2024-04-10 22:08:12

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 00/10] KVM: Guest Memory Pre-Population API

From: Isaku Yamahata <[email protected]>

Thank you for the feedback on the v1 patch series [1]. I updated the series
based on the v1 review. The intention is to come up with an agreement on uAPI
and to get merged into the kvm-coco-queue branch first so that the TDX KVM patch
series can reduce the number of patches.


TL;DR
TDX (or other TEE technologies) needs to pre-populate private memory as the
initial guest memory with measurement. TDX KVM defined TDX-specific API for it.
However, Pre-population also has another use case [2] to mitigate excessive KVM
page faults during guest boot or after live migration. It applies to not only
confidential guests but also any guest type.

The patch series v1 [1] implemented the pre-population with CoCo operations.
The major feedback is to make the API for a "pure" population and to introduce
vendor-specific APIs for CoCo operations. The population API doesn't do any
CoCo (or whatever backend technologies) specific operation. The changes from
v1 are, dropping flags, allowing larger mapping, dropping coco-operation for
pure KVM mapping, restricting the API to TDP MMU only, and making structure in
byte instead of page size. Dropped patches needed for TDX.


Details of feedback and changes:
- Pure KVM mapping without additional operation (Sean, Michael)
Changed the behavior to not issue any additional operation. Leave it to
vendor-specific API. It's difficult to have common operations because coco
technologies have their requirement. For example, SEV requires more
parameters for guest memory initialization. TDX optionally extends
measurement with memory contents. The TDX-specific API will check if the GFN
is mapped and issue TDX-specific operations.
https://lore.kernel.org/kvm/[email protected]/
https://lore.kernel.org/kvm/[email protected]/

- Restrict the API to TDP MMU (David)
Narrow down the API scope to TDP MMU only for simplicity. It returns an error
when vCPU is in the legacy KVM MMU mode or guest mode. They require more
consideration and complex implementation. Given the use case is
pre-population, we can safely assume vCPU is not in guest mode.
https://lore.kernel.org/kvm/[email protected]/

- Drop flags for KVM page fault. (Sean, David, Michael, Kai)
Drop all defined flags. Make flags reserved for future use. Only RW mapping
is useful for pre-population use cases. Narrow down the API as the initial
API because deprecating API is difficult.
https://lore.kernel.org/kvm/[email protected]/

- Allow mapping level larger than request (David)
Allow huge page mapping larger than request because it doesn't have to be
exact mapping for the supposed use case. gmem_max_level() hook allows the
vendor-backend to specify the desired level.
https://lore.kernel.org/kvm/CALzav=dHNYP02q_CJncwk-JdL9OSB=613v4+siBm1Cp2rmxLLw@mail.gmail.com/

- Use byte length instead of number of pages (Michael)
Changed struct kvm_mapping to use byte length instead of page size. Because
we may want sub-page operations, use byte instead of page to avoid unnecessary
future API changes.
https://lore.kernel.org/kvm/[email protected]/


This patch series depends on the following patch series, which enhance
KVM page fault handler, on the top of the KVM queue branch
https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=queue

- https://lore.kernel.org/kvm/[email protected]/
[PATCH 00/16] KVM: x86/mmu: Page fault and MMIO cleanups

The organization of this patch series is as follows.
1: Documentation on uAPI KVM_MAP_MEMORY.
2: Archtechture-independent implementation part with uAPI definition.
3: Refactoring of x86 KVM MMU as preparation.
4: Preparation for the next patch.
5: Add an x86 KVM MMU helper function to map the guest page.
6: x86 KVM arch implementation.
7: Enhancement of x86 KVM arch implementation to enforce L1 GPA independent
of vCPU mode.
8: Add x86-ops necessary for TDX and SEV-SNP.
9: SEV callback to return an error.
10: Selftest for validation.

---
The objective of this RFC patch series is to develop a uAPI aimed at
pre-populating guest memory for various use cases and underlying VM
technologies.

Pre-populating guest memory to mitigate excessive KVM page faults during guest
boot [2] or after live migration is a need not limited to any specific
technology. Also, it applies to the case after live migration on the
destination. KVM_MAP_MEMORY only populates the second-level page tables and
doesn't do further technology-specific operations. For example, CoCo-related
operations like SEV-SNP RMP operation, TDX TDH.MEM.PAGE.ADD() and
TDH.MR.EXTEND(). Use technology-specific APIs for it.

The existing mmap(MAP_POPULATE) or madvise(MADV_WILLNEED) don't help because it
interacts with the CPU page tables, not with the second-level page tables for
virtualization (AMD NPT, Intel EPT, ARM state-2 page table, RISC-V G-stage page
table, etc.)

For x86 default VM (SVM or VMX) or other x86 VM type that uses the TDP page
table SW-PROTECTED VM, SEV/SEV-SNP, TDX, and pKVM, the API populates the TDP
page tables. Although this patch series implements it for x86 KVM, the uAPI
should be applicable to other architecture that supports the second-level page
tables, such as the ARM stage-2 page table and RISC-V G-stage page table. Other
CoCo technology KVM support like ARM CCA RMM or RISC-V CoVE would introduce the
vendor-specific uAPI.

- KVM_MAP_MEMORY: Populate the second-level page table.
x86 VM populates the TDP page tables.
The technology will define if KVM_MAP_MEMORY is optional or mandatory. Or it
may not support the API by returning -EOPNOTSUPP.

- vendor-specific APIs:
SEV: KVM_SEV_LAUNCH_MEASURE and guest_memfd prepare hook.
KVM_MAP_MEMORY is optional for SEV-SNP because the second-level page
table (NPT) is not protected. It introduces the Reverse Map Table (RMP)
to track the correspondence from the host physical address to the guest
physical address. The RMP is protected independently from NPT and the
host VMM uses the dedicated instruction (RMPUPDATE, PSMASH). It doesn't
matter whether the GFN is populated with NPT or not because
KVM_SEV_LAUNCH_MEASURE works with RMP and PFN only.

[PATCH v12 11/29] KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command
https://lore.kernel.org/kvm/[email protected]/

[PATCH v12 21/29] KVM: SEV: Implement gmem hook for initializing private pages
https://lore.kernel.org/kvm/[email protected]/

TDX: KVM_TDX_INIT_MEM_REGION
KVM_MAP_MEMORY is mandatory for TDX. Return an error for not
pre-populated GFN. TDX protects the second level page table which is
called Secure-EPT with its operations like TDH.MEM.SEPT.ADD(),
TDH.MEM.PAGE.ADD(), etc. The non-leaf entries need to be populated to
add the leaf private page to the guest.

https://lore.kernel.org/kvm/[email protected]/

ARM CCV RMM and RISC-V CoVE would define their uAPIs and determine if
KVM_MAP_MEMORY is optional, mandatory, or unsupported because these two
technologies are similar to TDX or SEV-SNP as the following is a small summary
of technologies:
ARM CCV RMM introduces the Realm Translation Table (RTT) as the stage two
translation table. The contents of an RTT are not directly accessible to the
host with RMM commands like RMI_DATA_CREATE(), RMI_DATA_CREATE_UNKNOWN(), and
RMI_RTT_CREATE(), etc.
RISC-V CoVE introduces the Memory Tracking Table (MTT) to track the host
physical address is whether confidential or non-confidential, etc with the
operations like sbi_covh_add_tvm_page_table_pages(),
sbi_covh_add_tvm_measured_pages(), sbi_covh_add_tvm_zero_pages(), etc. The
host VMM has no access to the G-stage page table.

Changes:
v2:
- Drop flags for KVM page fault
- Allow mapping larger than the request
- Change struct kvm_memory_mapping. Dropped source. Make them in bytes.
- Make KVM_MAP_MMEORY pure mapping. It takes no further action like copying
memory contents, measurement, etc.
- Restrict the API to TDP MMU only.

v1:
[1] https://lore.kernel.org/kvm/[email protected]/

[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/6a4c029af70d41b63bcee3d6a1f0c2377f6eb4bd.1690322424.git.isaku.yamahata@intel.com

Thanks,

Isaku Yamahata (10):
KVM: Document KVM_MAP_MEMORY ioctl
KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory
KVM: x86/mmu: Extract __kvm_mmu_do_page_fault()
KVM: x86/mmu: Make __kvm_mmu_do_page_fault() return mapped level
KVM: x86/mmu: Introduce kvm_tdp_map_page() to populate guest memory
KVM: x86: Implement kvm_arch_vcpu_map_memory()
KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY
KVM: x86: Add a hook in kvm_arch_vcpu_map_memory()
KVM: SVM: Implement pre_mmu_map_page() to refuse KVM_MAP_MEMORY
KVM: selftests: x86: Add test for KVM_MAP_MEMORY

Documentation/virt/kvm/api.rst | 52 ++
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 3 +
arch/x86/kvm/mmu.h | 3 +
arch/x86/kvm/mmu/mmu.c | 32 ++
arch/x86/kvm/mmu/mmu_internal.h | 36 +-
arch/x86/kvm/svm/sev.c | 6 +
arch/x86/kvm/svm/svm.c | 2 +
arch/x86/kvm/svm/svm.h | 9 +
arch/x86/kvm/x86.c | 82 +++
include/linux/kvm_host.h | 3 +
include/uapi/linux/kvm.h | 9 +
tools/include/uapi/linux/kvm.h | 8 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/map_memory_test.c | 479 ++++++++++++++++++
virt/kvm/kvm_main.c | 54 ++
16 files changed, 769 insertions(+), 11 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/map_memory_test.c


base-commit: 14009f39d2632aef6ef2421e21791144f2da8dab
prerequisite-patch-id: 1ca598b4dc0b7450f5e92f0f112bb810194ca31e
prerequisite-patch-id: 06e681d29f9bc5141760fbc173170eb5f6b0b448
prerequisite-patch-id: 79e9efbb9666bc38062052a94d3fe0950f666538
prerequisite-patch-id: b7df4f4b0a76cc89a33e0af033614ac41fc9a125
prerequisite-patch-id: 05ee88161ea93139190133de4bc1639ffeb999b5
prerequisite-patch-id: 2f163bc188dad6bc3ca51ad4a39de0c70b8a926f
prerequisite-patch-id: f9f4a66755c1f5728f4bdc663969c93da305af4f
prerequisite-patch-id: a5f7010a12eb69c65e1e998ef3750a0c9fb276c9
prerequisite-patch-id: ad620e25babeff7472a48311fc2903971777c1fb
prerequisite-patch-id: 34dc0058454f08f7e7e0e87ceb6073de2839eb18
prerequisite-patch-id: 7c348b473163f44046c2f28c728dbb66e42045f6
prerequisite-patch-id: 296a4b916d1261fcdf8ad1c3bb1b0ed454ccc312
prerequisite-patch-id: 5952fd4611ae5367b9378f4a37d31e2228975717
prerequisite-patch-id: 8371f63d5fbb091ec7daa6b39a6e0263415b3e56
prerequisite-patch-id: 7492a030d6b475bda7357b20f6ba08665a8ff79b
prerequisite-patch-id: 219902bb30027256f999eb311e1e3869752cd6f9
--
2.43.2



2024-04-10 22:08:39

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 02/10] KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory

From: Isaku Yamahata <[email protected]>

Add a new ioctl KVM_MAP_MEMORY in the KVM common code. It iterates on the
memory range and calls the arch-specific function. Add stub arch function
as a weak symbol.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Drop need_resched(). (David, Sean, Kai)
- Move cond_resched() at the end of loop. (Kai)
- Drop added check. (David)
- Use EINTR instead of ERESTART. (David, Sean)
- Fix srcu lock leak. (Kai, Sean)
- Add comment above copy_to_user().
- Drop pointless comment. (Sean)
- Drop kvm_arch_vcpu_pre_map_memory(). (Sean)
- Don't overwrite error code. (Sean, David)
- Make the parameter in bytes, not pages. (Michael)
- Drop source member in struct kvm_memory_mapping. (Sean, Michael)
---
include/linux/kvm_host.h | 3 +++
include/uapi/linux/kvm.h | 9 +++++++
virt/kvm/kvm_main.c | 54 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 66 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..e56a0c7e5b42 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2445,4 +2445,7 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
}
#endif /* CONFIG_KVM_PRIVATE_MEM */

+int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping);
+
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..972aa9e054d3 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -917,6 +917,7 @@ struct kvm_enable_cap {
#define KVM_CAP_MEMORY_ATTRIBUTES 233
#define KVM_CAP_GUEST_MEMFD 234
#define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_MAP_MEMORY 236

struct kvm_irq_routing_irqchip {
__u32 irqchip;
@@ -1548,4 +1549,12 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+#define KVM_MAP_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_memory_mapping)
+
+struct kvm_memory_mapping {
+ __u64 base_address;
+ __u64 size;
+ __u64 flags;
+};
+
#endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb49c2a60200..f2ad9e106cdb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4415,6 +4415,48 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
return fd;
}

+__weak int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ return -EOPNOTSUPP;
+}
+
+static int kvm_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ int idx, r;
+
+ if (mapping->flags)
+ return -EINVAL;
+
+ if (!PAGE_ALIGNED(mapping->base_address) ||
+ !PAGE_ALIGNED(mapping->size) ||
+ mapping->base_address + mapping->size <= mapping->base_address)
+ return -EINVAL;
+
+ vcpu_load(vcpu);
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+ r = 0;
+ while (mapping->size) {
+ if (signal_pending(current)) {
+ r = -EINTR;
+ break;
+ }
+
+ r = kvm_arch_vcpu_map_memory(vcpu, mapping);
+ if (r)
+ break;
+
+ cond_resched();
+ }
+
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+ vcpu_put(vcpu);
+
+ return r;
+}
+
static long kvm_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -4616,6 +4658,18 @@ static long kvm_vcpu_ioctl(struct file *filp,
r = kvm_vcpu_ioctl_get_stats_fd(vcpu);
break;
}
+ case KVM_MAP_MEMORY: {
+ struct kvm_memory_mapping mapping;
+
+ r = -EFAULT;
+ if (copy_from_user(&mapping, argp, sizeof(mapping)))
+ break;
+ r = kvm_vcpu_map_memory(vcpu, &mapping);
+ /* Don't check error to tell the processed range. */
+ if (copy_to_user(argp, &mapping, sizeof(mapping)))
+ r = -EFAULT;
+ break;
+ }
default:
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
}
--
2.43.2


2024-04-10 22:08:46

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 01/10] KVM: Document KVM_MAP_MEMORY ioctl

From: Isaku Yamahata <[email protected]>

Adds documentation of KVM_MAP_MEMORY ioctl. [1]

It populates guest memory. It doesn't do extra operations on the
underlying technology-specific initialization [2]. For example,
CoCo-related operations won't be performed. Concretely for TDX, this API
won't invoke TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND(). Vendor-specific APIs
are required for such operations.

The key point is to adapt of vcpu ioctl instead of VM ioctl. First,
populating guest memory requires vcpu. If it is VM ioctl, we need to pick
one vcpu somehow. Secondly, vcpu ioctl allows each vcpu to invoke this
ioctl in parallel. It helps to scale regarding guest memory size, e.g.,
hundreds of GB.

[1] https://lore.kernel.org/kvm/[email protected]/
[2] https://lore.kernel.org/kvm/[email protected]/

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Make flags reserved for future use. (Sean, Michael)
- Clarified the supposed use case. (Kai)
- Dropped source member of struct kvm_memory_mapping. (Michael)
- Change the unit from pages to bytes. (Michael)
---
Documentation/virt/kvm/api.rst | 52 ++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..6ee3d2b51a2b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,58 @@ a single guest_memfd file, but the bound ranges must not overlap).

See KVM_SET_USER_MEMORY_REGION2 for additional details.

+4.143 KVM_MAP_MEMORY
+------------------------
+
+:Capability: KVM_CAP_MAP_MEMORY
+:Architectures: none
+:Type: vcpu ioctl
+:Parameters: struct kvm_memory_mapping (in/out)
+:Returns: 0 on success, < 0 on error
+
+Errors:
+
+ ========== =============================================================
+ EINVAL invalid parameters
+ EAGAIN The region is only processed partially. The caller should
+ issue the ioctl with the updated parameters when `size` > 0.
+ EINTR An unmasked signal is pending. The region may be processed
+ partially.
+ EFAULT The parameter address was invalid. The specified region
+ `base_address` and `size` was invalid. The region isn't
+ covered by KVM memory slot.
+ EOPNOTSUPP The architecture doesn't support this operation. The x86 two
+ dimensional paging supports this API. the x86 kvm shadow mmu
+ doesn't support it. The other arch KVM doesn't support it.
+ ========== =============================================================
+
+::
+
+ struct kvm_memory_mapping {
+ __u64 base_address;
+ __u64 size;
+ __u64 flags;
+ };
+
+KVM_MAP_MEMORY populates guest memory with the range, `base_address` in (L1)
+guest physical address(GPA) and `size` in bytes. `flags` must be zero. It's
+reserved for future use. When the ioctl returns, the input values are updated
+to point to the remaining range. If `size` > 0 on return, the caller should
+issue the ioctl with the updated parameters.
+
+Multiple vcpus are allowed to call this ioctl simultaneously. It's not
+mandatory for all vcpus to issue this ioctl. A single vcpu can suffice.
+Multiple vcpus invocations are utilized for scalability to process the
+population in parallel. If multiple vcpus call this ioctl in parallel, it may
+result in the error of EAGAIN due to race conditions.
+
+This population is restricted to the "pure" population without triggering
+underlying technology-specific initialization. For example, CoCo-related
+operations won't perform. In the case of TDX, this API won't invoke
+TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND(). Vendor-specific uAPIs are required for
+such operations.
+
+
5. The kvm_run structure
========================

--
2.43.2


2024-04-10 22:09:06

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 03/10] KVM: x86/mmu: Extract __kvm_mmu_do_page_fault()

From: Isaku Yamahata <[email protected]>

Extract out __kvm_mmu_do_page_fault() from kvm_mmu_do_page_fault(). The
inner function is to initialize struct kvm_page_fault and to call the fault
handler, and the outer function handles updating stats and converting
return code. KVM_MAP_MEMORY will call the KVM page fault handler.

This patch makes the emulation_type always set irrelevant to the return
code. kvm_mmu_page_fault() is the only caller of kvm_mmu_do_page_fault(),
and references the value only when PF_RET_EMULATE is returned. Therefore,
this adjustment doesn't affect functionality.

No functional change intended.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Newly introduced. (Sean)
---
arch/x86/kvm/mmu/mmu_internal.h | 32 +++++++++++++++++++++-----------
1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index e68a60974cf4..9baae6c223ee 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -287,8 +287,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
fault->is_private);
}

-static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u64 err, bool prefetch, int *emulation_type)
+static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+ u64 err, bool prefetch, int *emulation_type)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
@@ -318,14 +318,6 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
}

- /*
- * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
- * guest perspective and have already been counted at the time of the
- * original fault.
- */
- if (!prefetch)
- vcpu->stat.pf_taken++;
-
if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) && fault.is_tdp)
r = kvm_tdp_page_fault(vcpu, &fault);
else
@@ -333,12 +325,30 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,

if (r == RET_PF_EMULATE && fault.is_private) {
kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
- return -EFAULT;
+ r = -EFAULT;
}

if (fault.write_fault_to_shadow_pgtable && emulation_type)
*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;

+ return r;
+}
+
+static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+ u64 err, bool prefetch, int *emulation_type)
+{
+ int r;
+
+ /*
+ * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
+ * guest perspective and have already been counted at the time of the
+ * original fault.
+ */
+ if (!prefetch)
+ vcpu->stat.pf_taken++;
+
+ r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch, emulation_type);
+
/*
* Similar to above, prefetch faults aren't truly spurious, and the
* async #PF path doesn't do emulation. Do count faults that are fixed
--
2.43.2


2024-04-10 22:09:08

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 04/10] KVM: x86/mmu: Make __kvm_mmu_do_page_fault() return mapped level

From: Isaku Yamahata <[email protected]>

The guest memory population logic will need to know what page size or level
(4K, 2M, ...) is mapped.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Newly added.
---
arch/x86/kvm/mmu/mmu_internal.h | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 9baae6c223ee..b0a10f5a40dd 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -288,7 +288,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
}

static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u64 err, bool prefetch, int *emulation_type)
+ u64 err, bool prefetch,
+ int *emulation_type, u8 *level)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
@@ -330,6 +331,8 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp

if (fault.write_fault_to_shadow_pgtable && emulation_type)
*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
+ if (level)
+ *level = fault.goal_level;

return r;
}
@@ -347,7 +350,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
if (!prefetch)
vcpu->stat.pf_taken++;

- r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch, emulation_type);
+ r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch,
+ emulation_type, NULL);

/*
* Similar to above, prefetch faults aren't truly spurious, and the
--
2.43.2


2024-04-10 22:09:49

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 06/10] KVM: x86: Implement kvm_arch_vcpu_map_memory()

From: Isaku Yamahata <[email protected]>

Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
memory. When KVM_CREATE_VCPU creates vCPU, it initializes the x86
KVM MMU part by kvm_mmu_create() and kvm_init_mmu(). vCPU is ready to
invoke the KVM page fault handler.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Catch up the change of struct kvm_memory_mapping. (Sean)
- Removed mapping level check. Push it down into vendor code. (David, Sean)
- Rename goal_level to level. (Sean)
- Drop kvm_arch_pre_vcpu_map_memory(), directly call kvm_mmu_reload().
(David, Sean)
- Fixed the update of mapping.
---
arch/x86/kvm/x86.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d2619d3eee4..2c765de3531e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4713,6 +4713,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
case KVM_CAP_IRQFD_RESAMPLE:
case KVM_CAP_MEMORY_FAULT_INFO:
+ case KVM_CAP_MAP_MEMORY:
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
@@ -5867,6 +5868,35 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
}
}

+int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping)
+{
+ u64 end, error_code = 0;
+ u8 level = PG_LEVEL_4K;
+ int r;
+
+ /*
+ * Shadow paging uses GVA for kvm page fault. The first implementation
+ * supports GPA only to avoid confusion.
+ */
+ if (!tdp_enabled)
+ return -EOPNOTSUPP;
+
+ /* reload is optimized for repeated call. */
+ kvm_mmu_reload(vcpu);
+
+ r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
+ if (r)
+ return r;
+
+ /* mapping->base_address is not necessarily aligned to level-hugepage. */
+ end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
+ KVM_HPAGE_SIZE(level);
+ mapping->size -= end - mapping->base_address;
+ mapping->base_address = end;
+ return r;
+}
+
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
--
2.43.2


2024-04-10 22:09:54

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 05/10] KVM: x86/mmu: Introduce kvm_tdp_map_page() to populate guest memory

From: Isaku Yamahata <[email protected]>

Introduce a helper function to call the KVM fault handler. It allows a new
ioctl to invoke the KVM fault handler to populate without seeing RET_PF_*
enums or other KVM MMU internal definitions because RET_PF_* are internal
to x86 KVM MMU. The implementation is restricted to two-dimensional paging
for simplicity. The shadow paging uses GVA for faulting instead of L1 GPA.
It makes the API difficult to use.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Make the helper function two-dimensional paging specific. (David)
- Return error when vcpu is in guest mode. (David)
- Rename goal_level to level in kvm_tdp_mmu_map_page(). (Sean)
- Update return code conversion. Don't check pfn.
RET_PF_EMULATE => EINVAL, RET_PF_CONTINUE => EIO (Sean)
- Add WARN_ON_ONCE on RET_PF_CONTINUE and RET_PF_INVALID. (Sean)
- Drop unnecessary EXPORT_SYMBOL_GPL(). (Sean)
---
arch/x86/kvm/mmu.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e8b620a85627..51ff4f67e115 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
__kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
}

+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+ u8 *level);
+
/*
* Check if a given access (described through the I/D, W/R and U/S bits of a
* page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 91dd4c44b7d8..a34f4af44cbd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4687,6 +4687,38 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
return direct_page_fault(vcpu, fault);
}

+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+ u8 *level)
+{
+ int r;
+
+ /* Restrict to TDP page fault. */
+ if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
+ return -EINVAL;
+
+ r = __kvm_mmu_do_page_fault(vcpu, gpa, error_code, false, NULL, level);
+ if (r < 0)
+ return r;
+
+ switch (r) {
+ case RET_PF_RETRY:
+ return -EAGAIN;
+
+ case RET_PF_FIXED:
+ case RET_PF_SPURIOUS:
+ return 0;
+
+ case RET_PF_EMULATE:
+ return -EINVAL;
+
+ case RET_PF_CONTINUE:
+ case RET_PF_INVALID:
+ default:
+ WARN_ON_ONCE(r);
+ return -EIO;
+ }
+}
+
static void nonpaging_init_context(struct kvm_mmu *context)
{
context->page_fault = nonpaging_page_fault;
--
2.43.2


2024-04-10 22:10:05

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

From: Isaku Yamahata <[email protected]>

Forcibly switch vCPU mode out from guest mode and SMM mode before calling
KVM page fault handler for KVM_MAP_MEMORY.

KVM_MAP_MEMORY populates guest memory with guest physical address (GPA).
If the vCPU is in guest mode, it populates with L2 GPA. If vCPU is in SMM
mode, it populates the SMM address pace. The API would be difficult to use
as such. Change vCPU MMU mode around populating the guest memory to always
populate with L1 GPA.

There are several options to populate L1 GPA irrelevant to vCPU mode.
- Switch vCPU MMU only: This patch.
Pros: Concise implementation.
Cons: Heavily dependent on the KVM MMU implementation.
- Use kvm_x86_nested_ops.get/set_state() to switch to/from guest mode.
Use __get/set_sregs2() to switch to/from SMM mode.
Pros: straightforward.
Cons: This may cause unintended side effects.
- Refactor KVM page fault handler not to pass vCPU. Pass around necessary
parameters and struct kvm.
Pros: The end result will have clearly no side effects.
Cons: This will require big refactoring.
- Return error on guest mode or SMM mode: Without this patch.
Pros: No additional patch.
Cons: Difficult to use.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Newly added.
---
arch/x86/kvm/x86.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c765de3531e..8ba9c1720ac9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5871,8 +5871,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
struct kvm_memory_mapping *mapping)
{
+ struct kvm_mmu *mmu = NULL, *walk_mmu = NULL;
u64 end, error_code = 0;
u8 level = PG_LEVEL_4K;
+ bool is_smm;
int r;

/*
@@ -5882,18 +5884,40 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
if (!tdp_enabled)
return -EOPNOTSUPP;

+ /* Force to use L1 GPA despite of vcpu MMU mode. */
+ is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK);
+ if (is_smm ||
+ vcpu->arch.mmu != &vcpu->arch.root_mmu ||
+ vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) {
+ vcpu->arch.hflags &= ~HF_SMM_MASK;
+ mmu = vcpu->arch.mmu;
+ walk_mmu = vcpu->arch.walk_mmu;
+ vcpu->arch.mmu = &vcpu->arch.root_mmu;
+ vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
+ kvm_mmu_reset_context(vcpu);
+ }
+
/* reload is optimized for repeated call. */
kvm_mmu_reload(vcpu);

r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
if (r)
- return r;
+ goto out;

/* mapping->base_address is not necessarily aligned to level-hugepage. */
end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
KVM_HPAGE_SIZE(level);
mapping->size -= end - mapping->base_address;
mapping->base_address = end;
+
+out:
+ /* Restore MMU state. */
+ if (is_smm || mmu) {
+ vcpu->arch.hflags |= is_smm ? HF_SMM_MASK : 0;
+ vcpu->arch.mmu = mmu;
+ vcpu->arch.walk_mmu = walk_mmu;
+ kvm_mmu_reset_context(vcpu);
+ }
return r;
}

--
2.43.2


2024-04-10 22:10:29

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 08/10] KVM: x86: Add a hook in kvm_arch_vcpu_map_memory()

From: Isaku Yamahata <[email protected]>

Add a hook in kvm_arch_vcpu_map_memory() for KVM_MAP_MEMORY before calling
kvm_mmu_map_page() to adjust the error code for a page fault. The hook can
hold vendor-specific logic to make those adjustments and enforce the
restrictions. SEV and TDX KVM will use the hook.

In the case of SEV and TDX, they need to adjust the KVM page fault error
code or refuse the operation due to their restriction. TDX requires that
the guest memory population must be before finalizing the VM.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Make pre_mmu_map_page() to take error_code.
- Drop post_mmu_map_page().
- Drop struct kvm_memory_map.source check.
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/x86.c | 28 ++++++++++++++++++++++++++++
3 files changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 5187fcf4b610..a5d4f4d5265d 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -139,6 +139,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
+KVM_X86_OP_OPTIONAL(pre_mmu_map_page);

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3ce244ad44e5..2bf7f97f889b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1812,6 +1812,9 @@ struct kvm_x86_ops {

gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+ int (*pre_mmu_map_page)(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u64 *error_code);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ba9c1720ac9..b76d854701d5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5868,6 +5868,26 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
}
}

+static int kvm_pre_mmu_map_page(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u64 *error_code)
+{
+ int r = 0;
+
+ if (vcpu->kvm->arch.vm_type == KVM_X86_DEFAULT_VM) {
+ /* nothing */
+ } else if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM) {
+ if (kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(mapping->base_address)))
+ *error_code |= PFERR_PRIVATE_ACCESS;
+ } else if (kvm_x86_ops.pre_mmu_map_page)
+ r = static_call(kvm_x86_pre_mmu_map_page)(vcpu, mapping,
+ error_code);
+ else
+ r = -EOPNOTSUPP;
+
+ return r;
+}
+
int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
struct kvm_memory_mapping *mapping)
{
@@ -5900,6 +5920,14 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
/* reload is optimized for repeated call. */
kvm_mmu_reload(vcpu);

+ /*
+ * Adjust error_code for VM-type. max_level is adjusted by
+ * gmem_max_level() callback.
+ */
+ r = kvm_pre_mmu_map_page(vcpu, mapping, &error_code);
+ if (r)
+ goto out;
+
r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
if (r)
goto out;
--
2.43.2


2024-04-10 22:10:44

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 09/10] KVM: SVM: Implement pre_mmu_map_page() to refuse KVM_MAP_MEMORY

From: Isaku Yamahata <[email protected]>

Implement vendor callback for KVM_MAP_MEMORY for SEV as EOPNOTSUPP because
it should use SEV-specific ioctl instead. This patch is only to
demonstrate how to implement the hook.

Compile only tested. I leave the actual implementation to the SEV folks.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Newly added
---
arch/x86/kvm/svm/sev.c | 6 ++++++
arch/x86/kvm/svm/svm.c | 2 ++
arch/x86/kvm/svm/svm.h | 9 +++++++++
3 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 1642d7d49bde..ab17d7c16636 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3322,3 +3322,9 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)

return p;
}
+
+int sev_pre_mmu_map_page(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping, u64 *error_code)
+{
+ return -EOPNOTSUPP;
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 535018f152a3..a886d4409b00 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5057,6 +5057,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
.alloc_apic_backing_page = svm_alloc_apic_backing_page,
+
+ .pre_mmu_map_page = sev_pre_mmu_map_page,
};

/*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 323901782547..c8dafcb0bfc6 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -689,6 +689,9 @@ int sev_mem_enc_unregister_region(struct kvm *kvm,
int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd);
int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd);
void sev_guest_memory_reclaimed(struct kvm *kvm);
+int sev_pre_mmu_map_page(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping, u64 *error_code);
+
int sev_handle_vmgexit(struct kvm_vcpu *vcpu);

/* These symbols are used in common code and are stubbed below. */
@@ -713,6 +716,12 @@ static inline void __init sev_hardware_setup(void) {}
static inline void sev_hardware_unsetup(void) {}
static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXIO; }
+static inline int sev_pre_mmu_map_page(struct kvm_vcpu *vcpu,
+ struct kvm_memory_mapping *mapping,
+ u64 *error_code)
+{
+ return -EOPNOTSUPP;
+}
#define max_sev_asid 0
#endif

--
2.43.2


2024-04-10 22:11:20

by Isaku Yamahata

[permalink] [raw]
Subject: [PATCH v2 10/10] KVM: selftests: x86: Add test for KVM_MAP_MEMORY

From: Isaku Yamahata <[email protected]>

Add a test case to exercise KVM_MAP_MEMORY and run the guest to access the
pre-populated area. It tests KVM_MAP_MEMORY ioctl for KVM_X86_DEFAULT_VM
and KVM_X86_SW_PROTECTED_VM.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v2:
- Catch up for uAPI change.
- Added smm mode test case.
- Added guest mode test case.
---
tools/include/uapi/linux/kvm.h | 8 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/map_memory_test.c | 479 ++++++++++++++++++
3 files changed, 488 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/map_memory_test.c

diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index c3308536482b..c742c403256a 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -2227,4 +2227,12 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};

+#define KVM_MAP_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_memory_mapping)
+
+struct kvm_memory_mapping {
+ __u64 base_address;
+ __u64 size;
+ __u64 flags;
+};
+
#endif /* __LINUX_KVM_H */
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 871e2de3eb05..2b097b6ec267 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -144,6 +144,7 @@ TEST_GEN_PROGS_x86_64 += set_memory_region_test
TEST_GEN_PROGS_x86_64 += steal_time
TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
TEST_GEN_PROGS_x86_64 += system_counter_offset_test
+TEST_GEN_PROGS_x86_64 += x86_64/map_memory_test

# Compiled outputs used by test targets
TEST_GEN_PROGS_EXTENDED_x86_64 += x86_64/nx_huge_pages_test
diff --git a/tools/testing/selftests/kvm/x86_64/map_memory_test.c b/tools/testing/selftests/kvm/x86_64/map_memory_test.c
new file mode 100644
index 000000000000..d5728439542e
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/map_memory_test.c
@@ -0,0 +1,479 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2024, Intel, Inc
+ *
+ * Author:
+ * Isaku Yamahata <isaku.yamahata at gmail.com>
+ */
+#include <linux/sizes.h>
+
+#include <test_util.h>
+#include <kvm_util.h>
+#include <processor.h>
+
+/* Arbitrarily chosen value. Pick 3G */
+#define TEST_GVA 0xc0000000
+#define TEST_GPA TEST_GVA
+#define TEST_SIZE (SZ_2M + PAGE_SIZE)
+#define TEST_NPAGES (TEST_SIZE / PAGE_SIZE)
+#define TEST_SLOT 10
+
+/* Nested: VMXON and VMCS12 for VMX or VMCB for XVM */
+/* Arbitrarily chosen value. Pick 128MB after TEST_GVA. */
+#define NESTED_GVA (TEST_GVA + 128 * 1024 * 1024)
+#define NESTED_GPA (TEST_GPA + 128 * 1024 * 1024)
+#define NESTED_NPAGES 2
+#define NESTED_SIZE (NESTED_NPAGES * PAGE_SIZE)
+#define NESTED_SLOT 11
+
+static void guest_code(uint64_t base_gpa)
+{
+ volatile uint64_t val __used;
+ int i;
+
+ for (i = 0; i < TEST_NPAGES; i++) {
+ uint64_t *src = (uint64_t *)(base_gpa + i * PAGE_SIZE);
+
+ val = *src;
+ }
+
+ GUEST_DONE();
+}
+
+static void map_memory(struct kvm_vcpu *vcpu, u64 base_address, u64 size,
+ bool should_success)
+{
+ struct kvm_memory_mapping mapping = {
+ .base_address = base_address,
+ .size = size,
+ .flags = 0,
+ };
+ int ret;
+
+ do {
+ ret = __vcpu_ioctl(vcpu, KVM_MAP_MEMORY, &mapping);
+ } while (ret && (errno == EAGAIN || errno == EINTR));
+
+ if (should_success)
+ __TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_MAP_MEMORY", ret, vcpu->vm);
+ else
+ /* No memory slot causes RET_PF_EMULATE. it results in -EINVAL. */
+ __TEST_ASSERT_VM_VCPU_IOCTL(ret && errno == EINVAL,
+ "KVM_MAP_MEMORY", ret, vcpu->vm);
+}
+
+static void set_smm(struct kvm_vcpu *vcpu, bool enter_or_leave)
+{
+ struct kvm_vcpu_events events;
+
+ vcpu_events_get(vcpu, &events);
+
+ events.smi.smm = !!enter_or_leave;
+ events.smi.pending = 0;
+ events.flags |= KVM_VCPUEVENT_VALID_SMM;
+
+ vcpu_events_set(vcpu, &events);
+}
+
+/* Copied from arch/x86/kvm/vmx/vmcs12.h */
+#define VMCS12_REVISION 0x11e57ed0
+
+struct vmcs_hdr {
+ u32 revision_id:31;
+ u32 shadow_vmcs:1;
+};
+
+typedef u64 natural_width;
+
+struct __packed vmcs12 {
+ struct vmcs_hdr hdr;
+ u32 abort;
+
+ u32 launch_state;
+ u32 padding[7];
+
+ u64 io_bitmap_a;
+ u64 io_bitmap_b;
+ u64 msr_bitmap;
+ u64 vm_exit_msr_store_addr;
+ u64 vm_exit_msr_load_addr;
+ u64 vm_entry_msr_load_addr;
+ u64 tsc_offset;
+ u64 virtual_apic_page_addr;
+ u64 apic_access_addr;
+ u64 posted_intr_desc_addr;
+ u64 ept_pointer;
+ u64 eoi_exit_bitmap0;
+ u64 eoi_exit_bitmap1;
+ u64 eoi_exit_bitmap2;
+ u64 eoi_exit_bitmap3;
+ u64 xss_exit_bitmap;
+ u64 guest_physical_address;
+ u64 vmcs_link_pointer;
+ u64 guest_ia32_debugctl;
+ u64 guest_ia32_pat;
+ u64 guest_ia32_efer;
+ u64 guest_ia32_perf_global_ctrl;
+ u64 guest_pdptr0;
+ u64 guest_pdptr1;
+ u64 guest_pdptr2;
+ u64 guest_pdptr3;
+ u64 guest_bndcfgs;
+ u64 host_ia32_pat;
+ u64 host_ia32_efer;
+ u64 host_ia32_perf_global_ctrl;
+ u64 vmread_bitmap;
+ u64 vmwrite_bitmap;
+ u64 vm_function_control;
+ u64 eptp_list_address;
+ u64 pml_address;
+ u64 encls_exiting_bitmap;
+ u64 tsc_multiplier;
+ u64 padding64[1];
+
+ natural_width cr0_guest_host_mask;
+ natural_width cr4_guest_host_mask;
+ natural_width cr0_read_shadow;
+ natural_width cr4_read_shadow;
+ natural_width dead_space[4];
+ natural_width exit_qualification;
+ natural_width guest_linear_address;
+ natural_width guest_cr0;
+ natural_width guest_cr3;
+ natural_width guest_cr4;
+ natural_width guest_es_base;
+ natural_width guest_cs_base;
+ natural_width guest_ss_base;
+ natural_width guest_ds_base;
+ natural_width guest_fs_base;
+ natural_width guest_gs_base;
+ natural_width guest_ldtr_base;
+ natural_width guest_tr_base;
+ natural_width guest_gdtr_base;
+ natural_width guest_idtr_base;
+ natural_width guest_dr7;
+ natural_width guest_rsp;
+ natural_width guest_rip;
+ natural_width guest_rflags;
+ natural_width guest_pending_dbg_exceptions;
+ natural_width guest_sysenter_esp;
+ natural_width guest_sysenter_eip;
+ natural_width host_cr0;
+ natural_width host_cr3;
+ natural_width host_cr4;
+ natural_width host_fs_base;
+ natural_width host_gs_base;
+ natural_width host_tr_base;
+ natural_width host_gdtr_base;
+ natural_width host_idtr_base;
+ natural_width host_ia32_sysenter_esp;
+ natural_width host_ia32_sysenter_eip;
+ natural_width host_rsp;
+ natural_width host_rip;
+ natural_width paddingl[8];
+
+ u32 pin_based_vm_exec_control;
+ u32 cpu_based_vm_exec_control;
+ u32 exception_bitmap;
+ u32 page_fault_error_code_mask;
+ u32 page_fault_error_code_match;
+ u32 cr3_target_count;
+ u32 vm_exit_controls;
+ u32 vm_exit_msr_store_count;
+ u32 vm_exit_msr_load_count;
+ u32 vm_entry_controls;
+ u32 vm_entry_msr_load_count;
+ u32 vm_entry_intr_info_field;
+ u32 vm_entry_exception_error_code;
+ u32 vm_entry_instruction_len;
+ u32 tpr_threshold;
+ u32 secondary_vm_exec_control;
+ u32 vm_instruction_error;
+ u32 vm_exit_reason;
+ u32 vm_exit_intr_info;
+ u32 vm_exit_intr_error_code;
+ u32 idt_vectoring_info_field;
+ u32 idt_vectoring_error_code;
+ u32 vm_exit_instruction_len;
+ u32 vmx_instruction_info;
+ u32 guest_es_limit;
+ u32 guest_cs_limit;
+ u32 guest_ss_limit;
+ u32 guest_ds_limit;
+ u32 guest_fs_limit;
+ u32 guest_gs_limit;
+ u32 guest_ldtr_limit;
+ u32 guest_tr_limit;
+ u32 guest_gdtr_limit;
+ u32 guest_idtr_limit;
+ u32 guest_es_ar_bytes;
+ u32 guest_cs_ar_bytes;
+ u32 guest_ss_ar_bytes;
+ u32 guest_ds_ar_bytes;
+ u32 guest_fs_ar_bytes;
+ u32 guest_gs_ar_bytes;
+ u32 guest_ldtr_ar_bytes;
+ u32 guest_tr_ar_bytes;
+ u32 guest_interruptibility_info;
+ u32 guest_activity_state;
+ u32 guest_sysenter_cs;
+ u32 host_ia32_sysenter_cs;
+ u32 vmx_preemption_timer_value;
+ u32 padding32[7];
+
+ u16 virtual_processor_id;
+ u16 posted_intr_nv;
+ u16 guest_es_selector;
+ u16 guest_cs_selector;
+ u16 guest_ss_selector;
+ u16 guest_ds_selector;
+ u16 guest_fs_selector;
+ u16 guest_gs_selector;
+ u16 guest_ldtr_selector;
+ u16 guest_tr_selector;
+ u16 guest_intr_status;
+ u16 host_es_selector;
+ u16 host_cs_selector;
+ u16 host_ss_selector;
+ u16 host_ds_selector;
+ u16 host_fs_selector;
+ u16 host_gs_selector;
+ u16 host_tr_selector;
+ u16 guest_pml_index;
+};
+
+/* Fill values to make KVM vmx_set_nested_state() pass. */
+void vmx_vmcs12_init(struct vmcs12 *vmcs12)
+{
+ *(__u32*)(vmcs12) = VMCS12_REVISION;
+
+ vmcs12->vmcs_link_pointer = -1;
+
+#define PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR 0x00000016
+ vmcs12->pin_based_vm_exec_control = PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
+
+#define CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR 0x0401e172
+ vmcs12->cpu_based_vm_exec_control = CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
+
+ vmcs12->secondary_vm_exec_control = 0;
+
+#define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR 0x00036dff
+ vmcs12->vm_exit_controls = VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
+
+#define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR 0x000011ff
+ vmcs12->vm_entry_controls = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
+
+#define VMXON_CR0_ALWAYSON (X86_CR0_PE | X86_CR0_PG | X86_CR0_NE)
+#define VMXON_CR4_ALWAYSON X86_CR4_VMXE
+
+ /* host */
+ vmcs12->host_cr0 = VMXON_CR0_ALWAYSON;
+ vmcs12->host_cr3 = TEST_GPA;
+ vmcs12->host_cr4 = VMXON_CR4_ALWAYSON;;
+
+ /* Non-zero to make KVM vmx check pass */
+ vmcs12->host_cs_selector = 8;
+ vmcs12->host_ss_selector = 8;
+ vmcs12->host_ds_selector = 8;
+ vmcs12->host_es_selector = 8;
+ vmcs12->host_fs_selector = 8;
+ vmcs12->host_gs_selector = 8;
+ vmcs12->host_tr_selector = 8;
+
+ /* guest */
+ vmcs12->guest_cr0 = VMXON_CR0_ALWAYSON;
+ vmcs12->guest_cr4 = VMXON_CR4_ALWAYSON;
+
+ vmcs12->guest_cs_selector = 0xf000;
+ vmcs12->guest_cs_base = 0xffff0000UL;
+ vmcs12->guest_cs_limit = 0xffff;
+ vmcs12->guest_cs_ar_bytes = 0x93 | 0x08;
+
+ vmcs12->guest_ds_selector = 0;
+ vmcs12->guest_ds_base = 0;
+ vmcs12->guest_ds_limit = 0xffff;
+ vmcs12->guest_ds_ar_bytes = 0x93;
+
+ vmcs12->guest_es_selector = 0;
+ vmcs12->guest_es_base = 0;
+ vmcs12->guest_es_limit = 0xffff;
+ vmcs12->guest_es_ar_bytes = 0x93;
+
+ vmcs12->guest_fs_selector = 0;
+ vmcs12->guest_fs_base = 0;
+ vmcs12->guest_fs_limit = 0xffff;
+ vmcs12->guest_fs_ar_bytes = 0x93;
+
+ vmcs12->guest_gs_selector = 0;
+ vmcs12->guest_gs_base = 0;
+ vmcs12->guest_gs_limit = 0xffff;
+ vmcs12->guest_gs_ar_bytes = 0x93;
+
+ vmcs12->guest_ss_selector = 0;
+ vmcs12->guest_ss_base = 0;
+ vmcs12->guest_ss_limit = 0xffff;
+ vmcs12->guest_ss_ar_bytes = 0x93;
+
+ vmcs12->guest_ldtr_selector = 0;
+ vmcs12->guest_ldtr_base = 0;
+ vmcs12->guest_ldtr_limit = 0xfff;
+ vmcs12->guest_ldtr_ar_bytes = 0x008b;
+
+ vmcs12->guest_gdtr_base = 0;
+ vmcs12->guest_gdtr_limit = 0xffff;
+
+ vmcs12->guest_idtr_base = 0;
+ vmcs12->guest_idtr_limit = 0xffff;
+
+ /* ACTIVE = 0 */
+ vmcs12->guest_activity_state = 0;
+
+ vmcs12->guest_interruptibility_info = 0;
+ vmcs12->guest_pending_dbg_exceptions = 0;
+
+ vmcs12->vm_entry_intr_info_field = 0;
+}
+
+void vmx_state_set(struct kvm_vcpu *vcpu, struct kvm_nested_state *state,
+ __u16 flags, __u64 vmxon_pa, __u64 vmcs12_pa)
+{
+ struct vmcs12 *vmcs12 = (struct vmcs12 *)state->data.vmx->vmcs12;
+
+ memset(state, 0, sizeof(*state) + KVM_STATE_NESTED_VMX_VMCS_SIZE);
+
+ state->flags = flags;
+ state->format = KVM_STATE_NESTED_FORMAT_VMX;
+ state->size = KVM_STATE_NESTED_VMX_VMCS_SIZE;
+
+ state->hdr.vmx.vmxon_pa = vmxon_pa;
+ state->hdr.vmx.vmcs12_pa = vmcs12_pa;
+ state->hdr.vmx.smm.flags = 0;
+ state->hdr.vmx.pad = 0;
+ state->hdr.vmx.flags = 0;
+ state->hdr.vmx.preemption_timer_deadline = 0;
+
+ vmx_vmcs12_init(vmcs12);
+
+ vcpu_nested_state_set(vcpu, state);
+}
+
+static void __test_map_memory(unsigned long vm_type, bool private,
+ bool smm, bool nested)
+{
+ struct kvm_nested_state *state = NULL;
+ const struct vm_shape shape = {
+ .mode = VM_MODE_DEFAULT,
+ .type = vm_type,
+ };
+ struct kvm_sregs sregs;
+ struct kvm_vcpu *vcpu;
+ struct kvm_regs regs;
+ struct kvm_run *run;
+ struct kvm_vm *vm;
+ struct ucall uc;
+
+ vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code);
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ TEST_GPA, TEST_SLOT, TEST_NPAGES,
+ private ? KVM_MEM_GUEST_MEMFD : 0);
+ virt_map(vm, TEST_GVA, TEST_GPA, TEST_NPAGES);
+
+ if (private)
+ vm_mem_set_private(vm, TEST_GPA, TEST_SIZE);
+ if (nested) {
+ size_t size = sizeof(*state);
+
+ if (kvm_cpu_has(X86_FEATURE_VMX)) {
+ size += KVM_STATE_NESTED_VMX_VMCS_SIZE;
+ vcpu_set_cpuid_feature(vcpu, X86_FEATURE_VMX);
+ }
+
+ state = malloc(size);
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, NESTED_GPA,
+ NESTED_SLOT, NESTED_NPAGES, 0);
+ virt_map(vm, NESTED_GVA, NESTED_GPA, NESTED_NPAGES);
+ }
+
+ if (smm)
+ set_smm(vcpu, true);
+ if (nested) {
+ vcpu_regs_get(vcpu, &regs);
+ vcpu_sregs_get(vcpu, &sregs);
+ vmx_state_set(vcpu, state, KVM_STATE_NESTED_RUN_PENDING |
+ KVM_STATE_NESTED_GUEST_MODE,
+ NESTED_GPA, NESTED_GPA + PAGE_SIZE);
+ }
+ map_memory(vcpu, TEST_GPA, SZ_2M, true);
+ map_memory(vcpu, TEST_GPA + SZ_2M, PAGE_SIZE, true);
+ map_memory(vcpu, TEST_GPA + TEST_SIZE, PAGE_SIZE, false);
+ if (nested) {
+ vmx_state_set(vcpu, state, 0, -1, -1);
+ free(state);
+ vcpu_sregs_set(vcpu, &sregs);
+ vcpu_regs_set(vcpu, &regs);
+ }
+ if (smm)
+ set_smm(vcpu, false);
+
+ vcpu_args_set(vcpu, 1, TEST_GVA);
+ vcpu_run(vcpu);
+
+ run = vcpu->run;
+ TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+ "Wanted KVM_EXIT_IO, got exit reason: %u (%s)",
+ run->exit_reason, exit_reason_str(run->exit_reason));
+
+ switch (get_ucall(vcpu, &uc)) {
+ case UCALL_ABORT:
+ REPORT_GUEST_ASSERT(uc);
+ break;
+ case UCALL_DONE:
+ break;
+ default:
+ TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+ break;
+ }
+
+ kvm_vm_free(vm);
+}
+
+static void test_map_memory(unsigned long vm_type, bool private)
+{
+ if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type))) {
+ pr_info("Skipping tests for vm_type 0x%lx\n", vm_type);
+ return;
+ }
+
+ __test_map_memory(vm_type, private, false, false);
+
+ if (kvm_has_cap(KVM_CAP_VCPU_EVENTS) && kvm_has_cap(KVM_CAP_X86_SMM))
+ __test_map_memory(vm_type, private, true, false);
+ else
+ pr_info("skipping test for vm_type 0x%lx with smm\n", vm_type);
+
+ if (!kvm_has_cap(KVM_CAP_NESTED_STATE)) {
+ pr_info("Skipping test for vm_type 0x%lx with nesting\n", vm_type);
+ return;
+ }
+
+ if (kvm_cpu_has(X86_FEATURE_SVM)) {
+ pr_info("Implement nested SVM case\n");
+ return;
+ }
+ if (!kvm_cpu_has(X86_FEATURE_VMX)) {
+ pr_info("Skipping test for vm_type 0x%lx with nested VMX\n",
+ vm_type);
+ return;
+ }
+ __test_map_memory(vm_type, private, false, true);
+}
+
+int main(int argc, char *argv[])
+{
+ TEST_REQUIRE(kvm_check_cap(KVM_CAP_MAP_MEMORY));
+
+ test_map_memory(KVM_X86_DEFAULT_VM, false);
+ test_map_memory(KVM_X86_SW_PROTECTED_VM, false);
+ test_map_memory(KVM_X86_SW_PROTECTED_VM, true);
+ return 0;
+}
--
2.43.2


2024-04-15 19:14:30

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

I wouldn't call myself much of an expert on nested, but...

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> There are several options to populate L1 GPA irrelevant to vCPU mode.
> - Switch vCPU MMU only: This patch.
>   Pros: Concise implementation.
>   Cons: Heavily dependent on the KVM MMU implementation.

Is switching just the MMU enough here? Won't the MTRRs and other vcpu bits be
wrong?

> - Use kvm_x86_nested_ops.get/set_state() to switch to/from guest mode.
>   Use __get/set_sregs2() to switch to/from SMM mode.
>   Pros: straightforward.
>   Cons: This may cause unintended side effects.

Cons make sense.

> - Refactor KVM page fault handler not to pass vCPU. Pass around necessary
>   parameters and struct kvm.
>   Pros: The end result will have clearly no side effects.
>   Cons: This will require big refactoring.

But doesn't the fault handler need the vCPU state?

> - Return error on guest mode or SMM mode:  Without this patch.
>   Pros: No additional patch.
>   Cons: Difficult to use.

Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
there shouldn't be an issue. If so, maybe this last one is not so horrible.

2024-04-15 19:48:50

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> @@ -5882,18 +5884,40 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
>         if (!tdp_enabled)
>                 return -EOPNOTSUPP;
>  
> +       /* Force to use L1 GPA despite of vcpu MMU mode. */
> +       is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK);
> +       if (is_smm ||
> +           vcpu->arch.mmu != &vcpu->arch.root_mmu ||
> +           vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) {
> +               vcpu->arch.hflags &= ~HF_SMM_MASK;
> +               mmu = vcpu->arch.mmu;
> +               walk_mmu = vcpu->arch.walk_mmu;
> +               vcpu->arch.mmu = &vcpu->arch.root_mmu;
> +               vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> +               kvm_mmu_reset_context(vcpu);
> +       }
> +
>         /* reload is optimized for repeated call. */

After the kvm_mmu_reset_context(), what benefit is there to the operation? And
it happening for every call of kvm_arch_vcpu_map_memory()?

>         kvm_mmu_reload(vcpu);

2024-04-15 21:19:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Mon, Apr 15, 2024, Rick P Edgecombe wrote:
> I wouldn't call myself much of an expert on nested, but...
>
> On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> > There are several options to populate L1 GPA irrelevant to vCPU mode.
> > - Switch vCPU MMU only: This patch.
> >   Pros: Concise implementation.
> >   Cons: Heavily dependent on the KVM MMU implementation.

Con: Makes it impossible to support other MMUs/modes without extending the uAPI.

> Is switching just the MMU enough here? Won't the MTRRs and other vcpu bits be
> wrong?
>
> > - Use kvm_x86_nested_ops.get/set_state() to switch to/from guest mode.
> >   Use __get/set_sregs2() to switch to/from SMM mode.
> >   Pros: straightforward.
> >   Cons: This may cause unintended side effects.
>
> Cons make sense.
>
> > - Refactor KVM page fault handler not to pass vCPU. Pass around necessary
> >   parameters and struct kvm.
> >   Pros: The end result will have clearly no side effects.
> >   Cons: This will require big refactoring.
>
> But doesn't the fault handler need the vCPU state?

Ignoring guest MTRRs, which will hopefully soon be a non-issue, no. There are
only six possible roots if TDP is enabled:

1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode

4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU eliminates
the SMM and guest_mode issues. If there is per-vCPU state that makes its way into
the TDP page tables, then we have problems, because it means that there is per-vCPU
state in per-VM structures that isn't accounted for.

There are a few edge cases where KVM treads carefully, e.g. if the fault is to
the vCPU's APIC-access page, but KVM manually handles those to avoid consuming
per-vCPU state.

That said, I think this option is effectively 1b, because dropping the SMM vs.
guest_mode state has the same uAPI problems as forcibly swapping the MMU, it's
just a different way of doing so.

The first question to answer is, do we want to return an error or "silently"
install mappings for !SMM, !guest_mode. And so this option becomes relevant only
_if_ we want to unconditionally install mappings for the 'base" mode.

> > - Return error on guest mode or SMM mode:  Without this patch.
> >   Pros: No additional patch.
> >   Cons: Difficult to use.
>
> Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> there shouldn't be an issue. If so, maybe this last one is not so horrible.

And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
basically invalidates the argument that returning an error makes the ioctl() hard
to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's
existing code, but I don't buy that the ioctl() itself is hard to use.

Literally the only thing userspace needs to do is set CPUID to implicitly select
between 4-level and 5-level paging. If userspace wants to pre-map memory during
live migration, or when jump-starting the guest with pre-defined state, simply
pre-map memory before stuffing guest state. In and of itself, that doesn't seem
difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).

I would describe the overall cons for this patch versus returning an error
differently. Switching MMU state puts the complexity in the kernel. Returning
an error punts any complexity to userspace. Specifically, anything that KVM can
do regarding vCPU state to get the right MMU, userspace can do too.

Add on that silently doing things that effectively ignore guest state usually
ends badly, and I don't see a good argument for this patch (or any variant
thereof).

2024-04-15 21:36:31

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Mon, 2024-04-15 at 14:17 -0700, Sean Christopherson wrote:
> > But doesn't the fault handler need the vCPU state?
>
> Ignoring guest MTRRs, which will hopefully soon be a non-issue, no.  There are
> only six possible roots if TDP is enabled:
>
>       1. 4-level !SMM !guest_mode
>       2. 4-level  SMM !guest_mode
>       3. 5-level !SMM !guest_mode
>       4. 5-level  SMM !guest_mode
>       5. 4-level !SMM guest_mode
>       6. 5-level !SMM guest_mode
>
> 4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU
> eliminates
> the SMM and guest_mode issues.  If there is per-vCPU state that makes its way
> into
> the TDP page tables, then we have problems, because it means that there is
> per-vCPU
> state in per-VM structures that isn't accounted for.
>
> There are a few edge cases where KVM treads carefully, e.g. if the fault is to
> the vCPU's APIC-access page, but KVM manually handles those to avoid consuming
> per-vCPU state.
>
> That said, I think this option is effectively 1b, because dropping the SMM vs.
> guest_mode state has the same uAPI problems as forcibly swapping the MMU, it's
> just a different way of doing so.
>
> The first question to answer is, do we want to return an error or "silently"
> install mappings for !SMM, !guest_mode.  And so this option becomes relevant
> only
> _if_ we want to unconditionally install mappings for the 'base" mode.

Ah, I thought there was some logic around CR0.CD.

>
> > > - Return error on guest mode or SMM mode:  Without this patch.
> > >   Pros: No additional patch.
> > >   Cons: Difficult to use.
> >
> > Hmm... For the non-TDX use cases this is just an optimization, right? For
> > TDX
> > there shouldn't be an issue. If so, maybe this last one is not so horrible.
>
> And the fact there are so variables to control (MAXPHADDR, SMM, and
> guest_mode)
> basically invalidates the argument that returning an error makes the ioctl()
> hard
> to use.  I can imagine it might be hard to squeeze this ioctl() into QEMU's
> existing code, but I don't buy that the ioctl() itself is hard to use.
>
> Literally the only thing userspace needs to do is set CPUID to implicitly
> select
> between 4-level and 5-level paging.  If userspace wants to pre-map memory
> during
> live migration, or when jump-starting the guest with pre-defined state, simply
> pre-map memory before stuffing guest state.  In and of itself, that doesn't
> seem
> difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> disclaimer that I only know enough about how QEMU manages vCPUs to be
> dangerous).
>
> I would describe the overall cons for this patch versus returning an error
> differently.  Switching MMU state puts the complexity in the kernel. 
> Returning
> an error punts any complexity to userspace.  Specifically, anything that KVM
> can
> do regarding vCPU state to get the right MMU, userspace can do too.
>  
> Add on that silently doing things that effectively ignore guest state usually
> ends badly, and I don't see a good argument for this patch (or any variant
> thereof).

Great.

2024-04-15 22:59:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Mon, Apr 15, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-04-15 at 14:17 -0700, Sean Christopherson wrote:
> > > But doesn't the fault handler need the vCPU state?
> >
> > Ignoring guest MTRRs, which will hopefully soon be a non-issue, no.  There are
> > only six possible roots if TDP is enabled:
> >
> >       1. 4-level !SMM !guest_mode
> >       2. 4-level  SMM !guest_mode
> >       3. 5-level !SMM !guest_mode
> >       4. 5-level  SMM !guest_mode
> >       5. 4-level !SMM guest_mode
> >       6. 5-level !SMM guest_mode
> >
> > 4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU
> > eliminates the SMM and guest_mode issues.  If there is per-vCPU state that
> > makes its way into the TDP page tables, then we have problems, because it
> > means that there is per-vCPU state in per-VM structures that isn't
> > accounted for.
> >
> > There are a few edge cases where KVM treads carefully, e.g. if the fault is
> > to the vCPU's APIC-access page, but KVM manually handles those to avoid
> > consuming per-vCPU state.
> >
> > That said, I think this option is effectively 1b, because dropping the SMM
> > vs. guest_mode state has the same uAPI problems as forcibly swapping the
> > MMU, it's just a different way of doing so.
> >
> > The first question to answer is, do we want to return an error or
> > "silently" install mappings for !SMM, !guest_mode.  And so this option
> > becomes relevant only _if_ we want to unconditionally install mappings for
> > the 'base" mode.
>
> Ah, I thought there was some logic around CR0.CD.

There is, but it's hopefully going the way of the dodo, along with full MTRR
virtualization:

https://lore.kernel.org/all/[email protected]

2024-04-15 23:27:38

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] KVM: Document KVM_MAP_MEMORY ioctl

Nits only...

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Adds documentation of KVM_MAP_MEMORY ioctl. [1]
>
> It populates guest memory.  It doesn't do extra operations on the
> underlying technology-specific initialization [2].  For example,
> CoCo-related operations won't be performed.  Concretely for TDX, this API
> won't invoke TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND().  Vendor-specific APIs
> are required for such operations.
>
> The key point is to adapt of vcpu ioctl instead of VM ioctl.

Not sure what you are trying to say here.

>   First,
> populating guest memory requires vcpu.  If it is VM ioctl, we need to pick
> one vcpu somehow.  Secondly, vcpu ioctl allows each vcpu to invoke this
> ioctl in parallel.  It helps to scale regarding guest memory size, e.g.,
> hundreds of GB.

I guess you are explaining why this is a vCPU ioctl instead of a KVM ioctl. Is
this clearer:

Although the operation is sort of a VM operation, make the ioctl a vCPU ioctl
instead of KVM ioctl. Do this because a vCPU is needed internally for the fault
path anyway, and because... (I don't follow the second point).

>
> [1] https://lore.kernel.org/kvm/[email protected]/
> [2] https://lore.kernel.org/kvm/[email protected]/
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v2:
> - Make flags reserved for future use. (Sean, Michael)
> - Clarified the supposed use case. (Kai)
> - Dropped source member of struct kvm_memory_mapping. (Michael)
> - Change the unit from pages to bytes. (Michael)
> ---
>  Documentation/virt/kvm/api.rst | 52 ++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f0b76ff5030d..6ee3d2b51a2b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,58 @@ a single guest_memfd file, but the bound ranges must
> not overlap).
>  
>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>  
> +4.143 KVM_MAP_MEMORY
> +------------------------
> +
> +:Capability: KVM_CAP_MAP_MEMORY
> +:Architectures: none
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_memory_mapping (in/out)
> +:Returns: 0 on success, < 0 on error
> +
> +Errors:
> +
> +  ========== =============================================================
> +  EINVAL     invalid parameters
> +  EAGAIN     The region is only processed partially.  The caller should
> +             issue the ioctl with the updated parameters when `size` > 0.
> +  EINTR      An unmasked signal is pending.  The region may be processed
> +             partially.
> +  EFAULT     The parameter address was invalid.  The specified region
> +             `base_address` and `size` was invalid.  The region isn't
> +             covered by KVM memory slot.
> +  EOPNOTSUPP The architecture doesn't support this operation. The x86 two
> +             dimensional paging supports this API.  the x86 kvm shadow mmu
> +             doesn't support it.  The other arch KVM doesn't support it.
> +  ========== =============================================================
> +
> +::
> +
> +  struct kvm_memory_mapping {
> +       __u64 base_address;
> +       __u64 size;
> +       __u64 flags;
> +  };
> +
> +KVM_MAP_MEMORY populates guest memory with the range, `base_address` in (L1)
> +guest physical address(GPA) and `size` in bytes.  `flags` must be zero.  It's
> +reserved for future use.  When the ioctl returns, the input values are
> updated
> +to point to the remaining range.  If `size` > 0 on return, the caller should
> +issue the ioctl with the updated parameters.
> +
> +Multiple vcpus are allowed to call this ioctl simultaneously.  It's not
> +mandatory for all vcpus to issue this ioctl.  A single vcpu can suffice.
> +Multiple vcpus invocations are utilized for scalability to process the
> +population in parallel.  If multiple vcpus call this ioctl in parallel, it
> may
> +result in the error of EAGAIN due to race conditions.
> +
> +This population is restricted to the "pure" population without triggering
> +underlying technology-specific initialization.  For example, CoCo-related
> +operations won't perform.  In the case of TDX, this API won't invoke
> +TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND().  Vendor-specific uAPIs are required
> for
> +such operations.

Probably don't want to have TDX bits in here yet. Since it's talking about what
KVM_MAP_MEMORY is *not* doing, it can just be dropped.

> +
> +
>  5. The kvm_run structure
>  ========================
>  

2024-04-15 23:47:16

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] KVM: Document KVM_MAP_MEMORY ioctl

On Mon, Apr 15, 2024 at 11:27:20PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> Nits only...
>
> On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Adds documentation of KVM_MAP_MEMORY ioctl. [1]
> >
> > It populates guest memory.  It doesn't do extra operations on the
> > underlying technology-specific initialization [2].  For example,
> > CoCo-related operations won't be performed.  Concretely for TDX, this API
> > won't invoke TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND().  Vendor-specific APIs
> > are required for such operations.
> >
> > The key point is to adapt of vcpu ioctl instead of VM ioctl.
>
> Not sure what you are trying to say here.
>
> >   First,
> > populating guest memory requires vcpu.  If it is VM ioctl, we need to pick
> > one vcpu somehow.  Secondly, vcpu ioctl allows each vcpu to invoke this
> > ioctl in parallel.  It helps to scale regarding guest memory size, e.g.,
> > hundreds of GB.
>
> I guess you are explaining why this is a vCPU ioctl instead of a KVM ioctl. Is
> this clearer:

Right, I wanted to explain why I chose vCPU ioctl. Let me update the commit
message.


> Although the operation is sort of a VM operation, make the ioctl a vCPU ioctl
> instead of KVM ioctl. Do this because a vCPU is needed internally for the fault
> path anyway, and because... (I don't follow the second point).
>
> >
> > [1] https://lore.kernel.org/kvm/[email protected]/
> > [2] https://lore.kernel.org/kvm/[email protected]/
> >
> > Suggested-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > v2:
> > - Make flags reserved for future use. (Sean, Michael)
> > - Clarified the supposed use case. (Kai)
> > - Dropped source member of struct kvm_memory_mapping. (Michael)
> > - Change the unit from pages to bytes. (Michael)
> > ---
> >  Documentation/virt/kvm/api.rst | 52 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 52 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f0b76ff5030d..6ee3d2b51a2b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6352,6 +6352,58 @@ a single guest_memfd file, but the bound ranges must
> > not overlap).
> >  
> >  See KVM_SET_USER_MEMORY_REGION2 for additional details.
> >  
> > +4.143 KVM_MAP_MEMORY
> > +------------------------
> > +
> > +:Capability: KVM_CAP_MAP_MEMORY
> > +:Architectures: none
> > +:Type: vcpu ioctl
> > +:Parameters: struct kvm_memory_mapping (in/out)
> > +:Returns: 0 on success, < 0 on error
> > +
> > +Errors:
> > +
> > +  ========== =============================================================
> > +  EINVAL     invalid parameters
> > +  EAGAIN     The region is only processed partially.  The caller should
> > +             issue the ioctl with the updated parameters when `size` > 0.
> > +  EINTR      An unmasked signal is pending.  The region may be processed
> > +             partially.
> > +  EFAULT     The parameter address was invalid.  The specified region
> > +             `base_address` and `size` was invalid.  The region isn't
> > +             covered by KVM memory slot.
> > +  EOPNOTSUPP The architecture doesn't support this operation. The x86 two
> > +             dimensional paging supports this API.  the x86 kvm shadow mmu
> > +             doesn't support it.  The other arch KVM doesn't support it.
> > +  ========== =============================================================
> > +
> > +::
> > +
> > +  struct kvm_memory_mapping {
> > +       __u64 base_address;
> > +       __u64 size;
> > +       __u64 flags;
> > +  };
> > +
> > +KVM_MAP_MEMORY populates guest memory with the range, `base_address` in (L1)
> > +guest physical address(GPA) and `size` in bytes.  `flags` must be zero.  It's
> > +reserved for future use.  When the ioctl returns, the input values are
> > updated
> > +to point to the remaining range.  If `size` > 0 on return, the caller should
> > +issue the ioctl with the updated parameters.
> > +
> > +Multiple vcpus are allowed to call this ioctl simultaneously.  It's not
> > +mandatory for all vcpus to issue this ioctl.  A single vcpu can suffice.
> > +Multiple vcpus invocations are utilized for scalability to process the
> > +population in parallel.  If multiple vcpus call this ioctl in parallel, it
> > may
> > +result in the error of EAGAIN due to race conditions.
> > +
> > +This population is restricted to the "pure" population without triggering
> > +underlying technology-specific initialization.  For example, CoCo-related
> > +operations won't perform.  In the case of TDX, this API won't invoke
> > +TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND().  Vendor-specific uAPIs are required
> > for
> > +such operations.
>
> Probably don't want to have TDX bits in here yet. Since it's talking about what
> KVM_MAP_MEMORY is *not* doing, it can just be dropped.

Ok. Will drop it.
--
Isaku Yamahata <[email protected]>

2024-04-16 01:49:47

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Mon, Apr 15, 2024 at 02:17:02PM -0700,
Sean Christopherson <[email protected]> wrote:

> > > - Return error on guest mode or SMM mode:  Without this patch.
> > >   Pros: No additional patch.
> > >   Cons: Difficult to use.
> >
> > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > there shouldn't be an issue. If so, maybe this last one is not so horrible.
>
> And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
> basically invalidates the argument that returning an error makes the ioctl() hard
> to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's
> existing code, but I don't buy that the ioctl() itself is hard to use.
>
> Literally the only thing userspace needs to do is set CPUID to implicitly select
> between 4-level and 5-level paging. If userspace wants to pre-map memory during
> live migration, or when jump-starting the guest with pre-defined state, simply
> pre-map memory before stuffing guest state. In and of itself, that doesn't seem
> difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).
>
> I would describe the overall cons for this patch versus returning an error
> differently. Switching MMU state puts the complexity in the kernel. Returning
> an error punts any complexity to userspace. Specifically, anything that KVM can
> do regarding vCPU state to get the right MMU, userspace can do too.
>
> Add on that silently doing things that effectively ignore guest state usually
> ends badly, and I don't see a good argument for this patch (or any variant
> thereof).

Ok, here is a experimental patch on top of the 7/10 to return error. Is this
a direction? or do we want to invoke KVM page fault handler without any check?

I can see the following options.

- Error if vCPU is in SMM mode or guest mode: This patch
Defer the decision until the use cases come up. We can utilize
KVM_CAP_MAP_MEMORY and struct kvm_map_memory.flags for future
enhancement.
Pro: Keep room for future enhancement for unclear use cases to defer
the decision.
Con: The use space VMM has to check/switch the vCPU mode.

- No check of vCPU mode and go on
Pro: It works.
Con: Unclear how the uAPI should be without concrete use cases.

- Always populate with L1 GPA:
This is a bad idea.

---
arch/x86/kvm/x86.c | 32 +++++++++-----------------------
1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ba9c1720ac9..2f3ceda5c225 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5871,10 +5871,8 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
struct kvm_memory_mapping *mapping)
{
- struct kvm_mmu *mmu = NULL, *walk_mmu = NULL;
u64 end, error_code = 0;
u8 level = PG_LEVEL_4K;
- bool is_smm;
int r;

/*
@@ -5884,25 +5882,21 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
if (!tdp_enabled)
return -EOPNOTSUPP;

- /* Force to use L1 GPA despite of vcpu MMU mode. */
- is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK);
- if (is_smm ||
- vcpu->arch.mmu != &vcpu->arch.root_mmu ||
- vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) {
- vcpu->arch.hflags &= ~HF_SMM_MASK;
- mmu = vcpu->arch.mmu;
- walk_mmu = vcpu->arch.walk_mmu;
- vcpu->arch.mmu = &vcpu->arch.root_mmu;
- vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
- kvm_mmu_reset_context(vcpu);
- }
+ /*
+ * SMM mode results in populating SMM memory space with memslots id = 1.
+ * guest mode results in populating with L2 GPA.
+ * Don't support those cases for now and punt them for the future
+ * discussion.
+ */
+ if (is_smm(vcpu) || is_guest_mode(vcpu))
+ return -EOPNOTSUPP;

/* reload is optimized for repeated call. */
kvm_mmu_reload(vcpu);

r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
if (r)
- goto out;
+ return r;

/* mapping->base_address is not necessarily aligned to level-hugepage. */
end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
@@ -5910,14 +5904,6 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
mapping->size -= end - mapping->base_address;
mapping->base_address = end;

-out:
- /* Restore MMU state. */
- if (is_smm || mmu) {
- vcpu->arch.hflags |= is_smm ? HF_SMM_MASK : 0;
- vcpu->arch.mmu = mmu;
- vcpu->arch.walk_mmu = walk_mmu;
- kvm_mmu_reset_context(vcpu);
- }
return r;
}

--
2.43.2

--
Isaku Yamahata <[email protected]>

2024-04-16 08:23:03

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] KVM: x86/mmu: Extract __kvm_mmu_do_page_fault()


>This patch makes the emulation_type always set irrelevant to the return
>code. kvm_mmu_page_fault() is the only caller of kvm_mmu_do_page_fault(),
>and references the value only when PF_RET_EMULATE is returned. Therefore,
>this adjustment doesn't affect functionality.

This is benign. But what's the benefit of doing this?

>+static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>+ u64 err, bool prefetch, int *emulation_type)
> {
> struct kvm_page_fault fault = {
> .addr = cr2_or_gpa,
>@@ -318,14 +318,6 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> }
>
>- /*
>- * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
>- * guest perspective and have already been counted at the time of the
>- * original fault.
>- */
>- if (!prefetch)
>- vcpu->stat.pf_taken++;
>-
> if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) && fault.is_tdp)
> r = kvm_tdp_page_fault(vcpu, &fault);
> else
>@@ -333,12 +325,30 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
> if (r == RET_PF_EMULATE && fault.is_private) {
> kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
>- return -EFAULT;
>+ r = -EFAULT;
> }
>
> if (fault.write_fault_to_shadow_pgtable && emulation_type)
> *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
>
>+ return r;
>+}
>+
>+static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>+ u64 err, bool prefetch, int *emulation_type)
>+{
>+ int r;
>+
>+ /*
>+ * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
>+ * guest perspective and have already been counted at the time of the
>+ * original fault.
>+ */
>+ if (!prefetch)
>+ vcpu->stat.pf_taken++;
>+
>+ r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch, emulation_type);

bail out if r < 0?

>+
> /*
> * Similar to above, prefetch faults aren't truly spurious, and the
> * async #PF path doesn't do emulation. Do count faults that are fixed
>--
>2.43.2
>
>

2024-04-16 14:25:40

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Mon, Apr 15, 2024, Isaku Yamahata wrote:
> On Mon, Apr 15, 2024 at 02:17:02PM -0700,
> Sean Christopherson <[email protected]> wrote:
>
> > > > - Return error on guest mode or SMM mode:  Without this patch.
> > > >   Pros: No additional patch.
> > > >   Cons: Difficult to use.
> > >
> > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > > there shouldn't be an issue. If so, maybe this last one is not so horrible.
> >
> > And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
> > basically invalidates the argument that returning an error makes the ioctl() hard
> > to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's
> > existing code, but I don't buy that the ioctl() itself is hard to use.
> >
> > Literally the only thing userspace needs to do is set CPUID to implicitly select
> > between 4-level and 5-level paging. If userspace wants to pre-map memory during
> > live migration, or when jump-starting the guest with pre-defined state, simply
> > pre-map memory before stuffing guest state. In and of itself, that doesn't seem
> > difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> > kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> > disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).
> >
> > I would describe the overall cons for this patch versus returning an error
> > differently. Switching MMU state puts the complexity in the kernel. Returning
> > an error punts any complexity to userspace. Specifically, anything that KVM can
> > do regarding vCPU state to get the right MMU, userspace can do too.
> >
> > Add on that silently doing things that effectively ignore guest state usually
> > ends badly, and I don't see a good argument for this patch (or any variant
> > thereof).
>
> Ok, here is a experimental patch on top of the 7/10 to return error. Is this
> a direction? or do we want to invoke KVM page fault handler without any check?
>
> I can see the following options.
>
> - Error if vCPU is in SMM mode or guest mode: This patch
> Defer the decision until the use cases come up. We can utilize
> KVM_CAP_MAP_MEMORY and struct kvm_map_memory.flags for future
> enhancement.
> Pro: Keep room for future enhancement for unclear use cases to defer
> the decision.
> Con: The use space VMM has to check/switch the vCPU mode.
>
> - No check of vCPU mode and go on
> Pro: It works.
> Con: Unclear how the uAPI should be without concrete use cases.
>
> - Always populate with L1 GPA:
> This is a bad idea.

Not always. If L1 is using shadow paging, then L1 and L2 GPAs are in the same
domain from KVM's perspective.

As I said in v1 (I think it was v1), whether or not mapping an L1 GPA is supported
should be a property of the mmu, not an explicit check. As far as the TDP MMU is
concerend, there's nothing special about SMM nor is there anything special about
guest_mode, except for the fact that they use different roots than "normal" mode.
But that part Just Works.

And if L1 is using TDP, i.e. KVM is shadowing L1's TDP page tables, and L2 is
active, then vcpu->arch.mmu will point at a variation of the shadow MMU, e.g.
the PTTYPE_EPT MMU on Intel CPUs. The shadow MMU won't support pre-mapping GPAs
because it's non-sensical (legacy shadow paging needs a GVA, nested TDP needs an
L2 GPA), and so the ioctl() fails because mmu->map_gpa or whatever is NULL.

In other words, unless I'm forgetting something, explicit checks for "unsupported"
modes shoud be unnecessary, because

2024-04-16 14:37:32

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] KVM: x86/mmu: Extract __kvm_mmu_do_page_fault()

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Extract out __kvm_mmu_do_page_fault() from kvm_mmu_do_page_fault().  The
> inner function is to initialize struct kvm_page_fault and to call the fault
> handler, and the outer function handles updating stats and converting
> return code.  KVM_MAP_MEMORY will call the KVM page fault handler.
>
> This patch makes the emulation_type always set irrelevant to the return
a comma would help parse this better ^
> code.

>   kvm_mmu_page_fault() is the only caller of kvm_mmu_do_page_fault(),

Not technically correct, there are other callers that pass NULL for
emulation_type.

> and references the value only when PF_RET_EMULATE is returned.  Therefore,
> this adjustment doesn't affect functionality.

Is there a problem with dropping the argument then?

>
> No functional change intended.

Can we not use the "intended"? It sounds like hedging for excuses.

>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v2:
> - Newly introduced. (Sean)
> ---
>  arch/x86/kvm/mmu/mmu_internal.h | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index e68a60974cf4..9baae6c223ee 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -287,8 +287,8 @@ static inline void
> kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>                                       fault->is_private);
>  }
>  
> -static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t
> cr2_or_gpa,
> -                                       u64 err, bool prefetch, int
> *emulation_type)
> +static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t
> cr2_or_gpa,
> +                                         u64 err, bool prefetch, int
> *emulation_type)
>  {
>         struct kvm_page_fault fault = {
>                 .addr = cr2_or_gpa,
> @@ -318,14 +318,6 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu
> *vcpu, gpa_t cr2_or_gpa,
>                 fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>         }
>  
> -       /*
> -        * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
> -        * guest perspective and have already been counted at the time of the
> -        * original fault.
> -        */
> -       if (!prefetch)
> -               vcpu->stat.pf_taken++;
> -
>         if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) && fault.is_tdp)
>                 r = kvm_tdp_page_fault(vcpu, &fault);
>         else
> @@ -333,12 +325,30 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu
> *vcpu, gpa_t cr2_or_gpa,
>  
>         if (r == RET_PF_EMULATE && fault.is_private) {
>                 kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
> -               return -EFAULT;
> +               r = -EFAULT;
>         }
>  
>         if (fault.write_fault_to_shadow_pgtable && emulation_type)
>                 *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
>  
> +       return r;
> +}
> +
> +static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t
> cr2_or_gpa,
> +                                       u64 err, bool prefetch, int
> *emulation_type)
> +{
> +       int r;
> +
> +       /*
> +        * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
> +        * guest perspective and have already been counted at the time of the
> +        * original fault.
> +        */
> +       if (!prefetch)
> +               vcpu->stat.pf_taken++;

From the name, it makes sense to not count KVM_MAP_MEMORY as a pf_taken. But
kvm_arch_async_page_ready() increments it as well. Which makes it more like a
"faulted-in" count. I think the code in this patch is ok.

> +
> +       r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch,
> emulation_type);
> +
>         /*
>          * Similar to above, prefetch faults aren't truly spurious, and the
>          * async #PF path doesn't do emulation.  Do count faults that are
> fixed

2024-04-16 14:41:03

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] KVM: x86/mmu: Make __kvm_mmu_do_page_fault() return mapped level

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> The guest memory population logic will need to know what page size or level
> (4K, 2M, ...) is mapped.

TDX needs this, but do the normal VM users need to have it fixed to 4k? Is it
actually good?

>
> Signed-off-by: Isaku Yamahata <[email protected]>


2024-04-16 14:43:25

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 02/10] KVM: Add KVM_MAP_MEMORY vcpu ioctl to pre-populate guest memory

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add a new ioctl KVM_MAP_MEMORY in the KVM common code. It iterates on the
> memory range and calls the arch-specific function.  Add stub arch function
> as a weak symbol.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>

Reviewed-by: Rick Edgecombe <[email protected]>

But one comment below.
> ---
> v2:
> - Drop need_resched(). (David, Sean, Kai)
> - Move cond_resched() at the end of loop. (Kai)
> - Drop added check. (David)
> - Use EINTR instead of ERESTART. (David, Sean)
> - Fix srcu lock leak. (Kai, Sean)
> - Add comment above copy_to_user().
> - Drop pointless comment. (Sean)
> - Drop kvm_arch_vcpu_pre_map_memory(). (Sean)
> - Don't overwrite error code. (Sean, David)
> - Make the parameter in bytes, not pages. (Michael)
> - Drop source member in struct kvm_memory_mapping. (Sean, Michael)
> ---
>  include/linux/kvm_host.h |  3 +++
>  include/uapi/linux/kvm.h |  9 +++++++
>  virt/kvm/kvm_main.c      | 54 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 48f31dcd318a..e56a0c7e5b42 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2445,4 +2445,7 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_PRIVATE_MEM */
>  
> +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> +                            struct kvm_memory_mapping *mapping);
> +
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..972aa9e054d3 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -917,6 +917,7 @@ struct kvm_enable_cap {
>  #define KVM_CAP_MEMORY_ATTRIBUTES 233
>  #define KVM_CAP_GUEST_MEMFD 234
>  #define KVM_CAP_VM_TYPES 235
> +#define KVM_CAP_MAP_MEMORY 236

This can go in a later patch.

2024-04-16 14:47:12

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] KVM: x86/mmu: Introduce kvm_tdp_map_page() to populate guest memory

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
>  
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> +                    u8 *level)
> +{
> +       int r;
> +
> +       /* Restrict to TDP page fault. */
> +       if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> +               return -EINVAL;
> +
> +       r = __kvm_mmu_do_page_fault(vcpu, gpa, error_code, false, NULL,
> level);

Why not prefetch = true? Doesn't it fit? It looks like the behavior will be to
not set the access bit.

> +       if (r < 0)
> +               return r;
> +
> +       switch (r) {
> +       case RET_PF_RETRY:
> +               return -EAGAIN;
> +
> +       case RET_PF_FIXED:
> +       case RET_PF_SPURIOUS:
> +               return 0;
> +
> +       case RET_PF_EMULATE:
> +               return -EINVAL;
> +
> +       case RET_PF_CONTINUE:
> +       case RET_PF_INVALID:
> +       default:
> +               WARN_ON_ONCE(r);
> +               return -EIO;
> +       }
> +}

2024-04-16 14:59:02

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] KVM: x86: Add a hook in kvm_arch_vcpu_map_memory()

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> +static int kvm_pre_mmu_map_page(struct kvm_vcpu *vcpu,
> +                               struct kvm_memory_mapping *mapping,
> +                               u64 *error_code)
> +{
> +       int r = 0;
> +
> +       if (vcpu->kvm->arch.vm_type == KVM_X86_DEFAULT_VM) {
> +               /* nothing */

On the Intel side, vt_pre_mmu_map_page will handle doing nothing. Is there a
reason the AMD side can't do the same thing?

> +       } else if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM) {
> +               if (kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(mapping-
> >base_address)))
> +                       *error_code |= PFERR_PRIVATE_ACCESS;

Not suggesting to do anything about it for this series, but there seems to be a
growing amount of manual KVM_X86_SW_PROTECTED_VM checks. I guess the problem
with giving it its own x86_ops is figuring which arch calls to use. Hmm.

> +       } else if (kvm_x86_ops.pre_mmu_map_page)
> +               r = static_call(kvm_x86_pre_mmu_map_page)(vcpu, mapping,
> +                                                         error_code);
> +       else
> +               r = -EOPNOTSUPP;

Do we actually need this last check?

> +
> +       return r;
> +}

2024-04-16 15:13:02

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] KVM: x86: Implement kvm_arch_vcpu_map_memory()

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
> memory.  When KVM_CREATE_VCPU creates vCPU, it initializes the x86
> KVM MMU part by kvm_mmu_create() and kvm_init_mmu().  vCPU is ready to
> invoke the KVM page fault handler.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v2:
> - Catch up the change of struct kvm_memory_mapping. (Sean)
> - Removed mapping level check. Push it down into vendor code. (David, Sean)
> - Rename goal_level to level. (Sean)
> - Drop kvm_arch_pre_vcpu_map_memory(), directly call kvm_mmu_reload().
>   (David, Sean)
> - Fixed the update of mapping.
> ---
>  arch/x86/kvm/x86.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2d2619d3eee4..2c765de3531e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4713,6 +4713,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long
> ext)
>         case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
>         case KVM_CAP_IRQFD_RESAMPLE:
>         case KVM_CAP_MEMORY_FAULT_INFO:
> +       case KVM_CAP_MAP_MEMORY:
>                 r = 1;
>                 break;

Should we add this after all of the pieces are in place?

>         case KVM_CAP_EXIT_HYPERCALL:
> @@ -5867,6 +5868,35 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu
> *vcpu,
>         }
>  }
>  
> +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> +                            struct kvm_memory_mapping *mapping)
> +{
> +       u64 end, error_code = 0;
> +       u8 level = PG_LEVEL_4K;
> +       int r;
> +
> +       /*
> +        * Shadow paging uses GVA for kvm page fault.  The first
> implementation
> +        * supports GPA only to avoid confusion.
> +        */
> +       if (!tdp_enabled)
> +               return -EOPNOTSUPP;

It's not confusion, it's that you can't pre-map GPAs for legacy shadow paging.
Or you are saying why not to support pre-mapping GVAs? I think that discussion
belongs more in the commit log. The code should just say it's not possible to
pre-map GPAs in shadow paging.

> +
> +       /* reload is optimized for repeated call. */
> +       kvm_mmu_reload(vcpu);
> +
> +       r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
> +       if (r)
> +               return r;
> +
> +       /* mapping->base_address is not necessarily aligned to level-hugepage.
> */
> +       end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
> +               KVM_HPAGE_SIZE(level);
> +       mapping->size -= end - mapping->base_address;
> +       mapping->base_address = end;
> +       return r;
> +}
> +
>  long kvm_arch_vcpu_ioctl(struct file *filp,
>                          unsigned int ioctl, unsigned long arg)
>  {

2024-04-16 17:12:28

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> +       /* Force to use L1 GPA despite of vcpu MMU mode. */
> +       is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK);
> +       if (is_smm ||
> +           vcpu->arch.mmu != &vcpu->arch.root_mmu ||
> +           vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) {
> +               vcpu->arch.hflags &= ~HF_SMM_MASK;

0-day informs me that the definition for HF_SMM_MASK depends on CONFIG_KVM_SMM.

> +               mmu = vcpu->arch.mmu;
> +               walk_mmu = vcpu->arch.walk_mmu;
> +               vcpu->arch.mmu = &vcpu->arch.root_mmu;
> +               vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> +               kvm_mmu_reset_context(vcpu);
> +       }
> +


2024-04-16 21:41:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Mon, Apr 15, 2024 at 11:17 PM Sean Christopherson <seanjc@googlecom> wrote:
>
> On Mon, Apr 15, 2024, Rick P Edgecombe wrote:
> > I wouldn't call myself much of an expert on nested, but...
> >
> > On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> > > There are several options to populate L1 GPA irrelevant to vCPU mode.
> > > - Switch vCPU MMU only: This patch.
> > > Pros: Concise implementation.
> > > Cons: Heavily dependent on the KVM MMU implementation.
>
> Con: Makes it impossible to support other MMUs/modes without extending the uAPI.

+1.

> The first question to answer is, do we want to return an error or "silently"
> install mappings for !SMM, !guest_mode. And so this option becomes relevant only
> _if_ we want to unconditionally install mappings for the 'base" mode.
>
> > > - Return error on guest mode or SMM mode: Without this patch.
> > > Pros: No additional patch.
> > > Cons: Difficult to use.
> >
> > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > there shouldn't be an issue. If so, maybe this last one is not so horrible.

It doesn't even have to be ABI that it gives an error. As you say,
this ioctl can just be advisory only for !confidential machines. Even
if it were implemented, the shadow MMU can drop roots at any moment
and/or kill the mapping via the shrinker.

That said, I can't fully shake the feeling that this ioctl should be
an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The
implementation was ugly but the API was fine. Sorry about this;
patches 3-5 can still be included in kvm-coco-queue sooner rather than
later.

> And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
> basically invalidates the argument that returning an error makes the ioctl() hard
> to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's
> existing code, but I don't buy that the ioctl() itself is hard to use.

Nah, I don't think so. With TDX it's just MAXPHYADDR; just invoke it
after KVM_SET_CPUID2 or TDX_INIT_VCPU which is very early.

> Literally the only thing userspace needs to do is set CPUID to implicitly select
> between 4-level and 5-level paging. If userspace wants to pre-map memory during
> live migration, or when jump-starting the guest with pre-defined state, simply
> pre-map memory before stuffing guest state. In and of itself, that doesn't seem
> difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).

Hehe :) the machine_done_notifier is probably a better place. /me
checks... yes it's exactly where Xiaoyao did it (tdx_finalize_vm is
the notifier, it calls KVM_TDX_INIT_VCPU, from tdx_post_init_vcpus and
then KVM_MEMORY_MAPPING).

> I would describe the overall cons for this patch versus returning an error
> differently. Switching MMU state puts the complexity in the kernel. Returning
> an error punts any complexity to userspace. Specifically, anything that KVM can
> do regarding vCPU state to get the right MMU, userspace can do too.
>
> Add on that silently doing things that effectively ignore guest state usually
> ends badly, and I don't see a good argument for this patch (or any variant
> thereof).

Agreed.

Paolo


2024-04-16 23:01:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Tue, Apr 16, 2024, Paolo Bonzini wrote:
> On Mon, Apr 15, 2024 at 11:17 PM Sean Christopherson <[email protected]> wrote:
> > The first question to answer is, do we want to return an error or "silently"
> > install mappings for !SMM, !guest_mode. And so this option becomes relevant only
> > _if_ we want to unconditionally install mappings for the 'base" mode.
> >
> > > > - Return error on guest mode or SMM mode: Without this patch.
> > > > Pros: No additional patch.
> > > > Cons: Difficult to use.
> > >
> > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > > there shouldn't be an issue. If so, maybe this last one is not so horrible.
>
> It doesn't even have to be ABI that it gives an error. As you say,
> this ioctl can just be advisory only for !confidential machines. Even
> if it were implemented, the shadow MMU can drop roots at any moment

Sure, but there's a difference between KVM _potentially_ dropping roots and
guaranteed failure because userspace is trying to do something that's unsupported.
But I think this is a non-issue, because it should really just be as simple as:

if (!mmu->pre_map_memory)
return -EOPNOTSUPP;

Hmm, or probably this to avoid adding an MMU hook for a single MMU flavor:

if (!tdp_mmu_enabled || !mmu->root_role.direct)
return -EOPNOTSUPP;

> and/or kill the mapping via the shrinker.

Ugh, we really need to kill that code.

> That said, I can't fully shake the feeling that this ioctl should be
> an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The
> implementation was ugly but the API was fine.

Hmm, but IMO the implementation was ugly in no small part because of the contraints
put on KVM by the API. Mapping S-EPT *and* doing TDH.MEM.PAGE.ADD in the same
ioctl() forced KVM to operate on vcpu0, and necessitated shoving temporary data
into a per-VM structure in order to get the source contents into TDH.MEM.PAGE.ADD.

We could eliminate the vcpu0 grossness, but it would require a massive refactor,
which is also not a problem per se, but it's obviously not free. Eliminating
kvm_tdx.source_page is also doable, but it's not clear to me that end result would
be a net positive.

If userspace pre-maps the S-EPT entries ahead of time, then KVM should have a
straight shot to PAGE.ADD, i.e. doesn't need to "pass" the source page via a
scratch field in kvm_tdx, and I think/hope would avoid the need to grab vcpu0
in order to get at an MMU to build the S-EPT.

And stating the obvious, TDX_INIT_MEM_REGION also doesn't allow pre-mapping memory,
which is generally useful, and can be especially beneficial for confidential VMs
(and TDX in particular) due to the added cost of a page fault VM-Exit.

I'm not dead set on this generic ioctl(), but unless it ends up being a train wreck
for userspace, I think it will allow for cleaner and more reusable code in KVM.

2024-04-16 23:43:49

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] KVM: x86/mmu: Extract __kvm_mmu_do_page_fault()

On Tue, Apr 16, 2024 at 04:22:35PM +0800,
Chao Gao <[email protected]> wrote:

>
> >This patch makes the emulation_type always set irrelevant to the return
> >code. kvm_mmu_page_fault() is the only caller of kvm_mmu_do_page_fault(),
> >and references the value only when PF_RET_EMULATE is returned. Therefore,
> >this adjustment doesn't affect functionality.
>
> This is benign. But what's the benefit of doing this?

To avoid increment vcpu->stat. Because originally this was VM ioctl, I wanted
to avoid touch vCPU stat. Now it's vCPU ioctl, it's fine to increment them.

Probably we can drop this patch and use kvm_mmu_do_page_fault().

>
> >+static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >+ u64 err, bool prefetch, int *emulation_type)
> > {
> > struct kvm_page_fault fault = {
> > .addr = cr2_or_gpa,
> >@@ -318,14 +318,6 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > }
> >
> >- /*
> >- * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
> >- * guest perspective and have already been counted at the time of the
> >- * original fault.
> >- */
> >- if (!prefetch)
> >- vcpu->stat.pf_taken++;
> >-
> > if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) && fault.is_tdp)
> > r = kvm_tdp_page_fault(vcpu, &fault);
> > else
> >@@ -333,12 +325,30 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> > if (r == RET_PF_EMULATE && fault.is_private) {
> > kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
> >- return -EFAULT;
> >+ r = -EFAULT;
> > }
> >
> > if (fault.write_fault_to_shadow_pgtable && emulation_type)
> > *emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
> >
> >+ return r;
> >+}
> >+
> >+static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >+ u64 err, bool prefetch, int *emulation_type)
> >+{
> >+ int r;
> >+
> >+ /*
> >+ * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
> >+ * guest perspective and have already been counted at the time of the
> >+ * original fault.
> >+ */
> >+ if (!prefetch)
> >+ vcpu->stat.pf_taken++;
> >+
> >+ r = __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch, emulation_type);
>
> bail out if r < 0?

The following if clauses checks RET_PF_xxx > 0.
--
Isaku Yamahata <[email protected]>

2024-04-16 23:52:42

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] KVM: x86/mmu: Extract __kvm_mmu_do_page_fault()

On Tue, Apr 16, 2024 at 02:36:31PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Extract out __kvm_mmu_do_page_fault() from kvm_mmu_do_page_fault().  The
> > inner function is to initialize struct kvm_page_fault and to call the fault
> > handler, and the outer function handles updating stats and converting
> > return code.  KVM_MAP_MEMORY will call the KVM page fault handler.
> >
> > This patch makes the emulation_type always set irrelevant to the return
> a comma would help parse this better ^
> > code.
>
> >   kvm_mmu_page_fault() is the only caller of kvm_mmu_do_page_fault(),
>
> Not technically correct, there are other callers that pass NULL for
> emulation_type.
>
> > and references the value only when PF_RET_EMULATE is returned.  Therefore,
> > this adjustment doesn't affect functionality.
>
> Is there a problem with dropping the argument then?
>
> >
> > No functional change intended.
>
> Can we not use the "intended"? It sounds like hedging for excuses.

Thanks for review.
As Chao pointed out, this patch is unnecessary. I'll use
kvm_mmu_do_page_fault() directly with updating vcpu->stat.

https://lore.kernel.org/all/[email protected]/
--
Isaku Yamahata <[email protected]>

2024-04-17 00:00:03

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v2 04/10] KVM: x86/mmu: Make __kvm_mmu_do_page_fault() return mapped level

On Tue, Apr 16, 2024 at 02:40:39PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > The guest memory population logic will need to know what page size or level
> > (4K, 2M, ...) is mapped.
>
> TDX needs this, but do the normal VM users need to have it fixed to 4k? Is it
> actually good?

I meant that the function, kvm_arch_vcpu_map_memory(), in
[PATCH v2 06/10] KVM: x86: Implement kvm_arch_vcpu_map_memory() needs level.

No logic in this patch series enforces to fixed 4K.
gmem_max_level() hook will determine it.

https://lore.kernel.org/all/[email protected]/

I'll update the commit message to reflect it.
--
Isaku Yamahata <[email protected]>

2024-04-17 07:05:02

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] KVM: x86/mmu: Introduce kvm_tdp_map_page() to populate guest memory

On Wed, Apr 10, 2024 at 03:07:31PM -0700, [email protected] wrote:
>From: Isaku Yamahata <[email protected]>
>
>Introduce a helper function to call the KVM fault handler. It allows a new
>ioctl to invoke the KVM fault handler to populate without seeing RET_PF_*
>enums or other KVM MMU internal definitions because RET_PF_* are internal
>to x86 KVM MMU. The implementation is restricted to two-dimensional paging
>for simplicity. The shadow paging uses GVA for faulting instead of L1 GPA.
>It makes the API difficult to use.
>
>Signed-off-by: Isaku Yamahata <[email protected]>
>---
>v2:
>- Make the helper function two-dimensional paging specific. (David)
>- Return error when vcpu is in guest mode. (David)
>- Rename goal_level to level in kvm_tdp_mmu_map_page(). (Sean)
>- Update return code conversion. Don't check pfn.
> RET_PF_EMULATE => EINVAL, RET_PF_CONTINUE => EIO (Sean)
>- Add WARN_ON_ONCE on RET_PF_CONTINUE and RET_PF_INVALID. (Sean)
>- Drop unnecessary EXPORT_SYMBOL_GPL(). (Sean)
>---
> arch/x86/kvm/mmu.h | 3 +++
> arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++++++++++++++
> 2 files changed, 35 insertions(+)
>
>diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>index e8b620a85627..51ff4f67e115 100644
>--- a/arch/x86/kvm/mmu.h
>+++ b/arch/x86/kvm/mmu.h
>@@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> __kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> }
>
>+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>+ u8 *level);
>+
> /*
> * Check if a given access (described through the I/D, W/R and U/S bits of a
> * page fault error code pfec) causes a permission fault with the given PTE
>diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>index 91dd4c44b7d8..a34f4af44cbd 100644
>--- a/arch/x86/kvm/mmu/mmu.c
>+++ b/arch/x86/kvm/mmu/mmu.c
>@@ -4687,6 +4687,38 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> return direct_page_fault(vcpu, fault);
> }
>
>+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>+ u8 *level)
>+{
>+ int r;
>+
>+ /* Restrict to TDP page fault. */

need to explain why. (just as you do in the changelog)

>+ if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)

page fault handlers (i.e., vcpu->arch.mmu->page_fault()) will be called
finally. why not let page fault handlers reject the request to get rid of
this ad-hoc check? We just need to plumb a flag indicating this is a
pre-population request into the handlers. I think this way is clearer.

What do you think?

2024-04-17 07:21:27

by Chao Gao

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] KVM: x86: Implement kvm_arch_vcpu_map_memory()

On Wed, Apr 10, 2024 at 03:07:32PM -0700, [email protected] wrote:
>From: Isaku Yamahata <[email protected]>
>
>Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
>memory. When KVM_CREATE_VCPU creates vCPU, it initializes the x86
>KVM MMU part by kvm_mmu_create() and kvm_init_mmu(). vCPU is ready to
>invoke the KVM page fault handler.
>
>Signed-off-by: Isaku Yamahata <[email protected]>
>---
>v2:
>- Catch up the change of struct kvm_memory_mapping. (Sean)
>- Removed mapping level check. Push it down into vendor code. (David, Sean)
>- Rename goal_level to level. (Sean)
>- Drop kvm_arch_pre_vcpu_map_memory(), directly call kvm_mmu_reload().
> (David, Sean)
>- Fixed the update of mapping.
>---
> arch/x86/kvm/x86.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 2d2619d3eee4..2c765de3531e 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -4713,6 +4713,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> case KVM_CAP_IRQFD_RESAMPLE:
> case KVM_CAP_MEMORY_FAULT_INFO:
>+ case KVM_CAP_MAP_MEMORY:
> r = 1;
> break;
> case KVM_CAP_EXIT_HYPERCALL:
>@@ -5867,6 +5868,35 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
>+int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
>+ struct kvm_memory_mapping *mapping)
>+{
>+ u64 end, error_code = 0;
>+ u8 level = PG_LEVEL_4K;

IIUC, no need to initialize @level here.

>+ int r;
>+
>+ /*
>+ * Shadow paging uses GVA for kvm page fault. The first implementation
>+ * supports GPA only to avoid confusion.
>+ */
>+ if (!tdp_enabled)
>+ return -EOPNOTSUPP;

This check duplicates the one for vcpu->arch.mmu->page_fault() in patch 5.

>+
>+ /* reload is optimized for repeated call. */
>+ kvm_mmu_reload(vcpu);
>+
>+ r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
>+ if (r)
>+ return r;
>+
>+ /* mapping->base_address is not necessarily aligned to level-hugepage. */
>+ end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
>+ KVM_HPAGE_SIZE(level);

maybe
end = ALIGN(mapping->base_address, KVM_HPAGE_SIZE(level));

>+ mapping->size -= end - mapping->base_address;
>+ mapping->base_address = end;
>+ return r;
>+}
>+
> long kvm_arch_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
>--
>2.43.2
>
>

2024-04-17 10:30:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

On Wed, Apr 17, 2024 at 1:00 AM Sean Christopherson <[email protected]> wrote:
> > > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > > > there shouldn't be an issue. If so, maybe this last one is not so horrible.
> >
> > It doesn't even have to be ABI that it gives an error. As you say,
> > this ioctl can just be advisory only for !confidential machines. Even
> > if it were implemented, the shadow MMU can drop roots at any moment
>
> Sure, but there's a difference between KVM _potentially_ dropping roots and
> guaranteed failure because userspace is trying to do something that's unsupported.
> But I think this is a non-issue, because it should really just be as simple as:
>
> if (!mmu->pre_map_memory)
> return -EOPNOTSUPP;
>
> Hmm, or probably this to avoid adding an MMU hook for a single MMU flavor:
>
> if (!tdp_mmu_enabled || !mmu->root_role.direct)
> return -EOPNOTSUPP;
>
> > and/or kill the mapping via the shrinker.
>
> Ugh, we really need to kill that code.

Ok, so let's add a KVM_CHECK_EXTENSION so that people can check if
it's supported.

> > That said, I can't fully shake the feeling that this ioctl should be
> > an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The
> > implementation was ugly but the API was fine.
>
> Hmm, but IMO the implementation was ugly in no small part because of the contraints
> put on KVM by the API. Mapping S-EPT *and* doing TDH.MEM.PAGE.ADD in the same
> ioctl() forced KVM to operate on vcpu0, and necessitated shoving temporary data
> into a per-VM structure in order to get the source contents into TDH.MEM.PAGE.ADD.

That's because it was trying to do two things with a single loop. It's
not needed - and in fact KVM_CAP_MEMORY_MAPPING forces userspace to do
it in two passes.

> And stating the obvious, TDX_INIT_MEM_REGION also doesn't allow pre-mapping memory,
> which is generally useful, and can be especially beneficial for confidential VMs
> (and TDX in particular) due to the added cost of a page fault VM-Exit.
>
> I'm not dead set on this generic ioctl(), but unless it ends up being a train wreck
> for userspace, I think it will allow for cleaner and more reusable code in KVM.

Yes, this ioctl() can stay. Forcing it before adding memory to TDX is
ugly, but it's not a blocker. I'll look at it closely and see how far
it is from being committable to kvm-coco-queue.

Paolo


2024-04-17 11:57:24

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 01/10] KVM: Document KVM_MAP_MEMORY ioctl

On Tue, Apr 16, 2024 at 1:27 AM Edgecombe, Rick P
<[email protected]> wrote:
> > + EAGAIN The region is only processed partially. The caller should
> > + issue the ioctl with the updated parameters when `size` > 0.
> > + EINTR An unmasked signal is pending. The region may be processed
> > + partially.

The common convention is to only return errno if no page was processed.

> > +KVM_MAP_MEMORY populates guest memory with the range, `base_address` in (L1)
> > +guest physical address(GPA) and `size` in bytes. `flags` must be zero It's
> > +reserved for future use. When the ioctl returns, the input values are
> > updated
> > +to point to the remaining range. If `size` > 0 on return, the caller should
> > +issue the ioctl with the updated parameters.
> > +
> > +Multiple vcpus are allowed to call this ioctl simultaneously. It's not
> > +mandatory for all vcpus to issue this ioctl. A single vcpu can suffice.
> > +Multiple vcpus invocations are utilized for scalability to process the
> > +population in parallel. If multiple vcpus call this ioctl in parallel, it
> > may
> > +result in the error of EAGAIN due to race conditions.
> > +
> > +This population is restricted to the "pure" population without triggering
> > +underlying technology-specific initialization. For example, CoCo-related
> > +operations won't perform. In the case of TDX, this API won't invoke
> > +TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND(). Vendor-specific uAPIs are required
> > for
> > +such operations.
>
> Probably don't want to have TDX bits in here yet. Since it's talking about what
> KVM_MAP_MEMORY is *not* doing, it can just be dropped.

Let's rewrite everything to be more generic:

+KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU.
+When the ioctl returns, the input values are updated to point to the
+remaining range. If `size` > 0 on return, the caller should
+issue the ioctl again with updated parameters.
+
+In some cases, multiple vCPUs might share the page tables. In this
+case, if this ioctl is called in parallel for multiple vCPUs the
+ioctl might return with `size > 0`.
+
+The ioctl may not be supported for all VMs. You may use
+`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is
+supported.
+
+`flags` must currently be zero.


Paolo


2024-04-17 12:18:31

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 06/10] KVM: x86: Implement kvm_arch_vcpu_map_memory()

On Thu, Apr 11, 2024 at 12:08 AM <[email protected]> wrote:
>
> From: Isaku Yamahata <[email protected]>
>
> Wire KVM_MAP_MEMORY ioctl to kvm_mmu_map_tdp_page() to populate guest
> memory. When KVM_CREATE_VCPU creates vCPU, it initializes the x86
> KVM MMU part by kvm_mmu_create() and kvm_init_mmu(). vCPU is ready to
> invoke the KVM page fault handler.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v2:
> - Catch up the change of struct kvm_memory_mapping. (Sean)
> - Removed mapping level check. Push it down into vendor code. (David, Sean)
> - Rename goal_level to level. (Sean)
> - Drop kvm_arch_pre_vcpu_map_memory(), directly call kvm_mmu_reload().
> (David, Sean)
> - Fixed the update of mapping.
> ---
> arch/x86/kvm/x86.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2d2619d3eee4..2c765de3531e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4713,6 +4713,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> case KVM_CAP_IRQFD_RESAMPLE:
> case KVM_CAP_MEMORY_FAULT_INFO:
> + case KVM_CAP_MAP_MEMORY:
> r = 1;
> break;
> case KVM_CAP_EXIT_HYPERCALL:
> @@ -5867,6 +5868,35 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
> +int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping)
> +{
> + u64 end, error_code = 0;
> + u8 level = PG_LEVEL_4K;
> + int r;
> +
> + /*
> + * Shadow paging uses GVA for kvm page fault. The first implementation
> + * supports GPA only to avoid confusion.
> + */
> + if (!tdp_enabled)
> + return -EOPNOTSUPP;
> +
> + /* reload is optimized for repeated call. */
> + kvm_mmu_reload(vcpu);
> +
> + r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
> + if (r)
> + return r;
> +
> + /* mapping->base_address is not necessarily aligned to level-hugepage. */

/*
* level can be more than the alignment of mapping->base_address if
* the mapping can use a huge page.
*/

> + end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
> + KVM_HPAGE_SIZE(level);
> + mapping->size -= end - mapping->base_address;
> + mapping->base_address = end;

Slightly safer in the case where level is more than the alignment of
mapping->base_address:

mapped = min(mapping->size, end - mapping->base_address);
mapping->size -= mapped;
mapping->base_address += mapped;

Paolo

> + return r;
> +}
> +
> long kvm_arch_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> --
> 2.43.2
>


2024-04-17 12:26:40

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 08/10] KVM: x86: Add a hook in kvm_arch_vcpu_map_memory()

On Thu, Apr 11, 2024 at 12:08 AM <[email protected]> wrote:
>
> From: Isaku Yamahata <[email protected]>
>
> Add a hook in kvm_arch_vcpu_map_memory() for KVM_MAP_MEMORY before calling
> kvm_mmu_map_page() to adjust the error code for a page fault. The hook can
> hold vendor-specific logic to make those adjustments and enforce the
> restrictions. SEV and TDX KVM will use the hook.
>
> In the case of SEV and TDX, they need to adjust the KVM page fault error
> code or refuse the operation due to their restriction. TDX requires that
> the guest memory population must be before finalizing the VM.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> v2:
> - Make pre_mmu_map_page() to take error_code.
> - Drop post_mmu_map_page().
> - Drop struct kvm_memory_map.source check.
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 3 +++
> arch/x86/kvm/x86.c | 28 ++++++++++++++++++++++++++++
> 3 files changed, 32 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 5187fcf4b610..a5d4f4d5265d 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -139,6 +139,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP_OPTIONAL(get_untagged_addr)
> KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> +KVM_X86_OP_OPTIONAL(pre_mmu_map_page);
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3ce244ad44e5..2bf7f97f889b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1812,6 +1812,9 @@ struct kvm_x86_ops {
>
> gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
> void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> + int (*pre_mmu_map_page)(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping,
> + u64 *error_code);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8ba9c1720ac9..b76d854701d5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5868,6 +5868,26 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
> +static int kvm_pre_mmu_map_page(struct kvm_vcpu *vcpu,
> + struct kvm_memory_mapping *mapping,
> + u64 *error_code)
> +{
> + int r = 0;
> +
> + if (vcpu->kvm->arch.vm_type == KVM_X86_DEFAULT_VM) {
> + /* nothing */
> + } else if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM) {
> + if (kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(mapping->base_address)))
> + *error_code |= PFERR_PRIVATE_ACCESS;

This can probably be done for all VM types, not just KVM_X86_SW_PROTECTED_VM.

For now I am going to squash

if (kvm_arch_has_private_mem(vcpu->kvm) &&
kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(mapping->base_address)))
*error_code |= PFERR_GUEST_ENC_MASK;

in the previous patch. If TDX or SEV need to adjust, they can
introduce the hook where we know if/how it is used.

Paolo


2024-04-17 16:06:48

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 03/10] KVM: x86/mmu: Extract __kvm_mmu_do_page_fault()

On Wed, Apr 17, 2024 at 1:52 AM Isaku Yamahata <[email protected]> wrote:
> As Chao pointed out, this patch is unnecessary. I'll use
> kvm_mmu_do_page_fault() directly with updating vcpu->stat.

Actually I prefer to have this patch.

pf_* stats do not make sense for pre-population, and updating them
confuses things because pre-population (outside TDX) has the purpose
of avoiding page faults.

Paolo


2024-04-17 18:39:22

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] KVM: x86/mmu: Introduce kvm_tdp_map_page() to populate guest memory

On Tue, Apr 16, 2024 at 02:46:17PM +0000,
"Edgecombe, Rick P" <[email protected]> wrote:

> On Wed, 2024-04-10 at 15:07 -0700, [email protected] wrote:
> >  
> > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> > +                    u8 *level)
> > +{
> > +       int r;
> > +
> > +       /* Restrict to TDP page fault. */
> > +       if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > +               return -EINVAL;
> > +
> > +       r = __kvm_mmu_do_page_fault(vcpu, gpa, error_code, false, NULL,
> > level);
>
> Why not prefetch = true? Doesn't it fit? It looks like the behavior will be to
> not set the access bit.

Makes sense. Yes, the difference is to set A/D bit or not.
--
Isaku Yamahata <[email protected]>

2024-04-17 18:44:43

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v2 05/10] KVM: x86/mmu: Introduce kvm_tdp_map_page() to populate guest memory

On Wed, Apr 17, 2024 at 03:04:08PM +0800,
Chao Gao <[email protected]> wrote:

> On Wed, Apr 10, 2024 at 03:07:31PM -0700, [email protected] wrote:
> >From: Isaku Yamahata <[email protected]>
> >
> >Introduce a helper function to call the KVM fault handler. It allows a new
> >ioctl to invoke the KVM fault handler to populate without seeing RET_PF_*
> >enums or other KVM MMU internal definitions because RET_PF_* are internal
> >to x86 KVM MMU. The implementation is restricted to two-dimensional paging
> >for simplicity. The shadow paging uses GVA for faulting instead of L1 GPA.
> >It makes the API difficult to use.
> >
> >Signed-off-by: Isaku Yamahata <[email protected]>
> >---
> >v2:
> >- Make the helper function two-dimensional paging specific. (David)
> >- Return error when vcpu is in guest mode. (David)
> >- Rename goal_level to level in kvm_tdp_mmu_map_page(). (Sean)
> >- Update return code conversion. Don't check pfn.
> > RET_PF_EMULATE => EINVAL, RET_PF_CONTINUE => EIO (Sean)
> >- Add WARN_ON_ONCE on RET_PF_CONTINUE and RET_PF_INVALID. (Sean)
> >- Drop unnecessary EXPORT_SYMBOL_GPL(). (Sean)
> >---
> > arch/x86/kvm/mmu.h | 3 +++
> > arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++++++++++++++
> > 2 files changed, 35 insertions(+)
> >
> >diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> >index e8b620a85627..51ff4f67e115 100644
> >--- a/arch/x86/kvm/mmu.h
> >+++ b/arch/x86/kvm/mmu.h
> >@@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > __kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> > }
> >
> >+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> >+ u8 *level);
> >+
> > /*
> > * Check if a given access (described through the I/D, W/R and U/S bits of a
> > * page fault error code pfec) causes a permission fault with the given PTE
> >diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >index 91dd4c44b7d8..a34f4af44cbd 100644
> >--- a/arch/x86/kvm/mmu/mmu.c
> >+++ b/arch/x86/kvm/mmu/mmu.c
> >@@ -4687,6 +4687,38 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > return direct_page_fault(vcpu, fault);
> > }
> >
> >+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> >+ u8 *level)
> >+{
> >+ int r;
> >+
> >+ /* Restrict to TDP page fault. */
>
> need to explain why. (just as you do in the changelog)

Sure.


> >+ if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
>
> page fault handlers (i.e., vcpu->arch.mmu->page_fault()) will be called
> finally. why not let page fault handlers reject the request to get rid of
> this ad-hoc check? We just need to plumb a flag indicating this is a
> pre-population request into the handlers. I think this way is clearer.
>
> What do you think?

__kvm_mmu_do_page_fault() doesn't check if the mmu mode is TDP or not.
If we don't want to check page_fault handler, the alternative check would
be if (!vcpu->arch.mmu->direct). Or we will require the caller to guarantee
that MMU mode is tdp (direct or tdp_mmu).
--
Isaku Yamahata <[email protected]>