2023-07-20 23:36:14

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 00/10] KVM: guest_memfd(), X86: Common base for SNP and TDX (was KVM: guest memory: Misc enhancement)

From: Isaku Yamahata <[email protected]>

Hello. I've updated KVM: guest memory: Misc enhancement patch series based
on "[RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes" [1].
I changed the subject to represent the patch series better.

The purpose is to get agreement on the common base patches both for SNP [2] and
TDX [3]. (And hopefully for other technology to protect guest memory.) Then, SNP
and TDX can make progress without stepping on each other.

The main change from the previous version is
- The rebased to v11 KVM guest_memfd()
- Introduce KVM_X86_SNP_VM and KVM_x86_TDX_VM
- Make KVM_MEM_ENC_OP uABI common for SNP and TDX

[1] https://lore.kernel.org/all/[email protected]/

[2] https://lore.kernel.org/lkml/[email protected]/
Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

[3] https://lore.kernel.org/all/[email protected]/
KVM TDX basic feature support

Changes:
v4:
- The rebased to v11 KVM guest_memfd()
- Introduce KVM_X86_SNP_VM and KVM_x86_TDX_VM
- Newly include a patch to make KVM_MEM_ENC_OP uABI common for SNP and TDX
- include a patch to address IMPLICIT_ACCESS

v3:
https://lore.kernel.org/all/[email protected]/

v2:
https://lore.kernel.org/all/[email protected]/

v1:
https://lore.kernel.org/all/[email protected]/

Brijesh Singh (1):
KVM: x86: Export the kvm_zap_gfn_range() for the SNP use

Isaku Yamahata (6):
KVM: x86: Add is_vm_type_supported callback
KVM: x86/mmu: Pass around full 64-bit error code for the KVM page
fault
KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private
KVM: Add new members to struct kvm_gfn_range to operate on
KVM: x86: Make struct sev_cmd common for KVM_MEM_ENC_OP
KVM: X86: KVM_MEM_ENC_OP check if unused field (flags, error) is zero

Michael Roth (2):
KVM: x86: Add gmem hook for initializing private memory
KVM: x86: Add gmem hook for invalidating private memory

Sean Christopherson (1):
KVM: x86/mmu: Guard against collision with KVM-defined
PFERR_IMPLICIT_ACCESS

arch/x86/include/asm/kvm-x86-ops.h | 3 ++
arch/x86/include/asm/kvm_host.h | 10 ++++-
arch/x86/include/uapi/asm/kvm.h | 35 +++++++++++++++
arch/x86/kvm/mmu.h | 2 -
arch/x86/kvm/mmu/mmu.c | 37 +++++++++++++---
arch/x86/kvm/mmu/mmu_internal.h | 18 ++++++--
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/svm/sev.c | 68 ++++++++++++++++--------------
arch/x86/kvm/svm/svm.c | 7 +++
arch/x86/kvm/svm/svm.h | 2 +-
arch/x86/kvm/vmx/vmx.c | 7 +++
arch/x86/kvm/x86.c | 50 +++++++++++++++++++++-
arch/x86/kvm/x86.h | 2 +
include/linux/kvm_host.h | 5 +++
virt/kvm/guest_mem.c | 44 +++++++++++++++++++
virt/kvm/kvm_main.c | 4 ++
17 files changed, 249 insertions(+), 49 deletions(-)


base-commit: bfa3037d828050896ae52f6467b6ca2489ae6fb1
prerequisite-patch-id: 3bd3037b3803e2d84f0ef98bb6c678be44eddd08
prerequisite-patch-id: b474cbf4f0ea21cf945036271f5286017e0efc84
prerequisite-patch-id: bd96a89fafe51956a55fdfc08a3ea2a37a2e55e4
prerequisite-patch-id: f15d178f9000430e0089c546756ab1d8d29341a7
prerequisite-patch-id: 5b34829d7433fa81ed574d724ee476b9cc2e6a50
prerequisite-patch-id: bf75388851ee37a83b37bfa7cb0084f27301f6bc
prerequisite-patch-id: 9d77fb0e8ce8c8c21e22ff3f26bd168eb5446df0
prerequisite-patch-id: 7152514149d4b4525a0057e3460ff78861e162f5
prerequisite-patch-id: a1d688257a210564ebeb23b1eef4b9ad1f5d7be3
prerequisite-patch-id: 0b1e771c370a03e1588ed97ee77cb0493d9304f4
prerequisite-patch-id: 313219882d617e4d4cb226760d1f071f52b3f882
prerequisite-patch-id: a8ebe373e3913fd0e0a55c57f55690f432975ec0
prerequisite-patch-id: 8b06f2333214e355b145113e33c65ade85d7eac4
prerequisite-patch-id: e739dd58995d35b0f888d02a6bf4ea144476f264
prerequisite-patch-id: 0e93d19cb59f3a052a377a56ff0a4399046818aa
prerequisite-patch-id: 4e0839abbfb8885154e278b4b0071a760199ad46
prerequisite-patch-id: be193bb3393ad8a16ea376a530df20a145145259
prerequisite-patch-id: 301dbdf8448175ea609664c890a3694750ecf740
prerequisite-patch-id: ba8e6068bcef7865bb5523065e19edd49fbc02de
prerequisite-patch-id: 81b25d13169b3617c12992dce85613a2730b0e1b
prerequisite-patch-id: b4526dee5b5a95da0a13116ae0c73d4e69efa3c6
prerequisite-patch-id: 8c62bacc52a75d4a9038a3f597fe436c50e07de3
prerequisite-patch-id: 5618d2414a1ef641b4c247b5e28076f67a765b24
prerequisite-patch-id: 022b4620f6ff729eca842192259e986d126e7fa6
prerequisite-patch-id: 73ebc581a3ce9a51167785d273fe69406ccccaed
prerequisite-patch-id: 1225df90aeae430a74354bc5ad0ddf508d0707db
prerequisite-patch-id: 1e38df398ee370ad7e457f4890d6e4457e8a83fa
prerequisite-patch-id: b8812b613f5674351565ea28354e91a756efd56e
prerequisite-patch-id: e231eff2baba07c2de984dd6cf83ad1a31b792b8
--
2.25.1



2023-07-20 23:38:51

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 09/10] KVM: x86: Make struct sev_cmd common for KVM_MEM_ENC_OP

From: Isaku Yamahata <[email protected]>

TDX KVM will use KVM_MEM_ENC_OP. Make struct sev_cmd common both for
vendor backends, SEV and TDX, with rename. Make the struct common uABI for
KVM_MEM_ENC_OP. TDX backend wants to return 64 bit error code instead of
32 bit. To keep ABI for SEV backend, use union to accommodate 64 bit
member. Opportunistically make the implicit padding after id member to
explicit flags member for future use and clarity.

Some data structures for sub-commands could be common. The current
candidate would be KVM_SEV{,_ES}_INIT, KVM_SEV_LAUNCH_FINISH,
KVM_SEV_LAUNCH_UPDATE_VMSA, KVM_SEV_DBG_DECRYPT, and KVM_SEV_DBG_ENCRYPT.

Only compile tested for SEV code.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Changes v3 -> v4:
- newly added
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/include/uapi/asm/kvm.h | 33 ++++++++++++++++
arch/x86/kvm/svm/sev.c | 68 ++++++++++++++++++---------------
arch/x86/kvm/svm/svm.h | 2 +-
arch/x86/kvm/x86.c | 16 +++++++-
5 files changed, 87 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 440a4a13a93f..5ede982442a0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1710,7 +1710,7 @@ struct kvm_x86_ops {
void (*enable_smi_window)(struct kvm_vcpu *vcpu);
#endif

- int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
+ int (*mem_enc_ioctl)(struct kvm *kvm, struct kvm_mem_enc_cmd *cmd);
int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index aa7a56a47564..32883e520b00 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -562,6 +562,39 @@ struct kvm_pmu_event_filter {
/* x86-specific KVM_EXIT_HYPERCALL flags. */
#define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)

+struct kvm_mem_enc_cmd {
+ /* sub-command id of KVM_MEM_ENC_OP. */
+ __u32 id;
+ /*
+ * Auxiliary flags for sub-command. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u32 flags;
+ /*
+ * Data for sub-command. An immediate or a pointer to the actual
+ * data in process virtual address. If sub-command doesn't use it,
+ * set zero.
+ */
+ __u64 data;
+ /*
+ * Supplemental error code in the case of error.
+ * SEV error code from the PSP or TDX SEAMCALL status code.
+ * The caller should set zero.
+ */
+ union {
+ struct {
+ __u32 error;
+ /*
+ * KVM_SEV_LAUNCH_START and KVM_SEV_RECEIVE_START
+ * require extra data. Not included in struct
+ * kvm_sev_launch_start or struct kvm_sev_receive_start.
+ */
+ __u32 sev_fd;
+ };
+ __u64 error64;
+ };
+};
+
#define KVM_X86_DEFAULT_VM 0
#define KVM_X86_SW_PROTECTED_VM 1
#define KVM_X86_TDX_VM 2
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 07756b7348ae..94e13bb49c86 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1835,30 +1835,39 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
return ret;
}

-int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
+int sev_mem_enc_ioctl(struct kvm *kvm, struct kvm_mem_enc_cmd *cmd)
{
- struct kvm_sev_cmd sev_cmd;
+ struct kvm_sev_cmd *sev_cmd = (struct kvm_sev_cmd *)cmd;
int r;

+ /* TODO: replace struct kvm_sev_cmd with kvm_mem_enc_cmd. */
+ BUILD_BUG_ON(sizeof(*sev_cmd) != sizeof(*cmd));
+ BUILD_BUG_ON(offsetof(struct kvm_sev_cmd, id) !=
+ offsetof(struct kvm_mem_enc_cmd, id));
+ BUILD_BUG_ON(sizeof(sev_cmd->id) != sizeof(cmd->id));
+ BUILD_BUG_ON(offsetof(struct kvm_sev_cmd, data) !=
+ offsetof(struct kvm_mem_enc_cmd, data));
+ BUILD_BUG_ON(sizeof(sev_cmd->data) != sizeof(cmd->data));
+ BUILD_BUG_ON(offsetof(struct kvm_sev_cmd, error) !=
+ offsetof(struct kvm_mem_enc_cmd, error));
+ BUILD_BUG_ON(sizeof(sev_cmd->error) != sizeof(cmd->error));
+ BUILD_BUG_ON(offsetof(struct kvm_sev_cmd, sev_fd) !=
+ offsetof(struct kvm_mem_enc_cmd, sev_fd));
+ BUILD_BUG_ON(sizeof(sev_cmd->sev_fd) != sizeof(cmd->sev_fd));
+
if (!sev_enabled)
return -ENOTTY;

- if (!argp)
- return 0;
-
- if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd)))
- return -EFAULT;
-
mutex_lock(&kvm->lock);

/* Only the enc_context_owner handles some memory enc operations. */
if (is_mirroring_enc_context(kvm) &&
- !is_cmd_allowed_from_mirror(sev_cmd.id)) {
+ !is_cmd_allowed_from_mirror(sev_cmd->id)) {
r = -EINVAL;
goto out;
}

- switch (sev_cmd.id) {
+ switch (sev_cmd->id) {
case KVM_SEV_ES_INIT:
if (!sev_es_enabled) {
r = -ENOTTY;
@@ -1866,67 +1875,64 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
}
fallthrough;
case KVM_SEV_INIT:
- r = sev_guest_init(kvm, &sev_cmd);
+ r = sev_guest_init(kvm, sev_cmd);
break;
case KVM_SEV_LAUNCH_START:
- r = sev_launch_start(kvm, &sev_cmd);
+ r = sev_launch_start(kvm, sev_cmd);
break;
case KVM_SEV_LAUNCH_UPDATE_DATA:
- r = sev_launch_update_data(kvm, &sev_cmd);
+ r = sev_launch_update_data(kvm, sev_cmd);
break;
case KVM_SEV_LAUNCH_UPDATE_VMSA:
- r = sev_launch_update_vmsa(kvm, &sev_cmd);
+ r = sev_launch_update_vmsa(kvm, sev_cmd);
break;
case KVM_SEV_LAUNCH_MEASURE:
- r = sev_launch_measure(kvm, &sev_cmd);
+ r = sev_launch_measure(kvm, sev_cmd);
break;
case KVM_SEV_LAUNCH_FINISH:
- r = sev_launch_finish(kvm, &sev_cmd);
+ r = sev_launch_finish(kvm, sev_cmd);
break;
case KVM_SEV_GUEST_STATUS:
- r = sev_guest_status(kvm, &sev_cmd);
+ r = sev_guest_status(kvm, sev_cmd);
break;
case KVM_SEV_DBG_DECRYPT:
- r = sev_dbg_crypt(kvm, &sev_cmd, true);
+ r = sev_dbg_crypt(kvm, sev_cmd, true);
break;
case KVM_SEV_DBG_ENCRYPT:
- r = sev_dbg_crypt(kvm, &sev_cmd, false);
+ r = sev_dbg_crypt(kvm, sev_cmd, false);
break;
case KVM_SEV_LAUNCH_SECRET:
- r = sev_launch_secret(kvm, &sev_cmd);
+ r = sev_launch_secret(kvm, sev_cmd);
break;
case KVM_SEV_GET_ATTESTATION_REPORT:
- r = sev_get_attestation_report(kvm, &sev_cmd);
+ r = sev_get_attestation_report(kvm, sev_cmd);
break;
case KVM_SEV_SEND_START:
- r = sev_send_start(kvm, &sev_cmd);
+ r = sev_send_start(kvm, sev_cmd);
break;
case KVM_SEV_SEND_UPDATE_DATA:
- r = sev_send_update_data(kvm, &sev_cmd);
+ r = sev_send_update_data(kvm, sev_cmd);
break;
case KVM_SEV_SEND_FINISH:
- r = sev_send_finish(kvm, &sev_cmd);
+ r = sev_send_finish(kvm, sev_cmd);
break;
case KVM_SEV_SEND_CANCEL:
- r = sev_send_cancel(kvm, &sev_cmd);
+ r = sev_send_cancel(kvm, sev_cmd);
break;
case KVM_SEV_RECEIVE_START:
- r = sev_receive_start(kvm, &sev_cmd);
+ r = sev_receive_start(kvm, sev_cmd);
break;
case KVM_SEV_RECEIVE_UPDATE_DATA:
- r = sev_receive_update_data(kvm, &sev_cmd);
+ r = sev_receive_update_data(kvm, sev_cmd);
break;
case KVM_SEV_RECEIVE_FINISH:
- r = sev_receive_finish(kvm, &sev_cmd);
+ r = sev_receive_finish(kvm, sev_cmd);
break;
default:
r = -EINVAL;
goto out;
}

- if (copy_to_user(argp, &sev_cmd, sizeof(struct kvm_sev_cmd)))
- r = -EFAULT;
-
out:
mutex_unlock(&kvm->lock);
return r;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 18af7e712a5a..74ecab20c24b 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -716,7 +716,7 @@ void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu);
extern unsigned int max_sev_asid;

void sev_vm_destroy(struct kvm *kvm);
-int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp);
+int sev_mem_enc_ioctl(struct kvm *kvm, struct kvm_mem_enc_cmd *cmd);
int sev_mem_enc_register_region(struct kvm *kvm,
struct kvm_enc_region *range);
int sev_mem_enc_unregister_region(struct kvm *kvm,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2ae40fa8e178..ab36e8940f1b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7040,11 +7040,25 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
goto out;
}
case KVM_MEMORY_ENCRYPT_OP: {
+ struct kvm_mem_enc_cmd cmd;
+
r = -ENOTTY;
if (!kvm_x86_ops.mem_enc_ioctl)
goto out;

- r = static_call(kvm_x86_mem_enc_ioctl)(kvm, argp);
+ if (!argp) {
+ r = 0;
+ goto out;
+ }
+
+ if (copy_from_user(&cmd, argp, sizeof(cmd))) {
+ r = -EFAULT;
+ goto out;
+ }
+ r = static_call(kvm_x86_mem_enc_ioctl)(kvm, &cmd);
+ if (copy_to_user(argp, &cmd, sizeof(cmd)))
+ r = -EFAULT;
+
break;
}
case KVM_MEMORY_ENCRYPT_REG_REGION: {
--
2.25.1


2023-07-20 23:50:40

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 06/10] KVM: x86: Export the kvm_zap_gfn_range() for the SNP use

From: Brijesh Singh <[email protected]>

While resolving the RMP page fault, there may be cases where the page
level between the RMP entry and TDP does not match and the 2M RMP entry
must be split into 4K RMP entries. Or a 2M TDP page need to be broken
into multiple of 4K pages.

To keep the RMP and TDP page level in sync, zap the gfn range after
splitting the pages in the RMP entry. The zap should force the TDP to
gets rebuilt with the new page level.

Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
Changes v3 -> v4:
- removed redandunt blank line

Changes v2 -> v3:
- Newly added
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.h | 2 --
arch/x86/kvm/mmu/mmu.c | 1 +
3 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ab7d080bf544..e4f2938bb1fc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1842,6 +1842,7 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
void kvm_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);

int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..963c734642f6 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -235,8 +235,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
return -(u32)fault & errcode;
}

-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
-
int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);

int kvm_mmu_post_init_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d2ebe26fb822..a73ddb43a2cf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6759,6 +6759,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,

return need_tlb_flush;
}
+EXPORT_SYMBOL_GPL(kvm_zap_gfn_range);

static void kvm_rmap_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot)
--
2.25.1


2023-07-20 23:50:57

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 01/10] KVM: x86: Add is_vm_type_supported callback

From: Isaku Yamahata <[email protected]>

For TDX, allow the backend can override the supported vm type. Add
KVM_X86_TDX_VM to reserve the bit.

Signed-off-by: Isaku Yamahata <[email protected]>

---
Changes v3 -> v4:
- Added KVM_X86_SNP_VM

Changes v2 -> v3:
- no change
- didn't bother to rename KVM_X86_PROTECTED_VM to KVM_X86_SW_PROTECTED_VM

Changes v1 -> v2
- no change
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/uapi/asm/kvm.h | 2 ++
arch/x86/kvm/svm/svm.c | 7 +++++++
arch/x86/kvm/vmx/vmx.c | 7 +++++++
arch/x86/kvm/x86.c | 12 +++++++++++-
arch/x86/kvm/x86.h | 2 ++
7 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 13bc212cd4bc..c0143906fe6d 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,7 @@ KVM_X86_OP(hardware_disable)
KVM_X86_OP(hardware_unsetup)
KVM_X86_OP(has_emulated_msr)
KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP(is_vm_type_supported)
KVM_X86_OP(vm_init)
KVM_X86_OP_OPTIONAL(vm_destroy)
KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bbefd79b7950..2c9350aa0da4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1543,6 +1543,7 @@ struct kvm_x86_ops {
bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);

+ bool (*is_vm_type_supported)(unsigned long vm_type);
unsigned int vm_size;
int (*vm_init)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a448d0964fc0..aa7a56a47564 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -564,5 +564,7 @@ struct kvm_pmu_event_filter {

#define KVM_X86_DEFAULT_VM 0
#define KVM_X86_SW_PROTECTED_VM 1
+#define KVM_X86_TDX_VM 2
+#define KVM_X86_SNP_VM 3

#endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d381ad424554..d681dd7ad397 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4768,6 +4768,12 @@ static void svm_vm_destroy(struct kvm *kvm)
sev_vm_destroy(kvm);
}

+static bool svm_is_vm_type_supported(unsigned long type)
+{
+ /* FIXME: Check if CPU is capable of SEV-SNP. */
+ return __kvm_is_vm_type_supported(type);
+}
+
static int svm_vm_init(struct kvm *kvm)
{
if (!pause_filter_count || !pause_filter_thresh)
@@ -4796,6 +4802,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_free = svm_vcpu_free,
.vcpu_reset = svm_vcpu_reset,

+ .is_vm_type_supported = svm_is_vm_type_supported,
.vm_size = sizeof(struct kvm_svm),
.vm_init = svm_vm_init,
.vm_destroy = svm_vm_destroy,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 946380b53cf5..693f07b80966 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7511,6 +7511,12 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
return err;
}

+static bool vmx_is_vm_type_supported(unsigned long type)
+{
+ /* TODO: Check if TDX is supported. */
+ return __kvm_is_vm_type_supported(type);
+}
+
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"

@@ -8180,6 +8186,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.hardware_disable = vmx_hardware_disable,
.has_emulated_msr = vmx_has_emulated_msr,

+ .is_vm_type_supported = vmx_is_vm_type_supported,
.vm_size = sizeof(struct kvm_vmx),
.vm_init = vmx_vm_init,
.vm_destroy = vmx_vm_destroy,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index de195ad83ec0..fd6c05d1883c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4427,12 +4427,18 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu,
return 0;
}

-static bool kvm_is_vm_type_supported(unsigned long type)
+bool __kvm_is_vm_type_supported(unsigned long type)
{
return type == KVM_X86_DEFAULT_VM ||
(type == KVM_X86_SW_PROTECTED_VM &&
IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_enabled);
}
+EXPORT_SYMBOL_GPL(__kvm_is_vm_type_supported);
+
+static bool kvm_is_vm_type_supported(unsigned long type)
+{
+ return static_call(kvm_x86_is_vm_type_supported)(type);
+}

int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
{
@@ -4628,6 +4634,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = BIT(KVM_X86_DEFAULT_VM);
if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
r |= BIT(KVM_X86_SW_PROTECTED_VM);
+ if (kvm_is_vm_type_supported(KVM_X86_TDX_VM))
+ r |= BIT(KVM_X86_TDX_VM);
+ if (kvm_is_vm_type_supported(KVM_X86_SNP_VM))
+ r |= BIT(KVM_X86_SNP_VM);
break;
default:
break;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 82e3dafc5453..7de3a45f655a 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -9,6 +9,8 @@
#include "kvm_cache_regs.h"
#include "kvm_emulate.h"

+bool __kvm_is_vm_type_supported(unsigned long type);
+
struct kvm_caps {
/* control of guest tsc rate supported? */
bool has_tsc_control;
--
2.25.1


2023-07-20 23:51:03

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

From: Michael Roth <[email protected]>

All gmem pages are expected to be 'private' as defined by a particular
arch/platform. Platforms like SEV-SNP require additional operations to
move these pages into a private state, so implement a hook that can be
used to prepare this memory prior to mapping it into a guest.

In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table, so this hook will also be used by the KVM MMU to clamp
the maximum mapping size accordingly.

Signed-off-by: Michael Roth <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
Changes v2 -> v3:
- Newly added
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index c0143906fe6d..a4cb248519cf 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -134,6 +134,7 @@ KVM_X86_OP(msr_filter_changed)
KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e4f2938bb1fc..de7f0dffa135 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1735,6 +1735,9 @@ struct kvm_x86_ops {
* Returns vCPU specific APICv inhibit reasons
*/
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+ int (*gmem_prepare)(struct kvm *kvm, struct kvm_memory_slot *slot,
+ kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a73ddb43a2cf..35bb14363828 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4352,6 +4352,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
int max_order, r;
+ u8 max_level;

if (!kvm_slot_can_be_private(fault->slot))
return kvm_do_memory_fault_exit(vcpu, fault);
@@ -4361,8 +4362,15 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
if (r)
return r;

- fault->max_level = min(kvm_max_level_for_order(max_order),
- fault->max_level);
+ max_level = kvm_max_level_for_order(max_order);
+ r = static_call(kvm_x86_gmem_prepare)(vcpu->kvm, fault->slot, fault->pfn,
+ fault->gfn, &max_level);
+ if (r) {
+ kvm_release_pfn_clean(fault->pfn);
+ return r;
+ }
+
+ fault->max_level = min(max_level, fault->max_level);
fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
return RET_PF_CONTINUE;
}
--
2.25.1


2023-07-20 23:57:51

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 04/10] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private

From: Isaku Yamahata <[email protected]>

Add two PFERR codes to designate that the page fault is private and that
it requires looking up memory attributes. The vendor kvm page fault
handler should set PFERR_GUEST_ENC_MASK bit based on their fault
information. It may or may not use the hardware value directly or
parse the hardware value to set the bit.

For KVM_X86_PROTECTED_VM, ask memory attributes for the fault privateness.

Signed-off-by: Isaku Yamahata <[email protected]>

---
Changes v3 -> v4:
- rename back struct kvm_page_fault::private => is_private
- catch up rename: KVM_X86_PROTECTED_VM => KVM_X86_SW_PROTECTED_VM

Changes v2 -> v3:
- Revive PFERR_GUEST_ENC_MASK
- rename struct kvm_page_fault::is_private => private
- Add check KVM_X86_PROTECTED_VM

Changes v1 -> v2:
- Introduced fault type and replaced is_private with fault_type.
- Add kvm_get_fault_type() to encapsulate the difference.
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 8 ++++++--
arch/x86/kvm/mmu/mmu_internal.h | 14 +++++++++++++-
3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c9350aa0da4..ab7d080bf544 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -255,6 +255,7 @@ enum x86_intercept_stage;
#define PFERR_SGX_BIT 15
#define PFERR_GUEST_FINAL_BIT 32
#define PFERR_GUEST_PAGE_BIT 33
+#define PFERR_GUEST_ENC_BIT 34
#define PFERR_IMPLICIT_ACCESS_BIT 48

#define PFERR_PRESENT_MASK BIT(PFERR_PRESENT_BIT)
@@ -266,6 +267,7 @@ enum x86_intercept_stage;
#define PFERR_SGX_MASK BIT(PFERR_SGX_BIT)
#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
+#define PFERR_GUEST_ENC_MASK BIT_ULL(PFERR_GUEST_ENC_BIT)
#define PFERR_IMPLICIT_ACCESS BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)

#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a2fe091e327a..d2ebe26fb822 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4399,8 +4399,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}

- if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
- return kvm_do_memory_fault_exit(vcpu, fault);
+ if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+ if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)
+ return RET_PF_RETRY;
+ else
+ return kvm_do_memory_fault_exit(vcpu, fault);
+ }

if (fault->is_private)
return kvm_faultin_pfn_private(vcpu, fault);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 7f9ec1e5b136..4f8f83546c37 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -282,6 +282,18 @@ enum {
RET_PF_SPURIOUS,
};

+static inline bool kvm_is_fault_private(struct kvm *kvm, gpa_t gpa, u64 error_code)
+{
+ /*
+ * This is racy with mmu_seq. If we hit a race, it would result in a
+ * spurious KVM_EXIT_MEMORY_FAULT.
+ */
+ if (kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)
+ return kvm_mem_is_private(kvm, gpa_to_gfn(gpa));
+
+ return error_code & PFERR_GUEST_ENC_MASK;
+}
+
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
u64 err, bool prefetch, int *emulation_type)
{
@@ -295,13 +307,13 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.user = err & PFERR_USER_MASK,
.prefetch = prefetch,
.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
+ .is_private = kvm_is_fault_private(vcpu->kvm, cr2_or_gpa, err),
.nx_huge_page_workaround_enabled =
is_nx_huge_page_enabled(vcpu->kvm),

.max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
- .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
};
int r;

--
2.25.1


2023-07-21 00:01:19

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 03/10] KVM: x86/mmu: Pass around full 64-bit error code for the KVM page fault

From: Isaku Yamahata <[email protected]>

Because the full 64-bit error code, or more info about the fault, for the
KVM page fault will be needed for protected VM, TDX and SEV-SNP, update
kvm_mmu_do_page_fault() to accept the 64-bit value so it can pass it to the
callbacks.

The upper 32 bits of error code are discarded at kvm_mmu_page_fault()
by lower_32_bits(). Now it's passed down as full 64 bits.
Currently two hardware defined bits, PFERR_GUEST_FINAL_MASK and
PFERR_GUEST_PAGE_MASK, and one software defined bit, PFERR_IMPLICIT_ACCESS,
is defined.

PFERR_IMPLICIT_ACCESS:
commit 4f4aa80e3b88 ("KVM: X86: Handle implicit supervisor access with SMAP")
introduced a software defined bit PFERR_IMPLICIT_ACCESS at bit 48 to
indicate implicit access for SMAP with instruction emulator. Concretely
emulator_read_std() and emulator_write_std() set the bit.
permission_fault() checks the bit as smap implicit access. The vendor page
fault handler shouldn't pass the bit to kvm_mmu_page_fault().

PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK:
commit 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
introduced them to optimize the nested page fault handling. Other code
path doesn't use the bits. Those two bits can be safely passed down
without functionality change.

The accesses of fault->error_code are as follows
- FNAME(page_fault): PFERR_IMPLICIT_ACCESS shouldn't be passed down.
PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK
aren't used.
- kvm_mmu_page_fault(): explicit mask with PFERR_RSVD_MASK, and
PFERR_NESTED_GUEST_PAGE is used outside of the
masking upper 32 bits.
- mmutrace: change u32 -> u64
- pgprintk(): change %x -> %llx

No functional change is intended. This is a preparation to pass on more
info with page fault error code.

Signed-off-by: Isaku Yamahata <[email protected]>

---
Changes v2 -> v3:
- Make depends on a patch to clear PFERR_IMPLICIT_ACCESS
- drop clearing the upper 32 bit, instead just pass whole 64 bits
- update commit message to mention about PFERR_IMPLICIT_ACCESS and
PFERR_NESTED_GUEST_PAGE

Changes v1 -> v2:
- no change
---
arch/x86/kvm/mmu/mmu.c | 5 ++---
arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a9bbc20c7dfd..a2fe091e327a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4523,7 +4523,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
static int nonpaging_page_fault(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
{
- pgprintk("%s: gva %lx error %x\n", __func__, fault->addr, fault->error_code);
+ pgprintk("%s: gva %llx error %llx\n", __func__, fault->addr, fault->error_code);

/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
fault->max_level = PG_LEVEL_2M;
@@ -5844,8 +5844,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
}

if (r == RET_PF_INVALID) {
- r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
- lower_32_bits(error_code), false,
+ r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
&emulation_type);
if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
return -EIO;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index f1786698ae00..7f9ec1e5b136 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -191,7 +191,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
struct kvm_page_fault {
/* arguments to kvm_mmu_do_page_fault. */
const gpa_t addr;
- const u32 error_code;
+ const u64 error_code;
const bool prefetch;

/* Derived from error_code. */
@@ -283,7 +283,7 @@ enum {
};

static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u32 err, bool prefetch, int *emulation_type)
+ u64 err, bool prefetch, int *emulation_type)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index 2d7555381955..2e77883c92f6 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -261,7 +261,7 @@ TRACE_EVENT(
TP_STRUCT__entry(
__field(int, vcpu_id)
__field(gpa_t, cr2_or_gpa)
- __field(u32, error_code)
+ __field(u64, error_code)
__field(u64 *, sptep)
__field(u64, old_spte)
__field(u64, new_spte)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0662e0278e70..42d48b1ec7b3 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -758,7 +758,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
struct guest_walker walker;
int r;

- pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code);
+ pgprintk("%s: addr %llx err %llx\n", __func__, fault->addr, fault->error_code);
WARN_ON_ONCE(fault->is_tdp);

/*
--
2.25.1


2023-07-21 00:05:17

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 05/10] KVM: Add new members to struct kvm_gfn_range to operate on

From: Isaku Yamahata <[email protected]>

TDX needs to know which mapping to operate on. Shared-EPT vs. Secure-EPT.
The following sequence to convert the GPA to private doesn't work for TDX
because the page can already be private.

1) Update memory attributes to private in memory attributes xarray
2) Zap the GPA range irrespective of private-or-shared.
Even if the page is already private, zap the entry.
3) EPT violation on the GPA
4) Populate the GPA as private
The page is zeroed, and the guest has to accept the page again.

In step 2, TDX wants to zap only shared pages and skip private ones.

Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on. only_private and only_shared. Update
mmu notifier, set memory attributes ioctl or KVM gmem callback to
initialize them.

- If operating on a file to back shared pages, zap shared pages only.
It's the mmu notifier.
(only_private, only_shared) = (false, true)

- If operating a file to back private pages, zap private pages only.
It's the KVM gmem.
(only_private, only_shared) = (true, false)

- If setting memory attributes, vendor callback checks new attributes
and make decisions.
SNP would do nothing and handle it later with gmem callback
TDX callback would do as follows.
When it converts pages to shared, zap private pages only.
When it converts pages to private, zap shared pages only.
(only_private, only_shared) = (false, false)

- If operating on both backing files, zap both private and shared pages.
This is when destructing guest.
(only_private, only_shared) = (true, true)

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>

---
Changes v3 -> v4:
- rebased v11 kvm gmem.

Changes v2 -> v3:
- Drop the KVM_GFN_RANGE flags
- Updated struct kvm_gfn_range
- Change kvm_arch_set_memory_attributes() to return bool for flush
- Added set_memory_attributes x86 op for vendor backends
- Refined commit message to describe TDX care concretely

Changes v1 -> v2:
- consolidate KVM_GFN_RANGE_FLAGS_GMEM_{PUNCH_HOLE, RELEASE} into
KVM_GFN_RANGE_FLAGS_GMEM.
- Update the commit message to describe TDX more. Drop SEV_SNP.
---
include/linux/kvm_host.h | 2 ++
virt/kvm/guest_mem.c | 2 ++
virt/kvm/kvm_main.c | 4 ++++
3 files changed, 8 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 091bc89ae805..ce4d91585368 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -268,6 +268,8 @@ struct kvm_gfn_range {
u64 raw;
} arg;
bool may_block;
+ bool only_private;
+ bool only_shared;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index 384671a55b41..ac185c776cda 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -105,6 +105,8 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
.slot = slot,
.may_block = true,
+ .only_private = true,
+ .only_shared = false,
};

flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ee331cf8ba54..4e2a2463ab19 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -603,6 +603,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
*/
gfn_range.arg.raw = range->arg.raw;
gfn_range.may_block = range->may_block;
+ gfn_range.only_private = false;
+ gfn_range.only_shared = true;

/*
* {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -2405,6 +2407,8 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,

gfn_range.arg.raw = range->arg.raw;
gfn_range.may_block = range->may_block;
+ gfn_range.only_private = false;
+ gfn_range.only_shared = false;

for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
--
2.25.1


2023-07-21 00:05:17

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 02/10] KVM: x86/mmu: Guard against collision with KVM-defined PFERR_IMPLICIT_ACCESS

From: Sean Christopherson <[email protected]>

Add an assertion in kvm_mmu_page_fault() to ensure the error code provided
by hardware doesn't conflict with KVM's software-defined IMPLICIT_ACCESS
flag. In the unlikely scenario that future hardware starts using bit 48
for a hardware-defined flag, preserving the bit could result in KVM
incorrectly interpreting the unknown flag as KVM's IMPLICIT_ACCESS flag.

WARN so that any such conflict can be surfaced to KVM developers and
resolved, but otherwise ignore the bit as KVM can't possibly rely on a
flag it knows nothing about.

Fixes: 4f4aa80e3b88 ("KVM: X86: Handle implicit supervisor access with SMAP")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 05943ccb55a4..a9bbc20c7dfd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5822,6 +5822,17 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
int r, emulation_type = EMULTYPE_PF;
bool direct = vcpu->arch.mmu->root_role.direct;

+ /*
+ * IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP
+ * checks when emulating instructions that triggers implicit access.
+ * WARN if hardware generates a fault with an error code that collides
+ * with the KVM-defined value. Clear the flag and continue on, i.e.
+ * don't terminate the VM, as KVM can't possibly be relying on a flag
+ * that KVM doesn't know about.
+ */
+ if (WARN_ON_ONCE(error_code & PFERR_IMPLICIT_ACCESS))
+ error_code &= ~PFERR_IMPLICIT_ACCESS;
+
if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;

--
2.25.1


2023-07-21 00:06:18

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 08/10] KVM: x86: Add gmem hook for invalidating private memory

From: Michael Roth <[email protected]>

TODO: add a CONFIG option that can be to completely skip arch
invalidation loop and avoid __weak references for arch/platforms that
don't need an additional invalidation hook.

In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Add hooks to wire up handling of this sort when freeing memory in
response to FALLOC_FL_PUNCH_HOLE operations.

Also issue invalidations of all allocated pages when releasing the gmem
file so that the pages are not left in an unusable state when they get
freed back to the host.

Signed-off-by: Michael Roth <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
Changes v2 -> v3:
- Newly added
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 6 +++++
include/linux/kvm_host.h | 3 +++
virt/kvm/guest_mem.c | 42 ++++++++++++++++++++++++++++++
5 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index a4cb248519cf..d520c6370cd6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -135,6 +135,7 @@ KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
+KVM_X86_OP_OPTIONAL(gmem_invalidate)

#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index de7f0dffa135..440a4a13a93f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1738,6 +1738,7 @@ struct kvm_x86_ops {

int (*gmem_prepare)(struct kvm *kvm, struct kvm_memory_slot *slot,
kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
+ void (*gmem_invalidate)(struct kvm *kvm, kvm_pfn_t start, kvm_pfn_t end);
};

struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd6c05d1883c..2ae40fa8e178 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13284,6 +13284,12 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_arch_no_poll);

+#ifdef CONFIG_KVM_PRIVATE_MEM
+void kvm_arch_gmem_invalidate(struct kvm *kvm, kvm_pfn_t start, kvm_pfn_t end)
+{
+ static_call_cond(kvm_x86_gmem_invalidate)(kvm, start, end);
+}
+#endif

int kvm_spec_ctrl_test_value(u64 value)
{
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ce4d91585368..6c5d39e429e9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2360,6 +2360,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
#ifdef CONFIG_KVM_PRIVATE_MEM
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+void kvm_arch_gmem_invalidate(struct kvm *kvm, kvm_pfn_t start, kvm_pfn_t end);
#else
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2368,6 +2369,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
KVM_BUG_ON(1, kvm);
return -EIO;
}
+
+void kvm_arch_gmem_invalidate(struct kvm *kvm, kvm_pfn_t start, kvm_pfn_t end) { }
#endif /* CONFIG_KVM_PRIVATE_MEM */

#endif
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index ac185c776cda..a14eaac9dbad 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -129,6 +129,46 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
KVM_MMU_UNLOCK(kvm);
}

+void __weak kvm_arch_gmem_invalidate(struct kvm *kvm, kvm_pfn_t start, kvm_pfn_t end)
+{
+}
+
+/* Handle arch-specific hooks needed before releasing guarded pages. */
+static void kvm_gmem_issue_arch_invalidate(struct kvm *kvm, struct inode *inode,
+ pgoff_t start, pgoff_t end)
+{
+ pgoff_t file_end = i_size_read(inode) >> PAGE_SHIFT;
+ pgoff_t index = start;
+
+ end = min(end, file_end);
+
+ while (index < end) {
+ struct folio *folio;
+ unsigned int order;
+ struct page *page;
+ kvm_pfn_t pfn;
+
+ folio = __filemap_get_folio(inode->i_mapping, index,
+ FGP_LOCK, 0);
+ if (!folio) {
+ index++;
+ continue;
+ }
+
+ page = folio_file_page(folio, index);
+ pfn = page_to_pfn(page);
+ order = folio_order(folio);
+
+ kvm_arch_gmem_invalidate(kvm, pfn, pfn + min((1ul << order), end - index));
+
+ index = folio_next_index(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+
+ cond_resched();
+ }
+}
+
static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
{
struct list_head *gmem_list = &inode->i_mapping->private_list;
@@ -145,6 +185,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
list_for_each_entry(gmem, gmem_list, entry)
kvm_gmem_invalidate_begin(gmem, start, end);

+ kvm_gmem_issue_arch_invalidate(gmem->kvm, inode, start, end);
truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);

list_for_each_entry(gmem, gmem_list, entry)
@@ -255,6 +296,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
* memory, as its lifetime is associated with the inode, not the file.
*/
kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+ kvm_gmem_issue_arch_invalidate(gmem->kvm, inode, 0, -1ul);
kvm_gmem_invalidate_end(gmem, 0, -1ul);

mutex_unlock(&kvm->slots_lock);
--
2.25.1


2023-07-21 00:26:08

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v4 10/10] KVM: X86: KVM_MEM_ENC_OP check if unused field (flags, error) is zero

From: Isaku Yamahata <[email protected]>

This breaks uABI as the current code doesn't check padding and sev_fd
when unused.

Signed-off-by: Isaku Yamahata <[email protected]>
---
Changes v3 -> v4:
- newly added
---
arch/x86/kvm/x86.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ab36e8940f1b..1d6085af6a00 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7055,6 +7055,22 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
r = -EFAULT;
goto out;
}
+ /* No sub-command uses flags at the moment. */
+ if (cmd.flags) {
+ r = -EINVAL;
+ goto out;
+ }
+ if (cmd.id != KVM_SEV_LAUNCH_START &&
+ cmd.id != KVM_SEV_RECEIVE_START && cmd.error64) {
+ r = -EINVAL;
+ goto out;
+ }
+ if ((cmd.id == KVM_SEV_LAUNCH_START ||
+ cmd.id == KVM_SEV_RECEIVE_START) && cmd.error) {
+ r = -EINVAL;
+ goto out;
+ }
+
r = static_call(kvm_x86_mem_enc_ioctl)(kvm, &cmd);
if (copy_to_user(argp, &cmd, sizeof(cmd)))
r = -EFAULT;
--
2.25.1


2023-07-21 14:27:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v4 04/10] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private

s/Introduce/Use

This doesn't "introduce" anything, in the sense that it's an AMD-defined error
code flag. That matters because KVM *did* introduce/define PFERR_IMPLICIT_ACCESS.

On Thu, Jul 20, 2023, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Add two PFERR codes to designate that the page fault is private and that
> it requires looking up memory attributes. The vendor kvm page fault
> handler should set PFERR_GUEST_ENC_MASK bit based on their fault
> information. It may or may not use the hardware value directly or
> parse the hardware value to set the bit.
>
> For KVM_X86_PROTECTED_VM, ask memory attributes for the fault privateness.

...

> +static inline bool kvm_is_fault_private(struct kvm *kvm, gpa_t gpa, u64 error_code)
> +{
> + /*
> + * This is racy with mmu_seq. If we hit a race, it would result in a
> + * spurious KVM_EXIT_MEMORY_FAULT.
> + */
> + if (kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)
> + return kvm_mem_is_private(kvm, gpa_to_gfn(gpa));

Please synthesize the error code flag for SW-protected VMs, same as TDX, e.g.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 20e289e872eb..de9e0a9c41e6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5751,6 +5751,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;

+ if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
+ kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
+ error_code |= PFERR_GUEST_ENC_MASK;
+
r = RET_PF_INVALID;
if (unlikely(error_code & PFERR_RSVD_MASK)) {
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);

Functionally it's the same, but I want all VM types to have the same source of
truth for private versus shared, and I really don't want kvm_is_fault_private()
to exist.

2023-07-21 14:51:47

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

On Thu, Jul 20, 2023, [email protected] wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a73ddb43a2cf..35bb14363828 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4352,6 +4352,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> struct kvm_page_fault *fault)
> {
> int max_order, r;
> + u8 max_level;
>
> if (!kvm_slot_can_be_private(fault->slot))
> return kvm_do_memory_fault_exit(vcpu, fault);
> @@ -4361,8 +4362,15 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> if (r)
> return r;
>
> - fault->max_level = min(kvm_max_level_for_order(max_order),
> - fault->max_level);
> + max_level = kvm_max_level_for_order(max_order);
> + r = static_call(kvm_x86_gmem_prepare)(vcpu->kvm, fault->slot, fault->pfn,
> + fault->gfn, &max_level);

I think KVM should hook gmem itself, i.e. convert pages to private when they're
allocated, and then back to shared when they're freed. I don't like having
asymmetric APIs (convert on fault instead of allocate, but then convert back on
free instead of on zap), and hooking the page fault path also violates the approach
of tying the RMP/HKID to the physical allocation.

Practically speaking, hooking the fault path will result in undesirable behavior.
Just because KVM *maps* at 4KiB granularity doesn't mean the RMP must be assigned
at 4KiB granularity, e.g. if userspace chooses to *not* PUNCH_HOLE when the guest
shares a single 4KiB page in a 2MiB chunk. Dirty logging is another case where
the RMP can stay at 2MiB. Or as a very silly example, imagine userspace pulls a
stupid and covers a single 2MiB chunk of gmem with 512 memslots.

That likely means KVM will need an additional hook to clamp the max_level at the
RMP, but that shouldn't be a big deal, e.g. if we convert on allocation, then KVM
should be able to blindly do the conversion because it would be a kernel bug if
the page is already assigned to an ASID in the RMP, i.e. the additional hook
shouldn't incur an extra RMP lookup.

2023-07-21 15:06:33

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v4 09/10] KVM: x86: Make struct sev_cmd common for KVM_MEM_ENC_OP

On Thu, Jul 20, 2023, [email protected] wrote:
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index aa7a56a47564..32883e520b00 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -562,6 +562,39 @@ struct kvm_pmu_event_filter {
> /* x86-specific KVM_EXIT_HYPERCALL flags. */
> #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)
>
> +struct kvm_mem_enc_cmd {
> + /* sub-command id of KVM_MEM_ENC_OP. */
> + __u32 id;
> + /*
> + * Auxiliary flags for sub-command. If sub-command doesn't use it,
> + * set zero.
> + */
> + __u32 flags;
> + /*
> + * Data for sub-command. An immediate or a pointer to the actual
> + * data in process virtual address. If sub-command doesn't use it,
> + * set zero.
> + */
> + __u64 data;
> + /*
> + * Supplemental error code in the case of error.
> + * SEV error code from the PSP or TDX SEAMCALL status code.
> + * The caller should set zero.
> + */
> + union {
> + struct {
> + __u32 error;
> + /*
> + * KVM_SEV_LAUNCH_START and KVM_SEV_RECEIVE_START
> + * require extra data. Not included in struct
> + * kvm_sev_launch_start or struct kvm_sev_receive_start.
> + */
> + __u32 sev_fd;
> + };
> + __u64 error64;
> + };
> +};

Eww. Why not just use an entirely different struct for TDX? I don't see what
benefit this provides other than a warm fuzzy feeling that TDX and SEV share a
struct. Practically speaking, KVM will likely take on more work to forcefully
smush the two together than if they're separate things.

2023-07-21 19:47:30

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH v4 09/10] KVM: x86: Make struct sev_cmd common for KVM_MEM_ENC_OP

On Fri, Jul 21, 2023 at 07:51:04AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Thu, Jul 20, 2023, [email protected] wrote:
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index aa7a56a47564..32883e520b00 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -562,6 +562,39 @@ struct kvm_pmu_event_filter {
> > /* x86-specific KVM_EXIT_HYPERCALL flags. */
> > #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)
> >
> > +struct kvm_mem_enc_cmd {
> > + /* sub-command id of KVM_MEM_ENC_OP. */
> > + __u32 id;
> > + /*
> > + * Auxiliary flags for sub-command. If sub-command doesn't use it,
> > + * set zero.
> > + */
> > + __u32 flags;
> > + /*
> > + * Data for sub-command. An immediate or a pointer to the actual
> > + * data in process virtual address. If sub-command doesn't use it,
> > + * set zero.
> > + */
> > + __u64 data;
> > + /*
> > + * Supplemental error code in the case of error.
> > + * SEV error code from the PSP or TDX SEAMCALL status code.
> > + * The caller should set zero.
> > + */
> > + union {
> > + struct {
> > + __u32 error;
> > + /*
> > + * KVM_SEV_LAUNCH_START and KVM_SEV_RECEIVE_START
> > + * require extra data. Not included in struct
> > + * kvm_sev_launch_start or struct kvm_sev_receive_start.
> > + */
> > + __u32 sev_fd;
> > + };
> > + __u64 error64;
> > + };
> > +};
>
> Eww. Why not just use an entirely different struct for TDX? I don't see what
> benefit this provides other than a warm fuzzy feeling that TDX and SEV share a
> struct. Practically speaking, KVM will likely take on more work to forcefully
> smush the two together than if they're separate things.

Ok, let's drop this patch. Keep the ABI different for now.
--
Isaku Yamahata <[email protected]>

2023-07-22 00:42:00

by Michael Roth

[permalink] [raw]
Subject: Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

On Fri, Jul 21, 2023 at 07:25:58AM -0700, Sean Christopherson wrote:
> On Thu, Jul 20, 2023, [email protected] wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a73ddb43a2cf..35bb14363828 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4352,6 +4352,7 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > struct kvm_page_fault *fault)
> > {
> > int max_order, r;
> > + u8 max_level;
> >
> > if (!kvm_slot_can_be_private(fault->slot))
> > return kvm_do_memory_fault_exit(vcpu, fault);
> > @@ -4361,8 +4362,15 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > if (r)
> > return r;
> >
> > - fault->max_level = min(kvm_max_level_for_order(max_order),
> > - fault->max_level);
> > + max_level = kvm_max_level_for_order(max_order);
> > + r = static_call(kvm_x86_gmem_prepare)(vcpu->kvm, fault->slot, fault->pfn,
> > + fault->gfn, &max_level);
>
> I think KVM should hook gmem itself, i.e. convert pages to private when they're
> allocated, and then back to shared when they're freed. I don't like having
> asymmetric APIs (convert on fault instead of allocate, but then convert back on
> free instead of on zap), and hooking the page fault path also violates the approach
> of tying the RMP/HKID to the physical allocation.

I'd originally added an arch hook in kvm_gmem_get_pfn() to handle
RMP/HKID when the folio is allocated:

https://github.com/mdroth/linux/commit/adf78d224126f31e9096b80be21619e1ba447304

Based on your response it seemed like there was a preference to either:

a) do the RMP/HKID update via KVM MMU prior to mapping into the guest
b) implement an ioctl() to let userspace pre-allocate/pre-prepare and
do the RMP updates in advance.

https://lore.kernel.org/lkml/[email protected]/

So this patch basically implements suggestion a), or at least my
understanding of it. Is the original patch that does it via
kvm_gmem_get_pfn() more in line with what you're suggesting here, or
were you still thinking of an ioctl? Or some other place in the code?
Keep in mind the GPA is needed to do the RMP updates so that prevents
us from hooking into the more obvious place like kvm_gmem_get_folio().

>
> Practically speaking, hooking the fault path will result in undesirable behavior.
> Just because KVM *maps* at 4KiB granularity doesn't mean the RMP must be assigned
> at 4KiB granularity, e.g. if userspace chooses to *not* PUNCH_HOLE when the guest
> shares a single 4KiB page in a 2MiB chunk. Dirty logging is another case where
> the RMP can stay at 2MiB. Or as a very silly example, imagine userspace pulls a
> stupid and covers a single 2MiB chunk of gmem with 512 memslots.

Unfortunately I don't think things aren't quite that flexible with SNP. If
RMP entry is 2MB, and you map a sub-page as 4K in the NPT, you'll immediately
get a PFERR_GUEST_SIZEM on the first access (presumably when the guest tries
to PVALIDATE it before use). The RMP fault handler will then subsequently
need to PSMASH the 2MB entry into 4K before that guest can access it. So you
get an extra page fault for every 2MB page that's mapped this way.
(APM Volume 2, Section 15.36.10).

That might not be a big deal for guests that are at least somewhat optimized
to make use of 2MB pages, but another situation is:

- gmem allocates 2MB page
- guest issues PVALIDATE on 2MB page
- guest later converts a subpage to shared but doesn't holepunch
- SNP host code issues PSMASH to split 2MB RMP mapping to 4K
- KVM MMU splits NPT mapping to 4K
- guest converts that shared page back to private

At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are validated/4K and cannot be promoted on
the hypervisor, so the NPT mappings must still be limited to 4K to
match this, otherwise we hit the reverse of the PFERR_GUEST_SIZEM
scenario above, where the NPT mapping level is *larger* than the RMP
entry level. Unfortunately that does not result in a PFERR_GUEST_SIZEM
where we can fix things up in response, but instead it's a general
RMP fault that would be tricky to distinguish from an RMP fault
resulting from an implicit page conversion or some other guest weirdness
without doing RMP table checks every time we get a general RMP fault.

So for all intents and purposes it does sort of end up being the case
that the mapping size and RMP entry size are sort of intertwined and
can't totally be decoupled, and if you don't take that into account
when updating the RMP entry, you'll have to do some extra PSMASH's
in response to PFERR_GUEST_SIZEM RMP faults later.

>
> That likely means KVM will need an additional hook to clamp the max_level at the
> RMP, but that shouldn't be a big deal, e.g. if we convert on allocation, then KVM
> should be able to blindly do the conversion because it would be a kernel bug if
> the page is already assigned to an ASID in the RMP, i.e. the additional hook
> shouldn't incur an extra RMP lookup.

Yah we'd still need a hook in roughly this same place for clamping
max_level. Previous versions of SNP hypervisor patches all had a separate
hook for handling these cases, but since the work of updating the RMP table
prior to mapping isn't too dissimilar from the work of determining max
mapping size, I combined both of them into the kvm_x86_gmem_prepare()
hook/implementation.

But I don't see any major issue with moving RMPUPDATE handling to an
allocation-time hook. As mentioned above we'd get additional
PFERR_GUEST_SIZEM faults by not taking MMU mapping level into account, but
I previously had it implemented similarly via a hook in kvm_gmem_get_pfn()
(because we need the GPA) and didn't notice anything major. But I'm not
sure exactly where you're suggesting we do it now, so could use some
clarify on that if kvm_gmem_get_pfn() isn't what you had in mind.

WRT avoiding the RMP lookup, I'd tried __filemap_get_folio() without
FGP_CREAT flag as a way to check whether the folio was already allocated
or not, but it seemed to result in a lockup, and didn't look like it
would be any better performance-wise than just doing the RMP lookup
anyway (which is just a normal memory read), so that's the route I ended
up going.

-Mike

2023-07-22 01:25:53

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH v4 04/10] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private

On Fri, Jul 21, 2023 at 07:11:43AM -0700,
Sean Christopherson <[email protected]> wrote:

> s/Introduce/Use
>
> This doesn't "introduce" anything, in the sense that it's an AMD-defined error
> code flag. That matters because KVM *did* introduce/define PFERR_IMPLICIT_ACCESS.
>
> On Thu, Jul 20, 2023, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Add two PFERR codes to designate that the page fault is private and that
> > it requires looking up memory attributes. The vendor kvm page fault
> > handler should set PFERR_GUEST_ENC_MASK bit based on their fault
> > information. It may or may not use the hardware value directly or
> > parse the hardware value to set the bit.
> >
> > For KVM_X86_PROTECTED_VM, ask memory attributes for the fault privateness.
>
> ...
>
> > +static inline bool kvm_is_fault_private(struct kvm *kvm, gpa_t gpa, u64 error_code)
> > +{
> > + /*
> > + * This is racy with mmu_seq. If we hit a race, it would result in a
> > + * spurious KVM_EXIT_MEMORY_FAULT.
> > + */
> > + if (kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)
> > + return kvm_mem_is_private(kvm, gpa_to_gfn(gpa));
>
> Please synthesize the error code flag for SW-protected VMs, same as TDX, e.g.
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 20e289e872eb..de9e0a9c41e6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5751,6 +5751,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
> return RET_PF_RETRY;
>
> + if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
> + kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
> + error_code |= PFERR_GUEST_ENC_MASK;
> +
> r = RET_PF_INVALID;
> if (unlikely(error_code & PFERR_RSVD_MASK)) {
> r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
>
> Functionally it's the same, but I want all VM types to have the same source of
> truth for private versus shared, and I really don't want kvm_is_fault_private()
> to exist.

Here is the updated patch.


From 30c452cd6a94b485eaa5f92dee4c222dd30ebcbe Mon Sep 17 00:00:00 2001
Message-Id: <30c452cd6a94b485eaa5f92dee4c222dd30ebcbe.1689987085.git.isaku.yamahata@intel.com>
In-Reply-To: <ab9d8654bd98ae24de05788a2ecaa4bea6c0c44b.1689987085.git.isaku.yamahata@intel.com>
References: <ab9d8654bd98ae24de05788a2ecaa4bea6c0c44b.1689987085.git.isaku.yamahata@intel.com>
From: Isaku Yamahata <[email protected]>
Date: Wed, 14 Jun 2023 12:34:00 -0700
Subject: [PATCH 4/8] KVM: x86: Use PFERR_GUEST_ENC_MASK to indicate fault is
private

SEV-SNP defines PFERR_GUEST_ENC_MASK (bit 32) in page-fault error bits to
represent the guest page is encrypted. Use the bit to designate that the
page fault is private and that it requires looking up memory attributes.
The vendor kvm page fault handler should set PFERR_GUEST_ENC_MASK bit based
on their fault information. It may or may not use the hardware value
directly or parse the hardware value to set the bit.

For KVM_X86_SW_PROTECTED_VM, ask memory attributes for the fault
privateness. For async page fault, carry the bit and use it for kvm page
fault handler.

Signed-off-by: Isaku Yamahata <[email protected]>

---
Changes v4 -> v5:
- Eliminate kvm_is_fault_private() by open code the function
- Make async page fault handler to carry is_private bit

Changes v3 -> v4:
- rename back struct kvm_page_fault::private => is_private
- catch up rename: KVM_X86_PROTECTED_VM => KVM_X86_SW_PROTECTED_VM

Changes v2 -> v3:
- Revive PFERR_GUEST_ENC_MASK
- rename struct kvm_page_fault::is_private => private
- Add check KVM_X86_PROTECTED_VM

Changes v1 -> v2:
- Introduced fault type and replaced is_private with fault_type.
- Add kvm_get_fault_type() to encapsulate the difference.
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 32 +++++++++++++++++++++++---------
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c9350aa0da4..a1ae3d881063 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -255,6 +255,7 @@ enum x86_intercept_stage;
#define PFERR_SGX_BIT 15
#define PFERR_GUEST_FINAL_BIT 32
#define PFERR_GUEST_PAGE_BIT 33
+#define PFERR_GUEST_ENC_BIT 34
#define PFERR_IMPLICIT_ACCESS_BIT 48

#define PFERR_PRESENT_MASK BIT(PFERR_PRESENT_BIT)
@@ -266,6 +267,7 @@ enum x86_intercept_stage;
#define PFERR_SGX_MASK BIT(PFERR_SGX_BIT)
#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
+#define PFERR_GUEST_ENC_MASK BIT_ULL(PFERR_GUEST_ENC_BIT)
#define PFERR_IMPLICIT_ACCESS BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)

#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
@@ -1770,6 +1772,7 @@ struct kvm_arch_async_pf {
gfn_t gfn;
unsigned long cr3;
bool direct_map;
+ u64 error_code;
};

extern u32 __read_mostly kvm_nr_uret_msrs;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a2fe091e327a..01e74af48e4c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4285,18 +4285,19 @@ static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
return (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
}

-static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- gfn_t gfn)
+static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
{
struct kvm_arch_async_pf arch;

arch.token = alloc_apf_token(vcpu);
- arch.gfn = gfn;
+ arch.gfn = fault->gfn;
arch.direct_map = vcpu->arch.mmu->root_role.direct;
arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
+ arch.error_code = fault->error_code & PFERR_GUEST_ENC_MASK;

- return kvm_setup_async_pf(vcpu, cr2_or_gpa,
- kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
+ return kvm_setup_async_pf(vcpu, fault->addr,
+ kvm_vcpu_gfn_to_hva(vcpu, fault->gfn), &arch);
}

void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
@@ -4315,7 +4316,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
return;

- kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
+ kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
+ true, NULL);
}

static inline u8 kvm_max_level_for_order(int order)
@@ -4399,8 +4401,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}

- if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
- return kvm_do_memory_fault_exit(vcpu, fault);
+ if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+ if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)
+ return RET_PF_RETRY;
+ else
+ return kvm_do_memory_fault_exit(vcpu, fault);
+ }

if (fault->is_private)
return kvm_faultin_pfn_private(vcpu, fault);
@@ -4418,7 +4424,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
kvm_make_request(KVM_REQ_APF_HALT, vcpu);
return RET_PF_RETRY;
- } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
+ } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
return RET_PF_RETRY;
}
}
@@ -5836,6 +5842,14 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;

+ /*
+ * This is racy with updating memory attributes with mmu_seq. If we
+ * hit a race, it would result in retrying page fault.
+ */
+ if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
+ kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
+ error_code |= PFERR_GUEST_ENC_MASK;
+
r = RET_PF_INVALID;
if (unlikely(error_code & PFERR_RSVD_MASK)) {
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 7f9ec1e5b136..3a423403af01 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -295,13 +295,13 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.user = err & PFERR_USER_MASK,
.prefetch = prefetch,
.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
+ .is_private = err & PFERR_GUEST_ENC_MASK,
.nx_huge_page_workaround_enabled =
is_nx_huge_page_enabled(vcpu->kvm),

.max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
- .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
};
int r;

--
2.25.1



--
Isaku Yamahata <[email protected]>

2023-07-25 10:55:24

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC PATCH v4 09/10] KVM: x86: Make struct sev_cmd common for KVM_MEM_ENC_OP

On 7/21/2023 10:51 PM, Sean Christopherson wrote:
> On Thu, Jul 20, 2023, [email protected] wrote:
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index aa7a56a47564..32883e520b00 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -562,6 +562,39 @@ struct kvm_pmu_event_filter {
>> /* x86-specific KVM_EXIT_HYPERCALL flags. */
>> #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)
>>
>> +struct kvm_mem_enc_cmd {
>> + /* sub-command id of KVM_MEM_ENC_OP. */
>> + __u32 id;
>> + /*
>> + * Auxiliary flags for sub-command. If sub-command doesn't use it,
>> + * set zero.
>> + */
>> + __u32 flags;
>> + /*
>> + * Data for sub-command. An immediate or a pointer to the actual
>> + * data in process virtual address. If sub-command doesn't use it,
>> + * set zero.
>> + */
>> + __u64 data;
>> + /*
>> + * Supplemental error code in the case of error.
>> + * SEV error code from the PSP or TDX SEAMCALL status code.
>> + * The caller should set zero.
>> + */
>> + union {
>> + struct {
>> + __u32 error;
>> + /*
>> + * KVM_SEV_LAUNCH_START and KVM_SEV_RECEIVE_START
>> + * require extra data. Not included in struct
>> + * kvm_sev_launch_start or struct kvm_sev_receive_start.
>> + */
>> + __u32 sev_fd;
>> + };
>> + __u64 error64;
>> + };
>> +};
>
> Eww. Why not just use an entirely different struct for TDX? I don't see what
> benefit this provides other than a warm fuzzy feeling that TDX and SEV share a
> struct. Practically speaking, KVM will likely take on more work to forcefully
> smush the two together than if they're separate things.

generalizing the struct of KVM_MEM_ENC_OP should be the first step. The
final target should be generalizing a set of commands for confidential
VMs (SEV-* VMs and TDs, maybe even for other arches), e.g., the commands
to create a confidential VM and commands to live migration a
confidential VM.

However, there seems not small divergence between the commands to create
a SEV-* VM and TDX VMs. I'm not sure if it is worth investigating and
pursuing.

2023-07-25 16:16:21

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v4 09/10] KVM: x86: Make struct sev_cmd common for KVM_MEM_ENC_OP

On Tue, Jul 25, 2023, Xiaoyao Li wrote:
> On 7/21/2023 10:51 PM, Sean Christopherson wrote:
> > On Thu, Jul 20, 2023, [email protected] wrote:
> > > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > > index aa7a56a47564..32883e520b00 100644
> > > --- a/arch/x86/include/uapi/asm/kvm.h
> > > +++ b/arch/x86/include/uapi/asm/kvm.h
> > > @@ -562,6 +562,39 @@ struct kvm_pmu_event_filter {
> > > /* x86-specific KVM_EXIT_HYPERCALL flags. */
> > > #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)
> > > +struct kvm_mem_enc_cmd {
> > > + /* sub-command id of KVM_MEM_ENC_OP. */
> > > + __u32 id;
> > > + /*
> > > + * Auxiliary flags for sub-command. If sub-command doesn't use it,
> > > + * set zero.
> > > + */
> > > + __u32 flags;
> > > + /*
> > > + * Data for sub-command. An immediate or a pointer to the actual
> > > + * data in process virtual address. If sub-command doesn't use it,
> > > + * set zero.
> > > + */
> > > + __u64 data;
> > > + /*
> > > + * Supplemental error code in the case of error.
> > > + * SEV error code from the PSP or TDX SEAMCALL status code.
> > > + * The caller should set zero.
> > > + */
> > > + union {
> > > + struct {
> > > + __u32 error;
> > > + /*
> > > + * KVM_SEV_LAUNCH_START and KVM_SEV_RECEIVE_START
> > > + * require extra data. Not included in struct
> > > + * kvm_sev_launch_start or struct kvm_sev_receive_start.
> > > + */
> > > + __u32 sev_fd;
> > > + };
> > > + __u64 error64;
> > > + };
> > > +};
> >
> > Eww. Why not just use an entirely different struct for TDX? I don't see what
> > benefit this provides other than a warm fuzzy feeling that TDX and SEV share a
> > struct. Practically speaking, KVM will likely take on more work to forcefully
> > smush the two together than if they're separate things.
>
> generalizing the struct of KVM_MEM_ENC_OP should be the first step.

It's not just the one structure though. The "data" field is a pointer to yet
another layer of commands, and SEV has a rather big pile of those. Making
kvm_mem_enc_cmd common is just putting lipstick on a pig since the vast majority
of the structures associated with the ioctl() would still be vendor specific.

struct kvm_sev_launch_start
struct kvm_sev_launch_update_data
struct kvm_sev_launch_secret
struct kvm_sev_launch_measure
struct kvm_sev_guest_status
struct kvm_sev_dbg
struct kvm_sev_attestation_report
struct kvm_sev_send_start
struct kvm_sev_send_update_data
struct kvm_sev_receive_start
struct kvm_sev_receive_update_data

FWIW, I really dislike KVM's uAPI for KVM_MEM_ENC_OP. The above structures are
all basically copied verbatim from PSP firmware structures, i.e. the commands and
their payloads are tightly coupled to "hardware" and essentially have no abstraction
whatsoever. But that ship has already sailed, and practically speaking trying to
provide a layer of abstraction might not of worked very well anyways.

In other words, unless there's an obvious and easy way path to convergence, I
recommend you don't spend much time/effort on trying to share code with TDX.

2023-07-27 01:16:43

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH v4 09/10] KVM: x86: Make struct sev_cmd common for KVM_MEM_ENC_OP

On Tue, Jul 25, 2023 at 08:36:09AM -0700,
Sean Christopherson <[email protected]> wrote:

> On Tue, Jul 25, 2023, Xiaoyao Li wrote:
> > On 7/21/2023 10:51 PM, Sean Christopherson wrote:
> > > On Thu, Jul 20, 2023, [email protected] wrote:
> > > > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > > > index aa7a56a47564..32883e520b00 100644
> > > > --- a/arch/x86/include/uapi/asm/kvm.h
> > > > +++ b/arch/x86/include/uapi/asm/kvm.h
> > > > @@ -562,6 +562,39 @@ struct kvm_pmu_event_filter {
> > > > /* x86-specific KVM_EXIT_HYPERCALL flags. */
> > > > #define KVM_EXIT_HYPERCALL_LONG_MODE BIT(0)
> > > > +struct kvm_mem_enc_cmd {
> > > > + /* sub-command id of KVM_MEM_ENC_OP. */
> > > > + __u32 id;
> > > > + /*
> > > > + * Auxiliary flags for sub-command. If sub-command doesn't use it,
> > > > + * set zero.
> > > > + */
> > > > + __u32 flags;
> > > > + /*
> > > > + * Data for sub-command. An immediate or a pointer to the actual
> > > > + * data in process virtual address. If sub-command doesn't use it,
> > > > + * set zero.
> > > > + */
> > > > + __u64 data;
> > > > + /*
> > > > + * Supplemental error code in the case of error.
> > > > + * SEV error code from the PSP or TDX SEAMCALL status code.
> > > > + * The caller should set zero.
> > > > + */
> > > > + union {
> > > > + struct {
> > > > + __u32 error;
> > > > + /*
> > > > + * KVM_SEV_LAUNCH_START and KVM_SEV_RECEIVE_START
> > > > + * require extra data. Not included in struct
> > > > + * kvm_sev_launch_start or struct kvm_sev_receive_start.
> > > > + */
> > > > + __u32 sev_fd;
> > > > + };
> > > > + __u64 error64;
> > > > + };
> > > > +};
> > >
> > > Eww. Why not just use an entirely different struct for TDX? I don't see what
> > > benefit this provides other than a warm fuzzy feeling that TDX and SEV share a
> > > struct. Practically speaking, KVM will likely take on more work to forcefully
> > > smush the two together than if they're separate things.
> >
> > generalizing the struct of KVM_MEM_ENC_OP should be the first step.
>
> It's not just the one structure though. The "data" field is a pointer to yet
> another layer of commands, and SEV has a rather big pile of those. Making
> kvm_mem_enc_cmd common is just putting lipstick on a pig since the vast majority
> of the structures associated with the ioctl() would still be vendor specific.

> struct kvm_sev_launch_start
> struct kvm_sev_launch_update_data
> struct kvm_sev_launch_secret
> struct kvm_sev_launch_measure
> struct kvm_sev_guest_status
> struct kvm_sev_dbg
> struct kvm_sev_attestation_report
> struct kvm_sev_send_start
> struct kvm_sev_send_update_data
> struct kvm_sev_receive_start
> struct kvm_sev_receive_update_data
>
> FWIW, I really dislike KVM's uAPI for KVM_MEM_ENC_OP. The above structures are
> all basically copied verbatim from PSP firmware structures, i.e. the commands and
> their payloads are tightly coupled to "hardware" and essentially have no abstraction
> whatsoever. But that ship has already sailed, and practically speaking trying to
> provide a layer of abstraction might not of worked very well anyways.
>
> In other words, unless there's an obvious and easy way path to convergence, I
> recommend you don't spend much time/effort on trying to share code with TDX.

I think we can easily unify vcpu initialization, populating/measure initial
memory, completing guest creation, and guest memory access for debug.

KVM_SEV_LAUNCH_UPDATE_VMSA <-> KVM_TDX_INIT_VCPU
KVM_SEV_LAUNCH_UPDATE_DATA and KVM_SEV_LAUNCH_MEASURE <-> KVM_INIT_MEM_REGION
KVM_SEV_LAUNCH_FINISH <-> KVM_TDX_FINALIZE_VM
KVM_SEV_DBG_DECRYPT, KVM_SEV_DBG_ENCRYPT: KVM common API for access protected guest memory


Here's my assessment. For now I don't address migration.

For creating confidential guest:

- Get the capability of underlying platform
KVM_TDX_CAPABILITY: no sev correspondence.

- Initialize VM as confidential VM
struct kvm_sev_launch_start
KVM_SEV{,_ES}_INIT, and KVM_SEV_LAUNCH_START:
KVM_TDX_INIT_VM
They take vendor specific data.


- Initialize vcpu
KVM_SEV_LAUNCH_UPDATE_VMSA: no extra argument
KVM_TDX_INIT_VCPU: no extra argument


- populate initial memory + measurement
KVM_SEV_LAUNCH_UPDATE_DATA and KVM_SEV_LAUNCH_MEASURE,
struct kvm_sev_launch_update_data {
__u64 uaddr;
__u32 len;
};
struct kvm_sev_launch_measure {
__u64 uaddr;
__u32 len;
};
=> GPA is calculated from uaddr.

KVM_INIT_MEM_REGION:
struct kvm_tdx_init_mem_region {
__u64 source_addr; // uaddr
__u64 gpa;
__u64 nr_pages;
};

I think those can same structure. Or prefault or prepopulating
e.g.
struct {
__u64 uaddr;
__u64 gpa;
__u64 len;
#define FLAG_MEASURE BIT(0)
#define FLAG_GPA BIT(1) // GPA is valid or calculated from uaddr
__u64 flags;
};


- Complete initialization. Make the guest ready to run vcpu
KVM_SEV_LAUNCH_FINISH: no argument
KVM_TDX_FINALIZE_VM: no argument

- KVM_SEV_LAUNCH_SECRET: no TDX correspondence
struct kvm_sev_launch_secret


For guest debug

- KVM_SEV_DBG_DECRYPT, KVM_SEV_DBG_ENCRYPT: struct kvm_sev_dbg
This is to read/write guest memory for debug. We can easily have a common
API.

- KVM_SEV_GUEST_STATUS
struct kvm_sev_guest_status
No TDX correspondence

Thanks,
--
Isaku Yamahata <[email protected]>

2023-08-29 19:19:19

by Michael Roth

[permalink] [raw]
Subject: Re: [RFC PATCH v4 07/10] KVM: x86: Add gmem hook for initializing private memory

On Fri, Aug 25, 2023 at 07:59:41PM -0500, Michael Roth wrote:
> On Fri, Aug 18, 2023 at 03:27:47PM -0700, Sean Christopherson wrote:
> > This seems like a bug in the SNP code. (a) why would KVM/SNP PSMASH in that
> > scenario and (b) why can't it zap/split the NPT before PSMASH?
>
> a) A PSMASH will be needed at some point, since, as detailed above, the 4K
> NPT mapping requires the RMP entries for the pages it maps to be
> limited to 4K RMP entries, but...
> b) What would happen normally[1] is the guest would issue PVALIDATE to
> *rescind* the validated status of that 4K GPA before issuing the GHCB
> request to convert it to shared. This would cause an
> #NPF(RMP+SIZE_MISMATCH) and handle_rmp_page_fault() would PSMASH the RMP
> entry so the PVALIDATE can succeed.
>
> So KVM doesn't even really have the option of deferring the PSMASH, it
> happens before the SET_MEMORY_ATTRIBUTES is even issued to zap the 2MB
> NPT mapping and switch the 2M range to 'mixed'. Because of that, we also
> need a hook in the KVM MMU code to clamp the max mapping level based
> on RMP entry size. Currently the kvm_gmem_prepare() in this patch
> doubles for handling that clamping, so we would still need a similar
> hook for that if we move the RMP initialization stuff to allocation
> time.
>
> [1] This handling is recommended for 'well-behaved' guests according to
> GHCB, but I don't see it documented as a hard requirement anywhere, so there
> is a possibility that that we have to deal with a guest that doesn't do this.
> What would happen then is the PVALIDATE wouldn't trigger the #NPF(RMP+SIZEM),
> and instead the SET_MEMORY_ATTRIBUTES would zap the 2MB mapping, install
> 4K entries on next #NPF(NOT_PRESENT), and at *that* point we would get
> an #NPF(RMP) without the SIZEM bit set, due to the behavior described in
> the beginning of this email.
>
> handle_rmp_page_fault() can do the corresponding PSMASH to deal with that,
> but it is a little unfortunate since we can't differentiate that case from a
> spurious/unexpected RMP faults, so would need to attempt a PSMASH in all
> cases, sometimes failing.

The spurious case here is when the guest is accessing a private page
that's just been PSMASH'd by another thread. I thought these might still
occur before the PSMASH has completed so we'd still potentially see the
page-size bit set in the RMP entry, but the RMP faults only happen after
the PSMASH has finished, so the spurious cases can be filtered out by
just checking if the page-size bit is set before attempting a PSMASH.

>
> gmem itself could also trigger this case if the lpage_info_slot() tracking
> ever became more granular than what the guest was expected (which I don't
> think would happen normally, but I may have hit one case where it does, but
> haven't had a chance to debug if that's on the lpage_info_slot() side or
> something else on the SNP end.

Turns out that was for a case where the shared/private attributes for the
2M range really were mixed at the time of access. In this case it is
when OVMF converts some shared memory to private during early boot: the
end address is not 2MB-aligned, so it is in a mixed region, and the NPT
mapping is 4K, but the RMP entry is initialized as 2MB. In this case the
PVALIDATE and NPT agree on the 4K mapping size so the SIZEM bit isn't set,
just the #NPF(RMP).

So we need to be able to deal with that even for 'well-behaved' guests.
With RMP-init-during-mapping-time approach I had some checks that avoided
creating the 2MB RMP entry in this mixed case which is why I didn't need
handling for this previously. But it's just one extra #NPF(RMP) and can
be handled cleanly since it can be distinguished from spurious cases.

-Mike

2024-02-22 02:06:19

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v4 04/10] KVM: x86: Introduce PFERR_GUEST_ENC_MASK to indicate fault is private

On Fri, Jul 21, 2023, Isaku Yamahata wrote:
> From: Isaku Yamahata <[email protected]>
> Date: Wed, 14 Jun 2023 12:34:00 -0700
> Subject: [PATCH 4/8] KVM: x86: Use PFERR_GUEST_ENC_MASK to indicate fault is
> private
>
> SEV-SNP defines PFERR_GUEST_ENC_MASK (bit 32) in page-fault error bits to
> represent the guest page is encrypted. Use the bit to designate that the
> page fault is private and that it requires looking up memory attributes.
> The vendor kvm page fault handler should set PFERR_GUEST_ENC_MASK bit based
> on their fault information. It may or may not use the hardware value
> directly or parse the hardware value to set the bit.
>
> For KVM_X86_SW_PROTECTED_VM, ask memory attributes for the fault
> privateness. For async page fault, carry the bit and use it for kvm page
> fault handler.
>
> Signed-off-by: Isaku Yamahata <[email protected]>

..

> @@ -4315,7 +4316,8 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
> return;
>
> - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
> + kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
> + true, NULL);

This is unnecessary, KVM doesn't suppoort async page fault behavior for private
memory (and doesn't need to, because guest_memmfd() doesn't support swap).

> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 7f9ec1e5b136..3a423403af01 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -295,13 +295,13 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> .user = err & PFERR_USER_MASK,
> .prefetch = prefetch,
> .is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
> + .is_private = err & PFERR_GUEST_ENC_MASK,

This breaks SEV and SEV-ES guests, because AFAICT, the APM is lying by defining
PFERR_GUEST_ENC_MASK in the context of SNP. The flag isn't just set when running
SEV-SNP guests, it's set for all C-bit=1 effective accesses when running on SNP
capable hardware (at least, that's my observation).

Grumpiness about discovering yet another problem that I would have expected
_someone_ to stumble upon...

FYI, I'm going to post a rambling series to cleanup code in the page fault path
(it started as a cleanup of the "no slot" code and then grew a few more heads).
One of the patches I'm going to include is something that looks like this patch,
but I'm going to use a KVM-defined synthetic bit, because stuffing a bit that KVM
would need _clear_ on _some_ hardware is gross.