LinuxLists.cc - [PATCH v2 00/10] KVM: MMU changes for confidential computing

2024-04-16 20:20:34

Subject: [PATCH v2 00/10] KVM: MMU changes for confidential computing

This includes the MMU parts of "TDX/SNP part 1 of n"[1] while the rest
was posted as "KVM: guest_memfd: New hooks and functionality for SEV-SNP
and TDX"[2] last week.

It includes two basic parts:

- Allow non-zero value for non-present SPTE and removed SPTE, so that
TDX can set the "suppress VE" bit

- Use PFERR_GUEST_ENC_MASK to indicate fault is private.

The changes from the previous posting were suggested by Chao Gao:

- small adjustment to comments

- fix checks for "!pte_access && !shadow_present_mask", rewriting them
as "(pte_access | shadow_present_mask) != SHADOW_NONPRESENT_VALUE".

- move allocation of ve_info outside init_vmcs()

I'll push this series later this week to kvm/next, and rebase kvm-coco-queue
on top.

Paolo

[1] https://patchew.org/linux/[email protected]/
[2] https://patchew.org/linux/[email protected]/

Isaku Yamahata (3):
KVM: x86/mmu: Add Suppress VE bit to EPT
shadow_mmio_mask/shadow_present_mask
KVM: VMX: Introduce test mode related to EPT violation VE
KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults

Paolo Bonzini (3):
KVM, x86: add architectural support code for #VE
KVM: x86/mmu: Use PFERR_GUEST_ENC_MASK to indicate fault is private
KVM: x86/mmu: check for invalid async page faults involving private
memory

Sean Christopherson (4):
KVM: Allow page-sized MMU caches to be initialized with custom 64-bit
values
KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed
SPTE
KVM: x86/mmu: Track shadow MMIO value on a per-VM basis

arch/x86/include/asm/kvm_host.h | 5 +++
arch/x86/include/asm/vmx.h | 13 ++++++++
arch/x86/kvm/Kconfig | 13 ++++++++
arch/x86/kvm/mmu/mmu.c | 50 ++++++++++++++++++----------
arch/x86/kvm/mmu/mmu_internal.h | 6 ++--
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 16 ++++-----
arch/x86/kvm/mmu/spte.c | 24 ++++++++------
arch/x86/kvm/mmu/spte.h | 24 +++++++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 18 +++++-----
arch/x86/kvm/vmx/vmcs.h | 5 +++
arch/x86/kvm/vmx/vmx.c | 59 ++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +++-
include/linux/kvm_types.h | 1 +
virt/kvm/kvm_main.c | 16 +++++++--
15 files changed, 201 insertions(+), 57 deletions(-)

--
2.43.0

2024-04-16 20:21:52

by Paolo Bonzini

[permalink] [raw]

Subject: [PATCH v2 04/10] KVM: x86/mmu: Add Suppress VE bit to EPT shadow_mmio_mask/shadow_present_mask

From: Isaku Yamahata <[email protected]>

To make use of the same value of shadow_mmio_mask and shadow_present_mask
for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and
shadow_present_mask so that they can be common for both VMX and TDX.

TDX will require shadow_mmio_mask and shadow_present_mask to include
VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for
shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the
spte value is defined so as to cause EPT misconfig.

Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kvm/mmu/spte.c | 6 ++++--
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 4dba17363008..ac6da0a5f5e6 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -514,6 +514,7 @@ enum vmcs_field {
#define VMX_EPT_IPAT_BIT (1ull << 6)
#define VMX_EPT_ACCESS_BIT (1ull << 8)
#define VMX_EPT_DIRTY_BIT (1ull << 9)
+#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)
#define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
VMX_EPT_WRITABLE_MASK | \
VMX_EPT_EXECUTABLE_MASK)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 768aaeddf5fa..0a0e83859c27 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -413,7 +413,9 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
shadow_nx_mask = 0ull;
shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
- shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+ /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
+ shadow_present_mask =
+ (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
/*
* EPT overrides the host MTRRs, and so KVM must program the desired
* memtype directly into the SPTEs. Note, this mask is just the mask
@@ -430,7 +432,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
* of an EPT paging-structure entry is 110b (write/execute).
*/
kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
- VMX_EPT_RWX_MASK, 0);
+ VMX_EPT_RWX_MASK | VMX_EPT_SUPPRESS_VE_BIT, 0);
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);

--
2.43.0

2024-04-16 20:22:10

by Paolo Bonzini

[permalink] [raw]

Subject: [PATCH v2 05/10] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis

From: Sean Christopherson <[email protected]>

TDX will use a different shadow PTE entry value for MMIO from VMX. Add a
member to kvm_arch and track value for MMIO per-VM instead of a global
variable. By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working. Introduce a separate setter function so that guest
TD can use a different value later.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 7 ++++---
arch/x86/kvm/mmu/spte.c | 4 ++--
arch/x86/kvm/mmu/spte.h | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.c | 6 +++---
5 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 01c69840647e..9f92bdb78504 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1313,6 +1313,8 @@ struct kvm_arch {
*/
spinlock_t mmu_unsync_pages_lock;

+ u64 shadow_mmio_value;
+
struct iommu_domain *iommu_domain;
bool iommu_noncoherent;
#define __KVM_HAVE_ARCH_NONCOHERENT_DMA
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fbfdc606f1f1..45b6d8f9e359 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2462,7 +2462,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
return kvm_mmu_prepare_zap_page(kvm, child,
invalid_list);
}
- } else if (is_mmio_spte(pte)) {
+ } else if (is_mmio_spte(kvm, pte)) {
mmu_spte_clear_no_track(spte);
}
return 0;
@@ -4144,7 +4144,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
if (WARN_ON_ONCE(reserved))
return -EINVAL;

- if (is_mmio_spte(spte)) {
+ if (is_mmio_spte(vcpu->kvm, spte)) {
gfn_t gfn = get_mmio_spte_gfn(spte);
unsigned int access = get_mmio_spte_access(spte);

@@ -4760,7 +4760,7 @@ EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
unsigned int access)
{
- if (unlikely(is_mmio_spte(*sptep))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
if (gfn != get_mmio_spte_gfn(*sptep)) {
mmu_spte_clear_no_track(sptep);
return true;
@@ -6267,6 +6267,7 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)

void kvm_mmu_init_vm(struct kvm *kvm)
{
+ kvm->arch.shadow_mmio_value = shadow_mmio_value;
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 0a0e83859c27..a5e014d7bc62 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -74,10 +74,10 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
u64 spte = generation_mmio_spte_mask(gen);
u64 gpa = gfn << PAGE_SHIFT;

- WARN_ON_ONCE(!shadow_mmio_value);
+ WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);

access &= shadow_mmio_access_mask;
- spte |= shadow_mmio_value | access;
+ spte |= vcpu->kvm->arch.shadow_mmio_value | access;
spte |= gpa | shadow_nonpresent_or_rsvd_mask;
spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 8056b7853a79..5dd5405fa07a 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -265,9 +265,9 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
return spte_to_child_sp(root);
}

-static inline bool is_mmio_spte(u64 spte)
+static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
{
- return (spte & shadow_mmio_mask) == shadow_mmio_value &&
+ return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
likely(enable_mmio_caching);
}

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f5401967897a..5fd618abc243 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -495,8 +495,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* impact the guest since both the former and current SPTEs
* are nonpresent.
*/
- if (WARN_ON_ONCE(!is_mmio_spte(old_spte) &&
- !is_mmio_spte(new_spte) &&
+ if (WARN_ON_ONCE(!is_mmio_spte(kvm, old_spte) &&
+ !is_mmio_spte(kvm, new_spte) &&
!is_removed_spte(new_spte)))
pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
"should not be replaced with another,\n"
@@ -1028,7 +1028,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
}

/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
- if (unlikely(is_mmio_spte(new_spte))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
vcpu->stat.pf_mmio_spte_created++;
trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
new_spte);
--
2.43.0

2024-04-16 20:23:37

by Paolo Bonzini

[permalink] [raw]

Subject: [PATCH v2 10/10] KVM: x86/mmu: check for invalid async page faults involving private memory

Right now the error code is not used when an async page fault is completed.
This is not a problem in the current code, but it is untidy. For protected
VMs, we will also need to check that the page attributes match the current
state of the page, because asynchronous page faults can only occur on
shared pages (private pages go through kvm_faultin_pfn_private() instead of
__gfn_to_pfn_memslot()).

Start by piping the error code from kvm_arch_setup_async_pf() to
kvm_arch_async_page_ready() via the architecture-specific async page
fault data. For now, it can be used to assert that there are no
async page faults on private memory.

Extracted from a patch by Isaku Yamahata.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 17 ++++++++++-------
2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7c73952b6f4e..57ec96bd4221 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1850,6 +1850,7 @@ struct kvm_arch_async_pf {
gfn_t gfn;
unsigned long cr3;
bool direct_map;
+ u64 error_code;
};

extern u32 __read_mostly kvm_nr_uret_msrs;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 33aea47dce8b..402d04aa5423 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4207,24 +4207,27 @@ static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
return (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
}

-static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- gfn_t gfn)
+static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
{
struct kvm_arch_async_pf arch;

arch.token = alloc_apf_token(vcpu);
- arch.gfn = gfn;
+ arch.gfn = fault->gfn;
arch.direct_map = vcpu->arch.mmu->root_role.direct;
arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);

- return kvm_setup_async_pf(vcpu, cr2_or_gpa,
- kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
+ return kvm_setup_async_pf(vcpu, fault->addr,
+ kvm_vcpu_gfn_to_hva(vcpu, fault->gfn), &arch);
}

void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
{
int r;

+ if (WARN_ON_ONCE(work->arch.error_code & PFERR_GUEST_ENC_MASK))
+ return;
+
if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) ||
work->wakeup_all)
return;
@@ -4237,7 +4240,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
return;

- kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
+ kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, true, NULL);
}

static inline u8 kvm_max_level_for_order(int order)
@@ -4342,7 +4345,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
kvm_make_request(KVM_REQ_APF_HALT, vcpu);
return RET_PF_RETRY;
- } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
+ } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
return RET_PF_RETRY;
}
}
--
2.43.0

2024-04-16 20:23:48

by Paolo Bonzini

[permalink] [raw]

Subject: [PATCH v2 06/10] KVM, x86: add architectural support code for #VE

Dump the contents of the #VE info data structure and assert that #VE does
not happen, but do not yet do anything with it.

No functional change intended, separated for clarity only.

Extracted from a patch by Isaku Yamahata <[email protected]>.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/vmx.h | 12 ++++++++++++
arch/x86/kvm/vmx/vmx.c | 12 ++++++++++++
2 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index ac6da0a5f5e6..d77a31039f24 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -71,6 +71,7 @@
#define SECONDARY_EXEC_ENCLS_EXITING VMCS_CONTROL_BIT(ENCLS_EXITING)
#define SECONDARY_EXEC_RDSEED_EXITING VMCS_CONTROL_BIT(RDSEED_EXITING)
#define SECONDARY_EXEC_ENABLE_PML VMCS_CONTROL_BIT(PAGE_MOD_LOGGING)
+#define SECONDARY_EXEC_EPT_VIOLATION_VE VMCS_CONTROL_BIT(EPT_VIOLATION_VE)
#define SECONDARY_EXEC_PT_CONCEAL_VMX VMCS_CONTROL_BIT(PT_CONCEAL_VMX)
#define SECONDARY_EXEC_ENABLE_XSAVES VMCS_CONTROL_BIT(XSAVES)
#define SECONDARY_EXEC_MODE_BASED_EPT_EXEC VMCS_CONTROL_BIT(MODE_BASED_EPT_EXEC)
@@ -226,6 +227,8 @@ enum vmcs_field {
VMREAD_BITMAP_HIGH = 0x00002027,
VMWRITE_BITMAP = 0x00002028,
VMWRITE_BITMAP_HIGH = 0x00002029,
+ VE_INFORMATION_ADDRESS = 0x0000202A,
+ VE_INFORMATION_ADDRESS_HIGH = 0x0000202B,
XSS_EXIT_BITMAP = 0x0000202C,
XSS_EXIT_BITMAP_HIGH = 0x0000202D,
ENCLS_EXITING_BITMAP = 0x0000202E,
@@ -631,4 +634,13 @@ enum vmx_l1d_flush_state {

extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;

+struct vmx_ve_information {
+ u32 exit_reason;
+ u32 delivery;
+ u64 exit_qualification;
+ u64 guest_linear_address;
+ u64 guest_physical_address;
+ u16 eptp_index;
+};
+
#endif
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6780313914f8..2c746318c6c3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6408,6 +6408,18 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
pr_err("Virtual processor ID = 0x%04x\n",
vmcs_read16(VIRTUAL_PROCESSOR_ID));
+ if (secondary_exec_control & SECONDARY_EXEC_EPT_VIOLATION_VE) {
+ struct vmx_ve_information *ve_info;
+
+ pr_err("VE info address = 0x%016llx\n",
+ vmcs_read64(VE_INFORMATION_ADDRESS));
+ ve_info = __va(vmcs_read64(VE_INFORMATION_ADDRESS));
+ pr_err("ve_info: 0x%08x 0x%08x 0x%016llx 0x%016llx 0x%016llx 0x%04x\n",
+ ve_info->exit_reason, ve_info->delivery,
+ ve_info->exit_qualification,
+ ve_info->guest_linear_address,
+ ve_info->guest_physical_address, ve_info->eptp_index);
+ }
}

/*
--
2.43.0

2024-04-16 20:33:22

by Paolo Bonzini

[permalink] [raw]

Subject: [PATCH v2 07/10] KVM: VMX: Introduce test mode related to EPT violation VE

From: Isaku Yamahata <[email protected]>

To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the
suppress #VE bit in EPT entries selectively, in order to be able to trap
non-present conditions. However, #VE isn't used for VMX and it's a bug
if it happens. To be defensive and test that VMX case isn't broken
introduce an option ept_violation_ve_test and when it's set, BUG the vm.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/Kconfig | 13 ++++++++++++
arch/x86/kvm/vmx/vmcs.h | 5 +++++
arch/x86/kvm/vmx/vmx.c | 47 ++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +++++-
4 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 3aaf7e86a859..7632fe6e4db9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -95,6 +95,19 @@ config KVM_INTEL
To compile this as a module, choose M here: the module
will be called kvm-intel.

+config KVM_INTEL_PROVE_VE
+ bool "Check that guests do not receive #VE exceptions"
+ default KVM_PROVE_MMU || DEBUG_KERNEL
+ depends on KVM_INTEL
+ help
+
+ Checks that KVM's page table management code will not incorrectly
+ let guests receive a virtualization exception. Virtualization
+ exceptions will be trapped by the hypervisor rather than injected
+ in the guest.
+
+ If unsure, say N.
+
config X86_SGX_KVM
bool "Software Guard eXtensions (SGX) Virtualization"
depends on X86_SGX && KVM_INTEL
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 7c1996b433e2..b25625314658 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -140,6 +140,11 @@ static inline bool is_nm_fault(u32 intr_info)
return is_exception_n(intr_info, NM_VECTOR);
}

+static inline bool is_ve_fault(u32 intr_info)
+{
+ return is_exception_n(intr_info, VE_VECTOR);
+}
+
/* Undocumented: icebp/int1 */
static inline bool is_icebp(u32 intr_info)
{
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2c746318c6c3..3c2fb1310aaa 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -869,6 +869,12 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)

eb = (1u << PF_VECTOR) | (1u << UD_VECTOR) | (1u << MC_VECTOR) |
(1u << DB_VECTOR) | (1u << AC_VECTOR);
+ /*
+ * #VE isn't used for VMX. To test against unexpected changes
+ * related to #VE for VMX, intercept unexpected #VE and warn on it.
+ */
+ if (IS_ENABLED(CONFIG_KVM_INTEL_PROVE_VE))
+ eb |= 1u << VE_VECTOR;
/*
* Guest access to VMware backdoor ports could legitimately
* trigger #GP because of TSS I/O permission bitmap.
@@ -2602,6 +2608,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
&_cpu_based_2nd_exec_control))
return -EIO;
}
+ if (!IS_ENABLED(CONFIG_KVM_INTEL_PROVE_VE))
+ _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
+
#ifndef CONFIG_X86_64
if (!(_cpu_based_2nd_exec_control &
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
@@ -2626,6 +2635,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
return -EIO;

vmx_cap->ept = 0;
+ _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
}
if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
vmx_cap->vpid) {
@@ -4588,6 +4598,7 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
if (!enable_ept) {
exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
+ exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
enable_unrestricted_guest = 0;
}
if (!enable_unrestricted_guest)
@@ -4711,8 +4722,21 @@ static void init_vmcs(struct vcpu_vmx *vmx)

exec_controls_set(vmx, vmx_exec_control(vmx));

- if (cpu_has_secondary_exec_ctrls())
+ if (cpu_has_secondary_exec_ctrls()) {
secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
+ if (vmx->ve_info) {
+ vmcs_write64(VE_INFORMATION_ADDRESS,
+ __pa(vmx->ve_info));
+ } else {
+ /*
+ * Because SECONDARY_EXEC_EPT_VIOLATION_VE is
+ * used only for debugging, it's okay to leave
+ * it disabled.
+ */
+ secondary_exec_controls_clearbit(vmx,
+ SECONDARY_EXEC_EPT_VIOLATION_VE);
+ }
+ }

if (cpu_has_tertiary_exec_ctrls())
tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
@@ -5200,6 +5224,12 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
if (is_invalid_opcode(intr_info))
return handle_ud(vcpu);

+ /*
+ * #VE isn't supposed to happen. Block the VM if it does.
+ */
+ if (KVM_BUG_ON(is_ve_fault(intr_info), vcpu->kvm))
+ return -EIO;
+
error_code = 0;
if (intr_info & INTR_INFO_DELIVER_CODE_MASK)
error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
@@ -7474,6 +7504,8 @@ void vmx_vcpu_free(struct kvm_vcpu *vcpu)
free_vpid(vmx->vpid);
nested_vmx_free_vcpu(vcpu);
free_loaded_vmcs(vmx->loaded_vmcs);
+ if (vmx->ve_info)
+ free_page((unsigned long)vmx->ve_info);
}

int vmx_vcpu_create(struct kvm_vcpu *vcpu)
@@ -7567,6 +7599,19 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
goto free_vmcs;
}

+ if (vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_EPT_VIOLATION_VE) {
+ struct page *page;
+
+ BUILD_BUG_ON(sizeof(*vmx->ve_info) > PAGE_SIZE);
+
+ /* ve_info must be page aligned. */
+ page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (page)
+ vmx->ve_info = page_to_virt(page);
+ else
+ pr_err("Failed to allocate ve_info. disabling EPT_VIOLATION_VE.\n");
+ }
+
if (vmx_can_use_ipiv(vcpu))
WRITE_ONCE(to_kvm_vmx(vcpu->kvm)->pid_table[vcpu->vcpu_id],
__pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 65786dbe7d60..0da79a386825 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -362,6 +362,9 @@ struct vcpu_vmx {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
} shadow_msr_intercept;
+
+ /* ve_info must be page aligned. */
+ struct vmx_ve_information *ve_info;
};

struct kvm_vmx {
@@ -574,7 +577,8 @@ static inline u8 vmx_get_rvi(void)
SECONDARY_EXEC_ENABLE_VMFUNC | \
SECONDARY_EXEC_BUS_LOCK_DETECTION | \
SECONDARY_EXEC_NOTIFY_VM_EXITING | \
- SECONDARY_EXEC_ENCLS_EXITING)
+ SECONDARY_EXEC_ENCLS_EXITING | \
+ SECONDARY_EXEC_EPT_VIOLATION_VE)

#define KVM_REQUIRED_VMX_TERTIARY_VM_EXEC_CONTROL 0
#define KVM_OPTIONAL_VMX_TERTIARY_VM_EXEC_CONTROL \
--
2.43.0

2024-04-16 20:34:01

by Paolo Bonzini

[permalink] [raw]

Subject: [PATCH v2 03/10] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE

From: Sean Christopherson <[email protected]>

For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
is not able to access the private memory of TD guest and do the emulation.
Instead, TD guest expects to receive #VE when it accesses the MMIO and then
it can explicitly make hypercall to KVM to get the expected information.

To achieve this, the TDX module always enables "EPT-violation #VE" in the
VMCS control. And accordingly, for the MMIO spte for the shared GPA,
1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
violation happens on TD accessing MMIO range. 2. On EPT violation, KVM
sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
the #VE instead of EPT misconfiguration unlike VMX case. For the shared GPA
that is not populated yet, EPT violation need to be triggered when TD guest
accesses such shared GPA. The non-present SPTE value for shared GPA should
set "suppress #VE" bit.

Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63)
for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
present bit is off; 2) for normal VMX guest, KVM never enables the
"EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
hardware.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/paging_tmpl.h | 12 ++++++------
arch/x86/kvm/mmu/spte.c | 14 +++++++-------
arch/x86/kvm/mmu/spte.h | 16 +++++++++++++++-
3 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index bebd73cd61bb..9aac3aa93d88 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -933,13 +933,13 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int
return 0;

/*
- * Drop the SPTE if the new protections would result in a RWX=0
- * SPTE or if the gfn is changing. The RWX=0 case only affects
- * EPT with execute-only support, i.e. EPT without an effective
- * "present" bit, as all other paging modes will create a
- * read-only SPTE if pte_access is zero.
+ * Drop the SPTE if the new protections result in no effective
+ * "present" bit or if the gfn is changing. The former case
+ * only affects EPT with execute-only support with pte_access==0;
+ * all other paging modes will create a read-only SPTE if
+ * pte_access is zero.
*/
- if ((!pte_access && !shadow_present_mask) ||
+ if ((pte_access | shadow_present_mask) == SHADOW_NONPRESENT_VALUE ||
gfn != kvm_mmu_page_get_gfn(sp, i)) {
drop_spte(vcpu->kvm, &sp->spt[i]);
return 1;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 6c7ab3aa6aa7..768aaeddf5fa 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -144,19 +144,19 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
u64 spte = SPTE_MMU_PRESENT_MASK;
bool wrprot = false;

- WARN_ON_ONCE(!pte_access && !shadow_present_mask);
+ /*
+ * For the EPT case, shadow_present_mask has no RWX bits set if
+ * exec-only page table entries are supported. In that case,
+ * ACC_USER_MASK and shadow_user_mask are used to represent
+ * read access. See FNAME(gpte_access) in paging_tmpl.h.
+ */
+ WARN_ON_ONCE((pte_access | shadow_present_mask) == SHADOW_NONPRESENT_VALUE);

if (sp->role.ad_disabled)
spte |= SPTE_TDP_AD_DISABLED;
else if (kvm_mmu_page_ad_need_write_protect(sp))
spte |= SPTE_TDP_AD_WRPROT_ONLY;

- /*
- * For the EPT case, shadow_present_mask is 0 if hardware
- * supports exec-only page table entries. In that case,
- * ACC_USER_MASK and shadow_user_mask are used to represent
- * read access. See FNAME(gpte_access) in paging_tmpl.h.
- */
spte |= shadow_present_mask;
if (!prefetch)
spte |= spte_shadow_accessed_mask(spte);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 0f4ec2859474..8056b7853a79 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -149,7 +149,21 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)

+/*
+ * Non-present SPTE value needs to set bit 63 for TDX, in order to suppress
+ * #VE and get EPT violations on non-present PTEs. We can use the
+ * same value also without TDX for both VMX and SVM:
+ *
+ * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
+ * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
+ * bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
+ */
+#ifdef CONFIG_X86_64
+#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
+static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
+#else
#define SHADOW_NONPRESENT_VALUE 0ULL
+#endif

extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
@@ -192,7 +206,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
*
* Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
* both AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
- * vulnerability. Use only low bits to avoid 64-bit immediates.
+ * vulnerability.
*
* Only used by the TDP MMU.
*/
--
2.43.0

2024-04-17 23:00:32

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2 07/10] KVM: VMX: Introduce test mode related to EPT violation VE

On Tue, Apr 16, 2024, Paolo Bonzini wrote:
> @@ -4711,8 +4722,21 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>
> exec_controls_set(vmx, vmx_exec_control(vmx));
>
> - if (cpu_has_secondary_exec_ctrls())
> + if (cpu_has_secondary_exec_ctrls()) {
> secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
> + if (vmx->ve_info) {
> + vmcs_write64(VE_INFORMATION_ADDRESS,
> + __pa(vmx->ve_info));
> + } else {
> + /*
> + * Because SECONDARY_EXEC_EPT_VIOLATION_VE is
> + * used only for debugging, it's okay to leave
> + * it disabled.
> + */
> + secondary_exec_controls_clearbit(vmx,
> + SECONDARY_EXEC_EPT_VIOLATION_VE);

As below, this is silly.

> + }
> + }
>
> if (cpu_has_tertiary_exec_ctrls())
> tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
> @@ -5200,6 +5224,12 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> if (is_invalid_opcode(intr_info))
> return handle_ud(vcpu);
>
> + /*
> + * #VE isn't supposed to happen. Block the VM if it does.
> + */

Doesn't need to be a multi-line comment. Though I would just drop the comment,
the KVM_BUG_ON() makes it pretty darn clear #VE is unexpected.

> + if (KVM_BUG_ON(is_ve_fault(intr_info), vcpu->kvm))
> + return -EIO;
> +
> error_code = 0;
> if (intr_info & INTR_INFO_DELIVER_CODE_MASK)
> error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
> @@ -7474,6 +7504,8 @@ void vmx_vcpu_free(struct kvm_vcpu *vcpu)
> free_vpid(vmx->vpid);
> nested_vmx_free_vcpu(vcpu);
> free_loaded_vmcs(vmx->loaded_vmcs);
> + if (vmx->ve_info)

free_page() handles '0', though hopefully this becomes a moot point.

> + free_page((unsigned long)vmx->ve_info);
> }
>
> int vmx_vcpu_create(struct kvm_vcpu *vcpu)
> @@ -7567,6 +7599,19 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
> goto free_vmcs;
> }
>
> + if (vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_EPT_VIOLATION_VE) {
> + struct page *page;
> +
> + BUILD_BUG_ON(sizeof(*vmx->ve_info) > PAGE_SIZE);
> +
> + /* ve_info must be page aligned. */
> + page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> + if (page)

Can we please just treat this as an error. The odds of us screwing up checks
against vmx->ve_info are higher than the odds of someone enabling KVM_INTEL_PROVE_VE
on a machine with such high memory pressure that a 4KiB allocation fails, all
subequent memory allocations succeeding, *and* caring that VM creation fails.

The pr_err() in the failure path is even more ridiculous.

> + vmx->ve_info = page_to_virt(page);
> + else
> + pr_err("Failed to allocate ve_info. disabling EPT_VIOLATION_VE.\n");
> + }
> +
> if (vmx_can_use_ipiv(vcpu))
> WRITE_ONCE(to_kvm_vmx(vcpu->kvm)->pid_table[vcpu->vcpu_id],
> __pa(&vmx->pi_desc) | PID_TABLE_ENTRY_VALID);

2024-04-19 07:35:27

by Xiaoyao Li

[permalink] [raw]

Subject: Re: [PATCH v2 10/10] KVM: x86/mmu: check for invalid async page faults involving private memory

On 4/17/2024 4:19 AM, Paolo Bonzini wrote:
> Right now the error code is not used when an async page fault is completed.
> This is not a problem in the current code, but it is untidy. For protected
> VMs, we will also need to check that the page attributes match the current
> state of the page, because asynchronous page faults can only occur on
> shared pages (private pages go through kvm_faultin_pfn_private() instead of
> __gfn_to_pfn_memslot()).
>
> Start by piping the error code from kvm_arch_setup_async_pf() to
> kvm_arch_async_page_ready() via the architecture-specific async page
> fault data.

It is missed in this patch ...

> For now, it can be used to assert that there are no
> async page faults on private memory.
>
> Extracted from a patch by Isaku Yamahata.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 17 ++++++++++-------
> 2 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7c73952b6f4e..57ec96bd4221 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1850,6 +1850,7 @@ struct kvm_arch_async_pf {
> gfn_t gfn;
> unsigned long cr3;
> bool direct_map;
> + u64 error_code;
> };
>
> extern u32 __read_mostly kvm_nr_uret_msrs;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 33aea47dce8b..402d04aa5423 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4207,24 +4207,27 @@ static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
> return (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
> }
>
> -static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> - gfn_t gfn)
> +static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> {
> struct kvm_arch_async_pf arch;
>
> arch.token = alloc_apf_token(vcpu);
> - arch.gfn = gfn;
> + arch.gfn = fault->gfn;
> arch.direct_map = vcpu->arch.mmu->root_role.direct;
> arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);

+ arch.error_code = fault->error_code;

>
> - return kvm_setup_async_pf(vcpu, cr2_or_gpa,
> - kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> + return kvm_setup_async_pf(vcpu, fault->addr,
> + kvm_vcpu_gfn_to_hva(vcpu, fault->gfn), &arch);
> }
>
> void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> {
> int r;
>
> + if (WARN_ON_ONCE(work->arch.error_code & PFERR_GUEST_ENC_MASK))
> + return;
> +
> if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) ||
> work->wakeup_all)
> return;
> @@ -4237,7 +4240,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
> return;
>
> - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
> + kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, true, NULL);
> }
>
> static inline u8 kvm_max_level_for_order(int order)
> @@ -4342,7 +4345,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
> kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> return RET_PF_RETRY;
> - } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
> + } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
> return RET_PF_RETRY;
> }
> }

2024-04-19 07:38:55

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v2 10/10] KVM: x86/mmu: check for invalid async page faults involving private memory

On Fri, Apr 19, 2024 at 9:35 AM Xiaoyao Li <[email protected]> wrote:
>
> On 4/17/2024 4:19 AM, Paolo Bonzini wrote:
> > Right now the error code is not used when an async page fault is completed.
> > This is not a problem in the current code, but it is untidy. For protected
> > VMs, we will also need to check that the page attributes match the current
> > state of the page, because asynchronous page faults can only occur on
> > shared pages (private pages go through kvm_faultin_pfn_private() instead of
> > __gfn_to_pfn_memslot()).
> >
> > Start by piping the error code from kvm_arch_setup_async_pf() to
> > kvm_arch_async_page_ready() via the architecture-specific async page
> > fault data.
>
> It is missed in this patch ...

Ugh, thanks Xiaoyao!

Paolo