This is a first set of, hopefully non-controversial patches from the
SNP and TDX series. They cover mostly changes to generic code and new
gmem APIs, and in general have already been reviewed when posted by
Isaku and Michael.
One important change is that the gmem hook for initializing memory
is designed to return -EEXIST if the page already exists in the
guestmemfd filemap. The idea is that the special case of
KVM_SEV_SNP_LAUNCH_UPDATE, where __kvm_gmem_get_pfn() is used to
return an uninitialized page and make it guest-owned, can be be done at
most once per page unless the ioctl fails.
Of course these patches add a bunch of dead code. This is intentional
because it's the only way to trim the large TDX (and to some extent SNP)
series to the point that it's possible to discuss them. The next step is
probably going to be the private<->shared page logic from the TDX series.
Paolo
Isaku Yamahata (5):
KVM: x86/mmu: Add Suppress VE bit to EPT
shadow_mmio_mask/shadow_present_mask
KVM: VMX: Introduce test mode related to EPT violation VE
KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at
allocation
KVM: x86/tdp_mmu: Sprinkle __must_check
KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults
Michael Roth (2):
KVM: x86: Add gmem hook for invalidating memory
KVM: x86: Add gmem hook for determining max NPT mapping level
Paolo Bonzini (6):
KVM: x86/mmu: pass error code back to MMU when async pf is ready
KVM: x86/mmu: Use PFERR_GUEST_ENC_MASK to indicate fault is private
KVM: guest_memfd: pass error up from filemap_grab_folio
filemap: add FGP_CREAT_ONLY
KVM: x86: Add gmem hook for initializing memory
KVM: guest_memfd: add API to undo kvm_gmem_get_uninit_pfn
Sean Christopherson (7):
KVM: x86: Split core of hypercall emulation to helper function
KVM: Allow page-sized MMU caches to be initialized with custom 64-bit
values
KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed
SPTE
KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
argument
Tom Lendacky (1):
KVM: SEV: Use a VMSA physical address variable for populating VMCB
arch/x86/include/asm/kvm-x86-ops.h | 3 +
arch/x86/include/asm/kvm_host.h | 12 +
arch/x86/include/asm/vmx.h | 13 +
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 55 ++--
arch/x86/kvm/mmu/mmu_internal.h | 6 +-
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 4 +-
arch/x86/kvm/mmu/spte.c | 16 +-
arch/x86/kvm/mmu/spte.h | 21 +-
arch/x86/kvm/mmu/tdp_iter.h | 12 +
arch/x86/kvm/mmu/tdp_mmu.c | 74 +++--
arch/x86/kvm/svm/sev.c | 3 +-
arch/x86/kvm/svm/svm.c | 9 +-
arch/x86/kvm/svm/svm.h | 1 +
arch/x86/kvm/vmx/main.c | 168 +++++++++++
arch/x86/kvm/vmx/vmcs.h | 5 +
arch/x86/kvm/vmx/vmx.c | 460 +++++++++++------------------
arch/x86/kvm/vmx/vmx.h | 6 +-
arch/x86/kvm/vmx/x86_ops.h | 124 ++++++++
arch/x86/kvm/x86.c | 69 +++--
include/linux/kvm_host.h | 25 ++
include/linux/kvm_types.h | 1 +
include/linux/pagemap.h | 2 +
mm/filemap.c | 4 +
virt/kvm/Kconfig | 8 +
virt/kvm/guest_memfd.c | 120 +++++++-
virt/kvm/kvm_main.c | 16 +-
29 files changed, 855 insertions(+), 387 deletions(-)
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/x86_ops.h
--
2.39.0
From: Tom Lendacky <[email protected]>
In preparation to support SEV-SNP AP Creation, use a variable that holds
the VMSA physical address rather than converting the virtual address.
This will allow SEV-SNP AP Creation to set the new physical address that
will be used should the vCPU reset path be taken.
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/sev.c | 3 +--
arch/x86/kvm/svm/svm.c | 9 ++++++++-
arch/x86/kvm/svm/svm.h | 1 +
3 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index a9e2fcf494a2..2bde1ad6bcfd 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3094,8 +3094,7 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm)
* the VMSA will be NULL if this vCPU is the destination for intrahost
* migration, and will be copied later.
*/
- if (svm->sev_es.vmsa)
- svm->vmcb->control.vmsa_pa = __pa(svm->sev_es.vmsa);
+ svm->vmcb->control.vmsa_pa = svm->sev_es.vmsa_pa;
/* Can't intercept CR register access, HV can't modify CR registers */
svm_clr_intercept(svm, INTERCEPT_CR0_READ);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f4a750426b24..8893975826f1 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1459,9 +1459,16 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
svm->vmcb01.pa = __sme_set(page_to_pfn(vmcb01_page) << PAGE_SHIFT);
svm_switch_vmcb(svm, &svm->vmcb01);
- if (vmsa_page)
+ if (vmsa_page) {
svm->sev_es.vmsa = page_address(vmsa_page);
+ /*
+ * Do not include the encryption mask on the VMSA physical
+ * address since hardware will access it using the guest key.
+ */
+ svm->sev_es.vmsa_pa = __pa(svm->sev_es.vmsa);
+ }
+
svm->guest_state_loaded = false;
return 0;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7a921acc534f..1812fd61ea56 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -198,6 +198,7 @@ struct vcpu_sev_es_state {
struct ghcb *ghcb;
u8 valid_bitmap[16];
struct kvm_host_map ghcb_map;
+ hpa_t vmsa_pa;
bool received_first_sipi;
/* SEV-ES scratch area support */
--
2.39.0
From: Isaku Yamahata <[email protected]>
Refactor tdp_mmu_alloc_sp() and tdp_mmu_init_sp and eliminate
tdp_mmu_init_child_sp(). Currently tdp_mmu_init_sp() (or
tdp_mmu_init_child_sp()) sets kvm_mmu_page.role after tdp_mmu_alloc_sp()
allocating struct kvm_mmu_page and its page table page. This patch makes
tdp_mmu_alloc_sp() initialize kvm_mmu_page.role instead of
tdp_mmu_init_sp().
To handle private page tables, argument of is_private needs to be passed
down. Given that already page level is passed down, it would be cumbersome
to add one more parameter about sp. Instead replace the level argument with
union kvm_mmu_page_role. Thus the number of argument won't be increased
and more info about sp can be passed down.
For private sp, secure page table will be also allocated in addition to
struct kvm_mmu_page and page table (spt member). The allocation functions
(tdp_mmu_alloc_sp() and __tdp_mmu_alloc_sp_for_split()) need to know if the
allocation is for the conventional page table or private page table. Pass
union kvm_mmu_role to those functions and initialize role member of struct
kvm_mmu_page.
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <d69acdd7f0b0b104f330a6d42ac28f9a9b1b5850.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/tdp_iter.h | 12 ++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 44 ++++++++++++++++---------------------
2 files changed, 31 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..e1e40e3f5eb7 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -135,4 +135,16 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
void tdp_iter_next(struct tdp_iter *iter);
void tdp_iter_restart(struct tdp_iter *iter);
+static inline union kvm_mmu_page_role tdp_iter_child_role(struct tdp_iter *iter)
+{
+ union kvm_mmu_page_role child_role;
+ struct kvm_mmu_page *parent_sp;
+
+ parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
+
+ child_role = parent_sp->role;
+ child_role.level--;
+ return child_role;
+}
+
#endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d15c44a8e123..55b5e3857e98 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -184,24 +184,30 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
#define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id) \
__for_each_tdp_mmu_root(_kvm, _root, _as_id, true)
-static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu,
+ union kvm_mmu_page_role role)
{
struct kvm_mmu_page *sp;
sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ sp->role = role;
return sp;
}
static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
- gfn_t gfn, union kvm_mmu_page_role role)
+ gfn_t gfn)
{
INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
- sp->role = role;
+ /*
+ * role must be set before calling this function. At least role.level
+ * is not 0 (PG_LEVEL_NONE).
+ */
+ WARN_ON_ONCE(!sp->role.word);
sp->gfn = gfn;
sp->ptep = sptep;
sp->tdp_mmu_page = true;
@@ -209,20 +215,6 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
trace_kvm_mmu_get_page(sp, true);
}
-static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
- struct tdp_iter *iter)
-{
- struct kvm_mmu_page *parent_sp;
- union kvm_mmu_page_role role;
-
- parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
-
- role = parent_sp->role;
- role.level--;
-
- tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
-}
-
int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
@@ -260,8 +252,8 @@ int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
goto out_spin_unlock;
}
- root = tdp_mmu_alloc_sp(vcpu);
- tdp_mmu_init_sp(root, NULL, 0, role);
+ root = tdp_mmu_alloc_sp(vcpu, role);
+ tdp_mmu_init_sp(root, NULL, 0);
/*
* TDP MMU roots are kept until they are explicitly invalidated, either
@@ -1118,8 +1110,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* The SPTE is either non-present or points to a huge page that
* needs to be split.
*/
- sp = tdp_mmu_alloc_sp(vcpu);
- tdp_mmu_init_child_sp(sp, &iter);
+ sp = tdp_mmu_alloc_sp(vcpu, tdp_iter_child_role(&iter));
+ tdp_mmu_init_sp(sp, iter.sptep, iter.gfn);
sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
@@ -1362,7 +1354,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
return spte_set;
}
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mmu_page_role role)
{
struct kvm_mmu_page *sp;
@@ -1372,6 +1364,7 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
if (!sp)
return NULL;
+ sp->role = role;
sp->spt = (void *)__get_free_page(gfp);
if (!sp->spt) {
kmem_cache_free(mmu_page_header_cache, sp);
@@ -1385,6 +1378,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
struct tdp_iter *iter,
bool shared)
{
+ union kvm_mmu_page_role role = tdp_iter_child_role(iter);
struct kvm_mmu_page *sp;
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
@@ -1398,7 +1392,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
* If this allocation fails we drop the lock and retry with reclaim
* allowed.
*/
- sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, role);
if (sp)
return sp;
@@ -1410,7 +1404,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
write_unlock(&kvm->mmu_lock);
iter->yielded = true;
- sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT, role);
if (shared)
read_lock(&kvm->mmu_lock);
@@ -1505,7 +1499,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
continue;
}
- tdp_mmu_init_child_sp(sp, &iter);
+ tdp_mmu_init_sp(sp, iter.sptep, iter.gfn);
if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
goto retry;
--
2.39.0
From: Isaku Yamahata <[email protected]>
TDP MMU allows tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() to
return -EBUSY or -EAGAIN error. The caller must check the return value and
retry. Add __must_check to ensure that it does so.
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 55b5e3857e98..3627744fcab6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -539,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* no side-effects other than setting iter->old_spte to the last
* known value of the spte.
*/
-static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
- struct tdp_iter *iter,
- u64 new_spte)
+static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
+ struct tdp_iter *iter,
+ u64 new_spte)
{
u64 *sptep = rcu_dereference(iter->sptep);
@@ -571,8 +571,8 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
return 0;
}
-static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
- struct tdp_iter *iter)
+static inline int __must_check tdp_mmu_zap_spte_atomic(struct kvm *kvm,
+ struct tdp_iter *iter)
{
int ret;
--
2.39.0
Signed-off-by: Paolo Bonzini <[email protected]>
---
include/linux/pagemap.h | 2 ++
mm/filemap.c | 4 ++++
2 files changed, 6 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..e8ac0b32f84d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -586,6 +586,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
* * %FGP_CREAT - If no folio is present then a new folio is allocated,
* added to the page cache and the VM's LRU list. The folio is
* returned locked.
+ * * %FGP_CREAT_ONLY - Fail if a folio is not present
* * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
* folio is already in cache. If the folio was allocated, unlock it
* before returning so the caller can do the same dance.
@@ -606,6 +607,7 @@ typedef unsigned int __bitwise fgf_t;
#define FGP_NOWAIT ((__force fgf_t)0x00000020)
#define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
#define FGP_STABLE ((__force fgf_t)0x00000080)
+#define FGP_CREAT_ONLY ((__force fgf_t)0x00000100)
#define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
#define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
diff --git a/mm/filemap.c b/mm/filemap.c
index 750e779c23db..d5107bd0cd09 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1854,6 +1854,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
folio = NULL;
if (!folio)
goto no_page;
+ if (fgp_flags & FGP_CREAT_ONLY) {
+ folio_put(folio);
+ return ERR_PTR(-EEXIST);
+ }
if (fgp_flags & FGP_LOCK) {
if (fgp_flags & FGP_NOWAIT) {
--
2.39.0
Right now the error code is not used when an async page fault is completed.
This is not a problem in the current code, but it is untidy. For protected
VMs we need to check that the page attributes match the current state of the
page. Async page faults can only occur on shared pages (because
private pages go through kvm_faultin_pfn_private() instead of
__gfn_to_pfn_memslot()), but it is risky to rely on the polarity of
PFERR_GUEST_ENC_MASK and the high 32 bits of the error code being zero.
So, for clarity and future-proofing of the code, pipe the error code
from kvm_arch_setup_async_pf() to kvm_arch_async_page_ready() via the
architecture-specific async page fault data.
Extracted from a patch by Isaku Yamahata.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 14 +++++++-------
2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a4514c2ef0ec..24e30ca2ca8f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1839,6 +1839,7 @@ struct kvm_arch_async_pf {
gfn_t gfn;
unsigned long cr3;
bool direct_map;
+ u64 error_code;
};
extern u32 __read_mostly kvm_nr_uret_msrs;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f58ca6cb789a..c9890e5b6e4c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4260,18 +4260,18 @@ static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
return (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
}
-static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- gfn_t gfn)
+static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
{
struct kvm_arch_async_pf arch;
arch.token = alloc_apf_token(vcpu);
- arch.gfn = gfn;
+ arch.gfn = fault->gfn;
arch.direct_map = vcpu->arch.mmu->root_role.direct;
arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
- return kvm_setup_async_pf(vcpu, cr2_or_gpa,
- kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
+ return kvm_setup_async_pf(vcpu, fault->addr,
+ kvm_vcpu_gfn_to_hva(vcpu, fault->gfn), &arch);
}
void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
@@ -4290,7 +4290,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
return;
- kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
+ kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, true, NULL);
}
static inline u8 kvm_max_level_for_order(int order)
@@ -4395,7 +4395,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
kvm_make_request(KVM_REQ_APF_HALT, vcpu);
return RET_PF_RETRY;
- } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
+ } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
return RET_PF_RETRY;
}
}
--
2.39.0
From: Isaku Yamahata <[email protected]>
In some cases the full 64-bit error code for the KVM page fault will be
needed to determine things like whether or not a fault was for a private
or shared guest page, so update related code to accept the full 64-bit
value so it can be plumbed all the way through to where it is needed.
The use of lower_32_bits() moves from kvm_mmu_page_fault() to
FNAME(page_fault), since walking is independent of the data in the
upper bits of the error code.
Signed-off-by: Isaku Yamahata <[email protected]>
Link: https://lore.kernel.org/kvm/[email protected]/T/#mbd0b20c9a2cf50319d5d2a27b63f73c772112076
[mdr: drop references/changes to code not in current gmem tree, update
commit message]
Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 3 +--
arch/x86/kvm/mmu/mmu_internal.h | 4 ++--
arch/x86/kvm/mmu/mmutrace.h | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
4 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 195e46a1f00f..f58ca6cb789a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5869,8 +5869,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
}
if (r == RET_PF_INVALID) {
- r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
- lower_32_bits(error_code), false,
+ r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
&emulation_type);
if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
return -EIO;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0669a8a668ca..21f55e8b4dc6 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -190,7 +190,7 @@ static inline bool is_nx_huge_page_enabled(struct kvm *kvm)
struct kvm_page_fault {
/* arguments to kvm_mmu_do_page_fault. */
const gpa_t addr;
- const u32 error_code;
+ const u64 error_code;
const bool prefetch;
/* Derived from error_code. */
@@ -280,7 +280,7 @@ enum {
};
static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- u32 err, bool prefetch, int *emulation_type)
+ u64 err, bool prefetch, int *emulation_type)
{
struct kvm_page_fault fault = {
.addr = cr2_or_gpa,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index ae86820cef69..195d98bc8de8 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -260,7 +260,7 @@ TRACE_EVENT(
TP_STRUCT__entry(
__field(int, vcpu_id)
__field(gpa_t, cr2_or_gpa)
- __field(u32, error_code)
+ __field(u64, error_code)
__field(u64 *, sptep)
__field(u64, old_spte)
__field(u64, new_spte)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index bebd73cd61bb..ed2923d9a934 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -787,7 +787,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
* The bit needs to be cleared before walking guest page tables.
*/
r = FNAME(walk_addr)(&walker, vcpu, fault->addr,
- fault->error_code & ~PFERR_RSVD_MASK);
+ lower_32_bits(fault->error_code) & ~PFERR_RSVD_MASK);
/*
* The page is not mapped by the guest. Let the guest handle it.
--
2.39.0
From: Michael Roth <[email protected]>
In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
allow for special handling of this sort when freeing memory in response
to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
ahead and define an arch-specific hook for x86 since it will be needed
for handling memory used for SEV-SNP guests.
Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 7 +++++++
include/linux/kvm_host.h | 4 ++++
virt/kvm/Kconfig | 4 ++++
virt/kvm/guest_memfd.c | 14 ++++++++++++++
6 files changed, 31 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index adfaad15e7e6..42474acb7375 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -140,6 +140,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
+KVM_X86_OP_OPTIONAL(gmem_invalidate)
#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d873d08f739..e523b204697d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1805,6 +1805,7 @@ struct kvm_x86_ops {
gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
+ void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
};
struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eff532ea59c9..9d5603adf542 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13605,6 +13605,13 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
}
#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+{
+ static_call_cond(kvm_x86_gmem_invalidate)(start, end);
+}
+#endif
+
int kvm_spec_ctrl_test_value(u64 value)
{
/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 192c58116220..3835732491b9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2465,4 +2465,8 @@ static inline int kvm_gmem_undo_get_pfn(struct kvm *kvm,
int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+#endif
+
#endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index dcce0c3b5b13..39356f5babbd 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -115,3 +115,7 @@ config KVM_GENERIC_PRIVATE_MEM
config HAVE_KVM_GMEM_PREPARE
bool
depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_GMEM_INVALIDATE
+ bool
+ depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 535ef1aa34fb..74e19170af8a 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -345,10 +345,24 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
return MF_DELAYED;
}
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+static void kvm_gmem_free_folio(struct folio *folio)
+{
+ struct page *page = folio_page(folio, 0);
+ kvm_pfn_t pfn = page_to_pfn(page);
+ int order = folio_order(folio);
+
+ kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
+}
+#endif
+
static const struct address_space_operations kvm_gmem_aops = {
.dirty_folio = noop_dirty_folio,
.migrate_folio = kvm_gmem_migrate_folio,
.error_remove_folio = kvm_gmem_error_folio,
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+ .free_folio = kvm_gmem_free_folio,
+#endif
};
static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
--
2.39.0
Some SNP ioctls will require the page not to be in the pagecache, and as such they
will want to return EEXIST to userspace. Start by passing the error up from
filemap_grab_folio.
Signed-off-by: Paolo Bonzini <[email protected]>
---
virt/kvm/guest_memfd.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 0f4e0cf4f158..de0d5a5c210c 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -20,7 +20,7 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
/* TODO: Support huge pages. */
folio = filemap_grab_folio(inode->i_mapping, index);
if (IS_ERR_OR_NULL(folio))
- return NULL;
+ return folio;
/*
* Use the up-to-date flag to track whether or not the memory has been
@@ -146,8 +146,8 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
}
folio = kvm_gmem_get_folio(inode, index);
- if (!folio) {
- r = -ENOMEM;
+ if (IS_ERR_OR_NULL(folio)) {
+ r = folio ? PTR_ERR(folio) : -ENOMEM;
break;
}
--
2.39.0
From: Isaku Yamahata <[email protected]>
To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the
suppress #VE bit in EPT entries selectively, in order to be able to trap
non-present conditions. However, #VE isn't used for VMX and it's a bug
if it happens. To be defensive and test that VMX case isn't broken
introduce an option ept_violation_ve_test and when it's set, BUG the vm.
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/vmx.h | 12 +++++++
arch/x86/kvm/vmx/vmcs.h | 5 +++
arch/x86/kvm/vmx/vmx.c | 69 +++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +++-
4 files changed, 90 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 76ed39541a52..f703bae0c4ac 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -70,6 +70,7 @@
#define SECONDARY_EXEC_ENCLS_EXITING VMCS_CONTROL_BIT(ENCLS_EXITING)
#define SECONDARY_EXEC_RDSEED_EXITING VMCS_CONTROL_BIT(RDSEED_EXITING)
#define SECONDARY_EXEC_ENABLE_PML VMCS_CONTROL_BIT(PAGE_MOD_LOGGING)
+#define SECONDARY_EXEC_EPT_VIOLATION_VE VMCS_CONTROL_BIT(EPT_VIOLATION_VE)
#define SECONDARY_EXEC_PT_CONCEAL_VMX VMCS_CONTROL_BIT(PT_CONCEAL_VMX)
#define SECONDARY_EXEC_ENABLE_XSAVES VMCS_CONTROL_BIT(XSAVES)
#define SECONDARY_EXEC_MODE_BASED_EPT_EXEC VMCS_CONTROL_BIT(MODE_BASED_EPT_EXEC)
@@ -225,6 +226,8 @@ enum vmcs_field {
VMREAD_BITMAP_HIGH = 0x00002027,
VMWRITE_BITMAP = 0x00002028,
VMWRITE_BITMAP_HIGH = 0x00002029,
+ VE_INFORMATION_ADDRESS = 0x0000202A,
+ VE_INFORMATION_ADDRESS_HIGH = 0x0000202B,
XSS_EXIT_BITMAP = 0x0000202C,
XSS_EXIT_BITMAP_HIGH = 0x0000202D,
ENCLS_EXITING_BITMAP = 0x0000202E,
@@ -630,4 +633,13 @@ enum vmx_l1d_flush_state {
extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;
+struct vmx_ve_information {
+ u32 exit_reason;
+ u32 delivery;
+ u64 exit_qualification;
+ u64 guest_linear_address;
+ u64 guest_physical_address;
+ u16 eptp_index;
+};
+
#endif
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 7c1996b433e2..b25625314658 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -140,6 +140,11 @@ static inline bool is_nm_fault(u32 intr_info)
return is_exception_n(intr_info, NM_VECTOR);
}
+static inline bool is_ve_fault(u32 intr_info)
+{
+ return is_exception_n(intr_info, VE_VECTOR);
+}
+
/* Undocumented: icebp/int1 */
static inline bool is_icebp(u32 intr_info)
{
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9239a89dea22..6468f421ba9e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -126,6 +126,9 @@ module_param(error_on_inconsistent_vmcs_config, bool, 0444);
static bool __read_mostly dump_invalid_vmcs = 0;
module_param(dump_invalid_vmcs, bool, 0644);
+static bool __read_mostly ept_violation_ve_test;
+module_param(ept_violation_ve_test, bool, 0444);
+
#define MSR_BITMAP_MODE_X2APIC 1
#define MSR_BITMAP_MODE_X2APIC_APICV 2
@@ -868,6 +871,12 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
eb = (1u << PF_VECTOR) | (1u << UD_VECTOR) | (1u << MC_VECTOR) |
(1u << DB_VECTOR) | (1u << AC_VECTOR);
+ /*
+ * #VE isn't used for VMX. To test against unexpected changes
+ * related to #VE for VMX, intercept unexpected #VE and warn on it.
+ */
+ if (ept_violation_ve_test)
+ eb |= 1u << VE_VECTOR;
/*
* Guest access to VMware backdoor ports could legitimately
* trigger #GP because of TSS I/O permission bitmap.
@@ -2603,6 +2613,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
&_cpu_based_2nd_exec_control))
return -EIO;
}
+ if (!ept_violation_ve_test)
+ _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
+
#ifndef CONFIG_X86_64
if (!(_cpu_based_2nd_exec_control &
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
@@ -2627,6 +2640,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
return -EIO;
vmx_cap->ept = 0;
+ _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
}
if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
vmx_cap->vpid) {
@@ -4592,6 +4606,7 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
if (!enable_ept) {
exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
+ exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
enable_unrestricted_guest = 0;
}
if (!enable_unrestricted_guest)
@@ -4715,8 +4730,40 @@ static void init_vmcs(struct vcpu_vmx *vmx)
exec_controls_set(vmx, vmx_exec_control(vmx));
- if (cpu_has_secondary_exec_ctrls())
+ if (cpu_has_secondary_exec_ctrls()) {
secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
+ if (secondary_exec_controls_get(vmx) &
+ SECONDARY_EXEC_EPT_VIOLATION_VE) {
+ if (!vmx->ve_info) {
+ /* ve_info must be page aligned. */
+ struct page *page;
+
+ BUILD_BUG_ON(sizeof(*vmx->ve_info) > PAGE_SIZE);
+ page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (page)
+ vmx->ve_info = page_to_virt(page);
+ }
+ if (vmx->ve_info) {
+ /*
+ * Allow #VE delivery. CPU sets this field to
+ * 0xFFFFFFFF on #VE delivery. Another #VE can
+ * occur only if software clears the field.
+ */
+ vmx->ve_info->delivery = 0;
+ vmcs_write64(VE_INFORMATION_ADDRESS,
+ __pa(vmx->ve_info));
+ } else {
+ /*
+ * Because SECONDARY_EXEC_EPT_VIOLATION_VE is
+ * used only when ept_violation_ve_test is true,
+ * it's okay to go with the bit disabled.
+ */
+ pr_err("Failed to allocate ve_info. disabling EPT_VIOLATION_VE.\n");
+ secondary_exec_controls_clearbit(vmx,
+ SECONDARY_EXEC_EPT_VIOLATION_VE);
+ }
+ }
+ }
if (cpu_has_tertiary_exec_ctrls())
tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
@@ -5204,6 +5251,12 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
if (is_invalid_opcode(intr_info))
return handle_ud(vcpu);
+ /*
+ * #VE isn't supposed to happen. Block the VM if it does.
+ */
+ if (KVM_BUG_ON(is_ve_fault(intr_info), vcpu->kvm))
+ return -EIO;
+
error_code = 0;
if (intr_info & INTR_INFO_DELIVER_CODE_MASK)
error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
@@ -6393,6 +6446,18 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
pr_err("Virtual processor ID = 0x%04x\n",
vmcs_read16(VIRTUAL_PROCESSOR_ID));
+ if (secondary_exec_control & SECONDARY_EXEC_EPT_VIOLATION_VE) {
+ struct vmx_ve_information *ve_info;
+
+ pr_err("VE info address = 0x%016llx\n",
+ vmcs_read64(VE_INFORMATION_ADDRESS));
+ ve_info = __va(vmcs_read64(VE_INFORMATION_ADDRESS));
+ pr_err("ve_info: 0x%08x 0x%08x 0x%016llx 0x%016llx 0x%016llx 0x%04x\n",
+ ve_info->exit_reason, ve_info->delivery,
+ ve_info->exit_qualification,
+ ve_info->guest_linear_address,
+ ve_info->guest_physical_address, ve_info->eptp_index);
+ }
}
/*
@@ -7433,6 +7498,8 @@ static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
free_vpid(vmx->vpid);
nested_vmx_free_vcpu(vcpu);
free_loaded_vmcs(vmx->loaded_vmcs);
+ if (vmx->ve_info)
+ free_page((unsigned long)vmx->ve_info);
}
static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e3b0985bb74a..1ea1e5c8930d 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -364,6 +364,9 @@ struct vcpu_vmx {
DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
} shadow_msr_intercept;
+
+ /* ve_info must be page aligned. */
+ struct vmx_ve_information *ve_info;
};
struct kvm_vmx {
@@ -576,7 +579,8 @@ static inline u8 vmx_get_rvi(void)
SECONDARY_EXEC_ENABLE_VMFUNC | \
SECONDARY_EXEC_BUS_LOCK_DETECTION | \
SECONDARY_EXEC_NOTIFY_VM_EXITING | \
- SECONDARY_EXEC_ENCLS_EXITING)
+ SECONDARY_EXEC_ENCLS_EXITING | \
+ SECONDARY_EXEC_EPT_VIOLATION_VE)
#define KVM_REQUIRED_VMX_TERTIARY_VM_EXEC_CONTROL 0
#define KVM_OPTIONAL_VMX_TERTIARY_VM_EXEC_CONTROL \
--
2.39.0
guest_memfd pages are generally expected to be in some arch-defined
initial state prior to using them for guest memory. For SEV-SNP this
initial state is 'private', or 'guest-owned', and requires additional
operations to move these pages into a 'private' state by updating the
corresponding entries the RMP table.
Allow for an arch-defined hook to handle updates of this sort, and go
ahead and implement one for x86 so KVM implementations like AMD SVM can
register a kvm_x86_ops callback to handle these updates for SEV-SNP
guests.
The preparation callback is always called when allocating/grabbing
folios via gmem, and it is up to the architecture to keep track of
whether or not the pages are already in the expected state (e.g. the RMP
table in the case of SEV-SNP).
In some cases, it is necessary to defer the preparation of the pages to
handle things like in-place encryption of initial guest memory payloads
before marking these pages as 'private'/'guest-owned', so also add a
helper that performs the same function as kvm_gmem_get_pfn(), but allows
for the preparation callback to be bypassed to allow for pages to be
accessed beforehand.
Link: https://lore.kernel.org/lkml/[email protected]/
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 6 +++
include/linux/kvm_host.h | 14 ++++++
virt/kvm/Kconfig | 4 ++
virt/kvm/guest_memfd.c | 72 +++++++++++++++++++++++++++---
6 files changed, 92 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index ac8b7614e79d..adfaad15e7e6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -139,6 +139,7 @@ KVM_X86_OP(complete_emulated_msr)
KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
+KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
#undef KVM_X86_OP
#undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7de8a3f2a118..6d873d08f739 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1804,6 +1804,7 @@ struct kvm_x86_ops {
unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
+ int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
};
struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f10a5a617120..eff532ea59c9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13598,6 +13598,12 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order)
+{
+ return static_call(kvm_x86_gmem_prepare)(kvm, pfn, gfn, max_order);
+}
+#endif
int kvm_spec_ctrl_test_value(u64 value)
{
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 97afe4519772..03bf616b7308 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2434,6 +2434,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
#ifdef CONFIG_KVM_PRIVATE_MEM
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+int kvm_gmem_get_uninit_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
#else
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2442,6 +2444,18 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
KVM_BUG_ON(1, kvm);
return -EIO;
}
+
+static inline int kvm_gmem_get_uninit_pfn(struct kvm *kvm,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ kvm_pfn_t *pfn, int *max_order)
+{
+ KVM_BUG_ON(1, kvm);
+ return -EIO;
+}
#endif /* CONFIG_KVM_PRIVATE_MEM */
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
+#endif
+
#endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index a11e9c80fac9..dcce0c3b5b13 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -111,3 +111,7 @@ config KVM_GENERIC_PRIVATE_MEM
select KVM_GENERIC_MEMORY_ATTRIBUTES
select KVM_PRIVATE_MEM
bool
+
+config HAVE_KVM_GMEM_PREPARE
+ bool
+ depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index de0d5a5c210c..7ec7afafc960 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -13,12 +13,50 @@ struct kvm_gmem {
struct list_head entry;
};
-static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct folio *folio)
+{
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+ struct list_head *gmem_list = &inode->i_mapping->i_private_list;
+ struct kvm_gmem *gmem;
+
+ list_for_each_entry(gmem, gmem_list, entry) {
+ struct kvm_memory_slot *slot;
+ struct kvm *kvm = gmem->kvm;
+ struct page *page;
+ kvm_pfn_t pfn;
+ gfn_t gfn;
+ int rc;
+
+ slot = xa_load(&gmem->bindings, index);
+ if (!slot)
+ continue;
+
+ page = folio_file_page(folio, index);
+ pfn = page_to_pfn(page);
+ gfn = slot->base_gfn + index - slot->gmem.pgoff;
+ rc = kvm_arch_gmem_prepare(kvm, gfn, pfn, compound_order(compound_head(page)));
+ if (rc) {
+ pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx, error %d.\n",
+ index, rc);
+ return rc;
+ }
+ }
+
+#endif
+ return 0;
+}
+
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
{
struct folio *folio;
+ fgf_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
+
+ if (!prepare)
+ fgp_flags |= FGP_CREAT_ONLY;
/* TODO: Support huge pages. */
- folio = filemap_grab_folio(inode->i_mapping, index);
+ folio = __filemap_get_folio(inode->i_mapping, index, fgp_flags,
+ mapping_gfp_mask(inode->i_mapping));
if (IS_ERR_OR_NULL(folio))
return folio;
@@ -41,6 +79,15 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
folio_mark_uptodate(folio);
}
+ if (prepare) {
+ int r = kvm_gmem_prepare_folio(inode, index, folio);
+ if (r < 0) {
+ folio_unlock(folio);
+ folio_put(folio);
+ return ERR_PTR(r);
+ }
+ }
+
/*
* Ignore accessed, referenced, and dirty flags. The memory is
* unevictable and there is no storage to write back to.
@@ -145,7 +192,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
break;
}
- folio = kvm_gmem_get_folio(inode, index);
+ folio = kvm_gmem_get_folio(inode, index, true);
if (IS_ERR_OR_NULL(folio)) {
r = folio ? PTR_ERR(folio) : -ENOMEM;
break;
@@ -482,8 +529,8 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
fput(file);
}
-int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
- gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+static int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare)
{
pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
struct kvm_gmem *gmem;
@@ -503,7 +550,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
goto out_fput;
}
- folio = kvm_gmem_get_folio(file_inode(file), index);
+ folio = kvm_gmem_get_folio(file_inode(file), index, prepare);
if (!folio) {
r = -ENOMEM;
goto out_fput;
@@ -529,4 +576,17 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
return r;
}
+
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+ return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, true);
+}
EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
+
+int kvm_gmem_get_uninit_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+{
+ return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, false);
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_get_uninit_pfn);
--
2.39.0
From: Michael Roth <[email protected]>
In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table. The existing mixed-attribute tracking in KVM is
insufficient here, for instance:
- gmem allocates 2MB page
- guest issues PVALIDATE on 2MB page
- guest later converts a subpage to shared
- SNP host code issues PSMASH to split 2MB RMP mapping to 4K
- KVM MMU splits NPT mapping to 4K
At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are 4K and cannot be promoted on the
hypervisor side, so the NPT mappings must still be limited to 4K to
match this.
Add a hook to determine the max NPT mapping size in situations like
this.
Signed-off-by: Michael Roth <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 7 +++++++
3 files changed, 9 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 42474acb7375..436e3c157fae 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -140,6 +140,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
KVM_X86_OP_OPTIONAL(get_untagged_addr)
KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
+KVM_X86_OP_OPTIONAL_RET0(gmem_validate_fault)
KVM_X86_OP_OPTIONAL(gmem_invalidate)
#undef KVM_X86_OP
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e523b204697d..259e6bb1e447 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1806,6 +1806,7 @@ struct kvm_x86_ops {
gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
+ int (*gmem_validate_fault)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
};
struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6b4cb71668df..bcf12ac489f9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4339,6 +4339,13 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
fault->max_level);
fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+ r = static_call(kvm_x86_gmem_validate_fault)(vcpu->kvm, fault->pfn,
+ fault->gfn, &fault->max_level);
+ if (r) {
+ kvm_release_pfn_clean(fault->pfn);
+ return r;
+ }
+
return RET_PF_CONTINUE;
}
--
2.39.0
From: Sean Christopherson <[email protected]>
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM. TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures. This means we must have a TDX
version of kvm_x86_ops.
The existing global struct kvm_x86_ops already defines an interface which
can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
TDX at run time.
To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX. Use 'vt' for the naming scheme as a nod to
VT-x and as a concatenation of VmxTdx.
The eventually converted code will look like this:
vmx.c:
vmx_op() { ... }
VMX initialization
tdx.c:
tdx_op() { ... }
TDX initialization
x86_ops.h:
vmx_op();
tdx_op();
main.c:
static vt_op() { if (tdx) tdx_op() else vmx_op() }
static struct kvm_x86_ops vt_x86_ops = {
.op = vt_op,
initialization functions call both VMX and TDX initialization
Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free().
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>
Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/vmx/main.c | 168 +++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 375 ++++++++++---------------------------
arch/x86/kvm/vmx/x86_ops.h | 124 ++++++++++++
4 files changed, 395 insertions(+), 274 deletions(-)
create mode 100644 arch/x86/kvm/vmx/main.c
create mode 100644 arch/x86/kvm/vmx/x86_ops.h
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 744a1ea3ee5c..8cee22145b1e 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -20,7 +20,7 @@ kvm-$(CONFIG_KVM_XEN) += xen.o
kvm-$(CONFIG_KVM_SMM) += smm.o
kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
- vmx/nested.o vmx/posted_intr.o
+ vmx/nested.o vmx/posted_intr.o vmx/main.o
kvm-intel-$(CONFIG_X86_SGX_KVM) += vmx/sgx.o
kvm-intel-$(CONFIG_KVM_HYPERV) += vmx/hyperv.o vmx/hyperv_evmcs.o
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
new file mode 100644
index 000000000000..63d32867065e
--- /dev/null
+++ b/arch/x86/kvm/vmx/main.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/moduleparam.h>
+
+#include "x86_ops.h"
+#include "vmx.h"
+#include "nested.h"
+#include "pmu.h"
+
+#define VMX_REQUIRED_APICV_INHIBITS \
+ (BIT(APICV_INHIBIT_REASON_DISABLE)| \
+ BIT(APICV_INHIBIT_REASON_ABSENT) | \
+ BIT(APICV_INHIBIT_REASON_HYPERV) | \
+ BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
+ BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
+ BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
+ BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
+
+struct kvm_x86_ops vt_x86_ops __initdata = {
+ .name = KBUILD_MODNAME,
+
+ .check_processor_compatibility = vmx_check_processor_compat,
+
+ .hardware_unsetup = vmx_hardware_unsetup,
+
+ .hardware_enable = vmx_hardware_enable,
+ .hardware_disable = vmx_hardware_disable,
+ .has_emulated_msr = vmx_has_emulated_msr,
+
+ .vm_size = sizeof(struct kvm_vmx),
+ .vm_init = vmx_vm_init,
+ .vm_destroy = vmx_vm_destroy,
+
+ .vcpu_precreate = vmx_vcpu_precreate,
+ .vcpu_create = vmx_vcpu_create,
+ .vcpu_free = vmx_vcpu_free,
+ .vcpu_reset = vmx_vcpu_reset,
+
+ .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+ .vcpu_load = vmx_vcpu_load,
+ .vcpu_put = vmx_vcpu_put,
+
+ .update_exception_bitmap = vmx_update_exception_bitmap,
+ .get_msr_feature = vmx_get_msr_feature,
+ .get_msr = vmx_get_msr,
+ .set_msr = vmx_set_msr,
+ .get_segment_base = vmx_get_segment_base,
+ .get_segment = vmx_get_segment,
+ .set_segment = vmx_set_segment,
+ .get_cpl = vmx_get_cpl,
+ .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
+ .is_valid_cr0 = vmx_is_valid_cr0,
+ .set_cr0 = vmx_set_cr0,
+ .is_valid_cr4 = vmx_is_valid_cr4,
+ .set_cr4 = vmx_set_cr4,
+ .set_efer = vmx_set_efer,
+ .get_idt = vmx_get_idt,
+ .set_idt = vmx_set_idt,
+ .get_gdt = vmx_get_gdt,
+ .set_gdt = vmx_set_gdt,
+ .set_dr7 = vmx_set_dr7,
+ .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
+ .cache_reg = vmx_cache_reg,
+ .get_rflags = vmx_get_rflags,
+ .set_rflags = vmx_set_rflags,
+ .get_if_flag = vmx_get_if_flag,
+
+ .flush_tlb_all = vmx_flush_tlb_all,
+ .flush_tlb_current = vmx_flush_tlb_current,
+ .flush_tlb_gva = vmx_flush_tlb_gva,
+ .flush_tlb_guest = vmx_flush_tlb_guest,
+
+ .vcpu_pre_run = vmx_vcpu_pre_run,
+ .vcpu_run = vmx_vcpu_run,
+ .handle_exit = vmx_handle_exit,
+ .skip_emulated_instruction = vmx_skip_emulated_instruction,
+ .update_emulated_instruction = vmx_update_emulated_instruction,
+ .set_interrupt_shadow = vmx_set_interrupt_shadow,
+ .get_interrupt_shadow = vmx_get_interrupt_shadow,
+ .patch_hypercall = vmx_patch_hypercall,
+ .inject_irq = vmx_inject_irq,
+ .inject_nmi = vmx_inject_nmi,
+ .inject_exception = vmx_inject_exception,
+ .cancel_injection = vmx_cancel_injection,
+ .interrupt_allowed = vmx_interrupt_allowed,
+ .nmi_allowed = vmx_nmi_allowed,
+ .get_nmi_mask = vmx_get_nmi_mask,
+ .set_nmi_mask = vmx_set_nmi_mask,
+ .enable_nmi_window = vmx_enable_nmi_window,
+ .enable_irq_window = vmx_enable_irq_window,
+ .update_cr8_intercept = vmx_update_cr8_intercept,
+ .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+ .load_eoi_exitmap = vmx_load_eoi_exitmap,
+ .apicv_pre_state_restore = vmx_apicv_pre_state_restore,
+ .required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
+ .hwapic_irr_update = vmx_hwapic_irr_update,
+ .hwapic_isr_update = vmx_hwapic_isr_update,
+ .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+ .sync_pir_to_irr = vmx_sync_pir_to_irr,
+ .deliver_interrupt = vmx_deliver_interrupt,
+ .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+
+ .set_tss_addr = vmx_set_tss_addr,
+ .set_identity_map_addr = vmx_set_identity_map_addr,
+ .get_mt_mask = vmx_get_mt_mask,
+
+ .get_exit_info = vmx_get_exit_info,
+
+ .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+
+ .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+ .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
+ .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
+ .write_tsc_offset = vmx_write_tsc_offset,
+ .write_tsc_multiplier = vmx_write_tsc_multiplier,
+
+ .load_mmu_pgd = vmx_load_mmu_pgd,
+
+ .check_intercept = vmx_check_intercept,
+ .handle_exit_irqoff = vmx_handle_exit_irqoff,
+
+ .request_immediate_exit = vmx_request_immediate_exit,
+
+ .sched_in = vmx_sched_in,
+
+ .cpu_dirty_log_size = PML_ENTITY_NUM,
+ .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+
+ .nested_ops = &vmx_nested_ops,
+
+ .pi_update_irte = vmx_pi_update_irte,
+ .pi_start_assignment = vmx_pi_start_assignment,
+
+#ifdef CONFIG_X86_64
+ .set_hv_timer = vmx_set_hv_timer,
+ .cancel_hv_timer = vmx_cancel_hv_timer,
+#endif
+
+ .setup_mce = vmx_setup_mce,
+
+#ifdef CONFIG_KVM_SMM
+ .smi_allowed = vmx_smi_allowed,
+ .enter_smm = vmx_enter_smm,
+ .leave_smm = vmx_leave_smm,
+ .enable_smi_window = vmx_enable_smi_window,
+#endif
+
+ .check_emulate_instruction = vmx_check_emulate_instruction,
+ .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+ .migrate_timers = vmx_migrate_timers,
+
+ .msr_filter_changed = vmx_msr_filter_changed,
+ .complete_emulated_msr = kvm_complete_insn_gp,
+
+ .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+ .get_untagged_addr = vmx_get_untagged_addr,
+};
+
+struct kvm_x86_init_ops vt_init_ops __initdata = {
+ .hardware_setup = vmx_hardware_setup,
+ .handle_intel_pt_intr = NULL,
+
+ .runtime_ops = &vt_x86_ops,
+ .pmu_ops = &intel_pmu_ops,
+};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6468f421ba9e..3d8a7e4c8e37 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -65,6 +65,7 @@
#include "vmcs12.h"
#include "vmx.h"
#include "x86.h"
+#include "x86_ops.h"
#include "smm.h"
#include "vmx_onhyperv.h"
@@ -519,8 +520,6 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
static unsigned long host_idt_base;
#if IS_ENABLED(CONFIG_HYPERV)
-static struct kvm_x86_ops vmx_x86_ops __initdata;
-
static bool __read_mostly enlightened_vmcs = true;
module_param(enlightened_vmcs, bool, 0444);
@@ -570,9 +569,8 @@ static __init void hv_init_evmcs(void)
}
if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
- vmx_x86_ops.enable_l2_tlb_flush
+ vt_x86_ops.enable_l2_tlb_flush
= hv_enable_l2_tlb_flush;
-
} else {
enlightened_vmcs = false;
}
@@ -1484,7 +1482,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
* Switches to specified vcpu, until a matching vcpu_put(), but assumes
* vcpu mutex is already taken.
*/
-static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1495,7 +1493,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vmx->host_debugctlmsr = get_debugctlmsr();
}
-static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
+void vmx_vcpu_put(struct kvm_vcpu *vcpu)
{
vmx_vcpu_pi_put(vcpu);
@@ -1554,7 +1552,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
vmx->emulation_required = vmx_emulation_required(vcpu);
}
-static bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
{
return vmx_get_rflags(vcpu) & X86_EFLAGS_IF;
}
@@ -1660,8 +1658,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
return 0;
}
-static int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
- void *insn, int insn_len)
+int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len)
{
/*
* Emulation of instructions in SGX enclaves is impossible as RIP does
@@ -1745,7 +1743,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
* Recognizes a pending MTF VM-exit and records the nested state for later
* delivery.
*/
-static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1776,7 +1774,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
}
}
-static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
vmx_update_emulated_instruction(vcpu);
return skip_emulated_instruction(vcpu);
@@ -1795,7 +1793,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
}
-static void vmx_inject_exception(struct kvm_vcpu *vcpu)
+void vmx_inject_exception(struct kvm_vcpu *vcpu)
{
struct kvm_queued_exception *ex = &vcpu->arch.exception;
u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
@@ -1916,12 +1914,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
return kvm_caps.default_tsc_scaling_ratio;
}
-static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu)
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu)
{
vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
}
-static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu)
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu)
{
vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio);
}
@@ -1964,7 +1962,7 @@ static inline bool is_vmx_feature_control_msr_valid(struct vcpu_vmx *vmx,
return !(msr->data & ~valid_bits);
}
-static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
+int vmx_get_msr_feature(struct kvm_msr_entry *msr)
{
switch (msr->index) {
case KVM_FIRST_EMULATED_VMX_MSR ... KVM_LAST_EMULATED_VMX_MSR:
@@ -1981,7 +1979,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -2162,7 +2160,7 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
* Returns 0 on success, non-0 otherwise.
* Assumes vcpu_load() was already called.
*/
-static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmx_uret_msr *msr;
@@ -2465,7 +2463,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return ret;
}
-static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
{
unsigned long guest_owned_bits;
@@ -2770,7 +2768,7 @@ static bool kvm_is_vmx_supported(void)
return supported;
}
-static int vmx_check_processor_compat(void)
+int vmx_check_processor_compat(void)
{
int cpu = raw_smp_processor_id();
struct vmcs_config vmcs_conf;
@@ -2812,7 +2810,7 @@ static int kvm_cpu_vmxon(u64 vmxon_pointer)
return -EFAULT;
}
-static int vmx_hardware_enable(void)
+int vmx_hardware_enable(void)
{
int cpu = raw_smp_processor_id();
u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
@@ -2852,7 +2850,7 @@ static void vmclear_local_loaded_vmcss(void)
__loaded_vmcs_clear(v);
}
-static void vmx_hardware_disable(void)
+void vmx_hardware_disable(void)
{
vmclear_local_loaded_vmcss();
@@ -3166,7 +3164,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
#endif
-static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3196,7 +3194,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
return to_vmx(vcpu)->vpid;
}
-static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
u64 root_hpa = mmu->root.hpa;
@@ -3212,7 +3210,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
vpid_sync_context(vmx_get_current_vpid(vcpu));
}
-static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
{
/*
* vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in
@@ -3221,7 +3219,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr);
}
-static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
{
/*
* vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a
@@ -3266,7 +3264,7 @@ void ept_save_pdptrs(struct kvm_vcpu *vcpu)
#define CR3_EXITING_BITS (CPU_BASED_CR3_LOAD_EXITING | \
CPU_BASED_CR3_STORE_EXITING)
-static bool vmx_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+bool vmx_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
{
if (is_guest_mode(vcpu))
return nested_guest_cr0_valid(vcpu, cr0);
@@ -3387,8 +3385,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
return eptp;
}
-static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
- int root_level)
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
{
struct kvm *kvm = vcpu->kvm;
bool update_guest_cr3 = true;
@@ -3417,8 +3414,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmcs_writel(GUEST_CR3, guest_cr3);
}
-
-static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
/*
* We operate under the default treatment of SMM, so VMX cannot be
@@ -3534,7 +3530,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
var->g = (ar >> 15) & 1;
}
-static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
{
struct kvm_segment s;
@@ -3611,14 +3607,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
}
-static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
{
__vmx_set_segment(vcpu, var, seg);
to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
}
-static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
{
u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);
@@ -3626,25 +3622,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
*l = (ar >> 13) & 1;
}
-static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
dt->address = vmcs_readl(GUEST_IDTR_BASE);
}
-static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
vmcs_writel(GUEST_IDTR_BASE, dt->address);
}
-static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
dt->address = vmcs_readl(GUEST_GDTR_BASE);
}
-static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
{
vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
vmcs_writel(GUEST_GDTR_BASE, dt->address);
@@ -4116,7 +4112,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
}
}
-static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
void *vapic_page;
@@ -4136,7 +4132,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
return ((rvi & 0xf0) > (vppr & 0xf0));
}
-static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 i;
@@ -4277,8 +4273,8 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
return 0;
}
-static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
- int trig_mode, int vector)
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector)
{
struct kvm_vcpu *vcpu = apic->vcpu;
@@ -4440,7 +4436,7 @@ static u32 vmx_vmexit_ctrl(void)
~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
}
-static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4705,7 +4701,7 @@ static int vmx_alloc_ipiv_pid_table(struct kvm *kvm)
return 0;
}
-static int vmx_vcpu_precreate(struct kvm *kvm)
+int vmx_vcpu_precreate(struct kvm *kvm)
{
return vmx_alloc_ipiv_pid_table(kvm);
}
@@ -4892,7 +4888,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmx->pi_desc.sn = 1;
}
-static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4951,12 +4947,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
vmx_update_fb_clear_dis(vcpu, vmx);
}
-static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
{
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
}
-static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
{
if (!enable_vnmi ||
vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
@@ -4967,7 +4963,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
}
-static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
uint32_t intr;
@@ -4995,7 +4991,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
vmx_clear_hlt(vcpu);
}
-static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
+void vmx_inject_nmi(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -5073,7 +5069,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
GUEST_INTR_STATE_NMI));
}
-static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -5095,7 +5091,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
}
-static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
return -EBUSY;
@@ -5110,7 +5106,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !vmx_interrupt_blocked(vcpu);
}
-static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
{
void __user *ret;
@@ -5130,7 +5126,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
return init_rmode_tss(kvm, ret);
}
-static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
{
to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
return 0;
@@ -5422,8 +5418,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
return kvm_fast_pio(vcpu, size, port, in);
}
-static void
-vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
{
/*
* Patch in the VMCALL instruction:
@@ -5632,7 +5627,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
return kvm_complete_insn_gp(vcpu, err);
}
-static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
{
get_debugreg(vcpu->arch.db[0], 0);
get_debugreg(vcpu->arch.db[1], 1);
@@ -5651,7 +5646,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
set_debugreg(DR6_RESERVED, 6);
}
-static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
{
vmcs_writel(GUEST_DR7, val);
}
@@ -5922,7 +5917,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
return 1;
}
-static int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
{
if (vmx_emulation_required_with_pending_exception(vcpu)) {
kvm_prepare_emulation_failure_exit(vcpu);
@@ -6186,9 +6181,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
static const int kvm_vmx_max_exit_handlers =
ARRAY_SIZE(kvm_vmx_exit_handlers);
-static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
- u64 *info1, u64 *info2,
- u32 *intr_info, u32 *error_code)
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6643,7 +6637,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
return 0;
}
-static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
{
int ret = __vmx_handle_exit(vcpu, exit_fastpath);
@@ -6731,7 +6725,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
: "eax", "ebx", "ecx", "edx");
}
-static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
int tpr_threshold;
@@ -6801,7 +6795,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
vmx_update_msr_bitmap_x2apic(vcpu);
}
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
{
const gfn_t gfn = APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT;
struct kvm *kvm = vcpu->kvm;
@@ -6870,7 +6864,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
kvm_release_pfn_clean(pfn);
}
-static void vmx_hwapic_isr_update(int max_isr)
+void vmx_hwapic_isr_update(int max_isr)
{
u16 status;
u8 old;
@@ -6904,7 +6898,7 @@ static void vmx_set_rvi(int vector)
}
}
-static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
{
/*
* When running L2, updating RVI is only relevant when
@@ -6918,7 +6912,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
vmx_set_rvi(max_irr);
}
-static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int max_irr;
@@ -6964,7 +6958,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
return max_irr;
}
-static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (!kvm_vcpu_apicv_active(vcpu))
return;
@@ -6975,7 +6969,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}
-static void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
+void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7038,7 +7032,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
vcpu->arch.at_instruction_boundary = true;
}
-static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7055,7 +7049,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
* The kvm parameter can be NULL (module initialization, or invocation before
* VM creation). Be sure to check the kvm parameter before using it.
*/
-static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
{
switch (index) {
case MSR_IA32_SMBASE:
@@ -7178,7 +7172,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
IDT_VECTORING_ERROR_CODE);
}
-static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
+void vmx_cancel_injection(struct kvm_vcpu *vcpu)
{
__vmx_complete_interrupts(vcpu,
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -7333,7 +7327,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
guest_state_exit_irqoff();
}
-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long cr3, cr4;
@@ -7489,7 +7483,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
return vmx_exit_handlers_fastpath(vcpu);
}
-static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
+void vmx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7502,7 +7496,7 @@ static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
free_page((unsigned long)vmx->ve_info);
}
-static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
+int vmx_vcpu_create(struct kvm_vcpu *vcpu)
{
struct vmx_uret_msr *tsx_ctrl;
struct vcpu_vmx *vmx;
@@ -7611,7 +7605,7 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
#define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
#define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
-static int vmx_vm_init(struct kvm *kvm)
+int vmx_vm_init(struct kvm *kvm)
{
if (!ple_gap)
kvm->arch.pause_in_guest = true;
@@ -7642,7 +7636,7 @@ static int vmx_vm_init(struct kvm *kvm)
return 0;
}
-static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
* memory aliases with conflicting memory types and sometimes MCEs.
@@ -7814,7 +7808,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}
-static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7968,7 +7962,7 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
}
-static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->req_immediate_exit = true;
}
@@ -8007,10 +8001,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
}
-static int vmx_check_intercept(struct kvm_vcpu *vcpu,
- struct x86_instruction_info *info,
- enum x86_intercept_stage stage,
- struct x86_exception *exception)
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
@@ -8090,8 +8084,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
return 0;
}
-static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
- bool *expired)
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired)
{
struct vcpu_vmx *vmx;
u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
@@ -8130,13 +8124,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
return 0;
}
-static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
{
to_vmx(vcpu)->hv_deadline_tsc = -1;
}
#endif
-static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
{
if (!kvm_pause_in_guest(vcpu->kvm))
shrink_ple_window(vcpu);
@@ -8165,7 +8159,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
}
-static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+void vmx_setup_mce(struct kvm_vcpu *vcpu)
{
if (vcpu->arch.mcg_cap & MCG_LMCE_P)
to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
@@ -8176,7 +8170,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
}
#ifdef CONFIG_KVM_SMM
-static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
/* we need a nested vmexit to enter SMM, postpone if run is pending */
if (to_vmx(vcpu)->nested.nested_run_pending)
@@ -8184,7 +8178,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
return !is_smm(vcpu);
}
-static int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -8205,7 +8199,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
return 0;
}
-static int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int ret;
@@ -8226,18 +8220,18 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
return 0;
}
-static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
{
/* RSM will cause a vmexit anyway. */
}
#endif
-static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
{
return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
}
-static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
+void vmx_migrate_timers(struct kvm_vcpu *vcpu)
{
if (is_guest_mode(vcpu)) {
struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
@@ -8247,7 +8241,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
}
}
-static void vmx_hardware_unsetup(void)
+void vmx_hardware_unsetup(void)
{
kvm_set_posted_intr_wakeup_handler(NULL);
@@ -8257,18 +8251,7 @@ static void vmx_hardware_unsetup(void)
free_kvm_area();
}
-#define VMX_REQUIRED_APICV_INHIBITS \
-( \
- BIT(APICV_INHIBIT_REASON_DISABLE)| \
- BIT(APICV_INHIBIT_REASON_ABSENT) | \
- BIT(APICV_INHIBIT_REASON_HYPERV) | \
- BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
- BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
- BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
- BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) \
-)
-
-static void vmx_vm_destroy(struct kvm *kvm)
+void vmx_vm_destroy(struct kvm *kvm)
{
struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
@@ -8319,150 +8302,6 @@ gva_t vmx_get_untagged_addr(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags
return (sign_extend64(gva, lam_bit) & ~BIT_ULL(63)) | (gva & BIT_ULL(63));
}
-static struct kvm_x86_ops vmx_x86_ops __initdata = {
- .name = KBUILD_MODNAME,
-
- .check_processor_compatibility = vmx_check_processor_compat,
-
- .hardware_unsetup = vmx_hardware_unsetup,
-
- .hardware_enable = vmx_hardware_enable,
- .hardware_disable = vmx_hardware_disable,
- .has_emulated_msr = vmx_has_emulated_msr,
-
- .vm_size = sizeof(struct kvm_vmx),
- .vm_init = vmx_vm_init,
- .vm_destroy = vmx_vm_destroy,
-
- .vcpu_precreate = vmx_vcpu_precreate,
- .vcpu_create = vmx_vcpu_create,
- .vcpu_free = vmx_vcpu_free,
- .vcpu_reset = vmx_vcpu_reset,
-
- .prepare_switch_to_guest = vmx_prepare_switch_to_guest,
- .vcpu_load = vmx_vcpu_load,
- .vcpu_put = vmx_vcpu_put,
-
- .update_exception_bitmap = vmx_update_exception_bitmap,
- .get_msr_feature = vmx_get_msr_feature,
- .get_msr = vmx_get_msr,
- .set_msr = vmx_set_msr,
- .get_segment_base = vmx_get_segment_base,
- .get_segment = vmx_get_segment,
- .set_segment = vmx_set_segment,
- .get_cpl = vmx_get_cpl,
- .get_cs_db_l_bits = vmx_get_cs_db_l_bits,
- .is_valid_cr0 = vmx_is_valid_cr0,
- .set_cr0 = vmx_set_cr0,
- .is_valid_cr4 = vmx_is_valid_cr4,
- .set_cr4 = vmx_set_cr4,
- .set_efer = vmx_set_efer,
- .get_idt = vmx_get_idt,
- .set_idt = vmx_set_idt,
- .get_gdt = vmx_get_gdt,
- .set_gdt = vmx_set_gdt,
- .set_dr7 = vmx_set_dr7,
- .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
- .cache_reg = vmx_cache_reg,
- .get_rflags = vmx_get_rflags,
- .set_rflags = vmx_set_rflags,
- .get_if_flag = vmx_get_if_flag,
-
- .flush_tlb_all = vmx_flush_tlb_all,
- .flush_tlb_current = vmx_flush_tlb_current,
- .flush_tlb_gva = vmx_flush_tlb_gva,
- .flush_tlb_guest = vmx_flush_tlb_guest,
-
- .vcpu_pre_run = vmx_vcpu_pre_run,
- .vcpu_run = vmx_vcpu_run,
- .handle_exit = vmx_handle_exit,
- .skip_emulated_instruction = vmx_skip_emulated_instruction,
- .update_emulated_instruction = vmx_update_emulated_instruction,
- .set_interrupt_shadow = vmx_set_interrupt_shadow,
- .get_interrupt_shadow = vmx_get_interrupt_shadow,
- .patch_hypercall = vmx_patch_hypercall,
- .inject_irq = vmx_inject_irq,
- .inject_nmi = vmx_inject_nmi,
- .inject_exception = vmx_inject_exception,
- .cancel_injection = vmx_cancel_injection,
- .interrupt_allowed = vmx_interrupt_allowed,
- .nmi_allowed = vmx_nmi_allowed,
- .get_nmi_mask = vmx_get_nmi_mask,
- .set_nmi_mask = vmx_set_nmi_mask,
- .enable_nmi_window = vmx_enable_nmi_window,
- .enable_irq_window = vmx_enable_irq_window,
- .update_cr8_intercept = vmx_update_cr8_intercept,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
- .load_eoi_exitmap = vmx_load_eoi_exitmap,
- .apicv_pre_state_restore = vmx_apicv_pre_state_restore,
- .required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
- .guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
- .sync_pir_to_irr = vmx_sync_pir_to_irr,
- .deliver_interrupt = vmx_deliver_interrupt,
- .dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
-
- .set_tss_addr = vmx_set_tss_addr,
- .set_identity_map_addr = vmx_set_identity_map_addr,
- .get_mt_mask = vmx_get_mt_mask,
-
- .get_exit_info = vmx_get_exit_info,
-
- .vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
-
- .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
-
- .get_l2_tsc_offset = vmx_get_l2_tsc_offset,
- .get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
- .write_tsc_offset = vmx_write_tsc_offset,
- .write_tsc_multiplier = vmx_write_tsc_multiplier,
-
- .load_mmu_pgd = vmx_load_mmu_pgd,
-
- .check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
-
- .request_immediate_exit = vmx_request_immediate_exit,
-
- .sched_in = vmx_sched_in,
-
- .cpu_dirty_log_size = PML_ENTITY_NUM,
- .update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
-
- .nested_ops = &vmx_nested_ops,
-
- .pi_update_irte = vmx_pi_update_irte,
- .pi_start_assignment = vmx_pi_start_assignment,
-
-#ifdef CONFIG_X86_64
- .set_hv_timer = vmx_set_hv_timer,
- .cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
-
- .setup_mce = vmx_setup_mce,
-
-#ifdef CONFIG_KVM_SMM
- .smi_allowed = vmx_smi_allowed,
- .enter_smm = vmx_enter_smm,
- .leave_smm = vmx_leave_smm,
- .enable_smi_window = vmx_enable_smi_window,
-#endif
-
- .check_emulate_instruction = vmx_check_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
- .migrate_timers = vmx_migrate_timers,
-
- .msr_filter_changed = vmx_msr_filter_changed,
- .complete_emulated_msr = kvm_complete_insn_gp,
-
- .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
-
- .get_untagged_addr = vmx_get_untagged_addr,
-};
-
static unsigned int vmx_handle_intel_pt_intr(void)
{
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
@@ -8528,9 +8367,7 @@ static void __init vmx_setup_me_spte_mask(void)
kvm_mmu_set_me_spte_mask(0, me_mask);
}
-static struct kvm_x86_init_ops vmx_init_ops __initdata;
-
-static __init int hardware_setup(void)
+__init int vmx_hardware_setup(void)
{
unsigned long host_bndcfgs;
struct desc_ptr dt;
@@ -8599,16 +8436,16 @@ static __init int hardware_setup(void)
* using the APIC_ACCESS_ADDR VMCS field.
*/
if (!flexpriority_enabled)
- vmx_x86_ops.set_apic_access_page_addr = NULL;
+ vt_x86_ops.set_apic_access_page_addr = NULL;
if (!cpu_has_vmx_tpr_shadow())
- vmx_x86_ops.update_cr8_intercept = NULL;
+ vt_x86_ops.update_cr8_intercept = NULL;
#if IS_ENABLED(CONFIG_HYPERV)
if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
&& enable_ept) {
- vmx_x86_ops.flush_remote_tlbs = hv_flush_remote_tlbs;
- vmx_x86_ops.flush_remote_tlbs_range = hv_flush_remote_tlbs_range;
+ vt_x86_ops.flush_remote_tlbs = hv_flush_remote_tlbs;
+ vt_x86_ops.flush_remote_tlbs_range = hv_flush_remote_tlbs_range;
}
#endif
@@ -8623,7 +8460,7 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_apicv())
enable_apicv = 0;
if (!enable_apicv)
- vmx_x86_ops.sync_pir_to_irr = NULL;
+ vt_x86_ops.sync_pir_to_irr = NULL;
if (!enable_apicv || !cpu_has_vmx_ipiv())
enable_ipiv = false;
@@ -8659,7 +8496,7 @@ static __init int hardware_setup(void)
enable_pml = 0;
if (!enable_pml)
- vmx_x86_ops.cpu_dirty_log_size = 0;
+ vt_x86_ops.cpu_dirty_log_size = 0;
if (!cpu_has_vmx_preemption_timer())
enable_preemption_timer = false;
@@ -8684,9 +8521,9 @@ static __init int hardware_setup(void)
}
if (!enable_preemption_timer) {
- vmx_x86_ops.set_hv_timer = NULL;
- vmx_x86_ops.cancel_hv_timer = NULL;
- vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
+ vt_x86_ops.set_hv_timer = NULL;
+ vt_x86_ops.cancel_hv_timer = NULL;
+ vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
}
kvm_caps.supported_mce_cap |= MCG_LMCE_P;
@@ -8697,9 +8534,9 @@ static __init int hardware_setup(void)
if (!enable_ept || !enable_pmu || !cpu_has_vmx_intel_pt())
pt_mode = PT_MODE_SYSTEM;
if (pt_mode == PT_MODE_HOST_GUEST)
- vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+ vt_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
else
- vmx_init_ops.handle_intel_pt_intr = NULL;
+ vt_init_ops.handle_intel_pt_intr = NULL;
setup_default_sgx_lepubkeyhash();
@@ -8722,14 +8559,6 @@ static __init int hardware_setup(void)
return r;
}
-static struct kvm_x86_init_ops vmx_init_ops __initdata = {
- .hardware_setup = hardware_setup,
- .handle_intel_pt_intr = NULL,
-
- .runtime_ops = &vmx_x86_ops,
- .pmu_ops = &intel_pmu_ops,
-};
-
static void vmx_cleanup_l1d_flush(void)
{
if (vmx_l1d_flush_pages) {
@@ -8771,7 +8600,7 @@ static int __init vmx_init(void)
*/
hv_init_evmcs();
- r = kvm_x86_vendor_init(&vmx_init_ops);
+ r = kvm_x86_vendor_init(&vt_init_ops);
if (r)
return r;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
new file mode 100644
index 000000000000..4bdb8f33b258
--- /dev/null
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -0,0 +1,124 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_X86_OPS_H
+#define __KVM_X86_VMX_X86_OPS_H
+
+#include <linux/kvm_host.h>
+
+#include "x86.h"
+
+__init int vmx_hardware_setup(void);
+
+extern struct kvm_x86_ops vt_x86_ops __initdata;
+extern struct kvm_x86_init_ops vt_init_ops __initdata;
+
+void vmx_hardware_unsetup(void);
+int vmx_check_processor_compat(void);
+int vmx_hardware_enable(void);
+void vmx_hardware_disable(void);
+int vmx_vm_init(struct kvm *kvm);
+void vmx_vm_destroy(struct kvm *kvm);
+int vmx_vcpu_precreate(struct kvm *kvm);
+int vmx_vcpu_create(struct kvm_vcpu *vcpu);
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
+void vmx_vcpu_free(struct kvm_vcpu *vcpu);
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_put(struct kvm_vcpu *vcpu);
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+#ifdef CONFIG_KVM_SMM
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram);
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram);
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
+#endif
+int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+ void *insn, int insn_len);
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+ struct x86_instruction_info *info,
+ enum x86_intercept_stage stage,
+ struct x86_exception *exception);
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
+void vmx_migrate_timers(struct kvm_vcpu *vcpu);
+void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu);
+bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason);
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
+void vmx_hwapic_isr_update(int max_isr);
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector);
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
+void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
+int vmx_get_msr_feature(struct kvm_msr_entry *msr);
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+int vmx_get_cpl(struct kvm_vcpu *vcpu);
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
+bool vmx_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
+void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
+u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected);
+void vmx_inject_nmi(struct kvm_vcpu *vcpu);
+void vmx_inject_exception(struct kvm_vcpu *vcpu);
+void vmx_cancel_injection(struct kvm_vcpu *vcpu);
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
+void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
+u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+ u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
+u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_X86_64
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+ bool *expired);
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
+#endif
+void vmx_setup_mce(struct kvm_vcpu *vcpu);
+
+#endif /* __KVM_X86_VMX_X86_OPS_H */
--
2.39.0
From: Isaku Yamahata <[email protected]>
To make use of the same value of shadow_mmio_mask and shadow_present_mask
for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and
shadow_present_mask so that they can be common for both VMX and TDX.
TDX will require shadow_mmio_mask and shadow_present_mask to include
VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for
shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the
spte value is defined so as to cause EPT misconfig.
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kvm/mmu/spte.c | 6 ++++--
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..76ed39541a52 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -513,6 +513,7 @@ enum vmcs_field {
#define VMX_EPT_IPAT_BIT (1ull << 6)
#define VMX_EPT_ACCESS_BIT (1ull << 8)
#define VMX_EPT_DIRTY_BIT (1ull << 9)
+#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)
#define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
VMX_EPT_WRITABLE_MASK | \
VMX_EPT_EXECUTABLE_MASK)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..02a466de2991 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -429,7 +429,9 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
shadow_nx_mask = 0ull;
shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
- shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+ /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
+ shadow_present_mask =
+ (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
/*
* EPT overrides the host MTRRs, and so KVM must program the desired
* memtype directly into the SPTEs. Note, this mask is just the mask
@@ -446,7 +448,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
* of an EPT paging-structure entry is 110b (write/execute).
*/
kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
- VMX_EPT_RWX_MASK, 0);
+ VMX_EPT_RWX_MASK | VMX_EPT_SUPPRESS_VE_BIT, 0);
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
--
2.39.0
SEV-SNP defines PFERR_GUEST_ENC_MASK (bit 34) in page-fault error bits to
represent the guest page is encrypted. Use the bit to designate that the
page fault is private and that it requires looking up memory attributes.
The vendor kvm page fault handler should set PFERR_GUEST_ENC_MASK bit based
on their fault information. It may or may not use the hardware value
directly or parse the hardware value to set the bit.
Based on a patch by Isaku Yamahata.
Signed-off-by: Paolo Bonzini <[email protected]>
---
Compared to what is in the Intel TDX tree, I am dropping the
if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM)
return RET_PF_RETRY;
change in __kvm_faultin_pfn(). It is not well documented why it
is needed and selftests seem to pass.
Also, checking has_private_mem is needed so as not to break SEV-ES.
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/mmu/mmu.c | 9 +++++++++
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
3 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 24e30ca2ca8f..7de8a3f2a118 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -264,6 +264,7 @@ enum x86_intercept_stage;
#define PFERR_SGX_BIT 15
#define PFERR_GUEST_FINAL_BIT 32
#define PFERR_GUEST_PAGE_BIT 33
+#define PFERR_GUEST_ENC_BIT 34
#define PFERR_IMPLICIT_ACCESS_BIT 48
#define PFERR_PRESENT_MASK BIT(PFERR_PRESENT_BIT)
@@ -275,6 +276,7 @@ enum x86_intercept_stage;
#define PFERR_SGX_MASK BIT(PFERR_SGX_BIT)
#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
+#define PFERR_GUEST_ENC_MASK BIT_ULL(PFERR_GUEST_ENC_BIT)
#define PFERR_IMPLICIT_ACCESS BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)
#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c9890e5b6e4c..6b4cb71668df 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5846,6 +5846,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
{
int r, emulation_type = EMULTYPE_PF;
bool direct = vcpu->arch.mmu->root_role.direct;
+ struct kvm *kvm = vcpu->kvm;
/*
* IMPLICIT_ACCESS is a KVM-defined flag used to correctly perform SMAP
@@ -5861,6 +5862,14 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
if (WARN_ON_ONCE(!VALID_PAGE(vcpu->arch.mmu->root.hpa)))
return RET_PF_RETRY;
+ /*
+ * There is no vendor code that can set PFERR_GUEST_ENC_MASK for
+ * software-protected VMs. Compute it here.
+ */
+ if (kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
+ kvm_mem_is_private(kvm, cr2_or_gpa >> PAGE_SHIFT))
+ error_code |= PFERR_GUEST_ENC_MASK;
+
r = RET_PF_INVALID;
if (unlikely(error_code & PFERR_RSVD_MASK)) {
r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 21f55e8b4dc6..154aa44eeb33 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -290,6 +290,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.present = err & PFERR_PRESENT_MASK,
.rsvd = err & PFERR_RSVD_MASK,
.user = err & PFERR_USER_MASK,
+ .is_private = vcpu->kvm->arch.has_private_mem && (err & PFERR_GUEST_ENC_MASK),
.prefetch = prefetch,
.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
.nx_huge_page_workaround_enabled =
@@ -298,7 +299,6 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
.max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
- .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT),
};
int r;
--
2.39.0
From: Sean Christopherson <[email protected]>
For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
is not able to access the private memory of TD guest and do the emulation.
Instead, TD guest expects to receive #VE when it accesses the MMIO and then
it can explicitly make hypercall to KVM to get the expected information.
To achieve this, the TDX module always enables "EPT-violation #VE" in the
VMCS control. And accordingly, for the MMIO spte for the shared GPA,
1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
violation happens on TD accessing MMIO range. 2. On EPT violation, KVM
sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
the #VE instead of EPT misconfiguration unlike VMX case. For the shared GPA
that is not populated yet, EPT violation need to be triggered when TD guest
accesses such shared GPA. The non-present SPTE value for shared GPA should
set "suppress #VE" bit.
Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63)
for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
present bit is off; 2) for normal VMX guest, KVM never enables the
"EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
hardware.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/spte.h | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 4d1799ba2bf8..26bc95bbc962 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -149,7 +149,20 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
+/*
+ * Non-present SPTE value for both VMX and SVM for TDP MMU.
+ * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
+ * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
+ * bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
+ * For TDX:
+ * TDX module sets EPT_VIOLATION_VE for Secure-EPT and conventional EPT
+ */
+#ifdef CONFIG_X86_64
+#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
+static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
+#else
#define SHADOW_NONPRESENT_VALUE 0ULL
+#endif
extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
@@ -196,7 +209,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
*
* Only used by the TDP MMU.
*/
-#define REMOVED_SPTE 0x5a0ULL
+#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
--
2.39.0
In order to be able to redo kvm_gmem_get_uninit_pfn, a hole must be punched
into the filemap, thus allowing FGP_CREAT_ONLY to succeed again. This will
be used whenever an operation that follows kvm_gmem_get_uninit_pfn fails.
Signed-off-by: Paolo Bonzini <[email protected]>
---
include/linux/kvm_host.h | 7 +++++++
virt/kvm/guest_memfd.c | 28 ++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 03bf616b7308..192c58116220 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2436,6 +2436,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
int kvm_gmem_get_uninit_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+int kvm_gmem_undo_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, int order);
#else
static inline int kvm_gmem_get_pfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2452,6 +2454,11 @@ static inline int kvm_gmem_get_uninit_pfn(struct kvm *kvm,
KVM_BUG_ON(1, kvm);
return -EIO;
}
+
+static inline int kvm_gmem_undo_get_pfn(struct kvm *kvm,
+ struct kvm_memory_slot *slot, gfn_t gfn,
+ int order)
+{}
#endif /* CONFIG_KVM_PRIVATE_MEM */
#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 7ec7afafc960..535ef1aa34fb 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -590,3 +590,31 @@ int kvm_gmem_get_uninit_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, false);
}
EXPORT_SYMBOL_GPL(kvm_gmem_get_uninit_pfn);
+
+int kvm_gmem_undo_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, int order)
+{
+ pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+ struct kvm_gmem *gmem;
+ struct file *file;
+ int r;
+
+ file = kvm_gmem_get_file(slot);
+ if (!file)
+ return -EFAULT;
+
+ gmem = file->private_data;
+
+ if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
+ r = -EIO;
+ goto out_fput;
+ }
+
+ r = kvm_gmem_punch_hole(file_inode(file), index << PAGE_SHIFT, PAGE_SHIFT << order);
+
+out_fput:
+ fput(file);
+
+ return r;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_undo_get_pfn);
--
2.39.0
From: Sean Christopherson <[email protected]>
TDX uses different ABI to get information about VM exit. Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.
When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS. The eventual code will be
VMX:
- get exit reason, intr_info, exit_qualification, and etc from VMCS
- call NMI/INTR handlers (common code)
TDX:
- get exit reason, intr_info, exit_qualification, and etc from guest
registers
- call NMI/INTR handlers (common code)
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3d8a7e4c8e37..8aedfe0fd78c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7000,24 +7000,22 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
}
-static void handle_exception_irqoff(struct vcpu_vmx *vmx)
+static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
{
- u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
-
/* if exit due to PF check for async PF */
if (is_page_fault(intr_info))
- vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
/* if exit due to NM, handle before interrupts are enabled */
else if (is_nm_fault(intr_info))
- handle_nm_fault_irqoff(&vmx->vcpu);
+ handle_nm_fault_irqoff(vcpu);
/* Handle machine checks before interrupts are enabled */
else if (is_machine_check(intr_info))
kvm_machine_check();
}
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
+static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
{
- u32 intr_info = vmx_get_intr_info(vcpu);
unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
gate_desc *desc = (gate_desc *)host_idt_base + vector;
@@ -7040,9 +7038,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
return;
if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
- handle_external_interrupt_irqoff(vcpu);
+ handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
- handle_exception_irqoff(vmx);
+ handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
}
/*
--
2.39.0
On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> This is a first set of, hopefully non-controversial patches from the
Heh, you jinxed yourself. :-)
> SNP and TDX series. They cover mostly changes to generic code and new
> gmem APIs, and in general have already been reviewed when posted by
> Isaku and Michael.
>
> One important change is that the gmem hook for initializing memory
> is designed to return -EEXIST if the page already exists in the
> guestmemfd filemap. The idea is that the special case of
> KVM_SEV_SNP_LAUNCH_UPDATE, where __kvm_gmem_get_pfn() is used to
> return an uninitialized page and make it guest-owned, can be be done at
> most once per page unless the ioctl fails.
>
> Of course these patches add a bunch of dead code. This is intentional
> because it's the only way to trim the large TDX (and to some extent SNP)
> series to the point that it's possible to discuss them. The next step is
> probably going to be the private<->shared page logic from the TDX series.
>
> Paolo
>
> Isaku Yamahata (5):
> KVM: x86/mmu: Add Suppress VE bit to EPT
> shadow_mmio_mask/shadow_present_mask
> KVM: VMX: Introduce test mode related to EPT violation VE
> KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at
> allocation
> KVM: x86/tdp_mmu: Sprinkle __must_check
> KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults
I have a slight tweak to this patch (drop truncation), and a rewritten changelog.
> Michael Roth (2):
> KVM: x86: Add gmem hook for invalidating memory
> KVM: x86: Add gmem hook for determining max NPT mapping level
>
> Paolo Bonzini (6):
> KVM: x86/mmu: pass error code back to MMU when async pf is ready
> KVM: x86/mmu: Use PFERR_GUEST_ENC_MASK to indicate fault is private
This doesn't work. The ENC flag gets set on any SNP *capable* CPU, which results
in false positives for SEV and SEV-ES guests[*].
I have a medium-sized series to add a KVM-defined synthetic flag, and clean up
the related code (it also has my slight variation on the 64-bit error code patch).
I'll post my series exactly as I have it, mostly so that I don't need to redo
testing, but also because it's pretty much a drop-in replacement. This series
applies cleanly on top, except for the two obvious conflicts.
[*] https://lore.kernel.org/all/[email protected]
> KVM: guest_memfd: pass error up from filemap_grab_folio
> filemap: add FGP_CREAT_ONLY
> KVM: x86: Add gmem hook for initializing memory
> KVM: guest_memfd: add API to undo kvm_gmem_get_uninit_pfn
>
> Sean Christopherson (7):
> KVM: x86: Split core of hypercall emulation to helper function
> KVM: Allow page-sized MMU caches to be initialized with custom 64-bit
> values
> KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
> KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
> KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed
> SPTE
> KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> argument
>
> Tom Lendacky (1):
> KVM: SEV: Use a VMSA physical address variable for populating VMCB
>
> arch/x86/include/asm/kvm-x86-ops.h | 3 +
> arch/x86/include/asm/kvm_host.h | 12 +
> arch/x86/include/asm/vmx.h | 13 +
> arch/x86/kvm/Makefile | 2 +-
> arch/x86/kvm/mmu.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 55 ++--
> arch/x86/kvm/mmu/mmu_internal.h | 6 +-
> arch/x86/kvm/mmu/mmutrace.h | 2 +-
> arch/x86/kvm/mmu/paging_tmpl.h | 4 +-
> arch/x86/kvm/mmu/spte.c | 16 +-
> arch/x86/kvm/mmu/spte.h | 21 +-
> arch/x86/kvm/mmu/tdp_iter.h | 12 +
> arch/x86/kvm/mmu/tdp_mmu.c | 74 +++--
> arch/x86/kvm/svm/sev.c | 3 +-
> arch/x86/kvm/svm/svm.c | 9 +-
> arch/x86/kvm/svm/svm.h | 1 +
> arch/x86/kvm/vmx/main.c | 168 +++++++++++
> arch/x86/kvm/vmx/vmcs.h | 5 +
> arch/x86/kvm/vmx/vmx.c | 460 +++++++++++------------------
> arch/x86/kvm/vmx/vmx.h | 6 +-
> arch/x86/kvm/vmx/x86_ops.h | 124 ++++++++
> arch/x86/kvm/x86.c | 69 +++--
> include/linux/kvm_host.h | 25 ++
> include/linux/kvm_types.h | 1 +
> include/linux/pagemap.h | 2 +
> mm/filemap.c | 4 +
> virt/kvm/Kconfig | 8 +
> virt/kvm/guest_memfd.c | 120 +++++++-
> virt/kvm/kvm_main.c | 16 +-
> 29 files changed, 855 insertions(+), 387 deletions(-)
> create mode 100644 arch/x86/kvm/vmx/main.c
> create mode 100644 arch/x86/kvm/vmx/x86_ops.h
>
> --
> 2.39.0
>
On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the
> suppress #VE bit in EPT entries selectively, in order to be able to trap
> non-present conditions. However, #VE isn't used for VMX and it's a bug
> if it happens. To be defensive and test that VMX case isn't broken
> introduce an option ept_violation_ve_test and when it's set, BUG the vm.
This needs to be two patches:
1. Add the architecture #defines, enums, structures, and is_ve_fault().
2. Add the forced #VE enabling test code
> Suggested-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/vmx.h | 12 +++++++
> arch/x86/kvm/vmx/vmcs.h | 5 +++
> arch/x86/kvm/vmx/vmx.c | 69 +++++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/vmx.h | 6 +++-
> 4 files changed, 90 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 76ed39541a52..f703bae0c4ac 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -70,6 +70,7 @@
> #define SECONDARY_EXEC_ENCLS_EXITING VMCS_CONTROL_BIT(ENCLS_EXITING)
> #define SECONDARY_EXEC_RDSEED_EXITING VMCS_CONTROL_BIT(RDSEED_EXITING)
> #define SECONDARY_EXEC_ENABLE_PML VMCS_CONTROL_BIT(PAGE_MOD_LOGGING)
> +#define SECONDARY_EXEC_EPT_VIOLATION_VE VMCS_CONTROL_BIT(EPT_VIOLATION_VE)
> #define SECONDARY_EXEC_PT_CONCEAL_VMX VMCS_CONTROL_BIT(PT_CONCEAL_VMX)
> #define SECONDARY_EXEC_ENABLE_XSAVES VMCS_CONTROL_BIT(XSAVES)
> #define SECONDARY_EXEC_MODE_BASED_EPT_EXEC VMCS_CONTROL_BIT(MODE_BASED_EPT_EXEC)
> @@ -225,6 +226,8 @@ enum vmcs_field {
> VMREAD_BITMAP_HIGH = 0x00002027,
> VMWRITE_BITMAP = 0x00002028,
> VMWRITE_BITMAP_HIGH = 0x00002029,
> + VE_INFORMATION_ADDRESS = 0x0000202A,
> + VE_INFORMATION_ADDRESS_HIGH = 0x0000202B,
> XSS_EXIT_BITMAP = 0x0000202C,
> XSS_EXIT_BITMAP_HIGH = 0x0000202D,
> ENCLS_EXITING_BITMAP = 0x0000202E,
> @@ -630,4 +633,13 @@ enum vmx_l1d_flush_state {
>
> extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;
>
> +struct vmx_ve_information {
> + u32 exit_reason;
> + u32 delivery;
> + u64 exit_qualification;
> + u64 guest_linear_address;
> + u64 guest_physical_address;
> + u16 eptp_index;
> +};
Should this be __packed since it's hardware-defined, or are we ok relying on the
compiler to not be stupid?
> #endif
> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
> index 7c1996b433e2..b25625314658 100644
> --- a/arch/x86/kvm/vmx/vmcs.h
> +++ b/arch/x86/kvm/vmx/vmcs.h
> @@ -140,6 +140,11 @@ static inline bool is_nm_fault(u32 intr_info)
> return is_exception_n(intr_info, NM_VECTOR);
> }
>
> +static inline bool is_ve_fault(u32 intr_info)
> +{
> + return is_exception_n(intr_info, VE_VECTOR);
> +}
> +
> /* Undocumented: icebp/int1 */
> static inline bool is_icebp(u32 intr_info)
> {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9239a89dea22..6468f421ba9e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -126,6 +126,9 @@ module_param(error_on_inconsistent_vmcs_config, bool, 0444);
> static bool __read_mostly dump_invalid_vmcs = 0;
> module_param(dump_invalid_vmcs, bool, 0644);
>
> +static bool __read_mostly ept_violation_ve_test;
> +module_param(ept_violation_ve_test, bool, 0444);
I would much prefer to enable #VE if CONFIG_KVM_PROVE_MMU=y. We already have
too many module params to deal with for testing, and practically speaking the
only people who will ever turn this on are the same people that run with
CONFIG_KVM_PROVE_MMU=y.
> #define MSR_BITMAP_MODE_X2APIC 1
> #define MSR_BITMAP_MODE_X2APIC_APICV 2
>
> @@ -868,6 +871,12 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
>
> eb = (1u << PF_VECTOR) | (1u << UD_VECTOR) | (1u << MC_VECTOR) |
> (1u << DB_VECTOR) | (1u << AC_VECTOR);
> + /*
> + * #VE isn't used for VMX. To test against unexpected changes
> + * related to #VE for VMX, intercept unexpected #VE and warn on it.
> + */
> + if (ept_violation_ve_test)
> + eb |= 1u << VE_VECTOR;
> /*
> * Guest access to VMware backdoor ports could legitimately
> * trigger #GP because of TSS I/O permission bitmap.
> @@ -2603,6 +2613,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> &_cpu_based_2nd_exec_control))
> return -EIO;
> }
> + if (!ept_violation_ve_test)
> + _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
_If_ we add a module param, the param needs to be disabled if #VE isn't supported.
> #ifndef CONFIG_X86_64
> if (!(_cpu_based_2nd_exec_control &
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
> @@ -2627,6 +2640,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> return -EIO;
>
> vmx_cap->ept = 0;
> + _cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
> }
> if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
> vmx_cap->vpid) {
> @@ -4592,6 +4606,7 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
> exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
> if (!enable_ept) {
> exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
> + exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
> enable_unrestricted_guest = 0;
> }
> if (!enable_unrestricted_guest)
> @@ -4715,8 +4730,40 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>
> exec_controls_set(vmx, vmx_exec_control(vmx));
>
> - if (cpu_has_secondary_exec_ctrls())
> + if (cpu_has_secondary_exec_ctrls()) {
> secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
> + if (secondary_exec_controls_get(vmx) &
> + SECONDARY_EXEC_EPT_VIOLATION_VE) {
> + if (!vmx->ve_info) {
> + /* ve_info must be page aligned. */
> + struct page *page;
> +
> + BUILD_BUG_ON(sizeof(*vmx->ve_info) > PAGE_SIZE);
> + page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> + if (page)
> + vmx->ve_info = page_to_virt(page);
> + }
> + if (vmx->ve_info) {
> + /*
> + * Allow #VE delivery. CPU sets this field to
> + * 0xFFFFFFFF on #VE delivery. Another #VE can
> + * occur only if software clears the field.
> + */
> + vmx->ve_info->delivery = 0;
This is completely unnecessary, the entire page is zero-allocated.
> + vmcs_write64(VE_INFORMATION_ADDRESS,
> + __pa(vmx->ve_info));
> + } else {
> + /*
> + * Because SECONDARY_EXEC_EPT_VIOLATION_VE is
> + * used only when ept_violation_ve_test is true,
> + * it's okay to go with the bit disabled.
No, it's not. This is silly on multiple fronts. (a) KVM knows if it's going to
enable #VE when the vCPU is first created, the allocation can and should be done
at that time along with all the other allocations needed for the VM. (b) Except
for error injection from syzkaller and friends, there is basically zero chance
the VM will live on if one 4KiB allocation fails. (c) This will never be enabled
in production; it's totally fine if the vCPU creation fails during testing, because
as a above, that will practically never happen outside of deliberate error
injection.
> + */
> + pr_err("Failed to allocate ve_info. disabling EPT_VIOLATION_VE.\n");
> + secondary_exec_controls_clearbit(vmx,
> + SECONDARY_EXEC_EPT_VIOLATION_VE);
> + }
> + }
> + }
>
> if (cpu_has_tertiary_exec_ctrls())
> tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
> @@ -5204,6 +5251,12 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> if (is_invalid_opcode(intr_info))
> return handle_ud(vcpu);
>
> + /*
> + * #VE isn't supposed to happen. Block the VM if it does.
This is not a useful comment. Obviously #VE isn't supposed to happen, otherwise
KVM wouldn't be bugging the VM.
> + */
> + if (KVM_BUG_ON(is_ve_fault(intr_info), vcpu->kvm))
> + return -EIO;
> +
> error_code = 0;
> if (intr_info & INTR_INFO_DELIVER_CODE_MASK)
> error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
> @@ -6393,6 +6446,18 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
> pr_err("Virtual processor ID = 0x%04x\n",
> vmcs_read16(VIRTUAL_PROCESSOR_ID));
> + if (secondary_exec_control & SECONDARY_EXEC_EPT_VIOLATION_VE) {
> + struct vmx_ve_information *ve_info;
> +
> + pr_err("VE info address = 0x%016llx\n",
> + vmcs_read64(VE_INFORMATION_ADDRESS));
> + ve_info = __va(vmcs_read64(VE_INFORMATION_ADDRESS));
Why!?!? You have the address in vcpu_vmx, just use that. If KVM is dumping
the VMCS, then something has gone wrong, possible in hardware or ucode.
Derefencing an address from the VMCS, which could very well be corrupted, is a
terrible idea. This could easily escalate from a dead VM into a dead host.
> + pr_err("ve_info: 0x%08x 0x%08x 0x%016llx 0x%016llx 0x%016llx 0x%04x\n",
> + ve_info->exit_reason, ve_info->delivery,
> + ve_info->exit_qualification,
> + ve_info->guest_linear_address,
> + ve_info->guest_physical_address, ve_info->eptp_index);
> + }
> }
>
> /*
> @@ -7433,6 +7498,8 @@ static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
> free_vpid(vmx->vpid);
> nested_vmx_free_vcpu(vcpu);
> free_loaded_vmcs(vmx->loaded_vmcs);
> + if (vmx->ve_info)
Unnecessary, free_page() does this for you.
> + free_page((unsigned long)vmx->ve_info);
> }
>
> static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index e3b0985bb74a..1ea1e5c8930d 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -364,6 +364,9 @@ struct vcpu_vmx {
> DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
> DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
> } shadow_msr_intercept;
> +
> + /* ve_info must be page aligned. */
This is also not a useful comment. Even the one at the allocation site is of
dubious value.
> + struct vmx_ve_information *ve_info;
> };
>
> struct kvm_vmx {
> @@ -576,7 +579,8 @@ static inline u8 vmx_get_rvi(void)
> SECONDARY_EXEC_ENABLE_VMFUNC | \
> SECONDARY_EXEC_BUS_LOCK_DETECTION | \
> SECONDARY_EXEC_NOTIFY_VM_EXITING | \
> - SECONDARY_EXEC_ENCLS_EXITING)
> + SECONDARY_EXEC_ENCLS_EXITING | \
> + SECONDARY_EXEC_EPT_VIOLATION_VE)
>
> #define KVM_REQUIRED_VMX_TERTIARY_VM_EXEC_CONTROL 0
> #define KVM_OPTIONAL_VMX_TERTIARY_VM_EXEC_CONTROL \
> --
> 2.39.0
>
>
On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> Right now the error code is not used when an async page fault is completed.
> This is not a problem in the current code, but it is untidy. For protected
> VMs we need to check that the page attributes match the current state of the
> page. Async page faults can only occur on shared pages (because
> private pages go through kvm_faultin_pfn_private() instead of
> __gfn_to_pfn_memslot()), but it is risky to rely on the polarity of
> PFERR_GUEST_ENC_MASK and the high 32 bits of the error code being zero.
> So, for clarity and future-proofing of the code, pipe the error code
> from kvm_arch_setup_async_pf() to kvm_arch_async_page_ready() via the
> architecture-specific async page fault data.
>
> Extracted from a patch by Isaku Yamahata.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 14 +++++++-------
> 2 files changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a4514c2ef0ec..24e30ca2ca8f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1839,6 +1839,7 @@ struct kvm_arch_async_pf {
> gfn_t gfn;
> unsigned long cr3;
> bool direct_map;
> + u64 error_code;
> };
>
> extern u32 __read_mostly kvm_nr_uret_msrs;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f58ca6cb789a..c9890e5b6e4c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4260,18 +4260,18 @@ static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
> return (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
> }
>
> -static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> - gfn_t gfn)
> +static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> {
> struct kvm_arch_async_pf arch;
>
> arch.token = alloc_apf_token(vcpu);
> - arch.gfn = gfn;
> + arch.gfn = fault->gfn;
> arch.direct_map = vcpu->arch.mmu->root_role.direct;
> arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
>
> - return kvm_setup_async_pf(vcpu, cr2_or_gpa,
> - kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> + return kvm_setup_async_pf(vcpu, fault->addr,
> + kvm_vcpu_gfn_to_hva(vcpu, fault->gfn), &arch);
> }
>
> void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> @@ -4290,7 +4290,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
> return;
>
> - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
> + kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, true, NULL);
This is silly. If we're going to bother plumbing in the error code, then we
should use it to do sanity checks. Things have gone off the rails if end up with
an async #PF on private memory.
> }
>
> static inline u8 kvm_max_level_for_order(int order)
> @@ -4395,7 +4395,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
> kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> return RET_PF_RETRY;
> - } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
> + } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
> return RET_PF_RETRY;
> }
> }
> --
> 2.39.0
>
>
On Tue, Feb 27, 2024, Paolo Bonzini wrote:
This needs a changelog, and also needs to be Cc'd to someone(s) that can give it
a thumbs up.
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> include/linux/pagemap.h | 2 ++
> mm/filemap.c | 4 ++++
> 2 files changed, 6 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2df35e65557d..e8ac0b32f84d 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -586,6 +586,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
> * * %FGP_CREAT - If no folio is present then a new folio is allocated,
> * added to the page cache and the VM's LRU list. The folio is
> * returned locked.
> + * * %FGP_CREAT_ONLY - Fail if a folio is not present
> * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> * folio is already in cache. If the folio was allocated, unlock it
> * before returning so the caller can do the same dance.
> @@ -606,6 +607,7 @@ typedef unsigned int __bitwise fgf_t;
> #define FGP_NOWAIT ((__force fgf_t)0x00000020)
> #define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
> #define FGP_STABLE ((__force fgf_t)0x00000080)
> +#define FGP_CREAT_ONLY ((__force fgf_t)0x00000100)
> #define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
>
> #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 750e779c23db..d5107bd0cd09 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1854,6 +1854,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> folio = NULL;
> if (!folio)
> goto no_page;
> + if (fgp_flags & FGP_CREAT_ONLY) {
> + folio_put(folio);
> + return ERR_PTR(-EEXIST);
> + }
>
> if (fgp_flags & FGP_LOCK) {
> if (fgp_flags & FGP_NOWAIT) {
> --
> 2.39.0
>
>
On Tue, Feb 27, 2024 at 6:15 PM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Feb 27, 2024, Paolo Bonzini wrote:
>
> This needs a changelog, and also needs to be Cc'd to someone(s) that can give it
> a thumbs up.
+Matthew Wilcox
>
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > include/linux/pagemap.h | 2 ++
> > mm/filemap.c | 4 ++++
> > 2 files changed, 6 insertions(+)
> >
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 2df35e65557d..e8ac0b32f84d 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -586,6 +586,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
> > * * %FGP_CREAT - If no folio is present then a new folio is allocated,
> > * added to the page cache and the VM's LRU list. The folio is
> > * returned locked.
> > + * * %FGP_CREAT_ONLY - Fail if a folio is not present
> > * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> > * folio is already in cache. If the folio was allocated, unlock it
> > * before returning so the caller can do the same dance.
> > @@ -606,6 +607,7 @@ typedef unsigned int __bitwise fgf_t;
> > #define FGP_NOWAIT ((__force fgf_t)0x00000020)
> > #define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
> > #define FGP_STABLE ((__force fgf_t)0x00000080)
> > +#define FGP_CREAT_ONLY ((__force fgf_t)0x00000100)
> > #define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
> >
> > #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 750e779c23db..d5107bd0cd09 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1854,6 +1854,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> > folio = NULL;
> > if (!folio)
> > goto no_page;
> > + if (fgp_flags & FGP_CREAT_ONLY) {
> > + folio_put(folio);
> > + return ERR_PTR(-EEXIST);
> > + }
> >
> > if (fgp_flags & FGP_LOCK) {
> > if (fgp_flags & FGP_NOWAIT) {
> > --
> > 2.39.0
> >
> >
>
I would strongly prefer we taret 6.10, not 6.9. The TDX and SNP folks don't need
any of this code to be in Linus' tree, they just need it in _a_ KVM tree so that
they can develop on top.
And I will have limited availability the rest of this week (potentially very
limited), and I obviously have strong opinions about some of this code. But even
if I had cycles to review this properly, I just don't see a reason to rush it in.
For the guest_memfd changes in particular, they're impossible to review in this
series. Rather than prematurely shove them into mainline, we should create a
volatile topic branch and use that to enable TDX/SNP development. That way we
can fixup patches if things need to change.
On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> This is a first set of, hopefully non-controversial patches from the
> SNP and TDX series. They cover mostly changes to generic code and new
> gmem APIs, and in general have already been reviewed when posted by
> Isaku and Michael.
>
> One important change is that the gmem hook for initializing memory
> is designed to return -EEXIST if the page already exists in the
> guestmemfd filemap. The idea is that the special case of
> KVM_SEV_SNP_LAUNCH_UPDATE, where __kvm_gmem_get_pfn() is used to
> return an uninitialized page and make it guest-owned, can be be done at
> most once per page unless the ioctl fails.
>
> Of course these patches add a bunch of dead code. This is intentional
> because it's the only way to trim the large TDX (and to some extent SNP)
> series to the point that it's possible to discuss them. The next step is
> probably going to be the private<->shared page logic from the TDX series.
On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> From: Tom Lendacky <[email protected]>
>
> In preparation to support SEV-SNP AP Creation, use a variable that holds
> the VMSA physical address rather than converting the virtual address.
> This will allow SEV-SNP AP Creation to set the new physical address that
> will be used should the vCPU reset path be taken.
No, this patch belongs in the SNP series. The hanlding of vmsa_pa is broken
(KVM leaks the page set by the guest; I need to follow-up in the SNP series).
On top of that, I detest duplicat variables, and I don't like that KVM keeps its
original VMSA (kernel allocation) after the guest creates its own.
I can't possibly imagine why this needs to be pulled in early. There's no way
TDX needs this, and while this patch is _small_, the functional change it leads
to is not.
On Tue, Feb 27, 2024 at 06:17:34PM -0800, Yosry Ahmed wrote:
> On Tue, Feb 27, 2024 at 6:15 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> >
> > This needs a changelog, and also needs to be Cc'd to someone(s) that can give it
> > a thumbs up.
>
> +Matthew Wilcox
If only there were an entry in MAINTAINERS for filemap.c ...
This looks bogus to me, and if it's not bogus, it's incomplete.
But it's hard to judge without a commit message that describes what it's
supposed to mean.
> >
> > > Signed-off-by: Paolo Bonzini <[email protected]>
> > > ---
> > > include/linux/pagemap.h | 2 ++
> > > mm/filemap.c | 4 ++++
> > > 2 files changed, 6 insertions(+)
> > >
> > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > index 2df35e65557d..e8ac0b32f84d 100644
> > > --- a/include/linux/pagemap.h
> > > +++ b/include/linux/pagemap.h
> > > @@ -586,6 +586,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
> > > * * %FGP_CREAT - If no folio is present then a new folio is allocated,
> > > * added to the page cache and the VM's LRU list. The folio is
> > > * returned locked.
> > > + * * %FGP_CREAT_ONLY - Fail if a folio is not present
> > > * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> > > * folio is already in cache. If the folio was allocated, unlock it
> > > * before returning so the caller can do the same dance.
> > > @@ -606,6 +607,7 @@ typedef unsigned int __bitwise fgf_t;
> > > #define FGP_NOWAIT ((__force fgf_t)0x00000020)
> > > #define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
> > > #define FGP_STABLE ((__force fgf_t)0x00000080)
> > > +#define FGP_CREAT_ONLY ((__force fgf_t)0x00000100)
> > > #define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
> > >
> > > #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 750e779c23db..d5107bd0cd09 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -1854,6 +1854,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> > > folio = NULL;
> > > if (!folio)
> > > goto no_page;
> > > + if (fgp_flags & FGP_CREAT_ONLY) {
> > > + folio_put(folio);
> > > + return ERR_PTR(-EEXIST);
> > > + }
> > >
> > > if (fgp_flags & FGP_LOCK) {
> > > if (fgp_flags & FGP_NOWAIT) {
> > > --
> > > 2.39.0
> > >
> > >
> >
On Wed, Feb 28, 2024 at 3:03 AM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> > Right now the error code is not used when an async page fault is completed.
> > This is not a problem in the current code, but it is untidy. For protected
> > VMs we need to check that the page attributes match the current state of the
> > page. Async page faults can only occur on shared pages (because
> > private pages go through kvm_faultin_pfn_private() instead of
> > __gfn_to_pfn_memslot()), but it is risky to rely on the polarity of
> > PFERR_GUEST_ENC_MASK and the high 32 bits of the error code being zero.
> > So, for clarity and future-proofing of the code, pipe the error code
> > from kvm_arch_setup_async_pf() to kvm_arch_async_page_ready() via the
> > architecture-specific async page fault data.
> >
> > Extracted from a patch by Isaku Yamahata.
> >
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 1 +
> > arch/x86/kvm/mmu/mmu.c | 14 +++++++-------
> > 2 files changed, 8 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index a4514c2ef0ec..24e30ca2ca8f 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1839,6 +1839,7 @@ struct kvm_arch_async_pf {
> > gfn_t gfn;
> > unsigned long cr3;
> > bool direct_map;
> > + u64 error_code;
> > };
> >
> > extern u32 __read_mostly kvm_nr_uret_msrs;
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f58ca6cb789a..c9890e5b6e4c 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4260,18 +4260,18 @@ static u32 alloc_apf_token(struct kvm_vcpu *vcpu)
> > return (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
> > }
> >
> > -static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > - gfn_t gfn)
> > +static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu,
> > + struct kvm_page_fault *fault)
> > {
> > struct kvm_arch_async_pf arch;
> >
> > arch.token = alloc_apf_token(vcpu);
> > - arch.gfn = gfn;
> > + arch.gfn = fault->gfn;
> > arch.direct_map = vcpu->arch.mmu->root_role.direct;
> > arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
> >
> > - return kvm_setup_async_pf(vcpu, cr2_or_gpa,
> > - kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
> > + return kvm_setup_async_pf(vcpu, fault->addr,
> > + kvm_vcpu_gfn_to_hva(vcpu, fault->gfn), &arch);
> > }
> >
> > void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > @@ -4290,7 +4290,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> > work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
> > return;
> >
> > - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
> > + kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, true, NULL);
>
> This is silly. If we're going to bother plumbing in the error code, then we
> should use it to do sanity checks. Things have gone off the rails if end up with
> an async #PF on private memory.
Sure, I split this part out not just because it makes sense to do so,
but also because it's not strictly necessary. I'll add the check and
tweak the changelog.
Paolo
>
> > }
> >
> > static inline u8 kvm_max_level_for_order(int order)
> > @@ -4395,7 +4395,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
> > kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> > return RET_PF_RETRY;
> > - } else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
> > + } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
> > return RET_PF_RETRY;
> > }
> > }
> > --
> > 2.39.0
> >
> >
>
On Wed, Feb 28, 2024 at 2:15 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Feb 27, 2024 at 06:17:34PM -0800, Yosry Ahmed wrote:
> > On Tue, Feb 27, 2024 at 6:15 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> > >
> > > This needs a changelog, and also needs to be Cc'd to someone(s) that can give it
> > > a thumbs up.
> >
> > +Matthew Wilcox
>
> If only there were an entry in MAINTAINERS for filemap.c ...
Not CCing you (or mm in general) was intentional because I first
wanted a review of the KVM APIs; of course I wouldn't have committed
it without an Acked-by. But yeah, not writing the changelog yet was
pure laziness.
Since you're here: KVM would like to add a ioctl to encrypt and
install a page into guest_memfd, in preparation for launching an
encrypted guest. For this API we want to rule out the possibility of
overwriting a page that is already in the guest_memfd's filemap,
therefore this API would pass FGP_CREAT_ONLY|FGP_CREAT
into__filemap_get_folio. Do you think this is bogus...
> This looks bogus to me, and if it's not bogus, it's incomplete.
.. or if not, what incompleteness can you spot?
Thanks,
Paolo
> But it's hard to judge without a commit message that describes what it's
> supposed to mean.
>
> > >
> > > > Signed-off-by: Paolo Bonzini <[email protected]>
> > > > ---
> > > > include/linux/pagemap.h | 2 ++
> > > > mm/filemap.c | 4 ++++
> > > > 2 files changed, 6 insertions(+)
> > > >
> > > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > > index 2df35e65557d..e8ac0b32f84d 100644
> > > > --- a/include/linux/pagemap.h
> > > > +++ b/include/linux/pagemap.h
> > > > @@ -586,6 +586,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
> > > > * * %FGP_CREAT - If no folio is present then a new folio is allocated,
> > > > * added to the page cache and the VM's LRU list. The folio is
> > > > * returned locked.
> > > > + * * %FGP_CREAT_ONLY - Fail if a folio is not present
> > > > * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> > > > * folio is already in cache. If the folio was allocated, unlock it
> > > > * before returning so the caller can do the same dance.
> > > > @@ -606,6 +607,7 @@ typedef unsigned int __bitwise fgf_t;
> > > > #define FGP_NOWAIT ((__force fgf_t)0x00000020)
> > > > #define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
> > > > #define FGP_STABLE ((__force fgf_t)0x00000080)
> > > > +#define FGP_CREAT_ONLY ((__force fgf_t)0x00000100)
> > > > #define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
> > > >
> > > > #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 750e779c23db..d5107bd0cd09 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -1854,6 +1854,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> > > > folio = NULL;
> > > > if (!folio)
> > > > goto no_page;
> > > > + if (fgp_flags & FGP_CREAT_ONLY) {
> > > > + folio_put(folio);
> > > > + return ERR_PTR(-EEXIST);
> > > > + }
> > > >
> > > > if (fgp_flags & FGP_LOCK) {
> > > > if (fgp_flags & FGP_NOWAIT) {
> > > > --
> > > > 2.39.0
> > > >
> > > >
> > >
>
On Wed, Feb 28, 2024 at 2:25 AM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> > This is a first set of, hopefully non-controversial patches from the
>
> Heh, you jinxed yourself. :-)
Well I
> > SNP and TDX series. They cover mostly changes to generic code and new
> > gmem APIs, and in general have already been reviewed when posted by
> > Isaku and Michael.
> >
> > One important change is that the gmem hook for initializing memory
> > is designed to return -EEXIST if the page already exists in the
> > guestmemfd filemap. The idea is that the special case of
> > KVM_SEV_SNP_LAUNCH_UPDATE, where __kvm_gmem_get_pfn() is used to
> > return an uninitialized page and make it guest-owned, can be be done at
> > most once per page unless the ioctl fails.
> >
> > Of course these patches add a bunch of dead code. This is intentional
> > because it's the only way to trim the large TDX (and to some extent SNP)
> > series to the point that it's possible to discuss them. The next step is
> > probably going to be the private<->shared page logic from the TDX series.
> >
> > Paolo
> >
> > Isaku Yamahata (5):
> > KVM: x86/mmu: Add Suppress VE bit to EPT
> > shadow_mmio_mask/shadow_present_mask
> > KVM: VMX: Introduce test mode related to EPT violation VE
> > KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at
> > allocation
> > KVM: x86/tdp_mmu: Sprinkle __must_check
> > KVM: x86/mmu: Pass around full 64-bit error code for KVM page faults
>
> I have a slight tweak to this patch (drop truncation), and a rewritten changelog.
>
> > Michael Roth (2):
> > KVM: x86: Add gmem hook for invalidating memory
> > KVM: x86: Add gmem hook for determining max NPT mapping level
> >
> > Paolo Bonzini (6):
> > KVM: x86/mmu: pass error code back to MMU when async pf is ready
> > KVM: x86/mmu: Use PFERR_GUEST_ENC_MASK to indicate fault is private
>
> This doesn't work. The ENC flag gets set on any SNP *capable* CPU, which results
> in false positives for SEV and SEV-ES guests[*].
You didn't look at the patch did you? :) It does check for
has_private_mem (alternatively I could have dropped the bit in SVM
code for SEV and SEV-ES guests).
> I have a medium-sized series to add a KVM-defined synthetic flag, and clean up
> the related code (it also has my slight variation on the 64-bit error code patch).
>
> I'll post my series exactly as I have it, mostly so that I don't need to redo
> testing, but also because it's pretty much a drop-in replacement. This series
> applies cleanly on top, except for the two obvious conflicts.
Ok, I will check it out. This is exactly why I posted these.
Paolo
> [*] https://lore.kernel.org/all/[email protected]
>
> > KVM: guest_memfd: pass error up from filemap_grab_folio
> > filemap: add FGP_CREAT_ONLY
> > KVM: x86: Add gmem hook for initializing memory
> > KVM: guest_memfd: add API to undo kvm_gmem_get_uninit_pfn
> >
> > Sean Christopherson (7):
> > KVM: x86: Split core of hypercall emulation to helper function
> > KVM: Allow page-sized MMU caches to be initialized with custom 64-bit
> > values
> > KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
> > KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
> > KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed
> > SPTE
> > KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
> > KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
> > argument
> >
> > Tom Lendacky (1):
> > KVM: SEV: Use a VMSA physical address variable for populating VMCB
> >
> > arch/x86/include/asm/kvm-x86-ops.h | 3 +
> > arch/x86/include/asm/kvm_host.h | 12 +
> > arch/x86/include/asm/vmx.h | 13 +
> > arch/x86/kvm/Makefile | 2 +-
> > arch/x86/kvm/mmu.h | 1 +
> > arch/x86/kvm/mmu/mmu.c | 55 ++--
> > arch/x86/kvm/mmu/mmu_internal.h | 6 +-
> > arch/x86/kvm/mmu/mmutrace.h | 2 +-
> > arch/x86/kvm/mmu/paging_tmpl.h | 4 +-
> > arch/x86/kvm/mmu/spte.c | 16 +-
> > arch/x86/kvm/mmu/spte.h | 21 +-
> > arch/x86/kvm/mmu/tdp_iter.h | 12 +
> > arch/x86/kvm/mmu/tdp_mmu.c | 74 +++--
> > arch/x86/kvm/svm/sev.c | 3 +-
> > arch/x86/kvm/svm/svm.c | 9 +-
> > arch/x86/kvm/svm/svm.h | 1 +
> > arch/x86/kvm/vmx/main.c | 168 +++++++++++
> > arch/x86/kvm/vmx/vmcs.h | 5 +
> > arch/x86/kvm/vmx/vmx.c | 460 +++++++++++------------------
> > arch/x86/kvm/vmx/vmx.h | 6 +-
> > arch/x86/kvm/vmx/x86_ops.h | 124 ++++++++
> > arch/x86/kvm/x86.c | 69 +++--
> > include/linux/kvm_host.h | 25 ++
> > include/linux/kvm_types.h | 1 +
> > include/linux/pagemap.h | 2 +
> > mm/filemap.c | 4 +
> > virt/kvm/Kconfig | 8 +
> > virt/kvm/guest_memfd.c | 120 +++++++-
> > virt/kvm/kvm_main.c | 16 +-
> > 29 files changed, 855 insertions(+), 387 deletions(-)
> > create mode 100644 arch/x86/kvm/vmx/main.c
> > create mode 100644 arch/x86/kvm/vmx/x86_ops.h
> >
> > --
> > 2.39.0
> >
>
On Wed, Feb 28, 2024, Paolo Bonzini wrote:
> On Wed, Feb 28, 2024 at 2:25 AM Sean Christopherson <[email protected]> wrote:
> > > Michael Roth (2):
> > > KVM: x86: Add gmem hook for invalidating memory
> > > KVM: x86: Add gmem hook for determining max NPT mapping level
> > >
> > > Paolo Bonzini (6):
> > > KVM: x86/mmu: pass error code back to MMU when async pf is ready
> > > KVM: x86/mmu: Use PFERR_GUEST_ENC_MASK to indicate fault is private
> >
> > This doesn't work. The ENC flag gets set on any SNP *capable* CPU, which results
> > in false positives for SEV and SEV-ES guests[*].
>
> You didn't look at the patch did you? :)
Guilty, sort of. I looked (and tested) the patch from the TDX series, but I didn't
look at what you postd. But it's a moot point, because now I did look at what you
posted, and it's still broken :-)
> It does check for has_private_mem (alternatively I could have dropped the bit
> in SVM code for SEV and SEV-ES guests).
The problem isn't with *KVM* setting the bit, it's with *hardware* setting the
bit for SEV and SEV-ES guests. That results in this:
.is_private = vcpu->kvm->arch.has_private_mem && (err & PFERR_GUEST_ENC_MASK),
marking the fault as private. Which, in a vacuum, isn't technically wrong, since
from hardware's perspective the vCPU access was "private". But from KVM's
perspective, SEV and SEV-ES guests don't have private memory, they have memory
that can be *encrypted*, and marking the access as "private" results in violations
of KVM's rules for private memory. Specifically, it results in KVM triggering
emulated MMIO for faults that are marked private, which we want to disallow for
SNP and TDX.
And because the flag only gets set on SNP capable hardware (in my limited testing
of a whole two systems), running the same VM on different hardware would result
in faults being marked private on one system, but not the other. Which means that
KVM can't rely on the flag being set for SEV or SEV-ES guests, i.e. we can't
retroactively enforce anything (not to mention that that might break existing VMs).
On Wed, Feb 28, 2024 at 5:39 PM Sean Christopherson <[email protected]> wrote:
> > > This doesn't work. The ENC flag gets set on any SNP *capable* CPU, which results
> > > in false positives for SEV and SEV-ES guests[*].
> >
> > You didn't look at the patch did you? :)
>
> Guilty, sort of. I looked (and tested) the patch from the TDX series, but I didn't
> look at what you postd. But it's a moot point, because now I did look at what you
> posted, and it's still broken :-)
>
> > It does check for has_private_mem (alternatively I could have dropped the bit
> > in SVM code for SEV and SEV-ES guests).
>
> The problem isn't with *KVM* setting the bit, it's with *hardware* setting the
> bit for SEV and SEV-ES guests. That results in this:
>
> .is_private = vcpu->kvm->arch.has_private_mem && (err & PFERR_GUEST_ENC_MASK),
>
> marking the fault as private. Which, in a vacuum, isn't technically wrong, since
> from hardware's perspective the vCPU access was "private". But from KVM's
> perspective, SEV and SEV-ES guests don't have private memory
vcpu->kvm->arch.has_private_mem is the flag from the SEV VM types
series. It's false on SEV and SEV-ES VMs, therefore fault->is_private
is going to be false as well. Is it ENOCOFFEE for you or ENODINNER for
me? :)
Paolo
> And because the flag only gets set on SNP capable hardware (in my limited testing
> of a whole two systems), running the same VM on different hardware would result
> in faults being marked private on one system, but not the other. Which means that
> KVM can't rely on the flag being set for SEV or SEV-ES guests, i.e. we can't
> retroactively enforce anything (not to mention that that might break existing VMs).
>
On Wed, Feb 28, 2024 at 3:00 AM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> > From: Tom Lendacky <[email protected]>
> >
> > In preparation to support SEV-SNP AP Creation, use a variable that holds
> > the VMSA physical address rather than converting the virtual address.
> > This will allow SEV-SNP AP Creation to set the new physical address that
> > will be used should the vCPU reset path be taken.
>
> No, this patch belongs in the SNP series. The hanlding of vmsa_pa is broken
> (KVM leaks the page set by the guest; I need to follow-up in the SNP series).
> On top of that, I detest duplicat variables, and I don't like that KVM keeps its
> original VMSA (kernel allocation) after the guest creates its own.
>
> I can't possibly imagine why this needs to be pulled in early. There's no way
> TDX needs this, and while this patch is _small_, the functional change it leads
> to is not.
Well, the point of this series (and there will be more if you agree)
is exactly to ask "why not" in a way that is more manageable than
through the huge TDX and SNP series. My reading of the above is that
you believe this is small enough that it can even be merged with "KVM:
SEV: Support SEV-SNP AP Creation NAE event" (with fixes), which I
don't disagree with.
Otherwise, if the approach was good there's no reason _not_ to get it
in early. It's just a refactoring.
Talking in general: I think I agree about keeping the gmem parts in a
kvm-coco-queue branch (and in the meanwhile involving the mm people if
mm/filemap.c changes are needed). #VE too, probably, but what I
_really_ want to avoid is that these series (the plural is not a typo)
become a new bottleneck for everybody. Basically these are meant to be
a "these seem good to go to me, please confirm or deny" between
comaintainers more than a real patch posting; having an extra branch
is extra protection against screwups but we should be mindful that
force pushes are painful for everyone.
If you think I'm misguided, please do speak out or feel free to ask me
to talk on voice.
Paolo
On Wed, Feb 28, 2024, Paolo Bonzini wrote:
> On Wed, Feb 28, 2024 at 5:39 PM Sean Christopherson <[email protected]> wrote:
> > > > This doesn't work. The ENC flag gets set on any SNP *capable* CPU, which results
> > > > in false positives for SEV and SEV-ES guests[*].
> > >
> > > You didn't look at the patch did you? :)
> >
> > Guilty, sort of. I looked (and tested) the patch from the TDX series, but I didn't
> > look at what you postd. But it's a moot point, because now I did look at what you
> > posted, and it's still broken :-)
> >
> > > It does check for has_private_mem (alternatively I could have dropped the bit
> > > in SVM code for SEV and SEV-ES guests).
> >
> > The problem isn't with *KVM* setting the bit, it's with *hardware* setting the
> > bit for SEV and SEV-ES guests. That results in this:
> >
> > .is_private = vcpu->kvm->arch.has_private_mem && (err & PFERR_GUEST_ENC_MASK),
> >
> > marking the fault as private. Which, in a vacuum, isn't technically wrong, since
> > from hardware's perspective the vCPU access was "private". But from KVM's
> > perspective, SEV and SEV-ES guests don't have private memory
>
> vcpu->kvm->arch.has_private_mem is the flag from the SEV VM types
> series. It's false on SEV and SEV-ES VMs, therefore fault->is_private
> is going to be false as well. Is it ENOCOFFEE for you or ENODINNER for
> me? :)
*sigh*, ENOCOFFEE.
On Tue, Feb 27, 2024 at 06:20:57PM -0500,
Paolo Bonzini <[email protected]> wrote:
> guest_memfd pages are generally expected to be in some arch-defined
> initial state prior to using them for guest memory. For SEV-SNP this
> initial state is 'private', or 'guest-owned', and requires additional
> operations to move these pages into a 'private' state by updating the
> corresponding entries the RMP table.
>
> Allow for an arch-defined hook to handle updates of this sort, and go
> ahead and implement one for x86 so KVM implementations like AMD SVM can
> register a kvm_x86_ops callback to handle these updates for SEV-SNP
> guests.
>
> The preparation callback is always called when allocating/grabbing
> folios via gmem, and it is up to the architecture to keep track of
> whether or not the pages are already in the expected state (e.g. the RMP
> table in the case of SEV-SNP).
>
> In some cases, it is necessary to defer the preparation of the pages to
> handle things like in-place encryption of initial guest memory payloads
> before marking these pages as 'private'/'guest-owned', so also add a
> helper that performs the same function as kvm_gmem_get_pfn(), but allows
> for the preparation callback to be bypassed to allow for pages to be
> accessed beforehand.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Co-developed-by: Michael Roth <[email protected]>
> Signed-off-by: Michael Roth <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 6 +++
> include/linux/kvm_host.h | 14 ++++++
> virt/kvm/Kconfig | 4 ++
> virt/kvm/guest_memfd.c | 72 +++++++++++++++++++++++++++---
> 6 files changed, 92 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index ac8b7614e79d..adfaad15e7e6 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -139,6 +139,7 @@ KVM_X86_OP(complete_emulated_msr)
> KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP_OPTIONAL(get_untagged_addr)
> +KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>
> #undef KVM_X86_OP
> #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7de8a3f2a118..6d873d08f739 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1804,6 +1804,7 @@ struct kvm_x86_ops {
> unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
>
> gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
> + int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f10a5a617120..eff532ea59c9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13598,6 +13598,12 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
> }
> EXPORT_SYMBOL_GPL(kvm_arch_no_poll);
>
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order)
> +{
> + return static_call(kvm_x86_gmem_prepare)(kvm, pfn, gfn, max_order);
> +}
> +#endif
>
> int kvm_spec_ctrl_test_value(u64 value)
> {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 97afe4519772..03bf616b7308 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2434,6 +2434,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> #ifdef CONFIG_KVM_PRIVATE_MEM
> int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
> +int kvm_gmem_get_uninit_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
> #else
> static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> struct kvm_memory_slot *slot, gfn_t gfn,
> @@ -2442,6 +2444,18 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> KVM_BUG_ON(1, kvm);
> return -EIO;
> }
> +
> +static inline int kvm_gmem_get_uninit_pfn(struct kvm *kvm,
> + struct kvm_memory_slot *slot, gfn_t gfn,
> + kvm_pfn_t *pfn, int *max_order)
> +{
> + KVM_BUG_ON(1, kvm);
> + return -EIO;
> +}
> #endif /* CONFIG_KVM_PRIVATE_MEM */
>
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> +int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
> +#endif
> +
> #endif
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index a11e9c80fac9..dcce0c3b5b13 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -111,3 +111,7 @@ config KVM_GENERIC_PRIVATE_MEM
> select KVM_GENERIC_MEMORY_ATTRIBUTES
> select KVM_PRIVATE_MEM
> bool
> +
> +config HAVE_KVM_GMEM_PREPARE
> + bool
> + depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index de0d5a5c210c..7ec7afafc960 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -13,12 +13,50 @@ struct kvm_gmem {
> struct list_head entry;
> };
>
> -static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> +static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct folio *folio)
> +{
> +#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> + struct list_head *gmem_list = &inode->i_mapping->i_private_list;
> + struct kvm_gmem *gmem;
> +
> + list_for_each_entry(gmem, gmem_list, entry) {
> + struct kvm_memory_slot *slot;
> + struct kvm *kvm = gmem->kvm;
> + struct page *page;
> + kvm_pfn_t pfn;
> + gfn_t gfn;
> + int rc;
> +
> + slot = xa_load(&gmem->bindings, index);
> + if (!slot)
> + continue;
> +
> + page = folio_file_page(folio, index);
> + pfn = page_to_pfn(page);
> + gfn = slot->base_gfn + index - slot->gmem.pgoff;
> + rc = kvm_arch_gmem_prepare(kvm, gfn, pfn, compound_order(compound_head(page)));
> + if (rc) {
> + pr_warn_ratelimited("gmem: Failed to prepare folio for index %lx, error %d.\n",
> + index, rc);
> + return rc;
> + }
> + }
> +
> +#endif
> + return 0;
> +}
Can we make it conditional?
TDX doesn't need prepare hook to set gmem_parepare = NULL. With large memory
guest(several hundreds Gbyte) to lookup page cache, this loop slows down guest
startup. I think it would also applies to SW_PROTECTED_VM (and pKVM in future).
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3835732491b9..cafb8d0997b5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -842,6 +842,9 @@ struct kvm {
#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
/* Protected by slots_locks (for writes) and RCU (for reads) */
struct xarray mem_attr_array;
+#endif
+#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+ bool gmem_need_prepare;
#endif
char stats_id[KVM_STATS_NAME_SIZE];
};
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 74e19170af8a..ab7d0f7d3d38 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -16,6 +16,7 @@ struct kvm_gmem {
static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct folio *folio)
{
#ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
+ rc = kvm_arch_gmem_prepare(inode, index, folio);
struct list_head *gmem_list = &inode->i_mapping->i_private_list;
struct kvm_gmem *gmem;
@@ -27,6 +28,9 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
gfn_t gfn;
int rc;
+ if (!kvm->gmem_need_prepare)
+ continue;
+
slot = xa_load(&gmem->bindings, index);
if (!slot)
continue;
--
Isaku Yamahata <[email protected]>
On Wed, Feb 28, 2024 at 8:24 PM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Feb 28, 2024 at 02:28:45PM +0100, Paolo Bonzini wrote:
> > Since you're here: KVM would like to add a ioctl to encrypt and
> > install a page into guest_memfd, in preparation for launching an
> > encrypted guest. For this API we want to rule out the possibility of
> > overwriting a page that is already in the guest_memfd's filemap,
> > therefore this API would pass FGP_CREAT_ONLY|FGP_CREAT
> > into__filemap_get_folio. Do you think this is bogus...
>
> Would it work to start out by either asserting the memfd is empty of
> pages, or by evicting any existing pages? Both those seem nicer than
> starting, realising you've got some unencrypted memory and aborting.
Unfortunately it would be quite ugly to force userspace to do all the
initialization in one go. For example, there are different kinds of
pages that probably would be initialized at different points (e.g.
before vs. after vCPUs are created, because the initial vCPU state is
also encrypted).
The thing that I want to protect against is userspace trying to
initialize the same encrypted page twice.
> > > This looks bogus to me, and if it's not bogus, it's incomplete.
> >
> > ... or if not, what incompleteness can you spot?
>
> The part where we race another caller passing FGP_CREAT_ONLY and one gets
> an EEXIST back from filemap_add_folio(). Maybe that's not something
> that can happen in your use case, but it's at least semantics that
> need documenting.
From the point of view of filemap_add_folio(), one of the racers wins
and one fails. It doesn't matter to filemap.c if the missing
synchronization is in the kernel or in userspace. In the case of KVM,
the ioctl will return the number of pages before it found an existing
page, or -EEXIST if that number is zero (similar to what nonblocking
read does with EAGAIN).
I'll improve the documentation and changelog and make sure to Cc you
on the next version.
Thanks again!
Paolo
On Wed, Feb 28, 2024 at 02:28:45PM +0100, Paolo Bonzini wrote:
> Since you're here: KVM would like to add a ioctl to encrypt and
> install a page into guest_memfd, in preparation for launching an
> encrypted guest. For this API we want to rule out the possibility of
> overwriting a page that is already in the guest_memfd's filemap,
> therefore this API would pass FGP_CREAT_ONLY|FGP_CREAT
> into__filemap_get_folio. Do you think this is bogus...
Would it work to start out by either asserting the memfd is empty of
pages, or by evicting any existing pages? Both those seem nicer than
starting, realising you've got some unencrypted memory and aborting.
> > This looks bogus to me, and if it's not bogus, it's incomplete.
>
> ... or if not, what incompleteness can you spot?
The part where we race another caller passing FGP_CREAT_ONLY and one gets
an EEXIST back from filemap_add_folio(). Maybe that's not something
that can happen in your use case, but it's at least semantics that
need documenting.
On Tue, Feb 27, 2024 at 06:20:43PM -0500, Paolo Bonzini wrote:
> From: Sean Christopherson <[email protected]>
>
> For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
> is not able to access the private memory of TD guest and do the emulation.
> Instead, TD guest expects to receive #VE when it accesses the MMIO and then
> it can explicitly make hypercall to KVM to get the expected information.
>
> To achieve this, the TDX module always enables "EPT-violation #VE" in the
> VMCS control. And accordingly, for the MMIO spte for the shared GPA,
> 1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
> violation happens on TD accessing MMIO range. 2. On EPT violation, KVM
> sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
> the #VE instead of EPT misconfiguration unlike VMX case. For the shared GPA
> that is not populated yet, EPT violation need to be triggered when TD guest
> accesses such shared GPA. The non-present SPTE value for shared GPA should
> set "suppress #VE" bit.
>
> Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
> REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63)
> for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
> present bit is off; 2) for normal VMX guest, KVM never enables the
> "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
> hardware.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Binbin Wu <[email protected]>
> Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu/spte.h | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 4d1799ba2bf8..26bc95bbc962 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -149,7 +149,20 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
>
> #define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
>
> +/*
> + * Non-present SPTE value for both VMX and SVM for TDP MMU.
> + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> + * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
> + * bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
> + * For TDX:
> + * TDX module sets EPT_VIOLATION_VE for Secure-EPT and conventional EPT
> + */
> +#ifdef CONFIG_X86_64
> +#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
> +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> +#else
> #define SHADOW_NONPRESENT_VALUE 0ULL
> +#endif
>
> extern u64 __read_mostly shadow_host_writable_mask;
> extern u64 __read_mostly shadow_mmu_writable_mask;
> @@ -196,7 +209,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
* vulnerability. Use only low bits to avoid 64-bit immediates.
^
We may remove this comment. Others are fine.
Reviewed-by: Xu Yilun <[email protected]>
> *
> * Only used by the TDP MMU.
> */
> -#define REMOVED_SPTE 0x5a0ULL
> +#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
>
> /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> --
> 2.39.0
>
>
>
On 2/28/2024 7:20 AM, Paolo Bonzini wrote:
> From: Sean Christopherson <[email protected]>
>
> For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
> is not able to access the private memory of TD guest and do the emulation.
> Instead, TD guest expects to receive #VE when it accesses the MMIO and then
> it can explicitly make hypercall to KVM to get the expected information.
>
> To achieve this, the TDX module always enables "EPT-violation #VE" in the
> VMCS control. And accordingly, for the MMIO spte for the shared GPA,
> 1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
> violation happens on TD accessing MMIO range. 2. On EPT violation, KVM
> sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
> the #VE instead of EPT misconfiguration unlike VMX case. For the shared GPA
> that is not populated yet, EPT violation need to be triggered when TD guest
> accesses such shared GPA. The non-present SPTE value for shared GPA should
> set "suppress #VE" bit.
>
> Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
> REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63)
> for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
> present bit is off; 2) for normal VMX guest, KVM never enables the
> "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
> hardware.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Binbin Wu <[email protected]>
> Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
+ 1 to the nit pointed by Yilun,
after that,
Reviewed-by: Xiaoyao Li <[email protected]>
On Wed, Feb 28, 2024, Paolo Bonzini wrote:
> On Wed, Feb 28, 2024 at 3:00 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> > > From: Tom Lendacky <[email protected]>
> > >
> > > In preparation to support SEV-SNP AP Creation, use a variable that holds
> > > the VMSA physical address rather than converting the virtual address.
> > > This will allow SEV-SNP AP Creation to set the new physical address that
> > > will be used should the vCPU reset path be taken.
> >
> > No, this patch belongs in the SNP series. The hanlding of vmsa_pa is broken
> > (KVM leaks the page set by the guest; I need to follow-up in the SNP series).
> > On top of that, I detest duplicat variables, and I don't like that KVM keeps its
> > original VMSA (kernel allocation) after the guest creates its own.
> >
> > I can't possibly imagine why this needs to be pulled in early. There's no way
> > TDX needs this, and while this patch is _small_, the functional change it leads
> > to is not.
>
> Well, the point of this series (and there will be more if you agree)
> is exactly to ask "why not" in a way that is more manageable than
> through the huge TDX and SNP series. My reading of the above is that
> you believe this is small enough that it can even be merged with "KVM:
> SEV: Support SEV-SNP AP Creation NAE event" (with fixes), which I
> don't disagree with.
Maybe? That wasn't my point.
> Otherwise, if the approach was good there's no reason _not_ to get it
> in early. It's just a refactoring.
It's not really a refactoring though, that's why I'm objecting. If this patch
stored _just_ the physical adddress of the VMSA, then I would consider it a
refactoring and would have no problem applying it earlier.
But this patch adds a second, 100% duplicate field (as of now), and the reason
it does so is to allow "svm->sev_es.vmsa" to become disconnected from the "real"
VMSA that is used by hardware, which is all kinds of messed up. That's what I
meant by "the functional change it leads to is not (small)".
> Talking in general: I think I agree about keeping the gmem parts in a
> kvm-coco-queue branch (and in the meanwhile involving the mm people if
> mm/filemap.c changes are needed). #VE too, probably, but what I
> _really_ want to avoid is that these series (the plural is not a typo)
> become a new bottleneck for everybody. Basically these are meant to be
> a "these seem good to go to me, please confirm or deny" between
> comaintainers more than a real patch posting; having an extra branch
> is extra protection against screwups but we should be mindful that
> force pushes are painful for everyone.
Yes, which is largely why I suggested we separate the gmem. I suspect we'll need
to force push to fixup gmem things, whereas I'm confident the other prep work won't
need to be tweaked once it's fully reviewed.
For the other stuff, specifically to avoid creating another bottleneck, my preference
is to follow the "normal" rules for posting patches, with slightly relaxed bundling
rules. I.e. post multiple, independent series so that they can be reviewed,
iterated upon, and applied like any other series.
E.g. my objection to this VMSA tracking patch shouldn't get in the way of the MMU
changes, the #VE patch shoudln't interfere with the vmx/main.c patch, etc. In
other words, throwing everything into a kitchen sink "TDX/SNP prep work" series
just creates another (smaller) bottleneck.
I am 100% in favor of applying prep patches in advance of the larger SNP and TDX
series. That's actually partly why I ended up posting my series that includes
the PFERR_PRIVATE_ACCESS patch; I was trying to pull in using PFERR_GUEST_ENC_MASK
and some of the other "simple" patches, and the darn thing failed on me.
On 2/28/2024 7:20 AM, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> To make use of the same value of shadow_mmio_mask and shadow_present_mask
> for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and
> shadow_present_mask so that they can be common for both VMX and TDX.
>
> TDX will require shadow_mmio_mask and shadow_present_mask to include
> VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for
> shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the
> spte value is defined so as to cause EPT misconfig.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
> ---
> arch/x86/include/asm/vmx.h | 1 +
> arch/x86/kvm/mmu/spte.c | 6 ++++--
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 0e73616b82f3..76ed39541a52 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -513,6 +513,7 @@ enum vmcs_field {
> #define VMX_EPT_IPAT_BIT (1ull << 6)
> #define VMX_EPT_ACCESS_BIT (1ull << 8)
> #define VMX_EPT_DIRTY_BIT (1ull << 9)
> +#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)
> #define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
> VMX_EPT_WRITABLE_MASK | \
> VMX_EPT_EXECUTABLE_MASK)
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 4a599130e9c9..02a466de2991 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -429,7 +429,9 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
> shadow_nx_mask = 0ull;
> shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
> - shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
> + /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
> + shadow_present_mask =
> + (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
> /*
> * EPT overrides the host MTRRs, and so KVM must program the desired
> * memtype directly into the SPTEs. Note, this mask is just the mask
> @@ -446,7 +448,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> * of an EPT paging-structure entry is 110b (write/execute).
> */
> kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
> - VMX_EPT_RWX_MASK, 0);
> + VMX_EPT_RWX_MASK | VMX_EPT_SUPPRESS_VE_BIT, 0);
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
>
On Tue, Feb 27, 2024 at 06:20:50PM -0500, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> Refactor tdp_mmu_alloc_sp() and tdp_mmu_init_sp and eliminate
^
tdp_mmu_init_sp()
> tdp_mmu_init_child_sp(). Currently tdp_mmu_init_sp() (or
> tdp_mmu_init_child_sp()) sets kvm_mmu_page.role after tdp_mmu_alloc_sp()
> allocating struct kvm_mmu_page and its page table page. This patch makes
> tdp_mmu_alloc_sp() initialize kvm_mmu_page.role instead of
> tdp_mmu_init_sp().
>
> To handle private page tables, argument of is_private needs to be passed
> down. Given that already page level is passed down, it would be cumbersome
> to add one more parameter about sp. Instead replace the level argument with
> union kvm_mmu_page_role. Thus the number of argument won't be increased
This section is hard to understand. I'm lost at which functions are
mentioned here that took the level argument and should be replaced by
role.
> and more info about sp can be passed down.
My understanding of the change is:
Extra handling is need for Allocation of private page tables, so
earlier caculate the kvm_mmu_page_role for the sp and pass it to
tdp_mmu_alloc_sp(). Since the sp.role could be decided on sp
allocation, in turn remove the role argument for tdp_mmu_init_sp(), also
eliminate the helper tdp_mmu_init_child_sp().
>
> For private sp, secure page table will be also allocated in addition to
> struct kvm_mmu_page and page table (spt member). The allocation functions
> (tdp_mmu_alloc_sp() and __tdp_mmu_alloc_sp_for_split()) need to know if the
> allocation is for the conventional page table or private page table. Pass
> union kvm_mmu_role to those functions and initialize role member of struct
^
Should be kvm_mmu_page_role
Thanks,
Yilun
On Tue, Feb 27, 2024 at 06:20:55PM -0500, Paolo Bonzini wrote:
> Some SNP ioctls will require the page not to be in the pagecache, and as such they
> will want to return EEXIST to userspace. Start by passing the error up from
> filemap_grab_folio.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> virt/kvm/guest_memfd.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 0f4e0cf4f158..de0d5a5c210c 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -20,7 +20,7 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> /* TODO: Support huge pages. */
> folio = filemap_grab_folio(inode->i_mapping, index);
> if (IS_ERR_OR_NULL(folio))
> - return NULL;
> + return folio;
I think it impacts kvm_gmem_get_pfn()
{
...
folio = kvm_gmem_get_folio(file_inode(file), index);
if (!folio) {
r = -ENOMEM;
goto out_fput;
}
...
}
Thanks,
Yilun
>
> /*
> * Use the up-to-date flag to track whether or not the memory has been
> @@ -146,8 +146,8 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> }
>
> folio = kvm_gmem_get_folio(inode, index);
> - if (!folio) {
> - r = -ENOMEM;
> + if (IS_ERR_OR_NULL(folio)) {
> + r = folio ? PTR_ERR(folio) : -ENOMEM;
> break;
> }
>
> --
> 2.39.0
>
>
>
On Wed, Feb 28, 2024 at 02:28:45PM +0100, Paolo Bonzini wrote:
> On Wed, Feb 28, 2024 at 2:15 PM Matthew Wilcox <[email protected]> wrote:
> >
> > On Tue, Feb 27, 2024 at 06:17:34PM -0800, Yosry Ahmed wrote:
> > > On Tue, Feb 27, 2024 at 6:15 PM Sean Christopherson <[email protected]> wrote:
> > > >
> > > > On Tue, Feb 27, 2024, Paolo Bonzini wrote:
> > > >
> > > > This needs a changelog, and also needs to be Cc'd to someone(s) that can give it
> > > > a thumbs up.
> > >
> > > +Matthew Wilcox
> >
> > If only there were an entry in MAINTAINERS for filemap.c ...
>
> Not CCing you (or mm in general) was intentional because I first
> wanted a review of the KVM APIs; of course I wouldn't have committed
> it without an Acked-by. But yeah, not writing the changelog yet was
> pure laziness.
>
> Since you're here: KVM would like to add a ioctl to encrypt and
> install a page into guest_memfd, in preparation for launching an
> encrypted guest. For this API we want to rule out the possibility of
> overwriting a page that is already in the guest_memfd's filemap,
> therefore this API would pass FGP_CREAT_ONLY|FGP_CREAT
> into__filemap_get_folio. Do you think this is bogus...
>
> > This looks bogus to me, and if it's not bogus, it's incomplete.
>
> ... or if not, what incompleteness can you spot?
>
> Thanks,
>
> Paolo
>
> > But it's hard to judge without a commit message that describes what it's
> > supposed to mean.
> >
> > > >
> > > > > Signed-off-by: Paolo Bonzini <[email protected]>
> > > > > ---
> > > > > include/linux/pagemap.h | 2 ++
> > > > > mm/filemap.c | 4 ++++
> > > > > 2 files changed, 6 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > > > index 2df35e65557d..e8ac0b32f84d 100644
> > > > > --- a/include/linux/pagemap.h
> > > > > +++ b/include/linux/pagemap.h
> > > > > @@ -586,6 +586,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
> > > > > * * %FGP_CREAT - If no folio is present then a new folio is allocated,
> > > > > * added to the page cache and the VM's LRU list. The folio is
> > > > > * returned locked.
> > > > > + * * %FGP_CREAT_ONLY - Fail if a folio is not present
^
So should be: Fail if a folio is present.
Thanks,
Yilun
> > > > > * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> > > > > * folio is already in cache. If the folio was allocated, unlock it
> > > > > * before returning so the caller can do the same dance.
> > > > > @@ -606,6 +607,7 @@ typedef unsigned int __bitwise fgf_t;
> > > > > #define FGP_NOWAIT ((__force fgf_t)0x00000020)
> > > > > #define FGP_FOR_MMAP ((__force fgf_t)0x00000040)
> > > > > #define FGP_STABLE ((__force fgf_t)0x00000080)
> > > > > +#define FGP_CREAT_ONLY ((__force fgf_t)0x00000100)
> > > > > #define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */
> > > > >
> > > > > #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index 750e779c23db..d5107bd0cd09 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -1854,6 +1854,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> > > > > folio = NULL;
> > > > > if (!folio)
> > > > > goto no_page;
> > > > > + if (fgp_flags & FGP_CREAT_ONLY) {
> > > > > + folio_put(folio);
> > > > > + return ERR_PTR(-EEXIST);
> > > > > + }
> > > > >
> > > > > if (fgp_flags & FGP_LOCK) {
> > > > > if (fgp_flags & FGP_NOWAIT) {
> > > > > --
> > > > > 2.39.0
> > > > >
> > > > >
> > > >
> >
>
>
On Tue, Feb 27, 2024 at 06:20:58PM -0500, Paolo Bonzini wrote:
> In order to be able to redo kvm_gmem_get_uninit_pfn, a hole must be punched
> into the filemap, thus allowing FGP_CREAT_ONLY to succeed again. This will
> be used whenever an operation that follows kvm_gmem_get_uninit_pfn fails.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> include/linux/kvm_host.h | 7 +++++++
> virt/kvm/guest_memfd.c | 28 ++++++++++++++++++++++++++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 03bf616b7308..192c58116220 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2436,6 +2436,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
> int kvm_gmem_get_uninit_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
> +int kvm_gmem_undo_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> + gfn_t gfn, int order);
> #else
> static inline int kvm_gmem_get_pfn(struct kvm *kvm,
> struct kvm_memory_slot *slot, gfn_t gfn,
> @@ -2452,6 +2454,11 @@ static inline int kvm_gmem_get_uninit_pfn(struct kvm *kvm,
> KVM_BUG_ON(1, kvm);
> return -EIO;
> }
> +
> +static inline int kvm_gmem_undo_get_pfn(struct kvm *kvm,
> + struct kvm_memory_slot *slot, gfn_t gfn,
> + int order)
> +{}
return -EIO;
or compiler would complain that no return value.
> #endif /* CONFIG_KVM_PRIVATE_MEM */
>
> #ifdef CONFIG_HAVE_KVM_GMEM_PREPARE
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 7ec7afafc960..535ef1aa34fb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -590,3 +590,31 @@ int kvm_gmem_get_uninit_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> return __kvm_gmem_get_pfn(kvm, slot, gfn, pfn, max_order, false);
> }
> EXPORT_SYMBOL_GPL(kvm_gmem_get_uninit_pfn);
> +
> +int kvm_gmem_undo_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> + gfn_t gfn, int order)
Didn't see the caller yet, but do we need to ensure the gfn is aligned
with page order? e.g.
WARN_ON(gfn & ((1UL << order) - 1));
> +{
> + pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
> + struct kvm_gmem *gmem;
> + struct file *file;
> + int r;
> +
> + file = kvm_gmem_get_file(slot);
> + if (!file)
> + return -EFAULT;
> +
> + gmem = file->private_data;
> +
> + if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
> + r = -EIO;
> + goto out_fput;
> + }
> +
> + r = kvm_gmem_punch_hole(file_inode(file), index << PAGE_SHIFT, PAGE_SHIFT << order);
^
PAGE_SIZE << order
Thanks,
Yilun
> +
> +out_fput:
> + fput(file);
> +
> + return r;
> +}
> +EXPORT_SYMBOL_GPL(kvm_gmem_undo_get_pfn);
> --
> 2.39.0
>
>
>
On 2/28/2024 7:20 AM, Paolo Bonzini wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX uses different ABI to get information about VM exit. Pass intr_info to
> the NMI and INTR handlers instead of pulling it from vcpu_vmx in
> preparation for sharing the bulk of the handlers with TDX.
>
> When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
> exit qualification etc rather than the VMCS fields because VMM doesn't have
> access to the VMCS. The eventual code will be
>
> VMX:
> - get exit reason, intr_info, exit_qualification, and etc from VMCS
> - call NMI/INTR handlers (common code)
>
> TDX:
> - get exit reason, intr_info, exit_qualification, and etc from guest
> registers
> - call NMI/INTR handlers (common code)
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3d8a7e4c8e37..8aedfe0fd78c 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7000,24 +7000,22 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
> rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> }
>
> -static void handle_exception_irqoff(struct vcpu_vmx *vmx)
> +static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
> {
> - u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
> -
> /* if exit due to PF check for async PF */
> if (is_page_fault(intr_info))
> - vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> + vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> /* if exit due to NM, handle before interrupts are enabled */
> else if (is_nm_fault(intr_info))
> - handle_nm_fault_irqoff(&vmx->vcpu);
> + handle_nm_fault_irqoff(vcpu);
> /* Handle machine checks before interrupts are enabled */
> else if (is_machine_check(intr_info))
> kvm_machine_check();
> }
>
> -static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> +static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
> + u32 intr_info)
> {
> - u32 intr_info = vmx_get_intr_info(vcpu);
> unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> gate_desc *desc = (gate_desc *)host_idt_base + vector;
>
> @@ -7040,9 +7038,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> return;
>
> if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
> - handle_external_interrupt_irqoff(vcpu);
> + handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
> else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
> - handle_exception_irqoff(vmx);
> + handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
> }
>
> /*
On 2/28/2024 7:20 AM, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> TDP MMU allows tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() to
> return -EBUSY or -EAGAIN error. The caller must check the return value and
> retry. Add __must_check to ensure that it does so.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Binbin Wu <[email protected]>
> Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 55b5e3857e98..3627744fcab6 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -539,9 +539,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> * no side-effects other than setting iter->old_spte to the last
> * known value of the spte.
> */
> -static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
> - struct tdp_iter *iter,
> - u64 new_spte)
> +static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
> + struct tdp_iter *iter,
> + u64 new_spte)
> {
> u64 *sptep = rcu_dereference(iter->sptep);
>
> @@ -571,8 +571,8 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
> return 0;
> }
>
> -static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> - struct tdp_iter *iter)
> +static inline int __must_check tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> + struct tdp_iter *iter)
> {
> int ret;
>
On 2/28/2024 7:20 AM, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
..
> The use of lower_32_bits() moves from kvm_mmu_page_fault() to
> FNAME(page_fault), since walking is independent of the data in the
> upper bits of the error code.
Is it a must? I don't see any issue if full u64 error_code is passed to
FNAME(page_fault) as well.
On Mon, Mar 04, 2024, Xiaoyao Li wrote:
> On 2/28/2024 7:20 AM, Paolo Bonzini wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> ...
> > The use of lower_32_bits() moves from kvm_mmu_page_fault() to
> > FNAME(page_fault), since walking is independent of the data in the
> > upper bits of the error code.
>
> Is it a must? I don't see any issue if full u64 error_code is passed to
> FNAME(page_fault) as well.
Heh, my thought as well.
https://lore.kernel.org/all/[email protected]
On 2/28/2024 7:20 AM, Paolo Bonzini wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX uses different ABI to get information about VM exit. Pass intr_info to
> the NMI and INTR handlers instead of pulling it from vcpu_vmx in
> preparation for sharing the bulk of the handlers with TDX.
>
> When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
> exit qualification etc rather than the VMCS fields because VMM doesn't have
> access to the VMCS. The eventual code will be
>
> VMX:
> - get exit reason, intr_info, exit_qualification, and etc from VMCS
> - call NMI/INTR handlers (common code)
>
> TDX:
> - get exit reason, intr_info, exit_qualification, and etc from guest
> registers
> - call NMI/INTR handlers (common code)
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3d8a7e4c8e37..8aedfe0fd78c 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7000,24 +7000,22 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
> rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> }
>
> -static void handle_exception_irqoff(struct vcpu_vmx *vmx)
> +static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
> {
> - u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
> -
> /* if exit due to PF check for async PF */
> if (is_page_fault(intr_info))
> - vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> + vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> /* if exit due to NM, handle before interrupts are enabled */
> else if (is_nm_fault(intr_info))
> - handle_nm_fault_irqoff(&vmx->vcpu);
> + handle_nm_fault_irqoff(vcpu);
> /* Handle machine checks before interrupts are enabled */
> else if (is_machine_check(intr_info))
> kvm_machine_check();
> }
>
> -static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> +static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
> + u32 intr_info)
> {
> - u32 intr_info = vmx_get_intr_info(vcpu);
> unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> gate_desc *desc = (gate_desc *)host_idt_base + vector;
>
> @@ -7040,9 +7038,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> return;
>
> if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
> - handle_external_interrupt_irqoff(vcpu);
> + handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
> else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
> - handle_exception_irqoff(vmx);
> + handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
> }
>
> /*
On 2/28/2024 7:20 AM, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> To make use of the same value of shadow_mmio_mask and shadow_present_mask
> for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and
> shadow_present_mask so that they can be common for both VMX and TDX.
>
> TDX will require shadow_mmio_mask and shadow_present_mask to include
> VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for
> shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the
> spte value is defined so as to cause EPT misconfig.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
> ---
> arch/x86/include/asm/vmx.h | 1 +
> arch/x86/kvm/mmu/spte.c | 6 ++++--
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 0e73616b82f3..76ed39541a52 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -513,6 +513,7 @@ enum vmcs_field {
> #define VMX_EPT_IPAT_BIT (1ull << 6)
> #define VMX_EPT_ACCESS_BIT (1ull << 8)
> #define VMX_EPT_DIRTY_BIT (1ull << 9)
> +#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)
> #define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
> VMX_EPT_WRITABLE_MASK | \
> VMX_EPT_EXECUTABLE_MASK)
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 4a599130e9c9..02a466de2991 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -429,7 +429,9 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
> shadow_nx_mask = 0ull;
> shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
> - shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
> + /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
> + shadow_present_mask =
> + (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
> /*
> * EPT overrides the host MTRRs, and so KVM must program the desired
> * memtype directly into the SPTEs. Note, this mask is just the mask
> @@ -446,7 +448,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> * of an EPT paging-structure entry is 110b (write/execute).
> */
> kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
> - VMX_EPT_RWX_MASK, 0);
> + VMX_EPT_RWX_MASK | VMX_EPT_SUPPRESS_VE_BIT, 0);
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
>
>
> +/*
> + * Non-present SPTE value for both VMX and SVM for TDP MMU.
In the previous patch, SHADOW_NONPRESENT_VALUE is also used in the
shadow MMU code. So here when you change SHADOW_NONPRESENT_VALUE to a
non-zero value, the "for TDP MMU" part doesn't stand.
I am wondering whether we can just avoid using SHADOW_NONPRESENT_VALUE
in shadow MMU code in the previous patch, and state explicitly that we
are only going to support TDP MMU for non-zero value for non-present SPTE?
> + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> + * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
> + * bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
> + * For TDX:
> + * TDX module sets EPT_VIOLATION_VE for Secure-EPT and conventional EPT
> + */
> +#ifdef CONFIG_X86_64
> +#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
> +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> +#else
> #define SHADOW_NONPRESENT_VALUE 0ULL
> +#endif
>
> extern u64 __read_mostly shadow_host_writable_mask;
> extern u64 __read_mostly shadow_mmu_writable_mask;
> @@ -196,7 +209,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> *
> * Only used by the TDP MMU.
> */
> -#define REMOVED_SPTE 0x5a0ULL
> +#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
I kinda prefer moving this chunk to the previous patch, because the
reason to have SHADOW_NONPRESENT_VALUE is to have a non-zero value for
non-present SPTEs, which include the REMOVED_SPTE.
But just my 2cents.
On 2/28/2024 7:21 AM, Paolo Bonzini wrote:
> From: Michael Roth<[email protected]>
>
> In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
> 2MB mapping in the guest's nested page table depends on whether or not
> any subpages within the range have already been initialized as private
> in the RMP table. The existing mixed-attribute tracking in KVM is
> insufficient here, for instance:
>
> - gmem allocates 2MB page
> - guest issues PVALIDATE on 2MB page
> - guest later converts a subpage to shared
> - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
> - KVM MMU splits NPT mapping to 4K
Is here a sentence missing that "guest converts the shared subpage back
to private"?
Otherwise, it conflicts with the following statement "there are no mixed
attributes".
> At this point there are no mixed attributes, and KVM would normally
> allow for 2MB NPT mappings again, but this is actually not allowed
> because the RMP table mappings are 4K and cannot be promoted on the
> hypervisor side, so the NPT mappings must still be limited to 4K to
> match this.
>
> Add a hook to determine the max NPT mapping size in situations like
> this.
>
> Signed-off-by: Michael Roth<[email protected]>
> Message-Id:<[email protected]>
> Signed-off-by: Paolo Bonzini<[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 7 +++++++
> 3 files changed, 9 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 42474acb7375..436e3c157fae 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -140,6 +140,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP_OPTIONAL(get_untagged_addr)
> KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> +KVM_X86_OP_OPTIONAL_RET0(gmem_validate_fault)
> KVM_X86_OP_OPTIONAL(gmem_invalidate)
>
> #undef KVM_X86_OP
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e523b204697d..259e6bb1e447 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1806,6 +1806,7 @@ struct kvm_x86_ops {
> gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
> int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
> + int (*gmem_validate_fault)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6b4cb71668df..bcf12ac489f9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4339,6 +4339,13 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> fault->max_level);
> fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
>
> + r = static_call(kvm_x86_gmem_validate_fault)(vcpu->kvm, fault->pfn,
> + fault->gfn, &fault->max_level);
> + if (r) {
> + kvm_release_pfn_clean(fault->pfn);
> + return r;
> + }
> +
> return RET_PF_CONTINUE;
> }
>
On 28/02/2024 12:20 pm, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the
> suppress #VE bit in EPT entries selectively, in order to be able to trap
> non-present conditions. However, #VE isn't used for VMX and it's a bug
> if it happens. To be defensive and test that VMX case isn't broken
> introduce an option ept_violation_ve_test and when it's set, BUG the vm.
I am wondering from HW's point of view, is it OK for the kernel to
explicitly send #VE IPI, in which case, IIUC, the guest can legally get
the #VE w/o being a TDX guest?
On 28/02/2024 12:20 pm, Paolo Bonzini wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX uses different ABI to get information about VM exit. Pass intr_info to
> the NMI and INTR handlers instead of pulling it from vcpu_vmx in
> preparation for sharing the bulk of the handlers with TDX.
>
> When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
> exit qualification etc rather than the VMCS fields because VMM doesn't have
> access to the VMCS.
IMHO this can be simpler:
TDX conveys VM exit information via GPRs while normal VMX does via VMCS
fields.
The eventual code will be
>
> VMX:
> - get exit reason, intr_info, exit_qualification, and etc from VMCS
> - call NMI/INTR handlers (common code)
>
> TDX:
> - get exit reason, intr_info, exit_qualification, and etc from guest
> registers
> - call NMI/INTR handlers (common code)
It's kinda nicer to mention why to change handle_exception_irqoff()'s
first argument from @vmx to @vcpu.
Anyway, doesn't matter ...
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
.. Acked-by: Kai Huang <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3d8a7e4c8e37..8aedfe0fd78c 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7000,24 +7000,22 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
> rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> }
>
> -static void handle_exception_irqoff(struct vcpu_vmx *vmx)
> +static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
> {
> - u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
> -
> /* if exit due to PF check for async PF */
> if (is_page_fault(intr_info))
> - vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> + vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> /* if exit due to NM, handle before interrupts are enabled */
> else if (is_nm_fault(intr_info))
> - handle_nm_fault_irqoff(&vmx->vcpu);
> + handle_nm_fault_irqoff(vcpu);
> /* Handle machine checks before interrupts are enabled */
> else if (is_machine_check(intr_info))
> kvm_machine_check();
> }
>
> -static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> +static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
> + u32 intr_info)
> {
> - u32 intr_info = vmx_get_intr_info(vcpu);
> unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> gate_desc *desc = (gate_desc *)host_idt_base + vector;
>
> @@ -7040,9 +7038,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> return;
>
> if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
> - handle_external_interrupt_irqoff(vcpu);
> + handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
> else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
> - handle_exception_irqoff(vmx);
> + handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
> }
>
> /*
On 2/28/2024 7:21 AM, Paolo Bonzini wrote:
> From: Michael Roth <[email protected]>
>
> In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
> 2MB mapping in the guest's nested page table depends on whether or not
> any subpages within the range have already been initialized as private
> in the RMP table. The existing mixed-attribute tracking in KVM is
> insufficient here, for instance:
>
> - gmem allocates 2MB page
> - guest issues PVALIDATE on 2MB page
> - guest later converts a subpage to shared
> - SNP host code issues PSMASH to split 2MB RMP mapping to 4K
> - KVM MMU splits NPT mapping to 4K
>
> At this point there are no mixed attributes, and KVM would normally
> allow for 2MB NPT mappings again, but this is actually not allowed
> because the RMP table mappings are 4K and cannot be promoted on the
> hypervisor side, so the NPT mappings must still be limited to 4K to
> match this.
>
> Add a hook to determine the max NPT mapping size in situations like
> this.
>
> Signed-off-by: Michael Roth <[email protected]>
> Message-Id: <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu/mmu.c | 7 +++++++
> 3 files changed, 9 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 42474acb7375..436e3c157fae 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -140,6 +140,7 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
> KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
> KVM_X86_OP_OPTIONAL(get_untagged_addr)
> KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> +KVM_X86_OP_OPTIONAL_RET0(gmem_validate_fault)
> KVM_X86_OP_OPTIONAL(gmem_invalidate)
>
> #undef KVM_X86_OP
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e523b204697d..259e6bb1e447 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1806,6 +1806,7 @@ struct kvm_x86_ops {
> gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
> int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
> + int (*gmem_validate_fault)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, u8 *max_level);
Since it's named "gmem_validate_fault", can we just passed in the
"fault" as
an argument to avoid passing in pfn, gfn, max_level individually?
I noticed in Isaku's TDX patch set, fault->private would also need to be
passed in.
> };
>
> struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6b4cb71668df..bcf12ac489f9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4339,6 +4339,13 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> fault->max_level);
> fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
>
> + r = static_call(kvm_x86_gmem_validate_fault)(vcpu->kvm, fault->pfn,
> + fault->gfn, &fault->max_level);
> + if (r) {
> + kvm_release_pfn_clean(fault->pfn);
> + return r;
> + }
> +
> return RET_PF_CONTINUE;
> }
>
On Tue, Mar 12, 2024, Kai Huang wrote:
> On 28/02/2024 12:20 pm, Paolo Bonzini wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the
> > suppress #VE bit in EPT entries selectively, in order to be able to trap
> > non-present conditions. However, #VE isn't used for VMX and it's a bug
> > if it happens. To be defensive and test that VMX case isn't broken
> > introduce an option ept_violation_ve_test and when it's set, BUG the vm.
>
> I am wondering from HW's point of view, is it OK for the kernel to
> explicitly send #VE IPI, in which case, IIUC, the guest can legally get the
> #VE w/o being a TDX guest?
Ooh, fun. Short answer: there's nothing to worry about here.
Legally, no. Vectors 0-31 are reserved. However, I do _think_ the guest could
technically send IPIs on vectors 16-31, as the local APIC doesn't outright reject
such vectors. But such software would be in clear violation of the SDM.
11.5.2 Valid Interrupt Vectors
The Intel 64 and IA-32 architectures define 256 vector numbers, ranging from
0 through 255 (see Section 6.2, “Exception and Interrupt Vectors”). Local and
I/O APICs support 240 of these vectors (in the range of 16 to 255) as valid
interrupts.
When an interrupt vector in the range of 0 to 15 is sent or received through
the local APIC, the APIC indicates an illegal vector in its Error Status
Register (see Section 11.5.3, “Error Handling”). The Intel 64 and IA-32
architectures reserve vectors 16 through 31 for predefined interrupts,
exceptions, and Intel-reserved encodings (see Table 6-1). However, the local
APIC does not treat vectors in this range as illegal.
When an illegal vector value (0 to 15) is written to an LVT entry and the delivery
mode is Fixed (bits 8-11 equal 0), the APIC may signal an illegal vector error,
without regard to whether the mask bit is set or whether an interrupt is actually
seen on the input.
where Table 6-1 defines the various exceptions, including #VE, and for vectors
22-31 says "Intel reserved. Do not use." Vectors 32-255 are explicitly described
as "User Defined (Non-reserved) Interrupts" that can be generated via "External
interrupt or INT n instruction."
However, INTn is far more interesting than IPIs, as INTn can definitely generate
interrupts for vectors 0-31, and the legality of software generating such interrupts
is questionable. E.g. KVM used to "forward" NMI VM-Exits to the kernel by doing
INTn with vector 2.
Key word "interrupts"! IPIs are hardware interrupts, and INTn generates software
interrupts, neither of which are subject to exception bitmap interception:
Exceptions (faults, traps, and aborts) cause VM exits based on the exception
bitmap (see Section 25.6.3). If an exception occurs, its vector (in the range
0–31) is used to select a bit in the exception bitmap. If the bit is 1, a VM
exit occurs; if the bit is 0, the exception is delivered normally through the
guest IDT. This use of the exception bitmap applies also to exceptions generated
by the instructions INT1, INT3, INTO, BOUND, UD0, UD1, and UD2.
with a footnote that further says:
INT1 and INT3 refer to the instructions with opcodes F1 and CC, respectively,
and not to INT n with value 1 or 3 for n.
So while a misbehaving guest could generate a software interrupt on vector 20,
it would not be a true #VE, i.e. not an exception, and thus would not generate
an EXCEPTION_NMI VM-Exit. I.e. the KVM_BUG_ON() can't be triggered by the guest
(assuming hardware isn't broken).
On Tue, 2024-03-12 at 09:54 -0700, Sean Christopherson wrote:
> On Tue, Mar 12, 2024, Kai Huang wrote:
> > On 28/02/2024 12:20 pm, Paolo Bonzini wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the
> > > suppress #VE bit in EPT entries selectively, in order to be able to trap
> > > non-present conditions. However, #VE isn't used for VMX and it's a bug
> > > if it happens. To be defensive and test that VMX case isn't broken
> > > introduce an option ept_violation_ve_test and when it's set, BUG the vm.
> >
> > I am wondering from HW's point of view, is it OK for the kernel to
> > explicitly send #VE IPI, in which case, IIUC, the guest can legally get the
> > #VE w/o being a TDX guest?
>
> Ooh, fun. Short answer: there's nothing to worry about here.
>
> Legally, no. Vectors 0-31 are reserved. However, I do _think_ the guest could
> technically send IPIs on vectors 16-31, as the local APIC doesn't outright reject
> such vectors. But such software would be in clear violation of the SDM.
>
> 11.5.2 Valid Interrupt Vectors
>
> The Intel 64 and IA-32 architectures define 256 vector numbers, ranging from
> 0 through 255 (see Section 6.2, “Exception and Interrupt Vectors”). Local and
> I/O APICs support 240 of these vectors (in the range of 16 to 255) as valid
> interrupts.
>
> When an interrupt vector in the range of 0 to 15 is sent or received through
> the local APIC, the APIC indicates an illegal vector in its Error Status
> Register (see Section 11.5.3, “Error Handling”). The Intel 64 and IA-32
> architectures reserve vectors 16 through 31 for predefined interrupts,
> exceptions, and Intel-reserved encodings (see Table 6-1). However, the local
> APIC does not treat vectors in this range as illegal.
>
> When an illegal vector value (0 to 15) is written to an LVT entry and the delivery
> mode is Fixed (bits 8-11 equal 0), the APIC may signal an illegal vector error,
> without regard to whether the mask bit is set or whether an interrupt is actually
> seen on the input.
I hate the "may" here :-)
>
> where Table 6-1 defines the various exceptions, including #VE, and for vectors
> 22-31 says "Intel reserved. Do not use." Vectors 32-255 are explicitly described
> as "User Defined (Non-reserved) Interrupts" that can be generated via "External
> interrupt or INT n instruction."
>
> However, INTn is far more interesting than IPIs, as INTn can definitely generate
> interrupts for vectors 0-31, and the legality of software generating such interrupts
> is questionable. E.g. KVM used to "forward" NMI VM-Exits to the kernel by doing
> INTn with vector 2.
>
> Key word "interrupts"! IPIs are hardware interrupts, and INTn generates software
> interrupts, neither of which are subject to exception bitmap interception:
>
> Exceptions (faults, traps, and aborts) cause VM exits based on the exception
> bitmap (see Section 25.6.3). If an exception occurs, its vector (in the range
> 0–31) is used to select a bit in the exception bitmap. If the bit is 1, a VM
> exit occurs; if the bit is 0, the exception is delivered normally through the
> guest IDT. This use of the exception bitmap applies also to exceptions generated
> by the instructions INT1, INT3, INTO, BOUND, UD0, UD1, and UD2.
>
> with a footnote that further says:
>
> INT1 and INT3 refer to the instructions with opcodes F1 and CC, respectively,
> and not to INT n with value 1 or 3 for n.
>
> So while a misbehaving guest could generate a software interrupt on vector 20,
> it would not be a true #VE, i.e. not an exception, and thus would not generate
> an EXCEPTION_NMI VM-Exit. I.e. the KVM_BUG_ON() can't be triggered by the guest
> (assuming hardware isn't broken).
>
Ah, right, software-interrupts but not exceptions.
Thanks for the full explanation!
On Tue, 2024-02-27 at 18:20 -0500, Paolo Bonzini wrote:
> From: Isaku Yamahata <[email protected]>
>
> Refactor tdp_mmu_alloc_sp() and tdp_mmu_init_sp and eliminate
> tdp_mmu_init_child_sp(). Currently tdp_mmu_init_sp() (or
> tdp_mmu_init_child_sp()) sets kvm_mmu_page.role after tdp_mmu_alloc_sp()
> allocating struct kvm_mmu_page and its page table page. This patch makes
> tdp_mmu_alloc_sp() initialize kvm_mmu_page.role instead of
> tdp_mmu_init_sp().
>
> To handle private page tables, argument of is_private needs to be passed
> down. Given that already page level is passed down, it would be cumbersome
> to add one more parameter about sp. Instead replace the level argument with
> union kvm_mmu_page_role. Thus the number of argument won't be increased
> and more info about sp can be passed down.
>
> For private sp, secure page table will be also allocated in addition to
> struct kvm_mmu_page and page table (spt member). The allocation functions
> (tdp_mmu_alloc_sp() and __tdp_mmu_alloc_sp_for_split()) need to know if the
> allocation is for the conventional page table or private page table. Pass
> union kvm_mmu_role to those functions and initialize role member of struct
> kvm_mmu_page.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> Message-Id: <d69acdd7f0b0b104f330a6d42ac28f9a9b1b5850.1705965635.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <[email protected]>
We were discussing on v19 of the TDX series whether we could drop this patch and end up with simpler
code in later patches:
https://lore.kernel.org/lkml/[email protected]/
TDX can manage in either case, but it might not be needed for TDX. Does it have any benefit for SNP?
On Mon, Mar 4, 2024 at 9:57 AM Xiaoyao Li <[email protected]> wrote:
>
> On 2/28/2024 7:20 AM, Paolo Bonzini wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> ...
> > The use of lower_32_bits() moves from kvm_mmu_page_fault() to
> > FNAME(page_fault), since walking is independent of the data in the
> > upper bits of the error code.
>
> Is it a must? I don't see any issue if full u64 error_code is passed to
> FNAME(page_fault) as well.
The full error code *is* passed to kvm_mmu_do_page_fault() and
FNAME(page_fault), it's only dropped when passed to FNAME(walk_addr).
Paolo