2022-08-07 22:38:11

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 00/13] KVM TDX: TDP MMU: large page support

From: Isaku Yamahata <[email protected]>

This patch series is based on "v8 KVM TDX: basic feature support". It
implements large page support for TDP MMU by allowing populating of the large
page and splitting it when necessary. It's not supported to merge 4K/2M pages
into 2M/1G pages.

Feedback for options to merge sub-pages into a large page are welcome.

Options for merging sub-pages into a large page
===============================================
A) Merge pages into a large page on scanning by NX page recovery daemon actively.
+ implementation would be simple
- inefficient as implementation because it always scans subpages.
- inefficient because it merges unused pages.
B) On normal EPT violation, check whether pages can be merged into a large page
after mapping it.
+ scanning part isn't needed.
- inefficient to add more logic to a fast path
C) Use TDH.MEM.RANGE.BLOCK instead of zapping EPT entry. And record that the
entry is blocked. On EPT violation, check if the entry is blocked or not.
If the EPT violation is caused by a blocked Secure-EPT entry, trigger the
page merge logic.
+ reuse scanning logic (NX recovery daemon)
+ take advantage of EPT violation
- would be complex
Block instead of zap, track blocked Secure-EPT entry, and unblock it on the
EPT violation and then page merge logic.


The current implementation (splitting large pages when necessary)
=================================================================
* It already tracking whether GFN is private or shared. When it's changed,
update lpage_info to prevent a large page.
* TDX provides page level on Secure EPT violation. Pass around the page level
that the lower level functions needs.
* Even if the page is the large page in the host, at the EPT level, only some
sub-pages are mapped. In such cases abandon to map large pages and step into
the sub-page level, unlike the conventional EPT.
* When zapping spte and the spte is for a large page, split and zap it unlike
the conventional EPT because otherwise the protected page contents will be
lost.
* It's not implemented to merge pages into a large page.


Discussion for merging pages into large page
============================================
The live migration support for TDX is planned. It means dirty page logging will
be supported and a large page will be split on enabling dirty page logging.
After disabling it, the pages should be merged into large pages for performance.

The current implementation for the conventional EPT is
* THP or NX page recovery zaps EPT entries. This step doesn't directly map a
large page.
* On the next EPT violation, when a large page is possible, map it as a large
page.

This is because
* To avoid unnecessary page merging for cold SPTE by mapping large pages on EPT
violation. This is desirable for the TDX case to avoid unnecessary Secure-EPT
operation.
* Reuse KVM page fault path.
For TDX, the new logic is needed to merge sub-pages into a large page.

TDX operation
-------------
* EPT violation trick
Such track (zapping the EPT entry to trigger EPT violation) doesn't work for
TDX. For TDX, it will lose the contents of the protected page to zap a page
because the protected guest page is un-associated from the guest TD. Instead,
TDX provides a different way to trigger EPT violation without losing the page
contents so that VMM can detect guest TD activity by blocking/unblocking
Secure-EPT entry. TDH.MEM.RANGE.BLOCK and TDH.MEM.RANGE.UNBLOCK. They
correspond to clearing/setting a present bit in an EPT entry with page contents
still kept. By TDH.MEM.RANGE.BLOCK and TLB shoot down, VMM can cause guest TD
to trigger EPT violation. After that, VMM can unblock it by
TDH.MEM.RANGE.UNBLOCK and resume guest TD execution. The procedure is as
follows.

- Block Secure-EPT entry by TDH.MEM.RANGE.BLOCK.
- TLB shoot down.
- Wait for guest TD to trigger EPT violation.
- Unblock Secure-EPT entry by TDH.MEM.RANGE.UNBLOCK to resume the guest TD.

* merging sub-pages into a large page
The following steps are needed.
- Ensure that all sub-pages are mapped.
- TLB shoot down.
- Merge sub-pages into a large page (TDH.MEM.PAGE.PROMOTE).
This requires all sub-pages are mapped.
- Cache flush Secure EPT page used to map subpages.


Thanks,

Chao Peng (1):
KVM: Update lpage info when private/shared memory are mixed

Xiaoyao Li (12):
KVM: TDP_MMU: Go to next level if smaller private mapping exists
KVM: TDX: Pass page level to cache flush before TDX SEAMCALL
KVM: TDX: Pass KVM page level to tdh_mem_page_add() and
tdh_mem_page_aug()
KVM: TDX: Pass size to tdx_measure_page()
KVM: TDX: Pass size to reclaim_page()
KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large
page
KVM: MMU: Introduce level info in PFERR code
KVM: TDX: Pin pages via get_page() right before ADD/AUG'ed to TDs
KVM: MMU: Pass desired page level in err code for page fault handler
KVM: TDP_MMU: Split the large page when zap leaf
KVM: TDX: Split a large page when 4KB page within it converted to
shared
KVM: x86: remove struct kvm_arch.tdp_max_page_level

arch/x86/include/asm/kvm_host.h | 14 ++-
arch/x86/kvm/mmu/mmu.c | 158 ++++++++++++++++++++++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 4 +-
arch/x86/kvm/mmu/tdp_mmu.c | 31 +++++-
arch/x86/kvm/vmx/common.h | 6 +-
arch/x86/kvm/vmx/tdx.c | 174 +++++++++++++++++++++-----------
arch/x86/kvm/vmx/tdx_arch.h | 20 ++++
arch/x86/kvm/vmx/tdx_ops.h | 46 ++++++---
arch/x86/kvm/vmx/vmx.c | 2 +-
include/linux/kvm_host.h | 10 ++
virt/kvm/kvm_main.c | 9 +-
11 files changed, 390 insertions(+), 84 deletions(-)

--
2.25.1


2022-08-07 22:41:48

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 11/13] KVM: TDP_MMU: Split the large page when zap leaf

From: Xiaoyao Li <[email protected]>

When TDX enabled, a large page cannot be zapped if it contains mixed
pages. In this case, it has to split the large page.

Signed-off-by: Xiaoyao Li <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index faf278e0c740..e5d31242677a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1033,6 +1033,14 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
return true;
}

+
+static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
+ struct tdp_iter *iter,
+ bool shared);
+
+static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
+ struct kvm_mmu_page *sp, bool shared);
+
/*
* If can_yield is true, will release the MMU lock and reschedule if the
* scheduler needs the CPU or there is contention on the MMU lock. If this
@@ -1075,6 +1083,24 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;

+ if (kvm_gfn_shared_mask(kvm) && is_large_pte(iter.old_spte)) {
+ gfn_t gfn = iter.gfn & ~kvm_gfn_shared_mask(kvm);
+ gfn_t mask = KVM_PAGES_PER_HPAGE(iter.level) - 1;
+ struct kvm_memory_slot *slot;
+ struct kvm_mmu_page *sp;
+
+ slot = gfn_to_memslot(kvm, gfn);
+ if (kvm_mem_attr_is_mixed(slot, gfn, iter.level) ||
+ (gfn & mask) < start ||
+ end < (gfn & mask) + KVM_PAGES_PER_HPAGE(iter.level)) {
+ sp = tdp_mmu_alloc_sp_for_split(kvm, &iter, false);
+ WARN_ON(!sp);
+
+ tdp_mmu_split_huge_page(kvm, &iter, sp, false);
+ continue;
+ }
+ }
+
tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
flush = true;
}
@@ -1642,8 +1668,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,

WARN_ON(kvm_mmu_page_role_is_private(role) !=
is_private_sptep(iter->sptep));
- /* TODO: Large page isn't supported for private SPTE yet. */
- WARN_ON(kvm_mmu_page_role_is_private(role));

/*
* Since we are allocating while under the MMU lock we have to be
--
2.25.1

2022-08-07 22:41:48

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 01/13] KVM: Update lpage info when private/shared memory are mixed

From: Chao Peng <[email protected]>

Update lpage_info when private/shared memory attribute is changed. If both
private and shared pages are within large page region, it can't be mapped
as large page. Reserve a bit in disallow_lpage to indicate a large page has
private/share pages mixed.

Signed-off-by: Chao Peng <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 ++
arch/x86/kvm/mmu/mmu.c | 152 +++++++++++++++++++++++++++++++-
arch/x86/kvm/mmu/mmu_internal.h | 2 +
include/linux/kvm_host.h | 10 +++
virt/kvm/kvm_main.c | 9 +-
5 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d68130be5bf7..2bdb1de9bce0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -37,6 +37,7 @@
#include <asm/hyperv-tlfs.h>

#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
#define __KVM_HAVE_ZAP_GFN_RANGE

#define KVM_MAX_VCPUS 1024
@@ -981,6 +982,13 @@ struct kvm_vcpu_arch {
#endif
};

+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits will be used as a reference count for other users.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
+#define KVM_LPAGE_COUNT_MAX ((1U << 31) - 1)
+
struct kvm_lpage_info {
int disallow_lpage;
};
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c61fb6848d0d..a03aa609a0da 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -818,11 +818,16 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
{
struct kvm_lpage_info *linfo;
int i;
+ int disallow_count;

for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
+
+ disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+ WARN_ON(disallow_count + count < 0 ||
+ disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
linfo->disallow_lpage += count;
- WARN_ON(linfo->disallow_lpage < 0);
}
}

@@ -7236,3 +7241,148 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_lpage_recovery_thread)
kthread_stop(kvm->arch.nx_lpage_recovery_thread);
}
+
+bool kvm_mem_attr_is_mixed(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+ gfn_t pages = KVM_PAGES_PER_HPAGE(level);
+ gfn_t mask = ~(pages - 1);
+ struct kvm_lpage_info *linfo = lpage_info_slot(gfn & mask, slot, level);
+
+ WARN_ON(level == PG_LEVEL_4K);
+ return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static void update_mixed(struct kvm_lpage_info *linfo, bool mixed)
+{
+ if (mixed)
+ linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+ else
+ linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static bool __mem_attr_is_mixed(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+ XA_STATE(xas, &kvm->mem_attr_array, start);
+ bool mixed = false;
+ gfn_t gfn = start;
+ void *s_entry;
+ void *entry;
+
+ rcu_read_lock();
+ s_entry = xas_load(&xas);
+ while (gfn < end) {
+ if (xas_retry(&xas, entry))
+ continue;
+
+ KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+ entry = xas_next(&xas);
+ if (entry != s_entry) {
+ mixed = true;
+ break;
+ }
+ gfn++;
+ }
+ rcu_read_unlock();
+ return mixed;
+}
+
+static bool mem_attr_is_mixed(struct kvm *kvm,
+ struct kvm_memory_slot *slot, int level,
+ gfn_t start, gfn_t end)
+{
+ struct kvm_lpage_info *child_linfo;
+ unsigned long child_pages;
+ bool mixed = false;
+ unsigned long gfn;
+ void *entry;
+
+ if (WARN_ON(level == PG_LEVEL_4K))
+ return false;
+
+ if (level == PG_LEVEL_2M)
+ return __mem_attr_is_mixed(kvm, start, end);
+
+ /* This assumes that level - 1 is already updated. */
+ rcu_read_lock();
+ child_pages = KVM_PAGES_PER_HPAGE(level - 1);
+ entry = xa_load(&kvm->mem_attr_array, start);
+ for (gfn = start; gfn < end; gfn += child_pages) {
+ child_linfo = lpage_info_slot(gfn, slot, level - 1);
+ if (child_linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED) {
+ mixed = true;
+ break;
+ }
+ if (xa_load(&kvm->mem_attr_array, gfn) != entry) {
+ mixed = true;
+ break;
+ }
+ }
+ rcu_read_unlock();
+ return mixed;
+}
+
+static void update_mem_lpage_info(struct kvm *kvm,
+ struct kvm_memory_slot *slot,
+ unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+ unsigned long lpage_start, lpage_end;
+ unsigned long gfn, pages, mask;
+ int level;
+
+ for (level = PG_LEVEL_2M; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+ pages = KVM_PAGES_PER_HPAGE(level);
+ mask = ~(pages - 1);
+ lpage_start = start & mask;
+ lpage_end = (end - 1) & mask;
+
+ /*
+ * We only need to scan the head and tail page, for middle pages
+ * we know they are not mixed.
+ */
+ update_mixed(lpage_info_slot(lpage_start, slot, level),
+ mem_attr_is_mixed(kvm, slot, level,
+ lpage_start, lpage_start + pages));
+
+ if (lpage_start == lpage_end)
+ return;
+
+ for (gfn = lpage_start + pages; gfn < lpage_end; gfn += pages) {
+ update_mixed(lpage_info_slot(gfn, slot, level), false);
+ }
+
+ update_mixed(lpage_info_slot(lpage_end, slot, level),
+ mem_attr_is_mixed(kvm, slot, level,
+ lpage_end, lpage_end + pages));
+ }
+}
+
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ struct kvm_memslot_iter iter;
+ int idx;
+ int i;
+
+ WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
+ "Unsupported mem attribute.\n");
+
+ idx = srcu_read_lock(&kvm->srcu);
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ slots = __kvm_memslots(kvm, i);
+
+ kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+ slot = iter.slot;
+ start = max(start, slot->base_gfn);
+ end = min(end, slot->base_gfn + slot->npages);
+ if (WARN_ON_ONCE(start >= end))
+ continue;
+
+ update_mem_lpage_info(kvm, slot, attr, start, end);
+ }
+ }
+ srcu_read_unlock(&kvm->srcu, idx);
+}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 4b581209b3b9..e5d5fea29bfa 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -259,6 +259,8 @@ static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
}
#endif

+bool kvm_mem_attr_is_mixed(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+
static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3c29e0eb754c..7e3d582cc1ba 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2295,6 +2295,16 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536

+#ifdef __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
+void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end);
+#else
+static inline void kvm_arch_update_mem_attr(struct kvm *kvm, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+}
+#endif /* __KVM_HAVE_ARCH_UPDATE_MEM_ATTR */
+
#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot,
gfn_t gfn, kvm_pfn_t *pfn, int *order)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2ec940354749..9f9b2c0e7afc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -943,6 +943,7 @@ EXPORT_SYMBOL_GPL(kvm_vm_reserve_mem_attr);
int kvm_vm_set_mem_attr(struct kvm *kvm, int attr, gfn_t start, gfn_t end)
{
void *entry;
+ int r;

/* By default, the entry is private. */
switch (attr) {
@@ -958,8 +959,12 @@ int kvm_vm_set_mem_attr(struct kvm *kvm, int attr, gfn_t start, gfn_t end)
}

WARN_ON(start >= end);
- return xa_err(xa_store_range(&kvm->mem_attr_array, start, end - 1,
- entry, GFP_KERNEL_ACCOUNT));
+ r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end - 1,
+ entry, GFP_KERNEL_ACCOUNT));
+ if (r)
+ return r;
+ kvm_arch_update_mem_attr(kvm, attr, start, end);
+ return 0;
}
EXPORT_SYMBOL_GPL(kvm_vm_set_mem_attr);
#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM_ATTR */
--
2.25.1

2022-08-07 22:43:25

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 07/13] KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large page

From: Xiaoyao Li <[email protected]>

Allow large page level AUG and REMOVE for TDX pages.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 46 +++++++++++++++++++++---------------------
1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0b9f9075e1ea..cdd421fb5024 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1458,20 +1458,18 @@ static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
struct tdx_module_output out;
hpa_t source_pa;
u64 err;
+ int i;

if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) ||
!kvm_pfn_to_refcounted_page(pfn)))
return;

/* To prevent page migration, do nothing on mmu notifier. */
- get_page(pfn_to_page(pfn));
+ for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++)
+ get_page(pfn_to_page(pfn + i));

/* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
if (likely(is_td_finalized(kvm_tdx))) {
- /* TODO: handle large pages. */
- if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
- return;
-
err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, tdx_level, hpa, &out);
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
@@ -1530,38 +1528,40 @@ static void tdx_sept_drop_private_spte(
hpa_t hpa_with_hkid;
struct tdx_module_output out;
u64 err = 0;
+ int i;

- /* TODO: handle large pages. */
- if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
- return;
-
- spin_lock(&kvm_tdx->seamcall_lock);
if (is_hkid_assigned(kvm_tdx)) {
+ spin_lock(&kvm_tdx->seamcall_lock);
err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
+ spin_unlock(&kvm_tdx->seamcall_lock);
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
- goto unlock;
+ return;
}

- hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
- err = tdh_phymem_page_wbinvd(hpa_with_hkid);
- if (WARN_ON_ONCE(err)) {
- pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
- goto unlock;
+ for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++) {
+ hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+ spin_lock(&kvm_tdx->seamcall_lock);
+ err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+ spin_unlock(&kvm_tdx->seamcall_lock);
+ if (WARN_ON_ONCE(err))
+ pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+ else
+ tdx_unpin(kvm, gfn + i, pfn + i);
+ hpa += PAGE_SIZE;
}
- } else
+ } else {
/*
* The HKID assigned to this TD was already freed and cache
* was already flushed. We don't have to flush again.
*/
+ spin_lock(&kvm_tdx->seamcall_lock);
err = tdx_reclaim_page((unsigned long)__va(hpa), hpa, level,
false, 0);
-
-unlock:
- spin_unlock(&kvm_tdx->seamcall_lock);
-
- if (!err)
- tdx_unpin_pfn(kvm, pfn);
+ spin_unlock(&kvm_tdx->seamcall_lock);
+ if (!err)
+ tdx_unpin(kvm, gfn, pfn);
+ }
}

static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
--
2.25.1

2022-08-07 22:44:18

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 08/13] KVM: MMU: Introduce level info in PFERR code

From: Xiaoyao Li <[email protected]>

For TDX, EPT violation can happen when TDG.MEM.PAGE.ACCEPT.
And TDG.MEM.PAGE.ACCEPT contains the desired accept page level of TD guest.

1. KVM can map it with 4KB page while TD guest wants to accept 2MB page.

TD geust will get TDX_PAGE_SIZE_MISMATCH and it should try to accept
4KB size.

2. KVM can map it with 2MB page while TD guest wants to accept 4KB page.

KVM needs to honor it because
a) there is no way to tell guest KVM maps it as 2MB size. And
b) guest accepts it in 4KB size since guest knows some other 4KB page
in the same 2MB range will be used as shared page.

For case 2, it need to pass desired page level to MMU's
page_fault_handler. Use bit 29:31 of kvm PF error code for this purpose.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/mmu/mmu.c | 5 +++++
arch/x86/kvm/vmx/common.h | 6 +++++-
arch/x86/kvm/vmx/tdx.c | 15 ++++++++++++++-
arch/x86/kvm/vmx/tdx.h | 19 +++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 2 +-
6 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2bdb1de9bce0..c01bde832de2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -251,6 +251,8 @@ enum x86_intercept_stage;
#define PFERR_FETCH_BIT 4
#define PFERR_PK_BIT 5
#define PFERR_SGX_BIT 15
+#define PFERR_LEVEL_START_BIT 29
+#define PFERR_LEVEL_END_BIT 31
#define PFERR_GUEST_FINAL_BIT 32
#define PFERR_GUEST_PAGE_BIT 33
#define PFERR_IMPLICIT_ACCESS_BIT 48
@@ -262,6 +264,7 @@ enum x86_intercept_stage;
#define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT)
#define PFERR_PK_MASK (1U << PFERR_PK_BIT)
#define PFERR_SGX_MASK (1U << PFERR_SGX_BIT)
+#define PFERR_LEVEL_MASK GENMASK_ULL(PFERR_LEVEL_END_BIT, PFERR_LEVEL_START_BIT)
#define PFERR_GUEST_FINAL_MASK (1ULL << PFERR_GUEST_FINAL_BIT)
#define PFERR_GUEST_PAGE_MASK (1ULL << PFERR_GUEST_PAGE_BIT)
#define PFERR_IMPLICIT_ACCESS (1ULL << PFERR_IMPLICIT_ACCESS_BIT)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a03aa609a0da..ba21503fa46f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4451,6 +4451,11 @@ EXPORT_SYMBOL_GPL(kvm_handle_page_fault);

int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
+ u8 err_level = (fault->error_code & PFERR_LEVEL_MASK) >> PFERR_LEVEL_START_BIT;
+
+ if (err_level)
+ fault->max_level = min(fault->max_level, err_level);
+
/*
* If the guest's MTRRs may be used to compute the "real" memtype,
* restrict the mapping level to ensure KVM uses a consistent memtype
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index fd5ed3c0f894..f512eaa458a2 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -78,7 +78,8 @@ static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
}

static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
- unsigned long exit_qualification)
+ unsigned long exit_qualification,
+ int err_page_level)
{
u64 error_code;

@@ -98,6 +99,9 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;

+ if (err_page_level > 0)
+ error_code |= (err_page_level << PFERR_LEVEL_START_BIT) & PFERR_LEVEL_MASK;
+
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cdd421fb5024..81d88b1e63ac 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1765,7 +1765,20 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,

static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
{
+ union tdx_ext_exit_qualification ext_exit_qual;
unsigned long exit_qual;
+ int err_page_level = 0;
+
+ ext_exit_qual.full = tdexit_ext_exit_qual(vcpu);
+
+ if (ext_exit_qual.type >= NUM_EXT_EXIT_QUAL) {
+ pr_err("EPT violation at gpa 0x%lx, with invalid ext exit qualification type 0x%x\n",
+ tdexit_gpa(vcpu), ext_exit_qual.type);
+ kvm_vm_bugged(vcpu->kvm);
+ return 0;
+ } else if (ext_exit_qual.type == EXT_EXIT_QUAL_ACCEPT) {
+ err_page_level = ext_exit_qual.req_sept_level + 1;
+ }

if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
/*
@@ -1792,7 +1805,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
}

trace_kvm_page_fault(tdexit_gpa(vcpu), exit_qual);
- return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+ return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual, err_page_level);
}

static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 8284cce0d385..3400563a2254 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -79,6 +79,25 @@ union tdx_exit_reason {
u64 full;
};

+union tdx_ext_exit_qualification {
+ struct {
+ u64 type : 4;
+ u64 reserved0 : 28;
+ u64 req_sept_level : 3;
+ u64 err_sept_level : 3;
+ u64 err_sept_state : 8;
+ u64 err_sept_is_leaf : 1;
+ u64 reserved1 : 17;
+ };
+ u64 full;
+};
+
+enum tdx_ext_exit_qualification_type {
+ EXT_EXIT_QUAL_NONE,
+ EXT_EXIT_QUAL_ACCEPT,
+ NUM_EXT_EXIT_QUAL,
+};
+
struct vcpu_tdx {
struct kvm_vcpu vcpu;

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e5aa805f6db4..6ba3eded55a7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5646,7 +5646,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
return kvm_emulate_instruction(vcpu, 0);

- return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
+ return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification, 0);
}

static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
--
2.25.1

2022-08-07 22:45:05

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 10/13] KVM: MMU: Pass desired page level in err code for page fault handler

From: Xiaoyao Li <[email protected]>

For TDX, EPT violation can happen when TDG.MEM.PAGE.ACCEPT.
And TDG.MEM.PAGE.ACCEPT contains the desired accept page level of TD guest.

1. KVM can map it with 4KB page while TD guest wants to accept 2MB page.

TD geust will get TDX_PAGE_SIZE_MISMATCH and it should try to accept
4KB size.

2. KVM can map it with 2MB page while TD guest wants to accept 4KB page.

KVM needs to honor it because
a) there is no way to tell guest KVM maps it as 2MB size. And
b) guest accepts it in 4KB size since guest knows some other 4KB page
in the same 2MB range will be used as shared page.

For case 2, it need to pass desired page level to MMU's
page_fault_handler. Use bit 29:31 of kvm PF error code for this purpose.

Signed-off-by: Xiaoyao Li <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/common.h | 2 +-
arch/x86/kvm/vmx/tdx.c | 9 +++++++--
arch/x86/kvm/vmx/tdx.h | 19 -------------------
arch/x86/kvm/vmx/tdx_arch.h | 19 +++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 2 +-
6 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c01bde832de2..a6bfcabcbbd7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -273,6 +273,8 @@ enum x86_intercept_stage;
PFERR_WRITE_MASK | \
PFERR_PRESENT_MASK)

+#define PFERR_LEVEL(err_code) (((err_code) & PFERR_LEVEL_MASK) >> PFERR_LEVEL_START_BIT)
+
/* apic attention bits */
#define KVM_APIC_CHECK_VAPIC 0
/*
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index f512eaa458a2..0835ea975250 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -99,7 +99,7 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;

- if (err_page_level > 0)
+ if (err_page_level > PG_LEVEL_NONE)
error_code |= (err_page_level << PFERR_LEVEL_START_BIT) & PFERR_LEVEL_MASK;

return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2fdf3aa70c57..e4e193b1a758 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1803,7 +1803,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
#define TDX_SEPT_VIOLATION_EXIT_QUAL EPT_VIOLATION_ACC_WRITE
exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
} else {
- exit_qual = tdexit_exit_qual(vcpu);;
+ exit_qual = tdexit_exit_qual(vcpu);
if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
tdexit_gpa(vcpu), kvm_rip_read(vcpu));
@@ -2303,6 +2303,7 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
struct kvm_tdx_init_mem_region region;
struct kvm_vcpu *vcpu;
struct page *page;
+ u64 error_code;
kvm_pfn_t pfn;
int idx, ret = 0;

@@ -2356,7 +2357,11 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) |
(cmd->flags & KVM_TDX_MEASURE_MEMORY_REGION);

- pfn = kvm_mmu_map_tdp_page(vcpu, region.gpa, TDX_SEPT_PFERR,
+ /* TODO: large page support. */
+ error_code = TDX_SEPT_PFERR;
+ error_code |= (PG_LEVEL_4K << PFERR_LEVEL_START_BIT) &
+ PFERR_LEVEL_MASK;
+ pfn = kvm_mmu_map_tdp_page(vcpu, region.gpa, error_code,
PG_LEVEL_4K);
if (is_error_noslot_pfn(pfn) || kvm->vm_bugged)
ret = -EFAULT;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 3400563a2254..8284cce0d385 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -79,25 +79,6 @@ union tdx_exit_reason {
u64 full;
};

-union tdx_ext_exit_qualification {
- struct {
- u64 type : 4;
- u64 reserved0 : 28;
- u64 req_sept_level : 3;
- u64 err_sept_level : 3;
- u64 err_sept_state : 8;
- u64 err_sept_is_leaf : 1;
- u64 reserved1 : 17;
- };
- u64 full;
-};
-
-enum tdx_ext_exit_qualification_type {
- EXT_EXIT_QUAL_NONE,
- EXT_EXIT_QUAL_ACCEPT,
- NUM_EXT_EXIT_QUAL,
-};
-
struct vcpu_tdx {
struct kvm_vcpu vcpu;

diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 94258056d742..fbf334bc18c9 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -154,4 +154,23 @@ struct td_params {
#define TDX_MIN_TSC_FREQUENCY_KHZ (100 * 1000)
#define TDX_MAX_TSC_FREQUENCY_KHZ (10 * 1000 * 1000)

+union tdx_ext_exit_qualification {
+ struct {
+ u64 type : 4;
+ u64 reserved0 : 28;
+ u64 req_sept_level : 3;
+ u64 err_sept_level : 3;
+ u64 err_sept_state : 8;
+ u64 err_sept_is_leaf : 1;
+ u64 reserved1 : 17;
+ };
+ u64 full;
+};
+
+enum tdx_ext_exit_qualification_type {
+ EXT_EXIT_QUAL_NONE = 0,
+ EXT_EXIT_QUAL_ACCEPT,
+ NUM_EXT_EXIT_QUAL,
+};
+
#endif /* __KVM_X86_TDX_ARCH_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6ba3eded55a7..bb493ce80fa9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5646,7 +5646,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
return kvm_emulate_instruction(vcpu, 0);

- return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification, 0);
+ return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification, PG_LEVEL_NONE);
}

static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
--
2.25.1

2022-08-07 22:45:20

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 05/13] KVM: TDX: Pass size to tdx_measure_page()

From: Xiaoyao Li <[email protected]>

Extend tdx_measure_page() to pass size info so that it can measure
large page as well.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b717d50ee4d3..b7a75c0adbfa 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1417,13 +1417,15 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
}

-static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa)
+static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa, int size)
{
struct tdx_module_output out;
u64 err;
int i;

- for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+ WARN_ON_ONCE(size % TDX_EXTENDMR_CHUNKSIZE);
+
+ for (i = 0; i < size; i += TDX_EXTENDMR_CHUNKSIZE) {
err = tdh_mr_extend(kvm_tdx->tdr.pa, gpa + i, &out);
if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
pr_tdx_error(TDH_MR_EXTEND, err, &out);
@@ -1497,7 +1499,7 @@ static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
tdx_unpin_pfn(kvm, pfn);
} else if ((kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION))
- tdx_measure_page(kvm_tdx, gpa); /* TODO: handle page size > 4KB */
+ tdx_measure_page(kvm_tdx, gpa, KVM_HPAGE_SIZE(level));

kvm_tdx->source_pa = INVALID_PAGE;
}
--
2.25.1

2022-08-07 22:45:41

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 13/13] KVM: x86: remove struct kvm_arch.tdp_max_page_level

From: Xiaoyao Li <[email protected]>

Now that everything is there to support large page for TD guest. Remove
tdp_max_page_level from struct kvm_arch that limits the page size.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/mmu/mmu.c | 1 -
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/vmx/tdx.c | 3 ---
4 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a6bfcabcbbd7..80f2bc3fbf0c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1190,7 +1190,6 @@ struct kvm_arch {
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
unsigned int indirect_shadow_pages;
- int tdp_max_page_level;
u8 mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
struct list_head active_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ba21503fa46f..0cbd52c476d7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6232,7 +6232,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;

- kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
return 0;
}

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index e5d5fea29bfa..82b220c4d1bd 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -395,7 +395,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
is_nx_huge_page_enabled(vcpu->kvm),
.is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa),

- .max_level = vcpu->kvm->arch.tdp_max_page_level,
+ .max_level = KVM_MAX_HUGEPAGE_LEVEL,
.req_level = PG_LEVEL_4K,
.goal_level = PG_LEVEL_4K,
};
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a340caeb9c62..72f21f5f78af 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -460,9 +460,6 @@ int tdx_vm_init(struct kvm *kvm)
*/
kvm_mmu_set_mmio_spte_mask(kvm, 0, VMX_EPT_RWX_MASK);

- /* TODO: Enable 2mb and 1gb large page support. */
- kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
-
/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
kvm->max_vcpus = 0;

--
2.25.1

2022-08-07 22:46:03

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 06/13] KVM: TDX: Pass size to reclaim_page()

From: Xiaoyao Li <[email protected]>

A 2MB large page can be tdh_mem_page_aug()'ed to TD directly. In this case,
it needs to reclaim and clear the page as 2MB size.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b7a75c0adbfa..0b9f9075e1ea 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -189,11 +189,13 @@ void tdx_hardware_disable(void)
tdx_disassociate_vp(&tdx->vcpu);
}

-static void tdx_clear_page(unsigned long page)
+static void tdx_clear_page(unsigned long page, int size)
{
const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
unsigned long i;

+ WARN_ON_ONCE(size % 64);
+
/*
* Zeroing the page is only necessary for systems with MKTME-i:
* when re-assign one page from old keyid to a new keyid, MOVDIR64B is
@@ -203,13 +205,14 @@ static void tdx_clear_page(unsigned long page)
if (!static_cpu_has(X86_FEATURE_MOVDIR64B))
return;

- for (i = 0; i < 4096; i += 64)
+ for (i = 0; i < size; i += 64)
/* MOVDIR64B [rdx], es:rdi */
asm (".byte 0x66, 0x0f, 0x38, 0xf8, 0x3a"
: : "d" (zero_page), "D" (page + i) : "memory");
}

-static int tdx_reclaim_page(unsigned long va, hpa_t pa, bool do_wb, u16 hkid)
+static int tdx_reclaim_page(unsigned long va, hpa_t pa, enum pg_level level,
+ bool do_wb, u16 hkid)
{
struct tdx_module_output out;
u64 err;
@@ -219,8 +222,11 @@ static int tdx_reclaim_page(unsigned long va, hpa_t pa, bool do_wb, u16 hkid)
pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
return -EIO;
}
+ /* out.r8 == tdx sept page level */
+ WARN_ON_ONCE(out.r8 != pg_level_to_tdx_sept_level(level));

- if (do_wb) {
+ /* only TDR page gets into this path */
+ if (do_wb && level == PG_LEVEL_4K) {
err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
if (WARN_ON_ONCE(err)) {
pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
@@ -228,7 +234,7 @@ static int tdx_reclaim_page(unsigned long va, hpa_t pa, bool do_wb, u16 hkid)
}
}

- tdx_clear_page(va);
+ tdx_clear_page(va, KVM_HPAGE_SIZE(level));
return 0;
}

@@ -257,7 +263,7 @@ static void tdx_reclaim_td_page(struct tdx_td_page *page)
* was already flushed by TDH.PHYMEM.CACHE.WB before here, So
* cache doesn't need to be flushed again.
*/
- if (tdx_reclaim_page(page->va, page->pa, false, 0))
+ if (tdx_reclaim_page(page->va, page->pa, PG_LEVEL_4K, false, 0))
return;

page->added = false;
@@ -404,8 +410,8 @@ void tdx_vm_free(struct kvm *kvm)
* TDX global HKID is needed.
*/
if (kvm_tdx->tdr.added &&
- tdx_reclaim_page(kvm_tdx->tdr.va, kvm_tdx->tdr.pa, true,
- tdx_global_keyid))
+ tdx_reclaim_page(kvm_tdx->tdr.va, kvm_tdx->tdr.pa, PG_LEVEL_4K,
+ true, tdx_global_keyid))
return;

free_page(kvm_tdx->tdr.va);
@@ -1548,7 +1554,8 @@ static void tdx_sept_drop_private_spte(
* The HKID assigned to this TD was already freed and cache
* was already flushed. We don't have to flush again.
*/
- err = tdx_reclaim_page((unsigned long)__va(hpa), hpa, false, 0);
+ err = tdx_reclaim_page((unsigned long)__va(hpa), hpa, level,
+ false, 0);

unlock:
spin_unlock(&kvm_tdx->seamcall_lock);
@@ -1667,7 +1674,8 @@ static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level le
* already flushed. We don't have to flush again.
*/
spin_lock(&kvm_tdx->seamcall_lock);
- ret = tdx_reclaim_page((unsigned long)sept_page, __pa(sept_page), false, 0);
+ ret = tdx_reclaim_page((unsigned long)sept_page, __pa(sept_page),
+ PG_LEVEL_4K, false, 0);
spin_unlock(&kvm_tdx->seamcall_lock);

return ret;
--
2.25.1

2022-08-07 22:46:16

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 03/13] KVM: TDX: Pass page level to cache flush before TDX SEAMCALL

From: Xiaoyao Li <[email protected]>

tdh_mem_page_aug() will support 2MB large page in the near future. Cache
flush also needs to be 2MB instead of 4KB in such cases. Introduce a helper
function to flush cache with page size info in preparation for large pages.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx_ops.h | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index a50bc1445cc2..9accf2fe04ae 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -6,6 +6,7 @@

#include <linux/compiler.h>

+#include <asm/pgtable_types.h>
#include <asm/cacheflush.h>
#include <asm/asm.h>
#include <asm/kvm_host.h>
@@ -18,6 +19,11 @@

void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out);

+static inline void tdx_clflush_page(hpa_t addr, enum pg_level level)
+{
+ clflush_cache_range(__va(addr), KVM_HPAGE_SIZE(level));
+}
+
/*
* Although seamcal_lock protects seamcall to avoid contention inside the TDX
* module, it doesn't protect TDH.VP.ENTER. With zero-step attack mitigation,
@@ -40,21 +46,21 @@ static inline u64 seamcall_sept_retry(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,

static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
{
- clflush_cache_range(__va(addr), PAGE_SIZE);
+ tdx_clflush_page(addr, PG_LEVEL_4K);
return __seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, NULL);
}

static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
struct tdx_module_output *out)
{
- clflush_cache_range(__va(hpa), PAGE_SIZE);
+ tdx_clflush_page(hpa, PG_LEVEL_4K);
return seamcall_sept_retry(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, out);
}

static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
struct tdx_module_output *out)
{
- clflush_cache_range(__va(page), PAGE_SIZE);
+ tdx_clflush_page(page, PG_LEVEL_4K);
return seamcall_sept_retry(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0,
out);
}
@@ -67,21 +73,21 @@ static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,

static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
{
- clflush_cache_range(__va(addr), PAGE_SIZE);
+ tdx_clflush_page(addr, PG_LEVEL_4K);
return __seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, NULL);
}

static inline u64 tdh_mem_page_relocate(hpa_t tdr, gpa_t gpa, hpa_t hpa,
struct tdx_module_output *out)
{
- clflush_cache_range(__va(hpa), PAGE_SIZE);
+ tdx_clflush_page(hpa, PG_LEVEL_4K);
return __seamcall(TDH_MEM_PAGE_RELOCATE, gpa, tdr, hpa, 0, out);
}

static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
struct tdx_module_output *out)
{
- clflush_cache_range(__va(hpa), PAGE_SIZE);
+ tdx_clflush_page(hpa, PG_LEVEL_4K);
return seamcall_sept_retry(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, out);
}

@@ -99,13 +105,13 @@ static inline u64 tdh_mng_key_config(hpa_t tdr)

static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
{
- clflush_cache_range(__va(tdr), PAGE_SIZE);
+ tdx_clflush_page(tdr, PG_LEVEL_4K);
return __seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, NULL);
}

static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
{
- clflush_cache_range(__va(tdvpr), PAGE_SIZE);
+ tdx_clflush_page(tdvpr, PG_LEVEL_4K);
return __seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, NULL);
}

--
2.25.1

2022-08-07 22:46:29

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH 12/13] KVM: TDX: Split a large page when 4KB page within it converted to shared

From: Xiaoyao Li <[email protected]>

When mapping the shared page for TDX, it needs to zap private alias.

In the case that private page is mapped as large page (2MB), it can be
removed directly only when the whole 2MB is converted to shared.
Otherwise, it has to split 2MB page into 512 4KB page, and only remove
the pages that converted to shared.

When a present large leaf spte switches to present non-leaf spte, TDX needs
to split the corresponding SEPT page to reflect it.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 36 +++++++++++++++++++++++++++---------
arch/x86/kvm/vmx/tdx_arch.h | 1 +
arch/x86/kvm/vmx/tdx_ops.h | 7 +++++++
3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e4e193b1a758..a340caeb9c62 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1595,6 +1595,28 @@ static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
return 0;
}

+static int tdx_sept_split_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, void *sept_page)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn << PAGE_SHIFT;
+ hpa_t hpa = __pa(sept_page);
+ struct tdx_module_output out;
+ u64 err;
+
+ /* See comment in tdx_sept_set_private_spte() */
+ spin_lock(&kvm_tdx->seamcall_lock);
+ err = tdh_mem_page_demote(kvm_tdx->tdr.pa, gpa, tdx_level, hpa, &out);
+ spin_unlock(&kvm_tdx->seamcall_lock);
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_PAGE_DEMOTE, err, &out);
+ return -EIO;
+ }
+
+ return 0;
+}
+
static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level)
{
@@ -1604,8 +1626,6 @@ static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
struct tdx_module_output out;
u64 err;

- /* For now large page isn't supported yet. */
- WARN_ON_ONCE(level != PG_LEVEL_4K);
spin_lock(&kvm_tdx->seamcall_lock);
err = tdh_mem_range_block(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
spin_unlock(&kvm_tdx->seamcall_lock);
@@ -1717,13 +1737,11 @@ static void tdx_handle_changed_private_spte(
lockdep_assert_held(&kvm->mmu_lock);

if (change->new.is_present) {
- /* TDP MMU doesn't change present -> present */
- WARN_ON(change->old.is_present);
- /*
- * Use different call to either set up middle level
- * private page table, or leaf.
- */
- if (is_leaf)
+ if (level > PG_LEVEL_4K && was_leaf && !is_leaf) {
+ tdx_sept_zap_private_spte(kvm, gfn, level);
+ tdx_sept_tlb_remote_flush(kvm);
+ tdx_sept_split_private_spte(kvm, gfn, level, change->sept_page);
+ } else if (is_leaf)
tdx_sept_set_private_spte(
kvm, gfn, level, change->new.pfn);
else {
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index fbf334bc18c9..5970416e95b2 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -21,6 +21,7 @@
#define TDH_MNG_CREATE 9
#define TDH_VP_CREATE 10
#define TDH_MNG_RD 11
+#define TDH_MEM_PAGE_DEMOTE 15
#define TDH_MR_EXTEND 16
#define TDH_MR_FINALIZE 17
#define TDH_VP_FLUSH 18
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index da662aa46cd9..3b7373272d61 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -127,6 +127,13 @@ static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_module_output *out
return __seamcall(TDH_MNG_RD, tdr, field, 0, 0, out);
}

+static inline u64 tdh_mem_page_demote(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+ struct tdx_module_output *out)
+{
+ return seamcall_sept_retry(TDH_MEM_PAGE_DEMOTE, gpa | level, tdr, page,
+ 0, out);
+}
+
static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
struct tdx_module_output *out)
{
--
2.25.1

2022-08-08 05:48:15

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC PATCH 13/13] KVM: x86: remove struct kvm_arch.tdp_max_page_level

On 8/8/2022 6:18 AM, [email protected] wrote:
> From: Xiaoyao Li <[email protected]>
>
> Now that everything is there to support large page for TD guest. Remove
> tdp_max_page_level from struct kvm_arch that limits the page size.

Isaku, we cannot do this to remove tdp_max_page_level. Instead, we need
assign it as PG_LEVEL_2M, because TDX currently only supports AUG'ing a
4K/2M page, 1G is not supported yet.

> Signed-off-by: Xiaoyao Li <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 -
> arch/x86/kvm/mmu/mmu.c | 1 -
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> arch/x86/kvm/vmx/tdx.c | 3 ---
> 4 files changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a6bfcabcbbd7..80f2bc3fbf0c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1190,7 +1190,6 @@ struct kvm_arch {
> unsigned long n_requested_mmu_pages;
> unsigned long n_max_mmu_pages;
> unsigned int indirect_shadow_pages;
> - int tdp_max_page_level;
> u8 mmu_valid_gen;
> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> struct list_head active_mmu_pages;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ba21503fa46f..0cbd52c476d7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6232,7 +6232,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
>
> - kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
> return 0;
> }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index e5d5fea29bfa..82b220c4d1bd 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -395,7 +395,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> is_nx_huge_page_enabled(vcpu->kvm),
> .is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa),
>
> - .max_level = vcpu->kvm->arch.tdp_max_page_level,
> + .max_level = KVM_MAX_HUGEPAGE_LEVEL,
> .req_level = PG_LEVEL_4K,
> .goal_level = PG_LEVEL_4K,
> };
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index a340caeb9c62..72f21f5f78af 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -460,9 +460,6 @@ int tdx_vm_init(struct kvm *kvm)
> */
> kvm_mmu_set_mmio_spte_mask(kvm, 0, VMX_EPT_RWX_MASK);
>
> - /* TODO: Enable 2mb and 1gb large page support. */
> - kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
> -
> /* vCPUs can't be created until after KVM_TDX_INIT_VM. */
> kvm->max_vcpus = 0;
>

2022-08-11 23:36:07

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [RFC PATCH 13/13] KVM: x86: remove struct kvm_arch.tdp_max_page_level

On Mon, Aug 08, 2022 at 01:40:53PM +0800,
Xiaoyao Li <[email protected]> wrote:

> On 8/8/2022 6:18 AM, [email protected] wrote:
> > From: Xiaoyao Li <[email protected]>
> >
> > Now that everything is there to support large page for TD guest. Remove
> > tdp_max_page_level from struct kvm_arch that limits the page size.
>
> Isaku, we cannot do this to remove tdp_max_page_level. Instead, we need
> assign it as PG_LEVEL_2M, because TDX currently only supports AUG'ing a
> 4K/2M page, 1G is not supported yet.

I went too further. I'll fix it, thanks.
--
Isaku Yamahata <[email protected]>