From: Isaku Yamahata <[email protected]>
Changes from v1:
- implemented page merging path
- rebased to UPM v10
- rebased to TDX KVM v10
- rebased to kvm.git queue + v6.1-rc8
---
This patch series is based on "v10 KVM TDX: basic feature support". It
implements large page support for TDP MMU by allowing populating of the large
page and splitting it when necessary.
Feedback for options to merge sub-pages into a large page are welcome.
Splitting large pages when necessary
====================================
* It already tracking whether GFN is private or shared. When it's changed,
update lpage_info to prevent a large page.
* TDX provides page level on Secure EPT violation. Pass around the page level
that the lower level functions needs.
* Even if the page is the large page in the host, at the EPT level, only some
sub-pages are mapped. In such cases abandon to map large pages and step into
the sub-page level, unlike the conventional EPT.
* When zapping spte and the spte is for a large page, split and zap it unlike
the conventional EPT because otherwise the protected page contents will be
lost.
* It's not implemented to merge pages into a large page.
Mergihng small pages into a large page if possible
==================================================
On normal EPT violation, check whether pages can be merged into a large page
after mapping it.
Missing part is as follows
* Make nx recovery thread useTDH.MEM.RANGE.BLOCK instead of zapping EPT entry.
* Record that the entry is blocked by introducing a bit in spte. On EPT
violation, check if the entry is blocked or not. If the EPT violation is
caused by a blocked Secure-EPT entry, trigger the page merge logic.
TDX operation
=============
The following describes what TDX operations procedures.
* EPT violation trick
Such track (zapping the EPT entry to trigger EPT violation) doesn't work for
TDX. For TDX, it will lose the contents of the protected page to zap a page
because the protected guest page is un-associated from the guest TD. Instead,
TDX provides a different way to trigger EPT violation without losing the page
contents so that VMM can detect guest TD activity by blocking/unblocking
Secure-EPT entry. TDH.MEM.RANGE.BLOCK and TDH.MEM.RANGE.UNBLOCK. They
correspond to clearing/setting a present bit in an EPT entry with page contents
still kept. By TDH.MEM.RANGE.BLOCK and TLB shoot down, VMM can cause guest TD
to trigger EPT violation. After that, VMM can unblock it by
TDH.MEM.RANGE.UNBLOCK and resume guest TD execution. The procedure is as
follows.
- Block Secure-EPT entry by TDH.MEM.RANGE.BLOCK.
- TLB shoot down.
- Wait for guest TD to trigger EPT violation.
- Unblock Secure-EPT entry by TDH.MEM.RANGE.UNBLOCK to resume the guest TD.
* merging sub-pages into a large page
The following steps are needed.
- Ensure that all sub-pages are mapped.
- TLB shoot down.
- Merge sub-pages into a large page (TDH.MEM.PAGE.PROMOTE).
This requires all sub-pages are mapped.
- Cache flush Secure EPT page used to map subpages.
Thanks,
Isaku Yamahata (3):
KVM: x86/tdp_mmu: Try to merge pages into a large page
KVM: x86/tdp_mmu: TDX: Implement merge pages into a large page
KVM: x86/mmu: Make kvm fault handelr aware of large page of private
memslot
Xiaoyao Li (12):
KVM: TDP_MMU: Go to next level if smaller private mapping exists
KVM: TDX: Pass page level to cache flush before TDX SEAMCALL
KVM: TDX: Pass KVM page level to tdh_mem_page_add() and
tdh_mem_page_aug()
KVM: TDX: Pass size to tdx_measure_page()
KVM: TDX: Pass size to reclaim_page()
KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large
page
KVM: MMU: Introduce level info in PFERR code
KVM: TDX: Pin pages via get_page() right before ADD/AUG'ed to TDs
KVM: TDX: Pass desired page level in err code for page fault handler
KVM: x86/tdp_mmu: Split the large page when zap leaf
KVM: x86/tdp_mmu, TDX: Split a large page when 4KB page within it
converted to shared
KVM: TDX: Allow 2MB large page for TD GUEST
arch/x86/include/asm/kvm-x86-ops.h | 3 +
arch/x86/include/asm/kvm_host.h | 10 ++
arch/x86/kvm/mmu/mmu.c | 50 +++++--
arch/x86/kvm/mmu/mmu_internal.h | 10 ++
arch/x86/kvm/mmu/tdp_mmu.c | 225 +++++++++++++++++++++++++---
arch/x86/kvm/vmx/common.h | 6 +-
arch/x86/kvm/vmx/tdx.c | 227 ++++++++++++++++++++++-------
arch/x86/kvm/vmx/tdx_arch.h | 21 +++
arch/x86/kvm/vmx/tdx_errno.h | 2 +
arch/x86/kvm/vmx/tdx_ops.h | 50 +++++--
arch/x86/kvm/vmx/vmx.c | 2 +-
11 files changed, 505 insertions(+), 101 deletions(-)
--
2.25.1
From: Xiaoyao Li <[email protected]>
Allow large page level AUG and REMOVE for TDX pages.
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 63 +++++++++++++++++++++---------------------
1 file changed, 32 insertions(+), 31 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index df213b488f89..d5f93115f3ba 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1297,11 +1297,12 @@ static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa, int size)
}
}
-static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
+static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn, int level)
{
- struct page *page = pfn_to_page(pfn);
+ int i;
- put_page(page);
+ for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++)
+ put_page(pfn_to_page(pfn + i));
}
static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -1315,28 +1316,26 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
hpa_t source_pa;
bool measure;
u64 err;
+ int i;
if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) ||
!kvm_pfn_to_refcounted_page(pfn)))
return 0;
/* To prevent page migration, do nothing on mmu notifier. */
- get_page(pfn_to_page(pfn));
+ for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++)
+ get_page(pfn_to_page(pfn + i));
/* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
if (likely(is_td_finalized(kvm_tdx))) {
- /* TODO: handle large pages. */
- if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
- return -EINVAL;
-
err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, tdx_level, hpa, &out);
if (err == TDX_ERROR_SEPT_BUSY) {
- tdx_unpin(kvm, pfn);
+ tdx_unpin(kvm, pfn, level);
return -EAGAIN;
}
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
- tdx_unpin(kvm, pfn);
+ tdx_unpin(kvm, pfn, level);
return -EIO;
}
return 0;
@@ -1359,7 +1358,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
* always uses vcpu 0's page table and protected by vcpu->mutex).
*/
if (KVM_BUG_ON(kvm_tdx->source_pa == INVALID_PAGE, kvm)) {
- tdx_unpin(kvm, pfn);
+ tdx_unpin(kvm, pfn, level);
return -EINVAL;
}
@@ -1377,7 +1376,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
} while (err == TDX_ERROR_SEPT_BUSY);
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
- tdx_unpin(kvm, pfn);
+ tdx_unpin(kvm, pfn, level);
return -EIO;
} else if (measure)
tdx_measure_page(kvm_tdx, gpa, KVM_HPAGE_SIZE(level));
@@ -1394,11 +1393,9 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
gpa_t gpa = gfn_to_gpa(gfn);
hpa_t hpa = pfn_to_hpa(pfn);
hpa_t hpa_with_hkid;
+ int r = 0;
u64 err;
-
- /* TODO: handle large pages. */
- if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
- return -EINVAL;
+ int i;
if (!is_hkid_assigned(kvm_tdx)) {
/*
@@ -1408,7 +1405,7 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
err = tdx_reclaim_page(hpa, level, false, 0);
if (KVM_BUG_ON(err, kvm))
return -EIO;
- tdx_unpin(kvm, pfn);
+ tdx_unpin(kvm, pfn, level);
return 0;
}
@@ -1425,21 +1422,25 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
return -EIO;
}
- hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
- do {
- /*
- * TDX_OPERAND_BUSY can happen on locking PAMT entry. Because
- * this page was removed above, other thread shouldn't be
- * repeatedly operating on this page. Just retry loop.
- */
- err = tdh_phymem_page_wbinvd(hpa_with_hkid);
- } while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));
- if (KVM_BUG_ON(err, kvm)) {
- pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
- return -EIO;
+ for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++) {
+ hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+ do {
+ /*
+ * TDX_OPERAND_BUSY can happen on locking PAMT entry.
+ * Because this page was removed above, other thread
+ * shouldn't be repeatedly operating on this page.
+ * Simple retry should work.
+ */
+ err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+ } while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+ r = -EIO;
+ } else
+ tdx_unpin(kvm, pfn + i, PG_LEVEL_4K);
+ hpa += PAGE_SIZE;
}
- tdx_unpin(kvm, pfn);
- return 0;
+ return r;
}
static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
--
2.25.1
From: Isaku Yamahata <[email protected]>
Implement merge_private_stp callback.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 70 ++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx_arch.h | 1 +
arch/x86/kvm/vmx/tdx_errno.h | 2 ++
arch/x86/kvm/vmx/tdx_ops.h | 6 ++++
4 files changed, 79 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ce7026136334..f20e931cf983 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1487,6 +1487,47 @@ static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
return 0;
}
+static int tdx_sept_merge_private_spt(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, void *private_spt)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct tdx_module_output out;
+ gpa_t gpa = gfn_to_gpa(gfn);
+ u64 err;
+
+ /* See comment in tdx_sept_set_private_spte() */
+ err = tdh_mem_page_promote(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
+ if (err == TDX_ERROR_SEPT_BUSY)
+ return -EAGAIN;
+ if (err == TDX_EPT_INVALID_PROMOTE_CONDITIONS)
+ /*
+ * Some pages are accepted, some pending. Need to wait for TD
+ * to accept all pages. Tell it the caller.
+ */
+ return -EAGAIN;
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_PAGE_PROMOTE, err, &out);
+ return -EIO;
+ }
+ WARN_ON_ONCE(out.rcx != __pa(private_spt));
+
+ /*
+ * TDH.MEM.PAGE.PROMOTE frees the Secure-EPT page for the lower level.
+ * Flush cache for reuse.
+ */
+ do {
+ err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(__pa(private_spt),
+ to_kvm_tdx(kvm)->hkid));
+ } while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+ return -EIO;
+ }
+
+ return 0;
+}
+
static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level)
{
@@ -1556,6 +1597,33 @@ static void tdx_track(struct kvm_tdx *kvm_tdx)
}
+static int tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn_to_gpa(gfn);
+ struct tdx_module_output out;
+ u64 err;
+
+ do {
+ err = tdh_mem_range_unblock(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
+
+ /*
+ * tdh_mem_range_block() is accompanied with tdx_track() via kvm
+ * remote tlb flush. Wait for the caller of
+ * tdh_mem_range_block() to complete TDX track.
+ */
+ } while (err == (TDX_TLB_TRACKING_NOT_DONE | TDX_OPERAND_ID_SEPT));
+ if (err == TDX_ERROR_SEPT_BUSY)
+ return -EAGAIN;
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_RANGE_UNBLOCK, err, &out);
+ return -EIO;
+ }
+ return 0;
+}
+
static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
enum pg_level level, void *private_spt)
{
@@ -2681,9 +2749,11 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
x86_ops->link_private_spt = tdx_sept_link_private_spt;
x86_ops->free_private_spt = tdx_sept_free_private_spt;
x86_ops->split_private_spt = tdx_sept_split_private_spt;
+ x86_ops->merge_private_spt = tdx_sept_merge_private_spt;
x86_ops->set_private_spte = tdx_sept_set_private_spte;
x86_ops->remove_private_spte = tdx_sept_remove_private_spte;
x86_ops->zap_private_spte = tdx_sept_zap_private_spte;
+ x86_ops->unzap_private_spte = tdx_sept_unzap_private_spte;
return 0;
}
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 508d9a1139ce..3a3c9c608bf0 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -29,6 +29,7 @@
#define TDH_MNG_KEY_FREEID 20
#define TDH_MNG_INIT 21
#define TDH_VP_INIT 22
+#define TDH_MEM_PAGE_PROMOTE 23
#define TDH_VP_RD 26
#define TDH_MNG_KEY_RECLAIMID 27
#define TDH_PHYMEM_PAGE_RECLAIM 28
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index 389b1b53da25..74a5777c05f1 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -19,6 +19,8 @@
#define TDX_KEY_CONFIGURED 0x0000081500000000ULL
#define TDX_NO_HKID_READY_TO_WBCACHE 0x0000082100000000ULL
#define TDX_EPT_WALK_FAILED 0xC0000B0000000000ULL
+#define TDX_TLB_TRACKING_NOT_DONE 0xC0000B0800000000ULL
+#define TDX_EPT_INVALID_PROMOTE_CONDITIONS 0xC0000B0900000000ULL
/*
* TDG.VP.VMCALL Status Codes (returned in R10)
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 60cbc7f94b18..5d2d0b1eed28 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -140,6 +140,12 @@ static inline u64 tdh_mem_page_demote(hpa_t tdr, gpa_t gpa, int level, hpa_t pag
return seamcall_sept(TDH_MEM_PAGE_DEMOTE, gpa | level, tdr, page, 0, out);
}
+static inline u64 tdh_mem_page_promote(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_output *out)
+{
+ return seamcall_sept(TDH_MEM_PAGE_PROMOTE, gpa | level, tdr, 0, 0, out);
+}
+
static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
struct tdx_module_output *out)
{
--
2.25.1