LinuxLists.cc - [PATCH v8 00/14] KVM TDX: TDP MMU: large page support

2024-02-26 09:23:41

Subject: [PATCH v8 00/14] KVM TDX: TDP MMU: large page support

From: Isaku Yamahata <[email protected]>

This patch series is based on "v19 KVM TDX: basic feature support". It
implements large page support for TDP MMU by allowing populating of the large
page and splitting it when necessary.

No major changes from v7 instead of rebasing.

Thanks,

Changes from v7:
- Rebased to v19 TDX KVM v6.8-rc5 based patch series

Changes from v6:
- Rebased to v18 TDX KVM v6.8-rc1 based patch series
- use struct tdx_module_args
- minor improve on comment, commit message

Changes from v5:
- Switched to TDX module 1.5 base.

Chnages from v4:
- Rebased to v16 TDX KVM v6.6-rc2 base

Changes from v3:
- Rebased to v15 TDX KVM v6.5-rc1 base

Changes from v2:
- implemented page merging path
- rebased to TDX KVM v11

Changes from v1:
- implemented page merging path
- rebased to UPM v10
- rebased to TDX KVM v10
- rebased to kvm.git queue + v6.1-rc8

Isaku Yamahata (4):
KVM: x86/tdp_mmu: Allocate private page table for large page split
KVM: x86/tdp_mmu: Try to merge pages into a large page
KVM: TDX: Implement merge pages into a large page
KVM: x86/mmu: Make kvm fault handler aware of large page of private
memslot

Sean Christopherson (1):
KVM: Add transparent hugepage support for dedicated guest memory

Xiaoyao Li (9):
KVM: TDX: Flush cache based on page size before TDX SEAMCALL
KVM: TDX: Pass KVM page level to tdh_mem_page_aug()
KVM: TDX: Pass size to reclaim_page()
KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large
page
KVM: MMU: Introduce level info in PFERR code
KVM: TDX: Pass desired page level in err code for page fault handler
KVM: x86/tdp_mmu: Split the large page when zap leaf
KVM: x86/tdp_mmu, TDX: Split a large page when 4KB page within it
converted to shared
KVM: TDX: Allow 2MB large page for TD GUEST

Documentation/virt/kvm/api.rst | 7 +
arch/x86/include/asm/kvm-x86-ops.h | 3 +
arch/x86/include/asm/kvm_host.h | 11 ++
arch/x86/kvm/mmu/mmu.c | 38 ++--
arch/x86/kvm/mmu/mmu_internal.h | 30 +++-
arch/x86/kvm/mmu/tdp_iter.c | 37 +++-
arch/x86/kvm/mmu/tdp_iter.h | 2 +
arch/x86/kvm/mmu/tdp_mmu.c | 276 ++++++++++++++++++++++++++---
arch/x86/kvm/vmx/common.h | 6 +-
arch/x86/kvm/vmx/tdx.c | 221 +++++++++++++++++------
arch/x86/kvm/vmx/tdx_arch.h | 21 +++
arch/x86/kvm/vmx/tdx_errno.h | 3 +
arch/x86/kvm/vmx/tdx_ops.h | 56 ++++--
arch/x86/kvm/vmx/vmx.c | 2 +-
include/uapi/linux/kvm.h | 2 +
virt/kvm/guest_memfd.c | 73 +++++++-
16 files changed, 672 insertions(+), 116 deletions(-)

--
2.25.1

2024-02-26 09:36:06

by Isaku Yamahata

[permalink] [raw]

Subject: [PATCH v8 06/14] KVM: MMU: Introduce level info in PFERR code

From: Xiaoyao Li <[email protected]>

For TDX, EPT violation can happen when TDG.MEM.PAGE.ACCEPT.
And TDG.MEM.PAGE.ACCEPT contains the desired accept page level of TD guest.

1. KVM can map it with 4KB page while TD guest wants to accept 2MB page.

TD guest will get TDX_PAGE_SIZE_MISMATCH and it should try to accept
4KB size.

2. KVM can map it with 2MB page while TD guest wants to accept 4KB page.

KVM needs to honor it because
a) there is no way to tell guest KVM maps it as 2MB size. And
b) guest accepts it in 4KB size since guest knows some other 4KB page
in the same 2MB range will be used as shared page.

For case 2, it need to pass desired page level to KVM MMU page fault
handler. Use bit 29:31 of kvm PF error code for this purpose.

Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 +++++
arch/x86/kvm/mmu/mmu.c | 5 +++++
2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e4d40e31fc31..c864a1ff2eb1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -262,6 +262,8 @@ enum x86_intercept_stage;
#define PFERR_FETCH_BIT 4
#define PFERR_PK_BIT 5
#define PFERR_SGX_BIT 15
+#define PFERR_LEVEL_START_BIT 29
+#define PFERR_LEVEL_END_BIT 31
#define PFERR_GUEST_FINAL_BIT 32
#define PFERR_GUEST_PAGE_BIT 33
#define PFERR_GUEST_ENC_BIT 34
@@ -274,6 +276,7 @@ enum x86_intercept_stage;
#define PFERR_FETCH_MASK BIT(PFERR_FETCH_BIT)
#define PFERR_PK_MASK BIT(PFERR_PK_BIT)
#define PFERR_SGX_MASK BIT(PFERR_SGX_BIT)
+#define PFERR_LEVEL_MASK GENMASK_ULL(PFERR_LEVEL_END_BIT, PFERR_LEVEL_START_BIT)
#define PFERR_GUEST_FINAL_MASK BIT_ULL(PFERR_GUEST_FINAL_BIT)
#define PFERR_GUEST_PAGE_MASK BIT_ULL(PFERR_GUEST_PAGE_BIT)
#define PFERR_GUEST_ENC_MASK BIT_ULL(PFERR_GUEST_ENC_BIT)
@@ -283,6 +286,8 @@ enum x86_intercept_stage;
PFERR_WRITE_MASK | \
PFERR_PRESENT_MASK)

+#define PFERR_LEVEL(err_code) (((err_code) & PFERR_LEVEL_MASK) >> PFERR_LEVEL_START_BIT)
+
/* apic attention bits */
#define KVM_APIC_CHECK_VAPIC 0
/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b8d6ce02e66d..081df7855065 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4625,6 +4625,11 @@ bool __kvm_mmu_honors_guest_mtrrs(bool vm_has_noncoherent_dma)

int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
+ u8 err_level = PFERR_LEVEL(fault->error_code);
+
+ if (err_level)
+ fault->max_level = min(fault->max_level, err_level);
+
/*
* If the guest's MTRRs may be used to compute the "real" memtype,
* restrict the mapping level to ensure KVM uses a consistent memtype
--
2.25.1

2024-02-26 09:37:40

by Isaku Yamahata

[permalink] [raw]

Subject: [PATCH v8 12/14] KVM: TDX: Implement merge pages into a large page

From: Isaku Yamahata <[email protected]>

Implement merge_private_stp callback.

Signed-off-by: Isaku Yamahata <[email protected]>
---
v7:
- Fix subject, x86/tdp_mmu => TDX
- comment: use ulink instead of free for clarity

v6:
- repeat TDH.MEM.PAGE.PROMOTE() on TDX_INTERRUPTED_RESTARTABLE

Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 74 ++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx_arch.h | 1 +
arch/x86/kvm/vmx/tdx_errno.h | 2 +
arch/x86/kvm/vmx/tdx_ops.h | 11 ++++++
4 files changed, 88 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 88af64658a9c..5b4d94a6c6e2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1674,6 +1674,51 @@ static int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
return 0;
}

+static int tdx_sept_merge_private_spt(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, void *private_spt)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct tdx_module_args out;
+ gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+ u64 err;
+
+ /* See comment in tdx_sept_set_private_spte() */
+ do {
+ err = tdh_mem_page_promote(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+ } while (err == TDX_INTERRUPTED_RESTARTABLE);
+ if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+ return -EAGAIN;
+ if (unlikely(err == (TDX_EPT_INVALID_PROMOTE_CONDITIONS |
+ TDX_OPERAND_ID_RCX)))
+ /*
+ * Some pages are accepted, some pending. Need to wait for TD
+ * to accept all pages. Tell it the caller.
+ */
+ return -EAGAIN;
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_PAGE_PROMOTE, err, &out);
+ return -EIO;
+ }
+ WARN_ON_ONCE(out.rcx != __pa(private_spt));
+
+ /*
+ * TDH.MEM.PAGE.PROMOTE unlinks the Secure-EPT page for the lower level.
+ * Flush cache for reuse.
+ */
+ do {
+ err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(__pa(private_spt),
+ to_kvm_tdx(kvm)->hkid));
+ } while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
+ if (WARN_ON_ONCE(err)) {
+ pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+ return -EIO;
+ }
+
+ tdx_clear_page(__pa(private_spt), PAGE_SIZE);
+ return 0;
+}
+
static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level)
{
@@ -1770,6 +1815,33 @@ static void tdx_track(struct kvm *kvm)

}

+static int tdx_sept_unzap_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level)
+{
+ int tdx_level = pg_level_to_tdx_sept_level(level);
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+ struct tdx_module_args out;
+ u64 err;
+
+ do {
+ err = tdh_mem_range_unblock(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+
+ /*
+ * tdh_mem_range_block() is accompanied with tdx_track() via kvm
+ * remote tlb flush. Wait for the caller of
+ * tdh_mem_range_block() to complete TDX track.
+ */
+ } while (err == (TDX_TLB_TRACKING_NOT_DONE | TDX_OPERAND_ID_SEPT));
+ if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+ return -EAGAIN;
+ if (KVM_BUG_ON(err, kvm)) {
+ pr_tdx_error(TDH_MEM_RANGE_UNBLOCK, err, &out);
+ return -EIO;
+ }
+ return 0;
+}
+
static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
enum pg_level level, void *private_spt)
{
@@ -3331,9 +3403,11 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
x86_ops->link_private_spt = tdx_sept_link_private_spt;
x86_ops->free_private_spt = tdx_sept_free_private_spt;
x86_ops->split_private_spt = tdx_sept_split_private_spt;
+ x86_ops->merge_private_spt = tdx_sept_merge_private_spt;
x86_ops->set_private_spte = tdx_sept_set_private_spte;
x86_ops->remove_private_spte = tdx_sept_remove_private_spte;
x86_ops->zap_private_spte = tdx_sept_zap_private_spte;
+ x86_ops->unzap_private_spte = tdx_sept_unzap_private_spte;

return 0;

diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index bb324f744bbf..a320f6d45731 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -29,6 +29,7 @@
#define TDH_MNG_KEY_FREEID 20
#define TDH_MNG_INIT 21
#define TDH_VP_INIT 22
+#define TDH_MEM_PAGE_PROMOTE 23
#define TDH_MEM_SEPT_RD 25
#define TDH_VP_RD 26
#define TDH_MNG_KEY_RECLAIMID 27
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index 416708e6cbb7..799d7166e69a 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -21,6 +21,8 @@
#define TDX_KEY_CONFIGURED 0x0000081500000000ULL
#define TDX_NO_HKID_READY_TO_WBCACHE 0x0000082100000000ULL
#define TDX_FLUSHVP_NOT_DONE 0x8000082400000000ULL
+#define TDX_TLB_TRACKING_NOT_DONE 0xC0000B0800000000ULL
+#define TDX_EPT_INVALID_PROMOTE_CONDITIONS 0xC0000B0900000000ULL
#define TDX_EPT_ENTRY_STATE_INCORRECT 0xC0000B0D00000000ULL

/*
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index d8f0d9aa7439..bf660eefa9e0 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -254,6 +254,17 @@ static inline u64 tdh_mem_page_demote(hpa_t tdr, gpa_t gpa, int level, hpa_t pag
return tdx_seamcall_sept(TDH_MEM_PAGE_DEMOTE, &in, out);
}

+static inline u64 tdh_mem_page_promote(hpa_t tdr, gpa_t gpa, int level,
+ struct tdx_module_args *out)
+{
+ struct tdx_module_args in = {
+ .rcx = gpa | level,
+ .rdx = tdr,
+ };
+
+ return tdx_seamcall_sept(TDH_MEM_PAGE_PROMOTE, &in, out);
+}
+
static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
struct tdx_module_args *out)
{
--
2.25.1

2024-03-27 00:54:36

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [PATCH v8 00/14] KVM TDX: TDP MMU: large page support

Hi Isaku,

On 2/26/2024 4:29 PM, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> This patch series is based on "v19 KVM TDX: basic feature support". It
> implements large page support for TDP MMU by allowing populating of the large
> page and splitting it when necessary.
To test the hugepage for TDX guest, we need to apply Qemu patch
from Xiaoyao:
https://lore.kernel.org/qemu-devel/[email protected]/

According to Xiaoyao, it's still under discussion. So he didn't
send updated version patch. For folks want to try this series,
it may be better to mention above link in this cover letter?

Test in my side showed several benchmarks got 10+% performance
gain which is really nice. So:
Tested-by: Yin Fengwei <[email protected]>

Regards
Yin, Fengwei

>
> No major changes from v7 instead of rebasing.
>
> Thanks,
>
> Changes from v7:
> - Rebased to v19 TDX KVM v6.8-rc5 based patch series
>
> Changes from v6:
> - Rebased to v18 TDX KVM v6.8-rc1 based patch series
> - use struct tdx_module_args
> - minor improve on comment, commit message
>
> Changes from v5:
> - Switched to TDX module 1.5 base.
>
> Chnages from v4:
> - Rebased to v16 TDX KVM v6.6-rc2 base
>
> Changes from v3:
> - Rebased to v15 TDX KVM v6.5-rc1 base
>
> Changes from v2:
> - implemented page merging path
> - rebased to TDX KVM v11
>
> Changes from v1:
> - implemented page merging path
> - rebased to UPM v10
> - rebased to TDX KVM v10
> - rebased to kvm.git queue + v6.1-rc8
>
> Isaku Yamahata (4):
> KVM: x86/tdp_mmu: Allocate private page table for large page split
> KVM: x86/tdp_mmu: Try to merge pages into a large page
> KVM: TDX: Implement merge pages into a large page
> KVM: x86/mmu: Make kvm fault handler aware of large page of private
> memslot
>
> Sean Christopherson (1):
> KVM: Add transparent hugepage support for dedicated guest memory
>
> Xiaoyao Li (9):
> KVM: TDX: Flush cache based on page size before TDX SEAMCALL
> KVM: TDX: Pass KVM page level to tdh_mem_page_aug()
> KVM: TDX: Pass size to reclaim_page()
> KVM: TDX: Update tdx_sept_{set,drop}_private_spte() to support large
> page
> KVM: MMU: Introduce level info in PFERR code
> KVM: TDX: Pass desired page level in err code for page fault handler
> KVM: x86/tdp_mmu: Split the large page when zap leaf
> KVM: x86/tdp_mmu, TDX: Split a large page when 4KB page within it
> converted to shared
> KVM: TDX: Allow 2MB large page for TD GUEST
>
> Documentation/virt/kvm/api.rst | 7 +
> arch/x86/include/asm/kvm-x86-ops.h | 3 +
> arch/x86/include/asm/kvm_host.h | 11 ++
> arch/x86/kvm/mmu/mmu.c | 38 ++--
> arch/x86/kvm/mmu/mmu_internal.h | 30 +++-
> arch/x86/kvm/mmu/tdp_iter.c | 37 +++-
> arch/x86/kvm/mmu/tdp_iter.h | 2 +
> arch/x86/kvm/mmu/tdp_mmu.c | 276 ++++++++++++++++++++++++++---
> arch/x86/kvm/vmx/common.h | 6 +-
> arch/x86/kvm/vmx/tdx.c | 221 +++++++++++++++++------
> arch/x86/kvm/vmx/tdx_arch.h | 21 +++
> arch/x86/kvm/vmx/tdx_errno.h | 3 +
> arch/x86/kvm/vmx/tdx_ops.h | 56 ++++--
> arch/x86/kvm/vmx/vmx.c | 2 +-
> include/uapi/linux/kvm.h | 2 +
> virt/kvm/guest_memfd.c | 73 +++++++-
> 16 files changed, 672 insertions(+), 116 deletions(-)
>

2024-03-27 04:15:32

by Isaku Yamahata

[permalink] [raw]

Subject: Re: [PATCH v8 00/14] KVM TDX: TDP MMU: large page support

On Wed, Mar 27, 2024 at 08:53:50AM +0800,
"Yin, Fengwei" <[email protected]> wrote:

> Hi Isaku,
>
> On 2/26/2024 4:29 PM, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > This patch series is based on "v19 KVM TDX: basic feature support". It
> > implements large page support for TDP MMU by allowing populating of the large
> > page and splitting it when necessary.
> To test the hugepage for TDX guest, we need to apply Qemu patch
> from Xiaoyao:
> https://lore.kernel.org/qemu-devel/[email protected]/
>
> According to Xiaoyao, it's still under discussion. So he didn't
> send updated version patch. For folks want to try this series,
> it may be better to mention above link in this cover letter?

Makes sense. Let me include it from the next versions.

> Test in my side showed several benchmarks got 10+% performance
> gain which is really nice. So:

So nice.

> Tested-by: Yin Fengwei <[email protected]>

Thanks,
--
Isaku Yamahata <[email protected]>