LinuxLists.cc - [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

2022-06-27 22:36:20

Subject: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

From: Sean Christopherson <[email protected]>

TDX introduced a new ETP, Secure-EPT, in addition to the existing EPT.
Secure-EPT maps protected guest memory, which is called private. Since
Secure-EPT page tables is also protected, those page tables is also called
private. The existing EPT is often called shared EPT to distinguish from
Secure-EPT. And also page tables for share EPT is also called shared.

Virtualization Exception, #VE, is a new processor exception in VMX non-root
operation. In certain virtualizatoin-related conditions, #VE is injected
into guest instead of exiting from guest to VMM so that guest is given a
chance to inspect it. One important one is EPT violation. When
"ETP-violation #VE" VM-execution is set, "#VE suppress bit" in EPT entry
is cleared, #VE is injected instead of EPT violation.

Because guest memory is protected with TDX, VMM can't parse instructions
in the guest memory. Instead, MMIO hypercall is used for guest to pass
necessary information to VMM.

To make unmodified device driver work, guest TD expects #VE on accessing
shared GPA. The #VE handler converts MMIO access into MMIO hypercall with
the EPT entry of enabled "#VE" by clearing "suppress #VE" bit. Before VMM
enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
violation. So the execution flow looks like

- Allocate unused shared EPT entry with suppress #VE bit set.
- EPT violation on that GPA.
- VMM figures out the faulted GPA is for MMIO.
- VMM clears the suppress #VE bit.
- Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
- If the GPA maps guest memory, VMM resolves it with guest pages.

For both cases, SPTE needs suppress #VE" bit set initially when it
is allocated or zapped, therefore non-zero non-present value for SPTE
needs to be allowed.

This change requires to update FNAME(sync_page) for shadow EPT.
"if(!sp->spte[i])" in FNAME(sync_page) means that the spte entry is the
initial value. With the introduction of shadow_nonpresent_value which can
be non-zero, it doesn't hold any more. Replace zero check with
"!is_shadow_present_pte() && !is_mmio_spte()".

When "if (!spt[i])" doesn't hold, but the entry value is
shadow_nonpresent_value, the entry is wrongly synchronized from non-present
to non-present with (wrongly) pfn changed and tries to remove rmap wrongly
and BUG_ON() is hit.

TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
intermediate value to indicate one thread is operating on it and the value
should be semi-arbitrary value. For TDX (more correctly to use #VE), the
value should include suppress #VE value which is SHADOW_NONPRESENT_VALUE.
Rename REMOVED_SPTE to __REMOVED_SPTE and define REMOVED_SPTE as
SHADOW_NONPRESENT_VALUE | REMOVED_SPTE to set suppress #VE bit.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 55 ++++++++++++++++++++++++++++++----
arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
arch/x86/kvm/mmu/spte.c | 5 +++-
arch/x86/kvm/mmu/spte.h | 37 ++++++++++++++++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++-----
5 files changed, 105 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 51306b80f47c..f239b6cb5d53 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -668,6 +668,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
}
}

+static inline void kvm_init_shadow_page(void *page)
+{
+#ifdef CONFIG_X86_64
+ int ign;
+
+ WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
+ asm volatile (
+ "rep stosq\n\t"
+ : "=c"(ign), "=D"(page)
+ : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
+ : "memory"
+ );
+#else
+ BUG();
+#endif
+}
+
+static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
+ int start, end, i, r;
+ bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
+
+ if (is_tdp_mmu && shadow_nonpresent_value)
+ start = kvm_mmu_memory_cache_nr_free_objects(mc);
+
+ r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+
+ if (is_tdp_mmu && shadow_nonpresent_value) {
+ end = kvm_mmu_memory_cache_nr_free_objects(mc);
+ for (i = start; i < end; i++)
+ kvm_init_shadow_page(mc->objects[i]);
+ }
+ return 0;
+}
+
static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
int r;
@@ -677,8 +715,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
- r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
- PT64_ROOT_MAX_LEVEL);
+ r = mmu_topup_shadow_page_cache(vcpu);
if (r)
return r;
if (maybe_indirect) {
@@ -5521,9 +5558,16 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
* what is used by the kernel for any given HVA, i.e. the kernel's
* capabilities are ultimately consulted by kvm_mmu_hugepage_adjust().
*/
- if (tdp_enabled)
+ if (tdp_enabled) {
+ /*
+ * For TDP MMU, always set bit 63 for TDX support. See the
+ * comment on SHADOW_NONPRESENT_VALUE.
+ */
+#ifdef CONFIG_X86_64
+ shadow_nonpresent_value = SHADOW_NONPRESENT_VALUE;
+#endif
max_huge_page_level = tdp_huge_page_level;
- else if (boot_cpu_has(X86_FEATURE_GBPAGES))
+ } else if (boot_cpu_has(X86_FEATURE_GBPAGES))
max_huge_page_level = PG_LEVEL_1G;
else
max_huge_page_level = PG_LEVEL_2M;
@@ -5654,7 +5698,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

- vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
+ vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index fe35d8fd3276..ee2fb0c073f3 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1031,7 +1031,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
gpa_t pte_gpa;
gfn_t gfn;

- if (!sp->spt[i])
+ if (!is_shadow_present_pte(sp->spt[i]) &&
+ !is_mmio_spte(sp->spt[i]))
continue;

pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cda1851ec155..bd441458153f 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -36,6 +36,9 @@ u64 __read_mostly shadow_present_mask;
u64 __read_mostly shadow_me_value;
u64 __read_mostly shadow_me_mask;
u64 __read_mostly shadow_acc_track_mask;
+#ifdef CONFIG_X86_64
+u64 __read_mostly shadow_nonpresent_value;
+#endif

u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
@@ -360,7 +363,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
* not set any RWX bits.
*/
if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
- WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
+ WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
mmio_value = 0;

if (!mmio_value)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 0127bb6e3c7d..1bfedbe0585f 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -140,6 +140,19 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)

+/*
+ * non-present SPTE value for both VMX and SVM for TDP MMU.
+ * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
+ * For VMX EPT, bit 63 is ignored if #VE is disabled.
+ * bit 63 is #VE suppress if #VE is enabled.
+ */
+#ifdef CONFIG_X86_64
+#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
+static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
+#else
+#define SHADOW_NONPRESENT_VALUE 0ULL
+#endif
+
extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
@@ -154,6 +167,12 @@ extern u64 __read_mostly shadow_present_mask;
extern u64 __read_mostly shadow_me_value;
extern u64 __read_mostly shadow_me_mask;

+#ifdef CONFIG_X86_64
+extern u64 __read_mostly shadow_nonpresent_value;
+#else
+#define shadow_nonpresent_value 0ULL
+#endif
+
/*
* SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
* shadow_acc_track_mask is the set of bits to be cleared in non-accessed
@@ -174,9 +193,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;

/*
* If a thread running without exclusive control of the MMU lock must perform a
- * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
+ * multi-part operation on an SPTE, it can set the SPTE to __REMOVED_SPTE as a
* non-present intermediate value. Other threads which encounter this value
- * should not modify the SPTE.
+ * should not modify the SPTE. For the case that TDX is enabled,
+ * SHADOW_NONPRESENT_VALUE, which is "suppress #VE" bit set because TDX module
+ * always enables "EPT violation #VE". The bit is ignored by non-TDX case as
+ * present bit (bit 0) is cleared.
*
* Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
* bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
@@ -184,10 +206,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
*
* Only used by the TDP MMU.
*/
-#define REMOVED_SPTE 0x5a0ULL
+#define __REMOVED_SPTE 0x5a0ULL

/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
-static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));
+
+/*
+ * See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
+ * intermediate value set to the removed SPET. it sets the "suppress #VE" bit.
+ */
+#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | __REMOVED_SPTE)

static inline bool is_removed_spte(u64 spte)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7b9265d67131..2ca03ec3bf52 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -692,8 +692,16 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* overwrite the special removed SPTE value. No bookkeeping is needed
* here since the SPTE is going from non-present to non-present. Use
* the raw write helper to avoid an unnecessary check on volatile bits.
+ *
+ * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
+ * It is because when TDX is enabled, TDX module always
+ * enables "EPT-violation #VE", so KVM needs to set
+ * "suppress #VE" bit in EPT table entries, in order to get
+ * real EPT violation, rather than TDVMCALL. KVM sets
+ * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
+ * can be set when EPT table entries are zapped.
*/
- __kvm_tdp_mmu_write_spte(iter->sptep, 0);
+ __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);

return 0;
}
@@ -870,8 +878,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
continue;

if (!shared)
- tdp_mmu_set_spte(kvm, &iter, 0);
- else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
+ tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
+ else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
goto retry;
}
}
@@ -927,8 +935,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
return false;

- __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
- sp->gfn, sp->role.level + 1, true, true);
+ __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
+ SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
+ true, true);

return true;
}
@@ -965,7 +974,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;

- tdp_mmu_set_spte(kvm, &iter, 0);
+ tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
flush = true;
}

@@ -1330,7 +1339,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
* invariant that the PFN of a present * leaf SPTE can never change.
* See __handle_changed_spte().
*/
- tdp_mmu_set_spte(kvm, iter, 0);
+ tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);

if (!pte_write(range->pte)) {
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
--
2.25.1

2022-06-30 12:07:08

by Kai Huang

[permalink] [raw]

Subject: Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX introduced a new ETP, Secure-EPT, in addition to the existing EPT.
> Secure-EPT maps protected guest memory, which is called private. Since
> Secure-EPT page tables is also protected, those page tables is also called
> private. The existing EPT is often called shared EPT to distinguish from
> Secure-EPT. And also page tables for share EPT is also called shared.

Does this patch has anything to do with secure-EPT?

>
> Virtualization Exception, #VE, is a new processor exception in VMX non-root

#VE isn't new. It's already in pre-TDX public spec AFAICT.

> operation. In certain virtualizatoin-related conditions, #VE is injected
> into guest instead of exiting from guest to VMM so that guest is given a
> chance to inspect it. One important one is EPT violation. When
> "ETP-violation #VE" VM-execution is set, "#VE suppress bit" in EPT entry
> is cleared, #VE is injected instead of EPT violation.

We already know such fact based on pre-TDX public spec. Instead of repeating it
here, why not focusing on saying what's new in TDX, so your below paragraph of
setting a non-zero value for non-present SPTE can be justified?

>
> Because guest memory is protected with TDX, VMM can't parse instructions
> in the guest memory. Instead, MMIO hypercall is used for guest to pass
> necessary information to VMM.
>
> To make unmodified device driver work, guest TD expects #VE on accessing
> shared GPA. The #VE handler converts MMIO access into MMIO hypercall with
> the EPT entry of enabled "#VE" by clearing "suppress #VE" bit. Before VMM
> enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
> violation.
>

As I said above, before here, you need to explain in TDX VMCS is controlled by
the TDX module and it always sets the "EPT-violation #VE" in execution control
bit.

> So the execution flow looks like
>
> - Allocate unused shared EPT entry with suppress #VE bit set.
> - EPT violation on that GPA.
> - VMM figures out the faulted GPA is for MMIO.
> - VMM clears the suppress #VE bit.
> - Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
> - If the GPA maps guest memory, VMM resolves it with guest pages.
>
> For both cases, SPTE needs suppress #VE" bit set initially when it
> is allocated or zapped, therefore non-zero non-present value for SPTE
> needs to be allowed.
>
> This change requires to update FNAME(sync_page) for shadow EPT.
> "if(!sp->spte[i])" in FNAME(sync_page) means that the spte entry is the
> initial value. With the introduction of shadow_nonpresent_value which can
> be non-zero, it doesn't hold any more. Replace zero check with
> "!is_shadow_present_pte() && !is_mmio_spte()".

I don't think you need to mention above paragraph. It's absolutely unclear how
is_mmio_spte() will be impacted by this patch by reading above paragraphs.

From the "execution flow" you mentioned above, you will change MMIO fault from
EPT misconfiguration to EPT violation (in order to get #VE), so theoretically
you may effectively disable MMIO caching, in which case, if I understand
correctly, is_mmio_spte() always returns false.

I guess you can just change to check:

if (sp->spte[i] != shadow_nonpresent_value)

Anyway, IMO you can just comment in the code.

After all, what is shadow_nonpresent_value, given you haven't explained what it
is?

>
> When "if (!spt[i])" doesn't hold, but the entry value is
> shadow_nonpresent_value, the entry is wrongly synchronized from non-present
> to non-present with (wrongly) pfn changed and tries to remove rmap wrongly
> and BUG_ON() is hit.

Ditto.

>
> TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> intermediate value to indicate one thread is operating on it and the value
> should be semi-arbitrary value. For TDX (more correctly to use #VE), the
> value should include suppress #VE value which is SHADOW_NONPRESENT_VALUE.

What is SHADOW_NONPRESENT_VALUE?

> Rename REMOVED_SPTE to __REMOVED_SPTE and define REMOVED_SPTE as
> SHADOW_NONPRESENT_VALUE | REMOVED_SPTE to set suppress #VE bit.

Ditto. IMHO you don't even need to mention REMOVED_SPTE in changelog.

>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 55 ++++++++++++++++++++++++++++++----
> arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> arch/x86/kvm/mmu/spte.c | 5 +++-
> arch/x86/kvm/mmu/spte.h | 37 ++++++++++++++++++++---
> arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++-----
> 5 files changed, 105 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 51306b80f47c..f239b6cb5d53 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -668,6 +668,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +static inline void kvm_init_shadow_page(void *page)
> +{
> +#ifdef CONFIG_X86_64
> + int ign;
> +
> + WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
> + asm volatile (
> + "rep stosq\n\t"
> + : "=c"(ign), "=D"(page)
> + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> + : "memory"
> + );
> +#else
> + BUG();
> +#endif
> +}
> +
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> + int start, end, i, r;
> + bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> +
> + if (is_tdp_mmu && shadow_nonpresent_value)
> + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> +
> + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> +
> + if (is_tdp_mmu && shadow_nonpresent_value) {
> + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> + for (i = start; i < end; i++)
> + kvm_init_shadow_page(mc->objects[i]);
> + }

I think you can just extend this to legacy MMU too, but not only TDP MMU.

After all, before this patch, where have you declared that TDX only supports TDP
MMU? This is only enforced in:

[PATCH v7 043/102] KVM: x86/mmu: Focibly use TDP MMU for TDX

Which is 7 patches later.

Also, shadow_nonpresent_value is only used in couple of places, while
SHADOW_NONPRESENT_VALUE is used directly in more places. Does it make more
sense to always use shadow_nonpresent_value, instead of using
SHADOW_NONPRESENT_VALUE?

Similar to other shadow values, we can provide a function to let caller
(VMX/SVM) to decide whether it wants to use non-zero for non-present SPTE.

void kvm_mmu_set_non_present_value(u64 value)
{
shadow_nonpresent_value = value;
}

> + return 0;
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -677,8 +715,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_shadow_page_cache(vcpu);
> if (r)
> return r;
> if (maybe_indirect) {
> @@ -5521,9 +5558,16 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> * what is used by the kernel for any given HVA, i.e. the kernel's
> * capabilities are ultimately consulted by kvm_mmu_hugepage_adjust().
> */
> - if (tdp_enabled)
> + if (tdp_enabled) {
> + /*
> + * For TDP MMU, always set bit 63 for TDX support. See the
> + * comment on SHADOW_NONPRESENT_VALUE.
> + */
> +#ifdef CONFIG_X86_64
> + shadow_nonpresent_value = SHADOW_NONPRESENT_VALUE;
> +#endif

'tdp_enabled' doesn't mean TDP MMU, right?

> max_huge_page_level = tdp_huge_page_level;
> - else if (boot_cpu_has(X86_FEATURE_GBPAGES))
> + } else if (boot_cpu_has(X86_FEATURE_GBPAGES))
> max_huge_page_level = PG_LEVEL_1G;
> else
> max_huge_page_level = PG_LEVEL_2M;
> @@ -5654,7 +5698,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
> + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index fe35d8fd3276..ee2fb0c073f3 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1031,7 +1031,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> gpa_t pte_gpa;
> gfn_t gfn;
>
> - if (!sp->spt[i])
> + if (!is_shadow_present_pte(sp->spt[i]) &&
> + !is_mmio_spte(sp->spt[i]))
> continue;

As I said in the changelog, I don't think this is correct.

I guess you can just change to check:

if (sp->spte[i] != shadow_nonpresent_value)

>
> pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index cda1851ec155..bd441458153f 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -36,6 +36,9 @@ u64 __read_mostly shadow_present_mask;
> u64 __read_mostly shadow_me_value;
> u64 __read_mostly shadow_me_mask;
> u64 __read_mostly shadow_acc_track_mask;
> +#ifdef CONFIG_X86_64
> +u64 __read_mostly shadow_nonpresent_value;
> +#endif
>
> u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> @@ -360,7 +363,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> * not set any RWX bits.
> */
> if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
> - WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
> + WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
> mmio_value = 0;
>
> if (!mmio_value)
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 0127bb6e3c7d..1bfedbe0585f 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -140,6 +140,19 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
>
> #define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
>
> +/*
> + * non-present SPTE value for both VMX and SVM for TDP MMU.
> + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> + * For VMX EPT, bit 63 is ignored if #VE is disabled.
> + * bit 63 is #VE suppress if #VE is enabled.
> + */
> +#ifdef CONFIG_X86_64
> +#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
> +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> +#else
> +#define SHADOW_NONPRESENT_VALUE 0ULL
> +#endif
> +
> extern u64 __read_mostly shadow_host_writable_mask;
> extern u64 __read_mostly shadow_mmu_writable_mask;
> extern u64 __read_mostly shadow_nx_mask;
> @@ -154,6 +167,12 @@ extern u64 __read_mostly shadow_present_mask;
> extern u64 __read_mostly shadow_me_value;
> extern u64 __read_mostly shadow_me_mask;
>
> +#ifdef CONFIG_X86_64
> +extern u64 __read_mostly shadow_nonpresent_value;
> +#else
> +#define shadow_nonpresent_value 0ULL
> +#endif
> +
> /*
> * SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
> * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
> @@ -174,9 +193,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>
> /*
> * If a thread running without exclusive control of the MMU lock must perform a
> - * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
> + * multi-part operation on an SPTE, it can set the SPTE to __REMOVED_SPTE as a
> * non-present intermediate value. Other threads which encounter this value
> - * should not modify the SPTE.
> + * should not modify the SPTE. For the case that TDX is enabled,
> + * SHADOW_NONPRESENT_VALUE, which is "suppress #VE" bit set because TDX module
> + * always enables "EPT violation #VE". The bit is ignored by non-TDX case as
> + * present bit (bit 0) is cleared.
> *
> * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
> * bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> @@ -184,10 +206,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> *
> * Only used by the TDP MMU.
> */
> -#define REMOVED_SPTE 0x5a0ULL
> +#define __REMOVED_SPTE 0x5a0ULL
>
> /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> -static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> +static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> +static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));

I don't think you need this. My understanding of the reason REMOVED_SPTE is
checked against SPTE_MMU_PRESENT_MASK is because they both at low 12 bits.
SHADOW_NONPRESENT_VALUE is bit 63 so it's not possible to conflict with
REMOVED_SPTE, which by comment only uses low bits.

> +
> +/*
> + * See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
> + * intermediate value set to the removed SPET. it sets the "suppress #VE" bit.
> + */
> +#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | __REMOVED_SPTE)
>
> static inline bool is_removed_spte(u64 spte)
> {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7b9265d67131..2ca03ec3bf52 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -692,8 +692,16 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * overwrite the special removed SPTE value. No bookkeeping is needed
> * here since the SPTE is going from non-present to non-present. Use
> * the raw write helper to avoid an unnecessary check on volatile bits.
> + *
> + * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
> + * It is because when TDX is enabled, TDX module always
> + * enables "EPT-violation #VE", so KVM needs to set
> + * "suppress #VE" bit in EPT table entries, in order to get
> + * real EPT violation, rather than TDVMCALL. KVM sets
> + * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
> + * can be set when EPT table entries are zapped.
> */
> - __kvm_tdp_mmu_write_spte(iter->sptep, 0);
> + __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
>
> return 0;
> }
> @@ -870,8 +878,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> continue;
>
> if (!shared)
> - tdp_mmu_set_spte(kvm, &iter, 0);
> - else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
> + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> + else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
> goto retry;
> }
> }
> @@ -927,8 +935,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
> return false;
>
> - __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> - sp->gfn, sp->role.level + 1, true, true);
> + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> + SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
> + true, true);
>
> return true;
> }
> @@ -965,7 +974,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> !is_last_spte(iter.old_spte, iter.level))
> continue;
>
> - tdp_mmu_set_spte(kvm, &iter, 0);
> + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> flush = true;
> }
>
> @@ -1330,7 +1339,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> * invariant that the PFN of a present * leaf SPTE can never change.
> * See __handle_changed_spte().
> */
> - tdp_mmu_set_spte(kvm, iter, 0);
> + tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
>
> if (!pte_write(range->pte)) {
> new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,

2022-07-08 05:34:57

by Yuan Yao

[permalink] [raw]

Subject: Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Mon, Jun 27, 2022 at 02:53:28PM -0700, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX introduced a new ETP, Secure-EPT, in addition to the existing EPT.
> Secure-EPT maps protected guest memory, which is called private. Since
> Secure-EPT page tables is also protected, those page tables is also called
> private. The existing EPT is often called shared EPT to distinguish from
> Secure-EPT. And also page tables for share EPT is also called shared.
>
> Virtualization Exception, #VE, is a new processor exception in VMX non-root
> operation. In certain virtualizatoin-related conditions, #VE is injected
> into guest instead of exiting from guest to VMM so that guest is given a
> chance to inspect it. One important one is EPT violation. When
> "ETP-violation #VE" VM-execution is set, "#VE suppress bit" in EPT entry
> is cleared, #VE is injected instead of EPT violation.
>
> Because guest memory is protected with TDX, VMM can't parse instructions
> in the guest memory. Instead, MMIO hypercall is used for guest to pass
> necessary information to VMM.
>
> To make unmodified device driver work, guest TD expects #VE on accessing
> shared GPA. The #VE handler converts MMIO access into MMIO hypercall with
> the EPT entry of enabled "#VE" by clearing "suppress #VE" bit. Before VMM
> enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
> violation. So the execution flow looks like
>
> - Allocate unused shared EPT entry with suppress #VE bit set.
> - EPT violation on that GPA.
> - VMM figures out the faulted GPA is for MMIO.
> - VMM clears the suppress #VE bit.
> - Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
> - If the GPA maps guest memory, VMM resolves it with guest pages.
>
> For both cases, SPTE needs suppress #VE" bit set initially when it
> is allocated or zapped, therefore non-zero non-present value for SPTE
> needs to be allowed.
>
> This change requires to update FNAME(sync_page) for shadow EPT.
> "if(!sp->spte[i])" in FNAME(sync_page) means that the spte entry is the
> initial value. With the introduction of shadow_nonpresent_value which can
> be non-zero, it doesn't hold any more. Replace zero check with
> "!is_shadow_present_pte() && !is_mmio_spte()".
>
> When "if (!spt[i])" doesn't hold, but the entry value is
> shadow_nonpresent_value, the entry is wrongly synchronized from non-present
> to non-present with (wrongly) pfn changed and tries to remove rmap wrongly
> and BUG_ON() is hit.
>
> TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> intermediate value to indicate one thread is operating on it and the value
> should be semi-arbitrary value. For TDX (more correctly to use #VE), the
> value should include suppress #VE value which is SHADOW_NONPRESENT_VALUE.
> Rename REMOVED_SPTE to __REMOVED_SPTE and define REMOVED_SPTE as
> SHADOW_NONPRESENT_VALUE | REMOVED_SPTE to set suppress #VE bit.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 55 ++++++++++++++++++++++++++++++----
> arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> arch/x86/kvm/mmu/spte.c | 5 +++-
> arch/x86/kvm/mmu/spte.h | 37 ++++++++++++++++++++---
> arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++-----
> 5 files changed, 105 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 51306b80f47c..f239b6cb5d53 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -668,6 +668,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +static inline void kvm_init_shadow_page(void *page)
> +{
> +#ifdef CONFIG_X86_64
> + int ign;
> +
> + WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
> + asm volatile (
> + "rep stosq\n\t"
> + : "=c"(ign), "=D"(page)
> + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> + : "memory"
> + );
> +#else
> + BUG();
> +#endif
> +}
> +
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> + int start, end, i, r;
> + bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> +
> + if (is_tdp_mmu && shadow_nonpresent_value)
> + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> +
> + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> +
> + if (is_tdp_mmu && shadow_nonpresent_value) {
> + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> + for (i = start; i < end; i++)
> + kvm_init_shadow_page(mc->objects[i]);
> + }
> + return 0;
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -677,8 +715,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_shadow_page_cache(vcpu);
> if (r)
> return r;
> if (maybe_indirect) {
> @@ -5521,9 +5558,16 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> * what is used by the kernel for any given HVA, i.e. the kernel's
> * capabilities are ultimately consulted by kvm_mmu_hugepage_adjust().
> */
> - if (tdp_enabled)
> + if (tdp_enabled) {
> + /*
> + * For TDP MMU, always set bit 63 for TDX support. See the
> + * comment on SHADOW_NONPRESENT_VALUE.
> + */
> +#ifdef CONFIG_X86_64
> + shadow_nonpresent_value = SHADOW_NONPRESENT_VALUE;
> +#endif
> max_huge_page_level = tdp_huge_page_level;
> - else if (boot_cpu_has(X86_FEATURE_GBPAGES))
> + } else if (boot_cpu_has(X86_FEATURE_GBPAGES))
> max_huge_page_level = PG_LEVEL_1G;
> else
> max_huge_page_level = PG_LEVEL_2M;
> @@ -5654,7 +5698,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
> + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

I'm not sure why skip this for TDX, arch.mmu_shadow_page_cache is
still used for allocating sp->spt which used to track the S-EPT in kvm
for tdx guest. Anything I missed for this ?

>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index fe35d8fd3276..ee2fb0c073f3 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1031,7 +1031,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> gpa_t pte_gpa;
> gfn_t gfn;
>
> - if (!sp->spt[i])
> + if (!is_shadow_present_pte(sp->spt[i]) &&
> + !is_mmio_spte(sp->spt[i]))
> continue;
>
> pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index cda1851ec155..bd441458153f 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -36,6 +36,9 @@ u64 __read_mostly shadow_present_mask;
> u64 __read_mostly shadow_me_value;
> u64 __read_mostly shadow_me_mask;
> u64 __read_mostly shadow_acc_track_mask;
> +#ifdef CONFIG_X86_64
> +u64 __read_mostly shadow_nonpresent_value;
> +#endif
>
> u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> @@ -360,7 +363,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> * not set any RWX bits.
> */
> if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
> - WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
> + WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
> mmio_value = 0;
>
> if (!mmio_value)
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 0127bb6e3c7d..1bfedbe0585f 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -140,6 +140,19 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
>
> #define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
>
> +/*
> + * non-present SPTE value for both VMX and SVM for TDP MMU.
> + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> + * For VMX EPT, bit 63 is ignored if #VE is disabled.
> + * bit 63 is #VE suppress if #VE is enabled.
> + */
> +#ifdef CONFIG_X86_64
> +#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
> +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> +#else
> +#define SHADOW_NONPRESENT_VALUE 0ULL
> +#endif
> +
> extern u64 __read_mostly shadow_host_writable_mask;
> extern u64 __read_mostly shadow_mmu_writable_mask;
> extern u64 __read_mostly shadow_nx_mask;
> @@ -154,6 +167,12 @@ extern u64 __read_mostly shadow_present_mask;
> extern u64 __read_mostly shadow_me_value;
> extern u64 __read_mostly shadow_me_mask;
>
> +#ifdef CONFIG_X86_64
> +extern u64 __read_mostly shadow_nonpresent_value;
> +#else
> +#define shadow_nonpresent_value 0ULL
> +#endif
> +
> /*
> * SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
> * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
> @@ -174,9 +193,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>
> /*
> * If a thread running without exclusive control of the MMU lock must perform a
> - * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
> + * multi-part operation on an SPTE, it can set the SPTE to __REMOVED_SPTE as a
> * non-present intermediate value. Other threads which encounter this value
> - * should not modify the SPTE.
> + * should not modify the SPTE. For the case that TDX is enabled,
> + * SHADOW_NONPRESENT_VALUE, which is "suppress #VE" bit set because TDX module
> + * always enables "EPT violation #VE". The bit is ignored by non-TDX case as
> + * present bit (bit 0) is cleared.
> *
> * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
> * bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> @@ -184,10 +206,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> *
> * Only used by the TDP MMU.
> */
> -#define REMOVED_SPTE 0x5a0ULL
> +#define __REMOVED_SPTE 0x5a0ULL
>
> /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> -static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> +static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> +static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));
> +
> +/*
> + * See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
> + * intermediate value set to the removed SPET. it sets the "suppress #VE" bit.
> + */
> +#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | __REMOVED_SPTE)
>
> static inline bool is_removed_spte(u64 spte)
> {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7b9265d67131..2ca03ec3bf52 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -692,8 +692,16 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * overwrite the special removed SPTE value. No bookkeeping is needed
> * here since the SPTE is going from non-present to non-present. Use
> * the raw write helper to avoid an unnecessary check on volatile bits.
> + *
> + * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
> + * It is because when TDX is enabled, TDX module always
> + * enables "EPT-violation #VE", so KVM needs to set
> + * "suppress #VE" bit in EPT table entries, in order to get
> + * real EPT violation, rather than TDVMCALL. KVM sets
> + * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
> + * can be set when EPT table entries are zapped.
> */
> - __kvm_tdp_mmu_write_spte(iter->sptep, 0);
> + __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
>
> return 0;
> }
> @@ -870,8 +878,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> continue;
>
> if (!shared)
> - tdp_mmu_set_spte(kvm, &iter, 0);
> - else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
> + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> + else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
> goto retry;
> }
> }
> @@ -927,8 +935,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
> return false;
>
> - __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> - sp->gfn, sp->role.level + 1, true, true);
> + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> + SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
> + true, true);
>
> return true;
> }
> @@ -965,7 +974,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> !is_last_spte(iter.old_spte, iter.level))
> continue;
>
> - tdp_mmu_set_spte(kvm, &iter, 0);
> + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> flush = true;
> }
>
> @@ -1330,7 +1339,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> * invariant that the PFN of a present * leaf SPTE can never change.
> * See __handle_changed_spte().
> */
> - tdp_mmu_set_spte(kvm, iter, 0);
> + tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
>
> if (!pte_write(range->pte)) {
> new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
> --
> 2.25.1
>

2022-07-08 16:05:03

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

Please trim replies.

On Fri, Jul 08, 2022, Yuan Yao wrote:
> On Mon, Jun 27, 2022 at 02:53:28PM -0700, [email protected] wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 51306b80f47c..f239b6cb5d53 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -668,6 +668,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > }
> > }
> >
> > +static inline void kvm_init_shadow_page(void *page)
> > +{
> > +#ifdef CONFIG_X86_64
> > + int ign;
> > +
> > + WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
> > + asm volatile (
> > + "rep stosq\n\t"

I have a slight preference for:

asm volatile ("rep stosq\n\t"
<align here>
);

so that searching for "asm" or "asm volatile" shows the "rep stosq" in the
result without needed to capture the next line.

> > + : "=c"(ign), "=D"(page)
> > + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> > + : "memory"
> > + );
> > +#else
> > + BUG();
> > +#endif

Rather than put the #ifdef here, split mmu_topup_shadow_page_cache() on 64-bit
versus 32-bit. Then this BUG() goes away and we don't get slapped on the wrist
by Linus :-)

> > +}
> > +
> > +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> > + int start, end, i, r;
> > + bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> > +
> > + if (is_tdp_mmu && shadow_nonpresent_value)
> > + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> > +
> > + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> > + if (r)
> > + return r;

Bailing immediately is wrong. If kvm_mmu_topup_memory_cache() fails after allocating
at least one page, then KVM needs to initialize those pages, otherwise it will leave
uninitialized pages in the cache. If userspace frees up memory in response to the
-ENOMEM and resumes the vCPU, KVM will consume uninitialized data.

> > +
> > + if (is_tdp_mmu && shadow_nonpresent_value) {

So I'm pretty sure I effectively suggested keeping shadow_nonpresent_value, but
seeing it in code, I really don't like it. It's an unnecessary check on every
SPT allocation, and it's misleading because it suggests shadow_nonpresent_value
might be zero when the TDP MMU is enabled.

My vote is to drop shadow_nonpresent_value and then rename kvm_init_shadow_page()
to make it clear that it's specific to the TDP MMU.

So this? Completely untested.

#ifdef CONFIG_X86_64
static void kvm_init_tdp_mmu_shadow_page(void *page)
{
int ign;

asm volatile ("rep stosq\n\t"
: "=c"(ign), "=D"(page)
: "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
: "memory"
);
}

static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
{
struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
int start, end, i, r;

if (is_tdp_mmu)
start = kvm_mmu_memory_cache_nr_free_objects(mc);

r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);

/*
* Note, topup may have allocated objects even if it failed to allocate
* the minimum number of objects required to make forward progress _at
* this time_. Initialize newly allocated objects even on failure, as
* userspace can free memory and rerun the vCPU in response to -ENOMEM.
*/
if (is_tdp_mmu) {
end = kvm_mmu_memory_cache_nr_free_objects(mc);
for (i = start; i < end; i++)
kvm_init_tdp_mmu_shadow_page(mc->objects[i]);
}
return r;
}
#else
static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
{
return kvm_mmu_topup_memory_cache(vcpu->arch.mmu_shadow_page_cache,
PT64_ROOT_MAX_LEVEL);
}
#endif /* CONFIG_X86_64 */

> > + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> > + for (i = start; i < end; i++)
> > + kvm_init_shadow_page(mc->objects[i]);
> > + }
> > + return 0;
> > +}
> > +

...

> > @@ -5654,7 +5698,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > + if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
> > + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> I'm not sure why skip this for TDX, arch.mmu_shadow_page_cache is
> still used for allocating sp->spt which used to track the S-EPT in kvm
> for tdx guest. Anything I missed for this ?

Shared EPTEs need to be initialized with SUPPRESS_VE=1, otherwise not-present
EPT violations would be reflected into the guest by hardware as #VE exceptions.
This is handled by initializing page allocations via kvm_init_shadow_page() during
cache topup if shadow_nonpresent_value is non-zero. In that case, telling the
page allocation to zero-initialize the page would be wasted effort.

The initialization is harmless for S-EPT entries because KVM's copy of the S-EPT
isn't consumed by hardware, and because under the hood S-EPT entries should never
#VE (I forget if this is enforced by hardware or if the TDX module sets SUPPRESS_VE).

2022-07-11 07:55:52

by Yuan Yao

[permalink] [raw]

Subject: Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Fri, Jul 08, 2022 at 03:30:23PM +0000, Sean Christopherson wrote:
> Please trim replies.
>
> On Fri, Jul 08, 2022, Yuan Yao wrote:
> > On Mon, Jun 27, 2022 at 02:53:28PM -0700, [email protected] wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 51306b80f47c..f239b6cb5d53 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -668,6 +668,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > > }
> > > }
> > >
> > > +static inline void kvm_init_shadow_page(void *page)
> > > +{
> > > +#ifdef CONFIG_X86_64
> > > + int ign;
> > > +
> > > + WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
> > > + asm volatile (
> > > + "rep stosq\n\t"
>
> I have a slight preference for:
>
> asm volatile ("rep stosq\n\t"
> <align here>
> );
>
> so that searching for "asm" or "asm volatile" shows the "rep stosq" in the
> result without needed to capture the next line.
>
> > > + : "=c"(ign), "=D"(page)
> > > + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> > > + : "memory"
> > > + );
> > > +#else
> > > + BUG();
> > > +#endif
>
> Rather than put the #ifdef here, split mmu_topup_shadow_page_cache() on 64-bit
> versus 32-bit. Then this BUG() goes away and we don't get slapped on the wrist
> by Linus :-)
>
> > > +}
> > > +
> > > +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> > > +{
> > > + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> > > + int start, end, i, r;
> > > + bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> > > +
> > > + if (is_tdp_mmu && shadow_nonpresent_value)
> > > + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> > > +
> > > + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> > > + if (r)
> > > + return r;
>
> Bailing immediately is wrong. If kvm_mmu_topup_memory_cache() fails after allocating
> at least one page, then KVM needs to initialize those pages, otherwise it will leave
> uninitialized pages in the cache. If userspace frees up memory in response to the
> -ENOMEM and resumes the vCPU, KVM will consume uninitialized data.
>
> > > +
> > > + if (is_tdp_mmu && shadow_nonpresent_value) {
>
> So I'm pretty sure I effectively suggested keeping shadow_nonpresent_value, but
> seeing it in code, I really don't like it. It's an unnecessary check on every
> SPT allocation, and it's misleading because it suggests shadow_nonpresent_value
> might be zero when the TDP MMU is enabled.
>
> My vote is to drop shadow_nonpresent_value and then rename kvm_init_shadow_page()
> to make it clear that it's specific to the TDP MMU.
>
> So this? Completely untested.
>
> #ifdef CONFIG_X86_64
> static void kvm_init_tdp_mmu_shadow_page(void *page)
> {
> int ign;
>
> asm volatile ("rep stosq\n\t"
> : "=c"(ign), "=D"(page)
> : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> : "memory"
> );
> }
>
> static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> {
> struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> int start, end, i, r;
>
> if (is_tdp_mmu)
> start = kvm_mmu_memory_cache_nr_free_objects(mc);
>
> r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
>
> /*
> * Note, topup may have allocated objects even if it failed to allocate
> * the minimum number of objects required to make forward progress _at
> * this time_. Initialize newly allocated objects even on failure, as
> * userspace can free memory and rerun the vCPU in response to -ENOMEM.
> */
> if (is_tdp_mmu) {
> end = kvm_mmu_memory_cache_nr_free_objects(mc);
> for (i = start; i < end; i++)
> kvm_init_tdp_mmu_shadow_page(mc->objects[i]);
> }
> return r;
> }
> #else
> static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> {
> return kvm_mmu_topup_memory_cache(vcpu->arch.mmu_shadow_page_cache,
> PT64_ROOT_MAX_LEVEL);
> }
> #endif /* CONFIG_X86_64 */
>
> > > + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> > > + for (i = start; i < end; i++)
> > > + kvm_init_shadow_page(mc->objects[i]);
> > > + }
> > > + return 0;
> > > +}
> > > +
>
> ...
>
> > > @@ -5654,7 +5698,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > > vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > >
> > > - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > > + if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
> > > + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> >
> > I'm not sure why skip this for TDX, arch.mmu_shadow_page_cache is
> > still used for allocating sp->spt which used to track the S-EPT in kvm
> > for tdx guest. Anything I missed for this ?
>
> Shared EPTEs need to be initialized with SUPPRESS_VE=1, otherwise not-present
> EPT violations would be reflected into the guest by hardware as #VE exceptions.
> This is handled by initializing page allocations via kvm_init_shadow_page() during
> cache topup if shadow_nonpresent_value is non-zero. In that case, telling the
> page allocation to zero-initialize the page would be wasted effort.
>
> The initialization is harmless for S-EPT entries because KVM's copy of the S-EPT
> isn't consumed by hardware, and because under the hood S-EPT entries should never
> #VE (I forget if this is enforced by hardware or if the TDX module sets SUPPRESS_VE).

Ah I see, you're right, thanks for the explanation! I think with
changes you suggested above the __GFP_ZERO can be removed from
mmu_shadow_page_cache for VMs which is_tdp_mmu_enabled() is true:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8de26cbde295..0b412f3eb0c5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6483,8 +6483,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

- if (!(tdp_enabled && shadow_nonpresent_value))
- vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ if (!(is_tdp_mmu_enabled(vcpu->kvm))
+ vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;

2022-07-11 15:29:55

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Mon, Jul 11, 2022, Yuan Yao wrote:
> On Fri, Jul 08, 2022 at 03:30:23PM +0000, Sean Christopherson wrote:
> > Please trim replies.
> > > I'm not sure why skip this for TDX, arch.mmu_shadow_page_cache is
> > > still used for allocating sp->spt which used to track the S-EPT in kvm
> > > for tdx guest. Anything I missed for this ?
> >
> > Shared EPTEs need to be initialized with SUPPRESS_VE=1, otherwise not-present
> > EPT violations would be reflected into the guest by hardware as #VE exceptions.
> > This is handled by initializing page allocations via kvm_init_shadow_page() during
> > cache topup if shadow_nonpresent_value is non-zero. In that case, telling the
> > page allocation to zero-initialize the page would be wasted effort.
> >
> > The initialization is harmless for S-EPT entries because KVM's copy of the S-EPT
> > isn't consumed by hardware, and because under the hood S-EPT entries should never
> > #VE (I forget if this is enforced by hardware or if the TDX module sets SUPPRESS_VE).
>
> Ah I see, you're right, thanks for the explanation! I think with
> changes you suggested above the __GFP_ZERO can be removed from
> mmu_shadow_page_cache for VMs which is_tdp_mmu_enabled() is true:

Yep.

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8de26cbde295..0b412f3eb0c5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6483,8 +6483,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> - if (!(tdp_enabled && shadow_nonpresent_value))
> - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + if (!(is_tdp_mmu_enabled(vcpu->kvm))
> + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;

2022-07-14 18:50:23

by Isaku Yamahata

[permalink] [raw]

Subject: Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Thu, Jun 30, 2022 at 11:03:56PM +1200,
Kai Huang <[email protected]> wrote:

> On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > TDX introduced a new ETP, Secure-EPT, in addition to the existing EPT.
> > Secure-EPT maps protected guest memory, which is called private. Since
> > Secure-EPT page tables is also protected, those page tables is also called
> > private. The existing EPT is often called shared EPT to distinguish from
> > Secure-EPT. And also page tables for share EPT is also called shared.
>
> Does this patch has anything to do with secure-EPT?
>
> >
> > Virtualization Exception, #VE, is a new processor exception in VMX non-root
>
> #VE isn't new. It's already in pre-TDX public spec AFAICT.
>
> > operation. In certain virtualizatoin-related conditions, #VE is injected
> > into guest instead of exiting from guest to VMM so that guest is given a
> > chance to inspect it. One important one is EPT violation. When
> > "ETP-violation #VE" VM-execution is set, "#VE suppress bit" in EPT entry
> > is cleared, #VE is injected instead of EPT violation.
>
> We already know such fact based on pre-TDX public spec. Instead of repeating it
> here, why not focusing on saying what's new in TDX, so your below paragraph of
> setting a non-zero value for non-present SPTE can be justified?

Ok, will drop those two paragraph above.

> > Because guest memory is protected with TDX, VMM can't parse instructions
> > in the guest memory. Instead, MMIO hypercall is used for guest to pass
> > necessary information to VMM.
> >
> > To make unmodified device driver work, guest TD expects #VE on accessing
> > shared GPA. The #VE handler converts MMIO access into MMIO hypercall with
> > the EPT entry of enabled "#VE" by clearing "suppress #VE" bit. Before VMM
> > enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
> > violation.
> >
>
> As I said above, before here, you need to explain in TDX VMCS is controlled by
> the TDX module and it always sets the "EPT-violation #VE" in execution control
> bit.
>
> > So the execution flow looks like
> >
> > - Allocate unused shared EPT entry with suppress #VE bit set.
> > - EPT violation on that GPA.
> > - VMM figures out the faulted GPA is for MMIO.
> > - VMM clears the suppress #VE bit.
> > - Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
> > - If the GPA maps guest memory, VMM resolves it with guest pages.
> >
> > For both cases, SPTE needs suppress #VE" bit set initially when it
> > is allocated or zapped, therefore non-zero non-present value for SPTE
> > needs to be allowed.
> >
> > This change requires to update FNAME(sync_page) for shadow EPT.
> > "if(!sp->spte[i])" in FNAME(sync_page) means that the spte entry is the
> > initial value. With the introduction of shadow_nonpresent_value which can
> > be non-zero, it doesn't hold any more. Replace zero check with
> > "!is_shadow_present_pte() && !is_mmio_spte()".
>
> I don't think you need to mention above paragraph. It's absolutely unclear how
> is_mmio_spte() will be impacted by this patch by reading above paragraphs.
>
> From the "execution flow" you mentioned above, you will change MMIO fault from
> EPT misconfiguration to EPT violation (in order to get #VE), so theoretically
> you may effectively disable MMIO caching, in which case, if I understand
> correctly, is_mmio_spte() always returns false.
>
> I guess you can just change to check:
>
> if (sp->spte[i] != shadow_nonpresent_value)
>
> Anyway, IMO you can just comment in the code.
>
> After all, what is shadow_nonpresent_value, given you haven't explained what it
> is?

I'll drop the paragraph and add a comment on the code.

> > TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> > intermediate value to indicate one thread is operating on it and the value
> > should be semi-arbitrary value. For TDX (more correctly to use #VE), the
> > value should include suppress #VE value which is SHADOW_NONPRESENT_VALUE.
>
> What is SHADOW_NONPRESENT_VALUE?
>
> > Rename REMOVED_SPTE to __REMOVED_SPTE and define REMOVED_SPTE as
> > SHADOW_NONPRESENT_VALUE | REMOVED_SPTE to set suppress #VE bit.
>
> Ditto. IMHO you don't even need to mention REMOVED_SPTE in changelog.

> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 55 ++++++++++++++++++++++++++++++----
> > arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> > arch/x86/kvm/mmu/spte.c | 5 +++-
> > arch/x86/kvm/mmu/spte.h | 37 ++++++++++++++++++++---
> > arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++-----
> > 5 files changed, 105 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 51306b80f47c..f239b6cb5d53 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -668,6 +668,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > }
> > }
> >
> > +static inline void kvm_init_shadow_page(void *page)
> > +{
> > +#ifdef CONFIG_X86_64
> > + int ign;
> > +
> > + WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
> > + asm volatile (
> > + "rep stosq\n\t"
> > + : "=c"(ign), "=D"(page)
> > + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> > + : "memory"
> > + );
> > +#else
> > + BUG();
> > +#endif
> > +}
> > +
> > +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> > + int start, end, i, r;
> > + bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> > +
> > + if (is_tdp_mmu && shadow_nonpresent_value)
> > + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> > +
> > + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> > + if (r)
> > + return r;
> > +
> > + if (is_tdp_mmu && shadow_nonpresent_value) {
> > + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> > + for (i = start; i < end; i++)
> > + kvm_init_shadow_page(mc->objects[i]);
> > + }
>
> I think you can just extend this to legacy MMU too, but not only TDP MMU.
>
> After all, before this patch, where have you declared that TDX only supports TDP
> MMU? This is only enforced in:
>
> [PATCH v7 043/102] KVM: x86/mmu: Focibly use TDP MMU for TDX
>
> Which is 7 patches later.
>
> Also, shadow_nonpresent_value is only used in couple of places, while
> SHADOW_NONPRESENT_VALUE is used directly in more places. Does it make more
> sense to always use shadow_nonpresent_value, instead of using
> SHADOW_NONPRESENT_VALUE?
>
> Similar to other shadow values, we can provide a function to let caller
> (VMX/SVM) to decide whether it wants to use non-zero for non-present SPTE.
>
> void kvm_mmu_set_non_present_value(u64 value)
> {
> shadow_nonpresent_value = value;
> }

As you pointed out, those logic is independent of TDP MMU or legacy MMU.
So I'll remove is_tdp_mmu.I'll drop shadwo_nonpresent_value and use
SHADWO_NONPRESENT_VALUE.

--
Isaku Yamahata <[email protected]>

2022-07-14 18:51:27

by Isaku Yamahata

[permalink] [raw]

Subject: Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

Thanks for review. Now here is the updated version.

From f1ee540d62ba13511b2c7d3db7662e32bd263e48 Mon Sep 17 00:00:00 2001
Message-Id: <f1ee540d62ba13511b2c7d3db7662e32bd263e48.1657823906.git.isaku.yamahata@intel.com>
In-Reply-To: <3941849bf08a55cfbbe69b222f0fd0dac7c5ee53.1657823906.git.isaku.yamahata@intel.com>
References: <3941849bf08a55cfbbe69b222f0fd0dac7c5ee53.1657823906.git.isaku.yamahata@intel.com>
From: Sean Christopherson <[email protected]>
Date: Mon, 29 Jul 2019 19:23:46 -0700
Subject: [PATCH 036/304] KVM: x86/mmu: Allow non-zero value for non-present
SPTE

TDX introduced a new ETP, Secure-EPT, in addition to the existing EPT.
Secure-EPT maps protected guest memory, which is called private. Since
Secure-EPT page tables is also protected, those page tables is also called
private. The existing EPT is often called shared EPT to distinguish from
Secure-EPT. And also page tables for shared EPT is also called shared.

TDX module enables #VE injection by setting "EPT-violation #VE" in
secondary processor-based VM-execution controls of TD VMCS. It also sets
"suppress #VE" bit in Secure-EPT so that EPT violation on Secure-EPT causes
exit to VMM.

Because guest memory is protected with TDX, VMM can't parse instructions in
the guest memory. Instead, MMIO hypercall is used for guest TD to pass
necessary information to VMM. To make unmodified device driver work, guest
TD expects #VE on accessing shared GPA for MMIO. The #VE handler of guest
TD converts MMIO access into MMIO hypercall. To trigger #VE in guest TD,
VMM needs to clear "suppress #VE" bit in shared EPT entry that corresponds
to MMIO address.

So the execution flow related for MMIO is as follows

- TDX module sets "EPT-violation #VE" in secondary processor-based
VM-execution controls of TD VMCS.
- Allocate page for shared EPT PML4E page. Shared EPT entries are
initialized with suppress #VE bit set. Update the EPTP pointer.
- TD accesses a GPA for MMIO to trigger EPT violation. It exits to VMM with
EPT violation due to suppress #VE bit of EPT entries of PML4E page.
- VMM figures out the faulted GPA is for MMIO
- start shared EPT page table walk.
- Allocate non-leaf EPT pages for the shared EPT.
- Allocate leaf EPT page for the shared EPT and initialize EPT entries with
suppress #VE bit set.
- VMM clears the suppress #VE bit for faulted GPA for MMIO.
Please notice the leaf EPT page has 512 SPTE and other 511 SPTE entries
need to keep "suppress #VE" bit set because GPAs for those SPTEs are not
known to be MMIO. (It requires further lookups.)
If GPA is a guest page, link the guest page from the leaf SPTE entry.
- resume TD vcpu.
- Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
- If the GPA maps guest memory, VMM resolves it with guest pages.

SPTEs for shared EPT need suppress #VE" bit set initially when it
is allocated or zapped, therefore non-zero non-present value for SPTE
needs to be allowed.

TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
intermediate value to indicate one thread is operating on it and the value
should be semi-arbitrary value. For TDX (more exactly to use #VE), the
value should include suppress #VE bit. Rename REMOVED_SPTE to
__REMOVED_SPTE and define REMOVED_SPTE as (REMOVED_SPTE | "suppress #VE")
bit.

For simplicity, "suppress #VE" bit is set unconditionally for X86_64 for
non-present SPTE. Because "suppress #VE" bit (bit position of 63) for
non-present SPTE is ignored for non-TD case (AMD CPUs or Intel VMX case
with "EPT-violation #VE" cleared), the functionality shouldn't change.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 71 ++++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
arch/x86/kvm/mmu/spte.c | 5 ++-
arch/x86/kvm/mmu/spte.h | 28 +++++++++++++-
arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++----
5 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 51306b80f47c..992f31458f94 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -668,6 +668,55 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
}
}

+#ifdef CONFIG_X86_64
+static inline void kvm_init_shadow_page(void *page)
+{
+ int ign;
+
+ /*
+ * AMD: "suppress #VE" bit is ignored
+ * Intel non-TD(VMX): "suppress #VE" bit is ignored because
+ * EPT_VIOLATION_VE isn't set.
+ * guest TD: TDX module sets EPT_VIOLATION_VE
+ * conventional SEPT: "suppress #VE" bit must be set to get EPT violation
+ * private SEPT: "suppress #VE" bit is ignored. CPU doesn't walk it
+ *
+ * For simplicity, unconditionally initialize SPET to set "suppress #VE".
+ */
+ asm volatile ("rep stosq\n\t"
+ : "=c"(ign), "=D"(page)
+ : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
+ : "memory"
+ );
+}
+
+static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
+ int start, end, i, r;
+
+ start = kvm_mmu_memory_cache_nr_free_objects(mc);
+ r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
+
+ /*
+ * Note, topup may have allocated objects even if it failed to allocate
+ * the minimum number of objects required to make forward progress _at
+ * this time_. Initialize newly allocated objects even on failure, as
+ * userspace can free memory and rerun the vCPU in response to -ENOMEM.
+ */
+ end = kvm_mmu_memory_cache_nr_free_objects(mc);
+ for (i = start; i < end; i++)
+ kvm_init_shadow_page(mc->objects[i]);
+ return r;
+}
+#else
+static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
+{
+ return kvm_mmu_topup_memory_cache(vcpu->arch.mmu_shadow_page_cache,
+ PT64_ROOT_MAX_LEVEL);
+}
+#endif /* CONFIG_X86_64 */
+
static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
int r;
@@ -677,8 +726,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
- r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
- PT64_ROOT_MAX_LEVEL);
+ r = mmu_topup_shadow_page_cache(vcpu);
if (r)
return r;
if (maybe_indirect) {
@@ -5654,7 +5702,24 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

- vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ /*
+ * When X86_64, initial SEPT entries are initialized with
+ * SHADOW_NONPRESENT_VALUE. Otherwise zeroed. See
+ * mmu_topup_shadow_page_cache().
+ *
+ * Shared EPTEs need to be initialized with SUPPRESS_VE=1, otherwise
+ * not-present EPT violations would be reflected into the guest by
+ * hardware as #VE exceptions. This is handled by initializing page
+ * allocations via kvm_init_shadow_page() during cache topup.
+ * In that case, telling the page allocation to zero-initialize the page
+ * would be wasted effort.
+ *
+ * The initialization is harmless for S-EPT entries because KVM's copy
+ * of the S-EPT isn't consumed by hardware, and because under the hood
+ * S-EPT entries should never #VE.
+ */
+ if (!IS_ENABLED(X86_64))
+ vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index fe35d8fd3276..964ec76579f0 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1031,7 +1031,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
gpa_t pte_gpa;
gfn_t gfn;

- if (!sp->spt[i])
+ /* spt[i] has initial value of shadow page table allocation */
+ if (sp->spt[i] != SHADOW_NONPRESENT_VALUE)
continue;

pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cda1851ec155..bd441458153f 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -36,6 +36,9 @@ u64 __read_mostly shadow_present_mask;
u64 __read_mostly shadow_me_value;
u64 __read_mostly shadow_me_mask;
u64 __read_mostly shadow_acc_track_mask;
+#ifdef CONFIG_X86_64
+u64 __read_mostly shadow_nonpresent_value;
+#endif

u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
@@ -360,7 +363,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
* not set any RWX bits.
*/
if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
- WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
+ WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
mmio_value = 0;

if (!mmio_value)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 0127bb6e3c7d..f5fd22f6bf5f 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -140,6 +140,19 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)

+/*
+ * non-present SPTE value for both VMX and SVM for TDP MMU.
+ * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
+ * For VMX EPT, bit 63 is ignored if #VE is disabled.
+ * bit 63 is #VE suppress if #VE is enabled.
+ */
+#ifdef CONFIG_X86_64
+#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
+static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
+#else
+#define SHADOW_NONPRESENT_VALUE 0ULL
+#endif
+
extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
@@ -178,16 +191,27 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
* non-present intermediate value. Other threads which encounter this value
* should not modify the SPTE.
*
+ * For X86_64 case, SHADOW_NONPRESENT_VALUE, "suppress #VE" bit, is set because
+ * "EPT violation #VE" in the secondary VM execution control may be enabled.
+ * Because TDX module sets "EPT violation #VE" for TD, "suppress #VE" bit for
+ * the conventional EPT needs to be set.
+ *
* Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
* bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
* vulnerability. Use only low bits to avoid 64-bit immediates.
*
* Only used by the TDP MMU.
*/
-#define REMOVED_SPTE 0x5a0ULL
+#define __REMOVED_SPTE 0x5a0ULL

/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
-static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+
+/*
+ * See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
+ * intermediate value set to the removed SPET. it sets the "suppress #VE" bit.
+ */
+#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | __REMOVED_SPTE)

static inline bool is_removed_spte(u64 spte)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7b9265d67131..2ca03ec3bf52 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -692,8 +692,16 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* overwrite the special removed SPTE value. No bookkeeping is needed
* here since the SPTE is going from non-present to non-present. Use
* the raw write helper to avoid an unnecessary check on volatile bits.
+ *
+ * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
+ * It is because when TDX is enabled, TDX module always
+ * enables "EPT-violation #VE", so KVM needs to set
+ * "suppress #VE" bit in EPT table entries, in order to get
+ * real EPT violation, rather than TDVMCALL. KVM sets
+ * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
+ * can be set when EPT table entries are zapped.
*/
- __kvm_tdp_mmu_write_spte(iter->sptep, 0);
+ __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);

return 0;
}
@@ -870,8 +878,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
continue;

if (!shared)
- tdp_mmu_set_spte(kvm, &iter, 0);
- else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
+ tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
+ else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
goto retry;
}
}
@@ -927,8 +935,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
return false;

- __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
- sp->gfn, sp->role.level + 1, true, true);
+ __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
+ SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
+ true, true);

return true;
}
@@ -965,7 +974,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;

- tdp_mmu_set_spte(kvm, &iter, 0);
+ tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
flush = true;
}

@@ -1330,7 +1339,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
* invariant that the PFN of a present * leaf SPTE can never change.
* See __handle_changed_spte().
*/
- tdp_mmu_set_spte(kvm, iter, 0);
+ tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);

if (!pte_write(range->pte)) {
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
--
2.25.1

--
Isaku Yamahata <[email protected]>

2022-07-20 03:06:18

by Kai Huang

[permalink] [raw]

Subject: Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Thu, 2022-07-14 at 11:41 -0700, Isaku Yamahata wrote:
> Thanks for review. Now here is the updated version.
>
> From f1ee540d62ba13511b2c7d3db7662e32bd263e48 Mon Sep 17 00:00:00 2001
> Message-Id: <f1ee540d62ba13511b2c7d3db7662e32bd263e48.1657823906.git.isaku.yamahata@intel.com>
> In-Reply-To: <3941849bf08a55cfbbe69b222f0fd0dac7c5ee53.1657823906.git.isaku.yamahata@intel.com>
> References: <3941849bf08a55cfbbe69b222f0fd0dac7c5ee53.1657823906.git.isaku.yamahata@intel.com>
> From: Sean Christopherson <[email protected]>
> Date: Mon, 29 Jul 2019 19:23:46 -0700
> Subject: [PATCH 036/304] KVM: x86/mmu: Allow non-zero value for non-present
> SPTE
>
> TDX introduced a new ETP, Secure-EPT, in addition to the existing EPT.
> Secure-EPT maps protected guest memory, which is called private. Since
> Secure-EPT page tables is also protected, those page tables is also called
> private. The existing EPT is often called shared EPT to distinguish from
> Secure-EPT. And also page tables for shared EPT is also called shared.

AFAICT secure-EPT isn't quite directly related here, so I don't think you should
spend the paragraph on it. The first paragraph should state the problem and
catch reviewer's eyeball.

>
> TDX module enables #VE injection by setting "EPT-violation #VE" in
> secondary processor-based VM-execution controls of TD VMCS. It also sets
> "suppress #VE" bit in Secure-EPT so that EPT violation on Secure-EPT causes
> exit to VMM.
>
> Because guest memory is protected with TDX, VMM can't parse instructions in
> the guest memory. Instead, MMIO hypercall is used for guest TD to pass
> necessary information to VMM. To make unmodified device driver work, guest
> TD expects #VE on accessing shared GPA for MMIO. The #VE handler of guest
> TD converts MMIO access into MMIO hypercall. To trigger #VE in guest TD,
> VMM needs to clear "suppress #VE" bit in shared EPT entry that corresponds
> to MMIO address.
>
> So the execution flow related for MMIO is as follows
>
> - TDX module sets "EPT-violation #VE" in secondary processor-based
> VM-execution controls of TD VMCS.
> - Allocate page for shared EPT PML4E page. Shared EPT entries are
> initialized with suppress #VE bit set. Update the EPTP pointer.
> - TD accesses a GPA for MMIO to trigger EPT violation. It exits to VMM with
> EPT violation due to suppress #VE bit of EPT entries of PML4E page.
> - VMM figures out the faulted GPA is for MMIO
> - start shared EPT page table walk.
> - Allocate non-leaf EPT pages for the shared EPT.
> - Allocate leaf EPT page for the shared EPT and initialize EPT entries with
> suppress #VE bit set.
> - VMM clears the suppress #VE bit for faulted GPA for MMIO.
> Please notice the leaf EPT page has 512 SPTE and other 511 SPTE entries
> need to keep "suppress #VE" bit set because GPAs for those SPTEs are not
> known to be MMIO. (It requires further lookups.)
> If GPA is a guest page, link the guest page from the leaf SPTE entry.
> - resume TD vcpu.
> - Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
> - If the GPA maps guest memory, VMM resolves it with guest pages.

Too many details IMHO.

Also, you forgot to mention the non-zero value for non-present SPTE is not just
for MMIO, but also for shared memory.

How about below?

For TD guest, the current way to emulate MMIO doesn't work any more, as KVM is
not able to access the private memory of TD guest and do the emulation.
Instead, TD guest expects to receive #VE when it accesses the MMIO and then it
can explicitly makes hypercall to KVM to get the expected information.

To achieve this, the TDX module always enables "EPT-violation #VE" in the VMCS
control. And accordingly, KVM needs to configure the MMIO spte to trigger EPT
violation (instead of misconfiguration) and at the same time, also clear the
"suppress #VE" bit so the TD guest can get the #VE instead of causing actual EPT
violation to KVM.

In order for KVM to be able to have chance to set up the correct SPTE for MMIO
for TD guest, the default non-present SPTE must have the "suppress #VE" bit set
so KVM can get a real EPT violation for the first time when TD guest accesses
the MMIO.

Also, when TD guest accesses the actual shared memory, it should continue to
trigger EPT violation to the KVM instead of receiving the #VE (the TDX module
guarantees KVM will receive EPT violation for private memory access). This
means for the shared memory, the SPTE also must have the "suppress #VE" bit set
for the non-present SPTE.

Add support to allow a non-zero value for the non-present SPTE (i.e. when the
page table is firstly allocated, and when the SPTE is zapped) to allow setting
"suppress #VE" bit for the non-present SPTE.

Introduce a new macro SHADOW_NONPRESENT_VALUE to be the "suppress #VE" bit.
Unconditionally set the "suppress #VE" bit (which is bit 63) for both AMD and
Intel as: 1) AMD hardware doesn't use this bit; 2) for normal VMX guest, KVM
never enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is
ignored by hardware.

(if you want to set SHADOW_NONPRESENT_VALUE only for TDP MMU then continue to
describe, but I don't see this is done in your below patch)

>
> SPTEs for shared EPT need suppress #VE" bit set initially when it
> is allocated or zapped, therefore non-zero non-present value for SPTE
> needs to be allowed.
>
> TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> intermediate value to indicate one thread is operating on it and the value
> should be semi-arbitrary value. For TDX (more exactly to use #VE), the
> value should include suppress #VE bit. Rename REMOVED_SPTE to
> __REMOVED_SPTE and define REMOVED_SPTE as (REMOVED_SPTE | "suppress #VE")
> bit.

IMHO REMOVED_SPTE is implementation details so it's OK to not mention in
changelog.

>
> For simplicity, "suppress #VE" bit is set unconditionally for X86_64 for
> non-present SPTE. Because "suppress #VE" bit (bit position of 63) for
> non-present SPTE is ignored for non-TD case (AMD CPUs or Intel VMX case
> with "EPT-violation #VE" cleared), the functionality shouldn't change.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 71 ++++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> arch/x86/kvm/mmu/spte.c | 5 ++-
> arch/x86/kvm/mmu/spte.h | 28 +++++++++++++-
> arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++----
> 5 files changed, 116 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 51306b80f47c..992f31458f94 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -668,6 +668,55 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +#ifdef CONFIG_X86_64
> +static inline void kvm_init_shadow_page(void *page)
> +{
> + int ign;
> +
> + /*
> + * AMD: "suppress #VE" bit is ignored
> + * Intel non-TD(VMX): "suppress #VE" bit is ignored because
> + * EPT_VIOLATION_VE isn't set.
> + * guest TD: TDX module sets EPT_VIOLATION_VE
> + * conventional SEPT: "suppress #VE" bit must be set to get EPT violation
> + * private SEPT: "suppress #VE" bit is ignored. CPU doesn't walk it
> + *
> + * For simplicity, unconditionally initialize SPET to set "suppress #VE".
> + */
> + asm volatile ("rep stosq\n\t"
> + : "=c"(ign), "=D"(page)
> + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> + : "memory"
> + );
> +}
> +
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> + int start, end, i, r;
> +
> + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> +
> + /*
> + * Note, topup may have allocated objects even if it failed to allocate
> + * the minimum number of objects required to make forward progress _at
> + * this time_. Initialize newly allocated objects even on failure, as
> + * userspace can free memory and rerun the vCPU in response to -ENOMEM.
> + */
> + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> + for (i = start; i < end; i++)
> + kvm_init_shadow_page(mc->objects[i]);
> + return r;
> +}
> +#else
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> + return kvm_mmu_topup_memory_cache(vcpu->arch.mmu_shadow_page_cache,
> + PT64_ROOT_MAX_LEVEL);
> +}
> +#endif /* CONFIG_X86_64 */
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -677,8 +726,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_shadow_page_cache(vcpu);
> if (r)
> return r;
> if (maybe_indirect) {
> @@ -5654,7 +5702,24 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + /*
> + * When X86_64, initial SEPT entries are initialized with
> + * SHADOW_NONPRESENT_VALUE. Otherwise zeroed. See
> + * mmu_topup_shadow_page_cache().
> + *
> + * Shared EPTEs need to be initialized with SUPPRESS_VE=1, otherwise
> + * not-present EPT violations would be reflected into the guest by
> + * hardware as #VE exceptions. This is handled by initializing page
> + * allocations via kvm_init_shadow_page() during cache topup.
> + * In that case, telling the page allocation to zero-initialize the page
> + * would be wasted effort.
> + *
> + * The initialization is harmless for S-EPT entries because KVM's copy
> + * of the S-EPT isn't consumed by hardware, and because under the hood
> + * S-EPT entries should never #VE.
> + */
> + if (!IS_ENABLED(X86_64))
> + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index fe35d8fd3276..964ec76579f0 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1031,7 +1031,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> gpa_t pte_gpa;
> gfn_t gfn;
>
> - if (!sp->spt[i])
> + /* spt[i] has initial value of shadow page table allocation */
> + if (sp->spt[i] != SHADOW_NONPRESENT_VALUE)
> continue;
>
> pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index cda1851ec155..bd441458153f 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -36,6 +36,9 @@ u64 __read_mostly shadow_present_mask;
> u64 __read_mostly shadow_me_value;
> u64 __read_mostly shadow_me_mask;
> u64 __read_mostly shadow_acc_track_mask;
> +#ifdef CONFIG_X86_64
> +u64 __read_mostly shadow_nonpresent_value;
> +#endif
>
> u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> @@ -360,7 +363,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> * not set any RWX bits.
> */
> if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
> - WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
> + WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
> mmio_value = 0;
>
> if (!mmio_value)
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 0127bb6e3c7d..f5fd22f6bf5f 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -140,6 +140,19 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
>
> #define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
>
> +/*
> + * non-present SPTE value for both VMX and SVM for TDP MMU.
> + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> + * For VMX EPT, bit 63 is ignored if #VE is disabled.
> + * bit 63 is #VE suppress if #VE is enabled.
> + */
> +#ifdef CONFIG_X86_64
> +#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
> +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> +#else
> +#define SHADOW_NONPRESENT_VALUE 0ULL
> +#endif
> +
> extern u64 __read_mostly shadow_host_writable_mask;
> extern u64 __read_mostly shadow_mmu_writable_mask;
> extern u64 __read_mostly shadow_nx_mask;
> @@ -178,16 +191,27 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> * non-present intermediate value. Other threads which encounter this value
> * should not modify the SPTE.
> *
> + * For X86_64 case, SHADOW_NONPRESENT_VALUE, "suppress #VE" bit, is set because
> + * "EPT violation #VE" in the secondary VM execution control may be enabled.
> + * Because TDX module sets "EPT violation #VE" for TD, "suppress #VE" bit for
> + * the conventional EPT needs to be set.
> + *
> * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
> * bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> * vulnerability. Use only low bits to avoid 64-bit immediates.
> *
> * Only used by the TDP MMU.
> */
> -#define REMOVED_SPTE 0x5a0ULL
> +#define __REMOVED_SPTE 0x5a0ULL
>
> /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> -static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> +static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> +
> +/*
> + * See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
> + * intermediate value set to the removed SPET. it sets the "suppress #VE" bit.
> + */
> +#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | __REMOVED_SPTE)
>
> static inline bool is_removed_spte(u64 spte)
> {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7b9265d67131..2ca03ec3bf52 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -692,8 +692,16 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * overwrite the special removed SPTE value. No bookkeeping is needed
> * here since the SPTE is going from non-present to non-present. Use
> * the raw write helper to avoid an unnecessary check on volatile bits.
> + *
> + * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
> + * It is because when TDX is enabled, TDX module always
> + * enables "EPT-violation #VE", so KVM needs to set
> + * "suppress #VE" bit in EPT table entries, in order to get
> + * real EPT violation, rather than TDVMCALL. KVM sets
> + * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
> + * can be set when EPT table entries are zapped.
> */
> - __kvm_tdp_mmu_write_spte(iter->sptep, 0);
> + __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);

Since you also always set SHADOW_NONPRESENT_VALUE to SPTE for legacy MMU when
the page table is firstly allocated, it also makes sense to set it when SPTE is
zapped for legacy MMU.

This part is missing in this patch.

>
> return 0;
> }
> @@ -870,8 +878,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> continue;
>
> if (!shared)
> - tdp_mmu_set_spte(kvm, &iter, 0);
> - else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
> + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> + else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
> goto retry;
> }
> }
> @@ -927,8 +935,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
> return false;
>
> - __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> - sp->gfn, sp->role.level + 1, true, true);
> + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> + SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
> + true, true);
>
> return true;
> }
> @@ -965,7 +974,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> !is_last_spte(iter.old_spte, iter.level))
> continue;
>
> - tdp_mmu_set_spte(kvm, &iter, 0);
> + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> flush = true;
> }
>
> @@ -1330,7 +1339,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> * invariant that the PFN of a present * leaf SPTE can never change.
> * See __handle_changed_spte().
> */
> - tdp_mmu_set_spte(kvm, iter, 0);
> + tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
>
> if (!pte_write(range->pte)) {
> new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
> --
> 2.25.1
>
>
>

2022-07-20 03:33:55

by Kai Huang

[permalink] [raw]

Subject: Re: [PATCH v7 036/102] KVM: x86/mmu: Allow non-zero value for non-present SPTE

> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -36,6 +36,9 @@ u64 __read_mostly shadow_present_mask;
> u64 __read_mostly shadow_me_value;
> u64 __read_mostly shadow_me_mask;
> u64 __read_mostly shadow_acc_track_mask;
> +#ifdef CONFIG_X86_64
> +u64 __read_mostly shadow_nonpresent_value;
> +#endif

Is this ever used?

>
> u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> @@ -360,7 +363,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> * not set any RWX bits.
> */
> if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
> - WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
> + WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
> mmio_value = 0;

This chunk doesn't look right, or necessary. We need mmio_mask/mmio_value which
causes EPT violation but with "suppress #VE" bit clear.

So, actually, we want to make sure SHADOW_NONPRESENT_VALUE is *NOT* in mmio_mask
and mmio_value. Using (REMOVED_SPTE & mmio_mask) == mmio_value can actually
ensure SHADOW_NONPRESENT_VALUE is never set in MMIO spte, correct? So I think
using REMOVED_SPTE is fine.

Or maybe additionally adding a explicit check is even better:

if (WARN_ON(mmio_mask & SHADOW_NONPRESENT_VALUE))
mmio_value = 0;

But this change maybe should be in another patch which deals setting up per-VM
mmio_mask/mmio_value anyway. This patch, instead, focuses on allowing non-zero
value for non-present SPTE.