2022-05-09 06:46:52

by Isaku Yamahata

[permalink] [raw]
Subject: [RFC PATCH v6 037/104] KVM: x86/mmu: Allow non-zero value for non-present SPTE

From: Sean Christopherson <[email protected]>

TDX introduced a new ETP, Secure-EPT, in addition to the existing EPT.
Secure-EPT maps protected guest memory, which is called private. Since
Secure-EPT page tables is also protected, those page tables is also called
private. The existing EPT is often called shared EPT to distinguish from
Secure-EPT. And also page tables for share EPT is also called shared.

Virtualization Exception, #VE, is a new processor exception in VMX non-root
operation. In certain virtualizatoin-related conditions, #VE is injected
into guest instead of exiting from guest to VMM so that guest is given a
chance to inspect it. One important one is EPT violation. When
"ETP-violation #VE" VM-execution is set, "#VE suppress bit" in EPT entry
is cleared, #VE is injected instead of EPT violation.

Because guest memory is protected with TDX, VMM can't parse instructions
in the guest memory. Instead, MMIO hypercall is used for guest to pass
necessary information to VMM.

To make unmodified device driver work, guest TD expects #VE on accessing
shared GPA. The #VE handler converts MMIO access into MMIO hypercall with
the EPT entry of enabled "#VE" by clearing "suppress #VE" bit. Before VMM
enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
violation. So the execution flow looks like

- Allocate unused shared EPT entry with suppress #VE bit set.
- EPT violation on that GPA.
- VMM figures out the faulted GPA is for MMIO.
- VMM clears the suppress #VE bit.
- Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
- If the GPA maps guest memory, VMM resolves it with guest pages.

For both cases, SPTE needs suppress #VE" bit set initially when it
is allocated or zapped, therefore non-zero non-present value for SPTE
needs to be allowed.

This change requires to update FNAME(sync_page) for shadow EPT.
"if(!sp->spte[i])" in FNAME(sync_page) means that the spte entry is the
initial value. With the introduction of shadow_nonpresent_value which can
be non-zero, it doesn't hold any more. Replace zero check with
"!is_shadow_present_pte() && !is_mmio_spte()".

When "if (!spt[i])" doesn't hold, but the entry value is
shadow_nonpresent_value, the entry is wrongly synchronized from non-present
to non-present with (wrongly) pfn changed and tries to remove rmap wrongly
and BUG_ON() is hit.

TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
intermediate value to indicate one thread is operating on it and the value
should be semi-arbitrary value. For TDX (more correctly to use #VE), the
value should include suppress #VE value which is SHADOW_NONPRESENT_VALUE.
Rename REMOVED_SPTE to __REMOVED_SPTE and define REMOVED_SPTE as
SHADOW_NONPRESENT_VALUE | REMOVED_SPTE to set suppress #VE bit.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 55 ++++++++++++++++++++++++++++++----
arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
arch/x86/kvm/mmu/spte.c | 5 +++-
arch/x86/kvm/mmu/spte.h | 37 ++++++++++++++++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++-----
5 files changed, 105 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4a12d862bbb6..324ea25ee0c7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -693,6 +693,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
}
}

+static inline void kvm_init_shadow_page(void *page)
+{
+#ifdef CONFIG_X86_64
+ int ign;
+
+ WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
+ asm volatile (
+ "rep stosq\n\t"
+ : "=c"(ign), "=D"(page)
+ : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
+ : "memory"
+ );
+#else
+ BUG();
+#endif
+}
+
+static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
+ int start, end, i, r;
+ bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
+
+ if (is_tdp_mmu && shadow_nonpresent_value)
+ start = kvm_mmu_memory_cache_nr_free_objects(mc);
+
+ r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
+ if (r)
+ return r;
+
+ if (is_tdp_mmu && shadow_nonpresent_value) {
+ end = kvm_mmu_memory_cache_nr_free_objects(mc);
+ for (i = start; i < end; i++)
+ kvm_init_shadow_page(mc->objects[i]);
+ }
+ return 0;
+}
+
static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
int r;
@@ -702,8 +740,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
- r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
- PT64_ROOT_MAX_LEVEL);
+ r = mmu_topup_shadow_page_cache(vcpu);
if (r)
return r;
if (maybe_indirect) {
@@ -5510,9 +5547,16 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
* what is used by the kernel for any given HVA, i.e. the kernel's
* capabilities are ultimately consulted by kvm_mmu_hugepage_adjust().
*/
- if (tdp_enabled)
+ if (tdp_enabled) {
+ /*
+ * For TDP MMU, always set bit 63 for TDX support. See the
+ * comment on SHADOW_NONPRESENT_VALUE.
+ */
+#ifdef CONFIG_X86_64
+ shadow_nonpresent_value = SHADOW_NONPRESENT_VALUE;
+#endif
max_huge_page_level = tdp_huge_page_level;
- else if (boot_cpu_has(X86_FEATURE_GBPAGES))
+ } else if (boot_cpu_has(X86_FEATURE_GBPAGES))
max_huge_page_level = PG_LEVEL_1G;
else
max_huge_page_level = PG_LEVEL_2M;
@@ -5643,7 +5687,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

- vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
+ vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index b025decf610d..058efd4bbcbc 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1030,7 +1030,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
gpa_t pte_gpa;
gfn_t gfn;

- if (!sp->spt[i])
+ if (!is_shadow_present_pte(sp->spt[i]) &&
+ !is_mmio_spte(sp->spt[i]))
continue;

pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 75c9e87d446a..1bf934f64b6f 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -36,6 +36,9 @@ u64 __read_mostly shadow_present_mask;
u64 __read_mostly shadow_me_value;
u64 __read_mostly shadow_me_mask;
u64 __read_mostly shadow_acc_track_mask;
+#ifdef CONFIG_X86_64
+u64 __read_mostly shadow_nonpresent_value;
+#endif

u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
@@ -330,7 +333,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
* not set any RWX bits.
*/
if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
- WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
+ WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
mmio_value = 0;

if (!mmio_value)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index fbbab180395e..3319ca7f8f48 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -140,6 +140,19 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

#define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)

+/*
+ * non-present SPTE value for both VMX and SVM for TDP MMU.
+ * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
+ * For VMX EPT, bit 63 is ignored if #VE is disabled.
+ * bit 63 is #VE suppress if #VE is enabled.
+ */
+#ifdef CONFIG_X86_64
+#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
+static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
+#else
+#define SHADOW_NONPRESENT_VALUE 0ULL
+#endif
+
extern u64 __read_mostly shadow_host_writable_mask;
extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
@@ -154,6 +167,12 @@ extern u64 __read_mostly shadow_present_mask;
extern u64 __read_mostly shadow_me_value;
extern u64 __read_mostly shadow_me_mask;

+#ifdef CONFIG_X86_64
+extern u64 __read_mostly shadow_nonpresent_value;
+#else
+#define shadow_nonpresent_value 0ULL
+#endif
+
/*
* SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
* shadow_acc_track_mask is the set of bits to be cleared in non-accessed
@@ -174,9 +193,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;

/*
* If a thread running without exclusive control of the MMU lock must perform a
- * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
+ * multi-part operation on an SPTE, it can set the SPTE to __REMOVED_SPTE as a
* non-present intermediate value. Other threads which encounter this value
- * should not modify the SPTE.
+ * should not modify the SPTE. For the case that TDX is enabled,
+ * SHADOW_NONPRESENT_VALUE, which is "suppress #VE" bit set because TDX module
+ * always enables "EPT violation #VE". The bit is ignored by non-TDX case as
+ * present bit (bit 0) is cleared.
*
* Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
* bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
@@ -184,10 +206,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
*
* Only used by the TDP MMU.
*/
-#define REMOVED_SPTE 0x5a0ULL
+#define __REMOVED_SPTE 0x5a0ULL

/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
-static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));
+
+/*
+ * See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
+ * intermediate value set to the removed SPET. it sets the "suppress #VE" bit.
+ */
+#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | __REMOVED_SPTE)

static inline bool is_removed_spte(u64 spte)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4fabb2cd0ba9..383904742f44 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -673,8 +673,16 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
* special removed SPTE value. No bookkeeping is needed
* here since the SPTE is going from non-present
* to non-present.
+ *
+ * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
+ * It is because when TDX is enabled, TDX module always
+ * enables "EPT-violation #VE", so KVM needs to set
+ * "suppress #VE" bit in EPT table entries, in order to get
+ * real EPT violation, rather than TDVMCALL. KVM sets
+ * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
+ * can be set when EPT table entries are zapped.
*/
- kvm_tdp_mmu_write_spte(iter->sptep, 0);
+ kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);

return 0;
}
@@ -846,8 +854,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
continue;

if (!shared)
- tdp_mmu_set_spte(kvm, &iter, 0);
- else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
+ tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
+ else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
goto retry;
}
}
@@ -903,8 +911,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
return false;

- __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
- sp->gfn, sp->role.level + 1, true, true);
+ __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
+ SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
+ true, true);

return true;
}
@@ -941,7 +950,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;

- tdp_mmu_set_spte(kvm, &iter, 0);
+ tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
flush = true;
}

@@ -1312,7 +1321,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
* invariant that the PFN of a present * leaf SPTE can never change.
* See __handle_changed_spte().
*/
- tdp_mmu_set_spte(kvm, iter, 0);
+ tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);

if (!pte_write(range->pte)) {
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
--
2.25.1



2022-08-04 23:54:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v6 037/104] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Thu, Aug 04, 2022, David Matlack wrote:
> On Thu, May 05, 2022 at 11:14:31AM -0700, [email protected] wrote:
> > +static inline void kvm_init_shadow_page(void *page)
> > +{
> > +#ifdef CONFIG_X86_64
> > + int ign;
> > +
> > + WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
> > + asm volatile (
> > + "rep stosq\n\t"
> > + : "=c"(ign), "=D"(page)
> > + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> > + : "memory"
> > + );
>
> Use memset64()?

Huh. The optimized x86-64 versions were added in 2017 (4c51248533ad ("x86: implement
memset16, memset32 & memset64"), so I can't even claim I wrote this before there
was a perfect fit.

> > @@ -5643,7 +5687,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > + if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
> > + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> Is there any reason to prefer using __GFP_ZERO? I suspect the code would
> be simpler if KVM unconditionally initialized shadow pages.

Hmm, we'd have to implement kvm_init_shadow_page() for 32-bit builds, and I don't
love having "gfp_zero" but not using it when we need zeros, but if the end result
is simpler, I'm definitely ok with omitting __GFP_ZERO and always flowing through
kvm_init_shadow_page().

> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index fbbab180395e..3319ca7f8f48 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -140,6 +140,19 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
> >
> > #define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
> >
> > +/*
> > + * non-present SPTE value for both VMX and SVM for TDP MMU.
> > + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> > + * For VMX EPT, bit 63 is ignored if #VE is disabled.
> > + * bit 63 is #VE suppress if #VE is enabled.
> > + */
> > +#ifdef CONFIG_X86_64
> > +#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
> > +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> > +#else
> > +#define SHADOW_NONPRESENT_VALUE 0ULL
> > +#endif
>
> The terminology "shadow_nonpresent" implies it would be the opposite of
> e.g. is_shadow_present_pte(), when in fact they are completely
> different concepts.

You can fight Paolo over that one :-) I agree it looks a bit odd when juxtaposed
with is_shadow_present_pte(), but at the same time I agree with Paolo that
SHADOW_INIT_VALUE is also funky.

https://lore.kernel.org/all/[email protected]

> Also, this is a good opportunity to follow the same naming terminology
> as REMOVED_SPTE in the TDP MMU.
>
> How about EMPTY_SPTE?

No, because "empty" implies there's nothing there, and it very much matters that
the SUPPRESS_VE bit is set for TDX.

2022-08-05 00:09:16

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH v6 037/104] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Thu, May 05, 2022 at 11:14:31AM -0700, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX introduced a new ETP, Secure-EPT, in addition to the existing EPT.
> Secure-EPT maps protected guest memory, which is called private. Since
> Secure-EPT page tables is also protected, those page tables is also called
> private. The existing EPT is often called shared EPT to distinguish from
> Secure-EPT. And also page tables for share EPT is also called shared.
>
> Virtualization Exception, #VE, is a new processor exception in VMX non-root
> operation. In certain virtualizatoin-related conditions, #VE is injected
> into guest instead of exiting from guest to VMM so that guest is given a
> chance to inspect it. One important one is EPT violation. When
> "ETP-violation #VE" VM-execution is set, "#VE suppress bit" in EPT entry
> is cleared, #VE is injected instead of EPT violation.
>
> Because guest memory is protected with TDX, VMM can't parse instructions
> in the guest memory. Instead, MMIO hypercall is used for guest to pass
> necessary information to VMM.
>
> To make unmodified device driver work, guest TD expects #VE on accessing
> shared GPA. The #VE handler converts MMIO access into MMIO hypercall with
> the EPT entry of enabled "#VE" by clearing "suppress #VE" bit. Before VMM
> enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
> violation. So the execution flow looks like
>
> - Allocate unused shared EPT entry with suppress #VE bit set.
> - EPT violation on that GPA.
> - VMM figures out the faulted GPA is for MMIO.
> - VMM clears the suppress #VE bit.
> - Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
> - If the GPA maps guest memory, VMM resolves it with guest pages.
>
> For both cases, SPTE needs suppress #VE" bit set initially when it
> is allocated or zapped, therefore non-zero non-present value for SPTE
> needs to be allowed.
>
> This change requires to update FNAME(sync_page) for shadow EPT.
> "if(!sp->spte[i])" in FNAME(sync_page) means that the spte entry is the
> initial value. With the introduction of shadow_nonpresent_value which can
> be non-zero, it doesn't hold any more. Replace zero check with
> "!is_shadow_present_pte() && !is_mmio_spte()".
>
> When "if (!spt[i])" doesn't hold, but the entry value is
> shadow_nonpresent_value, the entry is wrongly synchronized from non-present
> to non-present with (wrongly) pfn changed and tries to remove rmap wrongly
> and BUG_ON() is hit.
>
> TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> intermediate value to indicate one thread is operating on it and the value
> should be semi-arbitrary value. For TDX (more correctly to use #VE), the
> value should include suppress #VE value which is SHADOW_NONPRESENT_VALUE.
> Rename REMOVED_SPTE to __REMOVED_SPTE and define REMOVED_SPTE as
> SHADOW_NONPRESENT_VALUE | REMOVED_SPTE to set suppress #VE bit.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> arch/x86/kvm/mmu/mmu.c | 55 ++++++++++++++++++++++++++++++----
> arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> arch/x86/kvm/mmu/spte.c | 5 +++-
> arch/x86/kvm/mmu/spte.h | 37 ++++++++++++++++++++---
> arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++-----
> 5 files changed, 105 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4a12d862bbb6..324ea25ee0c7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -693,6 +693,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +static inline void kvm_init_shadow_page(void *page)
> +{
> +#ifdef CONFIG_X86_64
> + int ign;
> +
> + WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
> + asm volatile (
> + "rep stosq\n\t"
> + : "=c"(ign), "=D"(page)
> + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> + : "memory"
> + );

Use memset64()?

> +#else
> + BUG();
> +#endif
> +}
> +
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> + int start, end, i, r;
> + bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> +
> + if (is_tdp_mmu && shadow_nonpresent_value)
> + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> +
> + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> + if (r)
> + return r;
> +
> + if (is_tdp_mmu && shadow_nonpresent_value) {
> + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> + for (i = start; i < end; i++)
> + kvm_init_shadow_page(mc->objects[i]);
> + }

Doing this during top-up is probably the right decision since we're
outside the MMU lock. In v8 you'll need to also cover the eager page
splitting code paths, which go through a different allocation path for
the shadow and TDP MMU.

> + return 0;
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -702,8 +740,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_shadow_page_cache(vcpu);
> if (r)
> return r;
> if (maybe_indirect) {
> @@ -5510,9 +5547,16 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> * what is used by the kernel for any given HVA, i.e. the kernel's
> * capabilities are ultimately consulted by kvm_mmu_hugepage_adjust().
> */
> - if (tdp_enabled)
> + if (tdp_enabled) {
> + /*
> + * For TDP MMU, always set bit 63 for TDX support. See the
> + * comment on SHADOW_NONPRESENT_VALUE.
> + */
> +#ifdef CONFIG_X86_64
> + shadow_nonpresent_value = SHADOW_NONPRESENT_VALUE;
> +#endif
> max_huge_page_level = tdp_huge_page_level;
> - else if (boot_cpu_has(X86_FEATURE_GBPAGES))
> + } else if (boot_cpu_has(X86_FEATURE_GBPAGES))
> max_huge_page_level = PG_LEVEL_1G;
> else
> max_huge_page_level = PG_LEVEL_2M;
> @@ -5643,7 +5687,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
> + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;

Is there any reason to prefer using __GFP_ZERO? I suspect the code would
be simpler if KVM unconditionally initialized shadow pages.

>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index b025decf610d..058efd4bbcbc 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1030,7 +1030,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> gpa_t pte_gpa;
> gfn_t gfn;
>
> - if (!sp->spt[i])
> + if (!is_shadow_present_pte(sp->spt[i]) &&
> + !is_mmio_spte(sp->spt[i]))
> continue;
>
> pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 75c9e87d446a..1bf934f64b6f 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -36,6 +36,9 @@ u64 __read_mostly shadow_present_mask;
> u64 __read_mostly shadow_me_value;
> u64 __read_mostly shadow_me_mask;
> u64 __read_mostly shadow_acc_track_mask;
> +#ifdef CONFIG_X86_64
> +u64 __read_mostly shadow_nonpresent_value;
> +#endif
>
> u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> @@ -330,7 +333,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> * not set any RWX bits.
> */
> if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
> - WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
> + WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))

Why use __REMOVED_SPTE here and not REMOVED_SPTE?

> mmio_value = 0;
>
> if (!mmio_value)
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index fbbab180395e..3319ca7f8f48 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -140,6 +140,19 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
>
> #define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
>
> +/*
> + * non-present SPTE value for both VMX and SVM for TDP MMU.
> + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> + * For VMX EPT, bit 63 is ignored if #VE is disabled.
> + * bit 63 is #VE suppress if #VE is enabled.
> + */
> +#ifdef CONFIG_X86_64
> +#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
> +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> +#else
> +#define SHADOW_NONPRESENT_VALUE 0ULL
> +#endif

The terminology "shadow_nonpresent" implies it would be the opposite of
e.g. is_shadow_present_pte(), when in fact they are completely
different concepts.

Also, this is a good opportunity to follow the same naming terminology
as REMOVED_SPTE in the TDP MMU.

How about EMPTY_SPTE?

> +
> extern u64 __read_mostly shadow_host_writable_mask;
> extern u64 __read_mostly shadow_mmu_writable_mask;
> extern u64 __read_mostly shadow_nx_mask;
> @@ -154,6 +167,12 @@ extern u64 __read_mostly shadow_present_mask;
> extern u64 __read_mostly shadow_me_value;
> extern u64 __read_mostly shadow_me_mask;
>
> +#ifdef CONFIG_X86_64
> +extern u64 __read_mostly shadow_nonpresent_value;
> +#else
> +#define shadow_nonpresent_value 0ULL
> +#endif
> +
> /*
> * SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
> * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
> @@ -174,9 +193,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>
> /*
> * If a thread running without exclusive control of the MMU lock must perform a
> - * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
> + * multi-part operation on an SPTE, it can set the SPTE to __REMOVED_SPTE as a
> * non-present intermediate value. Other threads which encounter this value
> - * should not modify the SPTE.
> + * should not modify the SPTE. For the case that TDX is enabled,
> + * SHADOW_NONPRESENT_VALUE, which is "suppress #VE" bit set because TDX module
> + * always enables "EPT violation #VE". The bit is ignored by non-TDX case as
> + * present bit (bit 0) is cleared.
> *
> * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
> * bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> @@ -184,10 +206,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> *
> * Only used by the TDP MMU.
> */
> -#define REMOVED_SPTE 0x5a0ULL
> +#define __REMOVED_SPTE 0x5a0ULL
>
> /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> -static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> +static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> +static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));
> +
> +/*
> + * See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
> + * intermediate value set to the removed SPET. it sets the "suppress #VE" bit.
> + */
> +#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | __REMOVED_SPTE)
>
> static inline bool is_removed_spte(u64 spte)
> {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4fabb2cd0ba9..383904742f44 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -673,8 +673,16 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> * special removed SPTE value. No bookkeeping is needed
> * here since the SPTE is going from non-present
> * to non-present.
> + *
> + * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
> + * It is because when TDX is enabled, TDX module always
> + * enables "EPT-violation #VE", so KVM needs to set
> + * "suppress #VE" bit in EPT table entries, in order to get
> + * real EPT violation, rather than TDVMCALL. KVM sets
> + * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
> + * can be set when EPT table entries are zapped.
> */
> - kvm_tdp_mmu_write_spte(iter->sptep, 0);
> + kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
>
> return 0;
> }
> @@ -846,8 +854,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> continue;
>
> if (!shared)
> - tdp_mmu_set_spte(kvm, &iter, 0);
> - else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
> + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> + else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
> goto retry;
> }
> }
> @@ -903,8 +911,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
> return false;
>
> - __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> - sp->gfn, sp->role.level + 1, true, true);
> + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> + SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
> + true, true);
>
> return true;
> }
> @@ -941,7 +950,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> !is_last_spte(iter.old_spte, iter.level))
> continue;
>
> - tdp_mmu_set_spte(kvm, &iter, 0);
> + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> flush = true;
> }
>
> @@ -1312,7 +1321,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> * invariant that the PFN of a present * leaf SPTE can never change.
> * See __handle_changed_spte().
> */
> - tdp_mmu_set_spte(kvm, iter, 0);
> + tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
>
> if (!pte_write(range->pte)) {
> new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,

In addition to the suggestions above, I'd suggest breaking this patch
up, since it is doing multiple things:

1. Patch initialize shadow page tables to EMPTY_SPTE (0) and
replace TDP MMU hard-coded 0 with EMPTY_SPTE.
2. Patch to change FNAME(sync_page) to not assume EMPTY_SPTE is 0.
3. Patch to set bit 63 in EMPTY_SPTE.
4. Patch to set bit 63 in REMOVED_SPTE.

> --
> 2.25.1
>

2022-08-05 00:10:51

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH v6 037/104] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Thu, Aug 4, 2022 at 3:54 PM David Matlack <[email protected]> wrote:
>

Blegh, sorry. I somehow ended up looking at a v6 patch again instead
of the v7 patch.

> On Thu, May 05, 2022 at 11:14:31AM -0700, [email protected] wrote:
> > From: Sean Christopherson <[email protected]>
> >
> > TDX introduced a new ETP, Secure-EPT, in addition to the existing EPT.
> > Secure-EPT maps protected guest memory, which is called private. Since
> > Secure-EPT page tables is also protected, those page tables is also called
> > private. The existing EPT is often called shared EPT to distinguish from
> > Secure-EPT. And also page tables for share EPT is also called shared.
> >
> > Virtualization Exception, #VE, is a new processor exception in VMX non-root
> > operation. In certain virtualizatoin-related conditions, #VE is injected
> > into guest instead of exiting from guest to VMM so that guest is given a
> > chance to inspect it. One important one is EPT violation. When
> > "ETP-violation #VE" VM-execution is set, "#VE suppress bit" in EPT entry
> > is cleared, #VE is injected instead of EPT violation.
> >
> > Because guest memory is protected with TDX, VMM can't parse instructions
> > in the guest memory. Instead, MMIO hypercall is used for guest to pass
> > necessary information to VMM.
> >
> > To make unmodified device driver work, guest TD expects #VE on accessing
> > shared GPA. The #VE handler converts MMIO access into MMIO hypercall with
> > the EPT entry of enabled "#VE" by clearing "suppress #VE" bit. Before VMM
> > enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
> > violation. So the execution flow looks like
> >
> > - Allocate unused shared EPT entry with suppress #VE bit set.
> > - EPT violation on that GPA.
> > - VMM figures out the faulted GPA is for MMIO.
> > - VMM clears the suppress #VE bit.
> > - Guest TD gets #VE, and converts MMIO access into MMIO hypercall.
> > - If the GPA maps guest memory, VMM resolves it with guest pages.
> >
> > For both cases, SPTE needs suppress #VE" bit set initially when it
> > is allocated or zapped, therefore non-zero non-present value for SPTE
> > needs to be allowed.
> >
> > This change requires to update FNAME(sync_page) for shadow EPT.
> > "if(!sp->spte[i])" in FNAME(sync_page) means that the spte entry is the
> > initial value. With the introduction of shadow_nonpresent_value which can
> > be non-zero, it doesn't hold any more. Replace zero check with
> > "!is_shadow_present_pte() && !is_mmio_spte()".
> >
> > When "if (!spt[i])" doesn't hold, but the entry value is
> > shadow_nonpresent_value, the entry is wrongly synchronized from non-present
> > to non-present with (wrongly) pfn changed and tries to remove rmap wrongly
> > and BUG_ON() is hit.
> >
> > TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> > intermediate value to indicate one thread is operating on it and the value
> > should be semi-arbitrary value. For TDX (more correctly to use #VE), the
> > value should include suppress #VE value which is SHADOW_NONPRESENT_VALUE.
> > Rename REMOVED_SPTE to __REMOVED_SPTE and define REMOVED_SPTE as
> > SHADOW_NONPRESENT_VALUE | REMOVED_SPTE to set suppress #VE bit.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > arch/x86/kvm/mmu/mmu.c | 55 ++++++++++++++++++++++++++++++----
> > arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> > arch/x86/kvm/mmu/spte.c | 5 +++-
> > arch/x86/kvm/mmu/spte.h | 37 ++++++++++++++++++++---
> > arch/x86/kvm/mmu/tdp_mmu.c | 23 +++++++++-----
> > 5 files changed, 105 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4a12d862bbb6..324ea25ee0c7 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -693,6 +693,44 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > }
> > }
> >
> > +static inline void kvm_init_shadow_page(void *page)
> > +{
> > +#ifdef CONFIG_X86_64
> > + int ign;
> > +
> > + WARN_ON_ONCE(shadow_nonpresent_value != SHADOW_NONPRESENT_VALUE);
> > + asm volatile (
> > + "rep stosq\n\t"
> > + : "=c"(ign), "=D"(page)
> > + : "a"(SHADOW_NONPRESENT_VALUE), "c"(4096/8), "D"(page)
> > + : "memory"
> > + );
>
> Use memset64()?
>
> > +#else
> > + BUG();
> > +#endif
> > +}
> > +
> > +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> > + int start, end, i, r;
> > + bool is_tdp_mmu = is_tdp_mmu_enabled(vcpu->kvm);
> > +
> > + if (is_tdp_mmu && shadow_nonpresent_value)
> > + start = kvm_mmu_memory_cache_nr_free_objects(mc);
> > +
> > + r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> > + if (r)
> > + return r;
> > +
> > + if (is_tdp_mmu && shadow_nonpresent_value) {
> > + end = kvm_mmu_memory_cache_nr_free_objects(mc);
> > + for (i = start; i < end; i++)
> > + kvm_init_shadow_page(mc->objects[i]);
> > + }
>
> Doing this during top-up is probably the right decision since we're
> outside the MMU lock. In v8 you'll need to also cover the eager page
> splitting code paths, which go through a different allocation path for
> the shadow and TDP MMU.
>
> > + return 0;
> > +}
> > +
> > static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > {
> > int r;
> > @@ -702,8 +740,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> > if (r)
> > return r;
> > - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > - PT64_ROOT_MAX_LEVEL);
> > + r = mmu_topup_shadow_page_cache(vcpu);
> > if (r)
> > return r;
> > if (maybe_indirect) {
> > @@ -5510,9 +5547,16 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> > * what is used by the kernel for any given HVA, i.e. the kernel's
> > * capabilities are ultimately consulted by kvm_mmu_hugepage_adjust().
> > */
> > - if (tdp_enabled)
> > + if (tdp_enabled) {
> > + /*
> > + * For TDP MMU, always set bit 63 for TDX support. See the
> > + * comment on SHADOW_NONPRESENT_VALUE.
> > + */
> > +#ifdef CONFIG_X86_64
> > + shadow_nonpresent_value = SHADOW_NONPRESENT_VALUE;
> > +#endif
> > max_huge_page_level = tdp_huge_page_level;
> > - else if (boot_cpu_has(X86_FEATURE_GBPAGES))
> > + } else if (boot_cpu_has(X86_FEATURE_GBPAGES))
> > max_huge_page_level = PG_LEVEL_1G;
> > else
> > max_huge_page_level = PG_LEVEL_2M;
> > @@ -5643,7 +5687,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > + if (!(is_tdp_mmu_enabled(vcpu->kvm) && shadow_nonpresent_value))
> > + vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> Is there any reason to prefer using __GFP_ZERO? I suspect the code would
> be simpler if KVM unconditionally initialized shadow pages.
>
> >
> > vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index b025decf610d..058efd4bbcbc 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -1030,7 +1030,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> > gpa_t pte_gpa;
> > gfn_t gfn;
> >
> > - if (!sp->spt[i])
> > + if (!is_shadow_present_pte(sp->spt[i]) &&
> > + !is_mmio_spte(sp->spt[i]))
> > continue;
> >
> > pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 75c9e87d446a..1bf934f64b6f 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -36,6 +36,9 @@ u64 __read_mostly shadow_present_mask;
> > u64 __read_mostly shadow_me_value;
> > u64 __read_mostly shadow_me_mask;
> > u64 __read_mostly shadow_acc_track_mask;
> > +#ifdef CONFIG_X86_64
> > +u64 __read_mostly shadow_nonpresent_value;
> > +#endif
> >
> > u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> > u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> > @@ -330,7 +333,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> > * not set any RWX bits.
> > */
> > if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
> > - WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
> > + WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
>
> Why use __REMOVED_SPTE here and not REMOVED_SPTE?
>
> > mmio_value = 0;
> >
> > if (!mmio_value)
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index fbbab180395e..3319ca7f8f48 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -140,6 +140,19 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
> >
> > #define MMIO_SPTE_GEN_MASK GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
> >
> > +/*
> > + * non-present SPTE value for both VMX and SVM for TDP MMU.
> > + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> > + * For VMX EPT, bit 63 is ignored if #VE is disabled.
> > + * bit 63 is #VE suppress if #VE is enabled.
> > + */
> > +#ifdef CONFIG_X86_64
> > +#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
> > +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> > +#else
> > +#define SHADOW_NONPRESENT_VALUE 0ULL
> > +#endif
>
> The terminology "shadow_nonpresent" implies it would be the opposite of
> e.g. is_shadow_present_pte(), when in fact they are completely
> different concepts.
>
> Also, this is a good opportunity to follow the same naming terminology
> as REMOVED_SPTE in the TDP MMU.
>
> How about EMPTY_SPTE?
>
> > +
> > extern u64 __read_mostly shadow_host_writable_mask;
> > extern u64 __read_mostly shadow_mmu_writable_mask;
> > extern u64 __read_mostly shadow_nx_mask;
> > @@ -154,6 +167,12 @@ extern u64 __read_mostly shadow_present_mask;
> > extern u64 __read_mostly shadow_me_value;
> > extern u64 __read_mostly shadow_me_mask;
> >
> > +#ifdef CONFIG_X86_64
> > +extern u64 __read_mostly shadow_nonpresent_value;
> > +#else
> > +#define shadow_nonpresent_value 0ULL
> > +#endif
> > +
> > /*
> > * SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
> > * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
> > @@ -174,9 +193,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> >
> > /*
> > * If a thread running without exclusive control of the MMU lock must perform a
> > - * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
> > + * multi-part operation on an SPTE, it can set the SPTE to __REMOVED_SPTE as a
> > * non-present intermediate value. Other threads which encounter this value
> > - * should not modify the SPTE.
> > + * should not modify the SPTE. For the case that TDX is enabled,
> > + * SHADOW_NONPRESENT_VALUE, which is "suppress #VE" bit set because TDX module
> > + * always enables "EPT violation #VE". The bit is ignored by non-TDX case as
> > + * present bit (bit 0) is cleared.
> > *
> > * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
> > * bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> > @@ -184,10 +206,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> > *
> > * Only used by the TDP MMU.
> > */
> > -#define REMOVED_SPTE 0x5a0ULL
> > +#define __REMOVED_SPTE 0x5a0ULL
> >
> > /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> > -static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> > +static_assert(!(__REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> > +static_assert(!(__REMOVED_SPTE & SHADOW_NONPRESENT_VALUE));
> > +
> > +/*
> > + * See above comment around __REMOVED_SPTE. REMOVED_SPTE is the actual
> > + * intermediate value set to the removed SPET. it sets the "suppress #VE" bit.
> > + */
> > +#define REMOVED_SPTE (SHADOW_NONPRESENT_VALUE | __REMOVED_SPTE)
> >
> > static inline bool is_removed_spte(u64 spte)
> > {
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4fabb2cd0ba9..383904742f44 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -673,8 +673,16 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> > * special removed SPTE value. No bookkeeping is needed
> > * here since the SPTE is going from non-present
> > * to non-present.
> > + *
> > + * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
> > + * It is because when TDX is enabled, TDX module always
> > + * enables "EPT-violation #VE", so KVM needs to set
> > + * "suppress #VE" bit in EPT table entries, in order to get
> > + * real EPT violation, rather than TDVMCALL. KVM sets
> > + * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
> > + * can be set when EPT table entries are zapped.
> > */
> > - kvm_tdp_mmu_write_spte(iter->sptep, 0);
> > + kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
> >
> > return 0;
> > }
> > @@ -846,8 +854,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > continue;
> >
> > if (!shared)
> > - tdp_mmu_set_spte(kvm, &iter, 0);
> > - else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
> > + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> > + else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
> > goto retry;
> > }
> > }
> > @@ -903,8 +911,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
> > return false;
> >
> > - __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> > - sp->gfn, sp->role.level + 1, true, true);
> > + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> > + SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
> > + true, true);
> >
> > return true;
> > }
> > @@ -941,7 +950,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > !is_last_spte(iter.old_spte, iter.level))
> > continue;
> >
> > - tdp_mmu_set_spte(kvm, &iter, 0);
> > + tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> > flush = true;
> > }
> >
> > @@ -1312,7 +1321,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> > * invariant that the PFN of a present * leaf SPTE can never change.
> > * See __handle_changed_spte().
> > */
> > - tdp_mmu_set_spte(kvm, iter, 0);
> > + tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
> >
> > if (!pte_write(range->pte)) {
> > new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
>
> In addition to the suggestions above, I'd suggest breaking this patch
> up, since it is doing multiple things:
>
> 1. Patch initialize shadow page tables to EMPTY_SPTE (0) and
> replace TDP MMU hard-coded 0 with EMPTY_SPTE.
> 2. Patch to change FNAME(sync_page) to not assume EMPTY_SPTE is 0.
> 3. Patch to set bit 63 in EMPTY_SPTE.
> 4. Patch to set bit 63 in REMOVED_SPTE.
>
> > --
> > 2.25.1
> >

2022-08-05 00:12:39

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH v6 037/104] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Thu, Aug 4, 2022 at 4:23 PM Sean Christopherson <[email protected]> wrote:
> On Thu, Aug 04, 2022, David Matlack wrote:
> > On Thu, May 05, 2022 at 11:14:31AM -0700, [email protected] wrote:
> > > +#ifdef CONFIG_X86_64
> > > +#define SHADOW_NONPRESENT_VALUE BIT_ULL(63)
> > > +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> > > +#else
> > > +#define SHADOW_NONPRESENT_VALUE 0ULL
> > > +#endif
> >
> > The terminology "shadow_nonpresent" implies it would be the opposite of
> > e.g. is_shadow_present_pte(), when in fact they are completely
> > different concepts.
>
> You can fight Paolo over that one :-) I agree it looks a bit odd when juxtaposed
> with is_shadow_present_pte(), but at the same time I agree with Paolo that
> SHADOW_INIT_VALUE is also funky.
>
> https://lore.kernel.org/all/[email protected]

Ah ok, thanks for the context.

>
> > Also, this is a good opportunity to follow the same naming terminology
> > as REMOVED_SPTE in the TDP MMU.
> >
> > How about EMPTY_SPTE?
>
> No, because "empty" implies there's nothing there, and it very much matters that
> the SUPPRESS_VE bit is set for TDX.

Fair point. My other idea was INITIAL_SPTE but that's already covered
by Paolo's objection above :)

I'll change my vote to NONPRESENT_SPTE.

2022-08-05 00:17:18

by Kai Huang

[permalink] [raw]
Subject: RE: [RFC PATCH v6 037/104] KVM: x86/mmu: Allow non-zero value for non-present SPTE

> > In addition to the suggestions above, I'd suggest breaking this patch
> > up, since it is doing multiple things:
> >
> > 1. Patch initialize shadow page tables to EMPTY_SPTE (0) and
> > replace TDP MMU hard-coded 0 with EMPTY_SPTE.
> > 2. Patch to change FNAME(sync_page) to not assume EMPTY_SPTE is 0.
> > 3. Patch to set bit 63 in EMPTY_SPTE.
> > 4. Patch to set bit 63 in REMOVED_SPTE.

I think 1/2 can be separate patches, but 3/4 should be done together.

Patch 3 alone is broken as when TDP MMU zaps SPTE and replaces it with REMOVED_SPTE, it loses bit 63. This is not what we want. We always want bit 63 set if it is supposed to be set to a non-present SPTE.

But I also don't see splitting to 3 patches is absolutely worth to do as doing above in one patch is also fine to me.

2022-08-05 16:51:41

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH v6 037/104] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Thu, Aug 4, 2022 at 5:04 PM Huang, Kai <[email protected]> wrote:
>
> > > In addition to the suggestions above, I'd suggest breaking this patch
> > > up, since it is doing multiple things:
> > >
> > > 1. Patch initialize shadow page tables to EMPTY_SPTE (0) and
> > > replace TDP MMU hard-coded 0 with EMPTY_SPTE.
> > > 2. Patch to change FNAME(sync_page) to not assume EMPTY_SPTE is 0.
> > > 3. Patch to set bit 63 in EMPTY_SPTE.
> > > 4. Patch to set bit 63 in REMOVED_SPTE.
>
> I think 1/2 can be separate patches, but 3/4 should be done together.
>
> Patch 3 alone is broken as when TDP MMU zaps SPTE and replaces it with REMOVED_SPTE, it loses bit 63. This is not what we want. We always want bit 63 set if it is supposed to be set to a non-present SPTE.

How is patch 3 alone be broken? The TDX support that depends on bit 63
does not exist at this point in the series, i.e. setting bit 63 is
entirely optional and only done in preparation for future patches.

>
> But I also don't see splitting to 3 patches is absolutely worth to do as doing above in one patch is also fine to me.

Splitting patches up into logically independent changes makes it a lot
easier to review, and therefore reduces the chances of bugs.

Smaller changes also makes it easier for patches to get through the
review process, because reviewers can sign-off on specific patches
with Reviewed-by tags while discussion continues on patches that still
need more work. If the patches are too large, it makes it more
difficult to collect Reviewed-by tags because the entire patch has to
be correct.

Case in point, the above patch description has 9 paragraphs because
the patch is doing so many different things. It's difficult to keep
track of all of the different changes this patch aims to accomplish
when reviewing the code.

2022-08-05 17:28:03

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH v6 037/104] KVM: x86/mmu: Allow non-zero value for non-present SPTE

On Fri, Aug 05, 2022, David Matlack wrote:
> On Thu, Aug 4, 2022 at 5:04 PM Huang, Kai <[email protected]> wrote:
> >
> > > > In addition to the suggestions above, I'd suggest breaking this patch
> > > > up, since it is doing multiple things:
> > > >
> > > > 1. Patch initialize shadow page tables to EMPTY_SPTE (0) and
> > > > replace TDP MMU hard-coded 0 with EMPTY_SPTE.
> > > > 2. Patch to change FNAME(sync_page) to not assume EMPTY_SPTE is 0.
> > > > 3. Patch to set bit 63 in EMPTY_SPTE.
> > > > 4. Patch to set bit 63 in REMOVED_SPTE.
> >
> > I think 1/2 can be separate patches, but 3/4 should be done together.
> >
> > Patch 3 alone is broken as when TDP MMU zaps SPTE and replaces it with
> > REMOVED_SPTE, it loses bit 63. This is not what we want. We always want
> > bit 63 set if it is supposed to be set to a non-present SPTE.
>
> How is patch 3 alone be broken? The TDX support that depends on bit 63
> does not exist at this point in the series, i.e. setting bit 63 is
> entirely optional and only done in preparation for future patches.

Hmm, I agree with Kai on this specific point. Will it cause functional problems?
No. Is KVM technically broken? IMO, yes, because the intent of the code is to
ensure bit 63 is set for all SPTEs that are not-present (and not misconfigured)
in hardware.

I 100% agree on patches doing too much, but in this particular case it's easy to
capture the above semantics in a shortlog:

KVM: x86/mmu: Set bit 63 (EPT's SUPPRESS_VE) in all not-present SPTEs