LinuxLists.cc - [PATCH v7 037/102] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis

2022-06-27 22:36:13

Subject: [PATCH v7 037/102] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis

From: Sean Christopherson <[email protected]>

TDX will use a different shadow PTE entry value for MMIO from VMX. Add
members to kvm_arch and track value for MMIO per-VM instead of global
variables. By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working.

In the case of VMX VM case, the EPT entry for MMIO is non-present PTE
(present bit cleared) without backing guest physical address (on EPT
violation, KVM searches backing guest memory and it finds there is no
backing guest page.) or the value to trigger EPT misconfiguration. Once
MMIO is triggered on the EPT entry, the EPT entry is updated to trigger EPT
misconfiguration for the future MMIO on the same GPA. It allows KVM to
understand the memory access is for MMIO without searching backing guest
pages.). And then KVM parses guest instruction to figure out
address/value/width for MMIO.

In the case of the guest TD, the guest memory is protected so that VMM
can't parse guest instruction to understand the value and access width for
MMIO. Instead, VMM sets up (Shared) EPT to trigger #VE by clearing
the VE-suppress bit. When the guest TD issues MMIO, #VE is injected. Guest VE
handler converts MMIO access into MMIO hypercall to pass
address/value/width for MMIO to VMM. (or directly paravirtualize MMIO into
hypercall.) Then VMM can handle the MMIO hypercall without parsing guest
instructions.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++++
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kvm/mmu.h | 4 +++-
arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++----
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/spte.c | 41 +++++++++++++++------------------
arch/x86/kvm/mmu/spte.h | 11 ++++-----
arch/x86/kvm/mmu/tdp_mmu.c | 6 ++---
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 8 +++++++
10 files changed, 59 insertions(+), 40 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c47aab72a1b..39215daa8576 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1161,6 +1161,10 @@ struct kvm_arch {
*/
spinlock_t mmu_unsync_pages_lock;

+ bool enable_mmio_caching;
+ u64 shadow_mmio_value;
+ u64 shadow_mmio_mask;
+
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
bool iommu_noncoherent;
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index c371ef695fcc..6231ef005a50 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -511,6 +511,7 @@ enum vmcs_field {
#define VMX_EPT_IPAT_BIT (1ull << 6)
#define VMX_EPT_ACCESS_BIT (1ull << 8)
#define VMX_EPT_DIRTY_BIT (1ull << 9)
+#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)
#define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
VMX_EPT_WRITABLE_MASK | \
VMX_EPT_EXECUTABLE_MASK)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ccf0ba7a6387..9ba60fd79d33 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -108,7 +108,9 @@ static inline u8 kvm_get_shadow_phys_bits(void)
return boot_cpu_data.x86_phys_bits;
}

-void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
+void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask,
+ u64 access_mask);
+void kvm_mmu_set_default_mmio_spte_mask(u64 mask);
void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f239b6cb5d53..496d0d30839b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2287,7 +2287,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
return kvm_mmu_prepare_zap_page(kvm, child,
invalid_list);
}
- } else if (is_mmio_spte(pte)) {
+ } else if (is_mmio_spte(kvm, pte)) {
mmu_spte_clear_no_track(spte);
}
return 0;
@@ -3067,8 +3067,13 @@ static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fau
* by L0 userspace (you can observe gfn > L1.MAXPHYADDR if
* and only if L1's MAXPHYADDR is inaccurate with respect to
* the hardware's).
+ *
+ * Excludes the INTEL TD guest. Because TD memory is
+ * protected, the instruction can't be emulated. Instead, use
+ * SPTE value without #VE suppress bit cleared
+ * (kvm->arch.shadow_mmio_value = 0).
*/
- if (unlikely(!enable_mmio_caching) ||
+ if (unlikely(!vcpu->kvm->arch.enable_mmio_caching) ||
unlikely(fault->gfn > kvm_mmu_max_gfn()))
return RET_PF_EMULATE;
}
@@ -3200,7 +3205,8 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
else
sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);

- if (!is_shadow_present_pte(spte) || is_mmio_spte(spte))
+ if (!is_shadow_present_pte(spte) ||
+ is_mmio_spte(vcpu->kvm, spte))
break;

sp = sptep_to_sp(sptep);
@@ -3907,7 +3913,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
if (WARN_ON(reserved))
return -EINVAL;

- if (is_mmio_spte(spte)) {
+ if (is_mmio_spte(vcpu->kvm, spte)) {
gfn_t gfn = get_mmio_spte_gfn(spte);
unsigned int access = get_mmio_spte_access(spte);

@@ -4350,7 +4356,7 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu)
static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
unsigned int access)
{
- if (unlikely(is_mmio_spte(*sptep))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
if (gfn != get_mmio_spte_gfn(*sptep)) {
mmu_spte_clear_no_track(sptep);
return true;
@@ -5864,6 +5870,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
node->track_write = kvm_mmu_pte_write;
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);
+ kvm_mmu_set_mmio_spte_mask(kvm, shadow_default_mmio_mask,
+ shadow_default_mmio_mask,
+ ACC_WRITE_MASK | ACC_USER_MASK);
+
return 0;
}

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index ee2fb0c073f3..62ae590d4e5b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1032,7 +1032,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
gfn_t gfn;

if (!is_shadow_present_pte(sp->spt[i]) &&
- !is_mmio_spte(sp->spt[i]))
+ !is_mmio_spte(vcpu->kvm, sp->spt[i]))
continue;

pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index bd441458153f..5194aef60c1f 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -29,8 +29,7 @@ u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
u64 __read_mostly shadow_user_mask;
u64 __read_mostly shadow_accessed_mask;
u64 __read_mostly shadow_dirty_mask;
-u64 __read_mostly shadow_mmio_value;
-u64 __read_mostly shadow_mmio_mask;
+u64 __read_mostly shadow_default_mmio_mask;
u64 __read_mostly shadow_mmio_access_mask;
u64 __read_mostly shadow_present_mask;
u64 __read_mostly shadow_me_value;
@@ -62,10 +61,11 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
u64 spte = generation_mmio_spte_mask(gen);
u64 gpa = gfn << PAGE_SHIFT;

- WARN_ON_ONCE(!shadow_mmio_value);
+ WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
+ !kvm_gfn_shared_mask(vcpu->kvm));

access &= shadow_mmio_access_mask;
- spte |= shadow_mmio_value | access;
+ spte |= vcpu->kvm->arch.shadow_mmio_value | access;
spte |= gpa | shadow_nonpresent_or_rsvd_mask;
spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
@@ -337,7 +337,8 @@ u64 mark_spte_for_access_track(u64 spte)
return spte;
}

-void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
+void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask,
+ u64 access_mask)
{
BUG_ON((u64)(unsigned)access_mask != access_mask);
WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
@@ -366,11 +367,9 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
mmio_value = 0;

- if (!mmio_value)
- enable_mmio_caching = false;
-
- shadow_mmio_value = mmio_value;
- shadow_mmio_mask = mmio_mask;
+ kvm->arch.enable_mmio_caching = !!mmio_value;
+ kvm->arch.shadow_mmio_value = mmio_value;
+ kvm->arch.shadow_mmio_mask = mmio_mask;
shadow_mmio_access_mask = access_mask;
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
@@ -393,24 +392,18 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
shadow_nx_mask = 0ull;
shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
- shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+ /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
+ shadow_present_mask =
+ (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
shadow_acc_track_mask = VMX_EPT_RWX_MASK;
shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE;
shadow_mmu_writable_mask = EPT_SPTE_MMU_WRITABLE;
-
- /*
- * EPT Misconfigurations are generated if the value of bits 2:0
- * of an EPT paging-structure entry is 110b (write/execute).
- */
- kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
- VMX_EPT_RWX_MASK, 0);
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);

void kvm_mmu_reset_all_pte_masks(void)
{
u8 low_phys_bits;
- u64 mask;

shadow_phys_bits = kvm_get_shadow_phys_bits();

@@ -459,9 +452,13 @@ void kvm_mmu_reset_all_pte_masks(void)
* PTEs and so the reserved PA approach must be disabled.
*/
if (shadow_phys_bits < 52)
- mask = BIT_ULL(51) | PT_PRESENT_MASK;
+ shadow_default_mmio_mask = BIT_ULL(51) | PT_PRESENT_MASK;
else
- mask = 0;
+ shadow_default_mmio_mask = 0;
+}

- kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
+void kvm_mmu_set_default_mmio_spte_mask(u64 mask)
+{
+ shadow_default_mmio_mask = mask;
}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_default_mmio_spte_mask);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1bfedbe0585f..96312ab4fffb 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -5,8 +5,6 @@

#include "mmu_internal.h"

-extern bool __read_mostly enable_mmio_caching;
-
/*
* A MMU present SPTE is backed by actual memory and may or may not be present
* in hardware. E.g. MMIO SPTEs are not considered present. Use bit 11, as it
@@ -160,8 +158,7 @@ extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
extern u64 __read_mostly shadow_user_mask;
extern u64 __read_mostly shadow_accessed_mask;
extern u64 __read_mostly shadow_dirty_mask;
-extern u64 __read_mostly shadow_mmio_value;
-extern u64 __read_mostly shadow_mmio_mask;
+extern u64 __read_mostly shadow_default_mmio_mask;
extern u64 __read_mostly shadow_mmio_access_mask;
extern u64 __read_mostly shadow_present_mask;
extern u64 __read_mostly shadow_me_value;
@@ -233,10 +230,10 @@ static inline bool is_removed_spte(u64 spte)
*/
extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;

-static inline bool is_mmio_spte(u64 spte)
+static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
{
- return (spte & shadow_mmio_mask) == shadow_mmio_value &&
- likely(enable_mmio_caching);
+ return (spte & kvm->arch.shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
+ likely(kvm->arch.enable_mmio_caching || kvm_gfn_shared_mask(kvm));
}

static inline bool is_shadow_present_pte(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2ca03ec3bf52..82f1bfac7ee6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -569,8 +569,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* impact the guest since both the former and current SPTEs
* are nonpresent.
*/
- if (WARN_ON(!is_mmio_spte(old_spte) &&
- !is_mmio_spte(new_spte) &&
+ if (WARN_ON(!is_mmio_spte(kvm, old_spte) &&
+ !is_mmio_spte(kvm, new_spte) &&
!is_removed_spte(new_spte)))
pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
"should not be replaced with another,\n"
@@ -1108,7 +1108,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
}

/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
- if (unlikely(is_mmio_spte(new_spte))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
vcpu->stat.pf_mmio_spte_created++;
trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
new_spte);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 815a07c594f1..0abc43d6a115 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4870,7 +4870,7 @@ static __init void svm_adjust_mmio_mask(void)
*/
mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;

- kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK);
+ kvm_mmu_set_default_mmio_spte_mask(mask);
}

static __init void svm_set_cpu_caps(void)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1d87885245cc..e2415ac55317 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7289,6 +7289,14 @@ int vmx_vm_init(struct kvm *kvm)
if (!ple_gap)
kvm->arch.pause_in_guest = true;

+ /*
+ * EPT Misconfigurations can be generated if the value of bits 2:0
+ * of an EPT paging-structure entry is 110b (write/execute).
+ */
+ if (enable_ept)
+ kvm_mmu_set_mmio_spte_mask(kvm, VMX_EPT_MISCONFIG_WX_VALUE,
+ VMX_EPT_RWX_MASK, 0);
+
if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {
switch (l1tf_mitigation) {
case L1TF_MITIGATION_OFF:
--
2.25.1

2022-06-30 11:52:27

by Kai Huang

[permalink] [raw]

Subject: Re: [PATCH v7 037/102] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis

2022-07-05 15:00:12

by Kai Huang

[permalink] [raw]

Subject: Re: [PATCH v7 037/102] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis

On Mon, 2022-06-27 at 14:53 -0700, [email protected] wrote:
> From: Sean Christopherson <[email protected]>
>
> TDX will use a different shadow PTE entry value for MMIO from VMX. Add
> members to kvm_arch and track value for MMIO per-VM instead of global
> variables. By using the per-VM EPT entry value for MMIO, the existing VMX
> logic is kept working.
>
> In the case of VMX VM case, the EPT entry for MMIO is non-present PTE
> (present bit cleared) without backing guest physical address (on EPT
> violation, KVM searches backing guest memory and it finds there is no
> backing guest page.) or the value to trigger EPT misconfiguration. Once
> MMIO is triggered on the EPT entry, the EPT entry is updated to trigger EPT
> misconfiguration for the future MMIO on the same GPA. It allows KVM to
> understand the memory access is for MMIO without searching backing guest
> pages.). And then KVM parses guest instruction to figure out
> address/value/width for MMIO.
>
> In the case of the guest TD, the guest memory is protected so that VMM
> can't parse guest instruction to understand the value and access width for
> MMIO. Instead, VMM sets up (Shared) EPT to trigger #VE by clearing
> the VE-suppress bit. When the guest TD issues MMIO, #VE is injected. Guest VE
> handler converts MMIO access into MMIO hypercall to pass
> address/value/width for MMIO to VMM. (or directly paravirtualize MMIO into
> hypercall.) Then VMM can handle the MMIO hypercall without parsing guest
> instructions.

To me only first paragraph is needed. It already describes _why_ we need this
patch and _how_ you are going to implement.

The last two paragraphs only elaborate the _why_ in the first paragraph, but
they does not say this patch will do more. And they have been explained in
previous patches so looks they are not mandatory here.

>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 ++++
> arch/x86/include/asm/vmx.h | 1 +
> arch/x86/kvm/mmu.h | 4 +++-
> arch/x86/kvm/mmu/mmu.c | 20 ++++++++++++----
> arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
> arch/x86/kvm/mmu/spte.c | 41 +++++++++++++++------------------
> arch/x86/kvm/mmu/spte.h | 11 ++++-----
> arch/x86/kvm/mmu/tdp_mmu.c | 6 ++---
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 8 +++++++
> 10 files changed, 59 insertions(+), 40 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2c47aab72a1b..39215daa8576 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1161,6 +1161,10 @@ struct kvm_arch {
> */
> spinlock_t mmu_unsync_pages_lock;
>
> + bool enable_mmio_caching;
> + u64 shadow_mmio_value;
> + u64 shadow_mmio_mask;
> +
> struct list_head assigned_dev_head;
> struct iommu_domain *iommu_domain;
> bool iommu_noncoherent;
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index c371ef695fcc..6231ef005a50 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -511,6 +511,7 @@ enum vmcs_field {
> #define VMX_EPT_IPAT_BIT (1ull << 6)
> #define VMX_EPT_ACCESS_BIT (1ull << 8)
> #define VMX_EPT_DIRTY_BIT (1ull << 9)
> +#define VMX_EPT_SUPPRESS_VE_BIT (1ull << 63)

Both the patch title and the changelog say this patch only does per-VM MMIO
value/mask tracking. Why do we need this bit here?

> #define VMX_EPT_RWX_MASK (VMX_EPT_READABLE_MASK | \
> VMX_EPT_WRITABLE_MASK | \
> VMX_EPT_EXECUTABLE_MASK)
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index ccf0ba7a6387..9ba60fd79d33 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -108,7 +108,9 @@ static inline u8 kvm_get_shadow_phys_bits(void)
> return boot_cpu_data.x86_phys_bits;
> }
>
> -void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
> +void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask,
> + u64 access_mask);
> +void kvm_mmu_set_default_mmio_spte_mask(u64 mask);
> void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
> void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f239b6cb5d53..496d0d30839b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2287,7 +2287,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
> return kvm_mmu_prepare_zap_page(kvm, child,
> invalid_list);
> }
> - } else if (is_mmio_spte(pte)) {
> + } else if (is_mmio_spte(kvm, pte)) {
> mmu_spte_clear_no_track(spte);
> }
> return 0;
> @@ -3067,8 +3067,13 @@ static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fau
> * by L0 userspace (you can observe gfn > L1.MAXPHYADDR if
> * and only if L1's MAXPHYADDR is inaccurate with respect to
> * the hardware's).
> + *
> + * Excludes the INTEL TD guest. Because TD memory is
> + * protected, the instruction can't be emulated. Instead, use
> + * SPTE value without #VE suppress bit cleared
> + * (kvm->arch.shadow_mmio_value = 0).
> */

Again, I don't think this chunk should be in this patch. It's out-of-scope of
what the patch claims to do.

I see you will make below code change in later patch (couple of patches later):

- if (unlikely(!vcpu->kvm->arch.enable_mmio_caching) ||
+ if (unlikely(!vcpu->kvm->arch.enable_mmio_caching &&
+ !kvm_gfn_shared_mask(vcpu->kvm)) ||
unlikely(fault->gfn > kvm_mmu_max_gfn()))
return RET_PF_EMULATE;

So why not putting the comment and the code change together?

> - if (unlikely(!enable_mmio_caching) ||
> + if (unlikely(!vcpu->kvm->arch.enable_mmio_caching) ||
> unlikely(fault->gfn > kvm_mmu_max_gfn()))
> return RET_PF_EMULATE;
> }
> @@ -3200,7 +3205,8 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> else
> sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
>
> - if (!is_shadow_present_pte(spte) || is_mmio_spte(spte))
> + if (!is_shadow_present_pte(spte) ||
> + is_mmio_spte(vcpu->kvm, spte))
> break;
>
> sp = sptep_to_sp(sptep);
> @@ -3907,7 +3913,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
> if (WARN_ON(reserved))
> return -EINVAL;
>
> - if (is_mmio_spte(spte)) {
> + if (is_mmio_spte(vcpu->kvm, spte)) {
> gfn_t gfn = get_mmio_spte_gfn(spte);
> unsigned int access = get_mmio_spte_access(spte);
>
> @@ -4350,7 +4356,7 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu)
> static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
> unsigned int access)
> {
> - if (unlikely(is_mmio_spte(*sptep))) {
> + if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
> if (gfn != get_mmio_spte_gfn(*sptep)) {
> mmu_spte_clear_no_track(sptep);
> return true;
> @@ -5864,6 +5870,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> node->track_write = kvm_mmu_pte_write;
> node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> kvm_page_track_register_notifier(kvm, node);
> + kvm_mmu_set_mmio_spte_mask(kvm, shadow_default_mmio_mask,
> + shadow_default_mmio_mask,
> + ACC_WRITE_MASK | ACC_USER_MASK);
> +

This (along with shadow_default_mmio_mask) looks a little bit weird. Please
also see comments below.

> return 0;
> }
>
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index ee2fb0c073f3..62ae590d4e5b 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1032,7 +1032,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> gfn_t gfn;
>
> if (!is_shadow_present_pte(sp->spt[i]) &&
> - !is_mmio_spte(sp->spt[i]))
> + !is_mmio_spte(vcpu->kvm, sp->spt[i]))
> continue;
>
> pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index bd441458153f..5194aef60c1f 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -29,8 +29,7 @@ u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> u64 __read_mostly shadow_user_mask;
> u64 __read_mostly shadow_accessed_mask;
> u64 __read_mostly shadow_dirty_mask;
> -u64 __read_mostly shadow_mmio_value;
> -u64 __read_mostly shadow_mmio_mask;
> +u64 __read_mostly shadow_default_mmio_mask;

This shadow_default_mmio_mask looks a little bit weird. Please also see below.

> u64 __read_mostly shadow_mmio_access_mask;
> u64 __read_mostly shadow_present_mask;
> u64 __read_mostly shadow_me_value;
> @@ -62,10 +61,11 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
> u64 spte = generation_mmio_spte_mask(gen);
> u64 gpa = gfn << PAGE_SHIFT;
>
> - WARN_ON_ONCE(!shadow_mmio_value);
> + WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
> + !kvm_gfn_shared_mask(vcpu->kvm));

Chunk shouldn't belong to this patch.

>
> access &= shadow_mmio_access_mask;
> - spte |= shadow_mmio_value | access;
> + spte |= vcpu->kvm->arch.shadow_mmio_value | access;
> spte |= gpa | shadow_nonpresent_or_rsvd_mask;
> spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
> << SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
> @@ -337,7 +337,8 @@ u64 mark_spte_for_access_track(u64 spte)
> return spte;
> }
>
> -void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> +void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask,
> + u64 access_mask)
> {
> BUG_ON((u64)(unsigned)access_mask != access_mask);
> WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
> @@ -366,11 +367,9 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
> mmio_value = 0;
>
> - if (!mmio_value)
> - enable_mmio_caching = false;
> -
> - shadow_mmio_value = mmio_value;
> - shadow_mmio_mask = mmio_mask;
> + kvm->arch.enable_mmio_caching = !!mmio_value;
> + kvm->arch.shadow_mmio_value = mmio_value;
> + kvm->arch.shadow_mmio_mask = mmio_mask;
> shadow_mmio_access_mask = access_mask;
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
> @@ -393,24 +392,18 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> shadow_dirty_mask = has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
> shadow_nx_mask = 0ull;
> shadow_x_mask = VMX_EPT_EXECUTABLE_MASK;
> - shadow_present_mask = has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
> + /* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
> + shadow_present_mask =
> + (has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;

Again, this chunk shouldn't be in this patch.

> shadow_acc_track_mask = VMX_EPT_RWX_MASK;
> shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE;
> shadow_mmu_writable_mask = EPT_SPTE_MMU_WRITABLE;
> -
> - /*
> - * EPT Misconfigurations are generated if the value of bits 2:0
> - * of an EPT paging-structure entry is 110b (write/execute).
> - */
> - kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
> - VMX_EPT_RWX_MASK, 0);
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
>
> void kvm_mmu_reset_all_pte_masks(void)
> {
> u8 low_phys_bits;
> - u64 mask;
>
> shadow_phys_bits = kvm_get_shadow_phys_bits();
>
> @@ -459,9 +452,13 @@ void kvm_mmu_reset_all_pte_masks(void)
> * PTEs and so the reserved PA approach must be disabled.
> */
> if (shadow_phys_bits < 52)
> - mask = BIT_ULL(51) | PT_PRESENT_MASK;
> + shadow_default_mmio_mask = BIT_ULL(51) | PT_PRESENT_MASK;
> else
> - mask = 0;
> + shadow_default_mmio_mask = 0;
> +}

Shadow_default_mmio_mask alone looks a little bit weird with per-VM MMIO
tracking. I think it can be removed by moving this code to vmx_vm_init(), and
call it as VM's MMIO mask/value for non-EPT case. If EPT is enabled, it can
override using new mask/value.

>
> - kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
> +void kvm_mmu_set_default_mmio_spte_mask(u64 mask)
> +{
> + shadow_default_mmio_mask = mask;
> }
> +EXPORT_SYMBOL_GPL(kvm_mmu_set_default_mmio_spte_mask);
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 1bfedbe0585f..96312ab4fffb 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -5,8 +5,6 @@
>
> #include "mmu_internal.h"
>
> -extern bool __read_mostly enable_mmio_caching;
> -
> /*
> * A MMU present SPTE is backed by actual memory and may or may not be present
> * in hardware. E.g. MMIO SPTEs are not considered present. Use bit 11, as it
> @@ -160,8 +158,7 @@ extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> extern u64 __read_mostly shadow_user_mask;
> extern u64 __read_mostly shadow_accessed_mask;
> extern u64 __read_mostly shadow_dirty_mask;
> -extern u64 __read_mostly shadow_mmio_value;
> -extern u64 __read_mostly shadow_mmio_mask;
> +extern u64 __read_mostly shadow_default_mmio_mask;
> extern u64 __read_mostly shadow_mmio_access_mask;
> extern u64 __read_mostly shadow_present_mask;
> extern u64 __read_mostly shadow_me_value;
> @@ -233,10 +230,10 @@ static inline bool is_removed_spte(u64 spte)
> */
> extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
>
> -static inline bool is_mmio_spte(u64 spte)
> +static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
> {
> - return (spte & shadow_mmio_mask) == shadow_mmio_value &&
> - likely(enable_mmio_caching);
> + return (spte & kvm->arch.shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
> + likely(kvm->arch.enable_mmio_caching || kvm_gfn_shared_mask(kvm));
> }

This chunk (checking kvm_gfn_shared_mask(kvm)) should not be in this patch.

>
> static inline bool is_shadow_present_pte(u64 pte)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 2ca03ec3bf52..82f1bfac7ee6 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -569,8 +569,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> * impact the guest since both the former and current SPTEs
> * are nonpresent.
> */
> - if (WARN_ON(!is_mmio_spte(old_spte) &&
> - !is_mmio_spte(new_spte) &&
> + if (WARN_ON(!is_mmio_spte(kvm, old_spte) &&
> + !is_mmio_spte(kvm, new_spte) &&
> !is_removed_spte(new_spte)))
> pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
> "should not be replaced with another,\n"
> @@ -1108,7 +1108,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> }
>
> /* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
> - if (unlikely(is_mmio_spte(new_spte))) {
> + if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
> vcpu->stat.pf_mmio_spte_created++;
> trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
> new_spte);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 815a07c594f1..0abc43d6a115 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4870,7 +4870,7 @@ static __init void svm_adjust_mmio_mask(void)
> */
> mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;
>
> - kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK);
> + kvm_mmu_set_default_mmio_spte_mask(mask);

SVM doesn't need shadow_default_mmio_mask. Instead, it can define a local
variable in svm.c, and call kvm_mmu_set_mmio_spte_mask(mask, mask,
PT_WRITABLE_MASK | PT_USER_MASK) in svm_vm_init().

> }
>
> static __init void svm_set_cpu_caps(void)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 1d87885245cc..e2415ac55317 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7289,6 +7289,14 @@ int vmx_vm_init(struct kvm *kvm)
> if (!ple_gap)
> kvm->arch.pause_in_guest = true;
>
> + /*
> + * EPT Misconfigurations can be generated if the value of bits 2:0
> + * of an EPT paging-structure entry is 110b (write/execute).
> + */
> + if (enable_ept)
> + kvm_mmu_set_mmio_spte_mask(kvm, VMX_EPT_MISCONFIG_WX_VALUE,
> + VMX_EPT_RWX_MASK, 0);
> +

As commented above, I think we can remove shadow_default_mmio_mask by moving the
logic in kvm_mmu_reset_all_pte_mask() here.

Or use SVM similar way, use a local variable 'mask' in vmx.c, calculate the
'mask' during hardware_setup(), and use it here for non-EPT case.

> if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {
> switch (l1tf_mitigation) {
> case L1TF_MITIGATION_OFF:

2022-07-19 08:57:16

by Isaku Yamahata

[permalink] [raw]

Subject: Re: [PATCH v7 037/102] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis

Here is the updated one. The changes are
- removed hunks that should be a part of other patches.
- removed shadow_default_mmio_mask
- trimed down commit messages.

From ed6b4a076e515550878b069596cf156a1bc33514 Mon Sep 17 00:00:00 2001
Message-Id: <ed6b4a076e515550878b069596cf156a1bc33514.1658220363.git.isaku.yamahata@intel.com>
In-Reply-To: <3941849bf08a55cfbbe69b222f0fd0dac7c5ee53.1658220363.git.isaku.yamahata@intel.com>
References: <3941849bf08a55cfbbe69b222f0fd0dac7c5ee53.1658220363.git.isaku.yamahata@intel.com>
From: Sean Christopherson <[email protected]>
Date: Wed, 10 Jun 2020 15:46:38 -0700
Subject: [PATCH 036/306] KVM: x86/mmu: Track shadow MMIO value/mask on a
per-VM basis

TDX will use a different shadow PTE entry value for MMIO from VMX. Add
members to kvm_arch and track value for MMIO per-VM instead of global
variables. By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working. To untangle the logic to initialize
shadow_mmio_access_mask, introduce a setter function.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +++
arch/x86/kvm/mmu.h | 3 ++-
arch/x86/kvm/mmu/mmu.c | 8 +++---
arch/x86/kvm/mmu/spte.c | 45 +++++++++------------------------
arch/x86/kvm/mmu/spte.h | 10 +++-----
arch/x86/kvm/mmu/tdp_mmu.c | 6 ++---
arch/x86/kvm/svm/svm.c | 11 +++++---
arch/x86/kvm/vmx/tdx.c | 4 +++
arch/x86/kvm/vmx/vmx.c | 26 +++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 1 +
10 files changed, 66 insertions(+), 52 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c47aab72a1b..39215daa8576 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1161,6 +1161,10 @@ struct kvm_arch {
*/
spinlock_t mmu_unsync_pages_lock;

+ bool enable_mmio_caching;
+ u64 shadow_mmio_value;
+ u64 shadow_mmio_mask;
+
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
bool iommu_noncoherent;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ccf0ba7a6387..cfa3e658162c 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -108,7 +108,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
return boot_cpu_data.x86_phys_bits;
}

-void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
+void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask);
+void kvm_mmu_set_mmio_access_mask(u64 mmio_access_mask);
void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5bfccfa0f50e..34240fcc45de 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2298,7 +2298,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
return kvm_mmu_prepare_zap_page(kvm, child,
invalid_list);
}
- } else if (is_mmio_spte(pte)) {
+ } else if (is_mmio_spte(kvm, pte)) {
mmu_spte_clear_no_track(spte);
}
return 0;
@@ -3079,7 +3079,7 @@ static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fau
* and only if L1's MAXPHYADDR is inaccurate with respect to
* the hardware's).
*/
- if (unlikely(!enable_mmio_caching) ||
+ if (unlikely(!vcpu->kvm->arch.enable_mmio_caching) ||
unlikely(fault->gfn > kvm_mmu_max_gfn()))
return RET_PF_EMULATE;
}
@@ -3918,7 +3918,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
if (WARN_ON(reserved))
return -EINVAL;

- if (is_mmio_spte(spte)) {
+ if (is_mmio_spte(vcpu->kvm, spte)) {
gfn_t gfn = get_mmio_spte_gfn(spte);
unsigned int access = get_mmio_spte_access(spte);

@@ -4361,7 +4361,7 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu)
static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
unsigned int access)
{
- if (unlikely(is_mmio_spte(*sptep))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
if (gfn != get_mmio_spte_gfn(*sptep)) {
mmu_spte_clear_no_track(sptep);
return true;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 92968e5605fc..9a130dd3d6a3 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -29,8 +29,6 @@ u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
u64 __read_mostly shadow_user_mask;
u64 __read_mostly shadow_accessed_mask;
u64 __read_mostly shadow_dirty_mask;
-u64 __read_mostly shadow_mmio_value;
-u64 __read_mostly shadow_mmio_mask;
u64 __read_mostly shadow_mmio_access_mask;
u64 __read_mostly shadow_present_mask;
u64 __read_mostly shadow_me_value;
@@ -62,10 +60,10 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
u64 spte = generation_mmio_spte_mask(gen);
u64 gpa = gfn << PAGE_SHIFT;

- WARN_ON_ONCE(!shadow_mmio_value);
+ WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);

access &= shadow_mmio_access_mask;
- spte |= shadow_mmio_value | access;
+ spte |= vcpu->kvm->arch.shadow_mmio_value | access;
spte |= gpa | shadow_nonpresent_or_rsvd_mask;
spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
@@ -337,9 +335,8 @@ u64 mark_spte_for_access_track(u64 spte)
return spte;
}

-void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
+void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask)
{
- BUG_ON((u64)(unsigned)access_mask != access_mask);
WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);

if (!enable_mmio_caching)
@@ -366,12 +363,9 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
mmio_value = 0;

- if (!mmio_value)
- enable_mmio_caching = false;
-
- shadow_mmio_value = mmio_value;
- shadow_mmio_mask = mmio_mask;
- shadow_mmio_access_mask = access_mask;
+ kvm->arch.enable_mmio_caching = !!mmio_value;
+ kvm->arch.shadow_mmio_value = mmio_value;
+ kvm->arch.shadow_mmio_mask = mmio_mask;
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);

@@ -399,20 +393,12 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
shadow_acc_track_mask = VMX_EPT_RWX_MASK;
shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE;
shadow_mmu_writable_mask = EPT_SPTE_MMU_WRITABLE;
-
- /*
- * EPT Misconfigurations are generated if the value of bits 2:0
- * of an EPT paging-structure entry is 110b (write/execute).
- */
- kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
- VMX_EPT_RWX_MASK, 0);
}
EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);

void kvm_mmu_reset_all_pte_masks(void)
{
u8 low_phys_bits;
- u64 mask;

shadow_phys_bits = kvm_get_shadow_phys_bits();

@@ -452,18 +438,11 @@ void kvm_mmu_reset_all_pte_masks(void)

shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITABLE;
shadow_mmu_writable_mask = DEFAULT_SPTE_MMU_WRITABLE;
+}

- /*
- * Set a reserved PA bit in MMIO SPTEs to generate page faults with
- * PFEC.RSVD=1 on MMIO accesses. 64-bit PTEs (PAE, x86-64, and EPT
- * paging) support a maximum of 52 bits of PA, i.e. if the CPU supports
- * 52-bit physical addresses then there are no reserved PA bits in the
- * PTEs and so the reserved PA approach must be disabled.
- */
- if (shadow_phys_bits < 52)
- mask = BIT_ULL(51) | PT_PRESENT_MASK;
- else
- mask = 0;
-
- kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
+void kvm_mmu_set_mmio_access_mask(u64 mmio_access_mask)
+{
+ BUG_ON((u64)(unsigned)mmio_access_mask != mmio_access_mask);
+ shadow_mmio_access_mask = mmio_access_mask;
}
+EXPORT_SYMBOL(kvm_mmu_set_mmio_access_mask);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index f5fd22f6bf5f..99bce92b596e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -5,8 +5,6 @@

#include "mmu_internal.h"

-extern bool __read_mostly enable_mmio_caching;
-
/*
* A MMU present SPTE is backed by actual memory and may or may not be present
* in hardware. E.g. MMIO SPTEs are not considered present. Use bit 11, as it
@@ -160,8 +158,6 @@ extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
extern u64 __read_mostly shadow_user_mask;
extern u64 __read_mostly shadow_accessed_mask;
extern u64 __read_mostly shadow_dirty_mask;
-extern u64 __read_mostly shadow_mmio_value;
-extern u64 __read_mostly shadow_mmio_mask;
extern u64 __read_mostly shadow_mmio_access_mask;
extern u64 __read_mostly shadow_present_mask;
extern u64 __read_mostly shadow_me_value;
@@ -228,10 +224,10 @@ static inline bool is_removed_spte(u64 spte)
*/
extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;

-static inline bool is_mmio_spte(u64 spte)
+static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
{
- return (spte & shadow_mmio_mask) == shadow_mmio_value &&
- likely(enable_mmio_caching);
+ return (spte & kvm->arch.shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
+ likely(kvm->arch.enable_mmio_caching);
}

static inline bool is_shadow_present_pte(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2ca03ec3bf52..82f1bfac7ee6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -569,8 +569,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* impact the guest since both the former and current SPTEs
* are nonpresent.
*/
- if (WARN_ON(!is_mmio_spte(old_spte) &&
- !is_mmio_spte(new_spte) &&
+ if (WARN_ON(!is_mmio_spte(kvm, old_spte) &&
+ !is_mmio_spte(kvm, new_spte) &&
!is_removed_spte(new_spte)))
pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
"should not be replaced with another,\n"
@@ -1108,7 +1108,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
}

/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
- if (unlikely(is_mmio_spte(new_spte))) {
+ if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
vcpu->stat.pf_mmio_spte_created++;
trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
new_spte);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f01821f48bfd..0f63257161a6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -198,6 +198,7 @@ module_param(dump_invalid_vmcb, bool, 0644);
bool intercept_smi = true;
module_param(intercept_smi, bool, 0444);

+static u64 __read_mostly svm_shadow_mmio_mask;

static bool svm_gp_erratum_intercept = true;

@@ -4685,6 +4686,9 @@ static bool svm_is_vm_type_supported(unsigned long type)

static int svm_vm_init(struct kvm *kvm)
{
+ kvm_mmu_set_mmio_spte_mask(kvm, svm_shadow_mmio_mask,
+ svm_shadow_mmio_mask);
+
if (!pause_filter_count || !pause_filter_thresh)
kvm->arch.pause_in_guest = true;

@@ -4834,7 +4838,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
static __init void svm_adjust_mmio_mask(void)
{
unsigned int enc_bit, mask_bit;
- u64 msr, mask;
+ u64 msr;

/* If there is no memory encryption support, use existing mask */
if (cpuid_eax(0x80000000) < 0x8000001f)
@@ -4861,9 +4865,8 @@ static __init void svm_adjust_mmio_mask(void)
*
* If the mask bit location is 52 (or above), then clear the mask.
*/
- mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;
-
- kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK);
+ svm_shadow_mmio_mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;
+ kvm_mmu_set_mmio_access_mask(PT_WRITABLE_MASK | PT_USER_MASK);
}

static __init void svm_set_cpu_caps(void)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 36d2127cb7b7..52fb54880f9b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -7,6 +7,7 @@
#include "x86_ops.h"
#include "tdx.h"
#include "x86.h"
+#include "mmu.h"

#undef pr_fmt
#define pr_fmt(fmt) "tdx: " fmt
@@ -276,6 +277,9 @@ int tdx_vm_init(struct kvm *kvm)
int ret, i;
u64 err;

+ kvm_mmu_set_mmio_spte_mask(kvm, vmx_shadow_mmio_mask,
+ vmx_shadow_mmio_mask);
+
/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
kvm->max_vcpus = 0;

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e129ee663498..88e893fdffe8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -141,6 +141,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
extern bool __read_mostly allow_smaller_maxphyaddr;
module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);

+u64 __ro_after_init vmx_shadow_mmio_mask;
+
#define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
#define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
#define KVM_VM_CR0_ALWAYS_ON \
@@ -7359,6 +7361,17 @@ int vmx_vm_init(struct kvm *kvm)
if (!ple_gap)
kvm->arch.pause_in_guest = true;

+ /*
+ * EPT Misconfigurations can be generated if the value of bits 2:0
+ * of an EPT paging-structure entry is 110b (write/execute).
+ */
+ if (enable_ept)
+ kvm_mmu_set_mmio_spte_mask(kvm, VMX_EPT_MISCONFIG_WX_VALUE,
+ VMX_EPT_RWX_MASK);
+ else
+ kvm_mmu_set_mmio_spte_mask(kvm, vmx_shadow_mmio_mask,
+ vmx_shadow_mmio_mask);
+
if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {
switch (l1tf_mitigation) {
case L1TF_MITIGATION_OFF:
@@ -8358,6 +8371,19 @@ int __init vmx_init(void)
if (!enable_ept)
allow_smaller_maxphyaddr = true;

+ /*
+ * Set a reserved PA bit in MMIO SPTEs to generate page faults with
+ * PFEC.RSVD=1 on MMIO accesses. 64-bit PTEs (PAE, x86-64, and EPT
+ * paging) support a maximum of 52 bits of PA, i.e. if the CPU supports
+ * 52-bit physical addresses then there are no reserved PA bits in the
+ * PTEs and so the reserved PA approach must be disabled.
+ */
+ if (kvm_get_shadow_phys_bits() < 52)
+ vmx_shadow_mmio_mask = BIT_ULL(51) | PT_PRESENT_MASK;
+ else
+ vmx_shadow_mmio_mask = 0;
+ kvm_mmu_set_mmio_access_mask(0);
+
return 0;
}

diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7e38c7b756d4..279e5360c555 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -13,6 +13,7 @@ void hv_vp_assist_page_exit(void);
void __init vmx_init_early(void);
int __init vmx_init(void);
void vmx_exit(void);
+extern u64 __ro_after_init vmx_shadow_mmio_mask;

__init int vmx_cpu_has_kvm_support(void);
__init int vmx_disabled_by_bios(void);
--
2.25.1

--
Isaku Yamahata <[email protected]>

2022-07-20 04:06:39

by Kai Huang

[permalink] [raw]

Subject: Re: [PATCH v7 037/102] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis

On Tue, 2022-07-19 at 01:47 -0700, Isaku Yamahata wrote:
> Here is the updated one. The changes are
> - removed hunks that should be a part of other patches.
> - removed shadow_default_mmio_mask
> - trimed down commit messages.
>
> From ed6b4a076e515550878b069596cf156a1bc33514 Mon Sep 17 00:00:00 2001
> Message-Id: <ed6b4a076e515550878b069596cf156a1bc33514.1658220363.git.isaku.yamahata@intel.com>
> In-Reply-To: <3941849bf08a55cfbbe69b222f0fd0dac7c5ee53.1658220363.git.isaku.yamahata@intel.com>
> References: <3941849bf08a55cfbbe69b222f0fd0dac7c5ee53.1658220363.git.isaku.yamahata@intel.com>
> From: Sean Christopherson <[email protected]>
> Date: Wed, 10 Jun 2020 15:46:38 -0700
> Subject: [PATCH 036/306] KVM: x86/mmu: Track shadow MMIO value/mask on a
> per-VM basis
>
> TDX will use a different shadow PTE entry value for MMIO from VMX. Add
> members to kvm_arch and track value for MMIO per-VM instead of global
> variables. By using the per-VM EPT entry value for MMIO, the existing VMX
> logic is kept working. To untangle the logic to initialize
> shadow_mmio_access_mask, introduce a setter function.

introduce a separate setter function for it.

>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 +++
> arch/x86/kvm/mmu.h | 3 ++-
> arch/x86/kvm/mmu/mmu.c | 8 +++---
> arch/x86/kvm/mmu/spte.c | 45 +++++++++------------------------
> arch/x86/kvm/mmu/spte.h | 10 +++-----
> arch/x86/kvm/mmu/tdp_mmu.c | 6 ++---
> arch/x86/kvm/svm/svm.c | 11 +++++---
> arch/x86/kvm/vmx/tdx.c | 4 +++
> arch/x86/kvm/vmx/vmx.c | 26 +++++++++++++++++++
> arch/x86/kvm/vmx/x86_ops.h | 1 +
> 10 files changed, 66 insertions(+), 52 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2c47aab72a1b..39215daa8576 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1161,6 +1161,10 @@ struct kvm_arch {
> */
> spinlock_t mmu_unsync_pages_lock;
>
> + bool enable_mmio_caching;
> + u64 shadow_mmio_value;
> + u64 shadow_mmio_mask;
> +
> struct list_head assigned_dev_head;
> struct iommu_domain *iommu_domain;
> bool iommu_noncoherent;
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index ccf0ba7a6387..cfa3e658162c 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -108,7 +108,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
> return boot_cpu_data.x86_phys_bits;
> }
>
> -void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
> +void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask);
> +void kvm_mmu_set_mmio_access_mask(u64 mmio_access_mask);
> void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
> void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5bfccfa0f50e..34240fcc45de 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2298,7 +2298,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
> return kvm_mmu_prepare_zap_page(kvm, child,
> invalid_list);
> }
> - } else if (is_mmio_spte(pte)) {
> + } else if (is_mmio_spte(kvm, pte)) {
> mmu_spte_clear_no_track(spte);
> }
> return 0;
> @@ -3079,7 +3079,7 @@ static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fau
> * and only if L1's MAXPHYADDR is inaccurate with respect to
> * the hardware's).
> */
> - if (unlikely(!enable_mmio_caching) ||
> + if (unlikely(!vcpu->kvm->arch.enable_mmio_caching) ||
> unlikely(fault->gfn > kvm_mmu_max_gfn()))
> return RET_PF_EMULATE;
> }
> @@ -3918,7 +3918,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
> if (WARN_ON(reserved))
> return -EINVAL;
>
> - if (is_mmio_spte(spte)) {
> + if (is_mmio_spte(vcpu->kvm, spte)) {
> gfn_t gfn = get_mmio_spte_gfn(spte);
> unsigned int access = get_mmio_spte_access(spte);
>
> @@ -4361,7 +4361,7 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu)
> static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
> unsigned int access)
> {
> - if (unlikely(is_mmio_spte(*sptep))) {
> + if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
> if (gfn != get_mmio_spte_gfn(*sptep)) {
> mmu_spte_clear_no_track(sptep);
> return true;
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 92968e5605fc..9a130dd3d6a3 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -29,8 +29,6 @@ u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> u64 __read_mostly shadow_user_mask;
> u64 __read_mostly shadow_accessed_mask;
> u64 __read_mostly shadow_dirty_mask;
> -u64 __read_mostly shadow_mmio_value;
> -u64 __read_mostly shadow_mmio_mask;
> u64 __read_mostly shadow_mmio_access_mask;
> u64 __read_mostly shadow_present_mask;
> u64 __read_mostly shadow_me_value;
> @@ -62,10 +60,10 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
> u64 spte = generation_mmio_spte_mask(gen);
> u64 gpa = gfn << PAGE_SHIFT;
>
> - WARN_ON_ONCE(!shadow_mmio_value);
> + WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
>
> access &= shadow_mmio_access_mask;
> - spte |= shadow_mmio_value | access;
> + spte |= vcpu->kvm->arch.shadow_mmio_value | access;
> spte |= gpa | shadow_nonpresent_or_rsvd_mask;
> spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
> << SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
> @@ -337,9 +335,8 @@ u64 mark_spte_for_access_track(u64 spte)
> return spte;
> }
>
> -void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> +void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask)
> {
> - BUG_ON((u64)(unsigned)access_mask != access_mask);
> WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
>
> if (!enable_mmio_caching)
> @@ -366,12 +363,9 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
> mmio_value = 0;
>
> - if (!mmio_value)
> - enable_mmio_caching = false;
> -
> - shadow_mmio_value = mmio_value;
> - shadow_mmio_mask = mmio_mask;
> - shadow_mmio_access_mask = access_mask;
> + kvm->arch.enable_mmio_caching = !!mmio_value;

KVM has a global enable_mmio_caching boolean, and I think we should honor it
here (in this patch) by doing below first:

if (enabling_mmio_caching)
mmio_value = 0;

For TD guest, the logic around enable_mmio_caching doesn't make sense anymore,
so perhaps we can later tweak it by doing something like:

/*
* Treat mmio_caching is false for TD guest, or true for it, depending
* on how you define it.
*/
if (kvm_gfn_shared_mask(kvm))
kvm->arch.enable_mmio_caching = false; /* or true? */
else {
if (!enable_mmio_caching)
mmio_value = 0;
kvm->arch.enable_mmio_caching = !!mmio_value;
}

> + kvm->arch.shadow_mmio_value = mmio_value;
> + kvm->arch.shadow_mmio_mask = mmio_mask;
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>
> @@ -399,20 +393,12 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> shadow_acc_track_mask = VMX_EPT_RWX_MASK;
> shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE;
> shadow_mmu_writable_mask = EPT_SPTE_MMU_WRITABLE;
> -
> - /*
> - * EPT Misconfigurations are generated if the value of bits 2:0
> - * of an EPT paging-structure entry is 110b (write/execute).
> - */
> - kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
> - VMX_EPT_RWX_MASK, 0);
> }
> EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
>
> void kvm_mmu_reset_all_pte_masks(void)
> {
> u8 low_phys_bits;
> - u64 mask;
>
> shadow_phys_bits = kvm_get_shadow_phys_bits();
>
> @@ -452,18 +438,11 @@ void kvm_mmu_reset_all_pte_masks(void)
>
> shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITABLE;
> shadow_mmu_writable_mask = DEFAULT_SPTE_MMU_WRITABLE;
> +}
>
> - /*
> - * Set a reserved PA bit in MMIO SPTEs to generate page faults with
> - * PFEC.RSVD=1 on MMIO accesses. 64-bit PTEs (PAE, x86-64, and EPT
> - * paging) support a maximum of 52 bits of PA, i.e. if the CPU supports
> - * 52-bit physical addresses then there are no reserved PA bits in the
> - * PTEs and so the reserved PA approach must be disabled.
> - */
> - if (shadow_phys_bits < 52)
> - mask = BIT_ULL(51) | PT_PRESENT_MASK;
> - else
> - mask = 0;
> -
> - kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
> +void kvm_mmu_set_mmio_access_mask(u64 mmio_access_mask)
> +{
> + BUG_ON((u64)(unsigned)mmio_access_mask != mmio_access_mask);
> + shadow_mmio_access_mask = mmio_access_mask;
> }
> +EXPORT_SYMBOL(kvm_mmu_set_mmio_access_mask);
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index f5fd22f6bf5f..99bce92b596e 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -5,8 +5,6 @@
>
> #include "mmu_internal.h"
>
> -extern bool __read_mostly enable_mmio_caching;
> -

Here you removed the ability to control enable_mmio_caching globally. It's not
something you stated to do in the changelog. Perhaps we should still keep it,
and enforce it in kvm_mmu_set_mmio_spte_mask() as commented above.

And in upstream KVM, it is a module parameter. What happens to it?

> /*
> * A MMU present SPTE is backed by actual memory and may or may not be present
> * in hardware. E.g. MMIO SPTEs are not considered present. Use bit 11, as it
> @@ -160,8 +158,6 @@ extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> extern u64 __read_mostly shadow_user_mask;
> extern u64 __read_mostly shadow_accessed_mask;
> extern u64 __read_mostly shadow_dirty_mask;
> -extern u64 __read_mostly shadow_mmio_value;
> -extern u64 __read_mostly shadow_mmio_mask;
> extern u64 __read_mostly shadow_mmio_access_mask;
> extern u64 __read_mostly shadow_present_mask;
> extern u64 __read_mostly shadow_me_value;
> @@ -228,10 +224,10 @@ static inline bool is_removed_spte(u64 spte)
> */
> extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
>
> -static inline bool is_mmio_spte(u64 spte)
> +static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
> {
> - return (spte & shadow_mmio_mask) == shadow_mmio_value &&
> - likely(enable_mmio_caching);
> + return (spte & kvm->arch.shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
> + likely(kvm->arch.enable_mmio_caching);
> }
>
> static inline bool is_shadow_present_pte(u64 pte)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 2ca03ec3bf52..82f1bfac7ee6 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -569,8 +569,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> * impact the guest since both the former and current SPTEs
> * are nonpresent.
> */
> - if (WARN_ON(!is_mmio_spte(old_spte) &&
> - !is_mmio_spte(new_spte) &&
> + if (WARN_ON(!is_mmio_spte(kvm, old_spte) &&
> + !is_mmio_spte(kvm, new_spte) &&
> !is_removed_spte(new_spte)))
> pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
> "should not be replaced with another,\n"
> @@ -1108,7 +1108,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> }
>
> /* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
> - if (unlikely(is_mmio_spte(new_spte))) {
> + if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
> vcpu->stat.pf_mmio_spte_created++;
> trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
> new_spte);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index f01821f48bfd..0f63257161a6 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -198,6 +198,7 @@ module_param(dump_invalid_vmcb, bool, 0644);
> bool intercept_smi = true;
> module_param(intercept_smi, bool, 0444);
>
> +static u64 __read_mostly svm_shadow_mmio_mask;
>
> static bool svm_gp_erratum_intercept = true;
>
> @@ -4685,6 +4686,9 @@ static bool svm_is_vm_type_supported(unsigned long type)
>
> static int svm_vm_init(struct kvm *kvm)
> {
> + kvm_mmu_set_mmio_spte_mask(kvm, svm_shadow_mmio_mask,
> + svm_shadow_mmio_mask);
> +
> if (!pause_filter_count || !pause_filter_thresh)
> kvm->arch.pause_in_guest = true;
>
> @@ -4834,7 +4838,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
> static __init void svm_adjust_mmio_mask(void)
> {
> unsigned int enc_bit, mask_bit;
> - u64 msr, mask;
> + u64 msr;
>
> /* If there is no memory encryption support, use existing mask */
> if (cpuid_eax(0x80000000) < 0x8000001f)
> @@ -4861,9 +4865,8 @@ static __init void svm_adjust_mmio_mask(void)
> *
> * If the mask bit location is 52 (or above), then clear the mask.
> */
> - mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;
> -
> - kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK);
> + svm_shadow_mmio_mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;
> + kvm_mmu_set_mmio_access_mask(PT_WRITABLE_MASK | PT_USER_MASK);
> }
>
> static __init void svm_set_cpu_caps(void)
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 36d2127cb7b7..52fb54880f9b 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -7,6 +7,7 @@
> #include "x86_ops.h"
> #include "tdx.h"
> #include "x86.h"
> +#include "mmu.h"
>
> #undef pr_fmt
> #define pr_fmt(fmt) "tdx: " fmt
> @@ -276,6 +277,9 @@ int tdx_vm_init(struct kvm *kvm)
> int ret, i;
> u64 err;
>
> + kvm_mmu_set_mmio_spte_mask(kvm, vmx_shadow_mmio_mask,
> + vmx_shadow_mmio_mask);
> +

I prefer to split this chunk out to another patch so this patch can be purely
infrastructural. In this way you can even move this patch around easily in
this series.

> /* vCPUs can't be created until after KVM_TDX_INIT_VM. */
> kvm->max_vcpus = 0;
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index e129ee663498..88e893fdffe8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -141,6 +141,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
> extern bool __read_mostly allow_smaller_maxphyaddr;
> module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
>
> +u64 __ro_after_init vmx_shadow_mmio_mask;
> +
> #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
> #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
> #define KVM_VM_CR0_ALWAYS_ON \
> @@ -7359,6 +7361,17 @@ int vmx_vm_init(struct kvm *kvm)
> if (!ple_gap)
> kvm->arch.pause_in_guest = true;
>
> + /*
> + * EPT Misconfigurations can be generated if the value of bits 2:0
> + * of an EPT paging-structure entry is 110b (write/execute).
> + */
> + if (enable_ept)
> + kvm_mmu_set_mmio_spte_mask(kvm, VMX_EPT_MISCONFIG_WX_VALUE,
> + VMX_EPT_RWX_MASK);
> + else
> + kvm_mmu_set_mmio_spte_mask(kvm, vmx_shadow_mmio_mask,
> + vmx_shadow_mmio_mask);
> +
> if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {
> switch (l1tf_mitigation) {
> case L1TF_MITIGATION_OFF:
> @@ -8358,6 +8371,19 @@ int __init vmx_init(void)
> if (!enable_ept)
> allow_smaller_maxphyaddr = true;
>
> + /*
> + * Set a reserved PA bit in MMIO SPTEs to generate page faults with
> + * PFEC.RSVD=1 on MMIO accesses. 64-bit PTEs (PAE, x86-64, and EPT
> + * paging) support a maximum of 52 bits of PA, i.e. if the CPU supports
> + * 52-bit physical addresses then there are no reserved PA bits in the
> + * PTEs and so the reserved PA approach must be disabled.
> + */
> + if (kvm_get_shadow_phys_bits() < 52)
> + vmx_shadow_mmio_mask = BIT_ULL(51) | PT_PRESENT_MASK;
> + else
> + vmx_shadow_mmio_mask = 0;
> + kvm_mmu_set_mmio_access_mask(0);
> +
> return 0;
> }
>
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 7e38c7b756d4..279e5360c555 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -13,6 +13,7 @@ void hv_vp_assist_page_exit(void);
> void __init vmx_init_early(void);
> int __init vmx_init(void);
> void vmx_exit(void);
> +extern u64 __ro_after_init vmx_shadow_mmio_mask;
>
> __init int vmx_cpu_has_kvm_support(void);
> __init int vmx_disabled_by_bios(void);
> --
> 2.25.1
>
>

2022-07-27 23:37:07

by Isaku Yamahata

[permalink] [raw]

Subject: Re: [PATCH v7 037/102] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis

On Wed, Jul 20, 2022 at 03:45:59PM +1200,
Kai Huang <[email protected]> wrote:

> > @@ -337,9 +335,8 @@ u64 mark_spte_for_access_track(u64 spte)
> > return spte;
> > }
> >
> > -void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> > +void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask)
> > {
> > - BUG_ON((u64)(unsigned)access_mask != access_mask);
> > WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
> >
> > if (!enable_mmio_caching)
> > @@ -366,12 +363,9 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> > WARN_ON(mmio_value && (__REMOVED_SPTE & mmio_mask) == mmio_value))
> > mmio_value = 0;
> >
> > - if (!mmio_value)
> > - enable_mmio_caching = false;
> > -
> > - shadow_mmio_value = mmio_value;
> > - shadow_mmio_mask = mmio_mask;
> > - shadow_mmio_access_mask = access_mask;
> > + kvm->arch.enable_mmio_caching = !!mmio_value;
>
> KVM has a global enable_mmio_caching boolean, and I think we should honor it
> here (in this patch) by doing below first:
>
> if (enabling_mmio_caching)
> mmio_value = 0;

This function already includes "if (!enable_mmio_caching) mmio_value = 0;" in
the beginning. (But not in this hunk, though). So this patch honors the kernel
module parameter.

> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index f5fd22f6bf5f..99bce92b596e 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -5,8 +5,6 @@
> >
> > #include "mmu_internal.h"
> >
> > -extern bool __read_mostly enable_mmio_caching;
> > -
>
> Here you removed the ability to control enable_mmio_caching globally. It's not
> something you stated to do in the changelog. Perhaps we should still keep it,
> and enforce it in kvm_mmu_set_mmio_spte_mask() as commented above.
>
> And in upstream KVM, it is a module parameter. What happens to it?

Ditto. the upstredam kvm_mmu_set_mmio_spte_mask() has
"if (!enable_mmio_caching) mmio_value = 0;" and this patch keeps it.

> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 36d2127cb7b7..52fb54880f9b 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -7,6 +7,7 @@
> > #include "x86_ops.h"
> > #include "tdx.h"
> > #include "x86.h"
> > +#include "mmu.h"
> >
> > #undef pr_fmt
> > #define pr_fmt(fmt) "tdx: " fmt
> > @@ -276,6 +277,9 @@ int tdx_vm_init(struct kvm *kvm)
> > int ret, i;
> > u64 err;
> >
> > + kvm_mmu_set_mmio_spte_mask(kvm, vmx_shadow_mmio_mask,
> > + vmx_shadow_mmio_mask);
> > +
>
> I prefer to split this chunk out to another patch so this patch can be purely
> infrastructural. In this way you can even move this patch around easily in
> this series.

Ok. I'll move it to a patch that touches TDX.
--
Isaku Yamahata <[email protected]>

2022-07-28 00:57:15

by Kai Huang

[permalink] [raw]

Subject: Re: [PATCH v7 037/102] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis

On Wed, 2022-07-27 at 16:20 -0700, Isaku Yamahata wrote:
> > KVM has a global enable_mmio_caching boolean, and I think we should honor it
> > here (in this patch) by doing below first:
> >
> > if (enabling_mmio_caching)
> > mmio_value = 0;
>
> This function already includes "if (!enable_mmio_caching) mmio_value = 0;" in
> the beginning. (But not in this hunk, though). So this patch honors the
> kernel
> module parameter.

Yeah I missed that. Thanks.

--
Thanks,
-Kai