2022-07-23 01:25:08

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 0/6] KVM: x86: Apply NX mitigation more precisely

Patch 6 from Mingwei is the end goal of the series. KVM incorrectly
assumes that the NX huge page mitigation is the only scenario where KVM
will create a non-leaf page instead of a huge page. Precisely track
(via kvm_mmu_page) if a non-huge page is being forced and use that info
to avoid unnecessarily forcing smaller page sizes in
disallowed_hugepage_adjust().

v2: Rebase, tweak a changelog accordingly.

v1: https://lore.kernel.org/all/[email protected]

Mingwei Zhang (1):
KVM: x86/mmu: explicitly check nx_hugepage in
disallowed_hugepage_adjust()

Sean Christopherson (5):
KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
MMUs
KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
SPTE
KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
pages
KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

arch/x86/include/asm/kvm_host.h | 17 ++---
arch/x86/kvm/mmu/mmu.c | 107 ++++++++++++++++++++++----------
arch/x86/kvm/mmu/mmu_internal.h | 41 +++++++-----
arch/x86/kvm/mmu/paging_tmpl.h | 6 +-
arch/x86/kvm/mmu/spte.c | 11 ++++
arch/x86/kvm/mmu/spte.h | 17 +++++
arch/x86/kvm/mmu/tdp_mmu.c | 49 +++++++++------
arch/x86/kvm/mmu/tdp_mmu.h | 2 +
8 files changed, 167 insertions(+), 83 deletions(-)


base-commit: 1a4d88a361af4f2e91861d632c6a1fe87a9665c2
--
2.37.1.359.gd136c6c3e2-goog


2022-07-23 01:25:14

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 1/6] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked

Tag shadow pages that cannot be replaced with an NX huge page even if
zapping the page would not allow KVM to create a huge page, e.g. because
something else prevents creating a huge page. This will allow a future
patch to more precisely apply the mitigation by checking if an existing
shadow page can be replaced by a NX huge page. Currently, KVM assumes
that any existing shadow page encountered cannot be replaced by a NX huge
page (if the mitigation is enabled), which prevents KVM from replacing
no-longer-necessary shadow pages with huge pages, e.g. after disabling
dirty logging, zapping from the mmu_notifier due to page migration,
etc...

Failure to tag shadow pages appropriately could theoretically lead to
false negatives, e.g. if a fetch fault requests a small page and thus
isn't tracked, and a read/write fault later requests a huge page, KVM
will not reject the huge page as it should.

To avoid yet another flag, initialize the list_head and use list_empty()
to determine whether or not a page is on the list of NX huge pages that
should be recovered.

Opportunstically rename most of the variables/functions involved to
provide consistency, e.g. lpage vs huge page and NX huge vs huge NX, and
clarity, e.g. to make it obvious the flag applies only to the NX huge
page mitigation, not to any condition that prevents creating a huge page.

Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 +--
arch/x86/kvm/mmu/mmu.c | 75 ++++++++++++++++++++++-----------
arch/x86/kvm/mmu/mmu_internal.h | 22 ++++++++--
arch/x86/kvm/mmu/paging_tmpl.h | 6 +--
arch/x86/kvm/mmu/tdp_mmu.c | 8 ++--
5 files changed, 79 insertions(+), 38 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e8281d64a431..246b69262b93 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1143,7 +1143,7 @@ struct kvm_arch {
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
struct list_head active_mmu_pages;
struct list_head zapped_obsolete_pages;
- struct list_head lpage_disallowed_mmu_pages;
+ struct list_head possible_nx_huge_pages;
struct kvm_page_track_notifier_node mmu_sp_tracker;
struct kvm_page_track_notifier_head track_notifier_head;
/*
@@ -1304,8 +1304,8 @@ struct kvm_arch {
* - tdp_mmu_roots (above)
* - tdp_mmu_pages (above)
* - the link field of struct kvm_mmu_pages used by the TDP MMU
- * - lpage_disallowed_mmu_pages
- * - the lpage_disallowed_link field of struct kvm_mmu_pages used
+ * - possible_nx_huge_pages;
+ * - the possible_nx_huge_page_link field of struct kvm_mmu_pages used
* by the TDP MMU
* It is acceptable, but not necessary, to acquire this lock when
* the thread holds the MMU lock in write mode.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8e477333a263..1112e3a4cf3e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -802,15 +802,43 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
}

-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void untrack_possible_nx_huge_page(struct kvm *kvm,
+ struct kvm_mmu_page *sp)
{
- if (sp->lpage_disallowed)
+ if (list_empty(&sp->possible_nx_huge_page_link))
+ return;
+
+ --kvm->stat.nx_lpage_splits;
+ list_del_init(&sp->possible_nx_huge_page_link);
+}
+
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+ sp->nx_huge_page_disallowed = false;
+
+ untrack_possible_nx_huge_page(kvm, sp);
+}
+
+static void track_possible_nx_huge_page(struct kvm *kvm,
+ struct kvm_mmu_page *sp)
+{
+ if (!list_empty(&sp->possible_nx_huge_page_link))
return;

++kvm->stat.nx_lpage_splits;
- list_add_tail(&sp->lpage_disallowed_link,
- &kvm->arch.lpage_disallowed_mmu_pages);
- sp->lpage_disallowed = true;
+ list_add_tail(&sp->possible_nx_huge_page_link,
+ &kvm->arch.possible_nx_huge_pages);
+}
+
+void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ bool nx_huge_page_possible)
+{
+ sp->nx_huge_page_disallowed = true;
+
+ if (!nx_huge_page_possible)
+ untrack_possible_nx_huge_page(kvm, sp);
+ else
+ track_possible_nx_huge_page(kvm, sp);
}

static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -830,13 +858,6 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_mmu_gfn_allow_lpage(slot, gfn);
}

-void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
- --kvm->stat.nx_lpage_splits;
- sp->lpage_disallowed = false;
- list_del(&sp->lpage_disallowed_link);
-}
-
static struct kvm_memory_slot *
gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
bool no_dirty_log)
@@ -2115,6 +2136,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,

set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

+ INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
+
/*
* active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
* depends on valid pages being added to the head of the list. See
@@ -2472,8 +2495,8 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
zapped_root = !is_obsolete_sp(kvm, sp);
}

- if (sp->lpage_disallowed)
- unaccount_huge_nx_page(kvm, sp);
+ if (sp->nx_huge_page_disallowed)
+ unaccount_nx_huge_page(kvm, sp);

sp->role.invalid = 1;

@@ -3112,9 +3135,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
continue;

link_shadow_page(vcpu, it.sptep, sp);
- if (fault->is_tdp && fault->huge_page_disallowed &&
- fault->req_level >= it.level)
- account_huge_nx_page(vcpu->kvm, sp);
+ if (fault->is_tdp && fault->huge_page_disallowed)
+ account_nx_huge_page(vcpu->kvm, sp,
+ fault->req_level >= it.level);
}

if (WARN_ON_ONCE(it.level != fault->goal_level))
@@ -5970,7 +5993,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
- INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
+ INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);

r = kvm_mmu_init_tdp_mmu(kvm);
@@ -6845,23 +6868,25 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
for ( ; to_zap; --to_zap) {
- if (list_empty(&kvm->arch.lpage_disallowed_mmu_pages))
+ if (list_empty(&kvm->arch.possible_nx_huge_pages))
break;

/*
* We use a separate list instead of just using active_mmu_pages
- * because the number of lpage_disallowed pages is expected to
- * be relatively small compared to the total.
+ * because the number of shadow pages that be replaced with an
+ * NX huge page is expected to be relatively small compared to
+ * the total number of shadow pages. And because the TDP MMU
+ * doesn't use active_mmu_pages.
*/
- sp = list_first_entry(&kvm->arch.lpage_disallowed_mmu_pages,
+ sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
struct kvm_mmu_page,
- lpage_disallowed_link);
- WARN_ON_ONCE(!sp->lpage_disallowed);
+ possible_nx_huge_page_link);
+ WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
if (is_tdp_mmu_page(sp)) {
flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
} else {
kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
- WARN_ON_ONCE(sp->lpage_disallowed);
+ WARN_ON_ONCE(sp->nx_huge_page_disallowed);
}

if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 582def531d4d..ff4ca54b9dda 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -57,7 +57,13 @@ struct kvm_mmu_page {
bool tdp_mmu_page;
bool unsync;
u8 mmu_valid_gen;
- bool lpage_disallowed; /* Can't be replaced by an equiv large page */
+
+ /*
+ * The shadow page can't be replaced by an equivalent huge page
+ * because it is being used to map an executable page in the guest
+ * and the NX huge page mitigation is enabled.
+ */
+ bool nx_huge_page_disallowed;

/*
* The following two entries are used to key the shadow page in the
@@ -100,7 +106,14 @@ struct kvm_mmu_page {
};
};

- struct list_head lpage_disallowed_link;
+ /*
+ * Use to track shadow pages that, if zapped, would allow KVM to create
+ * an NX huge page. A shadow page will have nx_huge_page_disallowed
+ * set but not be on the list if a huge page is disallowed for other
+ * reasons, e.g. because KVM is shadowing a PTE at the same gfn, the
+ * memslot isn't properly aligned, etc...
+ */
+ struct list_head possible_nx_huge_page_link;
#ifdef CONFIG_X86_32
/*
* Used out of the mmu-lock to avoid reading spte values while an
@@ -315,7 +328,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_

void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);

-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
-void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ bool nx_huge_page_possible);
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);

#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f5958071220c..259c0f019f09 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -713,9 +713,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
continue;

link_shadow_page(vcpu, it.sptep, sp);
- if (fault->huge_page_disallowed &&
- fault->req_level >= it.level)
- account_huge_nx_page(vcpu->kvm, sp);
+ if (fault->huge_page_disallowed)
+ account_nx_huge_page(vcpu->kvm, sp,
+ fault->req_level >= it.level);
}

if (WARN_ON_ONCE(it.level != fault->goal_level))
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 40ccb5fba870..a30983947fee 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -284,6 +284,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
gfn_t gfn, union kvm_mmu_page_role role)
{
+ INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
+
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

sp->role = role;
@@ -390,8 +392,8 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
lockdep_assert_held_write(&kvm->mmu_lock);

list_del(&sp->link);
- if (sp->lpage_disallowed)
- unaccount_huge_nx_page(kvm, sp);
+ if (sp->nx_huge_page_disallowed)
+ unaccount_nx_huge_page(kvm, sp);

if (shared)
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1134,7 +1136,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
if (account_nx)
- account_huge_nx_page(kvm, sp);
+ account_nx_huge_page(kvm, sp, true);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

return 0;
--
2.37.1.359.gd136c6c3e2-goog

2022-07-23 01:25:18

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 2/6] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs

Account and track NX huge pages for nonpaging MMUs so that a future
enhancement to precisely check if shadow page cannot be replaced by a NX
huge page doesn't get false positives. Without correct tracking, KVM can
get stuck in a loop if an instruction is fetching and writing data on the
same huge page, e.g. KVM installs a small executable page on the fetch
fault, replaces it with an NX huge page on the write fault, and faults
again on the fetch.

Alternatively, and perhaps ideally, KVM would simply not enforce the
workaround for nonpaging MMUs. The guest has no page tables to abuse
and KVM is guaranteed to switch to a different MMU on CR0.PG being
toggled so there's no security or performance concerns. However, getting
make_spte() to play nice now and in the future is unnecessarily complex.

In the current code base, make_spte() can enforce the mitigation if TDP
is enabled or the MMU is indirect, but make_spte() may not always have a
vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
promotion when disabling dirty logging.

Without a vCPU/MMU, KVM could either pass in the correct information
and/or derive it from the shadow page, but the former is ugly and the
latter subtly non-trivial due to the possitibility of direct shadow pages
in indirect MMUs. Given that using shadow paging with an unpaged guest
is far from top priority _and_ has been subjected to the workaround since
its inception, keep it simple and just fix the accounting glitch.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/mmu_internal.h | 8 ++++++++
arch/x86/kvm/mmu/spte.c | 11 +++++++++++
3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1112e3a4cf3e..493cdf1c29ff 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3135,7 +3135,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
continue;

link_shadow_page(vcpu, it.sptep, sp);
- if (fault->is_tdp && fault->huge_page_disallowed)
+ if (fault->huge_page_disallowed)
account_nx_huge_page(vcpu->kvm, sp,
fault->req_level >= it.level);
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ff4ca54b9dda..83644a0167ab 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -201,6 +201,14 @@ struct kvm_page_fault {

/* Derived from mmu and global state. */
const bool is_tdp;
+
+ /*
+ * Note, enforcing the NX huge page mitigation for nonpaging MMUs
+ * (shadow paging, CR0.PG=0 in the guest) is completely unnecessary.
+ * The guest doesn't have any page tables to abuse and is guaranteed
+ * to switch to a different MMU when CR0.PG is toggled on (may not
+ * always be guaranteed when KVM is using TDP). See also make_spte().
+ */
const bool nx_huge_page_workaround_enabled;

/*
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 7314d27d57a4..9f3e5af088a5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -147,6 +147,17 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (!prefetch)
spte |= spte_shadow_accessed_mask(spte);

+ /*
+ * For simplicity, enforce the NX huge page mitigation even if not
+ * strictly necessary. KVM could ignore if the mitigation if paging is
+ * disabled in the guest, but KVM would then have to ensure a new MMU
+ * is loaded (or all shadow pages zapped) when CR0.PG is toggled on,
+ * and that's a net negative for performance when TDP is enabled. KVM
+ * could ignore the mitigation if TDP is disabled and CR0.PG=0, as KVM
+ * will always switch to a new MMU if paging is enabled in the guest,
+ * but that adds complexity just to optimize a mode that is anything
+ * but performance critical.
+ */
if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
is_nx_huge_page_enabled(vcpu->kvm)) {
pte_access &= ~ACC_EXEC_MASK;
--
2.37.1.359.gd136c6c3e2-goog

2022-07-23 01:26:24

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 3/6] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE

Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
visible to other readers, i.e. before setting its SPTE. This will allow
KVM to query the flag when determining if a shadow page can be replaced
by a NX huge page without violating the rules of the mitigation.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 +++++-------
arch/x86/kvm/mmu/mmu_internal.h | 5 ++---
arch/x86/kvm/mmu/tdp_mmu.c | 30 +++++++++++++++++-------------
3 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 493cdf1c29ff..e9252e7cd5a2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -802,8 +802,7 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
}

-static void untrack_possible_nx_huge_page(struct kvm *kvm,
- struct kvm_mmu_page *sp)
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
if (list_empty(&sp->possible_nx_huge_page_link))
return;
@@ -812,15 +811,14 @@ static void untrack_possible_nx_huge_page(struct kvm *kvm,
list_del_init(&sp->possible_nx_huge_page_link);
}

-void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
sp->nx_huge_page_disallowed = false;

untrack_possible_nx_huge_page(kvm, sp);
}

-static void track_possible_nx_huge_page(struct kvm *kvm,
- struct kvm_mmu_page *sp)
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
if (!list_empty(&sp->possible_nx_huge_page_link))
return;
@@ -830,8 +828,8 @@ static void track_possible_nx_huge_page(struct kvm *kvm,
&kvm->arch.possible_nx_huge_pages);
}

-void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- bool nx_huge_page_possible)
+static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+ bool nx_huge_page_possible)
{
sp->nx_huge_page_disallowed = true;

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 83644a0167ab..2a887d08b722 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -336,8 +336,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_

void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);

-void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
- bool nx_huge_page_possible);
-void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);

#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a30983947fee..626c40ec2af9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -392,8 +392,10 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
lockdep_assert_held_write(&kvm->mmu_lock);

list_del(&sp->link);
- if (sp->nx_huge_page_disallowed)
- unaccount_nx_huge_page(kvm, sp);
+ if (sp->nx_huge_page_disallowed) {
+ sp->nx_huge_page_disallowed = false;
+ untrack_possible_nx_huge_page(kvm, sp);
+ }

if (shared)
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1111,16 +1113,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
* @kvm: kvm instance
* @iter: a tdp_iter instance currently on the SPTE that should be set
* @sp: The new TDP page table to install.
- * @account_nx: True if this page table is being installed to split a
- * non-executable huge page.
* @shared: This operation is running under the MMU lock in read mode.
*
* Returns: 0 if the new page table was installed. Non-0 if the page table
* could not be installed (e.g. the atomic compare-exchange failed).
*/
static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
- struct kvm_mmu_page *sp, bool account_nx,
- bool shared)
+ struct kvm_mmu_page *sp, bool shared)
{
u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
int ret = 0;
@@ -1135,8 +1134,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,

spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
- if (account_nx)
- account_nx_huge_page(kvm, sp, true);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

return 0;
@@ -1149,6 +1146,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
+ struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
int ret;
@@ -1185,9 +1183,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
}

if (!is_shadow_present_pte(iter.old_spte)) {
- bool account_nx = fault->huge_page_disallowed &&
- fault->req_level >= iter.level;
-
/*
* If SPTE has been frozen by another thread, just
* give up and retry, avoiding unnecessary page table
@@ -1199,10 +1194,19 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
sp = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_child_sp(sp, &iter);

- if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
+ sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
+
+ if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
tdp_mmu_free_sp(sp);
break;
}
+
+ if (fault->huge_page_disallowed &&
+ fault->req_level >= iter.level) {
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ track_possible_nx_huge_page(kvm, sp);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+ }
}
}

@@ -1490,7 +1494,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
* correctness standpoint since the translation will be the same either
* way.
*/
- ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
+ ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
if (ret)
goto out;

--
2.37.1.359.gd136c6c3e2-goog

2022-07-23 01:26:57

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 6/6] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()

From: Mingwei Zhang <[email protected]>

Explicitly check if a NX huge page is disallowed when determining if a page
fault needs to be forced to use a smaller sized page. KVM incorrectly
assumes that the NX huge page mitigation is the only scenario where KVM
will create a shadow page instead of a huge page. Any scenario that causes
KVM to zap leaf SPTEs may result in having a SP that can be made huge
without violating the NX huge page mitigation. E.g. disabling of dirty
logging, zapping from mmu_notifier due to page migration, guest MTRR
changes that affect the viability of a huge page, etc...

Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")

Reviewed-by: Ben Gardon <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
[sean: add barrier comments, use spte_to_sp()]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++++--
arch/x86/kvm/mmu/tdp_mmu.c | 6 ++++++
2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ed3cfb31853b..97980528bf4a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3092,6 +3092,19 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
cur_level == fault->goal_level &&
is_shadow_present_pte(spte) &&
!is_large_pte(spte)) {
+ u64 page_mask;
+
+ /*
+ * Ensure nx_huge_page_disallowed is read after checking for a
+ * present shadow page. A different vCPU may be concurrently
+ * installing the shadow page if mmu_lock is held for read.
+ * Pairs with the smp_wmb() in kvm_tdp_mmu_map().
+ */
+ smp_rmb();
+
+ if (!spte_to_sp(spte)->nx_huge_page_disallowed)
+ return;
+
/*
* A small SPTE exists for this pfn, but FNAME(fetch)
* and __direct_map would like to create a large PTE
@@ -3099,8 +3112,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
* patching back for them into pfn the next 9 bits of
* the address.
*/
- u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
- KVM_PAGES_PER_HPAGE(cur_level - 1);
+ page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
+ KVM_PAGES_PER_HPAGE(cur_level - 1);
fault->pfn |= fault->gfn & page_mask;
fault->goal_level--;
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fea22dc481a0..313092d4931a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1194,6 +1194,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
tdp_mmu_init_child_sp(sp, &iter);

sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
+ /*
+ * Ensure nx_huge_page_disallowed is visible before the
+ * SP is marked present, as mmu_lock is held for read.
+ * Pairs with the smp_rmb() in disallowed_hugepage_adjust().
+ */
+ smp_wmb();

if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
tdp_mmu_free_sp(sp);
--
2.37.1.359.gd136c6c3e2-goog

2022-07-23 01:28:53

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 5/6] KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

Add a helper to convert a SPTE to its shadow page to deduplicate a
variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
get the shadow page for a SPTE without dropping high bits.

Opportunistically add a comment in mmu_free_root_page() documenting why
it treats the root HPA as a SPTE.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 17 ++++++++++-------
arch/x86/kvm/mmu/mmu_internal.h | 12 ------------
arch/x86/kvm/mmu/spte.h | 17 +++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 2 ++
4 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e9252e7cd5a2..ed3cfb31853b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1798,7 +1798,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
continue;
}

- child = to_shadow_page(ent & SPTE_BASE_ADDR_MASK);
+ child = spte_to_sp(ent);

if (child->unsync_children) {
if (mmu_pages_add(pvec, child, i))
@@ -2357,7 +2357,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
* so we should update the spte at this point to get
* a new sp with the correct access.
*/
- child = to_shadow_page(*sptep & SPTE_BASE_ADDR_MASK);
+ child = spte_to_sp(*sptep);
if (child->role.access == direct_access)
return;

@@ -2378,7 +2378,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
if (is_last_spte(pte, sp->role.level)) {
drop_spte(kvm, spte);
} else {
- child = to_shadow_page(pte & SPTE_BASE_ADDR_MASK);
+ child = spte_to_sp(pte);
drop_parent_pte(child, spte);

/*
@@ -2817,7 +2817,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
struct kvm_mmu_page *child;
u64 pte = *sptep;

- child = to_shadow_page(pte & SPTE_BASE_ADDR_MASK);
+ child = spte_to_sp(pte);
drop_parent_pte(child, sptep);
flush = true;
} else if (pfn != spte_to_pfn(*sptep)) {
@@ -3429,7 +3429,11 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
if (!VALID_PAGE(*root_hpa))
return;

- sp = to_shadow_page(*root_hpa & SPTE_BASE_ADDR_MASK);
+ /*
+ * The "root" may be a special root, e.g. a PAE entry, treat it as a
+ * SPTE to ensure any non-PA bits are dropped.
+ */
+ sp = spte_to_sp(*root_hpa);
if (WARN_ON(!sp))
return;

@@ -3914,8 +3918,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
hpa_t root = vcpu->arch.mmu->pae_root[i];

if (IS_VALID_PAE_ROOT(root)) {
- root &= SPTE_BASE_ADDR_MASK;
- sp = to_shadow_page(root);
+ sp = spte_to_sp(root);
mmu_sync_children(vcpu, sp, true);
}
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 2a887d08b722..04457b5ec968 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -133,18 +133,6 @@ struct kvm_mmu_page {

extern struct kmem_cache *mmu_page_header_cache;

-static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
-{
- struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
-
- return (struct kvm_mmu_page *)page_private(page);
-}
-
-static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
-{
- return to_shadow_page(__pa(sptep));
-}
-
static inline int kvm_mmu_role_as_id(union kvm_mmu_page_role role)
{
return role.smm ? 1 : 0;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index cabe3fbb4f39..a240b7eca54f 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -207,6 +207,23 @@ static inline int spte_index(u64 *sptep)
*/
extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;

+static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
+{
+ struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
+
+ return (struct kvm_mmu_page *)page_private(page);
+}
+
+static inline struct kvm_mmu_page *spte_to_sp(u64 spte)
+{
+ return to_shadow_page(spte & SPTE_BASE_ADDR_MASK);
+}
+
+static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
+{
+ return to_shadow_page(__pa(sptep));
+}
+
static inline bool is_mmio_spte(u64 spte)
{
return (spte & shadow_mmio_mask) == shadow_mmio_value &&
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index c163f7cc23ca..d3714200b932 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -5,6 +5,8 @@

#include <linux/kvm_host.h>

+#include "spte.h"
+
hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);

__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
--
2.37.1.359.gd136c6c3e2-goog

2022-07-23 01:31:15

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 4/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages

Track the number of TDP MMU "shadow" pages instead of tracking the pages
themselves. With the NX huge page list manipulation moved out of the common
linking flow, elminating the list-based tracking means the happy path of
adding a shadow page doesn't need to acquire a spinlock and can instead
inc/dec an atomic.

Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
pages is very, very useful for detecting KVM bugs.

Tracking the number of pages will also make it trivial to expose the
counter to userspace as a stat in the future, which may or may not be
desirable.

Note, the TDP MMU needs to use a separate counter (and stat if that ever
comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
TDP MMU consumes so few pages relative to shadow paging), and including TDP
MMU pages in that counter would break both the shrinker and shadow MMUs,
e.g. if a VM is using nested TDP.

Cc: Yosry Ahmed <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 11 +++--------
arch/x86/kvm/mmu/tdp_mmu.c | 19 +++++++++----------
2 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 246b69262b93..5c269b2556d6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1271,6 +1271,9 @@ struct kvm_arch {
*/
bool tdp_mmu_enabled;

+ /* The number of TDP MMU pages across all roots. */
+ atomic64_t tdp_mmu_pages;
+
/*
* List of struct kvm_mmu_pages being used as roots.
* All struct kvm_mmu_pages in the list should have
@@ -1291,18 +1294,10 @@ struct kvm_arch {
*/
struct list_head tdp_mmu_roots;

- /*
- * List of struct kvmp_mmu_pages not being used as roots.
- * All struct kvm_mmu_pages in the list should have
- * tdp_mmu_page set and a tdp_mmu_root_count of 0.
- */
- struct list_head tdp_mmu_pages;
-
/*
* Protects accesses to the following fields when the MMU lock
* is held in read mode:
* - tdp_mmu_roots (above)
- * - tdp_mmu_pages (above)
* - the link field of struct kvm_mmu_pages used by the TDP MMU
* - possible_nx_huge_pages;
* - the possible_nx_huge_page_link field of struct kvm_mmu_pages used
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 626c40ec2af9..fea22dc481a0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,7 +29,6 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
kvm->arch.tdp_mmu_enabled = true;
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
- INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
kvm->arch.tdp_mmu_zap_wq = wq;
return 1;
}
@@ -54,7 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
/* Also waits for any queued work items. */
destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);

- WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
+ WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));

/*
@@ -386,16 +385,18 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
bool shared)
{
+ atomic64_dec(&kvm->arch.tdp_mmu_pages);
+
+ if (!sp->nx_huge_page_disallowed)
+ return;
+
if (shared)
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
else
lockdep_assert_held_write(&kvm->mmu_lock);

- list_del(&sp->link);
- if (sp->nx_huge_page_disallowed) {
- sp->nx_huge_page_disallowed = false;
- untrack_possible_nx_huge_page(kvm, sp);
- }
+ sp->nx_huge_page_disallowed = false;
+ untrack_possible_nx_huge_page(kvm, sp);

if (shared)
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1132,9 +1133,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
tdp_mmu_set_spte(kvm, iter, spte);
}

- spin_lock(&kvm->arch.tdp_mmu_pages_lock);
- list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
- spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+ atomic64_inc(&kvm->arch.tdp_mmu_pages);

return 0;
}
--
2.37.1.359.gd136c6c3e2-goog

2022-07-25 23:17:35

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs

On Mon, Jul 25, 2022 at 4:05 PM David Matlack <[email protected]> wrote:
>
> On Sat, Jul 23, 2022 at 01:23:21AM +0000, Sean Christopherson wrote:
> > Account and track NX huge pages for nonpaging MMUs so that a future
> > enhancement to precisely check if shadow page cannot be replaced by a NX
> > huge page doesn't get false positives. Without correct tracking, KVM can
> > get stuck in a loop if an instruction is fetching and writing data on the
> > same huge page, e.g. KVM installs a small executable page on the fetch
> > fault, replaces it with an NX huge page on the write fault, and faults
> > again on the fetch.
> >
> > Alternatively, and perhaps ideally, KVM would simply not enforce the
> > workaround for nonpaging MMUs. The guest has no page tables to abuse
> > and KVM is guaranteed to switch to a different MMU on CR0.PG being
> > toggled so there's no security or performance concerns. However, getting
> > make_spte() to play nice now and in the future is unnecessarily complex.
> >
> > In the current code base, make_spte() can enforce the mitigation if TDP
> > is enabled or the MMU is indirect, but make_spte() may not always have a
> > vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
> > promotion when disabling dirty logging.
> >
> > Without a vCPU/MMU, KVM could either pass in the correct information
> > and/or derive it from the shadow page, but the former is ugly and the
> > latter subtly non-trivial due to the possitibility of direct shadow pages
> > in indirect MMUs. Given that using shadow paging with an unpaged guest
> > is far from top priority _and_ has been subjected to the workaround since
> > its inception, keep it simple and just fix the accounting glitch.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
>
> It's odd that KVM enforced NX Huge Pages but just skipped the accounting.
> In retrospect, that was bound to cause some issue.
>
> Aside from the comment suggestion below,
>
> Reviewed-by: David Matlack <[email protected]>
>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 2 +-
> > arch/x86/kvm/mmu/mmu_internal.h | 8 ++++++++
> > arch/x86/kvm/mmu/spte.c | 11 +++++++++++
> > 3 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 1112e3a4cf3e..493cdf1c29ff 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3135,7 +3135,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > continue;
> >
> > link_shadow_page(vcpu, it.sptep, sp);
> > - if (fault->is_tdp && fault->huge_page_disallowed)
> > + if (fault->huge_page_disallowed)
> > account_nx_huge_page(vcpu->kvm, sp,
> > fault->req_level >= it.level);
> > }
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index ff4ca54b9dda..83644a0167ab 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -201,6 +201,14 @@ struct kvm_page_fault {
> >
> > /* Derived from mmu and global state. */
> > const bool is_tdp;
> > +
> > + /*
> > + * Note, enforcing the NX huge page mitigation for nonpaging MMUs
> > + * (shadow paging, CR0.PG=0 in the guest) is completely unnecessary.
> > + * The guest doesn't have any page tables to abuse and is guaranteed
> > + * to switch to a different MMU when CR0.PG is toggled on (may not
> > + * always be guaranteed when KVM is using TDP). See also make_spte().
> > + */
> > const bool nx_huge_page_workaround_enabled;
> >
> > /*
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 7314d27d57a4..9f3e5af088a5 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -147,6 +147,17 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > if (!prefetch)
> > spte |= spte_shadow_accessed_mask(spte);
> >
> > + /*
> > + * For simplicity, enforce the NX huge page mitigation even if not
> > + * strictly necessary. KVM could ignore if the mitigation if paging is
> > + * disabled in the guest, but KVM would then have to ensure a new MMU
> > + * is loaded (or all shadow pages zapped) when CR0.PG is toggled on,
> > + * and that's a net negative for performance when TDP is enabled. KVM
> > + * could ignore the mitigation if TDP is disabled and CR0.PG=0, as KVM
> > + * will always switch to a new MMU if paging is enabled in the guest,
> > + * but that adds complexity just to optimize a mode that is anything
> > + * but performance critical.
> > + */
>
> I had some trouble parsing the last sentence. How about this for slightly
> better flow:
>
> /*
> * For simplicity, enforce the NX huge page mitigation even if not
> * strictly necessary. KVM could ignore if the mitigation if paging is
> * disabled in the guest, but KVM would then have to ensure a new MMU
> * is loaded (or all shadow pages zapped) when CR0.PG is toggled on,
> * and that's a net negative for performance when TDP is enabled. When
> * TDP is disabled, KVM will always switch to a new MMU when CR0.PG is
> * toggled, but that would tie make_spte() further to vCPU/MMU state
> * and add complexity just to optimize a mode that is anything but
> * performance critical.

Blegh. Should be:

"... but leveraging that to ignore the mitigation would tie
make_spte() further..."

> */
>
> > if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
> > is_nx_huge_page_enabled(vcpu->kvm)) {
> > pte_access &= ~ACC_EXEC_MASK;
> > --
> > 2.37.1.359.gd136c6c3e2-goog
> >

2022-07-25 23:18:34

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE

On Sat, Jul 23, 2022 at 01:23:22AM +0000, Sean Christopherson wrote:
> Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
> visible to other readers, i.e. before setting its SPTE. This will allow
> KVM to query the flag when determining if a shadow page can be replaced
> by a NX huge page without violating the rules of the mitigation.

It took me a minute to figure out why the same change isn't needed in
the shadow MMU (it always holds the write-lock so it's impossible for
another CPU to see an SP without a correct nx_huge_page_disallowed.
If you send a v2 can you add a short blurb to that effect here?

Otherwise,

Reviewed-by: David Matlack <[email protected]>

>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 12 +++++-------
> arch/x86/kvm/mmu/mmu_internal.h | 5 ++---
> arch/x86/kvm/mmu/tdp_mmu.c | 30 +++++++++++++++++-------------
> 3 files changed, 24 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 493cdf1c29ff..e9252e7cd5a2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -802,8 +802,7 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
> }
>
> -static void untrack_possible_nx_huge_page(struct kvm *kvm,
> - struct kvm_mmu_page *sp)
> +void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> if (list_empty(&sp->possible_nx_huge_page_link))
> return;
> @@ -812,15 +811,14 @@ static void untrack_possible_nx_huge_page(struct kvm *kvm,
> list_del_init(&sp->possible_nx_huge_page_link);
> }
>
> -void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> sp->nx_huge_page_disallowed = false;
>
> untrack_possible_nx_huge_page(kvm, sp);
> }
>
> -static void track_possible_nx_huge_page(struct kvm *kvm,
> - struct kvm_mmu_page *sp)
> +void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> if (!list_empty(&sp->possible_nx_huge_page_link))
> return;
> @@ -830,8 +828,8 @@ static void track_possible_nx_huge_page(struct kvm *kvm,
> &kvm->arch.possible_nx_huge_pages);
> }
>
> -void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> - bool nx_huge_page_possible)
> +static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> + bool nx_huge_page_possible)
> {
> sp->nx_huge_page_disallowed = true;
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 83644a0167ab..2a887d08b722 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -336,8 +336,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>
> void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>
> -void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> - bool nx_huge_page_possible);
> -void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index a30983947fee..626c40ec2af9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -392,8 +392,10 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> list_del(&sp->link);
> - if (sp->nx_huge_page_disallowed)
> - unaccount_nx_huge_page(kvm, sp);
> + if (sp->nx_huge_page_disallowed) {
> + sp->nx_huge_page_disallowed = false;
> + untrack_possible_nx_huge_page(kvm, sp);
> + }
>
> if (shared)
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -1111,16 +1113,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> * @kvm: kvm instance
> * @iter: a tdp_iter instance currently on the SPTE that should be set
> * @sp: The new TDP page table to install.
> - * @account_nx: True if this page table is being installed to split a
> - * non-executable huge page.
> * @shared: This operation is running under the MMU lock in read mode.
> *
> * Returns: 0 if the new page table was installed. Non-0 if the page table
> * could not be installed (e.g. the atomic compare-exchange failed).
> */
> static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> - struct kvm_mmu_page *sp, bool account_nx,
> - bool shared)
> + struct kvm_mmu_page *sp, bool shared)
> {
> u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
> int ret = 0;
> @@ -1135,8 +1134,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> - if (account_nx)
> - account_nx_huge_page(kvm, sp, true);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> return 0;
> @@ -1149,6 +1146,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> {
> struct kvm_mmu *mmu = vcpu->arch.mmu;
> + struct kvm *kvm = vcpu->kvm;
> struct tdp_iter iter;
> struct kvm_mmu_page *sp;
> int ret;
> @@ -1185,9 +1183,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> }
>
> if (!is_shadow_present_pte(iter.old_spte)) {
> - bool account_nx = fault->huge_page_disallowed &&
> - fault->req_level >= iter.level;
> -
> /*
> * If SPTE has been frozen by another thread, just
> * give up and retry, avoiding unnecessary page table
> @@ -1199,10 +1194,19 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> sp = tdp_mmu_alloc_sp(vcpu);
> tdp_mmu_init_child_sp(sp, &iter);
>
> - if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
> + sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> +
> + if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
> tdp_mmu_free_sp(sp);
> break;
> }
> +
> + if (fault->huge_page_disallowed &&
> + fault->req_level >= iter.level) {
> + spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> + track_possible_nx_huge_page(kvm, sp);
> + spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + }
> }
> }
>
> @@ -1490,7 +1494,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> * correctness standpoint since the translation will be the same either
> * way.
> */
> - ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
> + ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
> if (ret)
> goto out;
>
> --
> 2.37.1.359.gd136c6c3e2-goog
>

2022-07-25 23:29:31

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs

On Sat, Jul 23, 2022 at 01:23:21AM +0000, Sean Christopherson wrote:
> Account and track NX huge pages for nonpaging MMUs so that a future
> enhancement to precisely check if shadow page cannot be replaced by a NX
> huge page doesn't get false positives. Without correct tracking, KVM can
> get stuck in a loop if an instruction is fetching and writing data on the
> same huge page, e.g. KVM installs a small executable page on the fetch
> fault, replaces it with an NX huge page on the write fault, and faults
> again on the fetch.
>
> Alternatively, and perhaps ideally, KVM would simply not enforce the
> workaround for nonpaging MMUs. The guest has no page tables to abuse
> and KVM is guaranteed to switch to a different MMU on CR0.PG being
> toggled so there's no security or performance concerns. However, getting
> make_spte() to play nice now and in the future is unnecessarily complex.
>
> In the current code base, make_spte() can enforce the mitigation if TDP
> is enabled or the MMU is indirect, but make_spte() may not always have a
> vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
> promotion when disabling dirty logging.
>
> Without a vCPU/MMU, KVM could either pass in the correct information
> and/or derive it from the shadow page, but the former is ugly and the
> latter subtly non-trivial due to the possitibility of direct shadow pages
> in indirect MMUs. Given that using shadow paging with an unpaged guest
> is far from top priority _and_ has been subjected to the workaround since
> its inception, keep it simple and just fix the accounting glitch.
>
> Signed-off-by: Sean Christopherson <[email protected]>

It's odd that KVM enforced NX Huge Pages but just skipped the accounting.
In retrospect, that was bound to cause some issue.

Aside from the comment suggestion below,

Reviewed-by: David Matlack <[email protected]>

> ---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> arch/x86/kvm/mmu/mmu_internal.h | 8 ++++++++
> arch/x86/kvm/mmu/spte.c | 11 +++++++++++
> 3 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1112e3a4cf3e..493cdf1c29ff 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3135,7 +3135,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> continue;
>
> link_shadow_page(vcpu, it.sptep, sp);
> - if (fault->is_tdp && fault->huge_page_disallowed)
> + if (fault->huge_page_disallowed)
> account_nx_huge_page(vcpu->kvm, sp,
> fault->req_level >= it.level);
> }
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index ff4ca54b9dda..83644a0167ab 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -201,6 +201,14 @@ struct kvm_page_fault {
>
> /* Derived from mmu and global state. */
> const bool is_tdp;
> +
> + /*
> + * Note, enforcing the NX huge page mitigation for nonpaging MMUs
> + * (shadow paging, CR0.PG=0 in the guest) is completely unnecessary.
> + * The guest doesn't have any page tables to abuse and is guaranteed
> + * to switch to a different MMU when CR0.PG is toggled on (may not
> + * always be guaranteed when KVM is using TDP). See also make_spte().
> + */
> const bool nx_huge_page_workaround_enabled;
>
> /*
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 7314d27d57a4..9f3e5af088a5 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -147,6 +147,17 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> if (!prefetch)
> spte |= spte_shadow_accessed_mask(spte);
>
> + /*
> + * For simplicity, enforce the NX huge page mitigation even if not
> + * strictly necessary. KVM could ignore if the mitigation if paging is
> + * disabled in the guest, but KVM would then have to ensure a new MMU
> + * is loaded (or all shadow pages zapped) when CR0.PG is toggled on,
> + * and that's a net negative for performance when TDP is enabled. KVM
> + * could ignore the mitigation if TDP is disabled and CR0.PG=0, as KVM
> + * will always switch to a new MMU if paging is enabled in the guest,
> + * but that adds complexity just to optimize a mode that is anything
> + * but performance critical.
> + */

I had some trouble parsing the last sentence. How about this for slightly
better flow:

/*
* For simplicity, enforce the NX huge page mitigation even if not
* strictly necessary. KVM could ignore if the mitigation if paging is
* disabled in the guest, but KVM would then have to ensure a new MMU
* is loaded (or all shadow pages zapped) when CR0.PG is toggled on,
* and that's a net negative for performance when TDP is enabled. When
* TDP is disabled, KVM will always switch to a new MMU when CR0.PG is
* toggled, but that would tie make_spte() further to vCPU/MMU state
* and add complexity just to optimize a mode that is anything but
* performance critical.
*/

> if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
> is_nx_huge_page_enabled(vcpu->kvm)) {
> pte_access &= ~ACC_EXEC_MASK;
> --
> 2.37.1.359.gd136c6c3e2-goog
>

2022-07-25 23:31:17

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked

On Sat, Jul 23, 2022 at 01:23:20AM +0000, Sean Christopherson wrote:
> Tag shadow pages that cannot be replaced with an NX huge page even if
> zapping the page would not allow KVM to create a huge page, e.g. because
> something else prevents creating a huge page.

This sentence looks messed up :). Should it read:

Tag shadow pages that cannot be replaced with an NX huge page, e.g.
because something else prevents creating a huge page.

?

> This will allow a future
> patch to more precisely apply the mitigation by checking if an existing
> shadow page can be replaced by a NX huge page. Currently, KVM assumes
> that any existing shadow page encountered cannot be replaced by a NX huge
> page (if the mitigation is enabled), which prevents KVM from replacing
> no-longer-necessary shadow pages with huge pages, e.g. after disabling
> dirty logging, zapping from the mmu_notifier due to page migration,
> etc...
>
> Failure to tag shadow pages appropriately could theoretically lead to
> false negatives, e.g. if a fetch fault requests a small page and thus
> isn't tracked, and a read/write fault later requests a huge page, KVM
> will not reject the huge page as it should.
>
> To avoid yet another flag, initialize the list_head and use list_empty()
> to determine whether or not a page is on the list of NX huge pages that
> should be recovered.
>
> Opportunstically rename most of the variables/functions involved to
> provide consistency, e.g. lpage vs huge page and NX huge vs huge NX, and
> clarity, e.g. to make it obvious the flag applies only to the NX huge
> page mitigation, not to any condition that prevents creating a huge page.
>
> Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 6 +--
> arch/x86/kvm/mmu/mmu.c | 75 ++++++++++++++++++++++-----------
> arch/x86/kvm/mmu/mmu_internal.h | 22 ++++++++--
> arch/x86/kvm/mmu/paging_tmpl.h | 6 +--
> arch/x86/kvm/mmu/tdp_mmu.c | 8 ++--
> 5 files changed, 79 insertions(+), 38 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e8281d64a431..246b69262b93 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1143,7 +1143,7 @@ struct kvm_arch {
> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> struct list_head active_mmu_pages;
> struct list_head zapped_obsolete_pages;
> - struct list_head lpage_disallowed_mmu_pages;
> + struct list_head possible_nx_huge_pages;
> struct kvm_page_track_notifier_node mmu_sp_tracker;
> struct kvm_page_track_notifier_head track_notifier_head;
> /*
> @@ -1304,8 +1304,8 @@ struct kvm_arch {
> * - tdp_mmu_roots (above)
> * - tdp_mmu_pages (above)
> * - the link field of struct kvm_mmu_pages used by the TDP MMU
> - * - lpage_disallowed_mmu_pages
> - * - the lpage_disallowed_link field of struct kvm_mmu_pages used
> + * - possible_nx_huge_pages;
> + * - the possible_nx_huge_page_link field of struct kvm_mmu_pages used
> * by the TDP MMU
> * It is acceptable, but not necessary, to acquire this lock when
> * the thread holds the MMU lock in write mode.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8e477333a263..1112e3a4cf3e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -802,15 +802,43 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
> }
>
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +static void untrack_possible_nx_huge_page(struct kvm *kvm,
> + struct kvm_mmu_page *sp)
> {
> - if (sp->lpage_disallowed)
> + if (list_empty(&sp->possible_nx_huge_page_link))
> + return;
> +
> + --kvm->stat.nx_lpage_splits;
> + list_del_init(&sp->possible_nx_huge_page_link);
> +}
> +
> +void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> + sp->nx_huge_page_disallowed = false;
> +
> + untrack_possible_nx_huge_page(kvm, sp);
> +}
> +
> +static void track_possible_nx_huge_page(struct kvm *kvm,
> + struct kvm_mmu_page *sp)
> +{
> + if (!list_empty(&sp->possible_nx_huge_page_link))
> return;
>
> ++kvm->stat.nx_lpage_splits;
> - list_add_tail(&sp->lpage_disallowed_link,
> - &kvm->arch.lpage_disallowed_mmu_pages);
> - sp->lpage_disallowed = true;
> + list_add_tail(&sp->possible_nx_huge_page_link,
> + &kvm->arch.possible_nx_huge_pages);
> +}
> +
> +void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> + bool nx_huge_page_possible)
> +{
> + sp->nx_huge_page_disallowed = true;
> +
> + if (!nx_huge_page_possible)
> + untrack_possible_nx_huge_page(kvm, sp);

What would be a scenario where calling untrack_possible_nx_huge_page()
is actually necessary here?

> + else
> + track_possible_nx_huge_page(kvm, sp);
> }
>
> static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> @@ -830,13 +858,6 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> kvm_mmu_gfn_allow_lpage(slot, gfn);
> }
>
> -void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> -{
> - --kvm->stat.nx_lpage_splits;
> - sp->lpage_disallowed = false;
> - list_del(&sp->lpage_disallowed_link);
> -}
> -
> static struct kvm_memory_slot *
> gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
> bool no_dirty_log)
> @@ -2115,6 +2136,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>
> set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>
> + INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
> +
> /*
> * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> * depends on valid pages being added to the head of the list. See
> @@ -2472,8 +2495,8 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
> zapped_root = !is_obsolete_sp(kvm, sp);
> }
>
> - if (sp->lpage_disallowed)
> - unaccount_huge_nx_page(kvm, sp);
> + if (sp->nx_huge_page_disallowed)
> + unaccount_nx_huge_page(kvm, sp);
>
> sp->role.invalid = 1;
>
> @@ -3112,9 +3135,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> continue;
>
> link_shadow_page(vcpu, it.sptep, sp);
> - if (fault->is_tdp && fault->huge_page_disallowed &&
> - fault->req_level >= it.level)
> - account_huge_nx_page(vcpu->kvm, sp);
> + if (fault->is_tdp && fault->huge_page_disallowed)
> + account_nx_huge_page(vcpu->kvm, sp,
> + fault->req_level >= it.level);
> }
>
> if (WARN_ON_ONCE(it.level != fault->goal_level))
> @@ -5970,7 +5993,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>
> INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
> INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
> - INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
> + INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
> spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
>
> r = kvm_mmu_init_tdp_mmu(kvm);
> @@ -6845,23 +6868,25 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)

Can you rename this to kvm_recover_nx_huge_pages() while you're here?

> ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
> to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
> for ( ; to_zap; --to_zap) {
> - if (list_empty(&kvm->arch.lpage_disallowed_mmu_pages))
> + if (list_empty(&kvm->arch.possible_nx_huge_pages))
> break;
>
> /*
> * We use a separate list instead of just using active_mmu_pages
> - * because the number of lpage_disallowed pages is expected to
> - * be relatively small compared to the total.
> + * because the number of shadow pages that be replaced with an
> + * NX huge page is expected to be relatively small compared to
> + * the total number of shadow pages. And because the TDP MMU
> + * doesn't use active_mmu_pages.
> */
> - sp = list_first_entry(&kvm->arch.lpage_disallowed_mmu_pages,
> + sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
> struct kvm_mmu_page,
> - lpage_disallowed_link);
> - WARN_ON_ONCE(!sp->lpage_disallowed);
> + possible_nx_huge_page_link);
> + WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
> if (is_tdp_mmu_page(sp)) {
> flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
> } else {
> kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> - WARN_ON_ONCE(sp->lpage_disallowed);
> + WARN_ON_ONCE(sp->nx_huge_page_disallowed);
> }
>
> if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 582def531d4d..ff4ca54b9dda 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -57,7 +57,13 @@ struct kvm_mmu_page {
> bool tdp_mmu_page;
> bool unsync;
> u8 mmu_valid_gen;
> - bool lpage_disallowed; /* Can't be replaced by an equiv large page */
> +
> + /*
> + * The shadow page can't be replaced by an equivalent huge page
> + * because it is being used to map an executable page in the guest
> + * and the NX huge page mitigation is enabled.
> + */
> + bool nx_huge_page_disallowed;
>
> /*
> * The following two entries are used to key the shadow page in the
> @@ -100,7 +106,14 @@ struct kvm_mmu_page {
> };
> };
>
> - struct list_head lpage_disallowed_link;
> + /*
> + * Use to track shadow pages that, if zapped, would allow KVM to create
> + * an NX huge page. A shadow page will have nx_huge_page_disallowed
> + * set but not be on the list if a huge page is disallowed for other
> + * reasons, e.g. because KVM is shadowing a PTE at the same gfn, the
> + * memslot isn't properly aligned, etc...
> + */
> + struct list_head possible_nx_huge_page_link;
> #ifdef CONFIG_X86_32
> /*
> * Used out of the mmu-lock to avoid reading spte values while an
> @@ -315,7 +328,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>
> void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> -void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> + bool nx_huge_page_possible);
> +void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index f5958071220c..259c0f019f09 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -713,9 +713,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> continue;
>
> link_shadow_page(vcpu, it.sptep, sp);
> - if (fault->huge_page_disallowed &&
> - fault->req_level >= it.level)
> - account_huge_nx_page(vcpu->kvm, sp);
> + if (fault->huge_page_disallowed)
> + account_nx_huge_page(vcpu->kvm, sp,
> + fault->req_level >= it.level);
> }
>
> if (WARN_ON_ONCE(it.level != fault->goal_level))
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 40ccb5fba870..a30983947fee 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -284,6 +284,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
> gfn_t gfn, union kvm_mmu_page_role role)
> {
> + INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
> +
> set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>
> sp->role = role;
> @@ -390,8 +392,8 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> list_del(&sp->link);
> - if (sp->lpage_disallowed)
> - unaccount_huge_nx_page(kvm, sp);
> + if (sp->nx_huge_page_disallowed)
> + unaccount_nx_huge_page(kvm, sp);
>
> if (shared)
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -1134,7 +1136,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> if (account_nx)
> - account_huge_nx_page(kvm, sp);
> + account_nx_huge_page(kvm, sp, true);


account_nx is fault->huge_page_disallowed && fault->req_level >=
iter.level. So this is equivalent to:

if (fault->huge_page_disallowed && fault->req_level >= iter.level)
account_nx_huge_page(kvm, sp, true);

Whereas __direct_map() uses:

if (fault->is_tdp && fault->huge_page_disallowed)
account_nx_huge_page(vcpu->kvm, sp, fault->req_level >= it.level);

Aside from is_tdp (which you cover in another patch), why is there a
discrepancy in the NX Huge Page accounting?

> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> return 0;
> --
> 2.37.1.359.gd136c6c3e2-goog
>

2022-07-25 23:32:15

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

On Sat, Jul 23, 2022 at 01:23:24AM +0000, Sean Christopherson wrote:
> Add a helper to convert a SPTE to its shadow page to deduplicate a
> variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
> get the shadow page for a SPTE without dropping high bits.
>
> Opportunistically add a comment in mmu_free_root_page() documenting why
> it treats the root HPA as a SPTE.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
[...]
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -207,6 +207,23 @@ static inline int spte_index(u64 *sptep)
> */
> extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
>
> +static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
> +{
> + struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
> +
> + return (struct kvm_mmu_page *)page_private(page);
> +}
> +
> +static inline struct kvm_mmu_page *spte_to_sp(u64 spte)
> +{
> + return to_shadow_page(spte & SPTE_BASE_ADDR_MASK);
> +}

spte_to_sp() and sptep_to_sp() are a bit hard to differentiate visually.

Maybe spte_to_child_sp() or to_child_sp()?

> +
> +static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
> +{
> + return to_shadow_page(__pa(sptep));
> +}
> +
> static inline bool is_mmio_spte(u64 spte)
> {
> return (spte & shadow_mmio_mask) == shadow_mmio_value &&
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index c163f7cc23ca..d3714200b932 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -5,6 +5,8 @@
>
> #include <linux/kvm_host.h>
>
> +#include "spte.h"
> +
> hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
>
> __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
> --
> 2.37.1.359.gd136c6c3e2-goog
>

2022-07-25 23:33:37

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked

On Mon, Jul 25, 2022, David Matlack wrote:
> On Sat, Jul 23, 2022 at 01:23:20AM +0000, Sean Christopherson wrote:
> > Tag shadow pages that cannot be replaced with an NX huge page even if
> > zapping the page would not allow KVM to create a huge page, e.g. because
> > something else prevents creating a huge page.
>
> This sentence looks messed up :). Should it read:
>
> Tag shadow pages that cannot be replaced with an NX huge page, e.g.
> because something else prevents creating a huge page.
>
> ?

Hmm, not quite. Does this read better?

Tag shadow pages that cannot be replaced with an NX huge page regardless
of whether or not zapping the page would allow KVM to immediately create
a huge page, e.g. because something else prevents creating a huge page.

What I'm trying to call out is that, today, KVM tracks pages that were disallowed
from being huge due to the NX workaround if and only if the page could otherwise
be huge. After this patch, KVM will track pages that were disallowed regardless
of whether or they could have been huge at the time of fault.

> > +void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > + bool nx_huge_page_possible)
> > +{
> > + sp->nx_huge_page_disallowed = true;
> > +
> > + if (!nx_huge_page_possible)
> > + untrack_possible_nx_huge_page(kvm, sp);
>
> What would be a scenario where calling untrack_possible_nx_huge_page()
> is actually necessary here?

The only scenario that jumps to mind is the non-coherent DMA with funky MTRRs
case. There might be others, but it's been a while since I wrote this...

The MTRRs are per-vCPU (KVM really should just track them as per-VM, but whatever),
so it's possible that KVM could encounter a fault with a lower fault->req_level
than a previous fault that set nx_huge_page_disallowed=true (and added the page
to the possible_nx_huge_pages list because it had a higher req_level).

> > @@ -5970,7 +5993,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> >
> > INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
> > INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
> > - INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
> > + INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
> > spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
> >
> > r = kvm_mmu_init_tdp_mmu(kvm);
> > @@ -6845,23 +6868,25 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>
> Can you rename this to kvm_recover_nx_huge_pages() while you're here?

Will do.

> > @@ -1134,7 +1136,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> > spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> > if (account_nx)
> > - account_huge_nx_page(kvm, sp);
> > + account_nx_huge_page(kvm, sp, true);
>
>
> account_nx is fault->huge_page_disallowed && fault->req_level >=
> iter.level. So this is equivalent to:
>
> if (fault->huge_page_disallowed && fault->req_level >= iter.level)
> account_nx_huge_page(kvm, sp, true);
>
> Whereas __direct_map() uses:
>
> if (fault->is_tdp && fault->huge_page_disallowed)
> account_nx_huge_page(vcpu->kvm, sp, fault->req_level >= it.level);
>
> Aside from is_tdp (which you cover in another patch), why is there a
> discrepancy in the NX Huge Page accounting?

That wart gets fixed in patch 3. Fixing the TDP MMU requires more work due to
mmu_lock being held for read and so I wanted to separate it out. And as a minor
detail, the Fixes: from this patch predates the TDP MMU, so in a way it's kinda
sorta a different bug.

> > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >
> > return 0;
> > --
> > 2.37.1.359.gd136c6c3e2-goog
> >

2022-07-25 23:36:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

On Mon, Jul 25, 2022, David Matlack wrote:
> On Sat, Jul 23, 2022 at 01:23:24AM +0000, Sean Christopherson wrote:
> > Add a helper to convert a SPTE to its shadow page to deduplicate a
> > variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
> > get the shadow page for a SPTE without dropping high bits.
> >
> > Opportunistically add a comment in mmu_free_root_page() documenting why
> > it treats the root HPA as a SPTE.
> >
> > No functional change intended.
> >
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> [...]
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -207,6 +207,23 @@ static inline int spte_index(u64 *sptep)
> > */
> > extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> >
> > +static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
> > +{
> > + struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
> > +
> > + return (struct kvm_mmu_page *)page_private(page);
> > +}
> > +
> > +static inline struct kvm_mmu_page *spte_to_sp(u64 spte)
> > +{
> > + return to_shadow_page(spte & SPTE_BASE_ADDR_MASK);
> > +}
>
> spte_to_sp() and sptep_to_sp() are a bit hard to differentiate visually.

Yeah, I balked a bit when making the change, but couldn't come up with a better
alternative.

> Maybe spte_to_child_sp() or to_child_sp()?

I like to_child_sp(). Apparently I have a mental block when it comes to parent
vs. child pages and never realized that sptep_to_sp() gets the "parent" but
spte_to_sp() gets the "child". That indeed makes spte_to_sp() a bad name.

2022-07-25 23:39:53

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages

On Sat, Jul 23, 2022 at 01:23:23AM +0000, Sean Christopherson wrote:
> Track the number of TDP MMU "shadow" pages instead of tracking the pages
> themselves. With the NX huge page list manipulation moved out of the common
> linking flow, elminating the list-based tracking means the happy path of
> adding a shadow page doesn't need to acquire a spinlock and can instead
> inc/dec an atomic.
>
> Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
> pages is very, very useful for detecting KVM bugs.
>
> Tracking the number of pages will also make it trivial to expose the
> counter to userspace as a stat in the future, which may or may not be
> desirable.
>
> Note, the TDP MMU needs to use a separate counter (and stat if that ever
> comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
> supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
> TDP MMU consumes so few pages relative to shadow paging), and including TDP
> MMU pages in that counter would break both the shrinker and shadow MMUs,
> e.g. if a VM is using nested TDP.
>
> Cc: Yosry Ahmed <[email protected]>
> Reviewed-by: Mingwei Zhang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: David Matlack <[email protected]>

> ---
> arch/x86/include/asm/kvm_host.h | 11 +++--------
> arch/x86/kvm/mmu/tdp_mmu.c | 19 +++++++++----------
> 2 files changed, 12 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 246b69262b93..5c269b2556d6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1271,6 +1271,9 @@ struct kvm_arch {
> */
> bool tdp_mmu_enabled;
>
> + /* The number of TDP MMU pages across all roots. */
> + atomic64_t tdp_mmu_pages;

This is the number of non-root TDP MMU pages, right?

> +
> /*
> * List of struct kvm_mmu_pages being used as roots.
> * All struct kvm_mmu_pages in the list should have
> @@ -1291,18 +1294,10 @@ struct kvm_arch {
> */
> struct list_head tdp_mmu_roots;
>
> - /*
> - * List of struct kvmp_mmu_pages not being used as roots.
> - * All struct kvm_mmu_pages in the list should have
> - * tdp_mmu_page set and a tdp_mmu_root_count of 0.
> - */
> - struct list_head tdp_mmu_pages;
> -
> /*
> * Protects accesses to the following fields when the MMU lock
> * is held in read mode:
> * - tdp_mmu_roots (above)
> - * - tdp_mmu_pages (above)
> * - the link field of struct kvm_mmu_pages used by the TDP MMU
> * - possible_nx_huge_pages;
> * - the possible_nx_huge_page_link field of struct kvm_mmu_pages used
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 626c40ec2af9..fea22dc481a0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -29,7 +29,6 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> kvm->arch.tdp_mmu_enabled = true;
> INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
> spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
> - INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
> kvm->arch.tdp_mmu_zap_wq = wq;
> return 1;
> }
> @@ -54,7 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> /* Also waits for any queued work items. */
> destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
>
> - WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
> + WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
> WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
>
> /*
> @@ -386,16 +385,18 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
> static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> bool shared)
> {
> + atomic64_dec(&kvm->arch.tdp_mmu_pages);
> +
> + if (!sp->nx_huge_page_disallowed)
> + return;
> +
> if (shared)
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> else
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> - list_del(&sp->link);
> - if (sp->nx_huge_page_disallowed) {
> - sp->nx_huge_page_disallowed = false;
> - untrack_possible_nx_huge_page(kvm, sp);
> - }
> + sp->nx_huge_page_disallowed = false;
> + untrack_possible_nx_huge_page(kvm, sp);
>
> if (shared)
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -1132,9 +1133,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> tdp_mmu_set_spte(kvm, iter, spte);
> }
>
> - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> - list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + atomic64_inc(&kvm->arch.tdp_mmu_pages);
>
> return 0;
> }
> --
> 2.37.1.359.gd136c6c3e2-goog
>

2022-07-25 23:52:32

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()

On Sat, Jul 23, 2022 at 01:23:25AM +0000, Sean Christopherson wrote:
> From: Mingwei Zhang <[email protected]>
>
> Explicitly check if a NX huge page is disallowed when determining if a page
> fault needs to be forced to use a smaller sized page. KVM incorrectly
> assumes that the NX huge page mitigation is the only scenario where KVM
> will create a shadow page instead of a huge page. Any scenario that causes
> KVM to zap leaf SPTEs may result in having a SP that can be made huge
> without violating the NX huge page mitigation. E.g. disabling of dirty
> logging, zapping from mmu_notifier due to page migration, guest MTRR
> changes that affect the viability of a huge page, etc...
>
> Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
>
> Reviewed-by: Ben Gardon <[email protected]>
> Signed-off-by: Mingwei Zhang <[email protected]>
> [sean: add barrier comments, use spte_to_sp()]
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: David Matlack <[email protected]>

> ---
> arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++++--
> arch/x86/kvm/mmu/tdp_mmu.c | 6 ++++++
> 2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ed3cfb31853b..97980528bf4a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3092,6 +3092,19 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> cur_level == fault->goal_level &&
> is_shadow_present_pte(spte) &&
> !is_large_pte(spte)) {
> + u64 page_mask;
> +
> + /*
> + * Ensure nx_huge_page_disallowed is read after checking for a
> + * present shadow page. A different vCPU may be concurrently
> + * installing the shadow page if mmu_lock is held for read.
> + * Pairs with the smp_wmb() in kvm_tdp_mmu_map().
> + */
> + smp_rmb();
> +
> + if (!spte_to_sp(spte)->nx_huge_page_disallowed)
> + return;
> +
> /*
> * A small SPTE exists for this pfn, but FNAME(fetch)
> * and __direct_map would like to create a large PTE
> @@ -3099,8 +3112,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> * patching back for them into pfn the next 9 bits of
> * the address.
> */
> - u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
> - KVM_PAGES_PER_HPAGE(cur_level - 1);
> + page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
> + KVM_PAGES_PER_HPAGE(cur_level - 1);
> fault->pfn |= fault->gfn & page_mask;
> fault->goal_level--;
> }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index fea22dc481a0..313092d4931a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1194,6 +1194,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> tdp_mmu_init_child_sp(sp, &iter);
>
> sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> + /*
> + * Ensure nx_huge_page_disallowed is visible before the
> + * SP is marked present, as mmu_lock is held for read.
> + * Pairs with the smp_rmb() in disallowed_hugepage_adjust().
> + */
> + smp_wmb();
>
> if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
> tdp_mmu_free_sp(sp);
> --
> 2.37.1.359.gd136c6c3e2-goog
>

2022-07-25 23:58:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages

On Mon, Jul 25, 2022, David Matlack wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 246b69262b93..5c269b2556d6 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1271,6 +1271,9 @@ struct kvm_arch {
> > */
> > bool tdp_mmu_enabled;
> >
> > + /* The number of TDP MMU pages across all roots. */
> > + atomic64_t tdp_mmu_pages;
>
> This is the number of non-root TDP MMU pages, right?

Yes.

2022-07-26 00:00:03

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked

On Mon, Jul 25, 2022 at 4:26 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Jul 25, 2022, David Matlack wrote:
> > On Sat, Jul 23, 2022 at 01:23:20AM +0000, Sean Christopherson wrote:
> > > Tag shadow pages that cannot be replaced with an NX huge page even if
> > > zapping the page would not allow KVM to create a huge page, e.g. because
> > > something else prevents creating a huge page.
> >
> > This sentence looks messed up :). Should it read:
> >
> > Tag shadow pages that cannot be replaced with an NX huge page, e.g.
> > because something else prevents creating a huge page.
> >
> > ?
>
> Hmm, not quite. Does this read better?
>
> Tag shadow pages that cannot be replaced with an NX huge page regardless
> of whether or not zapping the page would allow KVM to immediately create
> a huge page, e.g. because something else prevents creating a huge page.
>
> What I'm trying to call out is that, today, KVM tracks pages that were disallowed
> from being huge due to the NX workaround if and only if the page could otherwise
> be huge. After this patch, KVM will track pages that were disallowed regardless
> of whether or they could have been huge at the time of fault.

Ah I see now! The way you explained it in the second paragraph here
clarified it for me, so if you can fit that into the commit message as
well that would be great.

>
> > > +void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> > > + bool nx_huge_page_possible)
> > > +{
> > > + sp->nx_huge_page_disallowed = true;
> > > +
> > > + if (!nx_huge_page_possible)
> > > + untrack_possible_nx_huge_page(kvm, sp);
> >
> > What would be a scenario where calling untrack_possible_nx_huge_page()
> > is actually necessary here?
>
> The only scenario that jumps to mind is the non-coherent DMA with funky MTRRs
> case. There might be others, but it's been a while since I wrote this...
>
> The MTRRs are per-vCPU (KVM really should just track them as per-VM, but whatever),
> so it's possible that KVM could encounter a fault with a lower fault->req_level
> than a previous fault that set nx_huge_page_disallowed=true (and added the page
> to the possible_nx_huge_pages list because it had a higher req_level).

But in that case the lower level SP would already have been installed,
so we wouldn't end up calling account_nx_huge_page() and getting to
this point. (account_nx_huge_page() is only called when linking in an
SP.)

Maybe account_nx_huge_page() needs to be pulled out and called for
every SP on the walk during a fault?

>
> > > @@ -5970,7 +5993,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > >
> > > INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
> > > INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
> > > - INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
> > > + INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
> > > spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
> > >
> > > r = kvm_mmu_init_tdp_mmu(kvm);
> > > @@ -6845,23 +6868,25 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
> >
> > Can you rename this to kvm_recover_nx_huge_pages() while you're here?
>
> Will do.
>
> > > @@ -1134,7 +1136,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> > > spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > > list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> > > if (account_nx)
> > > - account_huge_nx_page(kvm, sp);
> > > + account_nx_huge_page(kvm, sp, true);
> >
> >
> > account_nx is fault->huge_page_disallowed && fault->req_level >=
> > iter.level. So this is equivalent to:
> >
> > if (fault->huge_page_disallowed && fault->req_level >= iter.level)
> > account_nx_huge_page(kvm, sp, true);
> >
> > Whereas __direct_map() uses:
> >
> > if (fault->is_tdp && fault->huge_page_disallowed)
> > account_nx_huge_page(vcpu->kvm, sp, fault->req_level >= it.level);
> >
> > Aside from is_tdp (which you cover in another patch), why is there a
> > discrepancy in the NX Huge Page accounting?
>
> That wart gets fixed in patch 3. Fixing the TDP MMU requires more work due to
> mmu_lock being held for read and so I wanted to separate it out. And as a minor
> detail, the Fixes: from this patch predates the TDP MMU, so in a way it's kinda
> sorta a different bug.
>
> > > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > >
> > > return 0;
> > > --
> > > 2.37.1.359.gd136c6c3e2-goog
> > >

2022-07-26 00:13:03

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked

On Mon, Jul 25, 2022, David Matlack wrote:
> On Mon, Jul 25, 2022 at 4:26 PM Sean Christopherson <[email protected]> wrote:
> > The only scenario that jumps to mind is the non-coherent DMA with funky MTRRs
> > case. There might be others, but it's been a while since I wrote this...
> >
> > The MTRRs are per-vCPU (KVM really should just track them as per-VM, but whatever),
> > so it's possible that KVM could encounter a fault with a lower fault->req_level
> > than a previous fault that set nx_huge_page_disallowed=true (and added the page
> > to the possible_nx_huge_pages list because it had a higher req_level).
>
> But in that case the lower level SP would already have been installed,
> so we wouldn't end up calling account_nx_huge_page() and getting to
> this point. (account_nx_huge_page() is only called when linking in an
> SP.)

Hrm, true. I'm 99% certain past me was just maintaining the existing logic in
account_huge_nx_page()

if (sp->lpage_disallowed)
return;

Best thing might be to turn that into a WARN as the first patch?

> Maybe account_nx_huge_page() needs to be pulled out and called for
> every SP on the walk during a fault?

Eh, not worth it, the MTRR thing is bogus anyways, e.g. if vCPUs have different
MTRR settings and one vCPU allows a huge page but the other does not, KVM will
may or may not install a huge page depending on which vCPU faults in the page.

2022-07-26 05:40:17

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] KVM: x86: Apply NX mitigation more precisely

On Sat, Jul 23, 2022, Sean Christopherson wrote:
> Patch 6 from Mingwei is the end goal of the series. KVM incorrectly
> assumes that the NX huge page mitigation is the only scenario where KVM
> will create a non-leaf page instead of a huge page. Precisely track
> (via kvm_mmu_page) if a non-huge page is being forced and use that info
> to avoid unnecessarily forcing smaller page sizes in
> disallowed_hugepage_adjust().
>
> v2: Rebase, tweak a changelog accordingly.

hmm, I applied this patch set (v2) on top of kvm/queue (HEAD:
1a4d88a361af) and it seems kvm-unit-tests/vmx failed on both ept=1 and
ept=0. And it did not work on our internel kernel either (kernel
crashed).

Maybe there is still minor issues?

>
> v1: https://lore.kernel.org/all/[email protected]
>
> Mingwei Zhang (1):
> KVM: x86/mmu: explicitly check nx_hugepage in
> disallowed_hugepage_adjust()
>
> Sean Christopherson (5):
> KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
> KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
> MMUs
> KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
> SPTE
> KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
> pages
> KVM: x86/mmu: Add helper to convert SPTE value to its shadow page
>
> arch/x86/include/asm/kvm_host.h | 17 ++---
> arch/x86/kvm/mmu/mmu.c | 107 ++++++++++++++++++++++----------
> arch/x86/kvm/mmu/mmu_internal.h | 41 +++++++-----
> arch/x86/kvm/mmu/paging_tmpl.h | 6 +-
> arch/x86/kvm/mmu/spte.c | 11 ++++
> arch/x86/kvm/mmu/spte.h | 17 +++++
> arch/x86/kvm/mmu/tdp_mmu.c | 49 +++++++++------
> arch/x86/kvm/mmu/tdp_mmu.h | 2 +
> 8 files changed, 167 insertions(+), 83 deletions(-)
>
>
> base-commit: 1a4d88a361af4f2e91861d632c6a1fe87a9665c2
> --
> 2.37.1.359.gd136c6c3e2-goog
>

2022-07-26 16:45:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] KVM: x86: Apply NX mitigation more precisely

On Tue, Jul 26, 2022, Mingwei Zhang wrote:
> On Sat, Jul 23, 2022, Sean Christopherson wrote:
> > Patch 6 from Mingwei is the end goal of the series. KVM incorrectly
> > assumes that the NX huge page mitigation is the only scenario where KVM
> > will create a non-leaf page instead of a huge page. Precisely track
> > (via kvm_mmu_page) if a non-huge page is being forced and use that info
> > to avoid unnecessarily forcing smaller page sizes in
> > disallowed_hugepage_adjust().
> >
> > v2: Rebase, tweak a changelog accordingly.
>
> hmm, I applied this patch set (v2) on top of kvm/queue (HEAD:
> 1a4d88a361af) and it seems kvm-unit-tests/vmx failed on both ept=1 and
> ept=0. And it did not work on our internel kernel either (kernel
> crashed).
>
> Maybe there is still minor issues?

Heh, or not so minor issues. I'll see what I broke. I have a bad feeling that
it's the EPT tests; IIRC I only ran VMX on a platform with MAXPHYADDR < 40.

2022-07-26 17:39:22

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] KVM: x86: Apply NX mitigation more precisely

On Tue, Jul 26, 2022, Sean Christopherson wrote:
> On Tue, Jul 26, 2022, Mingwei Zhang wrote:
> > On Sat, Jul 23, 2022, Sean Christopherson wrote:
> > > Patch 6 from Mingwei is the end goal of the series. KVM incorrectly
> > > assumes that the NX huge page mitigation is the only scenario where KVM
> > > will create a non-leaf page instead of a huge page. Precisely track
> > > (via kvm_mmu_page) if a non-huge page is being forced and use that info
> > > to avoid unnecessarily forcing smaller page sizes in
> > > disallowed_hugepage_adjust().
> > >
> > > v2: Rebase, tweak a changelog accordingly.
> >
> > hmm, I applied this patch set (v2) on top of kvm/queue (HEAD:
> > 1a4d88a361af) and it seems kvm-unit-tests/vmx failed on both ept=1 and
> > ept=0. And it did not work on our internel kernel either (kernel
> > crashed).
> >
> > Maybe there is still minor issues?
>
> Heh, or not so minor issues. I'll see what I broke. I have a bad feeling that
> it's the EPT tests; IIRC I only ran VMX on a platform with MAXPHYADDR < 40.

Hrm, not seeing failures (beyond the VMX_VMCS_ENUM.MAX_INDEX failure because I'm
running an older QEMU).

I'll follow up off-list to figure out what's going on.

2022-07-27 03:49:48

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages

On Sat, Jul 23, 2022 at 01:23:23AM +0000, Sean Christopherson wrote:

<snip>

> @@ -386,16 +385,18 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
> static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> bool shared)
> {
> + atomic64_dec(&kvm->arch.tdp_mmu_pages);
> +
> + if (!sp->nx_huge_page_disallowed)
> + return;
> +
Does this read of sp->nx_huge_page_disallowed also need to be protected by
tdp_mmu_pages_lock in shared path?

Thanks
Yan

> if (shared)
> spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> else
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> - list_del(&sp->link);
> - if (sp->nx_huge_page_disallowed) {
> - sp->nx_huge_page_disallowed = false;
> - untrack_possible_nx_huge_page(kvm, sp);
> - }
> + sp->nx_huge_page_disallowed = false;
> + untrack_possible_nx_huge_page(kvm, sp);
>
> if (shared)
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -1132,9 +1133,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> tdp_mmu_set_spte(kvm, iter, spte);
> }
>
> - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> - list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> + atomic64_inc(&kvm->arch.tdp_mmu_pages);
>
> return 0;
> }
> --
> 2.37.1.359.gd136c6c3e2-goog
>

2022-07-27 20:02:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages

On Wed, Jul 27, 2022, Yan Zhao wrote:
> On Sat, Jul 23, 2022 at 01:23:23AM +0000, Sean Christopherson wrote:
>
> <snip>
>
> > @@ -386,16 +385,18 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
> > static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> > bool shared)
> > {
> > + atomic64_dec(&kvm->arch.tdp_mmu_pages);
> > +
> > + if (!sp->nx_huge_page_disallowed)
> > + return;
> > +
> Does this read of sp->nx_huge_page_disallowed also need to be protected by
> tdp_mmu_pages_lock in shared path?


No, because only one CPU can call tdp_mmu_unlink_sp() for a shadow page. E.g. in
a shared walk, the SPTE is zapped atomically and only the CPU that "wins" gets to
unlink the s[. The extra lock is needed to prevent list corruption, but the
sp itself is thread safe.

FWIW, even if that guarantee didn't hold, checking the flag outside of tdp_mmu_pages_lock
is safe because false positives are ok. untrack_possible_nx_huge_page() checks that
the shadow page is actually on the list, i.e. it's a nop if a different task unlinks
the page first.

False negatives need to be avoided, but nx_huge_page_disallowed is cleared only
when untrack_possible_nx_huge_page() is guaranteed to be called, i.e. true false
negatives can't occur.

Hmm, but I think there's a missing smp_rmb(), which is needed to ensure
nx_huge_page_disallowed is read after observing the shadow-present SPTE (that's
being unlinked). I'll add that in the next version.

2022-07-28 20:19:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs

On 7/23/22 03:23, Sean Christopherson wrote:
> +
> + /*
> + * Note, enforcing the NX huge page mitigation for nonpaging MMUs
> + * (shadow paging, CR0.PG=0 in the guest) is completely unnecessary.
> + * The guest doesn't have any page tables to abuse and is guaranteed
> + * to switch to a different MMU when CR0.PG is toggled on (may not
> + * always be guaranteed when KVM is using TDP). See also make_spte().
> + */
> const bool nx_huge_page_workaround_enabled;
>
> /*
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 7314d27d57a4..9f3e5af088a5 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -147,6 +147,17 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> if (!prefetch)
> spte |= spte_shadow_accessed_mask(spte);
>
> + /*
> + * For simplicity, enforce the NX huge page mitigation even if not
> + * strictly necessary. KVM could ignore if the mitigation if paging is
> + * disabled in the guest, but KVM would then have to ensure a new MMU
> + * is loaded (or all shadow pages zapped) when CR0.PG is toggled on,
> + * and that's a net negative for performance when TDP is enabled. KVM
> + * could ignore the mitigation if TDP is disabled and CR0.PG=0, as KVM
> + * will always switch to a new MMU if paging is enabled in the guest,
> + * but that adds complexity just to optimize a mode that is anything
> + * but performance critical.

Why even separate the two comments? I think they both belong in
make_spte().

Paolo

2022-07-28 20:31:34

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] KVM: x86: Apply NX mitigation more precisely

On 7/23/22 03:23, Sean Christopherson wrote:
> Patch 6 from Mingwei is the end goal of the series. KVM incorrectly
> assumes that the NX huge page mitigation is the only scenario where KVM
> will create a non-leaf page instead of a huge page. Precisely track
> (via kvm_mmu_page) if a non-huge page is being forced and use that info
> to avoid unnecessarily forcing smaller page sizes in
> disallowed_hugepage_adjust().
>
> v2: Rebase, tweak a changelog accordingly.
>
> v1:https://lore.kernel.org/all/[email protected]
>
> Mingwei Zhang (1):
> KVM: x86/mmu: explicitly check nx_hugepage in
> disallowed_hugepage_adjust()
>
> Sean Christopherson (5):
> KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
> KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
> MMUs
> KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
> SPTE
> KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
> pages
> KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

Some of the benefits are cool, such as not having to track the pages for
the TDP MMU, and patch 2 is a borderline bugfix, but there's quite a lot
of new non-obvious complexity here.

So the obligatory question is: is it worth a hundred lines of new code?

Paolo

2022-07-28 21:25:23

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] KVM: x86: Apply NX mitigation more precisely

On Thu, Jul 28, 2022, Paolo Bonzini wrote:
> On 7/23/22 03:23, Sean Christopherson wrote:
> > Patch 6 from Mingwei is the end goal of the series. KVM incorrectly
> > assumes that the NX huge page mitigation is the only scenario where KVM
> > will create a non-leaf page instead of a huge page. Precisely track
> > (via kvm_mmu_page) if a non-huge page is being forced and use that info
> > to avoid unnecessarily forcing smaller page sizes in
> > disallowed_hugepage_adjust().
> >
> > v2: Rebase, tweak a changelog accordingly.
> >
> > v1:https://lore.kernel.org/all/[email protected]
> >
> > Mingwei Zhang (1):
> > KVM: x86/mmu: explicitly check nx_hugepage in
> > disallowed_hugepage_adjust()
> >
> > Sean Christopherson (5):
> > KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
> > KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
> > MMUs
> > KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
> > SPTE
> > KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
> > pages
> > KVM: x86/mmu: Add helper to convert SPTE value to its shadow page
>
> Some of the benefits are cool, such as not having to track the pages for the
> TDP MMU, and patch 2 is a borderline bugfix, but there's quite a lot of new
> non-obvious complexity here.

100% agree on the complexity.

> So the obligatory question is: is it worth a hundred lines of new code?

Assuming I understanding the bug Mingwei's patch fixes, yes. Though after
re-reading that changelog, it should more explicitly call out the scenario we
actually care about.

Anyways, the bug we really care about is that by not precisely checking if a
huge page is disallowed, KVM would refuse to create huge page after disabling
dirty logging, which is a very noticeable performance issue for large VMs if
a migration is canceled. That particular bug has since been unintentionally
fixed in the TDP MMU by zapping the non-leaf SPTE, but there are other paths
that could similarly be affected, e.g. I believe zapping leaf SPTEs in response
to a host page migration (mmu_notifier invalidation) to create a huge page would
yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume
a huge page is disallowed.

2022-07-28 22:12:00

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] KVM: x86: Apply NX mitigation more precisely

On Thu, Jul 28, 2022 at 2:20 PM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Jul 28, 2022, Paolo Bonzini wrote:
> > On 7/23/22 03:23, Sean Christopherson wrote:
> > > Patch 6 from Mingwei is the end goal of the series. KVM incorrectly
> > > assumes that the NX huge page mitigation is the only scenario where KVM
> > > will create a non-leaf page instead of a huge page. Precisely track
> > > (via kvm_mmu_page) if a non-huge page is being forced and use that info
> > > to avoid unnecessarily forcing smaller page sizes in
> > > disallowed_hugepage_adjust().
> > >
> > > v2: Rebase, tweak a changelog accordingly.
> > >
> > > v1:https://lore.kernel.org/all/[email protected]
> > >
> > > Mingwei Zhang (1):
> > > KVM: x86/mmu: explicitly check nx_hugepage in
> > > disallowed_hugepage_adjust()
> > >
> > > Sean Christopherson (5):
> > > KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
> > > KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
> > > MMUs
> > > KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
> > > SPTE
> > > KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
> > > pages
> > > KVM: x86/mmu: Add helper to convert SPTE value to its shadow page
> >
> > Some of the benefits are cool, such as not having to track the pages for the
> > TDP MMU, and patch 2 is a borderline bugfix, but there's quite a lot of new
> > non-obvious complexity here.
>
> 100% agree on the complexity.
>
> > So the obligatory question is: is it worth a hundred lines of new code?
>
> Assuming I understanding the bug Mingwei's patch fixes, yes. Though after
> re-reading that changelog, it should more explicitly call out the scenario we
> actually care about.
>
> Anyways, the bug we really care about is that by not precisely checking if a
> huge page is disallowed, KVM would refuse to create huge page after disabling
> dirty logging, which is a very noticeable performance issue for large VMs if
> a migration is canceled. That particular bug has since been unintentionally
> fixed in the TDP MMU by zapping the non-leaf SPTE, but there are other paths
> that could similarly be affected, e.g. I believe zapping leaf SPTEs in response
> to a host page migration (mmu_notifier invalidation) to create a huge page would
> yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume
> a huge page is disallowed.

Just a quick update: the kernel crash has been resolved. It turns out
to be a bug introduced when I rebase the patch.

I see the patch set is working now.

2022-07-28 22:13:31

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] KVM: x86: Apply NX mitigation more precisely

On 7/28/22 23:20, Sean Christopherson wrote:
>
> Anyways, the bug we really care about is that by not precisely checking if a
> huge page is disallowed, KVM would refuse to create huge page after disabling
> dirty logging, which is a very noticeable performance issue for large VMs if
> a migration is canceled. That particular bug has since been unintentionally
> fixed in the TDP MMU by zapping the non-leaf SPTE, but there are other paths
> that could similarly be affected, e.g. I believe zapping leaf SPTEs in response
> to a host page migration (mmu_notifier invalidation) to create a huge page would
> yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume
> a huge page is disallowed.

Ok, thanks. So this will be 5.21 material even during the -rc phase; I
have posted a couple comments for patch 1 and 2.

One way to simplify the rmb/wmb logic could be to place the rmb/wmb
respectively after loading iter.old_spte and in tdp_mmu_link_sp. If you
like it, feel free to integrate it in v3.

Paolo

2022-07-28 22:20:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] KVM: x86: Apply NX mitigation more precisely

On Fri, Jul 29, 2022, Paolo Bonzini wrote:
> On 7/28/22 23:20, Sean Christopherson wrote:
> >
> > Anyways, the bug we really care about is that by not precisely checking if a
> > huge page is disallowed, KVM would refuse to create huge page after disabling
> > dirty logging, which is a very noticeable performance issue for large VMs if
> > a migration is canceled. That particular bug has since been unintentionally
> > fixed in the TDP MMU by zapping the non-leaf SPTE, but there are other paths
> > that could similarly be affected, e.g. I believe zapping leaf SPTEs in response
> > to a host page migration (mmu_notifier invalidation) to create a huge page would
> > yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume
> > a huge page is disallowed.
>
> Ok, thanks. So this will be 5.21 material even during the -rc phase

Yes, this definitely needs more time in the queue before being sent to Linus.

2022-07-28 22:21:47

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked

On 7/23/22 03:23, Sean Christopherson wrote:
> Tag shadow pages that cannot be replaced with an NX huge page even if
> zapping the page would not allow KVM to create a huge page, e.g. because
> something else prevents creating a huge page. This will allow a future
> patch to more precisely apply the mitigation by checking if an existing
> shadow page can be replaced by a NX huge page. Currently, KVM assumes
> that any existing shadow page encountered cannot be replaced by a NX huge
> page (if the mitigation is enabled), which prevents KVM from replacing
> no-longer-necessary shadow pages with huge pages, e.g. after disabling
> dirty logging, zapping from the mmu_notifier due to page migration,
> etc...
>
> Failure to tag shadow pages appropriately could theoretically lead to
> false negatives, e.g. if a fetch fault requests a small page and thus
> isn't tracked, and a read/write fault later requests a huge page, KVM
> will not reject the huge page as it should.
>
> To avoid yet another flag, initialize the list_head and use list_empty()
> to determine whether or not a page is on the list of NX huge pages that
> should be recovered.
>
> Opportunstically rename most of the variables/functions involved to
> provide consistency, e.g. lpage vs huge page and NX huge vs huge NX, and
> clarity, e.g. to make it obvious the flag applies only to the NX huge
> page mitigation, not to any condition that prevents creating a huge page.

Please do this in a separate patch, since this one is already complex
enough.

> * The following two entries are used to key the shadow page in the
> @@ -100,7 +106,14 @@ struct kvm_mmu_page {
> };
> };
>
> - struct list_head lpage_disallowed_link;
> + /*
> + * Use to track shadow pages that, if zapped, would allow KVM to create

s/Use/Used/

Thanks,

Paolo

2022-07-29 01:36:06

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages

On Wed, Jul 27, 2022 at 07:04:35PM +0000, Sean Christopherson wrote:
> On Wed, Jul 27, 2022, Yan Zhao wrote:
> > On Sat, Jul 23, 2022 at 01:23:23AM +0000, Sean Christopherson wrote:
> >
> > <snip>
> >
> > > @@ -386,16 +385,18 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
> > > static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
> > > bool shared)
> > > {
> > > + atomic64_dec(&kvm->arch.tdp_mmu_pages);
> > > +
> > > + if (!sp->nx_huge_page_disallowed)
> > > + return;
> > > +
> > Does this read of sp->nx_huge_page_disallowed also need to be protected by
> > tdp_mmu_pages_lock in shared path?
>
>
> No, because only one CPU can call tdp_mmu_unlink_sp() for a shadow page. E.g. in
> a shared walk, the SPTE is zapped atomically and only the CPU that "wins" gets to
> unlink the s[. The extra lock is needed to prevent list corruption, but the
> sp itself is thread safe.
>
> FWIW, even if that guarantee didn't hold, checking the flag outside of tdp_mmu_pages_lock
> is safe because false positives are ok. untrack_possible_nx_huge_page() checks that
> the shadow page is actually on the list, i.e. it's a nop if a different task unlinks
> the page first.
>
> False negatives need to be avoided, but nx_huge_page_disallowed is cleared only
> when untrack_possible_nx_huge_page() is guaranteed to be called, i.e. true false
> negatives can't occur.
>
> Hmm, but I think there's a missing smp_rmb(), which is needed to ensure
> nx_huge_page_disallowed is read after observing the shadow-present SPTE (that's
> being unlinked). I'll add that in the next version.

It makes sense. Thanks for such detailed explanation!

Thanks
Yan