LinuxLists.cc - [RFC 00/19] KVM: x86/mmu: Optimize disabling dirty logging

2021-11-10 22:30:18

[permalink] [raw]

Subject: [RFC 00/19] KVM: x86/mmu: Optimize disabling dirty logging

Currently disabling dirty logging with the TDP MMU is extremely slow.
On a 96 vCPU / 96G VM it takes ~45 seconds to disable dirty logging
with the TDP MMU, as opposed to ~3.5 seconds with the legacy MMU. This
series optimizes TLB flushes and introduces in-place large page
promotion, to bring the disable dirty log time down to ~2 seconds.

Testing:
Ran KVM selftests and kvm-unit-tests on an Intel Skylake. This
series introduced no new failures.

Performance:
To collect these results I needed to apply Mingwei's patch
"selftests: KVM: align guest physical memory base address to 1GB"
https://lkml.org/lkml/2021/8/29/310
David Matlack is going to send out an updated version of that patch soon.

Without this series, TDP MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 10.966500447s
Enabling dirty logging time: 0.002068737s

Iteration 1 dirty memory time: 0.047556280s
Iteration 1 get dirty log time: 0.001253914s
Iteration 1 clear dirty log time: 0.049716661s
Iteration 2 dirty memory time: 3.679662016s
Iteration 2 get dirty log time: 0.000659546s
Iteration 2 clear dirty log time: 1.834329322s
Disabling dirty logging time: 45.738439510s
Get dirty log over 2 iterations took 0.001913460s. (Avg 0.000956730s/iteration)
Clear dirty log over 2 iterations took 1.884045983s. (Avg 0.942022991s/iteration)

Without this series, Legacy MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 12.664750666s
Enabling dirty logging time: 0.002025510s

Iteration 1 dirty memory time: 0.046240875s
Iteration 1 get dirty log time: 0.001864342s
Iteration 1 clear dirty log time: 0.170243637s
Iteration 2 dirty memory time: 31.571088701s
Iteration 2 get dirty log time: 0.000626245s
Iteration 2 clear dirty log time: 1.294817729s
Disabling dirty logging time: 3.566831573s
Get dirty log over 2 iterations took 0.002490587s. (Avg 0.001245293s/iteration)
Clear dirty log over 2 iterations took 1.465061366s. (Avg 0.732530683s/iteration)

With this series, TDP MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 12.016653537s
Enabling dirty logging time: 0.001992860s

Iteration 1 dirty memory time: 0.046701599s
Iteration 1 get dirty log time: 0.001214806s
Iteration 1 clear dirty log time: 0.049519923s
Iteration 2 dirty memory time: 3.581931268s
Iteration 2 get dirty log time: 0.000621383s
Iteration 2 clear dirty log time: 1.894597059s
Disabling dirty logging time: 1.950542092s
Get dirty log over 2 iterations took 0.001836189s. (Avg 0.000918094s/iteration)
Clear dirty log over 2 iterations took 1.944116982s. (Avg 0.972058491s/iteration)

Patch breakdown:
Patch 1 is a fix for a bug in the way the TBP MMU issues TLB flushes
Patches 2-5 eliminate many unnecessary TLB flushes through better batching
Patches 6-12 remove the need for a vCPU pointer to make_spte
Patches 13-18 are small refactors in perparation for patch 19
Patch 19 implements in-place largepage promotion when disabling dirty logging

Ben Gardon (19):
KVM: x86/mmu: Fix TLB flush range when handling disconnected pt
KVM: x86/mmu: Batch TLB flushes for a single zap
KVM: x86/mmu: Factor flush and free up when zapping under MMU write
lock
KVM: x86/mmu: Yield while processing disconnected_sps
KVM: x86/mmu: Remove redundant flushes when disabling dirty logging
KVM: x86/mmu: Introduce vcpu_make_spte
KVM: x86/mmu: Factor wrprot for nested PML out of make_spte
KVM: x86/mmu: Factor mt_mask out of make_spte
KVM: x86/mmu: Remove need for a vcpu from
kvm_slot_page_track_is_active
KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages
KVM: x86/mmu: Factor shadow_zero_check out of make_spte
KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
KVM: x86/mmu: Propagate memslot const qualifier
KVM: x86/MMU: Refactor vmx_get_mt_mask
KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
on vcpu
KVM: x86/mmu: Add try_get_mt_mask to x86_ops
KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
KVM: x86/mmu: Promote pages in-place when disabling dirty logging

arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/asm/kvm_page_track.h | 6 +-
arch/x86/kvm/mmu/mmu.c | 45 +++---
arch/x86/kvm/mmu/mmu_internal.h | 6 +-
arch/x86/kvm/mmu/page_track.c | 8 +-
arch/x86/kvm/mmu/paging_tmpl.h | 6 +-
arch/x86/kvm/mmu/spte.c | 43 +++--
arch/x86/kvm/mmu/spte.h | 17 +-
arch/x86/kvm/mmu/tdp_mmu.c | 217 +++++++++++++++++++++-----
arch/x86/kvm/mmu/tdp_mmu.h | 5 +-
arch/x86/kvm/svm/svm.c | 8 +
arch/x86/kvm/vmx/vmx.c | 40 +++--
include/linux/kvm_host.h | 10 +-
virt/kvm/kvm_main.c | 12 +-
15 files changed, 302 insertions(+), 124 deletions(-)

--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:30:23

[permalink] [raw]

Subject: [RFC 01/19] KVM: x86/mmu: Fix TLB flush range when handling disconnected pt

When recursively clearing out disconnected pts, the range based TLB
flush in handle_removed_tdp_mmu_page uses the wrong starting GFN,
resulting in the flush mostly missing the affected range. Fix this by
using base_gfn for the flush.

Fixes: a066e61f13cf ("KVM: x86/mmu: Factor out handling of removed page tables")
CC: [email protected]

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7c5dd83e52de..866c2b191e1e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -374,7 +374,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
shared);
}

- kvm_flush_remote_tlbs_with_address(kvm, gfn,
+ kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
KVM_PAGES_PER_HPAGE(level + 1));

call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:30:37

[permalink] [raw]

Subject: [RFC 02/19] KVM: x86/mmu: Batch TLB flushes for a single zap

When recursively handling a removed TDP page table, the TDP MMU will
flush the TLBs and queue an RCU callback to free the PT. If the original
change zapped a non-leaf SPTE at PG_LEVEL_1G or above, that change will
result in many unnecessary TLB flushes when one would suffice. Queue all
the PTs which need to be freed on a list and wait to queue RCU callbacks
to free them until after all the recursive callbacks are done.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 88 ++++++++++++++++++++++++++++++--------
1 file changed, 70 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 866c2b191e1e..5b31d046df78 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -220,7 +220,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)

static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level,
- bool shared);
+ bool shared,
+ struct list_head *disconnected_sps);

static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
{
@@ -302,6 +303,11 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
* @shared: This operation may not be running under the exclusive use
* of the MMU lock and the operation must synchronize with other
* threads that might be modifying SPTEs.
+ * @disconnected_sps: If null, the TLBs will be flushed and the disconnected
+ * TDP MMU page will be queued to be freed after an RCU
+ * callback. If non-null the page will be added to the list
+ * and flushing the TLBs and queueing an RCU callback to
+ * free the page will be the caller's responsibility.
*
* Given a page table that has been removed from the TDP paging structure,
* iterates through the page table to clear SPTEs and free child page tables.
@@ -312,7 +318,8 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
* early rcu_dereferences in the function.
*/
static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
- bool shared)
+ bool shared,
+ struct list_head *disconnected_sps)
{
struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
int level = sp->role.level;
@@ -371,13 +378,16 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
}
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
old_child_spte, REMOVED_SPTE, level,
- shared);
+ shared, disconnected_sps);
}

- kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
- KVM_PAGES_PER_HPAGE(level + 1));
-
- call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
+ if (disconnected_sps) {
+ list_add_tail(&sp->link, disconnected_sps);
+ } else {
+ kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
+ KVM_PAGES_PER_HPAGE(level + 1));
+ call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
+ }
}

/**
@@ -391,13 +401,21 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
* @shared: This operation may not be running under the exclusive use of
* the MMU lock and the operation must synchronize with other
* threads that might be modifying SPTEs.
+ * @disconnected_sps: Only used if a page of page table memory has been
+ * removed from the paging structure by this change.
+ * If null, the TLBs will be flushed and the disconnected
+ * TDP MMU page will be queued to be freed after an RCU
+ * callback. If non-null the page will be added to the list
+ * and flushing the TLBs and queueing an RCU callback to
+ * free the page will be the caller's responsibility.
*
* Handle bookkeeping that might result from the modification of a SPTE.
* This function must be called for all TDP SPTE modifications.
*/
static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level,
- bool shared)
+ bool shared,
+ struct list_head *disconnected_sps)
{
bool was_present = is_shadow_present_pte(old_spte);
bool is_present = is_shadow_present_pte(new_spte);
@@ -475,22 +493,39 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
*/
if (was_present && !was_leaf && (pfn_changed || !is_present))
handle_removed_tdp_mmu_page(kvm,
- spte_to_child_pt(old_spte, level), shared);
+ spte_to_child_pt(old_spte, level), shared,
+ disconnected_sps);
}

static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level,
- bool shared)
+ bool shared, struct list_head *disconnected_sps)
{
__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
- shared);
+ shared, disconnected_sps);
handle_changed_spte_acc_track(old_spte, new_spte, level);
handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
new_spte, level);
}

/*
- * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
+ * The TLBs must be flushed between the pages linked from disconnected_sps
+ * being removed from the paging structure and this function being called.
+ */
+static void handle_disconnected_sps(struct kvm *kvm,
+ struct list_head *disconnected_sps)
+{
+ struct kvm_mmu_page *sp;
+ struct kvm_mmu_page *next;
+
+ list_for_each_entry_safe(sp, next, disconnected_sps, link) {
+ list_del(&sp->link);
+ call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
+ }
+}
+
+/*
+ * __tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
* and handle the associated bookkeeping. Do not mark the page dirty
* in KVM's dirty bitmaps.
*
@@ -500,9 +535,10 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* Returns: true if the SPTE was set, false if it was not. If false is returned,
* this function will have no side-effects.
*/
-static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
- struct tdp_iter *iter,
- u64 new_spte)
+static inline bool __tdp_mmu_set_spte_atomic(struct kvm *kvm,
+ struct tdp_iter *iter,
+ u64 new_spte,
+ struct list_head *disconnected_sps)
{
lockdep_assert_held_read(&kvm->mmu_lock);

@@ -522,22 +558,32 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
return false;

__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- new_spte, iter->level, true);
+ new_spte, iter->level, true, disconnected_sps);
handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);

return true;
}

+static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
+ struct tdp_iter *iter,
+ u64 new_spte)
+{
+ return __tdp_mmu_set_spte_atomic(kvm, iter, new_spte, NULL);
+}
+
static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter)
{
+ LIST_HEAD(disconnected_sps);
+
/*
* Freeze the SPTE by setting it to a special,
* non-present value. This will stop other threads from
* immediately installing a present entry in its place
* before the TLBs are flushed.
*/
- if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
+ if (!__tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE,
+ &disconnected_sps))
return false;

kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
@@ -553,6 +599,8 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
*/
WRITE_ONCE(*rcu_dereference(iter->sptep), 0);

+ handle_disconnected_sps(kvm, &disconnected_sps);
+
return true;
}

@@ -577,6 +625,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
u64 new_spte, bool record_acc_track,
bool record_dirty_log)
{
+ LIST_HEAD(disconnected_sps);
+
lockdep_assert_held_write(&kvm->mmu_lock);

/*
@@ -591,7 +641,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);

__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- new_spte, iter->level, false);
+ new_spte, iter->level, false, &disconnected_sps);
if (record_acc_track)
handle_changed_spte_acc_track(iter->old_spte, new_spte,
iter->level);
@@ -599,6 +649,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
iter->old_spte, new_spte,
iter->level);
+
+ handle_disconnected_sps(kvm, &disconnected_sps);
}

static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:30:39

[permalink] [raw]

Subject: [RFC 03/19] KVM: x86/mmu: Factor flush and free up when zapping under MMU write lock

When zapping a GFN range under the MMU write lock, there is no need to
flush the TLBs for every zap. Instead, follow the lead of the Legacy MMU
can collect disconnected sps to be freed after a flush at the end of
the routine.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5b31d046df78..a448f0f2d993 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -623,10 +623,9 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
*/
static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
u64 new_spte, bool record_acc_track,
- bool record_dirty_log)
+ bool record_dirty_log,
+ struct list_head *disconnected_sps)
{
- LIST_HEAD(disconnected_sps);
-
lockdep_assert_held_write(&kvm->mmu_lock);

/*
@@ -641,7 +640,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);

__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- new_spte, iter->level, false, &disconnected_sps);
+ new_spte, iter->level, false, disconnected_sps);
if (record_acc_track)
handle_changed_spte_acc_track(iter->old_spte, new_spte,
iter->level);
@@ -649,28 +648,32 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
iter->old_spte, new_spte,
iter->level);
+}

- handle_disconnected_sps(kvm, &disconnected_sps);
+static inline void tdp_mmu_zap_spte(struct kvm *kvm, struct tdp_iter *iter,
+ struct list_head *disconnected_sps)
+{
+ __tdp_mmu_set_spte(kvm, iter, 0, true, true, disconnected_sps);
}

static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
+ __tdp_mmu_set_spte(kvm, iter, new_spte, true, true, NULL);
}

static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
+ __tdp_mmu_set_spte(kvm, iter, new_spte, false, true, NULL);
}

static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
- __tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
+ __tdp_mmu_set_spte(kvm, iter, new_spte, true, false, NULL);
}

#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
@@ -757,6 +760,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
bool zap_all = (start == 0 && end >= max_gfn_host);
struct tdp_iter iter;
+ LIST_HEAD(disconnected_sps);

/*
* No need to try to step down in the iterator when zapping all SPTEs,
@@ -799,7 +803,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
continue;

if (!shared) {
- tdp_mmu_set_spte(kvm, &iter, 0);
+ tdp_mmu_zap_spte(kvm, &iter, &disconnected_sps);
flush = true;
} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
/*
@@ -811,6 +815,12 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
}
}

+ if (!list_empty(&disconnected_sps)) {
+ kvm_flush_remote_tlbs(kvm);
+ handle_disconnected_sps(kvm, &disconnected_sps);
+ flush = false;
+ }
+
rcu_read_unlock();
return flush;
}
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:30:44

[permalink] [raw]

Subject: [RFC 04/19] KVM: x86/mmu: Yield while processing disconnected_sps

When preparing to free disconnected SPs, the list can accumulate many
entries; enough that it is likely necessary to yeild while queuing RCU
callbacks to free the SPs.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a448f0f2d993..c2a9f7acf8ef 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -513,7 +513,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* being removed from the paging structure and this function being called.
*/
static void handle_disconnected_sps(struct kvm *kvm,
- struct list_head *disconnected_sps)
+ struct list_head *disconnected_sps,
+ bool can_yield, bool shared)
{
struct kvm_mmu_page *sp;
struct kvm_mmu_page *next;
@@ -521,6 +522,16 @@ static void handle_disconnected_sps(struct kvm *kvm,
list_for_each_entry_safe(sp, next, disconnected_sps, link) {
list_del(&sp->link);
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
+
+ if (can_yield &&
+ (need_resched() || rwlock_needbreak(&kvm->mmu_lock))) {
+ rcu_read_unlock();
+ if (shared)
+ cond_resched_rwlock_read(&kvm->mmu_lock);
+ else
+ cond_resched_rwlock_write(&kvm->mmu_lock);
+ rcu_read_lock();
+ }
}
}

@@ -599,7 +610,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
*/
WRITE_ONCE(*rcu_dereference(iter->sptep), 0);

- handle_disconnected_sps(kvm, &disconnected_sps);
+ handle_disconnected_sps(kvm, &disconnected_sps, false, true);

return true;
}
@@ -817,7 +828,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,

if (!list_empty(&disconnected_sps)) {
kvm_flush_remote_tlbs(kvm);
- handle_disconnected_sps(kvm, &disconnected_sps);
+ handle_disconnected_sps(kvm, &disconnected_sps,
+ can_yield, shared);
flush = false;
}

--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:30:51

[permalink] [raw]

Subject: [RFC 05/19] KVM: x86/mmu: Remove redundant flushes when disabling dirty logging

tdp_mmu_zap_spte_atomic flushes on every zap already, so no need to
flush again after it's done.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 4 +---
arch/x86/kvm/mmu/tdp_mmu.c | 21 ++++++---------------
arch/x86/kvm/mmu/tdp_mmu.h | 5 ++---
3 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 354d2ca92df4..baa94acab516 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5870,9 +5870,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,

if (is_tdp_mmu_enabled(kvm)) {
read_lock(&kvm->mmu_lock);
- flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush);
- if (flush)
- kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+ kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
read_unlock(&kvm->mmu_lock);
}
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c2a9f7acf8ef..1ece645e737f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1438,10 +1438,9 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
* Clear leaf entries which could be replaced by large mappings, for
* GFNs within the slot.
*/
-static bool zap_collapsible_spte_range(struct kvm *kvm,
+static void zap_collapsible_spte_range(struct kvm *kvm,
struct kvm_mmu_page *root,
- const struct kvm_memory_slot *slot,
- bool flush)
+ const struct kvm_memory_slot *slot)
{
gfn_t start = slot->base_gfn;
gfn_t end = start + slot->npages;
@@ -1452,10 +1451,8 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,

tdp_root_for_each_pte(iter, root, start, end) {
retry:
- if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
- flush = false;
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
- }

if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
@@ -1475,30 +1472,24 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
goto retry;
}
- flush = true;
}

rcu_read_unlock();
-
- return flush;
}

/*
* Clear non-leaf entries (and free associated page tables) which could
* be replaced by large mappings, for GFNs within the slot.
*/
-bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- bool flush)
+void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot)
{
struct kvm_mmu_page *root;

lockdep_assert_held_read(&kvm->mmu_lock);

for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
- flush = zap_collapsible_spte_range(kvm, root, slot, flush);
-
- return flush;
+ zap_collapsible_spte_range(kvm, root, slot);
}

/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 476b133544dd..3899004a5d91 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -64,9 +64,8 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn, unsigned long mask,
bool wrprot);
-bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- bool flush);
+void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot);

bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:30:54

[permalink] [raw]

Subject: [RFC 06/19] KVM: x86/mmu: Introduce vcpu_make_spte

Add a wrapper around make_spte which conveys the vCPU-specific context of
the function. This will facilitate factoring out all uses of the vCPU
pointer from make_spte in subsequent commits.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 6 +++---
arch/x86/kvm/mmu/spte.c | 17 +++++++++++++----
arch/x86/kvm/mmu/spte.h | 12 ++++++++----
arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++---
5 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index baa94acab516..2ada6dee920a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2723,7 +2723,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
was_rmapped = 1;
}

- wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
+ wrprot = vcpu_make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
true, host_writable, &spte);

if (*sptep == spte) {
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f87d36898c44..edb8ebd1a775 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1129,9 +1129,9 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
spte = *sptep;
host_writable = spte & shadow_host_writable_mask;
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
- make_spte(vcpu, sp, slot, pte_access, gfn,
- spte_to_pfn(spte), spte, true, false,
- host_writable, &spte);
+ vcpu_make_spte(vcpu, sp, slot, pte_access, gfn,
+ spte_to_pfn(spte), spte, true, false,
+ host_writable, &spte);

flush |= mmu_spte_update(sptep, spte);
}
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 0c76c45fdb68..04d26e913941 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -90,10 +90,9 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
}

bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
- struct kvm_memory_slot *slot,
- unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
- u64 old_spte, bool prefetch, bool can_unsync,
- bool host_writable, u64 *new_spte)
+ struct kvm_memory_slot *slot, unsigned int pte_access,
+ gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
+ bool can_unsync, bool host_writable, u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -191,6 +190,16 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
return wrprot;
}

+bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ struct kvm_memory_slot *slot, unsigned int pte_access,
+ gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
+ bool can_unsync, bool host_writable, u64 *new_spte)
+{
+ return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
+ prefetch, can_unsync, host_writable, new_spte);
+
+}
+
u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
{
u64 spte = SPTE_MMU_PRESENT_MASK;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index cc432f9a966b..14f18082d505 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -330,10 +330,14 @@ static inline u64 get_mmio_spte_generation(u64 spte)
}

bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
- struct kvm_memory_slot *slot,
- unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
- u64 old_spte, bool prefetch, bool can_unsync,
- bool host_writable, u64 *new_spte);
+ struct kvm_memory_slot *slot, unsigned int pte_access,
+ gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
+ bool can_unsync, bool host_writable, u64 *new_spte);
+bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ struct kvm_memory_slot *slot,
+ unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
+ u64 old_spte, bool prefetch, bool can_unsync,
+ bool host_writable, u64 *new_spte);
u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1ece645e737f..836eadd4e73a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -980,9 +980,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
- wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
- fault->pfn, iter->old_spte, fault->prefetch, true,
- fault->map_writable, &new_spte);
+ wrprot = vcpu_make_spte(vcpu, sp, fault->slot, ACC_ALL,
+ iter->gfn, fault->pfn, iter->old_spte,
+ fault->prefetch, true,
+ fault->map_writable, &new_spte);

if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:30:58

[permalink] [raw]

Subject: [RFC 08/19] KVM: x86/mmu: Factor mt_mask out of make_spte

In service of removing the vCPU pointer from make_spte, factor the memory
type mask calculation out of make_spte.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 9 +++++----
arch/x86/kvm/mmu/spte.h | 2 +-
2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 3cf08a534a16..75c666d3e7f1 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -93,7 +93,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
- u64 *new_spte)
+ u64 mt_mask, u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -130,8 +130,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (level > PG_LEVEL_4K)
spte |= PT_PAGE_SIZE_MASK;
if (tdp_enabled)
- spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
- kvm_is_mmio_pfn(pfn));
+ spte |= mt_mask;

if (host_writable)
spte |= shadow_host_writable_mask;
@@ -197,10 +196,12 @@ bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
bool can_unsync, bool host_writable, u64 *new_spte)
{
bool ad_need_write_protect = kvm_vcpu_ad_need_write_protect(vcpu);
+ u64 mt_mask = static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
+ kvm_is_mmio_pfn(pfn));

return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
prefetch, can_unsync, host_writable,
- ad_need_write_protect, new_spte);
+ ad_need_write_protect, mt_mask, new_spte);

}

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index bcf58602f224..e739f2ebf844 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -333,7 +333,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
- u64 *new_spte);
+ u64 mt_mask, u64 *new_spte);
bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:30:59

[permalink] [raw]

Subject: [RFC 07/19] KVM: x86/mmu: Factor wrprot for nested PML out of make_spte

When running a nested VM, KVM write protects SPTEs in the EPT/NPT02
instead of using PML for dirty tracking. This avoids expensive
translation later, when emptying the Page Modification Log. In service
of removing the vCPU pointer from make_spte, factor the check for nested
PML out of the function.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 10 +++++++---
arch/x86/kvm/mmu/spte.h | 3 ++-
2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 04d26e913941..3cf08a534a16 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -92,7 +92,8 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
- bool can_unsync, bool host_writable, u64 *new_spte)
+ bool can_unsync, bool host_writable, bool ad_need_write_protect,
+ u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -100,7 +101,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,

if (sp->role.ad_disabled)
spte |= SPTE_TDP_AD_DISABLED_MASK;
- else if (kvm_vcpu_ad_need_write_protect(vcpu))
+ else if (ad_need_write_protect)
spte |= SPTE_TDP_AD_WRPROT_ONLY_MASK;

/*
@@ -195,8 +196,11 @@ bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, u64 *new_spte)
{
+ bool ad_need_write_protect = kvm_vcpu_ad_need_write_protect(vcpu);
+
return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
- prefetch, can_unsync, host_writable, new_spte);
+ prefetch, can_unsync, host_writable,
+ ad_need_write_protect, new_spte);

}

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 14f18082d505..bcf58602f224 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -332,7 +332,8 @@ static inline u64 get_mmio_spte_generation(u64 spte)
bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
- bool can_unsync, bool host_writable, u64 *new_spte);
+ bool can_unsync, bool host_writable, bool ad_need_write_protect,
+ u64 *new_spte);
bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:02

[permalink] [raw]

Subject: [RFC 09/19] KVM: x86/mmu: Remove need for a vcpu from kvm_slot_page_track_is_active

kvm_slot_page_track_is_active only uses its vCPU argument to get a
pointer to the assoicated struct kvm, so just pass in the struct KVM to
remove the need for a vCPU pointer.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm_page_track.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 4 ++--
arch/x86/kvm/mmu/page_track.c | 4 ++--
3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 9d4a3b1b25b9..e99a30a4d38b 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -63,7 +63,7 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
void kvm_slot_page_track_remove_page(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
enum kvm_page_track_mode mode);
-bool kvm_slot_page_track_is_active(struct kvm_vcpu *vcpu,
+bool kvm_slot_page_track_is_active(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
enum kvm_page_track_mode mode);

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2ada6dee920a..7d0da79668c0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2587,7 +2587,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
* track machinery is used to write-protect upper-level shadow pages,
* i.e. this guards the role.level == 4K assertion below!
*/
- if (kvm_slot_page_track_is_active(vcpu, slot, gfn, KVM_PAGE_TRACK_WRITE))
+ if (kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
return -EPERM;

/*
@@ -3884,7 +3884,7 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
* guest is writing the page which is write tracked which can
* not be fixed by page fault handler.
*/
- if (kvm_slot_page_track_is_active(vcpu, fault->slot, fault->gfn, KVM_PAGE_TRACK_WRITE))
+ if (kvm_slot_page_track_is_active(vcpu->kvm, fault->slot, fault->gfn, KVM_PAGE_TRACK_WRITE))
return true;

return false;
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index cc4eb5b7fb76..35c221d5f6ce 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -173,7 +173,7 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_remove_page);
/*
* check if the corresponding access on the specified guest page is tracked.
*/
-bool kvm_slot_page_track_is_active(struct kvm_vcpu *vcpu,
+bool kvm_slot_page_track_is_active(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
enum kvm_page_track_mode mode)
{
@@ -186,7 +186,7 @@ bool kvm_slot_page_track_is_active(struct kvm_vcpu *vcpu,
return false;

if (mode == KVM_PAGE_TRACK_WRITE &&
- !kvm_page_track_write_tracking_enabled(vcpu->kvm))
+ !kvm_page_track_write_tracking_enabled(kvm))
return false;

index = gfn_to_index(gfn, slot->base_gfn, PG_LEVEL_4K);
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:04

[permalink] [raw]

Subject: [RFC 10/19] KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages

The vCPU argument to mmu_try_to_unsync_pages is now only used to get a
pointer to the associated struct kvm, so pass in the kvm pointer from
the beginning to remove the need for a vCPU when calling the function.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 16 ++++++++--------
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/spte.c | 2 +-
3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7d0da79668c0..1e890509b93f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2561,10 +2561,10 @@ static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
return r;
}

-static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
trace_kvm_mmu_unsync_page(sp);
- ++vcpu->kvm->stat.mmu_unsync;
+ ++kvm->stat.mmu_unsync;
sp->unsync = 1;

kvm_mmu_mark_parents_unsync(sp);
@@ -2576,7 +2576,7 @@ static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
* were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
* be write-protected.
*/
-int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, bool can_unsync, bool prefetch)
{
struct kvm_mmu_page *sp;
@@ -2587,7 +2587,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
* track machinery is used to write-protect upper-level shadow pages,
* i.e. this guards the role.level == 4K assertion below!
*/
- if (kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
+ if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
return -EPERM;

/*
@@ -2596,7 +2596,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
* that case, KVM must complete emulation of the guest TLB flush before
* allowing shadow pages to become unsync (writable by the guest).
*/
- for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
+ for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
if (!can_unsync)
return -EPERM;

@@ -2615,7 +2615,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
*/
if (!locked) {
locked = true;
- spin_lock(&vcpu->kvm->arch.mmu_unsync_pages_lock);
+ spin_lock(&kvm->arch.mmu_unsync_pages_lock);

/*
* Recheck after taking the spinlock, a different vCPU
@@ -2630,10 +2630,10 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
}

WARN_ON(sp->role.level != PG_LEVEL_4K);
- kvm_unsync_page(vcpu, sp);
+ kvm_unsync_page(kvm, sp);
}
if (locked)
- spin_unlock(&vcpu->kvm->arch.mmu_unsync_pages_lock);
+ spin_unlock(&kvm->arch.mmu_unsync_pages_lock);

/*
* We need to ensure that the marking of unsync pages is visible
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 52c6527b1a06..1073d10cce91 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -118,7 +118,7 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
kvm_x86_ops.cpu_dirty_log_size;
}

-int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, bool can_unsync, bool prefetch);

void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 75c666d3e7f1..b7271daa06c5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -160,7 +160,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
* e.g. it's write-tracked (upper-level SPs) or has one or more
* shadow pages and unsync'ing pages is not allowed.
*/
- if (mmu_try_to_unsync_pages(vcpu, slot, gfn, can_unsync, prefetch)) {
+ if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) {
pgprintk("%s: found shadow page for %llx, marking ro\n",
__func__, gfn);
wrprot = true;
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:07

[permalink] [raw]

Subject: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

In the interest of devloping a version of make_spte that can function
without a vCPU pointer, factor out the shadow_zero_mask to be an
additional argument to the function.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 11 +++++++----
arch/x86/kvm/mmu/spte.h | 3 ++-
2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index b7271daa06c5..d3b059e96c6e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -93,7 +93,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
- u64 mt_mask, u64 *new_spte)
+ u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
+ u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -176,9 +177,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (prefetch)
spte = mark_spte_for_access_track(spte);

- WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
+ WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
"spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
- get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
+ get_rsvd_bits(shadow_zero_check, spte, level));

if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
@@ -198,10 +199,12 @@ bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
bool ad_need_write_protect = kvm_vcpu_ad_need_write_protect(vcpu);
u64 mt_mask = static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
+ struct rsvd_bits_validate *shadow_zero_check = &vcpu->arch.mmu->shadow_zero_check;

return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
prefetch, can_unsync, host_writable,
- ad_need_write_protect, mt_mask, new_spte);
+ ad_need_write_protect, mt_mask, shadow_zero_check,
+ new_spte);

}

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index e739f2ebf844..6134a10487c4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -333,7 +333,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
- u64 mt_mask, u64 *new_spte);
+ u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
+ u64 *new_spte);
bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:10

[permalink] [raw]

Subject: [RFC 12/19] KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte

No that nothing in make_spte actually needs the vCPU argument, just
pass in a pointer to the struct kvm. This allows the function to be used
in situations where there is no relevant struct vcpu.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 8 ++++----
arch/x86/kvm/mmu/spte.h | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d3b059e96c6e..d98723b14cec 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -89,7 +89,7 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
E820_TYPE_RAM);
}

-bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
@@ -161,7 +161,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
* e.g. it's write-tracked (upper-level SPs) or has one or more
* shadow pages and unsync'ing pages is not allowed.
*/
- if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) {
+ if (mmu_try_to_unsync_pages(kvm, slot, gfn, can_unsync, prefetch)) {
pgprintk("%s: found shadow page for %llx, marking ro\n",
__func__, gfn);
wrprot = true;
@@ -184,7 +184,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
WARN_ON(level > PG_LEVEL_4K);
- mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
+ mark_page_dirty_in_slot(kvm, slot, gfn);
}

*new_spte = spte;
@@ -201,7 +201,7 @@ bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
kvm_is_mmio_pfn(pfn));
struct rsvd_bits_validate *shadow_zero_check = &vcpu->arch.mmu->shadow_zero_check;

- return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
+ return make_spte(vcpu->kvm, sp, slot, pte_access, gfn, pfn, old_spte,
prefetch, can_unsync, host_writable,
ad_need_write_protect, mt_mask, shadow_zero_check,
new_spte);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 6134a10487c4..5bb055688080 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -329,7 +329,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)
return gen;
}

-bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:13

[permalink] [raw]

Subject: [RFC 13/19] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask

Factor out the implementation of reset_tdp_shadow_zero_bits_mask to a
helper function which does not require a vCPU pointer. The only element
of the struct kvm_mmu context used by the function is the shadow root
level, so pass that in too instead of the mmu context.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e890509b93f..fdf0f15ab19d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4450,17 +4450,14 @@ static inline bool boot_cpu_is_amd(void)
* possible, however, kvm currently does not do execution-protection.
*/
static void
-reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
- struct kvm_mmu *context)
+build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
+ int shadow_root_level)
{
- struct rsvd_bits_validate *shadow_zero_check;
int i;

- shadow_zero_check = &context->shadow_zero_check;
-
if (boot_cpu_is_amd())
__reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
- context->shadow_root_level, false,
+ shadow_root_level, false,
boot_cpu_has(X86_FEATURE_GBPAGES),
false, true);
else
@@ -4470,12 +4467,20 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
if (!shadow_me_mask)
return;

- for (i = context->shadow_root_level; --i >= 0;) {
+ for (i = shadow_root_level; --i >= 0;) {
shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
}
}

+static void
+reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context)
+{
+ build_tdp_shadow_zero_bits_mask(&context->shadow_zero_check,
+ context->shadow_root_level);
+}
+
/*
* as the comments in reset_shadow_zero_bits_mask() except it
* is the shadow page table for intel nested guest.
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:17

[permalink] [raw]

Subject: [RFC 14/19] KVM: x86/mmu: Propagate memslot const qualifier

In preparation for implementing in-place hugepage promotion, various
functions will need to be called from zap_collapsible_spte_range, which
has the const qualifier on its memslot argument. Propagate the const
qualifier to the various functions which will be needed. This just serves
to simplify the following patch.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm_page_track.h | 4 ++--
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/page_track.c | 4 ++--
arch/x86/kvm/mmu/spte.c | 2 +-
arch/x86/kvm/mmu/spte.h | 2 +-
include/linux/kvm_host.h | 10 +++++-----
virt/kvm/kvm_main.c | 12 ++++++------
8 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index e99a30a4d38b..eb186bc57f6a 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -64,8 +64,8 @@ void kvm_slot_page_track_remove_page(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
enum kvm_page_track_mode mode);
bool kvm_slot_page_track_is_active(struct kvm *kvm,
- struct kvm_memory_slot *slot, gfn_t gfn,
- enum kvm_page_track_mode mode);
+ const struct kvm_memory_slot *slot,
+ gfn_t gfn, enum kvm_page_track_mode mode);

void
kvm_page_track_register_notifier(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fdf0f15ab19d..ef7a84422463 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2576,7 +2576,7 @@ static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
* were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
* be write-protected.
*/
-int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
+int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
gfn_t gfn, bool can_unsync, bool prefetch)
{
struct kvm_mmu_page *sp;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1073d10cce91..6563cce9c438 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -118,7 +118,7 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
kvm_x86_ops.cpu_dirty_log_size;
}

-int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
+int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
gfn_t gfn, bool can_unsync, bool prefetch);

void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 35c221d5f6ce..68eb1fb548b6 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -174,8 +174,8 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_remove_page);
* check if the corresponding access on the specified guest page is tracked.
*/
bool kvm_slot_page_track_is_active(struct kvm *kvm,
- struct kvm_memory_slot *slot, gfn_t gfn,
- enum kvm_page_track_mode mode)
+ const struct kvm_memory_slot *slot,
+ gfn_t gfn, enum kvm_page_track_mode mode)
{
int index;

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d98723b14cec..7be41d2dbb02 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -90,7 +90,7 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
}

bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct kvm_memory_slot *slot, unsigned int pte_access,
+ const struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 5bb055688080..d7598506fbad 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -330,7 +330,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)
}

bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct kvm_memory_slot *slot, unsigned int pte_access,
+ const struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 60a35d9fe259..675da38fac7f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -435,7 +435,7 @@ struct kvm_memory_slot {
u16 as_id;
};

-static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
+static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
{
return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
}
@@ -855,9 +855,9 @@ void kvm_set_page_accessed(struct page *page);
kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable);
-kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
bool atomic, bool *async, bool write_fault,
bool *writable, hva_t *hva);

@@ -934,7 +934,7 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
-void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn);
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);

struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3f6d450355f0..6dbf8cba1900 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2138,12 +2138,12 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
return size;
}

-static bool memslot_is_readonly(struct kvm_memory_slot *slot)
+static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
{
return slot->flags & KVM_MEM_READONLY;
}

-static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
+static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
gfn_t *nr_pages, bool write)
{
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
@@ -2438,7 +2438,7 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
return pfn;
}

-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
bool atomic, bool *async, bool write_fault,
bool *writable, hva_t *hva)
{
@@ -2478,13 +2478,13 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);

-kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
{
return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);

-kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
{
return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
}
@@ -3079,7 +3079,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
EXPORT_SYMBOL_GPL(kvm_clear_guest);

void mark_page_dirty_in_slot(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
+ const struct kvm_memory_slot *memslot,
gfn_t gfn)
{
if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:18

[permalink] [raw]

Subject: [RFC 15/19] KVM: x86/MMU: Refactor vmx_get_mt_mask

Remove the gotos from vmx_get_mt_mask to make it easier to separate out
the parts which do not depend on vcpu state.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 23 +++++++----------------
1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 71f54d85f104..77f45c005f28 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6987,7 +6987,6 @@ static int __init vmx_check_processor_compat(void)
static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
u8 cache;
- u64 ipat = 0;

/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
* memory aliases with conflicting memory types and sometimes MCEs.
@@ -7007,30 +7006,22 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
* EPT memory type is used to emulate guest CD/MTRR.
*/

- if (is_mmio) {
- cache = MTRR_TYPE_UNCACHABLE;
- goto exit;
- }
+ if (is_mmio)
+ return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;

- if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
- ipat = VMX_EPT_IPAT_BIT;
- cache = MTRR_TYPE_WRBACK;
- goto exit;
- }
+ if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
+ return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
- ipat = VMX_EPT_IPAT_BIT;
if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
cache = MTRR_TYPE_WRBACK;
else
cache = MTRR_TYPE_UNCACHABLE;
- goto exit;
- }

- cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
+ return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+ }

-exit:
- return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
+ return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
}

static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx, u32 new_ctl)
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:31

[permalink] [raw]

Subject: [RFC 16/19] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu

Factor out the parts of vmx_get_mt_mask which do not depend on the vCPU
argument. This also requires adding some error reporting to the helper
function to say whether it was possible to generate the MT mask without
a vCPU argument. This refactoring will allow the MT mask to be computed
when noncoherent DMA is not enabled on a VM.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 77f45c005f28..4129614262e8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6984,9 +6984,26 @@ static int __init vmx_check_processor_compat(void)
return 0;
}

+static bool vmx_try_get_mt_mask(struct kvm *kvm, gfn_t gfn,
+ bool is_mmio, u64 *mask)
+{
+ if (is_mmio) {
+ *mask = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+ return true;
+ }
+
+ if (!kvm_arch_has_noncoherent_dma(kvm)) {
+ *mask = (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+ return true;
+ }
+
+ return false;
+}
+
static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
u8 cache;
+ u64 mask;

/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
* memory aliases with conflicting memory types and sometimes MCEs.
@@ -7006,11 +7023,8 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
* EPT memory type is used to emulate guest CD/MTRR.
*/

- if (is_mmio)
- return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
-
- if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
- return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+ if (vmx_try_get_mt_mask(vcpu->kvm, gfn, is_mmio, &mask))
+ return mask;

if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:33

[permalink] [raw]

Subject: [RFC 17/19] KVM: x86/mmu: Add try_get_mt_mask to x86_ops

Add another function for getting the memory type mask to x86_ops.
This version of the function can fail, but it does not require a vCPU
pointer. It will be used in a subsequent commit for in-place large page
promotion when disabling dirty logging.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/svm/svm.c | 8 ++++++++
arch/x86/kvm/vmx/vmx.c | 1 +
4 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index cefe1d81e2e8..c86e9629ff1a 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -84,6 +84,7 @@ KVM_X86_OP_NULL(sync_pir_to_irr)
KVM_X86_OP(set_tss_addr)
KVM_X86_OP(set_identity_map_addr)
KVM_X86_OP(get_mt_mask)
+KVM_X86_OP(try_get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
KVM_X86_OP_NULL(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88fce6ab4bbd..ae13075f4d4c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1400,6 +1400,8 @@ struct kvm_x86_ops {
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+ bool (*try_get_mt_mask)(struct kvm *kvm, gfn_t gfn,
+ bool is_mmio, u64 *mask);

void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 21bb81710e0f..d073cc3985e6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4067,6 +4067,13 @@ static bool svm_has_emulated_msr(struct kvm *kvm, u32 index)
return true;
}

+static bool svm_try_get_mt_mask(struct kvm *kvm, gfn_t gfn,
+ bool is_mmio, u64 *mask)
+{
+ *mask = 0;
+ return true;
+}
+
static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
return 0;
@@ -4660,6 +4667,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.set_tss_addr = svm_set_tss_addr,
.set_identity_map_addr = svm_set_identity_map_addr,
.get_mt_mask = svm_get_mt_mask,
+ .try_get_mt_mask = svm_try_get_mt_mask,

.get_exit_info = svm_get_exit_info,

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4129614262e8..8cd6c1f50d3e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7658,6 +7658,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
.get_mt_mask = vmx_get_mt_mask,
+ .try_get_mt_mask = vmx_try_get_mt_mask,

.get_exit_info = vmx_get_exit_info,

--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:37

[permalink] [raw]

Subject: [RFC 18/19] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c

Export kvm_is_mmio_pfn from spte.c. It will be used in a subsequent
commit for in-place lpage promotion when disabling dirty logging.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 2 +-
arch/x86/kvm/mmu/spte.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 7be41d2dbb02..13b6143f6333 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -68,7 +68,7 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
return spte;
}

-static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
+bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
{
if (pfn_valid(pfn))
return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index d7598506fbad..909c24c733c4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -347,4 +347,5 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);

void kvm_mmu_reset_all_pte_masks(void);

+bool kvm_is_mmio_pfn(kvm_pfn_t pfn);
#endif
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:31:42

[permalink] [raw]

Subject: [RFC 19/19] KVM: x86/mmu: Promote pages in-place when disabling dirty logging

When disabling dirty logging, the TDP MMU currently zaps each leaf entry
mapping memory in the relevant memslot. This is very slow. Doing the zaps
under the mmu read lock requires a TLB flush for every zap and the
zapping causes a storm of ETP/NPT violations.

Instead of zapping, replace the split large pages with large page
mappings directly. While this sort of operation has historically only
been done in the vCPU page fault handler context, refactorings earlier
in this series and the relative simplicity of the TDP MMU make it
possible here as well.

Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
of memory per vCPU, this reduces the time required to disable dirty
logging from over 45 seconds to just over 1 second. It also avoids
provoking page faults, improving vCPU performance while disabling
dirty logging.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/mmu_internal.h | 4 ++
arch/x86/kvm/mmu/tdp_mmu.c | 69 ++++++++++++++++++++++++++++++++-
3 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ef7a84422463..add724aa9e8c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4449,7 +4449,7 @@ static inline bool boot_cpu_is_amd(void)
* the direct page table on host, use as much mmu features as
* possible, however, kvm currently does not do execution-protection.
*/
-static void
+void
build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
int shadow_root_level)
{
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 6563cce9c438..84d439432acf 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -161,4 +161,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);

+void
+build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
+ int shadow_root_level);
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 836eadd4e73a..77ff7f1d0d0a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1435,6 +1435,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
}

+static void try_promote_lpage(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ struct tdp_iter *iter)
+{
+ struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
+ struct rsvd_bits_validate shadow_zero_check;
+ /*
+ * Since the TDP MMU doesn't manage nested PTs, there's no need to
+ * write protect for a nested VM when PML is in use.
+ */
+ bool ad_need_write_protect = false;
+ bool map_writable;
+ kvm_pfn_t pfn;
+ u64 new_spte;
+ u64 mt_mask;
+
+ /*
+ * If addresses are being invalidated, don't do in-place promotion to
+ * avoid accidentally mapping an invalidated address.
+ */
+ if (unlikely(kvm->mmu_notifier_count))
+ return;
+
+ pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
+ &map_writable, NULL);
+
+ /*
+ * Can't reconstitute an lpage if the consituent pages can't be
+ * mapped higher.
+ */
+ if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
+ pfn, PG_LEVEL_NUM))
+ return;
+
+ build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
+
+ /*
+ * In some cases, a vCPU pointer is required to get the MT mask,
+ * however in most cases it can be generated without one. If a
+ * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
+ * In that case, bail on in-place promotion.
+ */
+ if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
+ kvm_is_mmio_pfn(pfn),
+ &mt_mask)))
+ return;
+
+ make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
+ map_writable, ad_need_write_protect, mt_mask,
+ &shadow_zero_check, &new_spte);
+
+ tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
+
+ /*
+ * Re-read the SPTE to avoid recursing into one of the removed child
+ * page tables.
+ */
+ iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+}
+
/*
* Clear leaf entries which could be replaced by large mappings, for
* GFNs within the slot.
@@ -1455,9 +1515,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;

- if (!is_shadow_present_pte(iter.old_spte) ||
- !is_last_spte(iter.old_spte, iter.level))
+ if (!is_shadow_present_pte(iter.old_spte))
+ continue;
+
+ /* Try to promote the constitutent pages to an lpage. */
+ if (!is_last_spte(iter.old_spte, iter.level)) {
+ try_promote_lpage(kvm, slot, &iter);
continue;
+ }

pfn = spte_to_pfn(iter.old_spte);
if (kvm_is_reserved_pfn(pfn) ||
--
2.34.0.rc0.344.g81b53c2807-goog

2021-11-10 22:45:03

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On 11/10/21 23:30, Ben Gardon wrote:
> - WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
> + WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
> "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
> - get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> + get_rsvd_bits(shadow_zero_check, spte, level));

Hmm, there is a deeper issue here, in that when using EPT/NPT (on either
the legacy aka shadow or the TDP MMU) large parts of vcpu->arch.mmu are
really the same for all vCPUs. The only thing that varies is those
parts that actually depend on the guest's paging mode---the extended
role, the reserved bits, etc. Those are needed by the emulator, but
don't really belong in vcpu->arch.mmu when EPT/NPT is in use.

I wonder if there's room for splitting kvm_mmu in two parts, such as
kvm_mmu and kvm_guest_paging_context, and possibly change the walk_mmu
pointer into a pointer to kvm_guest_paging_context. This way the
EPT/NPT MMU (again either shadow or TDP) can be moved to kvm->arch. It
should simplify this series and also David's work on eager page splitting.

I'm not asking you to do this, of course, but perhaps I can trigger
Sean's itch to refactor stuff. :)

Paolo

2021-11-10 23:49:59

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On Wed, Nov 10, 2021 at 2:45 PM Paolo Bonzini <[email protected]> wrote:
>
> On 11/10/21 23:30, Ben Gardon wrote:
> > - WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
> > + WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
> > "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
> > - get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> > + get_rsvd_bits(shadow_zero_check, spte, level));
>
> Hmm, there is a deeper issue here, in that when using EPT/NPT (on either
> the legacy aka shadow or the TDP MMU) large parts of vcpu->arch.mmu are
> really the same for all vCPUs. The only thing that varies is those
> parts that actually depend on the guest's paging mode---the extended
> role, the reserved bits, etc. Those are needed by the emulator, but
> don't really belong in vcpu->arch.mmu when EPT/NPT is in use.
>
> I wonder if there's room for splitting kvm_mmu in two parts, such as
> kvm_mmu and kvm_guest_paging_context, and possibly change the walk_mmu
> pointer into a pointer to kvm_guest_paging_context. This way the
> EPT/NPT MMU (again either shadow or TDP) can be moved to kvm->arch. It
> should simplify this series and also David's work on eager page splitting.
>
> I'm not asking you to do this, of course, but perhaps I can trigger
> Sean's itch to refactor stuff. :)
>
> Paolo
>

I think that's a great idea. I'm frequently confused as to why the
struct kvm_mmu is a per-vcpu construct as opposed to being VM-global.
Moving part of the struct to be a member for struct kvm would also
open the door to formalizing the MMU interface a little better and
perhaps even reveal more MMU code that can be consolidated across
architectures.

2021-11-11 01:18:09

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On Wed, Nov 10, 2021, Ben Gardon wrote:
> On Wed, Nov 10, 2021 at 2:45 PM Paolo Bonzini <[email protected]> wrote:
> >
> > On 11/10/21 23:30, Ben Gardon wrote:
> > > - WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
> > > + WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
> > > "spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
> > > - get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
> > > + get_rsvd_bits(shadow_zero_check, spte, level));
> >
> > Hmm, there is a deeper issue here, in that when using EPT/NPT (on either
> > the legacy aka shadow or the TDP MMU) large parts of vcpu->arch.mmu are
> > really the same for all vCPUs. The only thing that varies is those
> > parts that actually depend on the guest's paging mode---the extended
> > role, the reserved bits, etc. Those are needed by the emulator, but
> > don't really belong in vcpu->arch.mmu when EPT/NPT is in use.
> >
> > I wonder if there's room for splitting kvm_mmu in two parts, such as
> > kvm_mmu and kvm_guest_paging_context, and possibly change the walk_mmu
> > pointer into a pointer to kvm_guest_paging_context. This way the
> > EPT/NPT MMU (again either shadow or TDP) can be moved to kvm->arch. It
> > should simplify this series and also David's work on eager page splitting.
> >
> > I'm not asking you to do this, of course, but perhaps I can trigger
> > Sean's itch to refactor stuff. :)
> >
> > Paolo
> >
>
> I think that's a great idea. I'm frequently confused as to why the
> struct kvm_mmu is a per-vcpu construct as opposed to being VM-global.
> Moving part of the struct to be a member for struct kvm would also
> open the door to formalizing the MMU interface a little better and
> perhaps even reveal more MMU code that can be consolidated across
> architectures.

But what would you actually move? Even shadow_zero_check barely squeaks by,
e.g. if NX is ever used to for NPT, then maybe it stops being a per-VM setting.

Going through the fields...

These are all related to guest context:

unsigned long (*get_guest_pgd)(struct kvm_vcpu *vcpu);
u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index);
int (*page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
void (*inject_page_fault)(struct kvm_vcpu *vcpu,
struct x86_exception *fault);
gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gpa_t gva_or_gpa,
u32 access, struct x86_exception *exception);
gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
struct x86_exception *exception);
int (*sync_page)(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp);
void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
union kvm_mmu_role mmu_role;
u8 root_level;
u8 permissions[16];
u32 pkru_mask;
struct rsvd_bits_validate guest_rsvd_check;
u64 pdptrs[4];
gpa_t root_pgd;

One field, ept_ad, can be straight deleted as it's redundant with respect to
the above mmu_role.ad_disabled.

u8 ept_ad;

Ditto for direct_map flag (mmu_role.direct) and shadow_root_level (mmu_role.level).
I haven't bothered to yank those because they have a lot of touchpoints.

bool direct_map;
u8 shadow_root_level;

The prev_roots could be dropped if TDP roots were tracked per-VM, but we'd still
want an equivalent for !TDP and nTDP MMUs.

struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];

shadow_zero_check can be made per-VM if all vCPUs are required to have the same
cpuid.MAXPHYADDR or if we remove the (IMO) pointless 5-level vs. 4-level behavior,
which by-the-by, has my vote since we could make shadow_zero_check _global_, not
just per-VM, and everything I've heard is that the extra level has no measurable
performance overhead.

struct rsvd_bits_validate shadow_zero_check;

And that leaves us with:
hpa_t root_hpa;

u64 *pae_root;
u64 *pml4_root;
u64 *pml5_root;

Of those, _none_ of them can be per-VM, because they are all nothing more than
shadow pages, and thus cannot be per-VM unless there is exactly one set of TDP
page tables for the guest. Even if/when we strip the unnecessary role bits from
these for TDP (on my todo list), we still need up to three sets of page tables:

1. Normal
2. SMM
3. Guest (if L1 doesn't use TDP)

So I suppose we could refactor KVM to explicitly track its three possible TDP
roots, but I don't think it buys us anything and would complicate supporting
!TDP as well as nTDP.

2021-11-11 01:45:04

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On Thu, Nov 11, 2021, Sean Christopherson wrote:
> These are all related to guest context:
> int (*page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> void (*inject_page_fault)(struct kvm_vcpu *vcpu,
> struct x86_exception *fault);

That's incorrect, page_fault() and inject_page_fault() could be per-VM. Neither
is particularly interesting though.

2021-11-11 07:06:47

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On 11/11/21 02:18, Sean Christopherson wrote:
> But what would you actually move? Even shadow_zero_check barely squeaks by,
> e.g. if NX is ever used to for NPT, then maybe it stops being a per-VM setting.

Hmm, I think it would still be per-VM, just like 32-bit shadow page tables
are always built for EFER.NXE=CR4.PAE=1. Anyway, the rough sketch is to have
three structs:

* struct kvm_mmu_kind has the function pointers and the state that is
needed to operate on page tables

* struct kvm_mmu has the function pointers and the state that is
needed while the vCPU runs, including the role

* struct kvm_paging_context has the stuff related to emulation;
shadow page tables of course needs it but EPT/NPT do not (with
either the old or the new MMU)

So you'd have a "struct kvm_mmu_kind direct_mmu" in struct kvm_arch (for
either legacy EPT/NPT or the new MMU), and

struct kvm_mmu_kind shadow_mmu;
struct kvm_mmu root_mmu; /* either TDP or shadow */
struct kvm_mmu tdp12_mmu; /* always shadow */
struct kvm_mmu *mmu; /* either &kvm->direct_mmu or &vcpu->shadow_mmu */
struct kvm_paging_context root_walk; /* maybe unified with walk01 below? dunno yet */
struct kvm_paging_context walk01;
struct kvm_paging_context walk12;
struct kvm_paging_context *walk; /* either &vcpu->root_walk or &vcpu->walk12 */

in struct kvm_vcpu_arch. struct kvm_mmu* has a pointer to
struct kvm_mmu_kind*; however, if an spte.c function does not need
the data in struct kvm_mmu_state*, it can take a struct kvm_mmu_kind*
and it won't need a vCPU. Likewise the TDP MMU knows its kvm_mmu_kind
is always in &kvm->direct_mmu so it can take a struct kvm* if the struct
kvm_mmu_state* is not needed.

The first part of the refactoring would be to kill the nested_mmu
and walk_mmu, replacing them by &vcpu->walk12 and vcpu->walk
respectively. The half-useless nested_mmu has always bothered me,
I was going to play with it anyway because I want to remove the
kvm_mmu_reset_context from CR0.WP writes, I'll see if I get
something useful out of it.

Paolo

2021-11-11 17:44:54

by David Matlack

[permalink] [raw]

Subject: Re: [RFC 01/19] KVM: x86/mmu: Fix TLB flush range when handling disconnected pt

On Wed, Nov 10, 2021 at 02:29:52PM -0800, Ben Gardon wrote:
> When recursively clearing out disconnected pts, the range based TLB
> flush in handle_removed_tdp_mmu_page uses the wrong starting GFN,
> resulting in the flush mostly missing the affected range. Fix this by
> using base_gfn for the flush.
>
> Fixes: a066e61f13cf ("KVM: x86/mmu: Factor out handling of removed page tables")
> CC: [email protected]
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7c5dd83e52de..866c2b191e1e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -374,7 +374,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> shared);
> }
>
> - kvm_flush_remote_tlbs_with_address(kvm, gfn,
> + kvm_flush_remote_tlbs_with_address(kvm, base_gfn,

Suggest pulling the definition of gfn into the for loop as well (along
with sptep and old_child_spte for that matter) so that referencing it
here isn't even possible.

> KVM_PAGES_PER_HPAGE(level + 1));
>
> call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> --
> 2.34.0.rc0.344.g81b53c2807-goog
>

2021-11-11 18:07:06

by David Matlack

[permalink] [raw]

Subject: Re: [RFC 02/19] KVM: x86/mmu: Batch TLB flushes for a single zap

On Wed, Nov 10, 2021 at 02:29:53PM -0800, Ben Gardon wrote:
> When recursively handling a removed TDP page table, the TDP MMU will
> flush the TLBs and queue an RCU callback to free the PT. If the original
> change zapped a non-leaf SPTE at PG_LEVEL_1G or above, that change will
> result in many unnecessary TLB flushes when one would suffice. Queue all
> the PTs which need to be freed on a list and wait to queue RCU callbacks
> to free them until after all the recursive callbacks are done.
>
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 88 ++++++++++++++++++++++++++++++--------
> 1 file changed, 70 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 866c2b191e1e..5b31d046df78 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -220,7 +220,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>
> static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> u64 old_spte, u64 new_spte, int level,
> - bool shared);
> + bool shared,
> + struct list_head *disconnected_sps);
>
> static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
> {
> @@ -302,6 +303,11 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> * @shared: This operation may not be running under the exclusive use
> * of the MMU lock and the operation must synchronize with other
> * threads that might be modifying SPTEs.
> + * @disconnected_sps: If null, the TLBs will be flushed and the disconnected
> + * TDP MMU page will be queued to be freed after an RCU
> + * callback. If non-null the page will be added to the list
> + * and flushing the TLBs and queueing an RCU callback to
> + * free the page will be the caller's responsibility.
> *
> * Given a page table that has been removed from the TDP paging structure,
> * iterates through the page table to clear SPTEs and free child page tables.
> @@ -312,7 +318,8 @@ static void tdp_mmu_unlink_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> * early rcu_dereferences in the function.
> */
> static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> - bool shared)
> + bool shared,

> {
> struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
> int level = sp->role.level;
> @@ -371,13 +378,16 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> }
> handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> old_child_spte, REMOVED_SPTE, level,
> - shared);
> + shared, disconnected_sps);
> }
>
> - kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
> - KVM_PAGES_PER_HPAGE(level + 1));
> -
> - call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> + if (disconnected_sps) {
> + list_add_tail(&sp->link, disconnected_sps);
> + } else {
> + kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
> + KVM_PAGES_PER_HPAGE(level + 1));
> + call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> + }
> }
>
> /**
> @@ -391,13 +401,21 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> * @shared: This operation may not be running under the exclusive use of
> * the MMU lock and the operation must synchronize with other
> * threads that might be modifying SPTEs.
> + * @disconnected_sps: Only used if a page of page table memory has been
> + * removed from the paging structure by this change.
> + * If null, the TLBs will be flushed and the disconnected
> + * TDP MMU page will be queued to be freed after an RCU
> + * callback. If non-null the page will be added to the list
> + * and flushing the TLBs and queueing an RCU callback to
> + * free the page will be the caller's responsibility.
> *
> * Handle bookkeeping that might result from the modification of a SPTE.
> * This function must be called for all TDP SPTE modifications.
> */
> static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> u64 old_spte, u64 new_spte, int level,
> - bool shared)
> + bool shared,
> + struct list_head *disconnected_sps)
> {
> bool was_present = is_shadow_present_pte(old_spte);
> bool is_present = is_shadow_present_pte(new_spte);
> @@ -475,22 +493,39 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> */
> if (was_present && !was_leaf && (pfn_changed || !is_present))
> handle_removed_tdp_mmu_page(kvm,
> - spte_to_child_pt(old_spte, level), shared);
> + spte_to_child_pt(old_spte, level), shared,
> + disconnected_sps);
> }
>
> static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> u64 old_spte, u64 new_spte, int level,
> - bool shared)
> + bool shared, struct list_head *disconnected_sps)
> {
> __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> - shared);
> + shared, disconnected_sps);
> handle_changed_spte_acc_track(old_spte, new_spte, level);
> handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> new_spte, level);
> }
>
> /*
> - * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
> + * The TLBs must be flushed between the pages linked from disconnected_sps
> + * being removed from the paging structure and this function being called.
> + */
> +static void handle_disconnected_sps(struct kvm *kvm,
> + struct list_head *disconnected_sps)

handle_disconnected_sps() does a very specific task so I think we could
go with a more specific function name to make the code more readable.

How about free_sps_rcu() or call_rcu_free_sps()?

> +{
> + struct kvm_mmu_page *sp;
> + struct kvm_mmu_page *next;
> +
> + list_for_each_entry_safe(sp, next, disconnected_sps, link) {
> + list_del(&sp->link);
> + call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> + }
> +}
> +
> +/*
> + * __tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
> * and handle the associated bookkeeping. Do not mark the page dirty
> * in KVM's dirty bitmaps.
> *
> @@ -500,9 +535,10 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> * Returns: true if the SPTE was set, false if it was not. If false is returned,
> * this function will have no side-effects.
> */
> -static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> - struct tdp_iter *iter,
> - u64 new_spte)
> +static inline bool __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> + struct tdp_iter *iter,
> + u64 new_spte,
> + struct list_head *disconnected_sps)
> {
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> @@ -522,22 +558,32 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> return false;
>
> __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> - new_spte, iter->level, true);
> + new_spte, iter->level, true, disconnected_sps);
> handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
>
> return true;
> }
>
> +static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> + struct tdp_iter *iter,
> + u64 new_spte)
> +{
> + return __tdp_mmu_set_spte_atomic(kvm, iter, new_spte, NULL);

Why not leverage disconnected_sps here as well? Then you can remove the
NULL case (and comments) from handle_removed_tdp_mmu_page.

> +}
> +
> static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> struct tdp_iter *iter)
> {
> + LIST_HEAD(disconnected_sps);
> +
> /*
> * Freeze the SPTE by setting it to a special,
> * non-present value. This will stop other threads from
> * immediately installing a present entry in its place
> * before the TLBs are flushed.
> */
> - if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
> + if (!__tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE,
> + &disconnected_sps))
> return false;
>
> kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
> @@ -553,6 +599,8 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> */
> WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
>
> + handle_disconnected_sps(kvm, &disconnected_sps);
> +
> return true;
> }
>
> @@ -577,6 +625,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> u64 new_spte, bool record_acc_track,
> bool record_dirty_log)
> {
> + LIST_HEAD(disconnected_sps);
> +
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> /*
> @@ -591,7 +641,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
>
> __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> - new_spte, iter->level, false);
> + new_spte, iter->level, false, &disconnected_sps);
> if (record_acc_track)
> handle_changed_spte_acc_track(iter->old_spte, new_spte,
> iter->level);
> @@ -599,6 +649,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
> iter->old_spte, new_spte,
> iter->level);
> +
> + handle_disconnected_sps(kvm, &disconnected_sps);

Where is the TLB flush for these disconnected_sps?

> }
>
> static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> --
> 2.34.0.rc0.344.g81b53c2807-goog
>

2021-11-11 18:31:33

by David Matlack

[permalink] [raw]

Subject: Re: [RFC 03/19] KVM: x86/mmu: Factor flush and free up when zapping under MMU write lock

On Wed, Nov 10, 2021 at 02:29:54PM -0800, Ben Gardon wrote:
> When zapping a GFN range under the MMU write lock, there is no need to
> flush the TLBs for every zap. Instead, follow the lead of the Legacy MMU
> can collect disconnected sps to be freed after a flush at the end of
> the routine.
>
>
> Signed-off-by: Ben Gardon <[email protected]>

Reviewed-by: David Matlack <[email protected]>

> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++---------
> 1 file changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5b31d046df78..a448f0f2d993 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -623,10 +623,9 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> */
> static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> u64 new_spte, bool record_acc_track,
> - bool record_dirty_log)
> + bool record_dirty_log,
> + struct list_head *disconnected_sps)
> {
> - LIST_HEAD(disconnected_sps);
> -
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> /*
> @@ -641,7 +640,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
>
> __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> - new_spte, iter->level, false, &disconnected_sps);
> + new_spte, iter->level, false, disconnected_sps);
> if (record_acc_track)
> handle_changed_spte_acc_track(iter->old_spte, new_spte,
> iter->level);
> @@ -649,28 +648,32 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
> iter->old_spte, new_spte,
> iter->level);
> +}
>
> - handle_disconnected_sps(kvm, &disconnected_sps);
> +static inline void tdp_mmu_zap_spte(struct kvm *kvm, struct tdp_iter *iter,
> + struct list_head *disconnected_sps)
> +{
> + __tdp_mmu_set_spte(kvm, iter, 0, true, true, disconnected_sps);
> }
>
> static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> u64 new_spte)
> {
> - __tdp_mmu_set_spte(kvm, iter, new_spte, true, true);
> + __tdp_mmu_set_spte(kvm, iter, new_spte, true, true, NULL);
> }
>
> static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm,
> struct tdp_iter *iter,
> u64 new_spte)
> {
> - __tdp_mmu_set_spte(kvm, iter, new_spte, false, true);
> + __tdp_mmu_set_spte(kvm, iter, new_spte, false, true, NULL);
> }
>
> static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
> struct tdp_iter *iter,
> u64 new_spte)
> {
> - __tdp_mmu_set_spte(kvm, iter, new_spte, true, false);
> + __tdp_mmu_set_spte(kvm, iter, new_spte, true, false, NULL);
> }
>
> #define tdp_root_for_each_pte(_iter, _root, _start, _end) \
> @@ -757,6 +760,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> bool zap_all = (start == 0 && end >= max_gfn_host);
> struct tdp_iter iter;
> + LIST_HEAD(disconnected_sps);
>
> /*
> * No need to try to step down in the iterator when zapping all SPTEs,
> @@ -799,7 +803,7 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> continue;
>
> if (!shared) {
> - tdp_mmu_set_spte(kvm, &iter, 0);
> + tdp_mmu_zap_spte(kvm, &iter, &disconnected_sps);
> flush = true;
> } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
> /*
> @@ -811,6 +815,12 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> }
> }
>
> + if (!list_empty(&disconnected_sps)) {
> + kvm_flush_remote_tlbs(kvm);
> + handle_disconnected_sps(kvm, &disconnected_sps);

It might be worth adding a comment that we purposely do not process
disconnected_sps during the cond resched earlier in the loop because it
is an expensive call and it itself needs to cond resched (next patch).

> + flush = false;
> + }
> +
> rcu_read_unlock();
> return flush;
> }
> --
> 2.34.0.rc0.344.g81b53c2807-goog
>

2021-11-11 18:50:34

by David Matlack

[permalink] [raw]

Subject: Re: [RFC 04/19] KVM: x86/mmu: Yield while processing disconnected_sps

On Wed, Nov 10, 2021 at 02:29:55PM -0800, Ben Gardon wrote:
> When preparing to free disconnected SPs, the list can accumulate many
> entries; enough that it is likely necessary to yeild while queuing RCU
> callbacks to free the SPs.
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index a448f0f2d993..c2a9f7acf8ef 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -513,7 +513,8 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> * being removed from the paging structure and this function being called.
> */
> static void handle_disconnected_sps(struct kvm *kvm,
> - struct list_head *disconnected_sps)
> + struct list_head *disconnected_sps,
> + bool can_yield, bool shared)
> {
> struct kvm_mmu_page *sp;
> struct kvm_mmu_page *next;
> @@ -521,6 +522,16 @@ static void handle_disconnected_sps(struct kvm *kvm,
> list_for_each_entry_safe(sp, next, disconnected_sps, link) {
> list_del(&sp->link);
> call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> +
> + if (can_yield &&
> + (need_resched() || rwlock_needbreak(&kvm->mmu_lock))) {
> + rcu_read_unlock();
> + if (shared)
> + cond_resched_rwlock_read(&kvm->mmu_lock);
> + else
> + cond_resched_rwlock_write(&kvm->mmu_lock);
> + rcu_read_lock();
> + }

What about something like this to cut down on the duplicate code?

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c2a9f7acf8ef..2fd010f2421e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -508,6 +508,26 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
new_spte, level);
}

+static inline bool tdp_mmu_need_resched(struct kvm *kvm)
+{
+ return need_resched() || rwlock_needbreak(&kvm->mmu_lock);
+}
+
+static void tdp_mmu_cond_resched(struct kvm *kvm, bool shared, bool flush)
+{
+ rcu_read_unlock()
+
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+
+ if (shared)
+ cond_resched_rwlock_read(&kvm->mmu_lock);
+ else
+ cond_resched_rwlock_write(&kvm->mmu_lock);
+
+ rcu_read_lock();
+}
+
/*
* The TLBs must be flushed between the pages linked from disconnected_sps
* being removed from the paging structure and this function being called.
@@ -523,15 +543,8 @@ static void handle_disconnected_sps(struct kvm *kvm,
list_del(&sp->link);
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);

- if (can_yield &&
- (need_resched() || rwlock_needbreak(&kvm->mmu_lock))) {
- rcu_read_unlock();
- if (shared)
- cond_resched_rwlock_read(&kvm->mmu_lock);
- else
- cond_resched_rwlock_write(&kvm->mmu_lock);
- rcu_read_lock();
- }
+ if (can_yield && tdp_mmu_need_resched(kvm))
+ tdp_mmu_cond_resched(kvm, shared, false);
}
}

@@ -724,18 +737,8 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
if (iter->next_last_level_gfn == iter->yielded_gfn)
return false;

- if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
- rcu_read_unlock();
-
- if (flush)
- kvm_flush_remote_tlbs(kvm);
-
- if (shared)
- cond_resched_rwlock_read(&kvm->mmu_lock);
- else
- cond_resched_rwlock_write(&kvm->mmu_lock);
-
- rcu_read_lock();
+ if (tdp_mmu_need_resched(kvm)) {
+ tdp_mmu_cond_resched(kvm, shared, flush);

WARN_ON(iter->gfn > iter->next_last_level_gfn);

> }
> }
>
> @@ -599,7 +610,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> */
> WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
>
> - handle_disconnected_sps(kvm, &disconnected_sps);
> + handle_disconnected_sps(kvm, &disconnected_sps, false, true);
>
> return true;
> }
> @@ -817,7 +828,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>
> if (!list_empty(&disconnected_sps)) {
> kvm_flush_remote_tlbs(kvm);
> - handle_disconnected_sps(kvm, &disconnected_sps);
> + handle_disconnected_sps(kvm, &disconnected_sps,
> + can_yield, shared);
> flush = false;
> }
>
> --
> 2.34.0.rc0.344.g81b53c2807-goog
>

2021-11-11 18:55:44

by David Matlack

[permalink] [raw]

Subject: Re: [RFC 05/19] KVM: x86/mmu: Remove redundant flushes when disabling dirty logging

On Wed, Nov 10, 2021 at 02:29:56PM -0800, Ben Gardon wrote:
> tdp_mmu_zap_spte_atomic flushes on every zap already, so no need to
> flush again after it's done.
>
>
> Signed-off-by: Ben Gardon <[email protected]>

Reviewed-by: David Matlack <[email protected]>

> ---
> arch/x86/kvm/mmu/mmu.c | 4 +---
> arch/x86/kvm/mmu/tdp_mmu.c | 21 ++++++---------------
> arch/x86/kvm/mmu/tdp_mmu.h | 5 ++---
> 3 files changed, 9 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 354d2ca92df4..baa94acab516 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5870,9 +5870,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>
> if (is_tdp_mmu_enabled(kvm)) {
> read_lock(&kvm->mmu_lock);
> - flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush);
> - if (flush)
> - kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> + kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
> read_unlock(&kvm->mmu_lock);
> }
> }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c2a9f7acf8ef..1ece645e737f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1438,10 +1438,9 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> * Clear leaf entries which could be replaced by large mappings, for
> * GFNs within the slot.
> */
> -static bool zap_collapsible_spte_range(struct kvm *kvm,
> +static void zap_collapsible_spte_range(struct kvm *kvm,
> struct kvm_mmu_page *root,
> - const struct kvm_memory_slot *slot,
> - bool flush)
> + const struct kvm_memory_slot *slot)
> {
> gfn_t start = slot->base_gfn;
> gfn_t end = start + slot->npages;
> @@ -1452,10 +1451,8 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
>
> tdp_root_for_each_pte(iter, root, start, end) {
> retry:
> - if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
> - flush = false;
> + if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> continue;
> - }
>
> if (!is_shadow_present_pte(iter.old_spte) ||
> !is_last_spte(iter.old_spte, iter.level))
> @@ -1475,30 +1472,24 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
> iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> goto retry;
> }
> - flush = true;
> }
>
> rcu_read_unlock();
> -
> - return flush;
> }
>
> /*
> * Clear non-leaf entries (and free associated page tables) which could
> * be replaced by large mappings, for GFNs within the slot.
> */
> -bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> - const struct kvm_memory_slot *slot,
> - bool flush)
> +void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> + const struct kvm_memory_slot *slot)
> {
> struct kvm_mmu_page *root;
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
> - flush = zap_collapsible_spte_range(kvm, root, slot, flush);
> -
> - return flush;
> + zap_collapsible_spte_range(kvm, root, slot);
> }
>
> /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 476b133544dd..3899004a5d91 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -64,9 +64,8 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> gfn_t gfn, unsigned long mask,
> bool wrprot);
> -bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> - const struct kvm_memory_slot *slot,
> - bool flush);
> +void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> + const struct kvm_memory_slot *slot);
>
> bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> struct kvm_memory_slot *slot, gfn_t gfn,
> --
> 2.34.0.rc0.344.g81b53c2807-goog
>

2021-11-12 23:53:17

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC 02/19] KVM: x86/mmu: Batch TLB flushes for a single zap

On Wed, Nov 10, 2021, Ben Gardon wrote:
> When recursively handling a removed TDP page table, the TDP MMU will
> flush the TLBs and queue an RCU callback to free the PT. If the original
> change zapped a non-leaf SPTE at PG_LEVEL_1G or above, that change will
> result in many unnecessary TLB flushes when one would suffice. Queue all
> the PTs which need to be freed on a list and wait to queue RCU callbacks
> to free them until after all the recursive callbacks are done.

I'm pretty sure we can do this without tracking disconnected SPs. The whole point
of protecting TDP MMU with RCU is to wait until _all_ CPUs are guaranateed to have
dropped references. Emphasis on "all" because that also includes the CPU that's
doing the zapping/replacement!

And since the current CPU is required to hold RCU, we can use its RCU lock as a
proxy for all vCPUs executing in the guest. That will require either flushing in
zap_gfn_range() or requiring callers to hold, or more likely a mix of both so that
flows that zap multiple roots or both TDP and legacy MMU pages can batch flushes

If this doesn't sound completely bonkers, I'd like to pick this up next week, I
wandered into KVM's handling of invalidated roots and have patches that would
conflict in weird ways with this idea.

So I think this can simply be (sans zap_gfn_range() changes):

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4e226cdb40d9..d2303bca4449 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -431,9 +431,6 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
shared);
}

- kvm_flush_remote_tlbs_with_address(kvm, gfn,
- KVM_PAGES_PER_HPAGE(level + 1));
-
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

@@ -716,11 +713,11 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
return false;

if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
- rcu_read_unlock();
-
if (flush)
kvm_flush_remote_tlbs(kvm);

+ rcu_read_unlock();
+
if (shared)
cond_resched_rwlock_read(&kvm->mmu_lock);
else
@@ -817,7 +814,6 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
}

rcu_read_unlock();
- return flush;
}

/*
@@ -954,6 +950,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
ret = RET_PF_SPURIOUS;
else if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
return RET_PF_RETRY;
+ else if (<old spte was present shadow page>)
+ kvm_flush_remote_tlbs(kvm);

/*
* If the page fault was caused by a write but the page is write

> +static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> + struct tdp_iter *iter,
> + u64 new_spte)
> +{
> + return __tdp_mmu_set_spte_atomic(kvm, iter, new_spte, NULL);

This helper and refactoring belongs in patch 19. It is impossible to review without
the context of its user(s).

2021-11-16 00:12:17

[permalink] [raw]

Subject: Re: [RFC 00/19] KVM: x86/mmu: Optimize disabling dirty logging

On Wed, Nov 10, 2021 at 2:30 PM Ben Gardon <[email protected]> wrote:
>
> Currently disabling dirty logging with the TDP MMU is extremely slow.
> On a 96 vCPU / 96G VM it takes ~45 seconds to disable dirty logging
> with the TDP MMU, as opposed to ~3.5 seconds with the legacy MMU. This
> series optimizes TLB flushes and introduces in-place large page
> promotion, to bring the disable dirty log time down to ~2 seconds.
>
> Testing:
> Ran KVM selftests and kvm-unit-tests on an Intel Skylake. This
> series introduced no new failures.
>
> Performance:
> To collect these results I needed to apply Mingwei's patch
> "selftests: KVM: align guest physical memory base address to 1GB"
> https://lkml.org/lkml/2021/8/29/310
> David Matlack is going to send out an updated version of that patch soon.
>
> Without this series, TDP MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 10.966500447s
> Enabling dirty logging time: 0.002068737s
>
> Iteration 1 dirty memory time: 0.047556280s
> Iteration 1 get dirty log time: 0.001253914s
> Iteration 1 clear dirty log time: 0.049716661s
> Iteration 2 dirty memory time: 3.679662016s
> Iteration 2 get dirty log time: 0.000659546s
> Iteration 2 clear dirty log time: 1.834329322s
> Disabling dirty logging time: 45.738439510s
> Get dirty log over 2 iterations took 0.001913460s. (Avg 0.000956730s/iteration)
> Clear dirty log over 2 iterations took 1.884045983s. (Avg 0.942022991s/iteration)
>
> Without this series, Legacy MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 12.664750666s
> Enabling dirty logging time: 0.002025510s
>
> Iteration 1 dirty memory time: 0.046240875s
> Iteration 1 get dirty log time: 0.001864342s
> Iteration 1 clear dirty log time: 0.170243637s
> Iteration 2 dirty memory time: 31.571088701s
> Iteration 2 get dirty log time: 0.000626245s
> Iteration 2 clear dirty log time: 1.294817729s
> Disabling dirty logging time: 3.566831573s
> Get dirty log over 2 iterations took 0.002490587s. (Avg 0.001245293s/iteration)
> Clear dirty log over 2 iterations took 1.465061366s. (Avg 0.732530683s/iteration)
>
> With this series, TDP MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 12.016653537s
> Enabling dirty logging time: 0.001992860s
>
> Iteration 1 dirty memory time: 0.046701599s
> Iteration 1 get dirty log time: 0.001214806s
> Iteration 1 clear dirty log time: 0.049519923s
> Iteration 2 dirty memory time: 3.581931268s
> Iteration 2 get dirty log time: 0.000621383s
> Iteration 2 clear dirty log time: 1.894597059s
> Disabling dirty logging time: 1.950542092s
> Get dirty log over 2 iterations took 0.001836189s. (Avg 0.000918094s/iteration)
> Clear dirty log over 2 iterations took 1.944116982s. (Avg 0.972058491s/iteration)
>
> Patch breakdown:
> Patch 1 is a fix for a bug in the way the TBP MMU issues TLB flushes
> Patches 2-5 eliminate many unnecessary TLB flushes through better batching
> Patches 6-12 remove the need for a vCPU pointer to make_spte
> Patches 13-18 are small refactors in perparation for patch 19
> Patch 19 implements in-place largepage promotion when disabling dirty logging
>
> Ben Gardon (19):
> KVM: x86/mmu: Fix TLB flush range when handling disconnected pt
> KVM: x86/mmu: Batch TLB flushes for a single zap
> KVM: x86/mmu: Factor flush and free up when zapping under MMU write
> lock
> KVM: x86/mmu: Yield while processing disconnected_sps
> KVM: x86/mmu: Remove redundant flushes when disabling dirty logging
> KVM: x86/mmu: Introduce vcpu_make_spte
> KVM: x86/mmu: Factor wrprot for nested PML out of make_spte
> KVM: x86/mmu: Factor mt_mask out of make_spte
> KVM: x86/mmu: Remove need for a vcpu from
> kvm_slot_page_track_is_active
> KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages
> KVM: x86/mmu: Factor shadow_zero_check out of make_spte
> KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
> KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
> KVM: x86/mmu: Propagate memslot const qualifier
> KVM: x86/MMU: Refactor vmx_get_mt_mask
> KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
> on vcpu
> KVM: x86/mmu: Add try_get_mt_mask to x86_ops
> KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
> KVM: x86/mmu: Promote pages in-place when disabling dirty logging
>
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/include/asm/kvm_page_track.h | 6 +-
> arch/x86/kvm/mmu/mmu.c | 45 +++---
> arch/x86/kvm/mmu/mmu_internal.h | 6 +-
> arch/x86/kvm/mmu/page_track.c | 8 +-
> arch/x86/kvm/mmu/paging_tmpl.h | 6 +-
> arch/x86/kvm/mmu/spte.c | 43 +++--
> arch/x86/kvm/mmu/spte.h | 17 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 217 +++++++++++++++++++++-----
> arch/x86/kvm/mmu/tdp_mmu.h | 5 +-
> arch/x86/kvm/svm/svm.c | 8 +
> arch/x86/kvm/vmx/vmx.c | 40 +++--
> include/linux/kvm_host.h | 10 +-
> virt/kvm/kvm_main.c | 12 +-
> 15 files changed, 302 insertions(+), 124 deletions(-)
>
> --
> 2.34.0.rc0.344.g81b53c2807-goog
>

In a conversation with Sean today, he expressed interest in taking
over patches 2-4 from this series as it conflicted with another fix he
was working on.
I'll leave it to him to incorporate the feedback on these patches.
In the meantime, I've sent another iteration of patch 1 from this
series (a standalone bug fix) and will work on putting together
another version of patches 5-19.

2021-11-18 02:05:21

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On Wed, Nov 10, 2021, Ben Gardon wrote:
> In the interest of devloping a version of make_spte that can function
> without a vCPU pointer, factor out the shadow_zero_mask to be an
> additional argument to the function.
>
> No functional change intended.
>
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/mmu/spte.c | 11 +++++++----
> arch/x86/kvm/mmu/spte.h | 3 ++-
> 2 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index b7271daa06c5..d3b059e96c6e 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -93,7 +93,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> struct kvm_memory_slot *slot, unsigned int pte_access,
> gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
> bool can_unsync, bool host_writable, bool ad_need_write_protect,
> - u64 mt_mask, u64 *new_spte)
> + u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,

Ugh, so I had a big email written about how I think we should add a module param
to control 4-level vs. 5-level for all TDP pages, but then I realized it wouldn't
work for nested EPT because that follows the root level used by L1. We could
still make a global non_nested_tdp_shadow_zero_check or whatever, but then make_spte()
would have to do some work to find the right rsvd_bits_validate, and the end result
would likely be a mess.

One idea to avoid exploding make_spte() would be to add a backpointer to the MMU
in kvm_mmu_page. I don't love the idea, but I also don't love passing in rsvd_bits_validate.

2021-11-18 02:12:36

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC 07/19] KVM: x86/mmu: Factor wrprot for nested PML out of make_spte

On Wed, Nov 10, 2021, Ben Gardon wrote:
> When running a nested VM, KVM write protects SPTEs in the EPT/NPT02
> instead of using PML for dirty tracking. This avoids expensive
> translation later, when emptying the Page Modification Log. In service
> of removing the vCPU pointer from make_spte, factor the check for nested
> PML out of the function.

Aha! The dependency on @vcpu can be avoided without having to take a flag from
the caller. The shadow page has everything we need. The check is really "is this
a page for L2 EPT". The kvm_x86_ops.cpu_dirty_log_size gets us the EPT part, and
kvm_mmu_page.guest_mode gets us the L2 part.

Compile tested only...

From 773414e4fd7010c38ac89221d16089f3dcc57467 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <[email protected]>
Date: Wed, 17 Nov 2021 18:08:42 -0800
Subject: [PATCH] KVM: x86/mmu: Use shadow page role to detect PML-unfriendly
pages for L2

Rework make_spte() to query the shadow page's role, specifically whether
or not it's a guest_mode page, a.k.a. a page for L2, when determining if
the SPTE is compatible with PML. This eliminates a dependency on @vcpu,
with a future goal of being able to create SPTEs without a specific vCPU.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 7 +++----
arch/x86/kvm/mmu/spte.c | 2 +-
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 8ede43a826af..03882b2624c8 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -109,7 +109,7 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
return kvm_mmu_role_as_id(sp->role);
}

-static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
+static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
{
/*
* When using the EPT page-modification log, the GPAs in the CPU dirty
@@ -117,10 +117,9 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
* on write protection to record dirty pages, which bypasses PML, since
* writes now result in a vmexit. Note, the check on CPU dirty logging
* being enabled is mandatory as the bits used to denote WP-only SPTEs
- * are reserved for NPT w/ PAE (32-bit KVM).
+ * are reserved for PAE paging (32-bit KVM).
*/
- return vcpu->arch.mmu == &vcpu->arch.guest_mmu &&
- kvm_x86_ops.cpu_dirty_log_size;
+ return kvm_x86_ops.cpu_dirty_log_size && sp->role.guest_mode;
}

int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 0c76c45fdb68..84e64dbdd89e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -101,7 +101,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,

if (sp->role.ad_disabled)
spte |= SPTE_TDP_AD_DISABLED_MASK;
- else if (kvm_vcpu_ad_need_write_protect(vcpu))
+ else if (kvm_mmu_page_ad_need_write_protect(sp))
spte |= SPTE_TDP_AD_WRPROT_ONLY_MASK;

/*
--

2021-11-18 03:29:52

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On Thu, Nov 18, 2021, Sean Christopherson wrote:
> On Wed, Nov 10, 2021, Ben Gardon wrote:
> > In the interest of devloping a version of make_spte that can function
> > without a vCPU pointer, factor out the shadow_zero_mask to be an
> > additional argument to the function.
> >
> > No functional change intended.
> >
> >
> > Signed-off-by: Ben Gardon <[email protected]>
> > ---
> > arch/x86/kvm/mmu/spte.c | 11 +++++++----
> > arch/x86/kvm/mmu/spte.h | 3 ++-
> > 2 files changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index b7271daa06c5..d3b059e96c6e 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -93,7 +93,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > struct kvm_memory_slot *slot, unsigned int pte_access,
> > gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
> > bool can_unsync, bool host_writable, bool ad_need_write_protect,
> > - u64 mt_mask, u64 *new_spte)
> > + u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
>
> Ugh, so I had a big email written about how I think we should add a module param
> to control 4-level vs. 5-level for all TDP pages, but then I realized it wouldn't
> work for nested EPT because that follows the root level used by L1. We could
> still make a global non_nested_tdp_shadow_zero_check or whatever, but then make_spte()
> would have to do some work to find the right rsvd_bits_validate, and the end result
> would likely be a mess.
>
> One idea to avoid exploding make_spte() would be to add a backpointer to the MMU
> in kvm_mmu_page. I don't love the idea, but I also don't love passing in rsvd_bits_validate.

Another idea. The only difference between 5-level and 4-level is that 5-level
fills in index [4], and I'm pretty sure 4-level doesn't touch that index. For
PAE NPT (32-bit SVM), the shadow root level will never change, so that's not an issue.

Nested NPT is the only case where anything for an EPT/NPT MMU can change, because
that follows EFER.NX.

In other words, the non-nested TDP reserved bits don't need to be recalculated
regardless of level, they can just fill in 5-level and leave it be.

E.g. something like the below. The sp->role.direct check could be removed if we
forced EFER.NX for nested NPT.

It's a bit ugly in that we'd pass both @kvm and @vcpu, so that needs some more
thought, but at minimum it means there's no need to recalc the reserved bits.

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 84e64dbdd89e..05db9b89dc53 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -95,10 +95,18 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
u64 old_spte, bool prefetch, bool can_unsync,
bool host_writable, u64 *new_spte)
{
+ struct rsvd_bits_validate *rsvd_check;
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
bool wrprot = false;

+ if (vcpu) {
+ rsvd_check = vcpu->arch.mmu->shadow_zero_check;
+ } else {
+ WARN_ON_ONCE(!tdp_enabled || !sp->role.direct);
+ rsvd_check = tdp_shadow_rsvd_bits;
+ }
+
if (sp->role.ad_disabled)
spte |= SPTE_TDP_AD_DISABLED_MASK;
else if (kvm_mmu_page_ad_need_write_protect(sp))
@@ -177,9 +185,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (prefetch)
spte = mark_spte_for_access_track(spte);

- WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
+ WARN_ONCE(is_rsvd_spte(rsvd_check, spte, level),
"spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
- get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
+ get_rsvd_bits(rsvd_check, spte, level));

if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */

2021-11-18 16:37:36

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On Thu, Nov 18, 2021, Sean Christopherson wrote:
> Another idea. The only difference between 5-level and 4-level is that 5-level
> fills in index [4], and I'm pretty sure 4-level doesn't touch that index. For
> PAE NPT (32-bit SVM), the shadow root level will never change, so that's not an issue.
>
> Nested NPT is the only case where anything for an EPT/NPT MMU can change, because
> that follows EFER.NX.
>
> In other words, the non-nested TDP reserved bits don't need to be recalculated
> regardless of level, they can just fill in 5-level and leave it be.
>
> E.g. something like the below. The sp->role.direct check could be removed if we
> forced EFER.NX for nested NPT.
>
> It's a bit ugly in that we'd pass both @kvm and @vcpu, so that needs some more
> thought, but at minimum it means there's no need to recalc the reserved bits.

Ok, I think my final vote is to have the reserved bits passed in, but with the
non-nested TDP reserved bits being computed at MMU init.

I would also prefer to keep the existing make_spte() name so that there's no churn
in those call sites, and to make the relationship between the wrapper, mask_spte(),
and the "real" helper, __make_spte(), more obvious and aligned with the usual
kernel style.

So with the kvm_vcpu_ad_need_write_protect() change and my proposed hack-a-fix for
kvm_x86_get_mt_mask(), the end result would look like:

bool __make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, u64 *new_spte,
struct rsvd_bits_validate *shadow_rsvd_bits)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
bool wrprot = false;

if (sp->role.ad_disabled)
spte |= SPTE_TDP_AD_DISABLED_MASK;
else if (kvm_mmu_page_ad_need_write_protect(sp))
spte |= SPTE_TDP_AD_WRPROT_ONLY_MASK;

/*
* For the EPT case, shadow_present_mask is 0 if hardware
* supports exec-only page table entries. In that case,
* ACC_USER_MASK and shadow_user_mask are used to represent
* read access. See FNAME(gpte_access) in paging_tmpl.h.
*/
spte |= shadow_present_mask;
if (!prefetch)
spte |= spte_shadow_accessed_mask(spte);

if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
is_nx_huge_page_enabled()) {
pte_access &= ~ACC_EXEC_MASK;
}

if (pte_access & ACC_EXEC_MASK)
spte |= shadow_x_mask;
else
spte |= shadow_nx_mask;

if (pte_access & ACC_USER_MASK)
spte |= shadow_user_mask;

if (level > PG_LEVEL_4K)
spte |= PT_PAGE_SIZE_MASK;
if (tdp_enabled)
spte |= static_call(kvm_x86_get_mt_mask)(kvm, gfn,
kvm_is_mmio_pfn(pfn));

if (host_writable)
spte |= shadow_host_writable_mask;
else
pte_access &= ~ACC_WRITE_MASK;

if (!kvm_is_mmio_pfn(pfn))
spte |= shadow_me_mask;

spte |= (u64)pfn << PAGE_SHIFT;

if (pte_access & ACC_WRITE_MASK) {
spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask;

/*
* Optimization: for pte sync, if spte was writable the hash
* lookup is unnecessary (and expensive). Write protection
* is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
* Same reasoning can be applied to dirty page accounting.
*/
if (is_writable_pte(old_spte))
goto out;

/*
* Unsync shadow pages that are reachable by the new, writable
* SPTE. Write-protect the SPTE if the page can't be unsync'd,
* e.g. it's write-tracked (upper-level SPs) or has one or more
* shadow pages and unsync'ing pages is not allowed.
*/
if (mmu_try_to_unsync_pages(kvm, slot, gfn, can_unsync, prefetch)) {
pgprintk("%s: found shadow page for %llx, marking ro\n",
__func__, gfn);
wrprot = true;
pte_access &= ~ACC_WRITE_MASK;
spte &= ~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
}
}

if (pte_access & ACC_WRITE_MASK)
spte |= spte_shadow_dirty_mask(spte);

out:
if (prefetch)
spte = mark_spte_for_access_track(spte);

WARN_ONCE(is_rsvd_spte(shadow_rsvd_bits), spte, level),
"spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
get_rsvd_bits(&shadow_rsvd_bits, spte, level));

if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
WARN_ON(level > PG_LEVEL_4K);
mark_page_dirty_in_slot(kvm, slot, gfn);
}

*new_spte = spte;
return wrprot;
}

bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
u64 old_spte, bool prefetch, bool can_unsync,
bool host_writable, u64 *new_spte)
{
return __make_spte(vcpu->kvm, sp, slot, pte_access, gfn, pfn, old_spte,
prefetch, can_unsync, host_writable, new_spte,
&vcpu->arch.mmu->shadow_zero_check);
}

2021-11-18 17:19:19

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On 11/18/21 17:37, Sean Christopherson wrote:
>> It's a bit ugly in that we'd pass both @kvm and @vcpu, so that needs some more
>> thought, but at minimum it means there's no need to recalc the reserved bits.
>
> Ok, I think my final vote is to have the reserved bits passed in, but with the
> non-nested TDP reserved bits being computed at MMU init.

Yes, and that's also where I was getting with the idea of moving part of
the "direct" MMU (man, naming these things is so hard) to struct kvm:
split the per-vCPU state from the constant one and initialize the latter
just once. Though perhaps I was putting the cart slightly before the horse.

On the topic of naming, we have a lot of things to name:

- the two MMU codebases: you Googlers are trying to grandfather "legacy"
and "TDP" into upstream, but that's not a great name because the former
is used also when shadowing EPT/NPT. I'm thinking of standardizing on
"shadow" and "TDP" (it's not perfect because of the 32-bit and tdp_mmu=0
cases, but it's a start). Maybe even split parts of mmu.c out into
shadow_mmu.c.

- the two walkers (I'm quite convinced of splitting that part out of
struct kvm_mmu and getting rid of walk_mmu/nested_mmu): that's easy, it
can be walk01 and walk12 with "walk" pointing to one of them

- the two MMUs: with nested_mmu gone, root_mmu and guest_mmu are much
less confusing and we can keep those names.

Paolo

2021-11-18 17:44:00

[permalink] [raw]

Subject: Re: [RFC 07/19] KVM: x86/mmu: Factor wrprot for nested PML out of make_spte

On Wed, Nov 17, 2021 at 6:12 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Nov 10, 2021, Ben Gardon wrote:
> > When running a nested VM, KVM write protects SPTEs in the EPT/NPT02
> > instead of using PML for dirty tracking. This avoids expensive
> > translation later, when emptying the Page Modification Log. In service
> > of removing the vCPU pointer from make_spte, factor the check for nested
> > PML out of the function.
>
> Aha! The dependency on @vcpu can be avoided without having to take a flag from
> the caller. The shadow page has everything we need. The check is really "is this
> a page for L2 EPT". The kvm_x86_ops.cpu_dirty_log_size gets us the EPT part, and
> kvm_mmu_page.guest_mode gets us the L2 part.

Haha that's way cleaner than what I was doing! Seems like an obvious
solution in retrospect. I'll include this in the next version of the
series I send out unless Paolo beats me and just merges it directly.
Happy to give this my reviewed-by.

>
> Compile tested only...
>
> From 773414e4fd7010c38ac89221d16089f3dcc57467 Mon Sep 17 00:00:00 2001
> From: Sean Christopherson <[email protected]>
> Date: Wed, 17 Nov 2021 18:08:42 -0800
> Subject: [PATCH] KVM: x86/mmu: Use shadow page role to detect PML-unfriendly
> pages for L2
>
> Rework make_spte() to query the shadow page's role, specifically whether
> or not it's a guest_mode page, a.k.a. a page for L2, when determining if
> the SPTE is compatible with PML. This eliminates a dependency on @vcpu,
> with a future goal of being able to create SPTEs without a specific vCPU.
>
> Signed-off-by: Sean Christopherson <[email protected]>

Reviewed-by: Ben Gardon <[email protected]>

> ---
> arch/x86/kvm/mmu/mmu_internal.h | 7 +++----
> arch/x86/kvm/mmu/spte.c | 2 +-
> 2 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 8ede43a826af..03882b2624c8 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -109,7 +109,7 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
> return kvm_mmu_role_as_id(sp->role);
> }
>
> -static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
> +static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> {
> /*
> * When using the EPT page-modification log, the GPAs in the CPU dirty
> @@ -117,10 +117,9 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
> * on write protection to record dirty pages, which bypasses PML, since
> * writes now result in a vmexit. Note, the check on CPU dirty logging
> * being enabled is mandatory as the bits used to denote WP-only SPTEs
> - * are reserved for NPT w/ PAE (32-bit KVM).
> + * are reserved for PAE paging (32-bit KVM).
> */
> - return vcpu->arch.mmu == &vcpu->arch.guest_mmu &&
> - kvm_x86_ops.cpu_dirty_log_size;
> + return kvm_x86_ops.cpu_dirty_log_size && sp->role.guest_mode;
> }
>
> int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 0c76c45fdb68..84e64dbdd89e 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -101,7 +101,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>
> if (sp->role.ad_disabled)
> spte |= SPTE_TDP_AD_DISABLED_MASK;
> - else if (kvm_vcpu_ad_need_write_protect(vcpu))
> + else if (kvm_mmu_page_ad_need_write_protect(sp))
> spte |= SPTE_TDP_AD_WRPROT_ONLY_MASK;
>
> /*
> --

2021-11-18 18:02:25

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On Thu, Nov 18, 2021, Paolo Bonzini wrote:
> On 11/18/21 17:37, Sean Christopherson wrote:
> > > It's a bit ugly in that we'd pass both @kvm and @vcpu, so that needs some more
> > > thought, but at minimum it means there's no need to recalc the reserved bits.
> >
> > Ok, I think my final vote is to have the reserved bits passed in, but with the
> > non-nested TDP reserved bits being computed at MMU init.
>
> Yes, and that's also where I was getting with the idea of moving part of the
> "direct" MMU (man, naming these things is so hard) to struct kvm: split the
> per-vCPU state from the constant one and initialize the latter just once.
> Though perhaps I was putting the cart slightly before the horse.
>
> On the topic of naming, we have a lot of things to name:
>
> - the two MMU codebases: you Googlers are trying to grandfather "legacy" and
> "TDP" into upstream

Heh, I think that's like 99.9% me.

> but that's not a great name because the former is used also when shadowing
> EPT/NPT. I'm thinking of standardizing on "shadow" and "TDP" (it's not
> perfect because of the 32-bit and tdp_mmu=0 cases, but it's a start). Maybe
> even split parts of mmu.c out into shadow_mmu.c.

But shadow is flat out wrong until EPT and NPT support is ripped out of the "legacy"
MMU.

> - the two walkers (I'm quite convinced of splitting that part out of struct
> kvm_mmu and getting rid of walk_mmu/nested_mmu): that's easy, it can be
> walk01 and walk12 with "walk" pointing to one of them

I am all in favor of walk01 and walk12, the guest_mmu vs. nested_mmu confusion
is painful.

> - the two MMUs: with nested_mmu gone, root_mmu and guest_mmu are much less
> confusing and we can keep those names.

I would prefer root_mmu and nested_tdp_mmu. guest_mmu is misleading because its
not used for all cases of sp->role.guest_mode=1, i.e. when L1 is not using TDP
then guest_mode=1 but KVM isn't using guest_mmu.

2021-11-18 18:04:48

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC 07/19] KVM: x86/mmu: Factor wrprot for nested PML out of make_spte

On 11/18/21 18:43, Ben Gardon wrote:
>> Aha! The dependency on @vcpu can be avoided without having to take a flag from
>> the caller. The shadow page has everything we need. The check is really "is this
>> a page for L2 EPT". The kvm_x86_ops.cpu_dirty_log_size gets us the EPT part, and
>> kvm_mmu_page.guest_mode gets us the L2 part.
>
> Haha that's way cleaner than what I was doing! Seems like an obvious
> solution in retrospect. I'll include this in the next version of the
> series I send out unless Paolo beats me and just merges it directly.
> Happy to give this my reviewed-by.

Yeah, I am including the early cleanup parts because it makes no sense
to hold off on them; and Sean's patch qualifies as well.

I can't blame you for not remembering role.guest_mode. Jim added it for
a decidedly niche reason:

commit 1313cc2bd8f6568dd8801feef446afbe43e6d313
Author: Jim Mattson <[email protected]>
Date: Wed May 9 17:02:04 2018 -0400

kvm: mmu: Add guest_mode to kvm_mmu_page_role

L1 and L2 need to have disjoint mappings, so that L1's APIC access
page (under VMX) can be omitted from L2's mappings.

Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Krish Sadhukhan <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

though it's actually gotten a lot more important than just that:

commit 992edeaefed682511bd173dabd2f54b1ce5387df
Author: Liran Alon <[email protected]>
Date: Wed Nov 20 14:24:52 2019 +0200

KVM: nVMX: Assume TLB entries of L1 and L2 are tagged differently if L0 use EPT

Since commit 1313cc2bd8f6 ("kvm: mmu: Add guest_mode to kvm_mmu_page_role"),
guest_mode was added to mmu-role and therefore if L0 use EPT, it will
always run L1 and L2 with different EPTP. i.e. EPTP01!=EPTP02.

Because TLB entries are tagged with EP4TA, KVM can assume
TLB entries populated while running L2 are tagged differently
than TLB entries populated while running L1.

Paolo

2021-11-18 18:07:43

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On 11/18/21 19:02, Sean Christopherson wrote:
>> but that's not a great name because the former is used also when shadowing
>> EPT/NPT. I'm thinking of standardizing on "shadow" and "TDP" (it's not
>> perfect because of the 32-bit and tdp_mmu=0 cases, but it's a start). Maybe
>> even split parts of mmu.c out into shadow_mmu.c.
> But shadow is flat out wrong until EPT and NPT support is ripped out of the "legacy"
> MMU.

Yeah, that's true. "full" MMU? :)

>> - the two walkers (I'm quite convinced of splitting that part out of struct
>> kvm_mmu and getting rid of walk_mmu/nested_mmu): that's easy, it can be
>> walk01 and walk12 with "walk" pointing to one of them
>
> I am all in favor of walk01 and walk12, the guest_mmu vs. nested_mmu confusion
> is painful.
>
>> - the two MMUs: with nested_mmu gone, root_mmu and guest_mmu are much less
>> confusing and we can keep those names.
>
> I would prefer root_mmu and nested_tdp_mmu. guest_mmu is misleading because its
> not used for all cases of sp->role.guest_mode=1, i.e. when L1 is not using TDP
> then guest_mode=1 but KVM isn't using guest_mmu.

Ok, that sounds good too.

Paolo

2021-11-18 18:14:36

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC 11/19] KVM: x86/mmu: Factor shadow_zero_check out of make_spte

On Thu, Nov 18, 2021, Paolo Bonzini wrote:
> On 11/18/21 19:02, Sean Christopherson wrote:
> > > but that's not a great name because the former is used also when shadowing
> > > EPT/NPT. I'm thinking of standardizing on "shadow" and "TDP" (it's not
> > > perfect because of the 32-bit and tdp_mmu=0 cases, but it's a start). Maybe
> > > even split parts of mmu.c out into shadow_mmu.c.
> > But shadow is flat out wrong until EPT and NPT support is ripped out of the "legacy"
> > MMU.
>
> Yeah, that's true. "full" MMU? :)

Or we could just rip out non-nested TDP support from the legacy MMU and call it
the shadow MMU :-)