LinuxLists.cc - [PATCH v2 00/13] More parallel operations for the TDP MMU

2021-04-01 23:38:37

[permalink] [raw]

Subject: [PATCH v2 00/13] More parallel operations for the TDP MMU

Now that the TDP MMU is able to handle page faults in parallel, it's a
relatively small change to expand to other operations. This series allows
zapping a range of GFNs, reclaiming collapsible SPTEs (when disabling
dirty logging), and enabling dirty logging to all happen under the MMU
lock in read mode.

This is partly a cleanup + rewrite of the last few patches of the parallel
page faults series. I've incorporated feedback from Sean and Paolo, but
the patches have changed so much that I'm sending this as a separate
series.

Ran kvm-unit-tests + selftests on an SMP kernel + Intel Skylake, with the
TDP MMU enabled and disabled. This series introduces no new failures or
warnings.

I know this will conflict horribly with the patches from Sean's series
which were just queued, and I'll send a v2 to fix those conflicts +
address any feedback on this v1.

Changelog
v2:
-- Rebased patches on top of kvm/queue to incorporate Sean's recent
TLB flushing changes
-- Dropped patch 5: "KVM: x86/mmu: comment for_each_tdp_mmu_root
requires MMU write lock" as the following patch to protect the roots
list with RCU adds lockdep which makes the comment somewhat redundant.

Ben Gardon (13):
KVM: x86/mmu: Re-add const qualifier in
kvm_tdp_mmu_zap_collapsible_sptes
KVM: x86/mmu: Move kvm_mmu_(get|put)_root to TDP MMU
KVM: x86/mmu: use tdp_mmu_free_sp to free roots
KVM: x86/mmu: Merge TDP MMU put and free root
KVM: x86/mmu: Refactor yield safe root iterator
KVM: x86/mmu: Make TDP MMU root refcount atomic
KVM: x86/mmu: handle cmpxchg failure in kvm_tdp_mmu_get_root
KVM: x86/mmu: Protect the tdp_mmu_roots list with RCU
KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read
lock
KVM: x86/mmu: Fast invalidation for TDP MMU
KVM: x86/mmu: Tear down roots in fast invalidation thread

arch/x86/include/asm/kvm_host.h | 21 +-
arch/x86/kvm/mmu/mmu.c | 115 +++++++---
arch/x86/kvm/mmu/mmu_internal.h | 27 +--
arch/x86/kvm/mmu/tdp_mmu.c | 375 +++++++++++++++++++++++---------
arch/x86/kvm/mmu/tdp_mmu.h | 28 ++-
include/linux/kvm_host.h | 2 +-
6 files changed, 407 insertions(+), 161 deletions(-)

--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:38:48

[permalink] [raw]

Subject: [PATCH v2 03/13] KVM: x86/mmu: use tdp_mmu_free_sp to free roots

Minor cleanup to deduplicate the code used to free a struct kvm_mmu_page
in the TDP MMU.

No functional change intended.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6f612ac755a0..320cc4454737 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -92,6 +92,12 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end, bool can_yield, bool flush);

+static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
+{
+ free_page((unsigned long)sp->spt);
+ kmem_cache_free(mmu_page_header_cache, sp);
+}
+
void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
{
gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
@@ -105,8 +111,7 @@ void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)

zap_gfn_range(kvm, root, 0, max_gfn, false, false);

- free_page((unsigned long)root->spt);
- kmem_cache_free(mmu_page_header_cache, root);
+ tdp_mmu_free_sp(root);
}

static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
@@ -168,12 +173,6 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
return __pa(root->spt);
}

-static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
-{
- free_page((unsigned long)sp->spt);
- kmem_cache_free(mmu_page_header_cache, sp);
-}
-
/*
* This is called through call_rcu in order to free TDP page table memory
* safely with respect to other kernel threads that may be operating on
--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:38:55

[permalink] [raw]

Subject: [PATCH v2 04/13] KVM: x86/mmu: Merge TDP MMU put and free root

kvm_tdp_mmu_put_root and kvm_tdp_mmu_free_root are always called
together, so merge the functions to simplify TDP MMU root refcounting /
freeing.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 4 +--
arch/x86/kvm/mmu/tdp_mmu.c | 54 ++++++++++++++++++--------------------
arch/x86/kvm/mmu/tdp_mmu.h | 10 +------
3 files changed, 28 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9c7ef7ca8bf6..47d996a8074f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3153,8 +3153,8 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,

sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);

- if (is_tdp_mmu_page(sp) && kvm_tdp_mmu_put_root(kvm, sp))
- kvm_tdp_mmu_free_root(kvm, sp);
+ if (is_tdp_mmu_page(sp))
+ kvm_tdp_mmu_put_root(kvm, sp);
else if (!--sp->root_count && sp->role.invalid)
kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 320cc4454737..279a725061f7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -41,10 +41,31 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
rcu_barrier();
}

-static void tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
+static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, bool can_yield, bool flush);
+
+static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
- if (kvm_tdp_mmu_put_root(kvm, root))
- kvm_tdp_mmu_free_root(kvm, root);
+ free_page((unsigned long)sp->spt);
+ kmem_cache_free(mmu_page_header_cache, sp);
+}
+
+void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+ gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+
+ lockdep_assert_held_write(&kvm->mmu_lock);
+
+ if (--root->root_count)
+ return;
+
+ WARN_ON(!root->tdp_mmu_page);
+
+ list_del(&root->link);
+
+ zap_gfn_range(kvm, root, 0, max_gfn, false, false);
+
+ tdp_mmu_free_sp(root);
}

static inline bool tdp_mmu_next_root_valid(struct kvm *kvm,
@@ -66,7 +87,7 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
struct kvm_mmu_page *next_root;

next_root = list_next_entry(root, link);
- tdp_mmu_put_root(kvm, root);
+ kvm_tdp_mmu_put_root(kvm, root);
return next_root;
}

@@ -89,31 +110,6 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
if (kvm_mmu_page_as_id(_root) != _as_id) { \
} else

-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush);
-
-static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
-{
- free_page((unsigned long)sp->spt);
- kmem_cache_free(mmu_page_header_cache, sp);
-}
-
-void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root)
-{
- gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
-
- lockdep_assert_held_write(&kvm->mmu_lock);
-
- WARN_ON(root->root_count);
- WARN_ON(!root->tdp_mmu_page);
-
- list_del(&root->link);
-
- zap_gfn_range(kvm, root, 0, max_gfn, false, false);
-
- tdp_mmu_free_sp(root);
-}
-
static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
int level)
{
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index c9a081c786a5..d4e32ac5f4c9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -6,7 +6,6 @@
#include <linux/kvm_host.h>

hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
-void kvm_tdp_mmu_free_root(struct kvm *kvm, struct kvm_mmu_page *root);

static inline void kvm_tdp_mmu_get_root(struct kvm *kvm,
struct kvm_mmu_page *root)
@@ -17,14 +16,7 @@ static inline void kvm_tdp_mmu_get_root(struct kvm *kvm,
++root->root_count;
}

-static inline bool kvm_tdp_mmu_put_root(struct kvm *kvm,
- struct kvm_mmu_page *root)
-{
- lockdep_assert_held(&kvm->mmu_lock);
- --root->root_count;
-
- return !root->root_count;
-}
+void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);

bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
gfn_t end, bool can_yield, bool flush);
--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:39:02

[permalink] [raw]

Subject: [PATCH v2 05/13] KVM: x86/mmu: Refactor yield safe root iterator

Refactor the yield safe TDP MMU root iterator to be more amenable to
changes in future commits which will allow it to be used under the MMU
lock in read mode. Currently the iterator requires a complicated dance
between the helper functions and different parts of the for loop which
makes it hard to reason about. Moving all the logic into a single function
simplifies the iterator substantially.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 45 ++++++++++++++++++++++----------------
1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 279a725061f7..670c5e3ad80e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -68,26 +68,34 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
tdp_mmu_free_sp(root);
}

-static inline bool tdp_mmu_next_root_valid(struct kvm *kvm,
- struct kvm_mmu_page *root)
+/*
+ * Finds the next valid root after root (or the first valid root if root
+ * is NULL), takes a reference on it, and returns that next root. If root
+ * is not NULL, this thread should have already taken a reference on it, and
+ * that reference will be dropped. If no valid root is found, this
+ * function will return NULL.
+ */
+static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
+ struct kvm_mmu_page *prev_root)
{
- lockdep_assert_held_write(&kvm->mmu_lock);
+ struct kvm_mmu_page *next_root;

- if (list_entry_is_head(root, &kvm->arch.tdp_mmu_roots, link))
- return false;
+ lockdep_assert_held_write(&kvm->mmu_lock);

- kvm_tdp_mmu_get_root(kvm, root);
- return true;
+ if (prev_root)
+ next_root = list_next_entry(prev_root, link);
+ else
+ next_root = list_first_entry(&kvm->arch.tdp_mmu_roots,
+ typeof(*next_root), link);

-}
+ if (list_entry_is_head(next_root, &kvm->arch.tdp_mmu_roots, link))
+ next_root = NULL;
+ else
+ kvm_tdp_mmu_get_root(kvm, next_root);

-static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
- struct kvm_mmu_page *root)
-{
- struct kvm_mmu_page *next_root;
+ if (prev_root)
+ kvm_tdp_mmu_put_root(kvm, prev_root);

- next_root = list_next_entry(root, link);
- kvm_tdp_mmu_put_root(kvm, root);
return next_root;
}

@@ -98,11 +106,10 @@ static inline struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* recent root. (Unless keeping a live reference is desirable.)
*/
#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
- for (_root = list_first_entry(&_kvm->arch.tdp_mmu_roots, \
- typeof(*_root), link); \
- tdp_mmu_next_root_valid(_kvm, _root); \
- _root = tdp_mmu_next_root(_kvm, _root)) \
- if (kvm_mmu_page_as_id(_root) != _as_id) { \
+ for (_root = tdp_mmu_next_root(_kvm, NULL); \
+ _root; \
+ _root = tdp_mmu_next_root(_kvm, _root)) \
+ if (kvm_mmu_page_as_id(_root) != _as_id) { \
} else

#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:39:20

[permalink] [raw]

Subject: [PATCH v2 08/13] KVM: x86/mmu: Protect the tdp_mmu_roots list with RCU

Protect the contents of the TDP MMU roots list with RCU in preparation
for a future patch which will allow the iterator macro to be used under
the MMU lock in read mode.

Signed-off-by: Ben Gardon <[email protected]>
---

Changelog
v2:
-- add lockdep condition for tdp_mmu_pages_lock to for_each_tdp_mmu_root
-- fix problem with unexported lockdep function
-- updated comments in kvm_host.h

arch/x86/include/asm/kvm_host.h | 21 +++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 69 +++++++++++++++++++--------------
2 files changed, 55 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 99778ac51243..e02e8b8a875b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1050,25 +1050,36 @@ struct kvm_arch {
bool tdp_mmu_enabled;

/*
- * List of struct kvmp_mmu_pages being used as roots.
+ * List of struct kvm_mmu_pages being used as roots.
* All struct kvm_mmu_pages in the list should have
* tdp_mmu_page set.
- * All struct kvm_mmu_pages in the list should have a positive
- * root_count except when a thread holds the MMU lock and is removing
- * an entry from the list.
+ *
+ * For reads, this list is protected by:
+ * the MMU lock in read mode + RCU or
+ * the MMU lock in write mode
+ *
+ * For writes, this list is protected by:
+ * the MMU lock in read mode + the tdp_mmu_pages_lock or
+ * the MMU lock in write mode
+ *
+ * Roots will remain in the list until their tdp_mmu_root_count
+ * drops to zero, at which point the thread that decremented the
+ * count to zero should removed the root from the list and clean
+ * it up, freeing the root after an RCU grace period.
*/
struct list_head tdp_mmu_roots;

/*
* List of struct kvmp_mmu_pages not being used as roots.
* All struct kvm_mmu_pages in the list should have
- * tdp_mmu_page set and a root_count of 0.
+ * tdp_mmu_page set and a tdp_mmu_root_count of 0.
*/
struct list_head tdp_mmu_pages;

/*
* Protects accesses to the following fields when the MMU lock
* is held in read mode:
+ * - tdp_mmu_roots (above)
* - tdp_mmu_pages (above)
* - the link field of struct kvm_mmu_pages used by the TDP MMU
* - lpage_disallowed_mmu_pages
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 886bc170f2a5..c1d7f6b86870 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -50,6 +50,22 @@ static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
kmem_cache_free(mmu_page_header_cache, sp);
}

+/*
+ * This is called through call_rcu in order to free TDP page table memory
+ * safely with respect to other kernel threads that may be operating on
+ * the memory.
+ * By only accessing TDP MMU page table memory in an RCU read critical
+ * section, and freeing it after a grace period, lockless access to that
+ * memory won't use it after it is freed.
+ */
+static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
+{
+ struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
+ rcu_head);
+
+ tdp_mmu_free_sp(sp);
+}
+
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
{
gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
@@ -61,11 +77,13 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)

WARN_ON(!root->tdp_mmu_page);

- list_del(&root->link);
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ list_del_rcu(&root->link);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

zap_gfn_range(kvm, root, 0, max_gfn, false, false);

- tdp_mmu_free_sp(root);
+ call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}

/*
@@ -82,18 +100,21 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,

lockdep_assert_held_write(&kvm->mmu_lock);

+ rcu_read_lock();
+
if (prev_root)
- next_root = list_next_entry(prev_root, link);
+ next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
+ &prev_root->link,
+ typeof(*prev_root), link);
else
- next_root = list_first_entry(&kvm->arch.tdp_mmu_roots,
- typeof(*next_root), link);
+ next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
+ typeof(*next_root), link);

- while (!list_entry_is_head(next_root, &kvm->arch.tdp_mmu_roots, link) &&
- !kvm_tdp_mmu_get_root(kvm, next_root))
- next_root = list_next_entry(next_root, link);
+ while (next_root && !kvm_tdp_mmu_get_root(kvm, next_root))
+ next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
+ &next_root->link, typeof(*next_root), link);

- if (list_entry_is_head(next_root, &kvm->arch.tdp_mmu_roots, link))
- next_root = NULL;
+ rcu_read_unlock();

if (prev_root)
kvm_tdp_mmu_put_root(kvm, prev_root);
@@ -107,15 +128,17 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* if exiting the loop early, the caller must drop the reference to the most
* recent root. (Unless keeping a live reference is desirable.)
*/
-#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
+#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
for (_root = tdp_mmu_next_root(_kvm, NULL); \
_root; \
_root = tdp_mmu_next_root(_kvm, _root)) \
if (kvm_mmu_page_as_id(_root) != _as_id) { \
} else

-#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
- list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
+#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
+ list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link, \
+ lockdep_is_held_type(&kvm->mmu_lock, 0) || \
+ lockdep_is_help(&kvm->arch.tdp_mmu_pages_lock)) \
if (kvm_mmu_page_as_id(_root) != _as_id) { \
} else

@@ -171,28 +194,14 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
refcount_set(&root->tdp_mmu_root_count, 1);

- list_add(&root->link, &kvm->arch.tdp_mmu_roots);
+ spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+ list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
+ spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

out:
return __pa(root->spt);
}

-/*
- * This is called through call_rcu in order to free TDP page table memory
- * safely with respect to other kernel threads that may be operating on
- * the memory.
- * By only accessing TDP MMU page table memory in an RCU read critical
- * section, and freeing it after a grace period, lockless access to that
- * memory won't use it after it is freed.
- */
-static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
-{
- struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
- rcu_head);
-
- tdp_mmu_free_sp(sp);
-}
-
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level,
bool shared);
--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:39:46

[permalink] [raw]

Subject: [PATCH v2 07/13] KVM: x86/mmu: handle cmpxchg failure in kvm_tdp_mmu_get_root

To reduce dependence on the MMU write lock, don't rely on the assumption
that the atomic operation in kvm_tdp_mmu_get_root will always succeed.
By not relying on that assumption, threads do not need to hold the MMU
lock in write mode in order to take a reference on a TDP MMU root.

In the root iterator, this change means that some roots might have to be
skipped if they are found to have a zero refcount. This will still never
happen as of this patch, but a future patch will need that flexibility to
make the root iterator safe under the MMU read lock.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 11 ++++++-----
arch/x86/kvm/mmu/tdp_mmu.h | 13 +++----------
2 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 697ea882a3e4..886bc170f2a5 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -88,10 +88,12 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
next_root = list_first_entry(&kvm->arch.tdp_mmu_roots,
typeof(*next_root), link);

+ while (!list_entry_is_head(next_root, &kvm->arch.tdp_mmu_roots, link) &&
+ !kvm_tdp_mmu_get_root(kvm, next_root))
+ next_root = list_next_entry(next_root, link);
+
if (list_entry_is_head(next_root, &kvm->arch.tdp_mmu_roots, link))
next_root = NULL;
- else
- kvm_tdp_mmu_get_root(kvm, next_root);

if (prev_root)
kvm_tdp_mmu_put_root(kvm, prev_root);
@@ -161,10 +163,9 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)

/* Check for an existing root before allocating a new one. */
for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
- if (root->role.word == role.word) {
- kvm_tdp_mmu_get_root(kvm, root);
+ if (root->role.word == role.word &&
+ kvm_tdp_mmu_get_root(kvm, root))
goto out;
- }
}

root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1ec7914ecff9..f0a26214e999 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -7,17 +7,10 @@

hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);

-static inline void kvm_tdp_mmu_get_root(struct kvm *kvm,
- struct kvm_mmu_page *root)
+__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
+ struct kvm_mmu_page *root)
{
- lockdep_assert_held_write(&kvm->mmu_lock);
-
- /*
- * This should never fail since roots are removed from the roots
- * list under the MMU write lock when their reference count falls
- * to zero.
- */
- refcount_inc_not_zero(&root->tdp_mmu_root_count);
+ return refcount_inc_not_zero(&root->tdp_mmu_root_count);
}

void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:39:56

[permalink] [raw]

Subject: [PATCH v2 09/13] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock

To reduce lock contention and interference with page fault handlers,
allow the TDP MMU function to zap a GFN range to operate under the MMU
read lock.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 22 +++++---
arch/x86/kvm/mmu/tdp_mmu.c | 111 ++++++++++++++++++++++++++-----------
arch/x86/kvm/mmu/tdp_mmu.h | 14 +++--
3 files changed, 102 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 47d996a8074f..d03a7a8b7ea2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3154,7 +3154,7 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);

if (is_tdp_mmu_page(sp))
- kvm_tdp_mmu_put_root(kvm, sp);
+ kvm_tdp_mmu_put_root(kvm, sp, false);
else if (!--sp->root_count && sp->role.invalid)
kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);

@@ -5507,16 +5507,24 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
}
}

- if (is_tdp_mmu_enabled(kvm)) {
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
- gfn_end, flush);
- }
-
if (flush)
kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);

write_unlock(&kvm->mmu_lock);
+
+ if (is_tdp_mmu_enabled(kvm)) {
+ flush = false;
+
+ read_lock(&kvm->mmu_lock);
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
+ flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
+ gfn_end, flush, true);
+ if (flush)
+ kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
+ gfn_end);
+
+ read_unlock(&kvm->mmu_lock);
+ }
}

static bool slot_rmap_write_protect(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c1d7f6b86870..6917403484ce 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -27,6 +27,15 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
}

+static __always_inline void kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
+ bool shared)
+{
+ if (shared)
+ lockdep_assert_held_read(&kvm->mmu_lock);
+ else
+ lockdep_assert_held_write(&kvm->mmu_lock);
+}
+
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
{
if (!kvm->arch.tdp_mmu_enabled)
@@ -42,7 +51,8 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
}

static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush);
+ gfn_t start, gfn_t end, bool can_yield, bool flush,
+ bool shared);

static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
{
@@ -66,11 +76,12 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
tdp_mmu_free_sp(sp);
}

-void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
+void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ bool shared)
{
gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);

- lockdep_assert_held_write(&kvm->mmu_lock);
+ kvm_lockdep_assert_mmu_lock_held(kvm, shared);

if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;
@@ -81,7 +92,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

- zap_gfn_range(kvm, root, 0, max_gfn, false, false);
+ zap_gfn_range(kvm, root, 0, max_gfn, false, false, shared);

call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
@@ -94,11 +105,11 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
* function will return NULL.
*/
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
- struct kvm_mmu_page *prev_root)
+ struct kvm_mmu_page *prev_root,
+ bool shared)
{
struct kvm_mmu_page *next_root;

- lockdep_assert_held_write(&kvm->mmu_lock);

rcu_read_lock();

@@ -117,7 +128,7 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
rcu_read_unlock();

if (prev_root)
- kvm_tdp_mmu_put_root(kvm, prev_root);
+ kvm_tdp_mmu_put_root(kvm, prev_root, shared);

return next_root;
}
@@ -127,12 +138,16 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
* This makes it safe to release the MMU lock and yield within the loop, but
* if exiting the loop early, the caller must drop the reference to the most
* recent root. (Unless keeping a live reference is desirable.)
+ *
+ * If shared is set, this function is operating under the MMU lock in read
+ * mode. In the unlikely event that this thread must free a root, the lock
+ * will be temporarily dropped and reacquired in write mode.
*/
-#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
- for (_root = tdp_mmu_next_root(_kvm, NULL); \
- _root; \
- _root = tdp_mmu_next_root(_kvm, _root)) \
- if (kvm_mmu_page_as_id(_root) != _as_id) { \
+#define for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _shared) \
+ for (_root = tdp_mmu_next_root(_kvm, NULL, _shared); \
+ _root; \
+ _root = tdp_mmu_next_root(_kvm, _root, _shared)) \
+ if (kvm_mmu_page_as_id(_root) != _as_id) { \
} else

#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
@@ -636,7 +651,8 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
* Return false if a yield was not needed.
*/
static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
- struct tdp_iter *iter, bool flush)
+ struct tdp_iter *iter, bool flush,
+ bool shared)
{
/* Ensure forward progress has been made before yielding. */
if (iter->next_last_level_gfn == iter->yielded_gfn)
@@ -648,7 +664,11 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
if (flush)
kvm_flush_remote_tlbs(kvm);

- cond_resched_rwlock_write(&kvm->mmu_lock);
+ if (shared)
+ cond_resched_rwlock_read(&kvm->mmu_lock);
+ else
+ cond_resched_rwlock_write(&kvm->mmu_lock);
+
rcu_read_lock();

WARN_ON(iter->gfn > iter->next_last_level_gfn);
@@ -666,24 +686,32 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
* non-root pages mapping GFNs strictly within that range. Returns true if
* SPTEs have been cleared and a TLB flush is needed before releasing the
* MMU lock.
+ *
* If can_yield is true, will release the MMU lock and reschedule if the
* scheduler needs the CPU or there is contention on the MMU lock. If this
* function cannot yield, it will not release the MMU lock or reschedule and
* the caller must ensure it does not supply too large a GFN range, or the
- * operation can cause a soft lockup. Note, in some use cases a flush may be
- * required by prior actions. Ensure the pending flush is performed prior to
- * yielding.
+ * operation can cause a soft lockup.
+ *
+ * If shared is true, this thread holds the MMU lock in read mode and must
+ * account for the possibility that other threads are modifying the paging
+ * structures concurrently. If shared is false, this thread should hold the
+ * MMU lock in write mode.
*/
static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush)
+ gfn_t start, gfn_t end, bool can_yield, bool flush,
+ bool shared)
{
struct tdp_iter iter;

+ kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+
rcu_read_lock();

tdp_root_for_each_pte(iter, root, start, end) {
+retry:
if (can_yield &&
- tdp_mmu_iter_cond_resched(kvm, &iter, flush)) {
+ tdp_mmu_iter_cond_resched(kvm, &iter, flush, shared)) {
flush = false;
continue;
}
@@ -701,8 +729,17 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
!is_last_spte(iter.old_spte, iter.level))
continue;

- tdp_mmu_set_spte(kvm, &iter, 0);
- flush = true;
+ if (!shared) {
+ tdp_mmu_set_spte(kvm, &iter, 0);
+ flush = true;
+ } else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
+ /*
+ * The iter must explicitly re-read the SPTE because
+ * the atomic cmpxchg failed.
+ */
+ iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ goto retry;
+ }
}

rcu_read_unlock();
@@ -714,14 +751,21 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
* non-root pages mapping GFNs strictly within that range. Returns true if
* SPTEs have been cleared and a TLB flush is needed before releasing the
* MMU lock.
+ *
+ * If shared is true, this thread holds the MMU lock in read mode and must
+ * account for the possibility that other threads are modifying the paging
+ * structures concurrently. If shared is false, this thread should hold the
+ * MMU in write mode.
*/
bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
- gfn_t end, bool can_yield, bool flush)
+ gfn_t end, bool can_yield, bool flush,
+ bool shared)
{
struct kvm_mmu_page *root;

- for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = zap_gfn_range(kvm, root, start, end, can_yield, flush);
+ for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, shared)
+ flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
+ shared);

return flush;
}
@@ -733,7 +777,8 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
int i;

for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, max_gfn, flush);
+ flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, max_gfn,
+ flush, false);

if (flush)
kvm_flush_remote_tlbs(kvm);
@@ -902,7 +947,7 @@ static __always_inline int kvm_tdp_mmu_handle_hva_range(struct kvm *kvm,
int as_id;

for (as_id = 0; as_id < KVM_ADDRESS_SPACE_NUM; as_id++) {
- for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) {
+ for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false) {
slots = __kvm_memslots(kvm, as_id);
kvm_for_each_memslot(memslot, slots) {
unsigned long hva_start, hva_end;
@@ -942,7 +987,7 @@ static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
struct kvm_mmu_page *root, gfn_t start,
gfn_t end, unsigned long unused)
{
- return zap_gfn_range(kvm, root, start, end, false, false);
+ return zap_gfn_range(kvm, root, start, end, false, false, false);
}

int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
@@ -1103,7 +1148,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,

for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
min_level, start, end) {
- if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
continue;

if (!is_shadow_present_pte(iter.old_spte) ||
@@ -1132,7 +1177,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
struct kvm_mmu_page *root;
bool spte_set = false;

- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
+ for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, false)
spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
slot->base_gfn + slot->npages, min_level);

@@ -1156,7 +1201,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();

tdp_root_for_each_leaf_pte(iter, root, start, end) {
- if (tdp_mmu_iter_cond_resched(kvm, &iter, false))
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
continue;

if (spte_ad_need_write_protect(iter.old_spte)) {
@@ -1191,7 +1236,7 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
struct kvm_mmu_page *root;
bool spte_set = false;

- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
+ for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, false)
spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
slot->base_gfn + slot->npages);

@@ -1278,7 +1323,7 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
rcu_read_lock();

tdp_root_for_each_pte(iter, root, start, end) {
- if (tdp_mmu_iter_cond_resched(kvm, &iter, flush)) {
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
continue;
}
@@ -1313,7 +1358,7 @@ bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
{
struct kvm_mmu_page *root;

- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
+ for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, false)
flush = zap_collapsible_spte_range(kvm, root, slot, flush);

return flush;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index f0a26214e999..d703c6d6024a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -13,14 +13,18 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
return refcount_inc_not_zero(&root->tdp_mmu_root_count);
}

-void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
+void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
+ bool shared);

bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
- gfn_t end, bool can_yield, bool flush);
+ gfn_t end, bool can_yield, bool flush,
+ bool shared);
static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
- gfn_t start, gfn_t end, bool flush)
+ gfn_t start, gfn_t end, bool flush,
+ bool shared)
{
- return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
+ return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush,
+ shared);
}
static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
@@ -37,7 +41,7 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
*/
lockdep_assert_held_write(&kvm->mmu_lock);
return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
- sp->gfn, end, false, false);
+ sp->gfn, end, false, false, false);
}
void kvm_tdp_mmu_zap_all(struct kvm *kvm);

--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:39:59

[permalink] [raw]

Subject: [PATCH v2 11/13] KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read lock

To reduce lock contention and interference with page fault handlers,
allow the TDP MMU functions which enable and disable dirty logging
to operate under the MMU read lock.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 16 +++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 62 ++++++++++++++++++++++++++++++--------
2 files changed, 61 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5939813e3043..a3837f8ad4ed 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5543,10 +5543,14 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
write_lock(&kvm->mmu_lock);
flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
- if (is_tdp_mmu_enabled(kvm))
- flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_4K);
write_unlock(&kvm->mmu_lock);

+ if (is_tdp_mmu_enabled(kvm)) {
+ read_lock(&kvm->mmu_lock);
+ flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_4K);
+ read_unlock(&kvm->mmu_lock);
+ }
+
/*
* We can flush all the TLBs out of the mmu lock without TLB
* corruption since we just change the spte from writable to
@@ -5649,10 +5653,14 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,

write_lock(&kvm->mmu_lock);
flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
- if (is_tdp_mmu_enabled(kvm))
- flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
write_unlock(&kvm->mmu_lock);

+ if (is_tdp_mmu_enabled(kvm)) {
+ read_lock(&kvm->mmu_lock);
+ flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
+ read_unlock(&kvm->mmu_lock);
+ }
+
/*
* It's also safe to flush TLBs out of mmu lock here as currently this
* function is only used for dirty logging, in which case flushing TLB
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0e6ffa04e5e1..501722a524a7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -496,8 +496,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
}

/*
- * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically and handle the
- * associated bookkeeping
+ * tdp_mmu_set_spte_atomic_no_dirty_log - Set a TDP MMU SPTE atomically
+ * and handle the associated bookkeeping, but do not mark the page dirty
+ * in KVM's dirty bitmaps.
*
* @kvm: kvm instance
* @iter: a tdp_iter instance currently on the SPTE that should be set
@@ -505,9 +506,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
* Returns: true if the SPTE was set, false if it was not. If false is returned,
* this function will have no side-effects.
*/
-static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
- struct tdp_iter *iter,
- u64 new_spte)
+static inline bool tdp_mmu_set_spte_atomic_no_dirty_log(struct kvm *kvm,
+ struct tdp_iter *iter,
+ u64 new_spte)
{
lockdep_assert_held_read(&kvm->mmu_lock);

@@ -522,12 +523,25 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
new_spte) != iter->old_spte)
return false;

- handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
- new_spte, iter->level, true);
+ __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
+ new_spte, iter->level, true);
+ handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);

return true;
}

+static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
+ struct tdp_iter *iter,
+ u64 new_spte)
+{
+ if (!tdp_mmu_set_spte_atomic_no_dirty_log(kvm, iter, new_spte))
+ return false;
+
+ handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
+ iter->old_spte, new_spte, iter->level);
+ return true;
+}
+
static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter)
{
@@ -1148,7 +1162,8 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,

for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
min_level, start, end) {
- if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
+retry:
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;

if (!is_shadow_present_pte(iter.old_spte) ||
@@ -1158,7 +1173,15 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,

new_spte = iter.old_spte & ~PT_WRITABLE_MASK;

- tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
+ if (!tdp_mmu_set_spte_atomic_no_dirty_log(kvm, &iter,
+ new_spte)) {
+ /*
+ * The iter must explicitly re-read the SPTE because
+ * the atomic cmpxchg failed.
+ */
+ iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ goto retry;
+ }
spte_set = true;
}

@@ -1177,7 +1200,9 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
struct kvm_mmu_page *root;
bool spte_set = false;

- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, false)
+ lockdep_assert_held_read(&kvm->mmu_lock);
+
+ for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
slot->base_gfn + slot->npages, min_level);

@@ -1201,7 +1226,8 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
rcu_read_lock();

tdp_root_for_each_leaf_pte(iter, root, start, end) {
- if (tdp_mmu_iter_cond_resched(kvm, &iter, false, false))
+retry:
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;

if (spte_ad_need_write_protect(iter.old_spte)) {
@@ -1216,7 +1242,15 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
continue;
}

- tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
+ if (!tdp_mmu_set_spte_atomic_no_dirty_log(kvm, &iter,
+ new_spte)) {
+ /*
+ * The iter must explicitly re-read the SPTE because
+ * the atomic cmpxchg failed.
+ */
+ iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ goto retry;
+ }
spte_set = true;
}

@@ -1236,7 +1270,9 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
struct kvm_mmu_page *root;
bool spte_set = false;

- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, false)
+ lockdep_assert_held_read(&kvm->mmu_lock);
+
+ for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
slot->base_gfn + slot->npages);

--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:40:01

[permalink] [raw]

Subject: [PATCH v2 12/13] KVM: x86/mmu: Fast invalidation for TDP MMU

Provide a real mechanism for fast invalidation by marking roots as
invalid so that their reference count will quickly fall to zero
and they will be torn down.

One negative side affect of this approach is that a vCPU thread will
likely drop the last reference to a root and be saddled with the work of
tearing down an entire paging structure. This issue will be resolved in
a later commit.

Signed-off-by: Ben Gardon <[email protected]>
---

Changelog
v2:
-- open code root invalidation

arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++++++++++++---
arch/x86/kvm/mmu/tdp_mmu.h | 3 +++
2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a3837f8ad4ed..ba0c65076200 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5418,6 +5418,8 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
*/
static void kvm_mmu_zap_all_fast(struct kvm *kvm)
{
+ struct kvm_mmu_page *root;
+
lockdep_assert_held(&kvm->slots_lock);

write_lock(&kvm->mmu_lock);
@@ -5432,6 +5434,27 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
*/
kvm->arch.mmu_valid_gen = kvm->arch.mmu_valid_gen ? 0 : 1;

+
+ if (is_tdp_mmu_enabled(kvm)) {
+ /*
+ * Mark each TDP MMU root as invalid so that other threads
+ * will drop their references and allow the root count to
+ * go to 0.
+ *
+ * This has essentially the same effect for the TDP MMU
+ * as updating mmu_valid_gen above does for the shadow
+ * MMU.
+ *
+ * In order to ensure all threads see this change when
+ * handling the MMU reload signal, this must happen in the
+ * same critical section as kvm_reload_remote_mmus, and
+ * before kvm_zap_obsolete_pages as kvm_zap_obsolete_pages
+ * could drop the MMU lock and yield.
+ */
+ list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
+ root->role.invalid = true;
+ }
+
/*
* Notify all vcpus to reload its shadow page table and flush TLB.
* Then all vcpus will switch to new shadow page table with the new
@@ -5444,9 +5467,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)

kvm_zap_obsolete_pages(kvm);

- if (is_tdp_mmu_enabled(kvm))
- kvm_tdp_mmu_zap_all(kvm);
-
write_unlock(&kvm->mmu_lock);
}

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index d703c6d6024a..8fa3e7421a93 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,6 +10,9 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
__must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
struct kvm_mmu_page *root)
{
+ if (root->role.invalid)
+ return false;
+
return refcount_inc_not_zero(&root->tdp_mmu_root_count);
}

--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:40:08

[permalink] [raw]

Subject: [PATCH v2 13/13] KVM: x86/mmu: Tear down roots in fast invalidation thread

To avoid saddling a vCPU thread with the work of tearing down an entire
paging structure, take a reference on each root before they become
obsolete, so that the thread initiating the fast invalidation can tear
down the paging structure and (most likely) release the last reference.
As a bonus, this teardown can happen under the MMU lock in read mode so
as not to block the progress of vCPU threads.

Signed-off-by: Ben Gardon <[email protected]>
---

Changelog
v2:
-- rename kvm_tdp_mmu_zap_all_fast to
kvm_tdp_mmu_zap_invalidated_roots

arch/x86/kvm/mmu/mmu.c | 21 +++++++++++-
arch/x86/kvm/mmu/tdp_mmu.c | 68 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.h | 1 +
3 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ba0c65076200..5f2064ee7220 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5441,6 +5441,18 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* will drop their references and allow the root count to
* go to 0.
*
+ * Also take a reference on all roots so that this thread
+ * can do the bulk of the work required to free the roots
+ * once they are invalidated. Without this reference, a
+ * vCPU thread might drop the last reference to a root and
+ * get stuck with tearing down the entire paging structure.
+ *
+ * Roots which have a zero refcount should be skipped as
+ * they're already being torn down.
+ * Already invalid roots should be referenced again so that
+ * they aren't freed before kvm_tdp_mmu_zap_all_fast is
+ * done with them.
+ *
* This has essentially the same effect for the TDP MMU
* as updating mmu_valid_gen above does for the shadow
* MMU.
@@ -5452,7 +5464,8 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
* could drop the MMU lock and yield.
*/
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
- root->role.invalid = true;
+ if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
+ root->role.invalid = true;
}

/*
@@ -5468,6 +5481,12 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
kvm_zap_obsolete_pages(kvm);

write_unlock(&kvm->mmu_lock);
+
+ if (is_tdp_mmu_enabled(kvm)) {
+ read_lock(&kvm->mmu_lock);
+ kvm_tdp_mmu_zap_invalidated_roots(kvm);
+ read_unlock(&kvm->mmu_lock);
+ }
}

static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 501722a524a7..0adcfa5750f6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -798,6 +798,74 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
kvm_flush_remote_tlbs(kvm);
}

+static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
+ struct kvm_mmu_page *prev_root)
+{
+ struct kvm_mmu_page *next_root;
+
+ if (prev_root)
+ next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
+ &prev_root->link,
+ typeof(*prev_root), link);
+ else
+ next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
+ typeof(*next_root), link);
+
+ while (next_root && !(next_root->role.invalid &&
+ refcount_read(&next_root->tdp_mmu_root_count)))
+ next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
+ &next_root->link,
+ typeof(*next_root), link);
+
+ return next_root;
+}
+
+/*
+ * Since kvm_mmu_zap_all_fast has acquired a reference to each
+ * invalidated root, they will not be freed until this function drops the
+ * reference. Before dropping that reference, tear down the paging
+ * structure so that whichever thread does drop the last reference
+ * only has to do a trivial ammount of work. Since the roots are invalid,
+ * no new SPTEs should be created under them.
+ */
+void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
+{
+ gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+ struct kvm_mmu_page *next_root;
+ struct kvm_mmu_page *root;
+ bool flush = false;
+
+ lockdep_assert_held_read(&kvm->mmu_lock);
+
+ rcu_read_lock();
+
+ root = next_invalidated_root(kvm, NULL);
+
+ while (root) {
+ next_root = next_invalidated_root(kvm, root);
+
+ rcu_read_unlock();
+
+ flush = zap_gfn_range(kvm, root, 0, max_gfn, true, flush,
+ true);
+
+ /*
+ * Put the reference acquired in
+ * kvm_tdp_mmu_invalidate_roots
+ */
+ kvm_tdp_mmu_put_root(kvm, root, true);
+
+ root = next_root;
+
+ rcu_read_lock();
+ }
+
+ rcu_read_unlock();
+
+ if (flush)
+ kvm_flush_remote_tlbs(kvm);
+}
+
/*
* Installs a last-level SPTE to handle a TDP page fault.
* (NPT/EPT violation/misconfiguration)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 8fa3e7421a93..f8db381e3059 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -47,6 +47,7 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
sp->gfn, end, false, false, false);
}
void kvm_tdp_mmu_zap_all(struct kvm *kvm);
+void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);

int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
int map_writable, int max_level, kvm_pfn_t pfn,
--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:40:34

[permalink] [raw]

Subject: [PATCH v2 10/13] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock

To speed the process of disabling dirty logging, change the TDP MMU
function which zaps collapsible SPTEs to run under the MMU read lock.

Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 14 ++++++++++----
arch/x86/kvm/mmu/tdp_mmu.c | 17 +++++++++++++----
2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d03a7a8b7ea2..5939813e3043 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5612,13 +5612,19 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
write_lock(&kvm->mmu_lock);
flush = slot_handle_leaf(kvm, slot, kvm_mmu_zap_collapsible_spte, true);

- if (is_tdp_mmu_enabled(kvm))
- flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush);
-
if (flush)
kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
-
write_unlock(&kvm->mmu_lock);
+
+ if (is_tdp_mmu_enabled(kvm)) {
+ flush = false;
+
+ read_lock(&kvm->mmu_lock);
+ flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush);
+ if (flush)
+ kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+ read_unlock(&kvm->mmu_lock);
+ }
}

void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6917403484ce..0e6ffa04e5e1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1323,7 +1323,8 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
rcu_read_lock();

tdp_root_for_each_pte(iter, root, start, end) {
- if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
+retry:
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
flush = false;
continue;
}
@@ -1338,8 +1339,14 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
pfn, PG_LEVEL_NUM))
continue;

- tdp_mmu_set_spte(kvm, &iter, 0);
-
+ if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
+ /*
+ * The iter must explicitly re-read the SPTE because
+ * the atomic cmpxchg failed.
+ */
+ iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+ goto retry;
+ }
flush = true;
}

@@ -1358,7 +1365,9 @@ bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
{
struct kvm_mmu_page *root;

- for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, false)
+ lockdep_assert_held_read(&kvm->mmu_lock);
+
+ for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
flush = zap_collapsible_spte_range(kvm, root, slot, flush);

return flush;
--
2.31.0.208.g409f899ff0-goog

2021-04-01 23:41:31

[permalink] [raw]

Subject: [PATCH v2 06/13] KVM: x86/mmu: Make TDP MMU root refcount atomic

In order to parallelize more operations for the TDP MMU, make the
refcount on TDP MMU roots atomic, so that a future patch can allow
multiple threads to take a reference on the root concurrently, while
holding the MMU lock in read mode.

Signed-off-by: Ben Gardon <[email protected]>
---

Changelog
v2:
-- Split failure handling for kvm_tdp_mmu_get_root out into a
seperate commit.

arch/x86/kvm/mmu/mmu_internal.h | 6 +++++-
arch/x86/kvm/mmu/tdp_mmu.c | 4 ++--
arch/x86/kvm/mmu/tdp_mmu.h | 10 +++++++---
3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 9347d73996b5..f63d0fdb8567 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -50,7 +50,11 @@ struct kvm_mmu_page {
u64 *spt;
/* hold the gfn of each spte inside spt */
gfn_t *gfns;
- int root_count; /* Currently serving as active root */
+ /* Currently serving as active root */
+ union {
+ int root_count;
+ refcount_t tdp_mmu_root_count;
+ };
unsigned int unsync_children;
struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
DECLARE_BITMAP(unsync_child_bitmap, 512);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 670c5e3ad80e..697ea882a3e4 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -56,7 +56,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)

lockdep_assert_held_write(&kvm->mmu_lock);

- if (--root->root_count)
+ if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;

WARN_ON(!root->tdp_mmu_page);
@@ -168,7 +168,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
}

root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
- root->root_count = 1;
+ refcount_set(&root->tdp_mmu_root_count, 1);

list_add(&root->link, &kvm->arch.tdp_mmu_roots);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index d4e32ac5f4c9..1ec7914ecff9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,10 +10,14 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
static inline void kvm_tdp_mmu_get_root(struct kvm *kvm,
struct kvm_mmu_page *root)
{
- BUG_ON(!root->root_count);
- lockdep_assert_held(&kvm->mmu_lock);
+ lockdep_assert_held_write(&kvm->mmu_lock);

- ++root->root_count;
+ /*
+ * This should never fail since roots are removed from the roots
+ * list under the MMU write lock when their reference count falls
+ * to zero.
+ */
+ refcount_inc_not_zero(&root->tdp_mmu_root_count);
}

void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
--
2.31.0.208.g409f899ff0-goog

2021-04-02 07:54:38

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v2 09/13] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock

On 02/04/21 01:37, Ben Gardon wrote:
> +void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> + bool shared)
> {
> gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
>
> - lockdep_assert_held_write(&kvm->mmu_lock);
> + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>
> if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
> return;
> @@ -81,7 +92,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
> list_del_rcu(&root->link);
> spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
> - zap_gfn_range(kvm, root, 0, max_gfn, false, false);
> + zap_gfn_range(kvm, root, 0, max_gfn, false, false, shared);
>
> call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);

Instead of patch 13, would it make sense to delay the zap_gfn_range and
call_rcu to a work item (either unconditionally, or only if
shared==false)? Then the zap_gfn_range would be able to yield and take
the mmu_lock for read, similar to kvm_tdp_mmu_zap_invalidated_roots.

If done unconditionally, this would also allow removing the "shared"
argument to kvm_tdp_mmu_put_root, tdp_mmu_next_root and
for_each_tdp_mmu_root_yield_safe, so I would place that change before
this patch.

Paolo

2021-04-02 11:16:14

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v2 10/13] KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock

On 02/04/21 01:37, Ben Gardon wrote:
> To speed the process of disabling dirty logging, change the TDP MMU
> function which zaps collapsible SPTEs to run under the MMU read lock.

Technically it only reduces the impact on the running VM; it doesn't
speed it up right? Something like:

To reduce the impact of disabling dirty logging, change the TDP MMU
function which zaps collapsible SPTEs to run under the MMU read lock.
This way, page faults on zapped SPTEs can proceed in parallel with
kvm_mmu_zap_collapsible_sptes.

Paolo

2021-04-02 11:45:07

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v2 00/13] More parallel operations for the TDP MMU

On 02/04/21 01:37, Ben Gardon wrote:
> Now that the TDP MMU is able to handle page faults in parallel, it's a
> relatively small change to expand to other operations. This series allows
> zapping a range of GFNs, reclaiming collapsible SPTEs (when disabling
> dirty logging), and enabling dirty logging to all happen under the MMU
> lock in read mode.
>
> This is partly a cleanup + rewrite of the last few patches of the parallel
> page faults series. I've incorporated feedback from Sean and Paolo, but
> the patches have changed so much that I'm sending this as a separate
> series.
>
> Ran kvm-unit-tests + selftests on an SMP kernel + Intel Skylake, with the
> TDP MMU enabled and disabled. This series introduces no new failures or
> warnings.
>
> I know this will conflict horribly with the patches from Sean's series
> which were just queued, and I'll send a v2 to fix those conflicts +
> address any feedback on this v1.
>
> Changelog
> v2:
> -- Rebased patches on top of kvm/queue to incorporate Sean's recent
> TLB flushing changes
> -- Dropped patch 5: "KVM: x86/mmu: comment for_each_tdp_mmu_root
> requires MMU write lock" as the following patch to protect the roots
> list with RCU adds lockdep which makes the comment somewhat redundant.
>
> Ben Gardon (13):
> KVM: x86/mmu: Re-add const qualifier in
> kvm_tdp_mmu_zap_collapsible_sptes
> KVM: x86/mmu: Move kvm_mmu_(get|put)_root to TDP MMU
> KVM: x86/mmu: use tdp_mmu_free_sp to free roots
> KVM: x86/mmu: Merge TDP MMU put and free root
> KVM: x86/mmu: Refactor yield safe root iterator
> KVM: x86/mmu: Make TDP MMU root refcount atomic
> KVM: x86/mmu: handle cmpxchg failure in kvm_tdp_mmu_get_root
> KVM: x86/mmu: Protect the tdp_mmu_roots list with RCU
> KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
> KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
> KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read
> lock
> KVM: x86/mmu: Fast invalidation for TDP MMU
> KVM: x86/mmu: Tear down roots in fast invalidation thread
>
> arch/x86/include/asm/kvm_host.h | 21 +-
> arch/x86/kvm/mmu/mmu.c | 115 +++++++---
> arch/x86/kvm/mmu/mmu_internal.h | 27 +--
> arch/x86/kvm/mmu/tdp_mmu.c | 375 +++++++++++++++++++++++---------
> arch/x86/kvm/mmu/tdp_mmu.h | 28 ++-
> include/linux/kvm_host.h | 2 +-
> 6 files changed, 407 insertions(+), 161 deletions(-)
>

Applied to kvm/mmu-notifier-queue, thanks.

Paolo

2021-04-13 06:26:19

[permalink] [raw]

Subject: Re: [PATCH v2 09/13] KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock

On Fri, Apr 2, 2021 at 12:53 AM Paolo Bonzini <[email protected]> wrote:
>
> On 02/04/21 01:37, Ben Gardon wrote:
> > +void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > + bool shared)
> > {
> > gfn_t max_gfn = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> >
> > - lockdep_assert_held_write(&kvm->mmu_lock);
> > + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> >
> > if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
> > return;
> > @@ -81,7 +92,7 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > list_del_rcu(&root->link);
> > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >
> > - zap_gfn_range(kvm, root, 0, max_gfn, false, false);
> > + zap_gfn_range(kvm, root, 0, max_gfn, false, false, shared);
> >
> > call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
>
> Instead of patch 13, would it make sense to delay the zap_gfn_range and
> call_rcu to a work item (either unconditionally, or only if
> shared==false)? Then the zap_gfn_range would be able to yield and take
> the mmu_lock for read, similar to kvm_tdp_mmu_zap_invalidated_roots.
>
> If done unconditionally, this would also allow removing the "shared"
> argument to kvm_tdp_mmu_put_root, tdp_mmu_next_root and
> for_each_tdp_mmu_root_yield_safe, so I would place that change before
> this patch.
>
> Paolo
>

I tried that and it created problems. I believe the issue was that on
VM teardown memslots would be freed and the memory reallocated before
the root was torn down, resulting in a use-after free from
mark_pfn_dirty. Perhaps this could be resolved by forcing memslot
changes to wait until that work item was processed before returning. I
can look into it but I suspect there will be a lot of "gotchas"
involved.

2021-05-26 21:38:57

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2 00/13] More parallel operations for the TDP MMU

On Fri, Apr 02, 2021, Paolo Bonzini wrote:
> On 02/04/21 01:37, Ben Gardon wrote:
> > Now that the TDP MMU is able to handle page faults in parallel, it's a
> > relatively small change to expand to other operations. This series allows
> > zapping a range of GFNs, reclaiming collapsible SPTEs (when disabling
> > dirty logging), and enabling dirty logging to all happen under the MMU
> > lock in read mode.
> >
> > This is partly a cleanup + rewrite of the last few patches of the parallel
> > page faults series. I've incorporated feedback from Sean and Paolo, but
> > the patches have changed so much that I'm sending this as a separate
> > series.
> >
> > Ran kvm-unit-tests + selftests on an SMP kernel + Intel Skylake, with the
> > TDP MMU enabled and disabled. This series introduces no new failures or
> > warnings.
> >
> > I know this will conflict horribly with the patches from Sean's series
> > which were just queued, and I'll send a v2 to fix those conflicts +
> > address any feedback on this v1.
> >
> > Changelog
> > v2:
> > -- Rebased patches on top of kvm/queue to incorporate Sean's recent
> > TLB flushing changes
> > -- Dropped patch 5: "KVM: x86/mmu: comment for_each_tdp_mmu_root
> > requires MMU write lock" as the following patch to protect the roots
> > list with RCU adds lockdep which makes the comment somewhat redundant.
> >
> > Ben Gardon (13):
> > KVM: x86/mmu: Re-add const qualifier in
> > kvm_tdp_mmu_zap_collapsible_sptes
> > KVM: x86/mmu: Move kvm_mmu_(get|put)_root to TDP MMU
> > KVM: x86/mmu: use tdp_mmu_free_sp to free roots
> > KVM: x86/mmu: Merge TDP MMU put and free root
> > KVM: x86/mmu: Refactor yield safe root iterator
> > KVM: x86/mmu: Make TDP MMU root refcount atomic
> > KVM: x86/mmu: handle cmpxchg failure in kvm_tdp_mmu_get_root
> > KVM: x86/mmu: Protect the tdp_mmu_roots list with RCU
> > KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
> > KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
> > KVM: x86/mmu: Allow enabling / disabling dirty logging under MMU read
> > lock
> > KVM: x86/mmu: Fast invalidation for TDP MMU
> > KVM: x86/mmu: Tear down roots in fast invalidation thread
> >
> > arch/x86/include/asm/kvm_host.h | 21 +-
> > arch/x86/kvm/mmu/mmu.c | 115 +++++++---
> > arch/x86/kvm/mmu/mmu_internal.h | 27 +--
> > arch/x86/kvm/mmu/tdp_mmu.c | 375 +++++++++++++++++++++++---------
> > arch/x86/kvm/mmu/tdp_mmu.h | 28 ++-
> > include/linux/kvm_host.h | 2 +-
> > 6 files changed, 407 insertions(+), 161 deletions(-)
> >
>
> Applied to kvm/mmu-notifier-queue, thanks.

What's the plan for kvm/mmu-notifier-queue? More specifically, are the hashes
stable, i.e. will non-critical review feedback get squashed? I was finally
getting around to reviewing this, but what's sitting in that branch doesn't
appear to be exactly what's posted here. If the hashes are stable, I'll probably
test and review functionality, but not do a thorough review.

Thanks!

2021-05-27 11:45:16

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v2 00/13] More parallel operations for the TDP MMU

On 26/05/21 23:34, Sean Christopherson wrote:
>> Applied to kvm/mmu-notifier-queue, thanks.
> What's the plan for kvm/mmu-notifier-queue? More specifically, are the hashes
> stable, i.e. will non-critical review feedback get squashed? I was finally
> getting around to reviewing this, but what's sitting in that branch doesn't
> appear to be exactly what's posted here. If the hashes are stable, I'll probably
> test and review functionality, but not do a thorough review.

It's all in 5.13 except for the lock elision patch, for which I was
waiting for a review. I'll post that patch separately.

Thanks,

Paolo

2021-05-27 17:33:42

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v2 00/13] More parallel operations for the TDP MMU

On Thu, May 27, 2021, Paolo Bonzini wrote:
> On 26/05/21 23:34, Sean Christopherson wrote:
> > > Applied to kvm/mmu-notifier-queue, thanks.
> > What's the plan for kvm/mmu-notifier-queue? More specifically, are the hashes
> > stable, i.e. will non-critical review feedback get squashed? I was finally
> > getting around to reviewing this, but what's sitting in that branch doesn't
> > appear to be exactly what's posted here. If the hashes are stable, I'll probably
> > test and review functionality, but not do a thorough review.
>
> It's all in 5.13 except for the lock elision patch, for which I was waiting
> for a review. I'll post that patch separately.

Ha, stable indeed. Thanks!