LinuxLists.cc - [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall

2021-11-24 21:44:41

Subject: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall

TDP MMU SPTE zapping process currently uses two levels of iterations. The
first level iteration happens at the for loop within the zap_gfn_range()
with the purpose of calibrating the accurate range for zapping. The second
level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the
whole paging structures (leaf and non-leaf SPTEs) within the range. The
former iteration is yield safe, while the second one is not.

In many cases, zapping SPTE process could be optimized since the non-leaf
SPTEs could most likely be retained for the next allocation. On the other
hand, for large scale SPTE zapping scenarios, we may end up zapping too
many SPTEs and use excessive CPU time that causes the RCU stall warning.

The follow selftest reproduces the warning:

(env: kvm.tdp_mmu=Y)
./dirty_log_perf_test -v 64 -b 8G

Optimize the zapping process by skipping all SPTEs above a certain level in
the first iteration. This allows us to control the granularity of the
actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition,
we would retain some of the non-leaf SPTEs to accelerate next allocation.

For the selection of the `certain level`, we choose the PG_LEVEL_1G because
it is currently the largest page size supported and it natually fits the
scenario of splitting large pages.

For `zap_all` case (usually) at VM teardown time, we use a two-phase
mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase
zaps everything else. This is achieved by the helper function
__zap_gfn_range().

Cc: Sean Christopherson <[email protected]>
Cc: Ben Gardon <[email protected]>
Cc: David Matlack <[email protected]>

Signed-off-by: Mingwei Zhang <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------
1 file changed, 40 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 89d16bb104de..3fadc51c004a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
* account for the possibility that other threads are modifying the paging
* structures concurrently. If shared is false, this thread should hold the
* MMU lock in write mode.
+ *
+ * If zap_all is true, eliminate all the paging structures that contains the
+ * SPTEs.
*/
-static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
- gfn_t start, gfn_t end, bool can_yield, bool flush,
- bool shared)
+static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, bool can_yield, bool flush,
+ bool shared, bool zap_all)
{
- gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
- bool zap_all = (start == 0 && end >= max_gfn_host);
struct tdp_iter iter;

- /*
- * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
- * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
- * and so KVM will never install a SPTE for such addresses.
- */
- end = min(end, max_gfn_host);
-
- kvm_lockdep_assert_mmu_lock_held(kvm, shared);
-
rcu_read_lock();

tdp_root_for_each_pte(iter, root, start, end) {
@@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
continue;
}

- if (!is_shadow_present_pte(iter.old_spte))
+ /*
+ * In zap_all case, ignore the checking of present since we have
+ * to zap everything.
+ */
+ if (!zap_all && !is_shadow_present_pte(iter.old_spte))
continue;

/*
* If this is a non-last-level SPTE that covers a larger range
* than should be zapped, continue, and zap the mappings at a
- * lower level, except when zapping all SPTEs.
+ * lower level. Actual zapping started at proper granularity
+ * that is not so large as to cause a soft lockup when handling
+ * the changed pte (which does not yield).
*/
if (!zap_all &&
(iter.gfn < start ||
- iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
+ iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
+ iter.level > PG_LEVEL_1G) &&
!is_last_spte(iter.old_spte, iter.level))
continue;

@@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
return flush;
}

+static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+ gfn_t start, gfn_t end, bool can_yield, bool flush,
+ bool shared)
+{
+ gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+ bool zap_all = (start == 0 && end >= max_gfn_host);
+
+ /*
+ * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
+ * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
+ * and so KVM will never install a SPTE for such addresses.
+ */
+ end = min(end, max_gfn_host);
+
+ kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+
+ flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared,
+ false);
+ if (zap_all)
+ flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush,
+ shared, true);
+ return flush;
+}
+
/*
* Tears down the mappings for the range of gfns, [start, end), and frees the
* non-root pages mapping GFNs strictly within that range. Returns true if
--
2.34.0.rc2.393.gf8c9666880-goog

2021-11-30 00:48:31

by David Matlack

[permalink] [raw]

Subject: Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall

On Wed, Nov 24, 2021 at 09:44:21PM +0000, Mingwei Zhang wrote:
> TDP MMU SPTE zapping process currently uses two levels of iterations. The
> first level iteration happens at the for loop within the zap_gfn_range()
> with the purpose of calibrating the accurate range for zapping. The second
> level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the

iteration

> whole paging structures (leaf and non-leaf SPTEs) within the range. The
> former iteration is yield safe, while the second one is not.

I know what you mean but I'd suggest being more specific than "yield
safe". For example:

Unlike the outer loop, the recursive zapping done under
tdp_mmu_set_spte{,_atomic} does not yield. Since zapping is done with
a pre-order traversal, zapping sufficiently large ranges can lead to
RCU stall warnings.

I'd also clarify here that the TDP MMU iterator uses a pre-order
traversal which causes us KVM to end up doing the maximum amount of
zapping under tdp_mmu_set_spte{,_atomic} and not the outer for loop.

>
> In many cases, zapping SPTE process could be optimized since the non-leaf
> SPTEs could most likely be retained for the next allocation. On the other
> hand, for large scale SPTE zapping scenarios, we may end up zapping too
> many SPTEs and use excessive CPU time that causes the RCU stall warning.
>
> The follow selftest reproduces the warning:
>
> (env: kvm.tdp_mmu=Y)
> ./dirty_log_perf_test -v 64 -b 8G
>
> Optimize the zapping process by skipping all SPTEs above a certain level in
> the first iteration. This allows us to control the granularity of the
> actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition,
> we would retain some of the non-leaf SPTEs to accelerate next allocation.
>
> For the selection of the `certain level`, we choose the PG_LEVEL_1G because
> it is currently the largest page size supported and it natually fits the
> scenario of splitting large pages.
>
> For `zap_all` case (usually) at VM teardown time, we use a two-phase
> mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase
> zaps everything else. This is achieved by the helper function
> __zap_gfn_range().
>
> Cc: Sean Christopherson <[email protected]>
> Cc: Ben Gardon <[email protected]>
> Cc: David Matlack <[email protected]>
>
> Signed-off-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------
> 1 file changed, 40 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 89d16bb104de..3fadc51c004a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> * account for the possibility that other threads are modifying the paging
> * structures concurrently. If shared is false, this thread should hold the
> * MMU lock in write mode.
> + *
> + * If zap_all is true, eliminate all the paging structures that contains the
> + * SPTEs.
> */
> -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> - gfn_t start, gfn_t end, bool can_yield, bool flush,
> - bool shared)
> +static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> + gfn_t start, gfn_t end, bool can_yield, bool flush,
> + bool shared, bool zap_all)
> {
> - gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> - bool zap_all = (start == 0 && end >= max_gfn_host);
> struct tdp_iter iter;
>
> - /*
> - * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> - * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> - * and so KVM will never install a SPTE for such addresses.
> - */
> - end = min(end, max_gfn_host);
> -
> - kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> -
> rcu_read_lock();
>
> tdp_root_for_each_pte(iter, root, start, end) {
> @@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> continue;
> }
>
> - if (!is_shadow_present_pte(iter.old_spte))
> + /*
> + * In zap_all case, ignore the checking of present since we have
> + * to zap everything.
> + */
> + if (!zap_all && !is_shadow_present_pte(iter.old_spte))
> continue;

I don't believe there's any reason to attempt to zap a non-present spte,
even in the zap_all case. In any case, this change deserves its own
patch and a commit message that describes why the old logic is incorrect
and how this fixes it.

>
> /*
> * If this is a non-last-level SPTE that covers a larger range
> * than should be zapped, continue, and zap the mappings at a
> - * lower level, except when zapping all SPTEs.
> + * lower level. Actual zapping started at proper granularity
> + * that is not so large as to cause a soft lockup when handling
> + * the changed pte (which does not yield).
> */
> if (!zap_all &&
> (iter.gfn < start ||
> - iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> + iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> + iter.level > PG_LEVEL_1G) &&
> !is_last_spte(iter.old_spte, iter.level))
> continue;

This if statement is getting a bit long. I'd suggest breaking out the
level check and also using KVM_MAX_HUGEPAGE_LEVEL.

e.g.

/*
* If not doing zap_all, only zap up to the huge page level to
* avoid doing too much work in the recursive tdp_mmu_set_spte*
* call below, since it does not yield.
*
* This will potentially leave behind some childless page tables
* but that's ok because ...
*/
if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
continue;

And on that note, what is the reasoning for why it's ok to leave behind
childless page tables? I assume it's because most of the time we'll use
that page table again in the future, and at worst we leave the page
table allocated until the VM is cleaned up?

>
> @@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> return flush;
> }
>
> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> + gfn_t start, gfn_t end, bool can_yield, bool flush,
> + bool shared)
> +{
> + gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> + bool zap_all = (start == 0 && end >= max_gfn_host);
> +
> + /*
> + * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> + * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> + * and so KVM will never install a SPTE for such addresses.
> + */
> + end = min(end, max_gfn_host);
> +
> + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> +
> + flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared,
> + false);
> + if (zap_all)
> + flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush,
> + shared, true);
> + return flush;
> +}
> +
> /*
> * Tears down the mappings for the range of gfns, [start, end), and frees the
> * non-root pages mapping GFNs strictly within that range. Returns true if
> --
> 2.34.0.rc2.393.gf8c9666880-goog
>

2021-11-30 00:50:21

by David Matlack

[permalink] [raw]

Subject: Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall

On Tue, Nov 30, 2021 at 12:48:24AM +0000, David Matlack wrote:
> On Wed, Nov 24, 2021 at 09:44:21PM +0000, Mingwei Zhang wrote:
> > TDP MMU SPTE zapping process currently uses two levels of iterations. The
> > first level iteration happens at the for loop within the zap_gfn_range()
> > with the purpose of calibrating the accurate range for zapping. The second
> > level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the
>
> iteration
>
> > whole paging structures (leaf and non-leaf SPTEs) within the range. The
> > former iteration is yield safe, while the second one is not.
>
> I know what you mean but I'd suggest being more specific than "yield
> safe". For example:
>
> Unlike the outer loop, the recursive zapping done under
> tdp_mmu_set_spte{,_atomic} does not yield. Since zapping is done with
> a pre-order traversal, zapping sufficiently large ranges can lead to
> RCU stall warnings.
>
> I'd also clarify here that the TDP MMU iterator uses a pre-order
> traversal which causes us KVM to end up doing the maximum amount of
> zapping under tdp_mmu_set_spte{,_atomic} and not the outer for loop.

(Ah sorry for the redundant suggestion. I wrote this paragraph and then
reworded my suggested wording to mention the pre-order traversal.)

>
> >
> > In many cases, zapping SPTE process could be optimized since the non-leaf
> > SPTEs could most likely be retained for the next allocation. On the other
> > hand, for large scale SPTE zapping scenarios, we may end up zapping too
> > many SPTEs and use excessive CPU time that causes the RCU stall warning.
> >
> > The follow selftest reproduces the warning:
> >
> > (env: kvm.tdp_mmu=Y)
> > ./dirty_log_perf_test -v 64 -b 8G
> >
> > Optimize the zapping process by skipping all SPTEs above a certain level in
> > the first iteration. This allows us to control the granularity of the
> > actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition,
> > we would retain some of the non-leaf SPTEs to accelerate next allocation.
> >
> > For the selection of the `certain level`, we choose the PG_LEVEL_1G because
> > it is currently the largest page size supported and it natually fits the
> > scenario of splitting large pages.
> >
> > For `zap_all` case (usually) at VM teardown time, we use a two-phase
> > mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase
> > zaps everything else. This is achieved by the helper function
> > __zap_gfn_range().
> >
> > Cc: Sean Christopherson <[email protected]>
> > Cc: Ben Gardon <[email protected]>
> > Cc: David Matlack <[email protected]>
> >
> > Signed-off-by: Mingwei Zhang <[email protected]>
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------
> > 1 file changed, 40 insertions(+), 17 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 89d16bb104de..3fadc51c004a 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> > * account for the possibility that other threads are modifying the paging
> > * structures concurrently. If shared is false, this thread should hold the
> > * MMU lock in write mode.
> > + *
> > + * If zap_all is true, eliminate all the paging structures that contains the
> > + * SPTEs.
> > */
> > -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > - gfn_t start, gfn_t end, bool can_yield, bool flush,
> > - bool shared)
> > +static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > + gfn_t start, gfn_t end, bool can_yield, bool flush,
> > + bool shared, bool zap_all)
> > {
> > - gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> > - bool zap_all = (start == 0 && end >= max_gfn_host);
> > struct tdp_iter iter;
> >
> > - /*
> > - * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > - * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > - * and so KVM will never install a SPTE for such addresses.
> > - */
> > - end = min(end, max_gfn_host);
> > -
> > - kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > -
> > rcu_read_lock();
> >
> > tdp_root_for_each_pte(iter, root, start, end) {
> > @@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > continue;
> > }
> >
> > - if (!is_shadow_present_pte(iter.old_spte))
> > + /*
> > + * In zap_all case, ignore the checking of present since we have
> > + * to zap everything.
> > + */
> > + if (!zap_all && !is_shadow_present_pte(iter.old_spte))
> > continue;
>
> I don't believe there's any reason to attempt to zap a non-present spte,
> even in the zap_all case. In any case, this change deserves its own
> patch and a commit message that describes why the old logic is incorrect
> and how this fixes it.
>
> >
> > /*
> > * If this is a non-last-level SPTE that covers a larger range
> > * than should be zapped, continue, and zap the mappings at a
> > - * lower level, except when zapping all SPTEs.
> > + * lower level. Actual zapping started at proper granularity
> > + * that is not so large as to cause a soft lockup when handling
> > + * the changed pte (which does not yield).
> > */
> > if (!zap_all &&
> > (iter.gfn < start ||
> > - iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> > + iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> > + iter.level > PG_LEVEL_1G) &&
> > !is_last_spte(iter.old_spte, iter.level))
> > continue;
>
> This if statement is getting a bit long. I'd suggest breaking out the
> level check and also using KVM_MAX_HUGEPAGE_LEVEL.
>
> e.g.
>
> /*
> * If not doing zap_all, only zap up to the huge page level to
> * avoid doing too much work in the recursive tdp_mmu_set_spte*
> * call below, since it does not yield.
> *
> * This will potentially leave behind some childless page tables
> * but that's ok because ...
> */
> if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
> continue;
>
> And on that note, what is the reasoning for why it's ok to leave behind
> childless page tables? I assume it's because most of the time we'll use
> that page table again in the future, and at worst we leave the page
> table allocated until the VM is cleaned up?
>
> >
> > @@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > return flush;
> > }
> >
> > +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > + gfn_t start, gfn_t end, bool can_yield, bool flush,
> > + bool shared)
> > +{
> > + gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> > + bool zap_all = (start == 0 && end >= max_gfn_host);
> > +
> > + /*
> > + * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > + * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > + * and so KVM will never install a SPTE for such addresses.
> > + */
> > + end = min(end, max_gfn_host);
> > +
> > + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > +
> > + flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared,
> > + false);
> > + if (zap_all)
> > + flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush,
> > + shared, true);
> > + return flush;
> > +}
> > +
> > /*
> > * Tears down the mappings for the range of gfns, [start, end), and frees the
> > * non-root pages mapping GFNs strictly within that range. Returns true if
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >

2021-11-30 19:41:19

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall

On Tue, Nov 30, 2021, David Matlack wrote:
> > + /*
> > + * In zap_all case, ignore the checking of present since we have
> > + * to zap everything.
> > + */
> > + if (!zap_all && !is_shadow_present_pte(iter.old_spte))
> > continue;
>
> I don't believe there's any reason to attempt to zap a non-present spte,
> even in the zap_all case. In any case, this change deserves its own
> patch and a commit message that describes why the old logic is incorrect
> and how this fixes it.

Yep, at best it's wasted cycles, at worst it will trigger WARN/BUG due to using
accessors that require the caller to check for a shadow present SPTE.

> > * If this is a non-last-level SPTE that covers a larger range
> > * than should be zapped, continue, and zap the mappings at a
> > - * lower level, except when zapping all SPTEs.
> > + * lower level. Actual zapping started at proper granularity
> > + * that is not so large as to cause a soft lockup when handling
> > + * the changed pte (which does not yield).
> > */
> > if (!zap_all &&
> > (iter.gfn < start ||
> > - iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> > + iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> > + iter.level > PG_LEVEL_1G) &&
> > !is_last_spte(iter.old_spte, iter.level))
> > continue;
>
> This if statement is getting a bit long. I'd suggest breaking out the
> level check and also using KVM_MAX_HUGEPAGE_LEVEL.
>
> e.g.
>
> /*
> * If not doing zap_all, only zap up to the huge page level to
> * avoid doing too much work in the recursive tdp_mmu_set_spte*
> * call below, since it does not yield.
> *
> * This will potentially leave behind some childless page tables
> * but that's ok because ...
> */
> if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
> continue;
>
> And on that note, what is the reasoning for why it's ok to leave behind
> childless page tables? I assume it's because most of the time we'll use
> that page table again in the future, and at worst we leave the page
> table allocated until the VM is cleaned up?

Yes. If zap_all=false, KVM is zapping a gfn range but not dropping or invalidating
the root. The non-leaf paging structures are perfectly valid and can be reused.
There's certainly no guarantees that a structure will be reused, but keeping the
pages is ok because the memory consumed scales with the size of the VM, and the
number of shadow pages used for TDP is quite small, e.g. an order of magnitude less
than what's needed for shadow paging.

If we're ok waiting until my zap n' flush rework[*] lands, this is much easier to
handle, e.g. I think this will do it.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 62cb357b1dff..8d3df03024b7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -765,7 +765,7 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared, bool root_is_unreachable)
{
struct tdp_iter iter;
-
+ bool leafs_only;
gfn_t end = tdp_mmu_max_gfn_host();
gfn_t start = 0;

@@ -773,12 +773,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,

rcu_read_lock();

- /*
- * No need to try to step down in the iterator when zapping an entire
- * root, zapping an upper-level SPTE will recurse on its children.
- */
+again:
+ /* Add comment here. */
for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
- root->role.level, start, end) {
+ leafs_only ? PG_LEVEL_4K : root->role.level,
+ start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
continue;
@@ -786,6 +785,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
if (!is_shadow_present_pte(iter.old_spte))
continue;

+ if (leafs_only && !is_last_spte(iter.old_spte, iter.level))
+ continue;
+
if (!shared) {
tdp_mmu_set_spte(kvm, &iter, 0);
} else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
@@ -807,6 +809,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
}

+ if (leafs_only) {
+ leafs_only = false;
+ goto again;
+ }
+
rcu_read_unlock();
}

[*] https://lore.kernel.org/all/[email protected]/