Date:   Tue, 30 Nov 2021 19:40:45 +0000
From:   Sean Christopherson <seanjc@google.com>
To:     David Matlack <dmatlack@google.com>
Cc:     Mingwei Zhang <mizhang@google.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Ben Gardon <bgardon@google.com>
Subject: Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf
 SPTEs and avoid rcu stall
Message-ID: <YaZ+PSA9kOBYtpIz@google.com>
References: <20211124214421.458549-1-mizhang@google.com>
 <20211124214421.458549-3-mizhang@google.com>
 <YaV02MdGYlfGs35T@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YaV02MdGYlfGs35T@google.com>
Precedence: bulk

On Tue, Nov 30, 2021, David Matlack wrote:
> > +		/*
> > +		 * In zap_all case, ignore the checking of present since we have
> > +		 * to zap everything.
> > +		 */
> > +		if (!zap_all && !is_shadow_present_pte(iter.old_spte))
> >  			continue;
> 
> I don't believe there's any reason to attempt to zap a non-present spte,
> even in the zap_all case. In any case, this change deserves its own
> patch and a commit message that describes why the old logic is incorrect
> and how this fixes it.

Yep, at best it's wasted cycles, at worst it will trigger WARN/BUG due to using
accessors that require the caller to check for a shadow present SPTE.

> >  		 * If this is a non-last-level SPTE that covers a larger range
> >  		 * than should be zapped, continue, and zap the mappings at a
> > -		 * lower level, except when zapping all SPTEs.
> > +		 * lower level. Actual zapping started at proper granularity
> > +		 * that is not so large as to cause a soft lockup when handling
> > +		 * the changed pte (which does not yield).
> >  		 */
> >  		if (!zap_all &&
> >  		    (iter.gfn < start ||
> > -		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> > +		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> > +		     iter.level > PG_LEVEL_1G) &&
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> 
> This if statement is getting a bit long. I'd suggest breaking out the
> level check and also using KVM_MAX_HUGEPAGE_LEVEL.
> 
> e.g.
> 
>         /*
>          * If not doing zap_all, only zap up to the huge page level to
>          * avoid doing too much work in the recursive tdp_mmu_set_spte*
>          * call below, since it does not yield.
>          *
>          * This will potentially leave behind some childless page tables
>          * but that's ok because ...
>          */
>          if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
>                 continue;
> 
> And on that note, what is the reasoning for why it's ok to leave behind
> childless page tables? I assume it's because most of the time we'll use
> that page table again in the future, and at worst we leave the page
> table allocated until the VM is cleaned up?

Yes.  If zap_all=false, KVM is zapping a gfn range but not dropping or invalidating
the root.  The non-leaf paging structures are perfectly valid and can be reused.
There's certainly no guarantees that a structure will be reused, but keeping the
pages is ok because the memory consumed scales with the size of the VM, and the
number of shadow pages used for TDP is quite small, e.g. an order of magnitude less
than what's needed for shadow paging.

If we're ok waiting until my zap n' flush rework[*] lands, this is much easier to
handle, e.g. I think this will do it.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 62cb357b1dff..8d3df03024b7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -765,7 +765,7 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
                             bool shared, bool root_is_unreachable)
 {
        struct tdp_iter iter;
-
+       bool leafs_only;
        gfn_t end = tdp_mmu_max_gfn_host();
        gfn_t start = 0;

@@ -773,12 +773,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,

        rcu_read_lock();

-       /*
-        * No need to try to step down in the iterator when zapping an entire
-        * root, zapping an upper-level SPTE will recurse on its children.
-        */
+again:
+       /* Add comment here. */
        for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-                                  root->role.level, start, end) {
+                                  leafs_only ? PG_LEVEL_4K : root->role.level,
+                                  start, end) {
 retry:
                if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
                        continue;
@@ -786,6 +785,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
                if (!is_shadow_present_pte(iter.old_spte))
                        continue;

+               if (leafs_only && !is_last_spte(iter.old_spte, iter.level))
+                       continue;
+
                if (!shared) {
                        tdp_mmu_set_spte(kvm, &iter, 0);
                } else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
@@ -807,6 +809,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
                WARN_ON_ONCE(root_is_unreachable && root->role.invalid);
        }

+       if (leafs_only) {
+               leafs_only = false;
+               goto again;
+       }
+
        rcu_read_unlock();
 }

[*] https://lore.kernel.org/all/20211120045046.3940942-1-seanjc@google.com/