Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Tue, 19 Jul 2022 19:52:40 +0000
From:   Sean Christopherson <seanjc@google.com>
To:     Lai Jiangshan <jiangshanlai@gmail.com>
Cc:     linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
        Paolo Bonzini <pbonzini@redhat.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Maxim Levitsky <mlevitsk@redhat.com>,
        Lai Jiangshan <jiangshan.ljs@antgroup.com>
Subject: Re: [PATCH 05/12] KVM: X86/MMU: Clear unsync bit directly in
 __mmu_unsync_walk()
Message-ID: <YtcLiNskPb8z/2Qc@google.com>
References: <20220605064342.309219-1-jiangshanlai@gmail.com>
 <20220605064342.309219-6-jiangshanlai@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20220605064342.309219-6-jiangshanlai@gmail.com>
Precedence: bulk

On Sun, Jun 05, 2022, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> mmu_unsync_walk() and __mmu_unsync_walk() requires the caller to clear
> unsync for the shadow pages in the resulted pvec by synching them or
> zapping them.
> 
> All callers does so.
> 
> Otherwise mmu_unsync_walk() and __mmu_unsync_walk() can't work because
> they always walk from the beginning.
> 
> It is possible to make mmu_unsync_walk() and __mmu_unsync_walk() lists
> unsync shadow pages in the resulted pvec without needing synching them
> or zapping them later.  It would require to change mmu_unsync_walk()
> and __mmu_unsync_walk() and make it walk from the last visited position
> derived from the resulted pvec of the previous call of mmu_unsync_walk().
> 
> It would complicate the walk and no callers require the possible new
> behavior.
> 
> It is better to keep the original behavior.
> 
> Since the shadow pages in the resulted pvec will be synced or zapped,
> and clear_unsync_child_bit() for parents will be called anyway later.
> 
> Call clear_unsync_child_bit() earlier and directly in __mmu_unsync_walk()
> to make the code more efficient (the memory of the shadow pages is hot
> in the CPU cache, and no need to visit the shadow pages again later).

The changelog and shortlog do a poor job of capturing what this patch actually
does.  This is a prime example of why I prefer that changelogs first document
what the patch is doing, and only then dive into background details and alternatives.

This changelog has 6-7 paragraphs talking about current KVM behaviors and
alternatives before talking about the patch itself, and then doesn't actually
describe the net effect of the change.

The use of "directly" in the shortlog is also confusing because __mmu_unsync_walk()
already invokes clear_unsync_child_bit(), e.g. this patch only affects
__mmu_unsync_walk().  IIUC, the change is that __mmu_unsync_walk() will clear
the unsync info when adding to @pvec instead of having to redo the walk after
zapping/synching the page.

  KVM: x86/mmu: Clear unsync child _before_ final zap/sync

  Clear the unsync child information for a shadow page when adding it to
  the array of to-be-zapped/synced shadow pages, i.e. _before_ the actual
  zap/sync that effects the "no longer unsync" state change.  Callers of
  mmu_unsync_walk() and __mmu_unsync_walk() are required to zap/sync all
  entries in @pvec before dropping mmu_lock, i.e. once a shadow page is
  added to the set of pages to zap/sync, success is guaranteed.

  Clearing the unsync info when adding to the array yields more efficient
  code as KVM will no longer need to rewalk the shadow pages to "discover"
  that the child   pages is no longer unsync, and as a bonus, the metadata
  for the shadow   page will be hot in the CPU cache.

  Note, this obviously doesn't work if success isn't guaranteed, but
  mmu_unsync_walk() and __mmu_unsync_walk() would require significant
  changes to allow restarting a walk after failure to zap/sync.  I.e.
  this is but one of many details that would need to change.

> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 22 +++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f35fd5c59c38..2446ede0b7b9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1794,19 +1794,23 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
>  				return -ENOSPC;
>  
>  			ret = __mmu_unsync_walk(child, pvec);
> -			if (!ret) {
> -				clear_unsync_child_bit(sp, i);
> -				continue;
> -			} else if (ret > 0) {
> -				nr_unsync_leaf += ret;
> -			} else
> +			if (ret < 0)
>  				return ret;
> -		} else if (child->unsync) {
> +			nr_unsync_leaf += ret;
> +		}
> +
> +		/*
> +		 * Clear unsync bit for @child directly if @child is fully
> +		 * walked and all the unsync shadow pages descended from
> +		 * @child (including itself) are added into @pvec, the caller
> +		 * must sync or zap all the unsync shadow pages in @pvec.
> +		 */
> +		clear_unsync_child_bit(sp, i);
> +		if (child->unsync) {
>  			nr_unsync_leaf++;
>  			if (mmu_pages_add(pvec, child, i))

This ordering is wrong, no?  If the child itself is unsync and can't be added to
@pvec, i.e. fails here, then clearing its bit in unsync_child_bitmap is wrong.

I also dislike that that this patch obfuscates that a shadow page can't be unsync
itself _and_ have unsync children (because only PG_LEVEL_4K can be unsync).  In
other words, keep the

	if (child->unsync_children) {

	} else if (child->unsync) {

	}

And at that point, we can streamline this further:

	int i, ret, nr_unsync_leaf = 0;

	for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
		struct kvm_mmu_page *child;
		u64 ent = sp->spt[i];

		if (is_shadow_present_pte(ent) && !is_large_pte(ent)) {
			child = to_shadow_page(ent & PT64_BASE_ADDR_MASK);
			if (child->unsync_children) {
				ret = __mmu_unsync_walk_and_clear(child, pvec);
				if (ret < 0)
					return ret;
				nr_unsync_leaf += ret;
			} else if (child->unsync) {
				if (mmu_pages_add(pvec, child))
					return -ENOSPC;
				nr_unsync_leaf++;
			}
		}

		/*
		 * Clear the unsync info, the child is either already sync
		 * (bitmap is stale) or is guaranteed to be zapped/synced by
		 * the caller before mmu_lock is released.  Note, the caller is
		 * required to zap/sync all entries in @pvec even if an error
		 * is returned!
		 */
		clear_unsync_child_bit(sp, i);
	}

	return nr_unsync_leaf;