Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Tue, 12 Jul 2022 23:21:51 +0000
From:   Sean Christopherson <seanjc@google.com>
To:     Ben Gardon <bgardon@google.com>
Cc:     David Matlack <dmatlack@google.com>,
        LKML <linux-kernel@vger.kernel.org>, kvm <kvm@vger.kernel.org>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Jim Mattson <jmattson@google.com>,
        David Dunn <daviddunn@google.com>,
        Jing Zhang <jingzhangos@google.com>,
        Junaid Shahid <junaids@google.com>
Subject: Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when
 disabling dirty logging
Message-ID: <Ys4CD/VHtbrBVi6a@google.com>
References: <20220321224358.1305530-1-bgardon@google.com>
 <20220321224358.1305530-10-bgardon@google.com>
 <YkH0O2Qh7lRizGtC@google.com>
 <CANgfPd8V_34TBb3m-JpmczZnY3t5aaFwHNZq1W0eknumbrXCRw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CANgfPd8V_34TBb3m-JpmczZnY3t5aaFwHNZq1W0eknumbrXCRw@mail.gmail.com>
Precedence: bulk

On Mon, Mar 28, 2022, Ben Gardon wrote:
> On Mon, Mar 28, 2022 at 10:45 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Mon, Mar 21, 2022 at 03:43:58PM -0700, Ben Gardon wrote:
> > > +{
> > > +     struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> > > +     struct rsvd_bits_validate shadow_zero_check;
> > > +     bool map_writable;
> > > +     kvm_pfn_t pfn;
> > > +     u64 new_spte;
> > > +     u64 mt_mask;
> > > +
> > > +     /*
> > > +      * If addresses are being invalidated, don't do in-place promotion to
> > > +      * avoid accidentally mapping an invalidated address.
> > > +      */
> > > +     if (unlikely(kvm->mmu_notifier_count))
> > > +             return false;
> >
> > Why is this necessary? Seeing this makes me wonder if we need a similar
> > check for eager page splitting.
> 
> This is needed here, but not in the page splitting case, because we
> are potentially mapping new memory.

As written, it's required because KVM doesn't check that there's at least one
leaf SPTE for the range.  If KVM were to step down and find a leaf SPTE before
stepping back up to promote, then this check can be dropped because KVM zaps leaf
SPTEs during invalidate_range_start(), and the primary MMU must invalidate the
entire range covered by a huge page if it's splitting a huge page.

I'm inclined to go that route because it allows for a more unified path (with some
other prep work).  Having to find a leaf SPTE could increase the time to disable
dirty logging, but unless it's an order of magnitude or worse, I'm not sure we care
because walking SPTEs doesn't impact vCPUs (unlike actually zapping).

> > > +             /* Try to promote the constitutent pages to an lpage. */
> > > +             if (!is_last_spte(iter.old_spte, iter.level) &&
> > > +                 try_promote_lpage(kvm, slot, &iter))
> > >                       continue;
> >
> > If iter.old_spte is not a leaf, the only loop would always continue to
> > the next SPTE. Now we try to promote it and if that fails we run through
> > the rest of the loop. This seems broken. For example, in the next line
> > we end up grabbing the pfn of the non-leaf SPTE (which would be the PFN
> > of the TDP MMU page table?) and treat that as the PFN backing this GFN,
> > which is wrong.
> >
> > In the worst case we end up zapping an SPTE that we didn't need to, but
> > we should still fix up this code.

My thought to remedy this is to drop the @pfn argument to kvm_mmu_max_mapping_level().
It's used only for historical reasons, where KVM didn't walk the host page tables
to get the max mapping level and instead pulled THP information out of struct page.
I.e. KVM needed the pfn to get the page.

That would also allow KVM to use huge pages for things that aren't backed by
struct page (I know of at least one potential use case).

I _think_ we can do the below.  It's compile tested only at this point, and I
want to split some of the changes into separate patches, e.g. the WARN on the step-up
not going out-of-bounds.  I'll put this on the backburner for now, it's too late
for 5.20, and too many people are OOO :-)

	tdp_root_for_each_pte(iter, root, start, end) {
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		/*
		 * Step down until a PRESENT, leaf SPTE is found, even when
		 * promoting a parent shadow page.  Requiring a leaf SPTE
		 * ensures that KVM is not creating a new mapping while an MMU
		 * notifier invalidation is in-progress (KVM zaps only leaf
		 * SPTEs in response to MMU notifier invlidation events), and
		 * avoids doing work for shadow pages with no children.
		 */
		if (!is_shadow_present_pte(iter.old_spte) ||
		    !is_last_spte(iter.old_spte, iter.level))
			continue;

		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
							      PG_LEVEL_NUM);
		if (iter.level == max_mapping_level)
			continue;

		/*
		 * KVM zaps leaf SPTEs when handling MMU notifier invalidations,
		 * and the primary MMU is supposed to invalidate secondary MMUs
		 * _before_ zapping PTEs in the host page tables.  It should be
		 * impossible for a leaf SPTE to violate the host mapping level.
		 */
		if (WARN_ON_ONCE(max_mapping_level < iter.level))
			continue;

		/*
		 * The page can be remapped at a higher level, so step
		 * up to zap the parent SPTE.
		 */
		while (max_mapping_level > iter.level)
			tdp_iter_step_up(&iter);

		/*
		 * Stepping up should not cause the iter to go out of range of
		 * the memslot, the max mapping level is bounded by the memslot
		 * (among other things).
		 */
		if (WARN_ON_ONCE(iter.gfn < start || iter.gfn >= end))
			continue;

		/*
		 * Attempt to promote the non-leaf SPTE to a huge page.  If the
		 * promotion fails, zap the SPTE and let it be rebuilt on the
		 * next associated TDP page fault.
		 */
		if (!try_promote_to_huge_page(kvm, &rsvd_bits, slot, &iter))
			continue;

		/* Note, a successful atomic zap also does a remote TLB flush. */
		tdp_mmu_zap_spte_atomic(kvm, &iter);

		/*
		 * If the atomic zap fails, the iter will recurse back into
		 * the same subtree to retry.
		 */
	}

and then promotion shrinks a decent amount too as it's just getting the pfn,
memtype, and making the SPTE.

  static int try_promote_to_huge_page(struct kvm *kvm,
				      struct rsvd_bits_validate *rsvd_bits,
				      const struct kvm_memory_slot *slot,
				      struct tdp_iter *iter)
  {
	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
	kvm_pfn_t pfn;
	u64 new_spte;
	u8 mt_mask;
	int r;

	/*
	 * Treat the lookup as a write "fault", in-place promotion is used only
	 * when disabling dirty logging, which requires a writable memslot.
	 */
	pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true, NULL, NULL);
	if (is_error_noslot_pfn(pfn))
		return -EINVAL;

	/*
	 * In some cases, a vCPU pointer is required to get the MT mask,
	 * however in most cases it can be generated without one. If a
	 * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
	 * In that case, bail on in-place promotion.
	 */
	r = static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
						 kvm_is_mmio_pfn(pfn), &mt_mask);
	if (r)
		return r;

	__make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
		    true, mt_mask, rsvd_bits, &new_spte);

	return tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
  }