Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Wed, 7 Jun 2023 07:30:04 -0700
In-Reply-To: <20230607073728.vggwcoylibj3cp6s@linux.intel.com>
Mime-Version: 1.0
References: <20230602011518.787006-1-seanjc@google.com> <20230602011518.787006-2-seanjc@google.com>
 <20230607073728.vggwcoylibj3cp6s@linux.intel.com>
Message-ID: <ZICUbIF2+Cvbb9GM@google.com>
Subject: Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if
 invalidation is in-progress
From:   Sean Christopherson <seanjc@google.com>
To:     yu.c.zhang@linux.intel.com
Cc:     Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Jason Gunthorpe <jgg@nvidia.com>,
        Alistair Popple <apopple@nvidia.com>,
        Robin Murphy <robin.murphy@arm.com>
Content-Type: text/plain; charset="us-ascii"
Precedence: bulk

On Wed, Jun 07, 2023, yu.c.zhang@linux.intel.com wrote:
> On Thu, Jun 01, 2023 at 06:15:16PM -0700, Sean Christopherson wrote:
> > +	pfn = gfn_to_pfn_memslot(slot, gfn);
> > +	if (is_error_noslot_pfn(pfn))
> > +		return;
> > +
> > +	read_lock(&vcpu->kvm->mmu_lock);
> > +	if (mmu_invalidate_retry_hva(kvm, mmu_seq,
> > +				     gfn_to_hva_memslot(slot, gfn))) {
> > +		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
> 
> Are the mmu_invalidate_retry_hva() and the following request meant to stall
> the vCPU if there's on going invalidation? 

Yep.

> If yes, may I ask how would the vCPU be stalled?
> 
> Below are my understandings and confusions about this process. I must have
> missed something, waiting to be educated... :) 
> 
> When the backing page of APIC access page is to be reclaimed:
> 1> kvm_mmu_notifier_invalidate_range_start() -> __kvm_handle_hva_range() will
> increase the kvm->mmu_invalidate_in_progress and account the start/end of this
> page in kvm_mmu_invalidate_begin().
> 2> And then kvm_unmap_gfn_range() will zap the TDP, and send the request,
> KVM_REQ_APIC_PAGE_RELOAD, to all vCPUs.
> 3> While a vCPU tries to reload the APIC access page before entering the guest,
> kvm->mmu_invalidate_in_progress will be checked in mmu_invalidate_retry_hva(),
> but it is possible(or is it?) that the kvm->mmu_invalidate_in_progess is still
> positive, so KVM_REQ_APIC_PAGE_RELOAD is set again. No reload, and no TLB flush.
> 4> So what if the vCPU resumes with KVM_REQ_APIC_PAGE_RELOAD & KVM_REQ_TLB_FLUSH
> flags being set, yet APIC access page is not reloaded and TLB is not flushed? Or,
> will this happen?

Pending requests block KVM from actually entering the guest.  If a request comes
in after vcpu_enter_guest()'s initial handling of requests, KVM will bail before
VM-Enter and go back through the entire "outer" run loop.

This isn't necessarily the most efficient way to handle the stall, e.g. KVM does
a fair bit of prep for VM-Enter before detecting the pending request.  The
alternative would be to have kvm_vcpu_reload_apic_access_page() return value
instructing vcpu_enter_guest() whether to bail immediately or continue on.  I
elected for the re-request approach because (a) it didn't require redefining the
kvm_x86_ops vendor hook, (b) this should be a rare situation and not performance
critical overall, and (c) there's no guarantee that bailing immediately would
actually yield better performance from the guest's perspective, e.g. if there are
other pending requests/work, then the KVM can handle those items while the vCPU
is stalled instead of waiting until the invalidation completes to proceed.

> One more dumb question - why does KVM not just pin the APIC access page?

Definitely not a dumb question, I asked myself the same thing multiple times when
looking at this :-)  Pinning the page would be easier, and KVM actually did that
in the original implementation.  The issue is in how KVM allocates the backing
page.  It's not a traditional kernel allocation, but is instead anonymous memory
allocated by way of vm_mmap(), i.e. for all intents and purposes it's a user
allocation.  That means the kernel expects it to be a regular movable page, e.g.
it's entirely possible the page (if it were pinned) could be the only page in a
2MiB chunk preventing the kernel from migrating/compacting and creating a hugepage.

In hindsight, I'm not entirely convinced that unpinning the page was the right
choice, as it resulted in a handful of nasty bugs.  But, now that we've fixed all
those bugs (knock wood), there's no good argument for undoing all of that work.
Because while the code is subtle and requires hooks in a few paths, it's not *that*
complex and for the most part doesn't require active maintenance.

static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
	if (kvm_request_pending(vcpu)) {  <= check if any request is pending

		if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
			kvm_vcpu_reload_apic_access_page(vcpu); <= re-requests APIC_PAGE_RELOAD

                ...
	}

	...

	preempt_disable();

	static_call(kvm_x86_prepare_switch_to_guest)(vcpu);

        <host => guest bookkeeping>

	if (kvm_vcpu_exit_request(vcpu)) {  <= detects the new pending request
		vcpu->mode = OUTSIDE_GUEST_MODE;
		smp_wmb();
		local_irq_enable();
		preempt_enable();
		kvm_vcpu_srcu_read_lock(vcpu);
		r = 1;
		goto cancel_injection;  <= bails from actually entering the guest
	}

	if (req_immediate_exit) {
		kvm_make_request(KVM_REQ_EVENT, vcpu);
		static_call(kvm_x86_request_immediate_exit)(vcpu);
	}

	for (;;) {
		<inner run / VM-Enter loop>
	}

	<VM-Exit path>

	r = static_call(kvm_x86_handle_exit)(vcpu, exit_fastpath);
	return r;

cancel_injection:
	if (req_immediate_exit)
		kvm_make_request(KVM_REQ_EVENT, vcpu);
	static_call(kvm_x86_cancel_injection)(vcpu);
	if (unlikely(vcpu->arch.apic_attention))
		kvm_lapic_sync_from_vapic(vcpu);
out:
	return r;
}