Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Thu, 8 Jun 2023 01:23:11 +0800
From:   Yu Zhang <yu.c.zhang@linux.intel.com>
To:     Sean Christopherson <seanjc@google.com>
Cc:     Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Jason Gunthorpe <jgg@nvidia.com>,
        Alistair Popple <apopple@nvidia.com>,
        Robin Murphy <robin.murphy@arm.com>
Subject: Re: [PATCH 1/3] KVM: VMX: Retry APIC-access page reload if
 invalidation is in-progress
Message-ID: <20230607172243.c2bkw43hcet4sfnb@linux.intel.com>
References: <20230602011518.787006-1-seanjc@google.com>
 <20230602011518.787006-2-seanjc@google.com>
 <20230607073728.vggwcoylibj3cp6s@linux.intel.com>
 <ZICUbIF2+Cvbb9GM@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ZICUbIF2+Cvbb9GM@google.com>
User-Agent: NeoMutt/20171215
Precedence: bulk

> Pending requests block KVM from actually entering the guest.  If a request comes
> in after vcpu_enter_guest()'s initial handling of requests, KVM will bail before
> VM-Enter and go back through the entire "outer" run loop.
> 
> This isn't necessarily the most efficient way to handle the stall, e.g. KVM does
> a fair bit of prep for VM-Enter before detecting the pending request.  The
> alternative would be to have kvm_vcpu_reload_apic_access_page() return value
> instructing vcpu_enter_guest() whether to bail immediately or continue on.  I
> elected for the re-request approach because (a) it didn't require redefining the
> kvm_x86_ops vendor hook, (b) this should be a rare situation and not performance
> critical overall, and (c) there's no guarantee that bailing immediately would
> actually yield better performance from the guest's perspective, e.g. if there are
> other pending requests/work, then the KVM can handle those items while the vCPU
> is stalled instead of waiting until the invalidation completes to proceed.
> 

Wah! Thank you so much! Especially for the code snippets below! :)

> > One more dumb question - why does KVM not just pin the APIC access page?
> 
> Definitely not a dumb question, I asked myself the same thing multiple times when
> looking at this :-)  Pinning the page would be easier, and KVM actually did that
> in the original implementation.  The issue is in how KVM allocates the backing
> page.  It's not a traditional kernel allocation, but is instead anonymous memory
> allocated by way of vm_mmap(), i.e. for all intents and purposes it's a user
> allocation.  That means the kernel expects it to be a regular movable page, e.g.
> it's entirely possible the page (if it were pinned) could be the only page in a
> 2MiB chunk preventing the kernel from migrating/compacting and creating a hugepage.
> 
> In hindsight, I'm not entirely convinced that unpinning the page was the right
> choice, as it resulted in a handful of nasty bugs.  But, now that we've fixed all
> those bugs (knock wood), there's no good argument for undoing all of that work.
> Because while the code is subtle and requires hooks in a few paths, it's not *that*
> complex and for the most part doesn't require active maintenance.
> 

Thanks again! One more thing that bothers me when reading the mmu notifier,
is about the TLB flush request. After the APIC access page is reloaded, the
TLB will be flushed (a single-context EPT invalidation on not-so-outdated
CPUs) in vmx_set_apic_access_page_addr(). But the mmu notifier will send the
KVM_REQ_TLB_FLUSH as well, by kvm_mmu_notifier_invalidate_range_start() ->
__kvm_handle_hva_range(), therefore causing the vCPU to trigger another TLB
flush - normally a global EPT invalidation I guess.

But, is this necessary?

Could we try to return false in kvm_unmap_gfn_range() to indicate no more
flush is needed, if the range to be unmapped falls within guest APIC base,
and leaving the TLB invalidation work to vmx_set_apic_access_page_addr()?

But there are multiple places in vmx_set_apic_access_page_addr() to return
earlier(e.g., if xapic mode is disabled for this vCPU) with no TLB flush being
triggered, I am not sure if doing so would cause more problems... Any comment?
Thanks!

B.R.
Yu