Date:   Wed, 8 Mar 2023 11:11:58 -0800
In-Reply-To: <fb088e26-8c9c-bcac-6083-5945d2d9c16e@linux.microsoft.com>
Mime-Version: 1.0
References: <20230227171751.1211786-1-jpiotrowski@linux.microsoft.com>
 <ZAd2MRNLw1JAXmOf@google.com> <CABgObfa1578yKuw3sqnCeLXpyyKmMPgNaftP9HCdgHNM9Tztjw@mail.gmail.com>
 <ZAfZPA5Ed7STUT2B@google.com> <fb088e26-8c9c-bcac-6083-5945d2d9c16e@linux.microsoft.com>
Message-ID: <ZAjd/ktCeT8D5anK@google.com>
Subject: Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V
From:   Sean Christopherson <seanjc@google.com>
To:     Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Cc:     Paolo Bonzini <pbonzini@redhat.com>, linux-kernel@vger.kernel.org,
        kvm@vger.kernel.org, Vitaly Kuznetsov <vkuznets@redhat.com>,
        Tianyu Lan <ltykernel@gmail.com>,
        Michael Kelley <mikelley@microsoft.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

On Wed, Mar 08, 2023, Jeremi Piotrowski wrote:
> On 08/03/2023 01:39, Sean Christopherson wrote:
> > On Wed, Mar 08, 2023, Paolo Bonzini wrote:
> >> On Tue, Mar 7, 2023 at 6:36=E2=80=AFPM Sean Christopherson <seanjc@goo=
gle.com> wrote:
> >>> Thinking about this more, I would rather revert commit 1e0c7d40758b (=
"KVM: SVM:
> >>> hyper-v: Remote TLB flush for SVM") or fix the thing properly straita=
way.  KVM
> >>> doesn't magically handle the flushes correctly for the shadow/legacy =
MMU, KVM just
> >>> happens to get lucky and not run afoul of the underlying bugs.
> >>
> >> I don't think it's about luck---the legacy MMU's zapping/invalidation
> >> seems to invoke the flush hypercall correctly:
> >=20
> > ...for the paths that Jeremi has exercised, and for which a stale TLB e=
ntry is
> > fatal to L2.  E.g. kvm_unmap_gfn_range() does not have a range-based TL=
B flush
> > in its path and fully relies on the buggy kvm_flush_remote_tlbs().
> >
>=20
> Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs calls=
 the
> hypercall that is needed, I don't see how this might be an issue of a mis=
sing
> "range-based TLB flush".

Doh, I forgot that the arch hook in kvm_flush_remote_tlbs() leads to the Hy=
per-V
hook.

svm_flush_tlb_current is very much broken, but in practice it doesn't matte=
r outside
of the direct call from kvm_mmu_load(), because in all other paths KVM will=
 flow
through a Hyper-V flush if KVM actually modifies its MMU in any ways.  E.g.=
 the
request from kvm_mmu_new_pgd() when force_flush_and_sync_on_reuse=3Dtrue is=
 neutered,
but that exists only as a safeguard against MMU bugs.  And for things like
kvm_invalidate_pcid() and kvm_post_set_cr4(), my understanding is that Hype=
r-V
will still flush the bare metal TLB, it's only Hyper-V's shadow page tables=
 that
are stale.

Depending on how Hyper-V handles ASIDs, pre_svm_run() may also be broken.  =
If
Hyper-V tracks and rebuilds only the current ASID, then bumping the ASID wo=
n't
rebuild potentially stale page tables.  But I'm guessing Hyper-V ignores th=
e ASID
since the hypercall takes only the root PA.

The truly problematic case is kvm_mmu_load(), where KVM relies on the flush=
 to
force Hyper-V to rebuild shadow page tables for an old, possibly stale nCR3=
.  This
affects only the TDP MMU because of an explicit optimization in the TDP MMU=
.  So
in practice we could squeak by with something like this:

	if (kvm_x86_ops.tlb_remote_flush =3D=3D hv_remote_flush_tlb)
		hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
	else
		static_call(kvm_x86_flush_tlb_current)(vcpu);

but I'm not convinced that avoiding a hypercall in svm_flush_tlb_current() =
just
to avoid overhead when running an L3 (nested VM from L1 KVM's perspective) =
is
worthwhile.  The real problem there is that KVM nested SVM TLB/ASID support=
 is an
unoptimized mess, and I can't imagine someone running an L3 is going to be =
super
concerned with performance.

I also think we should have a sanity check in the flush_tlb_all() path, i.e=
. WARN
if kvm_flush_remote_tlbs() falls back.

Something like this (probably doesn't compile, likely needs #ifdefs or help=
ers):

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7ef4f9e3b35a..38afc5cac1c4 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3770,7 +3770,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vc=
pu)
        svm->vmcb->save.rflags |=3D (X86_EFLAGS_TF | X86_EFLAGS_RF);
 }
=20
-static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
+static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
 {
        struct vcpu_svm *svm =3D to_svm(vcpu);
=20
@@ -3794,6 +3794,23 @@ static void svm_flush_tlb_current(struct kvm_vcpu *v=
cpu)
                svm->current_vmcb->asid_generation--;
 }
=20
+static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+       if (kvm_x86_ops.tlb_remote_flush =3D=3D hv_remote_flush_tlb &&
+           VALID_PAGE(vcpu->arch.mmu->root.hpa))
+               hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
+
+       svm_flush_tlb_asid(vcpu);
+}
+
+static void svm_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+       if (WARN_ON_ONCE(kvm_x86_ops.tlb_remote_flush =3D=3D hv_remote_flus=
h_tlb))
+               hv_remote_flush_tlb(vcpu->kvm);
+
+       svm_flush_tlb_asid(vcpu);
+}
+
 static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
 {
        struct vcpu_svm *svm =3D to_svm(vcpu);
@@ -4786,10 +4803,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata =
=3D {
        .set_rflags =3D svm_set_rflags,
        .get_if_flag =3D svm_get_if_flag,
=20
-       .flush_tlb_all =3D svm_flush_tlb_current,
+       .flush_tlb_all =3D svm_flush_tlb_all,
        .flush_tlb_current =3D svm_flush_tlb_current,
        .flush_tlb_gva =3D svm_flush_tlb_gva,
-       .flush_tlb_guest =3D svm_flush_tlb_current,
+       .flush_tlb_guest =3D svm_flush_tlb_asid,
=20
        .vcpu_pre_run =3D svm_vcpu_pre_run,
        .vcpu_run =3D svm_vcpu_run,