2023-02-10 18:17:14

by Jeremi Piotrowski

[permalink] [raw]
Subject: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

Hi Paolo/Sean,

We've noticed that changes introduced in "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing"
conflict with a nested Hyper-V enlightenment that is always enabled on AMD CPUs
(HV_X64_NESTED_ENLIGHTENED_TLB). The scenario that is affected is L0 Hyper-V + L1 KVM on AMD,

L2 VMs fail to boot due to to stale data being seen on L1/L2 side, it looks
like the NPT is not in sync with L0. I can reproduce this on any kernel >=5.18,
the easiest way is by launching qemu in a loop with debug OVMF, you can observe
various #GP faults, assert failures, or the guest just suddenly dies. You can try it
for yourself in Azure by launching an Ubuntu 22.10 image on an AMD SKU with nested
virtualization (Da_v5).

In investigating I found that 3 things allow L2 guests to boot again:
* force tdp_mmu=N when loading kvm
* recompile L1 kernel to force disable HV_X64_NESTED_ENLIGHTENED_TLB
* revert both of these commits (found through bisecting):
bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages"
efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct roots via asynchronous worker"

I'll paste our understanding of what is happening (thanks Tianyu):
"""
Hyper-V provides HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE
and HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST hvcalls for l1
hypervisor to notify Hyper-V after L1 hypervisor changes L2 GPA <-> L1 GPA address
translation tables(Intel calls EPT and AMD calls NPT). This may help not to
mask whole address translation tables of L1 hypervisor to be write-protected in Hyper-V
and avoid vmexits triggered by changing address translation table in L1 hypervisor.

The following commits defers to call these two hvcalls when there are changes in the L1
hypervisor address translation table. Hyper-V can't sync/shadow L1 address space
table at the first time due to the delay and this may cause mismatch between shadow page table
in the Hyper-V and L1 address translation table. IIRC, KVM side always uses write-protected
translation table to shadow and so doesn't meet such issue with the commit.
"""

Let me know if either of you have any ideas on how to approach fixing this.
I'm not familiar enough with TDP MMU code to be able to contribute a fix directly
but I'm happy to help in any way I can.

Jeremi


2023-02-10 18:45:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On Fri, Feb 10, 2023, Jeremi Piotrowski wrote:
> Hi Paolo/Sean,
>
> We've noticed that changes introduced in "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing"
> conflict with a nested Hyper-V enlightenment that is always enabled on AMD CPUs
> (HV_X64_NESTED_ENLIGHTENED_TLB). The scenario that is affected is L0 Hyper-V + L1 KVM on AMD,
>
> L2 VMs fail to boot due to to stale data being seen on L1/L2 side, it looks
> like the NPT is not in sync with L0. I can reproduce this on any kernel >=5.18,
> the easiest way is by launching qemu in a loop with debug OVMF, you can observe
> various #GP faults, assert failures, or the guest just suddenly dies. You can try it
> for yourself in Azure by launching an Ubuntu 22.10 image on an AMD SKU with nested
> virtualization (Da_v5).
>
> In investigating I found that 3 things allow L2 guests to boot again:
> * force tdp_mmu=N when loading kvm
> * recompile L1 kernel to force disable HV_X64_NESTED_ENLIGHTENED_TLB
> * revert both of these commits (found through bisecting):
> bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages"
> efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct roots via asynchronous worker"
>
> I'll paste our understanding of what is happening (thanks Tianyu):
> """
> Hyper-V provides HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE
> and HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST hvcalls for l1
> hypervisor to notify Hyper-V after L1 hypervisor changes L2 GPA <-> L1 GPA address
> translation tables(Intel calls EPT and AMD calls NPT). This may help not to
> mask whole address translation tables of L1 hypervisor to be write-protected in Hyper-V
> and avoid vmexits triggered by changing address translation table in L1 hypervisor.
>
> The following commits defers to call these two hvcalls when there are changes in the L1
> hypervisor address translation table. Hyper-V can't sync/shadow L1 address space
> table at the first time due to the delay and this may cause mismatch between shadow page table
> in the Hyper-V and L1 address translation table. IIRC, KVM side always uses write-protected
> translation table to shadow and so doesn't meet such issue with the commit.
> """
>
> Let me know if either of you have any ideas on how to approach fixing this.
> I'm not familiar enough with TDP MMU code to be able to contribute a fix directly
> but I'm happy to help in any way I can.

As a hopefully quick-and-easy first step, can you try running KVM built from:

https://github.com/kvm-x86/linux/tree/mmu

specifically to get the fixes for KVM's usage of range-based TLB flushes:

https://lore.kernel.org/all/[email protected]

2023-02-13 12:45:01

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On 10/02/2023 19:45, Sean Christopherson wrote:
> On Fri, Feb 10, 2023, Jeremi Piotrowski wrote:
>> Hi Paolo/Sean,
>>
>> We've noticed that changes introduced in "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing"
>> conflict with a nested Hyper-V enlightenment that is always enabled on AMD CPUs
>> (HV_X64_NESTED_ENLIGHTENED_TLB). The scenario that is affected is L0 Hyper-V + L1 KVM on AMD,
>>
>> L2 VMs fail to boot due to to stale data being seen on L1/L2 side, it looks
>> like the NPT is not in sync with L0. I can reproduce this on any kernel >=5.18,
>> the easiest way is by launching qemu in a loop with debug OVMF, you can observe
>> various #GP faults, assert failures, or the guest just suddenly dies. You can try it
>> for yourself in Azure by launching an Ubuntu 22.10 image on an AMD SKU with nested
>> virtualization (Da_v5).
>>
>> In investigating I found that 3 things allow L2 guests to boot again:
>> * force tdp_mmu=N when loading kvm
>> * recompile L1 kernel to force disable HV_X64_NESTED_ENLIGHTENED_TLB
>> * revert both of these commits (found through bisecting):
>> bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages"
>> efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct roots via asynchronous worker"
>>
>> I'll paste our understanding of what is happening (thanks Tianyu):
>> """
>> Hyper-V provides HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE
>> and HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST hvcalls for l1
>> hypervisor to notify Hyper-V after L1 hypervisor changes L2 GPA <-> L1 GPA address
>> translation tables(Intel calls EPT and AMD calls NPT). This may help not to
>> mask whole address translation tables of L1 hypervisor to be write-protected in Hyper-V
>> and avoid vmexits triggered by changing address translation table in L1 hypervisor.
>>
>> The following commits defers to call these two hvcalls when there are changes in the L1
>> hypervisor address translation table. Hyper-V can't sync/shadow L1 address space
>> table at the first time due to the delay and this may cause mismatch between shadow page table
>> in the Hyper-V and L1 address translation table. IIRC, KVM side always uses write-protected
>> translation table to shadow and so doesn't meet such issue with the commit.
>> """
>>
>> Let me know if either of you have any ideas on how to approach fixing this.
>> I'm not familiar enough with TDP MMU code to be able to contribute a fix directly
>> but I'm happy to help in any way I can.
>
> As a hopefully quick-and-easy first step, can you try running KVM built from:
>
> https://github.com/kvm-x86/linux/tree/mmu
>
> specifically to get the fixes for KVM's usage of range-based TLB flushes:
>
> https://lore.kernel.org/all/[email protected]

Just built a kernel from that tree, and it displays the same behavior. The problem
is not that the addresses are wrong, but that the flushes are issued at the wrong
time now. At least for what "enlightened NPT TLB flush" requires.

Jeremi

2023-02-13 12:51:05

by Paolo Bonzini

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V


On 2/13/23 13:44, Jeremi Piotrowski wrote:
> Just built a kernel from that tree, and it displays the same behavior. The problem
> is not that the addresses are wrong, but that the flushes are issued at the wrong
> time now. At least for what "enlightened NPT TLB flush" requires.

It is not clear to me why HvCallFluyshGuestPhysicalAddressSpace or
HvCallFlushGuestPhysicalAddressList would have stricter requirements
than a "regular" TLB shootdown using INVEPT.

Can you clarify what you mean by wrong time, preferrably with some kind
of sequence of events?

That is, something like

CPU 0 Modify EPT from ... to ...
CPU 0 call_rcu() to free page table
CPU 1 ... which is invalid because ...
CPU 0 HvCallFlushGuestPhysicalAddressSpace

Paolo


2023-02-13 17:38:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On Fri, Feb 10, 2023, Jeremi Piotrowski wrote:
> Hi Paolo/Sean,
>
> We've noticed that changes introduced in "KVM: x86/mmu: Overhaul TDP MMU
> zapping and flushing" conflict with a nested Hyper-V enlightenment that is
> always enabled on AMD CPUs (HV_X64_NESTED_ENLIGHTENED_TLB). The scenario that
> is affected is L0 Hyper-V + L1 KVM on AMD,

Do you see issues with Intel and HV_X64_NESTED_GUEST_MAPPING_FLUSH? IIUC, on the
KVM side, that setup is equivalent to HV_X64_NESTED_ENLIGHTENED_TLB.

> IIRC, KVM side always uses write-protected translation table to shadow and so
> doesn't meet such issue with the commit.

This is incorrect. KVM write-protects guest PTEs that point at 2MiB and larger
pages, but 4KiB PTEs are allowed to become "unsync" and KVM's shadow NPT/EPT entries
are synchronized with the guest only on a relevant TLB.

I know of at least one non-KVM-hypervisor TDP TLB flushing bug that was found
specifically because of KVM's infinite software TLB. That doesn't mean that this
isn't a KVM bug, I just want to call out that KVM-on-KVM should be capable of
detecting KVM-as-L1 TLB bugs, at least on Intel/VMX/EPT (KVM's nested SVM support
is woefully naive from a TLB flushing perspective and synchronizes guest PTEs
before every nested VM-Entry to L2).

2023-02-13 17:49:51

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On 13/02/2023 18:38, Sean Christopherson wrote:
> On Fri, Feb 10, 2023, Jeremi Piotrowski wrote:
>> Hi Paolo/Sean,
>>
>> We've noticed that changes introduced in "KVM: x86/mmu: Overhaul TDP MMU
>> zapping and flushing" conflict with a nested Hyper-V enlightenment that is
>> always enabled on AMD CPUs (HV_X64_NESTED_ENLIGHTENED_TLB). The scenario that
>> is affected is L0 Hyper-V + L1 KVM on AMD,
>
> Do you see issues with Intel and HV_X64_NESTED_GUEST_MAPPING_FLUSH? IIUC, on the
> KVM side, that setup is equivalent to HV_X64_NESTED_ENLIGHTENED_TLB.
>

I couldn't reproduce this in any way on Intel. I can test again tomorrow, and see what
the differences are in the ftrace.

>> IIRC, KVM side always uses write-protected translation table to shadow and so
>> doesn't meet such issue with the commit.
>
> This is incorrect. KVM write-protects guest PTEs that point at 2MiB and larger
> pages, but 4KiB PTEs are allowed to become "unsync" and KVM's shadow NPT/EPT entries
> are synchronized with the guest only on a relevant TLB.
>
> I know of at least one non-KVM-hypervisor TDP TLB flushing bug that was found
> specifically because of KVM's infinite software TLB. That doesn't mean that this
> isn't a KVM bug, I just want to call out that KVM-on-KVM should be capable of
> detecting KVM-as-L1 TLB bugs, at least on Intel/VMX/EPT (KVM's nested SVM support
> is woefully naive from a TLB flushing perspective and synchronizes guest PTEs
> before every nested VM-Entry to L2).

2023-02-13 18:12:00

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V



On 13/02/2023 13:50, Paolo Bonzini wrote:
>
> On 2/13/23 13:44, Jeremi Piotrowski wrote:
>> Just built a kernel from that tree, and it displays the same behavior. The problem
>> is not that the addresses are wrong, but that the flushes are issued at the wrong
>> time now. At least for what "enlightened NPT TLB flush" requires.
>
> It is not clear to me why HvCallFluyshGuestPhysicalAddressSpace or HvCallFlushGuestPhysicalAddressList would have stricter requirements than a "regular" TLB shootdown using INVEPT.
>
> Can you clarify what you mean by wrong time, preferrably with some kind of sequence of events?
>
> That is, something like
>
> CPU 0    Modify EPT from ... to ...
> CPU 0    call_rcu() to free page table
> CPU 1    ... which is invalid because ...
> CPU 0    HvCallFlushGuestPhysicalAddressSpace
>
> Paolo

So I looked at the ftrace (all kvm&kvmmu events + hyperv_nested_* events) I see the following:
With tdp_mmu=0:
kvm_exit
sequence of kvm_mmu_prepare_zap_page
hyperv_nested_flush_guest_mapping (always follows every sequence of kvm_mmu_prepare_zap_page)
kvm_entry

With tdp_mmu=1 I see:
kvm_mmu_prepare_zap_page and kvm_tdp_mmu_spte_changed events from a kworker context, but
they are not followed by hyperv_nested_flush_guest_mapping. The only hyperv_nested_flush_guest_mapping
events I see happen from the qemu process context.

Also the number of flush hypercalls is significantly lower: a 7second sequence through OVMF with
tdp_mmu=0 produces ~270 flush hypercalls. In the traces with tdp_mmu=1 I now see max 3.

So this might be easier to diagnose than I thought: the HvCallFlushGuestPhysicalAddressSpace calls
are missing now.

2023-02-13 18:12:44

by Paolo Bonzini

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On 2/13/23 18:38, Sean Christopherson wrote:
> On Fri, Feb 10, 2023, Jeremi Piotrowski wrote:
>> Hi Paolo/Sean,
>>
>> We've noticed that changes introduced in "KVM: x86/mmu: Overhaul TDP MMU
>> zapping and flushing" conflict with a nested Hyper-V enlightenment that is
>> always enabled on AMD CPUs (HV_X64_NESTED_ENLIGHTENED_TLB). The scenario that
>> is affected is L0 Hyper-V + L1 KVM on AMD,
>
> Do you see issues with Intel and HV_X64_NESTED_GUEST_MAPPING_FLUSH? IIUC, on the
> KVM side, that setup is equivalent to HV_X64_NESTED_ENLIGHTENED_TLB.

My reading of the spec[1] is that HV_X64_NESTED_ENLIGHTENED_TLB will
cause svm_flush_tlb_current to behave (in Intel parlance) as an INVVPID
rather than an INVEPT. So svm_flush_tlb_current has to be changed to
also add a call to HvCallFlushGuestPhysicalAddressSpace. I'm not sure
if that's a good idea though.

First, that's a TLB shootdown rather than just a local thing;
flush_tlb_current is supposed to be relatively cheap, and there would be
a lot of them because of the unconditional calls to
nested_svm_transition_tlb_flush on vmentry/vmexit.

Second, while the nCR3 matches across virtual processors for SVM, the
(nCR3, ASID) pair does not, so it doesn't even make much sense to do a
TLB shootdown.

Depending on the performance results of adding the hypercall to
svm_flush_tlb_current, the fix could indeed be to just disable usage of
HV_X64_NESTED_ENLIGHTENED_TLB.

Paolo

[1]
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/nested-virtualization


2023-02-13 18:28:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On 2/13/23 19:05, Jeremi Piotrowski wrote:
> So I looked at the ftrace (all kvm&kvmmu events + hyperv_nested_*
> events) I see the following: With tdp_mmu=0: kvm_exit sequence of
> kvm_mmu_prepare_zap_page hyperv_nested_flush_guest_mapping (always
> follows every sequence of kvm_mmu_prepare_zap_page) kvm_entry
>
> With tdp_mmu=1 I see: kvm_mmu_prepare_zap_page and
> kvm_tdp_mmu_spte_changed events from a kworker context, but they are
> not followed by hyperv_nested_flush_guest_mapping. The only
> hyperv_nested_flush_guest_mapping events I see happen from the qemu
> process context.
>
> Also the number of flush hypercalls is significantly lower: a 7second
> sequence through OVMF with tdp_mmu=0 produces ~270 flush hypercalls.
> In the traces with tdp_mmu=1 I now see max 3.
>
> So this might be easier to diagnose than I thought: the
> HvCallFlushGuestPhysicalAddressSpace calls are missing now.

Can you check if KVM is reusing a nCR3 value?

If so, perhaps you can just add
hyperv_flush_guest_mapping(__pa(root->spt), NULL) after
kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()?

Paolo


2023-02-13 19:12:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On Mon, Feb 13, 2023, Paolo Bonzini wrote:
> On 2/13/23 18:38, Sean Christopherson wrote:
> > On Fri, Feb 10, 2023, Jeremi Piotrowski wrote:
> > > Hi Paolo/Sean,
> > >
> > > We've noticed that changes introduced in "KVM: x86/mmu: Overhaul TDP MMU
> > > zapping and flushing" conflict with a nested Hyper-V enlightenment that is
> > > always enabled on AMD CPUs (HV_X64_NESTED_ENLIGHTENED_TLB). The scenario that
> > > is affected is L0 Hyper-V + L1 KVM on AMD,
> >
> > Do you see issues with Intel and HV_X64_NESTED_GUEST_MAPPING_FLUSH? IIUC, on the
> > KVM side, that setup is equivalent to HV_X64_NESTED_ENLIGHTENED_TLB.
>
> My reading of the spec[1] is that HV_X64_NESTED_ENLIGHTENED_TLB will cause
> svm_flush_tlb_current to behave (in Intel parlance) as an INVVPID rather
> than an INVEPT.

Oh! Good catch! Yeah, that'll be a problem. Copy-pasting the relevant snippet
so future me doesn't have to reread the spec:

If the nested hypervisor opts into the enlightenment, ASID invalidations just
flush TLB entires derived from first level address translation (i.e. the
virtual address space).

Specifically, the "missing" flushes when a root's (nCR3) refcount goes to zero
are expected because KVM relies on flushing via svm_flush_tlb_current() when the
old, stale root might be reused. That would lead to consuming stale entries when
reusing a previously freed root.

int kvm_mmu_load(struct kvm_vcpu *vcpu)
{
int r;

...

/*
* Flush any TLB entries for the new root, the provenance of the root
* is unknown. Even if KVM ensures there are no stale TLB entries
* for a freed root, in theory another hypervisor could have left
* stale entries. Flushing on alloc also allows KVM to skip the TLB
* flush when freeing a root (see kvm_tdp_mmu_put_root()).
*/
static_call(kvm_x86_flush_tlb_current)(vcpu);
out:
return r;
}

> So svm_flush_tlb_current has to be changed to also add a
> call to HvCallFlushGuestPhysicalAddressSpace. I'm not sure if that's a good
> idea though.

That's not strictly necessary, e.g. flushes from kvm_invalidate_pcid() and
kvm_post_set_cr4() don't need to effect a full flush. I believe the virtual
address flush is also sufficient for avic_activate_vmcb(). Nested (from KVM's
perspective, i.e. running L3) can just be mutually exclusive with
HV_X64_NESTED_ENLIGHTENED_TLB.

That just leaves kvm_mmu_new_pgd()'s force_flush_and_sync_on_reuse and the
aforementioned kvm_mmu_load().

That said, the above cases where a virtual address flush is sufficient are
rare operations when using NPT, so adding a new KVM_REQ_TLB_FLUSH_ROOT or
whatever probably isn't worth doing.

> First, that's a TLB shootdown rather than just a local thing;
> flush_tlb_current is supposed to be relatively cheap, and there would be a
> lot of them because of the unconditional calls to
> nested_svm_transition_tlb_flush on vmentry/vmexit.

This isn't a nested scenario for KVM though.

> Second, while the nCR3 matches across virtual processors for SVM, the (nCR3,
> ASID) pair does not, so it doesn't even make much sense to do a TLB
> shootdown.
>
> Depending on the performance results of adding the hypercall to
> svm_flush_tlb_current, the fix could indeed be to just disable usage of
> HV_X64_NESTED_ENLIGHTENED_TLB.

Minus making nested SVM (L3) mutually exclusive, I believe this will do the trick:

---
arch/x86/kvm/kvm_onhyperv.c | 9 +++++++++
arch/x86/kvm/kvm_onhyperv.h | 4 ++++
arch/x86/kvm/svm/svm.c | 3 +++
3 files changed, 16 insertions(+)

diff --git a/arch/x86/kvm/kvm_onhyperv.c b/arch/x86/kvm/kvm_onhyperv.c
index 482d6639ef88..e03e9296c1cf 100644
--- a/arch/x86/kvm/kvm_onhyperv.c
+++ b/arch/x86/kvm/kvm_onhyperv.c
@@ -107,3 +107,12 @@ void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp)
}
}
EXPORT_SYMBOL_GPL(hv_track_root_tdp);
+
+void hv_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+ if (kvm_x86_ops.tlb_remote_flush != hv_remote_flush_tlb)
+ return;
+
+ WARN_ON_ONCE(hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa));
+}
+EXPORT_SYMBOL_GPL(hv_flush_tlb_current);
diff --git a/arch/x86/kvm/kvm_onhyperv.h b/arch/x86/kvm/kvm_onhyperv.h
index 287e98ef9df3..30789dfd3544 100644
--- a/arch/x86/kvm/kvm_onhyperv.h
+++ b/arch/x86/kvm/kvm_onhyperv.h
@@ -11,10 +11,14 @@ int hv_remote_flush_tlb_with_range(struct kvm *kvm,
struct kvm_tlb_range *range);
int hv_remote_flush_tlb(struct kvm *kvm);
void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp);
+void hv_flush_tlb_current(struct kvm_vcpu *vcpu);
#else /* !CONFIG_HYPERV */
static inline void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp)
{
}
+static inline void hv_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+}
#endif /* !CONFIG_HYPERV */

#endif
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b43775490074..bfc71dfa8482 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3733,6 +3733,9 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);

+ /* blah blah blah */
+ hv_flush_tlb_current(vcpu);
+
/*
* Unlike VMX, SVM doesn't provide a way to flush only NPT TLB entries.
* A TLB flush for the current ASID flushes both "host" and "guest" TLB

base-commit: 9fa259abdb42051e5ab4cbf0bc0cd21adcf95a4f
--


2023-02-13 19:57:13

by Paolo Bonzini

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On Mon, Feb 13, 2023 at 8:12 PM Sean Christopherson <[email protected]> wrote:
> > My reading of the spec[1] is that HV_X64_NESTED_ENLIGHTENED_TLB will cause
> > svm_flush_tlb_current to behave (in Intel parlance) as an INVVPID rather
> > than an INVEPT.
>
> Oh! Good catch! Yeah, that'll be a problem.
>
> > So svm_flush_tlb_current has to be changed to also add a
> > call to HvCallFlushGuestPhysicalAddressSpace. I'm not sure if that's a good
> > idea though.
>
> That's not strictly necessary, e.g. flushes from kvm_invalidate_pcid() and
> kvm_post_set_cr4() don't need to effect a full flush. I believe the virtual
> address flush is also sufficient for avic_activate_vmcb(). Nested (from KVM's
> perspective, i.e. running L3) can just be mutually exclusive with
> HV_X64_NESTED_ENLIGHTENED_TLB.
>
> That just leaves kvm_mmu_new_pgd()'s force_flush_and_sync_on_reuse and the
> aforementioned kvm_mmu_load().
>
> That said, the above cases where a virtual address flush is sufficient are
> rare operations when using NPT, so adding a new KVM_REQ_TLB_FLUSH_ROOT or
> whatever probably isn't worth doing.
>
> > First, that's a TLB shootdown rather than just a local thing;
> > flush_tlb_current is supposed to be relatively cheap, and there would be a
> > lot of them because of the unconditional calls to
> > nested_svm_transition_tlb_flush on vmentry/vmexit.
>
> This isn't a nested scenario for KVM though.

Yes, but svm_flush_tlb_current() *is* also used in nested scenarios so
it's like you said below---you would have to disable enlightened TLB
when EFER.SVME=1 or something like that.

> > Depending on the performance results of adding the hypercall to
> > svm_flush_tlb_current, the fix could indeed be to just disable usage of
> > HV_X64_NESTED_ENLIGHTENED_TLB.
>
> Minus making nested SVM (L3) mutually exclusive, I believe this will do the trick:
>
> + /* blah blah blah */
> + hv_flush_tlb_current(vcpu);
> +

Yes, it's either this or disabling the feature.

Paolo


2023-02-14 20:27:36

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V



On 13/02/2023 20:56, Paolo Bonzini wrote:
> On Mon, Feb 13, 2023 at 8:12 PM Sean Christopherson <[email protected]> wrote:
>>> My reading of the spec[1] is that HV_X64_NESTED_ENLIGHTENED_TLB will cause
>>> svm_flush_tlb_current to behave (in Intel parlance) as an INVVPID rather
>>> than an INVEPT.
>>
>> Oh! Good catch! Yeah, that'll be a problem.
>>
>>> So svm_flush_tlb_current has to be changed to also add a
>>> call to HvCallFlushGuestPhysicalAddressSpace. I'm not sure if that's a good
>>> idea though.
>>
>> That's not strictly necessary, e.g. flushes from kvm_invalidate_pcid() and
>> kvm_post_set_cr4() don't need to effect a full flush. I believe the virtual
>> address flush is also sufficient for avic_activate_vmcb(). Nested (from KVM's
>> perspective, i.e. running L3) can just be mutually exclusive with
>> HV_X64_NESTED_ENLIGHTENED_TLB.
>>
>> That just leaves kvm_mmu_new_pgd()'s force_flush_and_sync_on_reuse and the
>> aforementioned kvm_mmu_load().
>>
>> That said, the above cases where a virtual address flush is sufficient are
>> rare operations when using NPT, so adding a new KVM_REQ_TLB_FLUSH_ROOT or
>> whatever probably isn't worth doing.
>>
>>> First, that's a TLB shootdown rather than just a local thing;
>>> flush_tlb_current is supposed to be relatively cheap, and there would be a
>>> lot of them because of the unconditional calls to
>>> nested_svm_transition_tlb_flush on vmentry/vmexit.
>>
>> This isn't a nested scenario for KVM though.
>
> Yes, but svm_flush_tlb_current() *is* also used in nested scenarios so
> it's like you said below---you would have to disable enlightened TLB
> when EFER.SVME=1 or something like that.
>
>>> Depending on the performance results of adding the hypercall to
>>> svm_flush_tlb_current, the fix could indeed be to just disable usage of
>>> HV_X64_NESTED_ENLIGHTENED_TLB.
>>
>> Minus making nested SVM (L3) mutually exclusive, I believe this will do the trick:
>>
>> + /* blah blah blah */
>> + hv_flush_tlb_current(vcpu);
>> +
>
> Yes, it's either this or disabling the feature.
>
> Paolo

Combining the two sub-threads: both of the suggestions:

a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()

appear to work in my test case (L2 vm startup until panic due to missing rootfs).

But in both these cases (and also when I completely disable HV_X64_NESTED_ENLIGHTENED_TLB)
the runtime of an iteration of the test is noticeably longer compared to tdp_mmu=0.

So in terms of performance the ranking is (fastest to slowest):
1. tdp_mmu=0 + enlightened TLB
2. tdp_mmu=0 + no enlightened TLB
3. tdp_mmu=1 (enlightened TLB makes minimal difference)

2023-02-15 22:16:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On Tue, Feb 14, 2023, Jeremi Piotrowski wrote:
> On 13/02/2023 20:56, Paolo Bonzini wrote:
> > On Mon, Feb 13, 2023 at 8:12 PM Sean Christopherson <[email protected]> wrote:
> >>> Depending on the performance results of adding the hypercall to
> >>> svm_flush_tlb_current, the fix could indeed be to just disable usage of
> >>> HV_X64_NESTED_ENLIGHTENED_TLB.
> >>
> >> Minus making nested SVM (L3) mutually exclusive, I believe this will do the trick:
> >>
> >> + /* blah blah blah */
> >> + hv_flush_tlb_current(vcpu);
> >> +
> >
> > Yes, it's either this or disabling the feature.
> >
> > Paolo
>
> Combining the two sub-threads: both of the suggestions:
>
> a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
> b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()
>
> appear to work in my test case (L2 vm startup until panic due to missing rootfs).
>
> But in both these cases (and also when I completely disable HV_X64_NESTED_ENLIGHTENED_TLB)
> the runtime of an iteration of the test is noticeably longer compared to tdp_mmu=0.

Hmm, what is test doing?

> So in terms of performance the ranking is (fastest to slowest):
> 1. tdp_mmu=0 + enlightened TLB
> 2. tdp_mmu=0 + no enlightened TLB
> 3. tdp_mmu=1 (enlightened TLB makes minimal difference)

2023-02-16 14:40:29

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V



On 15/02/2023 23:16, Sean Christopherson wrote:
> On Tue, Feb 14, 2023, Jeremi Piotrowski wrote:
>> On 13/02/2023 20:56, Paolo Bonzini wrote:
>>> On Mon, Feb 13, 2023 at 8:12 PM Sean Christopherson <[email protected]> wrote:
>>>>> Depending on the performance results of adding the hypercall to
>>>>> svm_flush_tlb_current, the fix could indeed be to just disable usage of
>>>>> HV_X64_NESTED_ENLIGHTENED_TLB.
>>>>
>>>> Minus making nested SVM (L3) mutually exclusive, I believe this will do the trick:
>>>>
>>>> + /* blah blah blah */
>>>> + hv_flush_tlb_current(vcpu);
>>>> +
>>>
>>> Yes, it's either this or disabling the feature.
>>>
>>> Paolo
>>
>> Combining the two sub-threads: both of the suggestions:
>>
>> a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
>> b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()
>>
>> appear to work in my test case (L2 vm startup until panic due to missing rootfs).
>>
>> But in both these cases (and also when I completely disable HV_X64_NESTED_ENLIGHTENED_TLB)
>> the runtime of an iteration of the test is noticeably longer compared to tdp_mmu=0.
>
> Hmm, what is test doing?

Booting through OVMF and kernel with no rootfs provided, and panic=-1 specified on the
kernel command line. It's a pure startup time test.

>
>> So in terms of performance the ranking is (fastest to slowest):
>> 1. tdp_mmu=0 + enlightened TLB
>> 2. tdp_mmu=0 + no enlightened TLB
>> 3. tdp_mmu=1 (enlightened TLB makes minimal difference)

2023-02-24 16:17:29

by Jeremi Piotrowski

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On 16/02/2023 15:40, Jeremi Piotrowski wrote:
> On 15/02/2023 23:16, Sean Christopherson wrote:
>> On Tue, Feb 14, 2023, Jeremi Piotrowski wrote:
>>> On 13/02/2023 20:56, Paolo Bonzini wrote:
>>>> On Mon, Feb 13, 2023 at 8:12 PM Sean Christopherson <[email protected]> wrote:
>>>>>> Depending on the performance results of adding the hypercall to
>>>>>> svm_flush_tlb_current, the fix could indeed be to just disable usage of
>>>>>> HV_X64_NESTED_ENLIGHTENED_TLB.
>>>>>
>>>>> Minus making nested SVM (L3) mutually exclusive, I believe this will do the trick:
>>>>>
>>>>> + /* blah blah blah */
>>>>> + hv_flush_tlb_current(vcpu);
>>>>> +
>>>>
>>>> Yes, it's either this or disabling the feature.
>>>>
>>>> Paolo
>>>
>>> Combining the two sub-threads: both of the suggestions:
>>>
>>> a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
>>> b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()
>>>
>>> appear to work in my test case (L2 vm startup until panic due to missing rootfs).
>>>
>>> But in both these cases (and also when I completely disable HV_X64_NESTED_ENLIGHTENED_TLB)
>>> the runtime of an iteration of the test is noticeably longer compared to tdp_mmu=0.
>>
>> Hmm, what is test doing?
>
> Booting through OVMF and kernel with no rootfs provided, and panic=-1 specified on the
> kernel command line. It's a pure startup time test.
>

Hi Sean,

Have you been able to reproduce this by any chance?

I would be glad to see either of the two fixes getting merged (b) or a) if it doesn't require
special L3 nested handling) in order to get this regression resolved.

Jeremi

2023-02-24 16:27:13

by Paolo Bonzini

[permalink] [raw]
Subject: Re: "KVM: x86/mmu: Overhaul TDP MMU zapping and flushing" breaks SVM on Hyper-V

On 2/24/23 17:17, Jeremi Piotrowski wrote:
>>>>>>> Depending on the performance results of adding the hypercall to
>>>>>>> svm_flush_tlb_current, the fix could indeed be to just disable usage of
>>>>>>> HV_X64_NESTED_ENLIGHTENED_TLB.
>>>>>> Minus making nested SVM (L3) mutually exclusive, I believe this will do the trick:
>>>>>>
>>>>>> + /* blah blah blah */
>>>>>> + hv_flush_tlb_current(vcpu);
>>>>>> +
>>>>> Yes, it's either this or disabling the feature.
>>>>>
>>>>> Paolo
>>>> Combining the two sub-threads: both of the suggestions:
>>>>
>>>> a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
>>>> b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()
>>>>
>>>> appear to work in my test case (L2 vm startup until panic due to missing rootfs).
>>>>
>>>> But in both these cases (and also when I completely disable HV_X64_NESTED_ENLIGHTENED_TLB)
>>>> the runtime of an iteration of the test is noticeably longer compared to tdp_mmu=0.
>>> Hmm, what is test doing?
>> Booting through OVMF and kernel with no rootfs provided, and panic=-1 specified on the
>> kernel command line. It's a pure startup time test.
>>
> Hi Sean,
>
> Have you been able to reproduce this by any chance?
>
> I would be glad to see either of the two fixes getting merged (b) or a) if it doesn't require
> special L3 nested handling) in order to get this regression resolved.

For now the easiest course of action is to just disable the TDP MMU if
HV_X64_NESTED_ENLIGHTENED_TLB is available.

Paolo