From: Isaku Yamahata <[email protected]>
Wire up TDX PV HLT hypercall to the KVM backend function.
When the guest issues HLT, the hypercall instruction can be the right after
CLI instruction. Atomically unmask virtual interrupt and issue HLT
hypercall. The virtual interrupts can arrive right after CLI instruction
before switching back to VMM. In such a case, the VMM should return to the
guest without losing the interrupt. Check if interrupts arrived before the
TDX module switching to VMM. And return to the guest in such cases.
Signed-off-by: Isaku Yamahata <[email protected]>
---
arch/x86/kvm/vmx/tdx.c | 45 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 44 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f7c9170d596a..b0dcc2421649 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -917,6 +917,48 @@ static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
return 1;
}
+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+ bool interrupt_disabled = tdvmcall_p1_read(vcpu);
+ union tdx_vcpu_state_details details;
+
+ tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+ if (!interrupt_disabled) {
+ /*
+ * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
+ * the TDX module executing. On the other hand, KVM doesn't
+ * know if vcpu was executing in the guest TD or the TDX module.
+ *
+ * CPU mode transition:
+ * TDG.VP.VMCALL<HLT> (SEAM VMX non-root mode) ->
+ * the TDX module (SEAM VMX root mode) ->
+ * KVM (Legacy VMX root mode)
+ *
+ * If virtual interrupt arrives to this vcpu
+ * - In the guest TD executing:
+ * KVM can handle it in the same way to the VMX case.
+ * - During the TDX module executing:
+ * The TDX modules switches to KVM with TDG.VM.VMCALL<HLT>
+ * exit reason. KVM thinks the guest was running. So KVM
+ * vcpu wake up logic doesn't kick in. Check if virtual
+ * interrupt is pending and resume vcpu without blocking vcpu.
+ * - KVM executing:
+ * The existing logic wakes up the target vcpu on injecting
+ * virtual interrupt in the same way to the VMX case.
+ *
+ * Check if the interrupt is already pending. If yes, resume
+ * vcpu from guest HLT without emulating hlt instruction.
+ */
+ details.full = td_state_non_arch_read64(
+ to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
+ if (details.vmxip)
+ return 1;
+ }
+
+ return kvm_emulate_halt_noskip(vcpu);
+}
+
static int handle_tdvmcall(struct kvm_vcpu *vcpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -930,7 +972,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
switch (tdvmcall_exit_reason(vcpu)) {
case EXIT_REASON_CPUID:
return tdx_emulate_cpuid(vcpu);
-
+ case EXIT_REASON_HLT:
+ return tdx_emulate_hlt(vcpu);
default:
break;
}
--
2.25.1
On 3/4/22 20:49, [email protected] wrote:
> + bool interrupt_disabled = tdvmcall_p1_read(vcpu);
Where is R12 documented for TDG.VP.VMCALL<Instruction.HLT>?
> + * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
> + * the TDX module executing. On the other hand, KVM doesn't
> + * know if vcpu was executing in the guest TD or the TDX module.
I don't understand this; why isn't it enough to check PI.ON or something
like that as part of HLT emulation?
> + details.full = td_state_non_arch_read64(
> + to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
TDX documentation says "the meaning of the field may change with Intel
TDX module version", where is this field documented? I cannot find any
"other guest state" fields in the TDX documentation.
Paolo
> + if (details.vmxip)
> + return 1;
> + }
> +
> + return kvm_emulate_halt_noskip(vcpu);
On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> On 3/4/22 20:49, [email protected] wrote:
> > + bool interrupt_disabled = tdvmcall_p1_read(vcpu);
>
> Where is R12 documented for TDG.VP.VMCALL<Instruction.HLT>?
>
> > + * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
> > + * the TDX module executing. On the other hand, KVM doesn't
> > + * know if vcpu was executing in the guest TD or the TDX module.
>
> I don't understand this; why isn't it enough to check PI.ON or something
> like that as part of HLT emulation?
Ooh, I think I remember what this is. This is for the case where the virtual
interrupt is recognized, i.e. set in vmcs.RVI, between the STI and "HLT". KVM
doesn't have access to RVI and the interrupt is no longer in the PID (because it
was "recognized". It doesn't get delivered in the guest because the TDCALL
completes before interrupts are enabled.
I lobbied to get this fixed in the TDX module by immediately resuming the guest
in this case, but obviously that was unsuccessful.
> > + details.full = td_state_non_arch_read64(
> > + to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
>
> TDX documentation says "the meaning of the field may change with Intel TDX
> module version", where is this field documented? I cannot find any "other
> guest state" fields in the TDX documentation.
IMO we should put a stake in the ground and refuse to accept code that consumes
"non-architectural" state. It's all software, having non-architectural APIs is
completely ridiculous.
On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> On 4/7/22 17:02, Sean Christopherson wrote:
> > On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> > > On 3/4/22 20:49, [email protected] wrote:
> > > > + bool interrupt_disabled = tdvmcall_p1_read(vcpu);
> > >
> > > Where is R12 documented for TDG.VP.VMCALL<Instruction.HLT>?
> > >
> > > > + * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
> > > > + * the TDX module executing. On the other hand, KVM doesn't
> > > > + * know if vcpu was executing in the guest TD or the TDX module.
> > >
> > > I don't understand this; why isn't it enough to check PI.ON or something
> > > like that as part of HLT emulation?
> >
> > Ooh, I think I remember what this is. This is for the case where the virtual
> > interrupt is recognized, i.e. set in vmcs.RVI, between the STI and "HLT". KVM
> > doesn't have access to RVI and the interrupt is no longer in the PID (because it
> > was "recognized". It doesn't get delivered in the guest because the TDCALL
> > completes before interrupts are enabled.
> >
> > I lobbied to get this fixed in the TDX module by immediately resuming the guest
> > in this case, but obviously that was unsuccessful.
>
> So the TDX module sets RVI while in an STI interrupt shadow. So far so
> good. Then:
>
> - it receives the HLT TDCALL from the guest. The interrupt shadow at this
> point is gone.
>
> - it knows that there is an interrupt that can be delivered (RVI > PPR &&
> EFLAGS.IF=1, the other conditions of 29.2.2 don't matter)
>
> - it forwards the HLT TDCALL nevertheless, to a clueless hypervisor that has
> no way to glean either RVI or PPR?
>
> It's absurd that this be treated as anything but a bug.
That's what I said! :-)
On 4/7/22 17:02, Sean Christopherson wrote:
> On Thu, Apr 07, 2022, Paolo Bonzini wrote:
>> On 3/4/22 20:49, [email protected] wrote:
>>> + bool interrupt_disabled = tdvmcall_p1_read(vcpu);
>>
>> Where is R12 documented for TDG.VP.VMCALL<Instruction.HLT>?
>>
>>> + * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
>>> + * the TDX module executing. On the other hand, KVM doesn't
>>> + * know if vcpu was executing in the guest TD or the TDX module.
>>
>> I don't understand this; why isn't it enough to check PI.ON or something
>> like that as part of HLT emulation?
>
> Ooh, I think I remember what this is. This is for the case where the virtual
> interrupt is recognized, i.e. set in vmcs.RVI, between the STI and "HLT". KVM
> doesn't have access to RVI and the interrupt is no longer in the PID (because it
> was "recognized". It doesn't get delivered in the guest because the TDCALL
> completes before interrupts are enabled.
>
> I lobbied to get this fixed in the TDX module by immediately resuming the guest
> in this case, but obviously that was unsuccessful.
So the TDX module sets RVI while in an STI interrupt shadow. So far so
good. Then:
- it receives the HLT TDCALL from the guest. The interrupt shadow at
this point is gone.
- it knows that there is an interrupt that can be delivered (RVI > PPR
&& EFLAGS.IF=1, the other conditions of 29.2.2 don't matter)
- it forwards the HLT TDCALL nevertheless, to a clueless hypervisor that
has no way to glean either RVI or PPR?
It's absurd that this be treated as anything but a bug.
Until that is fixed, KVM needs to do something like:
- every time a bit is set in PID.PIR, set tdx->buggy_hlt_workaround = 1
- every time TDG.VP.VMCALL<HLT> is received,
xchg(&tdx->buggy_hlt_workaround, 0) and return immediately to the guest
if it is 1.
Basically an internal version of PID.ON.
>>> + details.full = td_state_non_arch_read64(
>>> + to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
>>
>> TDX documentation says "the meaning of the field may change with Intel TDX
>> module version", where is this field documented? I cannot find any "other
>> guest state" fields in the TDX documentation.
>
> IMO we should put a stake in the ground and refuse to accept code that consumes
> "non-architectural" state. It's all software, having non-architectural APIs is
> completely ridiculous.
Having them is fine, *using* them to work around undocumented bugs is
the ridiculous part.
You didn't answer the other question, which is "Where is R12 documented
for TDG.VP.VMCALL<Instruction.HLT>?" though... Should I be worried? :)
Paolo
On Fri, Mar 04, 2022, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Wire up TDX PV HLT hypercall to the KVM backend function.
>
> When the guest issues HLT, the hypercall instruction can be the right after
> CLI instruction. Atomically unmask virtual interrupt and issue HLT
> hypercall. The virtual interrupts can arrive right after CLI instruction
> before switching back to VMM. In such a case, the VMM should return to the
> guest without losing the interrupt. Check if interrupts arrived before the
> TDX module switching to VMM. And return to the guest in such cases.
Pretty sure you mean STI, not CLI.
On Thu, Apr 07, 2022 at 05:56:05PM +0200,
Paolo Bonzini <[email protected]> wrote:
> You didn't answer the other question, which is "Where is R12 documented for
> TDG.VP.VMCALL<Instruction.HLT>?" though... Should I be worried? :)
It's publicly documented.
Guest-Host-Communication Interface(GHCI) spec, 344426-003US Feburary 2022.
3.8 TDG.VP.VMCALL<Instruction.HLT>
R12 Interrupt Blocked Flag.
The TD is expected to clear this flag iff RFLAGS.IF == 1 or the TDCALL instruction
(that invoked TDG.VP.TDVMCALL(Instruction.HLT)) immediately follows an STI
instruction, otherwise this flag should be set.
--
Isaku Yamahata <[email protected]>
On 4/8/22 06:58, Isaku Yamahata wrote:
> On Thu, Apr 07, 2022 at 05:56:05PM +0200,
> Paolo Bonzini <[email protected]> wrote:
>
>> You didn't answer the other question, which is "Where is R12 documented for
>> TDG.VP.VMCALL<Instruction.HLT>?" though... Should I be worried? :)
>
> It's publicly documented.
>
> Guest-Host-Communication Interface(GHCI) spec, 344426-003US Feburary 2022.
> 3.8 TDG.VP.VMCALL<Instruction.HLT>
> R12 Interrupt Blocked Flag.
> The TD is expected to clear this flag iff RFLAGS.IF == 1 or the TDCALL instruction
> (that invoked TDG.VP.TDVMCALL(Instruction.HLT)) immediately follows an STI
> instruction, otherwise this flag should be set.
Oh, Google doesn't know about this version of the spec... It can be
downloaded from
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
though.
I also found VCPU_STATE_DETAILS in
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf:
Bit 0: VMXIP, indicates that a virtual interrupt is pending
delivery, i.e. VMCS.RVI[7:4] > TDVPS.VAPIC.VPPR[7:4]
It also documents how it has to be used. So this looks more or less
okay, just rename "vmxip" to "interrupt_pending_delivery".
The VCPU_STATE_DETAILS being "non-architectural" is still worrisome.
Paolo
On Fri, Apr 08, 2022, Paolo Bonzini wrote:
> On 4/8/22 06:58, Isaku Yamahata wrote:
> > On Thu, Apr 07, 2022 at 05:56:05PM +0200,
> > Paolo Bonzini <[email protected]> wrote:
> >
> > > You didn't answer the other question, which is "Where is R12 documented for
> > > TDG.VP.VMCALL<Instruction.HLT>?" though... Should I be worried? :)
> >
> > It's publicly documented.
> >
> > Guest-Host-Communication Interface(GHCI) spec, 344426-003US Feburary 2022.
> > 3.8 TDG.VP.VMCALL<Instruction.HLT>
> > R12 Interrupt Blocked Flag.
> > The TD is expected to clear this flag iff RFLAGS.IF == 1 or the TDCALL instruction
> > (that invoked TDG.VP.TDVMCALL(Instruction.HLT)) immediately follows an STI
> > instruction, otherwise this flag should be set.
>
> Oh, Google doesn't know about this version of the spec... It can be
> downloaded from https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> though.
>
> I also found VCPU_STATE_DETAILS in https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf:
>
> Bit 0: VMXIP, indicates that a virtual interrupt is pending
> delivery, i.e. VMCS.RVI[7:4] > TDVPS.VAPIC.VPPR[7:4]
>
> It also documents how it has to be used. So this looks more or less okay,
> just rename "vmxip" to "interrupt_pending_delivery".
If we're keeping the call back into SEAM, then this belongs in the path of
apic_has_interrupt_for_ppr(), not in the HLT-exit path. To avoid multiple SEAMCALLS
in a single exit, VCPU_EXREG_RVI can be added.
On 4/8/22 16:51, Sean Christopherson wrote:
>> It also documents how it has to be used. So this looks more or less okay,
>> just rename "vmxip" to "interrupt_pending_delivery".
>
> If we're keeping the call back into SEAM, then this belongs in the path of
> apic_has_interrupt_for_ppr(), not in the HLT-exit path. To avoid multiple SEAMCALLS
> in a single exit, VCPU_EXREG_RVI can be added.
But apic_has_interrupt_for_ppr takes a PPR argument and that is not
available.
So I suppose you mean kvm_apic_has_interrupt? You would change that to
a callback, like
if (!kvm_apic_present(vcpu))
return -1;
return static_call(kvm_x86_apic_has_interrupt)(vcpu);
}
and the default version would also be inlined in kvm_get_apic_interrupt,
like
- int vector = kvm_apic_has_interrupt(vcpu);
struct kvm_lapic *apic = vcpu->arch.apic;
u32 ppr;
- if (vector == -1)
+ if (!kvm_apic_present(vcpu))
return -1;
+ __apic_update_ppr(apic, &ppr);
+ vector = apic_has_interrupt_for_ppr(apic, ppr);
Checking the SEAM state (which would likewise not be VCPU_EXREG_RVI, but
more like VCPU_EXREG_INTR_PENDING) would be done in the tdx case of
kvm_x86_apic_has_interrupt.
Paolo
On Mon, Apr 11, 2022, Paolo Bonzini wrote:
> On 4/8/22 16:51, Sean Christopherson wrote:
> > > It also documents how it has to be used. So this looks more or less okay,
> > > just rename "vmxip" to "interrupt_pending_delivery".
> >
> > If we're keeping the call back into SEAM, then this belongs in the path of
> > apic_has_interrupt_for_ppr(), not in the HLT-exit path. To avoid multiple SEAMCALLS
> > in a single exit, VCPU_EXREG_RVI can be added.
>
> But apic_has_interrupt_for_ppr takes a PPR argument and that is not
> available.
>
> So I suppose you mean kvm_apic_has_interrupt?
Yeah, I realized that when I actually tried to implement my idea in code :-)
My hopefully-fully-thought-out idea for handling this:
https://lore.kernel.org/all/[email protected]
https://lore.kernel.org/all/[email protected]