MIME-Version: 1.0
Message-ID: <30cf953d-3132-4dbb-97ba-4b49aa949580@default>
Date: Tue, 9 Jan 2018 03:31:14 -0800 (PST)
From: Liran Alon <liran.alon@oracle.com>
To: <pbonzini@redhat.com>
Cc: <jmattson@google.com>, <dwmw@amazon.co.uk>, <bp@alien8.de>,
        <thomas.lendacky@amd.com>, <aliguori@amazon.com>,
        <linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>
Subject: Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT
Content-Disposition: inline
Sender: linux-kernel-owner@vger.kernel.org


----- pbonzini@redhat.com wrote:

> On 08/01/2018 21:00, Liran Alon wrote:
> > 
> > 
> > On 08/01/18 20:08, Paolo Bonzini wrote:
> >> From: Tom Lendacky <thomas.lendacky@amd.com>
> >>
> >> Set IBPB (Indirect Branch Prediction Barrier) when the current CPU
> is
> >> going to run a VCPU different from what was previously run. 
> Nested
> >> virtualization uses the same VMCB for the second level guest, but
> the
> >> L1 hypervisor should be using IBRS to protect itself.
> >>
> >> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> ---
> >>   arch/x86/kvm/svm.c | 31 +++++++++++++++++++++++++++++++
> >>   1 file changed, 31 insertions(+)
> >>
> >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >> index 779981a00d01..dd744d896cec 100644
> >> --- a/arch/x86/kvm/svm.c
> >> +++ b/arch/x86/kvm/svm.c
> >> @@ -289,6 +289,7 @@ struct amd_svm_iommu_ir {
> >>   module_param(vgif, int, 0444);
> >>
> >>   static bool __read_mostly have_spec_ctrl;
> >> +static bool __read_mostly have_ibpb_support;
> >>
> >>   static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long
> cr0);
> >>   static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool
> invalidate_gpa);
> >> @@ -540,6 +541,7 @@ struct svm_cpu_data {
> >>       struct kvm_ldttss_desc *tss_desc;
> >>
> >>       struct page *save_area;
> >> +    struct vmcb *current_vmcb;
> >>   };
> >>
> >>   static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
> >> @@ -1151,6 +1153,11 @@ static __init int svm_hardware_setup(void)
> >>           pr_info("kvm: SPEC_CTRL available\n");
> >>       else
> >>           pr_info("kvm: SPEC_CTRL not available\n");
> >> +    have_ibpb_support = have_spec_ctrl || cpu_has_ibpb_support();
> >> +    if (have_ibpb_support)
> >> +        pr_info("kvm: IBPB_SUPPORT available\n");
> >> +    else
> >> +        pr_info("kvm: IBPB_SUPPORT not available\n");
> >>
> >>       return 0;
> >>
> >> @@ -1725,11 +1732,19 @@ static void svm_free_vcpu(struct kvm_vcpu
> *vcpu)
> >>       __free_pages(virt_to_page(svm->nested.msrpm),
> MSRPM_ALLOC_ORDER);
> >>       kvm_vcpu_uninit(vcpu);
> >>       kmem_cache_free(kvm_vcpu_cache, svm);
> >> +
> >> +    /*
> >> +     * The VMCB could be recycled, causing a false negative in
> >> +     * svm_vcpu_load; block speculative execution.
> >> +     */
> >> +    if (have_ibpb_support)
> >> +        wrmsrl(MSR_IA32_PRED_CMD, FEATURE_SET_IBPB);
> >>   }
> >>
> >>   static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >>   {
> >>       struct vcpu_svm *svm = to_svm(vcpu);
> >> +    struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
> >>       int i;
> >>
> >>       if (unlikely(cpu != vcpu->cpu)) {
> >> @@ -1758,6 +1773,12 @@ static void svm_vcpu_load(struct kvm_vcpu
> >> *vcpu, int cpu)
> >>       if (static_cpu_has(X86_FEATURE_RDTSCP))
> >>           wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
> >>
> >> +    if (sd->current_vmcb != svm->vmcb) {
> >> +        sd->current_vmcb = svm->vmcb;
> >> +        if (have_ibpb_support)
> >> +            wrmsrl(MSR_IA32_PRED_CMD, FEATURE_SET_IBPB);
> >> +    }
> >> +
> >>       avic_vcpu_load(vcpu, cpu);
> >>   }
> >>
> >> @@ -2798,6 +2819,11 @@ static int nested_svm_vmexit(struct vcpu_svm
> *svm)
> >>       if (!nested_vmcb)
> >>           return 1;
> >>
> >> +    /*
> >> +     * No need for IBPB here, the L1 hypervisor should be running
> with
> >> +     * IBRS=1 and inserts one already when switching L2 VMs.
> >> +     */
> >> +
> > 
> > I am not fully convinced yet this is OK from security perspective.
> > From the CPU point-of-view, both L1 & L2 run in the same
> prediction-mode
> > (as they are both running in SVM guest-mode). Therefore, IBRS will
> not
> > hide the BHB & BTB of L1 from L2.
> 
> Indeed, IBRS will not hide the indirect branch predictor state of L2
> user mode from L1 user mode.
> 
> On current generation processors, as I understand it, IBRS simply
> disables the indirect branch predictor when set to 1.  Therefore, as

This is not how I understood what IBRS does.

Intel (not sure about AMD) states that if IBRS is set, indirect branches will now allow their predicted target address to be controlled by code that executed in a less privileged prediction-mode before the IBRS was last written with a value of 1.
(Intel also states that thus an indirect branch may be affected by code in a less privileged prediction-mode that executed AFTER IBRS mode was last written with a value of 1).

Therefore, L2 could also inject BTB/BHB entries that will be used by L1:
Consider the case that L2 injects values into BTB/BHB and then issues a hypercall which will cause #VMExit which will be forwarded to L1. Because L2 & L1 runs in the same prediction-mode (both Ring0 SVM guest-mode) from physical CPU perspective, Ring0 L1 code could be using BTB/BHB entries injected by Ring0 L2.
(The fact that L0 have set IBRS to 1 when exiting from L2 to L0 doesn't prevent those entries from being used by L1 after L0 enters L1).

This is different than what happens in non-nested case as L0 & L1 runs in different physical prediction-modes and therefore setting IBRS=1 on every #VMExit should suffice.

Therefore, I don't understand how L1 setting IBRS to 1 will help him in this case.
Maybe this mechanism works differently on AMD?

-Liran

> long as the L1 hypervisor sets IBRS=1, the indirect branch predictor
> state left by L2 does not affect execution of the L1 hypervisor.
> 
> Future processors might have a mode that says "just set IBRS=1 and
> I'll
> DTRT".  If that's done by indexing the BTB using the full virtual
> address, the CPL _and_ the ASID/PCID/VPID, then IBPB is not needed
> here
> because the L2 VM uses a different ASID.  If that's done by only
> augmenting the BTB index with the CPL, we'd need an IBPB.  But in
> this
> new world we've been thrown into, that would be a processor bug...
> 
> Note that AMD currently doesn't implement IBRS at all.  In order to
> mitigate against CVE-2017-5715, the hypervisor should issue an IBPB
> on
> every vmexit (in both L0 and L1).  However, the cost of that is very
> high.  While we may want a paranoia mode that does it, it is outside
> the
> scope of this series (which is more of a "let's unblock QEMU patches"
> thing than a complete fix).
> 
> > 6. One can implicitly assume that L1 hypervisor did issued a IBPB
> when
> > loading an VMCB and therefore it doesn't have to do it again in L0.
> > However, I see 2 problems with this approach:
> > (a) We don't protect old non-patched L1 hypervisors from Spectre
> even
> > though we could easily do that from L0 with small performance hit.
> 
> Yeah, that's a nice thing to have.  However, I disagree on the
> "small"
> performance hit... on a patched hypervisor, the cost of a double fix
> can
> be very noticeable.
> 
> > (b) When L1 hypervisor runs only a single L2 VM, it doesn't have to
> > issue an IBPB when loading the VMCB (as it didn't run any other
> VMCB
> > before) and it should be sufficient from L1 perspective to just
> write 1
> > to IBRS. However, in our nested-case, this is a security-hole.
> 
> This is the main difference in our reasoning.  I think IBRS=1 (or
> IBPB
> on vmexit) should be enough.
> 
> Paolo
> 
> > Am I misunderstanding something?
> > 
> > Regards,
> > -Liran
> > 
> >>       /* Exit Guest-Mode */
> >>       leave_guest_mode(&svm->vcpu);
> >>       svm->nested.vmcb = 0;
> >> @@ -3061,6 +3087,11 @@ static bool nested_svm_vmrun(struct vcpu_svm
> *svm)
> >>       if (!nested_vmcb)
> >>           return false;
> >>
> >> +    /*
> >> +     * No need for IBPB here, since the nested VM is less
> >> privileged.  The
> >> +     * L1 hypervisor inserts one already when switching L2 VMs.
> >> +     */
> >> +
> >>       if (!nested_vmcb_checks(nested_vmcb)) {
> >>           nested_vmcb->control.exit_code    = SVM_EXIT_ERR;
> >>           nested_vmcb->control.exit_code_hi = 0;
> >>
> >