Subject: Re: [PATCH 6/7] x86/svm: Set IBPB when running a different VCPU
To: Liran Alon <LIRAN.ALON@ORACLE.COM>, linux-kernel@vger.kernel.org,
        kvm@vger.kernel.org
Cc: jmattson@google.com, aliguori@amazon.com, thomas.lendacky@amd.com,
        dwmw@amazon.co.uk, bp@alien8.de
References: <1515434925-10250-1-git-send-email-pbonzini@redhat.com>
 <1515434925-10250-7-git-send-email-pbonzini@redhat.com>
 <5A53CDE0.7030607@ORACLE.COM>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <2ea3137a-c63c-cf1e-7d90-ed6416ba443d@redhat.com>
Date: Tue, 9 Jan 2018 12:07:48 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <5A53CDE0.7030607@ORACLE.COM>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org

On 08/01/2018 21:00, Liran Alon wrote:
> 
> 
> On 08/01/18 20:08, Paolo Bonzini wrote:
>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>
>> Set IBPB (Indirect Branch Prediction Barrier) when the current CPU is
>> going to run a VCPU different from what was previously run.  Nested
>> virtualization uses the same VMCB for the second level guest, but the
>> L1 hypervisor should be using IBRS to protect itself.
>>
>> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   arch/x86/kvm/svm.c | 31 +++++++++++++++++++++++++++++++
>>   1 file changed, 31 insertions(+)
>>
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 779981a00d01..dd744d896cec 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -289,6 +289,7 @@ struct amd_svm_iommu_ir {
>>   module_param(vgif, int, 0444);
>>
>>   static bool __read_mostly have_spec_ctrl;
>> +static bool __read_mostly have_ibpb_support;
>>
>>   static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
>>   static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa);
>> @@ -540,6 +541,7 @@ struct svm_cpu_data {
>>       struct kvm_ldttss_desc *tss_desc;
>>
>>       struct page *save_area;
>> +    struct vmcb *current_vmcb;
>>   };
>>
>>   static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
>> @@ -1151,6 +1153,11 @@ static __init int svm_hardware_setup(void)
>>           pr_info("kvm: SPEC_CTRL available\n");
>>       else
>>           pr_info("kvm: SPEC_CTRL not available\n");
>> +    have_ibpb_support = have_spec_ctrl || cpu_has_ibpb_support();
>> +    if (have_ibpb_support)
>> +        pr_info("kvm: IBPB_SUPPORT available\n");
>> +    else
>> +        pr_info("kvm: IBPB_SUPPORT not available\n");
>>
>>       return 0;
>>
>> @@ -1725,11 +1732,19 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
>>       __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
>>       kvm_vcpu_uninit(vcpu);
>>       kmem_cache_free(kvm_vcpu_cache, svm);
>> +
>> +    /*
>> +     * The VMCB could be recycled, causing a false negative in
>> +     * svm_vcpu_load; block speculative execution.
>> +     */
>> +    if (have_ibpb_support)
>> +        wrmsrl(MSR_IA32_PRED_CMD, FEATURE_SET_IBPB);
>>   }
>>
>>   static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>   {
>>       struct vcpu_svm *svm = to_svm(vcpu);
>> +    struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
>>       int i;
>>
>>       if (unlikely(cpu != vcpu->cpu)) {
>> @@ -1758,6 +1773,12 @@ static void svm_vcpu_load(struct kvm_vcpu
>> *vcpu, int cpu)
>>       if (static_cpu_has(X86_FEATURE_RDTSCP))
>>           wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
>>
>> +    if (sd->current_vmcb != svm->vmcb) {
>> +        sd->current_vmcb = svm->vmcb;
>> +        if (have_ibpb_support)
>> +            wrmsrl(MSR_IA32_PRED_CMD, FEATURE_SET_IBPB);
>> +    }
>> +
>>       avic_vcpu_load(vcpu, cpu);
>>   }
>>
>> @@ -2798,6 +2819,11 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
>>       if (!nested_vmcb)
>>           return 1;
>>
>> +    /*
>> +     * No need for IBPB here, the L1 hypervisor should be running with
>> +     * IBRS=1 and inserts one already when switching L2 VMs.
>> +     */
>> +
> 
> I am not fully convinced yet this is OK from security perspective.
> From the CPU point-of-view, both L1 & L2 run in the same prediction-mode
> (as they are both running in SVM guest-mode). Therefore, IBRS will not
> hide the BHB & BTB of L1 from L2.

Indeed, IBRS will not hide the indirect branch predictor state of L2
user mode from L1 user mode.

On current generation processors, as I understand it, IBRS simply
disables the indirect branch predictor when set to 1.  Therefore, as
long as the L1 hypervisor sets IBRS=1, the indirect branch predictor
state left by L2 does not affect execution of the L1 hypervisor.

Future processors might have a mode that says "just set IBRS=1 and I'll
DTRT".  If that's done by indexing the BTB using the full virtual
address, the CPL _and_ the ASID/PCID/VPID, then IBPB is not needed here
because the L2 VM uses a different ASID.  If that's done by only
augmenting the BTB index with the CPL, we'd need an IBPB.  But in this
new world we've been thrown into, that would be a processor bug...

Note that AMD currently doesn't implement IBRS at all.  In order to
mitigate against CVE-2017-5715, the hypervisor should issue an IBPB on
every vmexit (in both L0 and L1).  However, the cost of that is very
high.  While we may want a paranoia mode that does it, it is outside the
scope of this series (which is more of a "let's unblock QEMU patches"
thing than a complete fix).

> 6. One can implicitly assume that L1 hypervisor did issued a IBPB when
> loading an VMCB and therefore it doesn't have to do it again in L0.
> However, I see 2 problems with this approach:
> (a) We don't protect old non-patched L1 hypervisors from Spectre even
> though we could easily do that from L0 with small performance hit.

Yeah, that's a nice thing to have.  However, I disagree on the "small"
performance hit... on a patched hypervisor, the cost of a double fix can
be very noticeable.

> (b) When L1 hypervisor runs only a single L2 VM, it doesn't have to
> issue an IBPB when loading the VMCB (as it didn't run any other VMCB
> before) and it should be sufficient from L1 perspective to just write 1
> to IBRS. However, in our nested-case, this is a security-hole.

This is the main difference in our reasoning.  I think IBRS=1 (or IBPB
on vmexit) should be enough.

Paolo

> Am I misunderstanding something?
> 
> Regards,
> -Liran
> 
>>       /* Exit Guest-Mode */
>>       leave_guest_mode(&svm->vcpu);
>>       svm->nested.vmcb = 0;
>> @@ -3061,6 +3087,11 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
>>       if (!nested_vmcb)
>>           return false;
>>
>> +    /*
>> +     * No need for IBPB here, since the nested VM is less
>> privileged.  The
>> +     * L1 hypervisor inserts one already when switching L2 VMs.
>> +     */
>> +
>>       if (!nested_vmcb_checks(nested_vmcb)) {
>>           nested_vmcb->control.exit_code    = SVM_EXIT_ERR;
>>           nested_vmcb->control.exit_code_hi = 0;
>>
>