by KarimAllah Ahmed

[permalink] [raw]

Subject: Re: [PATCH v3 3/4] KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

On 01/30/2018 01:22 AM, Raj, Ashok wrote:
> On Tue, Jan 30, 2018 at 01:10:27AM +0100, KarimAllah Ahmed wrote:
>> Future intel processors will use MSR_IA32_ARCH_CAPABILITIES MSR to indicate
>> RDCL_NO (bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default
>> the contents will come directly from the hardware, but user-space can still
>> override it.
>>
>> [dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]
>>
>> Cc: Asit Mallick <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Arjan Van De Ven <[email protected]>
>> Cc: Tim Chen <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Andrea Arcangeli <[email protected]>
>> Cc: Andi Kleen <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Dan Williams <[email protected]>
>> Cc: Jun Nakajima <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Greg KH <[email protected]>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Ashok Raj <[email protected]>
>> Signed-off-by: KarimAllah Ahmed <[email protected]>
>> Signed-off-by: David Woodhouse <[email protected]>
>> ---
>> arch/x86/kvm/cpuid.c | 2 +-
>> arch/x86/kvm/vmx.c | 15 +++++++++++++++
>> arch/x86/kvm/x86.c | 1 +
>> 3 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 033004d..1909635 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -394,7 +394,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>>
>> /* cpuid 7.0.edx*/
>> const u32 kvm_cpuid_7_0_edx_x86_features =
>> - F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
>> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
>>
>> /* all calls to cpuid_count() should be made on the same cpu */
>> get_cpu();
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index ea278ce..798a00b 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -581,6 +581,8 @@ struct vcpu_vmx {
>> u64 msr_host_kernel_gs_base;
>> u64 msr_guest_kernel_gs_base;
>> #endif
>> + u64 arch_capabilities;
>> +
>> u32 vm_entry_controls_shadow;
>> u32 vm_exit_controls_shadow;
>> u32 secondary_exec_control;
>> @@ -3224,6 +3226,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> case MSR_IA32_TSC:
>> msr_info->data = guest_read_tsc(vcpu);
>> break;
>> + case MSR_IA32_ARCH_CAPABILITIES:
>> + if (!msr_info->host_initiated &&
>> + !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
>> + return 1;
>> + msr_info->data = to_vmx(vcpu)->arch_capabilities;
>> + break;
>> case MSR_IA32_SYSENTER_CS:
>> msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
>> break;
>> @@ -3339,6 +3347,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> if (data & PRED_CMD_IBPB)
>> wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
>> break;
>> + case MSR_IA32_ARCH_CAPABILITIES:
>> + if (!msr_info->host_initiated)
>> + return 1;
>> + vmx->arch_capabilities = data;
>> + break;
>
> arch capabilities is read only. You don't need the set_msr handling for this.

This is only for host driven writes. This would allow QEMU/whatever to
override the default value (i.e. the value from the hardware).

>
>> case MSR_IA32_CR_PAT:
>> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
>> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
>> @@ -5599,6 +5612,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>> ++vmx->nmsrs;
>> }
>>
>> + if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
>> + rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
>>
>> vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 03869eb..8e889dc 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1006,6 +1006,7 @@ static u32 msrs_to_save[] = {
>> #endif
>> MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
>> MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
>> + MSR_IA32_ARCH_CAPABILITIES
>
> Same here.. no need to save/restore this.
>
>> };
>>
>> static unsigned num_msrs_to_save;
>> --
>> 2.7.4
>>
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

2018-01-30 09:02:52

On Tue, 2018-01-30 at 08:22 -0600, Tom Lendacky wrote:
> > @@ -918,6 +919,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
> >
> > set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
> > }
> > +
> > + if (boot_cpu_has(X86_FEATURE_IBPB))
> > + set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);
>
> Not sure you really need the check here.  If the feature isn't available
> in the hardware, then it won't be advertised in the CPUID bits to the
> guest, so the guest shouldn't try to write to the msr.  If it does, it
> will #GP. So I would think it could be set all the time to not be
> intercepted, no?

The check for boot_cpu_has() is wrong and is fairly redundant as you
say. What we actually want is guest_cpu_has(). We *don't* want to pass
the MSR through for a recalcitrant guest to bash on, if we have elected
not to expose this feature to the guest.

On Intel right now it's *really* important that we don't allow it to be
touched, even if a write would succeed. So even boot_cpu_has() would
not be entirely meaningless there. :)

> > @@ -3330,6 +3331,14 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > case MSR_IA32_TSC:
> > kvm_write_tsc(vcpu, msr_info);
> > break;
> > + case MSR_IA32_PRED_CMD:
> > + if (!msr_info->host_initiated &&
> > +     !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
> > + return 1;
> > +
> > + if (data & PRED_CMD_IBPB)
> > + wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> > + break;
>
> Should this also be in svm.c or as common code in x86.c?

See my response to [0/4]. I suggested that, but noted that it wasn't
entirely clear where we'd put the storage for SPEC_CTRL. We probably
*could* manage it for IBPB though.

> >
> > case MSR_IA32_CR_PAT:
> > if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> > if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> > @@ -9548,6 +9557,9 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
> > goto free_msrs;
> >
> > msr_bitmap = vmx->vmcs01.msr_bitmap;
> > +
> > + if (boot_cpu_has(X86_FEATURE_IBPB))
> > + vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_PRED_CMD, MSR_TYPE_W);
>
> Same comment here as in svm.c, is the feature check necessary?

Again, yes but it should be guest_cpu_has() and we couldn't see how :)

Attachments:

smime.p7s (5.09 kB)

2018-01-30 17:38:02

by Jim Mattson

[permalink] [raw]

Subject: Re: [PATCH v3 2/4] KVM: x86: Add IBPB support

On Mon, Jan 29, 2018 at 4:10 PM, KarimAllah Ahmed <[email protected]> wrote:
> From: Ashok Raj <[email protected]>
>
> Add MSR passthrough for MSR_IA32_PRED_CMD and place branch predictor
> barriers on switching between VMs to avoid inter VM Spectre-v2 attacks.
>
> [peterz: rebase and changelog rewrite]
> [karahmed: - rebase
> - vmx: expose PRED_CMD whenever it is available
> - svm: only pass through IBPB if it is available
> - vmx: support !cpu_has_vmx_msr_bitmap()]
> [dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
> PRED_CMD is a write-only MSR]
>
> Cc: Asit Mallick <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Arjan Van De Ven <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Jun Nakajima <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: David Woodhouse <[email protected]>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 11 ++++++++++-
> arch/x86/kvm/svm.c | 14 ++++++++++++++
> arch/x86/kvm/vmx.c | 12 ++++++++++++
> 3 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c0eb337..033004d 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -365,6 +365,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
> 0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
>
> + /* cpuid 0x80000008.ebx */
> + const u32 kvm_cpuid_8000_0008_ebx_x86_features =
> + F(IBPB);
> +
> /* cpuid 0xC0000001.edx */
> const u32 kvm_cpuid_C000_0001_edx_x86_features =
> F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
> @@ -625,7 +629,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> if (!g_phys_as)
> g_phys_as = phys_as;
> entry->eax = g_phys_as | (virt_as << 8);
> - entry->ebx = entry->edx = 0;
> + entry->edx = 0;
> + /* IBPB isn't necessarily present in hardware cpuid */
> + if (boot_cpu_has(X86_FEATURE_IBPB))
> + entry->ebx |= F(IBPB);
> + entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
> + cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
> break;
> }
> case 0x80000019:
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 2744b973..c886e46 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -529,6 +529,7 @@ struct svm_cpu_data {
> struct kvm_ldttss_desc *tss_desc;
>
> struct page *save_area;
> + struct vmcb *current_vmcb;
> };
>
> static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
> @@ -918,6 +919,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
>
> set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
> }
> +
> + if (boot_cpu_has(X86_FEATURE_IBPB))
> + set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);
> }
>
> static void add_msr_offset(u32 offset)
> @@ -1706,11 +1710,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
> __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
> kvm_vcpu_uninit(vcpu);
> kmem_cache_free(kvm_vcpu_cache, svm);
> + /*
> + * The vmcb page can be recycled, causing a false negative in
> + * svm_vcpu_load(). So do a full IBPB now.
> + */
> + indirect_branch_prediction_barrier();
> }
>
> static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> + struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
> int i;
>
> if (unlikely(cpu != vcpu->cpu)) {
> @@ -1739,6 +1749,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> if (static_cpu_has(X86_FEATURE_RDTSCP))
> wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
>
> + if (sd->current_vmcb != svm->vmcb) {
> + sd->current_vmcb = svm->vmcb;
> + indirect_branch_prediction_barrier();
> + }
> avic_vcpu_load(vcpu, cpu);
> }
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index aa8638a..ea278ce 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2272,6 +2272,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
> per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
> vmcs_load(vmx->loaded_vmcs->vmcs);
> + indirect_branch_prediction_barrier();
> }
>
> if (!already_loaded) {
> @@ -3330,6 +3331,14 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> kvm_write_tsc(vcpu, msr_info);
> break;
> + case MSR_IA32_PRED_CMD:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
> + return 1;
> +
> + if (data & PRED_CMD_IBPB)
> + wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> + break;
> case MSR_IA32_CR_PAT:
> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> @@ -9548,6 +9557,9 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
> goto free_msrs;
>
> msr_bitmap = vmx->vmcs01.msr_bitmap;
> +
> + if (boot_cpu_has(X86_FEATURE_IBPB))
> + vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_PRED_CMD, MSR_TYPE_W);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
> --
> 2.7.4
>

Are you planning to allow L2 to write MSR_IA32_PRED_CMD without L0
intercepting it, if the MSR write intercept is disabled in both the
vmcs01 MSR permission bitmap and the vmcs12 MSR permission bitmap?

2018-01-30 18:20:54

On Tue, Jan 30, 2018 at 4:19 PM, Paolo Bonzini <[email protected]> wrote:
> The new code in nested_vmx_merge_msr_bitmap should be conditional on
> vmx->save_spec_ctrl_on_exit.

But then if L1 doesn't use MSR_IA32_SPEC_CTRL itself and it uses the
VM-entry MSR load list to set up L2's MSR_IA32_SPEC_CTRL, you will
never set vmx->save_spec_ctrl_on_exit, and L2's accesses to the MSR
will always be intercepted by L0.

2018-01-31 01:50:33

by David Woodhouse

[permalink] [raw]

Subject: Re: [PATCH v3 0/4] KVM: Expose speculation control feature to guests

On Tue, 2018-01-30 at 19:16 -0500, Paolo Bonzini wrote:
> On 30/01/2018 18:48, Raj, Ashok wrote:
> >
> > >
> > > Certainly not every vmexit! But doing it on every userspace vmexit and
> > > every sched_out would not be *that* bad.
> > Right.. agreed. We discussed the different scenarios that doing IBPB
> > on VMexit would help, and decided its really not required on every exit.
> >
> > One obvious case is when there is a VMexit and return back to Qemu
> > process (witout a real context switch) do we need that to be
> > protected from any poisoned BTB from guest?
> If the host is using retpolines, then some kind of barrier is needed. I
> don't know if the full PRED_CMD barrier is needed, or two IBRS=1/IBRS=0
> writes back-to-back are enough.
>
> If the host is using IBRS, then writing IBRS=1 at vmexit has established
> a barrier from the less privileged VMX guest environment.

IBRS will protect qemu userspace only if you set it at some point
before exiting the kernel. You don't want it set *in* the kernel, if
we're using retpolines in the kernel, so you'd want to clear it again
on the way back into the kernel. It's almost the opposite of what the
IBRS patch set was doing to protect the kernel.

Just use IBPB :)

Attachments:

smime.p7s (5.09 kB)

2018-01-31 06:56:02

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH v3 0/4] KVM: Expose speculation control feature to guests

On 01/30/2018 04:16 PM, Paolo Bonzini wrote:
> On 30/01/2018 18:48, Raj, Ashok wrote:
>>> Certainly not every vmexit! But doing it on every userspace vmexit and
>>> every sched_out would not be *that* bad.
>> Right.. agreed. We discussed the different scenarios that doing IBPB
>> on VMexit would help, and decided its really not required on every exit.
>>
>> One obvious case is when there is a VMexit and return back to Qemu
>> process (witout a real context switch) do we need that to be
>> protected from any poisoned BTB from guest?
> If the host is using retpolines, then some kind of barrier is needed. I
> don't know if the full PRED_CMD barrier is needed, or two IBRS=1/IBRS=0
> writes back-to-back are enough.

I think the spec is pretty clear here: protection is only provided
*while* IBRS=1. Once it goes back to 0, all bets are off.