2018-01-30 00:17:24

by KarimAllah Ahmed

[permalink] [raw]
Subject: [PATCH v3 0/4] KVM: Expose speculation control feature to guests

Add direct access to speculation control MSRs for KVM guests. This allows the
guest to protect itself against Spectre V2 using IBRS+IBPB instead of a
retpoline+IBPB based approach.

It also exposes the ARCH_CAPABILITIES MSR which is going to be used by future
Intel processors to indicate RDCL_NO and IBRS_ALL.

Ashok Raj (1):
KVM: x86: Add IBPB support

KarimAllah Ahmed (3):
KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX
KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

arch/x86/kvm/cpuid.c | 22 ++++++++++----
arch/x86/kvm/cpuid.h | 1 +
arch/x86/kvm/svm.c | 14 +++++++++
arch/x86/kvm/vmx.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
5 files changed, 118 insertions(+), 6 deletions(-)

Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Ashok Raj <[email protected]>
Cc: Asit Mallick <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Janakarajan Natarajan <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Laura Abbott <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

--
2.7.4



2018-01-30 00:12:34

by KarimAllah Ahmed

[permalink] [raw]
Subject: [PATCH v3 2/4] KVM: x86: Add IBPB support

From: Ashok Raj <[email protected]>

Add MSR passthrough for MSR_IA32_PRED_CMD and place branch predictor
barriers on switching between VMs to avoid inter VM Spectre-v2 attacks.

[peterz: rebase and changelog rewrite]
[karahmed: - rebase
- vmx: expose PRED_CMD whenever it is available
- svm: only pass through IBPB if it is available
- vmx: support !cpu_has_vmx_msr_bitmap()]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
PRED_CMD is a write-only MSR]

Cc: Asit Mallick <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
arch/x86/kvm/cpuid.c | 11 ++++++++++-
arch/x86/kvm/svm.c | 14 ++++++++++++++
arch/x86/kvm/vmx.c | 12 ++++++++++++
3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c0eb337..033004d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -365,6 +365,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);

+ /* cpuid 0x80000008.ebx */
+ const u32 kvm_cpuid_8000_0008_ebx_x86_features =
+ F(IBPB);
+
/* cpuid 0xC0000001.edx */
const u32 kvm_cpuid_C000_0001_edx_x86_features =
F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
@@ -625,7 +629,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
if (!g_phys_as)
g_phys_as = phys_as;
entry->eax = g_phys_as | (virt_as << 8);
- entry->ebx = entry->edx = 0;
+ entry->edx = 0;
+ /* IBPB isn't necessarily present in hardware cpuid */
+ if (boot_cpu_has(X86_FEATURE_IBPB))
+ entry->ebx |= F(IBPB);
+ entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
+ cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
break;
}
case 0x80000019:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2744b973..c886e46 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -529,6 +529,7 @@ struct svm_cpu_data {
struct kvm_ldttss_desc *tss_desc;

struct page *save_area;
+ struct vmcb *current_vmcb;
};

static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
@@ -918,6 +919,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)

set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
}
+
+ if (boot_cpu_has(X86_FEATURE_IBPB))
+ set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);
}

static void add_msr_offset(u32 offset)
@@ -1706,11 +1710,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, svm);
+ /*
+ * The vmcb page can be recycled, causing a false negative in
+ * svm_vcpu_load(). So do a full IBPB now.
+ */
+ indirect_branch_prediction_barrier();
}

static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
int i;

if (unlikely(cpu != vcpu->cpu)) {
@@ -1739,6 +1749,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (static_cpu_has(X86_FEATURE_RDTSCP))
wrmsrl(MSR_TSC_AUX, svm->tsc_aux);

+ if (sd->current_vmcb != svm->vmcb) {
+ sd->current_vmcb = svm->vmcb;
+ indirect_branch_prediction_barrier();
+ }
avic_vcpu_load(vcpu, cpu);
}

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aa8638a..ea278ce 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2272,6 +2272,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
+ indirect_branch_prediction_barrier();
}

if (!already_loaded) {
@@ -3330,6 +3331,14 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
+ case MSR_IA32_PRED_CMD:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
+ return 1;
+
+ if (data & PRED_CMD_IBPB)
+ wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+ break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -9548,6 +9557,9 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
goto free_msrs;

msr_bitmap = vmx->vmcs01.msr_bitmap;
+
+ if (boot_cpu_has(X86_FEATURE_IBPB))
+ vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_PRED_CMD, MSR_TYPE_W);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
vmx_disable_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
--
2.7.4


2018-01-30 00:12:55

by KarimAllah Ahmed

[permalink] [raw]
Subject: [PATCH v3 1/4] KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX

[dwmw2: Stop using KF() for bits in it, too]
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/cpuid.c | 8 +++-----
arch/x86/kvm/cpuid.h | 1 +
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0099e10..c0eb337 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -67,9 +67,7 @@ u64 kvm_supported_xcr0(void)

#define F(x) bit(X86_FEATURE_##x)

-/* These are scattered features in cpufeatures.h. */
-#define KVM_CPUID_BIT_AVX512_4VNNIW 2
-#define KVM_CPUID_BIT_AVX512_4FMAPS 3
+/* For scattered features from cpufeatures.h; we currently expose none */
#define KF(x) bit(KVM_CPUID_BIT_##x)

int kvm_update_cpuid(struct kvm_vcpu *vcpu)
@@ -392,7 +390,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,

/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
- KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS);
+ F(AVX512_4VNNIW) | F(AVX512_4FMAPS);

/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
@@ -477,7 +475,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
entry->ecx &= ~F(PKU);
entry->edx &= kvm_cpuid_7_0_edx_x86_features;
- entry->edx &= get_scattered_cpuid_leaf(7, 0, CPUID_EDX);
+ cpuid_mask(&entry->edx, CPUID_7_EDX);
} else {
entry->ebx = 0;
entry->ecx = 0;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index cdc70a3..dcfe227 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -54,6 +54,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
[CPUID_8000_000A_EDX] = {0x8000000a, 0, CPUID_EDX},
[CPUID_7_ECX] = { 7, 0, CPUID_ECX},
[CPUID_8000_0007_EBX] = {0x80000007, 0, CPUID_EBX},
+ [CPUID_7_EDX] = { 7, 0, CPUID_EDX},
};

static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned x86_feature)
--
2.7.4


2018-01-30 00:12:59

by KarimAllah Ahmed

[permalink] [raw]
Subject: [PATCH v3 3/4] KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

Future intel processors will use MSR_IA32_ARCH_CAPABILITIES MSR to indicate
RDCL_NO (bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default
the contents will come directly from the hardware, but user-space can still
override it.

[dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]

Cc: Asit Mallick <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ashok Raj <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/vmx.c | 15 +++++++++++++++
arch/x86/kvm/x86.c | 1 +
3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 033004d..1909635 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -394,7 +394,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,

/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
- F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
+ F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);

/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ea278ce..798a00b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -581,6 +581,8 @@ struct vcpu_vmx {
u64 msr_host_kernel_gs_base;
u64 msr_guest_kernel_gs_base;
#endif
+ u64 arch_capabilities;
+
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
u32 secondary_exec_control;
@@ -3224,6 +3226,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
+ case MSR_IA32_ARCH_CAPABILITIES:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
+ return 1;
+ msr_info->data = to_vmx(vcpu)->arch_capabilities;
+ break;
case MSR_IA32_SYSENTER_CS:
msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
break;
@@ -3339,6 +3347,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (data & PRED_CMD_IBPB)
wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
break;
+ case MSR_IA32_ARCH_CAPABILITIES:
+ if (!msr_info->host_initiated)
+ return 1;
+ vmx->arch_capabilities = data;
+ break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -5599,6 +5612,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
++vmx->nmsrs;
}

+ if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
+ rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);

vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03869eb..8e889dc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1006,6 +1006,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
+ MSR_IA32_ARCH_CAPABILITIES
};

static unsigned num_msrs_to_save;
--
2.7.4


2018-01-30 00:15:08

by KarimAllah Ahmed

[permalink] [raw]
Subject: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

[ Based on a patch from Ashok Raj <[email protected]> ]

Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
be using a retpoline+IBPB based approach.

To avoid the overhead of atomically saving and restoring the
MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
add_atomic_switch_msr when a non-zero is written to it.

No attempt is made to handle STIBP here, intentionally. Filtering STIBP
may be added in a future patch, which may require trapping all writes
if we don't want to pass it through directly to the guest.

[dwmw2: Clean up CPUID bits, save/restore manually, handle reset]

Cc: Asit Mallick <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ashok Raj <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
v2:
- remove 'host_spec_ctrl' in favor of only a comment (dwmw@).
- special case writing '0' in SPEC_CTRL to avoid confusing live-migration
when the instance never used the MSR (dwmw@).
- depend on X86_FEATURE_IBRS instead of X86_FEATURE_SPEC_CTRL (dwmw@).
- add MSR_IA32_SPEC_CTRL to the list of MSRs to save (dropped it by accident).
v3:
- Save/restore manually
- Fix CPUID handling
- Fix a copy & paste error in the name of SPEC_CTRL MSR in
disable_intercept.
- support !cpu_has_vmx_msr_bitmap()
---
arch/x86/kvm/cpuid.c | 7 +++++--
arch/x86/kvm/vmx.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 2 +-
3 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 1909635..662d0c0 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -394,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,

/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
- F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
+ F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
+ F(ARCH_CAPABILITIES);

/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
@@ -630,9 +631,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
g_phys_as = phys_as;
entry->eax = g_phys_as | (virt_as << 8);
entry->edx = 0;
- /* IBPB isn't necessarily present in hardware cpuid */
+ /* IBRS and IBPB aren't necessarily present in hardware cpuid */
if (boot_cpu_has(X86_FEATURE_IBPB))
entry->ebx |= F(IBPB);
+ if (boot_cpu_has(X86_FEATURE_IBRS))
+ entry->ebx |= F(IBRS);
entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 798a00b..9ac9747 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -582,6 +582,8 @@ struct vcpu_vmx {
u64 msr_guest_kernel_gs_base;
#endif
u64 arch_capabilities;
+ u64 spec_ctrl;
+ bool save_spec_ctrl_on_exit;

u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
@@ -922,6 +924,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
u16 error_code);
static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
+static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type);

static DEFINE_PER_CPU(struct vmcs *, vmxarea);
static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -3226,6 +3230,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
+ return 1;
+
+ msr_info->data = to_vmx(vcpu)->spec_ctrl;
+ break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
@@ -3339,6 +3350,31 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
+ return 1;
+
+ /* The STIBP bit doesn't fault even if it's not advertised */
+ if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+ return 1;
+
+ vmx->spec_ctrl = data;
+
+ /*
+ * When it's written (to non-zero) for the first time, pass
+ * it through. This means we don't have to take the perf
+ * hit of saving it on vmexit for the common case of guests
+ * that don't use it.
+ */
+ if (cpu_has_vmx_msr_bitmap() && data &&
+ !vmx->save_spec_ctrl_on_exit) {
+ vmx->save_spec_ctrl_on_exit = true;
+ vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_RW);
+ }
+ break;
case MSR_IA32_PRED_CMD:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
@@ -5644,6 +5680,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
u64 cr0;

vmx->rmode.vm86_active = 0;
+ vmx->spec_ctrl = 0;

vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
kvm_set_cr8(vcpu, 0);
@@ -9314,6 +9351,15 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)

vmx_arm_hv_timer(vcpu);

+ /*
+ * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+ * it's non-zero. Since vmentry is serialising on affected CPUs, there
+ * is no need to worry about the conditional branch over the wrmsr
+ * being speculatively taken.
+ */
+ if (vmx->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
vmx->__launched = vmx->loaded_vmcs->launched;
asm(
/* Store host registers */
@@ -9420,6 +9466,19 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
#endif
);

+ /*
+ * We do not use IBRS in the kernel. If this vCPU has used the
+ * SPEC_CTRL MSR it may have left it on; save the value and
+ * turn it off. This is much more efficient than blindly adding
+ * it to the atomic save/restore list. Especially as the former
+ * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+ */
+ if (vmx->save_spec_ctrl_on_exit)
+ rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
+ if (vmx->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e889dc..fc9724c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1006,7 +1006,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
- MSR_IA32_ARCH_CAPABILITIES
+ MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
};

static unsigned num_msrs_to_save;
--
2.7.4


2018-01-30 00:23:03

by Ashok Raj

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

On Tue, Jan 30, 2018 at 01:10:27AM +0100, KarimAllah Ahmed wrote:
> Future intel processors will use MSR_IA32_ARCH_CAPABILITIES MSR to indicate
> RDCL_NO (bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default
> the contents will come directly from the hardware, but user-space can still
> override it.
>
> [dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]
>
> Cc: Asit Mallick <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Arjan Van De Ven <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Jun Nakajima <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Ashok Raj <[email protected]>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 2 +-
> arch/x86/kvm/vmx.c | 15 +++++++++++++++
> arch/x86/kvm/x86.c | 1 +
> 3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 033004d..1909635 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -394,7 +394,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>
> /* cpuid 7.0.edx*/
> const u32 kvm_cpuid_7_0_edx_x86_features =
> - F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
>
> /* all calls to cpuid_count() should be made on the same cpu */
> get_cpu();
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index ea278ce..798a00b 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -581,6 +581,8 @@ struct vcpu_vmx {
> u64 msr_host_kernel_gs_base;
> u64 msr_guest_kernel_gs_base;
> #endif
> + u64 arch_capabilities;
> +
> u32 vm_entry_controls_shadow;
> u32 vm_exit_controls_shadow;
> u32 secondary_exec_control;
> @@ -3224,6 +3226,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> msr_info->data = guest_read_tsc(vcpu);
> break;
> + case MSR_IA32_ARCH_CAPABILITIES:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
> + return 1;
> + msr_info->data = to_vmx(vcpu)->arch_capabilities;
> + break;
> case MSR_IA32_SYSENTER_CS:
> msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
> break;
> @@ -3339,6 +3347,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> if (data & PRED_CMD_IBPB)
> wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> break;
> + case MSR_IA32_ARCH_CAPABILITIES:
> + if (!msr_info->host_initiated)
> + return 1;
> + vmx->arch_capabilities = data;
> + break;

arch capabilities is read only. You don't need the set_msr handling for this.

> case MSR_IA32_CR_PAT:
> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> @@ -5599,6 +5612,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> ++vmx->nmsrs;
> }
>
> + if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
> + rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
>
> vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 03869eb..8e889dc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1006,6 +1006,7 @@ static u32 msrs_to_save[] = {
> #endif
> MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
> MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
> + MSR_IA32_ARCH_CAPABILITIES

Same here.. no need to save/restore this.

> };
>
> static unsigned num_msrs_to_save;
> --
> 2.7.4
>

2018-01-30 00:27:13

by KarimAllah Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

On 01/30/2018 01:22 AM, Raj, Ashok wrote:
> On Tue, Jan 30, 2018 at 01:10:27AM +0100, KarimAllah Ahmed wrote:
>> Future intel processors will use MSR_IA32_ARCH_CAPABILITIES MSR to indicate
>> RDCL_NO (bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default
>> the contents will come directly from the hardware, but user-space can still
>> override it.
>>
>> [dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]
>>
>> Cc: Asit Mallick <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Arjan Van De Ven <[email protected]>
>> Cc: Tim Chen <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Andrea Arcangeli <[email protected]>
>> Cc: Andi Kleen <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Dan Williams <[email protected]>
>> Cc: Jun Nakajima <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Greg KH <[email protected]>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Ashok Raj <[email protected]>
>> Signed-off-by: KarimAllah Ahmed <[email protected]>
>> Signed-off-by: David Woodhouse <[email protected]>
>> ---
>> arch/x86/kvm/cpuid.c | 2 +-
>> arch/x86/kvm/vmx.c | 15 +++++++++++++++
>> arch/x86/kvm/x86.c | 1 +
>> 3 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 033004d..1909635 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -394,7 +394,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>>
>> /* cpuid 7.0.edx*/
>> const u32 kvm_cpuid_7_0_edx_x86_features =
>> - F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
>> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
>>
>> /* all calls to cpuid_count() should be made on the same cpu */
>> get_cpu();
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index ea278ce..798a00b 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -581,6 +581,8 @@ struct vcpu_vmx {
>> u64 msr_host_kernel_gs_base;
>> u64 msr_guest_kernel_gs_base;
>> #endif
>> + u64 arch_capabilities;
>> +
>> u32 vm_entry_controls_shadow;
>> u32 vm_exit_controls_shadow;
>> u32 secondary_exec_control;
>> @@ -3224,6 +3226,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> case MSR_IA32_TSC:
>> msr_info->data = guest_read_tsc(vcpu);
>> break;
>> + case MSR_IA32_ARCH_CAPABILITIES:
>> + if (!msr_info->host_initiated &&
>> + !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
>> + return 1;
>> + msr_info->data = to_vmx(vcpu)->arch_capabilities;
>> + break;
>> case MSR_IA32_SYSENTER_CS:
>> msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
>> break;
>> @@ -3339,6 +3347,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> if (data & PRED_CMD_IBPB)
>> wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
>> break;
>> + case MSR_IA32_ARCH_CAPABILITIES:
>> + if (!msr_info->host_initiated)
>> + return 1;
>> + vmx->arch_capabilities = data;
>> + break;
>
> arch capabilities is read only. You don't need the set_msr handling for this.

This is only for host driven writes. This would allow QEMU/whatever to
override the default value (i.e. the value from the hardware).

>
>> case MSR_IA32_CR_PAT:
>> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
>> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
>> @@ -5599,6 +5612,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>> ++vmx->nmsrs;
>> }
>>
>> + if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
>> + rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
>>
>> vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 03869eb..8e889dc 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1006,6 +1006,7 @@ static u32 msrs_to_save[] = {
>> #endif
>> MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
>> MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
>> + MSR_IA32_ARCH_CAPABILITIES
>
> Same here.. no need to save/restore this.
>
>> };
>>
>> static unsigned num_msrs_to_save;
>> --
>> 2.7.4
>>
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

2018-01-30 09:02:52

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] KVM: Expose speculation control feature to guests



On Tue, 2018-01-30 at 01:10 +0100, KarimAllah Ahmed wrote:
> Add direct access to speculation control MSRs for KVM guests. This allows the
> guest to protect itself against Spectre V2 using IBRS+IBPB instead of a
> retpoline+IBPB based approach.
>
> It also exposes the ARCH_CAPABILITIES MSR which is going to be used by future
> Intel processors to indicate RDCL_NO and IBRS_ALL.

Thanks. I think you've already fixed the SPEC_CTRL patch in the git
tree so that it adds F(IBRS) to kvm_cpuid_8000_0008_ebx_x86_features,
right?

The SVM part of Ashok's IBPB patch is still exposing the PRED_CMD MSR
to guests based on boot_cpu_has(IBPB), not based on the *guest*
capabilities. Looking back at Paolo's patch set from January 9th, it
was done differently there but I think it had the same behaviour?

The rest of Paolo's patch set I think has been covered, except 6/8:
 lkml.kernel.org/r/[email protected]

That exposes SPEC_CTRL for SVM too (since AMD now apparently has it).
If adding that ends up with duplicate MSR handling for get/set, perhaps
that wants shifting up into kvm_[sg]et_msr_common()? Although I don't
see offhand where you'd put the ->spec_ctrl field in that case. It
doesn't want to live in the generic (even to non-x86) struct kvm_vcpu.
So maybe a little bit of duplication is the best answer.

Other than those details, I think we're mostly getting close. Do we
want to add STIBP on top? There is some complexity there which meant I
was happier getting these first bits ready first, before piling that on
too.

I believe Ashok sent you a change which made us do IBPB on *every*
vmexit; I don't think we need that. It's currently done in vcpu_load()
which means we'll definitely have done it between running one vCPU and
the next, and when vCPUs are pinned we basically never need to do it.

We know that VMM (e.g. qemu) userspace could be vulnerable to attacks
from guest ring 3, because there is no flush between the vmexit and the
host kernel "returning" to the userspace thread. Doing a full IBPB on
*every* vmexit would protect from that, but it's overkill. If that's
the reason, let's come up with something better.


Attachments:
smime.p7s (5.09 kB)

2018-01-30 09:34:19

by KarimAllah Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] KVM: Expose speculation control feature to guests

On 01/30/2018 10:00 AM, David Woodhouse wrote:
>
>
> On Tue, 2018-01-30 at 01:10 +0100, KarimAllah Ahmed wrote:
>> Add direct access to speculation control MSRs for KVM guests. This allows the
>> guest to protect itself against Spectre V2 using IBRS+IBPB instead of a
>> retpoline+IBPB based approach.
>>
>> It also exposes the ARCH_CAPABILITIES MSR which is going to be used by future
>> Intel processors to indicate RDCL_NO and IBRS_ALL.
>
> Thanks. I think you've already fixed the SPEC_CTRL patch in the git
> tree so that it adds F(IBRS) to kvm_cpuid_8000_0008_ebx_x86_features,
> right?
Yup, this is already fixed in the tree.

>
> The SVM part of Ashok's IBPB patch is still exposing the PRED_CMD MSR
> to guests based on boot_cpu_has(IBPB), not based on the *guest*
> capabilities. Looking back at Paolo's patch set from January 9th, it
> was done differently there but I think it had the same behaviour?
>
> The rest of Paolo's patch set I think has been covered, except 6/8:
>  lkml.kernel.org/r/[email protected]
>
> That exposes SPEC_CTRL for SVM too (since AMD now apparently has it).
> If adding that ends up with duplicate MSR handling for get/set, perhaps
> that wants shifting up into kvm_[sg]et_msr_common()? Although I don't
> see offhand where you'd put the ->spec_ctrl field in that case. It
> doesn't want to live in the generic (even to non-x86) struct kvm_vcpu.
> So maybe a little bit of duplication is the best answer.
>
> Other than those details, I think we're mostly getting close. Do we
> want to add STIBP on top? There is some complexity there which meant I
> was happier getting these first bits ready first, before piling that on
> too.
>
> I believe Ashok sent you a change which made us do IBPB on *every*
> vmexit; I don't think we need that. It's currently done in vcpu_load()
> which means we'll definitely have done it between running one vCPU and
> the next, and when vCPUs are pinned we basically never need to do it.
>
> We know that VMM (e.g. qemu) userspace could be vulnerable to attacks
> from guest ring 3, because there is no flush between the vmexit and the
> host kernel "returning" to the userspace thread. Doing a full IBPB on
> *every* vmexit would protect from that, but it's overkill. If that's
> the reason, let's come up with something better.
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

2018-01-30 14:37:25

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] KVM: x86: Add IBPB support

On 1/29/2018 6:10 PM, KarimAllah Ahmed wrote:
> From: Ashok Raj <[email protected]>
>
> Add MSR passthrough for MSR_IA32_PRED_CMD and place branch predictor
> barriers on switching between VMs to avoid inter VM Spectre-v2 attacks.
>
> [peterz: rebase and changelog rewrite]
> [karahmed: - rebase
> - vmx: expose PRED_CMD whenever it is available
> - svm: only pass through IBPB if it is available
> - vmx: support !cpu_has_vmx_msr_bitmap()]
> [dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
> PRED_CMD is a write-only MSR]
>
> Cc: Asit Mallick <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Arjan Van De Ven <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Jun Nakajima <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: David Woodhouse <[email protected]>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 11 ++++++++++-
> arch/x86/kvm/svm.c | 14 ++++++++++++++
> arch/x86/kvm/vmx.c | 12 ++++++++++++
> 3 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c0eb337..033004d 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -365,6 +365,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
> 0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
>
> + /* cpuid 0x80000008.ebx */
> + const u32 kvm_cpuid_8000_0008_ebx_x86_features =
> + F(IBPB);
> +
> /* cpuid 0xC0000001.edx */
> const u32 kvm_cpuid_C000_0001_edx_x86_features =
> F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
> @@ -625,7 +629,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> if (!g_phys_as)
> g_phys_as = phys_as;
> entry->eax = g_phys_as | (virt_as << 8);
> - entry->ebx = entry->edx = 0;
> + entry->edx = 0;
> + /* IBPB isn't necessarily present in hardware cpuid */
> + if (boot_cpu_has(X86_FEATURE_IBPB))
> + entry->ebx |= F(IBPB);
> + entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
> + cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
> break;
> }
> case 0x80000019:
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 2744b973..c886e46 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -529,6 +529,7 @@ struct svm_cpu_data {
> struct kvm_ldttss_desc *tss_desc;
>
> struct page *save_area;
> + struct vmcb *current_vmcb;
> };
>
> static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
> @@ -918,6 +919,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
>
> set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
> }
> +
> + if (boot_cpu_has(X86_FEATURE_IBPB))
> + set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);

Not sure you really need the check here. If the feature isn't available
in the hardware, then it won't be advertised in the CPUID bits to the
guest, so the guest shouldn't try to write to the msr. If it does, it
will #GP. So I would think it could be set all the time to not be
intercepted, no?

> }
>
> static void add_msr_offset(u32 offset)
> @@ -1706,11 +1710,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
> __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
> kvm_vcpu_uninit(vcpu);
> kmem_cache_free(kvm_vcpu_cache, svm);
> + /*
> + * The vmcb page can be recycled, causing a false negative in
> + * svm_vcpu_load(). So do a full IBPB now.
> + */
> + indirect_branch_prediction_barrier();
> }
>
> static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> + struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
> int i;
>
> if (unlikely(cpu != vcpu->cpu)) {
> @@ -1739,6 +1749,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> if (static_cpu_has(X86_FEATURE_RDTSCP))
> wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
>
> + if (sd->current_vmcb != svm->vmcb) {
> + sd->current_vmcb = svm->vmcb;
> + indirect_branch_prediction_barrier();
> + }
> avic_vcpu_load(vcpu, cpu);
> }
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index aa8638a..ea278ce 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2272,6 +2272,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
> per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
> vmcs_load(vmx->loaded_vmcs->vmcs);
> + indirect_branch_prediction_barrier();
> }
>
> if (!already_loaded) {
> @@ -3330,6 +3331,14 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> kvm_write_tsc(vcpu, msr_info);
> break;
> + case MSR_IA32_PRED_CMD:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
> + return 1;
> +
> + if (data & PRED_CMD_IBPB)
> + wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> + break;

Should this also be in svm.c or as common code in x86.c?

> case MSR_IA32_CR_PAT:
> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> @@ -9548,6 +9557,9 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
> goto free_msrs;
>
> msr_bitmap = vmx->vmcs01.msr_bitmap;
> +
> + if (boot_cpu_has(X86_FEATURE_IBPB))
> + vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_PRED_CMD, MSR_TYPE_W);

Same comment here as in svm.c, is the feature check necessary?

Thanks,
Tom

> vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
>

2018-01-30 16:14:05

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] KVM: x86: Add IBPB support

On Tue, 2018-01-30 at 08:22 -0600, Tom Lendacky wrote:
> > @@ -918,6 +919,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
> >  
> >   set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
> >   }
> > +
> > + if (boot_cpu_has(X86_FEATURE_IBPB))
> > + set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);
>
> Not sure you really need the check here.  If the feature isn't available
> in the hardware, then it won't be advertised in the CPUID bits to the
> guest, so the guest shouldn't try to write to the msr.  If it does, it
> will #GP. So I would think it could be set all the time to not be
> intercepted, no?

The check for boot_cpu_has() is wrong and is fairly redundant as you
say. What we actually want is guest_cpu_has(). We *don't* want to pass
the MSR through for a recalcitrant guest to bash on, if we have elected
not to expose this feature to the guest.

On Intel right now it's *really* important that we don't allow it to be
touched, even if a write would succeed. So even boot_cpu_has() would
not be entirely meaningless there. :)


> > @@ -3330,6 +3331,14 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >   case MSR_IA32_TSC:
> >   kvm_write_tsc(vcpu, msr_info);
> >   break;
> > + case MSR_IA32_PRED_CMD:
> > + if (!msr_info->host_initiated &&
> > +     !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
> > + return 1;
> > +
> > + if (data & PRED_CMD_IBPB)
> > + wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> > + break;
>
> Should this also be in svm.c or as common code in x86.c?

See my response to [0/4]. I suggested that, but noted that it wasn't
entirely clear where we'd put the storage for SPEC_CTRL. We probably
*could* manage it for IBPB though.

> >
> >   case MSR_IA32_CR_PAT:
> >   if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> >   if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> > @@ -9548,6 +9557,9 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
> >   goto free_msrs;
> >  
> >   msr_bitmap = vmx->vmcs01.msr_bitmap;
> > +
> > + if (boot_cpu_has(X86_FEATURE_IBPB))
> > + vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_PRED_CMD, MSR_TYPE_W);
>
> Same comment here as in svm.c, is the feature check necessary?

Again, yes but it should be guest_cpu_has() and we couldn't see how :)


Attachments:
smime.p7s (5.09 kB)

2018-01-30 17:38:02

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] KVM: x86: Add IBPB support

On Mon, Jan 29, 2018 at 4:10 PM, KarimAllah Ahmed <[email protected]> wrote:
> From: Ashok Raj <[email protected]>
>
> Add MSR passthrough for MSR_IA32_PRED_CMD and place branch predictor
> barriers on switching between VMs to avoid inter VM Spectre-v2 attacks.
>
> [peterz: rebase and changelog rewrite]
> [karahmed: - rebase
> - vmx: expose PRED_CMD whenever it is available
> - svm: only pass through IBPB if it is available
> - vmx: support !cpu_has_vmx_msr_bitmap()]
> [dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
> PRED_CMD is a write-only MSR]
>
> Cc: Asit Mallick <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Arjan Van De Ven <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Jun Nakajima <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: David Woodhouse <[email protected]>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 11 ++++++++++-
> arch/x86/kvm/svm.c | 14 ++++++++++++++
> arch/x86/kvm/vmx.c | 12 ++++++++++++
> 3 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c0eb337..033004d 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -365,6 +365,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
> 0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
>
> + /* cpuid 0x80000008.ebx */
> + const u32 kvm_cpuid_8000_0008_ebx_x86_features =
> + F(IBPB);
> +
> /* cpuid 0xC0000001.edx */
> const u32 kvm_cpuid_C000_0001_edx_x86_features =
> F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
> @@ -625,7 +629,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> if (!g_phys_as)
> g_phys_as = phys_as;
> entry->eax = g_phys_as | (virt_as << 8);
> - entry->ebx = entry->edx = 0;
> + entry->edx = 0;
> + /* IBPB isn't necessarily present in hardware cpuid */
> + if (boot_cpu_has(X86_FEATURE_IBPB))
> + entry->ebx |= F(IBPB);
> + entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
> + cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
> break;
> }
> case 0x80000019:
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 2744b973..c886e46 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -529,6 +529,7 @@ struct svm_cpu_data {
> struct kvm_ldttss_desc *tss_desc;
>
> struct page *save_area;
> + struct vmcb *current_vmcb;
> };
>
> static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
> @@ -918,6 +919,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
>
> set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
> }
> +
> + if (boot_cpu_has(X86_FEATURE_IBPB))
> + set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);
> }
>
> static void add_msr_offset(u32 offset)
> @@ -1706,11 +1710,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
> __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
> kvm_vcpu_uninit(vcpu);
> kmem_cache_free(kvm_vcpu_cache, svm);
> + /*
> + * The vmcb page can be recycled, causing a false negative in
> + * svm_vcpu_load(). So do a full IBPB now.
> + */
> + indirect_branch_prediction_barrier();
> }
>
> static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> + struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
> int i;
>
> if (unlikely(cpu != vcpu->cpu)) {
> @@ -1739,6 +1749,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> if (static_cpu_has(X86_FEATURE_RDTSCP))
> wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
>
> + if (sd->current_vmcb != svm->vmcb) {
> + sd->current_vmcb = svm->vmcb;
> + indirect_branch_prediction_barrier();
> + }
> avic_vcpu_load(vcpu, cpu);
> }
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index aa8638a..ea278ce 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2272,6 +2272,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
> per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
> vmcs_load(vmx->loaded_vmcs->vmcs);
> + indirect_branch_prediction_barrier();
> }
>
> if (!already_loaded) {
> @@ -3330,6 +3331,14 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> kvm_write_tsc(vcpu, msr_info);
> break;
> + case MSR_IA32_PRED_CMD:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
> + return 1;
> +
> + if (data & PRED_CMD_IBPB)
> + wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> + break;
> case MSR_IA32_CR_PAT:
> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> @@ -9548,6 +9557,9 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
> goto free_msrs;
>
> msr_bitmap = vmx->vmcs01.msr_bitmap;
> +
> + if (boot_cpu_has(X86_FEATURE_IBPB))
> + vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_PRED_CMD, MSR_TYPE_W);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
> vmx_disable_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
> --
> 2.7.4
>

Are you planning to allow L2 to write MSR_IA32_PRED_CMD without L0
intercepting it, if the MSR write intercept is disabled in both the
vmcs01 MSR permission bitmap and the vmcs12 MSR permission bitmap?

2018-01-30 18:20:54

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] KVM: x86: Add IBPB support

On Tue, 2018-01-30 at 09:19 -0800, Jim Mattson wrote:
>
> Are you planning to allow L2 to write MSR_IA32_PRED_CMD without L0
> intercepting it, if the MSR write intercept is disabled in both the
> vmcs01 MSR permission bitmap and the vmcs12 MSR permission bitmap?

I don't see why we shouldn't.


Attachments:
smime.p7s (5.09 kB)

2018-01-30 18:28:48

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On Mon, Jan 29, 2018 at 4:10 PM, KarimAllah Ahmed <[email protected]> wrote:
> [ Based on a patch from Ashok Raj <[email protected]> ]
>
> Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
> guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
> be using a retpoline+IBPB based approach.
>
> To avoid the overhead of atomically saving and restoring the
> MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
> add_atomic_switch_msr when a non-zero is written to it.
>
> No attempt is made to handle STIBP here, intentionally. Filtering STIBP
> may be added in a future patch, which may require trapping all writes
> if we don't want to pass it through directly to the guest.
>
> [dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
>
> Cc: Asit Mallick <[email protected]>
> Cc: Arjan Van De Ven <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Jun Nakajima <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: David Woodhouse <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Ashok Raj <[email protected]>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> v2:
> - remove 'host_spec_ctrl' in favor of only a comment (dwmw@).
> - special case writing '0' in SPEC_CTRL to avoid confusing live-migration
> when the instance never used the MSR (dwmw@).
> - depend on X86_FEATURE_IBRS instead of X86_FEATURE_SPEC_CTRL (dwmw@).
> - add MSR_IA32_SPEC_CTRL to the list of MSRs to save (dropped it by accident).
> v3:
> - Save/restore manually
> - Fix CPUID handling
> - Fix a copy & paste error in the name of SPEC_CTRL MSR in
> disable_intercept.
> - support !cpu_has_vmx_msr_bitmap()
> ---
> arch/x86/kvm/cpuid.c | 7 +++++--
> arch/x86/kvm/vmx.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 2 +-
> 3 files changed, 65 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1909635..662d0c0 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -394,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>
> /* cpuid 7.0.edx*/
> const u32 kvm_cpuid_7_0_edx_x86_features =
> - F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
> + F(ARCH_CAPABILITIES);
>
> /* all calls to cpuid_count() should be made on the same cpu */
> get_cpu();
> @@ -630,9 +631,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> g_phys_as = phys_as;
> entry->eax = g_phys_as | (virt_as << 8);
> entry->edx = 0;
> - /* IBPB isn't necessarily present in hardware cpuid */
> + /* IBRS and IBPB aren't necessarily present in hardware cpuid */
> if (boot_cpu_has(X86_FEATURE_IBPB))
> entry->ebx |= F(IBPB);
> + if (boot_cpu_has(X86_FEATURE_IBRS))
> + entry->ebx |= F(IBRS);
> entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
> cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
> break;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 798a00b..9ac9747 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -582,6 +582,8 @@ struct vcpu_vmx {
> u64 msr_guest_kernel_gs_base;
> #endif
> u64 arch_capabilities;
> + u64 spec_ctrl;
> + bool save_spec_ctrl_on_exit;
>
> u32 vm_entry_controls_shadow;
> u32 vm_exit_controls_shadow;
> @@ -922,6 +924,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
> static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
> u16 error_code);
> static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
> +static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
> + u32 msr, int type);
>
> static DEFINE_PER_CPU(struct vmcs *, vmxarea);
> static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> @@ -3226,6 +3230,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> msr_info->data = guest_read_tsc(vcpu);
> break;
> + case MSR_IA32_SPEC_CTRL:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
> + return 1;
> +
> + msr_info->data = to_vmx(vcpu)->spec_ctrl;
> + break;
> case MSR_IA32_ARCH_CAPABILITIES:
> if (!msr_info->host_initiated &&
> !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
> @@ -3339,6 +3350,31 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> kvm_write_tsc(vcpu, msr_info);
> break;
> + case MSR_IA32_SPEC_CTRL:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
> + return 1;
> +
> + /* The STIBP bit doesn't fault even if it's not advertised */
> + if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
> + return 1;
> +
> + vmx->spec_ctrl = data;
> +
> + /*
> + * When it's written (to non-zero) for the first time, pass
> + * it through. This means we don't have to take the perf
> + * hit of saving it on vmexit for the common case of guests
> + * that don't use it.
> + */
> + if (cpu_has_vmx_msr_bitmap() && data &&
> + !vmx->save_spec_ctrl_on_exit) {
> + vmx->save_spec_ctrl_on_exit = true;
> + vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
> + MSR_IA32_SPEC_CTRL,
> + MSR_TYPE_RW);
> + }

This code seems to assume that L1 is currently active. What if L2 is
currently active?

> + break;
> case MSR_IA32_PRED_CMD:
> if (!msr_info->host_initiated &&
> !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
> @@ -5644,6 +5680,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> u64 cr0;
>
> vmx->rmode.vm86_active = 0;
> + vmx->spec_ctrl = 0;
>
> vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
> kvm_set_cr8(vcpu, 0);
> @@ -9314,6 +9351,15 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>
> vmx_arm_hv_timer(vcpu);
>
> + /*
> + * If this vCPU has touched SPEC_CTRL, restore the guest's value if
> + * it's non-zero. Since vmentry is serialising on affected CPUs, there
> + * is no need to worry about the conditional branch over the wrmsr
> + * being speculatively taken.
> + */
> + if (vmx->spec_ctrl)
> + wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> +
> vmx->__launched = vmx->loaded_vmcs->launched;
> asm(
> /* Store host registers */
> @@ -9420,6 +9466,19 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
> #endif
> );
>
> + /*
> + * We do not use IBRS in the kernel. If this vCPU has used the
> + * SPEC_CTRL MSR it may have left it on; save the value and
> + * turn it off. This is much more efficient than blindly adding
> + * it to the atomic save/restore list. Especially as the former
> + * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
> + */
> + if (vmx->save_spec_ctrl_on_exit)
> + rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> +
> + if (vmx->spec_ctrl)
> + wrmsrl(MSR_IA32_SPEC_CTRL, 0);
> +
> /* Eliminate branch target predictions from guest mode */
> vmexit_fill_RSB();
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8e889dc..fc9724c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1006,7 +1006,7 @@ static u32 msrs_to_save[] = {
> #endif
> MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
> MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
> - MSR_IA32_ARCH_CAPABILITIES
> + MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
> };
>
> static unsigned num_msrs_to_save;
> --
> 2.7.4
>

2018-01-30 21:15:21

by KarimAllah Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On 01/30/2018 06:49 PM, Jim Mattson wrote:
> On Mon, Jan 29, 2018 at 4:10 PM, KarimAllah Ahmed <[email protected]> wrote:
>> [ Based on a patch from Ashok Raj <[email protected]> ]
>>
>> Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
>> guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
>> be using a retpoline+IBPB based approach.
>>
>> To avoid the overhead of atomically saving and restoring the
>> MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
>> add_atomic_switch_msr when a non-zero is written to it.
>>
>> No attempt is made to handle STIBP here, intentionally. Filtering STIBP
>> may be added in a future patch, which may require trapping all writes
>> if we don't want to pass it through directly to the guest.
>>
>> [dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
>>
>> Cc: Asit Mallick <[email protected]>
>> Cc: Arjan Van De Ven <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Andi Kleen <[email protected]>
>> Cc: Andrea Arcangeli <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Tim Chen <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Dan Williams <[email protected]>
>> Cc: Jun Nakajima <[email protected]>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: David Woodhouse <[email protected]>
>> Cc: Greg KH <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Ashok Raj <[email protected]>
>> Signed-off-by: KarimAllah Ahmed <[email protected]>
>> Signed-off-by: David Woodhouse <[email protected]>
>> ---
>> v2:
>> - remove 'host_spec_ctrl' in favor of only a comment (dwmw@).
>> - special case writing '0' in SPEC_CTRL to avoid confusing live-migration
>> when the instance never used the MSR (dwmw@).
>> - depend on X86_FEATURE_IBRS instead of X86_FEATURE_SPEC_CTRL (dwmw@).
>> - add MSR_IA32_SPEC_CTRL to the list of MSRs to save (dropped it by accident).
>> v3:
>> - Save/restore manually
>> - Fix CPUID handling
>> - Fix a copy & paste error in the name of SPEC_CTRL MSR in
>> disable_intercept.
>> - support !cpu_has_vmx_msr_bitmap()
>> ---
>> arch/x86/kvm/cpuid.c | 7 +++++--
>> arch/x86/kvm/vmx.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> arch/x86/kvm/x86.c | 2 +-
>> 3 files changed, 65 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 1909635..662d0c0 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -394,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>>
>> /* cpuid 7.0.edx*/
>> const u32 kvm_cpuid_7_0_edx_x86_features =
>> - F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
>> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
>> + F(ARCH_CAPABILITIES);
>>
>> /* all calls to cpuid_count() should be made on the same cpu */
>> get_cpu();
>> @@ -630,9 +631,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>> g_phys_as = phys_as;
>> entry->eax = g_phys_as | (virt_as << 8);
>> entry->edx = 0;
>> - /* IBPB isn't necessarily present in hardware cpuid */
>> + /* IBRS and IBPB aren't necessarily present in hardware cpuid */
>> if (boot_cpu_has(X86_FEATURE_IBPB))
>> entry->ebx |= F(IBPB);
>> + if (boot_cpu_has(X86_FEATURE_IBRS))
>> + entry->ebx |= F(IBRS);
>> entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
>> cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
>> break;
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 798a00b..9ac9747 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -582,6 +582,8 @@ struct vcpu_vmx {
>> u64 msr_guest_kernel_gs_base;
>> #endif
>> u64 arch_capabilities;
>> + u64 spec_ctrl;
>> + bool save_spec_ctrl_on_exit;
>>
>> u32 vm_entry_controls_shadow;
>> u32 vm_exit_controls_shadow;
>> @@ -922,6 +924,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
>> static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
>> u16 error_code);
>> static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
>> +static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
>> + u32 msr, int type);
>>
>> static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>> static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
>> @@ -3226,6 +3230,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> case MSR_IA32_TSC:
>> msr_info->data = guest_read_tsc(vcpu);
>> break;
>> + case MSR_IA32_SPEC_CTRL:
>> + if (!msr_info->host_initiated &&
>> + !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
>> + return 1;
>> +
>> + msr_info->data = to_vmx(vcpu)->spec_ctrl;
>> + break;
>> case MSR_IA32_ARCH_CAPABILITIES:
>> if (!msr_info->host_initiated &&
>> !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
>> @@ -3339,6 +3350,31 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> case MSR_IA32_TSC:
>> kvm_write_tsc(vcpu, msr_info);
>> break;
>> + case MSR_IA32_SPEC_CTRL:
>> + if (!msr_info->host_initiated &&
>> + !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
>> + return 1;
>> +
>> + /* The STIBP bit doesn't fault even if it's not advertised */
>> + if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
>> + return 1;
>> +
>> + vmx->spec_ctrl = data;
>> +
>> + /*
>> + * When it's written (to non-zero) for the first time, pass
>> + * it through. This means we don't have to take the perf
>> + * hit of saving it on vmexit for the common case of guests
>> + * that don't use it.
>> + */
>> + if (cpu_has_vmx_msr_bitmap() && data &&
>> + !vmx->save_spec_ctrl_on_exit) {
>> + vmx->save_spec_ctrl_on_exit = true;
>> + vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
>> + MSR_IA32_SPEC_CTRL,
>> + MSR_TYPE_RW);
>> + }
>
> This code seems to assume that L1 is currently active. What if L2 is
> currently active?

Ooops! I did not think at all about nested :)

This should be addressed now, I hope:

http://git.infradead.org/linux-retpoline.git/commitdiff/f7f0cbba3e0cffcee050a8a5a9597a162d57e572

I have not tested it yet though.

>
>> + break;
>> case MSR_IA32_PRED_CMD:
>> if (!msr_info->host_initiated &&
>> !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
>> @@ -5644,6 +5680,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> u64 cr0;
>>
>> vmx->rmode.vm86_active = 0;
>> + vmx->spec_ctrl = 0;
>>
>> vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
>> kvm_set_cr8(vcpu, 0);
>> @@ -9314,6 +9351,15 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>
>> vmx_arm_hv_timer(vcpu);
>>
>> + /*
>> + * If this vCPU has touched SPEC_CTRL, restore the guest's value if
>> + * it's non-zero. Since vmentry is serialising on affected CPUs, there
>> + * is no need to worry about the conditional branch over the wrmsr
>> + * being speculatively taken.
>> + */
>> + if (vmx->spec_ctrl)
>> + wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
>> +
>> vmx->__launched = vmx->loaded_vmcs->launched;
>> asm(
>> /* Store host registers */
>> @@ -9420,6 +9466,19 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>> #endif
>> );
>>
>> + /*
>> + * We do not use IBRS in the kernel. If this vCPU has used the
>> + * SPEC_CTRL MSR it may have left it on; save the value and
>> + * turn it off. This is much more efficient than blindly adding
>> + * it to the atomic save/restore list. Especially as the former
>> + * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
>> + */
>> + if (vmx->save_spec_ctrl_on_exit)
>> + rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
>> +
>> + if (vmx->spec_ctrl)
>> + wrmsrl(MSR_IA32_SPEC_CTRL, 0);
>> +
>> /* Eliminate branch target predictions from guest mode */
>> vmexit_fill_RSB();
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 8e889dc..fc9724c 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1006,7 +1006,7 @@ static u32 msrs_to_save[] = {
>> #endif
>> MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
>> MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
>> - MSR_IA32_ARCH_CAPABILITIES
>> + MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
>> };
>>
>> static unsigned num_msrs_to_save;
>> --
>> 2.7.4
>>
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

2018-01-30 22:51:04

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On Tue, Jan 30, 2018 at 1:00 PM, KarimAllah Ahmed <[email protected]> wrote:
> Ooops! I did not think at all about nested :)
>
> This should be addressed now, I hope:
>
> http://git.infradead.org/linux-retpoline.git/commitdiff/f7f0cbba3e0cffcee050a8a5a9597a162d57e572

+ if (cpu_has_vmx_msr_bitmap() && data &&
+ !vmx->save_spec_ctrl_on_exit) {
+ vmx->save_spec_ctrl_on_exit = true;
+
+ msr_bitmap = is_guest_mode(vcpu) ?
vmx->nested.vmcs02.msr_bitmap :
+
vmx->vmcs01.msr_bitmap;
+ vmx_disable_intercept_for_msr(msr_bitmap,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_RW);
+ }

There are two ways to get to this point in vmx_set_msr while
is_guest_mode(vcpu) is true:
1) L0 is processing vmcs12's VM-entry MSR load list on emulated
VM-entry (see enter_vmx_non_root_mode).
2) L2 tried to execute WRMSR, writes to the MSR are intercepted in
vmcs02's MSR permission bitmap, and writes to the MSR are not
intercepted in vmcs12's MSR permission bitmap.

In the first case, disabling the intercepts for the MSR in
vmx->nested.vmcs02.msr_bitmap is incorrect, because we haven't yet
determined that the intercepts are clear in vmcs12's MSR permission
bitmap.
In the second case, disabling *both* of the intercepts for the MSR in
vmx->nested.vmcs02.msr_bitmap is incorrect, because we don't know that
the read intercept is clear in vmcs12's MSR permission bitmap.
Furthermore, disabling the write intercept for the MSR in
vmx->nested.vmcs02.msr_bitmap is somewhat fruitless, because
nested_vmx_merge_msr_bitmap is just going to undo that change on the
next emulated VM-entry.

2018-01-31 00:36:19

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On 30/01/2018 17:49, Jim Mattson wrote:
> On Tue, Jan 30, 2018 at 1:00 PM, KarimAllah Ahmed <[email protected]> wrote:
>> Ooops! I did not think at all about nested :)
>>
>> This should be addressed now, I hope:
>>
>> http://git.infradead.org/linux-retpoline.git/commitdiff/f7f0cbba3e0cffcee050a8a5a9597a162d57e572
>
> + if (cpu_has_vmx_msr_bitmap() && data &&
> + !vmx->save_spec_ctrl_on_exit) {
> + vmx->save_spec_ctrl_on_exit = true;
> +
> + msr_bitmap = is_guest_mode(vcpu) ?
> vmx->nested.vmcs02.msr_bitmap :
> +
> vmx->vmcs01.msr_bitmap;
> + vmx_disable_intercept_for_msr(msr_bitmap,
> + MSR_IA32_SPEC_CTRL,
> + MSR_TYPE_RW);
> + }
>
> There are two ways to get to this point in vmx_set_msr while
> is_guest_mode(vcpu) is true:
> 1) L0 is processing vmcs12's VM-entry MSR load list on emulated
> VM-entry (see enter_vmx_non_root_mode).
> 2) L2 tried to execute WRMSR, writes to the MSR are intercepted in
> vmcs02's MSR permission bitmap, and writes to the MSR are not
> intercepted in vmcs12's MSR permission bitmap.
>
> In the first case, disabling the intercepts for the MSR in
> vmx->nested.vmcs02.msr_bitmap is incorrect, because we haven't yet
> determined that the intercepts are clear in vmcs12's MSR permission
> bitmap.
> In the second case, disabling *both* of the intercepts for the MSR in
> vmx->nested.vmcs02.msr_bitmap is incorrect, because we don't know that
> the read intercept is clear in vmcs12's MSR permission bitmap.
> Furthermore, disabling the write intercept for the MSR in
> vmx->nested.vmcs02.msr_bitmap is somewhat fruitless, because
> nested_vmx_merge_msr_bitmap is just going to undo that change on the
> next emulated VM-entry.
>

Let's keep the original code from David, touching the L0->L1 MSR bitmap
unconditionally, and possibly add an "&& !is_guest_mode (vcpu)" to the
condition.

Paolo

2018-01-31 00:37:14

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] KVM: Expose speculation control feature to guests

On 30/01/2018 04:00, David Woodhouse wrote:
> I believe Ashok sent you a change which made us do IBPB on *every*
> vmexit; I don't think we need that. It's currently done in vcpu_load()
> which means we'll definitely have done it between running one vCPU and
> the next, and when vCPUs are pinned we basically never need to do it.
>
> We know that VMM (e.g. qemu) userspace could be vulnerable to attacks
> from guest ring 3, because there is no flush between the vmexit and the
> host kernel "returning" to the userspace thread. Doing a full IBPB on
> *every* vmexit would protect from that, but it's overkill. If that's
> the reason, let's come up with something better.

Certainly not every vmexit! But doing it on every userspace vmexit and
every sched_out would not be *that* bad.

We try really hard to avoid userspace vmexits for everything remotely
critical to performance (the main exception that's left is the PMTIMER
I/O port, that Windows likes to access quite a lot), so they shouldn't
happen that often.

Paolo

2018-01-31 00:56:28

by Ashok Raj

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] KVM: Expose speculation control feature to guests

On Tue, Jan 30, 2018 at 06:36:20PM -0500, Paolo Bonzini wrote:
> On 30/01/2018 04:00, David Woodhouse wrote:
> > I believe Ashok sent you a change which made us do IBPB on *every*
> > vmexit; I don't think we need that. It's currently done in vcpu_load()
> > which means we'll definitely have done it between running one vCPU and
> > the next, and when vCPUs are pinned we basically never need to do it.
> >
> > We know that VMM (e.g. qemu) userspace could be vulnerable to attacks
> > from guest ring 3, because there is no flush between the vmexit and the
> > host kernel "returning" to the userspace thread. Doing a full IBPB on
> > *every* vmexit would protect from that, but it's overkill. If that's
> > the reason, let's come up with something better.
>
> Certainly not every vmexit! But doing it on every userspace vmexit and
> every sched_out would not be *that* bad.

Right.. agreed. We discussed the different scenarios that doing IBPB
on VMexit would help, and decided its really not required on every exit.

One obvious case is when there is a VMexit and return back to Qemu
process (witout a real context switch) do we need that to be
protected from any poisoned BTB from guest?

If Qemu is protected by !dumpable/retpoline that should give that gaurantee.
We do VM->VM IBPB at vmload() time that should provide that gaurantee.

Cheers,
Ashok

>
> We try really hard to avoid userspace vmexits for everything remotely
> critical to performance (the main exception that's left is the PMTIMER
> I/O port, that Windows likes to access quite a lot), so they shouldn't
> happen that often.

2018-01-31 00:57:49

by KarimAllah Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On 01/30/2018 11:49 PM, Jim Mattson wrote:
> On Tue, Jan 30, 2018 at 1:00 PM, KarimAllah Ahmed <[email protected]> wrote:
>> Ooops! I did not think at all about nested :)
>>
>> This should be addressed now, I hope:
>>
>> http://git.infradead.org/linux-retpoline.git/commitdiff/f7f0cbba3e0cffcee050a8a5a9597a162d57e572
>
> + if (cpu_has_vmx_msr_bitmap() && data &&
> + !vmx->save_spec_ctrl_on_exit) {
> + vmx->save_spec_ctrl_on_exit = true;
> +
> + msr_bitmap = is_guest_mode(vcpu) ?
> vmx->nested.vmcs02.msr_bitmap :
> +
> vmx->vmcs01.msr_bitmap;
> + vmx_disable_intercept_for_msr(msr_bitmap,
> + MSR_IA32_SPEC_CTRL,
> + MSR_TYPE_RW);
> + }
>
> There are two ways to get to this point in vmx_set_msr while
> is_guest_mode(vcpu) is true:
> 1) L0 is processing vmcs12's VM-entry MSR load list on emulated
> VM-entry (see enter_vmx_non_root_mode).
> 2) L2 tried to execute WRMSR, writes to the MSR are intercepted in
> vmcs02's MSR permission bitmap, and writes to the MSR are not
> intercepted in vmcs12's MSR permission bitmap.
>
> In the first case, disabling the intercepts for the MSR in
> vmx->nested.vmcs02.msr_bitmap is incorrect, because we haven't yet
> determined that the intercepts are clear in vmcs12's MSR permission
> bitmap.
> In the second case, disabling *both* of the intercepts for the MSR in
> vmx->nested.vmcs02.msr_bitmap is incorrect, because we don't know that
> the read intercept is clear in vmcs12's MSR permission bitmap.
> Furthermore, disabling the write intercept for the MSR in
> vmx->nested.vmcs02.msr_bitmap is somewhat fruitless, because
> nested_vmx_merge_msr_bitmap is just going to undo that change on the
> next emulated VM-entry.

Okay, I took a second look at the code (specially
nested_vmx_merge_msr_bitmap).

This means that I simply should not touch the MSR bitmap in set_msr in
case of nested, I just need to properly update the l02 msr_bitmap in
nested_vmx_merge_msr_bitmap. As in here:

http://git.infradead.org/linux-retpoline.git/commitdiff/d90eedebdd16bb00741a2c93bc13c5e444c99c2b

or am I still missing something? (sorry, did not actually look at the
nested code before!)

>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

2018-01-31 01:00:49

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On Tue, Jan 30, 2018 at 3:50 PM, KarimAllah Ahmed <[email protected]> wrote:
> Okay, I took a second look at the code (specially
> nested_vmx_merge_msr_bitmap).
>
> This means that I simply should not touch the MSR bitmap in set_msr in
> case of nested, I just need to properly update the l02 msr_bitmap in
> nested_vmx_merge_msr_bitmap. As in here:
>
> http://git.infradead.org/linux-retpoline.git/commitdiff/d90eedebdd16bb00741a2c93bc13c5e444c99c2b
>
> or am I still missing something? (sorry, did not actually look at the
> nested code before!)

+ if (cpu_has_vmx_msr_bitmap() && data &&
+ !vmx->save_spec_ctrl_on_exit) {
+ vmx->save_spec_ctrl_on_exit = true;
+
+ if (is_guest_mode(vcpu))
+ break;

As Paolo suggested, the test for !is_guest_mode (vcpu) should just be
folded into the condition above. If you aren't clearing a 'W' bit in
the MSR permission bitmap, there's no need to set
vmx->save_spec_ctrl_on_exit.

+
+ vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_RW);
+ }
+ break;

...

+ if (guest_cpuid_has(vcpu, X86_FEATURE_IBRS)) {
+ nested_vmx_disable_intercept_for_msr(
+ msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_R | MSR_TYPE_W);
+ }
+

However, here, you should set vmx->save_spec_ctrl_on_exit if
nested_vmx_disable_intercept_for_msr clears the 'W' bit for
MSR_IA32_SPEC_CTRL in msr_bitmap_l0. Perhaps this would be easier if
nested_vmx_disable_intercept_for_msr returned something indicative of
which bits it cleared (if any).

2018-01-31 01:01:34

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] KVM: Expose speculation control feature to guests

On 30/01/2018 18:48, Raj, Ashok wrote:
>> Certainly not every vmexit! But doing it on every userspace vmexit and
>> every sched_out would not be *that* bad.
> Right.. agreed. We discussed the different scenarios that doing IBPB
> on VMexit would help, and decided its really not required on every exit.
>
> One obvious case is when there is a VMexit and return back to Qemu
> process (witout a real context switch) do we need that to be
> protected from any poisoned BTB from guest?

If the host is using retpolines, then some kind of barrier is needed. I
don't know if the full PRED_CMD barrier is needed, or two IBRS=1/IBRS=0
writes back-to-back are enough.

If the host is using IBRS, then writing IBRS=1 at vmexit has established
a barrier from the less privileged VMX guest environment.

Paolo

> If Qemu is protected by !dumpable/retpoline that should give that gaurantee.
> We do VM->VM IBPB at vmload() time that should provide that gaurantee.


2018-01-31 01:01:34

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On 30/01/2018 18:50, KarimAllah Ahmed wrote:
> On 01/30/2018 11:49 PM, Jim Mattson wrote:
>> On Tue, Jan 30, 2018 at 1:00 PM, KarimAllah Ahmed
>> <[email protected]> wrote:
>>> Ooops! I did not think at all about nested :)
>>>
>>> This should be addressed now, I hope:
>>>
>>> http://git.infradead.org/linux-retpoline.git/commitdiff/f7f0cbba3e0cffcee050a8a5a9597a162d57e572
>>>
>>
>> +               if (cpu_has_vmx_msr_bitmap() && data &&
>> +                   !vmx->save_spec_ctrl_on_exit) {
>> +                       vmx->save_spec_ctrl_on_exit = true;
>> +
>> +                       msr_bitmap = is_guest_mode(vcpu) ?
>> vmx->nested.vmcs02.msr_bitmap :
>> +
>> vmx->vmcs01.msr_bitmap;
>> +                       vmx_disable_intercept_for_msr(msr_bitmap,
>> +                                                     MSR_IA32_SPEC_CTRL,
>> +                                                     MSR_TYPE_RW);
>> +               }
>>
>> There are two ways to get to this point in vmx_set_msr while
>> is_guest_mode(vcpu) is true:
>> 1) L0 is processing vmcs12's VM-entry MSR load list on emulated
>> VM-entry (see enter_vmx_non_root_mode).
>> 2) L2 tried to execute WRMSR, writes to the MSR are intercepted in
>> vmcs02's MSR permission bitmap, and writes to the MSR are not
>> intercepted in vmcs12's MSR permission bitmap.
>>
>> In the first case, disabling the intercepts for the MSR in
>> vmx->nested.vmcs02.msr_bitmap is incorrect, because we haven't yet
>> determined that the intercepts are clear in vmcs12's MSR permission
>> bitmap.
>> In the second case, disabling *both* of the intercepts for the MSR in
>> vmx->nested.vmcs02.msr_bitmap is incorrect, because we don't know that
>> the read intercept is clear in vmcs12's MSR permission bitmap.
>> Furthermore, disabling the write intercept for the MSR in
>> vmx->nested.vmcs02.msr_bitmap is somewhat fruitless, because
>> nested_vmx_merge_msr_bitmap is just going to undo that change on the
>> next emulated VM-entry.
>
> Okay, I took a second look at the code (specially
> nested_vmx_merge_msr_bitmap).
>
> This means that I simply should not touch the MSR bitmap in set_msr in
> case of nested, I just need to properly update the l02 msr_bitmap in
> nested_vmx_merge_msr_bitmap. As in here:
>
> http://git.infradead.org/linux-retpoline.git/commitdiff/d90eedebdd16bb00741a2c93bc13c5e444c99c2b
>
>
> or am I still missing something? (sorry, did not actually look at the
> nested code before!)

The new code in nested_vmx_merge_msr_bitmap should be conditional on
vmx->save_spec_ctrl_on_exit. Also, guest_cpuid_has is pretty slow
(because of kvm_find_cpuid_entry); calling it once or twice on each and
every nested vmexit is probably not a good idea. Apart from this, it
looks good to me.

Paolo

2018-01-31 01:07:35

by KarimAllah Ahmed

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On 01/31/2018 01:27 AM, Jim Mattson wrote:
> On Tue, Jan 30, 2018 at 4:19 PM, Paolo Bonzini <[email protected]> wrote:
>> The new code in nested_vmx_merge_msr_bitmap should be conditional on
>> vmx->save_spec_ctrl_on_exit.
>
> But then if L1 doesn't use MSR_IA32_SPEC_CTRL itself and it uses the
> VM-entry MSR load list to set up L2's MSR_IA32_SPEC_CTRL, you will
> never set vmx->save_spec_ctrl_on_exit, and L2's accesses to the MSR
> will always be intercepted by L0.

I can add another variable (actually two) to indicate if msr
interception should be disabled or not for SPEC_CTRL and PRED_CMD in
nested case.

That would allow us to have a fast alternative to guest_cpuid_has in
nested_vmx_merge_msr_bitmap and at the same time maintain the current
semantics of save_spec_ctrl_on_exit (i.e we would still differentiate
between set_msr that is called from the loading MSRs for the emulated
vm-entry vs L2 actually writing to it).

What do you think?
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

2018-01-31 01:09:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On 30/01/2018 19:27, Jim Mattson wrote:
> On Tue, Jan 30, 2018 at 4:19 PM, Paolo Bonzini <[email protected]> wrote:
>> The new code in nested_vmx_merge_msr_bitmap should be conditional on
>> vmx->save_spec_ctrl_on_exit.
>
> But then if L1 doesn't use MSR_IA32_SPEC_CTRL itself and it uses the
> VM-entry MSR load list to set up L2's MSR_IA32_SPEC_CTRL, you will
> never set vmx->save_spec_ctrl_on_exit, and L2's accesses to the MSR
> will always be intercepted by L0.

If you don't make it conditional, L0 will forget to read back at vmexit
what value L2 has written to the MSR. The alternative is to set
vmx->save_spec_ctrl_on_exit on all writes, including those coming from
L2. That works for me.

Paolo

2018-01-31 01:36:25

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX

On 29/01/2018 19:10, KarimAllah Ahmed wrote:
> [dwmw2: Stop using KF() for bits in it, too]
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>

Reviewed-by: Paolo Bonzini <[email protected]>

> ---
> arch/x86/kvm/cpuid.c | 8 +++-----
> arch/x86/kvm/cpuid.h | 1 +
> 2 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0099e10..c0eb337 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -67,9 +67,7 @@ u64 kvm_supported_xcr0(void)
>
> #define F(x) bit(X86_FEATURE_##x)
>
> -/* These are scattered features in cpufeatures.h. */
> -#define KVM_CPUID_BIT_AVX512_4VNNIW 2
> -#define KVM_CPUID_BIT_AVX512_4FMAPS 3
> +/* For scattered features from cpufeatures.h; we currently expose none */
> #define KF(x) bit(KVM_CPUID_BIT_##x)
>
> int kvm_update_cpuid(struct kvm_vcpu *vcpu)
> @@ -392,7 +390,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>
> /* cpuid 7.0.edx*/
> const u32 kvm_cpuid_7_0_edx_x86_features =
> - KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS);
> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
>
> /* all calls to cpuid_count() should be made on the same cpu */
> get_cpu();
> @@ -477,7 +475,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
> entry->ecx &= ~F(PKU);
> entry->edx &= kvm_cpuid_7_0_edx_x86_features;
> - entry->edx &= get_scattered_cpuid_leaf(7, 0, CPUID_EDX);
> + cpuid_mask(&entry->edx, CPUID_7_EDX);
> } else {
> entry->ebx = 0;
> entry->ecx = 0;
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index cdc70a3..dcfe227 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -54,6 +54,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
> [CPUID_8000_000A_EDX] = {0x8000000a, 0, CPUID_EDX},
> [CPUID_7_ECX] = { 7, 0, CPUID_ECX},
> [CPUID_8000_0007_EBX] = {0x80000007, 0, CPUID_EBX},
> + [CPUID_7_EDX] = { 7, 0, CPUID_EDX},
> };
>
> static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned x86_feature)
>


2018-01-31 01:40:13

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

On 29/01/2018 19:25, KarimAllah Ahmed wrote:
>>> +    case MSR_IA32_ARCH_CAPABILITIES:
>>> +        if (!msr_info->host_initiated)
>>> +            return 1;
>>> +        vmx->arch_capabilities = data;
>>> +        break;
>>
>> arch capabilities is read only. You don't need the set_msr handling
>> for this.
>
> This is only for host driven writes. This would allow QEMU/whatever to
> override the default value (i.e. the value from the hardware).

Agreed.

Reviewed-by: Paolo Bonzini <[email protected]>

2018-01-31 01:50:23

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On Tue, Jan 30, 2018 at 4:19 PM, Paolo Bonzini <[email protected]> wrote:
> The new code in nested_vmx_merge_msr_bitmap should be conditional on
> vmx->save_spec_ctrl_on_exit.

But then if L1 doesn't use MSR_IA32_SPEC_CTRL itself and it uses the
VM-entry MSR load list to set up L2's MSR_IA32_SPEC_CTRL, you will
never set vmx->save_spec_ctrl_on_exit, and L2's accesses to the MSR
will always be intercepted by L0.

2018-01-31 01:50:33

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] KVM: Expose speculation control feature to guests



On Tue, 2018-01-30 at 19:16 -0500, Paolo Bonzini wrote:
> On 30/01/2018 18:48, Raj, Ashok wrote:
> >
> > >
> > > Certainly not every vmexit!  But doing it on every userspace vmexit and
> > > every sched_out would not be *that* bad.
> > Right.. agreed. We discussed the different scenarios that doing IBPB
> > on VMexit would help, and decided its really not required on every exit. 
> >
> > One obvious case is when there is a VMexit and return back to Qemu
> > process (witout a real context switch) do we need that to be 
> > protected from any poisoned BTB from guest?
> If the host is using retpolines, then some kind of barrier is needed.  I
> don't know if the full PRED_CMD barrier is needed, or two IBRS=1/IBRS=0
> writes back-to-back are enough.
>
> If the host is using IBRS, then writing IBRS=1 at vmexit has established
> a barrier from the less privileged VMX guest environment.

IBRS will protect qemu userspace only if you set it at some point
before exiting the kernel. You don't want it set *in* the kernel, if
we're using retpolines in the kernel, so you'd want to clear it again
on the way back into the kernel. It's almost the opposite of what the
IBRS patch set was doing to protect the kernel.

Just use IBPB :)


Attachments:
smime.p7s (5.09 kB)

2018-01-31 06:56:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] KVM: Expose speculation control feature to guests

On 01/30/2018 04:16 PM, Paolo Bonzini wrote:
> On 30/01/2018 18:48, Raj, Ashok wrote:
>>> Certainly not every vmexit! But doing it on every userspace vmexit and
>>> every sched_out would not be *that* bad.
>> Right.. agreed. We discussed the different scenarios that doing IBPB
>> on VMexit would help, and decided its really not required on every exit.
>>
>> One obvious case is when there is a VMexit and return back to Qemu
>> process (witout a real context switch) do we need that to be
>> protected from any poisoned BTB from guest?
> If the host is using retpolines, then some kind of barrier is needed. I
> don't know if the full PRED_CMD barrier is needed, or two IBRS=1/IBRS=0
> writes back-to-back are enough.

I think the spec is pretty clear here: protection is only provided
*while* IBRS=1. Once it goes back to 0, all bets are off.