Add direct access to speculation control MSRs for KVM guests. This allows the
guest to protect itself against Spectre V2 using IBRS+IBPB instead of a
retpoline+IBPB based approach.
It also exposes the ARCH_CAPABILITIES MSR which is going to be used by future
Intel processors to indicate RDCL_NO and IBRS_ALL.
v5:
- svm: add PRED_CMD and SPEC_CTRL to direct_access_msrs list.
- vmx: check also for X86_FEATURE_SPEC_CTRL for msr reads and writes.
- vmx: Use MSR_TYPE_W instead of MSR_TYPE_R for the nested IBPB MSR
- rewrite commit message for IBPB patch [2/5] (Ashok)
v4:
- Add IBRS passthrough for SVM (5/5).
- Handle nested guests properly.
- expose F(IBRS) in kvm_cpuid_8000_0008_ebx_x86_features
Ashok Raj (1):
KVM: x86: Add IBPB support
KarimAllah Ahmed (4):
KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX
KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
KVM: SVM: Allow direct access to MSR_IA32_SPEC_CTRL
arch/x86/kvm/cpuid.c | 22 +++++++---
arch/x86/kvm/cpuid.h | 1 +
arch/x86/kvm/svm.c | 87 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/x86.c | 1 +
5 files changed, 218 insertions(+), 10 deletions(-)
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Ashok Raj <[email protected]>
Cc: Asit Mallick <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Janakarajan Natarajan <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Laura Abbott <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
--
2.7.4
Future intel processors will use MSR_IA32_ARCH_CAPABILITIES MSR to indicate
RDCL_NO (bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default
the contents will come directly from the hardware, but user-space can still
override it.
[dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]
Cc: Asit Mallick <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Ashok Raj <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/vmx.c | 15 +++++++++++++++
arch/x86/kvm/x86.c | 1 +
3 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 033004d..1909635 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -394,7 +394,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
- F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
+ F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2e4e8af..a0b2bd1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -592,6 +592,8 @@ struct vcpu_vmx {
u64 msr_host_kernel_gs_base;
u64 msr_guest_kernel_gs_base;
#endif
+ u64 arch_capabilities;
+
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
u32 secondary_exec_control;
@@ -3236,6 +3238,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
+ case MSR_IA32_ARCH_CAPABILITIES:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
+ return 1;
+ msr_info->data = to_vmx(vcpu)->arch_capabilities;
+ break;
case MSR_IA32_SYSENTER_CS:
msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
break;
@@ -3363,6 +3371,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
MSR_TYPE_W);
break;
+ case MSR_IA32_ARCH_CAPABILITIES:
+ if (!msr_info->host_initiated)
+ return 1;
+ vmx->arch_capabilities = data;
+ break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -5625,6 +5638,8 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
++vmx->nmsrs;
}
+ if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
+ rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c53298d..4ec142e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1009,6 +1009,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
+ MSR_IA32_ARCH_CAPABILITIES
};
static unsigned num_msrs_to_save;
--
2.7.4
[dwmw2: Stop using KF() for bits in it, too]
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/cpuid.c | 8 +++-----
arch/x86/kvm/cpuid.h | 1 +
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0099e10..c0eb337 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -67,9 +67,7 @@ u64 kvm_supported_xcr0(void)
#define F(x) bit(X86_FEATURE_##x)
-/* These are scattered features in cpufeatures.h. */
-#define KVM_CPUID_BIT_AVX512_4VNNIW 2
-#define KVM_CPUID_BIT_AVX512_4FMAPS 3
+/* For scattered features from cpufeatures.h; we currently expose none */
#define KF(x) bit(KVM_CPUID_BIT_##x)
int kvm_update_cpuid(struct kvm_vcpu *vcpu)
@@ -392,7 +390,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
- KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS);
+ F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
@@ -477,7 +475,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
entry->ecx &= ~F(PKU);
entry->edx &= kvm_cpuid_7_0_edx_x86_features;
- entry->edx &= get_scattered_cpuid_leaf(7, 0, CPUID_EDX);
+ cpuid_mask(&entry->edx, CPUID_7_EDX);
} else {
entry->ebx = 0;
entry->ecx = 0;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index c2cea66..9a327d5 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -54,6 +54,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
[CPUID_8000_000A_EDX] = {0x8000000a, 0, CPUID_EDX},
[CPUID_7_ECX] = { 7, 0, CPUID_ECX},
[CPUID_8000_0007_EBX] = {0x80000007, 0, CPUID_EBX},
+ [CPUID_7_EDX] = { 7, 0, CPUID_EDX},
};
static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned x86_feature)
--
2.7.4
From: Ashok Raj <[email protected]>
The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.
Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.
IBPB helps mitigate against three potential attacks:
* Mitigate guests from being attacked by other guests.
- This is addressed by issing IBPB when we do a guest switch.
* Mitigate attacks from guest/ring3->host/ring3.
These would require a IBPB during context switch in host, or after
VMEXIT. The host process has two ways to mitigate
- Either it can be compiled with retpoline
- If its going through context switch, and has set !dumpable then
there is a IBPB in that path.
(Tim's patch: https://patchwork.kernel.org/patch/10192871)
- The case where after a VMEXIT you return back to Qemu might make
Qemu attackable from guest when Qemu isn't compiled with retpoline.
There are issues reported when doing IBPB on every VMEXIT that resulted
in some tsc calibration woes in guest.
* Mitigate guest/ring0->host/ring0 attacks.
When host kernel is using retpoline it is safe against these attacks.
If host kernel isn't using retpoline we might need to do a IBPB flush on
every VMEXIT.
Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.
* IBPB is issued only for SVM during svm_free_vcpu().
VMX has a vmclear and SVM doesn't. Follow discussion here:
https://lkml.org/lkml/2018/1/15/146
Please refer to the following spec for more details on the enumeration
and control.
Refer here to get documentation about mitigations.
https://software.intel.com/en-us/side-channel-security-support
[peterz: rebase and changelog rewrite]
[karahmed: - rebase
- vmx: expose PRED_CMD if guest has it in CPUID
- svm: only pass through IBPB if guest has it in CPUID
- vmx: support !cpu_has_vmx_msr_bitmap()]
- vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
PRED_CMD is a write-only MSR]
Cc: Asit Mallick <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
v5:
- Use MSR_TYPE_W instead of MSR_TYPE_R for the MSR.
- Always merge the bitmaps unconditionally.
- Add PRED_CMD to direct_access_msrs.
- Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
- rewrite the commit message (from ashok.raj@)
---
arch/x86/kvm/cpuid.c | 11 ++++++++++-
arch/x86/kvm/svm.c | 28 ++++++++++++++++++++++++++++
arch/x86/kvm/vmx.c | 29 +++++++++++++++++++++++++----
3 files changed, 63 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c0eb337..033004d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -365,6 +365,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
+ /* cpuid 0x80000008.ebx */
+ const u32 kvm_cpuid_8000_0008_ebx_x86_features =
+ F(IBPB);
+
/* cpuid 0xC0000001.edx */
const u32 kvm_cpuid_C000_0001_edx_x86_features =
F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
@@ -625,7 +629,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
if (!g_phys_as)
g_phys_as = phys_as;
entry->eax = g_phys_as | (virt_as << 8);
- entry->ebx = entry->edx = 0;
+ entry->edx = 0;
+ /* IBPB isn't necessarily present in hardware cpuid */
+ if (boot_cpu_has(X86_FEATURE_IBPB))
+ entry->ebx |= F(IBPB);
+ entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
+ cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
break;
}
case 0x80000019:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f40d0da..bfbb7b9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -250,6 +250,7 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_SYSCALL_MASK, .always = true },
#endif
{ .index = MSR_IA32_LASTBRANCHFROMIP, .always = false },
+ { .index = MSR_IA32_PRED_CMD, .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP, .always = false },
{ .index = MSR_IA32_LASTINTTOIP, .always = false },
@@ -529,6 +530,7 @@ struct svm_cpu_data {
struct kvm_ldttss_desc *tss_desc;
struct page *save_area;
+ struct vmcb *current_vmcb;
};
static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
@@ -1703,11 +1705,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, svm);
+ /*
+ * The vmcb page can be recycled, causing a false negative in
+ * svm_vcpu_load(). So do a full IBPB now.
+ */
+ indirect_branch_prediction_barrier();
}
static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
int i;
if (unlikely(cpu != vcpu->cpu)) {
@@ -1736,6 +1744,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (static_cpu_has(X86_FEATURE_RDTSCP))
wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
+ if (sd->current_vmcb != svm->vmcb) {
+ sd->current_vmcb = svm->vmcb;
+ indirect_branch_prediction_barrier();
+ }
avic_vcpu_load(vcpu, cpu);
}
@@ -3684,6 +3696,22 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr);
break;
+ case MSR_IA32_PRED_CMD:
+ if (!msr->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
+ return 1;
+
+ if (data & ~PRED_CMD_IBPB)
+ return 1;
+
+ if (!data)
+ break;
+
+ wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+ if (is_guest_mode(vcpu))
+ break;
+ set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
+ break;
case MSR_STAR:
svm->vmcb->save.star = data;
break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d46a61b..2e4e8af 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2285,6 +2285,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
+ indirect_branch_prediction_barrier();
}
if (!already_loaded) {
@@ -3342,6 +3343,26 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
+ case MSR_IA32_PRED_CMD:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
+ return 1;
+
+ if (data & ~PRED_CMD_IBPB)
+ return 1;
+
+ if (!data)
+ break;
+
+ wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+
+ if (is_guest_mode(vcpu))
+ break;
+
+ vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
+ MSR_TYPE_W);
+ break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -10045,10 +10066,6 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
unsigned long *msr_bitmap_l1;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
- /* This shortcut is ok because we support only x2APIC MSRs so far. */
- if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
- return false;
-
page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
if (is_error_page(page))
return false;
@@ -10056,6 +10073,10 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
memset(msr_bitmap_l0, 0xff, PAGE_SIZE);
+ nested_vmx_disable_intercept_for_msr(msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PRED_CMD,
+ MSR_TYPE_W);
+
if (nested_cpu_has_virt_x2apic_mode(vmcs12)) {
if (nested_cpu_has_apic_reg_virt(vmcs12))
for (msr = 0x800; msr <= 0x8ff; msr++)
--
2.7.4
[ Based on a patch from Ashok Raj <[email protected]> ]
Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
be using a retpoline+IBPB based approach.
To avoid the overhead of atomically saving and restoring the
MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
add_atomic_switch_msr when a non-zero is written to it.
No attempt is made to handle STIBP here, intentionally. Filtering STIBP
may be added in a future patch, which may require trapping all writes
if we don't want to pass it through directly to the guest.
[dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
Cc: Asit Mallick <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ashok Raj <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
v5:
- Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
v4:
- Add IBRS to kvm_cpuid_8000_0008_ebx_x86_features
- Handling nested guests
v3:
- Save/restore manually
- Fix CPUID handling
- Fix a copy & paste error in the name of SPEC_CTRL MSR in
disable_intercept.
- support !cpu_has_vmx_msr_bitmap()
v2:
- remove 'host_spec_ctrl' in favor of only a comment (dwmw@).
- special case writing '0' in SPEC_CTRL to avoid confusing live-migration
when the instance never used the MSR (dwmw@).
- depend on X86_FEATURE_IBRS instead of X86_FEATURE_SPEC_CTRL (dwmw@).
- add MSR_IA32_SPEC_CTRL to the list of MSRs to save (dropped it by accident).
---
arch/x86/kvm/cpuid.c | 9 ++++---
arch/x86/kvm/vmx.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 2 +-
3 files changed, 80 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 1909635..13f5d42 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -367,7 +367,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
/* cpuid 0x80000008.ebx */
const u32 kvm_cpuid_8000_0008_ebx_x86_features =
- F(IBPB);
+ F(IBPB) | F(IBRS);
/* cpuid 0xC0000001.edx */
const u32 kvm_cpuid_C000_0001_edx_x86_features =
@@ -394,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
- F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
+ F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
+ F(ARCH_CAPABILITIES);
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
@@ -630,9 +631,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
g_phys_as = phys_as;
entry->eax = g_phys_as | (virt_as << 8);
entry->edx = 0;
- /* IBPB isn't necessarily present in hardware cpuid */
+ /* IBRS and IBPB aren't necessarily present in hardware cpuid */
if (boot_cpu_has(X86_FEATURE_IBPB))
entry->ebx |= F(IBPB);
+ if (boot_cpu_has(X86_FEATURE_IBRS))
+ entry->ebx |= F(IBRS);
entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a0b2bd1..4ee93cb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -593,6 +593,8 @@ struct vcpu_vmx {
u64 msr_guest_kernel_gs_base;
#endif
u64 arch_capabilities;
+ u64 spec_ctrl;
+ bool save_spec_ctrl_on_exit;
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
@@ -938,6 +940,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
u16 error_code);
static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
+static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type);
static DEFINE_PER_CPU(struct vmcs *, vmxarea);
static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -3238,6 +3242,14 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
+ return 1;
+
+ msr_info->data = to_vmx(vcpu)->spec_ctrl;
+ break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
@@ -3351,6 +3363,36 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
+ return 1;
+
+ /* The STIBP bit doesn't fault even if it's not advertised */
+ if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+ return 1;
+
+ vmx->spec_ctrl = data;
+
+ /*
+ * When it's written (to non-zero) for the first time, pass
+ * it through. This means we don't have to take the perf
+ * hit of saving it on vmexit for the common case of guests
+ * that don't use it.
+ */
+ if (cpu_has_vmx_msr_bitmap() && data &&
+ !vmx->save_spec_ctrl_on_exit) {
+ vmx->save_spec_ctrl_on_exit = true;
+
+ if (is_guest_mode(vcpu))
+ break;
+
+ vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_RW);
+ }
+ break;
case MSR_IA32_PRED_CMD:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
@@ -5668,6 +5710,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
u64 cr0;
vmx->rmode.vm86_active = 0;
+ vmx->spec_ctrl = 0;
vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
kvm_set_cr8(vcpu, 0);
@@ -9339,6 +9382,15 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
vmx_arm_hv_timer(vcpu);
+ /*
+ * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+ * it's non-zero. Since vmentry is serialising on affected CPUs, there
+ * is no need to worry about the conditional branch over the wrmsr
+ * being speculatively taken.
+ */
+ if (vmx->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
vmx->__launched = vmx->loaded_vmcs->launched;
asm(
/* Store host registers */
@@ -9457,6 +9509,19 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
#endif
);
+ /*
+ * We do not use IBRS in the kernel. If this vCPU has used the
+ * SPEC_CTRL MSR it may have left it on; save the value and
+ * turn it off. This is much more efficient than blindly adding
+ * it to the atomic save/restore list. Especially as the former
+ * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+ */
+ if (vmx->save_spec_ctrl_on_exit)
+ rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
+ if (vmx->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();
@@ -10115,6 +10180,14 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
MSR_TYPE_W);
}
}
+
+ if (to_vmx(vcpu)->save_spec_ctrl_on_exit) {
+ nested_vmx_disable_intercept_for_msr(
+ msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_R | MSR_TYPE_W);
+ }
+
kunmap(page);
kvm_release_page_clean(page);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ec142e..ac38143 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1009,7 +1009,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
- MSR_IA32_ARCH_CAPABILITIES
+ MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
};
static unsigned num_msrs_to_save;
--
2.7.4
[ Based on a patch from Paolo Bonzini <[email protected]> ]
... basically doing exactly what we do for VMX:
- Passthrough SPEC_CTRL to guests (if enabled in guest CPUID)
- Save and restore SPEC_CTRL around VMExit and VMEntry only if the guest
actually used it.
Cc: Asit Mallick <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Ashok Raj <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---
v5:
- Add SPEC_CTRL to direct_access_msrs.
---
arch/x86/kvm/svm.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index bfbb7b9..0016a8a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -184,6 +184,9 @@ struct vcpu_svm {
u64 gs_base;
} host;
+ u64 spec_ctrl;
+ bool save_spec_ctrl_on_exit;
+
u32 *msrpm;
ulong nmi_iret_rip;
@@ -250,6 +253,7 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_SYSCALL_MASK, .always = true },
#endif
{ .index = MSR_IA32_LASTBRANCHFROMIP, .always = false },
+ { .index = MSR_IA32_SPEC_CTRL, .always = false },
{ .index = MSR_IA32_PRED_CMD, .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP, .always = false },
@@ -1584,6 +1588,8 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
u32 dummy;
u32 eax = 1;
+ svm->spec_ctrl = 0;
+
if (!init_event) {
svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE |
MSR_IA32_APICBASE_ENABLE;
@@ -3605,6 +3611,13 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_VM_CR:
msr_info->data = svm->nested.vm_cr_msr;
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
+ return 1;
+
+ msr_info->data = svm->spec_ctrl;
+ break;
case MSR_IA32_UCODE_REV:
msr_info->data = 0x01000065;
break;
@@ -3696,6 +3709,30 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
+ return 1;
+
+ /* The STIBP bit doesn't fault even if it's not advertised */
+ if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+ return 1;
+
+ svm->spec_ctrl = data;
+
+ /*
+ * When it's written (to non-zero) for the first time, pass
+ * it through. This means we don't have to take the perf
+ * hit of saving it on vmexit for the common case of guests
+ * that don't use it.
+ */
+ if (data && !svm->save_spec_ctrl_on_exit) {
+ svm->save_spec_ctrl_on_exit = true;
+ if (is_guest_mode(vcpu))
+ break;
+ set_msr_interception(svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
+ }
+ break;
case MSR_IA32_PRED_CMD:
if (!msr->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
@@ -4964,6 +5001,15 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
local_irq_enable();
+ /*
+ * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+ * it's non-zero. Since vmentry is serialising on affected CPUs, there
+ * is no need to worry about the conditional branch over the wrmsr
+ * being speculatively taken.
+ */
+ if (svm->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
+
asm volatile (
"push %%" _ASM_BP "; \n\t"
"mov %c[rbx](%[svm]), %%" _ASM_BX " \n\t"
@@ -5056,6 +5102,19 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
#endif
);
+ /*
+ * We do not use IBRS in the kernel. If this vCPU has used the
+ * SPEC_CTRL MSR it may have left it on; save the value and
+ * turn it off. This is much more efficient than blindly adding
+ * it to the atomic save/restore list. Especially as the former
+ * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+ */
+ if (svm->save_spec_ctrl_on_exit)
+ rdmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
+
+ if (svm->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();
--
2.7.4
On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed <[email protected]> wrote:
> + nested_vmx_disable_intercept_for_msr(msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_PRED_CMD,
> + MSR_TYPE_W);
> +
I still think this should be predicated on L1 having
guest_cpuid_has(vcpu, X86_FEATURE_IBPB) or guest_cpuid_has(vcpu,
X86_FEATURE_SPEC_CTRL), because of the potential impact to the
hypertwin. If L0 denies the feature to L1 by clearing those CPUID
bits, L1 shouldn't be able to bypass that restriction by launching L2.
On Wed, 2018-01-31 at 11:45 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed wrote:
>
> > + nested_vmx_disable_intercept_for_msr(msr_bitmap_l1, msr_bitmap_l0,
> > + MSR_IA32_PRED_CMD,
> > + MSR_TYPE_W);
> > +
>
> I still think this should be predicated on L1 having
> guest_cpuid_has(vcpu, X86_FEATURE_IBPB) or guest_cpuid_has(vcpu,
> X86_FEATURE_SPEC_CTRL), because of the potential impact to the
> hypertwin. If L0 denies the feature to L1 by clearing those CPUID
> bits, L1 shouldn't be able to bypass that restriction by launching L2.
Rather than doing the expensive guest_cpu_has() every time (which is
worse now as we realised we need two of them) perhaps we should
introduce a local flag for that too?
On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed <[email protected]> wrote:
> +
> + if (to_vmx(vcpu)->save_spec_ctrl_on_exit) {
> + nested_vmx_disable_intercept_for_msr(
> + msr_bitmap_l1, msr_bitmap_l0,
> + MSR_IA32_SPEC_CTRL,
> + MSR_TYPE_R | MSR_TYPE_W);
> + }
> +
As this is written, L2 will never get direct access to this MSR until
after L1 writes it. What if L1 never writes it? The condition should
really be something that captures, "if L0 is willing to yield this MSR
to the guest..."
On Wed, Jan 31, 2018 at 11:53 AM, David Woodhouse <[email protected]> wrote:
> Rather than doing the expensive guest_cpu_has() every time (which is
> worse now as we realised we need two of them) perhaps we should
> introduce a local flag for that too?
That sounds good to me.
On Wed, 2018-01-31 at 11:53 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed wrote:
>
> > +
> > + if (to_vmx(vcpu)->save_spec_ctrl_on_exit) {
> > + nested_vmx_disable_intercept_for_msr(
> > + msr_bitmap_l1, msr_bitmap_l0,
> > + MSR_IA32_SPEC_CTRL,
> > + MSR_TYPE_R | MSR_TYPE_W);
> > + }
> > +
>
> As this is written, L2 will never get direct access to this MSR until
> after L1 writes it. What if L1 never writes it? The condition should
> really be something that captures, "if L0 is willing to yield this MSR
> to the guest..."
I'm still kind of lost here, but don't forget the requirement that the
MSR must *not* be passed through for direct access by L1 or L2 guests,
unless that ->save_spec_ctrl_on_exit flag is set.
Because that's what makes us set it back to zero on vmexit.
So the above condition doesn't look *so* wrong to me. Perhaps the issue
is that we're missing a way for L2 to actually cause that flag to get
set?
On 01/31/2018 08:53 PM, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed <[email protected]> wrote:
>
>> +
>> + if (to_vmx(vcpu)->save_spec_ctrl_on_exit) {
>> + nested_vmx_disable_intercept_for_msr(
>> + msr_bitmap_l1, msr_bitmap_l0,
>> + MSR_IA32_SPEC_CTRL,
>> + MSR_TYPE_R | MSR_TYPE_W);
>> + }
>> +
>
> As this is written, L2 will never get direct access to this MSR until
> after L1 writes it. What if L1 never writes it? The condition should
> really be something that captures, "if L0 is willing to yield this MSR
> to the guest..."
but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
to it, this condition will be true and then the bitmap will be updated.
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On Wed, Jan 31, 2018 at 12:01 PM, KarimAllah Ahmed <[email protected]> wrote:
> but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
> to it, this condition will be true and then the bitmap will be updated.
So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
set to true, even if the MSR permission bitmap for a particular VMCS
*doesn't* allow the MSR to be written without an intercept. That's
functionally correct, but inefficient. It seems to me that
save_spec_ctrl_on_exit should indicate whether or not the *current*
MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
loaded_vmcs structure, alongside the msr_bitmap pointer that it is
associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
to L2.
On Wed, 2018-01-31 at 12:18 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 12:01 PM, KarimAllah Ahmed wrote:
>
> >
> > but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
> > to it, this condition will be true and then the bitmap will be updated.
> So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
> set to true, even if the MSR permission bitmap for a particular VMCS
> *doesn't* allow the MSR to be written without an intercept. That's
> functionally correct, but inefficient. It seems to me that
> save_spec_ctrl_on_exit should indicate whether or not the *current*
> MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
> To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
> loaded_vmcs structure, alongside the msr_bitmap pointer that it is
> associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
> the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
> to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
> to L2.
Reading and writing this MSR is expensive. And if it's yielded to the
guest in the MSR bitmap, that means we have to save its value on vmexit
and set it back to zero.
Some of the gymnastics here are explicitly done to avoid having to do
that save-and-zero step unless the guest has *actually* touched the
MSR. Not just if we are *willing* to let it do so.
That's the whole point in the yield-after-first-write dance.
On Wed, Jan 31, 2018 at 08:37:43PM +0100, KarimAllah Ahmed wrote:
> [dwmw2: Stop using KF() for bits in it, too]
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 8 +++-----
> arch/x86/kvm/cpuid.h | 1 +
> 2 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0099e10..c0eb337 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -67,9 +67,7 @@ u64 kvm_supported_xcr0(void)
>
> #define F(x) bit(X86_FEATURE_##x)
>
> -/* These are scattered features in cpufeatures.h. */
> -#define KVM_CPUID_BIT_AVX512_4VNNIW 2
> -#define KVM_CPUID_BIT_AVX512_4FMAPS 3
> +/* For scattered features from cpufeatures.h; we currently expose none */
> #define KF(x) bit(KVM_CPUID_BIT_##x)
>
> int kvm_update_cpuid(struct kvm_vcpu *vcpu)
> @@ -392,7 +390,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>
> /* cpuid 7.0.edx*/
> const u32 kvm_cpuid_7_0_edx_x86_features =
> - KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS);
> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
>
> /* all calls to cpuid_count() should be made on the same cpu */
> get_cpu();
> @@ -477,7 +475,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
> entry->ecx &= ~F(PKU);
> entry->edx &= kvm_cpuid_7_0_edx_x86_features;
> - entry->edx &= get_scattered_cpuid_leaf(7, 0, CPUID_EDX);
> + cpuid_mask(&entry->edx, CPUID_7_EDX);
> } else {
> entry->ebx = 0;
> entry->ecx = 0;
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index c2cea66..9a327d5 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -54,6 +54,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
> [CPUID_8000_000A_EDX] = {0x8000000a, 0, CPUID_EDX},
> [CPUID_7_ECX] = { 7, 0, CPUID_ECX},
> [CPUID_8000_0007_EBX] = {0x80000007, 0, CPUID_EBX},
> + [CPUID_7_EDX] = { 7, 0, CPUID_EDX},
> };
>
> static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned x86_feature)
> --
> 2.7.4
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index d46a61b..2e4e8af 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2285,6 +2285,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
> per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
> vmcs_load(vmx->loaded_vmcs->vmcs);
> + indirect_branch_prediction_barrier();
> }
>
> if (!already_loaded) {
> @@ -3342,6 +3343,26 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> kvm_write_tsc(vcpu, msr_info);
> break;
> + case MSR_IA32_PRED_CMD:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> + return 1;
> +
> + if (data & ~PRED_CMD_IBPB)
> + return 1;
> +
> + if (!data)
> + break;
> +
> + wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> +
> + if (is_guest_mode(vcpu))
> + break;
Don't you want this the other way around? That is first do the disable_intercept
and then add the 'if (is_guest_mode(vcpu))' ? Otherwise the very first
MSR write from the guest is going to hit condition above and never end
up executing the disabling of the intercept?
> +
> + vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
> + MSR_TYPE_W);
> + break;
> case MSR_IA32_CR_PAT:
> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
On 31/01/2018 15:18, Jim Mattson wrote:
>> but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
>> to it, this condition will be true and then the bitmap will be updated.
> So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
> set to true, even if the MSR permission bitmap for a particular VMCS
> *doesn't* allow the MSR to be written without an intercept. That's
> functionally correct, but inefficient. It seems to me that
> save_spec_ctrl_on_exit should indicate whether or not the *current*
> MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
> To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
> loaded_vmcs structure, alongside the msr_bitmap pointer that it is
> associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
> the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
> to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
> to L2.
On the first nested write, (b) must be true for L0 to see the MSR write.
If L1 doesn't yield the MSR to L2, the MSR write results in an L2->L1
vmexit and save_spec_ctrl_on_exit is not set to true.
So save_spec_ctrl_on_exit is set if all of the following are true:
(a) L0 is willing to yield the MSR to L1,
(b) and the write happens in L1,
or all of the following are true:
(a) L0 is willing to yield the MSR to L1,
(b) L1 is willing to yield the MSR to L2,
(c) and the write happens in L2,
It doesn't need to be placed in loaded_vmcs, because in the end if L1 is
willing to yield the MSR to L2, it will have to do reads and writes of
the MSR too, and both loaded_vmcs structs will have
save_spec_ctrl_on_exit=1.
Paolo
On 01/31/2018 09:28 PM, Konrad Rzeszutek Wilk wrote:
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index d46a61b..2e4e8af 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2285,6 +2285,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>> if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
>> per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
>> vmcs_load(vmx->loaded_vmcs->vmcs);
>> + indirect_branch_prediction_barrier();
>> }
>>
>> if (!already_loaded) {
>> @@ -3342,6 +3343,26 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> case MSR_IA32_TSC:
>> kvm_write_tsc(vcpu, msr_info);
>> break;
>> + case MSR_IA32_PRED_CMD:
>> + if (!msr_info->host_initiated &&
>> + !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
>> + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
>> + return 1;
>> +
>> + if (data & ~PRED_CMD_IBPB)
>> + return 1;
>> +
>> + if (!data)
>> + break;
>> +
>> + wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
>> +
>> + if (is_guest_mode(vcpu))
>> + break;
>
> Don't you want this the other way around? That is first do the disable_intercept
> and then add the 'if (is_guest_mode(vcpu))' ? Otherwise the very first
> MSR write from the guest is going to hit condition above and never end
> up executing the disabling of the intercept?
is_guest_mode is checking if this is an L2 guest. I *should not* do
disable_intercept on the L1 guest bitmap if it is an L2 guest that is
why this check happens before disable_intercept.
For the short-circuited L2 path, nested_vmx_merge_msr_bitmap will
properly update the L02 MSR bitmap and use it.
So the checks are fine AFAICT.
>
>> +
>> + vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
>> + MSR_TYPE_W);
>> + break;
>> case MSR_IA32_CR_PAT:
>> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
>> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
You seem to be making the assumption that there is one L2. What if
there are 100 L2s, and only one has write-access to IA32_SPEC_CTRL? Or
what if there once was such an L2, but it's been gone for months? The
current mechanism penalizes *all* L2s if any L2, ever, has
write-access to IA32_SPEC_CTRL.
On 31/01/2018 15:54, Jim Mattson wrote:
> You seem to be making the assumption that there is one L2. What if
> there are 100 L2s, and only one has write-access to IA32_SPEC_CTRL? Or
> what if there once was such an L2, but it's been gone for months? The
> current mechanism penalizes *all* L2s if any L2, ever, has
> write-access to IA32_SPEC_CTRL.
Yes, but how would moving the field into struct loaded_vmcs do anything?
Only vmon/vmoff would change anything in vmx->nested.vmcs02.
Even then, L1 vmexits will also be penalized because L1 has probably
done an RDMSR/WRMSR on L2->L1 vmexit. So I don't think it's an issue?
Paolo
On Wed, Jan 31, 2018 at 1:00 PM, Paolo Bonzini <[email protected]> wrote:
> Yes, but how would moving the field into struct loaded_vmcs do anything?
> Only vmon/vmoff would change anything in vmx->nested.vmcs02.
My suggestion was that nested_vmx_merge_msr_bitmap should set the
vmcs02 version of save_spec_ctrl_on_exit based on the calculated value
of the write permission bit for IA32_SPEC_CTRL in the vmcs02 MSR
permission bitmap.
> Even then, L1 vmexits will also be penalized because L1 has probably
> done an RDMSR/WRMSR on L2->L1 vmexit. So I don't think it's an issue?
Yes, it sucks to be L1 in this situation.
On Wed, Jan 31, 2018 at 12:21 PM, David Woodhouse <[email protected]> wrote:
> Reading and writing this MSR is expensive. And if it's yielded to the
> guest in the MSR bitmap, that means we have to save its value on vmexit
> and set it back to zero.
Agreed. But my point is that if it's not yielded to the guest in the
MSR bitmap, then we don't have to save its value on VM-exit and set it
back to zero. The vmcs02 MSR bitmap is reconstructed on every L1->L2
transition. Sometimes, it will yield the MSR and sometimes it won't.
> Some of the gymnastics here are explicitly done to avoid having to do
> that save-and-zero step unless the guest has *actually* touched the
> MSR. Not just if we are *willing* to let it do so.
>
> That's the whole point in the yield-after-first-write dance.
Sorry; bad choice of words on my part. All that L0 knows of L1's
"willingness" to pass the MSR through to L2 comes from the vmcs12 MSR
permission bitmap. If L1 also adopts a "clear the WRMSR intercept on
first write" strategy, then as far as L0 can tell, L1 is "unwilling"
to pass the MSR through until L2 has written it.
On Wed, 2018-01-31 at 13:05 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:00 PM, Paolo Bonzini <[email protected]> wrote:
>
> > Yes, but how would moving the field into struct loaded_vmcs do anything?
> > Only vmon/vmoff would change anything in vmx->nested.vmcs02.
>
> My suggestion was that nested_vmx_merge_msr_bitmap should set the
> vmcs02 version of save_spec_ctrl_on_exit based on the calculated value
> of the write permission bit for IA32_SPEC_CTRL in the vmcs02 MSR
> permission bitmap.
>
> > Even then, L1 vmexits will also be penalized because L1 has probably
> > done an RDMSR/WRMSR on L2->L1 vmexit. So I don't think it's an issue?
>
> Yes, it sucks to be L1 in this situation.
Well... we *could* clear the save_spec_ctrl_on_exit flag and intercept
the MSR again, any time that the actual value of spec_ctrl is zero.
I don't think we'd want to do that too aggressively, but there might be
something we could do there.
On 31/01/2018 16:05, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:00 PM, Paolo Bonzini <[email protected]> wrote:
>
>> Yes, but how would moving the field into struct loaded_vmcs do anything?
>> Only vmon/vmoff would change anything in vmx->nested.vmcs02.
>
> My suggestion was that nested_vmx_merge_msr_bitmap should set the
> vmcs02 version of save_spec_ctrl_on_exit based on the calculated value
> of the write permission bit for IA32_SPEC_CTRL in the vmcs02 MSR
> permission bitmap.
>
>> Even then, L1 vmexits will also be penalized because L1 has probably
>> done an RDMSR/WRMSR on L2->L1 vmexit. So I don't think it's an issue?
>
> Yes, it sucks to be L1 in this situation.
Can we just say it sucks to be L2 too? :) Because in the end as long as
no one ever writes to spec_ctrl, everybody is happy.
Paolo
On Wed, Jan 31, 2018 at 1:42 PM, Paolo Bonzini <[email protected]> wrote:
> Can we just say it sucks to be L2 too? :) Because in the end as long as
> no one ever writes to spec_ctrl, everybody is happy.
Unfortunately, quite a few OS vendors shipped IBRS-based mitigations
earlier this month. (Has Redhat stopped writing to IA32_SPEC_CTRL yet?
:-)
And in the long run, everyone is going to set IA32_SPEC_CTRL.IBRS=1 on
CPUs with IA32_ARCH_CAPABILITIES.IBRS_ALL.
On 31/01/2018 16:53, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:42 PM, Paolo Bonzini <[email protected]> wrote:
>
>> Can we just say it sucks to be L2 too? :) Because in the end as long as
>> no one ever writes to spec_ctrl, everybody is happy.
>
> Unfortunately, quite a few OS vendors shipped IBRS-based mitigations
> earlier this month. (Has Redhat stopped writing to IA32_SPEC_CTRL yet?
> :-)
Not yet, but getting there. :)
> And in the long run, everyone is going to set IA32_SPEC_CTRL.IBRS=1 on
> CPUs with IA32_ARCH_CAPABILITIES.IBRS_ALL.
And then it will suck for everyone---they will have to pay the price of
saving/restoring an MSR that is going to be written just once. Perhaps
we will have to tweak the heuristic, only passing IBRS through when the
guest writes IBRS=0.
In the end I think it's premature to try and optimize for L2 guests of
long-lived L1 hypervisors.
Paolo
On Wed, 2018-01-31 at 13:53 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:42 PM, Paolo Bonzini <[email protected]> wrote:
>
> > Can we just say it sucks to be L2 too? :) Because in the end as long as
> > no one ever writes to spec_ctrl, everybody is happy.
>
> Unfortunately, quite a few OS vendors shipped IBRS-based mitigations
> earlier this month. (Has Redhat stopped writing to IA32_SPEC_CTRL yet?
> :-)
>
> And in the long run, everyone is going to set IA32_SPEC_CTRL.IBRS=1 on
> CPUs with IA32_ARCH_CAPABILITIES.IBRS_ALL.
I'm actually working on IBRS_ALL at the moment.
I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
just make it a no-op.
Or if that really doesn't fly, perhaps with IBRS_ALL we should invert
the logic. Set IBRS to 1 on vCPU reset, and only if it's set to *zero*
do we pass through the MSR and set the save_spec_ctrl_on_exit flag.
But let's get the code for *current* hardware done first...
On Wed, 2018-01-31 at 13:18 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 12:21 PM, David Woodhouse wrote:
>
> >
> > Reading and writing this MSR is expensive. And if it's yielded to the
> > guest in the MSR bitmap, that means we have to save its value on vmexit
> > and set it back to zero.
>
> Agreed. But my point is that if it's not yielded to the guest in the
> MSR bitmap, then we don't have to save its value on VM-exit and set it
> back to zero. The vmcs02 MSR bitmap is reconstructed on every L1->L2
> transition. Sometimes, it will yield the MSR and sometimes it won't.
Strictly: if SPEC_CTRL is not already set to 1 *and* hasn't been
yielded to the guest in the MSR bitmap, then we don't have to set it
back to zero.
If L1 decides it's *always* going to trap and never pass through, but
the value is already set to non-zero, we need to get that case right.
On Wed, Jan 31, 2018 at 1:59 PM, David Woodhouse <[email protected]> wrote:
> I'm actually working on IBRS_ALL at the moment.
>
> I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
> just make it a no-op.
Maybe we could convince Intel to add a LOCK bit to IA32_SPEC_CTRL like
the one in IA32_FEATURE_CONTROL.
On Wed, 2018-01-31 at 14:06 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:59 PM, David Woodhouse <[email protected]> wrote:
> > I'm actually working on IBRS_ALL at the moment.
> >
> > I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
> > just make it a no-op.
>
> Maybe we could convince Intel to add a LOCK bit to IA32_SPEC_CTRL like
> the one in IA32_FEATURE_CONTROL.
Given that IBRS_ALL is supposed to be a sanely-performing option, I'd
rather convince Intel to just make it unconditional. If they've added
the appropriate tagging to the BTB, why even *have* this deliberately
insecure mode when IBRS==0?
I understand that until/unless they get a *proper* fix, software is
still going to have to use IBPB as appropriate. But there's no need for
the IBRS bit to do *anything*.
On Wed, Jan 31, 2018 at 2:10 PM, David Woodhouse <[email protected]> wrote:
>
> Given that IBRS_ALL is supposed to be a sanely-performing option, I'd
> rather convince Intel to just make it unconditional. If they've added
> the appropriate tagging to the BTB, why even *have* this deliberately
> insecure mode when IBRS==0?
>
> I understand that until/unless they get a *proper* fix, software is
> still going to have to use IBPB as appropriate. But there's no need for
> the IBRS bit to do *anything*.
Amen, brother!
Please please please can Amazon and friends push this? The current
situation with IBRS_ALL is complete nasty horrible garbage. It's
pointless on current CPU's, and it's not well-defined enough on future
CPU's.
The whole "you can enable this, but performance may or may not be
acceptable, and we won't tell you" thing is some bad mumbo-jumbo.
Before IBRS is good, we'll do retpoline and BTB stuffing and have
those (hopefully very rare) IBPB's. So the whole "badly performing
IBRS_ALL" is completely pointless, and actively wrong.
Linus
On 01/31/2018 09:18 PM, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 12:01 PM, KarimAllah Ahmed <[email protected]> wrote:
>
>> but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
>> to it, this condition will be true and then the bitmap will be updated.
>
> So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
> set to true, even if the MSR permission bitmap for a particular VMCS
> *doesn't* allow the MSR to be written without an intercept. That's
> functionally correct, but inefficient. It seems to me that
> save_spec_ctrl_on_exit should indicate whether or not the *current*
> MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
> To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
> loaded_vmcs structure, alongside the msr_bitmap pointer that it is
> associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
> the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
> to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
> to L2.
I actually got rid of this save_spec_ctrl_on_exit variable and replaced
it with another variable like the one suggested for IBPB. Just to avoid
doing an expensive guest_cpuid_has. Now I peak instead in the MSR bitmap
to figure out if this MSR was supposed to be intercepted or not. This
test should provide a similar semantics to save_spec_ctrl_on_exit.
Anyway, cleaning up/testing now and will post a new version.
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
> On Jan 31, 2018, at 2:06 PM, Jim Mattson <[email protected]> wrote:
>
>> On Wed, Jan 31, 2018 at 1:59 PM, David Woodhouse <[email protected]> wrote:
>> I'm actually working on IBRS_ALL at the moment.
>>
>> I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
>> just make it a no-op.
>
> Maybe we could convince Intel to add a LOCK bit to IA32_SPEC_CTRL like
> the one in IA32_FEATURE_CONTROL.
Please no. Some BIOS vendor is going to lock it to zero to win some silly benchmark.
Hi Karim
On Wed, Jan 31, 2018 at 08:37:46PM +0100, KarimAllah Ahmed wrote:
> [ Based on a patch from Ashok Raj <[email protected]> ]
>
> Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
> guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
> be using a retpoline+IBPB based approach.
With these changes SPEC_CTRL is properly exposed to the guest when using
latest Qemu.
>
> To avoid the overhead of atomically saving and restoring the
> MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
> add_atomic_switch_msr when a non-zero is written to it.
>
> No attempt is made to handle STIBP here, intentionally. Filtering STIBP
> may be added in a future patch, which may require trapping all writes
> if we don't want to pass it through directly to the guest.
>
> [dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
>
On 01/31/2018 11:52 PM, KarimAllah Ahmed wrote:
> On 01/31/2018 09:18 PM, Jim Mattson wrote:
>> On Wed, Jan 31, 2018 at 12:01 PM, KarimAllah Ahmed
>> <[email protected]> wrote:
>>
>>> but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
>>> to it, this condition will be true and then the bitmap will be updated.
>>
>> So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
>> set to true, even if the MSR permission bitmap for a particular VMCS
>> *doesn't* allow the MSR to be written without an intercept. That's
>> functionally correct, but inefficient. It seems to me that
>> save_spec_ctrl_on_exit should indicate whether or not the *current*
>> MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
>> To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
>> loaded_vmcs structure, alongside the msr_bitmap pointer that it is
>> associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
>> the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
>> to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
>> to L2.
>
> I actually got rid of this save_spec_ctrl_on_exit variable and replaced
> it with another variable like the one suggested for IBPB. Just to avoid
> doing an expensive guest_cpuid_has. Now I peak instead in the MSR bitmap
> to figure out if this MSR was supposed to be intercepted or not. This
> test should provide a similar semantics to save_spec_ctrl_on_exit.
>
> Anyway, cleaning up/testing now and will post a new version.
I think this patch should address all your concerns.
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On 01/31/2018 08:55 PM, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:53 AM, David Woodhouse <[email protected]> wrote:
>> Rather than doing the expensive guest_cpu_has() every time (which is
>> worse now as we realised we need two of them) perhaps we should
>> introduce a local flag for that too?
>
> That sounds good to me.
>
Done.
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
> >From 9c19a8ac3f021efba6f70ad7e28f7ad06bb97e43 Mon Sep 17 00:00:00 2001
> From: KarimAllah Ahmed <[email protected]>
> Date: Mon, 29 Jan 2018 19:58:10 +0000
> Subject: [PATCH] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
>
> [ Based on a patch from Ashok Raj <[email protected]> ]
>
> Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
> guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
> be using a retpoline+IBPB based approach.
>
> To avoid the overhead of atomically saving and restoring the
> MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
> add_atomic_switch_msr when a non-zero is written to it.
^^^^^^^^^^^^^^^^^^^^^
That part of the comment does not seem to be in sync with the code.
>
> No attempt is made to handle STIBP here, intentionally. Filtering STIBP
> may be added in a future patch, which may require trapping all writes
> if we don't want to pass it through directly to the guest.
>
> [dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
>
> Cc: Asit Mallick <[email protected]>
> Cc: Arjan Van De Ven <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Jun Nakajima <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: David Woodhouse <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Ashok Raj <[email protected]>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> v6:
> - got rid of save_spec_ctrl_on_exit
> - introduce spec_ctrl_intercepted
> - introduce spec_ctrl_used
> v5:
> - Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
> v4:
> - Add IBRS to kvm_cpuid_8000_0008_ebx_x86_features
> - Handling nested guests
> v3:
> - Save/restore manually
> - Fix CPUID handling
> - Fix a copy & paste error in the name of SPEC_CTRL MSR in
> disable_intercept.
> - support !cpu_has_vmx_msr_bitmap()
> v2:
> - remove 'host_spec_ctrl' in favor of only a comment (dwmw@).
> - special case writing '0' in SPEC_CTRL to avoid confusing live-migration
> when the instance never used the MSR (dwmw@).
> - depend on X86_FEATURE_IBRS instead of X86_FEATURE_SPEC_CTRL (dwmw@).
> - add MSR_IA32_SPEC_CTRL to the list of MSRs to save (dropped it by accident).
> ---
> arch/x86/kvm/cpuid.c | 9 +++--
> arch/x86/kvm/vmx.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> arch/x86/kvm/x86.c | 2 +-
> 3 files changed, 100 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1909635..13f5d42 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -367,7 +367,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>
> /* cpuid 0x80000008.ebx */
> const u32 kvm_cpuid_8000_0008_ebx_x86_features =
> - F(IBPB);
> + F(IBPB) | F(IBRS);
>
> /* cpuid 0xC0000001.edx */
> const u32 kvm_cpuid_C000_0001_edx_x86_features =
> @@ -394,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>
> /* cpuid 7.0.edx*/
> const u32 kvm_cpuid_7_0_edx_x86_features =
> - F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
> + F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
> + F(ARCH_CAPABILITIES);
>
> /* all calls to cpuid_count() should be made on the same cpu */
> get_cpu();
> @@ -630,9 +631,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> g_phys_as = phys_as;
> entry->eax = g_phys_as | (virt_as << 8);
> entry->edx = 0;
> - /* IBPB isn't necessarily present in hardware cpuid */
> + /* IBRS and IBPB aren't necessarily present in hardware cpuid */
> if (boot_cpu_has(X86_FEATURE_IBPB))
> entry->ebx |= F(IBPB);
> + if (boot_cpu_has(X86_FEATURE_IBRS))
> + entry->ebx |= F(IBRS);
> entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
> cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
> break;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6a9f4ec..bfc80ff 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -594,6 +594,14 @@ struct vcpu_vmx {
> #endif
>
> u64 arch_capabilities;
> + u64 spec_ctrl;
> +
> + /*
> + * This indicates that:
> + * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
> + * 2) The guest has actually initiated a write against the MSR.
> + */
> + bool spec_ctrl_used;
>
> /*
> * This indicates that:
> @@ -946,6 +954,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
> static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
> u16 error_code);
> static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
> +static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
> + u32 msr, int type);
>
> static DEFINE_PER_CPU(struct vmcs *, vmxarea);
> static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> @@ -1917,6 +1927,22 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu)
> vmcs_write32(EXCEPTION_BITMAP, eb);
> }
>
> +/* Is SPEC_CTRL intercepted for the currently running vCPU? */
> +static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
> +{
> + unsigned long *msr_bitmap;
> + int f = sizeof(unsigned long);
> +
> + if (!cpu_has_vmx_msr_bitmap())
> + return true;
> +
> + msr_bitmap = is_guest_mode(vcpu) ?
> + to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
> + to_vmx(vcpu)->vmcs01.msr_bitmap;
> +
> + return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
> +}
> +
> static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
> unsigned long entry, unsigned long exit)
> {
> @@ -3246,6 +3272,14 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> msr_info->data = guest_read_tsc(vcpu);
> break;
> + case MSR_IA32_SPEC_CTRL:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> + return 1;
> +
> + msr_info->data = to_vmx(vcpu)->spec_ctrl;
> + break;
> case MSR_IA32_ARCH_CAPABILITIES:
> if (!msr_info->host_initiated &&
> !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
> @@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> kvm_write_tsc(vcpu, msr_info);
> break;
> + case MSR_IA32_SPEC_CTRL:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> + return 1;
> +
> + vmx->spec_ctrl_used = true;
> +
> + /* The STIBP bit doesn't fault even if it's not advertised */
> + if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
> + return 1;
> +
> + vmx->spec_ctrl = data;
> +
> + /*
> + * When it's written (to non-zero) for the first time, pass
> + * it through. This means we don't have to take the perf
.. But only if it is a nested guest (as you have && is_guest_mode).
Do you want to update the comment a bit?
> + * hit of saving it on vmexit for the common case of guests
> + * that don't use it.
> + */
> + if (cpu_has_vmx_msr_bitmap() && data &&
> + spec_ctrl_intercepted(vcpu) &&
> + is_guest_mode(vcpu))
^^^^^^^^^^^^^^^^^^ <=== here
> + vmx_disable_intercept_for_msr(
> + vmx->vmcs01.msr_bitmap,
> + MSR_IA32_SPEC_CTRL,
> + MSR_TYPE_RW);
> + break;
> case MSR_IA32_PRED_CMD:
> if (!msr_info->host_initiated &&
> !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
On Wed, 2018-01-31 at 23:26 -0500, Konrad Rzeszutek Wilk wrote:
>
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 6a9f4ec..bfc80ff 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -594,6 +594,14 @@ struct vcpu_vmx {
> > #endif
> >
> > u64 arch_capabilities;
> > + u64 spec_ctrl;
> > +
> > + /*
> > + * This indicates that:
> > + * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
> > + * 2) The guest has actually initiated a write against the MSR.
> > + */
> > + bool spec_ctrl_used;
> >
> > /*
> > * This indicates that:
Thanks for persisting with the details here, Karim. In addition to
Konrad's heckling at the comments, I'll add my own request to his...
I'd like the comment for spec_ctrl_used to explain why it isn't
entirely redundant with the spec_ctrl_intercepted() function.
Without nesting, I believe it *would* be redundant, but the difference
comes when an L2 is running for which L1 has not permitted the MSR to
be passed through. That's when we have spec_ctrl_used = true but the
MSR *isn't* actually passed through in the active msr_bitmap.
Question: if spec_ctrl_used is always equivalent to the intercept bit
in the vmcs01.msr_bitmap, just not the guest bitmap... should we ditch
it and always use the bit from the vmcs01.msr_bitmap?
Sorry :)
On 31/01/2018 16:59, David Woodhouse wrote:
>
>
> On Wed, 2018-01-31 at 13:53 -0800, Jim Mattson wrote:
>> On Wed, Jan 31, 2018 at 1:42 PM, Paolo Bonzini <[email protected]> wrote:
>>
>>> Can we just say it sucks to be L2 too? :) Because in the end as long as
>>> no one ever writes to spec_ctrl, everybody is happy.
>>
>> Unfortunately, quite a few OS vendors shipped IBRS-based mitigations
>> earlier this month. (Has Redhat stopped writing to IA32_SPEC_CTRL yet?
>> :-)
>>
>> And in the long run, everyone is going to set IA32_SPEC_CTRL.IBRS=1 on
>> CPUs with IA32_ARCH_CAPABILITIES.IBRS_ALL.
>
>
> I'm actually working on IBRS_ALL at the moment.
>
> I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
> just make it a no-op.
That would be very slow.
> Or if that really doesn't fly, perhaps with IBRS_ALL we should invert
> the logic. Set IBRS to 1 on vCPU reset, and only if it's set to *zero*
> do we pass through the MSR and set the save_spec_ctrl_on_exit flag.
... but something like that would be a good idea. Even if IBRS to 0 on
vCPU reset, only pass it through once it's set to zero. The first
IBRS=1 write would not enable pass through, and would not set the
save_spec_ctrl_on_exit flag. In fact it need not even be conditional on
IBRS_ALL.
Paolo
> But let's get the code for *current* hardware done first...
>
.snip..
> > +/* Is SPEC_CTRL intercepted for the currently running vCPU? */
> > +static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
> > +{
> > + unsigned long *msr_bitmap;
> > + int f = sizeof(unsigned long);
> > +
> > + if (!cpu_has_vmx_msr_bitmap())
> > + return true;
> > +
> > + msr_bitmap = is_guest_mode(vcpu) ?
> > + to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
> > + to_vmx(vcpu)->vmcs01.msr_bitmap;
> > +
> > + return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
> > +}
> > +
..snip..
> > @@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > case MSR_IA32_TSC:
> > kvm_write_tsc(vcpu, msr_info);
> > break;
> > + case MSR_IA32_SPEC_CTRL:
> > + if (!msr_info->host_initiated &&
> > + !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
> > + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> > + return 1;
> > +
> > + vmx->spec_ctrl_used = true;
> > +
> > + /* The STIBP bit doesn't fault even if it's not advertised */
> > + if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
> > + return 1;
> > +
> > + vmx->spec_ctrl = data;
> > +
> > + /*
> > + * When it's written (to non-zero) for the first time, pass
> > + * it through. This means we don't have to take the perf
>
> .. But only if it is a nested guest (as you have && is_guest_mode).
>
> Do you want to update the comment a bit?
>
> > + * hit of saving it on vmexit for the common case of guests
> > + * that don't use it.
> > + */
> > + if (cpu_has_vmx_msr_bitmap() && data &&
> > + spec_ctrl_intercepted(vcpu) &&
> > + is_guest_mode(vcpu))
> ^^^^^^^^^^^^^^^^^^ <=== here
Would it be perhaps also good to mention the complexity of how
we ought to be handling L1 and L2 guests in the commit?
We are all stressed and I am sure some of us haven't gotten much
sleep - but it can help in say three months when some unluckly new
soul is trying to understand this and gets utterly confused.
> > + vmx_disable_intercept_for_msr(
> > + vmx->vmcs01.msr_bitmap,
> > + MSR_IA32_SPEC_CTRL,
> > + MSR_TYPE_RW);
> > + break;
> > case MSR_IA32_PRED_CMD:
> > if (!msr_info->host_initiated &&
> > !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
On 02/01/2018 03:19 PM, Konrad Rzeszutek Wilk wrote:
> .snip..
>>> +/* Is SPEC_CTRL intercepted for the currently running vCPU? */
>>> +static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
>>> +{
>>> + unsigned long *msr_bitmap;
>>> + int f = sizeof(unsigned long);
>>> +
>>> + if (!cpu_has_vmx_msr_bitmap())
>>> + return true;
>>> +
>>> + msr_bitmap = is_guest_mode(vcpu) ?
>>> + to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
>>> + to_vmx(vcpu)->vmcs01.msr_bitmap;
>>> +
>>> + return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
>>> +}
>>> +
> ..snip..
>>> @@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>> case MSR_IA32_TSC:
>>> kvm_write_tsc(vcpu, msr_info);
>>> break;
>>> + case MSR_IA32_SPEC_CTRL:
>>> + if (!msr_info->host_initiated &&
>>> + !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
>>> + !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
>>> + return 1;
>>> +
>>> + vmx->spec_ctrl_used = true;
>>> +
>>> + /* The STIBP bit doesn't fault even if it's not advertised */
>>> + if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
>>> + return 1;
>>> +
>>> + vmx->spec_ctrl = data;
>>> +
>>> + /*
>>> + * When it's written (to non-zero) for the first time, pass
>>> + * it through. This means we don't have to take the perf
>>
>> .. But only if it is a nested guest (as you have && is_guest_mode).
>>
>> Do you want to update the comment a bit?
>>
>>> + * hit of saving it on vmexit for the common case of guests
>>> + * that don't use it.
>>> + */
>>> + if (cpu_has_vmx_msr_bitmap() && data &&
>>> + spec_ctrl_intercepted(vcpu) &&
>>> + is_guest_mode(vcpu))
>> ^^^^^^^^^^^^^^^^^^ <=== here
>
> Would it be perhaps also good to mention the complexity of how
> we ought to be handling L1 and L2 guests in the commit?
>
> We are all stressed and I am sure some of us haven't gotten much
> sleep - but it can help in say three months when some unluckly new
> soul is trying to understand this and gets utterly confused.
Yup, I will go through the patches and add as much details as possible.
And yes, the is_guest_mode(vcpu) here is inverted :D I blame the late
night :)
>
>>> + vmx_disable_intercept_for_msr(
>>> + vmx->vmcs01.msr_bitmap,
>>> + MSR_IA32_SPEC_CTRL,
>>> + MSR_TYPE_RW);
>>> + break;
>>> case MSR_IA32_PRED_CMD:
>>> if (!msr_info->host_initiated &&
>>> !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
Hi Karim
Thanks for cracking at it through all the iterations.
On Wed, Jan 31, 2018 at 08:37:44PM +0100, KarimAllah Ahmed wrote:
> From: Ashok Raj <[email protected]>
>
> The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
> control mechanism. It keeps earlier branches from influencing
> later ones.
It might be better to split this patch into two parts.
- Just the place where we do IBPB barrier for VMX/SVM
- The msr get/set handling to comprehend guest, nested support etc.
The first part should be trivial so it can possibly make it into
tip/x86/pti branch sooner, while we rework the feedback to
address this completely.
Cheers,
Ashok
On 02/01/2018 02:25 PM, David Woodhouse wrote:
>
>
> On Wed, 2018-01-31 at 23:26 -0500, Konrad Rzeszutek Wilk wrote:
>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 6a9f4ec..bfc80ff 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -594,6 +594,14 @@ struct vcpu_vmx {
>>> #endif
>>>
>>> u64 arch_capabilities;
>>> + u64 spec_ctrl;
>>> +
>>> + /*
>>> + * This indicates that:
>>> + * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
>>> + * 2) The guest has actually initiated a write against the MSR.
>>> + */
>>> + bool spec_ctrl_used;
>>>
>>> /*
>>> * This indicates that:
>
> Thanks for persisting with the details here, Karim. In addition to
> Konrad's heckling at the comments, I'll add my own request to his...
>
> I'd like the comment for spec_ctrl_used to explain why it isn't
> entirely redundant with the spec_ctrl_intercepted() function.
>
> Without nesting, I believe it *would* be redundant, but the difference
> comes when an L2 is running for which L1 has not permitted the MSR to
> be passed through. That's when we have spec_ctrl_used = true but the
> MSR *isn't* actually passed through in the active msr_bitmap.
>
> Question: if spec_ctrl_used is always equivalent to the intercept bit
> in the vmcs01.msr_bitmap, just not the guest bitmap... should we ditch
> it and always use the bit from the vmcs01.msr_bitmap?
If I used the vmcs01.msr_bitmap, spec_ctrl_used will always be true if
L0 passed it to L1. Even if L1 did not actually pass it to L2 and even
if L2 has not written to it yet (!used).
This pretty much renders the short-circuit at
nested_vmx_merge_msr_bitmap useless:
if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
!to_vmx(vcpu)->pred_cmd_used &&
!to_vmx(vcpu)->spec_ctrl_used)
return false;
... and the default path will be kvm_vcpu_gpa_to_page + kmap.
That being said, I have to admit the logic for spec_ctrl_used is not
perfect either.
If L1 or any of the L2s touched the MSR, spec_ctrl_used will be set to
true. So if one L2 used the MSR, all other L2s will also skip the short-
circuit mentioned above and end up *always* going through
kvm_vcpu_gpa_to_page + kmap.
Maybe all of this is over-thinking and in reality the short-circuit
above is really useless and all L2 guests are happily using x2apic :)
>
> Sorry :)
>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On 02/01/2018 06:37 PM, KarimAllah Ahmed wrote:
> On 02/01/2018 02:25 PM, David Woodhouse wrote:
>>
>>
>> On Wed, 2018-01-31 at 23:26 -0500, Konrad Rzeszutek Wilk wrote:
>>>
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 6a9f4ec..bfc80ff 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -594,6 +594,14 @@ struct vcpu_vmx {
>>>> #endif
>>>> u64 arch_capabilities;
>>>> + u64 spec_ctrl;
>>>> +
>>>> + /*
>>>> + * This indicates that:
>>>> + * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
>>>> + * 2) The guest has actually initiated a write against the MSR.
>>>> + */
>>>> + bool spec_ctrl_used;
>>>> /*
>>>> * This indicates that:
>>
>> Thanks for persisting with the details here, Karim. In addition to
>> Konrad's heckling at the comments, I'll add my own request to his...
>>
>> I'd like the comment for spec_ctrl_used to explain why it isn't
>> entirely redundant with the spec_ctrl_intercepted() function.
>>
>> Without nesting, I believe it *would* be redundant, but the difference
>> comes when an L2 is running for which L1 has not permitted the MSR to
>> be passed through. That's when we have spec_ctrl_used = true but the
>> MSR *isn't* actually passed through in the active msr_bitmap.
>>
>> Question: if spec_ctrl_used is always equivalent to the intercept bit
>> in the vmcs01.msr_bitmap, just not the guest bitmap... should we ditch
>> it and always use the bit from the vmcs01.msr_bitmap?
>
> If I used the vmcs01.msr_bitmap, spec_ctrl_used will always be true if
> L0 passed it to L1. Even if L1 did not actually pass it to L2 and even
> if L2 has not written to it yet (!used).
>
> This pretty much renders the short-circuit at
> nested_vmx_merge_msr_bitmap useless:
>
> if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
> !to_vmx(vcpu)->pred_cmd_used &&
> !to_vmx(vcpu)->spec_ctrl_used)
> return false;
>
> ... and the default path will be kvm_vcpu_gpa_to_page + kmap.
>
> That being said, I have to admit the logic for spec_ctrl_used is not
> perfect either.
>
> If L1 or any of the L2s touched the MSR, spec_ctrl_used will be set to
> true. So if one L2 used the MSR, all other L2s will also skip the short-
> circuit mentioned above and end up *always* going through
> kvm_vcpu_gpa_to_page + kmap.
>
> Maybe all of this is over-thinking and in reality the short-circuit
> above is really useless and all L2 guests are happily using x2apic :)
>
hehe ..
>> if spec_ctrl_used is always equivalent to the intercept bit in the
vmcs01.msr_bitmap
actually yes, we can.
I just forgot that we update the msr bitmap lazily! :)
>>
>> Sorry :)
>>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B