2018-02-06 17:32:10

by Woodhouse, David

[permalink] [raw]
Subject: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

I've put together a linux-4.9.y branch at
http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y

Most of it is fairly straightforward, apart from the IBPB on context
switch for which Tim has already posted a candidate. I wanted some more
review on my backports of the KVM bits though, including some extra
historical patches I pulled in.

Ashok Raj (1):
KVM/x86: Add IBPB support

David Hildenbrand (1):
KVM: nVMX: vmx_complete_nested_posted_interrupt() can't fail

David Matlack (1):
KVM: nVMX: mark vmcs12 pages dirty on L2 exit

Jim Mattson (1):
KVM: nVMX: Eliminate vmcs02 pool

KarimAllah Ahmed (3):
KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL

Paolo Bonzini (2):
KVM: VMX: introduce alloc_loaded_vmcs
KVM: VMX: make MSR bitmaps per-VCPU

arch/x86/kvm/cpuid.c | 21 +-
arch/x86/kvm/cpuid.h | 31 +++
arch/x86/kvm/svm.c | 116 ++++++++
arch/x86/kvm/vmx.c | 730 +++++++++++++++++++++++++++------------------------
arch/x86/kvm/x86.c | 1 +
5 files changed, 554 insertions(+), 345 deletions(-)

--
2.7.4



2018-02-06 17:32:06

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH 8/9] KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL

From: KarimAllah Ahmed <[email protected]>

[ Based on a patch from Ashok Raj <[email protected]> ]

Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
be using a retpoline+IBPB based approach.

To avoid the overhead of saving and restoring the MSR_IA32_SPEC_CTRL for
guests that do not actually use the MSR, only start saving and restoring
when a non-zero is written to it.

No attempt is made to handle STIBP here, intentionally. Filtering STIBP
may be added in a future patch, which may require trapping all writes
if we don't want to pass it through directly to the guest.

[dwmw2: Clean up CPUID bits, save/restore manually, handle reset]

Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Darren Kenny <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: [email protected]
Cc: Dave Hansen <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Asit Mallick <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Ashok Raj <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

(cherry picked from commit d28b387fb74da95d69d2615732f50cceb38e9a4d)
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/cpuid.c | 8 ++--
arch/x86/kvm/cpuid.h | 11 ++++++
arch/x86/kvm/vmx.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 2 +-
4 files changed, 118 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 9c6493f..93f924d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -357,7 +357,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,

/* cpuid 0x80000008.ebx */
const u32 kvm_cpuid_8000_0008_ebx_x86_features =
- F(IBPB);
+ F(IBPB) | F(IBRS);

/* cpuid 0xC0000001.edx */
const u32 kvm_cpuid_C000_0001_edx_x86_features =
@@ -382,7 +382,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,

/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
- F(ARCH_CAPABILITIES);
+ F(SPEC_CTRL) | F(ARCH_CAPABILITIES);

/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
@@ -618,9 +618,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
g_phys_as = phys_as;
entry->eax = g_phys_as | (virt_as << 8);
entry->edx = 0;
- /* IBPB isn't necessarily present in hardware cpuid */
+ /* IBRS and IBPB aren't necessarily present in hardware cpuid */
if (boot_cpu_has(X86_FEATURE_IBPB))
entry->ebx |= F(IBPB);
+ if (boot_cpu_has(X86_FEATURE_IBRS))
+ entry->ebx |= F(IBRS);
entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
break;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 8719997..d1beb71 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -171,6 +171,17 @@ static inline bool guest_cpuid_has_ibpb(struct kvm_vcpu *vcpu)
return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL));
}

+static inline bool guest_cpuid_has_ibrs(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0);
+ if (best && (best->ebx & bit(X86_FEATURE_IBRS)))
+ return true;
+ best = kvm_find_cpuid_entry(vcpu, 7, 0);
+ return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL));
+}
+
static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu)
{
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 92bf61f..764cb7c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -552,6 +552,7 @@ struct vcpu_vmx {
#endif

u64 arch_capabilities;
+ u64 spec_ctrl;

u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
@@ -1852,6 +1853,29 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu)
}

/*
+ * Check if MSR is intercepted for currently loaded MSR bitmap.
+ */
+static bool msr_write_intercepted(struct kvm_vcpu *vcpu, u32 msr)
+{
+ unsigned long *msr_bitmap;
+ int f = sizeof(unsigned long);
+
+ if (!cpu_has_vmx_msr_bitmap())
+ return true;
+
+ msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
+
+ if (msr <= 0x1fff) {
+ return !!test_bit(msr, msr_bitmap + 0x800 / f);
+ } else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
+ msr &= 0x1fff;
+ return !!test_bit(msr, msr_bitmap + 0xc00 / f);
+ }
+
+ return true;
+}
+
+/*
* Check if MSR is intercepted for L01 MSR bitmap.
*/
static bool msr_write_intercepted_l01(struct kvm_vcpu *vcpu, u32 msr)
@@ -2981,6 +3005,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_ibrs(vcpu))
+ return 1;
+
+ msr_info->data = to_vmx(vcpu)->spec_ctrl;
+ break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated &&
!guest_cpuid_has_arch_capabilities(vcpu))
@@ -3091,6 +3122,36 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_ibrs(vcpu))
+ return 1;
+
+ /* The STIBP bit doesn't fault even if it's not advertised */
+ if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+ return 1;
+
+ vmx->spec_ctrl = data;
+
+ if (!data)
+ break;
+
+ /*
+ * For non-nested:
+ * When it's written (to non-zero) for the first time, pass
+ * it through.
+ *
+ * For nested:
+ * The handling of the MSR bitmap for L2 guests is done in
+ * nested_vmx_merge_msr_bitmap. We should not touch the
+ * vmcs02.msr_bitmap here since it gets completely overwritten
+ * in the merging. We update the vmcs01 here for L1 as well
+ * since it will end up touching the MSR anyway now.
+ */
+ vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_RW);
+ break;
case MSR_IA32_PRED_CMD:
if (!msr_info->host_initiated &&
!guest_cpuid_has_ibpb(vcpu))
@@ -5243,6 +5304,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
u64 cr0;

vmx->rmode.vm86_active = 0;
+ vmx->spec_ctrl = 0;

vmx->soft_vnmi_blocked = 0;

@@ -8828,6 +8890,15 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)

vmx_arm_hv_timer(vcpu);

+ /*
+ * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+ * it's non-zero. Since vmentry is serialising on affected CPUs, there
+ * is no need to worry about the conditional branch over the wrmsr
+ * being speculatively taken.
+ */
+ if (vmx->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
vmx->__launched = vmx->loaded_vmcs->launched;
asm(
/* Store host registers */
@@ -8946,6 +9017,27 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
#endif
);

+ /*
+ * We do not use IBRS in the kernel. If this vCPU has used the
+ * SPEC_CTRL MSR it may have left it on; save the value and
+ * turn it off. This is much more efficient than blindly adding
+ * it to the atomic save/restore list. Especially as the former
+ * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+ *
+ * For non-nested case:
+ * If the L01 MSR bitmap does not intercept the MSR, then we need to
+ * save it.
+ *
+ * For nested case:
+ * If the L02 MSR bitmap does not intercept the MSR, then we need to
+ * save it.
+ */
+ if (!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))
+ rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
+ if (vmx->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();

@@ -9505,7 +9597,7 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
unsigned long *msr_bitmap_l1;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
/*
- * pred_cmd is trying to verify two things:
+ * pred_cmd & spec_ctrl are trying to verify two things:
*
* 1. L0 gave a permission to L1 to actually passthrough the MSR. This
* ensures that we do not accidentally generate an L02 MSR bitmap
@@ -9518,9 +9610,10 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
* the MSR.
*/
bool pred_cmd = msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD);
+ bool spec_ctrl = msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL);

if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
- !pred_cmd)
+ !pred_cmd && !spec_ctrl)
return false;

page = nested_get_page(vcpu, vmcs12->msr_bitmap);
@@ -9559,6 +9652,12 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
}
}

+ if (spec_ctrl)
+ nested_vmx_disable_intercept_for_msr(
+ msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_R | MSR_TYPE_W);
+
if (pred_cmd)
nested_vmx_disable_intercept_for_msr(
msr_bitmap_l1, msr_bitmap_l0,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 94d1573..75f756e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -975,7 +975,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
- MSR_IA32_ARCH_CAPABILITIES
+ MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
};

static unsigned num_msrs_to_save;
--
2.7.4


2018-02-06 17:32:17

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH 9/9] KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL

From: KarimAllah Ahmed <[email protected]>

[ Based on a patch from Paolo Bonzini <[email protected]> ]

... basically doing exactly what we do for VMX:

- Passthrough SPEC_CTRL to guests (if enabled in guest CPUID)
- Save and restore SPEC_CTRL around VMExit and VMEntry only if the guest
actually used it.

Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Darren Kenny <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: [email protected]
Cc: Dave Hansen <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Asit Mallick <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Ashok Raj <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

(cherry picked from commit b2ac58f90540e39324e7a29a7ad471407ae0bf48)
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/svm.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 88 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 6de8d65..be644af 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -183,6 +183,8 @@ struct vcpu_svm {
u64 gs_base;
} host;

+ u64 spec_ctrl;
+
u32 *msrpm;

ulong nmi_iret_rip;
@@ -248,6 +250,7 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_CSTAR, .always = true },
{ .index = MSR_SYSCALL_MASK, .always = true },
#endif
+ { .index = MSR_IA32_SPEC_CTRL, .always = false },
{ .index = MSR_IA32_PRED_CMD, .always = false },
{ .index = MSR_IA32_LASTBRANCHFROMIP, .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
@@ -863,6 +866,25 @@ static bool valid_msr_intercept(u32 index)
return false;
}

+static bool msr_write_intercepted(struct kvm_vcpu *vcpu, unsigned msr)
+{
+ u8 bit_write;
+ unsigned long tmp;
+ u32 offset;
+ u32 *msrpm;
+
+ msrpm = is_guest_mode(vcpu) ? to_svm(vcpu)->nested.msrpm:
+ to_svm(vcpu)->msrpm;
+
+ offset = svm_msrpm_offset(msr);
+ bit_write = 2 * (msr & 0x0f) + 1;
+ tmp = msrpm[offset];
+
+ BUG_ON(offset == MSR_INVALID);
+
+ return !!test_bit(bit_write, &tmp);
+}
+
static void set_msr_interception(u32 *msrpm, unsigned msr,
int read, int write)
{
@@ -1537,6 +1559,8 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
u32 dummy;
u32 eax = 1;

+ svm->spec_ctrl = 0;
+
if (!init_event) {
svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE |
MSR_IA32_APICBASE_ENABLE;
@@ -3520,6 +3544,13 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_VM_CR:
msr_info->data = svm->nested.vm_cr_msr;
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_ibrs(vcpu))
+ return 1;
+
+ msr_info->data = svm->spec_ctrl;
+ break;
case MSR_IA32_UCODE_REV:
msr_info->data = 0x01000065;
break;
@@ -3611,6 +3642,33 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr->host_initiated &&
+ !guest_cpuid_has_ibrs(vcpu))
+ return 1;
+
+ /* The STIBP bit doesn't fault even if it's not advertised */
+ if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+ return 1;
+
+ svm->spec_ctrl = data;
+
+ if (!data)
+ break;
+
+ /*
+ * For non-nested:
+ * When it's written (to non-zero) for the first time, pass
+ * it through.
+ *
+ * For nested:
+ * The handling of the MSR bitmap for L2 guests is done in
+ * nested_svm_vmrun_msrpm.
+ * We update the L1 MSR bit as well since it will end up
+ * touching the MSR anyway now.
+ */
+ set_msr_interception(svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
+ break;
case MSR_IA32_PRED_CMD:
if (!msr->host_initiated &&
!guest_cpuid_has_ibpb(vcpu))
@@ -4854,6 +4912,15 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)

local_irq_enable();

+ /*
+ * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+ * it's non-zero. Since vmentry is serialising on affected CPUs, there
+ * is no need to worry about the conditional branch over the wrmsr
+ * being speculatively taken.
+ */
+ if (svm->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
+
asm volatile (
"push %%" _ASM_BP "; \n\t"
"mov %c[rbx](%[svm]), %%" _ASM_BX " \n\t"
@@ -4946,6 +5013,27 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
#endif
);

+ /*
+ * We do not use IBRS in the kernel. If this vCPU has used the
+ * SPEC_CTRL MSR it may have left it on; save the value and
+ * turn it off. This is much more efficient than blindly adding
+ * it to the atomic save/restore list. Especially as the former
+ * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+ *
+ * For non-nested case:
+ * If the L01 MSR bitmap does not intercept the MSR, then we need to
+ * save it.
+ *
+ * For nested case:
+ * If the L02 MSR bitmap does not intercept the MSR, then we need to
+ * save it.
+ */
+ if (!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))
+ rdmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
+
+ if (svm->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();

--
2.7.4


2018-02-06 17:32:58

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH 3/9] KVM: nVMX: Eliminate vmcs02 pool

From: Jim Mattson <[email protected]>

The potential performance advantages of a vmcs02 pool have never been
realized. To simplify the code, eliminate the pool. Instead, a single
vmcs02 is allocated per VCPU when the VCPU enters VMX operation.

Cc: [email protected] # prereq for Spectre mitigation
Signed-off-by: Jim Mattson <[email protected]>
Signed-off-by: Mark Kanda <[email protected]>
Reviewed-by: Ameya More <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>

(cherry picked from commit de3a0021a60635de96aa92713c1a31a96747d72c)
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/vmx.c | 146 +++++++++--------------------------------------------
1 file changed, 23 insertions(+), 123 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9408ae8..55b1474 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -174,7 +174,6 @@ module_param(ple_window_max, int, S_IRUGO);
extern const ulong vmx_return;

#define NR_AUTOLOAD_MSRS 8
-#define VMCS02_POOL_SIZE 1

struct vmcs {
u32 revision_id;
@@ -208,7 +207,7 @@ struct shared_msr_entry {
* stored in guest memory specified by VMPTRLD, but is opaque to the guest,
* which must access it using VMREAD/VMWRITE/VMCLEAR instructions.
* More than one of these structures may exist, if L1 runs multiple L2 guests.
- * nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the
+ * nested_vmx_run() will use the data here to build the vmcs02: a VMCS for the
* underlying hardware which will be used to run L2.
* This structure is packed to ensure that its layout is identical across
* machines (necessary for live migration).
@@ -387,13 +386,6 @@ struct __packed vmcs12 {
*/
#define VMCS12_SIZE 0x1000

-/* Used to remember the last vmcs02 used for some recently used vmcs12s */
-struct vmcs02_list {
- struct list_head list;
- gpa_t vmptr;
- struct loaded_vmcs vmcs02;
-};
-
/*
* The nested_vmx structure is part of vcpu_vmx, and holds information we need
* for correct emulation of VMX (i.e., nested VMX) on this vcpu.
@@ -420,15 +412,15 @@ struct nested_vmx {
*/
bool sync_shadow_vmcs;

- /* vmcs02_list cache of VMCSs recently used to run L2 guests */
- struct list_head vmcs02_pool;
- int vmcs02_num;
bool change_vmcs01_virtual_x2apic_mode;
/* L2 must run next, and mustn't decide to exit to L1. */
bool nested_run_pending;
+
+ struct loaded_vmcs vmcs02;
+
/*
- * Guest pages referred to in vmcs02 with host-physical pointers, so
- * we must keep them pinned while L2 runs.
+ * Guest pages referred to in the vmcs02 with host-physical
+ * pointers, so we must keep them pinned while L2 runs.
*/
struct page *apic_access_page;
struct page *virtual_apic_page;
@@ -6682,94 +6674,6 @@ static int handle_monitor(struct kvm_vcpu *vcpu)
}

/*
- * To run an L2 guest, we need a vmcs02 based on the L1-specified vmcs12.
- * We could reuse a single VMCS for all the L2 guests, but we also want the
- * option to allocate a separate vmcs02 for each separate loaded vmcs12 - this
- * allows keeping them loaded on the processor, and in the future will allow
- * optimizations where prepare_vmcs02 doesn't need to set all the fields on
- * every entry if they never change.
- * So we keep, in vmx->nested.vmcs02_pool, a cache of size VMCS02_POOL_SIZE
- * (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
- *
- * The following functions allocate and free a vmcs02 in this pool.
- */
-
-/* Get a VMCS from the pool to use as vmcs02 for the current vmcs12. */
-static struct loaded_vmcs *nested_get_current_vmcs02(struct vcpu_vmx *vmx)
-{
- struct vmcs02_list *item;
- list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
- if (item->vmptr == vmx->nested.current_vmptr) {
- list_move(&item->list, &vmx->nested.vmcs02_pool);
- return &item->vmcs02;
- }
-
- if (vmx->nested.vmcs02_num >= max(VMCS02_POOL_SIZE, 1)) {
- /* Recycle the least recently used VMCS. */
- item = list_last_entry(&vmx->nested.vmcs02_pool,
- struct vmcs02_list, list);
- item->vmptr = vmx->nested.current_vmptr;
- list_move(&item->list, &vmx->nested.vmcs02_pool);
- return &item->vmcs02;
- }
-
- /* Create a new VMCS */
- item = kmalloc(sizeof(struct vmcs02_list), GFP_KERNEL);
- if (!item)
- return NULL;
- item->vmcs02.vmcs = alloc_vmcs();
- item->vmcs02.shadow_vmcs = NULL;
- if (!item->vmcs02.vmcs) {
- kfree(item);
- return NULL;
- }
- loaded_vmcs_init(&item->vmcs02);
- item->vmptr = vmx->nested.current_vmptr;
- list_add(&(item->list), &(vmx->nested.vmcs02_pool));
- vmx->nested.vmcs02_num++;
- return &item->vmcs02;
-}
-
-/* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
-static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
-{
- struct vmcs02_list *item;
- list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
- if (item->vmptr == vmptr) {
- free_loaded_vmcs(&item->vmcs02);
- list_del(&item->list);
- kfree(item);
- vmx->nested.vmcs02_num--;
- return;
- }
-}
-
-/*
- * Free all VMCSs saved for this vcpu, except the one pointed by
- * vmx->loaded_vmcs. We must be running L1, so vmx->loaded_vmcs
- * must be &vmx->vmcs01.
- */
-static void nested_free_all_saved_vmcss(struct vcpu_vmx *vmx)
-{
- struct vmcs02_list *item, *n;
-
- WARN_ON(vmx->loaded_vmcs != &vmx->vmcs01);
- list_for_each_entry_safe(item, n, &vmx->nested.vmcs02_pool, list) {
- /*
- * Something will leak if the above WARN triggers. Better than
- * a use-after-free.
- */
- if (vmx->loaded_vmcs == &item->vmcs02)
- continue;
-
- free_loaded_vmcs(&item->vmcs02);
- list_del(&item->list);
- kfree(item);
- vmx->nested.vmcs02_num--;
- }
-}
-
-/*
* The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
* set the success or error code of an emulated VMX instruction, as specified
* by Vol 2B, VMX Instruction Reference, "Conventions".
@@ -7082,6 +6986,12 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}

+ vmx->nested.vmcs02.vmcs = alloc_vmcs();
+ vmx->nested.vmcs02.shadow_vmcs = NULL;
+ if (!vmx->nested.vmcs02.vmcs)
+ goto out_vmcs02;
+ loaded_vmcs_init(&vmx->nested.vmcs02);
+
if (cpu_has_vmx_msr_bitmap()) {
vmx->nested.msr_bitmap =
(unsigned long *)__get_free_page(GFP_KERNEL);
@@ -7104,9 +7014,6 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
vmx->vmcs01.shadow_vmcs = shadow_vmcs;
}

- INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
- vmx->nested.vmcs02_num = 0;
-
hrtimer_init(&vmx->nested.preemption_timer, CLOCK_MONOTONIC,
HRTIMER_MODE_REL_PINNED);
vmx->nested.preemption_timer.function = vmx_preemption_timer_fn;
@@ -7124,6 +7031,9 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
free_page((unsigned long)vmx->nested.msr_bitmap);

out_msr_bitmap:
+ free_loaded_vmcs(&vmx->nested.vmcs02);
+
+out_vmcs02:
return -ENOMEM;
}

@@ -7209,7 +7119,7 @@ static void free_nested(struct vcpu_vmx *vmx)
vmx->vmcs01.shadow_vmcs = NULL;
}
kfree(vmx->nested.cached_vmcs12);
- /* Unpin physical memory we referred to in current vmcs02 */
+ /* Unpin physical memory we referred to in the vmcs02 */
if (vmx->nested.apic_access_page) {
nested_release_page(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
@@ -7225,7 +7135,7 @@ static void free_nested(struct vcpu_vmx *vmx)
vmx->nested.pi_desc = NULL;
}

- nested_free_all_saved_vmcss(vmx);
+ free_loaded_vmcs(&vmx->nested.vmcs02);
}

/* Emulate the VMXOFF instruction */
@@ -7259,8 +7169,6 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
vmptr + offsetof(struct vmcs12, launch_state),
&zero, sizeof(zero));

- nested_free_vmcs02(vmx, vmptr);
-
skip_emulated_instruction(vcpu);
nested_vmx_succeed(vcpu);
return 1;
@@ -8049,10 +7957,11 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)

/*
* The host physical addresses of some pages of guest memory
- * are loaded into VMCS02 (e.g. L1's Virtual APIC Page). The CPU
- * may write to these pages via their host physical address while
- * L2 is running, bypassing any address-translation-based dirty
- * tracking (e.g. EPT write protection).
+ * are loaded into the vmcs02 (e.g. vmcs12's Virtual APIC
+ * Page). The CPU may write to these pages via their host
+ * physical address while L2 is running, bypassing any
+ * address-translation-based dirty tracking (e.g. EPT write
+ * protection).
*
* Mark them dirty on every exit from L2 to prevent them from
* getting out of sync with dirty tracking.
@@ -10221,7 +10130,6 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
struct vmcs12 *vmcs12;
struct vcpu_vmx *vmx = to_vmx(vcpu);
int cpu;
- struct loaded_vmcs *vmcs02;
bool ia32e;
u32 msr_entry_idx;

@@ -10361,17 +10269,13 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
* the nested entry.
*/

- vmcs02 = nested_get_current_vmcs02(vmx);
- if (!vmcs02)
- return -ENOMEM;
-
enter_guest_mode(vcpu);

if (!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
vmx->nested.vmcs01_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);

cpu = get_cpu();
- vmx->loaded_vmcs = vmcs02;
+ vmx->loaded_vmcs = &vmx->nested.vmcs02;
vmx_vcpu_put(vcpu);
vmx_vcpu_load(vcpu, cpu);
vcpu->cpu = cpu;
@@ -10886,10 +10790,6 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
vm_exit_controls_reset_shadow(vmx);
vmx_segment_cache_clear(vmx);

- /* if no vmcs02 cache requested, remove the one we used */
- if (VMCS02_POOL_SIZE == 0)
- nested_free_vmcs02(vmx, vmx->nested.current_vmptr);
-
load_vmcs12_host_state(vcpu, vmcs12);

/* Update any VMCS fields that might have changed while L2 ran */
--
2.7.4


2018-02-06 17:33:31

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH 7/9] KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

From: KarimAllah Ahmed <[email protected]>

Intel processors use MSR_IA32_ARCH_CAPABILITIES MSR to indicate RDCL_NO
(bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default the
contents will come directly from the hardware, but user-space can still
override it.

[dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]

Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Darren Kenny <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: [email protected]
Cc: Dave Hansen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Asit Mallick <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Ashok Raj <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

(cherry picked from commit 28c1c9fabf48d6ad596273a11c46e0d0da3e14cd)
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/cpuid.c | 8 +++++++-
arch/x86/kvm/cpuid.h | 8 ++++++++
arch/x86/kvm/vmx.c | 15 +++++++++++++++
arch/x86/kvm/x86.c | 1 +
4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 6f24483..9c6493f 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -380,6 +380,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
/* cpuid 7.0.ecx*/
const u32 kvm_cpuid_7_0_ecx_x86_features = F(PKU) | 0 /*OSPKE*/;

+ /* cpuid 7.0.edx*/
+ const u32 kvm_cpuid_7_0_edx_x86_features =
+ F(ARCH_CAPABILITIES);
+
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();

@@ -462,12 +466,14 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
/* PKU is not yet implemented for shadow paging. */
if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
entry->ecx &= ~F(PKU);
+ entry->edx &= kvm_cpuid_7_0_edx_x86_features;
+ cpuid_mask(&entry->edx, CPUID_7_EDX);
} else {
entry->ebx = 0;
entry->ecx = 0;
+ entry->edx = 0;
}
entry->eax = 0;
- entry->edx = 0;
break;
}
case 9:
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index ec4f9dc..8719997 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -171,6 +171,14 @@ static inline bool guest_cpuid_has_ibpb(struct kvm_vcpu *vcpu)
return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL));
}

+static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry(vcpu, 7, 0);
+ return best && (best->edx & bit(X86_FEATURE_ARCH_CAPABILITIES));
+}
+

/*
* NRIPS is provided through cpuidfn 0x8000000a.edx bit 3
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index dd6c831..92bf61f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -551,6 +551,8 @@ struct vcpu_vmx {
u64 msr_guest_kernel_gs_base;
#endif

+ u64 arch_capabilities;
+
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
/*
@@ -2979,6 +2981,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
+ case MSR_IA32_ARCH_CAPABILITIES:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_arch_capabilities(vcpu))
+ return 1;
+ msr_info->data = to_vmx(vcpu)->arch_capabilities;
+ break;
case MSR_IA32_SYSENTER_CS:
msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
break;
@@ -3110,6 +3118,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
MSR_TYPE_W);
break;
+ case MSR_IA32_ARCH_CAPABILITIES:
+ if (!msr_info->host_initiated)
+ return 1;
+ vmx->arch_capabilities = data;
+ break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -5200,6 +5213,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
++vmx->nmsrs;
}

+ if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
+ rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);

vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e023ef9..94d1573 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -975,6 +975,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
+ MSR_IA32_ARCH_CAPABILITIES
};

static unsigned num_msrs_to_save;
--
2.7.4


2018-02-06 17:34:10

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH 5/9] KVM: VMX: make MSR bitmaps per-VCPU

From: Paolo Bonzini <[email protected]>

Place the MSR bitmap in struct loaded_vmcs, and update it in place
every time the x2apic or APICv state can change. This is rare and
the loop can handle 64 MSRs per iteration, in a similar fashion as
nested_vmx_prepare_msr_bitmap.

This prepares for choosing, on a per-VM basis, whether to intercept
the SPEC_CTRL and PRED_CMD MSRs.

Cc: [email protected] # prereq for Spectre mitigation
Suggested-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
(cherry picked from commit 904e14fb7cb96401a7dc803ca2863fd5ba32ffe6)
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/vmx.c | 314 +++++++++++++++++++----------------------------------
1 file changed, 114 insertions(+), 200 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8c562da..8570a17 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -110,6 +110,14 @@ static u64 __read_mostly host_xss;
static bool __read_mostly enable_pml = 1;
module_param_named(pml, enable_pml, bool, S_IRUGO);

+#define MSR_TYPE_R 1
+#define MSR_TYPE_W 2
+#define MSR_TYPE_RW 3
+
+#define MSR_BITMAP_MODE_X2APIC 1
+#define MSR_BITMAP_MODE_X2APIC_APICV 2
+#define MSR_BITMAP_MODE_LM 4
+
#define KVM_VMX_TSC_MULTIPLIER_MAX 0xffffffffffffffffULL

/* Guest_tsc -> host_tsc conversion requires 64-bit division. */
@@ -191,6 +199,7 @@ struct loaded_vmcs {
struct vmcs *shadow_vmcs;
int cpu;
int launched;
+ unsigned long *msr_bitmap;
struct list_head loaded_vmcss_on_cpu_link;
};

@@ -429,8 +438,6 @@ struct nested_vmx {
bool pi_pending;
u16 posted_intr_nv;

- unsigned long *msr_bitmap;
-
struct hrtimer preemption_timer;
bool preemption_timer_expired;

@@ -531,6 +538,7 @@ struct vcpu_vmx {
unsigned long host_rsp;
u8 fail;
bool nmi_known_unmasked;
+ u8 msr_bitmap_mode;
u32 exit_intr_info;
u32 idt_vectoring_info;
ulong rflags;
@@ -902,6 +910,7 @@ static u32 vmx_segment_access_rights(struct kvm_segment *var);
static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
static int alloc_identity_pagetable(struct kvm *kvm);
+static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);

static DEFINE_PER_CPU(struct vmcs *, vmxarea);
static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -921,12 +930,6 @@ static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);

static unsigned long *vmx_io_bitmap_a;
static unsigned long *vmx_io_bitmap_b;
-static unsigned long *vmx_msr_bitmap_legacy;
-static unsigned long *vmx_msr_bitmap_longmode;
-static unsigned long *vmx_msr_bitmap_legacy_x2apic;
-static unsigned long *vmx_msr_bitmap_longmode_x2apic;
-static unsigned long *vmx_msr_bitmap_legacy_x2apic_apicv_inactive;
-static unsigned long *vmx_msr_bitmap_longmode_x2apic_apicv_inactive;
static unsigned long *vmx_vmread_bitmap;
static unsigned long *vmx_vmwrite_bitmap;

@@ -2520,36 +2523,6 @@ static void move_msr_up(struct vcpu_vmx *vmx, int from, int to)
vmx->guest_msrs[from] = tmp;
}

-static void vmx_set_msr_bitmap(struct kvm_vcpu *vcpu)
-{
- unsigned long *msr_bitmap;
-
- if (is_guest_mode(vcpu))
- msr_bitmap = to_vmx(vcpu)->nested.msr_bitmap;
- else if (cpu_has_secondary_exec_ctrls() &&
- (vmcs_read32(SECONDARY_VM_EXEC_CONTROL) &
- SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
- if (enable_apicv && kvm_vcpu_apicv_active(vcpu)) {
- if (is_long_mode(vcpu))
- msr_bitmap = vmx_msr_bitmap_longmode_x2apic;
- else
- msr_bitmap = vmx_msr_bitmap_legacy_x2apic;
- } else {
- if (is_long_mode(vcpu))
- msr_bitmap = vmx_msr_bitmap_longmode_x2apic_apicv_inactive;
- else
- msr_bitmap = vmx_msr_bitmap_legacy_x2apic_apicv_inactive;
- }
- } else {
- if (is_long_mode(vcpu))
- msr_bitmap = vmx_msr_bitmap_longmode;
- else
- msr_bitmap = vmx_msr_bitmap_legacy;
- }
-
- vmcs_write64(MSR_BITMAP, __pa(msr_bitmap));
-}
-
/*
* Set up the vmcs to automatically save and restore system
* msrs. Don't touch the 64-bit msrs if the guest is in legacy
@@ -2590,7 +2563,7 @@ static void setup_msrs(struct vcpu_vmx *vmx)
vmx->save_nmsrs = save_nmsrs;

if (cpu_has_vmx_msr_bitmap())
- vmx_set_msr_bitmap(&vmx->vcpu);
+ vmx_update_msr_bitmap(&vmx->vcpu);
}

/*
@@ -3537,6 +3510,8 @@ static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
loaded_vmcs_clear(loaded_vmcs);
free_vmcs(loaded_vmcs->vmcs);
loaded_vmcs->vmcs = NULL;
+ if (loaded_vmcs->msr_bitmap)
+ free_page((unsigned long)loaded_vmcs->msr_bitmap);
WARN_ON(loaded_vmcs->shadow_vmcs != NULL);
}

@@ -3553,7 +3528,18 @@ static int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)

loaded_vmcs->shadow_vmcs = NULL;
loaded_vmcs_init(loaded_vmcs);
+
+ if (cpu_has_vmx_msr_bitmap()) {
+ loaded_vmcs->msr_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
+ if (!loaded_vmcs->msr_bitmap)
+ goto out_vmcs;
+ memset(loaded_vmcs->msr_bitmap, 0xff, PAGE_SIZE);
+ }
return 0;
+
+out_vmcs:
+ free_loaded_vmcs(loaded_vmcs);
+ return -ENOMEM;
}

static void free_kvm_area(void)
@@ -4562,10 +4548,8 @@ static void free_vpid(int vpid)
spin_unlock(&vmx_vpid_lock);
}

-#define MSR_TYPE_R 1
-#define MSR_TYPE_W 2
-static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
- u32 msr, int type)
+static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type)
{
int f = sizeof(unsigned long);

@@ -4599,8 +4583,8 @@ static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
}
}

-static void __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
- u32 msr, int type)
+static void __always_inline vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type)
{
int f = sizeof(unsigned long);

@@ -4634,6 +4618,15 @@ static void __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
}
}

+static void __always_inline vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type, bool value)
+{
+ if (value)
+ vmx_enable_intercept_for_msr(msr_bitmap, msr, type);
+ else
+ vmx_disable_intercept_for_msr(msr_bitmap, msr, type);
+}
+
/*
* If a msr is allowed by L0, we should check whether it is allowed by L1.
* The corresponding bit will be cleared unless both of L0 and L1 allow it.
@@ -4680,58 +4673,68 @@ static void nested_vmx_disable_intercept_for_msr(unsigned long *msr_bitmap_l1,
}
}

-static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
+static u8 vmx_msr_bitmap_mode(struct kvm_vcpu *vcpu)
{
- if (!longmode_only)
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
- msr, MSR_TYPE_R | MSR_TYPE_W);
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
- msr, MSR_TYPE_R | MSR_TYPE_W);
-}
+ u8 mode = 0;

-static void vmx_enable_intercept_msr_read_x2apic(u32 msr, bool apicv_active)
-{
- if (apicv_active) {
- __vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy_x2apic,
- msr, MSR_TYPE_R);
- __vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode_x2apic,
- msr, MSR_TYPE_R);
- } else {
- __vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy_x2apic_apicv_inactive,
- msr, MSR_TYPE_R);
- __vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode_x2apic_apicv_inactive,
- msr, MSR_TYPE_R);
+ if (cpu_has_secondary_exec_ctrls() &&
+ (vmcs_read32(SECONDARY_VM_EXEC_CONTROL) &
+ SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE)) {
+ mode |= MSR_BITMAP_MODE_X2APIC;
+ if (enable_apicv && kvm_vcpu_apicv_active(vcpu))
+ mode |= MSR_BITMAP_MODE_X2APIC_APICV;
}
+
+ if (is_long_mode(vcpu))
+ mode |= MSR_BITMAP_MODE_LM;
+
+ return mode;
}

-static void vmx_disable_intercept_msr_read_x2apic(u32 msr, bool apicv_active)
+#define X2APIC_MSR(r) (APIC_BASE_MSR + ((r) >> 4))
+
+static void vmx_update_msr_bitmap_x2apic(unsigned long *msr_bitmap,
+ u8 mode)
{
- if (apicv_active) {
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy_x2apic,
- msr, MSR_TYPE_R);
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode_x2apic,
- msr, MSR_TYPE_R);
- } else {
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy_x2apic_apicv_inactive,
- msr, MSR_TYPE_R);
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode_x2apic_apicv_inactive,
- msr, MSR_TYPE_R);
+ int msr;
+
+ for (msr = 0x800; msr <= 0x8ff; msr += BITS_PER_LONG) {
+ unsigned word = msr / BITS_PER_LONG;
+ msr_bitmap[word] = (mode & MSR_BITMAP_MODE_X2APIC_APICV) ? 0 : ~0;
+ msr_bitmap[word + (0x800 / sizeof(long))] = ~0;
+ }
+
+ if (mode & MSR_BITMAP_MODE_X2APIC) {
+ /*
+ * TPR reads and writes can be virtualized even if virtual interrupt
+ * delivery is not in use.
+ */
+ vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_TASKPRI), MSR_TYPE_RW);
+ if (mode & MSR_BITMAP_MODE_X2APIC_APICV) {
+ vmx_enable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_TMCCT), MSR_TYPE_R);
+ vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_EOI), MSR_TYPE_W);
+ vmx_disable_intercept_for_msr(msr_bitmap, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W);
+ }
}
}

-static void vmx_disable_intercept_msr_write_x2apic(u32 msr, bool apicv_active)
+static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu)
{
- if (apicv_active) {
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy_x2apic,
- msr, MSR_TYPE_W);
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode_x2apic,
- msr, MSR_TYPE_W);
- } else {
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy_x2apic_apicv_inactive,
- msr, MSR_TYPE_W);
- __vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode_x2apic_apicv_inactive,
- msr, MSR_TYPE_W);
- }
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+ unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
+ u8 mode = vmx_msr_bitmap_mode(vcpu);
+ u8 changed = mode ^ vmx->msr_bitmap_mode;
+
+ if (!changed)
+ return;
+
+ vmx_set_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW,
+ !(mode & MSR_BITMAP_MODE_LM));
+
+ if (changed & (MSR_BITMAP_MODE_X2APIC | MSR_BITMAP_MODE_X2APIC_APICV))
+ vmx_update_msr_bitmap_x2apic(msr_bitmap, mode);
+
+ vmx->msr_bitmap_mode = mode;
}

static bool vmx_get_enable_apicv(void)
@@ -4980,7 +4983,7 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
}

if (cpu_has_vmx_msr_bitmap())
- vmx_set_msr_bitmap(vcpu);
+ vmx_update_msr_bitmap(vcpu);
}

static u32 vmx_exec_control(struct vcpu_vmx *vmx)
@@ -5069,7 +5072,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap));
}
if (cpu_has_vmx_msr_bitmap())
- vmcs_write64(MSR_BITMAP, __pa(vmx_msr_bitmap_legacy));
+ vmcs_write64(MSR_BITMAP, __pa(vmx->vmcs01.msr_bitmap));

vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */

@@ -6400,7 +6403,7 @@ static void wakeup_handler(void)

static __init int hardware_setup(void)
{
- int r = -ENOMEM, i, msr;
+ int r = -ENOMEM, i;

rdmsrl_safe(MSR_EFER, &host_efer);

@@ -6415,41 +6418,13 @@ static __init int hardware_setup(void)
if (!vmx_io_bitmap_b)
goto out;

- vmx_msr_bitmap_legacy = (unsigned long *)__get_free_page(GFP_KERNEL);
- if (!vmx_msr_bitmap_legacy)
- goto out1;
-
- vmx_msr_bitmap_legacy_x2apic =
- (unsigned long *)__get_free_page(GFP_KERNEL);
- if (!vmx_msr_bitmap_legacy_x2apic)
- goto out2;
-
- vmx_msr_bitmap_legacy_x2apic_apicv_inactive =
- (unsigned long *)__get_free_page(GFP_KERNEL);
- if (!vmx_msr_bitmap_legacy_x2apic_apicv_inactive)
- goto out3;
-
- vmx_msr_bitmap_longmode = (unsigned long *)__get_free_page(GFP_KERNEL);
- if (!vmx_msr_bitmap_longmode)
- goto out4;
-
- vmx_msr_bitmap_longmode_x2apic =
- (unsigned long *)__get_free_page(GFP_KERNEL);
- if (!vmx_msr_bitmap_longmode_x2apic)
- goto out5;
-
- vmx_msr_bitmap_longmode_x2apic_apicv_inactive =
- (unsigned long *)__get_free_page(GFP_KERNEL);
- if (!vmx_msr_bitmap_longmode_x2apic_apicv_inactive)
- goto out6;
-
vmx_vmread_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_vmread_bitmap)
- goto out7;
+ goto out1;

vmx_vmwrite_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL);
if (!vmx_vmwrite_bitmap)
- goto out8;
+ goto out2;

memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE);
memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE);
@@ -6458,12 +6433,9 @@ static __init int hardware_setup(void)

memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE);

- memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
- memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE);
-
if (setup_vmcs_config(&vmcs_config) < 0) {
r = -EIO;
- goto out9;
+ goto out3;
}

if (boot_cpu_has(X86_FEATURE_NX))
@@ -6520,47 +6492,8 @@ static __init int hardware_setup(void)
kvm_tsc_scaling_ratio_frac_bits = 48;
}

- vmx_disable_intercept_for_msr(MSR_FS_BASE, false);
- vmx_disable_intercept_for_msr(MSR_GS_BASE, false);
- vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true);
- vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_CS, false);
- vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
- vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
-
- memcpy(vmx_msr_bitmap_legacy_x2apic,
- vmx_msr_bitmap_legacy, PAGE_SIZE);
- memcpy(vmx_msr_bitmap_longmode_x2apic,
- vmx_msr_bitmap_longmode, PAGE_SIZE);
- memcpy(vmx_msr_bitmap_legacy_x2apic_apicv_inactive,
- vmx_msr_bitmap_legacy, PAGE_SIZE);
- memcpy(vmx_msr_bitmap_longmode_x2apic_apicv_inactive,
- vmx_msr_bitmap_longmode, PAGE_SIZE);
-
set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */

- /*
- * enable_apicv && kvm_vcpu_apicv_active()
- */
- for (msr = 0x800; msr <= 0x8ff; msr++)
- vmx_disable_intercept_msr_read_x2apic(msr, true);
-
- /* TMCCT */
- vmx_enable_intercept_msr_read_x2apic(0x839, true);
- /* TPR */
- vmx_disable_intercept_msr_write_x2apic(0x808, true);
- /* EOI */
- vmx_disable_intercept_msr_write_x2apic(0x80b, true);
- /* SELF-IPI */
- vmx_disable_intercept_msr_write_x2apic(0x83f, true);
-
- /*
- * (enable_apicv && !kvm_vcpu_apicv_active()) ||
- * !enable_apicv
- */
- /* TPR */
- vmx_disable_intercept_msr_read_x2apic(0x808, false);
- vmx_disable_intercept_msr_write_x2apic(0x808, false);
-
if (enable_ept) {
kvm_mmu_set_mask_ptes(VMX_EPT_READABLE_MASK,
(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
@@ -6606,22 +6539,10 @@ static __init int hardware_setup(void)

return alloc_kvm_area();

-out9:
- free_page((unsigned long)vmx_vmwrite_bitmap);
-out8:
- free_page((unsigned long)vmx_vmread_bitmap);
-out7:
- free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic_apicv_inactive);
-out6:
- free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic);
-out5:
- free_page((unsigned long)vmx_msr_bitmap_longmode);
-out4:
- free_page((unsigned long)vmx_msr_bitmap_legacy_x2apic_apicv_inactive);
out3:
- free_page((unsigned long)vmx_msr_bitmap_legacy_x2apic);
+ free_page((unsigned long)vmx_vmwrite_bitmap);
out2:
- free_page((unsigned long)vmx_msr_bitmap_legacy);
+ free_page((unsigned long)vmx_vmread_bitmap);
out1:
free_page((unsigned long)vmx_io_bitmap_b);
out:
@@ -6632,12 +6553,6 @@ static __init int hardware_setup(void)

static __exit void hardware_unsetup(void)
{
- free_page((unsigned long)vmx_msr_bitmap_legacy_x2apic);
- free_page((unsigned long)vmx_msr_bitmap_legacy_x2apic_apicv_inactive);
- free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic);
- free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic_apicv_inactive);
- free_page((unsigned long)vmx_msr_bitmap_legacy);
- free_page((unsigned long)vmx_msr_bitmap_longmode);
free_page((unsigned long)vmx_io_bitmap_b);
free_page((unsigned long)vmx_io_bitmap_a);
free_page((unsigned long)vmx_vmwrite_bitmap);
@@ -7002,13 +6917,6 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
if (r < 0)
goto out_vmcs02;

- if (cpu_has_vmx_msr_bitmap()) {
- vmx->nested.msr_bitmap =
- (unsigned long *)__get_free_page(GFP_KERNEL);
- if (!vmx->nested.msr_bitmap)
- goto out_msr_bitmap;
- }
-
vmx->nested.cached_vmcs12 = kmalloc(VMCS12_SIZE, GFP_KERNEL);
if (!vmx->nested.cached_vmcs12)
goto out_cached_vmcs12;
@@ -7038,9 +6946,6 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
kfree(vmx->nested.cached_vmcs12);

out_cached_vmcs12:
- free_page((unsigned long)vmx->nested.msr_bitmap);
-
-out_msr_bitmap:
free_loaded_vmcs(&vmx->nested.vmcs02);

out_vmcs02:
@@ -7119,10 +7024,6 @@ static void free_nested(struct vcpu_vmx *vmx)
vmx->nested.vmxon = false;
free_vpid(vmx->nested.vpid02);
nested_release_vmcs12(vmx);
- if (vmx->nested.msr_bitmap) {
- free_page((unsigned long)vmx->nested.msr_bitmap);
- vmx->nested.msr_bitmap = NULL;
- }
if (enable_shadow_vmcs) {
vmcs_clear(vmx->vmcs01.shadow_vmcs);
free_vmcs(vmx->vmcs01.shadow_vmcs);
@@ -8469,7 +8370,7 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set)
}
vmcs_write32(SECONDARY_VM_EXEC_CONTROL, sec_exec_control);

- vmx_set_msr_bitmap(vcpu);
+ vmx_update_msr_bitmap(vcpu);
}

static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu, hpa_t hpa)
@@ -9089,6 +8990,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
{
int err;
struct vcpu_vmx *vmx = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
+ unsigned long *msr_bitmap;
int cpu;

if (!vmx)
@@ -9129,6 +9031,15 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
if (err < 0)
goto free_msrs;

+ msr_bitmap = vmx->vmcs01.msr_bitmap;
+ vmx_disable_intercept_for_msr(msr_bitmap, MSR_FS_BASE, MSR_TYPE_RW);
+ vmx_disable_intercept_for_msr(msr_bitmap, MSR_GS_BASE, MSR_TYPE_RW);
+ vmx_disable_intercept_for_msr(msr_bitmap, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
+ vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW);
+ vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW);
+ vmx_disable_intercept_for_msr(msr_bitmap, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW);
+ vmx->msr_bitmap_mode = 0;
+
vmx->loaded_vmcs = &vmx->vmcs01;
cpu = get_cpu();
vmx_vcpu_load(&vmx->vcpu, cpu);
@@ -9523,7 +9434,7 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
int msr;
struct page *page;
unsigned long *msr_bitmap_l1;
- unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.msr_bitmap;
+ unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;

/* This shortcut is ok because we support only x2APIC MSRs so far. */
if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
@@ -10043,6 +9954,9 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
if (kvm_has_tsc_control)
decache_tsc_multiplier(vmx);

+ if (cpu_has_vmx_msr_bitmap())
+ vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap));
+
if (enable_vpid) {
/*
* There is no direct mapping between vpid02 and vpid12, the
@@ -10747,7 +10661,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
vmcs_write64(GUEST_IA32_DEBUGCTL, 0);

if (cpu_has_vmx_msr_bitmap())
- vmx_set_msr_bitmap(vcpu);
+ vmx_update_msr_bitmap(vcpu);

if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr,
vmcs12->vm_exit_msr_load_count))
--
2.7.4


2018-02-06 17:34:29

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH 1/9] KVM: nVMX: vmx_complete_nested_posted_interrupt() can't fail

From: David Hildenbrand <[email protected]>

vmx_complete_nested_posted_interrupt() can't fail, let's turn it into
a void function.

Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

(cherry picked from commit 6342c50ad12e8ce0736e722184a7dbdea4a3477f)
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/vmx.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index feadff3..fd890af 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4736,7 +4736,7 @@ static bool vmx_get_enable_apicv(void)
return enable_apicv;
}

-static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
+static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int max_irr;
@@ -4747,13 +4747,13 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
vmx->nested.pi_pending) {
vmx->nested.pi_pending = false;
if (!pi_test_and_clear_on(vmx->nested.pi_desc))
- return 0;
+ return;

max_irr = find_last_bit(
(unsigned long *)vmx->nested.pi_desc->pir, 256);

if (max_irr == 256)
- return 0;
+ return;

vapic_page = kmap(vmx->nested.virtual_apic_page);
if (!vapic_page) {
@@ -4770,7 +4770,6 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
vmcs_write16(GUEST_INTR_STATUS, status);
}
}
- return 0;
}

static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu)
@@ -10491,7 +10490,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
return 0;
}

- return vmx_complete_nested_posted_interrupt(vcpu);
+ vmx_complete_nested_posted_interrupt(vcpu);
+ return 0;
}

static u32 vmx_get_preemption_timer_value(struct kvm_vcpu *vcpu)
--
2.7.4


2018-02-06 17:35:47

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH 6/9] KVM/x86: Add IBPB support

From: Ashok Raj <[email protected]>

The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.

Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.

IBPB helps mitigate against three potential attacks:

* Mitigate guests from being attacked by other guests.
- This is addressed by issing IBPB when we do a guest switch.

* Mitigate attacks from guest/ring3->host/ring3.
These would require a IBPB during context switch in host, or after
VMEXIT. The host process has two ways to mitigate
- Either it can be compiled with retpoline
- If its going through context switch, and has set !dumpable then
there is a IBPB in that path.
(Tim's patch: https://patchwork.kernel.org/patch/10192871)
- The case where after a VMEXIT you return back to Qemu might make
Qemu attackable from guest when Qemu isn't compiled with retpoline.
There are issues reported when doing IBPB on every VMEXIT that resulted
in some tsc calibration woes in guest.

* Mitigate guest/ring0->host/ring0 attacks.
When host kernel is using retpoline it is safe against these attacks.
If host kernel isn't using retpoline we might need to do a IBPB flush on
every VMEXIT.

Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.

* IBPB is issued only for SVM during svm_free_vcpu().
VMX has a vmclear and SVM doesn't. Follow discussion here:
https://lkml.org/lkml/2018/1/15/146

Please refer to the following spec for more details on the enumeration
and control.

Refer here to get documentation about mitigations.

https://software.intel.com/en-us/side-channel-security-support

[peterz: rebase and changelog rewrite]
[karahmed: - rebase
- vmx: expose PRED_CMD if guest has it in CPUID
- svm: only pass through IBPB if guest has it in CPUID
- vmx: support !cpu_has_vmx_msr_bitmap()]
- vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
PRED_CMD is a write-only MSR]

Signed-off-by: Ashok Raj <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: [email protected]
Cc: Asit Mallick <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Arjan Van De Ven <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Jun Nakajima <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Tim Chen <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]

(cherry picked from commit 15d45071523d89b3fb7372e2135fbd72f6af9506)
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/cpuid.c | 11 +++++++-
arch/x86/kvm/cpuid.h | 12 ++++++++
arch/x86/kvm/svm.c | 28 +++++++++++++++++++
arch/x86/kvm/vmx.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 127 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 91af75e..6f24483 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -355,6 +355,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);

+ /* cpuid 0x80000008.ebx */
+ const u32 kvm_cpuid_8000_0008_ebx_x86_features =
+ F(IBPB);
+
/* cpuid 0xC0000001.edx */
const u32 kvm_cpuid_C000_0001_edx_x86_features =
F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
@@ -607,7 +611,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
if (!g_phys_as)
g_phys_as = phys_as;
entry->eax = g_phys_as | (virt_as << 8);
- entry->ebx = entry->edx = 0;
+ entry->edx = 0;
+ /* IBPB isn't necessarily present in hardware cpuid */
+ if (boot_cpu_has(X86_FEATURE_IBPB))
+ entry->ebx |= F(IBPB);
+ entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
+ cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
break;
}
case 0x80000019:
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 9368fec..ec4f9dc 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -160,6 +160,18 @@ static inline bool guest_cpuid_has_rdtscp(struct kvm_vcpu *vcpu)
return best && (best->edx & bit(X86_FEATURE_RDTSCP));
}

+static inline bool guest_cpuid_has_ibpb(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0);
+ if (best && (best->ebx & bit(X86_FEATURE_IBPB)))
+ return true;
+ best = kvm_find_cpuid_entry(vcpu, 7, 0);
+ return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL));
+}
+
+
/*
* NRIPS is provided through cpuidfn 0x8000000a.edx bit 3
*/
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 24af898..6de8d65 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -248,6 +248,7 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_CSTAR, .always = true },
{ .index = MSR_SYSCALL_MASK, .always = true },
#endif
+ { .index = MSR_IA32_PRED_CMD, .always = false },
{ .index = MSR_IA32_LASTBRANCHFROMIP, .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP, .always = false },
@@ -510,6 +511,7 @@ struct svm_cpu_data {
struct kvm_ldttss_desc *tss_desc;

struct page *save_area;
+ struct vmcb *current_vmcb;
};

static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
@@ -1644,11 +1646,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, svm);
+ /*
+ * The vmcb page can be recycled, causing a false negative in
+ * svm_vcpu_load(). So do a full IBPB now.
+ */
+ indirect_branch_prediction_barrier();
}

static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
int i;

if (unlikely(cpu != vcpu->cpu)) {
@@ -1677,6 +1685,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (static_cpu_has(X86_FEATURE_RDTSCP))
wrmsrl(MSR_TSC_AUX, svm->tsc_aux);

+ if (sd->current_vmcb != svm->vmcb) {
+ sd->current_vmcb = svm->vmcb;
+ indirect_branch_prediction_barrier();
+ }
avic_vcpu_load(vcpu, cpu);
}

@@ -3599,6 +3611,22 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr);
break;
+ case MSR_IA32_PRED_CMD:
+ if (!msr->host_initiated &&
+ !guest_cpuid_has_ibpb(vcpu))
+ return 1;
+
+ if (data & ~PRED_CMD_IBPB)
+ return 1;
+
+ if (!data)
+ break;
+
+ wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+ if (is_guest_mode(vcpu))
+ break;
+ set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
+ break;
case MSR_STAR:
svm->vmcb->save.star = data;
break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8570a17..dd6c831 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -550,6 +550,7 @@ struct vcpu_vmx {
u64 msr_host_kernel_gs_base;
u64 msr_guest_kernel_gs_base;
#endif
+
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
/*
@@ -911,6 +912,8 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx);
static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
static int alloc_identity_pagetable(struct kvm *kvm);
static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
+static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type);

static DEFINE_PER_CPU(struct vmcs *, vmxarea);
static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -1846,6 +1849,29 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu)
vmcs_write32(EXCEPTION_BITMAP, eb);
}

+/*
+ * Check if MSR is intercepted for L01 MSR bitmap.
+ */
+static bool msr_write_intercepted_l01(struct kvm_vcpu *vcpu, u32 msr)
+{
+ unsigned long *msr_bitmap;
+ int f = sizeof(unsigned long);
+
+ if (!cpu_has_vmx_msr_bitmap())
+ return true;
+
+ msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
+
+ if (msr <= 0x1fff) {
+ return !!test_bit(msr, msr_bitmap + 0x800 / f);
+ } else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
+ msr &= 0x1fff;
+ return !!test_bit(msr, msr_bitmap + 0xc00 / f);
+ }
+
+ return true;
+}
+
static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
unsigned long entry, unsigned long exit)
{
@@ -2255,6 +2281,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
+ indirect_branch_prediction_barrier();
}

if (!already_loaded) {
@@ -3056,6 +3083,33 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
+ case MSR_IA32_PRED_CMD:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_ibpb(vcpu))
+ return 1;
+
+ if (data & ~PRED_CMD_IBPB)
+ return 1;
+
+ if (!data)
+ break;
+
+ wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+
+ /*
+ * For non-nested:
+ * When it's written (to non-zero) for the first time, pass
+ * it through.
+ *
+ * For nested:
+ * The handling of the MSR bitmap for L2 guests is done in
+ * nested_vmx_merge_msr_bitmap. We should not touch the
+ * vmcs02.msr_bitmap here since it gets completely overwritten
+ * in the merging.
+ */
+ vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
+ MSR_TYPE_W);
+ break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -9435,9 +9489,23 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
struct page *page;
unsigned long *msr_bitmap_l1;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
+ /*
+ * pred_cmd is trying to verify two things:
+ *
+ * 1. L0 gave a permission to L1 to actually passthrough the MSR. This
+ * ensures that we do not accidentally generate an L02 MSR bitmap
+ * from the L12 MSR bitmap that is too permissive.
+ * 2. That L1 or L2s have actually used the MSR. This avoids
+ * unnecessarily merging of the bitmap if the MSR is unused. This
+ * works properly because we only update the L01 MSR bitmap lazily.
+ * So even if L0 should pass L1 these MSRs, the L01 bitmap is only
+ * updated to reflect this when L1 (or its L2s) actually write to
+ * the MSR.
+ */
+ bool pred_cmd = msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD);

- /* This shortcut is ok because we support only x2APIC MSRs so far. */
- if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
+ if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
+ !pred_cmd)
return false;

page = nested_get_page(vcpu, vmcs12->msr_bitmap);
@@ -9475,6 +9543,13 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
MSR_TYPE_W);
}
}
+
+ if (pred_cmd)
+ nested_vmx_disable_intercept_for_msr(
+ msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PRED_CMD,
+ MSR_TYPE_W);
+
kunmap(page);
nested_release_page_clean(page);

--
2.7.4


2018-02-06 17:36:05

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH 2/9] KVM: nVMX: mark vmcs12 pages dirty on L2 exit

From: David Matlack <[email protected]>

The host physical addresses of L1's Virtual APIC Page and Posted
Interrupt descriptor are loaded into the VMCS02. The CPU may write
to these pages via their host physical address while L2 is running,
bypassing address-translation-based dirty tracking (e.g. EPT write
protection). Mark them dirty on every exit from L2 to prevent them
from getting out of sync with dirty tracking.

Also mark the virtual APIC page and the posted interrupt descriptor
dirty when KVM is virtualizing posted interrupt processing.

Signed-off-by: David Matlack <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>

(cherry picked from commit c9f04407f2e0b3fc9ff7913c65fcfcb0a4b61570)
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/vmx.c | 53 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fd890af..9408ae8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4736,6 +4736,28 @@ static bool vmx_get_enable_apicv(void)
return enable_apicv;
}

+static void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu)
+{
+ struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+ gfn_t gfn;
+
+ /*
+ * Don't need to mark the APIC access page dirty; it is never
+ * written to by the CPU during APIC virtualization.
+ */
+
+ if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
+ gfn = vmcs12->virtual_apic_page_addr >> PAGE_SHIFT;
+ kvm_vcpu_mark_page_dirty(vcpu, gfn);
+ }
+
+ if (nested_cpu_has_posted_intr(vmcs12)) {
+ gfn = vmcs12->posted_intr_desc_addr >> PAGE_SHIFT;
+ kvm_vcpu_mark_page_dirty(vcpu, gfn);
+ }
+}
+
+
static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4743,18 +4765,15 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
void *vapic_page;
u16 status;

- if (vmx->nested.pi_desc &&
- vmx->nested.pi_pending) {
- vmx->nested.pi_pending = false;
- if (!pi_test_and_clear_on(vmx->nested.pi_desc))
- return;
-
- max_irr = find_last_bit(
- (unsigned long *)vmx->nested.pi_desc->pir, 256);
+ if (!vmx->nested.pi_desc || !vmx->nested.pi_pending)
+ return;

- if (max_irr == 256)
- return;
+ vmx->nested.pi_pending = false;
+ if (!pi_test_and_clear_on(vmx->nested.pi_desc))
+ return;

+ max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
+ if (max_irr != 256) {
vapic_page = kmap(vmx->nested.virtual_apic_page);
if (!vapic_page) {
WARN_ON(1);
@@ -4770,6 +4789,8 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
vmcs_write16(GUEST_INTR_STATUS, status);
}
}
+
+ nested_mark_vmcs12_pages_dirty(vcpu);
}

static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu)
@@ -8026,6 +8047,18 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
vmcs_read32(VM_EXIT_INTR_ERROR_CODE),
KVM_ISA_VMX);

+ /*
+ * The host physical addresses of some pages of guest memory
+ * are loaded into VMCS02 (e.g. L1's Virtual APIC Page). The CPU
+ * may write to these pages via their host physical address while
+ * L2 is running, bypassing any address-translation-based dirty
+ * tracking (e.g. EPT write protection).
+ *
+ * Mark them dirty on every exit from L2 to prevent them from
+ * getting out of sync with dirty tracking.
+ */
+ nested_mark_vmcs12_pages_dirty(vcpu);
+
if (vmx->nested.nested_run_pending)
return false;

--
2.7.4


2018-02-06 17:36:13

by Woodhouse, David

[permalink] [raw]
Subject: [PATCH 4/9] KVM: VMX: introduce alloc_loaded_vmcs

From: Paolo Bonzini <[email protected]>

Group together the calls to alloc_vmcs and loaded_vmcs_init. Soon we'll also
allocate an MSR bitmap there.

Cc: [email protected] # prereq for Spectre mitigation
Signed-off-by: Paolo Bonzini <[email protected]>
(cherry picked from commit f21f165ef922c2146cc5bdc620f542953c41714b)
Signed-off-by: David Woodhouse <[email protected]>
---
arch/x86/kvm/vmx.c | 38 +++++++++++++++++++++++---------------
1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 55b1474..8c562da 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3522,11 +3522,6 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
return vmcs;
}

-static struct vmcs *alloc_vmcs(void)
-{
- return alloc_vmcs_cpu(raw_smp_processor_id());
-}
-
static void free_vmcs(struct vmcs *vmcs)
{
free_pages((unsigned long)vmcs, vmcs_config.order);
@@ -3545,6 +3540,22 @@ static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
WARN_ON(loaded_vmcs->shadow_vmcs != NULL);
}

+static struct vmcs *alloc_vmcs(void)
+{
+ return alloc_vmcs_cpu(raw_smp_processor_id());
+}
+
+static int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
+{
+ loaded_vmcs->vmcs = alloc_vmcs();
+ if (!loaded_vmcs->vmcs)
+ return -ENOMEM;
+
+ loaded_vmcs->shadow_vmcs = NULL;
+ loaded_vmcs_init(loaded_vmcs);
+ return 0;
+}
+
static void free_kvm_area(void)
{
int cpu;
@@ -6947,6 +6958,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
struct vmcs *shadow_vmcs;
const u64 VMXON_NEEDED_FEATURES = FEATURE_CONTROL_LOCKED
| FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
+ int r;

/* The Intel VMX Instruction Reference lists a bunch of bits that
* are prerequisite to running VMXON, most notably cr4.VMXE must be
@@ -6986,11 +6998,9 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
return 1;
}

- vmx->nested.vmcs02.vmcs = alloc_vmcs();
- vmx->nested.vmcs02.shadow_vmcs = NULL;
- if (!vmx->nested.vmcs02.vmcs)
+ r = alloc_loaded_vmcs(&vmx->nested.vmcs02);
+ if (r < 0)
goto out_vmcs02;
- loaded_vmcs_init(&vmx->nested.vmcs02);

if (cpu_has_vmx_msr_bitmap()) {
vmx->nested.msr_bitmap =
@@ -9111,17 +9121,15 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
if (!vmx->guest_msrs)
goto free_pml;

- vmx->loaded_vmcs = &vmx->vmcs01;
- vmx->loaded_vmcs->vmcs = alloc_vmcs();
- vmx->loaded_vmcs->shadow_vmcs = NULL;
- if (!vmx->loaded_vmcs->vmcs)
- goto free_msrs;
if (!vmm_exclusive)
kvm_cpu_vmxon(__pa(per_cpu(vmxarea, raw_smp_processor_id())));
- loaded_vmcs_init(vmx->loaded_vmcs);
+ err = alloc_loaded_vmcs(&vmx->vmcs01);
if (!vmm_exclusive)
kvm_cpu_vmxoff();
+ if (err < 0)
+ goto free_msrs;

+ vmx->loaded_vmcs = &vmx->vmcs01;
cpu = get_cpu();
vmx_vcpu_load(&vmx->vcpu, cpu);
vmx->vcpu.cpu = cpu;
--
2.7.4


2018-02-06 18:03:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

On 06/02/2018 18:29, David Woodhouse wrote:
> I've put together a linux-4.9.y branch at
> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
>
> Most of it is fairly straightforward, apart from the IBPB on context
> switch for which Tim has already posted a candidate. I wanted some more
> review on my backports of the KVM bits though, including some extra
> historical patches I pulled in.

Looks good! Thanks for the work,

Paolo

> Ashok Raj (1):
> KVM/x86: Add IBPB support
>
> David Hildenbrand (1):
> KVM: nVMX: vmx_complete_nested_posted_interrupt() can't fail
>
> David Matlack (1):
> KVM: nVMX: mark vmcs12 pages dirty on L2 exit
>
> Jim Mattson (1):
> KVM: nVMX: Eliminate vmcs02 pool
>
> KarimAllah Ahmed (3):
> KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
> KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
> KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
>
> Paolo Bonzini (2):
> KVM: VMX: introduce alloc_loaded_vmcs
> KVM: VMX: make MSR bitmaps per-VCPU
>
> arch/x86/kvm/cpuid.c | 21 +-
> arch/x86/kvm/cpuid.h | 31 +++
> arch/x86/kvm/svm.c | 116 ++++++++
> arch/x86/kvm/vmx.c | 730 +++++++++++++++++++++++++++------------------------
> arch/x86/kvm/x86.c | 1 +
> 5 files changed, 554 insertions(+), 345 deletions(-)
>


2018-02-06 21:18:48

by Woodhouse, David

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support



On Tue, 2018-02-06 at 19:01 +0100, Paolo Bonzini wrote:
> On 06/02/2018 18:29, David Woodhouse wrote:
> > I've put together a linux-4.9.y branch at 
> > http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> > 
> > Most of it is fairly straightforward, apart from the IBPB on context 
> > switch for which Tim has already posted a candidate. I wanted some more
> > review on my backports of the KVM bits though, including some extra
> > historical patches I pulled in.
>
> Looks good!  Thanks for the work,
>
> Paolo

Thanks. In that case, Greg, the full set is lined up in
http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
or git://git.infradead.org/retpoline-stable linux-4.9.y

This matches what Linus just pulled in from tip/x86-pti-for-linus,
except that it's missing the IBPB on context switch (qv).

----------------------------------------------------------------
Andi Kleen (1):
      module/retpoline: Warn about missing retpoline in module

Andy Lutomirski (3):
      x86/entry/64: Remove the SYSCALL64 fast path
      x86/entry/64: Push extra regs right away
      x86/asm: Move 'status' from thread_struct to thread_info

Arnd Bergmann (1):
      x86/pti: Mark constant arrays as __initconst

Ashok Raj (1):
      KVM/x86: Add IBPB support

Borislav Petkov (4):
      x86/alternative: Print unadorned pointers
      x86/nospec: Fix header guards names
      x86/bugs: Drop one "mitigation" from dmesg
      x86/retpoline: Simplify vmexit_fill_RSB()

Colin Ian King (1):
      x86/spectre: Fix spelling mistake: "vunerable"-> "vulnerable"

Dan Williams (12):
      array_index_nospec: Sanitize speculative array de-references
      x86: Implement array_index_mask_nospec
      x86: Introduce barrier_nospec
      x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec
      x86/usercopy: Replace open coded stac/clac with __uaccess_{begin, end}
      x86/uaccess: Use __uaccess_begin_nospec() and uaccess_try_nospec
      x86/get_user: Use pointer masking to limit speculation
      x86/syscall: Sanitize syscall table de-references under speculation
      vfs, fdtable: Prevent bounds-check bypass via speculative execution
      nl80211: Sanitize array index in parse_txq_params
      x86/spectre: Report get_user mitigation for spectre_v1
      x86/kvm: Update spectre-v1 mitigation

Darren Kenny (1):
      x86/speculation: Fix typo IBRS_ATT, which should be IBRS_ALL

David Hildenbrand (1):
      KVM: nVMX: vmx_complete_nested_posted_interrupt() can't fail

David Matlack (1):
      KVM: nVMX: mark vmcs12 pages dirty on L2 exit

David Woodhouse (10):
      x86/cpufeatures: Add CPUID_7_EDX CPUID leaf
      x86/cpufeatures: Add Intel feature bits for Speculation Control
      x86/cpufeatures: Add AMD feature bits for Speculation Control
      x86/msr: Add definitions for new speculation control MSRs
      x86/pti: Do not enable PTI on CPUs which are not vulnerable to Meltdown
      x86/cpufeature: Blacklist SPEC_CTRL/PRED_CMD on early Spectre v2 microcodes
      x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support
      x86/cpufeatures: Clean up Spectre v2 related CPUID flags
      x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
      x86/retpoline: Avoid retpolines for built-in __init functions

Dou Liyang (1):
      x86/spectre: Check CONFIG_RETPOLINE in command line parser

Jim Mattson (1):
      KVM: nVMX: Eliminate vmcs02 pool

Josh Poimboeuf (1):
      x86/paravirt: Remove 'noreplace-paravirt' cmdline option

KarimAllah Ahmed (4):
      x86/spectre: Simplify spectre_v2 command line parsing
      KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
      KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
      KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL

Mark Rutland (1):
      Documentation: Document array_index_nospec

Paolo Bonzini (2):
      KVM: VMX: introduce alloc_loaded_vmcs
      KVM: VMX: make MSR bitmaps per-VCPU

Peter Zijlstra (2):
      KVM: x86: Make indirect calls in emulator speculation safe
      KVM: VMX: Make indirect call speculation safe

Thomas Gleixner (1):
      x86/cpu/bugs: Make retpoline module warning conditional

Waiman Long (1):
      x86/retpoline: Remove the esp/rsp thunk

 Documentation/kernel-parameters.txt      |   2 -
 Documentation/speculation.txt            |  90 +++++++++++++
 arch/x86/entry/common.c                  |   9 +-
 arch/x86/entry/entry_32.S                |   3 +-
 arch/x86/entry/entry_64.S                | 134 ++----------------
 arch/x86/entry/syscall_64.c              |   7 +-
 arch/x86/include/asm/asm-prototypes.h    |   4 +-
 arch/x86/include/asm/barrier.h           |  28 ++++
 arch/x86/include/asm/cpufeature.h        |   7 +-
 arch/x86/include/asm/cpufeatures.h       |  22 ++-
 arch/x86/include/asm/disabled-features.h |   3 +-
 arch/x86/include/asm/intel-family.h      |   7 +-
 arch/x86/include/asm/msr-index.h         |  12 ++
 arch/x86/include/asm/msr.h               |   3 +-
 arch/x86/include/asm/nospec-branch.h     |  91 ++++---------
 arch/x86/include/asm/processor.h         |   2 -
 arch/x86/include/asm/required-features.h |   3 +-
 arch/x86/include/asm/syscall.h           |   6 +-
 arch/x86/include/asm/thread_info.h       |   3 +-
 arch/x86/include/asm/uaccess.h           |  15 ++-
 arch/x86/include/asm/uaccess_32.h        |  12 +-
 arch/x86/include/asm/uaccess_64.h        |  12 +-
 arch/x86/kernel/alternative.c            |  28 +---
 arch/x86/kernel/cpu/bugs.c               | 128 +++++++++++++-----
 arch/x86/kernel/cpu/common.c             |  70 +++++++++-
 arch/x86/kernel/cpu/intel.c              |  66 +++++++++
 arch/x86/kernel/cpu/scattered.c          |   2 -
 arch/x86/kernel/process_64.c             |   4 +-
 arch/x86/kernel/ptrace.c                 |   2 +-
 arch/x86/kernel/signal.c                 |   2 +-
 arch/x86/kvm/cpuid.c                     |  21 ++-
 arch/x86/kvm/cpuid.h                     |  31 +++++
 arch/x86/kvm/emulate.c                   |  10 +-
 arch/x86/kvm/svm.c                       | 116 ++++++++++++++++
 arch/x86/kvm/vmx.c                       | 754 ++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------
 arch/x86/kvm/x86.c                       |   1 +
 arch/x86/lib/Makefile                    |   1 +
 arch/x86/lib/getuser.S                   |  10 ++
 arch/x86/lib/retpoline.S                 |  57 +++++++-
 arch/x86/lib/usercopy_32.c               |   8 +-
 include/linux/fdtable.h                  |   5 +-
 include/linux/init.h                     |   9 +-
 include/linux/module.h                   |   9 ++
 include/linux/nospec.h                   |  72 ++++++++++
 kernel/module.c                          |  11 ++
 net/wireless/nl80211.c                   |   9 +-
 scripts/mod/modpost.c                    |   9 ++
 47 files changed, 1230 insertions(+), 680 deletions(-)
 create mode 100644 Documentation/speculation.txt
 create mode 100644 include/linux/nospec.h


Attachments:
smime.p7s (5.09 kB)

2018-02-08 02:51:56

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

On Tue, Feb 06, 2018 at 09:05:46PM +0000, Woodhouse, David wrote:
>
>
> On Tue, 2018-02-06 at 19:01 +0100, Paolo Bonzini wrote:
> > On 06/02/2018 18:29, David Woodhouse wrote:
> > > I've put together a linux-4.9.y branch at?
> > > http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> > >?
> > > Most of it is fairly straightforward, apart from the IBPB on context?
> > > switch for which Tim has already posted a candidate. I wanted some more
> > > review on my backports of the KVM bits though, including some extra
> > > historical patches I pulled in.
> >
> > Looks good!? Thanks for the work,
> >
> > Paolo
>
> Thanks. In that case, Greg, the full set is lined up in
> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> or git://git.infradead.org/retpoline-stable linux-4.9.y

Many thanks for all of this work. I've now queued up all of these.

greg k-h

2018-02-08 17:28:46

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

On Thu, Feb 08, 2018 at 03:49:59AM +0100, Greg KH wrote:
> On Tue, Feb 06, 2018 at 09:05:46PM +0000, Woodhouse, David wrote:
> >
> >
> > On Tue, 2018-02-06 at 19:01 +0100, Paolo Bonzini wrote:
> > > On 06/02/2018 18:29, David Woodhouse wrote:
> > > > I've put together a linux-4.9.y branch at?
> > > > http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> > > >?
> > > > Most of it is fairly straightforward, apart from the IBPB on context?
> > > > switch for which Tim has already posted a candidate. I wanted some more
> > > > review on my backports of the KVM bits though, including some extra
> > > > historical patches I pulled in.
> > >
> > > Looks good!? Thanks for the work,
> > >
> > > Paolo
> >
> > Thanks. In that case, Greg, the full set is lined up in
> > http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> > or git://git.infradead.org/retpoline-stable linux-4.9.y
>
> Many thanks for all of this work. I've now queued up all of these.

There's a problem with the backport of 6342c50ad12e ("KVM: nVMX:
vmx_complete_nested_posted_interrupt() can't fail") as there is still a
check in the function that can fail:

vapic_page = kmap(vmx->nested.virtual_apic_page);
if (!vapic_page) {
WARN_ON(1);
return -ENOMEM;
}

Do we need something else before this patch in order to fix this? I
guess kmap really can't fail, should I just drop the whole (!vapic_page)
check?

thanks,

greg k-h

2018-02-08 17:44:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

On 08/02/2018 18:14, Greg KH wrote:
> On Thu, Feb 08, 2018 at 03:49:59AM +0100, Greg KH wrote:
>> On Tue, Feb 06, 2018 at 09:05:46PM +0000, Woodhouse, David wrote:
>>>
>>>
>>> On Tue, 2018-02-06 at 19:01 +0100, Paolo Bonzini wrote:
>>>> On 06/02/2018 18:29, David Woodhouse wrote:
>>>>> I've put together a linux-4.9.y branch at 
>>>>> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
>>>>>  
>>>>> Most of it is fairly straightforward, apart from the IBPB on context 
>>>>> switch for which Tim has already posted a candidate. I wanted some more
>>>>> review on my backports of the KVM bits though, including some extra
>>>>> historical patches I pulled in.
>>>>
>>>> Looks good!  Thanks for the work,
>>>>
>>>> Paolo
>>>
>>> Thanks. In that case, Greg, the full set is lined up in
>>> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
>>> or git://git.infradead.org/retpoline-stable linux-4.9.y
>>
>> Many thanks for all of this work. I've now queued up all of these.
>
> There's a problem with the backport of 6342c50ad12e ("KVM: nVMX:
> vmx_complete_nested_posted_interrupt() can't fail") as there is still a
> check in the function that can fail:
>
> vapic_page = kmap(vmx->nested.virtual_apic_page);
> if (!vapic_page) {
> WARN_ON(1);
> return -ENOMEM;
> }
>
> Do we need something else before this patch in order to fix this? I
> guess kmap really can't fail, should I just drop the whole (!vapic_page)
> check?

Yes, that would be commit 42cf014d38d8822cce63703a467e00f65d000952.
Should David or I respin?

Thanks,

Paolo

2018-02-08 17:58:48

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

On Thu, Feb 08, 2018 at 06:42:03PM +0100, Paolo Bonzini wrote:
> On 08/02/2018 18:14, Greg KH wrote:
> > On Thu, Feb 08, 2018 at 03:49:59AM +0100, Greg KH wrote:
> >> On Tue, Feb 06, 2018 at 09:05:46PM +0000, Woodhouse, David wrote:
> >>>
> >>>
> >>> On Tue, 2018-02-06 at 19:01 +0100, Paolo Bonzini wrote:
> >>>> On 06/02/2018 18:29, David Woodhouse wrote:
> >>>>> I've put together a linux-4.9.y branch at?
> >>>>> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> >>>>> ?
> >>>>> Most of it is fairly straightforward, apart from the IBPB on context?
> >>>>> switch for which Tim has already posted a candidate. I wanted some more
> >>>>> review on my backports of the KVM bits though, including some extra
> >>>>> historical patches I pulled in.
> >>>>
> >>>> Looks good!? Thanks for the work,
> >>>>
> >>>> Paolo
> >>>
> >>> Thanks. In that case, Greg, the full set is lined up in
> >>> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> >>> or git://git.infradead.org/retpoline-stable linux-4.9.y
> >>
> >> Many thanks for all of this work. I've now queued up all of these.
> >
> > There's a problem with the backport of 6342c50ad12e ("KVM: nVMX:
> > vmx_complete_nested_posted_interrupt() can't fail") as there is still a
> > check in the function that can fail:
> >
> > vapic_page = kmap(vmx->nested.virtual_apic_page);
> > if (!vapic_page) {
> > WARN_ON(1);
> > return -ENOMEM;
> > }
> >
> > Do we need something else before this patch in order to fix this? I
> > guess kmap really can't fail, should I just drop the whole (!vapic_page)
> > check?
>
> Yes, that would be commit 42cf014d38d8822cce63703a467e00f65d000952.
> Should David or I respin?

No need, I can sneak it into the middle of the series :) I'll do it
later tonight and let you know if I have any problems, thanks for
pointing out the needed commit.

greg k-h

2018-02-09 08:00:28

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

On Thu, Feb 08, 2018 at 06:57:38PM +0100, Greg KH wrote:
> On Thu, Feb 08, 2018 at 06:42:03PM +0100, Paolo Bonzini wrote:
> > On 08/02/2018 18:14, Greg KH wrote:
> > > On Thu, Feb 08, 2018 at 03:49:59AM +0100, Greg KH wrote:
> > >> On Tue, Feb 06, 2018 at 09:05:46PM +0000, Woodhouse, David wrote:
> > >>>
> > >>>
> > >>> On Tue, 2018-02-06 at 19:01 +0100, Paolo Bonzini wrote:
> > >>>> On 06/02/2018 18:29, David Woodhouse wrote:
> > >>>>> I've put together a linux-4.9.y branch at?
> > >>>>> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> > >>>>> ?
> > >>>>> Most of it is fairly straightforward, apart from the IBPB on context?
> > >>>>> switch for which Tim has already posted a candidate. I wanted some more
> > >>>>> review on my backports of the KVM bits though, including some extra
> > >>>>> historical patches I pulled in.
> > >>>>
> > >>>> Looks good!? Thanks for the work,
> > >>>>
> > >>>> Paolo
> > >>>
> > >>> Thanks. In that case, Greg, the full set is lined up in
> > >>> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> > >>> or git://git.infradead.org/retpoline-stable linux-4.9.y
> > >>
> > >> Many thanks for all of this work. I've now queued up all of these.
> > >
> > > There's a problem with the backport of 6342c50ad12e ("KVM: nVMX:
> > > vmx_complete_nested_posted_interrupt() can't fail") as there is still a
> > > check in the function that can fail:
> > >
> > > vapic_page = kmap(vmx->nested.virtual_apic_page);
> > > if (!vapic_page) {
> > > WARN_ON(1);
> > > return -ENOMEM;
> > > }
> > >
> > > Do we need something else before this patch in order to fix this? I
> > > guess kmap really can't fail, should I just drop the whole (!vapic_page)
> > > check?
> >
> > Yes, that would be commit 42cf014d38d8822cce63703a467e00f65d000952.
> > Should David or I respin?
>
> No need, I can sneak it into the middle of the series :) I'll do it
> later tonight and let you know if I have any problems, thanks for
> pointing out the needed commit.

Now queued up.

2018-02-15 10:24:31

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

On Thu, Feb 15, 2018 at 11:15:41AM +0100, Thomas Voegtle wrote:
>
> Hello,
>
> On Tue, 6 Feb 2018, David Woodhouse wrote:
>
> > I've put together a linux-4.9.y branch at
> > http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
>
>
> Are there any plans for a backport to v4.4?
>
> I would volunteer for testing ;)

Care to volunteer to do the backport? :)

Why do you need/want this type of thing for 4.4.y? What is keeping you
from moving to 4.9.y instead? Or even better yet, 4.14.y or 4.15, where
everything is fixed "correctly"?

thanks,

greg k-h

2018-02-15 10:26:42

by Thomas Voegtle

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support


Hello,

On Tue, 6 Feb 2018, David Woodhouse wrote:

> I've put together a linux-4.9.y branch at
> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y


Are there any plans for a backport to v4.4?

I would volunteer for testing ;)


Thomas


2018-02-15 10:50:53

by Thomas Voegtle

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

On Thu, 15 Feb 2018, Greg KH wrote:

> On Thu, Feb 15, 2018 at 11:15:41AM +0100, Thomas Voegtle wrote:
>>
>> Hello,
>>
>> On Tue, 6 Feb 2018, David Woodhouse wrote:
>>
>>> I've put together a linux-4.9.y branch at
>>> http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
>>
>>
>> Are there any plans for a backport to v4.4?
>>
>> I would volunteer for testing ;)
>
> Care to volunteer to do the backport? :)
>
> Why do you need/want this type of thing for 4.4.y? What is keeping you
> from moving to 4.9.y instead? Or even better yet, 4.14.y or 4.15, where
> everything is fixed "correctly"?


Puhh, I would love to do the backport but I had a quick look at the patch
series and that is beyond my scope, I'm afraid. I'm a tester ;)

If there are no plans I know I have to move to a newer kernel, sadly.


Thomas


2018-02-15 13:59:19

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [STABLE 4.9.y PATCH 0/9] Backport of KVM Speculation Control support

On Thu, Feb 15, 2018 at 11:49:54AM +0100, Thomas Voegtle wrote:
> On Thu, 15 Feb 2018, Greg KH wrote:
>
> > On Thu, Feb 15, 2018 at 11:15:41AM +0100, Thomas Voegtle wrote:
> > >
> > > Hello,
> > >
> > > On Tue, 6 Feb 2018, David Woodhouse wrote:
> > >
> > > > I've put together a linux-4.9.y branch at
> > > > http://git.infradead.org/retpoline-stable.git/shortlog/refs/heads/linux-4.9.y
> > >
> > >
> > > Are there any plans for a backport to v4.4?
> > >
> > > I would volunteer for testing ;)
> >
> > Care to volunteer to do the backport? :)
> >
> > Why do you need/want this type of thing for 4.4.y? What is keeping you
> > from moving to 4.9.y instead? Or even better yet, 4.14.y or 4.15, where
> > everything is fixed "correctly"?
>
>
> Puhh, I would love to do the backport but I had a quick look at the patch
> series and that is beyond my scope, I'm afraid. I'm a tester ;)
>
> If there are no plans I know I have to move to a newer kernel, sadly.

You should always move to a newer kernel, what is holding you back from
doing that today? Please, move to 4.14.y if you want a "LTS" kernel to
survive you for the year, and if not, then use 4.15.y please.

Seriously, staying on old kernels should not happen, unless you are
paying someone for support of such a thing (i.e. you have specific
hardware support only for an old kernel version.)

thanks,

greg k-h

2018-02-16 18:45:32

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH 8/9] KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On Tue, Feb 6, 2018 at 9:29 AM, David Woodhouse <[email protected]> wrote:

> @@ -8946,6 +9017,27 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
> #endif
> );
>
> + /*
> + * We do not use IBRS in the kernel. If this vCPU has used the
> + * SPEC_CTRL MSR it may have left it on; save the value and
> + * turn it off. This is much more efficient than blindly adding
> + * it to the atomic save/restore list. Especially as the former
> + * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
> + *
> + * For non-nested case:
> + * If the L01 MSR bitmap does not intercept the MSR, then we need to
> + * save it.
> + *
> + * For nested case:
> + * If the L02 MSR bitmap does not intercept the MSR, then we need to
> + * save it.
> + */
> + if (!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))
> + rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> +
> + if (vmx->spec_ctrl)
> + wrmsrl(MSR_IA32_SPEC_CTRL, 0);
> +

Again, we haven't verified host support for this MSR. Perhaps this
should be something like:

if (!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL)
&& !rdmsrl_safe(MSR_IA32_SPEC_CTRL, &vmx->spec_ctrl) && vmx->spec_ctrl)
wrmsrl(MSR_IA32_SPEC_CTRL, 0);

2018-02-16 18:45:59

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH 8/9] KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL

On Tue, Feb 6, 2018 at 9:29 AM, David Woodhouse <[email protected]> wrote:

> @@ -8828,6 +8890,15 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>
> vmx_arm_hv_timer(vcpu);
>
> + /*
> + * If this vCPU has touched SPEC_CTRL, restore the guest's value if
> + * it's non-zero. Since vmentry is serialising on affected CPUs, there
> + * is no need to worry about the conditional branch over the wrmsr
> + * being speculatively taken.
> + */
> + if (vmx->spec_ctrl)
> + wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
> +

Shouldn't this be wrmsrl_safe? Userspace can make an ioctl to set
vmx->spec_ctrl to non-zero even if the MSR is not supported on the
host.

2018-02-16 19:12:02

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 7/9] KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

On 06/02/2018 18:29, David Woodhouse wrote:
> From: KarimAllah Ahmed <[email protected]>
>
> Intel processors use MSR_IA32_ARCH_CAPABILITIES MSR to indicate RDCL_NO
> (bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default the
> contents will come directly from the hardware, but user-space can still
> override it.

Uhm, taking contents from the hardware is wrong (guess why---live
migration). I'll send a revert of those two lines.

Paolo

> [dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Reviewed-by: Paolo Bonzini <[email protected]>
> Reviewed-by: Darren Kenny <[email protected]>
> Reviewed-by: Jim Mattson <[email protected]>
> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Jun Nakajima <[email protected]>
> Cc: [email protected]
> Cc: Dave Hansen <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Asit Mallick <[email protected]>
> Cc: Arjan Van De Ven <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Ashok Raj <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
>
> (cherry picked from commit 28c1c9fabf48d6ad596273a11c46e0d0da3e14cd)
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 8 +++++++-
> arch/x86/kvm/cpuid.h | 8 ++++++++
> arch/x86/kvm/vmx.c | 15 +++++++++++++++
> arch/x86/kvm/x86.c | 1 +
> 4 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 6f24483..9c6493f 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -380,6 +380,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> /* cpuid 7.0.ecx*/
> const u32 kvm_cpuid_7_0_ecx_x86_features = F(PKU) | 0 /*OSPKE*/;
>
> + /* cpuid 7.0.edx*/
> + const u32 kvm_cpuid_7_0_edx_x86_features =
> + F(ARCH_CAPABILITIES);
> +
> /* all calls to cpuid_count() should be made on the same cpu */
> get_cpu();
>
> @@ -462,12 +466,14 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> /* PKU is not yet implemented for shadow paging. */
> if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
> entry->ecx &= ~F(PKU);
> + entry->edx &= kvm_cpuid_7_0_edx_x86_features;
> + cpuid_mask(&entry->edx, CPUID_7_EDX);
> } else {
> entry->ebx = 0;
> entry->ecx = 0;
> + entry->edx = 0;
> }
> entry->eax = 0;
> - entry->edx = 0;
> break;
> }
> case 9:
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index ec4f9dc..8719997 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -171,6 +171,14 @@ static inline bool guest_cpuid_has_ibpb(struct kvm_vcpu *vcpu)
> return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL));
> }
>
> +static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_cpuid_entry2 *best;
> +
> + best = kvm_find_cpuid_entry(vcpu, 7, 0);
> + return best && (best->edx & bit(X86_FEATURE_ARCH_CAPABILITIES));
> +}
> +
>
> /*
> * NRIPS is provided through cpuidfn 0x8000000a.edx bit 3
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index dd6c831..92bf61f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -551,6 +551,8 @@ struct vcpu_vmx {
> u64 msr_guest_kernel_gs_base;
> #endif
>
> + u64 arch_capabilities;
> +
> u32 vm_entry_controls_shadow;
> u32 vm_exit_controls_shadow;
> /*
> @@ -2979,6 +2981,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_IA32_TSC:
> msr_info->data = guest_read_tsc(vcpu);
> break;
> + case MSR_IA32_ARCH_CAPABILITIES:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has_arch_capabilities(vcpu))
> + return 1;
> + msr_info->data = to_vmx(vcpu)->arch_capabilities;
> + break;
> case MSR_IA32_SYSENTER_CS:
> msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
> break;
> @@ -3110,6 +3118,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
> MSR_TYPE_W);
> break;
> + case MSR_IA32_ARCH_CAPABILITIES:
> + if (!msr_info->host_initiated)
> + return 1;
> + vmx->arch_capabilities = data;
> + break;
> case MSR_IA32_CR_PAT:
> if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
> if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> @@ -5200,6 +5213,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> ++vmx->nmsrs;
> }
>
> + if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
> + rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
>
> vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e023ef9..94d1573 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -975,6 +975,7 @@ static u32 msrs_to_save[] = {
> #endif
> MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
> MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
> + MSR_IA32_ARCH_CAPABILITIES
> };
>
> static unsigned num_msrs_to_save;
>


2018-02-16 19:18:26

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 7/9] KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES



On Fri, 2018-02-16 at 08:29 -0800, Jim Mattson wrote:
> On Fri, Feb 16, 2018 at 6:18 AM, Paolo Bonzini <[email protected]> wrote:
>
> > Uhm, taking contents from the hardware is wrong (guess why---live
> > migration).  I'll send a revert of those two lines.
>
> Hardware seems like a reasonable place to get the default value (cf.
> the VMX capability MSRs). Should these two lines just be moved to
> vmx_create_cpu?

They're already in vmx_create_cpu(). (Well, in vmx_cpu_setup() which is
a static function called only once, from vmx_create_cpu().)


Attachments:
smime.p7s (5.09 kB)

2018-02-16 19:19:29

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH 7/9] KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

On Fri, Feb 16, 2018 at 6:18 AM, Paolo Bonzini <[email protected]> wrote:

> Uhm, taking contents from the hardware is wrong (guess why---live
> migration). I'll send a revert of those two lines.

Hardware seems like a reasonable place to get the default value (cf.
the VMX capability MSRs). Should these two lines just be moved to
vmx_create_cpu?

2018-02-19 13:11:34

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 7/9] KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

On 16/02/2018 17:29, Jim Mattson wrote:
> On Fri, Feb 16, 2018 at 6:18 AM, Paolo Bonzini <[email protected]> wrote:
>
>> Uhm, taking contents from the hardware is wrong (guess why---live
>> migration). I'll send a revert of those two lines.
>
> Hardware seems like a reasonable place to get the default value (cf.
> the VMX capability MSRs).

There are some differences:

- a zero value for ARCH_CAPABILITIES should be safe, while a zero value
for VMX capabilities doesn't really make sense. On the contrary, a
nonzero value for ARCH_CAPABILITIES is not safe across live migration.

- VMX doesn't support live migration; before adding that support we will
probably have Tom's patches to retrieve MSR capabilities.

Thanks,

Paolo

> Should these two lines just be moved to
> vmx_create_cpu?
>


2018-02-19 13:36:31

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 7/9] KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES



On Mon, 2018-02-19 at 14:10 +0100, Paolo Bonzini wrote:
> > Hardware seems like a reasonable place to get the default value (cf.
> > the VMX capability MSRs).
>
> There are some differences:
>
> - a zero value for ARCH_CAPABILITIES should be safe, while a zero value
> for VMX capabilities doesn't really make sense.  On the contrary, a
> nonzero value for ARCH_CAPABILITIES is not safe across live migration.

Any VMM which is going to support live migration surely needs to pay at
least a small amount of attention to the features it exposes? Exposing
the ARCH_CAPABILITIES CPUID bit without actually looking at the
contents of the associated MSR which that bit advertises would be... a
little strange, would it not? 

I don't see why we care so much about the *default* value, in that
context.


Attachments:
smime.p7s (5.09 kB)

2018-02-19 14:08:20

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 7/9] KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES

On 19/02/2018 14:35, David Woodhouse wrote:
> On Mon, 2018-02-19 at 14:10 +0100, Paolo Bonzini wrote:
>>> Hardware seems like a reasonable place to get the default value (cf.
>>> the VMX capability MSRs).
>>
>> There are some differences:
>>
>> - a zero value for ARCH_CAPABILITIES should be safe, while a zero value
>> for VMX capabilities doesn't really make sense.  On the contrary, a
>> nonzero value for ARCH_CAPABILITIES is not safe across live migration.
>
> Any VMM which is going to support live migration surely needs to pay at
> least a small amount of attention to the features it exposes? Exposing
> the ARCH_CAPABILITIES CPUID bit without actually looking at the
> contents of the associated MSR which that bit advertises would be... a
> little strange, would it not? 

I think what we should do is simply backport Tom Lendacky's series to
4.14 and 4.9 ASAP, and add ARCH_CAPABILITIES support there. Then the
question of the default becomes moot, more or less.

Paolo

> I don't see why we care so much about the *default* value, in that
> context.