While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD
CPUs check EAX against reserved memory regions (e.g. SMM memory on host)
before checking VMCB's instruction intercept. If EAX falls into such
memory areas, #GP is triggered before #VMEXIT. This causes unexpected #GP
under nested virtualization. To solve this problem, this patchset makes
KVM trap #GP and emulate these SVM instuctions accordingly.
Also newer AMD CPUs will change this behavior by triggering #VMEXIT
before #GP. This change is indicated by CPUID_0x8000000A_EDX[28]. Under
this circumstance, #GP interception is not required. This patchset supports
the new feature.
This patchset has been verified with vmrun_errata_test and vmware_backdoors
tests of kvm_unit_test on the following configs:
* Current CPU: nested, nested on nested
* New CPU with X86_FEATURE_SVME_ADDR_CHK: nested, nested on nested
v1->v2:
* Factor out instruction decode for sharing
* Re-org gp_interception() handling for both #GP and vmware_backdoor
* Use kvm_cpu_cap for X86_FEATURE_SVME_ADDR_CHK feature support
* Add nested on nested support
Thanks,
-Wei
Wei Huang (4):
KVM: x86: Factor out x86 instruction emulation with decoding
KVM: SVM: Add emulation support for #GP triggered by SVM instructions
KVM: SVM: Add support for VMCB address check change
KVM: SVM: Support #GP handling for the case of nested on nested
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kvm/svm/svm.c | 120 ++++++++++++++++++++++++-----
arch/x86/kvm/x86.c | 63 +++++++++------
arch/x86/kvm/x86.h | 2 +
4 files changed, 145 insertions(+), 41 deletions(-)
--
2.27.0
Move the instruction decode part out of x86_emulate_instruction() for it
to be used in other places. Also kvm_clear_exception_queue() is moved
inside the if-statement as it doesn't apply when KVM are coming back from
userspace.
Co-developed-by: Bandan Das <[email protected]>
Signed-off-by: Bandan Das <[email protected]>
Signed-off-by: Wei Huang <[email protected]>
---
arch/x86/kvm/x86.c | 63 +++++++++++++++++++++++++++++-----------------
arch/x86/kvm/x86.h | 2 ++
2 files changed, 42 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a8969a6dd06..580883cee493 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7298,6 +7298,43 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
return false;
}
+/*
+ * Decode and emulate instruction. Return EMULATION_OK if success.
+ */
+int x86_emulate_decoded_instruction(struct kvm_vcpu *vcpu, int emulation_type,
+ void *insn, int insn_len)
+{
+ int r = EMULATION_OK;
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+
+ init_emulate_ctxt(vcpu);
+
+ /*
+ * We will reenter on the same instruction since
+ * we do not set complete_userspace_io. This does not
+ * handle watchpoints yet, those would be handled in
+ * the emulate_ops.
+ */
+ if (!(emulation_type & EMULTYPE_SKIP) &&
+ kvm_vcpu_check_breakpoint(vcpu, &r))
+ return r;
+
+ ctxt->interruptibility = 0;
+ ctxt->have_exception = false;
+ ctxt->exception.vector = -1;
+ ctxt->perm_ok = false;
+
+ ctxt->ud = emulation_type & EMULTYPE_TRAP_UD;
+
+ r = x86_decode_insn(ctxt, insn, insn_len);
+
+ trace_kvm_emulate_insn_start(vcpu);
+ ++vcpu->stat.insn_emulation;
+
+ return r;
+}
+EXPORT_SYMBOL_GPL(x86_emulate_decoded_instruction);
+
int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int emulation_type, void *insn, int insn_len)
{
@@ -7317,32 +7354,12 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
*/
write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
vcpu->arch.write_fault_to_shadow_pgtable = false;
- kvm_clear_exception_queue(vcpu);
if (!(emulation_type & EMULTYPE_NO_DECODE)) {
- init_emulate_ctxt(vcpu);
-
- /*
- * We will reenter on the same instruction since
- * we do not set complete_userspace_io. This does not
- * handle watchpoints yet, those would be handled in
- * the emulate_ops.
- */
- if (!(emulation_type & EMULTYPE_SKIP) &&
- kvm_vcpu_check_breakpoint(vcpu, &r))
- return r;
-
- ctxt->interruptibility = 0;
- ctxt->have_exception = false;
- ctxt->exception.vector = -1;
- ctxt->perm_ok = false;
-
- ctxt->ud = emulation_type & EMULTYPE_TRAP_UD;
-
- r = x86_decode_insn(ctxt, insn, insn_len);
+ kvm_clear_exception_queue(vcpu);
- trace_kvm_emulate_insn_start(vcpu);
- ++vcpu->stat.insn_emulation;
+ r = x86_emulate_decoded_instruction(vcpu, emulation_type,
+ insn, insn_len);
if (r != EMULATION_OK) {
if ((emulation_type & EMULTYPE_TRAP_UD) ||
(emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c5ee0f5ce0f1..fc42454a4c27 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -273,6 +273,8 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
int page_num);
bool kvm_vector_hashing_enabled(void);
void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
+int x86_emulate_decoded_instruction(struct kvm_vcpu *vcpu, int emulation_type,
+ void *insn, int insn_len);
int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int emulation_type, void *insn, int insn_len);
fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
--
2.27.0
From: Bandan Das <[email protected]>
While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD
CPUs check EAX against reserved memory regions (e.g. SMM memory on host)
before checking VMCB's instruction intercept. If EAX falls into such
memory areas, #GP is triggered before VMEXIT. This causes problem under
nested virtualization. To solve this problem, KVM needs to trap #GP and
check the instructions triggering #GP. For VM execution instructions,
KVM emulates these instructions.
Co-developed-by: Wei Huang <[email protected]>
Signed-off-by: Wei Huang <[email protected]>
Signed-off-by: Bandan Das <[email protected]>
---
arch/x86/kvm/svm/svm.c | 99 ++++++++++++++++++++++++++++++++++--------
1 file changed, 81 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7ef171790d02..6ed523cab068 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -288,6 +288,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
if (!(efer & EFER_SVME)) {
svm_leave_nested(svm);
svm_set_gif(svm, true);
+ /* #GP intercept is still needed in vmware_backdoor */
+ if (!enable_vmware_backdoor)
+ clr_exception_intercept(svm, GP_VECTOR);
/*
* Free the nested guest state, unless we are in SMM.
@@ -309,6 +312,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
svm->vmcb->save.efer = efer | EFER_SVME;
vmcb_mark_dirty(svm->vmcb, VMCB_CR);
+ /* Enable #GP interception for SVM instructions */
+ set_exception_intercept(svm, GP_VECTOR);
+
return 0;
}
@@ -1957,24 +1963,6 @@ static int ac_interception(struct vcpu_svm *svm)
return 1;
}
-static int gp_interception(struct vcpu_svm *svm)
-{
- struct kvm_vcpu *vcpu = &svm->vcpu;
- u32 error_code = svm->vmcb->control.exit_info_1;
-
- WARN_ON_ONCE(!enable_vmware_backdoor);
-
- /*
- * VMware backdoor emulation on #GP interception only handles IN{S},
- * OUT{S}, and RDPMC, none of which generate a non-zero error code.
- */
- if (error_code) {
- kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
- return 1;
- }
- return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP);
-}
-
static bool is_erratum_383(void)
{
int err, i;
@@ -2173,6 +2161,81 @@ static int vmrun_interception(struct vcpu_svm *svm)
return nested_svm_vmrun(svm);
}
+enum {
+ NOT_SVM_INSTR,
+ SVM_INSTR_VMRUN,
+ SVM_INSTR_VMLOAD,
+ SVM_INSTR_VMSAVE,
+};
+
+/* Return NOT_SVM_INSTR if not SVM instrs, otherwise return decode result */
+static int svm_instr_opcode(struct kvm_vcpu *vcpu)
+{
+ struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+
+ if (ctxt->b != 0x1 || ctxt->opcode_len != 2)
+ return NOT_SVM_INSTR;
+
+ switch (ctxt->modrm) {
+ case 0xd8: /* VMRUN */
+ return SVM_INSTR_VMRUN;
+ case 0xda: /* VMLOAD */
+ return SVM_INSTR_VMLOAD;
+ case 0xdb: /* VMSAVE */
+ return SVM_INSTR_VMSAVE;
+ default:
+ break;
+ }
+
+ return NOT_SVM_INSTR;
+}
+
+static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
+{
+ int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
+ [SVM_INSTR_VMRUN] = vmrun_interception,
+ [SVM_INSTR_VMLOAD] = vmload_interception,
+ [SVM_INSTR_VMSAVE] = vmsave_interception,
+ };
+ struct vcpu_svm *svm = to_svm(vcpu);
+
+ return svm_instr_handlers[opcode](svm);
+}
+
+/*
+ * #GP handling code. Note that #GP can be triggered under the following two
+ * cases:
+ * 1) SVM VM-related instructions (VMRUN/VMSAVE/VMLOAD) that trigger #GP on
+ * some AMD CPUs when EAX of these instructions are in the reserved memory
+ * regions (e.g. SMM memory on host).
+ * 2) VMware backdoor
+ */
+static int gp_interception(struct vcpu_svm *svm)
+{
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ u32 error_code = svm->vmcb->control.exit_info_1;
+ int opcode;
+
+ /* Both #GP cases have zero error_code */
+ if (error_code)
+ goto reinject;
+
+ /* Decode the instruction for usage later */
+ if (x86_emulate_decoded_instruction(vcpu, 0, NULL, 0) != EMULATION_OK)
+ goto reinject;
+
+ opcode = svm_instr_opcode(vcpu);
+ if (opcode)
+ return emulate_svm_instr(vcpu, opcode);
+ else
+ return kvm_emulate_instruction(vcpu,
+ EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
+
+reinject:
+ kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
+ return 1;
+}
+
void svm_set_gif(struct vcpu_svm *svm, bool value)
{
if (value) {
--
2.27.0
Under the case of nested on nested (e.g. L0->L1->L2->L3), #GP triggered
by SVM instructions can be hided from L1. Instead the hypervisor can
inject the proper #VMEXIT to inform L1 of what is happening. Thus L1
can avoid invoking the #GP workaround. For this reason we turns on
guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to
receive the notification and change behavior.
Co-developed-by: Bandan Das <[email protected]>
Signed-off-by: Bandan Das <[email protected]>
Signed-off-by: Wei Huang <[email protected]>
---
arch/x86/kvm/svm/svm.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2a12870ac71a..89512c0e7663 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2196,6 +2196,11 @@ static int svm_instr_opcode(struct kvm_vcpu *vcpu)
static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
{
+ const int guest_mode_exit_codes[] = {
+ [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN,
+ [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD,
+ [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE,
+ };
int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
[SVM_INSTR_VMRUN] = vmrun_interception,
[SVM_INSTR_VMLOAD] = vmload_interception,
@@ -2203,7 +2208,14 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
};
struct vcpu_svm *svm = to_svm(vcpu);
- return svm_instr_handlers[opcode](svm);
+ if (is_guest_mode(vcpu)) {
+ svm->vmcb->control.exit_code = guest_mode_exit_codes[opcode];
+ svm->vmcb->control.exit_info_1 = 0;
+ svm->vmcb->control.exit_info_2 = 0;
+
+ return nested_svm_vmexit(svm);
+ } else
+ return svm_instr_handlers[opcode](svm);
}
/*
@@ -4034,6 +4046,11 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
/* Check again if INVPCID interception if required */
svm_check_invpcid(svm);
+ if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) {
+ best = kvm_find_cpuid_entry(vcpu, 0x8000000A, 0);
+ best->edx |= (1 << 28);
+ }
+
/* For sev guests, the memory encryption bit is not reserved in CR3. */
if (sev_guest(vcpu->kvm)) {
best = kvm_find_cpuid_entry(vcpu, 0x8000001F, 0);
--
2.27.0
New AMD CPUs have a change that checks VMEXIT intercept on special SVM
instructions before checking their EAX against reserved memory region.
This change is indicated by CPUID_0x8000000A_EDX[28]. If it is 1, #VMEXIT
is triggered before #GP. KVM doesn't need to intercept and emulate #GP
faults as #GP is supposed to be triggered.
Co-developed-by: Bandan Das <[email protected]>
Signed-off-by: Bandan Das <[email protected]>
Signed-off-by: Wei Huang <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kvm/svm/svm.c | 6 +++++-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 84b887825f12..ea89d6fdd79a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -337,6 +337,7 @@
#define X86_FEATURE_AVIC (15*32+13) /* Virtual Interrupt Controller */
#define X86_FEATURE_V_VMSAVE_VMLOAD (15*32+15) /* Virtual VMSAVE VMLOAD */
#define X86_FEATURE_VGIF (15*32+16) /* Virtual GIF */
+#define X86_FEATURE_SVME_ADDR_CHK (15*32+28) /* "" SVME addr check */
/* Intel-defined CPU features, CPUID level 0x00000007:0 (ECX), word 16 */
#define X86_FEATURE_AVX512VBMI (16*32+ 1) /* AVX512 Vector Bit Manipulation instructions*/
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 6ed523cab068..2a12870ac71a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -313,7 +313,8 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
svm->vmcb->save.efer = efer | EFER_SVME;
vmcb_mark_dirty(svm->vmcb, VMCB_CR);
/* Enable #GP interception for SVM instructions */
- set_exception_intercept(svm, GP_VECTOR);
+ if (!kvm_cpu_cap_has(X86_FEATURE_SVME_ADDR_CHK))
+ set_exception_intercept(svm, GP_VECTOR);
return 0;
}
@@ -933,6 +934,9 @@ static __init void svm_set_cpu_caps(void)
boot_cpu_has(X86_FEATURE_AMD_SSBD))
kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
+ if (boot_cpu_has(X86_FEATURE_SVME_ADDR_CHK))
+ kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
+
/* Enable INVPCID feature */
kvm_cpu_cap_check_and_set(X86_FEATURE_INVPCID);
}
--
2.27.0
On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
> Move the instruction decode part out of x86_emulate_instruction() for it
> to be used in other places. Also kvm_clear_exception_queue() is moved
> inside the if-statement as it doesn't apply when KVM are coming back from
> userspace.
>
> Co-developed-by: Bandan Das <[email protected]>
> Signed-off-by: Bandan Das <[email protected]>
> Signed-off-by: Wei Huang <[email protected]>
> ---
> arch/x86/kvm/x86.c | 63 +++++++++++++++++++++++++++++-----------------
> arch/x86/kvm/x86.h | 2 ++
> 2 files changed, 42 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a8969a6dd06..580883cee493 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7298,6 +7298,43 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
> return false;
> }
>
> +/*
> + * Decode and emulate instruction. Return EMULATION_OK if success.
> + */
> +int x86_emulate_decoded_instruction(struct kvm_vcpu *vcpu, int emulation_type,
> + void *insn, int insn_len)
Isn't the name of this function wrong? This function decodes the instruction.
So I would expect something like x86_decode_instruction.
> +{
> + int r = EMULATION_OK;
> + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> +
> + init_emulate_ctxt(vcpu);
> +
> + /*
> + * We will reenter on the same instruction since
> + * we do not set complete_userspace_io. This does not
> + * handle watchpoints yet, those would be handled in
> + * the emulate_ops.
> + */
> + if (!(emulation_type & EMULTYPE_SKIP) &&
> + kvm_vcpu_check_breakpoint(vcpu, &r))
> + return r;
> +
> + ctxt->interruptibility = 0;
> + ctxt->have_exception = false;
> + ctxt->exception.vector = -1;
> + ctxt->perm_ok = false;
> +
> + ctxt->ud = emulation_type & EMULTYPE_TRAP_UD;
> +
> + r = x86_decode_insn(ctxt, insn, insn_len);
> +
> + trace_kvm_emulate_insn_start(vcpu);
> + ++vcpu->stat.insn_emulation;
> +
> + return r;
> +}
> +EXPORT_SYMBOL_GPL(x86_emulate_decoded_instruction);
> +
> int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int emulation_type, void *insn, int insn_len)
> {
> @@ -7317,32 +7354,12 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> */
> write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
> vcpu->arch.write_fault_to_shadow_pgtable = false;
> - kvm_clear_exception_queue(vcpu);
I think that this change is OK, but I can't be 100% sure about this.
Best regards,
Maxim Levitsky
>
> if (!(emulation_type & EMULTYPE_NO_DECODE)) {
> - init_emulate_ctxt(vcpu);
> -
> - /*
> - * We will reenter on the same instruction since
> - * we do not set complete_userspace_io. This does not
> - * handle watchpoints yet, those would be handled in
> - * the emulate_ops.
> - */
> - if (!(emulation_type & EMULTYPE_SKIP) &&
> - kvm_vcpu_check_breakpoint(vcpu, &r))
> - return r;
> -
> - ctxt->interruptibility = 0;
> - ctxt->have_exception = false;
> - ctxt->exception.vector = -1;
> - ctxt->perm_ok = false;
> -
> - ctxt->ud = emulation_type & EMULTYPE_TRAP_UD;
> -
> - r = x86_decode_insn(ctxt, insn, insn_len);
> + kvm_clear_exception_queue(vcpu);
>
> - trace_kvm_emulate_insn_start(vcpu);
> - ++vcpu->stat.insn_emulation;
> + r = x86_emulate_decoded_instruction(vcpu, emulation_type,
> + insn, insn_len);
> if (r != EMULATION_OK) {
> if ((emulation_type & EMULTYPE_TRAP_UD) ||
> (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index c5ee0f5ce0f1..fc42454a4c27 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -273,6 +273,8 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
> int page_num);
> bool kvm_vector_hashing_enabled(void);
> void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
> +int x86_emulate_decoded_instruction(struct kvm_vcpu *vcpu, int emulation_type,
> + void *insn, int insn_len);
> int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> int emulation_type, void *insn, int insn_len);
> fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
> From: Bandan Das <[email protected]>
>
> While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD
> CPUs check EAX against reserved memory regions (e.g. SMM memory on host)
> before checking VMCB's instruction intercept. If EAX falls into such
> memory areas, #GP is triggered before VMEXIT. This causes problem under
> nested virtualization. To solve this problem, KVM needs to trap #GP and
> check the instructions triggering #GP. For VM execution instructions,
> KVM emulates these instructions.
>
> Co-developed-by: Wei Huang <[email protected]>
> Signed-off-by: Wei Huang <[email protected]>
> Signed-off-by: Bandan Das <[email protected]>
> ---
> arch/x86/kvm/svm/svm.c | 99 ++++++++++++++++++++++++++++++++++--------
> 1 file changed, 81 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7ef171790d02..6ed523cab068 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -288,6 +288,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
> if (!(efer & EFER_SVME)) {
> svm_leave_nested(svm);
> svm_set_gif(svm, true);
> + /* #GP intercept is still needed in vmware_backdoor */
> + if (!enable_vmware_backdoor)
> + clr_exception_intercept(svm, GP_VECTOR);
Again I would prefer a flag for the errata workaround, but this is still
better.
>
> /*
> * Free the nested guest state, unless we are in SMM.
> @@ -309,6 +312,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
>
> svm->vmcb->save.efer = efer | EFER_SVME;
> vmcb_mark_dirty(svm->vmcb, VMCB_CR);
> + /* Enable #GP interception for SVM instructions */
> + set_exception_intercept(svm, GP_VECTOR);
> +
> return 0;
> }
>
> @@ -1957,24 +1963,6 @@ static int ac_interception(struct vcpu_svm *svm)
> return 1;
> }
>
> -static int gp_interception(struct vcpu_svm *svm)
> -{
> - struct kvm_vcpu *vcpu = &svm->vcpu;
> - u32 error_code = svm->vmcb->control.exit_info_1;
> -
> - WARN_ON_ONCE(!enable_vmware_backdoor);
> -
> - /*
> - * VMware backdoor emulation on #GP interception only handles IN{S},
> - * OUT{S}, and RDPMC, none of which generate a non-zero error code.
> - */
> - if (error_code) {
> - kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> - return 1;
> - }
> - return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP);
> -}
> -
> static bool is_erratum_383(void)
> {
> int err, i;
> @@ -2173,6 +2161,81 @@ static int vmrun_interception(struct vcpu_svm *svm)
> return nested_svm_vmrun(svm);
> }
>
> +enum {
> + NOT_SVM_INSTR,
> + SVM_INSTR_VMRUN,
> + SVM_INSTR_VMLOAD,
> + SVM_INSTR_VMSAVE,
> +};
> +
> +/* Return NOT_SVM_INSTR if not SVM instrs, otherwise return decode result */
> +static int svm_instr_opcode(struct kvm_vcpu *vcpu)
> +{
> + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> +
> + if (ctxt->b != 0x1 || ctxt->opcode_len != 2)
> + return NOT_SVM_INSTR;
> +
> + switch (ctxt->modrm) {
> + case 0xd8: /* VMRUN */
> + return SVM_INSTR_VMRUN;
> + case 0xda: /* VMLOAD */
> + return SVM_INSTR_VMLOAD;
> + case 0xdb: /* VMSAVE */
> + return SVM_INSTR_VMSAVE;
> + default:
> + break;
> + }
> +
> + return NOT_SVM_INSTR;
> +}
> +
> +static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
> +{
> + int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
> + [SVM_INSTR_VMRUN] = vmrun_interception,
> + [SVM_INSTR_VMLOAD] = vmload_interception,
> + [SVM_INSTR_VMSAVE] = vmsave_interception,
> + };
> + struct vcpu_svm *svm = to_svm(vcpu);
> +
> + return svm_instr_handlers[opcode](svm);
> +}
> +
> +/*
> + * #GP handling code. Note that #GP can be triggered under the following two
> + * cases:
> + * 1) SVM VM-related instructions (VMRUN/VMSAVE/VMLOAD) that trigger #GP on
> + * some AMD CPUs when EAX of these instructions are in the reserved memory
> + * regions (e.g. SMM memory on host).
> + * 2) VMware backdoor
> + */
> +static int gp_interception(struct vcpu_svm *svm)
> +{
> + struct kvm_vcpu *vcpu = &svm->vcpu;
> + u32 error_code = svm->vmcb->control.exit_info_1;
> + int opcode;
> +
> + /* Both #GP cases have zero error_code */
I would have kept the original description of possible #GP reasons
for the VMWARE backdoor and that WARN_ON_ONCE that was removed.
> + if (error_code)
> + goto reinject;
> +
> + /* Decode the instruction for usage later */
> + if (x86_emulate_decoded_instruction(vcpu, 0, NULL, 0) != EMULATION_OK)
> + goto reinject;
> +
> + opcode = svm_instr_opcode(vcpu);
> + if (opcode)
I prefer opcode != NOT_SVM_INSTR.
> + return emulate_svm_instr(vcpu, opcode);
> + else
'WARN_ON_ONCE(!enable_vmware_backdoor)' I think can be placed here.
> + return kvm_emulate_instruction(vcpu,
> + EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
I tested the vmware backdoor a bit (using the kvm unit tests) and I found out a tiny pre-existing bug
there:
We shouldn't emulate the vmware backdoor for a nested guest, but rather let it do it.
The below patch (on top of your patches) works for me and allows the vmware backdoor
test to pass when kvm unit tests run in a guest.
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fe97b0e41824a..4557fdc9c3e1b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2243,7 +2243,7 @@ static int gp_interception(struct vcpu_svm *svm)
opcode = svm_instr_opcode(vcpu);
if (opcode)
return emulate_svm_instr(vcpu, opcode);
- else
+ else if (!is_guest_mode(vcpu))
return kvm_emulate_instruction(vcpu,
EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
Best regards,
Maxim Levitsky
> +
> +reinject:
> + kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> + return 1;
> +}
> +
> void svm_set_gif(struct vcpu_svm *svm, bool value)
> {
> if (value) {
On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
> New AMD CPUs have a change that checks VMEXIT intercept on special SVM
> instructions before checking their EAX against reserved memory region.
> This change is indicated by CPUID_0x8000000A_EDX[28]. If it is 1, #VMEXIT
> is triggered before #GP. KVM doesn't need to intercept and emulate #GP
> faults as #GP is supposed to be triggered.
>
> Co-developed-by: Bandan Das <[email protected]>
> Signed-off-by: Bandan Das <[email protected]>
> Signed-off-by: Wei Huang <[email protected]>
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kvm/svm/svm.c | 6 +++++-
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 84b887825f12..ea89d6fdd79a 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -337,6 +337,7 @@
> #define X86_FEATURE_AVIC (15*32+13) /* Virtual Interrupt Controller */
> #define X86_FEATURE_V_VMSAVE_VMLOAD (15*32+15) /* Virtual VMSAVE VMLOAD */
> #define X86_FEATURE_VGIF (15*32+16) /* Virtual GIF */
> +#define X86_FEATURE_SVME_ADDR_CHK (15*32+28) /* "" SVME addr check */
>
> /* Intel-defined CPU features, CPUID level 0x00000007:0 (ECX), word 16 */
> #define X86_FEATURE_AVX512VBMI (16*32+ 1) /* AVX512 Vector Bit Manipulation instructions*/
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 6ed523cab068..2a12870ac71a 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -313,7 +313,8 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
> svm->vmcb->save.efer = efer | EFER_SVME;
> vmcb_mark_dirty(svm->vmcb, VMCB_CR);
> /* Enable #GP interception for SVM instructions */
> - set_exception_intercept(svm, GP_VECTOR);
> + if (!kvm_cpu_cap_has(X86_FEATURE_SVME_ADDR_CHK))
> + set_exception_intercept(svm, GP_VECTOR);
>
> return 0;
> }
> @@ -933,6 +934,9 @@ static __init void svm_set_cpu_caps(void)
> boot_cpu_has(X86_FEATURE_AMD_SSBD))
> kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
>
> + if (boot_cpu_has(X86_FEATURE_SVME_ADDR_CHK))
> + kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
> +
> /* Enable INVPCID feature */
> kvm_cpu_cap_check_and_set(X86_FEATURE_INVPCID);
> }
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
> Under the case of nested on nested (e.g. L0->L1->L2->L3), #GP triggered
> by SVM instructions can be hided from L1. Instead the hypervisor can
> inject the proper #VMEXIT to inform L1 of what is happening. Thus L1
> can avoid invoking the #GP workaround. For this reason we turns on
> guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to
> receive the notification and change behavior.
>
> Co-developed-by: Bandan Das <[email protected]>
> Signed-off-by: Bandan Das <[email protected]>
> Signed-off-by: Wei Huang <[email protected]>
> ---
> arch/x86/kvm/svm/svm.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 2a12870ac71a..89512c0e7663 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2196,6 +2196,11 @@ static int svm_instr_opcode(struct kvm_vcpu *vcpu)
>
> static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
> {
> + const int guest_mode_exit_codes[] = {
> + [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN,
> + [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD,
> + [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE,
> + };
> int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
> [SVM_INSTR_VMRUN] = vmrun_interception,
> [SVM_INSTR_VMLOAD] = vmload_interception,
> @@ -2203,7 +2208,14 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
> };
> struct vcpu_svm *svm = to_svm(vcpu);
>
> - return svm_instr_handlers[opcode](svm);
> + if (is_guest_mode(vcpu)) {
> + svm->vmcb->control.exit_code = guest_mode_exit_codes[opcode];
> + svm->vmcb->control.exit_info_1 = 0;
> + svm->vmcb->control.exit_info_2 = 0;
> +
> + return nested_svm_vmexit(svm);
> + } else
> + return svm_instr_handlers[opcode](svm);
> }
>
> /*
> @@ -4034,6 +4046,11 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> /* Check again if INVPCID interception if required */
> svm_check_invpcid(svm);
>
> + if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) {
> + best = kvm_find_cpuid_entry(vcpu, 0x8000000A, 0);
> + best->edx |= (1 << 28);
> + }
> +
> /* For sev guests, the memory encryption bit is not reserved in CR3. */
> if (sev_guest(vcpu->kvm)) {
> best = kvm_find_cpuid_entry(vcpu, 0x8000001F, 0);
Tested-by: Maxim Levitsky <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
On 21/01/21 15:04, Maxim Levitsky wrote:
>> +int x86_emulate_decoded_instruction(struct kvm_vcpu *vcpu, int emulation_type,
>> + void *insn, int insn_len)
> Isn't the name of this function wrong? This function decodes the instruction.
> So I would expect something like x86_decode_instruction.
>
Yes, that or x86_decode_emulated_instruction.
Paolo
On 21/01/21 07:55, Wei Huang wrote:
> + if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) {
> + best = kvm_find_cpuid_entry(vcpu, 0x8000000A, 0);
> + best->edx |= (1 << 28);
> + }
> +
Instead of this, please use kvm_cpu_cap_set in svm_set_cpu_caps's "if
(nested)".
Paolo
* Wei Huang ([email protected]) wrote:
> Under the case of nested on nested (e.g. L0->L1->L2->L3), #GP triggered
> by SVM instructions can be hided from L1. Instead the hypervisor can
> inject the proper #VMEXIT to inform L1 of what is happening. Thus L1
> can avoid invoking the #GP workaround. For this reason we turns on
> guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to
> receive the notification and change behavior.
Doesn't this mean a VM migrated between levels (hmm L2 to L1???) would
see different behaviour?
(I've never tried such a migration, but I thought in principal it should
work).
Dave
> Co-developed-by: Bandan Das <[email protected]>
> Signed-off-by: Bandan Das <[email protected]>
> Signed-off-by: Wei Huang <[email protected]>
> ---
> arch/x86/kvm/svm/svm.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 2a12870ac71a..89512c0e7663 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2196,6 +2196,11 @@ static int svm_instr_opcode(struct kvm_vcpu *vcpu)
>
> static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
> {
> + const int guest_mode_exit_codes[] = {
> + [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN,
> + [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD,
> + [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE,
> + };
> int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
> [SVM_INSTR_VMRUN] = vmrun_interception,
> [SVM_INSTR_VMLOAD] = vmload_interception,
> @@ -2203,7 +2208,14 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
> };
> struct vcpu_svm *svm = to_svm(vcpu);
>
> - return svm_instr_handlers[opcode](svm);
> + if (is_guest_mode(vcpu)) {
> + svm->vmcb->control.exit_code = guest_mode_exit_codes[opcode];
> + svm->vmcb->control.exit_info_1 = 0;
> + svm->vmcb->control.exit_info_2 = 0;
> +
> + return nested_svm_vmexit(svm);
> + } else
> + return svm_instr_handlers[opcode](svm);
> }
>
> /*
> @@ -4034,6 +4046,11 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> /* Check again if INVPCID interception if required */
> svm_check_invpcid(svm);
>
> + if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) {
> + best = kvm_find_cpuid_entry(vcpu, 0x8000000A, 0);
> + best->edx |= (1 << 28);
> + }
> +
> /* For sev guests, the memory encryption bit is not reserved in CR3. */
> if (sev_guest(vcpu->kvm)) {
> best = kvm_find_cpuid_entry(vcpu, 0x8000001F, 0);
> --
> 2.27.0
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Thu, 2021-01-21 at 14:56 +0000, Dr. David Alan Gilbert wrote:
> * Wei Huang ([email protected]) wrote:
> > Under the case of nested on nested (e.g. L0->L1->L2->L3), #GP triggered
> > by SVM instructions can be hided from L1. Instead the hypervisor can
> > inject the proper #VMEXIT to inform L1 of what is happening. Thus L1
> > can avoid invoking the #GP workaround. For this reason we turns on
> > guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to
> > receive the notification and change behavior.
>
> Doesn't this mean a VM migrated between levels (hmm L2 to L1???) would
> see different behaviour?
> (I've never tried such a migration, but I thought in principal it should
> work).
This is not an issue. The VM will always see the X86_FEATURE_SVME_ADDR_CHK set,
(regardless if host has it, or if KVM emulates it).
This is not different from what KVM does for guest's x2apic.
KVM also always emulates it regardless of the host support.
The hypervisor on the other hand can indeed either see or not that bit set,
but it is prepared to handle both cases, so it will support migrating VMs
between hosts that have and don't have that bit.
I hope that I understand this correctly.
Best regards,
Maxim Levitsky
>
> Dave
>
>
> > Co-developed-by: Bandan Das <[email protected]>
> > Signed-off-by: Bandan Das <[email protected]>
> > Signed-off-by: Wei Huang <[email protected]>
> > ---
> > arch/x86/kvm/svm/svm.c | 19 ++++++++++++++++++-
> > 1 file changed, 18 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 2a12870ac71a..89512c0e7663 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -2196,6 +2196,11 @@ static int svm_instr_opcode(struct kvm_vcpu *vcpu)
> >
> > static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
> > {
> > + const int guest_mode_exit_codes[] = {
> > + [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN,
> > + [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD,
> > + [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE,
> > + };
> > int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
> > [SVM_INSTR_VMRUN] = vmrun_interception,
> > [SVM_INSTR_VMLOAD] = vmload_interception,
> > @@ -2203,7 +2208,14 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
> > };
> > struct vcpu_svm *svm = to_svm(vcpu);
> >
> > - return svm_instr_handlers[opcode](svm);
> > + if (is_guest_mode(vcpu)) {
> > + svm->vmcb->control.exit_code = guest_mode_exit_codes[opcode];
> > + svm->vmcb->control.exit_info_1 = 0;
> > + svm->vmcb->control.exit_info_2 = 0;
> > +
> > + return nested_svm_vmexit(svm);
> > + } else
> > + return svm_instr_handlers[opcode](svm);
> > }
> >
> > /*
> > @@ -4034,6 +4046,11 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > /* Check again if INVPCID interception if required */
> > svm_check_invpcid(svm);
> >
> > + if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) {
> > + best = kvm_find_cpuid_entry(vcpu, 0x8000000A, 0);
> > + best->edx |= (1 << 28);
> > + }
> > +
> > /* For sev guests, the memory encryption bit is not reserved in CR3. */
> > if (sev_guest(vcpu->kvm)) {
> > best = kvm_find_cpuid_entry(vcpu, 0x8000001F, 0);
> > --
> > 2.27.0
> >
On 1/21/21 8:23 AM, Paolo Bonzini wrote:
> On 21/01/21 15:04, Maxim Levitsky wrote:
>>> +int x86_emulate_decoded_instruction(struct kvm_vcpu *vcpu, int
>>> emulation_type,
>>> + void *insn, int insn_len)
>> Isn't the name of this function wrong? This function decodes the
>> instruction.
>> So I would expect something like x86_decode_instruction.
>>
>
> Yes, that or x86_decode_emulated_instruction.
I was debating about it while making the change. I will update it to new
name in v3.
>
> Paolo
>
On 1/21/21 8:07 AM, Maxim Levitsky wrote:
> On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
>> From: Bandan Das <[email protected]>
>>
>> While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD
>> CPUs check EAX against reserved memory regions (e.g. SMM memory on host)
>> before checking VMCB's instruction intercept. If EAX falls into such
>> memory areas, #GP is triggered before VMEXIT. This causes problem under
>> nested virtualization. To solve this problem, KVM needs to trap #GP and
>> check the instructions triggering #GP. For VM execution instructions,
>> KVM emulates these instructions.
>>
>> Co-developed-by: Wei Huang <[email protected]>
>> Signed-off-by: Wei Huang <[email protected]>
>> Signed-off-by: Bandan Das <[email protected]>
>> ---
>> arch/x86/kvm/svm/svm.c | 99 ++++++++++++++++++++++++++++++++++--------
>> 1 file changed, 81 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 7ef171790d02..6ed523cab068 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -288,6 +288,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
>> if (!(efer & EFER_SVME)) {
>> svm_leave_nested(svm);
>> svm_set_gif(svm, true);
>> + /* #GP intercept is still needed in vmware_backdoor */
>> + if (!enable_vmware_backdoor)
>> + clr_exception_intercept(svm, GP_VECTOR);
> Again I would prefer a flag for the errata workaround, but this is still
> better.
Instead of using !enable_vmware_backdoor, will the following be better?
Or the existing form is acceptable.
if (!kvm_cpu_cap_has(X86_FEATURE_SVME_ADDR_CHK))
clr_exception_intercept(svm, GP_VECTOR);
>
>>
>> /*
>> * Free the nested guest state, unless we are in SMM.
>> @@ -309,6 +312,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
>>
>> svm->vmcb->save.efer = efer | EFER_SVME;
>> vmcb_mark_dirty(svm->vmcb, VMCB_CR);
>> + /* Enable #GP interception for SVM instructions */
>> + set_exception_intercept(svm, GP_VECTOR);
>> +
>> return 0;
>> }
>>
>> @@ -1957,24 +1963,6 @@ static int ac_interception(struct vcpu_svm *svm)
>> return 1;
>> }
>>
>> -static int gp_interception(struct vcpu_svm *svm)
>> -{
>> - struct kvm_vcpu *vcpu = &svm->vcpu;
>> - u32 error_code = svm->vmcb->control.exit_info_1;
>> -
>> - WARN_ON_ONCE(!enable_vmware_backdoor);
>> -
>> - /*
>> - * VMware backdoor emulation on #GP interception only handles IN{S},
>> - * OUT{S}, and RDPMC, none of which generate a non-zero error code.
>> - */
>> - if (error_code) {
>> - kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
>> - return 1;
>> - }
>> - return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP);
>> -}
>> -
>> static bool is_erratum_383(void)
>> {
>> int err, i;
>> @@ -2173,6 +2161,81 @@ static int vmrun_interception(struct vcpu_svm *svm)
>> return nested_svm_vmrun(svm);
>> }
>>
>> +enum {
>> + NOT_SVM_INSTR,
>> + SVM_INSTR_VMRUN,
>> + SVM_INSTR_VMLOAD,
>> + SVM_INSTR_VMSAVE,
>> +};
>> +
>> +/* Return NOT_SVM_INSTR if not SVM instrs, otherwise return decode result */
>> +static int svm_instr_opcode(struct kvm_vcpu *vcpu)
>> +{
>> + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
>> +
>> + if (ctxt->b != 0x1 || ctxt->opcode_len != 2)
>> + return NOT_SVM_INSTR;
>> +
>> + switch (ctxt->modrm) {
>> + case 0xd8: /* VMRUN */
>> + return SVM_INSTR_VMRUN;
>> + case 0xda: /* VMLOAD */
>> + return SVM_INSTR_VMLOAD;
>> + case 0xdb: /* VMSAVE */
>> + return SVM_INSTR_VMSAVE;
>> + default:
>> + break;
>> + }
>> +
>> + return NOT_SVM_INSTR;
>> +}
>> +
>> +static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
>> +{
>> + int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
>> + [SVM_INSTR_VMRUN] = vmrun_interception,
>> + [SVM_INSTR_VMLOAD] = vmload_interception,
>> + [SVM_INSTR_VMSAVE] = vmsave_interception,
>> + };
>> + struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> + return svm_instr_handlers[opcode](svm);
>> +}
>> +
>> +/*
>> + * #GP handling code. Note that #GP can be triggered under the following two
>> + * cases:
>> + * 1) SVM VM-related instructions (VMRUN/VMSAVE/VMLOAD) that trigger #GP on
>> + * some AMD CPUs when EAX of these instructions are in the reserved memory
>> + * regions (e.g. SMM memory on host).
>> + * 2) VMware backdoor
>> + */
>> +static int gp_interception(struct vcpu_svm *svm)
>> +{
>> + struct kvm_vcpu *vcpu = &svm->vcpu;
>> + u32 error_code = svm->vmcb->control.exit_info_1;
>> + int opcode;
>> +
>> + /* Both #GP cases have zero error_code */
>
> I would have kept the original description of possible #GP reasons
> for the VMWARE backdoor and that WARN_ON_ONCE that was removed.
>
Will do
>
>> + if (error_code)
>> + goto reinject;
>> +
>> + /* Decode the instruction for usage later */
>> + if (x86_emulate_decoded_instruction(vcpu, 0, NULL, 0) != EMULATION_OK)
>> + goto reinject;
>> +
>> + opcode = svm_instr_opcode(vcpu);
>> + if (opcode)
>
> I prefer opcode != NOT_SVM_INSTR.
>
>> + return emulate_svm_instr(vcpu, opcode);
>> + else
>
> 'WARN_ON_ONCE(!enable_vmware_backdoor)' I think can be placed here.
>
>
>> + return kvm_emulate_instruction(vcpu,
>> + EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
>
> I tested the vmware backdoor a bit (using the kvm unit tests) and I found out a tiny pre-existing bug
> there:
>
> We shouldn't emulate the vmware backdoor for a nested guest, but rather let it do it.
>
> The below patch (on top of your patches) works for me and allows the vmware backdoor
> test to pass when kvm unit tests run in a guest.
>
This fix can be a separate patch? This problem exist even before this
patchset.
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index fe97b0e41824a..4557fdc9c3e1b 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2243,7 +2243,7 @@ static int gp_interception(struct vcpu_svm *svm)
> opcode = svm_instr_opcode(vcpu);
> if (opcode)
> return emulate_svm_instr(vcpu, opcode);
> - else
> + else if (!is_guest_mode(vcpu))
> return kvm_emulate_instruction(vcpu,
> EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
>
>
>
> Best regards,
> Maxim Levitsky
>
>> +
>> +reinject:
>> + kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
>> + return 1;
>> +}
>> +
>> void svm_set_gif(struct vcpu_svm *svm, bool value)
>> {
>> if (value) {
>
>
>
>
>
On Thu, 2021-01-21 at 10:06 -0600, Wei Huang wrote:
>
> On 1/21/21 8:07 AM, Maxim Levitsky wrote:
> > On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
> > > From: Bandan Das <[email protected]>
> > >
> > > While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD
> > > CPUs check EAX against reserved memory regions (e.g. SMM memory on host)
> > > before checking VMCB's instruction intercept. If EAX falls into such
> > > memory areas, #GP is triggered before VMEXIT. This causes problem under
> > > nested virtualization. To solve this problem, KVM needs to trap #GP and
> > > check the instructions triggering #GP. For VM execution instructions,
> > > KVM emulates these instructions.
> > >
> > > Co-developed-by: Wei Huang <[email protected]>
> > > Signed-off-by: Wei Huang <[email protected]>
> > > Signed-off-by: Bandan Das <[email protected]>
> > > ---
> > > arch/x86/kvm/svm/svm.c | 99 ++++++++++++++++++++++++++++++++++--------
> > > 1 file changed, 81 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index 7ef171790d02..6ed523cab068 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -288,6 +288,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
> > > if (!(efer & EFER_SVME)) {
> > > svm_leave_nested(svm);
> > > svm_set_gif(svm, true);
> > > + /* #GP intercept is still needed in vmware_backdoor */
> > > + if (!enable_vmware_backdoor)
> > > + clr_exception_intercept(svm, GP_VECTOR);
> > Again I would prefer a flag for the errata workaround, but this is still
> > better.
>
> Instead of using !enable_vmware_backdoor, will the following be better?
> Or the existing form is acceptable.
>
> if (!kvm_cpu_cap_has(X86_FEATURE_SVME_ADDR_CHK))
> clr_exception_intercept(svm, GP_VECTOR);
To be honest I would prefer to have a module param named something like
'enable_svm_gp_errata_workaround' that would have 3 state value: (0,1,-1),
aka true,false,auto
0,1 - would mean force disable/enable the workaround.
-1 - auto select based on X86_FEATURE_SVME_ADDR_CHK.
0 could be used if for example someone is paranoid in regard to attack surface.
-#define USER_BASE (1 << 24)
+#define USER_BASE (1 << 25)
This isn't that much importaint to me though, so if you prefer you can leave it as is
as well.
>
> > >
> > > /*
> > > * Free the nested guest state, unless we are in SMM.
> > > @@ -309,6 +312,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
> > >
> > > svm->vmcb->save.efer = efer | EFER_SVME;
> > > vmcb_mark_dirty(svm->vmcb, VMCB_CR);
> > > + /* Enable #GP interception for SVM instructions */
> > > + set_exception_intercept(svm, GP_VECTOR);
> > > +
> > > return 0;
> > > }
> > >
> > > @@ -1957,24 +1963,6 @@ static int ac_interception(struct vcpu_svm *svm)
> > > return 1;
> > > }
> > >
> > > -static int gp_interception(struct vcpu_svm *svm)
> > > -{
> > > - struct kvm_vcpu *vcpu = &svm->vcpu;
> > > - u32 error_code = svm->vmcb->control.exit_info_1;
> > > -
> > > - WARN_ON_ONCE(!enable_vmware_backdoor);
> > > -
> > > - /*
> > > - * VMware backdoor emulation on #GP interception only handles IN{S},
> > > - * OUT{S}, and RDPMC, none of which generate a non-zero error code.
> > > - */
> > > - if (error_code) {
> > > - kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> > > - return 1;
> > > - }
> > > - return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP);
> > > -}
> > > -
> > > static bool is_erratum_383(void)
> > > {
> > > int err, i;
> > > @@ -2173,6 +2161,81 @@ static int vmrun_interception(struct vcpu_svm *svm)
> > > return nested_svm_vmrun(svm);
> > > }
> > >
> > > +enum {
> > > + NOT_SVM_INSTR,
> > > + SVM_INSTR_VMRUN,
> > > + SVM_INSTR_VMLOAD,
> > > + SVM_INSTR_VMSAVE,
> > > +};
> > > +
> > > +/* Return NOT_SVM_INSTR if not SVM instrs, otherwise return decode result */
> > > +static int svm_instr_opcode(struct kvm_vcpu *vcpu)
> > > +{
> > > + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> > > +
> > > + if (ctxt->b != 0x1 || ctxt->opcode_len != 2)
> > > + return NOT_SVM_INSTR;
> > > +
> > > + switch (ctxt->modrm) {
> > > + case 0xd8: /* VMRUN */
> > > + return SVM_INSTR_VMRUN;
> > > + case 0xda: /* VMLOAD */
> > > + return SVM_INSTR_VMLOAD;
> > > + case 0xdb: /* VMSAVE */
> > > + return SVM_INSTR_VMSAVE;
> > > + default:
> > > + break;
> > > + }
> > > +
> > > + return NOT_SVM_INSTR;
> > > +}
> > > +
> > > +static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
> > > +{
> > > + int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
> > > + [SVM_INSTR_VMRUN] = vmrun_interception,
> > > + [SVM_INSTR_VMLOAD] = vmload_interception,
> > > + [SVM_INSTR_VMSAVE] = vmsave_interception,
> > > + };
> > > + struct vcpu_svm *svm = to_svm(vcpu);
> > > +
> > > + return svm_instr_handlers[opcode](svm);
> > > +}
> > > +
> > > +/*
> > > + * #GP handling code. Note that #GP can be triggered under the following two
> > > + * cases:
> > > + * 1) SVM VM-related instructions (VMRUN/VMSAVE/VMLOAD) that trigger #GP on
> > > + * some AMD CPUs when EAX of these instructions are in the reserved memory
> > > + * regions (e.g. SMM memory on host).
> > > + * 2) VMware backdoor
> > > + */
> > > +static int gp_interception(struct vcpu_svm *svm)
> > > +{
> > > + struct kvm_vcpu *vcpu = &svm->vcpu;
> > > + u32 error_code = svm->vmcb->control.exit_info_1;
> > > + int opcode;
> > > +
> > > + /* Both #GP cases have zero error_code */
> >
> > I would have kept the original description of possible #GP reasons
> > for the VMWARE backdoor and that WARN_ON_ONCE that was removed.
> >
>
> Will do
>
> > > + if (error_code)
> > > + goto reinject;
> > > +
> > > + /* Decode the instruction for usage later */
> > > + if (x86_emulate_decoded_instruction(vcpu, 0, NULL, 0) != EMULATION_OK)
> > > + goto reinject;
> > > +
> > > + opcode = svm_instr_opcode(vcpu);
> > > + if (opcode)
> >
> > I prefer opcode != NOT_SVM_INSTR.
> >
> > > + return emulate_svm_instr(vcpu, opcode);
> > > + else
> >
> > 'WARN_ON_ONCE(!enable_vmware_backdoor)' I think can be placed here.
> >
> >
> > > + return kvm_emulate_instruction(vcpu,
> > > + EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
> >
> > I tested the vmware backdoor a bit (using the kvm unit tests) and I found out a tiny pre-existing bug
> > there:
> >
> > We shouldn't emulate the vmware backdoor for a nested guest, but rather let it do it.
> >
> > The below patch (on top of your patches) works for me and allows the vmware backdoor
> > test to pass when kvm unit tests run in a guest.
> >
>
> This fix can be a separate patch? This problem exist even before this
> patchset.
It should indeed be a separate patch, but it won't hurt to add it
to this series IMHO if you have time for that.
I just pointed that out because I found this bug during testing,
to avoid forgetting about it.
BTW, on unrelated note, currently the smap test is broken in kvm-unit tests.
I bisected it to commit 322cdd6405250a2a3e48db199f97a45ef519e226
It seems that the following hack (I have no idea why it works,
since I haven't dug deep into the area 'fixes', the smap test for me)
-#define USER_BASE (1 << 24)
+#define USER_BASE (1 << 25)
Best regards,
Maxim Levitsky
>
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index fe97b0e41824a..4557fdc9c3e1b 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -2243,7 +2243,7 @@ static int gp_interception(struct vcpu_svm *svm)
> > opcode = svm_instr_opcode(vcpu);
> > if (opcode)
> > return emulate_svm_instr(vcpu, opcode);
> > - else
> > + else if (!is_guest_mode(vcpu))
> > return kvm_emulate_instruction(vcpu,
> > EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
> >
> >
> >
> > Best regards,
> > Maxim Levitsky
> >
> > > +
> > > +reinject:
> > > + kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> > > + return 1;
> > > +}
> > > +
> > > void svm_set_gif(struct vcpu_svm *svm, bool value)
> > > {
> > > if (value) {
> >
> >
> >
> >
On Thu, Jan 21, 2021, Maxim Levitsky wrote:
> BTW, on unrelated note, currently the smap test is broken in kvm-unit tests.
> I bisected it to commit 322cdd6405250a2a3e48db199f97a45ef519e226
>
> It seems that the following hack (I have no idea why it works,
> since I haven't dug deep into the area 'fixes', the smap test for me)
>
> -#define USER_BASE (1 << 24)
> +#define USER_BASE (1 << 25)
https://lkml.kernel.org/r/[email protected]
On Thu, 2021-01-21 at 14:40 -0800, Sean Christopherson wrote:
> On Thu, Jan 21, 2021, Maxim Levitsky wrote:
> > BTW, on unrelated note, currently the smap test is broken in kvm-unit tests.
> > I bisected it to commit 322cdd6405250a2a3e48db199f97a45ef519e226
> >
> > It seems that the following hack (I have no idea why it works,
> > since I haven't dug deep into the area 'fixes', the smap test for me)
> >
> > -#define USER_BASE (1 << 24)
> > +#define USER_BASE (1 << 25)
>
> https://lkml.kernel.org/r/[email protected]
>
Thanks!
Best regards,
Maxim Levitsky