Compared to v1, this fixes some incorrect injections of VINTR that happen
on kvm/queue while running nested guests, and it clarifies the code
that handles INT_CTL. The most important part here is the first three
patches, which further cleanup event injection and remove another race
between inject_pending_event and kvm_cpu_has_injectable_intr.
Two other important patches are "KVM: nSVM: restore clobbered INT_CTL
fields after clearing VINTR" and "KVM: nSVM: synthesize correct EXITINTINFO
on vmexit", which fix various hangs that were happening with v1.
Nested Hyper-V is still broken with these patches; the bug is only
marginally related to event injection and the fix is simple, so it can
go into 5.7. And it's Vitaly who heroically debugged it, so I'll leave
it to him to post it.
Paolo
Paolo Bonzini (28):
KVM: x86: track manually whether an event has been injected
KVM: x86: enable event window in inject_pending_event
KVM: nSVM: inject exceptions via svm_check_nested_events
KVM: nSVM: remove exit_required
KVM: nSVM: correctly inject INIT vmexits
KVM: SVM: always update CR3 in VMCB
KVM: nVMX: always update CR3 in VMCS
KVM: nSVM: move map argument out of enter_svm_guest_mode
KVM: nSVM: extract load_nested_vmcb_control
KVM: nSVM: extract preparation of VMCB for nested run
KVM: nSVM: move MMU setup to nested_prepare_vmcb_control
KVM: nSVM: clean up tsc_offset update
KVM: nSVM: pass vmcb_control_area to copy_vmcb_control_area
KVM: nSVM: remove trailing padding for struct vmcb_control_area
KVM: nSVM: save all control fields in svm->nested
KVM: nSVM: restore clobbered INT_CTL fields after clearing VINTR
KVM: nSVM: synchronize VMCB controls updated by the processor on every
vmexit
KVM: nSVM: remove unnecessary if
KVM: nSVM: extract svm_set_gif
KVM: SVM: preserve VGIF across VMCB switch
KVM: nSVM: synthesize correct EXITINTINFO on vmexit
KVM: nSVM: remove HF_VINTR_MASK
KVM: nSVM: remove HF_HIF_MASK
KVM: nSVM: split nested_vmcb_check_controls
KVM: nSVM: leave guest mode when clearing EFER.SVME
KVM: MMU: pass arbitrary CR0/CR4/EFER to kvm_init_shadow_mmu
selftests: kvm: add a SVM version of state-test
KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE
arch/x86/include/asm/kvm_host.h | 12 +-
arch/x86/include/asm/svm.h | 9 +-
arch/x86/include/uapi/asm/kvm.h | 17 +-
arch/x86/kvm/cpuid.h | 5 +
arch/x86/kvm/irq.c | 1 +
arch/x86/kvm/mmu.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 14 +-
arch/x86/kvm/svm/nested.c | 624 ++++++++++++------
arch/x86/kvm/svm/svm.c | 154 ++---
arch/x86/kvm/svm/svm.h | 33 +-
arch/x86/kvm/vmx/nested.c | 5 -
arch/x86/kvm/vmx/vmx.c | 25 +-
arch/x86/kvm/x86.c | 141 ++--
.../testing/selftests/kvm/x86_64/state_test.c | 69 +-
14 files changed, 687 insertions(+), 424 deletions(-)
--
2.26.2
When restoring SVM nested state, the control state cache in svm->nested
will have to be filled, but the save state will not have to be moved
into svm->vmcb. Therefore, pull the code that handles the control area
into a separate function.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 38 ++++++++++++++++++++++----------------
1 file changed, 22 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d2fc3c4aa00b..ea00dcd8c39b 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -228,6 +228,23 @@ static bool nested_vmcb_checks(struct vmcb *vmcb)
return true;
}
+static void load_nested_vmcb_control(struct vcpu_svm *svm,
+ struct vmcb_control_area *control)
+{
+ svm->nested.nested_cr3 = control->nested_cr3;
+
+ svm->nested.vmcb_msrpm = control->msrpm_base_pa & ~0x0fffULL;
+ svm->nested.vmcb_iopm = control->iopm_base_pa & ~0x0fffULL;
+
+ /* cache intercepts */
+ svm->nested.intercept_cr = control->intercept_cr;
+ svm->nested.intercept_dr = control->intercept_dr;
+ svm->nested.intercept_exceptions = control->intercept_exceptions;
+ svm->nested.intercept = control->intercept;
+
+ svm->vcpu.arch.tsc_offset += control->tsc_offset;
+}
+
void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
struct vmcb *nested_vmcb)
{
@@ -235,15 +252,16 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
is_intercept(svm, INTERCEPT_VINTR) ||
is_intercept(svm, INTERCEPT_IRET);
+ svm->nested.vmcb = vmcb_gpa;
if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF)
svm->vcpu.arch.hflags |= HF_HIF_MASK;
else
svm->vcpu.arch.hflags &= ~HF_HIF_MASK;
- if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) {
- svm->nested.nested_cr3 = nested_vmcb->control.nested_cr3;
+ load_nested_vmcb_control(svm, &nested_vmcb->control);
+
+ if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE)
nested_svm_init_mmu_context(&svm->vcpu);
- }
/* Load the nested guest state */
svm->vmcb->save.es = nested_vmcb->save.es;
@@ -274,25 +292,15 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
svm->vcpu.arch.dr6 = nested_vmcb->save.dr6;
svm->vmcb->save.cpl = nested_vmcb->save.cpl;
- svm->nested.vmcb_msrpm = nested_vmcb->control.msrpm_base_pa & ~0x0fffULL;
- svm->nested.vmcb_iopm = nested_vmcb->control.iopm_base_pa & ~0x0fffULL;
-
- /* cache intercepts */
- svm->nested.intercept_cr = nested_vmcb->control.intercept_cr;
- svm->nested.intercept_dr = nested_vmcb->control.intercept_dr;
- svm->nested.intercept_exceptions = nested_vmcb->control.intercept_exceptions;
- svm->nested.intercept = nested_vmcb->control.intercept;
-
svm_flush_tlb(&svm->vcpu);
- svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK;
if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK)
svm->vcpu.arch.hflags |= HF_VINTR_MASK;
else
svm->vcpu.arch.hflags &= ~HF_VINTR_MASK;
- svm->vcpu.arch.tsc_offset += nested_vmcb->control.tsc_offset;
svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset;
+ svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK;
svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext;
svm->vmcb->control.int_vector = nested_vmcb->control.int_vector;
svm->vmcb->control.int_state = nested_vmcb->control.int_state;
@@ -313,8 +321,6 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
*/
recalc_intercepts(svm);
- svm->nested.vmcb = vmcb_gpa;
-
/*
* If L1 had a pending IRQ/NMI before executing VMRUN,
* which wasn't delivered because it was disallowed (e.g.
--
2.26.2
Use l1_tsc_offset to compute svm->vcpu.arch.tsc_offset and
svm->vmcb->control.tsc_offset, instead of relying on hsave.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 7fbd7aaa4ce0..5a9d131a153e 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -241,8 +241,6 @@ static void load_nested_vmcb_control(struct vcpu_svm *svm,
svm->nested.intercept_dr = control->intercept_dr;
svm->nested.intercept_exceptions = control->intercept_exceptions;
svm->nested.intercept = control->intercept;
-
- svm->vcpu.arch.tsc_offset += control->tsc_offset;
}
static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_vmcb)
@@ -288,7 +286,8 @@ static void nested_prepare_vmcb_control(struct vcpu_svm *svm, struct vmcb *neste
else
svm->vcpu.arch.hflags &= ~HF_VINTR_MASK;
- svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset;
+ svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset =
+ svm->vcpu.arch.l1_tsc_offset + nested_vmcb->control.tsc_offset;
svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK;
svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext;
@@ -553,7 +552,9 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
/* Restore the original control entries */
copy_vmcb_control_area(vmcb, hsave);
- svm->vcpu.arch.tsc_offset = svm->vmcb->control.tsc_offset;
+ svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset =
+ svm->vcpu.arch.l1_tsc_offset;
+
kvm_clear_exception_queue(&svm->vcpu);
kvm_clear_interrupt_queue(&svm->vcpu);
--
2.26.2
Restore the INT_CTL value from the guest's VMCB once we've stopped using
it, so that virtual interrupts can be injected as requested by L1.
V_TPR is up-to-date however, and it can change if the guest writes to CR8,
so keep it.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/svm.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 77d3f441cd87..c213011a82e3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1365,9 +1365,17 @@ static void svm_set_vintr(struct vcpu_svm *svm)
static void svm_clear_vintr(struct vcpu_svm *svm)
{
+ const u32 mask = V_TPR_MASK | V_GIF_ENABLE_MASK | V_GIF_MASK | V_INTR_MASKING_MASK;
clr_intercept(svm, INTERCEPT_VINTR);
- svm->vmcb->control.int_ctl &= ~V_IRQ_MASK;
+ /* Drop int_ctl fields related to VINTR injection. */
+ svm->vmcb->control.int_ctl &= mask;
+ if (is_guest_mode(&svm->vcpu)) {
+ WARN_ON((svm->vmcb->control.int_ctl & V_TPR_MASK) !=
+ (svm->nested.ctl.int_ctl & V_TPR_MASK));
+ svm->vmcb->control.int_ctl |= svm->nested.ctl.int_ctl & ~mask;
+ }
+
mark_dirty(svm->vmcb, VMCB_INTR);
}
--
2.26.2
In preparation for nested SVM save/restore, store all data that matters
from the VMCB control area into svm->nested. It will then become part
of the nested SVM state that is saved by KVM_SET_NESTED_STATE and
restored by KVM_GET_NESTED_STATE, just like the cached vmcs12 for nVMX.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 73 +++++++++++++++++----------------------
arch/x86/kvm/svm/svm.c | 4 +--
arch/x86/kvm/svm/svm.h | 20 +++--------
3 files changed, 39 insertions(+), 58 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 0dfd9c8c2ca7..7f739f9a797c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -60,7 +60,7 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
{
struct vcpu_svm *svm = to_svm(vcpu);
- u64 cr3 = svm->nested.nested_cr3;
+ u64 cr3 = svm->nested.ctl.nested_cr3;
u64 pdpte;
int ret;
@@ -75,7 +75,7 @@ static unsigned long nested_svm_get_tdp_cr3(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
- return svm->nested.nested_cr3;
+ return svm->nested.ctl.nested_cr3;
}
static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu)
@@ -100,8 +100,7 @@ static void nested_svm_uninit_mmu_context(struct kvm_vcpu *vcpu)
void recalc_intercepts(struct vcpu_svm *svm)
{
- struct vmcb_control_area *c, *h;
- struct nested_state *g;
+ struct vmcb_control_area *c, *h, *g;
mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
@@ -110,7 +109,7 @@ void recalc_intercepts(struct vcpu_svm *svm)
c = &svm->vmcb->control;
h = &svm->nested.hsave->control;
- g = &svm->nested;
+ g = &svm->nested.ctl;
svm->nested.host_intercept_exceptions = h->intercept_exceptions;
@@ -180,7 +179,7 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
*/
int i;
- if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT)))
+ if (!(svm->nested.ctl.intercept & (1ULL << INTERCEPT_MSR_PROT)))
return true;
for (i = 0; i < MSRPM_OFFSETS; i++) {
@@ -191,7 +190,7 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
break;
p = msrpm_offsets[i];
- offset = svm->nested.vmcb_msrpm + (p * 4);
+ offset = svm->nested.ctl.msrpm_base_pa + (p * 4);
if (kvm_vcpu_read_guest(&svm->vcpu, offset, &value, 4))
return false;
@@ -229,16 +228,10 @@ static bool nested_vmcb_checks(struct vmcb *vmcb)
static void load_nested_vmcb_control(struct vcpu_svm *svm,
struct vmcb_control_area *control)
{
- svm->nested.nested_cr3 = control->nested_cr3;
+ copy_vmcb_control_area(&svm->nested.ctl, control);
- svm->nested.vmcb_msrpm = control->msrpm_base_pa & ~0x0fffULL;
- svm->nested.vmcb_iopm = control->iopm_base_pa & ~0x0fffULL;
-
- /* cache intercepts */
- svm->nested.intercept_cr = control->intercept_cr;
- svm->nested.intercept_dr = control->intercept_dr;
- svm->nested.intercept_exceptions = control->intercept_exceptions;
- svm->nested.intercept = control->intercept;
+ svm->nested.ctl.msrpm_base_pa &= ~0x0fffULL;
+ svm->nested.ctl.iopm_base_pa &= ~0x0fffULL;
}
static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_vmcb)
@@ -270,34 +263,32 @@ static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_v
svm->vmcb->save.cpl = nested_vmcb->save.cpl;
}
-static void nested_prepare_vmcb_control(struct vcpu_svm *svm, struct vmcb *nested_vmcb)
+static void nested_prepare_vmcb_control(struct vcpu_svm *svm)
{
- if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE)
+ if (svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE)
nested_svm_init_mmu_context(&svm->vcpu);
/* Guest paging mode is active - reset mmu */
kvm_mmu_reset_context(&svm->vcpu);
svm_flush_tlb(&svm->vcpu);
- if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK)
+ if (svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK)
svm->vcpu.arch.hflags |= HF_VINTR_MASK;
else
svm->vcpu.arch.hflags &= ~HF_VINTR_MASK;
svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset =
- svm->vcpu.arch.l1_tsc_offset + nested_vmcb->control.tsc_offset;
+ svm->vcpu.arch.l1_tsc_offset + svm->nested.ctl.tsc_offset;
- svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK;
- svm->vmcb->control.virt_ext = nested_vmcb->control.virt_ext;
- svm->vmcb->control.int_vector = nested_vmcb->control.int_vector;
- svm->vmcb->control.int_state = nested_vmcb->control.int_state;
- svm->vmcb->control.event_inj = nested_vmcb->control.event_inj;
- svm->vmcb->control.event_inj_err = nested_vmcb->control.event_inj_err;
+ svm->vmcb->control.int_ctl = svm->nested.ctl.int_ctl | V_INTR_MASKING_MASK;
+ svm->vmcb->control.virt_ext = svm->nested.ctl.virt_ext;
+ svm->vmcb->control.int_vector = svm->nested.ctl.int_vector;
+ svm->vmcb->control.int_state = svm->nested.ctl.int_state;
+ svm->vmcb->control.event_inj = svm->nested.ctl.event_inj;
+ svm->vmcb->control.event_inj_err = svm->nested.ctl.event_inj_err;
- svm->vmcb->control.pause_filter_count =
- nested_vmcb->control.pause_filter_count;
- svm->vmcb->control.pause_filter_thresh =
- nested_vmcb->control.pause_filter_thresh;
+ svm->vmcb->control.pause_filter_count = svm->nested.ctl.pause_filter_count;
+ svm->vmcb->control.pause_filter_thresh = svm->nested.ctl.pause_filter_thresh;
/* Enter Guest-Mode */
enter_guest_mode(&svm->vcpu);
@@ -326,7 +317,7 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
load_nested_vmcb_control(svm, &nested_vmcb->control);
nested_prepare_vmcb_save(svm, nested_vmcb);
- nested_prepare_vmcb_control(svm, nested_vmcb);
+ nested_prepare_vmcb_control(svm);
/*
* If L1 had a pending IRQ/NMI before executing VMRUN,
@@ -556,7 +547,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
kvm_clear_exception_queue(&svm->vcpu);
kvm_clear_interrupt_queue(&svm->vcpu);
- svm->nested.nested_cr3 = 0;
+ svm->nested.ctl.nested_cr3 = 0;
/* Restore selected save entries */
svm->vmcb->save.es = hsave->save.es;
@@ -606,7 +597,7 @@ static int nested_svm_exit_handled_msr(struct vcpu_svm *svm)
u32 offset, msr, value;
int write, mask;
- if (!(svm->nested.intercept & (1ULL << INTERCEPT_MSR_PROT)))
+ if (!(svm->nested.ctl.intercept & (1ULL << INTERCEPT_MSR_PROT)))
return NESTED_EXIT_HOST;
msr = svm->vcpu.arch.regs[VCPU_REGS_RCX];
@@ -620,7 +611,7 @@ static int nested_svm_exit_handled_msr(struct vcpu_svm *svm)
/* Offset is in 32 bit units but need in 8 bit units */
offset *= 4;
- if (kvm_vcpu_read_guest(&svm->vcpu, svm->nested.vmcb_msrpm + offset, &value, 4))
+ if (kvm_vcpu_read_guest(&svm->vcpu, svm->nested.ctl.msrpm_base_pa + offset, &value, 4))
return NESTED_EXIT_DONE;
return (value & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST;
@@ -633,13 +624,13 @@ static int nested_svm_intercept_ioio(struct vcpu_svm *svm)
u8 start_bit;
u64 gpa;
- if (!(svm->nested.intercept & (1ULL << INTERCEPT_IOIO_PROT)))
+ if (!(svm->nested.ctl.intercept & (1ULL << INTERCEPT_IOIO_PROT)))
return NESTED_EXIT_HOST;
port = svm->vmcb->control.exit_info_1 >> 16;
size = (svm->vmcb->control.exit_info_1 & SVM_IOIO_SIZE_MASK) >>
SVM_IOIO_SIZE_SHIFT;
- gpa = svm->nested.vmcb_iopm + (port / 8);
+ gpa = svm->nested.ctl.iopm_base_pa + (port / 8);
start_bit = port % 8;
iopm_len = (start_bit + size > 8) ? 2 : 1;
mask = (0xf >> (4 - size)) << start_bit;
@@ -665,13 +656,13 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
break;
case SVM_EXIT_READ_CR0 ... SVM_EXIT_WRITE_CR8: {
u32 bit = 1U << (exit_code - SVM_EXIT_READ_CR0);
- if (svm->nested.intercept_cr & bit)
+ if (svm->nested.ctl.intercept_cr & bit)
vmexit = NESTED_EXIT_DONE;
break;
}
case SVM_EXIT_READ_DR0 ... SVM_EXIT_WRITE_DR7: {
u32 bit = 1U << (exit_code - SVM_EXIT_READ_DR0);
- if (svm->nested.intercept_dr & bit)
+ if (svm->nested.ctl.intercept_dr & bit)
vmexit = NESTED_EXIT_DONE;
break;
}
@@ -690,7 +681,7 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
}
default: {
u64 exit_bits = 1ULL << (exit_code - SVM_EXIT_INTR);
- if (svm->nested.intercept & exit_bits)
+ if (svm->nested.ctl.intercept & exit_bits)
vmexit = NESTED_EXIT_DONE;
}
}
@@ -730,7 +721,7 @@ static bool nested_exit_on_exception(struct vcpu_svm *svm)
{
unsigned int nr = svm->vcpu.arch.exception.nr;
- return (svm->nested.intercept_exceptions & (1 << nr));
+ return (svm->nested.ctl.intercept_exceptions & (1 << nr));
}
static void nested_svm_inject_exception_vmexit(struct vcpu_svm *svm)
@@ -798,7 +789,7 @@ static void nested_svm_intr(struct vcpu_svm *svm)
static inline bool nested_exit_on_init(struct vcpu_svm *svm)
{
- return (svm->nested.intercept & (1ULL << INTERCEPT_INIT));
+ return (svm->nested.ctl.intercept & (1ULL << INTERCEPT_INIT));
}
static void nested_svm_init(struct vcpu_svm *svm)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3321c402a753..77d3f441cd87 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2173,7 +2173,7 @@ static bool check_selective_cr0_intercepted(struct vcpu_svm *svm,
bool ret = false;
u64 intercept;
- intercept = svm->nested.intercept;
+ intercept = svm->nested.ctl.intercept;
if (!is_guest_mode(&svm->vcpu) ||
(!(intercept & (1ULL << INTERCEPT_SELECTIVE_CR0))))
@@ -3649,7 +3649,7 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
info->intercept == x86_intercept_clts)
break;
- intercept = svm->nested.intercept;
+ intercept = svm->nested.ctl.intercept;
if (!(intercept & (1ULL << INTERCEPT_SELECTIVE_CR0)))
break;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 33e3f09d7a8e..dd5418f20256 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -91,22 +91,12 @@ struct nested_state {
/* These are the merged vectors */
u32 *msrpm;
- /* gpa pointers to the real vectors */
- u64 vmcb_msrpm;
- u64 vmcb_iopm;
-
/* A VMRUN has started but has not yet been performed, so
* we cannot inject a nested vmexit yet. */
bool nested_run_pending;
- /* cache for intercepts of the guest */
- u32 intercept_cr;
- u32 intercept_dr;
- u32 intercept_exceptions;
- u64 intercept;
-
- /* Nested Paging related state */
- u64 nested_cr3;
+ /* cache for control fields of the guest */
+ struct vmcb_control_area ctl;
};
struct vcpu_svm {
@@ -381,17 +371,17 @@ static inline bool svm_nested_virtualize_tpr(struct kvm_vcpu *vcpu)
static inline bool nested_exit_on_smi(struct vcpu_svm *svm)
{
- return (svm->nested.intercept & (1ULL << INTERCEPT_SMI));
+ return (svm->nested.ctl.intercept & (1ULL << INTERCEPT_SMI));
}
static inline bool nested_exit_on_intr(struct vcpu_svm *svm)
{
- return (svm->nested.intercept & (1ULL << INTERCEPT_INTR));
+ return (svm->nested.ctl.intercept & (1ULL << INTERCEPT_INTR));
}
static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
{
- return (svm->nested.intercept & (1ULL << INTERCEPT_NMI));
+ return (svm->nested.ctl.intercept & (1ULL << INTERCEPT_NMI));
}
void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
--
2.26.2
Now that the int_ctl field is stored in svm->nested.ctl.int_ctl, we can
use it instead of vcpu->arch.hflags to check whether L2 is running
in V_INTR_MASKING mode.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/svm/nested.c | 6 +-----
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/svm/svm.h | 4 +++-
4 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e6f2e1a2dab6..0dfc522f96cc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1596,7 +1596,6 @@ enum {
#define HF_GIF_MASK (1 << 0)
#define HF_HIF_MASK (1 << 1)
-#define HF_VINTR_MASK (1 << 2)
#define HF_NMI_MASK (1 << 3)
#define HF_IRET_MASK (1 << 4)
#define HF_GUEST_MASK (1 << 5) /* VCPU is in guest-mode */
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 9746ccbdfd2a..56d45fe6eb45 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -118,7 +118,7 @@ void recalc_intercepts(struct vcpu_svm *svm)
c->intercept_exceptions = h->intercept_exceptions;
c->intercept = h->intercept;
- if (svm->vcpu.arch.hflags & HF_VINTR_MASK) {
+ if (g->int_ctl & V_INTR_MASKING_MASK) {
/* We only want the cr8 intercept bits of L1 */
c->intercept_cr &= ~(1U << INTERCEPT_CR8_READ);
c->intercept_cr &= ~(1U << INTERCEPT_CR8_WRITE);
@@ -338,10 +338,6 @@ static void nested_prepare_vmcb_control(struct vcpu_svm *svm)
kvm_mmu_reset_context(&svm->vcpu);
svm_flush_tlb(&svm->vcpu);
- if (svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK)
- svm->vcpu.arch.hflags |= HF_VINTR_MASK;
- else
- svm->vcpu.arch.hflags &= ~HF_VINTR_MASK;
svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset =
svm->vcpu.arch.l1_tsc_offset + svm->nested.ctl.tsc_offset;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index af5d4ae00cb4..d1a6e2f3db55 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3104,7 +3104,7 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
if (is_guest_mode(vcpu)) {
/* As long as interrupts are being delivered... */
- if ((svm->vcpu.arch.hflags & HF_VINTR_MASK)
+ if ((svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK)
? !(svm->vcpu.arch.hflags & HF_HIF_MASK)
: !(kvm_get_rflags(vcpu) & X86_EFLAGS_IF))
return true;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 10b7b55720a0..be8e830f83fa 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -367,7 +367,9 @@ void svm_set_gif(struct vcpu_svm *svm, bool value);
static inline bool svm_nested_virtualize_tpr(struct kvm_vcpu *vcpu)
{
- return is_guest_mode(vcpu) && (vcpu->arch.hflags & HF_VINTR_MASK);
+ struct vcpu_svm *svm = to_svm(vcpu);
+
+ return is_guest_mode(vcpu) && (svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK);
}
static inline bool nested_exit_on_smi(struct vcpu_svm *svm)
--
2.26.2
The L1 flags can be found in the save area of svm->nested.hsave, fish
it from there so that there is one fewer thing to migrate.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/svm/nested.c | 5 -----
arch/x86/kvm/svm/svm.c | 2 +-
3 files changed, 1 insertion(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0dfc522f96cc..3485f8454088 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1595,7 +1595,6 @@ enum {
};
#define HF_GIF_MASK (1 << 0)
-#define HF_HIF_MASK (1 << 1)
#define HF_NMI_MASK (1 << 3)
#define HF_IRET_MASK (1 << 4)
#define HF_GUEST_MASK (1 << 5) /* VCPU is in guest-mode */
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 56d45fe6eb45..f3e70aee18cd 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -371,11 +371,6 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
struct vmcb *nested_vmcb)
{
svm->nested.vmcb = vmcb_gpa;
- if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF)
- svm->vcpu.arch.hflags |= HF_HIF_MASK;
- else
- svm->vcpu.arch.hflags &= ~HF_HIF_MASK;
-
load_nested_vmcb_control(svm, &nested_vmcb->control);
nested_prepare_vmcb_save(svm, nested_vmcb);
nested_prepare_vmcb_control(svm);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d1a6e2f3db55..f0da6f0ebf1b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3105,7 +3105,7 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
if (is_guest_mode(vcpu)) {
/* As long as interrupts are being delivered... */
if ((svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK)
- ? !(svm->vcpu.arch.hflags & HF_HIF_MASK)
+ ? !(svm->nested.hsave->save.rflags & X86_EFLAGS_IF)
: !(kvm_get_rflags(vcpu) & X86_EFLAGS_IF))
return true;
--
2.26.2
This bit was added to nested VMX right when nested_run_pending was
introduced, but it is not yet there in nSVM. Since we can have pending
events that L0 injected directly into L2 on vmentry, we have to transfer
them into L1's queue.
For this to work, one important change is required: svm_complete_interrupts
(which clears the "injected" fields from the previous VMRUN, and updates them
from svm->vmcb's EXITINTINFO) must be placed before we inject the vmexit.
This is not too scary though; VMX even does it in vmx_vcpu_run.
While at it, the nested_vmexit_inject tracepoint is moved towards the
end of nested_svm_vmexit. This ensures that the synthesized EXITINTINFO
is visible in the trace.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 59 +++++++++++++++++++++++++++++++--------
arch/x86/kvm/svm/svm.c | 4 +--
2 files changed, 49 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 607531db8a94..9746ccbdfd2a 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -262,6 +262,43 @@ void sync_nested_vmcb_control(struct vcpu_svm *svm)
svm->nested.ctl.int_ctl |= svm->vmcb->control.int_ctl & mask;
}
+/*
+ * Transfer any event that L0 or L1 wanted to inject into L2 to
+ * EXIT_INT_INFO.
+ */
+static void nested_vmcb_save_pending_event(struct vcpu_svm *svm,
+ struct vmcb *nested_vmcb)
+{
+ struct kvm_vcpu *vcpu = &svm->vcpu;
+ u32 exit_int_info = 0;
+ unsigned int nr;
+
+ if (vcpu->arch.exception.injected) {
+ nr = vcpu->arch.exception.nr;
+ exit_int_info = nr | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT;
+
+ if (vcpu->arch.exception.has_error_code) {
+ exit_int_info |= SVM_EVTINJ_VALID_ERR;
+ nested_vmcb->control.exit_int_info_err =
+ vcpu->arch.exception.error_code;
+ }
+
+ } else if (vcpu->arch.nmi_injected) {
+ exit_int_info = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
+
+ } else if (vcpu->arch.interrupt.injected) {
+ nr = vcpu->arch.interrupt.nr;
+ exit_int_info = nr | SVM_EVTINJ_VALID;
+
+ if (vcpu->arch.interrupt.soft)
+ exit_int_info |= SVM_EVTINJ_TYPE_SOFT;
+ else
+ exit_int_info |= SVM_EVTINJ_TYPE_INTR;
+ }
+
+ nested_vmcb->control.exit_int_info = exit_int_info;
+}
+
static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_vmcb)
{
/* Load the nested guest state */
@@ -466,13 +503,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
struct vmcb *vmcb = svm->vmcb;
struct kvm_host_map map;
- trace_kvm_nested_vmexit_inject(vmcb->control.exit_code,
- vmcb->control.exit_info_1,
- vmcb->control.exit_info_2,
- vmcb->control.exit_int_info,
- vmcb->control.exit_int_info_err,
- KVM_ISA_SVM);
-
rc = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->nested.vmcb), &map);
if (rc) {
if (rc == -EINVAL)
@@ -517,8 +547,9 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
nested_vmcb->control.exit_code_hi = vmcb->control.exit_code_hi;
nested_vmcb->control.exit_info_1 = vmcb->control.exit_info_1;
nested_vmcb->control.exit_info_2 = vmcb->control.exit_info_2;
- nested_vmcb->control.exit_int_info = vmcb->control.exit_int_info;
- nested_vmcb->control.exit_int_info_err = vmcb->control.exit_int_info_err;
+
+ if (nested_vmcb->control.exit_code != SVM_EXIT_ERR)
+ nested_vmcb_save_pending_event(svm, nested_vmcb);
if (svm->nrips_enabled)
nested_vmcb->control.next_rip = vmcb->control.next_rip;
@@ -539,9 +570,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset =
svm->vcpu.arch.l1_tsc_offset;
- kvm_clear_exception_queue(&svm->vcpu);
- kvm_clear_interrupt_queue(&svm->vcpu);
-
svm->nested.ctl.nested_cr3 = 0;
/* Restore selected save entries */
@@ -570,6 +598,13 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
mark_all_dirty(svm->vmcb);
+ trace_kvm_nested_vmexit_inject(nested_vmcb->control.exit_code,
+ nested_vmcb->control.exit_info_1,
+ nested_vmcb->control.exit_info_2,
+ nested_vmcb->control.exit_int_info,
+ nested_vmcb->control.exit_int_info_err,
+ KVM_ISA_SVM);
+
kvm_vcpu_unmap(&svm->vcpu, &map, true);
nested_svm_uninit_mmu_context(&svm->vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index df3fcaa827c7..af5d4ae00cb4 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2913,6 +2913,8 @@ static int handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
if (npt_enabled)
vcpu->arch.cr3 = svm->vmcb->save.cr3;
+ svm_complete_interrupts(svm);
+
if (is_guest_mode(vcpu)) {
int vmexit;
@@ -2932,8 +2934,6 @@ static int handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
return 1;
}
- svm_complete_interrupts(svm);
-
if (svm->vmcb->control.exit_code == SVM_EXIT_ERR) {
kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
kvm_run->fail_entry.hardware_entry_failure_reason
--
2.26.2
Similar to VMX, the state that is captured through the currently available
IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was
running at the moment when the process was interrupted to save its state.
In particular, the SVM-specific state for nested virtualization includes
the L1 saved state (including the interrupt flag), the cached L2 controls,
and the GIF.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 17 +++-
arch/x86/kvm/cpuid.h | 5 ++
arch/x86/kvm/svm/nested.c | 147 ++++++++++++++++++++++++++++++++
arch/x86/kvm/svm/svm.c | 1 +
arch/x86/kvm/vmx/nested.c | 5 --
arch/x86/kvm/x86.c | 3 +-
6 files changed, 171 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 3f3f780c8c65..12075a9de1c1 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -385,18 +385,22 @@ struct kvm_sync_regs {
#define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4)
#define KVM_STATE_NESTED_FORMAT_VMX 0
-#define KVM_STATE_NESTED_FORMAT_SVM 1 /* unused */
+#define KVM_STATE_NESTED_FORMAT_SVM 1
#define KVM_STATE_NESTED_GUEST_MODE 0x00000001
#define KVM_STATE_NESTED_RUN_PENDING 0x00000002
#define KVM_STATE_NESTED_EVMCS 0x00000004
#define KVM_STATE_NESTED_MTF_PENDING 0x00000008
+#define KVM_STATE_NESTED_GIF_SET 0x00000100
#define KVM_STATE_NESTED_SMM_GUEST_MODE 0x00000001
#define KVM_STATE_NESTED_SMM_VMXON 0x00000002
#define KVM_STATE_NESTED_VMX_VMCS_SIZE 0x1000
+#define KVM_STATE_NESTED_SVM_VMCB_SIZE 0x1000
+
+
struct kvm_vmx_nested_state_data {
__u8 vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE];
__u8 shadow_vmcs12[KVM_STATE_NESTED_VMX_VMCS_SIZE];
@@ -411,6 +415,15 @@ struct kvm_vmx_nested_state_hdr {
} smm;
};
+struct kvm_svm_nested_state_data {
+ /* Save area only used if KVM_STATE_NESTED_RUN_PENDING. */
+ __u8 vmcb12[KVM_STATE_NESTED_SVM_VMCB_SIZE];
+};
+
+struct kvm_svm_nested_state_hdr {
+ __u64 vmcb_pa;
+};
+
/* for KVM_CAP_NESTED_STATE */
struct kvm_nested_state {
__u16 flags;
@@ -419,6 +432,7 @@ struct kvm_nested_state {
union {
struct kvm_vmx_nested_state_hdr vmx;
+ struct kvm_svm_nested_state_hdr svm;
/* Pad the header to 128 bytes. */
__u8 pad[120];
@@ -431,6 +445,7 @@ struct kvm_nested_state {
*/
union {
struct kvm_vmx_nested_state_data vmx[0];
+ struct kvm_svm_nested_state_data svm[0];
} data;
};
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 63a70f6a3df3..05434cd9342f 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -303,4 +303,9 @@ static __always_inline void kvm_cpu_cap_check_and_set(unsigned int x86_feature)
kvm_cpu_cap_set(x86_feature);
}
+static inline bool page_address_valid(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+ return PAGE_ALIGNED(gpa) && !(gpa >> cpuid_maxphyaddr(vcpu));
+}
+
#endif
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 840662e66976..0f02521550b9 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -25,6 +25,7 @@
#include "trace.h"
#include "mmu.h"
#include "x86.h"
+#include "cpuid.h"
#include "lapic.h"
#include "svm.h"
@@ -238,6 +239,8 @@ static void load_nested_vmcb_control(struct vcpu_svm *svm,
{
copy_vmcb_control_area(&svm->nested.ctl, control);
+ /* Copy it here because nested_svm_check_controls will check it. */
+ svm->nested.ctl.asid = control->asid;
svm->nested.ctl.msrpm_base_pa &= ~0x0fffULL;
svm->nested.ctl.iopm_base_pa &= ~0x0fffULL;
}
@@ -930,6 +933,150 @@ int nested_svm_exit_special(struct vcpu_svm *svm)
return NESTED_EXIT_CONTINUE;
}
+static int svm_get_nested_state(struct kvm_vcpu *vcpu,
+ struct kvm_nested_state __user *user_kvm_nested_state,
+ u32 user_data_size)
+{
+ struct vcpu_svm *svm;
+ struct kvm_nested_state kvm_state = {
+ .flags = 0,
+ .format = KVM_STATE_NESTED_FORMAT_SVM,
+ .size = sizeof(kvm_state),
+ };
+ struct vmcb __user *user_vmcb = (struct vmcb __user *)
+ &user_kvm_nested_state->data.svm[0];
+
+ if (!vcpu)
+ return kvm_state.size + KVM_STATE_NESTED_SVM_VMCB_SIZE;
+
+ svm = to_svm(vcpu);
+
+ if (user_data_size < kvm_state.size)
+ goto out;
+
+ /* First fill in the header and copy it out. */
+ if (is_guest_mode(vcpu)) {
+ kvm_state.hdr.svm.vmcb_pa = svm->nested.vmcb;
+ kvm_state.size += KVM_STATE_NESTED_SVM_VMCB_SIZE;
+ kvm_state.flags |= KVM_STATE_NESTED_GUEST_MODE;
+
+ if (svm->nested.nested_run_pending)
+ kvm_state.flags |= KVM_STATE_NESTED_RUN_PENDING;
+ }
+
+ if (gif_set(svm))
+ kvm_state.flags |= KVM_STATE_NESTED_GIF_SET;
+
+ if (copy_to_user(user_kvm_nested_state, &kvm_state, sizeof(kvm_state)))
+ return -EFAULT;
+
+ if (!is_guest_mode(vcpu))
+ goto out;
+
+ /*
+ * Copy over the full size of the VMCB rather than just the size
+ * of the structs.
+ */
+ if (clear_user(user_vmcb, KVM_STATE_NESTED_SVM_VMCB_SIZE))
+ return -EFAULT;
+ if (copy_to_user(&user_vmcb->control, &svm->nested.ctl,
+ sizeof(user_vmcb->control)))
+ return -EFAULT;
+ if (copy_to_user(&user_vmcb->save, &svm->nested.hsave->save,
+ sizeof(user_vmcb->save)))
+ return -EFAULT;
+
+out:
+ return kvm_state.size;
+}
+
+static int svm_set_nested_state(struct kvm_vcpu *vcpu,
+ struct kvm_nested_state __user *user_kvm_nested_state,
+ struct kvm_nested_state *kvm_state)
+{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ struct vmcb *hsave = svm->nested.hsave;
+ struct vmcb __user *user_vmcb = (struct vmcb __user *)
+ &user_kvm_nested_state->data.svm[0];
+ struct vmcb_control_area ctl;
+ struct vmcb_save_area save;
+ u32 cr0;
+
+ if (kvm_state->format != KVM_STATE_NESTED_FORMAT_SVM)
+ return -EINVAL;
+
+ if (kvm_state->flags & ~(KVM_STATE_NESTED_GUEST_MODE |
+ KVM_STATE_NESTED_RUN_PENDING |
+ KVM_STATE_NESTED_GIF_SET))
+ return -EINVAL;
+
+ /*
+ * If in guest mode, vcpu->arch.efer actually refers to the L2 guest's
+ * EFER.SVME, but EFER.SVME still has to be 1 for VMRUN to succeed.
+ */
+ if (!(vcpu->arch.efer & EFER_SVME)) {
+ /* GIF=1 and no guest mode are required if SVME=0. */
+ if (kvm_state->flags != KVM_STATE_NESTED_GIF_SET)
+ return -EINVAL;
+ }
+
+ /* SMM temporarily disables SVM, so we cannot be in guest mode. */
+ if (is_smm(vcpu) && (kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE))
+ return -EINVAL;
+
+ if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) {
+ svm_leave_nested(svm);
+ goto out_set_gif;
+ }
+
+ if (!page_address_valid(vcpu, kvm_state->hdr.svm.vmcb_pa))
+ return -EINVAL;
+ if (kvm_state->size < sizeof(*kvm_state) + KVM_STATE_NESTED_SVM_VMCB_SIZE)
+ return -EINVAL;
+ if (copy_from_user(&ctl, &user_vmcb->control, sizeof(ctl)))
+ return -EFAULT;
+ if (copy_from_user(&save, &user_vmcb->save, sizeof(save)))
+ return -EFAULT;
+
+ if (!nested_vmcb_check_controls(&ctl))
+ return -EINVAL;
+
+ /*
+ * Processor state contains L2 state. Check that it is
+ * valid for guest mode (see nested_vmcb_checks).
+ */
+ cr0 = kvm_read_cr0(vcpu);
+ if (((cr0 & X86_CR0_CD) == 0) && (cr0 & X86_CR0_NW))
+ return -EINVAL;
+
+ /*
+ * Validate host state saved from before VMRUN (see
+ * nested_svm_check_permissions).
+ * TODO: validate reserved bits for all saved state.
+ */
+ if (!(save.cr0 & X86_CR0_PG))
+ return -EINVAL;
+
+ /*
+ * All checks done, we can enter guest mode. L1 control fields
+ * come from the nested save state. Guest state is already
+ * in the registers, the save area of the nested state instead
+ * contains saved L1 state.
+ */
+ copy_vmcb_control_area(&hsave->control, &svm->vmcb->control);
+ hsave->save = save;
+
+ svm->nested.vmcb = kvm_state->hdr.svm.vmcb_pa;
+ load_nested_vmcb_control(svm, &ctl);
+ nested_prepare_vmcb_control(svm);
+
+out_set_gif:
+ svm_set_gif(svm, !!(kvm_state->flags & KVM_STATE_NESTED_GIF_SET));
+ return 0;
+}
+
struct kvm_x86_nested_ops svm_nested_ops = {
.check_events = svm_check_nested_events,
+ .get_state = svm_get_nested_state,
+ .set_state = svm_set_nested_state,
};
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c1e746ccd7da..f6a13685baa0 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1193,6 +1193,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
svm->avic_is_running = true;
svm->nested.hsave = page_address(hsave_page);
+ clear_page(svm->nested.hsave);
svm->msrpm = page_address(msrpm_pages);
svm_vcpu_init_msrpm(svm->msrpm);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 51ebb60e1533..106fc6fceb97 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -437,11 +437,6 @@ static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
}
}
-static bool page_address_valid(struct kvm_vcpu *vcpu, gpa_t gpa)
-{
- return PAGE_ALIGNED(gpa) && !(gpa >> cpuid_maxphyaddr(vcpu));
-}
-
static int nested_vmx_check_io_bitmap_controls(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e03543488d37..3bb34d53460f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4628,7 +4628,8 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (kvm_state.flags &
~(KVM_STATE_NESTED_RUN_PENDING | KVM_STATE_NESTED_GUEST_MODE
- | KVM_STATE_NESTED_EVMCS | KVM_STATE_NESTED_MTF_PENDING))
+ | KVM_STATE_NESTED_EVMCS | KVM_STATE_NESTED_MTF_PENDING
+ | KVM_STATE_NESTED_GIF_SET))
break;
/* nested_run_pending implies guest_mode. */
--
2.26.2
According to the AMD manual, the effect of turning off EFER.SVME while a
guest is running is undefined. We make it leave guest mode immediately,
similar to the effect of clearing the VMX bit in MSR_IA32_FEAT_CTL.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 16 ++++++++++++++++
arch/x86/kvm/svm/svm.c | 10 ++++++++--
arch/x86/kvm/svm/svm.h | 1 +
3 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index a504963ef532..e63e62d12acd 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -618,6 +618,22 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
return 0;
}
+/*
+ * Forcibly leave nested mode in order to be able to reset the VCPU later on.
+ */
+void svm_leave_nested(struct vcpu_svm *svm)
+{
+ if (is_guest_mode(&svm->vcpu)) {
+ struct vmcb *hsave = svm->nested.hsave;
+ struct vmcb *vmcb = svm->vmcb;
+
+ svm->nested.nested_run_pending = 0;
+ leave_guest_mode(&svm->vcpu);
+ copy_vmcb_control_area(&vmcb->control, &hsave->control);
+ nested_svm_uninit_mmu_context(&svm->vcpu);
+ }
+}
+
static int nested_svm_exit_handled_msr(struct vcpu_svm *svm)
{
u32 offset, msr, value;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f0da6f0ebf1b..c1e746ccd7da 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -265,6 +265,7 @@ static int get_npt_level(struct kvm_vcpu *vcpu)
void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
{
+ struct vcpu_svm *svm = to_svm(vcpu);
vcpu->arch.efer = efer;
if (!npt_enabled) {
@@ -275,8 +276,13 @@ void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
efer &= ~EFER_LME;
}
- to_svm(vcpu)->vmcb->save.efer = efer | EFER_SVME;
- mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR);
+ if (!(efer & EFER_SVME)) {
+ svm_leave_nested(svm);
+ svm_set_gif(svm, true);
+ }
+
+ svm->vmcb->save.efer = efer | EFER_SVME;
+ mark_dirty(svm->vmcb, VMCB_CR);
}
static int is_external_interrupt(u32 info)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index be8e830f83fa..6ac4c00a5d82 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -389,6 +389,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
struct vmcb *nested_vmcb);
+void svm_leave_nested(struct vcpu_svm *svm);
int nested_svm_vmrun(struct vcpu_svm *svm);
void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb);
int nested_svm_vmexit(struct vcpu_svm *svm);
--
2.26.2
All events now inject vmexits before vmentry rather than after vmexit. Therefore,
exit_required is not set anymore and we can remove it.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 3 +--
arch/x86/kvm/svm/svm.c | 14 --------------
arch/x86/kvm/svm/svm.h | 3 ---
3 files changed, 1 insertion(+), 19 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 87fc5879dfaa..bbf991cfe24b 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -792,8 +792,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
bool block_nested_events =
- kvm_event_needs_reinjection(vcpu) || svm->nested.exit_required ||
- svm->nested.nested_run_pending;
+ kvm_event_needs_reinjection(vcpu) || svm->nested.nested_run_pending;
if (vcpu->arch.exception.pending) {
if (block_nested_events)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 251c457820a1..270061fa6cfa 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2889,13 +2889,6 @@ static int handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
if (npt_enabled)
vcpu->arch.cr3 = svm->vmcb->save.cr3;
- if (unlikely(svm->nested.exit_required)) {
- nested_svm_vmexit(svm);
- svm->nested.exit_required = false;
-
- return 1;
- }
-
if (is_guest_mode(vcpu)) {
int vmexit;
@@ -3327,13 +3320,6 @@ static fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP];
svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP];
- /*
- * A vmexit emulation is required before the vcpu can be executed
- * again.
- */
- if (unlikely(svm->nested.exit_required))
- return EXIT_FASTPATH_NONE;
-
/*
* Disable singlestep if we're injecting an interrupt/exception.
* We don't want our modified rflags to be pushed on the stack where
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 8342032291fc..89fab75dd4f5 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -95,9 +95,6 @@ struct nested_state {
u64 vmcb_msrpm;
u64 vmcb_iopm;
- /* A VMEXIT is required but not yet emulated */
- bool exit_required;
-
/* A VMRUN has started but has not yet been performed, so
* we cannot inject a nested vmexit yet. */
bool nested_run_pending;
--
2.26.2
There is only one GIF flag for the whole processor, so make sure it is
not clobbered when switching to L2 (in which case we also have to include
the V_GIF_ENABLE_MASK, lest we confuse enable_gif/disable_gif/gif_set).
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index f5956ecfeeac..607531db8a94 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -293,6 +293,7 @@ static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_v
static void nested_prepare_vmcb_control(struct vcpu_svm *svm)
{
+ const u32 mask = V_INTR_MASKING_MASK | V_GIF_ENABLE_MASK | V_GIF_MASK;
if (svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE)
nested_svm_init_mmu_context(&svm->vcpu);
@@ -308,7 +309,10 @@ static void nested_prepare_vmcb_control(struct vcpu_svm *svm)
svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset =
svm->vcpu.arch.l1_tsc_offset + svm->nested.ctl.tsc_offset;
- svm->vmcb->control.int_ctl = svm->nested.ctl.int_ctl | V_INTR_MASKING_MASK;
+ svm->vmcb->control.int_ctl =
+ (svm->nested.ctl.int_ctl & ~mask) |
+ (svm->nested.hsave->control.int_ctl & mask);
+
svm->vmcb->control.virt_ext = svm->nested.ctl.virt_ext;
svm->vmcb->control.int_vector = svm->nested.ctl.int_vector;
svm->vmcb->control.int_state = svm->nested.ctl.int_state;
--
2.26.2
The authoritative state does not come from the VMCB once in guest mode,
but KVM_SET_NESTED_STATE can still perform checks on L1's provided SVM
controls because we get them from userspace.
Therefore, split out a function to do them.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index f3e70aee18cd..a504963ef532 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -203,26 +203,31 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
return true;
}
-static bool nested_vmcb_checks(struct vmcb *vmcb)
+static bool nested_vmcb_check_controls(struct vmcb_control_area *control)
{
- if ((vmcb->save.efer & EFER_SVME) == 0)
+ if ((control->intercept & (1ULL << INTERCEPT_VMRUN)) == 0)
return false;
- if (((vmcb->save.cr0 & X86_CR0_CD) == 0) &&
- (vmcb->save.cr0 & X86_CR0_NW))
+ if (control->asid == 0)
return false;
- if ((vmcb->control.intercept & (1ULL << INTERCEPT_VMRUN)) == 0)
+ if ((control->nested_ctl & SVM_NESTED_CTL_NP_ENABLE) &&
+ !npt_enabled)
return false;
- if (vmcb->control.asid == 0)
+ return true;
+}
+
+static bool nested_vmcb_checks(struct vmcb *vmcb)
+{
+ if ((vmcb->save.efer & EFER_SVME) == 0)
return false;
- if ((vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE) &&
- !npt_enabled)
+ if (((vmcb->save.cr0 & X86_CR0_CD) == 0) &&
+ (vmcb->save.cr0 & X86_CR0_NW))
return false;
- return true;
+ return nested_vmcb_check_controls(&vmcb->control);
}
static void load_nested_vmcb_control(struct vcpu_svm *svm,
--
2.26.2
The control state changes on every L2->L0 vmexit, and we will have to
serialize it in the nested state. So keep it up to date in svm->nested.ctl
and just copy them back to the nested VMCB in nested_svm_vmexit.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 57 ++++++++++++++++++++++-----------------
arch/x86/kvm/svm/svm.c | 5 +++-
arch/x86/kvm/svm/svm.h | 1 +
3 files changed, 38 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 7f739f9a797c..4355286b2726 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -234,6 +234,34 @@ static void load_nested_vmcb_control(struct vcpu_svm *svm,
svm->nested.ctl.iopm_base_pa &= ~0x0fffULL;
}
+/*
+ * Synchronize fields that are written by the processor, so that
+ * they can be copied back into the nested_vmcb.
+ */
+void sync_nested_vmcb_control(struct vcpu_svm *svm)
+{
+ u32 mask;
+ svm->nested.ctl.event_inj = svm->vmcb->control.event_inj;
+ svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err;
+
+ /* Only a few fields of int_ctl are written by the processor. */
+ mask = V_IRQ_MASK | V_TPR_MASK;
+ if (!(svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK) &&
+ is_intercept(svm, SVM_EXIT_VINTR)) {
+ /*
+ * In order to request an interrupt window, L0 is usurping
+ * svm->vmcb->control.int_ctl and possibly setting V_IRQ
+ * even if it was clear in L1's VMCB. Restoring it would be
+ * wrong. However, in this case V_IRQ will remain true until
+ * interrupt_window_interception calls svm_clear_vintr and
+ * restores int_ctl. We can just leave it aside.
+ */
+ mask &= ~V_IRQ_MASK;
+ }
+ svm->nested.ctl.int_ctl &= ~mask;
+ svm->nested.ctl.int_ctl |= svm->vmcb->control.int_ctl & mask;
+}
+
static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_vmcb)
{
/* Load the nested guest state */
@@ -471,6 +499,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
/* Exit Guest-Mode */
leave_guest_mode(&svm->vcpu);
svm->nested.vmcb = 0;
+ WARN_ON_ONCE(svm->nested.nested_run_pending);
/* in case we halted in L2 */
svm->vcpu.arch.mp_state = KVM_MP_STATE_RUNNABLE;
@@ -497,8 +526,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
nested_vmcb->save.dr6 = svm->vcpu.arch.dr6;
nested_vmcb->save.cpl = vmcb->save.cpl;
- nested_vmcb->control.int_ctl = vmcb->control.int_ctl;
- nested_vmcb->control.int_vector = vmcb->control.int_vector;
nested_vmcb->control.int_state = vmcb->control.int_state;
nested_vmcb->control.exit_code = vmcb->control.exit_code;
nested_vmcb->control.exit_code_hi = vmcb->control.exit_code_hi;
@@ -510,34 +537,16 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
if (svm->nrips_enabled)
nested_vmcb->control.next_rip = vmcb->control.next_rip;
- /*
- * If we emulate a VMRUN/#VMEXIT in the same host #vmexit cycle we have
- * to make sure that we do not lose injected events. So check event_inj
- * here and copy it to exit_int_info if it is valid.
- * Exit_int_info and event_inj can't be both valid because the case
- * below only happens on a VMRUN instruction intercept which has
- * no valid exit_int_info set.
- */
- if (vmcb->control.event_inj & SVM_EVTINJ_VALID) {
- struct vmcb_control_area *nc = &nested_vmcb->control;
-
- nc->exit_int_info = vmcb->control.event_inj;
- nc->exit_int_info_err = vmcb->control.event_inj_err;
- }
-
- nested_vmcb->control.tlb_ctl = 0;
- nested_vmcb->control.event_inj = 0;
- nested_vmcb->control.event_inj_err = 0;
+ nested_vmcb->control.int_ctl = svm->nested.ctl.int_ctl;
+ nested_vmcb->control.tlb_ctl = svm->nested.ctl.tlb_ctl;
+ nested_vmcb->control.event_inj = svm->nested.ctl.event_inj;
+ nested_vmcb->control.event_inj_err = svm->nested.ctl.event_inj_err;
nested_vmcb->control.pause_filter_count =
svm->vmcb->control.pause_filter_count;
nested_vmcb->control.pause_filter_thresh =
svm->vmcb->control.pause_filter_thresh;
- /* We always set V_INTR_MASKING and remember the old value in hflags */
- if (!(svm->vcpu.arch.hflags & HF_VINTR_MASK))
- nested_vmcb->control.int_ctl &= ~V_INTR_MASKING_MASK;
-
/* Restore the original control entries */
copy_vmcb_control_area(&vmcb->control, &hsave->control);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c213011a82e3..97dbd6f65831 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3427,7 +3427,10 @@ static fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
sync_cr8_to_lapic(vcpu);
svm->next_rip = 0;
- svm->nested.nested_run_pending = 0;
+ if (is_guest_mode(&svm->vcpu)) {
+ sync_nested_vmcb_control(svm);
+ svm->nested.nested_run_pending = 0;
+ }
svm->vmcb->control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index dd5418f20256..7e79f0af1204 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -394,6 +394,7 @@ int nested_svm_check_permissions(struct vcpu_svm *svm);
int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
bool has_error_code, u32 error_code);
int nested_svm_exit_special(struct vcpu_svm *svm);
+void sync_nested_vmcb_control(struct vcpu_svm *svm);
extern struct kvm_x86_nested_ops svm_nested_ops;
--
2.26.2
Everything that is needed during nested state restore is now part of
nested_prepare_vmcb_control.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index af7c9ce8c5ad..7fbd7aaa4ce0 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -247,9 +247,6 @@ static void load_nested_vmcb_control(struct vcpu_svm *svm,
static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_vmcb)
{
- if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE)
- nested_svm_init_mmu_context(&svm->vcpu);
-
/* Load the nested guest state */
svm->vmcb->save.es = nested_vmcb->save.es;
svm->vmcb->save.cs = nested_vmcb->save.cs;
@@ -263,9 +260,6 @@ static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_v
svm_set_cr4(&svm->vcpu, nested_vmcb->save.cr4);
(void)kvm_set_cr3(&svm->vcpu, nested_vmcb->save.cr3);
- /* Guest paging mode is active - reset mmu */
- kvm_mmu_reset_context(&svm->vcpu);
-
svm->vmcb->save.cr2 = svm->vcpu.arch.cr2 = nested_vmcb->save.cr2;
kvm_rax_write(&svm->vcpu, nested_vmcb->save.rax);
kvm_rsp_write(&svm->vcpu, nested_vmcb->save.rsp);
@@ -282,6 +276,12 @@ static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_v
static void nested_prepare_vmcb_control(struct vcpu_svm *svm, struct vmcb *nested_vmcb)
{
+ if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE)
+ nested_svm_init_mmu_context(&svm->vcpu);
+
+ /* Guest paging mode is active - reset mmu */
+ kvm_mmu_reset_context(&svm->vcpu);
+
svm_flush_tlb(&svm->vcpu);
if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK)
svm->vcpu.arch.hflags |= HF_VINTR_MASK;
--
2.26.2
kvm_vcpu_apicv_active must be false when nested virtualization is enabled,
so there is no need to check it in clgi_interception.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/svm.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 97dbd6f65831..0654c5672b1a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2015,8 +2015,7 @@ static int clgi_interception(struct vcpu_svm *svm)
disable_gif(svm);
/* After a CLGI no interrupts should come */
- if (!kvm_vcpu_apicv_active(&svm->vcpu))
- svm_clear_vintr(svm);
+ svm_clear_vintr(svm);
return ret;
}
--
2.26.2
Split out filling svm->vmcb.save and svm->vmcb.control before VMRUN.
Only the latter will be useful when restoring nested SVM state.
This patch introduces no semantic change, so the MMU setup is still
done in nested_prepare_vmcb_save. The next patch will clean up things.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 40 +++++++++++++++++++++++----------------
1 file changed, 24 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index ea00dcd8c39b..af7c9ce8c5ad 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -245,21 +245,8 @@ static void load_nested_vmcb_control(struct vcpu_svm *svm,
svm->vcpu.arch.tsc_offset += control->tsc_offset;
}
-void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
- struct vmcb *nested_vmcb)
+static void nested_prepare_vmcb_save(struct vcpu_svm *svm, struct vmcb *nested_vmcb)
{
- bool evaluate_pending_interrupts =
- is_intercept(svm, INTERCEPT_VINTR) ||
- is_intercept(svm, INTERCEPT_IRET);
-
- svm->nested.vmcb = vmcb_gpa;
- if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF)
- svm->vcpu.arch.hflags |= HF_HIF_MASK;
- else
- svm->vcpu.arch.hflags &= ~HF_HIF_MASK;
-
- load_nested_vmcb_control(svm, &nested_vmcb->control);
-
if (nested_vmcb->control.nested_ctl & SVM_NESTED_CTL_NP_ENABLE)
nested_svm_init_mmu_context(&svm->vcpu);
@@ -291,7 +278,10 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
svm->vmcb->save.dr7 = nested_vmcb->save.dr7;
svm->vcpu.arch.dr6 = nested_vmcb->save.dr6;
svm->vmcb->save.cpl = nested_vmcb->save.cpl;
+}
+static void nested_prepare_vmcb_control(struct vcpu_svm *svm, struct vmcb *nested_vmcb)
+{
svm_flush_tlb(&svm->vcpu);
if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK)
svm->vcpu.arch.hflags |= HF_VINTR_MASK;
@@ -321,6 +311,26 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
*/
recalc_intercepts(svm);
+ mark_all_dirty(svm->vmcb);
+}
+
+void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
+ struct vmcb *nested_vmcb)
+{
+ bool evaluate_pending_interrupts =
+ is_intercept(svm, INTERCEPT_VINTR) ||
+ is_intercept(svm, INTERCEPT_IRET);
+
+ svm->nested.vmcb = vmcb_gpa;
+ if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF)
+ svm->vcpu.arch.hflags |= HF_HIF_MASK;
+ else
+ svm->vcpu.arch.hflags &= ~HF_HIF_MASK;
+
+ load_nested_vmcb_control(svm, &nested_vmcb->control);
+ nested_prepare_vmcb_save(svm, nested_vmcb);
+ nested_prepare_vmcb_control(svm, nested_vmcb);
+
/*
* If L1 had a pending IRQ/NMI before executing VMRUN,
* which wasn't delivered because it was disallowed (e.g.
@@ -336,8 +346,6 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
enable_gif(svm);
if (unlikely(evaluate_pending_interrupts))
kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);
-
- mark_all_dirty(svm->vmcb);
}
int nested_svm_vmrun(struct vcpu_svm *svm)
--
2.26.2
This will come in handy when we put a struct vmcb_control_area in
svm->nested.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 5a9d131a153e..0dfd9c8c2ca7 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -141,11 +141,9 @@ void recalc_intercepts(struct vcpu_svm *svm)
c->intercept |= g->intercept;
}
-static void copy_vmcb_control_area(struct vmcb *dst_vmcb, struct vmcb *from_vmcb)
+static void copy_vmcb_control_area(struct vmcb_control_area *dst,
+ struct vmcb_control_area *from)
{
- struct vmcb_control_area *dst = &dst_vmcb->control;
- struct vmcb_control_area *from = &from_vmcb->control;
-
dst->intercept_cr = from->intercept_cr;
dst->intercept_dr = from->intercept_dr;
dst->intercept_exceptions = from->intercept_exceptions;
@@ -419,7 +417,7 @@ int nested_svm_vmrun(struct vcpu_svm *svm)
else
hsave->save.cr3 = kvm_read_cr3(&svm->vcpu);
- copy_vmcb_control_area(hsave, vmcb);
+ copy_vmcb_control_area(&hsave->control, &vmcb->control);
svm->nested.nested_run_pending = 1;
enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb);
@@ -550,7 +548,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
nested_vmcb->control.int_ctl &= ~V_INTR_MASKING_MASK;
/* Restore the original control entries */
- copy_vmcb_control_area(vmcb, hsave);
+ copy_vmcb_control_area(&vmcb->control, &hsave->control);
svm->vmcb->control.tsc_offset = svm->vcpu.arch.tsc_offset =
svm->vcpu.arch.l1_tsc_offset;
--
2.26.2
This allows exceptions injected by the emulator to be properly delivered
as vmexits. The code also becomes simpler, because we can just let all
L0-intercepted exceptions go through the usual path. In particular, our
emulation of the VMX #DB exit qualification is very much simplified,
because the vmexit injection path can use kvm_deliver_exception_payload
to update DR6.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/svm/nested.c | 136 +++++++++++++-------------------
arch/x86/kvm/svm/svm.c | 9 ---
arch/x86/kvm/svm/svm.h | 1 +
arch/x86/kvm/x86.c | 13 +--
5 files changed, 58 insertions(+), 103 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7707bd4b0593..e6f2e1a2dab6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1495,6 +1495,8 @@ void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
void kvm_inject_nmi(struct kvm_vcpu *vcpu);
+void kvm_update_dr7(struct kvm_vcpu *vcpu);
+
int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index d544cce4f964..87fc5879dfaa 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -111,6 +111,8 @@ void recalc_intercepts(struct vcpu_svm *svm)
h = &svm->nested.hsave->control;
g = &svm->nested;
+ svm->nested.host_intercept_exceptions = h->intercept_exceptions;
+
c->intercept_cr = h->intercept_cr;
c->intercept_dr = h->intercept_dr;
c->intercept_exceptions = h->intercept_exceptions;
@@ -616,50 +618,6 @@ static int nested_svm_exit_handled_msr(struct vcpu_svm *svm)
return (value & mask) ? NESTED_EXIT_DONE : NESTED_EXIT_HOST;
}
-/* DB exceptions for our internal use must not cause vmexit */
-static int nested_svm_intercept_db(struct vcpu_svm *svm)
-{
- unsigned long dr6 = svm->vmcb->save.dr6;
-
- /* Always catch it and pass it to userspace if debugging. */
- if (svm->vcpu.guest_debug &
- (KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP))
- return NESTED_EXIT_HOST;
-
- /* if we're not singlestepping, it's not ours */
- if (!svm->nmi_singlestep)
- goto reflected_db;
-
- /* if it's not a singlestep exception, it's not ours */
- if (!(dr6 & DR6_BS))
- goto reflected_db;
-
- /* if the guest is singlestepping, it should get the vmexit */
- if (svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF) {
- disable_nmi_singlestep(svm);
- goto reflected_db;
- }
-
- /* it's ours, the nested hypervisor must not see this one */
- return NESTED_EXIT_HOST;
-
-reflected_db:
- /*
- * Synchronize guest DR6 here just like in kvm_deliver_exception_payload;
- * it will be moved into the nested VMCB by nested_svm_vmexit. Once
- * exceptions will be moved to svm_check_nested_events, all this stuff
- * will just go away and we could just return NESTED_EXIT_HOST
- * unconditionally. db_interception will queue the exception, which
- * will be processed by svm_check_nested_events if a nested vmexit is
- * required, and we will just use kvm_deliver_exception_payload to copy
- * the payload to DR6 before vmexit.
- */
- WARN_ON(svm->vcpu.arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT);
- svm->vcpu.arch.dr6 &= ~(DR_TRAP_BITS | DR6_RTM);
- svm->vcpu.arch.dr6 |= dr6 & ~DR6_FIXED_1;
- return NESTED_EXIT_DONE;
-}
-
static int nested_svm_intercept_ioio(struct vcpu_svm *svm)
{
unsigned port, size, iopm_len;
@@ -710,20 +668,12 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
break;
}
case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: {
- u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE);
- if (svm->nested.intercept_exceptions & excp_bits) {
- if (exit_code == SVM_EXIT_EXCP_BASE + DB_VECTOR)
- vmexit = nested_svm_intercept_db(svm);
- else if (exit_code == SVM_EXIT_EXCP_BASE + BP_VECTOR &&
- svm->vcpu.guest_debug & KVM_GUESTDBG_USE_SW_BP)
- vmexit = NESTED_EXIT_HOST;
- else
- vmexit = NESTED_EXIT_DONE;
- }
- /* async page fault always cause vmexit */
- else if ((exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) &&
- svm->vcpu.arch.exception.nested_apf != 0)
- vmexit = NESTED_EXIT_DONE;
+ /*
+ * Host-intercepted exceptions have been checked already in
+ * nested_svm_exit_special. There is nothing to do here,
+ * the vmexit is injected by svm_check_nested_events.
+ */
+ vmexit = NESTED_EXIT_DONE;
break;
}
case SVM_EXIT_ERR: {
@@ -768,35 +718,45 @@ int nested_svm_check_permissions(struct vcpu_svm *svm)
return 0;
}
-int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
- bool has_error_code, u32 error_code)
+static bool nested_exit_on_exception(struct vcpu_svm *svm)
{
- int vmexit;
+ unsigned int nr = svm->vcpu.arch.exception.nr;
- if (!is_guest_mode(&svm->vcpu))
- return 0;
+ return (svm->nested.intercept_exceptions & (1 << nr));
+}
- vmexit = nested_svm_intercept(svm);
- if (vmexit != NESTED_EXIT_DONE)
- return 0;
+static void nested_svm_inject_exception_vmexit(struct vcpu_svm *svm)
+{
+ unsigned int nr = svm->vcpu.arch.exception.nr;
svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr;
svm->vmcb->control.exit_code_hi = 0;
- svm->vmcb->control.exit_info_1 = error_code;
+
+ if (svm->vcpu.arch.exception.has_error_code)
+ svm->vmcb->control.exit_info_1 = svm->vcpu.arch.exception.error_code;
/*
* EXITINFO2 is undefined for all exception intercepts other
* than #PF.
*/
- if (svm->vcpu.arch.exception.nested_apf)
- svm->vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token;
- else if (svm->vcpu.arch.exception.has_payload)
- svm->vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload;
- else
- svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
+ if (nr == PF_VECTOR) {
+ if (svm->vcpu.arch.exception.nested_apf)
+ svm->vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token;
+ else if (svm->vcpu.arch.exception.has_payload)
+ svm->vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload;
+ else
+ svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
+ } else if (nr == DB_VECTOR) {
+ /* See inject_pending_event. */
+ kvm_deliver_exception_payload(&svm->vcpu);
+ if (svm->vcpu.arch.dr7 & DR7_GD) {
+ svm->vcpu.arch.dr7 &= ~DR7_GD;
+ kvm_update_dr7(&svm->vcpu);
+ }
+ } else
+ WARN_ON(svm->vcpu.arch.exception.has_payload);
- svm->nested.exit_required = true;
- return vmexit;
+ nested_svm_vmexit(svm);
}
static void nested_svm_smi(struct vcpu_svm *svm)
@@ -835,6 +795,15 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
kvm_event_needs_reinjection(vcpu) || svm->nested.exit_required ||
svm->nested.nested_run_pending;
+ if (vcpu->arch.exception.pending) {
+ if (block_nested_events)
+ return -EBUSY;
+ if (!nested_exit_on_exception(svm))
+ return 0;
+ nested_svm_inject_exception_vmexit(svm);
+ return 0;
+ }
+
if (vcpu->arch.smi_pending && !svm_smi_blocked(vcpu)) {
if (block_nested_events)
return -EBUSY;
@@ -872,18 +841,19 @@ int nested_svm_exit_special(struct vcpu_svm *svm)
switch (exit_code) {
case SVM_EXIT_INTR:
case SVM_EXIT_NMI:
- case SVM_EXIT_EXCP_BASE + MC_VECTOR:
- return NESTED_EXIT_HOST;
case SVM_EXIT_NPF:
- /* For now we are always handling NPFs when using them */
- if (npt_enabled)
+ return NESTED_EXIT_HOST;
+ case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: {
+ u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE);
+
+ if (get_host_vmcb(svm)->control.intercept_exceptions & excp_bits)
return NESTED_EXIT_HOST;
- break;
- case SVM_EXIT_EXCP_BASE + PF_VECTOR:
- /* Trap async PF even if not shadowing */
- if (!npt_enabled || svm->vcpu.arch.apf.host_apf_reason)
+ else if (exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR &&
+ svm->vcpu.arch.apf.host_apf_reason)
+ /* Trap async PF even if not shadowing */
return NESTED_EXIT_HOST;
break;
+ }
default:
break;
}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9ac9963405b5..251c457820a1 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -331,17 +331,8 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
struct vcpu_svm *svm = to_svm(vcpu);
unsigned nr = vcpu->arch.exception.nr;
bool has_error_code = vcpu->arch.exception.has_error_code;
- bool reinject = vcpu->arch.exception.injected;
u32 error_code = vcpu->arch.exception.error_code;
- /*
- * If we are within a nested VM we'd better #VMEXIT and let the guest
- * handle the exception
- */
- if (!reinject &&
- nested_svm_check_exception(svm, nr, has_error_code, error_code))
- return;
-
kvm_deliver_exception_payload(&svm->vcpu);
if (nr == BP_VECTOR && !nrips) {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 5cc559ab862d..8342032291fc 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -86,6 +86,7 @@ struct nested_state {
u64 hsave_msr;
u64 vm_cr_msr;
u64 vmcb;
+ u32 host_intercept_exceptions;
/* These are the merged vectors */
u32 *msrpm;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 192238841cac..e03543488d37 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1072,7 +1072,7 @@ static void kvm_update_dr0123(struct kvm_vcpu *vcpu)
}
}
-static void kvm_update_dr7(struct kvm_vcpu *vcpu)
+void kvm_update_dr7(struct kvm_vcpu *vcpu)
{
unsigned long dr7;
@@ -1085,6 +1085,7 @@ static void kvm_update_dr7(struct kvm_vcpu *vcpu)
if (dr7 & DR7_BP_EN_MASK)
vcpu->arch.switch_db_regs |= KVM_DEBUGREG_BP_ENABLED;
}
+EXPORT_SYMBOL_GPL(kvm_update_dr7);
static u64 kvm_dr6_fixed(struct kvm_vcpu *vcpu)
{
@@ -7774,16 +7775,6 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit
X86_EFLAGS_RF);
if (vcpu->arch.exception.nr == DB_VECTOR) {
- /*
- * This code assumes that nSVM doesn't use
- * check_nested_events(). If it does, the
- * DR6/DR7 changes should happen before L1
- * gets a #VMEXIT for an intercepted #DB in
- * L2. (Under VMX, on the other hand, the
- * DR6/DR7 changes should not happen in the
- * event of a VM-exit to L1 for an intercepted
- * #DB in L2.)
- */
kvm_deliver_exception_payload(vcpu);
if (vcpu->arch.dr7 & DR7_GD) {
vcpu->arch.dr7 &= ~DR7_GD;
--
2.26.2
Unmapping the nested VMCB in enter_svm_guest_mode is a bit of a wart,
since the map is not used elsewhere in the function. There are
just two calls, so move it there.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 14 ++++++--------
arch/x86/kvm/svm/svm.c | 3 ++-
arch/x86/kvm/svm/svm.h | 2 +-
3 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 81e0fbd5e267..d2fc3c4aa00b 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -229,7 +229,7 @@ static bool nested_vmcb_checks(struct vmcb *vmcb)
}
void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
- struct vmcb *nested_vmcb, struct kvm_host_map *map)
+ struct vmcb *nested_vmcb)
{
bool evaluate_pending_interrupts =
is_intercept(svm, INTERCEPT_VINTR) ||
@@ -304,8 +304,6 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
svm->vmcb->control.pause_filter_thresh =
nested_vmcb->control.pause_filter_thresh;
- kvm_vcpu_unmap(&svm->vcpu, map, true);
-
/* Enter Guest-Mode */
enter_guest_mode(&svm->vcpu);
@@ -368,10 +366,7 @@ int nested_svm_vmrun(struct vcpu_svm *svm)
nested_vmcb->control.exit_code_hi = 0;
nested_vmcb->control.exit_info_1 = 0;
nested_vmcb->control.exit_info_2 = 0;
-
- kvm_vcpu_unmap(&svm->vcpu, &map, true);
-
- return ret;
+ goto out;
}
trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb_gpa,
@@ -414,7 +409,7 @@ int nested_svm_vmrun(struct vcpu_svm *svm)
copy_vmcb_control_area(hsave, vmcb);
svm->nested.nested_run_pending = 1;
- enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, &map);
+ enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb);
if (!nested_svm_vmrun_msrpm(svm)) {
svm->vmcb->control.exit_code = SVM_EXIT_ERR;
@@ -425,6 +420,9 @@ int nested_svm_vmrun(struct vcpu_svm *svm)
nested_svm_vmexit(svm);
}
+out:
+ kvm_vcpu_unmap(&svm->vcpu, &map, true);
+
return ret;
}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index abe277a3216b..3321c402a753 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3814,7 +3814,8 @@ static int svm_pre_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
if (kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(vmcb), &map) == -EINVAL)
return 1;
nested_vmcb = map.hva;
- enter_svm_guest_mode(svm, vmcb, nested_vmcb, &map);
+ enter_svm_guest_mode(svm, vmcb, nested_vmcb);
+ kvm_vcpu_unmap(&svm->vcpu, &map, true);
}
return 0;
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 89fab75dd4f5..33e3f09d7a8e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -395,7 +395,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
}
void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
- struct vmcb *nested_vmcb, struct kvm_host_map *map);
+ struct vmcb *nested_vmcb);
int nested_svm_vmrun(struct vcpu_svm *svm);
void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb);
int nested_svm_vmexit(struct vcpu_svm *svm);
--
2.26.2
The test is similar to the existing one for VMX, but simpler because we
don't have to test shadow VMCS or vmptrld/vmptrst/vmclear.
Signed-off-by: Paolo Bonzini <[email protected]>
---
.../testing/selftests/kvm/x86_64/state_test.c | 69 +++++++++++++++----
1 file changed, 57 insertions(+), 12 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/state_test.c b/tools/testing/selftests/kvm/x86_64/state_test.c
index 5b1a016edf55..af8b6df6a13e 100644
--- a/tools/testing/selftests/kvm/x86_64/state_test.c
+++ b/tools/testing/selftests/kvm/x86_64/state_test.c
@@ -18,14 +18,46 @@
#include "kvm_util.h"
#include "processor.h"
#include "vmx.h"
+#include "svm_util.h"
#define VCPU_ID 5
+#define L2_GUEST_STACK_SIZE 256
-void l2_guest_code(void)
+void svm_l2_guest_code(void)
+{
+ GUEST_SYNC(4);
+ /* Exit to L1 */
+ vmcall();
+ GUEST_SYNC(6);
+ /* Done, exit to L1 and never come back. */
+ vmcall();
+}
+
+static void svm_l1_guest_code(struct svm_test_data *svm)
+{
+ unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+ struct vmcb *vmcb = svm->vmcb;
+
+ GUEST_ASSERT(svm->vmcb_gpa);
+ /* Prepare for L2 execution. */
+ generic_svm_setup(svm, svm_l2_guest_code,
+ &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+
+ GUEST_SYNC(3);
+ run_guest(vmcb, svm->vmcb_gpa);
+ GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL);
+ GUEST_SYNC(5);
+ vmcb->save.rip += 3;
+ run_guest(vmcb, svm->vmcb_gpa);
+ GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL);
+ GUEST_SYNC(7);
+}
+
+void vmx_l2_guest_code(void)
{
GUEST_SYNC(6);
- /* Exit to L1 */
+ /* Exit to L1 */
vmcall();
/* L1 has now set up a shadow VMCS for us. */
@@ -42,10 +74,9 @@ void l2_guest_code(void)
vmcall();
}
-void l1_guest_code(struct vmx_pages *vmx_pages)
+static void vmx_l1_guest_code(struct vmx_pages *vmx_pages)
{
-#define L2_GUEST_STACK_SIZE 64
- unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+ unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
GUEST_ASSERT(vmx_pages->vmcs_gpa);
GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages));
@@ -56,7 +87,7 @@ void l1_guest_code(struct vmx_pages *vmx_pages)
GUEST_SYNC(4);
GUEST_ASSERT(vmptrstz() == vmx_pages->vmcs_gpa);
- prepare_vmcs(vmx_pages, l2_guest_code,
+ prepare_vmcs(vmx_pages, vmx_l2_guest_code,
&l2_guest_stack[L2_GUEST_STACK_SIZE]);
GUEST_SYNC(5);
@@ -106,20 +137,31 @@ void l1_guest_code(struct vmx_pages *vmx_pages)
GUEST_ASSERT(vmresume());
}
-void guest_code(struct vmx_pages *vmx_pages)
+static u32 cpuid_ecx(u32 eax)
+{
+ u32 ecx;
+ asm volatile("cpuid" : "=a" (eax), "=c" (ecx) : "0" (eax) : "ebx", "edx");
+ return ecx;
+}
+
+static void __attribute__((__flatten__)) guest_code(void *arg)
{
GUEST_SYNC(1);
GUEST_SYNC(2);
- if (vmx_pages)
- l1_guest_code(vmx_pages);
+ if (arg) {
+ if (cpuid_ecx(0x80000001) & CPUID_SVM)
+ svm_l1_guest_code(arg);
+ else
+ vmx_l1_guest_code(arg);
+ }
GUEST_DONE();
}
int main(int argc, char *argv[])
{
- vm_vaddr_t vmx_pages_gva = 0;
+ vm_vaddr_t nested_gva = 0;
struct kvm_regs regs1, regs2;
struct kvm_vm *vm;
@@ -136,8 +178,11 @@ int main(int argc, char *argv[])
vcpu_regs_get(vm, VCPU_ID, ®s1);
if (kvm_check_cap(KVM_CAP_NESTED_STATE)) {
- vcpu_alloc_vmx(vm, &vmx_pages_gva);
- vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);
+ if (kvm_get_supported_cpuid_entry(0x80000001)->ecx & CPUID_SVM)
+ vcpu_alloc_svm(vm, &nested_gva);
+ else
+ vcpu_alloc_vmx(vm, &nested_gva);
+ vcpu_args_set(vm, VCPU_ID, 1, nested_gva);
} else {
pr_info("will skip nested state checks\n");
vcpu_args_set(vm, VCPU_ID, 1, 0);
--
2.26.2
svm_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as
an optimization, but this is only correct before the nested vmentry.
If userspace is modifying CR3 with KVM_SET_SREGS after the VM has
already been put in guest mode, the value of CR3 will not be updated.
Remove the optimization, which almost never triggers anyway.
This was was added in commit 689f3bf21628 ("KVM: x86: unify callbacks
to load paging root", 2020-03-16) just to keep the two vendor-specific
modules closer, but we'll fix VMX too.
Fixes: 689f3bf21628 ("KVM: x86: unify callbacks to load paging root")
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 6 +-----
arch/x86/kvm/svm/svm.c | 16 +++++-----------
2 files changed, 6 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 166b88fc9509..81e0fbd5e267 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -256,11 +256,7 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
svm_set_efer(&svm->vcpu, nested_vmcb->save.efer);
svm_set_cr0(&svm->vcpu, nested_vmcb->save.cr0);
svm_set_cr4(&svm->vcpu, nested_vmcb->save.cr4);
- if (npt_enabled) {
- svm->vmcb->save.cr3 = nested_vmcb->save.cr3;
- svm->vcpu.arch.cr3 = nested_vmcb->save.cr3;
- } else
- (void)kvm_set_cr3(&svm->vcpu, nested_vmcb->save.cr3);
+ (void)kvm_set_cr3(&svm->vcpu, nested_vmcb->save.cr3);
/* Guest paging mode is active - reset mmu */
kvm_mmu_reset_context(&svm->vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 270061fa6cfa..abe277a3216b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3447,7 +3447,6 @@ static fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long root)
{
struct vcpu_svm *svm = to_svm(vcpu);
- bool update_guest_cr3 = true;
unsigned long cr3;
cr3 = __sme_set(root);
@@ -3456,18 +3455,13 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long root)
mark_dirty(svm->vmcb, VMCB_NPT);
/* Loading L2's CR3 is handled by enter_svm_guest_mode. */
- if (is_guest_mode(vcpu))
- update_guest_cr3 = false;
- else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
- cr3 = vcpu->arch.cr3;
- else /* CR3 is already up-to-date. */
- update_guest_cr3 = false;
+ if (!test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
+ return;
+ cr3 = vcpu->arch.cr3;
}
- if (update_guest_cr3) {
- svm->vmcb->save.cr3 = cr3;
- mark_dirty(svm->vmcb, VMCB_CR);
- }
+ svm->vmcb->save.cr3 = cr3;
+ mark_dirty(svm->vmcb, VMCB_CR);
}
static int is_disabled(void)
--
2.26.2
In case an interrupt arrives after nested.check_events but before the
call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt
window even if the interrupt is actually going to be a vmexit. This is
useless rather than harmful, but it really complicates reasoning about
SVM's handling of the VINTR intercept. We'd like to never bother with
the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in
that case there is no interrupt window and we can just exit the nested
guest whenever we want.
As a first step, this patch moves the opening of the interrupt
window inside inject_pending_event. This consolidates the check for
pending interrupt/NMI/SMI in one place, removing the repeated call to
kvm_cpu_has_injectable_intr.
The main functional change here is that re-injection of still-pending
events will also use req_immediate_exit instead of using interrupt-window
intercepts.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 +--
arch/x86/kvm/svm/svm.c | 24 +++----
arch/x86/kvm/vmx/vmx.c | 20 +++---
arch/x86/kvm/x86.c | 112 +++++++++++++++++---------------
4 files changed, 87 insertions(+), 77 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index db261da578f3..7707bd4b0593 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1136,8 +1136,8 @@ struct kvm_x86_ops {
void (*set_nmi)(struct kvm_vcpu *vcpu);
void (*queue_exception)(struct kvm_vcpu *vcpu);
void (*cancel_injection)(struct kvm_vcpu *vcpu);
- bool (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
- bool (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
+ int (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
+ int (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
bool (*get_nmi_mask)(struct kvm_vcpu *vcpu);
void (*set_nmi_mask)(struct kvm_vcpu *vcpu, bool masked);
void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
@@ -1234,10 +1234,10 @@ struct kvm_x86_ops {
void (*setup_mce)(struct kvm_vcpu *vcpu);
- bool (*smi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
+ int (*smi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
int (*pre_enter_smm)(struct kvm_vcpu *vcpu, char *smstate);
int (*pre_leave_smm)(struct kvm_vcpu *vcpu, const char *smstate);
- int (*enable_smi_window)(struct kvm_vcpu *vcpu);
+ void (*enable_smi_window)(struct kvm_vcpu *vcpu);
int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9987f6fe9d88..9ac9963405b5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3053,15 +3053,15 @@ bool svm_nmi_blocked(struct kvm_vcpu *vcpu)
return ret;
}
-static bool svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
struct vcpu_svm *svm = to_svm(vcpu);
if (svm->nested.nested_run_pending)
- return false;
+ return -EBUSY;
/* An NMI must not be injected into L2 if it's supposed to VM-Exit. */
if (for_injection && is_guest_mode(vcpu) && nested_exit_on_nmi(svm))
- return false;
+ return -EBUSY;
return !svm_nmi_blocked(vcpu);
}
@@ -3112,18 +3112,18 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
return (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK);
}
-static bool svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+static int svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
struct vcpu_svm *svm = to_svm(vcpu);
if (svm->nested.nested_run_pending)
- return false;
+ return -EBUSY;
/*
* An IRQ must not be injected into L2 if it's supposed to VM-Exit,
* e.g. if the IRQ arrived asynchronously after checking nested events.
*/
if (for_injection && is_guest_mode(vcpu) && nested_exit_on_intr(svm))
- return false;
+ return -EBUSY;
return !svm_interrupt_blocked(vcpu);
}
@@ -3793,15 +3793,15 @@ bool svm_smi_blocked(struct kvm_vcpu *vcpu)
return is_smm(vcpu);
}
-static bool svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+static int svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
struct vcpu_svm *svm = to_svm(vcpu);
if (svm->nested.nested_run_pending)
- return false;
+ return -EBUSY;
/* An SMI must not be injected into L2 if it's supposed to VM-Exit. */
if (for_injection && is_guest_mode(vcpu) && nested_exit_on_smi(svm))
- return false;
+ return -EBUSY;
return !svm_smi_blocked(vcpu);
}
@@ -3848,7 +3848,7 @@ static int svm_pre_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
return 0;
}
-static int enable_smi_window(struct kvm_vcpu *vcpu)
+static void enable_smi_window(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
@@ -3856,9 +3856,9 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
if (vgif_enabled(svm))
set_intercept(svm, INTERCEPT_STGI);
/* STGI will cause a vm exit */
- return 1;
+ } else {
+ /* We must be in SMM; RSM will cause a vmexit anyway. */
}
- return 0;
}
static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 55712dd86baf..aedc46407b1f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4552,14 +4552,14 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
GUEST_INTR_STATE_NMI));
}
-static bool vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
- return false;
+ return -EBUSY;
/* An NMI must not be injected into L2 if it's supposed to VM-Exit. */
if (for_injection && is_guest_mode(vcpu) && nested_exit_on_nmi(vcpu))
- return false;
+ return -EBUSY;
return !vmx_nmi_blocked(vcpu);
}
@@ -4574,17 +4574,17 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
}
-static bool vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
if (to_vmx(vcpu)->nested.nested_run_pending)
- return false;
+ return -EBUSY;
/*
* An IRQ must not be injected into L2 if it's supposed to VM-Exit,
* e.g. if the IRQ arrived asynchronously after checking nested events.
*/
if (for_injection && is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
- return false;
+ return -EBUSY;
return !vmx_interrupt_blocked(vcpu);
}
@@ -7755,11 +7755,11 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
~FEAT_CTL_LMCE_ENABLED;
}
-static bool vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
/* we need a nested vmexit to enter SMM, postpone if run is pending */
if (to_vmx(vcpu)->nested.nested_run_pending)
- return false;
+ return -EBUSY;
return !is_smm(vcpu);
}
@@ -7797,9 +7797,9 @@ static int vmx_pre_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
return 0;
}
-static int enable_smi_window(struct kvm_vcpu *vcpu)
+static void enable_smi_window(struct kvm_vcpu *vcpu)
{
- return 0;
+ /* RSM will cause a vmexit anyway. */
}
static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 064a7ea0e671..192238841cac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7710,7 +7710,7 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu)
kvm_x86_ops.update_cr8_intercept(vcpu, tpr, max_irr);
}
-static int inject_pending_event(struct kvm_vcpu *vcpu)
+static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
{
int r;
bool can_inject = true;
@@ -7756,8 +7756,8 @@ static int inject_pending_event(struct kvm_vcpu *vcpu)
*/
if (is_guest_mode(vcpu)) {
r = kvm_x86_ops.nested_ops->check_events(vcpu);
- if (r != 0)
- return r;
+ if (r < 0)
+ goto busy;
}
/* try to inject new event if pending */
@@ -7795,27 +7795,64 @@ static int inject_pending_event(struct kvm_vcpu *vcpu)
can_inject = false;
}
- /* Finish re-injection before considering new events */
- if (!can_inject)
- return 0;
+ /*
+ * Finally, either inject the event or enable window-open exits.
+ * If an event is pending but cannot be injected right now (for
+ * example if it just arrived and we have to inject it as a
+ * vmexit), then we request an immediate exit. This is indicated
+ * by a -EBUSY return value from kvm_x86_ops.*_allowed.
+ */
+ if (vcpu->arch.smi_pending) {
+ r = can_inject ? kvm_x86_ops.smi_allowed(vcpu, true) : -EBUSY;
+ if (r < 0)
+ goto busy;
+ if (r) {
+ vcpu->arch.smi_pending = false;
+ ++vcpu->arch.smi_count;
+ enter_smm(vcpu);
+ can_inject = false;
+ } else {
+ kvm_x86_ops.enable_smi_window(vcpu);
+ }
+ }
- if (vcpu->arch.smi_pending &&
- kvm_x86_ops.smi_allowed(vcpu, true)) {
- vcpu->arch.smi_pending = false;
- ++vcpu->arch.smi_count;
- enter_smm(vcpu);
- } else if (vcpu->arch.nmi_pending &&
- kvm_x86_ops.nmi_allowed(vcpu, true)) {
- --vcpu->arch.nmi_pending;
- vcpu->arch.nmi_injected = true;
- kvm_x86_ops.set_nmi(vcpu);
- } else if (kvm_cpu_has_injectable_intr(vcpu) &&
- kvm_x86_ops.interrupt_allowed(vcpu, true)) {
- kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu), false);
- kvm_x86_ops.set_irq(vcpu);
+ if (vcpu->arch.nmi_pending) {
+ r = can_inject ? kvm_x86_ops.nmi_allowed(vcpu, true) : -EBUSY;
+ if (r < 0)
+ goto busy;
+ if (r) {
+ --vcpu->arch.nmi_pending;
+ vcpu->arch.nmi_injected = true;
+ kvm_x86_ops.set_nmi(vcpu);
+ can_inject = false;
+ } else {
+ kvm_x86_ops.enable_nmi_window(vcpu);
+ }
}
- return 0;
+ if (kvm_cpu_has_injectable_intr(vcpu)) {
+ r = can_inject ? kvm_x86_ops.interrupt_allowed(vcpu, true) : -EBUSY;
+ if (r < 0)
+ goto busy;
+ if (r) {
+ kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu), false);
+ kvm_x86_ops.set_irq(vcpu);
+ } else {
+ kvm_x86_ops.enable_irq_window(vcpu);
+ }
+ }
+
+ if (is_guest_mode(vcpu) &&
+ kvm_x86_ops.nested_ops->hv_timer_pending &&
+ kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
+ *req_immediate_exit = true;
+
+ WARN_ON(vcpu->arch.exception.pending);
+ return;
+
+busy:
+ *req_immediate_exit = true;
+ return;
}
static void process_nmi(struct kvm_vcpu *vcpu)
@@ -8353,36 +8390,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
goto out;
}
- if (inject_pending_event(vcpu) != 0)
- req_immediate_exit = true;
- else {
- /* Enable SMI/NMI/IRQ window open exits if needed.
- *
- * SMIs have three cases:
- * 1) They can be nested, and then there is nothing to
- * do here because RSM will cause a vmexit anyway.
- * 2) There is an ISA-specific reason why SMI cannot be
- * injected, and the moment when this changes can be
- * intercepted.
- * 3) Or the SMI can be pending because
- * inject_pending_event has completed the injection
- * of an IRQ or NMI from the previous vmexit, and
- * then we request an immediate exit to inject the
- * SMI.
- */
- if (vcpu->arch.smi_pending && !is_smm(vcpu))
- if (!kvm_x86_ops.enable_smi_window(vcpu))
- req_immediate_exit = true;
- if (vcpu->arch.nmi_pending)
- kvm_x86_ops.enable_nmi_window(vcpu);
- if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)
- kvm_x86_ops.enable_irq_window(vcpu);
- if (is_guest_mode(vcpu) &&
- kvm_x86_ops.nested_ops->hv_timer_pending &&
- kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
- req_immediate_exit = true;
- WARN_ON(vcpu->arch.exception.pending);
- }
+ inject_pending_event(vcpu, &req_immediate_exit);
+ if (req_int_win)
+ kvm_x86_ops.enable_irq_window(vcpu);
if (kvm_lapic_enabled(vcpu)) {
update_cr8_intercept(vcpu);
--
2.26.2
This allows fetching the registers from the hsave area when setting
up the NPT shadow MMU, and is needed for KVM_SET_NESTED_STATE (which
runs long after the CR0, CR4 and EFER values in vcpu have been switched
to hold L2 guest state).
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 14 +++++++++-----
arch/x86/kvm/svm/nested.c | 5 ++++-
3 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 8a3b1bce722a..45c1ae872a34 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -57,7 +57,7 @@ void
reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
-void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu);
+void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
bool accessed_dirty, gpa_t new_eptp);
bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2df0f347655a..fbc061df57ab 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4956,7 +4956,7 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only)
return role;
}
-void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu)
+void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer)
{
struct kvm_mmu *context = vcpu->arch.mmu;
union kvm_mmu_role new_role =
@@ -4965,11 +4965,11 @@ void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu)
if (new_role.as_u64 == context->mmu_role.as_u64)
return;
- if (!is_paging(vcpu))
+ if (!(cr0 & X86_CR0_PG))
nonpaging_init_context(vcpu, context);
- else if (is_long_mode(vcpu))
+ else if (efer & EFER_LMA)
paging64_init_context(vcpu, context);
- else if (is_pae(vcpu))
+ else if (cr4 & X86_CR4_PAE)
paging32E_init_context(vcpu, context);
else
paging32_init_context(vcpu, context);
@@ -5047,7 +5047,11 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *context = vcpu->arch.mmu;
- kvm_init_shadow_mmu(vcpu);
+ kvm_init_shadow_mmu(vcpu,
+ kvm_read_cr0_bits(vcpu, X86_CR0_PG),
+ kvm_read_cr4_bits(vcpu, X86_CR4_PAE),
+ vcpu->arch.efer);
+
context->get_guest_pgd = get_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index e63e62d12acd..840662e66976 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -80,10 +80,13 @@ static unsigned long nested_svm_get_tdp_cr3(struct kvm_vcpu *vcpu)
static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu)
{
+ struct vcpu_svm *svm = to_svm(vcpu);
+ struct vmcb *hsave = svm->nested.hsave;
+
WARN_ON(mmu_is_nested(vcpu));
vcpu->arch.mmu = &vcpu->arch.guest_mmu;
- kvm_init_shadow_mmu(vcpu);
+ kvm_init_shadow_mmu(vcpu, X86_CR0_PG, hsave->save.cr4, hsave->save.efer);
vcpu->arch.mmu->get_guest_pgd = nested_svm_get_tdp_cr3;
vcpu->arch.mmu->get_pdptr = nested_svm_get_tdp_pdptr;
vcpu->arch.mmu->inject_page_fault = nested_svm_inject_npf_exit;
--
2.26.2
Allow placing the VMCB structs on the stack or in other structs without
wasting too much space. Add BUILD_BUG_ON as a quick safeguard against typos.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/svm.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 6ece8561ba66..8a1f5382a4ea 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -96,7 +96,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u8 reserved_6[8]; /* Offset 0xe8 */
u64 avic_logical_id; /* Offset 0xf0 */
u64 avic_physical_id; /* Offset 0xf8 */
- u8 reserved_7[768];
};
@@ -203,8 +202,16 @@ struct __attribute__ ((__packed__)) vmcb_save_area {
u64 last_excp_to;
};
+
+static inline void __unused_size_checks(void)
+{
+ BUILD_BUG_ON(sizeof(struct vmcb_save_area) != 0x298);
+ BUILD_BUG_ON(sizeof(struct vmcb_control_area) != 256);
+}
+
struct __attribute__ ((__packed__)) vmcb {
struct vmcb_control_area control;
+ u8 reserved_control[1024 - sizeof(struct vmcb_control_area)];
struct vmcb_save_area save;
};
--
2.26.2
Extract the code that is needed to implement CLGI and STGI,
so that we can run it from VMRUN and vmexit (and in the future,
KVM_SET_NESTED_STATE). Skip the request for KVM_REQ_EVENT unless needed,
subsuming the evaluate_pending_interrupts optimization that is found
in enter_svm_guest_mode.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/irq.c | 1 +
arch/x86/kvm/svm/nested.c | 22 ++---------------
arch/x86/kvm/svm/svm.c | 51 ++++++++++++++++++++++++++-------------
arch/x86/kvm/svm/svm.h | 1 +
4 files changed, 38 insertions(+), 37 deletions(-)
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 54f7ea68083b..99d118ffc67d 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -83,6 +83,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
}
+EXPORT_SYMBOL_GPL(kvm_cpu_has_injectable_intr);
/*
* check if there is pending interrupt without
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 4355286b2726..f5956ecfeeac 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -333,10 +333,6 @@ static void nested_prepare_vmcb_control(struct vcpu_svm *svm)
void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
struct vmcb *nested_vmcb)
{
- bool evaluate_pending_interrupts =
- is_intercept(svm, INTERCEPT_VINTR) ||
- is_intercept(svm, INTERCEPT_IRET);
-
svm->nested.vmcb = vmcb_gpa;
if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF)
svm->vcpu.arch.hflags |= HF_HIF_MASK;
@@ -347,21 +343,7 @@ void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
nested_prepare_vmcb_save(svm, nested_vmcb);
nested_prepare_vmcb_control(svm);
- /*
- * If L1 had a pending IRQ/NMI before executing VMRUN,
- * which wasn't delivered because it was disallowed (e.g.
- * interrupts disabled), L0 needs to evaluate if this pending
- * event should cause an exit from L2 to L1 or be delivered
- * directly to L2.
- *
- * Usually this would be handled by the processor noticing an
- * IRQ/NMI window request. However, VMRUN can unblock interrupts
- * by implicitly setting GIF, so force L0 to perform pending event
- * evaluation by requesting a KVM_REQ_EVENT.
- */
- enable_gif(svm);
- if (unlikely(evaluate_pending_interrupts))
- kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);
+ svm_set_gif(svm, true);
}
int nested_svm_vmrun(struct vcpu_svm *svm)
@@ -505,7 +487,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
svm->vcpu.arch.mp_state = KVM_MP_STATE_RUNNABLE;
/* Give the current vmcb to the guest */
- disable_gif(svm);
+ svm_set_gif(svm, false);
nested_vmcb->save.es = vmcb->save.es;
nested_vmcb->save.cs = vmcb->save.cs;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0654c5672b1a..df3fcaa827c7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1981,6 +1981,38 @@ static int vmrun_interception(struct vcpu_svm *svm)
return nested_svm_vmrun(svm);
}
+void svm_set_gif(struct vcpu_svm *svm, bool value)
+{
+ if (value) {
+ /*
+ * If VGIF is enabled, the STGI intercept is only added to
+ * detect the opening of the SMI/NMI window; remove it now.
+ * Likewise, clear the VINTR intercept, we will set it
+ * again while processing KVM_REQ_EVENT if needed.
+ */
+ if (vgif_enabled(svm))
+ clr_intercept(svm, INTERCEPT_STGI);
+ if (is_intercept(svm, SVM_EXIT_VINTR))
+ svm_clear_vintr(svm);
+
+ enable_gif(svm);
+ if (svm->vcpu.arch.smi_pending ||
+ svm->vcpu.arch.nmi_pending ||
+ kvm_cpu_has_injectable_intr(&svm->vcpu))
+ kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);
+ } else {
+ disable_gif(svm);
+
+ /*
+ * After a CLGI no interrupts should come. But if vGIF is
+ * in use, we still rely on the VINTR intercept (rather than
+ * STGI) to detect an open interrupt window.
+ */
+ if (!vgif_enabled(svm))
+ svm_clear_vintr(svm);
+ }
+}
+
static int stgi_interception(struct vcpu_svm *svm)
{
int ret;
@@ -1988,18 +2020,8 @@ static int stgi_interception(struct vcpu_svm *svm)
if (nested_svm_check_permissions(svm))
return 1;
- /*
- * If VGIF is enabled, the STGI intercept is only added to
- * detect the opening of the SMI/NMI window; remove it now.
- */
- if (vgif_enabled(svm))
- clr_intercept(svm, INTERCEPT_STGI);
-
ret = kvm_skip_emulated_instruction(&svm->vcpu);
- kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);
-
- enable_gif(svm);
-
+ svm_set_gif(svm, true);
return ret;
}
@@ -2011,12 +2033,7 @@ static int clgi_interception(struct vcpu_svm *svm)
return 1;
ret = kvm_skip_emulated_instruction(&svm->vcpu);
-
- disable_gif(svm);
-
- /* After a CLGI no interrupts should come */
- svm_clear_vintr(svm);
-
+ svm_set_gif(svm, false);
return ret;
}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7e79f0af1204..10b7b55720a0 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -357,6 +357,7 @@ void disable_nmi_singlestep(struct vcpu_svm *svm);
bool svm_smi_blocked(struct kvm_vcpu *vcpu);
bool svm_nmi_blocked(struct kvm_vcpu *vcpu);
bool svm_interrupt_blocked(struct kvm_vcpu *vcpu);
+void svm_set_gif(struct vcpu_svm *svm, bool value);
/* nested.c */
--
2.26.2
The usual drill at this point, except there is no code to remove because this
case was not handled at all.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/svm/nested.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bbf991cfe24b..166b88fc9509 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -25,6 +25,7 @@
#include "trace.h"
#include "mmu.h"
#include "x86.h"
+#include "lapic.h"
#include "svm.h"
static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
@@ -788,11 +789,37 @@ static void nested_svm_intr(struct vcpu_svm *svm)
nested_svm_vmexit(svm);
}
+static inline bool nested_exit_on_init(struct vcpu_svm *svm)
+{
+ return (svm->nested.intercept & (1ULL << INTERCEPT_INIT));
+}
+
+static void nested_svm_init(struct vcpu_svm *svm)
+{
+ svm->vmcb->control.exit_code = SVM_EXIT_INIT;
+ svm->vmcb->control.exit_info_1 = 0;
+ svm->vmcb->control.exit_info_2 = 0;
+
+ nested_svm_vmexit(svm);
+}
+
+
static int svm_check_nested_events(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
bool block_nested_events =
kvm_event_needs_reinjection(vcpu) || svm->nested.nested_run_pending;
+ struct kvm_lapic *apic = vcpu->arch.apic;
+
+ if (lapic_in_kernel(vcpu) &&
+ test_bit(KVM_APIC_INIT, &apic->pending_events)) {
+ if (block_nested_events)
+ return -EBUSY;
+ if (!nested_exit_on_init(svm))
+ return 0;
+ nested_svm_init(svm);
+ return 0;
+ }
if (vcpu->arch.exception.pending) {
if (block_nested_events)
--
2.26.2
vmx_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as
an optimization, but this is only correct before the nested vmentry.
If userspace is modifying CR3 with KVM_SET_SREGS after the VM has
already been put in guest mode, the value of CR3 will not be updated.
Remove the optimization, which almost never triggers anyway.
Fixes: 04f11ef45810 ("KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter")
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index aedc46407b1f..7efdc3c2f7f2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3085,10 +3085,7 @@ void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, unsigned long pgd)
spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
}
- /* Loading vmcs02.GUEST_CR3 is handled by nested VM-Enter. */
- if (is_guest_mode(vcpu))
- update_guest_cr3 = false;
- else if (!enable_unrestricted_guest && !is_paging(vcpu))
+ if (!enable_unrestricted_guest && !is_paging(vcpu))
guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
else if (test_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail))
guest_cr3 = vcpu->arch.cr3;
--
2.26.2
Instead of calling kvm_event_needs_reinjection, track its
future return value in a variable. This will be useful in
the next patch.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/x86.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b226fb8abe41..064a7ea0e671 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7713,11 +7713,14 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu)
static int inject_pending_event(struct kvm_vcpu *vcpu)
{
int r;
+ bool can_inject = true;
/* try to reinject previous events if any */
- if (vcpu->arch.exception.injected)
+ if (vcpu->arch.exception.injected) {
kvm_x86_ops.queue_exception(vcpu);
+ can_inject = false;
+ }
/*
* Do not inject an NMI or interrupt if there is a pending
* exception. Exceptions and interrupts are recognized at
@@ -7733,10 +7736,13 @@ static int inject_pending_event(struct kvm_vcpu *vcpu)
* fully complete the previous instruction.
*/
else if (!vcpu->arch.exception.pending) {
- if (vcpu->arch.nmi_injected)
+ if (vcpu->arch.nmi_injected) {
kvm_x86_ops.set_nmi(vcpu);
- else if (vcpu->arch.interrupt.injected)
+ can_inject = false;
+ } else if (vcpu->arch.interrupt.injected) {
kvm_x86_ops.set_irq(vcpu);
+ can_inject = false;
+ }
}
WARN_ON_ONCE(vcpu->arch.exception.injected &&
@@ -7786,10 +7792,11 @@ static int inject_pending_event(struct kvm_vcpu *vcpu)
}
kvm_x86_ops.queue_exception(vcpu);
+ can_inject = false;
}
- /* Don't consider new event if we re-injected an event */
- if (kvm_event_needs_reinjection(vcpu))
+ /* Finish re-injection before considering new events */
+ if (!can_inject)
return 0;
if (vcpu->arch.smi_pending &&
--
2.26.2
On 5/26/20 10:22 AM, Paolo Bonzini wrote:
> In case an interrupt arrives after nested.check_events but before the
> call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt
> window even if the interrupt is actually going to be a vmexit. This is
> useless rather than harmful, but it really complicates reasoning about
> SVM's handling of the VINTR intercept. We'd like to never bother with
> the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in
> that case there is no interrupt window and we can just exit the nested
> guest whenever we want.
>
> As a first step, this patch moves the opening of the interrupt
> window inside inject_pending_event. This consolidates the check for
> pending interrupt/NMI/SMI in one place, removing the repeated call to
> kvm_cpu_has_injectable_intr.
>
> The main functional change here is that re-injection of still-pending
> events will also use req_immediate_exit instead of using interrupt-window
> intercepts.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 8 +--
> arch/x86/kvm/svm/svm.c | 24 +++----
> arch/x86/kvm/vmx/vmx.c | 20 +++---
> arch/x86/kvm/x86.c | 112 +++++++++++++++++---------------
> 4 files changed, 87 insertions(+), 77 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index db261da578f3..7707bd4b0593 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1136,8 +1136,8 @@ struct kvm_x86_ops {
> void (*set_nmi)(struct kvm_vcpu *vcpu);
> void (*queue_exception)(struct kvm_vcpu *vcpu);
> void (*cancel_injection)(struct kvm_vcpu *vcpu);
> - bool (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
> - bool (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
> + int (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
> + int (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
> bool (*get_nmi_mask)(struct kvm_vcpu *vcpu);
> void (*set_nmi_mask)(struct kvm_vcpu *vcpu, bool masked);
> void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
> @@ -1234,10 +1234,10 @@ struct kvm_x86_ops {
>
> void (*setup_mce)(struct kvm_vcpu *vcpu);
>
> - bool (*smi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
> + int (*smi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
> int (*pre_enter_smm)(struct kvm_vcpu *vcpu, char *smstate);
> int (*pre_leave_smm)(struct kvm_vcpu *vcpu, const char *smstate);
> - int (*enable_smi_window)(struct kvm_vcpu *vcpu);
> + void (*enable_smi_window)(struct kvm_vcpu *vcpu);
>
> int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
> int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 9987f6fe9d88..9ac9963405b5 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3053,15 +3053,15 @@ bool svm_nmi_blocked(struct kvm_vcpu *vcpu)
> return ret;
> }
>
> -static bool svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> if (svm->nested.nested_run_pending)
> - return false;
> + return -EBUSY;
>
> /* An NMI must not be injected into L2 if it's supposed to VM-Exit. */
> if (for_injection && is_guest_mode(vcpu) && nested_exit_on_nmi(svm))
> - return false;
> + return -EBUSY;
>
> return !svm_nmi_blocked(vcpu);
> }
> @@ -3112,18 +3112,18 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
> return (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK);
> }
>
> -static bool svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +static int svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> if (svm->nested.nested_run_pending)
> - return false;
> + return -EBUSY;
>
> /*
> * An IRQ must not be injected into L2 if it's supposed to VM-Exit,
> * e.g. if the IRQ arrived asynchronously after checking nested events.
> */
> if (for_injection && is_guest_mode(vcpu) && nested_exit_on_intr(svm))
> - return false;
> + return -EBUSY;
>
> return !svm_interrupt_blocked(vcpu);
> }
> @@ -3793,15 +3793,15 @@ bool svm_smi_blocked(struct kvm_vcpu *vcpu)
> return is_smm(vcpu);
> }
>
> -static bool svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +static int svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> if (svm->nested.nested_run_pending)
> - return false;
> + return -EBUSY;
>
> /* An SMI must not be injected into L2 if it's supposed to VM-Exit. */
> if (for_injection && is_guest_mode(vcpu) && nested_exit_on_smi(svm))
> - return false;
> + return -EBUSY;
>
> return !svm_smi_blocked(vcpu);
> }
> @@ -3848,7 +3848,7 @@ static int svm_pre_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
> return 0;
> }
>
> -static int enable_smi_window(struct kvm_vcpu *vcpu)
> +static void enable_smi_window(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
>
> @@ -3856,9 +3856,9 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
> if (vgif_enabled(svm))
> set_intercept(svm, INTERCEPT_STGI);
> /* STGI will cause a vm exit */
> - return 1;
> + } else {
> + /* We must be in SMM; RSM will cause a vmexit anyway. */
> }
> - return 0;
> }
>
> static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 55712dd86baf..aedc46407b1f 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4552,14 +4552,14 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
> GUEST_INTR_STATE_NMI));
> }
>
> -static bool vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> {
> if (to_vmx(vcpu)->nested.nested_run_pending)
> - return false;
> + return -EBUSY;
>
> /* An NMI must not be injected into L2 if it's supposed to VM-Exit. */
> if (for_injection && is_guest_mode(vcpu) && nested_exit_on_nmi(vcpu))
> - return false;
> + return -EBUSY;
>
> return !vmx_nmi_blocked(vcpu);
> }
> @@ -4574,17 +4574,17 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
> (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
> }
>
> -static bool vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> {
> if (to_vmx(vcpu)->nested.nested_run_pending)
> - return false;
> + return -EBUSY;
>
> /*
> * An IRQ must not be injected into L2 if it's supposed to VM-Exit,
> * e.g. if the IRQ arrived asynchronously after checking nested events.
> */
> if (for_injection && is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> - return false;
> + return -EBUSY;
>
> return !vmx_interrupt_blocked(vcpu);
> }
> @@ -7755,11 +7755,11 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
> ~FEAT_CTL_LMCE_ENABLED;
> }
>
> -static bool vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> {
> /* we need a nested vmexit to enter SMM, postpone if run is pending */
> if (to_vmx(vcpu)->nested.nested_run_pending)
> - return false;
> + return -EBUSY;
> return !is_smm(vcpu);
> }
>
> @@ -7797,9 +7797,9 @@ static int vmx_pre_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
> return 0;
> }
>
> -static int enable_smi_window(struct kvm_vcpu *vcpu)
> +static void enable_smi_window(struct kvm_vcpu *vcpu)
> {
> - return 0;
> + /* RSM will cause a vmexit anyway. */
> }
>
> static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 064a7ea0e671..192238841cac 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7710,7 +7710,7 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu)
> kvm_x86_ops.update_cr8_intercept(vcpu, tpr, max_irr);
> }
>
> -static int inject_pending_event(struct kvm_vcpu *vcpu)
> +static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
Now that this function also opens the interrupt window instead of
injecting an event, does it makes sense to change its name to something
like process_pending_event() ?
> {
> int r;
> bool can_inject = true;
> @@ -7756,8 +7756,8 @@ static int inject_pending_event(struct kvm_vcpu *vcpu)
> */
> if (is_guest_mode(vcpu)) {
> r = kvm_x86_ops.nested_ops->check_events(vcpu);
> - if (r != 0)
> - return r;
> + if (r < 0)
> + goto busy;
> }
>
> /* try to inject new event if pending */
> @@ -7795,27 +7795,64 @@ static int inject_pending_event(struct kvm_vcpu *vcpu)
> can_inject = false;
> }
>
> - /* Finish re-injection before considering new events */
> - if (!can_inject)
> - return 0;
> + /*
> + * Finally, either inject the event or enable window-open exits.
> + * If an event is pending but cannot be injected right now (for
> + * example if it just arrived and we have to inject it as a
> + * vmexit), then we request an immediate exit. This is indicated
> + * by a -EBUSY return value from kvm_x86_ops.*_allowed.
> + */
> + if (vcpu->arch.smi_pending) {
> + r = can_inject ? kvm_x86_ops.smi_allowed(vcpu, true) : -EBUSY;
> + if (r < 0)
> + goto busy;
> + if (r) {
> + vcpu->arch.smi_pending = false;
> + ++vcpu->arch.smi_count;
> + enter_smm(vcpu);
> + can_inject = false;
> + } else {
> + kvm_x86_ops.enable_smi_window(vcpu);
> + }
> + }
>
> - if (vcpu->arch.smi_pending &&
> - kvm_x86_ops.smi_allowed(vcpu, true)) {
> - vcpu->arch.smi_pending = false;
> - ++vcpu->arch.smi_count;
> - enter_smm(vcpu);
> - } else if (vcpu->arch.nmi_pending &&
> - kvm_x86_ops.nmi_allowed(vcpu, true)) {
> - --vcpu->arch.nmi_pending;
> - vcpu->arch.nmi_injected = true;
> - kvm_x86_ops.set_nmi(vcpu);
> - } else if (kvm_cpu_has_injectable_intr(vcpu) &&
> - kvm_x86_ops.interrupt_allowed(vcpu, true)) {
> - kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu), false);
> - kvm_x86_ops.set_irq(vcpu);
> + if (vcpu->arch.nmi_pending) {
> + r = can_inject ? kvm_x86_ops.nmi_allowed(vcpu, true) : -EBUSY;
> + if (r < 0)
> + goto busy;
> + if (r) {
> + --vcpu->arch.nmi_pending;
> + vcpu->arch.nmi_injected = true;
> + kvm_x86_ops.set_nmi(vcpu);
> + can_inject = false;
> + } else {
> + kvm_x86_ops.enable_nmi_window(vcpu);
> + }
> }
>
> - return 0;
> + if (kvm_cpu_has_injectable_intr(vcpu)) {
> + r = can_inject ? kvm_x86_ops.interrupt_allowed(vcpu, true) : -EBUSY;
> + if (r < 0)
> + goto busy;
> + if (r) {
> + kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu), false);
> + kvm_x86_ops.set_irq(vcpu);
> + } else {
> + kvm_x86_ops.enable_irq_window(vcpu);
> + }
> + }
> +
> + if (is_guest_mode(vcpu) &&
> + kvm_x86_ops.nested_ops->hv_timer_pending &&
> + kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
> + *req_immediate_exit = true;
Nit: May be we can use goto for consistency ?
> +
> + WARN_ON(vcpu->arch.exception.pending);
> + return;
> +
> +busy:
> + *req_immediate_exit = true;
> + return;
> }
>
> static void process_nmi(struct kvm_vcpu *vcpu)
> @@ -8353,36 +8390,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> goto out;
> }
>
> - if (inject_pending_event(vcpu) != 0)
> - req_immediate_exit = true;
> - else {
> - /* Enable SMI/NMI/IRQ window open exits if needed.
> - *
> - * SMIs have three cases:
> - * 1) They can be nested, and then there is nothing to
> - * do here because RSM will cause a vmexit anyway.
> - * 2) There is an ISA-specific reason why SMI cannot be
> - * injected, and the moment when this changes can be
> - * intercepted.
> - * 3) Or the SMI can be pending because
> - * inject_pending_event has completed the injection
> - * of an IRQ or NMI from the previous vmexit, and
> - * then we request an immediate exit to inject the
> - * SMI.
> - */
> - if (vcpu->arch.smi_pending && !is_smm(vcpu))
> - if (!kvm_x86_ops.enable_smi_window(vcpu))
> - req_immediate_exit = true;
> - if (vcpu->arch.nmi_pending)
> - kvm_x86_ops.enable_nmi_window(vcpu);
> - if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)
> - kvm_x86_ops.enable_irq_window(vcpu);
> - if (is_guest_mode(vcpu) &&
> - kvm_x86_ops.nested_ops->hv_timer_pending &&
> - kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
> - req_immediate_exit = true;
> - WARN_ON(vcpu->arch.exception.pending);
> - }
> + inject_pending_event(vcpu, &req_immediate_exit);
> + if (req_int_win)
> + kvm_x86_ops.enable_irq_window(vcpu);
Passing req_int_win to inject_pending_event and opening the window
inside there will probably look logically better since this action is
taken inside it.
>
> if (kvm_lapic_enabled(vcpu)) {
> update_cr8_intercept(vcpu);
On 5/26/20 10:22 AM, Paolo Bonzini wrote:
> The usual drill at this point, except there is no code to remove because this
> case was not handled at all.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/svm/nested.c | 27 +++++++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index bbf991cfe24b..166b88fc9509 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -25,6 +25,7 @@
> #include "trace.h"
> #include "mmu.h"
> #include "x86.h"
> +#include "lapic.h"
> #include "svm.h"
>
> static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
> @@ -788,11 +789,37 @@ static void nested_svm_intr(struct vcpu_svm *svm)
> nested_svm_vmexit(svm);
> }
>
> +static inline bool nested_exit_on_init(struct vcpu_svm *svm)
> +{
> + return (svm->nested.intercept & (1ULL << INTERCEPT_INIT));
> +}
> +
> +static void nested_svm_init(struct vcpu_svm *svm)
Should this be named nested_svm_inject_init_vmexit in accordance with
nested_svm_inject_exception_vmexit that you did in patch# 3 ?
> +{
> + svm->vmcb->control.exit_code = SVM_EXIT_INIT;
> + svm->vmcb->control.exit_info_1 = 0;
> + svm->vmcb->control.exit_info_2 = 0;
> +
> + nested_svm_vmexit(svm);
> +}
> +
> +
> static int svm_check_nested_events(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> bool block_nested_events =
> kvm_event_needs_reinjection(vcpu) || svm->nested.nested_run_pending;
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + if (lapic_in_kernel(vcpu) &&
> + test_bit(KVM_APIC_INIT, &apic->pending_events)) {
> + if (block_nested_events)
> + return -EBUSY;
> + if (!nested_exit_on_init(svm))
> + return 0;
> + nested_svm_init(svm);
> + return 0;
> + }
>
> if (vcpu->arch.exception.pending) {
> if (block_nested_events)
On 29/05/20 08:46, Krish Sadhukhan wrote:
>>
>> +static void nested_svm_init(struct vcpu_svm *svm)
>
> Should this be named nested_svm_inject_init_vmexit in accordance with
> nested_svm_inject_exception_vmexit that you did in patch# 3 ?
There's also nested_svm_intr and nested_svm_nmi. I'll rename all of
them, but it will be a follow up.
Paolo
On 29/05/20 04:16, Krish Sadhukhan wrote:
>
> On 5/26/20 10:22 AM, Paolo Bonzini wrote:
>> In case an interrupt arrives after nested.check_events but before the
>> call to kvm_cpu_has_injectable_intr, we could end up enabling the
>> interrupt
>> window even if the interrupt is actually going to be a vmexit. This is
>> useless rather than harmful, but it really complicates reasoning about
>> SVM's handling of the VINTR intercept. We'd like to never bother with
>> the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in
>> that case there is no interrupt window and we can just exit the nested
>> guest whenever we want.
>>
>> As a first step, this patch moves the opening of the interrupt
>> window inside inject_pending_event. This consolidates the check for
>> pending interrupt/NMI/SMI in one place, removing the repeated call to
>> kvm_cpu_has_injectable_intr.
>>
>> The main functional change here is that re-injection of still-pending
>> events will also use req_immediate_exit instead of using interrupt-window
>> intercepts.
>>
>> Signed-off-by: Paolo Bonzini <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 8 +--
>> arch/x86/kvm/svm/svm.c | 24 +++----
>> arch/x86/kvm/vmx/vmx.c | 20 +++---
>> arch/x86/kvm/x86.c | 112 +++++++++++++++++---------------
>> 4 files changed, 87 insertions(+), 77 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h
>> index db261da578f3..7707bd4b0593 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1136,8 +1136,8 @@ struct kvm_x86_ops {
>> void (*set_nmi)(struct kvm_vcpu *vcpu);
>> void (*queue_exception)(struct kvm_vcpu *vcpu);
>> void (*cancel_injection)(struct kvm_vcpu *vcpu);
>> - bool (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool
>> for_injection);
>> - bool (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
>> + int (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
>> + int (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
>> bool (*get_nmi_mask)(struct kvm_vcpu *vcpu);
>> void (*set_nmi_mask)(struct kvm_vcpu *vcpu, bool masked);
>> void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>> @@ -1234,10 +1234,10 @@ struct kvm_x86_ops {
>> void (*setup_mce)(struct kvm_vcpu *vcpu);
>> - bool (*smi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
>> + int (*smi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
>> int (*pre_enter_smm)(struct kvm_vcpu *vcpu, char *smstate);
>> int (*pre_leave_smm)(struct kvm_vcpu *vcpu, const char *smstate);
>> - int (*enable_smi_window)(struct kvm_vcpu *vcpu);
>> + void (*enable_smi_window)(struct kvm_vcpu *vcpu);
>> int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
>> int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region
>> *argp);
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 9987f6fe9d88..9ac9963405b5 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -3053,15 +3053,15 @@ bool svm_nmi_blocked(struct kvm_vcpu *vcpu)
>> return ret;
>> }
>> -static bool svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>> +static int svm_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>> {
>> struct vcpu_svm *svm = to_svm(vcpu);
>> if (svm->nested.nested_run_pending)
>> - return false;
>> + return -EBUSY;
>> /* An NMI must not be injected into L2 if it's supposed to
>> VM-Exit. */
>> if (for_injection && is_guest_mode(vcpu) &&
>> nested_exit_on_nmi(svm))
>> - return false;
>> + return -EBUSY;
>> return !svm_nmi_blocked(vcpu);
>> }
>> @@ -3112,18 +3112,18 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
>> return (vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK);
>> }
>> -static bool svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool
>> for_injection)
>> +static int svm_interrupt_allowed(struct kvm_vcpu *vcpu, bool
>> for_injection)
>> {
>> struct vcpu_svm *svm = to_svm(vcpu);
>> if (svm->nested.nested_run_pending)
>> - return false;
>> + return -EBUSY;
>> /*
>> * An IRQ must not be injected into L2 if it's supposed to VM-Exit,
>> * e.g. if the IRQ arrived asynchronously after checking nested
>> events.
>> */
>> if (for_injection && is_guest_mode(vcpu) &&
>> nested_exit_on_intr(svm))
>> - return false;
>> + return -EBUSY;
>> return !svm_interrupt_blocked(vcpu);
>> }
>> @@ -3793,15 +3793,15 @@ bool svm_smi_blocked(struct kvm_vcpu *vcpu)
>> return is_smm(vcpu);
>> }
>> -static bool svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>> +static int svm_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>> {
>> struct vcpu_svm *svm = to_svm(vcpu);
>> if (svm->nested.nested_run_pending)
>> - return false;
>> + return -EBUSY;
>> /* An SMI must not be injected into L2 if it's supposed to
>> VM-Exit. */
>> if (for_injection && is_guest_mode(vcpu) &&
>> nested_exit_on_smi(svm))
>> - return false;
>> + return -EBUSY;
>> return !svm_smi_blocked(vcpu);
>> }
>> @@ -3848,7 +3848,7 @@ static int svm_pre_leave_smm(struct kvm_vcpu
>> *vcpu, const char *smstate)
>> return 0;
>> }
>> -static int enable_smi_window(struct kvm_vcpu *vcpu)
>> +static void enable_smi_window(struct kvm_vcpu *vcpu)
>> {
>> struct vcpu_svm *svm = to_svm(vcpu);
>> @@ -3856,9 +3856,9 @@ static int enable_smi_window(struct kvm_vcpu
>> *vcpu)
>> if (vgif_enabled(svm))
>> set_intercept(svm, INTERCEPT_STGI);
>> /* STGI will cause a vm exit */
>> - return 1;
>> + } else {
>> + /* We must be in SMM; RSM will cause a vmexit anyway. */
>> }
>> - return 0;
>> }
>> static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 55712dd86baf..aedc46407b1f 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -4552,14 +4552,14 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
>> GUEST_INTR_STATE_NMI));
>> }
>> -static bool vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>> +static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>> {
>> if (to_vmx(vcpu)->nested.nested_run_pending)
>> - return false;
>> + return -EBUSY;
>> /* An NMI must not be injected into L2 if it's supposed to
>> VM-Exit. */
>> if (for_injection && is_guest_mode(vcpu) &&
>> nested_exit_on_nmi(vcpu))
>> - return false;
>> + return -EBUSY;
>> return !vmx_nmi_blocked(vcpu);
>> }
>> @@ -4574,17 +4574,17 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
>> (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
>> }
>> -static bool vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool
>> for_injection)
>> +static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool
>> for_injection)
>> {
>> if (to_vmx(vcpu)->nested.nested_run_pending)
>> - return false;
>> + return -EBUSY;
>> /*
>> * An IRQ must not be injected into L2 if it's supposed to
>> VM-Exit,
>> * e.g. if the IRQ arrived asynchronously after checking
>> nested events.
>> */
>> if (for_injection && is_guest_mode(vcpu) &&
>> nested_exit_on_intr(vcpu))
>> - return false;
>> + return -EBUSY;
>> return !vmx_interrupt_blocked(vcpu);
>> }
>> @@ -7755,11 +7755,11 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
>> ~FEAT_CTL_LMCE_ENABLED;
>> }
>> -static bool vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>> +static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>> {
>> /* we need a nested vmexit to enter SMM, postpone if run is
>> pending */
>> if (to_vmx(vcpu)->nested.nested_run_pending)
>> - return false;
>> + return -EBUSY;
>> return !is_smm(vcpu);
>> }
>> @@ -7797,9 +7797,9 @@ static int vmx_pre_leave_smm(struct kvm_vcpu
>> *vcpu, const char *smstate)
>> return 0;
>> }
>> -static int enable_smi_window(struct kvm_vcpu *vcpu)
>> +static void enable_smi_window(struct kvm_vcpu *vcpu)
>> {
>> - return 0;
>> + /* RSM will cause a vmexit anyway. */
>> }
>> static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 064a7ea0e671..192238841cac 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -7710,7 +7710,7 @@ static void update_cr8_intercept(struct kvm_vcpu
>> *vcpu)
>> kvm_x86_ops.update_cr8_intercept(vcpu, tpr, max_irr);
>> }
>> -static int inject_pending_event(struct kvm_vcpu *vcpu)
>> +static void inject_pending_event(struct kvm_vcpu *vcpu, bool
>> *req_immediate_exit)
>
>
> Now that this function also opens the interrupt window instead of
> injecting an event, does it makes sense to change its name to something
> like process_pending_event() ?
>
>> {
>> int r;
>> bool can_inject = true;
>> @@ -7756,8 +7756,8 @@ static int inject_pending_event(struct kvm_vcpu
>> *vcpu)
>> */
>> if (is_guest_mode(vcpu)) {
>> r = kvm_x86_ops.nested_ops->check_events(vcpu);
>> - if (r != 0)
>> - return r;
>> + if (r < 0)
>> + goto busy;
>> }
>> /* try to inject new event if pending */
>> @@ -7795,27 +7795,64 @@ static int inject_pending_event(struct
>> kvm_vcpu *vcpu)
>> can_inject = false;
>> }
>> - /* Finish re-injection before considering new events */
>> - if (!can_inject)
>> - return 0;
>> + /*
>> + * Finally, either inject the event or enable window-open exits.
>> + * If an event is pending but cannot be injected right now (for
>> + * example if it just arrived and we have to inject it as a
>> + * vmexit), then we request an immediate exit. This is indicated
>> + * by a -EBUSY return value from kvm_x86_ops.*_allowed.
>> + */
>> + if (vcpu->arch.smi_pending) {
>> + r = can_inject ? kvm_x86_ops.smi_allowed(vcpu, true) : -EBUSY;
>> + if (r < 0)
>> + goto busy;
>> + if (r) {
>> + vcpu->arch.smi_pending = false;
>> + ++vcpu->arch.smi_count;
>> + enter_smm(vcpu);
>> + can_inject = false;
>> + } else {
>> + kvm_x86_ops.enable_smi_window(vcpu);
>> + }
>> + }
>> - if (vcpu->arch.smi_pending &&
>> - kvm_x86_ops.smi_allowed(vcpu, true)) {
>> - vcpu->arch.smi_pending = false;
>> - ++vcpu->arch.smi_count;
>> - enter_smm(vcpu);
>> - } else if (vcpu->arch.nmi_pending &&
>> - kvm_x86_ops.nmi_allowed(vcpu, true)) {
>> - --vcpu->arch.nmi_pending;
>> - vcpu->arch.nmi_injected = true;
>> - kvm_x86_ops.set_nmi(vcpu);
>> - } else if (kvm_cpu_has_injectable_intr(vcpu) &&
>> - kvm_x86_ops.interrupt_allowed(vcpu, true)) {
>> - kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu), false);
>> - kvm_x86_ops.set_irq(vcpu);
>> + if (vcpu->arch.nmi_pending) {
>> + r = can_inject ? kvm_x86_ops.nmi_allowed(vcpu, true) : -EBUSY;
>> + if (r < 0)
>> + goto busy;
>> + if (r) {
>> + --vcpu->arch.nmi_pending;
>> + vcpu->arch.nmi_injected = true;
>> + kvm_x86_ops.set_nmi(vcpu);
>> + can_inject = false;
>> + } else {
>> + kvm_x86_ops.enable_nmi_window(vcpu);
>> + }
>> }
>> - return 0;
>> + if (kvm_cpu_has_injectable_intr(vcpu)) {
>> + r = can_inject ? kvm_x86_ops.interrupt_allowed(vcpu, true) :
>> -EBUSY;
>> + if (r < 0)
>> + goto busy;
>> + if (r) {
>> + kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
>> false);
>> + kvm_x86_ops.set_irq(vcpu);
>> + } else {
>> + kvm_x86_ops.enable_irq_window(vcpu);
>> + }
>> + }
>> +
>> + if (is_guest_mode(vcpu) &&
>> + kvm_x86_ops.nested_ops->hv_timer_pending &&
>> + kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
>> + *req_immediate_exit = true;
>
>
> Nit: May be we can use goto for consistency ?
>
>> +
>> + WARN_ON(vcpu->arch.exception.pending);
>> + return;
>> +
>> +busy:
>> + *req_immediate_exit = true;
>> + return;
>> }
>> static void process_nmi(struct kvm_vcpu *vcpu)
>> @@ -8353,36 +8390,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>> goto out;
>> }
>> - if (inject_pending_event(vcpu) != 0)
>> - req_immediate_exit = true;
>> - else {
>> - /* Enable SMI/NMI/IRQ window open exits if needed.
>> - *
>> - * SMIs have three cases:
>> - * 1) They can be nested, and then there is nothing to
>> - * do here because RSM will cause a vmexit anyway.
>> - * 2) There is an ISA-specific reason why SMI cannot be
>> - * injected, and the moment when this changes can be
>> - * intercepted.
>> - * 3) Or the SMI can be pending because
>> - * inject_pending_event has completed the injection
>> - * of an IRQ or NMI from the previous vmexit, and
>> - * then we request an immediate exit to inject the
>> - * SMI.
>> - */
>> - if (vcpu->arch.smi_pending && !is_smm(vcpu))
>> - if (!kvm_x86_ops.enable_smi_window(vcpu))
>> - req_immediate_exit = true;
>> - if (vcpu->arch.nmi_pending)
>> - kvm_x86_ops.enable_nmi_window(vcpu);
>> - if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)
>> - kvm_x86_ops.enable_irq_window(vcpu);
>> - if (is_guest_mode(vcpu) &&
>> - kvm_x86_ops.nested_ops->hv_timer_pending &&
>> - kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
>> - req_immediate_exit = true;
>> - WARN_ON(vcpu->arch.exception.pending);
>> - }
>> + inject_pending_event(vcpu, &req_immediate_exit);
>> + if (req_int_win)
>> + kvm_x86_ops.enable_irq_window(vcpu);
>
>
> Passing req_int_win to inject_pending_event and opening the window
> inside there will probably look logically better since this action is
> taken inside it.
This is a special case for the userspace irqchip case;
inject_pending_event is enabling the IRQ window in response to a pending
event while this is not. But your right that I should rename
inject_pending_event to handle_pending_event.
Also, I'm thinking of dropping support for kernel_irqchip=off
completely, in that unless you do KVM_CREATE_IRQCHIP you won't be able
to inject interrupts at all.
Paolo
Hopefully I got the In-Reply-To header right...
On Thu, May 28, 2020, Paolo Bonzini wrote:
> This allows exceptions injected by the emulator to be properly delivered
> as vmexits. The code also becomes simpler, because we can just let all
> L0-intercepted exceptions go through the usual path. In particular, our
> emulation of the VMX #DB exit qualification is very much simplified,
> because the vmexit injection path can use kvm_deliver_exception_payload
> to update DR6.
Sadly, it's also completely and utterly broken for #UD and #GP, and a bit
sketchy for #AC.
Unless KVM (L0) knowingly wants to override L1, e.g. KVM_GUESTDBG_* cases, KVM
shouldn't do a damn thing except forward the exception to L1 if L1 wants the
exception.
ud_interception() and gp_interception() do quite a bit before forwarding the
exception, and in the case of #UD, it's entirely possible the #UD will never get
forwarded to L1. #GP is even more problematic because it's a contributory
exception, and kvm_multiple_exception() is not equipped to check and handle
nested intercepts before vectoring the exception, which means KVM will
incorrectly escalate a #GP->#DF and #GP->#DF->Triple Fault instead of exiting
to L1. That's a wee bit problematic since KVM also has a soon-to-be-fixed bug
where it kills L1 on a Triple Fault in L2...
I think this will fix the bugs, I'll properly test and post next week.
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 90a1704b5752..928e11646dca 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -926,11 +926,11 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
}
case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: {
/*
- * Host-intercepted exceptions have been checked already in
- * nested_svm_exit_special. There is nothing to do here,
- * the vmexit is injected by svm_check_nested_events.
+ * Note, KVM may already have snagged exceptions it wants to
+ * handle even if L1 also wants the exception, e.g. #MC.
*/
- vmexit = NESTED_EXIT_DONE;
+ if (vmcb_is_intercept(&svm->nested.ctl, exit_code))
+ vmexit = NESTED_EXIT_DONE;
break;
}
case SVM_EXIT_ERR: {
@@ -1122,19 +1122,23 @@ int nested_svm_exit_special(struct vcpu_svm *svm)
case SVM_EXIT_INTR:
case SVM_EXIT_NMI:
case SVM_EXIT_NPF:
+ case SVM_EXIT_EXCP_BASE + MC_VECTOR:
return NESTED_EXIT_HOST;
- case SVM_EXIT_EXCP_BASE ... SVM_EXIT_EXCP_BASE + 0x1f: {
+ case SVM_EXIT_EXCP_BASE + DB_VECTOR:
+ case SVM_EXIT_EXCP_BASE + BP_VECTOR: {
+ /* KVM gets first crack at #DBs and #BPs, if it wants them. */
u32 excp_bits = 1 << (exit_code - SVM_EXIT_EXCP_BASE);
if (svm->vmcb01.ptr->control.intercepts[INTERCEPT_EXCEPTION] &
excp_bits)
return NESTED_EXIT_HOST;
- else if (exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR &&
- svm->vcpu.arch.apf.host_apf_flags)
- /* Trap async PF even if not shadowing */
- return NESTED_EXIT_HOST;
break;
}
+ case SVM_EXIT_EXCP_BASE + PF_VECTOR:
+ /* Trap async PF even if not shadowing */
+ if (svm->vcpu.arch.apf.host_apf_flags)
+ return NESTED_EXIT_HOST;
+ break;
default:
break;
}
On 06/03/21 02:39, Sean Christopherson wrote:
> Unless KVM (L0) knowingly wants to override L1, e.g. KVM_GUESTDBG_* cases, KVM
> shouldn't do a damn thing except forward the exception to L1 if L1 wants the
> exception.
>
> ud_interception() and gp_interception() do quite a bit before forwarding the
> exception, and in the case of #UD, it's entirely possible the #UD will never get
> forwarded to L1. #GP is even more problematic because it's a contributory
> exception, and kvm_multiple_exception() is not equipped to check and handle
> nested intercepts before vectoring the exception, which means KVM will
> incorrectly escalate a #GP->#DF and #GP->#DF->Triple Fault instead of exiting
> to L1. That's a wee bit problematic since KVM also has a soon-to-be-fixed bug
> where it kills L1 on a Triple Fault in L2...
I agree with the #GP problem, but this is on purpose. For example, if
L1 CPUID has MOVBE and it is being emulated via #UD, L1 would be right
to set MOVBE in L2's CPUID and expect it not to cause a #UD. The same
is true for the VMware #GP interception case.
Maxim is also working on this, the root cause is that
kvm_multiple_exception()'s escalation of contributory exceptions to #DF
and triple fault is incorrect in the case of nested virtualization.
Paolo
On Sat, Mar 06, 2021, Paolo Bonzini wrote:
> On 06/03/21 02:39, Sean Christopherson wrote:
> > Unless KVM (L0) knowingly wants to override L1, e.g. KVM_GUESTDBG_* cases, KVM
> > shouldn't do a damn thing except forward the exception to L1 if L1 wants the
> > exception.
> >
> > ud_interception() and gp_interception() do quite a bit before forwarding the
> > exception, and in the case of #UD, it's entirely possible the #UD will never get
> > forwarded to L1. #GP is even more problematic because it's a contributory
> > exception, and kvm_multiple_exception() is not equipped to check and handle
> > nested intercepts before vectoring the exception, which means KVM will
> > incorrectly escalate a #GP->#DF and #GP->#DF->Triple Fault instead of exiting
> > to L1. That's a wee bit problematic since KVM also has a soon-to-be-fixed bug
> > where it kills L1 on a Triple Fault in L2...
>
> I agree with the #GP problem, but this is on purpose. For example, if L1
> CPUID has MOVBE and it is being emulated via #UD, L1 would be right to set
> MOVBE in L2's CPUID and expect it not to cause a #UD.
The opposite is also true, since KVM has no way of knowing what CPU model L1 has
exposed to L2. Though admittedly hiding MOVBE is a rather contrived case. But,
the other EmulateOnUD instructions that don't have an intercept condition,
SYSENTER, SYSEXIT, SYSCALL, and VMCALL, are also suspect. SYS* will mostly do
the right thing, though it's again technically possible that KVM will do the
wrong thing since KVM doesn't know L2's CPU model. VMCALL is also probably ok
in most scenarios, but patching L2's code from L0 KVM is sketchy.
> The same is true for the VMware #GP interception case.
I highly doubt that will ever work out as intended for the modified IO #GP
behavior. The only way emulating #GP in L2 is correct if L1 wants to pass
through the capabilities to L2, i.e. the I/O access isn't intercepted by L1.
That seems unlikely. If the I/O is is intercepted by L1, bypassing the IOPL and
TSS-bitmap checks is wrong and will cause L1 to emulate I/O for L2 userspace
that should never be allowed. Odds are there isn't a corresponding emulated
port in L1, i.e. there's no major security flaw, but it's far from good
behavior.
I can see how some of the instructions will kinda sorta work, but IMO forwading
them to L1 is much safer, even if it means that L1 will see faults that should
be impossible. At least for KVM-on-KVM, those spurious faults are benign since
L1 likely also knows how to emulate the #UD and #GP instructions.
On 08/03/21 17:44, Sean Christopherson wrote:
> VMCALL is also probably ok
> in most scenarios, but patching L2's code from L0 KVM is sketchy.
I agree that patching is sketchy and I'll send a patch. However...
>> The same is true for the VMware #GP interception case.
>
> I highly doubt that will ever work out as intended for the modified IO #GP
> behavior. The only way emulating #GP in L2 is correct if L1 wants to pass
> through the capabilities to L2, i.e. the I/O access isn't intercepted by L1.
> That seems unlikely.
... not all hypervisors trap everything. In particular in this case the
VMCS12 I/O permission bitmap should be consulted (which we do in
vmx_check_intercept_io), but if the I/O is not trapped by L1 it should
bypass the IOPL and TSS-bitmap checks in my opinion.
Paolo
> If the I/O is is intercepted by L1, bypassing the IOPL and
> TSS-bitmap checks is wrong and will cause L1 to emulate I/O for L2 userspace
> that should never be allowed. Odds are there isn't a corresponding emulated
> port in L1, i.e. there's no major security flaw, but it's far from good
> behavior.
On Mon, Mar 08, 2021, Paolo Bonzini wrote:
> On 08/03/21 17:44, Sean Christopherson wrote:
> > VMCALL is also probably ok
> > in most scenarios, but patching L2's code from L0 KVM is sketchy.
>
> I agree that patching is sketchy and I'll send a patch. However...
>
> > > The same is true for the VMware #GP interception case.
> >
> > I highly doubt that will ever work out as intended for the modified IO #GP
> > behavior. The only way emulating #GP in L2 is correct if L1 wants to pass
> > through the capabilities to L2, i.e. the I/O access isn't intercepted by L1.
> > That seems unlikely.
>
> ... not all hypervisors trap everything. In particular in this case the
> VMCS12 I/O permission bitmap should be consulted (which we do in
> vmx_check_intercept_io), but if the I/O is not trapped by L1 it should
> bypass the IOPL and TSS-bitmap checks in my opinion.
I agree, _if_ it's not trapped. But bypassing the checks when it is trapped is
clearly wrong.
On 08/03/21 21:43, Sean Christopherson wrote:
> On Mon, Mar 08, 2021, Paolo Bonzini wrote:
>> On 08/03/21 17:44, Sean Christopherson wrote:
>>> VMCALL is also probably ok
>>> in most scenarios, but patching L2's code from L0 KVM is sketchy.
>>
>> I agree that patching is sketchy and I'll send a patch. However...
>>
>>>> The same is true for the VMware #GP interception case.
>>>
>>> I highly doubt that will ever work out as intended for the modified IO #GP
>>> behavior. The only way emulating #GP in L2 is correct if L1 wants to pass
>>> through the capabilities to L2, i.e. the I/O access isn't intercepted by L1.
>>> That seems unlikely.
>>
>> ... not all hypervisors trap everything. In particular in this case the
>> VMCS12 I/O permission bitmap should be consulted (which we do in
>> vmx_check_intercept_io), but if the I/O is not trapped by L1 it should
>> bypass the IOPL and TSS-bitmap checks in my opinion.
>
> I agree, _if_ it's not trapped. But bypassing the checks when it is trapped is
> clearly wrong.
You can still trap #GP unconditionally and run the emulator. The
intercept check will (or should) handle it.
Paolo