2021-05-03 18:10:32

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use

Win10 guests with WSL2 enabled sometimes crash on migration when
enlightened VMCS was used. The condition seems to be induced by the
situation when L2->L1 exit is caused immediately after migration and
before L2 gets a chance to run (e.g. when there's an interrupt pending).
The issue was introduced by commit f2c7ef3ba955 ("KVM: nSVM: cancel
KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit") and the first patch
of the series addresses the immediate issue. The eVMCS mapping restoration
path, however, seems to be fragile and the rest of the series tries to
make it more future proof by including eVMCS GPA in the migration data.

Vitaly Kuznetsov (4):
KVM: nVMX: Always make an attempt to map eVMCS after migration
KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr'
KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld()
KVM: nVMX: Map enlightened VMCS upon restore when possible

arch/x86/include/uapi/asm/kvm.h | 4 ++
arch/x86/kvm/vmx/nested.c | 82 +++++++++++++++++++++++----------
2 files changed, 61 insertions(+), 25 deletions(-)

--
2.30.2


2021-05-03 18:10:53

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible

It now looks like a bad idea to not restore eVMCS mapping directly from
vmx_set_nested_state(). The restoration path now depends on whether KVM
will continue executing L2 (vmx_get_nested_state_pages()) or will have to
exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
diverges too much from the 'native' path when 'nested.current_vmptr' is
set directly from vmx_get_nested_state_pages().

The existing solution postponing eVMCS mapping also seems to be fragile.
In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
state.

Also, in case vmx_get_nested_state() is called right after
vmx_set_nested_state() without executing the guest first, the resulting
state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
missing.

Fix all these issues by making eVMCS restoration path closer to its
'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
To avoid ABI incompatibility, do not introduce a new flag and keep the
original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
place. To distinguish between 'new' and 'old' formats consider eVMCS
GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
path). While technically possible, it seems to be an extremely unlikely
case.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/include/uapi/asm/kvm.h | 2 ++
arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++------
2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 0662f644aad9..3845977b739e 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -441,6 +441,8 @@ struct kvm_vmx_nested_state_hdr {

__u32 flags;
__u64 preemption_timer_deadline;
+
+ __u64 evmcs_pa;
};

struct kvm_svm_nested_state_data {
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 37fdc34f7afc..4261cf4755c8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -6019,6 +6019,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
.hdr.vmx.vmxon_pa = -1ull,
.hdr.vmx.vmcs12_pa = -1ull,
.hdr.vmx.preemption_timer_deadline = 0,
+ .hdr.vmx.evmcs_pa = -1ull,
};
struct kvm_vmx_nested_state_data __user *user_vmx_nested_state =
&user_kvm_nested_state->data.vmx[0];
@@ -6037,8 +6038,10 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
if (vmx_has_valid_vmcs12(vcpu)) {
kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);

- if (vmx->nested.hv_evmcs)
+ if (vmx->nested.hv_evmcs) {
kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
+ kvm_state.hdr.vmx.evmcs_pa = vmx->nested.hv_evmcs_vmptr;
+ }

if (is_guest_mode(vcpu) &&
nested_cpu_has_shadow_vmcs(vmcs12) &&
@@ -6230,13 +6233,25 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,

set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa);
} else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) {
+ u64 evmcs_gpa = kvm_state->hdr.vmx.evmcs_pa;
+
/*
- * nested_vmx_handle_enlightened_vmptrld() cannot be called
- * directly from here as HV_X64_MSR_VP_ASSIST_PAGE may not be
- * restored yet. EVMCS will be mapped from
- * nested_get_vmcs12_pages().
+ * EVMCS GPA == 0 most likely indicates that the migration data is
+ * coming from an older KVM which doesn't support 'evmcs_pa' in
+ * 'struct kvm_vmx_nested_state_hdr'.
*/
- kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+ if (evmcs_gpa && (evmcs_gpa != -1ull) &&
+ (__nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, false) !=
+ EVMPTRLD_SUCCEEDED)) {
+ return -EINVAL;
+ } else if (!evmcs_gpa) {
+ /*
+ * EVMCS GPA can't be acquired from VP assist page here because
+ * HV_X64_MSR_VP_ASSIST_PAGE may not be restored yet.
+ * EVMCS will be mapped from nested_get_evmcs_page().
+ */
+ kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+ }
} else {
return -EINVAL;
}
--
2.30.2

2021-05-03 18:11:06

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 3/4] KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld()

As a preparation to mapping eVMCS from vmx_set_nested_state() split
the actual eVMCS mappign from aquiring eVMCS GPA.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2febb1dd68e8..37fdc34f7afc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1972,18 +1972,11 @@ static int copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
* This is an equivalent of the nested hypervisor executing the vmptrld
* instruction.
*/
-static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
- struct kvm_vcpu *vcpu, bool from_launch)
+static enum nested_evmptrld_status __nested_vmx_handle_enlightened_vmptrld(
+ struct kvm_vcpu *vcpu, u64 evmcs_gpa, bool from_launch)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
bool evmcs_gpa_changed = false;
- u64 evmcs_gpa;
-
- if (likely(!vmx->nested.enlightened_vmcs_enabled))
- return EVMPTRLD_DISABLED;
-
- if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
- return EVMPTRLD_DISABLED;

if (unlikely(!vmx->nested.hv_evmcs ||
evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) {
@@ -2055,6 +2048,21 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
return EVMPTRLD_SUCCEEDED;
}

+static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
+ struct kvm_vcpu *vcpu, bool from_launch)
+{
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+ u64 evmcs_gpa;
+
+ if (likely(!vmx->nested.enlightened_vmcs_enabled))
+ return EVMPTRLD_DISABLED;
+
+ if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
+ return EVMPTRLD_DISABLED;
+
+ return __nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, from_launch);
+}
+
void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
--
2.30.2

2021-05-03 18:11:51

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible

On 03/05/21 17:08, Vitaly Kuznetsov wrote:
> It now looks like a bad idea to not restore eVMCS mapping directly from
> vmx_set_nested_state(). The restoration path now depends on whether KVM
> will continue executing L2 (vmx_get_nested_state_pages()) or will have to
> exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
> diverges too much from the 'native' path when 'nested.current_vmptr' is
> set directly from vmx_get_nested_state_pages().
>
> The existing solution postponing eVMCS mapping also seems to be fragile.
> In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
> NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
> are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
> state.
>
> Also, in case vmx_get_nested_state() is called right after
> vmx_set_nested_state() without executing the guest first, the resulting
> state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
> missing.
>
> Fix all these issues by making eVMCS restoration path closer to its
> 'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
> To avoid ABI incompatibility, do not introduce a new flag and keep the

I'm not sure what is the disadvantage of not having a new flag.

Having two different paths with subtly different side effects however
seems really worse for maintenance. We are already discussing in
another thread how to get rid of the check_nested_events side effects;
that might possibly even remove the need for patch 1, so it's at least
worth pursuing more than adding this second path.

I have queued patch 1, but I'd rather have a kvm selftest for it. It
doesn't seem impossible to have one...

Paolo

> original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
> place. To distinguish between 'new' and 'old' formats consider eVMCS
> GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
> path). While technically possible, it seems to be an extremely unlikely
> case.


> Signed-off-by: Vitaly Kuznetsov<[email protected]>

2021-05-03 18:12:57

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration

When enlightened VMCS is in use and nested state is migrated with
vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
and we can't read it from VP assist page because userspace may decide
to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
(and QEMU, for example, does exactly that). To make sure eVMCS is
mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
request.

Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
when an immediate exit from L2 to L1 happens right after migration (caused
by a pending event, for example). Unfortunately, in the exact same
situation we still need to have eVMCS mapped so
nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.

As a band-aid, restore nested_get_evmcs_page() when clearing
KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
from being ideal as we can't easily propagate possible failures and even if
we could, this is most likely already too late to do so. The whole
'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
seems to be fragile as we diverge too much from the 'native' path when
vmptr loading happens on vmx_set_nested_state().

Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1e069aac7410..2febb1dd68e8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
nested_vmx_handle_enlightened_vmptrld(vcpu, false);

if (evmptrld_status == EVMPTRLD_VMFAIL ||
- evmptrld_status == EVMPTRLD_ERROR) {
- pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
- __func__);
- vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
- vcpu->run->internal.suberror =
- KVM_INTERNAL_ERROR_EMULATION;
- vcpu->run->internal.ndata = 0;
+ evmptrld_status == EVMPTRLD_ERROR)
return false;
- }
}

return true;
@@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)

static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
{
- if (!nested_get_evmcs_page(vcpu))
+ if (!nested_get_evmcs_page(vcpu)) {
+ pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
+ __func__);
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ vcpu->run->internal.suberror =
+ KVM_INTERNAL_ERROR_EMULATION;
+ vcpu->run->internal.ndata = 0;
+
return false;
+ }

if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
return false;
@@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
/* trying to cancel vmlaunch/vmresume is a bug */
WARN_ON_ONCE(vmx->nested.nested_run_pending);

- kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+ if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
+ /*
+ * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
+ * Enlightened VMCS after migration and we still need to
+ * do that when something is forcing L2->L1 exit prior to
+ * the first L2 run.
+ */
+ (void)nested_get_evmcs_page(vcpu);
+ }

/* Service the TLB flush request for L2 before switching to L1. */
if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
--
2.30.2

2021-05-04 08:07:27

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible

Paolo Bonzini <[email protected]> writes:

> On 03/05/21 17:08, Vitaly Kuznetsov wrote:
>> It now looks like a bad idea to not restore eVMCS mapping directly from
>> vmx_set_nested_state(). The restoration path now depends on whether KVM
>> will continue executing L2 (vmx_get_nested_state_pages()) or will have to
>> exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
>> diverges too much from the 'native' path when 'nested.current_vmptr' is
>> set directly from vmx_get_nested_state_pages().
>>
>> The existing solution postponing eVMCS mapping also seems to be fragile.
>> In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
>> NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
>> are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
>> state.
>>
>> Also, in case vmx_get_nested_state() is called right after
>> vmx_set_nested_state() without executing the guest first, the resulting
>> state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
>> missing.
>>
>> Fix all these issues by making eVMCS restoration path closer to its
>> 'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
>> To avoid ABI incompatibility, do not introduce a new flag and keep the
>
> I'm not sure what is the disadvantage of not having a new flag.
>

Adding a new flag would make us backwards-incompatible both ways:

1) Migrating 'new' state to an older KVM will fail the

if (kvm_state->hdr.vmx.flags & ~KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE)
return -EINVAL;

check.

2) Migrating 'old' state to 'new' KVM would make us support the old path
('KVM_REQ_GET_NESTED_STATE_PAGES') so the flag will still be 'optional'.

> Having two different paths with subtly different side effects however
> seems really worse for maintenance. We are already discussing in
> another thread how to get rid of the check_nested_events side effects;
> that might possibly even remove the need for patch 1, so it's at least
> worth pursuing more than adding this second path.

I have to admit I don't fully like this solution either :-( In case we
make sure KVM_REQ_GET_NESTED_STATE_PAGES always gets handled the fix can
be omitted indeed, however, I still dislike the divergence and the fact
that 'if (vmx->nested.hv_evmcs)' checks scattered across the code are
not fully valid. E.g. how do we fix immediate KVM_GET_NESTED_STATE after
KVM_SET_NESTED_STATE without executing the vCPU problem?

>
> I have queued patch 1, but I'd rather have a kvm selftest for it. It
> doesn't seem impossible to have one...

Thank you, the band-aid solves a real problem. Let me try to come up
with a selftest for it.

>
> Paolo
>
>> original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
>> place. To distinguish between 'new' and 'old' formats consider eVMCS
>> GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
>> path). While technically possible, it seems to be an extremely unlikely
>> case.
>
>
>> Signed-off-by: Vitaly Kuznetsov<[email protected]>
>

--
Vitaly

2021-05-04 08:56:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible

On 04/05/21 10:02, Vitaly Kuznetsov wrote:
> I still dislike the divergence and the fact
> that 'if (vmx->nested.hv_evmcs)' checks scattered across the code are
> not fully valid. E.g. how do we fix immediate KVM_GET_NESTED_STATE after
> KVM_SET_NESTED_STATE without executing the vCPU problem?

You obviously have thought about this more than I did, but if you can
write a testcase for that as well, I can take a look.

Thanks,

Paolo

2021-05-05 08:24:15

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration

On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> When enlightened VMCS is in use and nested state is migrated with
> vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
> page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
> and we can't read it from VP assist page because userspace may decide
> to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
> (and QEMU, for example, does exactly that). To make sure eVMCS is
> mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
> request.
>
> Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
> on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
> nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
> when an immediate exit from L2 to L1 happens right after migration (caused
> by a pending event, for example). Unfortunately, in the exact same
> situation we still need to have eVMCS mapped so
> nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.
>
> As a band-aid, restore nested_get_evmcs_page() when clearing
> KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
> from being ideal as we can't easily propagate possible failures and even if
> we could, this is most likely already too late to do so. The whole
> 'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
> seems to be fragile as we diverge too much from the 'native' path when
> vmptr loading happens on vmx_set_nested_state().
>
> Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
> 1 file changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 1e069aac7410..2febb1dd68e8 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
> nested_vmx_handle_enlightened_vmptrld(vcpu, false);
>
> if (evmptrld_status == EVMPTRLD_VMFAIL ||
> - evmptrld_status == EVMPTRLD_ERROR) {
> - pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
> - __func__);
> - vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> - vcpu->run->internal.suberror =
> - KVM_INTERNAL_ERROR_EMULATION;
> - vcpu->run->internal.ndata = 0;
> + evmptrld_status == EVMPTRLD_ERROR)
> return false;
> - }
> }
>
> return true;
> @@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>
> static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
> {
> - if (!nested_get_evmcs_page(vcpu))
> + if (!nested_get_evmcs_page(vcpu)) {
> + pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
> + __func__);
> + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> + vcpu->run->internal.suberror =
> + KVM_INTERNAL_ERROR_EMULATION;
> + vcpu->run->internal.ndata = 0;
> +
> return false;
> + }

Hi!

Any reason to move the debug prints out of nested_get_evmcs_page?


>
> if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
> return false;
> @@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
> /* trying to cancel vmlaunch/vmresume is a bug */
> WARN_ON_ONCE(vmx->nested.nested_run_pending);
>
> - kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> + /*
> + * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
> + * Enlightened VMCS after migration and we still need to
> + * do that when something is forcing L2->L1 exit prior to
> + * the first L2 run.
> + */
> + (void)nested_get_evmcs_page(vcpu);
> + }
Yes this is a band-aid, but it has to be done I agree.

>
> /* Service the TLB flush request for L2 before switching to L1. */
> if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))




I also tested this and it survives a bit better (used to crash instantly
after a single migration cycle, but the guest still crashes after around ~20 iterations of my
regular nested migration test).

Blues screen shows that stop code is HYPERVISOR ERROR and nothing else.

I tested both this patch alone and all 4 patches.

Without evmcs, the same VM with same host kernel and qemu survived an overnight
test and passed about 1800 migration iterations.
(my synthetic migration test doesn't yet work on Intel, I need to investigate why)

For reference this is the VM that you gave me to test, kvm/queue kernel,
with merged mainline in it,
and mostly latest qemu (updated about a week ago or so)

qemu: 3791642c8d60029adf9b00bcb4e34d7d8a1aea4d
kernel: 9f242010c3b46e63bc62f08fff42cef992d3801b and
then merge v5.12 from mainline.

Best regards,
Maxim Levitsky




2021-05-05 08:26:30

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 3/4] KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld()

On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> As a preparation to mapping eVMCS from vmx_set_nested_state() split
> the actual eVMCS mappign from aquiring eVMCS GPA.
>
> No functional change intended.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 26 +++++++++++++++++---------
> 1 file changed, 17 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 2febb1dd68e8..37fdc34f7afc 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1972,18 +1972,11 @@ static int copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
> * This is an equivalent of the nested hypervisor executing the vmptrld
> * instruction.
> */
> -static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
> - struct kvm_vcpu *vcpu, bool from_launch)
> +static enum nested_evmptrld_status __nested_vmx_handle_enlightened_vmptrld(
> + struct kvm_vcpu *vcpu, u64 evmcs_gpa, bool from_launch)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> bool evmcs_gpa_changed = false;
> - u64 evmcs_gpa;
> -
> - if (likely(!vmx->nested.enlightened_vmcs_enabled))
> - return EVMPTRLD_DISABLED;
> -
> - if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
> - return EVMPTRLD_DISABLED;
>
> if (unlikely(!vmx->nested.hv_evmcs ||
> evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) {
> @@ -2055,6 +2048,21 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
> return EVMPTRLD_SUCCEEDED;
> }
>
> +static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
> + struct kvm_vcpu *vcpu, bool from_launch)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + u64 evmcs_gpa;
> +
> + if (likely(!vmx->nested.enlightened_vmcs_enabled))
> + return EVMPTRLD_DISABLED;
> +
> + if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
> + return EVMPTRLD_DISABLED;
> +
> + return __nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, from_launch);
> +}
> +
> void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);

Reviewed-by: Maxim Levitsky <[email protected]>

Best regards,
Maxim Levitsky

2021-05-05 08:34:37

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible

On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> It now looks like a bad idea to not restore eVMCS mapping directly from
> vmx_set_nested_state(). The restoration path now depends on whether KVM
> will continue executing L2 (vmx_get_nested_state_pages()) or will have to
> exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
> diverges too much from the 'native' path when 'nested.current_vmptr' is
> set directly from vmx_get_nested_state_pages().
>
> The existing solution postponing eVMCS mapping also seems to be fragile.
> In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
> NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
> are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
> state.
>
> Also, in case vmx_get_nested_state() is called right after
> vmx_set_nested_state() without executing the guest first, the resulting
> state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
> missing.
>
> Fix all these issues by making eVMCS restoration path closer to its
> 'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
> To avoid ABI incompatibility, do not introduce a new flag and keep the
> original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
> place. To distinguish between 'new' and 'old' formats consider eVMCS
> GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
> path). While technically possible, it seems to be an extremely unlikely
> case.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm.h | 2 ++
> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++------
> 2 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 0662f644aad9..3845977b739e 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -441,6 +441,8 @@ struct kvm_vmx_nested_state_hdr {
>
> __u32 flags;
> __u64 preemption_timer_deadline;
> +
> + __u64 evmcs_pa;
> };
>
> struct kvm_svm_nested_state_data {
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 37fdc34f7afc..4261cf4755c8 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -6019,6 +6019,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
> .hdr.vmx.vmxon_pa = -1ull,
> .hdr.vmx.vmcs12_pa = -1ull,
> .hdr.vmx.preemption_timer_deadline = 0,
> + .hdr.vmx.evmcs_pa = -1ull,
> };
> struct kvm_vmx_nested_state_data __user *user_vmx_nested_state =
> &user_kvm_nested_state->data.vmx[0];
> @@ -6037,8 +6038,10 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
> if (vmx_has_valid_vmcs12(vcpu)) {
> kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
>
> - if (vmx->nested.hv_evmcs)
> + if (vmx->nested.hv_evmcs) {
> kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
> + kvm_state.hdr.vmx.evmcs_pa = vmx->nested.hv_evmcs_vmptr;
> + }
>
> if (is_guest_mode(vcpu) &&
> nested_cpu_has_shadow_vmcs(vmcs12) &&
> @@ -6230,13 +6233,25 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
>
> set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa);
> } else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) {
> + u64 evmcs_gpa = kvm_state->hdr.vmx.evmcs_pa;
> +
> /*
> - * nested_vmx_handle_enlightened_vmptrld() cannot be called
> - * directly from here as HV_X64_MSR_VP_ASSIST_PAGE may not be
> - * restored yet. EVMCS will be mapped from
> - * nested_get_vmcs12_pages().
> + * EVMCS GPA == 0 most likely indicates that the migration data is
> + * coming from an older KVM which doesn't support 'evmcs_pa' in
> + * 'struct kvm_vmx_nested_state_hdr'.
> */
> - kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> + if (evmcs_gpa && (evmcs_gpa != -1ull) &&
> + (__nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, false) !=
> + EVMPTRLD_SUCCEEDED)) {
> + return -EINVAL;
> + } else if (!evmcs_gpa) {
> + /*
> + * EVMCS GPA can't be acquired from VP assist page here because
> + * HV_X64_MSR_VP_ASSIST_PAGE may not be restored yet.
> + * EVMCS will be mapped from nested_get_evmcs_page().
> + */
> + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> + }
> } else {
> return -EINVAL;
> }

Hi everyone!

Let me expalin my concern about this patch and also ask if I understand this correctly.

In a nutshell if I understand this correctly, we are not allowed to access any guest
memory while setting the nested state.

Now, if I understand correctly as well, the reason for the above,
is that the userspace is allowed to set the nested state first, then fiddle with
the KVM memslots, maybe even update the guest memory and only later do the KVM_RUN ioctl,

And so this is the major reason why the KVM_REQ_GET_NESTED_STATE_PAGES
request exists in the first place.

If that is correct I assume that we either have to keep loading the EVMCS page on
KVM_REQ_GET_NESTED_STATE_PAGES request, or we want to include the EVMCS itself
in the migration state in addition to its physical address, similar to how we treat
the VMCS12 and the VMCB12.

I personally tinkered with qemu to try and reproduce this situation
and in my tests I wasn't able to make it update the memory
map after the load of the nested state but prior to KVM_RUN
but neither I wasn't able to prove that this can't happen.

In addition to that I don't know how qemu behaves when it does
guest ram post-copy because so far I haven't tried to tinker with it.

Finally other userspace hypervisors exist, and they might rely on assumption
as well.

Looking forward for any comments,
Best regards,
Maxim Levitsky



2021-05-05 08:42:01

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration

Maxim Levitsky <[email protected]> writes:

> On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
>> When enlightened VMCS is in use and nested state is migrated with
>> vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
>> page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
>> and we can't read it from VP assist page because userspace may decide
>> to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
>> (and QEMU, for example, does exactly that). To make sure eVMCS is
>> mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
>> request.
>>
>> Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
>> on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
>> nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
>> when an immediate exit from L2 to L1 happens right after migration (caused
>> by a pending event, for example). Unfortunately, in the exact same
>> situation we still need to have eVMCS mapped so
>> nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.
>>
>> As a band-aid, restore nested_get_evmcs_page() when clearing
>> KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
>> from being ideal as we can't easily propagate possible failures and even if
>> we could, this is most likely already too late to do so. The whole
>> 'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
>> seems to be fragile as we diverge too much from the 'native' path when
>> vmptr loading happens on vmx_set_nested_state().
>>
>> Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
>> 1 file changed, 19 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 1e069aac7410..2febb1dd68e8 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>> nested_vmx_handle_enlightened_vmptrld(vcpu, false);
>>
>> if (evmptrld_status == EVMPTRLD_VMFAIL ||
>> - evmptrld_status == EVMPTRLD_ERROR) {
>> - pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
>> - __func__);
>> - vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> - vcpu->run->internal.suberror =
>> - KVM_INTERNAL_ERROR_EMULATION;
>> - vcpu->run->internal.ndata = 0;
>> + evmptrld_status == EVMPTRLD_ERROR)
>> return false;
>> - }
>> }
>>
>> return true;
>> @@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>>
>> static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
>> {
>> - if (!nested_get_evmcs_page(vcpu))
>> + if (!nested_get_evmcs_page(vcpu)) {
>> + pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
>> + __func__);
>> + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> + vcpu->run->internal.suberror =
>> + KVM_INTERNAL_ERROR_EMULATION;
>> + vcpu->run->internal.ndata = 0;
>> +
>> return false;
>> + }
>
> Hi!
>
> Any reason to move the debug prints out of nested_get_evmcs_page?
>

Debug print could've probably stayed or could've been dropped
completely -- I don't really believe it's going to help
anyone. Debugging such issues without instrumentation/tracing seems to
be hard-to-impossible...

>
>>
>> if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
>> return false;
>> @@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>> /* trying to cancel vmlaunch/vmresume is a bug */
>> WARN_ON_ONCE(vmx->nested.nested_run_pending);
>>
>> - kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>> + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
>> + /*
>> + * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
>> + * Enlightened VMCS after migration and we still need to
>> + * do that when something is forcing L2->L1 exit prior to
>> + * the first L2 run.
>> + */
>> + (void)nested_get_evmcs_page(vcpu);
>> + }
> Yes this is a band-aid, but it has to be done I agree.
>

To restore the status quo, yes.

>>
>> /* Service the TLB flush request for L2 before switching to L1. */
>> if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
>
>
>
>
> I also tested this and it survives a bit better (used to crash instantly
> after a single migration cycle, but the guest still crashes after around ~20 iterations of my
> regular nested migration test).
>
> Blues screen shows that stop code is HYPERVISOR ERROR and nothing else.
>
> I tested both this patch alone and all 4 patches.
>
> Without evmcs, the same VM with same host kernel and qemu survived an overnight
> test and passed about 1800 migration iterations.
> (my synthetic migration test doesn't yet work on Intel, I need to investigate why)
>

It would be great to compare on Intel to be 100% sure the issue is eVMCS
related, Hyper-V may be behaving quite differently on AMD.

> For reference this is the VM that you gave me to test, kvm/queue kernel,
> with merged mainline in it,
> and mostly latest qemu (updated about a week ago or so)
>
> qemu: 3791642c8d60029adf9b00bcb4e34d7d8a1aea4d
> kernel: 9f242010c3b46e63bc62f08fff42cef992d3801b and
> then merge v5.12 from mainline.

Thanks for testing! I'll try to come up with a selftest for this issue,
maybe it'll help us discovering others)

--
Vitaly

2021-05-05 09:20:58

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration

On Wed, 2021-05-05 at 10:39 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <[email protected]> writes:
>
> > On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> > > When enlightened VMCS is in use and nested state is migrated with
> > > vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
> > > page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
> > > and we can't read it from VP assist page because userspace may decide
> > > to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
> > > (and QEMU, for example, does exactly that). To make sure eVMCS is
> > > mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
> > > request.
> > >
> > > Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
> > > on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
> > > nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
> > > when an immediate exit from L2 to L1 happens right after migration (caused
> > > by a pending event, for example). Unfortunately, in the exact same
> > > situation we still need to have eVMCS mapped so
> > > nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.
> > >
> > > As a band-aid, restore nested_get_evmcs_page() when clearing
> > > KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
> > > from being ideal as we can't easily propagate possible failures and even if
> > > we could, this is most likely already too late to do so. The whole
> > > 'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
> > > seems to be fragile as we diverge too much from the 'native' path when
> > > vmptr loading happens on vmx_set_nested_state().
> > >
> > > Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
> > > Signed-off-by: Vitaly Kuznetsov <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
> > > 1 file changed, 19 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index 1e069aac7410..2febb1dd68e8 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
> > > nested_vmx_handle_enlightened_vmptrld(vcpu, false);
> > >
> > > if (evmptrld_status == EVMPTRLD_VMFAIL ||
> > > - evmptrld_status == EVMPTRLD_ERROR) {
> > > - pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
> > > - __func__);
> > > - vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> > > - vcpu->run->internal.suberror =
> > > - KVM_INTERNAL_ERROR_EMULATION;
> > > - vcpu->run->internal.ndata = 0;
> > > + evmptrld_status == EVMPTRLD_ERROR)
> > > return false;
> > > - }
> > > }
> > >
> > > return true;
> > > @@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> > >
> > > static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
> > > {
> > > - if (!nested_get_evmcs_page(vcpu))
> > > + if (!nested_get_evmcs_page(vcpu)) {
> > > + pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
> > > + __func__);
> > > + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> > > + vcpu->run->internal.suberror =
> > > + KVM_INTERNAL_ERROR_EMULATION;
> > > + vcpu->run->internal.ndata = 0;
> > > +
> > > return false;
> > > + }
> >
> > Hi!
> >
> > Any reason to move the debug prints out of nested_get_evmcs_page?
> >
>
> Debug print could've probably stayed or could've been dropped
> completely -- I don't really believe it's going to help
> anyone. Debugging such issues without instrumentation/tracing seems to
> be hard-to-impossible...
>
> > >
> > > if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
> > > return false;
> > > @@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
> > > /* trying to cancel vmlaunch/vmresume is a bug */
> > > WARN_ON_ONCE(vmx->nested.nested_run_pending);
> > >
> > > - kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > + /*
> > > + * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
> > > + * Enlightened VMCS after migration and we still need to
> > > + * do that when something is forcing L2->L1 exit prior to
> > > + * the first L2 run.
> > > + */
> > > + (void)nested_get_evmcs_page(vcpu);
> > > + }
> > Yes this is a band-aid, but it has to be done I agree.
> >
>
> To restore the status quo, yes.
>
> > >
> > > /* Service the TLB flush request for L2 before switching to L1. */
> > > if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
> >
> >
> >
> > I also tested this and it survives a bit better (used to crash instantly
> > after a single migration cycle, but the guest still crashes after around ~20 iterations of my
> > regular nested migration test).
> >
> > Blues screen shows that stop code is HYPERVISOR ERROR and nothing else.
> >
> > I tested both this patch alone and all 4 patches.
> >
> > Without evmcs, the same VM with same host kernel and qemu survived an overnight
> > test and passed about 1800 migration iterations.
> > (my synthetic migration test doesn't yet work on Intel, I need to investigate why)
> >
>
> It would be great to compare on Intel to be 100% sure the issue is eVMCS
> related, Hyper-V may be behaving quite differently on AMD.
Hi!

I tested this on my Intel machine with and without eVMCS, without changing
any other parameters, running the same VM from a snapshot.

As I said without eVMCS the test survived overnight stress of ~1800 migrations.
With eVMCs, it fails pretty much on first try.
With those patches, it fails after about 20 iterations.

Best regards,
Maxim Levitsky

>
> > For reference this is the VM that you gave me to test, kvm/queue kernel,
> > with merged mainline in it,
> > and mostly latest qemu (updated about a week ago or so)
> >
> > qemu: 3791642c8d60029adf9b00bcb4e34d7d8a1aea4d
> > kernel: 9f242010c3b46e63bc62f08fff42cef992d3801b and
> > then merge v5.12 from mainline.
>
> Thanks for testing! I'll try to come up with a selftest for this issue,
> maybe it'll help us discovering others)
>


2021-05-05 09:20:59

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible

Maxim Levitsky <[email protected]> writes:

> On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
>> It now looks like a bad idea to not restore eVMCS mapping directly from
>> vmx_set_nested_state(). The restoration path now depends on whether KVM
>> will continue executing L2 (vmx_get_nested_state_pages()) or will have to
>> exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
>> diverges too much from the 'native' path when 'nested.current_vmptr' is
>> set directly from vmx_get_nested_state_pages().
>>
>> The existing solution postponing eVMCS mapping also seems to be fragile.
>> In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
>> NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
>> are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
>> state.
>>
>> Also, in case vmx_get_nested_state() is called right after
>> vmx_set_nested_state() without executing the guest first, the resulting
>> state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
>> missing.
>>
>> Fix all these issues by making eVMCS restoration path closer to its
>> 'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
>> To avoid ABI incompatibility, do not introduce a new flag and keep the
>> original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
>> place. To distinguish between 'new' and 'old' formats consider eVMCS
>> GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
>> path). While technically possible, it seems to be an extremely unlikely
>> case.
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/include/uapi/asm/kvm.h | 2 ++
>> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++------
>> 2 files changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 0662f644aad9..3845977b739e 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -441,6 +441,8 @@ struct kvm_vmx_nested_state_hdr {
>>
>> __u32 flags;
>> __u64 preemption_timer_deadline;
>> +
>> + __u64 evmcs_pa;
>> };
>>
>> struct kvm_svm_nested_state_data {
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 37fdc34f7afc..4261cf4755c8 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -6019,6 +6019,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>> .hdr.vmx.vmxon_pa = -1ull,
>> .hdr.vmx.vmcs12_pa = -1ull,
>> .hdr.vmx.preemption_timer_deadline = 0,
>> + .hdr.vmx.evmcs_pa = -1ull,
>> };
>> struct kvm_vmx_nested_state_data __user *user_vmx_nested_state =
>> &user_kvm_nested_state->data.vmx[0];
>> @@ -6037,8 +6038,10 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>> if (vmx_has_valid_vmcs12(vcpu)) {
>> kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
>>
>> - if (vmx->nested.hv_evmcs)
>> + if (vmx->nested.hv_evmcs) {
>> kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
>> + kvm_state.hdr.vmx.evmcs_pa = vmx->nested.hv_evmcs_vmptr;
>> + }
>>
>> if (is_guest_mode(vcpu) &&
>> nested_cpu_has_shadow_vmcs(vmcs12) &&
>> @@ -6230,13 +6233,25 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
>>
>> set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa);
>> } else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) {
>> + u64 evmcs_gpa = kvm_state->hdr.vmx.evmcs_pa;
>> +
>> /*
>> - * nested_vmx_handle_enlightened_vmptrld() cannot be called
>> - * directly from here as HV_X64_MSR_VP_ASSIST_PAGE may not be
>> - * restored yet. EVMCS will be mapped from
>> - * nested_get_vmcs12_pages().
>> + * EVMCS GPA == 0 most likely indicates that the migration data is
>> + * coming from an older KVM which doesn't support 'evmcs_pa' in
>> + * 'struct kvm_vmx_nested_state_hdr'.
>> */
>> - kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>> + if (evmcs_gpa && (evmcs_gpa != -1ull) &&
>> + (__nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, false) !=
>> + EVMPTRLD_SUCCEEDED)) {
>> + return -EINVAL;
>> + } else if (!evmcs_gpa) {
>> + /*
>> + * EVMCS GPA can't be acquired from VP assist page here because
>> + * HV_X64_MSR_VP_ASSIST_PAGE may not be restored yet.
>> + * EVMCS will be mapped from nested_get_evmcs_page().
>> + */
>> + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>> + }
>> } else {
>> return -EINVAL;
>> }
>
> Hi everyone!
>
> Let me expalin my concern about this patch and also ask if I understand this correctly.
>
> In a nutshell if I understand this correctly, we are not allowed to access any guest
> memory while setting the nested state.
>
> Now, if I understand correctly as well, the reason for the above,
> is that the userspace is allowed to set the nested state first, then fiddle with
> the KVM memslots, maybe even update the guest memory and only later do the KVM_RUN ioctl,

Currently, userspace is free to restore the guest in any order
indeed. I've probably missed post-copy but even the fact that guest MSRs
can be restored after restoring nested state doesn't make our life easier.

>
> And so this is the major reason why the KVM_REQ_GET_NESTED_STATE_PAGES
> request exists in the first place.
>
> If that is correct I assume that we either have to keep loading the EVMCS page on
> KVM_REQ_GET_NESTED_STATE_PAGES request, or we want to include the EVMCS itself
> in the migration state in addition to its physical address, similar to how we treat
> the VMCS12 and the VMCB12.

Keeping eVMCS load from KVM_REQ_GET_NESTED_STATE_PAGES is OK I believe
(or at least I still don't see a reason for us to carry a copy in the
migration data). What I still don't like is the transient state after
vmx_set_nested_state():
- vmx->nested.current_vmptr is -1ull because no 'real' vmptrld was done
(we skip set_current_vmptr() when KVM_STATE_NESTED_EVMCS)
- vmx->nested.hv_evmcs/vmx->nested.hv_evmcs_vmptr are also NULL because
we haven't performed nested_vmx_handle_enlightened_vmptrld() yet.

I know of at least one real problem with this state: in case
vmx_get_nested_state() happens before KVM_RUN the resulting state won't
have KVM_STATE_NESTED_EVMCS flag and this is incorrect. Take a look at
the check in nested_vmx_fail() for example:

if (vmx->nested.current_vmptr == -1ull && !vmx->nested.hv_evmcs)
return nested_vmx_failInvalid(vcpu);

this also seems off (I'm not sure it matters in any context but still).

>
> I personally tinkered with qemu to try and reproduce this situation
> and in my tests I wasn't able to make it update the memory
> map after the load of the nested state but prior to KVM_RUN
> but neither I wasn't able to prove that this can't happen.

Userspace has multiple ways to mess with the state of course, in KVM we
only need to make sure we don't crash :-) On migration, well behaving
userspace is supposed to restore exactly what it got though. The
restoration sequence may vary.

>
> In addition to that I don't know how qemu behaves when it does
> guest ram post-copy because so far I haven't tried to tinker with it.
>
> Finally other userspace hypervisors exist, and they might rely on assumption
> as well.
>
> Looking forward for any comments,
> Best regards,
> Maxim Levitsky
>
>
>

--
Vitaly

2021-05-05 09:27:12

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration

Maxim Levitsky <[email protected]> writes:

> On Wed, 2021-05-05 at 10:39 +0200, Vitaly Kuznetsov wrote:
>> Maxim Levitsky <[email protected]> writes:
>>
>> > On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
>> > > When enlightened VMCS is in use and nested state is migrated with
>> > > vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
>> > > page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
>> > > and we can't read it from VP assist page because userspace may decide
>> > > to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
>> > > (and QEMU, for example, does exactly that). To make sure eVMCS is
>> > > mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
>> > > request.
>> > >
>> > > Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
>> > > on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
>> > > nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
>> > > when an immediate exit from L2 to L1 happens right after migration (caused
>> > > by a pending event, for example). Unfortunately, in the exact same
>> > > situation we still need to have eVMCS mapped so
>> > > nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.
>> > >
>> > > As a band-aid, restore nested_get_evmcs_page() when clearing
>> > > KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
>> > > from being ideal as we can't easily propagate possible failures and even if
>> > > we could, this is most likely already too late to do so. The whole
>> > > 'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
>> > > seems to be fragile as we diverge too much from the 'native' path when
>> > > vmptr loading happens on vmx_set_nested_state().
>> > >
>> > > Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
>> > > Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> > > ---
>> > > arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
>> > > 1 file changed, 19 insertions(+), 10 deletions(-)
>> > >
>> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> > > index 1e069aac7410..2febb1dd68e8 100644
>> > > --- a/arch/x86/kvm/vmx/nested.c
>> > > +++ b/arch/x86/kvm/vmx/nested.c
>> > > @@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>> > > nested_vmx_handle_enlightened_vmptrld(vcpu, false);
>> > >
>> > > if (evmptrld_status == EVMPTRLD_VMFAIL ||
>> > > - evmptrld_status == EVMPTRLD_ERROR) {
>> > > - pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
>> > > - __func__);
>> > > - vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> > > - vcpu->run->internal.suberror =
>> > > - KVM_INTERNAL_ERROR_EMULATION;
>> > > - vcpu->run->internal.ndata = 0;
>> > > + evmptrld_status == EVMPTRLD_ERROR)
>> > > return false;
>> > > - }
>> > > }
>> > >
>> > > return true;
>> > > @@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>> > >
>> > > static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
>> > > {
>> > > - if (!nested_get_evmcs_page(vcpu))
>> > > + if (!nested_get_evmcs_page(vcpu)) {
>> > > + pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
>> > > + __func__);
>> > > + vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> > > + vcpu->run->internal.suberror =
>> > > + KVM_INTERNAL_ERROR_EMULATION;
>> > > + vcpu->run->internal.ndata = 0;
>> > > +
>> > > return false;
>> > > + }
>> >
>> > Hi!
>> >
>> > Any reason to move the debug prints out of nested_get_evmcs_page?
>> >
>>
>> Debug print could've probably stayed or could've been dropped
>> completely -- I don't really believe it's going to help
>> anyone. Debugging such issues without instrumentation/tracing seems to
>> be hard-to-impossible...
>>
>> > >
>> > > if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
>> > > return false;
>> > > @@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>> > > /* trying to cancel vmlaunch/vmresume is a bug */
>> > > WARN_ON_ONCE(vmx->nested.nested_run_pending);
>> > >
>> > > - kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>> > > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
>> > > + /*
>> > > + * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
>> > > + * Enlightened VMCS after migration and we still need to
>> > > + * do that when something is forcing L2->L1 exit prior to
>> > > + * the first L2 run.
>> > > + */
>> > > + (void)nested_get_evmcs_page(vcpu);
>> > > + }
>> > Yes this is a band-aid, but it has to be done I agree.
>> >
>>
>> To restore the status quo, yes.
>>
>> > >
>> > > /* Service the TLB flush request for L2 before switching to L1. */
>> > > if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
>> >
>> >
>> >
>> > I also tested this and it survives a bit better (used to crash instantly
>> > after a single migration cycle, but the guest still crashes after around ~20 iterations of my
>> > regular nested migration test).
>> >
>> > Blues screen shows that stop code is HYPERVISOR ERROR and nothing else.
>> >
>> > I tested both this patch alone and all 4 patches.
>> >
>> > Without evmcs, the same VM with same host kernel and qemu survived an overnight
>> > test and passed about 1800 migration iterations.
>> > (my synthetic migration test doesn't yet work on Intel, I need to investigate why)
>> >
>>
>> It would be great to compare on Intel to be 100% sure the issue is eVMCS
>> related, Hyper-V may be behaving quite differently on AMD.
> Hi!
>
> I tested this on my Intel machine with and without eVMCS, without changing
> any other parameters, running the same VM from a snapshot.
>
> As I said without eVMCS the test survived overnight stress of ~1800 migrations.
> With eVMCs, it fails pretty much on first try.
> With those patches, it fails after about 20 iterations.
>

Ah, sorry, misunderstood your 'synthetic migration test doesn't yet work
on Intel' :-)

--
Vitaly