2019-09-27 21:49:40

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 0/8] KVM: x86: nVMX GUEST_CR3 bug fix, and then some...

*sigh*

v2 was shaping up to be a trivial update, until I started working on
Vitaly's suggestion to add a helper to test for register availability.

The primary purpose of this series is to fix a CR3 corruption in L2
reported by Reto Buerki when running with HLT interception disabled in L1.
On a nested VM-Enter that puts L2 into HLT, KVM never actually enters L2
and instead mimics HLT interception by canceling the nested run and
pretending that VM-Enter to L2 completed and then exited on HLT (which
KVM intercepted). Because KVM never actually runs L2, KVM skips the
pending MMU update for L2 and so leaves a stale value in vmcs02.GUEST_CR3.
If the next wake event for L2 triggers a nested VM-Exit, KVM will refresh
vmcs12->guest_cr3 from vmcs02.GUEST_CR3 and consume the stale value.

Fix the issue by unconditionally writing vmcs02.GUEST_CR3 during nested
VM-Enter instead of deferring the update to vmx_set_cr3(), and skip the
update of GUEST_CR3 in vmx_set_cr3() when running L2. I.e. make the
nested code fully responsible for vmcs02.GUEST_CR3.

Patch 02/08 is a minor optimization to skip the GUEST_CR3 update if
vmcs01 is already up-to-date.

Patches 03 and beyond are Vitaly's fault ;-).

Patches 03 and 04 are tangentially related cleanup to vmx_set_rflags()
that was discovered when working through the avail/dirty testing code.
Ideally they'd be sent as a separate series, but they conflict with the
avail/dirty helper changes and are themselves minor and straightforward.

Patches 05 and 06 clean up the register caching code so that there is a
single enum for all registers which use avail/dirty tracking. While not
a true prerequisite for the avail/dirty helpers, the cleanup allows the
new helpers to take an 'enum kvm_reg' instead of a less helpful 'int reg'.

Patch 07 is the helpers themselves, as suggested by Vitaly.

Patch 08 is a truly optional change to ditch decache_cr3() in favor of
handling CR3 via cache_reg() like any other avail/dirty register.


Note, I collected the Reviewed-by and Tested-by tags for patches 01 and 02
even though I inverted the boolean from 'skip_cr3' to 'update_guest_cr3'.
Please drop the tags if that constitutes a non-trivial functional change.

v2:
- Invert skip_cr3 to update_guest_cr3. [Liran]
- Reword the changelog and comment to be more explicit in detailing
how/when KVM will process a nested VM-Enter without runnin L2. [Liran]
- Added Reviewed-by and Tested-by tags.
- Add a comment in vmx_set_cr3() to explicitly state that nested
VM-Enter is responsible for loading vmcs02.GUEST_CR3. [Jim]
- All of the loveliness in patches 03-08. [Vitaly]

Sean Christopherson (8):
KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter
KVM: VMX: Skip GUEST_CR3 VMREAD+VMWRITE if the VMCS is up-to-date
KVM: VMX: Consolidate to_vmx() usage in RFLAGS accessors
KVM: VMX: Optimize vmx_set_rflags() for unrestricted guest
KVM: x86: Add WARNs to detect out-of-bounds register indices
KVM: x86: Fold 'enum kvm_ex_reg' definitions into 'enum kvm_reg'
KVM: x86: Add helpers to test/mark reg availability and dirtiness
KVM: x86: Fold decache_cr3() into cache_reg()

arch/x86/include/asm/kvm_host.h | 5 +-
arch/x86/kvm/kvm_cache_regs.h | 67 +++++++++++++++++------
arch/x86/kvm/svm.c | 5 --
arch/x86/kvm/vmx/nested.c | 14 ++++-
arch/x86/kvm/vmx/vmx.c | 94 ++++++++++++++++++---------------
arch/x86/kvm/x86.c | 13 ++---
arch/x86/kvm/x86.h | 6 +--
7 files changed, 123 insertions(+), 81 deletions(-)

--
2.22.0


2019-09-27 21:49:51

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 1/8] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
instead of deferring the VMWRITE until vmx_set_cr3(). If the VMWRITE
is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
VM-Exit occurs without actually entering L2, e.g. if the nested run
is squashed because nested VM-Enter (from L1) is putting L2 into HLT.

Note, the above scenario can occur regardless of whether L1 is
intercepting HLT, e.g. L1 can intercept HLT and then re-enter L2 with
vmcs.GUEST_ACTIVITY_STATE=HALTED. But practically speaking, a VMM will
likely put a guest into HALTED if and only if it's not intercepting HLT.

In an ideal world where EPT *requires* unrestricted guest (and vice
versa), VMX could handle CR3 similar to how it handles RSP and RIP,
e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run(). But
the unrestricted guest silliness complicates the dirty tracking logic
to the point that explicitly handling vmcs02.GUEST_CR3 during nested
VM-Enter is a simpler overall implementation.

Cc: [email protected]
Reported-and-tested-by: Reto Buerki <[email protected]>
Tested-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Liran Alon <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 10 ++++++++++
arch/x86/kvm/vmx/vmx.c | 10 +++++++---
2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 41abc62c9a8a..b72a00b53e4a 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2418,6 +2418,16 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
entry_failure_code))
return -EINVAL;

+ /*
+ * Immediately write vmcs02.GUEST_CR3. It will be propagated to vmcs12
+ * on nested VM-Exit, which can occur without actually running L2 and
+ * thus without hitting vmx_set_cr3(), e.g. if L1 is entering L2 with
+ * vmcs12.GUEST_ACTIVITYSTATE=HLT, in which case KVM will intercept the
+ * transition to HLT instead of running L2.
+ */
+ if (enable_ept)
+ vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
+
/* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
is_pae_paging(vcpu)) {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d4575ffb3cec..7679c2a05a50 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2984,6 +2984,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa)
void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
{
struct kvm *kvm = vcpu->kvm;
+ bool update_guest_cr3 = true;
unsigned long guest_cr3;
u64 eptp;

@@ -3000,15 +3001,18 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
}

- if (enable_unrestricted_guest || is_paging(vcpu) ||
- is_guest_mode(vcpu))
+ /* Loading vmcs02.GUEST_CR3 is handled by nested VM-Enter. */
+ if (is_guest_mode(vcpu))
+ update_guest_cr3 = false;
+ else if (enable_unrestricted_guest || is_paging(vcpu))
guest_cr3 = kvm_read_cr3(vcpu);
else
guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
ept_load_pdptrs(vcpu);
}

- vmcs_writel(GUEST_CR3, guest_cr3);
+ if (update_guest_cr3)
+ vmcs_writel(GUEST_CR3, guest_cr3);
}

int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
--
2.22.0

2019-09-27 23:40:22

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v2 1/8] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter

On Fri, Sep 27, 2019 at 2:45 PM Sean Christopherson
<[email protected]> wrote:
>
> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
> instead of deferring the VMWRITE until vmx_set_cr3(). If the VMWRITE
> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
> VM-Exit occurs without actually entering L2, e.g. if the nested run
> is squashed because nested VM-Enter (from L1) is putting L2 into HLT.
>
> Note, the above scenario can occur regardless of whether L1 is
> intercepting HLT, e.g. L1 can intercept HLT and then re-enter L2 with
> vmcs.GUEST_ACTIVITY_STATE=HALTED. But practically speaking, a VMM will
> likely put a guest into HALTED if and only if it's not intercepting HLT.
>
> In an ideal world where EPT *requires* unrestricted guest (and vice
> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run(). But
> the unrestricted guest silliness complicates the dirty tracking logic
> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
> VM-Enter is a simpler overall implementation.
>
> Cc: [email protected]
> Reported-and-tested-by: Reto Buerki <[email protected]>
> Tested-by: Vitaly Kuznetsov <[email protected]>
> Reviewed-by: Liran Alon <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>

2019-09-30 10:43:41

by Reto Buerki

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] KVM: x86: nVMX GUEST_CR3 bug fix, and then some...

On 9/27/19 11:45 PM, Sean Christopherson wrote:
> *sigh*
>
> v2 was shaping up to be a trivial update, until I started working on
> Vitaly's suggestion to add a helper to test for register availability.
>
> The primary purpose of this series is to fix a CR3 corruption in L2
> reported by Reto Buerki when running with HLT interception disabled in L1.
> On a nested VM-Enter that puts L2 into HLT, KVM never actually enters L2
> and instead mimics HLT interception by canceling the nested run and
> pretending that VM-Enter to L2 completed and then exited on HLT (which
> KVM intercepted). Because KVM never actually runs L2, KVM skips the
> pending MMU update for L2 and so leaves a stale value in vmcs02.GUEST_CR3.
> If the next wake event for L2 triggers a nested VM-Exit, KVM will refresh
> vmcs12->guest_cr3 from vmcs02.GUEST_CR3 and consume the stale value.
>
> Fix the issue by unconditionally writing vmcs02.GUEST_CR3 during nested
> VM-Enter instead of deferring the update to vmx_set_cr3(), and skip the
> update of GUEST_CR3 in vmx_set_cr3() when running L2. I.e. make the
> nested code fully responsible for vmcs02.GUEST_CR3.
>
> Patch 02/08 is a minor optimization to skip the GUEST_CR3 update if
> vmcs01 is already up-to-date.
>
> Patches 03 and beyond are Vitaly's fault ;-).
>
> Patches 03 and 04 are tangentially related cleanup to vmx_set_rflags()
> that was discovered when working through the avail/dirty testing code.
> Ideally they'd be sent as a separate series, but they conflict with the
> avail/dirty helper changes and are themselves minor and straightforward.
>
> Patches 05 and 06 clean up the register caching code so that there is a
> single enum for all registers which use avail/dirty tracking. While not
> a true prerequisite for the avail/dirty helpers, the cleanup allows the
> new helpers to take an 'enum kvm_reg' instead of a less helpful 'int reg'.
>
> Patch 07 is the helpers themselves, as suggested by Vitaly.
>
> Patch 08 is a truly optional change to ditch decache_cr3() in favor of
> handling CR3 via cache_reg() like any other avail/dirty register.
>
>
> Note, I collected the Reviewed-by and Tested-by tags for patches 01 and 02
> even though I inverted the boolean from 'skip_cr3' to 'update_guest_cr3'.
> Please drop the tags if that constitutes a non-trivial functional change.
>
> v2:
> - Invert skip_cr3 to update_guest_cr3. [Liran]
> - Reword the changelog and comment to be more explicit in detailing
> how/when KVM will process a nested VM-Enter without runnin L2. [Liran]
> - Added Reviewed-by and Tested-by tags.
> - Add a comment in vmx_set_cr3() to explicitly state that nested
> VM-Enter is responsible for loading vmcs02.GUEST_CR3. [Jim]
> - All of the loveliness in patches 03-08. [Vitaly]
>
> Sean Christopherson (8):
> KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter
> KVM: VMX: Skip GUEST_CR3 VMREAD+VMWRITE if the VMCS is up-to-date
> KVM: VMX: Consolidate to_vmx() usage in RFLAGS accessors
> KVM: VMX: Optimize vmx_set_rflags() for unrestricted guest
> KVM: x86: Add WARNs to detect out-of-bounds register indices
> KVM: x86: Fold 'enum kvm_ex_reg' definitions into 'enum kvm_reg'
> KVM: x86: Add helpers to test/mark reg availability and dirtiness
> KVM: x86: Fold decache_cr3() into cache_reg()
>
> arch/x86/include/asm/kvm_host.h | 5 +-
> arch/x86/kvm/kvm_cache_regs.h | 67 +++++++++++++++++------
> arch/x86/kvm/svm.c | 5 --
> arch/x86/kvm/vmx/nested.c | 14 ++++-
> arch/x86/kvm/vmx/vmx.c | 94 ++++++++++++++++++---------------
> arch/x86/kvm/x86.c | 13 ++---
> arch/x86/kvm/x86.h | 6 +--
> 7 files changed, 123 insertions(+), 81 deletions(-)

Series:
Tested-by: Reto Buerki <[email protected]>

Thanks.

2019-10-29 15:13:10

by Martin Lucina

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] KVM: x86: nVMX GUEST_CR3 bug fix, and then some...

(Cc:s trimmed)

Hi,

On Monday, 30.09.2019 at?12:42, Reto Buerki wrote:
> On 9/27/19 11:45 PM, Sean Christopherson wrote:
> > *sigh*
> >
> > v2 was shaping up to be a trivial update, until I started working on
> > Vitaly's suggestion to add a helper to test for register availability.
> >
> > The primary purpose of this series is to fix a CR3 corruption in L2
> > reported by Reto Buerki when running with HLT interception disabled in L1.
> > On a nested VM-Enter that puts L2 into HLT, KVM never actually enters L2
> > and instead mimics HLT interception by canceling the nested run and
> > pretending that VM-Enter to L2 completed and then exited on HLT (which
> > KVM intercepted). Because KVM never actually runs L2, KVM skips the
> > pending MMU update for L2 and so leaves a stale value in vmcs02.GUEST_CR3.
> > If the next wake event for L2 triggers a nested VM-Exit, KVM will refresh
> > vmcs12->guest_cr3 from vmcs02.GUEST_CR3 and consume the stale value.
> >
> > Fix the issue by unconditionally writing vmcs02.GUEST_CR3 during nested
> > VM-Enter instead of deferring the update to vmx_set_cr3(), and skip the
> > update of GUEST_CR3 in vmx_set_cr3() when running L2. I.e. make the
> > nested code fully responsible for vmcs02.GUEST_CR3.
> >
> > Patch 02/08 is a minor optimization to skip the GUEST_CR3 update if
> > vmcs01 is already up-to-date.
> >
> > Patches 03 and beyond are Vitaly's fault ;-).
> >
> > Patches 03 and 04 are tangentially related cleanup to vmx_set_rflags()
> > that was discovered when working through the avail/dirty testing code.
> > Ideally they'd be sent as a separate series, but they conflict with the
> > avail/dirty helper changes and are themselves minor and straightforward.
> >
> > Patches 05 and 06 clean up the register caching code so that there is a
> > single enum for all registers which use avail/dirty tracking. While not
> > a true prerequisite for the avail/dirty helpers, the cleanup allows the
> > new helpers to take an 'enum kvm_reg' instead of a less helpful 'int reg'.
> >
> > Patch 07 is the helpers themselves, as suggested by Vitaly.
> >
> > Patch 08 is a truly optional change to ditch decache_cr3() in favor of
> > handling CR3 via cache_reg() like any other avail/dirty register.
> >
> >
> > Note, I collected the Reviewed-by and Tested-by tags for patches 01 and 02
> > even though I inverted the boolean from 'skip_cr3' to 'update_guest_cr3'.
> > Please drop the tags if that constitutes a non-trivial functional change.
> >
> > v2:
> > - Invert skip_cr3 to update_guest_cr3. [Liran]
> > - Reword the changelog and comment to be more explicit in detailing
> > how/when KVM will process a nested VM-Enter without runnin L2. [Liran]
> > - Added Reviewed-by and Tested-by tags.
> > - Add a comment in vmx_set_cr3() to explicitly state that nested
> > VM-Enter is responsible for loading vmcs02.GUEST_CR3. [Jim]
> > - All of the loveliness in patches 03-08. [Vitaly]
> >
> > Sean Christopherson (8):
> > KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter
> > KVM: VMX: Skip GUEST_CR3 VMREAD+VMWRITE if the VMCS is up-to-date
> > KVM: VMX: Consolidate to_vmx() usage in RFLAGS accessors
> > KVM: VMX: Optimize vmx_set_rflags() for unrestricted guest
> > KVM: x86: Add WARNs to detect out-of-bounds register indices
> > KVM: x86: Fold 'enum kvm_ex_reg' definitions into 'enum kvm_reg'
> > KVM: x86: Add helpers to test/mark reg availability and dirtiness
> > KVM: x86: Fold decache_cr3() into cache_reg()
> >
> > arch/x86/include/asm/kvm_host.h | 5 +-
> > arch/x86/kvm/kvm_cache_regs.h | 67 +++++++++++++++++------
> > arch/x86/kvm/svm.c | 5 --
> > arch/x86/kvm/vmx/nested.c | 14 ++++-
> > arch/x86/kvm/vmx/vmx.c | 94 ++++++++++++++++++---------------
> > arch/x86/kvm/x86.c | 13 ++---
> > arch/x86/kvm/x86.h | 6 +--
> > 7 files changed, 123 insertions(+), 81 deletions(-)
>
> Series:
> Tested-by: Reto Buerki <[email protected]>

Any chance of this series making it into 5.4? Unless I'm looking in the
wrong place, I don't see the changes in either kvm.git or Linus' tree.

Thanks,

Martin

2019-10-30 09:10:57

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] KVM: x86: nVMX GUEST_CR3 bug fix, and then some...

On Tue, Oct 29, 2019 at 04:03:04PM +0100, Martin Lucina wrote:
> (Cc:s trimmed)
>
> Hi,
>
> On Monday, 30.09.2019 at?12:42, Reto Buerki wrote:
> > On 9/27/19 11:45 PM, Sean Christopherson wrote:
> > > Sean Christopherson (8):
> > > KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter
> > > KVM: VMX: Skip GUEST_CR3 VMREAD+VMWRITE if the VMCS is up-to-date
> > > KVM: VMX: Consolidate to_vmx() usage in RFLAGS accessors
> > > KVM: VMX: Optimize vmx_set_rflags() for unrestricted guest
> > > KVM: x86: Add WARNs to detect out-of-bounds register indices
> > > KVM: x86: Fold 'enum kvm_ex_reg' definitions into 'enum kvm_reg'
> > > KVM: x86: Add helpers to test/mark reg availability and dirtiness
> > > KVM: x86: Fold decache_cr3() into cache_reg()
> > >
> > > arch/x86/include/asm/kvm_host.h | 5 +-
> > > arch/x86/kvm/kvm_cache_regs.h | 67 +++++++++++++++++------
> > > arch/x86/kvm/svm.c | 5 --
> > > arch/x86/kvm/vmx/nested.c | 14 ++++-
> > > arch/x86/kvm/vmx/vmx.c | 94 ++++++++++++++++++---------------
> > > arch/x86/kvm/x86.c | 13 ++---
> > > arch/x86/kvm/x86.h | 6 +--
> > > 7 files changed, 123 insertions(+), 81 deletions(-)
> >
> > Series:
> > Tested-by: Reto Buerki <[email protected]>
>
> Any chance of this series making it into 5.4? Unless I'm looking in the
> wrong place, I don't see the changes in either kvm.git or Linus' tree.

It's queued up in kvm.git for 5.5. That being said, the first patch
should go into 5.4 (it's also tagged for stable). The next round of KVM
fixes for 5.4 will probably be delayed due to KVM Forum.

Paolo?