2021-06-07 09:04:14

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v3 0/8] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration

This patch set aims to fix few flaws that were discovered

in KVM_{GET|SET}_SREGS on x86:



* There is no support for reading/writing PDPTRs although

these are considered to be part of the guest state.



* There is useless interrupt bitmap which isn't needed



* No support for future extensions (via flags and such)



Also if the user doesn't use the new SREG2 api, the PDPTR

load after migration is now done on KVM_REQ_GET_NESTED_STATE_PAGES

to at least read them correctly in cases when guest memory

map is not up to date when nested state is loaded.



This patch series was tested by doing nested migration test

of 32 bit PAE L1 + 32 bit PAE L2 on AMD and Intel and by

nested migration test of 64 bit L1 + 32 bit PAE L2 on AMD.

The later test currently fails on Intel (regardless of my patches).



Changes from V2:

- I took in the patch series from Sean Christopherson that

removes the pdptrs_changed function and rebased my code

on top of it.

- I updated the SET_SREGS2 ioctl to load PDPTRS from memory

when user haven't given PDPTRS.

- Minor refactoring all over the place.



Changes from V1:

- move only PDPTRS load to KVM_REQ_GET_NESTED_STATE_PAGES on VMX

- rebase on top of kvm/queue

- improve the KVM_GET_SREGS2 to have flag for PDPTRS

and remove padding



Patches to qemu to enable this feature were sent as well.



Maxim Levitsky (5):

KVM: nSVM: refactor the CR3 reload on migration

KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES

KVM: x86: introduce kvm_register_clear_available

KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2

KVM: x86: avoid loading PDPTRs after migration when possible



Sean Christopherson (3):

KVM: nVMX: Drop obsolete (and pointless) pdptrs_changed() check

KVM: nSVM: Drop pointless pdptrs_changed() check on nested transition

KVM: x86: Always load PDPTRs on CR3 load for SVM w/o NPT and a PAE

guest



Documentation/virt/kvm/api.rst | 48 +++++++++

arch/x86/include/asm/kvm_host.h | 7 +-

arch/x86/include/uapi/asm/kvm.h | 13 +++

arch/x86/kvm/kvm_cache_regs.h | 12 +++

arch/x86/kvm/svm/nested.c | 39 +++++--

arch/x86/kvm/svm/svm.c | 6 +-

arch/x86/kvm/vmx/nested.c | 32 ++++--

arch/x86/kvm/x86.c | 176 +++++++++++++++++++++-----------

include/uapi/linux/kvm.h | 4 +

9 files changed, 253 insertions(+), 84 deletions(-)



--

2.26.3





2021-06-07 09:04:31

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v3 2/8] KVM: nSVM: Drop pointless pdptrs_changed() check on nested transition

From: Sean Christopherson <[email protected]>

Remove the "PDPTRs unchanged" check to skip PDPTR loading during nested
SVM transitions as it's not at all an optimization. Reading guest memory
to get the PDPTRs isn't magically cheaper by doing it in pdptrs_changed(),
and if the PDPTRs did change, KVM will end up doing the read twice.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/svm/nested.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 5e8d8443154e..8f5dbc80f57f 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -391,10 +391,8 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
return -EINVAL;

if (!nested_npt && is_pae_paging(vcpu) &&
- (cr3 != kvm_read_cr3(vcpu) || pdptrs_changed(vcpu))) {
- if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)))
- return -EINVAL;
- }
+ CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)))
+ return -EINVAL;

/*
* TODO: optimize unconditional TLB flush/MMU sync here and in
--
2.26.3

2021-06-07 09:04:52

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v3 3/8] KVM: x86: Always load PDPTRs on CR3 load for SVM w/o NPT and a PAE guest

From: Sean Christopherson <[email protected]>

Kill off pdptrs_changed() and instead go through the full kvm_set_cr3()
for PAE guest, even if the new CR3 is the same as the current CR3. For
VMX, and SVM with NPT enabled, the PDPTRs are unconditionally marked as
unavailable after VM-Exit, i.e. the optimization is dead code except for
SVM without NPT.

In the unlikely scenario that anyone cares about SVM without NPT _and_ a
PAE guest, they've got bigger problems if their guest is loading the same
CR3 so frequently that the performance of kvm_set_cr3() is notable,
especially since KVM's fast PGD switching means reloading the same CR3
does not require a full rebuild. Given that PAE and PCID are mutually
exclusive, i.e. a sync and flush are guaranteed in any case, the actual
benefits of the pdptrs_changed() optimization are marginal at best.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/x86.c | 34 ++-------------------------------
2 files changed, 2 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55efbacfc244..83f948bdc59a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1476,7 +1476,6 @@ unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);

int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
-bool pdptrs_changed(struct kvm_vcpu *vcpu);

int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
const void *val, int bytes);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b6bca616929..0be35c37c958 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -778,13 +778,6 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
}
EXPORT_SYMBOL_GPL(kvm_read_guest_page_mmu);

-static int kvm_read_nested_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
- void *data, int offset, int len, u32 access)
-{
- return kvm_read_guest_page_mmu(vcpu, vcpu->arch.walk_mmu, gfn,
- data, offset, len, access);
-}
-
static inline u64 pdptr_rsvd_bits(struct kvm_vcpu *vcpu)
{
return vcpu->arch.reserved_gpa_bits | rsvd_bits(5, 8) | rsvd_bits(1, 2);
@@ -826,30 +819,6 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
}
EXPORT_SYMBOL_GPL(load_pdptrs);

-bool pdptrs_changed(struct kvm_vcpu *vcpu)
-{
- u64 pdpte[ARRAY_SIZE(vcpu->arch.walk_mmu->pdptrs)];
- int offset;
- gfn_t gfn;
- int r;
-
- if (!is_pae_paging(vcpu))
- return false;
-
- if (!kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR))
- return true;
-
- gfn = (kvm_read_cr3(vcpu) & 0xffffffe0ul) >> PAGE_SHIFT;
- offset = (kvm_read_cr3(vcpu) & 0xffffffe0ul) & (PAGE_SIZE - 1);
- r = kvm_read_nested_guest_page(vcpu, gfn, pdpte, offset, sizeof(pdpte),
- PFERR_USER_MASK | PFERR_WRITE_MASK);
- if (r < 0)
- return true;
-
- return memcmp(pdpte, vcpu->arch.walk_mmu->pdptrs, sizeof(pdpte)) != 0;
-}
-EXPORT_SYMBOL_GPL(pdptrs_changed);
-
void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
{
unsigned long update_bits = X86_CR0_PG | X86_CR0_WP;
@@ -1096,7 +1065,8 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
}
#endif

- if (cr3 == kvm_read_cr3(vcpu) && !pdptrs_changed(vcpu)) {
+ /* PDPTRs are always reloaded for PAE paging. */
+ if (cr3 == kvm_read_cr3(vcpu) && !is_pae_paging(vcpu)) {
if (!skip_tlb_flush) {
kvm_mmu_sync_roots(vcpu);
kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
--
2.26.3

2021-06-07 09:04:57

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v3 4/8] KVM: nSVM: refactor the CR3 reload on migration

Document the actual reason why we need to do it
on migration and move the call to svm_set_nested_state
to be closer to VMX code.

To avoid loading the PDPTRs from possibly not up to date memory map,
in nested_svm_load_cr3 after the move, move this code to
.get_nested_state_pages.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/svm/nested.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 8f5dbc80f57f..e3e5775b8f1c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -385,12 +385,12 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm)
* if we are emulating VM-Entry into a guest with NPT enabled.
*/
static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
- bool nested_npt)
+ bool nested_npt, bool reload_pdptrs)
{
if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3)))
return -EINVAL;

- if (!nested_npt && is_pae_paging(vcpu) &&
+ if (reload_pdptrs && !nested_npt && is_pae_paging(vcpu) &&
CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)))
return -EINVAL;

@@ -574,7 +574,7 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
nested_vmcb02_prepare_save(svm, vmcb12);

ret = nested_svm_load_cr3(&svm->vcpu, vmcb12->save.cr3,
- nested_npt_enabled(svm));
+ nested_npt_enabled(svm), true);
if (ret)
return ret;

@@ -803,7 +803,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)

nested_svm_uninit_mmu_context(vcpu);

- rc = nested_svm_load_cr3(vcpu, svm->vmcb->save.cr3, false);
+ rc = nested_svm_load_cr3(vcpu, svm->vmcb->save.cr3, false, true);
if (rc)
return 1;

@@ -1299,6 +1299,19 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
!nested_vmcb_valid_sregs(vcpu, save))
goto out_free;

+ /*
+ * While the nested guest CR3 is already checked and set by
+ * KVM_SET_SREGS, it was set when nested state was yet loaded,
+ * thus MMU might not be initialized correctly.
+ * Set it again to fix this.
+ */
+
+ ret = nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
+ nested_npt_enabled(svm), false);
+ if (WARN_ON_ONCE(ret))
+ goto out_free;
+
+
/*
* All checks done, we can enter guest mode. Userspace provides
* vmcb12.control, which will be combined with L1 and stored into
@@ -1356,9 +1369,14 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
if (WARN_ON(!is_guest_mode(vcpu)))
return true;

- if (nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
- nested_npt_enabled(svm)))
- return false;
+ if (!nested_npt_enabled(svm) && is_pae_paging(vcpu))
+ /*
+ * Reload the guest's PDPTRs since after a migration
+ * the guest CR3 might be restored prior to setting the nested
+ * state which can lead to a load of wrong PDPTRs.
+ */
+ if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, vcpu->arch.cr3)))
+ return false;

if (!nested_svm_vmrun_msrpm(svm)) {
vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
--
2.26.3

2021-06-07 09:05:04

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v3 6/8] KVM: x86: introduce kvm_register_clear_available

Small refactoring that will be used in the next patch.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/kvm_cache_regs.h | 7 +++++++
arch/x86/kvm/svm/svm.c | 6 ++----
2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 3db5c42c9ecd..b8559b83c4ca 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -55,6 +55,13 @@ static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
}

+static inline void kvm_register_clear_available(struct kvm_vcpu *vcpu,
+ enum kvm_reg reg)
+{
+ __clear_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
+ __clear_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
+}
+
static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
enum kvm_reg reg)
{
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dfa351e605de..491833ff2495 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3848,10 +3848,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.apf.host_apf_flags =
kvm_read_and_reset_apf_flags();

- if (npt_enabled) {
- vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR);
- vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR);
- }
+ if (npt_enabled)
+ kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);

/*
* We need to handle MC intercepts here before the vcpu has a chance to
--
2.26.3

2021-06-07 09:05:26

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v3 5/8] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES

Similar to the rest of guest page accesses after a migration,
this access should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c45189898a64..0acdda85f36a 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1105,7 +1105,8 @@ static bool nested_vmx_transition_mmu_sync(struct kvm_vcpu *vcpu)
* Exit Qualification (for a VM-Entry consistency check VM-Exit) is assigned to
* @entry_failure_code.
*/
-static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool nested_ept,
+static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
+ bool nested_ept, bool reload_pdptrs,
enum vm_entry_failure_code *entry_failure_code)
{
if (CC(kvm_vcpu_is_illegal_gpa(vcpu, cr3))) {
@@ -1117,7 +1118,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool ne
* If PAE paging and EPT are both on, CR3 is not used by the CPU and
* must not be dereferenced.
*/
- if (!nested_ept && is_pae_paging(vcpu) &&
+ if (reload_pdptrs && !nested_ept && is_pae_paging(vcpu) &&
CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))) {
*entry_failure_code = ENTRY_FAIL_PDPTE;
return -EINVAL;
@@ -2486,6 +2487,7 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
* is assigned to entry_failure_code on failure.
*/
static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
+ bool from_vmentry,
enum vm_entry_failure_code *entry_failure_code)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -2570,7 +2572,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,

/* Shadow page tables on either EPT or shadow page tables. */
if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
- entry_failure_code))
+ from_vmentry, entry_failure_code))
return -EINVAL;

/*
@@ -3111,6 +3113,17 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
struct page *page;
u64 hpa;

+ if (!nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) {
+ /*
+ * Reload the guest's PDPTRs since after a migration
+ * the guest CR3 might be restored prior to setting the nested
+ * state which can lead to a load of wrong PDPTRs.
+ */
+ if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, vcpu->arch.cr3)))
+ return false;
+ }
+
+
if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
/*
* Translate L1 physical address to host physical
@@ -3355,7 +3368,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)
vcpu->arch.tsc_offset += vmcs12->tsc_offset;

- if (prepare_vmcs02(vcpu, vmcs12, &entry_failure_code)) {
+ if (prepare_vmcs02(vcpu, vmcs12, from_vmentry, &entry_failure_code)) {
exit_reason.basic = EXIT_REASON_INVALID_STATE;
vmcs12->exit_qualification = entry_failure_code;
goto vmentry_fail_vmexit_guest_mode;
@@ -4204,7 +4217,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
* Only PDPTE load can fail as the value of cr3 was checked on entry and
* couldn't have changed.
*/
- if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &ignored))
+ if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, true, &ignored))
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);

nested_vmx_transition_tlb_flush(vcpu, vmcs12, false);
--
2.26.3

2021-06-07 09:05:51

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v3 8/8] KVM: x86: avoid loading PDPTRs after migration when possible

if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
a part of the migration state and are correctly
restored by those ioctls.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 ++++++
arch/x86/kvm/svm/nested.c | 3 ++-
arch/x86/kvm/vmx/nested.c | 3 ++-
arch/x86/kvm/x86.c | 3 +++
4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 83f948bdc59a..8eb107ceb45a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -851,6 +851,12 @@ struct kvm_vcpu_arch {

/* Protected Guests */
bool guest_state_protected;
+
+ /*
+ * Set when PDPTS were loaded directly by the userspace without
+ * reading the guest memory
+ */
+ bool pdptrs_restored_oob;
};

struct kvm_lpage_info {
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index e3e5775b8f1c..0f80d68a45e1 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1369,7 +1369,8 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
if (WARN_ON(!is_guest_mode(vcpu)))
return true;

- if (!nested_npt_enabled(svm) && is_pae_paging(vcpu))
+ if (!vcpu->arch.pdptrs_restored_oob &&
+ !nested_npt_enabled(svm) && is_pae_paging(vcpu))
/*
* Reload the guest's PDPTRs since after a migration
* the guest CR3 might be restored prior to setting the nested
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0acdda85f36a..78d6c71ab03b 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3113,7 +3113,8 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
struct page *page;
u64 hpa;

- if (!nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) {
+ if (!vcpu->arch.pdptrs_restored_oob &&
+ !nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) {
/*
* Reload the guest's PDPTRs since after a migration
* the guest CR3 might be restored prior to setting the nested
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 11260e83518f..eadfc9caf500 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -815,6 +815,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)

memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
+ vcpu->arch.pdptrs_restored_oob = false;
+
out:

return ret;
@@ -10113,6 +10115,7 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)

kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
mmu_reset_needed = 1;
+ vcpu->arch.pdptrs_restored_oob = true;
}
if (mmu_reset_needed)
kvm_mmu_reset_context(vcpu);
--
2.26.3

2021-06-07 09:06:42

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH v3 7/8] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2

This is a new version of KVM_GET_SREGS / KVM_SET_SREGS.

It has the following changes:
* Has flags for future extensions
* Has vcpu's PDPTS, which allows to save/restore them on migration.
* Lacks obsolete interrupt bitmap (done now via KVM_SET_VCPU_EVENTS)

New capability, KVM_CAP_SREGS2 is added to signal
the userspace of this ioctl.

Signed-off-by: Maxim Levitsky <[email protected]>
---
Documentation/virt/kvm/api.rst | 48 +++++++++++
arch/x86/include/uapi/asm/kvm.h | 13 +++
arch/x86/kvm/kvm_cache_regs.h | 5 ++
arch/x86/kvm/x86.c | 141 ++++++++++++++++++++++++++------
include/uapi/linux/kvm.h | 4 +
5 files changed, 184 insertions(+), 27 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7fcb2fd38f42..292522fbf3ed 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5034,6 +5034,54 @@ see KVM_XEN_VCPU_SET_ATTR above.
The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
with the KVM_XEN_VCPU_GET_ATTR ioctl.

+
+4.131 KVM_GET_SREGS2
+------------------
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (out)
+:Returns: 0 on success, -1 on error
+
+Reads special registers from the vcpu.
+This ioctl (when supported) replaces the KVM_GET_SREGS.
+
+::
+
+struct kvm_sregs2 {
+ /* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+ struct kvm_segment cs, ds, es, fs, gs, ss;
+ struct kvm_segment tr, ldt;
+ struct kvm_dtable gdt, idt;
+ __u64 cr0, cr2, cr3, cr4, cr8;
+ __u64 efer;
+ __u64 apic_base;
+ __u64 flags;
+ __u64 pdptrs[4];
+};
+
+flags values for ``kvm_sregs2``:
+
+``KVM_SREGS2_FLAGS_PDPTRS_VALID``
+
+ Indicates thats the struct contain valid PDPTR values.
+
+
+4.132 KVM_SET_SREGS2
+------------------
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (in)
+:Returns: 0 on success, -1 on error
+
+Writes special registers into the vcpu.
+See KVM_GET_SREGS2 for the data structures.
+This ioctl (when supported) replaces the KVM_SET_SREGS.
+
+
5. The kvm_run structure
========================

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 0662f644aad9..a6c327f8ad9e 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -159,6 +159,19 @@ struct kvm_sregs {
__u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
};

+struct kvm_sregs2 {
+ /* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+ struct kvm_segment cs, ds, es, fs, gs, ss;
+ struct kvm_segment tr, ldt;
+ struct kvm_dtable gdt, idt;
+ __u64 cr0, cr2, cr3, cr4, cr8;
+ __u64 efer;
+ __u64 apic_base;
+ __u64 flags;
+ __u64 pdptrs[4];
+};
+#define KVM_SREGS2_FLAGS_PDPTRS_VALID 1
+
/* for KVM_GET_FPU and KVM_SET_FPU */
struct kvm_fpu {
__u8 fpr[8][16];
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index b8559b83c4ca..842a09711b80 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -125,6 +125,11 @@ static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int index)
return vcpu->arch.walk_mmu->pdptrs[index];
}

+static inline void kvm_pdptr_write(struct kvm_vcpu *vcpu, int index, u64 value)
+{
+ vcpu->arch.walk_mmu->pdptrs[index] = value;
+}
+
static inline ulong kvm_read_cr0_bits(struct kvm_vcpu *vcpu, ulong mask)
{
ulong tmask = mask & KVM_POSSIBLE_CR0_GUEST_BITS;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0be35c37c958..11260e83518f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -113,6 +113,9 @@ static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
static void store_regs(struct kvm_vcpu *vcpu);
static int sync_regs(struct kvm_vcpu *vcpu);

+static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
+static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
+
struct kvm_x86_ops kvm_x86_ops __read_mostly;
EXPORT_SYMBOL_GPL(kvm_x86_ops);

@@ -812,7 +815,6 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)

memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
-
out:

return ret;
@@ -3862,6 +3864,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SGX_ATTRIBUTE:
#endif
case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM:
+ case KVM_CAP_SREGS2:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
@@ -4778,6 +4781,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
void __user *argp = (void __user *)arg;
int r;
union {
+ struct kvm_sregs2 *sregs2;
struct kvm_lapic_state *lapic;
struct kvm_xsave *xsave;
struct kvm_xcrs *xcrs;
@@ -5150,6 +5154,28 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
break;
}
#endif
+ case KVM_GET_SREGS2: {
+ u.sregs2 = kzalloc(sizeof(struct kvm_sregs2), GFP_KERNEL);
+ r = -ENOMEM;
+ if (!u.sregs2)
+ goto out;
+ __get_sregs2(vcpu, u.sregs2);
+ r = -EFAULT;
+ if (copy_to_user(argp, u.sregs2, sizeof(struct kvm_sregs2)))
+ goto out;
+ r = 0;
+ break;
+ }
+ case KVM_SET_SREGS2: {
+ u.sregs2 = memdup_user(argp, sizeof(struct kvm_sregs2));
+ if (IS_ERR(u.sregs2)) {
+ r = PTR_ERR(u.sregs2);
+ u.sregs2 = NULL;
+ goto out;
+ }
+ r = __set_sregs2(vcpu, u.sregs2);
+ break;
+ }
default:
r = -EINVAL;
}
@@ -9793,7 +9819,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
}
EXPORT_SYMBOL_GPL(kvm_get_cs_db_l_bits);

-static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
{
struct desc_ptr dt;

@@ -9826,14 +9852,36 @@ static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
sregs->cr8 = kvm_get_cr8(vcpu);
sregs->efer = vcpu->arch.efer;
sregs->apic_base = kvm_get_apic_base(vcpu);
+}

- memset(sregs->interrupt_bitmap, 0, sizeof(sregs->interrupt_bitmap));
+static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+ __get_sregs_common(vcpu, sregs);
+
+ if (vcpu->arch.guest_state_protected)
+ return;

if (vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft)
set_bit(vcpu->arch.interrupt.nr,
(unsigned long *)sregs->interrupt_bitmap);
}

+static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
+{
+ int i;
+
+ __get_sregs_common(vcpu, (struct kvm_sregs *)sregs2);
+
+ if (vcpu->arch.guest_state_protected)
+ return;
+
+ if (is_pae_paging(vcpu)) {
+ for (i = 0 ; i < 4 ; i++)
+ sregs2->pdptrs[i] = kvm_pdptr_read(vcpu, i);
+ sregs2->flags |= KVM_SREGS2_FLAGS_PDPTRS_VALID;
+ }
+}
+
int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
struct kvm_sregs *sregs)
{
@@ -9945,24 +9993,23 @@ static bool kvm_is_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
return kvm_is_valid_cr4(vcpu, sregs->cr4);
}

-static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+static int __set_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs,
+ int *mmu_reset_needed, bool update_pdptrs)
{
struct msr_data apic_base_msr;
- int mmu_reset_needed = 0;
- int pending_vec, max_bits, idx;
+ int idx;
struct desc_ptr dt;
- int ret = -EINVAL;

if (!kvm_is_valid_sregs(vcpu, sregs))
- goto out;
+ return -EINVAL;

apic_base_msr.data = sregs->apic_base;
apic_base_msr.host_initiated = true;
if (kvm_set_apic_base(vcpu, &apic_base_msr))
- goto out;
+ return -EINVAL;

if (vcpu->arch.guest_state_protected)
- goto skip_protected_regs;
+ return 0;

dt.size = sregs->idt.limit;
dt.address = sregs->idt.base;
@@ -9972,31 +10019,30 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
static_call(kvm_x86_set_gdt)(vcpu, &dt);

vcpu->arch.cr2 = sregs->cr2;
- mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
+ *mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
vcpu->arch.cr3 = sregs->cr3;
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);

kvm_set_cr8(vcpu, sregs->cr8);

- mmu_reset_needed |= vcpu->arch.efer != sregs->efer;
+ *mmu_reset_needed |= vcpu->arch.efer != sregs->efer;
static_call(kvm_x86_set_efer)(vcpu, sregs->efer);

- mmu_reset_needed |= kvm_read_cr0(vcpu) != sregs->cr0;
+ *mmu_reset_needed |= kvm_read_cr0(vcpu) != sregs->cr0;
static_call(kvm_x86_set_cr0)(vcpu, sregs->cr0);
vcpu->arch.cr0 = sregs->cr0;

- mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
+ *mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
static_call(kvm_x86_set_cr4)(vcpu, sregs->cr4);

- idx = srcu_read_lock(&vcpu->kvm->srcu);
- if (is_pae_paging(vcpu)) {
- load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
- mmu_reset_needed = 1;
+ if (update_pdptrs) {
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+ if (is_pae_paging(vcpu)) {
+ load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
+ *mmu_reset_needed = 1;
+ }
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
}
- srcu_read_unlock(&vcpu->kvm->srcu, idx);
-
- if (mmu_reset_needed)
- kvm_mmu_reset_context(vcpu);

kvm_set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
kvm_set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
@@ -10016,20 +10062,61 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
!is_protmode(vcpu))
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;

-skip_protected_regs:
+ return 0;
+}
+
+static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+ int pending_vec, max_bits;
+ int mmu_reset_needed = 0;
+ int ret = __set_sregs_common(vcpu, sregs, &mmu_reset_needed, true);
+
+ if (ret)
+ return ret;
+
+ if (mmu_reset_needed)
+ kvm_mmu_reset_context(vcpu);
+
max_bits = KVM_NR_INTERRUPTS;
pending_vec = find_first_bit(
(const unsigned long *)sregs->interrupt_bitmap, max_bits);
+
if (pending_vec < max_bits) {
kvm_queue_interrupt(vcpu, pending_vec, false);
pr_debug("Set back pending irq %d\n", pending_vec);
+ kvm_make_request(KVM_REQ_EVENT, vcpu);
}
+ return 0;
+}

- kvm_make_request(KVM_REQ_EVENT, vcpu);
+static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
+{
+ int mmu_reset_needed = 0;
+ bool valid_pdptrs = sregs2->flags & KVM_SREGS2_FLAGS_PDPTRS_VALID;
+ int i, ret;

- ret = 0;
-out:
- return ret;
+ if (sregs2->flags & ~KVM_SREGS2_FLAGS_PDPTRS_VALID)
+ return -EINVAL;
+
+ ret = __set_sregs_common(vcpu, (struct kvm_sregs *)sregs2,
+ &mmu_reset_needed, !valid_pdptrs);
+ if (ret)
+ return ret;
+
+ if (valid_pdptrs) {
+ if (!is_pae_paging(vcpu))
+ return -EINVAL;
+ if (vcpu->arch.guest_state_protected)
+ return -EINVAL;
+ for (i = 0 ; i < 4 ; i++)
+ kvm_pdptr_write(vcpu, i, sregs2->pdptrs[i]);
+
+ kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
+ mmu_reset_needed = 1;
+ }
+ if (mmu_reset_needed)
+ kvm_mmu_reset_context(vcpu);
+ return 0;
}

int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3fd9a7e9d90c..da98bc314ce0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_SGX_ATTRIBUTE 196
#define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
#define KVM_CAP_PTP_KVM 198
+#define KVM_CAP_SREGS2 200

#ifdef KVM_CAP_IRQ_ROUTING

@@ -1620,6 +1621,9 @@ struct kvm_xen_hvm_attr {
#define KVM_XEN_VCPU_GET_ATTR _IOWR(KVMIO, 0xca, struct kvm_xen_vcpu_attr)
#define KVM_XEN_VCPU_SET_ATTR _IOW(KVMIO, 0xcb, struct kvm_xen_vcpu_attr)

+#define KVM_GET_SREGS2 _IOR(KVMIO, 0xcc, struct kvm_sregs2)
+#define KVM_SET_SREGS2 _IOW(KVMIO, 0xcd, struct kvm_sregs2)
+
struct kvm_xen_vcpu_attr {
__u16 type;
__u16 pad[3];
--
2.26.3

2021-06-10 15:07:51

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 7/8] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2

On 07/06/21 11:02, Maxim Levitsky wrote:
> +static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
> +{
> + int mmu_reset_needed = 0;
> + bool valid_pdptrs = sregs2->flags & KVM_SREGS2_FLAGS_PDPTRS_VALID;
> + int i, ret;
>
> + if (sregs2->flags & ~KVM_SREGS2_FLAGS_PDPTRS_VALID)
> + return -EINVAL;
> +
> + ret = __set_sregs_common(vcpu, (struct kvm_sregs *)sregs2,
> + &mmu_reset_needed, !valid_pdptrs);
> + if (ret)
> + return ret;
> +
> + if (valid_pdptrs) {
> + if (!is_pae_paging(vcpu))
> + return -EINVAL;
> + if (vcpu->arch.guest_state_protected)
> + return -EINVAL;
> + for (i = 0 ; i < 4 ; i++)
> + kvm_pdptr_write(vcpu, i, sregs2->pdptrs[i]);
> +
> + kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> + mmu_reset_needed = 1;
> + }
> + if (mmu_reset_needed)
> + kvm_mmu_reset_context(vcpu);
> + return 0;
> }

It's a bit nicer if the checks are done early:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f20c7c06bd4a..c6f8fec78c53 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10248,22 +10248,23 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
{
int mmu_reset_needed = 0;
bool valid_pdptrs = sregs2->flags & KVM_SREGS2_FLAGS_PDPTRS_VALID;
+ bool pae = (sregs2->cr0 & X86_CR0_PG) && (sregs2->cr4 & X86_CR4_PAE) &&
+ !(sregs2->efer & EFER_LMA);
int i, ret;

if (sregs2->flags & ~KVM_SREGS2_FLAGS_PDPTRS_VALID)
return -EINVAL;

+ if (valid_pdptrs && (!pae || vcpu->arch.guest_state_protected))
+ return -EINVAL;
+
ret = __set_sregs_common(vcpu, (struct kvm_sregs *)sregs2,
&mmu_reset_needed, !valid_pdptrs);
if (ret)
return ret;

if (valid_pdptrs) {
- if (!is_pae_paging(vcpu))
- return -EINVAL;
- if (vcpu->arch.guest_state_protected)
- return -EINVAL;
- for (i = 0 ; i < 4 ; i++)
+ for (i = 0; i < 4 ; i++)
kvm_pdptr_write(vcpu, i, sregs2->pdptrs[i]);

kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);

Paolo

2021-06-10 15:08:26

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration

On 07/06/21 11:01, Maxim Levitsky wrote:
> This patch set aims to fix few flaws that were discovered
> in KVM_{GET|SET}_SREGS on x86:
>
> * There is no support for reading/writing PDPTRs although
> these are considered to be part of the guest state.
>
> * There is useless interrupt bitmap which isn't needed
>
> * No support for future extensions (via flags and such)
>
> Also if the user doesn't use the new SREG2 api, the PDPTR
> load after migration is now done on KVM_REQ_GET_NESTED_STATE_PAGES
> to at least read them correctly in cases when guest memory
> map is not up to date when nested state is loaded.
>
> This patch series was tested by doing nested migration test
> of 32 bit PAE L1 + 32 bit PAE L2 on AMD and Intel and by
> nested migration test of 64 bit L1 + 32 bit PAE L2 on AMD.
> The later test currently fails on Intel (regardless of my patches).
>
> Changes from V2:
> - I took in the patch series from Sean Christopherson that
> removes the pdptrs_changed function and rebased my code
> on top of it.
> - I updated the SET_SREGS2 ioctl to load PDPTRS from memory
> when user haven't given PDPTRS.
> - Minor refactoring all over the place.
>
> Changes from V1:
> - move only PDPTRS load to KVM_REQ_GET_NESTED_STATE_PAGES on VMX
> - rebase on top of kvm/queue
> - improve the KVM_GET_SREGS2 to have flag for PDPTRS
> and remove padding
>
> Patches to qemu to enable this feature were sent as well.
>
> Maxim Levitsky (5):
> KVM: nSVM: refactor the CR3 reload on migration
> KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES
> KVM: x86: introduce kvm_register_clear_available
> KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2
> KVM: x86: avoid loading PDPTRs after migration when possible
>
> Sean Christopherson (3):
> KVM: nVMX: Drop obsolete (and pointless) pdptrs_changed() check
> KVM: nSVM: Drop pointless pdptrs_changed() check on nested transition
> KVM: x86: Always load PDPTRs on CR3 load for SVM w/o NPT and a PAE
> guest
>
> Documentation/virt/kvm/api.rst | 48 +++++++++
> arch/x86/include/asm/kvm_host.h | 7 +-
> arch/x86/include/uapi/asm/kvm.h | 13 +++
> arch/x86/kvm/kvm_cache_regs.h | 12 +++
> arch/x86/kvm/svm/nested.c | 39 +++++--
> arch/x86/kvm/svm/svm.c | 6 +-
> arch/x86/kvm/vmx/nested.c | 32 ++++--
> arch/x86/kvm/x86.c | 176 +++++++++++++++++++++-----------
> include/uapi/linux/kvm.h | 4 +
> 9 files changed, 253 insertions(+), 84 deletions(-)
>

Queued, thanks.

Paolo

2021-06-19 06:40:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 8/8] KVM: x86: avoid loading PDPTRs after migration when possible

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 11260e83518f..eadfc9caf500 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -815,6 +815,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>
> memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> + vcpu->arch.pdptrs_restored_oob = false;
> +
> out:
>
> return ret;
> @@ -10113,6 +10115,7 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
>
> kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> mmu_reset_needed = 1;
> + vcpu->arch.pdptrs_restored_oob = true;

Setting pdptrs_restored_oob[*] here and _only_ clearing it on successful
load_pdptrs() is not robust. Potential problems once the flag is set:

1. Userspace calls KVM_SET_SREGS{,2} without valid PDPTRs. Flag is now stale.
2. kvm_check_nested_events() VM-Exits to L1 before the flag is processed.
Flag is now stale.

(2) might not be problematic in practice since the "normal" load_pdptrs()
should reset the flag on the next VM-Enter, but it's really, really hard to tell.
E.g. what if an SMI causes an exit and _that_ non-VM-Enter reload of L2 state
is the first to trip the flag? The bool is essentially an extension of
KVM_REQ_GET_NESTED_STATE_PAGES, I think it makes sense to clear the flag whenever
KVM_REQ_GET_NESTED_STATE_PAGES is cleared.

Another thing that's not obvious is the required ordering between KVM_SET_SREGS2
and KVM_SET_NESTED_STATE. AFAICT it's not documented, but that may be PEBKAC on
my end. E.g. what happens if walk_mmu == &root_mmu (L1 active in targte KVM)
when SET_SREGS2 is called, and _then_ KVM_SET_NESTED_STATE is called?

[*] pdptrs_from_userspace in Paolo's tree.

2021-06-19 10:02:00

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 8/8] KVM: x86: avoid loading PDPTRs after migration when possible

On 18/06/21 22:53, Sean Christopherson wrote:
> The bool is essentially an extension of
> KVM_REQ_GET_NESTED_STATE_PAGES, I think it makes sense to clear the flag whenever
> KVM_REQ_GET_NESTED_STATE_PAGES is cleared.

So in vcpu_enter_guest?

> Another thing that's not obvious is the required ordering between KVM_SET_SREGS2
> and KVM_SET_NESTED_STATE. AFAICT it's not documented, but that may be PEBKAC on
> my end. E.g. what happens if walk_mmu == &root_mmu (L1 active in targte KVM)
> when SET_SREGS2 is called, and_then_ KVM_SET_NESTED_STATE is called?

Either ordering should work.

Paolo

2021-06-20 22:27:48

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH v3 8/8] KVM: x86: avoid loading PDPTRs after migration when possible

On Fri, 2021-06-18 at 20:53 +0000, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 11260e83518f..eadfc9caf500 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -815,6 +815,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
> >
> > memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> > kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > + vcpu->arch.pdptrs_restored_oob = false;
> > +
> > out:
> >
> > return ret;
> > @@ -10113,6 +10115,7 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
> >
> > kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > mmu_reset_needed = 1;
> > + vcpu->arch.pdptrs_restored_oob = true;
>
> Setting pdptrs_restored_oob[*] here and _only_ clearing it on successful
> load_pdptrs() is not robust. Potential problems once the flag is set:

Hi Sean Christopherson!
Thanks for the review!

I also thought about the exact same thing when I submitted the last version of
this patches (prior version didn't clear the flag at all but I noticed that
while doing self review of my patches).


>
> 1. Userspace calls KVM_SET_SREGS{,2} without valid PDPTRs. Flag is now stale.

True. It isn't that big issue though since the only way to this is to also disable PAE mode
in CR4 during the call or before it, thus PDPTRs becomes irrelevant.
Once PAE is enabled again, PDPTRs will be loaded again, resetting this flag.

Something to note is that we also don't clear available/dirty status of VCPU_EXREG_PDPTR
in this case.



> 2. kvm_check_nested_events() VM-Exits to L1 before the flag is processed.
> Flag is now stale.

Also true. However this means that we enter L1 now, and once we are ready to enter
L2 again, we will load PDPTRS from guest memory as thankfully VM entries in PAE mode
do load PDPTRS from guest memory (both on Intel and AMD).



>
> (2) might not be problematic in practice since the "normal" load_pdptrs()
> should reset the flag on the next VM-Enter, but it's really, really hard to tell.
> E.g. what if an SMI causes an exit and _that_ non-VM-Enter reload of L2 state
> is the first to trip the flag? The bool is essentially an extension of
> KVM_REQ_GET_NESTED_STATE_PAGES, I think it makes sense to clear the flag whenever
> KVM_REQ_GET_NESTED_STATE_PAGES is cleared.

Could you expalain a bit better about SMM case? When SMM entry is done, at least Intel
spec is silent on if PDPTRs are preserved in SMRAM (and it doesn't have any place allocated
for them).

We currently don't preserve PDTPRS on SMM entry and we reload them via CR3/CR0 write
when we exit SMM.

Ah I see now, on VMX the non VM-Enter code path is also used for returns from SMM,
and it sets the KVM_REQ_GET_NESTED_STATE_PAGES, so this is a real issue.
If SMM code enabled PAE, and then KVM_SET_SREGS2 was used, and followed by
RSM, we indeed have a risk of not loading the PDPTRs.

I think that this is best fixed by resetting this flag in vmx_leave_smm,
since RSM loads PDPTRS from memory always otherwise.

I also note that we don't do KVM_REQ_GET_NESTED_STATE_PAGES on SVM,
on return from SMM at all,
thus on SVM I think I broke the resume from SMM to a guest if the guest is PAE.
Oh well....

I will now extend the testing I usually do to SMM and prepare a patch to fix this.

>
> Another thing that's not obvious is the required ordering between KVM_SET_SREGS2
> and KVM_SET_NESTED_STATE. AFAICT it's not documented, but that may be PEBKAC on
> my end. E.g. what happens if walk_mmu == &root_mmu (L1 active in targte KVM)
> when SET_SREGS2 is called, and _then_ KVM_SET_NESTED_STATE is called?

Isn't that exactly the current ordering (and reason why I had to do this patch series)
First the KVM_SET_SREGS is called indeed prior to KVM_SET_NESTED_STATE and it can
potentially load wrong PDPTRS, and then KVM_SET_NESTED_STATE is called which used
to 'fix' this by reloading them always.



>
> [*] pdptrs_from_userspace in Paolo's tree.
>


I think that strictly speaking this flag should be cleared when PAE mode is disabled,
and together with clearing of availablity of VCPU_EXREG_PDPTR.

I don't agree that this flag is an extension of the KVM_REQ_GET_NESTED_STATE_PAGES.
I think this flag is more like an extra property of VCPU_EXREG_PDPTR.
In addition to being dirty/available, this "register" can be loaded from memory
or restored from migration stream.

Thanks again for the review,
Best regards,
Maxim Levitsky