Guest memory can either be directly managed by the kernel (i.e. have a "struct
page") or they can simply live outside kernel control (i.e. do not have a
"struct page"). KVM mostly support these two modes, except in a few places
where the code seems to assume that guest memory must have a "struct page".
This patchset introduces a new mapping interface to map guest memory into host
kernel memory which also supports PFN-based memory (i.e. memory without 'struct
page'). It also converts all offending code to this interface or simply
read/write directly from guest memory. Patch 2 is additionally fixing an
incorrect page release and marking the page as dirty (i.e. as a side-effect of
using the helper function to write).
As far as I can see all offending code is now fixed except the APIC-access page
which I will handle in a seperate series along with dropping
kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API.
The current implementation of the new API uses memremap to map memory that does
not have a "struct page". This proves to be very slow for high frequency
mappings. Since this does not affect the normal use-case where a "struct page"
is available, the performance of this API will be handled by a seperate patch
series.
v4 -> v5:
- Introduce a new parameter 'dirty' into kvm_vcpu_unmap
- A horrible rebase due to nested.c :)
- Dropped a couple of hyperv patches as the code was fixed already as a
side-effect of another patch.
- Added a new trivial cleanup patch.
v3 -> v4:
- Rebase
- Add a new patch to also fix the newly introduced enlightned VMCS.
v2 -> v3:
- Rebase
- Add a new patch to also fix the newly introduced shadow VMCS.
Filippo Sironi (1):
X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs
KarimAllah Ahmed (12):
X86/nVMX: handle_vmon: Read 4 bytes from guest memory
X86/nVMX: Update the PML table without mapping and unmapping the page
KVM: Introduce a new guest mapping API
X86/nVMX: handle_vmptrld: Use kvm_vcpu_map when copying VMCS12 from
guest memory
KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt
descriptor table
KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
KVM/nSVM: Use the new mapping API for mapping guest memory
KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS
KVM/nVMX: Use kvm_vcpu_map for accessing the enlightened VMCS
KVM/nVMX: Use page_address_valid in a few more locations
arch/x86/kvm/paging_tmpl.h | 38 ++++++++---
arch/x86/kvm/svm.c | 97 +++++++++++++--------------
arch/x86/kvm/vmx/nested.c | 160 ++++++++++++++++-----------------------------
arch/x86/kvm/vmx/vmx.c | 19 ++----
arch/x86/kvm/vmx/vmx.h | 9 ++-
arch/x86/kvm/x86.c | 13 ++--
include/linux/kvm_host.h | 9 +++
virt/kvm/kvm_main.c | 53 +++++++++++++++
8 files changed, 217 insertions(+), 181 deletions(-)
--
2.7.4
Use kvm_vcpu_map to the map the VMCS12 from guest memory because
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
v4 -> v5:
- Switch to the new guest mapping API instead of reading directly from
guest.
- unmap with dirty flag
v3 -> v4:
- Return VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID on failure (jmattson@)
v1 -> v2:
- Massage commit message a bit.
---
arch/x86/kvm/vmx/nested.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 536468a..5602b0c 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4521,11 +4521,10 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
return 1;
if (vmx->nested.current_vmptr != vmptr) {
+ struct kvm_host_map map;
struct vmcs12 *new_vmcs12;
- struct page *page;
- page = kvm_vcpu_gpa_to_page(vcpu, vmptr);
- if (is_error_page(page)) {
+ if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmptr), &map)) {
/*
* Reads from an unbacked page return all 1s,
* which means that the 32 bits located at the
@@ -4536,12 +4535,13 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
return kvm_skip_emulated_instruction(vcpu);
}
- new_vmcs12 = kmap(page);
+
+ new_vmcs12 = map.hva;
+
if (new_vmcs12->hdr.revision_id != VMCS12_REVISION ||
(new_vmcs12->hdr.shadow_vmcs &&
!nested_cpu_has_vmx_shadow_vmcs(vcpu))) {
- kunmap(page);
- kvm_release_page_clean(page);
+ kvm_vcpu_unmap(&map, false);
return nested_vmx_failValid(vcpu,
VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
}
@@ -4553,8 +4553,7 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
* cached.
*/
memcpy(vmx->nested.cached_vmcs12, new_vmcs12, VMCS12_SIZE);
- kunmap(page);
- kvm_release_page_clean(page);
+ kvm_vcpu_unmap(&map, false);
set_current_vmptr(vmx, vmptr);
}
--
2.7.4
Use page_address_valid in a few more locations that is already checking for
a page aligned address that does not cross the maximum physical address.
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ccb3b63..77aad46 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4203,7 +4203,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
* Note - IA32_VMX_BASIC[48] will never be 1 for the nested case;
* which replaces physical address width with 32
*/
- if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
+ if (!page_address_valid(vcpu, vmptr))
return nested_vmx_failInvalid(vcpu);
if (kvm_read_guest(vcpu->kvm, vmptr, &revision, sizeof(revision)) ||
@@ -4266,7 +4266,7 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
if (nested_vmx_get_vmptr(vcpu, &vmptr))
return 1;
- if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
+ if (!page_address_valid(vcpu, vmptr))
return nested_vmx_failValid(vcpu,
VMXERR_VMCLEAR_INVALID_ADDRESS);
@@ -4473,7 +4473,7 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
if (nested_vmx_get_vmptr(vcpu, &vmptr))
return 1;
- if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
+ if (!page_address_valid(vcpu, vmptr))
return nested_vmx_failValid(vcpu,
VMXERR_VMPTRLD_INVALID_ADDRESS);
--
2.7.4
Use kvm_vcpu_map when mapping the virtual APIC page since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".
One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
virtual APIC page on the host side.
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
v4 -> v5:
- unmap with dirty flag
v1 -> v2:
- Do not change the lifecycle of the mapping (pbonzini)
- Use pfn_to_hpa instead of gfn_to_gpa
---
arch/x86/kvm/vmx/nested.c | 32 +++++++++++---------------------
arch/x86/kvm/vmx/vmx.c | 5 ++---
arch/x86/kvm/vmx/vmx.h | 2 +-
3 files changed, 14 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 4127ad9..dcff99d 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -229,10 +229,7 @@ static void free_nested(struct kvm_vcpu *vcpu)
kvm_release_page_dirty(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
}
- if (vmx->nested.virtual_apic_page) {
- kvm_release_page_dirty(vmx->nested.virtual_apic_page);
- vmx->nested.virtual_apic_page = NULL;
- }
+ kvm_vcpu_unmap(&vmx->nested.virtual_apic_map, true);
if (vmx->nested.pi_desc_page) {
kunmap(vmx->nested.pi_desc_page);
kvm_release_page_dirty(vmx->nested.pi_desc_page);
@@ -2817,6 +2814,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
{
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ struct kvm_host_map *map;
struct page *page;
u64 hpa;
@@ -2849,11 +2847,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
}
if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
- if (vmx->nested.virtual_apic_page) { /* shouldn't happen */
- kvm_release_page_dirty(vmx->nested.virtual_apic_page);
- vmx->nested.virtual_apic_page = NULL;
- }
- page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->virtual_apic_page_addr);
+ map = &vmx->nested.virtual_apic_map;
/*
* If translation failed, VM entry will fail because
@@ -2868,11 +2862,9 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
* control. But such a configuration is useless, so
* let's keep the code simple.
*/
- if (!is_error_page(page)) {
- vmx->nested.virtual_apic_page = page;
- hpa = page_to_phys(vmx->nested.virtual_apic_page);
- vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa);
- }
+ if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map))
+ vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn));
+
}
if (nested_cpu_has_posted_intr(vmcs12)) {
@@ -3279,11 +3271,12 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
if (max_irr != 256) {
- vapic_page = kmap(vmx->nested.virtual_apic_page);
+ vapic_page = vmx->nested.virtual_apic_map.hva;
+ if (!vapic_page)
+ return;
+
__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
vapic_page, &max_irr);
- kunmap(vmx->nested.virtual_apic_page);
-
status = vmcs_read16(GUEST_INTR_STATUS);
if ((u8)max_irr > ((u8)status & 0xff)) {
status &= ~0xff;
@@ -3917,10 +3910,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
kvm_release_page_dirty(vmx->nested.apic_access_page);
vmx->nested.apic_access_page = NULL;
}
- if (vmx->nested.virtual_apic_page) {
- kvm_release_page_dirty(vmx->nested.virtual_apic_page);
- vmx->nested.virtual_apic_page = NULL;
- }
+ kvm_vcpu_unmap(&vmx->nested.virtual_apic_map, true);
if (vmx->nested.pi_desc_page) {
kunmap(vmx->nested.pi_desc_page);
kvm_release_page_dirty(vmx->nested.pi_desc_page);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 71d88df..e13308e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3627,14 +3627,13 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
if (WARN_ON_ONCE(!is_guest_mode(vcpu)) ||
!nested_cpu_has_vid(get_vmcs12(vcpu)) ||
- WARN_ON_ONCE(!vmx->nested.virtual_apic_page))
+ WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn))
return false;
rvi = vmx_get_rvi();
- vapic_page = kmap(vmx->nested.virtual_apic_page);
+ vapic_page = vmx->nested.virtual_apic_map.hva;
vppr = *((u32 *)(vapic_page + APIC_PROCPRI));
- kunmap(vmx->nested.virtual_apic_page);
return ((rvi & 0xf0) > (vppr & 0xf0));
}
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 6fb69d8..f618f52 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -142,7 +142,7 @@ struct nested_vmx {
* pointers, so we must keep them pinned while L2 runs.
*/
struct page *apic_access_page;
- struct page *virtual_apic_page;
+ struct kvm_host_map virtual_apic_map;
struct page *pi_desc_page;
struct kvm_host_map msr_bitmap_map;
--
2.7.4
Use kvm_vcpu_map when mapping the L1 MSR bitmap since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
v4 -> v5:
- unmap with dirty flag
v1 -> v2:
- Do not change the lifecycle of the mapping (pbonzini)
---
arch/x86/kvm/vmx/nested.c | 11 +++++------
arch/x86/kvm/vmx/vmx.h | 3 +++
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 5602b0c..4127ad9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -507,9 +507,10 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12)
{
int msr;
- struct page *page;
unsigned long *msr_bitmap_l1;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
+ struct kvm_host_map *map = &to_vmx(vcpu)->nested.msr_bitmap_map;
+
/*
* pred_cmd & spec_ctrl are trying to verify two things:
*
@@ -535,11 +536,10 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
!pred_cmd && !spec_ctrl)
return false;
- page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
- if (is_error_page(page))
+ if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), map))
return false;
- msr_bitmap_l1 = (unsigned long *)kmap(page);
+ msr_bitmap_l1 = (unsigned long *)map->hva;
if (nested_cpu_has_apic_reg_virt(vmcs12)) {
/*
* L0 need not intercept reads for MSRs between 0x800 and 0x8ff, it
@@ -587,8 +587,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
MSR_IA32_PRED_CMD,
MSR_TYPE_W);
- kunmap(page);
- kvm_release_page_clean(page);
+ kvm_vcpu_unmap(&to_vmx(vcpu)->nested.msr_bitmap_map, false);
return true;
}
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 9932895..6fb69d8 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -144,6 +144,9 @@ struct nested_vmx {
struct page *apic_access_page;
struct page *virtual_apic_page;
struct page *pi_desc_page;
+
+ struct kvm_host_map msr_bitmap_map;
+
struct pi_desc *pi_desc;
bool pi_pending;
u16 posted_intr_nv;
--
2.7.4
Use kvm_vcpu_map for accessing the shadow VMCS since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
v4 -> v5:
- unmap with dirty flag
---
arch/x86/kvm/vmx/nested.c | 25 ++++++++++++-------------
1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b4230ce..04a8b43 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -588,20 +588,20 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
static void nested_cache_shadow_vmcs12(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12)
{
+ struct kvm_host_map map;
struct vmcs12 *shadow;
- struct page *page;
if (!nested_cpu_has_shadow_vmcs(vmcs12) ||
vmcs12->vmcs_link_pointer == -1ull)
return;
shadow = get_shadow_vmcs12(vcpu);
- page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer);
- memcpy(shadow, kmap(page), VMCS12_SIZE);
+ if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), &map))
+ return;
- kunmap(page);
- kvm_release_page_clean(page);
+ memcpy(shadow, map.hva, VMCS12_SIZE);
+ kvm_vcpu_unmap(&map, false);
}
static void nested_flush_cached_shadow_vmcs12(struct kvm_vcpu *vcpu,
@@ -2637,9 +2637,9 @@ static int nested_vmx_check_vmentry_prereqs(struct kvm_vcpu *vcpu,
static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12)
{
- int r;
- struct page *page;
+ int r = 0;
struct vmcs12 *shadow;
+ struct kvm_host_map map;
if (vmcs12->vmcs_link_pointer == -1ull)
return 0;
@@ -2647,17 +2647,16 @@ static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu,
if (!page_address_valid(vcpu, vmcs12->vmcs_link_pointer))
return -EINVAL;
- page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->vmcs_link_pointer);
- if (is_error_page(page))
+ if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->vmcs_link_pointer), &map))
return -EINVAL;
- r = 0;
- shadow = kmap(page);
+ shadow = map.hva;
+
if (shadow->hdr.revision_id != VMCS12_REVISION ||
shadow->hdr.shadow_vmcs != nested_cpu_has_shadow_vmcs(vmcs12))
r = -EINVAL;
- kunmap(page);
- kvm_release_page_clean(page);
+
+ kvm_vcpu_unmap(&map, false);
return r;
}
--
2.7.4
Use kvm_vcpu_map for accessing the enhanced VMCS since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
v4 -> v5:
- s/enhanced/enlightened
- unmap with dirty flag
---
arch/x86/kvm/vmx/nested.c | 14 +++++---------
arch/x86/kvm/vmx/vmx.h | 2 +-
2 files changed, 6 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 04a8b43..ccb3b63 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -193,10 +193,8 @@ static inline void nested_release_evmcs(struct kvm_vcpu *vcpu)
if (!vmx->nested.hv_evmcs)
return;
- kunmap(vmx->nested.hv_evmcs_page);
- kvm_release_page_dirty(vmx->nested.hv_evmcs_page);
+ kvm_vcpu_unmap(&vmx->nested.hv_evmcs_map, true);
vmx->nested.hv_evmcs_vmptr = -1ull;
- vmx->nested.hv_evmcs_page = NULL;
vmx->nested.hv_evmcs = NULL;
}
@@ -1769,13 +1767,11 @@ static int nested_vmx_handle_enlightened_vmptrld(struct kvm_vcpu *vcpu,
nested_release_evmcs(vcpu);
- vmx->nested.hv_evmcs_page = kvm_vcpu_gpa_to_page(
- vcpu, assist_page.current_nested_vmcs);
-
- if (unlikely(is_error_page(vmx->nested.hv_evmcs_page)))
+ if (kvm_vcpu_map(vcpu, gpa_to_gfn(assist_page.current_nested_vmcs),
+ &vmx->nested.hv_evmcs_map))
return 0;
- vmx->nested.hv_evmcs = kmap(vmx->nested.hv_evmcs_page);
+ vmx->nested.hv_evmcs = vmx->nested.hv_evmcs_map.hva;
/*
* Currently, KVM only supports eVMCS version 1
@@ -4278,7 +4274,7 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
return nested_vmx_failValid(vcpu,
VMXERR_VMCLEAR_VMXON_POINTER);
- if (vmx->nested.hv_evmcs_page) {
+ if (vmx->nested.hv_evmcs_map.hva) {
if (vmptr == vmx->nested.hv_evmcs_vmptr)
nested_release_evmcs(vcpu);
} else {
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index bd04725..a130139 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -172,7 +172,7 @@ struct nested_vmx {
} smm;
gpa_t hv_evmcs_vmptr;
- struct page *hv_evmcs_page;
+ struct kvm_host_map hv_evmcs_map;
struct hv_enlightened_vmcs *hv_evmcs;
};
--
2.7.4
Update the PML table without mapping and unmapping the page. This also
avoids using kvm_vcpu_gpa_to_page(..) which assumes that there is a "struct
page" for guest memory.
As a side-effect of using kvm_write_guest_page the page is also properly
marked as dirty.
Signed-off-by: KarimAllah Ahmed <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
v1 -> v2:
- Use kvm_write_guest_page instead of kvm_write_guest (pbonzini)
- Do not use pointer arithmetic for pml_address (pbonzini)
---
arch/x86/kvm/vmx/vmx.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4d39f73..71d88df 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7199,9 +7199,7 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
{
struct vmcs12 *vmcs12;
struct vcpu_vmx *vmx = to_vmx(vcpu);
- gpa_t gpa;
- struct page *page = NULL;
- u64 *pml_address;
+ gpa_t gpa, dst;
if (is_guest_mode(vcpu)) {
WARN_ON_ONCE(vmx->nested.pml_full);
@@ -7221,15 +7219,13 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
}
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS) & ~0xFFFull;
+ dst = vmcs12->pml_address + sizeof(u64) * vmcs12->guest_pml_index;
- page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->pml_address);
- if (is_error_page(page))
+ if (kvm_write_guest_page(vcpu->kvm, gpa_to_gfn(dst), &gpa,
+ offset_in_page(dst), sizeof(gpa)))
return 0;
- pml_address = kmap(page);
- pml_address[vmcs12->guest_pml_index--] = gpa;
- kunmap(page);
- kvm_release_page_clean(page);
+ vmcs12->guest_pml_index--;
}
return 0;
--
2.7.4
Use kvm_vcpu_map when mapping the posted interrupt descriptor table since
using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory
that has a "struct page".
One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
interrupt descriptor table page on the host side.
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
v4 -> v5:
- unmap with dirty flag
v1 -> v2:
- Do not change the lifecycle of the mapping (pbonzini)
---
arch/x86/kvm/vmx/nested.c | 43 ++++++++++++-------------------------------
arch/x86/kvm/vmx/vmx.h | 2 +-
2 files changed, 13 insertions(+), 32 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index dcff99d..b4230ce 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -230,12 +230,8 @@ static void free_nested(struct kvm_vcpu *vcpu)
vmx->nested.apic_access_page = NULL;
}
kvm_vcpu_unmap(&vmx->nested.virtual_apic_map, true);
- if (vmx->nested.pi_desc_page) {
- kunmap(vmx->nested.pi_desc_page);
- kvm_release_page_dirty(vmx->nested.pi_desc_page);
- vmx->nested.pi_desc_page = NULL;
- vmx->nested.pi_desc = NULL;
- }
+ kvm_vcpu_unmap(&vmx->nested.pi_desc_map, true);
+ vmx->nested.pi_desc = NULL;
kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
@@ -2868,26 +2864,15 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
}
if (nested_cpu_has_posted_intr(vmcs12)) {
- if (vmx->nested.pi_desc_page) { /* shouldn't happen */
- kunmap(vmx->nested.pi_desc_page);
- kvm_release_page_dirty(vmx->nested.pi_desc_page);
- vmx->nested.pi_desc_page = NULL;
- vmx->nested.pi_desc = NULL;
- vmcs_write64(POSTED_INTR_DESC_ADDR, -1ull);
+ map = &vmx->nested.pi_desc_map;
+
+ if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->posted_intr_desc_addr), map)) {
+ vmx->nested.pi_desc =
+ (struct pi_desc *)(((void *)map->hva) +
+ offset_in_page(vmcs12->posted_intr_desc_addr));
+ vmcs_write64(POSTED_INTR_DESC_ADDR,
+ pfn_to_hpa(map->pfn) + offset_in_page(vmcs12->posted_intr_desc_addr));
}
- page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->posted_intr_desc_addr);
- if (is_error_page(page))
- return;
- vmx->nested.pi_desc_page = page;
- vmx->nested.pi_desc = kmap(vmx->nested.pi_desc_page);
- vmx->nested.pi_desc =
- (struct pi_desc *)((void *)vmx->nested.pi_desc +
- (unsigned long)(vmcs12->posted_intr_desc_addr &
- (PAGE_SIZE - 1)));
- vmcs_write64(POSTED_INTR_DESC_ADDR,
- page_to_phys(vmx->nested.pi_desc_page) +
- (unsigned long)(vmcs12->posted_intr_desc_addr &
- (PAGE_SIZE - 1)));
}
if (nested_vmx_prepare_msr_bitmap(vcpu, vmcs12))
vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
@@ -3911,12 +3896,8 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
vmx->nested.apic_access_page = NULL;
}
kvm_vcpu_unmap(&vmx->nested.virtual_apic_map, true);
- if (vmx->nested.pi_desc_page) {
- kunmap(vmx->nested.pi_desc_page);
- kvm_release_page_dirty(vmx->nested.pi_desc_page);
- vmx->nested.pi_desc_page = NULL;
- vmx->nested.pi_desc = NULL;
- }
+ kvm_vcpu_unmap(&vmx->nested.pi_desc_map, true);
+ vmx->nested.pi_desc = NULL;
/*
* We are now running in L2, mmu_notifier will force to reload the
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index f618f52..bd04725 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -143,7 +143,7 @@ struct nested_vmx {
*/
struct page *apic_access_page;
struct kvm_host_map virtual_apic_map;
- struct page *pi_desc_page;
+ struct kvm_host_map pi_desc_map;
struct kvm_host_map msr_bitmap_map;
--
2.7.4
Read the data directly from guest memory instead of the map->read->unmap
sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which
assumes that there is a "struct page" for guest memory.
Suggested-by: Jim Mattson <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
---
v1 -> v2:
- Massage commit message a bit.
---
arch/x86/kvm/vmx/nested.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 3170e29..536468a 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4192,7 +4192,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
{
int ret;
gpa_t vmptr;
- struct page *page;
+ uint32_t revision;
struct vcpu_vmx *vmx = to_vmx(vcpu);
const u64 VMXON_NEEDED_FEATURES = FEATURE_CONTROL_LOCKED
| FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
@@ -4241,18 +4241,10 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
return nested_vmx_failInvalid(vcpu);
- page = kvm_vcpu_gpa_to_page(vcpu, vmptr);
- if (is_error_page(page))
+ if (kvm_read_guest(vcpu->kvm, vmptr, &revision, sizeof(revision)) ||
+ revision != VMCS12_REVISION)
return nested_vmx_failInvalid(vcpu);
- if (*(u32 *)kmap(page) != VMCS12_REVISION) {
- kunmap(page);
- kvm_release_page_clean(page);
- return nested_vmx_failInvalid(vcpu);
- }
- kunmap(page);
- kvm_release_page_clean(page);
-
vmx->nested.vmxon_ptr = vmptr;
ret = enter_vmx_operation(vcpu);
if (ret)
--
2.7.4
From: Filippo Sironi <[email protected]>
cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
pages and the respective struct page to map in the kernel virtual
address space.
This doesn't work if get_user_pages_fast() is invoked with a userspace
virtual address that's backed by PFNs outside of kernel reach (e.g., when
limiting the kernel memory with mem= in the command line and using
/dev/mem to map memory).
If get_user_pages_fast() fails, look up the VMA that back the userspace
virtual address, compute the PFN and the physical address, and map it in
the kernel virtual address space with memremap().
Signed-off-by: Filippo Sironi <[email protected]>
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
arch/x86/kvm/paging_tmpl.h | 38 +++++++++++++++++++++++++++++---------
1 file changed, 29 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6bdca39..c40af67 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -141,15 +141,35 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
struct page *page;
npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, &page);
- /* Check if the user is doing something meaningless. */
- if (unlikely(npages != 1))
- return -EFAULT;
-
- table = kmap_atomic(page);
- ret = CMPXCHG(&table[index], orig_pte, new_pte);
- kunmap_atomic(table);
-
- kvm_release_page_dirty(page);
+ if (likely(npages == 1)) {
+ table = kmap_atomic(page);
+ ret = CMPXCHG(&table[index], orig_pte, new_pte);
+ kunmap_atomic(table);
+
+ kvm_release_page_dirty(page);
+ } else {
+ struct vm_area_struct *vma;
+ unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
+ unsigned long pfn;
+ unsigned long paddr;
+
+ down_read(¤t->mm->mmap_sem);
+ vma = find_vma_intersection(current->mm, vaddr, vaddr + PAGE_SIZE);
+ if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
+ up_read(¤t->mm->mmap_sem);
+ return -EFAULT;
+ }
+ pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ paddr = pfn << PAGE_SHIFT;
+ table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
+ if (!table) {
+ up_read(¤t->mm->mmap_sem);
+ return -EFAULT;
+ }
+ ret = CMPXCHG(&table[index], orig_pte, new_pte);
+ memunmap(table);
+ up_read(¤t->mm->mmap_sem);
+ }
return (ret != orig_pte);
}
--
2.7.4
In KVM, specially for nested guests, there is a dominant pattern of:
=> map guest memory -> do_something -> unmap guest memory
In addition to all this unnecessarily noise in the code due to boiler plate
code, most of the time the mapping function does not properly handle memory
that is not backed by "struct page". This new guest mapping API encapsulate
most of this boiler plate code and also handles guest memory that is not
backed by "struct page".
The current implementation of this API is using memremap for memory that is
not backed by a "struct page" which would lead to a huge slow-down if it
was used for high-frequency mapping operations. The API does not have any
effect on current setups where guest memory is backed by a "struct page".
Further patches are going to also introduce a pfn-cache which would
significantly improve the performance of the memremap case.
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
v3 -> v4:
- Update the commit message.
v1 -> v2:
- Drop the caching optimization (pbonzini)
- Use 'hva' instead of 'kaddr' (pbonzini)
- Return 0/-EINVAL/-EFAULT instead of true/false. -EFAULT will be used for
AMD patch (pbonzini)
- Introduce __kvm_map_gfn which accepts a memory slot and use it (pbonzini)
- Only clear map->hva instead of memsetting the whole structure.
- Drop kvm_vcpu_map_valid since it is no longer used.
- Fix EXPORT_MODULE naming.
---
include/linux/kvm_host.h | 9 ++++++++
virt/kvm/kvm_main.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 62 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5e..8a2f5fa 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -205,6 +205,13 @@ enum {
READING_SHADOW_PAGE_TABLES,
};
+struct kvm_host_map {
+ struct page *page;
+ void *hva;
+ kvm_pfn_t pfn;
+ kvm_pfn_t gfn;
+};
+
/*
* Sometimes a large or cross-page mmio needs to be broken up into separate
* exits for userspace servicing.
@@ -710,7 +717,9 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
+void kvm_vcpu_unmap(struct kvm_host_map *map, bool dirty);
unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, int offset,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1f888a1..4d8f2e3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1733,6 +1733,59 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
}
EXPORT_SYMBOL_GPL(gfn_to_page);
+static int __kvm_map_gfn(struct kvm_memory_slot *slot, gfn_t gfn,
+ struct kvm_host_map *map)
+{
+ kvm_pfn_t pfn;
+ void *hva = NULL;
+ struct page *page = NULL;
+
+ pfn = gfn_to_pfn_memslot(slot, gfn);
+ if (is_error_noslot_pfn(pfn))
+ return -EINVAL;
+
+ if (pfn_valid(pfn)) {
+ page = pfn_to_page(pfn);
+ hva = kmap(page);
+ } else {
+ hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
+ }
+
+ if (!hva)
+ return -EFAULT;
+
+ map->page = page;
+ map->hva = hva;
+ map->pfn = pfn;
+ map->gfn = gfn;
+
+ return 0;
+}
+
+int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
+{
+ return __kvm_map_gfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, map);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_map);
+
+void kvm_vcpu_unmap(struct kvm_host_map *map, bool dirty)
+{
+ if (!map->hva)
+ return;
+
+ if (map->page)
+ kunmap(map->page);
+ else
+ memunmap(map->hva);
+
+ if (dirty)
+ kvm_release_pfn_dirty(map->pfn);
+ else
+ kvm_release_pfn_clean(map->pfn);
+ map->hva = NULL;
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
+
struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
{
kvm_pfn_t pfn;
--
2.7.4
Use the new mapping API for mapping guest memory to avoid depending on
"struct page".
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
v4 -> v5:
- unmap with dirty flag
---
arch/x86/kvm/svm.c | 97 +++++++++++++++++++++++++++---------------------------
1 file changed, 49 insertions(+), 48 deletions(-)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 307e5bd..d886664 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3062,32 +3062,6 @@ static inline bool nested_svm_nmi(struct vcpu_svm *svm)
return false;
}
-static void *nested_svm_map(struct vcpu_svm *svm, u64 gpa, struct page **_page)
-{
- struct page *page;
-
- might_sleep();
-
- page = kvm_vcpu_gfn_to_page(&svm->vcpu, gpa >> PAGE_SHIFT);
- if (is_error_page(page))
- goto error;
-
- *_page = page;
-
- return kmap(page);
-
-error:
- kvm_inject_gp(&svm->vcpu, 0);
-
- return NULL;
-}
-
-static void nested_svm_unmap(struct page *page)
-{
- kunmap(page);
- kvm_release_page_dirty(page);
-}
-
static int nested_svm_intercept_ioio(struct vcpu_svm *svm)
{
unsigned port, size, iopm_len;
@@ -3290,10 +3264,11 @@ static inline void copy_vmcb_control_area(struct vmcb *dst_vmcb, struct vmcb *fr
static int nested_svm_vmexit(struct vcpu_svm *svm)
{
+ int rc;
struct vmcb *nested_vmcb;
struct vmcb *hsave = svm->nested.hsave;
struct vmcb *vmcb = svm->vmcb;
- struct page *page;
+ struct kvm_host_map map;
trace_kvm_nested_vmexit_inject(vmcb->control.exit_code,
vmcb->control.exit_info_1,
@@ -3302,9 +3277,14 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
vmcb->control.exit_int_info_err,
KVM_ISA_SVM);
- nested_vmcb = nested_svm_map(svm, svm->nested.vmcb, &page);
- if (!nested_vmcb)
+ rc = kvm_vcpu_map(&svm->vcpu, gfn_to_gpa(svm->nested.vmcb), &map);
+ if (rc) {
+ if (rc == -EINVAL)
+ kvm_inject_gp(&svm->vcpu, 0);
return 1;
+ }
+
+ nested_vmcb = map.hva;
/* Exit Guest-Mode */
leave_guest_mode(&svm->vcpu);
@@ -3408,7 +3388,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
mark_all_dirty(svm->vmcb);
- nested_svm_unmap(page);
+ kvm_vcpu_unmap(&map, true);
nested_svm_uninit_mmu_context(&svm->vcpu);
kvm_mmu_reset_context(&svm->vcpu);
@@ -3466,7 +3446,7 @@ static bool nested_vmcb_checks(struct vmcb *vmcb)
}
static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
- struct vmcb *nested_vmcb, struct page *page)
+ struct vmcb *nested_vmcb, struct kvm_host_map *map)
{
if (kvm_get_rflags(&svm->vcpu) & X86_EFLAGS_IF)
svm->vcpu.arch.hflags |= HF_HIF_MASK;
@@ -3550,7 +3530,7 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
svm->vmcb->control.pause_filter_thresh =
nested_vmcb->control.pause_filter_thresh;
- nested_svm_unmap(page);
+ kvm_vcpu_unmap(map, true);
/* Enter Guest-Mode */
enter_guest_mode(&svm->vcpu);
@@ -3570,17 +3550,23 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
static bool nested_svm_vmrun(struct vcpu_svm *svm)
{
+ int rc;
struct vmcb *nested_vmcb;
struct vmcb *hsave = svm->nested.hsave;
struct vmcb *vmcb = svm->vmcb;
- struct page *page;
+ struct kvm_host_map map;
u64 vmcb_gpa;
vmcb_gpa = svm->vmcb->save.rax;
- nested_vmcb = nested_svm_map(svm, svm->vmcb->save.rax, &page);
- if (!nested_vmcb)
+ rc = kvm_vcpu_map(&svm->vcpu, gfn_to_gpa(vmcb_gpa), &map);
+ if (rc) {
+ if (rc == -EINVAL)
+ kvm_inject_gp(&svm->vcpu, 0);
return false;
+ }
+
+ nested_vmcb = map.hva;
if (!nested_vmcb_checks(nested_vmcb)) {
nested_vmcb->control.exit_code = SVM_EXIT_ERR;
@@ -3588,7 +3574,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
nested_vmcb->control.exit_info_1 = 0;
nested_vmcb->control.exit_info_2 = 0;
- nested_svm_unmap(page);
+ kvm_vcpu_unmap(&map, true);
return false;
}
@@ -3632,7 +3618,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
copy_vmcb_control_area(hsave, vmcb);
- enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, page);
+ enter_svm_guest_mode(svm, vmcb_gpa, nested_vmcb, &map);
return true;
}
@@ -3656,21 +3642,26 @@ static void nested_svm_vmloadsave(struct vmcb *from_vmcb, struct vmcb *to_vmcb)
static int vmload_interception(struct vcpu_svm *svm)
{
struct vmcb *nested_vmcb;
- struct page *page;
+ struct kvm_host_map map;
int ret;
if (nested_svm_check_permissions(svm))
return 1;
- nested_vmcb = nested_svm_map(svm, svm->vmcb->save.rax, &page);
- if (!nested_vmcb)
+ ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->vmcb->save.rax), &map);
+ if (ret) {
+ if (ret == -EINVAL)
+ kvm_inject_gp(&svm->vcpu, 0);
return 1;
+ }
+
+ nested_vmcb = map.hva;
svm->next_rip = kvm_rip_read(&svm->vcpu) + 3;
ret = kvm_skip_emulated_instruction(&svm->vcpu);
nested_svm_vmloadsave(nested_vmcb, svm->vmcb);
- nested_svm_unmap(page);
+ kvm_vcpu_unmap(&map, true);
return ret;
}
@@ -3678,21 +3669,26 @@ static int vmload_interception(struct vcpu_svm *svm)
static int vmsave_interception(struct vcpu_svm *svm)
{
struct vmcb *nested_vmcb;
- struct page *page;
+ struct kvm_host_map map;
int ret;
if (nested_svm_check_permissions(svm))
return 1;
- nested_vmcb = nested_svm_map(svm, svm->vmcb->save.rax, &page);
- if (!nested_vmcb)
+ ret = kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->vmcb->save.rax), &map);
+ if (ret) {
+ if (ret == -EINVAL)
+ kvm_inject_gp(&svm->vcpu, 0);
return 1;
+ }
+
+ nested_vmcb = map.hva;
svm->next_rip = kvm_rip_read(&svm->vcpu) + 3;
ret = kvm_skip_emulated_instruction(&svm->vcpu);
nested_svm_vmloadsave(svm->vmcb, nested_vmcb);
- nested_svm_unmap(page);
+ kvm_vcpu_unmap(&map, true);
return ret;
}
@@ -6220,7 +6216,7 @@ static int svm_pre_leave_smm(struct kvm_vcpu *vcpu, u64 smbase)
{
struct vcpu_svm *svm = to_svm(vcpu);
struct vmcb *nested_vmcb;
- struct page *page;
+ struct kvm_host_map map;
struct {
u64 guest;
u64 vmcb;
@@ -6234,11 +6230,16 @@ static int svm_pre_leave_smm(struct kvm_vcpu *vcpu, u64 smbase)
if (svm_state_save.guest) {
vcpu->arch.hflags &= ~HF_SMM_MASK;
- nested_vmcb = nested_svm_map(svm, svm_state_save.vmcb, &page);
+ if (kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm_state_save.vmcb), &map) == -EINVAL)
+ kvm_inject_gp(&svm->vcpu, 0);
+
+ nested_vmcb = map.hva;
+
if (nested_vmcb)
- enter_svm_guest_mode(svm, svm_state_save.vmcb, nested_vmcb, page);
+ enter_svm_guest_mode(svm, svm_state_save.vmcb, nested_vmcb, &map);
else
ret = 1;
+
vcpu->arch.hflags |= HF_SMM_MASK;
}
return ret;
--
2.7.4
Use kvm_vcpu_map in emulator_cmpxchg_emulated since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".
Signed-off-by: KarimAllah Ahmed <[email protected]>
---
v4 -> v5:
- unmap with dirty flag
v1 -> v2:
- Update to match the new API return codes
---
arch/x86/kvm/x86.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02c8e09..0c35cfc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5492,9 +5492,9 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
unsigned int bytes,
struct x86_exception *exception)
{
+ struct kvm_host_map map;
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
gpa_t gpa;
- struct page *page;
char *kaddr;
bool exchanged;
@@ -5511,12 +5511,11 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
if (((gpa + bytes - 1) & PAGE_MASK) != (gpa & PAGE_MASK))
goto emul_write;
- page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT);
- if (is_error_page(page))
+ if (kvm_vcpu_map(vcpu, gpa_to_gfn(gpa), &map))
goto emul_write;
- kaddr = kmap_atomic(page);
- kaddr += offset_in_page(gpa);
+ kaddr = map.hva + offset_in_page(gpa);
+
switch (bytes) {
case 1:
exchanged = CMPXCHG_TYPE(u8, kaddr, old, new);
@@ -5533,8 +5532,8 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
default:
BUG();
}
- kunmap_atomic(kaddr);
- kvm_release_page_dirty(page);
+
+ kvm_vcpu_unmap(&map, true);
if (!exchanged)
return X86EMUL_CMPXCHG_FAILED;
--
2.7.4
On 09.01.19 10:42, KarimAllah Ahmed wrote:
> In KVM, specially for nested guests, there is a dominant pattern of:
>
> => map guest memory -> do_something -> unmap guest memory
>
> In addition to all this unnecessarily noise in the code due to boiler plate
> code, most of the time the mapping function does not properly handle memory
> that is not backed by "struct page". This new guest mapping API encapsulate
> most of this boiler plate code and also handles guest memory that is not
> backed by "struct page".
>
> The current implementation of this API is using memremap for memory that is
> not backed by a "struct page" which would lead to a huge slow-down if it
> was used for high-frequency mapping operations. The API does not have any
> effect on current setups where guest memory is backed by a "struct page".
> Further patches are going to also introduce a pfn-cache which would
> significantly improve the performance of the memremap case.
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> ---
> v3 -> v4:
> - Update the commit message.
> v1 -> v2:
> - Drop the caching optimization (pbonzini)
> - Use 'hva' instead of 'kaddr' (pbonzini)
> - Return 0/-EINVAL/-EFAULT instead of true/false. -EFAULT will be used for
> AMD patch (pbonzini)
> - Introduce __kvm_map_gfn which accepts a memory slot and use it (pbonzini)
> - Only clear map->hva instead of memsetting the whole structure.
> - Drop kvm_vcpu_map_valid since it is no longer used.
> - Fix EXPORT_MODULE naming.
> ---
> include/linux/kvm_host.h | 9 ++++++++
> virt/kvm/kvm_main.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 62 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c38cc5e..8a2f5fa 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -205,6 +205,13 @@ enum {
> READING_SHADOW_PAGE_TABLES,
> };
>
> +struct kvm_host_map {
> + struct page *page;
Can you add somme comments to what it means when there is a page vs.
when there is none?
> + void *hva;
> + kvm_pfn_t pfn;
> + kvm_pfn_t gfn;
> +};
> +
> /*
> * Sometimes a large or cross-page mmio needs to be broken up into separate
> * exits for userspace servicing.
> @@ -710,7 +717,9 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
> struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
> kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
> kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
> struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
> +void kvm_vcpu_unmap(struct kvm_host_map *map, bool dirty);
> unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
> unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
> int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, int offset,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1f888a1..4d8f2e3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1733,6 +1733,59 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_GPL(gfn_to_page);
>
> +static int __kvm_map_gfn(struct kvm_memory_slot *slot, gfn_t gfn,
> + struct kvm_host_map *map)
> +{
> + kvm_pfn_t pfn;
> + void *hva = NULL;
> + struct page *page = NULL;
nit: I prefer these in a growing line-length fashion.
> +
> + pfn = gfn_to_pfn_memslot(slot, gfn);
> + if (is_error_noslot_pfn(pfn))
> + return -EINVAL;
> +
> + if (pfn_valid(pfn)) {
> + page = pfn_to_page(pfn);
> + hva = kmap(page);
> + } else {
> + hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
> + }
> +
> + if (!hva)
> + return -EFAULT;
> +
> + map->page = page;
> + map->hva = hva;
> + map->pfn = pfn;
> + map->gfn = gfn;
> +
> + return 0;
> +}
> +
> +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
> +{
> + return __kvm_map_gfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, map);
> +}
> +EXPORT_SYMBOL_GPL(kvm_vcpu_map);
> +
> +void kvm_vcpu_unmap(struct kvm_host_map *map, bool dirty)
> +{
> + if (!map->hva)
> + return;
> +
> + if (map->page)
> + kunmap(map->page);
> + else
> + memunmap(map->hva);
> +
> + if (dirty)
I am wondering if this would also be the right place for
kvm_vcpu_mark_page_dirty() to mark the page dirty for migration.
> + kvm_release_pfn_dirty(map->pfn);
> + else
> + kvm_release_pfn_clean(map->pfn);
> + map->hva = NULL;
> +}
> +EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
> +
> struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
> {
> kvm_pfn_t pfn;
>
--
Thanks,
David / dhildenb
On Wed, Jan 09, 2019 at 10:42:01AM +0100, KarimAllah Ahmed wrote:
> Read the data directly from guest memory instead of the map->read->unmap
> sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which
> assumes that there is a "struct page" for guest memory.
>
> Suggested-by: Jim Mattson <[email protected]>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> Reviewed-by: Jim Mattson <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
Mind if I join party?
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
>
> ---
> v1 -> v2:
> - Massage commit message a bit.
> ---
> arch/x86/kvm/vmx/nested.c | 14 +++-----------
> 1 file changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 3170e29..536468a 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4192,7 +4192,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
> {
> int ret;
> gpa_t vmptr;
> - struct page *page;
> + uint32_t revision;
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> const u64 VMXON_NEEDED_FEATURES = FEATURE_CONTROL_LOCKED
> | FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
> @@ -4241,18 +4241,10 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
> if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> return nested_vmx_failInvalid(vcpu);
>
> - page = kvm_vcpu_gpa_to_page(vcpu, vmptr);
> - if (is_error_page(page))
> + if (kvm_read_guest(vcpu->kvm, vmptr, &revision, sizeof(revision)) ||
> + revision != VMCS12_REVISION)
> return nested_vmx_failInvalid(vcpu);
>
> - if (*(u32 *)kmap(page) != VMCS12_REVISION) {
> - kunmap(page);
> - kvm_release_page_clean(page);
> - return nested_vmx_failInvalid(vcpu);
> - }
> - kunmap(page);
> - kvm_release_page_clean(page);
> -
> vmx->nested.vmxon_ptr = vmptr;
> ret = enter_vmx_operation(vcpu);
> if (ret)
> --
> 2.7.4
>
On Wed, Jan 09, 2019 at 10:42:02AM +0100, KarimAllah Ahmed wrote:
> Update the PML table without mapping and unmapping the page. This also
> avoids using kvm_vcpu_gpa_to_page(..) which assumes that there is a "struct
> page" for guest memory.
>
> As a side-effect of using kvm_write_guest_page the page is also properly
> marked as dirty.
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
> ---
> v1 -> v2:
> - Use kvm_write_guest_page instead of kvm_write_guest (pbonzini)
> - Do not use pointer arithmetic for pml_address (pbonzini)
> ---
> arch/x86/kvm/vmx/vmx.c | 14 +++++---------
> 1 file changed, 5 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4d39f73..71d88df 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7199,9 +7199,7 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
> {
> struct vmcs12 *vmcs12;
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> - gpa_t gpa;
> - struct page *page = NULL;
> - u64 *pml_address;
> + gpa_t gpa, dst;
>
> if (is_guest_mode(vcpu)) {
> WARN_ON_ONCE(vmx->nested.pml_full);
> @@ -7221,15 +7219,13 @@ static int vmx_write_pml_buffer(struct kvm_vcpu *vcpu)
> }
>
> gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS) & ~0xFFFull;
> + dst = vmcs12->pml_address + sizeof(u64) * vmcs12->guest_pml_index;
>
> - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->pml_address);
> - if (is_error_page(page))
> + if (kvm_write_guest_page(vcpu->kvm, gpa_to_gfn(dst), &gpa,
> + offset_in_page(dst), sizeof(gpa)))
> return 0;
>
> - pml_address = kmap(page);
> - pml_address[vmcs12->guest_pml_index--] = gpa;
> - kunmap(page);
> - kvm_release_page_clean(page);
> + vmcs12->guest_pml_index--;
> }
>
> return 0;
> --
> 2.7.4
>
On Wed, Jan 09, 2019 at 10:42:03AM +0100, KarimAllah Ahmed wrote:
> From: Filippo Sironi <[email protected]>
>
> cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
> pages and the respective struct page to map in the kernel virtual
> address space.
> This doesn't work if get_user_pages_fast() is invoked with a userspace
> virtual address that's backed by PFNs outside of kernel reach (e.g., when
> limiting the kernel memory with mem= in the command line and using
> /dev/mem to map memory).
>
> If get_user_pages_fast() fails, look up the VMA that back the userspace
> virtual address, compute the PFN and the physical address, and map it in
> the kernel virtual address space with memremap().
>
> Signed-off-by: Filippo Sironi <[email protected]>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
I personally would have used some crafty goto statements to jump to
'err' label which would have
up_read(¤t->mm->mmap_sem);
return -EFAULT;
which would be after
154 return (ret != orig_pte);
But that is bike-shedding so feel free to ignore it.
Either way:
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
> ---
> arch/x86/kvm/paging_tmpl.h | 38 +++++++++++++++++++++++++++++---------
> 1 file changed, 29 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 6bdca39..c40af67 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -141,15 +141,35 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> struct page *page;
>
> npages = get_user_pages_fast((unsigned long)ptep_user, 1, 1, &page);
> - /* Check if the user is doing something meaningless. */
> - if (unlikely(npages != 1))
> - return -EFAULT;
> -
> - table = kmap_atomic(page);
> - ret = CMPXCHG(&table[index], orig_pte, new_pte);
> - kunmap_atomic(table);
> -
> - kvm_release_page_dirty(page);
> + if (likely(npages == 1)) {
> + table = kmap_atomic(page);
> + ret = CMPXCHG(&table[index], orig_pte, new_pte);
> + kunmap_atomic(table);
> +
> + kvm_release_page_dirty(page);
> + } else {
> + struct vm_area_struct *vma;
> + unsigned long vaddr = (unsigned long)ptep_user & PAGE_MASK;
> + unsigned long pfn;
> + unsigned long paddr;
> +
> + down_read(¤t->mm->mmap_sem);
> + vma = find_vma_intersection(current->mm, vaddr, vaddr + PAGE_SIZE);
> + if (!vma || !(vma->vm_flags & VM_PFNMAP)) {
> + up_read(¤t->mm->mmap_sem);
> + return -EFAULT;
> + }
> + pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> + paddr = pfn << PAGE_SHIFT;
> + table = memremap(paddr, PAGE_SIZE, MEMREMAP_WB);
> + if (!table) {
> + up_read(¤t->mm->mmap_sem);
> + return -EFAULT;
> + }
> + ret = CMPXCHG(&table[index], orig_pte, new_pte);
> + memunmap(table);
> + up_read(¤t->mm->mmap_sem);
> + }
>
> return (ret != orig_pte);
> }
> --
> 2.7.4
>
On Wed, Jan 09, 2019 at 10:42:04AM +0100, KarimAllah Ahmed wrote:
> In KVM, specially for nested guests, there is a dominant pattern of:
>
> => map guest memory -> do_something -> unmap guest memory
>
> In addition to all this unnecessarily noise in the code due to boiler plate
> code, most of the time the mapping function does not properly handle memory
> that is not backed by "struct page". This new guest mapping API encapsulate
> most of this boiler plate code and also handles guest memory that is not
> backed by "struct page".
>
> The current implementation of this API is using memremap for memory that is
> not backed by a "struct page" which would lead to a huge slow-down if it
> was used for high-frequency mapping operations. The API does not have any
> effect on current setups where guest memory is backed by a "struct page".
> Further patches are going to also introduce a pfn-cache which would
> significantly improve the performance of the memremap case.
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> ---
> v3 -> v4:
> - Update the commit message.
> v1 -> v2:
> - Drop the caching optimization (pbonzini)
> - Use 'hva' instead of 'kaddr' (pbonzini)
> - Return 0/-EINVAL/-EFAULT instead of true/false. -EFAULT will be used for
> AMD patch (pbonzini)
> - Introduce __kvm_map_gfn which accepts a memory slot and use it (pbonzini)
> - Only clear map->hva instead of memsetting the whole structure.
> - Drop kvm_vcpu_map_valid since it is no longer used.
> - Fix EXPORT_MODULE naming.
> ---
> include/linux/kvm_host.h | 9 ++++++++
> virt/kvm/kvm_main.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 62 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c38cc5e..8a2f5fa 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -205,6 +205,13 @@ enum {
> READING_SHADOW_PAGE_TABLES,
> };
>
> +struct kvm_host_map {
> + struct page *page;
> + void *hva;
> + kvm_pfn_t pfn;
> + kvm_pfn_t gfn;
> +};
> +
> /*
> * Sometimes a large or cross-page mmio needs to be broken up into separate
> * exits for userspace servicing.
> @@ -710,7 +717,9 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
> struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
> kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
> kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
> struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
> +void kvm_vcpu_unmap(struct kvm_host_map *map, bool dirty);
> unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
> unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
> int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, int offset,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1f888a1..4d8f2e3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1733,6 +1733,59 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_GPL(gfn_to_page);
>
> +static int __kvm_map_gfn(struct kvm_memory_slot *slot, gfn_t gfn,
> + struct kvm_host_map *map)
> +{
> + kvm_pfn_t pfn;
> + void *hva = NULL;
> + struct page *page = NULL;
> +
Would it make sense to check if 'map' is NULL first? Ditto on the unmap?
Either way:
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
On Wed, Jan 09, 2019 at 10:42:05AM +0100, KarimAllah Ahmed wrote:
> Use kvm_vcpu_map to the map the VMCS12 from guest memory because
> kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> a "struct page".
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
> ---
> v4 -> v5:
> - Switch to the new guest mapping API instead of reading directly from
> guest.
> - unmap with dirty flag
> v3 -> v4:
> - Return VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID on failure (jmattson@)
> v1 -> v2:
> - Massage commit message a bit.
> ---
> arch/x86/kvm/vmx/nested.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 536468a..5602b0c 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4521,11 +4521,10 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
> return 1;
>
> if (vmx->nested.current_vmptr != vmptr) {
> + struct kvm_host_map map;
> struct vmcs12 *new_vmcs12;
> - struct page *page;
>
> - page = kvm_vcpu_gpa_to_page(vcpu, vmptr);
> - if (is_error_page(page)) {
> + if (kvm_vcpu_map(vcpu, gpa_to_gfn(vmptr), &map)) {
> /*
> * Reads from an unbacked page return all 1s,
> * which means that the 32 bits located at the
> @@ -4536,12 +4535,13 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
> VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
> return kvm_skip_emulated_instruction(vcpu);
> }
> - new_vmcs12 = kmap(page);
> +
> + new_vmcs12 = map.hva;
> +
> if (new_vmcs12->hdr.revision_id != VMCS12_REVISION ||
> (new_vmcs12->hdr.shadow_vmcs &&
> !nested_cpu_has_vmx_shadow_vmcs(vcpu))) {
> - kunmap(page);
> - kvm_release_page_clean(page);
> + kvm_vcpu_unmap(&map, false);
> return nested_vmx_failValid(vcpu,
> VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
> }
> @@ -4553,8 +4553,7 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
> * cached.
> */
> memcpy(vmx->nested.cached_vmcs12, new_vmcs12, VMCS12_SIZE);
> - kunmap(page);
> - kvm_release_page_clean(page);
> + kvm_vcpu_unmap(&map, false);
>
> set_current_vmptr(vmx, vmptr);
> }
> --
> 2.7.4
>
> + if (dirty)
> + kvm_release_pfn_dirty(map->pfn);
> + else
> + kvm_release_pfn_clean(map->pfn);
> + map->hva = NULL;
I keep on having this gnawing feeling that we MUST set map->page to
NULL.
That is I can see how it is not needed if you are using 'map' and
'unmap' together - for that we are good. But what I am worried is that
some one unmaps it .. and instead of checking map->hva they end up
checking map->page and think the page is mapped.
Would you be OK adding that extra statement just as a fail-safe
mechanism in case someones misues the APIs?
On Wed, Jan 09, 2019 at 10:42:07AM +0100, KarimAllah Ahmed wrote:
> Use kvm_vcpu_map when mapping the virtual APIC page since using
> kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> a "struct page".
>
> One additional semantic change is that the virtual host mapping lifecycle
> has changed a bit. It now has the same lifetime of the pinning of the
> virtual APIC page on the host side.
Could you expand a bit on the 'same lifetime .. on the host side' to be
obvious for folks what exactly the semantic is?
And how does this ring with this comment:
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> ---
> v4 -> v5:
> - unmap with dirty flag
>
> v1 -> v2:
> - Do not change the lifecycle of the mapping (pbonzini)
.. Where Paolo does not want the semantics of the mapping to be changed?
Code wise feel free to smack my Reviewed-by on it, but obviously the
question on the above comment needs to be resolved.
Thank you.
> - Use pfn_to_hpa instead of gfn_to_gpa
> ---
> arch/x86/kvm/vmx/nested.c | 32 +++++++++++---------------------
> arch/x86/kvm/vmx/vmx.c | 5 ++---
> arch/x86/kvm/vmx/vmx.h | 2 +-
> 3 files changed, 14 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 4127ad9..dcff99d 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -229,10 +229,7 @@ static void free_nested(struct kvm_vcpu *vcpu)
> kvm_release_page_dirty(vmx->nested.apic_access_page);
> vmx->nested.apic_access_page = NULL;
> }
> - if (vmx->nested.virtual_apic_page) {
> - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> - vmx->nested.virtual_apic_page = NULL;
> - }
> + kvm_vcpu_unmap(&vmx->nested.virtual_apic_map, true);
> if (vmx->nested.pi_desc_page) {
> kunmap(vmx->nested.pi_desc_page);
> kvm_release_page_dirty(vmx->nested.pi_desc_page);
> @@ -2817,6 +2814,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> {
> struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> + struct kvm_host_map *map;
> struct page *page;
> u64 hpa;
>
> @@ -2849,11 +2847,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> }
>
> if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
> - if (vmx->nested.virtual_apic_page) { /* shouldn't happen */
> - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> - vmx->nested.virtual_apic_page = NULL;
> - }
> - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->virtual_apic_page_addr);
> + map = &vmx->nested.virtual_apic_map;
>
> /*
> * If translation failed, VM entry will fail because
> @@ -2868,11 +2862,9 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> * control. But such a configuration is useless, so
> * let's keep the code simple.
> */
> - if (!is_error_page(page)) {
> - vmx->nested.virtual_apic_page = page;
> - hpa = page_to_phys(vmx->nested.virtual_apic_page);
> - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa);
> - }
> + if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map))
> + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn));
> +
> }
>
> if (nested_cpu_has_posted_intr(vmcs12)) {
> @@ -3279,11 +3271,12 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>
> max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
> if (max_irr != 256) {
> - vapic_page = kmap(vmx->nested.virtual_apic_page);
> + vapic_page = vmx->nested.virtual_apic_map.hva;
> + if (!vapic_page)
> + return;
> +
> __kvm_apic_update_irr(vmx->nested.pi_desc->pir,
> vapic_page, &max_irr);
> - kunmap(vmx->nested.virtual_apic_page);
> -
> status = vmcs_read16(GUEST_INTR_STATUS);
> if ((u8)max_irr > ((u8)status & 0xff)) {
> status &= ~0xff;
> @@ -3917,10 +3910,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
> kvm_release_page_dirty(vmx->nested.apic_access_page);
> vmx->nested.apic_access_page = NULL;
> }
> - if (vmx->nested.virtual_apic_page) {
> - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> - vmx->nested.virtual_apic_page = NULL;
> - }
> + kvm_vcpu_unmap(&vmx->nested.virtual_apic_map, true);
> if (vmx->nested.pi_desc_page) {
> kunmap(vmx->nested.pi_desc_page);
> kvm_release_page_dirty(vmx->nested.pi_desc_page);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 71d88df..e13308e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3627,14 +3627,13 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
>
> if (WARN_ON_ONCE(!is_guest_mode(vcpu)) ||
> !nested_cpu_has_vid(get_vmcs12(vcpu)) ||
> - WARN_ON_ONCE(!vmx->nested.virtual_apic_page))
> + WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn))
> return false;
>
> rvi = vmx_get_rvi();
>
> - vapic_page = kmap(vmx->nested.virtual_apic_page);
> + vapic_page = vmx->nested.virtual_apic_map.hva;
> vppr = *((u32 *)(vapic_page + APIC_PROCPRI));
> - kunmap(vmx->nested.virtual_apic_page);
>
> return ((rvi & 0xf0) > (vppr & 0xf0));
> }
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 6fb69d8..f618f52 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -142,7 +142,7 @@ struct nested_vmx {
> * pointers, so we must keep them pinned while L2 runs.
> */
> struct page *apic_access_page;
> - struct page *virtual_apic_page;
> + struct kvm_host_map virtual_apic_map;
> struct page *pi_desc_page;
>
> struct kvm_host_map msr_bitmap_map;
> --
> 2.7.4
>
On Wed, Jan 09, 2019 at 10:42:08AM +0100, KarimAllah Ahmed wrote:
> Use kvm_vcpu_map when mapping the posted interrupt descriptor table since
> using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory
> that has a "struct page".
>
> One additional semantic change is that the virtual host mapping lifecycle
> has changed a bit. It now has the same lifetime of the pinning of the
> interrupt descriptor table page on the host side.
Is the description stale? I am not seeing how you are changing the
semantics here. You follow the same path - map/unmap.
Could you expand please?
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> ---
> v4 -> v5:
> - unmap with dirty flag
>
> v1 -> v2:
> - Do not change the lifecycle of the mapping (pbonzini)
On Wed, Jan 09, 2019 at 10:42:09AM +0100, KarimAllah Ahmed wrote:
> Use kvm_vcpu_map in emulator_cmpxchg_emulated since using
> kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> a "struct page".
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
On Wed, Jan 09, 2019 at 10:42:10AM +0100, KarimAllah Ahmed wrote:
> Use the new mapping API for mapping guest memory to avoid depending on
> "struct page".
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
On Wed, Jan 09, 2019 at 10:42:11AM +0100, KarimAllah Ahmed wrote:
> Use kvm_vcpu_map for accessing the shadow VMCS since using
> kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> a "struct page".
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
Reviewed-by: Konrad Rzessutek Wilk <[email protected]>
On Wed, Jan 09, 2019 at 10:42:12AM +0100, KarimAllah Ahmed wrote:
> Use kvm_vcpu_map for accessing the enhanced VMCS since using
> kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> a "struct page".
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
On Wed, Jan 09, 2019 at 10:42:00AM +0100, KarimAllah Ahmed wrote:
> Guest memory can either be directly managed by the kernel (i.e. have a "struct
> page") or they can simply live outside kernel control (i.e. do not have a
> "struct page"). KVM mostly support these two modes, except in a few places
> where the code seems to assume that guest memory must have a "struct page".
>
> This patchset introduces a new mapping interface to map guest memory into host
> kernel memory which also supports PFN-based memory (i.e. memory without 'struct
> page'). It also converts all offending code to this interface or simply
> read/write directly from guest memory. Patch 2 is additionally fixing an
> incorrect page release and marking the page as dirty (i.e. as a side-effect of
> using the helper function to write).
>
> As far as I can see all offending code is now fixed except the APIC-access page
> which I will handle in a seperate series along with dropping
> kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API.
>
> The current implementation of the new API uses memremap to map memory that does
> not have a "struct page". This proves to be very slow for high frequency
> mappings. Since this does not affect the normal use-case where a "struct page"
> is available, the performance of this API will be handled by a seperate patch
> series.
Where could one find this patchset?
Also is there an simple test-case (or a writeup) you have for testing
this code? Specifically I am thinking about the use-case of "memory
without the 'struct page'"
And thank you for posting this patchset. It was a pleasure reviewing the
code!
On Wed, Jan 09, 2019 at 10:42:13AM +0100, KarimAllah Ahmed wrote:
> Use page_address_valid in a few more locations that is already checking for
> a page aligned address that does not cross the maximum physical address.
Where is this page_address_valid declared? The latest linus's tree does
not have it, nor does your patchset?
>
> Signed-off-by: KarimAllah Ahmed <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ccb3b63..77aad46 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4203,7 +4203,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
> * Note - IA32_VMX_BASIC[48] will never be 1 for the nested case;
> * which replaces physical address width with 32
> */
> - if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> + if (!page_address_valid(vcpu, vmptr))
> return nested_vmx_failInvalid(vcpu);
>
> if (kvm_read_guest(vcpu->kvm, vmptr, &revision, sizeof(revision)) ||
> @@ -4266,7 +4266,7 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
> if (nested_vmx_get_vmptr(vcpu, &vmptr))
> return 1;
>
> - if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> + if (!page_address_valid(vcpu, vmptr))
> return nested_vmx_failValid(vcpu,
> VMXERR_VMCLEAR_INVALID_ADDRESS);
>
> @@ -4473,7 +4473,7 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
> if (nested_vmx_get_vmptr(vcpu, &vmptr))
> return 1;
>
> - if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> + if (!page_address_valid(vcpu, vmptr))
> return nested_vmx_failValid(vcpu,
> VMXERR_VMPTRLD_INVALID_ADDRESS);
>
> --
> 2.7.4
>
On Thu, 2019-01-10 at 14:07 +0100, David Hildenbrand wrote:
> On 09.01.19 10:42, KarimAllah Ahmed wrote:
> >
> > In KVM, specially for nested guests, there is a dominant pattern of:
> >
> > => map guest memory -> do_something -> unmap guest memory
> >
> > In addition to all this unnecessarily noise in the code due to boiler plate
> > code, most of the time the mapping function does not properly handle memory
> > that is not backed by "struct page". This new guest mapping API encapsulate
> > most of this boiler plate code and also handles guest memory that is not
> > backed by "struct page".
> >
> > The current implementation of this API is using memremap for memory that is
> > not backed by a "struct page" which would lead to a huge slow-down if it
> > was used for high-frequency mapping operations. The API does not have any
> > effect on current setups where guest memory is backed by a "struct page".
> > Further patches are going to also introduce a pfn-cache which would
> > significantly improve the performance of the memremap case.
> >
> > Signed-off-by: KarimAllah Ahmed <[email protected]>
> > ---
> > v3 -> v4:
> > - Update the commit message.
> > v1 -> v2:
> > - Drop the caching optimization (pbonzini)
> > - Use 'hva' instead of 'kaddr' (pbonzini)
> > - Return 0/-EINVAL/-EFAULT instead of true/false. -EFAULT will be used for
> > AMD patch (pbonzini)
> > - Introduce __kvm_map_gfn which accepts a memory slot and use it (pbonzini)
> > - Only clear map->hva instead of memsetting the whole structure.
> > - Drop kvm_vcpu_map_valid since it is no longer used.
> > - Fix EXPORT_MODULE naming.
> > ---
> > include/linux/kvm_host.h | 9 ++++++++
> > virt/kvm/kvm_main.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 62 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index c38cc5e..8a2f5fa 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -205,6 +205,13 @@ enum {
> > READING_SHADOW_PAGE_TABLES,
> > };
> >
> > +struct kvm_host_map {
> > + struct page *page;
>
> Can you add somme comments to what it means when there is a page vs.
> when there is none?
>
> >
> > + void *hva;
> > + kvm_pfn_t pfn;
> > + kvm_pfn_t gfn;
> > +};
> > +
> > /*
> > * Sometimes a large or cross-page mmio needs to be broken up into separate
> > * exits for userspace servicing.
> > @@ -710,7 +717,9 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
> > struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
> > kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
> > kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> > +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
> > struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn);
> > +void kvm_vcpu_unmap(struct kvm_host_map *map, bool dirty);
> > unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
> > unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
> > int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, int offset,
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1f888a1..4d8f2e3 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1733,6 +1733,59 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
> > }
> > EXPORT_SYMBOL_GPL(gfn_to_page);
> >
> > +static int __kvm_map_gfn(struct kvm_memory_slot *slot, gfn_t gfn,
> > + struct kvm_host_map *map)
> > +{
> > + kvm_pfn_t pfn;
> > + void *hva = NULL;
> > + struct page *page = NULL;
>
> nit: I prefer these in a growing line-length fashion.
>
> >
> > +
> > + pfn = gfn_to_pfn_memslot(slot, gfn);
> > + if (is_error_noslot_pfn(pfn))
> > + return -EINVAL;
> > +
> > + if (pfn_valid(pfn)) {
> > + page = pfn_to_page(pfn);
> > + hva = kmap(page);
> > + } else {
> > + hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
> > + }
> > +
> > + if (!hva)
> > + return -EFAULT;
> > +
> > + map->page = page;
> > + map->hva = hva;
> > + map->pfn = pfn;
> > + map->gfn = gfn;
> > +
> > + return 0;
> > +}
> > +
> > +int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
> > +{
> > + return __kvm_map_gfn(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, map);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_vcpu_map);
> > +
> > +void kvm_vcpu_unmap(struct kvm_host_map *map, bool dirty)
> > +{
> > + if (!map->hva)
> > + return;
> > +
> > + if (map->page)
> > + kunmap(map->page);
> > + else
> > + memunmap(map->hva);
> > +
> > + if (dirty)
>
>
> I am wondering if this would also be the right place for
>
> kvm_vcpu_mark_page_dirty() to mark the page dirty for migration.
I indeed considered this, however, either I am missing something or this
mark_page_dirty is missing accidentally in a couple of places! For example:
1) When unmapping the EVMCS page (in nested_release_evmcs) where is it marked as
dirty?
2) The mapping changes in svm.c, where is marking it dirty?
However, it is handled properly in the rest:
3) For emulator_cmpxchg_emulated it is done, so no problem here.
4) The posted interrupts for L12 is done, so no problem here.
5) The virtual apic page is done, so no problem here.
Is there any reason why it would not be needed in 1 and 2 above other than
being a bug?
That being said, good point. I will merge your suggestion in v6 when I rebase
again :)
>
> >
> > + kvm_release_pfn_dirty(map->pfn);
> > + else
> > + kvm_release_pfn_clean(map->pfn);
> > + map->hva = NULL;
>
> >
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
> > +
> > struct page *kvm_vcpu_gfn_to_page(struct kvm_vcpu *vcpu, gfn_t gfn)
> > {
> > kvm_pfn_t pfn;
> >
>
>
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On Wed, 2019-01-23 at 12:50 -0500, Konrad Rzeszutek Wilk wrote:
> >
> > + if (dirty)
> > + kvm_release_pfn_dirty(map->pfn);
> > + else
> > + kvm_release_pfn_clean(map->pfn);
> > + map->hva = NULL;
>
> I keep on having this gnawing feeling that we MUST set map->page to
> NULL.
>
> That is I can see how it is not needed if you are using 'map' and
> 'unmap' together - for that we are good. But what I am worried is that
> some one unmaps it .. and instead of checking map->hva they end up
> checking map->page and think the page is mapped.
>
> Would you be OK adding that extra statement just as a fail-safe
> mechanism in case someones misues the APIs?
Good point, will do.
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On Wed, 2019-01-23 at 12:57 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 09, 2019 at 10:42:07AM +0100, KarimAllah Ahmed wrote:
> >
> > Use kvm_vcpu_map when mapping the virtual APIC page since using
> > kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
> > a "struct page".
> >
> > One additional semantic change is that the virtual host mapping lifecycle
> > has changed a bit. It now has the same lifetime of the pinning of the
> > virtual APIC page on the host side.
>
> Could you expand a bit on the 'same lifetime .. on the host side' to be
> obvious for folks what exactly the semantic is?
>
> And how does this ring with this comment:
> >
> >
> > Signed-off-by: KarimAllah Ahmed <[email protected]>
> > ---
> > v4 -> v5:
> > - unmap with dirty flag
> >
> > v1 -> v2:
> > - Do not change the lifecycle of the mapping (pbonzini)
>
> .. Where Paolo does not want the semantics of the mapping to be changed?
>
> Code wise feel free to smack my Reviewed-by on it, but obviously the
> question on the above comment needs to be resolved.
Ah, right. So there was two life-cycle changes:
1- Lazily unmap the mapping and only do it on: a) release b) if the gfn is
different. This was done as an optimization (i.e. cache of a single entry).
This is the first life-cycle change that Paolo asked me to do seperately and
suggested to create a PFN cache instead. This has indeed been dropped
between v1 -> v2 switch.
2- The life-cycle change now is the fact that the kvm_vcpu_map interface does:
a) map to a virtual address b) translate a gfn to a pfn.
The original code was doing the kmap in one location and the gfn_to_page in
another. Using kvm_vcpu_map means that both kmap+gfn_to_page will be tied
together and will not be done seperately. So far no one complained about
this one so I kept it :D
>
> Thank you.
>
> >
> > - Use pfn_to_hpa instead of gfn_to_gpa
> > ---
> > arch/x86/kvm/vmx/nested.c | 32 +++++++++++---------------------
> > arch/x86/kvm/vmx/vmx.c | 5 ++---
> > arch/x86/kvm/vmx/vmx.h | 2 +-
> > 3 files changed, 14 insertions(+), 25 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 4127ad9..dcff99d 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -229,10 +229,7 @@ static void free_nested(struct kvm_vcpu *vcpu)
> > kvm_release_page_dirty(vmx->nested.apic_access_page);
> > vmx->nested.apic_access_page = NULL;
> > }
> > - if (vmx->nested.virtual_apic_page) {
> > - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> > - vmx->nested.virtual_apic_page = NULL;
> > - }
> > + kvm_vcpu_unmap(&vmx->nested.virtual_apic_map, true);
> > if (vmx->nested.pi_desc_page) {
> > kunmap(vmx->nested.pi_desc_page);
> > kvm_release_page_dirty(vmx->nested.pi_desc_page);
> > @@ -2817,6 +2814,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> > {
> > struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > + struct kvm_host_map *map;
> > struct page *page;
> > u64 hpa;
> >
> > @@ -2849,11 +2847,7 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> > }
> >
> > if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
> > - if (vmx->nested.virtual_apic_page) { /* shouldn't happen */
> > - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> > - vmx->nested.virtual_apic_page = NULL;
> > - }
> > - page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->virtual_apic_page_addr);
> > + map = &vmx->nested.virtual_apic_map;
> >
> > /*
> > * If translation failed, VM entry will fail because
> > @@ -2868,11 +2862,9 @@ static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> > * control. But such a configuration is useless, so
> > * let's keep the code simple.
> > */
> > - if (!is_error_page(page)) {
> > - vmx->nested.virtual_apic_page = page;
> > - hpa = page_to_phys(vmx->nested.virtual_apic_page);
> > - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa);
> > - }
> > + if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map))
> > + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn));
> > +
> > }
> >
> > if (nested_cpu_has_posted_intr(vmcs12)) {
> > @@ -3279,11 +3271,12 @@ static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
> >
> > max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
> > if (max_irr != 256) {
> > - vapic_page = kmap(vmx->nested.virtual_apic_page);
> > + vapic_page = vmx->nested.virtual_apic_map.hva;
> > + if (!vapic_page)
> > + return;
> > +
> > __kvm_apic_update_irr(vmx->nested.pi_desc->pir,
> > vapic_page, &max_irr);
> > - kunmap(vmx->nested.virtual_apic_page);
> > -
> > status = vmcs_read16(GUEST_INTR_STATUS);
> > if ((u8)max_irr > ((u8)status & 0xff)) {
> > status &= ~0xff;
> > @@ -3917,10 +3910,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
> > kvm_release_page_dirty(vmx->nested.apic_access_page);
> > vmx->nested.apic_access_page = NULL;
> > }
> > - if (vmx->nested.virtual_apic_page) {
> > - kvm_release_page_dirty(vmx->nested.virtual_apic_page);
> > - vmx->nested.virtual_apic_page = NULL;
> > - }
> > + kvm_vcpu_unmap(&vmx->nested.virtual_apic_map, true);
> > if (vmx->nested.pi_desc_page) {
> > kunmap(vmx->nested.pi_desc_page);
> > kvm_release_page_dirty(vmx->nested.pi_desc_page);
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 71d88df..e13308e 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -3627,14 +3627,13 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
> >
> > if (WARN_ON_ONCE(!is_guest_mode(vcpu)) ||
> > !nested_cpu_has_vid(get_vmcs12(vcpu)) ||
> > - WARN_ON_ONCE(!vmx->nested.virtual_apic_page))
> > + WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn))
> > return false;
> >
> > rvi = vmx_get_rvi();
> >
> > - vapic_page = kmap(vmx->nested.virtual_apic_page);
> > + vapic_page = vmx->nested.virtual_apic_map.hva;
> > vppr = *((u32 *)(vapic_page + APIC_PROCPRI));
> > - kunmap(vmx->nested.virtual_apic_page);
> >
> > return ((rvi & 0xf0) > (vppr & 0xf0));
> > }
> > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> > index 6fb69d8..f618f52 100644
> > --- a/arch/x86/kvm/vmx/vmx.h
> > +++ b/arch/x86/kvm/vmx/vmx.h
> > @@ -142,7 +142,7 @@ struct nested_vmx {
> > * pointers, so we must keep them pinned while L2 runs.
> > */
> > struct page *apic_access_page;
> > - struct page *virtual_apic_page;
> > + struct kvm_host_map virtual_apic_map;
> > struct page *pi_desc_page;
> >
> > struct kvm_host_map msr_bitmap_map;
> > --
> > 2.7.4
> >
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On Wed, 2019-01-23 at 13:03 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 09, 2019 at 10:42:08AM +0100, KarimAllah Ahmed wrote:
> >
> > Use kvm_vcpu_map when mapping the posted interrupt descriptor table since
> > using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory
> > that has a "struct page".
> >
> > One additional semantic change is that the virtual host mapping lifecycle
> > has changed a bit. It now has the same lifetime of the pinning of the
> > interrupt descriptor table page on the host side.
>
> Is the description stale? I am not seeing how you are changing the
> semantics here. You follow the same path - map/unmap.
>
> Could you expand please?
This is pretty much the same case as in 7/13, it's just two different
life-cycle changes and I dropped one of them and the other one is still there :D
>
> >
> >
> > Signed-off-by: KarimAllah Ahmed <[email protected]>
> > ---
> > v4 -> v5:
> > - unmap with dirty flag
> >
> > v1 -> v2:
> > - Do not change the lifecycle of the mapping (pbonzini)
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On Wed, 2019-01-23 at 13:18 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 09, 2019 at 10:42:13AM +0100, KarimAllah Ahmed wrote:
> >
> > Use page_address_valid in a few more locations that is already checking for
> > a page aligned address that does not cross the maximum physical address.
>
> Where is this page_address_valid declared? The latest linus's tree does
> not have it, nor does your patchset?
It is already defined in the code, I can not see any commits that removed it:
$ git grep page_address_valid
arch/x86/kvm/vmx/nested.c:static bool page_address_valid(struct kvm_vcpu *vcpu,
gpa_t gpa)
arch/x86/kvm/vmx/nested.c: if (!page_address_valid(vcpu, vmcs12-
>io_bitmap_a) ||
arch/x86/kvm/vmx/nested.c: !page_address_valid(vcpu, vmcs12->io_bitmap_
b))
arch/x86/kvm/vmx/nested.c: if (!page_address_valid(vcpu, vmcs12-
>msr_bitmap))
arch/x86/kvm/vmx/nested.c: if (!page_address_valid(vcpu, vmcs12-
>virtual_apic_page_addr))
arch/x86/kvm/vmx/nested.c: !page_address_valid(vcpu, vmcs12-
>apic_access_addr))
arch/x86/kvm/vmx/nested.c: !page_address_valid(vcpu, vmcs12-
>pml_address))
arch/x86/kvm/vmx/nested.c: if (!page_address_valid(vcpu, vmcs12-
>vmread_bitmap) ||
arch/x86/kvm/vmx/nested.c: !page_address_valid(vcpu, vmcs12-
>vmwrite_bitmap))
arch/x86/kvm/vmx/nested.c: !page_address_valid(vcpu,
vmcs12->eptp_list_address))
arch/x86/kvm/vmx/nested.c: if (!page_address_valid(vcpu, vmcs12-
>vmcs_link_pointer))
arch/x86/kvm/vmx/nested.c: if (!page_address_valid(vcpu, kvm_state-
>vmx.vmxon_pa))
arch/x86/kvm/vmx/nested.c: !page_address_valid(vcpu, kvm_state-
>vmx.vmcs_pa))
> >
> >
> > Signed-off-by: KarimAllah Ahmed <[email protected]>
> > ---
> > arch/x86/kvm/vmx/nested.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index ccb3b63..77aad46 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -4203,7 +4203,7 @@ static int handle_vmon(struct kvm_vcpu *vcpu)
> > * Note - IA32_VMX_BASIC[48] will never be 1 for the nested case;
> > * which replaces physical address width with 32
> > */
> > - if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> > + if (!page_address_valid(vcpu, vmptr))
> > return nested_vmx_failInvalid(vcpu);
> >
> > if (kvm_read_guest(vcpu->kvm, vmptr, &revision, sizeof(revision)) ||
> > @@ -4266,7 +4266,7 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
> > if (nested_vmx_get_vmptr(vcpu, &vmptr))
> > return 1;
> >
> > - if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> > + if (!page_address_valid(vcpu, vmptr))
> > return nested_vmx_failValid(vcpu,
> > VMXERR_VMCLEAR_INVALID_ADDRESS);
> >
> > @@ -4473,7 +4473,7 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
> > if (nested_vmx_get_vmptr(vcpu, &vmptr))
> > return 1;
> >
> > - if (!PAGE_ALIGNED(vmptr) || (vmptr >> cpuid_maxphyaddr(vcpu)))
> > + if (!page_address_valid(vcpu, vmptr))
> > return nested_vmx_failValid(vcpu,
> > VMXERR_VMPTRLD_INVALID_ADDRESS);
> >
> > --
> > 2.7.4
> >
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On Wed, 2019-01-23 at 13:16 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 09, 2019 at 10:42:00AM +0100, KarimAllah Ahmed wrote:
> >
> > Guest memory can either be directly managed by the kernel (i.e. have a "struct
> > page") or they can simply live outside kernel control (i.e. do not have a
> > "struct page"). KVM mostly support these two modes, except in a few places
> > where the code seems to assume that guest memory must have a "struct page".
> >
> > This patchset introduces a new mapping interface to map guest memory into host
> > kernel memory which also supports PFN-based memory (i.e. memory without 'struct
> > page'). It also converts all offending code to this interface or simply
> > read/write directly from guest memory. Patch 2 is additionally fixing an
> > incorrect page release and marking the page as dirty (i.e. as a side-effect of
> > using the helper function to write).
> >
> > As far as I can see all offending code is now fixed except the APIC-access page
> > which I will handle in a seperate series along with dropping
> > kvm_vcpu_gfn_to_page and kvm_vcpu_gpa_to_page from the internal KVM API.
> >
> > The current implementation of the new API uses memremap to map memory that does
> > not have a "struct page". This proves to be very slow for high frequency
> > mappings. Since this does not affect the normal use-case where a "struct page"
> > is available, the performance of this API will be handled by a seperate patch
> > series.
>
> Where could one find this patchset?
Let me clean it and send it out as well :)
>
> Also is there an simple test-case (or a writeup) you have for testing
> this code? Specifically I am thinking about the use-case of "memory
> without the 'struct page'"
So the simple way to do it is:
1- Pass 'mem=' in the kernel command-line to limit the amount of memory managed
by the kernel.
2- Map this physical memory you want to give to the guest with
mmap("/dev/mem", physical_address_offset, ..)
3- Use the user-space virtual address as the "userspace_addr" field
in KVM_SET_USER_MEMORY_REGION ioctl.
You will also need this patch (hopefully I will repost next week as well):
https://patchwork.kernel.org/patch/9191755/
I will make sure to expand on this in the cover letter in v6.
>
> And thank you for posting this patchset. It was a pleasure reviewing the
> code!
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On 23/01/19 18:50, Konrad Rzeszutek Wilk wrote:
>> + if (dirty)
>> + kvm_release_pfn_dirty(map->pfn);
>> + else
>> + kvm_release_pfn_clean(map->pfn);
>> + map->hva = NULL;
> I keep on having this gnawing feeling that we MUST set map->page to
> NULL.
>
> That is I can see how it is not needed if you are using 'map' and
> 'unmap' together - for that we are good. But what I am worried is that
> some one unmaps it .. and instead of checking map->hva they end up
> checking map->page and think the page is mapped.
I think that would break anyway the memremap case.
So I think we should indeed reset map->page, but we should set it to a
poison value:
#define KVM_UNMAPPED_PAGE ((void *) 0x500 + POISON_POINTER_DELTA)
mem->page = KVM_UNMAPPED_PAGE;
This should make it clear to everyone that checking map->page is _not_
the right thing to do in any case.
Paolo
> Would you be OK adding that extra statement just as a fail-safe
> mechanism in case someones misues the APIs?
On 25/01/19 19:28, Raslan, KarimAllah wrote:
> So the simple way to do it is:
>
> 1- Pass 'mem=' in the kernel command-line to limit the amount of memory managed
> by the kernel.
> 2- Map this physical memory you want to give to the guest with
> mmap("/dev/mem", physical_address_offset, ..)
> 3- Use the user-space virtual address as the "userspace_addr" field
> in KVM_SET_USER_MEMORY_REGION ioctl.
>
> You will also need this patch (hopefully I will repost next week as well):
> https://patchwork.kernel.org/patch/9191755/
I took a look again at that patch and I guess I've changed my mind now
that the kernel provides e820__mapped_any and e820__mapped_all.
However, please do use e820__mapped_any instead of adding a new function
e820_is_ram.
Thanks,
Paolo
> I will make sure to expand on this in the cover letter in v6.
On Wed, 2019-01-30 at 18:14 +0100, Paolo Bonzini wrote:
> On 25/01/19 19:28, Raslan, KarimAllah wrote:
> >
> > So the simple way to do it is:
> >
> > 1- Pass 'mem=' in the kernel command-line to limit the amount of memory managed
> > by the kernel.
> > 2- Map this physical memory you want to give to the guest with
> > mmap("/dev/mem", physical_address_offset, ..)
> > 3- Use the user-space virtual address as the "userspace_addr" field
> > in KVM_SET_USER_MEMORY_REGION ioctl.
> >
> > You will also need this patch (hopefully I will repost next week as well):
> > https://patchwork.kernel.org/patch/9191755/
>
> I took a look again at that patch and I guess I've changed my mind now
> that the kernel provides e820__mapped_any and e820__mapped_all.
> However, please do use e820__mapped_any instead of adding a new function
> e820_is_ram.
The problem with e820__mapped_* is that they are iterating over 'e820_table'
which is already truncated from the 'mem=' and 'memmap=' parameters:
"""
* - 'e820_table': this is the main E820 table that is massaged by the
* low level x86 platform code, or modified by boot parameters, before
* passed on to higher level MM layers.
"""
.. so I really still can not use it for this purpose. The structure that I want
to look at is actually 'e820_table_firmware' which is:
"""
* - 'e820_table_firmware': the original firmware version passed to us by the
* bootloader - not modified by the kernel. It is composed of two parts:
* the first 128 E820 memory entries in boot_params.e820_table and the
remaining
* (if any) entries of the SETUP_E820_EXT nodes. We use this to:
*
* - inform the user about the firmware's notion of memory layout
* via /sys/firmware/memmap
*
* - the hibernation code uses it to generate a kernel-independent MD5
* fingerprint of the physical memory layout of a system.
"""
The users of e820__mapped_any expect these semantics, so even changing the
implementation of these functions to use 'e820_table_firmware' to handle this
will not be an option!
One option here would be to add 'e820__mapped_raw_any' (or whatever
other name) and make it identical to the current implementation of
e820__mapped_any at. Would that be slightly more acceptable? :)
>
> Thanks,
>
> Paolo
>
> >
> > I will make sure to expand on this in the cover letter in v6.
>
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On 31/01/19 14:03, Raslan, KarimAllah wrote:
>
> One option here would be to add 'e820__mapped_raw_any' (or whatever
> other name) and make it identical to the current implementation of
> e820__mapped_any at. Would that be slightly more acceptable? :)
Yes, of course it would (for me at least :)).
Paolo