From: Yulei Zhang <[email protected]>
Currently in KVM memory virtulization we relay on mmu_lock to
synchronize the memory mapping update, which make vCPUs work
in serialize mode and slow down the execution, especially after
migration to do substantial memory mapping will cause visible
performance drop, and it can get worse if guest has more vCPU
numbers and memories.
The idea we present in this patch set is to mitigate the issue
with pre-constructed memory mapping table. We will fast pin the
guest memory to build up a global memory mapping table according
to the guest memslots changes and apply it to cr3, so that after
guest starts up all the vCPUs would be able to update the memory
simultaneously without page fault exception, thus the performance
improvement is expected.
We use memory dirty pattern workload to test the initial patch
set and get positive result even with huge page enabled. For example,
we create guest with 32 vCPUs and 64G memories, and let the vcpus
dirty the entire memory region concurrently, as the initial patch
eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
get the job done in about 50% faster.
We only validate this feature on Intel x86 platform. And as Ben
pointed out in RFC V1, so far we disable the SMM for resource
consideration, drop the mmu notification as in this case the
memory is pinned.
V1->V2:
* Rebase the code to kernel version 5.9.0-rc1.
Yulei Zhang (9):
Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
support
Introduce page table population function for direct build EPT feature
Introduce page table remove function for direct build EPT feature
Add release function for direct build ept when guest VM exit
Modify the page fault path to meet the direct build EPT requirement
Apply the direct build EPT according to the memory slots change
Add migration support when using direct build EPT
Introduce kvm module parameter global_tdp to turn on the direct build
EPT mode
Handle certain mmu exposed functions properly while turn on direct
build EPT mode
arch/mips/kvm/mips.c | 13 +
arch/powerpc/kvm/powerpc.c | 13 +
arch/s390/kvm/kvm-s390.c | 13 +
arch/x86/include/asm/kvm_host.h | 13 +-
arch/x86/kvm/mmu/mmu.c | 533 ++++++++++++++++++++++++++++++--
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 7 +-
arch/x86/kvm/x86.c | 55 ++--
include/linux/kvm_host.h | 7 +-
virt/kvm/kvm_main.c | 43 ++-
10 files changed, 639 insertions(+), 60 deletions(-)
--
2.17.1
From: Yulei Zhang <[email protected]>
Add parameter global_root_hpa for saving direct build global EPT root point,
and add per-vcpu flag direct_build_tdp to indicate using global EPT root
point.
Signed-off-by: Yulei Zhang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5ab3af7275d8..485b1239ad39 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -788,6 +788,9 @@ struct kvm_vcpu_arch {
/* AMD MSRC001_0015 Hardware Configuration */
u64 msr_hwcr;
+
+ /* vcpu use pre-constructed EPT */
+ bool direct_build_tdp;
};
struct kvm_lpage_info {
@@ -963,6 +966,8 @@ struct kvm_arch {
struct kvm_pmu_event_filter *pmu_event_filter;
struct task_struct *nx_lpage_recovery_thread;
+ /* global root hpa for pre-constructed EPT */
+ hpa_t global_root_hpa;
};
struct kvm_vm_stat {
--
2.17.1
From: Yulei Zhang <[email protected]>
During guest boots up it will modify the memory slots multiple times,
so add page table remove function to free pre-pinned memory according
to the the memory slot changes.
Signed-off-by: Yulei Zhang <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 56 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bfe4d2b3e809..03c5e73b96cb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6482,6 +6482,62 @@ int kvm_direct_tdp_populate_page_table(struct kvm *kvm, struct kvm_memory_slot *
return 0;
}
+static int __kvm_remove_spte(struct kvm *kvm, u64 *addr, gfn_t gfn, int level)
+{
+ int i;
+ int ret = level;
+ bool present = false;
+ kvm_pfn_t pfn;
+ u64 *sptep = (u64 *)__va((*addr) & PT64_BASE_ADDR_MASK);
+ unsigned index = SHADOW_PT_INDEX(gfn << PAGE_SHIFT, level);
+
+ for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
+ if (is_shadow_present_pte(sptep[i])) {
+ if (i == index) {
+ if (!is_last_spte(sptep[i], level)) {
+ ret = __kvm_remove_spte(kvm, &sptep[i], gfn, level - 1);
+ if (is_shadow_present_pte(sptep[i]))
+ return ret;
+ } else {
+ pfn = spte_to_pfn(sptep[i]);
+ mmu_spte_clear_track_bits(&sptep[i]);
+ kvm_release_pfn_clean(pfn);
+ if (present)
+ return ret;
+ }
+ } else {
+ if (i > index)
+ return ret;
+ else
+ present = true;
+ }
+ }
+ }
+
+ if (!present) {
+ pfn = spte_to_pfn(*addr);
+ mmu_spte_clear_track_bits(addr);
+ kvm_release_pfn_clean(pfn);
+ }
+ return ret;
+}
+
+void kvm_direct_tdp_remove_page_table(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ gfn_t gfn = slot->base_gfn;
+ int host_level;
+
+ if (!kvm->arch.global_root_hpa)
+ return;
+
+ for (gfn = slot->base_gfn;
+ gfn < slot->base_gfn + slot->npages;
+ gfn += KVM_PAGES_PER_HPAGE(host_level))
+ host_level = __kvm_remove_spte(kvm, &(kvm->arch.global_root_hpa), gfn, PT64_ROOT_4LEVEL);
+
+ kvm_flush_remote_tlbs(kvm);
+}
+
/*
* Calculate mmu pages needed for kvm.
*/
--
2.17.1
From: Yulei Zhang <[email protected]>
Release the pre-pinned memory in direct build ept when guest VM
exit.
Signed-off-by: Yulei Zhang <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 37 ++++++++++++++++++++++++++++---------
1 file changed, 28 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 03c5e73b96cb..f2124f52b286 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4309,8 +4309,11 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd, bool skip_tlb_flush,
bool skip_mmu_sync)
{
- __kvm_mmu_new_pgd(vcpu, new_pgd, kvm_mmu_calc_root_page_role(vcpu),
- skip_tlb_flush, skip_mmu_sync);
+ if (!vcpu->arch.direct_build_tdp)
+ __kvm_mmu_new_pgd(vcpu, new_pgd, kvm_mmu_calc_root_page_role(vcpu),
+ skip_tlb_flush, skip_mmu_sync);
+ else
+ vcpu->arch.mmu->root_hpa = INVALID_PAGE;
}
EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
@@ -5207,10 +5210,14 @@ EXPORT_SYMBOL_GPL(kvm_mmu_load);
void kvm_mmu_unload(struct kvm_vcpu *vcpu)
{
- kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root_hpa));
- kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
- WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root_hpa));
+ if (!vcpu->arch.direct_build_tdp) {
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu, KVM_MMU_ROOTS_ALL);
+ WARN_ON(VALID_PAGE(vcpu->arch.root_mmu.root_hpa));
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
+ WARN_ON(VALID_PAGE(vcpu->arch.guest_mmu.root_hpa));
+ }
+ vcpu->arch.direct_build_tdp = false;
+ vcpu->arch.mmu->root_hpa = INVALID_PAGE;
}
EXPORT_SYMBOL_GPL(kvm_mmu_unload);
@@ -6538,6 +6545,14 @@ void kvm_direct_tdp_remove_page_table(struct kvm *kvm, struct kvm_memory_slot *s
kvm_flush_remote_tlbs(kvm);
}
+void kvm_direct_tdp_release_global_root(struct kvm *kvm)
+{
+ if (kvm->arch.global_root_hpa)
+ __kvm_walk_global_page(kvm, kvm->arch.global_root_hpa, max_tdp_level);
+
+ return;
+}
+
/*
* Calculate mmu pages needed for kvm.
*/
@@ -6564,9 +6579,13 @@ unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm)
void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
{
- kvm_mmu_unload(vcpu);
- free_mmu_pages(&vcpu->arch.root_mmu);
- free_mmu_pages(&vcpu->arch.guest_mmu);
+ if (vcpu->arch.direct_build_tdp) {
+ vcpu->arch.mmu->root_hpa = INVALID_PAGE;
+ } else {
+ kvm_mmu_unload(vcpu);
+ free_mmu_pages(&vcpu->arch.root_mmu);
+ free_mmu_pages(&vcpu->arch.guest_mmu);
+ }
mmu_free_memory_caches(vcpu);
}
--
2.17.1
From: Yulei Zhang <[email protected]>
Signed-off-by: Yulei Zhang <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6639d9c7012e..35bd87bf965f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1719,6 +1719,9 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
int i;
bool write_protected = false;
+ if (kvm->arch.global_root_hpa)
+ return write_protected;
+
for (i = PG_LEVEL_4K; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
rmap_head = __gfn_to_rmap(gfn, i, slot);
write_protected |= __rmap_write_protect(kvm, rmap_head, true);
@@ -5862,6 +5865,9 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
*/
static void kvm_mmu_zap_all_fast(struct kvm *kvm)
{
+ if (kvm->arch.global_root_hpa)
+ return;
+
lockdep_assert_held(&kvm->slots_lock);
spin_lock(&kvm->mmu_lock);
@@ -5924,6 +5930,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
struct kvm_memory_slot *memslot;
int i;
+ if (kvm->arch.global_root_hpa)
+ return;
+
spin_lock(&kvm->mmu_lock);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
slots = __kvm_memslots(kvm, i);
--
2.17.1
From: Yulei Zhang <[email protected]>
Currently global_tdp is only supported on intel X86 system with ept
supported, and it will turn off the smm mode when enable global_tdp.
Signed-off-by: Yulei Zhang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++++
arch/x86/kvm/mmu/mmu.c | 5 ++++-
arch/x86/kvm/x86.c | 11 ++++++++++-
3 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 429a50c89268..330cb254b34b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1357,6 +1357,8 @@ extern u64 kvm_default_tsc_scaling_ratio;
extern u64 kvm_mce_cap_supported;
+extern bool global_tdp;
+
/*
* EMULTYPE_NO_DECODE - Set when re-emulating an instruction (after completing
* userspace I/O) to indicate that the emulation context
@@ -1689,6 +1691,8 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
#endif
}
+inline bool boot_cpu_is_amd(void);
+
#define put_smstate(type, buf, offset, val) \
*(type *)((buf) + (offset) - 0x7e00) = val
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f03bf8efcefe..6639d9c7012e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4573,7 +4573,7 @@ reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
}
EXPORT_SYMBOL_GPL(reset_shadow_zero_bits_mask);
-static inline bool boot_cpu_is_amd(void)
+inline bool boot_cpu_is_amd(void)
{
WARN_ON_ONCE(!tdp_enabled);
return shadow_x_mask == 0;
@@ -6497,6 +6497,9 @@ int kvm_direct_tdp_populate_page_table(struct kvm *kvm, struct kvm_memory_slot *
kvm_pfn_t pfn;
int host_level;
+ if (!global_tdp)
+ return 0;
+
if (!kvm->arch.global_root_hpa) {
struct page *page;
WARN_ON(!tdp_enabled);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee898003f22f..57d64f3239e1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -161,6 +161,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO);
int __read_mostly pi_inject_timer = -1;
module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);
+bool __read_mostly global_tdp;
+module_param_named(global_tdp, global_tdp, bool, S_IRUGO);
+
#define KVM_NR_SHARED_MSRS 16
struct kvm_shared_msrs_global {
@@ -3539,7 +3542,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
* fringe case that is not enabled except via specific settings
* of the module parameters.
*/
- r = kvm_x86_ops.has_emulated_msr(MSR_IA32_SMBASE);
+ if (global_tdp)
+ r = 0;
+ else
+ r = kvm_x86_ops.has_emulated_msr(MSR_IA32_SMBASE);
break;
case KVM_CAP_VAPIC:
r = !kvm_x86_ops.cpu_has_accelerated_tpr();
@@ -9808,6 +9814,9 @@ int kvm_arch_hardware_setup(void *opaque)
if (r != 0)
return r;
+ if ((tdp_enabled == false) || boot_cpu_is_amd())
+ global_tdp = 0;
+
memcpy(&kvm_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
--
2.17.1
From: Yulei Zhang <[email protected]>
Make migration available in direct build ept mode whether
pml enabled or not.
Signed-off-by: Yulei Zhang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/mmu/mmu.c | 153 +++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 44 +++++----
3 files changed, 178 insertions(+), 21 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ab3cbef8c1aa..429a50c89268 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1318,6 +1318,8 @@ void kvm_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long kvm_nr_mmu_pages);
+void kvm_mmu_slot_direct_build_handle_wp(struct kvm *kvm,
+ struct kvm_memory_slot *memslot);
int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
bool pdptrs_changed(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 47d2a1c18f36..f03bf8efcefe 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -249,6 +249,8 @@ struct kvm_shadow_walk_iterator {
static struct kmem_cache *pte_list_desc_cache;
static struct kmem_cache *mmu_page_header_cache;
static struct percpu_counter kvm_total_used_mmu_pages;
+static int __kvm_write_protect_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, int level);
static u64 __read_mostly shadow_nx_mask;
static u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
@@ -1644,11 +1646,18 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
gfn_t gfn_offset, unsigned long mask)
{
struct kvm_rmap_head *rmap_head;
+ gfn_t gfn;
while (mask) {
- rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
- PG_LEVEL_4K, slot);
- __rmap_write_protect(kvm, rmap_head, false);
+ if (kvm->arch.global_root_hpa) {
+ gfn = slot->base_gfn + gfn_offset + __ffs(mask);
+
+ __kvm_write_protect_spte(kvm, slot, gfn, PG_LEVEL_4K);
+ } else {
+ rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
+ PG_LEVEL_4K, slot);
+ __rmap_write_protect(kvm, rmap_head, false);
+ }
/* clear the first set bit */
mask &= mask - 1;
@@ -6584,6 +6593,144 @@ void kvm_direct_tdp_release_global_root(struct kvm *kvm)
return;
}
+static int __kvm_write_protect_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, int level)
+{
+ int ret = 0;
+ /* add write protect on pte, tear down the page table if large page is enabled */
+ struct kvm_shadow_walk_iterator iterator;
+ unsigned long i;
+ kvm_pfn_t pfn;
+ struct page *page;
+ u64 *sptep;
+ u64 spte, t_spte;
+
+ for_each_direct_build_shadow_entry(iterator, kvm->arch.global_root_hpa,
+ gfn << PAGE_SHIFT, max_tdp_level) {
+ if (iterator.level == level) {
+ break;
+ }
+ }
+
+ if (level != PG_LEVEL_4K) {
+ sptep = iterator.sptep;
+
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return ret;
+
+ t_spte = page_to_phys(page) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
+ shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
+
+ for (i = 0; i < KVM_PAGES_PER_HPAGE(level); i++) {
+
+ for_each_direct_build_shadow_entry(iterator, t_spte & PT64_BASE_ADDR_MASK,
+ gfn << PAGE_SHIFT, level - 1) {
+ if (iterator.level == PG_LEVEL_4K) {
+ break;
+ }
+
+ if (!is_shadow_present_pte(*iterator.sptep)) {
+ struct page *page;
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page) {
+ __kvm_walk_global_page(kvm, t_spte & PT64_BASE_ADDR_MASK, level - 1);
+ return ret;
+ }
+ spte = page_to_phys(page) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
+ shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
+ mmu_spte_set(iterator.sptep, spte);
+ }
+ }
+
+ pfn = gfn_to_pfn_try_write(slot, gfn);
+ if ((pfn & KVM_PFN_ERR_FAULT) || is_noslot_pfn(pfn))
+ return ret;
+
+ if (kvm_x86_ops.slot_enable_log_dirty)
+ direct_build_tdp_set_spte(kvm, slot, iterator.sptep,
+ ACC_ALL, iterator.level, gfn, pfn, false, false, true);
+
+ else
+ direct_build_tdp_set_spte(kvm, slot, iterator.sptep,
+ ACC_EXEC_MASK | ACC_USER_MASK, iterator.level, gfn, pfn, false, true, true);
+ gfn++;
+ }
+ WARN_ON(!is_last_spte(*sptep, level));
+ pfn = spte_to_pfn(*sptep);
+ mmu_spte_clear_track_bits(sptep);
+ kvm_release_pfn_clean(pfn);
+ mmu_spte_set(sptep, t_spte);
+ } else {
+ if (kvm_x86_ops.slot_enable_log_dirty)
+ spte_clear_dirty(iterator.sptep);
+ else
+ spte_write_protect(iterator.sptep, false);
+ }
+ return ret;
+}
+
+static void __kvm_remove_wp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, int level)
+{
+ struct kvm_shadow_walk_iterator iterator;
+ kvm_pfn_t pfn;
+ u64 addr, spte;
+
+ for_each_direct_build_shadow_entry(iterator, kvm->arch.global_root_hpa,
+ gfn << PAGE_SHIFT, max_tdp_level) {
+ if (iterator.level == level)
+ break;
+ }
+
+ if (level != PG_LEVEL_4K) {
+ if (is_shadow_present_pte(*iterator.sptep)) {
+ addr = (*iterator.sptep) & PT64_BASE_ADDR_MASK;
+
+ pfn = gfn_to_pfn_try_write(slot, gfn);
+ if ((pfn & KVM_PFN_ERR_FAULT) || is_noslot_pfn(pfn)) {
+ printk("Failed to alloc page\n");
+ return;
+ }
+ mmu_spte_clear_track_bits(iterator.sptep);
+ direct_build_tdp_set_spte(kvm, slot, iterator.sptep,
+ ACC_ALL, level, gfn, pfn, false, true, true);
+
+ __kvm_walk_global_page(kvm, addr, level - 1);
+ }
+ } else {
+ if (is_shadow_present_pte(*iterator.sptep)) {
+ if (kvm_x86_ops.slot_enable_log_dirty) {
+ spte_set_dirty(iterator.sptep);
+ } else {
+ spte = (*iterator.sptep) | PT_WRITABLE_MASK;
+ mmu_spte_update(iterator.sptep, spte);
+ }
+ }
+ }
+}
+
+void kvm_mmu_slot_direct_build_handle_wp(struct kvm *kvm,
+ struct kvm_memory_slot *memslot)
+{
+ gfn_t gfn = memslot->base_gfn;
+ int host_level;
+
+ /* remove write mask from PTE */
+ for (gfn = memslot->base_gfn; gfn < memslot->base_gfn + memslot->npages; ) {
+
+ host_level = direct_build_mapping_level(kvm, memslot, gfn);
+
+ if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES)
+ __kvm_write_protect_spte(kvm, memslot, gfn, host_level);
+ else
+ __kvm_remove_wp_spte(kvm, memslot, gfn, host_level);
+ gfn += KVM_PAGES_PER_HPAGE(host_level);
+ }
+
+ kvm_flush_remote_tlbs(kvm);
+}
+
/*
* Calculate mmu pages needed for kvm.
*/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 599d73206299..ee898003f22f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10196,9 +10196,12 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
* kvm_arch_flush_shadow_memslot()
*/
if ((old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
- !(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
- kvm_mmu_zap_collapsible_sptes(kvm, new);
-
+ !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) {
+ if (kvm->arch.global_root_hpa)
+ kvm_mmu_slot_direct_build_handle_wp(kvm, (struct kvm_memory_slot *)new);
+ else
+ kvm_mmu_zap_collapsible_sptes(kvm, new);
+ }
/*
* Enable or disable dirty logging for the slot.
*
@@ -10228,25 +10231,30 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
* is enabled the D-bit or the W-bit will be cleared.
*/
if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
- if (kvm_x86_ops.slot_enable_log_dirty) {
- kvm_x86_ops.slot_enable_log_dirty(kvm, new);
+ if (kvm->arch.global_root_hpa) {
+ kvm_mmu_slot_direct_build_handle_wp(kvm, new);
} else {
- int level =
- kvm_dirty_log_manual_protect_and_init_set(kvm) ?
- PG_LEVEL_2M : PG_LEVEL_4K;
+ if (kvm_x86_ops.slot_enable_log_dirty) {
+ kvm_x86_ops.slot_enable_log_dirty(kvm, new);
+ } else {
+ int level =
+ kvm_dirty_log_manual_protect_and_init_set(kvm) ?
+ PG_LEVEL_2M : PG_LEVEL_4K;
- /*
- * If we're with initial-all-set, we don't need
- * to write protect any small page because
- * they're reported as dirty already. However
- * we still need to write-protect huge pages
- * so that the page split can happen lazily on
- * the first write to the huge page.
- */
- kvm_mmu_slot_remove_write_access(kvm, new, level);
+ /*
+ * If we're with initial-all-set, we don't need
+ * to write protect any small page because
+ * they're reported as dirty already. However
+ * we still need to write-protect huge pages
+ * so that the page split can happen lazily on
+ * the first write to the huge page.
+ */
+ kvm_mmu_slot_remove_write_access(kvm, new, level);
+ }
}
} else {
- if (kvm_x86_ops.slot_disable_log_dirty)
+ if (kvm_x86_ops.slot_disable_log_dirty
+ && !kvm->arch.global_root_hpa)
kvm_x86_ops.slot_disable_log_dirty(kvm, new);
}
}
--
2.17.1
From: Yulei Zhang <[email protected]>
Construct the direct build ept when guest memory slots have been
changed, and issue mmu_reload request to update the CR3 so that
guest could use the pre-constructed EPT without page fault.
Signed-off-by: Yulei Zhang <[email protected]>
---
arch/mips/kvm/mips.c | 13 +++++++++++++
arch/powerpc/kvm/powerpc.c | 13 +++++++++++++
arch/s390/kvm/kvm-s390.c | 13 +++++++++++++
arch/x86/kvm/mmu/mmu.c | 33 ++++++++++++++++++++++++++-------
include/linux/kvm_host.h | 3 +++
virt/kvm/kvm_main.c | 13 +++++++++++++
6 files changed, 81 insertions(+), 7 deletions(-)
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 7de85d2253ff..05d053a53ebf 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -267,6 +267,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
}
}
+int kvm_direct_tdp_populate_page_table(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ return 0;
+}
+
+void kvm_direct_tdp_remove_page_table(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+}
+
+void kvm_direct_tdp_release_global_root(struct kvm *kvm)
+{
+}
+
static inline void dump_handler(const char *symbol, void *start, void *end)
{
u32 *p;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 13999123b735..c6964cbeb6da 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -715,6 +715,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
kvmppc_core_commit_memory_region(kvm, mem, old, new, change);
}
+int kvm_direct_tdp_populate_page_table(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ return 0;
+}
+
+void kvm_direct_tdp_remove_page_table(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+}
+
+void kvm_direct_tdp_release_global_root(struct kvm *kvm)
+{
+}
+
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot)
{
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 6b74b92c1a58..d6f7cf1a30a3 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -5021,6 +5021,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
return;
}
+int kvm_direct_tdp_populate_page_table(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ return 0;
+}
+
+void kvm_direct_tdp_remove_page_table(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+}
+
+void kvm_direct_tdp_release_global_root(struct kvm *kvm)
+{
+}
+
static inline unsigned long nonhyp_mask(int i)
{
unsigned int nonhyp_fai = (sclp.hmfai << i * 2) >> 30;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fda6c4196854..47d2a1c18f36 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5206,13 +5206,20 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
{
int r;
- r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map);
- if (r)
- goto out;
- r = mmu_alloc_roots(vcpu);
- kvm_mmu_sync_roots(vcpu);
- if (r)
- goto out;
+ if (vcpu->kvm->arch.global_root_hpa) {
+ vcpu->arch.direct_build_tdp = true;
+ vcpu->arch.mmu->root_hpa = vcpu->kvm->arch.global_root_hpa;
+ }
+
+ if (!vcpu->arch.direct_build_tdp) {
+ r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map);
+ if (r)
+ goto out;
+ r = mmu_alloc_roots(vcpu);
+ kvm_mmu_sync_roots(vcpu);
+ if (r)
+ goto out;
+ }
kvm_mmu_load_pgd(vcpu);
kvm_x86_ops.tlb_flush_current(vcpu);
out:
@@ -6464,6 +6471,17 @@ int direct_build_mapping_level(struct kvm *kvm, struct kvm_memory_slot *slot, gf
return host_level;
}
+static void kvm_make_direct_build_update(struct kvm *kvm)
+{
+ int i;
+ struct kvm_vcpu *vcpu;
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+ kvm_vcpu_kick(vcpu);
+ }
+}
+
int kvm_direct_tdp_populate_page_table(struct kvm *kvm, struct kvm_memory_slot *slot)
{
gfn_t gfn;
@@ -6498,6 +6516,7 @@ int kvm_direct_tdp_populate_page_table(struct kvm *kvm, struct kvm_memory_slot *
direct_build_tdp_map(kvm, slot, gfn, pfn, host_level);
}
+ kvm_make_direct_build_update(kvm);
return 0;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8901862ba2a3..b2aa0daad6dd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -694,6 +694,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
struct kvm_memory_slot *old,
const struct kvm_memory_slot *new,
enum kvm_mr_change change);
+int kvm_direct_tdp_populate_page_table(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_direct_tdp_remove_page_table(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_direct_tdp_release_global_root(struct kvm *kvm);
void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
/* flush all memory translations */
void kvm_arch_flush_shadow_all(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 47fc18b05c53..fd1b419f4eb4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -876,6 +876,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
#endif
kvm_arch_destroy_vm(kvm);
kvm_destroy_devices(kvm);
+ kvm_direct_tdp_release_global_root(kvm);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
kvm_free_memslots(kvm, __kvm_memslots(kvm, i));
cleanup_srcu_struct(&kvm->irq_srcu);
@@ -1195,6 +1196,10 @@ static int kvm_set_memslot(struct kvm *kvm,
* in the freshly allocated memslots, not in @old or @new.
*/
slot = id_to_memslot(slots, old->id);
+ /* Remove pre-constructed page table */
+ if (!as_id)
+ kvm_direct_tdp_remove_page_table(kvm, slot);
+
slot->flags |= KVM_MEMSLOT_INVALID;
/*
@@ -1222,6 +1227,14 @@ static int kvm_set_memslot(struct kvm *kvm,
update_memslots(slots, new, change);
slots = install_new_memslots(kvm, as_id, slots);
+ if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
+ if (!as_id) {
+ r = kvm_direct_tdp_populate_page_table(kvm, new);
+ if (r)
+ goto out_slots;
+ }
+ }
+
kvm_arch_commit_memory_region(kvm, mem, old, new, change);
kvfree(slots);
--
2.17.1
From: Yulei Zhang <[email protected]>
Refine the fast page fault code so that it can be used in either
normal ept mode or direct build EPT mode.
Signed-off-by: Yulei Zhang <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f2124f52b286..fda6c4196854 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3443,12 +3443,13 @@ static bool page_fault_can_be_fast(u32 error_code)
* someone else modified the SPTE from its original value.
*/
static bool
-fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, gpa_t gpa,
u64 *sptep, u64 old_spte, u64 new_spte)
{
gfn_t gfn;
- WARN_ON(!sp->role.direct);
+ WARN_ON(!vcpu->arch.direct_build_tdp &&
+ (!sptep_to_sp(sptep)->role.direct));
/*
* Theoretically we could also set dirty bit (and flush TLB) here in
@@ -3470,7 +3471,8 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
* The gfn of direct spte is stable since it is
* calculated by sp->gfn.
*/
- gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
+
+ gfn = gpa >> PAGE_SHIFT;
kvm_vcpu_mark_page_dirty(vcpu, gfn);
}
@@ -3498,10 +3500,10 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
u32 error_code)
{
struct kvm_shadow_walk_iterator iterator;
- struct kvm_mmu_page *sp;
bool fault_handled = false;
u64 spte = 0ull;
uint retry_count = 0;
+ int pte_level = 0;
if (!page_fault_can_be_fast(error_code))
return false;
@@ -3515,8 +3517,15 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
if (!is_shadow_present_pte(spte))
break;
- sp = sptep_to_sp(iterator.sptep);
- if (!is_last_spte(spte, sp->role.level))
+ if (iterator.level < PG_LEVEL_4K)
+ pte_level = PG_LEVEL_4K;
+ else
+ pte_level = iterator.level;
+
+ WARN_ON(!vcpu->arch.direct_build_tdp &&
+ (pte_level != sptep_to_sp(iterator.sptep)->role.level));
+
+ if (!is_last_spte(spte, pte_level))
break;
/*
@@ -3559,7 +3568,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
*
* See the comments in kvm_arch_commit_memory_region().
*/
- if (sp->role.level > PG_LEVEL_4K)
+ if (pte_level > PG_LEVEL_4K)
break;
}
@@ -3573,7 +3582,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
* since the gfn is not stable for indirect shadow page. See
* Documentation/virt/kvm/locking.rst to get more detail.
*/
- fault_handled = fast_pf_fix_direct_spte(vcpu, sp,
+ fault_handled = fast_pf_fix_direct_spte(vcpu, cr2_or_gpa,
iterator.sptep, spte,
new_spte);
if (fault_handled)
@@ -4106,6 +4115,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
if (fast_page_fault(vcpu, gpa, error_code))
return RET_PF_RETRY;
+ if (vcpu->arch.direct_build_tdp)
+ return RET_PF_EMULATE;
+
r = mmu_topup_memory_caches(vcpu, false);
if (r)
return r;
--
2.17.1
From: Yulei Zhang <[email protected]>
Page table population function will pin the memory and pre-construct
the EPT base on the input memory slot configuration so that it won't
relay on the page fault interrupt to setup the page table.
Signed-off-by: Yulei Zhang <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 212 +++++++++++++++++++++++++++++++-
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 7 +-
include/linux/kvm_host.h | 4 +-
virt/kvm/kvm_main.c | 30 ++++-
6 files changed, 244 insertions(+), 13 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 485b1239ad39..ab3cbef8c1aa 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1138,7 +1138,7 @@ struct kvm_x86_ops {
int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
- u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+ u64 (*get_mt_mask)(struct kvm *kvm, struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, unsigned long pgd,
int pgd_level);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4e03841f053d..bfe4d2b3e809 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -241,6 +241,11 @@ struct kvm_shadow_walk_iterator {
({ spte = mmu_spte_get_lockless(_walker.sptep); 1; }); \
__shadow_walk_next(&(_walker), spte))
+#define for_each_direct_build_shadow_entry(_walker, shadow_addr, _addr, level) \
+ for (__shadow_walk_init(&(_walker), shadow_addr, _addr, level); \
+ shadow_walk_okay(&(_walker)); \
+ shadow_walk_next(&(_walker)))
+
static struct kmem_cache *pte_list_desc_cache;
static struct kmem_cache *mmu_page_header_cache;
static struct percpu_counter kvm_total_used_mmu_pages;
@@ -2506,13 +2511,20 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
return sp;
}
+static void __shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
+ hpa_t shadow_addr, u64 addr, int level)
+{
+ iterator->addr = addr;
+ iterator->shadow_addr = shadow_addr;
+ iterator->level = level;
+ iterator->sptep = NULL;
+}
+
static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
struct kvm_vcpu *vcpu, hpa_t root,
u64 addr)
{
- iterator->addr = addr;
- iterator->shadow_addr = root;
- iterator->level = vcpu->arch.mmu->shadow_root_level;
+ __shadow_walk_init(iterator, root, addr, vcpu->arch.mmu->shadow_root_level);
if (iterator->level == PT64_ROOT_4LEVEL &&
vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL &&
@@ -3014,7 +3026,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
if (level > PG_LEVEL_4K)
spte |= PT_PAGE_SIZE_MASK;
if (tdp_enabled)
- spte |= kvm_x86_ops.get_mt_mask(vcpu, gfn,
+ spte |= kvm_x86_ops.get_mt_mask(vcpu->kvm, vcpu, gfn,
kvm_is_mmio_pfn(pfn));
if (host_writable)
@@ -6278,6 +6290,198 @@ int kvm_mmu_module_init(void)
return ret;
}
+static int direct_build_tdp_set_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
+ u64 *sptep, unsigned pte_access, int level,
+ gfn_t gfn, kvm_pfn_t pfn, bool speculative,
+ bool dirty, bool host_writable)
+{
+ u64 spte = 0;
+ int ret = 0;
+ /*
+ * For the EPT case, shadow_present_mask is 0 if hardware
+ * supports exec-only page table entries. In that case,
+ * ACC_USER_MASK and shadow_user_mask are used to represent
+ * read access. See FNAME(gpte_access) in paging_tmpl.h.
+ */
+ spte |= shadow_present_mask;
+ if (!speculative)
+ spte |= shadow_accessed_mask;
+
+ if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
+ is_nx_huge_page_enabled()) {
+ pte_access &= ~ACC_EXEC_MASK;
+ }
+
+ if (pte_access & ACC_EXEC_MASK)
+ spte |= shadow_x_mask;
+ else
+ spte |= shadow_nx_mask;
+
+ if (pte_access & ACC_USER_MASK)
+ spte |= shadow_user_mask;
+
+ if (level > PG_LEVEL_4K)
+ spte |= PT_PAGE_SIZE_MASK;
+
+ if (tdp_enabled)
+ spte |= kvm_x86_ops.get_mt_mask(kvm, NULL, gfn, kvm_is_mmio_pfn(pfn));
+
+ if (host_writable)
+ spte |= SPTE_HOST_WRITEABLE;
+ else
+ pte_access &= ~ACC_WRITE_MASK;
+
+ spte |= (u64)pfn << PAGE_SHIFT;
+
+ if (pte_access & ACC_WRITE_MASK) {
+
+ spte |= PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE;
+
+ if (dirty) {
+ mark_page_dirty_in_slot(slot, gfn);
+ spte |= shadow_dirty_mask;
+ }
+ }
+
+ if (mmu_spte_update(sptep, spte))
+ kvm_flush_remote_tlbs(kvm);
+
+ return ret;
+}
+
+static void __kvm_walk_global_page(struct kvm *kvm, u64 addr, int level)
+{
+ int i;
+ kvm_pfn_t pfn;
+ u64 *sptep = (u64 *)__va(addr);
+
+ for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
+ if (is_shadow_present_pte(sptep[i])) {
+ if (!is_last_spte(sptep[i], level)) {
+ __kvm_walk_global_page(kvm, sptep[i] & PT64_BASE_ADDR_MASK, level - 1);
+ } else {
+ pfn = spte_to_pfn(sptep[i]);
+ mmu_spte_clear_track_bits(&sptep[i]);
+ kvm_release_pfn_clean(pfn);
+ }
+ }
+ }
+ put_page(pfn_to_page(addr >> PAGE_SHIFT));
+}
+
+static int direct_build_tdp_map(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn,
+ kvm_pfn_t pfn, int level)
+{
+ int ret = 0;
+
+ struct kvm_shadow_walk_iterator iterator;
+ kvm_pfn_t old_pfn;
+ u64 spte;
+
+ for_each_direct_build_shadow_entry(iterator, kvm->arch.global_root_hpa,
+ gfn << PAGE_SHIFT, max_tdp_level) {
+ if (iterator.level == level) {
+ break;
+ }
+
+ if (!is_shadow_present_pte(*iterator.sptep)) {
+ struct page *page;
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return 0;
+
+ spte = page_to_phys(page) | PT_PRESENT_MASK | PT_WRITABLE_MASK |
+ shadow_user_mask | shadow_x_mask | shadow_accessed_mask;
+ mmu_spte_set(iterator.sptep, spte);
+ }
+ }
+ /* if presented pte, release the original pfn */
+ if (is_shadow_present_pte(*iterator.sptep)) {
+ if (level > PG_LEVEL_4K)
+ __kvm_walk_global_page(kvm, (*iterator.sptep) & PT64_BASE_ADDR_MASK, level - 1);
+ else {
+ old_pfn = spte_to_pfn(*iterator.sptep);
+ mmu_spte_clear_track_bits(iterator.sptep);
+ kvm_release_pfn_clean(old_pfn);
+ }
+ }
+ direct_build_tdp_set_spte(kvm, slot, iterator.sptep, ACC_ALL, level, gfn, pfn, false, true, true);
+
+ return ret;
+}
+
+static int host_mapping_level(struct kvm *kvm, gfn_t gfn)
+{
+ unsigned long page_size;
+ int i, ret = 0;
+
+ page_size = kvm_host_page_size(kvm, NULL, gfn);
+
+ for (i = PG_LEVEL_4K; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
+ if (page_size >= KVM_HPAGE_SIZE(i))
+ ret = i;
+ else
+ break;
+ }
+
+ return ret;
+}
+
+int direct_build_mapping_level(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ int host_level, max_level, level;
+ struct kvm_lpage_info *linfo;
+
+ host_level = host_mapping_level(kvm, gfn);
+ if (host_level != PG_LEVEL_4K) {
+ max_level = min(max_huge_page_level, host_level);
+ for (level = PG_LEVEL_4K; level <= max_level; ++level) {
+ linfo = lpage_info_slot(gfn, slot, level);
+ if (linfo->disallow_lpage)
+ break;
+ }
+ host_level = level - 1;
+ }
+ return host_level;
+}
+
+int kvm_direct_tdp_populate_page_table(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+ gfn_t gfn;
+ kvm_pfn_t pfn;
+ int host_level;
+
+ if (!kvm->arch.global_root_hpa) {
+ struct page *page;
+ WARN_ON(!tdp_enabled);
+ WARN_ON(max_tdp_level != PT64_ROOT_4LEVEL);
+
+ /* init global root hpa */
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return -ENOMEM;
+
+ kvm->arch.global_root_hpa = page_to_phys(page);
+ }
+
+ /* setup page table for the slot */
+ for (gfn = slot->base_gfn;
+ gfn < slot->base_gfn + slot->npages;
+ gfn += KVM_PAGES_PER_HPAGE(host_level)) {
+ pfn = gfn_to_pfn_try_write(slot, gfn);
+ if ((pfn & KVM_PFN_ERR_FAULT) || is_noslot_pfn(pfn))
+ return -ENOMEM;
+
+ host_level = direct_build_mapping_level(kvm, slot, gfn);
+
+ if (host_level > PG_LEVEL_4K)
+ MMU_WARN_ON(gfn & (KVM_PAGES_PER_HPAGE(host_level) - 1));
+ direct_build_tdp_map(kvm, slot, gfn, pfn, host_level);
+ }
+
+ return 0;
+}
+
/*
* Calculate mmu pages needed for kvm.
*/
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 03dd7bac8034..3b7ee65cd941 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3607,7 +3607,7 @@ static bool svm_has_emulated_msr(u32 index)
return true;
}
-static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+static u64 svm_get_mt_mask(struct kvm *kvm, struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
return 0;
}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 46ba2e03a892..6f79343ed40e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7106,7 +7106,7 @@ static int __init vmx_check_processor_compat(void)
return 0;
}
-static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+static u64 vmx_get_mt_mask(struct kvm *kvm, struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
u8 cache;
u64 ipat = 0;
@@ -7134,12 +7134,15 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
goto exit;
}
- if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
+ if (!kvm_arch_has_noncoherent_dma(kvm)) {
ipat = VMX_EPT_IPAT_BIT;
cache = MTRR_TYPE_WRBACK;
goto exit;
}
+ if (!vcpu)
+ vcpu = kvm->vcpus[0];
+
if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
ipat = VMX_EPT_IPAT_BIT;
if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a23076765b4c..8901862ba2a3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -694,6 +694,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
struct kvm_memory_slot *old,
const struct kvm_memory_slot *new,
enum kvm_mr_change change);
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
/* flush all memory translations */
void kvm_arch_flush_shadow_all(struct kvm *kvm);
/* flush memory translations pointing to 'slot' */
@@ -721,6 +722,7 @@ kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
bool atomic, bool *async, bool write_fault,
bool *writable);
+kvm_pfn_t gfn_to_pfn_try_write(struct kvm_memory_slot *slot, gfn_t gfn);
void kvm_release_pfn_clean(kvm_pfn_t pfn);
void kvm_release_pfn_dirty(kvm_pfn_t pfn);
@@ -775,7 +777,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
-unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
+unsigned long kvm_host_page_size(struct kvm *kvm, struct kvm_vcpu *vcpu, gfn_t gfn);
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 737666db02de..47fc18b05c53 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -143,7 +143,7 @@ static void hardware_disable_all(void);
static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
__visible bool kvm_rebooting;
EXPORT_SYMBOL_GPL(kvm_rebooting);
@@ -1689,14 +1689,17 @@ bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
}
EXPORT_SYMBOL_GPL(kvm_vcpu_is_visible_gfn);
-unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
+unsigned long kvm_host_page_size(struct kvm *kvm, struct kvm_vcpu *vcpu, gfn_t gfn)
{
struct vm_area_struct *vma;
unsigned long addr, size;
size = PAGE_SIZE;
- addr = kvm_vcpu_gfn_to_hva_prot(vcpu, gfn, NULL);
+ if (vcpu)
+ addr = kvm_vcpu_gfn_to_hva_prot(vcpu, gfn, NULL);
+ else
+ addr = gfn_to_hva(kvm, gfn);
if (kvm_is_error_hva(addr))
return PAGE_SIZE;
@@ -1989,6 +1992,25 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
return pfn;
}
+/* Map pfn for direct EPT mode, if map failed and it is readonly memslot,
+ * will try to remap it with readonly flag.
+ */
+kvm_pfn_t gfn_to_pfn_try_write(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+ kvm_pfn_t pfn;
+ unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, !memslot_is_readonly(slot));
+
+ if (kvm_is_error_hva(addr))
+ return KVM_PFN_NOSLOT;
+
+ pfn = hva_to_pfn(addr, false, NULL, true, NULL);
+ if (pfn & KVM_PFN_ERR_FAULT) {
+ if (memslot_is_readonly(slot))
+ pfn = hva_to_pfn(addr, false, NULL, false, NULL);
+ }
+ return pfn;
+}
+
kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
bool atomic, bool *async, bool write_fault,
bool *writable)
@@ -2638,7 +2660,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
}
EXPORT_SYMBOL_GPL(kvm_clear_guest);
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
gfn_t gfn)
{
if (memslot && memslot->dirty_bitmap) {
--
2.17.1
Any comments? guys!
On Tue, 1 Sep 2020 at 19:52, <[email protected]> wrote:
>
> From: Yulei Zhang <[email protected]>
>
> Currently in KVM memory virtulization we relay on mmu_lock to
> synchronize the memory mapping update, which make vCPUs work
> in serialize mode and slow down the execution, especially after
> migration to do substantial memory mapping will cause visible
> performance drop, and it can get worse if guest has more vCPU
> numbers and memories.
>
> The idea we present in this patch set is to mitigate the issue
> with pre-constructed memory mapping table. We will fast pin the
> guest memory to build up a global memory mapping table according
> to the guest memslots changes and apply it to cr3, so that after
> guest starts up all the vCPUs would be able to update the memory
> simultaneously without page fault exception, thus the performance
> improvement is expected.
>
> We use memory dirty pattern workload to test the initial patch
> set and get positive result even with huge page enabled. For example,
> we create guest with 32 vCPUs and 64G memories, and let the vcpus
> dirty the entire memory region concurrently, as the initial patch
> eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> get the job done in about 50% faster.
>
> We only validate this feature on Intel x86 platform. And as Ben
> pointed out in RFC V1, so far we disable the SMM for resource
> consideration, drop the mmu notification as in this case the
> memory is pinned.
>
> V1->V2:
> * Rebase the code to kernel version 5.9.0-rc1.
>
> Yulei Zhang (9):
> Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> support
> Introduce page table population function for direct build EPT feature
> Introduce page table remove function for direct build EPT feature
> Add release function for direct build ept when guest VM exit
> Modify the page fault path to meet the direct build EPT requirement
> Apply the direct build EPT according to the memory slots change
> Add migration support when using direct build EPT
> Introduce kvm module parameter global_tdp to turn on the direct build
> EPT mode
> Handle certain mmu exposed functions properly while turn on direct
> build EPT mode
>
> arch/mips/kvm/mips.c | 13 +
> arch/powerpc/kvm/powerpc.c | 13 +
> arch/s390/kvm/kvm-s390.c | 13 +
> arch/x86/include/asm/kvm_host.h | 13 +-
> arch/x86/kvm/mmu/mmu.c | 533 ++++++++++++++++++++++++++++++--
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 7 +-
> arch/x86/kvm/x86.c | 55 ++--
> include/linux/kvm_host.h | 7 +-
> virt/kvm/kvm_main.c | 43 ++-
> 10 files changed, 639 insertions(+), 60 deletions(-)
>
> --
> 2.17.1
>
Any comments? Paolo! :)
On Wed, 9 Sep 2020 at 11:04, Wanpeng Li <[email protected]> wrote:
>
> Any comments? guys!
> On Tue, 1 Sep 2020 at 19:52, <[email protected]> wrote:
> >
> > From: Yulei Zhang <[email protected]>
> >
> > Currently in KVM memory virtulization we relay on mmu_lock to
> > synchronize the memory mapping update, which make vCPUs work
> > in serialize mode and slow down the execution, especially after
> > migration to do substantial memory mapping will cause visible
> > performance drop, and it can get worse if guest has more vCPU
> > numbers and memories.
> >
> > The idea we present in this patch set is to mitigate the issue
> > with pre-constructed memory mapping table. We will fast pin the
> > guest memory to build up a global memory mapping table according
> > to the guest memslots changes and apply it to cr3, so that after
> > guest starts up all the vCPUs would be able to update the memory
> > simultaneously without page fault exception, thus the performance
> > improvement is expected.
> >
> > We use memory dirty pattern workload to test the initial patch
> > set and get positive result even with huge page enabled. For example,
> > we create guest with 32 vCPUs and 64G memories, and let the vcpus
> > dirty the entire memory region concurrently, as the initial patch
> > eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> > get the job done in about 50% faster.
> >
> > We only validate this feature on Intel x86 platform. And as Ben
> > pointed out in RFC V1, so far we disable the SMM for resource
> > consideration, drop the mmu notification as in this case the
> > memory is pinned.
> >
> > V1->V2:
> > * Rebase the code to kernel version 5.9.0-rc1.
> >
> > Yulei Zhang (9):
> > Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> > support
> > Introduce page table population function for direct build EPT feature
> > Introduce page table remove function for direct build EPT feature
> > Add release function for direct build ept when guest VM exit
> > Modify the page fault path to meet the direct build EPT requirement
> > Apply the direct build EPT according to the memory slots change
> > Add migration support when using direct build EPT
> > Introduce kvm module parameter global_tdp to turn on the direct build
> > EPT mode
> > Handle certain mmu exposed functions properly while turn on direct
> > build EPT mode
> >
> > arch/mips/kvm/mips.c | 13 +
> > arch/powerpc/kvm/powerpc.c | 13 +
> > arch/s390/kvm/kvm-s390.c | 13 +
> > arch/x86/include/asm/kvm_host.h | 13 +-
> > arch/x86/kvm/mmu/mmu.c | 533 ++++++++++++++++++++++++++++++--
> > arch/x86/kvm/svm/svm.c | 2 +-
> > arch/x86/kvm/vmx/vmx.c | 7 +-
> > arch/x86/kvm/x86.c | 55 ++--
> > include/linux/kvm_host.h | 7 +-
> > virt/kvm/kvm_main.c | 43 ++-
> > 10 files changed, 639 insertions(+), 60 deletions(-)
> >
> > --
> > 2.17.1
> >
On Wed, Sep 23, 2020 at 11:28 PM Wanpeng Li <[email protected]> wrote:
>
> Any comments? Paolo! :)
Hi, sorry to be so late in replying! I wanted to post the first part
of the TDP MMU series I've been working on before responding so we
could discuss the two together, but I haven't been able to get it out
as fast as I would have liked. (I'll send it ASAP!) I'm hopeful that
it will ultimately help address some of the page fault handling and
lock contention issues you're addressing with these patches. I'd also
be happy to work together to add a prepopulation feature to it. I'll
put in some more comments inline below.
> On Wed, 9 Sep 2020 at 11:04, Wanpeng Li <[email protected]> wrote:
> >
> > Any comments? guys!
> > On Tue, 1 Sep 2020 at 19:52, <[email protected]> wrote:
> > >
> > > From: Yulei Zhang <[email protected]>
> > >
> > > Currently in KVM memory virtulization we relay on mmu_lock to
> > > synchronize the memory mapping update, which make vCPUs work
> > > in serialize mode and slow down the execution, especially after
> > > migration to do substantial memory mapping will cause visible
> > > performance drop, and it can get worse if guest has more vCPU
> > > numbers and memories.
> > >
> > > The idea we present in this patch set is to mitigate the issue
> > > with pre-constructed memory mapping table. We will fast pin the
> > > guest memory to build up a global memory mapping table according
> > > to the guest memslots changes and apply it to cr3, so that after
> > > guest starts up all the vCPUs would be able to update the memory
> > > simultaneously without page fault exception, thus the performance
> > > improvement is expected.
My understanding from this RFC is that your primary goal is to
eliminate page fault latencies and lock contention arising from the
first page faults incurred by vCPUs when initially populating the EPT.
Is that right?
I have the impression that the pinning and generally static memory
mappings are more a convenient simplification than part of a larger
goal to avoid incurring page faults down the line. Is that correct?
I ask because I didn't fully understand, from our conversation on v1
of this RFC, why reimplementing the page fault handler and associated
functions was necessary for the above goals, as I understood them.
My impression of the prepopulation approach is that, KVM will
sequentially populate all the EPT entries to map guest memory. I
understand how this could be optimized to be quite efficient, but I
don't understand how it would scale better than the existing
implementation with one vCPU accessing memory.
> > >
> > > We use memory dirty pattern workload to test the initial patch
> > > set and get positive result even with huge page enabled. For example,
> > > we create guest with 32 vCPUs and 64G memories, and let the vcpus
> > > dirty the entire memory region concurrently, as the initial patch
> > > eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> > > get the job done in about 50% faster.
In this benchmark did you include the time required to pre-populate
the EPT or just the time required for the vCPUs to dirty memory?
I ask because I'm curious if your priority is to decrease the total
end-to-end time, or you just care about the guest experience, and not
so much the VM startup time.
How does this compare to the case where 1 vCPU reads every page of
memory and then 32 vCPUs concurrently dirty every page?
> > >
> > > We only validate this feature on Intel x86 platform. And as Ben
> > > pointed out in RFC V1, so far we disable the SMM for resource
> > > consideration, drop the mmu notification as in this case the
> > > memory is pinned.
I'm excited to see big MMU changes like this, and I look forward to
combining our needs towards a better MMU for the x86 TDP case. Have
you thought about how you would build SMM and MMU notifier support
onto this patch series? I know that the invalidate range notifiers, at
least, added a lot of non-trivial complexity to the direct MMU
implementation I presented last year.
> > >
> > > V1->V2:
> > > * Rebase the code to kernel version 5.9.0-rc1.
> > >
> > > Yulei Zhang (9):
> > > Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> > > support
> > > Introduce page table population function for direct build EPT feature
> > > Introduce page table remove function for direct build EPT feature
> > > Add release function for direct build ept when guest VM exit
> > > Modify the page fault path to meet the direct build EPT requirement
> > > Apply the direct build EPT according to the memory slots change
> > > Add migration support when using direct build EPT
> > > Introduce kvm module parameter global_tdp to turn on the direct build
> > > EPT mode
> > > Handle certain mmu exposed functions properly while turn on direct
> > > build EPT mode
> > >
> > > arch/mips/kvm/mips.c | 13 +
> > > arch/powerpc/kvm/powerpc.c | 13 +
> > > arch/s390/kvm/kvm-s390.c | 13 +
> > > arch/x86/include/asm/kvm_host.h | 13 +-
> > > arch/x86/kvm/mmu/mmu.c | 533 ++++++++++++++++++++++++++++++--
> > > arch/x86/kvm/svm/svm.c | 2 +-
> > > arch/x86/kvm/vmx/vmx.c | 7 +-
> > > arch/x86/kvm/x86.c | 55 ++--
> > > include/linux/kvm_host.h | 7 +-
> > > virt/kvm/kvm_main.c | 43 ++-
> > > 10 files changed, 639 insertions(+), 60 deletions(-)
> > >
> > > --
> > > 2.17.1
> > >
On Fri, Sep 25, 2020 at 1:14 AM Ben Gardon <[email protected]> wrote:
>
> On Wed, Sep 23, 2020 at 11:28 PM Wanpeng Li <[email protected]> wrote:
> >
> > Any comments? Paolo! :)
>
> Hi, sorry to be so late in replying! I wanted to post the first part
> of the TDP MMU series I've been working on before responding so we
> could discuss the two together, but I haven't been able to get it out
> as fast as I would have liked. (I'll send it ASAP!) I'm hopeful that
> it will ultimately help address some of the page fault handling and
> lock contention issues you're addressing with these patches. I'd also
> be happy to work together to add a prepopulation feature to it. I'll
> put in some more comments inline below.
>
Thanks for the feedback and looking forward to your patchset.
> > On Wed, 9 Sep 2020 at 11:04, Wanpeng Li <[email protected]> wrote:
> > >
> > > Any comments? guys!
> > > On Tue, 1 Sep 2020 at 19:52, <[email protected]> wrote:
> > > >
> > > > From: Yulei Zhang <[email protected]>
> > > >
> > > > Currently in KVM memory virtulization we relay on mmu_lock to
> > > > synchronize the memory mapping update, which make vCPUs work
> > > > in serialize mode and slow down the execution, especially after
> > > > migration to do substantial memory mapping will cause visible
> > > > performance drop, and it can get worse if guest has more vCPU
> > > > numbers and memories.
> > > >
> > > > The idea we present in this patch set is to mitigate the issue
> > > > with pre-constructed memory mapping table. We will fast pin the
> > > > guest memory to build up a global memory mapping table according
> > > > to the guest memslots changes and apply it to cr3, so that after
> > > > guest starts up all the vCPUs would be able to update the memory
> > > > simultaneously without page fault exception, thus the performance
> > > > improvement is expected.
>
> My understanding from this RFC is that your primary goal is to
> eliminate page fault latencies and lock contention arising from the
> first page faults incurred by vCPUs when initially populating the EPT.
> Is that right?
>
That's right.
> I have the impression that the pinning and generally static memory
> mappings are more a convenient simplification than part of a larger
> goal to avoid incurring page faults down the line. Is that correct?
>
> I ask because I didn't fully understand, from our conversation on v1
> of this RFC, why reimplementing the page fault handler and associated
> functions was necessary for the above goals, as I understood them.
> My impression of the prepopulation approach is that, KVM will
> sequentially populate all the EPT entries to map guest memory. I
> understand how this could be optimized to be quite efficient, but I
> don't understand how it would scale better than the existing
> implementation with one vCPU accessing memory.
>
I don't think our goal is to simply eliminate the page fault. Our
target scenario
is in live migration, when the workload resume on the destination VM after
migrate, it will kick off the vcpus to build the gfn to pfn mapping,
but due to the
mmu_lock it holds the vcpus to execute in sequential which significantly slows
down the workload execution in VM and affect the end user experience, especially
when it is memory sensitive workload. Pre-populate the EPT entries
will solve the
problem smoothly as it allows the vcpus to execute in parallel after migration.
> > > >
> > > > We use memory dirty pattern workload to test the initial patch
> > > > set and get positive result even with huge page enabled. For example,
> > > > we create guest with 32 vCPUs and 64G memories, and let the vcpus
> > > > dirty the entire memory region concurrently, as the initial patch
> > > > eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> > > > get the job done in about 50% faster.
>
> In this benchmark did you include the time required to pre-populate
> the EPT or just the time required for the vCPUs to dirty memory?
> I ask because I'm curious if your priority is to decrease the total
> end-to-end time, or you just care about the guest experience, and not
> so much the VM startup time.
We compare the time for each vcpu thread to finish the dirty job. Yes, it can
take some time for the page table pre-populate, but as each vcpu thread
can gain a huge advantage with concurrent dirty write, if we count that in
the total time it is still a better result.
> How does this compare to the case where 1 vCPU reads every page of
> memory and then 32 vCPUs concurrently dirty every page?
>
Haven't tried this yet, I think the major difference would be the page fault
latency introduced by the one vCPU read.
> > > >
> > > > We only validate this feature on Intel x86 platform. And as Ben
> > > > pointed out in RFC V1, so far we disable the SMM for resource
> > > > consideration, drop the mmu notification as in this case the
> > > > memory is pinned.
>
> I'm excited to see big MMU changes like this, and I look forward to
> combining our needs towards a better MMU for the x86 TDP case. Have
> you thought about how you would build SMM and MMU notifier support
> onto this patch series? I know that the invalidate range notifiers, at
> least, added a lot of non-trivial complexity to the direct MMU
> implementation I presented last year.
>
Thanks for the suggestion, I will think about it.
> > > >
> > > > V1->V2:
> > > > * Rebase the code to kernel version 5.9.0-rc1.
> > > >
> > > > Yulei Zhang (9):
> > > > Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> > > > support
> > > > Introduce page table population function for direct build EPT feature
> > > > Introduce page table remove function for direct build EPT feature
> > > > Add release function for direct build ept when guest VM exit
> > > > Modify the page fault path to meet the direct build EPT requirement
> > > > Apply the direct build EPT according to the memory slots change
> > > > Add migration support when using direct build EPT
> > > > Introduce kvm module parameter global_tdp to turn on the direct build
> > > > EPT mode
> > > > Handle certain mmu exposed functions properly while turn on direct
> > > > build EPT mode
> > > >
> > > > arch/mips/kvm/mips.c | 13 +
> > > > arch/powerpc/kvm/powerpc.c | 13 +
> > > > arch/s390/kvm/kvm-s390.c | 13 +
> > > > arch/x86/include/asm/kvm_host.h | 13 +-
> > > > arch/x86/kvm/mmu/mmu.c | 533 ++++++++++++++++++++++++++++++--
> > > > arch/x86/kvm/svm/svm.c | 2 +-
> > > > arch/x86/kvm/vmx/vmx.c | 7 +-
> > > > arch/x86/kvm/x86.c | 55 ++--
> > > > include/linux/kvm_host.h | 7 +-
> > > > virt/kvm/kvm_main.c | 43 ++-
> > > > 10 files changed, 639 insertions(+), 60 deletions(-)
> > > >
> > > > --
> > > > 2.17.1
> > > >
On Fri, Sep 25, 2020 at 5:04 AM yulei zhang <[email protected]> wrote:
>
> On Fri, Sep 25, 2020 at 1:14 AM Ben Gardon <[email protected]> wrote:
> >
> > On Wed, Sep 23, 2020 at 11:28 PM Wanpeng Li <[email protected]> wrote:
> > >
> > > Any comments? Paolo! :)
> >
> > Hi, sorry to be so late in replying! I wanted to post the first part
> > of the TDP MMU series I've been working on before responding so we
> > could discuss the two together, but I haven't been able to get it out
> > as fast as I would have liked. (I'll send it ASAP!) I'm hopeful that
> > it will ultimately help address some of the page fault handling and
> > lock contention issues you're addressing with these patches. I'd also
> > be happy to work together to add a prepopulation feature to it. I'll
> > put in some more comments inline below.
> >
>
> Thanks for the feedback and looking forward to your patchset.
>
> > > On Wed, 9 Sep 2020 at 11:04, Wanpeng Li <[email protected]> wrote:
> > > >
> > > > Any comments? guys!
> > > > On Tue, 1 Sep 2020 at 19:52, <[email protected]> wrote:
> > > > >
> > > > > From: Yulei Zhang <[email protected]>
> > > > >
> > > > > Currently in KVM memory virtulization we relay on mmu_lock to
> > > > > synchronize the memory mapping update, which make vCPUs work
> > > > > in serialize mode and slow down the execution, especially after
> > > > > migration to do substantial memory mapping will cause visible
> > > > > performance drop, and it can get worse if guest has more vCPU
> > > > > numbers and memories.
> > > > >
> > > > > The idea we present in this patch set is to mitigate the issue
> > > > > with pre-constructed memory mapping table. We will fast pin the
> > > > > guest memory to build up a global memory mapping table according
> > > > > to the guest memslots changes and apply it to cr3, so that after
> > > > > guest starts up all the vCPUs would be able to update the memory
> > > > > simultaneously without page fault exception, thus the performance
> > > > > improvement is expected.
> >
> > My understanding from this RFC is that your primary goal is to
> > eliminate page fault latencies and lock contention arising from the
> > first page faults incurred by vCPUs when initially populating the EPT.
> > Is that right?
> >
>
> That's right.
>
> > I have the impression that the pinning and generally static memory
> > mappings are more a convenient simplification than part of a larger
> > goal to avoid incurring page faults down the line. Is that correct?
> >
> > I ask because I didn't fully understand, from our conversation on v1
> > of this RFC, why reimplementing the page fault handler and associated
> > functions was necessary for the above goals, as I understood them.
> > My impression of the prepopulation approach is that, KVM will
> > sequentially populate all the EPT entries to map guest memory. I
> > understand how this could be optimized to be quite efficient, but I
> > don't understand how it would scale better than the existing
> > implementation with one vCPU accessing memory.
> >
>
> I don't think our goal is to simply eliminate the page fault. Our
> target scenario
> is in live migration, when the workload resume on the destination VM after
> migrate, it will kick off the vcpus to build the gfn to pfn mapping,
> but due to the
> mmu_lock it holds the vcpus to execute in sequential which significantly slows
> down the workload execution in VM and affect the end user experience, especially
> when it is memory sensitive workload. Pre-populate the EPT entries
> will solve the
> problem smoothly as it allows the vcpus to execute in parallel after migration.
Oh, thank you for explaining that. I didn't realize the goal here was
to improve LM performance. I was under the impression that this was to
give VMs a better experience on startup for fast scaling or something.
In your testing with live migration how has this affected the
distribution of time between the phases of live migration? Just for
terminology (since I'm not sure how standard it is across the
industry) I think of a live migration as consisting of 3 stages:
precopy, blackout, and postcopy. In precopy we're tracking the VM's
working set via dirty logging and sending the contents of its memory
to the target host. In blackout we pause the vCPUs on the source, copy
minimal data to the target, and resume the vCPUs on the target. In
postcopy we may still have some pages that have not been copied to the
target and so request those in response to vCPU page faults via user
fault fd or some other mechanism.
Does EPT pre-population preclude the use of a postcopy phase? I would
expect that to make the blackout phase really long. Has that not been
a problem for you?
I love the idea of partial EPT pre-population during precopy if you
could still handle postcopy and just pre-populate as memory came in.
>
> > > > >
> > > > > We use memory dirty pattern workload to test the initial patch
> > > > > set and get positive result even with huge page enabled. For example,
> > > > > we create guest with 32 vCPUs and 64G memories, and let the vcpus
> > > > > dirty the entire memory region concurrently, as the initial patch
> > > > > eliminate the overhead of mmu_lock, in 2M/1G huge page mode we would
> > > > > get the job done in about 50% faster.
> >
> > In this benchmark did you include the time required to pre-populate
> > the EPT or just the time required for the vCPUs to dirty memory?
> > I ask because I'm curious if your priority is to decrease the total
> > end-to-end time, or you just care about the guest experience, and not
> > so much the VM startup time.
>
> We compare the time for each vcpu thread to finish the dirty job. Yes, it can
> take some time for the page table pre-populate, but as each vcpu thread
> can gain a huge advantage with concurrent dirty write, if we count that in
> the total time it is still a better result.
That makes sense to me. Your implementation definitely seems more
efficient than the existing PF handling path. It's probably much
easier to parallelize as a sort of recursive population operation too.
>
> > How does this compare to the case where 1 vCPU reads every page of
> > memory and then 32 vCPUs concurrently dirty every page?
> >
>
> Haven't tried this yet, I think the major difference would be the page fault
> latency introduced by the one vCPU read.
I agree. The whole VM exit path adds a lot of overhead. I wonder what
kind of numbers you'd get it you cranked PTE_PREFETCH_NUM way up
though. If you set that to >= your memory size, one PF could
pre-populate the entire EPT. It's a silly approach, but it would be a
lot more efficient as an easy POC.
>
> > > > >
> > > > > We only validate this feature on Intel x86 platform. And as Ben
> > > > > pointed out in RFC V1, so far we disable the SMM for resource
> > > > > consideration, drop the mmu notification as in this case the
> > > > > memory is pinned.
> >
> > I'm excited to see big MMU changes like this, and I look forward to
> > combining our needs towards a better MMU for the x86 TDP case. Have
> > you thought about how you would build SMM and MMU notifier support
> > onto this patch series? I know that the invalidate range notifiers, at
> > least, added a lot of non-trivial complexity to the direct MMU
> > implementation I presented last year.
> >
>
> Thanks for the suggestion, I will think about it.
>
> > > > >
> > > > > V1->V2:
> > > > > * Rebase the code to kernel version 5.9.0-rc1.
> > > > >
> > > > > Yulei Zhang (9):
> > > > > Introduce new fields in kvm_arch/vcpu_arch struct for direct build EPT
> > > > > support
> > > > > Introduce page table population function for direct build EPT feature
> > > > > Introduce page table remove function for direct build EPT feature
> > > > > Add release function for direct build ept when guest VM exit
> > > > > Modify the page fault path to meet the direct build EPT requirement
> > > > > Apply the direct build EPT according to the memory slots change
> > > > > Add migration support when using direct build EPT
> > > > > Introduce kvm module parameter global_tdp to turn on the direct build
> > > > > EPT mode
> > > > > Handle certain mmu exposed functions properly while turn on direct
> > > > > build EPT mode
> > > > >
> > > > > arch/mips/kvm/mips.c | 13 +
> > > > > arch/powerpc/kvm/powerpc.c | 13 +
> > > > > arch/s390/kvm/kvm-s390.c | 13 +
> > > > > arch/x86/include/asm/kvm_host.h | 13 +-
> > > > > arch/x86/kvm/mmu/mmu.c | 533 ++++++++++++++++++++++++++++++--
> > > > > arch/x86/kvm/svm/svm.c | 2 +-
> > > > > arch/x86/kvm/vmx/vmx.c | 7 +-
> > > > > arch/x86/kvm/x86.c | 55 ++--
> > > > > include/linux/kvm_host.h | 7 +-
> > > > > virt/kvm/kvm_main.c | 43 ++-
> > > > > 10 files changed, 639 insertions(+), 60 deletions(-)
> > > > >
> > > > > --
> > > > > 2.17.1
> > > > >
On 25/09/20 19:30, Ben Gardon wrote:
> Oh, thank you for explaining that. I didn't realize the goal here was
> to improve LM performance. I was under the impression that this was to
> give VMs a better experience on startup for fast scaling or something.
> In your testing with live migration how has this affected the
> distribution of time between the phases of live migration? Just for
> terminology (since I'm not sure how standard it is across the
> industry) I think of a live migration as consisting of 3 stages:
> precopy, blackout, and postcopy. In precopy we're tracking the VM's
> working set via dirty logging and sending the contents of its memory
> to the target host. In blackout we pause the vCPUs on the source, copy
> minimal data to the target, and resume the vCPUs on the target. In
> postcopy we may still have some pages that have not been copied to the
> target and so request those in response to vCPU page faults via user
> fault fd or some other mechanism.
>
> Does EPT pre-population preclude the use of a postcopy phase?
I think so.
As a quick recap, turn postcopy migration handles two kinds of
pages---they can be copied to the destination either in background
(stuff that was dirty when userspace decided to transition to the
blackout phase) or on-demand (relayed from KVM to userspace via
get_user_pages and userfaultfd). Normally only on-demand pages would be
served through userfaultfd, while with prepopulation every missing page
would be faulted in from the kernel through userfaultfd. In practice
this would just extend the blackout phase.
Paolo
> I would
> expect that to make the blackout phase really long. Has that not been
> a problem for you?
>
> I love the idea of partial EPT pre-population during precopy if you
> could still handle postcopy and just pre-populate as memory came in.
>
On Sat, Sep 26, 2020 at 4:50 AM Paolo Bonzini <[email protected]> wrote:
>
> On 25/09/20 19:30, Ben Gardon wrote:
> > Oh, thank you for explaining that. I didn't realize the goal here was
> > to improve LM performance. I was under the impression that this was to
> > give VMs a better experience on startup for fast scaling or something.
> > In your testing with live migration how has this affected the
> > distribution of time between the phases of live migration? Just for
> > terminology (since I'm not sure how standard it is across the
> > industry) I think of a live migration as consisting of 3 stages:
> > precopy, blackout, and postcopy. In precopy we're tracking the VM's
> > working set via dirty logging and sending the contents of its memory
> > to the target host. In blackout we pause the vCPUs on the source, copy
> > minimal data to the target, and resume the vCPUs on the target. In
> > postcopy we may still have some pages that have not been copied to the
> > target and so request those in response to vCPU page faults via user
> > fault fd or some other mechanism.
> >
> > Does EPT pre-population preclude the use of a postcopy phase?
>
> I think so.
>
> As a quick recap, turn postcopy migration handles two kinds of
> pages---they can be copied to the destination either in background
> (stuff that was dirty when userspace decided to transition to the
> blackout phase) or on-demand (relayed from KVM to userspace via
> get_user_pages and userfaultfd). Normally only on-demand pages would be
> served through userfaultfd, while with prepopulation every missing page
> would be faulted in from the kernel through userfaultfd. In practice
> this would just extend the blackout phase.
>
> Paolo
>
Yep, you are right, based on current implementation it doesn't support the
postcopy. Thanks for the suggestion, we will try to fill the gap with proper
EPT population during the post-copy.
> > I would
> > expect that to make the blackout phase really long. Has that not been
> > a problem for you?
> >
> > I love the idea of partial EPT pre-population during precopy if you
> > could still handle postcopy and just pre-populate as memory came in.
> >
>