Testing:
Ran KVM selftests and kvm-unit-tests on an Intel Skylake. This
series introduced no new failures.
Performance:
To collect these results I needed to apply Mingwei's patch
"selftests: KVM: align guest physical memory base address to 1GB"
https://lkml.org/lkml/2021/8/29/310
David Matlack is going to send out an updated version of that patch soon.
Without this series, TDP MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 10.966500447s
Enabling dirty logging time: 0.002068737s
Iteration 1 dirty memory time: 0.047556280s
Iteration 1 get dirty log time: 0.001253914s
Iteration 1 clear dirty log time: 0.049716661s
Iteration 2 dirty memory time: 3.679662016s
Iteration 2 get dirty log time: 0.000659546s
Iteration 2 clear dirty log time: 1.834329322s
Disabling dirty logging time: 45.738439510s
Get dirty log over 2 iterations took 0.001913460s. (Avg 0.000956730s/iteration)
Clear dirty log over 2 iterations took 1.884045983s. (Avg 0.942022991s/iteration)
Without this series, Legacy MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 12.664750666s
Enabling dirty logging time: 0.002025510s
Iteration 1 dirty memory time: 0.046240875s
Iteration 1 get dirty log time: 0.001864342s
Iteration 1 clear dirty log time: 0.170243637s
Iteration 2 dirty memory time: 31.571088701s
Iteration 2 get dirty log time: 0.000626245s
Iteration 2 clear dirty log time: 1.294817729s
Disabling dirty logging time: 3.566831573s
Get dirty log over 2 iterations took 0.002490587s. (Avg 0.001245293s/iteration)
Clear dirty log over 2 iterations took 1.465061366s. (Avg 0.732530683s/iteration)
With this series, TDP MMU:
(Updated since RFC. Pulling out patches 1-4 could have a performance impact.)
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 12.225242366s
Enabling dirty logging time: 0.002063442s
Iteration 1 dirty memory time: 0.047598123s
Iteration 1 get dirty log time: 0.001247702s
Iteration 1 clear dirty log time: 0.051062420s
Iteration 2 dirty memory time: 3.660439803s
Iteration 2 get dirty log time: 0.000736229s
Iteration 2 clear dirty log time: 1.043469951s
Disabling dirty logging time: 1.400549627s
Get dirty log over 2 iterations took 0.001983931s. (Avg 0.000991965s/iteration)
Clear dirty log over 2 iterations took 1.094532371s. (Avg 0.547266185s/iteration)
Patch breakdown:
Patches 1 eliminates extra TLB flushes while disabling dirty logging.
Patches 2-8 remove the need for a vCPU pointer to make_spte
Patches 9-14 are small refactors in perparation for patch 19
Patch 15 implements in-place largepage promotion when disabling dirty logging
Changelog:
RFC -> v1:
Dropped the first 4 patches from the series. Patch 1 was sent
separately, patches 2-4 will be taken over by Sean Christopherson.
Incorporated David Matlack's Reviewed-by.
Ben Gardon (15):
KVM: x86/mmu: Remove redundant flushes when disabling dirty logging
KVM: x86/mmu: Introduce vcpu_make_spte
KVM: x86/mmu: Factor wrprot for nested PML out of make_spte
KVM: x86/mmu: Factor mt_mask out of make_spte
KVM: x86/mmu: Remove need for a vcpu from
kvm_slot_page_track_is_active
KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages
KVM: x86/mmu: Factor shadow_zero_check out of make_spte
KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
KVM: x86/mmu: Propagate memslot const qualifier
KVM: x86/MMU: Refactor vmx_get_mt_mask
KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
on vcpu
KVM: x86/mmu: Add try_get_mt_mask to x86_ops
KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
KVM: x86/mmu: Promote pages in-place when disabling dirty logging
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/asm/kvm_page_track.h | 6 +-
arch/x86/kvm/mmu/mmu.c | 45 +++++++------
arch/x86/kvm/mmu/mmu_internal.h | 6 +-
arch/x86/kvm/mmu/page_track.c | 8 +--
arch/x86/kvm/mmu/paging_tmpl.h | 6 +-
arch/x86/kvm/mmu/spte.c | 43 ++++++++----
arch/x86/kvm/mmu/spte.h | 17 +++--
arch/x86/kvm/mmu/tdp_mmu.c | 97 +++++++++++++++++++++------
arch/x86/kvm/mmu/tdp_mmu.h | 5 +-
arch/x86/kvm/svm/svm.c | 8 +++
arch/x86/kvm/vmx/vmx.c | 40 ++++++-----
include/linux/kvm_host.h | 10 +--
virt/kvm/kvm_main.c | 12 ++--
15 files changed, 205 insertions(+), 101 deletions(-)
--
2.34.0.rc1.387.gb447b232ab-goog
tdp_mmu_zap_spte_atomic flushes on every zap already, so no need to
flush again after it's done.
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 4 +---
arch/x86/kvm/mmu/tdp_mmu.c | 21 ++++++---------------
arch/x86/kvm/mmu/tdp_mmu.h | 5 ++---
3 files changed, 9 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 354d2ca92df4..baa94acab516 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5870,9 +5870,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
if (is_tdp_mmu_enabled(kvm)) {
read_lock(&kvm->mmu_lock);
- flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush);
- if (flush)
- kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
+ kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
read_unlock(&kvm->mmu_lock);
}
}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7c5dd83e52de..b3c78568ae60 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1364,10 +1364,9 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
* Clear leaf entries which could be replaced by large mappings, for
* GFNs within the slot.
*/
-static bool zap_collapsible_spte_range(struct kvm *kvm,
+static void zap_collapsible_spte_range(struct kvm *kvm,
struct kvm_mmu_page *root,
- const struct kvm_memory_slot *slot,
- bool flush)
+ const struct kvm_memory_slot *slot)
{
gfn_t start = slot->base_gfn;
gfn_t end = start + slot->npages;
@@ -1378,10 +1377,8 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
tdp_root_for_each_pte(iter, root, start, end) {
retry:
- if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
- flush = false;
+ if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
- }
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
@@ -1401,30 +1398,24 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
goto retry;
}
- flush = true;
}
rcu_read_unlock();
-
- return flush;
}
/*
* Clear non-leaf entries (and free associated page tables) which could
* be replaced by large mappings, for GFNs within the slot.
*/
-bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- bool flush)
+void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot)
{
struct kvm_mmu_page *root;
lockdep_assert_held_read(&kvm->mmu_lock);
for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
- flush = zap_collapsible_spte_range(kvm, root, slot, flush);
-
- return flush;
+ zap_collapsible_spte_range(kvm, root, slot);
}
/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 476b133544dd..3899004a5d91 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -64,9 +64,8 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn, unsigned long mask,
bool wrprot);
-bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
- const struct kvm_memory_slot *slot,
- bool flush);
+void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+ const struct kvm_memory_slot *slot);
bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
--
2.34.0.rc1.387.gb447b232ab-goog
Add a wrapper around make_spte which conveys the vCPU-specific context of
the function. This will facilitate factoring out all uses of the vCPU
pointer from make_spte in subsequent commits.
No functional change intended.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 6 +++---
arch/x86/kvm/mmu/spte.c | 17 +++++++++++++----
arch/x86/kvm/mmu/spte.h | 12 ++++++++----
arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++---
5 files changed, 29 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index baa94acab516..2ada6dee920a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2723,7 +2723,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
was_rmapped = 1;
}
- wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
+ wrprot = vcpu_make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
true, host_writable, &spte);
if (*sptep == spte) {
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f87d36898c44..edb8ebd1a775 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1129,9 +1129,9 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
spte = *sptep;
host_writable = spte & shadow_host_writable_mask;
slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
- make_spte(vcpu, sp, slot, pte_access, gfn,
- spte_to_pfn(spte), spte, true, false,
- host_writable, &spte);
+ vcpu_make_spte(vcpu, sp, slot, pte_access, gfn,
+ spte_to_pfn(spte), spte, true, false,
+ host_writable, &spte);
flush |= mmu_spte_update(sptep, spte);
}
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 0c76c45fdb68..04d26e913941 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -90,10 +90,9 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
}
bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
- struct kvm_memory_slot *slot,
- unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
- u64 old_spte, bool prefetch, bool can_unsync,
- bool host_writable, u64 *new_spte)
+ struct kvm_memory_slot *slot, unsigned int pte_access,
+ gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
+ bool can_unsync, bool host_writable, u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -191,6 +190,16 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
return wrprot;
}
+bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ struct kvm_memory_slot *slot, unsigned int pte_access,
+ gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
+ bool can_unsync, bool host_writable, u64 *new_spte)
+{
+ return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
+ prefetch, can_unsync, host_writable, new_spte);
+
+}
+
u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
{
u64 spte = SPTE_MMU_PRESENT_MASK;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index cc432f9a966b..14f18082d505 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -330,10 +330,14 @@ static inline u64 get_mmio_spte_generation(u64 spte)
}
bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
- struct kvm_memory_slot *slot,
- unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
- u64 old_spte, bool prefetch, bool can_unsync,
- bool host_writable, u64 *new_spte);
+ struct kvm_memory_slot *slot, unsigned int pte_access,
+ gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
+ bool can_unsync, bool host_writable, u64 *new_spte);
+bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+ struct kvm_memory_slot *slot,
+ unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
+ u64 old_spte, bool prefetch, bool can_unsync,
+ bool host_writable, u64 *new_spte);
u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
u64 mark_spte_for_access_track(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b3c78568ae60..43c7834b4f0a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -906,9 +906,10 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
- wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
- fault->pfn, iter->old_spte, fault->prefetch, true,
- fault->map_writable, &new_spte);
+ wrprot = vcpu_make_spte(vcpu, sp, fault->slot, ACC_ALL,
+ iter->gfn, fault->pfn, iter->old_spte,
+ fault->prefetch, true,
+ fault->map_writable, &new_spte);
if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
--
2.34.0.rc1.387.gb447b232ab-goog
When running a nested VM, KVM write protects SPTEs in the EPT/NPT02
instead of using PML for dirty tracking. This avoids expensive
translation later, when emptying the Page Modification Log. In service
of removing the vCPU pointer from make_spte, factor the check for nested
PML out of the function.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 10 +++++++---
arch/x86/kvm/mmu/spte.h | 3 ++-
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 04d26e913941..3cf08a534a16 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -92,7 +92,8 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
- bool can_unsync, bool host_writable, u64 *new_spte)
+ bool can_unsync, bool host_writable, bool ad_need_write_protect,
+ u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -100,7 +101,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (sp->role.ad_disabled)
spte |= SPTE_TDP_AD_DISABLED_MASK;
- else if (kvm_vcpu_ad_need_write_protect(vcpu))
+ else if (ad_need_write_protect)
spte |= SPTE_TDP_AD_WRPROT_ONLY_MASK;
/*
@@ -195,8 +196,11 @@ bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, u64 *new_spte)
{
+ bool ad_need_write_protect = kvm_vcpu_ad_need_write_protect(vcpu);
+
return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
- prefetch, can_unsync, host_writable, new_spte);
+ prefetch, can_unsync, host_writable,
+ ad_need_write_protect, new_spte);
}
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 14f18082d505..bcf58602f224 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -332,7 +332,8 @@ static inline u64 get_mmio_spte_generation(u64 spte)
bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
- bool can_unsync, bool host_writable, u64 *new_spte);
+ bool can_unsync, bool host_writable, bool ad_need_write_protect,
+ u64 *new_spte);
bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
--
2.34.0.rc1.387.gb447b232ab-goog
In service of removing the vCPU pointer from make_spte, factor the memory
type mask calculation out of make_spte.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 9 +++++----
arch/x86/kvm/mmu/spte.h | 2 +-
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 3cf08a534a16..75c666d3e7f1 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -93,7 +93,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
- u64 *new_spte)
+ u64 mt_mask, u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -130,8 +130,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (level > PG_LEVEL_4K)
spte |= PT_PAGE_SIZE_MASK;
if (tdp_enabled)
- spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
- kvm_is_mmio_pfn(pfn));
+ spte |= mt_mask;
if (host_writable)
spte |= shadow_host_writable_mask;
@@ -197,10 +196,12 @@ bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
bool can_unsync, bool host_writable, u64 *new_spte)
{
bool ad_need_write_protect = kvm_vcpu_ad_need_write_protect(vcpu);
+ u64 mt_mask = static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
+ kvm_is_mmio_pfn(pfn));
return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
prefetch, can_unsync, host_writable,
- ad_need_write_protect, new_spte);
+ ad_need_write_protect, mt_mask, new_spte);
}
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index bcf58602f224..e739f2ebf844 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -333,7 +333,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
- u64 *new_spte);
+ u64 mt_mask, u64 *new_spte);
bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
--
2.34.0.rc1.387.gb447b232ab-goog
kvm_slot_page_track_is_active only uses its vCPU argument to get a
pointer to the assoicated struct kvm, so just pass in the struct KVM to
remove the need for a vCPU pointer.
No functional change intended.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm_page_track.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 4 ++--
arch/x86/kvm/mmu/page_track.c | 4 ++--
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index 9d4a3b1b25b9..e99a30a4d38b 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -63,7 +63,7 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
void kvm_slot_page_track_remove_page(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
enum kvm_page_track_mode mode);
-bool kvm_slot_page_track_is_active(struct kvm_vcpu *vcpu,
+bool kvm_slot_page_track_is_active(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
enum kvm_page_track_mode mode);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2ada6dee920a..7d0da79668c0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2587,7 +2587,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
* track machinery is used to write-protect upper-level shadow pages,
* i.e. this guards the role.level == 4K assertion below!
*/
- if (kvm_slot_page_track_is_active(vcpu, slot, gfn, KVM_PAGE_TRACK_WRITE))
+ if (kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
return -EPERM;
/*
@@ -3884,7 +3884,7 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
* guest is writing the page which is write tracked which can
* not be fixed by page fault handler.
*/
- if (kvm_slot_page_track_is_active(vcpu, fault->slot, fault->gfn, KVM_PAGE_TRACK_WRITE))
+ if (kvm_slot_page_track_is_active(vcpu->kvm, fault->slot, fault->gfn, KVM_PAGE_TRACK_WRITE))
return true;
return false;
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index cc4eb5b7fb76..35c221d5f6ce 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -173,7 +173,7 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_remove_page);
/*
* check if the corresponding access on the specified guest page is tracked.
*/
-bool kvm_slot_page_track_is_active(struct kvm_vcpu *vcpu,
+bool kvm_slot_page_track_is_active(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
enum kvm_page_track_mode mode)
{
@@ -186,7 +186,7 @@ bool kvm_slot_page_track_is_active(struct kvm_vcpu *vcpu,
return false;
if (mode == KVM_PAGE_TRACK_WRITE &&
- !kvm_page_track_write_tracking_enabled(vcpu->kvm))
+ !kvm_page_track_write_tracking_enabled(kvm))
return false;
index = gfn_to_index(gfn, slot->base_gfn, PG_LEVEL_4K);
--
2.34.0.rc1.387.gb447b232ab-goog
The vCPU argument to mmu_try_to_unsync_pages is now only used to get a
pointer to the associated struct kvm, so pass in the kvm pointer from
the beginning to remove the need for a vCPU when calling the function.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 16 ++++++++--------
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/spte.c | 2 +-
3 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7d0da79668c0..1e890509b93f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2561,10 +2561,10 @@ static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
return r;
}
-static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
trace_kvm_mmu_unsync_page(sp);
- ++vcpu->kvm->stat.mmu_unsync;
+ ++kvm->stat.mmu_unsync;
sp->unsync = 1;
kvm_mmu_mark_parents_unsync(sp);
@@ -2576,7 +2576,7 @@ static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
* were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
* be write-protected.
*/
-int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, bool can_unsync, bool prefetch)
{
struct kvm_mmu_page *sp;
@@ -2587,7 +2587,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
* track machinery is used to write-protect upper-level shadow pages,
* i.e. this guards the role.level == 4K assertion below!
*/
- if (kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
+ if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
return -EPERM;
/*
@@ -2596,7 +2596,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
* that case, KVM must complete emulation of the guest TLB flush before
* allowing shadow pages to become unsync (writable by the guest).
*/
- for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
+ for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
if (!can_unsync)
return -EPERM;
@@ -2615,7 +2615,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
*/
if (!locked) {
locked = true;
- spin_lock(&vcpu->kvm->arch.mmu_unsync_pages_lock);
+ spin_lock(&kvm->arch.mmu_unsync_pages_lock);
/*
* Recheck after taking the spinlock, a different vCPU
@@ -2630,10 +2630,10 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
}
WARN_ON(sp->role.level != PG_LEVEL_4K);
- kvm_unsync_page(vcpu, sp);
+ kvm_unsync_page(kvm, sp);
}
if (locked)
- spin_unlock(&vcpu->kvm->arch.mmu_unsync_pages_lock);
+ spin_unlock(&kvm->arch.mmu_unsync_pages_lock);
/*
* We need to ensure that the marking of unsync pages is visible
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 52c6527b1a06..1073d10cce91 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -118,7 +118,7 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
kvm_x86_ops.cpu_dirty_log_size;
}
-int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
+int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
gfn_t gfn, bool can_unsync, bool prefetch);
void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 75c666d3e7f1..b7271daa06c5 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -160,7 +160,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
* e.g. it's write-tracked (upper-level SPs) or has one or more
* shadow pages and unsync'ing pages is not allowed.
*/
- if (mmu_try_to_unsync_pages(vcpu, slot, gfn, can_unsync, prefetch)) {
+ if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) {
pgprintk("%s: found shadow page for %llx, marking ro\n",
__func__, gfn);
wrprot = true;
--
2.34.0.rc1.387.gb447b232ab-goog
In the interest of devloping a version of make_spte that can function
without a vCPU pointer, factor out the shadow_zero_mask to be an
additional argument to the function.
No functional change intended.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 11 +++++++----
arch/x86/kvm/mmu/spte.h | 3 ++-
2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index b7271daa06c5..d3b059e96c6e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -93,7 +93,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
- u64 mt_mask, u64 *new_spte)
+ u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
+ u64 *new_spte)
{
int level = sp->role.level;
u64 spte = SPTE_MMU_PRESENT_MASK;
@@ -176,9 +177,9 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if (prefetch)
spte = mark_spte_for_access_track(spte);
- WARN_ONCE(is_rsvd_spte(&vcpu->arch.mmu->shadow_zero_check, spte, level),
+ WARN_ONCE(is_rsvd_spte(shadow_zero_check, spte, level),
"spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
- get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
+ get_rsvd_bits(shadow_zero_check, spte, level));
if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
@@ -198,10 +199,12 @@ bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
bool ad_need_write_protect = kvm_vcpu_ad_need_write_protect(vcpu);
u64 mt_mask = static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
+ struct rsvd_bits_validate *shadow_zero_check = &vcpu->arch.mmu->shadow_zero_check;
return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
prefetch, can_unsync, host_writable,
- ad_need_write_protect, mt_mask, new_spte);
+ ad_need_write_protect, mt_mask, shadow_zero_check,
+ new_spte);
}
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index e739f2ebf844..6134a10487c4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -333,7 +333,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
- u64 mt_mask, u64 *new_spte);
+ u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
+ u64 *new_spte);
bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot,
unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
--
2.34.0.rc1.387.gb447b232ab-goog
In preparation for implementing in-place hugepage promotion, various
functions will need to be called from zap_collapsible_spte_range, which
has the const qualifier on its memslot argument. Propagate the const
qualifier to the various functions which will be needed. This just serves
to simplify the following patch.
No functional change intended.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm_page_track.h | 4 ++--
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/mmu_internal.h | 2 +-
arch/x86/kvm/mmu/page_track.c | 4 ++--
arch/x86/kvm/mmu/spte.c | 2 +-
arch/x86/kvm/mmu/spte.h | 2 +-
include/linux/kvm_host.h | 10 +++++-----
virt/kvm/kvm_main.c | 12 ++++++------
8 files changed, 19 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index e99a30a4d38b..eb186bc57f6a 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -64,8 +64,8 @@ void kvm_slot_page_track_remove_page(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
enum kvm_page_track_mode mode);
bool kvm_slot_page_track_is_active(struct kvm *kvm,
- struct kvm_memory_slot *slot, gfn_t gfn,
- enum kvm_page_track_mode mode);
+ const struct kvm_memory_slot *slot,
+ gfn_t gfn, enum kvm_page_track_mode mode);
void
kvm_page_track_register_notifier(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fdf0f15ab19d..ef7a84422463 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2576,7 +2576,7 @@ static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
* were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
* be write-protected.
*/
-int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
+int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
gfn_t gfn, bool can_unsync, bool prefetch)
{
struct kvm_mmu_page *sp;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1073d10cce91..6563cce9c438 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -118,7 +118,7 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
kvm_x86_ops.cpu_dirty_log_size;
}
-int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
+int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
gfn_t gfn, bool can_unsync, bool prefetch);
void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 35c221d5f6ce..68eb1fb548b6 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -174,8 +174,8 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_remove_page);
* check if the corresponding access on the specified guest page is tracked.
*/
bool kvm_slot_page_track_is_active(struct kvm *kvm,
- struct kvm_memory_slot *slot, gfn_t gfn,
- enum kvm_page_track_mode mode)
+ const struct kvm_memory_slot *slot,
+ gfn_t gfn, enum kvm_page_track_mode mode)
{
int index;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d98723b14cec..7be41d2dbb02 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -90,7 +90,7 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
}
bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct kvm_memory_slot *slot, unsigned int pte_access,
+ const struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 5bb055688080..d7598506fbad 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -330,7 +330,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)
}
bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
- struct kvm_memory_slot *slot, unsigned int pte_access,
+ const struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 60a35d9fe259..675da38fac7f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -435,7 +435,7 @@ struct kvm_memory_slot {
u16 as_id;
};
-static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
+static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
{
return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
}
@@ -855,9 +855,9 @@ void kvm_set_page_accessed(struct page *page);
kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
bool *writable);
-kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
bool atomic, bool *async, bool write_fault,
bool *writable, hva_t *hva);
@@ -934,7 +934,7 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
-void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn);
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3f6d450355f0..6dbf8cba1900 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2138,12 +2138,12 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
return size;
}
-static bool memslot_is_readonly(struct kvm_memory_slot *slot)
+static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
{
return slot->flags & KVM_MEM_READONLY;
}
-static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
+static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
gfn_t *nr_pages, bool write)
{
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
@@ -2438,7 +2438,7 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
return pfn;
}
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
bool atomic, bool *async, bool write_fault,
bool *writable, hva_t *hva)
{
@@ -2478,13 +2478,13 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
-kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
{
return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
-kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
+kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
{
return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
}
@@ -3079,7 +3079,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
EXPORT_SYMBOL_GPL(kvm_clear_guest);
void mark_page_dirty_in_slot(struct kvm *kvm,
- struct kvm_memory_slot *memslot,
+ const struct kvm_memory_slot *memslot,
gfn_t gfn)
{
if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
--
2.34.0.rc1.387.gb447b232ab-goog
Factor out the implementation of reset_tdp_shadow_zero_bits_mask to a
helper function which does not require a vCPU pointer. The only element
of the struct kvm_mmu context used by the function is the shadow root
level, so pass that in too instead of the mmu context.
No functional change intended.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++-------
1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e890509b93f..fdf0f15ab19d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4450,17 +4450,14 @@ static inline bool boot_cpu_is_amd(void)
* possible, however, kvm currently does not do execution-protection.
*/
static void
-reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
- struct kvm_mmu *context)
+build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
+ int shadow_root_level)
{
- struct rsvd_bits_validate *shadow_zero_check;
int i;
- shadow_zero_check = &context->shadow_zero_check;
-
if (boot_cpu_is_amd())
__reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
- context->shadow_root_level, false,
+ shadow_root_level, false,
boot_cpu_has(X86_FEATURE_GBPAGES),
false, true);
else
@@ -4470,12 +4467,20 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
if (!shadow_me_mask)
return;
- for (i = context->shadow_root_level; --i >= 0;) {
+ for (i = shadow_root_level; --i >= 0;) {
shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
}
}
+static void
+reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context)
+{
+ build_tdp_shadow_zero_bits_mask(&context->shadow_zero_check,
+ context->shadow_root_level);
+}
+
/*
* as the comments in reset_shadow_zero_bits_mask() except it
* is the shadow page table for intel nested guest.
--
2.34.0.rc1.387.gb447b232ab-goog
No that nothing in make_spte actually needs the vCPU argument, just
pass in a pointer to the struct kvm. This allows the function to be used
in situations where there is no relevant struct vcpu.
No functional change intended.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 8 ++++----
arch/x86/kvm/mmu/spte.h | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d3b059e96c6e..d98723b14cec 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -89,7 +89,7 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
E820_TYPE_RAM);
}
-bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
@@ -161,7 +161,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
* e.g. it's write-tracked (upper-level SPs) or has one or more
* shadow pages and unsync'ing pages is not allowed.
*/
- if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) {
+ if (mmu_try_to_unsync_pages(kvm, slot, gfn, can_unsync, prefetch)) {
pgprintk("%s: found shadow page for %llx, marking ro\n",
__func__, gfn);
wrprot = true;
@@ -184,7 +184,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
WARN_ON(level > PG_LEVEL_4K);
- mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
+ mark_page_dirty_in_slot(kvm, slot, gfn);
}
*new_spte = spte;
@@ -201,7 +201,7 @@ bool vcpu_make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
kvm_is_mmio_pfn(pfn));
struct rsvd_bits_validate *shadow_zero_check = &vcpu->arch.mmu->shadow_zero_check;
- return make_spte(vcpu, sp, slot, pte_access, gfn, pfn, old_spte,
+ return make_spte(vcpu->kvm, sp, slot, pte_access, gfn, pfn, old_spte,
prefetch, can_unsync, host_writable,
ad_need_write_protect, mt_mask, shadow_zero_check,
new_spte);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 6134a10487c4..5bb055688080 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -329,7 +329,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)
return gen;
}
-bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
struct kvm_memory_slot *slot, unsigned int pte_access,
gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
bool can_unsync, bool host_writable, bool ad_need_write_protect,
--
2.34.0.rc1.387.gb447b232ab-goog
Remove the gotos from vmx_get_mt_mask to make it easier to separate out
the parts which do not depend on vcpu state.
No functional change intended.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 23 +++++++----------------
1 file changed, 7 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 71f54d85f104..77f45c005f28 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6987,7 +6987,6 @@ static int __init vmx_check_processor_compat(void)
static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
u8 cache;
- u64 ipat = 0;
/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
* memory aliases with conflicting memory types and sometimes MCEs.
@@ -7007,30 +7006,22 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
* EPT memory type is used to emulate guest CD/MTRR.
*/
- if (is_mmio) {
- cache = MTRR_TYPE_UNCACHABLE;
- goto exit;
- }
+ if (is_mmio)
+ return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
- if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
- ipat = VMX_EPT_IPAT_BIT;
- cache = MTRR_TYPE_WRBACK;
- goto exit;
- }
+ if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
+ return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
- ipat = VMX_EPT_IPAT_BIT;
if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
cache = MTRR_TYPE_WRBACK;
else
cache = MTRR_TYPE_UNCACHABLE;
- goto exit;
- }
- cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
+ return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+ }
-exit:
- return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
+ return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
}
static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx, u32 new_ctl)
--
2.34.0.rc1.387.gb447b232ab-goog
When disabling dirty logging, the TDP MMU currently zaps each leaf entry
mapping memory in the relevant memslot. This is very slow. Doing the zaps
under the mmu read lock requires a TLB flush for every zap and the
zapping causes a storm of ETP/NPT violations.
Instead of zapping, replace the split large pages with large page
mappings directly. While this sort of operation has historically only
been done in the vCPU page fault handler context, refactorings earlier
in this series and the relative simplicity of the TDP MMU make it
possible here as well.
Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
of memory per vCPU, this reduces the time required to disable dirty
logging from over 45 seconds to just over 1 second. It also avoids
provoking page faults, improving vCPU performance while disabling
dirty logging.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/mmu_internal.h | 4 ++
arch/x86/kvm/mmu/tdp_mmu.c | 69 ++++++++++++++++++++++++++++++++-
3 files changed, 72 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ef7a84422463..add724aa9e8c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4449,7 +4449,7 @@ static inline bool boot_cpu_is_amd(void)
* the direct page table on host, use as much mmu features as
* possible, however, kvm currently does not do execution-protection.
*/
-static void
+void
build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
int shadow_root_level)
{
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 6563cce9c438..84d439432acf 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -161,4 +161,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void
+build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
+ int shadow_root_level);
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 43c7834b4f0a..b15c8cd11cf9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1361,6 +1361,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
}
+static void try_promote_lpage(struct kvm *kvm,
+ const struct kvm_memory_slot *slot,
+ struct tdp_iter *iter)
+{
+ struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
+ struct rsvd_bits_validate shadow_zero_check;
+ /*
+ * Since the TDP MMU doesn't manage nested PTs, there's no need to
+ * write protect for a nested VM when PML is in use.
+ */
+ bool ad_need_write_protect = false;
+ bool map_writable;
+ kvm_pfn_t pfn;
+ u64 new_spte;
+ u64 mt_mask;
+
+ /*
+ * If addresses are being invalidated, don't do in-place promotion to
+ * avoid accidentally mapping an invalidated address.
+ */
+ if (unlikely(kvm->mmu_notifier_count))
+ return;
+
+ pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
+ &map_writable, NULL);
+
+ /*
+ * Can't reconstitute an lpage if the consituent pages can't be
+ * mapped higher.
+ */
+ if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
+ pfn, PG_LEVEL_NUM))
+ return;
+
+ build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
+
+ /*
+ * In some cases, a vCPU pointer is required to get the MT mask,
+ * however in most cases it can be generated without one. If a
+ * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
+ * In that case, bail on in-place promotion.
+ */
+ if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
+ kvm_is_mmio_pfn(pfn),
+ &mt_mask)))
+ return;
+
+ make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
+ map_writable, ad_need_write_protect, mt_mask,
+ &shadow_zero_check, &new_spte);
+
+ tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
+
+ /*
+ * Re-read the SPTE to avoid recursing into one of the removed child
+ * page tables.
+ */
+ iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
+}
+
/*
* Clear leaf entries which could be replaced by large mappings, for
* GFNs within the slot.
@@ -1381,9 +1441,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
- if (!is_shadow_present_pte(iter.old_spte) ||
- !is_last_spte(iter.old_spte, iter.level))
+ if (!is_shadow_present_pte(iter.old_spte))
+ continue;
+
+ /* Try to promote the constitutent pages to an lpage. */
+ if (!is_last_spte(iter.old_spte, iter.level)) {
+ try_promote_lpage(kvm, slot, &iter);
continue;
+ }
pfn = spte_to_pfn(iter.old_spte);
if (kvm_is_reserved_pfn(pfn) ||
--
2.34.0.rc1.387.gb447b232ab-goog
Add another function for getting the memory type mask to x86_ops.
This version of the function can fail, but it does not require a vCPU
pointer. It will be used in a subsequent commit for in-place large page
promotion when disabling dirty logging.
No functional change intended.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/svm/svm.c | 8 ++++++++
arch/x86/kvm/vmx/vmx.c | 1 +
4 files changed, 12 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index cefe1d81e2e8..c86e9629ff1a 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -84,6 +84,7 @@ KVM_X86_OP_NULL(sync_pir_to_irr)
KVM_X86_OP(set_tss_addr)
KVM_X86_OP(set_identity_map_addr)
KVM_X86_OP(get_mt_mask)
+KVM_X86_OP(try_get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
KVM_X86_OP_NULL(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88fce6ab4bbd..ae13075f4d4c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1400,6 +1400,8 @@ struct kvm_x86_ops {
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+ bool (*try_get_mt_mask)(struct kvm *kvm, gfn_t gfn,
+ bool is_mmio, u64 *mask);
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int root_level);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 21bb81710e0f..d073cc3985e6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4067,6 +4067,13 @@ static bool svm_has_emulated_msr(struct kvm *kvm, u32 index)
return true;
}
+static bool svm_try_get_mt_mask(struct kvm *kvm, gfn_t gfn,
+ bool is_mmio, u64 *mask)
+{
+ *mask = 0;
+ return true;
+}
+
static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
return 0;
@@ -4660,6 +4667,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.set_tss_addr = svm_set_tss_addr,
.set_identity_map_addr = svm_set_identity_map_addr,
.get_mt_mask = svm_get_mt_mask,
+ .try_get_mt_mask = svm_try_get_mt_mask,
.get_exit_info = svm_get_exit_info,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4129614262e8..8cd6c1f50d3e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7658,6 +7658,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
.get_mt_mask = vmx_get_mt_mask,
+ .try_get_mt_mask = vmx_try_get_mt_mask,
.get_exit_info = vmx_get_exit_info,
--
2.34.0.rc1.387.gb447b232ab-goog
Factor out the parts of vmx_get_mt_mask which do not depend on the vCPU
argument. This also requires adding some error reporting to the helper
function to say whether it was possible to generate the MT mask without
a vCPU argument. This refactoring will allow the MT mask to be computed
when noncoherent DMA is not enabled on a VM.
No functional change intended.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 77f45c005f28..4129614262e8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6984,9 +6984,26 @@ static int __init vmx_check_processor_compat(void)
return 0;
}
+static bool vmx_try_get_mt_mask(struct kvm *kvm, gfn_t gfn,
+ bool is_mmio, u64 *mask)
+{
+ if (is_mmio) {
+ *mask = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+ return true;
+ }
+
+ if (!kvm_arch_has_noncoherent_dma(kvm)) {
+ *mask = (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+ return true;
+ }
+
+ return false;
+}
+
static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
u8 cache;
+ u64 mask;
/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
* memory aliases with conflicting memory types and sometimes MCEs.
@@ -7006,11 +7023,8 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
* EPT memory type is used to emulate guest CD/MTRR.
*/
- if (is_mmio)
- return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
-
- if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
- return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+ if (vmx_try_get_mt_mask(vcpu->kvm, gfn, is_mmio, &mask))
+ return mask;
if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
--
2.34.0.rc1.387.gb447b232ab-goog
Export kvm_is_mmio_pfn from spte.c. It will be used in a subsequent
commit for in-place lpage promotion when disabling dirty logging.
Signed-off-by: Ben Gardon <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 2 +-
arch/x86/kvm/mmu/spte.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 7be41d2dbb02..13b6143f6333 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -68,7 +68,7 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
return spte;
}
-static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
+bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
{
if (pfn_valid(pfn))
return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index d7598506fbad..909c24c733c4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -347,4 +347,5 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
void kvm_mmu_reset_all_pte_masks(void);
+bool kvm_is_mmio_pfn(kvm_pfn_t pfn);
#endif
--
2.34.0.rc1.387.gb447b232ab-goog
On Mon, Nov 15, 2021 at 3:46 PM Ben Gardon <[email protected]> wrote:
Haha oops, I lost the covers letter title there. It should have been
"KVM: x86/mmu: Optimize disabling dirty logging" and the subject line
should have been the first paragraph.
I'll correct that on any future versions of this series I send out.
>
> Testing:
> Ran KVM selftests and kvm-unit-tests on an Intel Skylake. This
> series introduced no new failures.
>
> Performance:
> To collect these results I needed to apply Mingwei's patch
> "selftests: KVM: align guest physical memory base address to 1GB"
> https://lkml.org/lkml/2021/8/29/310
> David Matlack is going to send out an updated version of that patch soon.
>
> Without this series, TDP MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 10.966500447s
> Enabling dirty logging time: 0.002068737s
>
> Iteration 1 dirty memory time: 0.047556280s
> Iteration 1 get dirty log time: 0.001253914s
> Iteration 1 clear dirty log time: 0.049716661s
> Iteration 2 dirty memory time: 3.679662016s
> Iteration 2 get dirty log time: 0.000659546s
> Iteration 2 clear dirty log time: 1.834329322s
> Disabling dirty logging time: 45.738439510s
> Get dirty log over 2 iterations took 0.001913460s. (Avg 0.000956730s/iteration)
> Clear dirty log over 2 iterations took 1.884045983s. (Avg 0.942022991s/iteration)
>
> Without this series, Legacy MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 12.664750666s
> Enabling dirty logging time: 0.002025510s
>
> Iteration 1 dirty memory time: 0.046240875s
> Iteration 1 get dirty log time: 0.001864342s
> Iteration 1 clear dirty log time: 0.170243637s
> Iteration 2 dirty memory time: 31.571088701s
> Iteration 2 get dirty log time: 0.000626245s
> Iteration 2 clear dirty log time: 1.294817729s
> Disabling dirty logging time: 3.566831573s
> Get dirty log over 2 iterations took 0.002490587s. (Avg 0.001245293s/iteration)
> Clear dirty log over 2 iterations took 1.465061366s. (Avg 0.732530683s/iteration)
>
> With this series, TDP MMU:
> (Updated since RFC. Pulling out patches 1-4 could have a performance impact.)
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 12.225242366s
> Enabling dirty logging time: 0.002063442s
>
> Iteration 1 dirty memory time: 0.047598123s
> Iteration 1 get dirty log time: 0.001247702s
> Iteration 1 clear dirty log time: 0.051062420s
> Iteration 2 dirty memory time: 3.660439803s
> Iteration 2 get dirty log time: 0.000736229s
> Iteration 2 clear dirty log time: 1.043469951s
> Disabling dirty logging time: 1.400549627s
> Get dirty log over 2 iterations took 0.001983931s. (Avg 0.000991965s/iteration)
> Clear dirty log over 2 iterations took 1.094532371s. (Avg 0.547266185s/iteration)
>
> Patch breakdown:
> Patches 1 eliminates extra TLB flushes while disabling dirty logging.
> Patches 2-8 remove the need for a vCPU pointer to make_spte
> Patches 9-14 are small refactors in perparation for patch 19
> Patch 15 implements in-place largepage promotion when disabling dirty logging
>
> Changelog:
> RFC -> v1:
> Dropped the first 4 patches from the series. Patch 1 was sent
> separately, patches 2-4 will be taken over by Sean Christopherson.
> Incorporated David Matlack's Reviewed-by.
>
> Ben Gardon (15):
> KVM: x86/mmu: Remove redundant flushes when disabling dirty logging
> KVM: x86/mmu: Introduce vcpu_make_spte
> KVM: x86/mmu: Factor wrprot for nested PML out of make_spte
> KVM: x86/mmu: Factor mt_mask out of make_spte
> KVM: x86/mmu: Remove need for a vcpu from
> kvm_slot_page_track_is_active
> KVM: x86/mmu: Remove need for a vcpu from mmu_try_to_unsync_pages
> KVM: x86/mmu: Factor shadow_zero_check out of make_spte
> KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
> KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
> KVM: x86/mmu: Propagate memslot const qualifier
> KVM: x86/MMU: Refactor vmx_get_mt_mask
> KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
> on vcpu
> KVM: x86/mmu: Add try_get_mt_mask to x86_ops
> KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
> KVM: x86/mmu: Promote pages in-place when disabling dirty logging
>
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/include/asm/kvm_page_track.h | 6 +-
> arch/x86/kvm/mmu/mmu.c | 45 +++++++------
> arch/x86/kvm/mmu/mmu_internal.h | 6 +-
> arch/x86/kvm/mmu/page_track.c | 8 +--
> arch/x86/kvm/mmu/paging_tmpl.h | 6 +-
> arch/x86/kvm/mmu/spte.c | 43 ++++++++----
> arch/x86/kvm/mmu/spte.h | 17 +++--
> arch/x86/kvm/mmu/tdp_mmu.c | 97 +++++++++++++++++++++------
> arch/x86/kvm/mmu/tdp_mmu.h | 5 +-
> arch/x86/kvm/svm/svm.c | 8 +++
> arch/x86/kvm/vmx/vmx.c | 40 ++++++-----
> include/linux/kvm_host.h | 10 +--
> virt/kvm/kvm_main.c | 12 ++--
> 15 files changed, 205 insertions(+), 101 deletions(-)
>
> --
> 2.34.0.rc1.387.gb447b232ab-goog
>
On 11/16/21 00:45, Ben Gardon wrote:
> kvm_slot_page_track_is_active only uses its vCPU argument to get a
> pointer to the assoicated struct kvm, so just pass in the struct KVM to
> remove the need for a vCPU pointer.
>
> No functional change intended.
>
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/include/asm/kvm_page_track.h | 2 +-
> arch/x86/kvm/mmu/mmu.c | 4 ++--
> arch/x86/kvm/mmu/page_track.c | 4 ++--
> 3 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
> index 9d4a3b1b25b9..e99a30a4d38b 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -63,7 +63,7 @@ void kvm_slot_page_track_add_page(struct kvm *kvm,
> void kvm_slot_page_track_remove_page(struct kvm *kvm,
> struct kvm_memory_slot *slot, gfn_t gfn,
> enum kvm_page_track_mode mode);
> -bool kvm_slot_page_track_is_active(struct kvm_vcpu *vcpu,
> +bool kvm_slot_page_track_is_active(struct kvm *kvm,
> struct kvm_memory_slot *slot, gfn_t gfn,
> enum kvm_page_track_mode mode);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2ada6dee920a..7d0da79668c0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2587,7 +2587,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> * track machinery is used to write-protect upper-level shadow pages,
> * i.e. this guards the role.level == 4K assertion below!
> */
> - if (kvm_slot_page_track_is_active(vcpu, slot, gfn, KVM_PAGE_TRACK_WRITE))
> + if (kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
> return -EPERM;
>
> /*
> @@ -3884,7 +3884,7 @@ static bool page_fault_handle_page_track(struct kvm_vcpu *vcpu,
> * guest is writing the page which is write tracked which can
> * not be fixed by page fault handler.
> */
> - if (kvm_slot_page_track_is_active(vcpu, fault->slot, fault->gfn, KVM_PAGE_TRACK_WRITE))
> + if (kvm_slot_page_track_is_active(vcpu->kvm, fault->slot, fault->gfn, KVM_PAGE_TRACK_WRITE))
> return true;
>
> return false;
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index cc4eb5b7fb76..35c221d5f6ce 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -173,7 +173,7 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_remove_page);
> /*
> * check if the corresponding access on the specified guest page is tracked.
> */
> -bool kvm_slot_page_track_is_active(struct kvm_vcpu *vcpu,
> +bool kvm_slot_page_track_is_active(struct kvm *kvm,
> struct kvm_memory_slot *slot, gfn_t gfn,
> enum kvm_page_track_mode mode)
> {
> @@ -186,7 +186,7 @@ bool kvm_slot_page_track_is_active(struct kvm_vcpu *vcpu,
> return false;
>
> if (mode == KVM_PAGE_TRACK_WRITE &&
> - !kvm_page_track_write_tracking_enabled(vcpu->kvm))
> + !kvm_page_track_write_tracking_enabled(kvm))
> return false;
>
> index = gfn_to_index(gfn, slot->base_gfn, PG_LEVEL_4K);
>
Queued, thanks.
Paolo
On 11/16/21 00:45, Ben Gardon wrote:
> tdp_mmu_zap_spte_atomic flushes on every zap already, so no need to
> flush again after it's done.
>
> Reviewed-by: David Matlack <[email protected]>
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 4 +---
> arch/x86/kvm/mmu/tdp_mmu.c | 21 ++++++---------------
> arch/x86/kvm/mmu/tdp_mmu.h | 5 ++---
> 3 files changed, 9 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 354d2ca92df4..baa94acab516 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5870,9 +5870,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>
> if (is_tdp_mmu_enabled(kvm)) {
> read_lock(&kvm->mmu_lock);
> - flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush);
> - if (flush)
> - kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> + kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
> read_unlock(&kvm->mmu_lock);
> }
> }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7c5dd83e52de..b3c78568ae60 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1364,10 +1364,9 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> * Clear leaf entries which could be replaced by large mappings, for
> * GFNs within the slot.
> */
> -static bool zap_collapsible_spte_range(struct kvm *kvm,
> +static void zap_collapsible_spte_range(struct kvm *kvm,
> struct kvm_mmu_page *root,
> - const struct kvm_memory_slot *slot,
> - bool flush)
> + const struct kvm_memory_slot *slot)
> {
> gfn_t start = slot->base_gfn;
> gfn_t end = start + slot->npages;
> @@ -1378,10 +1377,8 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
>
> tdp_root_for_each_pte(iter, root, start, end) {
> retry:
> - if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) {
> - flush = false;
> + if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> continue;
> - }
>
> if (!is_shadow_present_pte(iter.old_spte) ||
> !is_last_spte(iter.old_spte, iter.level))
> @@ -1401,30 +1398,24 @@ static bool zap_collapsible_spte_range(struct kvm *kvm,
> iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
> goto retry;
> }
> - flush = true;
> }
>
> rcu_read_unlock();
> -
> - return flush;
> }
>
> /*
> * Clear non-leaf entries (and free associated page tables) which could
> * be replaced by large mappings, for GFNs within the slot.
> */
> -bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> - const struct kvm_memory_slot *slot,
> - bool flush)
> +void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> + const struct kvm_memory_slot *slot)
> {
> struct kvm_mmu_page *root;
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true)
> - flush = zap_collapsible_spte_range(kvm, root, slot, flush);
> -
> - return flush;
> + zap_collapsible_spte_range(kvm, root, slot);
> }
>
> /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 476b133544dd..3899004a5d91 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -64,9 +64,8 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> gfn_t gfn, unsigned long mask,
> bool wrprot);
> -bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> - const struct kvm_memory_slot *slot,
> - bool flush);
> +void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> + const struct kvm_memory_slot *slot);
>
> bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> struct kvm_memory_slot *slot, gfn_t gfn,
>
Queued, thanks.
Paolo
On 11/16/21 00:45, Ben Gardon wrote:
> The vCPU argument to mmu_try_to_unsync_pages is now only used to get a
> pointer to the associated struct kvm, so pass in the kvm pointer from
> the beginning to remove the need for a vCPU when calling the function.
>
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 16 ++++++++--------
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> arch/x86/kvm/mmu/spte.c | 2 +-
> 3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7d0da79668c0..1e890509b93f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2561,10 +2561,10 @@ static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
> return r;
> }
>
> -static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> +static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> trace_kvm_mmu_unsync_page(sp);
> - ++vcpu->kvm->stat.mmu_unsync;
> + ++kvm->stat.mmu_unsync;
> sp->unsync = 1;
>
> kvm_mmu_mark_parents_unsync(sp);
> @@ -2576,7 +2576,7 @@ static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> * were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
> * be write-protected.
> */
> -int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> +int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, bool can_unsync, bool prefetch)
> {
> struct kvm_mmu_page *sp;
> @@ -2587,7 +2587,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> * track machinery is used to write-protect upper-level shadow pages,
> * i.e. this guards the role.level == 4K assertion below!
> */
> - if (kvm_slot_page_track_is_active(vcpu->kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
> + if (kvm_slot_page_track_is_active(kvm, slot, gfn, KVM_PAGE_TRACK_WRITE))
> return -EPERM;
>
> /*
> @@ -2596,7 +2596,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> * that case, KVM must complete emulation of the guest TLB flush before
> * allowing shadow pages to become unsync (writable by the guest).
> */
> - for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
> + for_each_gfn_indirect_valid_sp(kvm, sp, gfn) {
> if (!can_unsync)
> return -EPERM;
>
> @@ -2615,7 +2615,7 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> */
> if (!locked) {
> locked = true;
> - spin_lock(&vcpu->kvm->arch.mmu_unsync_pages_lock);
> + spin_lock(&kvm->arch.mmu_unsync_pages_lock);
>
> /*
> * Recheck after taking the spinlock, a different vCPU
> @@ -2630,10 +2630,10 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> }
>
> WARN_ON(sp->role.level != PG_LEVEL_4K);
> - kvm_unsync_page(vcpu, sp);
> + kvm_unsync_page(kvm, sp);
> }
> if (locked)
> - spin_unlock(&vcpu->kvm->arch.mmu_unsync_pages_lock);
> + spin_unlock(&kvm->arch.mmu_unsync_pages_lock);
>
> /*
> * We need to ensure that the marking of unsync pages is visible
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 52c6527b1a06..1073d10cce91 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -118,7 +118,7 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
> kvm_x86_ops.cpu_dirty_log_size;
> }
>
> -int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> +int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, bool can_unsync, bool prefetch);
>
> void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 75c666d3e7f1..b7271daa06c5 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -160,7 +160,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> * e.g. it's write-tracked (upper-level SPs) or has one or more
> * shadow pages and unsync'ing pages is not allowed.
> */
> - if (mmu_try_to_unsync_pages(vcpu, slot, gfn, can_unsync, prefetch)) {
> + if (mmu_try_to_unsync_pages(vcpu->kvm, slot, gfn, can_unsync, prefetch)) {
> pgprintk("%s: found shadow page for %llx, marking ro\n",
> __func__, gfn);
> wrprot = true;
>
Queued, thanks.
Paolo
On 11/16/21 00:45, Ben Gardon wrote:
> In preparation for implementing in-place hugepage promotion, various
> functions will need to be called from zap_collapsible_spte_range, which
> has the const qualifier on its memslot argument. Propagate the const
> qualifier to the various functions which will be needed. This just serves
> to simplify the following patch.
>
> No functional change intended.
>
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/include/asm/kvm_page_track.h | 4 ++--
> arch/x86/kvm/mmu/mmu.c | 2 +-
> arch/x86/kvm/mmu/mmu_internal.h | 2 +-
> arch/x86/kvm/mmu/page_track.c | 4 ++--
> arch/x86/kvm/mmu/spte.c | 2 +-
> arch/x86/kvm/mmu/spte.h | 2 +-
> include/linux/kvm_host.h | 10 +++++-----
> virt/kvm/kvm_main.c | 12 ++++++------
> 8 files changed, 19 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
> index e99a30a4d38b..eb186bc57f6a 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -64,8 +64,8 @@ void kvm_slot_page_track_remove_page(struct kvm *kvm,
> struct kvm_memory_slot *slot, gfn_t gfn,
> enum kvm_page_track_mode mode);
> bool kvm_slot_page_track_is_active(struct kvm *kvm,
> - struct kvm_memory_slot *slot, gfn_t gfn,
> - enum kvm_page_track_mode mode);
> + const struct kvm_memory_slot *slot,
> + gfn_t gfn, enum kvm_page_track_mode mode);
>
> void
> kvm_page_track_register_notifier(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index fdf0f15ab19d..ef7a84422463 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2576,7 +2576,7 @@ static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> * were marked unsync (or if there is no shadow page), -EPERM if the SPTE must
> * be write-protected.
> */
> -int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
> +int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
> gfn_t gfn, bool can_unsync, bool prefetch)
> {
> struct kvm_mmu_page *sp;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1073d10cce91..6563cce9c438 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -118,7 +118,7 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
> kvm_x86_ops.cpu_dirty_log_size;
> }
>
> -int mmu_try_to_unsync_pages(struct kvm *kvm, struct kvm_memory_slot *slot,
> +int mmu_try_to_unsync_pages(struct kvm *kvm, const struct kvm_memory_slot *slot,
> gfn_t gfn, bool can_unsync, bool prefetch);
>
> void kvm_mmu_gfn_disallow_lpage(const struct kvm_memory_slot *slot, gfn_t gfn);
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 35c221d5f6ce..68eb1fb548b6 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -174,8 +174,8 @@ EXPORT_SYMBOL_GPL(kvm_slot_page_track_remove_page);
> * check if the corresponding access on the specified guest page is tracked.
> */
> bool kvm_slot_page_track_is_active(struct kvm *kvm,
> - struct kvm_memory_slot *slot, gfn_t gfn,
> - enum kvm_page_track_mode mode)
> + const struct kvm_memory_slot *slot,
> + gfn_t gfn, enum kvm_page_track_mode mode)
> {
> int index;
>
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index d98723b14cec..7be41d2dbb02 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -90,7 +90,7 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> }
>
> bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> - struct kvm_memory_slot *slot, unsigned int pte_access,
> + const struct kvm_memory_slot *slot, unsigned int pte_access,
> gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
> bool can_unsync, bool host_writable, bool ad_need_write_protect,
> u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 5bb055688080..d7598506fbad 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -330,7 +330,7 @@ static inline u64 get_mmio_spte_generation(u64 spte)
> }
>
> bool make_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> - struct kvm_memory_slot *slot, unsigned int pte_access,
> + const struct kvm_memory_slot *slot, unsigned int pte_access,
> gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool prefetch,
> bool can_unsync, bool host_writable, bool ad_need_write_protect,
> u64 mt_mask, struct rsvd_bits_validate *shadow_zero_check,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 60a35d9fe259..675da38fac7f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -435,7 +435,7 @@ struct kvm_memory_slot {
> u16 as_id;
> };
>
> -static inline bool kvm_slot_dirty_track_enabled(struct kvm_memory_slot *slot)
> +static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
> {
> return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> }
> @@ -855,9 +855,9 @@ void kvm_set_page_accessed(struct page *page);
> kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
> kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> bool *writable);
> -kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
> -kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
> -kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
> +kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
> +kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> bool atomic, bool *async, bool write_fault,
> bool *writable, hva_t *hva);
>
> @@ -934,7 +934,7 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
> bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
> bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
> unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
> -void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, gfn_t gfn);
> +void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn);
> void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
>
> struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3f6d450355f0..6dbf8cba1900 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2138,12 +2138,12 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
> return size;
> }
>
> -static bool memslot_is_readonly(struct kvm_memory_slot *slot)
> +static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> {
> return slot->flags & KVM_MEM_READONLY;
> }
>
> -static unsigned long __gfn_to_hva_many(struct kvm_memory_slot *slot, gfn_t gfn,
> +static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn,
> gfn_t *nr_pages, bool write)
> {
> if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
> @@ -2438,7 +2438,7 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
> return pfn;
> }
>
> -kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> bool atomic, bool *async, bool write_fault,
> bool *writable, hva_t *hva)
> {
> @@ -2478,13 +2478,13 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
> }
> EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
>
> -kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> +kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
> {
> return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
> }
> EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
>
> -kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
> +kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
> {
> return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
> }
> @@ -3079,7 +3079,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
> EXPORT_SYMBOL_GPL(kvm_clear_guest);
>
> void mark_page_dirty_in_slot(struct kvm *kvm,
> - struct kvm_memory_slot *memslot,
> + const struct kvm_memory_slot *memslot,
> gfn_t gfn)
> {
> if (memslot && kvm_slot_dirty_track_enabled(memslot)) {
>
Queued, thanks.
Paolo
On 11/16/21 00:45, Ben Gardon wrote:
> Remove the gotos from vmx_get_mt_mask to make it easier to separate out
> the parts which do not depend on vcpu state.
>
> No functional change intended.
>
>
> Signed-off-by: Ben Gardon <[email protected]>
Queued, thanks (with a slightly edited commit message; the patch is a
simplification anyway).
Paolo
> ---
> arch/x86/kvm/vmx/vmx.c | 23 +++++++----------------
> 1 file changed, 7 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 71f54d85f104..77f45c005f28 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6987,7 +6987,6 @@ static int __init vmx_check_processor_compat(void)
> static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> {
> u8 cache;
> - u64 ipat = 0;
>
> /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> * memory aliases with conflicting memory types and sometimes MCEs.
> @@ -7007,30 +7006,22 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> * EPT memory type is used to emulate guest CD/MTRR.
> */
>
> - if (is_mmio) {
> - cache = MTRR_TYPE_UNCACHABLE;
> - goto exit;
> - }
> + if (is_mmio)
> + return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
>
> - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
> - ipat = VMX_EPT_IPAT_BIT;
> - cache = MTRR_TYPE_WRBACK;
> - goto exit;
> - }
> + if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> + return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>
> if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
> - ipat = VMX_EPT_IPAT_BIT;
> if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> cache = MTRR_TYPE_WRBACK;
> else
> cache = MTRR_TYPE_UNCACHABLE;
> - goto exit;
> - }
>
> - cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
> + return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> + }
>
> -exit:
> - return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
> + return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> }
>
> static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx, u32 new_ctl)
>
On Thu, Nov 18, 2021, Paolo Bonzini wrote:
> On 11/16/21 00:45, Ben Gardon wrote:
> > Remove the gotos from vmx_get_mt_mask to make it easier to separate out
> > the parts which do not depend on vcpu state.
> >
> > No functional change intended.
> >
> >
> > Signed-off-by: Ben Gardon <[email protected]>
>
> Queued, thanks (with a slightly edited commit message; the patch is a
> simplification anyway).
Don't know waht message you've queued, but just in case you kept some of the original,
can you further edit it to remove any snippets that mention separating out the parts
that don't depend on vCPU state?
IMO, we should not separate vmx_get_mt_mask() into per-VM and per-vCPU variants,
because the per-vCPU variant is a lie. The memtype of a SPTE is not tracked anywhere,
which means that if the guest has non-uniform CR0.CD/NW or MTRR settings, KVM will
happily let the guest consumes SPTEs with the incorrect memtype. In practice, this
isn't an issue because no sane BIOS or kernel uses per-CPU MTRRs, nor do they have
DMA operations running while the cacheability state is in flux.
If we really want to make this state per-vCPU, KVM would need to incorporate the
CR0.CD and MTRR settings in kvm_mmu_page_role. For MTRRs in particular, the worst
case scenario is that every vCPU has different MTRR settings, which means that
kvm_mmu_page_role would need to be expanded by 10 bits in order to track every
possible vcpu_idx (currently capped at 1024).
So unless we want to massively complicate kvm_mmu_page_role and gfn_track for a
scenario no one cares about, I would strongly prefer to acknowledge that KVM assumes
memtypes are a per-VM property, e.g. on top:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 77f45c005f28..8a84d30f1dbd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6984,8 +6984,9 @@ static int __init vmx_check_processor_compat(void)
return 0;
}
-static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+static u64 vmx_get_mt_mask(struct kvm *kvm, gfn_t gfn, bool is_mmio)
{
+ struct kvm_vcpu *vcpu;
u8 cache;
/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
@@ -7009,11 +7010,15 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
if (is_mmio)
return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
- if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
+ if (!kvm_arch_has_noncoherent_dma(kvm))
return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+ vcpu = kvm_get_vcpu_by_id(kvm, 0);
+ if (KVM_BUG_ON(!vcpu, kvm))
+ return;
+
if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
- if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
+ if (kvm_check_has_quirk(kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
cache = MTRR_TYPE_WRBACK;
else
cache = MTRR_TYPE_UNCACHABLE;
On 11/18/21 16:30, Sean Christopherson wrote:
> On Thu, Nov 18, 2021, Paolo Bonzini wrote:
>> On 11/16/21 00:45, Ben Gardon wrote:
>>> Remove the gotos from vmx_get_mt_mask to make it easier to separate out
>>> the parts which do not depend on vcpu state.
>>>
>>> No functional change intended.
>>>
>>>
>>> Signed-off-by: Ben Gardon <[email protected]>
>>
>> Queued, thanks (with a slightly edited commit message; the patch is a
>> simplification anyway).
>
> Don't know waht message you've queued, but just in case you kept some of the original,
> can you further edit it to remove any snippets that mention separating out the parts
> that don't depend on vCPU state?
Indeed I did keep some:
commit b7297e02826857e068d03f844c8336ce48077d78
Author: Ben Gardon <[email protected]>
Date: Mon Nov 15 15:45:59 2021 -0800
KVM: x86/MMU: Simplify flow of vmx_get_mt_mask
Remove the gotos from vmx_get_mt_mask. This may later make it easier
to separate out the parts which do not depend on vcpu state, but it also
simplifies the code in general.
No functional change intended.
i.e. keeping it conditional but I can edit it further, like
Remove the gotos from vmx_get_mt_mask. It's easier to build the whole
memory type at once, than it is to combine separate cacheability and ipat
fields.
Paolo
> IMO, we should not separate vmx_get_mt_mask() into per-VM and per-vCPU variants,
> because the per-vCPU variant is a lie. The memtype of a SPTE is not tracked anywhere,
> which means that if the guest has non-uniform CR0.CD/NW or MTRR settings, KVM will
> happily let the guest consumes SPTEs with the incorrect memtype. In practice, this
> isn't an issue because no sane BIOS or kernel uses per-CPU MTRRs, nor do they have
> DMA operations running while the cacheability state is in flux.
>
> If we really want to make this state per-vCPU, KVM would need to incorporate the
> CR0.CD and MTRR settings in kvm_mmu_page_role. For MTRRs in particular, the worst
> case scenario is that every vCPU has different MTRR settings, which means that
> kvm_mmu_page_role would need to be expanded by 10 bits in order to track every
> possible vcpu_idx (currently capped at 1024).
Yes, that's insanity. I was also a bit skeptical about Ben's try_get_mt_mask callback,
but this would be much much worse.
Paolo
> So unless we want to massively complicate kvm_mmu_page_role and gfn_track for a
> scenario no one cares about, I would strongly prefer to acknowledge that KVM assumes
> memtypes are a per-VM property, e.g. on top:
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 77f45c005f28..8a84d30f1dbd 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6984,8 +6984,9 @@ static int __init vmx_check_processor_compat(void)
> return 0;
> }
>
> -static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +static u64 vmx_get_mt_mask(struct kvm *kvm, gfn_t gfn, bool is_mmio)
> {
> + struct kvm_vcpu *vcpu;
> u8 cache;
>
> /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> @@ -7009,11 +7010,15 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> if (is_mmio)
> return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
>
> - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> + if (!kvm_arch_has_noncoherent_dma(kvm))
> return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>
> + vcpu = kvm_get_vcpu_by_id(kvm, 0);
> + if (KVM_BUG_ON(!vcpu, kvm))
> + return;
> +
> if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
> - if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> + if (kvm_check_has_quirk(kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> cache = MTRR_TYPE_WRBACK;
> else
> cache = MTRR_TYPE_UNCACHABLE;
>
On Fri, Nov 19, 2021 at 1:03 AM Paolo Bonzini <[email protected]> wrote:
>
> On 11/18/21 16:30, Sean Christopherson wrote:
> > On Thu, Nov 18, 2021, Paolo Bonzini wrote:
> >> On 11/16/21 00:45, Ben Gardon wrote:
> >>> Remove the gotos from vmx_get_mt_mask to make it easier to separate out
> >>> the parts which do not depend on vcpu state.
> >>>
> >>> No functional change intended.
> >>>
> >>>
> >>> Signed-off-by: Ben Gardon <[email protected]>
> >>
> >> Queued, thanks (with a slightly edited commit message; the patch is a
> >> simplification anyway).
> >
> > Don't know waht message you've queued, but just in case you kept some of the original,
> > can you further edit it to remove any snippets that mention separating out the parts
> > that don't depend on vCPU state?
>
> Indeed I did keep some:
>
> commit b7297e02826857e068d03f844c8336ce48077d78
> Author: Ben Gardon <[email protected]>
> Date: Mon Nov 15 15:45:59 2021 -0800
>
> KVM: x86/MMU: Simplify flow of vmx_get_mt_mask
>
> Remove the gotos from vmx_get_mt_mask. This may later make it easier
> to separate out the parts which do not depend on vcpu state, but it also
> simplifies the code in general.
>
> No functional change intended.
>
> i.e. keeping it conditional but I can edit it further, like
>
> Remove the gotos from vmx_get_mt_mask. It's easier to build the whole
> memory type at once, than it is to combine separate cacheability and ipat
> fields.
>
> Paolo
>
> > IMO, we should not separate vmx_get_mt_mask() into per-VM and per-vCPU variants,
> > because the per-vCPU variant is a lie. The memtype of a SPTE is not tracked anywhere,
> > which means that if the guest has non-uniform CR0.CD/NW or MTRR settings, KVM will
> > happily let the guest consumes SPTEs with the incorrect memtype. In practice, this
> > isn't an issue because no sane BIOS or kernel uses per-CPU MTRRs, nor do they have
> > DMA operations running while the cacheability state is in flux.
> >
> > If we really want to make this state per-vCPU, KVM would need to incorporate the
> > CR0.CD and MTRR settings in kvm_mmu_page_role. For MTRRs in particular, the worst
> > case scenario is that every vCPU has different MTRR settings, which means that
> > kvm_mmu_page_role would need to be expanded by 10 bits in order to track every
> > possible vcpu_idx (currently capped at 1024).
>
> Yes, that's insanity. I was also a bit skeptical about Ben's try_get_mt_mask callback,
> but this would be much much worse.
Yeah, the implementation of that felt a bit kludgy to me too, but
refactoring the handling of all those CR bits was way more complex
than I wanted to handle in this patch set.
I'd love to see some of those CR0 / MTRR settings be set on a VM basis
and enforced as uniform across vCPUs.
Looking up vCPU 0 and basing things on that feels extra hacky though,
especially if we're still not asserting uniformity of settings across
vCPUs.
If we need to track that state to accurately virtualize the hardware
though, that would be unfortunate.
>
> Paolo
>
> > So unless we want to massively complicate kvm_mmu_page_role and gfn_track for a
> > scenario no one cares about, I would strongly prefer to acknowledge that KVM assumes
> > memtypes are a per-VM property, e.g. on top:
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 77f45c005f28..8a84d30f1dbd 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6984,8 +6984,9 @@ static int __init vmx_check_processor_compat(void)
> > return 0;
> > }
> >
> > -static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > +static u64 vmx_get_mt_mask(struct kvm *kvm, gfn_t gfn, bool is_mmio)
> > {
> > + struct kvm_vcpu *vcpu;
> > u8 cache;
> >
> > /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> > @@ -7009,11 +7010,15 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > if (is_mmio)
> > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> >
> > - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > + if (!kvm_arch_has_noncoherent_dma(kvm))
> > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >
> > + vcpu = kvm_get_vcpu_by_id(kvm, 0);
> > + if (KVM_BUG_ON(!vcpu, kvm))
> > + return;
> > +
> > if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
> > - if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> > + if (kvm_check_has_quirk(kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> > cache = MTRR_TYPE_WRBACK;
> > else
> > cache = MTRR_TYPE_UNCACHABLE;
> >
>
On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 1:03 AM Paolo Bonzini <[email protected]> wrote:
> >
> > On 11/18/21 16:30, Sean Christopherson wrote:
> > > If we really want to make this state per-vCPU, KVM would need to incorporate the
> > > CR0.CD and MTRR settings in kvm_mmu_page_role. For MTRRs in particular, the worst
> > > case scenario is that every vCPU has different MTRR settings, which means that
> > > kvm_mmu_page_role would need to be expanded by 10 bits in order to track every
> > > possible vcpu_idx (currently capped at 1024).
> >
> > Yes, that's insanity. I was also a bit skeptical about Ben's try_get_mt_mask callback,
> > but this would be much much worse.
>
> Yeah, the implementation of that felt a bit kludgy to me too, but
> refactoring the handling of all those CR bits was way more complex
> than I wanted to handle in this patch set.
> I'd love to see some of those CR0 / MTRR settings be set on a VM basis
> and enforced as uniform across vCPUs.
Architecturally, we can't do that. Even a perfectly well-behaved guest will have
(small) periods where the BSP has different settings than APs. And it's technically
legal to have non-uniform MTRR and CR0.CD/NW configurations, even though no modern
BIOS/kernel does that. Except for non-coherent DMA, it's a moot point because KVM
can simply ignore guest cacheability settings.
> Looking up vCPU 0 and basing things on that feels extra hacky though,
> especially if we're still not asserting uniformity of settings across
> vCPUs.
IMO, it's marginally less hacky than what KVM has today as it allows KVM's behavior
to be clearly and sanely stated, e.g. KVM uses vCPU0's cacheability settings when
mapping non-coherent DMA. Compare that with today's behavior where the cacheability
settings depend on which vCPU first faulted in the address for a given MMU role and
instance of the associated root, and whether other vCPUs share an MMU role/root.
> If we need to track that state to accurately virtualize the hardware
> though, that would be unfortunate.
Hi, Ben,
On Mon, Nov 15, 2021 at 03:46:03PM -0800, Ben Gardon wrote:
> When disabling dirty logging, the TDP MMU currently zaps each leaf entry
> mapping memory in the relevant memslot. This is very slow. Doing the zaps
> under the mmu read lock requires a TLB flush for every zap and the
> zapping causes a storm of ETP/NPT violations.
>
> Instead of zapping, replace the split large pages with large page
> mappings directly. While this sort of operation has historically only
> been done in the vCPU page fault handler context, refactorings earlier
> in this series and the relative simplicity of the TDP MMU make it
> possible here as well.
Thanks for this patch, it looks very useful.
I've got a few comments below, but before that I've also got one off-topic
question too; it'll be great if you can help answer.
When I was looking into how the old code recovers the huge pages I found that
we'll leave the full-zero pgtable page there until the next page fault, then I
_think_ it'll be released only until the __handle_changed_spte() when we're
dropping the old spte (handle_removed_tdp_mmu_page).
As comment above handle_removed_tdp_mmu_page() showed, at this point IIUC
current thread should have exclusive ownership of this orphaned and abandoned
pgtable page, then why in handle_removed_tdp_mmu_page() we still need all the
atomic operations and REMOVED_SPTE tricks to protect from concurrent access?
Since that's cmpxchg-ed out of the old pgtable, what can be accessing it
besides the current thread?
>
> Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
> of memory per vCPU, this reduces the time required to disable dirty
> logging from over 45 seconds to just over 1 second. It also avoids
> provoking page faults, improving vCPU performance while disabling
> dirty logging.
>
>
> Signed-off-by: Ben Gardon <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> arch/x86/kvm/mmu/mmu_internal.h | 4 ++
> arch/x86/kvm/mmu/tdp_mmu.c | 69 ++++++++++++++++++++++++++++++++-
> 3 files changed, 72 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ef7a84422463..add724aa9e8c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4449,7 +4449,7 @@ static inline bool boot_cpu_is_amd(void)
> * the direct page table on host, use as much mmu features as
> * possible, however, kvm currently does not do execution-protection.
> */
> -static void
> +void
> build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> int shadow_root_level)
> {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 6563cce9c438..84d439432acf 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -161,4 +161,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +void
> +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> + int shadow_root_level);
> +
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 43c7834b4f0a..b15c8cd11cf9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1361,6 +1361,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
> }
>
> +static void try_promote_lpage(struct kvm *kvm,
> + const struct kvm_memory_slot *slot,
> + struct tdp_iter *iter)
> +{
> + struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> + struct rsvd_bits_validate shadow_zero_check;
> + /*
> + * Since the TDP MMU doesn't manage nested PTs, there's no need to
> + * write protect for a nested VM when PML is in use.
> + */
> + bool ad_need_write_protect = false;
Shall we just pass in "false" in make_spte() and just move the comment there?
> + bool map_writable;
> + kvm_pfn_t pfn;
> + u64 new_spte;
> + u64 mt_mask;
> +
> + /*
> + * If addresses are being invalidated, don't do in-place promotion to
> + * avoid accidentally mapping an invalidated address.
> + */
> + if (unlikely(kvm->mmu_notifier_count))
> + return;
> +
> + pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> + &map_writable, NULL);
Should we better check pfn validity and bail out otherwise? E.g. for atomic I
think we can also get KVM_PFN_ERR_FAULT when fast-gup failed somehow.
> +
> + /*
> + * Can't reconstitute an lpage if the consituent pages can't be
> + * mapped higher.
> + */
> + if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> + pfn, PG_LEVEL_NUM))
> + return;
> +
> + build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> +
> + /*
> + * In some cases, a vCPU pointer is required to get the MT mask,
> + * however in most cases it can be generated without one. If a
> + * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> + * In that case, bail on in-place promotion.
> + */
> + if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
> + kvm_is_mmio_pfn(pfn),
> + &mt_mask)))
> + return;
> +
> + make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
> + map_writable, ad_need_write_protect, mt_mask,
> + &shadow_zero_check, &new_spte);
> +
> + tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> +
> + /*
> + * Re-read the SPTE to avoid recursing into one of the removed child
> + * page tables.
> + */
> + iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
Is this redundant since it seems the iterator logic handles this already, I'm
reading try_step_down() here:
/*
* Reread the SPTE before stepping down to avoid traversing into page
* tables that are no longer linked from this entry.
*/
iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
The rest looks good to me, thanks.
> +}
> +
> /*
> * Clear leaf entries which could be replaced by large mappings, for
> * GFNs within the slot.
> @@ -1381,9 +1441,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> continue;
>
> - if (!is_shadow_present_pte(iter.old_spte) ||
> - !is_last_spte(iter.old_spte, iter.level))
> + if (!is_shadow_present_pte(iter.old_spte))
> + continue;
> +
> + /* Try to promote the constitutent pages to an lpage. */
> + if (!is_last_spte(iter.old_spte, iter.level)) {
> + try_promote_lpage(kvm, slot, &iter);
> continue;
> + }
>
> pfn = spte_to_pfn(iter.old_spte);
> if (kvm_is_reserved_pfn(pfn) ||
> --
> 2.34.0.rc1.387.gb447b232ab-goog
>
--
Peter Xu
On Wed, Nov 24, 2021 at 8:19 PM Peter Xu <[email protected]> wrote:
>
> Hi, Ben,
>
> On Mon, Nov 15, 2021 at 03:46:03PM -0800, Ben Gardon wrote:
> > When disabling dirty logging, the TDP MMU currently zaps each leaf entry
> > mapping memory in the relevant memslot. This is very slow. Doing the zaps
> > under the mmu read lock requires a TLB flush for every zap and the
> > zapping causes a storm of ETP/NPT violations.
> >
> > Instead of zapping, replace the split large pages with large page
> > mappings directly. While this sort of operation has historically only
> > been done in the vCPU page fault handler context, refactorings earlier
> > in this series and the relative simplicity of the TDP MMU make it
> > possible here as well.
>
> Thanks for this patch, it looks very useful.
Thanks for taking a look!
>
> I've got a few comments below, but before that I've also got one off-topic
> question too; it'll be great if you can help answer.
>
> When I was looking into how the old code recovers the huge pages I found that
> we'll leave the full-zero pgtable page there until the next page fault, then I
> _think_ it'll be released only until the __handle_changed_spte() when we're
> dropping the old spte (handle_removed_tdp_mmu_page).
That seems likely, though Sean's recent series that heavily refactored
zapping probably changed that.
>
> As comment above handle_removed_tdp_mmu_page() showed, at this point IIUC
> current thread should have exclusive ownership of this orphaned and abandoned
> pgtable page, then why in handle_removed_tdp_mmu_page() we still need all the
> atomic operations and REMOVED_SPTE tricks to protect from concurrent access?
> Since that's cmpxchg-ed out of the old pgtable, what can be accessing it
> besides the current thread?
The cmpxchg does nothing to guarantee that other threads can't have a
pointer to the page table, only that this thread knows it's the one
that removed it from the page table. Other threads could still have
pointers to it in two ways:
1. A kernel thread could be in the process of modifying an SPTE in the
page table, under the MMU lock in read mode. In that case, there's no
guarantee that there's not another kernel thread with a pointer to the
SPTE until the end of an RCU grace period.
2. There could be a pointer to the page table in a vCPU's paging
structure caches, which are similar to the TLB but cache partial
translations. These are also cleared out on TLB flush.
Sean's recent series linked the RCU grace period and TLB flush in a
clever way so that we can ensure that the end of a grace period
implies that the necessary flushes have happened already, but we still
need to clear out the disconnected page table with atomic operations.
We need to clear it out mostly to collect dirty / accessed bits and
update page size stats.
>
> >
> > Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
> > of memory per vCPU, this reduces the time required to disable dirty
> > logging from over 45 seconds to just over 1 second. It also avoids
> > provoking page faults, improving vCPU performance while disabling
> > dirty logging.
> >
> >
> > Signed-off-by: Ben Gardon <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 2 +-
> > arch/x86/kvm/mmu/mmu_internal.h | 4 ++
> > arch/x86/kvm/mmu/tdp_mmu.c | 69 ++++++++++++++++++++++++++++++++-
> > 3 files changed, 72 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index ef7a84422463..add724aa9e8c 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4449,7 +4449,7 @@ static inline bool boot_cpu_is_amd(void)
> > * the direct page table on host, use as much mmu features as
> > * possible, however, kvm currently does not do execution-protection.
> > */
> > -static void
> > +void
> > build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> > int shadow_root_level)
> > {
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 6563cce9c438..84d439432acf 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -161,4 +161,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +void
> > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> > + int shadow_root_level);
> > +
> > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 43c7834b4f0a..b15c8cd11cf9 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1361,6 +1361,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> > clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
> > }
> >
> > +static void try_promote_lpage(struct kvm *kvm,
> > + const struct kvm_memory_slot *slot,
> > + struct tdp_iter *iter)
> > +{
> > + struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> > + struct rsvd_bits_validate shadow_zero_check;
> > + /*
> > + * Since the TDP MMU doesn't manage nested PTs, there's no need to
> > + * write protect for a nested VM when PML is in use.
> > + */
> > + bool ad_need_write_protect = false;
>
> Shall we just pass in "false" in make_spte() and just move the comment there?
We could, but given the egregious number of arguments to the function
(totally my fault), I think this is probably a bit more readable.
>
> > + bool map_writable;
> > + kvm_pfn_t pfn;
> > + u64 new_spte;
> > + u64 mt_mask;
> > +
> > + /*
> > + * If addresses are being invalidated, don't do in-place promotion to
> > + * avoid accidentally mapping an invalidated address.
> > + */
> > + if (unlikely(kvm->mmu_notifier_count))
> > + return;
> > +
> > + pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> > + &map_writable, NULL);
>
> Should we better check pfn validity and bail out otherwise? E.g. for atomic I
> think we can also get KVM_PFN_ERR_FAULT when fast-gup failed somehow.
That's an excellent point. We should absolutely do that.
>
> > +
> > + /*
> > + * Can't reconstitute an lpage if the consituent pages can't be
> > + * mapped higher.
> > + */
> > + if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> > + pfn, PG_LEVEL_NUM))
> > + return;
> > +
> > + build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> > +
> > + /*
> > + * In some cases, a vCPU pointer is required to get the MT mask,
> > + * however in most cases it can be generated without one. If a
> > + * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> > + * In that case, bail on in-place promotion.
> > + */
> > + if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
> > + kvm_is_mmio_pfn(pfn),
> > + &mt_mask)))
> > + return;
> > +
> > + make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
> > + map_writable, ad_need_write_protect, mt_mask,
> > + &shadow_zero_check, &new_spte);
> > +
> > + tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > +
> > + /*
> > + * Re-read the SPTE to avoid recursing into one of the removed child
> > + * page tables.
> > + */
> > + iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
>
> Is this redundant since it seems the iterator logic handles this already, I'm
> reading try_step_down() here:
>
> /*
> * Reread the SPTE before stepping down to avoid traversing into page
> * tables that are no longer linked from this entry.
> */
> iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
Oh, I had forgotten about that, but it'd be great if it was redundant.
I'll double check.
>
> The rest looks good to me, thanks.
Thanks for your review!
>
> > +}
> > +
> > /*
> > * Clear leaf entries which could be replaced by large mappings, for
> > * GFNs within the slot.
> > @@ -1381,9 +1441,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> > if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> > continue;
> >
> > - if (!is_shadow_present_pte(iter.old_spte) ||
> > - !is_last_spte(iter.old_spte, iter.level))
> > + if (!is_shadow_present_pte(iter.old_spte))
> > + continue;
> > +
> > + /* Try to promote the constitutent pages to an lpage. */
> > + if (!is_last_spte(iter.old_spte, iter.level)) {
> > + try_promote_lpage(kvm, slot, &iter);
> > continue;
> > + }
> >
> > pfn = spte_to_pfn(iter.old_spte);
> > if (kvm_is_reserved_pfn(pfn) ||
> > --
> > 2.34.0.rc1.387.gb447b232ab-goog
> >
>
> --
> Peter Xu
>
On Mon, Nov 29, 2021, Ben Gardon wrote:
> On Wed, Nov 24, 2021 at 8:19 PM Peter Xu <[email protected]> wrote:
> > I've got a few comments below, but before that I've also got one off-topic
> > question too; it'll be great if you can help answer.
> >
> > When I was looking into how the old code recovers the huge pages I found that
> > we'll leave the full-zero pgtable page there until the next page fault, then I
> > _think_ it'll be released only until the __handle_changed_spte() when we're
> > dropping the old spte (handle_removed_tdp_mmu_page).
>
> That seems likely, though Sean's recent series that heavily refactored
> zapping probably changed that.
Hmm, no, that behavior shouldn't change for this path in my series. Only the leaf
SPTEs are zapped, the shadow page hangs around until its replaced by a hugepage,
reclaimed due to memory pressure, etc... FWIW, that's consistent with the
legacy/full MMU.
Zapping the SP is doable in theory, but it would require completely different
iteration behavior and small amount of additional logic to check the entire gfn
range covered by the SP for compability. If we were starting from scratch, I
would probably advocate zapping the parent SP directly, but since the net work
done is roughly equivalent, the cost of keeping the page around is negligible,
and we have functional code already...
On Mon, Nov 29, 2021 at 10:31:14AM -0800, Ben Gardon wrote:
> > As comment above handle_removed_tdp_mmu_page() showed, at this point IIUC
> > current thread should have exclusive ownership of this orphaned and abandoned
> > pgtable page, then why in handle_removed_tdp_mmu_page() we still need all the
> > atomic operations and REMOVED_SPTE tricks to protect from concurrent access?
> > Since that's cmpxchg-ed out of the old pgtable, what can be accessing it
> > besides the current thread?
>
> The cmpxchg does nothing to guarantee that other threads can't have a
> pointer to the page table, only that this thread knows it's the one
> that removed it from the page table. Other threads could still have
> pointers to it in two ways:
> 1. A kernel thread could be in the process of modifying an SPTE in the
> page table, under the MMU lock in read mode. In that case, there's no
> guarantee that there's not another kernel thread with a pointer to the
> SPTE until the end of an RCU grace period.
Right, I definitely missed that whole picture of the RCU usage. Thanks.
> 2. There could be a pointer to the page table in a vCPU's paging
> structure caches, which are similar to the TLB but cache partial
> translations. These are also cleared out on TLB flush.
Could you elaborate what's the structure cache that you mentioned? I thought
the processor page walker will just use the data cache (L1-L3) as pgtable
caches, in which case IIUC the invalidation happens when we do WRITE_ONCE()
that'll invalidate all the rest data cache besides the writter core. But I
could be completely missing something..
> Sean's recent series linked the RCU grace period and TLB flush in a
> clever way so that we can ensure that the end of a grace period
> implies that the necessary flushes have happened already, but we still
> need to clear out the disconnected page table with atomic operations.
> We need to clear it out mostly to collect dirty / accessed bits and
> update page size stats.
Yes, this sounds reasonable too.
--
Peter Xu
On Tue, Nov 30, 2021, Peter Xu wrote:
> On Mon, Nov 29, 2021 at 10:31:14AM -0800, Ben Gardon wrote:
> > 2. There could be a pointer to the page table in a vCPU's paging
> > structure caches, which are similar to the TLB but cache partial
> > translations. These are also cleared out on TLB flush.
>
> Could you elaborate what's the structure cache that you mentioned? I thought
> the processor page walker will just use the data cache (L1-L3) as pgtable
> caches, in which case IIUC the invalidation happens when we do WRITE_ONCE()
> that'll invalidate all the rest data cache besides the writter core. But I
> could be completely missing something..
Ben is referring to the Intel SDM's use of the term "paging-structure caches"
Intel CPUs, and I'm guessing other x86 CPUs, cache upper level entries, e.g. the
L4 PTE for a given address, to avoid having to do data cache lookups, reserved
bits checked, A/D assists, etc... Like full VA=>PA TLB entries, these entries
are associated with the PCID, VPID, EPT4A, etc...
The data caches are still used when reading PTEs that aren't cached in the TLB,
the extra caching in the "TLB" is optimization on top.
28.3.1 Information That May Be Cached
Section 4.10, “Caching Translation Information” in Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3A identifies two kinds of translation-related
information that may be cached by a logical processor: translations, which are mappings
from linear page numbers to physical page frames, and paging-structure caches, which
map the upper bits of a linear page number to information from the paging-structure
entries used to translate linear addresses matching those upper bits.
On Tue, Nov 30, 2021 at 04:01:50PM +0000, Sean Christopherson wrote:
> On Tue, Nov 30, 2021, Peter Xu wrote:
> > On Mon, Nov 29, 2021 at 10:31:14AM -0800, Ben Gardon wrote:
> > > 2. There could be a pointer to the page table in a vCPU's paging
> > > structure caches, which are similar to the TLB but cache partial
> > > translations. These are also cleared out on TLB flush.
> >
> > Could you elaborate what's the structure cache that you mentioned? I thought
> > the processor page walker will just use the data cache (L1-L3) as pgtable
> > caches, in which case IIUC the invalidation happens when we do WRITE_ONCE()
> > that'll invalidate all the rest data cache besides the writter core. But I
> > could be completely missing something..
>
> Ben is referring to the Intel SDM's use of the term "paging-structure caches"
> Intel CPUs, and I'm guessing other x86 CPUs, cache upper level entries, e.g. the
> L4 PTE for a given address, to avoid having to do data cache lookups, reserved
> bits checked, A/D assists, etc... Like full VA=>PA TLB entries, these entries
> are associated with the PCID, VPID, EPT4A, etc...
>
> The data caches are still used when reading PTEs that aren't cached in the TLB,
> the extra caching in the "TLB" is optimization on top.
>
> 28.3.1 Information That May Be Cached
> Section 4.10, “Caching Translation Information” in Intel® 64 and IA-32 Architectures
> Software Developer’s Manual, Volume 3A identifies two kinds of translation-related
> information that may be cached by a logical processor: translations, which are mappings
> from linear page numbers to physical page frames, and paging-structure caches, which
> map the upper bits of a linear page number to information from the paging-structure
> entries used to translate linear addresses matching those upper bits.
Ah, I should have tried harder when reading the spec, where I just stopped at
4.10.2... :) They're also described in general section of 4.10.3 and also on
how TLB invalidations affect these caches in 4.10.4.
Thanks again to both!
--
Peter Xu