2023-03-06 22:41:43

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 00/18] NUMA aware page table allocation

Hi,

This series build up based on the feedback on v3.

Biggest change in features is to enable NUMA aware page table per VM
basis instead of using a module parameter for all VMs on a host. This
was decided based on an internal discussion to avoid forcing all VMs to
be NUMA aware on a host. We need to collect more data to see how much
performance degradation a VM can get in negative testing, where vCPUs in
VM are always accessing remote NUMA nodes memory instead of staying
local compared to a VM which is not NUMA aware.

There are other changes which are mentioned in the change log below for
v4.

Thanks
Vipin

v4:
- Removed module parameter for enabling NUMA aware page table.
- Added new capability KVM_CAP_NUMA_AWARE_PAGE_TABLE to enable this
feature per VM.
- Added documentation for the new capability.
- Holding mutex just before the top up and releasing it after the
fault/split is addressed. Previous version were using spinlocks two
times, first time for topup and second time fetching the page from
cache.
- Using the existing slots_lock for split_shadow_page_cache operations.
- KVM MMU shrinker will also shrink mm_shadow_info_cache besides
split_shadow_page_cache and mmu_shadow_page_cache.
- Reduced cache default size to 4.
- Split patches into smaller ones.

v3: https://lore.kernel.org/lkml/[email protected]/
- Split patches into smaller ones.
- Repurposed KVM MMU shrinker to free cache pages instead of oldest page table
pages
- Reduced cache size from 40 to 5
- Removed __weak function and initializing node value in all architectures.
- Some name changes.

v2: https://lore.kernel.org/lkml/[email protected]/
- All page table pages will be allocated on underlying physical page's
NUMA node.
- Introduced module parameter, numa_aware_pagetable, to disable this
feature.
- Using kvm_pfn_to_refcounted_page to get page from a pfn.

v1: https://lore.kernel.org/all/[email protected]/

Vipin Sharma (18):
KVM: x86/mmu: Change KVM mmu shrinker to no-op
KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{}
KVM: x86/mmu: Track count of pages in KVM MMU page caches globally
KVM: x86/mmu: Shrink shadow page caches via MMU shrinker
KVM: x86/mmu: Add split_shadow_page_cache pages to global count of MMU
cache pages
KVM: x86/mmu: Shrink split_shadow_page_cache via MMU shrinker
KVM: x86/mmu: Unconditionally count allocations from MMU page caches
KVM: x86/mmu: Track unused mmu_shadowed_info_cache pages count via
global counter
KVM: x86/mmu: Shrink mmu_shadowed_info_cache via MMU shrinker
KVM: x86/mmu: Add per VM NUMA aware page table capability
KVM: x86/mmu: Add documentation of NUMA aware page table capability
KVM: x86/mmu: Allocate NUMA aware page tables on TDP huge page splits
KVM: mmu: Add common initialization logic for struct
kvm_mmu_memory_cache{}
KVM: mmu: Initialize kvm_mmu_memory_cache.gfp_zero to __GFP_ZERO by
default
KVM: mmu: Add NUMA node support in struct kvm_mmu_memory_cache{}
KVM: x86/mmu: Allocate numa aware page tables during page fault
KVM: x86/mmu: Allocate shadow mmu page table on huge page split on the
same NUMA node
KVM: x86/mmu: Reduce default mmu memory cache size

Documentation/virt/kvm/api.rst | 29 +++
arch/arm64/kvm/arm.c | 2 +-
arch/arm64/kvm/mmu.c | 2 +-
arch/mips/kvm/mips.c | 3 +
arch/riscv/kvm/mmu.c | 8 +-
arch/riscv/kvm/vcpu.c | 2 +-
arch/x86/include/asm/kvm_host.h | 17 +-
arch/x86/include/asm/kvm_types.h | 6 +-
arch/x86/kvm/mmu/mmu.c | 319 +++++++++++++++++++------------
arch/x86/kvm/mmu/mmu_internal.h | 38 ++++
arch/x86/kvm/mmu/paging_tmpl.h | 29 +--
arch/x86/kvm/mmu/tdp_mmu.c | 23 ++-
arch/x86/kvm/x86.c | 18 +-
include/linux/kvm_host.h | 2 +
include/linux/kvm_types.h | 21 ++
include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 24 ++-
17 files changed, 386 insertions(+), 158 deletions(-)

--
2.40.0.rc0.216.gc4246ad0f0-goog



2023-03-06 22:41:47

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 01/18] KVM: x86/mmu: Change KVM mmu shrinker to no-op

Remove page zapping logic from the shrinker. Keep shrinker
infrastructure in place, it will be reused in future commits to free KVM
page caches.

mmu_shrink_scan() is very disruptive to VMs. It picks the first VM in
the vm_list, zaps the oldest page which is most likely an upper level
SPTEs and most like to be reused. Prior to TDP MMU, this is even more
disruptive in nested VMs case, considering L1 SPTEs will be the oldest
even though most of the entries are for L2 SPTEs.

As discussed in
https://lore.kernel.org/lkml/[email protected]/ shrinker logic
has not be very useful in actually keeping VMs performant and reducing
memory usage.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 87 +++---------------------------------------
1 file changed, 5 insertions(+), 82 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8ebe542c565..0d07767f7922 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -166,7 +166,6 @@ struct kvm_shadow_walk_iterator {

static struct kmem_cache *pte_list_desc_cache;
struct kmem_cache *mmu_page_header_cache;
-static struct percpu_counter kvm_total_used_mmu_pages;

static void mmu_spte_set(u64 *sptep, u64 spte);

@@ -1704,27 +1703,15 @@ static int is_empty_shadow_page(u64 *spt)
}
#endif

-/*
- * This value is the sum of all of the kvm instances's
- * kvm->arch.n_used_mmu_pages values. We need a global,
- * aggregate version in order to make the slab shrinker
- * faster
- */
-static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
-{
- kvm->arch.n_used_mmu_pages += nr;
- percpu_counter_add(&kvm_total_used_mmu_pages, nr);
-}
-
static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- kvm_mod_used_mmu_pages(kvm, +1);
+ kvm->arch.n_used_mmu_pages++;
kvm_account_pgtable_pages((void *)sp->spt, +1);
}

static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- kvm_mod_used_mmu_pages(kvm, -1);
+ kvm->arch.n_used_mmu_pages--;
kvm_account_pgtable_pages((void *)sp->spt, -1);
}

@@ -6072,11 +6059,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
kvm_tdp_mmu_zap_invalidated_roots(kvm);
}

-static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
-{
- return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
-}
-
static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
@@ -6666,66 +6648,13 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
static unsigned long
mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
{
- struct kvm *kvm;
- int nr_to_scan = sc->nr_to_scan;
- unsigned long freed = 0;
-
- mutex_lock(&kvm_lock);
-
- list_for_each_entry(kvm, &vm_list, vm_list) {
- int idx;
- LIST_HEAD(invalid_list);
-
- /*
- * Never scan more than sc->nr_to_scan VM instances.
- * Will not hit this condition practically since we do not try
- * to shrink more than one VM and it is very unlikely to see
- * !n_used_mmu_pages so many times.
- */
- if (!nr_to_scan--)
- break;
- /*
- * n_used_mmu_pages is accessed without holding kvm->mmu_lock
- * here. We may skip a VM instance errorneosly, but we do not
- * want to shrink a VM that only started to populate its MMU
- * anyway.
- */
- if (!kvm->arch.n_used_mmu_pages &&
- !kvm_has_zapped_obsolete_pages(kvm))
- continue;
-
- idx = srcu_read_lock(&kvm->srcu);
- write_lock(&kvm->mmu_lock);
-
- if (kvm_has_zapped_obsolete_pages(kvm)) {
- kvm_mmu_commit_zap_page(kvm,
- &kvm->arch.zapped_obsolete_pages);
- goto unlock;
- }
-
- freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
-
-unlock:
- write_unlock(&kvm->mmu_lock);
- srcu_read_unlock(&kvm->srcu, idx);
-
- /*
- * unfair on small ones
- * per-vm shrinkers cry out
- * sadness comes quickly
- */
- list_move_tail(&kvm->vm_list, &vm_list);
- break;
- }
-
- mutex_unlock(&kvm_lock);
- return freed;
+ return SHRINK_STOP;
}

static unsigned long
mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
{
- return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
+ return SHRINK_EMPTY;
}

static struct shrinker mmu_shrinker = {
@@ -6840,17 +6769,12 @@ int kvm_mmu_vendor_module_init(void)
if (!mmu_page_header_cache)
goto out;

- if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
- goto out;
-
ret = register_shrinker(&mmu_shrinker, "x86-mmu");
if (ret)
- goto out_shrinker;
+ goto out;

return 0;

-out_shrinker:
- percpu_counter_destroy(&kvm_total_used_mmu_pages);
out:
mmu_destroy_caches();
return ret;
@@ -6867,7 +6791,6 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
void kvm_mmu_vendor_module_exit(void)
{
mmu_destroy_caches();
- percpu_counter_destroy(&kvm_total_used_mmu_pages);
unregister_shrinker(&mmu_shrinker);
}

--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:41:50

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 02/18] KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{}

Remove zapped_obsolete_pages from struct kvm_arch{} and use local list
in kvm_zap_obsolete_pages().

zapped_obsolete_pages list was used in struct kvm_arch{} to provide
pages for KVM MMU shrinker. Since, KVM MMU shrinker is no-op now, this
is not needed.

Signed-off-by: Vipin Sharma <[email protected]>
Reviewed-by: David Matlack <[email protected]>

---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/mmu/mmu.c | 8 ++++----
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 808c292ad3f4..ebbe692acf3f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1252,7 +1252,6 @@ struct kvm_arch {
u8 mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
struct list_head active_mmu_pages;
- struct list_head zapped_obsolete_pages;
/*
* A list of kvm_mmu_page structs that, if zapped, could possibly be
* replaced by an NX huge page. A shadow page is on this list if its
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0d07767f7922..3a452989f5cd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5947,6 +5947,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
{
struct kvm_mmu_page *sp, *node;
int nr_zapped, batch = 0;
+ LIST_HEAD(invalid_list);
bool unstable;

restart:
@@ -5979,8 +5980,8 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
goto restart;
}

- unstable = __kvm_mmu_prepare_zap_page(kvm, sp,
- &kvm->arch.zapped_obsolete_pages, &nr_zapped);
+ unstable = __kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list,
+ &nr_zapped);
batch += nr_zapped;

if (unstable)
@@ -5996,7 +5997,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
* kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
* running with an obsolete MMU.
*/
- kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
+ kvm_mmu_commit_zap_page(kvm, &invalid_list);
}

/*
@@ -6072,7 +6073,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
int r;

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
- INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);

--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:41:54

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 04/18] KVM: x86/mmu: Shrink shadow page caches via MMU shrinker

Shrink shadow page caches via MMU shrinker based on
kvm_total_unused_cached_pages. Traverse each vCPU of all of the VMs,
empty the caches and exit the shrinker when sufficient number of pages
have been freed. Also, move processed VMs to the end of vm_list so that
next time other VMs are tortured first.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 55 +++++++++++++++++++++++++++++++++++-----
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 6 ++++-
3 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 13f41b7ac280..df8dcb7e5de7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6693,16 +6693,57 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
}
}

-static unsigned long
-mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
-{
- return SHRINK_STOP;
+static unsigned long mmu_shrink_scan(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct kvm *kvm, *next_kvm, *first_kvm = NULL;
+ struct kvm_mmu_memory_cache *cache;
+ unsigned long i, freed = 0;
+ struct mutex *cache_lock;
+ struct kvm_vcpu *vcpu;
+
+ mutex_lock(&kvm_lock);
+ list_for_each_entry_safe(kvm, next_kvm, &vm_list, vm_list) {
+ if (first_kvm == kvm)
+ break;
+
+ if (!first_kvm)
+ first_kvm = kvm;
+
+ list_move_tail(&kvm->vm_list, &vm_list);
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ cache = &vcpu->arch.mmu_shadow_page_cache;
+ cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
+ if (mutex_trylock(cache_lock)) {
+ if (cache->nobjs) {
+ freed += cache->nobjs;
+ kvm_mmu_empty_memory_cache(cache);
+ }
+ mutex_unlock(cache_lock);
+ if (freed >= sc->nr_to_scan)
+ goto out;
+ }
+ }
+ }
+out:
+ mutex_unlock(&kvm_lock);
+ if (freed) {
+ percpu_counter_sub(&kvm_total_unused_cached_pages, freed);
+ return freed;
+ } else {
+ return SHRINK_STOP;
+ }
}

-static unsigned long
-mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+static unsigned long mmu_shrink_count(struct shrinker *shrink,
+ struct shrink_control *sc)
{
- return SHRINK_EMPTY;
+ s64 count = percpu_counter_sum(&kvm_total_unused_cached_pages);
+
+ WARN_ON(count < 0);
+ return count <= 0 ? SHRINK_EMPTY : count;
+
}

static struct shrinker mmu_shrinker = {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8ada23756b0e..5cfa42c130e0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1361,6 +1361,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
+void kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331..536d8ab6e61f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -430,7 +430,7 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
return mc->nobjs;
}

-void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+void kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
{
while (mc->nobjs) {
if (mc->kmem_cache)
@@ -438,7 +438,11 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
else
free_page((unsigned long)mc->objects[--mc->nobjs]);
}
+}

+void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+{
+ kvm_mmu_empty_memory_cache(mc);
kvfree(mc->objects);

mc->objects = NULL;
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:41:58

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

Create a global counter for total number of pages available
in MMU page caches across all VMs. Add mmu_shadow_page_cache
pages to this counter.

This accounting will be used in future commits to shrink MMU caches via
KVM MMU shrinker.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 ++
arch/x86/kvm/mmu/mmu.c | 90 ++++++++++++++++++++++++++++-----
arch/x86/kvm/mmu/mmu_internal.h | 2 +
arch/x86/kvm/mmu/paging_tmpl.h | 25 +++++----
arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
5 files changed, 100 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ebbe692acf3f..4322c7020d5d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -791,6 +791,11 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;

+ /*
+ * Protect allocation and release of pages from mmu_shadow_page_cache.
+ */
+ struct mutex mmu_shadow_page_cache_lock;
+
/*
* QEMU userspace and the guest each have their own FPU state.
* In vcpu_run, we switch between the user and guest FPU contexts.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3a452989f5cd..13f41b7ac280 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -167,6 +167,11 @@ struct kvm_shadow_walk_iterator {
static struct kmem_cache *pte_list_desc_cache;
struct kmem_cache *mmu_page_header_cache;

+/*
+ * Global count of unused pages in MMU page caches across all VMs.
+ */
+static struct percpu_counter kvm_total_unused_cached_pages;
+
static void mmu_spte_set(u64 *sptep, u64 spte);

struct kvm_mmu_role_regs {
@@ -667,6 +672,34 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
}
}

+/**
+ * Caller should hold mutex lock corresponding to cache, if available.
+ */
+static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
+ int min)
+{
+ int orig_nobjs, r;
+
+ orig_nobjs = cache->nobjs;
+ r = kvm_mmu_topup_memory_cache(cache, min);
+ if (orig_nobjs != cache->nobjs)
+ percpu_counter_add(&kvm_total_unused_cached_pages,
+ (cache->nobjs - orig_nobjs));
+
+ return r;
+}
+
+/**
+ * Caller should hold mutex lock corresponding to kvm_mmu_memory_cache, if
+ * available.
+ */
+static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache)
+{
+ if (cache->nobjs)
+ percpu_counter_sub(&kvm_total_unused_cached_pages, cache->nobjs);
+ kvm_mmu_free_memory_cache(cache);
+}
+
static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
int r;
@@ -676,10 +709,11 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
- r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
- PT64_ROOT_MAX_LEVEL);
+
+ r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache, PT64_ROOT_MAX_LEVEL);
if (r)
return r;
+
if (maybe_indirect) {
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
PT64_ROOT_MAX_LEVEL);
@@ -693,7 +727,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
{
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
- kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+ mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
+ mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+ mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}
@@ -2148,6 +2184,7 @@ struct shadow_page_caches {
struct kvm_mmu_memory_cache *page_header_cache;
struct kvm_mmu_memory_cache *shadow_page_cache;
struct kvm_mmu_memory_cache *shadowed_info_cache;
+ bool count_shadow_page_allocation;
};

static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
@@ -2159,7 +2196,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
- sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
+ sp->spt = mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
+ caches->count_shadow_page_allocation);
if (!role.direct)
sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);

@@ -2216,6 +2254,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
.page_header_cache = &vcpu->arch.mmu_page_header_cache,
.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
+ .count_shadow_page_allocation = true,
};

return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
@@ -4314,29 +4353,32 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
if (r != RET_PF_INVALID)
return r;

+ mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
r = mmu_topup_memory_caches(vcpu, false);
if (r)
- return r;
+ goto out_page_cache_unlock;

r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
if (r != RET_PF_CONTINUE)
- return r;
+ goto out_page_cache_unlock;

r = RET_PF_RETRY;
write_lock(&vcpu->kvm->mmu_lock);

if (is_page_fault_stale(vcpu, fault))
- goto out_unlock;
+ goto out_mmu_unlock;

r = make_mmu_pages_available(vcpu);
if (r)
- goto out_unlock;
+ goto out_mmu_unlock;

r = direct_map(vcpu, fault);

-out_unlock:
+out_mmu_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
kvm_release_pfn_clean(fault->pfn);
+out_page_cache_unlock:
+ mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
return r;
}

@@ -4396,25 +4438,28 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
if (r != RET_PF_INVALID)
return r;

+ mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
r = mmu_topup_memory_caches(vcpu, false);
if (r)
- return r;
+ goto out_page_cache_unlock;

r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
if (r != RET_PF_CONTINUE)
- return r;
+ goto out_page_cache_unlock;

r = RET_PF_RETRY;
read_lock(&vcpu->kvm->mmu_lock);

if (is_page_fault_stale(vcpu, fault))
- goto out_unlock;
+ goto out_mmu_unlock;

r = kvm_tdp_mmu_map(vcpu, fault);

-out_unlock:
+out_mmu_unlock:
read_unlock(&vcpu->kvm->mmu_lock);
kvm_release_pfn_clean(fault->pfn);
+out_page_cache_unlock:
+ mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
return r;
}
#endif
@@ -5394,6 +5439,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
{
int r;

+ mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
if (r)
goto out;
@@ -5420,6 +5466,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
*/
static_call(kvm_x86_flush_tlb_current)(vcpu);
out:
+ mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
return r;
}

@@ -5924,6 +5971,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
@@ -6769,12 +6817,17 @@ int kvm_mmu_vendor_module_init(void)
if (!mmu_page_header_cache)
goto out;

+ if (percpu_counter_init(&kvm_total_unused_cached_pages, 0, GFP_KERNEL))
+ goto out;
+
ret = register_shrinker(&mmu_shrinker, "x86-mmu");
if (ret)
- goto out;
+ goto out_shrinker;

return 0;

+out_shrinker:
+ percpu_counter_destroy(&kvm_total_unused_cached_pages);
out:
mmu_destroy_caches();
return ret;
@@ -6792,6 +6845,7 @@ void kvm_mmu_vendor_module_exit(void)
{
mmu_destroy_caches();
unregister_shrinker(&mmu_shrinker);
+ percpu_counter_destroy(&kvm_total_unused_cached_pages);
}

/*
@@ -6994,3 +7048,11 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_huge_page_recovery_thread)
kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
}
+
+void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
+ bool count_allocation)
+{
+ if (count_allocation && shadow_page_cache->nobjs)
+ percpu_counter_dec(&kvm_total_unused_cached_pages);
+ return kvm_mmu_memory_cache_alloc(shadow_page_cache);
+}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index cc58631e2336..798cfbf0a36b 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -338,5 +338,7 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);

void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache,
+ bool count_allocation);

#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 57f0b75c80f9..1dea9be6849d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -821,9 +821,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
return RET_PF_EMULATE;
}

+ mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
r = mmu_topup_memory_caches(vcpu, true);
if (r)
- return r;
+ goto out_page_cache_unlock;

vcpu->arch.write_fault_to_shadow_pgtable = false;

@@ -837,7 +838,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault

r = kvm_faultin_pfn(vcpu, fault, walker.pte_access);
if (r != RET_PF_CONTINUE)
- return r;
+ goto out_page_cache_unlock;

/*
* Do not change pte_access if the pfn is a mmio page, otherwise
@@ -862,16 +863,18 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
write_lock(&vcpu->kvm->mmu_lock);

if (is_page_fault_stale(vcpu, fault))
- goto out_unlock;
+ goto out_mmu_unlock;

r = make_mmu_pages_available(vcpu);
if (r)
- goto out_unlock;
+ goto out_mmu_unlock;
r = FNAME(fetch)(vcpu, fault, &walker);

-out_unlock:
+out_mmu_unlock:
write_unlock(&vcpu->kvm->mmu_lock);
kvm_release_pfn_clean(fault->pfn);
+out_page_cache_unlock:
+ mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
return r;
}

@@ -897,17 +900,18 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)

vcpu_clear_mmio_info(vcpu, gva);

+ if (!VALID_PAGE(root_hpa)) {
+ WARN_ON(1);
+ return;
+ }
+
+ mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
/*
* No need to check return value here, rmap_can_add() can
* help us to skip pte prefetch later.
*/
mmu_topup_memory_caches(vcpu, true);

- if (!VALID_PAGE(root_hpa)) {
- WARN_ON(1);
- return;
- }
-
write_lock(&vcpu->kvm->mmu_lock);
for_each_shadow_entry_using_root(vcpu, root_hpa, gva, iterator) {
level = iterator.level;
@@ -943,6 +947,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
break;
}
write_unlock(&vcpu->kvm->mmu_lock);
+ mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
}

/* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7c25dbf32ecc..fa6eb1e9101e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -265,7 +265,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
+ true);

return sp;
}
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:42:02

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 05/18] KVM: x86/mmu: Add split_shadow_page_cache pages to global count of MMU cache pages

Add pages in split_shadow_page_cache to the global counter
kvm_total_unused_cached_pages. These pages will be freed by MMU shrinker
in future commit.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index df8dcb7e5de7..0ebb8a2eaf47 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6149,7 +6149,9 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
{
kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
- kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
+ mutex_lock(&kvm->slots_lock);
+ mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache);
+ mutex_unlock(&kvm->slots_lock);
}

void kvm_mmu_uninit_vm(struct kvm *kvm)
@@ -6303,7 +6305,7 @@ static int topup_split_caches(struct kvm *kvm)
if (r)
return r;

- return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+ return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
}

static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
@@ -6328,6 +6330,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
/* Direct SPs do not require a shadowed_info_cache. */
caches.page_header_cache = &kvm->arch.split_page_header_cache;
caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+ caches.count_shadow_page_allocation = true;

/* Safe to pass NULL for vCPU since requesting a direct SP. */
return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:42:05

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 06/18] KVM: x86/mmu: Shrink split_shadow_page_cache via MMU shrinker

Use MMU shrinker to free unused pages in split_shadow_page_cache.
Refactor the code and make common function to try emptying the page cache.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 34 +++++++++++++++++++++-------------
1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0ebb8a2eaf47..73a0ac9c11ce 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6696,13 +6696,24 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
}
}

+static int mmu_memory_cache_try_empty(struct kvm_mmu_memory_cache *cache,
+ struct mutex *cache_lock)
+{
+ int freed = 0;
+
+ if (mutex_trylock(cache_lock)) {
+ freed = cache->nobjs;
+ kvm_mmu_empty_memory_cache(cache);
+ mutex_unlock(cache_lock);
+ }
+ return freed;
+}
+
static unsigned long mmu_shrink_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
struct kvm *kvm, *next_kvm, *first_kvm = NULL;
- struct kvm_mmu_memory_cache *cache;
unsigned long i, freed = 0;
- struct mutex *cache_lock;
struct kvm_vcpu *vcpu;

mutex_lock(&kvm_lock);
@@ -6716,18 +6727,15 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
list_move_tail(&kvm->vm_list, &vm_list);

kvm_for_each_vcpu(i, vcpu, kvm) {
- cache = &vcpu->arch.mmu_shadow_page_cache;
- cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
- if (mutex_trylock(cache_lock)) {
- if (cache->nobjs) {
- freed += cache->nobjs;
- kvm_mmu_empty_memory_cache(cache);
- }
- mutex_unlock(cache_lock);
- if (freed >= sc->nr_to_scan)
- goto out;
- }
+ freed += mmu_memory_cache_try_empty(&vcpu->arch.mmu_shadow_page_cache,
+ &vcpu->arch.mmu_shadow_page_cache_lock);
+ if (freed >= sc->nr_to_scan)
+ goto out;
}
+ freed += mmu_memory_cache_try_empty(&kvm->arch.split_shadow_page_cache,
+ &kvm->slots_lock);
+ if (freed >= sc->nr_to_scan)
+ goto out;
}
out:
mutex_unlock(&kvm_lock);
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:42:08

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 07/18] KVM: x86/mmu: Unconditionally count allocations from MMU page caches

Remove count_shadow_page_allocations from struct shadow_page_caches{}.
Remove count_allocation boolean condition check from
mmu_sp_memory_cache_alloc().

Both split_shadow_page_cache and mmu_shadow_page_cache are counted in
global count of unused cache pages. count_shadow_page_allocations
boolean is obsolete and can be removed.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 11 +++--------
arch/x86/kvm/mmu/mmu_internal.h | 3 +--
arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
3 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 73a0ac9c11ce..0a0962d8108b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2184,7 +2184,6 @@ struct shadow_page_caches {
struct kvm_mmu_memory_cache *page_header_cache;
struct kvm_mmu_memory_cache *shadow_page_cache;
struct kvm_mmu_memory_cache *shadowed_info_cache;
- bool count_shadow_page_allocation;
};

static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
@@ -2196,8 +2195,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
- sp->spt = mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
- caches->count_shadow_page_allocation);
+ sp->spt = mmu_sp_memory_cache_alloc(caches->shadow_page_cache);
if (!role.direct)
sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);

@@ -2254,7 +2252,6 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
.page_header_cache = &vcpu->arch.mmu_page_header_cache,
.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
- .count_shadow_page_allocation = true,
};

return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
@@ -6330,7 +6327,6 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
/* Direct SPs do not require a shadowed_info_cache. */
caches.page_header_cache = &kvm->arch.split_page_header_cache;
caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
- caches.count_shadow_page_allocation = true;

/* Safe to pass NULL for vCPU since requesting a direct SP. */
return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
@@ -7101,10 +7097,9 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
}

-void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
- bool count_allocation)
+void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache)
{
- if (count_allocation && shadow_page_cache->nobjs)
+ if (shadow_page_cache->nobjs)
percpu_counter_dec(&kvm_total_unused_cached_pages);
return kvm_mmu_memory_cache_alloc(shadow_page_cache);
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 798cfbf0a36b..a607314348e3 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -338,7 +338,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);

void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
-void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache,
- bool count_allocation);
+void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache);

#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fa6eb1e9101e..d1e85012a008 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -265,8 +265,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
- true);
+ sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);

return sp;
}
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:42:17

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 08/18] KVM: x86/mmu: Track unused mmu_shadowed_info_cache pages count via global counter

Add unused pages in mmu_shadowed_info_cache to global MMU unused page
cache counter i.e. kvm_total_unused_cached_pages. These pages will be
freed by MMU shrinker in future commit.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++-
arch/x86/kvm/mmu/mmu.c | 8 ++++----
2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4322c7020d5d..185719dbeb81 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -792,7 +792,8 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_page_header_cache;

/*
- * Protect allocation and release of pages from mmu_shadow_page_cache.
+ * Protect allocation and release of pages from mmu_shadow_page_cache
+ * and mmu_shadowed_info_cache.
*/
struct mutex mmu_shadow_page_cache_lock;

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0a0962d8108b..b7ca31b5699c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -715,8 +715,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
return r;

if (maybe_indirect) {
- r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
- PT64_ROOT_MAX_LEVEL);
+ r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
+ PT64_ROOT_MAX_LEVEL);
if (r)
return r;
}
@@ -729,8 +729,8 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+ mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
- kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}

@@ -2197,7 +2197,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
sp->spt = mmu_sp_memory_cache_alloc(caches->shadow_page_cache);
if (!role.direct)
- sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
+ sp->shadowed_translation = mmu_sp_memory_cache_alloc(caches->shadowed_info_cache);

set_page_private(virt_to_page(sp->spt), (unsigned long)sp);

--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:42:20

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 09/18] KVM: x86/mmu: Shrink mmu_shadowed_info_cache via MMU shrinker

Shrink shadow page cache via MMU shrinker based on
kvm_total_unused_cached_pages.

Tested by running dirty_log_perf_test while dropping cache
via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval. Global
always return to 0. Shrinker also gets invoked to remove pages in cache.

Above test were run with three configurations:
- EPT=N
- EPT=Y, TDP_MMU=N
- EPT=Y, TDP_MMU=Y

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b7ca31b5699c..a4bf2e433030 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6725,6 +6725,8 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
kvm_for_each_vcpu(i, vcpu, kvm) {
freed += mmu_memory_cache_try_empty(&vcpu->arch.mmu_shadow_page_cache,
&vcpu->arch.mmu_shadow_page_cache_lock);
+ freed += mmu_memory_cache_try_empty(&vcpu->arch.mmu_shadowed_info_cache,
+ &vcpu->arch.mmu_shadow_page_cache_lock);
if (freed >= sc->nr_to_scan)
goto out;
}
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:42:32

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 10/18] KVM: x86/mmu: Add per VM NUMA aware page table capability

Add KVM_CAP_NUMA_AWARE_PAGE_TABLE capability. This capability enables a
VM to allocate its page tables, specifically lower level page tables, on
the NUMA node of underlying leaf physical page pointed by the page table
entry.

This patch is only adding this option, future patches will use the
boolean numa_aware_page_table to allocate page tables on appropriate
NUMA node.

For now this capability is for x86 only, it can be extended to other
architecture in future if needed.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 ++++++
arch/x86/kvm/x86.c | 10 ++++++++++
include/uapi/linux/kvm.h | 1 +
3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 185719dbeb81..64de083cd6b9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1467,6 +1467,12 @@ struct kvm_arch {
*/
#define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
struct kvm_mmu_memory_cache split_desc_cache;
+
+ /*
+ * If true then allocate page tables near to underlying physical page
+ * NUMA node.
+ */
+ bool numa_aware_page_table;
};

struct kvm_vm_stat {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f706621c35b8..71728abd7f92 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4425,6 +4425,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VAPIC:
case KVM_CAP_ENABLE_CAP:
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
+ case KVM_CAP_NUMA_AWARE_PAGE_TABLE:
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
@@ -6391,6 +6392,15 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
}
mutex_unlock(&kvm->lock);
break;
+ case KVM_CAP_NUMA_AWARE_PAGE_TABLE:
+ r = -EINVAL;
+ mutex_lock(&kvm->lock);
+ if (!kvm->created_vcpus) {
+ kvm->arch.numa_aware_page_table = true;
+ r = 0;
+ }
+ mutex_unlock(&kvm->lock);
+ break;
default:
r = -EINVAL;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d77aef872a0a..5f367a93762a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1184,6 +1184,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
#define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
+#define KVM_CAP_NUMA_AWARE_PAGE_TABLE 227

#ifdef KVM_CAP_IRQ_ROUTING

--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:42:39

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 11/18] KVM: x86/mmu: Add documentation of NUMA aware page table capability

Add documentation for KVM_CAP_NUMA_AWARE_PAGE_TABLE capability and
explain why it is needed.

Signed-off-by: Vipin Sharma <[email protected]>
---
Documentation/virt/kvm/api.rst | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 62de0768d6aa..7e3a1299ca8e 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7669,6 +7669,35 @@ This capability is aimed to mitigate the threat that malicious VMs can
cause CPU stuck (due to event windows don't open up) and make the CPU
unavailable to host or other VMs.

+7.34 KVM_CAP_NUMA_AWARE_PAGE_TABLE
+------------------------------
+
+:Architectures: x86
+:Target: VM
+:Returns: 0 on success, -EINVAL if vCPUs are already created.
+
+This capability allows userspace to enable NUMA aware page tables allocations.
+NUMA aware page tables are disabled by default. Once enabled, prior to vCPU
+creation, any page table allocated during the life of a VM will be allocated
+preferably from the NUMA node of the leaf page.
+
+Without this capability, default feature is to use current thread mempolicy and
+allocate page table based on that.
+
+This capability is useful to improve page accesses by a guest. For example, an
+initialization thread which access lots of remote memory and ends up creating
+page tables on local NUMA node, or some service thread allocates memory on
+remote NUMA nodes and later worker/background threads accessing that memory
+will end up accessing remote NUMA node page tables. So, a multi NUMA node
+guest, can with high confidence access local memory faster instead of going
+through remote page tables first.
+
+This capability is also helpful for host to reduce live migration impact when
+splitting huge pages during dirty log operations. If the thread splitting huge
+page is on remote NUMA node it will create page tables on remote node. Even if
+guest is careful in making sure that it only access local memory they will end
+up accessing remote page tables.
+
8. Other capabilities.
======================

--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:42:51

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 12/18] KVM: x86/mmu: Allocate NUMA aware page tables on TDP huge page splits

When splitting a huge page, try to allocate new lower level page tables
on the same NUMA node of the huge page. Only do NUMA aware page splits
if KVM has enabled NUMA aware page table for the VM else fall back to
the default method of using current thread mempolicy.

When huge pages are split for dirty log, new page tables are created
based on the current thread mempolicy, which by default will be the NUMA
node of the pCPU executing the thread. If thread enabling dirty log is
on a remote NUMA node than the huge page NUMA node then it will
create all page tables mapping 4KiB pages of that huge page on the
remote node. This reduces performances of the vCPUs which are NUMA
bound and are only accessing local NUMA memory as they will access
remote NUMA node page tables to access their local NUMA node memory.

Tested this feature on synthetic read-write-heavy workload in a 416 vCPU
VM on a 8 NUMA node host. This workload creates multiple threads,
partitions data in equal sizes and assigns them to each thread. Each
thread iterates over its own data in strides, reads and writes value in
its partitions. While executing, this workload continuously outputs
combined rates at which it is performing operations.

When dirty log is enabled in WRPROT mode, workload's performance:
- Without NUMA aware page table drops by ~75%
- With NUMA aware page table drops by ~20%

Raw data from one example run:
1. Without NUMA aware page table
Before dirty log: ~2750000 accesses/sec
After dirty log: ~700000 accesses/sec

2. With NUMA aware page table
Before dirty log: ~2750000 accesses/sec
After dirty log: ~2250000 accesses/sec

NUMA aware page table improved performance by more than 200%

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/kvm/mmu/mmu_internal.h | 15 +++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 9 +++++----
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 16 ++++++++++++++++
4 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index a607314348e3..b9d0e09ae974 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -340,4 +340,19 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache);

+static inline int kvm_pfn_to_page_table_nid(struct kvm *kvm, kvm_pfn_t pfn)
+{
+ struct page *page;
+
+ if (!kvm->arch.numa_aware_page_table)
+ return NUMA_NO_NODE;
+
+ page = kvm_pfn_to_refcounted_page(pfn);
+
+ if (page)
+ return page_to_nid(page);
+ else
+ return numa_mem_id();
+}
+
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d1e85012a008..61fd9c177694 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1412,7 +1412,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
return spte_set;
}

-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, int nid)
{
struct kvm_mmu_page *sp;

@@ -1422,7 +1422,7 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
if (!sp)
return NULL;

- sp->spt = (void *)__get_free_page(gfp);
+ sp->spt = kvm_mmu_get_free_page(gfp, nid);
if (!sp->spt) {
kmem_cache_free(mmu_page_header_cache, sp);
return NULL;
@@ -1435,6 +1435,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
struct tdp_iter *iter,
bool shared)
{
+ int nid = kvm_pfn_to_page_table_nid(kvm, spte_to_pfn(iter->old_spte));
struct kvm_mmu_page *sp;

/*
@@ -1446,7 +1447,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
* If this allocation fails we drop the lock and retry with reclaim
* allowed.
*/
- sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
if (sp)
return sp;

@@ -1458,7 +1459,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
write_unlock(&kvm->mmu_lock);

iter->yielded = true;
- sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+ sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT, nid);

if (shared)
read_lock(&kvm->mmu_lock);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5cfa42c130e0..31586a65e346 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1358,6 +1358,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool yield_to_kernel_mode);
void kvm_flush_remote_tlbs(struct kvm *kvm);

#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
+void *kvm_mmu_get_free_page(gfp_t gfp, int nid);
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 536d8ab6e61f..47006d209309 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -377,6 +377,22 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
}

#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
+
+void *kvm_mmu_get_free_page(gfp_t gfp, int nid)
+{
+#ifdef CONFIG_NUMA
+ struct page *page;
+
+ if (nid != NUMA_NO_NODE) {
+ page = alloc_pages_node(nid, gfp, 0);
+ if (!page)
+ return (void *)0;
+ return page_address(page);
+ }
+#endif /* CONFIG_NUMA */
+ return (void *)__get_free_page(gfp);
+}
+
static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
gfp_t gfp_flags)
{
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:42:57

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 13/18] KVM: mmu: Add common initialization logic for struct kvm_mmu_memory_cache{}

Add macros and function to make common logic for struct
kvm_mmu_memory_cache{} declaration and initialization.

Any user which wants different values in struct kvm_mmu_memory_cache{}
will overwrite the default values explicitly after the initialization.

Suggested-by: David Matlack <[email protected]>
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/mmu.c | 3 ++-
arch/riscv/kvm/mmu.c | 9 +++++----
arch/riscv/kvm/vcpu.c | 1 +
arch/x86/kvm/mmu/mmu.c | 8 ++++++++
include/linux/kvm_types.h | 10 ++++++++++
6 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 3bd732eaf087..2b3d88e4ace8 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -330,6 +330,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
vcpu->arch.target = -1;
bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);

+ INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;

/*
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7113587222ff..8a56f071ca66 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -895,7 +895,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
{
phys_addr_t addr;
int ret = 0;
- struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
+ KVM_MMU_MEMORY_CACHE(cache);
struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
KVM_PGTABLE_PROT_R |
@@ -904,6 +904,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
if (is_protected_kvm_enabled())
return -EPERM;

+ cache.gfp_zero = __GFP_ZERO;
size += offset_in_page(guest_ipa);
guest_ipa &= PAGE_MASK;

diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index 78211aed36fa..bdd8c17958dd 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -351,10 +351,11 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
int ret = 0;
unsigned long pfn;
phys_addr_t addr, end;
- struct kvm_mmu_memory_cache pcache = {
- .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
- .gfp_zero = __GFP_ZERO,
- };
+ KVM_MMU_MEMORY_CACHE(pcache);
+
+ pcache.gfp_zero = __GFP_ZERO;
+ if (in_atomic)
+ pcache.gfp_custom = GFP_ATOMIC | __GFP_ACCOUNT;

end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
pfn = __phys_to_pfn(hpa);
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 7d010b0be54e..bc743e9122d1 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -163,6 +163,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)

/* Mark this VCPU never ran */
vcpu->arch.ran_atleast_once = false;
+ INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a4bf2e433030..b706087ef74e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5961,15 +5961,20 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
{
int ret;

+ INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;

+ INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

+ INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);

+ INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
+
vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;

@@ -6131,11 +6136,14 @@ int kvm_mmu_init_vm(struct kvm *kvm)
node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
kvm_page_track_register_notifier(kvm, node);

+ INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;

+ INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache);
kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;

+ INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 2728d49bbdf6..192516eeccac 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -98,6 +98,16 @@ struct kvm_mmu_memory_cache {
int capacity;
void **objects;
};
+
+#define KVM_MMU_MEMORY_CACHE_INIT() { }
+
+#define KVM_MMU_MEMORY_CACHE(_name) \
+ struct kvm_mmu_memory_cache _name = KVM_MMU_MEMORY_CACHE_INIT()
+
+static inline void INIT_KVM_MMU_MEMORY_CACHE(struct kvm_mmu_memory_cache *cache)
+{
+ *cache = (struct kvm_mmu_memory_cache)KVM_MMU_MEMORY_CACHE_INIT();
+}
#endif

#define HALT_POLL_HIST_COUNT 32
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:43:08

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 14/18] KVM: mmu: Initialize kvm_mmu_memory_cache.gfp_zero to __GFP_ZERO by default

Set __GFP_ZERO to gfp_zero in default initizliation of struct
kvm_mmu_memory_cache{}

All of the users of default initialization code of struct
kvm_mmu_memory_cache{} explicitly sets gfp_zero to __GFP_ZERO. This can
be moved to common initialization logic.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/arm64/kvm/arm.c | 1 -
arch/arm64/kvm/mmu.c | 1 -
arch/riscv/kvm/mmu.c | 1 -
arch/riscv/kvm/vcpu.c | 1 -
arch/x86/kvm/mmu/mmu.c | 6 ------
include/linux/kvm_types.h | 4 +++-
6 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2b3d88e4ace8..b4243978d962 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -331,7 +331,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);

INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
- vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;

/*
* Default value for the FP state, will be overloaded at load
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8a56f071ca66..133eba96c41f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -904,7 +904,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
if (is_protected_kvm_enabled())
return -EPERM;

- cache.gfp_zero = __GFP_ZERO;
size += offset_in_page(guest_ipa);
guest_ipa &= PAGE_MASK;

diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index bdd8c17958dd..62550fd91c70 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -353,7 +353,6 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
phys_addr_t addr, end;
KVM_MMU_MEMORY_CACHE(pcache);

- pcache.gfp_zero = __GFP_ZERO;
if (in_atomic)
pcache.gfp_custom = GFP_ATOMIC | __GFP_ACCOUNT;

diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index bc743e9122d1..f5a96ed1e426 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -164,7 +164,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
/* Mark this VCPU never ran */
vcpu->arch.ran_atleast_once = false;
INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
- vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);

/* Setup ISA features available to VCPU */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b706087ef74e..d96afc849ee8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5963,14 +5963,11 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)

INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
- vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;

INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
- vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
- vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);

INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
@@ -6138,14 +6135,11 @@ int kvm_mmu_init_vm(struct kvm *kvm)

INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
- kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;

INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache);
- kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;

INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
- kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;

return 0;
}
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 192516eeccac..5da7953532ce 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -99,7 +99,9 @@ struct kvm_mmu_memory_cache {
void **objects;
};

-#define KVM_MMU_MEMORY_CACHE_INIT() { }
+#define KVM_MMU_MEMORY_CACHE_INIT() { \
+ .gfp_zero = __GFP_ZERO, \
+}

#define KVM_MMU_MEMORY_CACHE(_name) \
struct kvm_mmu_memory_cache _name = KVM_MMU_MEMORY_CACHE_INIT()
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:43:16

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 15/18] KVM: mmu: Add NUMA node support in struct kvm_mmu_memory_cache{}

Add NUMA node id variable in struct kvm_mmu_memory_cache{}. This
variable denotes preferable NUMA node from which memory will be
allocated under this memory cache.

Set this variable to NUMA_NO_NODE if there is no preferred node.

MIPS doesn't do any sort of initializatino of struct
kvm_mmu_memory_cache{}. Keep things similar in MIPS by setting gfp_zero
to 0 as INIT_KVM_MMU_MEMORY_CACHE() will initialize it to __GFP_ZERO.

"node" cannot be left as 0, as 0 is a valid NUMA node value.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/mips/kvm/mips.c | 3 +++
include/linux/kvm_types.h | 3 +++
2 files changed, 6 insertions(+)

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 36c8991b5d39..5ec5ce919918 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -294,6 +294,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
HRTIMER_MODE_REL);
vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;

+ INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
+ vcpu->arch.mmu_page_cache.gfp_zero = 0;
+
/*
* Allocate space for host mode exception handlers that handle
* guest mode exits
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 5da7953532ce..b2a405c8e629 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -97,10 +97,13 @@ struct kvm_mmu_memory_cache {
struct kmem_cache *kmem_cache;
int capacity;
void **objects;
+ /* Preferred NUMA node of memory allocation. */
+ int node;
};

#define KVM_MMU_MEMORY_CACHE_INIT() { \
.gfp_zero = __GFP_ZERO, \
+ .node = NUMA_NO_NODE, \
}

#define KVM_MMU_MEMORY_CACHE(_name) \
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:43:19

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 16/18] KVM: x86/mmu: Allocate numa aware page tables during page fault

Allocate page tables on the preferred NUMA node via memory cache during
page faults. If memory cache doesn't have a preferred NUMA node (node
value is set to NUMA_NO_NODE) then fallback to the default logic where
pages are selected based on thread's mempolicy. Also, free NUMA aware
page caches, mmu_shadow_page_cache, when memory shrinker is invoked.

Allocate root pages based on the current thread's NUMA node as there is
no way to know which will be the ideal NUMA node in long run.

This commit allocate page tables to be on the same NUMA node as the
physical page pointed by them, even if a vCPU causing page fault is on a
different NUMA node. If memory is not available on the requested NUMA
node then the other nearest NUMA node is selected by default. NUMA aware
page tables can be beneficial in cases where a thread touches lot of far
memory initially and then divide work among multiple threads. VMs
generally take advantage of NUMA architecture for faster memory access
by moving threads to the NUMA node of the memory they are accessing.
This change will help them in accessing pages faster.

Downside of this change is that an experimental workload can be created
where a guest threads are always accessing remote memory and not the one
local to them. This will cause performance to degrade compared to VMs
where numa aware page tables are not enabled. Ideally, these VMs when
using non-uniform memory access machine should generally be taking
advantage of NUMA architecture to improve their performance in the first
place.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 63 ++++++++++++++++++++++++---------
arch/x86/kvm/mmu/mmu_internal.h | 24 ++++++++++++-
arch/x86/kvm/mmu/paging_tmpl.h | 4 +--
arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++---
include/linux/kvm_types.h | 6 ++++
virt/kvm/kvm_main.c | 2 +-
7 files changed, 88 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 64de083cd6b9..77d3aa368e5e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -787,7 +787,7 @@ struct kvm_vcpu_arch {
struct kvm_mmu *walk_mmu;

struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
- struct kvm_mmu_memory_cache mmu_shadow_page_cache;
+ struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d96afc849ee8..86f0d74d35ed 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -702,7 +702,7 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache)

static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
- int r;
+ int r, nid = KVM_MMU_DEFAULT_CACHE_INDEX;

/* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
@@ -710,7 +710,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
if (r)
return r;

- r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache, PT64_ROOT_MAX_LEVEL);
+ if (kvm_numa_aware_page_table_enabled(vcpu->kvm)) {
+ for_each_online_node(nid) {
+ r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
+ PT64_ROOT_MAX_LEVEL);
+ }
+ } else {
+ r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
+ PT64_ROOT_MAX_LEVEL);
+ }
+
if (r)
return r;

@@ -726,9 +735,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)

static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
{
+ int nid;
+
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
- mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+ for_each_node(nid)
+ mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid]);
mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
@@ -2245,12 +2257,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
}

static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
- gfn_t gfn,
+ gfn_t gfn, int nid,
union kvm_mmu_page_role role)
{
struct shadow_page_caches caches = {
.page_header_cache = &vcpu->arch.mmu_page_header_cache,
- .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
+ .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid],
.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
};

@@ -2305,15 +2317,18 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,

static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
u64 *sptep, gfn_t gfn,
- bool direct, unsigned int access)
+ bool direct, unsigned int access,
+ kvm_pfn_t pfn)
{
union kvm_mmu_page_role role;
+ int nid;

if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
return ERR_PTR(-EEXIST);

role = kvm_mmu_child_role(sptep, direct, access);
- return kvm_mmu_get_shadow_page(vcpu, gfn, role);
+ nid = kvm_pfn_to_mmu_cache_nid(vcpu->kvm, pfn);
+ return kvm_mmu_get_shadow_page(vcpu, gfn, nid, role);
}

static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
@@ -3205,7 +3220,8 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
if (it.level == fault->goal_level)
break;

- sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
+ sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true,
+ ACC_ALL, fault->pfn);
if (sp == ERR_PTR(-EEXIST))
continue;

@@ -3625,6 +3641,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
{
union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
struct kvm_mmu_page *sp;
+ int nid;

role.level = level;
role.quadrant = quadrant;
@@ -3632,7 +3649,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);

- sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
+ nid = kvm_mmu_root_page_cache_nid(vcpu->kvm);
+ sp = kvm_mmu_get_shadow_page(vcpu, gfn, nid, role);
++sp->root_count;

return __pa(sp->spt);
@@ -5959,7 +5977,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)

int kvm_mmu_create(struct kvm_vcpu *vcpu)
{
- int ret;
+ int ret, nid;

INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
@@ -5967,7 +5985,12 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;

- INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
+ for_each_node(nid) {
+ INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache[nid]);
+ if (kvm_numa_aware_page_table_enabled(vcpu->kvm))
+ vcpu->arch.mmu_shadow_page_cache[nid].node = nid;
+ }
+
mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);

INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
@@ -6695,13 +6718,17 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
}

static int mmu_memory_cache_try_empty(struct kvm_mmu_memory_cache *cache,
- struct mutex *cache_lock)
+ int cache_count, struct mutex *cache_lock)
{
- int freed = 0;
+ int freed = 0, nid;

if (mutex_trylock(cache_lock)) {
- freed = cache->nobjs;
- kvm_mmu_empty_memory_cache(cache);
+ for (nid = 0; nid < cache_count; nid++) {
+ if (!cache[nid].nobjs)
+ continue;
+ freed += cache[nid].nobjs;
+ kvm_mmu_empty_memory_cache(&cache[nid]);
+ }
mutex_unlock(cache_lock);
}
return freed;
@@ -6725,15 +6752,17 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
list_move_tail(&kvm->vm_list, &vm_list);

kvm_for_each_vcpu(i, vcpu, kvm) {
- freed += mmu_memory_cache_try_empty(&vcpu->arch.mmu_shadow_page_cache,
+ freed += mmu_memory_cache_try_empty(vcpu->arch.mmu_shadow_page_cache,
+ MAX_NUMNODES,
&vcpu->arch.mmu_shadow_page_cache_lock);
freed += mmu_memory_cache_try_empty(&vcpu->arch.mmu_shadowed_info_cache,
+ 1,
&vcpu->arch.mmu_shadow_page_cache_lock);
if (freed >= sc->nr_to_scan)
goto out;
}
freed += mmu_memory_cache_try_empty(&kvm->arch.split_shadow_page_cache,
- &kvm->slots_lock);
+ 1, &kvm->slots_lock);
if (freed >= sc->nr_to_scan)
goto out;
}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b9d0e09ae974..652fd0c2bcba 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -340,11 +340,16 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache);

+static inline bool kvm_numa_aware_page_table_enabled(struct kvm *kvm)
+{
+ return kvm->arch.numa_aware_page_table;
+}
+
static inline int kvm_pfn_to_page_table_nid(struct kvm *kvm, kvm_pfn_t pfn)
{
struct page *page;

- if (!kvm->arch.numa_aware_page_table)
+ if (!kvm_numa_aware_page_table_enabled(kvm))
return NUMA_NO_NODE;

page = kvm_pfn_to_refcounted_page(pfn);
@@ -355,4 +360,21 @@ static inline int kvm_pfn_to_page_table_nid(struct kvm *kvm, kvm_pfn_t pfn)
return numa_mem_id();
}

+static inline int kvm_pfn_to_mmu_cache_nid(struct kvm *kvm, kvm_pfn_t pfn)
+{
+ int index = kvm_pfn_to_page_table_nid(kvm, pfn);
+
+ if (index == NUMA_NO_NODE)
+ return KVM_MMU_DEFAULT_CACHE_INDEX;
+
+ return index;
+}
+
+static inline int kvm_mmu_root_page_cache_nid(struct kvm *kvm)
+{
+ if (kvm_numa_aware_page_table_enabled(kvm))
+ return numa_mem_id();
+
+ return KVM_MMU_DEFAULT_CACHE_INDEX;
+}
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 1dea9be6849d..9db8b3df434d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
table_gfn = gw->table_gfn[it.level - 2];
access = gw->pt_access[it.level - 2];
sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
- false, access);
+ false, access, fault->pfn);

if (sp != ERR_PTR(-EEXIST)) {
/*
@@ -706,7 +706,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
validate_direct_spte(vcpu, it.sptep, direct_access);

sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
- true, direct_access);
+ true, direct_access, fault->pfn);
if (sp == ERR_PTR(-EEXIST))
continue;

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 61fd9c177694..63113a66f560 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -260,12 +260,12 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
kvm_mmu_page_as_id(_root) != _as_id) { \
} else

-static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
{
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid]);

return sp;
}
@@ -304,6 +304,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_page *root;
+ int nid;

lockdep_assert_held_write(&kvm->mmu_lock);

@@ -317,7 +318,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
goto out;
}

- root = tdp_mmu_alloc_sp(vcpu);
+ nid = kvm_mmu_root_page_cache_nid(vcpu->kvm);
+ root = tdp_mmu_alloc_sp(vcpu, nid);
tdp_mmu_init_sp(root, NULL, 0, role);

refcount_set(&root->tdp_mmu_root_count, 1);
@@ -1149,12 +1151,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
- int ret = RET_PF_RETRY;
+ int ret = RET_PF_RETRY, nid;

kvm_mmu_hugepage_adjust(vcpu, fault);

trace_kvm_mmu_spte_requested(fault);

+ nid = kvm_pfn_to_mmu_cache_nid(kvm, fault->pfn);
+
rcu_read_lock();

tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
@@ -1182,7 +1186,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* The SPTE is either non-present or points to a huge page that
* needs to be split.
*/
- sp = tdp_mmu_alloc_sp(vcpu);
+ sp = tdp_mmu_alloc_sp(vcpu, nid);
tdp_mmu_init_child_sp(sp, &iter);

sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index b2a405c8e629..13032da2ddfc 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -113,6 +113,12 @@ static inline void INIT_KVM_MMU_MEMORY_CACHE(struct kvm_mmu_memory_cache *cache)
{
*cache = (struct kvm_mmu_memory_cache)KVM_MMU_MEMORY_CACHE_INIT();
}
+
+/*
+ * When NUMA aware page table option is disabled for a VM then use cache at the
+ * below index in the array of NUMA caches.
+ */
+#define KVM_MMU_DEFAULT_CACHE_INDEX 0
#endif

#define HALT_POLL_HIST_COUNT 32
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 47006d209309..25a549705c8e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -401,7 +401,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
if (mc->kmem_cache)
return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
else
- return (void *)__get_free_page(gfp_flags);
+ return kvm_mmu_get_free_page(gfp_flags, mc->node);
}

int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:43:22

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 17/18] KVM: x86/mmu: Allocate shadow mmu page table on huge page split on the same NUMA node

When splitting a huge page and NUMA aware page split option is enabled,
try to allocate new lower level page tables on the same NUMA node of
the huge page. If NUMA aware page split is disabled then fallback to
default policy of using current thread's mempolicy.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 42 ++++++++++++++++++++-------------
arch/x86/kvm/x86.c | 8 ++++++-
3 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 77d3aa368e5e..041302d6132c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1453,7 +1453,7 @@ struct kvm_arch {
*
* Protected by kvm->slots_lock.
*/
- struct kvm_mmu_memory_cache split_shadow_page_cache;
+ struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES];
struct kvm_mmu_memory_cache split_page_header_cache;

/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 86f0d74d35ed..6d44a4e08328 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6140,7 +6140,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
int kvm_mmu_init_vm(struct kvm *kvm)
{
struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
- int r;
+ int r, nid;

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
@@ -6159,7 +6159,9 @@ int kvm_mmu_init_vm(struct kvm *kvm)
INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;

- INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache);
+ for_each_node(nid)
+ INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache[nid]);
+

INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
@@ -6169,10 +6171,13 @@ int kvm_mmu_init_vm(struct kvm *kvm)

static void mmu_free_vm_memory_caches(struct kvm *kvm)
{
+ int nid;
+
kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
mutex_lock(&kvm->slots_lock);
- mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache);
+ for_each_node(nid)
+ mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid]);
mutex_unlock(&kvm->slots_lock);
}

@@ -6282,7 +6287,7 @@ static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
}

-static bool need_topup_split_caches_or_resched(struct kvm *kvm)
+static bool need_topup_split_caches_or_resched(struct kvm *kvm, int nid)
{
if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
return true;
@@ -6294,10 +6299,10 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm)
*/
return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
need_topup(&kvm->arch.split_page_header_cache, 1) ||
- need_topup(&kvm->arch.split_shadow_page_cache, 1);
+ need_topup(&kvm->arch.split_shadow_page_cache[nid], 1);
}

-static int topup_split_caches(struct kvm *kvm)
+static int topup_split_caches(struct kvm *kvm, int nid)
{
/*
* Allocating rmap list entries when splitting huge pages for nested
@@ -6327,10 +6332,11 @@ static int topup_split_caches(struct kvm *kvm)
if (r)
return r;

- return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
+ return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid], 1);
}

-static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
+static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep,
+ int nid)
{
struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
struct shadow_page_caches caches = {};
@@ -6351,7 +6357,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu

/* Direct SPs do not require a shadowed_info_cache. */
caches.page_header_cache = &kvm->arch.split_page_header_cache;
- caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+ caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache[nid];

/* Safe to pass NULL for vCPU since requesting a direct SP. */
return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
@@ -6359,7 +6365,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu

static void shadow_mmu_split_huge_page(struct kvm *kvm,
const struct kvm_memory_slot *slot,
- u64 *huge_sptep)
+ u64 *huge_sptep, int nid)

{
struct kvm_mmu_memory_cache *cache = &kvm->arch.split_desc_cache;
@@ -6370,7 +6376,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm,
gfn_t gfn;
int index;

- sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep);
+ sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep, nid);

for (index = 0; index < SPTE_ENT_PER_PAGE; index++) {
sptep = &sp->spt[index];
@@ -6408,7 +6414,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
u64 *huge_sptep)
{
struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep);
- int level, r = 0;
+ int level, r = 0, nid;
gfn_t gfn;
u64 spte;

@@ -6422,7 +6428,9 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
goto out;
}

- if (need_topup_split_caches_or_resched(kvm)) {
+ nid = kvm_pfn_to_mmu_cache_nid(kvm, spte_to_pfn(spte));
+
+ if (need_topup_split_caches_or_resched(kvm, nid)) {
write_unlock(&kvm->mmu_lock);
cond_resched();
/*
@@ -6430,12 +6438,12 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm,
* rmap iterator should be restarted because the MMU lock was
* dropped.
*/
- r = topup_split_caches(kvm) ?: -EAGAIN;
+ r = topup_split_caches(kvm, nid) ?: -EAGAIN;
write_lock(&kvm->mmu_lock);
goto out;
}

- shadow_mmu_split_huge_page(kvm, slot, huge_sptep);
+ shadow_mmu_split_huge_page(kvm, slot, huge_sptep, nid);

out:
trace_kvm_mmu_split_huge_page(gfn, spte, level, r);
@@ -6761,8 +6769,8 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
if (freed >= sc->nr_to_scan)
goto out;
}
- freed += mmu_memory_cache_try_empty(&kvm->arch.split_shadow_page_cache,
- 1, &kvm->slots_lock);
+ freed += mmu_memory_cache_try_empty(kvm->arch.split_shadow_page_cache,
+ MAX_NUMNODES, &kvm->slots_lock);
if (freed >= sc->nr_to_scan)
goto out;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 71728abd7f92..d8ea39b248cd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6176,7 +6176,7 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
struct kvm_enable_cap *cap)
{
- int r;
+ int r, nid;

if (cap->flags)
return -EINVAL;
@@ -6397,6 +6397,12 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
mutex_lock(&kvm->lock);
if (!kvm->created_vcpus) {
kvm->arch.numa_aware_page_table = true;
+
+ mutex_lock(&kvm->slots_lock);
+ for_each_node(nid) {
+ kvm->arch.split_shadow_page_cache[nid].node = nid;
+ }
+ mutex_unlock(&kvm->slots_lock);
r = 0;
}
mutex_unlock(&kvm->lock);
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-06 22:43:25

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v4 18/18] KVM: x86/mmu: Reduce default mmu memory cache size

Reduce KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to PT64_ROOT_MAX_LEVEL - 1.
Opportunistically, use this reduced value for topping up caches.

There was no specific reason to set this value to 40. With addition of
multi NUMA node caches, it is good to save space and make these cachees
lean.

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/include/asm/kvm_types.h | 6 +++++-
arch/x86/kvm/mmu/mmu.c | 8 ++++----
2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_types.h b/arch/x86/include/asm/kvm_types.h
index 08f1b57d3b62..80aff231b708 100644
--- a/arch/x86/include/asm/kvm_types.h
+++ b/arch/x86/include/asm/kvm_types.h
@@ -2,6 +2,10 @@
#ifndef _ASM_X86_KVM_TYPES_H
#define _ASM_X86_KVM_TYPES_H

-#define KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE 40
+/*
+ * For each fault only PT64_ROOT_MAX_LEVEL - 1 pages are needed. Root
+ * page is allocated in a separate flow.
+ */
+#define KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE (PT64_ROOT_MAX_LEVEL - 1)

#endif /* _ASM_X86_KVM_TYPES_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d44a4e08328..5463ce6e52fa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -713,11 +713,11 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
if (kvm_numa_aware_page_table_enabled(vcpu->kvm)) {
for_each_online_node(nid) {
r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
- PT64_ROOT_MAX_LEVEL);
+ KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE);
}
} else {
r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
- PT64_ROOT_MAX_LEVEL);
+ KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE);
}

if (r)
@@ -725,12 +725,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)

if (maybe_indirect) {
r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
- PT64_ROOT_MAX_LEVEL);
+ KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE);
if (r)
return r;
}
return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
- PT64_ROOT_MAX_LEVEL);
+ KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE);
}

static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-07 11:35:08

by kernel test robot

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

Hi Vipin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kvm/queue]
[also build test WARNING on kvmarm/next linus/master v6.3-rc1 next-20230307]
[cannot apply to mst-vhost/linux-next kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Vipin-Sharma/KVM-x86-mmu-Change-KVM-mmu-shrinker-to-no-op/20230307-064510
base: https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link: https://lore.kernel.org/r/20230306224127.1689967-4-vipinsh%40google.com
patch subject: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally
config: i386-randconfig-a003-20230306 (https://download.01.org/0day-ci/archive/20230307/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/511e837798da25063830276b8a3345c7601c6459
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Vipin-Sharma/KVM-x86-mmu-Change-KVM-mmu-shrinker-to-no-op/20230307-064510
git checkout 511e837798da25063830276b8a3345c7601c6459
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 olddefconfig
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash arch/x86/kvm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:676: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Caller should hold mutex lock corresponding to cache, if available.
arch/x86/kvm/mmu/mmu.c:693: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Caller should hold mutex lock corresponding to kvm_mmu_memory_cache, if
arch/x86/kvm/mmu/mmu.c:1404: warning: Function parameter or member 'kvm' not described in 'kvm_arch_mmu_enable_log_dirty_pt_masked'
arch/x86/kvm/mmu/mmu.c:1404: warning: Function parameter or member 'slot' not described in 'kvm_arch_mmu_enable_log_dirty_pt_masked'
arch/x86/kvm/mmu/mmu.c:1404: warning: Function parameter or member 'gfn_offset' not described in 'kvm_arch_mmu_enable_log_dirty_pt_masked'
arch/x86/kvm/mmu/mmu.c:1404: warning: Function parameter or member 'mask' not described in 'kvm_arch_mmu_enable_log_dirty_pt_masked'


vim +676 arch/x86/kvm/mmu/mmu.c

674
675 /**
> 676 * Caller should hold mutex lock corresponding to cache, if available.
677 */
678 static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
679 int min)
680 {
681 int orig_nobjs, r;
682
683 orig_nobjs = cache->nobjs;
684 r = kvm_mmu_topup_memory_cache(cache, min);
685 if (orig_nobjs != cache->nobjs)
686 percpu_counter_add(&kvm_total_unused_cached_pages,
687 (cache->nobjs - orig_nobjs));
688
689 return r;
690 }
691

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-03-07 12:16:27

by kernel test robot

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

Hi Vipin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kvm/queue]
[also build test WARNING on kvmarm/next linus/master v6.3-rc1 next-20230307]
[cannot apply to mst-vhost/linux-next kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Vipin-Sharma/KVM-x86-mmu-Change-KVM-mmu-shrinker-to-no-op/20230307-064510
base: https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link: https://lore.kernel.org/r/20230306224127.1689967-4-vipinsh%40google.com
patch subject: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally
config: x86_64-randconfig-a016-20230306 (https://download.01.org/0day-ci/archive/20230307/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/511e837798da25063830276b8a3345c7601c6459
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Vipin-Sharma/KVM-x86-mmu-Change-KVM-mmu-shrinker-to-no-op/20230307-064510
git checkout 511e837798da25063830276b8a3345c7601c6459
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash arch/x86/kvm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> arch/x86/kvm/mmu/mmu.c:676: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Caller should hold mutex lock corresponding to cache, if available.
arch/x86/kvm/mmu/mmu.c:693: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Caller should hold mutex lock corresponding to kvm_mmu_memory_cache, if
arch/x86/kvm/mmu/mmu.c:1404: warning: Function parameter or member 'kvm' not described in 'kvm_arch_mmu_enable_log_dirty_pt_masked'
arch/x86/kvm/mmu/mmu.c:1404: warning: Function parameter or member 'slot' not described in 'kvm_arch_mmu_enable_log_dirty_pt_masked'
arch/x86/kvm/mmu/mmu.c:1404: warning: Function parameter or member 'gfn_offset' not described in 'kvm_arch_mmu_enable_log_dirty_pt_masked'
arch/x86/kvm/mmu/mmu.c:1404: warning: Function parameter or member 'mask' not described in 'kvm_arch_mmu_enable_log_dirty_pt_masked'


vim +676 arch/x86/kvm/mmu/mmu.c

674
675 /**
> 676 * Caller should hold mutex lock corresponding to cache, if available.
677 */
678 static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
679 int min)
680 {
681 int orig_nobjs, r;
682
683 orig_nobjs = cache->nobjs;
684 r = kvm_mmu_topup_memory_cache(cache, min);
685 if (orig_nobjs != cache->nobjs)
686 percpu_counter_add(&kvm_total_unused_cached_pages,
687 (cache->nobjs - orig_nobjs));
688
689 return r;
690 }
691

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-03-07 18:24:42

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v4 00/18] NUMA aware page table allocation

On Mon, Mar 06, 2023, Vipin Sharma wrote:
> Hi,
>
> This series build up based on the feedback on v3.
>
> Biggest change in features is to enable NUMA aware page table per VM
> basis instead of using a module parameter for all VMs on a host. This
> was decided based on an internal discussion to avoid forcing all VMs to
> be NUMA aware on a host. We need to collect more data to see how much
> performance degradation a VM can get in negative testing, where vCPUs in
> VM are always accessing remote NUMA nodes memory instead of staying
> local compared to a VM which is not NUMA aware.
>
> There are other changes which are mentioned in the change log below for
> v4.
>
> Thanks
> Vipin
>
> v4:
> - Removed module parameter for enabling NUMA aware page table.

Could you have a space before the dash? I think the mutt mistakenly
treats it as a 'diff' where you removes a line.
> - Added new capability KVM_CAP_NUMA_AWARE_PAGE_TABLE to enable this
> feature per VM.
> - Added documentation for the new capability.
> - Holding mutex just before the top up and releasing it after the
> fault/split is addressed. Previous version were using spinlocks two
> times, first time for topup and second time fetching the page from
> cache.
> - Using the existing slots_lock for split_shadow_page_cache operations.
> - KVM MMU shrinker will also shrink mm_shadow_info_cache besides
> split_shadow_page_cache and mmu_shadow_page_cache.
> - Reduced cache default size to 4.
> - Split patches into smaller ones.
>
> v3: https://lore.kernel.org/lkml/[email protected]/
> - Split patches into smaller ones.
> - Repurposed KVM MMU shrinker to free cache pages instead of oldest page table
> pages
> - Reduced cache size from 40 to 5
> - Removed __weak function and initializing node value in all architectures.
> - Some name changes.
>
> v2: https://lore.kernel.org/lkml/[email protected]/
> - All page table pages will be allocated on underlying physical page's
> NUMA node.
> - Introduced module parameter, numa_aware_pagetable, to disable this
> feature.
> - Using kvm_pfn_to_refcounted_page to get page from a pfn.
>
> v1: https://lore.kernel.org/all/[email protected]/
>
> Vipin Sharma (18):
> KVM: x86/mmu: Change KVM mmu shrinker to no-op
> KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{}
> KVM: x86/mmu: Track count of pages in KVM MMU page caches globally
> KVM: x86/mmu: Shrink shadow page caches via MMU shrinker
> KVM: x86/mmu: Add split_shadow_page_cache pages to global count of MMU
> cache pages
> KVM: x86/mmu: Shrink split_shadow_page_cache via MMU shrinker
> KVM: x86/mmu: Unconditionally count allocations from MMU page caches
> KVM: x86/mmu: Track unused mmu_shadowed_info_cache pages count via
> global counter
> KVM: x86/mmu: Shrink mmu_shadowed_info_cache via MMU shrinker
> KVM: x86/mmu: Add per VM NUMA aware page table capability
> KVM: x86/mmu: Add documentation of NUMA aware page table capability
> KVM: x86/mmu: Allocate NUMA aware page tables on TDP huge page splits
> KVM: mmu: Add common initialization logic for struct
> kvm_mmu_memory_cache{}
> KVM: mmu: Initialize kvm_mmu_memory_cache.gfp_zero to __GFP_ZERO by
> default
> KVM: mmu: Add NUMA node support in struct kvm_mmu_memory_cache{}
> KVM: x86/mmu: Allocate numa aware page tables during page fault
> KVM: x86/mmu: Allocate shadow mmu page table on huge page split on the
> same NUMA node
> KVM: x86/mmu: Reduce default mmu memory cache size
>
> Documentation/virt/kvm/api.rst | 29 +++
> arch/arm64/kvm/arm.c | 2 +-
> arch/arm64/kvm/mmu.c | 2 +-
> arch/mips/kvm/mips.c | 3 +
> arch/riscv/kvm/mmu.c | 8 +-
> arch/riscv/kvm/vcpu.c | 2 +-
> arch/x86/include/asm/kvm_host.h | 17 +-
> arch/x86/include/asm/kvm_types.h | 6 +-
> arch/x86/kvm/mmu/mmu.c | 319 +++++++++++++++++++------------
> arch/x86/kvm/mmu/mmu_internal.h | 38 ++++
> arch/x86/kvm/mmu/paging_tmpl.h | 29 +--
> arch/x86/kvm/mmu/tdp_mmu.c | 23 ++-
> arch/x86/kvm/x86.c | 18 +-
> include/linux/kvm_host.h | 2 +
> include/linux/kvm_types.h | 21 ++
> include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 24 ++-
> 17 files changed, 386 insertions(+), 158 deletions(-)
>
> --
> 2.40.0.rc0.216.gc4246ad0f0-goog
>

May I know your base? It seems I cannot apply the series to kvm/master
or kvm/queue without manual manipulation.

2023-03-07 18:45:22

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 00/18] NUMA aware page table allocation

On Tue, Mar 7, 2023 at 10:19 AM Mingwei Zhang <[email protected]> wrote:
>
> On Mon, Mar 06, 2023, Vipin Sharma wrote:

> > v4:
> > - Removed module parameter for enabling NUMA aware page table.
>
> Could you have a space before the dash? I think the mutt mistakenly
> treats it as a 'diff' where you removes a line.

From next version I will add a space before dash.


> > --
> > 2.40.0.rc0.216.gc4246ad0f0-goog
> >
>
> May I know your base? It seems I cannot apply the series to kvm/master
> or kvm/queue without manual manipulation.

My patch series is on the latest kvm/queue branch which is currently
on commit 45dd9bc75d9a ("KVM: SVM: hyper-v: placate modpost section
mismatch error")

What manual manipulation do you have to do to apply this series?

2023-03-07 19:28:54

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Tue, Mar 7, 2023 at 3:33 AM kernel test robot <[email protected]> wrote:
>
> Hi Vipin,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on kvm/queue]
> [also build test WARNING on kvmarm/next linus/master v6.3-rc1 next-20230307]
> [cannot apply to mst-vhost/linux-next kvm/linux-next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> All warnings (new ones prefixed by >>):
>
> >> arch/x86/kvm/mmu/mmu.c:676: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst

> vim +676 arch/x86/kvm/mmu/mmu.c
>
> 674
> 675 /**
> > 676 * Caller should hold mutex lock corresponding to cache, if available.
> 677 */

I will fix it in the next version.

2023-03-07 20:18:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Tue, Mar 07, 2023, Vipin Sharma wrote:
> On Tue, Mar 7, 2023 at 3:33 AM kernel test robot <[email protected]> wrote:
> >
> > Hi Vipin,
> >
> > Thank you for the patch! Perhaps something to improve:
> >
> > [auto build test WARNING on kvm/queue]
> > [also build test WARNING on kvmarm/next linus/master v6.3-rc1 next-20230307]
> > [cannot apply to mst-vhost/linux-next kvm/linux-next]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> >
> > All warnings (new ones prefixed by >>):
> >
> > >> arch/x86/kvm/mmu/mmu.c:676: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
>
> > vim +676 arch/x86/kvm/mmu/mmu.c
> >
> > 674
> > 675 /**
> > > 676 * Caller should hold mutex lock corresponding to cache, if available.
> > 677 */
>
> I will fix it in the next version.

Don't bother reworking the code/comment, I will likely have feedback that results
in the demise of the comment altogether (comments that say "lock X must be held"
are almost always flawed).

2023-03-08 20:33:46

by Zhi Wang

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Mon, 6 Mar 2023 14:41:12 -0800
Vipin Sharma <[email protected]> wrote:

> Create a global counter for total number of pages available
> in MMU page caches across all VMs. Add mmu_shadow_page_cache
> pages to this counter.
>
> This accounting will be used in future commits to shrink MMU caches via
> KVM MMU shrinker.
>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 5 ++
> arch/x86/kvm/mmu/mmu.c | 90 ++++++++++++++++++++++++++++-----
> arch/x86/kvm/mmu/mmu_internal.h | 2 +
> arch/x86/kvm/mmu/paging_tmpl.h | 25 +++++----
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
> 5 files changed, 100 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index ebbe692acf3f..4322c7020d5d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -791,6 +791,11 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> + /*
> + * Protect allocation and release of pages from mmu_shadow_page_cache.
> + */
> + struct mutex mmu_shadow_page_cache_lock;
> +
> /*
> * QEMU userspace and the guest each have their own FPU state.
> * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3a452989f5cd..13f41b7ac280 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -167,6 +167,11 @@ struct kvm_shadow_walk_iterator {
> static struct kmem_cache *pte_list_desc_cache;
> struct kmem_cache *mmu_page_header_cache;
>
> +/*
> + * Global count of unused pages in MMU page caches across all VMs.
> + */
> +static struct percpu_counter kvm_total_unused_cached_pages;
> +
> static void mmu_spte_set(u64 *sptep, u64 spte);
>
> struct kvm_mmu_role_regs {
> @@ -667,6 +672,34 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +/**
> + * Caller should hold mutex lock corresponding to cache, if available.
> + */
> +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> + int min)
> +{
> + int orig_nobjs, r;
> +
> + orig_nobjs = cache->nobjs;
> + r = kvm_mmu_topup_memory_cache(cache, min);
> + if (orig_nobjs != cache->nobjs)
> + percpu_counter_add(&kvm_total_unused_cached_pages,
> + (cache->nobjs - orig_nobjs));
> +
> + return r;
> +}
> +

Maybe kvm_mmu_topup_shadow_page_cache() would be better?

As a user of kvm_mmu_topup_memory_cache(), mmu_topup_memory_cache() is not
supposed to directly touch the kvm_mmu_memory_cache meta data.

The name "mmu_topup_sp_memory_cache()" seems similar with "mmu_topup_memory_cache()".
Renaming it would make its level self-documenting.

> +/**
> + * Caller should hold mutex lock corresponding to kvm_mmu_memory_cache, if
> + * available.
> + */
> +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache)
> +{
> + if (cache->nobjs)
> + percpu_counter_sub(&kvm_total_unused_cached_pages, cache->nobjs);
> + kvm_mmu_free_memory_cache(cache);
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -676,10 +709,11 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> +
> + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache, PT64_ROOT_MAX_LEVEL);
> if (r)
> return r;
> +
> if (maybe_indirect) {
> r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
> PT64_ROOT_MAX_LEVEL);
> @@ -693,7 +727,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
> @@ -2148,6 +2184,7 @@ struct shadow_page_caches {
> struct kvm_mmu_memory_cache *page_header_cache;
> struct kvm_mmu_memory_cache *shadow_page_cache;
> struct kvm_mmu_memory_cache *shadowed_info_cache;
> + bool count_shadow_page_allocation;
> };
>
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> @@ -2159,7 +2196,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> + sp->spt = mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> + caches->count_shadow_page_allocation);
> if (!role.direct)
> sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
>
> @@ -2216,6 +2254,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> + .count_shadow_page_allocation = true,
> };
>
> return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> @@ -4314,29 +4353,32 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> if (r != RET_PF_INVALID)
> return r;
>
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> r = mmu_topup_memory_caches(vcpu, false);
> if (r)
> - return r;
> + goto out_page_cache_unlock;
>
> r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> if (r != RET_PF_CONTINUE)
> - return r;
> + goto out_page_cache_unlock;
>
> r = RET_PF_RETRY;
> write_lock(&vcpu->kvm->mmu_lock);
>
> if (is_page_fault_stale(vcpu, fault))
> - goto out_unlock;
> + goto out_mmu_unlock;
>
> r = make_mmu_pages_available(vcpu);
> if (r)
> - goto out_unlock;
> + goto out_mmu_unlock;
>
> r = direct_map(vcpu, fault);
>
> -out_unlock:
> +out_mmu_unlock:
> write_unlock(&vcpu->kvm->mmu_lock);
> kvm_release_pfn_clean(fault->pfn);
> +out_page_cache_unlock:
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> return r;
> }
>
> @@ -4396,25 +4438,28 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> if (r != RET_PF_INVALID)
> return r;
>
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);

Can you elaborate more why this lock is required? When will this lock contend?

1) Previously mmu_topup_memory_caches() works fine without a lock.
2) IMHO I was suspecting if this lock seems affects the parallelization
of the TDP MMU fault handling.

TDP MMU fault handling is intend to be optimized for parallelization fault
handling by taking a read lock and operating the page table via atomic
operations. Multiple fault handling can enter the TDP MMU fault path
because of read_lock(&vcpu->kvm->mmu_lock) below.

W/ this lock, it seems the part of benefit of parallelization is gone
because the lock can contend earlier above. Will this cause performance
regression?

If the lock will not contend above, then I am not sure if we need it.

> r = mmu_topup_memory_caches(vcpu, false);
> if (r)
> - return r;
> + goto out_page_cache_unlock;
>
> r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> if (r != RET_PF_CONTINUE)
> - return r;
> + goto out_page_cache_unlock;
>
> r = RET_PF_RETRY;
> read_lock(&vcpu->kvm->mmu_lock);
>
> if (is_page_fault_stale(vcpu, fault))
> - goto out_unlock;
> + goto out_mmu_unlock;
>
> r = kvm_tdp_mmu_map(vcpu, fault);
>
> -out_unlock:
> +out_mmu_unlock:
> read_unlock(&vcpu->kvm->mmu_lock);
> kvm_release_pfn_clean(fault->pfn);
> +out_page_cache_unlock:
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> return r;
> }
> #endif
> @@ -5394,6 +5439,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> {
> int r;
>
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
> if (r)
> goto out;
> @@ -5420,6 +5466,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> */
> static_call(kvm_x86_flush_tlb_current)(vcpu);
> out:
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> return r;
> }
>
> @@ -5924,6 +5971,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -6769,12 +6817,17 @@ int kvm_mmu_vendor_module_init(void)
> if (!mmu_page_header_cache)
> goto out;
>
> + if (percpu_counter_init(&kvm_total_unused_cached_pages, 0, GFP_KERNEL))
> + goto out;
> +
> ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> if (ret)
> - goto out;
> + goto out_shrinker;
>
> return 0;
>
> +out_shrinker:
> + percpu_counter_destroy(&kvm_total_unused_cached_pages);
> out:
> mmu_destroy_caches();
> return ret;
> @@ -6792,6 +6845,7 @@ void kvm_mmu_vendor_module_exit(void)
> {
> mmu_destroy_caches();
> unregister_shrinker(&mmu_shrinker);
> + percpu_counter_destroy(&kvm_total_unused_cached_pages);
> }
>
> /*
> @@ -6994,3 +7048,11 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> if (kvm->arch.nx_huge_page_recovery_thread)
> kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> }
> +
> +void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> + bool count_allocation)
> +{
> + if (count_allocation && shadow_page_cache->nobjs)
> + percpu_counter_dec(&kvm_total_unused_cached_pages);
> + return kvm_mmu_memory_cache_alloc(shadow_page_cache);
> +}
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index cc58631e2336..798cfbf0a36b 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -338,5 +338,7 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache,
> + bool count_allocation);
>
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 57f0b75c80f9..1dea9be6849d 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -821,9 +821,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> return RET_PF_EMULATE;
> }
>
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> r = mmu_topup_memory_caches(vcpu, true);
> if (r)
> - return r;
> + goto out_page_cache_unlock;
>
> vcpu->arch.write_fault_to_shadow_pgtable = false;
>
> @@ -837,7 +838,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>
> r = kvm_faultin_pfn(vcpu, fault, walker.pte_access);
> if (r != RET_PF_CONTINUE)
> - return r;
> + goto out_page_cache_unlock;
>
> /*
> * Do not change pte_access if the pfn is a mmio page, otherwise
> @@ -862,16 +863,18 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> write_lock(&vcpu->kvm->mmu_lock);
>
> if (is_page_fault_stale(vcpu, fault))
> - goto out_unlock;
> + goto out_mmu_unlock;
>
> r = make_mmu_pages_available(vcpu);
> if (r)
> - goto out_unlock;
> + goto out_mmu_unlock;
> r = FNAME(fetch)(vcpu, fault, &walker);
>
> -out_unlock:
> +out_mmu_unlock:
> write_unlock(&vcpu->kvm->mmu_lock);
> kvm_release_pfn_clean(fault->pfn);
> +out_page_cache_unlock:
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> return r;
> }
>
> @@ -897,17 +900,18 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
>
> vcpu_clear_mmio_info(vcpu, gva);
>
> + if (!VALID_PAGE(root_hpa)) {
> + WARN_ON(1);
> + return;
> + }
> +
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> /*
> * No need to check return value here, rmap_can_add() can
> * help us to skip pte prefetch later.
> */
> mmu_topup_memory_caches(vcpu, true);
>
> - if (!VALID_PAGE(root_hpa)) {
> - WARN_ON(1);
> - return;
> - }
> -
> write_lock(&vcpu->kvm->mmu_lock);
> for_each_shadow_entry_using_root(vcpu, root_hpa, gva, iterator) {
> level = iterator.level;
> @@ -943,6 +947,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
> break;
> }
> write_unlock(&vcpu->kvm->mmu_lock);
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> }
>
> /* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7c25dbf32ecc..fa6eb1e9101e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -265,7 +265,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> + sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> + true);
>
> return sp;
> }


2023-03-08 22:17:35

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Wed, Mar 8, 2023 at 12:33 PM Zhi Wang <[email protected]> wrote:
>
> On Mon, 6 Mar 2023 14:41:12 -0800
> Vipin Sharma <[email protected]> wrote:
> > +/**
> > + * Caller should hold mutex lock corresponding to cache, if available.
> > + */
> > +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > + int min)
> > +{
> > + int orig_nobjs, r;
> > +
> > + orig_nobjs = cache->nobjs;
> > + r = kvm_mmu_topup_memory_cache(cache, min);
> > + if (orig_nobjs != cache->nobjs)
> > + percpu_counter_add(&kvm_total_unused_cached_pages,
> > + (cache->nobjs - orig_nobjs));
> > +
> > + return r;
> > +}
> > +
>
> Maybe kvm_mmu_topup_shadow_page_cache() would be better?
>
> As a user of kvm_mmu_topup_memory_cache(), mmu_topup_memory_cache() is not
> supposed to directly touch the kvm_mmu_memory_cache meta data.
>
> The name "mmu_topup_sp_memory_cache()" seems similar with "mmu_topup_memory_cache()".
> Renaming it would make its level self-documenting.
>

Sounds good. I will rename it.

> > @@ -4396,25 +4438,28 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > if (r != RET_PF_INVALID)
> > return r;
> >
> > + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> Can you elaborate more why this lock is required? When will this lock contend?

This lock is not needed in this patch. In the patch 4 when I am
freeing up the cache in MMU shrinker this lock is used. In an internal
discussion Sean also mentioned it. I will move it to the patch where
it is actually used.

>
> 1) Previously mmu_topup_memory_caches() works fine without a lock.
> 2) IMHO I was suspecting if this lock seems affects the parallelization
> of the TDP MMU fault handling.
>
> TDP MMU fault handling is intend to be optimized for parallelization fault
> handling by taking a read lock and operating the page table via atomic
> operations. Multiple fault handling can enter the TDP MMU fault path
> because of read_lock(&vcpu->kvm->mmu_lock) below.
>
> W/ this lock, it seems the part of benefit of parallelization is gone
> because the lock can contend earlier above. Will this cause performance
> regression?

This is a per vCPU lock, with this lock each vCPU will still be able
to perform parallel fault handling without contending for lock.

>
> If the lock will not contend above, then I am not sure if we need it.
>

Not in this patch, but in patch 4 we will need it when clearing cache
via MMU shrinker. I will move it to the patch where it is actually
needed.

2023-03-09 05:18:24

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

> >
> > 1) Previously mmu_topup_memory_caches() works fine without a lock.
> > 2) IMHO I was suspecting if this lock seems affects the parallelization
> > of the TDP MMU fault handling.
> >
> > TDP MMU fault handling is intend to be optimized for parallelization fault
> > handling by taking a read lock and operating the page table via atomic
> > operations. Multiple fault handling can enter the TDP MMU fault path
> > because of read_lock(&vcpu->kvm->mmu_lock) below.
> >
> > W/ this lock, it seems the part of benefit of parallelization is gone
> > because the lock can contend earlier above. Will this cause performance
> > regression?
>
> This is a per vCPU lock, with this lock each vCPU will still be able
> to perform parallel fault handling without contending for lock.
>

I am curious how effective it is by trying to accquiring this per vCPU
lock? If a vcpu thread should stay within the (host) kernel (vmx
root/non-root) for the vast majority of the time, isn't the shrinker
always fail to make any progress?

2023-03-09 12:52:29

by Zhi Wang

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Thu, 9 Mar 2023 05:18:11 +0000
Mingwei Zhang <[email protected]> wrote:

> > >
> > > 1) Previously mmu_topup_memory_caches() works fine without a lock.
> > > 2) IMHO I was suspecting if this lock seems affects the parallelization
> > > of the TDP MMU fault handling.
> > >
> > > TDP MMU fault handling is intend to be optimized for parallelization fault
> > > handling by taking a read lock and operating the page table via atomic
> > > operations. Multiple fault handling can enter the TDP MMU fault path
> > > because of read_lock(&vcpu->kvm->mmu_lock) below.
> > >
> > > W/ this lock, it seems the part of benefit of parallelization is gone
> > > because the lock can contend earlier above. Will this cause performance
> > > regression?
> >
> > This is a per vCPU lock, with this lock each vCPU will still be able
> > to perform parallel fault handling without contending for lock.
> >
>
> I am curious how effective it is by trying to accquiring this per vCPU
> lock? If a vcpu thread should stay within the (host) kernel (vmx
> root/non-root) for the vast majority of the time, isn't the shrinker
> always fail to make any progress?

IMHO the lock is to prevent the faulting path from being disturbed by the
shrinker. I guess even a vCPU thread stays in the host kernel, the shrinker
should still be able to harvest the pages from the cache as long as there is
no faulting flood.

I am curious about the effectiveness as well. It would be nice there can be
some unit tests that people can try by themselves to see the results, like
when the shrinker isn't triggered, the faulting is still as effective as
before and when the shrinker is triggered, what would actually happen when
the system memory is under different pressure. (like how much the faulting
will be slowed down)


2023-03-09 15:37:22

by Zhi Wang

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Mon, 6 Mar 2023 14:41:12 -0800
Vipin Sharma <[email protected]> wrote:

> Create a global counter for total number of pages available
> in MMU page caches across all VMs. Add mmu_shadow_page_cache
> pages to this counter.
>
> This accounting will be used in future commits to shrink MMU caches via
> KVM MMU shrinker.
>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 5 ++
> arch/x86/kvm/mmu/mmu.c | 90 ++++++++++++++++++++++++++++-----
> arch/x86/kvm/mmu/mmu_internal.h | 2 +
> arch/x86/kvm/mmu/paging_tmpl.h | 25 +++++----
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
> 5 files changed, 100 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index ebbe692acf3f..4322c7020d5d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -791,6 +791,11 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> + /*
> + * Protect allocation and release of pages from mmu_shadow_page_cache.
> + */
> + struct mutex mmu_shadow_page_cache_lock;
> +
> /*
> * QEMU userspace and the guest each have their own FPU state.
> * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3a452989f5cd..13f41b7ac280 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -167,6 +167,11 @@ struct kvm_shadow_walk_iterator {
> static struct kmem_cache *pte_list_desc_cache;
> struct kmem_cache *mmu_page_header_cache;
>
> +/*
> + * Global count of unused pages in MMU page caches across all VMs.
> + */
> +static struct percpu_counter kvm_total_unused_cached_pages;
> +
> static void mmu_spte_set(u64 *sptep, u64 spte);
>
> struct kvm_mmu_role_regs {
> @@ -667,6 +672,34 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +/**
> + * Caller should hold mutex lock corresponding to cache, if available.
> + */
> +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> + int min)
> +{
> + int orig_nobjs, r;
> +
> + orig_nobjs = cache->nobjs;
> + r = kvm_mmu_topup_memory_cache(cache, min);
> + if (orig_nobjs != cache->nobjs)
> + percpu_counter_add(&kvm_total_unused_cached_pages,
> + (cache->nobjs - orig_nobjs));
> +
> + return r;
> +}
> +
> +/**
> + * Caller should hold mutex lock corresponding to kvm_mmu_memory_cache, if
> + * available.
> + */
> +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache)
> +{
> + if (cache->nobjs)
> + percpu_counter_sub(&kvm_total_unused_cached_pages, cache->nobjs);
> + kvm_mmu_free_memory_cache(cache);
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -676,10 +709,11 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> +
> + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache, PT64_ROOT_MAX_LEVEL);
> if (r)
> return r;
> +
> if (maybe_indirect) {
> r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
> PT64_ROOT_MAX_LEVEL);
> @@ -693,7 +727,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
> @@ -2148,6 +2184,7 @@ struct shadow_page_caches {
> struct kvm_mmu_memory_cache *page_header_cache;
> struct kvm_mmu_memory_cache *shadow_page_cache;
> struct kvm_mmu_memory_cache *shadowed_info_cache;
> + bool count_shadow_page_allocation;
> };
>
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> @@ -2159,7 +2196,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> + sp->spt = mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> + caches->count_shadow_page_allocation);
> if (!role.direct)
> sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
>
> @@ -2216,6 +2254,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> + .count_shadow_page_allocation = true,
> };
>
> return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> @@ -4314,29 +4353,32 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> if (r != RET_PF_INVALID)
> return r;
>
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> r = mmu_topup_memory_caches(vcpu, false);
> if (r)
> - return r;
> + goto out_page_cache_unlock;
>
> r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> if (r != RET_PF_CONTINUE)
> - return r;
> + goto out_page_cache_unlock;
>
> r = RET_PF_RETRY;
> write_lock(&vcpu->kvm->mmu_lock);
>
> if (is_page_fault_stale(vcpu, fault))
> - goto out_unlock;
> + goto out_mmu_unlock;
>
> r = make_mmu_pages_available(vcpu);
> if (r)
> - goto out_unlock;
> + goto out_mmu_unlock;
>
> r = direct_map(vcpu, fault);
>
> -out_unlock:
> +out_mmu_unlock:
> write_unlock(&vcpu->kvm->mmu_lock);
> kvm_release_pfn_clean(fault->pfn);
> +out_page_cache_unlock:
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> return r;
> }
>
> @@ -4396,25 +4438,28 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> if (r != RET_PF_INVALID)
> return r;
>
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> r = mmu_topup_memory_caches(vcpu, false);
> if (r)
> - return r;
> + goto out_page_cache_unlock;
>
> r = kvm_faultin_pfn(vcpu, fault, ACC_ALL);
> if (r != RET_PF_CONTINUE)
> - return r;
> + goto out_page_cache_unlock;
>
> r = RET_PF_RETRY;
> read_lock(&vcpu->kvm->mmu_lock);
>
> if (is_page_fault_stale(vcpu, fault))
> - goto out_unlock;
> + goto out_mmu_unlock;
>
> r = kvm_tdp_mmu_map(vcpu, fault);
>
> -out_unlock:
> +out_mmu_unlock:
> read_unlock(&vcpu->kvm->mmu_lock);
> kvm_release_pfn_clean(fault->pfn);
> +out_page_cache_unlock:
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> return r;
> }
> #endif
> @@ -5394,6 +5439,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> {
> int r;
>
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
> if (r)
> goto out;
> @@ -5420,6 +5466,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> */
> static_call(kvm_x86_flush_tlb_current)(vcpu);
> out:
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> return r;
> }
>
> @@ -5924,6 +5971,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -6769,12 +6817,17 @@ int kvm_mmu_vendor_module_init(void)
> if (!mmu_page_header_cache)
> goto out;
>
> + if (percpu_counter_init(&kvm_total_unused_cached_pages, 0, GFP_KERNEL))
> + goto out;
> +
> ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> if (ret)
> - goto out;
> + goto out_shrinker;
>
> return 0;
>
> +out_shrinker:
> + percpu_counter_destroy(&kvm_total_unused_cached_pages);
> out:
> mmu_destroy_caches();
> return ret;
> @@ -6792,6 +6845,7 @@ void kvm_mmu_vendor_module_exit(void)
> {
> mmu_destroy_caches();
> unregister_shrinker(&mmu_shrinker);
> + percpu_counter_destroy(&kvm_total_unused_cached_pages);
> }
>
> /*
> @@ -6994,3 +7048,11 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> if (kvm->arch.nx_huge_page_recovery_thread)
> kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> }
> +
> +void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> + bool count_allocation)

Is it necessary to have the control of count_allocation in every call of
mmu_sp_memory_cache_alloc() instead of taking
shadow_page_cache->count_shadow_page_allocation directly?

> +{
> + if (count_allocation && shadow_page_cache->nobjs)
> + percpu_counter_dec(&kvm_total_unused_cached_pages);
> + return kvm_mmu_memory_cache_alloc(shadow_page_cache);
> +}
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index cc58631e2336..798cfbf0a36b 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -338,5 +338,7 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache,
> + bool count_allocation);
>
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 57f0b75c80f9..1dea9be6849d 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -821,9 +821,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> return RET_PF_EMULATE;
> }
>
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> r = mmu_topup_memory_caches(vcpu, true);
> if (r)
> - return r;
> + goto out_page_cache_unlock;
>
> vcpu->arch.write_fault_to_shadow_pgtable = false;
>
> @@ -837,7 +838,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>
> r = kvm_faultin_pfn(vcpu, fault, walker.pte_access);
> if (r != RET_PF_CONTINUE)
> - return r;
> + goto out_page_cache_unlock;
>
> /*
> * Do not change pte_access if the pfn is a mmio page, otherwise
> @@ -862,16 +863,18 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> write_lock(&vcpu->kvm->mmu_lock);
>
> if (is_page_fault_stale(vcpu, fault))
> - goto out_unlock;
> + goto out_mmu_unlock;
>
> r = make_mmu_pages_available(vcpu);
> if (r)
> - goto out_unlock;
> + goto out_mmu_unlock;
> r = FNAME(fetch)(vcpu, fault, &walker);
>
> -out_unlock:
> +out_mmu_unlock:
> write_unlock(&vcpu->kvm->mmu_lock);
> kvm_release_pfn_clean(fault->pfn);
> +out_page_cache_unlock:
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> return r;
> }
>
> @@ -897,17 +900,18 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
>
> vcpu_clear_mmio_info(vcpu, gva);
>
> + if (!VALID_PAGE(root_hpa)) {
> + WARN_ON(1);
> + return;
> + }
> +
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> /*
> * No need to check return value here, rmap_can_add() can
> * help us to skip pte prefetch later.
> */
> mmu_topup_memory_caches(vcpu, true);
>
> - if (!VALID_PAGE(root_hpa)) {
> - WARN_ON(1);
> - return;
> - }
> -
> write_lock(&vcpu->kvm->mmu_lock);
> for_each_shadow_entry_using_root(vcpu, root_hpa, gva, iterator) {
> level = iterator.level;
> @@ -943,6 +947,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
> break;
> }
> write_unlock(&vcpu->kvm->mmu_lock);
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> }
>
> /* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7c25dbf32ecc..fa6eb1e9101e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -265,7 +265,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> + sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> + true);
>
> return sp;
> }


2023-03-09 15:59:07

by Zhi Wang

[permalink] [raw]
Subject: Re: [Patch v4 05/18] KVM: x86/mmu: Add split_shadow_page_cache pages to global count of MMU cache pages

On Mon, 6 Mar 2023 14:41:14 -0800
Vipin Sharma <[email protected]> wrote:

> Add pages in split_shadow_page_cache to the global counter
> kvm_total_unused_cached_pages. These pages will be freed by MMU shrinker
> in future commit.
>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index df8dcb7e5de7..0ebb8a2eaf47 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6149,7 +6149,9 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
> {
> kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> - kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> + mutex_lock(&kvm->slots_lock);
> + mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache);
> + mutex_unlock(&kvm->slots_lock);

Taking the lock of the calling path in the layer of cache topping/free layer
seems off.

My vote goes to have a lock for each cache and take the lock of the cache when
topping/free the cache. It is more self-contained and architecturally nice.

> }
>
> void kvm_mmu_uninit_vm(struct kvm *kvm)
> @@ -6303,7 +6305,7 @@ static int topup_split_caches(struct kvm *kvm)
> if (r)
> return r;
>
> - return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
> + return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache, 1);
> }
>
> static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep)
> @@ -6328,6 +6330,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> /* Direct SPs do not require a shadowed_info_cache. */
> caches.page_header_cache = &kvm->arch.split_page_header_cache;
> caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> + caches.count_shadow_page_allocation = true;
>
> /* Safe to pass NULL for vCPU since requesting a direct SP. */
> return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);


2023-03-09 16:02:41

by Zhi Wang

[permalink] [raw]
Subject: Re: [Patch v4 06/18] KVM: x86/mmu: Shrink split_shadow_page_cache via MMU shrinker

On Mon, 6 Mar 2023 14:41:15 -0800
Vipin Sharma <[email protected]> wrote:

> Use MMU shrinker to free unused pages in split_shadow_page_cache.
> Refactor the code and make common function to try emptying the page cache.
>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 34 +++++++++++++++++++++-------------
> 1 file changed, 21 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0ebb8a2eaf47..73a0ac9c11ce 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6696,13 +6696,24 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> }
> }
>

After adding the lock in the kvm_mmu_memory_cache, the cache_lock doesn't need
to be passed here and in mmu_shrink_scan().

> +static int mmu_memory_cache_try_empty(struct kvm_mmu_memory_cache *cache,
> + struct mutex *cache_lock)
> +{
> + int freed = 0;
> +
> + if (mutex_trylock(cache_lock)) {
> + freed = cache->nobjs;
> + kvm_mmu_empty_memory_cache(cache);
> + mutex_unlock(cache_lock);
> + }
> + return freed;
> +}
> +
> static unsigned long mmu_shrink_scan(struct shrinker *shrink,
> struct shrink_control *sc)
> {
> struct kvm *kvm, *next_kvm, *first_kvm = NULL;
> - struct kvm_mmu_memory_cache *cache;
> unsigned long i, freed = 0;
> - struct mutex *cache_lock;
> struct kvm_vcpu *vcpu;
>
> mutex_lock(&kvm_lock);
> @@ -6716,18 +6727,15 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
> list_move_tail(&kvm->vm_list, &vm_list);
>
> kvm_for_each_vcpu(i, vcpu, kvm) {
> - cache = &vcpu->arch.mmu_shadow_page_cache;
> - cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
> - if (mutex_trylock(cache_lock)) {
> - if (cache->nobjs) {
> - freed += cache->nobjs;
> - kvm_mmu_empty_memory_cache(cache);
> - }
> - mutex_unlock(cache_lock);
> - if (freed >= sc->nr_to_scan)
> - goto out;
> - }
> + freed += mmu_memory_cache_try_empty(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
> + if (freed >= sc->nr_to_scan)
> + goto out;
> }
> + freed += mmu_memory_cache_try_empty(&kvm->arch.split_shadow_page_cache,
> + &kvm->slots_lock);
> + if (freed >= sc->nr_to_scan)
> + goto out;
> }
> out:
> mutex_unlock(&kvm_lock);


2023-03-09 16:11:35

by Zhi Wang

[permalink] [raw]
Subject: Re: [Patch v4 07/18] KVM: x86/mmu: Unconditionally count allocations from MMU page caches

On Mon, 6 Mar 2023 14:41:16 -0800
Vipin Sharma <[email protected]> wrote:

Ah, it is removed here. :)

> Remove count_shadow_page_allocations from struct shadow_page_caches{}.
> Remove count_allocation boolean condition check from
> mmu_sp_memory_cache_alloc().
>
> Both split_shadow_page_cache and mmu_shadow_page_cache are counted in
> global count of unused cache pages. count_shadow_page_allocations
> boolean is obsolete and can be removed.
>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 11 +++--------
> arch/x86/kvm/mmu/mmu_internal.h | 3 +--
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +--
> 3 files changed, 5 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 73a0ac9c11ce..0a0962d8108b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2184,7 +2184,6 @@ struct shadow_page_caches {
> struct kvm_mmu_memory_cache *page_header_cache;
> struct kvm_mmu_memory_cache *shadow_page_cache;
> struct kvm_mmu_memory_cache *shadowed_info_cache;
> - bool count_shadow_page_allocation;
> };
>
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> @@ -2196,8 +2195,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> - sp->spt = mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> - caches->count_shadow_page_allocation);
> + sp->spt = mmu_sp_memory_cache_alloc(caches->shadow_page_cache);
> if (!role.direct)
> sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
>
> @@ -2254,7 +2252,6 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> - .count_shadow_page_allocation = true,
> };
>
> return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> @@ -6330,7 +6327,6 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> /* Direct SPs do not require a shadowed_info_cache. */
> caches.page_header_cache = &kvm->arch.split_page_header_cache;
> caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> - caches.count_shadow_page_allocation = true;
>
> /* Safe to pass NULL for vCPU since requesting a direct SP. */
> return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> @@ -7101,10 +7097,9 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> }
>
> -void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> - bool count_allocation)
> +void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache)
> {
> - if (count_allocation && shadow_page_cache->nobjs)
> + if (shadow_page_cache->nobjs)
> percpu_counter_dec(&kvm_total_unused_cached_pages);
> return kvm_mmu_memory_cache_alloc(shadow_page_cache);
> }
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 798cfbf0a36b..a607314348e3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -338,7 +338,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> -void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache,
> - bool count_allocation);
> +void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache);
>
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index fa6eb1e9101e..d1e85012a008 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -265,8 +265,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> - sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> - true);
> + sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
>
> return sp;
> }


2023-03-09 18:20:16

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Thu, Mar 9, 2023 at 7:37 AM Zhi Wang <[email protected]> wrote:
>
> On Mon, 6 Mar 2023 14:41:12 -0800
> Vipin Sharma <[email protected]> wrote:
> > /*
> > @@ -6994,3 +7048,11 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > if (kvm->arch.nx_huge_page_recovery_thread)
> > kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> > }
> > +
> > +void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > + bool count_allocation)
>
> Is it necessary to have the control of count_allocation in every call of
> mmu_sp_memory_cache_alloc() instead of taking
> shadow_page_cache->count_shadow_page_allocation directly?
>
You have found in patch 7 that this is cleaned up.

2023-03-09 19:53:12

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Thu, Mar 9, 2023 at 4:52 AM Zhi Wang <[email protected]> wrote:
>
> On Thu, 9 Mar 2023 05:18:11 +0000
> Mingwei Zhang <[email protected]> wrote:
>
> > > >
> > > > 1) Previously mmu_topup_memory_caches() works fine without a lock.
> > > > 2) IMHO I was suspecting if this lock seems affects the parallelization
> > > > of the TDP MMU fault handling.
> > > >
> > > > TDP MMU fault handling is intend to be optimized for parallelization fault
> > > > handling by taking a read lock and operating the page table via atomic
> > > > operations. Multiple fault handling can enter the TDP MMU fault path
> > > > because of read_lock(&vcpu->kvm->mmu_lock) below.
> > > >
> > > > W/ this lock, it seems the part of benefit of parallelization is gone
> > > > because the lock can contend earlier above. Will this cause performance
> > > > regression?
> > >
> > > This is a per vCPU lock, with this lock each vCPU will still be able
> > > to perform parallel fault handling without contending for lock.
> > >
> >
> > I am curious how effective it is by trying to accquiring this per vCPU
> > lock? If a vcpu thread should stay within the (host) kernel (vmx
> > root/non-root) for the vast majority of the time, isn't the shrinker
> > always fail to make any progress?
>
> IMHO the lock is to prevent the faulting path from being disturbed by the
> shrinker. I guess even a vCPU thread stays in the host kernel, the shrinker
> should still be able to harvest the pages from the cache as long as there is
> no faulting flood.

Yes, lock is to prevent the faulting path from being disturbed by the
shrinker. In this new approach, shrinker goes through each vCPU of
each VM alive on the host. All of these vCPUs collectively being in
the fault path while shrinker is invoked seems unlikely.

Let us say we free the cache during the fault path, now when a vCPU
asks a page from the cache, it will dynamically allocate a page via
GFP_ATOMIC which has higher chances of failing if a host is already
under memory pressure. Shrinker by default should be at lower priority
and based on discussions pointed in patch 1, it seems like it was of
not much practical use before either.

>
> I am curious about the effectiveness as well. It would be nice there can be
> some unit tests that people can try by themselves to see the results, like
> when the shrinker isn't triggered, the faulting is still as effective as
> before and when the shrinker is triggered, what would actually happen when
> the system memory is under different pressure. (like how much the faulting
> will be slowed down)
>

Not sure what can be a right test to measure this. My manual testing
was to just run dirty_log_perf_test with and without shrinker and I
didn't notice much difference. I did print some logs to see if
shrinker is getting invoked, caches are freed by shrinker and when VM
is freed to verify page accounting is right with patch 9 of the
series.

2023-03-09 20:00:01

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 05/18] KVM: x86/mmu: Add split_shadow_page_cache pages to global count of MMU cache pages

On Thu, Mar 9, 2023 at 7:58 AM Zhi Wang <[email protected]> wrote:
>
> On Mon, 6 Mar 2023 14:41:14 -0800
> Vipin Sharma <[email protected]> wrote:
>
> > Add pages in split_shadow_page_cache to the global counter
> > kvm_total_unused_cached_pages. These pages will be freed by MMU shrinker
> > in future commit.
> >
> > Signed-off-by: Vipin Sharma <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index df8dcb7e5de7..0ebb8a2eaf47 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6149,7 +6149,9 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
> > {
> > kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> > kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> > - kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> > + mutex_lock(&kvm->slots_lock);
> > + mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache);
> > + mutex_unlock(&kvm->slots_lock);
>
> Taking the lock of the calling path in the layer of cache topping/free layer
> seems off.
>
> My vote goes to have a lock for each cache and take the lock of the cache when
> topping/free the cache. It is more self-contained and architecturally nice.
>

Yeah, this can be one way. However, in future patches when I am adding
per NUMA node cache, it will add up a lot of locks for the same code
path before a topup. In split huge page case we know what NUMA node we
need to allocate from so we can fine tune which lock to take but in
fault path code we don't know what NUMA node the page will be coming
from so we need to topup all of the NUMA caches. Having a single lock
simplifies code a little bit.

I agree with you on being more self-contained. I will wait for others
to also chime in on this and go from there.

2023-03-09 20:00:50

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 06/18] KVM: x86/mmu: Shrink split_shadow_page_cache via MMU shrinker

On Thu, Mar 9, 2023 at 8:01 AM Zhi Wang <[email protected]> wrote:
>
> On Mon, 6 Mar 2023 14:41:15 -0800
> Vipin Sharma <[email protected]> wrote:
>
> > Use MMU shrinker to free unused pages in split_shadow_page_cache.
> > Refactor the code and make common function to try emptying the page cache.
> >
> > Signed-off-by: Vipin Sharma <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 34 +++++++++++++++++++++-------------
> > 1 file changed, 21 insertions(+), 13 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 0ebb8a2eaf47..73a0ac9c11ce 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6696,13 +6696,24 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > }
> > }
> >
>
> After adding the lock in the kvm_mmu_memory_cache, the cache_lock doesn't need
> to be passed here and in mmu_shrink_scan().
>
Agree. Let us see what is the decision on moving the lock inside the cache.

2023-03-09 23:53:55

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Mon, Mar 06, 2023 at 02:41:12PM -0800, Vipin Sharma wrote:
> Create a global counter for total number of pages available
> in MMU page caches across all VMs. Add mmu_shadow_page_cache
> pages to this counter.

I think I prefer counting the objects on-demand in mmu_shrink_count(),
instead of keeping track of the count. Keeping track of the count adds
complexity to the topup/alloc paths for the sole benefit of the
shrinker. I'd rather contain that complexity to the shrinker code unless
there is a compelling reason to optimize mmu_shrink_count().

IIRC we discussed this at one point. Was there a reason to take this
approach that I'm just forgetting?

2023-03-10 00:06:01

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 05/18] KVM: x86/mmu: Add split_shadow_page_cache pages to global count of MMU cache pages

On Thu, Mar 09, 2023 at 11:59:00AM -0800, Vipin Sharma wrote:
> On Thu, Mar 9, 2023 at 7:58 AM Zhi Wang <[email protected]> wrote:
> >
> > On Mon, 6 Mar 2023 14:41:14 -0800
> > Vipin Sharma <[email protected]> wrote:
> >
> > > Add pages in split_shadow_page_cache to the global counter
> > > kvm_total_unused_cached_pages. These pages will be freed by MMU shrinker
> > > in future commit.
> > >
> > > Signed-off-by: Vipin Sharma <[email protected]>
> > > ---
> > > arch/x86/kvm/mmu/mmu.c | 7 +++++--
> > > 1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index df8dcb7e5de7..0ebb8a2eaf47 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -6149,7 +6149,9 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
> > > {
> > > kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> > > kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> > > - kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> > > + mutex_lock(&kvm->slots_lock);
> > > + mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache);
> > > + mutex_unlock(&kvm->slots_lock);
> >
> > Taking the lock of the calling path in the layer of cache topping/free layer
> > seems off.
> >
> > My vote goes to have a lock for each cache and take the lock of the cache when
> > topping/free the cache. It is more self-contained and architecturally nice.
> >
>
> Yeah, this can be one way. However, in future patches when I am adding
> per NUMA node cache, it will add up a lot of locks for the same code
> path before a topup. In split huge page case we know what NUMA node we
> need to allocate from so we can fine tune which lock to take but in
> fault path code we don't know what NUMA node the page will be coming
> from so we need to topup all of the NUMA caches. Having a single lock
> simplifies code a little bit.
>
> I agree with you on being more self-contained. I will wait for others
> to also chime in on this and go from there.

As a general rule, please only added locking when it's needed. Adding
the lock in this commit is just confusing.

But that aside, I don't think acquiring the slots lock is even needed in
this commit. mmu_free_vm_memory_caches() is never called while the the
VM is on vm_list. i.e. This can never race with the shrinker.

If you want to be paranoid you can add a WARN to ensure that stays true
going forward:

/* ... comment ... */
WARN_ON_ONCE(!list_empty(&kvm->vm_list));

2023-03-10 00:07:32

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 05/18] KVM: x86/mmu: Add split_shadow_page_cache pages to global count of MMU cache pages

On Thu, Mar 9, 2023 at 4:05 PM David Matlack <[email protected]> wrote:
>
> On Thu, Mar 09, 2023 at 11:59:00AM -0800, Vipin Sharma wrote:
> > On Thu, Mar 9, 2023 at 7:58 AM Zhi Wang <[email protected]> wrote:
> > >
> > > On Mon, 6 Mar 2023 14:41:14 -0800
> > > Vipin Sharma <[email protected]> wrote:
> > >
> > > > Add pages in split_shadow_page_cache to the global counter
> > > > kvm_total_unused_cached_pages. These pages will be freed by MMU shrinker
> > > > in future commit.
> > > >
> > > > Signed-off-by: Vipin Sharma <[email protected]>
> > > > ---
> > > > arch/x86/kvm/mmu/mmu.c | 7 +++++--
> > > > 1 file changed, 5 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index df8dcb7e5de7..0ebb8a2eaf47 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -6149,7 +6149,9 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
> > > > {
> > > > kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
> > > > kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
> > > > - kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache);
> > > > + mutex_lock(&kvm->slots_lock);
> > > > + mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache);
> > > > + mutex_unlock(&kvm->slots_lock);
> > >
> > > Taking the lock of the calling path in the layer of cache topping/free layer
> > > seems off.
> > >
> > > My vote goes to have a lock for each cache and take the lock of the cache when
> > > topping/free the cache. It is more self-contained and architecturally nice.
> > >
> >
> > Yeah, this can be one way. However, in future patches when I am adding
> > per NUMA node cache, it will add up a lot of locks for the same code
> > path before a topup. In split huge page case we know what NUMA node we
> > need to allocate from so we can fine tune which lock to take but in
> > fault path code we don't know what NUMA node the page will be coming
> > from so we need to topup all of the NUMA caches. Having a single lock
> > simplifies code a little bit.
> >
> > I agree with you on being more self-contained. I will wait for others
> > to also chime in on this and go from there.
>
> As a general rule, please only added locking when it's needed. Adding
> the lock in this commit is just confusing.
>
> But that aside, I don't think acquiring the slots lock is even needed in
> this commit.

Correction: even needed in the *next* commit

> mmu_free_vm_memory_caches() is never called while the the
> VM is on vm_list. i.e. This can never race with the shrinker.
>
> If you want to be paranoid you can add a WARN to ensure that stays true
> going forward:
>
> /* ... comment ... */
> WARN_ON_ONCE(!list_empty(&kvm->vm_list));

2023-03-10 00:22:46

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Mon, Mar 06, 2023 at 02:41:12PM -0800, Vipin Sharma wrote:
>
> static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);

Is this lock necessary (even when the shrinker is hooked up)?
mmu_free_memory_caches() is only called when KVM fails to create a vCPU
(before it has been added to vcpu_array) or during VM destruction (after
the VM has been removed from vm_list).

2023-03-10 00:28:54

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Thu, Mar 9, 2023 at 3:53 PM David Matlack <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:12PM -0800, Vipin Sharma wrote:
> > Create a global counter for total number of pages available
> > in MMU page caches across all VMs. Add mmu_shadow_page_cache
> > pages to this counter.
>
> I think I prefer counting the objects on-demand in mmu_shrink_count(),
> instead of keeping track of the count. Keeping track of the count adds
> complexity to the topup/alloc paths for the sole benefit of the
> shrinker. I'd rather contain that complexity to the shrinker code unless
> there is a compelling reason to optimize mmu_shrink_count().
>
> IIRC we discussed this at one point. Was there a reason to take this
> approach that I'm just forgetting?

To count on demand, we first need to lock on kvm_lock and then for
each VMs iterate through each vCPU, take a lock, and sum the objects
count in caches. When the NUMA support will be introduced in this
series then it means we have to iterate even more caches. We
can't/shouldn't use mutex_trylock() as it will not give the correct
picture and when shrink_scan is called count can be totally different.

scan_count() API comment says to not do any deadlock check (I don't
know what does that mean) and percpu_counter is very fast when we are
adding/subtracting pages so the effect of using it to keep global
count is very minimal. Since, there is not much impact to using
percpu_count compared to previous one, we ended our discussion with
keeping this per cpu counter.

2023-03-10 00:37:08

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Thu, Mar 9, 2023 at 4:22 PM David Matlack <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:12PM -0800, Vipin Sharma wrote:
> >
> > static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> > {
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > + mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> > + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > + mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> Is this lock necessary (even when the shrinker is hooked up)?
> mmu_free_memory_caches() is only called when KVM fails to create a vCPU
> (before it has been added to vcpu_array) or during VM destruction (after
> the VM has been removed from vm_list).

My approach was if shrinker ran just before VM destruction and removed
pages, it would reduce nobjs variable in the cache. Now, when the VM
is being destroyed, mmu_free_sp_memory_cache() will first read the
nobjs variable to update the global counter and free the cache. To be
sure that the latest value is read and there is no memory ordering
issue I used mutex.

I discussed with Sean offline and he pointed out that x86 is strongly
ordered and mutex is not needed when freeing memory caches.

2023-03-10 00:56:20

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Thu, Mar 09, 2023 at 04:28:10PM -0800, Vipin Sharma wrote:
> On Thu, Mar 9, 2023 at 3:53 PM David Matlack <[email protected]> wrote:
> >
> > On Mon, Mar 06, 2023 at 02:41:12PM -0800, Vipin Sharma wrote:
> > > Create a global counter for total number of pages available
> > > in MMU page caches across all VMs. Add mmu_shadow_page_cache
> > > pages to this counter.
> >
> > I think I prefer counting the objects on-demand in mmu_shrink_count(),
> > instead of keeping track of the count. Keeping track of the count adds
> > complexity to the topup/alloc paths for the sole benefit of the
> > shrinker. I'd rather contain that complexity to the shrinker code unless
> > there is a compelling reason to optimize mmu_shrink_count().
> >
> > IIRC we discussed this at one point. Was there a reason to take this
> > approach that I'm just forgetting?
>
> To count on demand, we first need to lock on kvm_lock and then for
> each VMs iterate through each vCPU, take a lock, and sum the objects
> count in caches. When the NUMA support will be introduced in this
> series then it means we have to iterate even more caches. We
> can't/shouldn't use mutex_trylock() as it will not give the correct
> picture and when shrink_scan is called count can be totally different.

Yeah good point. Hm, do we need to take the cache mutex to calculate the
count though? mmu_shrink_count() is inherently racy (something could get
freed or allocated in between count() and scan()). I don't think holding
the mutex buys us anything over just reading the count without the
mutex.

e.g.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index df8dcb7e5de7..c80a5c52f0ea 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6739,10 +6739,20 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
static unsigned long mmu_shrink_count(struct shrinker *shrink,
struct shrink_control *sc)
{
- s64 count = percpu_counter_sum(&kvm_total_unused_cached_pages);
+ struct kvm *kvm, *next_kvm;
+ unsigned long count = 0;

- WARN_ON(count < 0);
- return count <= 0 ? SHRINK_EMPTY : count;
+ mutex_lock(&kvm_lock);
+ list_for_each_entry_safe(kvm, next_kvm, &vm_list, vm_list) {
+ struct kvm_vcpu *vcpu;
+ unsigned long i;
+
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ count += READ_ONCE(vcpu->arch.mmu_shadow_page_cache.nobjs);
+ }
+ mutex_unlock(&kvm_lock);
+
+ return count == 0 ? SHRINK_EMPTY : count;

}

Then the only concern is an additional acquire of kvm_lock. But it
should be fairly quick (quicker than mmu_shrink_scan()). If we can
tolerate the kvm_lock overhead of mmu_shrink_scan(), then we should be
able to tolerate some here.

>
> scan_count() API comment says to not do any deadlock check (I don't
> know what does that mean) and percpu_counter is very fast when we are
> adding/subtracting pages so the effect of using it to keep global
> count is very minimal. Since, there is not much impact to using
> percpu_count compared to previous one, we ended our discussion with
> keeping this per cpu counter.

Yeah it's just the code complexity of maintaing
kvm_total_unused_cached_pages that I'm hoping to avoid. We have to
create the counter, destroy it, and keep it up to date. Some
kvm_mmu_memory_caches have to update the counter, and others don't. It's
just adds a lot of bookkeeping code that I'm not convinced is worth the
it.

2023-03-10 01:09:55

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 03/18] KVM: x86/mmu: Track count of pages in KVM MMU page caches globally

On Thu, Mar 9, 2023 at 4:56 PM David Matlack <[email protected]> wrote:
>
> On Thu, Mar 09, 2023 at 04:28:10PM -0800, Vipin Sharma wrote:
> > On Thu, Mar 9, 2023 at 3:53 PM David Matlack <[email protected]> wrote:
> > >
> > > On Mon, Mar 06, 2023 at 02:41:12PM -0800, Vipin Sharma wrote:
> > > > Create a global counter for total number of pages available
> > > > in MMU page caches across all VMs. Add mmu_shadow_page_cache
> > > > pages to this counter.
> > >
> > > I think I prefer counting the objects on-demand in mmu_shrink_count(),
> > > instead of keeping track of the count. Keeping track of the count adds
> > > complexity to the topup/alloc paths for the sole benefit of the
> > > shrinker. I'd rather contain that complexity to the shrinker code unless
> > > there is a compelling reason to optimize mmu_shrink_count().
> > >
> > > IIRC we discussed this at one point. Was there a reason to take this
> > > approach that I'm just forgetting?
> >
> > To count on demand, we first need to lock on kvm_lock and then for
> > each VMs iterate through each vCPU, take a lock, and sum the objects
> > count in caches. When the NUMA support will be introduced in this
> > series then it means we have to iterate even more caches. We
> > can't/shouldn't use mutex_trylock() as it will not give the correct
> > picture and when shrink_scan is called count can be totally different.
>
> Yeah good point. Hm, do we need to take the cache mutex to calculate the
> count though? mmu_shrink_count() is inherently racy (something could get
> freed or allocated in between count() and scan()). I don't think holding
> the mutex buys us anything over just reading the count without the
> mutex.
>

You are right, mutex and percpu_counter both are not not solving
accuracy problems with the shrinker. So, this can be removed.

> e.g.
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index df8dcb7e5de7..c80a5c52f0ea 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6739,10 +6739,20 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
> static unsigned long mmu_shrink_count(struct shrinker *shrink,
> struct shrink_control *sc)
> {
> - s64 count = percpu_counter_sum(&kvm_total_unused_cached_pages);
> + struct kvm *kvm, *next_kvm;
> + unsigned long count = 0;
>
> - WARN_ON(count < 0);
> - return count <= 0 ? SHRINK_EMPTY : count;
> + mutex_lock(&kvm_lock);
> + list_for_each_entry_safe(kvm, next_kvm, &vm_list, vm_list) {
> + struct kvm_vcpu *vcpu;
> + unsigned long i;
> +
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + count += READ_ONCE(vcpu->arch.mmu_shadow_page_cache.nobjs);
> + }
> + mutex_unlock(&kvm_lock);
> +
> + return count == 0 ? SHRINK_EMPTY : count;
>
> }
>
> Then the only concern is an additional acquire of kvm_lock. But it
> should be fairly quick (quicker than mmu_shrink_scan()). If we can
> tolerate the kvm_lock overhead of mmu_shrink_scan(), then we should be
> able to tolerate some here.
>
> >
> > scan_count() API comment says to not do any deadlock check (I don't
> > know what does that mean) and percpu_counter is very fast when we are
> > adding/subtracting pages so the effect of using it to keep global
> > count is very minimal. Since, there is not much impact to using
> > percpu_count compared to previous one, we ended our discussion with
> > keeping this per cpu counter.
>
> Yeah it's just the code complexity of maintaing
> kvm_total_unused_cached_pages that I'm hoping to avoid. We have to
> create the counter, destroy it, and keep it up to date. Some
> kvm_mmu_memory_caches have to update the counter, and others don't. It's
> just adds a lot of bookkeeping code that I'm not convinced is worth the
> it.

Yeah, it will simplify code a lot. Also, we also don't need 100%
accuracy with Shrinker. I will remove this global counter in the next
version.

2023-03-23 22:06:27

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 11/18] KVM: x86/mmu: Add documentation of NUMA aware page table capability

On Mon, Mar 06, 2023 at 02:41:20PM -0800, Vipin Sharma wrote:
> Add documentation for KVM_CAP_NUMA_AWARE_PAGE_TABLE capability and
> explain why it is needed.
>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 62de0768d6aa..7e3a1299ca8e 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -7669,6 +7669,35 @@ This capability is aimed to mitigate the threat that malicious VMs can
> cause CPU stuck (due to event windows don't open up) and make the CPU
> unavailable to host or other VMs.
>
> +7.34 KVM_CAP_NUMA_AWARE_PAGE_TABLE
> +------------------------------
> +
> +:Architectures: x86
> +:Target: VM
> +:Returns: 0 on success, -EINVAL if vCPUs are already created.
> +
> +This capability allows userspace to enable NUMA aware page tables allocations.

Call out that this capability overrides task mempolicies. e.g.

This capability causes KVM to use a custom NUMA memory policy when
allocating page tables. Specifically, KVM will attempt to co-locate
page tables pages with the memory that they map, rather than following
the mempolicy of the current task.

> +NUMA aware page tables are disabled by default. Once enabled, prior to vCPU
> +creation, any page table allocated during the life of a VM will be allocated

The "prior to vCPU creation" part here is confusing because it sounds
like you're talking about any page tables allocated before vCPU
creation. Just delete that part and put it in a separate paragraph.

KVM_CAP_NUMA_AWARE_PAGE_TABLE must be enabled before any vCPU is
created, otherwise KVM will return -EINVAL.

> +preferably from the NUMA node of the leaf page.
> +
> +Without this capability, default feature is to use current thread mempolicy and

s/default feature is to/KVM will/

> +allocate page table based on that.

s/and allocate page table based on that./to allocate page tables./

> +
> +This capability is useful to improve page accesses by a guest. For example, an

nit: Be more specific about how.

This capability aims to minimize the cost of TLB misses when a vCPU is
accessing NUMA-local memory, by reducing the number of remote memory
accesses needed to walk KVM's page tables.

> +initialization thread which access lots of remote memory and ends up creating
> +page tables on local NUMA node, or some service thread allocates memory on
> +remote NUMA nodes and later worker/background threads accessing that memory
> +will end up accessing remote NUMA node page tables.

It's not clear if these examples are talking about what happens when
KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled or disabled.

Also it's important to distinguish virtual NUMA nodes from physical NUMA
nodes and where these "threads" are running. How about this:

For example, when KVM_CAP_NUMA_AWARE_PAGE_TABLE is disabled and a vCPU
accesses memory on a remote NUMA node and triggers a KVM page fault,
KVM will allocate page tables to handle that fault on the node where
the vCPU is running rather than the node where the memory is allocated.
When KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled, KVM will allocate the
page tables on the node where the memory is located.

This is intended to be used in VM configurations that properly
virtualize NUMA. i.e. VMs with one or more virtual NUMA nodes, each of
which is mapped to a physical NUMA node. With this capability enabled
on such VMs, any guest memory access to virtually-local memory will be
translated through mostly[*] physically-local page tables, regardless
of how the memory was faulted in.

[*] KVM will fallback to allocating from remote NUMA nodes if the
preferred node is out of memory. Also, in VMs with 2 or more NUMA
nodes, higher level page tables will necessarily map memory across
multiple physical nodes.

> So, a multi NUMA node
> +guest, can with high confidence access local memory faster instead of going
> +through remote page tables first.
> +
> +This capability is also helpful for host to reduce live migration impact when
> +splitting huge pages during dirty log operations. If the thread splitting huge
> +page is on remote NUMA node it will create page tables on remote node. Even if
> +guest is careful in making sure that it only access local memory they will end
> +up accessing remote page tables.

Please also cover the limitations of this feature:

- Impact on remote memory accesses (more expensive).
- How KVM handles NUMA node exhaustion.
- How high-level page tables can span multiple nodes.
- What KVM does if it can't determine the NUMA node of the pfn.
- What KVM does for faults on GPAs that aren't backed by a pfn.

> +
> 8. Other capabilities.
> ======================
>
> --
> 2.40.0.rc0.216.gc4246ad0f0-goog
>

2023-03-23 22:16:51

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 12/18] KVM: x86/mmu: Allocate NUMA aware page tables on TDP huge page splits

On Mon, Mar 06, 2023 at 02:41:21PM -0800, Vipin Sharma wrote:
> +
> +void *kvm_mmu_get_free_page(gfp_t gfp, int nid)
> +{
> +#ifdef CONFIG_NUMA

Is this #ifdef necessary? alloc_pages_node() is defined regardless of
CONFIG_NUMA.

> + struct page *page;
> +
> + if (nid != NUMA_NO_NODE) {
> + page = alloc_pages_node(nid, gfp, 0);
> + if (!page)
> + return (void *)0;
> + return page_address(page);
> + }
> +#endif /* CONFIG_NUMA */
> + return (void *)__get_free_page(gfp);
> +}
> +

2023-03-23 22:25:57

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 13/18] KVM: mmu: Add common initialization logic for struct kvm_mmu_memory_cache{}

On Mon, Mar 06, 2023 at 02:41:22PM -0800, Vipin Sharma wrote:
> Add macros and function to make common logic for struct
> kvm_mmu_memory_cache{} declaration and initialization.
>
> Any user which wants different values in struct kvm_mmu_memory_cache{}
> will overwrite the default values explicitly after the initialization.
>
> Suggested-by: David Matlack <[email protected]>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/arm64/kvm/arm.c | 1 +
> arch/arm64/kvm/mmu.c | 3 ++-
> arch/riscv/kvm/mmu.c | 9 +++++----
> arch/riscv/kvm/vcpu.c | 1 +

MIPS also has cache (git grep "struct kvm_mmu_memory_cache").

> arch/x86/kvm/mmu/mmu.c | 8 ++++++++
> include/linux/kvm_types.h | 10 ++++++++++
> 6 files changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 3bd732eaf087..2b3d88e4ace8 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -330,6 +330,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.target = -1;
> bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
> /*
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 7113587222ff..8a56f071ca66 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -895,7 +895,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> {
> phys_addr_t addr;
> int ret = 0;
> - struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> + KVM_MMU_MEMORY_CACHE(cache);

nit: DEFINE_KVM_MMU_MEMORY_CACHE()

(Based on similar existing macros in the kernel, e.g. DEFINE_MUTEX(),
DEFINE_TIMER().)

> struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> KVM_PGTABLE_PROT_R |
> @@ -904,6 +904,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> if (is_protected_kvm_enabled())
> return -EPERM;
>
> + cache.gfp_zero = __GFP_ZERO;
> size += offset_in_page(guest_ipa);
> guest_ipa &= PAGE_MASK;
>
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index 78211aed36fa..bdd8c17958dd 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -351,10 +351,11 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
> int ret = 0;
> unsigned long pfn;
> phys_addr_t addr, end;
> - struct kvm_mmu_memory_cache pcache = {
> - .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> - .gfp_zero = __GFP_ZERO,
> - };
> + KVM_MMU_MEMORY_CACHE(pcache);
> +
> + pcache.gfp_zero = __GFP_ZERO;
> + if (in_atomic)
> + pcache.gfp_custom = GFP_ATOMIC | __GFP_ACCOUNT;
>
> end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> pfn = __phys_to_pfn(hpa);
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index 7d010b0be54e..bc743e9122d1 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -163,6 +163,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>
> /* Mark this VCPU never ran */
> vcpu->arch.ran_atleast_once = false;
> + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a4bf2e433030..b706087ef74e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5961,15 +5961,20 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> {
> int ret;
>
> + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
> vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>
> + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
> vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
> +
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
>
> @@ -6131,11 +6136,14 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> kvm_page_track_register_notifier(kvm, node);
>
> + INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
> kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
>
> + INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache);
> kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> + INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
> kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
>
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 2728d49bbdf6..192516eeccac 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -98,6 +98,16 @@ struct kvm_mmu_memory_cache {
> int capacity;
> void **objects;
> };
> +
> +#define KVM_MMU_MEMORY_CACHE_INIT() { }
> +
> +#define KVM_MMU_MEMORY_CACHE(_name) \
> + struct kvm_mmu_memory_cache _name = KVM_MMU_MEMORY_CACHE_INIT()

nit: There's an extra tab here.

> +
> +static inline void INIT_KVM_MMU_MEMORY_CACHE(struct kvm_mmu_memory_cache *cache)
> +{
> + *cache = (struct kvm_mmu_memory_cache)KVM_MMU_MEMORY_CACHE_INIT();
> +}
> #endif
>
> #define HALT_POLL_HIST_COUNT 32
> --
> 2.40.0.rc0.216.gc4246ad0f0-goog
>

2023-03-23 22:29:36

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 14/18] KVM: mmu: Initialize kvm_mmu_memory_cache.gfp_zero to __GFP_ZERO by default

On Mon, Mar 06, 2023 at 02:41:23PM -0800, Vipin Sharma wrote:
> Set __GFP_ZERO to gfp_zero in default initizliation of struct
> kvm_mmu_memory_cache{}
>
> All of the users of default initialization code of struct
> kvm_mmu_memory_cache{} explicitly sets gfp_zero to __GFP_ZERO. This can
> be moved to common initialization logic.

If that were true we could get rid of gfp_zero entirely and hard-code
__GFP_ZERO in the memory allocator! mmu_shadowed_info_cache is the one
that does not set __GFP_ZERO.

>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/arm64/kvm/arm.c | 1 -
> arch/arm64/kvm/mmu.c | 1 -
> arch/riscv/kvm/mmu.c | 1 -
> arch/riscv/kvm/vcpu.c | 1 -
> arch/x86/kvm/mmu/mmu.c | 6 ------
> include/linux/kvm_types.h | 4 +++-
> 6 files changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 2b3d88e4ace8..b4243978d962 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -331,7 +331,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>
> INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> - vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
> /*
> * Default value for the FP state, will be overloaded at load
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8a56f071ca66..133eba96c41f 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -904,7 +904,6 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> if (is_protected_kvm_enabled())
> return -EPERM;
>
> - cache.gfp_zero = __GFP_ZERO;
> size += offset_in_page(guest_ipa);
> guest_ipa &= PAGE_MASK;
>
> diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> index bdd8c17958dd..62550fd91c70 100644
> --- a/arch/riscv/kvm/mmu.c
> +++ b/arch/riscv/kvm/mmu.c
> @@ -353,7 +353,6 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
> phys_addr_t addr, end;
> KVM_MMU_MEMORY_CACHE(pcache);
>
> - pcache.gfp_zero = __GFP_ZERO;
> if (in_atomic)
> pcache.gfp_custom = GFP_ATOMIC | __GFP_ACCOUNT;
>
> diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> index bc743e9122d1..f5a96ed1e426 100644
> --- a/arch/riscv/kvm/vcpu.c
> +++ b/arch/riscv/kvm/vcpu.c
> @@ -164,7 +164,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> /* Mark this VCPU never ran */
> vcpu->arch.ran_atleast_once = false;
> INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> - vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
>
> /* Setup ISA features available to VCPU */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b706087ef74e..d96afc849ee8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5963,14 +5963,11 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>
> INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
> vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> - vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>
> INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> - vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
> - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
> @@ -6138,14 +6135,11 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>
> INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
> kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> - kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
>
> INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache);
> - kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
>
> INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
> kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> - kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
>
> return 0;
> }
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 192516eeccac..5da7953532ce 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -99,7 +99,9 @@ struct kvm_mmu_memory_cache {
> void **objects;
> };
>
> -#define KVM_MMU_MEMORY_CACHE_INIT() { }
> +#define KVM_MMU_MEMORY_CACHE_INIT() { \
> + .gfp_zero = __GFP_ZERO, \
> +}
>
> #define KVM_MMU_MEMORY_CACHE(_name) \
> struct kvm_mmu_memory_cache _name = KVM_MMU_MEMORY_CACHE_INIT()
> --
> 2.40.0.rc0.216.gc4246ad0f0-goog
>

2023-03-23 22:31:07

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 15/18] KVM: mmu: Add NUMA node support in struct kvm_mmu_memory_cache{}

On Mon, Mar 06, 2023 at 02:41:24PM -0800, Vipin Sharma wrote:
> Add NUMA node id variable in struct kvm_mmu_memory_cache{}. This
> variable denotes preferable NUMA node from which memory will be
> allocated under this memory cache.
>
> Set this variable to NUMA_NO_NODE if there is no preferred node.
>
> MIPS doesn't do any sort of initializatino of struct
> kvm_mmu_memory_cache{}. Keep things similar in MIPS by setting gfp_zero
> to 0 as INIT_KVM_MMU_MEMORY_CACHE() will initialize it to __GFP_ZERO.
>
> "node" cannot be left as 0, as 0 is a valid NUMA node value.
>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/mips/kvm/mips.c | 3 +++
> include/linux/kvm_types.h | 3 +++
> 2 files changed, 6 insertions(+)
>
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index 36c8991b5d39..5ec5ce919918 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -294,6 +294,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> HRTIMER_MODE_REL);
> vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
>
> + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> + vcpu->arch.mmu_page_cache.gfp_zero = 0;

Oh MIPS is here. Why isn't MIPS covered in the previous commits?

> +
> /*
> * Allocate space for host mode exception handlers that handle
> * guest mode exits
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 5da7953532ce..b2a405c8e629 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -97,10 +97,13 @@ struct kvm_mmu_memory_cache {
> struct kmem_cache *kmem_cache;
> int capacity;
> void **objects;
> + /* Preferred NUMA node of memory allocation. */
> + int node;
> };
>
> #define KVM_MMU_MEMORY_CACHE_INIT() { \
> .gfp_zero = __GFP_ZERO, \
> + .node = NUMA_NO_NODE, \
> }
>
> #define KVM_MMU_MEMORY_CACHE(_name) \
> --
> 2.40.0.rc0.216.gc4246ad0f0-goog
>

2023-03-28 16:53:51

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 11/18] KVM: x86/mmu: Add documentation of NUMA aware page table capability

On Thu, Mar 23, 2023 at 2:59 PM David Matlack <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:20PM -0800, Vipin Sharma wrote:
> > Add documentation for KVM_CAP_NUMA_AWARE_PAGE_TABLE capability and
> > explain why it is needed.
> >
> > Signed-off-by: Vipin Sharma <[email protected]>
> > ---
> > Documentation/virt/kvm/api.rst | 29 +++++++++++++++++++++++++++++
> > 1 file changed, 29 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 62de0768d6aa..7e3a1299ca8e 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -7669,6 +7669,35 @@ This capability is aimed to mitigate the threat that malicious VMs can
> > cause CPU stuck (due to event windows don't open up) and make the CPU
> > unavailable to host or other VMs.
> >
> > +7.34 KVM_CAP_NUMA_AWARE_PAGE_TABLE
> > +------------------------------
> > +
> > +:Architectures: x86
> > +:Target: VM
> > +:Returns: 0 on success, -EINVAL if vCPUs are already created.
> > +
> > +This capability allows userspace to enable NUMA aware page tables allocations.
>
> Call out that this capability overrides task mempolicies. e.g.
>
> This capability causes KVM to use a custom NUMA memory policy when
> allocating page tables. Specifically, KVM will attempt to co-locate
> page tables pages with the memory that they map, rather than following
> the mempolicy of the current task.
>
> > +NUMA aware page tables are disabled by default. Once enabled, prior to vCPU
> > +creation, any page table allocated during the life of a VM will be allocated
>
> The "prior to vCPU creation" part here is confusing because it sounds
> like you're talking about any page tables allocated before vCPU
> creation. Just delete that part and put it in a separate paragraph.
>
> KVM_CAP_NUMA_AWARE_PAGE_TABLE must be enabled before any vCPU is
> created, otherwise KVM will return -EINVAL.
>
> > +preferably from the NUMA node of the leaf page.
> > +
> > +Without this capability, default feature is to use current thread mempolicy and
>
> s/default feature is to/KVM will/
>
> > +allocate page table based on that.
>
> s/and allocate page table based on that./to allocate page tables./
>
> > +
> > +This capability is useful to improve page accesses by a guest. For example, an
>
> nit: Be more specific about how.
>
> This capability aims to minimize the cost of TLB misses when a vCPU is
> accessing NUMA-local memory, by reducing the number of remote memory
> accesses needed to walk KVM's page tables.
>
> > +initialization thread which access lots of remote memory and ends up creating
> > +page tables on local NUMA node, or some service thread allocates memory on
> > +remote NUMA nodes and later worker/background threads accessing that memory
> > +will end up accessing remote NUMA node page tables.
>
> It's not clear if these examples are talking about what happens when
> KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled or disabled.
>
> Also it's important to distinguish virtual NUMA nodes from physical NUMA
> nodes and where these "threads" are running. How about this:
>
> For example, when KVM_CAP_NUMA_AWARE_PAGE_TABLE is disabled and a vCPU
> accesses memory on a remote NUMA node and triggers a KVM page fault,
> KVM will allocate page tables to handle that fault on the node where
> the vCPU is running rather than the node where the memory is allocated.
> When KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled, KVM will allocate the
> page tables on the node where the memory is located.
>
> This is intended to be used in VM configurations that properly
> virtualize NUMA. i.e. VMs with one or more virtual NUMA nodes, each of
> which is mapped to a physical NUMA node. With this capability enabled
> on such VMs, any guest memory access to virtually-local memory will be
> translated through mostly[*] physically-local page tables, regardless
> of how the memory was faulted in.
>
> [*] KVM will fallback to allocating from remote NUMA nodes if the
> preferred node is out of memory. Also, in VMs with 2 or more NUMA
> nodes, higher level page tables will necessarily map memory across
> multiple physical nodes.
>
> > So, a multi NUMA node
> > +guest, can with high confidence access local memory faster instead of going
> > +through remote page tables first.
> > +
> > +This capability is also helpful for host to reduce live migration impact when
> > +splitting huge pages during dirty log operations. If the thread splitting huge
> > +page is on remote NUMA node it will create page tables on remote node. Even if
> > +guest is careful in making sure that it only access local memory they will end
> > +up accessing remote page tables.
>
> Please also cover the limitations of this feature:
>
> - Impact on remote memory accesses (more expensive).
> - How KVM handles NUMA node exhaustion.
> - How high-level page tables can span multiple nodes.
> - What KVM does if it can't determine the NUMA node of the pfn.
> - What KVM does for faults on GPAs that aren't backed by a pfn.
>

Thanks for the suggestions, I will incorporate them in the next version.

2023-03-28 17:16:28

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 12/18] KVM: x86/mmu: Allocate NUMA aware page tables on TDP huge page splits

On Thu, Mar 23, 2023 at 3:15 PM David Matlack <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:21PM -0800, Vipin Sharma wrote:
> > +
> > +void *kvm_mmu_get_free_page(gfp_t gfp, int nid)
> > +{
> > +#ifdef CONFIG_NUMA
>
> Is this #ifdef necessary? alloc_pages_node() is defined regardless of
> CONFIG_NUMA.
>

It is not necessary. Only advantage will be skipping the if()
condition check. I will remove it.

> > + struct page *page;
> > +
> > + if (nid != NUMA_NO_NODE) {
> > + page = alloc_pages_node(nid, gfp, 0);
> > + if (!page)
> > + return (void *)0;
> > + return page_address(page);
> > + }
> > +#endif /* CONFIG_NUMA */
> > + return (void *)__get_free_page(gfp);
> > +}
> > +

2023-03-28 17:24:48

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 13/18] KVM: mmu: Add common initialization logic for struct kvm_mmu_memory_cache{}

On Thu, Mar 23, 2023 at 3:23 PM David Matlack <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:22PM -0800, Vipin Sharma wrote:
> > Add macros and function to make common logic for struct
> > kvm_mmu_memory_cache{} declaration and initialization.
> >
> > Any user which wants different values in struct kvm_mmu_memory_cache{}
> > will overwrite the default values explicitly after the initialization.
> >
> > Suggested-by: David Matlack <[email protected]>
> > Signed-off-by: Vipin Sharma <[email protected]>
> > ---
> > arch/arm64/kvm/arm.c | 1 +
> > arch/arm64/kvm/mmu.c | 3 ++-
> > arch/riscv/kvm/mmu.c | 9 +++++----
> > arch/riscv/kvm/vcpu.c | 1 +
>
> MIPS also has cache (git grep "struct kvm_mmu_memory_cache").
>

I will respond in Patch 15 where I added stuff for MIPS.

> > arch/x86/kvm/mmu/mmu.c | 8 ++++++++
> > include/linux/kvm_types.h | 10 ++++++++++
> > 6 files changed, 27 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 3bd732eaf087..2b3d88e4ace8 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -330,6 +330,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > vcpu->arch.target = -1;
> > bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> >
> > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> > vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >
> > /*
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 7113587222ff..8a56f071ca66 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -895,7 +895,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > {
> > phys_addr_t addr;
> > int ret = 0;
> > - struct kvm_mmu_memory_cache cache = { .gfp_zero = __GFP_ZERO };
> > + KVM_MMU_MEMORY_CACHE(cache);
>
> nit: DEFINE_KVM_MMU_MEMORY_CACHE()
>
> (Based on similar existing macros in the kernel, e.g. DEFINE_MUTEX(),
> DEFINE_TIMER().)
>

I will update in v5.

> > struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> > enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_DEVICE |
> > KVM_PGTABLE_PROT_R |
> > @@ -904,6 +904,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> > if (is_protected_kvm_enabled())
> > return -EPERM;
> >
> > + cache.gfp_zero = __GFP_ZERO;
> > size += offset_in_page(guest_ipa);
> > guest_ipa &= PAGE_MASK;
> >
> > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
> > index 78211aed36fa..bdd8c17958dd 100644
> > --- a/arch/riscv/kvm/mmu.c
> > +++ b/arch/riscv/kvm/mmu.c
> > @@ -351,10 +351,11 @@ int kvm_riscv_gstage_ioremap(struct kvm *kvm, gpa_t gpa,
> > int ret = 0;
> > unsigned long pfn;
> > phys_addr_t addr, end;
> > - struct kvm_mmu_memory_cache pcache = {
> > - .gfp_custom = (in_atomic) ? GFP_ATOMIC | __GFP_ACCOUNT : 0,
> > - .gfp_zero = __GFP_ZERO,
> > - };
> > + KVM_MMU_MEMORY_CACHE(pcache);
> > +
> > + pcache.gfp_zero = __GFP_ZERO;
> > + if (in_atomic)
> > + pcache.gfp_custom = GFP_ATOMIC | __GFP_ACCOUNT;
> >
> > end = (gpa + size + PAGE_SIZE - 1) & PAGE_MASK;
> > pfn = __phys_to_pfn(hpa);
> > diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
> > index 7d010b0be54e..bc743e9122d1 100644
> > --- a/arch/riscv/kvm/vcpu.c
> > +++ b/arch/riscv/kvm/vcpu.c
> > @@ -163,6 +163,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> > /* Mark this VCPU never ran */
> > vcpu->arch.ran_atleast_once = false;
> > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> > vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> > bitmap_zero(vcpu->arch.isa, RISCV_ISA_EXT_MAX);
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a4bf2e433030..b706087ef74e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5961,15 +5961,20 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > {
> > int ret;
> >
> > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
> > vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> > vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
> >
> > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
> > vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
> > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
> > vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> >
> > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
> > +
> > vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> >
> > @@ -6131,11 +6136,14 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> > kvm_page_track_register_notifier(kvm, node);
> >
> > + INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
> > kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
> > kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > + INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache);
> > kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO;
> >
> > + INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
> > kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
> > kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> >
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 2728d49bbdf6..192516eeccac 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -98,6 +98,16 @@ struct kvm_mmu_memory_cache {
> > int capacity;
> > void **objects;
> > };
> > +
> > +#define KVM_MMU_MEMORY_CACHE_INIT() { }
> > +
> > +#define KVM_MMU_MEMORY_CACHE(_name) \
> > + struct kvm_mmu_memory_cache _name = KVM_MMU_MEMORY_CACHE_INIT()
>
> nit: There's an extra tab here.
>

Auto formatting is happy with two tabs only. I will update in the next
version. Thanks for catching it.

> > +
> > +static inline void INIT_KVM_MMU_MEMORY_CACHE(struct kvm_mmu_memory_cache *cache)
> > +{
> > + *cache = (struct kvm_mmu_memory_cache)KVM_MMU_MEMORY_CACHE_INIT();
> > +}
> > #endif
> >
> > #define HALT_POLL_HIST_COUNT 32
> > --
> > 2.40.0.rc0.216.gc4246ad0f0-goog
> >

2023-03-28 17:36:34

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 14/18] KVM: mmu: Initialize kvm_mmu_memory_cache.gfp_zero to __GFP_ZERO by default

On Thu, Mar 23, 2023 at 3:28 PM David Matlack <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:23PM -0800, Vipin Sharma wrote:
> > Set __GFP_ZERO to gfp_zero in default initizliation of struct
> > kvm_mmu_memory_cache{}
> >
> > All of the users of default initialization code of struct
> > kvm_mmu_memory_cache{} explicitly sets gfp_zero to __GFP_ZERO. This can
> > be moved to common initialization logic.
>
> If that were true we could get rid of gfp_zero entirely and hard-code
> __GFP_ZERO in the memory allocator! mmu_shadowed_info_cache is the one
> that does not set __GFP_ZERO.
>

Can we use __GFP_ZERO for mmu_shadowed_info_cache? Also, MIPS doesn't
use __GFP_ZERO, I think it might be a missed thing in MIPS rather than
intentional.

2023-03-28 17:56:57

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 15/18] KVM: mmu: Add NUMA node support in struct kvm_mmu_memory_cache{}

On Thu, Mar 23, 2023 at 3:30 PM David Matlack <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:24PM -0800, Vipin Sharma wrote:
> > Add NUMA node id variable in struct kvm_mmu_memory_cache{}. This
> > variable denotes preferable NUMA node from which memory will be
> > allocated under this memory cache.
> >
> > Set this variable to NUMA_NO_NODE if there is no preferred node.
> >
> > MIPS doesn't do any sort of initializatino of struct
> > kvm_mmu_memory_cache{}. Keep things similar in MIPS by setting gfp_zero
> > to 0 as INIT_KVM_MMU_MEMORY_CACHE() will initialize it to __GFP_ZERO.
> >
> > "node" cannot be left as 0, as 0 is a valid NUMA node value.
> >
> > Signed-off-by: Vipin Sharma <[email protected]>
> > ---
> > arch/mips/kvm/mips.c | 3 +++
> > include/linux/kvm_types.h | 3 +++
> > 2 files changed, 6 insertions(+)
> >
> > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > index 36c8991b5d39..5ec5ce919918 100644
> > --- a/arch/mips/kvm/mips.c
> > +++ b/arch/mips/kvm/mips.c
> > @@ -294,6 +294,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > HRTIMER_MODE_REL);
> > vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
> >
> > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> > + vcpu->arch.mmu_page_cache.gfp_zero = 0;
>
> Oh MIPS is here. Why isn't MIPS covered in the previous commits?
>

Because this is the patch where MIPS get impacted. MIPS doesn't
initialize gfp_zero, so there was no need to change the code in MIPS.
However, with the addition of "node" in kvm_mmu_memory_cache{} in this
patch, we need initialization in MIPS to (1) Set node to NUMA_NO_NODE
as 0 is now a valid value, and (2) INIT_KVM_MMU_MEMORY_CACHE() will
set gfp_zero to __GFP_ZERO which is different than existing code in
MIPS to keep it 0.

I asked MIPS maintainers in the previous version to see if GFP_ZERO
can be added but didn't get any response.
https://lore.kernel.org/lkml/CAHVum0c+17Z-RbGAFdU-xmRejDjDQ+MKOfN4XaObh2SwgWAjLg@mail.gmail.com/

> > +
> > /*
> > * Allocate space for host mode exception handlers that handle
> > * guest mode exits
> > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > index 5da7953532ce..b2a405c8e629 100644
> > --- a/include/linux/kvm_types.h
> > +++ b/include/linux/kvm_types.h
> > @@ -97,10 +97,13 @@ struct kvm_mmu_memory_cache {
> > struct kmem_cache *kmem_cache;
> > int capacity;
> > void **objects;
> > + /* Preferred NUMA node of memory allocation. */
> > + int node;
> > };
> >
> > #define KVM_MMU_MEMORY_CACHE_INIT() { \
> > .gfp_zero = __GFP_ZERO, \
> > + .node = NUMA_NO_NODE, \
> > }
> >
> > #define KVM_MMU_MEMORY_CACHE(_name) \
> > --
> > 2.40.0.rc0.216.gc4246ad0f0-goog
> >

2023-03-28 23:18:19

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 14/18] KVM: mmu: Initialize kvm_mmu_memory_cache.gfp_zero to __GFP_ZERO by default

On Tue, Mar 28, 2023 at 10:31 AM Vipin Sharma <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 3:28 PM David Matlack <[email protected]> wrote:
> >
> > On Mon, Mar 06, 2023 at 02:41:23PM -0800, Vipin Sharma wrote:
> > > Set __GFP_ZERO to gfp_zero in default initizliation of struct
> > > kvm_mmu_memory_cache{}
> > >
> > > All of the users of default initialization code of struct
> > > kvm_mmu_memory_cache{} explicitly sets gfp_zero to __GFP_ZERO. This can
> > > be moved to common initialization logic.
> >
> > If that were true we could get rid of gfp_zero entirely and hard-code
> > __GFP_ZERO in the memory allocator! mmu_shadowed_info_cache is the one
> > that does not set __GFP_ZERO.
> >
>
> Can we use __GFP_ZERO for mmu_shadowed_info_cache?

Yes but doing so would add CPU cost to zero the memory on allocation.
Someone would need to do some performance testing to confirm that the
cost of zeroing is acceptable.

> Also, MIPS doesn't
> use __GFP_ZERO, I think it might be a missed thing in MIPS rather than
> intentional.

2023-03-28 23:34:45

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 15/18] KVM: mmu: Add NUMA node support in struct kvm_mmu_memory_cache{}

On Tue, Mar 28, 2023 at 10:51 AM Vipin Sharma <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 3:30 PM David Matlack <[email protected]> wrote:
> >
> > On Mon, Mar 06, 2023 at 02:41:24PM -0800, Vipin Sharma wrote:
> > > Add NUMA node id variable in struct kvm_mmu_memory_cache{}. This
> > > variable denotes preferable NUMA node from which memory will be
> > > allocated under this memory cache.
> > >
> > > Set this variable to NUMA_NO_NODE if there is no preferred node.
> > >
> > > MIPS doesn't do any sort of initializatino of struct
> > > kvm_mmu_memory_cache{}. Keep things similar in MIPS by setting gfp_zero
> > > to 0 as INIT_KVM_MMU_MEMORY_CACHE() will initialize it to __GFP_ZERO.
> > >
> > > "node" cannot be left as 0, as 0 is a valid NUMA node value.
> > >
> > > Signed-off-by: Vipin Sharma <[email protected]>
> > > ---
> > > arch/mips/kvm/mips.c | 3 +++
> > > include/linux/kvm_types.h | 3 +++
> > > 2 files changed, 6 insertions(+)
> > >
> > > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > > index 36c8991b5d39..5ec5ce919918 100644
> > > --- a/arch/mips/kvm/mips.c
> > > +++ b/arch/mips/kvm/mips.c
> > > @@ -294,6 +294,9 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > HRTIMER_MODE_REL);
> > > vcpu->arch.comparecount_timer.function = kvm_mips_comparecount_wakeup;
> > >
> > > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> > > + vcpu->arch.mmu_page_cache.gfp_zero = 0;
> >
> > Oh MIPS is here. Why isn't MIPS covered in the previous commits?
>
> Because this is the patch where MIPS get impacted. MIPS doesn't
> initialize gfp_zero, so there was no need to change the code in MIPS.
> However, with the addition of "node" in kvm_mmu_memory_cache{} in this
> patch, we need initialization in MIPS to (1) Set node to NUMA_NO_NODE
> as 0 is now a valid value, and (2) INIT_KVM_MMU_MEMORY_CACHE() will
> set gfp_zero to __GFP_ZERO which is different than existing code in
> MIPS to keep it 0.
>
> I asked MIPS maintainers in the previous version to see if GFP_ZERO
> can be added but didn't get any response.
> https://lore.kernel.org/lkml/CAHVum0c+17Z-RbGAFdU-xmRejDjDQ+MKOfN4XaObh2SwgWAjLg@mail.gmail.com/

I see. IMO it's more logical to convert the MIPS cache to
INIT_KVM_MMU_MEMORY_CACHE() in patch 13, along with all the other
users of struct kvm_mmu_memory_cache. Then in patch 14, add the line
to set gfp_zero to 0 for MIPS to preserve the existing behavior. That
produces a very simple chain of changes:

Patch 13: Convert all users of struct kvm_mmu_memory_cache to INIT()
Patch 14: Invert the default value of kvm_mmu_memory_cache.gfp_zero
Patch 15: Add node to kvm_mmu_memory_cache


>
> > > +
> > > /*
> > > * Allocate space for host mode exception handlers that handle
> > > * guest mode exits
> > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> > > index 5da7953532ce..b2a405c8e629 100644
> > > --- a/include/linux/kvm_types.h
> > > +++ b/include/linux/kvm_types.h
> > > @@ -97,10 +97,13 @@ struct kvm_mmu_memory_cache {
> > > struct kmem_cache *kmem_cache;
> > > int capacity;
> > > void **objects;
> > > + /* Preferred NUMA node of memory allocation. */
> > > + int node;
> > > };
> > >
> > > #define KVM_MMU_MEMORY_CACHE_INIT() { \
> > > .gfp_zero = __GFP_ZERO, \
> > > + .node = NUMA_NO_NODE, \
> > > }
> > >
> > > #define KVM_MMU_MEMORY_CACHE(_name) \
> > > --
> > > 2.40.0.rc0.216.gc4246ad0f0-goog
> > >

2023-03-29 00:24:22

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 16/18] KVM: x86/mmu: Allocate numa aware page tables during page fault

On Mon, Mar 06, 2023 at 02:41:25PM -0800, Vipin Sharma wrote:
> Allocate page tables on the preferred NUMA node via memory cache during
> page faults. If memory cache doesn't have a preferred NUMA node (node
> value is set to NUMA_NO_NODE) then fallback to the default logic where
> pages are selected based on thread's mempolicy. Also, free NUMA aware
> page caches, mmu_shadow_page_cache, when memory shrinker is invoked.
>
> Allocate root pages based on the current thread's NUMA node as there is
> no way to know which will be the ideal NUMA node in long run.
>
> This commit allocate page tables to be on the same NUMA node as the
> physical page pointed by them, even if a vCPU causing page fault is on a
> different NUMA node. If memory is not available on the requested NUMA
> node then the other nearest NUMA node is selected by default. NUMA aware
> page tables can be beneficial in cases where a thread touches lot of far
> memory initially and then divide work among multiple threads. VMs
> generally take advantage of NUMA architecture for faster memory access
> by moving threads to the NUMA node of the memory they are accessing.
> This change will help them in accessing pages faster.
>
> Downside of this change is that an experimental workload can be created
> where a guest threads are always accessing remote memory and not the one
> local to them. This will cause performance to degrade compared to VMs
> where numa aware page tables are not enabled. Ideally, these VMs when
> using non-uniform memory access machine should generally be taking
> advantage of NUMA architecture to improve their performance in the first
> place.
>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/kvm/mmu/mmu.c | 63 ++++++++++++++++++++++++---------
> arch/x86/kvm/mmu/mmu_internal.h | 24 ++++++++++++-
> arch/x86/kvm/mmu/paging_tmpl.h | 4 +--
> arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++---
> include/linux/kvm_types.h | 6 ++++
> virt/kvm/kvm_main.c | 2 +-
> 7 files changed, 88 insertions(+), 27 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 64de083cd6b9..77d3aa368e5e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -787,7 +787,7 @@ struct kvm_vcpu_arch {
> struct kvm_mmu *walk_mmu;
>
> struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> - struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> + struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];

I think we need an abstraction for a NUMA-aware mmu cache, since there
is more than one by the end of this series.

e.g. A wrapper struct (struct kvm_mmu_numa_memory_cache) or make
NUMA-awareness an optional feature within kvm_mmu_memory_cache, plus
common helper functions for operations like initializing, topping-up,
and freeing.

I have some ideas I want to try but I ran out of time today.

> struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d96afc849ee8..86f0d74d35ed 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -702,7 +702,7 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache)
>
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> - int r;
> + int r, nid = KVM_MMU_DEFAULT_CACHE_INDEX;
>
> /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> @@ -710,7 +710,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> if (r)
> return r;
>
> - r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache, PT64_ROOT_MAX_LEVEL);
> + if (kvm_numa_aware_page_table_enabled(vcpu->kvm)) {
> + for_each_online_node(nid) {

Blegh. This is going to potentially waste a lot of memory. Yes the
shrinker can free it, but the next fault will re-allocate all the online
node caches.

The reason we have to top-up all nodes is because KVM tops up caches
before faulting in the PFN, and there is concern that changing this will
increase the rate of guest page-fault retries [1].

I think we should revisit that concern. Can we do any testing to
validate that hypothesis? Or can we convince ourselves that re-ordering
is ok?

[1] https://lore.kernel.org/kvm/CAHVum0cjqsdG2NEjRF3ZRtUY2t2=Tb9H4OyOz9wpmsrN--Sjhg@mail.gmail.com/

> + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> + PT64_ROOT_MAX_LEVEL);

This ignores the return value of mmu_topup_sp_memory_cache() for all but
the last node.

> + }
> + } else {
> + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> + PT64_ROOT_MAX_LEVEL);
> + }
> +
> if (r)
> return r;
>
> @@ -726,9 +735,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>
> static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> + int nid;
> +
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> - mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + for_each_node(nid)
> + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid]);
> mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> @@ -2245,12 +2257,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> }
>
> static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> - gfn_t gfn,
> + gfn_t gfn, int nid,
> union kvm_mmu_page_role role)
> {
> struct shadow_page_caches caches = {
> .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> - .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> + .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid],
> .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> };
>
> @@ -2305,15 +2317,18 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
>
> static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> u64 *sptep, gfn_t gfn,
> - bool direct, unsigned int access)
> + bool direct, unsigned int access,
> + kvm_pfn_t pfn)
> {
> union kvm_mmu_page_role role;
> + int nid;
>
> if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
> return ERR_PTR(-EEXIST);
>
> role = kvm_mmu_child_role(sptep, direct, access);
> - return kvm_mmu_get_shadow_page(vcpu, gfn, role);
> + nid = kvm_pfn_to_mmu_cache_nid(vcpu->kvm, pfn);
> + return kvm_mmu_get_shadow_page(vcpu, gfn, nid, role);
> }
>
> static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
> @@ -3205,7 +3220,8 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> if (it.level == fault->goal_level)
> break;
>
> - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
> + sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true,
> + ACC_ALL, fault->pfn);
> if (sp == ERR_PTR(-EEXIST))
> continue;
>
> @@ -3625,6 +3641,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
> {
> union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> struct kvm_mmu_page *sp;
> + int nid;
>
> role.level = level;
> role.quadrant = quadrant;
> @@ -3632,7 +3649,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
> WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
> WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
>
> - sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
> + nid = kvm_mmu_root_page_cache_nid(vcpu->kvm);
> + sp = kvm_mmu_get_shadow_page(vcpu, gfn, nid, role);
> ++sp->root_count;
>
> return __pa(sp->spt);
> @@ -5959,7 +5977,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>
> int kvm_mmu_create(struct kvm_vcpu *vcpu)
> {
> - int ret;
> + int ret, nid;
>
> INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
> vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
> @@ -5967,7 +5985,12 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
> vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>
> - INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache);
> + for_each_node(nid) {
> + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache[nid]);
> + if (kvm_numa_aware_page_table_enabled(vcpu->kvm))
> + vcpu->arch.mmu_shadow_page_cache[nid].node = nid;
> + }
> +
> mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadowed_info_cache);
> @@ -6695,13 +6718,17 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> }
>
> static int mmu_memory_cache_try_empty(struct kvm_mmu_memory_cache *cache,

nit: s/cache/caches/

> - struct mutex *cache_lock)
> + int cache_count, struct mutex *cache_lock)

nit: s/cache_count/nr_caches/

> {
> - int freed = 0;
> + int freed = 0, nid;

nit: s/nid/i/

(nothing in this function knows about NUMA so "nid" is an odd name here)
>
> if (mutex_trylock(cache_lock)) {
> - freed = cache->nobjs;
> - kvm_mmu_empty_memory_cache(cache);
> + for (nid = 0; nid < cache_count; nid++) {
> + if (!cache[nid].nobjs)
> + continue;
> + freed += cache[nid].nobjs;
> + kvm_mmu_empty_memory_cache(&cache[nid]);
> + }
> mutex_unlock(cache_lock);
> }
> return freed;
> @@ -6725,15 +6752,17 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
> list_move_tail(&kvm->vm_list, &vm_list);
>
> kvm_for_each_vcpu(i, vcpu, kvm) {
> - freed += mmu_memory_cache_try_empty(&vcpu->arch.mmu_shadow_page_cache,
> + freed += mmu_memory_cache_try_empty(vcpu->arch.mmu_shadow_page_cache,
> + MAX_NUMNODES,
> &vcpu->arch.mmu_shadow_page_cache_lock);
> freed += mmu_memory_cache_try_empty(&vcpu->arch.mmu_shadowed_info_cache,
> + 1,
> &vcpu->arch.mmu_shadow_page_cache_lock);
> if (freed >= sc->nr_to_scan)
> goto out;
> }
> freed += mmu_memory_cache_try_empty(&kvm->arch.split_shadow_page_cache,
> - &kvm->slots_lock);
> + 1, &kvm->slots_lock);
> if (freed >= sc->nr_to_scan)
> goto out;
> }
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index b9d0e09ae974..652fd0c2bcba 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -340,11 +340,16 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void *mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *cache);
>
> +static inline bool kvm_numa_aware_page_table_enabled(struct kvm *kvm)
> +{
> + return kvm->arch.numa_aware_page_table;

No need for this helper function. Accessing the variable directly makes
lines shorter, does not introduce any code duplication, and reduces
abstraction.

> +}
> +
> static inline int kvm_pfn_to_page_table_nid(struct kvm *kvm, kvm_pfn_t pfn)
> {
> struct page *page;
>
> - if (!kvm->arch.numa_aware_page_table)
> + if (!kvm_numa_aware_page_table_enabled(kvm))
> return NUMA_NO_NODE;
>
> page = kvm_pfn_to_refcounted_page(pfn);
> @@ -355,4 +360,21 @@ static inline int kvm_pfn_to_page_table_nid(struct kvm *kvm, kvm_pfn_t pfn)
> return numa_mem_id();
> }
>
> +static inline int kvm_pfn_to_mmu_cache_nid(struct kvm *kvm, kvm_pfn_t pfn)
> +{
> + int index = kvm_pfn_to_page_table_nid(kvm, pfn);
> +
> + if (index == NUMA_NO_NODE)
> + return KVM_MMU_DEFAULT_CACHE_INDEX;
> +
> + return index;
> +}
> +
> +static inline int kvm_mmu_root_page_cache_nid(struct kvm *kvm)
> +{
> + if (kvm_numa_aware_page_table_enabled(kvm))
> + return numa_mem_id();
> +
> + return KVM_MMU_DEFAULT_CACHE_INDEX;
> +}
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 1dea9be6849d..9db8b3df434d 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> table_gfn = gw->table_gfn[it.level - 2];
> access = gw->pt_access[it.level - 2];
> sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
> - false, access);
> + false, access, fault->pfn);
>
> if (sp != ERR_PTR(-EEXIST)) {
> /*
> @@ -706,7 +706,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> validate_direct_spte(vcpu, it.sptep, direct_access);
>
> sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
> - true, direct_access);
> + true, direct_access, fault->pfn);
> if (sp == ERR_PTR(-EEXIST))
> continue;
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 61fd9c177694..63113a66f560 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -260,12 +260,12 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> kvm_mmu_page_as_id(_root) != _as_id) { \
> } else
>
> -static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> +static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
> {
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> - sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> + sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid]);
>
> return sp;
> }
> @@ -304,6 +304,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
> struct kvm *kvm = vcpu->kvm;
> struct kvm_mmu_page *root;
> + int nid;
>
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> @@ -317,7 +318,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> goto out;
> }
>
> - root = tdp_mmu_alloc_sp(vcpu);
> + nid = kvm_mmu_root_page_cache_nid(vcpu->kvm);
> + root = tdp_mmu_alloc_sp(vcpu, nid);
> tdp_mmu_init_sp(root, NULL, 0, role);
>
> refcount_set(&root->tdp_mmu_root_count, 1);
> @@ -1149,12 +1151,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> struct kvm *kvm = vcpu->kvm;
> struct tdp_iter iter;
> struct kvm_mmu_page *sp;
> - int ret = RET_PF_RETRY;
> + int ret = RET_PF_RETRY, nid;
>
> kvm_mmu_hugepage_adjust(vcpu, fault);
>
> trace_kvm_mmu_spte_requested(fault);
>
> + nid = kvm_pfn_to_mmu_cache_nid(kvm, fault->pfn);
> +
> rcu_read_lock();
>
> tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> @@ -1182,7 +1186,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> * The SPTE is either non-present or points to a huge page that
> * needs to be split.
> */
> - sp = tdp_mmu_alloc_sp(vcpu);
> + sp = tdp_mmu_alloc_sp(vcpu, nid);
> tdp_mmu_init_child_sp(sp, &iter);
>
> sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index b2a405c8e629..13032da2ddfc 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -113,6 +113,12 @@ static inline void INIT_KVM_MMU_MEMORY_CACHE(struct kvm_mmu_memory_cache *cache)
> {
> *cache = (struct kvm_mmu_memory_cache)KVM_MMU_MEMORY_CACHE_INIT();
> }
> +
> +/*
> + * When NUMA aware page table option is disabled for a VM then use cache at the
> + * below index in the array of NUMA caches.
> + */
> +#define KVM_MMU_DEFAULT_CACHE_INDEX 0
> #endif
>
> #define HALT_POLL_HIST_COUNT 32
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 47006d209309..25a549705c8e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -401,7 +401,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> if (mc->kmem_cache)
> return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> else
> - return (void *)__get_free_page(gfp_flags);
> + return kvm_mmu_get_free_page(gfp_flags, mc->node);
> }
>
> int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> --
> 2.40.0.rc0.216.gc4246ad0f0-goog
>

2023-03-29 00:30:33

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 16/18] KVM: x86/mmu: Allocate numa aware page tables during page fault

On Tue, Mar 28, 2023 at 5:21 PM David Matlack <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:25PM -0800, Vipin Sharma wrote:
> > Allocate page tables on the preferred NUMA node via memory cache during
> > page faults. If memory cache doesn't have a preferred NUMA node (node
> > value is set to NUMA_NO_NODE) then fallback to the default logic where
> > pages are selected based on thread's mempolicy. Also, free NUMA aware
> > page caches, mmu_shadow_page_cache, when memory shrinker is invoked.
> >
> > Allocate root pages based on the current thread's NUMA node as there is
> > no way to know which will be the ideal NUMA node in long run.
> >
> > This commit allocate page tables to be on the same NUMA node as the
> > physical page pointed by them, even if a vCPU causing page fault is on a
> > different NUMA node. If memory is not available on the requested NUMA
> > node then the other nearest NUMA node is selected by default. NUMA aware
> > page tables can be beneficial in cases where a thread touches lot of far
> > memory initially and then divide work among multiple threads. VMs
> > generally take advantage of NUMA architecture for faster memory access
> > by moving threads to the NUMA node of the memory they are accessing.
> > This change will help them in accessing pages faster.
> >
> > Downside of this change is that an experimental workload can be created
> > where a guest threads are always accessing remote memory and not the one
> > local to them. This will cause performance to degrade compared to VMs
> > where numa aware page tables are not enabled. Ideally, these VMs when
> > using non-uniform memory access machine should generally be taking
> > advantage of NUMA architecture to improve their performance in the first
> > place.
> >
> > Signed-off-by: Vipin Sharma <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 2 +-
> > arch/x86/kvm/mmu/mmu.c | 63 ++++++++++++++++++++++++---------
> > arch/x86/kvm/mmu/mmu_internal.h | 24 ++++++++++++-
> > arch/x86/kvm/mmu/paging_tmpl.h | 4 +--
> > arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++---
> > include/linux/kvm_types.h | 6 ++++
> > virt/kvm/kvm_main.c | 2 +-
> > 7 files changed, 88 insertions(+), 27 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 64de083cd6b9..77d3aa368e5e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -787,7 +787,7 @@ struct kvm_vcpu_arch {
> > struct kvm_mmu *walk_mmu;
> >
> > struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > - struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > + struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
>
> I think we need an abstraction for a NUMA-aware mmu cache, since there
> is more than one by the end of this series.
>
> e.g. A wrapper struct (struct kvm_mmu_numa_memory_cache) or make
> NUMA-awareness an optional feature within kvm_mmu_memory_cache, plus
> common helper functions for operations like initializing, topping-up,
> and freeing.
>
> I have some ideas I want to try but I ran out of time today.
>
> > struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> > struct kvm_mmu_memory_cache mmu_page_header_cache;
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index d96afc849ee8..86f0d74d35ed 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -702,7 +702,7 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache)
> >
> > static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > {
> > - int r;
> > + int r, nid = KVM_MMU_DEFAULT_CACHE_INDEX;
> >
> > /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> > r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > @@ -710,7 +710,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > if (r)
> > return r;
> >
> > - r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache, PT64_ROOT_MAX_LEVEL);
> > + if (kvm_numa_aware_page_table_enabled(vcpu->kvm)) {
> > + for_each_online_node(nid) {
>
> Blegh. This is going to potentially waste a lot of memory. Yes the
> shrinker can free it, but the next fault will re-allocate all the online
> node caches.
>
> The reason we have to top-up all nodes is because KVM tops up caches
> before faulting in the PFN, and there is concern that changing this will
> increase the rate of guest page-fault retries [1].
>
> I think we should revisit that concern. Can we do any testing to
> validate that hypothesis? Or can we convince ourselves that re-ordering
> is ok?
>
> [1] https://lore.kernel.org/kvm/CAHVum0cjqsdG2NEjRF3ZRtUY2t2=Tb9H4OyOz9wpmsrN--Sjhg@mail.gmail.com/

Ah I forgot about patch 18 reducing the default cache size. So at the
end of this series, even with topping up every node, the maximum
number of objects per cache will be 4 * num_online_nodes. So only
hosts with more than 10 online NUMA nodes would have larger caches
than today (40). That seems more reasonable to me.

2023-03-29 19:09:10

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v4 16/18] KVM: x86/mmu: Allocate numa aware page tables during page fault

On Tue, Mar 28, 2023 at 05:21:29PM -0700, David Matlack wrote:
> On Mon, Mar 06, 2023 at 02:41:25PM -0800, Vipin Sharma wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 64de083cd6b9..77d3aa368e5e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -787,7 +787,7 @@ struct kvm_vcpu_arch {
> > struct kvm_mmu *walk_mmu;
> >
> > struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > - struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > + struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
>
> I think we need an abstraction for a NUMA-aware mmu cache, since there
> is more than one by the end of this series.
>
> e.g. A wrapper struct (struct kvm_mmu_numa_memory_cache) or make
> NUMA-awareness an optional feature within kvm_mmu_memory_cache, plus
> common helper functions for operations like initializing, topping-up,
> and freeing.
>
> I have some ideas I want to try but I ran out of time today.

Something like this (compile test only, applies on top of this series):

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 041302d6132c..b44f867d0ed2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -787,7 +787,7 @@ struct kvm_vcpu_arch {
struct kvm_mmu *walk_mmu;

struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
- struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
+ struct kvm_mmu_numa_memory_cache mmu_shadow_page_cache;
struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;

@@ -1453,7 +1453,7 @@ struct kvm_arch {
*
* Protected by kvm->slots_lock.
*/
- struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES];
+ struct kvm_mmu_numa_memory_cache split_shadow_page_cache;
struct kvm_mmu_memory_cache split_page_header_cache;

/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5463ce6e52fa..fb7b3932f08d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -702,7 +702,7 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache)

static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
- int r, nid = KVM_MMU_DEFAULT_CACHE_INDEX;
+ int r;

/* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
@@ -710,16 +710,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
if (r)
return r;

- if (kvm_numa_aware_page_table_enabled(vcpu->kvm)) {
- for_each_online_node(nid) {
- r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
- KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE);
- }
- } else {
- r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
- KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE);
- }
-
+ r = kvm_mmu_topup_numa_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+ KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE);
if (r)
return r;

@@ -735,12 +727,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)

static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
{
- int nid;
-
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
- for_each_node(nid)
- mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid]);
+ kvm_mmu_free_numa_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
@@ -2262,7 +2251,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
{
struct shadow_page_caches caches = {
.page_header_cache = &vcpu->arch.mmu_page_header_cache,
- .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid],
+ .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache.nodes[nid],
.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
};

@@ -5977,7 +5966,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)

int kvm_mmu_create(struct kvm_vcpu *vcpu)
{
- int ret, nid;
+ int ret;

INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache);
vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
@@ -5985,11 +5974,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache);
vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;

- for_each_node(nid) {
- INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache[nid]);
- if (kvm_numa_aware_page_table_enabled(vcpu->kvm))
- vcpu->arch.mmu_shadow_page_cache[nid].node = nid;
- }
+ kvm_mmu_init_numa_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+ if (kvm_numa_aware_page_table_enabled(vcpu->kvm))
+ kvm_mmu_enable_numa_memory_cache(&vcpu->arch.mmu_shadow_page_cache);

mutex_init(&vcpu->arch.mmu_shadow_page_cache_lock);

@@ -6140,7 +6127,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
int kvm_mmu_init_vm(struct kvm *kvm)
{
struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
- int r, nid;
+ int r;

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
@@ -6159,9 +6146,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_page_header_cache);
kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;

- for_each_node(nid)
- INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_shadow_page_cache[nid]);
-
+ kvm_mmu_init_numa_memory_cache(&kvm->arch.split_shadow_page_cache);

INIT_KVM_MMU_MEMORY_CACHE(&kvm->arch.split_desc_cache);
kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
@@ -6171,13 +6156,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)

static void mmu_free_vm_memory_caches(struct kvm *kvm)
{
- int nid;
-
kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache);
kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache);
mutex_lock(&kvm->slots_lock);
- for_each_node(nid)
- mmu_free_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid]);
+ kvm_mmu_free_numa_memory_cache(&kvm->arch.split_shadow_page_cache);
mutex_unlock(&kvm->slots_lock);
}

@@ -6299,7 +6281,7 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm, int nid)
*/
return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) ||
need_topup(&kvm->arch.split_page_header_cache, 1) ||
- need_topup(&kvm->arch.split_shadow_page_cache[nid], 1);
+ need_topup(&kvm->arch.split_shadow_page_cache.nodes[nid], 1);
}

static int topup_split_caches(struct kvm *kvm, int nid)
@@ -6332,7 +6314,7 @@ static int topup_split_caches(struct kvm *kvm, int nid)
if (r)
return r;

- return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache[nid], 1);
+ return mmu_topup_sp_memory_cache(&kvm->arch.split_shadow_page_cache.nodes[nid], 1);
}

static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep,
@@ -6357,7 +6339,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu

/* Direct SPs do not require a shadowed_info_cache. */
caches.page_header_cache = &kvm->arch.split_page_header_cache;
- caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache[nid];
+ caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache.nodes[nid];

/* Safe to pass NULL for vCPU since requesting a direct SP. */
return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
@@ -6760,7 +6742,7 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
list_move_tail(&kvm->vm_list, &vm_list);

kvm_for_each_vcpu(i, vcpu, kvm) {
- freed += mmu_memory_cache_try_empty(vcpu->arch.mmu_shadow_page_cache,
+ freed += mmu_memory_cache_try_empty(vcpu->arch.mmu_shadow_page_cache.nodes,
MAX_NUMNODES,
&vcpu->arch.mmu_shadow_page_cache_lock);
freed += mmu_memory_cache_try_empty(&vcpu->arch.mmu_shadowed_info_cache,
@@ -6769,7 +6751,7 @@ static unsigned long mmu_shrink_scan(struct shrinker *shrink,
if (freed >= sc->nr_to_scan)
goto out;
}
- freed += mmu_memory_cache_try_empty(kvm->arch.split_shadow_page_cache,
+ freed += mmu_memory_cache_try_empty(kvm->arch.split_shadow_page_cache.nodes,
MAX_NUMNODES, &kvm->slots_lock);
if (freed >= sc->nr_to_scan)
goto out;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 63113a66f560..721d5a415807 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -265,7 +265,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid]);
+ sp->spt = mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache.nodes[nid]);

return sp;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d8ea39b248cd..940099629626 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6176,7 +6176,7 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
struct kvm_enable_cap *cap)
{
- int r, nid;
+ int r;

if (cap->flags)
return -EINVAL;
@@ -6399,9 +6399,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
kvm->arch.numa_aware_page_table = true;

mutex_lock(&kvm->slots_lock);
- for_each_node(nid) {
- kvm->arch.split_shadow_page_cache[nid].node = nid;
- }
+ kvm_mmu_enable_numa_memory_cache(&kvm->arch.split_shadow_page_cache);
mutex_unlock(&kvm->slots_lock);
r = 0;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31586a65e346..d5d966e4a8bf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1365,6 +1365,11 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
void kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
+
+void kvm_mmu_init_numa_memory_cache(struct kvm_mmu_numa_memory_cache *cache);
+void kvm_mmu_enable_numa_memory_cache(struct kvm_mmu_numa_memory_cache *cache);
+int kvm_mmu_topup_numa_memory_cache(struct kvm_mmu_numa_memory_cache *cache, int min);
+void kvm_mmu_free_numa_memory_cache(struct kvm_mmu_numa_memory_cache *cache);
#endif

void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 13032da2ddfc..7a58ea37bc15 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -101,6 +101,10 @@ struct kvm_mmu_memory_cache {
int node;
};

+struct kvm_mmu_numa_memory_cache {
+ struct kvm_mmu_memory_cache nodes[MAX_NUMNODES];
+};
+
#define KVM_MMU_MEMORY_CACHE_INIT() { \
.gfp_zero = __GFP_ZERO, \
.node = NUMA_NO_NODE, \
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25a549705c8e..2607b546c3c9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -476,6 +476,43 @@ void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
BUG_ON(!p);
return p;
}
+
+void kvm_mmu_init_numa_memory_cache(struct kvm_mmu_numa_memory_cache *cache)
+{
+ int node;
+
+ for_each_node(node)
+ INIT_KVM_MMU_MEMORY_CACHE(&cache->nodes[node]);
+}
+
+void kvm_mmu_enable_numa_memory_cache(struct kvm_mmu_numa_memory_cache *cache)
+{
+ int node;
+
+ for_each_node(node)
+ cache->nodes[node].node = node;
+}
+
+int kvm_mmu_topup_numa_memory_cache(struct kvm_mmu_numa_memory_cache *cache, int min)
+{
+ int r, node;
+
+ for_each_online_node(node) {
+ r = kvm_mmu_topup_memory_cache(&cache->nodes[node], min);
+ if (r)
+ return r;
+ }
+
+ return 0;
+}
+
+void kvm_mmu_free_numa_memory_cache(struct kvm_mmu_numa_memory_cache *cache)
+{
+ int node;
+
+ for_each_node(node)
+ kvm_mmu_free_memory_cache(&cache->nodes[node]);
+}
#endif

static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)

2023-03-30 05:08:52

by Yang, Weijiang

[permalink] [raw]
Subject: Re: [Patch v4 08/18] KVM: x86/mmu: Track unused mmu_shadowed_info_cache pages count via global counter


On 3/7/2023 6:41 AM, Vipin Sharma wrote:
> Add unused pages in mmu_shadowed_info_cache to global MMU unused page
> cache counter i.e. kvm_total_unused_cached_pages. These pages will be
> freed by MMU shrinker in future commit.

This patch mainly renames some functions,  but the commit log doesn't
reflect what

this patch does. Please change the commit log or squash the patch.


>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/mmu/mmu.c | 8 ++++----
> 2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4322c7020d5d..185719dbeb81 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -792,7 +792,8 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> /*
> - * Protect allocation and release of pages from mmu_shadow_page_cache.
> + * Protect allocation and release of pages from mmu_shadow_page_cache
> + * and mmu_shadowed_info_cache.
> */
> struct mutex mmu_shadow_page_cache_lock;
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0a0962d8108b..b7ca31b5699c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -715,8 +715,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> return r;
>
> if (maybe_indirect) {
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
> + PT64_ROOT_MAX_LEVEL);
> if (r)
> return r;
> }
> @@ -729,8 +729,8 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> mutex_lock(&vcpu->arch.mmu_shadow_page_cache_lock);
> mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> mutex_unlock(&vcpu->arch.mmu_shadow_page_cache_lock);
> - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
>
> @@ -2197,7 +2197,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> sp->spt = mmu_sp_memory_cache_alloc(caches->shadow_page_cache);
> if (!role.direct)
> - sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
> + sp->shadowed_translation = mmu_sp_memory_cache_alloc(caches->shadowed_info_cache);
>
> set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>

2023-04-03 22:53:21

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 16/18] KVM: x86/mmu: Allocate numa aware page tables during page fault

On Tue, Mar 28, 2023 at 5:21 PM David Matlack <[email protected]> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:25PM -0800, Vipin Sharma wrote:
> > + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> > + PT64_ROOT_MAX_LEVEL);
>
> This ignores the return value of mmu_topup_sp_memory_cache() for all but
> the last node.
>

Yeah, I will change it to exit the function on the first error.

> > static int mmu_memory_cache_try_empty(struct kvm_mmu_memory_cache *cache,
>
> nit: s/cache/caches/
>

Okay.

> > - struct mutex *cache_lock)
> > + int cache_count, struct mutex *cache_lock)
>
> nit: s/cache_count/nr_caches/

Okay.

>
> > {
> > - int freed = 0;
> > + int freed = 0, nid;
>
> nit: s/nid/i/
>
> (nothing in this function knows about NUMA so "nid" is an odd name here)

Okay.

> > +static inline bool kvm_numa_aware_page_table_enabled(struct kvm *kvm)
> > +{
> > + return kvm->arch.numa_aware_page_table;
>
> No need for this helper function. Accessing the variable directly makes
> lines shorter, does not introduce any code duplication, and reduces
> abstraction.
>

Okay.

2023-04-03 22:56:19

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 16/18] KVM: x86/mmu: Allocate numa aware page tables during page fault

On Wed, Mar 29, 2023 at 12:04 PM David Matlack <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 05:21:29PM -0700, David Matlack wrote:
> > On Mon, Mar 06, 2023 at 02:41:25PM -0800, Vipin Sharma wrote:
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 64de083cd6b9..77d3aa368e5e 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -787,7 +787,7 @@ struct kvm_vcpu_arch {
> > > struct kvm_mmu *walk_mmu;
> > >
> > > struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > > - struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > > + struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
> >
> > I think we need an abstraction for a NUMA-aware mmu cache, since there
> > is more than one by the end of this series.
> >
> > e.g. A wrapper struct (struct kvm_mmu_numa_memory_cache) or make
> > NUMA-awareness an optional feature within kvm_mmu_memory_cache, plus
> > common helper functions for operations like initializing, topping-up,
> > and freeing.
> >
> > I have some ideas I want to try but I ran out of time today.
>
> Something like this (compile test only, applies on top of this series):
>

It looks good to me. I was not sure in the first place if having a new
struct will be acceptable. Below abstraction looks good to me, I will
update my patches accordingly in the next version.

Thanks

2023-04-03 23:22:46

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 15/18] KVM: mmu: Add NUMA node support in struct kvm_mmu_memory_cache{}

On Tue, Mar 28, 2023 at 4:25 PM David Matlack <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 10:51 AM Vipin Sharma <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 3:30 PM David Matlack <[email protected]> wrote:
> > >
> > > On Mon, Mar 06, 2023 at 02:41:24PM -0800, Vipin Sharma wrote:
> > > > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_cache);
> > > > + vcpu->arch.mmu_page_cache.gfp_zero = 0;
> > >
> > > Oh MIPS is here. Why isn't MIPS covered in the previous commits?
> >
> > Because this is the patch where MIPS get impacted. MIPS doesn't
> > initialize gfp_zero, so there was no need to change the code in MIPS.
> > However, with the addition of "node" in kvm_mmu_memory_cache{} in this
> > patch, we need initialization in MIPS to (1) Set node to NUMA_NO_NODE
> > as 0 is now a valid value, and (2) INIT_KVM_MMU_MEMORY_CACHE() will
> > set gfp_zero to __GFP_ZERO which is different than existing code in
> > MIPS to keep it 0.
> >
> > I asked MIPS maintainers in the previous version to see if GFP_ZERO
> > can be added but didn't get any response.
> > https://lore.kernel.org/lkml/CAHVum0c+17Z-RbGAFdU-xmRejDjDQ+MKOfN4XaObh2SwgWAjLg@mail.gmail.com/
>
> I see. IMO it's more logical to convert the MIPS cache to
> INIT_KVM_MMU_MEMORY_CACHE() in patch 13, along with all the other
> users of struct kvm_mmu_memory_cache. Then in patch 14, add the line
> to set gfp_zero to 0 for MIPS to preserve the existing behavior. That
> produces a very simple chain of changes:
>
> Patch 13: Convert all users of struct kvm_mmu_memory_cache to INIT()
> Patch 14: Invert the default value of kvm_mmu_memory_cache.gfp_zero
> Patch 15: Add node to kvm_mmu_memory_cache
>
>

Yeah, this looks better. I will do this.

2023-04-03 23:23:34

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v4 08/18] KVM: x86/mmu: Track unused mmu_shadowed_info_cache pages count via global counter

On Wed, Mar 29, 2023 at 9:53 PM Yang, Weijiang <[email protected]> wrote:
>
>
> On 3/7/2023 6:41 AM, Vipin Sharma wrote:
> > Add unused pages in mmu_shadowed_info_cache to global MMU unused page
> > cache counter i.e. kvm_total_unused_cached_pages. These pages will be
> > freed by MMU shrinker in future commit.
>
> This patch mainly renames some functions, but the commit log doesn't
> reflect what
>
> this patch does. Please change the commit log or squash the patch.
>
>

This is not just function renaming, it is using a function which does
page accounting. I will expand the commit log to capture more details
instead of squashing.

Thanks