2022-12-22 03:02:18

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v3 0/9] NUMA aware page table's pages allocation

Hi,

This series has expanded from v2 based on the feedbacks. Main items happening in
this series are:

1. KVM MMU shrinker now shrinks KVM caches.
MMU shrinker will free shadow page caches and split caches whenever shrinker
is invoked.

2. Page table's pages are allocated on NUMA node during fault and split.
Pages of page tables will be allocated based on the underlying physical page
a page table entry is point to. Got performance improvement up to 150% in
416 vCPUs VM during live migrations.

3. Cache size is reduced from 40 to 5.
40 is current cache size for KVM memory caches. Reduced them to 5. I didn't
see any negative performance impact while running perf and dirty_log_perf_test.
I also saw lesser number of calls to get a free page.

Thanks
Vipin

v3:
- Split patches into smaller ones.
- Repurposed KVM MMU shrinker to free cache pages instead of oldest page table
pages
- Reduced cache size from 40 to 5
- Removed __weak function and initializing node value in all architectures.
- Some name changes.

v2: https://lore.kernel.org/lkml/[email protected]/
- All page table pages will be allocated on underlying physical page's
NUMA node.
- Introduced module parameter, numa_aware_pagetable, to disable this
feature.
- Using kvm_pfn_to_refcounted_page to get page from a pfn.

v1: https://lore.kernel.org/all/[email protected]/

Vipin Sharma (9):
KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches
KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{}
KVM: x86/mmu: Shrink split_shadow_page_cache via KVM MMU shrinker
KVM: Add module param to make page tables NUMA aware
KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on
split
KVM: Provide NUMA node support to kvm_mmu_memory_cache{}
KVM: x86/mmu: Allocate page table's pages on NUMA node of the
underlying pages
KVM: x86/mmu: Make split_shadow_page_cache NUMA aware
KVM: x86/mmu: Reduce default cache size in KVM from 40 to
PT64_ROOT_MAX_LEVEL

arch/arm64/kvm/arm.c | 2 +-
arch/arm64/kvm/mmu.c | 4 +-
arch/mips/kvm/mips.c | 2 +
arch/riscv/kvm/mmu.c | 2 +-
arch/riscv/kvm/vcpu.c | 2 +-
arch/x86/include/asm/kvm_host.h | 15 +-
arch/x86/include/asm/kvm_types.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 282 +++++++++++++++++++------------
arch/x86/kvm/mmu/mmu_internal.h | 2 +
arch/x86/kvm/mmu/paging_tmpl.h | 4 +-
arch/x86/kvm/mmu/tdp_mmu.c | 24 ++-
include/linux/kvm_host.h | 27 +++
include/linux/kvm_types.h | 2 +
virt/kvm/kvm_main.c | 35 +++-
14 files changed, 277 insertions(+), 128 deletions(-)

--
2.39.0.314.g84b9a713c41-goog


2022-12-22 03:02:37

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v3 2/9] KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{}

zapped_obsolete_pages list was used in struct kvm_arch{} to provide
pages for KVM MMU shrinker. This is not needed now as KVM MMU shrinker
has been repurposed to free shadow page caches and not
zapped_obsolete_pages.

Remove zapped_obsolete_pages from struct kvm_arch{} and use local list
in kvm_zap_obsolete_pages().

Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/mmu/mmu.c | 8 ++++----
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 89cc809e4a00..f89f02e18080 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1215,7 +1215,6 @@ struct kvm_arch {
u8 mmu_valid_gen;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
struct list_head active_mmu_pages;
- struct list_head zapped_obsolete_pages;
/*
* A list of kvm_mmu_page structs that, if zapped, could possibly be
* replaced by an NX huge page. A shadow page is on this list if its
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 157417e1cb6e..3364760a1695 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5987,6 +5987,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
{
struct kvm_mmu_page *sp, *node;
int nr_zapped, batch = 0;
+ LIST_HEAD(zapped_pages);
bool unstable;

restart:
@@ -6019,8 +6020,8 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
goto restart;
}

- unstable = __kvm_mmu_prepare_zap_page(kvm, sp,
- &kvm->arch.zapped_obsolete_pages, &nr_zapped);
+ unstable = __kvm_mmu_prepare_zap_page(kvm, sp, &zapped_pages,
+ &nr_zapped);
batch += nr_zapped;

if (unstable)
@@ -6036,7 +6037,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
* kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
* running with an obsolete MMU.
*/
- kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
+ kvm_mmu_commit_zap_page(kvm, &zapped_pages);
}

/*
@@ -6112,7 +6113,6 @@ int kvm_mmu_init_vm(struct kvm *kvm)
int r;

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
- INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);

--
2.39.0.314.g84b9a713c41-goog

2022-12-22 03:07:47

by Vipin Sharma

[permalink] [raw]
Subject: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

mmu_shrink_scan() is very disruptive to VMs. It picks the first
VM in the vm_list, zaps the oldest page which is most likely an upper
level SPTEs and most like to be reused. Prior to TDP MMU, this is even
more disruptive in nested VMs case, considering L1 SPTEs will be the
oldest even though most of the entries are for L2 SPTEs.

As discussed in
https://lore.kernel.org/lkml/[email protected]/
shrinker logic has not be very useful in actually keeping VMs performant
and reducing memory usage.

Change mmu_shrink_scan() to free pages from the vCPU's shadow page
cache. Freeing pages from cache doesn't cause vCPU exits, therefore, a
VM's performance should not be affected.

This also allows to change cache capacities without worrying too much
about high memory usage in cache.

Tested this change by running dirty_log_perf_test while dropping cache
via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
logs from kvm_mmu_memory_cache_alloc(), which is expected.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Vipin Sharma <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 +
arch/x86/kvm/mmu/mmu.c | 163 +++++++++++++++++++-------------
arch/x86/kvm/mmu/mmu_internal.h | 2 +
arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 11 ++-
6 files changed, 114 insertions(+), 71 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aa4eb8cfcd7e..89cc809e4a00 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;

+ /*
+ * Protects change in size of mmu_shadow_page_cache cache.
+ */
+ spinlock_t mmu_shadow_page_cache_lock;
+
/*
* QEMU userspace and the guest each have their own FPU state.
* In vcpu_run, we switch between the user and guest FPU contexts.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 254bc46234e0..157417e1cb6e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {

static struct kmem_cache *pte_list_desc_cache;
struct kmem_cache *mmu_page_header_cache;
-static struct percpu_counter kvm_total_used_mmu_pages;
+/*
+ * Total number of unused pages in MMU shadow page cache.
+ */
+static struct percpu_counter kvm_total_unused_mmu_pages;

static void mmu_spte_set(u64 *sptep, u64 spte);

@@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
}
}

+static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
+ spinlock_t *cache_lock)
+{
+ int orig_nobjs;
+ int r;
+
+ spin_lock(cache_lock);
+ orig_nobjs = cache->nobjs;
+ r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
+ if (orig_nobjs != cache->nobjs)
+ percpu_counter_add(&kvm_total_unused_mmu_pages,
+ (cache->nobjs - orig_nobjs));
+ spin_unlock(cache_lock);
+ return r;
+}
+
static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
int r;
@@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
if (r)
return r;
- r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
- PT64_ROOT_MAX_LEVEL);
+ r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+ &vcpu->arch.mmu_shadow_page_cache_lock);
if (r)
return r;
if (maybe_indirect) {
@@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
PT64_ROOT_MAX_LEVEL);
}

+static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
+ spinlock_t *cache_lock)
+{
+ int orig_nobjs;
+
+ spin_lock(cache_lock);
+ orig_nobjs = cache->nobjs;
+ kvm_mmu_free_memory_cache(cache);
+ if (orig_nobjs)
+ percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
+
+ spin_unlock(cache_lock);
+}
+
static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
{
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
- kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+ mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+ &vcpu->arch.mmu_shadow_page_cache_lock);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
}
@@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
}
#endif

-/*
- * This value is the sum of all of the kvm instances's
- * kvm->arch.n_used_mmu_pages values. We need a global,
- * aggregate version in order to make the slab shrinker
- * faster
- */
-static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
-{
- kvm->arch.n_used_mmu_pages += nr;
- percpu_counter_add(&kvm_total_used_mmu_pages, nr);
-}
-
static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- kvm_mod_used_mmu_pages(kvm, +1);
+ kvm->arch.n_used_mmu_pages++;
kvm_account_pgtable_pages((void *)sp->spt, +1);
}

static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
- kvm_mod_used_mmu_pages(kvm, -1);
+ kvm->arch.n_used_mmu_pages--;
kvm_account_pgtable_pages((void *)sp->spt, -1);
}

@@ -2150,8 +2172,31 @@ struct shadow_page_caches {
struct kvm_mmu_memory_cache *page_header_cache;
struct kvm_mmu_memory_cache *shadow_page_cache;
struct kvm_mmu_memory_cache *shadowed_info_cache;
+ /*
+ * Protects change in size of shadow_page_cache cache.
+ */
+ spinlock_t *shadow_page_cache_lock;
};

+void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
+ spinlock_t *cache_lock)
+{
+ int orig_nobjs;
+ void *page;
+
+ if (!cache_lock) {
+ spin_lock(cache_lock);
+ orig_nobjs = shadow_page_cache->nobjs;
+ }
+ page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
+ if (!cache_lock) {
+ if (orig_nobjs)
+ percpu_counter_dec(&kvm_total_unused_mmu_pages);
+ spin_unlock(cache_lock);
+ }
+ return page;
+}
+
static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
struct shadow_page_caches *caches,
gfn_t gfn,
@@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
- sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
+ sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
+ caches->shadow_page_cache_lock);
if (!role.direct)
sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);

@@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
.page_header_cache = &vcpu->arch.mmu_page_header_cache,
.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
+ .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
};

return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
@@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;

vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+ spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);

vcpu->arch.mmu = &vcpu->arch.root_mmu;
vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
@@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
kvm_tdp_mmu_zap_invalidated_roots(kvm);
}

-static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
-{
- return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
-}
-
static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
struct kvm_memory_slot *slot,
struct kvm_page_track_notifier_node *node)
@@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
/* Direct SPs do not require a shadowed_info_cache. */
caches.page_header_cache = &kvm->arch.split_page_header_cache;
caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
+ caches.shadow_page_cache_lock = NULL;

/* Safe to pass NULL for vCPU since requesting a direct SP. */
return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
@@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
static unsigned long
mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
{
- struct kvm *kvm;
- int nr_to_scan = sc->nr_to_scan;
+ struct kvm_mmu_memory_cache *cache;
+ struct kvm *kvm, *first_kvm = NULL;
unsigned long freed = 0;
+ /* spinlock for memory cache */
+ spinlock_t *cache_lock;
+ struct kvm_vcpu *vcpu;
+ unsigned long i;

mutex_lock(&kvm_lock);

list_for_each_entry(kvm, &vm_list, vm_list) {
- int idx;
- LIST_HEAD(invalid_list);
-
- /*
- * Never scan more than sc->nr_to_scan VM instances.
- * Will not hit this condition practically since we do not try
- * to shrink more than one VM and it is very unlikely to see
- * !n_used_mmu_pages so many times.
- */
- if (!nr_to_scan--)
+ if (first_kvm == kvm)
break;
- /*
- * n_used_mmu_pages is accessed without holding kvm->mmu_lock
- * here. We may skip a VM instance errorneosly, but we do not
- * want to shrink a VM that only started to populate its MMU
- * anyway.
- */
- if (!kvm->arch.n_used_mmu_pages &&
- !kvm_has_zapped_obsolete_pages(kvm))
- continue;
+ if (!first_kvm)
+ first_kvm = kvm;
+ list_move_tail(&kvm->vm_list, &vm_list);

- idx = srcu_read_lock(&kvm->srcu);
- write_lock(&kvm->mmu_lock);
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ cache = &vcpu->arch.mmu_shadow_page_cache;
+ cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
+ if (READ_ONCE(cache->nobjs)) {
+ spin_lock(cache_lock);
+ freed += kvm_mmu_empty_memory_cache(cache);
+ spin_unlock(cache_lock);
+ }

- if (kvm_has_zapped_obsolete_pages(kvm)) {
- kvm_mmu_commit_zap_page(kvm,
- &kvm->arch.zapped_obsolete_pages);
- goto unlock;
}

- freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
-
-unlock:
- write_unlock(&kvm->mmu_lock);
- srcu_read_unlock(&kvm->srcu, idx);
-
- /*
- * unfair on small ones
- * per-vm shrinkers cry out
- * sadness comes quickly
- */
- list_move_tail(&kvm->vm_list, &vm_list);
- break;
+ if (freed >= sc->nr_to_scan)
+ break;
}

+ if (freed)
+ percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
mutex_unlock(&kvm_lock);
+ percpu_counter_sync(&kvm_total_unused_mmu_pages);
return freed;
}

static unsigned long
mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
{
- return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
+ return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
}

static struct shrinker mmu_shrinker = {
@@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
if (!mmu_page_header_cache)
goto out;

- if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
+ if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
goto out;

ret = register_shrinker(&mmu_shrinker, "x86-mmu");
@@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
return 0;

out_shrinker:
- percpu_counter_destroy(&kvm_total_used_mmu_pages);
+ percpu_counter_destroy(&kvm_total_unused_mmu_pages);
out:
mmu_destroy_caches();
return ret;
@@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
void kvm_mmu_vendor_module_exit(void)
{
mmu_destroy_caches();
- percpu_counter_destroy(&kvm_total_used_mmu_pages);
+ percpu_counter_destroy(&kvm_total_unused_mmu_pages);
unregister_shrinker(&mmu_shrinker);
}

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ac00bfbf32f6..c2a342028b6a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);

+void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
+ spinlock_t *cache_lock);
#endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 764f7c87286f..4974fa96deff 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
struct kvm_mmu_page *sp;

sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
- sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+ sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
+ &vcpu->arch.mmu_shadow_page_cache_lock);

return sp;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01aad8b74162..efd9b38ea9a2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
+int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
#endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13e88297f999..f2d762878b97 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
return mc->nobjs;
}

-void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
{
+ int freed = mc->nobjs;
+
while (mc->nobjs) {
if (mc->kmem_cache)
kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
@@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
free_page((unsigned long)mc->objects[--mc->nobjs]);
}

- kvfree(mc->objects);
+ return freed;
+}

+void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+{
+ kvm_mmu_empty_memory_cache(mc);
+ kvfree(mc->objects);
mc->objects = NULL;
mc->capacity = 0;
}
--
2.39.0.314.g84b9a713c41-goog

2022-12-27 18:49:25

by Ben Gardon

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <[email protected]> wrote:
>
> mmu_shrink_scan() is very disruptive to VMs. It picks the first
> VM in the vm_list, zaps the oldest page which is most likely an upper
> level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> more disruptive in nested VMs case, considering L1 SPTEs will be the
> oldest even though most of the entries are for L2 SPTEs.
>
> As discussed in
> https://lore.kernel.org/lkml/[email protected]/
> shrinker logic has not be very useful in actually keeping VMs performant
> and reducing memory usage.
>
> Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> cache. Freeing pages from cache doesn't cause vCPU exits, therefore, a
> VM's performance should not be affected.
>
> This also allows to change cache capacities without worrying too much
> about high memory usage in cache.
>
> Tested this change by running dirty_log_perf_test while dropping cache
> via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> logs from kvm_mmu_memory_cache_alloc(), which is expected.

Oh, that's not a good thing. I don't think we want to be hitting those
warnings. For one, kernel warnings should not be expected behavior,
probably for many reasons, but at least because Syzbot will find it.
In this particular case, we don't want to hit that because in that
case we'll try to do a GFP_ATOMIC, which can fail, and if it fails,
we'll BUG:

void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
{
void *p;

if (WARN_ON(!mc->nobjs))
p = mmu_memory_cache_alloc_obj(mc, GFP_ATOMIC | __GFP_ACCOUNT);
else
p = mc->objects[--mc->nobjs];
BUG_ON(!p);
return p;
}

Perhaps the risk of actually panicking is small, but it probably
indicates that we need better error handling around failed allocations
from the cache.
Or, the slightly less elegant approach might be to just hold the cache
lock around the cache topup and use of pages from the cache, but
adding better error handling would probably be cleaner.

>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 5 +
> arch/x86/kvm/mmu/mmu.c | 163 +++++++++++++++++++-------------
> arch/x86/kvm/mmu/mmu_internal.h | 2 +
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 11 ++-
> 6 files changed, 114 insertions(+), 71 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aa4eb8cfcd7e..89cc809e4a00 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> + /*
> + * Protects change in size of mmu_shadow_page_cache cache.
> + */
> + spinlock_t mmu_shadow_page_cache_lock;
> +
> /*
> * QEMU userspace and the guest each have their own FPU state.
> * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..157417e1cb6e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
>
> static struct kmem_cache *pte_list_desc_cache;
> struct kmem_cache *mmu_page_header_cache;
> -static struct percpu_counter kvm_total_used_mmu_pages;
> +/*
> + * Total number of unused pages in MMU shadow page cache.
> + */
> +static struct percpu_counter kvm_total_unused_mmu_pages;
>
> static void mmu_spte_set(u64 *sptep, u64 spte);
>
> @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> + spinlock_t *cache_lock)
> +{
> + int orig_nobjs;
> + int r;
> +
> + spin_lock(cache_lock);
> + orig_nobjs = cache->nobjs;
> + r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> + if (orig_nobjs != cache->nobjs)
> + percpu_counter_add(&kvm_total_unused_mmu_pages,
> + (cache->nobjs - orig_nobjs));
> + spin_unlock(cache_lock);
> + return r;
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
> if (r)
> return r;
> if (maybe_indirect) {
> @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> PT64_ROOT_MAX_LEVEL);
> }
>
> +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> + spinlock_t *cache_lock)
> +{
> + int orig_nobjs;
> +
> + spin_lock(cache_lock);
> + orig_nobjs = cache->nobjs;
> + kvm_mmu_free_memory_cache(cache);
> + if (orig_nobjs)
> + percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> +
> + spin_unlock(cache_lock);
> +}
> +
> static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
> @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
> }
> #endif
>
> -/*
> - * This value is the sum of all of the kvm instances's
> - * kvm->arch.n_used_mmu_pages values. We need a global,
> - * aggregate version in order to make the slab shrinker
> - * faster
> - */
> -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> -{
> - kvm->arch.n_used_mmu_pages += nr;
> - percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> -}
> -
> static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - kvm_mod_used_mmu_pages(kvm, +1);
> + kvm->arch.n_used_mmu_pages++;
> kvm_account_pgtable_pages((void *)sp->spt, +1);
> }
>
> static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - kvm_mod_used_mmu_pages(kvm, -1);
> + kvm->arch.n_used_mmu_pages--;
> kvm_account_pgtable_pages((void *)sp->spt, -1);
> }
>
> @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
> struct kvm_mmu_memory_cache *page_header_cache;
> struct kvm_mmu_memory_cache *shadow_page_cache;
> struct kvm_mmu_memory_cache *shadowed_info_cache;
> + /*
> + * Protects change in size of shadow_page_cache cache.
> + */
> + spinlock_t *shadow_page_cache_lock;
> };
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> + spinlock_t *cache_lock)
> +{
> + int orig_nobjs;
> + void *page;
> +
> + if (!cache_lock) {
> + spin_lock(cache_lock);
> + orig_nobjs = shadow_page_cache->nobjs;
> + }

I believe this is guaranteed to cause a null pointer dereference.

> + page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> + if (!cache_lock) {
> + if (orig_nobjs)
> + percpu_counter_dec(&kvm_total_unused_mmu_pages);
> + spin_unlock(cache_lock);

Again, this will cause a null-pointer dereference. The check above
just needs to be inverted.

> + }
> + return page;
> +}
> +
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> struct shadow_page_caches *caches,
> gfn_t gfn,
> @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> + sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> + caches->shadow_page_cache_lock);
> if (!role.direct)
> sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
>
> @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> + .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> };
>
> return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> kvm_tdp_mmu_zap_invalidated_roots(kvm);
> }
>
> -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> -{
> - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> -}
> -
> static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> struct kvm_page_track_notifier_node *node)
> @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> /* Direct SPs do not require a shadowed_info_cache. */
> caches.page_header_cache = &kvm->arch.split_page_header_cache;
> caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> + caches.shadow_page_cache_lock = NULL;
>
> /* Safe to pass NULL for vCPU since requesting a direct SP. */
> return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> static unsigned long
> mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> {
> - struct kvm *kvm;
> - int nr_to_scan = sc->nr_to_scan;
> + struct kvm_mmu_memory_cache *cache;
> + struct kvm *kvm, *first_kvm = NULL;
> unsigned long freed = 0;
> + /* spinlock for memory cache */
> + spinlock_t *cache_lock;
> + struct kvm_vcpu *vcpu;
> + unsigned long i;
>
> mutex_lock(&kvm_lock);
>
> list_for_each_entry(kvm, &vm_list, vm_list) {
> - int idx;
> - LIST_HEAD(invalid_list);
> -
> - /*
> - * Never scan more than sc->nr_to_scan VM instances.
> - * Will not hit this condition practically since we do not try
> - * to shrink more than one VM and it is very unlikely to see
> - * !n_used_mmu_pages so many times.
> - */
> - if (!nr_to_scan--)
> + if (first_kvm == kvm)
> break;
> - /*
> - * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> - * here. We may skip a VM instance errorneosly, but we do not
> - * want to shrink a VM that only started to populate its MMU
> - * anyway.
> - */
> - if (!kvm->arch.n_used_mmu_pages &&
> - !kvm_has_zapped_obsolete_pages(kvm))
> - continue;
> + if (!first_kvm)
> + first_kvm = kvm;
> + list_move_tail(&kvm->vm_list, &vm_list);
>
> - idx = srcu_read_lock(&kvm->srcu);

I think we still want to do the SRCU read lock here to prevent
use-after-free on the vCPUs.

> - write_lock(&kvm->mmu_lock);
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + cache = &vcpu->arch.mmu_shadow_page_cache;
> + cache_lock = vcpu->arch.mmu_shadow_page_cache_lock;
> + if (READ_ONCE(cache->nobjs)) {
> + spin_lock(cache_lock);
> + freed += kvm_mmu_empty_memory_cache(cache);

Would it make sense to just have kvm_mmu_empty_memory_cache()
decrement the per-cpu counter itself? I don't think there's much perf
to be gained by reducing percpu counter updates here and it would
consolidate the bookkeeping.

> + spin_unlock(cache_lock);
> + }
>
> - if (kvm_has_zapped_obsolete_pages(kvm)) {
> - kvm_mmu_commit_zap_page(kvm,
> - &kvm->arch.zapped_obsolete_pages);
> - goto unlock;
> }
>
> - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> -
> -unlock:
> - write_unlock(&kvm->mmu_lock);
> - srcu_read_unlock(&kvm->srcu, idx);
> -
> - /*
> - * unfair on small ones
> - * per-vm shrinkers cry out
> - * sadness comes quickly
> - */

Nooooo, don't delete the beautiful poem!

> - list_move_tail(&kvm->vm_list, &vm_list);
> - break;
> + if (freed >= sc->nr_to_scan)
> + break;
> }
>
> + if (freed)
> + percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
> mutex_unlock(&kvm_lock);
> + percpu_counter_sync(&kvm_total_unused_mmu_pages);
> return freed;
> }
>
> static unsigned long
> mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> {
> - return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> + return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);

This will return 0 if the sum of all the per-cpu counters is negative.
It should never be negative though. Might be nice to add a warning if
we would get a negative sum.

> }
>
> static struct shrinker mmu_shrinker = {
> @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
> if (!mmu_page_header_cache)
> goto out;
>
> - if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> + if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
> goto out;
>
> ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
> return 0;
>
> out_shrinker:
> - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> out:
> mmu_destroy_caches();
> return ret;
> @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> void kvm_mmu_vendor_module_exit(void)
> {
> mmu_destroy_caches();
> - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> unregister_shrinker(&mmu_shrinker);
> }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index ac00bfbf32f6..c2a342028b6a 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> + spinlock_t *cache_lock);
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 764f7c87286f..4974fa96deff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> + sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
>
> return sp;
> }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 01aad8b74162..efd9b38ea9a2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
> int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
> void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13e88297f999..f2d762878b97 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> return mc->nobjs;
> }
>
> -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
> {
> + int freed = mc->nobjs;
> +
> while (mc->nobjs) {
> if (mc->kmem_cache)
> kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> free_page((unsigned long)mc->objects[--mc->nobjs]);
> }
>
> - kvfree(mc->objects);
> + return freed;
> +}
>
> +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +{
> + kvm_mmu_empty_memory_cache(mc);
> + kvfree(mc->objects);
> mc->objects = NULL;
> mc->capacity = 0;
> }
> --
> 2.39.0.314.g84b9a713c41-goog
>

2022-12-28 22:22:35

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Tue, Dec 27, 2022 at 10:37 AM Ben Gardon <[email protected]> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <[email protected]> wrote:
> >
> > mmu_shrink_scan() is very disruptive to VMs. It picks the first
> > VM in the vm_list, zaps the oldest page which is most likely an upper
> > level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> > more disruptive in nested VMs case, considering L1 SPTEs will be the
> > oldest even though most of the entries are for L2 SPTEs.
> >
> > As discussed in
> > https://lore.kernel.org/lkml/[email protected]/
> > shrinker logic has not be very useful in actually keeping VMs performant
> > and reducing memory usage.
> >
> > Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> > cache. Freeing pages from cache doesn't cause vCPU exits, therefore, a
> > VM's performance should not be affected.
> >
> > This also allows to change cache capacities without worrying too much
> > about high memory usage in cache.
> >
> > Tested this change by running dirty_log_perf_test while dropping cache
> > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > logs from kvm_mmu_memory_cache_alloc(), which is expected.
>
> Oh, that's not a good thing. I don't think we want to be hitting those
> warnings. For one, kernel warnings should not be expected behavior,
> probably for many reasons, but at least because Syzbot will find it.
> In this particular case, we don't want to hit that because in that
> case we'll try to do a GFP_ATOMIC, which can fail, and if it fails,
> we'll BUG:
>
> void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> {
> void *p;
>
> if (WARN_ON(!mc->nobjs))
> p = mmu_memory_cache_alloc_obj(mc, GFP_ATOMIC | __GFP_ACCOUNT);
> else
> p = mc->objects[--mc->nobjs];
> BUG_ON(!p);
> return p;
> }
>
> Perhaps the risk of actually panicking is small, but it probably
> indicates that we need better error handling around failed allocations
> from the cache.
> Or, the slightly less elegant approach might be to just hold the cache
> lock around the cache topup and use of pages from the cache, but
> adding better error handling would probably be cleaner.

I was counting on the fact that shrinker will ideally run only in
extreme cases, i.e. host is running on low memory. So, this WARN_ON
will only be rarely used. I was not aware of Syzbot, it seems like it
will be a concern if it does this kind of testing.

I thought about keeping a mutex, taking it during topup and releasing
it after the whole operation is done but I stopped it as the duration
of holding mutex will be long and might block the memory shrinker
longer. I am not sure though, if this is a valid concern.

I can't think of a better error handling in this situation. I can
change logic to hold mutex if the above mutex hold duration concern
won't be an issue compared to the current WARN_ON() approach.

>
> >
> > Suggested-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Vipin Sharma <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 5 +
> > arch/x86/kvm/mmu/mmu.c | 163 +++++++++++++++++++-------------
> > arch/x86/kvm/mmu/mmu_internal.h | 2 +
> > arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
> > include/linux/kvm_host.h | 1 +
> > virt/kvm/kvm_main.c | 11 ++-
> > 6 files changed, 114 insertions(+), 71 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index aa4eb8cfcd7e..89cc809e4a00 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
> > struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> > struct kvm_mmu_memory_cache mmu_page_header_cache;
> >
> > + /*
> > + * Protects change in size of mmu_shadow_page_cache cache.
> > + */
> > + spinlock_t mmu_shadow_page_cache_lock;
> > +
> > /*
> > * QEMU userspace and the guest each have their own FPU state.
> > * In vcpu_run, we switch between the user and guest FPU contexts.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 254bc46234e0..157417e1cb6e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
> >
> > static struct kmem_cache *pte_list_desc_cache;
> > struct kmem_cache *mmu_page_header_cache;
> > -static struct percpu_counter kvm_total_used_mmu_pages;
> > +/*
> > + * Total number of unused pages in MMU shadow page cache.
> > + */
> > +static struct percpu_counter kvm_total_unused_mmu_pages;
> >
> > static void mmu_spte_set(u64 *sptep, u64 spte);
> >
> > @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > }
> > }
> >
> > +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > + spinlock_t *cache_lock)
> > +{
> > + int orig_nobjs;
> > + int r;
> > +
> > + spin_lock(cache_lock);
> > + orig_nobjs = cache->nobjs;
> > + r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> > + if (orig_nobjs != cache->nobjs)
> > + percpu_counter_add(&kvm_total_unused_mmu_pages,
> > + (cache->nobjs - orig_nobjs));
> > + spin_unlock(cache_lock);
> > + return r;
> > +}
> > +
> > static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > {
> > int r;
> > @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> > if (r)
> > return r;
> > - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > - PT64_ROOT_MAX_LEVEL);
> > + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > + &vcpu->arch.mmu_shadow_page_cache_lock);
> > if (r)
> > return r;
> > if (maybe_indirect) {
> > @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > PT64_ROOT_MAX_LEVEL);
> > }
> >
> > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > + spinlock_t *cache_lock)
> > +{
> > + int orig_nobjs;
> > +
> > + spin_lock(cache_lock);
> > + orig_nobjs = cache->nobjs;
> > + kvm_mmu_free_memory_cache(cache);
> > + if (orig_nobjs)
> > + percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > +
> > + spin_unlock(cache_lock);
> > +}
> > +
> > static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> > {
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > + &vcpu->arch.mmu_shadow_page_cache_lock);
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > }
> > @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
> > }
> > #endif
> >
> > -/*
> > - * This value is the sum of all of the kvm instances's
> > - * kvm->arch.n_used_mmu_pages values. We need a global,
> > - * aggregate version in order to make the slab shrinker
> > - * faster
> > - */
> > -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> > -{
> > - kvm->arch.n_used_mmu_pages += nr;
> > - percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> > -}
> > -
> > static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > {
> > - kvm_mod_used_mmu_pages(kvm, +1);
> > + kvm->arch.n_used_mmu_pages++;
> > kvm_account_pgtable_pages((void *)sp->spt, +1);
> > }
> >
> > static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > {
> > - kvm_mod_used_mmu_pages(kvm, -1);
> > + kvm->arch.n_used_mmu_pages--;
> > kvm_account_pgtable_pages((void *)sp->spt, -1);
> > }
> >
> > @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
> > struct kvm_mmu_memory_cache *page_header_cache;
> > struct kvm_mmu_memory_cache *shadow_page_cache;
> > struct kvm_mmu_memory_cache *shadowed_info_cache;
> > + /*
> > + * Protects change in size of shadow_page_cache cache.
> > + */
> > + spinlock_t *shadow_page_cache_lock;
> > };
> >
> > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > + spinlock_t *cache_lock)
> > +{
> > + int orig_nobjs;
> > + void *page;
> > +
> > + if (!cache_lock) {
> > + spin_lock(cache_lock);
> > + orig_nobjs = shadow_page_cache->nobjs;
> > + }
>
> I believe this is guaranteed to cause a null pointer dereference.
>
> > + page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> > + if (!cache_lock) {
> > + if (orig_nobjs)
> > + percpu_counter_dec(&kvm_total_unused_mmu_pages);
> > + spin_unlock(cache_lock);
>
> Again, this will cause a null-pointer dereference. The check above
> just needs to be inverted.

Yes, I forgot to change it in the commit and one patch later in the
series removes this whole "if(!cache_lock)" condition so it skipped my
attention. Thanks for catching it.

>
> > + }
> > + return page;
> > +}
> > +
> > static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> > struct shadow_page_caches *caches,
> > gfn_t gfn,
> > @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> > struct kvm_mmu_page *sp;
> >
> > sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> > - sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> > + sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> > + caches->shadow_page_cache_lock);
> > if (!role.direct)
> > sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
> >
> > @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> > .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> > .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> > .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> > + .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> > };
> >
> > return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> > @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > + spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> >
> > vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> > @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> > kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > }
> >
> > -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> > -{
> > - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> > -}
> > -
> > static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> > struct kvm_memory_slot *slot,
> > struct kvm_page_track_notifier_node *node)
> > @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> > /* Direct SPs do not require a shadowed_info_cache. */
> > caches.page_header_cache = &kvm->arch.split_page_header_cache;
> > caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> > + caches.shadow_page_cache_lock = NULL;
> >
> > /* Safe to pass NULL for vCPU since requesting a direct SP. */
> > return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> > @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > static unsigned long
> > mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > {
> > - struct kvm *kvm;
> > - int nr_to_scan = sc->nr_to_scan;
> > + struct kvm_mmu_memory_cache *cache;
> > + struct kvm *kvm, *first_kvm = NULL;
> > unsigned long freed = 0;
> > + /* spinlock for memory cache */
> > + spinlock_t *cache_lock;
> > + struct kvm_vcpu *vcpu;
> > + unsigned long i;
> >
> > mutex_lock(&kvm_lock);
> >
> > list_for_each_entry(kvm, &vm_list, vm_list) {
> > - int idx;
> > - LIST_HEAD(invalid_list);
> > -
> > - /*
> > - * Never scan more than sc->nr_to_scan VM instances.
> > - * Will not hit this condition practically since we do not try
> > - * to shrink more than one VM and it is very unlikely to see
> > - * !n_used_mmu_pages so many times.
> > - */
> > - if (!nr_to_scan--)
> > + if (first_kvm == kvm)
> > break;
> > - /*
> > - * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> > - * here. We may skip a VM instance errorneosly, but we do not
> > - * want to shrink a VM that only started to populate its MMU
> > - * anyway.
> > - */
> > - if (!kvm->arch.n_used_mmu_pages &&
> > - !kvm_has_zapped_obsolete_pages(kvm))
> > - continue;
> > + if (!first_kvm)
> > + first_kvm = kvm;
> > + list_move_tail(&kvm->vm_list, &vm_list);
> >
> > - idx = srcu_read_lock(&kvm->srcu);
>
> I think we still want to do the SRCU read lock here to prevent
> use-after-free on the vCPUs.

Since I am in mutex_lock(&kvm_lock), a kvm will not be removed from
kvm->vm_list, this will block kvm_destroy_vm() moving further to
destroy vcpus via kvm_arch_destroy_vm() > kvm_destroy_vcpus(). Do we
still need the srcu_read_lock()? Also, kvm_for_each_vcpu() using
xa_for_each_range() which uses RCU for traversing the loop, won't
these two be sufficient to avoid needing srcu_read_lock() here?

>
> > - write_lock(&kvm->mmu_lock);
> > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > + cache = &vcpu->arch.mmu_shadow_page_cache;
> > + cache_lock = vcpu->arch.mmu_shadow_page_cache_lock;
> > + if (READ_ONCE(cache->nobjs)) {
> > + spin_lock(cache_lock);
> > + freed += kvm_mmu_empty_memory_cache(cache);
>
> Would it make sense to just have kvm_mmu_empty_memory_cache()
> decrement the per-cpu counter itself? I don't think there's much perf
> to be gained by reducing percpu counter updates here and it would
> consolidate the bookkeeping.

kvm_mmu_empty_memory_cache() is also used by other caches for which
are not keeping the count.

>
> > + spin_unlock(cache_lock);
> > + }
> >
> > - if (kvm_has_zapped_obsolete_pages(kvm)) {
> > - kvm_mmu_commit_zap_page(kvm,
> > - &kvm->arch.zapped_obsolete_pages);
> > - goto unlock;
> > }
> >
> > - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> > -
> > -unlock:
> > - write_unlock(&kvm->mmu_lock);
> > - srcu_read_unlock(&kvm->srcu, idx);
> > -
> > - /*
> > - * unfair on small ones
> > - * per-vm shrinkers cry out
> > - * sadness comes quickly
> > - */
>
> Nooooo, don't delete the beautiful poem!

I will fix this mistake in the next version, pardon my ignorance :)

>
> > - list_move_tail(&kvm->vm_list, &vm_list);
> > - break;
> > + if (freed >= sc->nr_to_scan)
> > + break;
> > }
> >
> > + if (freed)
> > + percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
> > mutex_unlock(&kvm_lock);
> > + percpu_counter_sync(&kvm_total_unused_mmu_pages);
> > return freed;
> > }
> >
> > static unsigned long
> > mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > {
> > - return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> > + return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
>
> This will return 0 if the sum of all the per-cpu counters is negative.
> It should never be negative though. Might be nice to add a warning if
> we would get a negative sum.
>

Sounds good.


> > }
> >
> > static struct shrinker mmu_shrinker = {
> > @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
> > if (!mmu_page_header_cache)
> > goto out;
> >
> > - if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> > + if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
> > goto out;
> >
> > ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> > @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
> > return 0;
> >
> > out_shrinker:
> > - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> > out:
> > mmu_destroy_caches();
> > return ret;
> > @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > void kvm_mmu_vendor_module_exit(void)
> > {
> > mmu_destroy_caches();
> > - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> > unregister_shrinker(&mmu_shrinker);
> > }
> >
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index ac00bfbf32f6..c2a342028b6a 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > + spinlock_t *cache_lock);
> > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 764f7c87286f..4974fa96deff 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> > struct kvm_mmu_page *sp;
> >
> > sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > + sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> > + &vcpu->arch.mmu_shadow_page_cache_lock);
> >
> > return sp;
> > }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 01aad8b74162..efd9b38ea9a2 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
> > int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> > int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> > int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
> > void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 13e88297f999..f2d762878b97 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> > return mc->nobjs;
> > }
> >
> > -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
> > {
> > + int freed = mc->nobjs;
> > +
> > while (mc->nobjs) {
> > if (mc->kmem_cache)
> > kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> > @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > free_page((unsigned long)mc->objects[--mc->nobjs]);
> > }
> >
> > - kvfree(mc->objects);
> > + return freed;
> > +}
> >
> > +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > +{
> > + kvm_mmu_empty_memory_cache(mc);
> > + kvfree(mc->objects);
> > mc->objects = NULL;
> > mc->capacity = 0;
> > }
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

2022-12-29 21:52:34

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Wed, Dec 28, 2022 at 02:07:49PM -0800, Vipin Sharma wrote:
> On Tue, Dec 27, 2022 at 10:37 AM Ben Gardon <[email protected]> wrote:
> > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <[email protected]> wrote:
> > >
> > > Tested this change by running dirty_log_perf_test while dropping cache
> > > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > > logs from kvm_mmu_memory_cache_alloc(), which is expected.
> >
> > Oh, that's not a good thing. I don't think we want to be hitting those
> > warnings. For one, kernel warnings should not be expected behavior,
> > probably for many reasons, but at least because Syzbot will find it.
> > In this particular case, we don't want to hit that because in that
> > case we'll try to do a GFP_ATOMIC, which can fail, and if it fails,
> > we'll BUG:
> >
> > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> > {
> > void *p;
> >
> > if (WARN_ON(!mc->nobjs))
> > p = mmu_memory_cache_alloc_obj(mc, GFP_ATOMIC | __GFP_ACCOUNT);
> > else
> > p = mc->objects[--mc->nobjs];
> > BUG_ON(!p);
> > return p;
> > }
> >
> > Perhaps the risk of actually panicking is small, but it probably
> > indicates that we need better error handling around failed allocations
> > from the cache.
> > Or, the slightly less elegant approach might be to just hold the cache
> > lock around the cache topup and use of pages from the cache, but
> > adding better error handling would probably be cleaner.
>
> I was counting on the fact that shrinker will ideally run only in
> extreme cases, i.e. host is running on low memory. So, this WARN_ON
> will only be rarely used. I was not aware of Syzbot, it seems like it
> will be a concern if it does this kind of testing.

In an extreme low-memory situation, forcing vCPUS to do GFP_ATOMIC
allocations to handle page faults is risky. Plus it's a waste of time to
free that memory since it's just going to get immediately reallocated.

>
> I thought about keeping a mutex, taking it during topup and releasing
> it after the whole operation is done but I stopped it as the duration
> of holding mutex will be long and might block the memory shrinker
> longer. I am not sure though, if this is a valid concern.

Use mutex_trylock() to skip any vCPUs that are currently handling page
faults.

2022-12-29 22:01:56

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Wed, Dec 21, 2022 at 06:34:49PM -0800, Vipin Sharma wrote:
> mmu_shrink_scan() is very disruptive to VMs. It picks the first
> VM in the vm_list, zaps the oldest page which is most likely an upper
> level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> more disruptive in nested VMs case, considering L1 SPTEs will be the
> oldest even though most of the entries are for L2 SPTEs.
>
> As discussed in
> https://lore.kernel.org/lkml/[email protected]/
> shrinker logic has not be very useful in actually keeping VMs performant
> and reducing memory usage.
>
> Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> cache. Freeing pages from cache doesn't cause vCPU exits, therefore, a
> VM's performance should not be affected.

Can you split this commit up? e.g. First drop the old shrinking logic in
one commit (but leave the shrinking infrastructure in place). Then a
commit to make the shrinker free the per-vCPU shadow page caches. And
then perhaps another to make the shrinker free the per-VM shadow page
cache used for eager splitting.

>
> This also allows to change cache capacities without worrying too much
> about high memory usage in cache.
>
> Tested this change by running dirty_log_perf_test while dropping cache
> via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> logs from kvm_mmu_memory_cache_alloc(), which is expected.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 5 +
> arch/x86/kvm/mmu/mmu.c | 163 +++++++++++++++++++-------------
> arch/x86/kvm/mmu/mmu_internal.h | 2 +
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 11 ++-
> 6 files changed, 114 insertions(+), 71 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aa4eb8cfcd7e..89cc809e4a00 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> + /*
> + * Protects change in size of mmu_shadow_page_cache cache.
> + */
> + spinlock_t mmu_shadow_page_cache_lock;
> +
> /*
> * QEMU userspace and the guest each have their own FPU state.
> * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..157417e1cb6e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
>
> static struct kmem_cache *pte_list_desc_cache;
> struct kmem_cache *mmu_page_header_cache;
> -static struct percpu_counter kvm_total_used_mmu_pages;
> +/*
> + * Total number of unused pages in MMU shadow page cache.
> + */
> +static struct percpu_counter kvm_total_unused_mmu_pages;
>
> static void mmu_spte_set(u64 *sptep, u64 spte);
>
> @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> + spinlock_t *cache_lock)
> +{
> + int orig_nobjs;
> + int r;
> +
> + spin_lock(cache_lock);
> + orig_nobjs = cache->nobjs;
> + r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> + if (orig_nobjs != cache->nobjs)
> + percpu_counter_add(&kvm_total_unused_mmu_pages,
> + (cache->nobjs - orig_nobjs));
> + spin_unlock(cache_lock);
> + return r;
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
> if (r)
> return r;
> if (maybe_indirect) {
> @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> PT64_ROOT_MAX_LEVEL);
> }
>
> +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> + spinlock_t *cache_lock)
> +{
> + int orig_nobjs;
> +
> + spin_lock(cache_lock);
> + orig_nobjs = cache->nobjs;
> + kvm_mmu_free_memory_cache(cache);
> + if (orig_nobjs)
> + percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> +
> + spin_unlock(cache_lock);
> +}

It would be nice to avoid adding these wrapper functions.

Once you add a mutex to protect the caches from being freed while vCPUs
are in the middle of a page fault you can drop the spin lock. After that
the only reason to have these wrappers is to update
kvm_total_unused_mmu_pages.

Do we really need kvm_total_unused_mmu_pages? Why not just dynamically
calculate the number of of unused pages in mmu_shrink_count()? Or just
estimate the count, e.g. num_vcpus * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE?
Or have per-VM or per-vCPU shrinkers to avoid needing to do any
aggregation?

> +
> static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);

mmu_shadowed_info_cache can be freed by the shrinker as well.

> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
> @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
> }
> #endif
>
> -/*
> - * This value is the sum of all of the kvm instances's
> - * kvm->arch.n_used_mmu_pages values. We need a global,
> - * aggregate version in order to make the slab shrinker
> - * faster
> - */
> -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> -{
> - kvm->arch.n_used_mmu_pages += nr;
> - percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> -}
> -
> static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - kvm_mod_used_mmu_pages(kvm, +1);
> + kvm->arch.n_used_mmu_pages++;
> kvm_account_pgtable_pages((void *)sp->spt, +1);
> }
>
> static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - kvm_mod_used_mmu_pages(kvm, -1);
> + kvm->arch.n_used_mmu_pages--;
> kvm_account_pgtable_pages((void *)sp->spt, -1);
> }
>
> @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
> struct kvm_mmu_memory_cache *page_header_cache;
> struct kvm_mmu_memory_cache *shadow_page_cache;
> struct kvm_mmu_memory_cache *shadowed_info_cache;
> + /*
> + * Protects change in size of shadow_page_cache cache.
> + */
> + spinlock_t *shadow_page_cache_lock;
> };
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> + spinlock_t *cache_lock)
> +{
> + int orig_nobjs;
> + void *page;
> +
> + if (!cache_lock) {
> + spin_lock(cache_lock);
> + orig_nobjs = shadow_page_cache->nobjs;
> + }
> + page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> + if (!cache_lock) {
> + if (orig_nobjs)
> + percpu_counter_dec(&kvm_total_unused_mmu_pages);
> + spin_unlock(cache_lock);
> + }
> + return page;
> +}
> +
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> struct shadow_page_caches *caches,
> gfn_t gfn,
> @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> + sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> + caches->shadow_page_cache_lock);
> if (!role.direct)
> sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
>
> @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> + .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> };
>
> return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> kvm_tdp_mmu_zap_invalidated_roots(kvm);
> }
>
> -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> -{
> - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> -}
> -
> static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> struct kvm_page_track_notifier_node *node)
> @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> /* Direct SPs do not require a shadowed_info_cache. */
> caches.page_header_cache = &kvm->arch.split_page_header_cache;
> caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> + caches.shadow_page_cache_lock = NULL;
>
> /* Safe to pass NULL for vCPU since requesting a direct SP. */
> return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> static unsigned long
> mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> {
> - struct kvm *kvm;
> - int nr_to_scan = sc->nr_to_scan;
> + struct kvm_mmu_memory_cache *cache;
> + struct kvm *kvm, *first_kvm = NULL;
> unsigned long freed = 0;
> + /* spinlock for memory cache */
> + spinlock_t *cache_lock;
> + struct kvm_vcpu *vcpu;
> + unsigned long i;
>
> mutex_lock(&kvm_lock);
>
> list_for_each_entry(kvm, &vm_list, vm_list) {
> - int idx;
> - LIST_HEAD(invalid_list);
> -
> - /*
> - * Never scan more than sc->nr_to_scan VM instances.
> - * Will not hit this condition practically since we do not try
> - * to shrink more than one VM and it is very unlikely to see
> - * !n_used_mmu_pages so many times.
> - */
> - if (!nr_to_scan--)
> + if (first_kvm == kvm)
> break;
> - /*
> - * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> - * here. We may skip a VM instance errorneosly, but we do not
> - * want to shrink a VM that only started to populate its MMU
> - * anyway.
> - */
> - if (!kvm->arch.n_used_mmu_pages &&
> - !kvm_has_zapped_obsolete_pages(kvm))
> - continue;
> + if (!first_kvm)
> + first_kvm = kvm;
> + list_move_tail(&kvm->vm_list, &vm_list);
>
> - idx = srcu_read_lock(&kvm->srcu);
> - write_lock(&kvm->mmu_lock);
> + kvm_for_each_vcpu(i, vcpu, kvm) {

What protects this from racing with vCPU creation/deletion?

> + cache = &vcpu->arch.mmu_shadow_page_cache;
> + cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
> + if (READ_ONCE(cache->nobjs)) {
> + spin_lock(cache_lock);
> + freed += kvm_mmu_empty_memory_cache(cache);
> + spin_unlock(cache_lock);
> + }

What about freeing kvm->arch.split_shadow_page_cache as well?

>
> - if (kvm_has_zapped_obsolete_pages(kvm)) {
> - kvm_mmu_commit_zap_page(kvm,
> - &kvm->arch.zapped_obsolete_pages);
> - goto unlock;
> }
>
> - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> -
> -unlock:
> - write_unlock(&kvm->mmu_lock);
> - srcu_read_unlock(&kvm->srcu, idx);
> -
> - /*
> - * unfair on small ones
> - * per-vm shrinkers cry out
> - * sadness comes quickly
> - */
> - list_move_tail(&kvm->vm_list, &vm_list);
> - break;
> + if (freed >= sc->nr_to_scan)
> + break;
> }
>
> + if (freed)
> + percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
> mutex_unlock(&kvm_lock);
> + percpu_counter_sync(&kvm_total_unused_mmu_pages);
> return freed;
> }
>
> static unsigned long
> mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> {
> - return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> + return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
> }
>
> static struct shrinker mmu_shrinker = {
> @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
> if (!mmu_page_header_cache)
> goto out;
>
> - if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> + if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
> goto out;
>
> ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
> return 0;
>
> out_shrinker:
> - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> out:
> mmu_destroy_caches();
> return ret;
> @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> void kvm_mmu_vendor_module_exit(void)
> {
> mmu_destroy_caches();
> - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> unregister_shrinker(&mmu_shrinker);
> }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index ac00bfbf32f6..c2a342028b6a 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> + spinlock_t *cache_lock);
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 764f7c87286f..4974fa96deff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> + sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
>
> return sp;
> }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 01aad8b74162..efd9b38ea9a2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
> int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
> void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13e88297f999..f2d762878b97 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> return mc->nobjs;
> }
>
> -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
> {
> + int freed = mc->nobjs;
> +
> while (mc->nobjs) {
> if (mc->kmem_cache)
> kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> free_page((unsigned long)mc->objects[--mc->nobjs]);
> }
>
> - kvfree(mc->objects);
> + return freed;
> +}
>
> +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +{
> + kvm_mmu_empty_memory_cache(mc);
> + kvfree(mc->objects);
> mc->objects = NULL;
> mc->capacity = 0;
> }
> --
> 2.39.0.314.g84b9a713c41-goog
>

2022-12-29 23:04:00

by David Matlack

[permalink] [raw]
Subject: Re: [Patch v3 2/9] KVM: x86/mmu: Remove zapped_obsolete_pages from struct kvm_arch{}

On Wed, Dec 21, 2022 at 06:34:50PM -0800, Vipin Sharma wrote:
> zapped_obsolete_pages list was used in struct kvm_arch{} to provide
> pages for KVM MMU shrinker. This is not needed now as KVM MMU shrinker
> has been repurposed to free shadow page caches and not
> zapped_obsolete_pages.
>
> Remove zapped_obsolete_pages from struct kvm_arch{} and use local list
> in kvm_zap_obsolete_pages().
>
> Signed-off-by: Vipin Sharma <[email protected]>

Reviewed-by: David Matlack <[email protected]>

> ---
> arch/x86/include/asm/kvm_host.h | 1 -
> arch/x86/kvm/mmu/mmu.c | 8 ++++----
> 2 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 89cc809e4a00..f89f02e18080 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1215,7 +1215,6 @@ struct kvm_arch {
> u8 mmu_valid_gen;
> struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> struct list_head active_mmu_pages;
> - struct list_head zapped_obsolete_pages;
> /*
> * A list of kvm_mmu_page structs that, if zapped, could possibly be
> * replaced by an NX huge page. A shadow page is on this list if its
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 157417e1cb6e..3364760a1695 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5987,6 +5987,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
> {
> struct kvm_mmu_page *sp, *node;
> int nr_zapped, batch = 0;
> + LIST_HEAD(zapped_pages);

optional nit: The common name of this is invalid_list (see other callers
of __kvm_mmu_prepare_zap_page()).

2023-01-03 17:55:17

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Thu, Dec 29, 2022 at 1:15 PM David Matlack <[email protected]> wrote:
>
> On Wed, Dec 28, 2022 at 02:07:49PM -0800, Vipin Sharma wrote:
> > On Tue, Dec 27, 2022 at 10:37 AM Ben Gardon <[email protected]> wrote:
> > > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <[email protected]> wrote:
> > > >
> > > > Tested this change by running dirty_log_perf_test while dropping cache
> > > > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > > > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > > > logs from kvm_mmu_memory_cache_alloc(), which is expected.
> > >
> > > Oh, that's not a good thing. I don't think we want to be hitting those
> > > warnings. For one, kernel warnings should not be expected behavior,
> > > probably for many reasons, but at least because Syzbot will find it.
> > > In this particular case, we don't want to hit that because in that
> > > case we'll try to do a GFP_ATOMIC, which can fail, and if it fails,
> > > we'll BUG:
> > >
> > > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
> > > {
> > > void *p;
> > >
> > > if (WARN_ON(!mc->nobjs))
> > > p = mmu_memory_cache_alloc_obj(mc, GFP_ATOMIC | __GFP_ACCOUNT);
> > > else
> > > p = mc->objects[--mc->nobjs];
> > > BUG_ON(!p);
> > > return p;
> > > }
> > >
> > > Perhaps the risk of actually panicking is small, but it probably
> > > indicates that we need better error handling around failed allocations
> > > from the cache.
> > > Or, the slightly less elegant approach might be to just hold the cache
> > > lock around the cache topup and use of pages from the cache, but
> > > adding better error handling would probably be cleaner.
> >
> > I was counting on the fact that shrinker will ideally run only in
> > extreme cases, i.e. host is running on low memory. So, this WARN_ON
> > will only be rarely used. I was not aware of Syzbot, it seems like it
> > will be a concern if it does this kind of testing.
>
> In an extreme low-memory situation, forcing vCPUS to do GFP_ATOMIC
> allocations to handle page faults is risky. Plus it's a waste of time to
> free that memory since it's just going to get immediately reallocated.
>
> >
> > I thought about keeping a mutex, taking it during topup and releasing
> > it after the whole operation is done but I stopped it as the duration
> > of holding mutex will be long and might block the memory shrinker
> > longer. I am not sure though, if this is a valid concern.
>
> Use mutex_trylock() to skip any vCPUs that are currently handling page
> faults.

oh yeah! Thanks.

2023-01-03 18:05:38

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Thu, Dec 29, 2022 at 1:55 PM David Matlack <[email protected]> wrote:
>
> On Wed, Dec 21, 2022 at 06:34:49PM -0800, Vipin Sharma wrote:
> > mmu_shrink_scan() is very disruptive to VMs. It picks the first
> > VM in the vm_list, zaps the oldest page which is most likely an upper
> > level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> > more disruptive in nested VMs case, considering L1 SPTEs will be the
> > oldest even though most of the entries are for L2 SPTEs.
> >
> > As discussed in
> > https://lore.kernel.org/lkml/[email protected]/
> > shrinker logic has not be very useful in actually keeping VMs performant
> > and reducing memory usage.
> >
> > Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> > cache. Freeing pages from cache doesn't cause vCPU exits, therefore, a
> > VM's performance should not be affected.
>
> Can you split this commit up? e.g. First drop the old shrinking logic in
> one commit (but leave the shrinking infrastructure in place). Then a
> commit to make the shrinker free the per-vCPU shadow page caches. And
> then perhaps another to make the shrinker free the per-VM shadow page
> cache used for eager splitting.
>

Sounds good, I will separate it in two parts, one for dropping old
logic, one for adding per vcpu shadow page caches. Patch 3 is enabling
shrinkerto free per-VM shadow page.

> >
> > This also allows to change cache capacities without worrying too much
> > about high memory usage in cache.
> >
> > Tested this change by running dirty_log_perf_test while dropping cache
> > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > logs from kvm_mmu_memory_cache_alloc(), which is expected.
> >
> > Suggested-by: Sean Christopherson <[email protected]>
> > Signed-off-by: Vipin Sharma <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 5 +
> > arch/x86/kvm/mmu/mmu.c | 163 +++++++++++++++++++-------------
> > arch/x86/kvm/mmu/mmu_internal.h | 2 +
> > arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
> > include/linux/kvm_host.h | 1 +
> > virt/kvm/kvm_main.c | 11 ++-
> > 6 files changed, 114 insertions(+), 71 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index aa4eb8cfcd7e..89cc809e4a00 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
> > struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> > struct kvm_mmu_memory_cache mmu_page_header_cache;
> >
> > + /*
> > + * Protects change in size of mmu_shadow_page_cache cache.
> > + */
> > + spinlock_t mmu_shadow_page_cache_lock;
> > +
> > /*
> > * QEMU userspace and the guest each have their own FPU state.
> > * In vcpu_run, we switch between the user and guest FPU contexts.
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 254bc46234e0..157417e1cb6e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
> >
> > static struct kmem_cache *pte_list_desc_cache;
> > struct kmem_cache *mmu_page_header_cache;
> > -static struct percpu_counter kvm_total_used_mmu_pages;
> > +/*
> > + * Total number of unused pages in MMU shadow page cache.
> > + */
> > +static struct percpu_counter kvm_total_unused_mmu_pages;
> >
> > static void mmu_spte_set(u64 *sptep, u64 spte);
> >
> > @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > }
> > }
> >
> > +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > + spinlock_t *cache_lock)
> > +{
> > + int orig_nobjs;
> > + int r;
> > +
> > + spin_lock(cache_lock);
> > + orig_nobjs = cache->nobjs;
> > + r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> > + if (orig_nobjs != cache->nobjs)
> > + percpu_counter_add(&kvm_total_unused_mmu_pages,
> > + (cache->nobjs - orig_nobjs));
> > + spin_unlock(cache_lock);
> > + return r;
> > +}
> > +
> > static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > {
> > int r;
> > @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> > if (r)
> > return r;
> > - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > - PT64_ROOT_MAX_LEVEL);
> > + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > + &vcpu->arch.mmu_shadow_page_cache_lock);
> > if (r)
> > return r;
> > if (maybe_indirect) {
> > @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > PT64_ROOT_MAX_LEVEL);
> > }
> >
> > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > + spinlock_t *cache_lock)
> > +{
> > + int orig_nobjs;
> > +
> > + spin_lock(cache_lock);
> > + orig_nobjs = cache->nobjs;
> > + kvm_mmu_free_memory_cache(cache);
> > + if (orig_nobjs)
> > + percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > +
> > + spin_unlock(cache_lock);
> > +}
>
> It would be nice to avoid adding these wrapper functions.
>
> Once you add a mutex to protect the caches from being freed while vCPUs
> are in the middle of a page fault you can drop the spin lock. After that
> the only reason to have these wrappers is to update
> kvm_total_unused_mmu_pages.
>
> Do we really need kvm_total_unused_mmu_pages? Why not just dynamically
> calculate the number of of unused pages in mmu_shrink_count()? Or just
> estimate the count, e.g. num_vcpus * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE?
> Or have per-VM or per-vCPU shrinkers to avoid needing to do any
> aggregation?
>

I think we can drop this, by default we can return num_kvms *
num_vcpus * nodes * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE

Whenever mmu_shrink_scan() is called if there are no pages to free
then return SHRINK_STOP which will stop any subsequent calls during
that time.


> > +
> > static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> > {
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > + &vcpu->arch.mmu_shadow_page_cache_lock);
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
>
> mmu_shadowed_info_cache can be freed by the shrinker as well.
>
> > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > }
> > @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
> > }
> > #endif
> >
> > -/*
> > - * This value is the sum of all of the kvm instances's
> > - * kvm->arch.n_used_mmu_pages values. We need a global,
> > - * aggregate version in order to make the slab shrinker
> > - * faster
> > - */
> > -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> > -{
> > - kvm->arch.n_used_mmu_pages += nr;
> > - percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> > -}
> > -
> > static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > {
> > - kvm_mod_used_mmu_pages(kvm, +1);
> > + kvm->arch.n_used_mmu_pages++;
> > kvm_account_pgtable_pages((void *)sp->spt, +1);
> > }
> >
> > static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > {
> > - kvm_mod_used_mmu_pages(kvm, -1);
> > + kvm->arch.n_used_mmu_pages--;
> > kvm_account_pgtable_pages((void *)sp->spt, -1);
> > }
> >
> > @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
> > struct kvm_mmu_memory_cache *page_header_cache;
> > struct kvm_mmu_memory_cache *shadow_page_cache;
> > struct kvm_mmu_memory_cache *shadowed_info_cache;
> > + /*
> > + * Protects change in size of shadow_page_cache cache.
> > + */
> > + spinlock_t *shadow_page_cache_lock;
> > };
> >
> > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > + spinlock_t *cache_lock)
> > +{
> > + int orig_nobjs;
> > + void *page;
> > +
> > + if (!cache_lock) {
> > + spin_lock(cache_lock);
> > + orig_nobjs = shadow_page_cache->nobjs;
> > + }
> > + page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> > + if (!cache_lock) {
> > + if (orig_nobjs)
> > + percpu_counter_dec(&kvm_total_unused_mmu_pages);
> > + spin_unlock(cache_lock);
> > + }
> > + return page;
> > +}
> > +
> > static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> > struct shadow_page_caches *caches,
> > gfn_t gfn,
> > @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> > struct kvm_mmu_page *sp;
> >
> > sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> > - sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> > + sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> > + caches->shadow_page_cache_lock);
> > if (!role.direct)
> > sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
> >
> > @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> > .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> > .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> > .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> > + .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> > };
> >
> > return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> > @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> >
> > vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > + spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> >
> > vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> > @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> > kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > }
> >
> > -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> > -{
> > - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> > -}
> > -
> > static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> > struct kvm_memory_slot *slot,
> > struct kvm_page_track_notifier_node *node)
> > @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> > /* Direct SPs do not require a shadowed_info_cache. */
> > caches.page_header_cache = &kvm->arch.split_page_header_cache;
> > caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> > + caches.shadow_page_cache_lock = NULL;
> >
> > /* Safe to pass NULL for vCPU since requesting a direct SP. */
> > return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> > @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > static unsigned long
> > mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > {
> > - struct kvm *kvm;
> > - int nr_to_scan = sc->nr_to_scan;
> > + struct kvm_mmu_memory_cache *cache;
> > + struct kvm *kvm, *first_kvm = NULL;
> > unsigned long freed = 0;
> > + /* spinlock for memory cache */
> > + spinlock_t *cache_lock;
> > + struct kvm_vcpu *vcpu;
> > + unsigned long i;
> >
> > mutex_lock(&kvm_lock);
> >
> > list_for_each_entry(kvm, &vm_list, vm_list) {
> > - int idx;
> > - LIST_HEAD(invalid_list);
> > -
> > - /*
> > - * Never scan more than sc->nr_to_scan VM instances.
> > - * Will not hit this condition practically since we do not try
> > - * to shrink more than one VM and it is very unlikely to see
> > - * !n_used_mmu_pages so many times.
> > - */
> > - if (!nr_to_scan--)
> > + if (first_kvm == kvm)
> > break;
> > - /*
> > - * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> > - * here. We may skip a VM instance errorneosly, but we do not
> > - * want to shrink a VM that only started to populate its MMU
> > - * anyway.
> > - */
> > - if (!kvm->arch.n_used_mmu_pages &&
> > - !kvm_has_zapped_obsolete_pages(kvm))
> > - continue;
> > + if (!first_kvm)
> > + first_kvm = kvm;
> > + list_move_tail(&kvm->vm_list, &vm_list);
> >
> > - idx = srcu_read_lock(&kvm->srcu);
> > - write_lock(&kvm->mmu_lock);
> > + kvm_for_each_vcpu(i, vcpu, kvm) {
>
> What protects this from racing with vCPU creation/deletion?
>
> > + cache = &vcpu->arch.mmu_shadow_page_cache;
> > + cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
> > + if (READ_ONCE(cache->nobjs)) {
> > + spin_lock(cache_lock);
> > + freed += kvm_mmu_empty_memory_cache(cache);
> > + spin_unlock(cache_lock);
> > + }
>
> What about freeing kvm->arch.split_shadow_page_cache as well?
>
> >
> > - if (kvm_has_zapped_obsolete_pages(kvm)) {
> > - kvm_mmu_commit_zap_page(kvm,
> > - &kvm->arch.zapped_obsolete_pages);
> > - goto unlock;
> > }
> >
> > - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> > -
> > -unlock:
> > - write_unlock(&kvm->mmu_lock);
> > - srcu_read_unlock(&kvm->srcu, idx);
> > -
> > - /*
> > - * unfair on small ones
> > - * per-vm shrinkers cry out
> > - * sadness comes quickly
> > - */
> > - list_move_tail(&kvm->vm_list, &vm_list);
> > - break;
> > + if (freed >= sc->nr_to_scan)
> > + break;
> > }
> >
> > + if (freed)
> > + percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
> > mutex_unlock(&kvm_lock);
> > + percpu_counter_sync(&kvm_total_unused_mmu_pages);
> > return freed;
> > }
> >
> > static unsigned long
> > mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > {
> > - return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> > + return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
> > }
> >
> > static struct shrinker mmu_shrinker = {
> > @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
> > if (!mmu_page_header_cache)
> > goto out;
> >
> > - if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> > + if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
> > goto out;
> >
> > ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> > @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
> > return 0;
> >
> > out_shrinker:
> > - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> > out:
> > mmu_destroy_caches();
> > return ret;
> > @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > void kvm_mmu_vendor_module_exit(void)
> > {
> > mmu_destroy_caches();
> > - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> > unregister_shrinker(&mmu_shrinker);
> > }
> >
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index ac00bfbf32f6..c2a342028b6a 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > + spinlock_t *cache_lock);
> > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 764f7c87286f..4974fa96deff 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> > struct kvm_mmu_page *sp;
> >
> > sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > + sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> > + &vcpu->arch.mmu_shadow_page_cache_lock);
> >
> > return sp;
> > }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 01aad8b74162..efd9b38ea9a2 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
> > int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> > int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> > int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
> > void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > #endif
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 13e88297f999..f2d762878b97 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> > return mc->nobjs;
> > }
> >
> > -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
> > {
> > + int freed = mc->nobjs;
> > +
> > while (mc->nobjs) {
> > if (mc->kmem_cache)
> > kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> > @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > free_page((unsigned long)mc->objects[--mc->nobjs]);
> > }
> >
> > - kvfree(mc->objects);
> > + return freed;
> > +}
> >
> > +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > +{
> > + kvm_mmu_empty_memory_cache(mc);
> > + kvfree(mc->objects);
> > mc->objects = NULL;
> > mc->capacity = 0;
> > }
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

2023-01-03 20:19:12

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <[email protected]> wrote:
>
> mmu_shrink_scan() is very disruptive to VMs. It picks the first
> VM in the vm_list, zaps the oldest page which is most likely an upper
> level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> more disruptive in nested VMs case, considering L1 SPTEs will be the
> oldest even though most of the entries are for L2 SPTEs.
>
> As discussed in
> https://lore.kernel.org/lkml/[email protected]/
> shrinker logic has not be very useful in actually keeping VMs performant
> and reducing memory usage.
>
> Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> cache. Freeing pages from cache doesn't cause vCPU exits, therefore, a
> VM's performance should not be affected.
>
> This also allows to change cache capacities without worrying too much
> about high memory usage in cache.
>
> Tested this change by running dirty_log_perf_test while dropping cache
> via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> logs from kvm_mmu_memory_cache_alloc(), which is expected.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Vipin Sharma <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 5 +
> arch/x86/kvm/mmu/mmu.c | 163 +++++++++++++++++++-------------
> arch/x86/kvm/mmu/mmu_internal.h | 2 +
> arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 11 ++-
> 6 files changed, 114 insertions(+), 71 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aa4eb8cfcd7e..89cc809e4a00 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
> struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> struct kvm_mmu_memory_cache mmu_page_header_cache;
>
> + /*
> + * Protects change in size of mmu_shadow_page_cache cache.
> + */
> + spinlock_t mmu_shadow_page_cache_lock;
> +
> /*
> * QEMU userspace and the guest each have their own FPU state.
> * In vcpu_run, we switch between the user and guest FPU contexts.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 254bc46234e0..157417e1cb6e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
>
> static struct kmem_cache *pte_list_desc_cache;
> struct kmem_cache *mmu_page_header_cache;
> -static struct percpu_counter kvm_total_used_mmu_pages;
> +/*
> + * Total number of unused pages in MMU shadow page cache.
> + */
> +static struct percpu_counter kvm_total_unused_mmu_pages;
>
> static void mmu_spte_set(u64 *sptep, u64 spte);
>
> @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> }
> }
>
> +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> + spinlock_t *cache_lock)
> +{
> + int orig_nobjs;
> + int r;
> +
> + spin_lock(cache_lock);
> + orig_nobjs = cache->nobjs;
> + r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> + if (orig_nobjs != cache->nobjs)
> + percpu_counter_add(&kvm_total_unused_mmu_pages,
> + (cache->nobjs - orig_nobjs));
> + spin_unlock(cache_lock);
> + return r;
> +}
> +
> static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> {
> int r;
> @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> if (r)
> return r;
> - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> - PT64_ROOT_MAX_LEVEL);
> + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
> if (r)
> return r;
> if (maybe_indirect) {
> @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> PT64_ROOT_MAX_LEVEL);
> }
>
> +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> + spinlock_t *cache_lock)
> +{
> + int orig_nobjs;
> +
> + spin_lock(cache_lock);
> + orig_nobjs = cache->nobjs;
> + kvm_mmu_free_memory_cache(cache);
> + if (orig_nobjs)
> + percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> +
> + spin_unlock(cache_lock);
> +}

I think the mmu_cache allocation and deallocation may cause the usage
of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
lock would definitely sound like a plan, but I think it might affect
the performance. Alternatively, I am wondering if we could use a
mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
concurrency?

Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
mmu write lock. In the page fault path, each vcpu has to collect a
snapshot of mmu_cache_sequence before calling into
mmu_topup_memory_caches() and check the value again when holding the
mmu lock. If the value is different, that means the mmu_shrinker has
removed the cache objects and because of that, the vcpu should retry.


> +
> static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> {
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> }
> @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
> }
> #endif
>
> -/*
> - * This value is the sum of all of the kvm instances's
> - * kvm->arch.n_used_mmu_pages values. We need a global,
> - * aggregate version in order to make the slab shrinker
> - * faster
> - */
> -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> -{
> - kvm->arch.n_used_mmu_pages += nr;
> - percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> -}
> -
> static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - kvm_mod_used_mmu_pages(kvm, +1);
> + kvm->arch.n_used_mmu_pages++;
> kvm_account_pgtable_pages((void *)sp->spt, +1);
> }
>
> static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> - kvm_mod_used_mmu_pages(kvm, -1);
> + kvm->arch.n_used_mmu_pages--;
> kvm_account_pgtable_pages((void *)sp->spt, -1);
> }
>
> @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
> struct kvm_mmu_memory_cache *page_header_cache;
> struct kvm_mmu_memory_cache *shadow_page_cache;
> struct kvm_mmu_memory_cache *shadowed_info_cache;
> + /*
> + * Protects change in size of shadow_page_cache cache.
> + */
> + spinlock_t *shadow_page_cache_lock;
> };
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> + spinlock_t *cache_lock)
> +{
> + int orig_nobjs;
> + void *page;
> +
> + if (!cache_lock) {
> + spin_lock(cache_lock);
> + orig_nobjs = shadow_page_cache->nobjs;
> + }
> + page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> + if (!cache_lock) {
> + if (orig_nobjs)
> + percpu_counter_dec(&kvm_total_unused_mmu_pages);
> + spin_unlock(cache_lock);
> + }
> + return page;
> +}
> +
> static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> struct shadow_page_caches *caches,
> gfn_t gfn,
> @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> + sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> + caches->shadow_page_cache_lock);
> if (!role.direct)
> sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
>
> @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> + .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> };
>
> return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>
> vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> + spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
>
> vcpu->arch.mmu = &vcpu->arch.root_mmu;
> vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> kvm_tdp_mmu_zap_invalidated_roots(kvm);
> }
>
> -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> -{
> - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> -}
> -
> static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> struct kvm_page_track_notifier_node *node)
> @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> /* Direct SPs do not require a shadowed_info_cache. */
> caches.page_header_cache = &kvm->arch.split_page_header_cache;
> caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> + caches.shadow_page_cache_lock = NULL;
>
> /* Safe to pass NULL for vCPU since requesting a direct SP. */
> return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> static unsigned long
> mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> {
> - struct kvm *kvm;
> - int nr_to_scan = sc->nr_to_scan;
> + struct kvm_mmu_memory_cache *cache;
> + struct kvm *kvm, *first_kvm = NULL;
> unsigned long freed = 0;
> + /* spinlock for memory cache */
> + spinlock_t *cache_lock;
> + struct kvm_vcpu *vcpu;
> + unsigned long i;
>
> mutex_lock(&kvm_lock);
>
> list_for_each_entry(kvm, &vm_list, vm_list) {
> - int idx;
> - LIST_HEAD(invalid_list);
> -
> - /*
> - * Never scan more than sc->nr_to_scan VM instances.
> - * Will not hit this condition practically since we do not try
> - * to shrink more than one VM and it is very unlikely to see
> - * !n_used_mmu_pages so many times.
> - */
> - if (!nr_to_scan--)
> + if (first_kvm == kvm)
> break;
> - /*
> - * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> - * here. We may skip a VM instance errorneosly, but we do not
> - * want to shrink a VM that only started to populate its MMU
> - * anyway.
> - */
> - if (!kvm->arch.n_used_mmu_pages &&
> - !kvm_has_zapped_obsolete_pages(kvm))
> - continue;
> + if (!first_kvm)
> + first_kvm = kvm;
> + list_move_tail(&kvm->vm_list, &vm_list);
>
> - idx = srcu_read_lock(&kvm->srcu);
> - write_lock(&kvm->mmu_lock);
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + cache = &vcpu->arch.mmu_shadow_page_cache;
> + cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
> + if (READ_ONCE(cache->nobjs)) {
> + spin_lock(cache_lock);
> + freed += kvm_mmu_empty_memory_cache(cache);
> + spin_unlock(cache_lock);
> + }
>
> - if (kvm_has_zapped_obsolete_pages(kvm)) {
> - kvm_mmu_commit_zap_page(kvm,
> - &kvm->arch.zapped_obsolete_pages);
> - goto unlock;
> }
>
> - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> -
> -unlock:
> - write_unlock(&kvm->mmu_lock);
> - srcu_read_unlock(&kvm->srcu, idx);
> -
> - /*
> - * unfair on small ones
> - * per-vm shrinkers cry out
> - * sadness comes quickly
> - */
> - list_move_tail(&kvm->vm_list, &vm_list);
> - break;
> + if (freed >= sc->nr_to_scan)
> + break;
> }
>
> + if (freed)
> + percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
> mutex_unlock(&kvm_lock);
> + percpu_counter_sync(&kvm_total_unused_mmu_pages);
> return freed;
> }
>
> static unsigned long
> mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> {
> - return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> + return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
> }
>
> static struct shrinker mmu_shrinker = {
> @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
> if (!mmu_page_header_cache)
> goto out;
>
> - if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> + if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
> goto out;
>
> ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
> return 0;
>
> out_shrinker:
> - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> out:
> mmu_destroy_caches();
> return ret;
> @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> void kvm_mmu_vendor_module_exit(void)
> {
> mmu_destroy_caches();
> - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> unregister_shrinker(&mmu_shrinker);
> }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index ac00bfbf32f6..c2a342028b6a 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> + spinlock_t *cache_lock);
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 764f7c87286f..4974fa96deff 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> struct kvm_mmu_page *sp;
>
> sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> + sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> + &vcpu->arch.mmu_shadow_page_cache_lock);
>
> return sp;
> }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 01aad8b74162..efd9b38ea9a2 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
> int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
> void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> #endif
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13e88297f999..f2d762878b97 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> return mc->nobjs;
> }
>
> -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
> {
> + int freed = mc->nobjs;
> +
> while (mc->nobjs) {
> if (mc->kmem_cache)
> kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> free_page((unsigned long)mc->objects[--mc->nobjs]);
> }
>
> - kvfree(mc->objects);
> + return freed;
> +}
>
> +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> +{
> + kvm_mmu_empty_memory_cache(mc);
> + kvfree(mc->objects);
> mc->objects = NULL;
> mc->capacity = 0;
> }
> --
> 2.39.0.314.g84b9a713c41-goog
>

2023-01-04 00:42:18

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Tue, Jan 3, 2023 at 10:01 AM Vipin Sharma <[email protected]> wrote:
>
> On Thu, Dec 29, 2022 at 1:55 PM David Matlack <[email protected]> wrote:
> >
> > On Wed, Dec 21, 2022 at 06:34:49PM -0800, Vipin Sharma wrote:
> > > mmu_shrink_scan() is very disruptive to VMs. It picks the first
> > > VM in the vm_list, zaps the oldest page which is most likely an upper
> > > level SPTEs and most like to be reused. Prior to TDP MMU, this is even
> > > more disruptive in nested VMs case, considering L1 SPTEs will be the
> > > oldest even though most of the entries are for L2 SPTEs.
> > >
> > > As discussed in
> > > https://lore.kernel.org/lkml/[email protected]/
> > > shrinker logic has not be very useful in actually keeping VMs performant
> > > and reducing memory usage.
> > >
> > > Change mmu_shrink_scan() to free pages from the vCPU's shadow page
> > > cache. Freeing pages from cache doesn't cause vCPU exits, therefore, a
> > > VM's performance should not be affected.
> >
> > Can you split this commit up? e.g. First drop the old shrinking logic in
> > one commit (but leave the shrinking infrastructure in place). Then a
> > commit to make the shrinker free the per-vCPU shadow page caches. And
> > then perhaps another to make the shrinker free the per-VM shadow page
> > cache used for eager splitting.
> >
>
> Sounds good, I will separate it in two parts, one for dropping old
> logic, one for adding per vcpu shadow page caches. Patch 3 is enabling
> shrinkerto free per-VM shadow page.
>
> > >
> > > This also allows to change cache capacities without worrying too much
> > > about high memory usage in cache.
> > >
> > > Tested this change by running dirty_log_perf_test while dropping cache
> > > via "echo 2 > /proc/sys/vm/drop_caches" at 1 second interval
> > > continuously. There were WARN_ON(!mc->nobjs) messages printed in kernel
> > > logs from kvm_mmu_memory_cache_alloc(), which is expected.
> > >
> > > Suggested-by: Sean Christopherson <[email protected]>
> > > Signed-off-by: Vipin Sharma <[email protected]>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 5 +
> > > arch/x86/kvm/mmu/mmu.c | 163 +++++++++++++++++++-------------
> > > arch/x86/kvm/mmu/mmu_internal.h | 2 +
> > > arch/x86/kvm/mmu/tdp_mmu.c | 3 +-
> > > include/linux/kvm_host.h | 1 +
> > > virt/kvm/kvm_main.c | 11 ++-
> > > 6 files changed, 114 insertions(+), 71 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index aa4eb8cfcd7e..89cc809e4a00 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -786,6 +786,11 @@ struct kvm_vcpu_arch {
> > > struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> > > struct kvm_mmu_memory_cache mmu_page_header_cache;
> > >
> > > + /*
> > > + * Protects change in size of mmu_shadow_page_cache cache.
> > > + */
> > > + spinlock_t mmu_shadow_page_cache_lock;
> > > +
> > > /*
> > > * QEMU userspace and the guest each have their own FPU state.
> > > * In vcpu_run, we switch between the user and guest FPU contexts.
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 254bc46234e0..157417e1cb6e 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -164,7 +164,10 @@ struct kvm_shadow_walk_iterator {
> > >
> > > static struct kmem_cache *pte_list_desc_cache;
> > > struct kmem_cache *mmu_page_header_cache;
> > > -static struct percpu_counter kvm_total_used_mmu_pages;
> > > +/*
> > > + * Total number of unused pages in MMU shadow page cache.
> > > + */
> > > +static struct percpu_counter kvm_total_unused_mmu_pages;
> > >
> > > static void mmu_spte_set(u64 *sptep, u64 spte);
> > >
> > > @@ -655,6 +658,22 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> > > }
> > > }
> > >
> > > +static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > > + spinlock_t *cache_lock)
> > > +{
> > > + int orig_nobjs;
> > > + int r;
> > > +
> > > + spin_lock(cache_lock);
> > > + orig_nobjs = cache->nobjs;
> > > + r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL);
> > > + if (orig_nobjs != cache->nobjs)
> > > + percpu_counter_add(&kvm_total_unused_mmu_pages,
> > > + (cache->nobjs - orig_nobjs));
> > > + spin_unlock(cache_lock);
> > > + return r;
> > > +}
> > > +
> > > static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > > {
> > > int r;
> > > @@ -664,8 +683,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > > 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> > > if (r)
> > > return r;
> > > - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > - PT64_ROOT_MAX_LEVEL);
> > > + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > + &vcpu->arch.mmu_shadow_page_cache_lock);
> > > if (r)
> > > return r;
> > > if (maybe_indirect) {
> > > @@ -678,10 +697,25 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > > PT64_ROOT_MAX_LEVEL);
> > > }
> > >
> > > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > > + spinlock_t *cache_lock)
> > > +{
> > > + int orig_nobjs;
> > > +
> > > + spin_lock(cache_lock);
> > > + orig_nobjs = cache->nobjs;
> > > + kvm_mmu_free_memory_cache(cache);
> > > + if (orig_nobjs)
> > > + percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > > +
> > > + spin_unlock(cache_lock);
> > > +}
> >
> > It would be nice to avoid adding these wrapper functions.
> >
> > Once you add a mutex to protect the caches from being freed while vCPUs
> > are in the middle of a page fault you can drop the spin lock. After that
> > the only reason to have these wrappers is to update
> > kvm_total_unused_mmu_pages.
> >
> > Do we really need kvm_total_unused_mmu_pages? Why not just dynamically
> > calculate the number of of unused pages in mmu_shrink_count()? Or just
> > estimate the count, e.g. num_vcpus * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE?
> > Or have per-VM or per-vCPU shrinkers to avoid needing to do any
> > aggregation?
> >
>
> I think we can drop this, by default we can return num_kvms *
> num_vcpus * nodes * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
>
> Whenever mmu_shrink_scan() is called if there are no pages to free
> then return SHRINK_STOP which will stop any subsequent calls during
> that time.
>
>
> > > +
> > > static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> > > {
> > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > > - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> > > + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > + &vcpu->arch.mmu_shadow_page_cache_lock);
> > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> >
> > mmu_shadowed_info_cache can be freed by the shrinker as well.
> >

Yes, I can do that as well.

> > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > > }
> > > @@ -1693,27 +1727,15 @@ static int is_empty_shadow_page(u64 *spt)
> > > }
> > > #endif
> > >
> > > -/*
> > > - * This value is the sum of all of the kvm instances's
> > > - * kvm->arch.n_used_mmu_pages values. We need a global,
> > > - * aggregate version in order to make the slab shrinker
> > > - * faster
> > > - */
> > > -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
> > > -{
> > > - kvm->arch.n_used_mmu_pages += nr;
> > > - percpu_counter_add(&kvm_total_used_mmu_pages, nr);
> > > -}
> > > -
> > > static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > {
> > > - kvm_mod_used_mmu_pages(kvm, +1);
> > > + kvm->arch.n_used_mmu_pages++;
> > > kvm_account_pgtable_pages((void *)sp->spt, +1);
> > > }
> > >
> > > static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> > > {
> > > - kvm_mod_used_mmu_pages(kvm, -1);
> > > + kvm->arch.n_used_mmu_pages--;
> > > kvm_account_pgtable_pages((void *)sp->spt, -1);
> > > }
> > >
> > > @@ -2150,8 +2172,31 @@ struct shadow_page_caches {
> > > struct kvm_mmu_memory_cache *page_header_cache;
> > > struct kvm_mmu_memory_cache *shadow_page_cache;
> > > struct kvm_mmu_memory_cache *shadowed_info_cache;
> > > + /*
> > > + * Protects change in size of shadow_page_cache cache.
> > > + */
> > > + spinlock_t *shadow_page_cache_lock;
> > > };
> > >
> > > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > > + spinlock_t *cache_lock)
> > > +{
> > > + int orig_nobjs;
> > > + void *page;
> > > +
> > > + if (!cache_lock) {
> > > + spin_lock(cache_lock);
> > > + orig_nobjs = shadow_page_cache->nobjs;
> > > + }
> > > + page = kvm_mmu_memory_cache_alloc(shadow_page_cache);
> > > + if (!cache_lock) {
> > > + if (orig_nobjs)
> > > + percpu_counter_dec(&kvm_total_unused_mmu_pages);
> > > + spin_unlock(cache_lock);
> > > + }
> > > + return page;
> > > +}
> > > +
> > > static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> > > struct shadow_page_caches *caches,
> > > gfn_t gfn,
> > > @@ -2161,7 +2206,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
> > > struct kvm_mmu_page *sp;
> > >
> > > sp = kvm_mmu_memory_cache_alloc(caches->page_header_cache);
> > > - sp->spt = kvm_mmu_memory_cache_alloc(caches->shadow_page_cache);
> > > + sp->spt = kvm_mmu_sp_memory_cache_alloc(caches->shadow_page_cache,
> > > + caches->shadow_page_cache_lock);
> > > if (!role.direct)
> > > sp->shadowed_translation = kvm_mmu_memory_cache_alloc(caches->shadowed_info_cache);
> > >
> > > @@ -2218,6 +2264,7 @@ static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> > > .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> > > .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> > > .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> > > + .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> > > };
> > >
> > > return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role);
> > > @@ -5916,6 +5963,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
> > >
> > > vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> > > + spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > > vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > > vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> > > @@ -6051,11 +6099,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> > > kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > > }
> > >
> > > -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> > > -{
> > > - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> > > -}
> > > -
> > > static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> > > struct kvm_memory_slot *slot,
> > > struct kvm_page_track_notifier_node *node)
> > > @@ -6277,6 +6320,7 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu
> > > /* Direct SPs do not require a shadowed_info_cache. */
> > > caches.page_header_cache = &kvm->arch.split_page_header_cache;
> > > caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache;
> > > + caches.shadow_page_cache_lock = NULL;
> > >
> > > /* Safe to pass NULL for vCPU since requesting a direct SP. */
> > > return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role);
> > > @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > > static unsigned long
> > > mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > > {
> > > - struct kvm *kvm;
> > > - int nr_to_scan = sc->nr_to_scan;
> > > + struct kvm_mmu_memory_cache *cache;
> > > + struct kvm *kvm, *first_kvm = NULL;
> > > unsigned long freed = 0;
> > > + /* spinlock for memory cache */
> > > + spinlock_t *cache_lock;
> > > + struct kvm_vcpu *vcpu;
> > > + unsigned long i;
> > >
> > > mutex_lock(&kvm_lock);
> > >
> > > list_for_each_entry(kvm, &vm_list, vm_list) {
> > > - int idx;
> > > - LIST_HEAD(invalid_list);
> > > -
> > > - /*
> > > - * Never scan more than sc->nr_to_scan VM instances.
> > > - * Will not hit this condition practically since we do not try
> > > - * to shrink more than one VM and it is very unlikely to see
> > > - * !n_used_mmu_pages so many times.
> > > - */
> > > - if (!nr_to_scan--)
> > > + if (first_kvm == kvm)
> > > break;
> > > - /*
> > > - * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> > > - * here. We may skip a VM instance errorneosly, but we do not
> > > - * want to shrink a VM that only started to populate its MMU
> > > - * anyway.
> > > - */
> > > - if (!kvm->arch.n_used_mmu_pages &&
> > > - !kvm_has_zapped_obsolete_pages(kvm))
> > > - continue;
> > > + if (!first_kvm)
> > > + first_kvm = kvm;
> > > + list_move_tail(&kvm->vm_list, &vm_list);
> > >
> > > - idx = srcu_read_lock(&kvm->srcu);
> > > - write_lock(&kvm->mmu_lock);
> > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> >
> > What protects this from racing with vCPU creation/deletion?
> >

vCPU deletion:
We take kvm_lock in mmu_shrink_scan(), the same lock is taken in
kvm_destroy_vm() to remove a vm from vm_list. So, once we are
iterating vm_list we will not see any VM removal which will means no
vcpu removal.

I didn't find any other code for vCPU deletion except failures during
VM and VCPU set up. A VM is only added to vm_list after successful
creation.

vCPU creation:
I think it will work.

kvm_vm_ioctl_create_vcpus() initializes the vcpu, adds it to
kvm->vcpu_array which is of the type xarray and is managed by RCU.
After this online_vcpus is incremented. So, kvm_for_each_vcpu() which
uses RCU to read entries, if it sees incremented online_vcpus value
then it will also sees all of the vcpu initialization.

@Sean, Paolo

Is the above explanation correct, kvm_for_each_vcpu() is safe without any lock?

> > > + cache = &vcpu->arch.mmu_shadow_page_cache;
> > > + cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock;
> > > + if (READ_ONCE(cache->nobjs)) {
> > > + spin_lock(cache_lock);
> > > + freed += kvm_mmu_empty_memory_cache(cache);
> > > + spin_unlock(cache_lock);
> > > + }
> >
> > What about freeing kvm->arch.split_shadow_page_cache as well?
> >

I am doing this in patch 3.

> > >
> > > - if (kvm_has_zapped_obsolete_pages(kvm)) {
> > > - kvm_mmu_commit_zap_page(kvm,
> > > - &kvm->arch.zapped_obsolete_pages);
> > > - goto unlock;
> > > }
> > >
> > > - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
> > > -
> > > -unlock:
> > > - write_unlock(&kvm->mmu_lock);
> > > - srcu_read_unlock(&kvm->srcu, idx);
> > > -
> > > - /*
> > > - * unfair on small ones
> > > - * per-vm shrinkers cry out
> > > - * sadness comes quickly
> > > - */
> > > - list_move_tail(&kvm->vm_list, &vm_list);
> > > - break;
> > > + if (freed >= sc->nr_to_scan)
> > > + break;
> > > }
> > >
> > > + if (freed)
> > > + percpu_counter_sub(&kvm_total_unused_mmu_pages, freed);
> > > mutex_unlock(&kvm_lock);
> > > + percpu_counter_sync(&kvm_total_unused_mmu_pages);
> > > return freed;
> > > }
> > >
> > > static unsigned long
> > > mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > > {
> > > - return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
> > > + return percpu_counter_sum_positive(&kvm_total_unused_mmu_pages);
> > > }
> > >
> > > static struct shrinker mmu_shrinker = {
> > > @@ -6820,7 +6847,7 @@ int kvm_mmu_vendor_module_init(void)
> > > if (!mmu_page_header_cache)
> > > goto out;
> > >
> > > - if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
> > > + if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL))
> > > goto out;
> > >
> > > ret = register_shrinker(&mmu_shrinker, "x86-mmu");
> > > @@ -6830,7 +6857,7 @@ int kvm_mmu_vendor_module_init(void)
> > > return 0;
> > >
> > > out_shrinker:
> > > - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > > + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> > > out:
> > > mmu_destroy_caches();
> > > return ret;
> > > @@ -6847,7 +6874,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > > void kvm_mmu_vendor_module_exit(void)
> > > {
> > > mmu_destroy_caches();
> > > - percpu_counter_destroy(&kvm_total_used_mmu_pages);
> > > + percpu_counter_destroy(&kvm_total_unused_mmu_pages);
> > > unregister_shrinker(&mmu_shrinker);
> > > }
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index ac00bfbf32f6..c2a342028b6a 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -325,4 +325,6 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > > void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > > void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > >
> > > +void *kvm_mmu_sp_memory_cache_alloc(struct kvm_mmu_memory_cache *shadow_page_cache,
> > > + spinlock_t *cache_lock);
> > > #endif /* __KVM_X86_MMU_INTERNAL_H */
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 764f7c87286f..4974fa96deff 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -264,7 +264,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> > > struct kvm_mmu_page *sp;
> > >
> > > sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > > - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> > > + sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> > > + &vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > > return sp;
> > > }
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 01aad8b74162..efd9b38ea9a2 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -1362,6 +1362,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm);
> > > int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
> > > int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
> > > int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
> > > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
> > > void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
> > > void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > > #endif
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 13e88297f999..f2d762878b97 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -438,8 +438,10 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc)
> > > return mc->nobjs;
> > > }
> > >
> > > -void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > > +int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
> > > {
> > > + int freed = mc->nobjs;
> > > +
> > > while (mc->nobjs) {
> > > if (mc->kmem_cache)
> > > kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
> > > @@ -447,8 +449,13 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > > free_page((unsigned long)mc->objects[--mc->nobjs]);
> > > }
> > >
> > > - kvfree(mc->objects);
> > > + return freed;
> > > +}
> > >
> > > +void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
> > > +{
> > > + kvm_mmu_empty_memory_cache(mc);
> > > + kvfree(mc->objects);
> > > mc->objects = NULL;
> > > mc->capacity = 0;
> > > }
> > > --
> > > 2.39.0.314.g84b9a713c41-goog
> > >

2023-01-04 01:18:20

by Vipin Sharma

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Tue, Jan 3, 2023 at 11:32 AM Mingwei Zhang <[email protected]> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <[email protected]> wrote:
> >
> > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > + spinlock_t *cache_lock)
> > +{
> > + int orig_nobjs;
> > +
> > + spin_lock(cache_lock);
> > + orig_nobjs = cache->nobjs;
> > + kvm_mmu_free_memory_cache(cache);
> > + if (orig_nobjs)
> > + percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > +
> > + spin_unlock(cache_lock);
> > +}
>
> I think the mmu_cache allocation and deallocation may cause the usage
> of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
> lock would definitely sound like a plan, but I think it might affect
> the performance. Alternatively, I am wondering if we could use a
> mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
> concurrency?
>

Can you explain more about the performance impact? Each vcpu will have
its own mutex. So, only contention will be with the mmu_shrinker. This
shrinker will use mutex_try_lock() which will not block to wait for
the lock, it will just pass on to the next vcpu. While shrinker is
holding the lock, vcpu will be blocked in the page fault path but I
think it should not have a huge impact considering it will execute
rarely and for a small time.

> Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
> mmu write lock. In the page fault path, each vcpu has to collect a
> snapshot of mmu_cache_sequence before calling into
> mmu_topup_memory_caches() and check the value again when holding the
> mmu lock. If the value is different, that means the mmu_shrinker has
> removed the cache objects and because of that, the vcpu should retry.
>

Yeah, this can be one approach. I think it will come down to the
performance impact of using mutex which I don't think should be a
concern.

2023-01-04 07:21:59

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Tue, Jan 3, 2023 at 10:29 PM Mingwei Zhang <[email protected]> wrote:
>
> On Tue, Jan 3, 2023 at 5:00 PM Vipin Sharma <[email protected]> wrote:
> >
> > On Tue, Jan 3, 2023 at 11:32 AM Mingwei Zhang <[email protected]> wrote:
> > >
> > > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <[email protected]> wrote:
> > > >
> > > > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > > > + spinlock_t *cache_lock)
> > > > +{
> > > > + int orig_nobjs;
> > > > +
> > > > + spin_lock(cache_lock);
> > > > + orig_nobjs = cache->nobjs;
> > > > + kvm_mmu_free_memory_cache(cache);
> > > > + if (orig_nobjs)
> > > > + percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > > > +
> > > > + spin_unlock(cache_lock);
> > > > +}
> > >
> > > I think the mmu_cache allocation and deallocation may cause the usage
> > > of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
> > > lock would definitely sound like a plan, but I think it might affect
> > > the performance. Alternatively, I am wondering if we could use a
> > > mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
> > > concurrency?
> > >
> >
> > Can you explain more about the performance impact? Each vcpu will have
> > its own mutex. So, only contention will be with the mmu_shrinker. This
> > shrinker will use mutex_try_lock() which will not block to wait for
> > the lock, it will just pass on to the next vcpu. While shrinker is
> > holding the lock, vcpu will be blocked in the page fault path but I
> > think it should not have a huge impact considering it will execute
> > rarely and for a small time.
> >
> > > Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
> > > mmu write lock. In the page fault path, each vcpu has to collect a
> > > snapshot of mmu_cache_sequence before calling into
> > > mmu_topup_memory_caches() and check the value again when holding the
> > > mmu lock. If the value is different, that means the mmu_shrinker has
> > > removed the cache objects and because of that, the vcpu should retry.
> > >
> >
> > Yeah, this can be one approach. I think it will come down to the
> > performance impact of using mutex which I don't think should be a
> > concern.
>
> hmm, I think you are right that there is no performance overhead by
> adding a mutex and letting the shrinker using mutex_trylock(). The
> point of using a sequence counter is to avoid the new lock, since
> introducing a new lock will increase management burden. So unless it
> is necessary, we probably should choose a simple solution first.
>
> In this case, I think we do have such a choice and since a similar
> mechanism has already been used by mmu_notifiers.
>

Let me take it back. The per-vcpu sequence number in this case has to
be protected by a VM level mmu write lock. I think this might be less
performant than using a per-vcpu mutex.

2023-01-04 07:32:27

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Tue, Jan 3, 2023 at 5:00 PM Vipin Sharma <[email protected]> wrote:
>
> On Tue, Jan 3, 2023 at 11:32 AM Mingwei Zhang <[email protected]> wrote:
> >
> > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <[email protected]> wrote:
> > >
> > > +static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > > + spinlock_t *cache_lock)
> > > +{
> > > + int orig_nobjs;
> > > +
> > > + spin_lock(cache_lock);
> > > + orig_nobjs = cache->nobjs;
> > > + kvm_mmu_free_memory_cache(cache);
> > > + if (orig_nobjs)
> > > + percpu_counter_sub(&kvm_total_unused_mmu_pages, orig_nobjs);
> > > +
> > > + spin_unlock(cache_lock);
> > > +}
> >
> > I think the mmu_cache allocation and deallocation may cause the usage
> > of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
> > lock would definitely sound like a plan, but I think it might affect
> > the performance. Alternatively, I am wondering if we could use a
> > mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
> > concurrency?
> >
>
> Can you explain more about the performance impact? Each vcpu will have
> its own mutex. So, only contention will be with the mmu_shrinker. This
> shrinker will use mutex_try_lock() which will not block to wait for
> the lock, it will just pass on to the next vcpu. While shrinker is
> holding the lock, vcpu will be blocked in the page fault path but I
> think it should not have a huge impact considering it will execute
> rarely and for a small time.
>
> > Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
> > mmu write lock. In the page fault path, each vcpu has to collect a
> > snapshot of mmu_cache_sequence before calling into
> > mmu_topup_memory_caches() and check the value again when holding the
> > mmu lock. If the value is different, that means the mmu_shrinker has
> > removed the cache objects and because of that, the vcpu should retry.
> >
>
> Yeah, this can be one approach. I think it will come down to the
> performance impact of using mutex which I don't think should be a
> concern.

hmm, I think you are right that there is no performance overhead by
adding a mutex and letting the shrinker using mutex_trylock(). The
point of using a sequence counter is to avoid the new lock, since
introducing a new lock will increase management burden. So unless it
is necessary, we probably should choose a simple solution first.

In this case, I think we do have such a choice and since a similar
mechanism has already been used by mmu_notifiers.

best
-Mingwei

2023-01-16 04:40:32

by Yujie Liu

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

Greeting,

FYI, we noticed BUG:sleeping_function_called_from_invalid_context_at_include/linux/sched/mm.h due to commit (built with gcc-11):

commit: 99e2853d906a7593e6a3f0e5bc7ecc503b6b9462 ("[Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches")
url: https://github.com/intel-lab-lkp/linux/commits/Vipin-Sharma/NUMA-aware-page-table-s-pages-allocation/20221222-104911
base: https://git.kernel.org/cgit/virt/kvm/kvm.git queue
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

in testcase: kvm-unit-tests-qemu
version: kvm-unit-tests-x86_64-e11a0e2-1_20230106

on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


[ 159.416792][T16345] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274
[ 159.426638][T16345] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16345, name: qemu-system-x86
[ 159.426641][T16345] preempt_count: 1, expected: 0
[ 159.426644][T16345] CPU: 122 PID: 16345 Comm: qemu-system-x86 Not tainted 6.1.0-rc8-00451-g99e2853d906a #1
[ 159.426647][T16345] Call Trace:
[ 159.426649][T16345] <TASK>
[159.426650][T16345] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
[159.445592][T16345] __might_resched.cold (kernel/sched/core.c:9909)
[159.459683][T16345] ? __kvm_mmu_topup_memory_cache (arch/x86/kvm/../../../virt/kvm/kvm_main.c:411) kvm
[159.472465][T16345] __kmem_cache_alloc_node (include/linux/sched/mm.h:274 mm/slab.h:710 mm/slub.c:3318 mm/slub.c:3437)
[159.479626][T16345] ? kasan_set_track (mm/kasan/common.c:52)
[159.486869][T16345] ? __kvm_mmu_topup_memory_cache (arch/x86/kvm/../../../virt/kvm/kvm_main.c:411) kvm
[159.503129][T16345] __kmalloc_node (include/linux/kasan.h:211 mm/slab_common.c:955 mm/slab_common.c:962)
[159.510635][T16345] __kvm_mmu_topup_memory_cache (arch/x86/kvm/../../../virt/kvm/kvm_main.c:411) kvm
[159.525074][T16345] ? _raw_write_lock_irq (kernel/locking/spinlock.c:153)
[159.533706][T16345] ? down_read (arch/x86/include/asm/atomic64_64.h:34 include/linux/atomic/atomic-long.h:41 include/linux/atomic/atomic-instrumented.h:1280 kernel/locking/rwsem.c:176 kernel/locking/rwsem.c:181 kernel/locking/rwsem.c:249 kernel/locking/rwsem.c:1259 kernel/locking/rwsem.c:1269 kernel/locking/rwsem.c:1511)
[159.533710][T16345] mmu_topup_memory_caches (arch/x86/kvm/mmu/mmu.c:670 arch/x86/kvm/mmu/mmu.c:686) kvm
[159.547875][T16345] kvm_mmu_load (arch/x86/kvm/mmu/mmu.c:5436) kvm
[159.556325][T16345] vcpu_enter_guest+0x1ad7/0x30f0 kvm
[159.571283][T16345] ? ttwu_queue_wakelist (kernel/sched/core.c:3844 kernel/sched/core.c:3839)
[159.577747][T16345] ? vmx_prepare_switch_to_guest (arch/x86/kvm/vmx/vmx.c:1322) kvm_intel
[159.593219][T16345] ? kvm_check_and_inject_events (arch/x86/kvm/x86.c:10215) kvm
[159.600193][T16345] ? try_to_wake_up (include/linux/sched.h:2239 kernel/sched/core.c:4197)
[159.600197][T16345] ? kernel_fpu_begin_mask (arch/x86/kernel/fpu/core.c:137)
[159.616366][T16345] vcpu_run (arch/x86/kvm/x86.c:10687) kvm
[159.623697][T16345] ? fpu_swap_kvm_fpstate (arch/x86/kernel/fpu/core.c:368)
[159.623700][T16345] kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:10908) kvm
[159.640555][T16345] kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4107) kvm
[159.649090][T16345] ? vfs_fileattr_set (fs/ioctl.c:774)
[159.649094][T16345] ? kvm_dying_cpu (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4063) kvm
[159.659190][T16345] ? do_futex (kernel/futex/syscalls.c:111)
[159.673538][T16345] ? __x64_sys_get_robust_list (kernel/futex/syscalls.c:87)
[159.673542][T16345] ? __x64_sys_rt_sigaction (kernel/signal.c:4242)
[159.680957][T16345] ? _raw_spin_lock_bh (kernel/locking/spinlock.c:169)
[159.680960][T16345] ? __x64_sys_futex (kernel/futex/syscalls.c:183 kernel/futex/syscalls.c:164 kernel/futex/syscalls.c:164)
[159.697043][T16345] ? __fget_files (arch/x86/include/asm/atomic64_64.h:22 include/linux/atomic/atomic-arch-fallback.h:2363 include/linux/atomic/atomic-arch-fallback.h:2388 include/linux/atomic/atomic-arch-fallback.h:2404 include/linux/atomic/atomic-long.h:497 include/linux/atomic/atomic-instrumented.h:1854 fs/file.c:882 fs/file.c:913)
[159.697047][T16345] __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856)
[159.704290][T16345] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[159.714133][T16345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[ 159.714136][T16345] RIP: 0033:0x7f1ca20ffcc7
[ 159.728227][T16345] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
All code
========
0: 00 00 add %al,(%rax)
2: 00 48 8b add %cl,-0x75(%rax)
5: 05 c9 91 0c 00 add $0xc91c9,%eax
a: 64 c7 00 26 00 00 00 movl $0x26,%fs:(%rax)
11: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
18: c3 retq
19: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
20: 00 00 00
23: b8 10 00 00 00 mov $0x10,%eax
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 retq
33: 48 8b 0d 99 91 0c 00 mov 0xc9199(%rip),%rcx # 0xc91d3
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W

Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 retq
9: 48 8b 0d 99 91 0c 00 mov 0xc9199(%rip),%rcx # 0xc91a9
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
[ 159.728230][T16345] RSP: 002b:00007f1ca11ea848 EFLAGS: 00000246
[ 159.736078][T16345] ORIG_RAX: 0000000000000010
[ 159.736080][T16345] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007f1ca20ffcc7
[ 159.736082][T16345] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
[ 159.751470][T16345] RBP: 0000555803999500 R08: 0000000000000000 R09: 0000555801cd6d80
[ 159.751472][T16345] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 159.761055][T16345] R13: 0000555801cdd060 R14: 00007f1ca11eab00 R15: 0000000000802000
[ 159.761058][T16345] </TASK>
[ 159.780317][T16345] x86/split lock detection: #AC: qemu-system-x86/16345 took a split_lock trap at address: 0x1e3


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Attachments:
(No filename) (7.53 kB)
config-6.1.0-rc8-00451-g99e2853d906a (168.33 kB)
job-script (6.17 kB)
dmesg.xz (119.35 kB)
kvm-unit-tests-qemu (226.55 kB)
job.yaml (4.98 kB)
reproduce (153.00 B)
Download all attachments

2023-01-18 17:48:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

On Tue, Jan 03, 2023, Mingwei Zhang wrote:
> On Tue, Jan 3, 2023 at 5:00 PM Vipin Sharma <[email protected]> wrote:
> > > I think the mmu_cache allocation and deallocation may cause the usage
> > > of GFP_ATOMIC (as observed by other reviewers as well). Adding a new
> > > lock would definitely sound like a plan, but I think it might affect
> > > the performance. Alternatively, I am wondering if we could use a
> > > mmu_cache_sequence similar to mmu_notifier_seq to help avoid the
> > > concurrency?
> > >
> >
> > Can you explain more about the performance impact? Each vcpu will have
> > its own mutex. So, only contention will be with the mmu_shrinker. This
> > shrinker will use mutex_try_lock() which will not block to wait for
> > the lock, it will just pass on to the next vcpu. While shrinker is
> > holding the lock, vcpu will be blocked in the page fault path but I
> > think it should not have a huge impact considering it will execute
> > rarely and for a small time.
> >
> > > Similar to mmu_notifier_seq, mmu_cache_sequence should be protected by
> > > mmu write lock. In the page fault path, each vcpu has to collect a
> > > snapshot of mmu_cache_sequence before calling into
> > > mmu_topup_memory_caches() and check the value again when holding the
> > > mmu lock. If the value is different, that means the mmu_shrinker has
> > > removed the cache objects and because of that, the vcpu should retry.
> > >
> >
> > Yeah, this can be one approach. I think it will come down to the
> > performance impact of using mutex which I don't think should be a
> > concern.
>
> hmm, I think you are right that there is no performance overhead by
> adding a mutex and letting the shrinker using mutex_trylock(). The
> point of using a sequence counter is to avoid the new lock, since
> introducing a new lock will increase management burden.

No, more locks doesn't necessarily mean higher maintenance cost. More complexity
definitely means more maintenance, but additional locks doesn't necessarily equate
to increased complexity.

Lockless algorithms are almost always more difficult to reason about, i.e. trying
to use a sequence counter for this case would be more complex than using a per-vCPU
mutex. The only complexity in adding another mutex is understanding why an additional
lock necessary, and IMO that's fairly easy to explain/understand (the shrinker will
almost never succeed if it has to wait for vcpu->mutex to be dropped).

> So unless it is necessary, we probably should choose a simple solution first.
>
> In this case, I think we do have such a choice and since a similar
> mechanism has already been used by mmu_notifiers.

The mmu_notifier case is very different. The invalidations affect the entire VM,
notifiers _must_ succeed, may or may not allowing sleeping, the readers (vCPUs)
effectively need protection while running in the guest, and practically speaking
holding a per-VM (or global) lock of any kind while a vCPU is running in the guest
is not viable, e.g. even holding kvm->srcu is disallowed.

In other words, using a traditional locking scheme to serialize guest accesses
with host-initiated page table (or memslot) updates is simply not an option.

2023-01-18 18:02:38

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v3 1/9] KVM: x86/mmu: Repurpose KVM MMU shrinker to purge shadow page caches

@all, trim your replies!

On Tue, Jan 03, 2023, Vipin Sharma wrote:
> On Tue, Jan 3, 2023 at 10:01 AM Vipin Sharma <[email protected]> wrote:
> >
> > On Thu, Dec 29, 2022 at 1:55 PM David Matlack <[email protected]> wrote:
> > > > @@ -6646,66 +6690,49 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > > > static unsigned long
> > > > mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > > > {
> > > > - struct kvm *kvm;
> > > > - int nr_to_scan = sc->nr_to_scan;
> > > > + struct kvm_mmu_memory_cache *cache;
> > > > + struct kvm *kvm, *first_kvm = NULL;
> > > > unsigned long freed = 0;
> > > > + /* spinlock for memory cache */
> > > > + spinlock_t *cache_lock;
> > > > + struct kvm_vcpu *vcpu;
> > > > + unsigned long i;
> > > >
> > > > mutex_lock(&kvm_lock);
> > > >
> > > > list_for_each_entry(kvm, &vm_list, vm_list) {
> > > > - int idx;
> > > > - LIST_HEAD(invalid_list);
> > > > -
> > > > - /*
> > > > - * Never scan more than sc->nr_to_scan VM instances.
> > > > - * Will not hit this condition practically since we do not try
> > > > - * to shrink more than one VM and it is very unlikely to see
> > > > - * !n_used_mmu_pages so many times.
> > > > - */
> > > > - if (!nr_to_scan--)
> > > > + if (first_kvm == kvm)
> > > > break;
> > > > - /*
> > > > - * n_used_mmu_pages is accessed without holding kvm->mmu_lock
> > > > - * here. We may skip a VM instance errorneosly, but we do not
> > > > - * want to shrink a VM that only started to populate its MMU
> > > > - * anyway.
> > > > - */
> > > > - if (!kvm->arch.n_used_mmu_pages &&
> > > > - !kvm_has_zapped_obsolete_pages(kvm))
> > > > - continue;
> > > > + if (!first_kvm)
> > > > + first_kvm = kvm;
> > > > + list_move_tail(&kvm->vm_list, &vm_list);
> > > >
> > > > - idx = srcu_read_lock(&kvm->srcu);
> > > > - write_lock(&kvm->mmu_lock);
> > > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > >
> > > What protects this from racing with vCPU creation/deletion?
> > >
>
> vCPU deletion:
> We take kvm_lock in mmu_shrink_scan(), the same lock is taken in
> kvm_destroy_vm() to remove a vm from vm_list. So, once we are
> iterating vm_list we will not see any VM removal which will means no
> vcpu removal.
>
> I didn't find any other code for vCPU deletion except failures during
> VM and VCPU set up. A VM is only added to vm_list after successful
> creation.

Yep, KVM doesn't support destroying/freeing a vCPU after it's been added.

> vCPU creation:
> I think it will work.
>
> kvm_vm_ioctl_create_vcpus() initializes the vcpu, adds it to
> kvm->vcpu_array which is of the type xarray and is managed by RCU.
> After this online_vcpus is incremented. So, kvm_for_each_vcpu() which
> uses RCU to read entries, if it sees incremented online_vcpus value
> then it will also sees all of the vcpu initialization.

Yep. The shrinker may race with a vCPU creation, e.g. not process a just-created
vCPU, but that's totally ok in this case since the shrinker path is best effort
(and purging the caches of a newly created vCPU is likely pointless).

> @Sean, Paolo
>
> Is the above explanation correct, kvm_for_each_vcpu() is safe without any lock?

Well, in this case, you do need to hold kvm_lock ;-)

But yes, iterating over vCPUs without holding the per-VM kvm->lock is safe, the
caller just needs to ensure the VM can't be destroyed, i.e. either needs to hold
a reference to the VM or needs to hold kvm_lock.