2023-05-26 23:45:36

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit

TLDR
====
This patchset adds a fast path to clear the accessed bit without
taking kvm->mmu_lock. It can significantly improve the performance of
guests when the host is under heavy memory pressure.

ChromeOS has been using a similar approach [1] since mid 2021 and it
was proven successful on tens of millions devices.

This v2 addressed previous requests [2] on refactoring code, removing
inaccurate/redundant texts, etc.

[1] https://crrev.com/c/2987928
[2] https://lore.kernel.org/r/[email protected]/

Overview
========
The goal of this patchset is to optimize the performance of guests
when the host memory is overcommitted. It focuses on a simple yet
common case where hardware sets the accessed bit in KVM PTEs and VMs
are not nested. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

The fast path relies on two techniques to safely clear the accessed
bit: RCU and CAS. The former protects KVM page tables from being
freed while the latter clears the accessed bit atomically against
both the hardware and other software page table walkers.

A new mmu_notifier_ops member, test_clear_young(), supersedes the
existing clear_young() and test_young(). This extended callback can
operate on a range of KVM PTEs individually according to a bitmap, if
the caller provides it.

Evaluation
==========
An existing selftest can quickly demonstrate the effectiveness of
this patchset. On a generic workstation equipped with 128 CPUs and
256GB DRAM:

$ sudo max_guest_memory_test -c 64 -m 250 -s 250

MGLRU run2
------------------
Before [1] ~64s
After ~51s

kswapd (MGLRU before)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.99% try_to_shrink_lruvec
99.71% evict_folios
97.29% shrink_folio_list
==>> 13.05% folio_referenced
12.83% rmap_walk_file
12.31% folio_referenced_one
7.90% __mmu_notifier_clear_young
7.72% kvm_mmu_notifier_clear_young
7.34% _raw_write_lock

kswapd (MGLRU after)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.99% try_to_shrink_lruvec
99.59% evict_folios
80.37% shrink_folio_list
==>> 3.74% folio_referenced
3.59% rmap_walk_file
3.19% folio_referenced_one
2.53% lru_gen_look_around
1.06% __mmu_notifier_test_clear_young

Comprehensive benchmarks are coming soon.

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
apples.
https://lore.kernel.org/r/[email protected]/

Yu Zhao (10):
mm/kvm: add mmu_notifier_ops->test_clear_young()
mm/kvm: use mmu_notifier_ops->test_clear_young()
kvm/arm64: export stage2_try_set_pte() and macros
kvm/arm64: make stage2 page tables RCU safe
kvm/arm64: add kvm_arch_test_clear_young()
kvm/powerpc: make radix page tables RCU safe
kvm/powerpc: add kvm_arch_test_clear_young()
kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask
kvm/x86: add kvm_arch_test_clear_young()
mm: multi-gen LRU: use mmu_notifier_test_clear_young()

Documentation/admin-guide/mm/multigen_lru.rst | 6 +-
arch/arm64/include/asm/kvm_host.h | 6 +
arch/arm64/include/asm/kvm_pgtable.h | 55 +++++++
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hyp/pgtable.c | 61 +-------
arch/arm64/kvm/mmu.c | 53 ++++++-
arch/powerpc/include/asm/kvm_host.h | 8 +
arch/powerpc/include/asm/kvm_ppc.h | 1 +
arch/powerpc/kvm/book3s.c | 6 +
arch/powerpc/kvm/book3s.h | 1 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 65 +++++++-
arch/powerpc/kvm/book3s_hv.c | 5 +
arch/x86/include/asm/kvm_host.h | 13 ++
arch/x86/kvm/mmu.h | 6 -
arch/x86/kvm/mmu/spte.h | 1 -
arch/x86/kvm/mmu/tdp_mmu.c | 34 +++++
include/linux/kvm_host.h | 22 +++
include/linux/mmu_notifier.h | 79 ++++++----
include/linux/mmzone.h | 6 +-
include/trace/events/kvm.h | 15 --
mm/mmu_notifier.c | 48 ++----
mm/rmap.c | 8 +-
mm/vmscan.c | 139 ++++++++++++++++--
virt/kvm/kvm_main.c | 114 ++++++++------
24 files changed, 546 insertions(+), 207 deletions(-)

--
2.41.0.rc0.172.g3f132b7071-goog



2023-05-26 23:46:24

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()

Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., radix MMU sets the accessed bit in
KVM PTEs and VMs are not nested, where it can rely on RCU and
pte_xchg() to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao <[email protected]>
---
arch/powerpc/include/asm/kvm_host.h | 8 ++++
arch/powerpc/include/asm/kvm_ppc.h | 1 +
arch/powerpc/kvm/book3s.c | 6 +++
arch/powerpc/kvm/book3s.h | 1 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++++++++++++++++++++++++++
arch/powerpc/kvm/book3s_hv.c | 5 +++
6 files changed, 80 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..75c260ea8a9e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}

+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+ return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
+ cpu_has_feature(CPU_FTR_HVMODE) && cpu_has_feature(CPU_FTR_ARCH_300) &&
+ radix_enabled();
+}
+
#endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 79a9c0bb8bba..ff1af6a7b44f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -287,6 +287,7 @@ struct kvmppc_ops {
bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+ bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
void (*free_memslot)(struct kvm_memory_slot *slot);
int (*init_vm)(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 686d8d9eda3e..37bf40b0c4ff 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
}

+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ return !kvm->arch.kvm_ops->test_clear_young ||
+ kvm->arch.kvm_ops->test_clear_young(kvm, range);
+}
+
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 58391b4b32ed..fa2659e21ccc 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range);
extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);

extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 3b65b3b11041..0a392e9a100a 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
return ref;
}

+bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ bool err;
+ gfn_t gfn = range->start;
+
+ rcu_read_lock();
+
+ err = !kvm_is_radix(kvm);
+ if (err)
+ goto unlock;
+
+ /*
+ * Case 1: This function kvmppc_switch_mmu_to_hpt()
+ *
+ * rcu_read_lock()
+ * Test kvm_is_radix() kvm->arch.radix = 0
+ * Use kvm->arch.pgtable synchronize_rcu()
+ * rcu_read_unlock()
+ * kvmppc_free_radix()
+ *
+ *
+ * Case 2: This function kvmppc_switch_mmu_to_radix()
+ *
+ * kvmppc_init_vm_radix()
+ * smp_wmb()
+ * Test kvm_is_radix() kvm->arch.radix = 1
+ * smp_rmb()
+ * Use kvm->arch.pgtable
+ */
+ smp_rmb();
+
+ while (gfn < range->end) {
+ pte_t *ptep;
+ pte_t old, new;
+ unsigned int shift;
+
+ ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift);
+ if (!ptep)
+ goto next;
+
+ VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep)));
+
+ old = READ_ONCE(*ptep);
+ if (!pte_present(old) || !pte_young(old))
+ goto next;
+
+ new = pte_mkold(old);
+
+ if (kvm_should_clear_young(range, gfn))
+ pte_xchg(ptep, old, new);
+next:
+ gfn += shift ? BIT(shift - PAGE_SHIFT) : 1;
+ }
+unlock:
+ rcu_read_unlock();
+
+ return err;
+}
+
/* Returns the number of PAGE_SIZE pages that are dirty */
static int kvm_radix_test_clear_dirty(struct kvm *kvm,
struct kvm_memory_slot *memslot, int pagenum)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..20a81ec9fde8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5262,6 +5262,8 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
spin_lock(&kvm->mmu_lock);
kvm->arch.radix = 0;
spin_unlock(&kvm->mmu_lock);
+ /* see the comments in kvm_test_clear_young_hv() */
+ synchronize_rcu();
kvmppc_free_radix(kvm);

lpcr = LPCR_VPM1;
@@ -5286,6 +5288,8 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
if (err)
return err;
kvmppc_rmap_reset(kvm);
+ /* see the comments in kvm_test_clear_young_hv() */
+ smp_wmb();
/* Mutual exclusion with kvm_unmap_gfn_range etc. */
spin_lock(&kvm->mmu_lock);
kvm->arch.radix = 1;
@@ -6185,6 +6189,7 @@ static struct kvmppc_ops kvm_ops_hv = {
.unmap_gfn_range = kvm_unmap_gfn_range_hv,
.age_gfn = kvm_age_gfn_hv,
.test_age_gfn = kvm_test_age_gfn_hv,
+ .test_clear_young = kvm_test_clear_young_hv,
.set_spte_gfn = kvm_set_spte_gfn_hv,
.free_memslot = kvmppc_core_free_memslot_hv,
.init_vm = kvmppc_core_init_vm_hv,
--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-26 23:46:27

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe

KVM page tables are currently not RCU safe against remapping, i.e.,
kvmppc_unmap_free_pmd_entry_table() et al. The previous
mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
that operation.

However, the new mmu_notifier_ops member test_clear_young() provides
a fast path that does not take kvm->mmu_lock. To implement
kvm_arch_test_clear_young() for that path, orphan page tables need to
be freed by RCU.

Unmapping, specifically kvm_unmap_radix(), does not free page tables,
hence not a concern.

Signed-off-by: Yu Zhao <[email protected]>
---
arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..3b65b3b11041 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
{
unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;

- kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
+ kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
+ SLAB_TYPESAFE_BY_RCU, pte_ctor);
if (!kvm_pte_cache)
return -ENOMEM;

size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;

- kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
+ kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
+ SLAB_TYPESAFE_BY_RCU, pmd_ctor);
if (!kvm_pmd_cache) {
kmem_cache_destroy(kvm_pte_cache);
return -ENOMEM;
--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-26 23:47:12

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask

tdp_mmu_enabled and shadow_accessed_mask are needed to implement
kvm_arch_has_test_clear_young().

Signed-off-by: Yu Zhao <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 ++++++
arch/x86/kvm/mmu.h | 6 ------
arch/x86/kvm/mmu/spte.h | 1 -
3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb9d1f2d6136..753c67072c47 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1772,6 +1772,7 @@ struct kvm_arch_async_pf {

extern u32 __read_mostly kvm_nr_uret_msrs;
extern u64 __read_mostly host_efer;
+extern u64 __read_mostly shadow_accessed_mask;
extern bool __read_mostly allow_smaller_maxphyaddr;
extern bool __read_mostly enable_apicv;
extern struct kvm_x86_ops kvm_x86_ops;
@@ -1855,6 +1856,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
bool mask);

extern bool tdp_enabled;
+#ifdef CONFIG_X86_64
+extern bool tdp_mmu_enabled;
+#else
+#define tdp_mmu_enabled false
+#endif

u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..84aedb2671ef 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -253,12 +253,6 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
return smp_load_acquire(&kvm->arch.shadow_root_allocated);
}

-#ifdef CONFIG_X86_64
-extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
-
static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
{
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1279db2eab44..a82c4fa1c47b 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
extern u64 __read_mostly shadow_user_mask;
-extern u64 __read_mostly shadow_accessed_mask;
extern u64 __read_mostly shadow_dirty_mask;
extern u64 __read_mostly shadow_mmio_value;
extern u64 __read_mostly shadow_mmio_mask;
--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-26 23:51:12

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young()

Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., hardware sets the accessed bit in
KVM PTEs and VMs are not protected, where it can rely on RCU and
cmpxchg to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 6 ++++++
arch/arm64/kvm/mmu.c | 36 +++++++++++++++++++++++++++++++
2 files changed, 42 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 7e7e19ef6993..da32b0890716 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1113,4 +1113,10 @@ static inline void kvm_hyp_reserve(void) { }
void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);

+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+ return cpu_has_hw_af() && !is_protected_kvm_enabled();
+}
+
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c3b3e2afe26f..26a8d955b49c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1678,6 +1678,42 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
range->start << PAGE_SHIFT);
}

+static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags flags)
+{
+ kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
+
+ VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
+
+ if (!kvm_pte_valid(new))
+ return 0;
+
+ if (new == ctx->old)
+ return 0;
+
+ if (kvm_should_clear_young(ctx->arg, ctx->addr / PAGE_SIZE))
+ stage2_try_set_pte(ctx, new);
+
+ return 0;
+}
+
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ u64 start = range->start * PAGE_SIZE;
+ u64 end = range->end * PAGE_SIZE;
+ struct kvm_pgtable_walker walker = {
+ .cb = stage2_test_clear_young,
+ .arg = range,
+ .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_SHARED,
+ };
+
+ BUILD_BUG_ON(is_hyp_code());
+
+ kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker);
+
+ return false;
+}
+
phys_addr_t kvm_mmu_get_httbr(void)
{
return __pa(hyp_pgtable->pgd);
--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-26 23:51:12

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 10/10] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

Use mmu_notifier_test_clear_young() to handle KVM PTEs in batches
when the fast path is supported. This reduces the contention on
kvm->mmu_lock when the host is under heavy memory pressure.

An existing selftest can quickly demonstrate the effectiveness of
this patch. On a generic workstation equipped with 128 CPUs and 256GB
DRAM:

$ sudo max_guest_memory_test -c 64 -m 250 -s 250

MGLRU run2
------------------
Before [1] ~64s
After ~51s

kswapd (MGLRU before)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.99% try_to_shrink_lruvec
99.71% evict_folios
97.29% shrink_folio_list
==>> 13.05% folio_referenced
12.83% rmap_walk_file
12.31% folio_referenced_one
7.90% __mmu_notifier_clear_young
7.72% kvm_mmu_notifier_clear_young
7.34% _raw_write_lock

kswapd (MGLRU after)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.99% try_to_shrink_lruvec
99.59% evict_folios
80.37% shrink_folio_list
==>> 3.74% folio_referenced
3.59% rmap_walk_file
3.19% folio_referenced_one
2.53% lru_gen_look_around
1.06% __mmu_notifier_test_clear_young

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
apples.
https://lore.kernel.org/r/[email protected]/

Signed-off-by: Yu Zhao <[email protected]>
---
Documentation/admin-guide/mm/multigen_lru.rst | 6 +-
include/linux/mmzone.h | 6 +-
mm/rmap.c | 8 +-
mm/vmscan.c | 139 ++++++++++++++++--
4 files changed, 138 insertions(+), 21 deletions(-)

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..0ae2a6d4d94c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@ Values Components
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
+0x0008 Clearing the accessed bit in KVM page table entries in large
+ batches, when KVM MMU sets it (e.g., on x86_64). This can
+ improve the performance of guests when the host is under memory
+ pressure.
[yYnN] Apply to all the components above.
====== ===============================================================

@@ -56,7 +60,7 @@ E.g.,

echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
- 0x0007
+ 0x000f
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5a7ada0413da..1b148f39fabc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,6 +369,7 @@ enum {
LRU_GEN_CORE,
LRU_GEN_MM_WALK,
LRU_GEN_NONLEAF_YOUNG,
+ LRU_GEN_KVM_MMU_WALK,
NR_LRU_GEN_CAPS
};

@@ -471,7 +472,7 @@ struct lru_gen_mm_walk {
};

void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

#ifdef CONFIG_MEMCG

@@ -559,8 +560,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}

-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
+ return false;
}

#ifdef CONFIG_MEMCG
diff --git a/mm/rmap.c b/mm/rmap.c
index ae127f60a4fb..3a2c4e6a0887 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -825,12 +825,10 @@ static bool folio_referenced_one(struct folio *folio,
return false; /* To break the loop */
}

- if (pvmw.pte) {
- if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
- lru_gen_look_around(&pvmw);
+ if (lru_gen_enabled() && pvmw.pte) {
+ if (lru_gen_look_around(&pvmw))
referenced++;
- }
-
+ } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ef687f9be13c..3f734588b600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -58,6 +58,7 @@
#include <linux/rculist_nulls.h>
#include <linux/random.h>
#include <linux/srcu.h>
+#include <linux/mmu_notifier.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3244,6 +3245,20 @@ static bool should_clear_pmd_young(void)
return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
}

+#if IS_ENABLED(CONFIG_KVM)
+#include <linux/kvm_host.h>
+
+static bool should_walk_kvm_mmu(void)
+{
+ return kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_KVM_MMU_WALK);
+}
+#else
+static bool should_walk_kvm_mmu(void)
+{
+ return false;
+}
+#endif
+
/******************************************************************************
* shorthand helpers
******************************************************************************/
@@ -3982,6 +3997,55 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
return folio;
}

+static bool test_spte_young(struct mm_struct *mm, unsigned long addr, unsigned long end,
+ unsigned long *bitmap, unsigned long *last)
+{
+ if (!should_walk_kvm_mmu())
+ return false;
+
+ if (*last > addr)
+ goto done;
+
+ *last = end - addr > MIN_LRU_BATCH * PAGE_SIZE ?
+ addr + MIN_LRU_BATCH * PAGE_SIZE - 1 : end - 1;
+ bitmap_zero(bitmap, MIN_LRU_BATCH);
+
+ mmu_notifier_test_clear_young(mm, addr, *last + 1, false, bitmap);
+done:
+ return test_bit((*last - addr) / PAGE_SIZE, bitmap);
+}
+
+static void clear_spte_young(struct mm_struct *mm, unsigned long addr,
+ unsigned long *bitmap, unsigned long *last)
+{
+ int i;
+ unsigned long start, end = *last + 1;
+
+ if (addr + PAGE_SIZE != end)
+ return;
+
+ i = find_last_bit(bitmap, MIN_LRU_BATCH);
+ if (i == MIN_LRU_BATCH)
+ return;
+
+ start = end - (i + 1) * PAGE_SIZE;
+
+ i = find_first_bit(bitmap, MIN_LRU_BATCH);
+
+ end -= i * PAGE_SIZE;
+
+ mmu_notifier_test_clear_young(mm, start, end, true, bitmap);
+}
+
+static void skip_spte_young(struct mm_struct *mm, unsigned long addr,
+ unsigned long *bitmap, unsigned long *last)
+{
+ if (*last > addr)
+ __clear_bit((*last - addr) / PAGE_SIZE, bitmap);
+
+ clear_spte_young(mm, addr, bitmap, last);
+}
+
static bool suitable_to_scan(int total, int young)
{
int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
@@ -3997,6 +4061,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
pte_t *pte;
spinlock_t *ptl;
unsigned long addr;
+ DECLARE_BITMAP(bitmap, MIN_LRU_BATCH);
+ unsigned long last = 0;
int total = 0;
int young = 0;
struct lru_gen_mm_walk *walk = args->private;
@@ -4015,6 +4081,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
pte = pte_offset_map(pmd, start & PMD_MASK);
restart:
for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ bool ret;
unsigned long pfn;
struct folio *folio;

@@ -4022,20 +4089,27 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
walk->mm_stats[MM_LEAF_TOTAL]++;

pfn = get_pte_pfn(pte[i], args->vma, addr);
- if (pfn == -1)
+ if (pfn == -1) {
+ skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!pte_young(pte[i])) {
+ ret = test_spte_young(args->vma->vm_mm, addr, end, bitmap, &last);
+ if (!ret && !pte_young(pte[i])) {
+ skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
walk->mm_stats[MM_LEAF_OLD]++;
continue;
}

folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
- if (!folio)
+ if (!folio) {
+ skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ clear_spte_young(args->vma->vm_mm, addr, bitmap, &last);
+ if (pte_young(pte[i]))
+ ptep_test_and_clear_young(args->vma, addr, pte + i);

young++;
walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -4629,6 +4703,23 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* rmap/PT walk feedback
******************************************************************************/

+static bool should_look_around(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, int *young)
+{
+ int ret = mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE);
+
+ if (pte_young(*pte)) {
+ ptep_test_and_clear_young(vma, addr, pte);
+ *young = true;
+ return true;
+ }
+
+ if (ret)
+ *young = true;
+
+ return ret & MMU_NOTIFIER_RANGE_LOCKLESS;
+}
+
/*
* This function exploits spatial locality when shrink_folio_list() walks the
* rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
@@ -4636,12 +4727,14 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* the PTE table to the Bloom filter. This forms a feedback loop between the
* eviction and the aging.
*/
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
int i;
unsigned long start;
unsigned long end;
struct lru_gen_mm_walk *walk;
+ DECLARE_BITMAP(bitmap, MIN_LRU_BATCH);
+ unsigned long last = 0;
int young = 0;
pte_t *pte = pvmw->pte;
unsigned long addr = pvmw->address;
@@ -4655,8 +4748,11 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
lockdep_assert_held(pvmw->ptl);
VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);

+ if (!should_look_around(pvmw->vma, addr, pte, &young))
+ return young;
+
if (spin_is_contended(pvmw->ptl))
- return;
+ return young;

/* avoid taking the LRU lock under the PTL when possible */
walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
@@ -4664,6 +4760,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
start = max(addr & PMD_MASK, pvmw->vma->vm_start);
end = min(addr | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1;

+ if (end - start == PAGE_SIZE)
+ return young;
+
if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
end = start + MIN_LRU_BATCH * PAGE_SIZE;
@@ -4677,28 +4776,37 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)

/* folio_update_gen() requires stable folio_memcg() */
if (!mem_cgroup_trylock_pages(memcg))
- return;
+ return young;

arch_enter_lazy_mmu_mode();

pte -= (addr - start) / PAGE_SIZE;

for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ bool ret;
unsigned long pfn;

pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
- if (pfn == -1)
+ if (pfn == -1) {
+ skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!pte_young(pte[i]))
+ ret = test_spte_young(pvmw->vma->vm_mm, addr, end, bitmap, &last);
+ if (!ret && !pte_young(pte[i])) {
+ skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap);
- if (!folio)
+ if (!folio) {
+ skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ clear_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
+ if (pte_young(pte[i]))
+ ptep_test_and_clear_young(pvmw->vma, addr, pte + i);

young++;

@@ -4728,6 +4836,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
/* feedback from rmap walkers to page table walkers */
if (suitable_to_scan(i, young))
update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+ return young;
}

/******************************************************************************
@@ -5745,6 +5855,9 @@ static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, c
if (should_clear_pmd_young())
caps |= BIT(LRU_GEN_NONLEAF_YOUNG);

+ if (should_walk_kvm_mmu())
+ caps |= BIT(LRU_GEN_KVM_MMU_WALK);
+
return sysfs_emit(buf, "0x%04x\n", caps);
}

--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-26 23:51:15

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 03/10] kvm/arm64: export stage2_try_set_pte() and macros

stage2_try_set_pte() and KVM_PTE_LEAF_ATTR_LO_S2_AF are needed to
implement kvm_arch_test_clear_young().

Signed-off-by: Yu Zhao <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 53 ++++++++++++++++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 53 ----------------------------
2 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index dc3c072e862f..ff520598b62c 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -44,6 +44,49 @@ typedef u64 kvm_pte_t;

#define KVM_PHYS_INVALID (-1ULL)

+#define KVM_PTE_TYPE BIT(1)
+#define KVM_PTE_TYPE_BLOCK 0
+#define KVM_PTE_TYPE_PAGE 1
+#define KVM_PTE_TYPE_TABLE 1
+
+#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
+
+#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
+#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51)
+
+#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
+
+#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
+ KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
+ KVM_PTE_LEAF_ATTR_HI_S2_XN)
+
+#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
+#define KVM_MAX_OWNER_ID 1
+
+/*
+ * Used to indicate a pte for which a 'break-before-make' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED BIT(10)
+
static inline bool kvm_pte_valid(kvm_pte_t pte)
{
return pte & KVM_PTE_VALID;
@@ -224,6 +267,16 @@ static inline bool kvm_pgtable_walk_shared(const struct kvm_pgtable_visit_ctx *c
return ctx->flags & KVM_PGTABLE_WALK_SHARED;
}

+static inline bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new)
+{
+ if (!kvm_pgtable_walk_shared(ctx)) {
+ WRITE_ONCE(*ctx->ptep, new);
+ return true;
+ }
+
+ return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
+}
+
/**
* struct kvm_pgtable_walker - Hook into a page-table walk.
* @cb: Callback function to invoke during the walk.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5282cb9ca4cf..24678ccba76a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -12,49 +12,6 @@
#include <asm/stage2_pgtable.h>


-#define KVM_PTE_TYPE BIT(1)
-#define KVM_PTE_TYPE_BLOCK 0
-#define KVM_PTE_TYPE_PAGE 1
-#define KVM_PTE_TYPE_TABLE 1
-
-#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
-
-#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
-#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51)
-
-#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
-
-#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
- KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
- KVM_PTE_LEAF_ATTR_HI_S2_XN)
-
-#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
-#define KVM_MAX_OWNER_ID 1
-
-/*
- * Used to indicate a pte for which a 'break-before-make' sequence is in
- * progress.
- */
-#define KVM_INVALID_PTE_LOCKED BIT(10)
-
struct kvm_pgtable_walk_data {
struct kvm_pgtable_walker *walker;

@@ -702,16 +659,6 @@ static bool stage2_pte_is_locked(kvm_pte_t pte)
return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
}

-static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new)
-{
- if (!kvm_pgtable_walk_shared(ctx)) {
- WRITE_ONCE(*ctx->ptep, new);
- return true;
- }
-
- return cmpxchg(ctx->ptep, ctx->old, new) == ctx->old;
-}
-
/**
* stage2_try_break_pte() - Invalidates a pte according to the
* 'break-before-make' requirements of the
--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-26 23:52:16

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

Stage2 page tables are currently not RCU safe against unmapping or VM
destruction. The previous mmu_notifier_ops members rely on
kvm->mmu_lock to synchronize with those operations.

However, the new mmu_notifier_ops member test_clear_young() provides
a fast path that does not take kvm->mmu_lock. To implement
kvm_arch_test_clear_young() for that path, unmapped page tables need
to be freed by RCU and kvm_free_stage2_pgd() needs to be after
mmu_notifier_unregister().

Remapping, specifically stage2_free_removed_table(), is already RCU
safe.

Signed-off-by: Yu Zhao <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 2 ++
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hyp/pgtable.c | 8 ++++++--
arch/arm64/kvm/mmu.c | 17 ++++++++++++++++-
4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index ff520598b62c..5cab52e3a35f 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
* @put_page: Decrement the refcount on a page. When the
* refcount reaches 0 the page is automatically
* freed.
+ * @put_page_rcu: RCU variant of the above.
* @page_count: Return the refcount of a page.
* @phys_to_virt: Convert a physical address into a virtual
* address mapped in the current context.
@@ -170,6 +171,7 @@ struct kvm_pgtable_mm_ops {
void (*free_removed_table)(void *addr, u32 level);
void (*get_page)(void *addr);
void (*put_page)(void *addr);
+ void (*put_page_rcu)(void *addr);
int (*page_count)(void *addr);
void* (*phys_to_virt)(phys_addr_t phys);
phys_addr_t (*virt_to_phys)(void *addr);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 14391826241c..ee93271035d9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
*/
void kvm_arch_destroy_vm(struct kvm *kvm)
{
+ kvm_free_stage2_pgd(&kvm->arch.mmu);
bitmap_free(kvm->arch.pmu_filter);
free_cpumask_var(kvm->arch.supported_cpus);

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 24678ccba76a..dbace4c6a841 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -988,8 +988,12 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
kvm_granule_size(ctx->level));

- if (childp)
- mm_ops->put_page(childp);
+ if (childp) {
+ if (mm_ops->put_page_rcu)
+ mm_ops->put_page_rcu(childp);
+ else
+ mm_ops->put_page(childp);
+ }

return 0;
}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 3b9d4d24c361..c3b3e2afe26f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -172,6 +172,21 @@ static int kvm_host_page_count(void *addr)
return page_count(virt_to_page(addr));
}

+static void kvm_s2_rcu_put_page(struct rcu_head *head)
+{
+ put_page(container_of(head, struct page, rcu_head));
+}
+
+static void kvm_s2_put_page_rcu(void *addr)
+{
+ struct page *page = virt_to_page(addr);
+
+ if (kvm_host_page_count(addr) == 1)
+ kvm_account_pgtable_pages(addr, -1);
+
+ call_rcu(&page->rcu_head, kvm_s2_rcu_put_page);
+}
+
static phys_addr_t kvm_host_pa(void *addr)
{
return __pa(addr);
@@ -704,6 +719,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
.free_removed_table = stage2_free_removed_table,
.get_page = kvm_host_get_page,
.put_page = kvm_s2_put_page,
+ .put_page_rcu = kvm_s2_put_page_rcu,
.page_count = kvm_host_page_count,
.phys_to_virt = kvm_host_va,
.virt_to_phys = kvm_host_pa,
@@ -1877,7 +1893,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
- kvm_free_stage2_pgd(&kvm->arch.mmu);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-26 23:52:23

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

Add mmu_notifier_ops->test_clear_young() to supersede test_young()
and clear_young().

test_clear_young() has a fast path, which if supported, allows its
callers to safely clear the accessed bit without taking
kvm->mmu_lock.

The fast path requires arch-specific code that generally relies on
RCU and CAS: the former protects KVM page tables from being freed
while the latter clears the accessed bit atomically against both the
hardware and other software page table walkers. If the fast path is
unsupported, test_clear_young() falls back to the existing slow path
where kvm->mmu_lock is then taken.

test_clear_young() can also operate on a range of KVM PTEs
individually according to a bitmap, if the caller provides it.

Signed-off-by: Yu Zhao <[email protected]>
---
include/linux/kvm_host.h | 22 +++++++++++
include/linux/mmu_notifier.h | 52 ++++++++++++++++++++++++
mm/mmu_notifier.c | 24 ++++++++++++
virt/kvm/kvm_main.c | 76 +++++++++++++++++++++++++++++++++++-
4 files changed, 173 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e571e973bc2..374262545f96 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
+ void *args;
gfn_t start;
gfn_t end;
pte_t pte;
@@ -267,6 +268,27 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn);
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range);
+#endif
+
+/*
+ * Architectures that implement kvm_arch_test_clear_young() should override
+ * kvm_arch_has_test_clear_young().
+ *
+ * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., it
+ * can return true if kvm_arch_test_clear_young() is supported but disabled due
+ * to some runtime constraint. In this case, kvm_arch_test_clear_young() should
+ * return true; otherwise, it should return false.
+ *
+ * For each young KVM PTE, kvm_arch_test_clear_young() should call
+ * kvm_should_clear_young() to decide whether to clear the accessed bit.
+ */
+#ifndef kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+ return false;
+}
#endif

enum {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..dfdbb370682d 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -60,6 +60,8 @@ enum mmu_notifier_event {
};

#define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
+#define MMU_NOTIFIER_RANGE_LOCKLESS (1 << 1)
+#define MMU_NOTIFIER_RANGE_YOUNG (1 << 2)

struct mmu_notifier_ops {
/*
@@ -122,6 +124,10 @@ struct mmu_notifier_ops {
struct mm_struct *mm,
unsigned long address);

+ int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool clear, unsigned long *bitmap);
+
/*
* change_pte is called in cases that pte mapping to page is changed:
* for example, when ksm remaps pte to point to a new shared page.
@@ -392,6 +398,9 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
unsigned long end);
extern int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address);
+extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool clear, unsigned long *bitmap);
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
unsigned long address, pte_t pte);
extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
@@ -440,6 +449,35 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
return 0;
}

+/*
+ * mmu_notifier_test_clear_young() returns nonzero if any of the KVM PTEs within
+ * a given range was young. Specifically, it returns MMU_NOTIFIER_RANGE_LOCKLESS
+ * if the fast path was successful, MMU_NOTIFIER_RANGE_YOUNG otherwise.
+ *
+ * The last parameter to the function is a bitmap and only the fast path
+ * supports it: if it is NULL, the function falls back to the slow path if the
+ * fast path was unsuccessful; otherwise, the function bails out.
+ *
+ * The bitmap has the following specifications:
+ * 1. The number of bits should be at least (end-start)/PAGE_SIZE.
+ * 2. The offset of each bit should be relative to the end, i.e., the offset
+ * corresponding to addr should be (end-addr)/PAGE_SIZE-1. This is convenient
+ * for batching while forward looping.
+ *
+ * When testing, this function sets the corresponding bit in the bitmap for each
+ * young KVM PTE. When clearing, this function clears the accessed bit for each
+ * young KVM PTE whose corresponding bit in the bitmap is set.
+ */
+static inline int mmu_notifier_test_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool clear, unsigned long *bitmap)
+{
+ if (mm_has_notifiers(mm))
+ return __mmu_notifier_test_clear_young(mm, start, end, clear, bitmap);
+
+ return 0;
+}
+
static inline void mmu_notifier_change_pte(struct mm_struct *mm,
unsigned long address, pte_t pte)
{
@@ -684,12 +722,26 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
return 0;
}

+static inline int mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ return 0;
+}
+
static inline int mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
return 0;
}

+static inline int mmu_notifier_test_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool clear, unsigned long *bitmap)
+{
+ return 0;
+}
+
static inline void mmu_notifier_change_pte(struct mm_struct *mm,
unsigned long address, pte_t pte)
{
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 50c0dde1354f..7e6aba4bddcb 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -424,6 +424,30 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
return young;
}

+int __mmu_notifier_test_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool clear, unsigned long *bitmap)
+{
+ int idx;
+ struct mmu_notifier *mn;
+ int young = 0;
+
+ idx = srcu_read_lock(&srcu);
+
+ hlist_for_each_entry_srcu(mn, &mm->notifier_subscriptions->list, hlist,
+ srcu_read_lock_held(&srcu)) {
+ if (mn->ops->test_clear_young)
+ young |= mn->ops->test_clear_young(mn, mm, start, end, clear, bitmap);
+
+ if (young && !clear)
+ break;
+ }
+
+ srcu_read_unlock(&srcu, idx);
+
+ return young;
+}
+
void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
pte_t pte)
{
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 51e4882d0873..31ee58754b19 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -541,6 +541,7 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
typedef void (*on_unlock_fn_t)(struct kvm *kvm);

struct kvm_hva_range {
+ void *args;
unsigned long start;
unsigned long end;
pte_t pte;
@@ -549,6 +550,7 @@ struct kvm_hva_range {
on_unlock_fn_t on_unlock;
bool flush_on_ret;
bool may_block;
+ bool lockless;
};

/*
@@ -602,6 +604,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
hva_end = min(range->end, slot->userspace_addr +
(slot->npages << PAGE_SHIFT));

+ gfn_range.args = range->args;
+
/*
* To optimize for the likely case where the address
* range is covered by zero or one memslots, don't
@@ -619,7 +623,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
gfn_range.slot = slot;

- if (!locked) {
+ if (!range->lockless && !locked) {
locked = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))
@@ -628,6 +632,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
break;
}
ret |= range->handler(kvm, &gfn_range);
+
+ if (range->lockless && ret)
+ break;
}
}

@@ -880,6 +887,72 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
kvm_test_age_gfn);
}

+struct test_clear_young_args {
+ unsigned long *bitmap;
+ unsigned long end;
+ bool clear;
+ bool young;
+};
+
+bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn)
+{
+ struct test_clear_young_args *args = range->args;
+
+ VM_WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
+
+ args->young = true;
+
+ if (args->bitmap) {
+ int offset = hva_to_gfn_memslot(args->end - 1, range->slot) - gfn;
+
+ if (args->clear)
+ return test_bit(offset, args->bitmap);
+
+ __set_bit(offset, args->bitmap);
+ }
+
+ return args->clear;
+}
+
+static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool clear, unsigned long *bitmap)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+ struct kvm_hva_range range = {
+ .start = start,
+ .end = end,
+ .on_lock = (void *)kvm_null_fn,
+ .on_unlock = (void *)kvm_null_fn,
+ };
+
+ trace_kvm_age_hva(start, end);
+
+ if (kvm_arch_has_test_clear_young()) {
+ struct test_clear_young_args args = {
+ .bitmap = bitmap,
+ .end = end,
+ .clear = clear,
+ };
+
+ range.args = &args;
+ range.lockless = true;
+ range.handler = kvm_arch_test_clear_young;
+
+ if (!__kvm_handle_hva_range(kvm, &range))
+ return args.young ? MMU_NOTIFIER_RANGE_LOCKLESS : 0;
+ }
+
+ if (bitmap)
+ return 0;
+
+ range.args = NULL;
+ range.lockless = false;
+ range.handler = clear ? kvm_age_gfn : kvm_test_age_gfn;
+
+ return __kvm_handle_hva_range(kvm, &range) ? MMU_NOTIFIER_RANGE_YOUNG : 0;
+}
+
static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
struct mm_struct *mm)
{
@@ -898,6 +971,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
.clear_young = kvm_mmu_notifier_clear_young,
.test_young = kvm_mmu_notifier_test_young,
+ .test_clear_young = kvm_mmu_notifier_test_clear_young,
.change_pte = kvm_mmu_notifier_change_pte,
.release = kvm_mmu_notifier_release,
};
--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-26 23:52:27

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young()

Implement kvm_arch_test_clear_young() to support the fast path in
mmu_notifier_ops->test_clear_young().

It focuses on a simple case, i.e., TDP MMU sets the accessed bit in
KVM PTEs and VMs are not nested, where it can rely on RCU and
clear_bit() to safely clear the accessed bit without taking
kvm->mmu_lock. Complex cases fall back to the existing slow path
where kvm->mmu_lock is then taken.

Signed-off-by: Yu Zhao <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 7 +++++++
arch/x86/kvm/mmu/tdp_mmu.c | 34 +++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 753c67072c47..d6dfdebe3d94 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2223,4 +2223,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
*/
#define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)

+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+ return IS_ENABLED(CONFIG_X86_64) &&
+ (!IS_REACHABLE(CONFIG_KVM) || (tdp_mmu_enabled && shadow_accessed_mask));
+}
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 08340219c35a..6875a819e007 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1232,6 +1232,40 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
}

+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ struct kvm_mmu_page *root;
+ int offset = ffs(shadow_accessed_mask) - 1;
+
+ if (kvm_shadow_root_allocated(kvm))
+ return true;
+
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
+ struct tdp_iter iter;
+
+ if (kvm_mmu_page_as_id(root) != range->slot->as_id)
+ continue;
+
+ tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
+ u64 *sptep = rcu_dereference(iter.sptep);
+
+ VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep)));
+
+ if (!(iter.old_spte & shadow_accessed_mask))
+ continue;
+
+ if (kvm_should_clear_young(range, iter.gfn))
+ clear_bit(offset, (unsigned long *)sptep);
+ }
+ }
+
+ rcu_read_unlock();
+
+ return false;
+}
+
static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_gfn_range *range)
{
--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-26 23:52:57

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v2 02/10] mm/kvm: use mmu_notifier_ops->test_clear_young()

Replace test_young() and clear_young() with test_clear_young().

Signed-off-by: Yu Zhao <[email protected]>
---
include/linux/mmu_notifier.h | 29 ++-----------------
include/trace/events/kvm.h | 15 ----------
mm/mmu_notifier.c | 42 ----------------------------
virt/kvm/kvm_main.c | 54 ------------------------------------
4 files changed, 2 insertions(+), 138 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index dfdbb370682d..c8f35fc08703 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -104,26 +104,6 @@ struct mmu_notifier_ops {
unsigned long start,
unsigned long end);

- /*
- * clear_young is a lightweight version of clear_flush_young. Like the
- * latter, it is supposed to test-and-clear the young/accessed bitflag
- * in the secondary pte, but it may omit flushing the secondary tlb.
- */
- int (*clear_young)(struct mmu_notifier *subscription,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
-
- /*
- * test_young is called to check the young/accessed bitflag in
- * the secondary pte. This is used to know if the page is
- * frequently used without actually clearing the flag or tearing
- * down the secondary mapping on the page.
- */
- int (*test_young)(struct mmu_notifier *subscription,
- struct mm_struct *mm,
- unsigned long address);
-
int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
unsigned long start, unsigned long end,
bool clear, unsigned long *bitmap);
@@ -393,11 +373,6 @@ extern void __mmu_notifier_release(struct mm_struct *mm);
extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
unsigned long start,
unsigned long end);
-extern int __mmu_notifier_clear_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
-extern int __mmu_notifier_test_young(struct mm_struct *mm,
- unsigned long address);
extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
unsigned long start, unsigned long end,
bool clear, unsigned long *bitmap);
@@ -437,7 +412,7 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm,
unsigned long end)
{
if (mm_has_notifiers(mm))
- return __mmu_notifier_clear_young(mm, start, end);
+ return __mmu_notifier_test_clear_young(mm, start, end, true, NULL);
return 0;
}

@@ -445,7 +420,7 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
if (mm_has_notifiers(mm))
- return __mmu_notifier_test_young(mm, address);
+ return __mmu_notifier_test_clear_young(mm, address, address + 1, false, NULL);
return 0;
}

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 3bd31ea23fee..46c347e56e60 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -489,21 +489,6 @@ TRACE_EVENT(kvm_age_hva,
__entry->start, __entry->end)
);

-TRACE_EVENT(kvm_test_age_hva,
- TP_PROTO(unsigned long hva),
- TP_ARGS(hva),
-
- TP_STRUCT__entry(
- __field( unsigned long, hva )
- ),
-
- TP_fast_assign(
- __entry->hva = hva;
- ),
-
- TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
-);
-
#endif /* _TRACE_KVM_MAIN_H */

/* This part must be outside protection */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 7e6aba4bddcb..c7e9747c9920 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -382,48 +382,6 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
return young;
}

-int __mmu_notifier_clear_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
-{
- struct mmu_notifier *subscription;
- int young = 0, id;
-
- id = srcu_read_lock(&srcu);
- hlist_for_each_entry_rcu(subscription,
- &mm->notifier_subscriptions->list, hlist,
- srcu_read_lock_held(&srcu)) {
- if (subscription->ops->clear_young)
- young |= subscription->ops->clear_young(subscription,
- mm, start, end);
- }
- srcu_read_unlock(&srcu, id);
-
- return young;
-}
-
-int __mmu_notifier_test_young(struct mm_struct *mm,
- unsigned long address)
-{
- struct mmu_notifier *subscription;
- int young = 0, id;
-
- id = srcu_read_lock(&srcu);
- hlist_for_each_entry_rcu(subscription,
- &mm->notifier_subscriptions->list, hlist,
- srcu_read_lock_held(&srcu)) {
- if (subscription->ops->test_young) {
- young = subscription->ops->test_young(subscription, mm,
- address);
- if (young)
- break;
- }
- }
- srcu_read_unlock(&srcu, id);
-
- return young;
-}
-
int __mmu_notifier_test_clear_young(struct mm_struct *mm,
unsigned long start, unsigned long end,
bool clear, unsigned long *bitmap)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 31ee58754b19..977baaf1b248 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -674,25 +674,6 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
return __kvm_handle_hva_range(kvm, &range);
}

-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
- unsigned long start,
- unsigned long end,
- hva_handler_t handler)
-{
- struct kvm *kvm = mmu_notifier_to_kvm(mn);
- const struct kvm_hva_range range = {
- .start = start,
- .end = end,
- .pte = __pte(0),
- .handler = handler,
- .on_lock = (void *)kvm_null_fn,
- .on_unlock = (void *)kvm_null_fn,
- .flush_on_ret = false,
- .may_block = false,
- };
-
- return __kvm_handle_hva_range(kvm, &range);
-}
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -854,39 +835,6 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
return kvm_handle_hva_range(mn, start, end, __pte(0), kvm_age_gfn);
}

-static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
-{
- trace_kvm_age_hva(start, end);
-
- /*
- * Even though we do not flush TLB, this will still adversely
- * affect performance on pre-Haswell Intel EPT, where there is
- * no EPT Access Bit to clear so that we have to tear down EPT
- * tables instead. If we find this unacceptable, we can always
- * add a parameter to kvm_age_hva so that it effectively doesn't
- * do anything on clear_young.
- *
- * Also note that currently we never issue secondary TLB flushes
- * from clear_young, leaving this job up to the regular system
- * cadence. If we find this inaccurate, we might come up with a
- * more sophisticated heuristic later.
- */
- return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
-}
-
-static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address)
-{
- trace_kvm_test_age_hva(address);
-
- return kvm_handle_hva_range_no_flush(mn, address, address + 1,
- kvm_test_age_gfn);
-}
-
struct test_clear_young_args {
unsigned long *bitmap;
unsigned long end;
@@ -969,8 +917,6 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
- .clear_young = kvm_mmu_notifier_clear_young,
- .test_young = kvm_mmu_notifier_test_young,
.test_clear_young = kvm_mmu_notifier_test_clear_young,
.change_pte = kvm_mmu_notifier_change_pte,
.release = kvm_mmu_notifier_release,
--
2.41.0.rc0.172.g3f132b7071-goog


2023-05-27 18:40:32

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

Yu,

On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote:
> Stage2 page tables are currently not RCU safe against unmapping or VM
> destruction. The previous mmu_notifier_ops members rely on
> kvm->mmu_lock to synchronize with those operations.
>
> However, the new mmu_notifier_ops member test_clear_young() provides
> a fast path that does not take kvm->mmu_lock. To implement
> kvm_arch_test_clear_young() for that path, unmapped page tables need
> to be freed by RCU and kvm_free_stage2_pgd() needs to be after
> mmu_notifier_unregister().
>
> Remapping, specifically stage2_free_removed_table(), is already RCU
> safe.
>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 2 ++
> arch/arm64/kvm/arm.c | 1 +
> arch/arm64/kvm/hyp/pgtable.c | 8 ++++++--
> arch/arm64/kvm/mmu.c | 17 ++++++++++++++++-
> 4 files changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index ff520598b62c..5cab52e3a35f 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> * @put_page: Decrement the refcount on a page. When the
> * refcount reaches 0 the page is automatically
> * freed.
> + * @put_page_rcu: RCU variant of the above.

You don't need to add yet another hook to implement this. I was working
on lock-free walks in a separate context and arrived at the following:

commit f82d264a37745e07ee28e116c336f139f681fd7f
Author: Oliver Upton <[email protected]>
Date: Mon May 1 08:53:37 2023 +0000

KVM: arm64: Consistently use free_removed_table() for stage-2

free_removed_table() is essential to the RCU-protected parallel walking
scheme, as behind the scenes the cleanup is deferred until an RCU grace
period. Nonetheless, the stage-2 unmap path calls put_page() directly,
which leads to table memory being freed inline with the table walk.

This is safe for the time being, as the stage-2 unmap walker is called
while holding the write lock. A future change to KVM will further relax
the locking mechanics around the stage-2 page tables to allow lock-free
walkers protected only by RCU. As such, switch to the RCU-safe mechanism
for freeing table memory.

Signed-off-by: Oliver Upton <[email protected]>

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3d61bd3e591d..bfbebdcb4ef0 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
kvm_granule_size(ctx->level));

if (childp)
- mm_ops->put_page(childp);
+ mm_ops->free_removed_table(childp, ctx->level);

return 0;
}

--
Thanks,
Oliver

2023-05-27 20:28:26

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

On Sat, May 27, 2023 at 12:08 PM Oliver Upton <[email protected]> wrote:
>
> Yu,
>
> On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote:
> > Stage2 page tables are currently not RCU safe against unmapping or VM
> > destruction. The previous mmu_notifier_ops members rely on
> > kvm->mmu_lock to synchronize with those operations.
> >
> > However, the new mmu_notifier_ops member test_clear_young() provides
> > a fast path that does not take kvm->mmu_lock. To implement
> > kvm_arch_test_clear_young() for that path, unmapped page tables need
> > to be freed by RCU and kvm_free_stage2_pgd() needs to be after
> > mmu_notifier_unregister().
> >
> > Remapping, specifically stage2_free_removed_table(), is already RCU
> > safe.
> >
> > Signed-off-by: Yu Zhao <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_pgtable.h | 2 ++
> > arch/arm64/kvm/arm.c | 1 +
> > arch/arm64/kvm/hyp/pgtable.c | 8 ++++++--
> > arch/arm64/kvm/mmu.c | 17 ++++++++++++++++-
> > 4 files changed, 25 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index ff520598b62c..5cab52e3a35f 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> > * @put_page: Decrement the refcount on a page. When the
> > * refcount reaches 0 the page is automatically
> > * freed.
> > + * @put_page_rcu: RCU variant of the above.
>
> You don't need to add yet another hook to implement this. I was working
> on lock-free walks in a separate context and arrived at the following:
>
> commit f82d264a37745e07ee28e116c336f139f681fd7f
> Author: Oliver Upton <[email protected]>
> Date: Mon May 1 08:53:37 2023 +0000
>
> KVM: arm64: Consistently use free_removed_table() for stage-2
>
> free_removed_table() is essential to the RCU-protected parallel walking
> scheme, as behind the scenes the cleanup is deferred until an RCU grace
> period. Nonetheless, the stage-2 unmap path calls put_page() directly,
> which leads to table memory being freed inline with the table walk.
>
> This is safe for the time being, as the stage-2 unmap walker is called
> while holding the write lock. A future change to KVM will further relax
> the locking mechanics around the stage-2 page tables to allow lock-free
> walkers protected only by RCU. As such, switch to the RCU-safe mechanism
> for freeing table memory.
>
> Signed-off-by: Oliver Upton <[email protected]>
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 3d61bd3e591d..bfbebdcb4ef0 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> kvm_granule_size(ctx->level));
>
> if (childp)
> - mm_ops->put_page(childp);
> + mm_ops->free_removed_table(childp, ctx->level);

Thanks, Oliver.

A couple of things I haven't had the chance to verify -- I'm hoping
you could help clarify:
1. For unmapping, with free_removed_table(), wouldn't we have to look
into the table we know it's empty unnecessarily?
2. For remapping and unmapping, how does free_removed_table() put the
final refcnt on the table passed in? (Previously we had
put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd
have to do something equivalent with free_removed_table().)

2023-05-30 19:50:24

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

Hi Yu,

On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> On Sat, May 27, 2023 at 12:08 PM Oliver Upton <[email protected]> wrote:
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > kvm_granule_size(ctx->level));
> >
> > if (childp)
> > - mm_ops->put_page(childp);
> > + mm_ops->free_removed_table(childp, ctx->level);
>
> Thanks, Oliver.
>
> A couple of things I haven't had the chance to verify -- I'm hoping
> you could help clarify:
> 1. For unmapping, with free_removed_table(), wouldn't we have to look
> into the table we know it's empty unnecessarily?

As it is currently implemented, yes. But, there's potential to fast-path
the implementation by checking page_count() before starting the walk.

> 2. For remapping and unmapping, how does free_removed_table() put the
> final refcnt on the table passed in? (Previously we had
> put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd
> have to do something equivalent with free_removed_table().)

Heh, that's a bug, and an embarrassing one at that!

Sent out a fix for that, since it would appear we leak memory on
table->block transitions. PTAL if you have a chance.

https://lore.kernel.org/all/[email protected]/

--
Thanks,
Oliver

2023-05-30 20:15:18

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

On Tue, May 30, 2023 at 1:37 PM Oliver Upton <[email protected]> wrote:
>
> Hi Yu,
>
> On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <[email protected]> wrote:
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > kvm_granule_size(ctx->level));
> > >
> > > if (childp)
> > > - mm_ops->put_page(childp);
> > > + mm_ops->free_removed_table(childp, ctx->level);
> >
> > Thanks, Oliver.
> >
> > A couple of things I haven't had the chance to verify -- I'm hoping
> > you could help clarify:
> > 1. For unmapping, with free_removed_table(), wouldn't we have to look
> > into the table we know it's empty unnecessarily?
>
> As it is currently implemented, yes. But, there's potential to fast-path
> the implementation by checking page_count() before starting the walk.

Do you mind posting another patch? I'd be happy to ack it, as well as
the one you suggested above.

> > 2. For remapping and unmapping, how does free_removed_table() put the
> > final refcnt on the table passed in? (Previously we had
> > put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd
> > have to do something equivalent with free_removed_table().)
>
> Heh, that's a bug, and an embarrassing one at that!
>
> Sent out a fix for that, since it would appear we leak memory on
> table->block transitions. PTAL if you have a chance.
>
> https://lore.kernel.org/all/[email protected]/

Awesome.

2023-05-31 19:34:01

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> On Tue, May 30, 2023 at 1:37 PM Oliver Upton <[email protected]> wrote:
> >
> > Hi Yu,
> >
> > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <[email protected]> wrote:
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > kvm_granule_size(ctx->level));
> > > >
> > > > if (childp)
> > > > - mm_ops->put_page(childp);
> > > > + mm_ops->free_removed_table(childp, ctx->level);
> > >
> > > Thanks, Oliver.
> > >
> > > A couple of things I haven't had the chance to verify -- I'm hoping
> > > you could help clarify:
> > > 1. For unmapping, with free_removed_table(), wouldn't we have to look
> > > into the table we know it's empty unnecessarily?
> >
> > As it is currently implemented, yes. But, there's potential to fast-path
> > the implementation by checking page_count() before starting the walk.
>
> Do you mind posting another patch? I'd be happy to ack it, as well as
> the one you suggested above.

I'd rather not take such a patch independent of the test_clear_young
series if you're OK with that. Do you mind implementing something
similar to the above patch w/ the proposed optimization if you need it?

--
Thanks,
Oliver

2023-05-31 19:34:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

On Fri, May 26, 2023 at 05:44:26PM -0600, Yu Zhao wrote:
> @@ -122,6 +124,10 @@ struct mmu_notifier_ops {
> struct mm_struct *mm,
> unsigned long address);
>
> + int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + bool clear, unsigned long *bitmap);
> +

Why leave clear_young behind? Just make a NULL bitmap mean
clear_young?

Jason

2023-05-31 19:35:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 02/10] mm/kvm: use mmu_notifier_ops->test_clear_young()

On Fri, May 26, 2023 at 05:44:27PM -0600, Yu Zhao wrote:
> Replace test_young() and clear_young() with test_clear_young().
>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> include/linux/mmu_notifier.h | 29 ++-----------------
> include/trace/events/kvm.h | 15 ----------
> mm/mmu_notifier.c | 42 ----------------------------
> virt/kvm/kvm_main.c | 54 ------------------------------------
> 4 files changed, 2 insertions(+), 138 deletions(-)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index dfdbb370682d..c8f35fc08703 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -104,26 +104,6 @@ struct mmu_notifier_ops {
> unsigned long start,
> unsigned long end);
>
> - /*
> - * clear_young is a lightweight version of clear_flush_young. Like the
> - * latter, it is supposed to test-and-clear the young/accessed bitflag
> - * in the secondary pte, but it may omit flushing the secondary tlb.
> - */
> - int (*clear_young)(struct mmu_notifier *subscription,
> - struct mm_struct *mm,
> - unsigned long start,
> - unsigned long end);
> -
> - /*
> - * test_young is called to check the young/accessed bitflag in
> - * the secondary pte. This is used to know if the page is
> - * frequently used without actually clearing the flag or tearing
> - * down the secondary mapping on the page.
> - */
> - int (*test_young)(struct mmu_notifier *subscription,
> - struct mm_struct *mm,
> - unsigned long address);
> -
> int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
> unsigned long start, unsigned long end,
> bool clear, unsigned long *bitmap);

Oh, you split the patch. This MMU notifier stuff seems OK for both
patches then, and KVM is the only user:

Reviewed-by: Jason Gunthorpe <[email protected]>

Jason

2023-05-31 20:19:14

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young()

Hi Yu,

On Fri, May 26, 2023 at 05:44:30PM -0600, Yu Zhao wrote:
> Implement kvm_arch_test_clear_young() to support the fast path in
> mmu_notifier_ops->test_clear_young().
>
> It focuses on a simple case, i.e., hardware sets the accessed bit in
> KVM PTEs and VMs are not protected, where it can rely on RCU and
> cmpxchg to safely clear the accessed bit without taking
> kvm->mmu_lock. Complex cases fall back to the existing slow path
> where kvm->mmu_lock is then taken.
>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 6 ++++++
> arch/arm64/kvm/mmu.c | 36 +++++++++++++++++++++++++++++++
> 2 files changed, 42 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 7e7e19ef6993..da32b0890716 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -1113,4 +1113,10 @@ static inline void kvm_hyp_reserve(void) { }
> void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
>
> +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> +static inline bool kvm_arch_has_test_clear_young(void)
> +{
> + return cpu_has_hw_af() && !is_protected_kvm_enabled();
> +}

I would *strongly* suggest you consider supporting test_clear_young on
systems that do software Access Flag management. FEAT_HAFDBS is an
*optional* extension to the architecture, so we're going to support
software AF management for a very long time in KVM. It is also a valid
fallback option in the case of hardware errata which render HAFDBS
broken.

So, we should expect (and support) systems of all shapes and sizes that
do software AF. I'm sure we'll hear about more in the not-too-distant
future...

For future reference (even though I'm suggesting you support software
AF), decisions such of these need an extremely verbose comment
describing the rationale behind the decision.

> +
> #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c3b3e2afe26f..26a8d955b49c 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c

Please do not implement page table walkers outside of hyp/pgtable.c

> @@ -1678,6 +1678,42 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> range->start << PAGE_SHIFT);
> }
>
> +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
> + enum kvm_pgtable_walk_flags flags)
> +{
> + kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
> +
> + VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));

This sort of sanity checking is a bit excessive. Isn't there a risk of
false negatives here too? IOW, if we tragically mess up RCU in the page
table code, what's stopping a prematurely freed page from being
allocated to another user?

> + if (!kvm_pte_valid(new))
> + return 0;
> +
> + if (new == ctx->old)
> + return 0;
> +
> + if (kvm_should_clear_young(ctx->arg, ctx->addr / PAGE_SIZE))
> + stage2_try_set_pte(ctx, new);
> +
> + return 0;
> +}
> +
> +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> + u64 start = range->start * PAGE_SIZE;
> + u64 end = range->end * PAGE_SIZE;
> + struct kvm_pgtable_walker walker = {
> + .cb = stage2_test_clear_young,
> + .arg = range,
> + .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_SHARED,
> + };
> +
> + BUILD_BUG_ON(is_hyp_code());

Delete this assertion.

> + kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker);
> +
> + return false;
> +}
> +
> phys_addr_t kvm_mmu_get_httbr(void)
> {
> return __pa(hyp_pgtable->pgd);
> --
> 2.41.0.rc0.172.g3f132b7071-goog
>

--
Thanks,
Oliver

2023-05-31 21:30:54

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 05/10] kvm/arm64: add kvm_arch_test_clear_young()

On Wed, May 31, 2023 at 1:56 PM Oliver Upton <[email protected]> wrote:
>
> Hi Yu,
>
> On Fri, May 26, 2023 at 05:44:30PM -0600, Yu Zhao wrote:
> > Implement kvm_arch_test_clear_young() to support the fast path in
> > mmu_notifier_ops->test_clear_young().
> >
> > It focuses on a simple case, i.e., hardware sets the accessed bit in
> > KVM PTEs and VMs are not protected, where it can rely on RCU and
> > cmpxchg to safely clear the accessed bit without taking
> > kvm->mmu_lock. Complex cases fall back to the existing slow path
> > where kvm->mmu_lock is then taken.
> >
> > Signed-off-by: Yu Zhao <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 6 ++++++
> > arch/arm64/kvm/mmu.c | 36 +++++++++++++++++++++++++++++++
> > 2 files changed, 42 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 7e7e19ef6993..da32b0890716 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1113,4 +1113,10 @@ static inline void kvm_hyp_reserve(void) { }
> > void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> > bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> >
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > + return cpu_has_hw_af() && !is_protected_kvm_enabled();
> > +}
>
> I would *strongly* suggest you consider supporting test_clear_young on
> systems that do software Access Flag management. FEAT_HAFDBS is an
> *optional* extension to the architecture, so we're going to support
> software AF management for a very long time in KVM. It is also a valid
> fallback option in the case of hardware errata which render HAFDBS
> broken.

Hi Oliver,

It's not about willingness but resources. Ideally we want to make
everything perfect, but in reality, we can only move forward one step
a time.

If I looked at your request from ARM's POV, I would agree with you.
But my goal is to lay the foundation for all architectures that could
benefit, so I may not be able to cover a lot for each architecture.
Specifically, I don't have the bandwidth to test the !FEAT_HAFDBS case
for ARM.

So here are some options I could offer, ordered by my preferences:
1. We proceed as it is for now. I *will* find someone from my team (or
yours) to follow up -- this way we can make sure !FEAT_HAFDBS is well
tested.
2. I drop the cpu_has_hw_af() check above. Not that I think there is
much risk, I'm just trying to be cautious.
3. I drop the entire ARM support (and include the RISC-V support which
I previously deprioritized). We revisit after the test is done.

Sounds reasonable?

> So, we should expect (and support) systems of all shapes and sizes that
> do software AF. I'm sure we'll hear about more in the not-too-distant
> future...
>
> For future reference (even though I'm suggesting you support software
> AF), decisions such of these need an extremely verbose comment
> describing the rationale behind the decision.
>
> > +
> > #endif /* __ARM64_KVM_HOST_H__ */
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index c3b3e2afe26f..26a8d955b49c 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
>
> Please do not implement page table walkers outside of hyp/pgtable.c
>
> > @@ -1678,6 +1678,42 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > range->start << PAGE_SHIFT);
> > }
> >
> > +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
> > + enum kvm_pgtable_walk_flags flags)
> > +{
> > + kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
> > +
> > + VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
>
> This sort of sanity checking is a bit excessive. Isn't there a risk of
> false negatives here too? IOW, if we tragically mess up RCU in the page
> table code, what's stopping a prematurely freed page from being
> allocated to another user?

Yes, but from my aforementioned POV (the breadth I'm focusing on),
this is a good practice. I can live without this assertion if you feel
strongly about it.

> > + if (!kvm_pte_valid(new))
> > + return 0;
> > +
> > + if (new == ctx->old)
> > + return 0;
> > +
> > + if (kvm_should_clear_young(ctx->arg, ctx->addr / PAGE_SIZE))
> > + stage2_try_set_pte(ctx, new);
> > +
> > + return 0;
> > +}
> > +
> > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > + u64 start = range->start * PAGE_SIZE;
> > + u64 end = range->end * PAGE_SIZE;
> > + struct kvm_pgtable_walker walker = {
> > + .cb = stage2_test_clear_young,
> > + .arg = range,
> > + .flags = KVM_PGTABLE_WALK_LEAF | KVM_PGTABLE_WALK_SHARED,
> > + };
> > +
> > + BUILD_BUG_ON(is_hyp_code());
>
> Delete this assertion.

Will do.

2023-05-31 23:25:23

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

On Wed, May 31, 2023 at 1:28 PM Oliver Upton <[email protected]> wrote:
>
> On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <[email protected]> wrote:
> > >
> > > Hi Yu,
> > >
> > > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote:
> > > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <[email protected]> wrote:
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index 3d61bd3e591d..bfbebdcb4ef0 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > > kvm_granule_size(ctx->level));
> > > > >
> > > > > if (childp)
> > > > > - mm_ops->put_page(childp);
> > > > > + mm_ops->free_removed_table(childp, ctx->level);
> > > >
> > > > Thanks, Oliver.
> > > >
> > > > A couple of things I haven't had the chance to verify -- I'm hoping
> > > > you could help clarify:
> > > > 1. For unmapping, with free_removed_table(), wouldn't we have to look
> > > > into the table we know it's empty unnecessarily?
> > >
> > > As it is currently implemented, yes. But, there's potential to fast-path
> > > the implementation by checking page_count() before starting the walk.
> >
> > Do you mind posting another patch? I'd be happy to ack it, as well as
> > the one you suggested above.
>
> I'd rather not take such a patch independent of the test_clear_young
> series if you're OK with that. Do you mind implementing something
> similar to the above patch w/ the proposed optimization if you need it?

No worries. I can take the above together with the following, which
would form a new series with its own merits, since apparently you
think the !AF case is important.

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 26a8d955b49c..6ce73ce9f146 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1453,10 +1453,10 @@ static void handle_access_fault(struct
kvm_vcpu *vcpu, phys_addr_t fault_ipa)

trace_kvm_access_fault(fault_ipa);

- read_lock(&vcpu->kvm->mmu_lock);
+ rcu_read_lock();
mmu = vcpu->arch.hw_mmu;
pte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa);
- read_unlock(&vcpu->kvm->mmu_lock);
+ rcu_read_unlock();

if (kvm_pte_valid(pte))
kvm_set_pfn_accessed(kvm_pte_to_pfn(pte));

2023-05-31 23:43:18

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote:
> On Wed, May 31, 2023 at 1:28 PM Oliver Upton <[email protected]> wrote:
> > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <[email protected]> wrote:
> > > > As it is currently implemented, yes. But, there's potential to fast-path
> > > > the implementation by checking page_count() before starting the walk.
> > >
> > > Do you mind posting another patch? I'd be happy to ack it, as well as
> > > the one you suggested above.
> >
> > I'd rather not take such a patch independent of the test_clear_young
> > series if you're OK with that. Do you mind implementing something
> > similar to the above patch w/ the proposed optimization if you need it?
>
> No worries. I can take the above together with the following, which
> would form a new series with its own merits, since apparently you
> think the !AF case is important.

Sorry if my suggestion was unclear.

I thought we were talking about ->free_removed_table() being called from
the stage-2 unmap path, in which case we wind up unnecessarily visiting
PTEs on a table known to be empty. You could fast-path that by only
initiating a walk if page_count() > 1:

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 95dae02ccc2e..766563dc465c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg
.end = kvm_granule_size(level),
};

- WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1));
+ if (mm_ops->page_count(pgtable) > 1)
+ WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1));

WARN_ON(mm_ops->page_count(pgtable) != 1);
mm_ops->put_page(pgtable);


A lock-free access fault walker is interesting, but in my testing it hasn't
led to any significant improvements over acquiring the MMU lock for
read. Because of that I hadn't bothered with posting the series upstream.

--
Thanks,
Oliver

2023-05-31 23:50:10

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe

On Wed, May 31, 2023 at 5:23 PM Oliver Upton <[email protected]> wrote:
>
> On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote:
> > On Wed, May 31, 2023 at 1:28 PM Oliver Upton <[email protected]> wrote:
> > > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote:
> > > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <[email protected]> wrote:
> > > > > As it is currently implemented, yes. But, there's potential to fast-path
> > > > > the implementation by checking page_count() before starting the walk.
> > > >
> > > > Do you mind posting another patch? I'd be happy to ack it, as well as
> > > > the one you suggested above.
> > >
> > > I'd rather not take such a patch independent of the test_clear_young
> > > series if you're OK with that. Do you mind implementing something
> > > similar to the above patch w/ the proposed optimization if you need it?
> >
> > No worries. I can take the above together with the following, which
> > would form a new series with its own merits, since apparently you
> > think the !AF case is important.
>
> Sorry if my suggestion was unclear.
>
> I thought we were talking about ->free_removed_table() being called from
> the stage-2 unmap path

Yes, we were, or in general, about how to make KVM PTs RCU safe for ARM.

So I'm thinking about taking 1) your patch above, 2) what I just
suggested and 3) what you suggested below to form a mini series, which
could land indepently and would make my job here easier.

> in which case we wind up unnecessarily visiting
> PTEs on a table known to be empty. You could fast-path that by only
> initiating a walk if page_count() > 1:

Yes, this is what I meant.

> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 95dae02ccc2e..766563dc465c 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg
> .end = kvm_granule_size(level),
> };
>
> - WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1));
> + if (mm_ops->page_count(pgtable) > 1)
> + WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1));
>
> WARN_ON(mm_ops->page_count(pgtable) != 1);
> mm_ops->put_page(pgtable);
>
>
> A lock-free access fault walker is interesting, but in my testing it hasn't
> led to any significant improvements over acquiring the MMU lock for
> read. Because of that I hadn't bothered with posting the series upstream.

It's hard to measure but we have perf benchmarks on ChromeOS which should help.

2023-06-06 09:05:37

by Tzung-Bi Shih

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

On Fri, May 26, 2023 at 05:44:26PM -0600, Yu Zhao wrote:
> +/*
> + * Architectures that implement kvm_arch_test_clear_young() should override
> + * kvm_arch_has_test_clear_young().
> + *
> + * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., it
> + * can return true if kvm_arch_test_clear_young() is supported but disabled due
> + * to some runtime constraint. In this case, kvm_arch_test_clear_young() should

Is it a typo here? s/kvm_arch_test_clear_young/kvm_arch_has_test_clear_young/.

> +static inline int mmu_notifier_clear_young(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end)
> +{
> + return 0;
> +}
> +

This looks irrelevant to the patch but a fix for commit 1d7715c676a1
("mmu-notifier: add clear_young callback") instead.

2023-06-09 01:01:15

by Yu Zhao

[permalink] [raw]
Subject: kvm/powerpc: memcached benchmark

TLDR
====
Memcached achieved 10% more operations per second (in ~4 hours) after this patchset [1].

Hardware
========
HOST $ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 184
On-line CPU(s) list: 0-183
Model name: POWER9 (raw), altivec supported
Model: 2.2 (pvr 004e 1202)
Thread(s) per core: 4
Core(s) per socket: 23
Socket(s): 2
CPU max MHz: 3000.0000
CPU min MHz: 2300.0000
Caches (sum of all):
L1d: 1.4 MiB (46 instances)
L1i: 1.4 MiB (46 instances)
L2: 12 MiB (24 instances)
L3: 240 MiB (24 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-91
NUMA node1 CPU(s): 92-183
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Mitigation; RFI Flush, L1D private per thread
Mds: Not affected
Meltdown: Mitigation; RFI Flush, L1D private per thread
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Spectre v1: Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
Spectre v2: Mitigation; Indirect branch serialisation (kernel only), Indirect branch cache disabled, Software link stack flush
Srbds: Not affected
Tsx async abort: Not affected

HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-91
node 0 size: 261659 MB
node 0 free: 259152 MB
node 1 cpus: 92-183
node 1 size: 261713 MB
node 1 free: 261076 MB
node distances:
node 0 1
0: 10 40
1: 40 10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software
========
HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"

HOST $ uname -a
Linux ppc 6.3.0 #1 SMP Sun Jun 4 18:26:37 UTC 2023 ppc64le ppc64le ppc64le GNU/Linux

HOST $ cat /proc/swaps
Filename Type Size Used Priority
/dev/nvme0n1p2 partition 466838272 0 -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x0009

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

HOST $ qemu-system-ppc64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

GUEST $ cat /etc/memcached.conf
...
-t 92
-m 262144
-B binary
-s /var/run/memcached/memcached.sock
-a 0766

GUEST $ memtier_benchmark -v
memtier_benchmark 1.4.0
Copyright (C) 2011-2022 Redis Ltd.
This is free software. You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Procedure
=========
HOST $ sudo numactl -N 0 -m 0 qemu-system-ppc64 \
-M pseries,accel=kvm,kvm-type=HV -cpu host -smp 92 -m 270g
-nographic -nic user \
-drive if=virtio,format=raw,file=/dev/nvme0n1p1

GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 1:0 \
--key-minimum=1 --key-maximum=120000000 --key-pattern=P:P \
-n allkeys -d 2000

GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 0:1 \
--key-minimum=1 --key-maximum=120000000 --key-pattern=R:R \
-n allkeys --randomize --distinct-client-seed

Results
=======
Before [1] After Change
-------------------------------------------------
Ops/sec 721586.10 800210.12 +10%
Avg. Latency 0.12546 0.11260 -10%
p50 Latency 0.08700 0.08700 N/C
p99 Latency 0.28700 0.24700 -13%

Notes
=====
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/[email protected]/

2023-06-09 01:02:25

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

On Tue, Jun 6, 2023 at 2:34 AM Tzung-Bi Shih <[email protected]> wrote:
>
> On Fri, May 26, 2023 at 05:44:26PM -0600, Yu Zhao wrote:
> > +/*
> > + * Architectures that implement kvm_arch_test_clear_young() should override
> > + * kvm_arch_has_test_clear_young().
> > + *
> > + * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., it
> > + * can return true if kvm_arch_test_clear_young() is supported but disabled due
> > + * to some runtime constraint. In this case, kvm_arch_test_clear_young() should
>
> Is it a typo here? s/kvm_arch_test_clear_young/kvm_arch_has_test_clear_young/.

Not a typo.

2023-06-09 01:07:04

by Yu Zhao

[permalink] [raw]
Subject: kvm/arm64: Spark benchmark

TLDR
====
Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1].

Hardware
========
HOST $ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
Stepping: r3p1
Frequency boost: disabled
CPU max MHz: 2800.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
L1d: 8 MiB (128 instances)
L1i: 8 MiB (128 instances)
L2: 128 MiB (128 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Mitigation; CSV2, BHB
Srbds: Not affected
Tsx async abort: Not affected

HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-63
node 0 size: 257730 MB
node 0 free: 1447 MB
node 1 cpus: 64-127
node 1 size: 256877 MB
node 1 free: 256093 MB
node distances:
node 0 1
0: 10 20
1: 20 10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software
========
HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

HOST $ uname -a
Linux arm 6.4.0-rc4 #1 SMP Sat Jun 3 05:30:06 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

HOST $ cat /proc/swaps
Filename Type Size Used Priority
/dev/nvme0n1p2 partition 466838356 116922112 -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000b

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

HOST $ qemu-system-aarch64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

GUEST $ java --version
openjdk 17.0.7 2023-04-18
OpenJDK Runtime Environment (build 17.0.7+7-Ubuntu-0ubuntu122.04.2)
OpenJDK 64-Bit Server VM (build 17.0.7+7-Ubuntu-0ubuntu122.04.2, mixed mode, sharing)

GUEST $ spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.4.0
/_/

Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 17.0.7
Branch HEAD
Compiled by user xinrong.meng on 2023-04-07T02:18:01Z
Revision 87a5442f7ed96b11051d8a9333476d080054e5a0
Url https://github.com/apache/spark
Type --help for more information.

Procedure
=========
HOST $ sudo numactl -N 0 -m 0 qemu-system-aarch64 \
-M virt,accel=kvm -cpu host -smp 64 -m 300g -nographic -nic user \
-bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd \
-drive if=virtio,format=raw,file=/dev/nvme0n1p1

GUEST $ cat gen.scala
import java.io._
import scala.collection.mutable.ArrayBuffer

object GenData {
def main(args: Array[String]): Unit = {
val file = new File("/dev/shm/dataset.txt")
val writer = new BufferedWriter(new FileWriter(file))
val buf = ArrayBuffer(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)
for(_ <- 0 until 400000000) {
for (i <- 0 until 10) {
buf.update(i, scala.util.Random.nextLong())
}
writer.write(s"${buf.mkString(",")}\n")
}
writer.close()
}
}
GenData.main(Array())

GUEST $ cat sort.scala
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.SparkSession

object SparkSort {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().getOrCreate()
val file = sc.textFile("/dev/shm/dataset.txt", 64)
val results = file.flatMap(_.split(",")).map(x => (x, 1)).sortByKey().takeOrdered(10)
results.foreach(println)
spark.stop()
}
}
SparkSort.main(Array())

GUEST $ cat run_spark.sh
export SPARK_LOCAL_DIRS=/dev/shm/

spark-shell <gen.scala

start=$SECONDS

for ((i=0; i<20; i++))
do
spark-3.4.0-bin-hadoop3/bin/spark-shell --master "local[64]" --driver-memory 160g <sort.scala
done

echo "wall time: $((SECONDS - start))"

Results
=======
Before [1] After Change
----------------------------------------------------
Wall time (seconds) 14455 12865 -12%

Notes
=====
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/[email protected]/

2023-06-09 01:13:09

by Yu Zhao

[permalink] [raw]
Subject: kvm/x86: multichase benchmark

TLDR
====
Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after this patchset [1].

Hardware
========
HOST $ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 3995WX 64-Cores
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
Stepping: 0
Frequency boost: disabled
CPU max MHz: 4308.3979
CPU min MHz: 2200.0000
BogoMIPS: 5390.20
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2
...
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 2 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 32 MiB (64 instances)
L3: 256 MiB (16 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-127
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected

HOST $ numactl -H
available: 1 nodes (0)
node 0 cpus: 0-127
node 0 size: 257542 MB
node 0 free: 224855 MB
node distances:
node 0
0: 10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software
========
HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

HOST $ uname -a
Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun 7 22:17:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

HOST $ cat /proc/swaps
Filename Type Size Used Priority
/dev/nvme0n1p2 partition 466838356 0 -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000f

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

Procedure
=========
HOST $ git clone https://github.com/google/multichase

HOST $ <Build multichase>
HOST $ <Unpack /boot/initrd.img into ./initrd/>

HOST $ cp multichase/multichase ./initrd/bin/
HOST $ sed -i \
"/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \
./initrd/init

HOST $ <Pack ./initrd/ into ./initrd.img>

HOST $ cat run_microvms.sh
memcgs=64

run() {
path=/sys/fs/cgroup/memcg$1

mkdir $path
echo $BASHPID >$path/cgroup.procs

qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
-nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
-append "console=ttyS0 loglevel=0"
}

for ((memcg = 0; memcg < $memcgs; memcg++)); do
run $memcg &
done

wait

Results
=======
Before [1] After Change
----------------------------------------------
Total samples 6824 7237 +6%

Notes
=====
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/[email protected]/

2023-06-09 09:16:58

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

On 5/31/23 21:17, Jason Gunthorpe wrote:
>> + int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
>> + unsigned long start, unsigned long end,
>> + bool clear, unsigned long *bitmap);
>> +
> Why leave clear_young behind? Just make a NULL bitmap mean
> clear_young?

It goes away in patch 2, together with:

@@ -437,7 +412,7 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm,
unsigned long end)
{
if (mm_has_notifiers(mm))
- return __mmu_notifier_clear_young(mm, start, end);
+ return __mmu_notifier_test_clear_young(mm, start, end, true, NULL);
return 0;
}

@@ -445,7 +420,7 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
if (mm_has_notifiers(mm))
- return __mmu_notifier_test_young(mm, address);
+ return __mmu_notifier_test_clear_young(mm, address, address + 1, false, NULL);
return 0;
}


Paolo


2023-06-09 09:17:36

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young()

On 5/27/23 01:44, Yu Zhao wrote:
> +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> +static inline bool kvm_arch_has_test_clear_young(void)
> +{
> + return IS_ENABLED(CONFIG_X86_64) &&
> + (!IS_REACHABLE(CONFIG_KVM) || (tdp_mmu_enabled && shadow_accessed_mask));
> +}

I don't think you need IS_REACHABLE(CONFIG_KVM) here, it would be a bug
if this is called from outside KVM code.

Maybe make it a BUILD_BUG_ON?

Paolo


2023-06-09 09:29:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit

On 5/27/23 01:44, Yu Zhao wrote:
> TLDR
> ====
> This patchset adds a fast path to clear the accessed bit without
> taking kvm->mmu_lock. It can significantly improve the performance of
> guests when the host is under heavy memory pressure.
>
> ChromeOS has been using a similar approach [1] since mid 2021 and it
> was proven successful on tens of millions devices.
>
> This v2 addressed previous requests [2] on refactoring code, removing
> inaccurate/redundant texts, etc.
>
> [1]https://crrev.com/c/2987928
> [2]https://lore.kernel.org/r/[email protected]/

From the KVM point of view the patches look good (though I wouldn't
mind if Nicholas took a look at the ppc part). Jason's comment on the
MMU notifier side are promising as well. Can you send v3 with Oliver's
comments addressed?

Thanks,

Paolo


2023-06-09 13:23:27

by Marc Zyngier

[permalink] [raw]
Subject: Re: kvm/arm64: Spark benchmark

On Fri, 09 Jun 2023 01:59:35 +0100,
Yu Zhao <[email protected]> wrote:
>
> TLDR
> ====
> Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1].

Why are the 3 architectures you have considered being evaluated with 3
different benchmarks? I am not suspecting you to have cherry-picked
the best results, but I'd really like to see a variety of benchmarks
that exercise this stuff differently.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2023-06-15 17:26:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 08/10] kvm/x86: move tdp_mmu_enabled and shadow_accessed_mask

On Fri, May 26, 2023, Yu Zhao wrote:
> tdp_mmu_enabled and shadow_accessed_mask are needed to implement
> kvm_arch_has_test_clear_young().
>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 6 ++++++
> arch/x86/kvm/mmu.h | 6 ------
> arch/x86/kvm/mmu/spte.h | 1 -
> 3 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fb9d1f2d6136..753c67072c47 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1772,6 +1772,7 @@ struct kvm_arch_async_pf {
>
> extern u32 __read_mostly kvm_nr_uret_msrs;
> extern u64 __read_mostly host_efer;
> +extern u64 __read_mostly shadow_accessed_mask;
> extern bool __read_mostly allow_smaller_maxphyaddr;
> extern bool __read_mostly enable_apicv;
> extern struct kvm_x86_ops kvm_x86_ops;
> @@ -1855,6 +1856,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
> bool mask);
>
> extern bool tdp_enabled;
> +#ifdef CONFIG_X86_64
> +extern bool tdp_mmu_enabled;
> +#else
> +#define tdp_mmu_enabled false
> +#endif

I would much prefer that these be kept in kvm/mmu.h. And looking at all the arch
code, there's no reason to make kvm_arch_has_test_clear_young() a runtime callback,
all of the logic is constant relative to when KVM is loaded.

So rather than have generic KVM pull from arch code, what if we have arch code
push info to generic KVM? We could even avoid #ifdefs if arch code passed in its
handler. That might result in an extra indirect branch though, so it might be
better to just use a flag? E.g. the x86 conversion would be something like this.

---
arch/x86/kvm/mmu/mmu.c | 5 +++++
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
arch/x86/kvm/mmu/tdp_mmu.h | 1 +
include/linux/kvm_host.h | 24 ++++--------------------
virt/kvm/kvm_main.c | 14 ++++++++++----
5 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8ebe542c565..84a4a83540f0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5809,6 +5809,11 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
max_huge_page_level = PG_LEVEL_1G;
else
max_huge_page_level = PG_LEVEL_2M;
+
+ if (tdp_mmu_enabled && kvm_ad_enabled())
+ kvm_init_test_clear_young(kvm_tdp_mmu_test_clear_young);
+ else
+ kvm_init_test_clear_young(NULL);
}
EXPORT_SYMBOL_GPL(kvm_configure_mmu);

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f463d54228f8..e878c88f0e02 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1308,7 +1308,7 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
}

-bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
+bool kvm_tdp_mmu_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
{
struct kvm_mmu_page *root;
int offset = ffs(shadow_accessed_mask) - 1;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 0a63b1afabd3..aaa0b75b3896 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -34,6 +34,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_tdp_mmu_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range);

bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
const struct kvm_memory_slot *slot, int min_level);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1714f82a0c47..7a0922cbc36f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -264,31 +264,15 @@ struct kvm_gfn_range {
pte_t pte;
bool may_block;
};
+
+typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn);
-bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range);
-#endif
-
-/*
- * Architectures that implement kvm_arch_test_clear_young() should override
- * kvm_arch_has_test_clear_young().
- *
- * kvm_arch_has_test_clear_young() is allowed to return false positive, i.e., it
- * can return true if kvm_arch_test_clear_young() is supported but disabled due
- * to some runtime constraint. In this case, kvm_arch_test_clear_young() should
- * return true; otherwise, it should return false.
- *
- * For each young KVM PTE, kvm_arch_test_clear_young() should call
- * kvm_should_clear_young() to decide whether to clear the accessed bit.
- */
-#ifndef kvm_arch_has_test_clear_young
-static inline bool kvm_arch_has_test_clear_young(void)
-{
- return false;
-}
+void kvm_init_test_clear_young(hva_handler_t arch_test_clear_young);
#endif

enum {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ef2790469fda..ac83cfb30771 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -530,8 +530,6 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
srcu_read_unlock(&kvm->srcu, idx);
}

-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
-
typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
unsigned long end);

@@ -859,6 +857,14 @@ bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn)
return args->clear;
}

+static hva_handler_t kvm_test_clear_young;
+
+void kvm_init_test_clear_young(hva_handler_t arch_test_clear_young)
+{
+ WARN_ON_ONCE(!list_empty(&vm_list));
+ kvm_test_clear_young = arch_test_clear_young;
+}
+
static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_struct *mm,
unsigned long start, unsigned long end,
bool clear, unsigned long *bitmap)
@@ -873,7 +879,7 @@ static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_

trace_kvm_age_hva(start, end);

- if (kvm_arch_has_test_clear_young()) {
+ if (kvm_test_clear_young) {
struct test_clear_young_args args = {
.bitmap = bitmap,
.end = end,
@@ -882,7 +888,7 @@ static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_

range.args = &args;
range.lockless = true;
- range.handler = kvm_arch_test_clear_young;
+ range.handler = kvm_test_clear_young;

if (!__kvm_handle_hva_range(kvm, &range))
return args.young ? MMU_NOTIFIER_RANGE_LOCKLESS : 0;

base-commit: 39ca80f27cc0d2a37b4e3d07bbf763d4954934d7
--


2023-06-15 18:01:28

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

On Fri, May 26, 2023, Yu Zhao wrote:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0e571e973bc2..374262545f96 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> struct kvm_gfn_range {
> struct kvm_memory_slot *slot;
> + void *args;

There's no reason to make this "void *", just declare "struct test_clear_young_args"
in the header. Arch code won't be able to use it regardless. And I vote for
something more like "test_clear_young_metadata", as there's far more information
in there than just function arguments.

And to stave off the argument that "void *" would allow reuse, take this opportunity
to unionize the test_clear_young field with the change_pte field, e.g.

/* comment about these fields being callback specific. */
union {
struct test_clear_young_metadata *metadata;
pte_t pte;
unsigned long callback_arg; /* needs a better name */
};

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 51e4882d0873..31ee58754b19 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -541,6 +541,7 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
> typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>
> struct kvm_hva_range {
> + void *args;

Same feedback as kvm_gfn_range.

> unsigned long start;
> unsigned long end;
> pte_t pte;
> @@ -549,6 +550,7 @@ struct kvm_hva_range {
> on_unlock_fn_t on_unlock;
> bool flush_on_ret;
> bool may_block;
> + bool lockless;
> };
>
> /*
> @@ -602,6 +604,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> hva_end = min(range->end, slot->userspace_addr +
> (slot->npages << PAGE_SHIFT));
>
> + gfn_range.args = range->args;

And this goes away because the generic callback_arg is what gets propagated.

> +
> /*
> * To optimize for the likely case where the address
> * range is covered by zero or one memslots, don't
> @@ -619,7 +623,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
> gfn_range.slot = slot;
>
> - if (!locked) {
> + if (!range->lockless && !locked) {
> locked = true;
> KVM_MMU_LOCK(kvm);
> if (!IS_KVM_NULL_FN(range->on_lock))
> @@ -628,6 +632,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> break;
> }
> ret |= range->handler(kvm, &gfn_range);
> +
> + if (range->lockless && ret)

I don't like overloading "lockless" to also mean "stop on ret". Just add another
flag, there's literally no cost for most callbacks as everything is constant at
compile time and gets optimized away.

> + range.args = &args;
> + range.lockless = true;

The lockless and stop_on_ret behavior needs comments.

> + range.handler = kvm_arch_test_clear_young;
> +
> + if (!__kvm_handle_hva_range(kvm, &range))
> + return args.young ? MMU_NOTIFIER_RANGE_LOCKLESS : 0;
> + }
> +
> + if (bitmap)
> + return 0;
> +
> + range.args = NULL;
> + range.lockless = false;

No need to manually clear these, they'll be zeroed by the initialization code.

E.g. all in all, something like so

---
include/linux/kvm_host.h | 9 +++++++--
virt/kvm/kvm_main.c | 29 +++++++++++++++++------------
2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7a0922cbc36f..e04605061f5e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,12 +256,17 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#endif

#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+struct test_clear_young_metadata;
+
struct kvm_gfn_range {
struct kvm_memory_slot *slot;
- void *args;
gfn_t start;
gfn_t end;
- pte_t pte;
+ union {
+ struct test_clear_young_metadata *metadata;
+ pte_t pte;
+ unsigned long callback_arg;
+ };
bool may_block;
};

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ac83cfb30771..8cf4fee9cd8b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -536,16 +536,20 @@ typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
typedef void (*on_unlock_fn_t)(struct kvm *kvm);

struct kvm_hva_range {
- void *args;
unsigned long start;
unsigned long end;
- pte_t pte;
hva_handler_t handler;
+ union {
+ struct test_clear_young_metadata *metadata;
+ pte_t pte;
+ unsigned long callback_arg;
+ };
on_lock_fn_t on_lock;
on_unlock_fn_t on_unlock;
bool flush_on_ret;
bool may_block;
bool lockless;
+ bool stop_on_ret;
};

/*
@@ -576,6 +580,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
struct kvm_memslots *slots;
int i, idx;

+ BUILD_BUG_ON(sizeof(gfn_range.callback_arg) != sizeof(gfn_range.pte) ||
+ sizeof(gfn_range.callback_arg) != sizeof(gfn_range.metadata));
+
if (WARN_ON_ONCE(range->end <= range->start))
return 0;

@@ -599,16 +606,14 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
hva_end = min(range->end, slot->userspace_addr +
(slot->npages << PAGE_SHIFT));

- gfn_range.args = range->args;
-
/*
* To optimize for the likely case where the address
* range is covered by zero or one memslots, don't
* bother making these conditional (to avoid writes on
* the second or later invocation of the handler).
*/
- gfn_range.pte = range->pte;
gfn_range.may_block = range->may_block;
+ gfn_range.callback_arg = range->callback_arg;

/*
* {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -628,7 +633,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
}
ret |= range->handler(kvm, &gfn_range);

- if (range->lockless && ret)
+ /* comment goes here. */
+ if (range->stop_on_ret && ret)
break;
}
}
@@ -830,7 +836,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
return kvm_handle_hva_range(mn, start, end, __pte(0), kvm_age_gfn);
}

-struct test_clear_young_args {
+struct test_clear_young_metadata {
unsigned long *bitmap;
unsigned long end;
bool clear;
@@ -839,7 +845,7 @@ struct test_clear_young_args {

bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn)
{
- struct test_clear_young_args *args = range->args;
+ struct test_clear_young_metadata *args = range->metadata;

VM_WARN_ON_ONCE(gfn < range->start || gfn >= range->end);

@@ -880,14 +886,15 @@ static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_
trace_kvm_age_hva(start, end);

if (kvm_test_clear_young) {
- struct test_clear_young_args args = {
+ struct test_clear_young_metadata args = {
.bitmap = bitmap,
.end = end,
.clear = clear,
};

- range.args = &args;
range.lockless = true;
+ range.stop_on_ret = true;
+ range.metadata = &args;
range.handler = kvm_test_clear_young;

if (!__kvm_handle_hva_range(kvm, &range))
@@ -897,8 +904,6 @@ static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_
if (bitmap)
return 0;

- range.args = NULL;
- range.lockless = false;
range.handler = clear ? kvm_age_gfn : kvm_test_age_gfn;

return __kvm_handle_hva_range(kvm, &range) ? MMU_NOTIFIER_RANGE_YOUNG : 0;

base-commit: 7a5d8be2c18415b73b9380741095f439d6983a40
--


2023-06-15 19:16:19

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 09/10] kvm/x86: add kvm_arch_test_clear_young()

On Fri, May 26, 2023, Yu Zhao wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 08340219c35a..6875a819e007 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1232,6 +1232,40 @@ bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
> }
>
> +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> + struct kvm_mmu_page *root;
> + int offset = ffs(shadow_accessed_mask) - 1;
> +
> + if (kvm_shadow_root_allocated(kvm))

This needs a comment.

> + return true;
> +
> + rcu_read_lock();
> +
> + list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {

As requested in v1[1], please add a macro for a lockless walk.

[1] https://lkml.kernel.org/r/Y%2Fed0XYAPx%2B7pukA%40google.com

> + struct tdp_iter iter;
> +
> + if (kvm_mmu_page_as_id(root) != range->slot->as_id)
> + continue;
> +
> + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
> + u64 *sptep = rcu_dereference(iter.sptep);
> +
> + VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep)));

Hrm, I don't like adding this in KVM. The primary MMU might guarantee that this
callback is invoked if and only if the SPTE is backed by struct page memory, but
there's no reason to assume that's true in KVM. If we want the sanity check, then
this needs to use kvm_pfn_to_refcounted_page().

And it should use KVM's MMU_WARN_ON(), which is a mess and effectively dead code,
but I'm working on changing that[*], i.e. by the time this gets to Linus' tree,
the sanity check should have a much cleaner implementation.

[2] https://lore.kernel.org/all/[email protected]

> +
> + if (!(iter.old_spte & shadow_accessed_mask))
> + continue;
> +
> + if (kvm_should_clear_young(range, iter.gfn))
> + clear_bit(offset, (unsigned long *)sptep);

If/when you rebase on https://github.com/kvm-x86/linux/tree/next, can you pull
out the atomic bits of tdp_mmu_clear_spte_bits() and use that new helper? E.g.

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..914c34518829 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -58,15 +58,18 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
return old_spte;
}

+static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
+{
+ atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
+
+ return (u64)atomic64_fetch_and(~mask, sptep_atomic);
+}
+
static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
u64 mask, int level)
{
- atomic64_t *sptep_atomic;
-
- if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
- sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
- return (u64)atomic64_fetch_and(~mask, sptep_atomic);
- }
+ if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
+ return tdp_mmu_clear_spte_bits_atomic(sptep, mask);

__kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
return old_spte;


2023-06-18 19:24:36

by Yu Zhao

[permalink] [raw]
Subject: Re: kvm/x86: multichase benchmark

On Thu, Jun 8, 2023 at 6:59 PM Yu Zhao <[email protected]> wrote:
>
> TLDR
> ====
> Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after this patchset [1].
>
> Hardware
> ========
> HOST $ lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Address sizes: 43 bits physical, 48 bits virtual
> Byte Order: Little Endian
> CPU(s): 128
> On-line CPU(s) list: 0-127
> Vendor ID: AuthenticAMD
> Model name: AMD Ryzen Threadripper PRO 3995WX 64-Cores
> CPU family: 23
> Model: 49
> Thread(s) per core: 2
> Core(s) per socket: 64
> Socket(s): 1
> Stepping: 0
> Frequency boost: disabled
> CPU max MHz: 4308.3979
> CPU min MHz: 2200.0000
> BogoMIPS: 5390.20
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2
> ...
> Virtualization features:
> Virtualization: AMD-V
> Caches (sum of all):
> L1d: 2 MiB (64 instances)
> L1i: 2 MiB (64 instances)
> L2: 32 MiB (64 instances)
> L3: 256 MiB (16 instances)
> NUMA:
> NUMA node(s): 1
> NUMA node0 CPU(s): 0-127
> Vulnerabilities:
> Itlb multihit: Not affected
> L1tf: Not affected
> Mds: Not affected
> Meltdown: Not affected
> Mmio stale data: Not affected
> Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
> Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
> Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
> Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
> Srbds: Not affected
> Tsx async abort: Not affected
>
> HOST $ numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0-127
> node 0 size: 257542 MB
> node 0 free: 224855 MB
> node distances:
> node 0
> 0: 10
>
> HOST $ cat /sys/class/nvme/nvme0/model
> INTEL SSDPF21Q800GB
>
> HOST $ cat /sys/class/nvme/nvme0/numa_node
> 0
>
> Software
> ========
> HOST $ cat /etc/lsb-release
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=22.04
> DISTRIB_CODENAME=jammy
> DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
>
> HOST $ uname -a
> Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun 7 22:17:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>
> HOST $ cat /proc/swaps
> Filename Type Size Used Priority
> /dev/nvme0n1p2 partition 466838356 0 -2
>
> HOST $ cat /sys/kernel/mm/lru_gen/enabled
> 0x000f
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer defer+madvise madvise [never]
>
> Procedure
> =========
> HOST $ git clone https://github.com/google/multichase
>
> HOST $ <Build multichase>
> HOST $ <Unpack /boot/initrd.img into ./initrd/>
>
> HOST $ cp multichase/multichase ./initrd/bin/
> HOST $ sed -i \
> "/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \

I was reminded that I missed one parameter above, i.e.,

"/^maybe_break top$/i multichase -N -t 2 -m 4g -n 28800; poweroff" \
^^

> ./initrd/init
>
> HOST $ <Pack ./initrd/ into ./initrd.img>
>
> HOST $ cat run_microvms.sh
> memcgs=64
>
> run() {
> path=/sys/fs/cgroup/memcg$1
>
> mkdir $path
> echo $BASHPID >$path/cgroup.procs

And one line here:

echo 4000m >$path/memory.min # or the largest size that doesn't cause OOM kills

> qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
> -nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
> -append "console=ttyS0 loglevel=0"
> }
>
> for ((memcg = 0; memcg < $memcgs; memcg++)); do
> run $memcg &
> done
>
> wait
>
> Results
> =======
> Before [1] After Change
> ----------------------------------------------
> Total samples 6824 7237 +6%
>
> Notes
> =====
> [1] "mm: rmap: Don't flush TLB after checking PTE young for page
> reference" was included so that the comparison is apples to
> Apples.
> https://lore.kernel.org/r/[email protected]/

2023-06-18 20:16:23

by Yu Zhao

[permalink] [raw]
Subject: Re: kvm/arm64: Spark benchmark

On Fri, Jun 9, 2023 at 7:04 AM Marc Zyngier <[email protected]> wrote:
>
> On Fri, 09 Jun 2023 01:59:35 +0100,
> Yu Zhao <[email protected]> wrote:
> >
> > TLDR
> > ====
> > Apache Spark spent 12% less time sorting four billion random integers twenty times (in ~4 hours) after this patchset [1].
>
> Why are the 3 architectures you have considered being evaluated with 3
> different benchmarks?

I was hoping people having special interests in different archs might
try to reproduce the benchmarks that I didn't report (but did cover)
and see what happens.

> I am not suspecting you to have cherry-picked
> the best results

I'm generally very conservative when reporting *synthetic* results.
For example, the same memcached benchmark used on powerpc yielded >50%
improvement on aarch64, because the default Ubuntu Kconfig uses 64KB
base page size for powerpc but 4KB for aarch64. (Before the series,
the reclaim (swap) path takes kvm->mmu_lock for *write* on O(nr of all
pages to consider); after the series, it becomes O(actual nr of pages
to swap), which is <10% given how the benchmark was set up.)

Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency
------------------------------------------------------------------------
Before 639511.40 0.09940 0.04700 0.27100 22.52700
After 974184.60 0.06471 0.04700 0.15900 3.75900

> but I'd really like to see a variety of benchmarks
> that exercise this stuff differently.

I'd be happy to try other synthetic workloads that people think that
are relatively representative. Also, I've backported the series and
started an A/B experiment involving ~1 million devices (real-world
workloads). We should have the preliminary results by the time I post
the next version.

2023-06-20 02:58:03

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 00/10] mm/kvm: locklessly clear the accessed bit

On Fri, Jun 9, 2023 at 3:08 AM Paolo Bonzini <[email protected]> wrote:
>
> On 5/27/23 01:44, Yu Zhao wrote:
> > TLDR
> > ====
> > This patchset adds a fast path to clear the accessed bit without
> > taking kvm->mmu_lock. It can significantly improve the performance of
> > guests when the host is under heavy memory pressure.
> >
> > ChromeOS has been using a similar approach [1] since mid 2021 and it
> > was proven successful on tens of millions devices.
> >
> > This v2 addressed previous requests [2] on refactoring code, removing
> > inaccurate/redundant texts, etc.
> >
> > [1]https://crrev.com/c/2987928
> > [2]https://lore.kernel.org/r/[email protected]/
>
> From the KVM point of view the patches look good (though I wouldn't
> mind if Nicholas took a look at the ppc part). Jason's comment on the
> MMU notifier side are promising as well. Can you send v3 with Oliver's
> comments addressed?

Thanks. I'll address all the comments in v3 and post it asap.

Meanwhile, some updates on the recent progress from my side:
1. I've asked some downstream kernels to pick up v2 for testing, the
Archlinux Zen kernel did. I don't really expect its enthusiastic
testers to find this series relevant to their use cases. But who
knows.
2. I've also asked openbenchmarking.org to run their popular highmem
benchmark suites with v2. Hopefully they'll have some independent
results soon.
3. I've backported v2 to v5.15 and v6.1 and started an A/B experiment
involving ~1 million devices, as I mentioned in another email in this
thread. I should have some results to share when posting v3.

2023-06-20 06:48:03

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe

On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> KVM page tables are currently not RCU safe against remapping, i.e.,
> kvmppc_unmap_free_pmd_entry_table() et al. The previous

Minor nit but the "page table" is not RCU-safe against something. It
is RCU-freed, and therefore some algorithm that accesses it can have
the existence guarantee provided by RCU (usually there still needs
to be more to it).

> mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
> that operation.
>
> However, the new mmu_notifier_ops member test_clear_young() provides
> a fast path that does not take kvm->mmu_lock. To implement
> kvm_arch_test_clear_young() for that path, orphan page tables need to
> be freed by RCU.

Short version: clear the referenced bit using RCU instead of MMU lock
to protect against page table freeing, and there is no problem with
clearing the bit in a table that has been freed.

Seems reasonable.

>
> Unmapping, specifically kvm_unmap_radix(), does not free page tables,
> hence not a concern.

Not sure if you really need to make the distinction about why the page
table is freed, we might free them via unmapping. The point is just
anything that frees them while there can be concurrent access, right?

>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 461307b89c3a..3b65b3b11041 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
> {
> unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
>
> - kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
> + kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
> + SLAB_TYPESAFE_BY_RCU, pte_ctor);
> if (!kvm_pte_cache)
> return -ENOMEM;
>
> size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
>
> - kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
> + kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
> + SLAB_TYPESAFE_BY_RCU, pmd_ctor);
> if (!kvm_pmd_cache) {
> kmem_cache_destroy(kvm_pte_cache);
> return -ENOMEM;

KVM PPC HV radix PUD level page tables use the arch/powerpc allocators
(for some reason), which are not RCU freed. I think you need them too?
I wouldn't mind if the kvm pud table slab was moved in here instead of
shared.

Thanks,
Nick

2023-06-20 07:35:28

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> Add mmu_notifier_ops->test_clear_young() to supersede test_young()
> and clear_young().
>
> test_clear_young() has a fast path, which if supported, allows its
> callers to safely clear the accessed bit without taking
> kvm->mmu_lock.
>
> The fast path requires arch-specific code that generally relies on
> RCU and CAS: the former protects KVM page tables from being freed
> while the latter clears the accessed bit atomically against both the
> hardware and other software page table walkers. If the fast path is
> unsupported, test_clear_young() falls back to the existing slow path
> where kvm->mmu_lock is then taken.
>
> test_clear_young() can also operate on a range of KVM PTEs
> individually according to a bitmap, if the caller provides it.

It would be better if you could do patch 1 that only touches the
mmu_notifier code and implements mmu_notifier_test_clear_young() in
terms of existing callbacks, and next patch swaps KVM to new callbacks
and remove the old ones.

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 64a3e051c3c4..dfdbb370682d 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -60,6 +60,8 @@ enum mmu_notifier_event {
> };
>
> #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> +#define MMU_NOTIFIER_RANGE_LOCKLESS (1 << 1)
> +#define MMU_NOTIFIER_RANGE_YOUNG (1 << 2)
>
> struct mmu_notifier_ops {
> /*
> @@ -122,6 +124,10 @@ struct mmu_notifier_ops {
> struct mm_struct *mm,
> unsigned long address);
>
> + int (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + bool clear, unsigned long *bitmap);

This should have a comment like the others. Callback wants to know how
to implement it.

Could add a _range on it as well while you're here, to correct that
inconsistency.

> +
> /*
> * change_pte is called in cases that pte mapping to page is changed:
> * for example, when ksm remaps pte to point to a new shared page.
> @@ -392,6 +398,9 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
> unsigned long end);
> extern int __mmu_notifier_test_young(struct mm_struct *mm,
> unsigned long address);
> +extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + bool clear, unsigned long *bitmap);
> extern void __mmu_notifier_change_pte(struct mm_struct *mm,
> unsigned long address, pte_t pte);
> extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
> @@ -440,6 +449,35 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
> return 0;
> }
>
> +/*
> + * mmu_notifier_test_clear_young() returns nonzero if any of the KVM PTEs within
> + * a given range was young. Specifically, it returns MMU_NOTIFIER_RANGE_LOCKLESS
> + * if the fast path was successful, MMU_NOTIFIER_RANGE_YOUNG otherwise.
> + *
> + * The last parameter to the function is a bitmap and only the fast path
> + * supports it: if it is NULL, the function falls back to the slow path if the
> + * fast path was unsuccessful; otherwise, the function bails out.

Then if it was NULL, you would just not populate it. Minmize differences
and cases for the caller/implementations.

> + *
> + * The bitmap has the following specifications:
> + * 1. The number of bits should be at least (end-start)/PAGE_SIZE.
> + * 2. The offset of each bit should be relative to the end, i.e., the offset
> + * corresponding to addr should be (end-addr)/PAGE_SIZE-1. This is convenient
> + * for batching while forward looping.
> + *
> + * When testing, this function sets the corresponding bit in the bitmap for each
> + * young KVM PTE. When clearing, this function clears the accessed bit for each
> + * young KVM PTE whose corresponding bit in the bitmap is set.

I think this is over-designed as a first pass. The secondary MMU should
just implement the call always. If it can't do it locklessly, then just
do individual lookups. If the benefit is in the batching as you say then
the locked version will get similar benefit. Possibly more because locks
like some amount of batching when contended.

I think that would reduce some concerns about cases of secondary MMUs
that do not not support the lockless version yet, and avoid
proliferation of code paths by platform.

_If_ that was insufficient then I would like to see numbers and profiles
and incremental patch to expose more complexity like this.

Also mmu notifier code should say nothing about KVM PTEs or use kvm
names in any code or comments either. "if the page was accessed via the
secondary MMU" or similar is mutually understandable to KVM and mm
developers.

> @@ -880,6 +887,72 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> kvm_test_age_gfn);
> }
>
> +struct test_clear_young_args {
> + unsigned long *bitmap;
> + unsigned long end;
> + bool clear;
> + bool young;
> +};
> +
> +bool kvm_should_clear_young(struct kvm_gfn_range *range, gfn_t gfn)
> +{
> + struct test_clear_young_args *args = range->args;
> +
> + VM_WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
> +
> + args->young = true;
> +
> + if (args->bitmap) {
> + int offset = hva_to_gfn_memslot(args->end - 1, range->slot) - gfn;
> +
> + if (args->clear)
> + return test_bit(offset, args->bitmap);
> +
> + __set_bit(offset, args->bitmap);
> + }
> +
> + return args->clear;
> +}

I don't quite understnd what's going on here. This is actually the
function that notes the young pte, despite its name suggesting it is
only a query.

Shouldn't it set the bitmap bit even in the clear case? And why is it
testing at all? Oh, it seems to be some strange mix of test *or* clear
young. With the bitmap being a predicate in some cases for the clear
case.

This is a fairly confusing multi-modal API then. I think it should
take 2 bitmaps, one is the young bitmap and the other is the predicate
bitmap, and either/or can be NULL.

Also this kvm_should_clear_young helper is clunky and misnamed. If you
just provided an inline helper to get test_clear_young bitmap offset
from gfn, then set/clear bit in the caller is quite trivial.

> +
> +static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + bool clear, unsigned long *bitmap)
> +{
> + struct kvm *kvm = mmu_notifier_to_kvm(mn);
> + struct kvm_hva_range range = {
> + .start = start,
> + .end = end,
> + .on_lock = (void *)kvm_null_fn,
> + .on_unlock = (void *)kvm_null_fn,
> + };
> +
> + trace_kvm_age_hva(start, end);
> +
> + if (kvm_arch_has_test_clear_young()) {
> + struct test_clear_young_args args = {
> + .bitmap = bitmap,
> + .end = end,
> + .clear = clear,
> + };
> +
> + range.args = &args;
> + range.lockless = true;
> + range.handler = kvm_arch_test_clear_young;
> +
> + if (!__kvm_handle_hva_range(kvm, &range))
> + return args.young ? MMU_NOTIFIER_RANGE_LOCKLESS : 0;
> + }
> +
> + if (bitmap)
> + return 0;
> +
> + range.args = NULL;
> + range.lockless = false;
> + range.handler = clear ? kvm_age_gfn : kvm_test_age_gfn;

Minor thing, but KVM's "young" handling has been called "age" until now.
Any reason not to stick with that theme?

Thanks,
Nick

2023-06-20 08:05:53

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()

On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> Implement kvm_arch_test_clear_young() to support the fast path in
> mmu_notifier_ops->test_clear_young().
>
> It focuses on a simple case, i.e., radix MMU sets the accessed bit in
> KVM PTEs and VMs are not nested, where it can rely on RCU and
> pte_xchg() to safely clear the accessed bit without taking
> kvm->mmu_lock. Complex cases fall back to the existing slow path
> where kvm->mmu_lock is then taken.
>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> arch/powerpc/include/asm/kvm_host.h | 8 ++++
> arch/powerpc/include/asm/kvm_ppc.h | 1 +
> arch/powerpc/kvm/book3s.c | 6 +++
> arch/powerpc/kvm/book3s.h | 1 +
> arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++++++++++++++++++++++++++
> arch/powerpc/kvm/book3s_hv.c | 5 +++
> 6 files changed, 80 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 14ee0dece853..75c260ea8a9e 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
>
> +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> +static inline bool kvm_arch_has_test_clear_young(void)
> +{
> + return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
> + cpu_has_feature(CPU_FTR_HVMODE) && cpu_has_feature(CPU_FTR_ARCH_300) &&
> + radix_enabled();

This could probably be radix_enabled() && !kvmhv_on_pseries(). Although
unclear why not nested hypervisor... I'd have to think about it a bit
more. Do you have any idea what might go wrong, or just didn't have the
time to consider it? (Not just powerpc nested but any others).

> +}
> +
> #endif /* __POWERPC_KVM_HOST_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 79a9c0bb8bba..ff1af6a7b44f 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -287,6 +287,7 @@ struct kvmppc_ops {
> bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
> bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> + bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
> bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> void (*free_memslot)(struct kvm_memory_slot *slot);
> int (*init_vm)(struct kvm *kvm);
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 686d8d9eda3e..37bf40b0c4ff 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
> }
>
> +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> + return !kvm->arch.kvm_ops->test_clear_young ||
> + kvm->arch.kvm_ops->test_clear_young(kvm, range);
> +}
> +
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
> diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
> index 58391b4b32ed..fa2659e21ccc 100644
> --- a/arch/powerpc/kvm/book3s.h
> +++ b/arch/powerpc/kvm/book3s.h
> @@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
> extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> +extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
>
> extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 3b65b3b11041..0a392e9a100a 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
> return ref;
> }
>
> +bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> + bool err;
> + gfn_t gfn = range->start;
> +
> + rcu_read_lock();
> +
> + err = !kvm_is_radix(kvm);
> + if (err)
> + goto unlock;
> +
> + /*
> + * Case 1: This function kvmppc_switch_mmu_to_hpt()
> + *
> + * rcu_read_lock()
> + * Test kvm_is_radix() kvm->arch.radix = 0
> + * Use kvm->arch.pgtable synchronize_rcu()
> + * rcu_read_unlock()
> + * kvmppc_free_radix()
> + *
> + *
> + * Case 2: This function kvmppc_switch_mmu_to_radix()
> + *
> + * kvmppc_init_vm_radix()
> + * smp_wmb()
> + * Test kvm_is_radix() kvm->arch.radix = 1
> + * smp_rmb()
> + * Use kvm->arch.pgtable
> + */
> + smp_rmb();

Comment could stand to expand slightly on what you are solving, in
words.

If you use synchronize_rcu() on both sides, you wouldn't need the
barrier, right?

> + while (gfn < range->end) {
> + pte_t *ptep;
> + pte_t old, new;
> + unsigned int shift;
> +
> + ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift);
> + if (!ptep)
> + goto next;
> +
> + VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep)));

Not really appropriate at the KVM level. mm enforces this kind of
thing (with notifiers).

> +
> + old = READ_ONCE(*ptep);
> + if (!pte_present(old) || !pte_young(old))
> + goto next;
> +
> + new = pte_mkold(old);
> +
> + if (kvm_should_clear_young(range, gfn))
> + pte_xchg(ptep, old, new);

*Probably* will work. I can't think of a reason why not at the
moment anyway :)

Thanks,
Nick
> +next:
> + gfn += shift ? BIT(shift - PAGE_SHIFT) : 1;
> + }
> +unlock:
> + rcu_read_unlock();
> +
> + return err;
> +}
> +
> /* Returns the number of PAGE_SIZE pages that are dirty */
> static int kvm_radix_test_clear_dirty(struct kvm *kvm,
> struct kvm_memory_slot *memslot, int pagenum)
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 130bafdb1430..20a81ec9fde8 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5262,6 +5262,8 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
> spin_lock(&kvm->mmu_lock);
> kvm->arch.radix = 0;
> spin_unlock(&kvm->mmu_lock);
> + /* see the comments in kvm_test_clear_young_hv() */

I'm guilty of such comments at times, but it wouldn't hurt to say
/* Finish concurrent kvm_test_clear_young_hv access to page tables */

Then you know where to look for more info and you have a vague
idea what it's for.

> + synchronize_rcu();

> kvmppc_free_radix(kvm);
>
> lpcr = LPCR_VPM1;
> @@ -5286,6 +5288,8 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
> if (err)
> return err;
> kvmppc_rmap_reset(kvm);
> + /* see the comments in kvm_test_clear_young_hv() */
> + smp_wmb();
> /* Mutual exclusion with kvm_unmap_gfn_range etc. */
> spin_lock(&kvm->mmu_lock);
> kvm->arch.radix = 1;
> @@ -6185,6 +6189,7 @@ static struct kvmppc_ops kvm_ops_hv = {
> .unmap_gfn_range = kvm_unmap_gfn_range_hv,
> .age_gfn = kvm_age_gfn_hv,
> .test_age_gfn = kvm_test_age_gfn_hv,
> + .test_clear_young = kvm_test_clear_young_hv,
> .set_spte_gfn = kvm_set_spte_gfn_hv,
> .free_memslot = kvmppc_core_free_memslot_hv,
> .init_vm = kvmppc_core_init_vm_hv,

Thanks for looking at the powerpc conversion!

Thanks,
Nick

2023-06-20 08:47:44

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe

On Tue, Jun 20, 2023 at 12:33 AM Nicholas Piggin <[email protected]> wrote:
>
> On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > KVM page tables are currently not RCU safe against remapping, i.e.,
> > kvmppc_unmap_free_pmd_entry_table() et al. The previous
>
> Minor nit but the "page table" is not RCU-safe against something. It
> is RCU-freed, and therefore some algorithm that accesses it can have
> the existence guarantee provided by RCU (usually there still needs
> to be more to it).
>
> > mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
> > that operation.
> >
> > However, the new mmu_notifier_ops member test_clear_young() provides
> > a fast path that does not take kvm->mmu_lock. To implement
> > kvm_arch_test_clear_young() for that path, orphan page tables need to
> > be freed by RCU.
>
> Short version: clear the referenced bit using RCU instead of MMU lock
> to protect against page table freeing, and there is no problem with
> clearing the bit in a table that has been freed.
>
> Seems reasonable.

Thanks. All above points taken.

> > Unmapping, specifically kvm_unmap_radix(), does not free page tables,
> > hence not a concern.
>
> Not sure if you really need to make the distinction about why the page
> table is freed, we might free them via unmapping. The point is just
> anything that frees them while there can be concurrent access, right?

Correct.

> > Signed-off-by: Yu Zhao <[email protected]>
> > ---
> > arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > index 461307b89c3a..3b65b3b11041 100644
> > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > @@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
> > {
> > unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
> >
> > - kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
> > + kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
> > + SLAB_TYPESAFE_BY_RCU, pte_ctor);
> > if (!kvm_pte_cache)
> > return -ENOMEM;
> >
> > size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
> >
> > - kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
> > + kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
> > + SLAB_TYPESAFE_BY_RCU, pmd_ctor);
> > if (!kvm_pmd_cache) {
> > kmem_cache_destroy(kvm_pte_cache);
> > return -ENOMEM;
>
> KVM PPC HV radix PUD level page tables use the arch/powerpc allocators
> (for some reason), which are not RCU freed. I think you need them too?

We don't. The use of the arch/powerpc allocator for PUD tables seems
appropriate to me because, unlike PMD/PTE tables, we never free PUD
tables during the lifetime of a VM:
* We don't free PUD/PMD/PTE tables when they become empty, i.e., not
mapping any pages but still attached. (We could in theory, as
x86/aarch64 do.)
* We have to free PMD/PTE tables when we replace them with 1GB/2MB
pages. (Otherwise we'd lose track of detached tables.) And we
currently don't support huge pages at P4D level, so we never detach
and free PUD tables.

2023-06-20 11:43:09

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 06/10] kvm/powerpc: make radix page tables RCU safe

On Tue Jun 20, 2023 at 6:00 PM AEST, Yu Zhao wrote:
> On Tue, Jun 20, 2023 at 12:33 AM Nicholas Piggin <[email protected]> wrote:
> >
> > On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > > KVM page tables are currently not RCU safe against remapping, i.e.,
> > > kvmppc_unmap_free_pmd_entry_table() et al. The previous
> >
> > Minor nit but the "page table" is not RCU-safe against something. It
> > is RCU-freed, and therefore some algorithm that accesses it can have
> > the existence guarantee provided by RCU (usually there still needs
> > to be more to it).
> >
> > > mmu_notifier_ops members rely on kvm->mmu_lock to synchronize with
> > > that operation.
> > >
> > > However, the new mmu_notifier_ops member test_clear_young() provides
> > > a fast path that does not take kvm->mmu_lock. To implement
> > > kvm_arch_test_clear_young() for that path, orphan page tables need to
> > > be freed by RCU.
> >
> > Short version: clear the referenced bit using RCU instead of MMU lock
> > to protect against page table freeing, and there is no problem with
> > clearing the bit in a table that has been freed.
> >
> > Seems reasonable.
>
> Thanks. All above points taken.
>
> > > Unmapping, specifically kvm_unmap_radix(), does not free page tables,
> > > hence not a concern.
> >
> > Not sure if you really need to make the distinction about why the page
> > table is freed, we might free them via unmapping. The point is just
> > anything that frees them while there can be concurrent access, right?
>
> Correct.
>
> > > Signed-off-by: Yu Zhao <[email protected]>
> > > ---
> > > arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++++--
> > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > index 461307b89c3a..3b65b3b11041 100644
> > > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > @@ -1469,13 +1469,15 @@ int kvmppc_radix_init(void)
> > > {
> > > unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;
> > >
> > > - kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
> > > + kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
> > > + SLAB_TYPESAFE_BY_RCU, pte_ctor);
> > > if (!kvm_pte_cache)
> > > return -ENOMEM;
> > >
> > > size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;
> > >
> > > - kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
> > > + kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
> > > + SLAB_TYPESAFE_BY_RCU, pmd_ctor);
> > > if (!kvm_pmd_cache) {
> > > kmem_cache_destroy(kvm_pte_cache);
> > > return -ENOMEM;
> >
> > KVM PPC HV radix PUD level page tables use the arch/powerpc allocators
> > (for some reason), which are not RCU freed. I think you need them too?
>
> We don't. The use of the arch/powerpc allocator for PUD tables seems
> appropriate to me because, unlike PMD/PTE tables, we never free PUD
> tables during the lifetime of a VM:

Ah you're right, the pud_free only comes from the double alloc case
so it's never visible to concurrent threads.

> * We don't free PUD/PMD/PTE tables when they become empty, i.e., not
> mapping any pages but still attached. (We could in theory, as
> x86/aarch64 do.)

We may try to do that at some point, but that's not related to your
patch for now so no worries.

Thanks,
Nick

2023-06-21 01:43:25

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()

On Tue, Jun 20, 2023 at 1:48 AM Nicholas Piggin <[email protected]> wrote:
>
> On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > Implement kvm_arch_test_clear_young() to support the fast path in
> > mmu_notifier_ops->test_clear_young().
> >
> > It focuses on a simple case, i.e., radix MMU sets the accessed bit in
> > KVM PTEs and VMs are not nested, where it can rely on RCU and
> > pte_xchg() to safely clear the accessed bit without taking
> > kvm->mmu_lock. Complex cases fall back to the existing slow path
> > where kvm->mmu_lock is then taken.
> >
> > Signed-off-by: Yu Zhao <[email protected]>
> > ---
> > arch/powerpc/include/asm/kvm_host.h | 8 ++++
> > arch/powerpc/include/asm/kvm_ppc.h | 1 +
> > arch/powerpc/kvm/book3s.c | 6 +++
> > arch/powerpc/kvm/book3s.h | 1 +
> > arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++++++++++++++++++++++++++
> > arch/powerpc/kvm/book3s_hv.c | 5 +++
> > 6 files changed, 80 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> > index 14ee0dece853..75c260ea8a9e 100644
> > --- a/arch/powerpc/include/asm/kvm_host.h
> > +++ b/arch/powerpc/include/asm/kvm_host.h
> > @@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> > static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> > static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> >
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > + return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
> > + cpu_has_feature(CPU_FTR_HVMODE) && cpu_has_feature(CPU_FTR_ARCH_300) &&
> > + radix_enabled();
>
> This could probably be radix_enabled() && !kvmhv_on_pseries().

Will do. (I used !kvmhv_on_pseries() in v1 but had second thoughts on
moving kvmhv_on_pseries() into this file.)

> Although unclear why not nested hypervisor... I'd have to think about it a bit
> more. Do you have any idea what might go wrong, or just didn't have the
> time to consider it? (Not just powerpc nested but any others).

Yes, this series excludes nested KVM support on all architures. The
common reason for such a decision on powerpc and x86 (aarch64 doesn't
support nested yet) is that it's quite challenging to make the rmap, a
complex data structure that maps one PFN to multiple GFNs, lockless.
(See kvmhv_insert_nest_rmap().) It might be possible, however, the
potential ROI would be in question.

> > +}
> > +
> > #endif /* __POWERPC_KVM_HOST_H__ */
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> > index 79a9c0bb8bba..ff1af6a7b44f 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -287,6 +287,7 @@ struct kvmppc_ops {
> > bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > + bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > void (*free_memslot)(struct kvm_memory_slot *slot);
> > int (*init_vm)(struct kvm *kvm);
> > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> > index 686d8d9eda3e..37bf40b0c4ff 100644
> > --- a/arch/powerpc/kvm/book3s.c
> > +++ b/arch/powerpc/kvm/book3s.c
> > @@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
> > }
> >
> > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > + return !kvm->arch.kvm_ops->test_clear_young ||
> > + kvm->arch.kvm_ops->test_clear_young(kvm, range);
> > +}
> > +
> > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
> > diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
> > index 58391b4b32ed..fa2659e21ccc 100644
> > --- a/arch/powerpc/kvm/book3s.h
> > +++ b/arch/powerpc/kvm/book3s.h
> > @@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
> > extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > +extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> >
> > extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
> > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > index 3b65b3b11041..0a392e9a100a 100644
> > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > @@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > return ref;
> > }
> >
> > +bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
> > +{
> > + bool err;
> > + gfn_t gfn = range->start;
> > +
> > + rcu_read_lock();
> > +
> > + err = !kvm_is_radix(kvm);
> > + if (err)
> > + goto unlock;
> > +
> > + /*
> > + * Case 1: This function kvmppc_switch_mmu_to_hpt()
> > + *
> > + * rcu_read_lock()
> > + * Test kvm_is_radix() kvm->arch.radix = 0
> > + * Use kvm->arch.pgtable synchronize_rcu()
> > + * rcu_read_unlock()
> > + * kvmppc_free_radix()
> > + *
> > + *
> > + * Case 2: This function kvmppc_switch_mmu_to_radix()
> > + *
> > + * kvmppc_init_vm_radix()
> > + * smp_wmb()
> > + * Test kvm_is_radix() kvm->arch.radix = 1
> > + * smp_rmb()
> > + * Use kvm->arch.pgtable
> > + */
> > + smp_rmb();
>
> Comment could stand to expand slightly on what you are solving, in
> words.

Will do.

> If you use synchronize_rcu() on both sides, you wouldn't need the
> barrier, right?

Case 2 is about memory ordering, which is orthogonal to case 1 (RCU
freeing). So we need the r/w barrier regardless.

> > + while (gfn < range->end) {
> > + pte_t *ptep;
> > + pte_t old, new;
> > + unsigned int shift;
> > +
> > + ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift);
> > + if (!ptep)
> > + goto next;
> > +
> > + VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep)));
>
> Not really appropriate at the KVM level. mm enforces this kind of
> thing (with notifiers).

Will remove this.

> > +
> > + old = READ_ONCE(*ptep);
> > + if (!pte_present(old) || !pte_young(old))
> > + goto next;
> > +
> > + new = pte_mkold(old);
> > +
> > + if (kvm_should_clear_young(range, gfn))
> > + pte_xchg(ptep, old, new);
>
> *Probably* will work. I can't think of a reason why not at the
> moment anyway :)

My reasoning:
* It should work if we only change the dedicated A bit, i.e., not
shared for other purposes, because races are safe (the case here).
* It may not work for x86 EPT without the A bit (excluded in this
series) where accesses are trapped by the R/X bits, because races in
changing the R/X bits can be unsafe.

> > +next:
> > + gfn += shift ? BIT(shift - PAGE_SHIFT) : 1;
> > + }
> > +unlock:
> > + rcu_read_unlock();
> > +
> > + return err;
> > +}
> > +
> > /* Returns the number of PAGE_SIZE pages that are dirty */
> > static int kvm_radix_test_clear_dirty(struct kvm *kvm,
> > struct kvm_memory_slot *memslot, int pagenum)
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 130bafdb1430..20a81ec9fde8 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -5262,6 +5262,8 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
> > spin_lock(&kvm->mmu_lock);
> > kvm->arch.radix = 0;
> > spin_unlock(&kvm->mmu_lock);
> > + /* see the comments in kvm_test_clear_young_hv() */
>
> I'm guilty of such comments at times, but it wouldn't hurt to say
> /* Finish concurrent kvm_test_clear_young_hv access to page tables */
>
> Then you know where to look for more info and you have a vague
> idea what it's for.

Will do.

> > + synchronize_rcu();
>
> > kvmppc_free_radix(kvm);
> >
> > lpcr = LPCR_VPM1;
> > @@ -5286,6 +5288,8 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
> > if (err)
> > return err;
> > kvmppc_rmap_reset(kvm);
> > + /* see the comments in kvm_test_clear_young_hv() */
> > + smp_wmb();
> > /* Mutual exclusion with kvm_unmap_gfn_range etc. */
> > spin_lock(&kvm->mmu_lock);
> > kvm->arch.radix = 1;
> > @@ -6185,6 +6189,7 @@ static struct kvmppc_ops kvm_ops_hv = {
> > .unmap_gfn_range = kvm_unmap_gfn_range_hv,
> > .age_gfn = kvm_age_gfn_hv,
> > .test_age_gfn = kvm_test_age_gfn_hv,
> > + .test_clear_young = kvm_test_clear_young_hv,
> > .set_spte_gfn = kvm_set_spte_gfn_hv,
> > .free_memslot = kvmppc_core_free_memslot_hv,
> > .init_vm = kvmppc_core_init_vm_hv,
>
> Thanks for looking at the powerpc conversion!

Thanks for reviewing!

2023-06-21 03:09:36

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v2 07/10] kvm/powerpc: add kvm_arch_test_clear_young()

On Wed Jun 21, 2023 at 10:38 AM AEST, Yu Zhao wrote:
> On Tue, Jun 20, 2023 at 1:48 AM Nicholas Piggin <[email protected]> wrote:
> >
> > On Sat May 27, 2023 at 9:44 AM AEST, Yu Zhao wrote:
> > > Implement kvm_arch_test_clear_young() to support the fast path in
> > > mmu_notifier_ops->test_clear_young().
> > >
> > > It focuses on a simple case, i.e., radix MMU sets the accessed bit in
> > > KVM PTEs and VMs are not nested, where it can rely on RCU and
> > > pte_xchg() to safely clear the accessed bit without taking
> > > kvm->mmu_lock. Complex cases fall back to the existing slow path
> > > where kvm->mmu_lock is then taken.
> > >
> > > Signed-off-by: Yu Zhao <[email protected]>
> > > ---
> > > arch/powerpc/include/asm/kvm_host.h | 8 ++++
> > > arch/powerpc/include/asm/kvm_ppc.h | 1 +
> > > arch/powerpc/kvm/book3s.c | 6 +++
> > > arch/powerpc/kvm/book3s.h | 1 +
> > > arch/powerpc/kvm/book3s_64_mmu_radix.c | 59 ++++++++++++++++++++++++++
> > > arch/powerpc/kvm/book3s_hv.c | 5 +++
> > > 6 files changed, 80 insertions(+)
> > >
> > > diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> > > index 14ee0dece853..75c260ea8a9e 100644
> > > --- a/arch/powerpc/include/asm/kvm_host.h
> > > +++ b/arch/powerpc/include/asm/kvm_host.h
> > > @@ -883,4 +883,12 @@ static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
> > > static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
> > > static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
> > >
> > > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > > +static inline bool kvm_arch_has_test_clear_young(void)
> > > +{
> > > + return IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
> > > + cpu_has_feature(CPU_FTR_HVMODE) && cpu_has_feature(CPU_FTR_ARCH_300) &&
> > > + radix_enabled();
> >
> > This could probably be radix_enabled() && !kvmhv_on_pseries().
>
> Will do. (I used !kvmhv_on_pseries() in v1 but had second thoughts on
> moving kvmhv_on_pseries() into this file.)

That should be okay. kvmhv_on_pseries is a property of the host so it
seems reasonable to move it here if needed.

> > Although unclear why not nested hypervisor... I'd have to think about it a bit
> > more. Do you have any idea what might go wrong, or just didn't have the
> > time to consider it? (Not just powerpc nested but any others).
>
> Yes, this series excludes nested KVM support on all architures. The
> common reason for such a decision on powerpc and x86 (aarch64 doesn't
> support nested yet) is that it's quite challenging to make the rmap, a
> complex data structure that maps one PFN to multiple GFNs, lockless.
> (See kvmhv_insert_nest_rmap().) It might be possible, however, the
> potential ROI would be in question.

Okay just wondering. rmap (at least the powerpc one) is just a list
I think, with a few details. If that is all it is, it might not be
so hard to make that lock-free or a fine-grained lock on the rmap
chains maybe. But fine to ignore it to start with.

> > > +}
> > > +
> > > #endif /* __POWERPC_KVM_HOST_H__ */
> > > diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> > > index 79a9c0bb8bba..ff1af6a7b44f 100644
> > > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > > @@ -287,6 +287,7 @@ struct kvmppc_ops {
> > > bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > > + bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range);
> > > bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
> > > void (*free_memslot)(struct kvm_memory_slot *slot);
> > > int (*init_vm)(struct kvm *kvm);
> > > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> > > index 686d8d9eda3e..37bf40b0c4ff 100644
> > > --- a/arch/powerpc/kvm/book3s.c
> > > +++ b/arch/powerpc/kvm/book3s.c
> > > @@ -899,6 +899,12 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
> > > }
> > >
> > > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range)
> > > +{
> > > + return !kvm->arch.kvm_ops->test_clear_young ||
> > > + kvm->arch.kvm_ops->test_clear_young(kvm, range);
> > > +}
> > > +
> > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
> > > diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
> > > index 58391b4b32ed..fa2659e21ccc 100644
> > > --- a/arch/powerpc/kvm/book3s.h
> > > +++ b/arch/powerpc/kvm/book3s.h
> > > @@ -12,6 +12,7 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
> > > extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > > extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > > extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > > +extern bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > > extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
> > >
> > > extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
> > > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > index 3b65b3b11041..0a392e9a100a 100644
> > > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > @@ -1088,6 +1088,65 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > > return ref;
> > > }
> > >
> > > +bool kvm_test_clear_young_hv(struct kvm *kvm, struct kvm_gfn_range *range)
> > > +{
> > > + bool err;
> > > + gfn_t gfn = range->start;
> > > +
> > > + rcu_read_lock();
> > > +
> > > + err = !kvm_is_radix(kvm);
> > > + if (err)
> > > + goto unlock;
> > > +
> > > + /*
> > > + * Case 1: This function kvmppc_switch_mmu_to_hpt()
> > > + *
> > > + * rcu_read_lock()
> > > + * Test kvm_is_radix() kvm->arch.radix = 0
> > > + * Use kvm->arch.pgtable synchronize_rcu()
> > > + * rcu_read_unlock()
> > > + * kvmppc_free_radix()
> > > + *
> > > + *
> > > + * Case 2: This function kvmppc_switch_mmu_to_radix()
> > > + *
> > > + * kvmppc_init_vm_radix()
> > > + * smp_wmb()
> > > + * Test kvm_is_radix() kvm->arch.radix = 1
> > > + * smp_rmb()
> > > + * Use kvm->arch.pgtable
> > > + */
> > > + smp_rmb();
> >
> > Comment could stand to expand slightly on what you are solving, in
> > words.
>
> Will do.
>
> > If you use synchronize_rcu() on both sides, you wouldn't need the
> > barrier, right?
>
> Case 2 is about memory ordering, which is orthogonal to case 1 (RCU
> freeing). So we need the r/w barrier regardless.

RCU can take care of memory ordering too though. If you had
synchronize_rcu() where smp_wmb() is, then no smp_rmb() neeed here.

>
> > > + while (gfn < range->end) {
> > > + pte_t *ptep;
> > > + pte_t old, new;
> > > + unsigned int shift;
> > > +
> > > + ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift);
> > > + if (!ptep)
> > > + goto next;
> > > +
> > > + VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep)));
> >
> > Not really appropriate at the KVM level. mm enforces this kind of
> > thing (with notifiers).
>
> Will remove this.
>
> > > +
> > > + old = READ_ONCE(*ptep);
> > > + if (!pte_present(old) || !pte_young(old))
> > > + goto next;
> > > +
> > > + new = pte_mkold(old);
> > > +
> > > + if (kvm_should_clear_young(range, gfn))
> > > + pte_xchg(ptep, old, new);
> >
> > *Probably* will work. I can't think of a reason why not at the
> > moment anyway :)
>
> My reasoning:
> * It should work if we only change the dedicated A bit, i.e., not
> shared for other purposes, because races are safe (the case here).
> * It may not work for x86 EPT without the A bit (excluded in this
> series) where accesses are trapped by the R/X bits, because races in
> changing the R/X bits can be unsafe.

(For the benefit of others reading, it works because powerpc's pte_xchg
is actually a cmpxchg, for some reason which we really should fix).
Although it can fail to clear the bit if the cmpxchg fails.

I think pte_xchg is only used on with hash MMU in Linux before this
change. I think we may want to keep it that way and use something
like kvmppc_radix_update_pte() here to clear out the bit. But don't
worry too much about fine details so much before sorting out the
core changes I will have a better look after that.

Thanks,
Nick