2023-02-17 04:12:41

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 0/5] mm/kvm: lockless accessed bit harvest

TLDR
====
This patchset RCU-protects KVM page tables and compare-and-exchanges
KVM PTEs with the accessed bit set by hardware. It significantly
improves the performance of guests when the host is under heavy
memory pressure.

ChromeOS has been using a similar approach [1] since mid 2021 and it
was proven successful on tens of millions devices.

[1] https://crrev.com/c/2987928

Overview
========
The goal of this patchset is to optimize the performance of guests
when the host memory is overcommitted. It focuses on the vast
majority of VMs that are not nested and run on hardware that sets the
accessed bit in KVM page tables.

Note that nested VMs and hardware that does not support the accessed
bit are both out of scope.

This patchset relies on two techniques, RCU and cmpxchg, to safely
test and clear the accessed bit without taking kvm->mmu_lock. The
former protects KVM page tables from being freed while the latter
clears the accessed bit atomically against both hardware and other
software page table walkers.

A new MMU notifier API, mmu_notifier_test_clear_young(), is
introduced. It follows two design patterns: fallback and batching.
For any unsupported cases, it can optionally fall back to
mmu_notifier_ops->clear_young(). For a range of KVM PTEs, it can test
or test and clear their accessed bits according to a bitmap provided
by the caller.

This patchset only applies mmu_notifier_test_clear_young() to MGLRU.
A follow-up patchset will apply it to /proc/PID/pagemap and
/prod/PID/clear_refs.

Evaluation
==========
An existing selftest can quickly demonstrate the effectiveness of
this patchset. On a generic workstation equipped with 64 CPUs and
256GB DRAM:

$ sudo max_guest_memory_test -c 64 -m 256 -s 256

MGLRU run2
---------------
Before ~600s
After ~50s
Off ~250s

kswapd (MGLRU before)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.97% try_to_shrink_lruvec
99.06% evict_folios
97.41% shrink_folio_list
31.33% folio_referenced
31.06% rmap_walk_file
30.89% folio_referenced_one
20.83% __mmu_notifier_clear_flush_young
20.54% kvm_mmu_notifier_clear_flush_young
=> 19.34% _raw_write_lock

kswapd (MGLRU after)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.97% try_to_shrink_lruvec
99.51% evict_folios
71.70% shrink_folio_list
7.08% folio_referenced
6.78% rmap_walk_file
6.72% folio_referenced_one
5.60% lru_gen_look_around
=> 1.53% __mmu_notifier_test_clear_young

kswapd (MGLRU off)
100.00% balance_pgdat
100.00% shrink_node
99.92% shrink_lruvec
69.95% shrink_folio_list
19.35% folio_referenced
18.37% rmap_walk_file
17.88% folio_referenced_one
13.20% __mmu_notifier_clear_flush_young
11.64% kvm_mmu_notifier_clear_flush_young
=> 9.93% _raw_write_lock
26.23% shrink_active_list
25.50% folio_referenced
25.35% rmap_walk_file
25.28% folio_referenced_one
23.87% __mmu_notifier_clear_flush_young
23.69% kvm_mmu_notifier_clear_flush_young
=> 18.98% _raw_write_lock

Comprehensive benchmarks are coming soon.

Yu Zhao (5):
mm/kvm: add mmu_notifier_test_clear_young()
kvm/x86: add kvm_arch_test_clear_young()
kvm/arm64: add kvm_arch_test_clear_young()
kvm/powerpc: add kvm_arch_test_clear_young()
mm: multi-gen LRU: use mmu_notifier_test_clear_young()

arch/arm64/include/asm/kvm_host.h | 7 ++
arch/arm64/include/asm/kvm_pgtable.h | 8 ++
arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hyp/pgtable.c | 51 ++--------
arch/arm64/kvm/mmu.c | 77 +++++++++++++-
arch/powerpc/include/asm/kvm_host.h | 18 ++++
arch/powerpc/include/asm/kvm_ppc.h | 14 +--
arch/powerpc/kvm/book3s.c | 7 ++
arch/powerpc/kvm/book3s.h | 2 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 78 ++++++++++++++-
arch/powerpc/kvm/book3s_hv.c | 10 +-
arch/x86/include/asm/kvm_host.h | 27 +++++
arch/x86/kvm/mmu/spte.h | 12 ---
arch/x86/kvm/mmu/tdp_mmu.c | 41 ++++++++
include/linux/kvm_host.h | 29 ++++++
include/linux/mmu_notifier.h | 40 ++++++++
include/linux/mmzone.h | 6 +-
mm/mmu_notifier.c | 26 +++++
mm/rmap.c | 8 +-
mm/vmscan.c | 127 +++++++++++++++++++++---
virt/kvm/kvm_main.c | 58 +++++++++++
22 files changed, 593 insertions(+), 97 deletions(-)

--
2.39.2.637.g21b0678d19-goog



2023-02-17 04:12:45

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 1/5] mm/kvm: add mmu_notifier_test_clear_young()

mmu_notifier_test_clear_young() allows the caller to safely test and
clear the accessed bit in KVM PTEs without taking the MMU lock.

This patch adds the generic infrastructure to invoke the subsequent
arch-specific patches. The arch-specific implementations generally
rely on two techniques: RCU and cmpxchg. The former protects KVM page
tables from being freed while the latter clears the accessed bit
atomically against both the hardware and other software page table
walkers.

mmu_notifier_test_clear_young() follows two design patterns: fallback
and batching. For any unsupported cases, it can optionally fall back
to mmu_notifier_ops->clear_young(). For a range of KVM PTEs, it can
test or test and clear their accessed bits according to a bitmap
provided by the caller.

mmu_notifier_test_clear_young() always returns 0 if fallback is not
allowed. If fallback happens, its return value is similar to that of
mmu_notifier_clear_young().

The bitmap parameter has the following specifications:
1. The number of bits should be at least (end-start)/PAGE_SIZE.
2. The offset of each bit is relative to the end. E.g., the offset
corresponding to addr is (end-addr)/PAGE_SIZE-1. This is to better
suit batching while forward looping.
3. For each KVM PTE with the accessed bit set (young), arch-specific
implementations flip the corresponding bit in the bitmap. It only
clears the accessed bit if the old value is 1. A caller can test or
test and clear the accessed bit by setting the corresponding bit in
the bitmap to 0 or 1, and the new value will be 1 or 0 for a young
KVM PTE.

Signed-off-by: Yu Zhao <[email protected]>
---
include/linux/kvm_host.h | 29 ++++++++++++++++++
include/linux/mmu_notifier.h | 40 +++++++++++++++++++++++++
mm/mmu_notifier.c | 26 ++++++++++++++++
virt/kvm/kvm_main.c | 58 ++++++++++++++++++++++++++++++++++++
4 files changed, 153 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b244f6d0..df46fc815c8b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2281,4 +2281,33 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536

+/*
+ * Architectures that implement kvm_arch_test_clear_young() should override
+ * kvm_arch_has_test_clear_young().
+ *
+ * kvm_arch_has_test_clear_young() is allowed to return false positive. It can
+ * return true if kvm_arch_test_clear_young() is supported but disabled due to
+ * some runtime constraint. In this case, kvm_arch_test_clear_young() should
+ * return false.
+ *
+ * The last parameter to kvm_arch_test_clear_young() is a bitmap with the
+ * following specifications:
+ * 1. The offset of each bit is relative to the second to the last parameter
+ * lsb_gfn. E.g., the offset corresponding to gfn is lsb_gfn-gfn. This is to
+ * better suit batching while forward looping.
+ * 2. For each KVM PTE with the accessed bit set, the implementation should flip
+ * the corresponding bit in the bitmap. It should only clear the accessed bit
+ * if the old value is 1. This allows the caller to test or test and clear
+ * the accessed bit.
+ */
+#ifndef kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+ return false;
+}
+#endif
+
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+ gfn_t lsb_gfn, unsigned long *bitmap);
+
#endif
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..432b51cd6843 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -122,6 +122,11 @@ struct mmu_notifier_ops {
struct mm_struct *mm,
unsigned long address);

+ /* see the comments on mmu_notifier_test_clear_young() */
+ bool (*test_clear_young)(struct mmu_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ unsigned long *bitmap);
+
/*
* change_pte is called in cases that pte mapping to page is changed:
* for example, when ksm remaps pte to point to a new shared page.
@@ -390,6 +395,9 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
extern int __mmu_notifier_clear_young(struct mm_struct *mm,
unsigned long start,
unsigned long end);
+extern int __mmu_notifier_test_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool fallback, unsigned long *bitmap);
extern int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
@@ -432,6 +440,31 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm,
return 0;
}

+/*
+ * This function always returns 0 if fallback is not allowed. If fallback
+ * happens, its return value is similar to that of mmu_notifier_clear_young().
+ *
+ * The bitmap has the following specifications:
+ * 1. The number of bits should be at least (end-start)/PAGE_SIZE.
+ * 2. The offset of each bit is relative to the end. E.g., the offset
+ * corresponding to addr is (end-addr)/PAGE_SIZE-1. This is to better suit
+ * batching while forward looping.
+ * 3. For each KVM PTE with the accessed bit set (young), this function flips
+ * the corresponding bit in the bitmap. It only clears the accessed bit if
+ * the old value is 1. A caller can test or test and clear the accessed bit
+ * by setting the corresponding bit in the bitmap to 0 or 1, and the new
+ * value will be 1 or 0 for a young KVM PTE.
+ */
+static inline int mmu_notifier_test_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool fallback, unsigned long *bitmap)
+{
+ if (mm_has_notifiers(mm))
+ return __mmu_notifier_test_clear_young(mm, start, end, fallback, bitmap);
+
+ return 0;
+}
+
static inline int mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
@@ -684,6 +717,13 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
return 0;
}

+static inline int mmu_notifier_test_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool fallback, unsigned long *bitmap)
+{
+ return 0;
+}
+
static inline int mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 50c0dde1354f..dd39b9b4d6d3 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -402,6 +402,32 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
return young;
}

+/* see the comments on mmu_notifier_test_clear_young() */
+int __mmu_notifier_test_clear_young(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool fallback, unsigned long *bitmap)
+{
+ int key;
+ struct mmu_notifier *mn;
+ int young = 0;
+
+ key = srcu_read_lock(&srcu);
+
+ hlist_for_each_entry_srcu(mn, &mm->notifier_subscriptions->list,
+ hlist, srcu_read_lock_held(&srcu)) {
+ if (mn->ops->test_clear_young &&
+ mn->ops->test_clear_young(mn, mm, start, end, bitmap))
+ continue;
+
+ if (fallback && mn->ops->clear_young)
+ young |= mn->ops->clear_young(mn, mm, start, end);
+ }
+
+ srcu_read_unlock(&srcu, key);
+
+ return young;
+}
+
int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9c60384b5ae0..1b465df4a93d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -875,6 +875,63 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
}

+static bool kvm_test_clear_young(struct kvm *kvm, unsigned long start,
+ unsigned long end, unsigned long *bitmap)
+{
+ int i;
+ int key;
+ bool success = true;
+
+ trace_kvm_age_hva(start, end);
+
+ key = srcu_read_lock(&kvm->srcu);
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ struct interval_tree_node *node;
+ struct kvm_memslots *slots = __kvm_memslots(kvm, i);
+
+ kvm_for_each_memslot_in_hva_range(node, slots, start, end - 1) {
+ gfn_t lsb_gfn;
+ unsigned long hva_start, hva_end;
+ struct kvm_gfn_range range = {
+ .slot = container_of(node, struct kvm_memory_slot,
+ hva_node[slots->node_idx]),
+ };
+
+ hva_start = max(start, range.slot->userspace_addr);
+ hva_end = min(end - 1, range.slot->userspace_addr +
+ range.slot->npages * PAGE_SIZE - 1);
+
+ range.start = hva_to_gfn_memslot(hva_start, range.slot);
+ range.end = hva_to_gfn_memslot(hva_end, range.slot) + 1;
+
+ if (WARN_ON_ONCE(range.end <= range.start))
+ continue;
+
+ /* see the comments on the generic kvm_arch_has_test_clear_young() */
+ lsb_gfn = hva_to_gfn_memslot(end - 1, range.slot);
+
+ success = kvm_arch_test_clear_young(kvm, &range, lsb_gfn, bitmap);
+ if (!success)
+ break;
+ }
+ }
+
+ srcu_read_unlock(&kvm->srcu, key);
+
+ return success;
+}
+
+static bool kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ unsigned long *bitmap)
+{
+ if (kvm_arch_has_test_clear_young())
+ return kvm_test_clear_young(mmu_notifier_to_kvm(mn), start, end, bitmap);
+
+ return false;
+}
+
static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address)
@@ -903,6 +960,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
.clear_young = kvm_mmu_notifier_clear_young,
.test_young = kvm_mmu_notifier_test_young,
+ .test_clear_young = kvm_mmu_notifier_test_clear_young,
.change_pte = kvm_mmu_notifier_change_pte,
.release = kvm_mmu_notifier_release,
};
--
2.39.2.637.g21b0678d19-goog


2023-02-17 04:12:52

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

This patch adds kvm_arch_test_clear_young() for the vast majority of
VMs that are not nested and run on hardware that sets the accessed bit
in TDP MMU page tables.

It relies on two techniques, RCU and cmpxchg, to safely test and clear
the accessed bit without taking the MMU lock. The former protects KVM
page tables from being freed while the latter clears the accessed bit
atomically against both the hardware and other software page table
walkers.

Signed-off-by: Yu Zhao <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 27 ++++++++++++++++++++++
arch/x86/kvm/mmu/spte.h | 12 ----------
arch/x86/kvm/mmu/tdp_mmu.c | 41 +++++++++++++++++++++++++++++++++
3 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6aaae18f1854..d2995c9e8f07 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1367,6 +1367,12 @@ struct kvm_arch {
* the MMU lock in read mode + the tdp_mmu_pages_lock or
* the MMU lock in write mode
*
+ * kvm_arch_test_clear_young() is a special case. It relies on two
+ * techniques, RCU and cmpxchg, to safely test and clear the accessed
+ * bit without taking the MMU lock. The former protects KVM page tables
+ * from being freed while the latter clears the accessed bit atomically
+ * against both the hardware and other software page table walkers.
+ *
* Roots will remain in the list until their tdp_mmu_root_count
* drops to zero, at which point the thread that decremented the
* count to zero should removed the root from the list and clean
@@ -2171,4 +2177,25 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \
KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS)

+extern u64 __read_mostly shadow_accessed_mask;
+
+/*
+ * Returns true if A/D bits are supported in hardware and are enabled by KVM.
+ * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can
+ * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
+ * scenario where KVM is using A/D bits for L1, but not L2.
+ */
+static inline bool kvm_ad_enabled(void)
+{
+ return shadow_accessed_mask;
+}
+
+/* see the comments on the generic kvm_arch_has_test_clear_young() */
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+ return IS_ENABLED(CONFIG_KVM) && IS_ENABLED(CONFIG_X86_64) &&
+ (!IS_REACHABLE(CONFIG_KVM) || (kvm_ad_enabled() && tdp_enabled));
+}
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 6f54dc9409c9..0dc7fed1f3fd 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
extern u64 __read_mostly shadow_user_mask;
-extern u64 __read_mostly shadow_accessed_mask;
extern u64 __read_mostly shadow_dirty_mask;
extern u64 __read_mostly shadow_mmio_value;
extern u64 __read_mostly shadow_mmio_mask;
@@ -247,17 +246,6 @@ static inline bool is_shadow_present_pte(u64 pte)
return !!(pte & SPTE_MMU_PRESENT_MASK);
}

-/*
- * Returns true if A/D bits are supported in hardware and are enabled by KVM.
- * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can
- * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
- * scenario where KVM is using A/D bits for L1, but not L2.
- */
-static inline bool kvm_ad_enabled(void)
-{
- return !!shadow_accessed_mask;
-}
-
static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
{
return sp->role.ad_disabled;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d6df38d371a0..9028e09f1aab 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1309,6 +1309,47 @@ bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
}

+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+ gfn_t lsb_gfn, unsigned long *bitmap)
+{
+ struct kvm_mmu_page *root;
+
+ if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
+ return false;
+
+ if (kvm_memslots_have_rmaps(kvm))
+ return false;
+
+ /* see the comments on kvm_arch->tdp_mmu_roots */
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
+ struct tdp_iter iter;
+
+ if (kvm_mmu_page_as_id(root) != range->slot->as_id)
+ continue;
+
+ tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
+ u64 *sptep = rcu_dereference(iter.sptep);
+ u64 new_spte = iter.old_spte & ~shadow_accessed_mask;
+
+ VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep)));
+ VM_WARN_ON_ONCE(iter.gfn < range->start || iter.gfn >= range->end);
+
+ if (new_spte == iter.old_spte)
+ continue;
+
+ /* see the comments on the generic kvm_arch_has_test_clear_young() */
+ if (__test_and_change_bit(lsb_gfn - iter.gfn, bitmap))
+ cmpxchg64(sptep, iter.old_spte, new_spte);
+ }
+ }
+
+ rcu_read_unlock();
+
+ return true;
+}
+
static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_gfn_range *range)
{
--
2.39.2.637.g21b0678d19-goog


2023-02-17 04:12:57

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

This patch adds kvm_arch_test_clear_young() for the vast majority of
VMs that are not pKVM and run on hardware that sets the accessed bit
in KVM page tables.

It relies on two techniques, RCU and cmpxchg, to safely test and clear
the accessed bit without taking the MMU lock. The former protects KVM
page tables from being freed while the latter clears the accessed bit
atomically against both the hardware and other software page table
walkers.

Signed-off-by: Yu Zhao <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 7 +++
arch/arm64/include/asm/kvm_pgtable.h | 8 +++
arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hyp/pgtable.c | 51 ++--------------
arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++-
6 files changed, 141 insertions(+), 46 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 35a159d131b5..572bcd321586 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);

+/* see the comments on the generic kvm_arch_has_test_clear_young() */
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+ return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
+}
+
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 63f81b27a4e3..8c9a04388c88 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -105,6 +105,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
* @put_page: Decrement the refcount on a page. When the
* refcount reaches 0 the page is automatically
* freed.
+ * @put_page_rcu: RCU variant of put_page().
* @page_count: Return the refcount of a page.
* @phys_to_virt: Convert a physical address into a virtual
* address mapped in the current context.
@@ -122,6 +123,7 @@ struct kvm_pgtable_mm_ops {
void (*free_removed_table)(void *addr, u32 level);
void (*get_page)(void *addr);
void (*put_page)(void *addr);
+ void (*put_page_rcu)(void *addr);
int (*page_count)(void *addr);
void* (*phys_to_virt)(phys_addr_t phys);
phys_addr_t (*virt_to_phys)(void *addr);
@@ -188,6 +190,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
* children.
* @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared
* with other software walkers.
+ *
+ * kvm_arch_test_clear_young() is a special case. It relies on two
+ * techniques, RCU and cmpxchg, to safely test and clear the accessed
+ * bit without taking the MMU lock. The former protects KVM page tables
+ * from being freed while the latter clears the accessed bit atomically
+ * against both the hardware and other software page table walkers.
*/
enum kvm_pgtable_walk_flags {
KVM_PGTABLE_WALK_LEAF = BIT(0),
diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
index c8dca8ae359c..350437661d4b 100644
--- a/arch/arm64/include/asm/stage2_pgtable.h
+++ b/arch/arm64/include/asm/stage2_pgtable.h
@@ -30,4 +30,47 @@
*/
#define kvm_mmu_cache_min_pages(kvm) (kvm_stage2_levels(kvm) - 1)

+#define KVM_PTE_TYPE BIT(1)
+#define KVM_PTE_TYPE_BLOCK 0
+#define KVM_PTE_TYPE_PAGE 1
+#define KVM_PTE_TYPE_TABLE 1
+
+#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
+
+#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
+#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51)
+
+#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
+
+#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
+ KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
+ KVM_PTE_LEAF_ATTR_HI_S2_XN)
+
+#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
+#define KVM_MAX_OWNER_ID 1
+
+/*
+ * Used to indicate a pte for which a 'break-before-make' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED BIT(10)
+
#endif /* __ARM64_S2_PGTABLE_H_ */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9c5573bc4614..6770bc47f5c9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
*/
void kvm_arch_destroy_vm(struct kvm *kvm)
{
+ kvm_free_stage2_pgd(&kvm->arch.mmu);
bitmap_free(kvm->arch.pmu_filter);
free_cpumask_var(kvm->arch.supported_cpus);

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b11cf2c618a6..8d65ee4767f1 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -12,49 +12,6 @@
#include <asm/stage2_pgtable.h>


-#define KVM_PTE_TYPE BIT(1)
-#define KVM_PTE_TYPE_BLOCK 0
-#define KVM_PTE_TYPE_PAGE 1
-#define KVM_PTE_TYPE_TABLE 1
-
-#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
-
-#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
-#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51)
-
-#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
-
-#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
- KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
- KVM_PTE_LEAF_ATTR_HI_S2_XN)
-
-#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
-#define KVM_MAX_OWNER_ID 1
-
-/*
- * Used to indicate a pte for which a 'break-before-make' sequence is in
- * progress.
- */
-#define KVM_INVALID_PTE_LOCKED BIT(10)
-
struct kvm_pgtable_walk_data {
struct kvm_pgtable_walker *walker;

@@ -994,8 +951,12 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
kvm_granule_size(ctx->level));

- if (childp)
- mm_ops->put_page(childp);
+ if (childp) {
+ if (mm_ops->put_page_rcu)
+ mm_ops->put_page_rcu(childp);
+ else
+ mm_ops->put_page(childp);
+ }

return 0;
}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a3ee3b605c9b..761fffc788f5 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -171,6 +171,21 @@ static int kvm_host_page_count(void *addr)
return page_count(virt_to_page(addr));
}

+static void kvm_s2_rcu_put_page(struct rcu_head *head)
+{
+ put_page(container_of(head, struct page, rcu_head));
+}
+
+static void kvm_s2_put_page_rcu(void *addr)
+{
+ struct page *page = virt_to_page(addr);
+
+ if (kvm_host_page_count(addr) == 1)
+ kvm_account_pgtable_pages(addr, -1);
+
+ call_rcu(&page->rcu_head, kvm_s2_rcu_put_page);
+}
+
static phys_addr_t kvm_host_pa(void *addr)
{
return __pa(addr);
@@ -684,6 +699,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
.free_removed_table = stage2_free_removed_table,
.get_page = kvm_host_get_page,
.put_page = kvm_s2_put_page,
+ .put_page_rcu = kvm_s2_put_page_rcu,
.page_count = kvm_host_page_count,
.phys_to_virt = kvm_host_va,
.virt_to_phys = kvm_host_pa,
@@ -1624,6 +1640,66 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
return pte_valid(pte) && pte_young(pte);
}

+struct test_clear_young_arg {
+ struct kvm_gfn_range *range;
+ gfn_t lsb_gfn;
+ unsigned long *bitmap;
+};
+
+static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags flags)
+{
+ struct test_clear_young_arg *arg = ctx->arg;
+ gfn_t gfn = ctx->addr / PAGE_SIZE;
+ kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
+
+ VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
+ VM_WARN_ON_ONCE(gfn < arg->range->start || gfn >= arg->range->end);
+
+ if (!kvm_pte_valid(new))
+ return 0;
+
+ if (new == ctx->old)
+ return 0;
+
+ /* see the comments on the generic kvm_arch_has_test_clear_young() */
+ if (__test_and_change_bit(arg->lsb_gfn - gfn, arg->bitmap))
+ cmpxchg64(ctx->ptep, ctx->old, new);
+
+ return 0;
+}
+
+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+ gfn_t lsb_gfn, unsigned long *bitmap)
+{
+ u64 start = range->start * PAGE_SIZE;
+ u64 end = range->end * PAGE_SIZE;
+ struct test_clear_young_arg arg = {
+ .range = range,
+ .lsb_gfn = lsb_gfn,
+ .bitmap = bitmap,
+ };
+ struct kvm_pgtable_walker walker = {
+ .cb = stage2_test_clear_young,
+ .arg = &arg,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ };
+
+ BUILD_BUG_ON(is_hyp_code());
+
+ if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
+ return false;
+
+ /* see the comments on kvm_pgtable_walk_flags */
+ rcu_read_lock();
+
+ kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker);
+
+ rcu_read_unlock();
+
+ return true;
+}
+
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
if (!kvm->arch.mmu.pgt)
@@ -1848,7 +1924,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)

void kvm_arch_flush_shadow_all(struct kvm *kvm)
{
- kvm_free_stage2_pgd(&kvm->arch.mmu);
}

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
--
2.39.2.637.g21b0678d19-goog


2023-02-17 04:13:08

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 4/5] kvm/powerpc: add kvm_arch_test_clear_young()

This patch adds kvm_arch_test_clear_young() for the vast majority of
VMs that are not nested and run on hardware with Radix MMU enabled.

It relies on two techniques, RCU and cmpxchg, to safely test and clear
the accessed bit without taking the MMU lock. The former protects KVM
page tables from being freed while the latter clears the accessed bit
atomically against both the hardware and other software page table
walkers.

Signed-off-by: Yu Zhao <[email protected]>
---
arch/powerpc/include/asm/kvm_host.h | 18 ++++++
arch/powerpc/include/asm/kvm_ppc.h | 14 +----
arch/powerpc/kvm/book3s.c | 7 +++
arch/powerpc/kvm/book3s.h | 2 +
arch/powerpc/kvm/book3s_64_mmu_radix.c | 78 +++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_hv.c | 10 ++--
6 files changed, 110 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index caea15dcb91d..996850029ce0 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -886,4 +886,22 @@ static inline void kvm_arch_exit(void) {}
static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}

+static inline int kvmppc_radix_possible(void)
+{
+ return cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled();
+}
+
+static inline bool kvmhv_on_pseries(void)
+{
+ return IS_ENABLED(CONFIG_PPC_PSERIES) && !cpu_has_feature(CPU_FTR_HVMODE);
+}
+
+/* see the comments on the generic kvm_arch_has_test_clear_young() */
+#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
+static inline bool kvm_arch_has_test_clear_young(void)
+{
+ return IS_ENABLED(CONFIG_KVM) && IS_ENABLED(CONFIG_KVM_BOOK3S_HV_POSSIBLE) &&
+ kvmppc_radix_possible() && !kvmhv_on_pseries();
+}
+
#endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index eae9619b6190..0bb772fc12b1 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -277,6 +277,8 @@ struct kvmppc_ops {
bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
+ bool (*test_clear_young)(struct kvm *kvm, struct kvm_gfn_range *range,
+ gfn_t lsb_gfn, unsigned long *bitmap);
bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
void (*free_memslot)(struct kvm_memory_slot *slot);
int (*init_vm)(struct kvm *kvm);
@@ -580,18 +582,6 @@ static inline bool kvm_hv_mode_active(void) { return false; }

#endif

-#ifdef CONFIG_PPC_PSERIES
-static inline bool kvmhv_on_pseries(void)
-{
- return !cpu_has_feature(CPU_FTR_HVMODE);
-}
-#else
-static inline bool kvmhv_on_pseries(void)
-{
- return false;
-}
-#endif
-
#ifdef CONFIG_KVM_XICS
static inline int kvmppc_xics_enabled(struct kvm_vcpu *vcpu)
{
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 6d525285dbe8..f4cf330e3e81 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -877,6 +877,13 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
}

+bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+ gfn_t lsb_gfn, unsigned long *bitmap)
+{
+ return kvm->arch.kvm_ops->test_clear_young &&
+ kvm->arch.kvm_ops->test_clear_young(kvm, range, lsb_gfn, bitmap);
+}
+
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
return kvm->arch.kvm_ops->set_spte_gfn(kvm, range);
diff --git a/arch/powerpc/kvm/book3s.h b/arch/powerpc/kvm/book3s.h
index 58391b4b32ed..fe9cac423817 100644
--- a/arch/powerpc/kvm/book3s.h
+++ b/arch/powerpc/kvm/book3s.h
@@ -12,6 +12,8 @@ extern void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
extern bool kvm_unmap_gfn_range_hv(struct kvm *kvm, struct kvm_gfn_range *range);
extern bool kvm_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
extern bool kvm_test_age_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);
+extern bool kvmhv_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+ gfn_t lsb_gfn, unsigned long *bitmap);
extern bool kvm_set_spte_gfn_hv(struct kvm *kvm, struct kvm_gfn_range *range);

extern int kvmppc_mmu_init_pr(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 9d3743ca16d5..8476646c554c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1083,6 +1083,78 @@ bool kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
return ref;
}

+bool kvmhv_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
+ gfn_t lsb_gfn, unsigned long *bitmap)
+{
+ bool success;
+ gfn_t gfn = range->start;
+
+ if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
+ return false;
+
+ /*
+ * This function relies on two techniques, RCU and cmpxchg, to safely
+ * test and clear the accessed bit without taking the MMU lock. The
+ * former protects KVM page tables from being freed while the latter
+ * clears the accessed bit atomically against both the hardware and
+ * other software page table walkers.
+ */
+ rcu_read_lock();
+
+ success = kvm_is_radix(kvm);
+ if (!success)
+ goto unlock;
+
+ /*
+ * case 1: this function kvmppc_switch_mmu_to_hpt()
+ *
+ * rcu_read_lock()
+ * test kvm_is_radix() kvm->arch.radix = 0
+ * use kvm->arch.pgtable
+ * rcu_read_unlock()
+ * synchronize_rcu()
+ * kvmppc_free_radix()
+ *
+ *
+ * case 2: this function kvmppc_switch_mmu_to_radix()
+ *
+ * kvmppc_init_vm_radix()
+ * smp_wmb()
+ * test kvm_is_radix() kvm->arch.radix = 1
+ * smp_rmb()
+ * use kvm->arch.pgtable
+ */
+ smp_rmb();
+
+ while (gfn < range->end) {
+ pte_t *ptep;
+ pte_t old, new;
+ unsigned int shift;
+
+ ptep = find_kvm_secondary_pte_unlocked(kvm, gfn * PAGE_SIZE, &shift);
+ if (!ptep)
+ goto next;
+
+ VM_WARN_ON_ONCE(!page_count(virt_to_page(ptep)));
+
+ old = READ_ONCE(*ptep);
+ if (!pte_present(old) || !pte_young(old))
+ goto next;
+
+ new = pte_mkold(old);
+
+ /* see the comments on the generic kvm_arch_has_test_clear_young() */
+ if (__test_and_change_bit(lsb_gfn - gfn, bitmap))
+ pte_xchg(ptep, old, new);
+next:
+ gfn += shift ? BIT(shift - PAGE_SHIFT) : 1;
+ }
+unlock:
+ rcu_read_unlock();
+
+ return success;
+}
+
/* Returns the number of PAGE_SIZE pages that are dirty */
static int kvm_radix_test_clear_dirty(struct kvm *kvm,
struct kvm_memory_slot *memslot, int pagenum)
@@ -1464,13 +1536,15 @@ int kvmppc_radix_init(void)
{
unsigned long size = sizeof(void *) << RADIX_PTE_INDEX_SIZE;

- kvm_pte_cache = kmem_cache_create("kvm-pte", size, size, 0, pte_ctor);
+ kvm_pte_cache = kmem_cache_create("kvm-pte", size, size,
+ SLAB_TYPESAFE_BY_RCU, pte_ctor);
if (!kvm_pte_cache)
return -ENOMEM;

size = sizeof(void *) << RADIX_PMD_INDEX_SIZE;

- kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size, 0, pmd_ctor);
+ kvm_pmd_cache = kmem_cache_create("kvm-pmd", size, size,
+ SLAB_TYPESAFE_BY_RCU, pmd_ctor);
if (!kvm_pmd_cache) {
kmem_cache_destroy(kvm_pte_cache);
return -ENOMEM;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6ba68dd6190b..17b415661282 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5242,6 +5242,8 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
spin_lock(&kvm->mmu_lock);
kvm->arch.radix = 0;
spin_unlock(&kvm->mmu_lock);
+ /* see the comments in kvmhv_test_clear_young() */
+ synchronize_rcu();
kvmppc_free_radix(kvm);

lpcr = LPCR_VPM1;
@@ -5266,6 +5268,8 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
if (err)
return err;
kvmppc_rmap_reset(kvm);
+ /* see the comments in kvmhv_test_clear_young() */
+ smp_wmb();
/* Mutual exclusion with kvm_unmap_gfn_range etc. */
spin_lock(&kvm->mmu_lock);
kvm->arch.radix = 1;
@@ -6165,6 +6169,7 @@ static struct kvmppc_ops kvm_ops_hv = {
.unmap_gfn_range = kvm_unmap_gfn_range_hv,
.age_gfn = kvm_age_gfn_hv,
.test_age_gfn = kvm_test_age_gfn_hv,
+ .test_clear_young = kvmhv_test_clear_young,
.set_spte_gfn = kvm_set_spte_gfn_hv,
.free_memslot = kvmppc_core_free_memslot_hv,
.init_vm = kvmppc_core_init_vm_hv,
@@ -6225,11 +6230,6 @@ static int kvm_init_subcore_bitmap(void)
return 0;
}

-static int kvmppc_radix_possible(void)
-{
- return cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled();
-}
-
static int kvmppc_book3s_init_hv(void)
{
int r;
--
2.39.2.637.g21b0678d19-goog


2023-02-17 04:13:16

by Yu Zhao

[permalink] [raw]
Subject: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

An existing selftest can quickly demonstrate the effectiveness of this
patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM:

$ sudo max_guest_memory_test -c 64 -m 250 -s 250

MGLRU run2
---------------
Before ~600s
After ~50s
Off ~250s

kswapd (MGLRU before)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.97% try_to_shrink_lruvec
99.06% evict_folios
97.41% shrink_folio_list
31.33% folio_referenced
31.06% rmap_walk_file
30.89% folio_referenced_one
20.83% __mmu_notifier_clear_flush_young
20.54% kvm_mmu_notifier_clear_flush_young
=> 19.34% _raw_write_lock

kswapd (MGLRU after)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.97% try_to_shrink_lruvec
99.51% evict_folios
71.70% shrink_folio_list
7.08% folio_referenced
6.78% rmap_walk_file
6.72% folio_referenced_one
5.60% lru_gen_look_around
=> 1.53% __mmu_notifier_test_clear_young

kswapd (MGLRU off)
100.00% balance_pgdat
100.00% shrink_node
99.92% shrink_lruvec
69.95% shrink_folio_list
19.35% folio_referenced
18.37% rmap_walk_file
17.88% folio_referenced_one
13.20% __mmu_notifier_clear_flush_young
11.64% kvm_mmu_notifier_clear_flush_young
=> 9.93% _raw_write_lock
26.23% shrink_active_list
25.50% folio_referenced
25.35% rmap_walk_file
25.28% folio_referenced_one
23.87% __mmu_notifier_clear_flush_young
23.69% kvm_mmu_notifier_clear_flush_young
=> 18.98% _raw_write_lock

Signed-off-by: Yu Zhao <[email protected]>
---
include/linux/mmzone.h | 6 +-
mm/rmap.c | 8 +--
mm/vmscan.c | 127 ++++++++++++++++++++++++++++++++++++-----
3 files changed, 121 insertions(+), 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9fb1b03b83b2..ce34b7ea8e4c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -379,6 +379,7 @@ enum {
LRU_GEN_CORE,
LRU_GEN_MM_WALK,
LRU_GEN_NONLEAF_YOUNG,
+ LRU_GEN_SPTE_WALK,
NR_LRU_GEN_CAPS
};

@@ -485,7 +486,7 @@ struct lru_gen_mm_walk {
};

void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

#ifdef CONFIG_MEMCG

@@ -573,8 +574,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}

-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
+ return false;
}

#ifdef CONFIG_MEMCG
diff --git a/mm/rmap.c b/mm/rmap.c
index 15ae24585fc4..eb0089f8f112 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -823,12 +823,10 @@ static bool folio_referenced_one(struct folio *folio,
return false; /* To break the loop */
}

- if (pvmw.pte) {
- if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
- lru_gen_look_around(&pvmw);
+ if (lru_gen_enabled() && pvmw.pte) {
+ if (lru_gen_look_around(&pvmw))
referenced++;
- }
-
+ } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..d6d69f0baabf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,8 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/mmu_notifier.h>
+#include <linux/kvm_host.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3923,6 +3925,55 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
return folio;
}

+static bool test_spte_young(struct mm_struct *mm, unsigned long addr, unsigned long end,
+ unsigned long *bitmap, unsigned long *last)
+{
+ if (!kvm_arch_has_test_clear_young() || !get_cap(LRU_GEN_SPTE_WALK))
+ return false;
+
+ if (*last > addr)
+ goto done;
+
+ *last = end - addr > MIN_LRU_BATCH * PAGE_SIZE ?
+ addr + MIN_LRU_BATCH * PAGE_SIZE - 1 : end - 1;
+ bitmap_zero(bitmap, MIN_LRU_BATCH);
+
+ mmu_notifier_test_clear_young(mm, addr, *last + 1, false, bitmap);
+done:
+ return test_bit((*last - addr) / PAGE_SIZE, bitmap);
+}
+
+static void clear_spte_young(struct mm_struct *mm, unsigned long addr,
+ unsigned long *bitmap, unsigned long *last)
+{
+ int i;
+ unsigned long start, end = *last + 1;
+
+ if (addr + PAGE_SIZE != end)
+ return;
+
+ i = find_last_bit(bitmap, MIN_LRU_BATCH);
+ if (i == MIN_LRU_BATCH)
+ return;
+
+ start = end - (i + 1) * PAGE_SIZE;
+
+ i = find_first_bit(bitmap, MIN_LRU_BATCH);
+
+ end -= i * PAGE_SIZE;
+
+ mmu_notifier_test_clear_young(mm, start, end, false, bitmap);
+}
+
+static void skip_spte_young(struct mm_struct *mm, unsigned long addr,
+ unsigned long *bitmap, unsigned long *last)
+{
+ if (*last > addr)
+ __clear_bit((*last - addr) / PAGE_SIZE, bitmap);
+
+ clear_spte_young(mm, addr, bitmap, last);
+}
+
static bool suitable_to_scan(int total, int young)
{
int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
@@ -3938,6 +3989,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
pte_t *pte;
spinlock_t *ptl;
unsigned long addr;
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)];
+ unsigned long last = 0;
int total = 0;
int young = 0;
struct lru_gen_mm_walk *walk = args->private;
@@ -3956,6 +4009,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
pte = pte_offset_map(pmd, start & PMD_MASK);
restart:
for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ bool success;
unsigned long pfn;
struct folio *folio;

@@ -3963,20 +4017,27 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
walk->mm_stats[MM_LEAF_TOTAL]++;

pfn = get_pte_pfn(pte[i], args->vma, addr);
- if (pfn == -1)
+ if (pfn == -1) {
+ skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!pte_young(pte[i])) {
+ success = test_spte_young(args->vma->vm_mm, addr, end, bitmap, &last);
+ if (!success && !pte_young(pte[i])) {
+ skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
walk->mm_stats[MM_LEAF_OLD]++;
continue;
}

folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
- if (!folio)
+ if (!folio) {
+ skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ clear_spte_young(args->vma->vm_mm, addr, bitmap, &last);
+ if (pte_young(pte[i]))
+ ptep_test_and_clear_young(args->vma, addr, pte + i);

young++;
walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -4581,6 +4642,24 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* rmap/PT walk feedback
******************************************************************************/

+static bool should_look_around(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, int *young)
+{
+ unsigned long old = true;
+
+ *young = mmu_notifier_test_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE, true, &old);
+ if (!old)
+ *young = true;
+
+ if (pte_young(*pte)) {
+ ptep_test_and_clear_young(vma, addr, pte);
+ *young = true;
+ return true;
+ }
+
+ return !old && get_cap(LRU_GEN_SPTE_WALK);
+}
+
/*
* This function exploits spatial locality when shrink_folio_list() walks the
* rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
@@ -4588,12 +4667,14 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* the PTE table to the Bloom filter. This forms a feedback loop between the
* eviction and the aging.
*/
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
int i;
unsigned long start;
unsigned long end;
struct lru_gen_mm_walk *walk;
+ unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)];
+ unsigned long last = 0;
int young = 0;
pte_t *pte = pvmw->pte;
unsigned long addr = pvmw->address;
@@ -4607,8 +4688,11 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
lockdep_assert_held(pvmw->ptl);
VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);

+ if (!should_look_around(pvmw->vma, addr, pte, &young))
+ return young;
+
if (spin_is_contended(pvmw->ptl))
- return;
+ return young;

/* avoid taking the LRU lock under the PTL when possible */
walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
@@ -4616,6 +4700,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
start = max(addr & PMD_MASK, pvmw->vma->vm_start);
end = min(addr | ~PMD_MASK, pvmw->vma->vm_end - 1) + 1;

+ if (end - start == PAGE_SIZE)
+ return young;
+
if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
end = start + MIN_LRU_BATCH * PAGE_SIZE;
@@ -4629,28 +4716,37 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)

/* folio_update_gen() requires stable folio_memcg() */
if (!mem_cgroup_trylock_pages(memcg))
- return;
+ return young;

arch_enter_lazy_mmu_mode();

pte -= (addr - start) / PAGE_SIZE;

for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ bool success;
unsigned long pfn;

pfn = get_pte_pfn(pte[i], pvmw->vma, addr);
- if (pfn == -1)
+ if (pfn == -1) {
+ skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!pte_young(pte[i]))
+ success = test_spte_young(pvmw->vma->vm_mm, addr, end, bitmap, &last);
+ if (!success && !pte_young(pte[i])) {
+ skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap);
- if (!folio)
+ if (!folio) {
+ skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ clear_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
+ if (pte_young(pte[i]))
+ ptep_test_and_clear_young(pvmw->vma, addr, pte + i);

young++;

@@ -4680,6 +4776,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
/* feedback from rmap walkers to page table walkers */
if (suitable_to_scan(i, young))
update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+ return young;
}

/******************************************************************************
@@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
caps |= BIT(LRU_GEN_NONLEAF_YOUNG);

+ if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
+ caps |= BIT(LRU_GEN_SPTE_WALK);
+
return sysfs_emit(buf, "0x%04x\n", caps);
}

--
2.39.2.637.g21b0678d19-goog


2023-02-17 04:20:51

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <[email protected]> wrote:
>
> This patch adds kvm_arch_test_clear_young() for the vast majority of
> VMs that are not nested and run on hardware that sets the accessed bit
> in TDP MMU page tables.
>
> It relies on two techniques, RCU and cmpxchg, to safely test and clear
> the accessed bit without taking the MMU lock. The former protects KVM
> page tables from being freed while the latter clears the accessed bit
> atomically against both the hardware and other software page table
> walkers.
>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 27 ++++++++++++++++++++++
> arch/x86/kvm/mmu/spte.h | 12 ----------
> arch/x86/kvm/mmu/tdp_mmu.c | 41 +++++++++++++++++++++++++++++++++
> 3 files changed, 68 insertions(+), 12 deletions(-)

Adding Sean and David.

Can you please add other interested parties that I've missed?

Thanks.

2023-02-17 04:22:18

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <[email protected]> wrote:
>
> This patch adds kvm_arch_test_clear_young() for the vast majority of
> VMs that are not pKVM and run on hardware that sets the accessed bit
> in KVM page tables.
>
> It relies on two techniques, RCU and cmpxchg, to safely test and clear
> the accessed bit without taking the MMU lock. The former protects KVM
> page tables from being freed while the latter clears the accessed bit
> atomically against both the hardware and other software page table
> walkers.
>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 7 +++
> arch/arm64/include/asm/kvm_pgtable.h | 8 +++
> arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> arch/arm64/kvm/arm.c | 1 +
> arch/arm64/kvm/hyp/pgtable.c | 51 ++--------------
> arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++-
> 6 files changed, 141 insertions(+), 46 deletions(-)

Adding Marc and Will.

Can you please add other interested parties that I've missed?

Thanks.

2023-02-17 04:24:55

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 4/5] kvm/powerpc: add kvm_arch_test_clear_young()

On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <[email protected]> wrote:
>
> This patch adds kvm_arch_test_clear_young() for the vast majority of
> VMs that are not nested and run on hardware with Radix MMU enabled.
>
> It relies on two techniques, RCU and cmpxchg, to safely test and clear
> the accessed bit without taking the MMU lock. The former protects KVM
> page tables from being freed while the latter clears the accessed bit
> atomically against both the hardware and other software page table
> walkers.
>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> arch/powerpc/include/asm/kvm_host.h | 18 ++++++
> arch/powerpc/include/asm/kvm_ppc.h | 14 +----
> arch/powerpc/kvm/book3s.c | 7 +++
> arch/powerpc/kvm/book3s.h | 2 +
> arch/powerpc/kvm/book3s_64_mmu_radix.c | 78 +++++++++++++++++++++++++-
> arch/powerpc/kvm/book3s_hv.c | 10 ++--
> 6 files changed, 110 insertions(+), 19 deletions(-)

Adding Michael, Nicholas and Christophe.

I'm not sure who I should add for this patch. Can you please add any
interested parties that I've missed?

Thank you.

2023-02-17 09:00:35

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

On Fri, 17 Feb 2023 04:21:28 +0000,
Yu Zhao <[email protected]> wrote:
>
> On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <[email protected]> wrote:
> >
> > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > VMs that are not pKVM and run on hardware that sets the accessed bit
> > in KVM page tables.

I'm really interested in how you can back this statement. 90% of the
HW I have access to is not FEAT_HWAFDB capable, either because it
predates the feature or because the feature is too buggy to be useful.

Do you have numbers?

> >
> > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > the accessed bit without taking the MMU lock. The former protects KVM
> > page tables from being freed while the latter clears the accessed bit
> > atomically against both the hardware and other software page table
> > walkers.
> >
> > Signed-off-by: Yu Zhao <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 7 +++
> > arch/arm64/include/asm/kvm_pgtable.h | 8 +++
> > arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> > arch/arm64/kvm/arm.c | 1 +
> > arch/arm64/kvm/hyp/pgtable.c | 51 ++--------------
> > arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++-
> > 6 files changed, 141 insertions(+), 46 deletions(-)
>
> Adding Marc and Will.
>
> Can you please add other interested parties that I've missed?

The MAINTAINERS file has it all:

KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)
M: Marc Zyngier <[email protected]>
M: Oliver Upton <[email protected]>
R: James Morse <[email protected]>
R: Suzuki K Poulose <[email protected]>
R: Zenghui Yu <[email protected]>
L: [email protected]

May I suggest that you repost your patch and Cc the interested
parties yourself? I guess most folks will want to see this in context,
and not as a random, isolated change with no rationale.

M.

--
Without deviation from the norm, progress is not possible.

2023-02-17 09:09:50

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

Hi Yu,

scripts/get_maintainers.pl is your friend for getting the right set of
emails for a series :) Don't know about others, but generally I would
prefer to be Cc'ed on an entire series (to gather context) than just an
individual patch.

On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> This patch adds kvm_arch_test_clear_young() for the vast majority of
> VMs that are not pKVM and run on hardware that sets the accessed bit
> in KVM page tables.
>
> It relies on two techniques, RCU and cmpxchg, to safely test and clear
> the accessed bit without taking the MMU lock. The former protects KVM
> page tables from being freed while the latter clears the accessed bit
> atomically against both the hardware and other software page table
> walkers.
>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 7 +++
> arch/arm64/include/asm/kvm_pgtable.h | 8 +++
> arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> arch/arm64/kvm/arm.c | 1 +
> arch/arm64/kvm/hyp/pgtable.c | 51 ++--------------
> arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++-
> 6 files changed, 141 insertions(+), 46 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 35a159d131b5..572bcd321586 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
> void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
>
> +/* see the comments on the generic kvm_arch_has_test_clear_young() */
> +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> +static inline bool kvm_arch_has_test_clear_young(void)
> +{
> + return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
> +}

Why does the lack of FEAT_HAFDBS preclude the use of the test-and-clear
notifier?

On implementations without FEAT_HAFDBS, hardware will generate a data
abort for software to set the access flag. Regardless of whether
software or hardware is responsible for updating the PTE that
information is available in the page tables.

Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
My expectation is that we should provide an implementation that returns
false if !CONFIG_KVM, avoiding the need to repeat that bit in every
single implementation of the function.

> +
> #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 63f81b27a4e3..8c9a04388c88 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -105,6 +105,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> * @put_page: Decrement the refcount on a page. When the
> * refcount reaches 0 the page is automatically
> * freed.
> + * @put_page_rcu: RCU variant of put_page().
> * @page_count: Return the refcount of a page.
> * @phys_to_virt: Convert a physical address into a virtual
> * address mapped in the current context.
> @@ -122,6 +123,7 @@ struct kvm_pgtable_mm_ops {
> void (*free_removed_table)(void *addr, u32 level);
> void (*get_page)(void *addr);
> void (*put_page)(void *addr);
> + void (*put_page_rcu)(void *addr);

Why do we need this? We already defer dropping the last reference count
on a page to an RCU callback. Have you observed issues with the existing
implementation?

> int (*page_count)(void *addr);
> void* (*phys_to_virt)(phys_addr_t phys);
> phys_addr_t (*virt_to_phys)(void *addr);
> @@ -188,6 +190,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
> * children.
> * @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared
> * with other software walkers.
> + *
> + * kvm_arch_test_clear_young() is a special case. It relies on two
> + * techniques, RCU and cmpxchg, to safely test and clear the accessed
> + * bit without taking the MMU lock. The former protects KVM page tables
> + * from being freed while the latter clears the accessed bit atomically
> + * against both the hardware and other software page table walkers.
> */
> enum kvm_pgtable_walk_flags {
> KVM_PGTABLE_WALK_LEAF = BIT(0),
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index c8dca8ae359c..350437661d4b 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -30,4 +30,47 @@
> */
> #define kvm_mmu_cache_min_pages(kvm) (kvm_stage2_levels(kvm) - 1)
>
> +#define KVM_PTE_TYPE BIT(1)
> +#define KVM_PTE_TYPE_BLOCK 0
> +#define KVM_PTE_TYPE_PAGE 1
> +#define KVM_PTE_TYPE_TABLE 1
> +
> +#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
> +
> +#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
> +#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
> +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3
> +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1
> +#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
> +#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
> +#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
> +
> +#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
> +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
> +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
> +#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
> +#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
> +#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
> +
> +#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51)
> +
> +#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
> +
> +#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
> +
> +#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
> +
> +#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
> + KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
> + KVM_PTE_LEAF_ATTR_HI_S2_XN)
> +
> +#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
> +#define KVM_MAX_OWNER_ID 1
> +
> +/*
> + * Used to indicate a pte for which a 'break-before-make' sequence is in
> + * progress.
> + */
> +#define KVM_INVALID_PTE_LOCKED BIT(10)
> +

If there is a need to do these sort of moves, please do it in a separate
patch. It pollutes the context of the functional change you're making.

> #endif /* __ARM64_S2_PGTABLE_H_ */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9c5573bc4614..6770bc47f5c9 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
> */
> void kvm_arch_destroy_vm(struct kvm *kvm)
> {
> + kvm_free_stage2_pgd(&kvm->arch.mmu);
>
> bitmap_free(kvm->arch.pmu_filter);
> free_cpumask_var(kvm->arch.supported_cpus);
>

[...]

> +struct test_clear_young_arg {
> + struct kvm_gfn_range *range;
> + gfn_t lsb_gfn;
> + unsigned long *bitmap;
> +};
> +
> +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
> + enum kvm_pgtable_walk_flags flags)
> +{
> + struct test_clear_young_arg *arg = ctx->arg;
> + gfn_t gfn = ctx->addr / PAGE_SIZE;
> + kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
> +
> + VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
> + VM_WARN_ON_ONCE(gfn < arg->range->start || gfn >= arg->range->end);

Do we really need to be _this_ pedantic about sanity checking?

> + if (!kvm_pte_valid(new))
> + return 0;
> +
> + if (new == ctx->old)
> + return 0;
> +
> + /* see the comments on the generic kvm_arch_has_test_clear_young() */
> + if (__test_and_change_bit(arg->lsb_gfn - gfn, arg->bitmap))
> + cmpxchg64(ctx->ptep, ctx->old, new);

Why not use stage2_try_set_pte()? Not only is it idiomatic with the rest
of the stage-2 code, it also will 'do the right thing' according to the
locking scheme of the caller if we decide to change it at some point.

> + return 0;
> +}
> +
> +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
> + gfn_t lsb_gfn, unsigned long *bitmap)
> +{
> + u64 start = range->start * PAGE_SIZE;
> + u64 end = range->end * PAGE_SIZE;
> + struct test_clear_young_arg arg = {
> + .range = range,
> + .lsb_gfn = lsb_gfn,
> + .bitmap = bitmap,
> + };
> + struct kvm_pgtable_walker walker = {
> + .cb = stage2_test_clear_young,
> + .arg = &arg,
> + .flags = KVM_PGTABLE_WALK_LEAF,
> + };
> +
> + BUILD_BUG_ON(is_hyp_code());

See prior comment about sanity checking.

> + if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
> + return false;

Same here...

> + /* see the comments on kvm_pgtable_walk_flags */
> + rcu_read_lock();
> +
> + kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker);
> +
> + rcu_read_unlock();

The rcu_read_{lock,unlock}() is entirely superfluous.

Of course, it is somewhat hidden by the fact that we must use
abstractions to support host and EL2 use of the page table code, but we
already make use of RCU to protect the stage-2 of a 'regular' VM.

> + return true;
> +}
> +
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> if (!kvm->arch.mmu.pgt)
> @@ -1848,7 +1924,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
>
> void kvm_arch_flush_shadow_all(struct kvm *kvm)
> {
> - kvm_free_stage2_pgd(&kvm->arch.mmu);
> }

Doesn't this become a blatant correctness issue? This entirely fails to
uphold the exact expectations of the call.

--
Thanks,
Oliver

2023-02-17 16:00:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

On Fri, Feb 17, 2023, Oliver Upton wrote:
> Hi Yu,
>
> scripts/get_maintainers.pl is your friend for getting the right set of
> emails for a series :) Don't know about others, but generally I would
> prefer to be Cc'ed on an entire series (to gather context) than just an
> individual patch.

+1

>
> On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > VMs that are not pKVM and run on hardware that sets the accessed bit
> > in KVM page tables.

At least for the x86 changes, please read Documentation/process/maintainer-tip.rst
and rewrite the changelogs.

> > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > the accessed bit without taking the MMU lock. The former protects KVM
> > page tables from being freed while the latter clears the accessed bit
> > atomically against both the hardware and other software page table
> > walkers.
> >
> > Signed-off-by: Yu Zhao <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 7 +++
> > arch/arm64/include/asm/kvm_pgtable.h | 8 +++
> > arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> > arch/arm64/kvm/arm.c | 1 +
> > arch/arm64/kvm/hyp/pgtable.c | 51 ++--------------
> > arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++-
> > 6 files changed, 141 insertions(+), 46 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 35a159d131b5..572bcd321586 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
> > void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> > bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> >
> > +/* see the comments on the generic kvm_arch_has_test_clear_young() */

Please eliminate all of these "see the comments on blah", in every case they do
nothing more than redirect the reader to something they're likely already aware of.

> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > + return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
> > +}

...

> Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
> My expectation is that we should provide an implementation that returns
> false if !CONFIG_KVM, avoiding the need to repeat that bit in every
> single implementation of the function.

mm/vmscan.c uses kvm_arch_has_test_clear_young(). I have opinions on that, but
I'll hold off on expressing them until there's actual justification presented
somewhere.

Yu, this series and each patch needs a big pile of "why". I get that the goal
is to optimize memory oversubscribe, but there needs to be justification for
why this is KVM only, why nested VMs and !A/D hardware are out of scope, why yet
another mmu_notifier hook is being added, etc.

2023-02-17 16:28:15

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Thu, Feb 16, 2023, Yu Zhao wrote:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6aaae18f1854..d2995c9e8f07 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1367,6 +1367,12 @@ struct kvm_arch {
> * the MMU lock in read mode + the tdp_mmu_pages_lock or
> * the MMU lock in write mode
> *
> + * kvm_arch_test_clear_young() is a special case. It relies on two

No, it's not. The TDP MMU already employs on RCU and CMPXCHG. Just drop the
entire comment.

> + * techniques, RCU and cmpxchg, to safely test and clear the accessed
> + * bit without taking the MMU lock. The former protects KVM page tables
> + * from being freed while the latter clears the accessed bit atomically
> + * against both the hardware and other software page table walkers.
> + *
> * Roots will remain in the list until their tdp_mmu_root_count
> * drops to zero, at which point the thread that decremented the
> * count to zero should removed the root from the list and clean
> @@ -2171,4 +2177,25 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \
> KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS)
>
> +extern u64 __read_mostly shadow_accessed_mask;
> +
> +/*
> + * Returns true if A/D bits are supported in hardware and are enabled by KVM.
> + * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can
> + * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
> + * scenario where KVM is using A/D bits for L1, but not L2.
> + */
> +static inline bool kvm_ad_enabled(void)
> +{
> + return shadow_accessed_mask;
> +}

Absolutely not. This information is not getting directly exposed outside of KVM.

> +
> +/* see the comments on the generic kvm_arch_has_test_clear_young() */
> +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> +static inline bool kvm_arch_has_test_clear_young(void)
> +{
> + return IS_ENABLED(CONFIG_KVM) && IS_ENABLED(CONFIG_X86_64) &&
> + (!IS_REACHABLE(CONFIG_KVM) || (kvm_ad_enabled() && tdp_enabled));
> +}

Pending the justification for why this is KVM-only, I would strongly prefer we
find a way to have the mmu_notifier framework track whether or not any listeners
have a test_clear_young(). E.g. have KVM nullify its hook during module load.

> +
> #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 6f54dc9409c9..0dc7fed1f3fd 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
> extern u64 __read_mostly shadow_nx_mask;
> extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> extern u64 __read_mostly shadow_user_mask;
> -extern u64 __read_mostly shadow_accessed_mask;
> extern u64 __read_mostly shadow_dirty_mask;
> extern u64 __read_mostly shadow_mmio_value;
> extern u64 __read_mostly shadow_mmio_mask;
> @@ -247,17 +246,6 @@ static inline bool is_shadow_present_pte(u64 pte)
> return !!(pte & SPTE_MMU_PRESENT_MASK);
> }
>
> -/*
> - * Returns true if A/D bits are supported in hardware and are enabled by KVM.
> - * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can
> - * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
> - * scenario where KVM is using A/D bits for L1, but not L2.
> - */
> -static inline bool kvm_ad_enabled(void)
> -{
> - return !!shadow_accessed_mask;
> -}

As Oliver said in the ARM patch, _if_ this is justified, please do code movement
in a separate patch.

> -
> static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
> {
> return sp->role.ad_disabled;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index d6df38d371a0..9028e09f1aab 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1309,6 +1309,47 @@ bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
> }
>
> +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
> + gfn_t lsb_gfn, unsigned long *bitmap)
> +{
> + struct kvm_mmu_page *root;
> +
> + if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
> + return false;
> +
> + if (kvm_memslots_have_rmaps(kvm))

This completely disables the API on VMs that have _ever_ run a nested VM. I doubt
that's the intended behavior.

> + return false;
> +
> + /* see the comments on kvm_arch->tdp_mmu_roots */
> + rcu_read_lock();
> +
> + list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
> + struct tdp_iter iter;
> +
> + if (kvm_mmu_page_as_id(root) != range->slot->as_id)
> + continue;

for_each_tdp_mmu_root() does this for you.

> +
> + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
> + u64 *sptep = rcu_dereference(iter.sptep);

kvm_tdp_mmu_read_spte(), thought it's not clear to me why this doesn't test+clear
the SPTE's accessed bit and then toggle the bitmap.

> + u64 new_spte = iter.old_spte & ~shadow_accessed_mask;
> +
> + VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep)));

This doesn't do what I assume it's intended to do. The sptep points at a KVM,
a.k.a. kernel, allocated page, not at guest memory. Assuming the intent is to
assert that the memory being aged has an elevated refcount, this would need to
extract the pfn out of the SPTE and get the struct page for that. But that's
completely unsafe because KVM supports mapping VM_PFNMAP and VM_IO memory into
the guest. Maybe the proposed caller only operates on struct page memory, but
I am not willing to make that assumption in KVM.

TL;DR: drop this.

> + VM_WARN_ON_ONCE(iter.gfn < range->start || iter.gfn >= range->end);

This adds no value, KVM is completely hosed if tdp_root_for_each_leaf_pte() botches
the ranges.

> +
> + if (new_spte == iter.old_spte)
> + continue;
> +
> + /* see the comments on the generic kvm_arch_has_test_clear_young() */

No, "see xyz" for unintuitive logic is not acceptable. Add a helper and document
the logic there, don't splatter "see XYZ" comments everywhere.

> + if (__test_and_change_bit(lsb_gfn - iter.gfn, bitmap))
> + cmpxchg64(sptep, iter.old_spte, new_spte);

Clearing a single bit doesn't need a CMPXCHG. Please weigh in on a relevant series
that is modifying the aging flows[*], I want to have exactly one helper for aging
TDP MMU SPTEs.

[*] https://lore.kernel.org/all/[email protected]

2023-02-23 03:59:28

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

On Fri, Feb 17, 2023 at 2:00 AM Marc Zyngier <[email protected]> wrote:
>
> On Fri, 17 Feb 2023 04:21:28 +0000,
> Yu Zhao <[email protected]> wrote:
> >
> > On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <[email protected]> wrote:
> > >
> > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > in KVM page tables.
>
> I'm really interested in how you can back this statement. 90% of the
> HW I have access to is not FEAT_HWAFDB capable, either because it
> predates the feature or because the feature is too buggy to be useful.

This is my expericen too -- most devices are pre v8.2.

> Do you have numbers?

Let's do a quick market survey by segment. The following only applies
to ARM CPUs:

1. Phones: none of the major Android phone vendors sell phones running
VMs; no other major Linux phone vendors.
2. Laptops: only a very limited number of Chromebooks run VMs, namely
ACRVM. No other major Linux laptop vendors.
3. Desktops: no major Linux desktop vendors.
4. Embedded/IoT/Router: no major Linux vendors run VMs (Android Auto
can be a VM guest on QNX host).
5. Cloud: this is where the vast majority VMs come from. Among the
vendors available to the general public, Ampere is the biggest player.
Here [1] is a list of its customers. The A-bit works well even on its
EVT products (Neoverse cores).

[1] https://en.wikipedia.org/wiki/Ampere_Computing

> > > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > > the accessed bit without taking the MMU lock. The former protects KVM
> > > page tables from being freed while the latter clears the accessed bit
> > > atomically against both the hardware and other software page table
> > > walkers.
> > >
> > > Signed-off-by: Yu Zhao <[email protected]>
> > > ---
> > > arch/arm64/include/asm/kvm_host.h | 7 +++
> > > arch/arm64/include/asm/kvm_pgtable.h | 8 +++
> > > arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> > > arch/arm64/kvm/arm.c | 1 +
> > > arch/arm64/kvm/hyp/pgtable.c | 51 ++--------------
> > > arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++-
> > > 6 files changed, 141 insertions(+), 46 deletions(-)
> >
> > Adding Marc and Will.
> >
> > Can you please add other interested parties that I've missed?
>
> The MAINTAINERS file has it all:
>
> KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)
> M: Marc Zyngier <[email protected]>
> M: Oliver Upton <[email protected]>
> R: James Morse <[email protected]>
> R: Suzuki K Poulose <[email protected]>
> R: Zenghui Yu <[email protected]>
> L: [email protected]
>
> May I suggest that you repost your patch and Cc the interested
> parties yourself? I guess most folks will want to see this in context,
> and not as a random, isolated change with no rationale.

This clarified it. Thanks. (I was hesitant to spam people with the
entire series containing changes to other architectures.)

2023-02-23 04:44:54

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

On Fri, Feb 17, 2023 at 2:09 AM Oliver Upton <[email protected]> wrote:
>
> Hi Yu,
>
> scripts/get_maintainers.pl is your friend for getting the right set of
> emails for a series :) Don't know about others, but generally I would
> prefer to be Cc'ed on an entire series (to gather context) than just an
> individual patch.

Will do. Thank you.

> On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > VMs that are not pKVM and run on hardware that sets the accessed bit
> > in KVM page tables.
> >
> > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > the accessed bit without taking the MMU lock. The former protects KVM
> > page tables from being freed while the latter clears the accessed bit
> > atomically against both the hardware and other software page table
> > walkers.
> >
> > Signed-off-by: Yu Zhao <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 7 +++
> > arch/arm64/include/asm/kvm_pgtable.h | 8 +++
> > arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> > arch/arm64/kvm/arm.c | 1 +
> > arch/arm64/kvm/hyp/pgtable.c | 51 ++--------------
> > arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++-
> > 6 files changed, 141 insertions(+), 46 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 35a159d131b5..572bcd321586 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
> > void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> > bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> >
> > +/* see the comments on the generic kvm_arch_has_test_clear_young() */
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > + return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
> > +}
>
> Why does the lack of FEAT_HAFDBS preclude the use of the test-and-clear
> notifier?

This all comes down to the return on investment. We could
theoretically make it work but the complexity and the poor performance
would outweigh the benefits -- VM memory overcommit mostly happens in
Cloud and none of the major Cloud vendors use pre v8.2 [1].

[1] https://lore.kernel.org/linux-mm/CAOUHufbbs2gG+DPvSOw_N_Kx7FWdZvpdJUvLzko-BDQ8vfd6Xg@mail.gmail.com/

> On implementations without FEAT_HAFDBS, hardware will generate a data
> abort for software to set the access flag. Regardless of whether
> software or hardware is responsible for updating the PTE that
> information is available in the page tables.

Agreed, the s/w emulation of the A-bit has poor performance. With the
forward looking in mind, businesses who wish to overcommit host memory
will eventually all move onto v8.2 and later. This is another reason
not to worry about pre v8.2 (or 32-bit for that matter).

> Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
> My expectation is that we should provide an implementation that returns
> false if !CONFIG_KVM, avoiding the need to repeat that bit in every
> single implementation of the function.

We do have that default implementation. But we still need to disable
this implementation when !CONFIG_KVM (it isn't protected by #ifdef
CONFIG_KVM).

> > +
> > #endif /* __ARM64_KVM_HOST_H__ */
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 63f81b27a4e3..8c9a04388c88 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -105,6 +105,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level)
> > * @put_page: Decrement the refcount on a page. When the
> > * refcount reaches 0 the page is automatically
> > * freed.
> > + * @put_page_rcu: RCU variant of put_page().
> > * @page_count: Return the refcount of a page.
> > * @phys_to_virt: Convert a physical address into a virtual
> > * address mapped in the current context.
> > @@ -122,6 +123,7 @@ struct kvm_pgtable_mm_ops {
> > void (*free_removed_table)(void *addr, u32 level);
> > void (*get_page)(void *addr);
> > void (*put_page)(void *addr);
> > + void (*put_page_rcu)(void *addr);
>
> Why do we need this? We already defer dropping the last reference count
> on a page to an RCU callback. Have you observed issues with the existing
> implementation?

That's on the reader path, i.e., collapsing PTEs into a PMD, which
RCU-frees the PTE table.

On the writer path, unmapping wasn't protected by RCU before this
patch, and put_page_rcu() makes it so.

> > int (*page_count)(void *addr);
> > void* (*phys_to_virt)(phys_addr_t phys);
> > phys_addr_t (*virt_to_phys)(void *addr);
> > @@ -188,6 +190,12 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
> > * children.
> > * @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared
> > * with other software walkers.
> > + *
> > + * kvm_arch_test_clear_young() is a special case. It relies on two
> > + * techniques, RCU and cmpxchg, to safely test and clear the accessed
> > + * bit without taking the MMU lock. The former protects KVM page tables
> > + * from being freed while the latter clears the accessed bit atomically
> > + * against both the hardware and other software page table walkers.
> > */
> > enum kvm_pgtable_walk_flags {
> > KVM_PGTABLE_WALK_LEAF = BIT(0),
> > diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> > index c8dca8ae359c..350437661d4b 100644
> > --- a/arch/arm64/include/asm/stage2_pgtable.h
> > +++ b/arch/arm64/include/asm/stage2_pgtable.h
> > @@ -30,4 +30,47 @@
> > */
> > #define kvm_mmu_cache_min_pages(kvm) (kvm_stage2_levels(kvm) - 1)
> >
> > +#define KVM_PTE_TYPE BIT(1)
> > +#define KVM_PTE_TYPE_BLOCK 0
> > +#define KVM_PTE_TYPE_PAGE 1
> > +#define KVM_PTE_TYPE_TABLE 1
> > +
> > +#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
> > +
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO 3
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW 1
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
> > +#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
> > +
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
> > +#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
> > +
> > +#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51)
> > +
> > +#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
> > +
> > +#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
> > +
> > +#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
> > +
> > +#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
> > + KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
> > + KVM_PTE_LEAF_ATTR_HI_S2_XN)
> > +
> > +#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
> > +#define KVM_MAX_OWNER_ID 1
> > +
> > +/*
> > + * Used to indicate a pte for which a 'break-before-make' sequence is in
> > + * progress.
> > + */
> > +#define KVM_INVALID_PTE_LOCKED BIT(10)
> > +
>
> If there is a need to do these sort of moves, please do it in a separate
> patch. It pollutes the context of the functional change you're making.
>
> > #endif /* __ARM64_S2_PGTABLE_H_ */
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 9c5573bc4614..6770bc47f5c9 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
> > */
> > void kvm_arch_destroy_vm(struct kvm *kvm)
> > {
> > + kvm_free_stage2_pgd(&kvm->arch.mmu);
> >
> > bitmap_free(kvm->arch.pmu_filter);
> > free_cpumask_var(kvm->arch.supported_cpus);
> >
>
> [...]
>
> > +struct test_clear_young_arg {
> > + struct kvm_gfn_range *range;
> > + gfn_t lsb_gfn;
> > + unsigned long *bitmap;
> > +};
> > +
> > +static int stage2_test_clear_young(const struct kvm_pgtable_visit_ctx *ctx,
> > + enum kvm_pgtable_walk_flags flags)
> > +{
> > + struct test_clear_young_arg *arg = ctx->arg;
> > + gfn_t gfn = ctx->addr / PAGE_SIZE;
> > + kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
> > +
> > + VM_WARN_ON_ONCE(!page_count(virt_to_page(ctx->ptep)));
> > + VM_WARN_ON_ONCE(gfn < arg->range->start || gfn >= arg->range->end);
>
> Do we really need to be _this_ pedantic about sanity checking?

Will remove them. (My experience with the world's large fleets is that
Murphy's law is always true.)

> > + if (!kvm_pte_valid(new))
> > + return 0;
> > +
> > + if (new == ctx->old)
> > + return 0;
> > +
> > + /* see the comments on the generic kvm_arch_has_test_clear_young() */
> > + if (__test_and_change_bit(arg->lsb_gfn - gfn, arg->bitmap))
> > + cmpxchg64(ctx->ptep, ctx->old, new);
>
> Why not use stage2_try_set_pte()? Not only is it idiomatic with the rest
> of the stage-2 code, it also will 'do the right thing' according to the
> locking scheme of the caller if we decide to change it at some point.

It's not exported. Do you prefer it to be exported?

> > + return 0;
> > +}
> > +
> > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
> > + gfn_t lsb_gfn, unsigned long *bitmap)
> > +{
> > + u64 start = range->start * PAGE_SIZE;
> > + u64 end = range->end * PAGE_SIZE;
> > + struct test_clear_young_arg arg = {
> > + .range = range,
> > + .lsb_gfn = lsb_gfn,
> > + .bitmap = bitmap,
> > + };
> > + struct kvm_pgtable_walker walker = {
> > + .cb = stage2_test_clear_young,
> > + .arg = &arg,
> > + .flags = KVM_PGTABLE_WALK_LEAF,
> > + };
> > +
> > + BUILD_BUG_ON(is_hyp_code());
>
> See prior comment about sanity checking.
>
> > + if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
> > + return false;
>
> Same here...
>
> > + /* see the comments on kvm_pgtable_walk_flags */
> > + rcu_read_lock();
> > +
> > + kvm_pgtable_walk(kvm->arch.mmu.pgt, start, end - start, &walker);
> > +
> > + rcu_read_unlock();
>
> The rcu_read_{lock,unlock}() is entirely superfluous.

Not really. I didn't use the KVM_PGTABLE_WALK_SHARED flag above. Yes,
it would be more consistent with the rest of the ARM code. My POV is
that it would be less consistent with other archs, which I fully
expect you to disagree :)

I could add it and remove rcu_read_{lock,unlock}() if you prefer that way.

> Of course, it is somewhat hidden by the fact that we must use
> abstractions to support host and EL2 use of the page table code, but we
> already make use of RCU to protect the stage-2 of a 'regular' VM.
>
> > + return true;
> > +}
> > +
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > if (!kvm->arch.mmu.pgt)
> > @@ -1848,7 +1924,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
> >
> > void kvm_arch_flush_shadow_all(struct kvm *kvm)
> > {
> > - kvm_free_stage2_pgd(&kvm->arch.mmu);
> > }
>
> Doesn't this become a blatant correctness issue? This entirely fails to
> uphold the exact expectations of the call.

I moved kvm_free_stage2_pgd() into kvm_arch_destroy_vm() above so
that the mmu notifer SRCU will protect pgd at destruction. Don't worry
about this for now. I refactor this change and put_page_rcu() into a
separate patch to make it more clearer -- without them, RCU page table
walkers won't be safe against VM destruction and unmap.

2023-02-23 05:26:25

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

On Fri, Feb 17, 2023 at 9:00 AM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Feb 17, 2023, Oliver Upton wrote:
> > Hi Yu,
> >
> > scripts/get_maintainers.pl is your friend for getting the right set of
> > emails for a series :) Don't know about others, but generally I would
> > prefer to be Cc'ed on an entire series (to gather context) than just an
> > individual patch.
>
> +1
>
> >
> > On Thu, Feb 16, 2023 at 09:12:28PM -0700, Yu Zhao wrote:
> > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > in KVM page tables.
>
> At least for the x86 changes, please read Documentation/process/maintainer-tip.rst
> and rewrite the changelogs.

I see -- will remove "this patch".

> > > It relies on two techniques, RCU and cmpxchg, to safely test and clear
> > > the accessed bit without taking the MMU lock. The former protects KVM
> > > page tables from being freed while the latter clears the accessed bit
> > > atomically against both the hardware and other software page table
> > > walkers.
> > >
> > > Signed-off-by: Yu Zhao <[email protected]>
> > > ---
> > > arch/arm64/include/asm/kvm_host.h | 7 +++
> > > arch/arm64/include/asm/kvm_pgtable.h | 8 +++
> > > arch/arm64/include/asm/stage2_pgtable.h | 43 ++++++++++++++
> > > arch/arm64/kvm/arm.c | 1 +
> > > arch/arm64/kvm/hyp/pgtable.c | 51 ++--------------
> > > arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++-
> > > 6 files changed, 141 insertions(+), 46 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index 35a159d131b5..572bcd321586 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -1031,4 +1031,11 @@ static inline void kvm_hyp_reserve(void) { }
> > > void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu);
> > > bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
> > >
> > > +/* see the comments on the generic kvm_arch_has_test_clear_young() */
>
> Please eliminate all of these "see the comments on blah", in every case they do
> nothing more than redirect the reader to something they're likely already aware of.
>
> > > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > > +static inline bool kvm_arch_has_test_clear_young(void)
> > > +{
> > > + return IS_ENABLED(CONFIG_KVM) && cpu_has_hw_af() && !is_protected_kvm_enabled();
> > > +}
>
> ...
>
> > Also, I'm at a loss for why we'd need to test if CONFIG_KVM is enabled.
> > My expectation is that we should provide an implementation that returns
> > false if !CONFIG_KVM, avoiding the need to repeat that bit in every
> > single implementation of the function.
>
> mm/vmscan.c uses kvm_arch_has_test_clear_young(). I have opinions on that, but
> I'll hold off on expressing them until there's actual justification presented
> somewhere.
>
> Yu, this series and each patch needs a big pile of "why". I get that the goal
> is to optimize memory oversubscribe, but there needs to be justification for
> why this is KVM only, why nested VMs and !A/D hardware are out of scope, why yet
> another mmu_notifier hook is being added, etc.

I totally agree.

This is an optimization, not a bug fix. It can't be justified without
performance numbers from some common use cases. That burden of proof
clearly rests on me -- I will follow up on that.

For now, I want to make sure the methodical part is clear:
1. We only have limited resources and we need to prioritize major use cases.
2. We can only improve one thing at a time and we can't cover
everything at the same time.
3. We need to focus on the return on investment and the future.

I hope everyone by now agrees with my "the vast majority of VMs ..."
assertion. If not, I'm happy to revisit that [1]. If so, the next step
would be whether we want to focus on the vast majority first. I think
this naturally answers why the nested VMs and !AD h/w are out of
scope, at the moment (I didn't spell this out; probably I should in
v2). After we have taken the first step, we probably can decide
whether there is enough resource and demand to cover the low return on
investment part (but complexity but less common use cases).

[1] https://lore.kernel.org/linux-mm/[email protected]/

2023-02-23 05:59:45

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Fri, Feb 17, 2023 at 9:27 AM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 16, 2023, Yu Zhao wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 6aaae18f1854..d2995c9e8f07 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1367,6 +1367,12 @@ struct kvm_arch {
> > * the MMU lock in read mode + the tdp_mmu_pages_lock or
> > * the MMU lock in write mode
> > *
> > + * kvm_arch_test_clear_young() is a special case. It relies on two
>
> No, it's not. The TDP MMU already employs on RCU and CMPXCHG.

It is -- you read it out of context :)

* For reads, this list is protected by:
* the MMU lock in read mode + RCU or
* the MMU lock in write mode
*
* For writes, this list is protected by:
* the MMU lock in read mode + the tdp_mmu_pages_lock or
* the MMU lock in write mode
*
* kvm_arch_test_clear_young() is a special case.
...

struct list_head tdp_mmu_roots;

> Just drop the
> entire comment.

Let me move it into kvm_arch_test_clear_young().

Also I want to be clear:
1. We can't just focus on here and now; we need to consider the distant future.
2. From my POV, "see the comments on ..." is like the index of a book.

> > + * techniques, RCU and cmpxchg, to safely test and clear the accessed
> > + * bit without taking the MMU lock. The former protects KVM page tables
> > + * from being freed while the latter clears the accessed bit atomically
> > + * against both the hardware and other software page table walkers.
> > + *
> > * Roots will remain in the list until their tdp_mmu_root_count
> > * drops to zero, at which point the thread that decremented the
> > * count to zero should removed the root from the list and clean
> > @@ -2171,4 +2177,25 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> > KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \
> > KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS)
> >
> > +extern u64 __read_mostly shadow_accessed_mask;
> > +
> > +/*
> > + * Returns true if A/D bits are supported in hardware and are enabled by KVM.
> > + * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can
> > + * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
> > + * scenario where KVM is using A/D bits for L1, but not L2.
> > + */
> > +static inline bool kvm_ad_enabled(void)
> > +{
> > + return shadow_accessed_mask;
> > +}
>
> Absolutely not. This information is not getting directly exposed outside of KVM.

Will do.

> > +
> > +/* see the comments on the generic kvm_arch_has_test_clear_young() */
> > +#define kvm_arch_has_test_clear_young kvm_arch_has_test_clear_young
> > +static inline bool kvm_arch_has_test_clear_young(void)
> > +{
> > + return IS_ENABLED(CONFIG_KVM) && IS_ENABLED(CONFIG_X86_64) &&
> > + (!IS_REACHABLE(CONFIG_KVM) || (kvm_ad_enabled() && tdp_enabled));
> > +}
>
> Pending the justification for why this is KVM-only

Nothing else has *_young()... IOW, KVM is the only user of *_young().

> I would strongly prefer we
> find a way to have the mmu_notifier framework track whether or not any listeners
> have a test_clear_young(). E.g. have KVM nullify its hook during module load.

It's already done that way. This function is just for the caller to
avoid unnecessary trips into the MMU notifier on archs that don't
support this capability. (The caller would find it unsupported.)

> > +
> > #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index 6f54dc9409c9..0dc7fed1f3fd 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -153,7 +153,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
> > extern u64 __read_mostly shadow_nx_mask;
> > extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> > extern u64 __read_mostly shadow_user_mask;
> > -extern u64 __read_mostly shadow_accessed_mask;
> > extern u64 __read_mostly shadow_dirty_mask;
> > extern u64 __read_mostly shadow_mmio_value;
> > extern u64 __read_mostly shadow_mmio_mask;
> > @@ -247,17 +246,6 @@ static inline bool is_shadow_present_pte(u64 pte)
> > return !!(pte & SPTE_MMU_PRESENT_MASK);
> > }
> >
> > -/*
> > - * Returns true if A/D bits are supported in hardware and are enabled by KVM.
> > - * When enabled, KVM uses A/D bits for all non-nested MMUs. Because L1 can
> > - * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
> > - * scenario where KVM is using A/D bits for L1, but not L2.
> > - */
> > -static inline bool kvm_ad_enabled(void)
> > -{
> > - return !!shadow_accessed_mask;
> > -}
>
> As Oliver said in the ARM patch, _if_ this is justified, please do code movement
> in a separate patch.

I'll just drop this.

> > static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
> > {
> > return sp->role.ad_disabled;
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index d6df38d371a0..9028e09f1aab 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1309,6 +1309,47 @@ bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
> > }
> >
> > +bool kvm_arch_test_clear_young(struct kvm *kvm, struct kvm_gfn_range *range,
> > + gfn_t lsb_gfn, unsigned long *bitmap)
> > +{
> > + struct kvm_mmu_page *root;
> > +
> > + if (WARN_ON_ONCE(!kvm_arch_has_test_clear_young()))
> > + return false;
> > +
> > + if (kvm_memslots_have_rmaps(kvm))
>
> This completely disables the API on VMs that have _ever_ run a nested VM.

Ok, we definitely don't want this.

> I doubt
> that's the intended behavior.

Good catch, thanks.

> > + return false;
> > +
> > + /* see the comments on kvm_arch->tdp_mmu_roots */
> > + rcu_read_lock();
> > +
> > + list_for_each_entry_rcu(root, &kvm->arch.tdp_mmu_roots, link) {
> > + struct tdp_iter iter;
> > +
> > + if (kvm_mmu_page_as_id(root) != range->slot->as_id)
> > + continue;
>
> for_each_tdp_mmu_root() does this for you.
>
> > +
> > + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
> > + u64 *sptep = rcu_dereference(iter.sptep);
>
> kvm_tdp_mmu_read_spte(), thought it's not clear to me why this doesn't test+clear
> the SPTE's accessed bit and then toggle the bitmap.
>
> > + u64 new_spte = iter.old_spte & ~shadow_accessed_mask;
> > +
> > + VM_WARN_ON_ONCE(!page_count(virt_to_page(sptep)));
>
> This doesn't do what I assume it's intended to do.

This asserts the page table page is not freed, i.e., the memory we are
modifying still belongs to someone.

If we forget to do RCU free, i.e., we free immediately, this can
catch it, assuming no reuse.

> The sptep points at a KVM,
> a.k.a. kernel, allocated page, not at guest memory. Assuming the intent is to
> assert that the memory being aged has an elevated refcount, this would need to
> extract the pfn out of the SPTE and get the struct page for that. But that's
> completely unsafe because KVM supports mapping VM_PFNMAP and VM_IO memory into
> the guest. Maybe the proposed caller only operates on struct page memory, but
> I am not willing to make that assumption in KVM.
>
> TL;DR: drop this.
>
> > + VM_WARN_ON_ONCE(iter.gfn < range->start || iter.gfn >= range->end);
>
> This adds no value KVM is completely hosed if tdp_root_for_each_leaf_pte() botches
> the ranges.

Ok.

> > +
> > + if (new_spte == iter.old_spte)
> > + continue;
> > +
> > + /* see the comments on the generic kvm_arch_has_test_clear_young() */
>
> No, "see xyz" for unintuitive logic is not acceptable. Add a helper and document
> the logic there, don't splatter "see XYZ" comments everywhere.
>
> > + if (__test_and_change_bit(lsb_gfn - iter.gfn, bitmap))
> > + cmpxchg64(sptep, iter.old_spte, new_spte);
>
> Clearing a single bit doesn't need a CMPXCHG. Please weigh in on a relevant series
> that is modifying the aging flows[*], I want to have exactly one helper for aging
> TDP MMU SPTEs.
>
> [*] https://lore.kernel.org/all/[email protected]

I'll take a look at that series. clear_bit() probably won't cause any
practical damage but is technically wrong because, for example, it can
end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
in this case, obviously.)

2023-02-23 09:03:19

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

On Thu, 23 Feb 2023 03:58:47 +0000,
Yu Zhao <[email protected]> wrote:
>
> On Fri, Feb 17, 2023 at 2:00 AM Marc Zyngier <[email protected]> wrote:
> >
> > On Fri, 17 Feb 2023 04:21:28 +0000,
> > Yu Zhao <[email protected]> wrote:
> > >
> > > On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <[email protected]> wrote:
> > > >
> > > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > > in KVM page tables.
> >
> > I'm really interested in how you can back this statement. 90% of the
> > HW I have access to is not FEAT_HWAFDB capable, either because it
> > predates the feature or because the feature is too buggy to be useful.
>
> This is my expericen too -- most devices are pre v8.2.

And yet you have no issue writing the above. Puzzling.

>
> > Do you have numbers?
>
> Let's do a quick market survey by segment. The following only applies
> to ARM CPUs:
>
> 1. Phones: none of the major Android phone vendors sell phones running
> VMs; no other major Linux phone vendors.

Maybe you should have a reality check and look at what your own
employer is shipping.

> 2. Laptops: only a very limited number of Chromebooks run VMs, namely
> ACRVM. No other major Linux laptop vendors.

Again, your employer disagree.

> 3. Desktops: no major Linux desktop vendors.

My desktop disagree (I send this from my arm64 desktop VM ).

> 4. Embedded/IoT/Router: no major Linux vendors run VMs (Android Auto
> can be a VM guest on QNX host).

This email is brought to you via a router VM on an arm64 box.

> 5. Cloud: this is where the vast majority VMs come from. Among the
> vendors available to the general public, Ampere is the biggest player.
> Here [1] is a list of its customers. The A-bit works well even on its
> EVT products (Neoverse cores).

Just the phone stuff dwarfs the number of cloud hosts.

Hopefully your patches are better than your market analysis...

M.

--
Without deviation from the norm, progress is not possible.

2023-02-23 09:19:05

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 3/5] kvm/arm64: add kvm_arch_test_clear_young()

On Thu, Feb 23, 2023 at 2:03 AM Marc Zyngier <[email protected]> wrote:
>
> On Thu, 23 Feb 2023 03:58:47 +0000,
> Yu Zhao <[email protected]> wrote:
> >
> > On Fri, Feb 17, 2023 at 2:00 AM Marc Zyngier <[email protected]> wrote:
> > >
> > > On Fri, 17 Feb 2023 04:21:28 +0000,
> > > Yu Zhao <[email protected]> wrote:
> > > >
> > > > On Thu, Feb 16, 2023 at 9:12 PM Yu Zhao <[email protected]> wrote:
> > > > >
> > > > > This patch adds kvm_arch_test_clear_young() for the vast majority of
> > > > > VMs that are not pKVM and run on hardware that sets the accessed bit
> > > > > in KVM page tables.
> > >
> > > I'm really interested in how you can back this statement. 90% of the
> > > HW I have access to is not FEAT_HWAFDB capable, either because it
> > > predates the feature or because the feature is too buggy to be useful.
> >
> > This is my expericen too -- most devices are pre v8.2.
>
> And yet you have no issue writing the above. Puzzling.

That's best to my knowledge. Mind enlightening me?

> > > Do you have numbers?
> >
> > Let's do a quick market survey by segment. The following only applies
> > to ARM CPUs:
> >
> > 1. Phones: none of the major Android phone vendors sell phones running
> > VMs; no other major Linux phone vendors.
>
> Maybe you should have a reality check and look at what your own
> employer is shipping.

Which model? I'll look it up and see how/how I missed it.

> > 2. Laptops: only a very limited number of Chromebooks run VMs, namely
> > ACRVM. No other major Linux laptop vendors.
>
> Again, your employer disagree.

What do you mean? Sorry, I'm a little surprised here... I do know *a
lot* about Chromebooks.

> > 3. Desktops: no major Linux desktop vendors.
>
> My desktop disagree (I send this from my arm64 desktop VM ).

A model number please?

> > 4. Embedded/IoT/Router: no major Linux vendors run VMs (Android Auto
> > can be a VM guest on QNX host).
>
> This email is brought to you via a router VM on an arm64 box.

More details?

> > 5. Cloud: this is where the vast majority VMs come from. Among the
> > vendors available to the general public, Ampere is the biggest player.
> > Here [1] is a list of its customers. The A-bit works well even on its
> > EVT products (Neoverse cores).
>
> Just the phone stuff dwarfs the number of cloud hosts.

Please point me to something that I can work on so that I wouldn't
sound so ignorant next time.

2023-02-23 17:09:47

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Wed, Feb 22, 2023, Yu Zhao wrote:
> On Fri, Feb 17, 2023 at 9:27 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Feb 16, 2023, Yu Zhao wrote:
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 6aaae18f1854..d2995c9e8f07 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1367,6 +1367,12 @@ struct kvm_arch {
> > > * the MMU lock in read mode + the tdp_mmu_pages_lock or
> > > * the MMU lock in write mode
> > > *
> > > + * kvm_arch_test_clear_young() is a special case. It relies on two
> >
> > No, it's not. The TDP MMU already employs on RCU and CMPXCHG.
>
> It is -- you read it out of context :)

Ah, the special case is that it's fully lockless. That's still not all that
special, e.g. see kvm_tdp_mmu_walk_lockless_{begin,end}().

> * For reads, this list is protected by:
> * the MMU lock in read mode + RCU or
> * the MMU lock in write mode
> *
> * For writes, this list is protected by:
> * the MMU lock in read mode + the tdp_mmu_pages_lock or
> * the MMU lock in write mode
> *
> * kvm_arch_test_clear_young() is a special case.
> ...
>
> struct list_head tdp_mmu_roots;
>
> > Just drop the
> > entire comment.
>
> Let me move it into kvm_arch_test_clear_young().

No, I do not want kvm_arch_test_clear_young(), or any other one-off function, to
be "special". I love the idea of a lockless walk, but I want it to be a formal,
documented way to walk TDP MMU roots. I.e. add macro to go with for_each_tdp_mmu_root()
and the yield-safe variants.

/* blah blah blah */
#define for_each_tdp_mmu_root_lockless(_kvm, _root, _as_id) \
list_for_each_entry_rcu(_root, &kvm->arch.tdp_mmu_roots, link) \
if (refcount_read(&root->tdp_mmu_root_count) && \
kvm_mmu_page_as_id(_root) != _as_id) { \
} else

> Also I want to be clear:
> 1. We can't just focus on here and now; we need to consider the distant future.

I 100% agree, but those words need to be backed up by actions. This series is
littered with code that is not maintainable long term, e.g. open coding stuff
that belongs in helpers and/or for which KVM already provides helpers, copy-pasting
__kvm_handle_hva_range() instead of extending it to have a lockless option, poking
directly into KVM from mm/ code, etc.

I apologize for being so blunt. My intent isn't to be rude/snarky, it's to set
very clear expectations for getting any of these changes merges. I asbolutely do
want to land improvments to KVM's test+clear young flows, but it needs to be done
in a way that is maintainable and doesn't saddle KVM with more tech debt.

> 2. From my POV, "see the comments on ..." is like the index of a book.

And my _very_ strong preference is to provide the "index" via code, not comments.

> > Clearing a single bit doesn't need a CMPXCHG. Please weigh in on a relevant series
> > that is modifying the aging flows[*], I want to have exactly one helper for aging
> > TDP MMU SPTEs.
> >
> > [*] https://lore.kernel.org/all/[email protected]
>
> I'll take a look at that series. clear_bit() probably won't cause any
> practical damage but is technically wrong because, for example, it can
> end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> in this case, obviously.)

Eh, not really. By that argument, clearing an A-bit in a huge PTE is also technically
wrong because the target gfn may or may not have been accessed. The only way for
KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge PTE, but was
replaced between the "is leaf" and the clear_bit().

2023-02-23 17:14:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/5] mm/kvm: add mmu_notifier_test_clear_young()

On Thu, Feb 16, 2023, Yu Zhao wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9c60384b5ae0..1b465df4a93d 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -875,6 +875,63 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> }
>
> +static bool kvm_test_clear_young(struct kvm *kvm, unsigned long start,
> + unsigned long end, unsigned long *bitmap)
> +{
> + int i;
> + int key;
> + bool success = true;
> +
> + trace_kvm_age_hva(start, end);
> +
> + key = srcu_read_lock(&kvm->srcu);
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + struct interval_tree_node *node;
> + struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_hva_range(node, slots, start, end - 1) {
> + gfn_t lsb_gfn;
> + unsigned long hva_start, hva_end;
> + struct kvm_gfn_range range = {
> + .slot = container_of(node, struct kvm_memory_slot,
> + hva_node[slots->node_idx]),
> + };
> +
> + hva_start = max(start, range.slot->userspace_addr);
> + hva_end = min(end - 1, range.slot->userspace_addr +
> + range.slot->npages * PAGE_SIZE - 1);
> +
> + range.start = hva_to_gfn_memslot(hva_start, range.slot);
> + range.end = hva_to_gfn_memslot(hva_end, range.slot) + 1;
> +
> + if (WARN_ON_ONCE(range.end <= range.start))
> + continue;

Extend __kvm_handle_hva_range() instead of copy-pasting. At a very quick glance,
I believe all that is needed is (minus sanity checks):

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331..3296ae2cf6fa 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -544,6 +544,7 @@ struct kvm_hva_range {
hva_handler_t handler;
on_lock_fn_t on_lock;
on_unlock_fn_t on_unlock;
+ bool lockless;
bool flush_on_ret;
bool may_block;
};
@@ -616,7 +617,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
gfn_range.slot = slot;

- if (!locked) {
+ if (!range->lockless && !locked) {
locked = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))

> +
> + /* see the comments on the generic kvm_arch_has_test_clear_young() */
> + lsb_gfn = hva_to_gfn_memslot(end - 1, range.slot);
> +
> + success = kvm_arch_test_clear_young(kvm, &range, lsb_gfn, bitmap);
> + if (!success)
> + break;
> + }
> + }
> +
> + srcu_read_unlock(&kvm->srcu, key);
> +
> + return success;
> +}

2023-02-23 17:28:15

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Feb 22, 2023, Yu Zhao wrote:
> > On Fri, Feb 17, 2023 at 9:27 AM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Thu, Feb 16, 2023, Yu Zhao wrote:
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 6aaae18f1854..d2995c9e8f07 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1367,6 +1367,12 @@ struct kvm_arch {
> > > > * the MMU lock in read mode + the tdp_mmu_pages_lock or
> > > > * the MMU lock in write mode
> > > > *
> > > > + * kvm_arch_test_clear_young() is a special case. It relies on two
> > >
> > > No, it's not. The TDP MMU already employs on RCU and CMPXCHG.
> >
> > It is -- you read it out of context :)
>
> Ah, the special case is that it's fully lockless. That's still not all that
> special, e.g. see kvm_tdp_mmu_walk_lockless_{begin,end}().
>
> > * For reads, this list is protected by:
> > * the MMU lock in read mode + RCU or
> > * the MMU lock in write mode
> > *
> > * For writes, this list is protected by:
> > * the MMU lock in read mode + the tdp_mmu_pages_lock or
> > * the MMU lock in write mode
> > *
> > * kvm_arch_test_clear_young() is a special case.
> > ...
> >
> > struct list_head tdp_mmu_roots;
> >
> > > Just drop the
> > > entire comment.
> >
> > Let me move it into kvm_arch_test_clear_young().
>
> No, I do not want kvm_arch_test_clear_young(), or any other one-off function, to
> be "special". I love the idea of a lockless walk, but I want it to be a formal,
> documented way to walk TDP MMU roots. I.e. add macro to go with for_each_tdp_mmu_root()
> and the yield-safe variants.

I see what you mean now. will do.

> /* blah blah blah */
> #define for_each_tdp_mmu_root_lockless(_kvm, _root, _as_id) \
> list_for_each_entry_rcu(_root, &kvm->arch.tdp_mmu_roots, link) \
> if (refcount_read(&root->tdp_mmu_root_count) && \
> kvm_mmu_page_as_id(_root) != _as_id) { \
> } else
>
> > Also I want to be clear:
> > 1. We can't just focus on here and now; we need to consider the distant future.
>
> I 100% agree, but those words need to be backed up by actions. This series is
> littered with code that is not maintainable long term, e.g. open coding stuff
> that belongs in helpers and/or for which KVM already provides helpers, copy-pasting
> __kvm_handle_hva_range() instead of extending it to have a lockless option, poking
> directly into KVM from mm/ code, etc.
>
> I apologize for being so blunt. My intent isn't to be rude/snarky, it's to set
> very clear expectations for getting any of these changes merges.

No worries at all. I appreciate you directly telling me how you prefer
it to be done, and that makes the job easier for both of us. Please do
bear with me though, because I'm very unfamiliar with the KVM side of
expectations.

> I asbolutely do
> want to land improvments to KVM's test+clear young flows, but it needs to be done
> in a way that is maintainable and doesn't saddle KVM with more tech debt.

Agreed.

> > 2. From my POV, "see the comments on ..." is like the index of a book.
>
> And my _very_ strong preference is to provide the "index" via code, not comments.

Will do.

> > > Clearing a single bit doesn't need a CMPXCHG. Please weigh in on a relevant series
> > > that is modifying the aging flows[*], I want to have exactly one helper for aging
> > > TDP MMU SPTEs.
> > >
> > > [*] https://lore.kernel.org/all/[email protected]
> >
> > I'll take a look at that series. clear_bit() probably won't cause any
> > practical damage but is technically wrong because, for example, it can
> > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > in this case, obviously.)
>
> Eh, not really. By that argument, clearing an A-bit in a huge PTE is also technically
> wrong because the target gfn may or may not have been accessed.

Sorry, I don't understand. You mean clear_bit() on a huge PTE is
technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
is not.)

> The only way for
> KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge PTE, but was
> replaced between the "is leaf" and the clear_bit().

I think there is a misunderstanding here. Let me be more specific:
1. Clearing the A-bit in a non-leaf entry is technically wrong because
that's not our intention.
2. When we try to clear_bit() on a leaf PMD, it can at the same time
become a non-leaf PMD, which causes 1) above, and therefore is
technically wrong.
3. I don't think 2) could do any real harm, so no practically no problem.
4. cmpxchg() can avoid 2).

Does this make sense?

Thanks.

2023-02-23 17:34:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/5] mm/kvm: add mmu_notifier_test_clear_young()

On Thu, Feb 16, 2023, Yu Zhao wrote:
> +static bool kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn, struct mm_struct *mm,
> + unsigned long start, unsigned long end,
> + unsigned long *bitmap)
> +{
> + if (kvm_arch_has_test_clear_young())
> + return kvm_test_clear_young(mmu_notifier_to_kvm(mn), start, end, bitmap);
> +
> + return false;
> +}
> +
> static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> struct mm_struct *mm,
> unsigned long address)
> @@ -903,6 +960,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> .clear_flush_young = kvm_mmu_notifier_clear_flush_young,
> .clear_young = kvm_mmu_notifier_clear_young,
> .test_young = kvm_mmu_notifier_test_young,
> + .test_clear_young = kvm_mmu_notifier_test_clear_young,

I am strongly opposed to adding yet another "young" mmu_notifier hook for this.
I would much rather we extend (and rename) mmu_notifier_clear_young() to take the
bitmap as an optional parameter, and then have KVM hide the details of whether or
not it supports lockless processing of the range/bitmap.

I also think for KVM x86 in particular, this series should first convert to a
lockless walk for aging ranges, and then add the batching. It might also make
sense to land x86 first and then follow-up with ARM and PPC. I haven't looked at
the ARM or PPC patches in too much depth, but on the x86 side of things KVM already
has the pieces in place to support a fully lockless walk, i.e. the x86 specific
changes aren't all that contentious, the only thing we need to figure out is what
to do about nested VMs.

2023-02-23 17:41:02

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/5] mm/kvm: add mmu_notifier_test_clear_young()

On Thu, Feb 23, 2023 at 10:14 AM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 16, 2023, Yu Zhao wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 9c60384b5ae0..1b465df4a93d 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -875,6 +875,63 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> > return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> > }
> >
> > +static bool kvm_test_clear_young(struct kvm *kvm, unsigned long start,
> > + unsigned long end, unsigned long *bitmap)
> > +{
> > + int i;
> > + int key;
> > + bool success = true;
> > +
> > + trace_kvm_age_hva(start, end);
> > +
> > + key = srcu_read_lock(&kvm->srcu);
> > +
> > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > + struct interval_tree_node *node;
> > + struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > +
> > + kvm_for_each_memslot_in_hva_range(node, slots, start, end - 1) {
> > + gfn_t lsb_gfn;
> > + unsigned long hva_start, hva_end;
> > + struct kvm_gfn_range range = {
> > + .slot = container_of(node, struct kvm_memory_slot,
> > + hva_node[slots->node_idx]),
> > + };
> > +
> > + hva_start = max(start, range.slot->userspace_addr);
> > + hva_end = min(end - 1, range.slot->userspace_addr +
> > + range.slot->npages * PAGE_SIZE - 1);
> > +
> > + range.start = hva_to_gfn_memslot(hva_start, range.slot);
> > + range.end = hva_to_gfn_memslot(hva_end, range.slot) + 1;
> > +
> > + if (WARN_ON_ONCE(range.end <= range.start))
> > + continue;
>
> Extend __kvm_handle_hva_range() instead of copy-pasting. At a very quick glance,
> I believe all that is needed is (minus sanity checks):

Yes, will do.

I do need to add one more parameter to kvm_gfn_range, because that's
what the current kvm_arch_test_clear_young() needs, assuming that
function is acceptable.

Also, just a side note, from MM's POV, the following in
__kvm_handle_hva_range() seems to forget to handle end == 0, if that's
possible?

hva_end = min(range->end, slot->userspace_addr + (slot->npages <<
PAGE_SHIFT));

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d255964ec331..3296ae2cf6fa 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -544,6 +544,7 @@ struct kvm_hva_range {
> hva_handler_t handler;
> on_lock_fn_t on_lock;
> on_unlock_fn_t on_unlock;
> + bool lockless;
> bool flush_on_ret;
> bool may_block;
> };
> @@ -616,7 +617,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
> gfn_range.slot = slot;
>
> - if (!locked) {
> + if (!range->lockless && !locked) {
> locked = true;
> KVM_MMU_LOCK(kvm);
> if (!IS_KVM_NULL_FN(range->on_lock))
>
> > +
> > + /* see the comments on the generic kvm_arch_has_test_clear_young() */
> > + lsb_gfn = hva_to_gfn_memslot(end - 1, range.slot);
> > +
> > + success = kvm_arch_test_clear_young(kvm, &range, lsb_gfn, bitmap);
> > + if (!success)
> > + break;
> > + }
> > + }
> > +
> > + srcu_read_unlock(&kvm->srcu, key);
> > +
> > + return success;
> > +}

2023-02-23 17:43:38

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

On Thu, Feb 16, 2023, Yu Zhao wrote:
> An existing selftest can quickly demonstrate the effectiveness of this
> patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM:

Not my area of maintenance, but a non-existent changelog (for all intents and
purposes) for a change of this size and complexity is not acceptable.

> $ sudo max_guest_memory_test -c 64 -m 250 -s 250
>
> MGLRU run2
> ---------------
> Before ~600s
> After ~50s
> Off ~250s
>
> kswapd (MGLRU before)
> 100.00% balance_pgdat
> 100.00% shrink_node
> 100.00% shrink_one
> 99.97% try_to_shrink_lruvec
> 99.06% evict_folios
> 97.41% shrink_folio_list
> 31.33% folio_referenced
> 31.06% rmap_walk_file
> 30.89% folio_referenced_one
> 20.83% __mmu_notifier_clear_flush_young
> 20.54% kvm_mmu_notifier_clear_flush_young
> => 19.34% _raw_write_lock
>
> kswapd (MGLRU after)
> 100.00% balance_pgdat
> 100.00% shrink_node
> 100.00% shrink_one
> 99.97% try_to_shrink_lruvec
> 99.51% evict_folios
> 71.70% shrink_folio_list
> 7.08% folio_referenced
> 6.78% rmap_walk_file
> 6.72% folio_referenced_one
> 5.60% lru_gen_look_around
> => 1.53% __mmu_notifier_test_clear_young

Do you happen to know how much of the improvement is due to batching, and how
much is due to using a walkless walk?

> @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
> if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
> caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
>
> + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
> + caps |= BIT(LRU_GEN_SPTE_WALK);

As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
if we want to avoid batching when there are no mmu_notifier listeners, probe
mmu_notifiers. But don't call into KVM directly.

2023-02-23 18:09:22

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 16, 2023, Yu Zhao wrote:
> > An existing selftest can quickly demonstrate the effectiveness of this
> > patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM:
>
> Not my area of maintenance, but a non-existent changelog (for all intents and
> purposes) for a change of this size and complexity is not acceptable.

Will fix.

> > $ sudo max_guest_memory_test -c 64 -m 250 -s 250
> >
> > MGLRU run2
> > ---------------
> > Before ~600s
> > After ~50s
> > Off ~250s
> >
> > kswapd (MGLRU before)
> > 100.00% balance_pgdat
> > 100.00% shrink_node
> > 100.00% shrink_one
> > 99.97% try_to_shrink_lruvec
> > 99.06% evict_folios
> > 97.41% shrink_folio_list
> > 31.33% folio_referenced
> > 31.06% rmap_walk_file
> > 30.89% folio_referenced_one
> > 20.83% __mmu_notifier_clear_flush_young
> > 20.54% kvm_mmu_notifier_clear_flush_young
> > => 19.34% _raw_write_lock
> >
> > kswapd (MGLRU after)
> > 100.00% balance_pgdat
> > 100.00% shrink_node
> > 100.00% shrink_one
> > 99.97% try_to_shrink_lruvec
> > 99.51% evict_folios
> > 71.70% shrink_folio_list
> > 7.08% folio_referenced
> > 6.78% rmap_walk_file
> > 6.72% folio_referenced_one
> > 5.60% lru_gen_look_around
> > => 1.53% __mmu_notifier_test_clear_young
>
> Do you happen to know how much of the improvement is due to batching, and how
> much is due to using a walkless walk?

No. I have three benchmarks running at the moment:
1. Windows SQL server guest on x86 host,
2. Apache Spark guest on arm64 host, and
3. Memcached guest on ppc64 host.

If you are really interested in that, I can reprioritize -- I need to
stop 1) and use that machine to get the number for you.

> > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
> > if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
> > caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> >
> > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
> > + caps |= BIT(LRU_GEN_SPTE_WALK);
>
> As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> if we want to avoid batching when there are no mmu_notifier listeners, probe
> mmu_notifiers. But don't call into KVM directly.

I'm not sure I fully understand. Let's present the problem on the MM
side: assuming KVM supports lockless walks, batching can still be
worse (very unlikely), because GFNs can exhibit no memory locality at
all. So this option allows userspace to disable batching.

I fully understand why you don't want MM to call into KVM directly. No
acceptable ways to set up a clear interface between MM and KVM other
than the MMU notifier?

2023-02-23 18:24:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Thu, Feb 23, 2023, Yu Zhao wrote:
> On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson <[email protected]> wrote:
> > > I'll take a look at that series. clear_bit() probably won't cause any
> > > practical damage but is technically wrong because, for example, it can
> > > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > > in this case, obviously.)
> >
> > Eh, not really. By that argument, clearing an A-bit in a huge PTE is also technically
> > wrong because the target gfn may or may not have been accessed.
>
> Sorry, I don't understand. You mean clear_bit() on a huge PTE is
> technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
> is not.)
>
> > The only way for
> > KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge PTE, but was
> > replaced between the "is leaf" and the clear_bit().
>
> I think there is a misunderstanding here. Let me be more specific:
> 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> that's not our intention.
> 2. When we try to clear_bit() on a leaf PMD, it can at the same time
> become a non-leaf PMD, which causes 1) above, and therefore is
> technically wrong.
> 3. I don't think 2) could do any real harm, so no practically no problem.
> 4. cmpxchg() can avoid 2).
>
> Does this make sense?

I understand what you're saying, but clearing an A-bit on a non-leaf PMD that
_just_ got converted from a leaf PMD is "wrong" if and only if the intented
behavior is nonsensical.

Without an explicit granluarity from the caller, the intent is to either (a) reap
A-bit on leaf PTEs, or (b) reap A-bit at the highest possible granularity. (a) is
nonsensical because because it provides zero guarantees to the caller as to the
granularity of the information. Leaf vs. non-leaf matters for the life cycle of
page tables and guest accesses, e.g. KVM needs to zap _only_ leaf SPTEs when
handling an mmu_notifier invalidation, but when it comes to the granularity of the
A-bit, leaf vs. non-leaf has no meaning. On KVM x86, a PMD covers 2MiB of GPAs
regardless of whether it's a leaf or non-leaf PMD.

If the intent is (b), then clearing the A-bit on a PMD a few cycles after the PMD
was converted from leaf to non-leaf is a pointless distinction, because it yields
the same end result as clearing the A-bit just a few cycles earlier, when the PMD
was a leaf.

Actually, if I'm reading patch 5 correctly, this is all much ado about nothing,
because the MGLRU code only kicks in only for non-huge PTEs, and KVM cannot have
larger mappings than the primary MMU, i.e. should not encounter huge PTEs.

On that topic, if the assumption is that the bitmap is used only for non-huge PTEs,
then x86's kvm_arch_test_clear_young() needs to be hardened to process only 4KiB
PTEs, and probably to WARN if a huge PTE is encountered. That assumption should
also be documented.

If that assumption is incorrect, then kvm_arch_test_clear_young() is broken and/or
the expected behavior of the bitmap isn't fully defined.

2023-02-23 18:35:14

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Thu, Feb 23, 2023 at 11:24 AM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson <[email protected]> wrote:
> > > > I'll take a look at that series. clear_bit() probably won't cause any
> > > > practical damage but is technically wrong because, for example, it can
> > > > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > > > in this case, obviously.)
> > >
> > > Eh, not really. By that argument, clearing an A-bit in a huge PTE is also technically
> > > wrong because the target gfn may or may not have been accessed.
> >
> > Sorry, I don't understand. You mean clear_bit() on a huge PTE is
> > technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
> > is not.)
> >
> > > The only way for
> > > KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge PTE, but was
> > > replaced between the "is leaf" and the clear_bit().
> >
> > I think there is a misunderstanding here. Let me be more specific:
> > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > that's not our intention.
> > 2. When we try to clear_bit() on a leaf PMD, it can at the same time
> > become a non-leaf PMD, which causes 1) above, and therefore is
> > technically wrong.
> > 3. I don't think 2) could do any real harm, so no practically no problem.
> > 4. cmpxchg() can avoid 2).
> >
> > Does this make sense?
>
> I understand what you're saying, but clearing an A-bit on a non-leaf PMD that
> _just_ got converted from a leaf PMD is "wrong" if and only if the intented
> behavior is nonsensical.

Sorry, let me rephrase:
1. Clearing the A-bit in a non-leaf entry is technically wrong because
we didn't make sure there is the A-bit there -- the bit we are
clearing can be something else. (Yes, we know it's not, but we didn't
define this behavior, e.g., a macro to designate that bit for non-leaf
entries. Also I didn't check the spec -- does EPT actually support the
A-bit in non-leaf entries? My guess is that NPT does.)

2023-02-23 18:48:42

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Thu, Feb 23, 2023, Yu Zhao wrote:
> On Thu, Feb 23, 2023 at 11:24 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson <[email protected]> wrote:
> > > > > I'll take a look at that series. clear_bit() probably won't cause any
> > > > > practical damage but is technically wrong because, for example, it can
> > > > > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > > > > in this case, obviously.)
> > > >
> > > > Eh, not really. By that argument, clearing an A-bit in a huge PTE is also technically
> > > > wrong because the target gfn may or may not have been accessed.
> > >
> > > Sorry, I don't understand. You mean clear_bit() on a huge PTE is
> > > technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
> > > is not.)
> > >
> > > > The only way for
> > > > KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge PTE, but was
> > > > replaced between the "is leaf" and the clear_bit().
> > >
> > > I think there is a misunderstanding here. Let me be more specific:
> > > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > > that's not our intention.
> > > 2. When we try to clear_bit() on a leaf PMD, it can at the same time
> > > become a non-leaf PMD, which causes 1) above, and therefore is
> > > technically wrong.
> > > 3. I don't think 2) could do any real harm, so no practically no problem.
> > > 4. cmpxchg() can avoid 2).
> > >
> > > Does this make sense?
> >
> > I understand what you're saying, but clearing an A-bit on a non-leaf PMD that
> > _just_ got converted from a leaf PMD is "wrong" if and only if the intented
> > behavior is nonsensical.
>
> Sorry, let me rephrase:
> 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> we didn't make sure there is the A-bit there -- the bit we are
> clearing can be something else. (Yes, we know it's not, but we didn't
> define this behavior, e.g., a macro to designate that bit for non-leaf
> entries.

Heh, by that definition, anything and everything is "technically wrong". An Intel
CPU might support SVM, even though we know no such CPUs exist, so requiring AMD or
Hygon to enable SVM is technically wrong.

> Also I didn't check the spec -- does EPT actually support the
> A-bit in non-leaf entries? My guess is that NPT does.)

If A/D bits are enabled, both EPT and 64-bit NPT support the Accessed bit at all
levels irrespective of whether or not the entry maps a huge page.

PAE NPT is a different story, but the TDP MMU is limited to 64-bit kernels, i.e.
requires 64-bit NPT.

2023-02-23 19:02:49

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Thu, Feb 23, 2023 at 11:47 AM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 11:24 AM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson <[email protected]> wrote:
> > > > > > I'll take a look at that series. clear_bit() probably won't cause any
> > > > > > practical damage but is technically wrong because, for example, it can
> > > > > > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > > > > > in this case, obviously.)
> > > > >
> > > > > Eh, not really. By that argument, clearing an A-bit in a huge PTE is also technically
> > > > > wrong because the target gfn may or may not have been accessed.
> > > >
> > > > Sorry, I don't understand. You mean clear_bit() on a huge PTE is
> > > > technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
> > > > is not.)
> > > >
> > > > > The only way for
> > > > > KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge PTE, but was
> > > > > replaced between the "is leaf" and the clear_bit().
> > > >
> > > > I think there is a misunderstanding here. Let me be more specific:
> > > > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > > > that's not our intention.
> > > > 2. When we try to clear_bit() on a leaf PMD, it can at the same time
> > > > become a non-leaf PMD, which causes 1) above, and therefore is
> > > > technically wrong.
> > > > 3. I don't think 2) could do any real harm, so no practically no problem.
> > > > 4. cmpxchg() can avoid 2).
> > > >
> > > > Does this make sense?
> > >
> > > I understand what you're saying, but clearing an A-bit on a non-leaf PMD that
> > > _just_ got converted from a leaf PMD is "wrong" if and only if the intented
> > > behavior is nonsensical.
> >
> > Sorry, let me rephrase:
> > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > we didn't make sure there is the A-bit there -- the bit we are
> > clearing can be something else. (Yes, we know it's not, but we didn't
> > define this behavior, e.g., a macro to designate that bit for non-leaf
> > entries.
>
> Heh, by that definition, anything and everything is "technically wrong".

I really don't see how what I said, in our context,

"Clearing the A-bit in a non-leaf entry is technically wrong because
we didn't make sure there is the A-bit there"

can infer

"anything and everything is "technically wrong"."

And how what I said can be an analogy to

"An Intel CPU might support SVM, even though we know no such CPUs
exist, so requiring AMD or Hygon to enable SVM is technically wrong."

BTW, here is a bug caused by clearing the A-bit in non-leaf entries in
a different scenario:
https://lore.kernel.org/linux-mm/[email protected]/

Let's just agree to disagree.

2023-02-23 19:12:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

On Thu, Feb 23, 2023, Yu Zhao wrote:
> On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Feb 16, 2023, Yu Zhao wrote:
> > > kswapd (MGLRU before)
> > > 100.00% balance_pgdat
> > > 100.00% shrink_node
> > > 100.00% shrink_one
> > > 99.97% try_to_shrink_lruvec
> > > 99.06% evict_folios
> > > 97.41% shrink_folio_list
> > > 31.33% folio_referenced
> > > 31.06% rmap_walk_file
> > > 30.89% folio_referenced_one
> > > 20.83% __mmu_notifier_clear_flush_young
> > > 20.54% kvm_mmu_notifier_clear_flush_young
> > > => 19.34% _raw_write_lock
> > >
> > > kswapd (MGLRU after)
> > > 100.00% balance_pgdat
> > > 100.00% shrink_node
> > > 100.00% shrink_one
> > > 99.97% try_to_shrink_lruvec
> > > 99.51% evict_folios
> > > 71.70% shrink_folio_list
> > > 7.08% folio_referenced
> > > 6.78% rmap_walk_file
> > > 6.72% folio_referenced_one
> > > 5.60% lru_gen_look_around
> > > => 1.53% __mmu_notifier_test_clear_young
> >
> > Do you happen to know how much of the improvement is due to batching, and how
> > much is due to using a walkless walk?
>
> No. I have three benchmarks running at the moment:
> 1. Windows SQL server guest on x86 host,
> 2. Apache Spark guest on arm64 host, and
> 3. Memcached guest on ppc64 host.
>
> If you are really interested in that, I can reprioritize -- I need to
> stop 1) and use that machine to get the number for you.

After looking at the "MGLRU before" stack again, it's definitely worth getting
those numbers. The "before" isn't just taking mmu_lock, it's taking mmu_lock for
write _and_ flushing remote TLBs on _every_ PTE. I suspect the batching is a
tiny percentage of the overall win (might be larger with RETPOLINE and friends),
and that the bulk of the improvement comes from avoiding the insanity of
kvm_mmu_notifier_clear_flush_young().

Speaking of which, what would it take to drop mmu_notifier_clear_flush_young()
entirely? I.e. why can MGLRU tolerate stale information but !MGLRU cannot? If
we simply deleted mmu_notifier_clear_flush_young() and used mmu_notifier_clear_young()
instead, would anyone notice, let alone care?

> > > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
> > > if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
> > > caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> > >
> > > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
> > > + caps |= BIT(LRU_GEN_SPTE_WALK);
> >
> > As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> > a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> > I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> > if we want to avoid batching when there are no mmu_notifier listeners, probe
> > mmu_notifiers. But don't call into KVM directly.
>
> I'm not sure I fully understand. Let's present the problem on the MM
> side: assuming KVM supports lockless walks, batching can still be
> worse (very unlikely), because GFNs can exhibit no memory locality at
> all. So this option allows userspace to disable batching.

I'm asking the opposite. Is there a scenario where batching+lock is worse than
!batching+lock? If not, then don't make batching depend on lockless walks.

> I fully understand why you don't want MM to call into KVM directly. No
> acceptable ways to set up a clear interface between MM and KVM other
> than the MMU notifier?

There are several options I can think of, but before we go spend time designing
the best API, I'd rather figure out if we care in the first place.

2023-02-23 19:21:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Thu, Feb 23, 2023, Yu Zhao wrote:
> On Thu, Feb 23, 2023 at 11:47 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > On Thu, Feb 23, 2023 at 11:24 AM Sean Christopherson <[email protected]> wrote:
> > > >
> > > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > > On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson <[email protected]> wrote:
> > > > > > > I'll take a look at that series. clear_bit() probably won't cause any
> > > > > > > practical damage but is technically wrong because, for example, it can
> > > > > > > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > > > > > > in this case, obviously.)
> > > > > >
> > > > > > Eh, not really. By that argument, clearing an A-bit in a huge PTE is also technically
> > > > > > wrong because the target gfn may or may not have been accessed.
> > > > >
> > > > > Sorry, I don't understand. You mean clear_bit() on a huge PTE is
> > > > > technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
> > > > > is not.)
> > > > >
> > > > > > The only way for
> > > > > > KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge PTE, but was
> > > > > > replaced between the "is leaf" and the clear_bit().
> > > > >
> > > > > I think there is a misunderstanding here. Let me be more specific:
> > > > > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > > > > that's not our intention.
> > > > > 2. When we try to clear_bit() on a leaf PMD, it can at the same time
> > > > > become a non-leaf PMD, which causes 1) above, and therefore is
> > > > > technically wrong.
> > > > > 3. I don't think 2) could do any real harm, so no practically no problem.
> > > > > 4. cmpxchg() can avoid 2).
> > > > >
> > > > > Does this make sense?
> > > >
> > > > I understand what you're saying, but clearing an A-bit on a non-leaf PMD that
> > > > _just_ got converted from a leaf PMD is "wrong" if and only if the intented
> > > > behavior is nonsensical.
> > >
> > > Sorry, let me rephrase:
> > > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > > we didn't make sure there is the A-bit there -- the bit we are
> > > clearing can be something else. (Yes, we know it's not, but we didn't
> > > define this behavior, e.g., a macro to designate that bit for non-leaf
> > > entries.
> >
> > Heh, by that definition, anything and everything is "technically wrong".
>
> I really don't see how what I said, in our context,
>
> "Clearing the A-bit in a non-leaf entry is technically wrong because
> we didn't make sure there is the A-bit there"
>
> can infer
>
> "anything and everything is "technically wrong"."
>
> And how what I said can be an analogy to
>
> "An Intel CPU might support SVM, even though we know no such CPUs
> exist, so requiring AMD or Hygon to enable SVM is technically wrong."
>
> BTW, here is a bug caused by clearing the A-bit in non-leaf entries in
> a different scenario:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Let's just agree to disagree.

No, because I don't want anyone to leave with the impression that relying on the
Accessed bit to uniformly exist (or not) at all levels in the TDP MMU is somehow
technically wrong. The link you posted is about running as a Xen guest, and is
in arch-agnostic code. That is wildly different than what we are talking about
here, where the targets are strictly limited to x86-64 TDP, and the existence of
the Accessed bit is architecturally defined.

In this code, there are exactly two flavors of paging that can be in use, and
using clear_bit() to clear shadow_accessed_mask is safe for both, full stop.

2023-02-23 19:25:56

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 2/5] kvm/x86: add kvm_arch_test_clear_young()

On Thu, Feb 23, 2023 at 12:21 PM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 11:47 AM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > On Thu, Feb 23, 2023 at 11:24 AM Sean Christopherson <[email protected]> wrote:
> > > > >
> > > > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > > > On Thu, Feb 23, 2023 at 10:09 AM Sean Christopherson <[email protected]> wrote:
> > > > > > > > I'll take a look at that series. clear_bit() probably won't cause any
> > > > > > > > practical damage but is technically wrong because, for example, it can
> > > > > > > > end up clearing the A-bit in a non-leaf PMD. (cmpxchg will just fail
> > > > > > > > in this case, obviously.)
> > > > > > >
> > > > > > > Eh, not really. By that argument, clearing an A-bit in a huge PTE is also technically
> > > > > > > wrong because the target gfn may or may not have been accessed.
> > > > > >
> > > > > > Sorry, I don't understand. You mean clear_bit() on a huge PTE is
> > > > > > technically wrong? Yes, that's what I mean. (cmpxchg() on a huge PTE
> > > > > > is not.)
> > > > > >
> > > > > > > The only way for
> > > > > > > KVM to clear a A-bit in a non-leaf entry is if the entry _was_ a huge PTE, but was
> > > > > > > replaced between the "is leaf" and the clear_bit().
> > > > > >
> > > > > > I think there is a misunderstanding here. Let me be more specific:
> > > > > > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > > > > > that's not our intention.
> > > > > > 2. When we try to clear_bit() on a leaf PMD, it can at the same time
> > > > > > become a non-leaf PMD, which causes 1) above, and therefore is
> > > > > > technically wrong.
> > > > > > 3. I don't think 2) could do any real harm, so no practically no problem.
> > > > > > 4. cmpxchg() can avoid 2).
> > > > > >
> > > > > > Does this make sense?
> > > > >
> > > > > I understand what you're saying, but clearing an A-bit on a non-leaf PMD that
> > > > > _just_ got converted from a leaf PMD is "wrong" if and only if the intented
> > > > > behavior is nonsensical.
> > > >
> > > > Sorry, let me rephrase:
> > > > 1. Clearing the A-bit in a non-leaf entry is technically wrong because
> > > > we didn't make sure there is the A-bit there -- the bit we are
> > > > clearing can be something else. (Yes, we know it's not, but we didn't
> > > > define this behavior, e.g., a macro to designate that bit for non-leaf
> > > > entries.
> > >
> > > Heh, by that definition, anything and everything is "technically wrong".
> >
> > I really don't see how what I said, in our context,
> >
> > "Clearing the A-bit in a non-leaf entry is technically wrong because
> > we didn't make sure there is the A-bit there"
> >
> > can infer
> >
> > "anything and everything is "technically wrong"."
> >
> > And how what I said can be an analogy to
> >
> > "An Intel CPU might support SVM, even though we know no such CPUs
> > exist, so requiring AMD or Hygon to enable SVM is technically wrong."
> >
> > BTW, here is a bug caused by clearing the A-bit in non-leaf entries in
> > a different scenario:
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Let's just agree to disagree.
>
> No, because I don't want anyone to leave with the impression that relying on the
> Accessed bit to uniformly exist (or not) at all levels in the TDP MMU is somehow
> technically wrong. The link you posted is about running as a Xen guest, and is
> in arch-agnostic code. That is wildly different than what we are talking about
> here, where the targets are strictly limited to x86-64 TDP, and the existence of
> the Accessed bit is architecturally defined.

Yes, I see now.

Sorry to say this, but this is all I needed to hear: "the existence of
the Accessed bit is architecturally defined". (I didn't know and see
this is what you meant.)

> In this code, there are exactly two flavors of paging that can be in use, and
> using clear_bit() to clear shadow_accessed_mask is safe for both, full stop.

2023-02-23 19:38:08

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Thu, Feb 16, 2023, Yu Zhao wrote:
> > > > kswapd (MGLRU before)
> > > > 100.00% balance_pgdat
> > > > 100.00% shrink_node
> > > > 100.00% shrink_one
> > > > 99.97% try_to_shrink_lruvec
> > > > 99.06% evict_folios
> > > > 97.41% shrink_folio_list
> > > > 31.33% folio_referenced
> > > > 31.06% rmap_walk_file
> > > > 30.89% folio_referenced_one
> > > > 20.83% __mmu_notifier_clear_flush_young
> > > > 20.54% kvm_mmu_notifier_clear_flush_young
> > > > => 19.34% _raw_write_lock
> > > >
> > > > kswapd (MGLRU after)
> > > > 100.00% balance_pgdat
> > > > 100.00% shrink_node
> > > > 100.00% shrink_one
> > > > 99.97% try_to_shrink_lruvec
> > > > 99.51% evict_folios
> > > > 71.70% shrink_folio_list
> > > > 7.08% folio_referenced
> > > > 6.78% rmap_walk_file
> > > > 6.72% folio_referenced_one
> > > > 5.60% lru_gen_look_around
> > > > => 1.53% __mmu_notifier_test_clear_young
> > >
> > > Do you happen to know how much of the improvement is due to batching, and how
> > > much is due to using a walkless walk?
> >
> > No. I have three benchmarks running at the moment:
> > 1. Windows SQL server guest on x86 host,
> > 2. Apache Spark guest on arm64 host, and
> > 3. Memcached guest on ppc64 host.
> >
> > If you are really interested in that, I can reprioritize -- I need to
> > stop 1) and use that machine to get the number for you.
>
> After looking at the "MGLRU before" stack again, it's definitely worth getting
> those numbers. The "before" isn't just taking mmu_lock, it's taking mmu_lock for
> write _and_ flushing remote TLBs on _every_ PTE.

Correct.

> I suspect the batching is a
> tiny percentage of the overall win (might be larger with RETPOLINE and friends),

Same here.

> and that the bulk of the improvement comes from avoiding the insanity of
> kvm_mmu_notifier_clear_flush_young().
>
> Speaking of which, what would it take to drop mmu_notifier_clear_flush_young()
> entirely?

That's not my call :)

Adding Johannes.

> I.e. why can MGLRU tolerate stale information but !MGLRU cannot?

Good question. The native clear API doesn't flush:

int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
/*
* On x86 CPUs, clearing the accessed bit without a TLB flush
* doesn't cause data corruption. [ It could cause incorrect
* page aging and the (mistaken) reclaim of hot pages, but the
* chance of that should be relatively low. ]
*
* So as a performance optimization don't flush the TLB when
* clearing the accessed bit, it will eventually be flushed by
* a context switch or a VM operation anyway. [ In the rare
* event of it not getting flushed for a long time the delay
* shouldn't really matter because there's no real memory
* pressure for swapout to react to. ]
*/
return ptep_test_and_clear_young(vma, address, ptep);
}

> If
> we simply deleted mmu_notifier_clear_flush_young() and used mmu_notifier_clear_young()
> instead, would anyone notice, let alone care?

I tend to agree.

> > > > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
> > > > if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
> > > > caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> > > >
> > > > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
> > > > + caps |= BIT(LRU_GEN_SPTE_WALK);
> > >
> > > As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> > > a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> > > I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> > > if we want to avoid batching when there are no mmu_notifier listeners, probe
> > > mmu_notifiers. But don't call into KVM directly.
> >
> > I'm not sure I fully understand. Let's present the problem on the MM
> > side: assuming KVM supports lockless walks, batching can still be
> > worse (very unlikely), because GFNs can exhibit no memory locality at
> > all. So this option allows userspace to disable batching.
>
> I'm asking the opposite. Is there a scenario where batching+lock is worse than
> !batching+lock? If not, then don't make batching depend on lockless walks.

Yes, absolutely. batching+lock means we take/release mmu_lock for
every single PTE in the entire VA space -- each small batch contains
64 PTEs but the entire batch is the whole KVM.

> > I fully understand why you don't want MM to call into KVM directly. No
> > acceptable ways to set up a clear interface between MM and KVM other
> > than the MMU notifier?
>
> There are several options I can think of, but before we go spend time designing
> the best API, I'd rather figure out if we care in the first place.

This is self serving -- MGLRU would be the only user in the near
future. But I never assume there will be no common ground, at least it
doesn't hurt to check.

2023-02-23 19:58:25

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

On Thu, Feb 23, 2023, Yu Zhao wrote:
> On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> > > > a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> > > > I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> > > > if we want to avoid batching when there are no mmu_notifier listeners, probe
> > > > mmu_notifiers. But don't call into KVM directly.
> > >
> > > I'm not sure I fully understand. Let's present the problem on the MM
> > > side: assuming KVM supports lockless walks, batching can still be
> > > worse (very unlikely), because GFNs can exhibit no memory locality at
> > > all. So this option allows userspace to disable batching.
> >
> > I'm asking the opposite. Is there a scenario where batching+lock is worse than
> > !batching+lock? If not, then don't make batching depend on lockless walks.
>
> Yes, absolutely. batching+lock means we take/release mmu_lock for
> every single PTE in the entire VA space -- each small batch contains
> 64 PTEs but the entire batch is the whole KVM.

Who is "we"? I don't see anything in the kernel that triggers walking the whole
VMA, e.g. lru_gen_look_around() limits the walk to a single PMD. I feel like I'm
missing something...

2023-02-23 20:10:19

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

On Thu, Feb 23, 2023 at 12:58 PM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > > As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> > > > > a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> > > > > I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> > > > > if we want to avoid batching when there are no mmu_notifier listeners, probe
> > > > > mmu_notifiers. But don't call into KVM directly.
> > > >
> > > > I'm not sure I fully understand. Let's present the problem on the MM
> > > > side: assuming KVM supports lockless walks, batching can still be
> > > > worse (very unlikely), because GFNs can exhibit no memory locality at
> > > > all. So this option allows userspace to disable batching.
> > >
> > > I'm asking the opposite. Is there a scenario where batching+lock is worse than
> > > !batching+lock? If not, then don't make batching depend on lockless walks.
> >
> > Yes, absolutely. batching+lock means we take/release mmu_lock for
> > every single PTE in the entire VA space -- each small batch contains
> > 64 PTEs but the entire batch is the whole KVM.
>
> Who is "we"?

Oops -- shouldn't have used "we".

> I don't see anything in the kernel that triggers walking the whole
> VMA, e.g. lru_gen_look_around() limits the walk to a single PMD. I feel like I'm
> missing something...

walk_mm() -> walk_pud_range() -> walk_pmd_range() -> walk_pte_range()
-> test_spte_young() -> mmu_notifier_test_clear_young().

MGLRU takes two passes: during the first pass, it sweeps entire VA
space on each MM (per MM/KVM); during the second pass, it uses the rmap on each
folio (per folio). The look around exploits the (spatial) locality in
the second pass, to get the best out of the expensive per folio rmap
walk.

(The first pass can't handle shared mappings; the second pass can.)

2023-02-23 20:29:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

On Thu, Feb 23, 2023, Yu Zhao wrote:
> On Thu, Feb 23, 2023 at 12:58 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson <[email protected]> wrote:
> > > >
> > > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > > > As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> > > > > > a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> > > > > > I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> > > > > > if we want to avoid batching when there are no mmu_notifier listeners, probe
> > > > > > mmu_notifiers. But don't call into KVM directly.
> > > > >
> > > > > I'm not sure I fully understand. Let's present the problem on the MM
> > > > > side: assuming KVM supports lockless walks, batching can still be
> > > > > worse (very unlikely), because GFNs can exhibit no memory locality at
> > > > > all. So this option allows userspace to disable batching.
> > > >
> > > > I'm asking the opposite. Is there a scenario where batching+lock is worse than
> > > > !batching+lock? If not, then don't make batching depend on lockless walks.
> > >
> > > Yes, absolutely. batching+lock means we take/release mmu_lock for
> > > every single PTE in the entire VA space -- each small batch contains
> > > 64 PTEs but the entire batch is the whole KVM.
> >
> > Who is "we"?
>
> Oops -- shouldn't have used "we".
>
> > I don't see anything in the kernel that triggers walking the whole
> > VMA, e.g. lru_gen_look_around() limits the walk to a single PMD. I feel like I'm
> > missing something...
>
> walk_mm() -> walk_pud_range() -> walk_pmd_range() -> walk_pte_range()
> -> test_spte_young() -> mmu_notifier_test_clear_young().
>
> MGLRU takes two passes: during the first pass, it sweeps entire VA
> space on each MM (per MM/KVM); during the second pass, it uses the rmap on each
> folio (per folio).

Ah. IIUC, userspace can use LRU_GEN_SPTE_WALK to control whether or not to walk
secondary MMUs, and the kernel further restricts LRU_GEN_SPTE_WALK to secondary
MMUs that implement a lockless walk. And if the answer is "no", secondary MMUs
are simply not consulted.

If that's correct, then the proper way to handle this is by extending mmu_notifier_ops
to query (a) if there's at least one register listeners that implements
test_clear_young() and (b) if all registered listeners that implement test_clear_young()
support lockless walks. That avoids direct dependencies on KVM, and avoids making
assumptions that may not always hold true, e.g. that KVM is the only mmu_notifier
user that supports the young APIs.

P.S. all of this info absolutely belongs in documentation and/or changelogs.

2023-02-23 20:49:58

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

On Thu, Feb 23, 2023 at 1:29 PM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Feb 23, 2023, Yu Zhao wrote:
> > On Thu, Feb 23, 2023 at 12:58 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson <[email protected]> wrote:
> > > > >
> > > > > On Thu, Feb 23, 2023, Yu Zhao wrote:
> > > > > > > As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> > > > > > > a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> > > > > > > I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> > > > > > > if we want to avoid batching when there are no mmu_notifier listeners, probe
> > > > > > > mmu_notifiers. But don't call into KVM directly.
> > > > > >
> > > > > > I'm not sure I fully understand. Let's present the problem on the MM
> > > > > > side: assuming KVM supports lockless walks, batching can still be
> > > > > > worse (very unlikely), because GFNs can exhibit no memory locality at
> > > > > > all. So this option allows userspace to disable batching.
> > > > >
> > > > > I'm asking the opposite. Is there a scenario where batching+lock is worse than
> > > > > !batching+lock? If not, then don't make batching depend on lockless walks.
> > > >
> > > > Yes, absolutely. batching+lock means we take/release mmu_lock for
> > > > every single PTE in the entire VA space -- each small batch contains
> > > > 64 PTEs but the entire batch is the whole KVM.
> > >
> > > Who is "we"?
> >
> > Oops -- shouldn't have used "we".
> >
> > > I don't see anything in the kernel that triggers walking the whole
> > > VMA, e.g. lru_gen_look_around() limits the walk to a single PMD. I feel like I'm
> > > missing something...
> >
> > walk_mm() -> walk_pud_range() -> walk_pmd_range() -> walk_pte_range()
> > -> test_spte_young() -> mmu_notifier_test_clear_young().
> >
> > MGLRU takes two passes: during the first pass, it sweeps entire VA
> > space on each MM (per MM/KVM); during the second pass, it uses the rmap on each
> > folio (per folio).
>
> Ah. IIUC, userspace can use LRU_GEN_SPTE_WALK to control whether or not to walk
> secondary MMUs, and the kernel further restricts LRU_GEN_SPTE_WALK to secondary
> MMUs that implement a lockless walk. And if the answer is "no", secondary MMUs
> are simply not consulted.

Sorry for the bad naming -- probably LRU_GEN_SPTE_BATCH_WALK would be
less confusing.

MGLRU always consults the secondary MMU for each page it's going to
reclaim (during the second pass), i.e., it checks the A-bit in the
SPTE mapping a page (by the rmap) it plans to reclaim so that it won't
take a hot page away from KVM.

If the lockless walk is supported, MGLRU doesn't need to work at page
granularity: (physical) pages on the LRU list may have nothing in
common (e.g., from different processes), checking their PTEs/SPTEs one
by one is expensive. Instead, it sweeps the entire KVM spaces in the
first pass and checks the *adjacent SPTEs* of a page it plans to
reclaim in the second pass. Both rely on the *assumption* there would
be some spatial locality to exploit. This assumption can be wrong, and
LRU_GEN_SPTE_WALK disables it.

> If that's correct, then the proper way to handle this is by extending mmu_notifier_ops
> to query (a) if there's at least one register listeners that implements
> test_clear_young() and (b) if all registered listeners that implement test_clear_young()
> support lockless walks. That avoids direct dependencies on KVM, and avoids making
> assumptions that may not always hold true, e.g. that KVM is the only mmu_notifier
> user that supports the young APIs.
>
> P.S. all of this info absolutely belongs in documentation and/or changelogs.

Will do.

2023-02-23 21:12:53

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH mm-unstable v1 1/5] mm/kvm: add mmu_notifier_test_clear_young()

On Thu, Feb 23, 2023, Yu Zhao wrote:
> On Thu, Feb 23, 2023 at 10:14 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Feb 16, 2023, Yu Zhao wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 9c60384b5ae0..1b465df4a93d 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -875,6 +875,63 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> > > return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> > > }
> > >
> > > +static bool kvm_test_clear_young(struct kvm *kvm, unsigned long start,
> > > + unsigned long end, unsigned long *bitmap)
> > > +{
> > > + int i;
> > > + int key;
> > > + bool success = true;
> > > +
> > > + trace_kvm_age_hva(start, end);
> > > +
> > > + key = srcu_read_lock(&kvm->srcu);
> > > +
> > > + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > > + struct interval_tree_node *node;
> > > + struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > > +
> > > + kvm_for_each_memslot_in_hva_range(node, slots, start, end - 1) {
> > > + gfn_t lsb_gfn;
> > > + unsigned long hva_start, hva_end;
> > > + struct kvm_gfn_range range = {
> > > + .slot = container_of(node, struct kvm_memory_slot,
> > > + hva_node[slots->node_idx]),
> > > + };
> > > +
> > > + hva_start = max(start, range.slot->userspace_addr);
> > > + hva_end = min(end - 1, range.slot->userspace_addr +
> > > + range.slot->npages * PAGE_SIZE - 1);
> > > +
> > > + range.start = hva_to_gfn_memslot(hva_start, range.slot);
> > > + range.end = hva_to_gfn_memslot(hva_end, range.slot) + 1;
> > > +
> > > + if (WARN_ON_ONCE(range.end <= range.start))
> > > + continue;
> >
> > Extend __kvm_handle_hva_range() instead of copy-pasting. At a very quick glance,
> > I believe all that is needed is (minus sanity checks):
>
> Yes, will do.
>
> I do need to add one more parameter to kvm_gfn_range, because that's
> what the current kvm_arch_test_clear_young() needs, assuming that
> function is acceptable.
>
> Also, just a side note, from MM's POV, the following in
> __kvm_handle_hva_range() seems to forget to handle end == 0, if that's
> possible?

It's handled by the WARN_ON_ONCE() at the very top:

static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
const struct kvm_hva_range *range)
{
if (WARN_ON_ONCE(range->end <= range->start))
return 0;


>
> hva_end = min(range->end, slot->userspace_addr + (slot->npages <<
> PAGE_SHIFT));