2024-06-11 00:22:19

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 0/9] mm: multi-gen LRU: Walk secondary MMU page tables while aging

This patchset makes it possible for MGLRU to consult secondary MMUs
while doing aging, not just during eviction. This allows for more
accurate reclaim decisions, which is especially important for proactive
reclaim.

This series does the following:
1. Improve locking for the existing test/clear_young notifiers for x86
and arm64.
2. Add a new notifier test_clear_young_fast_only(), implemented only by
KVM/x86.
3. Incorporate test_clear_young_fast_only() into MGLRU aging.

To make aging work for more than just x86, the
test_clear_young_fast_only() notifier must be implemented by those other
architectures.

access_tracking_perf_test now has a mode (-p) to check performance of
MGLRU aging while the VM is faulting memory in. See the v4 cover
letter[1] for performance data collected with this test.

Previous versions of this series included logic in MGLRU and KVM to
support batching the updates to secondary page tables. This version
removes this logic, as it was complex and not necessary to enable
proactive reclaim. This optimization, as well as the additional
optimizations for arm64 and powerpc, can be done in a later series.

=== Previous Versions ===

This v5 re-adds a lot of logic that was present in v3 and earlier
versions of the series. There is an important difference I want to call
out:

- should_look_around() can sometimes require two notifiers instead of
one. This is necessary if I forbid myself from modifying
mmu_notifier_clear_young().

It may simply be better to do what v2/v3 did have and not have a
fast-only notifier, and merge them all. This makes the API slightly
more complex. I'm not sure which is better.

Change log:

Since v4[1]:
- Removed Kconfig that controlled when aging was enabled. Aging will
be done whenever the architecture supports it (thanks Yu).
- Added a new MMU notifier, test_clear_young_fast_only(), specifically
for MGLRU to use.
- Add kvm_fast_{test_,}age_gfn, implemented by x86.
- Fix locking for clear_flush_young().
- Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
(thanks Sean).
- Fix WARN_ON and other cleanup for the arm64 locking changes
(thanks Oliver).

Since v3[2]:
- Vastly simplified the series (thanks David). Removed mmu notifier
batching logic entirely.
- Cleaned up how locking is done for mmu_notifier_test/clear_young
(thanks David).
- Look-around is now only done when there are no secondary MMUs
subscribed to MMU notifiers.
- CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
- Fixed the lockless implementation of kvm_{test,}age_gfn for x86
(thanks David).
- Added MGLRU functional and performance tests to
access_tracking_perf_test (thanks Axel).
- In v3, an mm would be completely ignored (for aging) if there was a
secondary MMU but support for secondary MMU walking was missing. Now,
missing secondary MMU walking support simply skips the notifier
calls (except for eviction).
- Added a sanity check for that range->lockless and range->on_lock are
never both provided for the memslot walk.

For the changes since v2[3], see v3.

Based on 6.10-rc3.

[1]: https://lore.kernel.org/linux-mm/[email protected]/
[2]: https://lore.kernel.org/linux-mm/[email protected]/
[3]: https://lore.kernel.org/kvmarm/[email protected]/

James Houghton (8):
KVM: Add lockless memslot walk to KVM
KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
mm: Add test_clear_young_fast_only MMU notifier
KVM: Add kvm_fast_age_gfn and kvm_fast_test_age_gfn
KVM: x86: Implement kvm_fast_test_age_gfn and kvm_fast_age_gfn
mm: multi-gen LRU: Have secondary MMUs participate in aging
KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test

Yu Zhao (1):
KVM: x86: Move tdp_mmu_enabled and shadow_accessed_mask

Documentation/admin-guide/mm/multigen_lru.rst | 6 +-
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/hyp/pgtable.c | 15 +-
arch/arm64/kvm/mmu.c | 26 +-
arch/x86/include/asm/kvm_host.h | 14 +
arch/x86/kvm/Kconfig | 2 +
arch/x86/kvm/mmu.h | 6 -
arch/x86/kvm/mmu/mmu.c | 60 ++-
arch/x86/kvm/mmu/spte.h | 1 -
arch/x86/kvm/mmu/tdp_iter.h | 27 +-
arch/x86/kvm/mmu/tdp_mmu.c | 67 ++-
include/linux/kvm_host.h | 8 +
include/linux/mmu_notifier.h | 50 +++
include/linux/mmzone.h | 6 +-
include/trace/events/kvm.h | 22 +
mm/mmu_notifier.c | 26 ++
mm/rmap.c | 9 +-
mm/vmscan.c | 185 +++++++--
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/access_tracking_perf_test.c | 365 ++++++++++++++--
.../selftests/kvm/include/lru_gen_util.h | 55 +++
.../testing/selftests/kvm/lib/lru_gen_util.c | 391 ++++++++++++++++++
virt/kvm/Kconfig | 7 +
virt/kvm/kvm_main.c | 73 +++-
24 files changed, 1283 insertions(+), 140 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c


base-commit: 83a7eefedc9b56fe7bfeff13b6c7356688ffa670
--
2.45.2.505.gda0bf45e8d-goog



2024-06-11 00:22:57

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 2/9] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn

Walk the TDP MMU in an RCU read-side critical section. This requires a
way to do RCU-safe walking of the tdp_mmu_roots; do this with a new
macro. The PTE modifications are now done atomically, and
kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the
fact that kvm_age_gfn can now lockless update the accessed bit and the
R/X bits).

If the cmpxchg for marking the spte for access tracking fails, we simply
retry if the spte is still a leaf PTE. If it isn't, we return false
to continue the walk.

Harvesting age information from the shadow MMU is still done while
holding the MMU write lock.

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu/mmu.c | 10 ++++-
arch/x86/kvm/mmu/tdp_iter.h | 27 +++++++------
arch/x86/kvm/mmu/tdp_mmu.c | 67 +++++++++++++++++++++++++--------
5 files changed, 77 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8ca74e7678f..011c8eb7c8d3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1438,6 +1438,7 @@ struct kvm_arch {
* tdp_mmu_page set.
*
* For reads, this list is protected by:
+ * RCU alone or
* the MMU lock in read mode + RCU or
* the MMU lock in write mode
*
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fec95a770270..9dda7f8c72ed 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -23,6 +23,7 @@ config KVM
depends on X86_LOCAL_APIC
select KVM_COMMON
select KVM_GENERIC_MMU_NOTIFIER
+ select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
select HAVE_KVM_IRQCHIP
select HAVE_KVM_PFNCACHE
select HAVE_KVM_DIRTY_RING_TSO
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8d74bdef68c1..51061f1fb3d1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1633,8 +1633,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

- if (kvm_memslots_have_rmaps(kvm))
+ if (kvm_memslots_have_rmaps(kvm)) {
+ write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+ write_unlock(&kvm->mmu_lock);
+ }

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1646,8 +1649,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

- if (kvm_memslots_have_rmaps(kvm))
+ if (kvm_memslots_have_rmaps(kvm)) {
+ write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
+ write_unlock(&kvm->mmu_lock);
+ }

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 2880fd392e0c..510936a8455a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
return xchg(rcu_dereference(sptep), new_spte);
}

+static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
+{
+ atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
+
+ return (u64)atomic64_fetch_and(~mask, sptep_atomic);
+}
+
static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
{
KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte));
@@ -32,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
}

/*
- * SPTEs must be modified atomically if they are shadow-present, leaf
- * SPTEs, and have volatile bits, i.e. has bits that can be set outside
- * of mmu_lock. The Writable bit can be set by KVM's fast page fault
- * handler, and Accessed and Dirty bits can be set by the CPU.
+ * SPTEs must be modified atomically if they have bits that can be set outside
+ * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
+ * Writable bit can be set by KVM's fast page fault handler, the Accessed and
+ * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be
+ * cleared by age_gfn_range.
*
* Note, non-leaf SPTEs do have Accessed bits and those bits are
* technically volatile, but KVM doesn't consume the Accessed bit of
@@ -46,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
{
return is_shadow_present_pte(old_spte) &&
- is_last_spte(old_spte, level) &&
- spte_has_volatile_bits(old_spte);
+ is_last_spte(old_spte, level);
}

static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
@@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
u64 mask, int level)
{
- atomic64_t *sptep_atomic;
-
- if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
- sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
- return (u64)atomic64_fetch_and(~mask, sptep_atomic);
- }
+ if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
+ return tdp_mmu_clear_spte_bits_atomic(sptep, mask);

__kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
return old_spte;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 36539c1b36cd..46abd04914c2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,6 +29,11 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,

return true;
}
+static __always_inline bool kvm_lockdep_assert_rcu_read_lock_held(void)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ return true;
+}

void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
{
@@ -178,6 +183,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
((_only_valid) && (_root)->role.invalid))) { \
} else

+/*
+ * Iterate over all TDP MMU roots in an RCU read-side critical section.
+ */
+#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id) \
+ list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link) \
+ if (kvm_lockdep_assert_rcu_read_lock_held() && \
+ (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id)) { \
+ } else
+
#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)

@@ -1223,6 +1237,27 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
return ret;
}

+static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
+ struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ tdp_handler_t handler)
+{
+ struct kvm_mmu_page *root;
+ struct tdp_iter iter;
+ bool ret = false;
+
+ rcu_read_lock();
+
+ for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) {
+ tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+ ret |= handler(kvm, &iter, range);
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
/*
* Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
* if any of the GFNs in the range have been accessed.
@@ -1236,28 +1271,30 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
{
u64 new_spte;

+retry:
/* If we have a non-accessed entry we don't need to change the pte. */
if (!is_accessed_spte(iter->old_spte))
return false;

if (spte_ad_enabled(iter->old_spte)) {
- iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
- iter->old_spte,
- shadow_accessed_mask,
- iter->level);
+ iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
+ shadow_accessed_mask);
new_spte = iter->old_spte & ~shadow_accessed_mask;
} else {
- /*
- * Capture the dirty status of the page, so that it doesn't get
- * lost when the SPTE is marked for access tracking.
- */
+ new_spte = mark_spte_for_access_track(iter->old_spte);
+ if (__tdp_mmu_set_spte_atomic(iter, new_spte)) {
+ /*
+ * The cmpxchg failed. If the spte is still a
+ * last-level spte, we can safely retry.
+ */
+ if (is_shadow_present_pte(iter->old_spte) &&
+ is_last_spte(iter->old_spte, iter->level))
+ goto retry;
+ /* Otherwise, continue walking. */
+ return false;
+ }
if (is_writable_pte(iter->old_spte))
kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
-
- new_spte = mark_spte_for_access_track(iter->old_spte);
- iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
- iter->old_spte, new_spte,
- iter->level);
}

trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
@@ -1267,7 +1304,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,

bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
- return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
+ return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, age_gfn_range);
}

static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
@@ -1278,7 +1315,7 @@ static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,

bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
+ return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, test_age_gfn);
}

/*
--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 00:23:17

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 3/9] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

Replace the MMU write locks (taken in the memslot iteration loop) for
read locks.

Grabbing the read lock instead of the write lock is safe because the
only requirement we have is that the stage-2 page tables do not get
deallocated while we are walking them. The stage2_age_walker() callback
is safe to race with itself; update the comment to reflect the
synchronization change.

Signed-off-by: James Houghton <[email protected]>
---
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++------
arch/arm64/kvm/mmu.c | 26 ++++++++++++++++++++------
3 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 58f09370d17e..7a1af8141c0e 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,6 +22,7 @@ menuconfig KVM
select KVM_COMMON
select KVM_GENERIC_HARDWARE_ENABLING
select KVM_GENERIC_MMU_NOTIFIER
+ select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
select HAVE_KVM_CPU_RELAX_INTERCEPT
select KVM_MMIO
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 9e2bbee77491..b1b0f7148cff 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1319,10 +1319,10 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
data->young = true;

/*
- * stage2_age_walker() is always called while holding the MMU lock for
- * write, so this will always succeed. Nonetheless, this deliberately
- * follows the race detection pattern of the other stage-2 walkers in
- * case the locking mechanics of the MMU notifiers is ever changed.
+ * This walk may not be exclusive; the PTE is permitted to change
+ * from under us. If there is a race to update this PTE, then the
+ * GFN is most likely young, so failing to clear the AF is likely
+ * to be inconsequential.
*/
if (data->mkold && !stage2_try_set_pte(ctx, new))
return -EAGAIN;
@@ -1345,10 +1345,13 @@ bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
struct kvm_pgtable_walker walker = {
.cb = stage2_age_walker,
.arg = &data,
- .flags = KVM_PGTABLE_WALK_LEAF,
+ .flags = KVM_PGTABLE_WALK_LEAF |
+ KVM_PGTABLE_WALK_SHARED,
};
+ int r;

- WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
+ r = kvm_pgtable_walk(pgt, addr, size, &walker);
+ WARN_ON(r && r != -EAGAIN);
return data.young;
}

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8bcab0cc3fe9..a62c27a347ed 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1773,25 +1773,39 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
u64 size = (range->end - range->start) << PAGE_SHIFT;
+ bool young = false;
+
+ read_lock(&kvm->mmu_lock);

if (!kvm->arch.mmu.pgt)
return false;

- return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
- range->start << PAGE_SHIFT,
- size, true);
+ young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+ range->start << PAGE_SHIFT,
+ size, true);
+
+out:
+ read_unlock(&kvm->mmu_lock);
+ return young;
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
u64 size = (range->end - range->start) << PAGE_SHIFT;
+ bool young = false;
+
+ read_lock(&kvm->mmu_lock);

if (!kvm->arch.mmu.pgt)
return false;

- return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
- range->start << PAGE_SHIFT,
- size, false);
+ young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+ range->start << PAGE_SHIFT,
+ size, false);
+
+out:
+ read_unlock(&kvm->mmu_lock);
+ return young;
}

phys_addr_t kvm_mmu_get_httbr(void)
--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 00:24:06

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

This new notifier is for multi-gen LRU specifically, as it wants to be
able to get and clear age information from secondary MMUs only if it can
be done "fast".

By having this notifier specifically created for MGLRU, what "fast"
means comes down to what is "fast" enough to improve MGLRU's ability to
reclaim most of the time.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/mmu_notifier.h | 50 ++++++++++++++++++++++++++++++++++++
mm/mmu_notifier.c | 26 +++++++++++++++++++
2 files changed, 76 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d39ebb10caeb..2655d841a409 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,15 @@ enum mmu_notifier_event {

#define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)

+/*
+ * Bits in the return value for test_clear_young_fast_only.
+ *
+ * MMU_NOTIFIER_FAST_YOUNG: notifier succeeded, secondary MMU reports young.
+ * MMU_NOTIFIER_FAST_FAILED: notifier failed.
+ */
+#define MMU_NOTIFIER_FAST_YOUNG (1 << 0)
+#define MMU_NOTIFIER_FAST_FAILED (1 << 1)
+
struct mmu_notifier_ops {
/*
* Called either by mmu_notifier_unregister or when the mm is
@@ -122,6 +131,24 @@ struct mmu_notifier_ops {
struct mm_struct *mm,
unsigned long address);

+ /*
+ * test_clear_young_fast_only is called to check (and optionally clear)
+ * the young/accessed bitflag in the secondary pte such that the
+ * secondary MMU must implement it in a way that will not significantly
+ * disrupt other MMU operations. In other words, speed is more
+ * important than accuracy.
+ *
+ * Returns MMU_NOTIFIER_FAST_YOUNG if the secondary pte(s) were young.
+ * Returns MMU_NOTIFIER_FAST_FAILED if the secondary MMU could not do
+ * an accurate fast-only test and/or clear of the young/accessed
+ * flag.
+ */
+ int (*test_clear_young_fast_only)(struct mmu_notifier *subscription,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ bool clear);
+
/*
* invalidate_range_start() and invalidate_range_end() must be
* paired and are called only when the mmap_lock and/or the
@@ -383,6 +410,10 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
unsigned long end);
extern int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address);
+extern int __mmu_notifier_test_clear_young_fast_only(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ bool clear);
extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
@@ -428,6 +459,17 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
return 0;
}

+static inline int mmu_notifier_test_clear_young_fast_only(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ bool clear)
+{
+ if (mm_has_notifiers(mm))
+ return __mmu_notifier_test_clear_young_fast_only(mm, start, end,
+ clear);
+ return 0;
+}
+
static inline void
mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
{
@@ -612,6 +654,14 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
return 0;
}

+static inline int mmu_notifier_test_clear_young_fast_only(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ bool clear)
+{
+ return 0;
+}
+
static inline void
mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
{
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 8982e6139d07..7b77ad6cf833 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -424,6 +424,32 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
return young;
}

+int __mmu_notifier_test_clear_young_fast_only(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ bool clear)
+{
+ struct mmu_notifier *subscription;
+ int ret = 0, id;
+
+ id = srcu_read_lock(&srcu);
+ hlist_for_each_entry_rcu(subscription,
+ &mm->notifier_subscriptions->list, hlist,
+ srcu_read_lock_held(&srcu)) {
+ if (subscription->ops->test_clear_young_fast_only) {
+ ret = subscription->ops->test_clear_young_fast_only(
+ subscription, mm, start, end, clear);
+ if (ret & MMU_NOTIFIER_FAST_FAILED)
+ break;
+ if (!clear && (ret & MMU_NOTIFIER_FAST_YOUNG))
+ break;
+ }
+ }
+ srcu_read_unlock(&srcu, id);
+
+ return ret;
+}
+
static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
const struct mmu_notifier_range *range)
{
--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 00:24:11

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 5/9] KVM: Add kvm_fast_age_gfn and kvm_fast_test_age_gfn

Provide the basics for allowing architectures to implement
mmu_notifier_test_clear_young_fast_only().

Add CONFIG_HAVE_KVM_YOUNG_FAST_ONLY_NOTIFIER that architectures will set
if they implement the fast-only notifier.

kvm_fast_age_gfn and kvm_fast_test_age_gfn both need to support
returning a tri-state state of:
1. fast && young,
2. fast && !young,
3. !fast
This could be done by making gfn_handler_t return int, but that would
mean a lot of churn. Instead, include a new kvm_mmu_notifier_arg
'bool *failed' for kvm_fast_{test,}_age_gfn to optionally use.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/kvm_host.h | 7 ++++++
include/trace/events/kvm.h | 22 ++++++++++++++++++
virt/kvm/Kconfig | 4 ++++
virt/kvm/kvm_main.c | 47 ++++++++++++++++++++++++++++++--------
4 files changed, 71 insertions(+), 9 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4d7c3e8632e6..e4efeba51222 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,9 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
union kvm_mmu_notifier_arg {
unsigned long attributes;
+#ifdef CONFIG_HAVE_KVM_YOUNG_FAST_ONLY_NOTIFIER
+ bool *failed;
+#endif
};

struct kvm_gfn_range {
@@ -271,7 +274,11 @@ struct kvm_gfn_range {
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+#ifdef CONFIG_HAVE_KVM_YOUNG_FAST_ONLY_NOTIFIER
+bool kvm_fast_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_fast_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
#endif
+#endif /* CONFIG_KVM_GENERIC_MMU_NOTIFIER */

enum {
OUTSIDE_GUEST_MODE,
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 74e40d5d4af4..7ba6c35c2426 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -489,6 +489,28 @@ TRACE_EVENT(kvm_test_age_hva,
TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
);

+TRACE_EVENT(kvm_fast_test_age_hva,
+ TP_PROTO(unsigned long start, unsigned long end, bool clear),
+ TP_ARGS(start, end, clear),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, start )
+ __field( unsigned long, end )
+ __field( bool, clear )
+ ),
+
+ TP_fast_assign(
+ __entry->start = start;
+ __entry->end = end;
+ __entry->clear = clear;
+ ),
+
+ TP_printk("mmu notifier fast test age: hva: %#016lx -- %#016lx "
+ "clear: %d",
+ __entry->start, __entry->end,
+ __entry->clear)
+);
+
#endif /* _TRACE_KVM_MAIN_H */

/* This part must be outside protection */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 0404857c1702..77ac680af60c 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -100,6 +100,10 @@ config KVM_GENERIC_MMU_NOTIFIER
config KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
bool

+config HAVE_KVM_YOUNG_FAST_ONLY_NOTIFIER
+ select KVM_GENERIC_MMU_NOTIFIER
+ bool
+
config KVM_GENERIC_MEMORY_ATTRIBUTES
depends on KVM_GENERIC_MMU_NOTIFIER
bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d8fa0d617f12..aa930a8b903f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -699,7 +699,8 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
- gfn_handler_t handler)
+ gfn_handler_t handler,
+ bool *failed)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
const struct kvm_mmu_notifier_range range = {
@@ -711,6 +712,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.may_block = false,
.lockless =
IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
+ .arg.failed = failed,
};

return __kvm_handle_hva_range(kvm, &range).ret;
@@ -901,7 +903,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
* cadence. If we find this inaccurate, we might come up with a
* more sophisticated heuristic later.
*/
- return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+ return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn, NULL);
}

static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -911,9 +913,32 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
trace_kvm_test_age_hva(address);

return kvm_handle_hva_range_no_flush(mn, address, address + 1,
- kvm_test_age_gfn);
+ kvm_test_age_gfn, NULL);
}

+#ifdef CONFIG_HAVE_KVM_YOUNG_FAST_ONLY_NOTIFIER
+static int kvm_mmu_notifier_test_clear_young_fast_only(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ bool clear)
+{
+ gfn_handler_t handler;
+ bool failed = false, young;
+
+ trace_kvm_fast_test_age_hva(start, end, clear);
+
+ handler = clear ? kvm_fast_age_gfn : kvm_fast_test_age_gfn;
+
+ young = kvm_handle_hva_range_no_flush(mn, start, end, handler, &failed);
+
+ if (failed)
+ return MMU_NOTIFIER_FAST_FAILED;
+
+ return young ? MMU_NOTIFIER_FAST_YOUNG : 0;
+}
+#endif
+
static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
struct mm_struct *mm)
{
@@ -926,12 +951,16 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
}

static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
- .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
- .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
- .clear_flush_young = kvm_mmu_notifier_clear_flush_young,
- .clear_young = kvm_mmu_notifier_clear_young,
- .test_young = kvm_mmu_notifier_test_young,
- .release = kvm_mmu_notifier_release,
+ .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
+ .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
+ .clear_flush_young = kvm_mmu_notifier_clear_flush_young,
+ .clear_young = kvm_mmu_notifier_clear_young,
+ .test_young = kvm_mmu_notifier_test_young,
+#ifdef CONFIG_HAVE_KVM_YOUNG_FAST_ONLY_NOTIFIER
+ .test_clear_young_fast_only =
+ kvm_mmu_notifier_test_clear_young_fast_only,
+#endif
+ .release = kvm_mmu_notifier_release,
};

static int kvm_init_mmu_notifier(struct kvm *kvm)
--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 00:24:33

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 7/9] KVM: x86: Implement kvm_fast_test_age_gfn and kvm_fast_age_gfn

The fast-only notifier will only report an accurate result when the
shadow MMU is not in use.

Implement kvm_arch_young_notifier_likely_fast(), as MGLRU will check
this function to see if it should even be attempting the fast-only
notifier. We only want to attempt the notifier if there is a chance that
it will succeed (i.e., that we're using the TDP MMU).

Signed-off-by: James Houghton <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 7 +++++
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu/mmu.c | 50 ++++++++++++++++++++++++++++++---
3 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0dc1fa99cdbb..ca2fbc162e51 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2338,4 +2338,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
*/
#define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)

+#define kvm_arch_young_notifier_likely_fast kvm_arch_young_notifier_likely_fast
+static inline bool kvm_arch_young_notifier_likely_fast(void)
+{
+ return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled &&
+ shadow_accessed_mask;
+}
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 9dda7f8c72ed..84ae043c7d43 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -24,6 +24,7 @@ config KVM
select KVM_COMMON
select KVM_GENERIC_MMU_NOTIFIER
select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
+ select HAVE_KVM_YOUNG_FAST_ONLY_NOTIFIER
select HAVE_KVM_IRQCHIP
select HAVE_KVM_PFNCACHE
select HAVE_KVM_DIRTY_RING_TSO
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 51061f1fb3d1..ed50e78755ab 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1629,11 +1629,15 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
__rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
}

-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
+ bool fast_only)
{
bool young = false;

if (kvm_memslots_have_rmaps(kvm)) {
+ if (fast_only)
+ return -1;
+
write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
@@ -1642,14 +1646,18 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

- return young;
+ return (int)young;
}

-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+static int __kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
+ bool fast_only)
{
bool young = false;

if (kvm_memslots_have_rmaps(kvm)) {
+ if (fast_only)
+ return -1;
+
write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
write_unlock(&kvm->mmu_lock);
@@ -1658,7 +1666,41 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_test_age_gfn(kvm, range);

- return young;
+ return (int)young;
+}
+
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ return __kvm_age_gfn(kvm, range, false);
+}
+
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ return __kvm_test_age_gfn(kvm, range, false);
+}
+
+bool kvm_fast_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ int ret = __kvm_age_gfn(kvm, range, true);
+
+ if (ret < 0) {
+ *range->arg.failed = true;
+ return false;
+ }
+
+ return ret != 0;
+}
+
+bool kvm_fast_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+ int ret = __kvm_test_age_gfn(kvm, range, true);
+
+ if (ret < 0) {
+ *range->arg.failed = true;
+ return false;
+ }
+
+ return ret != 0;
}

static void kvm_mmu_check_sptes_at_free(struct kvm_mmu_page *sp)
--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 00:24:34

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 6/9] KVM: x86: Move tdp_mmu_enabled and shadow_accessed_mask

From: Yu Zhao <[email protected]>

tdp_mmu_enabled and shadow_accessed_mask are needed to implement
kvm_arch_young_notifier_likely_fast().

Signed-off-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 ++++++
arch/x86/kvm/mmu.h | 6 ------
arch/x86/kvm/mmu/spte.h | 1 -
3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 011c8eb7c8d3..0dc1fa99cdbb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1855,6 +1855,7 @@ struct kvm_arch_async_pf {

extern u32 __read_mostly kvm_nr_uret_msrs;
extern u64 __read_mostly host_efer;
+extern u64 __read_mostly shadow_accessed_mask;
extern bool __read_mostly allow_smaller_maxphyaddr;
extern bool __read_mostly enable_apicv;
extern struct kvm_x86_ops kvm_x86_ops;
@@ -1960,6 +1961,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
bool mask);

extern bool tdp_enabled;
+#ifdef CONFIG_X86_64
+extern bool tdp_mmu_enabled;
+#else
+#define tdp_mmu_enabled false
+#endif

u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2e454316f2a2..267f72b065f5 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -271,12 +271,6 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
return smp_load_acquire(&kvm->arch.shadow_root_allocated);
}

-#ifdef CONFIG_X86_64
-extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
-
static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
{
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 52fa004a1fbc..9ca6d80fb86c 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -172,7 +172,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
extern u64 __read_mostly shadow_user_mask;
-extern u64 __read_mostly shadow_accessed_mask;
extern u64 __read_mostly shadow_dirty_mask;
extern u64 __read_mostly shadow_mmio_value;
extern u64 __read_mostly shadow_mmio_mask;
--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 00:25:08

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging

Secondary MMUs are currently consulted for access/age information at
eviction time, but before then, we don't get accurate age information.
That is, pages that are mostly accessed through a secondary MMU (like
guest memory, used by KVM) will always just proceed down to the oldest
generation, and then at eviction time, if KVM reports the page to be
young, the page will be activated/promoted back to the youngest
generation.

The added feature bit (0x8), if disabled, will make MGLRU behave as if
there are no secondary MMUs subscribed to MMU notifiers except at
eviction time.

Implement aging with the new mmu_notifier_test_clear_young_fast_only()
notifier. For architectures that do not support this notifier, this
becomes a no-op. For architectures that do implement it, it should be
fast enough to make aging worth it.

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---

Notes:
should_look_around() can sometimes use two notifiers now instead of one.

This simply comes from restricting myself from not changing
mmu_notifier_clear_young() to return more than just "young or not".

I could change mmu_notifier_clear_young() (and
mmu_notifier_test_young()) to return if it was fast or not. At that
point, I could just as well combine all the notifiers into one notifier,
like what was in v2 and v3.

Documentation/admin-guide/mm/multigen_lru.rst | 6 +-
include/linux/mmzone.h | 6 +-
mm/rmap.c | 9 +-
mm/vmscan.c | 185 ++++++++++++++----
4 files changed, 164 insertions(+), 42 deletions(-)

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..1e578e0c4c0c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@ Values Components
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
+0x0008 Continuously clear the accessed bit in secondary MMU page
+ tables instead of waiting until eviction time. This results in
+ accurate page age information for pages that are mainly used by
+ a secondary MMU.
[yYnN] Apply to all the components above.
====== ===============================================================

@@ -56,7 +60,7 @@ E.g.,

echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
- 0x0007
+ 0x000f
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8f9c9590a42c..869824ef5f3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -400,6 +400,7 @@ enum {
LRU_GEN_CORE,
LRU_GEN_MM_WALK,
LRU_GEN_NONLEAF_YOUNG,
+ LRU_GEN_SECONDARY_MMU_WALK,
NR_LRU_GEN_CAPS
};

@@ -557,7 +558,7 @@ struct lru_gen_memcg {

void lru_gen_init_pgdat(struct pglist_data *pgdat);
void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

void lru_gen_init_memcg(struct mem_cgroup *memcg);
void lru_gen_exit_memcg(struct mem_cgroup *memcg);
@@ -576,8 +577,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}

-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
+ return false;
}

static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..24a3ff639919 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio,
continue;
}

- if (pvmw.pte) {
- if (lru_gen_enabled() &&
- pte_young(ptep_get(pvmw.pte))) {
- lru_gen_look_around(&pvmw);
+ if (lru_gen_enabled() && pvmw.pte) {
+ if (lru_gen_look_around(&pvmw))
referenced++;
- }
-
+ } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..348f3ffc8d5d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,7 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/mmu_notifier.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -2579,6 +2580,21 @@ static bool should_clear_pmd_young(void)
return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
}

+#ifdef CONFIG_HAVE_KVM_YOUNG_FAST_ONLY_NOTIFIER
+#include <linux/kvm_host.h>
+static bool should_walk_secondary_mmu(void)
+{
+ return kvm_arch_young_notifier_likely_fast() &&
+ get_cap(LRU_GEN_SECONDARY_MMU_WALK);
+}
+#else
+static bool should_walk_secondary_mmu(void)
+{
+ return false;
+}
+#endif
+
+
/******************************************************************************
* shorthand helpers
******************************************************************************/
@@ -3276,7 +3292,8 @@ static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk
return false;
}

-static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr,
+ struct pglist_data *pgdat)
{
unsigned long pfn = pte_pfn(pte);

@@ -3291,10 +3308,15 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return -1;

+ /* try to avoid unnecessary memory loads */
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ return -1;
+
return pfn;
}

-static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr,
+ struct pglist_data *pgdat)
{
unsigned long pfn = pmd_pfn(pmd);

@@ -3309,6 +3331,10 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return -1;

+ /* try to avoid unnecessary memory loads */
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ return -1;
+
return pfn;
}

@@ -3317,10 +3343,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
{
struct folio *folio;

- /* try to avoid unnecessary memory loads */
- if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
- return NULL;
-
folio = pfn_folio(pfn);
if (folio_nid(folio) != pgdat->node_id)
return NULL;
@@ -3343,6 +3365,43 @@ static bool suitable_to_scan(int total, int young)
return young * n >= total;
}

+static bool lru_gen_notifier_test_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ bool clear)
+{
+ return should_walk_secondary_mmu() &&
+ (mmu_notifier_test_clear_young_fast_only(
+ mm, start, end, clear) &
+ MMU_NOTIFIER_FAST_YOUNG);
+}
+
+static bool lru_gen_notifier_test_young(struct mm_struct *mm,
+ unsigned long addr)
+{
+ return lru_gen_notifier_test_clear_young(mm, addr, addr + PAGE_SIZE,
+ false);
+}
+
+static bool lru_gen_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ return lru_gen_notifier_test_clear_young(mm, start, end, true);
+}
+
+static bool lru_gen_pmdp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr,
+ pmd_t *pmd)
+{
+ bool young = pmdp_test_and_clear_young(vma, addr, pmd);
+
+ if (lru_gen_notifier_clear_young(vma->vm_mm, addr, addr + PMD_SIZE))
+ young = true;
+
+ return young;
+}
+
static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
struct mm_walk *args)
{
@@ -3357,8 +3416,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
DEFINE_MAX_SEQ(walk->lruvec);
int old_gen, new_gen = lru_gen_from_seq(max_seq);
+ struct mm_struct *mm = args->mm;

- pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+ pte = pte_offset_map_nolock(mm, pmd, start & PMD_MASK, &ptl);
if (!pte)
return false;
if (!spin_trylock(ptl)) {
@@ -3376,11 +3436,12 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
total++;
walk->mm_stats[MM_LEAF_TOTAL]++;

- pfn = get_pte_pfn(ptent, args->vma, addr);
+ pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
if (pfn == -1)
continue;

- if (!pte_young(ptent)) {
+ if (!pte_young(ptent) &&
+ !lru_gen_notifier_test_young(mm, addr)) {
walk->mm_stats[MM_LEAF_OLD]++;
continue;
}
@@ -3389,8 +3450,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
if (!folio)
continue;

- if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
+ if (pte_young(ptent))
+ ptep_test_and_clear_young(args->vma, addr, pte + i);

young++;
walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -3456,22 +3518,25 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
/* don't round down the first address */
addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first;

- pfn = get_pmd_pfn(pmd[i], vma, addr);
- if (pfn == -1)
- goto next;
-
- if (!pmd_trans_huge(pmd[i])) {
- if (should_clear_pmd_young())
+ if (pmd_present(pmd[i]) && !pmd_trans_huge(pmd[i])) {
+ if (should_clear_pmd_young() &&
+ !should_walk_secondary_mmu())
pmdp_test_and_clear_young(vma, addr, pmd + i);
goto next;
}

+ pfn = get_pmd_pfn(pmd[i], vma, addr, pgdat);
+ if (pfn == -1)
+ goto next;
+
folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
if (!folio)
goto next;

- if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+ if (!lru_gen_pmdp_test_and_clear_young(vma, addr, pmd + i)) {
+ walk->mm_stats[MM_LEAF_OLD]++;
goto next;
+ }

walk->mm_stats[MM_LEAF_YOUNG]++;

@@ -3528,19 +3593,18 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
}

if (pmd_trans_huge(val)) {
- unsigned long pfn = pmd_pfn(val);
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+ unsigned long pfn = get_pmd_pfn(val, vma, addr, pgdat);

walk->mm_stats[MM_LEAF_TOTAL]++;

- if (!pmd_young(val)) {
- walk->mm_stats[MM_LEAF_OLD]++;
+ if (pfn == -1)
continue;
- }

- /* try to avoid unnecessary memory loads */
- if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ if (!pmd_young(val) && !mm_has_notifiers(args->mm)) {
+ walk->mm_stats[MM_LEAF_OLD]++;
continue;
+ }

walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
continue;
@@ -3548,7 +3612,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,

walk->mm_stats[MM_NONLEAF_TOTAL]++;

- if (should_clear_pmd_young()) {
+ if (should_clear_pmd_young() && !should_walk_secondary_mmu()) {
if (!pmd_young(val))
continue;

@@ -3994,6 +4058,47 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* rmap/PT walk feedback
******************************************************************************/

+static bool should_look_around(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, int *young)
+{
+ int notifier_result = MMU_NOTIFIER_FAST_FAILED;
+ bool notifier_was_fast = false;
+ bool secondary_young = false;
+
+ if (should_walk_secondary_mmu()) {
+ notifier_result =
+ mmu_notifier_test_clear_young_fast_only(
+ vma->vm_mm, addr, addr + PAGE_SIZE,
+ /*clear=*/true);
+ }
+
+ if (notifier_result & MMU_NOTIFIER_FAST_FAILED)
+ secondary_young = mmu_notifier_clear_young(vma->vm_mm, addr,
+ addr + PAGE_SIZE);
+ else {
+ secondary_young = notifier_result & MMU_NOTIFIER_FAST_YOUNG;
+ notifier_was_fast = true;
+ }
+
+ /*
+ * Look around if (1) the PTE is young or (2) the secondary PTE was
+ * young and the results were gathered fast (so look-around will
+ * probably be accurate).
+ */
+ if (pte_young(ptep_get(pte))) {
+ ptep_test_and_clear_young(vma, addr, pte);
+ *young = true;
+ return true;
+ }
+
+ if (secondary_young) {
+ *young = true;
+ return notifier_was_fast;
+ }
+
+ return false;
+}
+
/*
* This function exploits spatial locality when shrink_folio_list() walks the
* rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
@@ -4001,7 +4106,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* the PTE table to the Bloom filter. This forms a feedback loop between the
* eviction and the aging.
*/
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
int i;
unsigned long start;
@@ -4019,16 +4124,20 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
DEFINE_MAX_SEQ(lruvec);
int old_gen, new_gen = lru_gen_from_seq(max_seq);
+ struct mm_struct *mm = pvmw->vma->vm_mm;

lockdep_assert_held(pvmw->ptl);
VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);

+ if (!should_look_around(vma, addr, pte, &young))
+ return young;
+
if (spin_is_contended(pvmw->ptl))
- return;
+ return young;

/* exclude special VMAs containing anon pages from COW */
if (vma->vm_flags & VM_SPECIAL)
- return;
+ return young;

/* avoid taking the LRU lock under the PTL when possible */
walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
@@ -4036,6 +4145,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
start = max(addr & PMD_MASK, vma->vm_start);
end = min(addr | ~PMD_MASK, vma->vm_end - 1) + 1;

+ if (end - start == PAGE_SIZE)
+ return young;
+
if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
end = start + MIN_LRU_BATCH * PAGE_SIZE;
@@ -4049,7 +4161,7 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)

/* folio_update_gen() requires stable folio_memcg() */
if (!mem_cgroup_trylock_pages(memcg))
- return;
+ return young;

arch_enter_lazy_mmu_mode();

@@ -4059,19 +4171,21 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
unsigned long pfn;
pte_t ptent = ptep_get(pte + i);

- pfn = get_pte_pfn(ptent, vma, addr);
+ pfn = get_pte_pfn(ptent, vma, addr, pgdat);
if (pfn == -1)
continue;

- if (!pte_young(ptent))
+ if (!pte_young(ptent) &&
+ !lru_gen_notifier_test_young(mm, addr))
continue;

folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
if (!folio)
continue;

- if (!ptep_test_and_clear_young(vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
+ if (pte_young(ptent))
+ ptep_test_and_clear_young(vma, addr, pte + i);

young++;

@@ -4101,6 +4215,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
/* feedback from rmap walkers to page table walkers */
if (mm_state && suitable_to_scan(i, young))
update_bloom_filter(mm_state, max_seq, pvmw->pmd);
+
+ return young;
}

/******************************************************************************
@@ -5137,6 +5253,9 @@ static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, c
if (should_clear_pmd_young())
caps |= BIT(LRU_GEN_NONLEAF_YOUNG);

+ if (should_walk_secondary_mmu())
+ caps |= BIT(LRU_GEN_SECONDARY_MMU_WALK);
+
return sysfs_emit(buf, "0x%04x\n", caps);
}

--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 00:25:40

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 9/9] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test

This test now has two modes of operation:
1. (default) To check how much vCPU performance was affected by access
tracking (previously existed, now supports MGLRU aging).
2. (-p) To also benchmark how fast MGLRU can do aging while vCPUs are
faulting in memory.

Mode (1) also serves as a way to verify that aging is working properly
for pages only accessed by KVM. It will fail if one does not have the
0x8 lru_gen feature bit.

To support MGLRU, the test creates a memory cgroup, moves itself into
it, then uses the lru_gen debugfs output to track memory in that cgroup.
The logic to parse the lru_gen debugfs output has been put into
selftests/kvm/lib/lru_gen_util.c.

Co-developed-by: Axel Rasmussen <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/access_tracking_perf_test.c | 365 ++++++++++++++--
.../selftests/kvm/include/lru_gen_util.h | 55 +++
.../testing/selftests/kvm/lib/lru_gen_util.c | 391 ++++++++++++++++++
4 files changed, 782 insertions(+), 30 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index ac280dcba996..f34b4cdf6524 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -22,6 +22,7 @@ LIBKVM += lib/elf.c
LIBKVM += lib/guest_modes.c
LIBKVM += lib/io.c
LIBKVM += lib/kvm_util.c
+LIBKVM += lib/lru_gen_util.c
LIBKVM += lib/memstress.c
LIBKVM += lib/guest_sprintf.c
LIBKVM += lib/rbtree.c
diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..15be99ff3bdc 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -38,6 +38,7 @@
#include <inttypes.h>
#include <limits.h>
#include <pthread.h>
+#include <stdio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
@@ -47,6 +48,20 @@
#include "memstress.h"
#include "guest_modes.h"
#include "processor.h"
+#include "lru_gen_util.h"
+
+static const char *TEST_MEMCG_NAME = "access_tracking_perf_test";
+static const int LRU_GEN_ENABLED = 0x1;
+static const int LRU_GEN_MM_WALK = 0x2;
+static const int LRU_GEN_SECONDARY_MMU_WALK = 0x8;
+static const char *CGROUP_PROCS = "cgroup.procs";
+/*
+ * If using MGLRU, this test assumes a cgroup v2 or cgroup v1 memory hierarchy
+ * is mounted at cgroup_root.
+ *
+ * Can be changed with -r.
+ */
+static const char *cgroup_root = "/sys/fs/cgroup";

/* Global variable used to synchronize all of the vCPU threads. */
static int iteration;
@@ -62,6 +77,9 @@ static enum {
/* The iteration that was last completed by each vCPU. */
static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];

+/* The time at which the last iteration was completed */
+static struct timespec vcpu_last_completed_time[KVM_MAX_VCPUS];
+
/* Whether to overlap the regions of memory vCPUs access. */
static bool overlap_memory_access;

@@ -74,6 +92,12 @@ struct test_params {

/* The number of vCPUs to create in the VM. */
int nr_vcpus;
+
+ /* Whether to use lru_gen aging instead of idle page tracking. */
+ bool lru_gen;
+
+ /* Whether to test the performance of aging itself. */
+ bool benchmark_lru_gen;
};

static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
@@ -89,6 +113,50 @@ static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)

}

+static void write_file_long(const char *path, long v)
+{
+ FILE *f;
+
+ f = fopen(path, "w");
+ TEST_ASSERT(f, "fopen(%s) failed", path);
+ TEST_ASSERT(fprintf(f, "%ld\n", v) > 0,
+ "fprintf to %s failed", path);
+ TEST_ASSERT(!fclose(f), "fclose(%s) failed", path);
+}
+
+static char *path_join(const char *parent, const char *child)
+{
+ char *out = NULL;
+
+ return asprintf(&out, "%s/%s", parent, child) >= 0 ? out : NULL;
+}
+
+static char *memcg_path(const char *memcg)
+{
+ return path_join(cgroup_root, memcg);
+}
+
+static char *memcg_file_path(const char *memcg, const char *file)
+{
+ char *mp = memcg_path(memcg);
+ char *fp;
+
+ if (!mp)
+ return NULL;
+ fp = path_join(mp, file);
+ free(mp);
+ return fp;
+}
+
+static void move_to_memcg(const char *memcg, pid_t pid)
+{
+ char *procs = memcg_file_path(memcg, CGROUP_PROCS);
+
+ TEST_ASSERT(procs, "Failed to construct cgroup.procs path");
+ write_file_long(procs, pid);
+ free(procs);
+}
+
#define PAGEMAP_PRESENT (1ULL << 63)
#define PAGEMAP_PFN_MASK ((1ULL << 55) - 1)

@@ -242,6 +310,8 @@ static void vcpu_thread_main(struct memstress_vcpu_args *vcpu_args)
};

vcpu_last_completed_iteration[vcpu_idx] = current_iteration;
+ clock_gettime(CLOCK_MONOTONIC,
+ &vcpu_last_completed_time[vcpu_idx]);
}
}

@@ -253,38 +323,68 @@ static void spin_wait_for_vcpu(int vcpu_idx, int target_iteration)
}
}

+static bool all_vcpus_done(int target_iteration, int nr_vcpus)
+{
+ for (int i = 0; i < nr_vcpus; ++i)
+ if (READ_ONCE(vcpu_last_completed_iteration[i]) !=
+ target_iteration)
+ return false;
+
+ return true;
+}
+
/* The type of memory accesses to perform in the VM. */
enum access_type {
ACCESS_READ,
ACCESS_WRITE,
};

-static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description)
+static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description,
+ bool wait)
{
- struct timespec ts_start;
- struct timespec ts_elapsed;
int next_iteration, i;

/* Kick off the vCPUs by incrementing iteration. */
next_iteration = ++iteration;

- clock_gettime(CLOCK_MONOTONIC, &ts_start);
-
/* Wait for all vCPUs to finish the iteration. */
- for (i = 0; i < nr_vcpus; i++)
- spin_wait_for_vcpu(i, next_iteration);
+ if (wait) {
+ struct timespec ts_start;
+ struct timespec ts_elapsed;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);

- ts_elapsed = timespec_elapsed(ts_start);
- pr_info("%-30s: %ld.%09lds\n",
- description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+ for (i = 0; i < nr_vcpus; i++)
+ spin_wait_for_vcpu(i, next_iteration);
+
+ ts_elapsed = timespec_elapsed(ts_start);
+
+ pr_info("%-30s: %ld.%09lds\n",
+ description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+ } else
+ pr_info("%-30s\n", description);
}

-static void access_memory(struct kvm_vm *vm, int nr_vcpus,
- enum access_type access, const char *description)
+static void _access_memory(struct kvm_vm *vm, int nr_vcpus,
+ enum access_type access, const char *description,
+ bool wait)
{
memstress_set_write_percent(vm, (access == ACCESS_READ) ? 0 : 100);
iteration_work = ITERATION_ACCESS_MEMORY;
- run_iteration(vm, nr_vcpus, description);
+ run_iteration(vm, nr_vcpus, description, wait);
+}
+
+static void access_memory(struct kvm_vm *vm, int nr_vcpus,
+ enum access_type access, const char *description)
+{
+ return _access_memory(vm, nr_vcpus, access, description, true);
+}
+
+static void access_memory_async(struct kvm_vm *vm, int nr_vcpus,
+ enum access_type access,
+ const char *description)
+{
+ return _access_memory(vm, nr_vcpus, access, description, false);
}

static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
@@ -297,19 +397,111 @@ static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
*/
pr_debug("Marking VM memory idle (slow)...\n");
iteration_work = ITERATION_MARK_IDLE;
- run_iteration(vm, nr_vcpus, "Mark memory idle");
+ run_iteration(vm, nr_vcpus, "Mark memory idle", true);
}

-static void run_test(enum vm_guest_mode mode, void *arg)
+static void create_memcg(const char *memcg)
+{
+ const char *full_memcg_path = memcg_path(memcg);
+ int ret;
+
+ TEST_ASSERT(full_memcg_path, "Failed to construct full memcg path");
+retry:
+ ret = mkdir(full_memcg_path, 0755);
+ if (ret && errno == EEXIST) {
+ TEST_ASSERT(!rmdir(full_memcg_path),
+ "Found existing memcg at %s, but rmdir failed",
+ full_memcg_path);
+ goto retry;
+ }
+ TEST_ASSERT(!ret, "Creating the memcg failed: mkdir(%s) failed",
+ full_memcg_path);
+
+ pr_info("Created memcg at %s\n", full_memcg_path);
+}
+
+/*
+ * Test lru_gen aging speed while vCPUs are faulting memory in.
+ *
+ * This test will run lru_gen aging until the vCPUs have finished all of
+ * the faulting work, reporting:
+ * - vcpu wall time (wall time for slowest vCPU)
+ * - average aging pass duration
+ * - total number of aging passes
+ * - total time spent aging
+ *
+ * This test produces the most useful results when the vcpu wall time and the
+ * total time spent aging are similar (i.e., we want to avoid timing aging
+ * while the vCPUs aren't doing any work).
+ */
+static void run_benchmark(enum vm_guest_mode mode, struct kvm_vm *vm,
+ struct test_params *params)
{
- struct test_params *params = arg;
- struct kvm_vm *vm;
int nr_vcpus = params->nr_vcpus;
+ struct memcg_stats stats;
+ struct timespec ts_start, ts_max, ts_vcpus_elapsed,
+ ts_aging_elapsed, ts_aging_elapsed_avg;
+ int num_passes = 0;

- vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
- params->backing_src, !overlap_memory_access);
+ printf("Running lru_gen benchmark...\n");

- memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ access_memory_async(vm, nr_vcpus, ACCESS_WRITE,
+ "Populating memory (async)");
+ while (!all_vcpus_done(iteration, nr_vcpus)) {
+ lru_gen_do_aging_quiet(&stats, TEST_MEMCG_NAME);
+ ++num_passes;
+ }
+
+ ts_aging_elapsed = timespec_elapsed(ts_start);
+ ts_aging_elapsed_avg = timespec_div(ts_aging_elapsed, num_passes);
+
+ /* Find out when the slowest vCPU finished. */
+ ts_max = ts_start;
+ for (int i = 0; i < nr_vcpus; ++i) {
+ struct timespec *vcpu_ts = &vcpu_last_completed_time[i];
+
+ if (ts_max.tv_sec < vcpu_ts->tv_sec ||
+ (ts_max.tv_sec == vcpu_ts->tv_sec &&
+ ts_max.tv_nsec < vcpu_ts->tv_nsec))
+ ts_max = *vcpu_ts;
+ }
+
+ ts_vcpus_elapsed = timespec_sub(ts_max, ts_start);
+
+ pr_info("%-30s: %ld.%09lds\n", "vcpu wall time",
+ ts_vcpus_elapsed.tv_sec, ts_vcpus_elapsed.tv_nsec);
+
+ pr_info("%-30s: %ld.%09lds, (passes:%d, total:%ld.%09lds)\n",
+ "lru_gen avg pass duration",
+ ts_aging_elapsed_avg.tv_sec,
+ ts_aging_elapsed_avg.tv_nsec,
+ num_passes,
+ ts_aging_elapsed.tv_sec,
+ ts_aging_elapsed.tv_nsec);
+}
+
+/*
+ * Test how much access tracking affects vCPU performance.
+ *
+ * Supports two modes of access tracking:
+ * - idle page tracking
+ * - lru_gen aging
+ *
+ * When using lru_gen, this test additionally verifies that the pages are in
+ * fact getting younger and older, otherwise the performance data would be
+ * invalid.
+ *
+ * The forced lru_gen aging can race with aging that occurs naturally.
+ */
+static void run_test(enum vm_guest_mode mode, struct kvm_vm *vm,
+ struct test_params *params)
+{
+ int nr_vcpus = params->nr_vcpus;
+ bool lru_gen = params->lru_gen;
+ struct memcg_stats stats;
+ long total_pages = nr_vcpus * params->vcpu_memory_bytes / getpagesize();
+ int found_gens[5];

pr_info("\n");
access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
@@ -319,11 +511,83 @@ static void run_test(enum vm_guest_mode mode, void *arg)
access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");

/* Repeat on memory that has been marked as idle. */
- mark_memory_idle(vm, nr_vcpus);
+ if (lru_gen) {
+ /* Do an initial page table scan */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+ TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
+ "Not all pages tracked in lru_gen stats.\n"
+ "Is lru_gen enabled? Did the memcg get created properly?");
+
+ /* Find the generation we're currently in (probably youngest) */
+ found_gens[0] = lru_gen_find_generation(&stats, total_pages);
+
+ /* Do an aging pass now */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+ /* Same generation, but a newer generation has been made */
+ found_gens[1] = lru_gen_find_generation(&stats, total_pages);
+ TEST_ASSERT(found_gens[1] == found_gens[0],
+ "unexpected gen change: %d vs. %d",
+ found_gens[1], found_gens[0]);
+ } else
+ mark_memory_idle(vm, nr_vcpus);
+
access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
- mark_memory_idle(vm, nr_vcpus);
+
+ if (lru_gen) {
+ /* Scan the page tables again */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+ /* The pages should now be young again, so in a newer generation */
+ found_gens[2] = lru_gen_find_generation(&stats, total_pages);
+ TEST_ASSERT(found_gens[2] > found_gens[1],
+ "pages did not get younger");
+
+ /* Do another aging pass */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+ /* Same generation; new generation has been made */
+ found_gens[3] = lru_gen_find_generation(&stats, total_pages);
+ TEST_ASSERT(found_gens[3] == found_gens[2],
+ "unexpected gen change: %d vs. %d",
+ found_gens[3], found_gens[2]);
+ } else
+ mark_memory_idle(vm, nr_vcpus);
+
access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");

+ if (lru_gen) {
+ /* Scan the pages tables again */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+ /* The pages should now be young again, so in a newer generation */
+ found_gens[4] = lru_gen_find_generation(&stats, total_pages);
+ TEST_ASSERT(found_gens[4] > found_gens[3],
+ "pages did not get younger");
+ }
+}
+
+static void setup_vm_and_run(enum vm_guest_mode mode, void *arg)
+{
+ struct test_params *params = arg;
+ int nr_vcpus = params->nr_vcpus;
+ struct kvm_vm *vm;
+
+ if (params->lru_gen) {
+ create_memcg(TEST_MEMCG_NAME);
+ move_to_memcg(TEST_MEMCG_NAME, getpid());
+ }
+
+ vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+ params->backing_src, !overlap_memory_access);
+
+ memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
+
+ if (params->benchmark_lru_gen)
+ run_benchmark(mode, vm, params);
+ else
+ run_test(mode, vm, params);
+
memstress_join_vcpu_threads(nr_vcpus);
memstress_destroy_vm(vm);
}
@@ -331,8 +595,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
static void help(char *name)
{
puts("");
- printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o] [-s mem_type]\n",
- name);
+ printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]"
+ " [-s mem_type] [-l] [-r memcg_root]\n", name);
puts("");
printf(" -h: Display this help message.");
guest_modes_help();
@@ -342,6 +606,9 @@ static void help(char *name)
printf(" -v: specify the number of vCPUs to run.\n");
printf(" -o: Overlap guest memory accesses instead of partitioning\n"
" them into a separate region of memory for each vCPU.\n");
+ printf(" -l: Use MGLRU aging instead of idle page tracking\n");
+ printf(" -p: Benchmark MGLRU aging while faulting memory in\n");
+ printf(" -r: The memory cgroup hierarchy root to use (when -l is given)\n");
backing_src_help("-s");
puts("");
exit(0);
@@ -353,13 +620,15 @@ int main(int argc, char *argv[])
.backing_src = DEFAULT_VM_MEM_SRC,
.vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
.nr_vcpus = 1,
+ .lru_gen = false,
+ .benchmark_lru_gen = false,
};
int page_idle_fd;
int opt;

guest_modes_append_default();

- while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
+ while ((opt = getopt(argc, argv, "hm:b:v:os:lr:p")) != -1) {
switch (opt) {
case 'm':
guest_modes_cmdline(optarg);
@@ -376,6 +645,15 @@ int main(int argc, char *argv[])
case 's':
params.backing_src = parse_backing_src_type(optarg);
break;
+ case 'l':
+ params.lru_gen = true;
+ break;
+ case 'p':
+ params.benchmark_lru_gen = true;
+ break;
+ case 'r':
+ cgroup_root = strdup(optarg);
+ break;
case 'h':
default:
help(argv[0]);
@@ -383,12 +661,39 @@ int main(int argc, char *argv[])
}
}

- page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
- __TEST_REQUIRE(page_idle_fd >= 0,
- "CONFIG_IDLE_PAGE_TRACKING is not enabled");
- close(page_idle_fd);
+ if (!params.lru_gen) {
+ page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
+ __TEST_REQUIRE(page_idle_fd >= 0,
+ "CONFIG_IDLE_PAGE_TRACKING is not enabled");
+ close(page_idle_fd);
+ } else {
+ int lru_gen_fd, lru_gen_debug_fd;
+ long mglru_features;
+ char mglru_feature_str[8] = {};
+
+ lru_gen_fd = open("/sys/kernel/mm/lru_gen/enabled", O_RDONLY);
+ __TEST_REQUIRE(lru_gen_fd >= 0,
+ "CONFIG_LRU_GEN is not enabled");
+ TEST_ASSERT(read(lru_gen_fd, &mglru_feature_str, 7) > 0,
+ "couldn't read lru_gen features");
+ mglru_features = strtol(mglru_feature_str, NULL, 16);
+ __TEST_REQUIRE(mglru_features & LRU_GEN_ENABLED,
+ "lru_gen is not enabled");
+ __TEST_REQUIRE(mglru_features & LRU_GEN_MM_WALK,
+ "lru_gen does not support MM_WALK");
+ __TEST_REQUIRE(mglru_features & LRU_GEN_SECONDARY_MMU_WALK,
+ "lru_gen does not support SECONDARY_MMU_WALK");
+
+ lru_gen_debug_fd = open(DEBUGFS_LRU_GEN, O_RDWR);
+ __TEST_REQUIRE(lru_gen_debug_fd >= 0,
+ "Cannot access %s", DEBUGFS_LRU_GEN);
+ close(lru_gen_debug_fd);
+ }
+
+ TEST_ASSERT(!params.benchmark_lru_gen || params.lru_gen,
+ "-p specified without -l");

- for_each_guest_mode(run_test, &params);
+ for_each_guest_mode(setup_vm_and_run, &params);

return 0;
}
diff --git a/tools/testing/selftests/kvm/include/lru_gen_util.h b/tools/testing/selftests/kvm/include/lru_gen_util.h
new file mode 100644
index 000000000000..4eef8085a3cb
--- /dev/null
+++ b/tools/testing/selftests/kvm/include/lru_gen_util.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Tools for integrating with lru_gen, like parsing the lru_gen debugfs output.
+ *
+ * Copyright (C) 2024, Google LLC.
+ */
+#ifndef SELFTEST_KVM_LRU_GEN_UTIL_H
+#define SELFTEST_KVM_LRU_GEN_UTIL_H
+
+#include <inttypes.h>
+#include <limits.h>
+#include <stdlib.h>
+
+#include "test_util.h"
+
+#define MAX_NR_GENS 16 /* MAX_NR_GENS in include/linux/mmzone.h */
+#define MAX_NR_NODES 4 /* Maximum number of nodes we support */
+
+static const char *DEBUGFS_LRU_GEN = "/sys/kernel/debug/lru_gen";
+
+struct generation_stats {
+ int gen;
+ long age_ms;
+ long nr_anon;
+ long nr_file;
+};
+
+struct node_stats {
+ int node;
+ int nr_gens; /* Number of populated gens entries. */
+ struct generation_stats gens[MAX_NR_GENS];
+};
+
+struct memcg_stats {
+ unsigned long memcg_id;
+ int nr_nodes; /* Number of populated nodes entries. */
+ struct node_stats nodes[MAX_NR_NODES];
+};
+
+void print_memcg_stats(const struct memcg_stats *stats, const char *name);
+
+void read_memcg_stats(struct memcg_stats *stats, const char *memcg);
+
+void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg);
+
+long sum_memcg_stats(const struct memcg_stats *stats);
+
+void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg);
+
+void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg);
+
+int lru_gen_find_generation(const struct memcg_stats *stats,
+ unsigned long total_pages);
+
+#endif /* SELFTEST_KVM_LRU_GEN_UTIL_H */
diff --git a/tools/testing/selftests/kvm/lib/lru_gen_util.c b/tools/testing/selftests/kvm/lib/lru_gen_util.c
new file mode 100644
index 000000000000..3c02a635a9f7
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/lru_gen_util.c
@@ -0,0 +1,391 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024, Google LLC.
+ */
+
+#include <time.h>
+
+#include "lru_gen_util.h"
+
+/*
+ * Tracks state while we parse memcg lru_gen stats. The file we're parsing is
+ * structured like this (some extra whitespace elided):
+ *
+ * memcg (id) (path)
+ * node (id)
+ * (gen_nr) (age_in_ms) (nr_anon_pages) (nr_file_pages)
+ */
+struct memcg_stats_parse_context {
+ bool consumed; /* Whether or not this line was consumed */
+ /* Next parse handler to invoke */
+ void (*next_handler)(struct memcg_stats *,
+ struct memcg_stats_parse_context *, char *);
+ int current_node_idx; /* Current index in nodes array */
+ const char *name; /* The name of the memcg we're looking for */
+};
+
+static void memcg_stats_handle_searching(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line);
+static void memcg_stats_handle_in_memcg(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line);
+static void memcg_stats_handle_in_node(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line);
+
+struct split_iterator {
+ char *str;
+ char *save;
+};
+
+static char *split_next(struct split_iterator *it)
+{
+ char *ret = strtok_r(it->str, " \t\n\r", &it->save);
+
+ it->str = NULL;
+ return ret;
+}
+
+static void memcg_stats_handle_searching(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line)
+{
+ struct split_iterator it = { .str = line };
+ char *prefix = split_next(&it);
+ char *memcg_id = split_next(&it);
+ char *memcg_name = split_next(&it);
+ char *end;
+
+ ctx->consumed = true;
+
+ if (!prefix || strcmp("memcg", prefix))
+ return; /* Not a memcg line (maybe empty), skip */
+
+ TEST_ASSERT(memcg_id && memcg_name,
+ "malformed memcg line; no memcg id or memcg_name");
+
+ if (strcmp(memcg_name + 1, ctx->name))
+ return; /* Wrong memcg, skip */
+
+ /* Found it! */
+
+ stats->memcg_id = strtoul(memcg_id, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed memcg id '%s'", memcg_id);
+ if (!stats->memcg_id)
+ return; /* Removed memcg? */
+
+ ctx->next_handler = memcg_stats_handle_in_memcg;
+}
+
+static void memcg_stats_handle_in_memcg(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line)
+{
+ struct split_iterator it = { .str = line };
+ char *prefix = split_next(&it);
+ char *id = split_next(&it);
+ long found_node_id;
+ char *end;
+
+ ctx->consumed = true;
+ ctx->current_node_idx = -1;
+
+ if (!prefix)
+ return; /* Skip empty lines */
+
+ if (!strcmp("memcg", prefix)) {
+ /* Memcg done, found next one; stop. */
+ ctx->next_handler = NULL;
+ return;
+ } else if (strcmp("node", prefix))
+ TEST_ASSERT(false, "found malformed line after 'memcg ...',"
+ "token: '%s'", prefix);
+
+ /* At this point we know we have a node line. Parse the ID. */
+
+ TEST_ASSERT(id, "malformed node line; no node id");
+
+ found_node_id = strtol(id, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed node id '%s'", id);
+
+ ctx->current_node_idx = stats->nr_nodes++;
+ TEST_ASSERT(ctx->current_node_idx < MAX_NR_NODES,
+ "memcg has stats for too many nodes, max is %d",
+ MAX_NR_NODES);
+ stats->nodes[ctx->current_node_idx].node = found_node_id;
+
+ ctx->next_handler = memcg_stats_handle_in_node;
+}
+
+static void memcg_stats_handle_in_node(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line)
+{
+ /* Have to copy since we might not consume */
+ char *my_line = strdup(line);
+ struct split_iterator it = { .str = my_line };
+ char *gen, *age, *nr_anon, *nr_file;
+ struct node_stats *node_stats;
+ struct generation_stats *gen_stats;
+ char *end;
+
+ TEST_ASSERT(it.str, "failed to copy input line");
+
+ gen = split_next(&it);
+
+ /* Skip empty lines */
+ if (!gen)
+ goto out_consume; /* Skip empty lines */
+
+ if (!strcmp("memcg", gen) || !strcmp("node", gen)) {
+ /*
+ * Reached next memcg or node section. Don't consume, let the
+ * other handler deal with this.
+ */
+ ctx->next_handler = memcg_stats_handle_in_memcg;
+ goto out;
+ }
+
+ node_stats = &stats->nodes[ctx->current_node_idx];
+ TEST_ASSERT(node_stats->nr_gens < MAX_NR_GENS,
+ "found too many generation lines; max is %d",
+ MAX_NR_GENS);
+ gen_stats = &node_stats->gens[node_stats->nr_gens++];
+
+ age = split_next(&it);
+ nr_anon = split_next(&it);
+ nr_file = split_next(&it);
+
+ TEST_ASSERT(age && nr_anon && nr_file,
+ "malformed generation line; not enough tokens");
+
+ gen_stats->gen = (int)strtol(gen, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed generation number '%s'", gen);
+
+ gen_stats->age_ms = strtol(age, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed generation age '%s'", age);
+
+ gen_stats->nr_anon = strtol(nr_anon, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed anonymous page count '%s'",
+ nr_anon);
+
+ gen_stats->nr_file = strtol(nr_file, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed file page count '%s'", nr_file);
+
+out_consume:
+ ctx->consumed = true;
+out:
+ free(my_line);
+}
+
+/* Pretty-print lru_gen @stats. */
+void print_memcg_stats(const struct memcg_stats *stats, const char *name)
+{
+ int node, gen;
+
+ fprintf(stderr, "stats for memcg %s (id %lu):\n",
+ name, stats->memcg_id);
+ for (node = 0; node < stats->nr_nodes; ++node) {
+ fprintf(stderr, "\tnode %d\n", stats->nodes[node].node);
+ for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
+ const struct generation_stats *gstats =
+ &stats->nodes[node].gens[gen];
+
+ fprintf(stderr,
+ "\t\tgen %d\tage_ms %ld"
+ "\tnr_anon %ld\tnr_file %ld\n",
+ gstats->gen, gstats->age_ms, gstats->nr_anon,
+ gstats->nr_file);
+ }
+ }
+}
+
+/* Re-read lru_gen debugfs information for @memcg into @stats. */
+void read_memcg_stats(struct memcg_stats *stats, const char *memcg)
+{
+ FILE *f;
+ ssize_t read = 0;
+ char *line = NULL;
+ size_t bufsz;
+ struct memcg_stats_parse_context ctx = {
+ .next_handler = memcg_stats_handle_searching,
+ .name = memcg,
+ };
+
+ memset(stats, 0, sizeof(struct memcg_stats));
+
+ f = fopen(DEBUGFS_LRU_GEN, "r");
+ TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
+
+ while (ctx.next_handler && (read = getline(&line, &bufsz, f)) > 0) {
+ ctx.consumed = false;
+
+ do {
+ ctx.next_handler(stats, &ctx, line);
+ if (!ctx.next_handler)
+ break;
+ } while (!ctx.consumed);
+ }
+
+ if (read < 0 && !feof(f))
+ TEST_ASSERT(false, "getline(%s) failed", DEBUGFS_LRU_GEN);
+
+ TEST_ASSERT(stats->memcg_id > 0, "Couldn't find memcg: %s\n"
+ "Did the memcg get created in the proper mount?",
+ memcg);
+ if (line)
+ free(line);
+ TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
+}
+
+/*
+ * Find all pages tracked by lru_gen for this memcg in generation @target_gen.
+ *
+ * If @target_gen is negative, look for all generations.
+ */
+static long sum_memcg_stats_for_gen(int target_gen,
+ const struct memcg_stats *stats)
+{
+ int node, gen;
+ long total_nr = 0;
+
+ for (node = 0; node < stats->nr_nodes; ++node) {
+ const struct node_stats *node_stats = &stats->nodes[node];
+
+ for (gen = 0; gen < node_stats->nr_gens; ++gen) {
+ const struct generation_stats *gen_stats =
+ &node_stats->gens[gen];
+
+ if (target_gen >= 0 && gen_stats->gen != target_gen)
+ continue;
+
+ total_nr += gen_stats->nr_anon + gen_stats->nr_file;
+ }
+ }
+
+ return total_nr;
+}
+
+/* Find all pages tracked by lru_gen for this memcg. */
+long sum_memcg_stats(const struct memcg_stats *stats)
+{
+ return sum_memcg_stats_for_gen(-1, stats);
+}
+
+/* Read the memcg stats and optionally print if this is a debug build. */
+void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg)
+{
+ read_memcg_stats(stats, memcg);
+#ifdef DEBUG
+ print_memcg_stats(stats, memcg);
+#endif
+}
+
+/*
+ * If lru_gen aging should force page table scanning.
+ *
+ * If you want to set this to false, you will need to do eviction
+ * before doing extra aging passes.
+ */
+static const bool force_scan = true;
+
+static void run_aging_impl(unsigned long memcg_id, int node_id, int max_gen)
+{
+ FILE *f = fopen(DEBUGFS_LRU_GEN, "w");
+ char *command;
+ size_t sz;
+
+ TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
+ sz = asprintf(&command, "+ %lu %d %d 1 %d\n",
+ memcg_id, node_id, max_gen, force_scan);
+ TEST_ASSERT(sz > 0, "creating aging command failed");
+
+ pr_debug("Running aging command: %s", command);
+ if (fwrite(command, sizeof(char), sz, f) < sz) {
+ TEST_ASSERT(false, "writing aging command %s to %s failed",
+ command, DEBUGFS_LRU_GEN);
+ }
+
+ TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
+}
+
+static void _lru_gen_do_aging(struct memcg_stats *stats, const char *memcg,
+ bool verbose)
+{
+ int node, gen;
+ struct timespec ts_start;
+ struct timespec ts_elapsed;
+
+ pr_debug("lru_gen: invoking aging...\n");
+
+ /* Must read memcg stats to construct the proper aging command. */
+ read_print_memcg_stats(stats, memcg);
+
+ if (verbose)
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+ for (node = 0; node < stats->nr_nodes; ++node) {
+ int max_gen = 0;
+
+ for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
+ int this_gen = stats->nodes[node].gens[gen].gen;
+
+ max_gen = max_gen > this_gen ? max_gen : this_gen;
+ }
+
+ run_aging_impl(stats->memcg_id, stats->nodes[node].node,
+ max_gen);
+ }
+
+ if (verbose) {
+ ts_elapsed = timespec_elapsed(ts_start);
+ pr_info("%-30s: %ld.%09lds\n", "lru_gen: Aging",
+ ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+ }
+
+ /* Re-read so callers get updated information */
+ read_print_memcg_stats(stats, memcg);
+}
+
+/* Do aging, and print how long it took. */
+void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg)
+{
+ return _lru_gen_do_aging(stats, memcg, true);
+}
+
+/* Do aging, don't print anything. */
+void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg)
+{
+ return _lru_gen_do_aging(stats, memcg, false);
+}
+
+/*
+ * Find which generation contains more than half of @total_pages, assuming that
+ * such a generation exists.
+ */
+int lru_gen_find_generation(const struct memcg_stats *stats,
+ unsigned long total_pages)
+{
+ int node, gen, gen_idx, min_gen = INT_MAX, max_gen = -1;
+
+ for (node = 0; node < stats->nr_nodes; ++node)
+ for (gen_idx = 0; gen_idx < stats->nodes[node].nr_gens;
+ ++gen_idx) {
+ gen = stats->nodes[node].gens[gen_idx].gen;
+ max_gen = gen > max_gen ? gen : max_gen;
+ min_gen = gen < min_gen ? gen : min_gen;
+ }
+
+ for (gen = min_gen; gen < max_gen; ++gen)
+ /* See if the most pages are in this generation. */
+ if (sum_memcg_stats_for_gen(gen, stats) >
+ total_pages / 2)
+ return gen;
+
+ TEST_ASSERT(false, "No generation includes majority of %lu pages.",
+ total_pages);
+
+ /* unreachable, but make the compiler happy */
+ return -1;
+}
--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 00:33:10

by James Houghton

[permalink] [raw]
Subject: [PATCH v5 1/9] KVM: Add lockless memslot walk to KVM

Provide flexibility to the architecture to synchronize as optimally as
they can instead of always taking the MMU lock for writing.

Architectures that do their own locking must select
CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS.

The immediate application is to allow architectures to implement the
test/clear_young MMU notifiers more cheaply.

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/Kconfig | 3 +++
virt/kvm/kvm_main.c | 26 +++++++++++++++++++-------
3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 692c01e41a18..4d7c3e8632e6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -266,6 +266,7 @@ struct kvm_gfn_range {
gfn_t end;
union kvm_mmu_notifier_arg arg;
bool may_block;
+ bool lockless;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 29b73eedfe74..0404857c1702 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -97,6 +97,9 @@ config KVM_GENERIC_MMU_NOTIFIER
select MMU_NOTIFIER
bool

+config KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
+ bool
+
config KVM_GENERIC_MEMORY_ATTRIBUTES
depends on KVM_GENERIC_MMU_NOTIFIER
bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 14841acb8b95..d8fa0d617f12 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -558,6 +558,7 @@ struct kvm_mmu_notifier_range {
on_lock_fn_t on_lock;
bool flush_on_ret;
bool may_block;
+ bool lockless;
};

/*
@@ -612,6 +613,10 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
IS_KVM_NULL_FN(range->handler)))
return r;

+ /* on_lock will never be called for lockless walks */
+ if (WARN_ON_ONCE(range->lockless && !IS_KVM_NULL_FN(range->on_lock)))
+ return r;
+
idx = srcu_read_lock(&kvm->srcu);

for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
@@ -643,15 +648,18 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
gfn_range.slot = slot;
+ gfn_range.lockless = range->lockless;

if (!r.found_memslot) {
r.found_memslot = true;
- KVM_MMU_LOCK(kvm);
- if (!IS_KVM_NULL_FN(range->on_lock))
- range->on_lock(kvm);
-
- if (IS_KVM_NULL_FN(range->handler))
- break;
+ if (!range->lockless) {
+ KVM_MMU_LOCK(kvm);
+ if (!IS_KVM_NULL_FN(range->on_lock))
+ range->on_lock(kvm);
+
+ if (IS_KVM_NULL_FN(range->handler))
+ break;
+ }
}
r.ret |= range->handler(kvm, &gfn_range);
}
@@ -660,7 +668,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
if (range->flush_on_ret && r.ret)
kvm_flush_remote_tlbs(kvm);

- if (r.found_memslot)
+ if (r.found_memslot && !range->lockless)
KVM_MMU_UNLOCK(kvm);

srcu_read_unlock(&kvm->srcu, idx);
@@ -681,6 +689,8 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
.on_lock = (void *)kvm_null_fn,
.flush_on_ret = true,
.may_block = false,
+ .lockless =
+ IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
};

return __kvm_handle_hva_range(kvm, &range).ret;
@@ -699,6 +709,8 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.on_lock = (void *)kvm_null_fn,
.flush_on_ret = false,
.may_block = false,
+ .lockless =
+ IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS),
};

return __kvm_handle_hva_range(kvm, &range).ret;
--
2.45.2.505.gda0bf45e8d-goog


2024-06-11 05:44:36

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Mon, Jun 10, 2024 at 6:22 PM James Houghton <[email protected]> wrote:
>
> This new notifier is for multi-gen LRU specifically

Let me call it out before others do: we can't be this self-serving.

> as it wants to be
> able to get and clear age information from secondary MMUs only if it can
> be done "fast".
>
> By having this notifier specifically created for MGLRU, what "fast"
> means comes down to what is "fast" enough to improve MGLRU's ability to
> reclaim most of the time.
>
> Signed-off-by: James Houghton <[email protected]>

If we'd like this to pass other MM reviewers, especially the MMU
notifier maintainers, we'd need to design a generic API that can
benefit all the *existing* users: idle page tracking [1], DAMON [2]
and MGLRU.

Also I personally prefer to extend the existing callbacks by adding
new parameters, and on top of that, I'd try to consolidate the
existing callbacks -- it'd be less of a hard sell if my changes result
in less code, not more.

(v2 did all these, btw.)

[1] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
[2] https://www.kernel.org/doc/html/latest/mm/damon/index.html

2024-06-11 06:00:00

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

On Tue, Jun 11, 2024 at 12:21:39AM +0000, James Houghton wrote:
> Replace the MMU write locks (taken in the memslot iteration loop) for
> read locks.
>
> Grabbing the read lock instead of the write lock is safe because the
> only requirement we have is that the stage-2 page tables do not get
> deallocated while we are walking them. The stage2_age_walker() callback
> is safe to race with itself; update the comment to reflect the
> synchronization change.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> arch/arm64/kvm/Kconfig | 1 +
> arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++------
> arch/arm64/kvm/mmu.c | 26 ++++++++++++++++++++------
> 3 files changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 58f09370d17e..7a1af8141c0e 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -22,6 +22,7 @@ menuconfig KVM
> select KVM_COMMON
> select KVM_GENERIC_HARDWARE_ENABLING
> select KVM_GENERIC_MMU_NOTIFIER
> + select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
> select HAVE_KVM_CPU_RELAX_INTERCEPT
> select KVM_MMIO
> select KVM_GENERIC_DIRTYLOG_READ_PROTECT
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 9e2bbee77491..b1b0f7148cff 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1319,10 +1319,10 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> data->young = true;
>
> /*
> - * stage2_age_walker() is always called while holding the MMU lock for
> - * write, so this will always succeed. Nonetheless, this deliberately
> - * follows the race detection pattern of the other stage-2 walkers in
> - * case the locking mechanics of the MMU notifiers is ever changed.
> + * This walk may not be exclusive; the PTE is permitted to change

s/may not/is not/

> + * from under us. If there is a race to update this PTE, then the
> + * GFN is most likely young, so failing to clear the AF is likely
> + * to be inconsequential.
> */
> if (data->mkold && !stage2_try_set_pte(ctx, new))
> return -EAGAIN;
> @@ -1345,10 +1345,13 @@ bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
> struct kvm_pgtable_walker walker = {
> .cb = stage2_age_walker,
> .arg = &data,
> - .flags = KVM_PGTABLE_WALK_LEAF,
> + .flags = KVM_PGTABLE_WALK_LEAF |
> + KVM_PGTABLE_WALK_SHARED,
> };
> + int r;
>
> - WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
> + r = kvm_pgtable_walk(pgt, addr, size, &walker);
> + WARN_ON(r && r != -EAGAIN);

I could've been more explicit last time around, could you please tone
this down to WARN_ON_ONCE() as well?

> return data.young;
> }
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8bcab0cc3fe9..a62c27a347ed 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1773,25 +1773,39 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> u64 size = (range->end - range->start) << PAGE_SHIFT;
> + bool young = false;
> +
> + read_lock(&kvm->mmu_lock);
>
> if (!kvm->arch.mmu.pgt)
> return false;

I'm guessing you meant to have 'goto out' here, since this early return
fails to drop the mmu_lock.

--
Thanks,
Oliver

2024-06-11 16:52:44

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <[email protected]> wrote:
>
> On Mon, Jun 10, 2024 at 6:22 PM James Houghton <[email protected]> wrote:
> >
> > This new notifier is for multi-gen LRU specifically
>
> Let me call it out before others do: we can't be this self-serving.
>
> > as it wants to be
> > able to get and clear age information from secondary MMUs only if it can
> > be done "fast".
> >
> > By having this notifier specifically created for MGLRU, what "fast"
> > means comes down to what is "fast" enough to improve MGLRU's ability to
> > reclaim most of the time.
> >
> > Signed-off-by: James Houghton <[email protected]>
>
> If we'd like this to pass other MM reviewers, especially the MMU
> notifier maintainers, we'd need to design a generic API that can
> benefit all the *existing* users: idle page tracking [1], DAMON [2]
> and MGLRU.
>
> Also I personally prefer to extend the existing callbacks by adding
> new parameters, and on top of that, I'd try to consolidate the
> existing callbacks -- it'd be less of a hard sell if my changes result
> in less code, not more.
>
> (v2 did all these, btw.)

I think consolidating the callbacks is cleanest, like you had it in
v2. I really wasn't sure about this change honestly, but it was my
attempt to incorporate feedback like this[3] from v4. I'll consolidate
the callbacks like you had in v2.

Instead of the bitmap like you had, I imagine we'll have some kind of
flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR,
MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does
that sound ok?

Do idle page tracking and DAMON need this new "fast-only" notifier? Or
do they benefit from a generic API in other ways? Sorry if I missed
this from some other mail.

I've got feedback saying that tying the definition of "fast" to MGLRU
specifically is helpful. So instead of MMU_NOTIFIER_YOUNG_FAST_ONLY,
maybe MMU_NOTIFIER_YOUNG_LRU_GEN_FAST to mean "do fast-for-MGLRU
notifier". It sounds like you'd prefer the more generic one.

Thanks for the feedback -- I don't want to keep this series lingering
on the list, so I'll try and get newer versions out sooner rather than
later.

[3]: https://lore.kernel.org/linux-mm/[email protected]/

>
> [1] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html
> [2] https://www.kernel.org/doc/html/latest/mm/damon/index.html

2024-06-11 16:54:10

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

On Mon, Jun 10, 2024 at 10:58 PM Oliver Upton <[email protected]> wrote:
>
> On Tue, Jun 11, 2024 at 12:21:39AM +0000, James Houghton wrote:
> > Replace the MMU write locks (taken in the memslot iteration loop) for
> > read locks.
> >
> > Grabbing the read lock instead of the write lock is safe because the
> > only requirement we have is that the stage-2 page tables do not get
> > deallocated while we are walking them. The stage2_age_walker() callback
> > is safe to race with itself; update the comment to reflect the
> > synchronization change.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > arch/arm64/kvm/Kconfig | 1 +
> > arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++------
> > arch/arm64/kvm/mmu.c | 26 ++++++++++++++++++++------
> > 3 files changed, 30 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> > index 58f09370d17e..7a1af8141c0e 100644
> > --- a/arch/arm64/kvm/Kconfig
> > +++ b/arch/arm64/kvm/Kconfig
> > @@ -22,6 +22,7 @@ menuconfig KVM
> > select KVM_COMMON
> > select KVM_GENERIC_HARDWARE_ENABLING
> > select KVM_GENERIC_MMU_NOTIFIER
> > + select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS
> > select HAVE_KVM_CPU_RELAX_INTERCEPT
> > select KVM_MMIO
> > select KVM_GENERIC_DIRTYLOG_READ_PROTECT
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 9e2bbee77491..b1b0f7148cff 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1319,10 +1319,10 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > data->young = true;
> >
> > /*
> > - * stage2_age_walker() is always called while holding the MMU lock for
> > - * write, so this will always succeed. Nonetheless, this deliberately
> > - * follows the race detection pattern of the other stage-2 walkers in
> > - * case the locking mechanics of the MMU notifiers is ever changed.
> > + * This walk may not be exclusive; the PTE is permitted to change
>
> s/may not/is not/

Will fix.

>
> > + * from under us. If there is a race to update this PTE, then the
> > + * GFN is most likely young, so failing to clear the AF is likely
> > + * to be inconsequential.
> > */
> > if (data->mkold && !stage2_try_set_pte(ctx, new))
> > return -EAGAIN;
> > @@ -1345,10 +1345,13 @@ bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
> > struct kvm_pgtable_walker walker = {
> > .cb = stage2_age_walker,
> > .arg = &data,
> > - .flags = KVM_PGTABLE_WALK_LEAF,
> > + .flags = KVM_PGTABLE_WALK_LEAF |
> > + KVM_PGTABLE_WALK_SHARED,
> > };
> > + int r;
> >
> > - WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
> > + r = kvm_pgtable_walk(pgt, addr, size, &walker);
> > + WARN_ON(r && r != -EAGAIN);
>
> I could've been more explicit last time around, could you please tone
> this down to WARN_ON_ONCE() as well?

Will do, thanks.

>
> > return data.young;
> > }
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 8bcab0cc3fe9..a62c27a347ed 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1773,25 +1773,39 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > u64 size = (range->end - range->start) << PAGE_SHIFT;
> > + bool young = false;
> > +
> > + read_lock(&kvm->mmu_lock);
> >
> > if (!kvm->arch.mmu.pgt)
> > return false;
>
> I'm guessing you meant to have 'goto out' here, since this early return
> fails to drop the mmu_lock.

Ah sorry! Thanks for catching this.

2024-06-11 18:54:48

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Tue, Jun 11, 2024 at 09:49:59AM -0700, James Houghton wrote:
> On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <[email protected]> wrote:
> >
> > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <[email protected]> wrote:
> > >
> > > This new notifier is for multi-gen LRU specifically
> >
> > Let me call it out before others do: we can't be this self-serving.

Establishing motivation for a change is always a good idea. The wording
could be a bit crisper, but the connection between the new MMU notifier
and MGLRU is valuable. I do not view the wording of the changeset as
excluding other users of the 'fast' notifier.

> I think consolidating the callbacks is cleanest, like you had it in
> v2. I really wasn't sure about this change honestly, but it was my
> attempt to incorporate feedback like this[3] from v4. I'll consolidate
> the callbacks like you had in v2.

My strong preference is to have the callers expectations of the
secondary MMU be explicit. Having ->${BLAH}_fast_only() makes this
abundantly clear both at the callsite and in the implementation.

> Instead of the bitmap like you had, I imagine we'll have some kind of
> flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR,
> MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does
> that sound ok?
>
> Do idle page tracking and DAMON need this new "fast-only" notifier? Or
> do they benefit from a generic API in other ways? Sorry if I missed
> this from some other mail.

Let's also keep in mind we aren't establishing an ABI here. If we have
direct line of sight (i.e. patches) on how to leverage the new MMU
notifier for DAMON and idle page tracking then great, let's try and
build something that satisfies all users. Otherwise, it isn't
that big of a deal if the interface needs to change slightly when
someone decides to leverage the MMU notifier for something else.

> I've got feedback saying that tying the definition of "fast" to MGLRU
> specifically is helpful. So instead of MMU_NOTIFIER_YOUNG_FAST_ONLY,
> maybe MMU_NOTIFIER_YOUNG_LRU_GEN_FAST to mean "do fast-for-MGLRU
> notifier". It sounds like you'd prefer the more generic one.
>
> Thanks for the feedback -- I don't want to keep this series lingering
> on the list, so I'll try and get newer versions out sooner rather than
> later.

Let's make sure we get alignment on this before you proceed, I don't get
the sense that we're getting to a common understanding of where to go
with this.

--
Thanks,
Oliver

2024-06-11 19:42:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Tue, Jun 11, 2024, James Houghton wrote:
> On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <[email protected]> wrote:
> >
> > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <[email protected]> wrote:
> > >
> > > This new notifier is for multi-gen LRU specifically
> >
> > Let me call it out before others do: we can't be this self-serving.
> >
> > > as it wants to be
> > > able to get and clear age information from secondary MMUs only if it can
> > > be done "fast".
> > >
> > > By having this notifier specifically created for MGLRU, what "fast"
> > > means comes down to what is "fast" enough to improve MGLRU's ability to
> > > reclaim most of the time.
> > >
> > > Signed-off-by: James Houghton <[email protected]>
> >
> > If we'd like this to pass other MM reviewers, especially the MMU
> > notifier maintainers, we'd need to design a generic API that can
> > benefit all the *existing* users: idle page tracking [1], DAMON [2]
> > and MGLRU.
> >
> > Also I personally prefer to extend the existing callbacks by adding
> > new parameters, and on top of that, I'd try to consolidate the
> > existing callbacks -- it'd be less of a hard sell if my changes result
> > in less code, not more.
> >
> > (v2 did all these, btw.)
>
> I think consolidating the callbacks is cleanest, like you had it in
> v2. I really wasn't sure about this change honestly, but it was my
> attempt to incorporate feedback like this[3] from v4. I'll consolidate
> the callbacks like you had in v2.

James, wait for others to chime in before committing yourself to a course of
action, otherwise you're going to get ping-ponged to hell and back.

> Instead of the bitmap like you had, I imagine we'll have some kind of
> flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR,
> MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does
> that sound ok?

Why do we need a bundle of flags? If we extend .clear_young() and .test_young()
as Yu suggests, then we only need a single "bool fast_only".

As for adding a fast_only versus dedicated APIs, I don't have a strong preference.
Extending will require a small amount of additional churn, e.g. to pass in false,
but that doesn't seem problematic on its own. On the plus side, there would be
less copy+paste in include/linux/mmu_notifier.h (though that could be solved with
macros :-) ).

E.g.

--
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 7b77ad6cf833..07872ae00fa6 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,

int __mmu_notifier_clear_young(struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool fast_only)
{
struct mmu_notifier *subscription;
int young = 0, id;
@@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
hlist_for_each_entry_rcu(subscription,
&mm->notifier_subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
- if (subscription->ops->clear_young)
- young |= subscription->ops->clear_young(subscription,
- mm, start, end);
+ if (!subscription->ops->clear_young ||
+ fast_only && !subscription->ops->has_fast_aging)
+ continue;
+
+ young |= subscription->ops->clear_young(subscription,
+ mm, start, end);
}
srcu_read_unlock(&srcu, id);

@@ -403,7 +407,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
}

int __mmu_notifier_test_young(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ bool fast_only)
{
struct mmu_notifier *subscription;
int young = 0, id;
@@ -412,12 +417,15 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
hlist_for_each_entry_rcu(subscription,
&mm->notifier_subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
- if (subscription->ops->test_young) {
- young = subscription->ops->test_young(subscription, mm,
- address);
- if (young)
- break;
- }
+ if (!subscription->ops->test_young)
+ continue;
+
+ if (fast_only && !subscription->ops->has_fast_aging)
+ continue;
+
+ young = subscription->ops->test_young(subscription, mm, address);
+ if (young)
+ break;
}
srcu_read_unlock(&srcu, id);
--

It might also require multiplexing the return value to differentiate between
"young" and "failed". Ugh, but the code already does that, just in a bespoke way.

Double ugh. Peeking ahead at the "failure" code, NAK to adding
kvm_arch_young_notifier_likely_fast for all the same reasons I objected to
kvm_arch_has_test_clear_young() in v1. Please stop trying to do anything like
that, I will NAK each every attempt to have core mm/ code call directly into KVM.

Anyways, back to this code, before we spin another version, we need to agree on
exactly what behavior we want out of secondary MMUs. Because to me, the behavior
proposed in this version doesn't make any sense.

Signalling failure because KVM _might_ have relevant aging information in SPTEs
that require taking kvm->mmu_lock is a terrible tradeoff. And for the test_young
case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP MMU, then
KVM should return "young", not "failed".

If KVM is using the TDP MMU, i.e. has_fast_aging=true, then there will be rmaps
if and only if L1 ran a nested VM at some point. But as proposed, KVM doesn't
actually check if there are any shadow TDP entries to process. That could be
fixed by looking at kvm->arch.indirect_shadow_pages, but even then it's not clear
that bailing if kvm->arch.indirect_shadow_pages > 0 makes sense.

E.g. if L1 happens to be running an L2, but <10% of the VM's memory is exposed to
L2, then "failure" is pretty much guaranteed to a false positive. And even for
the pages that are exposed to L2, "failure" will occur if and only if the pages
are being accessed _only_ by L2.

There most definitely are use cases where the majority of a VM's memory is accessed
only by L2. But if those use cases are performing poorly under MGLRU, then IMO
we should figure out a way to enhance KVM to do a fast harvest of nested TDP
Accessed information, not make MGRLU+KVM suck for a VMs that run nested VMs.

Oh, and calling into mmu_notifiers to do the "slow" version if the fast version
fails is suboptimal.

So rather than failing the fast aging, I think what we want is to know if an
mmu_notifier found a young SPTE during a fast lookup. E.g. something like this
in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps()
is an optional optimization to avoid taking mmu_lock for write in paths where a
(very rare) false negative is acceptable.

static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
{
return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
}

static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
bool fast_only)
{
int young = 0;

if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
}

if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range))
young = 1 | MMU_NOTIFY_WAS_FAST;

return (int)young;
}

and then in lru_gen_look_around():

if (spin_is_contended(pvmw->ptl))
return false;

/* exclude special VMAs containing anon pages from COW */
if (vma->vm_flags & VM_SPECIAL)
return false;

young = ptep_clear_young_notify(vma, addr, pte);
if (!young)
return false;

if (!(young & MMU_NOTIFY_WAS_FAST))
return true;

young = 1;

with the lookaround done using ptep_clear_young_notify_fast().

The MMU_NOTIFY_WAS_FAST flag is gross, but AFAICT it would Just Work without
needing to update all users of ptep_clear_young_notify() and friends.

2024-06-11 19:50:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Tue, Jun 11, 2024, Oliver Upton wrote:
> On Tue, Jun 11, 2024 at 09:49:59AM -0700, James Houghton wrote:
> > I think consolidating the callbacks is cleanest, like you had it in
> > v2. I really wasn't sure about this change honestly, but it was my
> > attempt to incorporate feedback like this[3] from v4. I'll consolidate
> > the callbacks like you had in v2.
>
> My strong preference is to have the callers expectations of the
> secondary MMU be explicit. Having ->${BLAH}_fast_only() makes this
> abundantly clear both at the callsite and in the implementation.

Partially agreed. We don't need a dedicated mmu_notifier API to achieve that
for the callsites, e.g. ptep_clear_young_notify() passes fast_only=false, and a
new ptep_clear_young_notify_fast_only() does the obvious.

On the back end, odds are very good KVM is going to squish the "fast" and "slow"
paths back into a common helper, so IMO having dedicated fast_only() APIs for the
mmu_notifier hooks doesn't add much value in the end.

I'm not opposed to dedicated hooks, but I after poking around a bit, I suspect
that passing a fast_only flag will end up being less cleaner for all parties.

2024-06-11 20:40:02

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Tue, Jun 11, 2024 at 10:50 AM James Houghton <[email protected]> wrote:
>
> On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <[email protected]> wrote:
> >
> > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <[email protected]> wrote:
> > >
> > > This new notifier is for multi-gen LRU specifically
> >
> > Let me call it out before others do: we can't be this self-serving.
> >
> > > as it wants to be
> > > able to get and clear age information from secondary MMUs only if it can
> > > be done "fast".
> > >
> > > By having this notifier specifically created for MGLRU, what "fast"
> > > means comes down to what is "fast" enough to improve MGLRU's ability to
> > > reclaim most of the time.
> > >
> > > Signed-off-by: James Houghton <[email protected]>
> >
> > If we'd like this to pass other MM reviewers, especially the MMU
> > notifier maintainers, we'd need to design a generic API that can
> > benefit all the *existing* users: idle page tracking [1], DAMON [2]
> > and MGLRU.
> >
> > Also I personally prefer to extend the existing callbacks by adding
> > new parameters, and on top of that, I'd try to consolidate the
> > existing callbacks -- it'd be less of a hard sell if my changes result
> > in less code, not more.
> >
> > (v2 did all these, btw.)
>
> I think consolidating the callbacks is cleanest, like you had it in
> v2. I really wasn't sure about this change honestly, but it was my
> attempt to incorporate feedback like this[3] from v4. I'll consolidate
> the callbacks like you had in v2.

Only if you think updating the MMU notifier API is overall the best option.

As I said earlier, I don't mind not doing the extra work, i.e., the
bitmap/_fast_only parameters and the _FAST_YOUNG return value, as long
as we know where we win and lose, i.e., simplicity and regressions. I
would call all the potential regressions (the nested VM case, the
arm64 case and the memcached benchmark) minor, compared with what we
got from v2. But my opinion doesn't really matter much because I'm not
your customer :)

2024-06-11 23:05:39

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Tue, Jun 11, 2024 at 12:42 PM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Jun 11, 2024, James Houghton wrote:
> > On Mon, Jun 10, 2024 at 10:34 PM Yu Zhao <[email protected]> wrote:
> > >
> > > On Mon, Jun 10, 2024 at 6:22 PM James Houghton <[email protected]> wrote:
> > > >
> > > > This new notifier is for multi-gen LRU specifically
> > >
> > > Let me call it out before others do: we can't be this self-serving.
> > >
> > > > as it wants to be
> > > > able to get and clear age information from secondary MMUs only if it can
> > > > be done "fast".
> > > >
> > > > By having this notifier specifically created for MGLRU, what "fast"
> > > > means comes down to what is "fast" enough to improve MGLRU's ability to
> > > > reclaim most of the time.
> > > >
> > > > Signed-off-by: James Houghton <[email protected]>
> > >
> > > If we'd like this to pass other MM reviewers, especially the MMU
> > > notifier maintainers, we'd need to design a generic API that can
> > > benefit all the *existing* users: idle page tracking [1], DAMON [2]
> > > and MGLRU.
> > >
> > > Also I personally prefer to extend the existing callbacks by adding
> > > new parameters, and on top of that, I'd try to consolidate the
> > > existing callbacks -- it'd be less of a hard sell if my changes result
> > > in less code, not more.
> > >
> > > (v2 did all these, btw.)
> >
> > I think consolidating the callbacks is cleanest, like you had it in
> > v2. I really wasn't sure about this change honestly, but it was my
> > attempt to incorporate feedback like this[3] from v4. I'll consolidate
> > the callbacks like you had in v2.
>
> James, wait for others to chime in before committing yourself to a course of
> action, otherwise you're going to get ping-ponged to hell and back.

Ah yeah. I really mean "I'll do it, provided the other feedback is in
line with this".

>
> > Instead of the bitmap like you had, I imagine we'll have some kind of
> > flags argument that has bits like MMU_NOTIFIER_YOUNG_CLEAR,
> > MMU_NOTIFIER_YOUNG_FAST_ONLY, and other ones as they come up. Does
> > that sound ok?
>
> Why do we need a bundle of flags? If we extend .clear_young() and .test_young()
> as Yu suggests, then we only need a single "bool fast_only".

We don't need to. In my head it's a little easier to collapse them
(slightly less code, and at the callsite you have a flag with a name
instead of a true/false). Making it a bool SGTM.

> As for adding a fast_only versus dedicated APIs, I don't have a strong preference.
> Extending will require a small amount of additional churn, e.g. to pass in false,
> but that doesn't seem problematic on its own. On the plus side, there would be
> less copy+paste in include/linux/mmu_notifier.h (though that could be solved with
> macros :-) ).

I think having the extra bool is cleaner than the new fast_only
notifier, definitely.

>
> E.g.
>
> --
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 7b77ad6cf833..07872ae00fa6 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
>
> int __mmu_notifier_clear_young(struct mm_struct *mm,
> unsigned long start,
> - unsigned long end)
> + unsigned long end,
> + bool fast_only)
> {
> struct mmu_notifier *subscription;
> int young = 0, id;
> @@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
> hlist_for_each_entry_rcu(subscription,
> &mm->notifier_subscriptions->list, hlist,
> srcu_read_lock_held(&srcu)) {
> - if (subscription->ops->clear_young)
> - young |= subscription->ops->clear_young(subscription,
> - mm, start, end);
> + if (!subscription->ops->clear_young ||
> + fast_only && !subscription->ops->has_fast_aging)
> + continue;
> +
> + young |= subscription->ops->clear_young(subscription,
> + mm, start, end);

KVM changing has_fast_aging dynamically would be slow, wouldn't it? I
feel like it's simpler to just pass in fast_only into `clear_young`
itself (and this is how I interpreted what you wrote above anyway).

> }
> srcu_read_unlock(&srcu, id);
>
> @@ -403,7 +407,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
> }
>
> int __mmu_notifier_test_young(struct mm_struct *mm,
> - unsigned long address)
> + unsigned long address,
> + bool fast_only)
> {
> struct mmu_notifier *subscription;
> int young = 0, id;
> @@ -412,12 +417,15 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
> hlist_for_each_entry_rcu(subscription,
> &mm->notifier_subscriptions->list, hlist,
> srcu_read_lock_held(&srcu)) {
> - if (subscription->ops->test_young) {
> - young = subscription->ops->test_young(subscription, mm,
> - address);
> - if (young)
> - break;
> - }
> + if (!subscription->ops->test_young)
> + continue;
> +
> + if (fast_only && !subscription->ops->has_fast_aging)
> + continue;
> +
> + young = subscription->ops->test_young(subscription, mm, address);
> + if (young)
> + break;
> }
> srcu_read_unlock(&srcu, id);
> --
>
> It might also require multiplexing the return value to differentiate between
> "young" and "failed". Ugh, but the code already does that, just in a bespoke way.

Yeah, that is necessary.

> Double ugh. Peeking ahead at the "failure" code, NAK to adding
> kvm_arch_young_notifier_likely_fast for all the same reasons I objected to
> kvm_arch_has_test_clear_young() in v1. Please stop trying to do anything like
> that, I will NAK each every attempt to have core mm/ code call directly into KVM.

Sorry to make you repeat yourself; I'll leave it out of v6. I don't
like it either, but I wasn't sure how important it was to avoid
calling into unnecessary notifiers if the TDP MMU were completely
disabled.

> Anyways, back to this code, before we spin another version, we need to agree on
> exactly what behavior we want out of secondary MMUs. Because to me, the behavior
> proposed in this version doesn't make any sense.
>
> Signalling failure because KVM _might_ have relevant aging information in SPTEs
> that require taking kvm->mmu_lock is a terrible tradeoff. And for the test_young
> case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP MMU, then
> KVM should return "young", not "failed".

Sorry for this oversight. What about something like:

1. test (and maybe clear) A bits on TDP MMU
2. If accessed && !should_clear: return (fast)
3. if (fast_only): return (fast)
4. If !(must check shadow MMU): return (fast)
5. test (and maybe clear) A bits in shadow MMU
6. return (slow)

Some of this reordering (and maybe a change from
kvm_shadow_root_allocated() to checking indirect_shadow_pages or
something else) can be done in its own patch.

> If KVM is using the TDP MMU, i.e. has_fast_aging=true, then there will be rmaps
> if and only if L1 ran a nested VM at some point. But as proposed, KVM doesn't
> actually check if there are any shadow TDP entries to process. That could be
> fixed by looking at kvm->arch.indirect_shadow_pages, but even then it's not clear
> that bailing if kvm->arch.indirect_shadow_pages > 0 makes sense.
>
> E.g. if L1 happens to be running an L2, but <10% of the VM's memory is exposed to
> L2, then "failure" is pretty much guaranteed to a false positive. And even for
> the pages that are exposed to L2, "failure" will occur if and only if the pages
> are being accessed _only_ by L2.
>
> There most definitely are use cases where the majority of a VM's memory is accessed
> only by L2. But if those use cases are performing poorly under MGLRU, then IMO
> we should figure out a way to enhance KVM to do a fast harvest of nested TDP
> Accessed information, not make MGRLU+KVM suck for a VMs that run nested VMs.

This makes sense. I don't have data today to say that we would get a
huge win from speeding up harvesting Accessed information from the
shadow MMU would be helpful. Getting this information for the TDP MMU
is at least better than no information at all.

>
> Oh, and calling into mmu_notifiers to do the "slow" version if the fast version
> fails is suboptimal.

Agreed. I didn't like this when I wrote it. This can be easily fixed
by making mmu_notifier_clear_young() return "fast" and "young or not",
which I will do.

> So rather than failing the fast aging, I think what we want is to know if an
> mmu_notifier found a young SPTE during a fast lookup. E.g. something like this
> in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps()
> is an optional optimization to avoid taking mmu_lock for write in paths where a
> (very rare) false negative is acceptable.
>
> static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
> {
> return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
> }
>
> static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
> bool fast_only)
> {
> int young = 0;
>
> if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
> write_lock(&kvm->mmu_lock);
> young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> write_unlock(&kvm->mmu_lock);
> }
>
> if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range))
> young = 1 | MMU_NOTIFY_WAS_FAST;

I don't think this line is quite right. We might set
MMU_NOTIFY_WAS_FAST even when we took the mmu_lock. I understand what
you mean though, thanks.

>
> return (int)young;
> }
>
> and then in lru_gen_look_around():
>
> if (spin_is_contended(pvmw->ptl))
> return false;
>
> /* exclude special VMAs containing anon pages from COW */
> if (vma->vm_flags & VM_SPECIAL)
> return false;
>
> young = ptep_clear_young_notify(vma, addr, pte);
> if (!young)
> return false;
>
> if (!(young & MMU_NOTIFY_WAS_FAST))
> return true;
>
> young = 1;
>
> with the lookaround done using ptep_clear_young_notify_fast().
>
> The MMU_NOTIFY_WAS_FAST flag is gross, but AFAICT it would Just Work without
> needing to update all users of ptep_clear_young_notify() and friends.

Sounds good to me.

Thanks for all the feedback!

2024-06-12 00:34:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Tue, Jun 11, 2024, James Houghton wrote:
> On Tue, Jun 11, 2024 at 12:42 PM Sean Christopherson <[email protected]> wrote:
> > --
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 7b77ad6cf833..07872ae00fa6 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> >
> > int __mmu_notifier_clear_young(struct mm_struct *mm,
> > unsigned long start,
> > - unsigned long end)
> > + unsigned long end,
> > + bool fast_only)
> > {
> > struct mmu_notifier *subscription;
> > int young = 0, id;
> > @@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
> > hlist_for_each_entry_rcu(subscription,
> > &mm->notifier_subscriptions->list, hlist,
> > srcu_read_lock_held(&srcu)) {
> > - if (subscription->ops->clear_young)
> > - young |= subscription->ops->clear_young(subscription,
> > - mm, start, end);
> > + if (!subscription->ops->clear_young ||
> > + fast_only && !subscription->ops->has_fast_aging)
> > + continue;
> > +
> > + young |= subscription->ops->clear_young(subscription,
> > + mm, start, end);
>
> KVM changing has_fast_aging dynamically would be slow, wouldn't it?

No, it could/would be done quite quickly. But, I'm not suggesting has_fast_aging
be dynamic, i.e. it's not an "all aging is guaranteed to be fast", it's a "this
MMU _can_ do fast aging". It's a bit fuzzy/weird mostly because KVM can essentially
have multiple secondary MMUs wired up to the same mmu_notifier.

> I feel like it's simpler to just pass in fast_only into `clear_young` itself
> (and this is how I interpreted what you wrote above anyway).

Eh, maybe? A "has_fast_aging" flag is more robust in the sense that it requires
secondary MMUs to opt-in, i.e. all secondary MMUs will be considered "slow" by
default.

It's somewhat of a moot point because KVM is the only secondary MMU that implements
.clear_young() and .test_young() (which I keep forgetting), and that seems unlikely
to change.

A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y,
i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's
probably a pretty unlikely combination, so it's probably not a valid argument.

So, I guess I don't have a strong opinion?

> > Double ugh. Peeking ahead at the "failure" code, NAK to adding
> > kvm_arch_young_notifier_likely_fast for all the same reasons I objected to
> > kvm_arch_has_test_clear_young() in v1. Please stop trying to do anything like
> > that, I will NAK each every attempt to have core mm/ code call directly into KVM.
>
> Sorry to make you repeat yourself; I'll leave it out of v6. I don't
> like it either, but I wasn't sure how important it was to avoid
> calling into unnecessary notifiers if the TDP MMU were completely
> disabled.

If it's important, e.g. for performance, then the mmu_notifier should have a flag
so that the behavior doesn't assume a KVM backend. Hence my has_fast_aging
suggestion.

> > Anyways, back to this code, before we spin another version, we need to agree on
> > exactly what behavior we want out of secondary MMUs. Because to me, the behavior
> > proposed in this version doesn't make any sense.
> >
> > Signalling failure because KVM _might_ have relevant aging information in SPTEs
> > that require taking kvm->mmu_lock is a terrible tradeoff. And for the test_young
> > case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP MMU, then
> > KVM should return "young", not "failed".
>
> Sorry for this oversight. What about something like:
>
> 1. test (and maybe clear) A bits on TDP MMU
> 2. If accessed && !should_clear: return (fast)
> 3. if (fast_only): return (fast)
> 4. If !(must check shadow MMU): return (fast)
> 5. test (and maybe clear) A bits in shadow MMU
> 6. return (slow)

I don't understand where the "must check shadow MMU" in #4 comes from. I also
don't think it's necessary; see below.

> Some of this reordering (and maybe a change from
> kvm_shadow_root_allocated() to checking indirect_shadow_pages or
> something else) can be done in its own patch.
>
> > So rather than failing the fast aging, I think what we want is to know if an
> > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this
> > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps()
> > is an optional optimization to avoid taking mmu_lock for write in paths where a
> > (very rare) false negative is acceptable.
> >
> > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
> > {
> > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
> > }
> >
> > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
> > bool fast_only)
> > {
> > int young = 0;
> >
> > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
> > write_lock(&kvm->mmu_lock);
> > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > write_unlock(&kvm->mmu_lock);
> > }
> >
> > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range))
> > young = 1 | MMU_NOTIFY_WAS_FAST;
>
> I don't think this line is quite right. We might set
> MMU_NOTIFY_WAS_FAST even when we took the mmu_lock. I understand what
> you mean though, thanks.

The name sucks, but I believe the logic is correct. As posted here in v5, the
MGRLU code wants to age both fast _and_ slow MMUs. AIUI, the intent is to always
get aging information, but only look around at other PTEs if it can be done fast.

if (should_walk_secondary_mmu()) {
notifier_result =
mmu_notifier_test_clear_young_fast_only(
vma->vm_mm, addr, addr + PAGE_SIZE,
/*clear=*/true);
}

if (notifier_result & MMU_NOTIFIER_FAST_FAILED)
secondary_young = mmu_notifier_clear_young(vma->vm_mm, addr,
addr + PAGE_SIZE);
else {
secondary_young = notifier_result & MMU_NOTIFIER_FAST_YOUNG;
notifier_was_fast = true;
}

The change, relative to v5, that I am proposing is that MGLRU looks around if
the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and
only if _all_ secondary MMUs are fast.

In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the
fast_only flag.

2024-06-12 16:17:29

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Tue, Jun 11, 2024, James Houghton wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e8fc5ecb59b2..24a3ff639919 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio,
> continue;
> }
>
> - if (pvmw.pte) {
> - if (lru_gen_enabled() &&
> - pte_young(ptep_get(pvmw.pte))) {
> - lru_gen_look_around(&pvmw);
> + if (lru_gen_enabled() && pvmw.pte) {
> + if (lru_gen_look_around(&pvmw))
> referenced++;
> - }
> -
> + } else if (pvmw.pte) {
> if (ptep_clear_flush_young_notify(vma, address,
> pvmw.pte))
> referenced++;

Random question not really related to KVM/secondary MMU participation. AFAICT,
the MGLRU approach doesn't flush TLBs after aging pages. How does MGLRU mitigate
false negatives on pxx_young() due to the CPU not setting Accessed bits because
of stale TLB entries?

2024-06-12 17:00:29

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, Jun 12, 2024 at 10:02 AM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Jun 11, 2024, James Houghton wrote:
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index e8fc5ecb59b2..24a3ff639919 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio,
> > continue;
> > }
> >
> > - if (pvmw.pte) {
> > - if (lru_gen_enabled() &&
> > - pte_young(ptep_get(pvmw.pte))) {
> > - lru_gen_look_around(&pvmw);
> > + if (lru_gen_enabled() && pvmw.pte) {
> > + if (lru_gen_look_around(&pvmw))
> > referenced++;
> > - }
> > -
> > + } else if (pvmw.pte) {
> > if (ptep_clear_flush_young_notify(vma, address,
> > pvmw.pte))
> > referenced++;
>
> Random question not really related to KVM/secondary MMU participation. AFAICT,
> the MGLRU approach doesn't flush TLBs after aging pages. How does MGLRU mitigate
> false negatives on pxx_young() due to the CPU not setting Accessed bits because
> of stale TLB entries?

I do think there can be false negatives but we have not been able to
measure their practical impacts since we disabled the flush on some
host MMUs long ago (NOT by MGLRU), e.g., on x86 and ppc,
ptep_clear_flush_young() is just ptep_test_andclear_young(). The
theoretical basis is that, given the TLB coverage trend (Figure 1 in
[1]), when a system is running out of memory, it's unlikely to have
many long-lived entries in its TLB. IOW, if that system had a stable
working set (hot memory) that can fit into its TLB, it wouldn't hit
page reclaim. Again, this is based on the theory (proposition) that
for most systems, their TLB coverages are much smaller than their
memory sizes.

If/when the above proposition doesn't hold, the next step in the page
reclaim path, which is to unmap the PTE, will cause a page fault. The
fault can be minor or major (requires IO), depending on the race
between the reclaiming and accessing threads. In this case, the
tradeoff, in a steady state, is between the PF cost of pages we
shouldn't reclaim and the flush cost of pages we scan. The PF cost is
higher than the flush cost per page. But we scan many pages and only
reclaim a few of them; pages we shouldn't reclaim are a (small)
portion of the latter.

[1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf

2024-06-12 17:47:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, Jun 12, 2024, Yu Zhao wrote:
> On Wed, Jun 12, 2024 at 10:02 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Tue, Jun 11, 2024, James Houghton wrote:
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index e8fc5ecb59b2..24a3ff639919 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio,
> > > continue;
> > > }
> > >
> > > - if (pvmw.pte) {
> > > - if (lru_gen_enabled() &&
> > > - pte_young(ptep_get(pvmw.pte))) {
> > > - lru_gen_look_around(&pvmw);
> > > + if (lru_gen_enabled() && pvmw.pte) {
> > > + if (lru_gen_look_around(&pvmw))
> > > referenced++;
> > > - }
> > > -
> > > + } else if (pvmw.pte) {
> > > if (ptep_clear_flush_young_notify(vma, address,
> > > pvmw.pte))
> > > referenced++;
> >
> > Random question not really related to KVM/secondary MMU participation. AFAICT,
> > the MGLRU approach doesn't flush TLBs after aging pages. How does MGLRU mitigate
> > false negatives on pxx_young() due to the CPU not setting Accessed bits because
> > of stale TLB entries?
>
> I do think there can be false negatives but we have not been able to
> measure their practical impacts since we disabled the flush on some
> host MMUs long ago (NOT by MGLRU), e.g., on x86 and ppc,
> ptep_clear_flush_young() is just ptep_test_andclear_young().

Aha! That's what I was missing, I somehow didn't see x86's ptep_clear_flush_young().

That begs the question, why does KVM flush TLBs on architectures that don't need
to? And since kvm_mmu_notifier_clear_young() explicitly doesn't flush, are there
even any KVM-supported architectures for which the flush is mandatory?

Skipping the flush on KVM x86 seems like a complete no-brainer.

Will, Marc and/or Oliver, what are arm64's requirements in this area? E.g. I see
that arm64's version of __ptep_clear_flush_young() does TLBI but not DSB. Should
KVM be doing something similar? Can KVM safely skip even the TBLI?

> theoretical basis is that, given the TLB coverage trend (Figure 1 in
> [1]), when a system is running out of memory, it's unlikely to have
> many long-lived entries in its TLB. IOW, if that system had a stable
> working set (hot memory) that can fit into its TLB, it wouldn't hit
> page reclaim. Again, this is based on the theory (proposition) that
> for most systems, their TLB coverages are much smaller than their
> memory sizes.
>
> If/when the above proposition doesn't hold, the next step in the page
> reclaim path, which is to unmap the PTE, will cause a page fault. The
> fault can be minor or major (requires IO), depending on the race
> between the reclaiming and accessing threads. In this case, the
> tradeoff, in a steady state, is between the PF cost of pages we
> shouldn't reclaim and the flush cost of pages we scan. The PF cost is
> higher than the flush cost per page. But we scan many pages and only
> reclaim a few of them; pages we shouldn't reclaim are a (small)
> portion of the latter.
>
> [1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf

2024-06-13 06:50:17

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v5 8/9] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, Jun 12, 2024 at 10:23:38AM -0700, Sean Christopherson wrote:
> On Wed, Jun 12, 2024, Yu Zhao wrote:
> > I do think there can be false negatives but we have not been able to
> > measure their practical impacts since we disabled the flush on some
> > host MMUs long ago (NOT by MGLRU), e.g., on x86 and ppc,
> > ptep_clear_flush_young() is just ptep_test_andclear_young().
>
> Aha! That's what I was missing, I somehow didn't see x86's ptep_clear_flush_young().

Heh, well the helper name isn't exactly giving any hints...

> That begs the question, why does KVM flush TLBs on architectures that don't need
> to? And since kvm_mmu_notifier_clear_young() explicitly doesn't flush, are there
> even any KVM-supported architectures for which the flush is mandatory?
>
> Skipping the flush on KVM x86 seems like a complete no-brainer.
>
> Will, Marc and/or Oliver, what are arm64's requirements in this area? E.g. I see
> that arm64's version of __ptep_clear_flush_young() does TLBI but not DSB. Should
> KVM be doing something similar? Can KVM safely skip even the TBLI?

Short answer, yes, KVM can elide TLBIs when clearing AF.

Long answer: Software needs to be extremely careful to ensure that TLBI
elision doesn't lead to a failure to uphold break-before-make requirements,
if we're only concerned with architecture-specific requirements. IOW, the AF
cannot be used as a hint for the presence of TLB entries for a given PTE.

There's the obvious failure of skipping TLBIs for old pages when
unmapping, but that isn't an architecture-specific issue.

So, since KVM/arm64 doesn't play any games with the AF at stage-2, leaving
out a TLBI when aging ought to be fine.

--
Thanks,
Oliver

2024-06-13 06:53:40

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Tue, Jun 11, 2024 at 12:49:49PM -0700, Sean Christopherson wrote:
> On Tue, Jun 11, 2024, Oliver Upton wrote:
> > On Tue, Jun 11, 2024 at 09:49:59AM -0700, James Houghton wrote:
> > > I think consolidating the callbacks is cleanest, like you had it in
> > > v2. I really wasn't sure about this change honestly, but it was my
> > > attempt to incorporate feedback like this[3] from v4. I'll consolidate
> > > the callbacks like you had in v2.
> >
> > My strong preference is to have the callers expectations of the
> > secondary MMU be explicit. Having ->${BLAH}_fast_only() makes this
> > abundantly clear both at the callsite and in the implementation.
>
> Partially agreed. We don't need a dedicated mmu_notifier API to achieve that
> for the callsites, e.g. ptep_clear_young_notify() passes fast_only=false, and a
> new ptep_clear_young_notify_fast_only() does the obvious.
>
> On the back end, odds are very good KVM is going to squish the "fast" and "slow"
> paths back into a common helper, so IMO having dedicated fast_only() APIs for the
> mmu_notifier hooks doesn't add much value in the end.
>
> I'm not opposed to dedicated hooks, but I after poking around a bit, I suspect
> that passing a fast_only flag will end up being less cleaner for all parties.

Yeah, I think I'm headed in the same direction after actually reading
the MM side of this, heh.

--
Thanks,
Oliver

2024-06-14 00:46:32

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Jun 11, 2024, James Houghton wrote:
> > On Tue, Jun 11, 2024 at 12:42 PM Sean Christopherson <[email protected]> wrote:
> > > --
> > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > > index 7b77ad6cf833..07872ae00fa6 100644
> > > --- a/mm/mmu_notifier.c
> > > +++ b/mm/mmu_notifier.c
> > > @@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> > >
> > > int __mmu_notifier_clear_young(struct mm_struct *mm,
> > > unsigned long start,
> > > - unsigned long end)
> > > + unsigned long end,
> > > + bool fast_only)
> > > {
> > > struct mmu_notifier *subscription;
> > > int young = 0, id;
> > > @@ -393,9 +394,12 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
> > > hlist_for_each_entry_rcu(subscription,
> > > &mm->notifier_subscriptions->list, hlist,
> > > srcu_read_lock_held(&srcu)) {
> > > - if (subscription->ops->clear_young)
> > > - young |= subscription->ops->clear_young(subscription,
> > > - mm, start, end);
> > > + if (!subscription->ops->clear_young ||
> > > + fast_only && !subscription->ops->has_fast_aging)
> > > + continue;
> > > +
> > > + young |= subscription->ops->clear_young(subscription,
> > > + mm, start, end);
> >
> > KVM changing has_fast_aging dynamically would be slow, wouldn't it?
>
> No, it could/would be done quite quickly. But, I'm not suggesting has_fast_aging
> be dynamic, i.e. it's not an "all aging is guaranteed to be fast", it's a "this
> MMU _can_ do fast aging". It's a bit fuzzy/weird mostly because KVM can essentially
> have multiple secondary MMUs wired up to the same mmu_notifier.
>
> > I feel like it's simpler to just pass in fast_only into `clear_young` itself
> > (and this is how I interpreted what you wrote above anyway).
>
> Eh, maybe? A "has_fast_aging" flag is more robust in the sense that it requires
> secondary MMUs to opt-in, i.e. all secondary MMUs will be considered "slow" by
> default.
>
> It's somewhat of a moot point because KVM is the only secondary MMU that implements
> .clear_young() and .test_young() (which I keep forgetting), and that seems unlikely
> to change.
>
> A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y,
> i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's
> probably a pretty unlikely combination, so it's probably not a valid argument.
>
> So, I guess I don't have a strong opinion?

(Sorry for the somewhat delayed response... spent some time actually
writing what this would look like.)

I see what you mean, thanks! So has_fast_aging might be set by KVM if
the architecture sets a Kconfig saying that it understands the concept
of fast aging, basically what the presence of this v5's
test_clear_young_fast_only() indicates.

>
> > > Double ugh. Peeking ahead at the "failure" code, NAK to adding
> > > kvm_arch_young_notifier_likely_fast for all the same reasons I objected to
> > > kvm_arch_has_test_clear_young() in v1. Please stop trying to do anything like
> > > that, I will NAK each every attempt to have core mm/ code call directly into KVM.
> >
> > Sorry to make you repeat yourself; I'll leave it out of v6. I don't
> > like it either, but I wasn't sure how important it was to avoid
> > calling into unnecessary notifiers if the TDP MMU were completely
> > disabled.
>
> If it's important, e.g. for performance, then the mmu_notifier should have a flag
> so that the behavior doesn't assume a KVM backend. Hence my has_fast_aging
> suggestion.

Thanks! That makes sense.

> > > Anyways, back to this code, before we spin another version, we need to agree on
> > > exactly what behavior we want out of secondary MMUs. Because to me, the behavior
> > > proposed in this version doesn't make any sense.
> > >
> > > Signalling failure because KVM _might_ have relevant aging information in SPTEs
> > > that require taking kvm->mmu_lock is a terrible tradeoff. And for the test_young
> > > case, it's flat out wrong, e.g. if a page is marked Accessed in the TDP MMU, then
> > > KVM should return "young", not "failed".
> >
> > Sorry for this oversight. What about something like:
> >
> > 1. test (and maybe clear) A bits on TDP MMU
> > 2. If accessed && !should_clear: return (fast)
> > 3. if (fast_only): return (fast)
> > 4. If !(must check shadow MMU): return (fast)
> > 5. test (and maybe clear) A bits in shadow MMU
> > 6. return (slow)
>
> I don't understand where the "must check shadow MMU" in #4 comes from. I also
> don't think it's necessary; see below.

I just meant `kvm_has_shadow_mmu_sptes()` or
`kvm_memslots_have_rmaps()`. I like the logic you suggest below. :)

> > Some of this reordering (and maybe a change from
> > kvm_shadow_root_allocated() to checking indirect_shadow_pages or
> > something else) can be done in its own patch.

So just to be clear, for test_young(), I intend to have a patch in v6
to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems
like a pure win; no reason not to include it if we're making logic
changes here anyway.

> >
> > > So rather than failing the fast aging, I think what we want is to know if an
> > > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this
> > > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps()
> > > is an optional optimization to avoid taking mmu_lock for write in paths where a
> > > (very rare) false negative is acceptable.
> > >
> > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
> > > {
> > > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
> > > }
> > >
> > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
> > > bool fast_only)
> > > {
> > > int young = 0;
> > >
> > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
> > > write_lock(&kvm->mmu_lock);
> > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > > write_unlock(&kvm->mmu_lock);
> > > }
> > >
> > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range))
> > > young = 1 | MMU_NOTIFY_WAS_FAST;

The most straightforward way (IMHO) to return something like `1 |
MMU_NOTIFY_WAS_FAST` up to the MMU notifier itself is to make
gfn_handler_t return int instead of bool.

In this v5, I worked around this need by using `bool *failed` in patch
5[1]. I think the way this is going to look now in v6 would be cleaner
by actually changing gfn_handler_t to return int, and then we can
write something like what you wrote here. What do you think?

[1]: https://lore.kernel.org/linux-mm/[email protected]/

> > I don't think this line is quite right. We might set
> > MMU_NOTIFY_WAS_FAST even when we took the mmu_lock. I understand what
> > you mean though, thanks.
>
> The name sucks, but I believe the logic is correct. As posted here in v5, the
> MGRLU code wants to age both fast _and_ slow MMUs. AIUI, the intent is to always
> get aging information, but only look around at other PTEs if it can be done fast.
>
> if (should_walk_secondary_mmu()) {
> notifier_result =
> mmu_notifier_test_clear_young_fast_only(
> vma->vm_mm, addr, addr + PAGE_SIZE,
> /*clear=*/true);
> }
>
> if (notifier_result & MMU_NOTIFIER_FAST_FAILED)
> secondary_young = mmu_notifier_clear_young(vma->vm_mm, addr,
> addr + PAGE_SIZE);
> else {
> secondary_young = notifier_result & MMU_NOTIFIER_FAST_YOUNG;
> notifier_was_fast = true;
> }
>
> The change, relative to v5, that I am proposing is that MGLRU looks around if
> the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and
> only if _all_ secondary MMUs are fast.
>
> In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the
> fast_only flag.

Oh, yeah, that's a lot more intelligent than what I had. I think I
fully understand your suggestion; I guess we'll see in v6. :)

I wonder if this still makes sense if whether or not an MMU is "fast"
is determined by how contended some lock(s) are at the time. I think
it does, but I guess we can discuss more if it turns out that having
an architecture participate like this is actually something we want to
do (i.e., that performance results say it's a good idea).

Thanks Sean!

2024-06-14 00:49:11

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Wed, Jun 12, 2024 at 11:53 PM Oliver Upton <[email protected]> wrote:
>
> On Tue, Jun 11, 2024 at 12:49:49PM -0700, Sean Christopherson wrote:
> > On Tue, Jun 11, 2024, Oliver Upton wrote:
> > > On Tue, Jun 11, 2024 at 09:49:59AM -0700, James Houghton wrote:
> > > > I think consolidating the callbacks is cleanest, like you had it in
> > > > v2. I really wasn't sure about this change honestly, but it was my
> > > > attempt to incorporate feedback like this[3] from v4. I'll consolidate
> > > > the callbacks like you had in v2.
> > >
> > > My strong preference is to have the callers expectations of the
> > > secondary MMU be explicit. Having ->${BLAH}_fast_only() makes this
> > > abundantly clear both at the callsite and in the implementation.
> >
> > Partially agreed. We don't need a dedicated mmu_notifier API to achieve that
> > for the callsites, e.g. ptep_clear_young_notify() passes fast_only=false, and a
> > new ptep_clear_young_notify_fast_only() does the obvious.
> >
> > On the back end, odds are very good KVM is going to squish the "fast" and "slow"
> > paths back into a common helper, so IMO having dedicated fast_only() APIs for the
> > mmu_notifier hooks doesn't add much value in the end.
> >
> > I'm not opposed to dedicated hooks, but I after poking around a bit, I suspect
> > that passing a fast_only flag will end up being less cleaner for all parties.
>
> Yeah, I think I'm headed in the same direction after actually reading
> the MM side of this, heh.

Yeah, I agree. What I have now for v6 is that the test_young() and
clear_young() notifiers themselves now take a `bool fast_only`. When
called with the existing helpers (e.g. `mmu_notifier_test_young()`,
`fast_only` is false, and I have new helpers (e.g.
`mmu_notifier_test_young_fast_only()`) that will set `fast_only` to
true. Seems clean to me. Thanks!

2024-06-14 16:13:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Thu, Jun 13, 2024, James Houghton wrote:
> On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <[email protected]> wrote:
> > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y,
> > i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's
> > probably a pretty unlikely combination, so it's probably not a valid argument.
> >
> > So, I guess I don't have a strong opinion?
>
> (Sorry for the somewhat delayed response... spent some time actually
> writing what this would look like.)
>
> I see what you mean, thanks! So has_fast_aging might be set by KVM if
> the architecture sets a Kconfig saying that it understands the concept
> of fast aging, basically what the presence of this v5's
> test_clear_young_fast_only() indicates.

It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_enabled=false
doesn't support fast aging (uses the shadow MMU even for TDP).

> > I don't understand where the "must check shadow MMU" in #4 comes from. I also
> > don't think it's necessary; see below.
>
> I just meant `kvm_has_shadow_mmu_sptes()` or
> `kvm_memslots_have_rmaps()`. I like the logic you suggest below. :)
>
> > > Some of this reordering (and maybe a change from
> > > kvm_shadow_root_allocated() to checking indirect_shadow_pages or
> > > something else) can be done in its own patch.
>
> So just to be clear, for test_young(), I intend to have a patch in v6
> to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems
> like a pure win; no reason not to include it if we're making logic
> changes here anyway.

I don't think that's correct. The initial fast_only=false aging should process
shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=false would
get a false positive on young due to failing to clear the Accessed bit in the
shadow MMU. E.g. if page X is accessed by both L1 and L2, then aged, and never
accessed again, the Accessed bit would still be set in the page tables for L2.

My thought for MMU_NOTIFY_WAS_FAST below (which again is a bad name) is to
communicate to MGLRU that the page was found to be young in an MMU that supports
fast aging, i.e. that looking around at other SPTEs is worth doing.

> > > > So rather than failing the fast aging, I think what we want is to know if an
> > > > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this
> > > > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps()
> > > > is an optional optimization to avoid taking mmu_lock for write in paths where a
> > > > (very rare) false negative is acceptable.
> > > >
> > > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
> > > > {
> > > > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
> > > > }
> > > >
> > > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
> > > > bool fast_only)
> > > > {
> > > > int young = 0;
> > > >
> > > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
> > > > write_lock(&kvm->mmu_lock);
> > > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > > > write_unlock(&kvm->mmu_lock);
> > > > }
> > > >
> > > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range))
> > > > young = 1 | MMU_NOTIFY_WAS_FAST;
>
> The most straightforward way (IMHO) to return something like `1 |
> MMU_NOTIFY_WAS_FAST` up to the MMU notifier itself is to make
> gfn_handler_t return int instead of bool.

Hrm, all the options are unpleasant. Modifying gfn_handler_t to return an int
will require an absurd amount of churn (all implementations in all archictures),
and I don't love that the APIs that return true/false to indicate "flush" would
lose their boolean-ness.

One idea would be to add kvm_mmu_notifier_arg.aging_was_fast or so, and then
refactor kvm_handle_hva_range_no_flush() into a dedicated aging helper, and have
it morph the KVM-internal flag into an MMU_NOTIFIER flag. It's not perect either,
but it requires far less churn and keeps some of the KVM<=>mmu_notifer details in
common KVM code.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7b9d2633a931..c11a359b6ff5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
union kvm_mmu_notifier_arg {
unsigned long attributes;
+ bool aging_was_fast;
};

struct kvm_gfn_range {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 436ca41f61e5..a936f6bedd97 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -685,10 +685,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
return __kvm_handle_hva_range(kvm, &range).ret;
}

-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
- unsigned long start,
- unsigned long end,
- gfn_handler_t handler)
+static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
+ unsigned long start,
+ unsigned long end,
+ bool flush_if_young)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
const struct kvm_mmu_notifier_range range = {
@@ -696,11 +696,14 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.end = end,
.handler = handler,
.on_lock = (void *)kvm_null_fn,
- .flush_on_ret = false,
+ .flush_on_ret = flush_if_young,
.may_block = false,
+ .aging_was_fast = false,
};

- return __kvm_handle_hva_range(kvm, &range).ret;
+ bool young = __kvm_handle_hva_range(kvm, &range).ret;
+
+ return (int)young | (range.aging_was_fast ? MMU_NOTIFIER_FAST_AGING : 0);
}

void kvm_mmu_invalidate_begin(struct kvm *kvm)
@@ -865,7 +868,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
{
trace_kvm_age_hva(start, end);

- return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
+ return kvm_age_hva_range(mn, start, end, true);
}

static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
@@ -875,20 +878,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
{
trace_kvm_age_hva(start, end);

- /*
- * Even though we do not flush TLB, this will still adversely
- * affect performance on pre-Haswell Intel EPT, where there is
- * no EPT Access Bit to clear so that we have to tear down EPT
- * tables instead. If we find this unacceptable, we can always
- * add a parameter to kvm_age_hva so that it effectively doesn't
- * do anything on clear_young.
- *
- * Also note that currently we never issue secondary TLB flushes
- * from clear_young, leaving this job up to the regular system
- * cadence. If we find this inaccurate, we might come up with a
- * more sophisticated heuristic later.
- */
- return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+ return kvm_age_hva_range(mn, start, end, false);
}

static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -897,8 +887,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
{
trace_kvm_test_age_hva(address);

- return kvm_handle_hva_range_no_flush(mn, address, address + 1,
- kvm_test_age_gfn);
+ return kvm_age_hva_range(mn, address, address + 1, false);
}

static void kvm_mmu_notifier_release(struct mmu_notifier *mn,


> > The change, relative to v5, that I am proposing is that MGLRU looks around if
> > the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and
> > only if _all_ secondary MMUs are fast.
> >
> > In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the
> > fast_only flag.
>
> Oh, yeah, that's a lot more intelligent than what I had. I think I
> fully understand your suggestion; I guess we'll see in v6. :)
>
> I wonder if this still makes sense if whether or not an MMU is "fast"
> is determined by how contended some lock(s) are at the time.

No. Just because a lock wasn't contended on the initial aging doesn't mean it
won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which
takes mmu_lock for write for all operations, an aging operation could get lucky
and sneak in while mmu_lock happened to be free, but then get stuck behind a large
queue of operations.

The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in
an MMU that takes mmu_lock for read in all but the most rare paths.

2024-06-14 18:24:41

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Jun 13, 2024, James Houghton wrote:
> > On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <[email protected]> wrote:
> > > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y,
> > > i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's
> > > probably a pretty unlikely combination, so it's probably not a valid argument.
> > >
> > > So, I guess I don't have a strong opinion?
> >
> > (Sorry for the somewhat delayed response... spent some time actually
> > writing what this would look like.)
> >
> > I see what you mean, thanks! So has_fast_aging might be set by KVM if
> > the architecture sets a Kconfig saying that it understands the concept
> > of fast aging, basically what the presence of this v5's
> > test_clear_young_fast_only() indicates.
>
> It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_enabled=false
> doesn't support fast aging (uses the shadow MMU even for TDP).

I see. I'm not sure if it makes sense to put this in `ops` as you
originally had it then (it seems like a bit of a pain anyway). I could
just make it a member of `struct mmu_notifier` itself.

> > So just to be clear, for test_young(), I intend to have a patch in v6
> > to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems
> > like a pure win; no reason not to include it if we're making logic
> > changes here anyway.
>
> I don't think that's correct. The initial fast_only=false aging should process
> shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=false would
> get a false positive on young due to failing to clear the Accessed bit in the
> shadow MMU. E.g. if page X is accessed by both L1 and L2, then aged, and never
> accessed again, the Accessed bit would still be set in the page tables for L2.

For clear_young(fast_only=false), yeah we need to check and clear
Accessed for both MMUs. But for test_young(fast_only=false), I don't
see why we couldn't just return early if the TDP MMU reports young.

> My thought for MMU_NOTIFY_WAS_FAST below (which again is a bad name) is to
> communicate to MGLRU that the page was found to be young in an MMU that supports
> fast aging, i.e. that looking around at other SPTEs is worth doing.

That makes sense; I don't think this little test_young() optimization
affects that.

> > > > > So rather than failing the fast aging, I think what we want is to know if an
> > > > > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this
> > > > > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps()
> > > > > is an optional optimization to avoid taking mmu_lock for write in paths where a
> > > > > (very rare) false negative is acceptable.
> > > > >
> > > > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
> > > > > {
> > > > > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
> > > > > }
> > > > >
> > > > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
> > > > > bool fast_only)
> > > > > {
> > > > > int young = 0;
> > > > >
> > > > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
> > > > > write_lock(&kvm->mmu_lock);
> > > > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > > > > write_unlock(&kvm->mmu_lock);
> > > > > }
> > > > >
> > > > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range))
> > > > > young = 1 | MMU_NOTIFY_WAS_FAST;
> >
> > The most straightforward way (IMHO) to return something like `1 |
> > MMU_NOTIFY_WAS_FAST` up to the MMU notifier itself is to make
> > gfn_handler_t return int instead of bool.
>
> Hrm, all the options are unpleasant. Modifying gfn_handler_t to return an int
> will require an absurd amount of churn (all implementations in all archictures),
> and I don't love that the APIs that return true/false to indicate "flush" would
> lose their boolean-ness.
>
> One idea would be to add kvm_mmu_notifier_arg.aging_was_fast or so, and then
> refactor kvm_handle_hva_range_no_flush() into a dedicated aging helper, and have
> it morph the KVM-internal flag into an MMU_NOTIFIER flag. It's not perect either,
> but it requires far less churn and keeps some of the KVM<=>mmu_notifer details in
> common KVM code.

SGTM. I think this will work. Thanks!

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7b9d2633a931..c11a359b6ff5 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> union kvm_mmu_notifier_arg {
> unsigned long attributes;
> + bool aging_was_fast;
> };
>
> struct kvm_gfn_range {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 436ca41f61e5..a936f6bedd97 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -685,10 +685,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> return __kvm_handle_hva_range(kvm, &range).ret;
> }
>
> -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
> - unsigned long start,
> - unsigned long end,
> - gfn_handler_t handler)
> +static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
> + unsigned long start,
> + unsigned long end,
> + bool flush_if_young)
> {
> struct kvm *kvm = mmu_notifier_to_kvm(mn);
> const struct kvm_mmu_notifier_range range = {
> @@ -696,11 +696,14 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
> .end = end,
> .handler = handler,
> .on_lock = (void *)kvm_null_fn,
> - .flush_on_ret = false,
> + .flush_on_ret = flush_if_young,
> .may_block = false,
> + .aging_was_fast = false,
> };
>
> - return __kvm_handle_hva_range(kvm, &range).ret;
> + bool young = __kvm_handle_hva_range(kvm, &range).ret;
> +
> + return (int)young | (range.aging_was_fast ? MMU_NOTIFIER_FAST_AGING : 0);
> }
>
> void kvm_mmu_invalidate_begin(struct kvm *kvm)
> @@ -865,7 +868,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> {
> trace_kvm_age_hva(start, end);
>
> - return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
> + return kvm_age_hva_range(mn, start, end, true);
> }
>
> static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> @@ -875,20 +878,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> {
> trace_kvm_age_hva(start, end);
>
> - /*
> - * Even though we do not flush TLB, this will still adversely
> - * affect performance on pre-Haswell Intel EPT, where there is
> - * no EPT Access Bit to clear so that we have to tear down EPT
> - * tables instead. If we find this unacceptable, we can always
> - * add a parameter to kvm_age_hva so that it effectively doesn't
> - * do anything on clear_young.
> - *
> - * Also note that currently we never issue secondary TLB flushes
> - * from clear_young, leaving this job up to the regular system
> - * cadence. If we find this inaccurate, we might come up with a
> - * more sophisticated heuristic later.
> - */
> - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> + return kvm_age_hva_range(mn, start, end, false);
> }
>
> static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> @@ -897,8 +887,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> {
> trace_kvm_test_age_hva(address);
>
> - return kvm_handle_hva_range_no_flush(mn, address, address + 1,
> - kvm_test_age_gfn);
> + return kvm_age_hva_range(mn, address, address + 1, false);
> }
>
> static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
>
>
> > > The change, relative to v5, that I am proposing is that MGLRU looks around if
> > > the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and
> > > only if _all_ secondary MMUs are fast.
> > >
> > > In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the
> > > fast_only flag.
> >
> > Oh, yeah, that's a lot more intelligent than what I had. I think I
> > fully understand your suggestion; I guess we'll see in v6. :)
> >
> > I wonder if this still makes sense if whether or not an MMU is "fast"
> > is determined by how contended some lock(s) are at the time.
>
> No. Just because a lock wasn't contended on the initial aging doesn't mean it
> won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which
> takes mmu_lock for write for all operations, an aging operation could get lucky
> and sneak in while mmu_lock happened to be free, but then get stuck behind a large
> queue of operations.
>
> The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in
> an MMU that takes mmu_lock for read in all but the most rare paths.

Aging and look-around themselves only use the fast-only notifiers, so
they won't ever wait on a lock (well... provided KVM is written like
that, which I think is a given). should_look_around() will use the
slow notifier because it (despite its name) is responsible for
accurately determining if a page is young lest we evict a young page.

So in this case where "fast" means "lock not contended for now", I
don't think it's necessarily wrong for MGLRU to attempt to find young
pages, even if sometimes it will bail out because a lock is
contended/held for a few or even a majority of the pages. Not doing
look-around is the same as doing look-around and finding that no pages
are young.

Anyway, I don't think this bit is really all that important unless we
can demonstrate that KVM participating like this actually results in a
measurable win.

2024-06-14 23:22:11

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Fri, Jun 14, 2024, James Houghton wrote:
> On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Jun 13, 2024, James Houghton wrote:
> > > On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <[email protected]> wrote:
> > > > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y,
> > > > i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's
> > > > probably a pretty unlikely combination, so it's probably not a valid argument.
> > > >
> > > > So, I guess I don't have a strong opinion?
> > >
> > > (Sorry for the somewhat delayed response... spent some time actually
> > > writing what this would look like.)
> > >
> > > I see what you mean, thanks! So has_fast_aging might be set by KVM if
> > > the architecture sets a Kconfig saying that it understands the concept
> > > of fast aging, basically what the presence of this v5's
> > > test_clear_young_fast_only() indicates.
> >
> > It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_enabled=false
> > doesn't support fast aging (uses the shadow MMU even for TDP).
>
> I see. I'm not sure if it makes sense to put this in `ops` as you
> originally had it then (it seems like a bit of a pain anyway). I could
> just make it a member of `struct mmu_notifier` itself.

Ah, right, because the ops are const. Yeah, losing the const just to set a flag
would be unfortunate.

> > > So just to be clear, for test_young(), I intend to have a patch in v6
> > > to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems
> > > like a pure win; no reason not to include it if we're making logic
> > > changes here anyway.
> >
> > I don't think that's correct. The initial fast_only=false aging should process
> > shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=false would
> > get a false positive on young due to failing to clear the Accessed bit in the
> > shadow MMU. E.g. if page X is accessed by both L1 and L2, then aged, and never
> > accessed again, the Accessed bit would still be set in the page tables for L2.
>
> For clear_young(fast_only=false), yeah we need to check and clear
> Accessed for both MMUs. But for test_young(fast_only=false), I don't
> see why we couldn't just return early if the TDP MMU reports young.

Ooh, good point.

> > > Oh, yeah, that's a lot more intelligent than what I had. I think I
> > > fully understand your suggestion; I guess we'll see in v6. :)
> > >
> > > I wonder if this still makes sense if whether or not an MMU is "fast"
> > > is determined by how contended some lock(s) are at the time.
> >
> > No. Just because a lock wasn't contended on the initial aging doesn't mean it
> > won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which
> > takes mmu_lock for write for all operations, an aging operation could get lucky
> > and sneak in while mmu_lock happened to be free, but then get stuck behind a large
> > queue of operations.
> >
> > The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in
> > an MMU that takes mmu_lock for read in all but the most rare paths.
>
> Aging and look-around themselves only use the fast-only notifiers, so
> they won't ever wait on a lock (well... provided KVM is written like
> that, which I think is a given).

Regarding aging, is that actually the behavior that we want? I thought the plan
is to have the initial test look at all MMUs, i.e. be potentially slow, but only
do the lookaround if it can be fast. IIUC, that was Yu's intent (and peeking back
at v2, that is indeed the case, unless I'm misreading the code).

If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2 will
end up with hot pages (used by L2) swapped out.

Or are you saying that the "test" could be slow, but not "clear"? That's also
suboptimal, because any pages accessed by L2 will always appear hot.

> should_look_around() will use the slow notifier because it (despite its name)
> is responsible for accurately determining if a page is young lest we evict a
> young page.
>
> So in this case where "fast" means "lock not contended for now",

No. In KVM, "fast" needs to be a property of the MMU, not a reflection of system
state at some random snapshot in time.

> I don't think it's necessarily wrong for MGLRU to attempt to find young
> pages, even if sometimes it will bail out because a lock is contended/held

lru_gen_look_around() skips lookaround if something else is waiting on the page
table lock, but that is a far cry from what KVM would be doing. (a) the PTL is
already held, and (b) it is scoped precisely to the range being processed. Not
looking around makes sense because there's a high probability of the PTEs in
question being modified by a different task, i.e. of the look around being a
waste of time.

In KVM, mmu_lock is not yet held, so KVM would need to use try-lock to avoid
waiting, and would need to bail from the middle of the aging walk if a different
task contends mmu_lock.

I agree that's not "wrong", but in part because mmu_lock is scoped to the entire
VM, it risks ending up with semi-random, hard to debug behavior. E.g. a user
could see intermittent "failures" that come and go based on seemingly unrelated
behavior in KVM. And implementing the "bail" behavior in the shadow MMU would
require non-trivial changes.

In other words, I would very strongly prefer that the shadow MMU be all or nothing,
i.e. is either part of look-around or isn't. And if nested TDP doesn't fair well
with MGLRU, then we (or whoever cares) can spend the time+effort to make it work
with fast-aging.

Ooh! Actually, after fiddling a bit to see how feasible fast-aging in the shadow
MMU would be, I'm pretty sure we can do straight there for nested TDP. Or rather,
I suspect/hope we can get close enough for an initial merge, which would allow
aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things
because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification.

Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done
in such a way that a lockless walk would be painfully complex. But if there is
exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at
that one SPTE. And with nested TDP, unless L1 is doing something uncommon, e.g.
mapping the same page into multiple L2s, that overwhelming vast majority of rmaps
have only one entry. That's not the case for legacy shadow paging because kernels
almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map
along with any userspace mappings.

E.g. with a QEMU+KVM setup running Linux as L2, in my setup, only one gfn has
multiple rmap entries (IIRC, it's from QEMU remapping BIOS into low memory during
boot).

So, if we bifurcate aging behavior based on whether or not the TDP MMU is enabled,
then whether or not aging is fast is constant (after KVM loads). Rougly, the KVM
side of things would be the below, plus a bunch of conversions to WRITE_ONCE() to
ensure a stable rmap value (KVM already plays nice with lockless accesses to SPTEs,
thanks to the fast page fault path).

If KVM adds an rmap entry after the READ_ONCE(), then functionally all is still
well because the original SPTE pointer is still valid. If the rmap entry is
removed, then KVM just needs to ensure the owning page table isn't freed. That
could be done either with a cmpxchg() (KVM zaps leafs SPTE before freeing page
tables, and the rmap stuff doesn't actually walk upper level entries), or by
enhancing the shadow MMU's lockless walk logic to allow lockless walks from
non-vCPU tasks.

And in the (hopefully) unlikely scenario someone has a use case where L1 maps a
gfn into multiple L2s (or aliases in bizarre ways), then we can tackle making the
nested TDP shadow MMU rmap walks always lockless.

E.g. again very roughly, if we went with the latter:

@@ -1629,22 +1629,45 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
__rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
}

+static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ typedefme handler)
+{
+ gfn_t gfn;
+ int level;
+ u64 *spte;
+ bool ret;
+
+ walk_shadow_page_lockless_begin(???);
+
+ for (gfn = range->start; gfn < range->end; gfn++) {
+ for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+ spte = (void *)READ_ONCE(gfn_to_rmap(gfn, level, range->slot)->val);
+
+ /* Skip the gfn if there are multiple SPTEs. */
+ if ((unsigned long)spte & 1)
+ continue;
+
+ ret |= handler(spte);
+ }
+ }
+
+ walk_shadow_page_lockless_end(???);
+}
+
static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
bool fast_only)
{
bool young = false;

- if (kvm_memslots_have_rmaps(kvm)) {
- if (fast_only)
- return -1;
-
- write_lock(&kvm->mmu_lock);
- young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
- write_unlock(&kvm->mmu_lock);
- }
-
- if (tdp_mmu_enabled)
+ if (tdp_mmu_enabled) {
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
+ young |= kvm_handle_gfn_range_lockless(kvm, range, kvm_age_rmap_fast);
+ } else if (!fast_only) {
+ write_lock(&kvm->mmu_lock);
+ young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+ write_unlock(&kvm->mmu_lock);
+ }

return (int)young;
}

> for a few or even a majority of the pages. Not doing look-around is the same
> as doing look-around and finding that no pages are young.

No, because the former is deterministic and predictable, the latter is not.

> Anyway, I don't think this bit is really all that important unless we
> can demonstrate that KVM participating like this actually results in a
> measurable win.

Participating like what? You've lost me a bit. Are we talking past each other?

What I am saying is that we do this (note that this is slightly different than
an earlier sketch; I botched the ordering of spin_is_contend() in that one, and
didn't account for the page being young in the primary MMU).

if (pte_young(ptep_get(pte)))
young = 1 | MMU_NOTIFY_WAS_FAST;

young |= ptep_clear_young_notify(vma, addr, pte);
if (!young)
return false;

if (!(young & MMU_NOTIFY_WAS_FAST))
return true;

if (spin_is_contended(pvmw->ptl))
return false;

/* exclude special VMAs containing anon pages from COW */
if (vma->vm_flags & VM_SPECIAL)
return false;