2024-05-29 18:05:41

by James Houghton

[permalink] [raw]
Subject: [PATCH v4 0/7] mm: multi-gen LRU: Walk secondary MMU page tables while aging

This patchset makes it possible for MGLRU to consult secondary MMUs
while doing aging, not just during eviction. This allows for more
accurate reclaim decisions, which is especially important for proactive
reclaim.

This series makes the necessary MMU notifier changes to MGLRU and then
includes optimizations on top of that. This series also now includes
changes to access_tracking_perf_test to verify that aging works properly
for pages that are mainly used by KVM.

access_tracking_perf_test also has a mode (-p) to check performance of
MGLRU aging while the VM is faulting memory in. Here are some results:

Testing MGLRU aging while vCPUs are faulting in memory on x86 with the
TDP MMU. THPs disabled.

The test results varied a decent amount from run to run. I did my best to
take representative averages, but nonetheless, the big picture is the
important part.

Main takeaways:
- With the optimizations, the workload is much less impacted
by the presence of aging.
- With the optimizations, MGLRU is able to do aging much more
quickly, especially at 8+ vCPUs on my machine.

./access_tracking_perf_test -p -l -b 1G -v $N_VCPUS # 1G per vCPU

num_vcpus vcpu wall time aging avg pass time

1 (no aging) 0.878822016 n/a
1 (no opt) 0.938250568 0.008236007
1 (opt) 0.912270190 0.007314582

2 (no aging) 0.984959659 n/a
2 (no opt) 1.057880728 0.017989741
2 (opt) 1.037735641 0.013996319

4 (no aging) 1.264881581 n/a
4 (no opt) 1.318849182 0.056164918
4 (opt) 1.314653465 0.029311993

8 (no aging) 1.473883699 n/a
8 (no opt) 1.589441079 0.227419586s
8 (opt) 1.498439592 0.063857740s

16 (no aging) 2.048766096 n/a
16 (no opt) 2.399335597 1.247142841s
16 (opt) 2.000914001 0.121628689s

32 (no aging) 3.316256321 n/a
32 (no opt) 3.955417018 4.347290433
32 (opt) 3.355274507 0.250886289

64 (no aging) 6.498958516 n/a
64 (no opt) 7.127533884 9.815592054
64 (opt) 6.442582168 1.392907010

112 (no aging) 8.498029491 n/a
112 (no opt) 10.21372495 13.47381656
112 (opt) 8.896963554 2.292223850

Previous versions of this series included logic in MGLRU and KVM to
support batching the updates to secondary page tables. This version
removes this logic, as it was complex and not necessary to enable
proactive reclaim. This optimization, as well as the additional
optimizations for arm64 and powerpc, can be done in a later series.

Changes since v3[1]:
- Vastly simplified the series (thanks David). Removed mmu notifier
batching logic entirely.
- Cleaned up how locking is done for mmu_notifier_test/clear_young
(thanks David).
- Look-around is now only done when there are no secondary MMUs
subscribed to MMU notifiers.
- CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
- Fixed the lockless implementation of kvm_{test,}age_gfn for x86
(thanks David).
- Added MGLRU functional and performance tests to
access_tracking_perf_test (thanks Axel).
- In v3, an mm would be completely ignored (for aging) if there was a
secondary MMU but support for secondary MMU walking was missing. Now,
missing secondary MMU walking support simply skips the notifier
calls (except for eviction).
- Added a sanity check for that range->lockless and range->on_lock are
never both provided for the memslot walk.

For the changes from v2[2] to v3, see v3[1].

This series applies cleanly to mm/mm-unstable and kvm/queue.

[1]: https://lore.kernel.org/linux-mm/[email protected]/
[2]: https://lore.kernel.org/kvmarm/[email protected]/

James Houghton (7):
mm/Kconfig: Add LRU_GEN_WALKS_SECONDARY_MMU
mm: multi-gen LRU: Have secondary MMUs participate in aging
KVM: Add lockless memslot walk to KVM
KVM: Move MMU lock acquisition for test/clear_young to architecture
KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn
KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test

Documentation/admin-guide/mm/multigen_lru.rst | 6 +-
arch/arm64/kvm/hyp/pgtable.c | 9 +-
arch/arm64/kvm/mmu.c | 30 +-
arch/loongarch/kvm/mmu.c | 20 +-
arch/mips/kvm/mmu.c | 21 +-
arch/powerpc/kvm/book3s.c | 14 +-
arch/riscv/kvm/mmu.c | 26 +-
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 10 +-
arch/x86/kvm/mmu/tdp_iter.h | 27 +-
arch/x86/kvm/mmu/tdp_mmu.c | 67 ++-
include/linux/kvm_host.h | 1 +
include/linux/mmzone.h | 6 +-
mm/Kconfig | 8 +
mm/rmap.c | 9 +-
mm/vmscan.c | 144 +++++--
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/access_tracking_perf_test.c | 365 ++++++++++++++--
.../selftests/kvm/include/lru_gen_util.h | 55 +++
.../testing/selftests/kvm/lib/lru_gen_util.c | 391 ++++++++++++++++++
virt/kvm/kvm_main.c | 38 +-
21 files changed, 1104 insertions(+), 145 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c


base-commit: e0cce98fe279b64f4a7d81b7f5c3a23d80b92fbc
--
2.45.1.288.g0e0cd299f1-goog



2024-05-29 18:05:48

by James Houghton

[permalink] [raw]
Subject: [PATCH v4 1/7] mm/Kconfig: Add LRU_GEN_WALKS_SECONDARY_MMU

Add this option so that one building the kernel can choose whether or
not they want to support walking the secondary MMU.

We want users to be able to blindly enable all lru_gen features to have
the best possible performance most of the time. Walking the secondary
MMU is mainly useful for be able to do proactive reclaim, and it is
possible that doing this can harm VM performance.

This option should be enabled by users who run VMs and also care to do
proactive aging/reclaim with MGLRU.

With this config option enabled, a user can still disable the
new functionality at runtime through sysfs.

Signed-off-by: James Houghton <[email protected]>
---
mm/Kconfig | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index b4cb45255a54..3ac4b1dbf745 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1222,6 +1222,14 @@ config LRU_GEN_STATS

This option has a per-memcg and per-node memory overhead.

+config LRU_GEN_WALKS_SECONDARY_MMU
+ bool "Walk secondary MMUs when aging"
+ depends on LRU_GEN && LRU_GEN_WALKS_MMU
+ help
+ This option allows multi-gen LRU to walk secondary MMU page tables
+ when aging. This allows for proactive reclaim, but this can reduce
+ overall performance (e.g. for a KVM VM).
+
config LRU_GEN_WALKS_MMU
def_bool y
depends on LRU_GEN && ARCH_HAS_HW_PTE_YOUNG
--
2.45.1.288.g0e0cd299f1-goog


2024-05-29 18:06:29

by James Houghton

[permalink] [raw]
Subject: [PATCH v4 3/7] KVM: Add lockless memslot walk to KVM

Provide flexibility to the architecture to synchronize as optimally as
they can instead of always taking the MMU lock for writing.

The immediate application is to allow architectures to implement the
test/clear_young MMU notifiers more cheaply.

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 38 +++++++++++++++++++++++++-------------
2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 692c01e41a18..4d7c3e8632e6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -266,6 +266,7 @@ struct kvm_gfn_range {
gfn_t end;
union kvm_mmu_notifier_arg arg;
bool may_block;
+ bool lockless;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 14841acb8b95..d197b6725cb3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -558,6 +558,7 @@ struct kvm_mmu_notifier_range {
on_lock_fn_t on_lock;
bool flush_on_ret;
bool may_block;
+ bool lockless;
};

/*
@@ -612,6 +613,10 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
IS_KVM_NULL_FN(range->handler)))
return r;

+ /* on_lock will never be called for lockless walks */
+ if (WARN_ON_ONCE(range->lockless && !IS_KVM_NULL_FN(range->on_lock)))
+ return r;
+
idx = srcu_read_lock(&kvm->srcu);

for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
@@ -643,15 +648,18 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
gfn_range.slot = slot;
+ gfn_range.lockless = range->lockless;

if (!r.found_memslot) {
r.found_memslot = true;
- KVM_MMU_LOCK(kvm);
- if (!IS_KVM_NULL_FN(range->on_lock))
- range->on_lock(kvm);
-
- if (IS_KVM_NULL_FN(range->handler))
- break;
+ if (!range->lockless) {
+ KVM_MMU_LOCK(kvm);
+ if (!IS_KVM_NULL_FN(range->on_lock))
+ range->on_lock(kvm);
+
+ if (IS_KVM_NULL_FN(range->handler))
+ break;
+ }
}
r.ret |= range->handler(kvm, &gfn_range);
}
@@ -660,7 +668,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
if (range->flush_on_ret && r.ret)
kvm_flush_remote_tlbs(kvm);

- if (r.found_memslot)
+ if (r.found_memslot && !range->lockless)
KVM_MMU_UNLOCK(kvm);

srcu_read_unlock(&kvm->srcu, idx);
@@ -686,10 +694,12 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
return __kvm_handle_hva_range(kvm, &range).ret;
}

-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
- unsigned long start,
- unsigned long end,
- gfn_handler_t handler)
+static __always_inline int kvm_handle_hva_range_no_flush(
+ struct mmu_notifier *mn,
+ unsigned long start,
+ unsigned long end,
+ gfn_handler_t handler,
+ bool lockless)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
const struct kvm_mmu_notifier_range range = {
@@ -699,6 +709,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.on_lock = (void *)kvm_null_fn,
.flush_on_ret = false,
.may_block = false,
+ .lockless = lockless,
};

return __kvm_handle_hva_range(kvm, &range).ret;
@@ -889,7 +900,8 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
* cadence. If we find this inaccurate, we might come up with a
* more sophisticated heuristic later.
*/
- return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+ return kvm_handle_hva_range_no_flush(mn, start, end,
+ kvm_age_gfn, false);
}

static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -899,7 +911,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
trace_kvm_test_age_hva(address);

return kvm_handle_hva_range_no_flush(mn, address, address + 1,
- kvm_test_age_gfn);
+ kvm_test_age_gfn, false);
}

static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
--
2.45.1.288.g0e0cd299f1-goog


2024-05-29 18:06:35

by James Houghton

[permalink] [raw]
Subject: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

Secondary MMUs are currently consulted for access/age information at
eviction time, but before then, we don't get accurate age information.
That is, pages that are mostly accessed through a secondary MMU (like
guest memory, used by KVM) will always just proceed down to the oldest
generation, and then at eviction time, if KVM reports the page to be
young, the page will be activated/promoted back to the youngest
generation.

Do not do look around if there is a secondary MMU we have to interact
with.

The added feature bit (0x8), if disabled, will make MGLRU behave as if
there are no secondary MMUs subscribed to MMU notifiers except at
eviction time.

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
Documentation/admin-guide/mm/multigen_lru.rst | 6 +-
include/linux/mmzone.h | 6 +-
mm/rmap.c | 9 +-
mm/vmscan.c | 144 ++++++++++++++----
4 files changed, 123 insertions(+), 42 deletions(-)

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..1e578e0c4c0c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@ Values Components
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
+0x0008 Continuously clear the accessed bit in secondary MMU page
+ tables instead of waiting until eviction time. This results in
+ accurate page age information for pages that are mainly used by
+ a secondary MMU.
[yYnN] Apply to all the components above.
====== ===============================================================

@@ -56,7 +60,7 @@ E.g.,

echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
- 0x0007
+ 0x000f
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8f9c9590a42c..869824ef5f3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -400,6 +400,7 @@ enum {
LRU_GEN_CORE,
LRU_GEN_MM_WALK,
LRU_GEN_NONLEAF_YOUNG,
+ LRU_GEN_SECONDARY_MMU_WALK,
NR_LRU_GEN_CAPS
};

@@ -557,7 +558,7 @@ struct lru_gen_memcg {

void lru_gen_init_pgdat(struct pglist_data *pgdat);
void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

void lru_gen_init_memcg(struct mem_cgroup *memcg);
void lru_gen_exit_memcg(struct mem_cgroup *memcg);
@@ -576,8 +577,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}

-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
+ return false;
}

static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
diff --git a/mm/rmap.c b/mm/rmap.c
index e8fc5ecb59b2..24a3ff639919 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -870,13 +870,10 @@ static bool folio_referenced_one(struct folio *folio,
continue;
}

- if (pvmw.pte) {
- if (lru_gen_enabled() &&
- pte_young(ptep_get(pvmw.pte))) {
- lru_gen_look_around(&pvmw);
+ if (lru_gen_enabled() && pvmw.pte) {
+ if (lru_gen_look_around(&pvmw))
referenced++;
- }
-
+ } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d55e8d07ffc4..0d89f712f45c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,7 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/mmu_notifier.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -2579,6 +2580,12 @@ static bool should_clear_pmd_young(void)
return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
}

+static bool should_walk_secondary_mmu(void)
+{
+ return IS_ENABLED(CONFIG_LRU_GEN_WALKS_SECONDARY_MMU) &&
+ get_cap(LRU_GEN_SECONDARY_MMU_WALK);
+}
+
/******************************************************************************
* shorthand helpers
******************************************************************************/
@@ -3276,7 +3283,8 @@ static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk
return false;
}

-static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr,
+ struct pglist_data *pgdat)
{
unsigned long pfn = pte_pfn(pte);

@@ -3291,10 +3299,15 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return -1;

+ /* try to avoid unnecessary memory loads */
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ return -1;
+
return pfn;
}

-static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr,
+ struct pglist_data *pgdat)
{
unsigned long pfn = pmd_pfn(pmd);

@@ -3309,6 +3322,10 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return -1;

+ /* try to avoid unnecessary memory loads */
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ return -1;
+
return pfn;
}

@@ -3317,10 +3334,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
{
struct folio *folio;

- /* try to avoid unnecessary memory loads */
- if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
- return NULL;
-
folio = pfn_folio(pfn);
if (folio_nid(folio) != pgdat->node_id)
return NULL;
@@ -3343,6 +3356,32 @@ static bool suitable_to_scan(int total, int young)
return young * n >= total;
}

+static bool lru_gen_notifier_test_young(struct mm_struct *mm,
+ unsigned long addr)
+{
+ return should_walk_secondary_mmu() && mmu_notifier_test_young(mm, addr);
+}
+
+static bool lru_gen_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ return should_walk_secondary_mmu() &&
+ mmu_notifier_clear_young(mm, start, end);
+}
+
+static bool lru_gen_pmdp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr,
+ pmd_t *pmd)
+{
+ bool young = pmdp_test_and_clear_young(vma, addr, pmd);
+
+ if (lru_gen_notifier_clear_young(vma->vm_mm, addr, addr + PMD_SIZE))
+ young = true;
+
+ return young;
+}
+
static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
struct mm_walk *args)
{
@@ -3357,8 +3396,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
DEFINE_MAX_SEQ(walk->lruvec);
int old_gen, new_gen = lru_gen_from_seq(max_seq);
+ struct mm_struct *mm = args->mm;

- pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+ pte = pte_offset_map_nolock(mm, pmd, start & PMD_MASK, &ptl);
if (!pte)
return false;
if (!spin_trylock(ptl)) {
@@ -3376,11 +3416,12 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
total++;
walk->mm_stats[MM_LEAF_TOTAL]++;

- pfn = get_pte_pfn(ptent, args->vma, addr);
+ pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
if (pfn == -1)
continue;

- if (!pte_young(ptent)) {
+ if (!pte_young(ptent) &&
+ !lru_gen_notifier_test_young(mm, addr)) {
walk->mm_stats[MM_LEAF_OLD]++;
continue;
}
@@ -3389,8 +3430,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
if (!folio)
continue;

- if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
+ if (pte_young(ptent))
+ ptep_test_and_clear_young(args->vma, addr, pte + i);

young++;
walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -3456,22 +3498,25 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
/* don't round down the first address */
addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first;

- pfn = get_pmd_pfn(pmd[i], vma, addr);
- if (pfn == -1)
- goto next;
-
- if (!pmd_trans_huge(pmd[i])) {
- if (should_clear_pmd_young())
+ if (pmd_present(pmd[i]) && !pmd_trans_huge(pmd[i])) {
+ if (should_clear_pmd_young() &&
+ !should_walk_secondary_mmu())
pmdp_test_and_clear_young(vma, addr, pmd + i);
goto next;
}

+ pfn = get_pmd_pfn(pmd[i], vma, addr, pgdat);
+ if (pfn == -1)
+ goto next;
+
folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
if (!folio)
goto next;

- if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+ if (!lru_gen_pmdp_test_and_clear_young(vma, addr, pmd + i)) {
+ walk->mm_stats[MM_LEAF_OLD]++;
goto next;
+ }

walk->mm_stats[MM_LEAF_YOUNG]++;

@@ -3528,19 +3573,18 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
}

if (pmd_trans_huge(val)) {
- unsigned long pfn = pmd_pfn(val);
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+ unsigned long pfn = get_pmd_pfn(val, vma, addr, pgdat);

walk->mm_stats[MM_LEAF_TOTAL]++;

- if (!pmd_young(val)) {
- walk->mm_stats[MM_LEAF_OLD]++;
+ if (pfn == -1)
continue;
- }

- /* try to avoid unnecessary memory loads */
- if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ if (!pmd_young(val) && !mm_has_notifiers(args->mm)) {
+ walk->mm_stats[MM_LEAF_OLD]++;
continue;
+ }

walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
continue;
@@ -3548,7 +3592,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,

walk->mm_stats[MM_NONLEAF_TOTAL]++;

- if (should_clear_pmd_young()) {
+ if (should_clear_pmd_young() && !should_walk_secondary_mmu()) {
if (!pmd_young(val))
continue;

@@ -3994,6 +4038,26 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* rmap/PT walk feedback
******************************************************************************/

+static bool should_look_around(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, int *young)
+{
+ bool secondary_was_young =
+ mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE);
+
+ /*
+ * Look around if (1) the PTE is young and (2) we do not need to
+ * consult any secondary MMUs.
+ */
+ if (pte_young(ptep_get(pte))) {
+ ptep_test_and_clear_young(vma, addr, pte);
+ *young = true;
+ return !mm_has_notifiers(vma->vm_mm);
+ } else if (secondary_was_young)
+ *young = true;
+
+ return false;
+}
+
/*
* This function exploits spatial locality when shrink_folio_list() walks the
* rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
@@ -4001,7 +4065,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* the PTE table to the Bloom filter. This forms a feedback loop between the
* eviction and the aging.
*/
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
int i;
unsigned long start;
@@ -4019,16 +4083,20 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
DEFINE_MAX_SEQ(lruvec);
int old_gen, new_gen = lru_gen_from_seq(max_seq);
+ struct mm_struct *mm = pvmw->vma->vm_mm;

lockdep_assert_held(pvmw->ptl);
VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);

+ if (!should_look_around(vma, addr, pte, &young))
+ return young;
+
if (spin_is_contended(pvmw->ptl))
- return;
+ return young;

/* exclude special VMAs containing anon pages from COW */
if (vma->vm_flags & VM_SPECIAL)
- return;
+ return young;

/* avoid taking the LRU lock under the PTL when possible */
walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
@@ -4036,6 +4104,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
start = max(addr & PMD_MASK, vma->vm_start);
end = min(addr | ~PMD_MASK, vma->vm_end - 1) + 1;

+ if (end - start == PAGE_SIZE)
+ return young;
+
if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
end = start + MIN_LRU_BATCH * PAGE_SIZE;
@@ -4049,7 +4120,7 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)

/* folio_update_gen() requires stable folio_memcg() */
if (!mem_cgroup_trylock_pages(memcg))
- return;
+ return young;

arch_enter_lazy_mmu_mode();

@@ -4059,19 +4130,21 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
unsigned long pfn;
pte_t ptent = ptep_get(pte + i);

- pfn = get_pte_pfn(ptent, vma, addr);
+ pfn = get_pte_pfn(ptent, vma, addr, pgdat);
if (pfn == -1)
continue;

- if (!pte_young(ptent))
+ if (!pte_young(ptent) &&
+ !lru_gen_notifier_test_young(mm, addr))
continue;

folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
if (!folio)
continue;

- if (!ptep_test_and_clear_young(vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ lru_gen_notifier_clear_young(mm, addr, addr + PAGE_SIZE);
+ if (pte_young(ptent))
+ ptep_test_and_clear_young(vma, addr, pte + i);

young++;

@@ -4101,6 +4174,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
/* feedback from rmap walkers to page table walkers */
if (mm_state && suitable_to_scan(i, young))
update_bloom_filter(mm_state, max_seq, pvmw->pmd);
+
+ return young;
}

/******************************************************************************
@@ -5137,6 +5212,9 @@ static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, c
if (should_clear_pmd_young())
caps |= BIT(LRU_GEN_NONLEAF_YOUNG);

+ if (should_walk_secondary_mmu())
+ caps |= BIT(LRU_GEN_SECONDARY_MMU_WALK);
+
return sysfs_emit(buf, "0x%04x\n", caps);
}

--
2.45.1.288.g0e0cd299f1-goog


2024-05-29 18:06:58

by James Houghton

[permalink] [raw]
Subject: [PATCH v4 4/7] KVM: Move MMU lock acquisition for test/clear_young to architecture

For implementation mmu_notifier_{test,clear}_young, the KVM memslot
walker used to take the MMU lock for us. Now make the architectures
take it themselves.

Don't relax locking for any architecture except powerpc e500; its
implementations of kvm_age_gfn and kvm_test_age_gfn simply return false,
so there is no need to grab the KVM MMU lock.

Signed-off-by: James Houghton <[email protected]>
---
arch/arm64/kvm/mmu.c | 30 ++++++++++++++++++++++--------
arch/loongarch/kvm/mmu.c | 20 +++++++++++++++-----
arch/mips/kvm/mmu.c | 21 ++++++++++++++++-----
arch/powerpc/kvm/book3s.c | 14 ++++++++++++--
arch/riscv/kvm/mmu.c | 26 ++++++++++++++++++++------
arch/x86/kvm/mmu/mmu.c | 8 ++++++++
virt/kvm/kvm_main.c | 4 ++--
7 files changed, 95 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8bcab0cc3fe9..8337009dde77 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1773,25 +1773,39 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
u64 size = (range->end - range->start) << PAGE_SHIFT;
+ bool young = false;
+
+ write_lock(&kvm->mmu_lock);

if (!kvm->arch.mmu.pgt)
- return false;
+ goto out;

- return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
- range->start << PAGE_SHIFT,
- size, true);
+ young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+ range->start << PAGE_SHIFT,
+ size, true);
+
+out:
+ write_unlock(&kvm->mmu_lock);
+ return young;
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
u64 size = (range->end - range->start) << PAGE_SHIFT;
+ bool young = false;
+
+ write_lock(&kvm->mmu_lock);

if (!kvm->arch.mmu.pgt)
- return false;
+ goto out;

- return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
- range->start << PAGE_SHIFT,
- size, false);
+ young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
+ range->start << PAGE_SHIFT,
+ size, false);
+
+out:
+ write_unlock(&kvm->mmu_lock);
+ return young;
}

phys_addr_t kvm_mmu_get_httbr(void)
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index 98883aa23ab8..5eb262bcf6b0 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -497,24 +497,34 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
kvm_ptw_ctx ctx;
+ bool young;
+
+ spin_lock(&kvm->mmu_lock);

ctx.flag = 0;
ctx.ops = kvm_mkold_pte;
kvm_ptw_prepare(kvm, &ctx);

- return kvm_ptw_top(kvm->arch.pgd, range->start << PAGE_SHIFT,
+ young = kvm_ptw_top(kvm->arch.pgd, range->start << PAGE_SHIFT,
range->end << PAGE_SHIFT, &ctx);
+
+ spin_unlock(&kvm->mmu_lock);
+ return young;
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
gpa_t gpa = range->start << PAGE_SHIFT;
- kvm_pte_t *ptep = kvm_populate_gpa(kvm, NULL, gpa, 0);
+ kvm_pte_t *ptep;
+ bool young;

- if (ptep && kvm_pte_present(NULL, ptep) && kvm_pte_young(*ptep))
- return true;
+ spin_lock(&kvm->mmu_lock);
+ ptep = kvm_populate_gpa(kvm, NULL, gpa, 0);

- return false;
+ young = ptep && kvm_pte_present(NULL, ptep) && kvm_pte_young(*ptep);
+
+ spin_unlock(&kvm->mmu_lock);
+ return young;
}

/*
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index c17157e700c0..db3b7cf22db1 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -446,17 +446,28 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- return kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
+ bool young;
+
+ spin_lock(&kvm->mmu_lock);
+ young = kvm_mips_mkold_gpa_pt(kvm, range->start, range->end);
+ spin_unlock(&kvm->mmu_lock);
+ return young;
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
gpa_t gpa = range->start << PAGE_SHIFT;
- pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
+ pte_t *gpa_pte;
+ bool young = false;

- if (!gpa_pte)
- return false;
- return pte_young(*gpa_pte);
+ spin_lock(&kvm->mmu_lock);
+ gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa);
+
+ if (gpa_pte)
+ young = pte_young(*gpa_pte);
+
+ spin_unlock(&kvm->mmu_lock);
+ return young;
}

/**
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index ff6c38373957..f503ab9ac3a5 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -887,12 +887,22 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- return kvm->arch.kvm_ops->age_gfn(kvm, range);
+ bool young;
+
+ spin_lock(&kvm->mmu_lock);
+ young = kvm->arch.kvm_ops->age_gfn(kvm, range);
+ spin_unlock(&kvm->mmu_lock);
+ return young;
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- return kvm->arch.kvm_ops->test_age_gfn(kvm, range);
+ bool young;
+
+ spin_lock(&kvm->mmu_lock);
+ young = kvm->arch.kvm_ops->test_age_gfn(kvm, range);
+ spin_unlock(&kvm->mmu_lock);
+ return young;
}

int kvmppc_core_init_vm(struct kvm *kvm)
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c
index b63650f9b966..c78abe8041fb 100644
--- a/arch/riscv/kvm/mmu.c
+++ b/arch/riscv/kvm/mmu.c
@@ -555,17 +555,24 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
pte_t *ptep;
u32 ptep_level = 0;
u64 size = (range->end - range->start) << PAGE_SHIFT;
+ bool young = false;
+
+ spin_lock(&kvm->mmu_lock);

if (!kvm->arch.pgd)
- return false;
+ goto out;

WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);

if (!gstage_get_leaf_entry(kvm, range->start << PAGE_SHIFT,
&ptep, &ptep_level))
- return false;
+ goto out;
+
+ young = ptep_test_and_clear_young(NULL, 0, ptep);

- return ptep_test_and_clear_young(NULL, 0, ptep);
+out:
+ spin_unlock(&kvm->mmu_lock);
+ return young;
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -573,17 +580,24 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
pte_t *ptep;
u32 ptep_level = 0;
u64 size = (range->end - range->start) << PAGE_SHIFT;
+ bool young = false;
+
+ spin_lock(&kvm->mmu_lock);

if (!kvm->arch.pgd)
- return false;
+ goto out;

WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);

if (!gstage_get_leaf_entry(kvm, range->start << PAGE_SHIFT,
&ptep, &ptep_level))
- return false;
+ goto out;
+
+ young = pte_young(ptep_get(ptep));

- return pte_young(ptep_get(ptep));
+out:
+ spin_unlock(&kvm->mmu_lock);
+ return young;
}

int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 662f62dfb2aa..6a2a557c2c31 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1630,12 +1630,16 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

+ write_lock(&kvm->mmu_lock);
+
if (kvm_memslots_have_rmaps(kvm))
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

+ write_unlock(&kvm->mmu_lock);
+
return young;
}

@@ -1643,12 +1647,16 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

+ write_lock(&kvm->mmu_lock);
+
if (kvm_memslots_have_rmaps(kvm))
young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_test_age_gfn(kvm, range);

+ write_unlock(&kvm->mmu_lock);
+
return young;
}

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d197b6725cb3..8d2d3acf18d8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -901,7 +901,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
* more sophisticated heuristic later.
*/
return kvm_handle_hva_range_no_flush(mn, start, end,
- kvm_age_gfn, false);
+ kvm_age_gfn, true);
}

static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -911,7 +911,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
trace_kvm_test_age_hva(address);

return kvm_handle_hva_range_no_flush(mn, address, address + 1,
- kvm_test_age_gfn, false);
+ kvm_test_age_gfn, true);
}

static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
--
2.45.1.288.g0e0cd299f1-goog


2024-05-29 18:08:09

by James Houghton

[permalink] [raw]
Subject: [PATCH v4 7/7] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test

This test now has two modes of operation:
1. (default) To check how much vCPU performance was affected by access
tracking (previously existed, now supports MGLRU aging).
2. (-p) To also benchmark how fast MGLRU can do aging while vCPUs are
faulting in memory.

Mode (1) also serves as a way to verify that aging is working properly
for pages only accessed by KVM. It will fail if one does not have the
0x8 lru_gen feature bit.

To support MGLRU, the test creates a memory cgroup, moves itself into
it, then uses the lru_gen debugfs output to track memory in that cgroup.
The logic to parse the lru_gen debugfs output has been put into
selftests/kvm/lib/lru_gen_util.c.

Co-developed-by: Axel Rasmussen <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/access_tracking_perf_test.c | 365 ++++++++++++++--
.../selftests/kvm/include/lru_gen_util.h | 55 +++
.../testing/selftests/kvm/lib/lru_gen_util.c | 391 ++++++++++++++++++
4 files changed, 782 insertions(+), 30 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index ce8ff8e8ce3a..86415f524c48 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -22,6 +22,7 @@ LIBKVM += lib/elf.c
LIBKVM += lib/guest_modes.c
LIBKVM += lib/io.c
LIBKVM += lib/kvm_util.c
+LIBKVM += lib/lru_gen_util.c
LIBKVM += lib/memstress.c
LIBKVM += lib/guest_sprintf.c
LIBKVM += lib/rbtree.c
diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..15be99ff3bdc 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -38,6 +38,7 @@
#include <inttypes.h>
#include <limits.h>
#include <pthread.h>
+#include <stdio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
@@ -47,6 +48,20 @@
#include "memstress.h"
#include "guest_modes.h"
#include "processor.h"
+#include "lru_gen_util.h"
+
+static const char *TEST_MEMCG_NAME = "access_tracking_perf_test";
+static const int LRU_GEN_ENABLED = 0x1;
+static const int LRU_GEN_MM_WALK = 0x2;
+static const int LRU_GEN_SECONDARY_MMU_WALK = 0x8;
+static const char *CGROUP_PROCS = "cgroup.procs";
+/*
+ * If using MGLRU, this test assumes a cgroup v2 or cgroup v1 memory hierarchy
+ * is mounted at cgroup_root.
+ *
+ * Can be changed with -r.
+ */
+static const char *cgroup_root = "/sys/fs/cgroup";

/* Global variable used to synchronize all of the vCPU threads. */
static int iteration;
@@ -62,6 +77,9 @@ static enum {
/* The iteration that was last completed by each vCPU. */
static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];

+/* The time at which the last iteration was completed */
+static struct timespec vcpu_last_completed_time[KVM_MAX_VCPUS];
+
/* Whether to overlap the regions of memory vCPUs access. */
static bool overlap_memory_access;

@@ -74,6 +92,12 @@ struct test_params {

/* The number of vCPUs to create in the VM. */
int nr_vcpus;
+
+ /* Whether to use lru_gen aging instead of idle page tracking. */
+ bool lru_gen;
+
+ /* Whether to test the performance of aging itself. */
+ bool benchmark_lru_gen;
};

static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)
@@ -89,6 +113,50 @@ static uint64_t pread_uint64(int fd, const char *filename, uint64_t index)

}

+static void write_file_long(const char *path, long v)
+{
+ FILE *f;
+
+ f = fopen(path, "w");
+ TEST_ASSERT(f, "fopen(%s) failed", path);
+ TEST_ASSERT(fprintf(f, "%ld\n", v) > 0,
+ "fprintf to %s failed", path);
+ TEST_ASSERT(!fclose(f), "fclose(%s) failed", path);
+}
+
+static char *path_join(const char *parent, const char *child)
+{
+ char *out = NULL;
+
+ return asprintf(&out, "%s/%s", parent, child) >= 0 ? out : NULL;
+}
+
+static char *memcg_path(const char *memcg)
+{
+ return path_join(cgroup_root, memcg);
+}
+
+static char *memcg_file_path(const char *memcg, const char *file)
+{
+ char *mp = memcg_path(memcg);
+ char *fp;
+
+ if (!mp)
+ return NULL;
+ fp = path_join(mp, file);
+ free(mp);
+ return fp;
+}
+
+static void move_to_memcg(const char *memcg, pid_t pid)
+{
+ char *procs = memcg_file_path(memcg, CGROUP_PROCS);
+
+ TEST_ASSERT(procs, "Failed to construct cgroup.procs path");
+ write_file_long(procs, pid);
+ free(procs);
+}
+
#define PAGEMAP_PRESENT (1ULL << 63)
#define PAGEMAP_PFN_MASK ((1ULL << 55) - 1)

@@ -242,6 +310,8 @@ static void vcpu_thread_main(struct memstress_vcpu_args *vcpu_args)
};

vcpu_last_completed_iteration[vcpu_idx] = current_iteration;
+ clock_gettime(CLOCK_MONOTONIC,
+ &vcpu_last_completed_time[vcpu_idx]);
}
}

@@ -253,38 +323,68 @@ static void spin_wait_for_vcpu(int vcpu_idx, int target_iteration)
}
}

+static bool all_vcpus_done(int target_iteration, int nr_vcpus)
+{
+ for (int i = 0; i < nr_vcpus; ++i)
+ if (READ_ONCE(vcpu_last_completed_iteration[i]) !=
+ target_iteration)
+ return false;
+
+ return true;
+}
+
/* The type of memory accesses to perform in the VM. */
enum access_type {
ACCESS_READ,
ACCESS_WRITE,
};

-static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description)
+static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *description,
+ bool wait)
{
- struct timespec ts_start;
- struct timespec ts_elapsed;
int next_iteration, i;

/* Kick off the vCPUs by incrementing iteration. */
next_iteration = ++iteration;

- clock_gettime(CLOCK_MONOTONIC, &ts_start);
-
/* Wait for all vCPUs to finish the iteration. */
- for (i = 0; i < nr_vcpus; i++)
- spin_wait_for_vcpu(i, next_iteration);
+ if (wait) {
+ struct timespec ts_start;
+ struct timespec ts_elapsed;
+
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);

- ts_elapsed = timespec_elapsed(ts_start);
- pr_info("%-30s: %ld.%09lds\n",
- description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+ for (i = 0; i < nr_vcpus; i++)
+ spin_wait_for_vcpu(i, next_iteration);
+
+ ts_elapsed = timespec_elapsed(ts_start);
+
+ pr_info("%-30s: %ld.%09lds\n",
+ description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+ } else
+ pr_info("%-30s\n", description);
}

-static void access_memory(struct kvm_vm *vm, int nr_vcpus,
- enum access_type access, const char *description)
+static void _access_memory(struct kvm_vm *vm, int nr_vcpus,
+ enum access_type access, const char *description,
+ bool wait)
{
memstress_set_write_percent(vm, (access == ACCESS_READ) ? 0 : 100);
iteration_work = ITERATION_ACCESS_MEMORY;
- run_iteration(vm, nr_vcpus, description);
+ run_iteration(vm, nr_vcpus, description, wait);
+}
+
+static void access_memory(struct kvm_vm *vm, int nr_vcpus,
+ enum access_type access, const char *description)
+{
+ return _access_memory(vm, nr_vcpus, access, description, true);
+}
+
+static void access_memory_async(struct kvm_vm *vm, int nr_vcpus,
+ enum access_type access,
+ const char *description)
+{
+ return _access_memory(vm, nr_vcpus, access, description, false);
}

static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
@@ -297,19 +397,111 @@ static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus)
*/
pr_debug("Marking VM memory idle (slow)...\n");
iteration_work = ITERATION_MARK_IDLE;
- run_iteration(vm, nr_vcpus, "Mark memory idle");
+ run_iteration(vm, nr_vcpus, "Mark memory idle", true);
}

-static void run_test(enum vm_guest_mode mode, void *arg)
+static void create_memcg(const char *memcg)
+{
+ const char *full_memcg_path = memcg_path(memcg);
+ int ret;
+
+ TEST_ASSERT(full_memcg_path, "Failed to construct full memcg path");
+retry:
+ ret = mkdir(full_memcg_path, 0755);
+ if (ret && errno == EEXIST) {
+ TEST_ASSERT(!rmdir(full_memcg_path),
+ "Found existing memcg at %s, but rmdir failed",
+ full_memcg_path);
+ goto retry;
+ }
+ TEST_ASSERT(!ret, "Creating the memcg failed: mkdir(%s) failed",
+ full_memcg_path);
+
+ pr_info("Created memcg at %s\n", full_memcg_path);
+}
+
+/*
+ * Test lru_gen aging speed while vCPUs are faulting memory in.
+ *
+ * This test will run lru_gen aging until the vCPUs have finished all of
+ * the faulting work, reporting:
+ * - vcpu wall time (wall time for slowest vCPU)
+ * - average aging pass duration
+ * - total number of aging passes
+ * - total time spent aging
+ *
+ * This test produces the most useful results when the vcpu wall time and the
+ * total time spent aging are similar (i.e., we want to avoid timing aging
+ * while the vCPUs aren't doing any work).
+ */
+static void run_benchmark(enum vm_guest_mode mode, struct kvm_vm *vm,
+ struct test_params *params)
{
- struct test_params *params = arg;
- struct kvm_vm *vm;
int nr_vcpus = params->nr_vcpus;
+ struct memcg_stats stats;
+ struct timespec ts_start, ts_max, ts_vcpus_elapsed,
+ ts_aging_elapsed, ts_aging_elapsed_avg;
+ int num_passes = 0;

- vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
- params->backing_src, !overlap_memory_access);
+ printf("Running lru_gen benchmark...\n");

- memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+ access_memory_async(vm, nr_vcpus, ACCESS_WRITE,
+ "Populating memory (async)");
+ while (!all_vcpus_done(iteration, nr_vcpus)) {
+ lru_gen_do_aging_quiet(&stats, TEST_MEMCG_NAME);
+ ++num_passes;
+ }
+
+ ts_aging_elapsed = timespec_elapsed(ts_start);
+ ts_aging_elapsed_avg = timespec_div(ts_aging_elapsed, num_passes);
+
+ /* Find out when the slowest vCPU finished. */
+ ts_max = ts_start;
+ for (int i = 0; i < nr_vcpus; ++i) {
+ struct timespec *vcpu_ts = &vcpu_last_completed_time[i];
+
+ if (ts_max.tv_sec < vcpu_ts->tv_sec ||
+ (ts_max.tv_sec == vcpu_ts->tv_sec &&
+ ts_max.tv_nsec < vcpu_ts->tv_nsec))
+ ts_max = *vcpu_ts;
+ }
+
+ ts_vcpus_elapsed = timespec_sub(ts_max, ts_start);
+
+ pr_info("%-30s: %ld.%09lds\n", "vcpu wall time",
+ ts_vcpus_elapsed.tv_sec, ts_vcpus_elapsed.tv_nsec);
+
+ pr_info("%-30s: %ld.%09lds, (passes:%d, total:%ld.%09lds)\n",
+ "lru_gen avg pass duration",
+ ts_aging_elapsed_avg.tv_sec,
+ ts_aging_elapsed_avg.tv_nsec,
+ num_passes,
+ ts_aging_elapsed.tv_sec,
+ ts_aging_elapsed.tv_nsec);
+}
+
+/*
+ * Test how much access tracking affects vCPU performance.
+ *
+ * Supports two modes of access tracking:
+ * - idle page tracking
+ * - lru_gen aging
+ *
+ * When using lru_gen, this test additionally verifies that the pages are in
+ * fact getting younger and older, otherwise the performance data would be
+ * invalid.
+ *
+ * The forced lru_gen aging can race with aging that occurs naturally.
+ */
+static void run_test(enum vm_guest_mode mode, struct kvm_vm *vm,
+ struct test_params *params)
+{
+ int nr_vcpus = params->nr_vcpus;
+ bool lru_gen = params->lru_gen;
+ struct memcg_stats stats;
+ long total_pages = nr_vcpus * params->vcpu_memory_bytes / getpagesize();
+ int found_gens[5];

pr_info("\n");
access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
@@ -319,11 +511,83 @@ static void run_test(enum vm_guest_mode mode, void *arg)
access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");

/* Repeat on memory that has been marked as idle. */
- mark_memory_idle(vm, nr_vcpus);
+ if (lru_gen) {
+ /* Do an initial page table scan */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+ TEST_ASSERT(sum_memcg_stats(&stats) >= total_pages,
+ "Not all pages tracked in lru_gen stats.\n"
+ "Is lru_gen enabled? Did the memcg get created properly?");
+
+ /* Find the generation we're currently in (probably youngest) */
+ found_gens[0] = lru_gen_find_generation(&stats, total_pages);
+
+ /* Do an aging pass now */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+ /* Same generation, but a newer generation has been made */
+ found_gens[1] = lru_gen_find_generation(&stats, total_pages);
+ TEST_ASSERT(found_gens[1] == found_gens[0],
+ "unexpected gen change: %d vs. %d",
+ found_gens[1], found_gens[0]);
+ } else
+ mark_memory_idle(vm, nr_vcpus);
+
access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
- mark_memory_idle(vm, nr_vcpus);
+
+ if (lru_gen) {
+ /* Scan the page tables again */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+ /* The pages should now be young again, so in a newer generation */
+ found_gens[2] = lru_gen_find_generation(&stats, total_pages);
+ TEST_ASSERT(found_gens[2] > found_gens[1],
+ "pages did not get younger");
+
+ /* Do another aging pass */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+ /* Same generation; new generation has been made */
+ found_gens[3] = lru_gen_find_generation(&stats, total_pages);
+ TEST_ASSERT(found_gens[3] == found_gens[2],
+ "unexpected gen change: %d vs. %d",
+ found_gens[3], found_gens[2]);
+ } else
+ mark_memory_idle(vm, nr_vcpus);
+
access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");

+ if (lru_gen) {
+ /* Scan the pages tables again */
+ lru_gen_do_aging(&stats, TEST_MEMCG_NAME);
+
+ /* The pages should now be young again, so in a newer generation */
+ found_gens[4] = lru_gen_find_generation(&stats, total_pages);
+ TEST_ASSERT(found_gens[4] > found_gens[3],
+ "pages did not get younger");
+ }
+}
+
+static void setup_vm_and_run(enum vm_guest_mode mode, void *arg)
+{
+ struct test_params *params = arg;
+ int nr_vcpus = params->nr_vcpus;
+ struct kvm_vm *vm;
+
+ if (params->lru_gen) {
+ create_memcg(TEST_MEMCG_NAME);
+ move_to_memcg(TEST_MEMCG_NAME, getpid());
+ }
+
+ vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+ params->backing_src, !overlap_memory_access);
+
+ memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
+
+ if (params->benchmark_lru_gen)
+ run_benchmark(mode, vm, params);
+ else
+ run_test(mode, vm, params);
+
memstress_join_vcpu_threads(nr_vcpus);
memstress_destroy_vm(vm);
}
@@ -331,8 +595,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
static void help(char *name)
{
puts("");
- printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o] [-s mem_type]\n",
- name);
+ printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]"
+ " [-s mem_type] [-l] [-r memcg_root]\n", name);
puts("");
printf(" -h: Display this help message.");
guest_modes_help();
@@ -342,6 +606,9 @@ static void help(char *name)
printf(" -v: specify the number of vCPUs to run.\n");
printf(" -o: Overlap guest memory accesses instead of partitioning\n"
" them into a separate region of memory for each vCPU.\n");
+ printf(" -l: Use MGLRU aging instead of idle page tracking\n");
+ printf(" -p: Benchmark MGLRU aging while faulting memory in\n");
+ printf(" -r: The memory cgroup hierarchy root to use (when -l is given)\n");
backing_src_help("-s");
puts("");
exit(0);
@@ -353,13 +620,15 @@ int main(int argc, char *argv[])
.backing_src = DEFAULT_VM_MEM_SRC,
.vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
.nr_vcpus = 1,
+ .lru_gen = false,
+ .benchmark_lru_gen = false,
};
int page_idle_fd;
int opt;

guest_modes_append_default();

- while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
+ while ((opt = getopt(argc, argv, "hm:b:v:os:lr:p")) != -1) {
switch (opt) {
case 'm':
guest_modes_cmdline(optarg);
@@ -376,6 +645,15 @@ int main(int argc, char *argv[])
case 's':
params.backing_src = parse_backing_src_type(optarg);
break;
+ case 'l':
+ params.lru_gen = true;
+ break;
+ case 'p':
+ params.benchmark_lru_gen = true;
+ break;
+ case 'r':
+ cgroup_root = strdup(optarg);
+ break;
case 'h':
default:
help(argv[0]);
@@ -383,12 +661,39 @@ int main(int argc, char *argv[])
}
}

- page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
- __TEST_REQUIRE(page_idle_fd >= 0,
- "CONFIG_IDLE_PAGE_TRACKING is not enabled");
- close(page_idle_fd);
+ if (!params.lru_gen) {
+ page_idle_fd = open("/sys/kernel/mm/page_idle/bitmap", O_RDWR);
+ __TEST_REQUIRE(page_idle_fd >= 0,
+ "CONFIG_IDLE_PAGE_TRACKING is not enabled");
+ close(page_idle_fd);
+ } else {
+ int lru_gen_fd, lru_gen_debug_fd;
+ long mglru_features;
+ char mglru_feature_str[8] = {};
+
+ lru_gen_fd = open("/sys/kernel/mm/lru_gen/enabled", O_RDONLY);
+ __TEST_REQUIRE(lru_gen_fd >= 0,
+ "CONFIG_LRU_GEN is not enabled");
+ TEST_ASSERT(read(lru_gen_fd, &mglru_feature_str, 7) > 0,
+ "couldn't read lru_gen features");
+ mglru_features = strtol(mglru_feature_str, NULL, 16);
+ __TEST_REQUIRE(mglru_features & LRU_GEN_ENABLED,
+ "lru_gen is not enabled");
+ __TEST_REQUIRE(mglru_features & LRU_GEN_MM_WALK,
+ "lru_gen does not support MM_WALK");
+ __TEST_REQUIRE(mglru_features & LRU_GEN_SECONDARY_MMU_WALK,
+ "lru_gen does not support SECONDARY_MMU_WALK");
+
+ lru_gen_debug_fd = open(DEBUGFS_LRU_GEN, O_RDWR);
+ __TEST_REQUIRE(lru_gen_debug_fd >= 0,
+ "Cannot access %s", DEBUGFS_LRU_GEN);
+ close(lru_gen_debug_fd);
+ }
+
+ TEST_ASSERT(!params.benchmark_lru_gen || params.lru_gen,
+ "-p specified without -l");

- for_each_guest_mode(run_test, &params);
+ for_each_guest_mode(setup_vm_and_run, &params);

return 0;
}
diff --git a/tools/testing/selftests/kvm/include/lru_gen_util.h b/tools/testing/selftests/kvm/include/lru_gen_util.h
new file mode 100644
index 000000000000..4eef8085a3cb
--- /dev/null
+++ b/tools/testing/selftests/kvm/include/lru_gen_util.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Tools for integrating with lru_gen, like parsing the lru_gen debugfs output.
+ *
+ * Copyright (C) 2024, Google LLC.
+ */
+#ifndef SELFTEST_KVM_LRU_GEN_UTIL_H
+#define SELFTEST_KVM_LRU_GEN_UTIL_H
+
+#include <inttypes.h>
+#include <limits.h>
+#include <stdlib.h>
+
+#include "test_util.h"
+
+#define MAX_NR_GENS 16 /* MAX_NR_GENS in include/linux/mmzone.h */
+#define MAX_NR_NODES 4 /* Maximum number of nodes we support */
+
+static const char *DEBUGFS_LRU_GEN = "/sys/kernel/debug/lru_gen";
+
+struct generation_stats {
+ int gen;
+ long age_ms;
+ long nr_anon;
+ long nr_file;
+};
+
+struct node_stats {
+ int node;
+ int nr_gens; /* Number of populated gens entries. */
+ struct generation_stats gens[MAX_NR_GENS];
+};
+
+struct memcg_stats {
+ unsigned long memcg_id;
+ int nr_nodes; /* Number of populated nodes entries. */
+ struct node_stats nodes[MAX_NR_NODES];
+};
+
+void print_memcg_stats(const struct memcg_stats *stats, const char *name);
+
+void read_memcg_stats(struct memcg_stats *stats, const char *memcg);
+
+void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg);
+
+long sum_memcg_stats(const struct memcg_stats *stats);
+
+void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg);
+
+void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg);
+
+int lru_gen_find_generation(const struct memcg_stats *stats,
+ unsigned long total_pages);
+
+#endif /* SELFTEST_KVM_LRU_GEN_UTIL_H */
diff --git a/tools/testing/selftests/kvm/lib/lru_gen_util.c b/tools/testing/selftests/kvm/lib/lru_gen_util.c
new file mode 100644
index 000000000000..3c02a635a9f7
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/lru_gen_util.c
@@ -0,0 +1,391 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2024, Google LLC.
+ */
+
+#include <time.h>
+
+#include "lru_gen_util.h"
+
+/*
+ * Tracks state while we parse memcg lru_gen stats. The file we're parsing is
+ * structured like this (some extra whitespace elided):
+ *
+ * memcg (id) (path)
+ * node (id)
+ * (gen_nr) (age_in_ms) (nr_anon_pages) (nr_file_pages)
+ */
+struct memcg_stats_parse_context {
+ bool consumed; /* Whether or not this line was consumed */
+ /* Next parse handler to invoke */
+ void (*next_handler)(struct memcg_stats *,
+ struct memcg_stats_parse_context *, char *);
+ int current_node_idx; /* Current index in nodes array */
+ const char *name; /* The name of the memcg we're looking for */
+};
+
+static void memcg_stats_handle_searching(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line);
+static void memcg_stats_handle_in_memcg(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line);
+static void memcg_stats_handle_in_node(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line);
+
+struct split_iterator {
+ char *str;
+ char *save;
+};
+
+static char *split_next(struct split_iterator *it)
+{
+ char *ret = strtok_r(it->str, " \t\n\r", &it->save);
+
+ it->str = NULL;
+ return ret;
+}
+
+static void memcg_stats_handle_searching(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line)
+{
+ struct split_iterator it = { .str = line };
+ char *prefix = split_next(&it);
+ char *memcg_id = split_next(&it);
+ char *memcg_name = split_next(&it);
+ char *end;
+
+ ctx->consumed = true;
+
+ if (!prefix || strcmp("memcg", prefix))
+ return; /* Not a memcg line (maybe empty), skip */
+
+ TEST_ASSERT(memcg_id && memcg_name,
+ "malformed memcg line; no memcg id or memcg_name");
+
+ if (strcmp(memcg_name + 1, ctx->name))
+ return; /* Wrong memcg, skip */
+
+ /* Found it! */
+
+ stats->memcg_id = strtoul(memcg_id, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed memcg id '%s'", memcg_id);
+ if (!stats->memcg_id)
+ return; /* Removed memcg? */
+
+ ctx->next_handler = memcg_stats_handle_in_memcg;
+}
+
+static void memcg_stats_handle_in_memcg(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line)
+{
+ struct split_iterator it = { .str = line };
+ char *prefix = split_next(&it);
+ char *id = split_next(&it);
+ long found_node_id;
+ char *end;
+
+ ctx->consumed = true;
+ ctx->current_node_idx = -1;
+
+ if (!prefix)
+ return; /* Skip empty lines */
+
+ if (!strcmp("memcg", prefix)) {
+ /* Memcg done, found next one; stop. */
+ ctx->next_handler = NULL;
+ return;
+ } else if (strcmp("node", prefix))
+ TEST_ASSERT(false, "found malformed line after 'memcg ...',"
+ "token: '%s'", prefix);
+
+ /* At this point we know we have a node line. Parse the ID. */
+
+ TEST_ASSERT(id, "malformed node line; no node id");
+
+ found_node_id = strtol(id, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed node id '%s'", id);
+
+ ctx->current_node_idx = stats->nr_nodes++;
+ TEST_ASSERT(ctx->current_node_idx < MAX_NR_NODES,
+ "memcg has stats for too many nodes, max is %d",
+ MAX_NR_NODES);
+ stats->nodes[ctx->current_node_idx].node = found_node_id;
+
+ ctx->next_handler = memcg_stats_handle_in_node;
+}
+
+static void memcg_stats_handle_in_node(struct memcg_stats *stats,
+ struct memcg_stats_parse_context *ctx,
+ char *line)
+{
+ /* Have to copy since we might not consume */
+ char *my_line = strdup(line);
+ struct split_iterator it = { .str = my_line };
+ char *gen, *age, *nr_anon, *nr_file;
+ struct node_stats *node_stats;
+ struct generation_stats *gen_stats;
+ char *end;
+
+ TEST_ASSERT(it.str, "failed to copy input line");
+
+ gen = split_next(&it);
+
+ /* Skip empty lines */
+ if (!gen)
+ goto out_consume; /* Skip empty lines */
+
+ if (!strcmp("memcg", gen) || !strcmp("node", gen)) {
+ /*
+ * Reached next memcg or node section. Don't consume, let the
+ * other handler deal with this.
+ */
+ ctx->next_handler = memcg_stats_handle_in_memcg;
+ goto out;
+ }
+
+ node_stats = &stats->nodes[ctx->current_node_idx];
+ TEST_ASSERT(node_stats->nr_gens < MAX_NR_GENS,
+ "found too many generation lines; max is %d",
+ MAX_NR_GENS);
+ gen_stats = &node_stats->gens[node_stats->nr_gens++];
+
+ age = split_next(&it);
+ nr_anon = split_next(&it);
+ nr_file = split_next(&it);
+
+ TEST_ASSERT(age && nr_anon && nr_file,
+ "malformed generation line; not enough tokens");
+
+ gen_stats->gen = (int)strtol(gen, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed generation number '%s'", gen);
+
+ gen_stats->age_ms = strtol(age, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed generation age '%s'", age);
+
+ gen_stats->nr_anon = strtol(nr_anon, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed anonymous page count '%s'",
+ nr_anon);
+
+ gen_stats->nr_file = strtol(nr_file, &end, 10);
+ TEST_ASSERT(*end == '\0', "malformed file page count '%s'", nr_file);
+
+out_consume:
+ ctx->consumed = true;
+out:
+ free(my_line);
+}
+
+/* Pretty-print lru_gen @stats. */
+void print_memcg_stats(const struct memcg_stats *stats, const char *name)
+{
+ int node, gen;
+
+ fprintf(stderr, "stats for memcg %s (id %lu):\n",
+ name, stats->memcg_id);
+ for (node = 0; node < stats->nr_nodes; ++node) {
+ fprintf(stderr, "\tnode %d\n", stats->nodes[node].node);
+ for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
+ const struct generation_stats *gstats =
+ &stats->nodes[node].gens[gen];
+
+ fprintf(stderr,
+ "\t\tgen %d\tage_ms %ld"
+ "\tnr_anon %ld\tnr_file %ld\n",
+ gstats->gen, gstats->age_ms, gstats->nr_anon,
+ gstats->nr_file);
+ }
+ }
+}
+
+/* Re-read lru_gen debugfs information for @memcg into @stats. */
+void read_memcg_stats(struct memcg_stats *stats, const char *memcg)
+{
+ FILE *f;
+ ssize_t read = 0;
+ char *line = NULL;
+ size_t bufsz;
+ struct memcg_stats_parse_context ctx = {
+ .next_handler = memcg_stats_handle_searching,
+ .name = memcg,
+ };
+
+ memset(stats, 0, sizeof(struct memcg_stats));
+
+ f = fopen(DEBUGFS_LRU_GEN, "r");
+ TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
+
+ while (ctx.next_handler && (read = getline(&line, &bufsz, f)) > 0) {
+ ctx.consumed = false;
+
+ do {
+ ctx.next_handler(stats, &ctx, line);
+ if (!ctx.next_handler)
+ break;
+ } while (!ctx.consumed);
+ }
+
+ if (read < 0 && !feof(f))
+ TEST_ASSERT(false, "getline(%s) failed", DEBUGFS_LRU_GEN);
+
+ TEST_ASSERT(stats->memcg_id > 0, "Couldn't find memcg: %s\n"
+ "Did the memcg get created in the proper mount?",
+ memcg);
+ if (line)
+ free(line);
+ TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
+}
+
+/*
+ * Find all pages tracked by lru_gen for this memcg in generation @target_gen.
+ *
+ * If @target_gen is negative, look for all generations.
+ */
+static long sum_memcg_stats_for_gen(int target_gen,
+ const struct memcg_stats *stats)
+{
+ int node, gen;
+ long total_nr = 0;
+
+ for (node = 0; node < stats->nr_nodes; ++node) {
+ const struct node_stats *node_stats = &stats->nodes[node];
+
+ for (gen = 0; gen < node_stats->nr_gens; ++gen) {
+ const struct generation_stats *gen_stats =
+ &node_stats->gens[gen];
+
+ if (target_gen >= 0 && gen_stats->gen != target_gen)
+ continue;
+
+ total_nr += gen_stats->nr_anon + gen_stats->nr_file;
+ }
+ }
+
+ return total_nr;
+}
+
+/* Find all pages tracked by lru_gen for this memcg. */
+long sum_memcg_stats(const struct memcg_stats *stats)
+{
+ return sum_memcg_stats_for_gen(-1, stats);
+}
+
+/* Read the memcg stats and optionally print if this is a debug build. */
+void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg)
+{
+ read_memcg_stats(stats, memcg);
+#ifdef DEBUG
+ print_memcg_stats(stats, memcg);
+#endif
+}
+
+/*
+ * If lru_gen aging should force page table scanning.
+ *
+ * If you want to set this to false, you will need to do eviction
+ * before doing extra aging passes.
+ */
+static const bool force_scan = true;
+
+static void run_aging_impl(unsigned long memcg_id, int node_id, int max_gen)
+{
+ FILE *f = fopen(DEBUGFS_LRU_GEN, "w");
+ char *command;
+ size_t sz;
+
+ TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN);
+ sz = asprintf(&command, "+ %lu %d %d 1 %d\n",
+ memcg_id, node_id, max_gen, force_scan);
+ TEST_ASSERT(sz > 0, "creating aging command failed");
+
+ pr_debug("Running aging command: %s", command);
+ if (fwrite(command, sizeof(char), sz, f) < sz) {
+ TEST_ASSERT(false, "writing aging command %s to %s failed",
+ command, DEBUGFS_LRU_GEN);
+ }
+
+ TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN);
+}
+
+static void _lru_gen_do_aging(struct memcg_stats *stats, const char *memcg,
+ bool verbose)
+{
+ int node, gen;
+ struct timespec ts_start;
+ struct timespec ts_elapsed;
+
+ pr_debug("lru_gen: invoking aging...\n");
+
+ /* Must read memcg stats to construct the proper aging command. */
+ read_print_memcg_stats(stats, memcg);
+
+ if (verbose)
+ clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+ for (node = 0; node < stats->nr_nodes; ++node) {
+ int max_gen = 0;
+
+ for (gen = 0; gen < stats->nodes[node].nr_gens; ++gen) {
+ int this_gen = stats->nodes[node].gens[gen].gen;
+
+ max_gen = max_gen > this_gen ? max_gen : this_gen;
+ }
+
+ run_aging_impl(stats->memcg_id, stats->nodes[node].node,
+ max_gen);
+ }
+
+ if (verbose) {
+ ts_elapsed = timespec_elapsed(ts_start);
+ pr_info("%-30s: %ld.%09lds\n", "lru_gen: Aging",
+ ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+ }
+
+ /* Re-read so callers get updated information */
+ read_print_memcg_stats(stats, memcg);
+}
+
+/* Do aging, and print how long it took. */
+void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg)
+{
+ return _lru_gen_do_aging(stats, memcg, true);
+}
+
+/* Do aging, don't print anything. */
+void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg)
+{
+ return _lru_gen_do_aging(stats, memcg, false);
+}
+
+/*
+ * Find which generation contains more than half of @total_pages, assuming that
+ * such a generation exists.
+ */
+int lru_gen_find_generation(const struct memcg_stats *stats,
+ unsigned long total_pages)
+{
+ int node, gen, gen_idx, min_gen = INT_MAX, max_gen = -1;
+
+ for (node = 0; node < stats->nr_nodes; ++node)
+ for (gen_idx = 0; gen_idx < stats->nodes[node].nr_gens;
+ ++gen_idx) {
+ gen = stats->nodes[node].gens[gen_idx].gen;
+ max_gen = gen > max_gen ? gen : max_gen;
+ min_gen = gen < min_gen ? gen : min_gen;
+ }
+
+ for (gen = min_gen; gen < max_gen; ++gen)
+ /* See if the most pages are in this generation. */
+ if (sum_memcg_stats_for_gen(gen, stats) >
+ total_pages / 2)
+ return gen;
+
+ TEST_ASSERT(false, "No generation includes majority of %lu pages.",
+ total_pages);
+
+ /* unreachable, but make the compiler happy */
+ return -1;
+}
--
2.45.1.288.g0e0cd299f1-goog


2024-05-29 18:10:20

by James Houghton

[permalink] [raw]
Subject: [PATCH v4 5/7] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn

Walk the TDP MMU in an RCU read-side critical section. This requires a
way to do RCU-safe walking of the tdp_mmu_roots; do this with a new
macro. The PTE modifications are now done atomically, and
kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the
fact that kvm_age_gfn can now lockless update the accessed bit and the
R/X bits).

If the cmpxchg for marking the spte for access tracking fails, we simply
retry if the spte is still a leaf PTE. If it isn't, we return false
to continue the walk.

Harvesting age information from the shadow MMU is still done while
holding the MMU write lock.

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu/mmu.c | 18 ++++-----
arch/x86/kvm/mmu/tdp_iter.h | 27 +++++++------
arch/x86/kvm/mmu/tdp_mmu.c | 67 +++++++++++++++++++++++++--------
4 files changed, 76 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ece45b3f6f20..48fb29bb782f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1438,6 +1438,7 @@ struct kvm_arch {
* tdp_mmu_page set.
*
* For reads, this list is protected by:
+ * RCU alone or
* the MMU lock in read mode + RCU or
* the MMU lock in write mode
*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6a2a557c2c31..956834da8a0e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1630,16 +1630,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

- write_lock(&kvm->mmu_lock);
-
- if (kvm_memslots_have_rmaps(kvm))
+ if (kvm_memslots_have_rmaps(kvm)) {
+ write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+ write_unlock(&kvm->mmu_lock);
+ }

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

- write_unlock(&kvm->mmu_lock);
-
return young;
}

@@ -1647,16 +1646,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

- write_lock(&kvm->mmu_lock);
-
- if (kvm_memslots_have_rmaps(kvm))
+ if (kvm_memslots_have_rmaps(kvm)) {
+ write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
+ write_unlock(&kvm->mmu_lock);
+ }

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_test_age_gfn(kvm, range);

- write_unlock(&kvm->mmu_lock);
-
return young;
}

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..f558ae9054af 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -24,16 +24,24 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
return xchg(rcu_dereference(sptep), new_spte);
}

+static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mask)
+{
+ atomic64_t *sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
+
+ return (u64)atomic64_fetch_and(~mask, sptep_atomic);
+}
+
static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
{
WRITE_ONCE(*rcu_dereference(sptep), new_spte);
}

/*
- * SPTEs must be modified atomically if they are shadow-present, leaf
- * SPTEs, and have volatile bits, i.e. has bits that can be set outside
- * of mmu_lock. The Writable bit can be set by KVM's fast page fault
- * handler, and Accessed and Dirty bits can be set by the CPU.
+ * SPTEs must be modified atomically if they have bits that can be set outside
+ * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as the
+ * Writable bit can be set by KVM's fast page fault handler, the Accessed and
+ * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be
+ * cleared by age_gfn_range.
*
* Note, non-leaf SPTEs do have Accessed bits and those bits are
* technically volatile, but KVM doesn't consume the Accessed bit of
@@ -44,8 +52,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int level)
{
return is_shadow_present_pte(old_spte) &&
- is_last_spte(old_spte, level) &&
- spte_has_volatile_bits(old_spte);
+ is_last_spte(old_spte, level);
}

static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
@@ -61,12 +68,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte,
u64 mask, int level)
{
- atomic64_t *sptep_atomic;
-
- if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) {
- sptep_atomic = (atomic64_t *)rcu_dereference(sptep);
- return (u64)atomic64_fetch_and(~mask, sptep_atomic);
- }
+ if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level))
+ return tdp_mmu_clear_spte_bits_atomic(sptep, mask);

__kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask);
return old_spte;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1259dd63defc..c74b0221dae0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,6 +29,11 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,

return true;
}
+static __always_inline bool kvm_lockdep_assert_rcu_read_lock_held(void)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ return true;
+}

void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
{
@@ -178,6 +183,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
((_only_valid) && (_root)->role.invalid))) { \
} else

+/*
+ * Iterate over all TDP MMU roots in an RCU read-side critical section.
+ */
+#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id) \
+ list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link) \
+ if (kvm_lockdep_assert_rcu_read_lock_held() && \
+ (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id)) { \
+ } else
+
#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)

@@ -1223,6 +1237,27 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
return ret;
}

+static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless(
+ struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ tdp_handler_t handler)
+{
+ struct kvm_mmu_page *root;
+ struct tdp_iter iter;
+ bool ret = false;
+
+ rcu_read_lock();
+
+ for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) {
+ tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+ ret |= handler(kvm, &iter, range);
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
/*
* Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
* if any of the GFNs in the range have been accessed.
@@ -1236,28 +1271,30 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
{
u64 new_spte;

+retry:
/* If we have a non-accessed entry we don't need to change the pte. */
if (!is_accessed_spte(iter->old_spte))
return false;

if (spte_ad_enabled(iter->old_spte)) {
- iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
- iter->old_spte,
- shadow_accessed_mask,
- iter->level);
+ iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
+ shadow_accessed_mask);
new_spte = iter->old_spte & ~shadow_accessed_mask;
} else {
- /*
- * Capture the dirty status of the page, so that it doesn't get
- * lost when the SPTE is marked for access tracking.
- */
+ new_spte = mark_spte_for_access_track(iter->old_spte);
+ if (__tdp_mmu_set_spte_atomic(iter, new_spte)) {
+ /*
+ * The cmpxchg failed. If the spte is still a
+ * last-level spte, we can safely retry.
+ */
+ if (is_shadow_present_pte(iter->old_spte) &&
+ is_last_spte(iter->old_spte, iter->level))
+ goto retry;
+ /* Otherwise, continue walking. */
+ return false;
+ }
if (is_writable_pte(iter->old_spte))
kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
-
- new_spte = mark_spte_for_access_track(iter->old_spte);
- iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
- iter->old_spte, new_spte,
- iter->level);
}

trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
@@ -1267,7 +1304,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,

bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
- return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
+ return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, age_gfn_range);
}

static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
@@ -1278,7 +1315,7 @@ static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,

bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
- return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
+ return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, test_age_gfn);
}

/*
--
2.45.1.288.g0e0cd299f1-goog


2024-05-29 18:10:44

by James Houghton

[permalink] [raw]
Subject: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

Replace the MMU write locks for read locks.

Grabbing the read lock instead of the write lock is safe because the
only requirement we have is that the stage-2 page tables do not get
deallocated while we are walking them. The stage2_age_walker() callback
is safe to race with itself; update the comment to reflect the
synchronization change.

Signed-off-by: James Houghton <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 9 ++++-----
arch/arm64/kvm/mmu.c | 8 ++++----
2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 9e2bbee77491..eabb07c66a07 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
data->young = true;

/*
- * stage2_age_walker() is always called while holding the MMU lock for
- * write, so this will always succeed. Nonetheless, this deliberately
- * follows the race detection pattern of the other stage-2 walkers in
- * case the locking mechanics of the MMU notifiers is ever changed.
+ * This walk may not be exclusive; the PTE is permitted to change
+ * from under us.
*/
if (data->mkold && !stage2_try_set_pte(ctx, new))
return -EAGAIN;
@@ -1345,7 +1343,8 @@ bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
struct kvm_pgtable_walker walker = {
.cb = stage2_age_walker,
.arg = &data,
- .flags = KVM_PGTABLE_WALK_LEAF,
+ .flags = KVM_PGTABLE_WALK_LEAF |
+ KVM_PGTABLE_WALK_SHARED,
};

WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker));
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8337009dde77..40e7427462a7 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1775,7 +1775,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
u64 size = (range->end - range->start) << PAGE_SHIFT;
bool young = false;

- write_lock(&kvm->mmu_lock);
+ read_lock(&kvm->mmu_lock);

if (!kvm->arch.mmu.pgt)
goto out;
@@ -1785,7 +1785,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
size, true);

out:
- write_unlock(&kvm->mmu_lock);
+ read_unlock(&kvm->mmu_lock);
return young;
}

@@ -1794,7 +1794,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
u64 size = (range->end - range->start) << PAGE_SHIFT;
bool young = false;

- write_lock(&kvm->mmu_lock);
+ read_lock(&kvm->mmu_lock);

if (!kvm->arch.mmu.pgt)
goto out;
@@ -1804,7 +1804,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
size, false);

out:
- write_unlock(&kvm->mmu_lock);
+ read_unlock(&kvm->mmu_lock);
return young;
}

--
2.45.1.288.g0e0cd299f1-goog


2024-05-29 21:04:11

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
>
> Secondary MMUs are currently consulted for access/age information at
> eviction time, but before then, we don't get accurate age information.
> That is, pages that are mostly accessed through a secondary MMU (like
> guest memory, used by KVM) will always just proceed down to the oldest
> generation, and then at eviction time, if KVM reports the page to be
> young, the page will be activated/promoted back to the youngest
> generation.

Correct, and as I explained offline, this is the only reasonable
behavior if we can't locklessly walk secondary MMUs.

Just for the record, the (crude) analogy I used was:
Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
but you are only allowed to pick up 10 of them (and put them in your
pocket). A smart move would be to survey the room *first and then*
pick up the largest ones. But if you are carrying a 500 lbs backpack,
you would just want to pick up whichever that's in front of you rather
than walk the entire room.

MGLRU should only scan (or lookaround) secondary MMUs if it can be
done lockless. Otherwise, it should just fall back to the existing
approach, which existed in previous versions but is removed in this
version.

> Do not do look around if there is a secondary MMU we have to interact
> with.
>
> The added feature bit (0x8), if disabled, will make MGLRU behave as if
> there are no secondary MMUs subscribed to MMU notifiers except at
> eviction time.
>
> Suggested-by: Yu Zhao <[email protected]>
> Signed-off-by: James Houghton <[email protected]>

This is not what I suggested, and it would have been done in the first
place if it hadn't regressed the non-lockless case.

NAK.

2024-05-29 21:53:28

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] KVM: Add lockless memslot walk to KVM

On Wed, May 29, 2024, James Houghton wrote:
> @@ -686,10 +694,12 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> return __kvm_handle_hva_range(kvm, &range).ret;
> }
>
> -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
> - unsigned long start,
> - unsigned long end,
> - gfn_handler_t handler)
> +static __always_inline int kvm_handle_hva_range_no_flush(
> + struct mmu_notifier *mn,
> + unsigned long start,
> + unsigned long end,
> + gfn_handler_t handler,
> + bool lockless)

Unnecessary and unwanted style change.

> {
> struct kvm *kvm = mmu_notifier_to_kvm(mn);
> const struct kvm_mmu_notifier_range range = {
> @@ -699,6 +709,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
> .on_lock = (void *)kvm_null_fn,
> .flush_on_ret = false,
> .may_block = false,
> + .lockless = lockless,

Why add @lockess to kvm_handle_hva_range_no_flush()? Both callers immediately
pass %false, and conceptually, locking is always optional for a "no flush" variant.

> };
>
> return __kvm_handle_hva_range(kvm, &range).ret;
> @@ -889,7 +900,8 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> * cadence. If we find this inaccurate, we might come up with a
> * more sophisticated heuristic later.
> */
> - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> + return kvm_handle_hva_range_no_flush(mn, start, end,
> + kvm_age_gfn, false);
> }
>
> static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> @@ -899,7 +911,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> trace_kvm_test_age_hva(address);
>
> return kvm_handle_hva_range_no_flush(mn, address, address + 1,
> - kvm_test_age_gfn);
> + kvm_test_age_gfn, false);
> }
>
> static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
> --
> 2.45.1.288.g0e0cd299f1-goog
>

2024-05-29 21:56:02

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] KVM: Move MMU lock acquisition for test/clear_young to architecture

On Wed, May 29, 2024, James Houghton wrote:
> For implementation mmu_notifier_{test,clear}_young, the KVM memslot
> walker used to take the MMU lock for us. Now make the architectures
> take it themselves.

Hmm, *forcing* architectures to take mmu_lock is a step backwards. Rather than
add all of this churn, what about adding CONFIG_KVM_MMU_NOTIFIER_LOCKLESS, e.g.

static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
gfn_handler_t handler)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
const struct kvm_mmu_notifier_range range = {
.start = start,
.end = end,
.handler = handler,
.on_lock = (void *)kvm_null_fn,
.flush_on_ret = false,
.may_block = false,
.lockless = IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_LOCKLESS),
};

return __kvm_handle_hva_range(kvm, &range).ret;
}

2024-05-29 22:00:15

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, May 29, 2024, Yu Zhao wrote:
> On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
> >
> > Secondary MMUs are currently consulted for access/age information at
> > eviction time, but before then, we don't get accurate age information.
> > That is, pages that are mostly accessed through a secondary MMU (like
> > guest memory, used by KVM) will always just proceed down to the oldest
> > generation, and then at eviction time, if KVM reports the page to be
> > young, the page will be activated/promoted back to the youngest
> > generation.
>
> Correct, and as I explained offline, this is the only reasonable
> behavior if we can't locklessly walk secondary MMUs.
>
> Just for the record, the (crude) analogy I used was:
> Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> but you are only allowed to pick up 10 of them (and put them in your
> pocket). A smart move would be to survey the room *first and then*
> pick up the largest ones. But if you are carrying a 500 lbs backpack,
> you would just want to pick up whichever that's in front of you rather
> than walk the entire room.
>
> MGLRU should only scan (or lookaround) secondary MMUs if it can be
> done lockless. Otherwise, it should just fall back to the existing
> approach, which existed in previous versions but is removed in this
> version.

IIUC, by "existing approach" you mean completely ignore secondary MMUs that don't
implement a lockless walk?

2024-05-29 22:22:49

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, May 29, 2024, Yu Zhao wrote:
> > On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
> > >
> > > Secondary MMUs are currently consulted for access/age information at
> > > eviction time, but before then, we don't get accurate age information.
> > > That is, pages that are mostly accessed through a secondary MMU (like
> > > guest memory, used by KVM) will always just proceed down to the oldest
> > > generation, and then at eviction time, if KVM reports the page to be
> > > young, the page will be activated/promoted back to the youngest
> > > generation.
> >
> > Correct, and as I explained offline, this is the only reasonable
> > behavior if we can't locklessly walk secondary MMUs.
> >
> > Just for the record, the (crude) analogy I used was:
> > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > but you are only allowed to pick up 10 of them (and put them in your
> > pocket). A smart move would be to survey the room *first and then*
> > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > you would just want to pick up whichever that's in front of you rather
> > than walk the entire room.
> >
> > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > done lockless. Otherwise, it should just fall back to the existing
> > approach, which existed in previous versions but is removed in this
> > version.
>
> IIUC, by "existing approach" you mean completely ignore secondary MMUs that don't
> implement a lockless walk?

No, the existing approach only checks secondary MMUs for LRU folios,
i.e., those at the end of the LRU list. It might not find the best
candidates (the coldest ones) on the entire list, but it doesn't pay
as much for the locking. MGLRU can *optionally* scan MMUs (secondary
included) to find the best candidates, but it can only be a win if the
scanning incurs a relatively low overhead, e.g., done locklessly for
the secondary MMU. IOW, this is a balance between the cost of
reclaiming not-so-cold (warm) folios and that of finding the coldest
folios.

Scanning host MMUs is likely to be a win because 1) there is usually
access locality 2) there is no coarsed locking. If neither holds,
scanning secondary MMUs would likely be a loss. And 1) is generally
weaker for secondary MMUs, since it's about (guest) physical address
space.

2024-05-29 22:58:24

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, May 29, 2024, Yu Zhao wrote:
> On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, May 29, 2024, Yu Zhao wrote:
> > > On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
> > > >
> > > > Secondary MMUs are currently consulted for access/age information at
> > > > eviction time, but before then, we don't get accurate age information.
> > > > That is, pages that are mostly accessed through a secondary MMU (like
> > > > guest memory, used by KVM) will always just proceed down to the oldest
> > > > generation, and then at eviction time, if KVM reports the page to be
> > > > young, the page will be activated/promoted back to the youngest
> > > > generation.
> > >
> > > Correct, and as I explained offline, this is the only reasonable
> > > behavior if we can't locklessly walk secondary MMUs.
> > >
> > > Just for the record, the (crude) analogy I used was:
> > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > > but you are only allowed to pick up 10 of them (and put them in your
> > > pocket). A smart move would be to survey the room *first and then*
> > > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > > you would just want to pick up whichever that's in front of you rather
> > > than walk the entire room.
> > >
> > > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > > done lockless. Otherwise, it should just fall back to the existing
> > > approach, which existed in previous versions but is removed in this
> > > version.
> >
> > IIUC, by "existing approach" you mean completely ignore secondary MMUs that
> > don't implement a lockless walk?
>
> No, the existing approach only checks secondary MMUs for LRU folios,
> i.e., those at the end of the LRU list. It might not find the best
> candidates (the coldest ones) on the entire list, but it doesn't pay
> as much for the locking. MGLRU can *optionally* scan MMUs (secondary
> included) to find the best candidates, but it can only be a win if the
> scanning incurs a relatively low overhead, e.g., done locklessly for
> the secondary MMU. IOW, this is a balance between the cost of
> reclaiming not-so-cold (warm) folios and that of finding the coldest
> folios.

Gotcha.

I tend to agree with Yu, driving the behavior via a Kconfig may generate simpler
_code_, but I think it increases the overall system complexity. E.g. distros
will likely enable the Kconfig, and in my experience people using KVM with a
distro kernel usually aren't kernel experts, i.e. likely won't know that there's
even a decision to be made, let alone be able to make an informed decision.

Having an mmu_notifier hook that is conditionally implemented doesn't seem overly
complex, e.g. even if there's a runtime aspect at play, it'd be easy enough for
KVM to nullify its mmu_notifier hook during initialization. The hardest part is
likely going to be figuring out the threshold for how much overhead is too much.

2024-05-30 01:09:03

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, May 29, 2024 at 3:58 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, May 29, 2024, Yu Zhao wrote:
> > On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Wed, May 29, 2024, Yu Zhao wrote:
> > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
> > > > >
> > > > > Secondary MMUs are currently consulted for access/age information at
> > > > > eviction time, but before then, we don't get accurate age information.
> > > > > That is, pages that are mostly accessed through a secondary MMU (like
> > > > > guest memory, used by KVM) will always just proceed down to the oldest
> > > > > generation, and then at eviction time, if KVM reports the page to be
> > > > > young, the page will be activated/promoted back to the youngest
> > > > > generation.
> > > >
> > > > Correct, and as I explained offline, this is the only reasonable
> > > > behavior if we can't locklessly walk secondary MMUs.
> > > >
> > > > Just for the record, the (crude) analogy I used was:
> > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > > > but you are only allowed to pick up 10 of them (and put them in your
> > > > pocket). A smart move would be to survey the room *first and then*
> > > > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > > > you would just want to pick up whichever that's in front of you rather
> > > > than walk the entire room.
> > > >
> > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > > > done lockless. Otherwise, it should just fall back to the existing
> > > > approach, which existed in previous versions but is removed in this
> > > > version.
> > >
> > > IIUC, by "existing approach" you mean completely ignore secondary MMUs that
> > > don't implement a lockless walk?
> >
> > No, the existing approach only checks secondary MMUs for LRU folios,
> > i.e., those at the end of the LRU list. It might not find the best
> > candidates (the coldest ones) on the entire list, but it doesn't pay
> > as much for the locking. MGLRU can *optionally* scan MMUs (secondary
> > included) to find the best candidates, but it can only be a win if the
> > scanning incurs a relatively low overhead, e.g., done locklessly for
> > the secondary MMU. IOW, this is a balance between the cost of
> > reclaiming not-so-cold (warm) folios and that of finding the coldest
> > folios.
>
> Gotcha.
>
> I tend to agree with Yu, driving the behavior via a Kconfig may generate simpler
> _code_, but I think it increases the overall system complexity. E.g. distros
> will likely enable the Kconfig, and in my experience people using KVM with a
> distro kernel usually aren't kernel experts, i.e. likely won't know that there's
> even a decision to be made, let alone be able to make an informed decision.
>
> Having an mmu_notifier hook that is conditionally implemented doesn't seem overly
> complex, e.g. even if there's a runtime aspect at play, it'd be easy enough for
> KVM to nullify its mmu_notifier hook during initialization. The hardest part is
> likely going to be figuring out the threshold for how much overhead is too much.

Hi Yu, Sean,

Perhaps I "simplified" this bit of the series a little bit too much.
Being able to opportunistically do aging with KVM (even without
setting the Kconfig) is valuable.

IIUC, we have the following possibilities:
- v4: aging with KVM is done if the new Kconfig is set.
- v3: aging with KVM is always done.
- v2: aging with KVM is done when the architecture reports that it can
probably be done locklessly, set at KVM MMU init time.
- Another possibility?: aging with KVM is only done exactly when it
can be done locklessly (i.e., mmu_notifier_test/clear_young() called
such that it will not grab any locks).

I like the v4 approach because:
1. We can choose whether or not to do aging with KVM no matter what
architecture we're using (without requiring userspace be aware to
disable the feature at runtime with sysfs to avoid regressing
performance if they don't care about proactive reclaim).
2. If we check the new feature bit (0x8) in sysfs, we can know for
sure if aging is meant to be working or not. The selftest changes I
made won't work properly unless there is a way to be sure that aging
is working with KVM.

For look-around at eviction time:
- v4: done if the main mm PTE was young and no MMU notifiers are subscribed.
- v2/v3: done if the main mm PTE was young or (the SPTE was young and
the MMU notifier was lockless/fast).

I made this logic change as part of removing batching.

I'd really appreciate guidance on what the correct thing to do is.

In my mind, what would work great is: by default, do aging exactly
when KVM can do it locklessly, and then have a Kconfig to always have
MGLRU to do aging with KVM if a user really cares about proactive
reclaim (when the feature bit is set). The selftest can check the
Kconfig + feature bit to know for sure if aging will be done.

I'm not sure what the exact right thing to do for look-around is.

Thanks for the quick feedback.

2024-05-30 03:27:54

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] KVM: Add lockless memslot walk to KVM

On Wed, May 29, 2024 at 2:51 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, May 29, 2024, James Houghton wrote:
> > @@ -686,10 +694,12 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> > return __kvm_handle_hva_range(kvm, &range).ret;
> > }
> >
> > -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
> > - unsigned long start,
> > - unsigned long end,
> > - gfn_handler_t handler)
> > +static __always_inline int kvm_handle_hva_range_no_flush(
> > + struct mmu_notifier *mn,
> > + unsigned long start,
> > + unsigned long end,
> > + gfn_handler_t handler,
> > + bool lockless)
>
> Unnecessary and unwanted style change.

Sorry -- this will be fixed.

>
> > {
> > struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > const struct kvm_mmu_notifier_range range = {
> > @@ -699,6 +709,7 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
> > .on_lock = (void *)kvm_null_fn,
> > .flush_on_ret = false,
> > .may_block = false,
> > + .lockless = lockless,
>
> Why add @lockess to kvm_handle_hva_range_no_flush()? Both callers immediately
> pass %false, and conceptually, locking is always optional for a "no flush" variant.

Right, this isn't needed in this patch. But I think I need it
eventually (like, in the next patch), so I'll move it where it is
really needed.



>
> > };
> >
> > return __kvm_handle_hva_range(kvm, &range).ret;
> > @@ -889,7 +900,8 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> > * cadence. If we find this inaccurate, we might come up with a
> > * more sophisticated heuristic later.
> > */
> > - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> > + return kvm_handle_hva_range_no_flush(mn, start, end,
> > + kvm_age_gfn, false);
> > }
> >
> > static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> > @@ -899,7 +911,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> > trace_kvm_test_age_hva(address);
> >
> > return kvm_handle_hva_range_no_flush(mn, address, address + 1,
> > - kvm_test_age_gfn);
> > + kvm_test_age_gfn, false);
> > }
> >
> > static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
> > --
> > 2.45.1.288.g0e0cd299f1-goog
> >

2024-05-30 03:28:33

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] KVM: Move MMU lock acquisition for test/clear_young to architecture

On Wed, May 29, 2024 at 2:55 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, May 29, 2024, James Houghton wrote:
> > For implementation mmu_notifier_{test,clear}_young, the KVM memslot
> > walker used to take the MMU lock for us. Now make the architectures
> > take it themselves.
>
> Hmm, *forcing* architectures to take mmu_lock is a step backwards. Rather than
> add all of this churn, what about adding CONFIG_KVM_MMU_NOTIFIER_LOCKLESS, e.g.
>
> static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
> unsigned long start,
> unsigned long end,
> gfn_handler_t handler)
> {
> struct kvm *kvm = mmu_notifier_to_kvm(mn);
> const struct kvm_mmu_notifier_range range = {
> .start = start,
> .end = end,
> .handler = handler,
> .on_lock = (void *)kvm_null_fn,
> .flush_on_ret = false,
> .may_block = false,
> .lockless = IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_LOCKLESS),
> };
>
> return __kvm_handle_hva_range(kvm, &range).ret;
> }

Thanks Sean, yes this is a lot better. I will do this for v5.

2024-05-31 06:06:48

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, May 29, 2024 at 7:08 PM James Houghton <[email protected]> wrote:
>
> On Wed, May 29, 2024 at 3:58 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, May 29, 2024, Yu Zhao wrote:
> > > On Wed, May 29, 2024 at 3:59 PM Sean Christopherson <[email protected]> wrote:
> > > >
> > > > On Wed, May 29, 2024, Yu Zhao wrote:
> > > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
> > > > > >
> > > > > > Secondary MMUs are currently consulted for access/age information at
> > > > > > eviction time, but before then, we don't get accurate age information.
> > > > > > That is, pages that are mostly accessed through a secondary MMU (like
> > > > > > guest memory, used by KVM) will always just proceed down to the oldest
> > > > > > generation, and then at eviction time, if KVM reports the page to be
> > > > > > young, the page will be activated/promoted back to the youngest
> > > > > > generation.
> > > > >
> > > > > Correct, and as I explained offline, this is the only reasonable
> > > > > behavior if we can't locklessly walk secondary MMUs.
> > > > >
> > > > > Just for the record, the (crude) analogy I used was:
> > > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > > > > but you are only allowed to pick up 10 of them (and put them in your
> > > > > pocket). A smart move would be to survey the room *first and then*
> > > > > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > > > > you would just want to pick up whichever that's in front of you rather
> > > > > than walk the entire room.
> > > > >
> > > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > > > > done lockless. Otherwise, it should just fall back to the existing
> > > > > approach, which existed in previous versions but is removed in this
> > > > > version.
> > > >
> > > > IIUC, by "existing approach" you mean completely ignore secondary MMUs that
> > > > don't implement a lockless walk?
> > >
> > > No, the existing approach only checks secondary MMUs for LRU folios,
> > > i.e., those at the end of the LRU list. It might not find the best
> > > candidates (the coldest ones) on the entire list, but it doesn't pay
> > > as much for the locking. MGLRU can *optionally* scan MMUs (secondary
> > > included) to find the best candidates, but it can only be a win if the
> > > scanning incurs a relatively low overhead, e.g., done locklessly for
> > > the secondary MMU. IOW, this is a balance between the cost of
> > > reclaiming not-so-cold (warm) folios and that of finding the coldest
> > > folios.
> >
> > Gotcha.
> >
> > I tend to agree with Yu, driving the behavior via a Kconfig may generate simpler
> > _code_, but I think it increases the overall system complexity. E.g. distros
> > will likely enable the Kconfig, and in my experience people using KVM with a
> > distro kernel usually aren't kernel experts, i.e. likely won't know that there's
> > even a decision to be made, let alone be able to make an informed decision.
> >
> > Having an mmu_notifier hook that is conditionally implemented doesn't seem overly
> > complex, e.g. even if there's a runtime aspect at play, it'd be easy enough for
> > KVM to nullify its mmu_notifier hook during initialization. The hardest part is
> > likely going to be figuring out the threshold for how much overhead is too much.
>
> Hi Yu, Sean,
>
> Perhaps I "simplified" this bit of the series a little bit too much.
> Being able to opportunistically do aging with KVM (even without
> setting the Kconfig) is valuable.
>
> IIUC, we have the following possibilities:
> - v4: aging with KVM is done if the new Kconfig is set.
> - v3: aging with KVM is always done.

This is not true -- in v3, MGLRU only scans secondary MMUs if it can
be done locklessly on x86. It uses a bitmap to imply this requirement.

> - v2: aging with KVM is done when the architecture reports that it can
> probably be done locklessly, set at KVM MMU init time.

Not really -- it's only done if it can be done locklessly on both x86 and arm64.

> - Another possibility?: aging with KVM is only done exactly when it
> can be done locklessly (i.e., mmu_notifier_test/clear_young() called
> such that it will not grab any locks).

This is exactly the case for v2.

> I like the v4 approach because:
> 1. We can choose whether or not to do aging with KVM no matter what
> architecture we're using (without requiring userspace be aware to
> disable the feature at runtime with sysfs to avoid regressing
> performance if they don't care about proactive reclaim).
> 2. If we check the new feature bit (0x8) in sysfs, we can know for
> sure if aging is meant to be working or not. The selftest changes I
> made won't work properly unless there is a way to be sure that aging
> is working with KVM.

I'm not convinced, but it doesn't mean your point of view is invalid.
If you fully understand the implications of your design choice and
document them, I will not object.

All optimizations in v2 were measured step by step. Even that bitmap,
which might be considered overengineered, brought a readily
measuarable 4% improvement in memcached throughput on Altra Max
swapping to Optane:

Using the bitmap (64 KVM PTEs for each call)
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50
Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 0.00 --- --- ---
--- --- --- 0.00
Gets 1012801.92 431436.92 14965.11 0.06246
0.04700 0.16700 4.31900 39635.83
Waits 0.00 --- --- ---
--- --- --- ---
Totals 1012801.92 431436.92 14965.11 0.06246
0.04700 0.16700 4.31900 39635.83


Not using the bitmap (1 KVM PTEs for each call)
============================================================================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50
Latency p99 Latency p99.9 Latency KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets 0.00 --- --- ---
--- --- --- 0.00
Gets 968210.02 412443.85 14303.89 0.06517
0.04700 0.15900 7.42300 37890.74
Waits 0.00 --- --- ---
--- --- --- ---
Totals 968210.02 412443.85 14303.89 0.06517
0.04700 0.15900 7.42300 37890.74


FlameGraphs with bitmap (1.svg) and without bitmap (2.svg) attached.

What I don't think is acceptable is simplifying those optimizations
out without documenting your justifications (I would even call it a
design change, rather than simplification, from v3 to v4).

> For look-around at eviction time:
> - v4: done if the main mm PTE was young and no MMU notifiers are subscribed.
> - v2/v3: done if the main mm PTE was young or (the SPTE was young and
> the MMU notifier was lockless/fast).

The host and secondary MMUs are two *independent* cases, IMO:
1. lookaround the host MMU if the PTE mapping the folio under reclaim is young.
2. lookaround the secondary MMU if it can be done locklessly.

So the v2/v3 behavior sounds a lot more reasonable to me.

Also a nit -- don't use 'else' in the following case (should_look_around()):

if (foo)
return bar;
else
do_something();

> I made this logic change as part of removing batching.
>
> I'd really appreciate guidance on what the correct thing to do is.
>
> In my mind, what would work great is: by default, do aging exactly
> when KVM can do it locklessly, and then have a Kconfig to always have
> MGLRU to do aging with KVM if a user really cares about proactive
> reclaim (when the feature bit is set). The selftest can check the
> Kconfig + feature bit to know for sure if aging will be done.

I still don't see how that Kconfig helps. Or why the new static branch
isn't enough?






> I'm not sure what the exact right thing to do for look-around is.
>
> Thanks for the quick feedback.


Attachments:
2.svg (70.08 kB)
1.svg (89.94 kB)
Download all attachments

2024-05-31 07:04:34

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Fri, May 31, 2024 at 12:05:48AM -0600, Yu Zhao wrote:

[...]

> All optimizations in v2 were measured step by step. Even that bitmap,
> which might be considered overengineered, brought a readily
> measuarable 4% improvement in memcached throughput on Altra Max
> swapping to Optane:

That's great, but taking an iterative approach to the problem allows
the reviewers and maintainers to come to their own conclusions about
each optimization independently. Squashing all of that together and
posting the result doesn't allow for this.

Even if we were to take the series as-is, the door is wide open to
subsequent improvements.

> What I don't think is acceptable is simplifying those optimizations
> out without documenting your justifications (I would even call it a
> design change, rather than simplification, from v3 to v4).

No, sorry, there's nothing wrong with James' approach here.

The discussion that led to the design of v4 happened on list; you were
on CC. The general consensus on the KVM side was that the bitmap was
complicated and lacked independent justification. There was ample
opportunity to voice your concerns before he spent the time on v4.

You seriously cannot fault a contributor for respinning their work based
on the provided feedback.

--
Thanks,
Oliver

2024-05-31 07:24:52

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote:
> On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
> >
> > Secondary MMUs are currently consulted for access/age information at
> > eviction time, but before then, we don't get accurate age information.
> > That is, pages that are mostly accessed through a secondary MMU (like
> > guest memory, used by KVM) will always just proceed down to the oldest
> > generation, and then at eviction time, if KVM reports the page to be
> > young, the page will be activated/promoted back to the youngest
> > generation.
>
> Correct, and as I explained offline, this is the only reasonable
> behavior if we can't locklessly walk secondary MMUs.
>
> Just for the record, the (crude) analogy I used was:
> Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> but you are only allowed to pick up 10 of them (and put them in your
> pocket). A smart move would be to survey the room *first and then*
> pick up the largest ones. But if you are carrying a 500 lbs backpack,
> you would just want to pick up whichever that's in front of you rather
> than walk the entire room.
>
> MGLRU should only scan (or lookaround) secondary MMUs if it can be
> done lockless. Otherwise, it should just fall back to the existing
> approach, which existed in previous versions but is removed in this
> version.

Grabbing the MMU lock for write to scan sucks, no argument there. But
can you please be specific about the impact of read lock v. RCU in the
case of arm64? I had asked about this before and you never replied.

My concern remains that adding support for software table walkers
outside of the MMU lock entirely requires more work than just deferring
the deallocation to an RCU callback. Walkers that previously assumed
'exclusive' access while holding the MMU lock for write must now cope
with volatile PTEs.

Yes, this problem already exists when hardware sets the AF, but the
lock-free walker implementation needs to be generic so it can be applied
for other PTE bits.

--
Thanks,
Oliver

2024-05-31 16:46:26

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Fri, May 31, 2024 at 1:03 AM Oliver Upton <[email protected]> wrote:
>
> On Fri, May 31, 2024 at 12:05:48AM -0600, Yu Zhao wrote:

Let me add back what I said earlier:

I'm not convinced, but it doesn't mean your point of view is
invalid. If you fully understand the implications of your design
choice and document them, I will not object.

> > All optimizations in v2 were measured step by step. Even that bitmap,
> > which might be considered overengineered, brought a readily
> > measuarable 4% improvement in memcached throughput on Altra Max
> > swapping to Optane:
>
> That's great, but taking an iterative approach to the problem allows
> the reviewers and maintainers to come to their own conclusions about
> each optimization independently. Squashing all of that together and
> posting the result doesn't allow for this.

That's your methodology, which I respect: as I said I won't stand in your way.

But mine is backed by data, please do respect that as well, by doing
what I asked: document your justifications.

> Even if we were to take the series as-is, the door is wide open to
> subsequent improvements.
>
> > What I don't think is acceptable is simplifying those optimizations
> > out without documenting your justifications (I would even call it a
> > design change, rather than simplification, from v3 to v4).
>
> No, sorry, there's nothing wrong with James' approach here.

Sorry, are you saying "without documenting your justifications" is
nothing wrong? If so, please elaborate.

> The discussion that led to the design of v4 happened on list; you were
> on CC. The general consensus on the KVM side was that the bitmap was
> complicated and lacked independent justification. There was ample
> opportunity to voice your concerns before he spent the time on v4.

Please re-read my previous emails -- I never object to the removal of
the bitmap or James' approach.

And please stop making assumptions -- I did voice my concerns with
James privately.

> You seriously cannot fault a contributor for respinning their work based
> on the provided feedback.

Are you saying I faulted James for taking others' feedback? If so,
where? And I'll make sure I don't give such an impression in the
future.

Also what do you think about the technical flaws and inaccurate
understandings I pointed out? You seem to have a strong opinion on
your iterate approach, but I hope you didn't choose to overlook the
real meat of this discussion.

2024-05-31 18:42:04

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Fri, May 31, 2024 at 10:45:04AM -0600, Yu Zhao wrote:
> On Fri, May 31, 2024 at 1:03 AM Oliver Upton <[email protected]> wrote:
> >
> > On Fri, May 31, 2024 at 12:05:48AM -0600, Yu Zhao wrote:
>
> Let me add back what I said earlier:
>
> I'm not convinced, but it doesn't mean your point of view is
> invalid. If you fully understand the implications of your design
> choice and document them, I will not object.

Thanks, I appreciate the sentiment. Hopefully we can align here.

> > > All optimizations in v2 were measured step by step. Even that bitmap,
> > > which might be considered overengineered, brought a readily
> > > measuarable 4% improvement in memcached throughput on Altra Max
> > > swapping to Optane:
> >
> > That's great, but taking an iterative approach to the problem allows
> > the reviewers and maintainers to come to their own conclusions about
> > each optimization independently. Squashing all of that together and
> > posting the result doesn't allow for this.
>
> That's your methodology, which I respect: as I said I won't stand in your way.
>
> But mine is backed by data, please do respect that as well,

Data is useful and expected for changes that aim to improve the
performance of a system in one way or another. That is, after all, the
sole intention of the work, no?

What I'm also looking for is a controlled experiment, where a single
independent variable (e.g. locking) can be evaluated against the
baseline. All-or-nothing data has limited usefulness.

> by doing what I asked: document your justifications.

The justification for a series is against the upstream tree, not some
out-of-tree stuff. The cover letter explicitly calls out the decision
to simplify the patch series along with performance data I can reproduce
on my own systems.

This is a perfect example of how to contribute changes upstream.

> > > What I don't think is acceptable is simplifying those optimizations
> > > out without documenting your justifications (I would even call it a
> > > design change, rather than simplification, from v3 to v4).
> >
> > No, sorry, there's nothing wrong with James' approach here.
>
> Sorry, are you saying "without documenting your justifications" is
> nothing wrong? If so, please elaborate.

As I mentioned above, the reasoning is adequately documented and the
discussion that led to v4 is public. OTOH...

> > The discussion that led to the design of v4 happened on list; you were
> > on CC. The general consensus on the KVM side was that the bitmap was
> > complicated and lacked independent justification. There was ample
> > opportunity to voice your concerns before he spent the time on v4.
>
> Please re-read my previous emails -- I never object to the removal of
> the bitmap or James' approach.
>
> And please stop making assumptions -- I did voice my concerns with
> James privately.
^~~~~~~~~

If it happened in private then its no better than having said nothing at
all.

Please, keep the conversation on-list next time so we can iron out any
disagreements there. Otherwise contributors are put in a *very* awkward
situation of mediating the on- and off-list dialogue.

> > You seriously cannot fault a contributor for respinning their work based
> > on the provided feedback.
>
> Are you saying I faulted James for taking others' feedback?

No. Sufficient justification is captured in the public review feedback +
series cover letter. Your statement that the approach was changed without
justification is unsubstantiated.

> Also what do you think about the technical flaws and inaccurate
> understandings I pointed out? You seem to have a strong opinion on
> your iterate approach, but I hope you didn't choose to overlook the
> real meat of this discussion.

Serious question: are you not receiving my mail or something?

I re-raised my question to you from ages ago about locking on the arm64
MMU. You didn't answer last time, I'd appreciate a reply this time
around.

Otherwise I couldn't be bothered about the color of the Kconfig bikeshed
and don't have anything meaningful to add there. I think the three of
you are trending in the right direction.

--
Thanks,
Oliver

2024-05-31 19:12:01

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:

[...]

> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 9e2bbee77491..eabb07c66a07 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> data->young = true;
>
> /*
> - * stage2_age_walker() is always called while holding the MMU lock for
> - * write, so this will always succeed. Nonetheless, this deliberately
> - * follows the race detection pattern of the other stage-2 walkers in
> - * case the locking mechanics of the MMU notifiers is ever changed.
> + * This walk may not be exclusive; the PTE is permitted to change
> + * from under us.
> */
> if (data->mkold && !stage2_try_set_pte(ctx, new))
> return -EAGAIN;

It is probably worth mentioning that if there was a race to update the
PTE then the GFN is most likely young, so failing to clear AF probably
isn't even consequential.

--
Thanks,
Oliver

2024-05-31 19:20:01

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
>
> [...]
>
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 9e2bbee77491..eabb07c66a07 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > data->young = true;
> >
> > /*
> > - * stage2_age_walker() is always called while holding the MMU lock for
> > - * write, so this will always succeed. Nonetheless, this deliberately
> > - * follows the race detection pattern of the other stage-2 walkers in
> > - * case the locking mechanics of the MMU notifiers is ever changed.
> > + * This walk may not be exclusive; the PTE is permitted to change
> > + * from under us.
> > */
> > if (data->mkold && !stage2_try_set_pte(ctx, new))
> > return -EAGAIN;
>
> It is probably worth mentioning that if there was a race to update the
> PTE then the GFN is most likely young, so failing to clear AF probably
> isn't even consequential.

Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
now. Maybe demote it to:

r = kvm_pgtable_walk(...);
WARN_ON_ONCE(r && r != -EAGAIN);

--
Thanks,
Oliver

2024-05-31 20:32:11

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Fri, May 31, 2024 at 1:24 AM Oliver Upton <[email protected]> wrote:
>
> On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote:
> > On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
> > >
> > > Secondary MMUs are currently consulted for access/age information at
> > > eviction time, but before then, we don't get accurate age information.
> > > That is, pages that are mostly accessed through a secondary MMU (like
> > > guest memory, used by KVM) will always just proceed down to the oldest
> > > generation, and then at eviction time, if KVM reports the page to be
> > > young, the page will be activated/promoted back to the youngest
> > > generation.
> >
> > Correct, and as I explained offline, this is the only reasonable
> > behavior if we can't locklessly walk secondary MMUs.
> >
> > Just for the record, the (crude) analogy I used was:
> > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > but you are only allowed to pick up 10 of them (and put them in your
> > pocket). A smart move would be to survey the room *first and then*
> > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > you would just want to pick up whichever that's in front of you rather
> > than walk the entire room.
> >
> > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > done lockless. Otherwise, it should just fall back to the existing
> > approach, which existed in previous versions but is removed in this
> > version.
>
> Grabbing the MMU lock for write to scan sucks, no argument there. But
> can you please be specific about the impact of read lock v. RCU in the
> case of arm64? I had asked about this before and you never replied.
>
> My concern remains that adding support for software table walkers
> outside of the MMU lock entirely requires more work than just deferring
> the deallocation to an RCU callback. Walkers that previously assumed
> 'exclusive' access while holding the MMU lock for write must now cope
> with volatile PTEs.
>
> Yes, this problem already exists when hardware sets the AF, but the
> lock-free walker implementation needs to be generic so it can be applied
> for other PTE bits.

Direct reclaim is multi-threaded and each reclaimer can take the mmu
lock for read (testing the A-bit) or write (unmapping before paging
out) on arm64. The fundamental problem of using the readers-writer
lock in this case is priority inversion: the readers have lower
priority than the writers, so ideally, we don't want the readers to
block the writers at all.

Using my previous (crude) analogy: puting the bill right in front of
you (the writers) profits immediately whereas searching for the
largest bill (the readers) can be futile.

As I said earlier, I prefer we drop the arm64 support for now, but I
will not object to taking the mmu lock for read when clearing the
A-bit, as long as we fully understand the problem here and document it
clearly.

2024-05-31 21:10:37

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Fri, May 31, 2024 at 2:06 PM David Matlack <[email protected]> wrote:
>
> On Fri, May 31, 2024 at 1:31 PM Yu Zhao <[email protected]> wrote:
> >
> > On Fri, May 31, 2024 at 1:24 AM Oliver Upton <[email protected]> wrote:
> > >
> > > On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote:
> > > > On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
> > > > >
> > > > > Secondary MMUs are currently consulted for access/age information at
> > > > > eviction time, but before then, we don't get accurate age information.
> > > > > That is, pages that are mostly accessed through a secondary MMU (like
> > > > > guest memory, used by KVM) will always just proceed down to the oldest
> > > > > generation, and then at eviction time, if KVM reports the page to be
> > > > > young, the page will be activated/promoted back to the youngest
> > > > > generation.
> > > >
> > > > Correct, and as I explained offline, this is the only reasonable
> > > > behavior if we can't locklessly walk secondary MMUs.
> > > >
> > > > Just for the record, the (crude) analogy I used was:
> > > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > > > but you are only allowed to pick up 10 of them (and put them in your
> > > > pocket). A smart move would be to survey the room *first and then*
> > > > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > > > you would just want to pick up whichever that's in front of you rather
> > > > than walk the entire room.
> > > >
> > > > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > > > done lockless. Otherwise, it should just fall back to the existing
> > > > approach, which existed in previous versions but is removed in this
> > > > version.
> > >
> > > Grabbing the MMU lock for write to scan sucks, no argument there. But
> > > can you please be specific about the impact of read lock v. RCU in the
> > > case of arm64? I had asked about this before and you never replied.
> > >
> > > My concern remains that adding support for software table walkers
> > > outside of the MMU lock entirely requires more work than just deferring
> > > the deallocation to an RCU callback. Walkers that previously assumed
> > > 'exclusive' access while holding the MMU lock for write must now cope
> > > with volatile PTEs.
> > >
> > > Yes, this problem already exists when hardware sets the AF, but the
> > > lock-free walker implementation needs to be generic so it can be applied
> > > for other PTE bits.
> >
> > Direct reclaim is multi-threaded and each reclaimer can take the mmu
> > lock for read (testing the A-bit) or write (unmapping before paging
> > out) on arm64. The fundamental problem of using the readers-writer
> > lock in this case is priority inversion: the readers have lower
> > priority than the writers, so ideally, we don't want the readers to
> > block the writers at all.
> >
> > Using my previous (crude) analogy: puting the bill right in front of
> > you (the writers) profits immediately whereas searching for the
> > largest bill (the readers) can be futile.
> >
> > As I said earlier, I prefer we drop the arm64 support for now, but I
> > will not object to taking the mmu lock for read when clearing the
> > A-bit, as long as we fully understand the problem here and document it
> > clearly.
>
> FWIW, Google Cloud has been doing proactive reclaim and kstaled-based
> aging (a Google-internal page aging daemon, for those outside of
> Google) for many years on x86 VMs with the A-bit harvesting
> under the write-lock. So I'm skeptical that making ARM64 lockless is
> necessary to allow Secondary MMUs to participate in MGLRU aging with
> acceptable performance for Cloud usecases. I don't even think it's
> necessary on x86 but it's a simple enough change that we might as well
> just do it.

The obvious caveat here: If MGLRU aging and kstaled aging are
substantially different in how frequently they trigger mmu_notifiers,
then my analysis may not be correct. I'm hoping Yu you can shed some
light on that. I'm also operating under the assumption that Secondary
MMUs are only participating in aging, and not look-around (i.e. what
is implemented in v4).

>
> I suspect under pathological conditions (host under intense memory
> pressure and high rate of reclaim occurring) making A-bit harvesting
> lockless will perform better. But under such conditions VM performance
> is likely going to suffer regardless. In a Cloud environment we deal
> with that through other mechanisms to reduce the rate of reclaim and
> make the host healthy.
>
> For these reasons, I think there's value in giving users the option to
> enable Secondary MMUs participation MGLRU aging even when A-bit
> test/clearing is not done locklessly. I believe this was James' intent
> with the Kconfig. Perhaps a default-off writable module parameter
> would be better to avoid distros accidentally turning it on?
>
> If and when there is a usecase for optimizing VM performance under
> pathological reclaim conditions on ARM, we can make it lockless then.

2024-05-31 21:18:24

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Fri, May 31, 2024 at 1:31 PM Yu Zhao <[email protected]> wrote:
>
> On Fri, May 31, 2024 at 1:24 AM Oliver Upton <[email protected]> wrote:
> >
> > On Wed, May 29, 2024 at 03:03:21PM -0600, Yu Zhao wrote:
> > > On Wed, May 29, 2024 at 12:05 PM James Houghton <[email protected]> wrote:
> > > >
> > > > Secondary MMUs are currently consulted for access/age information at
> > > > eviction time, but before then, we don't get accurate age information.
> > > > That is, pages that are mostly accessed through a secondary MMU (like
> > > > guest memory, used by KVM) will always just proceed down to the oldest
> > > > generation, and then at eviction time, if KVM reports the page to be
> > > > young, the page will be activated/promoted back to the youngest
> > > > generation.
> > >
> > > Correct, and as I explained offline, this is the only reasonable
> > > behavior if we can't locklessly walk secondary MMUs.
> > >
> > > Just for the record, the (crude) analogy I used was:
> > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > > but you are only allowed to pick up 10 of them (and put them in your
> > > pocket). A smart move would be to survey the room *first and then*
> > > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > > you would just want to pick up whichever that's in front of you rather
> > > than walk the entire room.
> > >
> > > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > > done lockless. Otherwise, it should just fall back to the existing
> > > approach, which existed in previous versions but is removed in this
> > > version.
> >
> > Grabbing the MMU lock for write to scan sucks, no argument there. But
> > can you please be specific about the impact of read lock v. RCU in the
> > case of arm64? I had asked about this before and you never replied.
> >
> > My concern remains that adding support for software table walkers
> > outside of the MMU lock entirely requires more work than just deferring
> > the deallocation to an RCU callback. Walkers that previously assumed
> > 'exclusive' access while holding the MMU lock for write must now cope
> > with volatile PTEs.
> >
> > Yes, this problem already exists when hardware sets the AF, but the
> > lock-free walker implementation needs to be generic so it can be applied
> > for other PTE bits.
>
> Direct reclaim is multi-threaded and each reclaimer can take the mmu
> lock for read (testing the A-bit) or write (unmapping before paging
> out) on arm64. The fundamental problem of using the readers-writer
> lock in this case is priority inversion: the readers have lower
> priority than the writers, so ideally, we don't want the readers to
> block the writers at all.
>
> Using my previous (crude) analogy: puting the bill right in front of
> you (the writers) profits immediately whereas searching for the
> largest bill (the readers) can be futile.
>
> As I said earlier, I prefer we drop the arm64 support for now, but I
> will not object to taking the mmu lock for read when clearing the
> A-bit, as long as we fully understand the problem here and document it
> clearly.

FWIW, Google Cloud has been doing proactive reclaim and kstaled-based
aging (a Google-internal page aging daemon, for those outside of
Google) for many years on x86 VMs with the A-bit harvesting
under the write-lock. So I'm skeptical that making ARM64 lockless is
necessary to allow Secondary MMUs to participate in MGLRU aging with
acceptable performance for Cloud usecases. I don't even think it's
necessary on x86 but it's a simple enough change that we might as well
just do it.

I suspect under pathological conditions (host under intense memory
pressure and high rate of reclaim occurring) making A-bit harvesting
lockless will perform better. But under such conditions VM performance
is likely going to suffer regardless. In a Cloud environment we deal
with that through other mechanisms to reduce the rate of reclaim and
make the host healthy.

For these reasons, I think there's value in giving users the option to
enable Secondary MMUs participation MGLRU aging even when A-bit
test/clearing is not done locklessly. I believe this was James' intent
with the Kconfig. Perhaps a default-off writable module parameter
would be better to avoid distros accidentally turning it on?

If and when there is a usecase for optimizing VM performance under
pathological reclaim conditions on ARM, we can make it lockless then.

2024-05-31 21:18:53

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Fri, May 31, 2024 at 02:31:17PM -0600, Yu Zhao wrote:
> On Fri, May 31, 2024 at 1:24 AM Oliver Upton <[email protected]> wrote:

[...]

> > Grabbing the MMU lock for write to scan sucks, no argument there. But
> > can you please be specific about the impact of read lock v. RCU in the
> > case of arm64? I had asked about this before and you never replied.
> >
> > My concern remains that adding support for software table walkers
> > outside of the MMU lock entirely requires more work than just deferring
> > the deallocation to an RCU callback. Walkers that previously assumed
> > 'exclusive' access while holding the MMU lock for write must now cope
> > with volatile PTEs.
> >
> > Yes, this problem already exists when hardware sets the AF, but the
> > lock-free walker implementation needs to be generic so it can be applied
> > for other PTE bits.
>
> Direct reclaim is multi-threaded and each reclaimer can take the mmu
> lock for read (testing the A-bit) or write (unmapping before paging
> out) on arm64. The fundamental problem of using the readers-writer
> lock in this case is priority inversion: the readers have lower
> priority than the writers, so ideally, we don't want the readers to
> block the writers at all.

So we already have this sort of problem of stage-2 fault handling v.
secondary MMU invalidations, which is why I've been doubtful of the
perceived issue. In fact, I'd argue that needing to wait for faults is
worse than aging participation since those can be trivially influenced
by userspace/guest.

In any case, we shouldn't ever be starved since younger readers cannot
enter the critical section with a pending writer.

> As I said earlier, I prefer we drop the arm64 support for now, but I
> will not object to taking the mmu lock for read when clearing the
> A-bit, as long as we fully understand the problem here and document it
> clearly.

I'd be convinced of this if there's data that shows read lock
acquisition is in fact consequential. Otherwise, I'm not sure the added
complexity of RCU table walkers (per my statement above) is worth the
effort / maintenance burden.

--
Thanks,
Oliver

2024-06-03 22:56:57

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Thu, May 30, 2024 at 11:06 PM Yu Zhao <[email protected]> wrote:
>
> On Wed, May 29, 2024 at 7:08 PM James Houghton <[email protected]> wrote:
> >
> > Hi Yu, Sean,
> >
> > Perhaps I "simplified" this bit of the series a little bit too much.
> > Being able to opportunistically do aging with KVM (even without
> > setting the Kconfig) is valuable.
> >
> > IIUC, we have the following possibilities:
> > - v4: aging with KVM is done if the new Kconfig is set.
> > - v3: aging with KVM is always done.
>
> This is not true -- in v3, MGLRU only scans secondary MMUs if it can
> be done locklessly on x86. It uses a bitmap to imply this requirement.
>
> > - v2: aging with KVM is done when the architecture reports that it can
> > probably be done locklessly, set at KVM MMU init time.
>
> Not really -- it's only done if it can be done locklessly on both x86 and arm64.
>
> > - Another possibility?: aging with KVM is only done exactly when it
> > can be done locklessly (i.e., mmu_notifier_test/clear_young() called
> > such that it will not grab any locks).
>
> This is exactly the case for v2.

Thanks for clarifying; sorry for getting this wrong.

>
> > I like the v4 approach because:
> > 1. We can choose whether or not to do aging with KVM no matter what
> > architecture we're using (without requiring userspace be aware to
> > disable the feature at runtime with sysfs to avoid regressing
> > performance if they don't care about proactive reclaim).
> > 2. If we check the new feature bit (0x8) in sysfs, we can know for
> > sure if aging is meant to be working or not. The selftest changes I
> > made won't work properly unless there is a way to be sure that aging
> > is working with KVM.
>
> I'm not convinced, but it doesn't mean your point of view is invalid.
> If you fully understand the implications of your design choice and
> document them, I will not object.
>
> All optimizations in v2 were measured step by step. Even that bitmap,
> which might be considered overengineered, brought a readily
> measuarable 4% improvement in memcached throughput on Altra Max
> swapping to Optane:
>
> Using the bitmap (64 KVM PTEs for each call)
> ============================================================================================================================
> Type Ops/sec Hits/sec Misses/sec Avg. Latency p50
> Latency p99 Latency p99.9 Latency KB/sec
> ----------------------------------------------------------------------------------------------------------------------------
> Sets 0.00 --- --- ---
> --- --- --- 0.00
> Gets 1012801.92 431436.92 14965.11 0.06246
> 0.04700 0.16700 4.31900 39635.83
> Waits 0.00 --- --- ---
> --- --- --- ---
> Totals 1012801.92 431436.92 14965.11 0.06246
> 0.04700 0.16700 4.31900 39635.83
>
>
> Not using the bitmap (1 KVM PTEs for each call)
> ============================================================================================================================
> Type Ops/sec Hits/sec Misses/sec Avg. Latency p50
> Latency p99 Latency p99.9 Latency KB/sec
> ----------------------------------------------------------------------------------------------------------------------------
> Sets 0.00 --- --- ---
> --- --- --- 0.00
> Gets 968210.02 412443.85 14303.89 0.06517
> 0.04700 0.15900 7.42300 37890.74
> Waits 0.00 --- --- ---
> --- --- --- ---
> Totals 968210.02 412443.85 14303.89 0.06517
> 0.04700 0.15900 7.42300 37890.74
>
>
> FlameGraphs with bitmap (1.svg) and without bitmap (2.svg) attached.
>
> What I don't think is acceptable is simplifying those optimizations
> out without documenting your justifications (I would even call it a
> design change, rather than simplification, from v3 to v4).

I'll put back something similar to what you had before (like a
test_clear_young() with a "fast" parameter instead of "bitmap"). I
like the idea of having a new mmu notifier, like
fast_test_clear_young(), while leaving test_young() and clear_young()
unchanged (where "fast" means "prioritize speed over accuracy"). It
seems a little more straightforward that way.

>
> > For look-around at eviction time:
> > - v4: done if the main mm PTE was young and no MMU notifiers are subscribed.
> > - v2/v3: done if the main mm PTE was young or (the SPTE was young and
> > the MMU notifier was lockless/fast).
>
> The host and secondary MMUs are two *independent* cases, IMO:
> 1. lookaround the host MMU if the PTE mapping the folio under reclaim is young.
> 2. lookaround the secondary MMU if it can be done locklessly.
>
> So the v2/v3 behavior sounds a lot more reasonable to me.

I'll restore the v2/v3 behavior. I initially removed it because,
without batching, we (mostly) lose the spatial locality that, IIUC,
look-around is designed to exploit.

>
> Also a nit -- don't use 'else' in the following case (should_look_around()):
>
> if (foo)
> return bar;
> else
> do_something();

Oh, yes, sorry. I wrote and rewrote should_look_around() quite a few
times while trying to figure out what made sense in a no-batching
series. I'll fix this.

>
> > I made this logic change as part of removing batching.
> >
> > I'd really appreciate guidance on what the correct thing to do is.
> >
> > In my mind, what would work great is: by default, do aging exactly
> > when KVM can do it locklessly, and then have a Kconfig to always have
> > MGLRU to do aging with KVM if a user really cares about proactive
> > reclaim (when the feature bit is set). The selftest can check the
> > Kconfig + feature bit to know for sure if aging will be done.
>
> I still don't see how that Kconfig helps. Or why the new static branch
> isn't enough?

Without a special Kconfig, the feature bit just tells us that aging
with KVM is possible, not that it will necessarily be done. For the
self-test, it'd be good to know exactly when aging is being done or
not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would
help make the self-test set the right expectations for aging.

The Kconfig would also allow a user to know that, no matter what,
we're going to get correct age data for VMs, even if, say, we're using
the shadow MMU. This is somewhat important for me/Google Cloud. Is
that reasonable? Maybe there's a better solution.

2024-06-03 23:03:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Mon, Jun 03, 2024, James Houghton wrote:
> On Thu, May 30, 2024 at 11:06 PM Yu Zhao <[email protected]> wrote:
> > What I don't think is acceptable is simplifying those optimizations
> > out without documenting your justifications (I would even call it a
> > design change, rather than simplification, from v3 to v4).
>
> I'll put back something similar to what you had before (like a
> test_clear_young() with a "fast" parameter instead of "bitmap"). I
> like the idea of having a new mmu notifier, like
> fast_test_clear_young(), while leaving test_young() and clear_young()
> unchanged (where "fast" means "prioritize speed over accuracy").

Those two statements are contradicting each other, aren't they? Anyways, I vote
for a "fast only" variant, e.g. test_clear_young_fast_only() or so. gup() has
already established that terminology in mm/, so hopefully it would be familiar
to readers. We could pass a param, but then the MGLRU code would likely end up
doing a bunch of useless indirect calls into secondary MMUs, whereas a dedicated
hook allows implementations to nullify the pointer if the API isn't supported
for whatever reason.

And pulling in Oliver's comments about locking, I think it's important that the
mmu_notifier API express it's requirement that the operation be "fast", not that
it be lockless. E.g. if a secondary MMU can guarantee that a lock will be
contented only in rare, slow cases, then taking a lock is a-ok. Or a secondary
MMU could do try-lock and bail if the lock is contended.

That way KVM can honor the intent of the API with an implementation that works
best for KVM _and_ for MGRLU. I'm sure there will be future adjustments and fixes,
but that's just more motivation for using something like "fast only" instead of
"lockless".

> > > I made this logic change as part of removing batching.
> > >
> > > I'd really appreciate guidance on what the correct thing to do is.
> > >
> > > In my mind, what would work great is: by default, do aging exactly
> > > when KVM can do it locklessly, and then have a Kconfig to always have
> > > MGLRU to do aging with KVM if a user really cares about proactive
> > > reclaim (when the feature bit is set). The selftest can check the
> > > Kconfig + feature bit to know for sure if aging will be done.
> >
> > I still don't see how that Kconfig helps. Or why the new static branch
> > isn't enough?
>
> Without a special Kconfig, the feature bit just tells us that aging
> with KVM is possible, not that it will necessarily be done. For the
> self-test, it'd be good to know exactly when aging is being done or
> not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would
> help make the self-test set the right expectations for aging.
>
> The Kconfig would also allow a user to know that, no matter what,
> we're going to get correct age data for VMs, even if, say, we're using
> the shadow MMU.

Heh, unless KVM flushes, you won't get "correct" age data.

> This is somewhat important for me/Google Cloud. Is that reasonable? Maybe
> there's a better solution.

Hmm, no? There's no reason to use a Kconfig, e.g. if we _really_ want to prioritize
accuracy over speed, then a KVM (x86?) module param to have KVM walk nested TDP
page tables would give us what we want.

But before we do that, I think we need to perform due dilegence (or provide data)
showing that having KVM take mmu_lock for write in the "fast only" API provides
better total behavior. I.e. that the additional accuracy is indeed worth the cost.

2024-06-03 23:17:33

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Mon, Jun 3, 2024 at 4:03 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Jun 03, 2024, James Houghton wrote:
> > On Thu, May 30, 2024 at 11:06 PM Yu Zhao <[email protected]> wrote:
> > > What I don't think is acceptable is simplifying those optimizations
> > > out without documenting your justifications (I would even call it a
> > > design change, rather than simplification, from v3 to v4).
> >
> > I'll put back something similar to what you had before (like a
> > test_clear_young() with a "fast" parameter instead of "bitmap"). I
> > like the idea of having a new mmu notifier, like
> > fast_test_clear_young(), while leaving test_young() and clear_young()
> > unchanged (where "fast" means "prioritize speed over accuracy").
>
> Those two statements are contradicting each other, aren't they?

I guess it depends on how you define "similar". :)

> Anyways, I vote
> for a "fast only" variant, e.g. test_clear_young_fast_only() or so. gup() has
> already established that terminology in mm/, so hopefully it would be familiar
> to readers. We could pass a param, but then the MGLRU code would likely end up
> doing a bunch of useless indirect calls into secondary MMUs, whereas a dedicated
> hook allows implementations to nullify the pointer if the API isn't supported
> for whatever reason.
>
> And pulling in Oliver's comments about locking, I think it's important that the
> mmu_notifier API express it's requirement that the operation be "fast", not that
> it be lockless. E.g. if a secondary MMU can guarantee that a lock will be
> contented only in rare, slow cases, then taking a lock is a-ok. Or a secondary
> MMU could do try-lock and bail if the lock is contended.
>
> That way KVM can honor the intent of the API with an implementation that works
> best for KVM _and_ for MGRLU. I'm sure there will be future adjustments and fixes,
> but that's just more motivation for using something like "fast only" instead of
> "lockless".

Yes, thanks, this is exactly what I meant. I really should have "only"
in the name to signify that it is a requirement that it be fast.
Thanks for wording it so clearly.

>
> > > > I made this logic change as part of removing batching.
> > > >
> > > > I'd really appreciate guidance on what the correct thing to do is.
> > > >
> > > > In my mind, what would work great is: by default, do aging exactly
> > > > when KVM can do it locklessly, and then have a Kconfig to always have
> > > > MGLRU to do aging with KVM if a user really cares about proactive
> > > > reclaim (when the feature bit is set). The selftest can check the
> > > > Kconfig + feature bit to know for sure if aging will be done.
> > >
> > > I still don't see how that Kconfig helps. Or why the new static branch
> > > isn't enough?
> >
> > Without a special Kconfig, the feature bit just tells us that aging
> > with KVM is possible, not that it will necessarily be done. For the
> > self-test, it'd be good to know exactly when aging is being done or
> > not, so having a Kconfig like LRU_GEN_ALWAYS_WALK_SECONDARY_MMU would
> > help make the self-test set the right expectations for aging.
> >
> > The Kconfig would also allow a user to know that, no matter what,
> > we're going to get correct age data for VMs, even if, say, we're using
> > the shadow MMU.
>
> Heh, unless KVM flushes, you won't get "correct" age data.
>
> > This is somewhat important for me/Google Cloud. Is that reasonable? Maybe
> > there's a better solution.
>
> Hmm, no? There's no reason to use a Kconfig, e.g. if we _really_ want to prioritize
> accuracy over speed, then a KVM (x86?) module param to have KVM walk nested TDP
> page tables would give us what we want.
>
> But before we do that, I think we need to perform due dilegence (or provide data)
> showing that having KVM take mmu_lock for write in the "fast only" API provides
> better total behavior. I.e. that the additional accuracy is indeed worth the cost.

That sounds good to me. I'll drop the Kconfig. I'm not really sure
what to do about the self-test, but that's not really all that
important.

2024-06-04 00:23:33

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

On Mon, Jun 03, 2024, James Houghton wrote:
> On Mon, Jun 3, 2024 at 4:03 PM Sean Christopherson <[email protected]> wrote:
> > But before we do that, I think we need to perform due dilegence (or provide data)
> > showing that having KVM take mmu_lock for write in the "fast only" API provides
> > better total behavior. I.e. that the additional accuracy is indeed worth the cost.
>
> That sounds good to me. I'll drop the Kconfig. I'm not really sure
> what to do about the self-test, but that's not really all that
> important.

Enable it only on architectures+setups that are guaranteed to implement the
fast-only API? E.g. on x86, it darn well better be active if the TDP MMU is
enabled. If the test fails because that doesn't hold true, then we _want_ the
failure.

2024-06-04 22:21:21

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

On Fri, May 31, 2024 at 12:18 PM Oliver Upton <[email protected]> wrote:
>
> On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> > On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> >
> > [...]
> >
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 9e2bbee77491..eabb07c66a07 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > data->young = true;
> > >
> > > /*
> > > - * stage2_age_walker() is always called while holding the MMU lock for
> > > - * write, so this will always succeed. Nonetheless, this deliberately
> > > - * follows the race detection pattern of the other stage-2 walkers in
> > > - * case the locking mechanics of the MMU notifiers is ever changed.
> > > + * This walk may not be exclusive; the PTE is permitted to change
> > > + * from under us.
> > > */
> > > if (data->mkold && !stage2_try_set_pte(ctx, new))
> > > return -EAGAIN;
> >
> > It is probably worth mentioning that if there was a race to update the
> > PTE then the GFN is most likely young, so failing to clear AF probably
> > isn't even consequential.

Thanks Oliver.

>
> Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
> now. Maybe demote it to:
>
> r = kvm_pgtable_walk(...);
> WARN_ON_ONCE(r && r != -EAGAIN);

Oh, indeed, thank you. Just to make sure -- does it make sense to
retry the cmpxchg if it fails? For example, the way I have it now for
x86[1], we retry the cmpxchg if the spte is still a leaf, otherwise we
move on to the next one having done nothing. Does something like that
make sense for arm64?

[1]: https://lore.kernel.org/linux-mm/[email protected]/

2024-06-04 23:06:40

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

On Tue, Jun 04, 2024 at 03:20:20PM -0700, James Houghton wrote:
> On Fri, May 31, 2024 at 12:18 PM Oliver Upton <[email protected]> wrote:
> >
> > On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> > > On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> > >
> > > [...]
> > >
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 9e2bbee77491..eabb07c66a07 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > > data->young = true;
> > > >
> > > > /*
> > > > - * stage2_age_walker() is always called while holding the MMU lock for
> > > > - * write, so this will always succeed. Nonetheless, this deliberately
> > > > - * follows the race detection pattern of the other stage-2 walkers in
> > > > - * case the locking mechanics of the MMU notifiers is ever changed.
> > > > + * This walk may not be exclusive; the PTE is permitted to change
> > > > + * from under us.
> > > > */
> > > > if (data->mkold && !stage2_try_set_pte(ctx, new))
> > > > return -EAGAIN;
> > >
> > > It is probably worth mentioning that if there was a race to update the
> > > PTE then the GFN is most likely young, so failing to clear AF probably
> > > isn't even consequential.
>
> Thanks Oliver.
>
> >
> > Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
> > now. Maybe demote it to:
> >
> > r = kvm_pgtable_walk(...);
> > WARN_ON_ONCE(r && r != -EAGAIN);
>
> Oh, indeed, thank you. Just to make sure -- does it make sense to
> retry the cmpxchg if it fails? For example, the way I have it now for
> x86[1], we retry the cmpxchg if the spte is still a leaf, otherwise we
> move on to the next one having done nothing. Does something like that
> make sense for arm64?

At least for arm64 I do not see a need for retry. The only possible
races are:

- A stage-2 fault handler establishing / adjusting the mapping for the
GFN. If the guest is directly accessing the GFN in question, what's
the point of wiping out AF?

Even when returning -EAGAIN we've already primed stage2_age_data::young,
so we report the correct state back to the primary MMU.

- Another kvm_age_gfn() trying to age the same GFN. I haven't even
looked to see if this is possible from the primary MMU POV, but in
theory one of the calls will win the race and clear AF.

Given Yu's concerns about making pending writers wait, we should take
every opportunity to bail on the walk.

--
Thanks,
Oliver

2024-06-04 23:37:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn

On Tue, Jun 04, 2024, Oliver Upton wrote:
> On Tue, Jun 04, 2024 at 03:20:20PM -0700, James Houghton wrote:
> > On Fri, May 31, 2024 at 12:18 PM Oliver Upton <[email protected]> wrote:
> > >
> > > On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> > > > On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> > > Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
> > > now. Maybe demote it to:
> > >
> > > r = kvm_pgtable_walk(...);
> > > WARN_ON_ONCE(r && r != -EAGAIN);
> >
> > Oh, indeed, thank you. Just to make sure -- does it make sense to
> > retry the cmpxchg if it fails? For example, the way I have it now for
> > x86[1], we retry the cmpxchg if the spte is still a leaf, otherwise we
> > move on to the next one having done nothing. Does something like that
> > make sense for arm64?
>
> At least for arm64 I do not see a need for retry. The only possible
> races are:
>
> - A stage-2 fault handler establishing / adjusting the mapping for the
> GFN. If the guest is directly accessing the GFN in question, what's
> the point of wiping out AF?
>
> Even when returning -EAGAIN we've already primed stage2_age_data::young,
> so we report the correct state back to the primary MMU.
>
> - Another kvm_age_gfn() trying to age the same GFN. I haven't even
> looked to see if this is possible from the primary MMU POV, but in
> theory one of the calls will win the race and clear AF.
>
> Given Yu's concerns about making pending writers wait, we should take
> every opportunity to bail on the walk.

+1. The x86 path that retries is, for all intents and purposes, limited to Intel
CPUs that don't support EPT A/D bits, i.e. to pre-HSW CPUs. I wouldn't make any
decisions based on that code.