2024-04-01 23:30:15

by James Houghton

[permalink] [raw]
Subject: [PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting

This patchset adds a fast path in KVM to test and clear access bits on
sptes without taking the mmu_lock. It also adds support for using a
bitmap to (1) test the access bits for many sptes in a single call to
mmu_notifier_test_young, and to (2) clear the access bits for many ptes
in a single call to mmu_notifier_clear_young.

With Yu's permission, I'm now working on getting this series into a
mergeable state.

I'm posting this as an RFC because I'm not sure if the arm64 bits are
correct, and I haven't done complete performance testing. I want to do
broader experimentation to see how much this improves VM performance in
a cloud environment, but I want to be sure that the code is mergeable
first.

Yu has posted other performance results[1], [2]. This v3 shouldn't
significantly change the x86 results, but the arm64 results may have
changed.

The most important changes since v2[3]:

- Split the test_clear_young MMU notifier back into test_young and
clear_young. I did this because the bitmap passed in has a distinct
meaning for each of them, and I felt that this was cleaner.

- The return value of test_young / clear_young now indicates if the
bitmap was used.

- Removed the custom spte walker to implement the lockless path. This
was important for arm64 to be functionally correct (thanks Oliver),
and it avoids a lot of problems brought up in review of v2 (for
example[4]).

- Add kvm_arch_prepare_bitmap_age and kvm_arch_finish_bitmap_age to
allow for arm64 to implement its bitmap-based aging to grab the MMU
lock for reading while allowing x86 to be lockless.

- The powerpc changes have been dropped.

- The logic to inform architectures how to use the bitmap has been
cleaned up (kvm_should_clear_young has been split into
kvm_gfn_should_age and kvm_gfn_record_young) (thanks Nicolas).

There were some smaller changes too:
- Added test_clear_young_metadata (thanks Sean).
- MMU_NOTIFIER_RANGE_LOCKLESS has been renamed to
MMU_NOTIFIER_YOUNG_FAST, to indicate to the caller that passing a
bitmap for MGLRU look-around is likely to be beneficial.
- Cleaned up comments that describe the changes to
mmu_notifier_test_young / mmu_notifier_clear_young (thanks Nicolas).

[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/all/[email protected]/
[3]: https://lore.kernel.org/kvmarm/[email protected]/
[4]: https://lore.kernel.org/all/[email protected]/

James Houghton (5):
mm: Add a bitmap into mmu_notifier_{clear,test}_young
KVM: Move MMU notifier function declarations
KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young
KVM: x86: Participate in bitmap-based PTE aging
KVM: arm64: Participate in bitmap-based PTE aging

Yu Zhao (2):
KVM: x86: Move tdp_mmu_enabled and shadow_accessed_mask
mm: multi-gen LRU: use mmu_notifier_test_clear_young()

Documentation/admin-guide/mm/multigen_lru.rst | 6 +-
arch/arm64/include/asm/kvm_host.h | 5 +
arch/arm64/include/asm/kvm_pgtable.h | 4 +-
arch/arm64/kvm/hyp/pgtable.c | 21 +-
arch/arm64/kvm/mmu.c | 23 ++-
arch/x86/include/asm/kvm_host.h | 20 ++
arch/x86/kvm/mmu.h | 6 -
arch/x86/kvm/mmu/mmu.c | 16 +-
arch/x86/kvm/mmu/spte.h | 1 -
arch/x86/kvm/mmu/tdp_mmu.c | 10 +-
include/linux/kvm_host.h | 101 ++++++++--
include/linux/mmu_notifier.h | 93 ++++++++-
include/linux/mmzone.h | 6 +-
include/trace/events/kvm.h | 13 +-
mm/mmu_notifier.c | 20 +-
mm/rmap.c | 9 +-
mm/vmscan.c | 183 ++++++++++++++----
virt/kvm/kvm_main.c | 100 +++++++---
18 files changed, 509 insertions(+), 128 deletions(-)


base-commit: 0cef2c0a2a356137b170c3cb46cb9c1dd2ca3e6b
--
2.44.0.478.gd926399ef9-goog



2024-04-01 23:30:33

by James Houghton

[permalink] [raw]
Subject: [PATCH v3 2/7] KVM: Move MMU notifier function declarations

To allow new MMU-notifier-related functions to use gfn_to_hva_memslot(),
move some declarations around.

Also move mmu_notifier_to_kvm() for wider use later.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/kvm_host.h | 41 +++++++++++++++++++++-------------------
virt/kvm/kvm_main.c | 5 -----
2 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..1800d03a06a9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -257,25 +257,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
#endif

-#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
-union kvm_mmu_notifier_arg {
- pte_t pte;
- unsigned long attributes;
-};
-
-struct kvm_gfn_range {
- struct kvm_memory_slot *slot;
- gfn_t start;
- gfn_t end;
- union kvm_mmu_notifier_arg arg;
- bool may_block;
-};
-bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-#endif
-
enum {
OUTSIDE_GUEST_MODE,
IN_GUEST_MODE,
@@ -2012,6 +1993,11 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header;
extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];

#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
+static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
+{
+ return container_of(mn, struct kvm, mmu_notifier);
+}
+
static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
{
if (unlikely(kvm->mmu_invalidate_in_progress))
@@ -2089,6 +2075,23 @@ static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,

return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
}
+
+union kvm_mmu_notifier_arg {
+ pte_t pte;
+ unsigned long attributes;
+};
+
+struct kvm_gfn_range {
+ struct kvm_memory_slot *slot;
+ gfn_t start;
+ gfn_t end;
+ union kvm_mmu_notifier_arg arg;
+ bool may_block;
+};
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
#endif

#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ca4b1ef9dfc2..d0545d88c802 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -534,11 +534,6 @@ void kvm_destroy_vcpus(struct kvm *kvm)
EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);

#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
-static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
-{
- return container_of(mn, struct kvm, mmu_notifier);
-}
-
typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);

typedef void (*on_lock_fn_t)(struct kvm *kvm);
--
2.44.0.478.gd926399ef9-goog


2024-04-01 23:30:42

by James Houghton

[permalink] [raw]
Subject: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

The bitmap is provided for secondary MMUs to use if they support it. For
test_young(), after it returns, the bitmap represents the pages that
were young in the interval [start, end). For clear_young, it represents
the pages that we wish the secondary MMU to clear the accessed/young bit
for.

If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
should be unchanged except that if young PTEs are found and the
architecture supports passing in a bitmap, instead of returning 1,
MMU_NOTIFIER_YOUNG_FAST is returned.

This allows MGLRU's look-around logic to work faster, resulting in a 4%
improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
to indicate to main mm that doing look-around is likely to be
beneficial.

If the secondary MMU doesn't support the bitmap, it must return
an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

[1]: https://lore.kernel.org/all/[email protected]/

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
include/linux/mmu_notifier.h | 93 +++++++++++++++++++++++++++++++++---
include/trace/events/kvm.h | 13 +++--
mm/mmu_notifier.c | 20 +++++---
virt/kvm/kvm_main.c | 19 ++++++--
4 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@ enum mmu_notifier_event {

#define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)

+#define MMU_NOTIFIER_YOUNG (1 << 0)
+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
+#define MMU_NOTIFIER_YOUNG_FAST (1 << 2)
+
struct mmu_notifier_ops {
/*
* Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@ struct mmu_notifier_ops {
* clear_young is a lightweight version of clear_flush_young. Like the
* latter, it is supposed to test-and-clear the young/accessed bitflag
* in the secondary pte, but it may omit flushing the secondary tlb.
+ *
+ * If @bitmap is given but is not supported, return
+ * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+ *
+ * If the walk is done "quickly" and there were young PTEs,
+ * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
int (*clear_young)(struct mmu_notifier *subscription,
struct mm_struct *mm,
unsigned long start,
- unsigned long end);
+ unsigned long end,
+ unsigned long *bitmap);

/*
* test_young is called to check the young/accessed bitflag in
* the secondary pte. This is used to know if the page is
* frequently used without actually clearing the flag or tearing
* down the secondary mapping on the page.
+ *
+ * If @bitmap is given but is not supported, return
+ * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+ *
+ * If the walk is done "quickly" and there were young PTEs,
+ * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
int (*test_young)(struct mmu_notifier *subscription,
struct mm_struct *mm,
- unsigned long address);
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap);

/*
* change_pte is called in cases that pte mapping to page is changed:
@@ -388,10 +407,11 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
unsigned long start,
unsigned long end);
extern int __mmu_notifier_clear_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
+ unsigned long start, unsigned long end,
+ unsigned long *bitmap);
extern int __mmu_notifier_test_young(struct mm_struct *mm,
- unsigned long address);
+ unsigned long start, unsigned long end,
+ unsigned long *bitmap);
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
unsigned long address, pte_t pte);
extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
@@ -427,7 +447,25 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm,
unsigned long end)
{
if (mm_has_notifiers(mm))
- return __mmu_notifier_clear_young(mm, start, end);
+ return __mmu_notifier_clear_young(mm, start, end, NULL);
+ return 0;
+}
+
+/*
+ * When @bitmap is not provided, clear the young bits in the secondary
+ * MMUs for all of the pages in the interval [start, end).
+ *
+ * If any subscribed secondary MMU does not support @bitmap, this function
+ * will return an integer containing MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+ * Some work may have been done in the secondary MMU.
+ */
+static inline int mmu_notifier_clear_young_bitmap(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap)
+{
+ if (mm_has_notifiers(mm))
+ return __mmu_notifier_clear_young(mm, start, end, bitmap);
return 0;
}

@@ -435,7 +473,25 @@ static inline int mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
if (mm_has_notifiers(mm))
- return __mmu_notifier_test_young(mm, address);
+ return __mmu_notifier_test_young(mm, address, address + 1,
+ NULL);
+ return 0;
+}
+
+/*
+ * When @bitmap is not provided, test the young bits in the secondary
+ * MMUs for all of the pages in the interval [start, end).
+ *
+ * If any subscribed secondary MMU does not support @bitmap, this function
+ * will return an integer containing MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+ */
+static inline int mmu_notifier_test_young_bitmap(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap)
+{
+ if (mm_has_notifiers(mm))
+ return __mmu_notifier_test_young(mm, start, end, bitmap);
return 0;
}

@@ -644,12 +700,35 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
return 0;
}

+static inline int mmu_notifier_clear_young(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ return 0;
+}
+
+static inline int mmu_notifier_clear_young_bitmap(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap)
+{
+ return 0;
+}
+
static inline int mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address)
{
return 0;
}

+static inline int mmu_notifier_test_young_bitmap(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap)
+{
+ return 0;
+}
+
static inline void mmu_notifier_change_pte(struct mm_struct *mm,
unsigned long address, pte_t pte)
{
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 011fba6b5552..e4ace8cfdbba 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -490,18 +490,21 @@ TRACE_EVENT(kvm_age_hva,
);

TRACE_EVENT(kvm_test_age_hva,
- TP_PROTO(unsigned long hva),
- TP_ARGS(hva),
+ TP_PROTO(unsigned long start, unsigned long end),
+ TP_ARGS(start, end),

TP_STRUCT__entry(
- __field( unsigned long, hva )
+ __field( unsigned long, start )
+ __field( unsigned long, end )
),

TP_fast_assign(
- __entry->hva = hva;
+ __entry->start = start;
+ __entry->end = end;
),

- TP_printk("mmu notifier test age hva: %#016lx", __entry->hva)
+ TP_printk("mmu notifier test age hva: %#016lx -- %#016lx",
+ __entry->start, __entry->end)
);

#endif /* _TRACE_KVM_MAIN_H */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index ec3b068cbbe6..e70c6222944c 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -384,7 +384,8 @@ int __mmu_notifier_clear_flush_young(struct mm_struct *mm,

int __mmu_notifier_clear_young(struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ unsigned long *bitmap)
{
struct mmu_notifier *subscription;
int young = 0, id;
@@ -395,7 +396,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
srcu_read_lock_held(&srcu)) {
if (subscription->ops->clear_young)
young |= subscription->ops->clear_young(subscription,
- mm, start, end);
+ mm, start, end,
+ bitmap);
}
srcu_read_unlock(&srcu, id);

@@ -403,7 +405,8 @@ int __mmu_notifier_clear_young(struct mm_struct *mm,
}

int __mmu_notifier_test_young(struct mm_struct *mm,
- unsigned long address)
+ unsigned long start, unsigned long end,
+ unsigned long *bitmap)
{
struct mmu_notifier *subscription;
int young = 0, id;
@@ -413,9 +416,14 @@ int __mmu_notifier_test_young(struct mm_struct *mm,
&mm->notifier_subscriptions->list, hlist,
srcu_read_lock_held(&srcu)) {
if (subscription->ops->test_young) {
- young = subscription->ops->test_young(subscription, mm,
- address);
- if (young)
+ young |= subscription->ops->test_young(subscription, mm,
+ start, end,
+ bitmap);
+ if (young && !bitmap)
+ /*
+ * We're not using a bitmap, so there is no
+ * need to check any more secondary MMUs.
+ */
break;
}
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb49c2a60200..ca4b1ef9dfc2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -917,10 +917,15 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ unsigned long *bitmap)
{
trace_kvm_age_hva(start, end);

+ /* We don't support bitmaps. Don't test or clear anything. */
+ if (bitmap)
+ return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
+
/*
* Even though we do not flush TLB, this will still adversely
* affect performance on pre-Haswell Intel EPT, where there is
@@ -939,11 +944,17 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,

static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long address)
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap)
{
- trace_kvm_test_age_hva(address);
+ trace_kvm_test_age_hva(start, end);
+
+ /* We don't support bitmaps. Don't test or clear anything. */
+ if (bitmap)
+ return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

- return kvm_handle_hva_range_no_flush(mn, address, address + 1,
+ return kvm_handle_hva_range_no_flush(mn, start, end,
kvm_test_age_gfn);
}

--
2.44.0.478.gd926399ef9-goog


2024-04-01 23:30:48

by James Houghton

[permalink] [raw]
Subject: [PATCH v3 3/7] KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young

Add kvm_arch_prepare_bitmap_age() for architectures to indiciate that
they support bitmap-based aging in kvm_mmu_notifier_test_clear_young()
and that they do not need KVM to grab the MMU lock for writing. This
function allows architectures to do other locking or other preparatory
work that it needs.

If an architecture does not implement kvm_arch_prepare_bitmap_age() or
is unable to do bitmap-based aging at runtime (and marks the bitmap as
unreliable):
1. If a bitmap was provided, we inform the caller that the bitmap is
unreliable (MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE).
2. If a bitmap was not provided, fall back to the old logic.

Also add logic for architectures to easily use the provided bitmap if
they are able. The expectation is that the architecture's implementation
of kvm_gfn_test_age() will use kvm_gfn_record_young(), and
kvm_gfn_age() will use kvm_gfn_should_age().

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
include/linux/kvm_host.h | 60 ++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 92 +++++++++++++++++++++++++++++-----------
2 files changed, 127 insertions(+), 25 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1800d03a06a9..5862fd7b5f9b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1992,6 +1992,26 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
extern const struct kvm_stats_header kvm_vcpu_stats_header;
extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];

+/*
+ * Architectures that support using bitmaps for kvm_age_gfn() and
+ * kvm_test_age_gfn should return true for kvm_arch_prepare_bitmap_age()
+ * and do any work they need to prepare. The subsequent walk will not
+ * automatically grab the KVM MMU lock, so some architectures may opt
+ * to grab it.
+ *
+ * If true is returned, a subsequent call to kvm_arch_finish_bitmap_age() is
+ * guaranteed.
+ */
+#ifndef kvm_arch_prepare_bitmap_age
+static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
+{
+ return false;
+}
+#endif
+#ifndef kvm_arch_finish_bitmap_age
+static inline void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn) {}
+#endif
+
#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
{
@@ -2076,9 +2096,16 @@ static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
}

+struct test_clear_young_metadata {
+ unsigned long *bitmap;
+ unsigned long bitmap_offset_end;
+ unsigned long end;
+ bool unreliable;
+};
union kvm_mmu_notifier_arg {
pte_t pte;
unsigned long attributes;
+ struct test_clear_young_metadata *metadata;
};

struct kvm_gfn_range {
@@ -2087,11 +2114,44 @@ struct kvm_gfn_range {
gfn_t end;
union kvm_mmu_notifier_arg arg;
bool may_block;
+ bool lockless;
};
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+
+static inline void kvm_age_set_unreliable(struct kvm_gfn_range *range)
+{
+ struct test_clear_young_metadata *args = range->arg.metadata;
+
+ args->unreliable = true;
+}
+static inline unsigned long kvm_young_bitmap_offset(struct kvm_gfn_range *range,
+ gfn_t gfn)
+{
+ struct test_clear_young_metadata *args = range->arg.metadata;
+
+ return hva_to_gfn_memslot(args->end - 1, range->slot) - gfn;
+}
+static inline void kvm_gfn_record_young(struct kvm_gfn_range *range, gfn_t gfn)
+{
+ struct test_clear_young_metadata *args = range->arg.metadata;
+
+ WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
+ if (args->bitmap)
+ __set_bit(kvm_young_bitmap_offset(range, gfn), args->bitmap);
+}
+static inline bool kvm_gfn_should_age(struct kvm_gfn_range *range, gfn_t gfn)
+{
+ struct test_clear_young_metadata *args = range->arg.metadata;
+
+ WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
+ if (args->bitmap)
+ return test_bit(kvm_young_bitmap_offset(range, gfn),
+ args->bitmap);
+ return true;
+}
#endif

#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0545d88c802..7d80321e2ece 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -550,6 +550,7 @@ struct kvm_mmu_notifier_range {
on_lock_fn_t on_lock;
bool flush_on_ret;
bool may_block;
+ bool lockless;
};

/*
@@ -598,6 +599,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
struct kvm_memslots *slots;
int i, idx;

+ BUILD_BUG_ON(sizeof(gfn_range.arg) != sizeof(gfn_range.arg.pte));
+
if (WARN_ON_ONCE(range->end <= range->start))
return r;

@@ -637,15 +640,18 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
gfn_range.slot = slot;
+ gfn_range.lockless = range->lockless;

if (!r.found_memslot) {
r.found_memslot = true;
- KVM_MMU_LOCK(kvm);
- if (!IS_KVM_NULL_FN(range->on_lock))
- range->on_lock(kvm);
-
- if (IS_KVM_NULL_FN(range->handler))
- break;
+ if (!range->lockless) {
+ KVM_MMU_LOCK(kvm);
+ if (!IS_KVM_NULL_FN(range->on_lock))
+ range->on_lock(kvm);
+
+ if (IS_KVM_NULL_FN(range->handler))
+ break;
+ }
}
r.ret |= range->handler(kvm, &gfn_range);
}
@@ -654,7 +660,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
if (range->flush_on_ret && r.ret)
kvm_flush_remote_tlbs(kvm);

- if (r.found_memslot)
+ if (r.found_memslot && !range->lockless)
KVM_MMU_UNLOCK(kvm);

srcu_read_unlock(&kvm->srcu, idx);
@@ -682,19 +688,24 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
return __kvm_handle_hva_range(kvm, &range).ret;
}

-static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
- unsigned long start,
- unsigned long end,
- gfn_handler_t handler)
+static __always_inline int kvm_handle_hva_range_no_flush(
+ struct mmu_notifier *mn,
+ unsigned long start,
+ unsigned long end,
+ gfn_handler_t handler,
+ union kvm_mmu_notifier_arg arg,
+ bool lockless)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
const struct kvm_mmu_notifier_range range = {
.start = start,
.end = end,
.handler = handler,
+ .arg = arg,
.on_lock = (void *)kvm_null_fn,
.flush_on_ret = false,
.may_block = false,
+ .lockless = lockless,
};

return __kvm_handle_hva_range(kvm, &range).ret;
@@ -909,15 +920,36 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
kvm_age_gfn);
}

-static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long start,
- unsigned long end,
- unsigned long *bitmap)
+static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap,
+ bool clear)
{
- trace_kvm_age_hva(start, end);
+ if (kvm_arch_prepare_bitmap_age(mn)) {
+ struct test_clear_young_metadata args = {
+ .bitmap = bitmap,
+ .end = end,
+ .unreliable = false,
+ };
+ union kvm_mmu_notifier_arg arg = {
+ .metadata = &args
+ };
+ bool young;
+
+ young = kvm_handle_hva_range_no_flush(
+ mn, start, end,
+ clear ? kvm_age_gfn : kvm_test_age_gfn,
+ arg, true);
+
+ kvm_arch_finish_bitmap_age(mn);

- /* We don't support bitmaps. Don't test or clear anything. */
+ if (!args.unreliable)
+ return young ? MMU_NOTIFIER_YOUNG_FAST : 0;
+ }
+
+ /* A bitmap was passed but the architecture doesn't support bitmaps */
if (bitmap)
return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

@@ -934,7 +966,21 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
* cadence. If we find this inaccurate, we might come up with a
* more sophisticated heuristic later.
*/
- return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
+ return kvm_handle_hva_range_no_flush(
+ mn, start, end, clear ? kvm_age_gfn : kvm_test_age_gfn,
+ KVM_MMU_NOTIFIER_NO_ARG, false);
+}
+
+static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap)
+{
+ trace_kvm_age_hva(start, end);
+
+ return kvm_mmu_notifier_test_clear_young(mn, mm, start, end, bitmap,
+ true);
}

static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
@@ -945,12 +991,8 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
{
trace_kvm_test_age_hva(start, end);

- /* We don't support bitmaps. Don't test or clear anything. */
- if (bitmap)
- return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
-
- return kvm_handle_hva_range_no_flush(mn, start, end,
- kvm_test_age_gfn);
+ return kvm_mmu_notifier_test_clear_young(mn, mm, start, end, bitmap,
+ false);
}

static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
--
2.44.0.478.gd926399ef9-goog


2024-04-01 23:31:03

by James Houghton

[permalink] [raw]
Subject: [PATCH v3 4/7] KVM: x86: Move tdp_mmu_enabled and shadow_accessed_mask

From: Yu Zhao <[email protected]>

tdp_mmu_enabled and shadow_accessed_mask are needed to implement
kvm_arch_prepare_bitmap_age().

Signed-off-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 ++++++
arch/x86/kvm/mmu.h | 6 ------
arch/x86/kvm/mmu/spte.h | 1 -
3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 16e07a2eee19..3b58e2306621 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1847,6 +1847,7 @@ struct kvm_arch_async_pf {

extern u32 __read_mostly kvm_nr_uret_msrs;
extern u64 __read_mostly host_efer;
+extern u64 __read_mostly shadow_accessed_mask;
extern bool __read_mostly allow_smaller_maxphyaddr;
extern bool __read_mostly enable_apicv;
extern struct kvm_x86_ops kvm_x86_ops;
@@ -1952,6 +1953,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
bool mask);

extern bool tdp_enabled;
+#ifdef CONFIG_X86_64
+extern bool tdp_mmu_enabled;
+#else
+#define tdp_mmu_enabled false
+#endif

u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..8ae279035900 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -270,12 +270,6 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
return smp_load_acquire(&kvm->arch.shadow_root_allocated);
}

-#ifdef CONFIG_X86_64
-extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
-
static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
{
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a129951c9a88..f791fe045c7d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -154,7 +154,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
extern u64 __read_mostly shadow_nx_mask;
extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
extern u64 __read_mostly shadow_user_mask;
-extern u64 __read_mostly shadow_accessed_mask;
extern u64 __read_mostly shadow_dirty_mask;
extern u64 __read_mostly shadow_mmio_value;
extern u64 __read_mostly shadow_mmio_mask;
--
2.44.0.478.gd926399ef9-goog


2024-04-01 23:31:43

by James Houghton

[permalink] [raw]
Subject: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

Only handle the TDP MMU case for now. In other cases, if a bitmap was
not provided, fallback to the slowpath that takes mmu_lock, or, if a
bitmap was provided, inform the caller that the bitmap is unreliable.

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++--
arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
3 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3b58e2306621..c30918d0887e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
*/
#define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)

+#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
+static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
+{
+ /*
+ * Indicate that we support bitmap-based aging when using the TDP MMU
+ * and the accessed bit is available in the TDP page tables.
+ *
+ * We have no other preparatory work to do here, so we do not need to
+ * redefine kvm_arch_finish_bitmap_age().
+ */
+ return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
+ && shadow_accessed_mask;
+}
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 992e651540e8..fae1a75750bb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

- if (kvm_memslots_have_rmaps(kvm))
+ if (kvm_memslots_have_rmaps(kvm)) {
+ if (range->lockless) {
+ kvm_age_set_unreliable(range);
+ return false;
+ }
+
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+ }

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1687,8 +1693,14 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

- if (kvm_memslots_have_rmaps(kvm))
+ if (kvm_memslots_have_rmaps(kvm)) {
+ if (range->lockless) {
+ kvm_age_set_unreliable(range);
+ return false;
+ }
+
young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
+ }

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d078157e62aa..edea01bc145f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1217,6 +1217,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
if (!is_accessed_spte(iter->old_spte))
return false;

+ if (!kvm_gfn_should_age(range, iter->gfn))
+ return false;
+
if (spte_ad_enabled(iter->old_spte)) {
iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
iter->old_spte,
@@ -1250,7 +1253,12 @@ bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_gfn_range *range)
{
- return is_accessed_spte(iter->old_spte);
+ bool young = is_accessed_spte(iter->old_spte);
+
+ if (young)
+ kvm_gfn_record_young(range, iter->gfn);
+
+ return young;
}

bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
--
2.44.0.478.gd926399ef9-goog


2024-04-01 23:31:47

by James Houghton

[permalink] [raw]
Subject: [PATCH v3 6/7] KVM: arm64: Participate in bitmap-based PTE aging

Participate in bitmap-based aging while grabbing the KVM MMU lock for
reading. Ideally we wouldn't need to grab this lock at all, but that
would require a more intrustive and risky change. Also pass
KVM_PGTABLE_WALK_SHARED, as this software walker is safe to run in
parallel with other walkers.

It is safe only to grab the KVM MMU lock for reading as the kvm_pgtable
is destroyed while holding the lock for writing, and freeing of the page
table pages is either done while holding the MMU lock for writing or
after an RCU grace period.

When mkold == false, record the young pages in the passed-in bitmap.

When mkold == true, only age the pages that need aging according to the
passed-in bitmap.

Suggested-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 5 +++++
arch/arm64/include/asm/kvm_pgtable.h | 4 +++-
arch/arm64/kvm/hyp/pgtable.c | 21 ++++++++++++++-------
arch/arm64/kvm/mmu.c | 23 +++++++++++++++++++++--
4 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 9e8a496fb284..e503553cb356 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1331,4 +1331,9 @@ bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
(get_idreg_field((kvm), id, fld) >= expand_field_sign(id, fld, min) && \
get_idreg_field((kvm), id, fld) <= expand_field_sign(id, fld, max))

+#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
+bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn);
+#define kvm_arch_finish_bitmap_age kvm_arch_finish_bitmap_age
+void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn);
+
#endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 19278dfe7978..1976b4e26188 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -644,6 +644,7 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr);
* @addr: Intermediate physical address to identify the page-table entry.
* @size: Size of the address range to visit.
* @mkold: True if the access flag should be cleared.
+ * @range: The kvm_gfn_range that is being used for the memslot walker.
*
* The offset of @addr within a page is ignored.
*
@@ -657,7 +658,8 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr);
* Return: True if any of the visited PTEs had the access flag set.
*/
bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
- u64 size, bool mkold);
+ u64 size, bool mkold,
+ struct kvm_gfn_range *range);

/**
* kvm_pgtable_stage2_relax_perms() - Relax the permissions enforced by a
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3fae5830f8d2..e881d3595aca 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1281,6 +1281,7 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
}

struct stage2_age_data {
+ struct kvm_gfn_range *range;
bool mkold;
bool young;
};
@@ -1290,20 +1291,24 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
{
kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
struct stage2_age_data *data = ctx->arg;
+ gfn_t gfn = ctx->addr / PAGE_SIZE;

if (!kvm_pte_valid(ctx->old) || new == ctx->old)
return 0;

data->young = true;

+
/*
- * stage2_age_walker() is always called while holding the MMU lock for
- * write, so this will always succeed. Nonetheless, this deliberately
- * follows the race detection pattern of the other stage-2 walkers in
- * case the locking mechanics of the MMU notifiers is ever changed.
+ * stage2_age_walker() may not be holding the MMU lock for write, so
+ * follow the race detection pattern of the other stage-2 walkers.
*/
- if (data->mkold && !stage2_try_set_pte(ctx, new))
- return -EAGAIN;
+ if (data->mkold) {
+ if (kvm_gfn_should_age(data->range, gfn) &&
+ !stage2_try_set_pte(ctx, new))
+ return -EAGAIN;
+ } else
+ kvm_gfn_record_young(data->range, gfn);

/*
* "But where's the TLBI?!", you scream.
@@ -1315,10 +1320,12 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
}

bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
- u64 size, bool mkold)
+ u64 size, bool mkold,
+ struct kvm_gfn_range *range)
{
struct stage2_age_data data = {
.mkold = mkold,
+ .range = range,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_age_walker,
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 18680771cdb0..104cc23e9bb3 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1802,6 +1802,25 @@ bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
return false;
}

+bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+
+ /*
+ * We need to hold the MMU lock for reading to prevent page tables
+ * from being freed underneath us.
+ */
+ read_lock(&kvm->mmu_lock);
+ return true;
+}
+
+void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn)
+{
+ struct kvm *kvm = mmu_notifier_to_kvm(mn);
+
+ read_unlock(&kvm->mmu_lock);
+}
+
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
u64 size = (range->end - range->start) << PAGE_SHIFT;
@@ -1811,7 +1830,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)

return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
range->start << PAGE_SHIFT,
- size, true);
+ size, true, range);
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1823,7 +1842,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)

return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt,
range->start << PAGE_SHIFT,
- size, false);
+ size, false, range);
}

phys_addr_t kvm_mmu_get_httbr(void)
--
2.44.0.478.gd926399ef9-goog


2024-04-01 23:32:09

by James Houghton

[permalink] [raw]
Subject: [PATCH v3 7/7] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

From: Yu Zhao <[email protected]>

Use mmu_notifier_{test,clear}_young_bitmap() to handle KVM PTEs in
batches when the fast path is supported. This reduces the contention on
kvm->mmu_lock when the host is under heavy memory pressure.

An existing selftest can quickly demonstrate the effectiveness of
this patch. On a generic workstation equipped with 128 CPUs and 256GB
DRAM:

$ sudo max_guest_memory_test -c 64 -m 250 -s 250

MGLRU run2
------------------
Before [1] ~64s
After ~51s

kswapd (MGLRU before)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.99% try_to_shrink_lruvec
99.71% evict_folios
97.29% shrink_folio_list
==>> 13.05% folio_referenced
12.83% rmap_walk_file
12.31% folio_referenced_one
7.90% __mmu_notifier_clear_young
7.72% kvm_mmu_notifier_clear_young
7.34% _raw_write_lock

kswapd (MGLRU after)
100.00% balance_pgdat
100.00% shrink_node
100.00% shrink_one
99.99% try_to_shrink_lruvec
99.59% evict_folios
80.37% shrink_folio_list
==>> 3.74% folio_referenced
3.59% rmap_walk_file
3.19% folio_referenced_one
2.53% lru_gen_look_around
1.06% __mmu_notifier_test_clear_young

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
apples.
https://lore.kernel.org/r/[email protected]/

Signed-off-by: Yu Zhao <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
Documentation/admin-guide/mm/multigen_lru.rst | 6 +-
include/linux/mmzone.h | 6 +-
mm/rmap.c | 9 +-
mm/vmscan.c | 183 ++++++++++++++----
4 files changed, 159 insertions(+), 45 deletions(-)

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..0ae2a6d4d94c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@ Values Components
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
+0x0008 Clearing the accessed bit in KVM page table entries in large
+ batches, when KVM MMU sets it (e.g., on x86_64). This can
+ improve the performance of guests when the host is under memory
+ pressure.
[yYnN] Apply to all the components above.
====== ===============================================================

@@ -56,7 +60,7 @@ E.g.,

echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
- 0x0007
+ 0x000f
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c11b7cde81ef..a98de5106990 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -397,6 +397,7 @@ enum {
LRU_GEN_CORE,
LRU_GEN_MM_WALK,
LRU_GEN_NONLEAF_YOUNG,
+ LRU_GEN_KVM_MMU_WALK,
NR_LRU_GEN_CAPS
};

@@ -554,7 +555,7 @@ struct lru_gen_memcg {

void lru_gen_init_pgdat(struct pglist_data *pgdat);
void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);

void lru_gen_init_memcg(struct mem_cgroup *memcg);
void lru_gen_exit_memcg(struct mem_cgroup *memcg);
@@ -573,8 +574,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}

-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
+ return false;
}

static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
diff --git a/mm/rmap.c b/mm/rmap.c
index 56b313aa2ebf..41e9fc25684e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -871,13 +871,10 @@ static bool folio_referenced_one(struct folio *folio,
continue;
}

- if (pvmw.pte) {
- if (lru_gen_enabled() &&
- pte_young(ptep_get(pvmw.pte))) {
- lru_gen_look_around(&pvmw);
+ if (lru_gen_enabled() && pvmw.pte) {
+ if (lru_gen_look_around(&pvmw))
referenced++;
- }
-
+ } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 293120fe54f3..fd65f3466dfc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,7 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/mmu_notifier.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -2596,6 +2597,11 @@ static bool should_clear_pmd_young(void)
return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
}

+static bool should_walk_kvm_mmu(void)
+{
+ return get_cap(LRU_GEN_KVM_MMU_WALK);
+}
+
/******************************************************************************
* shorthand helpers
******************************************************************************/
@@ -3293,7 +3299,8 @@ static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk
return false;
}

-static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr,
+ struct pglist_data *pgdat)
{
unsigned long pfn = pte_pfn(pte);

@@ -3308,10 +3315,15 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return -1;

+ /* try to avoid unnecessary memory loads */
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ return -1;
+
return pfn;
}

-static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr,
+ struct pglist_data *pgdat)
{
unsigned long pfn = pmd_pfn(pmd);

@@ -3326,6 +3338,10 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return -1;

+ /* try to avoid unnecessary memory loads */
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ return -1;
+
return pfn;
}

@@ -3334,10 +3350,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
{
struct folio *folio;

- /* try to avoid unnecessary memory loads */
- if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
- return NULL;
-
folio = pfn_folio(pfn);
if (folio_nid(folio) != pgdat->node_id)
return NULL;
@@ -3352,6 +3364,52 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
return folio;
}

+static bool test_spte_young(struct mm_struct *mm, unsigned long addr, unsigned long end,
+ unsigned long *bitmap, unsigned long *last)
+{
+ if (*last > addr)
+ goto done;
+
+ *last = end - addr > MIN_LRU_BATCH * PAGE_SIZE ?
+ addr + MIN_LRU_BATCH * PAGE_SIZE - 1 : end - 1;
+ bitmap_zero(bitmap, MIN_LRU_BATCH);
+
+ mmu_notifier_test_young_bitmap(mm, addr, *last + 1, bitmap);
+done:
+ return test_bit((*last - addr) / PAGE_SIZE, bitmap);
+}
+
+static void clear_spte_young(struct mm_struct *mm, unsigned long addr,
+ unsigned long *bitmap, unsigned long *last)
+{
+ int i;
+ unsigned long start, end = *last + 1;
+
+ if (addr + PAGE_SIZE != end)
+ return;
+
+ i = find_last_bit(bitmap, MIN_LRU_BATCH);
+ if (i == MIN_LRU_BATCH)
+ return;
+
+ start = end - (i + 1) * PAGE_SIZE;
+
+ i = find_first_bit(bitmap, MIN_LRU_BATCH);
+
+ end -= i * PAGE_SIZE;
+
+ mmu_notifier_clear_young_bitmap(mm, start, end, bitmap);
+}
+
+static void skip_spte_young(struct mm_struct *mm, unsigned long addr,
+ unsigned long *bitmap, unsigned long *last)
+{
+ if (*last > addr)
+ __clear_bit((*last - addr) / PAGE_SIZE, bitmap);
+
+ clear_spte_young(mm, addr, bitmap, last);
+}
+
static bool suitable_to_scan(int total, int young)
{
int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
@@ -3367,6 +3425,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
pte_t *pte;
spinlock_t *ptl;
unsigned long addr;
+ DECLARE_BITMAP(bitmap, MIN_LRU_BATCH);
+ unsigned long last = 0;
int total = 0;
int young = 0;
struct lru_gen_mm_walk *walk = args->private;
@@ -3386,6 +3446,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
arch_enter_lazy_mmu_mode();
restart:
for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ bool ret;
unsigned long pfn;
struct folio *folio;
pte_t ptent = ptep_get(pte + i);
@@ -3393,21 +3454,28 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
total++;
walk->mm_stats[MM_LEAF_TOTAL]++;

- pfn = get_pte_pfn(ptent, args->vma, addr);
- if (pfn == -1)
+ pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
+ if (pfn == -1) {
+ skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!pte_young(ptent)) {
+ ret = test_spte_young(args->vma->vm_mm, addr, end, bitmap, &last);
+ if (!ret && !pte_young(ptent)) {
+ skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
walk->mm_stats[MM_LEAF_OLD]++;
continue;
}

folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
- if (!folio)
+ if (!folio) {
+ skip_spte_young(args->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ clear_spte_young(args->vma->vm_mm, addr, bitmap, &last);
+ if (pte_young(ptent))
+ ptep_test_and_clear_young(args->vma, addr, pte + i);

young++;
walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -3473,22 +3541,24 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
/* don't round down the first address */
addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first;

- pfn = get_pmd_pfn(pmd[i], vma, addr);
- if (pfn == -1)
- goto next;
-
- if (!pmd_trans_huge(pmd[i])) {
- if (should_clear_pmd_young())
+ if (pmd_present(pmd[i]) && !pmd_trans_huge(pmd[i])) {
+ if (should_clear_pmd_young() && !mm_has_notifiers(args->mm))
pmdp_test_and_clear_young(vma, addr, pmd + i);
goto next;
}

+ pfn = get_pmd_pfn(pmd[i], vma, addr, pgdat);
+ if (pfn == -1)
+ goto next;
+
folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
if (!folio)
goto next;

- if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+ if (!pmdp_clear_young_notify(vma, addr, pmd + i)) {
+ walk->mm_stats[MM_LEAF_OLD]++;
goto next;
+ }

walk->mm_stats[MM_LEAF_YOUNG]++;

@@ -3545,19 +3615,18 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
}

if (pmd_trans_huge(val)) {
- unsigned long pfn = pmd_pfn(val);
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+ unsigned long pfn = get_pmd_pfn(val, vma, addr, pgdat);

walk->mm_stats[MM_LEAF_TOTAL]++;

- if (!pmd_young(val)) {
- walk->mm_stats[MM_LEAF_OLD]++;
+ if (pfn == -1)
continue;
- }

- /* try to avoid unnecessary memory loads */
- if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ if (!pmd_young(val) && !mm_has_notifiers(args->mm)) {
+ walk->mm_stats[MM_LEAF_OLD]++;
continue;
+ }

walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
continue;
@@ -3565,7 +3634,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,

walk->mm_stats[MM_NONLEAF_TOTAL]++;

- if (should_clear_pmd_young()) {
+ if (should_clear_pmd_young() && !mm_has_notifiers(args->mm)) {
if (!pmd_young(val))
continue;

@@ -3646,6 +3715,9 @@ static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
struct lruvec *lruvec = walk->lruvec;
struct mem_cgroup *memcg = lruvec_memcg(lruvec);

+ if (!should_walk_kvm_mmu() && mm_has_notifiers(mm))
+ return;
+
walk->next_addr = FIRST_USER_ADDRESS;

do {
@@ -4011,6 +4083,23 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* rmap/PT walk feedback
******************************************************************************/

+static bool should_look_around(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, int *young)
+{
+ int ret = mmu_notifier_clear_young(vma->vm_mm, addr, addr + PAGE_SIZE);
+
+ if (pte_young(ptep_get(pte))) {
+ ptep_test_and_clear_young(vma, addr, pte);
+ *young = true;
+ return true;
+ }
+
+ if (ret)
+ *young = true;
+
+ return ret & MMU_NOTIFIER_YOUNG_FAST;
+}
+
/*
* This function exploits spatial locality when shrink_folio_list() walks the
* rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
@@ -4018,12 +4107,14 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* the PTE table to the Bloom filter. This forms a feedback loop between the
* eviction and the aging.
*/
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
int i;
unsigned long start;
unsigned long end;
struct lru_gen_mm_walk *walk;
+ DECLARE_BITMAP(bitmap, MIN_LRU_BATCH);
+ unsigned long last = 0;
int young = 0;
pte_t *pte = pvmw->pte;
unsigned long addr = pvmw->address;
@@ -4040,12 +4131,15 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
lockdep_assert_held(pvmw->ptl);
VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);

+ if (!should_look_around(pvmw->vma, addr, pte, &young))
+ return young;
+
if (spin_is_contended(pvmw->ptl))
- return;
+ return young;

/* exclude special VMAs containing anon pages from COW */
if (vma->vm_flags & VM_SPECIAL)
- return;
+ return young;

/* avoid taking the LRU lock under the PTL when possible */
walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
@@ -4053,6 +4147,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
start = max(addr & PMD_MASK, vma->vm_start);
end = min(addr | ~PMD_MASK, vma->vm_end - 1) + 1;

+ if (end - start == PAGE_SIZE)
+ return young;
+
if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
end = start + MIN_LRU_BATCH * PAGE_SIZE;
@@ -4066,29 +4163,38 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)

/* folio_update_gen() requires stable folio_memcg() */
if (!mem_cgroup_trylock_pages(memcg))
- return;
+ return young;

arch_enter_lazy_mmu_mode();

pte -= (addr - start) / PAGE_SIZE;

for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+ bool ret;
unsigned long pfn;
pte_t ptent = ptep_get(pte + i);

- pfn = get_pte_pfn(ptent, vma, addr);
- if (pfn == -1)
+ pfn = get_pte_pfn(ptent, vma, addr, pgdat);
+ if (pfn == -1) {
+ skip_spte_young(vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!pte_young(ptent))
+ ret = test_spte_young(pvmw->vma->vm_mm, addr, end, bitmap, &last);
+ if (!ret && !pte_young(ptent)) {
+ skip_spte_young(pvmw->vma->vm_mm, addr, bitmap, &last);
continue;
+ }

folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
- if (!folio)
+ if (!folio) {
+ skip_spte_young(vma->vm_mm, addr, bitmap, &last);
continue;
+ }

- if (!ptep_test_and_clear_young(vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ clear_spte_young(vma->vm_mm, addr, bitmap, &last);
+ if (pte_young(ptent))
+ ptep_test_and_clear_young(vma, addr, pte + i);

young++;

@@ -4118,6 +4224,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
/* feedback from rmap walkers to page table walkers */
if (mm_state && suitable_to_scan(i, young))
update_bloom_filter(mm_state, max_seq, pvmw->pmd);
+
+ return young;
}

/******************************************************************************
@@ -5154,6 +5262,9 @@ static ssize_t enabled_show(struct kobject *kobj, struct kobj_attribute *attr, c
if (should_clear_pmd_young())
caps |= BIT(LRU_GEN_NONLEAF_YOUNG);

+ if (should_walk_kvm_mmu())
+ caps |= BIT(LRU_GEN_KVM_MMU_WALK);
+
return sysfs_emit(buf, "0x%04x\n", caps);
}

--
2.44.0.478.gd926399ef9-goog


2024-04-02 04:07:47

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] KVM: arm64: Participate in bitmap-based PTE aging

On Mon, Apr 1, 2024 at 7:30 PM James Houghton <[email protected]> wrote:
>
> Participate in bitmap-based aging while grabbing the KVM MMU lock for
> reading. Ideally we wouldn't need to grab this lock at all, but that
> would require a more intrustive and risky change.
^^^^^^^^^^ intrusive
This sounds subjective -- I'd just present the challenges and let
reviewers make their own judgements.

> Also pass
> KVM_PGTABLE_WALK_SHARED, as this software walker is safe to run in
> parallel with other walkers.
>
> It is safe only to grab the KVM MMU lock for reading as the kvm_pgtable
> is destroyed while holding the lock for writing, and freeing of the page
> table pages is either done while holding the MMU lock for writing or
> after an RCU grace period.
>
> When mkold == false, record the young pages in the passed-in bitmap.
>
> When mkold == true, only age the pages that need aging according to the
> passed-in bitmap.
>
> Suggested-by: Yu Zhao <[email protected]>

Thanks but I did not suggest this.

What I have in v2 is RCU based. I hope Oliver or someone else can help
make that work. Otherwise we can just drop this for now and revisit
later.

(I have no problems with this patch if the Arm folks think the
RCU-based version doesn't have a good ROI.)

2024-04-02 07:00:56

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] KVM: arm64: Participate in bitmap-based PTE aging

On Tue, Apr 02, 2024 at 12:06:56AM -0400, Yu Zhao wrote:
> On Mon, Apr 1, 2024 at 7:30 PM James Houghton <[email protected]> wrote:
> > Suggested-by: Yu Zhao <[email protected]>
>
> Thanks but I did not suggest this.

Entirely up to you, but I would still want to credit everyone who
contributed to a feature even if the underlying implementation has
changed since the original attempt.

> What I have in v2 is RCU based. I hope Oliver or someone else can help
> make that work.

Why? If there's data to show that RCU has a material benefit over taking
the MMU lock for read then I'm all for it. Otherwise, the work required
to support page-table walkers entirely outside of the MMU lock isn't
justified.

In addition to ensuring that page table teardown is always RCU-safe,
we'd need to make sure all of the walkers that take the write lock are
prepared to handle races.

> Otherwise we can just drop this for now and revisit
> later.

I really wouldn't get hung up on the locking as the make-or-break for
whether arm64 supports this MMU notifier.

--
Thanks,
Oliver

2024-04-02 07:33:51

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] KVM: arm64: Participate in bitmap-based PTE aging

On Tue, 02 Apr 2024 05:06:56 +0100,
Yu Zhao <[email protected]> wrote:
>
> On Mon, Apr 1, 2024 at 7:30 PM James Houghton <[email protected]> wrote:
> >
> > Participate in bitmap-based aging while grabbing the KVM MMU lock for
> > reading. Ideally we wouldn't need to grab this lock at all, but that
> > would require a more intrustive and risky change.
> ^^^^^^^^^^ intrusive
> This sounds subjective -- I'd just present the challenges and let
> reviewers make their own judgements.

Quite the opposite.

This sort of comment actually indicates that the author has at least
understood some of the complexity behind the proposed changes. It is a
qualitative comment that conveys useful information to reviewers, and
even more to the maintainers of this code.

That's the difference between a human developer and a bot, and I'm not
overly fond of bots.

M.

--
Without deviation from the norm, progress is not possible.

2024-04-04 19:21:08

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

On 02.04.24 01:29, James Houghton wrote:
> The bitmap is provided for secondary MMUs to use if they support it. For
> test_young(), after it returns, the bitmap represents the pages that
> were young in the interval [start, end). For clear_young, it represents
> the pages that we wish the secondary MMU to clear the accessed/young bit
> for.
>
> If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
> should be unchanged except that if young PTEs are found and the
> architecture supports passing in a bitmap, instead of returning 1,
> MMU_NOTIFIER_YOUNG_FAST is returned.
>
> This allows MGLRU's look-around logic to work faster, resulting in a 4%
> improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
> to indicate to main mm that doing look-around is likely to be
> beneficial.
>
> If the secondary MMU doesn't support the bitmap, it must return
> an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>
> [1]: https://lore.kernel.org/all/[email protected]/
>
> Suggested-by: Yu Zhao <[email protected]>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/mmu_notifier.h | 93 +++++++++++++++++++++++++++++++++---
> include/trace/events/kvm.h | 13 +++--
> mm/mmu_notifier.c | 20 +++++---
> virt/kvm/kvm_main.c | 19 ++++++--
> 4 files changed, 123 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index f349e08a9dfe..daaa9db625d3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
>
> #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>
> +#define MMU_NOTIFIER_YOUNG (1 << 0)
> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)

Especially this one really deserves some documentation :)

> +#define MMU_NOTIFIER_YOUNG_FAST (1 << 2)

And that one as well.

Likely best to briefly document all of them, and how they are
supposed to be used (return value for X).

> +
> struct mmu_notifier_ops {
> /*
> * Called either by mmu_notifier_unregister or when the mm is
> @@ -106,21 +110,36 @@ struct mmu_notifier_ops {
> * clear_young is a lightweight version of clear_flush_young. Like the
> * latter, it is supposed to test-and-clear the young/accessed bitflag
> * in the secondary pte, but it may omit flushing the secondary tlb.
> + *
> + * If @bitmap is given but is not supported, return
> + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> + *
> + * If the walk is done "quickly" and there were young PTEs,
> + * MMU_NOTIFIER_YOUNG_FAST is returned.
> */
> int (*clear_young)(struct mmu_notifier *subscription,
> struct mm_struct *mm,
> unsigned long start,
> - unsigned long end);
> + unsigned long end,
> + unsigned long *bitmap);
>
> /*
> * test_young is called to check the young/accessed bitflag in
> * the secondary pte. This is used to know if the page is
> * frequently used without actually clearing the flag or tearing
> * down the secondary mapping on the page.
> + *
> + * If @bitmap is given but is not supported, return
> + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> + *
> + * If the walk is done "quickly" and there were young PTEs,
> + * MMU_NOTIFIER_YOUNG_FAST is returned.
> */
> int (*test_young)(struct mmu_notifier *subscription,
> struct mm_struct *mm,
> - unsigned long address);
> + unsigned long start,
> + unsigned long end,
> + unsigned long *bitmap);

What does "quickly" mean (why not use "fast")? What are the semantics, I
don't find any existing usage of that in this file.

Further, what is MMU_NOTIFIER_YOUNG you introduce used for?

--
Cheers,

David / dhildenb


2024-04-09 18:32:25

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

Ah, I didn't see this in my inbox, sorry David!

On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand <[email protected]> wrote:
>
> On 02.04.24 01:29, James Houghton wrote:
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index f349e08a9dfe..daaa9db625d3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -61,6 +61,10 @@ enum mmu_notifier_event {
> >
> > #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >
> > +#define MMU_NOTIFIER_YOUNG (1 << 0)
> > +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
>
> Especially this one really deserves some documentation :)

Yes, will do. Something like

MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
bitmap either (1) does not accurately represent the age of the pages
(in the case of test_young), or (2) was not able to be used to
completely clear the age/access bit (in the case of clear_young).

>
> > +#define MMU_NOTIFIER_YOUNG_FAST (1 << 2)
>
> And that one as well.

Something like

Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
would have been supported for this address range.

The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
is able to harvest the access bit "fast" (so for x86, locklessly, and
for arm64, with the KVM MMU read lock), "fast" enough that using a
bitmap to do look-around is probably a good idea.

>
> Likely best to briefly document all of them, and how they are
> supposed to be used (return value for X).

Right. Will do.

>
> > +
> > struct mmu_notifier_ops {
> > /*
> > * Called either by mmu_notifier_unregister or when the mm is
> > @@ -106,21 +110,36 @@ struct mmu_notifier_ops {
> > * clear_young is a lightweight version of clear_flush_young. Like the
> > * latter, it is supposed to test-and-clear the young/accessed bitflag
> > * in the secondary pte, but it may omit flushing the secondary tlb.
> > + *
> > + * If @bitmap is given but is not supported, return
> > + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> > + *
> > + * If the walk is done "quickly" and there were young PTEs,
> > + * MMU_NOTIFIER_YOUNG_FAST is returned.
> > */
> > int (*clear_young)(struct mmu_notifier *subscription,
> > struct mm_struct *mm,
> > unsigned long start,
> > - unsigned long end);
> > + unsigned long end,
> > + unsigned long *bitmap);
> >
> > /*
> > * test_young is called to check the young/accessed bitflag in
> > * the secondary pte. This is used to know if the page is
> > * frequently used without actually clearing the flag or tearing
> > * down the secondary mapping on the page.
> > + *
> > + * If @bitmap is given but is not supported, return
> > + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> > + *
> > + * If the walk is done "quickly" and there were young PTEs,
> > + * MMU_NOTIFIER_YOUNG_FAST is returned.
> > */
> > int (*test_young)(struct mmu_notifier *subscription,
> > struct mm_struct *mm,
> > - unsigned long address);
> > + unsigned long start,
> > + unsigned long end,
> > + unsigned long *bitmap);
>
> What does "quickly" mean (why not use "fast")? What are the semantics, I
> don't find any existing usage of that in this file.

"fast" means fast enough such that using a bitmap to scan adjacent
pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
this comment. Perhaps I should just rename it to
MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
beneficial" thing -- that's for MGLRU/etc. to decide really.

>
> Further, what is MMU_NOTIFIER_YOUNG you introduce used for?

MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
(1) didn't use a bitmap, and (2) the "fast" access bit harvesting
wasn't possible. In this case we simply return 1, which is
MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
it will be 1.

Thanks David!

2024-04-09 19:35:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

On 09.04.24 20:31, James Houghton wrote:
> Ah, I didn't see this in my inbox, sorry David!

No worries :)

>
> On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 02.04.24 01:29, James Houghton wrote:
>>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>>> index f349e08a9dfe..daaa9db625d3 100644
>>> --- a/include/linux/mmu_notifier.h
>>> +++ b/include/linux/mmu_notifier.h
>>> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
>>>
>>> #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>>>
>>> +#define MMU_NOTIFIER_YOUNG (1 << 0)
>>> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
>>
>> Especially this one really deserves some documentation :)
>
> Yes, will do. Something like
>
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
> bitmap either (1) does not accurately represent the age of the pages
> (in the case of test_young), or (2) was not able to be used to
> completely clear the age/access bit (in the case of clear_young).

Make sense. I do wonder what the expected reaction from the caller is :)

>
>>
>>> +#define MMU_NOTIFIER_YOUNG_FAST (1 << 2)
>>
>> And that one as well.
>
> Something like
>
> Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
> would have been supported for this address range.
>
> The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
> is able to harvest the access bit "fast" (so for x86, locklessly, and
> for arm64, with the KVM MMU read lock), "fast" enough that using a
> bitmap to do look-around is probably a good idea.

Is that really the right way to communicate that ("would have been
supported") -- wouldn't we want to sense support differently?

>
>>
>> Likely best to briefly document all of them, and how they are
>> supposed to be used (return value for X).
>
> Right. Will do.
>
>>
>>> +
>>> struct mmu_notifier_ops {
>>> /*
>>> * Called either by mmu_notifier_unregister or when the mm is
>>> @@ -106,21 +110,36 @@ struct mmu_notifier_ops {
>>> * clear_young is a lightweight version of clear_flush_young. Like the
>>> * latter, it is supposed to test-and-clear the young/accessed bitflag
>>> * in the secondary pte, but it may omit flushing the secondary tlb.
>>> + *
>>> + * If @bitmap is given but is not supported, return
>>> + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>>> + *
>>> + * If the walk is done "quickly" and there were young PTEs,
>>> + * MMU_NOTIFIER_YOUNG_FAST is returned.
>>> */
>>> int (*clear_young)(struct mmu_notifier *subscription,
>>> struct mm_struct *mm,
>>> unsigned long start,
>>> - unsigned long end);
>>> + unsigned long end,
>>> + unsigned long *bitmap);
>>>
>>> /*
>>> * test_young is called to check the young/accessed bitflag in
>>> * the secondary pte. This is used to know if the page is
>>> * frequently used without actually clearing the flag or tearing
>>> * down the secondary mapping on the page.
>>> + *
>>> + * If @bitmap is given but is not supported, return
>>> + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>>> + *
>>> + * If the walk is done "quickly" and there were young PTEs,
>>> + * MMU_NOTIFIER_YOUNG_FAST is returned.
>>> */
>>> int (*test_young)(struct mmu_notifier *subscription,
>>> struct mm_struct *mm,
>>> - unsigned long address);
>>> + unsigned long start,
>>> + unsigned long end,
>>> + unsigned long *bitmap);
>>
>> What does "quickly" mean (why not use "fast")? What are the semantics, I
>> don't find any existing usage of that in this file.
>
> "fast" means fast enough such that using a bitmap to scan adjacent
> pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
> this comment. Perhaps I should just rename it to
> MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
> beneficial" thing -- that's for MGLRU/etc. to decide really.

Yes!

>
>>
>> Further, what is MMU_NOTIFIER_YOUNG you introduce used for?
>
> MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
> (1) didn't use a bitmap, and (2) the "fast" access bit harvesting
> wasn't possible. In this case we simply return 1, which is
> MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
> properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
> it will be 1.

Yes, that will clarify it!

--
Cheers,

David / dhildenb


2024-04-11 00:36:00

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

On Tue, Apr 9, 2024 at 12:35 PM David Hildenbrand <[email protected]> wrote:
>
> On 09.04.24 20:31, James Houghton wrote:
> > Ah, I didn't see this in my inbox, sorry David!
>
> No worries :)
>
> >
> > On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand <david@redhatcom> wrote:
> >>
> >> On 02.04.24 01:29, James Houghton wrote:
> >>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> >>> index f349e08a9dfe..daaa9db625d3 100644
> >>> --- a/include/linux/mmu_notifier.h
> >>> +++ b/include/linux/mmu_notifier.h
> >>> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
> >>>
> >>> #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >>>
> >>> +#define MMU_NOTIFIER_YOUNG (1 << 0)
> >>> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
> >>
> >> Especially this one really deserves some documentation :)
> >
> > Yes, will do. Something like
> >
> > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
> > bitmap either (1) does not accurately represent the age of the pages
> > (in the case of test_young), or (2) was not able to be used to
> > completely clear the age/access bit (in the case of clear_young).
>
> Make sense. I do wonder what the expected reaction from the caller is :)

In this series the caller doesn't actually care (matching what Yu had
in his v2[1]). test_spte_young() probably ought to return false if it
finds MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (and I'll do this in v4 if
no one objects), but it doesn't have to. The bitmap will never say
that a page is young when it was actually not, only the other way
around.

>
> >
> >>
> >>> +#define MMU_NOTIFIER_YOUNG_FAST (1 << 2)
> >>
> >> And that one as well.
> >
> > Something like
> >
> > Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
> > would have been supported for this address range.
> >
> > The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
> > is able to harvest the access bit "fast" (so for x86, locklessly, and
> > for arm64, with the KVM MMU read lock), "fast" enough that using a
> > bitmap to do look-around is probably a good idea.
>
> Is that really the right way to communicate that ("would have been
> supported") -- wouldn't we want to sense support differently?

What I have now seems fine to me. It would be a little nicer to have a
way to query for bitmap support and make sure that the answer will not
be stale by the time we call the bitmap notifiers, but the complexity
to make that work seems unnecessary for dealing with such an uncommon
scenario.

Maybe the right thing to do is just to have KVM always return the same
answer. So instead of checking if the shadow root is allocated, we
could check something else (I'm not sure what exactly yet though...).

[1]: https://lore.kernel.org/kvmarm/[email protected]/

2024-04-11 18:12:24

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

On 2024-04-01 11:29 PM, James Houghton wrote:
> Only handle the TDP MMU case for now. In other cases, if a bitmap was
> not provided, fallback to the slowpath that takes mmu_lock, or, if a
> bitmap was provided, inform the caller that the bitmap is unreliable.
>
> Suggested-by: Yu Zhao <[email protected]>
> Signed-off-by: James Houghton <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
> arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++--
> arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
> 3 files changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3b58e2306621..c30918d0887e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> */
> #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
>
> +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> +{
> + /*
> + * Indicate that we support bitmap-based aging when using the TDP MMU
> + * and the accessed bit is available in the TDP page tables.
> + *
> + * We have no other preparatory work to do here, so we do not need to
> + * redefine kvm_arch_finish_bitmap_age().
> + */
> + return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> + && shadow_accessed_mask;
> +}
> +
> #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 992e651540e8..fae1a75750bb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> bool young = false;
>
> - if (kvm_memslots_have_rmaps(kvm))
> + if (kvm_memslots_have_rmaps(kvm)) {
> + if (range->lockless) {
> + kvm_age_set_unreliable(range);
> + return false;
> + }

If a VM has TDP MMU enabled, supports A/D bits, and is using nested
virtualization, MGLRU will effectively be blind to all accesses made by
the VM.

kvm_arch_prepare_bitmap_age() will return true indicating that the
bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
return false immediately and indicate the bitmap is unreliable because a
shadow root is allocate. The notfier will then return
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
consumed or used. So I think MGLRU will assume all memory is
unaccessed?

One way to improve the situation would be to re-order the TDP MMU
function first and return young instead of false, so that way MGLRU at
least has visibility into accesses made by L1 (and L2 if EPT is disable
in L2). But that still means MGLRU is blind to accesses made by L2.

What about grabbing the mmu_lock if there's a shadow root allocated and
get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?

if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
}

The TDP MMU walk would still be lockless. KVM only has to take the
mmu_lock to collect accesses made by L2.

kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
as well, but that seems relatively simple with the helper functions.

2024-04-11 18:30:55

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

On 2024-04-11 10:08 AM, David Matlack wrote:
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > bitmap was provided, inform the caller that the bitmap is unreliable.
> >
> > Suggested-by: Yu Zhao <[email protected]>
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
> > arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++--
> > arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
> > 3 files changed, 37 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 3b58e2306621..c30918d0887e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> > */
> > #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> >
> > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > +{
> > + /*
> > + * Indicate that we support bitmap-based aging when using the TDP MMU
> > + * and the accessed bit is available in the TDP page tables.
> > + *
> > + * We have no other preparatory work to do here, so we do not need to
> > + * redefine kvm_arch_finish_bitmap_age().
> > + */
> > + return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > + && shadow_accessed_mask;
> > +}
> > +
> > #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 992e651540e8..fae1a75750bb 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > - if (kvm_memslots_have_rmaps(kvm))
> > + if (kvm_memslots_have_rmaps(kvm)) {
> > + if (range->lockless) {
> > + kvm_age_set_unreliable(range);
> > + return false;
> > + }
>
> If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> virtualization, MGLRU will effectively be blind to all accesses made by
> the VM.
>
> kvm_arch_prepare_bitmap_age() will return true indicating that the
> bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> return false immediately and indicate the bitmap is unreliable because a
> shadow root is allocate. The notfier will then return
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>
> Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
> consumed or used. So I think MGLRU will assume all memory is
> unaccessed?
>
> One way to improve the situation would be to re-order the TDP MMU
> function first and return young instead of false, so that way MGLRU at
> least has visibility into accesses made by L1 (and L2 if EPT is disable
> in L2). But that still means MGLRU is blind to accesses made by L2.
>
> What about grabbing the mmu_lock if there's a shadow root allocated and
> get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?
>
> if (kvm_memslots_have_rmaps(kvm)) {
> write_lock(&kvm->mmu_lock);
> young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> write_unlock(&kvm->mmu_lock);
> }
>
> The TDP MMU walk would still be lockless. KVM only has to take the
> mmu_lock to collect accesses made by L2.
>
> kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
> as well, but that seems relatively simple with the helper functions.

Wait, even simpler, just check kvm_memslots_have_rmaps() in
kvm_arch_prepare_bitmap_age() and skip the shadow MMU when processing a
bitmap request.

i.e.

static inline bool kvm_arch_prepare_bitmap_age(struct kvm *kvm, struct mmu_notifier *mn)
{
/*
* Indicate that we support bitmap-based aging when using the TDP MMU
* and the accessed bit is available in the TDP page tables.
*
* We have no other preparatory work to do here, so we do not need to
* redefine kvm_arch_finish_bitmap_age().
*/
return IS_ENABLED(CONFIG_X86_64)
&& tdp_mmu_enabled
&& shadow_accessed_mask
&& !kvm_memslots_have_rmaps(kvm);
}

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);

if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_test_age_gfn(kvm, range);

return young;
}

Sure this could race with the creation of a shadow root but so can the
non-bitmap code.

2024-04-11 18:31:13

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

On Thu, Apr 11, 2024 at 10:28 AM David Matlack <[email protected]> wrote:
>
> On 2024-04-11 10:08 AM, David Matlack wrote:
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > >
> > > Suggested-by: Yu Zhao <[email protected]>
> > > Signed-off-by: James Houghton <[email protected]>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
> > > arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++--
> > > arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
> > > 3 files changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 3b58e2306621..c30918d0887e 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> > > */
> > > #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > >
> > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > +{
> > > + /*
> > > + * Indicate that we support bitmap-based aging when using the TDP MMU
> > > + * and the accessed bit is available in the TDP page tables.
> > > + *
> > > + * We have no other preparatory work to do here, so we do not need to
> > > + * redefine kvm_arch_finish_bitmap_age().
> > > + */
> > > + return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > + && shadow_accessed_mask;
> > > +}
> > > +
> > > #endif /* _ASM_X86_KVM_HOST_H */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 992e651540e8..fae1a75750bb 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > bool young = false;
> > >
> > > - if (kvm_memslots_have_rmaps(kvm))
> > > + if (kvm_memslots_have_rmaps(kvm)) {
> > > + if (range->lockless) {
> > > + kvm_age_set_unreliable(range);
> > > + return false;
> > > + }
> >
> > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > virtualization, MGLRU will effectively be blind to all accesses made by
> > the VM.
> >
> > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > return false immediately and indicate the bitmap is unreliable because a
> > shadow root is allocate. The notfier will then return
> > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

The control flow of all this and naming of functions and macros is
overall confusing. args.unreliable and
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE for one. Also I now realize
kvm_arch_prepare/finish_bitmap_age() are used even when the bitmap is
_not_ provided, so those names are also misleading.

2024-04-11 19:25:14

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

On Thu, Apr 11, 2024 at 11:00 AM David Matlack <[email protected]> wrote:
>
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack <[email protected]> wrote:
> >
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > > >
> > > > Suggested-by: Yu Zhao <[email protected]>
> > > > Signed-off-by: James Houghton <[email protected]>
> > > > ---
> > > > arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
> > > > arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++--
> > > > arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
> > > > 3 files changed, 37 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 3b58e2306621..c30918d0887e 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> > > > */
> > > > #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > > >
> > > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > > +{
> > > > + /*
> > > > + * Indicate that we support bitmap-based aging when using the TDP MMU
> > > > + * and the accessed bit is available in the TDP page tables.
> > > > + *
> > > > + * We have no other preparatory work to do here, so we do not need to
> > > > + * redefine kvm_arch_finish_bitmap_age().
> > > > + */
> > > > + return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > > + && shadow_accessed_mask;
> > > > +}
> > > > +
> > > > #endif /* _ASM_X86_KVM_HOST_H */
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 992e651540e8..fae1a75750bb 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > > {
> > > > bool young = false;
> > > >
> > > > - if (kvm_memslots_have_rmaps(kvm))
> > > > + if (kvm_memslots_have_rmaps(kvm)) {
> > > > + if (range->lockless) {
> > > > + kvm_age_set_unreliable(range);
> > > > + return false;
> > > > + }
> > >
> > > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > > virtualization, MGLRU will effectively be blind to all accesses made by
> > > the VM.
> > >
> > > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > > return false immediately and indicate the bitmap is unreliable because a
> > > shadow root is allocate. The notfier will then return
> > > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>
> Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
> return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

Nope, wrong again. Just ignore me while I try to figure out how this
actually works :)

2024-04-12 18:41:27

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting

On 2024-04-01 11:29 PM, James Houghton wrote:
> This patchset adds a fast path in KVM to test and clear access bits on
> sptes without taking the mmu_lock. It also adds support for using a
> bitmap to (1) test the access bits for many sptes in a single call to
> mmu_notifier_test_young, and to (2) clear the access bits for many ptes
> in a single call to mmu_notifier_clear_young.

How much improvement would we get if we _just_ made test/clear_young
lockless on x86 and hold the read-lock on arm64? And then how much
benefit does the bitmap look-around add on top of that?

2024-04-12 18:49:52

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

On 2024-04-01 11:29 PM, James Houghton wrote:
> The bitmap is provided for secondary MMUs to use if they support it. For
> test_young(), after it returns, the bitmap represents the pages that
> were young in the interval [start, end). For clear_young, it represents
> the pages that we wish the secondary MMU to clear the accessed/young bit
> for.
>
> If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
> should be unchanged except that if young PTEs are found and the
> architecture supports passing in a bitmap, instead of returning 1,
> MMU_NOTIFIER_YOUNG_FAST is returned.
>
> This allows MGLRU's look-around logic to work faster, resulting in a 4%
> improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
> to indicate to main mm that doing look-around is likely to be
> beneficial.
>
> If the secondary MMU doesn't support the bitmap, it must return
> an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>
> [1]: https://lore.kernel.org/all/[email protected]/
>
> Suggested-by: Yu Zhao <[email protected]>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/mmu_notifier.h | 93 +++++++++++++++++++++++++++++++++---
> include/trace/events/kvm.h | 13 +++--
> mm/mmu_notifier.c | 20 +++++---
> virt/kvm/kvm_main.c | 19 ++++++--
> 4 files changed, 123 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index f349e08a9dfe..daaa9db625d3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
>
> #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>
> +#define MMU_NOTIFIER_YOUNG (1 << 0)
> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)

MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE appears to be unused by all callers
of test/clear_young(). I would vote to remove it.

> +#define MMU_NOTIFIER_YOUNG_FAST (1 << 2)

Instead of MMU_NOTIFIER_YOUNG_FAST, how about
MMU_NOTIFIER_YOUNG_LOOK_AROUND? i.e. The secondary MMU is returning
saying it recommends doing a look-around and passing in a bitmap?

That would avoid the whole "what does FAST really mean" confusion.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fb49c2a60200..ca4b1ef9dfc2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -917,10 +917,15 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> struct mm_struct *mm,
> unsigned long start,
> - unsigned long end)
> + unsigned long end,
> + unsigned long *bitmap)
> {
> trace_kvm_age_hva(start, end);
>
> + /* We don't support bitmaps. Don't test or clear anything. */
> + if (bitmap)
> + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

Wouldn't it be a bug to get a bitmap here? The main MM is only suppost
to pass in a bitmap if the secondary MMU returns
MMU_NOTIFIER_YOUNG_FAST, which KVM does not do at this point.

Put another way, this check seems unneccessary.

> +
> /*
> * Even though we do not flush TLB, this will still adversely
> * affect performance on pre-Haswell Intel EPT, where there is
> @@ -939,11 +944,17 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>
> static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> struct mm_struct *mm,
> - unsigned long address)
> + unsigned long start,
> + unsigned long end,
> + unsigned long *bitmap)
> {
> - trace_kvm_test_age_hva(address);
> + trace_kvm_test_age_hva(start, end);
> +
> + /* We don't support bitmaps. Don't test or clear anything. */
> + if (bitmap)
> + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

Same thing here.

2024-04-12 20:32:08

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young

On 2024-04-01 11:29 PM, James Houghton wrote:
> Add kvm_arch_prepare_bitmap_age() for architectures to indiciate that
> they support bitmap-based aging in kvm_mmu_notifier_test_clear_young()
> and that they do not need KVM to grab the MMU lock for writing. This
> function allows architectures to do other locking or other preparatory
> work that it needs.

There's a lot going on here. I know it's extra work but I think the
series would be easier to understand and simplify if you introduced the
KVM support for lockless test/clear_young() first, and then introduce
support for the bitmap-based look-around.

Specifically:

1. Make all test/clear_young() notifiers lockless. i.e. Move the
mmu_lock into the architecture-specific code (kvm_age_gfn() and
kvm_test_age_gfn()).

2. Convert KVM/x86's kvm_{test,}_age_gfn() to be lockless for the TDP
MMU.

4. Convert KVM/arm64's kvm_{test,}_age_gfn() to hold the mmu_lock in
read-mode.

5. Add bitmap-based look-around support to KVM/x86 and KVM/arm64
(probably 2-3 patches).

>
> If an architecture does not implement kvm_arch_prepare_bitmap_age() or
> is unable to do bitmap-based aging at runtime (and marks the bitmap as
> unreliable):
> 1. If a bitmap was provided, we inform the caller that the bitmap is
> unreliable (MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE).
> 2. If a bitmap was not provided, fall back to the old logic.
>
> Also add logic for architectures to easily use the provided bitmap if
> they are able. The expectation is that the architecture's implementation
> of kvm_gfn_test_age() will use kvm_gfn_record_young(), and
> kvm_gfn_age() will use kvm_gfn_should_age().
>
> Suggested-by: Yu Zhao <[email protected]>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/kvm_host.h | 60 ++++++++++++++++++++++++++
> virt/kvm/kvm_main.c | 92 +++++++++++++++++++++++++++++-----------
> 2 files changed, 127 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1800d03a06a9..5862fd7b5f9b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1992,6 +1992,26 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
> extern const struct kvm_stats_header kvm_vcpu_stats_header;
> extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>
> +/*
> + * Architectures that support using bitmaps for kvm_age_gfn() and
> + * kvm_test_age_gfn should return true for kvm_arch_prepare_bitmap_age()
> + * and do any work they need to prepare. The subsequent walk will not
> + * automatically grab the KVM MMU lock, so some architectures may opt
> + * to grab it.
> + *
> + * If true is returned, a subsequent call to kvm_arch_finish_bitmap_age() is
> + * guaranteed.
> + */
> +#ifndef kvm_arch_prepare_bitmap_age
> +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)

I find the name of these architecture callbacks misleading/confusing.
The lockless path is used even when a bitmap is not provided. i.e.
bitmap can be NULL in between kvm_arch_prepare/finish_bitmap_age().

> +{
> + return false;
> +}
> +#endif
> +#ifndef kvm_arch_finish_bitmap_age
> +static inline void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn) {}
> +#endif

kvm_arch_finish_bitmap_age() seems unnecessary. I think the KVM/arm64
code could acquire/release the mmu_lock in read-mode in
kvm_test_age_gfn() and kvm_age_gfn() right?

> +
> #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> {
> @@ -2076,9 +2096,16 @@ static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
> }
>
> +struct test_clear_young_metadata {
> + unsigned long *bitmap;
> + unsigned long bitmap_offset_end;

bitmap_offset_end is unused.

> + unsigned long end;
> + bool unreliable;
> +};
> union kvm_mmu_notifier_arg {
> pte_t pte;
> unsigned long attributes;
> + struct test_clear_young_metadata *metadata;

nit: Maybe s/metadata/test_clear_young/ ?

> };
>
> struct kvm_gfn_range {
> @@ -2087,11 +2114,44 @@ struct kvm_gfn_range {
> gfn_t end;
> union kvm_mmu_notifier_arg arg;
> bool may_block;
> + bool lockless;

Please document this as it's somewhat subtle. A reader might think this
implies the entire operation runs without taking the mmu_lock.

> };
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +static inline void kvm_age_set_unreliable(struct kvm_gfn_range *range)
> +{
> + struct test_clear_young_metadata *args = range->arg.metadata;
> +
> + args->unreliable = true;
> +}
> +static inline unsigned long kvm_young_bitmap_offset(struct kvm_gfn_range *range,
> + gfn_t gfn)
> +{
> + struct test_clear_young_metadata *args = range->arg.metadata;
> +
> + return hva_to_gfn_memslot(args->end - 1, range->slot) - gfn;
> +}
> +static inline void kvm_gfn_record_young(struct kvm_gfn_range *range, gfn_t gfn)
> +{
> + struct test_clear_young_metadata *args = range->arg.metadata;
> +
> + WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
> + if (args->bitmap)
> + __set_bit(kvm_young_bitmap_offset(range, gfn), args->bitmap);
> +}
> +static inline bool kvm_gfn_should_age(struct kvm_gfn_range *range, gfn_t gfn)
> +{
> + struct test_clear_young_metadata *args = range->arg.metadata;
> +
> + WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
> + if (args->bitmap)
> + return test_bit(kvm_young_bitmap_offset(range, gfn),
> + args->bitmap);
> + return true;
> +}
> #endif
>
> #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d0545d88c802..7d80321e2ece 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -550,6 +550,7 @@ struct kvm_mmu_notifier_range {
> on_lock_fn_t on_lock;
> bool flush_on_ret;
> bool may_block;
> + bool lockless;
> };
>
> /*
> @@ -598,6 +599,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> struct kvm_memslots *slots;
> int i, idx;
>
> + BUILD_BUG_ON(sizeof(gfn_range.arg) != sizeof(gfn_range.arg.pte));
> +
> if (WARN_ON_ONCE(range->end <= range->start))
> return r;
>
> @@ -637,15 +640,18 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
> gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
> gfn_range.slot = slot;
> + gfn_range.lockless = range->lockless;
>
> if (!r.found_memslot) {
> r.found_memslot = true;
> - KVM_MMU_LOCK(kvm);
> - if (!IS_KVM_NULL_FN(range->on_lock))
> - range->on_lock(kvm);
> -
> - if (IS_KVM_NULL_FN(range->handler))
> - break;
> + if (!range->lockless) {
> + KVM_MMU_LOCK(kvm);
> + if (!IS_KVM_NULL_FN(range->on_lock))
> + range->on_lock(kvm);
> +
> + if (IS_KVM_NULL_FN(range->handler))
> + break;
> + }
> }
> r.ret |= range->handler(kvm, &gfn_range);
> }
> @@ -654,7 +660,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> if (range->flush_on_ret && r.ret)
> kvm_flush_remote_tlbs(kvm);
>
> - if (r.found_memslot)
> + if (r.found_memslot && !range->lockless)
> KVM_MMU_UNLOCK(kvm);
>
> srcu_read_unlock(&kvm->srcu, idx);
> @@ -682,19 +688,24 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> return __kvm_handle_hva_range(kvm, &range).ret;
> }
>
> -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
> - unsigned long start,
> - unsigned long end,
> - gfn_handler_t handler)
> +static __always_inline int kvm_handle_hva_range_no_flush(
> + struct mmu_notifier *mn,
> + unsigned long start,
> + unsigned long end,
> + gfn_handler_t handler,
> + union kvm_mmu_notifier_arg arg,
> + bool lockless)
> {
> struct kvm *kvm = mmu_notifier_to_kvm(mn);
> const struct kvm_mmu_notifier_range range = {
> .start = start,
> .end = end,
> .handler = handler,
> + .arg = arg,
> .on_lock = (void *)kvm_null_fn,
> .flush_on_ret = false,
> .may_block = false,
> + .lockless = lockless,
> };
>
> return __kvm_handle_hva_range(kvm, &range).ret;
> @@ -909,15 +920,36 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> kvm_age_gfn);
> }
>
> -static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> - struct mm_struct *mm,
> - unsigned long start,
> - unsigned long end,
> - unsigned long *bitmap)
> +static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end,
> + unsigned long *bitmap,
> + bool clear)

Perhaps pass in the callback (kvm_test_age_gfn/kvm_age_gfn) instead of
true/false to avoid the naked booleans at the callsites?

> {
> - trace_kvm_age_hva(start, end);
> + if (kvm_arch_prepare_bitmap_age(mn)) {
> + struct test_clear_young_metadata args = {
> + .bitmap = bitmap,
> + .end = end,
> + .unreliable = false,
> + };
> + union kvm_mmu_notifier_arg arg = {
> + .metadata = &args
> + };
> + bool young;
> +
> + young = kvm_handle_hva_range_no_flush(
> + mn, start, end,
> + clear ? kvm_age_gfn : kvm_test_age_gfn,
> + arg, true);

I suspect the end result will be cleaner we make all architectures
lockless. i.e. Move the mmu_lock acquire/release into the
architecture-specific code.

This could result in more acquire/release calls (one per memslot that
overlaps the provided range) but that should be a single memslot in the
majority of cases I think?

Then unconditionally pass in the metadata structure.

Then you don't need any special casing for the fast path / bitmap path.
The only thing needed is to figure out whether to return
MMU_NOTIFIER_YOUNG vs MMU_NOTIFIER_YOUNG_LOOK_AROUND and that can be
plumbed via test_clear_young_metadata or by changing gfn_handler_t to
return an int instead of a bool.

> +
> + kvm_arch_finish_bitmap_age(mn);
>
> - /* We don't support bitmaps. Don't test or clear anything. */
> + if (!args.unreliable)
> + return young ? MMU_NOTIFIER_YOUNG_FAST : 0;
> + }
> +
> + /* A bitmap was passed but the architecture doesn't support bitmaps */
> if (bitmap)
> return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
>
> @@ -934,7 +966,21 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> * cadence. If we find this inaccurate, we might come up with a
> * more sophisticated heuristic later.
> */
> - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> + return kvm_handle_hva_range_no_flush(
> + mn, start, end, clear ? kvm_age_gfn : kvm_test_age_gfn,
> + KVM_MMU_NOTIFIER_NO_ARG, false);

Should this return MMU_NOTIFIER_YOUNG explicitly? This code is assuming
MMU_NOTIFIER_YOUNG == (int)true.

> +}
> +
> +static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end,
> + unsigned long *bitmap)
> +{
> + trace_kvm_age_hva(start, end);
> +
> + return kvm_mmu_notifier_test_clear_young(mn, mm, start, end, bitmap,
> + true);
> }
>
> static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> @@ -945,12 +991,8 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> {
> trace_kvm_test_age_hva(start, end);
>
> - /* We don't support bitmaps. Don't test or clear anything. */
> - if (bitmap)
> - return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
> -
> - return kvm_handle_hva_range_no_flush(mn, start, end,
> - kvm_test_age_gfn);
> + return kvm_mmu_notifier_test_clear_young(mn, mm, start, end, bitmap,
> + false);
> }
>
> static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
> --
> 2.44.0.478.gd926399ef9-goog
>

2024-04-19 20:35:37

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

On Fri, Apr 12, 2024 at 11:45 AM David Matlack <[email protected]> wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > The bitmap is provided for secondary MMUs to use if they support it. For
> > test_young(), after it returns, the bitmap represents the pages that
> > were young in the interval [start, end). For clear_young, it represents
> > the pages that we wish the secondary MMU to clear the accessed/young bit
> > for.
> >
> > If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
> > should be unchanged except that if young PTEs are found and the
> > architecture supports passing in a bitmap, instead of returning 1,
> > MMU_NOTIFIER_YOUNG_FAST is returned.
> >
> > This allows MGLRU's look-around logic to work faster, resulting in a 4%
> > improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
> > to indicate to main mm that doing look-around is likely to be
> > beneficial.
> >
> > If the secondary MMU doesn't support the bitmap, it must return
> > an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> >
> > [1]: https://lore.kernel.org/all/[email protected]/
> >
> > Suggested-by: Yu Zhao <[email protected]>
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > include/linux/mmu_notifier.h | 93 +++++++++++++++++++++++++++++++++---
> > include/trace/events/kvm.h | 13 +++--
> > mm/mmu_notifier.c | 20 +++++---
> > virt/kvm/kvm_main.c | 19 ++++++--
> > 4 files changed, 123 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index f349e08a9dfe..daaa9db625d3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -61,6 +61,10 @@ enum mmu_notifier_event {
> >
> > #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >
> > +#define MMU_NOTIFIER_YOUNG (1 << 0)
> > +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
>
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE appears to be unused by all callers
> of test/clear_young(). I would vote to remove it.

Works for me.

>
> > +#define MMU_NOTIFIER_YOUNG_FAST (1 << 2)
>
> Instead of MMU_NOTIFIER_YOUNG_FAST, how about
> MMU_NOTIFIER_YOUNG_LOOK_AROUND? i.e. The secondary MMU is returning
> saying it recommends doing a look-around and passing in a bitmap?
>
> That would avoid the whole "what does FAST really mean" confusion.

I think MMU_NOTIFIER_YOUNG_LOOK_AROUND is fine.

>
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fb49c2a60200..ca4b1ef9dfc2 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -917,10 +917,15 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> > static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> > struct mm_struct *mm,
> > unsigned long start,
> > - unsigned long end)
> > + unsigned long end,
> > + unsigned long *bitmap)
> > {
> > trace_kvm_age_hva(start, end);
> >
> > + /* We don't support bitmaps. Don't test or clear anything. */
> > + if (bitmap)
> > + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
>
> Wouldn't it be a bug to get a bitmap here? The main MM is only suppost
> to pass in a bitmap if the secondary MMU returns
> MMU_NOTIFIER_YOUNG_FAST, which KVM does not do at this point.
>
> Put another way, this check seems unneccessary.
>
> > +
> > /*
> > * Even though we do not flush TLB, this will still adversely
> > * affect performance on pre-Haswell Intel EPT, where there is
> > @@ -939,11 +944,17 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> >
> > static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> > struct mm_struct *mm,
> > - unsigned long address)
> > + unsigned long start,
> > + unsigned long end,
> > + unsigned long *bitmap)
> > {
> > - trace_kvm_test_age_hva(address);
> > + trace_kvm_test_age_hva(start, end);
> > +
> > + /* We don't support bitmaps. Don't test or clear anything. */
> > + if (bitmap)
> > + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
>
> Same thing here.

I will remove them, they are indeed unnecessary, as it is just dead code.

2024-04-19 20:44:02

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young

On Fri, Apr 12, 2024 at 1:28 PM David Matlack <[email protected]> wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Add kvm_arch_prepare_bitmap_age() for architectures to indiciate that
> > they support bitmap-based aging in kvm_mmu_notifier_test_clear_young()
> > and that they do not need KVM to grab the MMU lock for writing. This
> > function allows architectures to do other locking or other preparatory
> > work that it needs.
>
> There's a lot going on here. I know it's extra work but I think the
> series would be easier to understand and simplify if you introduced the
> KVM support for lockless test/clear_young() first, and then introduce
> support for the bitmap-based look-around.

Yeah I think this is the right thing to do. Thanks.

>
> Specifically:
>
> 1. Make all test/clear_young() notifiers lockless. i.e. Move the
> mmu_lock into the architecture-specific code (kvm_age_gfn() and
> kvm_test_age_gfn()).
>
> 2. Convert KVM/x86's kvm_{test,}_age_gfn() to be lockless for the TDP
> MMU.
>
> 4. Convert KVM/arm64's kvm_{test,}_age_gfn() to hold the mmu_lock in
> read-mode.
>
> 5. Add bitmap-based look-around support to KVM/x86 and KVM/arm64
> (probably 2-3 patches).

This all sounds good to me. Thanks for laying it out for me -- this
should be a lot simpler.

>
> >
> > If an architecture does not implement kvm_arch_prepare_bitmap_age() or
> > is unable to do bitmap-based aging at runtime (and marks the bitmap as
> > unreliable):
> > 1. If a bitmap was provided, we inform the caller that the bitmap is
> > unreliable (MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE).
> > 2. If a bitmap was not provided, fall back to the old logic.
> >
> > Also add logic for architectures to easily use the provided bitmap if
> > they are able. The expectation is that the architecture's implementation
> > of kvm_gfn_test_age() will use kvm_gfn_record_young(), and
> > kvm_gfn_age() will use kvm_gfn_should_age().
> >
> > Suggested-by: Yu Zhao <[email protected]>
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > include/linux/kvm_host.h | 60 ++++++++++++++++++++++++++
> > virt/kvm/kvm_main.c | 92 +++++++++++++++++++++++++++++-----------
> > 2 files changed, 127 insertions(+), 25 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 1800d03a06a9..5862fd7b5f9b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1992,6 +1992,26 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
> > extern const struct kvm_stats_header kvm_vcpu_stats_header;
> > extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
> >
> > +/*
> > + * Architectures that support using bitmaps for kvm_age_gfn() and
> > + * kvm_test_age_gfn should return true for kvm_arch_prepare_bitmap_age()
> > + * and do any work they need to prepare. The subsequent walk will not
> > + * automatically grab the KVM MMU lock, so some architectures may opt
> > + * to grab it.
> > + *
> > + * If true is returned, a subsequent call to kvm_arch_finish_bitmap_age() is
> > + * guaranteed.
> > + */
> > +#ifndef kvm_arch_prepare_bitmap_age
> > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
>
> I find the name of these architecture callbacks misleading/confusing.
> The lockless path is used even when a bitmap is not provided. i.e.
> bitmap can be NULL in between kvm_arch_prepare/finish_bitmap_age().

Yes. I am really terrible at picking names.... I'm happy to just nix
this, following your other suggestions.

>
> > +{
> > + return false;
> > +}
> > +#endif
> > +#ifndef kvm_arch_finish_bitmap_age
> > +static inline void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn) {}
> > +#endif
>
> kvm_arch_finish_bitmap_age() seems unnecessary. I think the KVM/arm64
> code could acquire/release the mmu_lock in read-mode in
> kvm_test_age_gfn() and kvm_age_gfn() right?

Yes you're right, except that the way it is now, we only lock/unlock
once for the notifier instead of once for each overlapping memslot,
but that's not an issue, as you mention below.

>
> > +
> > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> > static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> > {
> > @@ -2076,9 +2096,16 @@ static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> > return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
> > }
> >
> > +struct test_clear_young_metadata {
> > + unsigned long *bitmap;
> > + unsigned long bitmap_offset_end;
>
> bitmap_offset_end is unused.

Indeed, sorry about that.

>
> > + unsigned long end;
> > + bool unreliable;
> > +};
> > union kvm_mmu_notifier_arg {
> > pte_t pte;
> > unsigned long attributes;
> > + struct test_clear_young_metadata *metadata;
>
> nit: Maybe s/metadata/test_clear_young/ ?

Yes, that's better.

>
> > };
> >
> > struct kvm_gfn_range {
> > @@ -2087,11 +2114,44 @@ struct kvm_gfn_range {
> > gfn_t end;
> > union kvm_mmu_notifier_arg arg;
> > bool may_block;
> > + bool lockless;
>
> Please document this as it's somewhat subtle. A reader might think this
> implies the entire operation runs without taking the mmu_lock.

Will do, and I'll improve the comments for the other "lockless"
variables. (In fact, it might be better to rename/adjust this one to
"mmu_lock_taken" instead -- it's a little more obvious what that
means.)

>
> > };
> > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > +
> > +static inline void kvm_age_set_unreliable(struct kvm_gfn_range *range)
> > +{
> > + struct test_clear_young_metadata *args = range->arg.metadata;
> > +
> > + args->unreliable = true;
> > +}
> > +static inline unsigned long kvm_young_bitmap_offset(struct kvm_gfn_range *range,
> > + gfn_t gfn)
> > +{
> > + struct test_clear_young_metadata *args = range->arg.metadata;
> > +
> > + return hva_to_gfn_memslot(args->end - 1, range->slot) - gfn;
> > +}
> > +static inline void kvm_gfn_record_young(struct kvm_gfn_range *range, gfn_t gfn)
> > +{
> > + struct test_clear_young_metadata *args = range->arg.metadata;
> > +
> > + WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
> > + if (args->bitmap)
> > + __set_bit(kvm_young_bitmap_offset(range, gfn), args->bitmap);
> > +}
> > +static inline bool kvm_gfn_should_age(struct kvm_gfn_range *range, gfn_t gfn)
> > +{
> > + struct test_clear_young_metadata *args = range->arg.metadata;
> > +
> > + WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
> > + if (args->bitmap)
> > + return test_bit(kvm_young_bitmap_offset(range, gfn),
> > + args->bitmap);
> > + return true;
> > +}
> > #endif
> >
> > #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index d0545d88c802..7d80321e2ece 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -550,6 +550,7 @@ struct kvm_mmu_notifier_range {
> > on_lock_fn_t on_lock;
> > bool flush_on_ret;
> > bool may_block;
> > + bool lockless;
> > };
> >
> > /*
> > @@ -598,6 +599,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > struct kvm_memslots *slots;
> > int i, idx;
> >
> > + BUILD_BUG_ON(sizeof(gfn_range.arg) != sizeof(gfn_range.arg.pte));
> > +
> > if (WARN_ON_ONCE(range->end <= range->start))
> > return r;
> >
> > @@ -637,15 +640,18 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
> > gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
> > gfn_range.slot = slot;
> > + gfn_range.lockless = range->lockless;
> >
> > if (!r.found_memslot) {
> > r.found_memslot = true;
> > - KVM_MMU_LOCK(kvm);
> > - if (!IS_KVM_NULL_FN(range->on_lock))
> > - range->on_lock(kvm);
> > -
> > - if (IS_KVM_NULL_FN(range->handler))
> > - break;
> > + if (!range->lockless) {
> > + KVM_MMU_LOCK(kvm);
> > + if (!IS_KVM_NULL_FN(range->on_lock))
> > + range->on_lock(kvm);
> > +
> > + if (IS_KVM_NULL_FN(range->handler))
> > + break;
> > + }
> > }
> > r.ret |= range->handler(kvm, &gfn_range);
> > }
> > @@ -654,7 +660,7 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> > if (range->flush_on_ret && r.ret)
> > kvm_flush_remote_tlbs(kvm);
> >
> > - if (r.found_memslot)
> > + if (r.found_memslot && !range->lockless)
> > KVM_MMU_UNLOCK(kvm);
> >
> > srcu_read_unlock(&kvm->srcu, idx);
> > @@ -682,19 +688,24 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> > return __kvm_handle_hva_range(kvm, &range).ret;
> > }
> >
> > -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
> > - unsigned long start,
> > - unsigned long end,
> > - gfn_handler_t handler)
> > +static __always_inline int kvm_handle_hva_range_no_flush(
> > + struct mmu_notifier *mn,
> > + unsigned long start,
> > + unsigned long end,
> > + gfn_handler_t handler,
> > + union kvm_mmu_notifier_arg arg,
> > + bool lockless)
> > {
> > struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > const struct kvm_mmu_notifier_range range = {
> > .start = start,
> > .end = end,
> > .handler = handler,
> > + .arg = arg,
> > .on_lock = (void *)kvm_null_fn,
> > .flush_on_ret = false,
> > .may_block = false,
> > + .lockless = lockless,
> > };
> >
> > return __kvm_handle_hva_range(kvm, &range).ret;
> > @@ -909,15 +920,36 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> > kvm_age_gfn);
> > }
> >
> > -static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> > - struct mm_struct *mm,
> > - unsigned long start,
> > - unsigned long end,
> > - unsigned long *bitmap)
> > +static int kvm_mmu_notifier_test_clear_young(struct mmu_notifier *mn,
> > + struct mm_struct *mm,
> > + unsigned long start,
> > + unsigned long end,
> > + unsigned long *bitmap,
> > + bool clear)
>
> Perhaps pass in the callback (kvm_test_age_gfn/kvm_age_gfn) instead of
> true/false to avoid the naked booleans at the callsites?

Will do. Thank you.

>
> > {
> > - trace_kvm_age_hva(start, end);
> > + if (kvm_arch_prepare_bitmap_age(mn)) {
> > + struct test_clear_young_metadata args = {
> > + .bitmap = bitmap,
> > + .end = end,
> > + .unreliable = false,
> > + };
> > + union kvm_mmu_notifier_arg arg = {
> > + .metadata = &args
> > + };
> > + bool young;
> > +
> > + young = kvm_handle_hva_range_no_flush(
> > + mn, start, end,
> > + clear ? kvm_age_gfn : kvm_test_age_gfn,
> > + arg, true);
>
> I suspect the end result will be cleaner we make all architectures
> lockless. i.e. Move the mmu_lock acquire/release into the
> architecture-specific code.
>
> This could result in more acquire/release calls (one per memslot that
> overlaps the provided range) but that should be a single memslot in the
> majority of cases I think?
>
> Then unconditionally pass in the metadata structure.
>
> Then you don't need any special casing for the fast path / bitmap path.
> The only thing needed is to figure out whether to return
> MMU_NOTIFIER_YOUNG vs MMU_NOTIFIER_YOUNG_LOOK_AROUND and that can be
> plumbed via test_clear_young_metadata or by changing gfn_handler_t to
> return an int instead of a bool.

Yes I think this simplification is a great idea. I agree that usually
there will only be one memslot that overlaps a virtual address range
in practice (MIN_LRU_BATCH is BITS_PER_LONG), so the theoretical
additional locking/unlocking shouldn't be an issue.

>
> > +
> > + kvm_arch_finish_bitmap_age(mn);
> >
> > - /* We don't support bitmaps. Don't test or clear anything. */
> > + if (!args.unreliable)
> > + return young ? MMU_NOTIFIER_YOUNG_FAST : 0;
> > + }
> > +
> > + /* A bitmap was passed but the architecture doesn't support bitmaps */
> > if (bitmap)
> > return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
> >
> > @@ -934,7 +966,21 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> > * cadence. If we find this inaccurate, we might come up with a
> > * more sophisticated heuristic later.
> > */
> > - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> > + return kvm_handle_hva_range_no_flush(
> > + mn, start, end, clear ? kvm_age_gfn : kvm_test_age_gfn,
> > + KVM_MMU_NOTIFIER_NO_ARG, false);
>
> Should this return MMU_NOTIFIER_YOUNG explicitly? This code is assuming
> MMU_NOTIFIER_YOUNG == (int)true.

Yes.

Thank you for all the feedback!

2024-04-19 20:48:42

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

On Thu, Apr 11, 2024 at 10:28 AM David Matlack <[email protected]> wrote:
>
> On 2024-04-11 10:08 AM, David Matlack wrote:
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > >
> > > Suggested-by: Yu Zhao <[email protected]>
> > > Signed-off-by: James Houghton <[email protected]>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 14 ++++++++++++++
> > > arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++--
> > > arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++++-
> > > 3 files changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 3b58e2306621..c30918d0887e 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
> > > */
> > > #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > >
> > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > +{
> > > + /*
> > > + * Indicate that we support bitmap-based aging when using the TDP MMU
> > > + * and the accessed bit is available in the TDP page tables.
> > > + *
> > > + * We have no other preparatory work to do here, so we do not need to
> > > + * redefine kvm_arch_finish_bitmap_age().
> > > + */
> > > + return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > + && shadow_accessed_mask;
> > > +}
> > > +
> > > #endif /* _ASM_X86_KVM_HOST_H */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 992e651540e8..fae1a75750bb 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > bool young = false;
> > >
> > > - if (kvm_memslots_have_rmaps(kvm))
> > > + if (kvm_memslots_have_rmaps(kvm)) {
> > > + if (range->lockless) {
> > > + kvm_age_set_unreliable(range);
> > > + return false;
> > > + }
> >
> > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > virtualization, MGLRU will effectively be blind to all accesses made by
> > the VM.
> >
> > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > return false immediately and indicate the bitmap is unreliable because a
> > shadow root is allocate. The notfier will then return
> > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> >
> > Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
> > consumed or used. So I think MGLRU will assume all memory is
> > unaccessed?
> >
> > One way to improve the situation would be to re-order the TDP MMU
> > function first and return young instead of false, so that way MGLRU at
> > least has visibility into accesses made by L1 (and L2 if EPT is disable
> > in L2). But that still means MGLRU is blind to accesses made by L2.
> >
> > What about grabbing the mmu_lock if there's a shadow root allocated and
> > get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?
> >
> > if (kvm_memslots_have_rmaps(kvm)) {
> > write_lock(&kvm->mmu_lock);
> > young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > write_unlock(&kvm->mmu_lock);
> > }
> >
> > The TDP MMU walk would still be lockless. KVM only has to take the
> > mmu_lock to collect accesses made by L2.
> >
> > kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
> > as well, but that seems relatively simple with the helper functions.
>
> Wait, even simpler, just check kvm_memslots_have_rmaps() in
> kvm_arch_prepare_bitmap_age() and skip the shadow MMU when processing a
> bitmap request.
>
> i.e.
>
> static inline bool kvm_arch_prepare_bitmap_age(struct kvm *kvm, struct mmu_notifier *mn)
> {
> /*
> * Indicate that we support bitmap-based aging when using the TDP MMU
> * and the accessed bit is available in the TDP page tables.
> *
> * We have no other preparatory work to do here, so we do not need to
> * redefine kvm_arch_finish_bitmap_age().
> */
> return IS_ENABLED(CONFIG_X86_64)
> && tdp_mmu_enabled
> && shadow_accessed_mask
> && !kvm_memslots_have_rmaps(kvm);
> }
>
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> bool young = false;
>
> if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>
> if (tdp_mmu_enabled)
> young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
>
> return young;
> }
>
> bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> bool young = false;
>
> if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
>
> if (tdp_mmu_enabled)
> young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
>
> return young;


Yeah I think this is the right thing to do. Given your other
suggestions (on patch 3), I think this will look something like this
-- let me know if I've misunderstood something:

bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);

if (check_rmap)
KVM_MMU_LOCK(kvm);

rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?

if (check_rmap)
kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)

if (tdp_mmu_enabled)
kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe

rcu_read_unlock();
if (check_rmap)
KVM_MMU_UNLOCK(kvm);

> }
>
> Sure this could race with the creation of a shadow root but so can the
> non-bitmap code.

2024-04-19 20:55:35

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

On Fri, Apr 12, 2024 at 1:44 PM David Matlack <[email protected]> wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > bitmap was provided, inform the caller that the bitmap is unreliable.
>
> I think this patch will trigger a lockdep assert in
>
> kvm_tdp_mmu_age_gfn_range
> kvm_tdp_mmu_handle_gfn
> for_each_tdp_mmu_root
> __for_each_tdp_mmu_root
> kvm_lockdep_assert_mmu_lock_held
>
> ... because it walks tdp_mmu_roots without holding mmu_lock.

Indeed, thanks. I'll make sure to build with CONFIG_LOCKDEP for the
future versions and check for errors.

>
> Yu's patch[1] added a lockless walk to the TDP MMU. We'd need something
> similar here and also update the comment above tdp_mmu_roots describing
> how tdp_mmu_roots can be read locklessly.

I'll add the macro / function to do the lockless walk of tdp_mmu_roots
and explain why it is safe. Thanks for pointing out this big mistake.

> [1] https://lore.kernel.org/kvmarm/[email protected]/

2024-04-19 20:58:30

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting

On Fri, Apr 12, 2024 at 11:41 AM David Matlack <[email protected]> wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > This patchset adds a fast path in KVM to test and clear access bits on
> > sptes without taking the mmu_lock. It also adds support for using a
> > bitmap to (1) test the access bits for many sptes in a single call to
> > mmu_notifier_test_young, and to (2) clear the access bits for many ptes
> > in a single call to mmu_notifier_clear_young.
>
> How much improvement would we get if we _just_ made test/clear_young
> lockless on x86 and hold the read-lock on arm64? And then how much
> benefit does the bitmap look-around add on top of that?

I don't have these results right now. For the next version I will (1)
separate the series into the locking change and the bitmap change, and
I will (2) have performance data for each change separately. It is
conceivable that the bitmap change should just be considered as a
completely separate patchset.

2024-04-19 21:07:15

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

On 2024-04-19 01:47 PM, James Houghton wrote:
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack <[email protected]> wrote:
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> > return young;
> > }
> >
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> >
> > return young;
>
>
> Yeah I think this is the right thing to do. Given your other
> suggestions (on patch 3), I think this will look something like this
> -- let me know if I've misunderstood something:
>
> bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
>
> if (check_rmap)
> KVM_MMU_LOCK(kvm);
>
> rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
>
> if (check_rmap)
> kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
>
> if (tdp_mmu_enabled)
> kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
>
> rcu_read_unlock();
> if (check_rmap)
> KVM_MMU_UNLOCK(kvm);

I was thinking a little different. If you follow my suggestion to first
make the TDP MMU aging lockless, you'll end up with something like this
prior to adding bitmap support (note: the comments are just for
demonstrative purposes):

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

/* Shadow MMU aging holds write-lock. */
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
}

/* TDM MMU aging is lockless. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

Then when you add bitmap support it would look something like this:

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
unsigned long *bitmap = range->arg.metadata->bitmap;
bool young = false;

/* SHadow MMU aging holds write-lock and does not support bitmap. */
if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
}

/* TDM MMU aging is lockless and supports bitmap. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().

That brings up a question I've been wondering about. If KVM only
advertises support for the bitmap lookaround when shadow roots are not
allocated, does that mean MGLRU will be blind to accesses made by L2
when nested virtualization is enabled? And does that mean the Linux MM
will think all L2 memory is cold (i.e. good candidate for swapping)
because it isn't seeing accesses made by L2?

2024-04-19 22:24:05

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting

On Fri, Apr 19, 2024 at 01:57:03PM -0700, James Houghton wrote:
> On Fri, Apr 12, 2024 at 11:41 AM David Matlack <[email protected]> wrote:
> >
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > This patchset adds a fast path in KVM to test and clear access bits on
> > > sptes without taking the mmu_lock. It also adds support for using a
> > > bitmap to (1) test the access bits for many sptes in a single call to
> > > mmu_notifier_test_young, and to (2) clear the access bits for many ptes
> > > in a single call to mmu_notifier_clear_young.
> >
> > How much improvement would we get if we _just_ made test/clear_young
> > lockless on x86 and hold the read-lock on arm64? And then how much
> > benefit does the bitmap look-around add on top of that?

Thanks David for providing the suggestion.

> I don't have these results right now. For the next version I will (1)
> separate the series into the locking change and the bitmap change, and
> I will (2) have performance data for each change separately. It is
> conceivable that the bitmap change should just be considered as a
> completely separate patchset.

That'd be great. Having the performance numbers will make it even more
compelling, but I'd be tempted to go for the lock improvement just
because it doesn't add any new complexity and leverages existing patterns
in the architectures that people seem to want improvements for.

The bitmap interface, OTOH, is rather complex. At least the current
implementation breaks some of the isolation we have between the MMU code
and the page table walker library on arm64, which I'm not ecstatic about.
It _could_ be justified by a massive performance uplift over locking, but
it'd have to be a sizable win.

--
Thanks,
Oliver

2024-04-21 00:20:24

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

On Fri, Apr 19, 2024 at 3:48 PM James Houghton <[email protected]> wrote:
>
> On Fri, Apr 19, 2024 at 2:07 PM David Matlack <[email protected]> wrote:
> >
> > On 2024-04-19 01:47 PM, James Houghton wrote:
> > > On Thu, Apr 11, 2024 at 10:28 AM David Matlack <[email protected]> wrote:
> > > > On 2024-04-11 10:08 AM, David Matlack wrote:
> > > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > > {
> > > > bool young = false;
> > > >
> > > > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > > >
> > > > if (tdp_mmu_enabled)
> > > > young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> > > >
> > > > return young;
> > > > }
> > > >
> > > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > > {
> > > > bool young = false;
> > > >
> > > > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > > > young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> > > >
> > > > if (tdp_mmu_enabled)
> > > > young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> > > >
> > > > return young;
> > >
> > >
> > > Yeah I think this is the right thing to do. Given your other
> > > suggestions (on patch 3), I think this will look something like this
> > > -- let me know if I've misunderstood something:
> > >
> > > bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
> > >
> > > if (check_rmap)
> > > KVM_MMU_LOCK(kvm);
> > >
> > > rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
> > >
> > > if (check_rmap)
> > > kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
> > >
> > > if (tdp_mmu_enabled)
> > > kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
> > >
> > > rcu_read_unlock();
> > > if (check_rmap)
> > > KVM_MMU_UNLOCK(kvm);
> >
> > I was thinking a little different. If you follow my suggestion to first
> > make the TDP MMU aging lockless, you'll end up with something like this
> > prior to adding bitmap support (note: the comments are just for
> > demonstrative purposes):
> >
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > /* Shadow MMU aging holds write-lock. */
> > if (kvm_memslots_have_rmaps(kvm)) {
> > write_lock(&kvm->mmu_lock);
> > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > write_unlock(&kvm->mmu_lock);
> > }
> >
> > /* TDM MMU aging is lockless. */
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> > return young;
> > }
> >
> > Then when you add bitmap support it would look something like this:
> >
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > unsigned long *bitmap = range->arg.metadata->bitmap;
> > bool young = false;
> >
> > /* SHadow MMU aging holds write-lock and does not support bitmap. */
> > if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
> > write_lock(&kvm->mmu_lock);
> > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > write_unlock(&kvm->mmu_lock);
> > }
> >
> > /* TDM MMU aging is lockless and supports bitmap. */
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> > return young;
> > }
> >
> > rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().
>
> Oh yes this is a lot better. I hope I would have seen this when it
> came time to actually update this patch. Thanks.
>
> >
> > That brings up a question I've been wondering about. If KVM only
> > advertises support for the bitmap lookaround when shadow roots are not
> > allocated, does that mean MGLRU will be blind to accesses made by L2
> > when nested virtualization is enabled? And does that mean the Linux MM
> > will think all L2 memory is cold (i.e. good candidate for swapping)
> > because it isn't seeing accesses made by L2?
>
> Yes, I think so (for both questions). That's better than KVM not
> participating in MGLRU aging at all, which is the case today (IIUC --
> also ignoring the case where KVM accesses guest memory directly). We
> could have MGLRU always invoke the mmu notifiers, but frequently
> taking the MMU lock for writing might be worse than evicting when we
> shouldn't. Maybe Yu tried this at some point, but I can't find any
> results for this.

No, in this case only the fast path (page table scanning) is disabled.
MGLRU still sees the A-bit from L2 using the rmap, i.e., the slow path
calling folio_check_references().