2024-03-20 00:50:47

by Sean Christopherson

[permalink] [raw]
Subject: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
i.e. when creating mappings for KVM guests, instead of when zapping or
modifying SPTEs, e.g. when dropping mappings.

The motivation is twofold:

1. Marking folios dirty and accessed when zapping can be extremely
expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
dirty and accessed for all 512*512 SPTEs.

2. x86 diverges from literally every other architecture, which updates
folios when mappings are created. AFAIK, x86 is unique in that it's
the only KVM arch that prefetches PTEs, so it's not quite an apples-
to-apples comparison, but I don't see any reason for the dirty logic
in particular to be different.

I tagged this RFC as it is barely tested, and because I'm not 100% positive
there isn't some weird edge case I'm missing, which is why I Cc'd David H.
and Matthew.

Note, I'm going to be offline from ~now until April 1st. I rushed this out
as it could impact David S.'s kvm_follow_pfn series[*], which is imminent.
E.g. if KVM stops marking pages dirty and accessed everywhere, adding
SPTE_MMU_PAGE_REFCOUNTED just to sanity check that the refcount is elevated
seems like a poor tradeoff (medium complexity and annoying to maintain, for
not much benefit).

Regarding David S.'s series, I wouldn't be at all opposed to going even
further and having x86 follow all architectures by marking pages accessed
_only_ at map time, at which point I think KVM could simply pass in FOLL_TOUCH
as appropriate, and thus dedup a fair bit of arch code.

Lastly, regarding bullet #1 above, we might be able to eke out better
performance by batching calls to folio_mark_{accessed,dirty}() on the backend,
e.g. when zapping SPTEs that KVM knows are covered by a single hugepage. But
I think in practice any benefit would be marginal, as it would be quite odd
for KVM to fault-in a 1GiB hugepage at 4KiB granularity.

And _if_ we wanted to optimize that case, I suspect we'd be better off
pre-mapping all SPTEs for a pfn that is mapped at a larger granularity in
the primary MMU. E.g. if KVM is dirty logging a 1GiB HugeTLB page, install
MMU-writable 4KiB SPTEs for the entire 1GiB region when any pfn is accessed.

P.S. Matthew ruined the "nothing but Davids!" Cc list.

[*] https://lore.kernel.org/all/[email protected]

Sean Christopherson (4):
KVM: x86/mmu: Skip the "try unsync" path iff the old SPTE was a leaf
SPTE
KVM: x86/mmu: Mark folio dirty when creating SPTE, not when
zapping/modifying
KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs
KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bit

Documentation/virt/kvm/locking.rst | 76 +++++++++++++++---------------
arch/x86/kvm/mmu/mmu.c | 60 +++++------------------
arch/x86/kvm/mmu/paging_tmpl.h | 7 ++-
arch/x86/kvm/mmu/spte.c | 27 ++++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 19 ++------
5 files changed, 78 insertions(+), 111 deletions(-)


base-commit: 964d0c614c7f71917305a5afdca9178fe8231434
--
2.44.0.291.gc1ea87d7ee-goog



2024-03-20 00:51:06

by Sean Christopherson

[permalink] [raw]
Subject: [RFC PATCH 1/4] KVM: x86/mmu: Skip the "try unsync" path iff the old SPTE was a leaf SPTE

Apply make_spte()'s optimization to skip trying to unsync shadow pages if
and only if the old SPTE was a leaf SPTE, as non-leaf SPTEs in direct MMUs
are always writable, i.e. could trigger a false positive and incorrectly
lead to KVM creating a SPTE without write-protecting or marking shadow
pages unsync.

This bug only affects the TDP MMU, as the shadow MMU only overwrites a
shadow-present SPTE when synchronizing SPTEs (and only 4KiB SPTEs can be
unsync). Specifically, mmu_set_spte() drops any non-leaf SPTEs *before*
calling make_spte(), whereas the TDP MMU can do a direct replacement of a
page table with the leaf SPTE.

Opportunistically update the comment to explain why skipping the unsync
stuff is safe, as opposed to simply saying "it's someone else's problem".

Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..b4c1119cc48b 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -206,12 +206,20 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
spte |= PT_WRITABLE_MASK | shadow_mmu_writable_mask;

/*
- * Optimization: for pte sync, if spte was writable the hash
- * lookup is unnecessary (and expensive). Write protection
- * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
- * Same reasoning can be applied to dirty page accounting.
+ * When overwriting an existing leaf SPTE, and the old SPTE was
+ * writable, skip trying to unsync shadow pages as any relevant
+ * shadow pages must already be unsync, i.e. the hash lookup is
+ * unnecessary (and expensive).
+ *
+ * The same reasoning applies to dirty page/folio accounting;
+ * KVM will mark the folio dirty using the old SPTE, thus
+ * there's no need to immediately mark the new SPTE as dirty.
+ *
+ * Note, both cases rely on KVM not changing PFNs without first
+ * zapping the old SPTE, which is guaranteed by both the shadow
+ * MMU and the TDP MMU.
*/
- if (is_writable_pte(old_spte))
+ if (is_last_spte(old_spte, level) && is_writable_pte(old_spte))
goto out;

/*
--
2.44.0.291.gc1ea87d7ee-goog


2024-03-20 00:51:27

by Sean Christopherson

[permalink] [raw]
Subject: [RFC PATCH 2/4] KVM: x86/mmu: Mark folio dirty when creating SPTE, not when zapping/modifying

Mark pages/folios dirty when creating SPTEs to map PFNs into the guest,
not when zapping or modifying SPTEs, as marking folios dirty when zapping
or modifying SPTEs can be extremely inefficient. E.g. when KVM is zapping
collapsible SPTEs to reconstitute a hugepage after disbling dirty logging,
KVM will mark every 4KiB pfn as dirty, even though _at least_ 512 pfns are
guaranteed to be in a single folio (the SPTE couldn't potentially be huge
if that weren't the case). The problem only becomes worse for 1GiB
HugeTLB pages, as KVM can mark a single folio dirty 512*512 times.

Marking a folio dirty when mapping is functionally safe as KVM drops all
relevant SPTEs in response to an mmu_notifier invalidation, i.e. ensures
that the guest can't dirty a folio after access has been removed.

And because KVM already marks folios dirty when zapping/modifying SPTEs
for KVM reasons, i.e. not in response to an mmu_notifier invalidation,
there is no danger of "prematurely" marking a folio dirty. E.g. if a
filesystems cleans a folio without first removing write access, then there
already exists races where KVM could mark a folio dirty before remote TLBs
are flushed, i.e. before guest writes are guaranteed to stop. Furthermore,
x86 is literally the only architecture that marks folios dirty on the
backend; every other KVM architecture marks folios dirty at map time.

x86's unique behavior likely stems from the fact that x86's MMU predates
mmu_notifiers. Long, long ago, before mmu_notifiers were added, marking
pages dirty when zapping SPTEs was logical, and perhaps even necessary, as
KVM held references to pages, i.e. kept a page's refcount elevated while
the page was mapped into the guest. At the time, KVM's rmap_remove()
simply did:

if (is_writeble_pte(*spte))
kvm_release_pfn_dirty(pfn);
else
kvm_release_pfn_clean(pfn);

i.e. dropped the refcount and marked the page dirty at the same time.
After mmu_notifiers were introduced, commit acb66dd051d0 ("KVM: MMU:
dont hold pagecount reference for mapped sptes pages") removed the
refcount logic, but kept the dirty logic, i.e. converted the above to:

if (is_writeble_pte(*spte))
kvm_release_pfn_dirty(pfn);

And for KVM x86, that's essentially how things have stayed over the last
~15 years, without anyone revisiting *why* KVM marks pages/folios dirty at
zap/modification time, e.g. the behavior was blindly carried forward to
the TDP MMU.

Practically speaking, the only downside to marking a folio dirty during
mapping is that KVM could trigger writeback of memory that was never
actually written. Except that can't actually happen if KVM marks folios
during if and only if a writable SPTE is created (as done here), because
KVM always marks writable SPTEs as dirty during make_spte(). See commit
9b51a63024bd ("KVM: MMU: Explicitly set D-bit for writable spte."), circa
2015.

Note, KVM's access tracking logic for prefetched SPTEs is a bit odd. If a
guest PTE is dirty and writable, KVM will create a writable SPTE, but then
mark the SPTE for access tracking. Which isn't wrong, just a bit odd, as
it results in _more_ precise dirty tracking for MMUs _without_ A/D bits.

To keep things simple, mark the folio dirty before access tracking comes
into play, as an access-tracked SPTE can be restored in the fast page
fault path, i.e. without holding mmu_lock. While writing SPTEs and
accessing memslots outside of mmu_lock is safe, marking a folio dirty is
not. E.g. if the fast path gets interrupted _just_ after setting a SPTE,
the primary MMU could theoretically invalidate and free a folio before KVM
marks it dirty. Unlike the shadow MMU, which waits for CPUs to respond to
an IPI, the TDP MMU only guarantees the page tables themselves won't be
freed (via RCU).

Opportunistically update a few stale comments.

Cc: David Hildenbrand <[email protected]>
Cc: David Matlack <[email protected]>
Cc: David Stevens <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 29 ++++-------------------------
arch/x86/kvm/mmu/paging_tmpl.h | 7 +++----
arch/x86/kvm/mmu/spte.c | 13 ++++++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 12 ------------
4 files changed, 17 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e4cc7f764980..bd2240b94ff6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -544,10 +544,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
}

- if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
+ if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))
flush = true;
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
- }

return flush;
}
@@ -590,9 +588,6 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
if (is_accessed_spte(old_spte))
kvm_set_pfn_accessed(pfn);

- if (is_dirty_spte(old_spte))
- kvm_set_pfn_dirty(pfn);
-
return old_spte;
}

@@ -623,13 +618,6 @@ static bool mmu_spte_age(u64 *sptep)
clear_bit((ffs(shadow_accessed_mask) - 1),
(unsigned long *)sptep);
} else {
- /*
- * Capture the dirty status of the page, so that it doesn't get
- * lost when the SPTE is marked for access tracking.
- */
- if (is_writable_pte(spte))
- kvm_set_pfn_dirty(spte_to_pfn(spte));
-
spte = mark_spte_for_access_track(spte);
mmu_spte_update_no_track(sptep, spte);
}
@@ -1263,16 +1251,6 @@ static bool spte_clear_dirty(u64 *sptep)
return mmu_spte_update(sptep, spte);
}

-static bool spte_wrprot_for_clear_dirty(u64 *sptep)
-{
- bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
- (unsigned long *)sptep);
- if (was_writable && !spte_ad_enabled(*sptep))
- kvm_set_pfn_dirty(spte_to_pfn(*sptep));
-
- return was_writable;
-}
-
/*
* Gets the GFN ready for another round of dirty logging by clearing the
* - D bit on ad-enabled SPTEs, and
@@ -1288,7 +1266,8 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head,

for_each_rmap_spte(rmap_head, &iter, sptep)
if (spte_ad_need_write_protect(*sptep))
- flush |= spte_wrprot_for_clear_dirty(sptep);
+ flush |= test_and_clear_bit(PT_WRITABLE_SHIFT,
+ (unsigned long *)sptep);
else
flush |= spte_clear_dirty(sptep);

@@ -3392,7 +3371,7 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
* harm. This also avoids the TLB flush needed after setting dirty bit
* so non-PML cases won't be impacted.
*
- * Compare with set_spte where instead shadow_dirty_mask is set.
+ * Compare with make_spte() where instead shadow_dirty_mask is set.
*/
if (!try_cmpxchg64(sptep, &old_spte, new_spte))
return false;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 4d4e98fe4f35..ec24a6679153 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -554,7 +554,6 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
return false;

mmu_set_spte(vcpu, slot, spte, pte_access, gfn, pfn, NULL);
- kvm_release_pfn_clean(pfn);
return true;
}

@@ -891,9 +890,9 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,

/*
* Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn()) is
- * safe because:
- * - The spte has a reference to the struct page, so the pfn for a given gfn
- * can't change unless all sptes pointing to it are nuked first.
+ * safe because SPTEs are protected by mmu_notifiers and memslot generations, so
+ * the pfn for a given gfn can't change unless all SPTEs pointing to the gfn are
+ * nuked first.
*
* Returns
* < 0: failed to sync spte
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index b4c1119cc48b..490966bc893c 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -212,8 +212,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
* unnecessary (and expensive).
*
* The same reasoning applies to dirty page/folio accounting;
- * KVM will mark the folio dirty using the old SPTE, thus
- * there's no need to immediately mark the new SPTE as dirty.
+ * KVM marked the folio dirty when the old SPTE was created,
+ * thus there's no need to mark the folio dirty again.
*
* Note, both cases rely on KVM not changing PFNs without first
* zapping the old SPTE, which is guaranteed by both the shadow
@@ -235,8 +235,10 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
}
}

- if (pte_access & ACC_WRITE_MASK)
+ if (pte_access & ACC_WRITE_MASK) {
spte |= spte_shadow_dirty_mask(spte);
+ kvm_set_pfn_dirty(pfn);
+ }

out:
if (prefetch)
@@ -246,6 +248,11 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
"spte = 0x%llx, level = %d, rsvd bits = 0x%llx", spte, level,
get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));

+ /*
+ * Mark the memslot dirty *after* modifying it for access tracking.
+ * Unlike folios, memslots can be safely marked dirty out of mmu_lock,
+ * i.e. in the fast page fault handler.
+ */
if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
/* Enforced by kvm_mmu_hugepage_adjust. */
WARN_ON_ONCE(level > PG_LEVEL_4K);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d078157e62aa..5866a664f46e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -511,10 +511,6 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
if (is_leaf != was_leaf)
kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);

- if (was_leaf && is_dirty_spte(old_spte) &&
- (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
- kvm_set_pfn_dirty(spte_to_pfn(old_spte));
-
/*
* Recursively handle child PTs if the change removed a subtree from
* the paging structure. Note the WARN on the PFN changing without the
@@ -1224,13 +1220,6 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
iter->level);
new_spte = iter->old_spte & ~shadow_accessed_mask;
} else {
- /*
- * Capture the dirty status of the page, so that it doesn't get
- * lost when the SPTE is marked for access tracking.
- */
- if (is_writable_pte(iter->old_spte))
- kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
-
new_spte = mark_spte_for_access_track(iter->old_spte);
iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
iter->old_spte, new_spte,
@@ -1652,7 +1641,6 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
iter.old_spte,
iter.old_spte & ~dbit);
- kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
}

rcu_read_unlock();
--
2.44.0.291.gc1ea87d7ee-goog


2024-03-20 00:51:39

by Sean Christopherson

[permalink] [raw]
Subject: [RFC PATCH 3/4] KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs

Mark folios as accessed only when zapping leaf SPTEs, which is a rough
heuristic for "only in response to an mmu_notifier invalidation". Page
aging and LRUs are tolerant of false negatives, i.e. KVM doesn't need to
be precise for correctness, and re-marking folios as accessed when zapping
entire roots or when zapping collapsible SPTEs is expensive and adds very
little value.

E.g. when a VM is dying, all of its memory is being freed; marking folios
accessed at that time provides no known value. Similarly, because KVM
makes folios as accessed when creating SPTEs, marking all folios as
accessed when userspace happens to delete a memslot doesn't add value.
The folio was marked access when the old SPTE was created, and will be
marked accessed yet again if a vCPU accesses the pfn again after reloading
a new root. Zapping collapsible SPTEs is a similar story; marking folios
accessed just because userspace disable dirty logging is a side effect of
KVM behavior, not a deliberate goal.

Mark folios accessed when the primary MMU might be invalidating mappings,
e.g. instead of completely dropping calls to kvm_set_pfn_accessed(), as
such zappings are not KVM initiated, i.e. might actually be related to
page aging and LRU activity.

Note, x86 is the only KVM architecture that "double dips"; every other
arch marks pfns as accessed only when mapping into the guest, not when
mapping into the guest _and_ when removing from the guest.

Signed-off-by: Sean Christopherson <[email protected]>
---
Documentation/virt/kvm/locking.rst | 76 +++++++++++++++---------------
arch/x86/kvm/mmu/mmu.c | 4 +-
arch/x86/kvm/mmu/tdp_mmu.c | 7 ++-
3 files changed, 43 insertions(+), 44 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 02880d5552d5..8b3bb9fe60bf 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -138,49 +138,51 @@ Then, we can ensure the dirty bitmaps is correctly set for a gfn.

2) Dirty bit tracking

-In the origin code, the spte can be fast updated (non-atomically) if the
+In the original code, the spte can be fast updated (non-atomically) if the
spte is read-only and the Accessed bit has already been set since the
Accessed bit and Dirty bit can not be lost.

But it is not true after fast page fault since the spte can be marked
writable between reading spte and updating spte. Like below case:

-+------------------------------------------------------------------------+
-| At the beginning:: |
-| |
-| spte.W = 0 |
-| spte.Accessed = 1 |
-+------------------------------------+-----------------------------------+
-| CPU 0: | CPU 1: |
-+------------------------------------+-----------------------------------+
-| In mmu_spte_clear_track_bits():: | |
-| | |
-| old_spte = *spte; | |
-| | |
-| | |
-| /* 'if' condition is satisfied. */| |
-| if (old_spte.Accessed == 1 && | |
-| old_spte.W == 0) | |
-| spte = 0ull; | |
-+------------------------------------+-----------------------------------+
-| | on fast page fault path:: |
-| | |
-| | spte.W = 1 |
-| | |
-| | memory write on the spte:: |
-| | |
-| | spte.Dirty = 1 |
-+------------------------------------+-----------------------------------+
-| :: | |
-| | |
-| else | |
-| old_spte = xchg(spte, 0ull) | |
-| if (old_spte.Accessed == 1) | |
-| kvm_set_pfn_accessed(spte.pfn);| |
-| if (old_spte.Dirty == 1) | |
-| kvm_set_pfn_dirty(spte.pfn); | |
-| OOPS!!! | |
-+------------------------------------+-----------------------------------+
++-------------------------------------------------------------------------+
+| At the beginning:: |
+| |
+| spte.W = 0 |
+| spte.Accessed = 1 |
++-------------------------------------+-----------------------------------+
+| CPU 0: | CPU 1: |
++-------------------------------------+-----------------------------------+
+| In mmu_spte_update():: | |
+| | |
+| old_spte = *spte; | |
+| | |
+| | |
+| /* 'if' condition is satisfied. */ | |
+| if (old_spte.Accessed == 1 && | |
+| old_spte.W == 0) | |
+| spte = new_spte; | |
++-------------------------------------+-----------------------------------+
+| | on fast page fault path:: |
+| | |
+| | spte.W = 1 |
+| | |
+| | memory write on the spte:: |
+| | |
+| | spte.Dirty = 1 |
++-------------------------------------+-----------------------------------+
+| :: | |
+| | |
+| else | |
+| old_spte = xchg(spte, new_spte);| |
+| if (old_spte.Accessed && | |
+| !new_spte.Accessed) | |
+| flush = true; | |
+| if (old_spte.Dirty && | |
+| !new_spte.Dirty) | |
+| flush = true; | |
+| OOPS!!! | |
++-------------------------------------+-----------------------------------+

The Dirty bit is lost in this case.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd2240b94ff6..0a6c6619d213 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -539,10 +539,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
* to guarantee consistency between TLB and page tables.
*/

- if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
+ if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte))
flush = true;
- kvm_set_pfn_accessed(spte_to_pfn(old_spte));
- }

if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))
flush = true;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5866a664f46e..340d5af454c6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -520,10 +520,6 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
if (was_present && !was_leaf &&
(is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
-
- if (was_leaf && is_accessed_spte(old_spte) &&
- (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
- kvm_set_pfn_accessed(spte_to_pfn(old_spte));
}

/*
@@ -841,6 +837,9 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,

tdp_mmu_iter_set_spte(kvm, &iter, 0);

+ if (is_accessed_spte(iter.old_spte))
+ kvm_set_pfn_accessed(spte_to_pfn(iter.old_spte));
+
/*
* Zappings SPTEs in invalid roots doesn't require a TLB flush,
* see kvm_tdp_mmu_zap_invalidated_roots() for details.
--
2.44.0.291.gc1ea87d7ee-goog


2024-03-20 00:52:00

by Sean Christopherson

[permalink] [raw]
Subject: [RFC PATCH 4/4] KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bit

Don't force a TLB if mmu_spte_update() clears Accessed bit, as access
tracking tolerates false negatives, as evidenced by the mmu_notifier hooks
that explicit test and age SPTEs without doing a TLB flush.

In practice, this is very nearly a nop. spte_write_protect() and
spte_clear_dirty() never clear the Accessed bit. make_spte() always
sets the Accessed bit for !prefetch scenarios. FNAME(sync_spte) only sets
SPTE if the protection bits are changing, i.e. if a flush will be needed
regardless of the Accessed bits. And FNAME(pte_prefetch) sets SPTE if and
only if the old SPTE is !PRESENT.

That leaves kvm_arch_async_page_ready() as the one path that will generate
a !ACCESSED SPTE *and* overwrite a PRESENT SPTE. And that's very arguably
a bug, as clobbering a valid SPTE in that case is nonsensical.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 31 +++++++++----------------------
1 file changed, 9 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0a6c6619d213..77d1072b130d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -515,37 +515,24 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
* TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
* spte, even though the writable spte might be cached on a CPU's TLB.
*
+ * Remote TLBs also need to be flushed if the Dirty bit is cleared, as false
+ * negatives are not acceptable, e.g. if KVM is using D-bit based PML on VMX.
+ *
+ * Don't flush if the Accessed bit is cleared, as access tracking tolerates
+ * false negatives, and the one path that does care about TLB flushes,
+ * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track().
+ *
* Returns true if the TLB needs to be flushed
*/
static bool mmu_spte_update(u64 *sptep, u64 new_spte)
{
- bool flush = false;
u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);

if (!is_shadow_present_pte(old_spte))
return false;

- /*
- * For the spte updated out of mmu-lock is safe, since
- * we always atomically update it, see the comments in
- * spte_has_volatile_bits().
- */
- if (is_mmu_writable_spte(old_spte) &&
- !is_writable_pte(new_spte))
- flush = true;
-
- /*
- * Flush TLB when accessed/dirty states are changed in the page tables,
- * to guarantee consistency between TLB and page tables.
- */
-
- if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte))
- flush = true;
-
- if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))
- flush = true;
-
- return flush;
+ return (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
+ (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte));
}

/*
--
2.44.0.291.gc1ea87d7ee-goog


2024-03-20 12:57:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On 20.03.24 01:50, Sean Christopherson wrote:
> Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
> i.e. when creating mappings for KVM guests, instead of when zapping or
> modifying SPTEs, e.g. when dropping mappings.
>
> The motivation is twofold:
>
> 1. Marking folios dirty and accessed when zapping can be extremely
> expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
> 512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
> dirty and accessed for all 512*512 SPTEs.
>
> 2. x86 diverges from literally every other architecture, which updates
> folios when mappings are created. AFAIK, x86 is unique in that it's
> the only KVM arch that prefetches PTEs, so it's not quite an apples-
> to-apples comparison, but I don't see any reason for the dirty logic
> in particular to be different.
>

Already sorry for the lengthy reply.


On "ordinary" process page tables on x86, it behaves as follows:

1) A page might be mapped writable but the PTE might not be dirty. Once
written to, HW will set the PTE dirty bit.

2) A page might be mapped but the PTE might not be young. Once accessed,
HW will set the PTE young bit.

3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
the folio (folio_mark_accessed()). The latter is done conditionally
only (vma_has_recency()).

BUT, when zapping an anon folio, we don't do that, because there zapping
implies "gone for good" and not "content must go to a file".

4) When temporarily unmapping a folio for migration/swapout, we
primarily only move the dirty PTE bit to the folio.


GUP is different, because the PTEs might change after we pinned the page
and wrote to it. We don't modify the PTEs and expect the GUP user to do
the right thing (set dirty/accessed). For example,
unpin_user_pages_dirty_lock() would mark the page dirty when unpinning,
where the PTE might long be gone.

So GUP does not really behave like HW access.


Secondary page tables are different to ordinary GUP, and KVM ends up
using GUP to some degree to simulate HW access; regarding NUMA-hinting,
KVM already revealed to be very different to all other GUP users. [1]

And I recall that at some point I raised that we might want to have a
dedicate interface for these "mmu-notifier" based page table
synchonization mechanism.

But KVM ends up setting folio dirty/access flags itself, like other GUP
users. I do wonder if secondary page tables should be messing with folio
flags *at all*, and if there would be ways to to it differently using PTEs.

We make sure to synchronize the secondary page tables to the process
page tables using MMU notifiers: when we write-protect/unmap a PTE, we
write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
different.


I once had the following idea, but I am not sure about all implications,
just wanted to raise it because it matches the topic here:

Secondary page tables kind-of behave like "HW" access. If there is a
write access, we would expect the original PTE to become dirty, not the
mapped folio.

1) When KVM wants to map a page into the secondary page table, we
require the PTE to be young (like a HW access). The SPTE can remain
old.

2) When KVM wants to map a page writable into the secondary page table,
we require the PTE to be dirty (like a HW access). The SPTE can
remain old.

3) When core MM clears the PTE dirty/young bit, we notify the secondary
page table to adjust: for example, if the dirty bit gets cleared,
the page cannot be writable in the secondary MMU.

4) GUP-fast cannot set the pte dirty/young, so we would fallback to slow
GUP, wehre we hold the PTL, and simply modify the PTE to have the
accessed/dirty bit set.

5) Prefetching would similarly be limited to that (only prefetch if PTE
is already dirty etc.).

6) Dirty/accessed bits not longer have to be synced from the secondary
page table to the process page table. Because an SPTE being dirty
implies that the PTE is dirty.


One tricky bit, why ordinary GUP modifies the folio and not the PTE, is
concurrent HW access. For example, when we want to mark a PTE accessed,
it could happen that HW concurrently tries marking the PTE dirty. We
must not lose that update, so we have to guarantee an atomic update
(maybe avoidable in some cases).

What would be the implications? We'd leave setting folio flags to the MM
core. That also implies, that if you shutdown a VM an zap all anon
folios, you wouldn't have to mark any folio dirty: the pte is dirty, and
MM core can decide to ignore that flag since it will discard the page
either way.

Downsides? Likely many I have not yet thought about (TLB flushes etc).
Just mentioning it because in context of [1] I was wondering if
something that uses MMU notifiers should really be messing with
dirty/young flags :)


> I tagged this RFC as it is barely tested, and because I'm not 100% positive
> there isn't some weird edge case I'm missing, which is why I Cc'd David H.
> and Matthew.

We'd be in trouble if someone would detect that all PTEs are clean, so
it can clear the folio dirty flag (for example, after writeback). Then,
we would write using the SPTE and the folio+PTE would be clean. If we
then evict the "clean" folio that is actually dirty, we would be in trouble.

Well, we would set the SPTE dirty flag I guess. But I cannot immediately
tell if that one would be synced back to the folio? Would we have a
mechanism in place to prevent that?

>
> Note, I'm going to be offline from ~now until April 1st. I rushed this out
> as it could impact David S.'s kvm_follow_pfn series[*], which is imminent.
> E.g. if KVM stops marking pages dirty and accessed everywhere, adding
> SPTE_MMU_PAGE_REFCOUNTED just to sanity check that the refcount is elevated
> seems like a poor tradeoff (medium complexity and annoying to maintain, for
> not much benefit).
>
> Regarding David S.'s series, I wouldn't be at all opposed to going even
> further and having x86 follow all architectures by marking pages accessed
> _only_ at map time, at which point I think KVM could simply pass in FOLL_TOUCH
> as appropriate, and thus dedup a fair bit of arch code.

FOLL_TOUCH is weird (excluding weird devmap stuff):

1) For PTEs (follow_page_pte), we set the page dirty and accessed, and
do not modify the PTE. For THP (follow_trans_huge_pmd), we set the
PMD young/dirty and don't mess with the folio.

2) FOLL_TOUCH is not implemented for hugetlb.

3) FOLL_TOUCH is not implemented for GUP-fast.

I'd leave that alone :)


[1]
https://lore.kernel.org/lkml/[email protected]/T/#u
--
Cheers,

David / dhildenb


2024-04-02 17:42:37

by David Matlack

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <[email protected]> wrote:
>
> On 20.03.24 01:50, Sean Christopherson wrote:
> > Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
> > i.e. when creating mappings for KVM guests, instead of when zapping or
> > modifying SPTEs, e.g. when dropping mappings.
> >
> > The motivation is twofold:
> >
> > 1. Marking folios dirty and accessed when zapping can be extremely
> > expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
> > 512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
> > dirty and accessed for all 512*512 SPTEs.
> >
> > 2. x86 diverges from literally every other architecture, which updates
> > folios when mappings are created. AFAIK, x86 is unique in that it's
> > the only KVM arch that prefetches PTEs, so it's not quite an apples-
> > to-apples comparison, but I don't see any reason for the dirty logic
> > in particular to be different.
> >
>
> Already sorry for the lengthy reply.
>
>
> On "ordinary" process page tables on x86, it behaves as follows:
>
> 1) A page might be mapped writable but the PTE might not be dirty. Once
> written to, HW will set the PTE dirty bit.
>
> 2) A page might be mapped but the PTE might not be young. Once accessed,
> HW will set the PTE young bit.
>
> 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
> PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
> the folio (folio_mark_accessed()). The latter is done conditionally
> only (vma_has_recency()).
>
> BUT, when zapping an anon folio, we don't do that, because there zapping
> implies "gone for good" and not "content must go to a file".
>
> 4) When temporarily unmapping a folio for migration/swapout, we
> primarily only move the dirty PTE bit to the folio.
>
>
> GUP is different, because the PTEs might change after we pinned the page
> and wrote to it. We don't modify the PTEs and expect the GUP user to do
> the right thing (set dirty/accessed). For example,
> unpin_user_pages_dirty_lock() would mark the page dirty when unpinning,
> where the PTE might long be gone.
>
> So GUP does not really behave like HW access.
>
>
> Secondary page tables are different to ordinary GUP, and KVM ends up
> using GUP to some degree to simulate HW access; regarding NUMA-hinting,
> KVM already revealed to be very different to all other GUP users. [1]
>
> And I recall that at some point I raised that we might want to have a
> dedicate interface for these "mmu-notifier" based page table
> synchonization mechanism.
>
> But KVM ends up setting folio dirty/access flags itself, like other GUP
> users. I do wonder if secondary page tables should be messing with folio
> flags *at all*, and if there would be ways to to it differently using PTEs.
>
> We make sure to synchronize the secondary page tables to the process
> page tables using MMU notifiers: when we write-protect/unmap a PTE, we
> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
> different.

Accessed bits have the test/clear young MMU-notifiers. But I agree
there aren't any notifiers for dirty tracking.

Are there any cases where the primary MMU transfers the PTE dirty bit
to the folio _other_ than zapping (which already has an MMU-notifier
to KVM). If not then there might not be any reason to add a new
notifier. Instead the contract should just be that secondary MMUs must
also transfer their dirty bits to folios in sync (or before) the
primary MMU zaps its PTE.

>
>
> I once had the following idea, but I am not sure about all implications,
> just wanted to raise it because it matches the topic here:
>
> Secondary page tables kind-of behave like "HW" access. If there is a
> write access, we would expect the original PTE to become dirty, not the
> mapped folio.

Propagating SPTE dirty bits to folios indirectly via the primary MMU
PTEs won't work for guest_memfd where there is no primary MMU PTE. In
order to avoid having two different ways to propagate SPTE dirty bits,
KVM should probably be responsible for updating the folio directly.

2024-04-02 18:31:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On 02.04.24 19:38, David Matlack wrote:
> On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 20.03.24 01:50, Sean Christopherson wrote:
>>> Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
>>> i.e. when creating mappings for KVM guests, instead of when zapping or
>>> modifying SPTEs, e.g. when dropping mappings.
>>>
>>> The motivation is twofold:
>>>
>>> 1. Marking folios dirty and accessed when zapping can be extremely
>>> expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
>>> 512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
>>> dirty and accessed for all 512*512 SPTEs.
>>>
>>> 2. x86 diverges from literally every other architecture, which updates
>>> folios when mappings are created. AFAIK, x86 is unique in that it's
>>> the only KVM arch that prefetches PTEs, so it's not quite an apples-
>>> to-apples comparison, but I don't see any reason for the dirty logic
>>> in particular to be different.
>>>
>>
>> Already sorry for the lengthy reply.
>>
>>
>> On "ordinary" process page tables on x86, it behaves as follows:
>>
>> 1) A page might be mapped writable but the PTE might not be dirty. Once
>> written to, HW will set the PTE dirty bit.
>>
>> 2) A page might be mapped but the PTE might not be young. Once accessed,
>> HW will set the PTE young bit.
>>
>> 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
>> PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
>> the folio (folio_mark_accessed()). The latter is done conditionally
>> only (vma_has_recency()).
>>
>> BUT, when zapping an anon folio, we don't do that, because there zapping
>> implies "gone for good" and not "content must go to a file".
>>
>> 4) When temporarily unmapping a folio for migration/swapout, we
>> primarily only move the dirty PTE bit to the folio.
>>
>>
>> GUP is different, because the PTEs might change after we pinned the page
>> and wrote to it. We don't modify the PTEs and expect the GUP user to do
>> the right thing (set dirty/accessed). For example,
>> unpin_user_pages_dirty_lock() would mark the page dirty when unpinning,
>> where the PTE might long be gone.
>>
>> So GUP does not really behave like HW access.
>>
>>
>> Secondary page tables are different to ordinary GUP, and KVM ends up
>> using GUP to some degree to simulate HW access; regarding NUMA-hinting,
>> KVM already revealed to be very different to all other GUP users. [1]
>>
>> And I recall that at some point I raised that we might want to have a
>> dedicate interface for these "mmu-notifier" based page table
>> synchonization mechanism.
>>
>> But KVM ends up setting folio dirty/access flags itself, like other GUP
>> users. I do wonder if secondary page tables should be messing with folio
>> flags *at all*, and if there would be ways to to it differently using PTEs.
>>
>> We make sure to synchronize the secondary page tables to the process
>> page tables using MMU notifiers: when we write-protect/unmap a PTE, we
>> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
>> different.
>
> Accessed bits have the test/clear young MMU-notifiers. But I agree
> there aren't any notifiers for dirty tracking.
>

Yes, and I am questioning if the "test" part should exist -- or if
having a spte in the secondary MMU should require the access bit to be
set (derived from the primary MMU). (again, my explanation about fake HW
page table walkers)

There might be a good reason to do it like that nowadays, so I'm only
raising it as something I was wondering. Likely, frequent clearing of
the access bit would result in many PTEs in the secondary MMU getting
invalidated, requiring a new GUP-fast lookup where we would set the
access bit in the primary MMU PTE. But I'm not an expert on the
implications with MMU notifiers and access bit clearing.

> Are there any cases where the primary MMU transfers the PTE dirty bit
> to the folio _other_ than zapping (which already has an MMU-notifier
> to KVM). If not then there might not be any reason to add a new
> notifier. Instead the contract should just be that secondary MMUs must
> also transfer their dirty bits to folios in sync (or before) the
> primary MMU zaps its PTE.

Grepping for pte_mkclean(), there might be some cases. Many cases use
MMU notifier, because they either clear the PTE or also remove write
permissions.

But these is madvise_free_pte_range() and
clean_record_shared_mapping_range()...->clean_record_pte(), that might
only clear the dirty bit without clearing/changing permissions and
consequently not calling MMU notifiers.

Getting a writable PTE without the dirty bit set should be possible.

So I am questioning whether having a writable PTE in the secondary MMU
with a clean PTE in the primary MMU should be valid to exist. It can
exist today, and I am not sure if that's the right approach.

>
>>
>>
>> I once had the following idea, but I am not sure about all implications,
>> just wanted to raise it because it matches the topic here:
>>
>> Secondary page tables kind-of behave like "HW" access. If there is a
>> write access, we would expect the original PTE to become dirty, not the
>> mapped folio.
>
> Propagating SPTE dirty bits to folios indirectly via the primary MMU
> PTEs won't work for guest_memfd where there is no primary MMU PTE. In
> order to avoid having two different ways to propagate SPTE dirty bits,
> KVM should probably be responsible for updating the folio directly.
>

But who really cares about access/dirty bits for guest_memfd?

guest_memfd already wants to disable/bypass all of core-MM, so different
rules are to be expected. This discussion is all about integration with
core-MM that relies on correct dirty bits, which does not really apply
to guest_memfd.

--
Cheers,

David / dhildenb


2024-04-03 22:19:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On Wed, Apr 03, 2024, David Hildenbrand wrote:
> On 03.04.24 02:17, Sean Christopherson wrote:
> > On Tue, Apr 02, 2024, David Hildenbrand wrote:
> > Aha! But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will
> > also keep the folio if it has been GUP'd. And __remove_mapping() explicitly states
> > that it needs to play nice with a GUP'd page being marked dirty before the
> > reference is dropped.
>
> >
> > * Must be careful with the order of the tests. When someone has
> > * a ref to the folio, it may be possible that they dirty it then
> > * drop the reference. So if the dirty flag is tested before the
> > * refcount here, then the following race may occur:
> >
> > So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the
> > code correctly it's safe/legal so long as KVM either (a) marks the folio dirty
> > while holding a reference or (b) marks the folio dirty before returning from its
> > mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its
> > mappings in response to mmu_notifier_invalidate_range_start().
> >
>
> Yes, I agree that it should work in the context of vmscan. But (b) is
> certainly a bit harder to swallow than "ordinary" (a) :)

Heh, all the more reason to switch KVM x86 from (b) => (a).

> As raised, if having a writable SPTE would imply having a writable+dirty
> PTE, then KVM MMU code wouldn't have to worry about syncing any dirty bits
> ever back to core-mm, so patch #2 would not be required. ... well, it would
> be replaces by an MMU notifier that notifies about clearing the PTE dirty
> bit :)

Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
to be invalidated before consuming dirty status. Isn't the end result essentially
a sane FOLL_TOUCH?

> ... because, then, there is also a subtle difference between
> folio_set_dirty() and folio_mark_dirty(), and I am still confused about the
> difference and not competent enough to explain the difference ... and KVM
> always does the former, while zapping code of pagecache folios does the
> latter ... hm

Ugh, just when I thought I finally had my head wrapped around this.

> Related note: IIRC, we usually expect most anon folios to be dirty.
>
> kvm_set_pfn_dirty()->kvm_set_page_dirty() does an unconditional
> SetPageDirty()->folio_set_dirty(). Doing a test-before-set might frequently
> avoid atomic ops.

Noted, definitely worth poking at.

2024-04-04 15:44:46

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On 04.04.24 00:19, Sean Christopherson wrote:
> On Wed, Apr 03, 2024, David Hildenbrand wrote:
>> On 03.04.24 02:17, Sean Christopherson wrote:
>>> On Tue, Apr 02, 2024, David Hildenbrand wrote:
>>> Aha! But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will
>>> also keep the folio if it has been GUP'd. And __remove_mapping() explicitly states
>>> that it needs to play nice with a GUP'd page being marked dirty before the
>>> reference is dropped.
>>
>>>
>>> * Must be careful with the order of the tests. When someone has
>>> * a ref to the folio, it may be possible that they dirty it then
>>> * drop the reference. So if the dirty flag is tested before the
>>> * refcount here, then the following race may occur:
>>>
>>> So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the
>>> code correctly it's safe/legal so long as KVM either (a) marks the folio dirty
>>> while holding a reference or (b) marks the folio dirty before returning from its
>>> mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its
>>> mappings in response to mmu_notifier_invalidate_range_start().
>>>
>>
>> Yes, I agree that it should work in the context of vmscan. But (b) is
>> certainly a bit harder to swallow than "ordinary" (a) :)
>
> Heh, all the more reason to switch KVM x86 from (b) => (a).

:)

>
>> As raised, if having a writable SPTE would imply having a writable+dirty
>> PTE, then KVM MMU code wouldn't have to worry about syncing any dirty bits
>> ever back to core-mm, so patch #2 would not be required. ... well, it would
>> be replaces by an MMU notifier that notifies about clearing the PTE dirty
>> bit :)
>
> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
> to be invalidated before consuming dirty status. Isn't the end result essentially
> a sane FOLL_TOUCH?

Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.

Having something that makes sure the writable PTE/PMD is dirty (or
alternatively sets it dirty), paired with MMU notifiers notifying on any
mkclean would be one option that would leave handling how to handle
dirtying of folios completely to the core. It would behave just like a
CPU writing to the page table, which would set the pte dirty.

Of course, if frequent clearing of the dirty PTE/PMD bit would be a
problem (like we discussed for the accessed bit), that would not be an
option. But from what I recall, only clearing the PTE/PMD dirty bit is
rather rare.

I think this GUP / GUP-user setting of folio dirty bits is a bit of a
mess, but it seems to be working so far, so ... who am I to judge :)

--
Cheers,

David / dhildenb


2024-04-04 17:31:18

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On Thu, Apr 04, 2024, David Hildenbrand wrote:
> On 04.04.24 00:19, Sean Christopherson wrote:
> > Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
> > to be invalidated before consuming dirty status. Isn't the end result essentially
> > a sane FOLL_TOUCH?
>
> Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.
>
> Having something that makes sure the writable PTE/PMD is dirty (or
> alternatively sets it dirty), paired with MMU notifiers notifying on any
> mkclean would be one option that would leave handling how to handle dirtying
> of folios completely to the core. It would behave just like a CPU writing to
> the page table, which would set the pte dirty.
>
> Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
> (like we discussed for the accessed bit), that would not be an option. But
> from what I recall, only clearing the PTE/PMD dirty bit is rather rare.

And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
it would probably be a net positive, e.g. the notification could more precisely
say that SPTEs need to be read-only, not blasted away completely.

2024-04-04 18:24:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On 04.04.24 19:31, Sean Christopherson wrote:
> On Thu, Apr 04, 2024, David Hildenbrand wrote:
>> On 04.04.24 00:19, Sean Christopherson wrote:
>>> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
>>> to be invalidated before consuming dirty status. Isn't the end result essentially
>>> a sane FOLL_TOUCH?
>>
>> Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.
>>
>> Having something that makes sure the writable PTE/PMD is dirty (or
>> alternatively sets it dirty), paired with MMU notifiers notifying on any
>> mkclean would be one option that would leave handling how to handle dirtying
>> of folios completely to the core. It would behave just like a CPU writing to
>> the page table, which would set the pte dirty.
>>
>> Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
>> (like we discussed for the accessed bit), that would not be an option. But
>> from what I recall, only clearing the PTE/PMD dirty bit is rather rare.
>
> And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
> it would probably be a net positive, e.g. the notification could more precisely
> say that SPTEs need to be read-only, not blasted away completely.
>

As discussed, I think at least madvise_free_pte_range() wouldn't do
that. Notifiers would only get called later when actually zapping the
folio. So at least for some time, you would have the PTE not dirty, but
the SPTE writable or even dirty. So you'd have to set the page dirty
when zapping the SPTE ... and IMHO that is what we should maybe try to
avoid :)

--
Cheers,

David / dhildenb


2024-04-04 22:03:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On Thu, Apr 04, 2024, David Hildenbrand wrote:
> On 04.04.24 19:31, Sean Christopherson wrote:
> > On Thu, Apr 04, 2024, David Hildenbrand wrote:
> > > On 04.04.24 00:19, Sean Christopherson wrote:
> > > > Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
> > > > to be invalidated before consuming dirty status. Isn't the end result essentially
> > > > a sane FOLL_TOUCH?
> > >
> > > Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.
> > >
> > > Having something that makes sure the writable PTE/PMD is dirty (or
> > > alternatively sets it dirty), paired with MMU notifiers notifying on any
> > > mkclean would be one option that would leave handling how to handle dirtying
> > > of folios completely to the core. It would behave just like a CPU writing to
> > > the page table, which would set the pte dirty.
> > >
> > > Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
> > > (like we discussed for the accessed bit), that would not be an option. But
> > > from what I recall, only clearing the PTE/PMD dirty bit is rather rare.
> >
> > And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
> > it would probably be a net positive, e.g. the notification could more precisely
> > say that SPTEs need to be read-only, not blasted away completely.
>
> As discussed, I think at least madvise_free_pte_range() wouldn't do that.

I'm getting a bit turned around. Are you talking about what madvise_free_pte_range()
would do in this future world, or what madvise_free_pte_range() does today? Because
today, unless I'm really misreading the code, secondary MMUs are invalidated before
the dirty bit is cleared.

mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
range.start, range.end);

lru_add_drain();
tlb_gather_mmu(&tlb, mm);
update_hiwater_rss(mm);

mmu_notifier_invalidate_range_start(&range);
tlb_start_vma(&tlb, vma);
walk_page_range(vma->vm_mm, range.start, range.end,
&madvise_free_walk_ops, &tlb);
tlb_end_vma(&tlb, vma);
mmu_notifier_invalidate_range_end(&range);

KVM (or any other secondary MMU) can re-establish mapping with W=1,D=0 in the
PTE, but the costly invalidation (zap+flush+fault) still happens.

> Notifiers would only get called later when actually zapping the folio.

And in case we're talking about a hypothetical future, I was thinking the above
could do MMU_NOTIFY_WRITE_PROTECT instead of MMU_NOTIFY_CLEAR.

> So at least for some time, you would have the PTE not dirty, but the SPTE
> writable or even dirty. So you'd have to set the page dirty when zapping the
> SPTE ... and IMHO that is what we should maybe try to avoid :)

2024-04-03 21:44:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On 03.04.24 02:17, Sean Christopherson wrote:
> On Tue, Apr 02, 2024, David Hildenbrand wrote:
>> On 02.04.24 19:38, David Matlack wrote:
>>> On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <[email protected]> wrote:
>>>> Secondary page tables are different to ordinary GUP, and KVM ends up
>>>> using GUP to some degree to simulate HW access; regarding NUMA-hinting,
>>>> KVM already revealed to be very different to all other GUP users. [1]
>>>>
>>>> And I recall that at some point I raised that we might want to have a
>>>> dedicate interface for these "mmu-notifier" based page table
>>>> synchonization mechanism.
>>>>
>>>> But KVM ends up setting folio dirty/access flags itself, like other GUP
>>>> users. I do wonder if secondary page tables should be messing with folio
>>>> flags *at all*, and if there would be ways to to it differently using PTEs.
>>>>
>>>> We make sure to synchronize the secondary page tables to the process
>>>> page tables using MMU notifiers: when we write-protect/unmap a PTE, we
>>>> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
>>>> different.
>>>
>>> Accessed bits have the test/clear young MMU-notifiers. But I agree
>>> there aren't any notifiers for dirty tracking.
>>
>> Yes, and I am questioning if the "test" part should exist -- or if having a
>> spte in the secondary MMU should require the access bit to be set (derived
>> from the primary MMU). (again, my explanation about fake HW page table
>> walkers)
>>
>> There might be a good reason to do it like that nowadays, so I'm only
>> raising it as something I was wondering. Likely, frequent clearing of the
>> access bit would result in many PTEs in the secondary MMU getting
>> invalidated, requiring a new GUP-fast lookup where we would set the access
>> bit in the primary MMU PTE. But I'm not an expert on the implications with
>> MMU notifiers and access bit clearing.
>
> Ya. KVM already does something similar this for dirty logging, where SPTEs are
> write-protected, i.e. generate faults to track dirty status. But KVM x86 has a
> lot of code to mitigates the resulting pain, e.g. has a lockless fast-path to
> make the SPTE writable and propagate the dirty status to bitmaps, and userspace
> is heavily/carefully optimized to balance harvesting/clearing dirty status against
> guest activity.
>
> In general, assumptions that core MM makes about the cost of PTE modifications
> tend to fall apart for KVM. Much of that comes down to mmu_notifiers invalidations
> being an imperfect boundary between the primary MMU and secondary MMUs. E.g.
> any prot changes (NUMA balancing in particular) can have disastrous impact on KVM
> guests because those changes invalidate secondary MMUs at PMD granularity, and
> effective force KVM to do a remote TLB for every PMD that is affected, whereas
> the primary is able to batch TLB flushes until the entire VMA is processed.
>
> So yeah, forcing KVM to zap and re-fault SPTEs in order to do page aging would be
> comically slow for KVM guests without a crazy amount of optimization.

Right, so access bits are certainly special. Fortunately, they are not
as important to get right as dirty bits.

>
>>> Are there any cases where the primary MMU transfers the PTE dirty bit
>>> to the folio _other_ than zapping (which already has an MMU-notifier
>>> to KVM). If not then there might not be any reason to add a new
>>> notifier. Instead the contract should just be that secondary MMUs must
>>> also transfer their dirty bits to folios in sync (or before) the
>>> primary MMU zaps its PTE.
>>
>> Grepping for pte_mkclean(), there might be some cases. Many cases use MMU
>> notifier, because they either clear the PTE or also remove write
>> permissions.
>>
>> But these is madvise_free_pte_range() and
>> clean_record_shared_mapping_range()...->clean_record_pte(), that might only
>> clear the dirty bit without clearing/changing permissions and consequently
>> not calling MMU notifiers.
>
> Heh, I was staring at these earlier today. They all are wrapped with
> mmu_notifier_invalidate_range_{start,end}(), and KVM's hyper aggressive response
> to mmu_notifier invalidations ensure all KVM SPTEs get dropped.
>
>> Getting a writable PTE without the dirty bit set should be possible.
>>
>> So I am questioning whether having a writable PTE in the secondary MMU with
>> a clean PTE in the primary MMU should be valid to exist. It can exist today,
>> and I am not sure if that's the right approach.
>
> I came to the same conclusion (that it can exist today). Without (sane) FOLL_TOUCH,
> KVM could retrieve the new PTE (W=1,D=0) after an mmu_notifier invaliation, and
> thus end up with a writable SPTE that isn't dirty in core MM.
>
> For MADV_FREE, I don't see how KVM's current behavior of marking the folio dirty
> at zap time changes anything. Ah, try_to_unmap_one() checks folio_test_dirty()
> *after* invoking mmu_notifier_invalidate_range_start(), which ensures that KVM's
> dirty status is propagated to the folio, and thus try_to_unmap_one() keeps the
> folio.

Right, we have to check if it has been re-dirtied. And any unexpected
reference (i.e., GUP) could dirty it.

>
> Aha! But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will
> also keep the folio if it has been GUP'd. And __remove_mapping() explicitly states
> that it needs to play nice with a GUP'd page being marked dirty before the
> reference is dropped.

>
> * Must be careful with the order of the tests. When someone has
> * a ref to the folio, it may be possible that they dirty it then
> * drop the reference. So if the dirty flag is tested before the
> * refcount here, then the following race may occur:
>
> So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the
> code correctly it's safe/legal so long as KVM either (a) marks the folio dirty
> while holding a reference or (b) marks the folio dirty before returning from its
> mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its
> mappings in response to mmu_notifier_invalidate_range_start().
>

Yes, I agree that it should work in the context of vmscan. But (b) is
certainly a bit harder to swallow than "ordinary" (a) :)

As raised, if having a writable SPTE would imply having a writable+dirty
PTE, then KVM MMU code wouldn't have to worry about syncing any dirty
bits ever back to core-mm, so patch #2 would not be required. ... well,
it would be replaces by an MMU notifier that notifies about clearing the
PTE dirty bit :)

.. because, then, there is also a subtle difference between
folio_set_dirty() and folio_mark_dirty(), and I am still confused about
the difference and not competent enough to explain the difference ...
and KVM always does the former, while zapping code of pagecache folios
does the latter ... hm

Related note: IIRC, we usually expect most anon folios to be dirty.

kvm_set_pfn_dirty()->kvm_set_page_dirty() does an unconditional
SetPageDirty()->folio_set_dirty(). Doing a test-before-set might
frequently avoid atomic ops.

>>>> I once had the following idea, but I am not sure about all implications,
>>>> just wanted to raise it because it matches the topic here:
>>>>
>>>> Secondary page tables kind-of behave like "HW" access. If there is a
>>>> write access, we would expect the original PTE to become dirty, not the
>>>> mapped folio.
>>>
>>> Propagating SPTE dirty bits to folios indirectly via the primary MMU
>>> PTEs won't work for guest_memfd where there is no primary MMU PTE. In
>>> order to avoid having two different ways to propagate SPTE dirty bits,
>>> KVM should probably be responsible for updating the folio directly.
>>>
>>
>> But who really cares about access/dirty bits for guest_memfd?
>>
>> guest_memfd already wants to disable/bypass all of core-MM, so different
>> rules are to be expected. This discussion is all about integration with
>> core-MM that relies on correct dirty bits, which does not really apply to
>> guest_memfd.
>
> +1, this is one of many reasons I don't want to support swap/relcaim for guest_memfd.
> The instant guest_memfd gets involved with anything LRU related, the interactions
> and complexity skyrockets.

Agreed ... but it's unfortunate :/

--
Cheers,

David / dhildenb


2024-04-03 00:17:46

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On Tue, Apr 02, 2024, David Hildenbrand wrote:
> On 02.04.24 19:38, David Matlack wrote:
> > On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@redhatcom> wrote:
> > > Secondary page tables are different to ordinary GUP, and KVM ends up
> > > using GUP to some degree to simulate HW access; regarding NUMA-hinting,
> > > KVM already revealed to be very different to all other GUP users. [1]
> > >
> > > And I recall that at some point I raised that we might want to have a
> > > dedicate interface for these "mmu-notifier" based page table
> > > synchonization mechanism.
> > >
> > > But KVM ends up setting folio dirty/access flags itself, like other GUP
> > > users. I do wonder if secondary page tables should be messing with folio
> > > flags *at all*, and if there would be ways to to it differently using PTEs.
> > >
> > > We make sure to synchronize the secondary page tables to the process
> > > page tables using MMU notifiers: when we write-protect/unmap a PTE, we
> > > write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
> > > different.
> >
> > Accessed bits have the test/clear young MMU-notifiers. But I agree
> > there aren't any notifiers for dirty tracking.
>
> Yes, and I am questioning if the "test" part should exist -- or if having a
> spte in the secondary MMU should require the access bit to be set (derived
> from the primary MMU). (again, my explanation about fake HW page table
> walkers)
>
> There might be a good reason to do it like that nowadays, so I'm only
> raising it as something I was wondering. Likely, frequent clearing of the
> access bit would result in many PTEs in the secondary MMU getting
> invalidated, requiring a new GUP-fast lookup where we would set the access
> bit in the primary MMU PTE. But I'm not an expert on the implications with
> MMU notifiers and access bit clearing.

Ya. KVM already does something similar this for dirty logging, where SPTEs are
write-protected, i.e. generate faults to track dirty status. But KVM x86 has a
lot of code to mitigates the resulting pain, e.g. has a lockless fast-path to
make the SPTE writable and propagate the dirty status to bitmaps, and userspace
is heavily/carefully optimized to balance harvesting/clearing dirty status against
guest activity.

In general, assumptions that core MM makes about the cost of PTE modifications
tend to fall apart for KVM. Much of that comes down to mmu_notifiers invalidations
being an imperfect boundary between the primary MMU and secondary MMUs. E.g.
any prot changes (NUMA balancing in particular) can have disastrous impact on KVM
guests because those changes invalidate secondary MMUs at PMD granularity, and
effective force KVM to do a remote TLB for every PMD that is affected, whereas
the primary is able to batch TLB flushes until the entire VMA is processed.

So yeah, forcing KVM to zap and re-fault SPTEs in order to do page aging would be
comically slow for KVM guests without a crazy amount of optimization.

> > Are there any cases where the primary MMU transfers the PTE dirty bit
> > to the folio _other_ than zapping (which already has an MMU-notifier
> > to KVM). If not then there might not be any reason to add a new
> > notifier. Instead the contract should just be that secondary MMUs must
> > also transfer their dirty bits to folios in sync (or before) the
> > primary MMU zaps its PTE.
>
> Grepping for pte_mkclean(), there might be some cases. Many cases use MMU
> notifier, because they either clear the PTE or also remove write
> permissions.
>
> But these is madvise_free_pte_range() and
> clean_record_shared_mapping_range()...->clean_record_pte(), that might only
> clear the dirty bit without clearing/changing permissions and consequently
> not calling MMU notifiers.

Heh, I was staring at these earlier today. They all are wrapped with
mmu_notifier_invalidate_range_{start,end}(), and KVM's hyper aggressive response
to mmu_notifier invalidations ensure all KVM SPTEs get dropped.

> Getting a writable PTE without the dirty bit set should be possible.
>
> So I am questioning whether having a writable PTE in the secondary MMU with
> a clean PTE in the primary MMU should be valid to exist. It can exist today,
> and I am not sure if that's the right approach.

I came to the same conclusion (that it can exist today). Without (sane) FOLL_TOUCH,
KVM could retrieve the new PTE (W=1,D=0) after an mmu_notifier invaliation, and
thus end up with a writable SPTE that isn't dirty in core MM.

For MADV_FREE, I don't see how KVM's current behavior of marking the folio dirty
at zap time changes anything. Ah, try_to_unmap_one() checks folio_test_dirty()
*after* invoking mmu_notifier_invalidate_range_start(), which ensures that KVM's
dirty status is propagated to the folio, and thus try_to_unmap_one() keeps the
folio.

Aha! But try_to_unmap_one() also checks that refcount==mapcount+1, i.e will
also keep the folio if it has been GUP'd. And __remove_mapping() explicitly states
that it needs to play nice with a GUP'd page being marked dirty before the
reference is dropped.

* Must be careful with the order of the tests. When someone has
* a ref to the folio, it may be possible that they dirty it then
* drop the reference. So if the dirty flag is tested before the
* refcount here, then the following race may occur:

So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the
code correctly it's safe/legal so long as KVM either (a) marks the folio dirty
while holding a reference or (b) marks the folio dirty before returning from its
mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its
mappings in response to mmu_notifier_invalidate_range_start().

> > > I once had the following idea, but I am not sure about all implications,
> > > just wanted to raise it because it matches the topic here:
> > >
> > > Secondary page tables kind-of behave like "HW" access. If there is a
> > > write access, we would expect the original PTE to become dirty, not the
> > > mapped folio.
> >
> > Propagating SPTE dirty bits to folios indirectly via the primary MMU
> > PTEs won't work for guest_memfd where there is no primary MMU PTE. In
> > order to avoid having two different ways to propagate SPTE dirty bits,
> > KVM should probably be responsible for updating the folio directly.
> >
>
> But who really cares about access/dirty bits for guest_memfd?
>
> guest_memfd already wants to disable/bypass all of core-MM, so different
> rules are to be expected. This discussion is all about integration with
> core-MM that relies on correct dirty bits, which does not really apply to
> guest_memfd.

+1, this is one of many reasons I don't want to support swap/relcaim for guest_memfd.
The instant guest_memfd gets involved with anything LRU related, the interactions
and complexity skyrockets.

2024-04-05 07:06:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On 05.04.24 00:02, Sean Christopherson wrote:
> On Thu, Apr 04, 2024, David Hildenbrand wrote:
>> On 04.04.24 19:31, Sean Christopherson wrote:
>>> On Thu, Apr 04, 2024, David Hildenbrand wrote:
>>>> On 04.04.24 00:19, Sean Christopherson wrote:
>>>>> Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need
>>>>> to be invalidated before consuming dirty status. Isn't the end result essentially
>>>>> a sane FOLL_TOUCH?
>>>>
>>>> Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now.
>>>>
>>>> Having something that makes sure the writable PTE/PMD is dirty (or
>>>> alternatively sets it dirty), paired with MMU notifiers notifying on any
>>>> mkclean would be one option that would leave handling how to handle dirtying
>>>> of folios completely to the core. It would behave just like a CPU writing to
>>>> the page table, which would set the pte dirty.
>>>>
>>>> Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem
>>>> (like we discussed for the accessed bit), that would not be an option. But
>>>> from what I recall, only clearing the PTE/PMD dirty bit is rather rare.
>>>
>>> And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything
>>> it would probably be a net positive, e.g. the notification could more precisely
>>> say that SPTEs need to be read-only, not blasted away completely.
>>
>> As discussed, I think at least madvise_free_pte_range() wouldn't do that.
>
> I'm getting a bit turned around. Are you talking about what madvise_free_pte_range()
> would do in this future world, or what madvise_free_pte_range() does today? Because
> today, unless I'm really misreading the code, secondary MMUs are invalidated before
> the dirty bit is cleared.

Likely I missed it, sorry! I was talking about the possible future. :)

>
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> range.start, range.end);
>
> lru_add_drain();
> tlb_gather_mmu(&tlb, mm);
> update_hiwater_rss(mm);
>
> mmu_notifier_invalidate_range_start(&range);
> tlb_start_vma(&tlb, vma);
> walk_page_range(vma->vm_mm, range.start, range.end,
> &madvise_free_walk_ops, &tlb);
> tlb_end_vma(&tlb, vma);
> mmu_notifier_invalidate_range_end(&range);
>

Indeed, we do setup the MMU notifier invalidation. We do the start/end
.. I was looking for PTE notifications.

I spotted the set_pte_at(), not a set_pte_at_notify() like we do in
other code. Maybe that's not required here (digging through
documentation I'm still left clueless).

"
set_pte_at_notify() sets the pte _after_ running the notifier. This is
safe to start by updating the secondary MMUs, because the primary MMU
pte invalidate must have already happened with a ptep_clear_flush()
before set_pte_at_notify() has been invoked. ...
"

Absolutely unclear to me when we *must* to use it, or if it is. Likely
its a pure optimization when we're *changing* a PTE.

KSM and wp_page_copy() uses it with MMU_NOTIFY_CLEAR, when replacing the
mapped page by another page (or write-protecting an existing one).
__replace_page() uses it with __replace_page() when replacing the mapped
page. And migrate_vma_insert_page() uses it with MMU_NOTIFY_MIGRATE.

Other code (e.g., khugepaged) doesn't seem to use it as well.

.. so I guess it's fine? Confusing.

--
Cheers,

David / dhildenb


2024-04-05 09:38:25

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On Fri, Apr 5, 2024 at 8:53 AM David Hildenbrand <[email protected]> wrote:
> > mmu_notifier_invalidate_range_start(&range);
> > tlb_start_vma(&tlb, vma);
> > walk_page_range(vma->vm_mm, range.start, range.end,
> > &madvise_free_walk_ops, &tlb);
> > tlb_end_vma(&tlb, vma);
> > mmu_notifier_invalidate_range_end(&range);
> >
>
> Indeed, we do setup the MMU notifier invalidation. We do the start/end
> ... I was looking for PTE notifications.
>
> I spotted the set_pte_at(), not a set_pte_at_notify() like we do in
> other code. Maybe that's not required here (digging through
> documentation I'm still left clueless). [...]
> Absolutely unclear to me when we *must* to use it, or if it is. Likely
> its a pure optimization when we're *changing* a PTE.

Yes, .change_pte() is just an optimization. The original point of it
was for KSM, so that KVM could flip the sPTE to a new location without
first zapping it. At the time there was also an .invalidate_page()
callback, and both of them were *not* bracketed by calls to
mmu_notifier_invalidate_range_{start,end}()

Later on, both callbacks were changed to occur within an
invalidate_range_start/end() block.

Commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) did so
for .change_pte(). The reason to do so was a bit roundabout, namely to
allow sleepable .invalidate_page() hooks (because .change_pte() is not
sleepable and at the time .invalidate_page() was used as a fallback
for .change_pte()).

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of succeeding.

(Commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2", 2017-08-31) is where the other set of non-bracketed calls to MMU
notifier callbacks was changed; calls to
mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range(), bracketed by calls to
mmu_notifier_invalidate_range_{start,end}()).

Since KVM is the only user of .change_pte(), we can remove
change_pte() and set_pte_at_notify() completely.

Paolo

> __replace_page() uses it with __replace_page() when replacing the mapped
> page. And migrate_vma_insert_page() uses it with MMU_NOTIFY_MIGRATE.
>
> Other code (e.g., khugepaged) doesn't seem to use it as well.
>
> ... so I guess it's fine? Confusing.
>
> --
> Cheers,
>
> David / dhildenb
>


2024-04-05 10:14:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On 05.04.24 11:37, Paolo Bonzini wrote:
> On Fri, Apr 5, 2024 at 8:53 AM David Hildenbrand <[email protected]> wrote:
>>> mmu_notifier_invalidate_range_start(&range);
>>> tlb_start_vma(&tlb, vma);
>>> walk_page_range(vma->vm_mm, range.start, range.end,
>>> &madvise_free_walk_ops, &tlb);
>>> tlb_end_vma(&tlb, vma);
>>> mmu_notifier_invalidate_range_end(&range);
>>>
>>
>> Indeed, we do setup the MMU notifier invalidation. We do the start/end
>> ... I was looking for PTE notifications.
>>
>> I spotted the set_pte_at(), not a set_pte_at_notify() like we do in
>> other code. Maybe that's not required here (digging through
>> documentation I'm still left clueless). [...]
>> Absolutely unclear to me when we *must* to use it, or if it is. Likely
>> its a pure optimization when we're *changing* a PTE.
>
> Yes, .change_pte() is just an optimization. The original point of it
> was for KSM, so that KVM could flip the sPTE to a new location without
> first zapping it. At the time there was also an .invalidate_page()
> callback, and both of them were *not* bracketed by calls to
> mmu_notifier_invalidate_range_{start,end}()
>
> Later on, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
>
> Commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
> invalidate_range_start and invalidate_range_end", 2012-10-09) did so
> for .change_pte(). The reason to do so was a bit roundabout, namely to
> allow sleepable .invalidate_page() hooks (because .change_pte() is not
> sleepable and at the time .invalidate_page() was used as a fallback
> for .change_pte()).
>
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of succeeding.
>
> (Commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
> v2", 2017-08-31) is where the other set of non-bracketed calls to MMU
> notifier callbacks was changed; calls to
> mmu_notifier_invalidate_page() were replaced by calls to
> mmu_notifier_invalidate_range(), bracketed by calls to
> mmu_notifier_invalidate_range_{start,end}()).
>
> Since KVM is the only user of .change_pte(), we can remove
> .change_pte() and set_pte_at_notify() completely.

Nice, thanks for all that information!

--
Cheers,

David / dhildenb


2024-04-05 14:07:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On 4/5/24 15:59, Sean Christopherson wrote:
> Ya, from commit c13fda237f08 ("KVM: Assert that notifier count is elevated in
> .change_pte()"):
>
> x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
> is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE,
> which again should be a nop due to the invalidation. arm64 is a bit
> murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
> called without a cache pointer, which means it will map an entry if and
> only if an existing PTE was found.
>
> I'm 100% in favor of removing .change_pte(). As I've said multiple times, the
> only reason I haven't sent a patch is because I didn't want it to prompt someone
> into resurrecting the original behavior. ????

Ah, this indeed reminded me of discussions we had in the past. Patches
sent now:

https://lore.kernel.org/kvm/[email protected]/

and I don't think anyone would try to resurrect the original behavior.
It made some sense when there was an .invalidate_page() callback as
well, but the whole thing started to crumble when .invalidate_page() was
made sleepable, plus as David noticed it's very poorly documented where
to use set_pte_at_notify() vs. set_pte_at().

As I wrote in the cover letter above, the only reason why it has never
caused a bug is because it was doing nothing...

Paolo


2024-04-05 14:00:11

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

On Fri, Apr 05, 2024, David Hildenbrand wrote:
> On 05.04.24 11:37, Paolo Bonzini wrote:
> > On Fri, Apr 5, 2024 at 8:53 AM David Hildenbrand <[email protected]> wrote:
> > > > mmu_notifier_invalidate_range_start(&range);
> > > > tlb_start_vma(&tlb, vma);
> > > > walk_page_range(vma->vm_mm, range.start, range.end,
> > > > &madvise_free_walk_ops, &tlb);
> > > > tlb_end_vma(&tlb, vma);
> > > > mmu_notifier_invalidate_range_end(&range);
> > > >
> > >
> > > Indeed, we do setup the MMU notifier invalidation. We do the start/end
> > > ... I was looking for PTE notifications.
> > >
> > > I spotted the set_pte_at(), not a set_pte_at_notify() like we do in
> > > other code. Maybe that's not required here (digging through
> > > documentation I'm still left clueless). [...]
> > > Absolutely unclear to me when we *must* to use it, or if it is. Likely
> > > its a pure optimization when we're *changing* a PTE.
> >
> > Yes, .change_pte() is just an optimization. The original point of it
> > was for KSM, so that KVM could flip the sPTE to a new location without
> > first zapping it. At the time there was also an .invalidate_page()
> > callback, and both of them were *not* bracketed by calls to
> > mmu_notifier_invalidate_range_{start,end}()
> >
> > Later on, both callbacks were changed to occur within an
> > invalidate_range_start/end() block.
> >
> > Commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
> > invalidate_range_start and invalidate_range_end", 2012-10-09) did so
> > for .change_pte(). The reason to do so was a bit roundabout, namely to
> > allow sleepable .invalidate_page() hooks (because .change_pte() is not
> > sleepable and at the time .invalidate_page() was used as a fallback
> > for .change_pte()).
> >
> > This however made KVM's usage of the .change_pte() callback completely
> > moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> > and therefore .change_pte() has no hope of succeeding.
> >
> > (Commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
> > v2", 2017-08-31) is where the other set of non-bracketed calls to MMU
> > notifier callbacks was changed; calls to
> > mmu_notifier_invalidate_page() were replaced by calls to
> > mmu_notifier_invalidate_range(), bracketed by calls to
> > mmu_notifier_invalidate_range_{start,end}()).
> >
> > Since KVM is the only user of .change_pte(), we can remove
> > .change_pte() and set_pte_at_notify() completely.
>
> Nice, thanks for all that information!

Ya, from commit c13fda237f08 ("KVM: Assert that notifier count is elevated in
change_pte()"):

x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE,
which again should be a nop due to the invalidation. arm64 is a bit
murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
called without a cache pointer, which means it will map an entry if and
only if an existing PTE was found.

I'm 100% in favor of removing .change_pte(). As I've said multiple times, the
only reason I haven't sent a patch is because I didn't want it to prompt someone
into resurrecting the original behavior. :-)