by James Houghton

[permalink] [raw]

Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Jun 13, 2024, James Houghton wrote:
> > On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <[email protected]> wrote:
> > > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y,
> > > i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's
> > > probably a pretty unlikely combination, so it's probably not a valid argument.
> > >
> > > So, I guess I don't have a strong opinion?
> >
> > (Sorry for the somewhat delayed response... spent some time actually
> > writing what this would look like.)
> >
> > I see what you mean, thanks! So has_fast_aging might be set by KVM if
> > the architecture sets a Kconfig saying that it understands the concept
> > of fast aging, basically what the presence of this v5's
> > test_clear_young_fast_only() indicates.
>
> It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_enabled=false
> doesn't support fast aging (uses the shadow MMU even for TDP).

I see. I'm not sure if it makes sense to put this in `ops` as you
originally had it then (it seems like a bit of a pain anyway). I could
just make it a member of `struct mmu_notifier` itself.

> > So just to be clear, for test_young(), I intend to have a patch in v6
> > to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems
> > like a pure win; no reason not to include it if we're making logic
> > changes here anyway.
>
> I don't think that's correct. The initial fast_only=false aging should process
> shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=false would
> get a false positive on young due to failing to clear the Accessed bit in the
> shadow MMU. E.g. if page X is accessed by both L1 and L2, then aged, and never
> accessed again, the Accessed bit would still be set in the page tables for L2.

For clear_young(fast_only=false), yeah we need to check and clear
Accessed for both MMUs. But for test_young(fast_only=false), I don't
see why we couldn't just return early if the TDP MMU reports young.

> My thought for MMU_NOTIFY_WAS_FAST below (which again is a bad name) is to
> communicate to MGLRU that the page was found to be young in an MMU that supports
> fast aging, i.e. that looking around at other SPTEs is worth doing.

That makes sense; I don't think this little test_young() optimization
affects that.

> > > > > So rather than failing the fast aging, I think what we want is to know if an
> > > > > mmu_notifier found a young SPTE during a fast lookup. E.g. something like this
> > > > > in KVM, where using kvm_has_shadow_mmu_sptes() instead of kvm_memslots_have_rmaps()
> > > > > is an optional optimization to avoid taking mmu_lock for write in paths where a
> > > > > (very rare) false negative is acceptable.
> > > > >
> > > > > static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
> > > > > {
> > > > > return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
> > > > > }
> > > > >
> > > > > static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
> > > > > bool fast_only)
> > > > > {
> > > > > int young = 0;
> > > > >
> > > > > if (!fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
> > > > > write_lock(&kvm->mmu_lock);
> > > > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > > > > write_unlock(&kvm->mmu_lock);
> > > > > }
> > > > >
> > > > > if (tdp_mmu_enabled && kvm_tdp_mmu_age_gfn_range(kvm, range))
> > > > > young = 1 | MMU_NOTIFY_WAS_FAST;
> >
> > The most straightforward way (IMHO) to return something like `1 |
> > MMU_NOTIFY_WAS_FAST` up to the MMU notifier itself is to make
> > gfn_handler_t return int instead of bool.
>
> Hrm, all the options are unpleasant. Modifying gfn_handler_t to return an int
> will require an absurd amount of churn (all implementations in all archictures),
> and I don't love that the APIs that return true/false to indicate "flush" would
> lose their boolean-ness.
>
> One idea would be to add kvm_mmu_notifier_arg.aging_was_fast or so, and then
> refactor kvm_handle_hva_range_no_flush() into a dedicated aging helper, and have
> it morph the KVM-internal flag into an MMU_NOTIFIER flag. It's not perect either,
> but it requires far less churn and keeps some of the KVM<=>mmu_notifer details in
> common KVM code.

SGTM. I think this will work. Thanks!

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7b9d2633a931..c11a359b6ff5 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -258,6 +258,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> union kvm_mmu_notifier_arg {
> unsigned long attributes;
> + bool aging_was_fast;
> };
>
> struct kvm_gfn_range {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 436ca41f61e5..a936f6bedd97 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -685,10 +685,10 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> return __kvm_handle_hva_range(kvm, &range).ret;
> }
>
> -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn,
> - unsigned long start,
> - unsigned long end,
> - gfn_handler_t handler)
> +static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
> + unsigned long start,
> + unsigned long end,
> + bool flush_if_young)
> {
> struct kvm *kvm = mmu_notifier_to_kvm(mn);
> const struct kvm_mmu_notifier_range range = {
> @@ -696,11 +696,14 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
> .end = end,
> .handler = handler,
> .on_lock = (void *)kvm_null_fn,
> - .flush_on_ret = false,
> + .flush_on_ret = flush_if_young,
> .may_block = false,
> + .aging_was_fast = false,
> };
>
> - return __kvm_handle_hva_range(kvm, &range).ret;
> + bool young = __kvm_handle_hva_range(kvm, &range).ret;
> +
> + return (int)young | (range.aging_was_fast ? MMU_NOTIFIER_FAST_AGING : 0);
> }
>
> void kvm_mmu_invalidate_begin(struct kvm *kvm)
> @@ -865,7 +868,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
> {
> trace_kvm_age_hva(start, end);
>
> - return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
> + return kvm_age_hva_range(mn, start, end, true);
> }
>
> static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> @@ -875,20 +878,7 @@ static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> {
> trace_kvm_age_hva(start, end);
>
> - /*
> - * Even though we do not flush TLB, this will still adversely
> - * affect performance on pre-Haswell Intel EPT, where there is
> - * no EPT Access Bit to clear so that we have to tear down EPT
> - * tables instead. If we find this unacceptable, we can always
> - * add a parameter to kvm_age_hva so that it effectively doesn't
> - * do anything on clear_young.
> - *
> - * Also note that currently we never issue secondary TLB flushes
> - * from clear_young, leaving this job up to the regular system
> - * cadence. If we find this inaccurate, we might come up with a
> - * more sophisticated heuristic later.
> - */
> - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> + return kvm_age_hva_range(mn, start, end, false);
> }
>
> static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> @@ -897,8 +887,7 @@ static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> {
> trace_kvm_test_age_hva(address);
>
> - return kvm_handle_hva_range_no_flush(mn, address, address + 1,
> - kvm_test_age_gfn);
> + return kvm_age_hva_range(mn, address, address + 1, false);
> }
>
> static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
>
>
> > > The change, relative to v5, that I am proposing is that MGLRU looks around if
> > > the page was young in _a_ "fast" secondary MMU, whereas v5 looks around if and
> > > only if _all_ secondary MMUs are fast.
> > >
> > > In other words, if a fast MMU had a young SPTE, look around _that_ MMU, via the
> > > fast_only flag.
> >
> > Oh, yeah, that's a lot more intelligent than what I had. I think I
> > fully understand your suggestion; I guess we'll see in v6. :)
> >
> > I wonder if this still makes sense if whether or not an MMU is "fast"
> > is determined by how contended some lock(s) are at the time.
>
> No. Just because a lock wasn't contended on the initial aging doesn't mean it
> won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which
> takes mmu_lock for write for all operations, an aging operation could get lucky
> and sneak in while mmu_lock happened to be free, but then get stuck behind a large
> queue of operations.
>
> The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in
> an MMU that takes mmu_lock for read in all but the most rare paths.

Aging and look-around themselves only use the fast-only notifiers, so
they won't ever wait on a lock (well... provided KVM is written like
that, which I think is a given). should_look_around() will use the
slow notifier because it (despite its name) is responsible for
accurately determining if a page is young lest we evict a young page.

So in this case where "fast" means "lock not contended for now", I
don't think it's necessarily wrong for MGLRU to attempt to find young
pages, even if sometimes it will bail out because a lock is
contended/held for a few or even a majority of the pages. Not doing
look-around is the same as doing look-around and finding that no pages
are young.

Anyway, I don't think this bit is really all that important unless we
can demonstrate that KVM participating like this actually results in a
measurable win.

2024-06-14 23:22:11

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

On Fri, Jun 14, 2024, James Houghton wrote:
> On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Jun 13, 2024, James Houghton wrote:
> > > On Tue, Jun 11, 2024 at 5:34 PM Sean Christopherson <[email protected]> wrote:
> > > > A flag would also avoid an indirect call and thus a RETPOLINE when CONFIG_RETPOLINE=y,
> > > > i.e. would be a minor optimization when KVM doesn't suppport fast aging. But that's
> > > > probably a pretty unlikely combination, so it's probably not a valid argument.
> > > >
> > > > So, I guess I don't have a strong opinion?
> > >
> > > (Sorry for the somewhat delayed response... spent some time actually
> > > writing what this would look like.)
> > >
> > > I see what you mean, thanks! So has_fast_aging might be set by KVM if
> > > the architecture sets a Kconfig saying that it understands the concept
> > > of fast aging, basically what the presence of this v5's
> > > test_clear_young_fast_only() indicates.
> >
> > It would need to be a runtime setting, because KVM x86-64 with tdp_mmu_enabled=false
> > doesn't support fast aging (uses the shadow MMU even for TDP).
>
> I see. I'm not sure if it makes sense to put this in `ops` as you
> originally had it then (it seems like a bit of a pain anyway). I could
> just make it a member of `struct mmu_notifier` itself.

Ah, right, because the ops are const. Yeah, losing the const just to set a flag
would be unfortunate.

> > > So just to be clear, for test_young(), I intend to have a patch in v6
> > > to elide the shadow MMU check if the TDP MMU indicates Accessed. Seems
> > > like a pure win; no reason not to include it if we're making logic
> > > changes here anyway.
> >
> > I don't think that's correct. The initial fast_only=false aging should process
> > shadow MMUs (nested TDP) and TDP MMUs, otherwise a future fast_only=false would
> > get a false positive on young due to failing to clear the Accessed bit in the
> > shadow MMU. E.g. if page X is accessed by both L1 and L2, then aged, and never
> > accessed again, the Accessed bit would still be set in the page tables for L2.
>
> For clear_young(fast_only=false), yeah we need to check and clear
> Accessed for both MMUs. But for test_young(fast_only=false), I don't
> see why we couldn't just return early if the TDP MMU reports young.

Ooh, good point.

> > > Oh, yeah, that's a lot more intelligent than what I had. I think I
> > > fully understand your suggestion; I guess we'll see in v6. :)
> > >
> > > I wonder if this still makes sense if whether or not an MMU is "fast"
> > > is determined by how contended some lock(s) are at the time.
> >
> > No. Just because a lock wasn't contended on the initial aging doesn't mean it
> > won't be contended on the next round. E.g. when using KVM x86's shadow MMU, which
> > takes mmu_lock for write for all operations, an aging operation could get lucky
> > and sneak in while mmu_lock happened to be free, but then get stuck behind a large
> > queue of operations.
> >
> > The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in
> > an MMU that takes mmu_lock for read in all but the most rare paths.
>
> Aging and look-around themselves only use the fast-only notifiers, so
> they won't ever wait on a lock (well... provided KVM is written like
> that, which I think is a given).

Regarding aging, is that actually the behavior that we want? I thought the plan
is to have the initial test look at all MMUs, i.e. be potentially slow, but only
do the lookaround if it can be fast. IIUC, that was Yu's intent (and peeking back
at v2, that is indeed the case, unless I'm misreading the code).

If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2 will
end up with hot pages (used by L2) swapped out.

Or are you saying that the "test" could be slow, but not "clear"? That's also
suboptimal, because any pages accessed by L2 will always appear hot.

> should_look_around() will use the slow notifier because it (despite its name)
> is responsible for accurately determining if a page is young lest we evict a
> young page.
>
> So in this case where "fast" means "lock not contended for now",

No. In KVM, "fast" needs to be a property of the MMU, not a reflection of system
state at some random snapshot in time.

> I don't think it's necessarily wrong for MGLRU to attempt to find young
> pages, even if sometimes it will bail out because a lock is contended/held

lru_gen_look_around() skips lookaround if something else is waiting on the page
table lock, but that is a far cry from what KVM would be doing. (a) the PTL is
already held, and (b) it is scoped precisely to the range being processed. Not
looking around makes sense because there's a high probability of the PTEs in
question being modified by a different task, i.e. of the look around being a
waste of time.

In KVM, mmu_lock is not yet held, so KVM would need to use try-lock to avoid
waiting, and would need to bail from the middle of the aging walk if a different
task contends mmu_lock.

I agree that's not "wrong", but in part because mmu_lock is scoped to the entire
VM, it risks ending up with semi-random, hard to debug behavior. E.g. a user
could see intermittent "failures" that come and go based on seemingly unrelated
behavior in KVM. And implementing the "bail" behavior in the shadow MMU would
require non-trivial changes.

In other words, I would very strongly prefer that the shadow MMU be all or nothing,
i.e. is either part of look-around or isn't. And if nested TDP doesn't fair well
with MGLRU, then we (or whoever cares) can spend the time+effort to make it work
with fast-aging.

Ooh! Actually, after fiddling a bit to see how feasible fast-aging in the shadow
MMU would be, I'm pretty sure we can do straight there for nested TDP. Or rather,
I suspect/hope we can get close enough for an initial merge, which would allow
aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things
because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification.

Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done
in such a way that a lockless walk would be painfully complex. But if there is
exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at
that one SPTE. And with nested TDP, unless L1 is doing something uncommon, e.g.
mapping the same page into multiple L2s, that overwhelming vast majority of rmaps
have only one entry. That's not the case for legacy shadow paging because kernels
almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map
along with any userspace mappings.

E.g. with a QEMU+KVM setup running Linux as L2, in my setup, only one gfn has
multiple rmap entries (IIRC, it's from QEMU remapping BIOS into low memory during
boot).

So, if we bifurcate aging behavior based on whether or not the TDP MMU is enabled,
then whether or not aging is fast is constant (after KVM loads). Rougly, the KVM
side of things would be the below, plus a bunch of conversions to WRITE_ONCE() to
ensure a stable rmap value (KVM already plays nice with lockless accesses to SPTEs,
thanks to the fast page fault path).

If KVM adds an rmap entry after the READ_ONCE(), then functionally all is still
well because the original SPTE pointer is still valid. If the rmap entry is
removed, then KVM just needs to ensure the owning page table isn't freed. That
could be done either with a cmpxchg() (KVM zaps leafs SPTE before freeing page
tables, and the rmap stuff doesn't actually walk upper level entries), or by
enhancing the shadow MMU's lockless walk logic to allow lockless walks from
non-vCPU tasks.

And in the (hopefully) unlikely scenario someone has a use case where L1 maps a
gfn into multiple L2s (or aliases in bizarre ways), then we can tackle making the
nested TDP shadow MMU rmap walks always lockless.

E.g. again very roughly, if we went with the latter:

@@ -1629,22 +1629,45 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
__rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
}

+static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm,
+ struct kvm_gfn_range *range,
+ typedefme handler)
+{
+ gfn_t gfn;
+ int level;
+ u64 *spte;
+ bool ret;
+
+ walk_shadow_page_lockless_begin(???);
+
+ for (gfn = range->start; gfn < range->end; gfn++) {
+ for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+ spte = (void *)READ_ONCE(gfn_to_rmap(gfn, level, range->slot)->val);
+
+ /* Skip the gfn if there are multiple SPTEs. */
+ if ((unsigned long)spte & 1)
+ continue;
+
+ ret |= handler(spte);
+ }
+ }
+
+ walk_shadow_page_lockless_end(???);
+}
+
static int __kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range,
bool fast_only)
{
bool young = false;

- if (kvm_memslots_have_rmaps(kvm)) {
- if (fast_only)
- return -1;
-
- write_lock(&kvm->mmu_lock);
- young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
- write_unlock(&kvm->mmu_lock);
- }
-
- if (tdp_mmu_enabled)
+ if (tdp_mmu_enabled) {
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
+ young |= kvm_handle_gfn_range_lockless(kvm, range, kvm_age_rmap_fast);
+ } else if (!fast_only) {
+ write_lock(&kvm->mmu_lock);
+ young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+ write_unlock(&kvm->mmu_lock);
+ }

return (int)young;
}

> for a few or even a majority of the pages. Not doing look-around is the same
> as doing look-around and finding that no pages are young.

No, because the former is deterministic and predictable, the latter is not.

> Anyway, I don't think this bit is really all that important unless we
> can demonstrate that KVM participating like this actually results in a
> measurable win.

Participating like what? You've lost me a bit. Are we talking past each other?

What I am saying is that we do this (note that this is slightly different than
an earlier sketch; I botched the ordering of spin_is_contend() in that one, and
didn't account for the page being young in the primary MMU).

if (pte_young(ptep_get(pte)))
young = 1 | MMU_NOTIFY_WAS_FAST;

young |= ptep_clear_young_notify(vma, addr, pte);
if (!young)
return false;

if (!(young & MMU_NOTIFY_WAS_FAST))
return true;

if (spin_is_contended(pvmw->ptl))
return false;

/* exclude special VMAs containing anon pages from COW */
if (vma->vm_flags & VM_SPECIAL)
return false;