2024-01-10 01:21:21

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

Retry page faults without acquiring mmu_lock if the resolved gfn is covered
by an active invalidation. Contending for mmu_lock is especially
problematic on preemptible kernels as the mmu_notifier invalidation task
will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
invalidation, and ultimately increase the latency of resolving the page
fault. And in the worst case scenario, yielding will be accompanied by a
remote TLB flush, e.g. if the invalidation covers a large range of memory
and vCPUs are accessing addresses that were already zapped.

Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.

Add a dedicated lockess version of the range-based retry check to avoid
false positives on the sanity check on start+end WARN, and so that it's
super obvious that checking for a racing invalidation without holding
mmu_lock is unsafe (though obviously useful).

Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
invalidation in a loop won't put KVM into an infinite loop, e.g. due to
caching the in-progress flag and never seeing it go to '0'.

Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.

Do the pre-check even for non-preemptible kernels, as waiting to detect
the invalidation until mmu_lock is held guarantees the vCPU will observe
the worst case latency in terms of handling the fault, and can generate
even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock,
detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
eventually re-acquire mmu_lock. This behavior is also why there are no
new starvation issues due to losing the fairness guarantees provided by
rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
on mmu_lock doesn't guarantee forward progress in the face of _another_
mmu_notifier invalidation event.

Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.

Reported-by: Yan Zhao <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]
Acked-by: Kai Huang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---

Note, this version adds a dedicated helper, mmu_invalidate_retry_gfn_unsafe(),
instead of making mmu_invalidate_retry_gfn() play nice with being called without
mmu_lock held. I was hesitant to drop the lockdep assertion before, and the
recently introduced sanity check on the gfn start/end values pushed this past
the threshold of being worth the duplicate code (preserving the start/end sanity
check in lock-free code would comically difficult, and would add almost no value
since it would have to be quite conservative to avoid false positives).

Kai, I kept your Ack even though the code is obviously a little different.
Holler if you want me to drop it.

v2:
- Introduce a dedicated helper and collapse to a single patch (because
adding an unused helper would be quite silly).
- Add a comment to explain the "unsafe" check in kvm_faultin_pfn(). [Kai]
- Add Kai's Ack.

v1: https://lore.kernel.org/all/[email protected]

arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
include/linux/kvm_host.h | 26 ++++++++++++++++++++++++++
2 files changed, 42 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c844e428684..92f51540c4a7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4415,6 +4415,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
if (unlikely(!fault->slot))
return kvm_handle_noslot_fault(vcpu, fault, access);

+ /*
+ * Pre-check for a relevant mmu_notifier invalidation event prior to
+ * acquiring mmu_lock. If there is an in-progress invalidation and the
+ * kernel allows preemption, the invalidation task may drop mmu_lock
+ * and yield in response to mmu_lock being contended, which is *very*
+ * counter-productive as this vCPU can't actually make forward progress
+ * until the invalidation completes. This "unsafe" check can get false
+ * negatives, i.e. KVM needs to re-check after acquiring mmu_lock. Do
+ * the pre-check even for non-preemtible kernels, i.e. even if KVM will
+ * never yield mmu_lock in response to contention, as this vCPU ob
+ * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
+ * to detect retry guarantees the worst case latency for the vCPU.
+ */
+ if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
+ return RET_PF_RETRY;
+
return RET_PF_CONTINUE;
}

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e7fd25b09b3..179df96b20f8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2031,6 +2031,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
return 1;
return 0;
}
+
+/*
+ * This lockless version of the range-based retry check *must* be paired with a
+ * call to the locked version after acquiring mmu_lock, i.e. this is safe to
+ * use only as a pre-check to avoid contending mmu_lock. This version *will*
+ * get false negatives and false positives.
+ */
+static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
+ unsigned long mmu_seq,
+ gfn_t gfn)
+{
+ /*
+ * Use READ_ONCE() to ensure the in-progress flag and sequence counter
+ * are always read from memory, e.g. so that checking for retry in a
+ * loop won't result in an infinite retry loop. Don't force loads for
+ * start+end, as the key to avoiding infinite retry loops is observing
+ * the 1=>0 transition of in-progress, i.e. getting false negatives
+ * due to stale start+end values is acceptable.
+ */
+ if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
+ gfn >= kvm->mmu_invalidate_range_start &&
+ gfn < kvm->mmu_invalidate_range_end)
+ return true;
+
+ return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
+}
#endif

#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING

base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
--
2.43.0.472.g3155946c3a-goog



2024-01-11 02:41:17

by Yan Zhao

[permalink] [raw]
Subject: Re: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

Tested on my environemnt with below improvements observed:
For a VM with 8 vcpus + 16G memory + OVMF

Avg cycles of kvm_zap_gfn_range() is reduced from 1100k to 800k;
Avg cycles of kvm_unmap_gfn_range() is reduced rom 470 to 180.

Reviewed-by: Yan Zhao <[email protected]>

On Tue, Jan 09, 2024 at 05:20:45PM -0800, Sean Christopherson wrote:
> Retry page faults without acquiring mmu_lock if the resolved gfn is covered
> by an active invalidation. Contending for mmu_lock is especially
> problematic on preemptible kernels as the mmu_notifier invalidation task
> will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
> invalidation, and ultimately increase the latency of resolving the page
> fault. And in the worst case scenario, yielding will be accompanied by a
> remote TLB flush, e.g. if the invalidation covers a large range of memory
> and vCPUs are accessing addresses that were already zapped.
>
> Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
> iterators to perform more work before yielding, but that wouldn't solve
> the lock contention and would negatively affect scenarios where a vCPU is
> trying to fault in an address that is NOT covered by the in-progress
> invalidation.
>
> Add a dedicated lockess version of the range-based retry check to avoid
> false positives on the sanity check on start+end WARN, and so that it's
> super obvious that checking for a racing invalidation without holding
> mmu_lock is unsafe (though obviously useful).
>
> Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
> invalidation in a loop won't put KVM into an infinite loop, e.g. due to
> caching the in-progress flag and never seeing it go to '0'.
>
> Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> necessary to avoid an infinite loop, as doing so improves the probability
> that KVM will detect an invalidation that already completed before
> acquiring mmu_lock and bailing anyways.
>
> Do the pre-check even for non-preemptible kernels, as waiting to detect
> the invalidation until mmu_lock is held guarantees the vCPU will observe
> the worst case latency in terms of handling the fault, and can generate
> even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock,
> detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
> eventually re-acquire mmu_lock. This behavior is also why there are no
> new starvation issues due to losing the fairness guarantees provided by
> rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
> on mmu_lock doesn't guarantee forward progress in the face of _another_
> mmu_notifier invalidation event.
>
> Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> may generate a load into a register instead of doing a direct comparison
> (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> is a few bytes of code and maaaaybe a cycle or three.
>
> Reported-by: Yan Zhao <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]
> Acked-by: Kai Huang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
>
> Note, this version adds a dedicated helper, mmu_invalidate_retry_gfn_unsafe(),
> instead of making mmu_invalidate_retry_gfn() play nice with being called without
> mmu_lock held. I was hesitant to drop the lockdep assertion before, and the
> recently introduced sanity check on the gfn start/end values pushed this past
> the threshold of being worth the duplicate code (preserving the start/end sanity
> check in lock-free code would comically difficult, and would add almost no value
> since it would have to be quite conservative to avoid false positives).
>
> Kai, I kept your Ack even though the code is obviously a little different.
> Holler if you want me to drop it.
>
> v2:
> - Introduce a dedicated helper and collapse to a single patch (because
> adding an unused helper would be quite silly).
> - Add a comment to explain the "unsafe" check in kvm_faultin_pfn(). [Kai]
> - Add Kai's Ack.
>
> v1: https://lore.kernel.org/all/[email protected]
>
> arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
> include/linux/kvm_host.h | 26 ++++++++++++++++++++++++++
> 2 files changed, 42 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..92f51540c4a7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4415,6 +4415,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> if (unlikely(!fault->slot))
> return kvm_handle_noslot_fault(vcpu, fault, access);
>
> + /*
> + * Pre-check for a relevant mmu_notifier invalidation event prior to
> + * acquiring mmu_lock. If there is an in-progress invalidation and the
> + * kernel allows preemption, the invalidation task may drop mmu_lock
> + * and yield in response to mmu_lock being contended, which is *very*
> + * counter-productive as this vCPU can't actually make forward progress
> + * until the invalidation completes. This "unsafe" check can get false
> + * negatives, i.e. KVM needs to re-check after acquiring mmu_lock. Do
> + * the pre-check even for non-preemtible kernels, i.e. even if KVM will
> + * never yield mmu_lock in response to contention, as this vCPU ob
> + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> + * to detect retry guarantees the worst case latency for the vCPU.
> + */
> + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> + return RET_PF_RETRY;
> +
> return RET_PF_CONTINUE;
> }
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..179df96b20f8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2031,6 +2031,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> return 1;
> return 0;
> }
> +
> +/*
> + * This lockless version of the range-based retry check *must* be paired with a
> + * call to the locked version after acquiring mmu_lock, i.e. this is safe to
> + * use only as a pre-check to avoid contending mmu_lock. This version *will*
> + * get false negatives and false positives.
> + */
> +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> + unsigned long mmu_seq,
> + gfn_t gfn)
> +{
> + /*
> + * Use READ_ONCE() to ensure the in-progress flag and sequence counter
> + * are always read from memory, e.g. so that checking for retry in a
> + * loop won't result in an infinite retry loop. Don't force loads for
> + * start+end, as the key to avoiding infinite retry loops is observing
> + * the 1=>0 transition of in-progress, i.e. getting false negatives
> + * due to stale start+end values is acceptable.
> + */
> + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
> + gfn >= kvm->mmu_invalidate_range_start &&
> + gfn < kvm->mmu_invalidate_range_end)
> + return true;
> +
> + return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
> +}
> #endif
>
> #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
>
> base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
> --
> 2.43.0.472.g3155946c3a-goog
>

2024-01-12 07:49:00

by Yuan Yao

[permalink] [raw]
Subject: Re: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

On Tue, Jan 09, 2024 at 05:20:45PM -0800, Sean Christopherson wrote:
> Retry page faults without acquiring mmu_lock if the resolved gfn is covered

after resolved gfn so it's very near the mmu lock taking, this should increase
the possibility of "hit" in progress validation for the gfn ?

> by an active invalidation. Contending for mmu_lock is especially
> problematic on preemptible kernels as the mmu_notifier invalidation task
> will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
> invalidation, and ultimately increase the latency of resolving the page
> fault. And in the worst case scenario, yielding will be accompanied by a
> remote TLB flush, e.g. if the invalidation covers a large range of memory
> and vCPUs are accessing addresses that were already zapped.
>
> Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
> iterators to perform more work before yielding, but that wouldn't solve
> the lock contention and would negatively affect scenarios where a vCPU is
> trying to fault in an address that is NOT covered by the in-progress
> invalidation.
>
> Add a dedicated lockess version of the range-based retry check to avoid
> false positives on the sanity check on start+end WARN, and so that it's
> super obvious that checking for a racing invalidation without holding
> mmu_lock is unsafe (though obviously useful).
>
> Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
> invalidation in a loop won't put KVM into an infinite loop, e.g. due to
> caching the in-progress flag and never seeing it go to '0'.
>
> Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> necessary to avoid an infinite loop, as doing so improves the probability
> that KVM will detect an invalidation that already completed before
> acquiring mmu_lock and bailing anyways.
>
> Do the pre-check even for non-preemptible kernels, as waiting to detect
> the invalidation until mmu_lock is held guarantees the vCPU will observe
> the worst case latency in terms of handling the fault, and can generate
> even more mmu_lock contention. E.g. the vCPU will acquire mmu_lock,
> detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
> eventually re-acquire mmu_lock. This behavior is also why there are no
> new starvation issues due to losing the fairness guarantees provided by
> rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
> on mmu_lock doesn't guarantee forward progress in the face of _another_
> mmu_notifier invalidation event.
>
> Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> may generate a load into a register instead of doing a direct comparison
> (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> is a few bytes of code and maaaaybe a cycle or three.
>
> Reported-by: Yan Zhao <[email protected]>
> Closes: https://lore.kernel.org/all/[email protected]
> Acked-by: Kai Huang <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
>
> Note, this version adds a dedicated helper, mmu_invalidate_retry_gfn_unsafe(),
> instead of making mmu_invalidate_retry_gfn() play nice with being called without
> mmu_lock held. I was hesitant to drop the lockdep assertion before, and the
> recently introduced sanity check on the gfn start/end values pushed this past
> the threshold of being worth the duplicate code (preserving the start/end sanity
> check in lock-free code would comically difficult, and would add almost no value
> since it would have to be quite conservative to avoid false positives).
>
> Kai, I kept your Ack even though the code is obviously a little different.
> Holler if you want me to drop it.
>
> v2:
> - Introduce a dedicated helper and collapse to a single patch (because
> adding an unused helper would be quite silly).
> - Add a comment to explain the "unsafe" check in kvm_faultin_pfn(). [Kai]
> - Add Kai's Ack.
>
> v1: https://lore.kernel.org/all/[email protected]
>
> arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
> include/linux/kvm_host.h | 26 ++++++++++++++++++++++++++
> 2 files changed, 42 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..92f51540c4a7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4415,6 +4415,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> if (unlikely(!fault->slot))
> return kvm_handle_noslot_fault(vcpu, fault, access);
>
> + /*
> + * Pre-check for a relevant mmu_notifier invalidation event prior to
> + * acquiring mmu_lock. If there is an in-progress invalidation and the
> + * kernel allows preemption, the invalidation task may drop mmu_lock
> + * and yield in response to mmu_lock being contended, which is *very*
> + * counter-productive as this vCPU can't actually make forward progress
> + * until the invalidation completes. This "unsafe" check can get false
> + * negatives, i.e. KVM needs to re-check after acquiring mmu_lock. Do
> + * the pre-check even for non-preemtible kernels, i.e. even if KVM will
> + * never yield mmu_lock in response to contention, as this vCPU ob
> + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> + * to detect retry guarantees the worst case latency for the vCPU.
> + */
> + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> + return RET_PF_RETRY;

This breaks the contract of kvm_faultin_pfn(), i.e. the pfn's refcount
increased after resolved from gfn, but its caller won't decrease it.

How about call kvm_release_pfn_clean() just before return RET_PF_RETRY here,
so we don't need to duplicate it in 3 different places.

> +
> return RET_PF_CONTINUE;
> }
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..179df96b20f8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2031,6 +2031,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> return 1;
> return 0;
> }
> +
> +/*
> + * This lockless version of the range-based retry check *must* be paired with a

s/lockess/lockless

> + * call to the locked version after acquiring mmu_lock, i.e. this is safe to
> + * use only as a pre-check to avoid contending mmu_lock. This version *will*
> + * get false negatives and false positives.
> + */
> +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> + unsigned long mmu_seq,
> + gfn_t gfn)
> +{
> + /*
> + * Use READ_ONCE() to ensure the in-progress flag and sequence counter
> + * are always read from memory, e.g. so that checking for retry in a
> + * loop won't result in an infinite retry loop. Don't force loads for
> + * start+end, as the key to avoiding infinite retry loops is observing
> + * the 1=>0 transition of in-progress, i.e. getting false negatives
> + * due to stale start+end values is acceptable.
> + */
> + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
> + gfn >= kvm->mmu_invalidate_range_start &&
> + gfn < kvm->mmu_invalidate_range_end)
> + return true;
> +
> + return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
> +}
> #endif
>
> #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
>
> base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
> --
> 2.43.0.472.g3155946c3a-goog
>
>

2024-01-12 16:32:04

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

On Fri, Jan 12, 2024, Yuan Yao wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3c844e428684..92f51540c4a7 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4415,6 +4415,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > if (unlikely(!fault->slot))
> > return kvm_handle_noslot_fault(vcpu, fault, access);
> >
> > + /*
> > + * Pre-check for a relevant mmu_notifier invalidation event prior to
> > + * acquiring mmu_lock. If there is an in-progress invalidation and the
> > + * kernel allows preemption, the invalidation task may drop mmu_lock
> > + * and yield in response to mmu_lock being contended, which is *very*
> > + * counter-productive as this vCPU can't actually make forward progress
> > + * until the invalidation completes. This "unsafe" check can get false
> > + * negatives, i.e. KVM needs to re-check after acquiring mmu_lock. Do
> > + * the pre-check even for non-preemtible kernels, i.e. even if KVM will
> > + * never yield mmu_lock in response to contention, as this vCPU ob
> > + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> > + * to detect retry guarantees the worst case latency for the vCPU.
> > + */
> > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> > + return RET_PF_RETRY;
>
> This breaks the contract of kvm_faultin_pfn(), i.e. the pfn's refcount
> increased after resolved from gfn, but its caller won't decrease it.

Oof, good catch.

> How about call kvm_release_pfn_clean() just before return RET_PF_RETRY here,
> so we don't need to duplicate it in 3 different places.

Hrm, yeah, that does seem to be the best option. Thanks!

> > +
> > return RET_PF_CONTINUE;
> > }
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 7e7fd25b09b3..179df96b20f8 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2031,6 +2031,32 @@ static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> > return 1;
> > return 0;
> > }
> > +
> > +/*
> > + * This lockless version of the range-based retry check *must* be paired with a
>
> s/lockess/lockless

Heh, unless mine eyes deceive me, that's what I wrote :-)

2024-01-18 17:24:39

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

On Fri, Jan 19, 2024, Xu Yilun wrote:
> On Tue, Jan 09, 2024 at 05:20:45PM -0800, Sean Christopherson wrote:
> > Retry page faults without acquiring mmu_lock if the resolved gfn is covered
> > by an active invalidation. Contending for mmu_lock is especially
> > problematic on preemptible kernels as the mmu_notifier invalidation task
> > will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
>
> Is it possible fault-in task avoids contending mmu_lock by using _trylock()?
> Like:
>
> while (!read_trylock(&vcpu->kvm->mmu_lock))
> cpu_relax();
>
> if (is_page_fault_stale(vcpu, fault))
> goto out_unlock;
>
> r = kvm_tdp_mmu_map(vcpu, fault);
>
> out_unlock:
> read_unlock(&vcpu->kvm->mmu_lock)

It's definitely possible, but the downsides far outweigh any potential benefits.

Doing trylock means the CPU is NOT put into the queue for acquiring the lock,
which means that forward progress isn't guaranteed. E.g. in a pathological
scenario (and by "pathological", I mean if NUMA balancing or KSM is active ;-)),
it's entirely possible for a near-endless stream of mmu_lock writers to be in
the queue, thus preventing the vCPU from acquiring mmu_lock in a timely manner.

And hacking the page fault path to bypass KVM's lock contention detection would
be a very willful, deliberate violation of the intent of the MMU's yielding logic
for preemptible kernels.

That said, I would love to revisit KVM's use of rwlock_needbreak(), at least in
the TDP MMU. As evidenced by this rash of issues, it's not at all obvious that
yielding on mmu_lock contention is *ever* a net positive for KVM, or even for the
kernel. The shadow MMU is probably a different story since it can't parallelize
page faults with other operations, e.g. yielding in kvm_zap_obsolete_pages() to
allow vCPUs to make forward progress is probably a net positive.

But AFAIK, no one has done any testing to prove that yielding on contention in
the TDP MMU is actually a good thing. I'm 99% certain the only reason the TDP
MMU yields on contention is because the shadow MMU yields on contention, i.e.
I'm confident that no one ever did performance testing to shadow that there is
any benefit whatsoever to yielding mmu_lock in the TDP MMU.

> > invalidation, and ultimately increase the latency of resolving the page
> > fault. And in the worst case scenario, yielding will be accompanied by a
> > remote TLB flush, e.g. if the invalidation covers a large range of memory
> > and vCPUs are accessing addresses that were already zapped.
>
> This case covers all usage of mmu_invalidate_retry_gfn(), is it? Should
> we also consider vmx_set_apic_access_page_addr()?

Nah, reloading the APIC access page is an very, very infrequent operation. And
post-boot (of the VM), the page will only be reloaded if it's migrated by the
primary MMU, i.e. after a relevant mmu_notifier operation. And the APIC access
page is a one-off allocation, i.e. it has its own VMA of PAGE_SIZE, so even if
vmx_set_apic_access_page_addr() does manage to trigger contention with a relevant
mmu_notifier event, e.g. races ahead of page migration completion, the odds of it
forcing KVM's mmu_notifier handler to yield mmu_lock are low (maybe impossible?)
because the invaldation event shouldn't be spanning multipl pages.

2024-01-18 21:45:36

by Xu Yilun

[permalink] [raw]
Subject: Re: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

On Tue, Jan 09, 2024 at 05:20:45PM -0800, Sean Christopherson wrote:
> Retry page faults without acquiring mmu_lock if the resolved gfn is covered
> by an active invalidation. Contending for mmu_lock is especially
> problematic on preemptible kernels as the mmu_notifier invalidation task
> will yield mmu_lock (see rwlock_needbreak()), delay the in-progress

Is it possible fault-in task avoids contending mmu_lock by using _trylock()?
Like:

while (!read_trylock(&vcpu->kvm->mmu_lock))
cpu_relax();

if (is_page_fault_stale(vcpu, fault))
goto out_unlock;

r = kvm_tdp_mmu_map(vcpu, fault);

out_unlock:
read_unlock(&vcpu->kvm->mmu_lock)

> invalidation, and ultimately increase the latency of resolving the page
> fault. And in the worst case scenario, yielding will be accompanied by a
> remote TLB flush, e.g. if the invalidation covers a large range of memory
> and vCPUs are accessing addresses that were already zapped.

This case covers all usage of mmu_invalidate_retry_gfn(), is it? Should
we also consider vmx_set_apic_access_page_addr()?

Thanks,
Yilun

2024-01-19 07:10:08

by Xu Yilun

[permalink] [raw]
Subject: Re: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

On Thu, Jan 18, 2024 at 09:22:37AM -0800, Sean Christopherson wrote:
> On Fri, Jan 19, 2024, Xu Yilun wrote:
> > On Tue, Jan 09, 2024 at 05:20:45PM -0800, Sean Christopherson wrote:
> > > Retry page faults without acquiring mmu_lock if the resolved gfn is covered
> > > by an active invalidation. Contending for mmu_lock is especially
> > > problematic on preemptible kernels as the mmu_notifier invalidation task
> > > will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
> >
> > Is it possible fault-in task avoids contending mmu_lock by using _trylock()?
> > Like:
> >
> > while (!read_trylock(&vcpu->kvm->mmu_lock))
> > cpu_relax();
> >
> > if (is_page_fault_stale(vcpu, fault))
> > goto out_unlock;
> >
> > r = kvm_tdp_mmu_map(vcpu, fault);
> >
> > out_unlock:
> > read_unlock(&vcpu->kvm->mmu_lock)
>
> It's definitely possible, but the downsides far outweigh any potential benefits.
>
> Doing trylock means the CPU is NOT put into the queue for acquiring the lock,
> which means that forward progress isn't guaranteed. E.g. in a pathological
> scenario (and by "pathological", I mean if NUMA balancing or KSM is active ;-)),
> it's entirely possible for a near-endless stream of mmu_lock writers to be in
> the queue, thus preventing the vCPU from acquiring mmu_lock in a timely manner.

Ah yes, I forgot the main purpose of yielding is to let vCPU make
forward progress when the fault-in page is not covered by the invalidation.

Thanks,
Yilun

>
> And hacking the page fault path to bypass KVM's lock contention detection would
> be a very willful, deliberate violation of the intent of the MMU's yielding logic
> for preemptible kernels.
>
> That said, I would love to revisit KVM's use of rwlock_needbreak(), at least in
> the TDP MMU. As evidenced by this rash of issues, it's not at all obvious that
> yielding on mmu_lock contention is *ever* a net positive for KVM, or even for the
> kernel. The shadow MMU is probably a different story since it can't parallelize
> page faults with other operations, e.g. yielding in kvm_zap_obsolete_pages() to
> allow vCPUs to make forward progress is probably a net positive.
>
> But AFAIK, no one has done any testing to prove that yielding on contention in
> the TDP MMU is actually a good thing. I'm 99% certain the only reason the TDP
> MMU yields on contention is because the shadow MMU yields on contention, i.e.
> I'm confident that no one ever did performance testing to shadow that there is
> any benefit whatsoever to yielding mmu_lock in the TDP MMU.


>