LinuxLists.cc - [PATCH] sched/core: Drop spinlocks on contention iff kernel is preemptible

2024-01-10 21:47:37

Subject: [PATCH] sched/core: Drop spinlocks on contention iff kernel is preemptible

Use preempt_model_preemptible() to detect a preemptible kernel when
deciding whether or not to reschedule in order to drop a contended
spinlock or rwlock. Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
preemption model is "none" or "voluntary". In short, make kernels with
dynamically selected models behave the same as kernels with statically
selected models.

Somewhat counter-intuitively, NOT yielding a lock can provide better
latency for the relevant tasks/processes. E.g. KVM x86's mmu_lock, a
rwlock, is often contended between an invalidation event (takes mmu_lock
for write) and a vCPU servicing a guest page fault (takes mmu_lock for
read). For _some_ setups, letting the invalidation task complete even
if there is mmu_lock contention provides lower latency for *all* tasks,
i.e. the invalidation completes sooner *and* the vCPU services the guest
page fault sooner.

But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
can vary depending on the host VMM, the guest workload, the number of
vCPUs, the number of pCPUs in the host, why there is lock contention, etc.

In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
opposite and removing contention yielding entirely) needs to come with a
big pile of data proving that changing the status quo is a net positive.

Cc: Valentin Schneider <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
include/linux/sched.h | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 03bfe9ab2951..51550e64ce14 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2229,11 +2229,10 @@ static inline bool preempt_model_preemptible(void)
*/
static inline int spin_needbreak(spinlock_t *lock)
{
-#ifdef CONFIG_PREEMPTION
+ if (!preempt_model_preemptible())
+ return 0;
+
return spin_is_contended(lock);
-#else
- return 0;
-#endif
}

/*
@@ -2246,11 +2245,10 @@ static inline int spin_needbreak(spinlock_t *lock)
*/
static inline int rwlock_needbreak(rwlock_t *lock)
{
-#ifdef CONFIG_PREEMPTION
+ if (!preempt_model_preemptible())
+ return 0;
+
return rwlock_is_contended(lock);
-#else
- return 0;
-#endif
}

static __always_inline bool need_resched(void)

base-commit: cdb3033e191fd03da2d7da23b9cd448dfa180a8e
--
2.43.0.275.g3460e3d667-goog

2024-02-01 15:27:43

by Friedrich Weber

[permalink] [raw]

Subject: Re: [PATCH] sched/core: Drop spinlocks on contention iff kernel is preemptible

On 10/01/2024 22:47, Sean Christopherson wrote:
> Use preempt_model_preemptible() to detect a preemptible kernel when
> deciding whether or not to reschedule in order to drop a contended
> spinlock or rwlock. Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
> built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
> preemption model is "none" or "voluntary". In short, make kernels with
> dynamically selected models behave the same as kernels with statically
> selected models.
>
> Somewhat counter-intuitively, NOT yielding a lock can provide better
> latency for the relevant tasks/processes. E.g. KVM x86's mmu_lock, a
> rwlock, is often contended between an invalidation event (takes mmu_lock
> for write) and a vCPU servicing a guest page fault (takes mmu_lock for
> read). For _some_ setups, letting the invalidation task complete even
> if there is mmu_lock contention provides lower latency for *all* tasks,
> i.e. the invalidation completes sooner *and* the vCPU services the guest
> page fault sooner.

I've been testing this patch for some time now:

Applied on top of Linux 6.7 (0dd3ee31) on a PREEMPT_DYNAMIC kernel with
preempt=voluntary, it fixes an issue for me where KVM guests would
temporarily freeze if NUMA balancing and KSM are active on a NUMA host.
See [1] for more details.

In addition, I've been running with this patch on my (non-NUMA)
workstation with (admittedly fairly light) VM workloads for two weeks
now and so far didn't notice any negative effects (this is on top of a
modified 6.5.11 kernel though).

Side note: I noticed the patch doesn't apply anymore on 6.8-rc2, seems
like sched.h was refactored in the meantime.

[1]
https://lore.kernel.org/kvm/[email protected]/

2024-05-13 14:13:08

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH] sched/core: Drop spinlocks on contention iff kernel is preemptible

On 1/10/24 22:47, Sean Christopherson wrote:
> Use preempt_model_preemptible() to detect a preemptible kernel when
> deciding whether or not to reschedule in order to drop a contended
> spinlock or rwlock. Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
> built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
> preemption model is "none" or "voluntary". In short, make kernels with
> dynamically selected models behave the same as kernels with statically
> selected models.

Peter, looks like this patch fell through the cracks. Could this be
applied for 6.10?

There is a slightly confusing line in the commit message below, so that
it reads more like an RFC; but the patch fixes a CONFIG_PREEMPT_DYNAMIC
regression wrt static preemption models and has no functional change for
!CONFIG_PREEMPT_DYNAMIC.

> Somewhat counter-intuitively, NOT yielding a lock can provide better
> latency for the relevant tasks/processes. E.g. KVM x86's mmu_lock, a
> rwlock, is often contended between an invalidation event (takes mmu_lock
> for write) and a vCPU servicing a guest page fault (takes mmu_lock for
> read). For _some_ setups, letting the invalidation task complete even
> if there is mmu_lock contention provides lower latency for *all* tasks,
> i.e. the invalidation completes sooner *and* the vCPU services the guest
> page fault sooner.
>
> But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
> can vary depending on the host VMM, the guest workload, the number of
> vCPUs, the number of pCPUs in the host, why there is lock contention, etc.
>
> In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This should be "deleting the preempt_model_preemptible() guard" given
that the patch does delete CONFIG_PREEMPTION, and only leaves
preempt_model_preemptible() in place.

Thanks,

Paolo

> opposite and removing contention yielding entirely) needs to come with a
> big pile of data proving that changing the status quo is a net positive.
>
> Cc: Valentin Schneider <[email protected]>
> Cc: Peter Zijlstra (Intel) <[email protected]>
> Cc: Marco Elver <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: David Matlack <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> include/linux/sched.h | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 03bfe9ab2951..51550e64ce14 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2229,11 +2229,10 @@ static inline bool preempt_model_preemptible(void)
> */
> static inline int spin_needbreak(spinlock_t *lock)
> {
> -#ifdef CONFIG_PREEMPTION
> + if (!preempt_model_preemptible())
> + return 0;
> +
> return spin_is_contended(lock);
> -#else
> - return 0;
> -#endif
> }
>
> /*
> @@ -2246,11 +2245,10 @@ static inline int spin_needbreak(spinlock_t *lock)
> */
> static inline int rwlock_needbreak(rwlock_t *lock)
> {
> -#ifdef CONFIG_PREEMPTION
> + if (!preempt_model_preemptible())
> + return 0;
> +
> return rwlock_is_contended(lock);
> -#else
> - return 0;
> -#endif
> }
>
> static __always_inline bool need_resched(void)
>
> base-commit: cdb3033e191fd03da2d7da23b9cd448dfa180a8e

2024-05-13 15:35:31

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH] sched/core: Drop spinlocks on contention iff kernel is preemptible

On Mon, May 13, 2024, Paolo Bonzini wrote:
> On 1/10/24 22:47, Sean Christopherson wrote:
> > Use preempt_model_preemptible() to detect a preemptible kernel when
> > deciding whether or not to reschedule in order to drop a contended
> > spinlock or rwlock. Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
> > built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
> > preemption model is "none" or "voluntary". In short, make kernels with
> > dynamically selected models behave the same as kernels with statically
> > selected models.
>
> Peter, looks like this patch fell through the cracks. Could this be applied
> for 6.10?
>
> There is a slightly confusing line in the commit message below, so that it
> reads more like an RFC; but the patch fixes a CONFIG_PREEMPT_DYNAMIC
> regression wrt static preemption models and has no functional change for
> !CONFIG_PREEMPT_DYNAMIC.
>
> > Somewhat counter-intuitively, NOT yielding a lock can provide better
> > latency for the relevant tasks/processes. E.g. KVM x86's mmu_lock, a
> > rwlock, is often contended between an invalidation event (takes mmu_lock
> > for write) and a vCPU servicing a guest page fault (takes mmu_lock for
> > read). For _some_ setups, letting the invalidation task complete even
> > if there is mmu_lock contention provides lower latency for *all* tasks,
> > i.e. the invalidation completes sooner *and* the vCPU services the guest
> > page fault sooner.
> >
> > But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
> > can vary depending on the host VMM, the guest workload, the number of
> > vCPUs, the number of pCPUs in the host, why there is lock contention, etc.
> >
> > In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> This should be "deleting the preempt_model_preemptible() guard" given that
> the patch does delete CONFIG_PREEMPTION, and only leaves
> preempt_model_preemptible() in place.

Note, this version won't apply cleanly, v2[*] handles the code movement and still
applies on Linus' tree.

[*] https://lore.kernel.org/all/[email protected]