While optimistic spinning is beneficial to performance, I have found that
threads can potentially spin for a long time while there is no lock owner
during high contention cases. In these scenarios, too much spinning can reduce
performance. This RFC patchset attempts to address some of the issues with
spinning too much with no owner.
Patch 1 changes the mutex_can_spin_on_owner() function to address the
need_resched() case. Patch 2 modifies the way mutex spinners are queued to
reduce mspin_lock and mspin_unlock overhead when there is no owner, and is also
necessary patch for patch 3. Patch 3 limits the number of times each thread
can spin on lock->count when there is no lock owner.
This change benefit the AIM7 fserver (run on disk) at 1000-2000 users
on an 8 socket (80 core) box, with a +19.5% gain with 3.13-rc7 + patchset
compared to the baseline 3.13-rc7 kernel. At 100-900 users, the
% gain was about 4.9%, and there wasn't much performance difference
at 10-90 users. On a 2 socket (8 core) box, there was about a 5.5%
improvement at 1000-2000 users.
The mutex_can_spin_on_owner() function should also return false if the
task needs to be rescheduled.
Signed-off-by: Jason Low <[email protected]>
---
kernel/locking/mutex.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4dd6e4c..85c6be1 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -212,6 +212,9 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
struct task_struct *owner;
int retval = 1;
+ if (need_resched())
+ return 0;
+
rcu_read_lock();
owner = ACCESS_ONCE(lock->owner);
if (owner)
--
1.7.1
This patch is needed for patch 3, but should also be beneficial in general.
The mutex->spin_mlock was introduced in order to ensure that only 1 thread
loops on lock->owner field at a time to reduce cache line contention. When
lock->owner is NULL and the lock->count is still not 1, the spinner(s) will
continually release and obtain the lock->spin_mlock. This can generate
quite a bit of overhead/contention, and also might just delay the spinner
from getting the lock.
This patch modifies the way optimistic spinners are queued by queuing before
entering the optimistic spinning loop as oppose to acquiring before every
call to mutex_spin_on_owner(). So in situations where the spinner requires
extra spins before obtaining the lock, then there will only be 1 spinner
trying to get the lock and it will avoid the overhead from unnecessarily
unlocking and locking the spin_mlock.
Signed-off-by: Jason Low <[email protected]>
---
kernel/locking/mutex.c | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 85c6be1..b500cc7 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -419,6 +419,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
struct mutex_waiter waiter;
unsigned long flags;
int ret;
+ struct mspin_node node;
preempt_disable();
mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
@@ -449,9 +450,9 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (!mutex_can_spin_on_owner(lock))
goto slowpath;
+ mspin_lock(MLOCK(lock), &node);
for (;;) {
struct task_struct *owner;
- struct mspin_node node;
if (use_ww_ctx && ww_ctx->acquired > 0) {
struct ww_mutex *ww;
@@ -465,15 +466,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* As such, when deadlock detection needs to be
* performed the optimistic spinning cannot be done.
*/
- if (ACCESS_ONCE(ww->ctx))
+ if (ACCESS_ONCE(ww->ctx)) {
+ mspin_unlock(MLOCK(lock), &node);
goto slowpath;
+ }
}
/*
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
- mspin_lock(MLOCK(lock), &node);
owner = ACCESS_ONCE(lock->owner);
if (owner && !mutex_spin_on_owner(lock, owner)) {
mspin_unlock(MLOCK(lock), &node);
@@ -495,7 +497,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
preempt_enable();
return 0;
}
- mspin_unlock(MLOCK(lock), &node);
/*
* When there's no owner, we might have preempted between the
@@ -503,8 +504,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* we're an RT task that will live-lock because we won't let
* the owner complete.
*/
- if (!owner && (need_resched() || rt_task(task)))
+ if (!owner && (need_resched() || rt_task(task))) {
+ mspin_unlock(MLOCK(lock), &node);
goto slowpath;
+ }
/*
* The cpu_relax() call is a compiler barrier which forces
--
1.7.1
When running workloads that have high contention in mutexes on an 8 socket
machine, spinners would often spin for a long time with no lock owner.
One of the potential reasons for this is because a thread can be preempted
after clearing lock->owner but before releasing the lock, or preempted after
acquiring the mutex but before setting lock->owner. In those cases, the
spinner cannot check if owner is not on_cpu because lock->owner is NULL.
A solution that would address the preemption part of this problem would
be to disable preemption between acquiring/releasing the mutex and
setting/clearing the lock->owner. However, that will require adding overhead
to the mutex fastpath.
The solution used in this patch is to limit the # of times thread can spin on
lock->count when !owner.
The threshold used in this patch for each spinner was 128, which appeared to
be a generous value, but any suggestions on another method to determine
the threshold are welcomed.
Signed-off-by: Jason Low <[email protected]>
---
kernel/locking/mutex.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index b500cc7..9465604 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -43,6 +43,7 @@
* mutex.
*/
#define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count) >= 0)
+#define MUTEX_SPIN_THRESHOLD (128)
void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
@@ -418,7 +419,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
struct task_struct *task = current;
struct mutex_waiter waiter;
unsigned long flags;
- int ret;
+ int ret, nr_spins = 0;
struct mspin_node node;
preempt_disable();
@@ -453,6 +454,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
mspin_lock(MLOCK(lock), &node);
for (;;) {
struct task_struct *owner;
+ nr_spins++;
if (use_ww_ctx && ww_ctx->acquired > 0) {
struct ww_mutex *ww;
@@ -502,9 +504,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* When there's no owner, we might have preempted between the
* owner acquiring the lock and setting the owner field. If
* we're an RT task that will live-lock because we won't let
- * the owner complete.
+ * the owner complete. Additionally, when there is no owner,
+ * stop spinning after too many tries.
*/
- if (!owner && (need_resched() || rt_task(task))) {
+ if (!owner && (need_resched() || rt_task(task) ||
+ nr_spins > MUTEX_SPIN_THRESHOLD)) {
mspin_unlock(MLOCK(lock), &node);
goto slowpath;
}
--
1.7.1
On Tue, 14 Jan 2014 16:33:10 -0800 Jason Low <[email protected]> wrote:
> When running workloads that have high contention in mutexes on an 8 socket
> machine, spinners would often spin for a long time with no lock owner.
>
> One of the potential reasons for this is because a thread can be preempted
> after clearing lock->owner but before releasing the lock, or preempted after
> acquiring the mutex but before setting lock->owner. In those cases, the
> spinner cannot check if owner is not on_cpu because lock->owner is NULL.
That sounds like a very small window. And your theory is that this
window is being hit sufficiently often to impact aggregate runtime
measurements, which sounds improbable to me?
> A solution that would address the preemption part of this problem would
> be to disable preemption between acquiring/releasing the mutex and
> setting/clearing the lock->owner. However, that will require adding overhead
> to the mutex fastpath.
preempt_disable() is cheap, and sometimes free.
Have you confirmed that the preempt_disable() approach actually fixes
the performance issues? If it does then this would confirm your
"potential reason" hypothesis. If it doesn't then we should be hunting
further for the explanation.
> The solution used in this patch is to limit the # of times thread can spin on
> lock->count when !owner.
>
> The threshold used in this patch for each spinner was 128, which appeared to
> be a generous value, but any suggestions on another method to determine
> the threshold are welcomed.
It's a bit hacky, isn't it? If your "owner got preempted in the
window" theory is correct then I guess this is reasonableish. But if
!owner is occurring for other reasons then perhaps there are better
solutions.
On Tue, 2014-01-14 at 16:33 -0800, Jason Low wrote:
> When running workloads that have high contention in mutexes on an 8 socket
> machine, spinners would often spin for a long time with no lock owner.
>
> One of the potential reasons for this is because a thread can be preempted
> after clearing lock->owner but before releasing the lock
What happens if you invert the order here? So mutex_clear_owner() is
called after the actual unlocking (__mutex_fastpath_unlock).
> or preempted after
> acquiring the mutex but before setting lock->owner.
That would be the case _only_ for the fastpath. For the slowpath
(including optimistic spinning) preemption is already disabled at that
point.
> In those cases, the
> spinner cannot check if owner is not on_cpu because lock->owner is NULL.
>
> A solution that would address the preemption part of this problem would
> be to disable preemption between acquiring/releasing the mutex and
> setting/clearing the lock->owner. However, that will require adding overhead
> to the mutex fastpath.
It's not uncommon to disable preemption in hotpaths, the overhead should
be quite smaller, actually.
>
> The solution used in this patch is to limit the # of times thread can spin on
> lock->count when !owner.
>
> The threshold used in this patch for each spinner was 128, which appeared to
> be a generous value, but any suggestions on another method to determine
> the threshold are welcomed.
Hmm generous compared to what? Could you elaborate further on how you
reached this value? These kind of magic numbers have produced
significant debate in the past.
Thanks,
Davidlohr
On Tue, 2014-01-14 at 17:00 -0800, Andrew Morton wrote:
> On Tue, 14 Jan 2014 16:33:10 -0800 Jason Low <[email protected]> wrote:
>
> > When running workloads that have high contention in mutexes on an 8 socket
> > machine, spinners would often spin for a long time with no lock owner.
> >
> > One of the potential reasons for this is because a thread can be preempted
> > after clearing lock->owner but before releasing the lock, or preempted after
> > acquiring the mutex but before setting lock->owner. In those cases, the
> > spinner cannot check if owner is not on_cpu because lock->owner is NULL.
>
> That sounds like a very small window. And your theory is that this
> window is being hit sufficiently often to impact aggregate runtime
> measurements, which sounds improbable to me?
>
> > A solution that would address the preemption part of this problem would
> > be to disable preemption between acquiring/releasing the mutex and
> > setting/clearing the lock->owner. However, that will require adding overhead
> > to the mutex fastpath.
>
> preempt_disable() is cheap, and sometimes free.
>
> Have you confirmed that the preempt_disable() approach actually fixes
> the performance issues? If it does then this would confirm your
> "potential reason" hypothesis. If it doesn't then we should be hunting
> further for the explanation.
Using Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
which can also generate high mutex contention, the preempt_disable()
approach did provide approximately a 4% improvement at 160 threads, but
not nearly the 25+% I was seeing with this patchset. So, it looks like
preemption is not the main cause of the problem then.
Thanks,
Jason
On Tue, 2014-01-14 at 17:06 -0800, Davidlohr Bueso wrote:
> On Tue, 2014-01-14 at 16:33 -0800, Jason Low wrote:
> > When running workloads that have high contention in mutexes on an 8 socket
> > machine, spinners would often spin for a long time with no lock owner.
> >
> > One of the potential reasons for this is because a thread can be preempted
> > after clearing lock->owner but before releasing the lock
>
> What happens if you invert the order here? So mutex_clear_owner() is
> called after the actual unlocking (__mutex_fastpath_unlock).
Reversing the mutex_fastpath_unlock and mutex_clear_owner resulted in a
20+% performance improvement to Ingo's test-mutex application at 160
threads on an 8 socket box.
I have tried this method before, but what I was initially concerned
about with clearing the owner after unlocking was that the following
scenario may occur.
thread 1 releases the lock
thread 2 acquires the lock (in the fastpath)
thread 2 sets the owner
thread 1 clears owner
In this situation, lock owner is NULL but thread 2 has the lock.
> > or preempted after
> > acquiring the mutex but before setting lock->owner.
>
> That would be the case _only_ for the fastpath. For the slowpath
> (including optimistic spinning) preemption is already disabled at that
> point.
Right, for just the fastpath_lock.
> > In those cases, the
> > spinner cannot check if owner is not on_cpu because lock->owner is NULL.
> >
> > A solution that would address the preemption part of this problem would
> > be to disable preemption between acquiring/releasing the mutex and
> > setting/clearing the lock->owner. However, that will require adding overhead
> > to the mutex fastpath.
>
> It's not uncommon to disable preemption in hotpaths, the overhead should
> be quite smaller, actually.
>
> >
> > The solution used in this patch is to limit the # of times thread can spin on
> > lock->count when !owner.
> >
> > The threshold used in this patch for each spinner was 128, which appeared to
> > be a generous value, but any suggestions on another method to determine
> > the threshold are welcomed.
>
> Hmm generous compared to what? Could you elaborate further on how you
> reached this value? These kind of magic numbers have produced
> significant debate in the past.
I've observed that when running workloads which don't exhibit this
behavior (long spins with no owner), threads rarely take more than 100
extra spins. So I went with 128 based on those number.
On Tue, Jan 14, 2014 at 04:33:08PM -0800, Jason Low wrote:
> The mutex_can_spin_on_owner() function should also return false if the
> task needs to be rescheduled.
>
While I was staring at mutex_can_spin_on_owner(); don't we need this?
kernel/locking/mutex.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4dd6e4c219de..480d2f437964 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -214,8 +214,10 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
rcu_read_lock();
owner = ACCESS_ONCE(lock->owner);
- if (owner)
+ if (owner) {
+ smp_read_barrier_depends();
retval = owner->on_cpu;
+ }
rcu_read_unlock();
/*
* if lock->owner is not set, the mutex owner may have just acquired
On Wed, Jan 15, 2014 at 08:44:20AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 14, 2014 at 04:33:08PM -0800, Jason Low wrote:
> > The mutex_can_spin_on_owner() function should also return false if the
> > task needs to be rescheduled.
> >
>
> While I was staring at mutex_can_spin_on_owner(); don't we need this?
>
> kernel/locking/mutex.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 4dd6e4c219de..480d2f437964 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -214,8 +214,10 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
>
> rcu_read_lock();
> owner = ACCESS_ONCE(lock->owner);
> - if (owner)
> + if (owner) {
That is, its an unmatched barrier, as mutex_set_owner() doesn't include
a barrier, and I don't think i needs to; but on alpha we still need this
read barrier to ensure we do not mess up this related load afaik.
Paul? can you explain an unpaired read_barrier_depends?
> + smp_read_barrier_depends();
> retval = owner->on_cpu;
> + }
> rcu_read_unlock();
> /*
> * if lock->owner is not set, the mutex owner may have just acquired
On 01/14/2014 07:33 PM, Jason Low wrote:
> This patch is needed for patch 3, but should also be beneficial in general.
>
> The mutex->spin_mlock was introduced in order to ensure that only 1 thread
> loops on lock->owner field at a time to reduce cache line contention. When
> lock->owner is NULL and the lock->count is still not 1, the spinner(s) will
> continually release and obtain the lock->spin_mlock. This can generate
> quite a bit of overhead/contention, and also might just delay the spinner
> from getting the lock.
>
> This patch modifies the way optimistic spinners are queued by queuing before
> entering the optimistic spinning loop as oppose to acquiring before every
> call to mutex_spin_on_owner(). So in situations where the spinner requires
> extra spins before obtaining the lock, then there will only be 1 spinner
> trying to get the lock and it will avoid the overhead from unnecessarily
> unlocking and locking the spin_mlock.
>
> Signed-off-by: Jason Low<[email protected]>
> ---
> kernel/locking/mutex.c | 13 ++++++++-----
> 1 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 85c6be1..b500cc7 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -419,6 +419,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> struct mutex_waiter waiter;
> unsigned long flags;
> int ret;
> + struct mspin_node node;
>
> preempt_disable();
> mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
> @@ -449,9 +450,9 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> if (!mutex_can_spin_on_owner(lock))
> goto slowpath;
>
> + mspin_lock(MLOCK(lock),&node);
> for (;;) {
> struct task_struct *owner;
> - struct mspin_node node;
>
> if (use_ww_ctx&& ww_ctx->acquired> 0) {
> struct ww_mutex *ww;
> @@ -465,15 +466,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> * As such, when deadlock detection needs to be
> * performed the optimistic spinning cannot be done.
> */
> - if (ACCESS_ONCE(ww->ctx))
> + if (ACCESS_ONCE(ww->ctx)) {
> + mspin_unlock(MLOCK(lock),&node);
> goto slowpath;
> + }
> }
>
> /*
> * If there's an owner, wait for it to either
> * release the lock or go to sleep.
> */
> - mspin_lock(MLOCK(lock),&node);
> owner = ACCESS_ONCE(lock->owner);
> if (owner&& !mutex_spin_on_owner(lock, owner)) {
> mspin_unlock(MLOCK(lock),&node);
> @@ -495,7 +497,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> preempt_enable();
> return 0;
> }
> - mspin_unlock(MLOCK(lock),&node);
>
> /*
> * When there's no owner, we might have preempted between the
> @@ -503,8 +504,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> * we're an RT task that will live-lock because we won't let
> * the owner complete.
> */
> - if (!owner&& (need_resched() || rt_task(task)))
> + if (!owner&& (need_resched() || rt_task(task))) {
> + mspin_unlock(MLOCK(lock),&node);
> goto slowpath;
> + }
>
> /*
> * The cpu_relax() call is a compiler barrier which forces
Maybe you can consider restructure the code as follows to reduce the
number of mspin_unlock() call sites:
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4dd6e4c..0a78a0c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -416,6 +416,7 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned
struct mutex_waiter waiter;
unsigned long flags;
int ret;
+ struct mspin_node node;
preempt_disable();
mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
@@ -446,9 +447,9 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned
if (!mutex_can_spin_on_owner(lock))
goto slowpath;
+ mspin_lock(MLOCK(lock), &node);
for (;;) {
struct task_struct *owner;
- struct mspin_node node;
if (use_ww_ctx && ww_ctx->acquired > 0) {
struct ww_mutex *ww;
@@ -463,19 +464,16 @@ __mutex_lock_common(struct mutex *lock, long
state, unsign
* performed the optimistic spinning cannot be
done.
*/
if (ACCESS_ONCE(ww->ctx))
- goto slowpath;
+ break;
}
/*
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
- mspin_lock(MLOCK(lock), &node);
owner = ACCESS_ONCE(lock->owner);
- if (owner && !mutex_spin_on_owner(lock, owner)) {
- mspin_unlock(MLOCK(lock), &node);
- goto slowpath;
- }
+ if (owner && !mutex_spin_on_owner(lock, owner))
+ break;
if ((atomic_read(&lock->count) == 1) &&
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
@@ -492,7 +490,6 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned
preempt_enable();
return 0;
}
- mspin_unlock(MLOCK(lock), &node);
/*
* When there's no owner, we might have preempted
between the
@@ -501,7 +498,7 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned
* the owner complete.
*/
if (!owner && (need_resched() || rt_task(task)))
- goto slowpath;
+ break;
/*
* The cpu_relax() call is a compiler barrier which forces
@@ -511,6 +508,7 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned
*/
arch_mutex_cpu_relax();
}
+ mspin_unlock(MLOCK(lock), &node);
slowpath:
#endif
spin_lock_mutex(&lock->wait_lock, flags);
On 01/14/2014 07:33 PM, Jason Low wrote:
> When running workloads that have high contention in mutexes on an 8 socket
> machine, spinners would often spin for a long time with no lock owner.
>
> One of the potential reasons for this is because a thread can be preempted
> after clearing lock->owner but before releasing the lock, or preempted after
> acquiring the mutex but before setting lock->owner. In those cases, the
> spinner cannot check if owner is not on_cpu because lock->owner is NULL.
>
> A solution that would address the preemption part of this problem would
> be to disable preemption between acquiring/releasing the mutex and
> setting/clearing the lock->owner. However, that will require adding overhead
> to the mutex fastpath.
>
> The solution used in this patch is to limit the # of times thread can spin on
> lock->count when !owner.
>
> The threshold used in this patch for each spinner was 128, which appeared to
> be a generous value, but any suggestions on another method to determine
> the threshold are welcomed.
>
> Signed-off-by: Jason Low<[email protected]>
> ---
> kernel/locking/mutex.c | 10 +++++++---
> 1 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index b500cc7..9465604 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -43,6 +43,7 @@
> * mutex.
> */
> #define MUTEX_SHOW_NO_WAITER(mutex) (atomic_read(&(mutex)->count)>= 0)
> +#define MUTEX_SPIN_THRESHOLD (128)
>
> void
> __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
> @@ -418,7 +419,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> struct task_struct *task = current;
> struct mutex_waiter waiter;
> unsigned long flags;
> - int ret;
> + int ret, nr_spins = 0;
> struct mspin_node node;
>
> preempt_disable();
> @@ -453,6 +454,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> mspin_lock(MLOCK(lock),&node);
> for (;;) {
> struct task_struct *owner;
> + nr_spins++;
>
> if (use_ww_ctx&& ww_ctx->acquired> 0) {
> struct ww_mutex *ww;
> @@ -502,9 +504,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> * When there's no owner, we might have preempted between the
> * owner acquiring the lock and setting the owner field. If
> * we're an RT task that will live-lock because we won't let
> - * the owner complete.
> + * the owner complete. Additionally, when there is no owner,
> + * stop spinning after too many tries.
> */
> - if (!owner&& (need_resched() || rt_task(task))) {
> + if (!owner&& (need_resched() || rt_task(task) ||
> + nr_spins> MUTEX_SPIN_THRESHOLD)) {
> mspin_unlock(MLOCK(lock),&node);
> goto slowpath;
> }
The time that a thread spent on one iteration of the loop can be highly
variable. Instead of a predefined iteration count, you may consider
setting an elapsed time limit on how long a thread can spin in the loop
using changes in jiffies as a proxy. Let say setting a limit of 40ms. On
a 1000Hz system, a changes of 40 or more in jiffies will indicate it is
time to leave and go to the slowpath.
-Longman
On Wed, 2014-01-15 at 10:10 -0500, Waiman Long wrote:
> On 01/14/2014 07:33 PM, Jason Low wrote:
> > * When there's no owner, we might have preempted between the
> > @@ -503,8 +504,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > * we're an RT task that will live-lock because we won't let
> > * the owner complete.
> > */
> > - if (!owner&& (need_resched() || rt_task(task)))
> > + if (!owner&& (need_resched() || rt_task(task))) {
> > + mspin_unlock(MLOCK(lock),&node);
> > goto slowpath;
> > + }
> >
> > /*
> > * The cpu_relax() call is a compiler barrier which forces
>
> Maybe you can consider restructure the code as follows to reduce the
> number of mspin_unlock() call sites:
Yeah, I would prefer your method of using break and having the
mspin_unlock() at the end of the loop, now that it would result in less
# of mspin_unlock().
Commit ec83f425dbca47e19c6737e8e7db0d0924a5de1b changed break to
slowpath to make it more intuitive to read, but with this patch, there
are benefits to using break.
On Wed, Jan 15, 2014 at 08:48:44AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 15, 2014 at 08:44:20AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 14, 2014 at 04:33:08PM -0800, Jason Low wrote:
> > > The mutex_can_spin_on_owner() function should also return false if the
> > > task needs to be rescheduled.
> > >
> >
> > While I was staring at mutex_can_spin_on_owner(); don't we need this?
> >
> > kernel/locking/mutex.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> > index 4dd6e4c219de..480d2f437964 100644
> > --- a/kernel/locking/mutex.c
> > +++ b/kernel/locking/mutex.c
> > @@ -214,8 +214,10 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
> >
> > rcu_read_lock();
> > owner = ACCESS_ONCE(lock->owner);
> > - if (owner)
> > + if (owner) {
>
> That is, its an unmatched barrier, as mutex_set_owner() doesn't include
> a barrier, and I don't think i needs to; but on alpha we still need this
> read barrier to ensure we do not mess up this related load afaik.
>
> Paul? can you explain an unpaired read_barrier_depends?
That is an impressive one! ;-)
My rationale for the code without smp_read_barrier_depends() is that
(1) the task struct was already exposed to readers and (2) the check
is heuristic in nature -- if we miss the assignment to ->on_cpu due
to memory order (or for any other reason), we just sleep unnecessarily.
If we did need full ordering (which I do -not- believe that we do at
the moment) then the above ACCESS_ONCE() can become rcu_dereference()
and mutex_set_owner() needs an smp_store_release().
So if we need a barrier here (which again I believe we do not), then
there needs to be a paired barrier in mutex_set_owner().
Thanx, Paul
> > + smp_read_barrier_depends();
> > retval = owner->on_cpu;
> > + }
> > rcu_read_unlock();
> > /*
> > * if lock->owner is not set, the mutex owner may have just acquired
>
On Tue, 2014-01-14 at 16:33 -0800, Jason Low wrote:
> When running workloads that have high contention in mutexes on an 8 socket
> machine, spinners would often spin for a long time with no lock owner.
>
> One of the potential reasons for this is because a thread can be preempted
> after clearing lock->owner but before releasing the lock, or preempted after
> acquiring the mutex but before setting lock->owner. In those cases, the
> spinner cannot check if owner is not on_cpu because lock->owner is NULL.
Looks like a bigger source of !owner latency is in
__mutex_unlock_common_slowpath(). If __mutex_slowpath_needs_to_unlock(),
then the owner needs to acquire the wait_lock before setting lock->count
to 1. If the wait_lock is being contended, which is occurring with some
workloads on my box, then this can delay the owner from releasing
the lock by quite a bit.
Any comments on the below change which unlocks the mutex before taking
the lock->wait_lock to wake up a waiter? Thanks.
---
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index b500cc7..38f0eb0 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -723,10 +723,6 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
struct mutex *lock = container_of(lock_count, struct mutex, count);
unsigned long flags;
- spin_lock_mutex(&lock->wait_lock, flags);
- mutex_release(&lock->dep_map, nested, _RET_IP_);
- debug_mutex_unlock(lock);
-
/*
* some architectures leave the lock unlocked in the fastpath failure
* case, others need to leave it locked. In the later case we have to
@@ -735,6 +731,10 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
if (__mutex_slowpath_needs_to_unlock())
atomic_set(&lock->count, 1);
+ spin_lock_mutex(&lock->wait_lock, flags);
+ mutex_release(&lock->dep_map, nested, _RET_IP_);
+ debug_mutex_unlock(lock);
+
if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
struct mutex_waiter *waiter =
On Thu, Jan 16, 2014 at 9:45 AM, Jason Low <[email protected]> wrote:
>
> Any comments on the below change which unlocks the mutex before taking
> the lock->wait_lock to wake up a waiter? Thanks.
Hmm. Doesn't that mean that a new lock owner can come in *before*
you've called debug_mutex_unlock and the lockdep stuff, and get the
lock? And then debug_mutex_lock() will be called *before* the unlocker
called debug_mutex_unlock(), which I'm sure confuses things.
Linus
On Thu, 2014-01-16 at 10:14 +0700, Linus Torvalds wrote:
> On Thu, Jan 16, 2014 at 9:45 AM, Jason Low <[email protected]> wrote:
> >
> > Any comments on the below change which unlocks the mutex before taking
> > the lock->wait_lock to wake up a waiter? Thanks.
>
> Hmm. Doesn't that mean that a new lock owner can come in *before*
> you've called debug_mutex_unlock and the lockdep stuff, and get the
> lock? And then debug_mutex_lock() will be called *before* the unlocker
> called debug_mutex_unlock(), which I'm sure confuses things.
If obtaining the wait_lock for debug_mutex_unlock is the issue, then
perhaps we can address that by taking care of
#ifdef CONFIG_DEBUG_MUTEXES. In the CONFIG_DEBUG_MUTEXES case, we can
take the wait_lock first, and in the regular case, take the wait_lock
after releasing the mutex.
#ifdef CONFIG_DEBUG_MUTEXES
spin_lock_mutex(&lock->wait_lock, flags);
#endif
mutex_release(&lock->dep_map, nested, _RET_IP_);
debug_mutex_unlock(lock);
if (__mutex_slowpath_needs_to_unlock())
atomic_set(&lock->count, 1);
#ifndef CONFIG_DEBUG_MUTEXES
spin_lock_mutex(&lock->wait_lock, flags);
#endif
On Wed, Jan 15, 2014 at 10:46:17PM -0800, Jason Low wrote:
> On Thu, 2014-01-16 at 10:14 +0700, Linus Torvalds wrote:
> > On Thu, Jan 16, 2014 at 9:45 AM, Jason Low <[email protected]> wrote:
> > >
> > > Any comments on the below change which unlocks the mutex before taking
> > > the lock->wait_lock to wake up a waiter? Thanks.
> >
> > Hmm. Doesn't that mean that a new lock owner can come in *before*
> > you've called debug_mutex_unlock and the lockdep stuff, and get the
> > lock? And then debug_mutex_lock() will be called *before* the unlocker
> > called debug_mutex_unlock(), which I'm sure confuses things.
>
> If obtaining the wait_lock for debug_mutex_unlock is the issue, then
> perhaps we can address that by taking care of
> #ifdef CONFIG_DEBUG_MUTEXES. In the CONFIG_DEBUG_MUTEXES case, we can
> take the wait_lock first, and in the regular case, take the wait_lock
> after releasing the mutex.
I think we're already good for DEBUG_MUTEXES, because DEBUG_MUTEXES has
to work for archs that have !__mutex_slowpath_needs_to_unlock() and also
the DEBUG_MUTEXES code is entirely serialized on ->wait_lock.
Note that we cannot do the optimistic spinning for DEBUG_MUTEXES exactly
because of this reason.
On Thu, 2014-01-16 at 13:05 +0100, Peter Zijlstra wrote:
> On Wed, Jan 15, 2014 at 10:46:17PM -0800, Jason Low wrote:
> > On Thu, 2014-01-16 at 10:14 +0700, Linus Torvalds wrote:
> > > On Thu, Jan 16, 2014 at 9:45 AM, Jason Low <[email protected]> wrote:
> > > >
> > > > Any comments on the below change which unlocks the mutex before taking
> > > > the lock->wait_lock to wake up a waiter? Thanks.
> > >
> > > Hmm. Doesn't that mean that a new lock owner can come in *before*
> > > you've called debug_mutex_unlock and the lockdep stuff, and get the
> > > lock? And then debug_mutex_lock() will be called *before* the unlocker
> > > called debug_mutex_unlock(), which I'm sure confuses things.
> >
> > If obtaining the wait_lock for debug_mutex_unlock is the issue, then
> > perhaps we can address that by taking care of
> > #ifdef CONFIG_DEBUG_MUTEXES. In the CONFIG_DEBUG_MUTEXES case, we can
> > take the wait_lock first, and in the regular case, take the wait_lock
> > after releasing the mutex.
>
> I think we're already good for DEBUG_MUTEXES, because DEBUG_MUTEXES has
> to work for archs that have !__mutex_slowpath_needs_to_unlock() and also
> the DEBUG_MUTEXES code is entirely serialized on ->wait_lock.
Yeah, in the !__mutex_slowpath_needs_to_unlock() case, we release the
mutex before calling __mutex_unlock_common_slowpath().
Jason