2014-01-28 19:13:22

by Jason Low

[permalink] [raw]
Subject: [PATCH v2 0/5] mutex: Mutex scalability patches

v1->v2:
- Replace the previous patch that limits the # of times a thread can spin with
!lock->owner with a patch that releases the mutex before holding the wait_lock
in the __mutex_unlock_common_slowpath() function.
- Add a patch which allows a thread to attempt 1 mutex_spin_on_owner() without
checking need_resched() if need_resched() triggered while in the MCS queue.
- Add a patch which disables preemption between modifying lock->owner and
acquiring/releasing the mutex.

This patchset addresses a few scalability issues with mutexes.

Patch 1 has the mutex_can_spin_on_owner() funtion check for need_resched()
before being added to MCS queue.

Patches 2, 3 are to fix issues with threads spinning when
there is no lock owner when the mutex is under high contention.

Patch 4 and 5 are RFC patches. Patch 4 disables preemption between modifying
lock->owner and locking/unlocking the mutex. Patch 5 addresses the situation
where spinners can potentially wait a long time in the MCS queue for a chance
to spin on mutex owner (not checking for need_resched()), yet ends up not
getting to spin.

These changes benefit the AIM7 fserver and high_systime workloads (run on disk)
on an 8 socket, 80 core box. The table below shows the performance
improvements with 3.13 + patches 1, 2, 3 when compared to the 3.13 baseline,
and the performance improvements with 3.13 + all 5 patches compared to
the 3.13 baseline.

Note: I split the % improvement into these 2 categories because
patch 3 and patch 5 are the most interesting/important patches in
this patchset in terms of performance improvements.

---------------------------------------------------------------
high_systime
---------------------------------------------------------------
# users | avg % improvement with | avg % improvement with
| 3.13 + patch 1, 2, 3 | 3.13 + patch 1, 2, 3, 4, 5
---------------------------------------------------------------
1000-2000 | +27.05% | +53.35%
---------------------------------------------------------------
100-900 | +36.11% | +52.56%
---------------------------------------------------------------
10-90 | +2.47% | +4.05%
---------------------------------------------------------------


---------------------------------------------------------------
fserver
---------------------------------------------------------------
# users | avg % improvement with | avg % improvement with
| 3.13 + patch 1, 2, 3 | 3.13 + patch 1, 2, 3, 4, 5
---------------------------------------------------------------
1000-2000 | +18.31% | +37.65%
---------------------------------------------------------------
100-900 | +5.99% | +17.50%
---------------------------------------------------------------
10-90 | +2.47% | +6.10%



2014-01-28 19:13:26

by Jason Low

[permalink] [raw]
Subject: [PATCH v2 1/5] mutex: In mutex_can_spin_on_owner(), return false if task need_resched()

The mutex_can_spin_on_owner() function should also return false if the
task needs to be rescheduled to avoid entering the MCS queue when it
needs to reschedule.

Signed-off-by: Jason Low <[email protected]>
---
kernel/locking/mutex.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4dd6e4c..85c6be1 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -212,6 +212,9 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
struct task_struct *owner;
int retval = 1;

+ if (need_resched())
+ return 0;
+
rcu_read_lock();
owner = ACCESS_ONCE(lock->owner);
if (owner)
--
1.7.1

2014-01-28 19:13:38

by Jason Low

[permalink] [raw]
Subject: [PATCH v2 2/5] mutex: Modify the way optimistic spinners are queued

The mutex->spin_mlock was introduced in order to ensure that only 1 thread
spins for lock acquisition at a time to reduce cache line contention. When
lock->owner is NULL and the lock->count is still not 1, the spinner(s) will
continually release and obtain the lock->spin_mlock. This can generate
quite a bit of overhead/contention, and also might just delay the spinner
from getting the lock.

This patch modifies the way optimistic spinners are queued by queuing before
entering the optimistic spinning loop as oppose to acquiring before every
call to mutex_spin_on_owner(). So in situations where the spinner requires
a few extra spins before obtaining the lock, then there will only be 1 spinner
trying to get the lock and it will avoid the overhead from unnecessarily
unlocking and locking the spin_mlock.

Signed-off-by: Jason Low <[email protected]>
---
kernel/locking/mutex.c | 16 +++++++---------
1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 85c6be1..7519d27 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -419,6 +419,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
struct mutex_waiter waiter;
unsigned long flags;
int ret;
+ struct mspin_node node;

preempt_disable();
mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
@@ -449,9 +450,9 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (!mutex_can_spin_on_owner(lock))
goto slowpath;

+ mspin_lock(MLOCK(lock), &node);
for (;;) {
struct task_struct *owner;
- struct mspin_node node;

if (use_ww_ctx && ww_ctx->acquired > 0) {
struct ww_mutex *ww;
@@ -466,19 +467,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* performed the optimistic spinning cannot be done.
*/
if (ACCESS_ONCE(ww->ctx))
- goto slowpath;
+ break;
}

/*
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
- mspin_lock(MLOCK(lock), &node);
owner = ACCESS_ONCE(lock->owner);
- if (owner && !mutex_spin_on_owner(lock, owner)) {
- mspin_unlock(MLOCK(lock), &node);
- goto slowpath;
- }
+ if (owner && !mutex_spin_on_owner(lock, owner))
+ break;

if ((atomic_read(&lock->count) == 1) &&
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
@@ -495,7 +493,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
preempt_enable();
return 0;
}
- mspin_unlock(MLOCK(lock), &node);

/*
* When there's no owner, we might have preempted between the
@@ -504,7 +501,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* the owner complete.
*/
if (!owner && (need_resched() || rt_task(task)))
- goto slowpath;
+ break;

/*
* The cpu_relax() call is a compiler barrier which forces
@@ -514,6 +511,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*/
arch_mutex_cpu_relax();
}
+ mspin_unlock(MLOCK(lock), &node);
slowpath:
#endif
spin_lock_mutex(&lock->wait_lock, flags);
--
1.7.1

2014-01-28 19:13:42

by Jason Low

[permalink] [raw]
Subject: [PATCH v2 3/5] mutex: Unlock the mutex without the wait_lock

When running workloads that have high contention in mutexes on an 8 socket
machine, mutex spinners would often spin for a long time with no lock owner.

The main reason why this is occuring is in __mutex_unlock_common_slowpath(),
if __mutex_slowpath_needs_to_unlock(), then the owner needs to acquire the
mutex->wait_lock before releasing the mutex (setting lock->count to 1). When
the wait_lock is contended, this delays the mutex from being released.
We should be able to release the mutex without holding the wait_lock.

Signed-off-by: Jason Low <[email protected]>
---
kernel/locking/mutex.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 7519d27..6d85b08 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -718,10 +718,6 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
struct mutex *lock = container_of(lock_count, struct mutex, count);
unsigned long flags;

- spin_lock_mutex(&lock->wait_lock, flags);
- mutex_release(&lock->dep_map, nested, _RET_IP_);
- debug_mutex_unlock(lock);
-
/*
* some architectures leave the lock unlocked in the fastpath failure
* case, others need to leave it locked. In the later case we have to
@@ -730,6 +726,10 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count, int nested)
if (__mutex_slowpath_needs_to_unlock())
atomic_set(&lock->count, 1);

+ spin_lock_mutex(&lock->wait_lock, flags);
+ mutex_release(&lock->dep_map, nested, _RET_IP_);
+ debug_mutex_unlock(lock);
+
if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */
struct mutex_waiter *waiter =
--
1.7.1

2014-01-28 19:13:58

by Jason Low

[permalink] [raw]
Subject: [RFC][PATCH v2 4/5] mutex: Disable preemtion between modifying lock->owner and locking/unlocking mutex

This RFC patch disables preemption between modifying lock->owner and
locking/unlocking the mutex lock. This prevents situations where the owner
can preempt between those 2 operations, which causes optimistic spinners to
be unable to check if lock->owner is not on CPU. As mentioned in the
thread for this v1 patchset, disabling preemption is a cheap operation.

Signed-off-by: Jason Low <[email protected]>
---
kernel/locking/mutex.c | 30 ++++++++++++++++++++++++++++--
1 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 6d85b08..cfaaf53 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -98,8 +98,10 @@ void __sched mutex_lock(struct mutex *lock)
* The locking fastpath is the 1->0 transition from
* 'unlocked' into 'locked' state.
*/
+ preempt_disable();
__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
mutex_set_owner(lock);
+ preempt_enable();
}

EXPORT_SYMBOL(mutex_lock);
@@ -253,9 +255,13 @@ void __sched mutex_unlock(struct mutex *lock)
* the slow path will always be taken, and that clears the owner field
* after verifying that it was indeed current.
*/
+ preempt_disable();
mutex_clear_owner(lock);
#endif
__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
+#ifndef CONFIG_DEBUG_MUTEXES
+ preempt_enable();
+#endif
}

EXPORT_SYMBOL(mutex_unlock);
@@ -292,9 +298,13 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock)
* the slow path will always be taken, and that clears the owner field
* after verifying that it was indeed current.
*/
+ preempt_disable();
mutex_clear_owner(&lock->base);
#endif
__mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
+#ifndef CONFIG_DEBUG_MUTEXES
+ preempt_enable();
+#endif
}
EXPORT_SYMBOL(ww_mutex_unlock);

@@ -780,12 +790,16 @@ int __sched mutex_lock_interruptible(struct mutex *lock)
int ret;

might_sleep();
+ preempt_disable();
ret = __mutex_fastpath_lock_retval(&lock->count);
if (likely(!ret)) {
mutex_set_owner(lock);
+ preempt_enable();
return 0;
- } else
+ } else {
+ preempt_enable();
return __mutex_lock_interruptible_slowpath(lock);
+ }
}

EXPORT_SYMBOL(mutex_lock_interruptible);
@@ -795,12 +809,16 @@ int __sched mutex_lock_killable(struct mutex *lock)
int ret;

might_sleep();
+ preempt_disable();
ret = __mutex_fastpath_lock_retval(&lock->count);
if (likely(!ret)) {
mutex_set_owner(lock);
+ preempt_enable();
return 0;
- } else
+ } else {
+ preempt_enable();
return __mutex_lock_killable_slowpath(lock);
+ }
}
EXPORT_SYMBOL(mutex_lock_killable);

@@ -889,9 +907,11 @@ int __sched mutex_trylock(struct mutex *lock)
{
int ret;

+ preempt_disable();
ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath);
if (ret)
mutex_set_owner(lock);
+ preempt_enable();

return ret;
}
@@ -904,6 +924,7 @@ __ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
int ret;

might_sleep();
+ preempt_disable();

ret = __mutex_fastpath_lock_retval(&lock->base.count);

@@ -912,6 +933,8 @@ __ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
mutex_set_owner(&lock->base);
} else
ret = __ww_mutex_lock_slowpath(lock, ctx);
+
+ preempt_enable();
return ret;
}
EXPORT_SYMBOL(__ww_mutex_lock);
@@ -922,6 +945,7 @@ __ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
int ret;

might_sleep();
+ preempt_disable();

ret = __mutex_fastpath_lock_retval(&lock->base.count);

@@ -930,6 +954,8 @@ __ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
mutex_set_owner(&lock->base);
} else
ret = __ww_mutex_lock_interruptible_slowpath(lock, ctx);
+
+ preempt_enable();
return ret;
}
EXPORT_SYMBOL(__ww_mutex_lock_interruptible);
--
1.7.1

2014-01-28 19:14:08

by Jason Low

[permalink] [raw]
Subject: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

Before a thread attempts mutex spin on owner, it is first added to a queue
using an MCS lock so that only 1 thread spins on owner at a time. However, when
the spinner is queued, it is unable to check if it needs to reschedule and
will remain on the queue. This could cause the spinner to spin longer
than its allocated time. However, once it is the spinner's turn to spin on
owner, it would immediately go to slowpath if it need_resched() and gets no spin
time. In these situations, not only does the spinner take up more time for a
chance to spin for the mutex, it also ends up not getting to spin once it
gets its turn.

One solution would be to exit the MCS queue and go to mutex slowpath if
need_resched(). However, that may require a lot of overhead. For instance, if a
thread at the end of the queue need_resched(), in order to remove it from the
queue, we would have to unqueue and requeue all other spinners in front of it.

This RFC patch tries to address the issue in another context by avoiding
situations where spinners immediately get sent to the slowpath on
need_resched() upon getting to spin. We will first check for need_resched()
right after acquiring the MCS lock. If need_resched() is true, then
need_resched() triggered while the thread is waiting in the MCS queue (since
patch 1 makes the spinner check for need_resched() before entering the queue).
In this case, we will allow the thread to have at least 1 try to do
mutex_spin_on_owner() regardless of need_resched(). This patch also removes
the need_resched() in the outer loop in case we require a few extra spins to
observe lock->count == 1, and patch 4 ensures we won't be spinning with
lock owner preempted.

And if the need_resched() check after acquiring the MCS lock is false, then
we won't give the spinner any extra spinning.

Signed-off-by: Jason Low <[email protected]>
---
kernel/locking/mutex.c | 22 ++++++++++++++++------
1 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index cfaaf53..2281a48 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -187,11 +187,11 @@ static inline bool owner_running(struct mutex *lock, struct task_struct *owner)
* access and not reliable.
*/
static noinline
-int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
+int mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, int spin_grace_period)
{
rcu_read_lock();
while (owner_running(lock, owner)) {
- if (need_resched())
+ if (need_resched() && !spin_grace_period)
break;

arch_mutex_cpu_relax();
@@ -428,7 +428,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
struct task_struct *task = current;
struct mutex_waiter waiter;
unsigned long flags;
- int ret;
+ int ret, spin_grace_period;
struct mspin_node node;

preempt_disable();
@@ -461,6 +461,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
goto slowpath;

mspin_lock(MLOCK(lock), &node);
+ /*
+ * If need_resched() triggered while queued, then we will still give
+ * this spinner a chance to spin for the mutex, rather than send this
+ * immediately to slowpath after waiting for its turn.
+ */
+ spin_grace_period = (need_resched()) ? 1 : 0;
for (;;) {
struct task_struct *owner;

@@ -485,8 +491,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* release the lock or go to sleep.
*/
owner = ACCESS_ONCE(lock->owner);
- if (owner && !mutex_spin_on_owner(lock, owner))
- break;
+ if (owner) {
+ if (!mutex_spin_on_owner(lock, owner, spin_grace_period))
+ break;
+
+ spin_grace_period = 0;
+ }

if ((atomic_read(&lock->count) == 1) &&
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
@@ -510,7 +520,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* we're an RT task that will live-lock because we won't let
* the owner complete.
*/
- if (!owner && (need_resched() || rt_task(task)))
+ if (!owner && rt_task(task))
break;

/*
--
1.7.1

2014-01-28 20:20:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] mutex: In mutex_can_spin_on_owner(), return false if task need_resched()

On Tue, Jan 28, 2014 at 11:13:12AM -0800, Jason Low wrote:
> The mutex_can_spin_on_owner() function should also return false if the
> task needs to be rescheduled to avoid entering the MCS queue when it
> needs to reschedule.
>
> Signed-off-by: Jason Low <[email protected]>

Reviewed-by: Paul E. McKenney <[email protected]>

But I cannot help asking how this affects performance. (My guess is
"not much", but always good to know.)

> ---
> kernel/locking/mutex.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 4dd6e4c..85c6be1 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -212,6 +212,9 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
> struct task_struct *owner;
> int retval = 1;
>
> + if (need_resched())
> + return 0;
> +
> rcu_read_lock();
> owner = ACCESS_ONCE(lock->owner);
> if (owner)
> --
> 1.7.1
>

2014-01-28 20:23:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mutex: Modify the way optimistic spinners are queued

On Tue, Jan 28, 2014 at 11:13:13AM -0800, Jason Low wrote:
> The mutex->spin_mlock was introduced in order to ensure that only 1 thread
> spins for lock acquisition at a time to reduce cache line contention. When
> lock->owner is NULL and the lock->count is still not 1, the spinner(s) will
> continually release and obtain the lock->spin_mlock. This can generate
> quite a bit of overhead/contention, and also might just delay the spinner
> from getting the lock.
>
> This patch modifies the way optimistic spinners are queued by queuing before
> entering the optimistic spinning loop as oppose to acquiring before every
> call to mutex_spin_on_owner(). So in situations where the spinner requires
> a few extra spins before obtaining the lock, then there will only be 1 spinner
> trying to get the lock and it will avoid the overhead from unnecessarily
> unlocking and locking the spin_mlock.
>
> Signed-off-by: Jason Low <[email protected]>

One question below. Also, this might well have a visible effect on
performance, so would be good to see the numbers.

Thanx, Paul

> ---
> kernel/locking/mutex.c | 16 +++++++---------
> 1 files changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 85c6be1..7519d27 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -419,6 +419,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> struct mutex_waiter waiter;
> unsigned long flags;
> int ret;
> + struct mspin_node node;
>
> preempt_disable();
> mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
> @@ -449,9 +450,9 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> if (!mutex_can_spin_on_owner(lock))
> goto slowpath;
>
> + mspin_lock(MLOCK(lock), &node);
> for (;;) {
> struct task_struct *owner;
> - struct mspin_node node;
>
> if (use_ww_ctx && ww_ctx->acquired > 0) {
> struct ww_mutex *ww;
> @@ -466,19 +467,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> * performed the optimistic spinning cannot be done.
> */
> if (ACCESS_ONCE(ww->ctx))
> - goto slowpath;
> + break;
> }
>
> /*
> * If there's an owner, wait for it to either
> * release the lock or go to sleep.
> */
> - mspin_lock(MLOCK(lock), &node);
> owner = ACCESS_ONCE(lock->owner);
> - if (owner && !mutex_spin_on_owner(lock, owner)) {
> - mspin_unlock(MLOCK(lock), &node);
> - goto slowpath;
> - }
> + if (owner && !mutex_spin_on_owner(lock, owner))
> + break;
>
> if ((atomic_read(&lock->count) == 1) &&
> (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
> @@ -495,7 +493,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> preempt_enable();
> return 0;
> }
> - mspin_unlock(MLOCK(lock), &node);
>
> /*
> * When there's no owner, we might have preempted between the
> @@ -504,7 +501,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> * the owner complete.
> */
> if (!owner && (need_resched() || rt_task(task)))
> - goto slowpath;
> + break;
>
> /*
> * The cpu_relax() call is a compiler barrier which forces
> @@ -514,6 +511,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> */
> arch_mutex_cpu_relax();
> }
> + mspin_unlock(MLOCK(lock), &node);
> slowpath:

Are there any remaining goto statements to slowpath? If so, they need
to release the lock. If not, this label should be removed.

> #endif
> spin_lock_mutex(&lock->wait_lock, flags);
> --
> 1.7.1
>

2014-01-28 20:24:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mutex: Modify the way optimistic spinners are queued

On Tue, Jan 28, 2014 at 12:23:34PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 28, 2014 at 11:13:13AM -0800, Jason Low wrote:
> > The mutex->spin_mlock was introduced in order to ensure that only 1 thread
> > spins for lock acquisition at a time to reduce cache line contention. When
> > lock->owner is NULL and the lock->count is still not 1, the spinner(s) will
> > continually release and obtain the lock->spin_mlock. This can generate
> > quite a bit of overhead/contention, and also might just delay the spinner
> > from getting the lock.
> >
> > This patch modifies the way optimistic spinners are queued by queuing before
> > entering the optimistic spinning loop as oppose to acquiring before every
> > call to mutex_spin_on_owner(). So in situations where the spinner requires
> > a few extra spins before obtaining the lock, then there will only be 1 spinner
> > trying to get the lock and it will avoid the overhead from unnecessarily
> > unlocking and locking the spin_mlock.
> >
> > Signed-off-by: Jason Low <[email protected]>
>
> One question below. Also, this might well have a visible effect on
> performance, so would be good to see the numbers.

Never mind, I see the numbers in your patch 0. :-/

Thanx, Paul

> > ---
> > kernel/locking/mutex.c | 16 +++++++---------
> > 1 files changed, 7 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> > index 85c6be1..7519d27 100644
> > --- a/kernel/locking/mutex.c
> > +++ b/kernel/locking/mutex.c
> > @@ -419,6 +419,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > struct mutex_waiter waiter;
> > unsigned long flags;
> > int ret;
> > + struct mspin_node node;
> >
> > preempt_disable();
> > mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
> > @@ -449,9 +450,9 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > if (!mutex_can_spin_on_owner(lock))
> > goto slowpath;
> >
> > + mspin_lock(MLOCK(lock), &node);
> > for (;;) {
> > struct task_struct *owner;
> > - struct mspin_node node;
> >
> > if (use_ww_ctx && ww_ctx->acquired > 0) {
> > struct ww_mutex *ww;
> > @@ -466,19 +467,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > * performed the optimistic spinning cannot be done.
> > */
> > if (ACCESS_ONCE(ww->ctx))
> > - goto slowpath;
> > + break;
> > }
> >
> > /*
> > * If there's an owner, wait for it to either
> > * release the lock or go to sleep.
> > */
> > - mspin_lock(MLOCK(lock), &node);
> > owner = ACCESS_ONCE(lock->owner);
> > - if (owner && !mutex_spin_on_owner(lock, owner)) {
> > - mspin_unlock(MLOCK(lock), &node);
> > - goto slowpath;
> > - }
> > + if (owner && !mutex_spin_on_owner(lock, owner))
> > + break;
> >
> > if ((atomic_read(&lock->count) == 1) &&
> > (atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
> > @@ -495,7 +493,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > preempt_enable();
> > return 0;
> > }
> > - mspin_unlock(MLOCK(lock), &node);
> >
> > /*
> > * When there's no owner, we might have preempted between the
> > @@ -504,7 +501,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > * the owner complete.
> > */
> > if (!owner && (need_resched() || rt_task(task)))
> > - goto slowpath;
> > + break;
> >
> > /*
> > * The cpu_relax() call is a compiler barrier which forces
> > @@ -514,6 +511,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > */
> > arch_mutex_cpu_relax();
> > }
> > + mspin_unlock(MLOCK(lock), &node);
> > slowpath:
>
> Are there any remaining goto statements to slowpath? If so, they need
> to release the lock. If not, this label should be removed.
>
> > #endif
> > spin_lock_mutex(&lock->wait_lock, flags);
> > --
> > 1.7.1
> >

2014-01-28 20:54:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 4/5] mutex: Disable preemtion between modifying lock->owner and locking/unlocking mutex

On Tue, Jan 28, 2014 at 11:13:15AM -0800, Jason Low wrote:
> This RFC patch disables preemption between modifying lock->owner and
> locking/unlocking the mutex lock. This prevents situations where the owner
> can preempt between those 2 operations, which causes optimistic spinners to
> be unable to check if lock->owner is not on CPU. As mentioned in the
> thread for this v1 patchset, disabling preemption is a cheap operation.

In that same thread it was also said that this wasn't really an issue at
all. So what is the justification?

The patch is rather hideous.

2014-01-28 21:08:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Tue, Jan 28, 2014 at 11:13:16AM -0800, Jason Low wrote:
> Before a thread attempts mutex spin on owner, it is first added to a queue
> using an MCS lock so that only 1 thread spins on owner at a time. However, when
> the spinner is queued, it is unable to check if it needs to reschedule and
> will remain on the queue. This could cause the spinner to spin longer
> than its allocated time.

what allocated time?

> However, once it is the spinner's turn to spin on
> owner, it would immediately go to slowpath if it need_resched() and gets no spin
> time. In these situations, not only does the spinner take up more time for a
> chance to spin for the mutex, it also ends up not getting to spin once it
> gets its turn.
>
> One solution would be to exit the MCS queue and go to mutex slowpath if
> need_resched(). However, that may require a lot of overhead. For instance, if a
> thread at the end of the queue need_resched(), in order to remove it from the
> queue, we would have to unqueue and requeue all other spinners in front of it.

If you can do that, you can also walk the list and find prev and cmpxchg
the entry out. But I don't think you can do either, as we simply don't
have a head pointer.

> This RFC patch tries to address the issue in another context by avoiding
> situations where spinners immediately get sent to the slowpath on
> need_resched() upon getting to spin.

> We will first check for need_resched()
> right after acquiring the MCS lock. If need_resched() is true, then
> need_resched() triggered while the thread is waiting in the MCS queue (since
> patch 1 makes the spinner check for need_resched() before entering the queue).

> In this case, we will allow the thread to have at least 1 try to do
> mutex_spin_on_owner() regardless of need_resched().

No! We should absolutely not ever willfully ignore need_resched(). Esp.
not for unbounded spins.

> This patch also removes
> the need_resched() in the outer loop in case we require a few extra spins to
> observe lock->count == 1, and patch 4 ensures we won't be spinning with
> lock owner preempted.
>
> And if the need_resched() check after acquiring the MCS lock is false, then
> we won't give the spinner any extra spinning.

But urgh, nasty problem. Lemme ponder this a bit.

2014-01-28 21:08:38

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mutex: Mutex scalability patches

On Tue, 2014-01-28 at 11:13 -0800, Jason Low wrote:
> v1->v2:
> - Replace the previous patch that limits the # of times a thread can spin with
> !lock->owner with a patch that releases the mutex before holding the wait_lock
> in the __mutex_unlock_common_slowpath() function.
> - Add a patch which allows a thread to attempt 1 mutex_spin_on_owner() without
> checking need_resched() if need_resched() triggered while in the MCS queue.
> - Add a patch which disables preemption between modifying lock->owner and
> acquiring/releasing the mutex.
>
> This patchset addresses a few scalability issues with mutexes.
>
> Patch 1 has the mutex_can_spin_on_owner() funtion check for need_resched()
> before being added to MCS queue.
>
> Patches 2, 3 are to fix issues with threads spinning when
> there is no lock owner when the mutex is under high contention.
>
> Patch 4 and 5 are RFC patches. Patch 4 disables preemption between modifying
> lock->owner and locking/unlocking the mutex. Patch 5 addresses the situation
> where spinners can potentially wait a long time in the MCS queue for a chance
> to spin on mutex owner (not checking for need_resched()), yet ends up not
> getting to spin.
>
> These changes benefit the AIM7 fserver and high_systime workloads (run on disk)
> on an 8 socket, 80 core box. The table below shows the performance
> improvements with 3.13 + patches 1, 2, 3 when compared to the 3.13 baseline,
> and the performance improvements with 3.13 + all 5 patches compared to
> the 3.13 baseline.

A lot of these changes are quite subtle. It would be good to see how
smaller systems are impacted with other workloads, not only big servers.
Since you see improvement in fserver, perhaps similar workloads could
also be of use: fio, filebench, postmark, fstress, etc.

Thanks,
Davidlohr

2014-01-28 21:09:29

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] mutex: In mutex_can_spin_on_owner(), return false if task need_resched()

On Tue, 2014-01-28 at 11:13 -0800, Jason Low wrote:
> The mutex_can_spin_on_owner() function should also return false if the
> task needs to be rescheduled to avoid entering the MCS queue when it
> needs to reschedule.
>
> Signed-off-by: Jason Low <[email protected]>

Reviewed-by: Davidlohr Bueso <[email protected]>

2014-01-28 21:17:48

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mutex: Modify the way optimistic spinners are queued

On Tue, 2014-01-28 at 12:23 -0800, Paul E. McKenney wrote:
> On Tue, Jan 28, 2014 at 11:13:13AM -0800, Jason Low wrote:
> > ...
> > if (!owner && (need_resched() || rt_task(task)))
> > - goto slowpath;
> > + break;
> >
> > /*
> > * The cpu_relax() call is a compiler barrier which forces
> > @@ -514,6 +511,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > */
> > arch_mutex_cpu_relax();
> > }
> > + mspin_unlock(MLOCK(lock), &node);
> > slowpath:
>
> Are there any remaining goto statements to slowpath? If so, they need
> to release the lock. If not, this label should be removed.

We still have the !mutex_can_spin_on_owner case.

Thanks,
Davidlohr

2014-01-28 22:01:04

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH v2 1/5] mutex: In mutex_can_spin_on_owner(), return false if task need_resched()

On Tue, 2014-01-28 at 12:20 -0800, Paul E. McKenney wrote:
> On Tue, Jan 28, 2014 at 11:13:12AM -0800, Jason Low wrote:
> > The mutex_can_spin_on_owner() function should also return false if the
> > task needs to be rescheduled to avoid entering the MCS queue when it
> > needs to reschedule.
> >
> > Signed-off-by: Jason Low <[email protected]>
>
> Reviewed-by: Paul E. McKenney <[email protected]>
>
> But I cannot help asking how this affects performance. (My guess is
> "not much", but always good to know.)

Hi Paul,

In the past, when I tested this particular patch, I did not see any
noticeable performance gains. Patch 1 is really more of a "correctness"
change which was why I didn't retest this by itself. I can be sure to
include the benchmarks numbers on this particular patch though next
time.

Thanks.

2014-01-28 22:10:45

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mutex: Modify the way optimistic spinners are queued

On Tue, 2014-01-28 at 12:23 -0800, Paul E. McKenney wrote:
> On Tue, Jan 28, 2014 at 11:13:13AM -0800, Jason Low wrote:
> > /*
> > * The cpu_relax() call is a compiler barrier which forces
> > @@ -514,6 +511,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > */
> > arch_mutex_cpu_relax();
> > }
> > + mspin_unlock(MLOCK(lock), &node);
> > slowpath:
>
> Are there any remaining goto statements to slowpath? If so, they need
> to release the lock. If not, this label should be removed.

Yes, if the mutex_can_spin_on_owner() returns false, then the thread
goes to directly slowpath, bypassing the optimistic spinning loop. In
that case, the thread avoids acquiring the MCS lock, and doesn't unlock
the MCS lock.

Thanks,
Jason

2014-01-28 22:17:52

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 4/5] mutex: Disable preemtion between modifying lock->owner and locking/unlocking mutex

On Tue, 2014-01-28 at 21:54 +0100, Peter Zijlstra wrote:
> On Tue, Jan 28, 2014 at 11:13:15AM -0800, Jason Low wrote:
> > This RFC patch disables preemption between modifying lock->owner and
> > locking/unlocking the mutex lock. This prevents situations where the owner
> > can preempt between those 2 operations, which causes optimistic spinners to
> > be unable to check if lock->owner is not on CPU. As mentioned in the
> > thread for this v1 patchset, disabling preemption is a cheap operation.
>
> In that same thread it was also said that this wasn't really an issue at
> all. So what is the justification?

This patch is mainly just a complementary patch for patch 5 so that we
can allow the spinner to wait slightly longer for lock->count to get set
without worrying about need_resched()/owner getting preempted.

Jason

2014-01-28 22:51:39

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Tue, 2014-01-28 at 22:07 +0100, Peter Zijlstra wrote:
> On Tue, Jan 28, 2014 at 11:13:16AM -0800, Jason Low wrote:
> > Before a thread attempts mutex spin on owner, it is first added to a queue
> > using an MCS lock so that only 1 thread spins on owner at a time. However, when
> > the spinner is queued, it is unable to check if it needs to reschedule and
> > will remain on the queue. This could cause the spinner to spin longer
> > than its allocated time.
>
> what allocated time?

Hi Peter,

By "spin longer than its allocated time", I meant to say that the thread
can continue spinning/waiting in the MCS queue after need_resched() has
been set.

> > However, once it is the spinner's turn to spin on
> > owner, it would immediately go to slowpath if it need_resched() and gets no spin
> > time. In these situations, not only does the spinner take up more time for a
> > chance to spin for the mutex, it also ends up not getting to spin once it
> > gets its turn.
> >
> > One solution would be to exit the MCS queue and go to mutex slowpath if
> > need_resched(). However, that may require a lot of overhead. For instance, if a
> > thread at the end of the queue need_resched(), in order to remove it from the
> > queue, we would have to unqueue and requeue all other spinners in front of it.
>
> If you can do that, you can also walk the list and find prev and cmpxchg
> the entry out. But I don't think you can do either, as we simply don't
> have a head pointer.
>
> > This RFC patch tries to address the issue in another context by avoiding
> > situations where spinners immediately get sent to the slowpath on
> > need_resched() upon getting to spin.
>
> > We will first check for need_resched()
> > right after acquiring the MCS lock. If need_resched() is true, then
> > need_resched() triggered while the thread is waiting in the MCS queue (since
> > patch 1 makes the spinner check for need_resched() before entering the queue).
>
> > In this case, we will allow the thread to have at least 1 try to do
> > mutex_spin_on_owner() regardless of need_resched().
>
> No! We should absolutely not ever willfully ignore need_resched(). Esp.
> not for unbounded spins.

Ok. This was sort of a proof of concept patch to illustrate the type of
performance gains we can get by addressing this issue.

> > This patch also removes
> > the need_resched() in the outer loop in case we require a few extra spins to
> > observe lock->count == 1, and patch 4 ensures we won't be spinning with
> > lock owner preempted.
> >
> > And if the need_resched() check after acquiring the MCS lock is false, then
> > we won't give the spinner any extra spinning.
>
> But urgh, nasty problem. Lemme ponder this a bit.

Thanks,
Jason

2014-01-28 23:11:27

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] mutex: Mutex scalability patches

On Tue, 2014-01-28 at 13:08 -0800, Davidlohr Bueso wrote:
> On Tue, 2014-01-28 at 11:13 -0800, Jason Low wrote:
> > v1->v2:
> > - Replace the previous patch that limits the # of times a thread can spin with
> > !lock->owner with a patch that releases the mutex before holding the wait_lock
> > in the __mutex_unlock_common_slowpath() function.
> > - Add a patch which allows a thread to attempt 1 mutex_spin_on_owner() without
> > checking need_resched() if need_resched() triggered while in the MCS queue.
> > - Add a patch which disables preemption between modifying lock->owner and
> > acquiring/releasing the mutex.
> >
> > This patchset addresses a few scalability issues with mutexes.
> >
> > Patch 1 has the mutex_can_spin_on_owner() funtion check for need_resched()
> > before being added to MCS queue.
> >
> > Patches 2, 3 are to fix issues with threads spinning when
> > there is no lock owner when the mutex is under high contention.
> >
> > Patch 4 and 5 are RFC patches. Patch 4 disables preemption between modifying
> > lock->owner and locking/unlocking the mutex. Patch 5 addresses the situation
> > where spinners can potentially wait a long time in the MCS queue for a chance
> > to spin on mutex owner (not checking for need_resched()), yet ends up not
> > getting to spin.
> >
> > These changes benefit the AIM7 fserver and high_systime workloads (run on disk)
> > on an 8 socket, 80 core box. The table below shows the performance
> > improvements with 3.13 + patches 1, 2, 3 when compared to the 3.13 baseline,
> > and the performance improvements with 3.13 + all 5 patches compared to
> > the 3.13 baseline.
>
> A lot of these changes are quite subtle. It would be good to see how
> smaller systems are impacted with other workloads, not only big servers.
> Since you see improvement in fserver, perhaps similar workloads could
> also be of use: fio, filebench, postmark, fstress, etc.

Okay, I will include the numbers I collect on smaller systems next time
(even if the % difference is small).

Thanks,
Jason

2014-01-29 11:52:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
> > But urgh, nasty problem. Lemme ponder this a bit.

OK, please have a very careful look at the below. It survived a boot
with udev -- which usually stresses mutex contention enough to explode
(in fact it did a few time when I got the contention/cancel path wrong),
however I have not ran anything else on it.

The below is an MCS variant that allows relatively cheap unqueueing. But
its somewhat tricky and I might have gotten a case wrong, esp. the
double concurrent cancel case got my head hurting (I didn't attempt a
tripple unqueue).

Applies to tip/master but does generate a few (harmless) compile
warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock
thing.

Also, there's a comment in the slowpath that bears consideration.

---
kernel/locking/mutex.c | 158 +++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 148 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 45fe1b5293d6..4a69da73903c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -166,6 +166,9 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
struct task_struct *owner;
int retval = 1;

+ if (need_resched())
+ return 0;
+
rcu_read_lock();
owner = ACCESS_ONCE(lock->owner);
if (owner)
@@ -358,6 +361,134 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock,
spin_unlock_mutex(&lock->base.wait_lock, flags);
}

+struct m_spinlock {
+ struct m_spinlock *next, *prev;
+ int locked; /* 1 if lock acquired */
+};
+
+/*
+ * Using a single mcs node per CPU is safe because mutex_lock() should not be
+ * called from interrupt context and we have preemption disabled over the mcs
+ * lock usage.
+ */
+static DEFINE_PER_CPU(struct m_spinlock, m_node);
+
+static bool m_spin_lock(struct m_spinlock **lock)
+{
+ struct m_spinlock *node = this_cpu_ptr(&m_node);
+ struct m_spinlock *prev, *next;
+
+ node->locked = 0;
+ node->next = NULL;
+
+ node->prev = prev = xchg(lock, node);
+ if (likely(prev == NULL))
+ return true;
+
+ ACCESS_ONCE(prev->next) = node;
+
+ /*
+ * Normally @prev is untouchable after the above store; because at that
+ * moment unlock can proceed and wipe the node element from stack.
+ *
+ * However, since our nodes are static per-cpu storage, we're
+ * guaranteed their existence -- this allows us to apply
+ * cmpxchg in an attempt to undo our queueing.
+ */
+
+ while (!smp_load_acquire(&node->locked)) {
+ if (need_resched())
+ goto unqueue;
+ arch_mutex_cpu_relax();
+ }
+ return true;
+
+unqueue:
+
+ /*
+ * Undo our @prev->next assignment; this will make @prev's unlock()
+ * wait for a next pointer since @lock points to us.
+ */
+ if (cmpxchg(&prev->next, node, NULL) != node) { /* A -> B */
+ /*
+ * @prev->next no longer pointed to us; either we hold the lock
+ * or @prev cancelled the wait too and we need to reload and
+ * retry.
+ */
+ if (smp_load_acquire(&node->locked))
+ return true;
+
+ /*
+ * Because we observed the new @prev->next, the smp_wmb() at (B)
+ * ensures that we must now observe the new @node->prev.
+ */
+ prev = ACCESS_ONCE(node->prev);
+ goto unqueue;
+ }
+
+ if (smp_load_acquire(&node->locked)) {
+ /*
+ * OOPS, we were too late, we already got the lock. No harm
+ * done though; @prev is now unused an nobody cares we frobbed
+ * it.
+ */
+ return true;
+ }
+
+ /*
+ * Per the above logic @prev's unlock() will now wait,
+ * therefore @prev is now stable.
+ */
+
+ if (cmpxchg(lock, node, prev) == node) {
+ /*
+ * We were still the last queued, we moved @lock back. @prev
+ * will now observe @lock and will complete its unlock().
+ */
+ return false;
+ }
+
+ /*
+ * We're not the last to be queued, obtain ->next.
+ */
+
+ while (!(next = ACCESS_ONCE(node->next)))
+ arch_mutex_cpu_relax();
+
+ ACCESS_ONCE(next->prev) = prev;
+
+ /*
+ * Ensure that @next->prev is written before we write @prev->next,
+ * this guarantees that when the cmpxchg at (A) fails we must
+ * observe the new prev value.
+ */
+ smp_wmb(); /* B -> A */
+
+ /*
+ * And point @prev to our next, and we're unlinked. We can use an
+ * non-atomic op because only we modify @prev->next.
+ */
+ ACCESS_ONCE(prev->next) = next;
+
+ return false;
+}
+
+static void m_spin_unlock(struct m_spinlock **lock)
+{
+ struct m_spinlock *node = this_cpu_ptr(&m_node);
+ struct m_spinlock *next = ACCESS_ONCE(node->next);
+
+ while (likely(!next)) {
+ if (likely(cmpxchg(lock, node, NULL) == node))
+ return;
+
+ arch_mutex_cpu_relax();
+ next = ACCESS_ONCE(node->next);
+ }
+
+ smp_store_release(&next->locked, 1);
+}
+
/*
* Lock a mutex (possibly interruptible), slowpath:
*/
@@ -400,9 +531,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
if (!mutex_can_spin_on_owner(lock))
goto slowpath;

+ if (!m_spin_lock(&lock->mcs_lock))
+ goto slowpath;
+
for (;;) {
struct task_struct *owner;
- struct mcs_spinlock node;

if (use_ww_ctx && ww_ctx->acquired > 0) {
struct ww_mutex *ww;
@@ -417,19 +550,16 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* performed the optimistic spinning cannot be done.
*/
if (ACCESS_ONCE(ww->ctx))
- goto slowpath;
+ break;
}

/*
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
- mcs_spin_lock(&lock->mcs_lock, &node);
owner = ACCESS_ONCE(lock->owner);
- if (owner && !mutex_spin_on_owner(lock, owner)) {
- mcs_spin_unlock(&lock->mcs_lock, &node);
- goto slowpath;
- }
+ if (owner && !mutex_spin_on_owner(lock, owner))
+ break;

if ((atomic_read(&lock->count) == 1) &&
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
@@ -442,11 +572,10 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
}

mutex_set_owner(lock);
- mcs_spin_unlock(&lock->mcs_lock, &node);
+ m_spin_unlock(&lock->mcs_lock);
preempt_enable();
return 0;
}
- mcs_spin_unlock(&lock->mcs_lock, &node);

/*
* When there's no owner, we might have preempted between the
@@ -455,7 +584,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
* the owner complete.
*/
if (!owner && (need_resched() || rt_task(task)))
- goto slowpath;
+ break;

/*
* The cpu_relax() call is a compiler barrier which forces
@@ -465,10 +594,19 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
*/
arch_mutex_cpu_relax();
}
+ m_spin_unlock(&lock->mcs_lock);
slowpath:
#endif
spin_lock_mutex(&lock->wait_lock, flags);

+ /*
+ * XXX arguably, when need_resched() is set and there's other pending
+ * owners we should not try-acquire and simply queue and schedule().
+ *
+ * There's nothing worse than obtaining a lock only to get immediately
+ * scheduled out.
+ */
+
/* once more, can we acquire the lock? */
if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, 0) == 1))
goto skip_wait;

2014-01-31 03:29:40

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Wed, 2014-01-29 at 12:51 +0100, Peter Zijlstra wrote:
> On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
> > > But urgh, nasty problem. Lemme ponder this a bit.
>
> OK, please have a very careful look at the below. It survived a boot
> with udev -- which usually stresses mutex contention enough to explode
> (in fact it did a few time when I got the contention/cancel path wrong),
> however I have not ran anything else on it.

I tested this patch on a 2 socket, 8 core machine with the AIM7 fserver
workload. After 100 users, the system gets soft lockups.

Some condition may be causing threads to not leave the "goto unqueue"
loop. I added a debug counter, and threads were able to reach more than
1,000,000,000 "goto unqueue".

I also was initially thinking if there can be problems when multiple
threads need_resched() and unqueue at the same time. As an example, 2
nodes that need to reschedule are next to each other in the middle of
the MCS queue. The 1st node executes "while (!(next =
ACCESS_ONCE(node->next)))" and exits the while loop because next is not
NULL. Then, the 2nd node execute its "if (cmpxchg(&prev->next, node,
NULL) != node)". We may then end up in a situation where the node before
the 1st node gets linked with the outdated 2nd node.

2014-01-31 14:10:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Thu, Jan 30, 2014 at 07:29:37PM -0800, Jason Low wrote:
> On Wed, 2014-01-29 at 12:51 +0100, Peter Zijlstra wrote:
> > On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
> > > > But urgh, nasty problem. Lemme ponder this a bit.
> >
> > OK, please have a very careful look at the below. It survived a boot
> > with udev -- which usually stresses mutex contention enough to explode
> > (in fact it did a few time when I got the contention/cancel path wrong),
> > however I have not ran anything else on it.
>
> I tested this patch on a 2 socket, 8 core machine with the AIM7 fserver
> workload. After 100 users, the system gets soft lockups.
>
> Some condition may be causing threads to not leave the "goto unqueue"
> loop. I added a debug counter, and threads were able to reach more than
> 1,000,000,000 "goto unqueue".
>
> I also was initially thinking if there can be problems when multiple
> threads need_resched() and unqueue at the same time. As an example, 2
> nodes that need to reschedule are next to each other in the middle of
> the MCS queue. The 1st node executes "while (!(next =
> ACCESS_ONCE(node->next)))" and exits the while loop because next is not
> NULL. Then, the 2nd node execute its "if (cmpxchg(&prev->next, node,
> NULL) != node)". We may then end up in a situation where the node before
> the 1st node gets linked with the outdated 2nd node.

Yes indeed, I found two bugs. If you consider the MCS lock with 4 queued
nodes:

.----.
L -> | N4 |
`----'
^ V
.----.
| N3 |
`----'
^ V
.----.
| N2 |
`----'
^ V
.----.
| N1 |
`----'

And look at the 3 unqueue steps in the below patch, and read 3A to
mean, Node 3 ran step A.

Both bugs were in Step B, the first was triggered through:

3A,4A,3B,4B

In this case the old code would have 3 spinning in @n3->next, however
4B would have moved L to N3 and left @n3->next empty. FAIL.

The second was:

2A,2B,3A,2C,3B,3C

In this case the cancellation of both 2 and 3 'succeed' and we end up
with N1 linked to N3 and N2 linked to N4. Total fail.


The first bug was fixed by having Step-B spin on both @n->next and @l
just like the unlink() path already did.

The second bug was fixed by having Step-B xchg(@n->next, NULL) instead
of only reading it; this way the 3A above cannot complete until after 2C
and we have a coherent chain back.

I've downloaded AIM7 from sf.net and I hope I'm running it with 100+
loads but I'm not entirely sure I got this thing right, its not really
making progress with or without patch :/

---
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -166,6 +166,9 @@ static inline int mutex_can_spin_on_owne
struct task_struct *owner;
int retval = 1;

+ if (need_resched())
+ return 0;
+
rcu_read_lock();
owner = ACCESS_ONCE(lock->owner);
if (owner)
@@ -358,6 +361,166 @@ ww_mutex_set_context_fastpath(struct ww_
spin_unlock_mutex(&lock->base.wait_lock, flags);
}

+struct m_spinlock {
+ struct m_spinlock *next, *prev;
+ int locked; /* 1 if lock acquired */
+};
+
+/*
+ * Using a single mcs node per CPU is safe because mutex_lock() should not be
+ * called from interrupt context and we have preemption disabled over the mcs
+ * lock usage.
+ */
+static DEFINE_PER_CPU(struct m_spinlock, m_node);
+
+static bool m_spin_lock(struct m_spinlock **lock)
+{
+ struct m_spinlock *node = this_cpu_ptr(&m_node);
+ struct m_spinlock *prev, *next;
+
+ node->locked = 0;
+ node->next = NULL;
+
+ node->prev = prev = xchg(lock, node);
+ if (likely(prev == NULL))
+ return true;
+
+ ACCESS_ONCE(prev->next) = node;
+
+ /*
+ * Normally @prev is untouchable after the above store; because at that
+ * moment unlock can proceed and wipe the node element from stack.
+ *
+ * However, since our nodes are static per-cpu storage, we're
+ * guaranteed their existence -- this allows us to apply
+ * cmpxchg in an attempt to undo our queueing.
+ */
+
+ while (!smp_load_acquire(&node->locked)) {
+ if (need_resched())
+ goto unqueue;
+ arch_mutex_cpu_relax();
+ }
+ return true;
+
+unqueue:
+ /*
+ * Step - A -- stabilize @prev
+ *
+ * Undo our @prev->next assignment; this will make @prev's
+ * unlock()/cancel() wait for a next pointer since @lock points to us
+ * (or later).
+ */
+
+ for (;;) {
+ next = cmpxchg(&prev->next, node, NULL); /* A -> B,C */
+
+ /*
+ * Because the unlock() path retains @prev->next (for
+ * performance) we must check @node->locked after clearing
+ * @prev->next to see if we raced.
+ *
+ * Ordered by the cmpxchg() above and the conditional-store in
+ * unlock().
+ */
+ if (smp_load_acquire(&node->locked)) {
+ /*
+ * OOPS, we were too late, we already got the lock. No
+ * harm done though; @prev is now unused an nobody
+ * cares we frobbed it.
+ */
+ return true;
+ }
+
+ if (next == node)
+ break;
+
+ /*
+ * @prev->next didn't point to us anymore, we didn't own the
+ * lock, so reload and try again.
+ *
+ * Because we observed the new @prev->next, the smp_wmb() at
+ * (C) ensures that we must now observe the new @node->prev.
+ */
+ prev = ACCESS_ONCE(node->prev);
+ }
+
+ /*
+ * Step - B -- stabilize @next
+ *
+ * Similar to unlock(), wait for @node->next or move @lock from @node
+ * back to @prev.
+ */
+
+ for (;;) {
+ if (*lock == node && cmpxchg(lock, node, prev) == node) {
+ /*
+ * We were the last queued, we moved @lock back. @prev
+ * will now observe @lock and will complete its
+ * unlock()/cancel().
+ */
+ return false;
+ }
+
+ /*
+ * We must xchg() the @node->next value, because if we were to
+ * leave it in, a concurrent cancel() from @node->next might
+ * complete Step-A and think its @prev is still valid.
+ *
+ * If the concurrent cancel() wins the race, we'll wait for
+ * either @lock to point to us, through its Step-B, or wait for
+ * a new @node->next from its Step-C.
+ */
+ next = xchg(&node->next, NULL); /* B -> A */
+ if (next)
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+
+ /*
+ * Step - C -- unlink
+ *
+ * @prev is stable because its still waiting for a new @prev->next
+ * pointer, @next is stable because our @node->next pointer is NULL and
+ * it will wait in Step-A.
+ */
+
+ ACCESS_ONCE(next->prev) = prev;
+
+ /*
+ * Ensure that @next->prev is written before we write @prev->next,
+ * this guarantees that when the cmpxchg at (A) fails we must
+ * observe the new prev value.
+ */
+ smp_wmb(); /* C -> A */
+
+ /*
+ * And point @prev to our next, and we're unlinked.
+ */
+ ACCESS_ONCE(prev->next) = next;
+
+ return false;
+}
+
+static void m_spin_unlock(struct m_spinlock **lock)
+{
+ struct m_spinlock *node = this_cpu_ptr(&m_node);
+ struct m_spinlock *next;
+
+ for (;;) {
+ if (likely(cmpxchg(lock, node, NULL) == node))
+ return;
+
+ next = ACCESS_ONCE(node->next);
+ if (unlikely(next))
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+ smp_store_release(&next->locked, 1);
+}
+
/*
* Lock a mutex (possibly interruptible), slowpath:
*/
@@ -400,9 +563,11 @@ __mutex_lock_common(struct mutex *lock,
if (!mutex_can_spin_on_owner(lock))
goto slowpath;

+ if (!m_spin_lock(&lock->mcs_lock))
+ goto slowpath;
+
for (;;) {
struct task_struct *owner;
- struct mcs_spinlock node;

if (use_ww_ctx && ww_ctx->acquired > 0) {
struct ww_mutex *ww;
@@ -417,19 +582,16 @@ __mutex_lock_common(struct mutex *lock,
* performed the optimistic spinning cannot be done.
*/
if (ACCESS_ONCE(ww->ctx))
- goto slowpath;
+ break;
}

/*
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
- mcs_spin_lock(&lock->mcs_lock, &node);
owner = ACCESS_ONCE(lock->owner);
- if (owner && !mutex_spin_on_owner(lock, owner)) {
- mcs_spin_unlock(&lock->mcs_lock, &node);
- goto slowpath;
- }
+ if (owner && !mutex_spin_on_owner(lock, owner))
+ break;

if ((atomic_read(&lock->count) == 1) &&
(atomic_cmpxchg(&lock->count, 1, 0) == 1)) {
@@ -442,11 +604,10 @@ __mutex_lock_common(struct mutex *lock,
}

mutex_set_owner(lock);
- mcs_spin_unlock(&lock->mcs_lock, &node);
+ m_spin_unlock(&lock->mcs_lock);
preempt_enable();
return 0;
}
- mcs_spin_unlock(&lock->mcs_lock, &node);

/*
* When there's no owner, we might have preempted between the
@@ -455,7 +616,7 @@ __mutex_lock_common(struct mutex *lock,
* the owner complete.
*/
if (!owner && (need_resched() || rt_task(task)))
- goto slowpath;
+ break;

/*
* The cpu_relax() call is a compiler barrier which forces
@@ -465,10 +626,19 @@ __mutex_lock_common(struct mutex *lock,
*/
arch_mutex_cpu_relax();
}
+ m_spin_unlock(&lock->mcs_lock);
slowpath:
#endif
spin_lock_mutex(&lock->wait_lock, flags);

+ /*
+ * XXX arguably, when need_resched() is set and there's other pending
+ * owners we should not try-acquire and simply queue and schedule().
+ *
+ * There's nothing worse than obtaining a lock only to get immediately
+ * scheduled out.
+ */
+
/* once more, can we acquire the lock? */
if (MUTEX_SHOW_NO_WAITER(lock) && (atomic_xchg(&lock->count, 0) == 1))
goto skip_wait;

2014-01-31 20:01:40

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Fri, Jan 31, 2014 at 6:09 AM, Peter Zijlstra <[email protected]> wrote:

> I've downloaded AIM7 from sf.net and I hope I'm running it with 100+
> loads but I'm not entirely sure I got this thing right, its not really
> making progress with or without patch :/

Ingo's program http://lkml.org/lkml/2006/1/8/50 using the V option may
be able to generate similar mutex contention.

Currently still getting soft lockups with the updated version.

2014-01-31 20:08:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote:
> On Fri, Jan 31, 2014 at 6:09 AM, Peter Zijlstra <[email protected]> wrote:
>
> > I've downloaded AIM7 from sf.net and I hope I'm running it with 100+
> > loads but I'm not entirely sure I got this thing right, its not really
> > making progress with or without patch :/
>
> Ingo's program http://lkml.org/lkml/2006/1/8/50 using the V option may
> be able to generate similar mutex contention.
>
> Currently still getting soft lockups with the updated version.

Bugger.. ok clearly I need to think harder still. I'm fairly sure this
cancelation can work though, just seems tricky to get right :-)

I'll give that proglet from Ingo a go, although that might be Monday ere
I get to it.

2014-02-02 20:02:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Fri, Jan 31, 2014 at 03:09:41PM +0100, Peter Zijlstra wrote:
> +struct m_spinlock {
> + struct m_spinlock *next, *prev;
> + int locked; /* 1 if lock acquired */
> +};
> +
> +/*
> + * Using a single mcs node per CPU is safe because mutex_lock() should not be
> + * called from interrupt context and we have preemption disabled over the mcs
> + * lock usage.
> + */
> +static DEFINE_PER_CPU(struct m_spinlock, m_node);
> +
> +static bool m_spin_lock(struct m_spinlock **lock)
> +{
> + struct m_spinlock *node = this_cpu_ptr(&m_node);
> + struct m_spinlock *prev, *next;
> +
> + node->locked = 0;
> + node->next = NULL;
> +
> + node->prev = prev = xchg(lock, node);
> + if (likely(prev == NULL))
> + return true;
> +
> + ACCESS_ONCE(prev->next) = node;
> +
> + /*
> + * Normally @prev is untouchable after the above store; because at that
> + * moment unlock can proceed and wipe the node element from stack.
> + *
> + * However, since our nodes are static per-cpu storage, we're
> + * guaranteed their existence -- this allows us to apply
> + * cmpxchg in an attempt to undo our queueing.
> + */
> +
> + while (!smp_load_acquire(&node->locked)) {
> + if (need_resched())
> + goto unqueue;
> + arch_mutex_cpu_relax();
> + }
> + return true;
> +
> +unqueue:
> + /*
> + * Step - A -- stabilize @prev
> + *
> + * Undo our @prev->next assignment; this will make @prev's
> + * unlock()/cancel() wait for a next pointer since @lock points to us
> + * (or later).
> + */
> +
> + for (;;) {
> + next = cmpxchg(&prev->next, node, NULL); /* A -> B,C */
> +
> + /*
> + * Because the unlock() path retains @prev->next (for
> + * performance) we must check @node->locked after clearing
> + * @prev->next to see if we raced.
> + *
> + * Ordered by the cmpxchg() above and the conditional-store in
> + * unlock().
> + */
> + if (smp_load_acquire(&node->locked)) {
> + /*
> + * OOPS, we were too late, we already got the lock. No
> + * harm done though; @prev is now unused an nobody
> + * cares we frobbed it.
> + */
> + return true;
> + }
> +
> + if (next == node)
> + break;
> +
> + /*
> + * @prev->next didn't point to us anymore, we didn't own the
> + * lock, so reload and try again.
> + *
> + * Because we observed the new @prev->next, the smp_wmb() at
> + * (C) ensures that we must now observe the new @node->prev.
> + */
> + prev = ACCESS_ONCE(node->prev);
> + }
> +
> + /*
> + * Step - B -- stabilize @next
> + *
> + * Similar to unlock(), wait for @node->next or move @lock from @node
> + * back to @prev.
> + */
> +
> + for (;;) {
> + if (*lock == node && cmpxchg(lock, node, prev) == node) {
> + /*
> + * We were the last queued, we moved @lock back. @prev
> + * will now observe @lock and will complete its
> + * unlock()/cancel().
> + */
> + return false;
> + }
> +
> + /*
> + * We must xchg() the @node->next value, because if we were to
> + * leave it in, a concurrent cancel() from @node->next might
> + * complete Step-A and think its @prev is still valid.
> + *
> + * If the concurrent cancel() wins the race, we'll wait for
> + * either @lock to point to us, through its Step-B, or wait for
> + * a new @node->next from its Step-C.
> + */
> + next = xchg(&node->next, NULL); /* B -> A */
> + if (next)
> + break;
> +
> + arch_mutex_cpu_relax();
> + }
> +
> + /*
> + * Step - C -- unlink
> + *
> + * @prev is stable because its still waiting for a new @prev->next
> + * pointer, @next is stable because our @node->next pointer is NULL and
> + * it will wait in Step-A.
> + */
> +
> + ACCESS_ONCE(next->prev) = prev;
> +
> + /*
> + * Ensure that @next->prev is written before we write @prev->next,
> + * this guarantees that when the cmpxchg at (A) fails we must
> + * observe the new prev value.
> + */
> + smp_wmb(); /* C -> A */

OK, I've definitely stared at this code for too long :/

I don't think the above barrier is right -- or even required for that
matter. At this point I can't see anything wrong with the order of
either of these stores.

If the latter store hits first an unlock can can happen and release the
@next entry, which is fine as the Step-A loop can deal with that, if the
unlock doesn't happen, we'll simply wait for the first store to become
visible before trying to undo later again.

If the former store hits first, we'll simply wait for the later store to
appear and we'll try to undo it.

This obviously doesn't explain lockups, but it does reduce the code to
stare at ever so slightly.

> + /*
> + * And point @prev to our next, and we're unlinked.
> + */
> + ACCESS_ONCE(prev->next) = next;
> +
> + return false;
> +}
> +
> +static void m_spin_unlock(struct m_spinlock **lock)
> +{
> + struct m_spinlock *node = this_cpu_ptr(&m_node);
> + struct m_spinlock *next;
> +
> + for (;;) {
> + if (likely(cmpxchg(lock, node, NULL) == node))
> + return;
> +
> + next = ACCESS_ONCE(node->next);
> + if (unlikely(next))
> + break;
> +
> + arch_mutex_cpu_relax();
> + }
> + smp_store_release(&next->locked, 1);
> +}

2014-02-02 21:01:30

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Fri, 2014-01-31 at 21:08 +0100, Peter Zijlstra wrote:
> On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote:
> > Currently still getting soft lockups with the updated version.
>
> Bugger.. ok clearly I need to think harder still. I'm fairly sure this
> cancelation can work though, just seems tricky to get right :-)

Ok, I believe I have found a race condition between m_spin_lock() and
m_spin_unlock().

In m_spin_unlock(), we do "next = ACCESS_ONCE(node->next)". Then, if
next is not NULL, we proceed to set next->locked to 1.

A thread in m_spin_lock() in the unqueue path could execute
"next = cmpxchg(&prev->next, node, NULL)" after the thread in
m_spin_unlock() accesses its node->next and finds that it is not NULL.
Then, the thread in m_spin_lock() could check !node->locked before
the thread in m_spin_unlock() sets next->locked to 1.

The following addition change was able to solve the initial lockups that were
occurring when running fserver on a 2 socket box.

---
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 9eb4dbe..e71a84a 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -513,8 +513,13 @@ static void m_spin_unlock(struct m_spinlock **lock)
return;

next = ACCESS_ONCE(node->next);
- if (unlikely(next))
- break;
+
+ if (unlikely(next)) {
+ next = cmpxchg(&node->next, next, NULL);
+
+ if (next)
+ break;
+ }

arch_mutex_cpu_relax();
}

2014-02-02 21:12:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Sun, Feb 02, 2014 at 01:01:23PM -0800, Jason Low wrote:
> On Fri, 2014-01-31 at 21:08 +0100, Peter Zijlstra wrote:
> > On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote:
> > > Currently still getting soft lockups with the updated version.
> >
> > Bugger.. ok clearly I need to think harder still. I'm fairly sure this
> > cancelation can work though, just seems tricky to get right :-)
>
> Ok, I believe I have found a race condition between m_spin_lock() and
> m_spin_unlock().
>
> In m_spin_unlock(), we do "next = ACCESS_ONCE(node->next)". Then, if
> next is not NULL, we proceed to set next->locked to 1.
>
> A thread in m_spin_lock() in the unqueue path could execute
> "next = cmpxchg(&prev->next, node, NULL)" after the thread in
> m_spin_unlock() accesses its node->next and finds that it is not NULL.
> Then, the thread in m_spin_lock() could check !node->locked before
> the thread in m_spin_unlock() sets next->locked to 1.

Yes indeed. How silly of me to not spot that!

> The following addition change was able to solve the initial lockups that were
> occurring when running fserver on a 2 socket box.
>
> ---
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 9eb4dbe..e71a84a 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -513,8 +513,13 @@ static void m_spin_unlock(struct m_spinlock **lock)
> return;
>
> next = ACCESS_ONCE(node->next);
> - if (unlikely(next))
> - break;
> +
> + if (unlikely(next)) {
> + next = cmpxchg(&node->next, next, NULL);
> +
> + if (next)

The cmpxchg could fail and next still be !NULL I suppose.

> + break;
> + }


The way I wrote that same loop in step-B, is:


for (;;) {
if (*lock == node && cmpxchg(lock, node, prev) == node)
return

next = xchg(&node->next, NULL); /* B -> A */
if (next)
break;

arch_mutex_cpu_relax();
}

I suppose we can make that something like:


if (node->next) {
next = xchg(&node->next, NULL);
if (next)
break
}

To avoid the xchg on every loop.

I had wanted to avoid the additional locked op in the unlock path, but
yes that does make things easier.

2014-02-02 21:58:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] mutex: Modify the way optimistic spinners are queued

On Tue, Jan 28, 2014 at 02:10:41PM -0800, Jason Low wrote:
> On Tue, 2014-01-28 at 12:23 -0800, Paul E. McKenney wrote:
> > On Tue, Jan 28, 2014 at 11:13:13AM -0800, Jason Low wrote:
> > > /*
> > > * The cpu_relax() call is a compiler barrier which forces
> > > @@ -514,6 +511,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> > > */
> > > arch_mutex_cpu_relax();
> > > }
> > > + mspin_unlock(MLOCK(lock), &node);
> > > slowpath:
> >
> > Are there any remaining goto statements to slowpath? If so, they need
> > to release the lock. If not, this label should be removed.
>
> Yes, if the mutex_can_spin_on_owner() returns false, then the thread
> goes to directly slowpath, bypassing the optimistic spinning loop. In
> that case, the thread avoids acquiring the MCS lock, and doesn't unlock
> the MCS lock.

Got it, apologies for my confusion!

Thanx, Paul

2014-02-02 22:03:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Fri, Jan 31, 2014 at 09:08:25PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 31, 2014 at 12:01:37PM -0800, Jason Low wrote:
> > On Fri, Jan 31, 2014 at 6:09 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > > I've downloaded AIM7 from sf.net and I hope I'm running it with 100+
> > > loads but I'm not entirely sure I got this thing right, its not really
> > > making progress with or without patch :/
> >
> > Ingo's program http://lkml.org/lkml/2006/1/8/50 using the V option may
> > be able to generate similar mutex contention.
> >
> > Currently still getting soft lockups with the updated version.
>
> Bugger.. ok clearly I need to think harder still. I'm fairly sure this
> cancelation can work though, just seems tricky to get right :-)

We used to do something similar to avoid passing locks off to tasks that
had been interrupted while spinning, and it was a bit tricky. But we
had it a bit easier, because we didn't actually have to remove the element
from the queue, just bypass it at lock-grant time.

Thanx, Paul

> I'll give that proglet from Ingo a go, although that might be Monday ere
> I get to it.
>

2014-02-03 18:39:27

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Sun, 2014-02-02 at 22:12 +0100, Peter Zijlstra wrote:
> The way I wrote that same loop in step-B, is:
>
>
> for (;;) {
> if (*lock == node && cmpxchg(lock, node, prev) == node)
> return
>
> next = xchg(&node->next, NULL); /* B -> A */
> if (next)
> break;
>
> arch_mutex_cpu_relax();
> }
>
> I suppose we can make that something like:
>
>
> if (node->next) {
> next = xchg(&node->next, NULL);
> if (next)
> break
> }
>
> To avoid the xchg on every loop.

Ah yes, we want to use xchg() on &node->next.

Since the cmpxchg() is now in a loop in the unlock function, an
additional (*lock == node) check before the cmpxchg() would also be nice
to avoid spinning on cmpxchg() there too.

2014-02-03 19:25:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Mon, Feb 03, 2014 at 10:39:20AM -0800, Jason Low wrote:
> > To avoid the xchg on every loop.
>
> Ah yes, we want to use xchg() on &node->next.
>
> Since the cmpxchg() is now in a loop in the unlock function, an
> additional (*lock == node) check before the cmpxchg() would also be nice
> to avoid spinning on cmpxchg() there too.

Right, I have the below; you can find the patches this depends upon
here:

http://programming.kicks-ass.net/sekrit/patches.tar.bz2

---
Subject: locking, mutex: Cancelable MCS lock for adaptive spinning
From: Peter Zijlstra <[email protected]>
Date: Wed, 29 Jan 2014 12:51:42 +0100

Since we want a task waiting for a mutex_lock() to go to sleep and
reschedule on need_resched() we must be able to abort the
mcs_spin_lock() around the adaptive spin.

Therefore implement a cancelable mcs lock.

XXX: anybody got a better name than m_spinlock?

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Jason Low <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
---
include/linux/mutex.h | 4 -
kernel/locking/Makefile | 2
kernel/locking/mcs_spinlock.c | 156 ++++++++++++++++++++++++++++++++++++++++++
kernel/locking/mcs_spinlock.h | 18 ++++
kernel/locking/mutex.c | 10 +-
5 files changed, 183 insertions(+), 7 deletions(-)

--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -46,7 +46,7 @@
* - detects multi-task circular deadlocks and prints out all affected
* locks and tasks (and only those tasks)
*/
-struct mcs_spinlock;
+struct m_spinlock;
struct mutex {
/* 1: unlocked, 0: locked, negative: locked, possible waiters */
atomic_t count;
@@ -56,7 +56,7 @@ struct mutex {
struct task_struct *owner;
#endif
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
- struct mcs_spinlock *mcs_lock; /* Spinner MCS lock */
+ struct m_spinlock *m_lock; /* Spinner MCS lock */
#endif
#ifdef CONFIG_DEBUG_MUTEXES
const char *name;
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -1,5 +1,5 @@

-obj-y += mutex.o semaphore.o rwsem.o lglock.o
+obj-y += mutex.o semaphore.o rwsem.o lglock.o mcs_spinlock.o

ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_lockdep.o = -pg
--- /dev/null
+++ b/kernel/locking/mcs_spinlock.c
@@ -0,0 +1,156 @@
+
+#include <linux/percpu.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include "mcs_spinlock.h"
+
+#ifdef CONFIG_SMP
+
+/*
+ * Using a single mcs node per CPU is safe because mutex_lock() should not be
+ * called from interrupt context and we have preemption disabled over the mcs
+ * lock usage.
+ */
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct m_spinlock, m_node);
+
+/*
+ * Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
+ * Can return NULL in case we were the last queued and we updated @lock instead.
+ */
+static inline struct m_spinlock *
+m_spin_wait_next(struct m_spinlock **lock, struct m_spinlock *node,
+ struct m_spinlock *prev)
+{
+ struct m_spinlock *next = NULL;
+
+ for (;;) {
+ if (*lock == node && cmpxchg(lock, node, prev) == node) {
+ /*
+ * We were the last queued, we moved @lock back. @prev
+ * will now observe @lock and will complete its
+ * unlock()/unqueue().
+ */
+ break;
+ }
+
+ /*
+ * We must xchg() the @node->next value, because if we were to
+ * leave it in, a concurrent unlock()/unqueue() from
+ * @node->next might complete Step-A and think its @prev is
+ * still valid.
+ *
+ * If the concurrent unlock()/unqueue() wins the race, we'll
+ * wait for either @lock to point to us, through its Step-B, or
+ * wait for a new @node->next from its Step-C.
+ */
+ if (node->next) {
+ next = xchg(&node->next, NULL);
+ if (next)
+ break;
+ }
+
+ arch_mutex_cpu_relax();
+ }
+
+ return next;
+}
+
+bool m_spin_lock(struct m_spinlock **lock)
+{
+ struct m_spinlock *node = this_cpu_ptr(&m_node);
+ struct m_spinlock *prev, *next;
+
+ node->locked = 0;
+ node->next = NULL;
+
+ node->prev = prev = xchg(lock, node);
+ if (likely(prev == NULL))
+ return true;
+
+ ACCESS_ONCE(prev->next) = node;
+
+ /*
+ * Normally @prev is untouchable after the above store; because at that
+ * moment unlock can proceed and wipe the node element from stack.
+ *
+ * However, since our nodes are static per-cpu storage, we're
+ * guaranteed their existence -- this allows us to apply
+ * cmpxchg in an attempt to undo our queueing.
+ */
+
+ while (!smp_load_acquire(&node->locked)) {
+ if (need_resched())
+ goto unqueue;
+ arch_mutex_cpu_relax();
+ }
+ return true;
+
+unqueue:
+ /*
+ * Step - A -- stabilize @prev
+ *
+ * Undo our @prev->next assignment; this will make @prev's
+ * unlock()/unqueue() wait for a next pointer since @lock points to us
+ * (or later).
+ */
+
+ for (;;) {
+ if (prev->next == node &&
+ cmpxchg(&prev->next, node, NULL) == node)
+ break;
+
+ /*
+ * We can only fail the cmpxchg() racing against an unlock(),
+ * in which case we should observe @node->locked becomming
+ * true.
+ */
+ if (smp_load_acquire(&node->locked))
+ return true;
+
+ /*
+ * Or we race against a concurrent unqueue()'s step-B, in which
+ * case its step-C will write us a new @node->prev pointer.
+ */
+ prev = ACCESS_ONCE(node->prev);
+ }
+
+ /*
+ * Step - B -- stabilize @next
+ *
+ * Similar to unlock(), wait for @node->next or move @lock from @node
+ * back to @prev.
+ */
+
+ next = m_spin_wait_next(lock, node, prev);
+ if (!next)
+ return false;
+
+ /*
+ * Step - C -- unlink
+ *
+ * @prev is stable because its still waiting for a new @prev->next
+ * pointer, @next is stable because our @node->next pointer is NULL and
+ * it will wait in Step-A.
+ */
+
+ ACCESS_ONCE(next->prev) = prev;
+ ACCESS_ONCE(prev->next) = next;
+
+ return false;
+}
+
+void m_spin_unlock(struct m_spinlock **lock)
+{
+ struct m_spinlock *node = this_cpu_ptr(&m_node);
+ struct m_spinlock *next;
+
+ if (likely(cmpxchg(lock, node, NULL) == node))
+ return;
+
+ next = m_spin_wait_next(lock, node, NULL);
+ if (next)
+ ACCESS_ONCE(next->locked) = 1;
+}
+
+#endif
+
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -109,4 +109,22 @@ void mcs_spin_unlock(struct mcs_spinlock
arch_mcs_spin_unlock_contended(&next->locked);
}

+/*
+ * Cancellable version of the MCS lock above.
+ *
+ * This version can fail acquisition and unqueue a spinner; it assumes no
+ * nesting.
+ *
+ * Intended for adaptive spinning of sleeping locks:
+ * mutex_lock()/rwsem_down_{read,write}() etc.
+ */
+
+struct m_spinlock {
+ struct m_spinlock *next, *prev;
+ int locked; /* 1 if lock acquired */
+};
+
+extern bool m_spin_lock(struct m_spinlock **lock);
+extern void m_spin_unlock(struct m_spinlock **lock);
+
#endif /* __LINUX_MCS_SPINLOCK_H */
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -53,7 +53,7 @@ __mutex_init(struct mutex *lock, const c
INIT_LIST_HEAD(&lock->wait_list);
mutex_clear_owner(lock);
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
- lock->mcs_lock = NULL;
+ lock->m_lock = NULL;
#endif

debug_mutex_init(lock, name, key);
@@ -403,7 +403,9 @@ __mutex_lock_common(struct mutex *lock,
if (!mutex_can_spin_on_owner(lock))
goto slowpath;

- mcs_spin_lock(&lock->mcs_lock);
+ if (!m_spin_lock(&lock->m_lock))
+ goto slowpath;
+
for (;;) {
struct task_struct *owner;

@@ -442,7 +444,7 @@ __mutex_lock_common(struct mutex *lock,
}

mutex_set_owner(lock);
- mcs_spin_unlock(&lock->mcs_lock);
+ m_spin_unlock(&lock->m_lock);
preempt_enable();
return 0;
}
@@ -464,7 +466,7 @@ __mutex_lock_common(struct mutex *lock,
*/
arch_mutex_cpu_relax();
}
- mcs_spin_unlock(&lock->mcs_lock);
+ m_spin_unlock(&lock->m_lock);
slowpath:
#endif
spin_lock_mutex(&lock->wait_lock, flags);

2014-02-03 20:55:42

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote:
> On Mon, Feb 03, 2014 at 10:39:20AM -0800, Jason Low wrote:
> > > To avoid the xchg on every loop.
> >
> > Ah yes, we want to use xchg() on &node->next.
> >
> > Since the cmpxchg() is now in a loop in the unlock function, an
> > additional (*lock == node) check before the cmpxchg() would also be nice
> > to avoid spinning on cmpxchg() there too.
>
> Right, I have the below; you can find the patches this depends upon
> here:
>
> http://programming.kicks-ass.net/sekrit/patches.tar.bz2
>
> ---
> Subject: locking, mutex: Cancelable MCS lock for adaptive spinning
> From: Peter Zijlstra <[email protected]>
> Date: Wed, 29 Jan 2014 12:51:42 +0100
>
> Since we want a task waiting for a mutex_lock() to go to sleep and
> reschedule on need_resched() we must be able to abort the
> mcs_spin_lock() around the adaptive spin.
>
> Therefore implement a cancelable mcs lock.
>
> XXX: anybody got a better name than m_spinlock?

So I was thinking something along the lines of
mcs_spin_lock_cancelable() as that's essentially what this function
does.

2014-02-03 21:06:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Mon, Feb 03, 2014 at 12:55:34PM -0800, Jason Low wrote:
> On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote:
> > XXX: anybody got a better name than m_spinlock?
>
> So I was thinking something along the lines of
> mcs_spin_lock_cancelable() as that's essentially what this function
> does.

sure, but what do we call the data structure that goes with it? Can't
have two struct mcs_spinlock :/

2014-02-03 21:56:29

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Mon, 2014-02-03 at 22:06 +0100, Peter Zijlstra wrote:
> On Mon, Feb 03, 2014 at 12:55:34PM -0800, Jason Low wrote:
> > On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote:
> > > XXX: anybody got a better name than m_spinlock?
> >
> > So I was thinking something along the lines of
> > mcs_spin_lock_cancelable() as that's essentially what this function
> > does.
>
> sure, but what do we call the data structure that goes with it? Can't
> have two struct mcs_spinlock :/

If this structure is only going to be used for cancelable mcs locking,
what do you think of "struct cancelable_mcs_spinlock"?

2014-02-04 07:13:23

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Mon, 2014-02-03 at 20:25 +0100, Peter Zijlstra wrote:

> +void m_spin_unlock(struct m_spinlock **lock)
> +{
> + struct m_spinlock *node = this_cpu_ptr(&m_node);
> + struct m_spinlock *next;
> +
> + if (likely(cmpxchg(lock, node, NULL) == node))
> + return;

At this current point, (node->next != NULL) is a likely scenario.
Perhaps we can also add the following code here:

next = xchg(&node->next, NULL);
if (next) {
ACCESS_ONCE(next->locked) = 1;
return;
}

> + next = m_spin_wait_next(lock, node, NULL);
> + if (next)
> + ACCESS_ONCE(next->locked) = 1;
> +}

2014-02-05 21:45:04

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On 01/29/2014 06:51 AM, Peter Zijlstra wrote:
> On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
>>> But urgh, nasty problem. Lemme ponder this a bit.
> OK, please have a very careful look at the below. It survived a boot
> with udev -- which usually stresses mutex contention enough to explode
> (in fact it did a few time when I got the contention/cancel path wrong),
> however I have not ran anything else on it.
>
> The below is an MCS variant that allows relatively cheap unqueueing. But
> its somewhat tricky and I might have gotten a case wrong, esp. the
> double concurrent cancel case got my head hurting (I didn't attempt a
> tripple unqueue).
>
> Applies to tip/master but does generate a few (harmless) compile
> warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock
> thing.
>
> Also, there's a comment in the slowpath that bears consideration.
>
>

I have an alternative way of breaking out of the MCS lock waiting queue
when need_resched() is set. I overload the locked flag to indicate a
skipped node if negative. I run the patch through the AIM7 high-systime
workload on a 4-socket server and it seemed to run fine.

Please check the following POC patch to see if you have any comment.
diff --git a/include/linux/mcs_spinlock.h b/include/linux/mcs_spinlock.h
index e9a4d74..84a0b32 100644
--- a/include/linux/mcs_spinlock.h
+++ b/include/linux/mcs_spinlock.h
@@ -14,17 +14,45 @@

struct mcs_spinlock {
struct mcs_spinlock *next;
- int locked; /* 1 if lock acquired */
+ int locked; /* 1 if lock acquired, -1 if skipping node */
};

+/*
+ * The node skipping feature requires the MCS node to be persistent,
+ * i.e. not allocated on stack. In addition, the MCS node cannot be
+ * reused until after the locked flag is cleared. The mcs_skip_node()
+ * macro must be defined before including this header file.
+ */
+#ifdef mcs_skip_node
+#undef arch_mcs_spin_lock_contended
+#undef arch_mcs_spin_unlock_contended
+
+#define arch_mcs_spin_lock_contended(n) \
+do { \
+ while (!smp_load_acquire(&(n)->locked)) { \
+ if (mcs_skip_node(n)) { \
+ if (cmpxchg(&(n)->locked, 0, -1) == 0) \
+ return; \
+ } \
+ arch_mutex_cpu_relax(); \
+ }; \
+} while (0)
+
+#define arch_mcs_spin_unlock_contended(n) \
+ if (cmpxchg(&(n)->locked, 0, 1) == 0) \
+ return
+
+#define mcs_set_locked(n, v) (n)->locked = (v)
+#else /* mcs_skip_node */
+
#ifndef arch_mcs_spin_lock_contended
/*
* Using smp_load_acquire() provides a memory barrier that ensures
* subsequent operations happen after the lock is acquired.
*/
-#define arch_mcs_spin_lock_contended(l) \
+#define arch_mcs_spin_lock_contended(n) \
do { \
- while (!(smp_load_acquire(l))) \
+ while (!(smp_load_acquire(&(n)->locked))) \
arch_mutex_cpu_relax(); \
} while (0)
#endif
@@ -35,9 +63,12 @@ do { \
* operations in the critical section has been completed before
* unlocking.
*/
-#define arch_mcs_spin_unlock_contended(l) \
- smp_store_release((l), 1)
+#define arch_mcs_spin_unlock_contended(n) \
+ smp_store_release(&(n)->locked, 1)
+
+#define mcs_set_locked(n, v)
#endif
+#endif /* mcs_skip_node */

/*
* Note: the smp_load_acquire/smp_store_release pair is not
@@ -77,12 +108,13 @@ void mcs_spin_lock(struct mcs_spinlock **lock,
struct mcs_spinlock *node)
* value won't be used. If a debug mode is needed to
* audit lock status, then set node->locked value here.
*/
+ mcs_set_locked(node, 1);
return;
}
ACCESS_ONCE(prev->next) = node;

/* Wait until the lock holder passes the lock down. */
- arch_mcs_spin_lock_contended(&node->locked);
+ arch_mcs_spin_lock_contended(node);
}

/*
@@ -94,19 +126,35 @@ void mcs_spin_unlock(struct mcs_spinlock **lock,
struct mcs_spinlock *node)
{
struct mcs_spinlock *next = ACCESS_ONCE(node->next);

+#ifdef mcs_skip_node
+check_next_node:
+#endif
if (likely(!next)) {
/*
* Release the lock by setting it to NULL
*/
- if (likely(cmpxchg(lock, node, NULL) == node))
+ if (likely(cmpxchg(lock, node, NULL) == node)) {
+ mcs_set_locked(node, 0);
return;
+ }
/* Wait until the next pointer is set */
while (!(next = ACCESS_ONCE(node->next)))
arch_mutex_cpu_relax();
}
+ /* Clear the lock flag to indicate the node can be used again. */
+ mcs_set_locked(node, 0);

/* Pass lock to next waiter. */
- arch_mcs_spin_unlock_contended(&next->locked);
+ arch_mcs_spin_unlock_contended(next);
+
+#ifdef mcs_skip_node
+ /*
+ * The next node should be skipped
+ */
+ node = next;
+ next = ACCESS_ONCE(node->next);
+ goto check_next_node;
+#endif
}

#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 45fe1b5..6351eca 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -25,6 +25,10 @@
#include <linux/spinlock.h>
#include <linux/interrupt.h>
#include <linux/debug_locks.h>
+/*
+ * Allowed an MCS node to be skipped if need_resched() is true.
+ */
+#define mcs_skip_node(n) need_resched()
#include <linux/mcs_spinlock.h>

/*
@@ -45,6 +49,13 @@
*/
#define MUTEX_SHOW_NO_WAITER(mutex)
(atomic_read(&(mutex)->count) >= 0)

+/*
+ * Using a single mcs node per CPU is safe because mutex_lock() should
not be
+ * called from interrupt context and we have preemption disabled over
the mcs
+ * lock usage.
+ */
+static DEFINE_PER_CPU(struct mcs_spinlock, m_node);
+
void
__mutex_init(struct mutex *lock, const char *name, struct
lock_class_key *key)
{
@@ -166,6 +177,9 @@ static inline int mutex_can_spin_on_owner(struct
mutex *lock)
struct task_struct *owner;
int retval = 1;

+ if (need_resched())
+ return 0;
+
rcu_read_lock();
owner = ACCESS_ONCE(lock->owner);
if (owner)
@@ -370,6 +384,7 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
struct mutex_waiter waiter;
unsigned long flags;
int ret;
+ struct mcs_spinlock *node;

preempt_disable();
mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
@@ -400,9 +415,13 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
if (!mutex_can_spin_on_owner(lock))
goto slowpath;

+ node = this_cpu_ptr(&m_node);
+ if (node->locked < 0)
+ /* MCS node not available yet */
+ goto slowpath;
+
for (;;) {
struct task_struct *owner;
- struct mcs_spinlock node;

if (use_ww_ctx && ww_ctx->acquired > 0) {
struct ww_mutex *ww;
@@ -424,10 +443,15 @@ __mutex_lock_common(struct mutex *lock, long
state, unsigned int subclass,
* If there's an owner, wait for it to either
* release the lock or go to sleep.
*/
- mcs_spin_lock(&lock->mcs_lock, &node);
+ mcs_spin_lock(&lock->mcs_lock, node);
+ if (node->locked < 0)
+ /*
+ * need_resched() true, no unlock needed
+ */
+ goto slowpath;
owner = ACCESS_ONCE(lock->owner);
if (owner && !mutex_spin_on_owner(lock, owner)) {
- mcs_spin_unlock(&lock->mcs_lock, &node);
+ mcs_spin_unlock(&lock->mcs_lock, node);
goto slowpath;
}

@@ -442,11 +466,11 @@ __mutex_lock_common(struct mutex *lock, long
state, unsigned int subclass,
}

mutex_set_owner(lock);
- mcs_spin_unlock(&lock->mcs_lock, &node);
+ mcs_spin_unlock(&lock->mcs_lock, node);
preempt_enable();
return 0;
}
- mcs_spin_unlock(&lock->mcs_lock, &node);
+ mcs_spin_unlock(&lock->mcs_lock, node);

/*
* When there's no owner, we might have preempted between the


2014-02-06 14:04:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Wed, Feb 05, 2014 at 04:44:34PM -0500, Waiman Long wrote:
> I have an alternative way of breaking out of the MCS lock waiting queue when
> need_resched() is set. I overload the locked flag to indicate a skipped node
> if negative.

I'm not quite seeing how it works (then again, I've not really read the
patch carefully).

Suppose you break out; at that point you get queued and go to sleep.
Suppose you got woken up while you MCS entry is still 'pending' and
magically win the race and acquire the lock.

At that point your MCS entry can be re-used while its still part of the
list.

Its a fantastically small race window, but I don't see anything that
makes it impossible.

> I run the patch through the AIM7 high-systime workload on a
> 4-socket server and it seemed to run fine.

How do people run this AIM7 piece of shit? I let it run for over an hour
and it generated exactly 0 numbers, it just sits there eating cpu-time
and creating a racket from my pantry.

2014-02-06 17:44:57

by Jason Low

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Wed, 2014-02-05 at 16:44 -0500, Waiman Long wrote:
> On 01/29/2014 06:51 AM, Peter Zijlstra wrote:
> > On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
> >>> But urgh, nasty problem. Lemme ponder this a bit.
> > OK, please have a very careful look at the below. It survived a boot
> > with udev -- which usually stresses mutex contention enough to explode
> > (in fact it did a few time when I got the contention/cancel path wrong),
> > however I have not ran anything else on it.
> >
> > The below is an MCS variant that allows relatively cheap unqueueing. But
> > its somewhat tricky and I might have gotten a case wrong, esp. the
> > double concurrent cancel case got my head hurting (I didn't attempt a
> > tripple unqueue).
> >
> > Applies to tip/master but does generate a few (harmless) compile
> > warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock
> > thing.
> >
> > Also, there's a comment in the slowpath that bears consideration.
> >
> >
>
> I have an alternative way of breaking out of the MCS lock waiting queue
> when need_resched() is set. I overload the locked flag to indicate a
> skipped node if negative. I run the patch through the AIM7 high-systime
> workload on a 4-socket server and it seemed to run fine.
>
> Please check the following POC patch to see if you have any comment.

So one of the concerns I had with the approach of skipping nodes was
that, under heavy contention, we potentially could cause optimistic
spinning to be disabled on CPUs for a while since the nodes can't be
used until they have been released. One advantage of the unqueuing
method would be that nodes are usable after the spinners exit the MCS
queue and go to sleep.

Jason

2014-02-06 18:37:44

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On 02/06/2014 12:44 PM, Jason Low wrote:
> On Wed, 2014-02-05 at 16:44 -0500, Waiman Long wrote:
>> On 01/29/2014 06:51 AM, Peter Zijlstra wrote:
>>> On Tue, Jan 28, 2014 at 02:51:35PM -0800, Jason Low wrote:
>>>>> But urgh, nasty problem. Lemme ponder this a bit.
>>> OK, please have a very careful look at the below. It survived a boot
>>> with udev -- which usually stresses mutex contention enough to explode
>>> (in fact it did a few time when I got the contention/cancel path wrong),
>>> however I have not ran anything else on it.
>>>
>>> The below is an MCS variant that allows relatively cheap unqueueing. But
>>> its somewhat tricky and I might have gotten a case wrong, esp. the
>>> double concurrent cancel case got my head hurting (I didn't attempt a
>>> tripple unqueue).
>>>
>>> Applies to tip/master but does generate a few (harmless) compile
>>> warnings because I didn't fully clean up the mcs_spinlock vs m_spinlock
>>> thing.
>>>
>>> Also, there's a comment in the slowpath that bears consideration.
>>>
>>>
>> I have an alternative way of breaking out of the MCS lock waiting queue
>> when need_resched() is set. I overload the locked flag to indicate a
>> skipped node if negative. I run the patch through the AIM7 high-systime
>> workload on a 4-socket server and it seemed to run fine.
>>
>> Please check the following POC patch to see if you have any comment.
> So one of the concerns I had with the approach of skipping nodes was
> that, under heavy contention, we potentially could cause optimistic
> spinning to be disabled on CPUs for a while since the nodes can't be
> used until they have been released. One advantage of the unqueuing
> method would be that nodes are usable after the spinners exit the MCS
> queue and go to sleep.
>
> Jason
>

Under heavy contention when many threads are trying to access the
mutexes using optimistic spinning. This patch can actually reduce the
number of wasted CPU cycles waiting in the MCS spin loop and let the
CPUs do other useful work. So I don't see that as a negative. I think
this kind of self-tuning is actually good for the overall throughput of
the system.

-Longman

2014-02-06 18:46:01

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On 02/06/2014 09:04 AM, Peter Zijlstra wrote:
> On Wed, Feb 05, 2014 at 04:44:34PM -0500, Waiman Long wrote:
>> I have an alternative way of breaking out of the MCS lock waiting queue when
>> need_resched() is set. I overload the locked flag to indicate a skipped node
>> if negative.
> I'm not quite seeing how it works (then again, I've not really read the
> patch carefully).
>
> Suppose you break out; at that point you get queued and go to sleep.
> Suppose you got woken up while you MCS entry is still 'pending' and
> magically win the race and acquire the lock.
>
> At that point your MCS entry can be re-used while its still part of the
> list.

Actually the MCS node entry cannot be reused until an MCS queue head
task traverses that entry and clear the locked flag. So it is possible
that the affected CPU won't be able to participate in optimistic
spinning for a while if the mutex is heavily contended.

> Its a fantastically small race window, but I don't see anything that
> makes it impossible.

I will send out an official patch with some performance data to solicit
further feedback.

>> I run the patch through the AIM7 high-systime workload on a
>> 4-socket server and it seemed to run fine.
> How do people run this AIM7 piece of shit? I let it run for over an hour
> and it generated exactly 0 numbers, it just sits there eating cpu-time
> and creating a racket from my pantry.

AIM7 can be tricky to set up. Fortunately, someone in our team had done
the ground work so I can just grab and run it.

-Longman

2014-02-06 20:11:08

by Norton, Scott J

[permalink] [raw]
Subject: RE: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

>> I run the patch through the AIM7 high-systime workload on a
>> 4-socket server and it seemed to run fine.
> How do people run this AIM7 piece of shit? I let it run for over an hour
> and it generated exactly 0 numbers, it just sits there eating cpu-time
> and creating a racket from my pantry.

./reaim -s100 -e2000 -t -j100 -i100 -f workfile.high_systime

The reaim.config file contains:

FILESIZE 10k
POOLSIZE 1m
DISKDIR /t0
DISKDIR /t1
DISKDIR /t2
DISKDIR /t3
DISKDIR /t4
DISKDIR /t5
DISKDIR /t6
DISKDIR /t7
DISKDIR /t8
DISKDIR /t9
DISKDIR /t10
DISKDIR /t11
DISKDIR /t12
DISKDIR /t13
DISKDIR /t14
DISKDIR /t15

The way Longman uses this is to create 16 ramdisk filesystems through
/dev/ram* and then mount those filesystems to the /t* directories.
Although you could run it through a regular filesystem also. It will use
whatever you place in the reaim.config file as DISKDIR. You can specify
one or more DISKDIR directories.


Scott

2014-02-10 17:02:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH v2 5/5] mutex: Give spinners a chance to spin_on_owner if need_resched() triggered while queued

On Thu, Feb 06, 2014 at 08:10:02PM +0000, Norton, Scott J wrote:
> > How do people run this AIM7 piece of shit? I let it run for over an hour
> > and it generated exactly 0 numbers, it just sits there eating cpu-time
> > and creating a racket from my pantry.
>
> ./reaim -s100 -e2000 -t -j100 -i100 -f workfile.high_systime
>
> The reaim.config file contains:
>
> FILESIZE 10k
> POOLSIZE 1m
> DISKDIR /t0
> DISKDIR /t1
> DISKDIR /t2
> DISKDIR /t3
> DISKDIR /t4
> DISKDIR /t5
> DISKDIR /t6
> DISKDIR /t7
> DISKDIR /t8
> DISKDIR /t9
> DISKDIR /t10
> DISKDIR /t11
> DISKDIR /t12
> DISKDIR /t13
> DISKDIR /t14
> DISKDIR /t15
>
> The way Longman uses this is to create 16 ramdisk filesystems through
> /dev/ram* and then mount those filesystems to the /t* directories.
> Although you could run it through a regular filesystem also. It will use
> whatever you place in the reaim.config file as DISKDIR. You can specify
> one or more DISKDIR directories.

OK, and we're back to creating a racket but not producing useful
numbers; how long is that crap supposed to run before it gives a number?

Surely it can produce a useful number after a few minutes of runtime..
Letting it run for hours is just a waste of time and money.

Note that I'm running this on a WSM-EP with (2*6*2) 24 CPUs and 24G of ram.