v2->v3:
- Remove patch 4 as it is not useful.
- Allow need_resched() check for waiter & add more comments about
changes to address issues raised by PeterZ.
v1->v2:
- Set task state to running before doing optimistic spinning.
- Add 2 more patches to handle possible missed wakeups and wasteful
spinning in try_to_wake_up() function.
This patchset is a variant of PeterZ's "locking/mutex: Avoid spinner
vs waiter starvation" patch. The major difference is that the
waiter-spinner won't enter into the OSQ used by the spinners. Instead,
it will spin directly on the lock in parallel with the queue head
of the OSQ. So there will be a bit more cacheline contention on the
lock cacheline, but that shouldn't cause noticeable impact on system
performance.
This patchset tries to address 2 issues with Peter's patch:
1) Ding Tianhong still find that hanging task could happen in some cases.
2) Jason Low found that there was performance regression for some AIM7
workloads.
By making the waiter-spinner to spin directly on the mutex, it will
increase the chance for the waiter-spinner to get the lock instead
of waiting in the OSQ for its turn.
Patch 1 modifies the mutex_optimistic_spin() function to enable it
to be called by a waiter-spinner that doesn't need to go into the OSQ.
Patch 2 modifies the mutex locking slowpath to make the waiter call
mutex_optimistic_spin() to do spinning after being waken up.
Patch 3 reverses the sequence of setting task state and changing
mutex count to -1 to prevent the possibility of missed wakeup.
Patch 4 modifies the wakeup code to abandon the wakeup operation
while spinning on the on_cpu flag if the task has changed back to a
non-sleeping state.
My own test on a 4-socket E7-4820 v3 system showed a regression of
about 4% in the high_systime workload with Peter's patch which this
new patch effectively eliminates.
Testing on an 8-socket Westmere-EX server, however, has performance
change from -9% to than +140% on the fserver workload of AIM7
depending on how the system was set up.
Waiman Long (3):
locking/mutex: Add waiter parameter to mutex_optimistic_spin()
locking/mutex: Enable optimistic spinning of woken task in wait queue
locking/mutex: Avoid missed wakeup of mutex waiter
kernel/locking/mutex.c | 126 ++++++++++++++++++++++++++++++++++--------------
1 files changed, 90 insertions(+), 36 deletions(-)
Ding Tianhong reported a live-lock situation where a constant stream
of incoming optimistic spinners blocked a task in the wait list from
getting the mutex.
This patch attempts to fix this live-lock condition by enabling the
woken task in the wait queue to enter into an optimistic spinning
loop itself in parallel with the regular spinners in the OSQ. This
should prevent the live-lock condition from happening.
Running the AIM7 benchmarks on a 4-socket E7-4820 v3 system (with ext4
filesystem), the additional spinning of the waiter-spinning improved
performance for the following workloads at high user count:
Workload % Improvement
-------- -------------
alltests 3.9%
disk 3.4%
fserver 2.0%
long 3.8%
new_fserver 10.5%
The other workloads were about the same as before.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/mutex.c | 16 +++++++++++++++-
1 files changed, 15 insertions(+), 1 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 5dd6171..5c0acee 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -538,6 +538,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
struct task_struct *task = current;
struct mutex_waiter waiter;
unsigned long flags;
+ bool acquired = false; /* True if the lock is acquired */
int ret;
preempt_disable();
@@ -568,7 +569,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
lock_contended(&lock->dep_map, ip);
- for (;;) {
+ while (!acquired) {
/*
* Lets try to take the lock again - this is needed even if
* we get here for the first time (shortly after failing to
@@ -603,6 +604,15 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
/* didn't get the lock, go to sleep: */
spin_unlock_mutex(&lock->wait_lock, flags);
schedule_preempt_disabled();
+
+ /*
+ * Optimistically spinning on the mutex without the wait lock
+ * The state has to be set to running to avoid another waker
+ * spinning on the on_cpu flag while the woken waiter is
+ * spinning on the mutex.
+ */
+ acquired = mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx,
+ true);
spin_lock_mutex(&lock->wait_lock, flags);
}
__set_task_state(task, TASK_RUNNING);
@@ -613,6 +623,9 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
atomic_set(&lock->count, 0);
debug_mutex_free_waiter(&waiter);
+ if (acquired)
+ goto unlock;
+
skip_wait:
/* got the lock - cleanup and rejoice! */
lock_acquired(&lock->dep_map, ip);
@@ -623,6 +636,7 @@ skip_wait:
ww_mutex_set_context_slowpath(ww, ww_ctx);
}
+unlock:
spin_unlock_mutex(&lock->wait_lock, flags);
preempt_enable();
return 0;
--
1.7.1
This patch adds a new waiter parameter to the mutex_optimistic_spin()
function to prepare it to be used by a waiter-spinner that doesn't
need to go into the OSQ as there can only be one waiter-spinner which
is the head of the waiting queue.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/mutex.c | 66 +++++++++++++++++++++++++++++++++--------------
1 files changed, 46 insertions(+), 20 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 0551c21..5dd6171 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -273,11 +273,15 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
/*
* Atomically try to take the lock when it is available
+ *
+ * For waiter-spinner, the count needs to be set to -1 first which will be
+ * cleared to 0 later on if the list becomes empty. For regular spinner,
+ * the count will be set to 0.
*/
-static inline bool mutex_try_to_acquire(struct mutex *lock)
+static inline bool mutex_try_to_acquire(struct mutex *lock, int waiter)
{
return !mutex_is_locked(lock) &&
- (atomic_cmpxchg_acquire(&lock->count, 1, 0) == 1);
+ (atomic_cmpxchg_acquire(&lock->count, 1, waiter ? -1 : 0) == 1);
}
/*
@@ -302,22 +306,42 @@ static inline bool mutex_try_to_acquire(struct mutex *lock)
*
* Returns true when the lock was taken, otherwise false, indicating
* that we need to jump to the slowpath and sleep.
+ *
+ * The waiter flag is set to true if the spinner is a waiter in the wait
+ * queue. As the waiter has slept for a while, it should have priority to
+ * get the lock over the regular spinners. So going to wait at the end of
+ * the OSQ isn't fair to the waiter. Instead, it will spin on the lock
+ * directly and concurrently with the spinner at the head of the OSQ, if
+ * present. There may be a bit more cacheline contention in this case.
+ * The waiter also needs to set the lock to -1 instead of 0 on lock
+ * acquisition.
*/
static bool mutex_optimistic_spin(struct mutex *lock,
- struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ const bool use_ww_ctx, int waiter)
{
struct task_struct *task = current;
+ bool acquired = false;
- if (!mutex_can_spin_on_owner(lock))
- goto done;
+ if (!waiter) {
+ /*
+ * The purpose of the mutex_can_spin_on_owner() function is
+ * to eliminate the overhead of osq_lock() and osq_unlock()
+ * in case spinning isn't possible. As a waiter-spinner
+ * is not going to take OSQ lock anyway, there is no need
+ * to call mutex_can_spin_on_owner().
+ */
+ if (!mutex_can_spin_on_owner(lock))
+ goto done;
- /*
- * In order to avoid a stampede of mutex spinners trying to
- * acquire the mutex all at once, the spinners need to take a
- * MCS (queued) lock first before spinning on the owner field.
- */
- if (!osq_lock(&lock->osq))
- goto done;
+ /*
+ * In order to avoid a stampede of mutex spinners trying to
+ * acquire the mutex all at once, the spinners need to take a
+ * MCS (queued) lock first before spinning on the owner field.
+ */
+ if (!osq_lock(&lock->osq))
+ goto done;
+ }
while (true) {
struct task_struct *owner;
@@ -347,7 +371,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
break;
/* Try to acquire the mutex if it is unlocked. */
- if (mutex_try_to_acquire(lock)) {
+ if (mutex_try_to_acquire(lock, waiter)) {
lock_acquired(&lock->dep_map, ip);
if (use_ww_ctx) {
@@ -358,8 +382,8 @@ static bool mutex_optimistic_spin(struct mutex *lock,
}
mutex_set_owner(lock);
- osq_unlock(&lock->osq);
- return true;
+ acquired = true;
+ break;
}
/*
@@ -380,14 +404,15 @@ static bool mutex_optimistic_spin(struct mutex *lock,
cpu_relax_lowlatency();
}
- osq_unlock(&lock->osq);
+ if (!waiter)
+ osq_unlock(&lock->osq);
done:
/*
* If we fell out of the spin path because of need_resched(),
* reschedule now, before we try-lock the mutex. This avoids getting
* scheduled out right after we obtained the mutex.
*/
- if (need_resched()) {
+ if (!acquired && need_resched()) {
/*
* We _should_ have TASK_RUNNING here, but just in case
* we do not, make it so, otherwise we might get stuck.
@@ -396,11 +421,12 @@ done:
schedule_preempt_disabled();
}
- return false;
+ return acquired;
}
#else
static bool mutex_optimistic_spin(struct mutex *lock,
- struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
+ struct ww_acquire_ctx *ww_ctx,
+ const bool use_ww_ctx, int waiter)
{
return false;
}
@@ -517,7 +543,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
preempt_disable();
mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
- if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
+ if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx, false)) {
/* got the lock, yay! */
preempt_enable();
return 0;
--
1.7.1
The current mutex code sets count to -1 and then sets the task
state. This is the same sequence that the mutex unlock path is checking
count and task state. That could lead to a missed wakeup even though
the problem will be cleared when a new waiter enters the waiting queue.
This patch reverses the order in the locking slowpath so that the task
state is set first before setting the count. This should eliminate
the potential missed wakeup and improve latency.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/mutex.c | 44 +++++++++++++++++++++++++++++---------------
1 files changed, 29 insertions(+), 15 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 5c0acee..c9af28c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -571,20 +571,6 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
while (!acquired) {
/*
- * Lets try to take the lock again - this is needed even if
- * we get here for the first time (shortly after failing to
- * acquire the lock), to make sure that we get a wakeup once
- * it's unlocked. Later on, if we sleep, this is the
- * operation that gives us the lock. We xchg it to -1, so
- * that when we release the lock, we properly wake up the
- * other waiters. We only attempt the xchg if the count is
- * non-negative in order to avoid unnecessary xchg operations:
- */
- if (atomic_read(&lock->count) >= 0 &&
- (atomic_xchg_acquire(&lock->count, -1) == 1))
- break;
-
- /*
* got a signal? (This code gets eliminated in the
* TASK_UNINTERRUPTIBLE case.)
*/
@@ -599,7 +585,35 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
goto err;
}
- __set_task_state(task, state);
+
+ /*
+ * We need to set the state first before changing the count
+ * to -1 to avoid missed wakeup even though the problem can
+ * be cleared by a new waiter entering the queue.
+ *
+ * Sleep Wakeup
+ * ----- ------
+ * [S] p->state = state [RmW] count = 1
+ * MB MB
+ * [RmW] count = -1 [L] if ((prev_count < 0) &&
+ * if (prev_count < 1) (p->state & state))
+ * sleep wakeup
+ */
+ set_task_state(task, state);
+
+ /*
+ * Lets try to take the lock again - this is needed even if
+ * we get here for the first time (shortly after failing to
+ * acquire the lock), to make sure that we get a wakeup once
+ * it's unlocked. Later on, if we sleep, this is the
+ * operation that gives us the lock. We xchg it to -1, so
+ * that when we release the lock, we properly wake up the
+ * other waiters. We only attempt the xchg if the count is
+ * non-negative in order to avoid unnecessary xchg operations:
+ */
+ if (atomic_read(&lock->count) >= 0 &&
+ (atomic_xchg_acquire(&lock->count, -1) == 1))
+ break;
/* didn't get the lock, go to sleep: */
spin_unlock_mutex(&lock->wait_lock, flags);
--
1.7.1
On Tue, Mar 22, 2016 at 01:46:43PM -0400, Waiman Long wrote:
> Ding Tianhong reported a live-lock situation where a constant stream
> of incoming optimistic spinners blocked a task in the wait list from
> getting the mutex.
>
> This patch attempts to fix this live-lock condition by enabling the
> woken task in the wait queue to enter into an optimistic spinning
> loop itself in parallel with the regular spinners in the OSQ. This
> should prevent the live-lock condition from happening.
I would very much like a few words on how fairness is preserved.
Because while the waiter remains on the wait_list while it spins, and
therefore unlock()s will only wake it, and we'll only contend with the
one waiter, the fact that we have two spinners is not fair or starvation
proof at all.
By adding the waiter to the OSQ we get only a single spinner and force
'fairness' by queuing.
I say 'fairness' because the OSQ (need_resched) cancellation can still
take the waiter out again and let even more new spinners in.
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 5dd6171..5c0acee 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -538,6 +538,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> struct task_struct *task = current;
> struct mutex_waiter waiter;
> unsigned long flags;
> + bool acquired = false; /* True if the lock is acquired */
Superfluous space there.
On Tue, Mar 22, 2016 at 01:46:44PM -0400, Waiman Long wrote:
> The current mutex code sets count to -1 and then sets the task
> state. This is the same sequence that the mutex unlock path is checking
> count and task state. That could lead to a missed wakeup even though
> the problem will be cleared when a new waiter enters the waiting queue.
>
> This patch reverses the order in the locking slowpath so that the task
> state is set first before setting the count. This should eliminate
> the potential missed wakeup and improve latency.
Is it really a problem though?
So the 'race' is __mutex_lock_common() against
__mutex_fastpath_unlock(), and that is fully serialized as per the
atomic instructions. Either the fast unlock path does 1->0 and the lock
acquires, or the lock sets -1, at which the unlock fails and enters
__mutex_unlock_common_slowpath, which is fully serialised against
__mutex_lock_common by the lock->wait_lock.
I agree that the code is nicer after your patch, but I don't actually
see a problem.
On Tue, Mar 29, 2016 at 05:39:35PM +0200, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 01:46:43PM -0400, Waiman Long wrote:
> > Ding Tianhong reported a live-lock situation where a constant stream
> > of incoming optimistic spinners blocked a task in the wait list from
> > getting the mutex.
> >
> > This patch attempts to fix this live-lock condition by enabling the
> > woken task in the wait queue to enter into an optimistic spinning
> > loop itself in parallel with the regular spinners in the OSQ. This
> > should prevent the live-lock condition from happening.
>
> I would very much like a few words on how fairness is preserved.
>
> Because while the waiter remains on the wait_list while it spins, and
> therefore unlock()s will only wake it, and we'll only contend with the
> one waiter, the fact that we have two spinners is not fair or starvation
> proof at all.
Alternatively, we can say this is good enough until proven deficient,
but then we should still very much document this.
On 03/29/2016 12:36 PM, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 01:46:44PM -0400, Waiman Long wrote:
>> The current mutex code sets count to -1 and then sets the task
>> state. This is the same sequence that the mutex unlock path is checking
>> count and task state. That could lead to a missed wakeup even though
>> the problem will be cleared when a new waiter enters the waiting queue.
>>
>> This patch reverses the order in the locking slowpath so that the task
>> state is set first before setting the count. This should eliminate
>> the potential missed wakeup and improve latency.
> Is it really a problem though?
>
> So the 'race' is __mutex_lock_common() against
> __mutex_fastpath_unlock(), and that is fully serialized as per the
> atomic instructions. Either the fast unlock path does 1->0 and the lock
> acquires, or the lock sets -1, at which the unlock fails and enters
> __mutex_unlock_common_slowpath, which is fully serialised against
> __mutex_lock_common by the lock->wait_lock.
>
> I agree that the code is nicer after your patch, but I don't actually
> see a problem.
You are right again. I think I missed the spinlock serialization part
from my analysis. So this patch isn't really necessary. I can withdraw
it or mark it as a cleanup.
Cheers,
Longman
On 03/29/2016 12:42 PM, Peter Zijlstra wrote:
> On Tue, Mar 29, 2016 at 05:39:35PM +0200, Peter Zijlstra wrote:
>> On Tue, Mar 22, 2016 at 01:46:43PM -0400, Waiman Long wrote:
>>> Ding Tianhong reported a live-lock situation where a constant stream
>>> of incoming optimistic spinners blocked a task in the wait list from
>>> getting the mutex.
>>>
>>> This patch attempts to fix this live-lock condition by enabling the
>>> woken task in the wait queue to enter into an optimistic spinning
>>> loop itself in parallel with the regular spinners in the OSQ. This
>>> should prevent the live-lock condition from happening.
>> I would very much like a few words on how fairness is preserved.
>>
>> Because while the waiter remains on the wait_list while it spins, and
>> therefore unlock()s will only wake it, and we'll only contend with the
>> one waiter, the fact that we have two spinners is not fair or starvation
>> proof at all.
> Alternatively, we can say this is good enough until proven deficient,
> but then we should still very much document this.
Yes, we can certainly do that.
Cheers,
Longman
On 03/29/2016 11:39 AM, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 01:46:43PM -0400, Waiman Long wrote:
>> Ding Tianhong reported a live-lock situation where a constant stream
>> of incoming optimistic spinners blocked a task in the wait list from
>> getting the mutex.
>>
>> This patch attempts to fix this live-lock condition by enabling the
>> woken task in the wait queue to enter into an optimistic spinning
>> loop itself in parallel with the regular spinners in the OSQ. This
>> should prevent the live-lock condition from happening.
> I would very much like a few words on how fairness is preserved.
>
> Because while the waiter remains on the wait_list while it spins, and
> therefore unlock()s will only wake it, and we'll only contend with the
> one waiter, the fact that we have two spinners is not fair or starvation
> proof at all.
>
> By adding the waiter to the OSQ we get only a single spinner and force
> 'fairness' by queuing.
>
> I say 'fairness' because the OSQ (need_resched) cancellation can still
> take the waiter out again and let even more new spinners in.
>
In my v1 patch, I added a flag in the mutex structure to signal that the
waiter is spinning and the OSQ spinner should yield to address this
fairness issue. I took it out in my later patchs as you said you want to
make the patch simpler.
Yes, I do agree that it is not guaranteed that the waiter spinner will
have a decent chance to get the lock, but I think it is still better
than queuing at the end of the OSQ as the time slice may expire before
the waiter bubbles up to the beginning of the queue. This can be
especially problematic if the waiter has lower priority which means
shorter time slice.
What do you think about the idea of adding a flag as in my v1 patch? For
64-bit systems, there is a 4-byte hole below osq and so it won't
increase the structure size. There will be a 4-byte increase in size for
32-bit systems, though.
Alternatively, I can certainly add a bit more comments to explain the
situation and the choice that we made.
>> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
>> index 5dd6171..5c0acee 100644
>> --- a/kernel/locking/mutex.c
>> +++ b/kernel/locking/mutex.c
>> @@ -538,6 +538,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>> struct task_struct *task = current;
>> struct mutex_waiter waiter;
>> unsigned long flags;
>> + bool acquired = false; /* True if the lock is acquired */
> Superfluous space there.
OK, will remove that.
Cheers,
Longman