2015-05-19 17:25:15

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH -tip 0/4] rtmutex: Spin on owner

Hello,

First three patches are straightforward and found while
going through the code. Patch 4 is the actual meat of the
set, but similar to what we have already in regular mutexes.
While having optimistic spinning in rtmutexes is a desired
feature, I've marked it RFC as I could very well have missed
something inherint in the rt flavor.

Details in the individual patches. Passes pi tests from
Darren's futextest suite as well as all weekend running
pi_stress from rt-tests on a 60 core box.

Thanks!

Davidlohr Bueso (4):
locking/rtmutex: Implement lockless top-waiter wakeup
locking/rtmutex: Use cmp-cmpxchg
locking/rtmutex: Update stale plist comments
locking/rtmutex: Support spin on owner (osq)

include/linux/rtmutex.h | 4 ++
kernel/Kconfig.locks | 4 ++
kernel/locking/rtmutex.c | 175 +++++++++++++++++++++++++++++++++++++++++------
3 files changed, 162 insertions(+), 21 deletions(-)

--
2.1.4


2015-05-19 17:25:29

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup

Mark the task for later wakeup after the wait_lock has been released.
This way, once the next task is awoken, it will have a better chance
to of finding the wait_lock free when continuing executing in
__rt_mutex_slowlock() when trying to acquire the rtmutex, calling
try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
take the lock may acquire it first, right after the wait_lock is released,
but (a) this can also occur with the current code, as it relies on the
spinlock fairness, and (b) we are dealing with the top-waiter anyway,
so it will always take the lock next.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/locking/rtmutex.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 36573e9..74188d8 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -955,14 +955,13 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
}

/*
- * Wake up the next waiter on the lock.
- *
* Remove the top waiter from the current tasks pi waiter list and
- * wake it up.
+ * queue it up.
*
* Called with lock->wait_lock held.
*/
-static void wakeup_next_waiter(struct rt_mutex *lock)
+static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
+ struct rt_mutex *lock)
{
struct rt_mutex_waiter *waiter;
unsigned long flags;
@@ -991,12 +990,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)

raw_spin_unlock_irqrestore(&current->pi_lock, flags);

- /*
- * It's safe to dereference waiter as it cannot go away as
- * long as we hold lock->wait_lock. The waiter task needs to
- * acquire it in order to dequeue the waiter.
- */
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}

/*
@@ -1255,6 +1249,8 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
static void __sched
rt_mutex_slowunlock(struct rt_mutex *lock)
{
+ WAKE_Q(wake_q);
+
raw_spin_lock(&lock->wait_lock);

debug_rt_mutex_unlock(lock);
@@ -1303,10 +1299,13 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
/*
* The wakeup next waiter path does not suffer from the above
* race. See the comments there.
+ *
+ * Queue the next waiter for wakeup once we release the wait_lock.
*/
- wakeup_next_waiter(lock);
+ mark_wakeup_next_waiter(&wake_q, lock);

raw_spin_unlock(&lock->wait_lock);
+ wake_up_q(&wake_q);

/* Undo pi boosting if necessary: */
rt_mutex_adjust_prio(current);
--
2.1.4

2015-05-19 17:25:25

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg

Avoid unnecessary cmpxchg calls, all of our other locks
use it as well.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/locking/rtmutex.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 74188d8..1d5cc9d 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -74,7 +74,9 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
* set up.
*/
#ifndef CONFIG_DEBUG_RT_MUTEXES
-# define rt_mutex_cmpxchg(l,c,n) (cmpxchg(&l->owner, c, n) == c)
+# define rt_mutex_cmpxchg(l,c,n) \
+ (l->owner == c && cmpxchg(&l->owner, c, n) == c)
+
static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
{
unsigned long owner, *p = (unsigned long *) &lock->owner;
--
2.1.4

2015-05-19 17:29:29

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 3/4] locking/rtmutex: Update stale plist comments

... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
no longer use plists for queuing any waiters. Update stale comments.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/locking/rtmutex.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 1d5cc9d..81aad32 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -626,7 +626,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
*/
prerequeue_top_waiter = rt_mutex_top_waiter(lock);

- /* [7] Requeue the waiter in the lock waiter list. */
+ /* [7] Requeue the waiter in the lock waiter tree. */
rt_mutex_dequeue(lock, waiter);
waiter->prio = task->prio;
rt_mutex_enqueue(lock, waiter);
@@ -664,7 +664,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/*
* The waiter became the new top (highest priority)
* waiter on the lock. Replace the previous top waiter
- * in the owner tasks pi waiters list with this waiter
+ * in the owner tasks pi waiters tree with this waiter
* and adjust the priority of the owner.
*/
rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
@@ -675,7 +675,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/*
* The waiter was the top waiter on the lock, but is
* no longer the top prority waiter. Replace waiter in
- * the owner tasks pi waiters list with the new top
+ * the owner tasks pi waiters tree with the new top
* (highest priority) waiter and adjust the priority
* of the owner.
* The new top waiter is stored in @waiter so that
@@ -749,7 +749,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
*
* @lock: The lock to be acquired.
* @task: The task which wants to acquire the lock
- * @waiter: The waiter that is queued to the lock's wait list if the
+ * @waiter: The waiter that is queued to the lock's wait tree if the
* callsite called task_blocked_on_lock(), otherwise NULL
*/
static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
@@ -784,7 +784,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,

/*
* If @waiter != NULL, @task has already enqueued the waiter
- * into @lock waiter list. If @waiter == NULL then this is a
+ * into @lock waiter tree. If @waiter == NULL then this is a
* trylock attempt.
*/
if (waiter) {
@@ -797,7 +797,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,

/*
* We can acquire the lock. Remove the waiter from the
- * lock waiters list.
+ * lock waiters tree.
*/
rt_mutex_dequeue(lock, waiter);

@@ -829,7 +829,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
* No waiters. Take the lock without the
* pi_lock dance.@task->pi_blocked_on is NULL
* and we have no waiters to enqueue in @task
- * pi waiters list.
+ * pi waiters tree.
*/
goto takeit;
}
@@ -846,7 +846,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
/*
* Finish the lock acquisition. @task is the new owner. If
* other waiters exist we have to insert the highest priority
- * waiter into @task->pi_waiters list.
+ * waiter into @task->pi_waiters tree.
*/
if (rt_mutex_has_waiters(lock))
rt_mutex_enqueue_pi(task, rt_mutex_top_waiter(lock));
@@ -957,7 +957,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
}

/*
- * Remove the top waiter from the current tasks pi waiter list and
+ * Remove the top waiter from the current tasks pi waiter tree and
* queue it up.
*
* Called with lock->wait_lock held.
--
2.1.4

2015-05-19 17:25:35

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)

Similar to what we have in other locks, particularly regular mutexes, the
idea is that as long as the owner is running, there is a fair chance it'll
release the lock soon, and thus a task trying to acquire the rtmutex will
better off spinning instead of blocking immediately after the fastpath.
Conditions to stop spinning and enter the slowpath are simple:

(1) Upon need_resched()
(2) Current lock owner blocks

Spinning tasks performance is further enhanced by queuing in cancelable
mcs (osq). Because rtmutexes track the lock owner atomically, we can
extend the fastpath to continue polling on the lock owner via
cmpxchg(lock->owner, NULL, current).

This is a conservative approach, such that upon entering the slowpath,
we force all lock spinners (including future ones) to serialize on the
wait_lock as mark_rt_mutex_waiters() is called, unconditionally.

CPU0 CPU1
optimistic_spin() (failed)
try_to_take_rt_mutex()
mark_rt_mutex_waiters()
optimistic_spin() (failed cmpxchg)
spin_lock(&lock->wait_lock)

As such we check for RT_MUTEX_HAS_WAITERS bit0 (rt_mutex_has_waiters_fast()).
This allows priority boosting to take precedence over spinning, as otherwise
we could starve a higher priority queued-up task (ie: top waiter) if spinners
constantly steal the lock. Another alternative would be to stop spinning when
we should really wakeup a higher priority waiter, but of course we do not hold
the wait_lock, so that is racy.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/rtmutex.h | 4 ++
kernel/Kconfig.locks | 4 ++
kernel/locking/rtmutex.c | 132 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 140 insertions(+)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 1abba5c..b5e85ca 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
#include <linux/linkage.h>
#include <linux/rbtree.h>
#include <linux/spinlock_types.h>
+#include <linux/osq_lock.h>

extern int max_lock_depth; /* for sysctl */

@@ -31,6 +32,9 @@ struct rt_mutex {
struct rb_root waiters;
struct rb_node *waiters_leftmost;
struct task_struct *owner;
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+ struct optimistic_spin_queue osq; /* Spinner MCS lock */
+#endif
#ifdef CONFIG_DEBUG_RT_MUTEXES
int save_state;
const char *name, *file;
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index ebdb004..628915d 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -227,6 +227,10 @@ config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW

+config RT_MUTEX_SPIN_ON_OWNER
+ def_bool y
+ depends on SMP && RT_MUTEXES && !CONFIG_DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
+
config RWSEM_SPIN_ON_OWNER
def_bool y
depends on SMP && RWSEM_XCHGADD_ALGORITHM && ARCH_SUPPORTS_ATOMIC_RMW
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 81aad32..6524c7c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -63,6 +63,20 @@ static inline void clear_rt_mutex_waiters(struct rt_mutex *lock)
((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
}

+/*
+ * Lockless alternative to rt_mutex_has_waiters() as we do not need the
+ * wait_lock to check if we are in, for instance, a transitional state
+ * after calling mark_rt_mutex_waiters().
+ */
+static inline bool rt_mutex_has_waiters_fast(struct rt_mutex *lock)
+{
+ unsigned long val = (unsigned long)lock->owner;
+
+ if (!val)
+ return false;
+ return val & RT_MUTEX_HAS_WAITERS;
+}
+
static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
{
if (!rt_mutex_has_waiters(lock))
@@ -1152,6 +1166,121 @@ static void rt_mutex_handle_deadlock(int res, int detect_deadlock,
}
}

+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+static noinline
+bool rt_mutex_spin_on_owner(struct rt_mutex *lock, struct task_struct *owner)
+{
+ bool ret = true;
+
+ rcu_read_lock();
+ while (rt_mutex_owner(lock) == owner) {
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_
+ * checking lock->owner still matches owner. If that fails,
+ * owner might point to freed memory. If it still matches,
+ * the rcu_read_lock() ensures the memory stays valid.
+ */
+ barrier();
+
+ if (!owner->on_cpu || need_resched()) {
+ ret = false;
+ break;
+ }
+
+ cpu_relax_lowlatency();
+ }
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline bool rt_mutex_can_spin_on_owner(struct rt_mutex *lock)
+{
+ struct task_struct *owner;
+ /* default return to spin: if no owner, the lock is free */
+ int ret = true;
+
+ if (need_resched())
+ return 0;
+
+ rcu_read_lock();
+ owner = rt_mutex_owner(lock);
+ if (owner)
+ ret = owner->on_cpu;
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
+{
+ bool taken = false;
+
+ preempt_disable();
+
+ if (!rt_mutex_can_spin_on_owner(lock))
+ goto done;
+ /*
+ * In order to avoid a stampede of mutex spinners trying to
+ * acquire the mutex all at once, the spinners need to take a
+ * MCS (queued) lock first before spinning on the owner field.
+ */
+ if (!osq_lock(&lock->osq))
+ goto done;
+
+ while (true) {
+ struct task_struct *owner;
+
+ /*
+ * When another task is competing for the lock in the
+ * slowpath (either transitional or not), avoid the
+ * cmpxchg and immediately block. If the situation is
+ * later fixed by clearing the waiters bit, future
+ * tasks that atempt to take the rtmutex can spin.
+ */
+ if (rt_mutex_has_waiters_fast(lock))
+ goto done_unlock;
+
+ /*
+ * If there's an owner, wait for it to either
+ * release the lock or go to sleep.
+ */
+ owner = rt_mutex_owner(lock);
+ if (owner && !rt_mutex_spin_on_owner(lock, owner))
+ break;
+
+ /* Try to acquire the lock, if it is unlocked. */
+ if (rt_mutex_cmpxchg(lock, NULL, current)) {
+ taken = true;
+ goto done_unlock;
+ }
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ cpu_relax_lowlatency();
+ }
+
+done_unlock:
+ osq_unlock(&lock->osq);
+done:
+ preempt_enable();
+ return taken;
+}
+
+#else
+static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
+{
+ return false;
+}
+#endif
+
/*
* Slow path lock function:
*/
@@ -1163,6 +1292,9 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
struct rt_mutex_waiter waiter;
int ret = 0;

+ if (rt_mutex_optimistic_spin(lock))
+ return ret;
+
debug_rt_mutex_init_waiter(&waiter);
RB_CLEAR_NODE(&waiter.pi_tree_entry);
RB_CLEAR_NODE(&waiter.tree_entry);
--
2.1.4

2015-05-20 07:11:18

by Paul Bolle

[permalink] [raw]
Subject: Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)

(This is pasted as an RFC, so you probably don't want feedback going
into detail yet, but I couldn't help spotting this.)

On Tue, 2015-05-19 at 10:24 -0700, Davidlohr Bueso wrote:
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks

> +config RT_MUTEX_SPIN_ON_OWNER
> + def_bool y
> + depends on SMP && RT_MUTEXES && !CONFIG_DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW

s/CONFIG_DEBUG_RT_MUTEXES/DEBUG_RT_MUTEXES/?

Or perhaps even drop "&& !CONFIG_DEBUG_RT_MUTEXES" entirely, because
that test currently always evaluates to true, so it might not be needed.

Running
scripts/checkkconfigsymbols.py --diff $sha1..$sha2

helps catching typos like this.


Paul Bolle

2015-05-25 20:35:29

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)

On Wed, 2015-05-20 at 09:11 +0200, Paul Bolle wrote:
> (This is pasted as an RFC, so you probably don't want feedback going
> into detail yet, but I couldn't help spotting this.)
>
> On Tue, 2015-05-19 at 10:24 -0700, Davidlohr Bueso wrote:
> > --- a/kernel/Kconfig.locks
> > +++ b/kernel/Kconfig.locks
>
> > +config RT_MUTEX_SPIN_ON_OWNER
> > + def_bool y
> > + depends on SMP && RT_MUTEXES && !CONFIG_DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
>
> s/CONFIG_DEBUG_RT_MUTEXES/DEBUG_RT_MUTEXES/?

Ah, yes thanks.

2015-05-26 19:06:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH -tip 0/4] rtmutex: Spin on owner

On Mon, 25 May 2015, Davidlohr Bueso wrote:

> ping?

pong!

It's in my review backlog list and it's gonna stay there until I
return from vacation and conferencing end of next week.

Thanks,

tglx

2015-06-05 12:35:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup

On Tue, 19 May 2015, Davidlohr Bueso wrote:

> Mark the task for later wakeup after the wait_lock has been released.
> This way, once the next task is awoken, it will have a better chance
> to of finding the wait_lock free when continuing executing in
> __rt_mutex_slowlock() when trying to acquire the rtmutex, calling
> try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
> take the lock may acquire it first, right after the wait_lock is released,
> but (a) this can also occur with the current code, as it relies on the
> spinlock fairness, and (b) we are dealing with the top-waiter anyway,
> so it will always take the lock next.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2015-06-05 12:38:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg

On Tue, 19 May 2015, Davidlohr Bueso wrote:

> Avoid unnecessary cmpxchg calls, all of our other locks
> use it as well.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> kernel/locking/rtmutex.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 74188d8..1d5cc9d 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -74,7 +74,9 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
> * set up.
> */
> #ifndef CONFIG_DEBUG_RT_MUTEXES
> -# define rt_mutex_cmpxchg(l,c,n) (cmpxchg(&l->owner, c, n) == c)
> +# define rt_mutex_cmpxchg(l,c,n) \
> + (l->owner == c && cmpxchg(&l->owner, c, n) == c)

Errm. How does that improve stuff like rt_mutex_lock() ?

Thanks,

tglx

2015-06-05 12:39:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking/rtmutex: Update stale plist comments

On Tue, 19 May 2015, Davidlohr Bueso wrote:

> ... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
> no longer use plists for queuing any waiters. Update stale comments.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>

Reviewed-by: Thomas Gleixner <[email protected]>

2015-06-05 13:59:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)

On Tue, 19 May 2015, Davidlohr Bueso wrote:
>
> +/*
> + * Lockless alternative to rt_mutex_has_waiters() as we do not need the
> + * wait_lock to check if we are in, for instance, a transitional state
> + * after calling mark_rt_mutex_waiters().

Before I get into a state of brain melt, could you please explain that
in an understandable way?

rt_mutex_has_waiters() looks at the root pointer of the rbtree head
whether that's empty. You can do a lockless check of that as well,
right? So what's the FAST part of that function and how is that
related to a point after we called mark_rt_mutex_waiters()?

> + */
> +static inline bool rt_mutex_has_waiters_fast(struct rt_mutex *lock)
> +{
> + unsigned long val = (unsigned long)lock->owner;
> +
> + if (!val)
> + return false;
> + return val & RT_MUTEX_HAS_WAITERS;
> +}
> +

> +/*
> + * Initial check for entering the mutex spinning loop
> + */
> +static inline bool rt_mutex_can_spin_on_owner(struct rt_mutex *lock)
> +{
> + struct task_struct *owner;
> + /* default return to spin: if no owner, the lock is free */


Rather than having a comment in the middle of the variable declaration
section, I'd prefer a comment explaing the whole logic of this
function.

> + int ret = true;

> +static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
> +{
> + bool taken = false;
> +
> + preempt_disable();
> +
> + if (!rt_mutex_can_spin_on_owner(lock))
> + goto done;
> + /*
> + * In order to avoid a stampede of mutex spinners trying to
> + * acquire the mutex all at once, the spinners need to take a
> + * MCS (queued) lock first before spinning on the owner field.
> + */
> + if (!osq_lock(&lock->osq))
> + goto done;

Hmm. The queue lock is serializing potential spinners, right?

So that's going to lead to a potential priority ordering problem
because if a lower prio task wins the racing to the ocq_lock queue,
then the higher prio waiter will be queued behind and blocked from
taking the lock first.

Thanks,

tglx

2015-06-06 15:28:02

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg

On Fri, 2015-06-05 at 14:38 +0200, Thomas Gleixner wrote:
> On Tue, 19 May 2015, Davidlohr Bueso wrote:
>
> > Avoid unnecessary cmpxchg calls, all of our other locks
> > use it as well.
> >
> > Signed-off-by: Davidlohr Bueso <[email protected]>
> > ---
> > kernel/locking/rtmutex.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> > index 74188d8..1d5cc9d 100644
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -74,7 +74,9 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
> > * set up.
> > */
> > #ifndef CONFIG_DEBUG_RT_MUTEXES
> > -# define rt_mutex_cmpxchg(l,c,n) (cmpxchg(&l->owner, c, n) == c)
> > +# define rt_mutex_cmpxchg(l,c,n) \
> > + (l->owner == c && cmpxchg(&l->owner, c, n) == c)
>
> Errm. How does that improve stuff like rt_mutex_lock() ?

It just avoids a cmpxchg in the fastpath when locked, at the cost of an
extra test when unlocked. CCAS techniques have been proven to boost some
workloads for both rwsems and mutexes. That's all.

Thanks,
Davidlohr

2015-06-09 04:42:02

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)

On Fri, 2015-06-05 at 15:59 +0200, Thomas Gleixner wrote:
> On Tue, 19 May 2015, Davidlohr Bueso wrote:
> >
> > +/*
> > + * Lockless alternative to rt_mutex_has_waiters() as we do not need the
> > + * wait_lock to check if we are in, for instance, a transitional state
> > + * after calling mark_rt_mutex_waiters().
>
> Before I get into a state of brain melt, could you please explain that
> in an understandable way?

With that I meant that we could check the owner to see if the
RT_MUTEX_HAS_WAITERS bit was set without taking the wait_lock and no
owner.

>
> rt_mutex_has_waiters() looks at the root pointer of the rbtree head
> whether that's empty. You can do a lockless check of that as well,
> right? So what's the FAST part of that function and how is that
> related to a point after we called mark_rt_mutex_waiters()?

You're right, we could use rt_mutex_has_waiters(). When I thought of
this originally, I was considering something like:

if (rt_mutex_has_waiters(lock)) {
if (current->prio >= rt_mutex_top_waiter(lock)->prio)
...

Which obviously requires the wait_lock, but I did not consider just
using the tree. However, the consequence I see in doing this is that we
would miss scenarios where mark_rt_mutex_waiters() is called (under nil
owner, for example), so we would force tasks to block only when there
are truly waiters.

> > + */
> > +static inline bool rt_mutex_has_waiters_fast(struct rt_mutex *lock)
> > +{
> > + unsigned long val = (unsigned long)lock->owner;
> > +
> > + if (!val)
> > + return false;
> > + return val & RT_MUTEX_HAS_WAITERS;
> > +}
> > +
>
> > +/*
> > + * Initial check for entering the mutex spinning loop
> > + */
> > +static inline bool rt_mutex_can_spin_on_owner(struct rt_mutex *lock)
> > +{
> > + struct task_struct *owner;
> > + /* default return to spin: if no owner, the lock is free */
>
>
> Rather than having a comment in the middle of the variable declaration
> section, I'd prefer a comment explaing the whole logic of this
> function.

Ok.

> > + int ret = true;
>
> > +static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
> > +{
> > + bool taken = false;
> > +
> > + preempt_disable();
> > +
> > + if (!rt_mutex_can_spin_on_owner(lock))
> > + goto done;
> > + /*
> > + * In order to avoid a stampede of mutex spinners trying to
> > + * acquire the mutex all at once, the spinners need to take a
> > + * MCS (queued) lock first before spinning on the owner field.
> > + */
> > + if (!osq_lock(&lock->osq))
> > + goto done;
>
> Hmm. The queue lock is serializing potential spinners, right?

Yes.

>
> So that's going to lead to a potential priority ordering problem
> because if a lower prio task wins the racing to the ocq_lock queue,
> then the higher prio waiter will be queued behind and blocked from
> taking the lock first.

Hmm yes, ocq is a fair lock. However I believe this is mitigated by (a)
the conservative spinning approach, and (b) by osq_lock's need_resched()
check, so at least a spinner will abort if a higher prio task comes in.
But of course, this only deals with spinners, and we cannot account for
a lower prio owner task.

So if this is not acceptable, I guess we'll have to do without the mcs
like properties.

Thanks,
Davidlohr

2015-06-09 09:30:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)

On Mon, 8 Jun 2015, Davidlohr Bueso wrote:
> On Fri, 2015-06-05 at 15:59 +0200, Thomas Gleixner wrote:
> > rt_mutex_has_waiters() looks at the root pointer of the rbtree head
> > whether that's empty. You can do a lockless check of that as well,
> > right? So what's the FAST part of that function and how is that
> > related to a point after we called mark_rt_mutex_waiters()?
>
> You're right, we could use rt_mutex_has_waiters(). When I thought of
> this originally, I was considering something like:
>
> if (rt_mutex_has_waiters(lock)) {
> if (current->prio >= rt_mutex_top_waiter(lock)->prio)
> ...
>
> Which obviously requires the wait_lock, but I did not consider just
> using the tree. However, the consequence I see in doing this is that we
> would miss scenarios where mark_rt_mutex_waiters() is called (under nil
> owner, for example), so we would force tasks to block only when there
> are truly waiters.

Fair enough. But we really want a proper comment explaining it.

> > > +static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
> > > +{
> > > + bool taken = false;
> > > +
> > > + preempt_disable();
> > > +
> > > + if (!rt_mutex_can_spin_on_owner(lock))
> > > + goto done;
> > > + /*
> > > + * In order to avoid a stampede of mutex spinners trying to
> > > + * acquire the mutex all at once, the spinners need to take a
> > > + * MCS (queued) lock first before spinning on the owner field.
> > > + */
> > > + if (!osq_lock(&lock->osq))
> > > + goto done;
> >
> > Hmm. The queue lock is serializing potential spinners, right?
>
> Yes.
>
> >
> > So that's going to lead to a potential priority ordering problem
> > because if a lower prio task wins the racing to the ocq_lock queue,
> > then the higher prio waiter will be queued behind and blocked from
> > taking the lock first.
>
> Hmm yes, ocq is a fair lock. However I believe this is mitigated by (a)
> the conservative spinning approach, and (b) by osq_lock's need_resched()
> check, so at least a spinner will abort if a higher prio task comes in.
> But of course, this only deals with spinners, and we cannot account for
> a lower prio owner task.
>
> So if this is not acceptable, I guess we'll have to do without the mcs
> like properties.

I'd say it accounts as priority inversion.

If you look at the RT code, then you'll notice that in the slow lock
path we queue the incoming waiter (including the PI dance) and then
spin only if the waiter is the top waiter on the lock.

Surely it would be nice to avoid the whole PI dance, but OTOH it can
lead to the following issue (and some others):

CPU0 CPU1

T0 prio=10 T1 prio=20
lock(RTM);
lock(RTM);
spin()
->preempt()
T2 prio=15 if (!owner->on_cpu)
break;
block_on_rtmutex();
prio_boost(T0);
->preempt()
T0 prio=20
unlock(RTM);

IIRC, the initial RT attempt was to spin w/o the PI dance but we gave
up on it due to latency and correctness issues.

I wrapped my head around doing the following:

1) Lightweight mark the owner as boosted, w/o actually boosting it to
keep it on the cpu

2) Have a priority check in the spin to drop out when a higher prio
waiter comes in.

But #1 is a nightmare in the scheduler to do, so I gave up on it. If
you have a better idea, you're welcome.

Thanks,

tglx

2015-06-09 11:21:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)

On Tue, Jun 09, 2015 at 11:29:59AM +0200, Thomas Gleixner wrote:
>
> If you look at the RT code, then you'll notice that in the slow lock
> path we queue the incoming waiter (including the PI dance) and then
> spin only if the waiter is the top waiter on the lock.

On a related note; does the RT code still add lock stealing to rtmutex?

We allow higher prio lock stealing, but I seem to remember a longish
debate if we should allow equal prio lock stealing.

2015-06-09 12:54:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)

On Tue, 9 Jun 2015, Peter Zijlstra wrote:

> On Tue, Jun 09, 2015 at 11:29:59AM +0200, Thomas Gleixner wrote:
> >
> > If you look at the RT code, then you'll notice that in the slow lock
> > path we queue the incoming waiter (including the PI dance) and then
> > spin only if the waiter is the top waiter on the lock.
>
> On a related note; does the RT code still add lock stealing to rtmutex?

No.

> We allow higher prio lock stealing, but I seem to remember a longish
> debate if we should allow equal prio lock stealing.

Right. We never finished that one:)

Thanks,

tglx

2015-06-15 18:34:34

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg

Hi David,

On Sat, Jun 6, 2015 at 8:27 AM, Davidlohr Bueso <[email protected]> wrote:
> On Fri, 2015-06-05 at 14:38 +0200, Thomas Gleixner wrote:
>> On Tue, 19 May 2015, Davidlohr Bueso wrote:
>>
>> > Avoid unnecessary cmpxchg calls, all of our other locks
>> > use it as well.
>> >
>> > Signed-off-by: Davidlohr Bueso <[email protected]>
>> > ---
>> > kernel/locking/rtmutex.c | 4 +++-
>> > 1 file changed, 3 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
>> > index 74188d8..1d5cc9d 100644
>> > --- a/kernel/locking/rtmutex.c
>> > +++ b/kernel/locking/rtmutex.c
>> > @@ -74,7 +74,9 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
>> > * set up.
>> > */
>> > #ifndef CONFIG_DEBUG_RT_MUTEXES
>> > -# define rt_mutex_cmpxchg(l,c,n) (cmpxchg(&l->owner, c, n) == c)
>> > +# define rt_mutex_cmpxchg(l,c,n) \
>> > + (l->owner == c && cmpxchg(&l->owner, c, n) == c)
>>
>> Errm. How does that improve stuff like rt_mutex_lock() ?
>
> It just avoids a cmpxchg in the fastpath when locked, at the cost of an
> extra test when unlocked. CCAS techniques have been proven to boost some
> workloads for both rwsems and mutexes. That's all.

The CCAS technique was typically used in the slow paths for those
other locks, where the chance of the operation returning false is
higher.

The rt_mutex_cmpxchg is used in places such as rt_mutex fastlocks. We
might not want to add extra costs to the fast paths, particularly when
the rt_mutex_cmpxchg are marked "likely" in those cases.

2015-06-15 19:37:22

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg

On Mon, 2015-06-15 at 11:34 -0700, Jason Low wrote:
> The CCAS technique was typically used in the slow paths for those
> other locks, where the chance of the operation returning false is
> higher.

That is true. Although I really want to use it in patch 4, I guess I
could move the check in there, and thus avoid having it in the fastpath.

Thanks,
Davidlohr

2015-06-16 01:00:32

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg

On Mon, Jun 15, 2015 at 12:37 PM, Davidlohr Bueso <[email protected]> wrote:
> On Mon, 2015-06-15 at 11:34 -0700, Jason Low wrote:
>> The CCAS technique was typically used in the slow paths for those
>> other locks, where the chance of the operation returning false is
>> higher.
>
> That is true. Although I really want to use it in patch 4, I guess I
> could move the check in there, and thus avoid having it in the fastpath.

I agree, that way, we only have the extra check in cases where it is
beneficial, like in the optimistic spin loop.

Jason

Subject: [PATCH] futex: lower the lock contention on the HB lock during wake up

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

[bigeasy: redo ontop of lockless wake-queues]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---

Davidlohr, would it work for you to replace that patch #1 from your
series with this one?

kernel/futex.c | 32 ++++++++++++++++++++---
kernel/locking/rtmutex.c | 56 +++++++++++++++++++++++++++++------------
kernel/locking/rtmutex_common.h | 3 +++
3 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ea6ca0bca525..026594f02bd2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1117,12 +1117,15 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
q->lock_ptr = NULL;
}

-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+ struct futex_hash_bucket *hb)
{
struct task_struct *new_owner;
struct futex_pi_state *pi_state = this->pi_state;
u32 uninitialized_var(curval), newval;
int ret = 0;
+ WAKE_Q(wake_q);
+ bool deboost;

if (!pi_state)
return -EINVAL;
@@ -1173,7 +1176,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
raw_spin_unlock_irq(&new_owner->pi_lock);

raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
- rt_mutex_unlock(&pi_state->pi_mutex);
+
+ deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+ /*
+ * First unlock HB so the waiter does not spin on it once he got woken
+ * up. Second wake up the waiter before the priority is adjusted. If we
+ * deboost first (and lose our higher priority), then the task might get
+ * scheduled away before the wake up can take place.
+ */
+ spin_unlock(&hb->lock);
+ wake_up_q(&wake_q);
+ if (deboost)
+ rt_mutex_adjust_prio(current);

return 0;
}
@@ -2410,13 +2425,23 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
*/
match = futex_top_waiter(hb, &key);
if (match) {
- ret = wake_futex_pi(uaddr, uval, match);
+ ret = wake_futex_pi(uaddr, uval, match, hb);
+ /*
+ * In case of success wake_futex_pi dropped the hash
+ * bucket lock.
+ */
+ if (!ret)
+ goto out_putkey;
/*
* The atomic access to the futex value generated a
* pagefault, so retry the user-access and the wakeup:
*/
if (ret == -EFAULT)
goto pi_faulted;
+ /*
+ * wake_futex_pi has detected invalid state. Tell user
+ * space.
+ */
goto out_unlock;
}

@@ -2437,6 +2462,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)

out_unlock:
spin_unlock(&hb->lock);
+out_putkey:
put_futex_key(&key);
return ret;

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 36573e96a477..0c8f43baead6 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -300,7 +300,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
* of task. We do not use the spin_xx_mutex() variants here as we are
* outside of the debug path.)
*/
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
{
unsigned long flags;

@@ -957,12 +957,12 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
/*
* Wake up the next waiter on the lock.
*
- * Remove the top waiter from the current tasks pi waiter list and
- * wake it up.
+ * Remove the top waiter from the current tasks pi waiter list, put him on the
+ * wake head (for later wake up).
*
* Called with lock->wait_lock held.
*/
-static void wakeup_next_waiter(struct rt_mutex *lock)
+static void wakeup_next_waiter(struct rt_mutex *lock, struct wake_q_head *wqh)
{
struct rt_mutex_waiter *waiter;
unsigned long flags;
@@ -996,7 +996,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
* long as we hold lock->wait_lock. The waiter task needs to
* acquire it in order to dequeue the waiter.
*/
- wake_up_process(waiter->task);
+ wake_q_add(wqh, waiter->task);
}

/*
@@ -1250,10 +1250,11 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
}

/*
- * Slow path to release a rt-mutex:
+ * Slow path to release a rt-mutex.
+ * Return whether the current task needs to undo a potential priority boosting.
*/
-static void __sched
-rt_mutex_slowunlock(struct rt_mutex *lock)
+static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
+ struct wake_q_head *wqh)
{
raw_spin_lock(&lock->wait_lock);

@@ -1295,7 +1296,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
while (!rt_mutex_has_waiters(lock)) {
/* Drops lock->wait_lock ! */
if (unlock_rt_mutex_safe(lock) == true)
- return;
+ return false;
/* Relock the rtmutex and try again */
raw_spin_lock(&lock->wait_lock);
}
@@ -1304,12 +1305,12 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
* The wakeup next waiter path does not suffer from the above
* race. See the comments there.
*/
- wakeup_next_waiter(lock);
+ wakeup_next_waiter(lock, wqh);

raw_spin_unlock(&lock->wait_lock);

- /* Undo pi boosting if necessary: */
- rt_mutex_adjust_prio(current);
+ /* check PI boosting */
+ return true;
}

/*
@@ -1360,12 +1361,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,

static inline void
rt_mutex_fastunlock(struct rt_mutex *lock,
- void (*slowfn)(struct rt_mutex *lock))
+ bool (*slowfn)(struct rt_mutex *lock,
+ struct wake_q_head *wqh))
{
- if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+ WAKE_Q(wake_q);
+
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
rt_mutex_deadlock_account_unlock(current);
- else
- slowfn(lock);
+
+ } else if (slowfn(lock, &wake_q)) {
+ /* Undo pi boosting if necessary: */
+ rt_mutex_adjust_prio(current);
+ }
}

/**
@@ -1467,6 +1474,23 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
EXPORT_SYMBOL_GPL(rt_mutex_unlock);

/**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
+ struct wake_q_head *wqh)
+{
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+ rt_mutex_deadlock_account_unlock(current);
+ return false;
+ }
+ return rt_mutex_slowunlock(lock, wqh);
+}
+
+/**
* rt_mutex_destroy - mark a mutex unusable
* @lock: the mutex to be destroyed
*
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 855212501407..7844f8f0e639 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -131,6 +131,9 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
struct hrtimer_sleeper *to,
struct rt_mutex_waiter *waiter);
extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
+ struct wake_q_head *wqh);
+extern void rt_mutex_adjust_prio(struct task_struct *task);

#ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h"
--
2.1.4

2015-06-16 19:50:42

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH] futex: lower the lock contention on the HB lock during wake up

On Tue, 2015-06-16 at 21:29 +0200, Sebastian Andrzej Siewior wrote:

> Davidlohr, would it work for you to replace that patch #1 from your
> series with this one?

I prefer having two separate patches, thus keeping their own changelog
for the change justification.

Thanks,
Davidlohr

Subject: [PATCH v2] futex: lower the lock contention on the HB lock during wake up

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

[bigeasy: redo ontop of lockless wake-queues]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
* Davidlohr Bueso | 2015-06-16 12:50:26 [-0700]:

>I prefer having two separate patches, thus keeping their own changelog
>for the change justification.

okay, here it is on top of #1.

kernel/futex.c | 32 +++++++++++++++++++++++---
kernel/locking/rtmutex.c | 51 +++++++++++++++++++++++++++++------------
kernel/locking/rtmutex_common.h | 3 +++
3 files changed, 68 insertions(+), 18 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ea6ca0bca525..026594f02bd2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1117,12 +1117,15 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
q->lock_ptr = NULL;
}

-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+ struct futex_hash_bucket *hb)
{
struct task_struct *new_owner;
struct futex_pi_state *pi_state = this->pi_state;
u32 uninitialized_var(curval), newval;
int ret = 0;
+ WAKE_Q(wake_q);
+ bool deboost;

if (!pi_state)
return -EINVAL;
@@ -1173,7 +1176,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
raw_spin_unlock_irq(&new_owner->pi_lock);

raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
- rt_mutex_unlock(&pi_state->pi_mutex);
+
+ deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+ /*
+ * First unlock HB so the waiter does not spin on it once he got woken
+ * up. Second wake up the waiter before the priority is adjusted. If we
+ * deboost first (and lose our higher priority), then the task might get
+ * scheduled away before the wake up can take place.
+ */
+ spin_unlock(&hb->lock);
+ wake_up_q(&wake_q);
+ if (deboost)
+ rt_mutex_adjust_prio(current);

return 0;
}
@@ -2410,13 +2425,23 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
*/
match = futex_top_waiter(hb, &key);
if (match) {
- ret = wake_futex_pi(uaddr, uval, match);
+ ret = wake_futex_pi(uaddr, uval, match, hb);
+ /*
+ * In case of success wake_futex_pi dropped the hash
+ * bucket lock.
+ */
+ if (!ret)
+ goto out_putkey;
/*
* The atomic access to the futex value generated a
* pagefault, so retry the user-access and the wakeup:
*/
if (ret == -EFAULT)
goto pi_faulted;
+ /*
+ * wake_futex_pi has detected invalid state. Tell user
+ * space.
+ */
goto out_unlock;
}

@@ -2437,6 +2462,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)

out_unlock:
spin_unlock(&hb->lock);
+out_putkey:
put_futex_key(&key);
return ret;

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 74188d83b065..53fab686a1c2 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -300,7 +300,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
* of task. We do not use the spin_xx_mutex() variants here as we are
* outside of the debug path.)
*/
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
{
unsigned long flags;

@@ -1244,13 +1244,12 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
}

/*
- * Slow path to release a rt-mutex:
+ * Slow path to release a rt-mutex.
+ * Return whether the current task needs to undo a potential priority boosting.
*/
-static void __sched
-rt_mutex_slowunlock(struct rt_mutex *lock)
+static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
+ struct wake_q_head *wake_q)
{
- WAKE_Q(wake_q);
-
raw_spin_lock(&lock->wait_lock);

debug_rt_mutex_unlock(lock);
@@ -1291,7 +1290,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
while (!rt_mutex_has_waiters(lock)) {
/* Drops lock->wait_lock ! */
if (unlock_rt_mutex_safe(lock) == true)
- return;
+ return false;
/* Relock the rtmutex and try again */
raw_spin_lock(&lock->wait_lock);
}
@@ -1302,13 +1301,12 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
*
* Queue the next waiter for wakeup once we release the wait_lock.
*/
- mark_wakeup_next_waiter(&wake_q, lock);
+ mark_wakeup_next_waiter(wake_q, lock);

raw_spin_unlock(&lock->wait_lock);
- wake_up_q(&wake_q);

- /* Undo pi boosting if necessary: */
- rt_mutex_adjust_prio(current);
+ /* check PI boosting */
+ return true;
}

/*
@@ -1359,12 +1357,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,

static inline void
rt_mutex_fastunlock(struct rt_mutex *lock,
- void (*slowfn)(struct rt_mutex *lock))
+ bool (*slowfn)(struct rt_mutex *lock,
+ struct wake_q_head *wqh))
{
- if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+ WAKE_Q(wake_q);
+
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
rt_mutex_deadlock_account_unlock(current);
- else
- slowfn(lock);
+
+ } else if (slowfn(lock, &wake_q)) {
+ /* Undo pi boosting if necessary: */
+ rt_mutex_adjust_prio(current);
+ }
}

/**
@@ -1466,6 +1470,23 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
EXPORT_SYMBOL_GPL(rt_mutex_unlock);

/**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
+ struct wake_q_head *wqh)
+{
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+ rt_mutex_deadlock_account_unlock(current);
+ return false;
+ }
+ return rt_mutex_slowunlock(lock, wqh);
+}
+
+/**
* rt_mutex_destroy - mark a mutex unusable
* @lock: the mutex to be destroyed
*
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 855212501407..7844f8f0e639 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -131,6 +131,9 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
struct hrtimer_sleeper *to,
struct rt_mutex_waiter *waiter);
extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
+ struct wake_q_head *wqh);
+extern void rt_mutex_adjust_prio(struct task_struct *task);

#ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h"
--
2.1.4

2015-06-17 14:17:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up

On Wed, 2015-06-17 at 10:33 +0200, Sebastian Andrzej Siewior wrote:
> wake_futex_pi() wakes the task before releasing the hash bucket lock
> (HB). The first thing the woken up task usually does is to acquire the
> lock which requires the HB lock. On SMP Systems this leads to blocking
> on the HB lock which is released by the owner shortly after.
> This patch rearranges the unlock path by first releasing the HB lock and
> then waking up the task.
>
> [bigeasy: redo ontop of lockless wake-queues]
> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>

4.1-rc8-rt4 contains this via 4.0-rt4, and seems fine on my 64 core
DL980. I ran a few iterations of futextests and stockfish, then mixed
two loops of futextest at different rt prios, with stockfish also rt,
and ltplight as tossed in as... crack filler. Box is still doing that,
is way too busy, but not griping about it.

-Mike

Subject: Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up

On 06/17/2015 04:17 PM, Mike Galbraith wrote:
> On Wed, 2015-06-17 at 10:33 +0200, Sebastian Andrzej Siewior wrote:
>> wake_futex_pi() wakes the task before releasing the hash bucket lock
>> (HB). The first thing the woken up task usually does is to acquire the
>> lock which requires the HB lock. On SMP Systems this leads to blocking
>> on the HB lock which is released by the owner shortly after.
>> This patch rearranges the unlock path by first releasing the HB lock and
>> then waking up the task.
>>
>> [bigeasy: redo ontop of lockless wake-queues]
>> Signed-off-by: Thomas Gleixner <[email protected]>
>> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
>
> 4.1-rc8-rt4 contains this via 4.0-rt4, and seems fine on my 64 core
> DL980. I ran a few iterations of futextests and stockfish, then mixed
> two loops of futextest at different rt prios, with stockfish also rt,
> and ltplight as tossed in as... crack filler. Box is still doing that,
> is way too busy, but not griping about it.

There are two patches mostly doing the same thing. The patch posted
here is a redo ontop of "lockless wake-queues". It does hb-unlock,
wakeup, de-boost. The patch merged into -RT is the original approach
not using "lockless wake-queues" and performing wakeup, hb-unlock,
de-boost.

I plan to get into -RT the final solution once it hits upstream.

>
> -Mike
>

Sebastian

2015-06-17 14:31:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up

On Wed, 2015-06-17 at 16:28 +0200, Sebastian Andrzej Siewior wrote:
> On 06/17/2015 04:17 PM, Mike Galbraith wrote:
> > On Wed, 2015-06-17 at 10:33 +0200, Sebastian Andrzej Siewior wrote:
> >> wake_futex_pi() wakes the task before releasing the hash bucket lock
> >> (HB). The first thing the woken up task usually does is to acquire the
> >> lock which requires the HB lock. On SMP Systems this leads to blocking
> >> on the HB lock which is released by the owner shortly after.
> >> This patch rearranges the unlock path by first releasing the HB lock and
> >> then waking up the task.
> >>
> >> [bigeasy: redo ontop of lockless wake-queues]
> >> Signed-off-by: Thomas Gleixner <[email protected]>
> >> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> >
> > 4.1-rc8-rt4 contains this via 4.0-rt4, and seems fine on my 64 core
> > DL980. I ran a few iterations of futextests and stockfish, then mixed
> > two loops of futextest at different rt prios, with stockfish also rt,
> > and ltplight as tossed in as... crack filler. Box is still doing that,
> > is way too busy, but not griping about it.
>
> There are two patches mostly doing the same thing. The patch posted
> here is a redo ontop of "lockless wake-queues". It does hb-unlock,
> wakeup, de-boost. The patch merged into -RT is the original approach
> not using "lockless wake-queues" and performing wakeup, hb-unlock,
> de-boost.
>
> I plan to get into -RT the final solution once it hits upstream.

OK, a glance wasn't enough. Guess I can let tired ole box rest.

-Mike

Subject: [tip:sched/core] locking/rtmutex: Implement lockless top-waiter wakeup

Commit-ID: 45ab4effc3bee6f8a5cb05652b7bb895ec5b6a7a
Gitweb: http://git.kernel.org/tip/45ab4effc3bee6f8a5cb05652b7bb895ec5b6a7a
Author: Davidlohr Bueso <[email protected]>
AuthorDate: Tue, 19 May 2015 10:24:55 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 18 Jun 2015 22:27:46 +0200

locking/rtmutex: Implement lockless top-waiter wakeup

Mark the task for later wakeup after the wait_lock has been released.
This way, once the next task is awoken, it will have a better chance
to of finding the wait_lock free when continuing executing in
__rt_mutex_slowlock() when trying to acquire the rtmutex, calling
try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
take the lock may acquire it first, right after the wait_lock is released,
but (a) this can also occur with the current code, as it relies on the
spinlock fairness, and (b) we are dealing with the top-waiter anyway,
so it will always take the lock next.

Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/locking/rtmutex.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b025295..44ee8f8 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -955,14 +955,13 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
}

/*
- * Wake up the next waiter on the lock.
- *
* Remove the top waiter from the current tasks pi waiter list and
- * wake it up.
+ * queue it up.
*
* Called with lock->wait_lock held.
*/
-static void wakeup_next_waiter(struct rt_mutex *lock)
+static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
+ struct rt_mutex *lock)
{
struct rt_mutex_waiter *waiter;
unsigned long flags;
@@ -991,12 +990,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)

raw_spin_unlock_irqrestore(&current->pi_lock, flags);

- /*
- * It's safe to dereference waiter as it cannot go away as
- * long as we hold lock->wait_lock. The waiter task needs to
- * acquire it in order to dequeue the waiter.
- */
- wake_up_process(waiter->task);
+ wake_q_add(wake_q, waiter->task);
}

/*
@@ -1258,6 +1252,8 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
static void __sched
rt_mutex_slowunlock(struct rt_mutex *lock)
{
+ WAKE_Q(wake_q);
+
raw_spin_lock(&lock->wait_lock);

debug_rt_mutex_unlock(lock);
@@ -1306,10 +1302,13 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
/*
* The wakeup next waiter path does not suffer from the above
* race. See the comments there.
+ *
+ * Queue the next waiter for wakeup once we release the wait_lock.
*/
- wakeup_next_waiter(lock);
+ mark_wakeup_next_waiter(&wake_q, lock);

raw_spin_unlock(&lock->wait_lock);
+ wake_up_q(&wake_q);

/* Undo pi boosting if necessary: */
rt_mutex_adjust_prio(current);

Subject: [tip:sched/core] futex: Lower the lock contention on the HB lock during wake up

Commit-ID: 881bd58d6e9eba4240b9dbc49fdc03a3374d7508
Gitweb: http://git.kernel.org/tip/881bd58d6e9eba4240b9dbc49fdc03a3374d7508
Author: Sebastian Andrzej Siewior <[email protected]>
AuthorDate: Wed, 17 Jun 2015 10:33:50 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 18 Jun 2015 22:27:46 +0200

futex: Lower the lock contention on the HB lock during wake up

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

Originally-from: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/futex.c | 32 +++++++++++++++++++++++---
kernel/locking/rtmutex.c | 51 +++++++++++++++++++++++++++++------------
kernel/locking/rtmutex_common.h | 3 +++
3 files changed, 68 insertions(+), 18 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f9984c3..a0cf6fa 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1117,11 +1117,14 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
q->lock_ptr = NULL;
}

-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+ struct futex_hash_bucket *hb)
{
struct task_struct *new_owner;
struct futex_pi_state *pi_state = this->pi_state;
u32 uninitialized_var(curval), newval;
+ WAKE_Q(wake_q);
+ bool deboost;
int ret = 0;

if (!pi_state)
@@ -1173,7 +1176,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
raw_spin_unlock_irq(&new_owner->pi_lock);

raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
- rt_mutex_unlock(&pi_state->pi_mutex);
+
+ deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+ /*
+ * First unlock HB so the waiter does not spin on it once he got woken
+ * up. Second wake up the waiter before the priority is adjusted. If we
+ * deboost first (and lose our higher priority), then the task might get
+ * scheduled away before the wake up can take place.
+ */
+ spin_unlock(&hb->lock);
+ wake_up_q(&wake_q);
+ if (deboost)
+ rt_mutex_adjust_prio(current);

return 0;
}
@@ -2413,13 +2428,23 @@ retry:
*/
match = futex_top_waiter(hb, &key);
if (match) {
- ret = wake_futex_pi(uaddr, uval, match);
+ ret = wake_futex_pi(uaddr, uval, match, hb);
+ /*
+ * In case of success wake_futex_pi dropped the hash
+ * bucket lock.
+ */
+ if (!ret)
+ goto out_putkey;
/*
* The atomic access to the futex value generated a
* pagefault, so retry the user-access and the wakeup:
*/
if (ret == -EFAULT)
goto pi_faulted;
+ /*
+ * wake_futex_pi has detected invalid state. Tell user
+ * space.
+ */
goto out_unlock;
}

@@ -2440,6 +2465,7 @@ retry:

out_unlock:
spin_unlock(&hb->lock);
+out_putkey:
put_futex_key(&key);
return ret;

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 44ee8f8..1130130 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -300,7 +300,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
* of task. We do not use the spin_xx_mutex() variants here as we are
* outside of the debug path.)
*/
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
{
unsigned long flags;

@@ -1247,13 +1247,12 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
}

/*
- * Slow path to release a rt-mutex:
+ * Slow path to release a rt-mutex.
+ * Return whether the current task needs to undo a potential priority boosting.
*/
-static void __sched
-rt_mutex_slowunlock(struct rt_mutex *lock)
+static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
+ struct wake_q_head *wake_q)
{
- WAKE_Q(wake_q);
-
raw_spin_lock(&lock->wait_lock);

debug_rt_mutex_unlock(lock);
@@ -1294,7 +1293,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
while (!rt_mutex_has_waiters(lock)) {
/* Drops lock->wait_lock ! */
if (unlock_rt_mutex_safe(lock) == true)
- return;
+ return false;
/* Relock the rtmutex and try again */
raw_spin_lock(&lock->wait_lock);
}
@@ -1305,13 +1304,12 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
*
* Queue the next waiter for wakeup once we release the wait_lock.
*/
- mark_wakeup_next_waiter(&wake_q, lock);
+ mark_wakeup_next_waiter(wake_q, lock);

raw_spin_unlock(&lock->wait_lock);
- wake_up_q(&wake_q);

- /* Undo pi boosting if necessary: */
- rt_mutex_adjust_prio(current);
+ /* check PI boosting */
+ return true;
}

/*
@@ -1362,12 +1360,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,

static inline void
rt_mutex_fastunlock(struct rt_mutex *lock,
- void (*slowfn)(struct rt_mutex *lock))
+ bool (*slowfn)(struct rt_mutex *lock,
+ struct wake_q_head *wqh))
{
- if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+ WAKE_Q(wake_q);
+
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
rt_mutex_deadlock_account_unlock(current);
- else
- slowfn(lock);
+
+ } else if (slowfn(lock, &wake_q)) {
+ /* Undo pi boosting if necessary: */
+ rt_mutex_adjust_prio(current);
+ }
}

/**
@@ -1462,6 +1466,23 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
EXPORT_SYMBOL_GPL(rt_mutex_unlock);

/**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
+ struct wake_q_head *wqh)
+{
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+ rt_mutex_deadlock_account_unlock(current);
+ return false;
+ }
+ return rt_mutex_slowunlock(lock, wqh);
+}
+
+/**
* rt_mutex_destroy - mark a mutex unusable
* @lock: the mutex to be destroyed
*
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 8552125..7844f8f 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -131,6 +131,9 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
struct hrtimer_sleeper *to,
struct rt_mutex_waiter *waiter);
extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
+ struct wake_q_head *wqh);
+extern void rt_mutex_adjust_prio(struct task_struct *task);

#ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h"

Subject: [tip:sched/core] locking/rtmutex: Update stale plist comments

Commit-ID: f5de9f848cdfba072a0466c24891167c0c8b3be3
Gitweb: http://git.kernel.org/tip/f5de9f848cdfba072a0466c24891167c0c8b3be3
Author: Davidlohr Bueso <[email protected]>
AuthorDate: Tue, 19 May 2015 10:24:57 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 18 Jun 2015 22:53:36 +0200

locking/rtmutex: Update stale plist comments

... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
no longer use plists for queuing any waiters. Update stale comments.

Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/locking/rtmutex.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 1130130..5537ebc 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -624,7 +624,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
*/
prerequeue_top_waiter = rt_mutex_top_waiter(lock);

- /* [7] Requeue the waiter in the lock waiter list. */
+ /* [7] Requeue the waiter in the lock waiter tree. */
rt_mutex_dequeue(lock, waiter);
waiter->prio = task->prio;
rt_mutex_enqueue(lock, waiter);
@@ -662,7 +662,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/*
* The waiter became the new top (highest priority)
* waiter on the lock. Replace the previous top waiter
- * in the owner tasks pi waiters list with this waiter
+ * in the owner tasks pi waiters tree with this waiter
* and adjust the priority of the owner.
*/
rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
@@ -673,7 +673,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/*
* The waiter was the top waiter on the lock, but is
* no longer the top prority waiter. Replace waiter in
- * the owner tasks pi waiters list with the new top
+ * the owner tasks pi waiters tree with the new top
* (highest priority) waiter and adjust the priority
* of the owner.
* The new top waiter is stored in @waiter so that
@@ -747,7 +747,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
*
* @lock: The lock to be acquired.
* @task: The task which wants to acquire the lock
- * @waiter: The waiter that is queued to the lock's wait list if the
+ * @waiter: The waiter that is queued to the lock's wait tree if the
* callsite called task_blocked_on_lock(), otherwise NULL
*/
static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
@@ -782,7 +782,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,

/*
* If @waiter != NULL, @task has already enqueued the waiter
- * into @lock waiter list. If @waiter == NULL then this is a
+ * into @lock waiter tree. If @waiter == NULL then this is a
* trylock attempt.
*/
if (waiter) {
@@ -795,7 +795,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,

/*
* We can acquire the lock. Remove the waiter from the
- * lock waiters list.
+ * lock waiters tree.
*/
rt_mutex_dequeue(lock, waiter);

@@ -827,7 +827,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
* No waiters. Take the lock without the
* pi_lock dance.@task->pi_blocked_on is NULL
* and we have no waiters to enqueue in @task
- * pi waiters list.
+ * pi waiters tree.
*/
goto takeit;
}
@@ -844,7 +844,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
/*
* Finish the lock acquisition. @task is the new owner. If
* other waiters exist we have to insert the highest priority
- * waiter into @task->pi_waiters list.
+ * waiter into @task->pi_waiters tree.
*/
if (rt_mutex_has_waiters(lock))
rt_mutex_enqueue_pi(task, rt_mutex_top_waiter(lock));
@@ -955,7 +955,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
}

/*
- * Remove the top waiter from the current tasks pi waiter list and
+ * Remove the top waiter from the current tasks pi waiter tree and
* queue it up.
*
* Called with lock->wait_lock held.

2015-06-19 17:51:48

by Kevin Hilman

[permalink] [raw]
Subject: Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up

On Wed, Jun 17, 2015 at 1:33 AM, Sebastian Andrzej Siewior
<[email protected]> wrote:
> wake_futex_pi() wakes the task before releasing the hash bucket lock
> (HB). The first thing the woken up task usually does is to acquire the
> lock which requires the HB lock. On SMP Systems this leads to blocking
> on the HB lock which is released by the owner shortly after.
> This patch rearranges the unlock path by first releasing the HB lock and
> then waking up the task.
>
> [bigeasy: redo ontop of lockless wake-queues]
> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> * Davidlohr Bueso | 2015-06-16 12:50:26 [-0700]:
>
>>I prefer having two separate patches, thus keeping their own changelog
>>for the change justification.
>
> okay, here it is on top of #1.

A handful of boot test failures on ARM/OMAP were found by kernelci.org
in next-20150619[1] and were bisected down to this patch, which hit
next-20150619 in the form of commit 881bd58d6e9e (futex: Lower the
lock contention on the HB lock during wake up). I confirmed that
reverting that patch on top of next-20150619 gets things booting again
for the affected platforms.

I haven't debugged this any further, but full boot logs are available
for the boot failures[2][3] and the linux-omap list and maintainer are
Cc'd here to help investigate further if needed.

Kevin

[1] http://kernelci.org/boot/all/job/next/kernel/next-20150619/
[2] http://storage.kernelci.org/next/next-20150619/arm-multi_v7_defconfig/lab-khilman/boot-omap5-uevm.html
[3] http://storage.kernelci.org/next/next-20150619/arm-omap2plus_defconfig/lab-tbaker/boot-omap3-beagle-xm.html

2015-06-19 18:55:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up

On Fri, 19 Jun 2015, Kevin Hilman wrote:
> On Wed, Jun 17, 2015 at 1:33 AM, Sebastian Andrzej Siewior
> A handful of boot test failures on ARM/OMAP were found by kernelci.org
> in next-20150619[1] and were bisected down to this patch, which hit
> next-20150619 in the form of commit 881bd58d6e9e (futex: Lower the
> lock contention on the HB lock during wake up). I confirmed that
> reverting that patch on top of next-20150619 gets things booting again
> for the affected platforms.
>
> I haven't debugged this any further, but full boot logs are available
> for the boot failures[2][3] and the linux-omap list and maintainer are
> Cc'd here to help investigate further if needed.

Found it. Dunno, how I missed that one. Fix below.

Thanks,

tglx
---

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 10dbeb6fe96f..5674b073473c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1365,9 +1365,14 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
rt_mutex_deadlock_account_unlock(current);

- } else if (slowfn(lock, &wake_q)) {
+ } else {
+ bool deboost = slowfn(lock, &wake_q);
+
+ wake_up_q(&wake_q);
+
/* Undo pi boosting if necessary: */
- rt_mutex_adjust_prio(current);
+ if (deboost)
+ rt_mutex_adjust_prio(current);
}
}


2015-06-19 19:33:09

by Kevin Hilman

[permalink] [raw]
Subject: Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up

Thomas Gleixner <[email protected]> writes:

> On Fri, 19 Jun 2015, Kevin Hilman wrote:
>> On Wed, Jun 17, 2015 at 1:33 AM, Sebastian Andrzej Siewior
>> A handful of boot test failures on ARM/OMAP were found by kernelci.org
>> in next-20150619[1] and were bisected down to this patch, which hit
>> next-20150619 in the form of commit 881bd58d6e9e (futex: Lower the
>> lock contention on the HB lock during wake up). I confirmed that
>> reverting that patch on top of next-20150619 gets things booting again
>> for the affected platforms.
>>
>> I haven't debugged this any further, but full boot logs are available
>> for the boot failures[2][3] and the linux-omap list and maintainer are
>> Cc'd here to help investigate further if needed.
>
> Found it. Dunno, how I missed that one. Fix below.
>

Yup, that fix on top of next-20150619 gets the two OMAP platforms
booting again.

Tested-by: Kevin Hilman <[email protected]>

Kevin

Subject: [tip:sched/locking] futex: Lower the lock contention on the HB lock during wake up

Commit-ID: 802ab58da74bb49ab348d2872190ef26ddc1a3e0
Gitweb: http://git.kernel.org/tip/802ab58da74bb49ab348d2872190ef26ddc1a3e0
Author: Sebastian Andrzej Siewior <[email protected]>
AuthorDate: Wed, 17 Jun 2015 10:33:50 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 19 Jun 2015 21:26:38 +0200

futex: Lower the lock contention on the HB lock during wake up

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

[ tglx: Fixed up the rtmutex unlock path ]

Originally-from: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/futex.c | 32 ++++++++++++++++++++---
kernel/locking/rtmutex.c | 56 ++++++++++++++++++++++++++++++-----------
kernel/locking/rtmutex_common.h | 3 +++
3 files changed, 73 insertions(+), 18 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f9984c3..a0cf6fa 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1117,11 +1117,14 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
q->lock_ptr = NULL;
}

-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+ struct futex_hash_bucket *hb)
{
struct task_struct *new_owner;
struct futex_pi_state *pi_state = this->pi_state;
u32 uninitialized_var(curval), newval;
+ WAKE_Q(wake_q);
+ bool deboost;
int ret = 0;

if (!pi_state)
@@ -1173,7 +1176,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
raw_spin_unlock_irq(&new_owner->pi_lock);

raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
- rt_mutex_unlock(&pi_state->pi_mutex);
+
+ deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+ /*
+ * First unlock HB so the waiter does not spin on it once he got woken
+ * up. Second wake up the waiter before the priority is adjusted. If we
+ * deboost first (and lose our higher priority), then the task might get
+ * scheduled away before the wake up can take place.
+ */
+ spin_unlock(&hb->lock);
+ wake_up_q(&wake_q);
+ if (deboost)
+ rt_mutex_adjust_prio(current);

return 0;
}
@@ -2413,13 +2428,23 @@ retry:
*/
match = futex_top_waiter(hb, &key);
if (match) {
- ret = wake_futex_pi(uaddr, uval, match);
+ ret = wake_futex_pi(uaddr, uval, match, hb);
+ /*
+ * In case of success wake_futex_pi dropped the hash
+ * bucket lock.
+ */
+ if (!ret)
+ goto out_putkey;
/*
* The atomic access to the futex value generated a
* pagefault, so retry the user-access and the wakeup:
*/
if (ret == -EFAULT)
goto pi_faulted;
+ /*
+ * wake_futex_pi has detected invalid state. Tell user
+ * space.
+ */
goto out_unlock;
}

@@ -2440,6 +2465,7 @@ retry:

out_unlock:
spin_unlock(&hb->lock);
+out_putkey:
put_futex_key(&key);
return ret;

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 44ee8f8..0add724 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -300,7 +300,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
* of task. We do not use the spin_xx_mutex() variants here as we are
* outside of the debug path.)
*/
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
{
unsigned long flags;

@@ -1247,13 +1247,12 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
}

/*
- * Slow path to release a rt-mutex:
+ * Slow path to release a rt-mutex.
+ * Return whether the current task needs to undo a potential priority boosting.
*/
-static void __sched
-rt_mutex_slowunlock(struct rt_mutex *lock)
+static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
+ struct wake_q_head *wake_q)
{
- WAKE_Q(wake_q);
-
raw_spin_lock(&lock->wait_lock);

debug_rt_mutex_unlock(lock);
@@ -1294,7 +1293,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
while (!rt_mutex_has_waiters(lock)) {
/* Drops lock->wait_lock ! */
if (unlock_rt_mutex_safe(lock) == true)
- return;
+ return false;
/* Relock the rtmutex and try again */
raw_spin_lock(&lock->wait_lock);
}
@@ -1305,13 +1304,12 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
*
* Queue the next waiter for wakeup once we release the wait_lock.
*/
- mark_wakeup_next_waiter(&wake_q, lock);
+ mark_wakeup_next_waiter(wake_q, lock);

raw_spin_unlock(&lock->wait_lock);
- wake_up_q(&wake_q);

- /* Undo pi boosting if necessary: */
- rt_mutex_adjust_prio(current);
+ /* check PI boosting */
+ return true;
}

/*
@@ -1362,12 +1360,23 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,

static inline void
rt_mutex_fastunlock(struct rt_mutex *lock,
- void (*slowfn)(struct rt_mutex *lock))
+ bool (*slowfn)(struct rt_mutex *lock,
+ struct wake_q_head *wqh))
{
- if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+ WAKE_Q(wake_q);
+
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
rt_mutex_deadlock_account_unlock(current);
- else
- slowfn(lock);
+
+ } else {
+ bool deboost = slowfn(lock, &wake_q);
+
+ wake_up_q(&wake_q);
+
+ /* Undo pi boosting if necessary: */
+ if (deboost)
+ rt_mutex_adjust_prio(current);
+ }
}

/**
@@ -1462,6 +1471,23 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
EXPORT_SYMBOL_GPL(rt_mutex_unlock);

/**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
+ struct wake_q_head *wqh)
+{
+ if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+ rt_mutex_deadlock_account_unlock(current);
+ return false;
+ }
+ return rt_mutex_slowunlock(lock, wqh);
+}
+
+/**
* rt_mutex_destroy - mark a mutex unusable
* @lock: the mutex to be destroyed
*
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 8552125..7844f8f 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -131,6 +131,9 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
struct hrtimer_sleeper *to,
struct rt_mutex_waiter *waiter);
extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
+ struct wake_q_head *wqh);
+extern void rt_mutex_adjust_prio(struct task_struct *task);

#ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h"

Subject: [tip:sched/locking] locking/rtmutex: Update stale plist comments

Commit-ID: 9f40a51a35a0e1445cc4873251c3df2631eda294
Gitweb: http://git.kernel.org/tip/9f40a51a35a0e1445cc4873251c3df2631eda294
Author: Davidlohr Bueso <[email protected]>
AuthorDate: Tue, 19 May 2015 10:24:57 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 19 Jun 2015 21:27:21 +0200

locking/rtmutex: Update stale plist comments

... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
no longer use plists for queuing any waiters. Update stale comments.

Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/locking/rtmutex.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 0add724..86d4853 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -624,7 +624,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
*/
prerequeue_top_waiter = rt_mutex_top_waiter(lock);

- /* [7] Requeue the waiter in the lock waiter list. */
+ /* [7] Requeue the waiter in the lock waiter tree. */
rt_mutex_dequeue(lock, waiter);
waiter->prio = task->prio;
rt_mutex_enqueue(lock, waiter);
@@ -662,7 +662,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/*
* The waiter became the new top (highest priority)
* waiter on the lock. Replace the previous top waiter
- * in the owner tasks pi waiters list with this waiter
+ * in the owner tasks pi waiters tree with this waiter
* and adjust the priority of the owner.
*/
rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
@@ -673,7 +673,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
/*
* The waiter was the top waiter on the lock, but is
* no longer the top prority waiter. Replace waiter in
- * the owner tasks pi waiters list with the new top
+ * the owner tasks pi waiters tree with the new top
* (highest priority) waiter and adjust the priority
* of the owner.
* The new top waiter is stored in @waiter so that
@@ -747,7 +747,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
*
* @lock: The lock to be acquired.
* @task: The task which wants to acquire the lock
- * @waiter: The waiter that is queued to the lock's wait list if the
+ * @waiter: The waiter that is queued to the lock's wait tree if the
* callsite called task_blocked_on_lock(), otherwise NULL
*/
static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
@@ -782,7 +782,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,

/*
* If @waiter != NULL, @task has already enqueued the waiter
- * into @lock waiter list. If @waiter == NULL then this is a
+ * into @lock waiter tree. If @waiter == NULL then this is a
* trylock attempt.
*/
if (waiter) {
@@ -795,7 +795,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,

/*
* We can acquire the lock. Remove the waiter from the
- * lock waiters list.
+ * lock waiters tree.
*/
rt_mutex_dequeue(lock, waiter);

@@ -827,7 +827,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
* No waiters. Take the lock without the
* pi_lock dance.@task->pi_blocked_on is NULL
* and we have no waiters to enqueue in @task
- * pi waiters list.
+ * pi waiters tree.
*/
goto takeit;
}
@@ -844,7 +844,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
/*
* Finish the lock acquisition. @task is the new owner. If
* other waiters exist we have to insert the highest priority
- * waiter into @task->pi_waiters list.
+ * waiter into @task->pi_waiters tree.
*/
if (rt_mutex_has_waiters(lock))
rt_mutex_enqueue_pi(task, rt_mutex_top_waiter(lock));
@@ -955,7 +955,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
}

/*
- * Remove the top waiter from the current tasks pi waiter list and
+ * Remove the top waiter from the current tasks pi waiter tree and
* queue it up.
*
* Called with lock->wait_lock held.

2015-06-21 04:35:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up

On Wed, 2015-06-17 at 16:28 +0200, Sebastian Andrzej Siewior wrote:
> On 06/17/2015 04:17 PM, Mike Galbraith wrote:
> > On Wed, 2015-06-17 at 10:33 +0200, Sebastian Andrzej Siewior wrote:
> >> wake_futex_pi() wakes the task before releasing the hash bucket lock
> >> (HB). The first thing the woken up task usually does is to acquire the
> >> lock which requires the HB lock. On SMP Systems this leads to blocking
> >> on the HB lock which is released by the owner shortly after.
> >> This patch rearranges the unlock path by first releasing the HB lock and
> >> then waking up the task.
> >>
> >> [bigeasy: redo ontop of lockless wake-queues]
> >> Signed-off-by: Thomas Gleixner <[email protected]>
> >> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> >
> > 4.1-rc8-rt4 contains this via 4.0-rt4, and seems fine on my 64 core
> > DL980. I ran a few iterations of futextests and stockfish, then mixed
> > two loops of futextest at different rt prios, with stockfish also rt,
> > and ltplight as tossed in as... crack filler. Box is still doing that,
> > is way too busy, but not griping about it.
>
> There are two patches mostly doing the same thing. The patch posted
> here is a redo ontop of "lockless wake-queues". It does hb-unlock,
> wakeup, de-boost. The patch merged into -RT is the original approach
> not using "lockless wake-queues" and performing wakeup, hb-unlock,
> de-boost.
>
> I plan to get into -RT the final solution once it hits upstream.

I plugged patch1 and tip version into rt and beat it, seems solid.

Converting the rest of rtmutex.c to use wake queues with ->save_state to
select wake function went less well. Kernel does a good impersonation
of a working kernel until I beat it up, then it loses wakeups. Hohum,
so much for yet another early morning tinker session.

-Mike