2016-04-04 06:33:12

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH -tip v2 0/4] rtmutex: Another crack at spin on owner

Changes from v1:
- Dropped patch 4 that removed the barrier in the deadlock path.
- Added patch 2 (trivial) to use poison.h defines.
- sprinkled READ/WRITE_ONCE around lock->owner, even if updated
inside the cr. (peterz)
- More testing time.

This is a followup to proposal sometime ago to add spin on owner to rtmutexes.
My first attempt was rather permissive in that I tried avoiding the pi
dance and let the lock be stolen. However, due to -rt constraints this
series only deals with top-waiter, based on what we do in the preempt
rt patchset.

First two patches are trivial and the whole patchset as survived a week
of locktorture+pi_stress pounding at the same time without anything
breaking. That said, I'm sure it needs more testing and eyeballs, these
paths make my head hurt.

Thanks!

Davidlohr Bueso (4):
rtmutex: Delete save_state member of struct rt_mutex
rtmutex: Use waiter debug init,free magic numbers
rtmutex: Add rt_mutex_init_waiter helper
rtmutex: Reduce top-waiter blocking on a lock

include/linux/poison.h | 4 +-
include/linux/rtmutex.h | 3 +-
kernel/Kconfig.locks | 4 ++
kernel/futex.c | 5 +--
kernel/locking/mutex-debug.c | 4 +-
kernel/locking/rtmutex-debug.c | 4 +-
kernel/locking/rtmutex.c | 88 ++++++++++++++++++++++++++++++++++-------
kernel/locking/rtmutex_common.h | 17 +++++++-
8 files changed, 102 insertions(+), 27 deletions(-)

--
2.7.4


2016-04-04 06:33:16

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 4/4] rtmutex: Reduce top-waiter blocking on a lock

By applying well known spin-on-lock-owner techniques, we can avoid the
blocking overhead during the process of when the task is trying to take
the rtmutex. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus a task trying to acquire
the rtmutex will better off spinning instead of blocking immediately after
the fastpath. This is similar to what we use for other locks, borrowed
from -rt. The main difference (due to the obvious realtime constraints)
is that top-waiter spinning must account for any new higher priority waiter,
and therefore cannot steal the lock and avoid any pi-dance. As such there
will be at most only one spinner waiter upon contended lock.

Conditions to stop spinning and block are simple:

(1) Upon need_resched()
(2) Current lock owner blocks

The unlock side remains unchanged as wake_up_process can safely deal with
calls where the task is not actually blocked (TASK_NORMAL). As such, there
is only unnecessary overhead dealing with the wake_q, but this allows us not
to miss any wakeups between the spinning step and the unlocking side.

Measuring the amount of inversions of the pi_stress program, there is
a rather constant improvement in throughput during a 30 second window.
On a 32-core box, with increasing thread-group counts:

pistress
4.4.3 4.4.3
vanilla rtmutex-spinv1
Hmean 1 2321586.73 ( 0.00%) 2339847.23 ( 0.79%)
Hmean 4 8209026.49 ( 0.00%) 8597809.55 ( 4.74%)
Hmean 7 12655322.45 ( 0.00%) 13194896.45 ( 4.26%)
Hmean 12 4210477.03 ( 0.00%) 4348643.08 ( 3.28%)
Hmean 21 2996823.05 ( 0.00%) 3104513.47 ( 3.59%)
Hmean 30 2463107.53 ( 0.00%) 2584275.71 ( 4.91%)
Hmean 48 2656668.46 ( 0.00%) 2719324.53 ( 2.36%)
Hmean 64 2397253.65 ( 0.00%) 2471628.92 ( 3.10%)
Stddev 1 653473.88 ( 0.00%) 527076.59 (-19.34%)
Stddev 4 664995.50 ( 0.00%) 359487.15 (-45.94%)
Stddev 7 248476.88 ( 0.00%) 278307.31 ( 12.01%)
Stddev 12 74537.42 ( 0.00%) 54305.86 (-27.14%)
Stddev 21 72143.80 ( 0.00%) 40371.42 (-44.04%)
Stddev 30 31981.43 ( 0.00%) 42306.07 ( 32.28%)
Stddev 48 21317.95 ( 0.00%) 42608.50 ( 99.87%)
Stddev 64 23433.99 ( 0.00%) 21502.56 ( -8.24%)

Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/rtmutex.h | 2 +-
kernel/Kconfig.locks | 4 ++
kernel/locking/rtmutex.c | 84 +++++++++++++++++++++++++++++++++++------
kernel/locking/rtmutex_common.h | 2 +-
4 files changed, 79 insertions(+), 13 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 9c50e2efb660..94a301c50788 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -82,7 +82,7 @@ struct hrtimer_sleeper;
*/
static inline int rt_mutex_is_locked(struct rt_mutex *lock)
{
- return lock->owner != NULL;
+ return READ_ONCE(lock->owner) != NULL;
}

extern void __rt_mutex_init(struct rt_mutex *lock, const char *name);
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index ebdb0043203a..8602871bf2ac 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -227,6 +227,10 @@ config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW

+config RT_MUTEX_SPIN_ON_OWNER
+ def_bool y
+ depends on SMP && RT_MUTEXES && !DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
+
config RWSEM_SPIN_ON_OWNER
def_bool y
depends on SMP && RWSEM_XCHGADD_ALGORITHM && ARCH_SUPPORTS_ATOMIC_RMW
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 60fa79ae6d99..3ffdd7ba7c9f 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -54,13 +54,13 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner)
if (rt_mutex_has_waiters(lock))
val |= RT_MUTEX_HAS_WAITERS;

- lock->owner = (struct task_struct *)val;
+ WRITE_ONCE(lock->owner, (struct task_struct *)val);
}

static inline void clear_rt_mutex_waiters(struct rt_mutex *lock)
{
- lock->owner = (struct task_struct *)
- ((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
+ WRITE_ONCE(lock->owner, (struct task_struct *)
+ ((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS));
}

static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
@@ -69,6 +69,48 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
clear_rt_mutex_waiters(lock);
}

+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+static bool rt_mutex_spin_on_owner(struct rt_mutex *lock,
+ struct task_struct *owner)
+{
+ bool ret = true;
+
+ /*
+ * The last owner could have just released the lock,
+ * immediately try taking it again.
+ */
+ if (!owner)
+ goto done;
+
+ rcu_read_lock();
+ while (rt_mutex_owner(lock) == owner) {
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_
+ * checking lock->owner still matches owner. If that fails,
+ * owner might point to freed memory. If it still matches,
+ * the rcu_read_lock() ensures the memory stays valid.
+ */
+ barrier();
+ if (!owner->on_cpu || need_resched()) {
+ ret = false;
+ break;
+ }
+
+ cpu_relax_lowlatency();
+ }
+ rcu_read_unlock();
+done:
+ return ret;
+}
+
+#else
+static bool rt_mutex_spin_on_owner(struct rt_mutex *lock,
+ struct task_struct *owner)
+{
+ return false;
+}
+#endif
+
/*
* We can speed up the acquire/release, if there's no debugging state to be
* set up.
@@ -130,6 +172,12 @@ static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
* unlock(wait_lock);
* lock(wait_lock);
* acquire(lock);
+ *
+ * Care must be taken when performing spin-on-owner techniques for
+ * the top waiter. Because this relies on lockless owner->on_cpu,
+ * it can cause nil owner pointer dereferencing. This condition is
+ * taken care of in rt_mutex_spin_on_owner() and will cause the task
+ * to immediately block.
*/
return rt_mutex_cmpxchg_release(lock, owner, NULL);
}
@@ -989,12 +1037,15 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
rt_mutex_dequeue_pi(current, waiter);

/*
- * As we are waking up the top waiter, and the waiter stays
- * queued on the lock until it gets the lock, this lock
- * obviously has waiters. Just set the bit here and this has
- * the added benefit of forcing all new tasks into the
- * slow path making sure no task of lower priority than
- * the top waiter can steal this lock.
+ * As we are potentially waking up the top waiter, and the waiter
+ * stays queued on the lock until it gets the lock, this lock
+ * obviously has waiters. Just set the bit here and this has the
+ * added benefit of forcing all new tasks into the slow path
+ * making sure no task of lower priority than the top waiter can
+ * steal this lock.
+ *
+ * If the top waiter, otoh, is spinning on ->owner, this will also
+ * serve to exit out of the loop and try to acquire the lock.
*/
lock->owner = (void *) RT_MUTEX_HAS_WAITERS;

@@ -1090,7 +1141,7 @@ void rt_mutex_adjust_pi(struct task_struct *task)
}

/**
- * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
+ * __rt_mutex_slowlock() - Perform the spin or wait-wake-try-to-take loop
* @lock: the rt_mutex to take
* @state: the state the task should block in (TASK_INTERRUPTIBLE
* or TASK_UNINTERRUPTIBLE)
@@ -1105,6 +1156,7 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
struct rt_mutex_waiter *waiter)
{
int ret = 0;
+ struct rt_mutex_waiter *top_waiter;

for (;;) {
/* Try to acquire the lock: */
@@ -1125,11 +1177,21 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
break;
}

+ top_waiter = rt_mutex_top_waiter(lock);
+
raw_spin_unlock_irq(&lock->wait_lock);

debug_rt_mutex_print_deadlock(waiter);

- schedule();
+ /*
+ * At this point the PI-dance is done, and, as the top waiter,
+ * we are next in line for the lock. Try to spin on the current
+ * owner for a while, in the hope that the lock will be released
+ * soon. Otherwise fallback and block.
+ */
+ if (top_waiter != waiter ||
+ !rt_mutex_spin_on_owner(lock, rt_mutex_owner(lock)))
+ schedule();

raw_spin_lock_irq(&lock->wait_lock);
set_current_state(state);
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index c7147376324e..cba4b81d2945 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -76,7 +76,7 @@ task_top_pi_waiter(struct task_struct *p)
static inline struct task_struct *rt_mutex_owner(struct rt_mutex *lock)
{
return (struct task_struct *)
- ((unsigned long)lock->owner & ~RT_MUTEX_OWNER_MASKALL);
+ ((unsigned long)READ_ONCE(lock->owner) & ~RT_MUTEX_OWNER_MASKALL);
}

/*
--
2.7.4

2016-04-04 06:33:13

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 1/4] rtmutex: Delete save_state member of struct rt_mutex

This field (debug) is unused. Furthermore it looks like a result
of rtmutex from -rt into upstream, where it serves to determine
if the wakeup is for a task blocked on a "sleeping spinlock",
iow if this is a regular rt_mutex_lock() or rt_spin_lock().

Of course, upstream we only have regular rt_mutexes, thus we
can safely assume nothing will be done with this field anyway.
There is also the fact that this is under debug.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/rtmutex.h | 1 -
1 file changed, 1 deletion(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 1abba5ce2a2f..9c50e2efb660 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -32,7 +32,6 @@ struct rt_mutex {
struct rb_node *waiters_leftmost;
struct task_struct *owner;
#ifdef CONFIG_DEBUG_RT_MUTEXES
- int save_state;
const char *name, *file;
int line;
void *magic;
--
2.7.4

2016-04-04 06:33:35

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 3/4] rtmutex: Add rt_mutex_init_waiter helper

... encapsulates debug and regular waiter initialization. In the
case of rtmutexes, we now also set the waiter to nil until later
explicitly set to whatever task instead of the magic number. This
is safe as the waiter is on he stack and we are doing very basic
initialization anyway.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/futex.c | 5 +----
kernel/locking/rtmutex.c | 4 +---
kernel/locking/rtmutex_common.h | 15 +++++++++++++++
3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a5d2e74c89e0..d4e0da984741 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2782,10 +2782,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
* The waiter is allocated on our stack, manipulated by the requeue
* code while we sleep on uaddr.
*/
- debug_rt_mutex_init_waiter(&rt_waiter);
- RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
- RB_CLEAR_NODE(&rt_waiter.tree_entry);
- rt_waiter.task = NULL;
+ rt_mutex_init_waiter(&rt_waiter);

ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
if (unlikely(ret != 0))
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 3e746607abe5..60fa79ae6d99 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1171,9 +1171,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
unsigned long flags;
int ret = 0;

- debug_rt_mutex_init_waiter(&waiter);
- RB_CLEAR_NODE(&waiter.pi_tree_entry);
- RB_CLEAR_NODE(&waiter.tree_entry);
+ rt_mutex_init_waiter(&waiter);

/*
* Technically we could use raw_spin_[un]lock_irq() here, but this can
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 4f5f83c7d2d3..c7147376324e 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -119,4 +119,19 @@ extern void rt_mutex_adjust_prio(struct task_struct *task);
# include "rtmutex.h"
#endif

+/*
+ * Waiter structure basic initialization. A waiter is not considered
+ * actually usable until it after calling task_blocks_on_rt_mutex()
+ * which setups up the relevant entries.
+ */
+static inline void
+rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
+{
+ debug_rt_mutex_init_waiter(waiter);
+
+ RB_CLEAR_NODE(&waiter->pi_tree_entry);
+ RB_CLEAR_NODE(&waiter->tree_entry);
+ waiter->task = NULL;
+}
+
#endif
--
2.7.4

2016-04-04 06:33:54

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 2/4] rtmutex: Use waiter debug init,free magic numbers

... we already use these for regular mutexes, rtmutex can
also use it, and while at it rename the whole thing since
this is specific to waiters.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/poison.h | 4 ++--
kernel/locking/mutex-debug.c | 4 ++--
kernel/locking/rtmutex-debug.c | 4 ++--
3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 4a27153574e2..f6c413f9d8f6 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -74,8 +74,8 @@
#define ATM_POISON 0xdeadbeef

/********** kernel/mutexes **********/
-#define MUTEX_DEBUG_INIT 0x11
-#define MUTEX_DEBUG_FREE 0x22
+#define MUTEX_WAITER_DEBUG_INIT 0x11
+#define MUTEX_WAITER_DEBUG_FREE 0x22

/********** lib/flex_array.c **********/
#define FLEX_ARRAY_FREE 0x6c /* for use-after-free poisoning */
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 3ef3736002d8..cd7539c88a8a 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -29,7 +29,7 @@
*/
void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
{
- memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
+ memset(waiter, MUTEX_WAITER_DEBUG_INIT, sizeof(*waiter));
waiter->magic = waiter;
INIT_LIST_HEAD(&waiter->list);
}
@@ -45,7 +45,7 @@ void debug_mutex_wake_waiter(struct mutex *lock, struct mutex_waiter *waiter)
void debug_mutex_free_waiter(struct mutex_waiter *waiter)
{
DEBUG_LOCKS_WARN_ON(!list_empty(&waiter->list));
- memset(waiter, MUTEX_DEBUG_FREE, sizeof(*waiter));
+ memset(waiter, MUTEX_WAITER_DEBUG_FREE, sizeof(*waiter));
}

void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
index 62b6cee8ea7f..4659980d9614 100644
--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -154,14 +154,14 @@ void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock)

void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
{
- memset(waiter, 0x11, sizeof(*waiter));
+ memset(waiter, MUTEX_WAITER_DEBUG_INIT, sizeof(*waiter));
waiter->deadlock_task_pid = NULL;
}

void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
{
put_pid(waiter->deadlock_task_pid);
- memset(waiter, 0x22, sizeof(*waiter));
+ memset(waiter, MUTEX_WAITER_DEBUG_FREE, sizeof(*waiter));
}

void debug_rt_mutex_init(struct rt_mutex *lock, const char *name)
--
2.7.4

2016-04-28 17:34:35

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH -tip v2 0/4] rtmutex: Another crack at spin on owner

Folks, any thoughts on this?

Thanks,
Davidlohr