The following series is v7 of the requeue_pi patches against
linux-2.6-tip/core/futexes. The current futex implementation doesn't allow for
requeueing of PI futexes, which leads to a thundering herd during
pthread_cond_broadcasa()t (as opposed to a civilized priority ordered wakeup
sequence). The core of the problem is that the underlying rt_mutex cannot be
left with waiters and no owner (which would break the PI logic). This patch
series updates the futex code to allow for requeueing from non-PI to PI futexes
in support of PI aware pthread_cond_* calls along with some needful rt_mutex
helper routines. The credit for the conceptual design goes to Thomas Gleixner,
while the bugs and other idiocies present in this implementation should be
attributed to me.
New in V7:
-refactored futex_wait_requeue_pi()
-fixed pi_state handling for some corner cases in the wake-up path
-fixed a wakeup race bug introduced by refactoring futex_wait_queue_me()
-corrected a bug in futex_wait_requeue_pi() calling fixup_owner on the wrong
uaddr
-rewrote finish_futex_lock_pi() as fixup_owner() with more intuitive logic
-fixed a couple logic errors
-cleaned up some comments and clarified some locking approaches
This version has been tested with a rough raw futex syscall test case as well
as with a preliminary glibc patch that updates the pthread_cond* calls to use
the new syscalls and allow for the PI calls to take ownership of the rt_mutex
inside the kernel (see the "glibc hacks for requeue_pi" at the end of this
series). With this patched glibc the LTP realtime/func/prio-wake test case has
passes consistently[1] (whereas before it would fail 10% of the time).
prio-wake tests the priority ordered wakeup of a pthread_cond_broadcast() using
a PI mutex. I have exercised the timeout and signal paths of
futex_wait_requeue_pi() prior to requeue. I am working to add more
sophisticated tests to be able to exercise the post-requeue error paths as
well. Additionally, I'd like to add some fault-injection.
I'd really appreciate feedback on the implementation as well as any design
critique. Answers to the questions posed in the patch headers and code
comments are particularly welcome.
1. In the interest of full disclosure I should mention that I have seen an rare
hang of the prio-wake testcase. Upon closer inspection I now believe this
to be due to a race inherent in the testcase, and not due to any flaw in the
kernel.
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
Darren Hart (9):
RFC: futex: add requeue_pi calls
RFC: futex: add futex_wait_setup()
RFC: futex: Add requeue_futex() call
RFC: futex: Add FUTEX_HAS_TIMEOUT flag to restart.futex.flags
RFC: rt_mutex: add proxy lock routines
RFC: futex: fixup_owner()
RFC: futex: futex_lock_pi_atomic()
RFC: futex: futex_top_waiter()
RFC: futex: futex_wait_queue_me()
include/linux/futex.h | 8
include/linux/thread_info.h | 3
kernel/futex.c | 1179 +++++++++++++++++++++++++++++++++----------
kernel/rtmutex.c | 240 +++++++--
kernel/rtmutex_common.h | 8
5 files changed, 1103 insertions(+), 335 deletions(-)
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Refactor futex_wait() in preparation for futex_wait_requeue_pi(). In order to
reuse a good chunk of the futex_wait() code for the upcoming
futex_wait_requeue_pi() function, this patch breaks out the queue-to-wakeup
section of futex_wait() into futex_wait_queue_me().
Changelog:
V7: -Updates from tglx
-Make futex_wait() to use current->timer_slack_ns directly
-Removed needless likely() optimization
-Minor comment updates and coding style fixes
-Move the declaration of the wait_queue back into the caller to avoid a
timeout/signal vs. wakeup race where the wait_queue_t may go out of
scope
V6: -Incremental build fixes
V4: -Nesting cleanups
-Delayed hrtimer start until after setting TASK_INTERRUPTIBLE
V1: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 138 +++++++++++++++++++++++++++++++-------------------------
1 files changed, 76 insertions(+), 62 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 6b50a02..d1d3e67 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1115,24 +1115,87 @@ handle_fault:
static long futex_wait_restart(struct restart_block *restart);
+/**
+ * futex_wait_queue_me() - queue_me() and wait for wakeup, timeout, or signal
+ * @hb: the futex hash bucket, must be locked by the caller
+ * @q: the futex_q to queue up on
+ * @timeout: the prepared hrtimer_sleeper, or null for no timeout
+ * @wait: the wait_queue to add to the futex_q after queueing in the hb
+ */
+static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
+ struct hrtimer_sleeper *timeout,
+ struct wait_queue_t *wait)
+{
+ queue_me(q, hb);
+
+ /*
+ * There might have been scheduling since the queue_me(), as we
+ * cannot hold a spinlock across the get_user() in case it
+ * faults, and we cannot just set TASK_INTERRUPTIBLE state when
+ * queueing ourselves into the futex hash. This code thus has to
+ * rely on the futex_wake() code removing us from hash when it
+ * wakes us up.
+ */
+
+ /* add_wait_queue is the barrier after __set_current_state. */
+ __set_current_state(TASK_INTERRUPTIBLE);
+
+ /*
+ * Add current as the futex_q waiter. We don't remove ourselves from
+ * the wait_queue because we are the only user of it.
+ */
+ add_wait_queue(&q->waiter, wait);
+
+ /* Arm the timer */
+ if (timeout) {
+ hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
+ if (!hrtimer_active(&timeout->timer))
+ timeout->task = NULL;
+ }
+
+ /*
+ * !plist_node_empty() is safe here without any lock.
+ * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
+ */
+ if (likely(!plist_node_empty(&q->list))) {
+ /*
+ * If the timer has already expired, current will already be
+ * flagged for rescheduling. Only call schedule if there
+ * is no timeout, or if it has yet to expire.
+ */
+ if (!timeout || timeout->task)
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+}
+
static int futex_wait(u32 __user *uaddr, int fshared,
u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
{
- struct task_struct *curr = current;
+ struct hrtimer_sleeper timeout, *to = NULL;
+ DECLARE_WAITQUEUE(wait, current);
struct restart_block *restart;
- DECLARE_WAITQUEUE(wait, curr);
struct futex_hash_bucket *hb;
struct futex_q q;
u32 uval;
int ret;
- struct hrtimer_sleeper t;
- int rem = 0;
if (!bitset)
return -EINVAL;
q.pi_state = NULL;
q.bitset = bitset;
+
+ if (abs_time) {
+ to = &timeout;
+
+ hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
+ CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ hrtimer_init_sleeper(to, current);
+ hrtimer_set_expires_range_ns(&to->timer, *abs_time,
+ current->timer_slack_ns);
+ }
+
retry:
q.key = FUTEX_KEY_INIT;
ret = get_futex_key(uaddr, fshared, &q.key);
@@ -1178,75 +1241,22 @@ retry_private:
goto retry;
}
ret = -EWOULDBLOCK;
+
+ /* Only actually queue if *uaddr contained val. */
if (unlikely(uval != val)) {
queue_unlock(&q, hb);
goto out_put_key;
}
- /* Only actually queue if *uaddr contained val. */
- queue_me(&q, hb);
-
- /*
- * There might have been scheduling since the queue_me(), as we
- * cannot hold a spinlock across the get_user() in case it
- * faults, and we cannot just set TASK_INTERRUPTIBLE state when
- * queueing ourselves into the futex hash. This code thus has to
- * rely on the futex_wake() code removing us from hash when it
- * wakes us up.
- */
-
- /* add_wait_queue is the barrier after __set_current_state. */
- __set_current_state(TASK_INTERRUPTIBLE);
- add_wait_queue(&q.waiter, &wait);
- /*
- * !plist_node_empty() is safe here without any lock.
- * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
- */
- if (likely(!plist_node_empty(&q.list))) {
- if (!abs_time)
- schedule();
- else {
- hrtimer_init_on_stack(&t.timer,
- clockrt ? CLOCK_REALTIME :
- CLOCK_MONOTONIC,
- HRTIMER_MODE_ABS);
- hrtimer_init_sleeper(&t, current);
- hrtimer_set_expires_range_ns(&t.timer, *abs_time,
- current->timer_slack_ns);
-
- hrtimer_start_expires(&t.timer, HRTIMER_MODE_ABS);
- if (!hrtimer_active(&t.timer))
- t.task = NULL;
-
- /*
- * the timer could have already expired, in which
- * case current would be flagged for rescheduling.
- * Don't bother calling schedule.
- */
- if (likely(t.task))
- schedule();
-
- hrtimer_cancel(&t.timer);
-
- /* Flag if a timeout occured */
- rem = (t.task == NULL);
-
- destroy_hrtimer_on_stack(&t.timer);
- }
- }
- __set_current_state(TASK_RUNNING);
-
- /*
- * NOTE: we don't remove ourselves from the waitqueue because
- * we are the only user of it.
- */
+ /* queue_me and wait for wakeup, timeout, or a signal. */
+ futex_wait_queue_me(hb, &q, to, &wait);
/* If we were woken (and unqueued), we succeeded, whatever. */
ret = 0;
if (!unqueue_me(&q))
goto out_put_key;
ret = -ETIMEDOUT;
- if (rem)
+ if (to && !to->task)
goto out_put_key;
/*
@@ -1275,6 +1285,10 @@ retry_private:
out_put_key:
put_futex_key(fshared, &q.key);
out:
+ if (to) {
+ hrtimer_cancel(&to->timer);
+ destroy_hrtimer_on_stack(&to->timer);
+ }
return ret;
}
Improve legibility by wrapping finding the top waiter in a function. This
will be used by the follow-on patches for enabling requeue pi.
Changelog:
V7: -Minor kernel doc comment update
V6: -Shortened function length after review from tglx
-Corrected kernel-doc formatting
V4: -Simplified logic in the hb loop
-Return the futex_q as the task can be had from there if needed
-Updated comments and dropped the -safe iterator per review from tglx
V3: -Comment linewidth cleanup
V1: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index d1d3e67..a13369d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -276,6 +276,25 @@ void put_futex_key(int fshared, union futex_key *key)
drop_futex_key_refs(key);
}
+/**
+ * futex_top_waiter() - Return the highest priority waiter on a futex
+ * @hb: the hash bucket the futex_q's reside in
+ * @key: the futex key (to distinguish it from other futex futex_q's)
+ *
+ * Must be called with the hb lock held.
+ */
+static struct futex_q *futex_top_waiter(struct futex_hash_bucket *hb,
+ union futex_key *key)
+{
+ struct futex_q *this;
+
+ plist_for_each_entry(this, &hb->chain, list) {
+ if (match_futex(&this->key, key))
+ return this;
+ }
+ return NULL;
+}
+
static u32 cmpxchg_futex_value_locked(u32 __user *uaddr, u32 uval, u32 newval)
{
u32 curval;
Refactor the atomic portion of futex_lock_pi() into futex_lock_pi_atomic().
This logic will be needed elsewhere, so modularize it to reduce code
duplication. The only significant change is passing of the task to try and
take the lock for. This simplifies the -EDEADLK test as if the lock is owned
by task t, it's a deadlock, regardless of if we are doing requeue pi or not.
This patch updates the corresponding comment accordingly.
Changelog:
V7: -coding style updates from tglx
V6: -comment update from peterz
V4: -fixes for style errors caught by checkpatch.pl
-use task instead of t for legibility whilst searching
-pass key and pi_state directly rather than trying to rely on the futex_q
since the key may be different.
-minor cleanups suggested by tglx
V2: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 224 +++++++++++++++++++++++++++++++++-----------------------
1 files changed, 130 insertions(+), 94 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index a13369d..36b5367 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -556,6 +556,127 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
return 0;
}
+/**
+ * futex_lock_pi_atomic() - atomic work required to acquire a pi aware futex
+ * @uaddr: the pi futex user address
+ * @hb: the pi futex hash bucket
+ * @key: the futex key associated with uaddr and hb
+ * @ps: the pi_state pointer where we store the result of the lookup
+ * @task: the task to perform the atomic lock work for. This will be
+ * "current" except in the case of requeue pi.
+ *
+ * Returns:
+ * 0 - ready to wait
+ * 1 - acquired the lock
+ * <0 - error
+ *
+ * The hb->lock and futex_key refs shall be held by the caller.
+ */
+static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
+ union futex_key *key,
+ struct futex_pi_state **ps,
+ struct task_struct *task)
+{
+ int lock_taken, ret, ownerdied = 0;
+ u32 uval, newval, curval;
+
+retry:
+ ret = lock_taken = 0;
+
+ /*
+ * To avoid races, we attempt to take the lock here again
+ * (by doing a 0 -> TID atomic cmpxchg), while holding all
+ * the locks. It will most likely not succeed.
+ */
+ newval = task_pid_vnr(task);
+
+ curval = cmpxchg_futex_value_locked(uaddr, 0, newval);
+
+ if (unlikely(curval == -EFAULT))
+ return -EFAULT;
+
+ /*
+ * Detect deadlocks.
+ */
+ if ((unlikely((curval & FUTEX_TID_MASK) == task_pid_vnr(task))))
+ return -EDEADLK;
+
+ /*
+ * Surprise - we got the lock. Just return to userspace:
+ */
+ if (unlikely(!curval))
+ return 1;
+
+ uval = curval;
+
+ /*
+ * Set the FUTEX_WAITERS flag, so the owner will know it has someone
+ * to wake at the next unlock.
+ */
+ newval = curval | FUTEX_WAITERS;
+
+ /*
+ * There are two cases, where a futex might have no owner (the
+ * owner TID is 0): OWNER_DIED. We take over the futex in this
+ * case. We also do an unconditional take over, when the owner
+ * of the futex died.
+ *
+ * This is safe as we are protected by the hash bucket lock !
+ */
+ if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+ /* Keep the OWNER_DIED bit */
+ newval = (curval & ~FUTEX_TID_MASK) | task_pid_vnr(task);
+ ownerdied = 0;
+ lock_taken = 1;
+ }
+
+ curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
+
+ if (unlikely(curval == -EFAULT))
+ return -EFAULT;
+ if (unlikely(curval != uval))
+ goto retry;
+
+ /*
+ * We took the lock due to owner died take over.
+ */
+ if (unlikely(lock_taken))
+ return 1;
+
+ /*
+ * We dont have the lock. Look up the PI state (or create it if
+ * we are the first waiter):
+ */
+ ret = lookup_pi_state(uval, hb, key, ps);
+
+ if (unlikely(ret)) {
+ switch (ret) {
+ case -ESRCH:
+ /*
+ * No owner found for this futex. Check if the
+ * OWNER_DIED bit is set to figure out whether
+ * this is a robust futex or not.
+ */
+ if (get_futex_value_locked(&curval, uaddr))
+ return -EFAULT;
+
+ /*
+ * We simply start over in case of a robust
+ * futex. The code above will take the futex
+ * and return happy.
+ */
+ if (curval & FUTEX_OWNER_DIED) {
+ ownerdied = 1;
+ goto retry;
+ }
+ default:
+ break;
+ }
+ }
+
+ return ret;
+}
+
/*
* The hash bucket lock must be held when this is called.
* Afterwards, the futex_q must not be accessed.
@@ -1340,9 +1461,9 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
struct hrtimer_sleeper timeout, *to = NULL;
struct task_struct *curr = current;
struct futex_hash_bucket *hb;
- u32 uval, newval, curval;
+ u32 uval;
struct futex_q q;
- int ret, lock_taken, ownerdied = 0;
+ int ret;
if (refill_pi_state_cache())
return -ENOMEM;
@@ -1365,81 +1486,15 @@ retry:
retry_private:
hb = queue_lock(&q);
-retry_locked:
- ret = lock_taken = 0;
-
- /*
- * To avoid races, we attempt to take the lock here again
- * (by doing a 0 -> TID atomic cmpxchg), while holding all
- * the locks. It will most likely not succeed.
- */
- newval = task_pid_vnr(current);
-
- curval = cmpxchg_futex_value_locked(uaddr, 0, newval);
-
- if (unlikely(curval == -EFAULT))
- goto uaddr_faulted;
-
- /*
- * Detect deadlocks. In case of REQUEUE_PI this is a valid
- * situation and we return success to user space.
- */
- if (unlikely((curval & FUTEX_TID_MASK) == task_pid_vnr(current))) {
- ret = -EDEADLK;
- goto out_unlock_put_key;
- }
-
- /*
- * Surprise - we got the lock. Just return to userspace:
- */
- if (unlikely(!curval))
- goto out_unlock_put_key;
-
- uval = curval;
-
- /*
- * Set the WAITERS flag, so the owner will know it has someone
- * to wake at next unlock
- */
- newval = curval | FUTEX_WAITERS;
-
- /*
- * There are two cases, where a futex might have no owner (the
- * owner TID is 0): OWNER_DIED. We take over the futex in this
- * case. We also do an unconditional take over, when the owner
- * of the futex died.
- *
- * This is safe as we are protected by the hash bucket lock !
- */
- if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
- /* Keep the OWNER_DIED bit */
- newval = (curval & ~FUTEX_TID_MASK) | task_pid_vnr(current);
- ownerdied = 0;
- lock_taken = 1;
- }
-
- curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
-
- if (unlikely(curval == -EFAULT))
- goto uaddr_faulted;
- if (unlikely(curval != uval))
- goto retry_locked;
-
- /*
- * We took the lock due to owner died take over.
- */
- if (unlikely(lock_taken))
- goto out_unlock_put_key;
-
- /*
- * We dont have the lock. Look up the PI state (or create it if
- * we are the first waiter):
- */
- ret = lookup_pi_state(uval, hb, &q.key, &q.pi_state);
-
+ ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current);
if (unlikely(ret)) {
switch (ret) {
-
+ case 1:
+ /* We got the lock. */
+ ret = 0;
+ goto out_unlock_put_key;
+ case -EFAULT:
+ goto uaddr_faulted;
case -EAGAIN:
/*
* Task is exiting and we just wait for the
@@ -1449,25 +1504,6 @@ retry_locked:
put_futex_key(fshared, &q.key);
cond_resched();
goto retry;
-
- case -ESRCH:
- /*
- * No owner found for this futex. Check if the
- * OWNER_DIED bit is set to figure out whether
- * this is a robust futex or not.
- */
- if (get_futex_value_locked(&curval, uaddr))
- goto uaddr_faulted;
-
- /*
- * We simply start over in case of a robust
- * futex. The code above will take the futex
- * and return happy.
- */
- if (curval & FUTEX_OWNER_DIED) {
- ownerdied = 1;
- goto retry_locked;
- }
default:
goto out_unlock_put_key;
}
Refactor the post lock acquisition logic from futex_lock_pi(). This code will
be reused in futex_wait_requeue_pi().
Changelog:
V7: -Rename finish_futex_lock_pi() to fixup_owner() (per Dinakar G.)
V6: -Minor comment updates
V4: -Corrected string paranoia check
-Move the spinlock(q->lock_ptr) out of finish_futex_lock_pi to retain
some semblance of lock/unlock in the same function.
V3: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 158 ++++++++++++++++++++++++++++++++------------------------
1 files changed, 89 insertions(+), 69 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 36b5367..c2b58d8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1256,6 +1256,79 @@ handle_fault:
static long futex_wait_restart(struct restart_block *restart);
/**
+ * fixup_owner() - Post lock pi_state and corner case management
+ * @uaddr: user address of the futex
+ * @fshared: whether the futex is shared (1) or not (0)
+ * @q: futex_q (contains pi_state and access to the rt_mutex)
+ * @locked: if the attempt to take the rt_mutex succeeded (1) or not (0)
+ *
+ * After attempting to lock an rt_mutex, this function is called to cleanup
+ * the pi_state owner as well as handle race conditions that may allow us to
+ * acquire the lock. Must be called with the hb lock held.
+ *
+ * Returns:
+ * 1 - success, lock taken
+ * 0 - success, lock not taken
+ * <0 - on error (-EFAULT)
+ */
+static int fixup_owner(u32 __user *uaddr, int fshared, struct futex_q *q,
+ int locked)
+{
+ struct task_struct *owner;
+ int ret = 0;
+
+ if (locked) {
+ /*
+ * Got the lock. We might not be the anticipated owner if we
+ * did a lock-steal - fix up the PI-state in that case:
+ */
+ if (q->pi_state->owner != current)
+ ret = fixup_pi_state_owner(uaddr, q, current, fshared);
+ goto out;
+ }
+
+ /*
+ * Catch the rare case, where the lock was released when we were on the
+ * way back before we locked the hash bucket.
+ */
+ if (q->pi_state->owner == current) {
+ /*
+ * Try to get the rt_mutex now. This might fail as some other
+ * task acquired the rt_mutex after we removed ourself from the
+ * rt_mutex waiters list.
+ */
+ if (rt_mutex_trylock(&q->pi_state->pi_mutex)) {
+ locked = 1;
+ goto out;
+ }
+
+ /*
+ * pi_state is incorrect, some other task did a lock steal and
+ * we returned due to timeout or signal without taking the
+ * rt_mutex. Too late. We can access the rt_mutex_owner without
+ * locking, as the other task is now blocked on the hash bucket
+ * lock. Fix the state up.
+ */
+ owner = rt_mutex_owner(&q->pi_state->pi_mutex);
+ ret = fixup_pi_state_owner(uaddr, q, owner, fshared);
+ goto out;
+ }
+
+ /*
+ * Paranoia check. If we did not take the lock, then we should not be
+ * the owner, nor the pending owner, of the rt_mutex.
+ */
+ if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
+ printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
+ "pi-state %p\n", ret,
+ q->pi_state->pi_mutex.owner,
+ q->pi_state->owner);
+
+out:
+ return ret ? ret : locked;
+}
+
+/**
* futex_wait_queue_me() - queue_me() and wait for wakeup, timeout, or signal
* @hb: the futex hash bucket, must be locked by the caller
* @q: the futex_q to queue up on
@@ -1459,11 +1532,10 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
int detect, ktime_t *time, int trylock)
{
struct hrtimer_sleeper timeout, *to = NULL;
- struct task_struct *curr = current;
struct futex_hash_bucket *hb;
u32 uval;
struct futex_q q;
- int ret;
+ int res, ret;
if (refill_pi_state_cache())
return -ENOMEM;
@@ -1527,71 +1599,21 @@ retry_private:
}
spin_lock(q.lock_ptr);
-
- if (!ret) {
- /*
- * Got the lock. We might not be the anticipated owner
- * if we did a lock-steal - fix up the PI-state in
- * that case:
- */
- if (q.pi_state->owner != curr)
- ret = fixup_pi_state_owner(uaddr, &q, curr, fshared);
- } else {
- /*
- * Catch the rare case, where the lock was released
- * when we were on the way back before we locked the
- * hash bucket.
- */
- if (q.pi_state->owner == curr) {
- /*
- * Try to get the rt_mutex now. This might
- * fail as some other task acquired the
- * rt_mutex after we removed ourself from the
- * rt_mutex waiters list.
- */
- if (rt_mutex_trylock(&q.pi_state->pi_mutex))
- ret = 0;
- else {
- /*
- * pi_state is incorrect, some other
- * task did a lock steal and we
- * returned due to timeout or signal
- * without taking the rt_mutex. Too
- * late. We can access the
- * rt_mutex_owner without locking, as
- * the other task is now blocked on
- * the hash bucket lock. Fix the state
- * up.
- */
- struct task_struct *owner;
- int res;
-
- owner = rt_mutex_owner(&q.pi_state->pi_mutex);
- res = fixup_pi_state_owner(uaddr, &q, owner,
- fshared);
-
- /* propagate -EFAULT, if the fixup failed */
- if (res)
- ret = res;
- }
- } else {
- /*
- * Paranoia check. If we did not take the lock
- * in the trylock above, then we should not be
- * the owner of the rtmutex, neither the real
- * nor the pending one:
- */
- if (rt_mutex_owner(&q.pi_state->pi_mutex) == curr)
- printk(KERN_ERR "futex_lock_pi: ret = %d "
- "pi-mutex: %p pi-state %p\n", ret,
- q.pi_state->pi_mutex.owner,
- q.pi_state->owner);
- }
- }
+ /*
+ * Fixup the pi_state owner and possibly acquire the lock if we
+ * haven't already.
+ */
+ res = fixup_owner(uaddr, fshared, &q, !ret);
+ /*
+ * If fixup_owner() returned an error, proprogate that. If it acquired
+ * the lock, clear our -ETIMEDOUT or -EINTR.
+ */
+ if (res)
+ ret = (res < 0) ? res : 0;
/*
- * If fixup_pi_state_owner() faulted and was unable to handle the
- * fault, unlock it and return the fault to userspace.
+ * If fixup_owner() faulted and was unable to handle the fault, unlock
+ * it and return the fault to userspace.
*/
if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
rt_mutex_unlock(&q.pi_state->pi_mutex);
@@ -1599,9 +1621,7 @@ retry_private:
/* Unqueue and drop the lock */
unqueue_me_pi(&q);
- if (to)
- destroy_hrtimer_on_stack(&to->timer);
- return ret != -EINTR ? ret : -ERESTARTNOINTR;
+ goto out;
out_unlock_put_key:
queue_unlock(&q, hb);
@@ -1611,7 +1631,7 @@ out_put_key:
out:
if (to)
destroy_hrtimer_on_stack(&to->timer);
- return ret;
+ return ret != -EINTR ? ret : -ERESTARTNOINTR;
uaddr_faulted:
/*
Currently restart is only used if there is a timeout. The requeue_pi
functionality requires restarting to futex_lock_pi() on signal after wakeup in
futex_wait_requeue_pi() regardless of if there was a timeout or not. Using 0
for the timeout value is confusing as that could indicate an expired timer.
The flag makes this explicit. While the check is not technically needed in
futex_wait_restart(), doing so makes the code consistent with and will avoid
confusion should the need arise to restart wait without a timeout.
Changelog:
V7: -Corrected FLAGS_HAS_TIMEOUT flag detection logic per Eric Dumazet
V6: - Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index c2b58d8..04e5431 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1252,6 +1252,7 @@ handle_fault:
*/
#define FLAGS_SHARED 0x01
#define FLAGS_CLOCKRT 0x02
+#define FLAGS_HAS_TIMEOUT 0x04
static long futex_wait_restart(struct restart_block *restart);
@@ -1486,7 +1487,7 @@ retry_private:
restart->futex.val = val;
restart->futex.time = abs_time->tv64;
restart->futex.bitset = bitset;
- restart->futex.flags = 0;
+ restart->futex.flags = FLAGS_HAS_TIMEOUT;
if (fshared)
restart->futex.flags |= FLAGS_SHARED;
@@ -1510,13 +1511,16 @@ static long futex_wait_restart(struct restart_block *restart)
{
u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
int fshared = 0;
- ktime_t t;
+ ktime_t t, *tp = NULL;
- t.tv64 = restart->futex.time;
+ if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
+ t.tv64 = restart->futex.time;
+ tp = &t;
+ }
restart->fn = do_no_restart_syscall;
if (restart->futex.flags & FLAGS_SHARED)
fshared = 1;
- return (long)futex_wait(uaddr, fshared, restart->futex.val, &t,
+ return (long)futex_wait(uaddr, fshared, restart->futex.val, tp,
restart->futex.bitset,
restart->futex.flags & FLAGS_CLOCKRT);
}
This patch is required for the first half of requeue_pi to function. It
basically splits rt_mutex_slowlock() right down the middle, just before the
first call to schedule().
This patch uses a new futex_q field, rt_waiter, for now. I think
I should be able to use task->pi_blocked_on in a future version of this patch.
Changelog:
V6: -add mark_rt_mutex_waiters() to rt_mutex_start_procy_lock() to avoid
the race condition evident in previous versions
-cleanup kernel-docs formatting and comments
-try to take the lock in rt_mutex_start_proxy_lock() rather than assume
the lock is held
-remove initial schedule in finish_proxy_lock to allow for signal and
timeout detection.
V5: -remove EXPORT_SYMBOL_GPL from the new routines
-minor cleanups
V4: -made detect_deadlock a parameter to rt_mutex_enqueue_task
-refactored rt_mutex_slowlock to share code with new functions
-renamed rt_mutex_enqueue_task and rt_mutex_handle_wakeup to
rt_mutex_start_proxy_lock and rt_mutex_finish_proxy_lock, respectively
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/rtmutex.c | 240 +++++++++++++++++++++++++++++++++++++----------
kernel/rtmutex_common.h | 8 ++
2 files changed, 195 insertions(+), 53 deletions(-)
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 69d9cb9..fec77e7 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -300,7 +300,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
* assigned pending owner [which might not have taken the
* lock yet]:
*/
-static inline int try_to_steal_lock(struct rt_mutex *lock)
+static inline int try_to_steal_lock(struct rt_mutex *lock,
+ struct task_struct *task)
{
struct task_struct *pendowner = rt_mutex_owner(lock);
struct rt_mutex_waiter *next;
@@ -309,11 +310,11 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
if (!rt_mutex_owner_pending(lock))
return 0;
- if (pendowner == current)
+ if (pendowner == task)
return 1;
spin_lock_irqsave(&pendowner->pi_lock, flags);
- if (current->prio >= pendowner->prio) {
+ if (task->prio >= pendowner->prio) {
spin_unlock_irqrestore(&pendowner->pi_lock, flags);
return 0;
}
@@ -338,21 +339,21 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
* We are going to steal the lock and a waiter was
* enqueued on the pending owners pi_waiters queue. So
* we have to enqueue this waiter into
- * current->pi_waiters list. This covers the case,
- * where current is boosted because it holds another
+ * task->pi_waiters list. This covers the case,
+ * where task is boosted because it holds another
* lock and gets unboosted because the booster is
* interrupted, so we would delay a waiter with higher
- * priority as current->normal_prio.
+ * priority as task->normal_prio.
*
* Note: in the rare case of a SCHED_OTHER task changing
* its priority and thus stealing the lock, next->task
- * might be current:
+ * might be task:
*/
- if (likely(next->task != current)) {
- spin_lock_irqsave(¤t->pi_lock, flags);
- plist_add(&next->pi_list_entry, ¤t->pi_waiters);
- __rt_mutex_adjust_prio(current);
- spin_unlock_irqrestore(¤t->pi_lock, flags);
+ if (likely(next->task != task)) {
+ spin_lock_irqsave(&task->pi_lock, flags);
+ plist_add(&next->pi_list_entry, &task->pi_waiters);
+ __rt_mutex_adjust_prio(task);
+ spin_unlock_irqrestore(&task->pi_lock, flags);
}
return 1;
}
@@ -389,7 +390,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
*/
mark_rt_mutex_waiters(lock);
- if (rt_mutex_owner(lock) && !try_to_steal_lock(lock))
+ if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, current))
return 0;
/* We got the lock. */
@@ -411,6 +412,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
*/
static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
+ struct task_struct *task,
int detect_deadlock)
{
struct task_struct *owner = rt_mutex_owner(lock);
@@ -418,21 +420,21 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
unsigned long flags;
int chain_walk = 0, res;
- spin_lock_irqsave(¤t->pi_lock, flags);
- __rt_mutex_adjust_prio(current);
- waiter->task = current;
+ spin_lock_irqsave(&task->pi_lock, flags);
+ __rt_mutex_adjust_prio(task);
+ waiter->task = task;
waiter->lock = lock;
- plist_node_init(&waiter->list_entry, current->prio);
- plist_node_init(&waiter->pi_list_entry, current->prio);
+ plist_node_init(&waiter->list_entry, task->prio);
+ plist_node_init(&waiter->pi_list_entry, task->prio);
/* Get the top priority waiter on the lock */
if (rt_mutex_has_waiters(lock))
top_waiter = rt_mutex_top_waiter(lock);
plist_add(&waiter->list_entry, &lock->wait_list);
- current->pi_blocked_on = waiter;
+ task->pi_blocked_on = waiter;
- spin_unlock_irqrestore(¤t->pi_lock, flags);
+ spin_unlock_irqrestore(&task->pi_lock, flags);
if (waiter == rt_mutex_top_waiter(lock)) {
spin_lock_irqsave(&owner->pi_lock, flags);
@@ -460,7 +462,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
spin_unlock(&lock->wait_lock);
res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,
- current);
+ task);
spin_lock(&lock->wait_lock);
@@ -605,37 +607,25 @@ void rt_mutex_adjust_pi(struct task_struct *task)
rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task);
}
-/*
- * Slow path lock function:
+/**
+ * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
+ * @lock: the rt_mutex to take
+ * @state: the state the task should block in (TASK_INTERRUPTIBLE
+ * or TASK_UNINTERRUPTIBLE)
+ * @timeout: the pre-initialized and started timer, or NULL for none
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @detect_deadlock: passed to task_blocks_on_rt_mutex
+ *
+ * lock->wait_lock must be held by the caller.
*/
static int __sched
-rt_mutex_slowlock(struct rt_mutex *lock, int state,
- struct hrtimer_sleeper *timeout,
- int detect_deadlock)
+__rt_mutex_slowlock(struct rt_mutex *lock, int state,
+ struct hrtimer_sleeper *timeout,
+ struct rt_mutex_waiter *waiter,
+ int detect_deadlock)
{
- struct rt_mutex_waiter waiter;
int ret = 0;
- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
-
- spin_lock(&lock->wait_lock);
-
- /* Try to acquire the lock again: */
- if (try_to_take_rt_mutex(lock)) {
- spin_unlock(&lock->wait_lock);
- return 0;
- }
-
- set_current_state(state);
-
- /* Setup the timer, when timeout != NULL */
- if (unlikely(timeout)) {
- hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
- if (!hrtimer_active(&timeout->timer))
- timeout->task = NULL;
- }
-
for (;;) {
/* Try to acquire the lock: */
if (try_to_take_rt_mutex(lock))
@@ -656,19 +646,19 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
}
/*
- * waiter.task is NULL the first time we come here and
+ * waiter->task is NULL the first time we come here and
* when we have been woken up by the previous owner
* but the lock got stolen by a higher prio task.
*/
- if (!waiter.task) {
- ret = task_blocks_on_rt_mutex(lock, &waiter,
+ if (!waiter->task) {
+ ret = task_blocks_on_rt_mutex(lock, waiter, current,
detect_deadlock);
/*
* If we got woken up by the owner then start loop
* all over without going into schedule to try
* to get the lock now:
*/
- if (unlikely(!waiter.task)) {
+ if (unlikely(!waiter->task)) {
/*
* Reset the return value. We might
* have returned with -EDEADLK and the
@@ -684,15 +674,52 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
spin_unlock(&lock->wait_lock);
- debug_rt_mutex_print_deadlock(&waiter);
+ debug_rt_mutex_print_deadlock(waiter);
- if (waiter.task)
+ if (waiter->task)
schedule_rt_mutex(lock);
spin_lock(&lock->wait_lock);
set_current_state(state);
}
+ return ret;
+}
+
+/*
+ * Slow path lock function:
+ */
+static int __sched
+rt_mutex_slowlock(struct rt_mutex *lock, int state,
+ struct hrtimer_sleeper *timeout,
+ int detect_deadlock)
+{
+ struct rt_mutex_waiter waiter;
+ int ret = 0;
+
+ debug_rt_mutex_init_waiter(&waiter);
+ waiter.task = NULL;
+
+ spin_lock(&lock->wait_lock);
+
+ /* Try to acquire the lock again: */
+ if (try_to_take_rt_mutex(lock)) {
+ spin_unlock(&lock->wait_lock);
+ return 0;
+ }
+
+ set_current_state(state);
+
+ /* Setup the timer, when timeout != NULL */
+ if (unlikely(timeout)) {
+ hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
+ if (!hrtimer_active(&timeout->timer))
+ timeout->task = NULL;
+ }
+
+ ret = __rt_mutex_slowlock(lock, state, timeout, &waiter,
+ detect_deadlock);
+
set_current_state(TASK_RUNNING);
if (unlikely(waiter.task))
@@ -986,6 +1013,59 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
}
/**
+ * rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock: the rt_mutex to take
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @task: the task to prepare
+ * @detect_deadlock: perform deadlock detection (1) or not (0)
+ *
+ * Returns:
+ * 0 - task blocked on lock
+ * 1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for FUTEX_REQUEUE_PI support.
+ */
+int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ struct task_struct *task, int detect_deadlock)
+{
+ int ret;
+
+ spin_lock(&lock->wait_lock);
+
+ mark_rt_mutex_waiters(lock);
+
+ if (!rt_mutex_owner(lock) || try_to_steal_lock(lock, task)) {
+ /* We got the lock for task. */
+ debug_rt_mutex_lock(lock);
+
+ rt_mutex_set_owner(lock, task, 0);
+
+ rt_mutex_deadlock_account_lock(lock, task);
+ return 1;
+ }
+
+ ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);
+
+
+ if (ret && !waiter->task) {
+ /*
+ * Reset the return value. We might have
+ * returned with -EDEADLK and the owner
+ * released the lock while we were walking the
+ * pi chain. Let the waiter sort it out.
+ */
+ ret = 0;
+ }
+ spin_unlock(&lock->wait_lock);
+
+ debug_rt_mutex_print_deadlock(waiter);
+
+ return ret;
+}
+
+/**
* rt_mutex_next_owner - return the next owner of the lock
*
* @lock: the rt lock query
@@ -1004,3 +1084,57 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock)
return rt_mutex_top_waiter(lock)->task;
}
+
+/**
+ * rt_mutex_finish_proxy_lock() - Complete lock acquisition
+ * @lock: the rt_mutex we were woken on
+ * @to: the timeout, null if none. hrtimer should already have
+ * been started.
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @detect_deadlock: perform deadlock detection (1) or not (0)
+ *
+ * Complete the lock acquisition started our behalf by another thread.
+ *
+ * Returns:
+ * 0 - success
+ * <0 - error, one of -EINTR, -ETIMEDOUT, or -EDEADLK
+ *
+ * Special API call for PI-futex requeue support
+ */
+int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+ struct hrtimer_sleeper *to,
+ struct rt_mutex_waiter *waiter,
+ int detect_deadlock)
+{
+ int ret;
+
+ spin_lock(&lock->wait_lock);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter,
+ detect_deadlock);
+
+ set_current_state(TASK_RUNNING);
+
+ if (unlikely(waiter->task))
+ remove_waiter(lock, waiter);
+
+ /*
+ * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
+ * have to fix that up.
+ */
+ fixup_rt_mutex_waiters(lock);
+
+ spin_unlock(&lock->wait_lock);
+
+ /*
+ * Readjust priority, when we did not get the lock. We might have been
+ * the pending owner and boosted. Since we did not take the lock, the
+ * PI boost has to go.
+ */
+ if (unlikely(ret))
+ rt_mutex_adjust_prio(current);
+
+ return ret;
+}
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index e124bf5..97a2f81 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -120,6 +120,14 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
struct task_struct *proxy_owner);
extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
struct task_struct *proxy_owner);
+extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ struct task_struct *task,
+ int detect_deadlock);
+extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+ struct hrtimer_sleeper *to,
+ struct rt_mutex_waiter *waiter,
+ int detect_deadlock);
#ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h"
futex_requeue() is getting a bit long-winded, and will be getting more so after
the requeue_pi patch. Factor out the actual requeueing into a nicely contained
inline function to reduce function length and improve legibility.
Changelog:
V6: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 43 +++++++++++++++++++++++++++++--------------
1 files changed, 29 insertions(+), 14 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 04e5431..4fa5de1 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -940,6 +940,34 @@ out:
return ret;
}
+/**
+ * requeue_futex() - Requeue a futex_q from one hb to another
+ * @q: the futex_q to requeue
+ * @hb1: the source hash_bucket
+ * @hb2: the target hash_bucket
+ * @key2: the new key for the requeued futex_q
+ */
+static inline
+void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
+ struct futex_hash_bucket *hb2, union futex_key *key2)
+{
+
+ /*
+ * If key1 and key2 hash to the same bucket, no need to
+ * requeue.
+ */
+ if (likely(&hb1->chain != &hb2->chain)) {
+ plist_del(&q->list, &hb1->chain);
+ plist_add(&q->list, &hb2->chain);
+ q->lock_ptr = &hb2->lock;
+#ifdef CONFIG_DEBUG_PI_LIST
+ q->list.plist.lock = &hb2->lock;
+#endif
+ }
+ get_futex_key_refs(key2);
+ q->key = *key2;
+}
+
/*
* Requeue all waiters hashed on one physical page to another
* physical page.
@@ -999,20 +1027,7 @@ retry_private:
if (++ret <= nr_wake) {
wake_futex(this);
} else {
- /*
- * If key1 and key2 hash to the same bucket, no need to
- * requeue.
- */
- if (likely(head1 != &hb2->chain)) {
- plist_del(&this->list, &hb1->chain);
- plist_add(&this->list, &hb2->chain);
- this->lock_ptr = &hb2->lock;
-#ifdef CONFIG_DEBUG_PI_LIST
- this->list.plist.lock = &hb2->lock;
-#endif
- }
- this->key = key2;
- get_futex_key_refs(&key2);
+ requeue_futex(this, hb1, hb2, &key2);
drop_count++;
if (ret - nr_wake >= nr_requeue)
Refactor the code to validate the expected futex value in order to
reuse it with the requeue_pi code.
Changelog:
V7: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 116 +++++++++++++++++++++++++++++++++++---------------------
1 files changed, 72 insertions(+), 44 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 4fa5de1..bf657db 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1398,42 +1398,29 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
__set_current_state(TASK_RUNNING);
}
-static int futex_wait(u32 __user *uaddr, int fshared,
- u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
+/**
+ * futex_wait_setup() - Prepare to wait on a futex
+ * @uaddr: the futex userspace address
+ * @val: the expected value
+ * @fshared: whether the futex is shared (1) or not (0)
+ * @q: the associated futex_q
+ * @hb: storage for hash_bucket pointer to be returned to caller
+ *
+ * Setup the futex_q and locate the hash_bucket. Get the futex value and
+ * compare it with the expected value. Handle atomic faults internally.
+ * Return with the hb lock held and a q.key reference on success, and unlocked
+ * with no q.key reference on failure.
+ *
+ * Returns:
+ * 0 - uaddr contains val and hb has been locked
+ * <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlcoked
+ */
+static int futex_wait_setup(u32 __user *uaddr, u32 val, int fshared,
+ struct futex_q *q, struct futex_hash_bucket **hb)
{
- struct hrtimer_sleeper timeout, *to = NULL;
- DECLARE_WAITQUEUE(wait, current);
- struct restart_block *restart;
- struct futex_hash_bucket *hb;
- struct futex_q q;
u32 uval;
int ret;
- if (!bitset)
- return -EINVAL;
-
- q.pi_state = NULL;
- q.bitset = bitset;
-
- if (abs_time) {
- to = &timeout;
-
- hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
- CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
- hrtimer_init_sleeper(to, current);
- hrtimer_set_expires_range_ns(&to->timer, *abs_time,
- current->timer_slack_ns);
- }
-
-retry:
- q.key = FUTEX_KEY_INIT;
- ret = get_futex_key(uaddr, fshared, &q.key);
- if (unlikely(ret != 0))
- goto out;
-
-retry_private:
- hb = queue_lock(&q);
-
/*
* Access the page AFTER the hash-bucket is locked.
* Order is important:
@@ -1450,33 +1437,74 @@ retry_private:
* A consequence is that futex_wait() can return zero and absorb
* a wakeup when *uaddr != val on entry to the syscall. This is
* rare, but normal.
- *
- * For shared futexes, we hold the mmap semaphore, so the mapping
- * cannot have changed since we looked it up in get_futex_key.
*/
+retry:
+ q->key = FUTEX_KEY_INIT;
+ ret = get_futex_key(uaddr, fshared, &q->key);
+ if (unlikely(ret != 0))
+ goto out;
+
+retry_private:
+ *hb = queue_lock(q);
+
ret = get_futex_value_locked(&uval, uaddr);
- if (unlikely(ret)) {
- queue_unlock(&q, hb);
+ if (ret) {
+ queue_unlock(q, *hb);
ret = get_user(uval, uaddr);
if (ret)
- goto out_put_key;
+ goto out;
if (!fshared)
goto retry_private;
- put_futex_key(fshared, &q.key);
+ put_futex_key(fshared, &q->key);
goto retry;
}
- ret = -EWOULDBLOCK;
- /* Only actually queue if *uaddr contained val. */
- if (unlikely(uval != val)) {
- queue_unlock(&q, hb);
- goto out_put_key;
+ if (uval != val) {
+ queue_unlock(q, *hb);
+ ret = -EWOULDBLOCK;
}
+out:
+ if (ret)
+ put_futex_key(fshared, &q->key);
+ return ret;
+}
+
+static int futex_wait(u32 __user *uaddr, int fshared,
+ u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
+{
+ struct hrtimer_sleeper timeout, *to = NULL;
+ DECLARE_WAITQUEUE(wait, current);
+ struct restart_block *restart;
+ struct futex_hash_bucket *hb;
+ struct futex_q q;
+ int ret;
+
+ if (!bitset)
+ return -EINVAL;
+
+ q.pi_state = NULL;
+ q.bitset = bitset;
+
+ if (abs_time) {
+ to = &timeout;
+
+ hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
+ CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ hrtimer_init_sleeper(to, current);
+ hrtimer_set_expires_range_ns(&to->timer, *abs_time,
+ current->timer_slack_ns);
+ }
+
+ /* Prepare to wait on uaddr. */
+ ret = futex_wait_setup(uaddr, val, fshared, &q, &hb);
+ if (ret)
+ goto out;
+
/* queue_me and wait for wakeup, timeout, or a signal. */
futex_wait_queue_me(hb, &q, to, &wait);
PI Futexes and their underlying rt_mutex cannot be left ownerless if there are
pending waiters as this will break the PI boosting logic, so the standard
requeue commands aren't sufficient. The new commands properly manage pi futex
ownership by ensuring a futex with waiters has an owner at all times. This
will allow glibc to properly handle pi mutexes with pthread_condvars.
The approach taken here is to create two new futex op codes:
FUTEX_WAIT_REQUEUE_PI:
Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
and wake after they have been requeued to a pi futex. Prior to returning to
userspace, they will acquire this pi futex (and the underlying rt_mutex).
futex_wait_requeue_pi() is the result of a high speed collision between
futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi() being
done by futex_proxy_trylock_atomic() on behalf of the top_waiter).
FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
regardless of how many tasks the caller intends to wake or requeue.
pthread_cond_broadcast() should call this with nr_wake=1 and
nr_requeue=INT_MAX. pthread_cond_signal() should call this with nr_wake=1 and
nr_requeue=0. The reason being we need both callers to get the benefit of the
futex_proxy_trylock_atomic() routine. futex_requeue() also enqueues the
top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().
Changelog:
V7: -Corrected FLAGS_HAS_TIMEOUT flag detection logic (per Eric Dumazet)
-Updated futex key management logic in futex_wait_requeue_pi() to match
futex_wait, adding retry_private: label
-Added requeue_pi_wake_futex()
-Handle lock acquisition during requeue loop (per tglx)
-Several comment and coding style updates (per tglx)
-Don't use a restart_block in futex_wait_requeue_pi() when interrupted
prior to requeue since we use an absolute timeout.
-Handle lock stealing case in futex_wait_requeue_pi() (per tglx)
-Make use of futex_wait_setup()
-Use futex_q.rt_waiter instead of futex_q.lock_ptr to detect lock
acquisition in futex_wait_requeue_pi()
-Fix bug in futex_wait_requeue_pi() passing wrong uaddr to fixup_owner()
V6: -Moved non requeue_pi related fixes/changes into separate patches
-Make use of new double_unlock_hb()
-Futex key management updates
-Removed unnecessary futex_requeue_pi_cleanup() routine
-Return -EINVAL if futex_wake is called with q.rt_waiter != NULL
-Rewrote futex_wait_requeue_pi() wakeup logic
-Rewrote requeue/wakeup loop
-Renamed futex_requeue_pi_init() to futex_proxy_trylock_atomic()
-Handle third party owner, removed -EMORON :-(
-Comment updates
V5: -Update futex_requeue to allow for nr_requeue == 0
-Whitespace cleanup
-Added task_count var to futex_requeue to avoid confusion between
ret, res, and ret used to count wakes and requeues
V4: -Cleanups to pass checkpatch.pl
-Added missing goto out; in futex_wait_requeue_pi()
-Moved rt_mutex_handle_wakeup to the rt_mutex_enqueue_task patch as they
are a functional pair.
-Fixed several error exit paths that failed to unqueue the futex_q, which
not only would leave the futex_q on the hb, but would have caused an exit
race with the waiter since they weren't synchonized on the hb lock. Thanks
Sripathi for catching this.
-Fix pi_state handling in futex_requeue
-Several other minor fixes to futex_requeue_pi
-add requeue_futex function and force the requeue in requeue_pi even for the
task we wake in the requeue loop
-refill the pi state cache at the beginning of futex_requeue for requeue_pi
-have futex_requeue_pi_init ensure it stores off the pi_state for use in
futex_requeue
- Delayed starting the hrtimer until after TASK_INTERRUPTIBLE is set
- Fixed NULL pointer bug when futex_wait_requeue_pi() has no timer and
receives a signal after waking on uaddr2. Added has_timeout to the
restart->futex structure.
V3: -Added FUTEX_CMP_REQUEUE_PI op
-Put fshared support back. So long as it is encoded in the op code, we
assume both the uaddr's are either private or share, but not mixed.
-Fixed access to expected value of uaddr2 in futex_wait_requeue_pi()
V2: -Added rt_mutex enqueueing to futex_requeue_pi_init
-Updated fault handling and exit logic
V1: -Initial verion
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
include/linux/futex.h | 8 +
include/linux/thread_info.h | 3
kernel/futex.c | 519 +++++++++++++++++++++++++++++++++++++++++--
3 files changed, 510 insertions(+), 20 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 3bf5bb5..b05519c 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -23,6 +23,9 @@ union ktime;
#define FUTEX_TRYLOCK_PI 8
#define FUTEX_WAIT_BITSET 9
#define FUTEX_WAKE_BITSET 10
+#define FUTEX_WAIT_REQUEUE_PI 11
+#define FUTEX_REQUEUE_PI 12
+#define FUTEX_CMP_REQUEUE_PI 13
#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
@@ -38,6 +41,11 @@ union ktime;
#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAIT_BITSET_PRIVATE (FUTEX_WAIT_BITS | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAKE_BITSET_PRIVATE (FUTEX_WAKE_BITS | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
+ FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PI_PRIVATE (FUTEX_REQUEUE_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
+ FUTEX_PRIVATE_FLAG)
/*
* Support for robust futexes: the kernel cleans up held futexes at
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index e6b820f..a8cc4e1 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -21,13 +21,14 @@ struct restart_block {
struct {
unsigned long arg0, arg1, arg2, arg3;
};
- /* For futex_wait */
+ /* For futex_wait and futex_wait_requeue_pi */
struct {
u32 *uaddr;
u32 val;
u32 flags;
u32 bitset;
u64 time;
+ u32 *uaddr2;
} futex;
/* For nanosleep */
struct {
diff --git a/kernel/futex.c b/kernel/futex.c
index bf657db..32494ea 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -19,6 +19,10 @@
* PRIVATE futexes by Eric Dumazet
* Copyright (C) 2007 Eric Dumazet <[email protected]>
*
+ * Requeue-PI support by Darren Hart <[email protected]>
+ * Copyright (C) IBM Corporation, 2009
+ * Thanks to Thomas Gleixner for conceptual design and careful reviews.
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -109,6 +113,9 @@ struct futex_q {
struct futex_pi_state *pi_state;
struct task_struct *task;
+ /* rt_waiter storage for requeue_pi: */
+ struct rt_mutex_waiter *rt_waiter;
+
/* Bitset for the optional bitmasked wakeup */
u32 bitset;
};
@@ -827,7 +834,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key)) {
- if (this->pi_state) {
+ if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
break;
}
@@ -968,20 +975,138 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
q->key = *key2;
}
-/*
- * Requeue all waiters hashed on one physical page to another
- * physical page.
+/**
+ * requeue_pi_wake_futex() - Wake a task that acquired the lock during requeue
+ * q: the futex_q
+ * key: the key of the requeue target futex
+ *
+ * During futex_requeue, with requeue_pi=1, it is possible to acquire the
+ * target futex if it is uncontended or via a lock steal. Set the futex_q key
+ * to the requeue target futex so the waiter can detect the wakeup on the right
+ * futex, but remove it from the hb and NULL the rt_waiter so it can detect
+ * atomic lock acquisition. Must be called with the q->lock_ptr held.
+ */
+static inline
+void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key)
+{
+ drop_futex_key_refs(&q->key);
+ get_futex_key_refs(key);
+ q->key = *key;
+
+ WARN_ON(plist_node_empty(&q->list));
+ plist_del(&q->list, &q->list.plist);
+
+ WARN_ON(!q->rt_waiter);
+ q->rt_waiter = NULL;
+
+ wake_up(&q->waiter);
+}
+
+/**
+ * futex_proxy_trylock_atomic() - Attempt an atomic lock for the top waiter
+ * @pifutex: the user address of the to futex
+ * @hb1: the from futex hash bucket, must be locked by the caller
+ * @hb2: the to futex hash bucket, must be locked by the caller
+ * @key1: the from futex key
+ * @key2: the to futex key
+ *
+ * Try and get the lock on behalf of the top waiter if we can do it atomically.
+ * Wake the top waiter if we succeed. hb1 and hb2 must be held by the caller.
+ *
+ * Returns:
+ * 0 - failed to acquire the lock atomicly
+ * 1 - acquired the lock
+ * <0 - error
+ */
+static int futex_proxy_trylock_atomic(u32 __user *pifutex,
+ struct futex_hash_bucket *hb1,
+ struct futex_hash_bucket *hb2,
+ union futex_key *key1, union futex_key *key2,
+ struct futex_pi_state **ps)
+{
+ struct futex_q *top_waiter;
+ u32 curval;
+ int ret;
+
+ if (get_futex_value_locked(&curval, pifutex))
+ return -EFAULT;
+
+ top_waiter = futex_top_waiter(hb1, key1);
+
+ /* There are no waiters, nothing for us to do. */
+ if (!top_waiter)
+ return 0;
+
+ /*
+ * Either take the lock for top_waiter or set the FUTEX_WAITERS bit.
+ * The pi_state is returned in ps in contended cases.
+ */
+ ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task);
+ if (ret == 1)
+ requeue_pi_wake_futex(top_waiter, key2);
+
+ return ret;
+}
+
+/**
+ * futex_requeue() - Requeue waiters from uaddr1 to uaddr2
+ * uaddr1: source futex user address
+ * uaddr2: target futex user address
+ * nr_wake: number of waiters to wake (must be 1 for requeue_pi)
+ * nr_requeue: number of waiters to requeue (0-INT_MAX)
+ * requeue_pi: if we are attempting to requeue from a non-pi futex to a
+ * pi futex (pi to pi requeue is not supported)
+ *
+ * Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire
+ * uaddr2 atomically on behalf of the top waiter.
+ *
+ * Returns:
+ * >=0 - on success, the number of tasks requeued or woken
+ * <0 - on error
*/
static int futex_requeue(u32 __user *uaddr1, int fshared, u32 __user *uaddr2,
- int nr_wake, int nr_requeue, u32 *cmpval)
+ int nr_wake, int nr_requeue, u32 *cmpval,
+ int requeue_pi)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
+ int drop_count = 0, task_count = 0, ret;
+ struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
struct plist_head *head1;
struct futex_q *this, *next;
- int ret, drop_count = 0;
+ u32 curval2;
+
+ if (requeue_pi) {
+ /*
+ * requeue_pi requires a pi_state, try to allocate it now
+ * without any locks in case it fails.
+ */
+ if (refill_pi_state_cache())
+ return -ENOMEM;
+ /*
+ * requeue_pi must wake as many tasks as it can, up to nr_wake
+ * + nr_requeue, since it acquires the rt_mutex prior to
+ * returning to userspace, so as to not leave the rt_mutex with
+ * waiters and no owner. However, second and third wake-ups
+ * cannot be predicted as they involve race conditions with the
+ * first wake and a fault while looking up the pi_state. Both
+ * pthread_cond_signal() and pthread_cond_broadcast() should
+ * use nr_wake=1.
+ */
+ if (nr_wake != 1)
+ return -EINVAL;
+ }
retry:
+ if (pi_state != NULL) {
+ /*
+ * We will have to lookup the pi_state again, so free this one
+ * to keep the accounting correct.
+ */
+ free_pi_state(pi_state);
+ pi_state = NULL;
+ }
+
ret = get_futex_key(uaddr1, fshared, &key1);
if (unlikely(ret != 0))
goto out;
@@ -1020,19 +1145,94 @@ retry_private:
}
}
+ if (requeue_pi && (task_count - nr_wake < nr_requeue)) {
+ /* Attempt to acquire uaddr2 and wake the top_waiter. */
+ ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
+ &key2, &pi_state);
+
+ /*
+ * At this point the top_waiter has either taken uaddr2 or is
+ * waiting on it. If the former, then the pi_state will not
+ * exist yet, look it up one more time to ensure we have a
+ * reference to it.
+ */
+ if (ret == 1) {
+ WARN_ON(pi_state);
+ task_count++;
+ ret = get_futex_value_locked(&curval2, uaddr2);
+ if (!ret)
+ ret = lookup_pi_state(curval2, hb2, &key2,
+ &pi_state);
+ }
+
+ switch (ret) {
+ case 0:
+ break;
+ case -EFAULT:
+ double_unlock_hb(hb1, hb2);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
+ ret = get_user(curval2, uaddr2);
+ if (!ret)
+ goto retry;
+ goto out;
+ case -EAGAIN:
+ /* The owner was exiting, try again. */
+ double_unlock_hb(hb1, hb2);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
+ cond_resched();
+ goto retry;
+ default:
+ goto out_unlock;
+ }
+ }
+
head1 = &hb1->chain;
plist_for_each_entry_safe(this, next, head1, list) {
- if (!match_futex (&this->key, &key1))
+ if (task_count - nr_wake >= nr_requeue)
+ break;
+
+ if (!match_futex(&this->key, &key1))
continue;
- if (++ret <= nr_wake) {
+
+ WARN_ON(!requeue_pi && this->rt_waiter);
+ WARN_ON(requeue_pi && !this->rt_waiter);
+
+ /*
+ * Wake nr_wake waiters. For requeue_pi, if we acquired the
+ * lock, we already woke the top_waiter. If not, it will be
+ * woken by futex_unlock_pi().
+ */
+ if (++task_count <= nr_wake && !requeue_pi) {
wake_futex(this);
- } else {
- requeue_futex(this, hb1, hb2, &key2);
- drop_count++;
+ continue;
+ }
- if (ret - nr_wake >= nr_requeue)
- break;
+ /*
+ * Requeue nr_requeue waiters and possibly one more in the case
+ * of requeue_pi if we couldn't acquire the lock atomically.
+ */
+ if (requeue_pi) {
+ /* Prepare the waiter to take the rt_mutex. */
+ atomic_inc(&pi_state->refcount);
+ this->pi_state = pi_state;
+ ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
+ this->rt_waiter,
+ this->task, 1);
+ if (ret == 1) {
+ /* We got the lock. */
+ requeue_pi_wake_futex(this, &key2);
+ continue;
+ } else if (ret) {
+ /* -EDEADLK */
+ this->pi_state = NULL;
+ free_pi_state(pi_state);
+ goto out_unlock;
+ }
}
+ requeue_futex(this, hb1, hb2, &key2);
+ drop_count++;
}
out_unlock:
@@ -1047,7 +1247,9 @@ out_put_keys:
out_put_key1:
put_futex_key(fshared, &key1);
out:
- return ret;
+ if (pi_state != NULL)
+ free_pi_state(pi_state);
+ return ret ? ret : task_count;
}
/* The key must be already stored in q->key. */
@@ -1270,6 +1472,7 @@ handle_fault:
#define FLAGS_HAS_TIMEOUT 0x04
static long futex_wait_restart(struct restart_block *restart);
+static long futex_lock_pi_restart(struct restart_block *restart);
/**
* fixup_owner() - Post lock pi_state and corner case management
@@ -1489,6 +1692,7 @@ static int futex_wait(u32 __user *uaddr, int fshared,
q.pi_state = NULL;
q.bitset = bitset;
+ q.rt_waiter = NULL;
if (abs_time) {
to = &timeout;
@@ -1596,6 +1800,7 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
}
q.pi_state = NULL;
+ q.rt_waiter = NULL;
retry:
q.key = FUTEX_KEY_INIT;
ret = get_futex_key(uaddr, fshared, &q.key);
@@ -1701,6 +1906,20 @@ uaddr_faulted:
goto retry;
}
+static long futex_lock_pi_restart(struct restart_block *restart)
+{
+ u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
+ ktime_t t, *tp = NULL;
+ int fshared = restart->futex.flags & FLAGS_SHARED;
+
+ if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
+ t.tv64 = restart->futex.time;
+ tp = &t;
+ }
+ restart->fn = do_no_restart_syscall;
+
+ return (long)futex_lock_pi(uaddr, fshared, restart->futex.val, tp, 0);
+}
/*
* Userspace attempted a TID -> 0 atomic transition, and failed.
@@ -1803,6 +2022,253 @@ pi_faulted:
return ret;
}
+/**
+ * handle_early_requeue_pi_wakeup() - Detect early wakeup on the initial futex
+ * @hb: the hash_bucket futex_q was original enqueued on
+ * @q: the futex_q woken while waiting to be requeued
+ * @key2: the futex_key of the requeue target futex
+ * @timeout: the timeout associated with the wait (NULL if none)
+ *
+ * Detect if the task was woken on the initial futex as opposed to the requeue
+ * target futex. If so, determine if it was a timeout or a signal that caused
+ * the wakeup and return the appropriate error code to the caller. Must be
+ * called with the hb lock held.
+ *
+ * Returns
+ * 0 - no early wakeup detected
+ * <0 - -ETIMEDOUT or -ERESTARTSYS (FIXME: or ERESTARTNOINTR?)
+ */
+static inline
+int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
+ struct futex_q *q, union futex_key *key2,
+ struct hrtimer_sleeper *timeout)
+{
+ int ret = 0;
+
+ /*
+ * With the hb lock held, we avoid races while we process the wakeup.
+ * We only need to hold hb (and not hb2) to ensure atomicity as the
+ * wakeup code can't change q.key from uaddr to uaddr2 if we hold hb.
+ * It can't be requeued from uaddr2 to something else since we don't
+ * support a PI aware source futex for requeue.
+ */
+ if (!match_futex(&q->key, key2)) {
+ WARN_ON(q->lock_ptr && (&hb->lock != q->lock_ptr));
+ /*
+ * We were woken prior to requeue by a timeout or a signal.
+ * Unqueue the futex_q and determine which it was.
+ */
+ plist_del(&q->list, &q->list.plist);
+ drop_futex_key_refs(&q->key);
+
+ if (timeout && !timeout->task)
+ ret = -ETIMEDOUT;
+ else {
+ /*
+ * We expect signal_pending(current), but another
+ * thread may have handled it for us already.
+ */
+ /* FIXME: ERESTARTSYS or ERESTARTNOINTR? Do we care if
+ * the user specified SA_RESTART or not? */
+ ret = -ERESTARTSYS;
+ }
+ }
+ return ret;
+}
+
+/**
+ * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
+ * @uaddr: the futex we initialyl wait on (non-pi)
+ * @fshared: whether the futexes are shared (1) or not (0). They must be
+ * the same type, no requeueing from private to shared, etc.
+ * @val: the expected value of uaddr
+ * @abs_time: absolute timeout
+ * @bitset: 32 bit wakeup bitset set by userspace, defaults to all.
+ * @clockrt: whether to use CLOCK_REALTIME (1) or CLOCK_MONOTONIC (0)
+ * @uaddr2: the pi futex we will take prior to returning to user-space
+ *
+ * The caller will wait on uaddr and will be requeued by futex_requeue() to
+ * uaddr2 which must be PI aware. Normal wakeup will wake on uaddr2 and
+ * complete the acquisition of the rt_mutex prior to returning to userspace.
+ * This ensures the rt_mutex maintains an owner when it has waiters; without
+ * one, the pi logic wouldn't know which task to boost/deboost, if there was a
+ * need to.
+ *
+ * We call schedule in futex_wait_queue_me() when we enqueue and return there
+ * via the following:
+ * 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue()
+ * 2) wakeup on uaddr2 after a requeue and subsequent unlock
+ * 3) signal (before or after requeue)
+ * 4) timeout (before or after requeue)
+ *
+ * If 3, we setup a restart_block with futex_wait_requeue_pi() as the function.
+ *
+ * If 2, we may then block on trying to take the rt_mutex and return via:
+ * 5) successful lock
+ * 6) signal
+ * 7) timeout
+ * 8) other lock acquisition failure
+ *
+ * If 6, we setup a restart_block with futex_lock_pi() as the function.
+ *
+ * If 4 or 7, we cleanup and return with -ETIMEDOUT.
+ *
+ * Returns:
+ * 0 - On success
+ * <0 - On error
+ */
+static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
+ u32 val, ktime_t *abs_time, u32 bitset,
+ int clockrt, u32 __user *uaddr2)
+{
+ struct hrtimer_sleeper timeout, *to = NULL;
+ struct rt_mutex_waiter rt_waiter;
+ struct rt_mutex *pi_mutex = NULL;
+ DECLARE_WAITQUEUE(wait, current);
+ struct restart_block *restart;
+ struct futex_hash_bucket *hb;
+ union futex_key key2;
+ struct futex_q q;
+ int res, ret;
+ u32 uval;
+
+ if (!bitset)
+ return -EINVAL;
+
+ if (abs_time) {
+ to = &timeout;
+ hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
+ CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ hrtimer_init_sleeper(to, current);
+ hrtimer_set_expires_range_ns(&to->timer, *abs_time,
+ current->timer_slack_ns);
+ }
+
+ /*
+ * The waiter is allocated on our stack, manipulated by the requeue
+ * code while we sleep on uaddr.
+ */
+ debug_rt_mutex_init_waiter(&rt_waiter);
+ rt_waiter.task = NULL;
+
+ q.pi_state = NULL;
+ q.bitset = bitset;
+ q.rt_waiter = &rt_waiter;
+
+ key2 = FUTEX_KEY_INIT;
+ ret = get_futex_key(uaddr2, fshared, &key2);
+ if (unlikely(ret != 0))
+ goto out;
+
+ /* Prepare to wait on uaddr. */
+ ret = futex_wait_setup(uaddr, val, fshared, &q, &hb);
+ if (ret) {
+ put_futex_key(fshared, &key2);
+ goto out;
+ }
+
+ /* Queue the futex_q, drop the hb lock, wait for wakeup. */
+ futex_wait_queue_me(hb, &q, to, &wait);
+
+ spin_lock(&hb->lock);
+ ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
+ spin_unlock(&hb->lock);
+ if (ret)
+ goto out_put_keys;
+
+ /*
+ * In order for us to be here, we know our q.key == key2, and since
+ * we took the hb->lock above, we also know that futex_requeue() has
+ * completed and we no longer have to concern ourselves with a wakeup
+ * race with the atomic proxy lock acquition by the requeue code.
+ */
+
+ /* Check if the requeue code acquired the second futex for us. */
+ if (!q.rt_waiter) {
+ /*
+ * Got the lock. We might not be the anticipated owner if we
+ * did a lock-steal - fix up the PI-state in that case.
+ */
+ if (q.pi_state && (q.pi_state->owner != current)) {
+ spin_lock(q.lock_ptr);
+ ret = fixup_pi_state_owner(uaddr2, &q, current,
+ fshared);
+ spin_unlock(q.lock_ptr);
+ }
+ } else {
+ /*
+ * We have been woken up by futex_unlock_pi(), a timeout, or a
+ * signal. futex_unlock_pi() will not destroy the lock_ptr nor
+ * the pi_state.
+ */
+ WARN_ON(!&q.pi_state);
+ pi_mutex = &q.pi_state->pi_mutex;
+ ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
+ debug_rt_mutex_free_waiter(&waiter);
+
+ spin_lock(q.lock_ptr);
+ /*
+ * Fixup the pi_state owner and possibly acquire the lock if we
+ * haven't already.
+ */
+ res = fixup_owner(uaddr2, fshared, &q, !ret);
+ /*
+ * If fixup_owner() returned an error, proprogate that. If it
+ * acquired the lock, clear our -ETIMEDOUT or -EINTR.
+ */
+ if (res)
+ ret = (res < 0) ? res : 0;
+
+ /* Unqueue and drop the lock. */
+ unqueue_me_pi(&q);
+ }
+
+ /*
+ * If fixup_pi_state_owner() faulted and was unable to handle the
+ * fault, unlock the rt_mutex and return the fault to userspace.
+ */
+ if (ret == -EFAULT) {
+ if (rt_mutex_owner(pi_mutex) == current)
+ rt_mutex_unlock(pi_mutex);
+ } else if (ret == -EINTR) {
+ ret = -EFAULT;
+ if (get_user(uval, uaddr2))
+ goto out_put_keys;
+
+ /*
+ * We've already been requeued, so restart by calling
+ * futex_lock_pi() directly, rather then returning to this
+ * function.
+ */
+ ret = -ERESTART_RESTARTBLOCK;
+ restart = ¤t_thread_info()->restart_block;
+ restart->fn = futex_lock_pi_restart;
+ restart->futex.uaddr = (u32 *)uaddr2;
+ restart->futex.val = uval;
+ restart->futex.flags = 0;
+ if (abs_time) {
+ restart->futex.flags |= FLAGS_HAS_TIMEOUT;
+ restart->futex.time = abs_time->tv64;
+ }
+
+ if (fshared)
+ restart->futex.flags |= FLAGS_SHARED;
+ if (clockrt)
+ restart->futex.flags |= FLAGS_CLOCKRT;
+ }
+
+out_put_keys:
+ put_futex_key(fshared, &q.key);
+ put_futex_key(fshared, &key2);
+
+out:
+ if (to) {
+ hrtimer_cancel(&to->timer);
+ destroy_hrtimer_on_stack(&to->timer);
+ }
+ return ret;
+}
+
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.
@@ -2025,7 +2491,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
fshared = 1;
clockrt = op & FUTEX_CLOCK_REALTIME;
- if (clockrt && cmd != FUTEX_WAIT_BITSET)
+ if (clockrt && cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI)
return -ENOSYS;
switch (cmd) {
@@ -2040,10 +2506,11 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
ret = futex_wake(uaddr, fshared, val, val3);
break;
case FUTEX_REQUEUE:
- ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL);
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 0);
break;
case FUTEX_CMP_REQUEUE:
- ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3);
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
+ 0);
break;
case FUTEX_WAKE_OP:
ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3);
@@ -2060,6 +2527,18 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
if (futex_cmpxchg_enabled)
ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1);
break;
+ case FUTEX_WAIT_REQUEUE_PI:
+ val3 = FUTEX_BITSET_MATCH_ANY;
+ ret = futex_wait_requeue_pi(uaddr, fshared, val, timeout, val3,
+ clockrt, uaddr2);
+ break;
+ case FUTEX_REQUEUE_PI:
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 1);
+ break;
+ case FUTEX_CMP_REQUEUE_PI:
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
+ 1);
+ break;
default:
ret = -ENOSYS;
}
@@ -2077,7 +2556,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
int cmd = op & FUTEX_CMD_MASK;
if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
- cmd == FUTEX_WAIT_BITSET)) {
+ cmd == FUTEX_WAIT_BITSET ||
+ cmd == FUTEX_WAIT_REQUEUE_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
return -EFAULT;
if (!timespec_valid(&ts))
@@ -2089,10 +2569,11 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
tp = &t;
}
/*
- * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE.
+ * requeue parameter in 'utime' if cmd == FUTEX_*_REQUEUE_*.
* number of waiters to wake in 'utime' if cmd == FUTEX_WAKE_OP.
*/
if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE ||
+ cmd == FUTEX_REQUEUE_PI || cmd == FUTEX_CMP_REQUEUE_PI ||
cmd == FUTEX_WAKE_OP)
val2 = (u32) (unsigned long) utime;
Darren,
On Fri, 3 Apr 2009, Darren Hart wrote:
> The following series is v7 of the requeue_pi patches against
> linux-2.6-tip/core/futexes. The current futex implementation doesn't allow for
> requeueing of PI futexes, which leads to a thundering herd during
> pthread_cond_broadcasa()t (as opposed to a civilized priority ordered wakeup
> sequence). The core of the problem is that the underlying rt_mutex cannot be
> left with waiters and no owner (which would break the PI logic). This patch
> series updates the futex code to allow for requeueing from non-PI to PI futexes
> in support of PI aware pthread_cond_* calls along with some needful rt_mutex
> helper routines. The credit for the conceptual design goes to Thomas Gleixner,
> while the bugs and other idiocies present in this implementation should be
> attributed to me.
I went through the patches with a fine comb again and there is nothing
left which triggered my futex wreckage sensors. Thanks for your
patience to go through the lather, rinse, repeat drill.
I think we are at a point where that code simply needs exposure to the
hostile environment of RT-Java VMs. I'm going to pull that into
tip/next and into -rt. Even if we have no requeue_pi user right now we
definitly want to test the heck out of the changes which also affect
the existing futex ops.
Uli, Jakub can you please go over the design and the user space
interface ?
Darren, could you please polish the initial design notes - especially
point out the subtle differences between requeue and requeue_pi - and
send them into the thread? That might help Uli and Jakub and we
definitly want to have that info preserved in Documentation/ as well.
Thanks,
tglx
Thomas Gleixner wrote:
> Darren,
>
> On Fri, 3 Apr 2009, Darren Hart wrote:
>> The following series is v7 of the requeue_pi patches against
>> linux-2.6-tip/core/futexes. The current futex implementation doesn't allow for
>> requeueing of PI futexes, which leads to a thundering herd during
>> pthread_cond_broadcasa()t (as opposed to a civilized priority ordered wakeup
>> sequence). The core of the problem is that the underlying rt_mutex cannot be
>> left with waiters and no owner (which would break the PI logic). This patch
>> series updates the futex code to allow for requeueing from non-PI to PI futexes
>> in support of PI aware pthread_cond_* calls along with some needful rt_mutex
>> helper routines. The credit for the conceptual design goes to Thomas Gleixner,
>> while the bugs and other idiocies present in this implementation should be
>> attributed to me.
>
> I went through the patches with a fine comb again and there is nothing
> left which triggered my futex wreckage sensors. Thanks for your
> patience to go through the lather, rinse, repeat drill.
>
> I think we are at a point where that code simply needs exposure to the
> hostile environment of RT-Java VMs. I'm going to pull that into
> tip/next and into -rt. Even if we have no requeue_pi user right now we
> definitly want to test the heck out of the changes which also affect
> the existing futex ops.
>
> Uli, Jakub can you please go over the design and the user space
> interface ?
>
> Darren, could you please polish the initial design notes - especially
> point out the subtle differences between requeue and requeue_pi - and
> send them into the thread? That might help Uli and Jakub and we
> definitly want to have that info preserved in Documentation/ as well.
>
Thanks Thomas! I'll review and update the docs (the emails you sent me
last year along with git commit messages for these patches) and send out
a new requeue_pi design and implementation document that we can consider
for inclusion in Documentation/. I'll try and have something out on Monday.
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Thomas Gleixner wrote:
> Darren,
>
> On Fri, 3 Apr 2009, Darren Hart wrote:
>> The following series is v7 of the requeue_pi patches against
>> linux-2.6-tip/core/futexes. The current futex implementation doesn't allow for
>> requeueing of PI futexes, which leads to a thundering herd during
>> pthread_cond_broadcasa()t (as opposed to a civilized priority ordered wakeup
>> sequence). The core of the problem is that the underlying rt_mutex cannot be
>> left with waiters and no owner (which would break the PI logic). This patch
>> series updates the futex code to allow for requeueing from non-PI to PI futexes
>> in support of PI aware pthread_cond_* calls along with some needful rt_mutex
>> helper routines. The credit for the conceptual design goes to Thomas Gleixner,
>> while the bugs and other idiocies present in this implementation should be
>> attributed to me.
>
> I went through the patches with a fine comb again and there is nothing
> left which triggered my futex wreckage sensors. Thanks for your
> patience to go through the lather, rinse, repeat drill.
>
> I think we are at a point where that code simply needs exposure to the
> hostile environment of RT-Java VMs. I'm going to pull that into
> tip/next and into -rt. Even if we have no requeue_pi user right now we
> definitly want to test the heck out of the changes which also affect
> the existing futex ops.
>
> Uli, Jakub can you please go over the design and the user space
> interface ?
>
> Darren, could you please polish the initial design notes - especially
> point out the subtle differences between requeue and requeue_pi - and
> send them into the thread? That might help Uli and Jakub and we
> definitly want to have that info preserved in Documentation/ as well.
I'll be away most of the day so I've whipped something up quickly in
order to get discussion started. It is probably a little heavy on the
glibc details and a little light on the kernel details for inclusion in
Documentation/. Please review for high-level content and let me know
what you would like to see more or less of. I'll pick this back up
tomorrow to incorporate any suggestions and flesh out the kernel details
a bit more.
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Futex Requeue PI
----------------
Requeueing of tasks from a non-PI futex to a PI futex requires special
handling in order to ensure the underlying rt_mutex is never left without an
owner if it has waiters; doing so would break the PI boosting logic [see
rt-mutex-desgin.txt] For the purposes of brevity, this action will be referred
to as "requeue_pi" throughout this document. Priority inheritance is
abbreviated throughout as "PI".
Motivation
----------
Without requeue_pi, the current implementation of pthread_cond_broadcast()
must resort to waking all the tasks waiting on a pthread_condvar and letting
them try and sort out which task gets to run first in classic thundering-herd
formation. An ideal implementation would wake the highest priority waiter,
and leave the rest to the natural wakeup inherent in unlocking the mutex
associated with the condvar.
Consider the simplified glibc calls:
/* caller must lock mutex */
pthread_cond_wait(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
lock(mutex);
}
pthread_cond_broadcast(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue(cond->data.__futex, cond->mutex);
}
Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters.
Note that pthread_cond_wait() attempts to lock the mutex only after it has
returned to userspace. This will leave the underlying rt_mutex with waiters,
and no owner, breaking the previously mentioned PI boosting algorithms.
In order to support PI-aware pthread_condvar's, the kernel needs to be able to
requeue tasks to PI futexes. This support implies that upon a successful
futex_wait system call, the caller would return to userspace already holding
the PI futex. As such the glibc implemenation would be modified as follows:
/* caller must lock mutex */
pthread_cond_wait_pi(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait_requeue_pi(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
/* the kernel acquired the the mutex for us */
}
pthread_cond_broadcast_pi(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue_pi(cond->data.__futex, cond->mutex);
}
The actual glibc implemenation will likely test for PI and make the
necessary changes inside the existing calls rather than creating new calls
for the PI cases. Similar changes are needed for pthread_cond_timedwait()
and pthread_cond_signal().
Implementation
--------------
In order to ensure the rt_mutex has an owner if it has waiters, it is necessary
for the both the requeue code as well as the waiting code to be able to acquire
the rt_mutex before returning to userspace (the requeue code cannot simply wake
the waiter and leave it to acquire the rt_mutex as it would open a race window
between the requeue call returning to userspace and the waiter waking and
starting to run. This is especially true in the uncontended case.
To accomplish this, rt_mutex lock acquistion has to be able to be accomlished
vicariously by the requeue code on behalf of the waiter. This is accomplished
via two new rt_mutex helper routines: rt_mutex_start_proxy_lock() (called by
the requeue code) and rt_mutex_finish_proxy_lock(), called by the waiter upon
wakeup.
Two new system calls provide the kernel<->userspace interface to requeue_pi:
FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI).
FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() and
pthread_cond_timedwait()) to block on the initial futex and wait to be requeued
to a PI aware futex. The implemenation is basically the result of a high speed
collision between futex_wait() and futex_lock_pi(), with some extra logic to
check for the additional wake-up scenarios.
FUTEX_REQUEUE_PI is called by the waker (pthread_cond_broadcast() and
pthread_cond_signal()) to requeue and possibly wake the waiting tasks.
Internally, this system call is still handled by futex_requeue (by passing
requeue_pi=1). Before requeueing, futex_requeue() attempts to acquire the
requeue target PI futex on behalf of the top waiter, if it can, this waiter is
woken. It then proceeds to requeue the remaining nr_wake+nr_requeue tasks to
the PI futex. It calls rt_mutex_start_proxy_lock() prior to each requeue to
prepare the task as a waiter on the underlying rt_mutex. It is possible that
the lock can be acquired at this stage as well, if so, the next waiter is woken
to finish the acquisition of the lock. FUTEX_REQUEUE_PI accepts nr_wake and
nr_requeue as arguments, but their sum is all that really matters. It will
wake or requeue up to nr_wake + nr_requeue tasks. It will wake only as many
tasks as it can acquire the lock for, which in the majority of cases should be
0 as good programming practice dictates that the caller of either
pthread_cond_broadcast() or pthread_cond_signal() acquire the mutex prior to
making the call. FUTEX_REQUEUE_PI requires that nr_wake=1. nr_requeue should
be INT_MAX for broadcast and 0 for signal.
Darren Hart wrote:
> Thomas Gleixner wrote:
>> Darren,
>>
>> On Fri, 3 Apr 2009, Darren Hart wrote:
[snip]
>> Darren, could you please polish the initial design notes - especially
>> point out the subtle differences between requeue and requeue_pi - and
>> send them into the thread? That might help Uli and Jakub and we
>> definitly want to have that info preserved in Documentation/ as well.
>
> I'll be away most of the day so I've whipped something up quickly in
> order to get discussion started. It is probably a little heavy on the
> glibc details and a little light on the kernel details for inclusion in
> Documentation/. Please review for high-level content and let me know
> what you would like to see more or less of. I'll pick this back up
> tomorrow to incorporate any suggestions and flesh out the kernel details
> a bit more.
>
Darren,
A gentle review of your documentation follows.
> Futex Requeue PI
> ----------------
>
> Requeueing of tasks from a non-PI futex to a PI futex requires special
> handling in order to ensure the underlying rt_mutex is never left without an
> owner if it has waiters; doing so would break the PI boosting logic [see
> rt-mutex-desgin.txt] For the purposes of brevity, this action will be referred
> to as "requeue_pi" throughout this document. Priority inheritance is
> abbreviated throughout as "PI".
>
> Motivation
> ----------
>
> Without requeue_pi, the current implementation of pthread_cond_broadcast()
> must resort to waking all the tasks waiting on a pthread_condvar and letting
> them try and sort out which task gets to run first in classic thundering-herd
^try to sort out^
> formation. An ideal implementation would wake the highest priority waiter,
^highest-priority^
> and leave the rest to the natural wakeup inherent in unlocking the mutex
> associated with the condvar.
>
> Consider the simplified glibc calls:
>
> /* caller must lock mutex */
> pthread_cond_wait(cond, mutex)
> {
> lock(cond->__data.__lock);
> unlock(mutex);
> do {
> unlock(cond->__data.__lock);
> futex_wait(cond->__data.__futex);
> lock(cond->__data.__lock);
> } while(...)
> unlock(cond->__data.__lock);
> lock(mutex);
> }
>
> pthread_cond_broadcast(cond)
> {
> lock(cond->__data.__lock);
> unlock(cond->__data.__lock);
> futex_requeue(cond->data.__futex, cond->mutex);
> }
>
> Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters.
> Note that pthread_cond_wait() attempts to lock the mutex only after it has
> returned to userspace. This will leave the underlying rt_mutex with waiters,
> and no owner, breaking the previously mentioned PI boosting algorithms.
^PI-boosting^
>
> In order to support PI-aware pthread_condvar's, the kernel needs to be able to
> requeue tasks to PI futexes. This support implies that upon a successful
> futex_wait system call, the caller would return to userspace already holding
> the PI futex. As such the glibc implemenation would be modified as follows:
^implementation^
>
>
> /* caller must lock mutex */
> pthread_cond_wait_pi(cond, mutex)
> {
> lock(cond->__data.__lock);
> unlock(mutex);
> do {
> unlock(cond->__data.__lock);
> futex_wait_requeue_pi(cond->__data.__futex);
> lock(cond->__data.__lock);
> } while(...)
> unlock(cond->__data.__lock);
> /* the kernel acquired the the mutex for us */
> }
>
> pthread_cond_broadcast_pi(cond)
> {
> lock(cond->__data.__lock);
> unlock(cond->__data.__lock);
> futex_requeue_pi(cond->data.__futex, cond->mutex);
> }
>
> The actual glibc implemenation will likely test for PI and make the
^implementation^
> necessary changes inside the existing calls rather than creating new calls
> for the PI cases. Similar changes are needed for pthread_cond_timedwait()
> and pthread_cond_signal().
>
> Implementation
> --------------
>
> In order to ensure the rt_mutex has an owner if it has waiters, it is necessary
> for the both the requeue code as well as the waiting code to be able to acquire
> the rt_mutex before returning to userspace (the requeue code cannot simply wake
I would suggest not using a parenthetical expression here because of the length
of the enclosed sentence. Also, you have a missing right parenthesis.
> the waiter and leave it to acquire the rt_mutex as it would open a race window
> between the requeue call returning to userspace and the waiter waking and
> starting to run. This is especially true in the uncontended case.
>
> To accomplish this, rt_mutex lock acquistion has to be able to be accomlished
^acquisition^ ^accomplished^
> vicariously by the requeue code on behalf of the waiter. This is accomplished
Lots of accomplishing going on here^^^. And one is missing a its p. Maybe
rewritten as:
To accomplish this, the rt_mutex lock must be acquired vicariously by the
requeue code on behalf of the waiter. This is done by two new rt_mutex
helper routines: rt_mutex_start_proxy_lock(), called by he requeue code;
and rt_mutex_finish_proxy_lock(), called by the waiter upon wakeup.
> via two new rt_mutex helper routines: rt_mutex_start_proxy_lock() (called by
> the requeue code) and rt_mutex_finish_proxy_lock(), called by the waiter upon
> wakeup.
I also changed the punctuation there to remove the parenthetical expression
modifying the first function, where only a comma was used on the second. Too
many commas and modifications made me think that maybe a semicolon would work
well.
>
> Two new system calls provide the kernel<->userspace interface to requeue_pi:
> FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI).
>
> FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() and
> pthread_cond_timedwait()) to block on the initial futex and wait to be requeued
> to a PI aware futex. The implemenation is basically the result of a high speed
^PI-aware^ ^implementation^ ^high-speed^
> collision between futex_wait() and futex_lock_pi(), with some extra logic to
> check for the additional wake-up scenarios.
>
> FUTEX_REQUEUE_PI is called by the waker (pthread_cond_broadcast() and
> pthread_cond_signal()) to requeue and possibly wake the waiting tasks.
> Internally, this system call is still handled by futex_requeue (by passing
> requeue_pi=1). Before requeueing, futex_requeue() attempts to acquire the
> requeue target PI futex on behalf of the top waiter, if it can, this waiter is
^. If^
> woken. It then proceeds to requeue the remaining nr_wake+nr_requeue tasks to
> the PI futex. It calls rt_mutex_start_proxy_lock() prior to each requeue to
> prepare the task as a waiter on the underlying rt_mutex. It is possible that
> the lock can be acquired at this stage as well, if so, the next waiter is woken
^. If^
> to finish the acquisition of the lock. FUTEX_REQUEUE_PI accepts nr_wake and
> nr_requeue as arguments, but their sum is all that really matters. It will
Lots of 'It' by this point. Maybe replacing 'it' with a proper noun a couple
of times would help make sure we are all on the same page.
> wake or requeue up to nr_wake + nr_requeue tasks. It will wake only as many
> tasks as it can acquire the lock for, which in the majority of cases should be
> 0 as good programming practice dictates that the caller of either
^0, as^
> pthread_cond_broadcast() or pthread_cond_signal() acquire the mutex prior to
> making the call. FUTEX_REQUEUE_PI requires that nr_wake=1. nr_requeue should
> be INT_MAX for broadcast and 0 for signal.
One final point is that generally numbers less than ten are spelled out, but
since this is a technical document where we see things like nr_wake=1, I figured
it would look more consistent to let those 0s slide.
Keep in mind that I am not really an editor so take my changes with a grain of
salt. (I am only *married* to an editor and if she gets wind that I am
pretending to be one on the side I will be in big trouble :)
--Vernon
Thanks to Vernon and Nivedita for some feedback on the documentation. Included
below is the updated document, mostly grammatical and spelling fixes. I'd still
appreciate some feedback as to content depth and how much context I should
assume in this document.
Futex Requeue PI
----------------
Requeueing of tasks from a non-PI futex to a PI futex requires special handling
in order to ensure the underlying rt_mutex is never left without an owner if it
has waiters; doing so would break the PI boosting logic [see
rt-mutex-desgin.txt] For the purposes of brevity, this action will be referred
to as "requeue_pi" throughout this document. Priority inheritance is
abbreviated throughout as "PI".
Motivation
----------
Without requeue_pi, the current implementation of pthread_cond_broadcast() must
resort to waking all the tasks waiting on a pthread_condvar and letting them
try to sort out which task gets to run first in classic thundering-herd
formation. An ideal implementation would wake the highest-priority waiter, and
leave the rest to the natural wakeup inherent in unlocking the mutex associated
with the condvar.
Consider the simplified glibc calls:
/* caller must lock mutex */
pthread_cond_wait(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
lock(mutex);
}
pthread_cond_broadcast(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue(cond->data.__futex, cond->mutex);
}
Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters.
Note that pthread_cond_wait() attempts to lock the mutex only after it has
returned to userspace. This will leave the underlying rt_mutex with waiters,
and no owner, breaking the previously mentioned PI-boosting algorithms.
In order to support PI-aware pthread_condvar's, the kernel needs to be able to
requeue tasks to PI futexes. This support implies that upon a successful
futex_wait system call, the caller would return to userspace already holding
the PI futex. The glibc implementation would be modified as follows:
/* caller must lock mutex */
pthread_cond_wait_pi(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait_requeue_pi(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
/* the kernel acquired the the mutex for us */
}
pthread_cond_broadcast_pi(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue_pi(cond->data.__futex, cond->mutex);
}
The actual glibc implementation will likely test for PI and make the
necessary changes inside the existing calls rather than creating new calls
for the PI cases. Similar changes are needed for pthread_cond_timedwait()
and pthread_cond_signal().
Implementation
--------------
In order to ensure the rt_mutex has an owner if it has waiters, it is necessary
for both the requeue code, as well as the waiting code, to be able to acquire
the rt_mutex before returning to userspace. The requeue code cannot simply wake
the waiter and leave it to acquire the rt_mutex as it would open a race window
between the requeue call returning to userspace and the waiter waking and
starting to run. This is especially true in the uncontended case.
The solution involves two new rt_mutex helper routines,
rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which allow the
requeue code to acquire an uncontended rt_mutex on behalf of the waiter and to
enqueue the waiter on a contended rt_mutex.
Two new system calls provide the kernel<->userspace interface to requeue_pi:
FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI).
FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() and
pthread_cond_timedwait()) to block on the initial futex and wait to be requeued
to a PI-aware futex. The implementation is the result of a high-speed
collision between futex_wait() and futex_lock_pi(), with some extra logic to
check for the additional wake-up scenarios.
FUTEX_REQUEUE_PI is called by the waker (pthread_cond_broadcast() and
pthread_cond_signal()) to requeue and possibly wake the waiting tasks.
Internally, this system call is still handled by futex_requeue (by passing
requeue_pi=1). Before requeueing, futex_requeue() attempts to acquire the
requeue target PI futex on behalf of the top waiter. If it can, this waiter is
woken. futex_requeue() then proceeds to requeue the remaining
nr_wake+nr_requeue tasks to the PI futex, calling rt_mutex_start_proxy_lock()
prior to each requeue to prepare the task as a waiter on the underlying
rt_mutex. It is possible that the lock can be acquired at this stage as well,
if so, the next waiter is woken to finish the acquisition of the lock.
FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but their sum is
all that really matters. futex_requeue() will wake or requeue up to nr_wake +
nr_requeue tasks. It will wake only as many tasks as it can acquire the lock
for, which in the majority of cases should be 0 as good programming practice
dictates that the caller of either pthread_cond_broadcast() or
pthread_cond_signal() acquire the mutex prior to making the call.
FUTEX_REQUEUE_PI requires that nr_wake=1. nr_requeue should be INT_MAX for
broadcast and 0 for signal.