The following series is v6 of the requeue_pi patches against
linux-2.6-tip/core/futexes + core/urgent + master. The current futex
implementation doesn't allow for requeueing of PI futexes, which leads to a
thundering herd during pthread_cond_broadcasa()t (as opposed to a civilized
priority ordered wakeup sequence). The core of the problem is that the
underlying rt_mutex cannot be left with waiters and no owner (which would break
the PI logic). This patch series updates the futex code to allow for
requeueing from non-PI to PI futexes in support of PI aware pthread_cond_*
calls along with some needful rt_mutex helper routines. The credit for the
conceptual design goes to Thomas Gleixner, while the bugs and other idiocies
present in this implementation should be attributed to me.
Since the last version I have reworked the requeue logic, renamed
futex_requeue_pi_init() to futex_proxy_trylock_atomic(), fixed several bugs in
timeout and signal handling in futex_wait_requeue_pi(), incorporated some
feedback from Thomas and others, and corrected some inconsistencies in my
comments and comment formats.
This version has been tested with a rough raw futex syscall test case as well
as with a preliminary glibc patch that updates the pthread_cond* calls to use
the new syscalls and allow for the PI calls to take ownership of the rt_mutex
inside the kernel (see the "glibc hacks for requeue_pi" at the end of this
series). With this patched glibc the LTP realtime/func/prio-wake test case has
passed more than 6k consecutive iterations (whereas before it would fail 10% of
the time). prio-wake tests the priority ordered wakeup of a
pthread_cond_broadcast() using a PI mutex. I have exercised the timeout and
signal paths of futex_wait_requeue_pi() prior to requeue. I am working to add
more sophisticated tests to be able to exercise the post-requeue paths as well.
Additionally, I'd like to add some fault-injection.
I'd really appreciate feedback on the implementation as well as any design
critique. Answers to the questions posed in the patch headers and code
comments are particularly welcome. If we agree on the general approach, I'd
like to refactor futex_wait_requeue_pi() as it is rather lengthy at this point.
---
Darren Hart (8):
RFC: futex: add requeue_pi calls
RFC: futex: Add requeue_futex() call
RFC: futex: Add FUTEX_HAS_TIMEOUT flag to restart.futex.flags
RFC: rt_mutex: add proxy lock routines
RFC: futex: finish_futex_lock_pi()
RFC: futex: futex_lock_pi_atomic()
RFC: futex: futex_top_waiter()
RFC: futex: futex_wait_queue_me()
include/linux/futex.h | 8
include/linux/thread_info.h | 3
kernel/futex.c | 1116 +++++++++++++++++++++++++++++++++----------
kernel/rtmutex.c | 240 +++++++--
kernel/rtmutex_common.h | 8
5 files changed, 1063 insertions(+), 312 deletions(-)
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Refactor futex_wait() in preparation for futex_wait_requeue_pi(). In order to
reuse a good chunk of the futex_wait() code for the upcoming
futex_wait_requeue_pi() function, this patch breaks out the queue-to-wakeup
section of futex_wait() into futex_wait_queue_me().
Changelog:
V6: -Incremental build fixes
V4: -Nesting cleanups
-Delayed hrtimer start until after setting TASK_INTERRUPTIBLE
V1: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 138 +++++++++++++++++++++++++++++++-------------------------
1 files changed, 76 insertions(+), 62 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 6b50a02..b2ed762 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1115,24 +1115,87 @@ handle_fault:
static long futex_wait_restart(struct restart_block *restart);
+/**
+ * futex_wait_queue_me() - queue_me and wait for wakeup, timeout, or signal.
+ * @hb: the futex hash bucket, must be locked by the caller
+ * @q: the futex_q to queue up on
+ * @timeout: the prepared hrtimer_sleeper, or null for no timeout.
+ */
+static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
+ struct hrtimer_sleeper *timeout)
+{
+ DECLARE_WAITQUEUE(wait, current);
+
+ queue_me(q, hb);
+
+ /*
+ * There might have been scheduling since the queue_me(), as we
+ * cannot hold a spinlock across the get_user() in case it
+ * faults, and we cannot just set TASK_INTERRUPTIBLE state when
+ * queueing ourselves into the futex hash. This code thus has to
+ * rely on the futex_wake() code removing us from hash when it
+ * wakes us up.
+ */
+
+ /* add_wait_queue is the barrier after __set_current_state. */
+ __set_current_state(TASK_INTERRUPTIBLE);
+ add_wait_queue(&q->waiter, &wait);
+ /*
+ * NOTE: we don't remove ourselves from the waitqueue because
+ * we are the only user of it.
+ */
+
+ /* Arm the timer */
+ if (timeout) {
+ hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
+ if (!hrtimer_active(&timeout->timer))
+ timeout->task = NULL;
+ }
+
+ /*
+ * !plist_node_empty() is safe here without any lock.
+ * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
+ */
+ if (likely(!plist_node_empty(&q->list))) {
+ /*
+ * If the timer has already expired, current will already be
+ * flagged for rescheduling. Only call schedule if there
+ * is no timeout, or if it has yet to expire.
+ */
+ if (!timeout || likely(timeout->task))
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+}
+
static int futex_wait(u32 __user *uaddr, int fshared,
u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
{
- struct task_struct *curr = current;
+ struct hrtimer_sleeper timeout, *to = NULL;
struct restart_block *restart;
- DECLARE_WAITQUEUE(wait, curr);
struct futex_hash_bucket *hb;
struct futex_q q;
u32 uval;
int ret;
- struct hrtimer_sleeper t;
- int rem = 0;
if (!bitset)
return -EINVAL;
q.pi_state = NULL;
q.bitset = bitset;
+
+ if (abs_time) {
+ unsigned long slack;
+ to = &timeout;
+ slack = current->timer_slack_ns;
+ if (rt_task(current))
+ slack = 0;
+ hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
+ CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ hrtimer_init_sleeper(to, current);
+ hrtimer_set_expires_range_ns(&to->timer, *abs_time, slack);
+ }
+
retry:
q.key = FUTEX_KEY_INIT;
ret = get_futex_key(uaddr, fshared, &q.key);
@@ -1178,75 +1241,22 @@ retry_private:
goto retry;
}
ret = -EWOULDBLOCK;
+
+ /* Only actually queue if *uaddr contained val. */
if (unlikely(uval != val)) {
queue_unlock(&q, hb);
goto out_put_key;
}
- /* Only actually queue if *uaddr contained val. */
- queue_me(&q, hb);
-
- /*
- * There might have been scheduling since the queue_me(), as we
- * cannot hold a spinlock across the get_user() in case it
- * faults, and we cannot just set TASK_INTERRUPTIBLE state when
- * queueing ourselves into the futex hash. This code thus has to
- * rely on the futex_wake() code removing us from hash when it
- * wakes us up.
- */
-
- /* add_wait_queue is the barrier after __set_current_state. */
- __set_current_state(TASK_INTERRUPTIBLE);
- add_wait_queue(&q.waiter, &wait);
- /*
- * !plist_node_empty() is safe here without any lock.
- * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
- */
- if (likely(!plist_node_empty(&q.list))) {
- if (!abs_time)
- schedule();
- else {
- hrtimer_init_on_stack(&t.timer,
- clockrt ? CLOCK_REALTIME :
- CLOCK_MONOTONIC,
- HRTIMER_MODE_ABS);
- hrtimer_init_sleeper(&t, current);
- hrtimer_set_expires_range_ns(&t.timer, *abs_time,
- current->timer_slack_ns);
-
- hrtimer_start_expires(&t.timer, HRTIMER_MODE_ABS);
- if (!hrtimer_active(&t.timer))
- t.task = NULL;
-
- /*
- * the timer could have already expired, in which
- * case current would be flagged for rescheduling.
- * Don't bother calling schedule.
- */
- if (likely(t.task))
- schedule();
-
- hrtimer_cancel(&t.timer);
-
- /* Flag if a timeout occured */
- rem = (t.task == NULL);
-
- destroy_hrtimer_on_stack(&t.timer);
- }
- }
- __set_current_state(TASK_RUNNING);
-
- /*
- * NOTE: we don't remove ourselves from the waitqueue because
- * we are the only user of it.
- */
+ /* queue_me and wait for wakeup, timeout, or a signal. */
+ futex_wait_queue_me(hb, &q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
ret = 0;
if (!unqueue_me(&q))
goto out_put_key;
ret = -ETIMEDOUT;
- if (rem)
+ if (to && !to->task)
goto out_put_key;
/*
@@ -1275,6 +1285,10 @@ retry_private:
out_put_key:
put_futex_key(fshared, &q.key);
out:
+ if (to) {
+ hrtimer_cancel(&to->timer);
+ destroy_hrtimer_on_stack(&to->timer);
+ }
return ret;
}
Refactor the atomic portion of futex_lock_pi() into futex_lock_pi_atomic().
This logic will be needed elsewhere, so modularize it to reduce code
duplication. The only significant change is passing of the task to try and
take the lock for. This simplifies the -EDEADLK test as if the lock is owned
by task t, it's a deadlock, regardless of if we are doing requeue pi or not.
This patch updates the corresponding comment accordingly.
Changelog:
V6: -comment update from peterz
V4: -fixes for style errors caught by checkpatch.pl
-use task instead of t for legibility whilst searching
-pass key and pi_state directly rather than trying to rely on the futex_q
since the key may be different.
-minor cleanups suggested by tglx
V2: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 226 +++++++++++++++++++++++++++++++++-----------------------
1 files changed, 132 insertions(+), 94 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 8ebed88..905b52a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -556,6 +556,129 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
return 0;
}
+/**
+ * futex_lock_pi_atomic() - atomic work required to acquire a pi aware futex
+ * @uaddr: the pi futex user address
+ * @hb: the pi futex hash bucket
+ * @key: the futex key associated with uaddr and hb
+ * @ps: the pi_state pointer where we store the result of the lookup
+ * @task: the task to perform the atomic lock work for. This will be
+ * "current" except in the case of requeue pi.
+ *
+ * Returns:
+ * 0 - ready to wait
+ * 1 - acquired the lock
+ * <0 - error
+ *
+ * The hb->lock and futex_key refs shall be held by the caller.
+ */
+static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
+ union futex_key *key,
+ struct futex_pi_state **ps,
+ struct task_struct *task)
+{
+ u32 uval, newval, curval;
+ int lock_taken;
+ int ret;
+ int ownerdied = 0;
+
+retry:
+ ret = lock_taken = 0;
+
+ /*
+ * To avoid races, we attempt to take the lock here again
+ * (by doing a 0 -> TID atomic cmpxchg), while holding all
+ * the locks. It will most likely not succeed.
+ */
+ newval = task_pid_vnr(task);
+
+ curval = cmpxchg_futex_value_locked(uaddr, 0, newval);
+
+ if (unlikely(curval == -EFAULT))
+ return -EFAULT;
+
+ /*
+ * Detect deadlocks.
+ */
+ if ((unlikely((curval & FUTEX_TID_MASK) == task_pid_vnr(task))))
+ return -EDEADLK;
+
+ /*
+ * Surprise - we got the lock. Just return to userspace:
+ */
+ if (unlikely(!curval))
+ return 1;
+
+ uval = curval;
+
+ /*
+ * Set the FUTEX_WAITERS flag, so the owner will know it has someone
+ * to wake at the next unlock.
+ */
+ newval = curval | FUTEX_WAITERS;
+
+ /*
+ * There are two cases, where a futex might have no owner (the
+ * owner TID is 0): OWNER_DIED. We take over the futex in this
+ * case. We also do an unconditional take over, when the owner
+ * of the futex died.
+ *
+ * This is safe as we are protected by the hash bucket lock !
+ */
+ if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+ /* Keep the OWNER_DIED bit */
+ newval = (curval & ~FUTEX_TID_MASK) | task_pid_vnr(task);
+ ownerdied = 0;
+ lock_taken = 1;
+ }
+
+ curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
+
+ if (unlikely(curval == -EFAULT))
+ return -EFAULT;
+ if (unlikely(curval != uval))
+ goto retry;
+
+ /*
+ * We took the lock due to owner died take over.
+ */
+ if (unlikely(lock_taken))
+ return 1;
+
+ /*
+ * We dont have the lock. Look up the PI state (or create it if
+ * we are the first waiter):
+ */
+ ret = lookup_pi_state(uval, hb, key, ps);
+
+ if (unlikely(ret)) {
+ switch (ret) {
+ case -ESRCH:
+ /*
+ * No owner found for this futex. Check if the
+ * OWNER_DIED bit is set to figure out whether
+ * this is a robust futex or not.
+ */
+ if (get_futex_value_locked(&curval, uaddr))
+ return -EFAULT;
+
+ /*
+ * We simply start over in case of a robust
+ * futex. The code above will take the futex
+ * and return happy.
+ */
+ if (curval & FUTEX_OWNER_DIED) {
+ ownerdied = 1;
+ goto retry;
+ }
+ default:
+ break;
+ }
+ }
+
+ return ret;
+}
+
/*
* The hash bucket lock must be held when this is called.
* Afterwards, the futex_q must not be accessed.
@@ -1340,9 +1463,9 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
struct hrtimer_sleeper timeout, *to = NULL;
struct task_struct *curr = current;
struct futex_hash_bucket *hb;
- u32 uval, newval, curval;
+ u32 uval;
struct futex_q q;
- int ret, lock_taken, ownerdied = 0;
+ int ret;
if (refill_pi_state_cache())
return -ENOMEM;
@@ -1365,81 +1488,15 @@ retry:
retry_private:
hb = queue_lock(&q);
-retry_locked:
- ret = lock_taken = 0;
-
- /*
- * To avoid races, we attempt to take the lock here again
- * (by doing a 0 -> TID atomic cmpxchg), while holding all
- * the locks. It will most likely not succeed.
- */
- newval = task_pid_vnr(current);
-
- curval = cmpxchg_futex_value_locked(uaddr, 0, newval);
-
- if (unlikely(curval == -EFAULT))
- goto uaddr_faulted;
-
- /*
- * Detect deadlocks. In case of REQUEUE_PI this is a valid
- * situation and we return success to user space.
- */
- if (unlikely((curval & FUTEX_TID_MASK) == task_pid_vnr(current))) {
- ret = -EDEADLK;
- goto out_unlock_put_key;
- }
-
- /*
- * Surprise - we got the lock. Just return to userspace:
- */
- if (unlikely(!curval))
- goto out_unlock_put_key;
-
- uval = curval;
-
- /*
- * Set the WAITERS flag, so the owner will know it has someone
- * to wake at next unlock
- */
- newval = curval | FUTEX_WAITERS;
-
- /*
- * There are two cases, where a futex might have no owner (the
- * owner TID is 0): OWNER_DIED. We take over the futex in this
- * case. We also do an unconditional take over, when the owner
- * of the futex died.
- *
- * This is safe as we are protected by the hash bucket lock !
- */
- if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
- /* Keep the OWNER_DIED bit */
- newval = (curval & ~FUTEX_TID_MASK) | task_pid_vnr(current);
- ownerdied = 0;
- lock_taken = 1;
- }
-
- curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
-
- if (unlikely(curval == -EFAULT))
- goto uaddr_faulted;
- if (unlikely(curval != uval))
- goto retry_locked;
-
- /*
- * We took the lock due to owner died take over.
- */
- if (unlikely(lock_taken))
- goto out_unlock_put_key;
-
- /*
- * We dont have the lock. Look up the PI state (or create it if
- * we are the first waiter):
- */
- ret = lookup_pi_state(uval, hb, &q.key, &q.pi_state);
-
+ ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current);
if (unlikely(ret)) {
switch (ret) {
-
+ case 1:
+ /* We got the lock. */
+ ret = 0;
+ goto out_unlock_put_key;
+ case -EFAULT:
+ goto uaddr_faulted;
case -EAGAIN:
/*
* Task is exiting and we just wait for the
@@ -1449,25 +1506,6 @@ retry_locked:
put_futex_key(fshared, &q.key);
cond_resched();
goto retry;
-
- case -ESRCH:
- /*
- * No owner found for this futex. Check if the
- * OWNER_DIED bit is set to figure out whether
- * this is a robust futex or not.
- */
- if (get_futex_value_locked(&curval, uaddr))
- goto uaddr_faulted;
-
- /*
- * We simply start over in case of a robust
- * futex. The code above will take the futex
- * and return happy.
- */
- if (curval & FUTEX_OWNER_DIED) {
- ownerdied = 1;
- goto retry_locked;
- }
default:
goto out_unlock_put_key;
}
Improve legibility by wrapping finding the top waiter in a function. This
will be used by the follow-on patches for enabling requeue pi.
Changelog:
V6: -Shortened function length after review from tglx
-Corrected kernel-doc formatting
V4: -Simplified logic in the hb loop
-Return the futex_q as the task can be had from there if needed
-Updated comments and dropped the -safe iterator per review from tglx
V3: -Comment linewidth cleanup
V1: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index b2ed762..8ebed88 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -276,6 +276,25 @@ void put_futex_key(int fshared, union futex_key *key)
drop_futex_key_refs(key);
}
+/**
+ * futex_top_waiter() - return the highest priority waiter on a futex
+ * @hb: the hash bucket the futex_q's reside in
+ * @key: the futex key (to distinguish it from other futex futex_q's)
+ *
+ * Must be called with the hb lock held.
+ */
+static struct futex_q *futex_top_waiter(struct futex_hash_bucket *hb,
+ union futex_key *key)
+{
+ struct futex_q *this;
+
+ plist_for_each_entry(this, &hb->chain, list) {
+ if (match_futex(&this->key, key))
+ return this;
+ }
+ return NULL;
+}
+
static u32 cmpxchg_futex_value_locked(u32 __user *uaddr, u32 uval, u32 newval)
{
u32 curval;
Refactor the post lock acquisition logic from futex_lock_pi(). This code will
be reused in futex_wait_requeue_pi().
V6: -Minor comment updates
V4: -Corrected string paranoia check
-Move the spinlock(q->lock_ptr) out of finish_futex_lock_pi to retain
some semblance of lock/unlock in the same function.
V3: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 147 +++++++++++++++++++++++++++++++-------------------------
1 files changed, 81 insertions(+), 66 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index 905b52a..dd05803 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1258,6 +1258,84 @@ handle_fault:
static long futex_wait_restart(struct restart_block *restart);
/**
+ * finish_futex_lock_pi() - Post lock pi_state and corner case management
+ * @uaddr: user address of the futex
+ * @fshared: whether the futex is shared (1) or not (0)
+ * @q: futex_q (contains pi_state and access to the rt_mutex)
+ * @ret: the return value of the preceeding attempt to lock the rt_mutex
+ *
+ * After attempting to lock an rt_mutex, process the return code and cleanup
+ * the pi_state as well as handle race conditions that may have caused us to
+ * lose the lock. Must be called with the hb lock held.
+ *
+ * Returns:
+ * 0 - on success
+ * <0 - on error
+ */
+static int finish_futex_lock_pi(u32 __user *uaddr, int fshared,
+ struct futex_q *q, int ret)
+{
+ if (!ret) {
+ /*
+ * Got the lock. We might not be the anticipated owner
+ * if we did a lock-steal - fix up the PI-state in
+ * that case:
+ */
+ if (q->pi_state->owner != current)
+ ret = fixup_pi_state_owner(uaddr, q, current, fshared);
+ return ret;
+ }
+
+ /*
+ * Catch the rare case, where the lock was released when we were on the
+ * way back before we locked the hash bucket.
+ */
+ if (q->pi_state->owner == current) {
+ /*
+ * Try to get the rt_mutex now. This might fail as some other
+ * task acquired the rt_mutex after we removed ourself from the
+ * rt_mutex waiters list.
+ */
+ if (rt_mutex_trylock(&q->pi_state->pi_mutex))
+ ret = 0;
+ else {
+ /*
+ * pi_state is incorrect, some other task did a lock
+ * steal and we returned due to timeout or signal
+ * without taking the rt_mutex. Too late. We can access
+ * the rt_mutex_owner without locking, as the other
+ * task is now blocked on the hash bucket lock. Fix the
+ * state up.
+ */
+ struct task_struct *owner;
+ int res;
+
+ owner = rt_mutex_owner(&q->pi_state->pi_mutex);
+ res = fixup_pi_state_owner(uaddr, q, owner,
+ fshared);
+
+ /* propagate -EFAULT, if the fixup failed */
+ if (res)
+ ret = res;
+ }
+ } else {
+ /*
+ * Paranoia check. If we did not take the lock in the trylock
+ * above, then we should not be the owner of the rtmutex,
+ * neither the real nor the pending one:
+ */
+ if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
+ printk(KERN_ERR "finish_futex_lock_pi: "
+ "ret = %d pi-mutex: %p "
+ "pi-state %p\n", ret,
+ q->pi_state->pi_mutex.owner,
+ q->pi_state->owner);
+ }
+
+ return ret;
+}
+
+/**
* futex_wait_queue_me() - queue_me and wait for wakeup, timeout, or signal.
* @hb: the futex hash bucket, must be locked by the caller
* @q: the futex_q to queue up on
@@ -1461,7 +1539,6 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
int detect, ktime_t *time, int trylock)
{
struct hrtimer_sleeper timeout, *to = NULL;
- struct task_struct *curr = current;
struct futex_hash_bucket *hb;
u32 uval;
struct futex_q q;
@@ -1529,67 +1606,7 @@ retry_private:
}
spin_lock(q.lock_ptr);
-
- if (!ret) {
- /*
- * Got the lock. We might not be the anticipated owner
- * if we did a lock-steal - fix up the PI-state in
- * that case:
- */
- if (q.pi_state->owner != curr)
- ret = fixup_pi_state_owner(uaddr, &q, curr, fshared);
- } else {
- /*
- * Catch the rare case, where the lock was released
- * when we were on the way back before we locked the
- * hash bucket.
- */
- if (q.pi_state->owner == curr) {
- /*
- * Try to get the rt_mutex now. This might
- * fail as some other task acquired the
- * rt_mutex after we removed ourself from the
- * rt_mutex waiters list.
- */
- if (rt_mutex_trylock(&q.pi_state->pi_mutex))
- ret = 0;
- else {
- /*
- * pi_state is incorrect, some other
- * task did a lock steal and we
- * returned due to timeout or signal
- * without taking the rt_mutex. Too
- * late. We can access the
- * rt_mutex_owner without locking, as
- * the other task is now blocked on
- * the hash bucket lock. Fix the state
- * up.
- */
- struct task_struct *owner;
- int res;
-
- owner = rt_mutex_owner(&q.pi_state->pi_mutex);
- res = fixup_pi_state_owner(uaddr, &q, owner,
- fshared);
-
- /* propagate -EFAULT, if the fixup failed */
- if (res)
- ret = res;
- }
- } else {
- /*
- * Paranoia check. If we did not take the lock
- * in the trylock above, then we should not be
- * the owner of the rtmutex, neither the real
- * nor the pending one:
- */
- if (rt_mutex_owner(&q.pi_state->pi_mutex) == curr)
- printk(KERN_ERR "futex_lock_pi: ret = %d "
- "pi-mutex: %p pi-state %p\n", ret,
- q.pi_state->pi_mutex.owner,
- q.pi_state->owner);
- }
- }
+ ret = finish_futex_lock_pi(uaddr, fshared, &q, ret);
/*
* If fixup_pi_state_owner() faulted and was unable to handle the
@@ -1601,9 +1618,7 @@ retry_private:
/* Unqueue and drop the lock */
unqueue_me_pi(&q);
- if (to)
- destroy_hrtimer_on_stack(&to->timer);
- return ret != -EINTR ? ret : -ERESTARTNOINTR;
+ goto out;
out_unlock_put_key:
queue_unlock(&q, hb);
@@ -1613,7 +1628,7 @@ out_put_key:
out:
if (to)
destroy_hrtimer_on_stack(&to->timer);
- return ret;
+ return ret != -EINTR ? ret : -ERESTARTNOINTR;
uaddr_faulted:
/*
This patch is required for the first half of requeue_pi to function. It
basically splits rt_mutex_slowlock() right down the middle, just before the
first call to schedule().
This patch uses a new futex_q field, rt_waiter, for now. I think
I should be able to use task->pi_blocked_on in a future version of this patch.
V6: -add mark_rt_mutex_waiters() to rt_mutex_start_procy_lock() to avoid
the race condition evident in previous versions
-cleanup kernel-docs formatting and comments
-try to take the lock in rt_mutex_start_proxy_lock() rather than assume
the lock is held
-remove initial schedule in finish_proxy_lock to allow for signal and
timeout detection.
V5: -remove EXPORT_SYMBOL_GPL from the new routines
-minor cleanups
V4: -made detect_deadlock a parameter to rt_mutex_enqueue_task
-refactored rt_mutex_slowlock to share code with new functions
-renamed rt_mutex_enqueue_task and rt_mutex_handle_wakeup to
rt_mutex_start_proxy_lock and rt_mutex_finish_proxy_lock, respectively
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/rtmutex.c | 240 +++++++++++++++++++++++++++++++++++++----------
kernel/rtmutex_common.h | 8 ++
2 files changed, 195 insertions(+), 53 deletions(-)
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 69d9cb9..fec77e7 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -300,7 +300,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
* assigned pending owner [which might not have taken the
* lock yet]:
*/
-static inline int try_to_steal_lock(struct rt_mutex *lock)
+static inline int try_to_steal_lock(struct rt_mutex *lock,
+ struct task_struct *task)
{
struct task_struct *pendowner = rt_mutex_owner(lock);
struct rt_mutex_waiter *next;
@@ -309,11 +310,11 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
if (!rt_mutex_owner_pending(lock))
return 0;
- if (pendowner == current)
+ if (pendowner == task)
return 1;
spin_lock_irqsave(&pendowner->pi_lock, flags);
- if (current->prio >= pendowner->prio) {
+ if (task->prio >= pendowner->prio) {
spin_unlock_irqrestore(&pendowner->pi_lock, flags);
return 0;
}
@@ -338,21 +339,21 @@ static inline int try_to_steal_lock(struct rt_mutex *lock)
* We are going to steal the lock and a waiter was
* enqueued on the pending owners pi_waiters queue. So
* we have to enqueue this waiter into
- * current->pi_waiters list. This covers the case,
- * where current is boosted because it holds another
+ * task->pi_waiters list. This covers the case,
+ * where task is boosted because it holds another
* lock and gets unboosted because the booster is
* interrupted, so we would delay a waiter with higher
- * priority as current->normal_prio.
+ * priority as task->normal_prio.
*
* Note: in the rare case of a SCHED_OTHER task changing
* its priority and thus stealing the lock, next->task
- * might be current:
+ * might be task:
*/
- if (likely(next->task != current)) {
- spin_lock_irqsave(¤t->pi_lock, flags);
- plist_add(&next->pi_list_entry, ¤t->pi_waiters);
- __rt_mutex_adjust_prio(current);
- spin_unlock_irqrestore(¤t->pi_lock, flags);
+ if (likely(next->task != task)) {
+ spin_lock_irqsave(&task->pi_lock, flags);
+ plist_add(&next->pi_list_entry, &task->pi_waiters);
+ __rt_mutex_adjust_prio(task);
+ spin_unlock_irqrestore(&task->pi_lock, flags);
}
return 1;
}
@@ -389,7 +390,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
*/
mark_rt_mutex_waiters(lock);
- if (rt_mutex_owner(lock) && !try_to_steal_lock(lock))
+ if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, current))
return 0;
/* We got the lock. */
@@ -411,6 +412,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
*/
static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
+ struct task_struct *task,
int detect_deadlock)
{
struct task_struct *owner = rt_mutex_owner(lock);
@@ -418,21 +420,21 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
unsigned long flags;
int chain_walk = 0, res;
- spin_lock_irqsave(¤t->pi_lock, flags);
- __rt_mutex_adjust_prio(current);
- waiter->task = current;
+ spin_lock_irqsave(&task->pi_lock, flags);
+ __rt_mutex_adjust_prio(task);
+ waiter->task = task;
waiter->lock = lock;
- plist_node_init(&waiter->list_entry, current->prio);
- plist_node_init(&waiter->pi_list_entry, current->prio);
+ plist_node_init(&waiter->list_entry, task->prio);
+ plist_node_init(&waiter->pi_list_entry, task->prio);
/* Get the top priority waiter on the lock */
if (rt_mutex_has_waiters(lock))
top_waiter = rt_mutex_top_waiter(lock);
plist_add(&waiter->list_entry, &lock->wait_list);
- current->pi_blocked_on = waiter;
+ task->pi_blocked_on = waiter;
- spin_unlock_irqrestore(¤t->pi_lock, flags);
+ spin_unlock_irqrestore(&task->pi_lock, flags);
if (waiter == rt_mutex_top_waiter(lock)) {
spin_lock_irqsave(&owner->pi_lock, flags);
@@ -460,7 +462,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
spin_unlock(&lock->wait_lock);
res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,
- current);
+ task);
spin_lock(&lock->wait_lock);
@@ -605,37 +607,25 @@ void rt_mutex_adjust_pi(struct task_struct *task)
rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task);
}
-/*
- * Slow path lock function:
+/**
+ * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
+ * @lock: the rt_mutex to take
+ * @state: the state the task should block in (TASK_INTERRUPTIBLE
+ * or TASK_UNINTERRUPTIBLE)
+ * @timeout: the pre-initialized and started timer, or NULL for none
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @detect_deadlock: passed to task_blocks_on_rt_mutex
+ *
+ * lock->wait_lock must be held by the caller.
*/
static int __sched
-rt_mutex_slowlock(struct rt_mutex *lock, int state,
- struct hrtimer_sleeper *timeout,
- int detect_deadlock)
+__rt_mutex_slowlock(struct rt_mutex *lock, int state,
+ struct hrtimer_sleeper *timeout,
+ struct rt_mutex_waiter *waiter,
+ int detect_deadlock)
{
- struct rt_mutex_waiter waiter;
int ret = 0;
- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
-
- spin_lock(&lock->wait_lock);
-
- /* Try to acquire the lock again: */
- if (try_to_take_rt_mutex(lock)) {
- spin_unlock(&lock->wait_lock);
- return 0;
- }
-
- set_current_state(state);
-
- /* Setup the timer, when timeout != NULL */
- if (unlikely(timeout)) {
- hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
- if (!hrtimer_active(&timeout->timer))
- timeout->task = NULL;
- }
-
for (;;) {
/* Try to acquire the lock: */
if (try_to_take_rt_mutex(lock))
@@ -656,19 +646,19 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
}
/*
- * waiter.task is NULL the first time we come here and
+ * waiter->task is NULL the first time we come here and
* when we have been woken up by the previous owner
* but the lock got stolen by a higher prio task.
*/
- if (!waiter.task) {
- ret = task_blocks_on_rt_mutex(lock, &waiter,
+ if (!waiter->task) {
+ ret = task_blocks_on_rt_mutex(lock, waiter, current,
detect_deadlock);
/*
* If we got woken up by the owner then start loop
* all over without going into schedule to try
* to get the lock now:
*/
- if (unlikely(!waiter.task)) {
+ if (unlikely(!waiter->task)) {
/*
* Reset the return value. We might
* have returned with -EDEADLK and the
@@ -684,15 +674,52 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
spin_unlock(&lock->wait_lock);
- debug_rt_mutex_print_deadlock(&waiter);
+ debug_rt_mutex_print_deadlock(waiter);
- if (waiter.task)
+ if (waiter->task)
schedule_rt_mutex(lock);
spin_lock(&lock->wait_lock);
set_current_state(state);
}
+ return ret;
+}
+
+/*
+ * Slow path lock function:
+ */
+static int __sched
+rt_mutex_slowlock(struct rt_mutex *lock, int state,
+ struct hrtimer_sleeper *timeout,
+ int detect_deadlock)
+{
+ struct rt_mutex_waiter waiter;
+ int ret = 0;
+
+ debug_rt_mutex_init_waiter(&waiter);
+ waiter.task = NULL;
+
+ spin_lock(&lock->wait_lock);
+
+ /* Try to acquire the lock again: */
+ if (try_to_take_rt_mutex(lock)) {
+ spin_unlock(&lock->wait_lock);
+ return 0;
+ }
+
+ set_current_state(state);
+
+ /* Setup the timer, when timeout != NULL */
+ if (unlikely(timeout)) {
+ hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
+ if (!hrtimer_active(&timeout->timer))
+ timeout->task = NULL;
+ }
+
+ ret = __rt_mutex_slowlock(lock, state, timeout, &waiter,
+ detect_deadlock);
+
set_current_state(TASK_RUNNING);
if (unlikely(waiter.task))
@@ -986,6 +1013,59 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
}
/**
+ * rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock: the rt_mutex to take
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @task: the task to prepare
+ * @detect_deadlock: perform deadlock detection (1) or not (0)
+ *
+ * Returns:
+ * 0 - task blocked on lock
+ * 1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for FUTEX_REQUEUE_PI support.
+ */
+int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ struct task_struct *task, int detect_deadlock)
+{
+ int ret;
+
+ spin_lock(&lock->wait_lock);
+
+ mark_rt_mutex_waiters(lock);
+
+ if (!rt_mutex_owner(lock) || try_to_steal_lock(lock, task)) {
+ /* We got the lock for task. */
+ debug_rt_mutex_lock(lock);
+
+ rt_mutex_set_owner(lock, task, 0);
+
+ rt_mutex_deadlock_account_lock(lock, task);
+ return 1;
+ }
+
+ ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);
+
+
+ if (ret && !waiter->task) {
+ /*
+ * Reset the return value. We might have
+ * returned with -EDEADLK and the owner
+ * released the lock while we were walking the
+ * pi chain. Let the waiter sort it out.
+ */
+ ret = 0;
+ }
+ spin_unlock(&lock->wait_lock);
+
+ debug_rt_mutex_print_deadlock(waiter);
+
+ return ret;
+}
+
+/**
* rt_mutex_next_owner - return the next owner of the lock
*
* @lock: the rt lock query
@@ -1004,3 +1084,57 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock)
return rt_mutex_top_waiter(lock)->task;
}
+
+/**
+ * rt_mutex_finish_proxy_lock() - Complete lock acquisition
+ * @lock: the rt_mutex we were woken on
+ * @to: the timeout, null if none. hrtimer should already have
+ * been started.
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @detect_deadlock: perform deadlock detection (1) or not (0)
+ *
+ * Complete the lock acquisition started our behalf by another thread.
+ *
+ * Returns:
+ * 0 - success
+ * <0 - error, one of -EINTR, -ETIMEDOUT, or -EDEADLK
+ *
+ * Special API call for PI-futex requeue support
+ */
+int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+ struct hrtimer_sleeper *to,
+ struct rt_mutex_waiter *waiter,
+ int detect_deadlock)
+{
+ int ret;
+
+ spin_lock(&lock->wait_lock);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter,
+ detect_deadlock);
+
+ set_current_state(TASK_RUNNING);
+
+ if (unlikely(waiter->task))
+ remove_waiter(lock, waiter);
+
+ /*
+ * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
+ * have to fix that up.
+ */
+ fixup_rt_mutex_waiters(lock);
+
+ spin_unlock(&lock->wait_lock);
+
+ /*
+ * Readjust priority, when we did not get the lock. We might have been
+ * the pending owner and boosted. Since we did not take the lock, the
+ * PI boost has to go.
+ */
+ if (unlikely(ret))
+ rt_mutex_adjust_prio(current);
+
+ return ret;
+}
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index e124bf5..97a2f81 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -120,6 +120,14 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
struct task_struct *proxy_owner);
extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
struct task_struct *proxy_owner);
+extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ struct task_struct *task,
+ int detect_deadlock);
+extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+ struct hrtimer_sleeper *to,
+ struct rt_mutex_waiter *waiter,
+ int detect_deadlock);
#ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h"
futex_requeue() is getting a bit long-winded, and will be getting more so after
the requeue_pi patch. Factor out the actual requeueing into a nicely contained
inline function to reduce function length and improve legibility.
Changelog:
V6: -Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 43 +++++++++++++++++++++++++++++--------------
1 files changed, 29 insertions(+), 14 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index f54faf5..61a8a8b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -942,6 +942,34 @@ out:
return ret;
}
+/**
+ * requeue_futex() - Requeue a futex_q from one hb to another
+ * @q: the futex_q to requeue
+ * @hb1: the source hash_bucket
+ * @hb2: the target hash_bucket
+ * @key2: the new key for the requeued futex_q
+ */
+static inline
+void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
+ struct futex_hash_bucket *hb2, union futex_key *key2)
+{
+
+ /*
+ * If key1 and key2 hash to the same bucket, no need to
+ * requeue.
+ */
+ if (likely(&hb1->chain != &hb2->chain)) {
+ plist_del(&q->list, &hb1->chain);
+ plist_add(&q->list, &hb2->chain);
+ q->lock_ptr = &hb2->lock;
+#ifdef CONFIG_DEBUG_PI_LIST
+ q->list.plist.lock = &hb2->lock;
+#endif
+ }
+ get_futex_key_refs(key2);
+ q->key = *key2;
+}
+
/*
* Requeue all waiters hashed on one physical page to another
* physical page.
@@ -1001,20 +1029,7 @@ retry_private:
if (++ret <= nr_wake) {
wake_futex(this);
} else {
- /*
- * If key1 and key2 hash to the same bucket, no need to
- * requeue.
- */
- if (likely(head1 != &hb2->chain)) {
- plist_del(&this->list, &hb1->chain);
- plist_add(&this->list, &hb2->chain);
- this->lock_ptr = &hb2->lock;
-#ifdef CONFIG_DEBUG_PI_LIST
- this->list.plist.lock = &hb2->lock;
-#endif
- }
- this->key = key2;
- get_futex_key_refs(&key2);
+ requeue_futex(this, hb1, hb2, &key2);
drop_count++;
if (ret - nr_wake >= nr_requeue)
Currently restart is only used if there is a timeout. The requeue_pi
functionality requires restarting to futex_lock_pi() on signal after
wakeup in futex_wait_requeue_pi() regardless of if there was a timeout
or not. Using 0 for the timeout value is confusing as that could
indicate an expired timer. The flag makes this explicit.
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index dd05803..f54faf5 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1254,6 +1254,7 @@ handle_fault:
*/
#define FLAGS_SHARED 0x01
#define FLAGS_CLOCKRT 0x02
+#define FLAGS_HAS_TIMEOUT 0x04
static long futex_wait_restart(struct restart_block *restart);
@@ -1493,7 +1494,7 @@ retry_private:
restart->futex.val = val;
restart->futex.time = abs_time->tv64;
restart->futex.bitset = bitset;
- restart->futex.flags = 0;
+ restart->futex.flags = FLAGS_HAS_TIMEOUT;
if (fshared)
restart->futex.flags |= FLAGS_SHARED;
@@ -1517,13 +1518,16 @@ static long futex_wait_restart(struct restart_block *restart)
{
u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
int fshared = 0;
- ktime_t t;
+ ktime_t t, *tp = NULL;
- t.tv64 = restart->futex.time;
+ if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
+ t.tv64 = restart->futex.time;
+ tp = &t;
+ }
restart->fn = do_no_restart_syscall;
if (restart->futex.flags & FLAGS_SHARED)
fshared = 1;
- return (long)futex_wait(uaddr, fshared, restart->futex.val, &t,
+ return (long)futex_wait(uaddr, fshared, restart->futex.val, tp,
restart->futex.bitset,
restart->futex.flags & FLAGS_CLOCKRT);
}
PI Futexes and their underlying rt_mutex cannot be left ownerless if there are
pending waiters as this will break the PI boosting logic, so the standard
requeue commands aren't sufficient. The new commands properly manage pi futex
ownership by ensuring a futex with waiters has an owner at all times. This
will allow glibc to properly handle pi mutexes with pthread_condvars.
The approach taken here is to create two new futex op codes:
FUTEX_WAIT_REQUEUE_PI:
Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
and wake after they have been requeued to a pi futex. Prior to returning to
userspace, they will acquire this pi futex (and the underlying rt_mutex).
futex_wait_requeue_pi() is the result of a high speed collision between
futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi() being
done by futex_proxy_trylock_atomic() on behalf of the top_waiter).
FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
regardless of how many tasks the caller intends to wake or requeue.
pthread_cond_broadcast() should call this with nr_wake=1 and
nr_requeue=INT_MAX. pthread_cond_signal() should call this with nr_wake=1 and
nr_requeue=0. The reason being we need both callers to get the benefit of the
futex_proxy_trylock_atomic() routine. futex_requeue() also enqueues the
top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().
Changelog:
V6: -Moved non requeue_pi related fixes/changes into separate patches
-Make use of new double_unlock_hb()
-Futex key management updates
-Removed unnecessary futex_requeue_pi_cleanup() routine
-Return -EINVAL if futex_wake is called with q.rt_waiter != NULL
-Rewrote futex_wait_requeue_pi() wakeup logic
-Rewrote requeue/wakeup loop
-Renamed futex_requeue_pi_init() to futex_proxy_trylock_atomic()
-Handle third party owner, removed -EMORON :-(
-Comment updates
V5: -Update futex_requeue to allow for nr_requeue == 0
-Whitespace cleanup
-Added task_count var to futex_requeue to avoid confusion between
ret, res, and ret used to count wakes and requeues
V4: -Cleanups to pass checkpatch.pl
-Added missing goto out; in futex_wait_requeue_pi()
-Moved rt_mutex_handle_wakeup to the rt_mutex_enqueue_task patch as they
are a functional pair.
-Fixed several error exit paths that failed to unqueue the futex_q, which
not only would leave the futex_q on the hb, but would have caused an exit
race with the waiter since they weren't synchonized on the hb lock. Thanks
Sripathi for catching this.
-Fix pi_state handling in futex_requeue
-Several other minor fixes to futex_requeue_pi
-add requeue_futex function and force the requeue in requeue_pi even for the
task we wake in the requeue loop
-refill the pi state cache at the beginning of futex_requeue for requeue_pi
-have futex_requeue_pi_init ensure it stores off the pi_state for use in
futex_requeue
- Delayed starting the hrtimer until after TASK_INTERRUPTIBLE is set
- Fixed NULL pointer bug when futex_wait_requeue_pi() has no timer and
receives a signal after waking on uaddr2. Added has_timeout to the
restart->futex structure.
V3: -Added FUTEX_CMP_REQUEUE_PI op
-Put fshared support back. So long as it is encoded in the op code, we
assume both the uaddr's are either private or share, but not mixed.
-Fixed access to expected value of uaddr2 in futex_wait_requeue_pi()
V2: -Added rt_mutex enqueueing to futex_requeue_pi_init
-Updated fault handling and exit logic
V1: -Initial verion
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
include/linux/futex.h | 8 +
include/linux/thread_info.h | 3
kernel/futex.c | 533 +++++++++++++++++++++++++++++++++++++++++--
3 files changed, 524 insertions(+), 20 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 3bf5bb5..b05519c 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -23,6 +23,9 @@ union ktime;
#define FUTEX_TRYLOCK_PI 8
#define FUTEX_WAIT_BITSET 9
#define FUTEX_WAKE_BITSET 10
+#define FUTEX_WAIT_REQUEUE_PI 11
+#define FUTEX_REQUEUE_PI 12
+#define FUTEX_CMP_REQUEUE_PI 13
#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
@@ -38,6 +41,11 @@ union ktime;
#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAIT_BITSET_PRIVATE (FUTEX_WAIT_BITS | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAKE_BITSET_PRIVATE (FUTEX_WAKE_BITS | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
+ FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PI_PRIVATE (FUTEX_REQUEUE_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
+ FUTEX_PRIVATE_FLAG)
/*
* Support for robust futexes: the kernel cleans up held futexes at
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index e6b820f..a8cc4e1 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -21,13 +21,14 @@ struct restart_block {
struct {
unsigned long arg0, arg1, arg2, arg3;
};
- /* For futex_wait */
+ /* For futex_wait and futex_wait_requeue_pi */
struct {
u32 *uaddr;
u32 val;
u32 flags;
u32 bitset;
u64 time;
+ u32 *uaddr2;
} futex;
/* For nanosleep */
struct {
diff --git a/kernel/futex.c b/kernel/futex.c
index 61a8a8b..a673bb3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -19,6 +19,10 @@
* PRIVATE futexes by Eric Dumazet
* Copyright (C) 2007 Eric Dumazet <[email protected]>
*
+ * Requeue-PI support by Darren Hart <[email protected]>
+ * Copyright (C) IBM Corporation, 2009
+ * Thanks to Thomas Gleixner for conceptual design and careful reviews.
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -109,6 +113,9 @@ struct futex_q {
struct futex_pi_state *pi_state;
struct task_struct *task;
+ /* rt_waiter storage for requeue_pi: */
+ struct rt_mutex_waiter *rt_waiter;
+
/* Bitset for the optional bitmasked wakeup */
u32 bitset;
};
@@ -829,7 +836,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key)) {
- if (this->pi_state) {
+ if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
break;
}
@@ -970,20 +977,116 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
q->key = *key2;
}
-/*
- * Requeue all waiters hashed on one physical page to another
- * physical page.
+/**
+ * futex_proxy_trylock_atomic() - Attempt an atomic lock for the top waiter
+ * @pifutex: the user address of the to futex
+ * @hb1: the from futex hash bucket, must be locked by the caller
+ * @hb2: the to futex hash bucket, must be locked by the caller
+ * @key1: the from futex key
+ * @key2: the to futex key
+ *
+ * Try and get the lock on behalf of the top waiter if we can do it atomically.
+ * Wake the top waiter if we succeed. hb1 and hb2 must be held by the caller.
+ *
+ * Faults occur for two primary reasons at this point:
+ * 1) The address isn't mapped
+ * 2) The address isn't writeable
+ *
+ * We return EFAULT on either of these cases and rely on the caller to handle
+ * them.
+ *
+ * Returns:
+ * 0 - failed to acquire the lock atomicly
+ * 1 - acquired the lock
+ * <0 - error
+ */
+static int futex_proxy_trylock_atomic(u32 __user *pifutex,
+ struct futex_hash_bucket *hb1,
+ struct futex_hash_bucket *hb2,
+ union futex_key *key1, union futex_key *key2,
+ struct futex_pi_state **ps)
+{
+ struct futex_q *top_waiter;
+ u32 curval;
+ int ret;
+
+ if (get_futex_value_locked(&curval, pifutex))
+ return -EFAULT;
+
+ top_waiter = futex_top_waiter(hb1, key1);
+
+ /* There are no waiters, nothing for us to do. */
+ if (!top_waiter)
+ return 0;
+
+ /*
+ * Either take the lock for top_waiter or set the FUTEX_WAITERS bit.
+ * The pi_state is returned in ps in contended cases.
+ */
+ ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task);
+ if (ret == 1) {
+ /*
+ * Set the top_waiter key for the requeue target futex so the
+ * waiter can detect the wakeup on the right futex, but remove
+ * it from the hb so it can detect atomic lock acquisition.
+ */
+ drop_futex_key_refs(&top_waiter->key);
+ get_futex_key_refs(key2);
+ top_waiter->key = *key2;
+ WARN_ON(plist_node_empty(&top_waiter->list));
+ plist_del(&top_waiter->list, &top_waiter->list.plist);
+ /*
+ * FIXME: wake_futex() wakes first, then nulls the lock_ptr,
+ * and uses a memory barrier. Do we need to?
+ */
+ top_waiter->lock_ptr = NULL;
+ wake_up(&top_waiter->waiter);
+ }
+
+ return ret;
+}
+
+/**
+ * futex_requeue() - Requeue waiters from uaddr1 to uaddr2
+ * uaddr1: source futex user address
+ * uaddr2: target futex user address
+ * nr_wake: number of waiters to wake (must be 1 for requeue_pi)
+ * nr_requeue: number of waiters to requeue (0-INT_MAX)
+ * requeue_pi: if we are attempting to requeue from a non-pi futex to a
+ * pi futex (pi to pi requeue is not supported)
+ *
+ * Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire
+ * uaddr2 atomically on behalf of the top waiter.
+ *
+ * Returns:
+ * >=0: on success, the number of tasks requeued or woken
+ * <0: on error
*/
static int futex_requeue(u32 __user *uaddr1, int fshared, u32 __user *uaddr2,
- int nr_wake, int nr_requeue, u32 *cmpval)
+ int nr_wake, int nr_requeue, u32 *cmpval,
+ int requeue_pi)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
+ int drop_count = 0, task_count = 0, ret;
+ struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
struct plist_head *head1;
struct futex_q *this, *next;
- int ret, drop_count = 0;
+ u32 curval2;
+
+ if (requeue_pi) {
+ if (refill_pi_state_cache())
+ return -ENOMEM;
+ if (nr_wake != 1)
+ return -EINVAL;
+ }
retry:
+ if (pi_state != NULL) {
+ free_pi_state(pi_state);
+ pi_state = NULL;
+ }
+
ret = get_futex_key(uaddr1, fshared, &key1);
if (unlikely(ret != 0))
goto out;
@@ -1022,19 +1125,92 @@ retry_private:
}
}
+ if (requeue_pi) {
+ /* Attempt to acquire uaddr2 and wake the top_waiter. */
+ ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
+ &key2, &pi_state);
+
+ /*
+ * At this point the top_waiter has either taken uaddr2 or is
+ * waiting on it. If the former, then the pi_state will not
+ * exist yet, look it up one more time to ensure we have a
+ * reference to it.
+ */
+ if (ret == 1 && !pi_state) {
+ task_count++;
+ ret = get_futex_value_locked(&curval2, uaddr2);
+ if (!ret)
+ ret = lookup_pi_state(curval2, hb2, &key2,
+ &pi_state);
+ }
+
+ switch (ret) {
+ case 0:
+ break;
+ case -EFAULT:
+ double_unlock_hb(hb1, hb2);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
+ ret = get_user(curval2, uaddr2);
+ if (!ret)
+ goto retry;
+ goto out;
+ case -EAGAIN:
+ /* The owner was exiting, try again. */
+ double_unlock_hb(hb1, hb2);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
+ cond_resched();
+ goto retry;
+ default:
+ goto out_unlock;
+ }
+ }
+
head1 = &hb1->chain;
plist_for_each_entry_safe(this, next, head1, list) {
- if (!match_futex (&this->key, &key1))
+ if (task_count - nr_wake >= nr_requeue)
+ break;
+
+ if (!match_futex(&this->key, &key1))
continue;
- if (++ret <= nr_wake) {
+
+ /* This can go after we're satisfied with testing. */
+ if (!requeue_pi)
+ WARN_ON(this->rt_waiter);
+
+ /*
+ * Wake nr_wake waiters. For requeue_pi, if we acquired the
+ * lock, we already woke the top_waiter. If not, it will be
+ * woken by futex_unlock_pi().
+ */
+ if (++task_count <= nr_wake && !requeue_pi) {
wake_futex(this);
- } else {
- requeue_futex(this, hb1, hb2, &key2);
- drop_count++;
+ continue;
+ }
- if (ret - nr_wake >= nr_requeue)
- break;
+ /*
+ * Requeue nr_requeue waiters and possibly one more in the case
+ * of requeue_pi if we couldn't acquire the lock atomically.
+ */
+ if (requeue_pi) {
+ /* This can go after we're satisfied with testing. */
+ WARN_ON(!this->rt_waiter);
+
+ /* Prepare the waiter to take the rt_mutex. */
+ atomic_inc(&pi_state->refcount);
+ this->pi_state = pi_state;
+ ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
+ this->rt_waiter,
+ this->task, 1);
+ if (ret) {
+ this->pi_state = NULL;
+ free_pi_state(pi_state);
+ goto out_unlock;
+ }
}
+ requeue_futex(this, hb1, hb2, &key2);
+ drop_count++;
}
out_unlock:
@@ -1049,7 +1225,9 @@ out_put_keys:
out_put_key1:
put_futex_key(fshared, &key1);
out:
- return ret;
+ if (pi_state != NULL)
+ free_pi_state(pi_state);
+ return ret ? ret : task_count;
}
/* The key must be already stored in q->key. */
@@ -1272,6 +1450,8 @@ handle_fault:
#define FLAGS_HAS_TIMEOUT 0x04
static long futex_wait_restart(struct restart_block *restart);
+static long futex_wait_requeue_pi_restart(struct restart_block *restart);
+static long futex_lock_pi_restart(struct restart_block *restart);
/**
* finish_futex_lock_pi() - Post lock pi_state and corner case management
@@ -1419,6 +1599,7 @@ static int futex_wait(u32 __user *uaddr, int fshared,
q.pi_state = NULL;
q.bitset = bitset;
+ q.rt_waiter = NULL;
if (abs_time) {
unsigned long slack;
@@ -1575,6 +1756,7 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
}
q.pi_state = NULL;
+ q.rt_waiter = NULL;
retry:
q.key = FUTEX_KEY_INIT;
ret = get_futex_key(uaddr, fshared, &q.key);
@@ -1670,6 +1852,20 @@ uaddr_faulted:
goto retry;
}
+static long futex_lock_pi_restart(struct restart_block *restart)
+{
+ u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
+ ktime_t t, *tp = NULL;
+ int fshared = restart->futex.flags & FLAGS_SHARED;
+
+ if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
+ t.tv64 = restart->futex.time;
+ tp = &t;
+ }
+ restart->fn = do_no_restart_syscall;
+
+ return (long)futex_lock_pi(uaddr, fshared, restart->futex.val, tp, 0);
+}
/*
* Userspace attempted a TID -> 0 atomic transition, and failed.
@@ -1772,6 +1968,290 @@ pi_faulted:
return ret;
}
+/**
+ * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
+ * @uaddr: the futex we initialyl wait on (non-pi)
+ * @fshared: whether the futexes are shared (1) or not (0). They must be
+ * the same type, no requeueing from private to shared, etc.
+ * @val: the expected value of uaddr
+ * @abs_time: absolute timeout
+ * @bitset: 32 bit wakeup bitset set by userspace, defaults to all.
+ * @clockrt: whether to use CLOCK_REALTIME (1) or CLOCK_MONOTONIC (0)
+ * @uaddr2: the pi futex we will take prior to returning to user-space
+ *
+ * The caller will wait on uaddr and will be requeued by futex_requeue() to
+ * uaddr2 which must be PI aware. Normal wakeup will wake on uaddr2 and
+ * complete the acquisition of the rt_mutex prior to returning to userspace.
+ * This ensures the rt_mutex maintains an owner when it has waiters; without
+ * one, the pi logic wouldn't know which task to boost/deboost, if there was a
+ * need to.
+ *
+ * We call schedule in futex_wait_queue_me() when we enqueue and return there
+ * via the following:
+ * 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue()
+ * 2) wakeup on uaddr2 after a requeue and subsequent unlock
+ * 3) signal (before or after requeue)
+ * 4) timeout (before or after requeue)
+ *
+ * If 3, we setup a restart_block with futex_wait_requeue_pi() as the function.
+ *
+ * If 2, we may then block on trying to take the rt_mutex and return via:
+ * 5) successful lock
+ * 6) signal
+ * 7) timeout
+ * 8) other lock acquisition failure
+ *
+ * If 6, we setup a restart_block with futex_lock_pi() as the function.
+ *
+ * If 4 or 7, we cleanup and return with -ETIMEDOUT.
+ *
+ * Returns:
+ * 0 - On success
+ * <0 - On error
+ */
+static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
+ u32 val, ktime_t *abs_time, u32 bitset,
+ int clockrt, u32 __user *uaddr2)
+{
+ struct hrtimer_sleeper timeout, *to = NULL;
+ struct rt_mutex_waiter rt_waiter;
+ struct restart_block *restart;
+ struct futex_hash_bucket *hb;
+ struct rt_mutex *pi_mutex;
+ union futex_key key2;
+ struct futex_q q;
+ u32 uval;
+ int ret;
+
+ if (!bitset)
+ return -EINVAL;
+
+ if (abs_time) {
+ to = &timeout;
+ hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
+ CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ hrtimer_init_sleeper(to, current);
+ hrtimer_set_expires_range_ns(&to->timer, *abs_time,
+ current->timer_slack_ns);
+ }
+
+ /*
+ * The waiter is allocated on our stack, manipulated by the requeue
+ * code while we sleep on uaddr.
+ */
+ debug_rt_mutex_init_waiter(&rt_waiter);
+ rt_waiter.task = NULL;
+
+ q.pi_state = NULL;
+ q.bitset = bitset;
+ q.rt_waiter = &rt_waiter;
+
+retry:
+ q.key = FUTEX_KEY_INIT;
+ ret = get_futex_key(uaddr, fshared, &q.key);
+ if (unlikely(ret != 0))
+ goto out;
+
+ key2 = FUTEX_KEY_INIT;
+ ret = get_futex_key(uaddr2, fshared, &key2);
+ if (unlikely(ret != 0)) {
+ put_futex_key(fshared, &q.key);
+ goto out;
+ }
+
+ hb = queue_lock(&q);
+
+ /*
+ * Access the page AFTER the hash-bucket is locked.
+ * Order is important:
+ *
+ * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
+ * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
+ *
+ * The basic logical guarantee of a futex is that it blocks ONLY
+ * if cond(var) is known to be true at the time of blocking, for
+ * any cond. If we queued after testing *uaddr, that would open
+ * a race condition where we could block indefinitely with
+ * cond(var) false, which would violate the guarantee.
+ *
+ * A consequence is that futex_wait() can return zero and absorb
+ * a wakeup when *uaddr != val on entry to the syscall. This is
+ * rare, but normal.
+ */
+ ret = get_futex_value_locked(&uval, uaddr);
+
+ if (unlikely(ret)) {
+ queue_unlock(&q, hb);
+ put_futex_key(fshared, &q.key);
+ put_futex_key(fshared, &key2);
+
+ ret = get_user(uval, uaddr);
+ if (!ret)
+ goto retry;
+ goto out;
+ }
+
+ /* Only actually queue if *uaddr contained val. */
+ ret = -EWOULDBLOCK;
+ if (uval != val) {
+ queue_unlock(&q, hb);
+ put_futex_key(fshared, &q.key);
+ put_futex_key(fshared, &key2);
+ goto out;
+ }
+
+ /* Queue the futex_q, drop the hb lock, wait for wakeup. */
+ futex_wait_queue_me(hb, &q, to);
+
+ /*
+ * Ensure the requeue is atomic to avoid races while we process the
+ * wakeup. We only need to hold hb->lock to ensure atomicity as the
+ * wakeup code can't change q.key from uaddr to uaddr2 if we hold that
+ * lock. It can't be requeued from uaddr2 to something else since we
+ * don't support a PI aware source futex for requeue.
+ */
+ spin_lock(&hb->lock);
+ if (!match_futex(&q.key, &key2)) {
+ WARN_ON(q.lock_ptr && (&hb->lock != q.lock_ptr));
+ /*
+ * We were not requeued, handle wakeup from futex1 (uaddr). We
+ * cannot have been unqueued and already hold the lock, no need
+ * to call unqueue_me, just do it directly.
+ */
+ plist_del(&q.list, &q.list.plist);
+ drop_futex_key_refs(&q.key);
+
+ ret = -ETIMEDOUT;
+ if (to && !to->task) {
+ spin_unlock(&hb->lock);
+ goto out_put_keys;
+ }
+
+ /*
+ * We expect signal_pending(current), but another thread may
+ * have handled it for us already.
+ */
+ ret = -ERESTARTSYS;
+ if (!abs_time) {
+ spin_unlock(&hb->lock);
+ goto out_put_keys;
+ }
+
+ restart = ¤t_thread_info()->restart_block;
+ restart->fn = futex_wait_requeue_pi_restart;
+ restart->futex.uaddr = (u32 *)uaddr;
+ restart->futex.val = val;
+ restart->futex.time = abs_time->tv64;
+ restart->futex.bitset = bitset;
+ restart->futex.flags = 0;
+ restart->futex.uaddr2 = (u32 *)uaddr2;
+ restart->futex.flags = FLAGS_HAS_TIMEOUT;
+
+ if (fshared)
+ restart->futex.flags |= FLAGS_SHARED;
+ if (clockrt)
+ restart->futex.flags |= FLAGS_CLOCKRT;
+
+ ret = -ERESTART_RESTARTBLOCK;
+
+ spin_unlock(&hb->lock);
+ goto out_put_keys;
+ }
+ spin_unlock(&hb->lock);
+
+ ret = 0;
+ /*
+ * Check if the waker acquired the second futex for us. If the lock_ptr
+ * is NULL, but our key is key2, then the requeue target futex was
+ * uncontended and the waker gave it to us. This is safe without a lock
+ * as futex_requeue() will not release the hb lock until after it's
+ * nulled the lock_ptr and removed us from the hb.
+ */
+ if (!q.lock_ptr)
+ goto out_put_keys;
+
+ /*
+ * At this point we have been requeued. We have been woken up by
+ * futex_unlock_pi(), a timeout, or a signal, but not futex_requeue().
+ * futex_unlock_pi() will not destroy the lock_ptr nor the pi_state.
+ */
+ WARN_ON(!&q.pi_state);
+ pi_mutex = &q.pi_state->pi_mutex;
+ ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
+ debug_rt_mutex_free_waiter(&waiter);
+
+ spin_lock(q.lock_ptr);
+ ret = finish_futex_lock_pi(uaddr, fshared, &q, ret);
+
+ /* Unqueue and drop the lock. */
+ unqueue_me_pi(&q);
+
+ /*
+ * If fixup_pi_state_owner() faulted and was unable to handle the
+ * fault, unlock it and return the fault to userspace.
+ */
+ if (ret == -EFAULT) {
+ if (rt_mutex_owner(pi_mutex) == current)
+ rt_mutex_unlock(pi_mutex);
+ } else if (ret == -EINTR) {
+ if (get_user(uval, uaddr2)) {
+ ret = -EFAULT;
+ goto out_put_keys;
+ }
+
+ /*
+ * We've already been requeued, so restart by calling
+ * futex_lock_pi() directly, rather then returning to this
+ * function.
+ */
+ restart = ¤t_thread_info()->restart_block;
+ restart->fn = futex_lock_pi_restart;
+ restart->futex.uaddr = (u32 *)uaddr2;
+ restart->futex.val = uval;
+ restart->futex.flags = 0;
+ if (abs_time) {
+ restart->futex.flags |= FLAGS_HAS_TIMEOUT;
+ restart->futex.time = abs_time->tv64;
+ }
+
+ if (fshared)
+ restart->futex.flags |= FLAGS_SHARED;
+ if (clockrt)
+ restart->futex.flags |= FLAGS_CLOCKRT;
+ ret = -ERESTART_RESTARTBLOCK;
+ }
+
+out_put_keys:
+ put_futex_key(fshared, &q.key);
+ put_futex_key(fshared, &key2);
+
+out:
+ if (to) {
+ hrtimer_cancel(&to->timer);
+ destroy_hrtimer_on_stack(&to->timer);
+ }
+ return ret;
+}
+
+static long futex_wait_requeue_pi_restart(struct restart_block *restart)
+{
+ u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
+ u32 __user *uaddr2 = (u32 __user *)restart->futex.uaddr2;
+ int fshared = restart->futex.flags & FLAGS_SHARED;
+ int clockrt = restart->futex.flags & FLAGS_CLOCKRT;
+ ktime_t t, *tp = NULL;
+
+ if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
+ t.tv64 = restart->futex.time;
+ tp = &t;
+ }
+ restart->fn = do_no_restart_syscall;
+
+ return (long)futex_wait_requeue_pi(uaddr, fshared, restart->futex.val,
+ tp, restart->futex.bitset, clockrt,
+ uaddr2);
+}
+
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.
@@ -1994,7 +2474,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
fshared = 1;
clockrt = op & FUTEX_CLOCK_REALTIME;
- if (clockrt && cmd != FUTEX_WAIT_BITSET)
+ if (clockrt && cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI)
return -ENOSYS;
switch (cmd) {
@@ -2009,10 +2489,11 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
ret = futex_wake(uaddr, fshared, val, val3);
break;
case FUTEX_REQUEUE:
- ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL);
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 0);
break;
case FUTEX_CMP_REQUEUE:
- ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3);
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
+ 0);
break;
case FUTEX_WAKE_OP:
ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3);
@@ -2029,6 +2510,18 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
if (futex_cmpxchg_enabled)
ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1);
break;
+ case FUTEX_WAIT_REQUEUE_PI:
+ val3 = FUTEX_BITSET_MATCH_ANY;
+ ret = futex_wait_requeue_pi(uaddr, fshared, val, timeout, val3,
+ clockrt, uaddr2);
+ break;
+ case FUTEX_REQUEUE_PI:
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 1);
+ break;
+ case FUTEX_CMP_REQUEUE_PI:
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
+ 1);
+ break;
default:
ret = -ENOSYS;
}
@@ -2046,7 +2539,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
int cmd = op & FUTEX_CMD_MASK;
if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
- cmd == FUTEX_WAIT_BITSET)) {
+ cmd == FUTEX_WAIT_BITSET ||
+ cmd == FUTEX_WAIT_REQUEUE_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
return -EFAULT;
if (!timespec_valid(&ts))
@@ -2058,10 +2552,11 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
tp = &t;
}
/*
- * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE.
+ * requeue parameter in 'utime' if cmd == FUTEX_*_REQUEUE_*.
* number of waiters to wake in 'utime' if cmd == FUTEX_WAKE_OP.
*/
if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE ||
+ cmd == FUTEX_REQUEUE_PI || cmd == FUTEX_CMP_REQUEUE_PI ||
cmd == FUTEX_WAKE_OP)
val2 = (u32) (unsigned long) utime;
Darren Hart a écrit :
> Currently restart is only used if there is a timeout. The requeue_pi
> functionality requires restarting to futex_lock_pi() on signal after
> wakeup in futex_wait_requeue_pi() regardless of if there was a timeout
> or not. Using 0 for the timeout value is confusing as that could
> indicate an expired timer. The flag makes this explicit.
>
> Signed-off-by: Darren Hart <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Sripathi Kodi <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Dinakar Guniguntala <[email protected]>
> Cc: Ulrich Drepper <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jakub Jelinek <[email protected]>
> ---
>
> kernel/futex.c | 12 ++++++++----
> 1 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index dd05803..f54faf5 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1254,6 +1254,7 @@ handle_fault:
> */
> #define FLAGS_SHARED 0x01
> #define FLAGS_CLOCKRT 0x02
> +#define FLAGS_HAS_TIMEOUT 0x04
>
> static long futex_wait_restart(struct restart_block *restart);
>
> @@ -1493,7 +1494,7 @@ retry_private:
> restart->futex.val = val;
> restart->futex.time = abs_time->tv64;
> restart->futex.bitset = bitset;
> - restart->futex.flags = 0;
> + restart->futex.flags = FLAGS_HAS_TIMEOUT;
>
> if (fshared)
> restart->futex.flags |= FLAGS_SHARED;
> @@ -1517,13 +1518,16 @@ static long futex_wait_restart(struct restart_block *restart)
> {
> u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
> int fshared = 0;
> - ktime_t t;
> + ktime_t t, *tp = NULL;
>
> - t.tv64 = restart->futex.time;
> + if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
You probably meant :
if (restart->futex.flags & FLAGS_HAS_TIMEOUT)
> + t.tv64 = restart->futex.time;
> + tp = &t;
> + }
> restart->fn = do_no_restart_syscall;
> if (restart->futex.flags & FLAGS_SHARED)
> fshared = 1;
> - return (long)futex_wait(uaddr, fshared, restart->futex.val, &t,
> + return (long)futex_wait(uaddr, fshared, restart->futex.val, tp,
> restart->futex.bitset,
> restart->futex.flags & FLAGS_CLOCKRT);
> }
>
>
Darren Hart a écrit :
> PI Futexes and their underlying rt_mutex cannot be left ownerless if there are
> pending waiters as this will break the PI boosting logic, so the standard
> requeue commands aren't sufficient. The new commands properly manage pi futex
> ownership by ensuring a futex with waiters has an owner at all times. This
> will allow glibc to properly handle pi mutexes with pthread_condvars.
>
> The approach taken here is to create two new futex op codes:
>
> FUTEX_WAIT_REQUEUE_PI:
> Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
> and wake after they have been requeued to a pi futex. Prior to returning to
> userspace, they will acquire this pi futex (and the underlying rt_mutex).
>
> futex_wait_requeue_pi() is the result of a high speed collision between
> futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi() being
> done by futex_proxy_trylock_atomic() on behalf of the top_waiter).
>
> FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
> This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
> regardless of how many tasks the caller intends to wake or requeue.
> pthread_cond_broadcast() should call this with nr_wake=1 and
> nr_requeue=INT_MAX. pthread_cond_signal() should call this with nr_wake=1 and
> nr_requeue=0. The reason being we need both callers to get the benefit of the
> futex_proxy_trylock_atomic() routine. futex_requeue() also enqueues the
> top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().
>
> Changelog:
> V6: -Moved non requeue_pi related fixes/changes into separate patches
> -Make use of new double_unlock_hb()
> -Futex key management updates
> -Removed unnecessary futex_requeue_pi_cleanup() routine
> -Return -EINVAL if futex_wake is called with q.rt_waiter != NULL
> -Rewrote futex_wait_requeue_pi() wakeup logic
> -Rewrote requeue/wakeup loop
> -Renamed futex_requeue_pi_init() to futex_proxy_trylock_atomic()
> -Handle third party owner, removed -EMORON :-(
> -Comment updates
> V5: -Update futex_requeue to allow for nr_requeue == 0
> -Whitespace cleanup
> -Added task_count var to futex_requeue to avoid confusion between
> ret, res, and ret used to count wakes and requeues
> V4: -Cleanups to pass checkpatch.pl
> -Added missing goto out; in futex_wait_requeue_pi()
> -Moved rt_mutex_handle_wakeup to the rt_mutex_enqueue_task patch as they
> are a functional pair.
> -Fixed several error exit paths that failed to unqueue the futex_q, which
> not only would leave the futex_q on the hb, but would have caused an exit
> race with the waiter since they weren't synchonized on the hb lock. Thanks
> Sripathi for catching this.
> -Fix pi_state handling in futex_requeue
> -Several other minor fixes to futex_requeue_pi
> -add requeue_futex function and force the requeue in requeue_pi even for the
> task we wake in the requeue loop
> -refill the pi state cache at the beginning of futex_requeue for requeue_pi
> -have futex_requeue_pi_init ensure it stores off the pi_state for use in
> futex_requeue
> - Delayed starting the hrtimer until after TASK_INTERRUPTIBLE is set
> - Fixed NULL pointer bug when futex_wait_requeue_pi() has no timer and
> receives a signal after waking on uaddr2. Added has_timeout to the
> restart->futex structure.
> V3: -Added FUTEX_CMP_REQUEUE_PI op
> -Put fshared support back. So long as it is encoded in the op code, we
> assume both the uaddr's are either private or share, but not mixed.
> -Fixed access to expected value of uaddr2 in futex_wait_requeue_pi()
> V2: -Added rt_mutex enqueueing to futex_requeue_pi_init
> -Updated fault handling and exit logic
> V1: -Initial verion
>
> Signed-off-by: Darren Hart <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Sripathi Kodi <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Dinakar Guniguntala <[email protected]>
> Cc: Ulrich Drepper <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jakub Jelinek <[email protected]>
> ---
>
> include/linux/futex.h | 8 +
> include/linux/thread_info.h | 3
> kernel/futex.c | 533 +++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 524 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index 3bf5bb5..b05519c 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> +static long futex_lock_pi_restart(struct restart_block *restart)
> +{
> + u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
> + ktime_t t, *tp = NULL;
> + int fshared = restart->futex.flags & FLAGS_SHARED;
> +
> + if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
> + t.tv64 = restart->futex.time;
> + tp = &t;
> + }
> + restart->fn = do_no_restart_syscall;
> +
> + return (long)futex_lock_pi(uaddr, fshared, restart->futex.val, tp, 0);
> +}
>
> /*
> * Userspace attempted a TID -> 0 atomic transition, and failed.
> @@ -1772,6 +1968,290 @@ pi_faulted:
> return ret;
> }
>
> +
> +static long futex_wait_requeue_pi_restart(struct restart_block *restart)
> +{
> + u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
> + u32 __user *uaddr2 = (u32 __user *)restart->futex.uaddr2;
> + int fshared = restart->futex.flags & FLAGS_SHARED;
> + int clockrt = restart->futex.flags & FLAGS_CLOCKRT;
> + ktime_t t, *tp = NULL;
> +
> + if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
> + t.tv64 = restart->futex.time;
> + tp = &t;
> + }
> + restart->fn = do_no_restart_syscall;
> +
Strange your compiler dit not complains...
The attached patch is what we used to test glibc with requeue_pi.
As a prototype we deleted the assembly optimizations and just updated
the .c implementation. Credit for the prototype goes to Sripathi Kodi.
Thanks,
--
Darren Hart
Eric Dumazet wrote:
Hi Eric,
Thanks for the prompt review...
> Darren Hart a écrit :
>> - t.tv64 = restart->futex.time;
>> + if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
>
> You probably meant :
>
> if (restart->futex.flags & FLAGS_HAS_TIMEOUT)
Gah! You are correct. Corrected patch below.
RFC: futex: Add FUTEX_HAS_TIMEOUT flag to restart.futex.flags
From: Darren Hart <[email protected]>
Currently restart is only used if there is a timeout. The requeue_pi
functionality requires restarting to futex_lock_pi() on signal after
wakeup in futex_wait_requeue_pi() regardless of if there was a timeout
or not. Using 0 for the timeout value is confusing as that could
indicate an expired timer. The flag makes this explicit.
Changelog:
V7pre: -Corrected FLAGS_HAS_TIMEOUT flag detection logic per Eric Dumazet
V6: - Initial version
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
kernel/futex.c | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index dd05803..07e314c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1254,6 +1254,7 @@ handle_fault:
*/
#define FLAGS_SHARED 0x01
#define FLAGS_CLOCKRT 0x02
+#define FLAGS_HAS_TIMEOUT 0x04
static long futex_wait_restart(struct restart_block *restart);
@@ -1493,7 +1494,7 @@ retry_private:
restart->futex.val = val;
restart->futex.time = abs_time->tv64;
restart->futex.bitset = bitset;
- restart->futex.flags = 0;
+ restart->futex.flags = FLAGS_HAS_TIMEOUT;
if (fshared)
restart->futex.flags |= FLAGS_SHARED;
@@ -1517,13 +1518,16 @@ static long futex_wait_restart(struct restart_block *restart)
{
u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
int fshared = 0;
- ktime_t t;
+ ktime_t t, *tp = NULL;
- t.tv64 = restart->futex.time;
+ if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
+ t.tv64 = restart->futex.time;
+ tp = &t;
+ }
restart->fn = do_no_restart_syscall;
if (restart->futex.flags & FLAGS_SHARED)
fshared = 1;
- return (long)futex_wait(uaddr, fshared, restart->futex.val, &t,
+ return (long)futex_wait(uaddr, fshared, restart->futex.val, tp,
restart->futex.bitset,
restart->futex.flags & FLAGS_CLOCKRT);
}
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Eric Dumazet wrote:
Two more nice catches, thanks. Corrected patch below.
>> +static long futex_lock_pi_restart(struct restart_block *restart)
>> +{
>> + u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
>> + ktime_t t, *tp = NULL;
>> + int fshared = restart->futex.flags & FLAGS_SHARED;
>> +
>> + if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
>
> if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
> if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
>
>> + t.tv64 = restart->futex.time;
>> + tp = &t;
>> + }
>> + restart->fn = do_no_restart_syscall;
>> +
>
>
> Strange your compiler dit not complains...
Well, the comparison with an "|" is still valid - just happens to always
be true :-) I didn't get any errors - perhaps I should be compiling
with some addition options?
RFC: futex: add requeue_pi calls
From: Darren Hart <[email protected]>
PI Futexes and their underlying rt_mutex cannot be left ownerless if there are
pending waiters as this will break the PI boosting logic, so the standard
requeue commands aren't sufficient. The new commands properly manage pi futex
ownership by ensuring a futex with waiters has an owner at all times. This
will allow glibc to properly handle pi mutexes with pthread_condvars.
The approach taken here is to create two new futex op codes:
FUTEX_WAIT_REQUEUE_PI:
Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
and wake after they have been requeued to a pi futex. Prior to returning to
userspace, they will acquire this pi futex (and the underlying rt_mutex).
futex_wait_requeue_pi() is the result of a high speed collision between
futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi() being
done by futex_proxy_trylock_atomic() on behalf of the top_waiter).
FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
regardless of how many tasks the caller intends to wake or requeue.
pthread_cond_broadcast() should call this with nr_wake=1 and
nr_requeue=INT_MAX. pthread_cond_signal() should call this with nr_wake=1 and
nr_requeue=0. The reason being we need both callers to get the benefit of the
futex_proxy_trylock_atomic() routine. futex_requeue() also enqueues the
top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().
Changelog:
V7pre: -Corrected FLAGS_HAS_TIMEOUT flag detection logic per Eric Dumazet
V6: -Moved non requeue_pi related fixes/changes into separate patches
-Make use of new double_unlock_hb()
-Futex key management updates
-Removed unnecessary futex_requeue_pi_cleanup() routine
-Return -EINVAL if futex_wake is called with q.rt_waiter != NULL
-Rewrote futex_wait_requeue_pi() wakeup logic
-Rewrote requeue/wakeup loop
-Renamed futex_requeue_pi_init() to futex_proxy_trylock_atomic()
-Handle third party owner, removed -EMORON :-(
-Comment updates
V5: -Update futex_requeue to allow for nr_requeue == 0
-Whitespace cleanup
-Added task_count var to futex_requeue to avoid confusion between
ret, res, and ret used to count wakes and requeues
V4: -Cleanups to pass checkpatch.pl
-Added missing goto out; in futex_wait_requeue_pi()
-Moved rt_mutex_handle_wakeup to the rt_mutex_enqueue_task patch as they
are a functional pair.
-Fixed several error exit paths that failed to unqueue the futex_q, which
not only would leave the futex_q on the hb, but would have caused an exit
race with the waiter since they weren't synchonized on the hb lock. Thanks
Sripathi for catching this.
-Fix pi_state handling in futex_requeue
-Several other minor fixes to futex_requeue_pi
-add requeue_futex function and force the requeue in requeue_pi even for the
task we wake in the requeue loop
-refill the pi state cache at the beginning of futex_requeue for requeue_pi
-have futex_requeue_pi_init ensure it stores off the pi_state for use in
futex_requeue
- Delayed starting the hrtimer until after TASK_INTERRUPTIBLE is set
- Fixed NULL pointer bug when futex_wait_requeue_pi() has no timer and
receives a signal after waking on uaddr2. Added has_timeout to the
restart->futex structure.
V3: -Added FUTEX_CMP_REQUEUE_PI op
-Put fshared support back. So long as it is encoded in the op code, we
assume both the uaddr's are either private or share, but not mixed.
-Fixed access to expected value of uaddr2 in futex_wait_requeue_pi()
V2: -Added rt_mutex enqueueing to futex_requeue_pi_init
-Updated fault handling and exit logic
V1: -Initial verion
Signed-off-by: Darren Hart <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Sripathi Kodi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Dinakar Guniguntala <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jakub Jelinek <[email protected]>
---
include/linux/futex.h | 8 +
include/linux/thread_info.h | 3
kernel/futex.c | 533 +++++++++++++++++++++++++++++++++++++++++--
3 files changed, 524 insertions(+), 20 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 3bf5bb5..b05519c 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -23,6 +23,9 @@ union ktime;
#define FUTEX_TRYLOCK_PI 8
#define FUTEX_WAIT_BITSET 9
#define FUTEX_WAKE_BITSET 10
+#define FUTEX_WAIT_REQUEUE_PI 11
+#define FUTEX_REQUEUE_PI 12
+#define FUTEX_CMP_REQUEUE_PI 13
#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
@@ -38,6 +41,11 @@ union ktime;
#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAIT_BITSET_PRIVATE (FUTEX_WAIT_BITS | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAKE_BITSET_PRIVATE (FUTEX_WAKE_BITS | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
+ FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PI_PRIVATE (FUTEX_REQUEUE_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
+ FUTEX_PRIVATE_FLAG)
/*
* Support for robust futexes: the kernel cleans up held futexes at
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index e6b820f..a8cc4e1 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -21,13 +21,14 @@ struct restart_block {
struct {
unsigned long arg0, arg1, arg2, arg3;
};
- /* For futex_wait */
+ /* For futex_wait and futex_wait_requeue_pi */
struct {
u32 *uaddr;
u32 val;
u32 flags;
u32 bitset;
u64 time;
+ u32 *uaddr2;
} futex;
/* For nanosleep */
struct {
diff --git a/kernel/futex.c b/kernel/futex.c
index a9c7da1..115ec52 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -19,6 +19,10 @@
* PRIVATE futexes by Eric Dumazet
* Copyright (C) 2007 Eric Dumazet <[email protected]>
*
+ * Requeue-PI support by Darren Hart <[email protected]>
+ * Copyright (C) IBM Corporation, 2009
+ * Thanks to Thomas Gleixner for conceptual design and careful reviews.
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -109,6 +113,9 @@ struct futex_q {
struct futex_pi_state *pi_state;
struct task_struct *task;
+ /* rt_waiter storage for requeue_pi: */
+ struct rt_mutex_waiter *rt_waiter;
+
/* Bitset for the optional bitmasked wakeup */
u32 bitset;
};
@@ -829,7 +836,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key)) {
- if (this->pi_state) {
+ if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
break;
}
@@ -970,20 +977,116 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
q->key = *key2;
}
-/*
- * Requeue all waiters hashed on one physical page to another
- * physical page.
+/**
+ * futex_proxy_trylock_atomic() - Attempt an atomic lock for the top waiter
+ * @pifutex: the user address of the to futex
+ * @hb1: the from futex hash bucket, must be locked by the caller
+ * @hb2: the to futex hash bucket, must be locked by the caller
+ * @key1: the from futex key
+ * @key2: the to futex key
+ *
+ * Try and get the lock on behalf of the top waiter if we can do it atomically.
+ * Wake the top waiter if we succeed. hb1 and hb2 must be held by the caller.
+ *
+ * Faults occur for two primary reasons at this point:
+ * 1) The address isn't mapped
+ * 2) The address isn't writeable
+ *
+ * We return EFAULT on either of these cases and rely on the caller to handle
+ * them.
+ *
+ * Returns:
+ * 0 - failed to acquire the lock atomicly
+ * 1 - acquired the lock
+ * <0 - error
+ */
+static int futex_proxy_trylock_atomic(u32 __user *pifutex,
+ struct futex_hash_bucket *hb1,
+ struct futex_hash_bucket *hb2,
+ union futex_key *key1, union futex_key *key2,
+ struct futex_pi_state **ps)
+{
+ struct futex_q *top_waiter;
+ u32 curval;
+ int ret;
+
+ if (get_futex_value_locked(&curval, pifutex))
+ return -EFAULT;
+
+ top_waiter = futex_top_waiter(hb1, key1);
+
+ /* There are no waiters, nothing for us to do. */
+ if (!top_waiter)
+ return 0;
+
+ /*
+ * Either take the lock for top_waiter or set the FUTEX_WAITERS bit.
+ * The pi_state is returned in ps in contended cases.
+ */
+ ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task);
+ if (ret == 1) {
+ /*
+ * Set the top_waiter key for the requeue target futex so the
+ * waiter can detect the wakeup on the right futex, but remove
+ * it from the hb so it can detect atomic lock acquisition.
+ */
+ drop_futex_key_refs(&top_waiter->key);
+ get_futex_key_refs(key2);
+ top_waiter->key = *key2;
+ WARN_ON(plist_node_empty(&top_waiter->list));
+ plist_del(&top_waiter->list, &top_waiter->list.plist);
+ /*
+ * FIXME: wake_futex() wakes first, then nulls the lock_ptr,
+ * and uses a memory barrier. Do we need to?
+ */
+ top_waiter->lock_ptr = NULL;
+ wake_up(&top_waiter->waiter);
+ }
+
+ return ret;
+}
+
+/**
+ * futex_requeue() - Requeue waiters from uaddr1 to uaddr2
+ * uaddr1: source futex user address
+ * uaddr2: target futex user address
+ * nr_wake: number of waiters to wake (must be 1 for requeue_pi)
+ * nr_requeue: number of waiters to requeue (0-INT_MAX)
+ * requeue_pi: if we are attempting to requeue from a non-pi futex to a
+ * pi futex (pi to pi requeue is not supported)
+ *
+ * Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire
+ * uaddr2 atomically on behalf of the top waiter.
+ *
+ * Returns:
+ * >=0: on success, the number of tasks requeued or woken
+ * <0: on error
*/
static int futex_requeue(u32 __user *uaddr1, int fshared, u32 __user *uaddr2,
- int nr_wake, int nr_requeue, u32 *cmpval)
+ int nr_wake, int nr_requeue, u32 *cmpval,
+ int requeue_pi)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
+ int drop_count = 0, task_count = 0, ret;
+ struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
struct plist_head *head1;
struct futex_q *this, *next;
- int ret, drop_count = 0;
+ u32 curval2;
+
+ if (requeue_pi) {
+ if (refill_pi_state_cache())
+ return -ENOMEM;
+ if (nr_wake != 1)
+ return -EINVAL;
+ }
retry:
+ if (pi_state != NULL) {
+ free_pi_state(pi_state);
+ pi_state = NULL;
+ }
+
ret = get_futex_key(uaddr1, fshared, &key1);
if (unlikely(ret != 0))
goto out;
@@ -1022,19 +1125,92 @@ retry_private:
}
}
+ if (requeue_pi) {
+ /* Attempt to acquire uaddr2 and wake the top_waiter. */
+ ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
+ &key2, &pi_state);
+
+ /*
+ * At this point the top_waiter has either taken uaddr2 or is
+ * waiting on it. If the former, then the pi_state will not
+ * exist yet, look it up one more time to ensure we have a
+ * reference to it.
+ */
+ if (ret == 1 && !pi_state) {
+ task_count++;
+ ret = get_futex_value_locked(&curval2, uaddr2);
+ if (!ret)
+ ret = lookup_pi_state(curval2, hb2, &key2,
+ &pi_state);
+ }
+
+ switch (ret) {
+ case 0:
+ break;
+ case -EFAULT:
+ double_unlock_hb(hb1, hb2);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
+ ret = get_user(curval2, uaddr2);
+ if (!ret)
+ goto retry;
+ goto out;
+ case -EAGAIN:
+ /* The owner was exiting, try again. */
+ double_unlock_hb(hb1, hb2);
+ put_futex_key(fshared, &key2);
+ put_futex_key(fshared, &key1);
+ cond_resched();
+ goto retry;
+ default:
+ goto out_unlock;
+ }
+ }
+
head1 = &hb1->chain;
plist_for_each_entry_safe(this, next, head1, list) {
- if (!match_futex (&this->key, &key1))
+ if (task_count - nr_wake >= nr_requeue)
+ break;
+
+ if (!match_futex(&this->key, &key1))
continue;
- if (++ret <= nr_wake) {
+
+ /* This can go after we're satisfied with testing. */
+ if (!requeue_pi)
+ WARN_ON(this->rt_waiter);
+
+ /*
+ * Wake nr_wake waiters. For requeue_pi, if we acquired the
+ * lock, we already woke the top_waiter. If not, it will be
+ * woken by futex_unlock_pi().
+ */
+ if (++task_count <= nr_wake && !requeue_pi) {
wake_futex(this);
- } else {
- requeue_futex(this, hb1, hb2, &key2);
- drop_count++;
+ continue;
+ }
- if (ret - nr_wake >= nr_requeue)
- break;
+ /*
+ * Requeue nr_requeue waiters and possibly one more in the case
+ * of requeue_pi if we couldn't acquire the lock atomically.
+ */
+ if (requeue_pi) {
+ /* This can go after we're satisfied with testing. */
+ WARN_ON(!this->rt_waiter);
+
+ /* Prepare the waiter to take the rt_mutex. */
+ atomic_inc(&pi_state->refcount);
+ this->pi_state = pi_state;
+ ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
+ this->rt_waiter,
+ this->task, 1);
+ if (ret) {
+ this->pi_state = NULL;
+ free_pi_state(pi_state);
+ goto out_unlock;
+ }
}
+ requeue_futex(this, hb1, hb2, &key2);
+ drop_count++;
}
out_unlock:
@@ -1049,7 +1225,9 @@ out_put_keys:
out_put_key1:
put_futex_key(fshared, &key1);
out:
- return ret;
+ if (pi_state != NULL)
+ free_pi_state(pi_state);
+ return ret ? ret : task_count;
}
/* The key must be already stored in q->key. */
@@ -1272,6 +1450,8 @@ handle_fault:
#define FLAGS_HAS_TIMEOUT 0x04
static long futex_wait_restart(struct restart_block *restart);
+static long futex_wait_requeue_pi_restart(struct restart_block *restart);
+static long futex_lock_pi_restart(struct restart_block *restart);
/**
* finish_futex_lock_pi() - Post lock pi_state and corner case management
@@ -1419,6 +1599,7 @@ static int futex_wait(u32 __user *uaddr, int fshared,
q.pi_state = NULL;
q.bitset = bitset;
+ q.rt_waiter = NULL;
if (abs_time) {
unsigned long slack;
@@ -1575,6 +1756,7 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
}
q.pi_state = NULL;
+ q.rt_waiter = NULL;
retry:
q.key = FUTEX_KEY_INIT;
ret = get_futex_key(uaddr, fshared, &q.key);
@@ -1670,6 +1852,20 @@ uaddr_faulted:
goto retry;
}
+static long futex_lock_pi_restart(struct restart_block *restart)
+{
+ u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
+ ktime_t t, *tp = NULL;
+ int fshared = restart->futex.flags & FLAGS_SHARED;
+
+ if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
+ t.tv64 = restart->futex.time;
+ tp = &t;
+ }
+ restart->fn = do_no_restart_syscall;
+
+ return (long)futex_lock_pi(uaddr, fshared, restart->futex.val, tp, 0);
+}
/*
* Userspace attempted a TID -> 0 atomic transition, and failed.
@@ -1772,6 +1968,290 @@ pi_faulted:
return ret;
}
+/**
+ * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
+ * @uaddr: the futex we initialyl wait on (non-pi)
+ * @fshared: whether the futexes are shared (1) or not (0). They must be
+ * the same type, no requeueing from private to shared, etc.
+ * @val: the expected value of uaddr
+ * @abs_time: absolute timeout
+ * @bitset: 32 bit wakeup bitset set by userspace, defaults to all.
+ * @clockrt: whether to use CLOCK_REALTIME (1) or CLOCK_MONOTONIC (0)
+ * @uaddr2: the pi futex we will take prior to returning to user-space
+ *
+ * The caller will wait on uaddr and will be requeued by futex_requeue() to
+ * uaddr2 which must be PI aware. Normal wakeup will wake on uaddr2 and
+ * complete the acquisition of the rt_mutex prior to returning to userspace.
+ * This ensures the rt_mutex maintains an owner when it has waiters; without
+ * one, the pi logic wouldn't know which task to boost/deboost, if there was a
+ * need to.
+ *
+ * We call schedule in futex_wait_queue_me() when we enqueue and return there
+ * via the following:
+ * 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue()
+ * 2) wakeup on uaddr2 after a requeue and subsequent unlock
+ * 3) signal (before or after requeue)
+ * 4) timeout (before or after requeue)
+ *
+ * If 3, we setup a restart_block with futex_wait_requeue_pi() as the function.
+ *
+ * If 2, we may then block on trying to take the rt_mutex and return via:
+ * 5) successful lock
+ * 6) signal
+ * 7) timeout
+ * 8) other lock acquisition failure
+ *
+ * If 6, we setup a restart_block with futex_lock_pi() as the function.
+ *
+ * If 4 or 7, we cleanup and return with -ETIMEDOUT.
+ *
+ * Returns:
+ * 0 - On success
+ * <0 - On error
+ */
+static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
+ u32 val, ktime_t *abs_time, u32 bitset,
+ int clockrt, u32 __user *uaddr2)
+{
+ struct hrtimer_sleeper timeout, *to = NULL;
+ struct rt_mutex_waiter rt_waiter;
+ struct restart_block *restart;
+ struct futex_hash_bucket *hb;
+ struct rt_mutex *pi_mutex;
+ union futex_key key2;
+ struct futex_q q;
+ u32 uval;
+ int ret;
+
+ if (!bitset)
+ return -EINVAL;
+
+ if (abs_time) {
+ to = &timeout;
+ hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
+ CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ hrtimer_init_sleeper(to, current);
+ hrtimer_set_expires_range_ns(&to->timer, *abs_time,
+ current->timer_slack_ns);
+ }
+
+ /*
+ * The waiter is allocated on our stack, manipulated by the requeue
+ * code while we sleep on uaddr.
+ */
+ debug_rt_mutex_init_waiter(&rt_waiter);
+ rt_waiter.task = NULL;
+
+ q.pi_state = NULL;
+ q.bitset = bitset;
+ q.rt_waiter = &rt_waiter;
+
+retry:
+ q.key = FUTEX_KEY_INIT;
+ ret = get_futex_key(uaddr, fshared, &q.key);
+ if (unlikely(ret != 0))
+ goto out;
+
+ key2 = FUTEX_KEY_INIT;
+ ret = get_futex_key(uaddr2, fshared, &key2);
+ if (unlikely(ret != 0)) {
+ put_futex_key(fshared, &q.key);
+ goto out;
+ }
+
+ hb = queue_lock(&q);
+
+ /*
+ * Access the page AFTER the hash-bucket is locked.
+ * Order is important:
+ *
+ * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
+ * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
+ *
+ * The basic logical guarantee of a futex is that it blocks ONLY
+ * if cond(var) is known to be true at the time of blocking, for
+ * any cond. If we queued after testing *uaddr, that would open
+ * a race condition where we could block indefinitely with
+ * cond(var) false, which would violate the guarantee.
+ *
+ * A consequence is that futex_wait() can return zero and absorb
+ * a wakeup when *uaddr != val on entry to the syscall. This is
+ * rare, but normal.
+ */
+ ret = get_futex_value_locked(&uval, uaddr);
+
+ if (unlikely(ret)) {
+ queue_unlock(&q, hb);
+ put_futex_key(fshared, &q.key);
+ put_futex_key(fshared, &key2);
+
+ ret = get_user(uval, uaddr);
+ if (!ret)
+ goto retry;
+ goto out;
+ }
+
+ /* Only actually queue if *uaddr contained val. */
+ ret = -EWOULDBLOCK;
+ if (uval != val) {
+ queue_unlock(&q, hb);
+ put_futex_key(fshared, &q.key);
+ put_futex_key(fshared, &key2);
+ goto out;
+ }
+
+ /* Queue the futex_q, drop the hb lock, wait for wakeup. */
+ futex_wait_queue_me(hb, &q, to);
+
+ /*
+ * Ensure the requeue is atomic to avoid races while we process the
+ * wakeup. We only need to hold hb->lock to ensure atomicity as the
+ * wakeup code can't change q.key from uaddr to uaddr2 if we hold that
+ * lock. It can't be requeued from uaddr2 to something else since we
+ * don't support a PI aware source futex for requeue.
+ */
+ spin_lock(&hb->lock);
+ if (!match_futex(&q.key, &key2)) {
+ WARN_ON(q.lock_ptr && (&hb->lock != q.lock_ptr));
+ /*
+ * We were not requeued, handle wakeup from futex1 (uaddr). We
+ * cannot have been unqueued and already hold the lock, no need
+ * to call unqueue_me, just do it directly.
+ */
+ plist_del(&q.list, &q.list.plist);
+ drop_futex_key_refs(&q.key);
+
+ ret = -ETIMEDOUT;
+ if (to && !to->task) {
+ spin_unlock(&hb->lock);
+ goto out_put_keys;
+ }
+
+ /*
+ * We expect signal_pending(current), but another thread may
+ * have handled it for us already.
+ */
+ ret = -ERESTARTSYS;
+ if (!abs_time) {
+ spin_unlock(&hb->lock);
+ goto out_put_keys;
+ }
+
+ restart = ¤t_thread_info()->restart_block;
+ restart->fn = futex_wait_requeue_pi_restart;
+ restart->futex.uaddr = (u32 *)uaddr;
+ restart->futex.val = val;
+ restart->futex.time = abs_time->tv64;
+ restart->futex.bitset = bitset;
+ restart->futex.flags = 0;
+ restart->futex.uaddr2 = (u32 *)uaddr2;
+ restart->futex.flags = FLAGS_HAS_TIMEOUT;
+
+ if (fshared)
+ restart->futex.flags |= FLAGS_SHARED;
+ if (clockrt)
+ restart->futex.flags |= FLAGS_CLOCKRT;
+
+ ret = -ERESTART_RESTARTBLOCK;
+
+ spin_unlock(&hb->lock);
+ goto out_put_keys;
+ }
+ spin_unlock(&hb->lock);
+
+ ret = 0;
+ /*
+ * Check if the waker acquired the second futex for us. If the lock_ptr
+ * is NULL, but our key is key2, then the requeue target futex was
+ * uncontended and the waker gave it to us. This is safe without a lock
+ * as futex_requeue() will not release the hb lock until after it's
+ * nulled the lock_ptr and removed us from the hb.
+ */
+ if (!q.lock_ptr)
+ goto out_put_keys;
+
+ /*
+ * At this point we have been requeued. We have been woken up by
+ * futex_unlock_pi(), a timeout, or a signal, but not futex_requeue().
+ * futex_unlock_pi() will not destroy the lock_ptr nor the pi_state.
+ */
+ WARN_ON(!&q.pi_state);
+ pi_mutex = &q.pi_state->pi_mutex;
+ ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
+ debug_rt_mutex_free_waiter(&waiter);
+
+ spin_lock(q.lock_ptr);
+ ret = finish_futex_lock_pi(uaddr, fshared, &q, ret);
+
+ /* Unqueue and drop the lock. */
+ unqueue_me_pi(&q);
+
+ /*
+ * If fixup_pi_state_owner() faulted and was unable to handle the
+ * fault, unlock it and return the fault to userspace.
+ */
+ if (ret == -EFAULT) {
+ if (rt_mutex_owner(pi_mutex) == current)
+ rt_mutex_unlock(pi_mutex);
+ } else if (ret == -EINTR) {
+ if (get_user(uval, uaddr2)) {
+ ret = -EFAULT;
+ goto out_put_keys;
+ }
+
+ /*
+ * We've already been requeued, so restart by calling
+ * futex_lock_pi() directly, rather then returning to this
+ * function.
+ */
+ restart = ¤t_thread_info()->restart_block;
+ restart->fn = futex_lock_pi_restart;
+ restart->futex.uaddr = (u32 *)uaddr2;
+ restart->futex.val = uval;
+ restart->futex.flags = 0;
+ if (abs_time) {
+ restart->futex.flags |= FLAGS_HAS_TIMEOUT;
+ restart->futex.time = abs_time->tv64;
+ }
+
+ if (fshared)
+ restart->futex.flags |= FLAGS_SHARED;
+ if (clockrt)
+ restart->futex.flags |= FLAGS_CLOCKRT;
+ ret = -ERESTART_RESTARTBLOCK;
+ }
+
+out_put_keys:
+ put_futex_key(fshared, &q.key);
+ put_futex_key(fshared, &key2);
+
+out:
+ if (to) {
+ hrtimer_cancel(&to->timer);
+ destroy_hrtimer_on_stack(&to->timer);
+ }
+ return ret;
+}
+
+static long futex_wait_requeue_pi_restart(struct restart_block *restart)
+{
+ u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
+ u32 __user *uaddr2 = (u32 __user *)restart->futex.uaddr2;
+ int fshared = restart->futex.flags & FLAGS_SHARED;
+ int clockrt = restart->futex.flags & FLAGS_CLOCKRT;
+ ktime_t t, *tp = NULL;
+
+ if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
+ t.tv64 = restart->futex.time;
+ tp = &t;
+ }
+ restart->fn = do_no_restart_syscall;
+
+ return (long)futex_wait_requeue_pi(uaddr, fshared, restart->futex.val,
+ tp, restart->futex.bitset, clockrt,
+ uaddr2);
+}
+
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.
@@ -1994,7 +2474,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
fshared = 1;
clockrt = op & FUTEX_CLOCK_REALTIME;
- if (clockrt && cmd != FUTEX_WAIT_BITSET)
+ if (clockrt && cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI)
return -ENOSYS;
switch (cmd) {
@@ -2009,10 +2489,11 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
ret = futex_wake(uaddr, fshared, val, val3);
break;
case FUTEX_REQUEUE:
- ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL);
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 0);
break;
case FUTEX_CMP_REQUEUE:
- ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3);
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
+ 0);
break;
case FUTEX_WAKE_OP:
ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3);
@@ -2029,6 +2510,18 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
if (futex_cmpxchg_enabled)
ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1);
break;
+ case FUTEX_WAIT_REQUEUE_PI:
+ val3 = FUTEX_BITSET_MATCH_ANY;
+ ret = futex_wait_requeue_pi(uaddr, fshared, val, timeout, val3,
+ clockrt, uaddr2);
+ break;
+ case FUTEX_REQUEUE_PI:
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 1);
+ break;
+ case FUTEX_CMP_REQUEUE_PI:
+ ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
+ 1);
+ break;
default:
ret = -ENOSYS;
}
@@ -2046,7 +2539,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
int cmd = op & FUTEX_CMD_MASK;
if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
- cmd == FUTEX_WAIT_BITSET)) {
+ cmd == FUTEX_WAIT_BITSET ||
+ cmd == FUTEX_WAIT_REQUEUE_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
return -EFAULT;
if (!timespec_valid(&ts))
@@ -2058,10 +2552,11 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
tp = &t;
}
/*
- * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE.
+ * requeue parameter in 'utime' if cmd == FUTEX_*_REQUEUE_*.
* number of waiters to wake in 'utime' if cmd == FUTEX_WAKE_OP.
*/
if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE ||
+ cmd == FUTEX_REQUEUE_PI || cmd == FUTEX_CMP_REQUEUE_PI ||
cmd == FUTEX_WAKE_OP)
val2 = (u32) (unsigned long) utime;
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Darren Hart wrote:
> Eric Dumazet wrote:
>
> Two more nice catches, thanks. Corrected patch below.
If anyone is still wanting to pull these from git, you can grab them
from my -dev branch. Note: I pop and push branches to this branch,
whereas the versioned branches will remain constant.
http://git.kernel.org/?p=linux/kernel/git/dvhart/linux-2.6-tip-hacks.git;a=shortlog;h=requeue-pi-dev
Thanks,
Darren
>
>>> +static long futex_lock_pi_restart(struct restart_block *restart)
>>> +{
>>> + u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
>>> + ktime_t t, *tp = NULL;
>>> + int fshared = restart->futex.flags & FLAGS_SHARED;
>>> +
>>> + if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
>>
>> if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
>
>
>> if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
>>
>>> + t.tv64 = restart->futex.time;
>>> + tp = &t;
>>> + }
>>> + restart->fn = do_no_restart_syscall;
>>> +
>>
>>
>> Strange your compiler dit not complains...
>
> Well, the comparison with an "|" is still valid - just happens to always
> be true :-) I didn't get any errors - perhaps I should be compiling
> with some addition options?
>
>
> RFC: futex: add requeue_pi calls
>
> From: Darren Hart <[email protected]>
>
> PI Futexes and their underlying rt_mutex cannot be left ownerless if
> there are
> pending waiters as this will break the PI boosting logic, so the standard
> requeue commands aren't sufficient. The new commands properly manage pi
> futex
> ownership by ensuring a futex with waiters has an owner at all times. This
> will allow glibc to properly handle pi mutexes with pthread_condvars.
>
> The approach taken here is to create two new futex op codes:
>
> FUTEX_WAIT_REQUEUE_PI:
> Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
> and wake after they have been requeued to a pi futex. Prior to
> returning to
> userspace, they will acquire this pi futex (and the underlying rt_mutex).
>
> futex_wait_requeue_pi() is the result of a high speed collision between
> futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi()
> being
> done by futex_proxy_trylock_atomic() on behalf of the top_waiter).
>
> FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
> This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
> regardless of how many tasks the caller intends to wake or requeue.
> pthread_cond_broadcast() should call this with nr_wake=1 and
> nr_requeue=INT_MAX. pthread_cond_signal() should call this with
> nr_wake=1 and
> nr_requeue=0. The reason being we need both callers to get the benefit
> of the
> futex_proxy_trylock_atomic() routine. futex_requeue() also enqueues the
> top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().
>
> Changelog:
> V7pre: -Corrected FLAGS_HAS_TIMEOUT flag detection logic per Eric Dumazet
> V6: -Moved non requeue_pi related fixes/changes into separate patches
> -Make use of new double_unlock_hb()
> -Futex key management updates
> -Removed unnecessary futex_requeue_pi_cleanup() routine
> -Return -EINVAL if futex_wake is called with q.rt_waiter != NULL
> -Rewrote futex_wait_requeue_pi() wakeup logic
> -Rewrote requeue/wakeup loop
> -Renamed futex_requeue_pi_init() to futex_proxy_trylock_atomic()
> -Handle third party owner, removed -EMORON :-(
> -Comment updates
> V5: -Update futex_requeue to allow for nr_requeue == 0
> -Whitespace cleanup
> -Added task_count var to futex_requeue to avoid confusion between
> ret, res, and ret used to count wakes and requeues
> V4: -Cleanups to pass checkpatch.pl
> -Added missing goto out; in futex_wait_requeue_pi()
> -Moved rt_mutex_handle_wakeup to the rt_mutex_enqueue_task patch as they
> are a functional pair.
> -Fixed several error exit paths that failed to unqueue the futex_q,
> which
> not only would leave the futex_q on the hb, but would have caused an
> exit
> race with the waiter since they weren't synchonized on the hb lock.
> Thanks
> Sripathi for catching this.
> -Fix pi_state handling in futex_requeue
> -Several other minor fixes to futex_requeue_pi
> -add requeue_futex function and force the requeue in requeue_pi even
> for the
> task we wake in the requeue loop
> -refill the pi state cache at the beginning of futex_requeue for
> requeue_pi
> -have futex_requeue_pi_init ensure it stores off the pi_state for use in
> futex_requeue
> - Delayed starting the hrtimer until after TASK_INTERRUPTIBLE is set
> - Fixed NULL pointer bug when futex_wait_requeue_pi() has no timer and
> receives a signal after waking on uaddr2. Added has_timeout to the
> restart->futex structure.
> V3: -Added FUTEX_CMP_REQUEUE_PI op
> -Put fshared support back. So long as it is encoded in the op code, we
> assume both the uaddr's are either private or share, but not mixed.
> -Fixed access to expected value of uaddr2 in futex_wait_requeue_pi()
> V2: -Added rt_mutex enqueueing to futex_requeue_pi_init
> -Updated fault handling and exit logic
> V1: -Initial verion
>
> Signed-off-by: Darren Hart <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Sripathi Kodi <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Dinakar Guniguntala <[email protected]>
> Cc: Ulrich Drepper <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jakub Jelinek <[email protected]>
> ---
>
> include/linux/futex.h | 8 +
> include/linux/thread_info.h | 3 kernel/futex.c | 533
> +++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 524 insertions(+), 20 deletions(-)
>
>
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index 3bf5bb5..b05519c 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -23,6 +23,9 @@ union ktime;
> #define FUTEX_TRYLOCK_PI 8
> #define FUTEX_WAIT_BITSET 9
> #define FUTEX_WAKE_BITSET 10
> +#define FUTEX_WAIT_REQUEUE_PI 11
> +#define FUTEX_REQUEUE_PI 12
> +#define FUTEX_CMP_REQUEUE_PI 13
>
> #define FUTEX_PRIVATE_FLAG 128
> #define FUTEX_CLOCK_REALTIME 256
> @@ -38,6 +41,11 @@ union ktime;
> #define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
> #define FUTEX_WAIT_BITSET_PRIVATE (FUTEX_WAIT_BITS | FUTEX_PRIVATE_FLAG)
> #define FUTEX_WAKE_BITSET_PRIVATE (FUTEX_WAKE_BITS | FUTEX_PRIVATE_FLAG)
> +#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
> + FUTEX_PRIVATE_FLAG)
> +#define FUTEX_REQUEUE_PI_PRIVATE (FUTEX_REQUEUE_PI |
> FUTEX_PRIVATE_FLAG)
> +#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
> + FUTEX_PRIVATE_FLAG)
>
> /*
> * Support for robust futexes: the kernel cleans up held futexes at
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index e6b820f..a8cc4e1 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -21,13 +21,14 @@ struct restart_block {
> struct {
> unsigned long arg0, arg1, arg2, arg3;
> };
> - /* For futex_wait */
> + /* For futex_wait and futex_wait_requeue_pi */
> struct {
> u32 *uaddr;
> u32 val;
> u32 flags;
> u32 bitset;
> u64 time;
> + u32 *uaddr2;
> } futex;
> /* For nanosleep */
> struct {
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a9c7da1..115ec52 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -19,6 +19,10 @@
> * PRIVATE futexes by Eric Dumazet
> * Copyright (C) 2007 Eric Dumazet <[email protected]>
> *
> + * Requeue-PI support by Darren Hart <[email protected]>
> + * Copyright (C) IBM Corporation, 2009
> + * Thanks to Thomas Gleixner for conceptual design and careful reviews.
> + *
> * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
> * enough at me, Linus for the original (flawed) idea, Matthew
> * Kirkwood for proof-of-concept implementation.
> @@ -109,6 +113,9 @@ struct futex_q {
> struct futex_pi_state *pi_state;
> struct task_struct *task;
>
> + /* rt_waiter storage for requeue_pi: */
> + struct rt_mutex_waiter *rt_waiter;
> +
> /* Bitset for the optional bitmasked wakeup */
> u32 bitset;
> };
> @@ -829,7 +836,7 @@ static int futex_wake(u32 __user *uaddr, int
> fshared, int nr_wake, u32 bitset)
>
> plist_for_each_entry_safe(this, next, head, list) {
> if (match_futex (&this->key, &key)) {
> - if (this->pi_state) {
> + if (this->pi_state || this->rt_waiter) {
> ret = -EINVAL;
> break;
> }
> @@ -970,20 +977,116 @@ void requeue_futex(struct futex_q *q, struct
> futex_hash_bucket *hb1,
> q->key = *key2;
> }
>
> -/*
> - * Requeue all waiters hashed on one physical page to another
> - * physical page.
> +/**
> + * futex_proxy_trylock_atomic() - Attempt an atomic lock for the top
> waiter
> + * @pifutex: the user address of the to futex
> + * @hb1: the from futex hash bucket, must be locked by the caller
> + * @hb2: the to futex hash bucket, must be locked by the caller
> + * @key1: the from futex key
> + * @key2: the to futex key
> + *
> + * Try and get the lock on behalf of the top waiter if we can do it
> atomically.
> + * Wake the top waiter if we succeed. hb1 and hb2 must be held by the
> caller.
> + *
> + * Faults occur for two primary reasons at this point:
> + * 1) The address isn't mapped
> + * 2) The address isn't writeable
> + *
> + * We return EFAULT on either of these cases and rely on the caller to
> handle
> + * them.
> + *
> + * Returns:
> + * 0 - failed to acquire the lock atomicly
> + * 1 - acquired the lock
> + * <0 - error
> + */
> +static int futex_proxy_trylock_atomic(u32 __user *pifutex,
> + struct futex_hash_bucket *hb1,
> + struct futex_hash_bucket *hb2,
> + union futex_key *key1, union futex_key *key2,
> + struct futex_pi_state **ps)
> +{
> + struct futex_q *top_waiter;
> + u32 curval;
> + int ret;
> +
> + if (get_futex_value_locked(&curval, pifutex))
> + return -EFAULT;
> +
> + top_waiter = futex_top_waiter(hb1, key1);
> +
> + /* There are no waiters, nothing for us to do. */
> + if (!top_waiter)
> + return 0;
> +
> + /*
> + * Either take the lock for top_waiter or set the FUTEX_WAITERS bit.
> + * The pi_state is returned in ps in contended cases.
> + */
> + ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task);
> + if (ret == 1) {
> + /*
> + * Set the top_waiter key for the requeue target futex so the
> + * waiter can detect the wakeup on the right futex, but remove
> + * it from the hb so it can detect atomic lock acquisition.
> + */
> + drop_futex_key_refs(&top_waiter->key);
> + get_futex_key_refs(key2);
> + top_waiter->key = *key2;
> + WARN_ON(plist_node_empty(&top_waiter->list));
> + plist_del(&top_waiter->list, &top_waiter->list.plist);
> + /*
> + * FIXME: wake_futex() wakes first, then nulls the lock_ptr,
> + * and uses a memory barrier. Do we need to?
> + */
> + top_waiter->lock_ptr = NULL;
> + wake_up(&top_waiter->waiter);
> + }
> +
> + return ret;
> +}
> +
> +/**
> + * futex_requeue() - Requeue waiters from uaddr1 to uaddr2
> + * uaddr1: source futex user address
> + * uaddr2: target futex user address
> + * nr_wake: number of waiters to wake (must be 1 for requeue_pi)
> + * nr_requeue: number of waiters to requeue (0-INT_MAX)
> + * requeue_pi: if we are attempting to requeue from a non-pi futex to a
> + * pi futex (pi to pi requeue is not supported)
> + *
> + * Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to
> acquire
> + * uaddr2 atomically on behalf of the top waiter.
> + *
> + * Returns:
> + * >=0: on success, the number of tasks requeued or woken
> + * <0: on error
> */
> static int futex_requeue(u32 __user *uaddr1, int fshared, u32 __user
> *uaddr2,
> - int nr_wake, int nr_requeue, u32 *cmpval)
> + int nr_wake, int nr_requeue, u32 *cmpval,
> + int requeue_pi)
> {
> union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
> + int drop_count = 0, task_count = 0, ret;
> + struct futex_pi_state *pi_state = NULL;
> struct futex_hash_bucket *hb1, *hb2;
> struct plist_head *head1;
> struct futex_q *this, *next;
> - int ret, drop_count = 0;
> + u32 curval2;
> +
> + if (requeue_pi) {
> + if (refill_pi_state_cache())
> + return -ENOMEM;
> + if (nr_wake != 1)
> + return -EINVAL;
> + }
>
> retry:
> + if (pi_state != NULL) {
> + free_pi_state(pi_state);
> + pi_state = NULL;
> + }
> +
> ret = get_futex_key(uaddr1, fshared, &key1);
> if (unlikely(ret != 0))
> goto out;
> @@ -1022,19 +1125,92 @@ retry_private:
> }
> }
>
> + if (requeue_pi) {
> + /* Attempt to acquire uaddr2 and wake the top_waiter. */
> + ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
> + &key2, &pi_state);
> +
> + /*
> + * At this point the top_waiter has either taken uaddr2 or is
> + * waiting on it. If the former, then the pi_state will not
> + * exist yet, look it up one more time to ensure we have a
> + * reference to it.
> + */
> + if (ret == 1 && !pi_state) {
> + task_count++;
> + ret = get_futex_value_locked(&curval2, uaddr2);
> + if (!ret)
> + ret = lookup_pi_state(curval2, hb2, &key2,
> + &pi_state);
> + }
> +
> + switch (ret) {
> + case 0:
> + break;
> + case -EFAULT:
> + double_unlock_hb(hb1, hb2);
> + put_futex_key(fshared, &key2);
> + put_futex_key(fshared, &key1);
> + ret = get_user(curval2, uaddr2);
> + if (!ret)
> + goto retry;
> + goto out;
> + case -EAGAIN:
> + /* The owner was exiting, try again. */
> + double_unlock_hb(hb1, hb2);
> + put_futex_key(fshared, &key2);
> + put_futex_key(fshared, &key1);
> + cond_resched();
> + goto retry;
> + default:
> + goto out_unlock;
> + }
> + }
> +
> head1 = &hb1->chain;
> plist_for_each_entry_safe(this, next, head1, list) {
> - if (!match_futex (&this->key, &key1))
> + if (task_count - nr_wake >= nr_requeue)
> + break;
> +
> + if (!match_futex(&this->key, &key1))
> continue;
> - if (++ret <= nr_wake) {
> +
> + /* This can go after we're satisfied with testing. */
> + if (!requeue_pi)
> + WARN_ON(this->rt_waiter);
> +
> + /*
> + * Wake nr_wake waiters. For requeue_pi, if we acquired the
> + * lock, we already woke the top_waiter. If not, it will be
> + * woken by futex_unlock_pi().
> + */
> + if (++task_count <= nr_wake && !requeue_pi) {
> wake_futex(this);
> - } else {
> - requeue_futex(this, hb1, hb2, &key2);
> - drop_count++;
> + continue;
> + }
>
> - if (ret - nr_wake >= nr_requeue)
> - break;
> + /*
> + * Requeue nr_requeue waiters and possibly one more in the case
> + * of requeue_pi if we couldn't acquire the lock atomically.
> + */
> + if (requeue_pi) {
> + /* This can go after we're satisfied with testing. */
> + WARN_ON(!this->rt_waiter);
> +
> + /* Prepare the waiter to take the rt_mutex. */
> + atomic_inc(&pi_state->refcount);
> + this->pi_state = pi_state;
> + ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
> + this->rt_waiter,
> + this->task, 1);
> + if (ret) {
> + this->pi_state = NULL;
> + free_pi_state(pi_state);
> + goto out_unlock;
> + }
> }
> + requeue_futex(this, hb1, hb2, &key2);
> + drop_count++;
> }
>
> out_unlock:
> @@ -1049,7 +1225,9 @@ out_put_keys:
> out_put_key1:
> put_futex_key(fshared, &key1);
> out:
> - return ret;
> + if (pi_state != NULL)
> + free_pi_state(pi_state);
> + return ret ? ret : task_count;
> }
>
> /* The key must be already stored in q->key. */
> @@ -1272,6 +1450,8 @@ handle_fault:
> #define FLAGS_HAS_TIMEOUT 0x04
>
> static long futex_wait_restart(struct restart_block *restart);
> +static long futex_wait_requeue_pi_restart(struct restart_block *restart);
> +static long futex_lock_pi_restart(struct restart_block *restart);
>
> /**
> * finish_futex_lock_pi() - Post lock pi_state and corner case management
> @@ -1419,6 +1599,7 @@ static int futex_wait(u32 __user *uaddr, int fshared,
>
> q.pi_state = NULL;
> q.bitset = bitset;
> + q.rt_waiter = NULL;
>
> if (abs_time) {
> unsigned long slack;
> @@ -1575,6 +1756,7 @@ static int futex_lock_pi(u32 __user *uaddr, int
> fshared,
> }
>
> q.pi_state = NULL;
> + q.rt_waiter = NULL;
> retry:
> q.key = FUTEX_KEY_INIT;
> ret = get_futex_key(uaddr, fshared, &q.key);
> @@ -1670,6 +1852,20 @@ uaddr_faulted:
> goto retry;
> }
>
> +static long futex_lock_pi_restart(struct restart_block *restart)
> +{
> + u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
> + ktime_t t, *tp = NULL;
> + int fshared = restart->futex.flags & FLAGS_SHARED;
> +
> + if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
> + t.tv64 = restart->futex.time;
> + tp = &t;
> + }
> + restart->fn = do_no_restart_syscall;
> +
> + return (long)futex_lock_pi(uaddr, fshared, restart->futex.val, tp, 0);
> +}
>
> /*
> * Userspace attempted a TID -> 0 atomic transition, and failed.
> @@ -1772,6 +1968,290 @@ pi_faulted:
> return ret;
> }
>
> +/**
> + * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
> + * @uaddr: the futex we initialyl wait on (non-pi)
> + * @fshared: whether the futexes are shared (1) or not (0). They
> must be
> + * the same type, no requeueing from private to shared, etc.
> + * @val: the expected value of uaddr
> + * @abs_time: absolute timeout
> + * @bitset: 32 bit wakeup bitset set by userspace, defaults to all.
> + * @clockrt: whether to use CLOCK_REALTIME (1) or CLOCK_MONOTONIC (0)
> + * @uaddr2: the pi futex we will take prior to returning to user-space
> + *
> + * The caller will wait on uaddr and will be requeued by
> futex_requeue() to
> + * uaddr2 which must be PI aware. Normal wakeup will wake on uaddr2 and
> + * complete the acquisition of the rt_mutex prior to returning to
> userspace.
> + * This ensures the rt_mutex maintains an owner when it has waiters;
> without
> + * one, the pi logic wouldn't know which task to boost/deboost, if
> there was a
> + * need to.
> + *
> + * We call schedule in futex_wait_queue_me() when we enqueue and return
> there
> + * via the following:
> + * 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue()
> + * 2) wakeup on uaddr2 after a requeue and subsequent unlock
> + * 3) signal (before or after requeue)
> + * 4) timeout (before or after requeue)
> + *
> + * If 3, we setup a restart_block with futex_wait_requeue_pi() as the
> function.
> + *
> + * If 2, we may then block on trying to take the rt_mutex and return via:
> + * 5) successful lock
> + * 6) signal
> + * 7) timeout
> + * 8) other lock acquisition failure
> + *
> + * If 6, we setup a restart_block with futex_lock_pi() as the function.
> + *
> + * If 4 or 7, we cleanup and return with -ETIMEDOUT.
> + *
> + * Returns:
> + * 0 - On success
> + * <0 - On error
> + */
> +static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
> + u32 val, ktime_t *abs_time, u32 bitset,
> + int clockrt, u32 __user *uaddr2)
> +{
> + struct hrtimer_sleeper timeout, *to = NULL;
> + struct rt_mutex_waiter rt_waiter;
> + struct restart_block *restart;
> + struct futex_hash_bucket *hb;
> + struct rt_mutex *pi_mutex;
> + union futex_key key2;
> + struct futex_q q;
> + u32 uval;
> + int ret;
> +
> + if (!bitset)
> + return -EINVAL;
> +
> + if (abs_time) {
> + to = &timeout;
> + hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
> + CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
> + hrtimer_init_sleeper(to, current);
> + hrtimer_set_expires_range_ns(&to->timer, *abs_time,
> + current->timer_slack_ns);
> + }
> +
> + /*
> + * The waiter is allocated on our stack, manipulated by the requeue
> + * code while we sleep on uaddr.
> + */
> + debug_rt_mutex_init_waiter(&rt_waiter);
> + rt_waiter.task = NULL;
> +
> + q.pi_state = NULL;
> + q.bitset = bitset;
> + q.rt_waiter = &rt_waiter;
> +
> +retry:
> + q.key = FUTEX_KEY_INIT;
> + ret = get_futex_key(uaddr, fshared, &q.key);
> + if (unlikely(ret != 0))
> + goto out;
> +
> + key2 = FUTEX_KEY_INIT;
> + ret = get_futex_key(uaddr2, fshared, &key2);
> + if (unlikely(ret != 0)) {
> + put_futex_key(fshared, &q.key);
> + goto out;
> + }
> +
> + hb = queue_lock(&q);
> +
> + /*
> + * Access the page AFTER the hash-bucket is locked.
> + * Order is important:
> + *
> + * Userspace waiter: val = var; if (cond(val)) futex_wait(&var,
> val);
> + * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
> + *
> + * The basic logical guarantee of a futex is that it blocks ONLY
> + * if cond(var) is known to be true at the time of blocking, for
> + * any cond. If we queued after testing *uaddr, that would open
> + * a race condition where we could block indefinitely with
> + * cond(var) false, which would violate the guarantee.
> + *
> + * A consequence is that futex_wait() can return zero and absorb
> + * a wakeup when *uaddr != val on entry to the syscall. This is
> + * rare, but normal.
> + */
> + ret = get_futex_value_locked(&uval, uaddr);
> +
> + if (unlikely(ret)) {
> + queue_unlock(&q, hb);
> + put_futex_key(fshared, &q.key);
> + put_futex_key(fshared, &key2);
> +
> + ret = get_user(uval, uaddr);
> + if (!ret)
> + goto retry;
> + goto out;
> + }
> +
> + /* Only actually queue if *uaddr contained val. */
> + ret = -EWOULDBLOCK;
> + if (uval != val) {
> + queue_unlock(&q, hb);
> + put_futex_key(fshared, &q.key);
> + put_futex_key(fshared, &key2);
> + goto out;
> + }
> +
> + /* Queue the futex_q, drop the hb lock, wait for wakeup. */
> + futex_wait_queue_me(hb, &q, to);
> +
> + /*
> + * Ensure the requeue is atomic to avoid races while we process the
> + * wakeup. We only need to hold hb->lock to ensure atomicity as the
> + * wakeup code can't change q.key from uaddr to uaddr2 if we hold that
> + * lock. It can't be requeued from uaddr2 to something else since we
> + * don't support a PI aware source futex for requeue.
> + */
> + spin_lock(&hb->lock);
> + if (!match_futex(&q.key, &key2)) {
> + WARN_ON(q.lock_ptr && (&hb->lock != q.lock_ptr));
> + /*
> + * We were not requeued, handle wakeup from futex1 (uaddr). We
> + * cannot have been unqueued and already hold the lock, no need
> + * to call unqueue_me, just do it directly.
> + */
> + plist_del(&q.list, &q.list.plist);
> + drop_futex_key_refs(&q.key);
> +
> + ret = -ETIMEDOUT;
> + if (to && !to->task) {
> + spin_unlock(&hb->lock);
> + goto out_put_keys;
> + }
> +
> + /*
> + * We expect signal_pending(current), but another thread may
> + * have handled it for us already.
> + */
> + ret = -ERESTARTSYS;
> + if (!abs_time) {
> + spin_unlock(&hb->lock);
> + goto out_put_keys;
> + }
> +
> + restart = ¤t_thread_info()->restart_block;
> + restart->fn = futex_wait_requeue_pi_restart;
> + restart->futex.uaddr = (u32 *)uaddr;
> + restart->futex.val = val;
> + restart->futex.time = abs_time->tv64;
> + restart->futex.bitset = bitset;
> + restart->futex.flags = 0;
> + restart->futex.uaddr2 = (u32 *)uaddr2;
> + restart->futex.flags = FLAGS_HAS_TIMEOUT;
> +
> + if (fshared)
> + restart->futex.flags |= FLAGS_SHARED;
> + if (clockrt)
> + restart->futex.flags |= FLAGS_CLOCKRT;
> +
> + ret = -ERESTART_RESTARTBLOCK;
> +
> + spin_unlock(&hb->lock);
> + goto out_put_keys;
> + }
> + spin_unlock(&hb->lock);
> +
> + ret = 0;
> + /*
> + * Check if the waker acquired the second futex for us. If the
> lock_ptr
> + * is NULL, but our key is key2, then the requeue target futex was
> + * uncontended and the waker gave it to us. This is safe without a
> lock
> + * as futex_requeue() will not release the hb lock until after it's
> + * nulled the lock_ptr and removed us from the hb.
> + */
> + if (!q.lock_ptr)
> + goto out_put_keys;
> +
> + /*
> + * At this point we have been requeued. We have been woken up by
> + * futex_unlock_pi(), a timeout, or a signal, but not futex_requeue().
> + * futex_unlock_pi() will not destroy the lock_ptr nor the pi_state.
> + */
> + WARN_ON(!&q.pi_state);
> + pi_mutex = &q.pi_state->pi_mutex;
> + ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
> + debug_rt_mutex_free_waiter(&waiter);
> +
> + spin_lock(q.lock_ptr);
> + ret = finish_futex_lock_pi(uaddr, fshared, &q, ret);
> +
> + /* Unqueue and drop the lock. */
> + unqueue_me_pi(&q);
> +
> + /*
> + * If fixup_pi_state_owner() faulted and was unable to handle the
> + * fault, unlock it and return the fault to userspace.
> + */
> + if (ret == -EFAULT) {
> + if (rt_mutex_owner(pi_mutex) == current)
> + rt_mutex_unlock(pi_mutex);
> + } else if (ret == -EINTR) {
> + if (get_user(uval, uaddr2)) {
> + ret = -EFAULT;
> + goto out_put_keys;
> + }
> +
> + /*
> + * We've already been requeued, so restart by calling
> + * futex_lock_pi() directly, rather then returning to this
> + * function.
> + */
> + restart = ¤t_thread_info()->restart_block;
> + restart->fn = futex_lock_pi_restart;
> + restart->futex.uaddr = (u32 *)uaddr2;
> + restart->futex.val = uval;
> + restart->futex.flags = 0;
> + if (abs_time) {
> + restart->futex.flags |= FLAGS_HAS_TIMEOUT;
> + restart->futex.time = abs_time->tv64;
> + }
> +
> + if (fshared)
> + restart->futex.flags |= FLAGS_SHARED;
> + if (clockrt)
> + restart->futex.flags |= FLAGS_CLOCKRT;
> + ret = -ERESTART_RESTARTBLOCK;
> + }
> +
> +out_put_keys:
> + put_futex_key(fshared, &q.key);
> + put_futex_key(fshared, &key2);
> +
> +out:
> + if (to) {
> + hrtimer_cancel(&to->timer);
> + destroy_hrtimer_on_stack(&to->timer);
> + }
> + return ret;
> +}
> +
> +static long futex_wait_requeue_pi_restart(struct restart_block *restart)
> +{
> + u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
> + u32 __user *uaddr2 = (u32 __user *)restart->futex.uaddr2;
> + int fshared = restart->futex.flags & FLAGS_SHARED;
> + int clockrt = restart->futex.flags & FLAGS_CLOCKRT;
> + ktime_t t, *tp = NULL;
> +
> + if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
> + t.tv64 = restart->futex.time;
> + tp = &t;
> + }
> + restart->fn = do_no_restart_syscall;
> +
> + return (long)futex_wait_requeue_pi(uaddr, fshared, restart->futex.val,
> + tp, restart->futex.bitset, clockrt,
> + uaddr2);
> +}
> +
> /*
> * Support for robust futexes: the kernel cleans up held futexes at
> * thread exit time.
> @@ -1994,7 +2474,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val,
> ktime_t *timeout,
> fshared = 1;
>
> clockrt = op & FUTEX_CLOCK_REALTIME;
> - if (clockrt && cmd != FUTEX_WAIT_BITSET)
> + if (clockrt && cmd != FUTEX_WAIT_BITSET && cmd !=
> FUTEX_WAIT_REQUEUE_PI)
> return -ENOSYS;
>
> switch (cmd) {
> @@ -2009,10 +2489,11 @@ long do_futex(u32 __user *uaddr, int op, u32
> val, ktime_t *timeout,
> ret = futex_wake(uaddr, fshared, val, val3);
> break;
> case FUTEX_REQUEUE:
> - ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL);
> + ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 0);
> break;
> case FUTEX_CMP_REQUEUE:
> - ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3);
> + ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
> + 0);
> break;
> case FUTEX_WAKE_OP:
> ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3);
> @@ -2029,6 +2510,18 @@ long do_futex(u32 __user *uaddr, int op, u32 val,
> ktime_t *timeout,
> if (futex_cmpxchg_enabled)
> ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1);
> break;
> + case FUTEX_WAIT_REQUEUE_PI:
> + val3 = FUTEX_BITSET_MATCH_ANY;
> + ret = futex_wait_requeue_pi(uaddr, fshared, val, timeout, val3,
> + clockrt, uaddr2);
> + break;
> + case FUTEX_REQUEUE_PI:
> + ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 1);
> + break;
> + case FUTEX_CMP_REQUEUE_PI:
> + ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
> + 1);
> + break;
> default:
> ret = -ENOSYS;
> }
> @@ -2046,7 +2539,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int,
> op, u32, val,
> int cmd = op & FUTEX_CMD_MASK;
>
> if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
> - cmd == FUTEX_WAIT_BITSET)) {
> + cmd == FUTEX_WAIT_BITSET ||
> + cmd == FUTEX_WAIT_REQUEUE_PI)) {
> if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
> return -EFAULT;
> if (!timespec_valid(&ts))
> @@ -2058,10 +2552,11 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int,
> op, u32, val,
> tp = &t;
> }
> /*
> - * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE.
> + * requeue parameter in 'utime' if cmd == FUTEX_*_REQUEUE_*.
> * number of waiters to wake in 'utime' if cmd == FUTEX_WAKE_OP.
> */
> if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE ||
> + cmd == FUTEX_REQUEUE_PI || cmd == FUTEX_CMP_REQUEUE_PI ||
> cmd == FUTEX_WAKE_OP)
> val2 = (u32) (unsigned long) utime;
>
>
>
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
On Mon, 30 Mar 2009, Darren Hart wrote:
>
> Darren Hart (8):
> RFC: futex: add requeue_pi calls
> RFC: futex: Add requeue_futex() call
> RFC: futex: Add FUTEX_HAS_TIMEOUT flag to restart.futex.flags
I just finished reviewing the first three patches, and I don't see
anything wrong with them. I'll review more tomorrow, but just for now, you
can add on to the first three:
Reviewed-by: Steven Rostedt <[email protected]>
-- Steve
> RFC: rt_mutex: add proxy lock routines
> RFC: futex: finish_futex_lock_pi()
> RFC: futex: futex_lock_pi_atomic()
> RFC: futex: futex_top_waiter()
> RFC: futex: futex_wait_queue_me()
>
Steven Rostedt wrote:
>
>
> On Mon, 30 Mar 2009, Darren Hart wrote:
>> Darren Hart (8):
>> RFC: futex: add requeue_pi calls
>> RFC: futex: Add requeue_futex() call
>> RFC: futex: Add FUTEX_HAS_TIMEOUT flag to restart.futex.flags
>
> I just finished reviewing the first three patches, and I don't see
> anything wrong with them. I'll review more tomorrow, but just for now, you
> can add on to the first three:
>
> Reviewed-by: Steven Rostedt <[email protected]>
Thanks Steven!
I presume you mean the FIRST three, as in 0/8, 1/8, and 2/8 as opposed
to the top three above which would be 6/8, 7/8, and 8/8 ?
>
> -- Steve
>
>
>> RFC: rt_mutex: add proxy lock routines
>> RFC: futex: finish_futex_lock_pi()
>> RFC: futex: futex_lock_pi_atomic()
>> RFC: futex: futex_top_waiter()
>> RFC: futex: futex_wait_queue_me()
>>
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
On Mon, 30 Mar 2009, Darren Hart wrote:
> +
> + /* add_wait_queue is the barrier after __set_current_state. */
> + __set_current_state(TASK_INTERRUPTIBLE);
> + add_wait_queue(&q->waiter, &wait);
> + /*
> + * NOTE: we don't remove ourselves from the waitqueue because
> + * we are the only user of it.
> + */
This comment, while correct is at an odd place.
> + /* Arm the timer */
> + if (timeout) {
> + hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
> + if (!hrtimer_active(&timeout->timer))
> + timeout->task = NULL;
> + }
> +
> + /*
> + * !plist_node_empty() is safe here without any lock.
> + * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
> + */
> + if (likely(!plist_node_empty(&q->list))) {
> + /*
> + * If the timer has already expired, current will already be
> + * flagged for rescheduling. Only call schedule if there
> + * is no timeout, or if it has yet to expire.
> + */
> + if (!timeout || likely(timeout->task))
Remove the likely(). It does not make sense
> + schedule();
> + }
> + __set_current_state(TASK_RUNNING);
> +}
> +
> static int futex_wait(u32 __user *uaddr, int fshared,
> u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
> {
> - struct task_struct *curr = current;
> + struct hrtimer_sleeper timeout, *to = NULL;
> struct restart_block *restart;
> - DECLARE_WAITQUEUE(wait, curr);
> struct futex_hash_bucket *hb;
> struct futex_q q;
> u32 uval;
> int ret;
> - struct hrtimer_sleeper t;
> - int rem = 0;
>
> if (!bitset)
> return -EINVAL;
>
> q.pi_state = NULL;
> q.bitset = bitset;
> +
> + if (abs_time) {
> + unsigned long slack;
missing new line
> + to = &timeout;
> + slack = current->timer_slack_ns;
> + if (rt_task(current))
> + slack = 0;
Hmm. I thought we would use current->timer_slack_ns ?
Thanks,
tglx
On Mon, 30 Mar 2009, Darren Hart wrote:
> +static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
> + union futex_key *key,
> + struct futex_pi_state **ps,
> + struct task_struct *task)
> +{
> + u32 uval, newval, curval;
> + int lock_taken;
> + int ret;
> + int ownerdied = 0;
int lock_taken, ret, ownerdied = 0;
Thanks,
tglx
On Mon, 30 Mar 2009, Darren Hart wrote:
> +
> + if (requeue_pi) {
> + if (refill_pi_state_cache())
> + return -ENOMEM;
> + if (nr_wake != 1)
> + return -EINVAL;
> + }
Needs a comment
> retry:
> + if (pi_state != NULL) {
> + free_pi_state(pi_state);
> + pi_state = NULL;
> + }
Ditto
> + if (requeue_pi) {
> + /* Attempt to acquire uaddr2 and wake the top_waiter. */
> + ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
> + &key2, &pi_state);
> +
> + /*
> + * At this point the top_waiter has either taken uaddr2 or is
> + * waiting on it. If the former, then the pi_state will not
> + * exist yet, look it up one more time to ensure we have a
> + * reference to it.
> + */
> + if (ret == 1 && !pi_state) {
> + task_count++;
> + ret = get_futex_value_locked(&curval2, uaddr2);
> + if (!ret)
> + ret = lookup_pi_state(curval2, hb2, &key2,
> + &pi_state);
> + }
> +
> + switch (ret) {
> + case 0:
> + break;
> + case -EFAULT:
> + double_unlock_hb(hb1, hb2);
> + put_futex_key(fshared, &key2);
> + put_futex_key(fshared, &key1);
> + ret = get_user(curval2, uaddr2);
> + if (!ret)
> + goto retry;
Hmm. What happens if futex_proxy_trylock_atomic() succeeds, but
get_futex_value_locked() fails ? I guess its unlikely to happen, but
in case it does are we sure that our state is correct ?
> +
> + /* This can go after we're satisfied with testing. */
> + if (!requeue_pi)
> + WARN_ON(this->rt_waiter);
Keep those WARN_ONs. they are useful when code is changed later. Just make it
WARN_ON(!requeue_pi && this->rt_waiter);
WARN_ON(requeue_pi && !this->rt_waiter);
> + /*
> + * Requeue nr_requeue waiters and possibly one more in the case
> + * of requeue_pi if we couldn't acquire the lock atomically.
> + */
> + if (requeue_pi) {
> + /* This can go after we're satisfied with testing. */
> + WARN_ON(!this->rt_waiter);
> +
> + /* Prepare the waiter to take the rt_mutex. */
> + atomic_inc(&pi_state->refcount);
> + this->pi_state = pi_state;
> + ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
> + this->rt_waiter,
> + this->task, 1);
> + if (ret) {
> + this->pi_state = NULL;
> + free_pi_state(pi_state);
> + goto out_unlock;
What are the reasons of dropping out here ?
> + /*
> + * Ensure the requeue is atomic to avoid races while we process the
> + * wakeup. We only need to hold hb->lock to ensure atomicity as the
> + * wakeup code can't change q.key from uaddr to uaddr2 if we hold that
> + * lock. It can't be requeued from uaddr2 to something else since we
> + * don't support a PI aware source futex for requeue.
> + */
> + spin_lock(&hb->lock);
> + if (!match_futex(&q.key, &key2)) {
> + WARN_ON(q.lock_ptr && (&hb->lock != q.lock_ptr));
> + /*
> + * We were not requeued, handle wakeup from futex1 (uaddr). We
> + * cannot have been unqueued and already hold the lock, no need
> + * to call unqueue_me, just do it directly.
> + */
> + plist_del(&q.list, &q.list.plist);
> + drop_futex_key_refs(&q.key);
> +
> + ret = -ETIMEDOUT;
> + if (to && !to->task) {
> + spin_unlock(&hb->lock);
> + goto out_put_keys;
> + }
> +
> + /*
> + * We expect signal_pending(current), but another thread may
> + * have handled it for us already.
> + */
> + ret = -ERESTARTSYS;
> + if (!abs_time) {
> + spin_unlock(&hb->lock);
> + goto out_put_keys;
> + }
Hmm. Why do we use restart block here ? The timeout is absolute, so
we can just restart the syscall, can't we ?
> + ret = 0;
> + /*
> + * Check if the waker acquired the second futex for us. If the lock_ptr
> + * is NULL, but our key is key2, then the requeue target futex was
> + * uncontended and the waker gave it to us. This is safe without a lock
> + * as futex_requeue() will not release the hb lock until after it's
> + * nulled the lock_ptr and removed us from the hb.
> + */
> + if (!q.lock_ptr)
> + goto out_put_keys;
At this point we hold the rt_mutex, the pi state is correct and the
user space variable at *uaddr2 is updated, right ??
Thanks,
tglx
Thomas Gleixner wrote:
> On Mon, 30 Mar 2009, Darren Hart wrote:
>> +
>> + /* add_wait_queue is the barrier after __set_current_state. */
>> + __set_current_state(TASK_INTERRUPTIBLE);
>> + add_wait_queue(&q->waiter, &wait);
>> + /*
>> + * NOTE: we don't remove ourselves from the waitqueue because
>> + * we are the only user of it.
>> + */
>
> This comment, while correct is at an odd place.
How about something like this:
/* add_wait_queue is the barrier after __set_current_state. */
__set_current_state(TASK_INTERRUPTIBLE);
/*
* Add current as the futex_q waiter. We don't remove ourselves from
* the wait_queue because we are the only user of it.
*/
add_wait_queue(&q->waiter, &wait);
>
>> + /* Arm the timer */
>> + if (timeout) {
>> + hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
>> + if (!hrtimer_active(&timeout->timer))
>> + timeout->task = NULL;
>> + }
>> +
>> + /*
>> + * !plist_node_empty() is safe here without any lock.
>> + * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
>> + */
>> + if (likely(!plist_node_empty(&q->list))) {
>> + /*
>> + * If the timer has already expired, current will already be
>> + * flagged for rescheduling. Only call schedule if there
>> + * is no timeout, or if it has yet to expire.
>> + */
>> + if (!timeout || likely(timeout->task))
>
> Remove the likely(). It does not make sense
Done.
>
>> + schedule();
>> + }
>> + __set_current_state(TASK_RUNNING);
>> +}
>> +
>> static int futex_wait(u32 __user *uaddr, int fshared,
>> u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
>> {
>> - struct task_struct *curr = current;
>> + struct hrtimer_sleeper timeout, *to = NULL;
>> struct restart_block *restart;
>> - DECLARE_WAITQUEUE(wait, curr);
>> struct futex_hash_bucket *hb;
>> struct futex_q q;
>> u32 uval;
>> int ret;
>> - struct hrtimer_sleeper t;
>> - int rem = 0;
>>
>> if (!bitset)
>> return -EINVAL;
>>
>> q.pi_state = NULL;
>> q.bitset = bitset;
>> +
>> + if (abs_time) {
>> + unsigned long slack;
>
> missing new line
>
>> + to = &timeout;
>> + slack = current->timer_slack_ns;
>> + if (rt_task(current))
>> + slack = 0;
>
> Hmm. I thought we would use current->timer_slack_ns ?
Hrm, right, so long as I'm changing futex_wait I might as well correct this. Done (and voids the missing new line comment above).
Thanks,
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Thomas Gleixner wrote:
> On Mon, 30 Mar 2009, Darren Hart wrote:
>> +static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
>> + union futex_key *key,
>> + struct futex_pi_state **ps,
>> + struct task_struct *task)
>> +{
>> + u32 uval, newval, curval;
>> + int lock_taken;
>> + int ret;
>> + int ownerdied = 0;
>
> int lock_taken, ret, ownerdied = 0;
>
Fixed, and reordered to meet preferred reverse line length order of
declarations:
int lock_taken, ret, ownerdied = 0;
u32 uval, newval, curval;
Thanks,
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Thomas Gleixner wrote:
> On Mon, 30 Mar 2009, Darren Hart wrote:
>> +
>> + if (requeue_pi) {
>> + if (refill_pi_state_cache())
>> + return -ENOMEM;
>> + if (nr_wake != 1)
>> + return -EINVAL;
>> + }
>
> Needs a comment
/*
* requeue_pi requires a pi_state, try to allocate it now in case it
* fails. requeue_pi can wake at most one task since it acquires the
* rt_mutex prior to returning to userspace, and must wake it if it can
* so as to not leave the rt_mutex with waiters and no owner.
*/
>
>> retry:
>> + if (pi_state != NULL) {
>> + free_pi_state(pi_state);
>> + pi_state = NULL;
>> + }
>
> Ditto
/*
* We will have to lookup the pi_state again, so
* free this one to keep the accounting correct.
*/
>> + if (requeue_pi) {
>> + /* Attempt to acquire uaddr2 and wake the top_waiter. */
>> + ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
>> + &key2, &pi_state);
>> +
>> + /*
>> + * At this point the top_waiter has either taken uaddr2 or is
>> + * waiting on it. If the former, then the pi_state will not
>> + * exist yet, look it up one more time to ensure we have a
>> + * reference to it.
>> + */
>> + if (ret == 1 && !pi_state) {
>> + task_count++;
>> + ret = get_futex_value_locked(&curval2, uaddr2);
>> + if (!ret)
>> + ret = lookup_pi_state(curval2, hb2, &key2,
>> + &pi_state);
>> + }
>> +
>> + switch (ret) {
>> + case 0:
>> + break;
>> + case -EFAULT:
>> + double_unlock_hb(hb1, hb2);
>> + put_futex_key(fshared, &key2);
>> + put_futex_key(fshared, &key1);
>> + ret = get_user(curval2, uaddr2);
>> + if (!ret)
>> + goto retry;
>
> Hmm. What happens if futex_proxy_trylock_atomic() succeeds, but
> get_futex_value_locked() fails ? I guess its unlikely to happen, but
> in case it does are we sure that our state is correct ?
Right so...
First, I suppose the !pi_state in that if block above should be turned
into a WARN_ON as ret == 1 should imply !pi_state.
If get_futex_value_locked() faults, the uaddr2 still contains the
top_waiter's pid and that task is on it's way waking up. We unlock hash
buckets which will allow for unlocking (maybe the top_waiter is REALLY
fast) or for additional waiters to be added. If we fault while not
atomic (get_user() above) then we goto out, returning a fatal fault to
userspace, but the top_waiter will continue to wake. I think this is
OK. We can't return both the number woken AND an error code without
passing another userspace variable for the number woken/requeued (which
might not be a bad idea actually).
If we don't fault (most likely) we goto retry which will not free the
pi_state since it will be NULL. We grab the keys and the hb locks, do
the cmpval test (un uaddr, not uaddr2, so we're still ok). We'll then
try to take the lock atomically again. We may actually succeed again if
the top_waiter was really fast and dropped uaddr2 already...
HRM... this means we might wake additional tasks... so for requeue_pi
nr_wake is really ignored... we'll wake however many waiters we can/have
to in order to ensure a contended rt_mutex always has an owner... which
raises the question of is it ok to return >1 when user called with
nr_wake=1 and nr_requeue=0 (for cond_signal)?
If the lock is still contended, the next futex_proxy_trylock_atomic()
will fail and we'll continue to requeue the remaining tasks.
So the only concern is the muddied relevance of nr_wake.... Any thoughts
on how we should treat nr_wake? I suppose if nr_wake=1 and
nr_requeue=0, then we should avoid calling futex_proxy_trylock_atomic()
a second time... something like this:
if (requeue_pi && (task_count - nr_wake < nr_requeue)) {
/* Attempt to acquire uaddr2 and wake the top_waiter. */
ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
&key2, &pi_state);
>
>> +
>> + /* This can go after we're satisfied with testing. */
>> + if (!requeue_pi)
>> + WARN_ON(this->rt_waiter);
>
> Keep those WARN_ONs. they are useful when code is changed later. Just make it
>
> WARN_ON(!requeue_pi && this->rt_waiter);
> WARN_ON(requeue_pi && !this->rt_waiter);
Makes sense, done.
>
>> + /*
>> + * Requeue nr_requeue waiters and possibly one more in the case
>> + * of requeue_pi if we couldn't acquire the lock atomically.
>> + */
>> + if (requeue_pi) {
>> + /* This can go after we're satisfied with testing. */
>> + WARN_ON(!this->rt_waiter);
>> +
>> + /* Prepare the waiter to take the rt_mutex. */
>> + atomic_inc(&pi_state->refcount);
>> + this->pi_state = pi_state;
>> + ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
>> + this->rt_waiter,
>> + this->task, 1);
>> + if (ret) {
>> + this->pi_state = NULL;
>> + free_pi_state(pi_state);
>> + goto out_unlock;
>
> What are the reasons of dropping out here ?
Eeek. Good catch.
if ret == 1, then we need to wake-up the waiter much like we do in
futex_proxy_trylock_atomic().
if ret < 0 (-EDEADLK), I think we should return the error to userspace
immediately, aborting any pending requeues - thoughts?
>
>> + /*
>> + * Ensure the requeue is atomic to avoid races while we process the
>> + * wakeup. We only need to hold hb->lock to ensure atomicity as the
>> + * wakeup code can't change q.key from uaddr to uaddr2 if we hold that
>> + * lock. It can't be requeued from uaddr2 to something else since we
>> + * don't support a PI aware source futex for requeue.
>> + */
>> + spin_lock(&hb->lock);
>> + if (!match_futex(&q.key, &key2)) {
>> + WARN_ON(q.lock_ptr && (&hb->lock != q.lock_ptr));
>> + /*
>> + * We were not requeued, handle wakeup from futex1 (uaddr). We
>> + * cannot have been unqueued and already hold the lock, no need
>> + * to call unqueue_me, just do it directly.
>> + */
>> + plist_del(&q.list, &q.list.plist);
>> + drop_futex_key_refs(&q.key);
>> +
>> + ret = -ETIMEDOUT;
>> + if (to && !to->task) {
>> + spin_unlock(&hb->lock);
>> + goto out_put_keys;
>> + }
>> +
>> + /*
>> + * We expect signal_pending(current), but another thread may
>> + * have handled it for us already.
>> + */
>> + ret = -ERESTARTSYS;
>> + if (!abs_time) {
>> + spin_unlock(&hb->lock);
>> + goto out_put_keys;
>> + }
>
> Hmm. Why do we use restart block here ? The timeout is absolute, so
> we can just restart the syscall, can't we ?
Yup, I caught this last night too. I'll update.
>
>> + ret = 0;
>> + /*
>> + * Check if the waker acquired the second futex for us. If the lock_ptr
>> + * is NULL, but our key is key2, then the requeue target futex was
>> + * uncontended and the waker gave it to us. This is safe without a lock
>> + * as futex_requeue() will not release the hb lock until after it's
>> + * nulled the lock_ptr and removed us from the hb.
>> + */
>> + if (!q.lock_ptr)
>> + goto out_put_keys;
>
> At this point we hold the rt_mutex, the pi state is correct and the
> user space variable at *uaddr2 is updated, right ??
uaddr2 has already been updated by futex_lock_pi_atomic()
we do hold the rt_mutex if there is one (a waiter would have to have
been added after we look the lock and setup the pi_state, creating the
rt_mutex, and assigning us as the owner)
we only get to this point if we acquired the uncontended lock, so there is no lock stealing to worry about, I believe this means we can avoid the corner case cleanup of finish_futex_lock_pi(). I'll pour over this a bit more though and make sure I can convince myself of that.
Thanks you for the review!
>
> Thanks,
>
> tglx
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team