2010-04-05 20:24:24

by Darren Hart

[permalink] [raw]
Subject: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

In-Reply-To:

NOT FOR INCLUSION

The following patch series implements a new experimental kernel side futex mutex
via new FUTEX_LOCK and FUTEX_LOCK_ADAPTIVE futex op codes. The adaptive spin
follows the kernel mutex model of allowing one spinner until the lock is
released or the owner is descheduled. The patch currently allows the user to
specify if they want no spinning, a single adaptive spinner, or multiple
spinners (aggressive adaptive spinning, or aas... which I have mistyped as "ass"
enough times to realize a better term is indeed required :-).

I'm using the futex_lock branch of my futextest test suite to gather results.
The test is performance/futex_lock.c and can be checked out here:

git://git.kernel.org/pub/scm/linux/kernel/git/dvhart/futextest.git

Per Avi's suggestion I added asm instruction based period and duty-cycle options
to do some testing. A series of plots with 256 threads, one per period length,
is shown at the URL below. Each plot compares raw futex lock (normal, no
spinning) with a single spinner (adaptive) and with multiple spinners (aas) for
each of several duty-cycles to determine performance at various levels of
contention. The test measure the number of lock/unlock pairs performed per
second (in thousands).

http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v4/

Avi's suggested starting point was 1000 instructions with a 10% duty cycle.
That plot "Futex Lock/Unlock (Period=1000, Threads=256)" does appear to be the
most interesting of the lot. It's not so short that other overhead becomes the
bottleneck, and not so long so as to make adaptive spinning benefits show up
as noise in the numbers. The adaptive spin (with a single waiter) consistently
beats the normal run, and outperforms aas for most duty-cycles. I rechecked a
few points on this plot to confirm and the relative scores remained consistent.
These plots were generated using 10,000,000 iterations per datapoint.

Lastly, I should mention that these results all underperform a simple
pthread_mutex_lock()/pthread_mutex_unlock() pair. I'm looking into why but felt
I was overdue in sharing what I have to date. A test comparing this to a
sched_yield() style userspace spinlock would probably be more appropraite.

Thanks,

Darren Hart
IBM Linux Technology Center
Real-Time Linux Team


2010-04-05 20:24:38

by Darren Hart

[permalink] [raw]
Subject: [PATCH 2/6] futex: add futex_q static initializer

The futex_q struct has grown considerably over the last year or so. I
believe it now merits a static initializer to avoid uninitialized data
errors (having just spent more time than I care to admit debugging
an uninitialized q.bitset in an experimental new op code).

I originally planned on following the __THING_INITIALIZER/DECLARE_THING
method, but since we already had FUTEX_KEY_INIT, and I personally prefer
that method, I went that route.

Signed-off-by: Darren Hart <[email protected]>
---
kernel/futex.c | 19 +++++++++----------
1 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ed08cfd..2ae18cd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -130,6 +130,12 @@ struct futex_q {
u32 bitset;
};

+#define FUTEX_Q_INIT \
+ { /* list gets initialized in queue_me()*/ \
+ .task = NULL, NULL, FUTEX_KEY_INIT \
+ , NULL, NULL, NULL, FUTEX_BITSET_MATCH_ANY }
+
+
/*
* Hash buckets are shared by all the futex_keys that hash to the same
* location. Each key may have multiple futex_q structures, one for each task
@@ -1799,16 +1805,13 @@ static int futex_wait(u32 __user *uaddr, int flags, u32 val, ktime_t *abs_time,
struct hrtimer_sleeper timeout, *to = NULL;
struct restart_block *restart;
struct futex_hash_bucket *hb;
- struct futex_q q;
+ struct futex_q q = FUTEX_Q_INIT;
int ret;

if (!bitset)
return -EINVAL;

- q.pi_state = NULL;
q.bitset = bitset;
- q.rt_waiter = NULL;
- q.requeue_pi_key = NULL;

if (abs_time) {
to = &timeout;
@@ -1899,7 +1902,7 @@ static int futex_lock_pi(u32 __user *uaddr, int flags, int detect,
{
struct hrtimer_sleeper timeout, *to = NULL;
struct futex_hash_bucket *hb;
- struct futex_q q;
+ struct futex_q q = FUTEX_Q_INIT;
int res, ret;

if (refill_pi_state_cache())
@@ -1913,9 +1916,6 @@ static int futex_lock_pi(u32 __user *uaddr, int flags, int detect,
hrtimer_set_expires(&to->timer, *time);
}

- q.pi_state = NULL;
- q.rt_waiter = NULL;
- q.requeue_pi_key = NULL;
retry:
q.key = FUTEX_KEY_INIT;
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
@@ -2206,7 +2206,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int flags,
struct rt_mutex *pi_mutex = NULL;
struct futex_hash_bucket *hb;
union futex_key key2;
- struct futex_q q;
+ struct futex_q q = FUTEX_Q_INIT;
int res, ret;

if (!bitset)
@@ -2234,7 +2234,6 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int flags,
if (unlikely(ret != 0))
goto out;

- q.pi_state = NULL;
q.bitset = bitset;
q.rt_waiter = &rt_waiter;
q.requeue_pi_key = &key2;
--
1.6.3.3

2010-04-05 20:24:45

by Darren Hart

[permalink] [raw]
Subject: [PATCH 1/6] futex: replace fshared and clockrt with combined flags

In the early days we passed the mmap sem around. That became the
"int fshared" with the fast gup improvements. Then we added
"int clockrt" in places. This patch unifies these options as "flags" and
cleans up various calls which had fshared in their signature but no
longer used it.

V4: Fix an operator precedence bug

Signed-off-by: Darren Hart <[email protected]>
---
kernel/futex.c | 215 +++++++++++++++++++++++++++-----------------------------
1 files changed, 104 insertions(+), 111 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index e7a35f1..ed08cfd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -69,6 +69,14 @@ int __read_mostly futex_cmpxchg_enabled;
#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)

/*
+ * Futex flags used to encode options to functions and preserve them across
+ * restarts.
+ */
+#define FLAGS_SHARED 0x01
+#define FLAGS_CLOCKRT 0x02
+#define FLAGS_HAS_TIMEOUT 0x04
+
+/*
* Priority Inheritance state:
*/
struct futex_pi_state {
@@ -283,7 +291,7 @@ again:
}

static inline
-void put_futex_key(int fshared, union futex_key *key)
+void put_futex_key(union futex_key *key)
{
drop_futex_key_refs(key);
}
@@ -878,7 +886,7 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
/*
* Wake up waiters matching bitset queued on this futex (uaddr).
*/
-static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
+static int futex_wake(u32 __user *uaddr, int flags, int nr_wake, u32 bitset)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -889,7 +897,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
if (!bitset)
return -EINVAL;

- ret = get_futex_key(uaddr, fshared, &key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
if (unlikely(ret != 0))
goto out;

@@ -915,7 +923,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
}

spin_unlock(&hb->lock);
- put_futex_key(fshared, &key);
+ put_futex_key(&key);
out:
return ret;
}
@@ -925,7 +933,7 @@ out:
* to this virtual address:
*/
static int
-futex_wake_op(u32 __user *uaddr1, int fshared, u32 __user *uaddr2,
+futex_wake_op(u32 __user *uaddr1, int flags, u32 __user *uaddr2,
int nr_wake, int nr_wake2, int op)
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
@@ -935,10 +943,10 @@ futex_wake_op(u32 __user *uaddr1, int fshared, u32 __user *uaddr2,
int ret, op_ret;

retry:
- ret = get_futex_key(uaddr1, fshared, &key1);
+ ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, fshared, &key2);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -970,11 +978,11 @@ retry_private:
if (ret)
goto out_put_keys;

- if (!fshared)
+ if (!(flags & FLAGS_SHARED))
goto retry_private;

- put_futex_key(fshared, &key2);
- put_futex_key(fshared, &key1);
+ put_futex_key(&key2);
+ put_futex_key(&key1);
goto retry;
}

@@ -1004,9 +1012,9 @@ retry_private:

double_unlock_hb(hb1, hb2);
out_put_keys:
- put_futex_key(fshared, &key2);
+ put_futex_key(&key2);
out_put_key1:
- put_futex_key(fshared, &key1);
+ put_futex_key(&key1);
out:
return ret;
}
@@ -1140,11 +1148,13 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,

/**
* futex_requeue() - Requeue waiters from uaddr1 to uaddr2
- * uaddr1: source futex user address
- * uaddr2: target futex user address
- * nr_wake: number of waiters to wake (must be 1 for requeue_pi)
- * nr_requeue: number of waiters to requeue (0-INT_MAX)
- * requeue_pi: if we are attempting to requeue from a non-pi futex to a
+ * @uaddr1: source futex user address
+ * @flags: futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, etc.), they must be
+ * the same type, no requeueing from private to shared, etc.
+ * @uaddr2: target futex user address
+ * @nr_wake: number of waiters to wake (must be 1 for requeue_pi)
+ * @nr_requeue: number of waiters to requeue (0-INT_MAX)
+ * @requeue_pi: if we are attempting to requeue from a non-pi futex to a
* pi futex (pi to pi requeue is not supported)
*
* Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to acquire
@@ -1154,7 +1164,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
* >=0 - on success, the number of tasks requeued or woken
* <0 - on error
*/
-static int futex_requeue(u32 __user *uaddr1, int fshared, u32 __user *uaddr2,
+static int futex_requeue(u32 __user *uaddr1, int flags, u32 __user *uaddr2,
int nr_wake, int nr_requeue, u32 *cmpval,
int requeue_pi)
{
@@ -1197,10 +1207,10 @@ retry:
pi_state = NULL;
}

- ret = get_futex_key(uaddr1, fshared, &key1);
+ ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, &key1);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, fshared, &key2);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
if (unlikely(ret != 0))
goto out_put_key1;

@@ -1222,11 +1232,11 @@ retry_private:
if (ret)
goto out_put_keys;

- if (!fshared)
+ if (!(flags & FLAGS_SHARED))
goto retry_private;

- put_futex_key(fshared, &key2);
- put_futex_key(fshared, &key1);
+ put_futex_key(&key2);
+ put_futex_key(&key1);
goto retry;
}
if (curval != *cmpval) {
@@ -1266,8 +1276,8 @@ retry_private:
break;
case -EFAULT:
double_unlock_hb(hb1, hb2);
- put_futex_key(fshared, &key2);
- put_futex_key(fshared, &key1);
+ put_futex_key(&key2);
+ put_futex_key(&key1);
ret = fault_in_user_writeable(uaddr2);
if (!ret)
goto retry;
@@ -1275,8 +1285,8 @@ retry_private:
case -EAGAIN:
/* The owner was exiting, try again. */
double_unlock_hb(hb1, hb2);
- put_futex_key(fshared, &key2);
- put_futex_key(fshared, &key1);
+ put_futex_key(&key2);
+ put_futex_key(&key1);
cond_resched();
goto retry;
default:
@@ -1358,9 +1368,9 @@ out_unlock:
drop_futex_key_refs(&key1);

out_put_keys:
- put_futex_key(fshared, &key2);
+ put_futex_key(&key2);
out_put_key1:
- put_futex_key(fshared, &key1);
+ put_futex_key(&key1);
out:
if (pi_state != NULL)
free_pi_state(pi_state);
@@ -1500,7 +1510,7 @@ static void unqueue_me_pi(struct futex_q *q)
* private futexes.
*/
static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
- struct task_struct *newowner, int fshared)
+ struct task_struct *newowner)
{
u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
struct futex_pi_state *pi_state = q->pi_state;
@@ -1593,20 +1603,11 @@ handle_fault:
goto retry;
}

-/*
- * In case we must use restart_block to restart a futex_wait,
- * we encode in the 'flags' shared capability
- */
-#define FLAGS_SHARED 0x01
-#define FLAGS_CLOCKRT 0x02
-#define FLAGS_HAS_TIMEOUT 0x04
-
static long futex_wait_restart(struct restart_block *restart);

/**
* fixup_owner() - Post lock pi_state and corner case management
* @uaddr: user address of the futex
- * @fshared: whether the futex is shared (1) or not (0)
* @q: futex_q (contains pi_state and access to the rt_mutex)
* @locked: if the attempt to take the rt_mutex succeeded (1) or not (0)
*
@@ -1619,8 +1620,7 @@ static long futex_wait_restart(struct restart_block *restart);
* 0 - success, lock not taken
* <0 - on error (-EFAULT)
*/
-static int fixup_owner(u32 __user *uaddr, int fshared, struct futex_q *q,
- int locked)
+static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
{
struct task_struct *owner;
int ret = 0;
@@ -1631,7 +1631,7 @@ static int fixup_owner(u32 __user *uaddr, int fshared, struct futex_q *q,
* did a lock-steal - fix up the PI-state in that case:
*/
if (q->pi_state->owner != current)
- ret = fixup_pi_state_owner(uaddr, q, current, fshared);
+ ret = fixup_pi_state_owner(uaddr, q, current);
goto out;
}

@@ -1658,7 +1658,7 @@ static int fixup_owner(u32 __user *uaddr, int fshared, struct futex_q *q,
* lock. Fix the state up.
*/
owner = rt_mutex_owner(&q->pi_state->pi_mutex);
- ret = fixup_pi_state_owner(uaddr, q, owner, fshared);
+ ret = fixup_pi_state_owner(uaddr, q, owner);
goto out;
}

@@ -1721,7 +1721,7 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
* futex_wait_setup() - Prepare to wait on a futex
* @uaddr: the futex userspace address
* @val: the expected value
- * @fshared: whether the futex is shared (1) or not (0)
+ * @flags: futex flags (FLAGS_SHARED, etc.)
* @q: the associated futex_q
* @hb: storage for hash_bucket pointer to be returned to caller
*
@@ -1734,7 +1734,7 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
* 0 - uaddr contains val and hb has been locked
* <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlcoked
*/
-static int futex_wait_setup(u32 __user *uaddr, u32 val, int fshared,
+static int futex_wait_setup(u32 __user *uaddr, u32 val, int flags,
struct futex_q *q, struct futex_hash_bucket **hb)
{
u32 uval;
@@ -1759,7 +1759,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, int fshared,
*/
retry:
q->key = FUTEX_KEY_INIT;
- ret = get_futex_key(uaddr, fshared, &q->key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key);
if (unlikely(ret != 0))
return ret;

@@ -1775,10 +1775,10 @@ retry_private:
if (ret)
goto out;

- if (!fshared)
+ if (!(flags & FLAGS_SHARED))
goto retry_private;

- put_futex_key(fshared, &q->key);
+ put_futex_key(&q->key);
goto retry;
}

@@ -1789,12 +1789,12 @@ retry_private:

out:
if (ret)
- put_futex_key(fshared, &q->key);
+ put_futex_key(&q->key);
return ret;
}

-static int futex_wait(u32 __user *uaddr, int fshared,
- u32 val, ktime_t *abs_time, u32 bitset, int clockrt)
+static int futex_wait(u32 __user *uaddr, int flags, u32 val, ktime_t *abs_time,
+ u32 bitset)
{
struct hrtimer_sleeper timeout, *to = NULL;
struct restart_block *restart;
@@ -1813,8 +1813,9 @@ static int futex_wait(u32 __user *uaddr, int fshared,
if (abs_time) {
to = &timeout;

- hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
- CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
+ CLOCK_REALTIME : CLOCK_MONOTONIC,
+ HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);
hrtimer_set_expires_range_ns(&to->timer, *abs_time,
current->timer_slack_ns);
@@ -1822,7 +1823,7 @@ static int futex_wait(u32 __user *uaddr, int fshared,

retry:
/* Prepare to wait on uaddr. */
- ret = futex_wait_setup(uaddr, val, fshared, &q, &hb);
+ ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
goto out;

@@ -1842,7 +1843,7 @@ retry:
* victim of a spurious wakeup as well.
*/
if (!signal_pending(current)) {
- put_futex_key(fshared, &q.key);
+ put_futex_key(&q.key);
goto retry;
}

@@ -1856,17 +1857,12 @@ retry:
restart->futex.val = val;
restart->futex.time = abs_time->tv64;
restart->futex.bitset = bitset;
- restart->futex.flags = FLAGS_HAS_TIMEOUT;
-
- if (fshared)
- restart->futex.flags |= FLAGS_SHARED;
- if (clockrt)
- restart->futex.flags |= FLAGS_CLOCKRT;
+ restart->futex.flags = flags;

ret = -ERESTART_RESTARTBLOCK;

out_put_key:
- put_futex_key(fshared, &q.key);
+ put_futex_key(&q.key);
out:
if (to) {
hrtimer_cancel(&to->timer);
@@ -1879,7 +1875,6 @@ out:
static long futex_wait_restart(struct restart_block *restart)
{
u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
- int fshared = 0;
ktime_t t, *tp = NULL;

if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
@@ -1887,11 +1882,9 @@ static long futex_wait_restart(struct restart_block *restart)
tp = &t;
}
restart->fn = do_no_restart_syscall;
- if (restart->futex.flags & FLAGS_SHARED)
- fshared = 1;
- return (long)futex_wait(uaddr, fshared, restart->futex.val, tp,
- restart->futex.bitset,
- restart->futex.flags & FLAGS_CLOCKRT);
+
+ return (long)futex_wait(uaddr, restart->futex.flags,
+ restart->futex.val, tp, restart->futex.bitset);
}


@@ -1901,8 +1894,8 @@ static long futex_wait_restart(struct restart_block *restart)
* if there are waiters then it will block, it does PI, etc. (Due to
* races the kernel might see a 0 value of the futex too.)
*/
-static int futex_lock_pi(u32 __user *uaddr, int fshared,
- int detect, ktime_t *time, int trylock)
+static int futex_lock_pi(u32 __user *uaddr, int flags, int detect,
+ ktime_t *time, int trylock)
{
struct hrtimer_sleeper timeout, *to = NULL;
struct futex_hash_bucket *hb;
@@ -1925,7 +1918,7 @@ static int futex_lock_pi(u32 __user *uaddr, int fshared,
q.requeue_pi_key = NULL;
retry:
q.key = FUTEX_KEY_INIT;
- ret = get_futex_key(uaddr, fshared, &q.key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
if (unlikely(ret != 0))
goto out;

@@ -1947,7 +1940,7 @@ retry_private:
* exit to complete.
*/
queue_unlock(&q, hb);
- put_futex_key(fshared, &q.key);
+ put_futex_key(&q.key);
cond_resched();
goto retry;
default:
@@ -1977,7 +1970,7 @@ retry_private:
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
*/
- res = fixup_owner(uaddr, fshared, &q, !ret);
+ res = fixup_owner(uaddr, &q, !ret);
/*
* If fixup_owner() returned an error, proprogate that. If it acquired
* the lock, clear our -ETIMEDOUT or -EINTR.
@@ -2001,7 +1994,7 @@ out_unlock_put_key:
queue_unlock(&q, hb);

out_put_key:
- put_futex_key(fshared, &q.key);
+ put_futex_key(&q.key);
out:
if (to)
destroy_hrtimer_on_stack(&to->timer);
@@ -2014,10 +2007,10 @@ uaddr_faulted:
if (ret)
goto out_put_key;

- if (!fshared)
+ if (!(flags & FLAGS_SHARED))
goto retry_private;

- put_futex_key(fshared, &q.key);
+ put_futex_key(&q.key);
goto retry;
}

@@ -2026,7 +2019,7 @@ uaddr_faulted:
* This is the in-kernel slowpath: we look up the PI state (if any),
* and do the rt-mutex unlock.
*/
-static int futex_unlock_pi(u32 __user *uaddr, int fshared)
+static int futex_unlock_pi(u32 __user *uaddr, int flags)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -2044,7 +2037,7 @@ retry:
if ((uval & FUTEX_TID_MASK) != task_pid_vnr(current))
return -EPERM;

- ret = get_futex_key(uaddr, fshared, &key);
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key);
if (unlikely(ret != 0))
goto out;

@@ -2099,14 +2092,14 @@ retry:

out_unlock:
spin_unlock(&hb->lock);
- put_futex_key(fshared, &key);
+ put_futex_key(&key);

out:
return ret;

pi_faulted:
spin_unlock(&hb->lock);
- put_futex_key(fshared, &key);
+ put_futex_key(&key);

ret = fault_in_user_writeable(uaddr);
if (!ret)
@@ -2166,7 +2159,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
/**
* futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
* @uaddr: the futex we initially wait on (non-pi)
- * @fshared: whether the futexes are shared (1) or not (0). They must be
+ * @flags: futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, etc.), they must be
* the same type, no requeueing from private to shared, etc.
* @val: the expected value of uaddr
* @abs_time: absolute timeout
@@ -2204,9 +2197,9 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
* 0 - On success
* <0 - On error
*/
-static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
+static int futex_wait_requeue_pi(u32 __user *uaddr, int flags,
u32 val, ktime_t *abs_time, u32 bitset,
- int clockrt, u32 __user *uaddr2)
+ u32 __user *uaddr2)
{
struct hrtimer_sleeper timeout, *to = NULL;
struct rt_mutex_waiter rt_waiter;
@@ -2221,8 +2214,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,

if (abs_time) {
to = &timeout;
- hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
- CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
+ CLOCK_REALTIME : CLOCK_MONOTONIC,
+ HRTIMER_MODE_ABS);
hrtimer_init_sleeper(to, current);
hrtimer_set_expires_range_ns(&to->timer, *abs_time,
current->timer_slack_ns);
@@ -2236,7 +2230,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
rt_waiter.task = NULL;

key2 = FUTEX_KEY_INIT;
- ret = get_futex_key(uaddr2, fshared, &key2);
+ ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2);
if (unlikely(ret != 0))
goto out;

@@ -2246,7 +2240,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
q.requeue_pi_key = &key2;

/* Prepare to wait on uaddr. */
- ret = futex_wait_setup(uaddr, val, fshared, &q, &hb);
+ ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
goto out_key2;

@@ -2274,8 +2268,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
*/
if (q.pi_state && (q.pi_state->owner != current)) {
spin_lock(q.lock_ptr);
- ret = fixup_pi_state_owner(uaddr2, &q, current,
- fshared);
+ ret = fixup_pi_state_owner(uaddr2, &q, current);
spin_unlock(q.lock_ptr);
}
} else {
@@ -2294,7 +2287,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
*/
- res = fixup_owner(uaddr2, fshared, &q, !ret);
+ res = fixup_owner(uaddr2, &q, !ret);
/*
* If fixup_owner() returned an error, proprogate that. If it
* acquired the lock, clear -ETIMEDOUT or -EINTR.
@@ -2325,9 +2318,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
}

out_put_keys:
- put_futex_key(fshared, &q.key);
+ put_futex_key(&q.key);
out_key2:
- put_futex_key(fshared, &key2);
+ put_futex_key(&key2);

out:
if (to) {
@@ -2551,58 +2544,58 @@ void exit_robust_list(struct task_struct *curr)
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
- int clockrt, ret = -ENOSYS;
+ int ret = -ENOSYS;
int cmd = op & FUTEX_CMD_MASK;
- int fshared = 0;
+ int flags = 0;

if (!(op & FUTEX_PRIVATE_FLAG))
- fshared = 1;
+ flags |= FLAGS_SHARED;

- clockrt = op & FUTEX_CLOCK_REALTIME;
- if (clockrt && cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI)
- return -ENOSYS;
+ if (op & FUTEX_CLOCK_REALTIME) {
+ flags |= FLAGS_CLOCKRT;
+ if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI)
+ return -ENOSYS;
+ }

switch (cmd) {
case FUTEX_WAIT:
val3 = FUTEX_BITSET_MATCH_ANY;
case FUTEX_WAIT_BITSET:
- ret = futex_wait(uaddr, fshared, val, timeout, val3, clockrt);
+ ret = futex_wait(uaddr, flags, val, timeout, val3);
break;
case FUTEX_WAKE:
val3 = FUTEX_BITSET_MATCH_ANY;
case FUTEX_WAKE_BITSET:
- ret = futex_wake(uaddr, fshared, val, val3);
+ ret = futex_wake(uaddr, flags, val, val3);
break;
case FUTEX_REQUEUE:
- ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 0);
+ ret = futex_requeue(uaddr, flags, uaddr2, val, val2, NULL, 0);
break;
case FUTEX_CMP_REQUEUE:
- ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
- 0);
+ ret = futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 0);
break;
case FUTEX_WAKE_OP:
- ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3);
+ ret = futex_wake_op(uaddr, flags, uaddr2, val, val2, val3);
break;
case FUTEX_LOCK_PI:
if (futex_cmpxchg_enabled)
- ret = futex_lock_pi(uaddr, fshared, val, timeout, 0);
+ ret = futex_lock_pi(uaddr, flags, val, timeout, 0);
break;
case FUTEX_UNLOCK_PI:
if (futex_cmpxchg_enabled)
- ret = futex_unlock_pi(uaddr, fshared);
+ ret = futex_unlock_pi(uaddr, flags);
break;
case FUTEX_TRYLOCK_PI:
if (futex_cmpxchg_enabled)
- ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1);
+ ret = futex_lock_pi(uaddr, flags, 0, timeout, 1);
break;
case FUTEX_WAIT_REQUEUE_PI:
val3 = FUTEX_BITSET_MATCH_ANY;
- ret = futex_wait_requeue_pi(uaddr, fshared, val, timeout, val3,
- clockrt, uaddr2);
+ ret = futex_wait_requeue_pi(uaddr, flags, val, timeout, val3,
+ uaddr2);
break;
case FUTEX_CMP_REQUEUE_PI:
- ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
- 1);
+ ret = futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
break;
default:
ret = -ENOSYS;
--
1.6.3.3

2010-04-05 20:24:54

by Darren Hart

[permalink] [raw]
Subject: [PATCH 4/6] futex: Add FUTEX_LOCK with optional adaptive spinning

Add a non-pi TID value based futex locking mechanism. This enables the
use of adaptive spinning which was problematic with the basic FUTEX_WAIT
operation.

V4: Remove update_futex_waiters() and perform all waiter clearing in
userspace. Turns out to be generally better to do an O(1) operation in
userspace that forces one extra syscall than to do an O(N) operation during
every syscall. Go figure :).
Cleanup timeout when exiting adaptive loop.
Perform adaptive spinning after retry:
General cleanups.
V3: Fix some locking bugs
Use cmpxchg directly in the adaptive loop
Move the adaptive loop into its own function
Use FUTEX_WAIT instead of FUTEX_UNLOCK
Add update_futex_waiters()
V2: Fix preempt-count breakage

This patch is a forward port of the following patch from Peter Zijlstra
with some bug fixes and a reworking of the futex op codes.

> futex: Implement kernel-side adaptive spinning for futexes
>
> This is implemented as a new futex op: FUTEX_WAIT_ADAPTIVE, because it
> requires the futex lock field to contain a TID and regular futexes do
> not have that restriction.
>
> FUTEX_WAIT_ADAPTIVE will spin while the lock owner is running (on a
> different cpu) or until the task gets preempted. After that it behaves
> like FUTEX_WAIT.
>
> The spin loop tries to acquire the lock by cmpxchg(lock, 0, tid) == tid
> on the lower 30 bits (TID_MASK) of the lock field -- leaving the
> WAITERS and OWNER_DIED flags in tact.
>
> NOTE: we could fold mutex_spin_on_owner() and futex_spin_on_owner() by
> adding a lock_owner function argument.
>
> void lock_spin_on_owner(struct thread_info *(*lock_owner)(void *lock),
> void *lock, struct thread_info *owner);
>
> Peter Zijlstra's original patch forward ported by Darren Hart.
>

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Darren Hart <[email protected]>
---
include/linux/futex.h | 6 +
include/linux/sched.h | 1 +
kernel/futex.c | 282 +++++++++++++++++++++++++++++++++++++++++++------
kernel/sched.c | 59 ++++++++++
4 files changed, 314 insertions(+), 34 deletions(-)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index 1e5a26d..3ca93d1 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -20,6 +20,8 @@
#define FUTEX_WAKE_BITSET 10
#define FUTEX_WAIT_REQUEUE_PI 11
#define FUTEX_CMP_REQUEUE_PI 12
+#define FUTEX_LOCK 13
+#define FUTEX_LOCK_ADAPTIVE 14

#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
@@ -39,6 +41,9 @@
FUTEX_PRIVATE_FLAG)
#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PRIVATE (FUTEX_LOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_ADAPTIVE_PRIVATE (FUTEX_LOCK_ADAPTIVE | \
+ FUTEX_PRIVATE_FLAG)

/*
* Support for robust futexes: the kernel cleans up held futexes at
@@ -177,6 +182,7 @@ union futex_key {
extern void exit_robust_list(struct task_struct *curr);
extern void exit_pi_state_list(struct task_struct *curr);
extern int futex_cmpxchg_enabled;
+extern struct thread_info *futex_owner(u32 __user *uaddr);
#else
static inline void exit_robust_list(struct task_struct *curr)
{
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..885d659 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -362,6 +362,7 @@ extern signed long schedule_timeout_killable(signed long timeout);
extern signed long schedule_timeout_uninterruptible(signed long timeout);
asmlinkage void schedule(void);
extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern int futex_spin_on_owner(u32 __user *uaddr, struct thread_info *owner);

struct nsproxy;
struct user_namespace;
diff --git a/kernel/futex.c b/kernel/futex.c
index 8c1bb16..c33ac2a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -75,6 +75,7 @@ int __read_mostly futex_cmpxchg_enabled;
#define FLAGS_SHARED 0x01
#define FLAGS_CLOCKRT 0x02
#define FLAGS_HAS_TIMEOUT 0x04
+#define FLAGS_ADAPTIVE 0x08

/*
* Priority Inheritance state:
@@ -368,6 +369,35 @@ static int get_futex_value_locked(u32 *dest, u32 __user *from)
return ret ? -EFAULT : 0;
}

+#ifdef CONFIG_SMP
+struct thread_info *futex_owner(u32 __user *uaddr)
+{
+ struct thread_info *ti = NULL;
+ struct task_struct *p;
+ pid_t pid;
+ u32 uval;
+
+ if (get_futex_value_locked(&uval, uaddr))
+ return NULL;
+
+ pid = uval & FUTEX_TID_MASK;
+ rcu_read_lock();
+ p = find_task_by_vpid(pid);
+ if (p) {
+ const struct cred *cred, *pcred;
+
+ cred = current_cred();
+ pcred = __task_cred(p);
+ if (cred->euid == pcred->euid ||
+ cred->euid == pcred->uid)
+ ti = task_thread_info(p);
+ }
+ rcu_read_unlock();
+
+ return ti;
+}
+#endif
+

/*
* PI code:
@@ -717,7 +747,7 @@ retry:
* @hb: the pi futex hash bucket
* @key: the futex key associated with uaddr and hb
* @ps: the pi_state pointer where we store the result of the
- * lookup
+ * lookup. If non-NULL, hb and key must also be non-NULL.
* @task: the task to perform the atomic lock work for. This will
* be "current" except in the case of requeue pi.
* @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
@@ -727,7 +757,8 @@ retry:
* 1 - acquired the lock
* <0 - error
*
- * The hb->lock and futex_key refs shall be held by the caller.
+ * The hb->lock and futex_key refs shall be held by the caller if the pi_state
+ * pointer is non-NULL.
*/
static int lock_pi_futex_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
union futex_key *key,
@@ -741,37 +772,32 @@ retry:
ret = lock_futex_atomic(uaddr, task, set_waiters);
if (ret)
return ret;
-
- /*
- * We dont have the lock. Look up the PI state (or create it if
- * we are the first waiter):
+ /*
+ * We don't have the lock.
+ * Look up the PI state (or create it if we are the first waiter).
+ * futex_lock_atomic() calls us with ps==NULL for non-pi atomic futex
+ * acquisition.
*/
- if (get_futex_value_locked(&uval, uaddr))
- return -EFAULT;
- ret = lookup_pi_state(uval, hb, key, ps);
+ if (ps)
+ ret = lookup_pi_state(uval, hb, key, ps);

- if (unlikely(ret)) {
- switch (ret) {
- case -ESRCH:
- /*
- * No owner found for this futex. Check if the
- * OWNER_DIED bit is set to figure out whether
- * this is a robust futex or not.
- */
- if (get_futex_value_locked(&uval, uaddr))
- return -EFAULT;
+ if (!ps || unlikely(ret) == -ESRCH) {
+ /*
+ * No owner found for this futex. Check if the
+ * OWNER_DIED bit is set to figure out whether
+ * this is a robust futex or not.
+ */
+ if (get_futex_value_locked(&uval, uaddr))
+ return -EFAULT;

- /*
- * We simply start over in case of a robust
- * futex. The code above will take the futex
- * and return happy.
- */
- if (uval & FUTEX_OWNER_DIED)
- goto retry;
- default:
- break;
- }
- }
+ /*
+ * We simply start over in case of a robust
+ * futex. The code above will take the futex
+ * and return happy.
+ */
+ if (uval & FUTEX_OWNER_DIED)
+ goto retry;
+ }

return ret;
}
@@ -874,13 +900,13 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
return 0;
}

-static int unlock_futex_pi(u32 __user *uaddr, u32 uval)
+static int unlock_futex(u32 __user *uaddr, u32 uval)
{
u32 oldval;

/*
* There is no waiter, so we unlock the futex. The owner died
- * bit has not to be preserved here. We are the owner:
+ * bit does not need to be preserved here. We are the owner:
*/
oldval = cmpxchg_futex_value_locked(uaddr, uval, 0);

@@ -2112,7 +2138,7 @@ retry:
* No waiters - kernel unlocks the futex:
*/
if (!(uval & FUTEX_OWNER_DIED)) {
- ret = unlock_futex_pi(uaddr, uval);
+ ret = unlock_futex(uaddr, uval);
if (ret == -EFAULT)
goto pi_faulted;
}
@@ -2356,6 +2382,187 @@ out:
return ret;
}

+/**
+ * trylock_futex_adaptive() - Try to acquire the futex lock in a busy loop
+ * @uaddr: the futex user address
+ *
+ * Try to acquire a futex lock in a loop until the owner changes or the owner
+ * is descheduled. To lock the futex, set the value to the current TID.
+ *
+ * Returns:
+ * 0 - Gave up, lock not acquired
+ * 1 - Futex lock acquired
+ * <0 - On error
+ */
+static int trylock_futex_adaptive(u32 __user *uaddr)
+{
+ int ret = 0;
+ u32 curval;
+
+ for (;;) {
+ struct thread_info *owner;
+
+ owner = futex_owner(uaddr);
+ if (owner && futex_spin_on_owner(uaddr, owner))
+ break;
+
+ /*
+ * Try to acquire the lock. If we acquire it or error out,
+ * break to enable preemption and handle ret outside the loop.
+ */
+ curval = cmpxchg_futex_value_locked(uaddr, 0,
+ task_pid_vnr(current));
+
+ if (curval == 0) {
+ ret = 1;
+ break;
+ }
+
+ if (unlikely(curval == -EFAULT)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ /*
+ * Futex locks manage the owner atomically. We are not in
+ * danger of deadlock by preempting a pending owner. Abort the
+ * loop if our time slice has expired. RT Threads can spin
+ * until they preempt the owner, at which point they will break
+ * out of the loop anyway.
+ */
+ if (need_resched())
+ break;
+
+ cpu_relax();
+ }
+ return ret;
+}
+
+/**
+ * futex_lock() - Acquire the lock and set the futex value
+ * @uaddr: the futex user address
+ * @flags: futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, FLAGS_ADAPTIVE, etc.)
+ * @detect: detect deadlock (1) or not (0)
+ * @time: absolute timeout
+ *
+ * futex_(un)lock() define a futex value policy and implement a full mutex. The
+ * futex value stores the owner's TID or'd with FUTEX_WAITERS and/or
+ * FUTEX_OWNER_DIED as appropriate.
+ *
+ * Userspace tried a 0 -> TID atomic transition of the futex value and failed.
+ * Try to acquire the lock in the kernel, blocking if necessary. Return to
+ * userspace with the lock held and the futex value set accordingly (or after a
+ * timeout).
+ *
+ * Returns:
+ * 0 - On success
+ * <0 - On error
+ */
+static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
+{
+ struct hrtimer_sleeper timeout, *to = NULL;
+ struct futex_hash_bucket *hb;
+ struct futex_q q = FUTEX_Q_INIT;
+ int ret = 0;
+
+ if (refill_pi_state_cache())
+ return -ENOMEM;
+
+ if (time) {
+ to = &timeout;
+ hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
+ HRTIMER_MODE_ABS);
+ hrtimer_init_sleeper(to, current);
+ hrtimer_set_expires(&to->timer, *time);
+ }
+
+retry:
+#ifdef CONFIG_SMP
+ if (flags & FLAGS_ADAPTIVE) {
+ preempt_disable();
+ ret = trylock_futex_adaptive(uaddr);
+ preempt_enable();
+
+ /* We got the lock. */
+ if (ret == 1) {
+ ret = 0;
+ goto out;
+ }
+
+ /* We encountered an error, -EFAULT most likely. */
+ if (ret)
+ goto out;
+ }
+#endif
+ q.key = FUTEX_KEY_INIT;
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
+ if (unlikely(ret != 0))
+ goto out;
+
+ hb = queue_lock(&q);
+
+ ret = lock_futex_atomic(uaddr, current, 0);
+ if (unlikely(ret)) {
+ switch (ret) {
+ case 1:
+ /* We got the lock. */
+ ret = 0;
+ goto out_unlock_put_key;
+ case -EFAULT:
+ goto uaddr_faulted;
+ default:
+ goto out_unlock_put_key;
+ }
+ }
+ /*
+ * Only actually queue now that the atomic ops are done:
+ */
+ futex_wait_queue_me(hb, &q, to);
+
+ ret = unqueue_me(&q);
+
+ /* If we were woken by the owner, try and acquire the lock. */
+ if (!ret)
+ goto retry;
+
+ ret = -ETIMEDOUT;
+ if (to && !to->task)
+ goto out_put_key;
+
+ /*
+ * We expect signal_pending(current), but we might be the
+ * victim of a spurious wakeup as well.
+ */
+ ret = -ERESTARTNOINTR;
+ if (!signal_pending(current)) {
+ put_futex_key(&q.key);
+ goto retry;
+ }
+
+ goto out_put_key;
+
+out_unlock_put_key:
+ queue_unlock(&q, hb);
+
+out_put_key:
+ put_futex_key(&q.key);
+out:
+ if (to)
+ destroy_hrtimer_on_stack(&to->timer);
+ return ret != -EINTR ? ret : -ERESTARTNOINTR;
+
+uaddr_faulted:
+ queue_unlock(&q, hb);
+
+ ret = fault_in_user_writeable(uaddr);
+ if (ret)
+ goto out_put_key;
+
+ put_futex_key(&q.key);
+ goto retry;
+}
+
+
/*
* Support for robust futexes: the kernel cleans up held futexes at
* thread exit time.
@@ -2623,6 +2830,12 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
case FUTEX_CMP_REQUEUE_PI:
ret = futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
break;
+ case FUTEX_LOCK_ADAPTIVE:
+ flags |= FLAGS_ADAPTIVE;
+ case FUTEX_LOCK:
+ if (futex_cmpxchg_enabled)
+ ret = futex_lock(uaddr, flags, val, timeout);
+ break;
default:
ret = -ENOSYS;
}
@@ -2641,7 +2854,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,

if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
cmd == FUTEX_WAIT_BITSET ||
- cmd == FUTEX_WAIT_REQUEUE_PI)) {
+ cmd == FUTEX_WAIT_REQUEUE_PI ||
+ cmd == FUTEX_LOCK || cmd == FUTEX_LOCK_ADAPTIVE)) {
if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
return -EFAULT;
if (!timespec_valid(&ts))
diff --git a/kernel/sched.c b/kernel/sched.c
index 9ab3cd7..a2dbb5b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3818,6 +3818,65 @@ int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
out:
return 1;
}
+
+#ifdef CONFIG_FUTEX
+#include <linux/futex.h>
+
+int futex_spin_on_owner(u32 __user *uaddr, struct thread_info *owner)
+{
+ unsigned int cpu;
+ struct rq *rq;
+
+ if (!sched_feat(OWNER_SPIN))
+ return 0;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ /*
+ * Need to access the cpu field knowing that
+ * DEBUG_PAGEALLOC could have unmapped it if
+ * the mutex owner just released it and exited.
+ */
+ if (probe_kernel_address(&owner->cpu, cpu))
+ goto out;
+#else
+ cpu = owner->cpu;
+#endif
+
+ /*
+ * Even if the access succeeded (likely case),
+ * the cpu field may no longer be valid.
+ */
+ if (cpu >= nr_cpumask_bits)
+ goto out;
+
+ /*
+ * We need to validate that we can do a
+ * get_cpu() and that we have the percpu area.
+ */
+ if (!cpu_online(cpu))
+ goto out;
+
+ rq = cpu_rq(cpu);
+
+ for (;;) {
+ /*
+ * Owner changed, break to re-assess state.
+ */
+ if (futex_owner(uaddr) != owner)
+ break;
+
+ /*
+ * Is that owner really running on that cpu?
+ */
+ if (task_thread_info(rq->curr) != owner || need_resched())
+ return 0;
+
+ cpu_relax();
+ }
+out:
+ return 1;
+}
+#endif
#endif

#ifdef CONFIG_PREEMPT
--
1.6.3.3

2010-04-05 20:25:04

by Darren Hart

[permalink] [raw]
Subject: [PATCH 3/6] futex: refactor futex_lock_pi_atomic

Prepare for FUTEX_LOCK by refactoring futex_lock_pi_atomic() into
lock_futex_atomic() and lock_pi_futex_atomic(). The name change is meant to
reflect the naming convention in futex.c that futex_*() functions map directly
to futex op codes and the others are internal helper functions.

Signed-off-by: Darren Hart <[email protected]>

===================================================================
---
kernel/futex.c | 79 +++++++++++++++++++++++++++++++++++++------------------
1 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2ae18cd..8c1bb16 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -625,30 +625,23 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
}

/**
- * futex_lock_pi_atomic() - Atomic work required to acquire a pi aware futex
- * @uaddr: the pi futex user address
- * @hb: the pi futex hash bucket
- * @key: the futex key associated with uaddr and hb
- * @ps: the pi_state pointer where we store the result of the
- * lookup
- * @task: the task to perform the atomic lock work for. This will
- * be "current" except in the case of requeue pi.
- * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
+ * lock_futex_atomic() - Try to acquire the futex lock atomically
+ * @uaddr: user address of the futex
+ * @task: the task to perform the atomic lock work for.
+ * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
+ *
+ * The hb->lock shall be held by the caller.
*
* Returns:
* 0 - ready to wait
* 1 - acquired the lock
* <0 - error
- *
- * The hb->lock and futex_key refs shall be held by the caller.
*/
-static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
- union futex_key *key,
- struct futex_pi_state **ps,
- struct task_struct *task, int set_waiters)
+static int lock_futex_atomic(u32 __user *uaddr, struct task_struct *task,
+ int set_waiters)
{
- int lock_taken, ret, ownerdied = 0;
u32 uval, newval, curval;
+ int lock_taken, ret;

retry:
ret = lock_taken = 0;
@@ -695,10 +688,10 @@ retry:
*
* This is safe as we are protected by the hash bucket lock !
*/
- if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+ if (unlikely((curval & FUTEX_OWNER_DIED) ||
+ !(curval & FUTEX_TID_MASK))) {
/* Keep the OWNER_DIED bit */
newval = (curval & ~FUTEX_TID_MASK) | task_pid_vnr(task);
- ownerdied = 0;
lock_taken = 1;
}

@@ -715,10 +708,46 @@ retry:
if (unlikely(lock_taken))
return 1;

+ return ret;
+}
+
+/**
+ * lock_pi_futex_atomic() - Atomic work required to acquire a pi aware futex
+ * @uaddr: the pi futex user address
+ * @hb: the pi futex hash bucket
+ * @key: the futex key associated with uaddr and hb
+ * @ps: the pi_state pointer where we store the result of the
+ * lookup
+ * @task: the task to perform the atomic lock work for. This will
+ * be "current" except in the case of requeue pi.
+ * @set_waiters: force setting the FUTEX_WAITERS bit (1) or not (0)
+ *
+ * Returns:
+ * 0 - ready to wait
+ * 1 - acquired the lock
+ * <0 - error
+ *
+ * The hb->lock and futex_key refs shall be held by the caller.
+ */
+static int lock_pi_futex_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
+ union futex_key *key,
+ struct futex_pi_state **ps,
+ struct task_struct *task, int set_waiters)
+{
+ u32 uval;
+ int ret;
+
+retry:
+ ret = lock_futex_atomic(uaddr, task, set_waiters);
+ if (ret)
+ return ret;
+
/*
* We dont have the lock. Look up the PI state (or create it if
* we are the first waiter):
*/
+ if (get_futex_value_locked(&uval, uaddr))
+ return -EFAULT;
ret = lookup_pi_state(uval, hb, key, ps);

if (unlikely(ret)) {
@@ -729,7 +758,7 @@ retry:
* OWNER_DIED bit is set to figure out whether
* this is a robust futex or not.
*/
- if (get_futex_value_locked(&curval, uaddr))
+ if (get_futex_value_locked(&uval, uaddr))
return -EFAULT;

/*
@@ -737,10 +766,8 @@ retry:
* futex. The code above will take the futex
* and return happy.
*/
- if (curval & FUTEX_OWNER_DIED) {
- ownerdied = 1;
+ if (uval & FUTEX_OWNER_DIED)
goto retry;
- }
default:
break;
}
@@ -1100,7 +1127,7 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
*
* Try and get the lock on behalf of the top waiter if we can do it atomically.
* Wake the top waiter if we succeed. If the caller specified set_waiters,
- * then direct futex_lock_pi_atomic() to force setting the FUTEX_WAITERS bit.
+ * then direct lock_pi_futex_atomic() to force setting the FUTEX_WAITERS bit.
* hb1 and hb2 must be held by the caller.
*
* Returns:
@@ -1124,7 +1151,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
/*
* Find the top_waiter and determine if there are additional waiters.
* If the caller intends to requeue more than 1 waiter to pifutex,
- * force futex_lock_pi_atomic() to set the FUTEX_WAITERS bit now,
+ * force lock_pi_futex_atomic() to set the FUTEX_WAITERS bit now,
* as we have means to handle the possible fault. If not, don't set
* the bit unecessarily as it will force the subsequent unlock to enter
* the kernel.
@@ -1144,7 +1171,7 @@ static int futex_proxy_trylock_atomic(u32 __user *pifutex,
* the contended case or if set_waiters is 1. The pi_state is returned
* in ps in contended cases.
*/
- ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
+ ret = lock_pi_futex_atomic(pifutex, hb2, key2, ps, top_waiter->task,
set_waiters);
if (ret == 1)
requeue_pi_wake_futex(top_waiter, key2, hb2);
@@ -1925,7 +1952,7 @@ retry:
retry_private:
hb = queue_lock(&q);

- ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
+ ret = lock_pi_futex_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
if (unlikely(ret)) {
switch (ret) {
case 1:
--
1.6.3.3

2010-04-05 20:25:14

by Darren Hart

[permalink] [raw]
Subject: [PATCH 5/6] futex: handle timeout inside adaptive lock spin

Signed-off-by: Darren Hart <[email protected]>
---
kernel/futex.c | 24 +++++++++++++++++++++---
1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index c33ac2a..af61dcd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2385,6 +2385,7 @@ out:
/**
* trylock_futex_adaptive() - Try to acquire the futex lock in a busy loop
* @uaddr: the futex user address
+ * @timeout: absolute timeout or NULL if none
*
* Try to acquire a futex lock in a loop until the owner changes or the owner
* is descheduled. To lock the futex, set the value to the current TID.
@@ -2394,10 +2395,11 @@ out:
* 1 - Futex lock acquired
* <0 - On error
*/
-static int trylock_futex_adaptive(u32 __user *uaddr)
+static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
{
int ret = 0;
u32 curval;
+ ktime_t now;

for (;;) {
struct thread_info *owner;
@@ -2433,6 +2435,22 @@ static int trylock_futex_adaptive(u32 __user *uaddr)
if (need_resched())
break;

+ if (timeout) {
+ now = ktime_get();
+/* FIXME: consider creating ktime_less_than(lhs, rhs) */
+#if (BITS_PER_LONG == 64) || defined(CONFIG_KTIME_SCALAR)
+ if (timeout->tv64 < now.tv64)
+ break;
+#else
+ if (timeout->sec < now.sec ||
+ (timeout->sec == now.sec &&
+ timeout->nsec < now.nsec)) {
+ ret = -ETIMEDOUT;
+ break;
+ }
+#endif
+ }
+
cpu_relax();
}
return ret;
@@ -2480,7 +2498,7 @@ retry:
#ifdef CONFIG_SMP
if (flags & FLAGS_ADAPTIVE) {
preempt_disable();
- ret = trylock_futex_adaptive(uaddr);
+ ret = trylock_futex_adaptive(uaddr, time);
preempt_enable();

/* We got the lock. */
@@ -2489,7 +2507,7 @@ retry:
goto out;
}

- /* We encountered an error, -EFAULT most likely. */
+ /* We encountered an error, -EFAULT or -ETIMEDOUT */
if (ret)
goto out;
}
--
1.6.3.3

2010-04-05 20:25:26

by Darren Hart

[permalink] [raw]
Subject: [PATCH 6/6] futex: Add aggressive adaptive spinning argument to FUTEX_LOCK

Aggresive adaptive spinning (aas) allows for more than spinner in the
adaptive loop at a given time. In some scenarios this yields better
results than the default single spinner mode.

Signed-off-by: Darren Hart <[email protected]>
---
kernel/futex.c | 21 ++++++++++++++++++---
1 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index af61dcd..ddbd158 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2462,6 +2462,9 @@ static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
* @flags: futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, FLAGS_ADAPTIVE, etc.)
* @detect: detect deadlock (1) or not (0)
* @time: absolute timeout
+ * @aas: aggressive adaptive spinning (1) allow multiple spinners (0) allow
+ * only one spinner. Only valid in conjunction with the FLAGS_ADAPTIVE
+ * flag.
*
* futex_(un)lock() define a futex value policy and implement a full mutex. The
* futex value stores the owner's TID or'd with FUTEX_WAITERS and/or
@@ -2476,12 +2479,14 @@ static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
* 0 - On success
* <0 - On error
*/
-static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
+static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time,
+ int aas)
{
struct hrtimer_sleeper timeout, *to = NULL;
- struct futex_hash_bucket *hb;
struct futex_q q = FUTEX_Q_INIT;
+ struct futex_hash_bucket *hb;
int ret = 0;
+ u32 uval;

if (refill_pi_state_cache())
return -ENOMEM;
@@ -2497,6 +2502,14 @@ static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
retry:
#ifdef CONFIG_SMP
if (flags & FLAGS_ADAPTIVE) {
+ if (!aas) {
+ ret = get_user(uval, uaddr);
+ if (ret)
+ goto out;
+ if (uval & FUTEX_WAITERS)
+ goto skip_adaptive;
+ }
+
preempt_disable();
ret = trylock_futex_adaptive(uaddr, time);
preempt_enable();
@@ -2512,6 +2525,8 @@ retry:
goto out;
}
#endif
+
+skip_adaptive:
q.key = FUTEX_KEY_INIT;
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
if (unlikely(ret != 0))
@@ -2852,7 +2867,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
flags |= FLAGS_ADAPTIVE;
case FUTEX_LOCK:
if (futex_cmpxchg_enabled)
- ret = futex_lock(uaddr, flags, val, timeout);
+ ret = futex_lock(uaddr, flags, val, timeout, val3);
break;
default:
ret = -ENOSYS;
--
1.6.3.3

2010-04-05 20:48:43

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2^W V4 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

And by V2 I meant V4....

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-05 21:16:34

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On 04/05/2010 11:23 PM, Darren Hart wrote:
> In-Reply-To:
>
> NOT FOR INCLUSION
>
> The following patch series implements a new experimental kernel side futex mutex
> via new FUTEX_LOCK and FUTEX_LOCK_ADAPTIVE futex op codes. The adaptive spin
> follows the kernel mutex model of allowing one spinner until the lock is
> released or the owner is descheduled. The patch currently allows the user to
> specify if they want no spinning, a single adaptive spinner, or multiple
> spinners (aggressive adaptive spinning, or aas... which I have mistyped as "ass"
> enough times to realize a better term is indeed required :-).
>

An interesting (but perhaps difficult to achieve) optimization would be
to spin in userspace.

> I'm using the futex_lock branch of my futextest test suite to gather results.
> The test is performance/futex_lock.c and can be checked out here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/dvhart/futextest.git
>
> Per Avi's suggestion I added asm instruction based period and duty-cycle options
> to do some testing. A series of plots with 256 threads, one per period length,
> is shown at the URL below. Each plot compares raw futex lock (normal, no
> spinning) with a single spinner (adaptive) and with multiple spinners (aas) for
> each of several duty-cycles to determine performance at various levels of
> contention. The test measure the number of lock/unlock pairs performed per
> second (in thousands).
>
> http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v4/
>
> Avi's suggested starting point was 1000 instructions with a 10% duty cycle.
> That plot "Futex Lock/Unlock (Period=1000, Threads=256)" does appear to be the
> most interesting of the lot. It's not so short that other overhead becomes the
> bottleneck, and not so long so as to make adaptive spinning benefits show up
> as noise in the numbers. The adaptive spin (with a single waiter) consistently
> beats the normal run, and outperforms aas for most duty-cycles. I rechecked a
> few points on this plot to confirm and the relative scores remained consistent.
> These plots were generated using 10,000,000 iterations per datapoint.
>
> Lastly, I should mention that these results all underperform a simple
> pthread_mutex_lock()/pthread_mutex_unlock() pair. I'm looking into why but felt
> I was overdue in sharing what I have to date. A test comparing this to a
> sched_yield() style userspace spinlock would probably be more appropraite.
>
>

How many cores (or hardware threads) does this machine have? At 10%
duty cycle you have 25 waiters behind the lock on average. I don't
think this is realistic, and it means that spinning is invoked only rarely.

I'd be interested in seeing runs where the average number of waiters is
0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention. 25
average waiters on compute bound code means the application needs to be
rewritten, no amount of mutex tweaking will help it.

Does the wakeup code select the spinning waiter, or just a random waiter?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-05 21:54:21

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

Avi Kivity wrote:
> On 04/05/2010 11:23 PM, Darren Hart wrote:
>> In-Reply-To:
>>
>> NOT FOR INCLUSION
>>
>> The following patch series implements a new experimental kernel side
>> futex mutex
>> via new FUTEX_LOCK and FUTEX_LOCK_ADAPTIVE futex op codes. The
>> adaptive spin
>> follows the kernel mutex model of allowing one spinner until the lock is
>> released or the owner is descheduled. The patch currently allows the
>> user to
>> specify if they want no spinning, a single adaptive spinner, or multiple
>> spinners (aggressive adaptive spinning, or aas... which I have
>> mistyped as "ass"
>> enough times to realize a better term is indeed required :-).
>>
>
> An interesting (but perhaps difficult to achieve) optimization would be
> to spin in userspace.

I couldn't think of a lightweight way to determine when the owner has
been scheduled out in userspace. Kernel assistance is required. You
could do this on the schedule() side of things, but I figured I'd get
some strong pushback if I tried to add a hook into descheduling that
flipped a bit in the futex value stating the owner was about to
deschedule(). Still, that might be something to explore.

>
> How many cores (or hardware threads) does this machine have?

Sorry, I meant to include that. I tested on an 8 CPU (no hardware
threads) 2.6 GHz Opteron 2218 (2 QuadCore CPUs) system.

> At 10%
> duty cycle you have 25 waiters behind the lock on average. I don't
> think this is realistic, and it means that spinning is invoked only rarely.

Perhaps some instrumentation is in order, it seems to get invoked enough
to achieve some 20% increase in lock/unlock iterations. Perhaps another
metric would be of more value - such as average wait time?

> I'd be interested in seeing runs where the average number of waiters is
> 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention.
> 25 average waiters on compute bound code means the application needs to be
> rewritten, no amount of mutex tweaking will help it.

Perhaps something NR_CPUS threads would be of more interest? At 10%
that's about .8 and at 25% the 2 of your upper limit. I could add a few
more duty-cycle points and make 25% the max. I'll kick that off and post
the results... probably tomorrow, 10M iterations takes a while, but
makes the results relatively stable.

> Does the wakeup code select the spinning waiter, or just a random waiter?

The wakeup code selects the highest priority task in fifo order to
wake-up - however, under contention it is most likely going to go back
to sleep as another waiter will steal the lock out from under it. This
locking strategy is unashamedly about as "unfair" as it gets.


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-05 22:22:39

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On 04/06/2010 12:54 AM, Darren Hart wrote:
> Avi Kivity wrote:
>> On 04/05/2010 11:23 PM, Darren Hart wrote:
>>> In-Reply-To:
>>>
>>> NOT FOR INCLUSION
>>>
>>> The following patch series implements a new experimental kernel side
>>> futex mutex
>>> via new FUTEX_LOCK and FUTEX_LOCK_ADAPTIVE futex op codes. The
>>> adaptive spin
>>> follows the kernel mutex model of allowing one spinner until the
>>> lock is
>>> released or the owner is descheduled. The patch currently allows the
>>> user to
>>> specify if they want no spinning, a single adaptive spinner, or
>>> multiple
>>> spinners (aggressive adaptive spinning, or aas... which I have
>>> mistyped as "ass"
>>> enough times to realize a better term is indeed required :-).
>>
>> An interesting (but perhaps difficult to achieve) optimization would
>> be to spin in userspace.
>
> I couldn't think of a lightweight way to determine when the owner has
> been scheduled out in userspace. Kernel assistance is required. You
> could do this on the schedule() side of things, but I figured I'd get
> some strong pushback if I tried to add a hook into descheduling that
> flipped a bit in the futex value stating the owner was about to
> deschedule(). Still, that might be something to explore.

In the futex value it's hopeless (since a thread can hold many locks),
but I don't think it's unreasonable to set a bit in the thread local
storage area. The futex format would then need to be extended to
contain a pointer to this bit.


>
>>
>> How many cores (or hardware threads) does this machine have?
>
> Sorry, I meant to include that. I tested on an 8 CPU (no hardware
> threads) 2.6 GHz Opteron 2218 (2 QuadCore CPUs) system.
>
> > At 10%
>> duty cycle you have 25 waiters behind the lock on average. I don't
>> think this is realistic, and it means that spinning is invoked only
>> rarely.
>
> Perhaps some instrumentation is in order, it seems to get invoked
> enough to achieve some 20% increase in lock/unlock iterations. Perhaps
> another metric would be of more value - such as average wait time?

Why measure an unrealistic workload?

>
>> I'd be interested in seeing runs where the average number of waiters
>> is 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention.
>> 25 average waiters on compute bound code means the application needs
>> to be rewritten, no amount of mutex tweaking will help it.
>
> Perhaps something NR_CPUS threads would be of more interest?

That seems artificial.

> At 10% that's about .8 and at 25% the 2 of your upper limit. I could
> add a few more duty-cycle points and make 25% the max. I'll kick that
> off and post the results... probably tomorrow, 10M iterations takes a
> while, but makes the results relatively stable.

Thanks. But why not vary the number of threads as well?

>
>> Does the wakeup code select the spinning waiter, or just a random
>> waiter?
>
> The wakeup code selects the highest priority task in fifo order to
> wake-up - however, under contention it is most likely going to go back
> to sleep as another waiter will steal the lock out from under it. This
> locking strategy is unashamedly about as "unfair" as it gets.

Best to avoid the wakeup if we notice the lock was stolen.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-05 23:00:24

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

Avi Kivity wrote:

>> > At 10%
>>> duty cycle you have 25 waiters behind the lock on average. I don't
>>> think this is realistic, and it means that spinning is invoked only
>>> rarely.
>>
>> Perhaps some instrumentation is in order, it seems to get invoked
>> enough to achieve some 20% increase in lock/unlock iterations. Perhaps
>> another metric would be of more value - such as average wait time?
>
> Why measure an unrealistic workload?

No argument there, thus my proposal for an alternate configuration below.

>>> I'd be interested in seeing runs where the average number of waiters
>>> is 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention.
>>> 25 average waiters on compute bound code means the application needs
>>> to be rewritten, no amount of mutex tweaking will help it.
>>
>> Perhaps something NR_CPUS threads would be of more interest?
>
> That seems artificial.

How so? Several real world applications use one thread per CPU to
dispatch work to, wait for events, etc.

>
>> At 10% that's about .8 and at 25% the 2 of your upper limit. I could
>> add a few more duty-cycle points and make 25% the max. I'll kick that
>> off and post the results... probably tomorrow, 10M iterations takes a
>> while, but makes the results relatively stable.
>
> Thanks. But why not vary the number of threads as well?

Absolutely, I don't disagree that all the variables should vary in order
to get a complete picture. I'm starting with 8 - it takes several hours
to collect the data.

>>> Does the wakeup code select the spinning waiter, or just a random
>>> waiter?
>>
>> The wakeup code selects the highest priority task in fifo order to
>> wake-up - however, under contention it is most likely going to go back
>> to sleep as another waiter will steal the lock out from under it. This
>> locking strategy is unashamedly about as "unfair" as it gets.
>
> Best to avoid the wakeup if we notice the lock was stolen.

You really can't do this precisely. You can read the futex value at
various points along the wakeup path, but at some point you have to
commit to waking a task, and you still have a race between the time you
wake_up_task() and when it is scheduled and attempts the cmpxchg itself.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-05 23:17:58

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

Avi Kivity wrote:

>>> An interesting (but perhaps difficult to achieve) optimization would
>>> be to spin in userspace.
>>
>> I couldn't think of a lightweight way to determine when the owner has
>> been scheduled out in userspace. Kernel assistance is required. You
>> could do this on the schedule() side of things, but I figured I'd get
>> some strong pushback if I tried to add a hook into descheduling that
>> flipped a bit in the futex value stating the owner was about to
>> deschedule(). Still, that might be something to explore.
>
> In the futex value it's hopeless (since a thread can hold many locks),

It can, but there is a futex value per lock. If the task_struct had a
list of held futex locks (as it does for pi futex locks) the
deschedule() path could walk that and mark the FUTEX_OWNER_SLEEPING bit.


> but I don't think it's unreasonable to set a bit in the thread local
> storage area. The futex format would then need to be extended to
> contain a pointer to this bit.

This appears to be 1 bit per task instead of 1 bit per lock. Also, the
value is thread-specific... so how would a potential waiter be able to
determine if the owner of a particular lock was running or not with this
method? ... maybe I'm missing some core bit about TLS... are you
talking about pthread_key_create() and pthread_getspecific() ?

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-05 23:31:31

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

* Darren Hart ([email protected]) wrote:
> Avi Kivity wrote:
>
>>>> An interesting (but perhaps difficult to achieve) optimization
>>>> would be to spin in userspace.
>>>
>>> I couldn't think of a lightweight way to determine when the owner has
>>> been scheduled out in userspace. Kernel assistance is required. You
>>> could do this on the schedule() side of things, but I figured I'd get
>>> some strong pushback if I tried to add a hook into descheduling that
>>> flipped a bit in the futex value stating the owner was about to
>>> deschedule(). Still, that might be something to explore.
>>
>> In the futex value it's hopeless (since a thread can hold many locks),
>
> It can, but there is a futex value per lock. If the task_struct had a
> list of held futex locks (as it does for pi futex locks) the
> deschedule() path could walk that and mark the FUTEX_OWNER_SLEEPING bit.

You also have a notification scheme (preempt_notifiers will fire on
schedule out). However, you'd have to register the notifiers from a
non-current context (i.e. on slow path acquire reaching out to lock owner
and registering notifier on lock owner's behalf, which would kind of
defeat the point AFAICT).

thanks,
-chris

2010-04-06 08:29:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 5/6] futex: handle timeout inside adaptive lock spin

On Mon, 5 Apr 2010, Darren Hart wrote:

> Signed-off-by: Darren Hart <[email protected]>
> ---
> kernel/futex.c | 24 +++++++++++++++++++++---
> 1 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index c33ac2a..af61dcd 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2385,6 +2385,7 @@ out:
> /**
> * trylock_futex_adaptive() - Try to acquire the futex lock in a busy loop
> * @uaddr: the futex user address
> + * @timeout: absolute timeout or NULL if none
> *
> * Try to acquire a futex lock in a loop until the owner changes or the owner
> * is descheduled. To lock the futex, set the value to the current TID.
> @@ -2394,10 +2395,11 @@ out:
> * 1 - Futex lock acquired
> * <0 - On error
> */
> -static int trylock_futex_adaptive(u32 __user *uaddr)
> +static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
> {
> int ret = 0;
> u32 curval;
> + ktime_t now;
>
> for (;;) {
> struct thread_info *owner;
> @@ -2433,6 +2435,22 @@ static int trylock_futex_adaptive(u32 __user *uaddr)
> if (need_resched())
> break;
>
> + if (timeout) {
> + now = ktime_get();

Hmm. Calling that in every iteration might hurt especially on non
TSC systems, but well...

> +/* FIXME: consider creating ktime_less_than(lhs, rhs) */

No need. The .tv64 comparison works in both cases. :)

Thanks,

tglx

2010-04-06 09:00:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 2010-04-06 at 00:15 +0300, Avi Kivity wrote:
>
> An interesting (but perhaps difficult to achieve) optimization would be
> to spin in userspace.

System call overhead isn't that large, but if you want to amortize that
you can do a very small spin in userspace and then take the syscall:

try
spin
try
syscall

like.

Complicating anything beyond that is just not worth it.

2010-04-06 13:29:17

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On 04/06/2010 01:59 AM, Darren Hart wrote:
>
>
>>>> I'd be interested in seeing runs where the average number of
>>>> waiters is 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad
>>>> contention.
>>>> 25 average waiters on compute bound code means the application
>>>> needs to be rewritten, no amount of mutex tweaking will help it.
>>>
>>> Perhaps something NR_CPUS threads would be of more interest?
>>
>> That seems artificial.
>
> How so? Several real world applications use one thread per CPU to
> dispatch work to, wait for events, etc.

Yes, but that's the best case for spinning. You could simply use a
userspace spinlock in this case.


>
>>>> Does the wakeup code select the spinning waiter, or just a random
>>>> waiter?
>>>
>>> The wakeup code selects the highest priority task in fifo order to
>>> wake-up - however, under contention it is most likely going to go
>>> back to sleep as another waiter will steal the lock out from under
>>> it. This locking strategy is unashamedly about as "unfair" as it gets.
>>
>> Best to avoid the wakeup if we notice the lock was stolen.
>
> You really can't do this precisely. You can read the futex value at
> various points along the wakeup path, but at some point you have to
> commit to waking a task, and you still have a race between the time
> you wake_up_task() and when it is scheduled and attempts the cmpxchg
> itself.
>

The race is not a problem as it's just an optimization. If you lose it
you get a spurious wakeup, if you win you save some cpu cycles.


--
error compiling committee.c: too many arguments to function

2010-04-06 13:31:14

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On 04/06/2010 02:15 AM, Darren Hart wrote:
> Avi Kivity wrote:
>
>>>> An interesting (but perhaps difficult to achieve) optimization
>>>> would be to spin in userspace.
>>>
>>> I couldn't think of a lightweight way to determine when the owner
>>> has been scheduled out in userspace. Kernel assistance is required.
>>> You could do this on the schedule() side of things, but I figured
>>> I'd get some strong pushback if I tried to add a hook into
>>> descheduling that flipped a bit in the futex value stating the owner
>>> was about to deschedule(). Still, that might be something to explore.
>>
>> In the futex value it's hopeless (since a thread can hold many locks),
>
> It can, but there is a futex value per lock. If the task_struct had a
> list of held futex locks (as it does for pi futex locks) the
> deschedule() path could walk that and mark the FUTEX_OWNER_SLEEPING bit.
>

You don't want the context switch path to walk a list whose length is
user controlled.

>> but I don't think it's unreasonable to set a bit in the thread local
>> storage area. The futex format would then need to be extended to
>> contain a pointer to this bit.
>
> This appears to be 1 bit per task instead of 1 bit per lock.

Yes. O(1) on context switch instead of O(n).

> Also, the value is thread-specific... so how would a potential waiter
> be able to determine if the owner of a particular lock was running or
> not with this method? ... maybe I'm missing some core bit about
> TLS... are you talking about pthread_key_create() and
> pthread_getspecific() ?

The lock would need to contain a pointer to the owning task. Could be
set with cmpxchg{8,16}b on x86.

--
error compiling committee.c: too many arguments to function

2010-04-06 13:35:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 2010-04-06 at 16:28 +0300, Avi Kivity wrote:
>
> Yes, but that's the best case for spinning. You could simply use a
> userspace spinlock in this case.

Userspace spinlocks are evil.. they should _never_ be used.

2010-04-06 13:42:27

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On 04/06/2010 04:35 PM, Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 16:28 +0300, Avi Kivity wrote:
>
>> Yes, but that's the best case for spinning. You could simply use a
>> userspace spinlock in this case.
>>
> Userspace spinlocks are evil.. they should _never_ be used.
>

But in this case they're fastest. If we don't provide a non-evil
alternative, people will use them.

--
error compiling committee.c: too many arguments to function

2010-04-06 13:49:29

by Alan

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 06 Apr 2010 15:35:31 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, 2010-04-06 at 16:28 +0300, Avi Kivity wrote:
> >
> > Yes, but that's the best case for spinning. You could simply use a
> > userspace spinlock in this case.
>
> Userspace spinlocks are evil.. they should _never_ be used.

Thats a gross and inaccurate simplification. For the case Avi is talking
about spinning in userspace makes sense in a lot of environments. Once
you've got one thread pinned per cpu (or gang scheduling >-) ) there are
various environments where it makes complete and utter sense.

2010-04-06 14:09:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 2010-04-06 at 16:41 +0300, Avi Kivity wrote:
> On 04/06/2010 04:35 PM, Peter Zijlstra wrote:
> > On Tue, 2010-04-06 at 16:28 +0300, Avi Kivity wrote:
> >
> >> Yes, but that's the best case for spinning. You could simply use a
> >> userspace spinlock in this case.
> >>
> > Userspace spinlocks are evil.. they should _never_ be used.
> >
>
> But in this case they're fastest. If we don't provide a non-evil
> alternative, people will use them.
>

That's what FUTEX_LOCK is about.

2010-04-06 14:48:12

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, Apr 6, 2010 at 01:48, Peter Zijlstra <[email protected]> wrote:
>  try
>  spin
>  try
>  syscall

This is available for a long time in the mutex implementation
(PTHREAD_MUTEX_ADAPTIVE_NP mutex type). It hasn't show much
improvement if any. There were some people demanding this support for
as far as I know they are not using it now. This is adaptive
spinning, learning from previous calls how long to wait. But it's
still unguided. There is no way to get information like "the owner
has been descheduled".

2010-04-06 14:51:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 2010-04-06 at 07:47 -0700, Ulrich Drepper wrote:
> On Tue, Apr 6, 2010 at 01:48, Peter Zijlstra <[email protected]>
> wrote:
> > try
> > spin
> > try
> > syscall
>
> This is available for a long time in the mutex implementation
> (PTHREAD_MUTEX_ADAPTIVE_NP mutex type). It hasn't show much
> improvement if any. There were some people demanding this support for
> as far as I know they are not using it now. This is adaptive
> spinning, learning from previous calls how long to wait. But it's
> still unguided. There is no way to get information like "the owner
> has been descheduled".

That's where the FUTEX_LOCK thing comes in, it does all those, the above
was a single spin loop to amortize the syscall overhead.

I wouldn't make it any more complex than a single pause ins, syscalls
are terribly cheap these days.

2010-04-06 15:28:49

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

Alan Cox wrote:
> On Tue, 06 Apr 2010 15:35:31 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>> On Tue, 2010-04-06 at 16:28 +0300, Avi Kivity wrote:
>>> Yes, but that's the best case for spinning. You could simply use a
>>> userspace spinlock in this case.
>> Userspace spinlocks are evil.. they should _never_ be used.
>
> Thats a gross and inaccurate simplification. For the case Avi is talking
> about spinning in userspace makes sense in a lot of environments. Once
> you've got one thread pinned per cpu (or gang scheduling >-) ) there are
> various environments where it makes complete and utter sense.

Hi Alan,

Do you feel some of these situations would also benefit from some kernel
assistance to stop spinning when the owner schedules out? Or are you
saying that there are situations where pure userspace spinlocks will
always be the best option?

If the latter, I'd think that they would also be situations where
sched_yield() is not used as part of the spin loop. If so, then these
are not our target situations for FUTEX_LOCK_ADAPTIVE, which hopes to
provide a better informed mechanism for making spin or sleep decisions.
If sleeping isn't part of the locking construct implementation, then
FUTEX_LOCK_ADAPTIVE doesn't have much to offer.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-06 15:29:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Mon, 2010-04-05 at 13:23 -0700, Darren Hart wrote:
> Lastly, I should mention that these results all underperform a simple
> pthread_mutex_lock()/pthread_mutex_unlock() pair. I'm looking into why but felt
> I was overdue in sharing what I have to date. A test comparing this to a
> sched_yield() style userspace spinlock would probably be more appropraite.

This really should be able to out-perform a regular pthread_mutex_lock()
we saw a significant performance gain when we added adaptive spins to
the kernel mutex implementation, so I'd expect a gain on the futex one
as well.

2010-04-06 15:33:39

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 07:47 -0700, Ulrich Drepper wrote:
>> On Tue, Apr 6, 2010 at 01:48, Peter Zijlstra <[email protected]>
>> wrote:
>>> try
>>> spin
>>> try
>>> syscall
>> This is available for a long time in the mutex implementation
>> (PTHREAD_MUTEX_ADAPTIVE_NP mutex type). It hasn't show much
>> improvement if any. There were some people demanding this support for
>> as far as I know they are not using it now. This is adaptive
>> spinning, learning from previous calls how long to wait. But it's
>> still unguided. There is no way to get information like "the owner
>> has been descheduled".
>
> That's where the FUTEX_LOCK thing comes in, it does all those, the above
> was a single spin loop to amortize the syscall overhead.
>
> I wouldn't make it any more complex than a single pause ins, syscalls
> are terribly cheap these days.

And yet they still seem to have a real impact on the futex_lock
benchmark. Perhaps I am just still looking at pathological cases, but
there is a strong correlation between high syscall counts and really low
iterations per second. Granted this also correlates with lock
contention. However, when using the same period and duty-cycle I find
that a locking mechanism that makes significantly fewer syscalls also
significantly outperforms one that makes more. Kind of handwavy stilly,
I'll have more numbers this afternoon.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-06 15:37:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 2010-04-06 at 08:33 -0700, Darren Hart wrote:
> Peter Zijlstra wrote:
> > On Tue, 2010-04-06 at 07:47 -0700, Ulrich Drepper wrote:
> >> On Tue, Apr 6, 2010 at 01:48, Peter Zijlstra <[email protected]>
> >> wrote:
> >>> try
> >>> spin
> >>> try
> >>> syscall
> >> This is available for a long time in the mutex implementation
> >> (PTHREAD_MUTEX_ADAPTIVE_NP mutex type). It hasn't show much
> >> improvement if any. There were some people demanding this support for
> >> as far as I know they are not using it now. This is adaptive
> >> spinning, learning from previous calls how long to wait. But it's
> >> still unguided. There is no way to get information like "the owner
> >> has been descheduled".
> >
> > That's where the FUTEX_LOCK thing comes in, it does all those, the above
> > was a single spin loop to amortize the syscall overhead.
> >
> > I wouldn't make it any more complex than a single pause ins, syscalls
> > are terribly cheap these days.
>
> And yet they still seem to have a real impact on the futex_lock
> benchmark. Perhaps I am just still looking at pathological cases, but
> there is a strong correlation between high syscall counts and really low
> iterations per second. Granted this also correlates with lock
> contention. However, when using the same period and duty-cycle I find
> that a locking mechanism that makes significantly fewer syscalls also
> significantly outperforms one that makes more. Kind of handwavy stilly,
> I'll have more numbers this afternoon.

Sure, but I'm still not sure why FUTEX_LOCK ends up making more syscalls
than FUTEX_WAIT based locking. Both should only do the syscall when the
lock is contended, both should only ever do 1 syscall per acquire,
right?


2010-04-06 16:07:38

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On 04/06/2010 06:28 PM, Darren Hart wrote:
> Alan Cox wrote:
>> On Tue, 06 Apr 2010 15:35:31 +0200
>> Peter Zijlstra <[email protected]> wrote:
>>
>>> On Tue, 2010-04-06 at 16:28 +0300, Avi Kivity wrote:
>>>> Yes, but that's the best case for spinning. You could simply use a
>>>> userspace spinlock in this case.
>>> Userspace spinlocks are evil.. they should _never_ be used.
>>
>> Thats a gross and inaccurate simplification. For the case Avi is talking
>> about spinning in userspace makes sense in a lot of environments. Once
>> you've got one thread pinned per cpu (or gang scheduling >-) ) there are
>> various environments where it makes complete and utter sense.
>
> Hi Alan,
>
> Do you feel some of these situations would also benefit from some
> kernel assistance to stop spinning when the owner schedules out? Or
> are you saying that there are situations where pure userspace
> spinlocks will always be the best option?
>
> If the latter, I'd think that they would also be situations where
> sched_yield() is not used as part of the spin loop. If so, then these
> are not our target situations for FUTEX_LOCK_ADAPTIVE, which hopes to
> provide a better informed mechanism for making spin or sleep
> decisions. If sleeping isn't part of the locking construct
> implementation, then FUTEX_LOCK_ADAPTIVE doesn't have much to offer.

IMO the best solution is to spin in userspace while the lock holder is
running, fall into the kernel when it is scheduled out.

--
error compiling committee.c: too many arguments to function

2010-04-06 16:11:06

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On 04/06/2010 05:09 PM, Peter Zijlstra wrote:
> On Tue, 2010-04-06 at 16:41 +0300, Avi Kivity wrote:
>
>> On 04/06/2010 04:35 PM, Peter Zijlstra wrote:
>>
>>> On Tue, 2010-04-06 at 16:28 +0300, Avi Kivity wrote:
>>>
>>>
>>>> Yes, but that's the best case for spinning. You could simply use a
>>>> userspace spinlock in this case.
>>>>
>>>>
>>> Userspace spinlocks are evil.. they should _never_ be used.
>>>
>>>
>> But in this case they're fastest. If we don't provide a non-evil
>> alternative, people will use them.
>>
>>
> That's what FUTEX_LOCK is about.
>

That works for the uncontended case. For the contended case, the waiter
and the owner have to go into the kernel and back out to transfer
ownership. In the non-adaptive case you have to switch to the idle task
and back as well, and send an IPI. That's a lot of latency if the
unlock happened just after the waiter started the descent into the kernel.

--
error compiling committee.c: too many arguments to function

2010-04-06 16:16:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 6 Apr 2010, Avi Kivity wrote:

> On 04/06/2010 06:28 PM, Darren Hart wrote:
> > Alan Cox wrote:
> > > On Tue, 06 Apr 2010 15:35:31 +0200
> > > Peter Zijlstra <[email protected]> wrote:
> > >
> > > > On Tue, 2010-04-06 at 16:28 +0300, Avi Kivity wrote:
> > > > > Yes, but that's the best case for spinning. You could simply use a
> > > > > userspace spinlock in this case.
> > > > Userspace spinlocks are evil.. they should _never_ be used.
> > >
> > > Thats a gross and inaccurate simplification. For the case Avi is talking
> > > about spinning in userspace makes sense in a lot of environments. Once
> > > you've got one thread pinned per cpu (or gang scheduling >-) ) there are
> > > various environments where it makes complete and utter sense.
> >
> > Hi Alan,
> >
> > Do you feel some of these situations would also benefit from some kernel
> > assistance to stop spinning when the owner schedules out? Or are you saying
> > that there are situations where pure userspace spinlocks will always be the
> > best option?
> >
> > If the latter, I'd think that they would also be situations where
> > sched_yield() is not used as part of the spin loop. If so, then these are
> > not our target situations for FUTEX_LOCK_ADAPTIVE, which hopes to provide a
> > better informed mechanism for making spin or sleep decisions. If sleeping
> > isn't part of the locking construct implementation, then FUTEX_LOCK_ADAPTIVE
> > doesn't have much to offer.
>
> IMO the best solution is to spin in userspace while the lock holder is
> running, fall into the kernel when it is scheduled out.

That's just not realistic as user space has no idea whether the lock
holder is running or not and when it's scheduled out without a syscall :)

Thanks,

tglx

2010-04-06 16:21:28

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On 04/06/2010 07:14 PM, Thomas Gleixner wrote:
>
>> IMO the best solution is to spin in userspace while the lock holder is
>> running, fall into the kernel when it is scheduled out.
>>
> That's just not realistic as user space has no idea whether the lock
> holder is running or not and when it's scheduled out without a syscall :)
>

The kernel could easily expose this information by writing into the
thread's TLS area.

So:

- the kernel maintains a current_cpu field in a thread's tls
- lock() atomically writes a pointer to the current thread's current_cpu
when acquiring
- the kernel writes an invalid value to current_cpu when switching out
- a contended lock() retrieves the current_cpu pointer, and spins as
long as it is a valid cpu

--
error compiling committee.c: too many arguments to function

2010-04-06 16:41:59

by Alan

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

> Do you feel some of these situations would also benefit from some kernel
> assistance to stop spinning when the owner schedules out? Or are you
> saying that there are situations where pure userspace spinlocks will
> always be the best option?

There are cases its the best option - you are assuming for example that
the owner can get scheduled out. Eg nailing one thread per CPU in some
specialist high performance situations means they can't.

> If the latter, I'd think that they would also be situations where
> sched_yield() is not used as part of the spin loop. If so, then these
> are not our target situations for FUTEX_LOCK_ADAPTIVE, which hopes to
> provide a better informed mechanism for making spin or sleep decisions.
> If sleeping isn't part of the locking construct implementation, then
> FUTEX_LOCK_ADAPTIVE doesn't have much to offer.

I am unsure about the approach. As Avi says knowing that the lock owner is
scheduled out allows for far better behaviour. It doesn't need complex
per lock stuff or per lock notifier entries on pre-empt either.

A given task is either pre-empted or not and in the normal case of things
you need this within a process so you've got shared pages anyway. So you
only need one instance of the 'is thread X pre-empted' bit somewhere in a
non swappable page.

That gives you something along the lines of

runaddr = find_run_flag(lock);
do {
while(*runaddr == RUNNING) {
if (trylock(lock))
return WHOOPEE;
cpu relax
}
yield (_on(thread));
} while(*runaddr != DEAD);


which unlike blindly spinning can avoid the worst of any hit on the CPU
power and would be a bit more guided ?

2010-04-06 16:50:10

by Alan

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

> That works for the uncontended case. For the contended case, the waiter
> and the owner have to go into the kernel and back out to transfer
> ownership.

If you insist on doing it that way yes, but knowing the lock owner is
likely to be away for a while also lets you do things like punt work
either by picking another work package in the meantime, or by queueing
the work you can't do on a list the pre-empted lock owner will review
before dropping the lock.

2010-04-06 16:51:06

by Alan

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

> > IMO the best solution is to spin in userspace while the lock holder is
> > running, fall into the kernel when it is scheduled out.
>
> That's just not realistic as user space has no idea whether the lock
> holder is running or not and when it's scheduled out without a syscall :)

Which is the real problem that wants addressing and can be addressed very
cheaply. That would bring us up to par with 1970s RTOS environments ;)

Alan

2010-04-06 16:56:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/6] futex: Add FUTEX_LOCK with optional adaptive spinning

On Mon, 5 Apr 2010, Darren Hart wrote:
> +#ifdef CONFIG_SMP
> +struct thread_info *futex_owner(u32 __user *uaddr)
> +{
> + struct thread_info *ti = NULL;
> + struct task_struct *p;
> + pid_t pid;
> + u32 uval;
> +
> + if (get_futex_value_locked(&uval, uaddr))
> + return NULL;
> +
> + pid = uval & FUTEX_TID_MASK;
> + rcu_read_lock();
> + p = find_task_by_vpid(pid);
> + if (p) {
> + const struct cred *cred, *pcred;
> +
> + cred = current_cred();
> + pcred = __task_cred(p);
> + if (cred->euid == pcred->euid ||
> + cred->euid == pcred->uid)
> + ti = task_thread_info(p);

We want a get_task_struct() here, don't we ?

> + }
> + rcu_read_unlock();
> +
> + return ti;
> +}
> +#endif
> +

> +/**
> + * trylock_futex_adaptive() - Try to acquire the futex lock in a busy loop
> + * @uaddr: the futex user address
> + *
> + * Try to acquire a futex lock in a loop until the owner changes or the owner
> + * is descheduled. To lock the futex, set the value to the current TID.
> + *
> + * Returns:
> + * 0 - Gave up, lock not acquired
> + * 1 - Futex lock acquired
> + * <0 - On error
> + */
> +static int trylock_futex_adaptive(u32 __user *uaddr)
> +{
> + int ret = 0;
> + u32 curval;
> +
> + for (;;) {
> + struct thread_info *owner;
> +
> + owner = futex_owner(uaddr);
> + if (owner && futex_spin_on_owner(uaddr, owner))
> + break;
> +
> + /*
> + * Try to acquire the lock. If we acquire it or error out,
> + * break to enable preemption and handle ret outside the loop.
> + */
> + curval = cmpxchg_futex_value_locked(uaddr, 0,
> + task_pid_vnr(current));
> +
> + if (curval == 0) {
> + ret = 1;
> + break;
> + }
> +
> + if (unlikely(curval == -EFAULT)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + /*
> + * Futex locks manage the owner atomically. We are not in
> + * danger of deadlock by preempting a pending owner. Abort the
> + * loop if our time slice has expired. RT Threads can spin
> + * until they preempt the owner, at which point they will break
> + * out of the loop anyway.
> + */
> + if (need_resched())
> + break;
> +
> + cpu_relax();
> + }
> + return ret;
> +}


Hmm. The order is weird. Why don't you do that simpler ?

Get the uval, the tid and the thread_info pointer outside of the
loop. Also task_pid_vnr(current) just needs a one time lookup.

change the loop to do:

for (;;) {
curval = cmpxchg_futex_value_locked(uaddr, 0, curtid);

if (!curval)
return 1;
if ((curval & FUTEX_TID_MASK) != ownertid) {
ownertid = curval & FUTEX_TID_MASK;
owner = update_owner(ownertid);
}

if (owner && futex_spin_on_owner(....))
.....

}

> +/**
> + * futex_lock() - Acquire the lock and set the futex value
> + * @uaddr: the futex user address
> + * @flags: futex flags (FLAGS_SHARED, FLAGS_CLOCKRT, FLAGS_ADAPTIVE, etc.)
> + * @detect: detect deadlock (1) or not (0)
> + * @time: absolute timeout
> + *
> + * futex_(un)lock() define a futex value policy and implement a full mutex. The
> + * futex value stores the owner's TID or'd with FUTEX_WAITERS and/or
> + * FUTEX_OWNER_DIED as appropriate.
> + *
> + * Userspace tried a 0 -> TID atomic transition of the futex value and failed.
> + * Try to acquire the lock in the kernel, blocking if necessary. Return to
> + * userspace with the lock held and the futex value set accordingly (or after a
> + * timeout).
> + *
> + * Returns:
> + * 0 - On success
> + * <0 - On error
> + */
> +static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
> +{
> + struct hrtimer_sleeper timeout, *to = NULL;
> + struct futex_hash_bucket *hb;
> + struct futex_q q = FUTEX_Q_INIT;
> + int ret = 0;
> +
> + if (refill_pi_state_cache())
> + return -ENOMEM;
> +
> + if (time) {
> + to = &timeout;
> + hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
> + HRTIMER_MODE_ABS);

Please make that like the WAIT_BITSET one, which can select the
clock and defaults to MONOTONIC.

> + hrtimer_init_sleeper(to, current);
> + hrtimer_set_expires(&to->timer, *time);
> + }

Why setup all this _before_ trying the adaptive spin ?

> +retry:
> +#ifdef CONFIG_SMP
> + if (flags & FLAGS_ADAPTIVE) {

Why do we want that non adaptive version at all ?

> + preempt_disable();
> + ret = trylock_futex_adaptive(uaddr);
> + preempt_enable();
> +
> + /* We got the lock. */
> + if (ret == 1) {
> + ret = 0;
> + goto out;
> + }
> +
> + /* We encountered an error, -EFAULT most likely. */
> + if (ret)
> + goto out;
> + }
> +#endif
> + q.key = FUTEX_KEY_INIT;
> + ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
> + if (unlikely(ret != 0))
> + goto out;
> +
> + hb = queue_lock(&q);
> +
> + ret = lock_futex_atomic(uaddr, current, 0);
> + if (unlikely(ret)) {
> + switch (ret) {
> + case 1:
> + /* We got the lock. */
> + ret = 0;
> + goto out_unlock_put_key;
> + case -EFAULT:
> + goto uaddr_faulted;
> + default:
> + goto out_unlock_put_key;
> + }
> + }
> + /*
> + * Only actually queue now that the atomic ops are done:
> + */
> + futex_wait_queue_me(hb, &q, to);
> +
> + ret = unqueue_me(&q);
> +
> + /* If we were woken by the owner, try and acquire the lock. */
> + if (!ret)
> + goto retry;
> +
> + ret = -ETIMEDOUT;
> + if (to && !to->task)
> + goto out_put_key;
> +
> + /*
> + * We expect signal_pending(current), but we might be the
> + * victim of a spurious wakeup as well.
> + */
> + ret = -ERESTARTNOINTR;
> + if (!signal_pending(current)) {
> + put_futex_key(&q.key);
> + goto retry;
> + }
> +
> + goto out_put_key;
> +
> +out_unlock_put_key:
> + queue_unlock(&q, hb);
> +
> +out_put_key:
> + put_futex_key(&q.key);
> +out:
> + if (to)
> + destroy_hrtimer_on_stack(&to->timer);
> + return ret != -EINTR ? ret : -ERESTARTNOINTR;

Which code sets -EINTR ?

> +
> +uaddr_faulted:
> + queue_unlock(&q, hb);
> +
> + ret = fault_in_user_writeable(uaddr);
> + if (ret)
> + goto out_put_key;
> +
> + put_futex_key(&q.key);
> + goto retry;
> +}
> +
> +
> /*
> * Support for robust futexes: the kernel cleans up held futexes at
> * thread exit time.
> @@ -2623,6 +2830,12 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
> case FUTEX_CMP_REQUEUE_PI:
> ret = futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
> break;
> + case FUTEX_LOCK_ADAPTIVE:
> + flags |= FLAGS_ADAPTIVE;
> + case FUTEX_LOCK:
> + if (futex_cmpxchg_enabled)
> + ret = futex_lock(uaddr, flags, val, timeout);
> + break;
> default:
> ret = -ENOSYS;
> }
> @@ -2641,7 +2854,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
>
> if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
> cmd == FUTEX_WAIT_BITSET ||
> - cmd == FUTEX_WAIT_REQUEUE_PI)) {
> + cmd == FUTEX_WAIT_REQUEUE_PI ||
> + cmd == FUTEX_LOCK || cmd == FUTEX_LOCK_ADAPTIVE)) {
> if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
> return -EFAULT;
> if (!timespec_valid(&ts))
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 9ab3cd7..a2dbb5b 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3818,6 +3818,65 @@ int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
> out:
> return 1;
> }
> +
> +#ifdef CONFIG_FUTEX
> +#include <linux/futex.h>
> +
> +int futex_spin_on_owner(u32 __user *uaddr, struct thread_info *owner)
> +{
> + unsigned int cpu;
> + struct rq *rq;
> +
> + if (!sched_feat(OWNER_SPIN))
> + return 0;
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> + /*
> + * Need to access the cpu field knowing that
> + * DEBUG_PAGEALLOC could have unmapped it if
> + * the mutex owner just released it and exited.
> + */
> + if (probe_kernel_address(&owner->cpu, cpu))
> + goto out;
> +#else
> + cpu = owner->cpu;
> +#endif
> +
> + /*
> + * Even if the access succeeded (likely case),
> + * the cpu field may no longer be valid.
> + */
> + if (cpu >= nr_cpumask_bits)
> + goto out;
> +
> + /*
> + * We need to validate that we can do a
> + * get_cpu() and that we have the percpu area.
> + */
> + if (!cpu_online(cpu))
> + goto out;
> +
> + rq = cpu_rq(cpu);
> +
> + for (;;) {
> + /*
> + * Owner changed, break to re-assess state.
> + */
> + if (futex_owner(uaddr) != owner)
> + break;

Uurg. It's enough to check whether the TID value changed. No need to
look up the full thing in every iteration.

> + /*
> + * Is that owner really running on that cpu?
> + */
> + if (task_thread_info(rq->curr) != owner || need_resched())
> + return 0;
> +
> + cpu_relax();
> + }
> +out:
> + return 1;
> +}
> +#endif
> #endif

Do we really need all this code ? A simple owner->on_cpu (owner needs
to be the task_struct then) would be sufficient to figure that out,
wouldn't it?

Thanks,

tglx

2010-04-06 17:34:43

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, Apr 6, 2010 at 09:44, Alan Cox <[email protected]> wrote:
> That gives you something along the lines of
>
>        runaddr = find_run_flag(lock);
>        do {
>                while(*runaddr == RUNNING) {
>                        if (trylock(lock))
>                                return WHOOPEE;
>                        cpu relax
>                }
>                yield (_on(thread));
>        } while(*runaddr != DEAD);

There still has to be an upper limit in the number of rounds of the
wait loop )some locks are held for a long time) since otherwise CPUs
are unnecessarily long tied up. And the DEAD case is only for robust
mutex handling. But in theory I agree.

We already have the set_tid_address syscall. This could be
generalized with a new syscall which can provide the kernel with more
than one pointer to store "stuff" in: TIDs, scheduling info, etc.

The non-swappable part will be tricky. One doesn't know how many
threads will be created in a process. This mechanism shouldn't put an
arbitrary limit in place. So where to allocate the memory? Perhaps
it's better to implicitly mark the memory page pointed to by the new
syscall as non-swappable? This could mean one page per thread...

2010-04-06 18:18:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 6 Apr 2010, Alan Cox wrote:

> > > IMO the best solution is to spin in userspace while the lock holder is
> > > running, fall into the kernel when it is scheduled out.
> >
> > That's just not realistic as user space has no idea whether the lock
> > holder is running or not and when it's scheduled out without a syscall :)
>
> Which is the real problem that wants addressing and can be addressed very
> cheaply. That would bring us up to par with 1970s RTOS environments ;)

Well, 1970's RTOSes had other features as well like preemption disable
mechanisms and other interesting stuff. I hope you're not going to
propose that next to bring us up to par with those :)

Thanks,

tglx

2010-04-06 19:33:36

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 6 Apr 2010, Alan Cox wrote:
> > Do you feel some of these situations would also benefit from some kernel
> > assistance to stop spinning when the owner schedules out? Or are you
> > saying that there are situations where pure userspace spinlocks will
> > always be the best option?
>
> There are cases its the best option - you are assuming for example that
> the owner can get scheduled out. Eg nailing one thread per CPU in some
> specialist high performance situations means they can't.

Fair enough, but that's not the problem Darren is targeting.

> > If the latter, I'd think that they would also be situations where
> > sched_yield() is not used as part of the spin loop. If so, then these
> > are not our target situations for FUTEX_LOCK_ADAPTIVE, which hopes to
> > provide a better informed mechanism for making spin or sleep decisions.
> > If sleeping isn't part of the locking construct implementation, then
> > FUTEX_LOCK_ADAPTIVE doesn't have much to offer.
>
> I am unsure about the approach. As Avi says knowing that the lock owner is
> scheduled out allows for far better behaviour. It doesn't need complex
> per lock stuff or per lock notifier entries on pre-empt either.
>
> A given task is either pre-empted or not and in the normal case of things
> you need this within a process so you've got shared pages anyway. So you
> only need one instance of the 'is thread X pre-empted' bit somewhere in a
> non swappable page.

I fear we might end up with a pinned page per thread to get this
working properly and it restricts the mechanism to process private
locks.

> That gives you something along the lines of
>
> runaddr = find_run_flag(lock);
> do {
> while(*runaddr == RUNNING) {
> if (trylock(lock))
> return WHOOPEE;
> cpu relax
> }
> yield (_on(thread));

That would require a new yield_to_target() syscall, which either
blocks the caller when the target thread is not runnable or returns an
error code which signals to go into the slow path.

> } while(*runaddr != DEAD);
>
>
> which unlike blindly spinning can avoid the worst of any hit on the CPU
> power and would be a bit more guided ?

I doubt that the syscall overhead per se is large enough to justify
all of the above horror. We need to figure out a more efficient way to
do the spinning in the kernel where we have all the necessary
information already. Darren's implementation is suboptimal AFAICT.

Thanks,

tglx

2010-04-06 20:02:58

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, Apr 6, 2010 at 12:31, Thomas Gleixner <[email protected]> wrote:
> We need to figure out a more efficient way to
> do the spinning in the kernel where we have all the necessary
> information already.

Really? The owner information isn't in general available in the
kernel. Futex operation doesn't require the value used to be the PID
(or negative of the PID). That is a dramatic limitation of the
usefulness of futexes.

At userlevel there is access to other fields of the data structure
which can contain the owner information.

I would like to see the method using a per-thread pinned page and an
update of a memory location on scheduling. For benchmarking at least.
I agree that a sys_yield_to() syscall would be at the very least
useful as well. But it's useful for other things already.

2010-04-06 21:22:35

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

Darren Hart wrote:
> Avi Kivity wrote:
>
>>> > At 10%
>>>> duty cycle you have 25 waiters behind the lock on average. I don't
>>>> think this is realistic, and it means that spinning is invoked only
>>>> rarely.
>>>
>>> Perhaps some instrumentation is in order, it seems to get invoked
>>> enough to achieve some 20% increase in lock/unlock iterations.
>>> Perhaps another metric would be of more value - such as average wait
>>> time?
>>
>> Why measure an unrealistic workload?
>
> No argument there, thus my proposal for an alternate configuration below.
>
>>>> I'd be interested in seeing runs where the average number of waiters
>>>> is 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention.
>>>> 25 average waiters on compute bound code means the application needs
>>>> to be rewritten, no amount of mutex tweaking will help it.
>>>
>>> Perhaps something NR_CPUS threads would be of more interest?
>>
>> That seems artificial.
>
> How so? Several real world applications use one thread per CPU to
> dispatch work to, wait for events, etc.
>
>>
>>> At 10% that's about .8 and at 25% the 2 of your upper limit. I could
>>> add a few more duty-cycle points and make 25% the max. I'll kick that
>>> off and post the results... probably tomorrow, 10M iterations takes a
>>> while, but makes the results relatively stable.
>>
>> Thanks. But why not vary the number of threads as well?
>
> Absolutely, I don't disagree that all the variables should vary in order
> to get a complete picture. I'm starting with 8 - it takes several hours
> to collect the data.

While this might be of less interest after today's discussion, I
promised to share the results of a run with 8 threads with a wider
selection of lower duty-cycles. The results are very poor for adaptive
and worse for aas (multiple spinners) compared to normal FUTEX_LOCK. As
Thomas and Peter have pointed out, the implementation is sub-optimal.
Before abandoning this approach I will see if I can find the bottlenecks
and simplify the kernel side of things. My impression is that I am doing
a lot more work in the kernel, especially in the adaptive loop, than is
really necessary.

Both the 8 and 256 Thread plots can be viewed here:

http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v4/

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-06 23:18:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, 6 Apr 2010, Ulrich Drepper wrote:

> On Tue, Apr 6, 2010 at 12:31, Thomas Gleixner <[email protected]> wrote:
> > We need to figure out a more efficient way to
> > do the spinning in the kernel where we have all the necessary
> > information already.
>
> Really? The owner information isn't in general available in the
> kernel. Futex operation doesn't require the value used to be the PID
> (or negative of the PID). That is a dramatic limitation of the
> usefulness of futexes.

I know that you can do any weird stuff with the futex value, but I
don't see the "dramatic" limitation. Care to elaborate ?

> At userlevel there is access to other fields of the data structure
> which can contain the owner information.
>
> I would like to see the method using a per-thread pinned page and an
> update of a memory location on scheduling. For benchmarking at least.

The per thread pinned page would be unconditional, right ?

I agree that benchmarking would be interesting, but OTOH I fear that
we open up a huge can of worms with exposing scheduler details and the
related necessary syscalls like sys_yield_to: User space thread
management/scheduling comes to my mind and I hope we agree that we do
not want to revisit that.

> I agree that a sys_yield_to() syscall would be at the very least
> useful as well. But it's useful for other things already.

Useful for what ?

What are the exact semantics of such a syscall ?

How does that fit into the various scheduling constraints ?

Thanks,

tglx

2010-04-06 23:36:49

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

Thomas Gleixner wrote:
> On Tue, 6 Apr 2010, Ulrich Drepper wrote:
>
>> On Tue, Apr 6, 2010 at 12:31, Thomas Gleixner <[email protected]> wrote:
>>> We need to figure out a more efficient way to
>>> do the spinning in the kernel where we have all the necessary
>>> information already.
>> Really? The owner information isn't in general available in the
>> kernel. Futex operation doesn't require the value used to be the PID
>> (or negative of the PID). That is a dramatic limitation of the
>> usefulness of futexes.
>
> I know that you can do any weird stuff with the futex value, but I
> don't see the "dramatic" limitation. Care to elaborate ?
>
>> At userlevel there is access to other fields of the data structure
>> which can contain the owner information.
>>
>> I would like to see the method using a per-thread pinned page and an
>> update of a memory location on scheduling. For benchmarking at least.
>
> The per thread pinned page would be unconditional, right ?
>
> I agree that benchmarking would be interesting, but OTOH I fear that
> we open up a huge can of worms with exposing scheduler details and the
> related necessary syscalls like sys_yield_to: User space thread
> management/scheduling comes to my mind and I hope we agree that we do
> not want to revisit that.
>
>> I agree that a sys_yield_to() syscall would be at the very least
>> useful as well. But it's useful for other things already.
>
> Useful for what ?
>
> What are the exact semantics of such a syscall ?
>
> How does that fit into the various scheduling constraints ?


I believe this comes back to the discussions of a directed yield. The
idea being that a thread yields its remaining timeslice to a thread of
it's choosing - usually because the target thread holds a resource the
yielding thread needs access to. This makes the yield more explicit so
the yielding thread is more likely to get some benefit out of yielding.

I believe the arguments would be either a TID or a thread group -
however that is specified. I believe the KVM guys would like to see
something like this as well - which might be the "other things" referred
to above.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-07 05:34:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On 04/06/2010 10:31 PM, Thomas Gleixner wrote:
>
>> That gives you something along the lines of
>>
>> runaddr = find_run_flag(lock);
>> do {
>> while(*runaddr == RUNNING) {
>> if (trylock(lock))
>> return WHOOPEE;
>> cpu relax
>> }
>> yield (_on(thread));
>>
> That would require a new yield_to_target() syscall, which either
> blocks the caller when the target thread is not runnable or returns an
> error code which signals to go into the slow path.
>

Note, directed yield is something that kvm wants for its own ends. As
an internal kernel api, not a syscall, but it's apparently useful.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-07 06:09:07

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Tue, Apr 6, 2010 at 16:16, Thomas Gleixner <[email protected]> wrote:
> I know that you can do any weird stuff with the futex value, but I
> don't see the "dramatic" limitation. Care to elaborate ?

If we have to fill in the PID we can represent only three states in a futex: 0, PID, -PID. Today we can represent 2^32 states. Quite a difference.


> The per thread pinned page would be unconditional, right ?

Only if the process would be using these adaptive mutexes. It could be conditional.


> I agree that benchmarking would be interesting, but OTOH I fear that
> we open up a huge can of worms with exposing scheduler details and the
> related necessary syscalls like sys_yield_to: User space thread
> management/scheduling comes to my mind and I hope we agree that we do
> not want to revisit that.

I'm not sure. We never got to the bottom of this. Why are these details which should not be disclosed? It's clear that there is descheduling and the sys_yield_to syscall would require nothing to happen but indicate to the kernel execution dependencies the kernel cannot necessarily discover on its own, at least not efficiently.


> Useful for what ?

We already have places where we could spin a bit using sys_yield_to because be know what we are waiting on.


> What are the exact semantics of such a syscall ?

It gives the kernel the hint that the current thread is willing to hand over the remaining time on the timeslice to the target thread. This target thread, if sleeping, can immediately make progress. Yes, this might mean moving the target thread to the core executing yielding thread. Perhaps this doesn't make sense in some situations. In this case the syscall could be a no-op, perhaps indicating this in the return value.


> How does that fit into the various scheduling constraints ?

I don't know enough about all the constraints. As I said, it could be a hint. If the constraints forbid the timeslice transfer it need not happen.


Attachments:
signature.asc (272.00 B)
OpenPGP digital signature

2010-04-07 06:31:18

by john cooper

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

Avi Kivity wrote:
> On 04/06/2010 07:14 PM, Thomas Gleixner wrote:
>>
>>> IMO the best solution is to spin in userspace while the lock holder is
>>> running, fall into the kernel when it is scheduled out.
>>>
>> That's just not realistic as user space has no idea whether the lock
>> holder is running or not and when it's scheduled out without a syscall :)
>>
>
> The kernel could easily expose this information by writing into the
> thread's TLS area.
>
> So:
>
> - the kernel maintains a current_cpu field in a thread's tls
> - lock() atomically writes a pointer to the current thread's current_cpu
> when acquiring
> - the kernel writes an invalid value to current_cpu when switching out
> - a contended lock() retrieves the current_cpu pointer, and spins as
> long as it is a valid cpu

There are certainly details to sort through in the packaging
of the mechanism but conceptually that should do the job.
So here the application has chosen a blocking lock as being
the optimal synchronization operation and we're detecting a
scenario where we can factor out the aggregate overhead of two
context switch operations.

There is also the case where the application requires a
polled lock with the rational being the assumed lock
hold/wait time is substantially less than the above context
switch overhead. But here we're otherwise completely
open to indiscriminate scheduling preemption even though
we may be holding a userland lock.

The adaptive mutex above is an optimization beyond what
is normally expected for the associated model. The preemption
of a polled lock OTOH can easily inflict latency several orders
of magnitude beyond what is expected in that model. Two use
cases exist here which IMO aren't related except for the latter
unintentionally degenerating into the former.

-john

--
[email protected]

2010-04-07 17:26:15

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 4/6] futex: Add FUTEX_LOCK with optional adaptive spinning

Thomas Gleixner wrote:
> On Mon, 5 Apr 2010, Darren Hart wrote:
>> +#ifdef CONFIG_SMP
>> +struct thread_info *futex_owner(u32 __user *uaddr)
>> +{
>> + struct thread_info *ti = NULL;
>> + struct task_struct *p;
>> + pid_t pid;
>> + u32 uval;
>> +
>> + if (get_futex_value_locked(&uval, uaddr))
>> + return NULL;
>> +
>> + pid = uval & FUTEX_TID_MASK;
>> + rcu_read_lock();
>> + p = find_task_by_vpid(pid);
>> + if (p) {
>> + const struct cred *cred, *pcred;
>> +
>> + cred = current_cred();
>> + pcred = __task_cred(p);
>> + if (cred->euid == pcred->euid ||
>> + cred->euid == pcred->uid)
>> + ti = task_thread_info(p);
>
> We want a get_task_struct() here, don't we ?


It would appear so. We access owner->cpu in futex_spin_on_owner. Done.


>> + }
>> + rcu_read_unlock();
>> +
>> + return ti;
>> +}
>> +#endif
>> +
>
>> +/**
>> + * trylock_futex_adaptive() - Try to acquire the futex lock in a busy loop
>> + * @uaddr: the futex user address
>> + *
>> + * Try to acquire a futex lock in a loop until the owner changes or the owner
>> + * is descheduled. To lock the futex, set the value to the current TID.
>> + *
>> + * Returns:
>> + * 0 - Gave up, lock not acquired
>> + * 1 - Futex lock acquired
>> + * <0 - On error
>> + */
>> +static int trylock_futex_adaptive(u32 __user *uaddr)
>> +{
>> + int ret = 0;
>> + u32 curval;
>> +
>> + for (;;) {
>> + struct thread_info *owner;
>> +
>> + owner = futex_owner(uaddr);
>> + if (owner && futex_spin_on_owner(uaddr, owner))
>> + break;
>> +
>> + /*
>> + * Try to acquire the lock. If we acquire it or error out,
>> + * break to enable preemption and handle ret outside the loop.
>> + */
>> + curval = cmpxchg_futex_value_locked(uaddr, 0,
>> + task_pid_vnr(current));
>> +
>> + if (curval == 0) {
>> + ret = 1;
>> + break;
>> + }
>> +
>> + if (unlikely(curval == -EFAULT)) {
>> + ret = -EFAULT;
>> + break;
>> + }
>> +
>> + /*
>> + * Futex locks manage the owner atomically. We are not in
>> + * danger of deadlock by preempting a pending owner. Abort the
>> + * loop if our time slice has expired. RT Threads can spin
>> + * until they preempt the owner, at which point they will break
>> + * out of the loop anyway.
>> + */
>> + if (need_resched())
>> + break;
>> +
>> + cpu_relax();
>> + }
>> + return ret;
>> +}
>
>
> Hmm. The order is weird. Why don't you do that simpler ?
>
> Get the uval, the tid and the thread_info pointer outside of the
> loop. Also task_pid_vnr(current) just needs a one time lookup.

Eeek. Having the owner in the loop is a good way to negate the benefits
of adaptive spinning by spinning forever (unlikely, but it could
certainly spin across multiple owners). Nice catch.

As for the uval.... I'm not sure what you mean. You get curval below
inside the loop, and there is no "uval" in the my version of the code.

As for the order, I had put the initial spin prior to the cmpxchg to
avoid doing too many cmpxchg's in a row as they are rather expensive.
However, since this is (now) the first opportunity to do try and acquire
the lock atomically after entering the futex syscall, I think you're
right, it should be the first thing in the loop.

>
> change the loop to do:
>
> for (;;) {
> curval = cmpxchg_futex_value_locked(uaddr, 0, curtid);
>
> if (!curval)
> return 1;

Single return point makes instrumentation so much easier. Unless folks
_really_ object, I'll leave it as is until we're closer to merging.

> if ((curval & FUTEX_TID_MASK) != ownertid) {
> ownertid = curval & FUTEX_TID_MASK;
> owner = update_owner(ownertid);
> }


Hrm... at this point the owner has changed... so we should break and go
to sleep, not update the owner and start spinning again. The
futex_spin_on_owner() will detect this and abort, so I'm not seeing the
purpose of the above if() block.

...

>> +static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
>> +{
>> + struct hrtimer_sleeper timeout, *to = NULL;
>> + struct futex_hash_bucket *hb;
>> + struct futex_q q = FUTEX_Q_INIT;
>> + int ret = 0;
>> +
>> + if (refill_pi_state_cache())
>> + return -ENOMEM;
>> +
>> + if (time) {
>> + to = &timeout;
>> + hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
>> + HRTIMER_MODE_ABS);
>
> Please make that like the WAIT_BITSET one, which can select the
> clock and defaults to MONOTONIC.


Done.

>
>> + hrtimer_init_sleeper(to, current);
>> + hrtimer_set_expires(&to->timer, *time);
>> + }
>
> Why setup all this _before_ trying the adaptive spin ?


I placed the retry: label above the adaptive spin loop. This way if we
wake a task and the lock is "stolen" it doesn't just go right back to
sleep. This should aid in fairness and also performance in less
contended cases. I didn't think it was worth a "if (first_time_through
&& time)" sort of block to be able to setup the timer after the spin loop.


>> +retry:
>> +#ifdef CONFIG_SMP
>> + if (flags & FLAGS_ADAPTIVE) {
>
> Why do we want that non adaptive version at all ?


We don't. I'm using it here simply to measure the impact of adaptive
spinning. Eventually all that will matter is how it stacks up against a
raw futex_wait/wake mutex implementation like
pthread_mutex_lock/unlock(). But in the early stages it's nice to be
able eliminate all other differences other than adaptive spinning.

...

>> + if (to)
>> + destroy_hrtimer_on_stack(&to->timer);
>> + return ret != -EINTR ? ret : -ERESTARTNOINTR;
>
> Which code sets -EINTR ?


Cruft left over from the futex_lock_pi() roots. Removed.

...

>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 9ab3cd7..a2dbb5b 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -3818,6 +3818,65 @@ int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
>> out:
>> return 1;
>> }
>> +
>> +#ifdef CONFIG_FUTEX
>> +#include <linux/futex.h>
>> +
>> +int futex_spin_on_owner(u32 __user *uaddr, struct thread_info *owner)
>> +{
>> + unsigned int cpu;
>> + struct rq *rq;
>> +
>> + if (!sched_feat(OWNER_SPIN))
>> + return 0;
>> +
>> +#ifdef CONFIG_DEBUG_PAGEALLOC
>> + /*
>> + * Need to access the cpu field knowing that
>> + * DEBUG_PAGEALLOC could have unmapped it if
>> + * the mutex owner just released it and exited.
>> + */
>> + if (probe_kernel_address(&owner->cpu, cpu))
>> + goto out;
>> +#else
>> + cpu = owner->cpu;
>> +#endif
>> +
>> + /*
>> + * Even if the access succeeded (likely case),
>> + * the cpu field may no longer be valid.
>> + */
>> + if (cpu >= nr_cpumask_bits)
>> + goto out;
>> +
>> + /*
>> + * We need to validate that we can do a
>> + * get_cpu() and that we have the percpu area.
>> + */
>> + if (!cpu_online(cpu))
>> + goto out;
>> +
>> + rq = cpu_rq(cpu);
>> +
>> + for (;;) {
>> + /*
>> + * Owner changed, break to re-assess state.
>> + */
>> + if (futex_owner(uaddr) != owner)
>> + break;
>
> Uurg. It's enough to check whether the TID value changed. No need to
> look up the full thing in every iteration.


OK, we can skip the euid, uid credentials checking by doing a get_user()
instead.


>
>> + /*
>> + * Is that owner really running on that cpu?
>> + */
>> + if (task_thread_info(rq->curr) != owner || need_resched())
>> + return 0;
>> +
>> + cpu_relax();
>> + }
>> +out:
>> + return 1;
>> +}
>> +#endif
>> #endif
>
> Do we really need all this code ? A simple owner->on_cpu (owner needs
> to be the task_struct then) would be sufficient to figure that out,
> wouldn't it?

As Peter pointed out in IRC, p->oncpu isn't generic. I'll go trolling
through the mutex_spin_on_owner() discussions to see if I can determine
why that's the case.

Thank you for the review.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-07 17:31:49

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 5/6] futex: handle timeout inside adaptive lock spin

Thomas Gleixner wrote:
> On Mon, 5 Apr 2010, Darren Hart wrote:
>
>> Signed-off-by: Darren Hart <[email protected]>
>> ---
>> kernel/futex.c | 24 +++++++++++++++++++++---
>> 1 files changed, 21 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/futex.c b/kernel/futex.c
>> index c33ac2a..af61dcd 100644
>> --- a/kernel/futex.c
>> +++ b/kernel/futex.c
>> @@ -2385,6 +2385,7 @@ out:
>> /**
>> * trylock_futex_adaptive() - Try to acquire the futex lock in a busy loop
>> * @uaddr: the futex user address
>> + * @timeout: absolute timeout or NULL if none
>> *
>> * Try to acquire a futex lock in a loop until the owner changes or the owner
>> * is descheduled. To lock the futex, set the value to the current TID.
>> @@ -2394,10 +2395,11 @@ out:
>> * 1 - Futex lock acquired
>> * <0 - On error
>> */
>> -static int trylock_futex_adaptive(u32 __user *uaddr)
>> +static int trylock_futex_adaptive(u32 __user *uaddr, ktime_t *timeout)
>> {
>> int ret = 0;
>> u32 curval;
>> + ktime_t now;
>>
>> for (;;) {
>> struct thread_info *owner;
>> @@ -2433,6 +2435,22 @@ static int trylock_futex_adaptive(u32 __user *uaddr)
>> if (need_resched())
>> break;
>>
>> + if (timeout) {
>> + now = ktime_get();
>
> Hmm. Calling that in every iteration might hurt especially on non
> TSC systems, but well...

I haven't come across a better alternative since arming the timer before
setting TASK_INTERRUPTIBLE isn't appropriate.

>
>> +/* FIXME: consider creating ktime_less_than(lhs, rhs) */
>
> No need. The .tv64 comparison works in both cases. :)

Ah, for some reason I was thinking that was only the case if
CONFIG_KTIME_SCALAR was set. Very nice, thanks.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-07 18:44:27

by Gregory Haskins

[permalink] [raw]
Subject: Re: [PATCH 5/6] futex: handle timeout inside adaptive lock spin

>>> On 4/7/2010 at 01:31 PM, in message <[email protected]>, Darren Hart
<[email protected]> wrote:
> Thomas Gleixner wrote:
>> On Mon, 5 Apr 2010, Darren Hart wrote:
>>
>>> Signed-off-by: Darren Hart <[email protected]>

>>> + if (timeout) {
>>> + now = ktime_get();
>>
>> Hmm. Calling that in every iteration might hurt especially on non
>> TSC systems, but well...
>
> I haven't come across a better alternative since arming the timer before
> setting TASK_INTERRUPTIBLE isn't appropriate.

Hey Darren,

I remember we tried something similar in early versions of the adaptive locks and this was definitely bad. :(

It ended up putting so much contention on the xtime_lock (IIRC) that it resulted in adaptive locks hurting overall performance verses not using adaptive at all. Alternative mechanisms employed a hybrid where the inner loops used a pseudo calibrated counter loop, and the outer loop checks periodically against a real clock. It all plays into "you are burning CPU cycles anyway, so might as well put them to use" theory. Hacky, yes, but it did relieve the pressure on the time subsystem locks and freed up a _lot_ of performance. Without this, the concept of timeouts+adaptive was unworkable. I think Steven ultimately rejected the timeout related patches outright when he merged adaptive to -rt, but I think Sven pulled them into SLERT if you would like a potential code reference to a working solution.

-Greg


2010-04-07 20:01:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/6] futex: Add FUTEX_LOCK with optional adaptive spinning

On Wed, 7 Apr 2010, Darren Hart wrote:
> Thomas Gleixner wrote:
> > On Mon, 5 Apr 2010, Darren Hart wrote:
> > Hmm. The order is weird. Why don't you do that simpler ?
> >
> > Get the uval, the tid and the thread_info pointer outside of the
> > loop. Also task_pid_vnr(current) just needs a one time lookup.
>
> Eeek. Having the owner in the loop is a good way to negate the benefits
> of adaptive spinning by spinning forever (unlikely, but it could
> certainly spin across multiple owners). Nice catch.
>
> As for the uval.... I'm not sure what you mean. You get curval below
> inside the loop, and there is no "uval" in the my version of the code.

Well, you need a first time lookup of owner and ownertid for which you
need the user space value (uval), but thinking more about it it's not
even necessary. Just initialize ownertid to 0 so it will drop into the
lookup code when we did not acquire the futex in the cmpxchg.

> As for the order, I had put the initial spin prior to the cmpxchg to
> avoid doing too many cmpxchg's in a row as they are rather expensive.
> However, since this is (now) the first opportunity to do try and acquire
> the lock atomically after entering the futex syscall, I think you're
> right, it should be the first thing in the loop.
>
> >
> > change the loop to do:
> >
> > for (;;) {
> > curval = cmpxchg_futex_value_locked(uaddr, 0, curtid);
> > if (!curval)
> > return 1;
>
> Single return point makes instrumentation so much easier. Unless folks
> _really_ object, I'll leave it as is until we're closer to merging.

I don't care either way. That was just example code.

> > if ((curval & FUTEX_TID_MASK) != ownertid) {
> > ownertid = curval & FUTEX_TID_MASK;
> > owner = update_owner(ownertid);
> > }
>
>
> Hrm... at this point the owner has changed... so we should break and go
> to sleep, not update the owner and start spinning again. The
> futex_spin_on_owner() will detect this and abort, so I'm not seeing the
> purpose of the above if() block.

Why ? If the owner has changed and the new owner is running on another
cpu then why not spin further ?

> > > + hrtimer_init_sleeper(to, current);
> > > + hrtimer_set_expires(&to->timer, *time);
> > > + }
> >
> > Why setup all this _before_ trying the adaptive spin ?
>
>
> I placed the retry: label above the adaptive spin loop. This way if we wake a
> task and the lock is "stolen" it doesn't just go right back to sleep. This
> should aid in fairness and also performance in less contended cases. I didn't
> think it was worth a "if (first_time_through && time)" sort of block to be
> able to setup the timer after the spin loop.

Hmm, ok.

> >
> > Do we really need all this code ? A simple owner->on_cpu (owner needs
> > to be the task_struct then) would be sufficient to figure that out,
> > wouldn't it?
>
> As Peter pointed out in IRC, p->oncpu isn't generic. I'll go trolling through
> the mutex_spin_on_owner() discussions to see if I can determine why that's the
> case.

AFAICT p->oncpu is the correct thing to use when CONFIG_SMP=y. All it
needs is a simple accessor function and you can keep all the futex
cruft in futex.c where it belongs.

Thanks,

tglx

2010-04-07 23:15:31

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 5/6] futex: handle timeout inside adaptive lock spin

Gregory Haskins wrote:
>>>> On 4/7/2010 at 01:31 PM, in message <[email protected]>,
Darren Hart
> <[email protected]> wrote:
>> Thomas Gleixner wrote:
>>> On Mon, 5 Apr 2010, Darren Hart wrote:
>>>
>>>> Signed-off-by: Darren Hart <[email protected]>
>
>>>> + if (timeout) {
>>>> + now = ktime_get();
>>> Hmm. Calling that in every iteration might hurt especially on non
>>> TSC systems, but well...
>> I haven't come across a better alternative since arming the timer
before
>> setting TASK_INTERRUPTIBLE isn't appropriate.
>
> Hey Darren,
>
> I remember we tried something similar in early versions of the
> adaptive locks and this was definitely bad. :(
>
> It ended up putting so much contention on the xtime_lock (IIRC) that
> it resulted in adaptive locks hurting overall performance verses not
> using adaptive at all. Alternative mechanisms employed a hybrid where
> the inner loops used a pseudo calibrated counter loop, and the outer
> loop checks periodically against a real clock. It all plays into "you
> are burning CPU cycles anyway, so might as well put them to use"
> theory. Hacky, yes, but it did relieve the pressure on the time
> subsystem locks and freed up a _lot_ of performance. Without this,
> the concept of timeouts+adaptive was unworkable. I think Steven
> ultimately rejected the timeout related patches outright when he
> merged adaptive to -rt, but I think Sven pulled them into SLERT if you
> would like a potential code reference to a working solution.
>

Hi Greg,

Thanks for the info! I haven't tested with timeouts yet as I'm still
struggling to get decent performance out of just plain old adaptive
right now. I'll keep that in mind when I get to that, and yeah, if the
plan is to burn CPU cycles, might as well do something constructive :-)

I do feel the timeouts are a necessary feature. Interruptibility may be
as well, but I'm going to ignore it for the time being...

--
Darren

2010-04-08 03:25:18

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 4/6] futex: Add FUTEX_LOCK with optional adaptive spinning

Thomas Gleixner wrote:
> On Wed, 7 Apr 2010, Darren Hart wrote:
>> Thomas Gleixner wrote:
>>> On Mon, 5 Apr 2010, Darren Hart wrote:
>>> Hmm. The order is weird. Why don't you do that simpler ?
>>>
>>> Get the uval, the tid and the thread_info pointer outside of the
>>> loop. Also task_pid_vnr(current) just needs a one time lookup.
>> Eeek. Having the owner in the loop is a good way to negate the benefits
>> of adaptive spinning by spinning forever (unlikely, but it could
>> certainly spin across multiple owners). Nice catch.
>>
>> As for the uval.... I'm not sure what you mean. You get curval below
>> inside the loop, and there is no "uval" in the my version of the code.
>
> Well, you need a first time lookup of owner and ownertid for which you
> need the user space value (uval),
> but thinking more about it it's not
> even necessary. Just initialize ownertid to 0 so it will drop into the
> lookup code when we did not acquire the futex in the cmpxchg.

No need for ownertid at all really. The cmpxchg always tries to go from
0 to curtid. I've pushed the futex_owner() call outside the loop for a
one time lookup.

>
>> As for the order, I had put the initial spin prior to the cmpxchg to
>> avoid doing too many cmpxchg's in a row as they are rather expensive.
>> However, since this is (now) the first opportunity to do try and acquire
>> the lock atomically after entering the futex syscall, I think you're
>> right, it should be the first thing in the loop.
>>
>>> change the loop to do:
>>>
>>> for (;;) {
>>> curval = cmpxchg_futex_value_locked(uaddr, 0, curtid);
>>> if (!curval)
>>> return 1;
>> Single return point makes instrumentation so much easier. Unless folks
>> _really_ object, I'll leave it as is until we're closer to merging.
>
> I don't care either way. That was just example code.
>
>>> if ((curval & FUTEX_TID_MASK) != ownertid) {
>>> ownertid = curval & FUTEX_TID_MASK;
>>> owner = update_owner(ownertid);
>>> }
>>
>> Hrm... at this point the owner has changed... so we should break and go
>> to sleep, not update the owner and start spinning again. The
>> futex_spin_on_owner() will detect this and abort, so I'm not seeing the
>> purpose of the above if() block.
>
> Why ? If the owner has changed and the new owner is running on another
> cpu then why not spin further ?

That's an interesting question, and I'm not sure what the right answer
is. The current approach of the adaptive spinning in the kernel is to
spin until the owner changes or deschedules, then stop and block. The
idea is that if you didn't get the lock before the owner changed, you
aren't going to get it in a very short period of time (you have at least
an entire critical section to wait through plus whatever time you've
already spent spinning). However, blocking just so another task can spin
doesn't really make sense either, and makes the lock less fair than it
could otherwise be.

My goal in starting this is to provide a more intelligent mechanism than
sched_yield() for userspace to use to determine when to spin and when to
sleep. The current implementation allows for spinning up until the owner
changes, deschedules, or the timeslice expires. I believe these are much
better than spinning for some fixed number of cycles and then yield for
some unpredictable amount of time until CFS decides to schedule you back in.

Still, the criteria for breaking the spin are something that needs more
eyes, and more numbers before I can be confident in any approach.

>
>>>> + hrtimer_init_sleeper(to, current);
>>>> + hrtimer_set_expires(&to->timer, *time);
>>>> + }
>>> Why setup all this _before_ trying the adaptive spin ?
>>
>> I placed the retry: label above the adaptive spin loop. This way if we wake a
>> task and the lock is "stolen" it doesn't just go right back to sleep. This
>> should aid in fairness and also performance in less contended cases. I didn't
>> think it was worth a "if (first_time_through && time)" sort of block to be
>> able to setup the timer after the spin loop.
>
> Hmm, ok.
>
>>> Do we really need all this code ? A simple owner->on_cpu (owner needs
>>> to be the task_struct then) would be sufficient to figure that out,
>>> wouldn't it?
>> As Peter pointed out in IRC, p->oncpu isn't generic. I'll go trolling through
>> the mutex_spin_on_owner() discussions to see if I can determine why that's the
>> case.
>
> AFAICT p->oncpu is the correct thing to use when CONFIG_SMP=y. All it
> needs is a simple accessor function and you can keep all the futex
> cruft in futex.c where it belongs.

Noted.

Thanks,
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-08 03:33:56

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

john cooper wrote:
> Avi Kivity wrote:
>> On 04/06/2010 07:14 PM, Thomas Gleixner wrote:
>>>> IMO the best solution is to spin in userspace while the lock holder is
>>>> running, fall into the kernel when it is scheduled out.
>>>>
>>> That's just not realistic as user space has no idea whether the lock
>>> holder is running or not and when it's scheduled out without a syscall :)
>>>
>> The kernel could easily expose this information by writing into the
>> thread's TLS area.
>>
>> So:
>>
>> - the kernel maintains a current_cpu field in a thread's tls
>> - lock() atomically writes a pointer to the current thread's current_cpu
>> when acquiring
>> - the kernel writes an invalid value to current_cpu when switching out
>> - a contended lock() retrieves the current_cpu pointer, and spins as
>> long as it is a valid cpu
>
> There are certainly details to sort through in the packaging
> of the mechanism but conceptually that should do the job.
> So here the application has chosen a blocking lock as being
> the optimal synchronization operation and we're detecting a
> scenario where we can factor out the aggregate overhead of two
> context switch operations.

I didn't intend to change the behavior of an existing blocking call with
adaptive spinning if that is what you are getting at here. Initially
there would be a new futex op, something like FUTEX_LOCK_ADAPTIVE or
maybe just FUTEX_WAIT_ADAPTIVE. Applications can use this directly to
implement adaptive spinlocks. Ideally glibc would make use of this via
either the existing adaptive spinning NP API or via a new one. Before we
even go there, we need to see if this can provide a real benefit.

>
> There is also the case where the application requires a
> polled lock with the rational being the assumed lock
> hold/wait time is substantially less than the above context
> switch overhead.

Polled lock == userspace spinlock?

> But here we're otherwise completely
> open to indiscriminate scheduling preemption even though
> we may be holding a userland lock.

That's true with any userland lock.

> The adaptive mutex above is an optimization beyond what
> is normally expected for the associated model. The preemption
> of a polled lock OTOH can easily inflict latency several orders
> of magnitude beyond what is expected in that model. Two use
> cases exist here which IMO aren't related except for the latter
> unintentionally degenerating into the former.

Again, my intention is not to replace any existing functionality, so
applications would have to explicitly request this behavior.

If I'm missing your point, please elaborate.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-08 03:41:30

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

[email protected] wrote:
> On Tue, Apr 6, 2010 at 16:16, Thomas Gleixner <[email protected]> wrote:
>> I know that you can do any weird stuff with the futex value, but I
>> don't see the "dramatic" limitation. Care to elaborate ?
>
> If we have to fill in the PID we can represent only three states in a
> futex: 0, PID, -PID. Today we can represent 2^32 states. Quite a
> difference.

For general futexes sure, but not for robust or PI mutexes. Having
adaptive futexes be based on the TID|WAITERS_FLAG policy certainly isn't
breaking new ground, and is consistent with the other kernel-side futex
locking implementations.

However, I agree that a FUTEX_SET_WAIT_ADAPTIVE sort of call might be
very powerful. However, that might be just academic until I can show an
improvement with adaptive futexes.

>> The per thread pinned page would be unconditional, right ?
>
> Only if the process would be using these adaptive mutexes. It could be
> conditional.

What about the concern of this TLS approach only working for process
private locks? I would very much like to make this work for both shared
and private locks.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-08 04:29:39

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Wed, Apr 7, 2010 at 20:41, Darren Hart <[email protected]> wrote:
> For general futexes sure, but not for robust or PI mutexes. Having adaptive
> futexes be based on the TID|WAITERS_FLAG policy certainly isn't breaking new
> ground, and is consistent with the other kernel-side futex locking
> implementations.

PI mutexes are really unimportant in the big world. I know your employer cares but overall it's a minute fraction. The focus should be primarily on the normal futexes.

BTW, you want to stuff a flag in the futex word? This doesn't work in general. For a mutex we need three distinct value. For PI futexes it's 0, TID and -TID. If we have 31 bit TID values there isn't enough room for another bit.


> What about the concern of this TLS approach only working for process private
> locks? I would very much like to make this work for both shared and private
> locks.

Again, hardly a general concern. It's a minute fraction of mutexes which is shared.

It should be clear that the kernel approach and the userlevel approach are complimentary and could even be used. If the userlevel variant proves significantly faster (and I assume it will) then the kernel variant could be used for shared mutexes etc.


Attachments:
signature.asc (272.00 B)
OpenPGP digital signature

2010-04-08 05:58:52

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 6/6] futex: Add aggressive adaptive spinning argument to FUTEX_LOCK

To eliminate syscall overhead from the equation, I modified the testcase
to allow for forcing the syscall on lock(). Doing so cut the
non-adaptive scores by more than half. The adaptive scores dropped
accordingly. The relative difference between normal and adaptive
remained in tact (with my adaptive implementation lagging by 10x). So
while the syscall overhead does impact the scores, it is not the source
of the performance issue with the adaptive futex implementation I posted.

The following bits were being used to test for spinners and attempt to
only allow one spinner. Obviously it failed miserably at that. I found
up to 8 spinners running at a time with an instrumented kernel.

> @@ -2497,6 +2502,14 @@ static int futex_lock(u32 __user *uaddr, int flags, int detect, ktime_t *time)
> retry:
> #ifdef CONFIG_SMP
> if (flags & FLAGS_ADAPTIVE) {
> + if (!aas) {
> + ret = get_user(uval, uaddr);
> + if (ret)
> + goto out;
> + if (uval & FUTEX_WAITERS)
> + goto skip_adaptive;
> + }

Trouble is at this point is there are no more bits in the word to be
able to have a FUTEX_SPINNER bit. The futex word is the only per-futex
storage we have, the futex_q is per task.

If we overload the FUTEX_WAITERS bit it will force more futex_wake()
calls on the unlock() path. It also will effectively disable spinning
under contention as there are bound to be FUTEX_WAITERS in that case.

Another option I dislike is to forget about robust futexes in
conjunction with adaptive futexes and overload the FUTEX_OWNER_DIED bit.
Ulrich mentioned in another mail that "If we have 31 bit TID values
there isn't enough room for another bit." Since we have two flag bits
now, I figured TID values were 30 bits. Is there an option to run with
31 bits or something?

Assuming we all agree that these options are "bad", that leaves us with
looking for somewhere else to store the information we need, which in
turn brings us back around to what Avi, Alan, and Ulrich were discussing
regarding non swappable TLS data and a pointer in the futex value.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-08 23:10:48

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 4/6] futex: Add FUTEX_LOCK with optional adaptive spinning

On Wed, 2010-04-07 at 20:25 -0700, Darren Hart wrote:
> Thomas Gleixner wrote:
> > On Wed, 7 Apr 2010, Darren Hart wrote:
> >> Thomas Gleixner wrote:
> >>> On Mon, 5 Apr 2010, Darren Hart wrote:
> >>> Hmm. The order is weird. Why don't you do that simpler ?
> >>>
> >>> Get the uval, the tid and the thread_info pointer outside of the
> >>> loop. Also task_pid_vnr(current) just needs a one time lookup.
> >> Eeek. Having the owner in the loop is a good way to negate the benefits
> >> of adaptive spinning by spinning forever (unlikely, but it could
> >> certainly spin across multiple owners). Nice catch.
> >>
> >> As for the uval.... I'm not sure what you mean. You get curval below
> >> inside the loop, and there is no "uval" in the my version of the code.
> >
> > Well, you need a first time lookup of owner and ownertid for which you
> > need the user space value (uval),
> > but thinking more about it it's not
> > even necessary. Just initialize ownertid to 0 so it will drop into the
> > lookup code when we did not acquire the futex in the cmpxchg.
>
> No need for ownertid at all really. The cmpxchg always tries to go from
> 0 to curtid. I've pushed the futex_owner() call outside the loop for a
> one time lookup.
>
> >
> >> As for the order, I had put the initial spin prior to the cmpxchg to
> >> avoid doing too many cmpxchg's in a row as they are rather expensive.
> >> However, since this is (now) the first opportunity to do try and acquire
> >> the lock atomically after entering the futex syscall, I think you're
> >> right, it should be the first thing in the loop.
> >>
> >>> change the loop to do:
> >>>
> >>> for (;;) {
> >>> curval = cmpxchg_futex_value_locked(uaddr, 0, curtid);
> >>> if (!curval)
> >>> return 1;
> >> Single return point makes instrumentation so much easier. Unless folks
> >> _really_ object, I'll leave it as is until we're closer to merging.
> >
> > I don't care either way. That was just example code.
> >
> >>> if ((curval & FUTEX_TID_MASK) != ownertid) {
> >>> ownertid = curval & FUTEX_TID_MASK;
> >>> owner = update_owner(ownertid);
> >>> }
> >>
> >> Hrm... at this point the owner has changed... so we should break and go
> >> to sleep, not update the owner and start spinning again. The
> >> futex_spin_on_owner() will detect this and abort, so I'm not seeing the
> >> purpose of the above if() block.
> >
> > Why ? If the owner has changed and the new owner is running on another
> > cpu then why not spin further ?
>
> That's an interesting question, and I'm not sure what the right answer
> is. The current approach of the adaptive spinning in the kernel is to
> spin until the owner changes or deschedules, then stop and block. The
> idea is that if you didn't get the lock before the owner changed, you
> aren't going to get it in a very short period of time (you have at least
> an entire critical section to wait through plus whatever time you've
> already spent spinning). However, blocking just so another task can spin
> doesn't really make sense either, and makes the lock less fair than it
> could otherwise be.

Not only less fair, but potentially could cause starvation, no? Perhaps
you could see this if you changed your model to allow all contended
tasks to spin instead of just one.

If a spinning task blocks because of an owner change, and a new task
enters and starts spinning directly after the owner change, at what
point does the original task get woken up? Its likely that the new
spinner will get the resource next, no? Rinse/repeat with another task
and the original spinner is starved.

(Or am I missing something? My understanding was that unfairness was
built-in to this algo... If so, then the above is a possibility, right?)

Best,
-PWM

>
> My goal in starting this is to provide a more intelligent mechanism than
> sched_yield() for userspace to use to determine when to spin and when to
> sleep. The current implementation allows for spinning up until the owner
> changes, deschedules, or the timeslice expires. I believe these are much
> better than spinning for some fixed number of cycles and then yield for
> some unpredictable amount of time until CFS decides to schedule you back in.
>
> Still, the criteria for breaking the spin are something that needs more
> eyes, and more numbers before I can be confident in any approach.
>
> >
> >>>> + hrtimer_init_sleeper(to, current);
> >>>> + hrtimer_set_expires(&to->timer, *time);
> >>>> + }
> >>> Why setup all this _before_ trying the adaptive spin ?
> >>
> >> I placed the retry: label above the adaptive spin loop. This way if we wake a
> >> task and the lock is "stolen" it doesn't just go right back to sleep. This
> >> should aid in fairness and also performance in less contended cases. I didn't
> >> think it was worth a "if (first_time_through && time)" sort of block to be
> >> able to setup the timer after the spin loop.
> >
> > Hmm, ok.
> >
> >>> Do we really need all this code ? A simple owner->on_cpu (owner needs
> >>> to be the task_struct then) would be sufficient to figure that out,
> >>> wouldn't it?
> >> As Peter pointed out in IRC, p->oncpu isn't generic. I'll go trolling through
> >> the mutex_spin_on_owner() discussions to see if I can determine why that's the
> >> case.
> >
> > AFAICT p->oncpu is the correct thing to use when CONFIG_SMP=y. All it
> > needs is a simple accessor function and you can keep all the futex
> > cruft in futex.c where it belongs.
>
> Noted.
>
> Thanks,

2010-04-09 05:41:38

by Darren Hart

[permalink] [raw]
Subject: Re: [PATCH 4/6] futex: Add FUTEX_LOCK with optional adaptive spinning

Peter W. Morreale wrote:
> On Wed, 2010-04-07 at 20:25 -0700, Darren Hart wrote:
>> Thomas Gleixner wrote:
>>> On Wed, 7 Apr 2010, Darren Hart wrote:
>>>> Thomas Gleixner wrote:

>>>>> if ((curval & FUTEX_TID_MASK) != ownertid) {
>>>>> ownertid = curval & FUTEX_TID_MASK;
>>>>> owner = update_owner(ownertid);
>>>>> }
>>>> Hrm... at this point the owner has changed... so we should break and go
>>>> to sleep, not update the owner and start spinning again. The
>>>> futex_spin_on_owner() will detect this and abort, so I'm not seeing the
>>>> purpose of the above if() block.
>>> Why ? If the owner has changed and the new owner is running on another
>>> cpu then why not spin further ?
>> That's an interesting question, and I'm not sure what the right answer
>> is. The current approach of the adaptive spinning in the kernel is to
>> spin until the owner changes or deschedules, then stop and block. The
>> idea is that if you didn't get the lock before the owner changed, you
>> aren't going to get it in a very short period of time (you have at least
>> an entire critical section to wait through plus whatever time you've
>> already spent spinning). However, blocking just so another task can spin
>> doesn't really make sense either, and makes the lock less fair than it
>> could otherwise be.
>
> Not only less fair, but potentially could cause starvation, no? Perhaps
> you could see this if you changed your model to allow all contended
> tasks to spin instead of just one.

Agreed, and V5 (just posted) does just that.

>
> If a spinning task blocks because of an owner change, and a new task
> enters and starts spinning directly after the owner change, at what
> point does the original task get woken up?

At the time of unlock the owner will have to call FUTEX_WAKE. This task
will wake and attempt to acquire the lock - it will lose races with
aclready running contenders. Lock stealing, adaptive spinning, etc are
all going to lead to less fair locks in exchange for throughput.

> Its likely that the new
> spinner will get the resource next, no? Rinse/repeat with another task
> and the original spinner is starved.
>
> (Or am I missing something? My understanding was that unfairness was
> built-in to this algo... If so, then the above is a possibility, right?)

Yes it is. These locks are typically used in situations where it is more
important that some work gets completed than _which_ work gets completed.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-09 06:04:51

by john cooper

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

Darren Hart wrote:
> john cooper wrote:
>> But here we're otherwise completely
>> open to indiscriminate scheduling preemption even though
>> we may be holding a userland lock.
>
> That's true with any userland lock.

Agreed. However from your earlier mail it seemed
addressing this scenario was within the scope of
consideration.

There are two ways to deal with this condition, either
reactive in the sense we do so after the lock holder
has been preempted and subsequently find we're spinning
in sibling thread context attempting to acquire the
lock. Or proactively where we provide a time bounded
deferral of lock holder preemption with the assumption
the lock hold path overhead has negligible effect upon
deferring a potentially coincident scheduling operation.

It is fairly straightforward to demonstrate the impact
to performance with a focused micro benchmark, less so
for a more "typical" application with the effect being
diluted among other app activity.

The two approaches are complimentary with differing
system wide tradeoffs. Both require some insight into
the scheduling disposition of the lock holder, the
preemption deferral mechanism more so. If a scheme to
expose scheduler state transitions to (or cooperate
with) userland locking primitives is being considered,
it seems opportune to consider support as well for
this scenario.

-john

--
[email protected]

2010-04-09 13:13:34

by Peter W. Morreale

[permalink] [raw]
Subject: Re: [PATCH 4/6] futex: Add FUTEX_LOCK with optional adaptive spinning

On Thu, 2010-04-08 at 22:41 -0700, Darren Hart wrote:
> Peter W. Morreale wrote:
> > On Wed, 2010-04-07 at 20:25 -0700, Darren Hart wrote:
> >> Thomas Gleixner wrote:
> >>> On Wed, 7 Apr 2010, Darren Hart wrote:
> >>>> Thomas Gleixner wrote:
>
> >>>>> if ((curval & FUTEX_TID_MASK) != ownertid) {
> >>>>> ownertid = curval & FUTEX_TID_MASK;
> >>>>> owner = update_owner(ownertid);
> >>>>> }
> >>>> Hrm... at this point the owner has changed... so we should break and go
> >>>> to sleep, not update the owner and start spinning again. The
> >>>> futex_spin_on_owner() will detect this and abort, so I'm not seeing the
> >>>> purpose of the above if() block.
> >>> Why ? If the owner has changed and the new owner is running on another
> >>> cpu then why not spin further ?
> >> That's an interesting question, and I'm not sure what the right answer
> >> is. The current approach of the adaptive spinning in the kernel is to
> >> spin until the owner changes or deschedules, then stop and block. The
> >> idea is that if you didn't get the lock before the owner changed, you
> >> aren't going to get it in a very short period of time (you have at least
> >> an entire critical section to wait through plus whatever time you've
> >> already spent spinning). However, blocking just so another task can spin
> >> doesn't really make sense either, and makes the lock less fair than it
> >> could otherwise be.
> >
> > Not only less fair, but potentially could cause starvation, no? Perhaps
> > you could see this if you changed your model to allow all contended
> > tasks to spin instead of just one.
>
> Agreed, and V5 (just posted) does just that.
>
> >
> > If a spinning task blocks because of an owner change, and a new task
> > enters and starts spinning directly after the owner change, at what
> > point does the original task get woken up?
>
> At the time of unlock the owner will have to call FUTEX_WAKE. This task
> will wake and attempt to acquire the lock - it will lose races with
> aclready running contenders. Lock stealing, adaptive spinning, etc are
> all going to lead to less fair locks in exchange for throughput.

nod. My only point was that if only one can spin, and sleeps when the
owner is not on CPU, then you virtually guarantee worst case performance
under certain conditions (ie: multiple threads coming in at various
times)

If all spin, and they sleep only on owner block, then the 'unfairness'
is potentially more evenly distributed via hardware (and stealing) and
when the owner is blocked.


Best,
-PWM

>
> > Its likely that the new
> > spinner will get the resource next, no? Rinse/repeat with another task
> > and the original spinner is starved.
> >
> > (Or am I missing something? My understanding was that unfairness was
> > built-in to this algo... If so, then the above is a possibility, right?)
>
> Yes it is. These locks are typically used in situations where it is more
> important that some work gets completed than _which_ work gets completed.
>
> Thanks,
>
> --
> Darren Hart
> IBM Linux Technology Center
> Real-Time Linux Team

2010-04-10 23:32:30

by Alan

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

> The non-swappable part will be tricky. One doesn't know how many
> threads will be created in a process. This mechanism shouldn't put an
> arbitrary limit in place. So where to allocate the memory? Perhaps
> it's better to implicitly mark the memory page pointed to by the new
> syscall as non-swappable? This could mean one page per thread...

You only need one page per 4096 threads or so if you make it create the
page on the first request, tied to say the signals and the like in
threaded tasks, and you then allocate 'slots' in the page for future
calls until you've got 4096 live.

ie you'd see something like

addr = set_tid_notifier_addr();

[1st call
map page at x to x + 4095, probably R/O]
[returns x]

next thread
addr = set_tid_notifier_addr()

[returns x + 1]

Alan

2010-04-10 23:53:56

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH V2 0/6][RFC] futex: FUTEX_LOCK with optional adaptive spinning

On Sat, Apr 10, 2010 at 16:35, Alan Cox <[email protected]> wrote:
> You only need one page per 4096 threads

Very expensive. Each cache line would be fought over by 64 threads.
Constant RFOs make context switches significantly slower.

At most 4096/64 = 64 threads should share one page. One page would
still cover almost all processes.