2014-01-12 23:31:43

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH v6 0/5] futex: Wakeup optimizations

Changes from v5 [https://lkml.org/lkml/2014/1/2/178]:
- Added latest review tags

- Removed redundant CONFIG_SMP ifdef in futex_get_mm()

- Updated comments per Darren's suggestions.

Changes from v3/v4 [http://lkml.org/lkml/2013/12/19/627]:
 - Almost completely redid patch 4, based on suggestions
   by Linus. Instead of adding an atomic counter to keep
   track of the plist size, couple the list's head empty
   call with a check to see if the hb lock is locked.
   This solves the race that motivated the use of the new
   atomic field.

 - Fix grammar in patch 3

 - Fix SOB tags.

Changes from v2 [http://lwn.net/Articles/575449/]:
 - Reordered SOB tags to reflect me as primary author.

 - Improved ordering guarantee comments for patch 4.

 - Rebased patch 4 against Linus' tree (this patch didn't
   apply after the recent futex changes/fixes).

Changes from v1 [https://lkml.org/lkml/2013/11/22/525]:
 - Removed patch "futex: Check for pi futex_q only once".

 - Cleaned up ifdefs for larger hash table.

 - Added a doc patch from tglx that describes the futex
   ordering guarantees.

 - Improved the lockless plist check for the wake calls.
   Based on the community feedback, the necessary abstractions
   and barriers are added to maintain ordering guarantees.
   Code documentation is also updated.

 - Removed patch "sched,futex: Provide delayed wakeup list".
   Based on feedback from PeterZ, I will look into this as
   a separate issue once the other patches are settled.

We have been dealing with a customer database workload on large
12Tb, 240 core 16 socket NUMA system that exhibits high amounts
of contention on some of the locks that serialize internal futex
data structures. This workload specially suffers in the wakeup
paths, where waiting on the corresponding hb->lock can account for
up to ~60% of the time. The result of such calls can mostly be
classified as (i) nothing to wake up and (ii) wakeup large amount
of tasks.

Before these patches are applied, we can see this pathological behavior:

 37.12%  826174  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--97.14%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.70%-- 0x7f383fbdea1f
             |          |           yyy

 43.71%  762296  xxx  [kernel.kallsyms] [k] _raw_spin_lock
            --- _raw_spin_lock
             |
             |--53.74%-- futex_wake
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          |
             |          |--99.40%-- 0x7fe7d44a4c05
             |          |           zzz
             |--45.90%-- futex_wait_setup
             |          futex_wait
             |          do_futex
             |          sys_futex
             |          system_call_fastpath
             |          0x7fe7ba315789
             |          syscall


With these patches, contention is practically non existent:

 0.10%     49   xxx  [kernel.kallsyms]   [k] _raw_spin_lock
               --- _raw_spin_lock
                |
                |--76.06%-- futex_wait_setup
                |          futex_wait
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--99.90%-- 0x7f3165e63789
                |          |          syscall|
                           ...
                |--6.27%-- futex_wake
                |          do_futex
                |          sys_futex
                |          system_call_fastpath
                |          |
                |          |--54.56%-- 0x7f317fff2c05
                ...

Patch 1 is a cleanup.

Patch 2 addresses the well known issue of the global hash table.
By creating a larger and NUMA aware table, we can reduce the false
sharing and collisions, thus reducing the chance of different futexes
using hb->lock.

Patch 3 documents the futex ordering guarantees.

Patch 4 reduces contention on the corresponding hb->lock by not trying to
acquire it if there are no blocked tasks in the waitqueue.
This particularly deals with point (i) above, where we see that it is not
uncommon for up to 90% of wakeup calls end up returning 0, indicating that no
tasks were woken.

Patch 5 silences a gcc warning msg.

This patchset has also been tested on smaller systems for a variety of
benchmarks, including java workloads, kernel builds and custom bang-the-hell-out-of
hb locks programs. So far, no functional or performance regressions have been seen.
Furthermore, no issues were found when running the different tests in the futextest
suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/

This patchset applies on top of Linus' tree as of v3.13-rc6 (9a0bb296)

Special thanks to Scott Norton, Tom Vanden, Mark Ray and Aswin Chandramouleeswaran
for help presenting, debugging and analyzing the data.

futex: Misc cleanups
futex: Larger hash table
futex: Document ordering guarantees
futex: Avoid taking hb lock if nothing to wakeup
futex: Silence uninitialized warnings

kernel/futex.c | 205 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 163 insertions(+), 42 deletions(-)

--
1.8.1.4


2014-01-12 23:32:25

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 5/5] futex: Silence uninitialized warnings

From: Davidlohr Bueso <[email protected]>

Callers of cmpxchg_futex_value_locked() can trigger the following:

kernel/futex.c: In function ‘futex_lock_pi_atomic’:
kernel/futex.c:725: warning: ‘curval’ may be used uninitialized in this function

This was initially addressed by commit 7cfdaf38, but others still remain. Silence
these messages once and for all as the variables really aren't uninitialized.

Cc: Ingo Molnar <[email protected]>
Acked-by: Darren Hart <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Tom Vaden <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Jason Low <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/futex.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index be6399a..8d40953 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -838,7 +838,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
struct task_struct *task, int set_waiters)
{
int lock_taken, ret, force_take = 0;
- u32 uval, newval, curval, vpid = task_pid_vnr(task);
+ u32 uval, newval, uninitialized_var(curval), vpid = task_pid_vnr(task);

retry:
ret = lock_taken = 0;
@@ -2227,7 +2227,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
- u32 uval, vpid = task_pid_vnr(current);
+ u32 uninitialized_var(uval), vpid = task_pid_vnr(current);
int ret;

retry:
@@ -2843,7 +2843,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,

static int __init futex_init(void)
{
- u32 curval;
+ u32 uninitialized_var(curval);
unsigned long i;

#if CONFIG_BASE_SMALL
--
1.8.1.4

2014-01-12 23:32:29

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 4/5] futex: Avoid taking hb lock if nothing to wakeup

From: Davidlohr Bueso <[email protected]>

In futex_wake() there is clearly no point in taking the hb->lock if we know
beforehand that there are no tasks to be woken. While the hash bucket's plist
head is a cheap way of knowing this, we cannot rely 100% on it as there is a
racy window between the futex_wait call and when the task is actually added to
the plist. To this end, we couple it with the spinlock check as tasks trying to
enter the critical region are most likely potential waiters that will be added
to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved, ensuring that waiters
either observe the changed user space value before blocking or is woken by a
concurrent waker. For wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(),
we explicitly add them through a new futex_get_mm function. For waiters we rely
on the fact that spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the code and related
discussion: https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Cc: Ingo Molnar <[email protected]>
Reviewed-by: Darren Hart <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Tom Vaden <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Jason Low <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/futex.c | 115 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 91 insertions(+), 24 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fcc6850..be6399a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -75,17 +75,20 @@
* The waiter reads the futex value in user space and calls
* futex_wait(). This function computes the hash bucket and acquires
* the hash bucket lock. After that it reads the futex user space value
- * again and verifies that the data has not changed. If it has not
- * changed it enqueues itself into the hash bucket, releases the hash
- * bucket lock and schedules.
+ * again and verifies that the data has not changed. If it has not changed
+ * it enqueues itself into the hash bucket, releases the hash bucket lock
+ * and schedules.
*
* The waker side modifies the user space value of the futex and calls
- * futex_wake(). This functions computes the hash bucket and acquires
- * the hash bucket lock. Then it looks for waiters on that futex in the
- * hash bucket and wakes them.
+ * futex_wake(). This function computes the hash bucket and acquires the
+ * hash bucket lock. Then it looks for waiters on that futex in the hash
+ * bucket and wakes them.
*
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * In futex wake up scenarios where no tasks are blocked on a futex, taking
+ * the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
*
* CPU 0 CPU 1
* val = *futex;
@@ -106,24 +109,52 @@
* This would cause the waiter on CPU 0 to wait forever because it
* missed the transition of the user space value from val to newval
* and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
+ *
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
*
* CPU 0 CPU 1
* val = *futex;
* sys_futex(WAIT, futex, val);
* futex_wait(futex, val);
- * lock(hash_bucket(futex));
- * uval = *futex;
- * *futex = newval;
- * sys_futex(WAKE, futex);
- * futex_wake(futex);
- * lock(hash_bucket(futex));
+ *
+ * waiters++;
+ * mb(); (A) <-- paired with -.
+ * |
+ * lock(hash_bucket(futex)); |
+ * |
+ * uval = *futex; |
+ * | *futex = newval;
+ * | sys_futex(WAKE, futex);
+ * | futex_wake(futex);
+ * |
+ * `-------> mb(); (B)
* if (uval == val)
- * queue();
+ * queue();
* unlock(hash_bucket(futex));
- * schedule(); if (!queue_empty())
- * wake_waiters(futex);
- * unlock(hash_bucket(futex));
+ * schedule(); if (waiters)
+ * lock(hash_bucket(futex));
+ * wake_waiters(futex);
+ * unlock(hash_bucket(futex));
+ *
+ * Where (A) orders the waiters increment and the futex value read -- this
+ * is guaranteed by the head counter in the hb spinlock; and where (B)
+ * orders the write to futex and the waiters read -- this is done by the
+ * barriers in get_futex_key_refs(), through either ihold or atomic_inc,
+ * depending on the futex type.
+ *
+ * This yields the following case (where X:=waiters, Y:=futex):
+ *
+ * X = Y = 0
+ *
+ * w[X]=1 w[Y]=1
+ * MB MB
+ * r[Y]=y r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible; which translates back into
+ * the guarantee that we cannot both miss the futex variable change and the
+ * enqueue.
*/

int __read_mostly futex_cmpxchg_enabled;
@@ -211,6 +242,36 @@ static unsigned long __read_mostly futex_hashsize;

static struct futex_hash_bucket *futex_queues;

+static inline void futex_get_mm(union futex_key *key)
+{
+ atomic_inc(&key->private.mm->mm_count);
+ /*
+ * Ensure futex_get_mm() implies a full barrier such that
+ * get_futex_key() implies a full barrier. This is relied upon
+ * as full barrier (B), see the ordering comment above.
+ */
+ smp_mb__after_atomic_inc();
+}
+
+static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+ /*
+ * Tasks trying to enter the critical region are most likely
+ * potential waiters that will be added to the plist. Ensure
+ * that wakers won't miss to-be-slept tasks in the window between
+ * the wait call and the actual plist_add.
+ */
+ if (spin_is_locked(&hb->lock))
+ return true;
+ smp_rmb(); /* Make sure we check the lock state first */
+
+ return !plist_head_empty(&hb->chain);
+#else
+ return true;
+#endif
+}
+
/*
* We hash on the keys returned from get_futex_key (see below).
*/
@@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key)

switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
case FUT_OFF_INODE:
- ihold(key->shared.inode);
+ ihold(key->shared.inode); /* implies MB (B) */
break;
case FUT_OFF_MMSHARED:
- atomic_inc(&key->private.mm->mm_count);
+ futex_get_mm(key); /* implies MB (B) */
break;
}
}
@@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
if (!fshared) {
key->private.mm = mm;
key->private.address = address;
- get_futex_key_refs(key);
+ get_futex_key_refs(key); /* implies MB (B) */
return 0;
}

@@ -429,7 +490,7 @@ again:
key->shared.pgoff = basepage_index(page);
}

- get_futex_key_refs(key);
+ get_futex_key_refs(key); /* implies MB (B) */

out:
unlock_page(page_head);
@@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
goto out;

hb = hash_futex(&key);
+
+ /* Make sure we really have tasks to wakeup */
+ if (!hb_waiters_pending(hb))
+ goto out_put_key;
+
spin_lock(&hb->lock);

plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
}

spin_unlock(&hb->lock);
+out_put_key:
put_futex_key(&key);
out:
return ret;
@@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
hb = hash_futex(&q->key);
q->lock_ptr = &hb->lock;

- spin_lock(&hb->lock);
+ spin_lock(&hb->lock); /* implies MB (A) */
return hb;
}

--
1.8.1.4

2014-01-12 23:32:27

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH v6 1/5] futex: Misc cleanups

From: Jason Low <[email protected]>

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Cc: Ingo Molnar <[email protected]>
Reviewed-by: Darren Hart <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Tom Vaden <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Cc: Waiman Long <[email protected]>
Signed-off-by: Jason Low <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/futex.c | 39 ++++++++++++---------------------------
1 file changed, 12 insertions(+), 27 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f6ff019..085f5fa 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -598,13 +598,10 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
{
struct futex_pi_state *pi_state = NULL;
struct futex_q *this, *next;
- struct plist_head *head;
struct task_struct *p;
pid_t pid = uval & FUTEX_TID_MASK;

- head = &hb->chain;
-
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (match_futex(&this->key, key)) {
/*
* Another waiter already exists - bump up
@@ -986,7 +983,6 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
- struct plist_head *head;
union futex_key key = FUTEX_KEY_INIT;
int ret;

@@ -999,9 +995,8 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)

hb = hash_futex(&key);
spin_lock(&hb->lock);
- head = &hb->chain;

- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (match_futex (&this->key, &key)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
@@ -1034,7 +1029,6 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
struct futex_hash_bucket *hb1, *hb2;
- struct plist_head *head;
struct futex_q *this, *next;
int ret, op_ret;

@@ -1082,9 +1076,7 @@ retry_private:
goto retry;
}

- head = &hb1->chain;
-
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
if (match_futex (&this->key, &key1)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
@@ -1097,10 +1089,8 @@ retry_private:
}

if (op_ret > 0) {
- head = &hb2->chain;
-
op_ret = 0;
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb2->chain, list) {
if (match_futex (&this->key, &key2)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
@@ -1270,7 +1260,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
int drop_count = 0, task_count = 0, ret;
struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
- struct plist_head *head1;
struct futex_q *this, *next;
u32 curval2;

@@ -1393,8 +1382,7 @@ retry_private:
}
}

- head1 = &hb1->chain;
- plist_for_each_entry_safe(this, next, head1, list) {
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
if (task_count - nr_wake >= nr_requeue)
break;

@@ -1494,7 +1482,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
}

static inline void
-queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
+queue_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
spin_unlock(&hb->lock);
@@ -1867,7 +1855,7 @@ retry_private:
ret = get_futex_value_locked(&uval, uaddr);

if (ret) {
- queue_unlock(q, *hb);
+ queue_unlock(*hb);

ret = get_user(uval, uaddr);
if (ret)
@@ -1881,7 +1869,7 @@ retry_private:
}

if (uval != val) {
- queue_unlock(q, *hb);
+ queue_unlock(*hb);
ret = -EWOULDBLOCK;
}

@@ -2029,7 +2017,7 @@ retry_private:
* Task is exiting and we just wait for the
* exit to complete.
*/
- queue_unlock(&q, hb);
+ queue_unlock(hb);
put_futex_key(&q.key);
cond_resched();
goto retry;
@@ -2081,7 +2069,7 @@ retry_private:
goto out_put_key;

out_unlock_put_key:
- queue_unlock(&q, hb);
+ queue_unlock(hb);

out_put_key:
put_futex_key(&q.key);
@@ -2091,7 +2079,7 @@ out:
return ret != -EINTR ? ret : -ERESTARTNOINTR;

uaddr_faulted:
- queue_unlock(&q, hb);
+ queue_unlock(hb);

ret = fault_in_user_writeable(uaddr);
if (ret)
@@ -2113,7 +2101,6 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
- struct plist_head *head;
union futex_key key = FUTEX_KEY_INIT;
u32 uval, vpid = task_pid_vnr(current);
int ret;
@@ -2153,9 +2140,7 @@ retry:
* Ok, other tasks may need to be woken up - check waiters
* and do the wakeup if necessary:
*/
- head = &hb->chain;
-
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (!match_futex (&this->key, &key))
continue;
ret = wake_futex_pi(uaddr, uval, this);
--
1.8.1.4

2014-01-12 23:32:23

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH v6 3/5] futex: Document ordering guarantees

From: Thomas Gleixner <[email protected]>

That's essential, if you want to hack on futexes.

Cc: Ingo Molnar <[email protected]>
Reviewed-by: Darren Hart <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Tom Vaden <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Jason Low <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 577481d..fcc6850 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -69,6 +69,63 @@

#include "locking/rtmutex_common.h"

+/*
+ * Basic futex operation and ordering guarantees:
+ *
+ * The waiter reads the futex value in user space and calls
+ * futex_wait(). This function computes the hash bucket and acquires
+ * the hash bucket lock. After that it reads the futex user space value
+ * again and verifies that the data has not changed. If it has not
+ * changed it enqueues itself into the hash bucket, releases the hash
+ * bucket lock and schedules.
+ *
+ * The waker side modifies the user space value of the futex and calls
+ * futex_wake(). This functions computes the hash bucket and acquires
+ * the hash bucket lock. Then it looks for waiters on that futex in the
+ * hash bucket and wakes them.
+ *
+ * Note that the spin_lock serializes waiters and wakers, so that the
+ * following scenario is avoided:
+ *
+ * CPU 0 CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ * futex_wait(futex, val);
+ * uval = *futex;
+ * *futex = newval;
+ * sys_futex(WAKE, futex);
+ * futex_wake(futex);
+ * if (queue_empty())
+ * return;
+ * if (uval == val)
+ * lock(hash_bucket(futex));
+ * queue();
+ * unlock(hash_bucket(futex));
+ * schedule();
+ *
+ * This would cause the waiter on CPU 0 to wait forever because it
+ * missed the transition of the user space value from val to newval
+ * and the waker did not find the waiter in the hash bucket queue.
+ * The spinlock serializes that:
+ *
+ * CPU 0 CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ * futex_wait(futex, val);
+ * lock(hash_bucket(futex));
+ * uval = *futex;
+ * *futex = newval;
+ * sys_futex(WAKE, futex);
+ * futex_wake(futex);
+ * lock(hash_bucket(futex));
+ * if (uval == val)
+ * queue();
+ * unlock(hash_bucket(futex));
+ * schedule(); if (!queue_empty())
+ * wake_waiters(futex);
+ * unlock(hash_bucket(futex));
+ */
+
int __read_mostly futex_cmpxchg_enabled;

/*
--
1.8.1.4

2014-01-12 23:32:20

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH v6 2/5] futex: Larger hash table

From: Davidlohr Bueso <[email protected]>

Currently, the futex global hash table suffers from it's fixed, smallish
(for today's standards) size of 256 entries, as well as its lack of NUMA
awareness. Large systems, using many futexes, can be prone to high amounts
of collisions; where these futexes hash to the same bucket and lead to
extra contention on the same hb->lock. Furthermore, cacheline bouncing is a
reality when we have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small systems,
or otherwise, 256 * ncpus (or larger as we need to round the number to a
power of 2). Note that this number of CPUs accounts for all CPUs that can
ever be available in the system, taking into consideration things like
hotpluging. While we do impose extra overhead at bootup by making the hash
table larger, this is a one time thing, and does not shadow the benefits
of this patch.

Furthermore, as suggested by tglx, by cache aligning the hash buckets we can
avoid access across cacheline boundaries and also avoid massive cache line
bouncing if multiple cpus are hammering away at different hash buckets which
happen to reside in the same cache line.

Also, similar to other core kernel components (pid, dcache, tcp), by using
alloc_large_system_hash() we benefit from its NUMA awareness and thus the
table is distributed among the nodes instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing -- making the wait
path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of
futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server:

+---------+--------------------+------------------------+-----------------------+-------------------------------+
| threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+
|     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
|     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
|     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
|      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
|      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
|      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
+---------+--------------------+------------------------+-----------------------+-------------------------------+

Cc: Ingo Molnar <[email protected]>
Reviewed-by: Darren Hart <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Tom Vaden <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Reviewed-and-tested-by: Jason Low <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
kernel/futex.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 085f5fa..577481d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
#include <linux/sched/rt.h>
#include <linux/hugetlb.h>
#include <linux/freezer.h>
+#include <linux/bootmem.h>

#include <asm/futex.h>

@@ -70,8 +71,6 @@

int __read_mostly futex_cmpxchg_enabled;

-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
-
/*
* Futex flags used to encode options to functions and preserve them across
* restarts.
@@ -149,9 +148,11 @@ static const struct futex_q futex_q_init = {
struct futex_hash_bucket {
spinlock_t lock;
struct plist_head chain;
-};
+} ____cacheline_aligned_in_smp;

-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+static unsigned long __read_mostly futex_hashsize;
+
+static struct futex_hash_bucket *futex_queues;

/*
* We hash on the keys returned from get_futex_key (see below).
@@ -161,7 +162,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ return &futex_queues[hash & (futex_hashsize - 1)];
}

/*
@@ -2719,7 +2720,18 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
static int __init futex_init(void)
{
u32 curval;
- int i;
+ unsigned long i;
+
+#if CONFIG_BASE_SMALL
+ futex_hashsize = 16;
+#else
+ futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+#endif
+
+ futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
+ futex_hashsize, 0,
+ futex_hashsize < 256 ? HASH_SMALL : 0,
+ NULL, NULL, futex_hashsize, futex_hashsize);

/*
* This will fail and we want it. Some arch implementations do
@@ -2734,7 +2746,7 @@ static int __init futex_init(void)
if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
futex_cmpxchg_enabled = 1;

- for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+ for (i = 0; i < futex_hashsize; i++) {
plist_head_init(&futex_queues[i].chain);
spin_lock_init(&futex_queues[i].lock);
}
--
1.8.1.4

2014-01-13 10:03:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v6 0/5] futex: Wakeup optimizations

On Sun, 12 Jan 2014, Davidlohr Bueso wrote:

For the whole series:

Reviewed-by: Thomas Gleixner <[email protected]>

2014-01-13 10:05:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v6 0/5] futex: Wakeup optimizations


* Thomas Gleixner <[email protected]> wrote:

> On Sun, 12 Jan 2014, Davidlohr Bueso wrote:
>
> For the whole series:
>
> Reviewed-by: Thomas Gleixner <[email protected]>

Thanks Thomas, I'll pick them up into the locking tree.

Ingo

2014-01-13 10:16:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/5] futex: Silence uninitialized warnings


* Davidlohr Bueso <[email protected]> wrote:

> From: Davidlohr Bueso <[email protected]>
>
> Callers of cmpxchg_futex_value_locked() can trigger the following:
>
> kernel/futex.c: In function ‘futex_lock_pi_atomic’:
> kernel/futex.c:725: warning: ‘curval’ may be used uninitialized in this function
>
> This was initially addressed by commit 7cfdaf38, but others still remain. Silence
> these messages once and for all as the variables really aren't uninitialized.
>
> Cc: Ingo Molnar <[email protected]>
> Acked-by: Darren Hart <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Jeff Mahoney <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Scott Norton <[email protected]>
> Cc: Tom Vaden <[email protected]>
> Cc: Aswin Chandramouleeswaran <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: Jason Low <[email protected]>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> kernel/futex.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index be6399a..8d40953 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -838,7 +838,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
> struct task_struct *task, int set_waiters)
> {
> int lock_taken, ret, force_take = 0;
> - u32 uval, newval, curval, vpid = task_pid_vnr(task);
> + u32 uval, newval, uninitialized_var(curval), vpid = task_pid_vnr(task);
>
> retry:
> ret = lock_taken = 0;
> @@ -2227,7 +2227,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> struct futex_hash_bucket *hb;
> struct futex_q *this, *next;
> union futex_key key = FUTEX_KEY_INIT;
> - u32 uval, vpid = task_pid_vnr(current);
> + u32 uninitialized_var(uval), vpid = task_pid_vnr(current);
> int ret;
>
> retry:
> @@ -2843,7 +2843,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
>
> static int __init futex_init(void)
> {
> - u32 curval;
> + u32 uninitialized_var(curval);
> unsigned long i;
>
> #if CONFIG_BASE_SMALL

I'm skipping this patch though.

I consider the use of uninitialized_var() dangerous and broken: if for
whatever reason a future change to the code makes the warning trigger
and makes it _true_, then we won't notice it because it's hidden
unconditionally ...

The following alternative measures can be used to make spurious
old-compiler warnings go away:

- Consider upgrading your compiler.

- If for whatever reason you can't upgrade your compiler then
restructure the code so that the flow of logic is more apparent
even to older GCC versions. (Chances are that the flow will be more
readable to humans too, so it's a win-win!)

- If you think the flow is exactly perfect and (older) GCC which you
cannot upgrade is still being silly, then initialize the variable
to zero. On a new compiler this won't mean a thing because GCC will
notice the superfluous initialization and will optimize it out -
and it's a lot safer than just shutting the warning up forever.

Thanks,

Ingo

Subject: [tip:core/locking] futexes: Increase hash table size for better performance

Commit-ID: a52b89ebb6d4499be38780db8d176c5d3a6fbc17
Gitweb: http://git.kernel.org/tip/a52b89ebb6d4499be38780db8d176c5d3a6fbc17
Author: Davidlohr Bueso <[email protected]>
AuthorDate: Sun, 12 Jan 2014 15:31:23 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 13 Jan 2014 11:45:18 +0100

futexes: Increase hash table size for better performance

Currently, the futex global hash table suffers from its fixed,
smallish (for today's standards) size of 256 entries, as well as
its lack of NUMA awareness. Large systems, using many futexes,
can be prone to high amounts of collisions; where these futexes
hash to the same bucket and lead to extra contention on the same
hb->lock. Furthermore, cacheline bouncing is a reality when we
have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small
systems, or otherwise, 256 * ncpus (or larger as we need to
round the number to a power of 2). Note that this number of CPUs
accounts for all CPUs that can ever be available in the system,
taking into consideration things like hotpluging. While we do
impose extra overhead at bootup by making the hash table larger,
this is a one time thing, and does not shadow the benefits of
this patch.

Furthermore, as suggested by tglx, by cache aligning the hash
buckets we can avoid access across cacheline boundaries and also
avoid massive cache line bouncing if multiple cpus are hammering
away at different hash buckets which happen to reside in the
same cache line.

Also, similar to other core kernel components (pid, dcache,
tcp), by using alloc_large_system_hash() we benefit from its
NUMA awareness and thus the table is distributed among the nodes
instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing --
making the wait path fail at futex_wait_setup() returning
-EWOULDBLOCK for large amounts of futexes, we can see the
following benefits on a 80-core, 8-socket 1Tb server:

+---------+--------------------+------------------------+-----------------------+-------------------------------+
| threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+
|     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
|     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
|     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
|      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
|      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
|      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
+---------+--------------------+------------------------+-----------------------+-------------------------------+

Reviewed-by: Darren Hart <[email protected]>
Reviewed-by: Peter Zijlstra <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Reviewed-and-tested-by: Jason Low <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Tom Vaden <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/futex.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 085f5fa..577481d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -63,6 +63,7 @@
#include <linux/sched/rt.h>
#include <linux/hugetlb.h>
#include <linux/freezer.h>
+#include <linux/bootmem.h>

#include <asm/futex.h>

@@ -70,8 +71,6 @@

int __read_mostly futex_cmpxchg_enabled;

-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
-
/*
* Futex flags used to encode options to functions and preserve them across
* restarts.
@@ -149,9 +148,11 @@ static const struct futex_q futex_q_init = {
struct futex_hash_bucket {
spinlock_t lock;
struct plist_head chain;
-};
+} ____cacheline_aligned_in_smp;

-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+static unsigned long __read_mostly futex_hashsize;
+
+static struct futex_hash_bucket *futex_queues;

/*
* We hash on the keys returned from get_futex_key (see below).
@@ -161,7 +162,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ return &futex_queues[hash & (futex_hashsize - 1)];
}

/*
@@ -2719,7 +2720,18 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
static int __init futex_init(void)
{
u32 curval;
- int i;
+ unsigned long i;
+
+#if CONFIG_BASE_SMALL
+ futex_hashsize = 16;
+#else
+ futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());
+#endif
+
+ futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
+ futex_hashsize, 0,
+ futex_hashsize < 256 ? HASH_SMALL : 0,
+ NULL, NULL, futex_hashsize, futex_hashsize);

/*
* This will fail and we want it. Some arch implementations do
@@ -2734,7 +2746,7 @@ static int __init futex_init(void)
if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT)
futex_cmpxchg_enabled = 1;

- for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+ for (i = 0; i < futex_hashsize; i++) {
plist_head_init(&futex_queues[i].chain);
spin_lock_init(&futex_queues[i].lock);
}

Subject: [tip:core/locking] futexes: Clean up various details

Commit-ID: 0d00c7b20c7716ce08399570ea48813ecf001aa8
Gitweb: http://git.kernel.org/tip/0d00c7b20c7716ce08399570ea48813ecf001aa8
Author: Jason Low <[email protected]>
AuthorDate: Sun, 12 Jan 2014 15:31:22 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 13 Jan 2014 11:45:17 +0100

futexes: Clean up various details

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Reviewed-by: Darren Hart <[email protected]>
Reviewed-by: Peter Zijlstra <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jason Low <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Tom Vaden <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/futex.c | 39 ++++++++++++---------------------------
1 file changed, 12 insertions(+), 27 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f6ff019..085f5fa 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -598,13 +598,10 @@ lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
{
struct futex_pi_state *pi_state = NULL;
struct futex_q *this, *next;
- struct plist_head *head;
struct task_struct *p;
pid_t pid = uval & FUTEX_TID_MASK;

- head = &hb->chain;
-
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (match_futex(&this->key, key)) {
/*
* Another waiter already exists - bump up
@@ -986,7 +983,6 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
- struct plist_head *head;
union futex_key key = FUTEX_KEY_INIT;
int ret;

@@ -999,9 +995,8 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)

hb = hash_futex(&key);
spin_lock(&hb->lock);
- head = &hb->chain;

- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (match_futex (&this->key, &key)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
@@ -1034,7 +1029,6 @@ futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
{
union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
struct futex_hash_bucket *hb1, *hb2;
- struct plist_head *head;
struct futex_q *this, *next;
int ret, op_ret;

@@ -1082,9 +1076,7 @@ retry_private:
goto retry;
}

- head = &hb1->chain;
-
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
if (match_futex (&this->key, &key1)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
@@ -1097,10 +1089,8 @@ retry_private:
}

if (op_ret > 0) {
- head = &hb2->chain;
-
op_ret = 0;
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb2->chain, list) {
if (match_futex (&this->key, &key2)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
@@ -1270,7 +1260,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
int drop_count = 0, task_count = 0, ret;
struct futex_pi_state *pi_state = NULL;
struct futex_hash_bucket *hb1, *hb2;
- struct plist_head *head1;
struct futex_q *this, *next;
u32 curval2;

@@ -1393,8 +1382,7 @@ retry_private:
}
}

- head1 = &hb1->chain;
- plist_for_each_entry_safe(this, next, head1, list) {
+ plist_for_each_entry_safe(this, next, &hb1->chain, list) {
if (task_count - nr_wake >= nr_requeue)
break;

@@ -1494,7 +1482,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
}

static inline void
-queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
+queue_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
spin_unlock(&hb->lock);
@@ -1867,7 +1855,7 @@ retry_private:
ret = get_futex_value_locked(&uval, uaddr);

if (ret) {
- queue_unlock(q, *hb);
+ queue_unlock(*hb);

ret = get_user(uval, uaddr);
if (ret)
@@ -1881,7 +1869,7 @@ retry_private:
}

if (uval != val) {
- queue_unlock(q, *hb);
+ queue_unlock(*hb);
ret = -EWOULDBLOCK;
}

@@ -2029,7 +2017,7 @@ retry_private:
* Task is exiting and we just wait for the
* exit to complete.
*/
- queue_unlock(&q, hb);
+ queue_unlock(hb);
put_futex_key(&q.key);
cond_resched();
goto retry;
@@ -2081,7 +2069,7 @@ retry_private:
goto out_put_key;

out_unlock_put_key:
- queue_unlock(&q, hb);
+ queue_unlock(hb);

out_put_key:
put_futex_key(&q.key);
@@ -2091,7 +2079,7 @@ out:
return ret != -EINTR ? ret : -ERESTARTNOINTR;

uaddr_faulted:
- queue_unlock(&q, hb);
+ queue_unlock(hb);

ret = fault_in_user_writeable(uaddr);
if (ret)
@@ -2113,7 +2101,6 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
- struct plist_head *head;
union futex_key key = FUTEX_KEY_INIT;
u32 uval, vpid = task_pid_vnr(current);
int ret;
@@ -2153,9 +2140,7 @@ retry:
* Ok, other tasks may need to be woken up - check waiters
* and do the wakeup if necessary:
*/
- head = &hb->chain;
-
- plist_for_each_entry_safe(this, next, head, list) {
+ plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (!match_futex (&this->key, &key))
continue;
ret = wake_futex_pi(uaddr, uval, this);

Subject: [tip:core/locking] futexes: Avoid taking the hb->lock if there' s nothing to wake up

Commit-ID: b0c29f79ecea0b6fbcefc999e70f2843ae8306db
Gitweb: http://git.kernel.org/tip/b0c29f79ecea0b6fbcefc999e70f2843ae8306db
Author: Davidlohr Bueso <[email protected]>
AuthorDate: Sun, 12 Jan 2014 15:31:25 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 13 Jan 2014 11:45:21 +0100

futexes: Avoid taking the hb->lock if there's nothing to wake up

In futex_wake() there is clearly no point in taking the hb->lock
if we know beforehand that there are no tasks to be woken. While
the hash bucket's plist head is a cheap way of knowing this, we
cannot rely 100% on it as there is a racy window between the
futex_wait call and when the task is actually added to the
plist. To this end, we couple it with the spinlock check as
tasks trying to enter the critical region are most likely
potential waiters that will be added to the plist, thus
preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved,
ensuring that waiters either observe the changed user space
value before blocking or is woken by a concurrent waker. For
wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb
in atomic_inc(), we explicitly add them through a new
futex_get_mm function. For waiters we rely on the fact that
spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the
code and related discussion:

https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Suggested-by: Linus Torvalds <[email protected]>
Reviewed-by: Darren Hart <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reviewed-by: Peter Zijlstra <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Tom Vaden <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Jason Low <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/futex.c | 117 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 92 insertions(+), 25 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index fcc6850..30971b5 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -75,17 +75,20 @@
* The waiter reads the futex value in user space and calls
* futex_wait(). This function computes the hash bucket and acquires
* the hash bucket lock. After that it reads the futex user space value
- * again and verifies that the data has not changed. If it has not
- * changed it enqueues itself into the hash bucket, releases the hash
- * bucket lock and schedules.
+ * again and verifies that the data has not changed. If it has not changed
+ * it enqueues itself into the hash bucket, releases the hash bucket lock
+ * and schedules.
*
* The waker side modifies the user space value of the futex and calls
- * futex_wake(). This functions computes the hash bucket and acquires
- * the hash bucket lock. Then it looks for waiters on that futex in the
- * hash bucket and wakes them.
+ * futex_wake(). This function computes the hash bucket and acquires the
+ * hash bucket lock. Then it looks for waiters on that futex in the hash
+ * bucket and wakes them.
*
- * Note that the spin_lock serializes waiters and wakers, so that the
- * following scenario is avoided:
+ * In futex wake up scenarios where no tasks are blocked on a futex, taking
+ * the hb spinlock can be avoided and simply return. In order for this
+ * optimization to work, ordering guarantees must exist so that the waiter
+ * being added to the list is acknowledged when the list is concurrently being
+ * checked by the waker, avoiding scenarios like the following:
*
* CPU 0 CPU 1
* val = *futex;
@@ -106,24 +109,52 @@
* This would cause the waiter on CPU 0 to wait forever because it
* missed the transition of the user space value from val to newval
* and the waker did not find the waiter in the hash bucket queue.
- * The spinlock serializes that:
*
- * CPU 0 CPU 1
+ * The correct serialization ensures that a waiter either observes
+ * the changed user space value before blocking or is woken by a
+ * concurrent waker:
+ *
+ * CPU 0 CPU 1
* val = *futex;
* sys_futex(WAIT, futex, val);
* futex_wait(futex, val);
- * lock(hash_bucket(futex));
- * uval = *futex;
- * *futex = newval;
- * sys_futex(WAKE, futex);
- * futex_wake(futex);
- * lock(hash_bucket(futex));
+ *
+ * waiters++;
+ * mb(); (A) <-- paired with -.
+ * |
+ * lock(hash_bucket(futex)); |
+ * |
+ * uval = *futex; |
+ * | *futex = newval;
+ * | sys_futex(WAKE, futex);
+ * | futex_wake(futex);
+ * |
+ * `-------> mb(); (B)
* if (uval == val)
- * queue();
+ * queue();
* unlock(hash_bucket(futex));
- * schedule(); if (!queue_empty())
- * wake_waiters(futex);
- * unlock(hash_bucket(futex));
+ * schedule(); if (waiters)
+ * lock(hash_bucket(futex));
+ * wake_waiters(futex);
+ * unlock(hash_bucket(futex));
+ *
+ * Where (A) orders the waiters increment and the futex value read -- this
+ * is guaranteed by the head counter in the hb spinlock; and where (B)
+ * orders the write to futex and the waiters read -- this is done by the
+ * barriers in get_futex_key_refs(), through either ihold or atomic_inc,
+ * depending on the futex type.
+ *
+ * This yields the following case (where X:=waiters, Y:=futex):
+ *
+ * X = Y = 0
+ *
+ * w[X]=1 w[Y]=1
+ * MB MB
+ * r[Y]=y r[X]=x
+ *
+ * Which guarantees that x==0 && y==0 is impossible; which translates back into
+ * the guarantee that we cannot both miss the futex variable change and the
+ * enqueue.
*/

int __read_mostly futex_cmpxchg_enabled;
@@ -211,6 +242,36 @@ static unsigned long __read_mostly futex_hashsize;

static struct futex_hash_bucket *futex_queues;

+static inline void futex_get_mm(union futex_key *key)
+{
+ atomic_inc(&key->private.mm->mm_count);
+ /*
+ * Ensure futex_get_mm() implies a full barrier such that
+ * get_futex_key() implies a full barrier. This is relied upon
+ * as full barrier (B), see the ordering comment above.
+ */
+ smp_mb__after_atomic_inc();
+}
+
+static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+ /*
+ * Tasks trying to enter the critical region are most likely
+ * potential waiters that will be added to the plist. Ensure
+ * that wakers won't miss to-be-slept tasks in the window between
+ * the wait call and the actual plist_add.
+ */
+ if (spin_is_locked(&hb->lock))
+ return true;
+ smp_rmb(); /* Make sure we check the lock state first */
+
+ return !plist_head_empty(&hb->chain);
+#else
+ return true;
+#endif
+}
+
/*
* We hash on the keys returned from get_futex_key (see below).
*/
@@ -245,10 +306,10 @@ static void get_futex_key_refs(union futex_key *key)

switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
case FUT_OFF_INODE:
- ihold(key->shared.inode);
+ ihold(key->shared.inode); /* implies MB (B) */
break;
case FUT_OFF_MMSHARED:
- atomic_inc(&key->private.mm->mm_count);
+ futex_get_mm(key); /* implies MB (B) */
break;
}
}
@@ -322,7 +383,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
if (!fshared) {
key->private.mm = mm;
key->private.address = address;
- get_futex_key_refs(key);
+ get_futex_key_refs(key); /* implies MB (B) */
return 0;
}

@@ -429,7 +490,7 @@ again:
key->shared.pgoff = basepage_index(page);
}

- get_futex_key_refs(key);
+ get_futex_key_refs(key); /* implies MB (B) */

out:
unlock_page(page_head);
@@ -1052,6 +1113,11 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
goto out;

hb = hash_futex(&key);
+
+ /* Make sure we really have tasks to wakeup */
+ if (!hb_waiters_pending(hb))
+ goto out_put_key;
+
spin_lock(&hb->lock);

plist_for_each_entry_safe(this, next, &hb->chain, list) {
@@ -1072,6 +1138,7 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
}

spin_unlock(&hb->lock);
+out_put_key:
put_futex_key(&key);
out:
return ret;
@@ -1535,7 +1602,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
hb = hash_futex(&q->key);
q->lock_ptr = &hb->lock;

- spin_lock(&hb->lock);
+ spin_lock(&hb->lock); /* implies MB (A) */
return hb;
}

Subject: [tip:core/locking] futexes: Document multiprocessor ordering guarantees

Commit-ID: 99b60ce69734dfeda58c6184a326b9475ce1dba3
Gitweb: http://git.kernel.org/tip/99b60ce69734dfeda58c6184a326b9475ce1dba3
Author: Thomas Gleixner <[email protected]>
AuthorDate: Sun, 12 Jan 2014 15:31:24 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 13 Jan 2014 11:45:19 +0100

futexes: Document multiprocessor ordering guarantees

That's essential, if you want to hack on futexes.

Reviewed-by: Darren Hart <[email protected]>
Reviewed-by: Peter Zijlstra <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Scott Norton <[email protected]>
Cc: Tom Vaden <[email protected]>
Cc: Aswin Chandramouleeswaran <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Jason Low <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/futex.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 577481d..fcc6850 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -69,6 +69,63 @@

#include "locking/rtmutex_common.h"

+/*
+ * Basic futex operation and ordering guarantees:
+ *
+ * The waiter reads the futex value in user space and calls
+ * futex_wait(). This function computes the hash bucket and acquires
+ * the hash bucket lock. After that it reads the futex user space value
+ * again and verifies that the data has not changed. If it has not
+ * changed it enqueues itself into the hash bucket, releases the hash
+ * bucket lock and schedules.
+ *
+ * The waker side modifies the user space value of the futex and calls
+ * futex_wake(). This functions computes the hash bucket and acquires
+ * the hash bucket lock. Then it looks for waiters on that futex in the
+ * hash bucket and wakes them.
+ *
+ * Note that the spin_lock serializes waiters and wakers, so that the
+ * following scenario is avoided:
+ *
+ * CPU 0 CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ * futex_wait(futex, val);
+ * uval = *futex;
+ * *futex = newval;
+ * sys_futex(WAKE, futex);
+ * futex_wake(futex);
+ * if (queue_empty())
+ * return;
+ * if (uval == val)
+ * lock(hash_bucket(futex));
+ * queue();
+ * unlock(hash_bucket(futex));
+ * schedule();
+ *
+ * This would cause the waiter on CPU 0 to wait forever because it
+ * missed the transition of the user space value from val to newval
+ * and the waker did not find the waiter in the hash bucket queue.
+ * The spinlock serializes that:
+ *
+ * CPU 0 CPU 1
+ * val = *futex;
+ * sys_futex(WAIT, futex, val);
+ * futex_wait(futex, val);
+ * lock(hash_bucket(futex));
+ * uval = *futex;
+ * *futex = newval;
+ * sys_futex(WAKE, futex);
+ * futex_wake(futex);
+ * lock(hash_bucket(futex));
+ * if (uval == val)
+ * queue();
+ * unlock(hash_bucket(futex));
+ * schedule(); if (!queue_empty())
+ * wake_waiters(futex);
+ * unlock(hash_bucket(futex));
+ */
+
int __read_mostly futex_cmpxchg_enabled;

/*

2014-01-13 17:29:23

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 5/5] futex: Silence uninitialized warnings

On Mon, 2014-01-13 at 11:16 +0100, Ingo Molnar wrote:
> * Davidlohr Bueso <[email protected]> wrote:
>
> > From: Davidlohr Bueso <[email protected]>
> >
> > Callers of cmpxchg_futex_value_locked() can trigger the following:
> >
> > kernel/futex.c: In function ‘futex_lock_pi_atomic’:
> > kernel/futex.c:725: warning: ‘curval’ may be used uninitialized in this function
> >
> > This was initially addressed by commit 7cfdaf38, but others still remain. Silence
> > these messages once and for all as the variables really aren't uninitialized.
> >
> > Cc: Ingo Molnar <[email protected]>
> > Acked-by: Darren Hart <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Mike Galbraith <[email protected]>
> > Cc: Jeff Mahoney <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Scott Norton <[email protected]>
> > Cc: Tom Vaden <[email protected]>
> > Cc: Aswin Chandramouleeswaran <[email protected]>
> > Cc: Waiman Long <[email protected]>
> > Cc: Jason Low <[email protected]>
> > Signed-off-by: Davidlohr Bueso <[email protected]>
> > ---
> > kernel/futex.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index be6399a..8d40953 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -838,7 +838,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
> > struct task_struct *task, int set_waiters)
> > {
> > int lock_taken, ret, force_take = 0;
> > - u32 uval, newval, curval, vpid = task_pid_vnr(task);
> > + u32 uval, newval, uninitialized_var(curval), vpid = task_pid_vnr(task);
> >
> > retry:
> > ret = lock_taken = 0;
> > @@ -2227,7 +2227,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
> > struct futex_hash_bucket *hb;
> > struct futex_q *this, *next;
> > union futex_key key = FUTEX_KEY_INIT;
> > - u32 uval, vpid = task_pid_vnr(current);
> > + u32 uninitialized_var(uval), vpid = task_pid_vnr(current);
> > int ret;
> >
> > retry:
> > @@ -2843,7 +2843,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
> >
> > static int __init futex_init(void)
> > {
> > - u32 curval;
> > + u32 uninitialized_var(curval);
> > unsigned long i;
> >
> > #if CONFIG_BASE_SMALL
>
> I'm skipping this patch though.
>
> I consider the use of uninitialized_var() dangerous and broken: if for
> whatever reason a future change to the code makes the warning trigger
> and makes it _true_, then we won't notice it because it's hidden
> unconditionally ...

Sure, so we should then get rid of the already existing ones from
7cfdaf38.

>
> The following alternative measures can be used to make spurious
> old-compiler warnings go away:
>
> - Consider upgrading your compiler.
>
> - If for whatever reason you can't upgrade your compiler then
> restructure the code so that the flow of logic is more apparent
> even to older GCC versions. (Chances are that the flow will be more
> readable to humans too, so it's a win-win!)
>
> - If you think the flow is exactly perfect and (older) GCC which you
> cannot upgrade is still being silly, then initialize the variable
> to zero. On a new compiler this won't mean a thing because GCC will
> notice the superfluous initialization and will optimize it out -
> and it's a lot safer than just shutting the warning up forever.

Ok, we could go wit this last path then.