2016-04-02 11:11:11

by Thomas Gleixner

[permalink] [raw]
Subject: [RFC patch 4/7] futex: Add support for attached futexes

The standard futex mechanism in the Linux kernel uses a global hash to store
transient state. Collisions on that hash can lead to performance degradation
and on real-time enabled kernels even to priority inversions.

To guarantee futexes without collisions on the global kernel hash, we provide
a mechanism to attach to a futex. This creates futex private state which
avoids hash collisions and on NUMA systems also cross node memory access.

To utilize this mechanism each thread has to attach to the futex before any
other operations on that futex.

The inner workings are as follows:

Attach:

sys_futex(FUTEX_ATTACH | FUTEX_ATTACHED, uaddr, ....);

If this is the first attach to uaddr then a 'global state' object is
created. This global state contains a futex hash bucket and a futex_q
object which is enqueued into the global hash for reference so subsequent
attachers can find it. Each attacher takes a reference count on the
'global state' object and hashes 'uaddr' into a thread local hash. This
thread local hash is lock free and dynamically expanded to avoid
collisions. Each populated entry in the thread local hash stores 'uaddr'
and a pointer to the 'global state' object.

Futex ops:

sys_futex(FUTEX_XXX | FUTEX_ATTACHED, uaddr, ....);

If the attached flag is set, then 'uaddr' is hashed and the thread local
hash is checked whether the hash entry contains 'uaddr'. If no, an error
code is returned. If yes, the hash slot number is stored in the futex key
which is used for further operations on the futex. When the hash bucket is
looked up then attached futexes will use the slot number to retrieve the
pointer to the 'global state' object and use the embedded hash bucket for
the operation. Non-attached futexes just use the global hash as before.

Detach:

sys_futex(FUTEX_DETACH | FUTEX_ATTACHED, uaddr, ....);

Detach removes the entry in the thread local hash and decrements the
refcount on the 'global state' object. Once the refcount drops to zero the
'global state' object is removed from the global hash and destroyed.

Thread exit cleans up the thread local hash and the 'global state' objects
as we do for other futex related storage already.

The thread local hash and the 'global state' object are allocated on the node
on which the attaching thread runs.

Attached mode works with all futex operations and with both private and shared
futexes. For operations which involve two futexes, i.e. FUTEX_REQUEUE_* both
futexes have to be either attached or detached (like FUTEX_PRIVATE).

Why not auto attaching?

Auto attaching has the following problems:

- Memory consumption
- Life time issues
- Performance issues due to the necessary allocations

So, no. It must be opt-in and reserved for explicit isolation purposes.

A modified version of 'perf bench futex hash' shows the following results:

# ./perf bench futex hash -s -r 60
# Running 'futex/hash' benchmark:
Run summary [PID 4456]: 144 threads, each operating on 1024 [private ] futexes for 60 secs.

Averaged 1451441 operations/sec (+- 3.65%), total secs = 60

Perf stat analysis for such a run:

Performance counter stats for './perf bench futex hash -s -r 60':

9,400,551,930 node-load-misses
74,404,109 node-loads
11,001,148,922 node-store-misses
574,505,561 node-stores

60.016993846 seconds time elapsed

# Running 'futex/hash' benchmark:
Run summary [PID 3675]: 144 threads, each operating on 1024 [private attached] futexes for 60 secs.

Averaged 1709712 operations/sec (+- 4.67%), total secs = 60

That's a performance increase of 18%.

Perf stat analysis for such a run:

Performance counter stats for './perf bench futex hash -s -r 60 -a':

2,609,871,190 node-load-misses
9,671,312 node-loads
2,346,187,670 node-store-misses
144,453,796 node-stores

60.020374008 seconds time elapsed

Now someone might argue that this is an unrealistic test case. Sure the number
of futexes here exceeds the size of the global hash:

144 * 1024 > 144 * 256

But from real world applications with a way smaller number of futexes we have
observed hash collisions on real-time and non real-time systems, which cause
performance or determinism issues. Surely we could increase the hash size, but
that just reduces the probability and does not exclude the chance to hit a bad
spot.

We wrote a - too ugly to share - collision scanner which runs on each node of
a NUMA system. It scans through malloced and mmaped memory and invokes
thousands of consecutive (failing like the hash bench test) futex_wait
operations on each address and observes the consumed time. In case of a
collision the time jumps massively due to the atomic_inc/dec and the spinlock
operation on the hash bucket and the resulting cache line bouncing.

Once it detected a collision it stays there for 60 seconds and then switches
to attached mode. This immediately brings back the performance on the involved
scanners. A collision with two threads and therefor two futexes on different
nodes results in a performance degradation of 30 and more percent in this
test. On real-time enabled systems even a one time collision can have fatal
consequences due to possibly unbound priority inversions.

An interesting observation is that the time to find a collision varies from a
few seconds (20 addresses scanned) to a couple of hours. That variation is due
to the different mm_struct addresses which are mixed into the global hash and
different malloc/mmap addresses which are handed out on each run.

A few smallish issues:

- we must limit the number of allocated 'global state' objects. That must
be a per process limit, but I have no idea where to put it best and how
we let the sysadmin tune it.

- this needs massive review and exposure to fuzzers as I don't want to end
up with another exploit disaster in that code.

[ tglx: Wrote changelog ]

Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/futex.h | 12 -
include/linux/sched.h | 2
include/uapi/linux/futex.h | 6
kernel/fork.c | 7
kernel/futex.c | 444 ++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 458 insertions(+), 13 deletions(-)

--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -47,6 +47,8 @@ union futex_key {
unsigned long word;
void *ptr;
int offset;
+ unsigned int slot;
+ bool attached;
} both;
};

@@ -55,17 +57,15 @@ union futex_key {
#ifdef CONFIG_FUTEX
extern void exit_robust_list(struct task_struct *curr);
extern void exit_pi_state_list(struct task_struct *curr);
+extern void exit_futex_task_cache(struct task_struct *curr);
#ifdef CONFIG_HAVE_FUTEX_CMPXCHG
#define futex_cmpxchg_enabled 1
#else
extern int futex_cmpxchg_enabled;
#endif
#else
-static inline void exit_robust_list(struct task_struct *curr)
-{
-}
-static inline void exit_pi_state_list(struct task_struct *curr)
-{
-}
+static inline void exit_robust_list(struct task_struct *curr) { }
+static inline void exit_pi_state_list(struct task_struct *curr) { }
+static inline void exit_futex_task_cache(struct task_struct *curr) { }
#endif
#endif
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -128,6 +128,7 @@ struct sched_attr {
};

struct futex_pi_state;
+struct futex_cache;
struct robust_list_head;
struct bio_list;
struct fs_struct;
@@ -1702,6 +1703,7 @@ struct task_struct {
#endif
struct list_head pi_state_list;
struct futex_pi_state *pi_state_cache;
+ struct futex_cache *futex_cache;
#endif
#ifdef CONFIG_PERF_EVENTS
struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,10 +20,14 @@
#define FUTEX_WAKE_BITSET 10
#define FUTEX_WAIT_REQUEUE_PI 11
#define FUTEX_CMP_REQUEUE_PI 12
+#define FUTEX_ATTACH 13
+#define FUTEX_DETACH 14

#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
-#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME)
+#define FUTEX_ATTACHED 512
+#define FUTEX_CMD_MASK ~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME | \
+ FUTEX_ATTACHED)

#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -880,6 +880,12 @@ void mm_release(struct task_struct *tsk,
#endif
if (unlikely(!list_empty(&tsk->pi_state_list)))
exit_pi_state_list(tsk);
+
+ /* Must be last as the other cleanups might deref it */
+ if (unlikely(tsk->futex_cache)) {
+ exit_futex_task_cache(tsk);
+ tsk->futex_cache = NULL;
+ }
#endif

uprobe_free_utask(tsk);
@@ -1489,6 +1495,7 @@ static struct task_struct *copy_process(
#endif
INIT_LIST_HEAD(&p->pi_state_list);
p->pi_state_cache = NULL;
+ p->futex_cache = NULL;
#endif
/*
* sigaltstack should be cleared when sharing the same VM
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -23,6 +23,9 @@
* Copyright (C) IBM Corporation, 2009
* Thanks to Thomas Gleixner for conceptual design and careful reviews.
*
+ * Attached futex support by Sebastian Siewior and Thomas Gleixner
+ * Copyright (C) Linutronix GmbH, 2016
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -182,6 +185,7 @@ int __read_mostly futex_cmpxchg_enabled;
#define FLAGS_SHARED 0x01
#define FLAGS_CLOCKRT 0x02
#define FLAGS_HAS_TIMEOUT 0x04
+#define FLAGS_ATTACHED 0x08

/*
* Priority Inheritance state:
@@ -255,6 +259,30 @@ struct futex_hash_bucket {
struct plist_head chain;
} ____cacheline_aligned_in_smp;

+struct futex_state {
+ struct futex_hash_bucket hb;
+ struct futex_q q;
+ struct futex_hash_bucket *global_hb;
+ int refcount;
+};
+
+/* Task cache sizes must be power of 2! */
+#define TASK_CACHE_BASE_SIZE 16
+/* FIXME: Make this a tunable */
+#define TASK_CACHE_MAX_SIZE 4096
+
+struct futex_cache_slot {
+ u32 __user *uaddr;
+ struct futex_state *fs;
+};
+
+struct futex_cache {
+ unsigned int cache_size;
+ unsigned int hash_prime;
+ unsigned long cache_map[BITS_TO_LONGS(TASK_CACHE_MAX_SIZE)];
+ struct futex_cache_slot slots[];
+};
+
/*
* The base of the bucket array and its size are always used together
* (after initialization only in hash_futex()), so ensure that they
@@ -403,13 +431,13 @@ static inline void hb_remove_q(struct fu
}

/**
- * hash_futex - Return the hash bucket in the global hash
+ * hash_global_futex - Return the hash bucket in the global hash
* @key: Pointer to the futex key for which the hash is calculated
*
* We hash on the keys returned from get_futex_key (see below) and return the
* corresponding hash bucket in the global hash.
*/
-static struct futex_hash_bucket *hash_futex(union futex_key *key)
+static struct futex_hash_bucket *hash_global_futex(union futex_key *key)
{
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
@@ -417,6 +445,38 @@ static struct futex_hash_bucket *hash_fu
return &futex_queues[hash & (futex_hashsize - 1)];
}

+/**
+ * hash_local_futex - Return the hash bucket in the task local cache
+ * @uaddr: The user space address of the futex
+ * @prime: The prime number for the modulo operation
+ *
+ * That's a primitive hash function, but it turned out to be the most
+ * efficient one for the task local cache as we don't have anything to
+ * mix in like we have for the global hash.
+ */
+static inline unsigned int
+hash_local_futex(void __user *uaddr, unsigned int prime)
+{
+ return ((unsigned long) uaddr) % prime;
+}
+
+/**
+ * hash_futex - Get the hash bucket for a futex
+ *
+ * Returns either the local or the global hash bucket which fits the key.
+ *
+ * In case of an attached futex, we already verified that the cache and the
+ * slot exists, so we can unconditionally dereference it.
+ */
+static struct futex_hash_bucket *hash_futex(union futex_key *key)
+{
+ struct futex_cache *tc = current->futex_cache;
+
+ if (!key->both.attached)
+ return hash_global_futex(key);
+
+ return &tc->slots[key->both.slot].fs->hb;
+}

/**
* match_futex - Check whether to futex keys are equal
@@ -430,22 +490,107 @@ static inline int match_futex(union fute
return (key1 && key2
&& key1->both.word == key2->both.word
&& key1->both.ptr == key2->both.ptr
- && key1->both.offset == key2->both.offset);
+ && key1->both.offset == key2->both.offset
+ && key1->both.attached == key2->both.attached);
+}
+
+/**
+ * futex_detach_task - Detach task from global state
+ * @slot: Slot number in the task local cache
+ *
+ * If the global state refcount drops to zero, the global state is destroyed.
+ */
+static void futex_detach_task(int slot)
+{
+ struct futex_cache *tc = current->futex_cache;
+ struct futex_state *fs = tc->slots[slot].fs;
+ struct futex_hash_bucket *hb = fs->global_hb;
+ struct futex_q *q = &fs->q;
+
+ /* Remove it from the task local cache */
+ __clear_bit(slot, tc->cache_map);
+ tc->slots[slot].uaddr = NULL;
+ tc->slots[slot].fs = NULL;
+
+ /*
+ * Lock the global hash bucket. Decrement global state refcount. If 0
+ * remove it from the global hash and free it.
+ */
+ spin_lock(&hb->lock);
+ if (--fs->refcount == 0)
+ hb_remove_q(q, hb);
+ else
+ fs = NULL;
+ spin_unlock(&hb->lock);
+ kfree(fs);
+}
+
+/**
+ * futex_attach_task - Attach current to a global state
+ * @fs: Pointer to global state
+ * @uaddr: User space address of the futex
+ * @slot: Hash slot to reference @fs in current
+ *
+ * Take a refcount on the global state and store the pointer to it in the
+ * given @slot of the current tasks futex cache along with @uaddr. Mark the
+ * slot as occupied.
+ *
+ * Must be called with fs->global_hb->lock held
+ */
+static void
+futex_attach_task(struct futex_state *fs, u32 __user *uaddr, int slot)
+{
+ struct futex_cache *tc = current->futex_cache;
+
+ fs->refcount++;
+ tc->slots[slot].fs = fs;
+ tc->slots[slot].uaddr = uaddr;
+ __set_bit(slot, tc->cache_map);
+}
+
+/**
+ * futex_queue_state - Queue a futex state object in the global hash
+ * @fs: Pointer to the futex state object
+ * @hb: Pointer to the hash bucket
+ *
+ * Must be called with hb->lock held
+ */
+static void
+futex_queue_state(struct futex_state *fs, struct futex_hash_bucket *hb)
+{
+ int prio = NICE_TO_PRIO(MIN_NICE);
+
+ fs->global_hb = hb;
+ fs->q.lock_ptr = &hb->lock;
+ hb_insert_q(&fs->q, hb, prio);
}

/**
* futex_key_init - Initialize a futex key
* @key: Pointer to the key to initialize
* @uaddr: User space address of the futex
- * @flags: Flags to check for futex mode. Not yet used
+ * @flags: Flags to check for attached mode
*
- * Returns: @uaddr
+ * Returns:
+ * @uaddr or NULL, if no match in the task local cache for attached mode
*/
static u32 __user *futex_key_init(union futex_key *key, u32 __user *uaddr,
unsigned int flags)
{
+ struct futex_cache *tc = current->futex_cache;
+ int slot;
+
*key = FUTEX_KEY_INIT;
- return uaddr;
+ if (!(flags & FLAGS_ATTACHED))
+ return uaddr;
+
+ if (!tc)
+ return NULL;
+
+ slot = hash_local_futex(uaddr, tc->hash_prime);
+ key->both.attached = true;
+ key->both.slot = slot;
+ return tc->slots[slot].uaddr == uaddr ? uaddr : NULL;
}

/*
@@ -3216,6 +3361,286 @@ void exit_robust_list(struct task_struct
curr, pip);
}

+/**
+ * exit_futex_task_cache -- Cleanup the task cache
+ *
+ * Called when the task exits.
+ */
+void exit_futex_task_cache(struct task_struct *tsk)
+{
+ struct futex_cache *tc = tsk->futex_cache;
+ unsigned long *map = tc->cache_map;
+ unsigned int slot, size = tc->cache_size;
+
+ slot = find_first_bit(map, size);
+ for (; slot < size; slot = find_next_bit(map, size, slot + 1))
+ futex_detach_task(slot);
+ kfree(tc);
+}
+
+static unsigned int hash_prime(unsigned int size)
+{
+ switch(size) {
+ case 16:
+ default: return 13;
+ case 32: return 31;
+ case 64: return 61;
+ case 128: return 127;
+ case 256: return 251;
+ case 512: return 509;
+ case 1024: return 1021;
+ case 2048: return 2039;
+ case 4096: return 4093;
+ }
+}
+
+static struct futex_cache *futex_alloc_cache(int cache_size)
+{
+ struct futex_cache *tc;
+ size_t size;
+
+ /* Allocate a new task cache */
+ size = sizeof(*tc) + cache_size * sizeof(struct futex_cache_slot);
+ tc = kzalloc_node(size, GFP_KERNEL, numa_node_id());
+ if (tc) {
+ tc->hash_prime = hash_prime(cache_size);
+ tc->cache_size = cache_size;
+ }
+ return tc;
+}
+
+static int
+futex_rehash_task_cache(struct futex_cache *tc, struct futex_cache *tcnew)
+{
+ unsigned long *newmap = tcnew->cache_map;
+ unsigned int prime = tcnew->hash_prime;
+ unsigned long *map = tc->cache_map;
+ unsigned int size = tc->cache_size;
+ unsigned int slot, newslot;
+
+ slot = find_first_bit(map, size);
+ for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
+ newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
+ /*
+ * Paranoia. Rehashing to a larger cache should not result in
+ * collisions which did not exist in the small one.
+ */
+ if (__test_and_set_bit(newslot, newmap))
+ return -ENOSPC;
+ /* Copy uaddr and futex state pointer */
+ tcnew->slots[newslot] = tc->slots[slot];
+ }
+ return 0;
+}
+
+/**
+ * futex_get_task_cache_slot - Get a slot in the tasks local cache
+ *
+ * If the cache is not yet available it's allocated. If the existing cache is
+ * too small the cache is extended.
+ *
+ * Returns a valid slot or an error code
+ */
+static int futex_get_task_cache_slot(u32 __user *uaddr)
+{
+ struct futex_cache *tcnew, *tc = current->futex_cache;
+ unsigned int cache_size;
+ int slot;
+
+ /* First caller allocates the initial cache */
+ if (!tc) {
+ tc = futex_alloc_cache(TASK_CACHE_BASE_SIZE);
+ if (!tc)
+ return -ENOMEM;
+ current->futex_cache = tc;
+ slot = hash_local_futex(uaddr, tc->hash_prime);
+ return slot;
+ }
+
+ slot = hash_local_futex(uaddr, tc->hash_prime);
+
+ /* Check whether the slot is populated already */
+ if (!test_bit(slot, tc->cache_map))
+ return slot;
+
+ /* Was this futex attached already ? */
+ if (tc->slots[slot].uaddr == uaddr)
+ return -EEXIST;
+
+ cache_size = tc->cache_size;
+retry:
+ /* Task has reached max cache size? */
+ if (cache_size >= TASK_CACHE_MAX_SIZE)
+ return -ENOSPC;
+
+ cache_size *= 2;
+ tcnew = futex_alloc_cache(cache_size);
+ if (!tcnew)
+ return -ENOMEM;
+
+ /*
+ * If the rehashing fails or the slot for uaddr is busy after
+ * rehashing, try with a larger cache.
+ */
+ slot = hash_local_futex(uaddr, tcnew->hash_prime);
+
+ if (futex_rehash_task_cache(tc, tcnew) ||
+ test_bit(slot, tcnew->cache_map)) {
+ kfree(tcnew);
+ goto retry;
+ }
+
+ /* Populate the new cache and return the slot number */
+ current->futex_cache = tcnew;
+ return slot;
+}
+
+/**
+ * futex_create - Create an attached futex object and attach to it
+ * @uaddr: The user space address of the futex
+ * @key: Pointer to a initialized futex key object
+ * @hb: Pointer to the hash bucket in the global hash corresponding
+ * to @key
+ * @slot: Free task cache slot number
+ *
+ * Returns:
+ * Success: 0
+ * Failure: Proper error code
+ * ENOMEM: Out of memory
+ * EEXIST: Global state exists already
+ */
+static int futex_create(u32 __user *uaddr, union futex_key *key,
+ struct futex_hash_bucket *hb, int slot)
+{
+ struct futex_state *fs;
+ struct futex_q *match;
+
+ fs = kzalloc_node(sizeof(*fs), GFP_KERNEL, numa_node_id());
+ if (!fs)
+ return -ENOMEM;
+
+ atomic_set(&fs->hb.waiters, 0);
+ plist_head_init(&fs->hb.chain);
+ spin_lock_init(&fs->hb.lock);
+
+ fs->q = futex_q_init;
+ /* This is the global state object. Set an invalid slot */
+ fs->q.key = *key;
+ fs->q.key.both.slot = ~0U;
+
+ /* Verify again whether global state for this futex exists already */
+ spin_lock(&hb->lock);
+ match = futex_top_waiter(hb, &fs->q.key);
+ if (match) {
+ spin_unlock(&hb->lock);
+ kfree(fs);
+ return -EEXIST;
+ }
+ /*
+ * Queue the new global state in the global hash and attach the task
+ * to it.
+ */
+ futex_queue_state(fs, hb);
+ futex_attach_task(fs, uaddr, slot);
+ spin_unlock(&hb->lock);
+ return slot;
+}
+
+/**
+ * futex_attach - Attach a task to a registered futex
+ * @uaddr: The user space address of the futex
+ * @flags: User supplied flags
+ *
+ * Returns:
+ * Success: 0
+ * Failure: Proper error code
+ * ENOMEM: Out of memory
+ * EINVAL: Invalid @flags or invalid @uaddr
+ * EFAULT: @uaddr access would fault
+ * EEXIST: @uaddr is already attached
+ * ENOSPC: TASK_CACHE_MAX_SIZE reached, no free slots
+ *
+ */
+static int futex_attach(u32 __user *uaddr, unsigned int flags)
+{
+ struct futex_hash_bucket *hb;
+ struct futex_state *fs;
+ struct futex_q *match;
+ union futex_key key;
+ int ret, slot;
+
+ if (!(flags & FLAGS_ATTACHED))
+ return -EINVAL;
+
+ if (futex_key_init(&key, uaddr, flags) != NULL)
+ return -EEXIST;
+
+ key = FUTEX_KEY_INIT;
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+ if (ret)
+ return ret;
+
+ /* Get a slot in the task local cache */
+ slot = futex_get_task_cache_slot(uaddr);
+ if (slot < 0) {
+ ret = slot;
+ goto out_put_key;
+ }
+
+ /* Find the global state and attach to it */
+ key.both.attached = true;
+ hb = hash_global_futex(&key);
+
+again:
+ spin_lock(&hb->lock);
+ match = futex_top_waiter(hb, &key);
+ if (!match) {
+ spin_unlock(&hb->lock);
+ ret = futex_create(uaddr, &key, hb, slot);
+ /* We raced against another task creating the global state */
+ if (ret == -EEXIST)
+ goto again;
+ goto out_put_key;
+ }
+
+ /* Attach to it */
+ fs = container_of(match, struct futex_state, q);
+ futex_attach_task(fs, uaddr, slot);
+ spin_unlock(&hb->lock);
+ ret = 0;
+
+out_put_key:
+ put_futex_key(&key);
+ return ret;
+
+}
+
+/**
+ * futex_detach - Detach a task from a registered futex
+ * @uaddr: The cookie which was returned from attach
+ * @flags: User supplied flags
+ *
+ * Returns:
+ * Success: 0
+ * Failure: Proper error code
+ * EINVAL: Invalid @flags or invalid @uaddr
+ */
+static int futex_detach(u32 __user *uaddr, unsigned int flags)
+{
+ union futex_key key;
+
+ if (!(flags & FLAGS_ATTACHED))
+ return -EINVAL;
+
+ /* We look up the slot and verify that it is populated */
+ uaddr = futex_key_init(&key, uaddr, flags);
+ if (!uaddr)
+ return -EINVAL;
+
+ futex_detach_task(key.both.slot);
+ return 0;
+}
+
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
@@ -3232,6 +3657,9 @@ long do_futex(u32 __user *uaddr, int op,
return -ENOSYS;
}

+ if (op & FUTEX_ATTACHED)
+ flags |= FLAGS_ATTACHED;
+
switch (cmd) {
case FUTEX_LOCK_PI:
case FUTEX_UNLOCK_PI:
@@ -3269,6 +3697,10 @@ long do_futex(u32 __user *uaddr, int op,
uaddr2);
case FUTEX_CMP_REQUEUE_PI:
return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+ case FUTEX_ATTACH:
+ return futex_attach(uaddr, flags);
+ case FUTEX_DETACH:
+ return futex_detach(uaddr, flags);
}
return -ENOSYS;
}



2016-04-02 16:26:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sat, Apr 02, 2016 at 11:09:18AM -0000, Thomas Gleixner wrote:
> +/**
> + * hash_local_futex - Return the hash bucket in the task local cache
> + * @uaddr: The user space address of the futex
> + * @prime: The prime number for the modulo operation
> + *
> + * That's a primitive hash function, but it turned out to be the most
> + * efficient one for the task local cache as we don't have anything to
> + * mix in like we have for the global hash.
> + */
> +static inline unsigned int
> +hash_local_futex(void __user *uaddr, unsigned int prime)
> +{
> + return ((unsigned long) uaddr) % prime;
> +}

> +static unsigned int hash_prime(unsigned int size)
> +{
> + switch(size) {
> + case 16:
> + default: return 13;
> + case 32: return 31;
> + case 64: return 61;
> + case 128: return 127;
> + case 256: return 251;
> + case 512: return 509;
> + case 1024: return 1021;
> + case 2048: return 2039;
> + case 4096: return 4093;
> + }
> +}

One wonders what's wrong with include/linux/hash.h ? It appears to me
hash_ptr() is pretty much what you want, no?

2016-04-02 16:29:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sat, Apr 02, 2016 at 11:09:18AM -0000, Thomas Gleixner wrote:
> +/**
> + * futex_detach_task - Detach task from global state
> + * @slot: Slot number in the task local cache
> + *
> + * If the global state refcount drops to zero, the global state is destroyed.
> + */
> +static void futex_detach_task(int slot)
> +{
> + struct futex_cache *tc = current->futex_cache;
> + struct futex_state *fs = tc->slots[slot].fs;
> + struct futex_hash_bucket *hb = fs->global_hb;
> + struct futex_q *q = &fs->q;
> +
> + /* Remove it from the task local cache */
> + __clear_bit(slot, tc->cache_map);
> + tc->slots[slot].uaddr = NULL;
> + tc->slots[slot].fs = NULL;
> +
> + /*
> + * Lock the global hash bucket. Decrement global state refcount. If 0
> + * remove it from the global hash and free it.
> + */
> + spin_lock(&hb->lock);
> + if (--fs->refcount == 0)
> + hb_remove_q(q, hb);
> + else
> + fs = NULL;
> + spin_unlock(&hb->lock);

So you could play funny games like:

if (atomic_add_unless(&fs->recount, -1, 1))
return;

spin_lock(&hb->lock);
if (atomic_dec_return(&fs->refcount) == 0)
hb_remove_q(q, hb);
else
fs = NULL;
spin_unlock(&hb->lock);

To avoid taking that lock entirely in the 'fast' path, but I'm not sure
how performance critical this path is.

> + kfree(fs);
> +}

2016-04-02 18:03:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sat, 2 Apr 2016, Peter Zijlstra wrote:
> On Sat, Apr 02, 2016 at 11:09:18AM -0000, Thomas Gleixner wrote:
> > +/**
> > + * hash_local_futex - Return the hash bucket in the task local cache
> > + * @uaddr: The user space address of the futex
> > + * @prime: The prime number for the modulo operation
> > + *
> > + * That's a primitive hash function, but it turned out to be the most
> > + * efficient one for the task local cache as we don't have anything to
> > + * mix in like we have for the global hash.
> > + */
> > +static inline unsigned int
> > +hash_local_futex(void __user *uaddr, unsigned int prime)
> > +{
> > + return ((unsigned long) uaddr) % prime;
> > +}
>
> > +static unsigned int hash_prime(unsigned int size)
> > +{
> > + switch(size) {
> > + case 16:
> > + default: return 13;
> > + case 32: return 31;
> > + case 64: return 61;
> > + case 128: return 127;
> > + case 256: return 251;
> > + case 512: return 509;
> > + case 1024: return 1021;
> > + case 2048: return 2039;
> > + case 4096: return 4093;
> > + }
> > +}
>
> One wonders what's wrong with include/linux/hash.h ? It appears to me
> hash_ptr() is pretty much what you want, no?

Nothing is wrong with that. Just my futex fried brain not being able to see
the obvious:)

Thanks,

tglx

2016-04-02 18:20:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
[omitted due to some Thunderbird bug, sigh]

What happens if you mix attached an non-attached ops on the same futex?

--Andy

2016-04-02 23:48:13

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sat, Apr 02 2016, Thomas Gleixner <[email protected]> wrote:

> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> and on real-time enabled kernels even to priority inversions.
>
> To guarantee futexes without collisions on the global kernel hash, we provide
> a mechanism to attach to a futex. This creates futex private state which
> avoids hash collisions and on NUMA systems also cross node memory access.

Hi,

A few minor comments inline below, and a question about the design:

How is an application supposed to handle it when the kernel fails to
achieve the no collision-goal? With any reasonable upper bound on the
size of the local hash table (which of course has to be there, whether
sysctl'ed or not), and regardless of the hashing scheme used, it
seems inevitable that someone is going to get -ENOSPC when trying to
attach. Moreover, since different threads can attach to different sets
of futexes, one thread may succesfully attach to a futex, while another
fails - the second thread is then permanently prevented from operating
on that futex (?).

Why not use some sort of tree instead? Or fall back to a traditional
chained hash table once we're no longer allowed to increase the table
size? Of course these have worse lookup performance, and maybe failing
the attach in the rare case is better than penalizing the common case,
but it would be nice to have some mention of this in the change log.

Alternatively [this is not really thought through], maybe one could move
the decision and the complexity to userspace: On succesful FUTEX_ATTACH,
return an index into a small per-task array of struct futex_state*. On
subsequent FUTEX_ATTACHED operations on that futex, userspace passes in
this index somehow (either instead of uaddr, in which case the kernel
array would have to include this in addition to the futex_state pointer,
or by making uaddr actually point to a struct { int *futex_addr; int
attach_idx; }, or...) Then each thread would have to maintain a (futex
address => index) mapping, but that's more or less what the kernel
otherwise has to do.

> +
> +static unsigned int hash_prime(unsigned int size)
> +{
> + switch(size) {
> + case 16:
> + default: return 13;
> + case 32: return 31;
> + case 64: return 61;
> + case 128: return 127;
> + case 256: return 251;
> + case 512: return 509;
> + case 1024: return 1021;
> + case 2048: return 2039;
> + case 4096: return 4093;
> + }
> +}
> +

There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here
so that anyone updating those would also look here, so we don't end up
using 13 out of 8192 slots... BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE
!= {16,4096}) should do it.

> +static struct futex_cache *futex_alloc_cache(int cache_size)
> +{
> + struct futex_cache *tc;
> + size_t size;
> +
> + /* Allocate a new task cache */
> + size = sizeof(*tc) + cache_size * sizeof(struct futex_cache_slot);
> + tc = kzalloc_node(size, GFP_KERNEL, numa_node_id());
> + if (tc) {
> + tc->hash_prime = hash_prime(cache_size);
> + tc->cache_size = cache_size;
> + }
> + return tc;
> +}
> +
> +static int
> +futex_rehash_task_cache(struct futex_cache *tc, struct futex_cache *tcnew)
> +{
> + unsigned long *newmap = tcnew->cache_map;
> + unsigned int prime = tcnew->hash_prime;
> + unsigned long *map = tc->cache_map;
> + unsigned int size = tc->cache_size;
> + unsigned int slot, newslot;
> +
> + slot = find_first_bit(map, size);
> + for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
> + newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
> + /*
> + * Paranoia. Rehashing to a larger cache should not result in
> + * collisions which did not exist in the small one.
> + */

This doesn't sound right when you're doing mod prime hashing; two
numbers can easily have the same remainder mod 31 while having distinct
remainders mod 13.


> + if (__test_and_set_bit(newslot, newmap))
> + return -ENOSPC;
> + /* Copy uaddr and futex state pointer */
> + tcnew->slots[newslot] = tc->slots[slot];
> + }
> + return 0;
> +}
> +
> +/**
> + * futex_get_task_cache_slot - Get a slot in the tasks local cache
> + *
> + * If the cache is not yet available it's allocated. If the existing cache is
> + * too small the cache is extended.
> + *
> + * Returns a valid slot or an error code
> + */
> +static int futex_get_task_cache_slot(u32 __user *uaddr)
> +{
> + struct futex_cache *tcnew, *tc = current->futex_cache;
> + unsigned int cache_size;
> + int slot;
> +
> + /* First caller allocates the initial cache */
> + if (!tc) {
> + tc = futex_alloc_cache(TASK_CACHE_BASE_SIZE);
> + if (!tc)
> + return -ENOMEM;
> + current->futex_cache = tc;
> + slot = hash_local_futex(uaddr, tc->hash_prime);
> + return slot;
> + }
> +
> + slot = hash_local_futex(uaddr, tc->hash_prime);
> +
> + /* Check whether the slot is populated already */
> + if (!test_bit(slot, tc->cache_map))
> + return slot;
> +
> + /* Was this futex attached already ? */
> + if (tc->slots[slot].uaddr == uaddr)
> + return -EEXIST;
> +
> + cache_size = tc->cache_size;
> +retry:
> + /* Task has reached max cache size? */
> + if (cache_size >= TASK_CACHE_MAX_SIZE)
> + return -ENOSPC;
> +
> + cache_size *= 2;
> + tcnew = futex_alloc_cache(cache_size);
> + if (!tcnew)
> + return -ENOMEM;
> +
> + /*
> + * If the rehashing fails or the slot for uaddr is busy after
> + * rehashing, try with a larger cache.
> + */
> + slot = hash_local_futex(uaddr, tcnew->hash_prime);
> +
> + if (futex_rehash_task_cache(tc, tcnew) ||
> + test_bit(slot, tcnew->cache_map)) {
> + kfree(tcnew);
> + goto retry;
> + }
> +
> + /* Populate the new cache and return the slot number */
> + current->futex_cache = tcnew;
> + return slot;
> +}

I may be misreading it, but this seems to leak the old ->futex_cache?


Rasmus

2016-04-03 09:59:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sat, 2 Apr 2016, Andy Lutomirski wrote:

> On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
> [omitted due to some Thunderbird bug, sigh]
>
> What happens if you mix attached an non-attached ops on the same futex?

Not much. You might get an error code, sleep forever or the call will just
result in a NOP wasting cpu cycles. That's the same when you mix
shared/private operations on the same futex.

Thanks,

tglx

2016-04-03 10:00:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sat, 2 Apr 2016, Peter Zijlstra wrote:
> On Sat, Apr 02, 2016 at 11:09:18AM -0000, Thomas Gleixner wrote:
> > + /*
> > + * Lock the global hash bucket. Decrement global state refcount. If 0
> > + * remove it from the global hash and free it.
> > + */
> > + spin_lock(&hb->lock);
> > + if (--fs->refcount == 0)
> > + hb_remove_q(q, hb);
> > + else
> > + fs = NULL;
> > + spin_unlock(&hb->lock);
>
> So you could play funny games like:
>
> if (atomic_add_unless(&fs->recount, -1, 1))
> return;
>
> spin_lock(&hb->lock);
> if (atomic_dec_return(&fs->refcount) == 0)
> hb_remove_q(q, hb);
> else
> fs = NULL;
> spin_unlock(&hb->lock);
>
> To avoid taking that lock entirely in the 'fast' path, but I'm not sure
> how performance critical this path is.

Attach/detach is not really critical. The futex ops are critical and they do
not touch the global hash bucket lock at all.

Thanks,

tglx

2016-04-03 10:06:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sun, 3 Apr 2016, Rasmus Villemoes wrote:
> On Sat, Apr 02 2016, Thomas Gleixner <[email protected]> wrote:
>
> > The standard futex mechanism in the Linux kernel uses a global hash to store
> > transient state. Collisions on that hash can lead to performance degradation
> > and on real-time enabled kernels even to priority inversions.
> >
> > To guarantee futexes without collisions on the global kernel hash, we provide
> > a mechanism to attach to a futex. This creates futex private state which
> > avoids hash collisions and on NUMA systems also cross node memory access.
>
> Hi,
>
> A few minor comments inline below, and a question about the design:
>
> How is an application supposed to handle it when the kernel fails to
> achieve the no collision-goal? With any reasonable upper bound on the
> size of the local hash table (which of course has to be there, whether
> sysctl'ed or not), and regardless of the hashing scheme used, it
> seems inevitable that someone is going to get -ENOSPC when trying to
> attach. Moreover, since different threads can attach to different sets
> of futexes, one thread may succesfully attach to a futex, while another
> fails - the second thread is then permanently prevented from operating
> on that futex (?).

Yes. There is not much we can do about that except adding it to the
documentation.

> Why not use some sort of tree instead? Or fall back to a traditional
> chained hash table once we're no longer allowed to increase the table
> size? Of course these have worse lookup performance, and maybe failing
> the attach in the rare case is better than penalizing the common case,
> but it would be nice to have some mention of this in the change log.

The lookup performance is critical for the futex ops (wait/wake ....)

> Alternatively [this is not really thought through], maybe one could move
> the decision and the complexity to userspace: On succesful FUTEX_ATTACH,
> return an index into a small per-task array of struct futex_state*. On
> subsequent FUTEX_ATTACHED operations on that futex, userspace passes in
> this index somehow (either instead of uaddr, in which case the kernel
> array would have to include this in addition to the futex_state pointer,
> or by making uaddr actually point to a struct { int *futex_addr; int
> attach_idx; }, or...) Then each thread would have to maintain a (futex
> address => index) mapping, but that's more or less what the kernel
> otherwise has to do.

Right. We tried this as a first attempt. That moves the complexity of hashing
futex to index per thread to user space. It can be done simple enough, though
the resulting

> > + case 4096: return 4093;
> > + }
> > +}
> > +
>
> There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here
> so that anyone updating those would also look here, so we don't end up
> using 13 out of 8192 slots... BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE
> != {16,4096}) should do it.

As Peter pointed out there is hash_ptr() already so this will go away.

> > + slot = find_first_bit(map, size);
> > + for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
> > + newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
> > + /*
> > + * Paranoia. Rehashing to a larger cache should not result in
> > + * collisions which did not exist in the small one.
> > + */
>
> This doesn't sound right when you're doing mod prime hashing; two
> numbers can easily have the same remainder mod 31 while having distinct
> remainders mod 13.

I know, but as far as my extensive testing went I never hit that case.

> > +
> > + /* Populate the new cache and return the slot number */
> > + current->futex_cache = tcnew;
> > + return slot;
> > +}
>
> I may be misreading it, but this seems to leak the old ->futex_cache?

Duh, you are right.

Thanks,

tglx

2016-04-03 11:16:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes


* Thomas Gleixner <[email protected]> wrote:

> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> and on real-time enabled kernels even to priority inversions.
>
> To guarantee futexes without collisions on the global kernel hash, we provide
> a mechanism to attach to a futex. This creates futex private state which
> avoids hash collisions and on NUMA systems also cross node memory access.
>
> To utilize this mechanism each thread has to attach to the futex before any
> other operations on that futex.
>
> The inner workings are as follows:
>
> Attach:
>
> sys_futex(FUTEX_ATTACH | FUTEX_ATTACHED, uaddr, ....);
>
> If this is the first attach to uaddr then a 'global state' object is
> created. This global state contains a futex hash bucket and a futex_q
> object which is enqueued into the global hash for reference so subsequent
> attachers can find it. Each attacher takes a reference count on the
> 'global state' object and hashes 'uaddr' into a thread local hash. This
> thread local hash is lock free and dynamically expanded to avoid
> collisions. Each populated entry in the thread local hash stores 'uaddr'
> and a pointer to the 'global state' object.
>
> Futex ops:
>
> sys_futex(FUTEX_XXX | FUTEX_ATTACHED, uaddr, ....);
>
> If the attached flag is set, then 'uaddr' is hashed and the thread local
> hash is checked whether the hash entry contains 'uaddr'. If no, an error
> code is returned. If yes, the hash slot number is stored in the futex key
> which is used for further operations on the futex. When the hash bucket is
> looked up then attached futexes will use the slot number to retrieve the
> pointer to the 'global state' object and use the embedded hash bucket for
> the operation. Non-attached futexes just use the global hash as before.
>
> Detach:
>
> sys_futex(FUTEX_DETACH | FUTEX_ATTACHED, uaddr, ....);
>
> Detach removes the entry in the thread local hash and decrements the
> refcount on the 'global state' object. Once the refcount drops to zero the
> 'global state' object is removed from the global hash and destroyed.
>
> Thread exit cleans up the thread local hash and the 'global state' objects
> as we do for other futex related storage already.
>
> The thread local hash and the 'global state' object are allocated on the node
> on which the attaching thread runs.
>
> Attached mode works with all futex operations and with both private and shared
> futexes. For operations which involve two futexes, i.e. FUTEX_REQUEUE_* both
> futexes have to be either attached or detached (like FUTEX_PRIVATE).
>
> Why not auto attaching?
>
> Auto attaching has the following problems:
>
> - Memory consumption
> - Life time issues
> - Performance issues due to the necessary allocations

But those are mostly setup only costs, right?

So I don't think this conclusion is necessarily true, even on smaller systems:

> So, no. It must be opt-in and reserved for explicit isolation purposes.
>
> A modified version of 'perf bench futex hash' shows the following results:

and look at the very measurable performance advantages on a small NUMA system:

Before:

> Averaged 1451441 operations/sec (+- 3.65%), total secs = 60

After:

> Averaged 1709712 operations/sec (+- 4.67%), total secs = 60

> That's a performance increase of 18%.

... and I suspect that on a larger NUMA system the speedup is probably a lot more
pronounced.

Also, the thing is, allocation/deallocation costs are a second order concern IMHO,
because most of the futex's usage is the lock/unlock operations.

So my prediction: in real life large systems will want to have collision-free
futexes most of the time, and they don't want to modify every futex using
application or library. So this is a mostly kernel side system sizing
question/decision, not really a user-side system purpose policy question.

So an ABI distinction and offloading the decision to every single application that
wants to use it and hardcode it into actual application source code via an ABI is
pretty much the _WORST_ way to go about it IMHO...

So how about this: don't add any ABI details, but make futexes auto-attached on
NUMA systems (and obviously PREEMPT_RT systems)?

I.e. make it a build time or boot time decision at most, don't start a messy
'should we used attached futexes or not' decisions on the ABI side, which we know
from Linux ABI history won't be answered and utilized very well by applications!

Thanks,

Ingo

2016-04-03 11:30:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar <[email protected]> wrote:
>
> So an ABI distinction and offloading the decision to every single application that
> wants to use it and hardcode it into actual application source code via an ABI is
> pretty much the _WORST_ way to go about it IMHO...
>
> So how about this: don't add any ABI details, but make futexes auto-attached on
> NUMA systems (and obviously PREEMPT_RT systems)?

I agree.

Do *not* make this a visible new ABI.

You will find that people will make exactly the wrong choices - either
not using it (because the futex is deep in a standard library!) when
they want to, or using it when they shouldn't (because the futex is
deep in a standard library, and the library writer knows *his* code is
so important that it should get a special faster futex).

So I absolutely detest this approach. It's the wrong way to go about
things. User space does *not* know whether they want to use this or
not, and they *will* be wrong.

So automatically using a local hashtable (for private mutexes - I
think people need to just accept that a shared mutex is more costly)
according to some heuristic is definitely the way to go. And yes, the
heuristic may be well be - at least to start - "this is a preempt-RT
system" (for people who clearly care about having predictable
latencies) or "this is actually a multi-node NUMA system, and I have
heaps of memory".

Then, add a tunable (for root, not per-futex) to allow people to tweak it.

Because the *last* thing you want is programmerrs saying "I'm so
important that I want the special futex". Because every single
programmer thinks they are special and that _their_ code is special. I
know - because I'm special.

Linus

2016-04-03 13:18:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sun, Apr 3, 2016 at 2:57 AM, Thomas Gleixner <[email protected]> wrote:
> On Sat, 2 Apr 2016, Andy Lutomirski wrote:
>
>> On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
>> [omitted due to some Thunderbird bug, sigh]
>>
>> What happens if you mix attached an non-attached ops on the same futex?
>
> Not much. You might get an error code, sleep forever or the call will just
> result in a NOP wasting cpu cycles. That's the same when you mix
> shared/private operations on the same futex.

What's the workflow?

Can the creation of an attached futex fail due to memory allocation
problems or any other reason? If so, how does a library make sure it
falls back to a normal futex safely?

Why can't private futexes be attached by default?

--Andy

2016-04-03 15:58:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sun, 3 Apr 2016, Andy Lutomirski wrote:

> On Sun, Apr 3, 2016 at 2:57 AM, Thomas Gleixner <[email protected]> wrote:
> > On Sat, 2 Apr 2016, Andy Lutomirski wrote:
> >
> >> On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
> >> [omitted due to some Thunderbird bug, sigh]
> >>
> >> What happens if you mix attached an non-attached ops on the same futex?
> >
> > Not much. You might get an error code, sleep forever or the call will just
> > result in a NOP wasting cpu cycles. That's the same when you mix
> > shared/private operations on the same futex.
>
> What's the workflow?
>
> Can the creation of an attached futex fail due to memory allocation
> problems or any other reason? If so, how does a library make sure it
> falls back to a normal futex safely?

Well, other operations on futexes can fail as well and the library or the
usage site has to take care of it. It's not that different.

> Why can't private futexes be attached by default?

We _can_ attach any futex - private or not - by default.

Thanks,

tglx

2016-04-03 16:12:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sun, Apr 3, 2016 at 8:56 AM, Thomas Gleixner <[email protected]> wrote:
> On Sun, 3 Apr 2016, Andy Lutomirski wrote:
>
>> On Sun, Apr 3, 2016 at 2:57 AM, Thomas Gleixner <[email protected]> wrote:
>> > On Sat, 2 Apr 2016, Andy Lutomirski wrote:
>> >
>> >> On 04/02/2016 04:09 AM, Thomas Gleixner wrote:
>> >> [omitted due to some Thunderbird bug, sigh]
>> >>
>> >> What happens if you mix attached an non-attached ops on the same futex?
>> >
>> > Not much. You might get an error code, sleep forever or the call will just
>> > result in a NOP wasting cpu cycles. That's the same when you mix
>> > shared/private operations on the same futex.
>>
>> What's the workflow?
>>
>> Can the creation of an attached futex fail due to memory allocation
>> problems or any other reason? If so, how does a library make sure it
>> falls back to a normal futex safely?
>
> Well, other operations on futexes can fail as well and the library or the
> usage site has to take care of it. It's not that different.
>
>> Why can't private futexes be attached by default?
>
> We _can_ attach any futex - private or not - by default.

Then I feel like I'm missing something. Why does this need an API
change? Why can't the kernel just attach all futexes?

The reason I'm asking about failures is this:

int some_futex;

Thread 1:
attach some_futex;
...
wait for some_futex;

Thread 2:
attach some_futex;
...
wake some_futex;

If one but not both of the attach operations fails and the other
doesn't, then either the thread that failed to attach can't operate on
the futex or, if it tries to fall back to a normal futex, then the
wake won't wake the waiter, right? This seems fragile.

--Andy

2016-04-05 07:44:40

by Torvald Riegel

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On Sun, 2016-04-03 at 06:30 -0500, Linus Torvalds wrote:
> On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar <[email protected]> wrote:
> >
> > So an ABI distinction and offloading the decision to every single application that
> > wants to use it and hardcode it into actual application source code via an ABI is
> > pretty much the _WORST_ way to go about it IMHO...
> >
> > So how about this: don't add any ABI details, but make futexes auto-attached on
> > NUMA systems (and obviously PREEMPT_RT systems)?
>
> I agree.
>
> Do *not* make this a visible new ABI.

>From a glibc perspective, I agree that this shouldn't require an
extension of the ABI unless it's really the only possible way to solve
this.

For "special" mutex kinds such as PI mutexes, the change in the
interface might be justifiable -- but for ordinary mutexes, there's no
good place to add the attach/detach calls in each thread: An
implementation of, say, C11 mutexes cannot easily estimate whether it
should use attached futexes, and it would have to track whether a
particular mutex has been attached to by the current thread; this might
just move the overhead of tracking and caching associations from the
kernel to userspace.

> You will find that people will make exactly the wrong choices - either
> not using it (because the futex is deep in a standard library!) when
> they want to, or using it when they shouldn't (because the futex is
> deep in a standard library, and the library writer knows *his* code is
> so important that it should get a special faster futex).

There are cases where userspace might know that it should use a
"special" futex. Consider an MCS-lock implementation in glibc by which
all pthreads, C, and C++ mutexes are backed: the lock nodes that threads
would be spinning on would be per-thread and exist for the whole
lifetime of the thread; attach and detach would be simple to do, and
there would be a limited number of these in the system.
A Java VM's park/unpark implementation might be another candidate.
However, these use cases are pretty specific (eg, there's only a single
waiter), so any kernel support for these might be more special too.

2016-04-05 15:58:17

by Carlos O'Donell

[permalink] [raw]
Subject: Re: [RFC patch 4/7] futex: Add support for attached futexes

On 04/03/2016 07:30 AM, Linus Torvalds wrote:
> On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar <[email protected]> wrote:
>>
>> So an ABI distinction and offloading the decision to every single application that
>> wants to use it and hardcode it into actual application source code via an ABI is
>> pretty much the _WORST_ way to go about it IMHO...
>>
>> So how about this: don't add any ABI details, but make futexes auto-attached on
>> NUMA systems (and obviously PREEMPT_RT systems)?
>
> I agree.
>
> Do *not* make this a visible new ABI.

Agreed.

We had similar requests in glibc to add APIs to tweak the parameters of
the elision for locks backed by hardware transactional memory.

The person submitting the patches always thinks this is a great API
because it allows them to write tests to verify their own work (which
means it still might be useful for internal testing or developing
auto-tuning).

Users have no clue what to do with the API and worse the state space of
the parameters is immense. You can't possibly do any kind of sensible
optimization without knowing a lot about the hardware.

So no public API was ever added in glibc for pthread_mutex_lock elision
parameters. Either the parameters work by default or you have to post
patches to change the auto-tuning used internally in glibc.

> So automatically using a local hashtable (for private mutexes - I
> think people need to just accept that a shared mutex is more costly)
> according to some heuristic is definitely the way to go. And yes, the
> heuristic may be well be - at least to start - "this is a preempt-RT
> system" (for people who clearly care about having predictable
> latencies) or "this is actually a multi-node NUMA system, and I have
> heaps of memory".

Agreed.

> Then, add a tunable (for root, not per-futex) to allow people to tweak it.

Agreed.

--
Cheers,
Carlos.