This patch series introduces two new futex command codes to support
a new optimistic spinning futex for implementing an exclusive lock
primitive that should perform better than the same primitive using
the wait-wake futex in cases where the lock owner is actively working
instead of waiting for I/O completion.
Optimistic spinning means that the waiting tasks spin within the
kernel for the lock owner to release the lock while it is running
instead of going to sleep and to be woken up later on. It is the same
mechanism that improves kernel mutex and rw semaphore performance,
especially on large system with many sockets and CPUs.
This patch series improves futex performance on two different fronts:
1) Reducing the amount of the futex spinlock contention by using 2
different spinlocks instead of just one for the wait-wake futex.
2) Eliminating the context switching overhead and latency due to the
sleeping and the waking of the waiting tasks.
The performance improvement varies depending on, to a certain extent,
the length of the critical section protected by futex.
Testing done on a 4-socket Westmere-EX boxes with 40 cores (HT off)
showed the following performance data (average kops/s) with various
load factor (number of pause instructions) used in the critical
section using an userspace mutex microbenchmark.
Threads Load Waiting Futex Spinning Futex %Change
------- ---- ------------- -------------- -------
256 1 6894 8883 +29%
256 10 3656 4912 +34%
256 50 1332 4358 +227%
256 100 792 2753 +248%
10 1 6382 4838 -24%
10 10 3614 4748 +31%
10 50 1319 3900 +196%
10 100 782 2459 +214%
2 1 7905 7194 -9.0%
2 10 4556 4717 +3.5%
2 50 2191 4167 +90%
2 100 1767 2407 +36%
Patch 1 introduces new futex command codes to provide a
mutex abstraction where the waiting tasks just go to sleep (no
spinning).
Patch 2 adds spinning to the mix.
Patch 3 enables wakened tasks to go back to the spinning state if
the owner is running.
Patch 4 changes sleeping queue from a simple FIFO list to an rbtree
sorted by process priority as well as the sequeunce the tasks enter
the kernel.
Patch 5 adds a new document file to describe the spinning futex.
Waiman Long (5):
futex: add new exclusive lock & unlock command codes
futex: add optimistic spinning to FUTEX_SPIN_LOCK
spinning futex: move a wakened task to spinning
spinning futex: put waiting tasks in a sorted rbtree
futex, doc: add a document on how to use the spinning futexes
Documentation/spinning-futex.txt | 109 +++++++
include/uapi/linux/futex.h | 4 +
kernel/futex.c | 659 ++++++++++++++++++++++++++++++++++++++
3 files changed, 772 insertions(+), 0 deletions(-)
create mode 100644 Documentation/spinning-futex.txt
This patch moves a wakened task back to the optimistic spinning loop
if the owner is running.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/futex.c | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index ddc24d1..cca6457 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3215,8 +3215,8 @@ static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
* loop. Preemption should have been disabled before calling this function.
*
* The number of spinners may temporarily exceed the threshold due to
- * racing (the spin count check and add aren't atomic), but that shouldn't
- * be a big problem.
+ * racing and from waiter joining the OSQ(the spin count check and add
+ * aren't atomic), but that shouldn't be a real problem.
*/
static inline int futex_optspin(struct futex_q_head *qh,
struct futex_q_node *qn,
@@ -3297,7 +3297,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
struct futex_q_node qnode;
union futex_key key;
struct task_struct *owner;
- bool gotlock;
+ bool gotlock, slept;
int ret, cnt;
u32 uval, vpid, old;
@@ -3345,6 +3345,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
owner = ACCESS_ONCE(qh->owner);
if ((FUTEX_SPINCNT(qh) < futex_spincnt_max) &&
(!owner || owner->on_cpu))
+optspin:
if (futex_optspin(qh, &qnode, uaddr, vpid))
goto penable_out;
@@ -3356,7 +3357,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
list_add_tail(&qnode.wnode, &qh->waitq);
__set_current_state(TASK_INTERRUPTIBLE);
spin_unlock(&qh->wlock);
- gotlock = false;
+ slept = gotlock = false;
for (;;) {
ret = get_user(uval, uaddr);
if (ret)
@@ -3393,7 +3394,16 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
break;
}
+ /*
+ * Go back to spinning if the lock owner is running and the
+ * spinner limit hasn't been reached.
+ */
+ if (slept && (!owner || owner->on_cpu) &&
+ (FUTEX_SPINCNT(qh) < futex_spincnt_max))
+ break;
+
schedule_preempt_disabled();
+ slept = true;
/*
* Got a signal? Return EINTR
@@ -3427,6 +3437,8 @@ dequeue:
}
}
spin_unlock(&qh->wlock);
+ if (!ret && !gotlock)
+ goto optspin;
penable_out:
preempt_enable();
--
1.7.1
This patch adds a new document file on how to use the spinning futexes.
Signed-off-by: Waiman Long <[email protected]>
---
Documentation/spinning-futex.txt | 109 ++++++++++++++++++++++++++++++++++++++
1 files changed, 109 insertions(+), 0 deletions(-)
create mode 100644 Documentation/spinning-futex.txt
diff --git a/Documentation/spinning-futex.txt b/Documentation/spinning-futex.txt
new file mode 100644
index 0000000..e3cb5a2
--- /dev/null
+++ b/Documentation/spinning-futex.txt
@@ -0,0 +1,109 @@
+Started by: Waiman Long <[email protected]>
+
+Spinning Futex
+--------------
+
+There are two main problems for a wait-wake futex (FUTEX_WAIT and
+FUTEX_WAKE) when used for creating user-space lock primitives:
+
+ 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
+ in the futex queue to be woken up by the lock owner when it is done
+ with the lock. Waking up a sleeping task, however, introduces some
+ additional latency which can be large especially if the critical
+ section protected by the lock is relatively short. This may cause
+ a performance bottleneck on large systems with many CPUs running
+ applications that need a lot of inter-thread synchronization.
+
+ 2) The performance of the wait-wake futex is currently
+ spinlock-constrained. When many threads are contending for a
+ futex in a large system with many CPUs, it is not unusual to have
+ spinlock contention accounting for more than 90% of the total
+ CPU cycles consumed at various points in time.
+
+Spinning futex is a solution to both the wakeup latency and spinlock
+contention problems by optimistically spinning on a locked futex
+when the lock owner is running within the kernel until the lock is
+free. This is the same optimistic spinning mechanism used by the kernel
+mutex and rw semaphore implementations to improve performance. The
+optimistic spinning was done without taking any lock.
+
+Implementation
+--------------
+
+Like the PI and robust futexes, a lock acquirer has to atomically
+put its thread ID (TID) into the lower 30 bits of the 32-bit futex
+which should has an original value of 0. If it succeeds, it will be
+the owner of the futex. Otherwise, it has to call into the kernel
+using the new FUTEX_SPIN_LOCK futex(2) syscall.
+
+The kernel will use the setting of the most significant bit
+(FUTEX_WAITERS) in the futex value to indicate one or more waiters
+are sleeping and need to be woken up later on.
+
+When it is time to unlock, the lock owner has to atomically clear
+the TID portion of the futex value. If the FUTEX_WAITERS bit is set,
+it has to issue a FUTEX_SPIN_UNLOCK futex system call to wake up the
+sleeping task.
+
+A return value of 1 from the FUTEX_SPIN_UNLOCK futex(2) syscall
+indicates a task has been woken up. The syscall returns 0 if no
+sleeping task is found or spinners are present to take the lock.
+
+The error number returned by a FUTEX_SPIN_UNLOCK call on an empty
+futex can be used to decide if the spinning futex functionality is
+implemented in the kernel. If it is present, the returned error number
+should be ESRCH. Otherwise it will be ENOSYS.
+
+Currently, only the first and the second arguments (the futex address
+and the opcode) of the futex(2) syscall is used. All the other
+arguments must be set to 0 or NULL to avoid forward compatibility
+problem.
+
+The spinning futex requires the kernel to have support for the cmpxchg
+functionality. For architectures that don't support cmpxchg, spinning
+futex will not be supported as well.
+
+Usage Scenario
+--------------
+
+A spinning futex can be used as an exclusive lock to guard a critical
+section which are unlikely to go to sleep in the kernel. The spinners
+in a spinning futex, however, will fall back to sleep in a wait queue
+if the lock owner isn't running. Therefore, it can also be used when
+the critical section is long and prone to sleeping. However, it may
+not have the performance benefit when compared with a wait-wake futex
+in this case.
+
+Sample Code
+-----------
+
+The following are sample code to implement a simple lock and unlock
+function.
+
+__thread int tid; /* Thread ID */
+
+void mutex_lock(int *faddr)
+{
+ if (cmpxchg(faddr, 0, tid) == 0)
+ return;
+ for (;;)
+ if (futex(faddr, FUTEX_SPIN_LOCK, ...) == 0)
+ break;
+}
+
+void mutex_unlock(int *faddr)
+{
+ int old, fval;
+
+ if ((fval = cmpxchg(faddr, tid, 0)) == tid)
+ return;
+ /* Clear only the TID portion of the futex */
+ for (;;) {
+ old = fval;
+ fval = cmpxchg(faddr, old, old & ~FUTEX_TID_MASK);
+ if (fval == old)
+ break;
+ }
+ if (fval & FUTEX_WAITERS)
+ futex(faddr, FUTEX_SPIN_UNLOCK, ...);
+}
--
1.7.1
A simple FIFO list for the waiting tasks is easy to use. However, it
differs in behavior from the other futex primitives that the waiting
tasks are put in a priority list.
In order to make a sorted list ranked on priority as well as the
order they enter the kernel, we need to convert the simple list into
a rbtree.
This patch changes the waiting queue into an rbtree sorted first by
process priority followed by the order the tasks enter the kernel.
The optimistic spinning itself is strictly FIFO. There is no easy
way to make it priority based. In fact, the spinlock contention in
a wait-wake futex when the contending tasks enter the kernel also
serves as a kind of FIFO queue for processing the request. Priority
doesn't play a role there too.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/futex.c | 64 +++++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 52 insertions(+), 12 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index cca6457..12d855c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3011,10 +3011,11 @@ void exit_robust_list(struct task_struct *curr)
/**
* struct futex_q_head - head of the optspin futex queue, one per unique key
* @hnode: list entry from the hash bucket
- * @waitq: a list of waiting tasks
+ * @waitq: a rbtree of waiting tasks sorted by priority & sequence number
* @key: the key the futex is hashed on
* @osq: pointer to optimisitic spinning queue
* @owner: task_struct pointer of the futex owner
+ * @seq: last used sequence number + 1
* @otid: TID of the futex owner
* @wlock: spinlock for managing wait queue
* @lcnt: locker count (spinners + waiters)
@@ -3027,7 +3028,7 @@ void exit_robust_list(struct task_struct *curr)
* - enqueue into the spinner queue and wait for its turn.
* 4) For waiter:
* - lock futex queue head spinlock
- * - enqueue into the wait queue
+ * - enqueue into the wait rbtree queue
* - release the lock & sleep
* 5) For unlocker:
* - locate the queue head just like a locker does
@@ -3037,10 +3038,11 @@ void exit_robust_list(struct task_struct *curr)
*/
struct futex_q_head {
struct list_head hnode;
- struct list_head waitq;
+ struct rb_root waitq;
union futex_key key;
struct optimistic_spin_queue *osq;
struct task_struct *owner;
+ unsigned long seq;
pid_t otid;
spinlock_t wlock;
atomic_t lcnt;
@@ -3050,10 +3052,14 @@ struct futex_q_head {
* struct futex_q_node - a node in the optspin futex queue
* @wnode: a list entry for the waiting queue
* @task: task_struct pointer of the current task
+ * @seq: unique sequence number that shows order of kernel entry
+ * @prio: process priority
*/
struct futex_q_node {
- struct list_head wnode;
+ struct rb_node wnode;
struct task_struct *task;
+ unsigned long seq;
+ int prio;
};
/*
@@ -3113,7 +3119,7 @@ qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key, u32 uval)
if (unlikely(!qh->owner))
qh->otid = 0;
atomic_set(&qh->lcnt, 1);
- INIT_LIST_HEAD(&qh->waitq);
+ qh->waitq = RB_ROOT;
spin_lock_init(&qh->wlock);
list_add(&qh->hnode, &hb->oslist);
}
@@ -3155,7 +3161,7 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
/*
* Free the queue head structure
*/
- BUG_ON(!list_empty(&qh->waitq) || qh->osq);
+ BUG_ON(!RB_EMPTY_ROOT(&qh->waitq) || qh->osq);
list_del(&qh->hnode);
spin_unlock(&hb->lock);
if (qh->owner)
@@ -3167,6 +3173,38 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
}
/*
+ * qnode_insert - insert queue node into an rbtree sorted by priority
+ * and then the unique sequence number
+ */
+static inline void qnode_insert(struct rb_root *root, struct futex_q_node *node)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ /*
+ * Locate where to put the new node
+ */
+ while (*new) {
+ struct futex_q_node *this = container_of(*new,
+ struct futex_q_node, wnode);
+ parent = *new;
+ if (likely(node->prio == this->prio)) {
+ if (node->seq < this->seq)
+ new = &(parent->rb_left);
+ else /* node->seq > this->seq */
+ new = &(parent->rb_right);
+ } else if (node->prio < this->prio) {
+ new = &(parent->rb_left);
+ } else {
+ new = &(parent->rb_right);
+ }
+ }
+
+ /* Add new node and rebalance tree */
+ rb_link_node(&node->wnode, parent, new);
+ rb_insert_color(&node->wnode, root);
+}
+
+/*
* futex_spin_trylock - attempt to take the lock
* Return: 1 if successful or an error happen
* 0 otherwise
@@ -3336,6 +3374,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
} else {
atomic_inc(&qh->lcnt);
}
+ qnode.seq = qh->seq++;
spin_unlock(&hb->lock);
/*
@@ -3353,8 +3392,9 @@ optspin:
* Put the task into the wait queue and sleep
*/
get_task_struct(qnode.task);
+ qnode.prio = min(qnode.task->normal_prio, MAX_RT_PRIO);
spin_lock(&qh->wlock);
- list_add_tail(&qnode.wnode, &qh->waitq);
+ qnode_insert(&qh->waitq, &qnode);
__set_current_state(TASK_INTERRUPTIBLE);
spin_unlock(&qh->wlock);
slept = gotlock = false;
@@ -3421,12 +3461,12 @@ dequeue:
*/
put_task_struct(qnode.task);
spin_lock(&qh->wlock);
- list_del(&qnode.wnode);
+ rb_erase(&qnode.wnode, &qh->waitq);
/*
* Try to clear the waiter bit if the wait queue is empty
*/
- if (list_empty(&qh->waitq)) {
+ if (RB_EMPTY_ROOT(&qh->waitq)) {
int retval = get_futex_value_locked(&uval, uaddr);
if (!retval && FUTEX_HAS_WAITERS(uval)) {
@@ -3529,9 +3569,9 @@ static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
lcnt = atomic_read(&qh->lcnt);
if (lcnt) {
spin_lock(&qh->wlock);
- if (!list_empty(&qh->waitq))
- wtask = list_entry(qh->waitq.next, struct futex_q_node,
- wnode)->task;
+ if (!RB_EMPTY_ROOT(&qh->waitq))
+ wtask = container_of(rb_first(&qh->waitq),
+ struct futex_q_node, wnode)->task;
spin_unlock(&qh->wlock);
}
spin_unlock(&hb->lock);
--
1.7.1
This patch adds two new simple exclusive lock and unlock primitives
or command codes (FUTEX_SPIN_LOCK and FUTEX_SPIN_UNLOCK) to provide
to the users of the futex(2) syscall. These two primitives provide
an abstract for a mutual exclusive lock (mutex).
The semantics is about the same as the PI and robust futexes where
a locker atomically puts its thread ID into the futex word to get
lock and clear it to free the lock.
The major differences between the new primitives and the existing
ones are on how the waiters are enqueued into the waiting queue. The
existing primitives use a priority list to store the waiting tasks
from all the futexes that hashed to the same hash bucket.
The new primitives use two different lists. The first list off the
hash bucket is to store a new dynamically allocated data structure
(futex_q_head) that contains information about all the waiters for
a given futex. So if two futexes hash to the same bucket, the list
will have 2 entries irrespective of the number of tasks contending
the futexes.
The second list is in the futex_q_head contains all the waiters
waiting for the futex to be freed. There are also 2 spinlocks used -
one in the hash bucket and one in the futex_q_head structure.
The main advantages of the 2-list approach is to reduce the amount of
spinlock contention that can happen to a heavily contended futex. It
is not unusually that over 90% of CPU cycles can be spent in the
lock spinning for a heavily contended futex. By making the lock more
granular, the spinlock contention can be reduced.
The disadvantage, however, is the need to dynamically allocate the new
structure. To reduce that overhead, a single-slot cache entry is added
to the futex hash bucket structure to hold a single free futex_q_head
structure. So as long as there is no futex address collision, the futex
code doesn't need to do a lot of memory allocation and deallocation.
With the new primitives, a simple 256-thread mutex micro-benchmark
on a 4-socket 40-core system gave the following perf profile:
6.19% futex_test [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.17% futex_test [kernel.kallsyms] [k] futex_spin_lock
3.87% futex_test [kernel.kallsyms] [k] _raw_spin_lock
2.72% futex_test [kernel.kallsyms] [k] drop_futex_key_refs
2.46% futex_test [kernel.kallsyms] [k] futex_spin_unlock
2.36% futex_test [kernel.kallsyms] [k] gup_pte_range
1.87% futex_test [kernel.kallsyms] [k] get_user_pages_fast
1.84% futex_test [kernel.kallsyms] [k] __audit_syscall_exit
The corresponding one with the wait-wake futex was as follows:
93.52% futex_test [kernel.kallsyms] [k] _raw_spin_lock
1.14% futex_test [kernel.kallsyms] [k] futex_wait_setup
0.97% futex_test [kernel.kallsyms] [k] futex_wake
In term of the reported system time used on a 10 million mutex
operations, the new primitives used only 12.7s while the old ones
needed 104.4s.
Signed-off-by: Waiman Long <[email protected]>
---
include/uapi/linux/futex.h | 4 +
kernel/futex.c | 453 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 457 insertions(+), 0 deletions(-)
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0b1f716..58aa4fb 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,8 @@
#define FUTEX_WAKE_BITSET 10
#define FUTEX_WAIT_REQUEUE_PI 11
#define FUTEX_CMP_REQUEUE_PI 12
+#define FUTEX_SPIN_LOCK 13
+#define FUTEX_SPIN_UNLOCK 14
#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
@@ -39,6 +41,8 @@
FUTEX_PRIVATE_FLAG)
#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
+#define FUTEX_SPIN_LOCK_PRIVATE (FUTEX_SPIN_LOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_SPIN_UNLOCK_PRIVATE (FUTEX_SPIN_UNLOCK | FUTEX_PRIVATE_FLAG)
/*
* Support for robust futexes: the kernel cleans up held futexes at
diff --git a/kernel/futex.c b/kernel/futex.c
index b632b5f..ec9b6ee 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -23,6 +23,9 @@
* Copyright (C) IBM Corporation, 2009
* Thanks to Thomas Gleixner for conceptual design and careful reviews.
*
+ * Optimistic-spinning futexes by Waiman Long
+ * Copyright (C) 2014 Hewlett-Packard Development Company, L.P.
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -253,6 +256,8 @@ struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
+ struct list_head oslist; /* Optimistic spinning list */
+ void *qhcache; /* Cache for a free queue head */
} ____cacheline_aligned_in_smp;
static unsigned long __read_mostly futex_hashsize;
@@ -2939,6 +2944,446 @@ void exit_robust_list(struct task_struct *curr)
curr, pip);
}
+/*=========================================
+ * Optimistic Spinning Futexes
+ *=========================================
+ *
+ * A major difference between the optimistic spinning (optspin) futex and a
+ * wait-wake futex is the fact we need a central data structure to hold
+ * information on all the spinning and waiting tasks for a given optspin
+ * futex. We can't use the futex_q structure for that purpose. Instead,
+ * we have to dynamically allocate a new futex_q_head structure to hold that
+ * information.
+ *
+ * To ease implementation, these new futex_q_head structures are queued in a
+ * different list within the futex hash bucket. To reduce the memory allocation
+ * and deallocation overhead, another field is added to the hash bucket to
+ * cache the last freed futex_q_head structure. This improves performance
+ * at the expense of a bit higher memory overhead. As long as there is no futex
+ * address hash collision, there is no need to do a lot of memory allocation
+ * and deallocation for using optspin futex.
+ *
+ * The states of each spinning or waiting task are stored in a futex_q_node
+ * structure allocated from the task's kernel stack, just like futex_q.
+ *
+ * Data structure diagram:
+ *
+ * hash futex futex
+ * table qhead 0 qhead 1
+ * | |
+ * +----+ +------+ +------+
+ * | p |---->| next |---->| next |----> ....
+ * +----+ +------+ +------+
+ * | | | |
+ * |
+ * +------+ +------+
+ * |node 0|---->|node 1|----> ....
+ * +------+ +------+
+ *
+ * Locking
+ * -------
+ * Two spinlocks are used by the optspin futexes - the hash bucket spinlock
+ * and the queue head spinlock.
+ *
+ * The hash bucket spinlock protects against the searching and modification
+ * of the futex queue head list as well as the increment of the atomic locker
+ * count. The decrement of locker count is done outside the lock. When the
+ * count reaches 0, the queue head structure can then be deallocated.
+ *
+ * The queue head spinlock protects the futex wait queue from modification.
+ */
+#define FUTEX_TID(u) (pid_t)((u) & FUTEX_TID_MASK)
+#define FUTEX_HAS_WAITERS(u) ((u) & FUTEX_WAITERS)
+
+/**
+ * struct futex_q_head - head of the optspin futex queue, one per unique key
+ * @hnode: list entry from the hash bucket
+ * @waitq: a list of waiting tasks
+ * @key: the key the futex is hashed on
+ * @wlock: spinlock for managing wait queue
+ * @lcnt: locker count
+ *
+ * Locking sequence
+ * ----------------
+ * 1) Lock hash bucket spinlock, locate the futex queue head
+ * 2) Inc lcnt (lock) or read lcnt (unlock), release hash bucket spinlock
+ * 3) For waiter:
+ * - lock futex queue head spinlock
+ * - enqueue into the wait queue
+ * - release the lock & sleep
+ * 4) For unlocker:
+ * - locate the queue head just like a locker does
+ * - Take the queue head lock and wake up the first waiter there.
+ */
+struct futex_q_head {
+ struct list_head hnode;
+ struct list_head waitq;
+ union futex_key key;
+ pid_t otid;
+ spinlock_t wlock;
+ atomic_t lcnt;
+};
+
+/**
+ * struct futex_q_node - a node in the optspin futex queue
+ * @wnode: a list entry for the waiting queue
+ * @task: task_struct pointer of the current task
+ */
+struct futex_q_node {
+ struct list_head wnode;
+ struct task_struct *task;
+};
+
+
+/*
+ * find_qhead - Find a queue head structure with the matching key
+ *
+ * The caller must hold the hash bucket spinlock.
+ */
+static inline struct futex_q_head *
+find_qhead(struct futex_hash_bucket *hb, union futex_key *key)
+{
+ struct futex_q_head *qh;
+
+ list_for_each_entry(qh, &hb->oslist, hnode)
+ if (match_futex(&qh->key, key))
+ return qh;
+ return NULL;
+}
+
+/*
+ * qhead_alloc_init - allocate and initialize a queue head structure.
+ *
+ * The caller must hold the hash bucket spinlock.
+ *
+ * A one-slot queue head structure cache is added to the futex hash bucket
+ * to minimize overhead due to memory allocation and deallocation under light
+ * contention with no hash bucket collision.
+ */
+static inline struct futex_q_head *
+qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
+{
+ struct futex_q_head *qh = NULL;
+ static const struct futex_q_head qh0 = { { 0 } };
+
+ if (hb->qhcache)
+ qh = xchg(&hb->qhcache, NULL);
+ if (!qh)
+ qh = kmalloc(sizeof(struct futex_q_head), GFP_KERNEL);
+
+ /*
+ * Initialize the queue head structure
+ */
+ if (qh) {
+ *qh = qh0;
+ qh->key = *key;
+ atomic_set(&qh->lcnt, 1);
+ INIT_LIST_HEAD(&qh->waitq);
+ spin_lock_init(&qh->wlock);
+ list_add(&qh->hnode, &hb->oslist);
+ }
+ return qh;
+}
+
+/*
+ * free_qhead - put queue head structure back to the free qh cache or free it
+ */
+static inline void
+qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
+{
+ bool found;
+ struct futex_q_head *qhead;
+
+ /*
+ * Try to free the queue head structure if no one is using it
+ * Must take the hash bucket lock to make sure that no one
+ * else is touching the locker count.
+ */
+ spin_lock(&hb->lock);
+
+ /*
+ * First make sure that qh is still there and unused in the
+ * hashed list.
+ */
+ found = false;
+ list_for_each_entry(qhead, &hb->oslist, hnode)
+ if (qhead == qh) {
+ found = true;
+ break;
+ }
+
+ if (!found || atomic_read(&qh->lcnt)) {
+ spin_unlock(&hb->lock);
+ return;
+ }
+
+ /*
+ * Free the queue head structure
+ */
+ BUG_ON(!list_empty(&qh->waitq));
+ list_del(&qh->hnode);
+ spin_unlock(&hb->lock);
+
+ if (!hb->qhcache && (cmpxchg(&hb->qhcache, NULL, qh) == NULL))
+ return;
+ kfree(qh);
+}
+
+/*
+ * futex_spin_trylock - attempt to take the lock
+ * Return: 1 if successful or an error happen
+ * 0 otherwise
+ *
+ * Side effect: The uval and ret will be updated.
+ */
+static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
+ int *pret, u32 vpid)
+{
+ u32 old;
+
+ *pret = get_futex_value_locked(puval, uaddr);
+ if (*pret)
+ return 1;
+
+ if (FUTEX_TID(*puval))
+ return 0; /* The mutex is not free */
+
+ old = *puval;
+
+ *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
+ if (*pret)
+ return 1;
+ if (*puval == old) {
+ /* Adjust uval to reflect current value */
+ *puval = vpid | old;
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * futex_spin_lock
+ */
+static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
+{
+ struct futex_hash_bucket *hb;
+ struct futex_q_head *qh = NULL;
+ struct futex_q_node qnode;
+ union futex_key key;
+ bool gotlock;
+ int ret, cnt;
+ u32 uval, vpid, old;
+
+ qnode.task = current;
+ vpid = task_pid_vnr(qnode.task);
+
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+ if (unlikely(ret))
+ return ret;
+
+ hb = hash_futex(&key);
+ spin_lock(&hb->lock);
+
+ /*
+ * Locate the queue head for the given key
+ */
+ qh = find_qhead(hb, &key);
+
+ /*
+ * Check the futex value under the hash bucket lock as it might
+ * be changed.
+ */
+ if (futex_spin_trylock(uaddr, &uval, &ret, vpid))
+ goto hbunlock_out;
+
+ if (!qh) {
+ /*
+ * First waiter:
+ * Allocate a queue head structure & initialize it
+ */
+ qh = qhead_alloc_init(hb, &key);
+ if (unlikely(!qh)) {
+ ret = -ENOMEM;
+ goto hbunlock_out;
+ }
+ } else {
+ atomic_inc(&qh->lcnt);
+ }
+ spin_unlock(&hb->lock);
+
+ /*
+ * Put the task into the wait queue and sleep
+ */
+ preempt_disable();
+ get_task_struct(qnode.task);
+ spin_lock(&qh->wlock);
+ list_add_tail(&qnode.wnode, &qh->waitq);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ spin_unlock(&qh->wlock);
+ gotlock = false;
+ for (;;) {
+ ret = get_user(uval, uaddr);
+ if (ret)
+ break;
+ while ((FUTEX_TID(uval) == 0) || !FUTEX_HAS_WAITERS(uval)) {
+ /*
+ * Make sure the waiter bit is set before sleeping
+ * and get the lock if it is available.
+ */
+ if (FUTEX_TID(uval) == 0) {
+ old = uval;
+ ret = futex_atomic_cmpxchg_inatomic(&uval,
+ uaddr, old, vpid);
+ if (ret) {
+ goto dequeue;
+ } else if (uval == old) {
+ gotlock = true;
+ goto dequeue;
+ }
+ continue;
+ }
+
+ old = uval;
+ ret = futex_atomic_cmpxchg_inatomic(&uval, uaddr, old,
+ FUTEX_TID(old) | FUTEX_WAITERS);
+ if (ret)
+ goto dequeue;
+ if (uval == old)
+ break;
+ }
+
+ schedule_preempt_disabled();
+
+ /*
+ * Got a signal? Return EINTR
+ */
+ if (unlikely(signal_pending(qnode.task))) {
+ ret = -EINTR;
+ break; /* Remove from queue */
+ }
+ }
+dequeue:
+ __set_current_state(TASK_RUNNING);
+ /*
+ * Remove itself from the wait queue and go back to optimistic
+ * spinning if it hasn't got the lock yet.
+ */
+ put_task_struct(qnode.task);
+ spin_lock(&qh->wlock);
+ list_del(&qnode.wnode);
+
+ /*
+ * Try to clear the waiter bit if the wait queue is empty
+ */
+ if (list_empty(&qh->waitq)) {
+ int retval = get_futex_value_locked(&uval, uaddr);
+
+ if (!retval && FUTEX_HAS_WAITERS(uval)) {
+ old = uval;
+ uval &= ~FUTEX_WAITERS;
+ (void)cmpxchg_futex_value_locked(&uval, uaddr, old,
+ uval);
+ }
+ }
+ spin_unlock(&qh->wlock);
+ preempt_enable();
+
+ cnt = atomic_dec_return(&qh->lcnt);
+ if (cnt == 0)
+ qhead_free(qh, hb);
+ /*
+ * Need to set the waiters bit there are still waiters
+ */
+ else if (!ret)
+ ret = put_user(vpid | FUTEX_WAITERS, uaddr);
+out:
+ put_futex_key(&key);
+ return ret;
+
+hbunlock_out:
+ spin_unlock(&hb->lock);
+ goto out;
+}
+
+/*
+ * futex_spin_unlock
+ */
+static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
+{
+ struct futex_hash_bucket *hb;
+ struct futex_q_head *qh;
+ union futex_key key;
+ struct task_struct *wtask; /* Task to be woken */
+ int ret, lcnt;
+ u32 uval, old, vpid = task_pid_vnr(current);
+
+ ret = get_user(uval, uaddr);
+ if (ret)
+ return ret;
+
+ /*
+ * The unlocker may have cleared the TID value and another task may
+ * steal it. However, if its TID is still set, we need to clear
+ * it as well as the FUTEX_WAITERS bit.
+ */
+ while (FUTEX_TID(uval) == vpid) {
+ old = uval;
+ uval &= ~(FUTEX_TID_MASK | FUTEX_WAITERS);
+ ret = futex_atomic_cmpxchg_inatomic(&uval, uaddr, old, uval);
+ if (ret)
+ return ret;
+ if (uval == old)
+ break;
+ }
+ /*
+ * Need to wake up a waiter
+ */
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+ if (unlikely(ret))
+ return ret;
+
+ hb = hash_futex(&key);
+ spin_lock(&hb->lock);
+
+ /*
+ * Locate the queue head for the given key
+ */
+ qh = find_qhead(hb, &key);
+ if (!qh) {
+ spin_unlock(&hb->lock);
+ ret = -ESRCH; /* Invalid futex address */
+ goto out;
+ }
+
+ /*
+ * The hash bucket lock is being hold while retrieving the task
+ * structure from the queue head to make sure that the qh structure
+ * won't go away under the hood.
+ * Preemption is disabled during the wakeup process to avoid having
+ * a long gap between getting the task structure and waking it up.
+ */
+ wtask = NULL;
+ preempt_disable();
+ lcnt = atomic_read(&qh->lcnt);
+ if (lcnt) {
+ spin_lock(&qh->wlock);
+ if (!list_empty(&qh->waitq))
+ wtask = list_entry(qh->waitq.next, struct futex_q_node,
+ wnode)->task;
+ spin_unlock(&qh->wlock);
+ }
+ spin_unlock(&hb->lock);
+ if (wtask && wake_up_process(wtask))
+ ret = 1;
+ else if (!wtask)
+ ret = -ESRCH;
+ preempt_enable();
+out:
+ put_futex_key(&key);
+ return ret;
+}
+
+
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
@@ -2960,6 +3405,8 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
case FUTEX_TRYLOCK_PI:
case FUTEX_WAIT_REQUEUE_PI:
case FUTEX_CMP_REQUEUE_PI:
+ case FUTEX_SPIN_LOCK:
+ case FUTEX_SPIN_UNLOCK:
if (!futex_cmpxchg_enabled)
return -ENOSYS;
}
@@ -2991,6 +3438,10 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
uaddr2);
case FUTEX_CMP_REQUEUE_PI:
return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+ case FUTEX_SPIN_LOCK:
+ return futex_spin_lock(uaddr, flags);
+ case FUTEX_SPIN_UNLOCK:
+ return futex_spin_unlock(uaddr, flags);
}
return -ENOSYS;
}
@@ -3072,7 +3523,9 @@ static int __init futex_init(void)
for (i = 0; i < futex_hashsize; i++) {
atomic_set(&futex_queues[i].waiters, 0);
plist_head_init(&futex_queues[i].chain);
+ INIT_LIST_HEAD(&futex_queues[i].oslist);
spin_lock_init(&futex_queues[i].lock);
+ futex_queues[i].qhcache = NULL;
}
return 0;
--
1.7.1
This patch adds code to do optimistic spinning for the FUTEX_SPIN_LOCK
primitive on the futex value when the lock owner is running. It is
the same optimistic spinning technique that is done in the mutex and
rw semaphore code to improve their performance especially on large
systems with large number of CPUs. When the lock owner is not running,
the spinning tasks will go to sleep.
There is 2 major advantages of doing optimistic spinning here:
1) It eliminates the context switching latency and overhead (at
least a few us) associated with sleeping and wakeup.
2) It eliminates most of the need to call futex(2) with
FUTEX_SPIN_UNLOCK as spinning is done without the need to set
the FUTEX_WAITERS bit.
Active spinning, however, does consume time in the current quantum of
time slice, make it a bit more likely to be preempted while running
in the critcal section due to time slice expiration. The heavy spinlock
contention of a wait-wake futex has the same effect. So it is not
specific
to this new primitive.
With the addition of optimistic spinning, it can significantly speed
up the amount of mutex operations that can be done in a certain unit
of time. With a userspace mutex microbenchmark running 10 million
mutex operations with 256 threads on a 4-socket 40-core server, the
spinning futex can achieve a rate of about 4.9 Mops/s with a critical
section load of 10 pause instructions. Whereas the wait-wake futex can
only achieve a rate of 3.7 Mops/s. When increasing the load to 100,
the corresponding rates become 2.8 Mops/s and 0.8 Mops/s respectively.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/futex.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 172 insertions(+), 18 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index ec9b6ee..ddc24d1 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -71,6 +71,7 @@
#include <asm/futex.h>
#include "locking/rtmutex_common.h"
+#include "locking/mcs_spinlock.h"
/*
* READ this before attempting to hack on futexes!
@@ -2995,30 +2996,51 @@ void exit_robust_list(struct task_struct *curr)
#define FUTEX_TID(u) (pid_t)((u) & FUTEX_TID_MASK)
#define FUTEX_HAS_WAITERS(u) ((u) & FUTEX_WAITERS)
+/*
+ * Bit usage of the locker count:
+ * bit 0-23: number of lockers (spinners + waiters)
+ * bit 24-30: number of spinners
+ */
+#define FUTEX_SPINCNT_MAX 64 /* Maximum # of spinners */
+#define FUTEX_SPINCNT_SHIFT 24
+#define FUTEX_SPINCNT_BIAS (1U << FUTEX_SPINCNT_SHIFT)
+#define FUTEX_LOCKCNT_MASK (FUTEX_SPINCNT_BIAS - 1)
+#define FUTEX_LOCKCNT(qh) (atomic_read(&(qh)->lcnt) & FUTEX_LOCKCNT_MASK)
+#define FUTEX_SPINCNT(qh) (atomic_read(&(qh)->lcnt)>>FUTEX_SPINCNT_SHIFT)
+
/**
* struct futex_q_head - head of the optspin futex queue, one per unique key
* @hnode: list entry from the hash bucket
* @waitq: a list of waiting tasks
* @key: the key the futex is hashed on
+ * @osq: pointer to optimisitic spinning queue
+ * @owner: task_struct pointer of the futex owner
+ * @otid: TID of the futex owner
* @wlock: spinlock for managing wait queue
- * @lcnt: locker count
+ * @lcnt: locker count (spinners + waiters)
*
* Locking sequence
* ----------------
* 1) Lock hash bucket spinlock, locate the futex queue head
* 2) Inc lcnt (lock) or read lcnt (unlock), release hash bucket spinlock
- * 3) For waiter:
+ * 3) For spinner:
+ * - enqueue into the spinner queue and wait for its turn.
+ * 4) For waiter:
* - lock futex queue head spinlock
* - enqueue into the wait queue
* - release the lock & sleep
- * 4) For unlocker:
+ * 5) For unlocker:
* - locate the queue head just like a locker does
- * - Take the queue head lock and wake up the first waiter there.
+ * - clear the owner field if it is the current owner
+ * - if the locker count is not 0 & osq is empty, take the queue head lock
+ * and wake up the first waiter.
*/
struct futex_q_head {
struct list_head hnode;
struct list_head waitq;
union futex_key key;
+ struct optimistic_spin_queue *osq;
+ struct task_struct *owner;
pid_t otid;
spinlock_t wlock;
atomic_t lcnt;
@@ -3034,6 +3056,13 @@ struct futex_q_node {
struct task_struct *task;
};
+/*
+ * The maximum number of tasks that can be a futex spin queue
+ *
+ * It is set to the lesser of half of the total number of CPUs and
+ * FUTEX_SPINCNT_MAX to avoid locking up all the CPUs in spinning.
+ */
+static int __read_mostly futex_spincnt_max;
/*
* find_qhead - Find a queue head structure with the matching key
@@ -3061,7 +3090,7 @@ find_qhead(struct futex_hash_bucket *hb, union futex_key *key)
* contention with no hash bucket collision.
*/
static inline struct futex_q_head *
-qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
+qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key, u32 uval)
{
struct futex_q_head *qh = NULL;
static const struct futex_q_head qh0 = { { 0 } };
@@ -3073,10 +3102,16 @@ qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
/*
* Initialize the queue head structure
+ * The lock owner field may be NULL if the task has released the lock
+ * and exit.
*/
if (qh) {
- *qh = qh0;
- qh->key = *key;
+ *qh = qh0;
+ qh->key = *key;
+ qh->otid = FUTEX_TID(uval);
+ qh->owner = futex_find_get_task(qh->otid);
+ if (unlikely(!qh->owner))
+ qh->otid = 0;
atomic_set(&qh->lcnt, 1);
INIT_LIST_HEAD(&qh->waitq);
spin_lock_init(&qh->wlock);
@@ -3120,9 +3155,11 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
/*
* Free the queue head structure
*/
- BUG_ON(!list_empty(&qh->waitq));
+ BUG_ON(!list_empty(&qh->waitq) || qh->osq);
list_del(&qh->hnode);
spin_unlock(&hb->lock);
+ if (qh->owner)
+ put_task_struct(qh->owner);
if (!hb->qhcache && (cmpxchg(&hb->qhcache, NULL, qh) == NULL))
return;
@@ -3134,14 +3171,19 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
* Return: 1 if successful or an error happen
* 0 otherwise
*
+ * Optimistic spinning is done without holding lock, but with page fault
+ * explicitly disabled. So different functions need to be used to access
+ * the userspace futex value.
+ *
* Side effect: The uval and ret will be updated.
*/
static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
- int *pret, u32 vpid)
+ int *pret, u32 vpid, bool spinning)
{
- u32 old;
+ u32 old;
- *pret = get_futex_value_locked(puval, uaddr);
+ *pret = spinning ? __copy_from_user_inatomic(puval, uaddr, sizeof(u32))
+ : get_futex_value_locked(puval, uaddr);
if (*pret)
return 1;
@@ -3150,18 +3192,102 @@ static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
old = *puval;
- *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
+ *pret = spinning
+ ? futex_atomic_cmpxchg_inatomic(puval, uaddr, old, vpid)
+ : cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
+
if (*pret)
return 1;
if (*puval == old) {
/* Adjust uval to reflect current value */
- *puval = vpid | old;
+ *puval = spinning ? vpid : (vpid | old);
return 1;
}
return 0;
}
/*
+ * futex_optspin - optimistic spinning loop
+ * Return: 1 if lock successfully acquired
+ * 0 if need to fall back to waiting
+ *
+ * Page fault and preemption are disabled in the optimistic spinning
+ * loop. Preemption should have been disabled before calling this function.
+ *
+ * The number of spinners may temporarily exceed the threshold due to
+ * racing (the spin count check and add aren't atomic), but that shouldn't
+ * be a big problem.
+ */
+static inline int futex_optspin(struct futex_q_head *qh,
+ struct futex_q_node *qn,
+ u32 __user *uaddr,
+ u32 vpid)
+{
+ u32 uval;
+ int ret, gotlock = false;
+ struct task_struct *owner;
+
+ /*
+ * Increment the spinner count
+ */
+ atomic_add(FUTEX_SPINCNT_BIAS, &qh->lcnt);
+ if (!osq_lock(&qh->osq)) {
+ atomic_sub(FUTEX_SPINCNT_BIAS, &qh->lcnt);
+ return gotlock;
+ }
+ pagefault_disable();
+ for (;; cpu_relax()) {
+ if (futex_spin_trylock(uaddr, &uval, &ret, vpid, true)) {
+ /*
+ * Fall back to waiting if an error happen
+ */
+ if (ret)
+ break;
+ qh->otid = vpid;
+ owner = xchg(&qh->owner, qn->task);
+ get_task_struct(qn->task);
+ if (owner)
+ put_task_struct(owner);
+ gotlock = true;
+ break;
+ } else if (unlikely(FUTEX_HAS_WAITERS(uval))) {
+ /*
+ * Try to turn off the waiter bit as it now has a
+ * spinner. It doesn't matter if it fails as it will
+ * try again in the next iteration.
+ */
+ (void)futex_atomic_cmpxchg_inatomic
+ (&uval, uaddr, uval, uval & ~FUTEX_WAITERS);
+ }
+
+ if (unlikely(FUTEX_TID(uval) != qh->otid)) {
+ /*
+ * Owner has changed
+ */
+ qh->otid = FUTEX_TID(uval);
+ owner = xchg(&qh->owner, futex_find_get_task(qh->otid));
+ if (owner)
+ put_task_struct(owner);
+ }
+ owner = ACCESS_ONCE(qh->owner);
+ if ((owner && !ACCESS_ONCE(owner->on_cpu)) || need_resched())
+ break;
+ }
+ pagefault_enable();
+ osq_unlock(&qh->osq);
+ atomic_sub(FUTEX_SPINCNT_BIAS, &qh->lcnt);
+
+ /*
+ * If we fell out of the spin path because of need_resched(),
+ * reschedule now.
+ */
+ if (!gotlock && need_resched())
+ schedule_preempt_disabled();
+
+ return gotlock;
+}
+
+/*
* futex_spin_lock
*/
static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
@@ -3170,6 +3296,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
struct futex_q_head *qh = NULL;
struct futex_q_node qnode;
union futex_key key;
+ struct task_struct *owner;
bool gotlock;
int ret, cnt;
u32 uval, vpid, old;
@@ -3193,7 +3320,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
* Check the futex value under the hash bucket lock as it might
* be changed.
*/
- if (futex_spin_trylock(uaddr, &uval, &ret, vpid))
+ if (futex_spin_trylock(uaddr, &uval, &ret, vpid, false))
goto hbunlock_out;
if (!qh) {
@@ -3201,7 +3328,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
* First waiter:
* Allocate a queue head structure & initialize it
*/
- qh = qhead_alloc_init(hb, &key);
+ qh = qhead_alloc_init(hb, &key, uval);
if (unlikely(!qh)) {
ret = -ENOMEM;
goto hbunlock_out;
@@ -3212,9 +3339,18 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
spin_unlock(&hb->lock);
/*
- * Put the task into the wait queue and sleep
+ * Perform optimisitic spinning if the owner is running.
*/
preempt_disable();
+ owner = ACCESS_ONCE(qh->owner);
+ if ((FUTEX_SPINCNT(qh) < futex_spincnt_max) &&
+ (!owner || owner->on_cpu))
+ if (futex_optspin(qh, &qnode, uaddr, vpid))
+ goto penable_out;
+
+ /*
+ * Put the task into the wait queue and sleep
+ */
get_task_struct(qnode.task);
spin_lock(&qh->wlock);
list_add_tail(&qnode.wnode, &qh->waitq);
@@ -3238,6 +3374,11 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
goto dequeue;
} else if (uval == old) {
gotlock = true;
+ qh->otid = vpid;
+ owner = xchg(&qh->owner, qnode.task);
+ get_task_struct(qnode.task);
+ if (owner)
+ put_task_struct(owner);
goto dequeue;
}
continue;
@@ -3286,15 +3427,17 @@ dequeue:
}
}
spin_unlock(&qh->wlock);
+
+penable_out:
preempt_enable();
cnt = atomic_dec_return(&qh->lcnt);
if (cnt == 0)
qhead_free(qh, hb);
/*
- * Need to set the waiters bit there are still waiters
+ * Need to set the waiters bit there no spinner running
*/
- else if (!ret)
+ else if (!ret && ((cnt >> FUTEX_SPINCNT_SHIFT) == 0))
ret = put_user(vpid | FUTEX_WAITERS, uaddr);
out:
put_futex_key(&key);
@@ -3356,6 +3499,13 @@ static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
}
/*
+ * Clear the owner field
+ */
+ if ((qh->owner == current) &&
+ (cmpxchg(&qh->owner, current, NULL) == current))
+ put_task_struct(current);
+
+ /*
* The hash bucket lock is being hold while retrieving the task
* structure from the queue head to make sure that the qh structure
* won't go away under the hood.
@@ -3520,6 +3670,10 @@ static int __init futex_init(void)
futex_detect_cmpxchg();
+ futex_spincnt_max = num_possible_cpus()/2;
+ if (futex_spincnt_max > FUTEX_SPINCNT_MAX)
+ futex_spincnt_max = FUTEX_SPINCNT_MAX;
+
for (i = 0; i < futex_hashsize; i++) {
atomic_set(&futex_queues[i].waiters, 0);
plist_head_init(&futex_queues[i].chain);
--
1.7.1
On 07/21/2014 08:24 AM, Waiman Long wrote:
> This patch adds a new document file on how to use the spinning futexes.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> Documentation/spinning-futex.txt | 109 ++++++++++++++++++++++++++++++++++++++
> 1 files changed, 109 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/spinning-futex.txt
>
> diff --git a/Documentation/spinning-futex.txt b/Documentation/spinning-futex.txt
> new file mode 100644
> index 0000000..e3cb5a2
> --- /dev/null
> +++ b/Documentation/spinning-futex.txt
> @@ -0,0 +1,109 @@
> +Started by: Waiman Long <[email protected]>
> +
> +Spinning Futex
> +--------------
> +
> +There are two main problems for a wait-wake futex (FUTEX_WAIT and
> +FUTEX_WAKE) when used for creating user-space lock primitives:
> +
> + 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
> + in the futex queue to be woken up by the lock owner when it is done
> + with the lock. Waking up a sleeping task, however, introduces some
> + additional latency which can be large especially if the critical
> + section protected by the lock is relatively short. This may cause
> + a performance bottleneck on large systems with many CPUs running
> + applications that need a lot of inter-thread synchronization.
> +
> + 2) The performance of the wait-wake futex is currently
> + spinlock-constrained. When many threads are contending for a
> + futex in a large system with many CPUs, it is not unusual to have
> + spinlock contention accounting for more than 90% of the total
> + CPU cycles consumed at various points in time.
> +
> +Spinning futex is a solution to both the wakeup latency and spinlock
> +contention problems by optimistically spinning on a locked futex
> +when the lock owner is running within the kernel until the lock is
> +free. This is the same optimistic spinning mechanism used by the kernel
> +mutex and rw semaphore implementations to improve performance. The
> +optimistic spinning was done without taking any lock.
is done
> +
> +Implementation
> +--------------
> +
> +Like the PI and robust futexes, a lock acquirer has to atomically
> +put its thread ID (TID) into the lower 30 bits of the 32-bit futex
> +which should has an original value of 0. If it succeeds, it will be
have
> +the owner of the futex. Otherwise, it has to call into the kernel
> +using the new FUTEX_SPIN_LOCK futex(2) syscall.
> +
> +The kernel will use the setting of the most significant bit
> +(FUTEX_WAITERS) in the futex value to indicate one or more waiters
> +are sleeping and need to be woken up later on.
> +
> +When it is time to unlock, the lock owner has to atomically clear
> +the TID portion of the futex value. If the FUTEX_WAITERS bit is set,
> +it has to issue a FUTEX_SPIN_UNLOCK futex system call to wake up the
> +sleeping task.
> +
> +A return value of 1 from the FUTEX_SPIN_UNLOCK futex(2) syscall
> +indicates a task has been woken up. The syscall returns 0 if no
> +sleeping task is found or spinners are present to take the lock.
> +
> +The error number returned by a FUTEX_SPIN_UNLOCK call on an empty
> +futex can be used to decide if the spinning futex functionality is
> +implemented in the kernel. If it is present, the returned error number
> +should be ESRCH. Otherwise it will be ENOSYS.
> +
> +Currently, only the first and the second arguments (the futex address
> +and the opcode) of the futex(2) syscall is used. All the other
are used.
> +arguments must be set to 0 or NULL to avoid forward compatibility
> +problem.
> +
> +The spinning futex requires the kernel to have support for the cmpxchg
> +functionality. For architectures that don't support cmpxchg, spinning
> +futex will not be supported as well.
> +
> +Usage Scenario
> +--------------
> +
> +A spinning futex can be used as an exclusive lock to guard a critical
> +section which are unlikely to go to sleep in the kernel. The spinners
is
> +in a spinning futex, however, will fall back to sleep in a wait queue
> +if the lock owner isn't running. Therefore, it can also be used when
> +the critical section is long and prone to sleeping. However, it may
> +not have the performance benefit when compared with a wait-wake futex
> +in this case.
> +
> +Sample Code
> +-----------
> +
> +The following are sample code to implement a simple lock and unlock
is
> +function.
> +
> +__thread int tid; /* Thread ID */
> +
> +void mutex_lock(int *faddr)
> +{
> + if (cmpxchg(faddr, 0, tid) == 0)
> + return;
> + for (;;)
> + if (futex(faddr, FUTEX_SPIN_LOCK, ...) == 0)
> + break;
> +}
> +
> +void mutex_unlock(int *faddr)
> +{
> + int old, fval;
> +
> + if ((fval = cmpxchg(faddr, tid, 0)) == tid)
> + return;
> + /* Clear only the TID portion of the futex */
> + for (;;) {
> + old = fval;
> + fval = cmpxchg(faddr, old, old & ~FUTEX_TID_MASK);
> + if (fval == old)
> + break;
> + }
> + if (fval & FUTEX_WAITERS)
> + futex(faddr, FUTEX_SPIN_UNLOCK, ...);
> +}
>
--
~Randy
On Mon, 21 Jul 2014, Waiman Long wrote:
> +#define FUTEX_TID(u) (pid_t)((u) & FUTEX_TID_MASK)
> +#define FUTEX_HAS_WAITERS(u) ((u) & FUTEX_WAITERS)
You love ugly macros, right?
> +/*
> + * futex_spin_trylock - attempt to take the lock
> + * Return: 1 if successful or an error happen
> + * 0 otherwise
> + *
> + * Side effect: The uval and ret will be updated.
> + */
> +static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
> + int *pret, u32 vpid)
> +{
> + u32 old;
> +
> + *pret = get_futex_value_locked(puval, uaddr);
> + if (*pret)
> + return 1;
> +
> + if (FUTEX_TID(*puval))
> + return 0; /* The mutex is not free */
> +
> + old = *puval;
> +
> + *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
> + if (*pret)
> + return 1;
> + if (*puval == old) {
> + /* Adjust uval to reflect current value */
> + *puval = vpid | old;
> + return 1;
> + }
> + return 0;
What's the point if all of this?
A simple cmpxchg_futex_value_locked() does all of this, just less ugly
and without all these extra indirections and totally uncomprehensible
conditionals.
> +}
> +
> +/*
> + * futex_spin_lock
> + */
> +static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> +{
So this lacks a timeout. If we provide this, then we need to have the
timeout supported as well.
> + struct futex_hash_bucket *hb;
> + struct futex_q_head *qh = NULL;
> + struct futex_q_node qnode;
> + union futex_key key;
> + bool gotlock;
> + int ret, cnt;
> + u32 uval, vpid, old;
> +
> + qnode.task = current;
> + vpid = task_pid_vnr(qnode.task);
> +
> + ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
> + if (unlikely(ret))
Stop sprinkling the code with unlikelys
> + return ret;
> +
> + hb = hash_futex(&key);
> + spin_lock(&hb->lock);
> +
> + /*
> + * Locate the queue head for the given key
> + */
Brilliant comment. If you'd comment the stuff which really matters and
leave out the obvious, then your code might be readable some day.
> + qh = find_qhead(hb, &key);
> +
> + /*
> + * Check the futex value under the hash bucket lock as it might
> + * be changed.
> + */
What might have changed? You enter the function with uaddr, but no
uval. So what changed?
> + if (futex_spin_trylock(uaddr, &uval, &ret, vpid))
> + goto hbunlock_out;
> +
> + if (!qh) {
> + /*
> + * First waiter:
> + * Allocate a queue head structure & initialize it
> + */
> + qh = qhead_alloc_init(hb, &key);
> + if (unlikely(!qh)) {
> + ret = -ENOMEM;
> + goto hbunlock_out;
> + }
> + } else {
> + atomic_inc(&qh->lcnt);
> + }
> + spin_unlock(&hb->lock);
> +
> + /*
> + * Put the task into the wait queue and sleep
> + */
> + preempt_disable();
Why?
> + get_task_struct(qnode.task);
So you get a task reference on current? What the heck is this for?
> + spin_lock(&qh->wlock);
> + list_add_tail(&qnode.wnode, &qh->waitq);
> + __set_current_state(TASK_INTERRUPTIBLE);
> + spin_unlock(&qh->wlock);
> + gotlock = false;
> + for (;;) {
> + ret = get_user(uval, uaddr);
> + if (ret)
> + break;
So you let user space handle EFAULT?
> +dequeue:
> + __set_current_state(TASK_RUNNING);
> + /*
> + * Remove itself from the wait queue and go back to optimistic
> + * spinning if it hasn't got the lock yet.
> + */
> + put_task_struct(qnode.task);
> + spin_lock(&qh->wlock);
> + list_del(&qnode.wnode);
> +
> + /*
> + * Try to clear the waiter bit if the wait queue is empty
> + */
> + if (list_empty(&qh->waitq)) {
> + int retval = get_futex_value_locked(&uval, uaddr);
> +
> + if (!retval && FUTEX_HAS_WAITERS(uval)) {
> + old = uval;
> + uval &= ~FUTEX_WAITERS;
> + (void)cmpxchg_futex_value_locked(&uval, uaddr, old,
> + uval);
> + }
> + }
> + spin_unlock(&qh->wlock);
> + preempt_enable();
> +
> + cnt = atomic_dec_return(&qh->lcnt);
> + if (cnt == 0)
> + qhead_free(qh, hb);
> + /*
> + * Need to set the waiters bit there are still waiters
> + */
> + else if (!ret)
> + ret = put_user(vpid | FUTEX_WAITERS, uaddr);
WTF? You fiddle with the uaddr completely unprotected.
> +out:
> + put_futex_key(&key);
> + return ret;
> +
> +hbunlock_out:
> + spin_unlock(&hb->lock);
> + goto out;
> +}
> +
> +/*
> + * futex_spin_unlock
> + */
> +static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
> +{
> + struct futex_hash_bucket *hb;
> + struct futex_q_head *qh;
> + union futex_key key;
> + struct task_struct *wtask; /* Task to be woken */
> + int ret, lcnt;
> + u32 uval, old, vpid = task_pid_vnr(current);
> +
> + ret = get_user(uval, uaddr);
> + if (ret)
> + return ret;
> +
> + /*
> + * The unlocker may have cleared the TID value and another task may
> + * steal it. However, if its TID is still set, we need to clear
> + * it as well as the FUTEX_WAITERS bit.
No, that's complete and utter crap. The unlocker is current and it may
not have cleared anything.
Your design or the lack thereof is a complete disaster.
Sit down first and define the exact semantics of the new opcode. That
includes user and kernel space and the interaction with robust list,
which you happily ignored.
What are the semantics of uval? When can it be changed in kernel and
in user space? How do we deal with corruption of the user space value?
How does that new opcode provide robustness?
How are faults handled?
....
Before you can't provide a coherent description of that, nothing of
this is going to happen. And after that, it's definitely not going to
look like the hackery you delivered now.
Thanks,
tglx
Waiman Long <[email protected]> writes:
> This patch series introduces two new futex command codes to support
> a new optimistic spinning futex for implementing an exclusive lock
> primitive that should perform better than the same primitive using
> the wait-wake futex in cases where the lock owner is actively working
> instead of waiting for I/O completion.
How would you distinguish those two cases automatically?
>
> This patch series improves futex performance on two different fronts:
> 1) Reducing the amount of the futex spinlock contention by using 2
> different spinlocks instead of just one for the wait-wake futex.
> 2) Eliminating the context switching overhead and latency due to the
> sleeping and the waking of the waiting tasks.
FWIW the main problem is currently that switch-through-idle is so
slow. I think improving that would give a boost to far more
situations.
> Patch 4 changes sleeping queue from a simple FIFO list to an rbtree
> sorted by process priority as well as the sequeunce the tasks enter
> the kernel.
This seems to mix new functionality with performance improvements?
Generally adding new ordering in an user interface is risky because
often user programs have been tuned for specific old ordering.
-Andi
--
[email protected] -- Speaking for myself only
Andi Kleen <[email protected]> writes:
> Waiman Long <[email protected]> writes:
>
>> This patch series introduces two new futex command codes to support
>> a new optimistic spinning futex for implementing an exclusive lock
>> primitive that should perform better than the same primitive using
>> the wait-wake futex in cases where the lock owner is actively working
>> instead of waiting for I/O completion.
>
> How would you distinguish those two cases automatically?
Also BTW traditionally the spinning was just done in user space.
This would be always even more efficient, because it would
even avoid the syscall entry path.
Given the glibc adaptive mutexes are not very good, but I'm sure those
could be improved.
-Andi
--
[email protected] -- Speaking for myself only
On Mon, 2014-07-21 at 11:24 -0400, Waiman Long wrote:
> This patch adds code to do optimistic spinning for the FUTEX_SPIN_LOCK
> primitive on the futex value when the lock owner is running. It is
> the same optimistic spinning technique that is done in the mutex and
> rw semaphore code to improve their performance especially on large
> systems with large number of CPUs. When the lock owner is not running,
> the spinning tasks will go to sleep.
>
> There is 2 major advantages of doing optimistic spinning here:
> 1) It eliminates the context switching latency and overhead (at
> least a few us) associated with sleeping and wakeup.
> 2) It eliminates most of the need to call futex(2) with
> FUTEX_SPIN_UNLOCK as spinning is done without the need to set
> the FUTEX_WAITERS bit.
I think this belongs with Patch 1: optimistic spinning feature should be
in the same patch when you add the new futex commands.
> Active spinning, however, does consume time in the current quantum of
> time slice, make it a bit more likely to be preempted while running
> in the critcal section due to time slice expiration. The heavy spinlock
> contention of a wait-wake futex has the same effect. So it is not
> specific
> to this new primitive.
>
> With the addition of optimistic spinning, it can significantly speed
> up the amount of mutex operations that can be done in a certain unit
> of time. With a userspace mutex microbenchmark running 10 million
> mutex operations with 256 threads on a 4-socket 40-core server, the
> spinning futex can achieve a rate of about 4.9 Mops/s with a critical
> section load of 10 pause instructions. Whereas the wait-wake futex can
> only achieve a rate of 3.7 Mops/s. When increasing the load to 100,
> the corresponding rates become 2.8 Mops/s and 0.8 Mops/s respectively.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/futex.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 172 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index ec9b6ee..ddc24d1 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -71,6 +71,7 @@
> #include <asm/futex.h>
>
> #include "locking/rtmutex_common.h"
> +#include "locking/mcs_spinlock.h"
>
> /*
> * READ this before attempting to hack on futexes!
> @@ -2995,30 +2996,51 @@ void exit_robust_list(struct task_struct *curr)
> #define FUTEX_TID(u) (pid_t)((u) & FUTEX_TID_MASK)
> #define FUTEX_HAS_WAITERS(u) ((u) & FUTEX_WAITERS)
>
> +/*
> + * Bit usage of the locker count:
> + * bit 0-23: number of lockers (spinners + waiters)
> + * bit 24-30: number of spinners
> + */
> +#define FUTEX_SPINCNT_MAX 64 /* Maximum # of spinners */
> +#define FUTEX_SPINCNT_SHIFT 24
> +#define FUTEX_SPINCNT_BIAS (1U << FUTEX_SPINCNT_SHIFT)
> +#define FUTEX_LOCKCNT_MASK (FUTEX_SPINCNT_BIAS - 1)
> +#define FUTEX_LOCKCNT(qh) (atomic_read(&(qh)->lcnt) & FUTEX_LOCKCNT_MASK)
> +#define FUTEX_SPINCNT(qh) (atomic_read(&(qh)->lcnt)>>FUTEX_SPINCNT_SHIFT)
Both FUTEX_LOCKCNT and FUTEX_SPINCNT should be static inline functions.
> /**
> * struct futex_q_head - head of the optspin futex queue, one per unique key
> * @hnode: list entry from the hash bucket
> * @waitq: a list of waiting tasks
> * @key: the key the futex is hashed on
> + * @osq: pointer to optimisitic spinning queue
> + * @owner: task_struct pointer of the futex owner
> + * @otid: TID of the futex owner
> * @wlock: spinlock for managing wait queue
> - * @lcnt: locker count
> + * @lcnt: locker count (spinners + waiters)
> *
> * Locking sequence
> * ----------------
> * 1) Lock hash bucket spinlock, locate the futex queue head
> * 2) Inc lcnt (lock) or read lcnt (unlock), release hash bucket spinlock
> - * 3) For waiter:
> + * 3) For spinner:
> + * - enqueue into the spinner queue and wait for its turn.
> + * 4) For waiter:
> * - lock futex queue head spinlock
> * - enqueue into the wait queue
> * - release the lock & sleep
> - * 4) For unlocker:
> + * 5) For unlocker:
> * - locate the queue head just like a locker does
> - * - Take the queue head lock and wake up the first waiter there.
> + * - clear the owner field if it is the current owner
> + * - if the locker count is not 0 & osq is empty, take the queue head lock
> + * and wake up the first waiter.
> */
> struct futex_q_head {
> struct list_head hnode;
> struct list_head waitq;
> union futex_key key;
> + struct optimistic_spin_queue *osq;
> + struct task_struct *owner;
> pid_t otid;
> spinlock_t wlock;
> atomic_t lcnt;
> @@ -3034,6 +3056,13 @@ struct futex_q_node {
> struct task_struct *task;
> };
>
> +/*
> + * The maximum number of tasks that can be a futex spin queue
> + *
> + * It is set to the lesser of half of the total number of CPUs and
> + * FUTEX_SPINCNT_MAX to avoid locking up all the CPUs in spinning.
> + */
> +static int __read_mostly futex_spincnt_max;
>
> /*
> * find_qhead - Find a queue head structure with the matching key
> @@ -3061,7 +3090,7 @@ find_qhead(struct futex_hash_bucket *hb, union futex_key *key)
> * contention with no hash bucket collision.
> */
> static inline struct futex_q_head *
> -qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
> +qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key, u32 uval)
> {
> struct futex_q_head *qh = NULL;
> static const struct futex_q_head qh0 = { { 0 } };
> @@ -3073,10 +3102,16 @@ qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
>
> /*
> * Initialize the queue head structure
> + * The lock owner field may be NULL if the task has released the lock
> + * and exit.
> */
> if (qh) {
> - *qh = qh0;
> - qh->key = *key;
> + *qh = qh0;
> + qh->key = *key;
> + qh->otid = FUTEX_TID(uval);
> + qh->owner = futex_find_get_task(qh->otid);
> + if (unlikely(!qh->owner))
> + qh->otid = 0;
> atomic_set(&qh->lcnt, 1);
> INIT_LIST_HEAD(&qh->waitq);
> spin_lock_init(&qh->wlock);
All this can be a single qh setup function.
> @@ -3120,9 +3155,11 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
> /*
> * Free the queue head structure
> */
> - BUG_ON(!list_empty(&qh->waitq));
> + BUG_ON(!list_empty(&qh->waitq) || qh->osq);
> list_del(&qh->hnode);
> spin_unlock(&hb->lock);
> + if (qh->owner)
> + put_task_struct(qh->owner);
>
> if (!hb->qhcache && (cmpxchg(&hb->qhcache, NULL, qh) == NULL))
> return;
> @@ -3134,14 +3171,19 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
> * Return: 1 if successful or an error happen
> * 0 otherwise
> *
> + * Optimistic spinning is done without holding lock, but with page fault
> + * explicitly disabled. So different functions need to be used to access
> + * the userspace futex value.
> + *
> * Side effect: The uval and ret will be updated.
> */
> static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
> - int *pret, u32 vpid)
> + int *pret, u32 vpid, bool spinning)
> {
> - u32 old;
> + u32 old;
>
> - *pret = get_futex_value_locked(puval, uaddr);
> + *pret = spinning ? __copy_from_user_inatomic(puval, uaddr, sizeof(u32))
> + : get_futex_value_locked(puval, uaddr);
> if (*pret)
> return 1;
>
> @@ -3150,18 +3192,102 @@ static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
>
> old = *puval;
>
> - *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
> + *pret = spinning
> + ? futex_atomic_cmpxchg_inatomic(puval, uaddr, old, vpid)
> + : cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
> +
> if (*pret)
> return 1;
> if (*puval == old) {
> /* Adjust uval to reflect current value */
> - *puval = vpid | old;
> + *puval = spinning ? vpid : (vpid | old);
> return 1;
> }
> return 0;
> }
>
> /*
> + * futex_optspin - optimistic spinning loop
> + * Return: 1 if lock successfully acquired
> + * 0 if need to fall back to waiting
> + *
> + * Page fault and preemption are disabled in the optimistic spinning
> + * loop. Preemption should have been disabled before calling this function.
> + *
> + * The number of spinners may temporarily exceed the threshold due to
> + * racing (the spin count check and add aren't atomic), but that shouldn't
> + * be a big problem.
> + */
> +static inline int futex_optspin(struct futex_q_head *qh,
> + struct futex_q_node *qn,
> + u32 __user *uaddr,
> + u32 vpid)
> +{
> + u32 uval;
> + int ret, gotlock = false;
> + struct task_struct *owner;
> +
> + /*
> + * Increment the spinner count
> + */
> + atomic_add(FUTEX_SPINCNT_BIAS, &qh->lcnt);
> + if (!osq_lock(&qh->osq)) {
> + atomic_sub(FUTEX_SPINCNT_BIAS, &qh->lcnt);
> + return gotlock;
> + }
> + pagefault_disable();
How about a comment to why pf is disabled.
> + for (;; cpu_relax()) {
while(true) {
> + if (futex_spin_trylock(uaddr, &uval, &ret, vpid, true)) {
> + /*
> + * Fall back to waiting if an error happen
> + */
> + if (ret)
> + break;
> + qh->otid = vpid;
> + owner = xchg(&qh->owner, qn->task);
> + get_task_struct(qn->task);
> + if (owner)
> + put_task_struct(owner);
> + gotlock = true;
> + break;
> + } else if (unlikely(FUTEX_HAS_WAITERS(uval))) {
Branch predictions have a time and place, please do not use
likely/unlikely just for anything.
> + /*
> + * Try to turn off the waiter bit as it now has a
> + * spinner. It doesn't matter if it fails as it will
> + * try again in the next iteration.
> + */
> + (void)futex_atomic_cmpxchg_inatomic
> + (&uval, uaddr, uval, uval & ~FUTEX_WAITERS);
> + }
> +
> + if (unlikely(FUTEX_TID(uval) != qh->otid)) {
> + /*
> + * Owner has changed
> + */
> + qh->otid = FUTEX_TID(uval);
> + owner = xchg(&qh->owner, futex_find_get_task(qh->otid));
> + if (owner)
> + put_task_struct(owner);
> + }
> + owner = ACCESS_ONCE(qh->owner);
> + if ((owner && !ACCESS_ONCE(owner->on_cpu)) || need_resched())
> + break;
> + }
> + pagefault_enable();
> + osq_unlock(&qh->osq);
> + atomic_sub(FUTEX_SPINCNT_BIAS, &qh->lcnt);
> +
> + /*
> + * If we fell out of the spin path because of need_resched(),
> + * reschedule now.
> + */
> + if (!gotlock && need_resched())
> + schedule_preempt_disabled();
> +
> + return gotlock;
> +}
> +
> +/*
> * futex_spin_lock
> */
> static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> @@ -3170,6 +3296,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> struct futex_q_head *qh = NULL;
> struct futex_q_node qnode;
> union futex_key key;
> + struct task_struct *owner;
> bool gotlock;
> int ret, cnt;
> u32 uval, vpid, old;
> @@ -3193,7 +3320,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> * Check the futex value under the hash bucket lock as it might
> * be changed.
> */
> - if (futex_spin_trylock(uaddr, &uval, &ret, vpid))
> + if (futex_spin_trylock(uaddr, &uval, &ret, vpid, false))
> goto hbunlock_out;
>
> if (!qh) {
> @@ -3201,7 +3328,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> * First waiter:
> * Allocate a queue head structure & initialize it
> */
> - qh = qhead_alloc_init(hb, &key);
> + qh = qhead_alloc_init(hb, &key, uval);
> if (unlikely(!qh)) {
> ret = -ENOMEM;
> goto hbunlock_out;
> @@ -3212,9 +3339,18 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> spin_unlock(&hb->lock);
>
> /*
> - * Put the task into the wait queue and sleep
> + * Perform optimisitic spinning if the owner is running.
> */
> preempt_disable();
> + owner = ACCESS_ONCE(qh->owner);
> + if ((FUTEX_SPINCNT(qh) < futex_spincnt_max) &&
> + (!owner || owner->on_cpu))
> + if (futex_optspin(qh, &qnode, uaddr, vpid))
> + goto penable_out;
> +
> + /*
> + * Put the task into the wait queue and sleep
> + */
> get_task_struct(qnode.task);
> spin_lock(&qh->wlock);
> list_add_tail(&qnode.wnode, &qh->waitq);
> @@ -3238,6 +3374,11 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> goto dequeue;
> } else if (uval == old) {
> gotlock = true;
> + qh->otid = vpid;
> + owner = xchg(&qh->owner, qnode.task);
> + get_task_struct(qnode.task);
> + if (owner)
> + put_task_struct(owner);
> goto dequeue;
> }
> continue;
> @@ -3286,15 +3427,17 @@ dequeue:
> }
> }
> spin_unlock(&qh->wlock);
> +
> +penable_out:
> preempt_enable();
>
> cnt = atomic_dec_return(&qh->lcnt);
> if (cnt == 0)
> qhead_free(qh, hb);
> /*
> - * Need to set the waiters bit there are still waiters
> + * Need to set the waiters bit there no spinner running
> */
> - else if (!ret)
> + else if (!ret && ((cnt >> FUTEX_SPINCNT_SHIFT) == 0))
> ret = put_user(vpid | FUTEX_WAITERS, uaddr);
> out:
> put_futex_key(&key);
> @@ -3356,6 +3499,13 @@ static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
> }
>
> /*
> + * Clear the owner field
> + */
> + if ((qh->owner == current) &&
> + (cmpxchg(&qh->owner, current, NULL) == current))
> + put_task_struct(current);
> +
> + /*
> * The hash bucket lock is being hold while retrieving the task
> * structure from the queue head to make sure that the qh structure
> * won't go away under the hood.
> @@ -3520,6 +3670,10 @@ static int __init futex_init(void)
>
> futex_detect_cmpxchg();
>
> + futex_spincnt_max = num_possible_cpus()/2;
> + if (futex_spincnt_max > FUTEX_SPINCNT_MAX)
> + futex_spincnt_max = FUTEX_SPINCNT_MAX;
This threshold needs commenting as well.
Thanks,
Davidlohr
On 7/21/14, 9:45, "Andi Kleen" <[email protected]> wrote:
>Andi Kleen <[email protected]> writes:
>
>> Waiman Long <[email protected]> writes:
>>
>>> This patch series introduces two new futex command codes to support
>>> a new optimistic spinning futex for implementing an exclusive lock
>>> primitive that should perform better than the same primitive using
>>> the wait-wake futex in cases where the lock owner is actively working
>>> instead of waiting for I/O completion.
>>
>> How would you distinguish those two cases automatically?
>
>Also BTW traditionally the spinning was just done in user space.
>
>This would be always even more efficient, because it would
>even avoid the syscall entry path.
>
>Given the glibc adaptive mutexes are not very good, but I'm sure those
>could be improved.
I presented on something along these lines a few years back, and various
people have asked for the sample code over the years. Andi is right, doing
this from user-space has the potential to be more efficient, the challenge
is getting access to privileged information, such as the state of the
mutex owner. You don't want to spin if the owner is sleeping. I did this
by adding a tidrunning call to the vdso. My somewhat hacked solution is
still available here:
http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/t
idrunning/dev
I abandoned the spinning in the kernel thing due to the overhead of the
system call if I remember correctly. Also available here:
http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/f
utex-lock-adaptive
--
Darren Hart Open Source Technology Center
[email protected] Intel Corporation
On 7/21/14, 10:20, "Darren Hart" <[email protected]> wrote:
>On 7/21/14, 9:45, "Andi Kleen" <[email protected]> wrote:
>
>>Andi Kleen <[email protected]> writes:
>>
>>> Waiman Long <[email protected]> writes:
>>>
>>>> This patch series introduces two new futex command codes to support
>>>> a new optimistic spinning futex for implementing an exclusive lock
>>>> primitive that should perform better than the same primitive using
>>>> the wait-wake futex in cases where the lock owner is actively working
>>>> instead of waiting for I/O completion.
>>>
>>> How would you distinguish those two cases automatically?
>>
>>Also BTW traditionally the spinning was just done in user space.
>>
>>This would be always even more efficient, because it would
>>even avoid the syscall entry path.
>>
>>Given the glibc adaptive mutexes are not very good, but I'm sure those
>>could be improved.
>
>I presented on something along these lines a few years back, and various
>people have asked for the sample code over the years. Andi is right,
>doing
>this from user-space has the potential to be more efficient, the
>challenge
>is getting access to privileged information, such as the state of the
>mutex owner. You don't want to spin if the owner is sleeping. I did this
>by adding a tidrunning call to the vdso. My somewhat hacked solution is
>still available here:
>
>http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/
>t
>idrunning/dev
>
>
>I abandoned the spinning in the kernel thing due to the overhead of the
>system call if I remember correctly. Also available here:
>
>http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/
>f
>utex-lock-adaptive
>
And the associated Linux Plumbers 2010 presentation which I can't seem to
find archived anywhere else:
http://dvhart.com/darren/files/kernel-assisted-userspace-adaptive-mutexes.p
df
We observed some significant improvements under some very specific use
cases, but a more thorough dive into performance impact in the other cases
as well as security implications with the vdso is still wanting.
--
Darren Hart Open Source Technology Center
[email protected] Intel Corporation
On Mon, 21 Jul 2014, Andi Kleen wrote:
> Andi Kleen <[email protected]> writes:
>
> > Waiman Long <[email protected]> writes:
> >
> >> This patch series introduces two new futex command codes to support
> >> a new optimistic spinning futex for implementing an exclusive lock
> >> primitive that should perform better than the same primitive using
> >> the wait-wake futex in cases where the lock owner is actively working
> >> instead of waiting for I/O completion.
> >
> > How would you distinguish those two cases automatically?
>
> Also BTW traditionally the spinning was just done in user space.
And traditionally that is based on heuristics, because user space
cannot figure out whether the lock owner is on the CPU or not.
And heuristics suck no matter what.
There have been attempts to expose that information to user space, but
that sucks as well, as it requires updates of that information from
the guts of the scheduler, does not scale at all and what's worse, it
has security implications.
Thanks,
tglx
On Mon, 21 Jul 2014, Darren Hart wrote:
> We observed some significant improvements under some very specific use
> cases, but a more thorough dive into performance impact in the other cases
> as well as security implications with the vdso is still wanting.
The security implication is that the feature can only be available for
process private futexes. There is no way to expose information which
crosses the process spaces.
But the way worse issue is storage.
While you can cache the namespace specific TID of a thread in the
task_struct, you still need a O(1) zero overhead mechanism to update
the thread state (only on/off cpu is interesting) in a per process
shared data structure from the guts of schedule()
For that you have basically two choices:
1) cpu_thread_id[NR_CPUS]
Simple to update from the scheduler, and a halfways moderate
storage size (NR_CPUS * 4 bytes) in the worst case, i.e. 16k
today. Set to 0 on scheduling out and to the namespace specific TID
on scheduling in.
But that requires a linear search in the user space spin loop. And
that's required for every iteration of the loop. Can you imagine
how well that works performance wise?
2) Bitmap threads_on_cpu
Again, simple to update from the scheduler, cache line bouncing
issues aside. Clear the bit on schedule out and set it on schedule
in.
But the bitmap needs the size of PID_MAX_LIMIT, which is a whopping
512k per process in the worst case.
Anything else would involve search/lookup schemes which are just
overkill in both the scheduler and the user space loop.
Now for enhanced fun you need immutable pages for that storage, as you
can't have pagefaults in the guts of schedule().
So once you found a way to make that opt-in as you don't want inflict
any of this to all processes by default, it might be a worthwhile
optimization. So the probably tolerable impact on schedule() would be
schedule_out()
if (curr->threads_on_cpu)
clear_bit(curr->ns_tid, curr->threads_on_cpu);
and
schedule_in()
if (curr->threads_on_cpu)
clear_bit(curr->ns_tid, curr->threads_on_cpu);
Anything more complex is just going to defeat the whole purpose.
Thanks,
tglx
On Mon, 2014-07-21 at 11:24 -0400, Waiman Long wrote:
> This patch adds code to do optimistic spinning for the FUTEX_SPIN_LOCK
> primitive on the futex value when the lock owner is running. It is
> the same optimistic spinning technique that is done in the mutex and
> rw semaphore code to improve their performance especially on large
> systems with large number of CPUs. When the lock owner is not running,
> the spinning tasks will go to sleep.
Perhaps we could introduce a "CONFIG_FUTEX_SPIN_ON_OWNER" that depends
on SMP and ARCH_SUPPORTS_ATOMIC_RMW?
> There is 2 major advantages of doing optimistic spinning here:
> 1) It eliminates the context switching latency and overhead (at
> least a few us) associated with sleeping and wakeup.
> 2) It eliminates most of the need to call futex(2) with
> FUTEX_SPIN_UNLOCK as spinning is done without the need to set
> the FUTEX_WAITERS bit.
> struct futex_q_head {
> struct list_head hnode;
> struct list_head waitq;
> union futex_key key;
> + struct optimistic_spin_queue *osq;
And this would have to be updated to
+ struct optimistic_spin_queue osq;
given the recent changes to the osq lock.
* Waiman Long <[email protected]> wrote:
> Testing done on a 4-socket Westmere-EX boxes with 40 cores (HT off)
> showed the following performance data (average kops/s) with various
> load factor (number of pause instructions) used in the critical
> section using an userspace mutex microbenchmark.
>
> Threads Load Waiting Futex Spinning Futex %Change
> ------- ---- ------------- -------------- -------
> 256 1 6894 8883 +29%
> 256 10 3656 4912 +34%
> 256 50 1332 4358 +227%
> 256 100 792 2753 +248%
> 10 1 6382 4838 -24%
> 10 10 3614 4748 +31%
> 10 50 1319 3900 +196%
> 10 100 782 2459 +214%
> 2 1 7905 7194 -9.0%
> 2 10 4556 4717 +3.5%
> 2 50 2191 4167 +90%
> 2 100 1767 2407 +36%
So the numbers look interesting - but it would be _really_ important
to provide noise/sttdev figures in a sixth column as well (denoted in
percentage units, not in benchmark units), so that we know how
significant a particular speedup (or slowdown) is.
Thanks,
Ingo
On Mon, Jul 21, 2014 at 10:16:37PM +0200, Thomas Gleixner wrote:
> On Mon, 21 Jul 2014, Darren Hart wrote:
> > We observed some significant improvements under some very specific use
> > cases, but a more thorough dive into performance impact in the other cases
> > as well as security implications with the vdso is still wanting.
>
> The security implication is that the feature can only be available for
> process private futexes. There is no way to expose information which
> crosses the process spaces.
>
> But the way worse issue is storage.
>
> While you can cache the namespace specific TID of a thread in the
> task_struct, you still need a O(1) zero overhead mechanism to update
> the thread state (only on/off cpu is interesting) in a per process
> shared data structure from the guts of schedule()
>
> For that you have basically two choices:
>
> 1) cpu_thread_id[NR_CPUS]
>
> Simple to update from the scheduler, and a halfways moderate
> storage size (NR_CPUS * 4 bytes) in the worst case, i.e. 16k
> today. Set to 0 on scheduling out and to the namespace specific TID
> on scheduling in.
>
> But that requires a linear search in the user space spin loop. And
> that's required for every iteration of the loop. Can you imagine
> how well that works performance wise?
>
> 2) Bitmap threads_on_cpu
>
> Again, simple to update from the scheduler, cache line bouncing
> issues aside. Clear the bit on schedule out and set it on schedule
> in.
>
> But the bitmap needs the size of PID_MAX_LIMIT, which is a whopping
> 512k per process in the worst case.
>
> Anything else would involve search/lookup schemes which are just
> overkill in both the scheduler and the user space loop.
>
> Now for enhanced fun you need immutable pages for that storage, as you
> can't have pagefaults in the guts of schedule().
>
> So once you found a way to make that opt-in as you don't want inflict
> any of this to all processes by default, it might be a worthwhile
> optimization. So the probably tolerable impact on schedule() would be
>
> schedule_out()
> if (curr->threads_on_cpu)
> clear_bit(curr->ns_tid, curr->threads_on_cpu);
> and
>
> schedule_in()
> if (curr->threads_on_cpu)
> clear_bit(curr->ns_tid, curr->threads_on_cpu);
>
> Anything more complex is just going to defeat the whole purpose.
All this is predicated on the fact that syscalls are 'expensive'.
Weren't syscalls only 100s of cycles? All this bitmap mucking is far
more expensive due to cacheline misses, which due to the size of the
things is almost guaranteed.
On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jul 21, 2014 at 10:16:37PM +0200, Thomas Gleixner wrote:
>> On Mon, 21 Jul 2014, Darren Hart wrote:
>> > We observed some significant improvements under some very specific use
>> > cases, but a more thorough dive into performance impact in the other cases
>> > as well as security implications with the vdso is still wanting.
>>
>> The security implication is that the feature can only be available for
>> process private futexes. There is no way to expose information which
>> crosses the process spaces.
>>
>> But the way worse issue is storage.
>>
>> While you can cache the namespace specific TID of a thread in the
>> task_struct, you still need a O(1) zero overhead mechanism to update
>> the thread state (only on/off cpu is interesting) in a per process
>> shared data structure from the guts of schedule()
>>
>> For that you have basically two choices:
>>
>> 1) cpu_thread_id[NR_CPUS]
>>
>> Simple to update from the scheduler, and a halfways moderate
>> storage size (NR_CPUS * 4 bytes) in the worst case, i.e. 16k
>> today. Set to 0 on scheduling out and to the namespace specific TID
>> on scheduling in.
>>
>> But that requires a linear search in the user space spin loop. And
>> that's required for every iteration of the loop. Can you imagine
>> how well that works performance wise?
>>
>> 2) Bitmap threads_on_cpu
>>
>> Again, simple to update from the scheduler, cache line bouncing
>> issues aside. Clear the bit on schedule out and set it on schedule
>> in.
>>
>> But the bitmap needs the size of PID_MAX_LIMIT, which is a whopping
>> 512k per process in the worst case.
>>
>> Anything else would involve search/lookup schemes which are just
>> overkill in both the scheduler and the user space loop.
>>
>> Now for enhanced fun you need immutable pages for that storage, as you
>> can't have pagefaults in the guts of schedule().
>>
>> So once you found a way to make that opt-in as you don't want inflict
>> any of this to all processes by default, it might be a worthwhile
>> optimization. So the probably tolerable impact on schedule() would be
>>
>> schedule_out()
>> if (curr->threads_on_cpu)
>> clear_bit(curr->ns_tid, curr->threads_on_cpu);
>> and
>>
>> schedule_in()
>> if (curr->threads_on_cpu)
>> clear_bit(curr->ns_tid, curr->threads_on_cpu);
>>
>> Anything more complex is just going to defeat the whole purpose.
>
> All this is predicated on the fact that syscalls are 'expensive'.
> Weren't syscalls only 100s of cycles? All this bitmap mucking is far
> more expensive due to cacheline misses, which due to the size of the
> things is almost guaranteed.
120 - 300 cycles for me, unless tracing happens, and I'm working on
reducing the incidence of tracing.
--Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Andy Lutomirski
AMA Capital Management, LLC
On Mon, 21 Jul 2014, Ingo Molnar wrote:
> * Waiman Long <[email protected]> wrote:
>
> > Testing done on a 4-socket Westmere-EX boxes with 40 cores (HT off)
> > showed the following performance data (average kops/s) with various
> > load factor (number of pause instructions) used in the critical
> > section using an userspace mutex microbenchmark.
> >
> > Threads Load Waiting Futex Spinning Futex %Change
> > ------- ---- ------------- -------------- -------
> > 256 1 6894 8883 +29%
> > 256 10 3656 4912 +34%
> > 256 50 1332 4358 +227%
> > 256 100 792 2753 +248%
> > 10 1 6382 4838 -24%
> > 10 10 3614 4748 +31%
> > 10 50 1319 3900 +196%
> > 10 100 782 2459 +214%
> > 2 1 7905 7194 -9.0%
> > 2 10 4556 4717 +3.5%
> > 2 50 2191 4167 +90%
> > 2 100 1767 2407 +36%
>
> So the numbers look interesting - but it would be _really_ important
> to provide noise/sttdev figures in a sixth column as well (denoted in
> percentage units, not in benchmark units), so that we know how
> significant a particular speedup (or slowdown) is.
We care about that, once we have something which
- Has a proper design
- Covers all the corner cases of futexes
- Does not introduce a gazillions of new lines of codes in futex.c
- Does not create another set of subtle security issues. I'm so not
interested to do the same exercise again - my brain still suffers.
The numbers provided are completely irrelevant as the implementation
is just the most idiotic approach to avoid all corner cases of
futexes, error handling and the proper treatment of detached kernel
state for the price of adding a completely unreviewable clusterfuck to
futex.c.
So, no. Don't encourage that number wankery any further. It's going
nowhere, period.
Thanks,
tglx
On Mon, 21 Jul 2014, Peter Zijlstra wrote:
> On Mon, Jul 21, 2014 at 10:16:37PM +0200, Thomas Gleixner wrote:
> > On Mon, 21 Jul 2014, Darren Hart wrote:
> > > We observed some significant improvements under some very specific use
> > > cases, but a more thorough dive into performance impact in the other cases
> > > as well as security implications with the vdso is still wanting.
> >
> > The security implication is that the feature can only be available for
> > process private futexes. There is no way to expose information which
> > crosses the process spaces.
> >
> > But the way worse issue is storage.
> >
> > While you can cache the namespace specific TID of a thread in the
> > task_struct, you still need a O(1) zero overhead mechanism to update
> > the thread state (only on/off cpu is interesting) in a per process
> > shared data structure from the guts of schedule()
> >
> > For that you have basically two choices:
> >
> > 1) cpu_thread_id[NR_CPUS]
> >
> > Simple to update from the scheduler, and a halfways moderate
> > storage size (NR_CPUS * 4 bytes) in the worst case, i.e. 16k
> > today. Set to 0 on scheduling out and to the namespace specific TID
> > on scheduling in.
> >
> > But that requires a linear search in the user space spin loop. And
> > that's required for every iteration of the loop. Can you imagine
> > how well that works performance wise?
> >
> > 2) Bitmap threads_on_cpu
> >
> > Again, simple to update from the scheduler, cache line bouncing
> > issues aside. Clear the bit on schedule out and set it on schedule
> > in.
> >
> > But the bitmap needs the size of PID_MAX_LIMIT, which is a whopping
> > 512k per process in the worst case.
> >
> > Anything else would involve search/lookup schemes which are just
> > overkill in both the scheduler and the user space loop.
> >
> > Now for enhanced fun you need immutable pages for that storage, as you
> > can't have pagefaults in the guts of schedule().
> >
> > So once you found a way to make that opt-in as you don't want inflict
> > any of this to all processes by default, it might be a worthwhile
> > optimization. So the probably tolerable impact on schedule() would be
> >
> > schedule_out()
> > if (curr->threads_on_cpu)
> > clear_bit(curr->ns_tid, curr->threads_on_cpu);
> > and
> >
> > schedule_in()
> > if (curr->threads_on_cpu)
> > clear_bit(curr->ns_tid, curr->threads_on_cpu);
> >
> > Anything more complex is just going to defeat the whole purpose.
>
> All this is predicated on the fact that syscalls are 'expensive'.
> Weren't syscalls only 100s of cycles? All this bitmap mucking is far
> more expensive due to cacheline misses, which due to the size of the
> things is almost guaranteed.
I completely agree.
This wants to backed by proper numbers taken from a proper
implementation and not from some randomly cobbled together works for
me hackery.
As I said: It might be a worthwhile optimization....
Thanks,
tglx
On Mon, 21 Jul 2014, Andy Lutomirski wrote:
> On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra <[email protected]> wrote:
> > All this is predicated on the fact that syscalls are 'expensive'.
> > Weren't syscalls only 100s of cycles? All this bitmap mucking is far
> > more expensive due to cacheline misses, which due to the size of the
> > things is almost guaranteed.
>
> 120 - 300 cycles for me, unless tracing happens, and I'm working on
> reducing the incidence of tracing.
So it's a non issue indeed and definitely not worth the trouble of
that extra storage, the scheduler overhead, etc.
Summary: Once you can't take it atomically in user space, you've lost
anyway. And we are better off to do the magic spinning in
kernel where we have all the information accessible already.
Thanks,
tglx
On 7/21/14, 14:47, "Thomas Gleixner" <[email protected]> wrote:
>On Mon, 21 Jul 2014, Andy Lutomirski wrote:
>> On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra <[email protected]>
>>wrote:
>> > All this is predicated on the fact that syscalls are 'expensive'.
>> > Weren't syscalls only 100s of cycles? All this bitmap mucking is far
>> > more expensive due to cacheline misses, which due to the size of the
>> > things is almost guaranteed.
>>
>> 120 - 300 cycles for me, unless tracing happens, and I'm working on
>> reducing the incidence of tracing.
>
>So it's a non issue indeed and definitely not worth the trouble of
>that extra storage, the scheduler overhead, etc.
>
>Summary: Once you can't take it atomically in user space, you've lost
> anyway. And we are better off to do the magic spinning in
> kernel where we have all the information accessible already.
And we have such an implementation with the FUTEX_LOCK_ADAPTIVE code we
discussed back in Oct 2010 (purely kernel, no VDSO), updated with some of
your and other's comments:
http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/f
utex-lock/v7
I can work on forward porting this series to current mainline (post recent
security fixes) and cleaning up the commentary and such if people are
interested in seeing this implementation (based on Peter Z's spinning
mutex work iirc) resurrected...
--
Darren Hart Open Source Technology Center
[email protected] Intel Corporation
On Mon, 2014-07-21 at 14:31 -0700, Andy Lutomirski wrote:
> On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra <[email protected]> wrote:
> > All this is predicated on the fact that syscalls are 'expensive'.
> > Weren't syscalls only 100s of cycles? All this bitmap mucking is far
> > more expensive due to cacheline misses, which due to the size of the
> > things is almost guaranteed.
>
> 120 - 300 cycles for me, unless tracing happens, and I'm working on
> reducing the incidence of tracing.
fwiw here's what lmbench's lat_ctx says on my system . For 'accuracy', I
kept the runs short.
http://www.stgolabs.net/lat_ctx.png
On Mon, 21 Jul 2014, Darren Hart wrote:
> On 7/21/14, 14:47, "Thomas Gleixner" <[email protected]> wrote:
>
> >On Mon, 21 Jul 2014, Andy Lutomirski wrote:
> >> On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra <[email protected]>
> >>wrote:
> >> > All this is predicated on the fact that syscalls are 'expensive'.
> >> > Weren't syscalls only 100s of cycles? All this bitmap mucking is far
> >> > more expensive due to cacheline misses, which due to the size of the
> >> > things is almost guaranteed.
> >>
> >> 120 - 300 cycles for me, unless tracing happens, and I'm working on
> >> reducing the incidence of tracing.
> >
> >So it's a non issue indeed and definitely not worth the trouble of
> >that extra storage, the scheduler overhead, etc.
> >
> >Summary: Once you can't take it atomically in user space, you've lost
> > anyway. And we are better off to do the magic spinning in
> > kernel where we have all the information accessible already.
>
> And we have such an implementation with the FUTEX_LOCK_ADAPTIVE code we
> discussed back in Oct 2010 (purely kernel, no VDSO), updated with some of
> your and other's comments:
>
> http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/f
> utex-lock/v7
>
>
> I can work on forward porting this series to current mainline (post recent
> security fixes) and cleaning up the commentary and such if people are
> interested in seeing this implementation (based on Peter Z's spinning
> mutex work iirc) resurrected...
No. We really want to avoid more magic hackery in the futex code.
Lets sit down and think about what we need:
1) Support for all existing features
2) More fine grained locking
3) Optimistic spinning
@1: No discussion about that. Period.
We are not going to introduce some new futex opcode, which is
optimized for a microbenchmark and ignoring all of the pain we
went through in the last 10 years. No way, especially after the
recent security disaster.
@2 and @3: Yes, we want that.
But again, we don't add fine grained locking just for some half
baken new opcode. No, we are adding it where it makes the most
sense and lets us reuse most of the code.
I can predict your question, how that should work :)
If we want to have fine grained locking beyond the futex hash
buckets and that's something we discussed before, you need a state
representation of the user space futex in the kernel. That's what
Waiman added as well, just in a way that's beyond repair.
The charm of the futex hash buckets is that they do not create
kernel state because the futex_q which is enqueued into the bucket
is on the task stack of the task which is blocked on the futex.
So much for the theory, because that stopped to be true when we
added support for PI futexes. They already create kernel internal
state which is not magically removed when the thread exits the
syscall. It's called pi_state.
Now the question is how is this related to the non PI magic
spinning optimization? Very much so.
Simply because the handling of pi_state already covers ALL the
corner cases which come with the extra kernel internal state and
they provide the full function set of futexes required by the
various (ab)use cases including the requeue functionality which is
required for condvars. It even supports caching of the pi_state
struct in a way which makes sense, rather than blindly slapping a
cached entry onto each hash bucket.
So it's not a really mind boggling thought experiment to think
about this magic new feature as a subset of the already existing
PI futex code. It is a subset, just minus the PI portion plus the
more fine grained locking.
So the right solution is to rework the existing code.
1) Make pi_state the entity which gets enqueued into the hash bucket
and manage the waiters in a pi_state internal queue.
Straight forward problem as it is still protected by the hash
bucket lock.
2) Add private locking to the pi_state to reduce the lock contention
on the hash bucket lock.
Straight forward conversion as well
3) Make mutex a subset of rt_mutex
That makes a lot of sense, simply because a mutex is the same as a
rtmutex just without the PI part.
Just for the record before anyone cries murder: The original mutex
implemention was modeled after the rtmutex one in the first place
and the optimistic spinning was first introduced on rtmutexes by
Gregory Haskins et. al. in the context of the RT tree. Back then it
was too early or we were simply too stupid to see the similarities
and opted for a complete separate implementation.
Yes, I'm aware that this is a non-trivial problem, but if you look
at the details, then the non-trivial part is in the slow path were
stuff actually goes to sleep. The fast path of mutex and rt_mutex
is simply the same. So there is no reason to keep them separate. An
extra conditional in the slow path is not going to hurt at all.
Surely, you might say, that I'm an egoistic bastard, who just wants
to have the benefit of MCS for rtmutex for free. Right you are, but
there is a whole bunch of reasons why this makes tons of sense:
- As I said above, it's almost the same, except for the slow path
details, where a few extra conditionals do not matter
- Sharing the code has the obvious benefits. And it's not only
sharing between rtmutex and mutex, it's also sharing the
underlying optimizations of MCS for the kernel internal users and
the user space exposed ones, i.e. futexes.
- Once unified the desired futex functionality just works with a
minimal impact on the futex code itself.
The futex code does not care whether the underlying primitive has
PI semantics or not, it just cares that all the required features
like proxy locking etc. are in place. And that everything like
timeouts, requeue etc. is just supported.
- The semantics of the new futex functionality has not to be
defined again. We can simply reuse the rather well defined and
well tested straight forward semantics of the existing PI
opcodes.
- That minimizes the risk massively, because all state and error
handling is common code.
- Any update to the underlying primitives will benefit ALL usage
sites and reduces the copy and paste induced hell of MCS code
sprinkled all over the kernel.
4) Enable user space to make use of it.
Whether this will be a separate opcode or just a new flag to the
existing PI opcodes is going to be just a bikeshedding question at
that point. :)
The important point is, that we do NOT grow any new unmaintainable
warts to the futex code. And the 650+ new lines of hackery which come
with the proposed patch set are in no way acceptable, especially as
they only cover the minimalistic functionality set of futexes. Not to
talk about the other issues I've observed on the first glance.
So before anyone comes up with a "solution" for all of this tomorrow
afternoon in form of another half baken patch, please sit back mull it
in your head and lets have a proper discussion about the approach
first.
Thanks,
tglx
On Tue, 22 Jul 2014 03:01:00 +0200 (CEST)
Thomas Gleixner <[email protected]> wrote:
> On Mon, 21 Jul 2014, Darren Hart wrote:
> > On 7/21/14, 14:47, "Thomas Gleixner" <[email protected]> wrote:
> >
> > >On Mon, 21 Jul 2014, Andy Lutomirski wrote:
> > >> On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra <[email protected]>
> > >>wrote:
> > >> > All this is predicated on the fact that syscalls are 'expensive'.
> > >> > Weren't syscalls only 100s of cycles? All this bitmap mucking is far
> > >> > more expensive due to cacheline misses, which due to the size of the
> > >> > things is almost guaranteed.
> > >>
> > >> 120 - 300 cycles for me, unless tracing happens, and I'm working on
> > >> reducing the incidence of tracing.
> > >
> > >So it's a non issue indeed and definitely not worth the trouble of
> > >that extra storage, the scheduler overhead, etc.
> > >
> > >Summary: Once you can't take it atomically in user space, you've lost
> > > anyway. And we are better off to do the magic spinning in
> > > kernel where we have all the information accessible already.
I just want to point out that I was having a very nice conversation
with Robert Haas (Cc'd) in Napa Valley at Linux Collaboration about
this very topic. Robert is a PostgeSQL developer who told me that they
implement their spin locks completely in userspace (no futex, just raw
spinning on shared memory). This is because the sleep on contention of a
futex has shown to be very expensive in their benchmarks. His work is
not a micro benchmark but for a very popular database where locking is
crucial.
I was telling Robert that if futexes get optimistic spinning, he should
reconsider their use of userspace spinlocks in favor of this, because
I'm pretty sure that they will see a great improvement.
Now Robert will be the best one to answer if the system call is indeed
more expensive than doing full spins in userspace. If the spin is done
in the kernel and they still get better performance by just spinning
blindly in userspace even if the owner is asleep, I think we will have
our answer.
Note, I believe they only care about shared threads, and this
optimistic spinning does not need to be something done between
processes.
> >
> > And we have such an implementation with the FUTEX_LOCK_ADAPTIVE code we
> > discussed back in Oct 2010 (purely kernel, no VDSO), updated with some of
> > your and other's comments:
> >
> > http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/f
> > utex-lock/v7
> >
> >
> > I can work on forward porting this series to current mainline (post recent
> > security fixes) and cleaning up the commentary and such if people are
> > interested in seeing this implementation (based on Peter Z's spinning
> > mutex work iirc) resurrected...
>
> No. We really want to avoid more magic hackery in the futex code.
>
> Lets sit down and think about what we need:
>
> 1) Support for all existing features
>
> 2) More fine grained locking
>
> 3) Optimistic spinning
>
> @1: No discussion about that. Period.
>
> We are not going to introduce some new futex opcode, which is
> optimized for a microbenchmark and ignoring all of the pain we
> went through in the last 10 years. No way, especially after the
> recent security disaster.
>
> @2 and @3: Yes, we want that.
>
> But again, we don't add fine grained locking just for some half
> baken new opcode. No, we are adding it where it makes the most
> sense and lets us reuse most of the code.
>
> I can predict your question, how that should work :)
>
> If we want to have fine grained locking beyond the futex hash
> buckets and that's something we discussed before, you need a state
> representation of the user space futex in the kernel. That's what
> Waiman added as well, just in a way that's beyond repair.
>
> The charm of the futex hash buckets is that they do not create
> kernel state because the futex_q which is enqueued into the bucket
> is on the task stack of the task which is blocked on the futex.
>
> So much for the theory, because that stopped to be true when we
> added support for PI futexes. They already create kernel internal
> state which is not magically removed when the thread exits the
> syscall. It's called pi_state.
>
> Now the question is how is this related to the non PI magic
> spinning optimization? Very much so.
>
> Simply because the handling of pi_state already covers ALL the
> corner cases which come with the extra kernel internal state and
> they provide the full function set of futexes required by the
> various (ab)use cases including the requeue functionality which is
> required for condvars. It even supports caching of the pi_state
> struct in a way which makes sense, rather than blindly slapping a
> cached entry onto each hash bucket.
>
> So it's not a really mind boggling thought experiment to think
> about this magic new feature as a subset of the already existing
> PI futex code. It is a subset, just minus the PI portion plus the
> more fine grained locking.
>
> So the right solution is to rework the existing code.
>
> 1) Make pi_state the entity which gets enqueued into the hash bucket
> and manage the waiters in a pi_state internal queue.
>
> Straight forward problem as it is still protected by the hash
> bucket lock.
>
> 2) Add private locking to the pi_state to reduce the lock contention
> on the hash bucket lock.
>
> Straight forward conversion as well
>
> 3) Make mutex a subset of rt_mutex
+1
>
> That makes a lot of sense, simply because a mutex is the same as a
> rtmutex just without the PI part.
>
> Just for the record before anyone cries murder: The original mutex
> implemention was modeled after the rtmutex one in the first place
> and the optimistic spinning was first introduced on rtmutexes by
> Gregory Haskins et. al. in the context of the RT tree. Back then it
> was too early or we were simply too stupid to see the similarities
> and opted for a complete separate implementation.
I believe it was more about politics to why it was a separate solution.
Technically, I think they should have been one.
>
> Yes, I'm aware that this is a non-trivial problem, but if you look
> at the details, then the non-trivial part is in the slow path were
> stuff actually goes to sleep. The fast path of mutex and rt_mutex
> is simply the same. So there is no reason to keep them separate. An
> extra conditional in the slow path is not going to hurt at all.
>
> Surely, you might say, that I'm an egoistic bastard, who just wants
> to have the benefit of MCS for rtmutex for free. Right you are, but
> there is a whole bunch of reasons why this makes tons of sense:
>
> - As I said above, it's almost the same, except for the slow path
> details, where a few extra conditionals do not matter
>
> - Sharing the code has the obvious benefits. And it's not only
> sharing between rtmutex and mutex, it's also sharing the
> underlying optimizations of MCS for the kernel internal users and
> the user space exposed ones, i.e. futexes.
>
> - Once unified the desired futex functionality just works with a
> minimal impact on the futex code itself.
>
> The futex code does not care whether the underlying primitive has
> PI semantics or not, it just cares that all the required features
> like proxy locking etc. are in place. And that everything like
> timeouts, requeue etc. is just supported.
>
> - The semantics of the new futex functionality has not to be
> defined again. We can simply reuse the rather well defined and
> well tested straight forward semantics of the existing PI
> opcodes.
>
> - That minimizes the risk massively, because all state and error
> handling is common code.
>
> - Any update to the underlying primitives will benefit ALL usage
> sites and reduces the copy and paste induced hell of MCS code
> sprinkled all over the kernel.
>
> 4) Enable user space to make use of it.
>
> Whether this will be a separate opcode or just a new flag to the
> existing PI opcodes is going to be just a bikeshedding question at
> that point. :)
>
> The important point is, that we do NOT grow any new unmaintainable
> warts to the futex code. And the 650+ new lines of hackery which come
> with the proposed patch set are in no way acceptable, especially as
> they only cover the minimalistic functionality set of futexes. Not to
> talk about the other issues I've observed on the first glance.
>
> So before anyone comes up with a "solution" for all of this tomorrow
> afternoon in form of another half baken patch, please sit back mull it
> in your head and lets have a proper discussion about the approach
> first.
>
If we can have that schedule_out() threads_on_cpu, that could be a
helpful solution to see if the system call is indeed expensive. Sounds
like we can have both an in kernel solution and a userspace one as well.
-- Steve
On Mon, 2014-07-21 at 21:34 -0400, Steven Rostedt wrote:
> I was telling Robert that if futexes get optimistic spinning, he should
> reconsider their use of userspace spinlocks in favor of this, because
> I'm pretty sure that they will see a great improvement.
My (dated) experience with pgsql says you're likely right.
Once upon a time, preempting a userspace spinlock owner caused pgsql to
collapse into a quivering heap. The scheduler trying to hand the CPU
back to a preempted task instead of selecting what was strictly speaking
the most deserving task afterward (LAST_BUDDY) let pgsql+oltp scale nice
and flat instead of self destructing, but the root cause of collapse was
spinners lacking knowledge. With spin done in kernel, you can know
when spinning is a waste of perfectly good cycles.
-Mike
On Mon, 2014-07-21 at 21:34 -0400, Steven Rostedt wrote:
> On Tue, 22 Jul 2014 03:01:00 +0200 (CEST)
> Thomas Gleixner <[email protected]> wrote:
>
> > On Mon, 21 Jul 2014, Darren Hart wrote:
> > > On 7/21/14, 14:47, "Thomas Gleixner" <[email protected]> wrote:
> > >
> > > >On Mon, 21 Jul 2014, Andy Lutomirski wrote:
> > > >> On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra <[email protected]>
> > > >>wrote:
> > > >> > All this is predicated on the fact that syscalls are 'expensive'.
> > > >> > Weren't syscalls only 100s of cycles? All this bitmap mucking is far
> > > >> > more expensive due to cacheline misses, which due to the size of the
> > > >> > things is almost guaranteed.
> > > >>
> > > >> 120 - 300 cycles for me, unless tracing happens, and I'm working on
> > > >> reducing the incidence of tracing.
> > > >
> > > >So it's a non issue indeed and definitely not worth the trouble of
> > > >that extra storage, the scheduler overhead, etc.
> > > >
> > > >Summary: Once you can't take it atomically in user space, you've lost
> > > > anyway. And we are better off to do the magic spinning in
> > > > kernel where we have all the information accessible already.
>
> I just want to point out that I was having a very nice conversation
> with Robert Haas (Cc'd) in Napa Valley at Linux Collaboration about
> this very topic. Robert is a PostgeSQL developer who told me that they
> implement their spin locks completely in userspace (no futex, just raw
> spinning on shared memory). This is because the sleep on contention of a
> futex has shown to be very expensive in their benchmarks.
Well sure, hashing, plist handling, wakeups etc. (not to mention
dirtying cachelines). The "fast" in futex is only due to the fact that
there is no jump to kernel space when the lock is uncontended after say,
a successful cmpxchg in user space. Once futex(2) is actually called,
performance is quite a different story.
> His work is
> not a micro benchmark but for a very popular database where locking is
> crucial.
>
> I was telling Robert that if futexes get optimistic spinning, he should
> reconsider their use of userspace spinlocks in favor of this, because
> I'm pretty sure that they will see a great improvement.
Agreed, with the critical regions being correctly sized for spinning
locks.
> Now Robert will be the best one to answer if the system call is indeed
> more expensive than doing full spins in userspace. If the spin is done
> in the kernel and they still get better performance by just spinning
> blindly in userspace even if the owner is asleep, I think we will have
> our answer.
>
> Note, I believe they only care about shared threads, and this
> optimistic spinning does not need to be something done between
> processes.
With the pi_state there would be no such distinction.
Thanks,
Davidlohr
On 07/21/2014 11:45 AM, Randy Dunlap wrote:
> On 07/21/2014 08:24 AM, Waiman Long wrote:
>> This patch adds a new document file on how to use the spinning futexes.
>>
>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> Documentation/spinning-futex.txt | 109 ++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 109 insertions(+), 0 deletions(-)
>> create mode 100644 Documentation/spinning-futex.txt
>>
>> diff --git a/Documentation/spinning-futex.txt b/Documentation/spinning-futex.txt
>> new file mode 100644
>> index 0000000..e3cb5a2
>> --- /dev/null
>> +++ b/Documentation/spinning-futex.txt
>> @@ -0,0 +1,109 @@
>> +Started by: Waiman Long<[email protected]>
>> +
>> +Spinning Futex
>> +--------------
>> +
>> +There are two main problems for a wait-wake futex (FUTEX_WAIT and
>> +FUTEX_WAKE) when used for creating user-space lock primitives:
>> +
>> + 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
>> + in the futex queue to be woken up by the lock owner when it is done
>> + with the lock. Waking up a sleeping task, however, introduces some
>> + additional latency which can be large especially if the critical
>> + section protected by the lock is relatively short. This may cause
>> + a performance bottleneck on large systems with many CPUs running
>> + applications that need a lot of inter-thread synchronization.
>> +
>> + 2) The performance of the wait-wake futex is currently
>> + spinlock-constrained. When many threads are contending for a
>> + futex in a large system with many CPUs, it is not unusual to have
>> + spinlock contention accounting for more than 90% of the total
>> + CPU cycles consumed at various points in time.
>> +
>> +Spinning futex is a solution to both the wakeup latency and spinlock
>> +contention problems by optimistically spinning on a locked futex
>> +when the lock owner is running within the kernel until the lock is
>> +free. This is the same optimistic spinning mechanism used by the kernel
>> +mutex and rw semaphore implementations to improve performance. The
>> +optimistic spinning was done without taking any lock.
> is done
>
>> +
>> +Implementation
>> +--------------
>> +
>> +Like the PI and robust futexes, a lock acquirer has to atomically
>> +put its thread ID (TID) into the lower 30 bits of the 32-bit futex
>> +which should has an original value of 0. If it succeeds, it will be
> have
>
>> +the owner of the futex. Otherwise, it has to call into the kernel
>> +using the new FUTEX_SPIN_LOCK futex(2) syscall.
>> +
>> +The kernel will use the setting of the most significant bit
>> +(FUTEX_WAITERS) in the futex value to indicate one or more waiters
>> +are sleeping and need to be woken up later on.
>> +
>> +When it is time to unlock, the lock owner has to atomically clear
>> +the TID portion of the futex value. If the FUTEX_WAITERS bit is set,
>> +it has to issue a FUTEX_SPIN_UNLOCK futex system call to wake up the
>> +sleeping task.
>> +
>> +A return value of 1 from the FUTEX_SPIN_UNLOCK futex(2) syscall
>> +indicates a task has been woken up. The syscall returns 0 if no
>> +sleeping task is found or spinners are present to take the lock.
>> +
>> +The error number returned by a FUTEX_SPIN_UNLOCK call on an empty
>> +futex can be used to decide if the spinning futex functionality is
>> +implemented in the kernel. If it is present, the returned error number
>> +should be ESRCH. Otherwise it will be ENOSYS.
>> +
>> +Currently, only the first and the second arguments (the futex address
>> +and the opcode) of the futex(2) syscall is used. All the other
> are used.
>
>> +arguments must be set to 0 or NULL to avoid forward compatibility
>> +problem.
>> +
>> +The spinning futex requires the kernel to have support for the cmpxchg
>> +functionality. For architectures that don't support cmpxchg, spinning
>> +futex will not be supported as well.
>> +
>> +Usage Scenario
>> +--------------
>> +
>> +A spinning futex can be used as an exclusive lock to guard a critical
>> +section which are unlikely to go to sleep in the kernel. The spinners
> is
>
>> +in a spinning futex, however, will fall back to sleep in a wait queue
>> +if the lock owner isn't running. Therefore, it can also be used when
>> +the critical section is long and prone to sleeping. However, it may
>> +not have the performance benefit when compared with a wait-wake futex
>> +in this case.
>> +
>> +Sample Code
>> +-----------
>> +
>> +The following are sample code to implement a simple lock and unlock
> is
>
>> +function.
>> +
>> +__thread int tid; /* Thread ID */
>> +
>> +void mutex_lock(int *faddr)
>> +{
>> + if (cmpxchg(faddr, 0, tid) == 0)
>> + return;
>> + for (;;)
>> + if (futex(faddr, FUTEX_SPIN_LOCK, ...) == 0)
>> + break;
>> +}
>> +
>> +void mutex_unlock(int *faddr)
>> +{
>> + int old, fval;
>> +
>> + if ((fval = cmpxchg(faddr, tid, 0)) == tid)
>> + return;
>> + /* Clear only the TID portion of the futex */
>> + for (;;) {
>> + old = fval;
>> + fval = cmpxchg(faddr, old, old& ~FUTEX_TID_MASK);
>> + if (fval == old)
>> + break;
>> + }
>> + if (fval& FUTEX_WAITERS)
>> + futex(faddr, FUTEX_SPIN_UNLOCK, ...);
>> +}
>>
>
Thank for the grammar corrections. Will apply those to the documents.
-Longman
On Mon, Jul 21, 2014 at 05:32:55PM -0700, Davidlohr Bueso wrote:
> On Mon, 2014-07-21 at 14:31 -0700, Andy Lutomirski wrote:
> > On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra <[email protected]> wrote:
> > > All this is predicated on the fact that syscalls are 'expensive'.
> > > Weren't syscalls only 100s of cycles? All this bitmap mucking is far
> > > more expensive due to cacheline misses, which due to the size of the
> > > things is almost guaranteed.
> >
> > 120 - 300 cycles for me, unless tracing happens, and I'm working on
> > reducing the incidence of tracing.
>
> fwiw here's what lmbench's lat_ctx says on my system . For 'accuracy', I
> kept the runs short.
>
> http://www.stgolabs.net/lat_ctx.png
Create a syscall without 'work body', so say something like the below.
Context switches do quite a lot of work (unfortunately).
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..daf27ec10e29 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common nop sys_nop
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/kernel/sys.c b/kernel/sys.c
index 66a751ebf9d9..1b86614bb551 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2108,6 +2108,11 @@ SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
return 0;
}
+SYSCALL_DEFINE0(nop)
+{
+ return 0;
+}
+
#ifdef CONFIG_COMPAT
struct compat_sysinfo {
s32 uptime;
On Mon, Jul 21, 2014 at 09:34:57PM -0400, Steven Rostedt wrote:
> I just want to point out that I was having a very nice conversation
> with Robert Haas (Cc'd) in Napa Valley at Linux Collaboration about
> this very topic. Robert is a PostgeSQL developer who told me that they
> implement their spin locks completely in userspace (no futex, just raw
> spinning on shared memory). This is because the sleep on contention of a
> futex has shown to be very expensive in their benchmarks. His work is
> not a micro benchmark but for a very popular database where locking is
> crucial.
Userspace spinlocks are a clusterfuck. Its impossible to solve the
priority inversion trainwrecks they cause _ever_.
We've had -- as I think Mike already pointed out -- tons of 'fun' with
psql exactly because its doing this :-(
> I was telling Robert that if futexes get optimistic spinning, he should
> reconsider their use of userspace spinlocks in favor of this, because
> I'm pretty sure that they will see a great improvement.
>
> Now Robert will be the best one to answer if the system call is indeed
> more expensive than doing full spins in userspace. If the spin is done
> in the kernel and they still get better performance by just spinning
> blindly in userspace even if the owner is asleep, I think we will have
> our answer.
No, the best way is to measure the exact syscall cost. If he still gets
better performance we need to analyze why, there might be something else
hiding there.
> Note, I believe they only care about shared threads, and this
> optimistic spinning does not need to be something done between
> processes.
There's no reason not to provide it for shared futexes, in fact I
suspect not doing it for shared futexes is going to make the code
uglier.
Anyway, there is one big fail in the entire futex stack that we 'need'
to sort some day and that is NUMA. Some people (again database people)
explicitly do not use futexes and instead use sysvsem because of this.
The problem with numa futexes is that because they're vaddr based there
is no (persistent) node information. You always end up having to fall
back to looking in all nodes before you can guarantee there is no
matching futex.
One way to achieve it is by extending the futex value to include a node
number, but that's obviously a complete ABI break. Then again, it should
be pretty straight fwd, since the node number doesn't need to be part of
the actual atomic update part, just part of the userspace storage.
On Tue, 22 Jul 2014, Peter Zijlstra wrote:
> Anyway, there is one big fail in the entire futex stack that we 'need'
> to sort some day and that is NUMA. Some people (again database people)
> explicitly do not use futexes and instead use sysvsem because of this.
>
> The problem with numa futexes is that because they're vaddr based there
> is no (persistent) node information. You always end up having to fall
> back to looking in all nodes before you can guarantee there is no
> matching futex.
>
> One way to achieve it is by extending the futex value to include a node
> number, but that's obviously a complete ABI break. Then again, it should
> be pretty straight fwd, since the node number doesn't need to be part of
> the actual atomic update part, just part of the userspace storage.
So you want per node hash buckets, right? Fair enough, but how do you
make sure, that no thread/process on a different node is fiddling with
that "node bound" futex as well?
Thanks,
tglx
On Tue, Jul 22, 2014 at 10:39:17AM +0200, Thomas Gleixner wrote:
> On Tue, 22 Jul 2014, Peter Zijlstra wrote:
> > Anyway, there is one big fail in the entire futex stack that we 'need'
> > to sort some day and that is NUMA. Some people (again database people)
> > explicitly do not use futexes and instead use sysvsem because of this.
> >
> > The problem with numa futexes is that because they're vaddr based there
> > is no (persistent) node information. You always end up having to fall
> > back to looking in all nodes before you can guarantee there is no
> > matching futex.
> >
> > One way to achieve it is by extending the futex value to include a node
> > number, but that's obviously a complete ABI break. Then again, it should
> > be pretty straight fwd, since the node number doesn't need to be part of
> > the actual atomic update part, just part of the userspace storage.
>
> So you want per node hash buckets, right? Fair enough, but how do you
> make sure, that no thread/process on a different node is fiddling with
> that "node bound" futex as well?
You don't and that should work just as well, just slower. But since the
node id is in the futex 'value' we'll always end up in the right
node-hash, even if its a remote one.
So yes, per node hashes, and a persistent futex->node map.
And before people start talking about mempol and using that to bind
memory to nodes and such, remember that private futexes do not have a
vma lookup and therefore mempols are impossible to use.
On Tue, 22 Jul 2014, Peter Zijlstra wrote:
> On Tue, Jul 22, 2014 at 10:39:17AM +0200, Thomas Gleixner wrote:
> > On Tue, 22 Jul 2014, Peter Zijlstra wrote:
> > > Anyway, there is one big fail in the entire futex stack that we 'need'
> > > to sort some day and that is NUMA. Some people (again database people)
> > > explicitly do not use futexes and instead use sysvsem because of this.
> > >
> > > The problem with numa futexes is that because they're vaddr based there
> > > is no (persistent) node information. You always end up having to fall
> > > back to looking in all nodes before you can guarantee there is no
> > > matching futex.
> > >
> > > One way to achieve it is by extending the futex value to include a node
> > > number, but that's obviously a complete ABI break. Then again, it should
> > > be pretty straight fwd, since the node number doesn't need to be part of
> > > the actual atomic update part, just part of the userspace storage.
> >
> > So you want per node hash buckets, right? Fair enough, but how do you
> > make sure, that no thread/process on a different node is fiddling with
> > that "node bound" futex as well?
>
> You don't and that should work just as well, just slower. But since the
> node id is in the futex 'value' we'll always end up in the right
> node-hash, even if its a remote one.
>
> So yes, per node hashes, and a persistent futex->node map.
Which works fine as long as you only have the futex_q on the stack of
the blocked task. If user space is lying to you, then you just end up
with a bunch of threads sleeping forever. Who cares?
But if you create independent kernel state, which we have with
pi_state and which you need for finegrained locking and further
spinning fun, you open up another can of worms. Simply because this
would enable rogue user space to create multiple instances of the
kernel internal state. I can predict the CVEs resulting from that
even without using a crystal ball.
Thanks,
tglx
On 07/21/2014 12:42 PM, Thomas Gleixner wrote:
> On Mon, 21 Jul 2014, Waiman Long wrote:
>
>> +#define FUTEX_TID(u) (pid_t)((u)& FUTEX_TID_MASK)
>> +#define FUTEX_HAS_WAITERS(u) ((u)& FUTEX_WAITERS)
> You love ugly macros, right?
>
Not really, I just have a tendency to overuse it sometimes. I could take
those macros out.
>> +/*
>> + * futex_spin_trylock - attempt to take the lock
>> + * Return: 1 if successful or an error happen
>> + * 0 otherwise
>> + *
>> + * Side effect: The uval and ret will be updated.
>> + */
>> +static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
>> + int *pret, u32 vpid)
>> +{
>> + u32 old;
>> +
>> + *pret = get_futex_value_locked(puval, uaddr);
>> + if (*pret)
>> + return 1;
>> +
>> + if (FUTEX_TID(*puval))
>> + return 0; /* The mutex is not free */
>> +
>> + old = *puval;
>> +
>> + *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
>> + if (*pret)
>> + return 1;
>> + if (*puval == old) {
>> + /* Adjust uval to reflect current value */
>> + *puval = vpid | old;
>> + return 1;
>> + }
>> + return 0;
> What's the point if all of this?
>
> A simple cmpxchg_futex_value_locked() does all of this, just less ugly
> and without all these extra indirections and totally uncomprehensible
> conditionals.
>
Yes, the trylock function is somewhat unwieldy. Will integrate it back
to the corresponding place. As a trylock, we usually do a read first to
make sure that it is ready before doing cmpxchg. Blindly doing a cmpxhg
unconditionally may hinder performance.
>> +}
>> +
>> +/*
>> + * futex_spin_lock
>> + */
>> +static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
>> +{
> So this lacks a timeout. If we provide this, then we need to have the
> timeout supported as well.
Yes, a timeout isn't supported yet. This is a RFC and I want to get a
sense of how important a timeout will be before I add it in. I could
certainly add that in if people think it is an important feature to have.
>> + struct futex_hash_bucket *hb;
>> + struct futex_q_head *qh = NULL;
>> + struct futex_q_node qnode;
>> + union futex_key key;
>> + bool gotlock;
>> + int ret, cnt;
>> + u32 uval, vpid, old;
>> +
>> + qnode.task = current;
>> + vpid = task_pid_vnr(qnode.task);
>> +
>> + ret = get_futex_key(uaddr, flags& FLAGS_SHARED,&key, VERIFY_WRITE);
>> + if (unlikely(ret))
> Stop sprinkling the code with unlikelys
Sure. Will remove those unlikely() calls.
>> + return ret;
>> +
>> + hb = hash_futex(&key);
>> + spin_lock(&hb->lock);
>> +
>> + /*
>> + * Locate the queue head for the given key
>> + */
> Brilliant comment. If you'd comment the stuff which really matters and
> leave out the obvious, then your code might be readable some day.
That comment was before I extracted the code out into a separate
function. Will remove it.
>> + qh = find_qhead(hb,&key);
>> +
>> + /*
>> + * Check the futex value under the hash bucket lock as it might
>> + * be changed.
>> + */
> What might have changed? You enter the function with uaddr, but no
> uval. So what changed?
If there is contention, the spin_lock() call may take a while. Unlike a
wait-wake futex, the only uval that will be of interest is when the TID
portion is 0. So we don't really need to pass in an uval. The uval is
not 0 when the lock function is called. However, the lock owner may have
released the lock by the time we check the futex value there before we
go into spinning or waiting.
>
>
>> + if (futex_spin_trylock(uaddr,&uval,&ret, vpid))
>> + goto hbunlock_out;
>> +
>> + if (!qh) {
>> + /*
>> + * First waiter:
>> + * Allocate a queue head structure& initialize it
>> + */
>> + qh = qhead_alloc_init(hb,&key);
>> + if (unlikely(!qh)) {
>> + ret = -ENOMEM;
>> + goto hbunlock_out;
>> + }
>> + } else {
>> + atomic_inc(&qh->lcnt);
>> + }
>> + spin_unlock(&hb->lock);
>> +
>> + /*
>> + * Put the task into the wait queue and sleep
>> + */
>> + preempt_disable();
> Why?
I just follow what has been done in the mutex code where preemption is
disabled even in the sleeping loop.
>
>> + get_task_struct(qnode.task);
> So you get a task reference on current? What the heck is this for?
Because the task is going to sleep and a queue node with the task
pointer is going to be enqueued into the wait queue.
>> + spin_lock(&qh->wlock);
>> + list_add_tail(&qnode.wnode,&qh->waitq);
>> + __set_current_state(TASK_INTERRUPTIBLE);
>> + spin_unlock(&qh->wlock);
>> + gotlock = false;
>> + for (;;) {
>> + ret = get_user(uval, uaddr);
>> + if (ret)
>> + break;
> So you let user space handle EFAULT?
This is a good question. Do you have any suggestion on how to better
handle error when get_user fails?
>
>> +dequeue:
>> + __set_current_state(TASK_RUNNING);
>> + /*
>> + * Remove itself from the wait queue and go back to optimistic
>> + * spinning if it hasn't got the lock yet.
>> + */
>> + put_task_struct(qnode.task);
>> + spin_lock(&qh->wlock);
>> + list_del(&qnode.wnode);
>> +
>> + /*
>> + * Try to clear the waiter bit if the wait queue is empty
>> + */
>> + if (list_empty(&qh->waitq)) {
>> + int retval = get_futex_value_locked(&uval, uaddr);
>> +
>> + if (!retval&& FUTEX_HAS_WAITERS(uval)) {
>> + old = uval;
>> + uval&= ~FUTEX_WAITERS;
>> + (void)cmpxchg_futex_value_locked(&uval, uaddr, old,
>> + uval);
>> + }
>> + }
>> + spin_unlock(&qh->wlock);
>> + preempt_enable();
>> +
>> + cnt = atomic_dec_return(&qh->lcnt);
>> + if (cnt == 0)
>> + qhead_free(qh, hb);
>> + /*
>> + * Need to set the waiters bit there are still waiters
>> + */
>> + else if (!ret)
>> + ret = put_user(vpid | FUTEX_WAITERS, uaddr);
> WTF? You fiddle with the uaddr completely unprotected.
The get_futex_key(...., VERIFY_WRITE) call has check to make sure that
the location is writeable and get_user() call has happened without
error. What additional protection do you think we need here?
>> +out:
>> + put_futex_key(&key);
>> + return ret;
>> +
>> +hbunlock_out:
>> + spin_unlock(&hb->lock);
>> + goto out;
>> +}
>> +
>> +/*
>> + * futex_spin_unlock
>> + */
>> +static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
>> +{
>> + struct futex_hash_bucket *hb;
>> + struct futex_q_head *qh;
>> + union futex_key key;
>> + struct task_struct *wtask; /* Task to be woken */
>> + int ret, lcnt;
>> + u32 uval, old, vpid = task_pid_vnr(current);
>> +
>> + ret = get_user(uval, uaddr);
>> + if (ret)
>> + return ret;
>> +
>> + /*
>> + * The unlocker may have cleared the TID value and another task may
>> + * steal it. However, if its TID is still set, we need to clear
>> + * it as well as the FUTEX_WAITERS bit.
> No, that's complete and utter crap. The unlocker is current and it may
> not have cleared anything.
>
> Your design or the lack thereof is a complete disaster.
In patch 5, the documentation and the sample unlock does clear the TID
before going in. The code here is just a safety measure in case the
unlocker doesn't follow the recommendation.
> Sit down first and define the exact semantics of the new opcode. That
> includes user and kernel space and the interaction with robust list,
> which you happily ignored.
>
> What are the semantics of uval? When can it be changed in kernel and
> in user space? How do we deal with corruption of the user space value?
The semantics of the uval is the same as that of PI and robust futex
where the TID portion contains the thread ID of the lock owner. It is my
intention to make it works with the robust futex mechanism before it can
be merged. This RPC patch series is for soliciting feedbacks and make
the necessary changes that make the patch acceptable before I go deep
into making it works with robust futex.
>
> How does that new opcode provide robustness?
>
> How are faults handled?
As you have a lot more experience working with futexes than me, any
suggestions on what kind of faults will happen and what are the best
practices to handle them will be highly appreciated.
-Longman
On 07/21/2014 12:42 PM, Andi Kleen wrote:
> Waiman Long<[email protected]> writes:
>
>> This patch series introduces two new futex command codes to support
>> a new optimistic spinning futex for implementing an exclusive lock
>> primitive that should perform better than the same primitive using
>> the wait-wake futex in cases where the lock owner is actively working
>> instead of waiting for I/O completion.
> How would you distinguish those two cases automatically?
I don't really distinguish between these two. The purpose of this
paragraph is to show the best use cases for the spinning futexes is when
the lock owner is not likely to sleep while in the critical section.
>> This patch series improves futex performance on two different fronts:
>> 1) Reducing the amount of the futex spinlock contention by using 2
>> different spinlocks instead of just one for the wait-wake futex.
>> 2) Eliminating the context switching overhead and latency due to the
>> sleeping and the waking of the waiting tasks.
> FWIW the main problem is currently that switch-through-idle is so
> slow. I think improving that would give a boost to far more
> situations.
If we can improve the context switching overhead, that can certainly
help a lot of applications.
>
>> Patch 4 changes sleeping queue from a simple FIFO list to an rbtree
>> sorted by process priority as well as the sequeunce the tasks enter
>> the kernel.
> This seems to mix new functionality with performance improvements?
>
> Generally adding new ordering in an user interface is risky because
> often user programs have been tuned for specific old ordering.
Not really, this patch is to emulate the current behavior of the
wait-wake futex which is queued in a priority queue. Because of
spinning, tasks may enter into sleep state in different order than when
they enter into the kernel. That is why I keep a sequence number to
track who go into the kernel first.
-Longman
On 07/21/2014 12:45 PM, Andi Kleen wrote:
> Andi Kleen<[email protected]> writes:
>
>> Waiman Long<[email protected]> writes:
>>
>>> This patch series introduces two new futex command codes to support
>>> a new optimistic spinning futex for implementing an exclusive lock
>>> primitive that should perform better than the same primitive using
>>> the wait-wake futex in cases where the lock owner is actively working
>>> instead of waiting for I/O completion.
>> How would you distinguish those two cases automatically?
> Also BTW traditionally the spinning was just done in user space.
>
> This would be always even more efficient, because it would
> even avoid the syscall entry path.
Spinning directly in userspace is useful if the critical section is
really short and the chance of heavy contention is small. However, if
the lock can be heavily contended and the critical section is not short,
these are the conditions where a spinning futex proposed by this patch
series can help.
-Longman
On 07/21/2014 01:15 PM, Davidlohr Bueso wrote:
> On Mon, 2014-07-21 at 11:24 -0400, Waiman Long wrote:
>> This patch adds code to do optimistic spinning for the FUTEX_SPIN_LOCK
>> primitive on the futex value when the lock owner is running. It is
>> the same optimistic spinning technique that is done in the mutex and
>> rw semaphore code to improve their performance especially on large
>> systems with large number of CPUs. When the lock owner is not running,
>> the spinning tasks will go to sleep.
>>
>> There is 2 major advantages of doing optimistic spinning here:
>> 1) It eliminates the context switching latency and overhead (at
>> least a few us) associated with sleeping and wakeup.
>> 2) It eliminates most of the need to call futex(2) with
>> FUTEX_SPIN_UNLOCK as spinning is done without the need to set
>> the FUTEX_WAITERS bit.
> I think this belongs with Patch 1: optimistic spinning feature should be
> in the same patch when you add the new futex commands.
I broke the spinning code out in patch 2 in order to make patch 1
smaller and easier to review.
>> Active spinning, however, does consume time in the current quantum of
>> time slice, make it a bit more likely to be preempted while running
>> in the critcal section due to time slice expiration. The heavy spinlock
>> contention of a wait-wake futex has the same effect. So it is not
>> specific
>> to this new primitive.
>>
>> With the addition of optimistic spinning, it can significantly speed
>> up the amount of mutex operations that can be done in a certain unit
>> of time. With a userspace mutex microbenchmark running 10 million
>> mutex operations with 256 threads on a 4-socket 40-core server, the
>> spinning futex can achieve a rate of about 4.9 Mops/s with a critical
>> section load of 10 pause instructions. Whereas the wait-wake futex can
>> only achieve a rate of 3.7 Mops/s. When increasing the load to 100,
>> the corresponding rates become 2.8 Mops/s and 0.8 Mops/s respectively.
>>
>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> kernel/futex.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
>> 1 files changed, 172 insertions(+), 18 deletions(-)
>>
>> diff --git a/kernel/futex.c b/kernel/futex.c
>> index ec9b6ee..ddc24d1 100644
>> --- a/kernel/futex.c
>> +++ b/kernel/futex.c
>> @@ -71,6 +71,7 @@
>> #include<asm/futex.h>
>>
>> #include "locking/rtmutex_common.h"
>> +#include "locking/mcs_spinlock.h"
>>
>> /*
>> * READ this before attempting to hack on futexes!
>> @@ -2995,30 +2996,51 @@ void exit_robust_list(struct task_struct *curr)
>> #define FUTEX_TID(u) (pid_t)((u)& FUTEX_TID_MASK)
>> #define FUTEX_HAS_WAITERS(u) ((u)& FUTEX_WAITERS)
>>
>> +/*
>> + * Bit usage of the locker count:
>> + * bit 0-23: number of lockers (spinners + waiters)
>> + * bit 24-30: number of spinners
>> + */
>> +#define FUTEX_SPINCNT_MAX 64 /* Maximum # of spinners */
>> +#define FUTEX_SPINCNT_SHIFT 24
>> +#define FUTEX_SPINCNT_BIAS (1U<< FUTEX_SPINCNT_SHIFT)
>> +#define FUTEX_LOCKCNT_MASK (FUTEX_SPINCNT_BIAS - 1)
>> +#define FUTEX_LOCKCNT(qh) (atomic_read(&(qh)->lcnt)& FUTEX_LOCKCNT_MASK)
>> +#define FUTEX_SPINCNT(qh) (atomic_read(&(qh)->lcnt)>>FUTEX_SPINCNT_SHIFT)
> Both FUTEX_LOCKCNT and FUTEX_SPINCNT should be static inline functions.
I will change them into static inline functions.
>
>> /**
>> * struct futex_q_head - head of the optspin futex queue, one per unique key
>> * @hnode: list entry from the hash bucket
>> * @waitq: a list of waiting tasks
>> * @key: the key the futex is hashed on
>> + * @osq: pointer to optimisitic spinning queue
>> + * @owner: task_struct pointer of the futex owner
>> + * @otid: TID of the futex owner
>> * @wlock: spinlock for managing wait queue
>> - * @lcnt: locker count
>> + * @lcnt: locker count (spinners + waiters)
>> *
>> * Locking sequence
>> * ----------------
>> * 1) Lock hash bucket spinlock, locate the futex queue head
>> * 2) Inc lcnt (lock) or read lcnt (unlock), release hash bucket spinlock
>> - * 3) For waiter:
>> + * 3) For spinner:
>> + * - enqueue into the spinner queue and wait for its turn.
>> + * 4) For waiter:
>> * - lock futex queue head spinlock
>> * - enqueue into the wait queue
>> * - release the lock& sleep
>> - * 4) For unlocker:
>> + * 5) For unlocker:
>> * - locate the queue head just like a locker does
>> - * - Take the queue head lock and wake up the first waiter there.
>> + * - clear the owner field if it is the current owner
>> + * - if the locker count is not 0& osq is empty, take the queue head lock
>> + * and wake up the first waiter.
>> */
>> struct futex_q_head {
>> struct list_head hnode;
>> struct list_head waitq;
>> union futex_key key;
>> + struct optimistic_spin_queue *osq;
>> + struct task_struct *owner;
>> pid_t otid;
>> spinlock_t wlock;
>> atomic_t lcnt;
>> @@ -3034,6 +3056,13 @@ struct futex_q_node {
>> struct task_struct *task;
>> };
>>
>> +/*
>> + * The maximum number of tasks that can be a futex spin queue
>> + *
>> + * It is set to the lesser of half of the total number of CPUs and
>> + * FUTEX_SPINCNT_MAX to avoid locking up all the CPUs in spinning.
>> + */
>> +static int __read_mostly futex_spincnt_max;
>>
>> /*
>> * find_qhead - Find a queue head structure with the matching key
>> @@ -3061,7 +3090,7 @@ find_qhead(struct futex_hash_bucket *hb, union futex_key *key)
>> * contention with no hash bucket collision.
>> */
>> static inline struct futex_q_head *
>> -qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
>> +qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key, u32 uval)
>> {
>> struct futex_q_head *qh = NULL;
>> static const struct futex_q_head qh0 = { { 0 } };
>> @@ -3073,10 +3102,16 @@ qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
>>
>> /*
>> * Initialize the queue head structure
>> + * The lock owner field may be NULL if the task has released the lock
>> + * and exit.
>> */
>> if (qh) {
>> - *qh = qh0;
>> - qh->key = *key;
>> + *qh = qh0;
>> + qh->key = *key;
>> + qh->otid = FUTEX_TID(uval);
>> + qh->owner = futex_find_get_task(qh->otid);
>> + if (unlikely(!qh->owner))
>> + qh->otid = 0;
>> atomic_set(&qh->lcnt, 1);
>> INIT_LIST_HEAD(&qh->waitq);
>> spin_lock_init(&qh->wlock);
> All this can be a single qh setup function.
This code is already in a separate allocation and initialization
function. I don't see a big advantage in further breaking them up into 2
unless there are cases where each can be called independently without
the other.
>> @@ -3120,9 +3155,11 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
>> /*
>> * Free the queue head structure
>> */
>> - BUG_ON(!list_empty(&qh->waitq));
>> + BUG_ON(!list_empty(&qh->waitq) || qh->osq);
>> list_del(&qh->hnode);
>> spin_unlock(&hb->lock);
>> + if (qh->owner)
>> + put_task_struct(qh->owner);
>>
>> if (!hb->qhcache&& (cmpxchg(&hb->qhcache, NULL, qh) == NULL))
>> return;
>> @@ -3134,14 +3171,19 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
>> * Return: 1 if successful or an error happen
>> * 0 otherwise
>> *
>> + * Optimistic spinning is done without holding lock, but with page fault
>> + * explicitly disabled. So different functions need to be used to access
>> + * the userspace futex value.
>> + *
>> * Side effect: The uval and ret will be updated.
>> */
>> static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
>> - int *pret, u32 vpid)
>> + int *pret, u32 vpid, bool spinning)
>> {
>> - u32 old;
>> + u32 old;
>>
>> - *pret = get_futex_value_locked(puval, uaddr);
>> + *pret = spinning ? __copy_from_user_inatomic(puval, uaddr, sizeof(u32))
>> + : get_futex_value_locked(puval, uaddr);
>> if (*pret)
>> return 1;
>>
>> @@ -3150,18 +3192,102 @@ static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
>>
>> old = *puval;
>>
>> - *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
>> + *pret = spinning
>> + ? futex_atomic_cmpxchg_inatomic(puval, uaddr, old, vpid)
>> + : cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
>> +
>> if (*pret)
>> return 1;
>> if (*puval == old) {
>> /* Adjust uval to reflect current value */
>> - *puval = vpid | old;
>> + *puval = spinning ? vpid : (vpid | old);
>> return 1;
>> }
>> return 0;
>> }
>>
>> /*
>> + * futex_optspin - optimistic spinning loop
>> + * Return: 1 if lock successfully acquired
>> + * 0 if need to fall back to waiting
>> + *
>> + * Page fault and preemption are disabled in the optimistic spinning
>> + * loop. Preemption should have been disabled before calling this function.
>> + *
>> + * The number of spinners may temporarily exceed the threshold due to
>> + * racing (the spin count check and add aren't atomic), but that shouldn't
>> + * be a big problem.
>> + */
>> +static inline int futex_optspin(struct futex_q_head *qh,
>> + struct futex_q_node *qn,
>> + u32 __user *uaddr,
>> + u32 vpid)
>> +{
>> + u32 uval;
>> + int ret, gotlock = false;
>> + struct task_struct *owner;
>> +
>> + /*
>> + * Increment the spinner count
>> + */
>> + atomic_add(FUTEX_SPINCNT_BIAS,&qh->lcnt);
>> + if (!osq_lock(&qh->osq)) {
>> + atomic_sub(FUTEX_SPINCNT_BIAS,&qh->lcnt);
>> + return gotlock;
>> + }
>> + pagefault_disable();
> How about a comment to why pf is disabled.
When page fault happens, there is a chance that the task can be switched
to a different CPU so all the OSQ magic fails to work even with
preempt_disable(). This is a bug that caused me a day or so to figure
out. I will add a comment to document that.
>> + for (;; cpu_relax()) {
> while(true) {
>
>> + if (futex_spin_trylock(uaddr,&uval,&ret, vpid, true)) {
>> + /*
>> + * Fall back to waiting if an error happen
>> + */
>> + if (ret)
>> + break;
>> + qh->otid = vpid;
>> + owner = xchg(&qh->owner, qn->task);
>> + get_task_struct(qn->task);
>> + if (owner)
>> + put_task_struct(owner);
>> + gotlock = true;
>> + break;
>> + } else if (unlikely(FUTEX_HAS_WAITERS(uval))) {
> Branch predictions have a time and place, please do not use
> likely/unlikely just for anything.
Sure. I may have overused them.
>
>> + /*
>> + * Try to turn off the waiter bit as it now has a
>> + * spinner. It doesn't matter if it fails as it will
>> + * try again in the next iteration.
>> + */
>> + (void)futex_atomic_cmpxchg_inatomic
>> + (&uval, uaddr, uval, uval& ~FUTEX_WAITERS);
>> + }
>> +
>> + if (unlikely(FUTEX_TID(uval) != qh->otid)) {
>> + /*
>> + * Owner has changed
>> + */
>> + qh->otid = FUTEX_TID(uval);
>> + owner = xchg(&qh->owner, futex_find_get_task(qh->otid));
>> + if (owner)
>> + put_task_struct(owner);
>> + }
>> + owner = ACCESS_ONCE(qh->owner);
>> + if ((owner&& !ACCESS_ONCE(owner->on_cpu)) || need_resched())
>> + break;
>> + }
>> + pagefault_enable();
>> + osq_unlock(&qh->osq);
>> + atomic_sub(FUTEX_SPINCNT_BIAS,&qh->lcnt);
>> +
>> + /*
>> + * If we fell out of the spin path because of need_resched(),
>> + * reschedule now.
>> + */
>> + if (!gotlock&& need_resched())
>> + schedule_preempt_disabled();
>> +
>> + return gotlock;
>> +}
>> +
>> +/*
>> * futex_spin_lock
>> */
>> static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
>> @@ -3170,6 +3296,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
>> struct futex_q_head *qh = NULL;
>> struct futex_q_node qnode;
>> union futex_key key;
>> + struct task_struct *owner;
>> bool gotlock;
>> int ret, cnt;
>> u32 uval, vpid, old;
>> @@ -3193,7 +3320,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
>> * Check the futex value under the hash bucket lock as it might
>> * be changed.
>> */
>> - if (futex_spin_trylock(uaddr,&uval,&ret, vpid))
>> + if (futex_spin_trylock(uaddr,&uval,&ret, vpid, false))
>> goto hbunlock_out;
>>
>> if (!qh) {
>> @@ -3201,7 +3328,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
>> * First waiter:
>> * Allocate a queue head structure& initialize it
>> */
>> - qh = qhead_alloc_init(hb,&key);
>> + qh = qhead_alloc_init(hb,&key, uval);
>> if (unlikely(!qh)) {
>> ret = -ENOMEM;
>> goto hbunlock_out;
>> @@ -3212,9 +3339,18 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
>> spin_unlock(&hb->lock);
>>
>> /*
>> - * Put the task into the wait queue and sleep
>> + * Perform optimisitic spinning if the owner is running.
>> */
>> preempt_disable();
>> + owner = ACCESS_ONCE(qh->owner);
>> + if ((FUTEX_SPINCNT(qh)< futex_spincnt_max)&&
>> + (!owner || owner->on_cpu))
>> + if (futex_optspin(qh,&qnode, uaddr, vpid))
>> + goto penable_out;
>> +
>> + /*
>> + * Put the task into the wait queue and sleep
>> + */
>> get_task_struct(qnode.task);
>> spin_lock(&qh->wlock);
>> list_add_tail(&qnode.wnode,&qh->waitq);
>> @@ -3238,6 +3374,11 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
>> goto dequeue;
>> } else if (uval == old) {
>> gotlock = true;
>> + qh->otid = vpid;
>> + owner = xchg(&qh->owner, qnode.task);
>> + get_task_struct(qnode.task);
>> + if (owner)
>> + put_task_struct(owner);
>> goto dequeue;
>> }
>> continue;
>> @@ -3286,15 +3427,17 @@ dequeue:
>> }
>> }
>> spin_unlock(&qh->wlock);
>> +
>> +penable_out:
>> preempt_enable();
>>
>> cnt = atomic_dec_return(&qh->lcnt);
>> if (cnt == 0)
>> qhead_free(qh, hb);
>> /*
>> - * Need to set the waiters bit there are still waiters
>> + * Need to set the waiters bit there no spinner running
>> */
>> - else if (!ret)
>> + else if (!ret&& ((cnt>> FUTEX_SPINCNT_SHIFT) == 0))
>> ret = put_user(vpid | FUTEX_WAITERS, uaddr);
>> out:
>> put_futex_key(&key);
>> @@ -3356,6 +3499,13 @@ static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
>> }
>>
>> /*
>> + * Clear the owner field
>> + */
>> + if ((qh->owner == current)&&
>> + (cmpxchg(&qh->owner, current, NULL) == current))
>> + put_task_struct(current);
>> +
>> + /*
>> * The hash bucket lock is being hold while retrieving the task
>> * structure from the queue head to make sure that the qh structure
>> * won't go away under the hood.
>> @@ -3520,6 +3670,10 @@ static int __init futex_init(void)
>>
>> futex_detect_cmpxchg();
>>
>> + futex_spincnt_max = num_possible_cpus()/2;
>> + if (futex_spincnt_max> FUTEX_SPINCNT_MAX)
>> + futex_spincnt_max = FUTEX_SPINCNT_MAX;
> This threshold needs commenting as well.
>
There is comment up where FUTEX_SPINCNT_MAX is defined. Will add a
comment here as well.
-Longman
On 07/21/2014 04:17 PM, Jason Low wrote:
> On Mon, 2014-07-21 at 11:24 -0400, Waiman Long wrote:
>> This patch adds code to do optimistic spinning for the FUTEX_SPIN_LOCK
>> primitive on the futex value when the lock owner is running. It is
>> the same optimistic spinning technique that is done in the mutex and
>> rw semaphore code to improve their performance especially on large
>> systems with large number of CPUs. When the lock owner is not running,
>> the spinning tasks will go to sleep.
> Perhaps we could introduce a "CONFIG_FUTEX_SPIN_ON_OWNER" that depends
> on SMP and ARCH_SUPPORTS_ATOMIC_RMW?
The new futex opcode depends on the ability to do cmpxchg() in the futex
context. The code will be disabled if futex cmpxchg is not supported. I
guess that should be enough to limit it to just a handful of architectures.
>> There is 2 major advantages of doing optimistic spinning here:
>> 1) It eliminates the context switching latency and overhead (at
>> least a few us) associated with sleeping and wakeup.
>> 2) It eliminates most of the need to call futex(2) with
>> FUTEX_SPIN_UNLOCK as spinning is done without the need to set
>> the FUTEX_WAITERS bit.
>> struct futex_q_head {
>> struct list_head hnode;
>> struct list_head waitq;
>> union futex_key key;
>> + struct optimistic_spin_queue *osq;
> And this would have to be updated to
>
> + struct optimistic_spin_queue osq;
>
> given the recent changes to the osq lock.
Yes, I will make the change in the next iteration of the patch.
-Longman
On 07/21/2014 05:18 PM, Ingo Molnar wrote:
> * Waiman Long<[email protected]> wrote:
>
>> Testing done on a 4-socket Westmere-EX boxes with 40 cores (HT off)
>> showed the following performance data (average kops/s) with various
>> load factor (number of pause instructions) used in the critical
>> section using an userspace mutex microbenchmark.
>>
>> Threads Load Waiting Futex Spinning Futex %Change
>> ------- ---- ------------- -------------- -------
>> 256 1 6894 8883 +29%
>> 256 10 3656 4912 +34%
>> 256 50 1332 4358 +227%
>> 256 100 792 2753 +248%
>> 10 1 6382 4838 -24%
>> 10 10 3614 4748 +31%
>> 10 50 1319 3900 +196%
>> 10 100 782 2459 +214%
>> 2 1 7905 7194 -9.0%
>> 2 10 4556 4717 +3.5%
>> 2 50 2191 4167 +90%
>> 2 100 1767 2407 +36%
> So the numbers look interesting - but it would be _really_ important
> to provide noise/sttdev figures in a sixth column as well (denoted in
> percentage units, not in benchmark units), so that we know how
> significant a particular speedup (or slowdown) is.
>
> Thanks,
>
> Ingo
The performance can varies quite a bit depending on what other processes
are running at the test execution time. I will include stddev data in
the next iteration of the patch.
-Longman
On 07/21/2014 09:01 PM, Thomas Gleixner wrote:
> On Mon, 21 Jul 2014, Darren Hart wrote:
>> On 7/21/14, 14:47, "Thomas Gleixner"<[email protected]> wrote:
>>
>>> On Mon, 21 Jul 2014, Andy Lutomirski wrote:
>>>> On Mon, Jul 21, 2014 at 2:27 PM, Peter Zijlstra<[email protected]>
>>>> wrote:
>>>>> All this is predicated on the fact that syscalls are 'expensive'.
>>>>> Weren't syscalls only 100s of cycles? All this bitmap mucking is far
>>>>> more expensive due to cacheline misses, which due to the size of the
>>>>> things is almost guaranteed.
>>>> 120 - 300 cycles for me, unless tracing happens, and I'm working on
>>>> reducing the incidence of tracing.
>>> So it's a non issue indeed and definitely not worth the trouble of
>>> that extra storage, the scheduler overhead, etc.
>>>
>>> Summary: Once you can't take it atomically in user space, you've lost
>>> anyway. And we are better off to do the magic spinning in
>>> kernel where we have all the information accessible already.
>> And we have such an implementation with the FUTEX_LOCK_ADAPTIVE code we
>> discussed back in Oct 2010 (purely kernel, no VDSO), updated with some of
>> your and other's comments:
>>
>> http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/f
>> utex-lock/v7
>>
>>
>> I can work on forward porting this series to current mainline (post recent
>> security fixes) and cleaning up the commentary and such if people are
>> interested in seeing this implementation (based on Peter Z's spinning
>> mutex work iirc) resurrected...
> No. We really want to avoid more magic hackery in the futex code.
>
> Lets sit down and think about what we need:
>
> 1) Support for all existing features
>
> 2) More fine grained locking
>
> 3) Optimistic spinning
>
> @1: No discussion about that. Period.
>
> We are not going to introduce some new futex opcode, which is
> optimized for a microbenchmark and ignoring all of the pain we
> went through in the last 10 years. No way, especially after the
> recent security disaster.
>
> @2 and @3: Yes, we want that.
>
> But again, we don't add fine grained locking just for some half
> baken new opcode. No, we are adding it where it makes the most
> sense and lets us reuse most of the code.
>
> I can predict your question, how that should work :)
>
> If we want to have fine grained locking beyond the futex hash
> buckets and that's something we discussed before, you need a state
> representation of the user space futex in the kernel. That's what
> Waiman added as well, just in a way that's beyond repair.
>
> The charm of the futex hash buckets is that they do not create
> kernel state because the futex_q which is enqueued into the bucket
> is on the task stack of the task which is blocked on the futex.
>
> So much for the theory, because that stopped to be true when we
> added support for PI futexes. They already create kernel internal
> state which is not magically removed when the thread exits the
> syscall. It's called pi_state.
>
> Now the question is how is this related to the non PI magic
> spinning optimization? Very much so.
>
> Simply because the handling of pi_state already covers ALL the
> corner cases which come with the extra kernel internal state and
> they provide the full function set of futexes required by the
> various (ab)use cases including the requeue functionality which is
> required for condvars. It even supports caching of the pi_state
> struct in a way which makes sense, rather than blindly slapping a
> cached entry onto each hash bucket.
>
> So it's not a really mind boggling thought experiment to think
> about this magic new feature as a subset of the already existing
> PI futex code. It is a subset, just minus the PI portion plus the
> more fine grained locking.
>
> So the right solution is to rework the existing code.
>
> 1) Make pi_state the entity which gets enqueued into the hash bucket
> and manage the waiters in a pi_state internal queue.
>
> Straight forward problem as it is still protected by the hash
> bucket lock.
>
> 2) Add private locking to the pi_state to reduce the lock contention
> on the hash bucket lock.
>
> Straight forward conversion as well
>
> 3) Make mutex a subset of rt_mutex
>
> That makes a lot of sense, simply because a mutex is the same as a
> rtmutex just without the PI part.
>
> Just for the record before anyone cries murder: The original mutex
> implemention was modeled after the rtmutex one in the first place
> and the optimistic spinning was first introduced on rtmutexes by
> Gregory Haskins et. al. in the context of the RT tree. Back then it
> was too early or we were simply too stupid to see the similarities
> and opted for a complete separate implementation.
>
> Yes, I'm aware that this is a non-trivial problem, but if you look
> at the details, then the non-trivial part is in the slow path were
> stuff actually goes to sleep. The fast path of mutex and rt_mutex
> is simply the same. So there is no reason to keep them separate. An
> extra conditional in the slow path is not going to hurt at all.
>
> Surely, you might say, that I'm an egoistic bastard, who just wants
> to have the benefit of MCS for rtmutex for free. Right you are, but
> there is a whole bunch of reasons why this makes tons of sense:
>
> - As I said above, it's almost the same, except for the slow path
> details, where a few extra conditionals do not matter
>
> - Sharing the code has the obvious benefits. And it's not only
> sharing between rtmutex and mutex, it's also sharing the
> underlying optimizations of MCS for the kernel internal users and
> the user space exposed ones, i.e. futexes.
>
> - Once unified the desired futex functionality just works with a
> minimal impact on the futex code itself.
>
> The futex code does not care whether the underlying primitive has
> PI semantics or not, it just cares that all the required features
> like proxy locking etc. are in place. And that everything like
> timeouts, requeue etc. is just supported.
>
> - The semantics of the new futex functionality has not to be
> defined again. We can simply reuse the rather well defined and
> well tested straight forward semantics of the existing PI
> opcodes.
>
> - That minimizes the risk massively, because all state and error
> handling is common code.
>
> - Any update to the underlying primitives will benefit ALL usage
> sites and reduces the copy and paste induced hell of MCS code
> sprinkled all over the kernel.
>
> 4) Enable user space to make use of it.
>
> Whether this will be a separate opcode or just a new flag to the
> existing PI opcodes is going to be just a bikeshedding question at
> that point. :)
>
> The important point is, that we do NOT grow any new unmaintainable
> warts to the futex code. And the 650+ new lines of hackery which come
> with the proposed patch set are in no way acceptable, especially as
> they only cover the minimalistic functionality set of futexes. Not to
> talk about the other issues I've observed on the first glance.
>
> So before anyone comes up with a "solution" for all of this tomorrow
> afternoon in form of another half baken patch, please sit back mull it
> in your head and lets have a proper discussion about the approach
> first.
>
> Thanks,
>
> tglx
Thank for your thorough analysis and suggestions on what to do to
support spinning futexes. You certainly know more about the internal
working of futex than most of us. I can live with what you have
suggested. My patch is just a proof of concept piece to demonstrate
optimistic spinning on futex is something worthwhile to do. I think I
have achieved my goal of stirring interest in this area. My next step
will be to look into the direction of what you have suggested and figure
out what actual code changes will be needed.
Thank again for your help.
-Longman
On 07/22/2014 05:59 AM, Thomas Gleixner wrote:
> On Tue, 22 Jul 2014, Peter Zijlstra wrote:
>> On Tue, Jul 22, 2014 at 10:39:17AM +0200, Thomas Gleixner wrote:
>>> On Tue, 22 Jul 2014, Peter Zijlstra wrote:
>>>> Anyway, there is one big fail in the entire futex stack that we 'need'
>>>> to sort some day and that is NUMA. Some people (again database people)
>>>> explicitly do not use futexes and instead use sysvsem because of this.
>>>>
>>>> The problem with numa futexes is that because they're vaddr based there
>>>> is no (persistent) node information. You always end up having to fall
>>>> back to looking in all nodes before you can guarantee there is no
>>>> matching futex.
>>>>
>>>> One way to achieve it is by extending the futex value to include a node
>>>> number, but that's obviously a complete ABI break. Then again, it should
>>>> be pretty straight fwd, since the node number doesn't need to be part of
>>>> the actual atomic update part, just part of the userspace storage.
>>> So you want per node hash buckets, right? Fair enough, but how do you
>>> make sure, that no thread/process on a different node is fiddling with
>>> that "node bound" futex as well?
>> You don't and that should work just as well, just slower. But since the
>> node id is in the futex 'value' we'll always end up in the right
>> node-hash, even if its a remote one.
>>
>> So yes, per node hashes, and a persistent futex->node map.
> Which works fine as long as you only have the futex_q on the stack of
> the blocked task. If user space is lying to you, then you just end up
> with a bunch of threads sleeping forever. Who cares?
>
> But if you create independent kernel state, which we have with
> pi_state and which you need for finegrained locking and further
> spinning fun, you open up another can of worms. Simply because this
> would enable rogue user space to create multiple instances of the
> kernel internal state. I can predict the CVEs resulting from that
> even without using a crystal ball.
>
> Thanks,
>
> tglx
I think NUMA futex, if implemented, is a completely independent piece
that have no direct relationship with optimistic spinning futex. It
should be a separate patch and not mixing with optimistic spinning patch
which will only make the latter one more complicated.
-Longman
On Tue, 22 Jul 2014, Waiman Long wrote:
> On 07/22/2014 05:59 AM, Thomas Gleixner wrote:
> > On Tue, 22 Jul 2014, Peter Zijlstra wrote:
> > > On Tue, Jul 22, 2014 at 10:39:17AM +0200, Thomas Gleixner wrote:
> > > > On Tue, 22 Jul 2014, Peter Zijlstra wrote:
> > > > > Anyway, there is one big fail in the entire futex stack that we 'need'
> > > > > to sort some day and that is NUMA. Some people (again database people)
> > > > > explicitly do not use futexes and instead use sysvsem because of this.
> > > > >
> > > > > The problem with numa futexes is that because they're vaddr based
> > > > > there
> > > > > is no (persistent) node information. You always end up having to fall
> > > > > back to looking in all nodes before you can guarantee there is no
> > > > > matching futex.
> > > > >
> > > > > One way to achieve it is by extending the futex value to include a
> > > > > node
> > > > > number, but that's obviously a complete ABI break. Then again, it
> > > > > should
> > > > > be pretty straight fwd, since the node number doesn't need to be part
> > > > > of
> > > > > the actual atomic update part, just part of the userspace storage.
> > > > So you want per node hash buckets, right? Fair enough, but how do you
> > > > make sure, that no thread/process on a different node is fiddling with
> > > > that "node bound" futex as well?
> > > You don't and that should work just as well, just slower. But since the
> > > node id is in the futex 'value' we'll always end up in the right
> > > node-hash, even if its a remote one.
> > >
> > > So yes, per node hashes, and a persistent futex->node map.
> > Which works fine as long as you only have the futex_q on the stack of
> > the blocked task. If user space is lying to you, then you just end up
> > with a bunch of threads sleeping forever. Who cares?
> >
> > But if you create independent kernel state, which we have with
> > pi_state and which you need for finegrained locking and further
> > spinning fun, you open up another can of worms. Simply because this
> > would enable rogue user space to create multiple instances of the
> > kernel internal state. I can predict the CVEs resulting from that
> > even without using a crystal ball.
> >
> > Thanks,
> >
> > tglx
>
> I think NUMA futex, if implemented, is a completely independent piece that
> have no direct relationship with optimistic spinning futex. It should be a
> separate patch and not mixing with optimistic spinning patch which will only
> make the latter one more complicated.
Bullshit. Of course it handles separate issues, but Peter is
completely right, that the NUMA aspect is a far bigger issue than the
optimistic spinning stuff. Do you have an idea what the costs of cross
node memory access and cacheline bouncing are? Obviously not, as you
only interest seems to be to slap optimistic spinning to every place
which deals with locking.
And if you had tried to read _AND_ understand the discussion above,
you might have noticed that providing NUMA awareness requires a lot of
the functionality which is needed for optimistic spinning as well.
But no, you did not even take the time to think about it, you just
claim that it makes your optimistic stuff more complicated. Just get
it, there is a world outside of optimistic spinning.
Thanks,
tglx
On Tue, 22 Jul 2014, Waiman Long wrote:
> On 07/21/2014 12:42 PM, Thomas Gleixner wrote:
> > > + /*
> > > + * The unlocker may have cleared the TID value and another task may
> > > + * steal it. However, if its TID is still set, we need to clear
> > > + * it as well as the FUTEX_WAITERS bit.
> >
> > No, that's complete and utter crap. The unlocker is current and it may
> > not have cleared anything.
> >
> > Your design or the lack thereof is a complete disaster.
>
> In patch 5, the documentation and the sample unlock does clear the TID before
> going in. The code here is just a safety measure in case the unlocker doesn't
> follow the recommendation.
I don't care about patch 5 at all. I'm already fed up reading 1/5. The
code is no safety measure it's a completely disastrous workaround.
We do not care whether user space follows recommendations. We set
rules and if the rules are violated then we act accordingly. Did you
even take the time to read the git history of futex.c? Did you notice
that we had a major security incident related to that code which we
fixed not long ago?
No, you obviously did not, otherwise you would not come up with
hackery which is going to explode in your face if exposed to a simple
syscall fuzzer, not to talk about a competent hacker. Without even
looking too deep I found two simple ways to leak kernel state and one
to corrupt kernel state. Brilliant stuff, really!
> > Sit down first and define the exact semantics of the new opcode. That
> > includes user and kernel space and the interaction with robust list,
> > which you happily ignored.
> >
> > What are the semantics of uval? When can it be changed in kernel and
> > in user space? How do we deal with corruption of the user space value?
>
> The semantics of the uval is the same as that of PI and robust futex where the
> TID portion contains the thread ID of the lock owner. It is my intention to
> make it works with the robust futex mechanism before it can be merged. This
> RPC patch series is for soliciting feedbacks and make the necessary changes
> that make the patch acceptable before I go deep into making it works with
> robust futex.
No, the semantics are not the same. PI and robust futexes have
different semantics, but you did not notice that at all.
Your so called semantics are really well thought out as as you have
proven with completely uncomprehensible hackeries, which you call
"safety measures".
And I do not care about your intention to make it work with robustness
etc. If you do not design it upfront to do so, then this code is going
to be a complete disaster. But to do that you need to sit down and
provide a proper design document and that wants to be patch 1/x not
the last one. And the code actually needs to follow that design.
> > How are faults handled?
>
> As you have a lot more experience working with futexes than me, any
> suggestions on what kind of faults will happen and what are the best practices
> to handle them will be highly appreciated.
So shall I fly over and read you the source code of futex.c as a
bedtime story?
You did not even try to understand how futexes work and what corner
cases they have by studying the existing code and reading the git
history.
No, you simply cobbled something together and created random
performance numbers and because they are so wonderful, you expect that
I and the other people who worked hard on that code do your homework.
No, that's not how it works.
Your numbers are completely useless because they just measure the fast
path by omitting all the required security measures and corner case
handling.
Darren had his version of spinning done years ago, but we had to
ground it due to not resolvable issues at that point. I'm quite sure,
that you did not even try to figure out whether people had looked into
that issue before and tried to understand why it failed, otherwise you
would have been clever enough to provide a reference and explain why
your approach is better or solves the underlying problems.
So your RFC is not impressive at all. It's just an inferior
implementation of something we are aware of for a very long time.
I'm already tired of this, really. Unless you start to understand the
problem of futexes in the first place and you should ask your coworker
how mind boggling that can be, do not even bother to send another half
arsed patch. Spare the electrons and the time of everyone involved.
Thanks,
tglx
On Tue, 22 Jul 2014, Waiman Long wrote:
> On 07/21/2014 09:01 PM, Thomas Gleixner wrote:
> > So before anyone comes up with a "solution" for all of this tomorrow
> > afternoon in form of another half baken patch, please sit back mull it
> > in your head and lets have a proper discussion about the approach
> > first.
> >
> > Thanks,
> >
> > tglx
>
> Thank for your thorough analysis and suggestions on what to do to support
> spinning futexes. You certainly know more about the internal working of futex
> than most of us. I can live with what you have suggested. My patch is just a
> proof of concept piece to demonstrate optimistic spinning on futex is
> something worthwhile to do. I think I have achieved my goal of stirring
We knew that already as Darren has proven that optimistic spinning in
a simpler form provides a huge benefit. So you do not have achieved
anything except annoying people with your sloppiness.
> interest in this area. My next step will be to look into the direction of what
> you have suggested and figure out what actual code changes will be needed.
You really can do what you want with your time. You can either reread
AND understand the paragraph, which I left as a quote from my previous
mail, AND act acoordingly or just stay away from futex.c.
Thanks,
tglx
On Mon, 2014-07-21 at 09:42 -0700, Andi Kleen wrote:
> FWIW the main problem is currently that switch-through-idle is so
> slow. I think improving that would give a boost to far more
> situations.
Two high frequency idle enter/exit suckage spots:
1) nohz (tick) - it's expensive to start/stop tick on every micro-idle,
throttle it or something.
2) ondemand governor - tweak silly default settings to reflect the
reality that we routinely schedule communicating threads cross core.
(3. seek/destroy fastpath cycles, damn things multiply over time)
-Mike
On Wed, Jul 23, 2014 at 06:55:03AM +0200, Mike Galbraith wrote:
> On Mon, 2014-07-21 at 09:42 -0700, Andi Kleen wrote:
>
> > FWIW the main problem is currently that switch-through-idle is so
> > slow. I think improving that would give a boost to far more
> > situations.
>
> Two high frequency idle enter/exit suckage spots:
>
> 1) nohz (tick) - it's expensive to start/stop tick on every micro-idle,
> throttle it or something.
Yeah, so the idea was to use the cpuidle idle guestimator to control
this, and now that we've moved it somewhat closer to the scheduler that
might become possible.
> 2) ondemand governor - tweak silly default settings to reflect the
> reality that we routinely schedule communicating threads cross core.
Yeah, so the plan is to shoot cpufreq in the head and base the
replacement on smp aware metrics ;-) Its on a todo list somewhere..
On Wed, 2014-07-23 at 08:57 +0200, Peter Zijlstra wrote:
> On Wed, Jul 23, 2014 at 06:55:03AM +0200, Mike Galbraith wrote:
> > On Mon, 2014-07-21 at 09:42 -0700, Andi Kleen wrote:
> >
> > > FWIW the main problem is currently that switch-through-idle is so
> > > slow. I think improving that would give a boost to far more
> > > situations.
> >
> > Two high frequency idle enter/exit suckage spots:
> >
> > 1) nohz (tick) - it's expensive to start/stop tick on every micro-idle,
> > throttle it or something.
>
> Yeah, so the idea was to use the cpuidle idle guestimator to control
> this, and now that we've moved it somewhat closer to the scheduler that
> might become possible.
>
> > 2) ondemand governor - tweak silly default settings to reflect the
> > reality that we routinely schedule communicating threads cross core.
>
> Yeah, so the plan is to shoot cpufreq in the head and base the
> replacement on smp aware metrics ;-) Its on a todo list somewhere..
It never ceases to amaze me that people aren't screaming bloody murder
about those two spots. Watching performance of lightly loaded boxen is
enough to make a grown man cry.
SUSE (and I in all of my many regression testing trees) puts tourniquets
on both of these blood spurting gashes, laptops be damned.
I also resurrect mwait_idle(), as while you may consider it obsolete, I
still love my lovely little Q6600 box (power sucking pig) dearly :)
-Mike
On Wed, Jul 23, 2014 at 09:25:46AM +0200, Mike Galbraith wrote:
> I also resurrect mwait_idle(), as while you may consider it obsolete, I
> still love my lovely little Q6600 box (power sucking pig) dearly :)
Yeah, I keep forgetting about that one.. we should get that fixed.
On Wed, 2014-07-23 at 09:35 +0200, Peter Zijlstra wrote:
> On Wed, Jul 23, 2014 at 09:25:46AM +0200, Mike Galbraith wrote:
> > I also resurrect mwait_idle(), as while you may consider it obsolete, I
> > still love my lovely little Q6600 box (power sucking pig) dearly :)
>
> Yeah, I keep forgetting about that one.. we should get that fixed.
(it is an obsolete pig, and nobody else seems to mind, so I don't mind)
On Wed, Jul 23, 2014 at 09:39:03AM +0200, Mike Galbraith wrote:
> On Wed, 2014-07-23 at 09:35 +0200, Peter Zijlstra wrote:
> > On Wed, Jul 23, 2014 at 09:25:46AM +0200, Mike Galbraith wrote:
> > > I also resurrect mwait_idle(), as while you may consider it obsolete, I
> > > still love my lovely little Q6600 box (power sucking pig) dearly :)
> >
> > Yeah, I keep forgetting about that one.. we should get that fixed.
>
> (it is an obsolete pig, and nobody else seems to mind, so I don't mind)
Hey, I fixed early P4 topology setup yesterday, core2 is like brand
spanking new compared :-)