2002-03-05 06:58:32

by Rusty Russell

[permalink] [raw]
Subject: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
2) Fix for the "decrement wraparound" problem (Paul Mackerras)
3) x86 fixes: tested on dual x86 box.

Example userspace lib attached,
Rusty.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/linux/futex.h working-2.5.6-pre2-futex/include/linux/futex.h
--- linux-2.5.6-pre2/include/linux/futex.h Thu Jan 1 10:00:00 1970
+++ working-2.5.6-pre2-futex/include/linux/futex.h Tue Mar 5 13:53:33 2002
@@ -0,0 +1,8 @@
+#ifndef _LINUX_FUTEX_H
+#define _LINUX_FUTEX_H
+
+/* Second argument to futex syscall */
+#define FUTEX_UP (1)
+#define FUTEX_DOWN (-1)
+
+#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/kernel/futex.c working-2.5.6-pre2-futex/kernel/futex.c
--- linux-2.5.6-pre2/kernel/futex.c Thu Jan 1 10:00:00 1970
+++ working-2.5.6-pre2-futex/kernel/futex.c Tue Mar 5 13:53:33 2002
@@ -0,0 +1,208 @@
+/*
+ * Fast Userspace Mutexes (which I call "Futexes!").
+ * (C) Rusty Russell, IBM 2002
+ *
+ * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
+ * enough at me, Linus for the original (flawed) idea, Matthew
+ * Kirkwood for proof-of-concept implementation.
+ *
+ * "The futexes are also cursed."
+ * "But they come in a choice of three flavours!"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/futex.h>
+#include <asm/atomic.h>
+
+/* These mutexes are a very simple counter: the winner is the one who
+ decrements from 1 to 0. The counter starts at 1 when the lock is
+ free. A value other than 0 or 1 means someone may be sleeping.
+ This is simple enough to work on all architectures, but has the
+ problem that if we never "up" the semaphore it could eventually
+ wrap around. */
+
+/* FIXME: This may be way too small. --RR */
+#define FUTEX_HASHBITS 6
+
+/* We use this instead of a normal wait_queue_t, so we can wake only
+ the relevent ones (hashed queues may be shared) */
+struct futex_q {
+ struct list_head list;
+ struct task_struct *task;
+ atomic_t *count;
+};
+
+/* The key for the hash is the address + index + offset within page */
+static struct list_head futex_queues[1<<FUTEX_HASHBITS];
+static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+
+static inline struct list_head *hash_futex(struct page *page,
+ unsigned long offset)
+{
+ unsigned long h;
+
+ /* If someone is sleeping, page is pinned. ie. page_address
+ is a constant when we care about it. */
+ h = (unsigned long)page_address(page) + offset;
+ return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
+}
+
+static inline void wake_one_waiter(struct list_head *head, atomic_t *count)
+{
+ struct list_head *i;
+
+ spin_lock(&futex_lock);
+ list_for_each(i, head) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->count == count) {
+ wake_up_process(this->task);
+ break;
+ }
+ }
+ spin_unlock(&futex_lock);
+}
+
+/* Add at end to avoid starvation */
+static inline void queue_me(struct list_head *head,
+ struct futex_q *q,
+ atomic_t *count)
+{
+ q->task = current;
+ q->count = count;
+
+ spin_lock(&futex_lock);
+ list_add_tail(&q->list, head);
+ spin_unlock(&futex_lock);
+}
+
+static inline void unqueue_me(struct futex_q *q)
+{
+ spin_lock(&futex_lock);
+ list_del(&q->list);
+ spin_unlock(&futex_lock);
+}
+
+/* Get kernel address of the user page and pin it. */
+static struct page *pin_page(unsigned long page_start)
+{
+ struct mm_struct *mm = current->mm;
+ struct page *page;
+ int err;
+
+ down_read(&mm->mmap_sem);
+ err = get_user_pages(current, current->mm, page_start,
+ 1 /* one page */,
+ 1 /* writable */,
+ 0 /* don't force */,
+ &page,
+ NULL /* don't return vmas */);
+ up_read(&mm->mmap_sem);
+
+ if (err < 0)
+ return ERR_PTR(err);
+ return page;
+}
+
+/* Simplified from arch/ppc/kernel/semaphore.c: Paul M. is a genius. */
+static int futex_down(struct list_head *head, atomic_t *count)
+{
+ int retval = 0;
+ struct futex_q q;
+
+ current->state = TASK_INTERRUPTIBLE;
+ queue_me(head, &q, count);
+
+ /* If we take the semaphore from 1 to 0, it's ours. But don't
+ bother decrementing if it's already negative. */
+ while (atomic_read(count) < 0 || !atomic_dec_and_test(count)) {
+ if (signal_pending(current)) {
+ retval = -EINTR;
+ break;
+ }
+ schedule();
+ current->state = TASK_INTERRUPTIBLE;
+ }
+ current->state = TASK_RUNNING;
+ unqueue_me(&q);
+ /* If we were signalled, we might have just been woken: we
+ must wake another one. Otherwise we need to wake someone
+ else (if they are waiting) so they drop the count below 0,
+ and when we "up" in userspace, we know there is a
+ waiter. */
+ wake_one_waiter(head, count);
+ return retval;
+}
+
+static int futex_up(struct list_head *head, atomic_t *count)
+{
+ atomic_set(count, 1);
+ smp_wmb();
+ wake_one_waiter(head, count);
+ return 0;
+}
+
+asmlinkage int sys_futex(void *uaddr, int op)
+{
+ int ret;
+ unsigned long pos_in_page;
+ struct list_head *head;
+ struct page *page;
+
+ pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
+
+ /* Must be "naturally" aligned, and not on page boundary. */
+ if ((pos_in_page % __alignof__(atomic_t)) != 0
+ || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
+ return -EINVAL;
+
+ /* Simpler if it doesn't vanish underneath us. */
+ page = pin_page((unsigned long)uaddr - pos_in_page);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ head = hash_futex(page, pos_in_page);
+ switch (op) {
+ case FUTEX_UP:
+ ret = futex_up(head, page_address(page) + pos_in_page);
+ break;
+ case FUTEX_DOWN:
+ ret = futex_down(head, page_address(page) + pos_in_page);
+ break;
+ /* Add other lock types here... */
+ default:
+ ret = -EINVAL;
+ }
+ put_page(page);
+
+ return ret;
+}
+
+static int __init init(void)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
+ INIT_LIST_HEAD(&futex_queues[i]);
+ return 0;
+}
+__initcall(init);
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/arch/i386/kernel/entry.S working-2.5.6-pre2-futex/arch/i386/kernel/entry.S
--- linux-2.5.6-pre2/arch/i386/kernel/entry.S Wed Feb 20 17:56:59 2002
+++ working-2.5.6-pre2-futex/arch/i386/kernel/entry.S Tue Mar 5 13:53:33 2002
@@ -716,6 +716,7 @@
.long SYMBOL_NAME(sys_lremovexattr)
.long SYMBOL_NAME(sys_fremovexattr)
.long SYMBOL_NAME(sys_tkill)
+ .long SYMBOL_NAME(sys_futex)

.rept NR_syscalls-(.-sys_call_table)/4
.long SYMBOL_NAME(sys_ni_syscall)
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/arch/ppc/kernel/misc.S working-2.5.6-pre2-futex/arch/ppc/kernel/misc.S
--- linux-2.5.6-pre2/arch/ppc/kernel/misc.S Wed Feb 20 17:57:04 2002
+++ working-2.5.6-pre2-futex/arch/ppc/kernel/misc.S Tue Mar 5 13:53:33 2002
@@ -1246,6 +1246,7 @@
.long sys_removexattr
.long sys_lremovexattr
.long sys_fremovexattr /* 220 */
+ .long sys_futex
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
.endr
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-i386/mman.h working-2.5.6-pre2-futex/include/asm-i386/mman.h
--- linux-2.5.6-pre2/include/asm-i386/mman.h Wed Mar 15 12:45:20 2000
+++ working-2.5.6-pre2-futex/include/asm-i386/mman.h Tue Mar 5 13:53:33 2002
@@ -4,6 +4,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-i386/unistd.h working-2.5.6-pre2-futex/include/asm-i386/unistd.h
--- linux-2.5.6-pre2/include/asm-i386/unistd.h Wed Feb 20 17:56:40 2002
+++ working-2.5.6-pre2-futex/include/asm-i386/unistd.h Tue Mar 5 13:53:33 2002
@@ -243,6 +243,7 @@
#define __NR_lremovexattr 236
#define __NR_fremovexattr 237
#define __NR_tkill 238
+#define __NR_futex 239

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-ppc/mman.h working-2.5.6-pre2-futex/include/asm-ppc/mman.h
--- linux-2.5.6-pre2/include/asm-ppc/mman.h Tue May 22 08:02:06 2001
+++ working-2.5.6-pre2-futex/include/asm-ppc/mman.h Tue Mar 5 13:53:33 2002
@@ -7,6 +7,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-ppc/unistd.h working-2.5.6-pre2-futex/include/asm-ppc/unistd.h
--- linux-2.5.6-pre2/include/asm-ppc/unistd.h Wed Feb 20 17:57:18 2002
+++ working-2.5.6-pre2-futex/include/asm-ppc/unistd.h Tue Mar 5 13:53:33 2002
@@ -228,6 +228,7 @@
#define __NR_removexattr 218
#define __NR_lremovexattr 219
#define __NR_fremovexattr 220
+#define __NR_futex 221

#define __NR(n) #n

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/kernel/Makefile working-2.5.6-pre2-futex/kernel/Makefile
--- linux-2.5.6-pre2/kernel/Makefile Wed Feb 20 17:56:17 2002
+++ working-2.5.6-pre2-futex/kernel/Makefile Tue Mar 5 13:53:33 2002
@@ -15,7 +15,7 @@
obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o info.o time.o softirq.o resource.o \
sysctl.o acct.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o
+ signal.o sys.o kmod.o context.o futex.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/linux/hash.h working-2.5.6-pre2-futex/include/linux/hash.h
--- linux-2.5.6-pre2/include/linux/hash.h Thu Jan 1 10:00:00 1970
+++ working-2.5.6-pre2-futex/include/linux/hash.h Tue Mar 5 13:53:33 2002
@@ -0,0 +1,58 @@
+#ifndef _LINUX_HASH_H
+#define _LINUX_HASH_H
+/* Fast hashing routine for a long.
+ (C) 2002 William Lee Irwin III, IBM */
+
+/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These primes are chosen to be bit-sparse, that is operations on
+ * them can use shifts and additions instead of multiplications for
+ * machines where multiplications are slow.
+ */
+#if BITS_PER_LONG == 32
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
+#elif BITS_PER_LONG == 64
+/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#else
+#error Define GOLDEN_RATIO_PRIME for your wordsize.
+#endif
+
+static inline unsigned long hash_long(unsigned long val, unsigned int bits)
+{
+ unsigned long hash = val;
+
+#if BITS_PER_LONG == 64
+ /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
+ unsigned long n = hash;
+ n <<= 18;
+ hash -= n;
+ n <<= 33;
+ hash -= n;
+ n <<= 3;
+ hash += n;
+ n <<= 3;
+ hash -= n;
+ n <<= 4;
+ hash += n;
+ n <<= 2;
+ hash += n;
+#else
+ /* On some cpus multiply is faster, on others gcc will do shifts */
+ hash *= GOLDEN_RATIO_PRIME;
+#endif
+
+ /* High bits are more random, so use them. */
+ return hash >> (BITS_PER_LONG - bits);
+}
+
+static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
+{
+ return hash_long((unsigned long)ptr, bits);
+}
+#endif /* _LINUX_HASH_H */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/linux/mmzone.h working-2.5.6-pre2-futex/include/linux/mmzone.h
--- linux-2.5.6-pre2/include/linux/mmzone.h Fri Mar 1 22:58:34 2002
+++ working-2.5.6-pre2-futex/include/linux/mmzone.h Tue Mar 5 14:03:15 2002
@@ -51,8 +51,7 @@
/*
* wait_table -- the array holding the hash table
* wait_table_size -- the size of the hash table array
- * wait_table_shift -- wait_table_size
- * == BITS_PER_LONG (1 << wait_table_bits)
+ * wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
@@ -75,7 +74,7 @@
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_size;
- unsigned long wait_table_shift;
+ unsigned long wait_table_bits;

/*
* Discontig memory support fields.
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/mm/filemap.c working-2.5.6-pre2-futex/mm/filemap.c
--- linux-2.5.6-pre2/mm/filemap.c Fri Mar 1 23:27:15 2002
+++ working-2.5.6-pre2-futex/mm/filemap.c Tue Mar 5 13:53:33 2002
@@ -25,6 +25,7 @@
#include <linux/iobuf.h>
#include <linux/compiler.h>
#include <linux/fs.h>
+#include <linux/hash.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -773,32 +774,8 @@
static inline wait_queue_head_t *page_waitqueue(struct page *page)
{
const zone_t *zone = page_zone(page);
- wait_queue_head_t *wait = zone->wait_table;
- unsigned long hash = (unsigned long)page;

-#if BITS_PER_LONG == 64
- /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
- unsigned long n = hash;
- n <<= 18;
- hash -= n;
- n <<= 33;
- hash -= n;
- n <<= 3;
- hash += n;
- n <<= 3;
- hash -= n;
- n <<= 4;
- hash += n;
- n <<= 2;
- hash += n;
-#else
- /* On some cpus multiply is faster, on others gcc will do shifts */
- hash *= GOLDEN_RATIO_PRIME;
-#endif
-
- hash >>= zone->wait_table_shift;
-
- return &wait[hash];
+ return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}

/*
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/mm/mprotect.c working-2.5.6-pre2-futex/mm/mprotect.c
--- linux-2.5.6-pre2/mm/mprotect.c Wed Feb 20 17:57:21 2002
+++ working-2.5.6-pre2-futex/mm/mprotect.c Tue Mar 5 13:53:33 2002
@@ -280,7 +280,7 @@
end = start + len;
if (end < start)
return -EINVAL;
- if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC))
+ if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
return -EINVAL;
if (end == start)
return 0;
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/mm/page_alloc.c working-2.5.6-pre2-futex/mm/page_alloc.c
--- linux-2.5.6-pre2/mm/page_alloc.c Wed Feb 20 17:57:21 2002
+++ working-2.5.6-pre2-futex/mm/page_alloc.c Tue Mar 5 13:53:33 2002
@@ -776,8 +776,8 @@
* per zone.
*/
zone->wait_table_size = wait_table_size(size);
- zone->wait_table_shift =
- BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
+ zone->wait_table_bits =
+ wait_table_bits(zone->wait_table_size);
zone->wait_table = (wait_queue_head_t *)
alloc_bootmem_node(pgdat, zone->wait_table_size
* sizeof(wait_queue_head_t));

--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.


Attachments:
futex-1.0.tar.bz2 (2.79 kB)
Futex example library

2002-03-05 21:22:51

by Hubertus Franke

[permalink] [raw]
Subject: Futexes III : performance numbers

On Tuesday 05 March 2002 02:01 am, Rusty Russell wrote:
> 1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
> 2) Fix for the "decrement wraparound" problem (Paul Mackerras)
> 3) x86 fixes: tested on dual x86 box.
>
> Example userspace lib attached,
> Rusty.


I did a quick hack to enable ulockflex to run on the latest interface that
Rusty posted.

The command run was
./ulockflex -c 2 -a 1 -t 2 -o 5 -m -1 -R 499 -r 0 -x 0 -L f

that is 2 children, 1 lock, 2 seconds in a tight loop contending for the lock.
Tue Mar 5 15:12:56 EST 2002: calibrated at 496
base arguments are <-t 10 -o 5 -m 8 -R 496 -P>
c #number of children
a #number of locks (always one in this case)
t #second of each run
i #number of iterations
L locktype (capital letter means spinning version)
p patience of spinning if any
r mean runtime usecs without holding a lock
x mean runtime usces with lock held.
WC lock contention on write lock
RC read contention on read lock (for shared lock)
R percentage of failed lock attempts resolved by spinning OR
average times in calock a failed attempt needed to retry
COV coefficient of variance between children (fairness)
ThPut Throughput achieved per second

Lock types:
e: empty locks no lock/unlock performed
k: sysv semaphores
f: usema (Hubertus Franke)
m: mutex (Rusty Russell)
c: convoy avoidance lock (Hubertus Franke)

RAW OVERHEAD/

First the raw overhead numbers for 1 process
c a t i L p r x WC RC R COV ThPut
1 1 2 1 e 0 0 0 0.00 0.00 0.00 0.000000 2316611
1 1 2 1 k 0 0 0 0.00 0.00 0.00 0.000000 539468
1 1 2 1 m 0 0 0 0.00 0.00 0.00 0.000000 2022462
1 1 2 1 f 0 0 0 0.00 0.00 0.00 0.000000 1979692
1 1 2 1 c 0 0 0 0.00 0.00 0.00 0.000000 1936554

User locks are about 4 times faster than kernel locking techniques.

Let's look at the contention issue 2 processes
2 1 5 4 k 0 0 0 99.77 0.00 0.00 0.003608 289920
2 1 5 4 m 0 0 0 9.67 0.00 0.00 0.086198 843918
2 1 5 4 f 0 0 0 97.90 0.00 0.00 0.028389 328755
2 1 5 4 c 0 0 0 6.80 0.00 0.06 0.001428 933587

This analysis reveals a few interesting points. First again userlocks
are much faster.
RR's version does not provide a FIFO locking order !!!!!!!!!!
This can be seen by the fact that the only 9.67% of the times the
lock was requested, the kernel was called. Also has a high COV 8%.
As a result RR's mutex are about 2.5 times faster than my semaphores
which process in strict FIFO order. This is due to the fact that
I separate user state and kernel state as described in various previous
messages.

We need to know whether FIFO processing is required.
I have provided similar locks to RR mutex I called convoy avoidance locks
and you see they are about 10% better than RR locks.

Now 3 processes
3 1 5 4 k 0 0 0 99.98 0.00 0.00 0.033284 242040
3 1 5 4 m 0 0 0 0.29 0.00 0.00 0.018406 1979992
3 1 5 4 f 0 0 0 99.71 0.00 0.00 0.028083 306140
3 1 5 4 c 0 0 0 7.79 0.00 4.00 0.437084 774175

Interesting... the strict FIFO ordering of my fast semaphores limits
performance as seen by 99.71% contention, so we always ditch
into the kernel. Convoy Avoidance locks 2.5 times better.
Wohh futex rock, BUT... with 0.29% contention it basically tells
me that we are exhausting our entire quantum getting the lock
without contention. So their is some serious fairness issue here
at least for the tightly scheduled locks. Compare the M numbers
for 2 and 3 children.

Let's rock again.... 100 processes
100 1 10 4 k 0 0 0 100.00 0.00 0.00 0.422294 17776
100 1 10 4 m 0 0 0 0.58 0.00 0.00 0.278172 1793520
100 1 10 4 f 0 0 0 99.50 0.00 0.00 0.478905 52834
100 1 10 4 c 0 0 0 8.05 0.00 12.02 0.522363 563151

Same as above, futexes rock, but at very tight arrival rates are highly
unfair, but they avoid the so called convoy phenomena.

Linus, or for that matter anybody else, what's your take: is some level of
starvation acceptable when deploying fast user locks ?

REALIST ARRIVAL RATE
----------------------------
Let's switch to something more realistic 1 usec mean lock holdtime and
10 usecs mean nonlock time.

3 1 10 4 k 0 10 1 98.70 0.00 0.00 0.005531 129429
3 1 10 4 m 0 10 1 7.29 0.00 0.00 0.113565 122602
3 1 10 4 f 0 10 1 98.19 0.00 0.00 0.009873 138952
3 1 10 4 c 0 10 1 10.61 0.00 8.45 0.325935 139446

Once the convoy disappears actually my locks seem to perform better.
Reason for that is likely that the in kernel hashing is faster (guess).


100 1 10 4 k 0 10 1 99.96 0.00 0.00 0.003876 12811
100 1 10 4 m 0 10 1 9.36 0.00 0.00 0.051076 139238
100 1 10 4 f 0 10 1 100.00 0.00 0.00 0.162686 36601
100 1 10 4 c 0 10 1 10.58 0.00 9.33 0.326738 138602

Again at the relevant cases convoy avoidance locks do roughly what
the RR's futex do. Fair locks don't do that well, still due to the FIFO order.

Now two takes possible:

(a) we need FIFO ordering even for fast user level locks
(b) I don't give a rat's hyni ; if you want FIFO use sysv locks.

Comments, stands .....

Again, we better settle this one soon.

-- Hubertus

2002-03-05 22:37:55

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Tue, 5 Mar 2002, Rusty Russell wrote:

> + pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> +
> + /* Must be "naturally" aligned, and not on page boundary. */
> + if ((pos_in_page % __alignof__(atomic_t)) != 0
> + || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> + return -EINVAL;

How can this :

(pos_in_page % __alignof__(atomic_t)) != 0

to be false, and together this :

pos_in_page + sizeof(atomic_t) > PAGE_SIZE

to be true ?
This is enough :

if ((pos_in_page % __alignof__(atomic_t)) != 0)




- Davide


2002-03-05 23:18:14

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Tuesday 05 March 2002 05:39 pm, Davide Libenzi wrote:
> On Tue, 5 Mar 2002, Rusty Russell wrote:
> > + pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> > +
> > + /* Must be "naturally" aligned, and not on page boundary. */
> > + if ((pos_in_page % __alignof__(atomic_t)) != 0
> > + || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> > + return -EINVAL;
>
> How can this :
>
> (pos_in_page % __alignof__(atomic_t)) != 0
>
> to be false, and together this :
>
> pos_in_page + sizeof(atomic_t) > PAGE_SIZE
>
> to be true ?
> This is enough :
>
> if ((pos_in_page % __alignof__(atomic_t)) != 0)
>
>

I believe not all machine have alignof == sizeof
>
>
> - Davide
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
-- Hubertus Franke ([email protected])

2002-03-05 23:23:24

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Tue, 5 Mar 2002, Hubertus Franke wrote:

> On Tuesday 05 March 2002 05:39 pm, Davide Libenzi wrote:
> > On Tue, 5 Mar 2002, Rusty Russell wrote:
> > > + pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> > > +
> > > + /* Must be "naturally" aligned, and not on page boundary. */
> > > + if ((pos_in_page % __alignof__(atomic_t)) != 0
> > > + || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> > > + return -EINVAL;
> >
> > How can this :
> >
> > (pos_in_page % __alignof__(atomic_t)) != 0
> >
> > to be false, and together this :
> >
> > pos_in_page + sizeof(atomic_t) > PAGE_SIZE
> >
> > to be true ?
> > This is enough :
> >
> > if ((pos_in_page % __alignof__(atomic_t)) != 0)
> >
> >
>
> I believe not all machine have alignof == sizeof

Yes but this is always true alignof >= sizeof




- Davide


2002-03-05 23:38:25

by Peter Svensson

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Tue, 5 Mar 2002, Davide Libenzi wrote:

> > I believe not all machine have alignof == sizeof
>
> Yes but this is always true alignof >= sizeof

No, this is not true. As the gcc info pages says:
For example, if the target machine requires a `double' value to be
aligned on an 8-byte boundary, then `__alignof__ (double)' is 8. This
is true on many RISC machines. On more traditional machine designs,
`__alignof__ (double)' is 4 or even 2.
A later example shows situations where alignof>sizeof.


Peter
--
Peter Svensson ! Pgp key available by finger, fingerprint:
<[email protected]> ! 8A E9 20 98 C1 FF 43 E3 07 FD B9 0A 80 72 70 AF
------------------------------------------------------------------------
Remember, Luke, your source will be with you... always...



2002-03-05 23:47:35

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Wed, 6 Mar 2002, Peter Svensson wrote:

> On Tue, 5 Mar 2002, Davide Libenzi wrote:
>
> > > I believe not all machine have alignof == sizeof
> >
> > Yes but this is always true alignof >= sizeof
>
> No, this is not true. As the gcc info pages says:
> For example, if the target machine requires a `double' value to be
> aligned on an 8-byte boundary, then `__alignof__ (double)' is 8. This
> is true on many RISC machines. On more traditional machine designs,
> `__alignof__ (double)' is 4 or even 2.
> A later example shows situations where alignof>sizeof.

Yes, it's true.



- Davide



2002-03-06 01:43:28

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> y
ou write:
> On Tue, 5 Mar 2002, Rusty Russell wrote:
>
> > + pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> > +
> > + /* Must be "naturally" aligned, and not on page boundary. */
> > + if ((pos_in_page % __alignof__(atomic_t)) != 0
> > + || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> > + return -EINVAL;
>
> How can this :
>
> (pos_in_page % __alignof__(atomic_t)) != 0
>
> to be false, and together this :
>
> pos_in_page + sizeof(atomic_t) > PAGE_SIZE
>
> to be true ?

You're assuming that __alignof__(atomic_t) = N * sizeof(atomic_t),
where N is an integer.

If alignof == 1, and sizeof == 4, you lose. I prefer to be
future-proof.

This means I should clarify the comment...
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-06 02:01:39

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Wed, 6 Mar 2002, Rusty Russell wrote:

> In message <[email protected]> y
> ou write:
> > On Tue, 5 Mar 2002, Rusty Russell wrote:
> >
> > > + pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> > > +
> > > + /* Must be "naturally" aligned, and not on page boundary. */
> > > + if ((pos_in_page % __alignof__(atomic_t)) != 0
> > > + || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> > > + return -EINVAL;
> >
> > How can this :
> >
> > (pos_in_page % __alignof__(atomic_t)) != 0
> >
> > to be false, and together this :
> >
> > pos_in_page + sizeof(atomic_t) > PAGE_SIZE
> >
> > to be true ?
>
> You're assuming that __alignof__(atomic_t) = N * sizeof(atomic_t),
> where N is an integer.
>
> If alignof == 1, and sizeof == 4, you lose. I prefer to be
> future-proof.
>
> This means I should clarify the comment...

No i should do less things at a time :-(



- Davide


2002-03-06 02:39:50

by Rusty Russell

[permalink] [raw]
Subject: Re: Futexes III : performance numbers

In message <[email protected]> you write:
> On Tuesday 05 March 2002 02:01 am, Rusty Russell wrote:
> > 1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
> > 2) Fix for the "decrement wraparound" problem (Paul Mackerras)
> > 3) x86 fixes: tested on dual x86 box.
> >
> > Example userspace lib attached,
> > Rusty.
>
>
> I did a quick hack to enable ulockflex to run on the latest interface that
> Rusty posted.

Cool... is this 8-way or some such "serious" SMP? How about the
below microoptimization (untested, but you get the idea).

> Now 3 processes
> 3 1 5 4 k 0 0 0 99.98 0.00 0.00 0.033284 242040
> 3 1 5 4 m 0 0 0 0.29 0.00 0.00 0.018406 1979992
> 3 1 5 4 f 0 0 0 99.71 0.00 0.00 0.028083 306140
> 3 1 5 4 c 0 0 0 7.79 0.00 4.00 0.437084 774175
>
> Interesting... the strict FIFO ordering of my fast semaphores limits
> performance as seen by 99.71% contention, so we always ditch
> into the kernel. Convoy Avoidance locks 2.5 times better.

Hmmm... actually I'm limited FIFO, in that I queue on the tail and do
wake one. Of course, someone can come in userspace and grab the lock
while the guy in the kernel is waking up, and this is clearly
happening here.

This can be fixed, I think, by saying to the one we wake up "you have
the lock" and never actually changing the value to 1. This might cost
us very little: I'll send another patch this afternoon.

Cheers!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

--- tmp/kernel/futex.c Wed Mar 6 13:03:08 2002
+++ working-2.5.6-pre1-futex/kernel/futex.c Wed Mar 6 13:01:48 2002
@@ -51,12 +51,17 @@
atomic_t *count;
};

-/* The key for the hash is the address + index + offset within page */
-static struct list_head futex_queues[1<<FUTEX_HASHBITS];
-static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+/* The key for the hash is the address + offset within page */
+struct futex_head
+{
+ struct list_head list;
+ static spinlock_t lock;
+} ____cacheline_aligned;

-static inline struct list_head *hash_futex(struct page *page,
- unsigned long offset)
+static struct futex_head futex_queues[1<<FUTEX_HASHBITS] __cacheline_aligned;
+
+static inline struct futex_head *hash_futex(struct page *page,
+ unsigned long offset)
{
unsigned long h;

@@ -66,12 +71,12 @@
return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
}

-static inline void wake_one_waiter(struct list_head *head, atomic_t *count)
+static inline void wake_one_waiter(struct futex_head *head, atomic_t *count)
{
struct list_head *i;

- spin_lock(&futex_lock);
- list_for_each(i, head) {
+ spin_lock(&head->lock);
+ list_for_each(i, &head->list) {
struct futex_q *this = list_entry(i, struct futex_q, list);

if (this->count == count) {
@@ -79,27 +84,27 @@
break;
}
}
- spin_unlock(&futex_lock);
+ spin_unlock(&head->lock);
}

/* Add at end to avoid starvation */
-static inline void queue_me(struct list_head *head,
+static inline void queue_me(struct futex_head *head,
struct futex_q *q,
atomic_t *count)
{
q->task = current;
q->count = count;

- spin_lock(&futex_lock);
- list_add_tail(&q->list, head);
- spin_unlock(&futex_lock);
+ spin_lock(&head->lock);
+ list_add_tail(&q->list, &head->list);
+ spin_unlock(&head->lock);
}

-static inline void unqueue_me(struct futex_q *q)
+static inline void unqueue_me(struct futex_head *head, struct futex_q *q)
{
- spin_lock(&futex_lock);
+ spin_lock(&head->lock);
list_del(&q->list);
- spin_unlock(&futex_lock);
+ spin_unlock(&head->lock);
}

/* Get kernel address of the user page and pin it. */
@@ -124,7 +129,7 @@
}

/* Simplified from arch/ppc/kernel/semaphore.c: Paul M. is a genius. */
-static int futex_down(struct list_head *head, atomic_t *count)
+static int futex_down(struct futex_head *head, atomic_t *count)
{
int retval = 0;
struct futex_q q;
@@ -143,7 +148,7 @@
current->state = TASK_INTERRUPTIBLE;
}
current->state = TASK_RUNNING;
- unqueue_me(&q);
+ unqueue_me(head, &q);
/* If we were signalled, we might have just been woken: we
must wake another one. Otherwise we need to wake someone
else (if they are waiting) so they drop the count below 0,
@@ -153,7 +158,7 @@
return retval;
}

-static int futex_up(struct list_head *head, atomic_t *count)
+static int futex_up(struct futex_head *head, atomic_t *count)
{
atomic_set(count, 1);
smp_wmb();
@@ -165,7 +170,7 @@
{
int ret;
unsigned long pos_in_page;
- struct list_head *head;
+ struct futex_head *head;
struct page *page;

pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
@@ -201,8 +206,10 @@
{
unsigned int i;

- for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
- INIT_LIST_HEAD(&futex_queues[i]);
+ for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+ INIT_LIST_HEAD(&futex_queues[i].list);
+ spin_lock_init(&futex_queues[i].lock);
+ }
return 0;
}
__initcall(init);

2002-03-06 07:51:33

by Rusty Russell

[permalink] [raw]
Subject: Re: Futexes III : performance numbers

On Tue, 5 Mar 2002 16:23:14 -0500
Hubertus Franke <[email protected]> wrote:
> Interesting... the strict FIFO ordering of my fast semaphores limits
> performance as seen by 99.71% contention, so we always ditch
> into the kernel. Convoy Avoidance locks 2.5 times better.
> Wohh futex rock, BUT... with 0.29% contention it basically tells
> me that we are exhausting our entire quantum getting the lock
> without contention. So their is some serious fairness issue here
> at least for the tightly scheduled locks. Compare the M numbers
> for 2 and 3 children.

Fairness <sigh>. This patch should be much more FIFO: it works by handing
the mutex straight to the first one on the queue if there is one, and only
actually "freeing" it if there's noone waiting.

Unfortunately, it seems to hurt performance by 50% on tdbtorture (although
there are weird scheduler things happening too).

Here's the "fair" patch:
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/arch/i386/kernel/entry.S working-2.5.6-pre2-futex/arch/i386/kernel/entry.S
--- linux-2.5.6-pre2/arch/i386/kernel/entry.S Wed Feb 20 17:56:59 2002
+++ working-2.5.6-pre2-futex/arch/i386/kernel/entry.S Tue Mar 5 13:53:33 2002
@@ -716,6 +716,7 @@
.long SYMBOL_NAME(sys_lremovexattr)
.long SYMBOL_NAME(sys_fremovexattr)
.long SYMBOL_NAME(sys_tkill)
+ .long SYMBOL_NAME(sys_futex)

.rept NR_syscalls-(.-sys_call_table)/4
.long SYMBOL_NAME(sys_ni_syscall)
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/arch/ppc/kernel/misc.S working-2.5.6-pre2-futex/arch/ppc/kernel/misc.S
--- linux-2.5.6-pre2/arch/ppc/kernel/misc.S Wed Feb 20 17:57:04 2002
+++ working-2.5.6-pre2-futex/arch/ppc/kernel/misc.S Tue Mar 5 13:53:33 2002
@@ -1246,6 +1246,7 @@
.long sys_removexattr
.long sys_lremovexattr
.long sys_fremovexattr /* 220 */
+ .long sys_futex
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
.endr
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-i386/atomic.h working-2.5.6-pre2-futex/include/asm-i386/atomic.h
--- linux-2.5.6-pre2/include/asm-i386/atomic.h Fri Nov 23 06:46:18 2001
+++ working-2.5.6-pre2-futex/include/asm-i386/atomic.h Tue Mar 5 14:03:15 2002
@@ -2,6 +2,7 @@
#define __ARCH_I386_ATOMIC__

#include <linux/config.h>
+#include <asm/system.h>

/*
* Atomic operations that C can't guarantee us. Useful for
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-i386/mman.h working-2.5.6-pre2-futex/include/asm-i386/mman.h
--- linux-2.5.6-pre2/include/asm-i386/mman.h Wed Mar 15 12:45:20 2000
+++ working-2.5.6-pre2-futex/include/asm-i386/mman.h Tue Mar 5 13:53:33 2002
@@ -4,6 +4,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-i386/unistd.h working-2.5.6-pre2-futex/include/asm-i386/unistd.h
--- linux-2.5.6-pre2/include/asm-i386/unistd.h Wed Feb 20 17:56:40 2002
+++ working-2.5.6-pre2-futex/include/asm-i386/unistd.h Tue Mar 5 13:53:33 2002
@@ -243,6 +243,7 @@
#define __NR_lremovexattr 236
#define __NR_fremovexattr 237
#define __NR_tkill 238
+#define __NR_futex 239

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-ppc/mman.h working-2.5.6-pre2-futex/include/asm-ppc/mman.h
--- linux-2.5.6-pre2/include/asm-ppc/mman.h Tue May 22 08:02:06 2001
+++ working-2.5.6-pre2-futex/include/asm-ppc/mman.h Tue Mar 5 13:53:33 2002
@@ -7,6 +7,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-ppc/unistd.h working-2.5.6-pre2-futex/include/asm-ppc/unistd.h
--- linux-2.5.6-pre2/include/asm-ppc/unistd.h Wed Feb 20 17:57:18 2002
+++ working-2.5.6-pre2-futex/include/asm-ppc/unistd.h Tue Mar 5 13:53:33 2002
@@ -228,6 +228,7 @@
#define __NR_removexattr 218
#define __NR_lremovexattr 219
#define __NR_fremovexattr 220
+#define __NR_futex 221

#define __NR(n) #n

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/linux/futex.h working-2.5.6-pre2-futex/include/linux/futex.h
--- linux-2.5.6-pre2/include/linux/futex.h Thu Jan 1 10:00:00 1970
+++ working-2.5.6-pre2-futex/include/linux/futex.h Tue Mar 5 13:53:33 2002
@@ -0,0 +1,8 @@
+#ifndef _LINUX_FUTEX_H
+#define _LINUX_FUTEX_H
+
+/* Second argument to futex syscall */
+#define FUTEX_UP (1)
+#define FUTEX_DOWN (-1)
+
+#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/linux/hash.h working-2.5.6-pre2-futex/include/linux/hash.h
--- linux-2.5.6-pre2/include/linux/hash.h Thu Jan 1 10:00:00 1970
+++ working-2.5.6-pre2-futex/include/linux/hash.h Tue Mar 5 13:53:33 2002
@@ -0,0 +1,58 @@
+#ifndef _LINUX_HASH_H
+#define _LINUX_HASH_H
+/* Fast hashing routine for a long.
+ (C) 2002 William Lee Irwin III, IBM */
+
+/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These primes are chosen to be bit-sparse, that is operations on
+ * them can use shifts and additions instead of multiplications for
+ * machines where multiplications are slow.
+ */
+#if BITS_PER_LONG == 32
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
+#elif BITS_PER_LONG == 64
+/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#else
+#error Define GOLDEN_RATIO_PRIME for your wordsize.
+#endif
+
+static inline unsigned long hash_long(unsigned long val, unsigned int bits)
+{
+ unsigned long hash = val;
+
+#if BITS_PER_LONG == 64
+ /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
+ unsigned long n = hash;
+ n <<= 18;
+ hash -= n;
+ n <<= 33;
+ hash -= n;
+ n <<= 3;
+ hash += n;
+ n <<= 3;
+ hash -= n;
+ n <<= 4;
+ hash += n;
+ n <<= 2;
+ hash += n;
+#else
+ /* On some cpus multiply is faster, on others gcc will do shifts */
+ hash *= GOLDEN_RATIO_PRIME;
+#endif
+
+ /* High bits are more random, so use them. */
+ return hash >> (BITS_PER_LONG - bits);
+}
+
+static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
+{
+ return hash_long((unsigned long)ptr, bits);
+}
+#endif /* _LINUX_HASH_H */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/linux/mmzone.h working-2.5.6-pre2-futex/include/linux/mmzone.h
--- linux-2.5.6-pre2/include/linux/mmzone.h Fri Mar 1 22:58:34 2002
+++ working-2.5.6-pre2-futex/include/linux/mmzone.h Tue Mar 5 14:03:15 2002
@@ -51,8 +51,7 @@
/*
* wait_table -- the array holding the hash table
* wait_table_size -- the size of the hash table array
- * wait_table_shift -- wait_table_size
- * == BITS_PER_LONG (1 << wait_table_bits)
+ * wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
@@ -75,7 +74,7 @@
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_size;
- unsigned long wait_table_shift;
+ unsigned long wait_table_bits;

/*
* Discontig memory support fields.
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/kernel/Makefile working-2.5.6-pre2-futex/kernel/Makefile
--- linux-2.5.6-pre2/kernel/Makefile Wed Feb 20 17:56:17 2002
+++ working-2.5.6-pre2-futex/kernel/Makefile Tue Mar 5 13:53:33 2002
@@ -15,7 +15,7 @@
obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o info.o time.o softirq.o resource.o \
sysctl.o acct.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o
+ signal.o sys.o kmod.o context.o futex.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/kernel/futex.c working-2.5.6-pre2-futex/kernel/futex.c
--- linux-2.5.6-pre2/kernel/futex.c Thu Jan 1 10:00:00 1970
+++ working-2.5.6-pre2-futex/kernel/futex.c Wed Mar 6 17:55:09 2002
@@ -0,0 +1,229 @@
+/*
+ * Fast Userspace Mutexes (which I call "Futexes!").
+ * (C) Rusty Russell, IBM 2002
+ *
+ * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
+ * enough at me, Linus for the original (flawed) idea, Matthew
+ * Kirkwood for proof-of-concept implementation.
+ *
+ * "The futexes are also cursed."
+ * "But they come in a choice of three flavours!"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/futex.h>
+#include <asm/atomic.h>
+
+/* These mutexes are a very simple counter: the winner is the one who
+ decrements from 1 to 0. 1 == free. 0 == noone sleeping.
+
+ This is simple enough to work on all architectures. */
+
+/* FIXME: This may be way too small. --RR */
+#define FUTEX_HASHBITS 6
+
+/* We use this instead of a wait_queue_t, so we can wake only the
+ relevent ones (hashed queues may be shared) */
+struct futex_q {
+ struct list_head list;
+ struct task_struct *task;
+ atomic_t *count;
+};
+
+/* The key for the hash is the address + offset within page */
+static struct list_head futex_queues[1<<FUTEX_HASHBITS];
+static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+
+#define FUTEX_PASSED ((void *)-1)
+
+/* Try to find someone else to pass futex to. */
+static int pass_futex(struct list_head *head, atomic_t *count)
+{
+ struct list_head *i;
+ struct futex_q *recipient = NULL;
+ int more_candidates = 0;
+
+ /* Find first, and keep looking to see if there are others. */
+ list_for_each(i, head) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->count == count) {
+ if (!recipient) recipient = this;
+ else {
+ /* Someone else waiting, too. */
+ more_candidates = 1;
+ break;
+ }
+ }
+ }
+
+ /* Nobody wants it. */
+ if (!recipient)
+ return 0;
+
+ /* Fixup to avoid wasted wakeup when we up() later. */
+ if (!more_candidates)
+ atomic_set(count, 0);
+
+ /* Pass directly to them. */
+ recipient->count = FUTEX_PASSED;
+ smp_wmb();
+ wake_up_process(recipient->task);
+ return 1;
+}
+
+static inline struct list_head *hash_futex(struct page *page,
+ unsigned long offset)
+{
+ unsigned long h;
+
+ /* If someone is sleeping, page is pinned. ie. page_address
+ is a constant when we care about it. */
+ h = (unsigned long)page_address(page) + offset;
+ return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
+}
+
+/* Add at end to make it FIFO. */
+static inline void queue_me(struct list_head *head, struct futex_q *q)
+{
+ q->task = current;
+
+ spin_lock(&futex_lock);
+ list_add_tail(&q->list, head);
+ spin_unlock(&futex_lock);
+}
+
+/* Return true if there is someone else waiting as well */
+static inline void unqueue_me(struct futex_q *q)
+{
+ spin_lock(&futex_lock);
+ list_del(&q->list);
+ spin_unlock(&futex_lock);
+}
+
+/* Get kernel address of the user page and pin it. */
+static struct page *pin_page(unsigned long page_start)
+{
+ struct mm_struct *mm = current->mm;
+ struct page *page;
+ int err;
+
+ down_read(&mm->mmap_sem);
+ err = get_user_pages(current, current->mm, page_start,
+ 1 /* one page */,
+ 1 /* writable */,
+ 0 /* don't force */,
+ &page,
+ NULL /* don't return vmas */);
+ up_read(&mm->mmap_sem);
+
+ if (err < 0)
+ return ERR_PTR(err);
+ return page;
+}
+
+/* Simplified from arch/ppc/kernel/semaphore.c: Paul M. is a genius. */
+static int futex_down(struct list_head *head, atomic_t *count)
+{
+ struct futex_q q;
+
+ current->state = TASK_INTERRUPTIBLE;
+ q.count = count;
+ queue_me(head, &q);
+
+ /* It may have become available while we were adding ourselves
+ to queue? Also, make sure it's -ve so userspace knows
+ there's someone waiting. */
+ while ((atomic_read(count) < 0 || !atomic_dec_and_test(count))
+ && q.count != FUTEX_PASSED) {
+ schedule();
+ current->state = TASK_INTERRUPTIBLE;
+
+ if (signal_pending(current)) {
+ unqueue_me(&q);
+
+ /* We might have been passed futex anyway. */
+ return (q.count == FUTEX_PASSED) ? 0 : -EINTR;
+ }
+ }
+
+ /* We got the futex! */
+ current->state = TASK_RUNNING;
+ unqueue_me(&q);
+ return 0;
+}
+
+static int futex_up(struct list_head *head, atomic_t *count)
+{
+ spin_lock(&futex_lock);
+ if (!pass_futex(head, count))
+ /* Noone to receive: set to one and leave it free. */
+ atomic_set(count, 1);
+ spin_unlock(&futex_lock);
+ return 0;
+}
+
+asmlinkage int sys_futex(void *uaddr, int op)
+{
+ int ret;
+ unsigned long pos_in_page;
+ struct list_head *head;
+ struct page *page;
+
+ pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
+
+ /* Must be "naturally" aligned, and not on page boundary. */
+ if ((pos_in_page % __alignof__(atomic_t)) != 0
+ || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
+ return -EINVAL;
+
+ /* Simpler if it doesn't vanish underneath us. */
+ page = pin_page((unsigned long)uaddr - pos_in_page);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ head = hash_futex(page, pos_in_page);
+ switch (op) {
+ case FUTEX_UP:
+ ret = futex_up(head, page_address(page) + pos_in_page);
+ break;
+ case FUTEX_DOWN:
+ ret = futex_down(head, page_address(page) + pos_in_page);
+ break;
+ /* Add other lock types here... */
+ default:
+ ret = -EINVAL;
+ }
+ put_page(page);
+
+ return ret;
+}
+
+static int __init init(void)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
+ INIT_LIST_HEAD(&futex_queues[i]);
+ return 0;
+}
+__initcall(init);
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/mm/filemap.c working-2.5.6-pre2-futex/mm/filemap.c
--- linux-2.5.6-pre2/mm/filemap.c Fri Mar 1 23:27:15 2002
+++ working-2.5.6-pre2-futex/mm/filemap.c Tue Mar 5 13:53:33 2002
@@ -25,6 +25,7 @@
#include <linux/iobuf.h>
#include <linux/compiler.h>
#include <linux/fs.h>
+#include <linux/hash.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -773,32 +774,8 @@
static inline wait_queue_head_t *page_waitqueue(struct page *page)
{
const zone_t *zone = page_zone(page);
- wait_queue_head_t *wait = zone->wait_table;
- unsigned long hash = (unsigned long)page;

-#if BITS_PER_LONG == 64
- /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
- unsigned long n = hash;
- n <<= 18;
- hash -= n;
- n <<= 33;
- hash -= n;
- n <<= 3;
- hash += n;
- n <<= 3;
- hash -= n;
- n <<= 4;
- hash += n;
- n <<= 2;
- hash += n;
-#else
- /* On some cpus multiply is faster, on others gcc will do shifts */
- hash *= GOLDEN_RATIO_PRIME;
-#endif
-
- hash >>= zone->wait_table_shift;
-
- return &wait[hash];
+ return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}

/*
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/mm/mprotect.c working-2.5.6-pre2-futex/mm/mprotect.c
--- linux-2.5.6-pre2/mm/mprotect.c Wed Feb 20 17:57:21 2002
+++ working-2.5.6-pre2-futex/mm/mprotect.c Tue Mar 5 13:53:33 2002
@@ -280,7 +280,7 @@
end = start + len;
if (end < start)
return -EINVAL;
- if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC))
+ if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
return -EINVAL;
if (end == start)
return 0;
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/mm/page_alloc.c working-2.5.6-pre2-futex/mm/page_alloc.c
--- linux-2.5.6-pre2/mm/page_alloc.c Wed Feb 20 17:57:21 2002
+++ working-2.5.6-pre2-futex/mm/page_alloc.c Tue Mar 5 13:53:33 2002
@@ -776,8 +776,8 @@
* per zone.
*/
zone->wait_table_size = wait_table_size(size);
- zone->wait_table_shift =
- BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
+ zone->wait_table_bits =
+ wait_table_bits(zone->wait_table_size);
zone->wait_table = (wait_queue_head_t *)
alloc_bootmem_node(pgdat, zone->wait_table_size
* sizeof(wait_queue_head_t));

2002-03-06 14:28:10

by Hubertus Franke

[permalink] [raw]
Subject: Re: Futexes III : performance numbers

On Tuesday 05 March 2002 09:08 pm, Rusty Russell wrote:
> In message <[email protected]> you write:
> > On Tuesday 05 March 2002 02:01 am, Rusty Russell wrote:
> > > 1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
> > > 2) Fix for the "decrement wraparound" problem (Paul Mackerras)
> > > 3) x86 fixes: tested on dual x86 box.
> > >
> > > Example userspace lib attached,
> > > Rusty.
> >
> > I did a quick hack to enable ulockflex to run on the latest interface
> > that Rusty posted.
>
> Cool... is this 8-way or some such "serious" SMP? How about the
> below microoptimization (untested, but you get the idea).
>
> > Now 3 processes
> > 3 1 5 4 k 0 0 0 99.98 0.00 0.00 0.033284 242040
> > 3 1 5 4 m 0 0 0 0.29 0.00 0.00 0.018406 1979992
> > 3 1 5 4 f 0 0 0 99.71 0.00 0.00 0.028083 306140
> > 3 1 5 4 c 0 0 0 7.79 0.00 4.00 0.437084 774175
> >
> > Interesting... the strict FIFO ordering of my fast semaphores limits
> > performance as seen by 99.71% contention, so we always ditch
> > into the kernel. Convoy Avoidance locks 2.5 times better.
>
> Hmmm... actually I'm limited FIFO, in that I queue on the tail and do
> wake one. Of course, someone can come in userspace and grab the lock
> while the guy in the kernel is waking up, and this is clearly
> happening here.
>
> This can be fixed, I think, by saying to the one we wake up "you have
> the lock" and never actually changing the value to 1. This might cost
> us very little: I'll send another patch this afternoon.
>

Well, yes it can be fixed as I did in my package, but it comes at a
substantial cost as seen above. The question is whether to simply
ignore strict FIFO requirements ?
Doing the FIFO leads to the convoy problem, namely your lock arrival
becomes the scheduling queue.
As seen above from the nonexisting contention, mootexes completely
exhaust their scheduling quantum before allowing anybocy else to grap
the lock. This is desired behavior particular for high traffic, low lockhold
time locks, but not for others.

In this case you simply hand over the lock and won't allow anybody
in user space to grap it during the time window one is woken up in
the kernel.
Also, from my own experience doing a spinning lock that way

Another issue, is a few more operations, what would be nice is to be
able to wake up all waiting processes and have them recontent?

> Cheers!
> Rusty.

--
-- Hubertus Franke ([email protected])

2002-03-06 14:46:23

by Hubertus Franke

[permalink] [raw]
Subject: Re: Futexes III : performance numbers

On Wednesday 06 March 2002 02:54 am, Rusty Russell wrote:
> On Tue, 5 Mar 2002 16:23:14 -0500
>
> Hubertus Franke <[email protected]> wrote:
> > Interesting... the strict FIFO ordering of my fast semaphores limits
> > performance as seen by 99.71% contention, so we always ditch
> > into the kernel. Convoy Avoidance locks 2.5 times better.
> > Wohh futex rock, BUT... with 0.29% contention it basically tells
> > me that we are exhausting our entire quantum getting the lock
> > without contention. So their is some serious fairness issue here
> > at least for the tightly scheduled locks. Compare the M numbers
> > for 2 and 3 children.
>
> Fairness <sigh>. This patch should be much more FIFO: it works by handing
> the mutex straight to the first one on the queue if there is one, and only
> actually "freeing" it if there's noone waiting.
>
> Unfortunately, it seems to hurt performance by 50% on tdbtorture (although
> there are weird scheduler things happening too).
>
> Here's the "fair" patch:
> Rusty.

Thanks, Rusty, man you are a coding machine :-)

Now you are experiencing all the issues that I went through as well.
Point I was trying ot make is, that there is not cookie cutter solution.
One must provide the various options to the higher level and let
the application choose what mootex semantics it wants.

There is applicability for fair futexes and for convoy avoidance futexes.

So let's put both in, and later expand it to read/write stuff.


--
-- Hubertus Franke ([email protected])

2002-03-06 16:13:13

by Hubertus Franke

[permalink] [raw]
Subject: Re: Futexes III : performance numbers

On Wednesday 06 March 2002 02:54 am, Rusty Russell wrote:
> On Tue, 5 Mar 2002 16:23:14 -0500
>
> Hubertus Franke <[email protected]> wrote:
> > Interesting... the strict FIFO ordering of my fast semaphores limits
> > performance as seen by 99.71% contention, so we always ditch
> > into the kernel. Convoy Avoidance locks 2.5 times better.
> > Wohh futex rock, BUT... with 0.29% contention it basically tells
> > me that we are exhausting our entire quantum getting the lock
> > without contention. So their is some serious fairness issue here
> > at least for the tightly scheduled locks. Compare the M numbers
> > for 2 and 3 children.
>
> Fairness <sigh>. This patch should be much more FIFO: it works by handing
> the mutex straight to the first one on the queue if there is one, and only
> actually "freeing" it if there's noone waiting.
>
> Unfortunately, it seems to hurt performance by 50% on tdbtorture (although
> there are weird scheduler things happening too).
>
> Here's the "fair" patch:
> Rusty.

Rusty why not provide something along the line <snipped all over the place>.

#define FUTEX_DOWN (0)
#define FUTEX_UP (1)
#define FUTEX_FAIR_UP (2)

static int futex_up(struct list_head *head, atomic_t *count)
{
atomic_set(count, 1);
smp_wmb();
wake_one_waiter(head, count);
return 0;
}

static int futex_fair_up(truct list_head *head, atomic_t *count)
{
spin_lock(&futex_lock);
if (!pass_futex(head, count))
/* Noone to receive: set to one and leave it free. */
atomic_set(count, 1);
spin_unlock(&futex_lock);
return 0;
}


asmlinkage int sys_futex(void *uaddr, int op)
{
<..... snip ....>

head = hash_futex(page, pos_in_page);
switch (op) {

case FUTEX_DOWN:
ret = futex_down(head, page_address(page) + pos_in_page);
break;

case FUTEX_UP:
ret = futex_up(head, page_address(page) + pos_in_page);
break;

case FUTEX_FAIR_UP:
ret = futex_fair_up(head, page_address(page) + pos_in_page);
break;

default :

<..... snip ....>
}


This would satisfy the fair vs. calock issue, you let the app decide what to
use. Best of all, it seems to me you can even mix it.
Imagine, if a process knows it will soon reaquire the lock, then it would
use FUTEX_UP to avoid being tagged back to the end of wait queue
avoiding costly scheduling events. At the same time, if the process knows
that its done for a while with that lock, then it issues FUTEX_FAIR_UP.
The best of two worlds..

It also shows how cleanly the code can be expanded in the future.
The more I look at the hash queues the more I like it.
Will look at the rwlock now. Let me know what you think.

--
-- Hubertus Franke ([email protected])

2002-03-06 17:25:32

by George Anzinger

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Futexes III : performance numbers

Hubertus Franke wrote:
>
> On Tuesday 05 March 2002 09:08 pm, Rusty Russell wrote:
> > In message <[email protected]> you write:
> > > On Tuesday 05 March 2002 02:01 am, Rusty Russell wrote:
> > > > 1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
> > > > 2) Fix for the "decrement wraparound" problem (Paul Mackerras)
> > > > 3) x86 fixes: tested on dual x86 box.
> > > >
> > > > Example userspace lib attached,
> > > > Rusty.
> > >
> > > I did a quick hack to enable ulockflex to run on the latest interface
> > > that Rusty posted.
> >
> > Cool... is this 8-way or some such "serious" SMP? How about the
> > below microoptimization (untested, but you get the idea).
> >
> > > Now 3 processes
> > > 3 1 5 4 k 0 0 0 99.98 0.00 0.00 0.033284 242040
> > > 3 1 5 4 m 0 0 0 0.29 0.00 0.00 0.018406 1979992
> > > 3 1 5 4 f 0 0 0 99.71 0.00 0.00 0.028083 306140
> > > 3 1 5 4 c 0 0 0 7.79 0.00 4.00 0.437084 774175
> > >
> > > Interesting... the strict FIFO ordering of my fast semaphores limits
> > > performance as seen by 99.71% contention, so we always ditch
> > > into the kernel. Convoy Avoidance locks 2.5 times better.
> >
> > Hmmm... actually I'm limited FIFO, in that I queue on the tail and do
> > wake one. Of course, someone can come in userspace and grab the lock
> > while the guy in the kernel is waking up, and this is clearly
> > happening here.
> >
> > This can be fixed, I think, by saying to the one we wake up "you have
> > the lock" and never actually changing the value to 1. This might cost
> > us very little: I'll send another patch this afternoon.
> >
>
> Well, yes it can be fixed as I did in my package, but it comes at a
> substantial cost as seen above. The question is whether to simply
> ignore strict FIFO requirements ?
> Doing the FIFO leads to the convoy problem, namely your lock arrival
> becomes the scheduling queue.
> As seen above from the nonexisting contention, mootexes completely
> exhaust their scheduling quantum before allowing anybocy else to grap
> the lock. This is desired behavior particular for high traffic, low lockhold
> time locks, but not for others.
>
> In this case you simply hand over the lock and won't allow anybody
> in user space to grap it during the time window one is woken up in
> the kernel.
> Also, from my own experience doing a spinning lock that way
>
> Another issue, is a few more operations, what would be nice is to be
> able to wake up all waiting processes and have them recontent?

Unless you are ready to go to priority queues (IMHO the preferred
approach) this is the only way to avoid priority inversion, especially
in a real time environment.

--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2002-03-06 20:35:53

by Hubertus Franke

[permalink] [raw]
Subject: Futexes V :

On Wednesday 06 March 2002 11:13 am, Hubertus Franke wrote:

I cut a new version with what I was previously discussing.
Now we have two kind of wakeup mechanism

(a) regular wakeup (as was) which basically gives you convoy avoidance
(b) fair wakeup (will first wake a waiting process up .. FIFO)

Basically integrated 2 prior patches of Rusty

Also changed FUTEX_DOWN, FUTEX_UP and FUTEX_UP_FAIR
operands to be linear (0,1,2), should make the case statement faster,
particularly when we get more operands.


frankeh:1019:~/futex/ulockflex>
./ulockflex -c 3 -a 1 -t 2 -o 5 -m 2 -R 499 -r 0 -x 0 -L f

SysV: 3 1 2 1 k 0 0 0 99.42 0.00 0.00 0.141423 243718
FAIR: 3 1 2 1 f 0 0 0 93.67 0.00 0.00 0.049231 217001
CA: 3 1 2 1 f 0 0 0 0.01 0.00 0.00 0.154154 2002852


Yesterdays numbers where

3 1 5 4 k 0 0 0 99.98 0.00 0.00 0.033284 242040
3 1 5 4 m 0 0 0 0.29 0.00 0.00 0.018406 1979992
3 1 5 4 f 0 0 0 99.71 0.00 0.00 0.028083 306140
3 1 5 4 c 0 0 0 7.79 0.00 4.00 0.437084 774175

Indicates that fair locking still needs some work, but what Rusty provided
on top the current
--
-- Hubertus Franke ([email protected])

---------------------------------------------------------------------------------------------
diff -urbN linux-2.5.5/arch/i386/kernel/entry.S
linux-2.5.5-futex/arch/i386/kernel/entry.S
--- linux-2.5.5/arch/i386/kernel/entry.S Tue Feb 19 21:10:58 2002
+++ linux-2.5.5-futex/arch/i386/kernel/entry.S Wed Mar 6 11:40:12 2002
@@ -716,6 +716,7 @@
.long SYMBOL_NAME(sys_lremovexattr)
.long SYMBOL_NAME(sys_fremovexattr)
.long SYMBOL_NAME(sys_tkill)
+ .long SYMBOL_NAME(sys_futex)

.rept NR_syscalls-(.-sys_call_table)/4
.long SYMBOL_NAME(sys_ni_syscall)
diff -urbN linux-2.5.5/arch/ppc/kernel/misc.S
linux-2.5.5-futex/arch/ppc/kernel/misc.S
--- linux-2.5.5/arch/ppc/kernel/misc.S Tue Feb 19 21:11:00 2002
+++ linux-2.5.5-futex/arch/ppc/kernel/misc.S Wed Mar 6 11:40:12 2002
@@ -1246,6 +1246,7 @@
.long sys_removexattr
.long sys_lremovexattr
.long sys_fremovexattr /* 220 */
+ .long sys_futex
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
.endr
diff -urbN linux-2.5.5/include/asm-i386/atomic.h
linux-2.5.5-futex/include/asm-i386/atomic.h
--- linux-2.5.5/include/asm-i386/atomic.h Tue Feb 19 21:10:58 2002
+++ linux-2.5.5-futex/include/asm-i386/atomic.h Wed Mar 6 11:42:35 2002
@@ -2,6 +2,7 @@
#define __ARCH_I386_ATOMIC__

#include <linux/config.h>
+#include <asm/system.h>

/*
* Atomic operations that C can't guarantee us. Useful for
diff -urbN linux-2.5.5/include/asm-i386/mman.h
linux-2.5.5-futex/include/asm-i386/mman.h
--- linux-2.5.5/include/asm-i386/mman.h Tue Feb 19 21:10:56 2002
+++ linux-2.5.5-futex/include/asm-i386/mman.h Wed Mar 6 11:40:12 2002
@@ -4,6 +4,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urbN linux-2.5.5/include/asm-i386/unistd.h
linux-2.5.5-futex/include/asm-i386/unistd.h
--- linux-2.5.5/include/asm-i386/unistd.h Tue Feb 19 21:11:04 2002
+++ linux-2.5.5-futex/include/asm-i386/unistd.h Wed Mar 6 11:40:12 2002
@@ -243,6 +243,7 @@
#define __NR_lremovexattr 236
#define __NR_fremovexattr 237
#define __NR_tkill 238
+#define __NR_futex 239

/* user-visible error numbers are in the range -1 - -124: see
<asm-i386/errno.h> */

diff -urbN linux-2.5.5/include/asm-ppc/mman.h
linux-2.5.5-futex/include/asm-ppc/mman.h
--- linux-2.5.5/include/asm-ppc/mman.h Tue Feb 19 21:11:03 2002
+++ linux-2.5.5-futex/include/asm-ppc/mman.h Wed Mar 6 11:40:12 2002
@@ -7,6 +7,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urbN linux-2.5.5/include/asm-ppc/unistd.h
linux-2.5.5-futex/include/asm-ppc/unistd.h
--- linux-2.5.5/include/asm-ppc/unistd.h Tue Feb 19 21:10:57 2002
+++ linux-2.5.5-futex/include/asm-ppc/unistd.h Wed Mar 6 11:40:12 2002
@@ -228,6 +228,7 @@
#define __NR_removexattr 218
#define __NR_lremovexattr 219
#define __NR_fremovexattr 220
+#define __NR_futex 221

#define __NR(n) #n

diff -urbN linux-2.5.5/include/linux/futex.h
linux-2.5.5-futex/include/linux/futex.h
--- linux-2.5.5/include/linux/futex.h Wed Dec 31 19:00:00 1969
+++ linux-2.5.5-futex/include/linux/futex.h Wed Mar 6 13:58:21 2002
@@ -0,0 +1,9 @@
+#ifndef _LINUX_FUTEX_H
+#define _LINUX_FUTEX_H
+
+/* Second argument to futex syscall */
+#define FUTEX_DOWN (0)
+#define FUTEX_UP (1)
+#define FUTEX_UP_FAIR (2)
+
+#endif
diff -urbN linux-2.5.5/include/linux/hash.h
linux-2.5.5-futex/include/linux/hash.h
--- linux-2.5.5/include/linux/hash.h Wed Dec 31 19:00:00 1969
+++ linux-2.5.5-futex/include/linux/hash.h Wed Mar 6 11:40:12 2002
@@ -0,0 +1,58 @@
+#ifndef _LINUX_HASH_H
+#define _LINUX_HASH_H
+/* Fast hashing routine for a long.
+ (C) 2002 William Lee Irwin III, IBM */
+
+/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These primes are chosen to be bit-sparse, that is operations on
+ * them can use shifts and additions instead of multiplications for
+ * machines where multiplications are slow.
+ */
+#if BITS_PER_LONG == 32
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
+#elif BITS_PER_LONG == 64
+/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#else
+#error Define GOLDEN_RATIO_PRIME for your wordsize.
+#endif
+
+static inline unsigned long hash_long(unsigned long val, unsigned int bits)
+{
+ unsigned long hash = val;
+
+#if BITS_PER_LONG == 64
+ /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
+ unsigned long n = hash;
+ n <<= 18;
+ hash -= n;
+ n <<= 33;
+ hash -= n;
+ n <<= 3;
+ hash += n;
+ n <<= 3;
+ hash -= n;
+ n <<= 4;
+ hash += n;
+ n <<= 2;
+ hash += n;
+#else
+ /* On some cpus multiply is faster, on others gcc will do shifts */
+ hash *= GOLDEN_RATIO_PRIME;
+#endif
+
+ /* High bits are more random, so use them. */
+ return hash >> (BITS_PER_LONG - bits);
+}
+
+static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
+{
+ return hash_long((unsigned long)ptr, bits);
+}
+#endif /* _LINUX_HASH_H */
diff -urbN linux-2.5.5/include/linux/mmzone.h
linux-2.5.5-futex/include/linux/mmzone.h
--- linux-2.5.5/include/linux/mmzone.h Tue Feb 19 21:10:53 2002
+++ linux-2.5.5-futex/include/linux/mmzone.h Wed Mar 6 11:42:35 2002
@@ -51,8 +51,7 @@
/*
* wait_table -- the array holding the hash table
* wait_table_size -- the size of the hash table array
- * wait_table_shift -- wait_table_size
- * == BITS_PER_LONG (1 << wait_table_bits)
+ * wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
@@ -75,7 +74,7 @@
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_size;
- unsigned long wait_table_shift;
+ unsigned long wait_table_bits;

/*
* Discontig memory support fields.
diff -urbN linux-2.5.5/kernel/Makefile linux-2.5.5-futex/kernel/Makefile
--- linux-2.5.5/kernel/Makefile Tue Feb 19 21:10:57 2002
+++ linux-2.5.5-futex/kernel/Makefile Wed Mar 6 11:40:12 2002
@@ -15,7 +15,7 @@
obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o info.o time.o softirq.o resource.o \
sysctl.o acct.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o
+ signal.o sys.o kmod.o context.o futex.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -urbN linux-2.5.5/kernel/futex.c linux-2.5.5-futex/kernel/futex.c
--- linux-2.5.5/kernel/futex.c Wed Dec 31 19:00:00 1969
+++ linux-2.5.5-futex/kernel/futex.c Wed Mar 6 13:59:01 2002
@@ -0,0 +1,255 @@
+/*
+ * Fast Userspace Mutexes (which I call "Futexes!").
+ * (C) Rusty Russell, IBM 2002
+ *
+ * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
+ * enough at me, Linus for the original (flawed) idea, Matthew
+ * Kirkwood for proof-of-concept implementation.
+ *
+ * "The futexes are also cursed."
+ * "But they come in a choice of three flavours!"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/futex.h>
+#include <asm/atomic.h>
+
+/* These mutexes are a very simple counter: the winner is the one who
+ decrements from 1 to 0. 1 == free. 0 == noone sleeping.
+
+ This is simple enough to work on all architectures. */
+
+/* FIXME: This may be way too small. --RR */
+#define FUTEX_HASHBITS 6
+
+/* We use this instead of a wait_queue_t, so we can wake only the
+ relevent ones (hashed queues may be shared) */
+struct futex_q {
+ struct list_head list;
+ struct task_struct *task;
+ atomic_t *count;
+};
+
+/* The key for the hash is the address + offset within page */
+static struct list_head futex_queues[1<<FUTEX_HASHBITS];
+static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+
+#define FUTEX_PASSED ((void *)-1)
+
+/* Try to find someone else to pass futex to. */
+static int pass_futex(struct list_head *head, atomic_t *count)
+{
+ struct list_head *i;
+ struct futex_q *recipient = NULL;
+ int more_candidates = 0;
+
+ /* Find first, and keep looking to see if there are others. */
+ list_for_each(i, head) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->count == count) {
+ if (!recipient) recipient = this;
+ else {
+ /* Someone else waiting, too. */
+ more_candidates = 1;
+ break;
+ }
+ }
+ }
+
+ /* Nobody wants it. */
+ if (!recipient)
+ return 0;
+
+ /* Fixup to avoid wasted wakeup when we up() later. */
+ if (!more_candidates)
+ atomic_set(count, 0);
+
+ /* Pass directly to them. */
+ recipient->count = FUTEX_PASSED;
+ smp_wmb();
+ wake_up_process(recipient->task);
+ return 1;
+}
+
+static inline void wake_one_waiter(struct list_head *head, atomic_t *count)
+{
+ struct list_head *i;
+
+ spin_lock(&futex_lock);
+ list_for_each(i, head) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->count == count) {
+ wake_up_process(this->task);
+ break;
+ }
+ }
+ spin_unlock(&futex_lock);
+}
+
+static inline struct list_head *hash_futex(struct page *page,
+ unsigned long offset)
+{
+ unsigned long h;
+
+ /* If someone is sleeping, page is pinned. ie. page_address
+ is a constant when we care about it. */
+ h = (unsigned long)page_address(page) + offset;
+ return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
+}
+
+/* Add at end to make it FIFO. */
+static inline void queue_me(struct list_head *head, struct futex_q *q)
+{
+ q->task = current;
+
+ spin_lock(&futex_lock);
+ list_add_tail(&q->list, head);
+ spin_unlock(&futex_lock);
+}
+
+/* Return true if there is someone else waiting as well */
+static inline void unqueue_me(struct futex_q *q)
+{
+ spin_lock(&futex_lock);
+ list_del(&q->list);
+ spin_unlock(&futex_lock);
+}
+
+/* Get kernel address of the user page and pin it. */
+static struct page *pin_page(unsigned long page_start)
+{
+ struct mm_struct *mm = current->mm;
+ struct page *page;
+ int err;
+
+ down_read(&mm->mmap_sem);
+ err = get_user_pages(current, current->mm, page_start,
+ 1 /* one page */,
+ 1 /* writable */,
+ 0 /* don't force */,
+ &page,
+ NULL /* don't return vmas */);
+ up_read(&mm->mmap_sem);
+
+ if (err < 0)
+ return ERR_PTR(err);
+ return page;
+}
+
+/* Simplified from arch/ppc/kernel/semaphore.c: Paul M. is a genius. */
+static int futex_down(struct list_head *head, atomic_t *count)
+{
+ struct futex_q q;
+
+ current->state = TASK_INTERRUPTIBLE;
+ q.count = count;
+ queue_me(head, &q);
+
+ /* It may have become available while we were adding ourselves
+ to queue? Also, make sure it's -ve so userspace knows
+ there's someone waiting. */
+ while ((atomic_read(count) < 0 || !atomic_dec_and_test(count))
+ && q.count != FUTEX_PASSED) {
+ schedule();
+ current->state = TASK_INTERRUPTIBLE;
+
+ if (signal_pending(current)) {
+ unqueue_me(&q);
+
+ /* We might have been passed futex anyway. */
+ return (q.count == FUTEX_PASSED) ? 0 : -EINTR;
+ }
+ }
+
+ /* We got the futex! */
+ current->state = TASK_RUNNING;
+ unqueue_me(&q);
+ return 0;
+}
+
+static int futex_fair_up(struct list_head *head, atomic_t *count)
+{
+ spin_lock(&futex_lock);
+ if (!pass_futex(head, count))
+ /* Noone to receive: set to one and leave it free. */
+ atomic_set(count, 1);
+ spin_unlock(&futex_lock);
+ return 0;
+}
+
+static int futex_up(struct list_head *head, atomic_t *count)
+{
+ atomic_set(count, 1);
+ smp_wmb();
+ wake_one_waiter(head, count);
+ return 0;
+}
+asmlinkage int sys_futex(void *uaddr, int op)
+{
+ int ret;
+ unsigned long pos_in_page;
+ struct list_head *head;
+ struct page *page;
+
+ pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
+
+ /* Must be "naturally" aligned, and not on page boundary. */
+ if ((pos_in_page % __alignof__(atomic_t)) != 0
+ || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
+ return -EINVAL;
+
+ /* Simpler if it doesn't vanish underneath us. */
+ page = pin_page((unsigned long)uaddr - pos_in_page);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ head = hash_futex(page, pos_in_page);
+ switch (op) {
+ case FUTEX_DOWN:
+ ret = futex_down(head, page_address(page) + pos_in_page);
+ break;
+ case FUTEX_UP:
+ ret = futex_up(head, page_address(page) + pos_in_page);
+ break;
+ case FUTEX_UP_FAIR:
+ ret = futex_fair_up(head, page_address(page) + pos_in_page);
+ break;
+ /* Add other lock types here... */
+ default:
+ ret = -EINVAL;
+ }
+ put_page(page);
+
+ return ret;
+}
+
+static int __init init(void)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
+ INIT_LIST_HEAD(&futex_queues[i]);
+ return 0;
+}
+__initcall(init);
diff -urbN linux-2.5.5/mm/filemap.c linux-2.5.5-futex/mm/filemap.c
--- linux-2.5.5/mm/filemap.c Wed Mar 6 15:10:09 2002
+++ linux-2.5.5-futex/mm/filemap.c Wed Mar 6 11:40:12 2002
@@ -25,6 +25,7 @@
#include <linux/iobuf.h>
#include <linux/compiler.h>
#include <linux/fs.h>
+#include <linux/hash.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -773,32 +774,8 @@
static inline wait_queue_head_t *page_waitqueue(struct page *page)
{
const zone_t *zone = page_zone(page);
- wait_queue_head_t *wait = zone->wait_table;
- unsigned long hash = (unsigned long)page;

-#if BITS_PER_LONG == 64
- /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
- unsigned long n = hash;
- n <<= 18;
- hash -= n;
- n <<= 33;
- hash -= n;
- n <<= 3;
- hash += n;
- n <<= 3;
- hash -= n;
- n <<= 4;
- hash += n;
- n <<= 2;
- hash += n;
-#else
- /* On some cpus multiply is faster, on others gcc will do shifts */
- hash *= GOLDEN_RATIO_PRIME;
-#endif
-
- hash >>= zone->wait_table_shift;
-
- return &wait[hash];
+ return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}

/*
diff -urbN linux-2.5.5/mm/mprotect.c linux-2.5.5-futex/mm/mprotect.c
--- linux-2.5.5/mm/mprotect.c Tue Feb 19 21:11:01 2002
+++ linux-2.5.5-futex/mm/mprotect.c Wed Mar 6 11:40:12 2002
@@ -280,7 +280,7 @@
end = start + len;
if (end < start)
return -EINVAL;
- if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC))
+ if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
return -EINVAL;
if (end == start)
return 0;
diff -urbN linux-2.5.5/mm/page_alloc.c linux-2.5.5-futex/mm/page_alloc.c
--- linux-2.5.5/mm/page_alloc.c Tue Feb 19 21:10:55 2002
+++ linux-2.5.5-futex/mm/page_alloc.c Wed Mar 6 11:40:12 2002
@@ -776,8 +776,8 @@
* per zone.
*/
zone->wait_table_size = wait_table_size(size);
- zone->wait_table_shift =
- BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
+ zone->wait_table_bits =
+ wait_table_bits(zone->wait_table_size);
zone->wait_table = (wait_queue_head_t *)
alloc_bootmem_node(pgdat, zone->wait_table_size
* sizeof(wait_queue_head_t));

2002-03-07 00:34:33

by Hubertus Franke

[permalink] [raw]
Subject: Re: Futexes III : performance numbers

On Tuesday 05 March 2002 09:08 pm, Rusty Russell wrote:
> In message <[email protected]> you write:

More on fairness. I hacked ulockflex to keep the history of lock acquisition
and print it out after the run, so this doesn't create any overhead and
is recorded while the lock is held (history buffer is pretouched)

Read as follows
lock-aquisition [ how often for the same process ] : process id
left out if only 1

First the FUTEX_UP later the FUTEX_UP_FAIR.
two cases (-r2 -x2) and (-r0 -x0)

Summary, its clearly seen how the fairness can be effected.
It also show the efficacy of FUTEX_UP and FUTEX_UP_FAIR.

Comments.

./ulockflex -c 3 -a 1 -t 2 -o 5 -m 2 -R 499 -r 2 -x 1 -L f -H 2

===========================================================

(i.e. 2 usecs non-lockholdtime and 1 usec lockhold time)

----------------- UNFAIR LOCKS == FUTEX_UP ------------------
<..snip...>
1067602 [ 4065 ]: 1
1071667 [ 4522 ]: 0
1076189 : 2
1076190 [ 953 ]: 1
1077143 : 2
1077144 [ 2875 ]: 0
1080019 [ 4800 ]: 2
1084819 : 0
1084820 [ 968 ]: 1
1085788 : 0
1085789 : 1
1085790 : 0
1085791 : 1
1085792 : 0
1085793 : 1
1085794 : 0
1085795 : 1
1085796 : 0
1085797 : 1
1085798 : 0
1085799 : 1
1085800 : 0
1085801 : 1
1085802 : 0
1085803 : 1
1085804 : 0
1085805 : 1
1085806 : 0
1085807 : 1
1085808 : 0
1085809 : 1
1085810 : 0
1085811 : 1
1085812 : 0
1085813 : 2
1085814 [ 2 ]: 0
1085816 : 2
1085817 [ 2861 ]: 1
1088678 : 2
1088679 [ 4868 ]: 0
1093547 : 2
1093548 [ 914 ]: 1
1094462 : 2
1094463 [ 4829 ]: 0
1099292 : 2
1099293 [ 963 ]: 1
1100256 : 2
1100257 [ 4789 ]: 0
1105046 : 2
1105047 [ 966 ]: 1
1106013 : 2
1106014 [ 4800 ]: 0
1110814 : 2
1110815 [ 961 ]: 1
1111776 : 2
1111777 [ 2013 ]: 0
1113790 : 2
1113791 : 0
1113792 : 2
1113793 [ 3768 ]: 1
1117561 : 2
1117562 [ 4832 ]: 0
1122394 : 2
1122395 [ 955 ]: 1
1123350 : 2
1123351 [ 4813 ]: 0
1128164 : 2
1128165 [ 982 ]: 1
1129147 : 2
1129148 : 1
1129149 : 2
1129150 [ 4789 ]: 0
1133939 : 2
1133940 : 0
1133941 : 2
1133942 : 0
1133943 : 2
1133944 [ 969 ]: 1
1134913 : 2
1134914 [ 4841 ]: 0
1139755 : 2
1139756 [ 967 ]: 1
1140723 : 2
1140724 [ 4820 ]: 0
1145544 : 2
1145545 [ 969 ]: 1
1146514 : 2
1146515 [ 5007 ]: 0
1151522 : 2
1151523 [ 3678 ]: 1
1155201 : 2
1155202 [ 4756 ]: 0
1159958 : 2
1159959 [ 978 ]: 1

-------------------------- FAIR LOCKS == FUTEX_UP_FAIR ----------------
<... snip ...>
558617 : 0
558618 : 1
558619 : 2
558620 : 1
558621 : 0
558622 : 1
558623 : 2
558624 : 1
558625 : 0
558626 : 1
558627 : 2
558628 : 1
558629 : 0
558630 : 1
558631 : 2
<... and so on ....>

=================================================================.
/ulockflex -c 3 -a 1 -t 2 -o 5 -m 2 -R 499 -r 0 -x 0 -L f -H 2

===========================================================

(i.e. 0 usecs non-lockholdtime and 0 usec lockhold time)

----------------- UNFAIR LOCKS == FUTEX_UP ------------------
<..snip...>
7682404 [ 4593 ]: 1
7686997 : 2
7686998 [ 16336 ]: 0
7703334 : 2
7703335 [ 23875 ]: 1
7727210 [ 4 ]: 2
7727214 [ 20110 ]: 0
7747324 [ 13298 ]: 1
7760622 [ 11612 ]: 2
7772234 [ 8340 ]: 0
7780574 [ 6732 ]: 1
7787306 [ 13388 ]: 2
7800694 [ 3006 ]: 0
7803700 [ 17121 ]: 1
7820821 [ 6726 ]: 2
7827547 [ 13396 ]: 0
7840943 [ 6760 ]: 1
7847703 [ 13375 ]: 2
7861078 [ 4443 ]: 0
7865521 [ 15566 ]: 1
7881087 [ 6730 ]: 2
7887817 [ 13421 ]: 0
7901238 [ 3013 ]: 1
7904251 [ 16995 ]: 2
7921246 [ 6715 ]: 0
7927961 [ 13397 ]: 1
7941358 [ 6716 ]: 2
7948074 [ 13407 ]: 0
7961481 [ 6743 ]: 1
7968224 [ 13309 ]: 2
7981533 [ 6708 ]: 0
7988241 [ 13374 ]: 1
8001615 [ 3411 ]: 2
8005026 [ 16574 ]: 0
8021600 [ 7016 ]: 1

-------------------------- FAIR LOCKS == FUTEX_UP_FAIR ----------------
<... snip ...> same as for -r 2 -x 1




--
-- Hubertus Franke ([email protected])

2002-03-07 00:34:33

by Hubertus Franke

[permalink] [raw]
Subject: Re: Futexes III : performance numbers

On Tuesday 05 March 2002 09:08 pm, Rusty Russell wrote:
> In message <[email protected]> you write:

More on fairness. I hacked ulockflex to keep the history of lock acquisition
and print it out after the run, so this doesn't create any overhead and
is recorded while the lock is held (history buffer is pretouched)

Read as follows
lock-aquisition [ how often for the same process ] : process id
left out if only 1

First the FUTEX_UP later the FUTEX_UP_FAIR.
two cases (-r2 -x2) and (-r0 -x0)

Summary, its clearly seen how the fairness can be effected.
It also show the efficacy of FUTEX_UP and FUTEX_UP_FAIR.

Comments.

./ulockflex -c 3 -a 1 -t 2 -o 5 -m 2 -R 499 -r 2 -x 1 -L f -H 2

===========================================================

(i.e. 2 usecs non-lockholdtime and 1 usec lockhold time)

----------------- UNFAIR LOCKS == FUTEX_UP ------------------
<..snip...>
1067602 [ 4065 ]: 1
1071667 [ 4522 ]: 0
1076189 : 2
1076190 [ 953 ]: 1
1077143 : 2
1077144 [ 2875 ]: 0
1080019 [ 4800 ]: 2
1084819 : 0
1084820 [ 968 ]: 1
1085788 : 0
1085789 : 1
1085790 : 0
1085791 : 1
1085792 : 0
1085793 : 1
1085794 : 0
1085795 : 1
1085796 : 0
1085797 : 1
1085798 : 0
1085799 : 1
1085800 : 0
1085801 : 1
1085802 : 0
1085803 : 1
1085804 : 0
1085805 : 1
1085806 : 0
1085807 : 1
1085808 : 0
1085809 : 1
1085810 : 0
1085811 : 1
1085812 : 0
1085813 : 2
1085814 [ 2 ]: 0
1085816 : 2
1085817 [ 2861 ]: 1
1088678 : 2
1088679 [ 4868 ]: 0
1093547 : 2
1093548 [ 914 ]: 1
1094462 : 2
1094463 [ 4829 ]: 0
1099292 : 2
1099293 [ 963 ]: 1
1100256 : 2
1100257 [ 4789 ]: 0
1105046 : 2
1105047 [ 966 ]: 1
1106013 : 2
1106014 [ 4800 ]: 0
1110814 : 2
1110815 [ 961 ]: 1
1111776 : 2
1111777 [ 2013 ]: 0
1113790 : 2
1113791 : 0
1113792 : 2
1113793 [ 3768 ]: 1
1117561 : 2
1117562 [ 4832 ]: 0
1122394 : 2
1122395 [ 955 ]: 1
1123350 : 2
1123351 [ 4813 ]: 0
1128164 : 2
1128165 [ 982 ]: 1
1129147 : 2
1129148 : 1
1129149 : 2
1129150 [ 4789 ]: 0
1133939 : 2
1133940 : 0
1133941 : 2
1133942 : 0
1133943 : 2
1133944 [ 969 ]: 1
1134913 : 2
1134914 [ 4841 ]: 0
1139755 : 2
1139756 [ 967 ]: 1
1140723 : 2
1140724 [ 4820 ]: 0
1145544 : 2
1145545 [ 969 ]: 1
1146514 : 2
1146515 [ 5007 ]: 0
1151522 : 2
1151523 [ 3678 ]: 1
1155201 : 2
1155202 [ 4756 ]: 0
1159958 : 2
1159959 [ 978 ]: 1

-------------------------- FAIR LOCKS == FUTEX_UP_FAIR ----------------
<... snip ...>
558617 : 0
558618 : 1
558619 : 2
558620 : 1
558621 : 0
558622 : 1
558623 : 2
558624 : 1
558625 : 0
558626 : 1
558627 : 2
558628 : 1
558629 : 0
558630 : 1
558631 : 2
<... and so on ....>

=================================================================.
/ulockflex -c 3 -a 1 -t 2 -o 5 -m 2 -R 499 -r 0 -x 0 -L f -H 2

===========================================================

(i.e. 0 usecs non-lockholdtime and 0 usec lockhold time)

----------------- UNFAIR LOCKS == FUTEX_UP ------------------
<..snip...>
7682404 [ 4593 ]: 1
7686997 : 2
7686998 [ 16336 ]: 0
7703334 : 2
7703335 [ 23875 ]: 1
7727210 [ 4 ]: 2
7727214 [ 20110 ]: 0
7747324 [ 13298 ]: 1
7760622 [ 11612 ]: 2
7772234 [ 8340 ]: 0
7780574 [ 6732 ]: 1
7787306 [ 13388 ]: 2
7800694 [ 3006 ]: 0
7803700 [ 17121 ]: 1
7820821 [ 6726 ]: 2
7827547 [ 13396 ]: 0
7840943 [ 6760 ]: 1
7847703 [ 13375 ]: 2
7861078 [ 4443 ]: 0
7865521 [ 15566 ]: 1
7881087 [ 6730 ]: 2
7887817 [ 13421 ]: 0
7901238 [ 3013 ]: 1
7904251 [ 16995 ]: 2
7921246 [ 6715 ]: 0
7927961 [ 13397 ]: 1
7941358 [ 6716 ]: 2
7948074 [ 13407 ]: 0
7961481 [ 6743 ]: 1
7968224 [ 13309 ]: 2
7981533 [ 6708 ]: 0
7988241 [ 13374 ]: 1
8001615 [ 3411 ]: 2
8005026 [ 16574 ]: 0
8021600 [ 7016 ]: 1

-------------------------- FAIR LOCKS == FUTEX_UP_FAIR ----------------
<... snip ...> same as for -r 2 -x 1




--
-- Hubertus Franke ([email protected])

2002-03-07 04:18:25

by Rusty Russell

[permalink] [raw]
Subject: Re: Futexes V :

On Wed, 6 Mar 2002 15:36:25 -0500
Hubertus Franke <[email protected]> wrote:

> On Wednesday 06 March 2002 11:13 am, Hubertus Franke wrote:
>
> I cut a new version with what I was previously discussing.
> Now we have two kind of wakeup mechanism
>
> (a) regular wakeup (as was) which basically gives you convoy avoidance
> (b) fair wakeup (will first wake a waiting process up .. FIFO)
>
> Basically integrated 2 prior patches of Rusty

I like your numbers. Since I think fairness is nice, but lack of starvation
is vital, I've been trying to starve a process.

So far, I've not managed it. Please hack on the below code, and see if
you can manage it. If not, I think we can say "not a problem in real life",
and just stick with the fastest implementation.

Thanks!
Rusty.

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <sys/time.h>
#include <time.h>
#include <signal.h>
#include "usersem.h"

#ifndef PROT_SEM
#define PROT_SEM 0x8
#endif


static void spinner(struct futex *sem, int hold)
{
while (1) {
futex_down(sem);
if (hold) sleep(1);
futex_up(sem);
}
}

/* Test maximum time to lock given furious spinners. */
int main(int argc, char *argv[])
{
struct futex *sem;
unsigned int i;
unsigned long maxtime = 0;
pid_t children[100];

if (argc != 3) {
fprintf(stderr, "Usage: starve <numspinners> <iterations>\n");
exit(1);
}

sem = malloc(sizeof(*sem));
futex_region(sem, sizeof(*sem));
futex_init(sem);
for (i = 0; i < atoi(argv[1]); i++) {
children[i] = fork();
if (children[i] == 0)
spinner(sem, i < atoi(argv[1])/2);
}

for (i = 0; i < atoi(argv[2]); i++) {
struct timeval start, end, diff;

sleep(1);
gettimeofday(&start, NULL);
futex_down(sem);
gettimeofday(&end, NULL);
futex_up(sem);
timersub(&end, &start, &diff);
printf("Wait time: %lu.%06lu\n", diff.tv_sec, diff.tv_usec);
if (diff.tv_sec * 1000000 + diff.tv_usec > maxtime)
maxtime = diff.tv_sec * 1000000 + diff.tv_usec;
}

/* Kill children */
for (i = 0; i < atoi(argv[1]); i++)
kill(children[i], SIGTERM);

printf("Worst case: %lu\n", maxtime);
exit(0);
}

--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-08 00:07:54

by Richard Henderson

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Tue, Mar 05, 2002 at 03:26:31PM -0800, Davide Libenzi wrote:
> Yes but this is always true alignof >= sizeof

No. m68k sets alignof to 2 for all types with sizeof >= 2.


r~

2002-03-08 18:08:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


On Tue, 5 Mar 2002, Rusty Russell wrote:
>
> 1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
> 2) Fix for the "decrement wraparound" problem (Paul Mackerras)
> 3) x86 fixes: tested on dual x86 box.

This doesn't work on highmem machines - doing the conversion from "<struct
page, offset>" to "page_address(page)+offset" is simply not legal (not
even for pure hashing purposes - page_address() changes as you kmap it).

You need to keep the <struct page,offset> tuple in that format, and no
other. And when you actually touch the page, you need to do the
kmap()/kunmap() (and you must not keep it mapped while you sleep, because
that might trivially make the kernel run out of virtual mappings).

Linus

2002-03-08 19:02:59

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Friday 08 March 2002 01:07 pm, Linus Torvalds wrote:
> On Tue, 5 Mar 2002, Rusty Russell wrote:
> > 1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
> > 2) Fix for the "decrement wraparound" problem (Paul Mackerras)
> > 3) x86 fixes: tested on dual x86 box.
>
> This doesn't work on highmem machines - doing the conversion from "<struct
> page, offset>" to "page_address(page)+offset" is simply not legal (not
> even for pure hashing purposes - page_address() changes as you kmap it).
>
> You need to keep the <struct page,offset> tuple in that format, and no
> other. And when you actually touch the page, you need to do the
> kmap()/kunmap() (and you must not keep it mapped while you sleep, because
> that might trivially make the kernel run out of virtual mappings).
>
> Linus
>

Good point, my package doesn't have that problem, but your suggestion
should nicely fit into Rusty's patch, but it seems like increasing in
overhead now.

Could you also comment on the functionality that has been discussed.


(I) the fairness issues that have been raised.
do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
or you don't care about fairness and starvation
(II) the rwlocks issues,
do you support rich set of functions/ops such as below

(a) writer preference
if any writer is waiting then wake that one up.
(b) reader preference
if any reader is waiting wait up all the readers in the queue
(c) fifo preference
if the first waiter is a writer wait it up, otherwise wake up all
readers
(d) fifo-fair preference
like (c), but only wake up readers until the next writer is
encountered

(a) - (c) can be implemented with Rusty's 2 user-ueue approach as long
as the wakeup type is always the same. The last one can't (?).

In the kernel this is easy to implement, but the trouble is the status
word in user space, still thinking about it.
It also requires compare and swap.

Thanks for your time and pointing this out.

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
-- Hubertus Franke ([email protected])

2002-03-08 19:23:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


On Fri, 8 Mar 2002, Hubertus Franke wrote:
>
> Could you also comment on the functionality that has been discussed.

First off, I have to say that I really like the current patch by Rusty.
The hashing approach is very clean, and it all seems quite good. As to
specific points:

> (I) the fairness issues that have been raised.
> do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> or you don't care about fairness and starvation

I don't think fairness and starvation is that big of a deal for
semaphores, usually being unfair in these things tends to just improve
performance through better cache locality with no real downside. That
said, I think the option should be open (which it does seem to be).

For rwlocks, my personal preference is the fifo-fair-preference (unlike
semaphore fairness, I have actually seen loads where read- vs
write-preference really is unacceptable). This might be a point where we
give users the choice.

I do think we should make the lock bigger - I worry that atomic_t simply
won't be enough for things like fair rwlocks, which might want a
"cmpxchg8b" on x86.

So I would suggest making the size (and thus alignment check) of locks at
least 8 bytes (and preferably 16). That makes it slightly harder to put
locks on the stack, but gcc does support stack alignment, even if the code
sucks right now.

Linus


2002-03-08 20:26:03

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> So I would suggest making the size (and thus alignment check) of locks at
> least 8 bytes (and preferably 16). That makes it slightly harder to put
> locks on the stack, but gcc does support stack alignment, even if the code
> sucks right now.

Can we go to cache line alignment - for an array of locks thats clearly
advantageous

2002-03-08 20:42:23

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Friday 08 March 2002 02:22 pm, Linus Torvalds wrote:
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > Could you also comment on the functionality that has been discussed.
>
> First off, I have to say that I really like the current patch by Rusty.
> The hashing approach is very clean, and it all seems quite good. As to
>
Agreed, Rusty did a marvelous job pulling all the right thinks together
and adding the hashing.

> specific points:
> > (I) the fairness issues that have been raised.
> > do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> > or you don't care about fairness and starvation
>
> I don't think fairness and starvation is that big of a deal for
> semaphores, usually being unfair in these things tends to just improve
> performance through better cache locality with no real downside. That
> said, I think the option should be open (which it does seem to be).
>
Yip, I'd prefer to leave it up to the user on what one exactly wants
fair or non-fair wakeup.

> For rwlocks, my personal preference is the fifo-fair-preference (unlike
> semaphore fairness, I have actually seen loads where read- vs
> write-preference really is unacceptable). This might be a point where we
> give users the choice.
>
> I do think we should make the lock bigger - I worry that atomic_t simply
> won't be enough for things like fair rwlocks, which might want a
> "cmpxchg8b" on x86.
>

But what about compatibility with i368, no cmpxchg or cmpxchg8b
Can't we have to types and infer from the op in the kernel what
the correct size in user space is.

> So I would suggest making the size (and thus alignment check) of locks at
> least 8 bytes (and preferably 16). That makes it slightly harder to put
> locks on the stack, but gcc does support stack alignment, even if the code
> sucks right now.
>
> Linus

Keeping it small, allows for the common case of mutex integration into data
objects, though its not a big deal.

--
-- Hubertus Franke ([email protected])

2002-03-08 20:48:14

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Linus Torvalds wrote:
>
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> >
> > Could you also comment on the functionality that has been discussed.
>
> First off, I have to say that I really like the current patch by Rusty.
> The hashing approach is very clean, and it all seems quite good. As to
> specific points:
>
> > (I) the fairness issues that have been raised.
> > do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> > or you don't care about fairness and starvation
>
> I don't think fairness and starvation is that big of a deal for
> semaphores, usually being unfair in these things tends to just improve
> performance through better cache locality with no real downside. That
> said, I think the option should be open (which it does seem to be).
>
> For rwlocks, my personal preference is the fifo-fair-preference (unlike
> semaphore fairness, I have actually seen loads where read- vs
> write-preference really is unacceptable). This might be a point where we
> give users the choice.
>
> I do think we should make the lock bigger - I worry that atomic_t simply
> won't be enough for things like fair rwlocks, which might want a
> "cmpxchg8b" on x86.
>
> So I would suggest making the size (and thus alignment check) of locks at
> least 8 bytes (and preferably 16). That makes it slightly harder to put
> locks on the stack, but gcc does support stack alignment, even if the code
> sucks right now.
>
I think this is needed if we want to address the "task dies while
holding a lock" issue. In this case we need to know who holds the lock.

-g

> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2002-03-08 20:49:33

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Fri, 8 Mar 2002, Hubertus Franke wrote:

> > So I would suggest making the size (and thus alignment check) of locks
> > at least 8 bytes (and preferably 16). That makes it slightly harder to
> > put locks on the stack, but gcc does support stack alignment, even if
> > the code sucks right now.

> Keeping it small, allows for the common case of mutex integration into
> data objects, though its not a big deal.

I guess we need to know what the requirements are of the fabled
"architectures which need special handling of PROT_SEM" before
we can tell whether this is a good idea or not.

Matthew.

2002-03-08 20:58:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


On Fri, 8 Mar 2002, Alan Cox wrote:
>
> Can we go to cache line alignment - for an array of locks thats clearly
> advantageous

I disagree about the "clearly". Firstly, the cacheline alignment is CPU
dependent, so on some CPU's it's 32 bytes (or even 16), on others it is
128 bytes.

Secondly, a lot of locking is actually done inside a single thread, and
false sharing doesn't happen much - so keeping the locks dense can be
quite advantageous.

The cases where false sharing _does_ happen and are a problem should be
for the application writer to worry about, not for the kernel to force.

So I think 8 bytes is plenty fine enough - with 16 bytes a remote
possibility (I don't think it is needed, but it gives you som epadding for
future expansion). And people who have arrays and find false sharing to be
a problem can fix it themselves.

I personally don't find arrays of locks very common. It's much more common
to have arrays of data structures that _contain_ locks (eg things like
having hash tables etc with a per-hashchain lock) and then those container
structures may want to be cacheline aligned, but the locks themselves
should not need to be.

Linus

2002-03-08 21:03:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


On Fri, 8 Mar 2002, Hubertus Franke wrote:
>
> But what about compatibility with i368, no cmpxchg or cmpxchg8b
> Can't we have to types and infer from the op in the kernel what
> the correct size in user space is.

I think the next step should be to map in one page of kernel code in a
user-readable location, and just do it there.

It's not just 386 vs later due to cmpxchg. It's also the simple issue of
UP vs SMP - a UP system still wants to do locking, but it doesn't need the
lock prefix. And that lock prefix makes a _huge_ difference
performance-wise.

So my suggestion is: ignore i386 for now (no _relevant_ SMP boxes exist
anyway), and plan on solving the problem with a separate library page
before 2.6.x gets released.

Linus

2002-03-08 22:54:59

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Friday 08 March 2002 03:40 pm, Alan Cox wrote:
> > So I would suggest making the size (and thus alignment check) of locks at
> > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > locks on the stack, but gcc does support stack alignment, even if the
> > code sucks right now.
>
> Can we go to cache line alignment - for an array of locks thats clearly
> advantageous

NO and let me explain.

I would to be able to integrate the lock with the data.
This is much more cache friendly then putting the lock on a different
cacheline.

If you want an array you need to pad each element.
That's easy enough to do....
Can't shrink a datastructure on the other hand :-)

--
-- Hubertus Franke ([email protected])

2002-03-08 23:02:20

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Friday 08 March 2002 03:47 pm, george anzinger wrote:
> Linus Torvalds wrote:
> > On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > > Could you also comment on the functionality that has been discussed.
> >
> > First off, I have to say that I really like the current patch by Rusty.
> > The hashing approach is very clean, and it all seems quite good. As to
> >
> > specific points:
> > > (I) the fairness issues that have been raised.
> > > do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> > > or you don't care about fairness and starvation
> >
> > I don't think fairness and starvation is that big of a deal for
> > semaphores, usually being unfair in these things tends to just improve
> > performance through better cache locality with no real downside. That
> > said, I think the option should be open (which it does seem to be).
> >
> > For rwlocks, my personal preference is the fifo-fair-preference (unlike
> > semaphore fairness, I have actually seen loads where read- vs
> > write-preference really is unacceptable). This might be a point where we
> > give users the choice.
> >
> > I do think we should make the lock bigger - I worry that atomic_t simply
> > won't be enough for things like fair rwlocks, which might want a
> > "cmpxchg8b" on x86.
> >
> > So I would suggest making the size (and thus alignment check) of locks at
> > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > locks on the stack, but gcc does support stack alignment, even if the
> > code sucks right now.
>
> I think this is needed if we want to address the "task dies while
> holding a lock" issue. In this case we need to know who holds the lock.
>
> -g
>

Georg, while desirable its very tricky if possible at all.

You need to stick your pid or so into the lock and do it
atomically. So let's assume we only stick with architectures that can do
cmpxchg-doubleword, still its not fool proof.
First, the app could still corrupt that count or pid field of the lock
In that case the whole logic get'ss crewed up.
There is no guarantee that you ever know who holds the locks !!!

Secondly, what guarantee do you have that your data is kosher ?
I tend to agree with the masses, that hand waving might be the best
first approximation

--
-- Hubertus Franke ([email protected])

2002-03-08 23:14:39

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Friday 08 March 2002 04:02 pm, Linus Torvalds wrote:
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > But what about compatibility with i368, no cmpxchg or cmpxchg8b
> > Can't we have to types and infer from the op in the kernel what
> > the correct size in user space is.
>
> I think the next step should be to map in one page of kernel code in a
> user-readable location, and just do it there.
>

Your kidding .....
Seriously, how can we guarantee that we correctly determine the
lock holder, due to memory corruption problems. If we can't do
it correctly all the times, why do it at all ?

> It's not just 386 vs later due to cmpxchg. It's also the simple issue of
> UP vs SMP - a UP system still wants to do locking, but it doesn't need the
> lock prefix. And that lock prefix makes a _huge_ difference
> performance-wise.

Fail to see why that matters. User level locking is mostly beneficial on SMPs.
So, you lock the bus for the atomic update. This is UP, nothing's going on
on the bus anyway.
In your experience, what is the overhead in cycles for "incl" vs "lock;incl".
Even if its a few more cycles, still beats the heck out of using other
heavyweight kernel APIs

> So my suggestion is: ignore i386 for now (no _relevant_ SMP boxes exist
> anyway), and plan on solving the problem with a separate library page
> before 2.6.x gets released.
>

Any rough design..
> Linus

--
-- Hubertus Franke ([email protected])

2002-03-08 23:21:11

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> > It's not just 386 vs later due to cmpxchg. It's also the simple issue of
> > UP vs SMP - a UP system still wants to do locking, but it doesn't need the
> > lock prefix. And that lock prefix makes a _huge_ difference
> > performance-wise.
>
> Fail to see why that matters. User level locking is mostly beneficial on SMPs.
> So, you lock the bus for the atomic update. This is UP, nothing's going on
> on the bus anyway.

Lots of older x86 is too stupid to optimise exclusive cache line locked
operations. After all the bus is still shared - PCI bus masters for one

Alan

2002-03-08 23:23:54

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> NO and let me explain.
>
> I would to be able to integrate the lock with the data.
> This is much more cache friendly then putting the lock on a different
> cacheline.

Yep - I agree

2002-03-08 23:42:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


On Fri, 8 Mar 2002, Hubertus Franke wrote:
> >
> > I think the next step should be to map in one page of kernel code in a
> > user-readable location, and just do it there.
>
> Your kidding .....
> Seriously, how can we guarantee that we correctly determine the
> lock holder, due to memory corruption problems. If we can't do
> it correctly all the times, why do it at all ?

You don't understand. This has nothing to do with lock holders, or
anything else.

I'm saying that we map in a page at a magic offset (just above the stack),
and that page contains the locking code.

For 386 CPU's (where only UP matters), we can trivially come up with a
lock that doesn't use cmpxchg8b and that isn't SMP-safe. It might even go
into the kernel every time if it has to - ie it _works_, it just isn't
optimal.

> Fail to see why that matters. User level locking is mostly beneficial on SMPs.

That's not the issue AT ALL.

Semaphores are absolutely required on UP too, with threads. There is
_zero_ difference between UP and SMP from a locking perspective in user
space due to the fact that we can be preempted at any time - except from
the cache coherency issue.

> So, you lock the bus for the atomic update. This is UP, nothing's going on
> on the bus anyway.

That's not the point. Nobody has locked the bus in the last ten years: the
cache coherency is done on a cacheline basis, not on the bus.

The point being that the difference between a "decl" and a "lock ; decl"
is about 1:12 or so in performance.

> Even if its a few more cycles, still beats the heck out of using other
> heavyweight kernel APIs

Sure it does. But if the speed of locking matters enough for user-level
locks to matter, don't you think the 1:12 difference matters as well?

Linus

2002-03-08 23:44:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Followup to: <[email protected]>
By author: Linus Torvalds <[email protected]>
In newsgroup: linux.dev.kernel
>
> On Fri, 8 Mar 2002, Alan Cox wrote:
> >
> > Can we go to cache line alignment - for an array of locks thats clearly
> > advantageous
>
> I disagree about the "clearly". Firstly, the cacheline alignment is CPU
> dependent, so on some CPU's it's 32 bytes (or even 16), on others it is
> 128 bytes.
>
> Secondly, a lot of locking is actually done inside a single thread, and
> false sharing doesn't happen much - so keeping the locks dense can be
> quite advantageous.
>
> The cases where false sharing _does_ happen and are a problem should be
> for the application writer to worry about, not for the kernel to force.
>
> So I think 8 bytes is plenty fine enough - with 16 bytes a remote
> possibility (I don't think it is needed, but it gives you som epadding for
> future expansion). And people who have arrays and find false sharing to be
> a problem can fix it themselves.
>
> I personally don't find arrays of locks very common. It's much more common
> to have arrays of data structures that _contain_ locks (eg things like
> having hash tables etc with a per-hashchain lock) and then those container
> structures may want to be cacheline aligned, but the locks themselves
> should not need to be.
>

Also, on UP this is all a waste.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-03-08 23:45:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Followup to: <[email protected]>
By author: Hubertus Franke <[email protected]>
In newsgroup: linux.dev.kernel
> >
> > Can we go to cache line alignment - for an array of locks thats clearly
> > advantageous
>
> NO and let me explain.
>
> I would to be able to integrate the lock with the data.
> This is much more cache friendly then putting the lock on a different
> cacheline.
>

Not just cache, but programmer-friendly as well. Data structures
containing locks (sometimes multiple and related) are really the
common case.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-03-08 23:48:35

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Hubertus Franke wrote:
>
> On Friday 08 March 2002 03:47 pm, george anzinger wrote:
> > Linus Torvalds wrote:
> > > On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > > > Could you also comment on the functionality that has been discussed.
> > >
> > > First off, I have to say that I really like the current patch by Rusty.
> > > The hashing approach is very clean, and it all seems quite good. As to
> > >
> > > specific points:
> > > > (I) the fairness issues that have been raised.
> > > > do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> > > > or you don't care about fairness and starvation
> > >
> > > I don't think fairness and starvation is that big of a deal for
> > > semaphores, usually being unfair in these things tends to just improve
> > > performance through better cache locality with no real downside. That
> > > said, I think the option should be open (which it does seem to be).
> > >
> > > For rwlocks, my personal preference is the fifo-fair-preference (unlike
> > > semaphore fairness, I have actually seen loads where read- vs
> > > write-preference really is unacceptable). This might be a point where we
> > > give users the choice.
> > >
> > > I do think we should make the lock bigger - I worry that atomic_t simply
> > > won't be enough for things like fair rwlocks, which might want a
> > > "cmpxchg8b" on x86.
> > >
> > > So I would suggest making the size (and thus alignment check) of locks at
> > > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > > locks on the stack, but gcc does support stack alignment, even if the
> > > code sucks right now.
> >
> > I think this is needed if we want to address the "task dies while
> > holding a lock" issue. In this case we need to know who holds the lock.
> >
> > -g
> >
>
> George, while desirable its very tricky if possible at all.
>
> You need to stick your pid or so into the lock and do it
> atomically. So let's assume we only stick with architectures that can do
> cmpxchg-doubleword, still its not fool proof.

Uh, just the pid would do. Maybe reserve a bit to indicate contention,
but surly one word would be enough.

> First, the app could still corrupt that count or pid field of the lock
> In that case the whole logic get'ss crewed up.
> There is no guarantee that you ever know who holds the locks !!!

At that point you are most likely down for the count. The most use for
this would be development where programs are dying like flies. If the
sem area is "registered" with the kernel, it could do the right thing?
in the exit code. Except for using the cmpxchg I don't think it adds to
the overhead.
>
> Secondly, what guarantee do you have that your data is kosher ?

Well, none, but you don't in either case. This is a non-argument.

> I tend to agree with the masses, that hand waving might be the best
> first approximation

That is your privilege :-)
>
> --
> -- Hubertus Franke ([email protected])

--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2002-03-08 23:55:37

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Friday 08 March 2002 06:41 pm, Linus Torvalds wrote:
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > > I think the next step should be to map in one page of kernel code in a
> > > user-readable location, and just do it there.
> >
> > Your kidding .....
> > Seriously, how can we guarantee that we correctly determine the
> > lock holder, due to memory corruption problems. If we can't do
> > it correctly all the times, why do it at all ?
>
> You don't understand. This has nothing to do with lock holders, or
> anything else.
>
Sorry misunderstood your answer with respect to another question.

> I'm saying that we map in a page at a magic offset (just above the stack),
> and that page contains the locking code.
>
> For 386 CPU's (where only UP matters), we can trivially come up with a
> lock that doesn't use cmpxchg8b and that isn't SMP-safe. It might even go
> into the kernel every time if it has to - ie it _works_, it just isn't
> optimal.
>

Ahhhh, in this context, now "I see the light" (actually its dark at the east
coast).
So you envision this to go through some named "section" or do you
want to go through the futex_region() library call which identifies whether
the code page has been mapped. If not, the kernel then will provide the
locking code in that page dependent on the architecture (UP or SMP).
Fair enough.

> > Fail to see why that matters. User level locking is mostly beneficial on
> > SMPs.
>
> That's not the issue AT ALL.
>
> Semaphores are absolutely required on UP too, with threads. There is
> _zero_ difference between UP and SMP from a locking perspective in user
> space due to the fact that we can be preempted at any time - except from
> the cache coherency issue.
>

Agreed, my point was wrt providing the functionality only. Only difference
between UP and SMP would be that a spinning version would default to the
standard version (no spinning) under UP.

> > So, you lock the bus for the atomic update. This is UP, nothing's going
> > on on the bus anyway.
>
> That's not the point. Nobody has locked the bus in the last ten years: the
> cache coherency is done on a cacheline basis, not on the bus.
>
> The point being that the difference between a "decl" and a "lock ; decl"
> is about 1:12 or so in performance.
>

I am no expert in architecture, but if its done through the cache coherency
mechanism, the overhead shouldn't be 12:1. You simply mark the cache line as
part of you instruction to avoid a cache line transfer. How can that be 12
times slower. .. Ready to be educated....

> > Even if its a few more cycles, still beats the heck out of using other
> > heavyweight kernel APIs
>
> Sure it does. But if the speed of locking matters enough for user-level
> locks to matter, don't you think the 1:12 difference matters as well?
>
> Linus

Yipp, I buy that argument.

Overall, it just seems to me the user locking subsystem is becoming quickly
again a complicated beast.


Anyway, time to go home and play with the kids :-)

--
-- Hubertus Franke ([email protected])

2002-03-09 00:04:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Followup to: <[email protected]>
By author: Linus Torvalds <[email protected]>
In newsgroup: linux.dev.kernel
>
> You don't understand. This has nothing to do with lock holders, or
> anything else.
>
> I'm saying that we map in a page at a magic offset (just above the stack),
> and that page contains the locking code.
>
> For 386 CPU's (where only UP matters), we can trivially come up with a
> lock that doesn't use cmpxchg8b and that isn't SMP-safe. It might even go
> into the kernel every time if it has to - ie it _works_, it just isn't
> optimal.
>

Okay, I'll say it and be impopular...

Perhaps it's time to drop i386 support?

It seems to me that the i386 support has been around mostly on a
"until we have a reason to do otherwise" basis, but perhaps this is
the reason?

There certainly are enough little, nagging reasons... CMPXCHG, BSWAP,
and especially WP...

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-03-09 00:57:13

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> > cmpxchg-doubleword, still its not fool proof.
>
> Uh, just the pid would do. Maybe reserve a bit to indicate contention,
> but surly one word would be enough.

Make it a dword, the 16bit pid is beginning to look strained on big
boxes

2002-03-09 01:00:43

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> It seems to me that the i386 support has been around mostly on a
> "until we have a reason to do otherwise" basis, but perhaps this is
> the reason?
>
> There certainly are enough little, nagging reasons... CMPXCHG, BSWAP,
> and especially WP...

They don't really arise in most normal situations. Bswap is trivial to
the extreme. Cmpxchg only comes up on SMP boxes. Right now the one big
hit is cmpxchg8 if you use direct rendering. And quite frankly if you
use the direct rendering infrastructure on a 386 its going to be a teeny
bit slow anyway 8)

So if anything its just not worth the effort of breaking the 386 setup
either 8). 386 SMP is a different issue but I don't see any lunatics doing
a 386 based sequent port thankfully.

Alan

2002-03-09 01:20:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


On Fri, 8 Mar 2002, george anzinger wrote:
>
> Uh, just the pid would do. Maybe reserve a bit to indicate
> contention, but surly one word would be enough.

Not really.

The pid would mean that anybody who gets a lock would have to have its pid
available (remember the fast-path is what we really care about), but
there's also a fundamental race between getting the lock and writing the
pid to the second word of the lock that you just won't avoid.

And that's assuming you only use the semaphores for pure mutual exclusion.
That is the normal behaviour, but some people use semaphores for other
things (ie "N people can be active inside this region" where N != 1).

And then you have to realize that doing the same for readers in a rwlock
is even worse.

In short, it just cannot be done quickly and simply, and for many cases it
cannot be done at all.

Linus

2002-03-09 02:13:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


On Fri, 8 Mar 2002, Hubertus Franke wrote:
> >
> > The point being that the difference between a "decl" and a "lock ; decl"
> > is about 1:12 or so in performance.
>
> I am no expert in architecture, but if its done through the cache coherency
> mechanism, the overhead shouldn't be 12:1. You simply mark the cache line as
> part of you instruction to avoid a cache line transfer. How can that be 12
> times slower. .. Ready to be educated....

A lock in a SMP system also needs to synchronize the instruction stream,
and not let stores move "out" from the locked region.

On a UP system, this all happens automatically (well, getting it to happen
right is obviously one of the big issues in an out-of-order CPU core, but
it's a very fundamental part of the core, so it's "free" in the sense that
if it isn't done, the CPU simply doesn't work).

On SMP, it's a memory barrier. This is why a "lock ; decl" is more
expensive than a "decl" - it's the implied memory ordering constraints (on
other architectures they are explicit). On an intel CPU, this basically
means that the pipeline is drained, so a locked instruction takes roughly
12 cycles on a PPro core (AMD's K7 core seems to be rather more graceful
about this one). I haven't timed a P4 lately, I think it's worse.

Other architectures do the memory ordering explicitly, and some are
better, some are worse. But it always costs you _something_.

Linus

2002-03-09 07:14:51

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Fri, 8 Mar 2002 11:22:20 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
> First off, I have to say that I really like the current patch by Rusty.
> The hashing approach is very clean, and it all seems quite good. As to
> specific points:

Thanks. [Aside: VIRO PLEASE TAKE NOTE.]

> > (I) the fairness issues that have been raised.
> > do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> > or you don't care about fairness and starvation
>
> I don't think fairness and starvation is that big of a deal for
> semaphores, usually being unfair in these things tends to just improve
> performance through better cache locality with no real downside. That
> said, I think the option should be open (which it does seem to be).

1) Unfairness definitely helps performance (~30% faster on tdbtorture and
IIRC on Hubertus' benchmark too).
2) Absolute fairness depends on hardware anyway.
3) I was not able to produce any evidence of STARVATION (which I think
we all agree *is* an issue).

So I'd say stick with the minimalistic, fast, unfair solution.

> For rwlocks, my personal preference is the fifo-fair-preference (unlike
> semaphore fairness, I have actually seen loads where read- vs
> write-preference really is unacceptable). This might be a point where we
> give users the choice.

Yes. See post on "furwocks": fair-preference rw locks implemented in
userspace on top of the futexes.

> I do think we should make the lock bigger - I worry that atomic_t simply
> won't be enough for things like fair rwlocks, which might want a
> "cmpxchg8b" on x86.
>
> So I would suggest making the size (and thus alignment check) of locks at
> least 8 bytes (and preferably 16). That makes it slightly harder to put
> locks on the stack, but gcc does support stack alignment, even if the code
> sucks right now.

Actually, I disagree.

1) We've left wiggle room in the second arg to sys_futex() to add rwsems
later if required.
2) Someone needs to implement them and prove they are superior to the
pure userspace solution.

The most gain will be from a very briefly held lock that is 99.99% read.
But if it's 10%, it's not worth it: we need numbers.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-09 07:14:51

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Fri, 8 Mar 2002 10:07:51 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
> This doesn't work on highmem machines - doing the conversion from "<struct
> page, offset>" to "page_address(page)+offset" is simply not legal (not
> even for pure hashing purposes - page_address() changes as you kmap it).

Thanks for the catch. Am not blessed (cursed?) with highmem here.

> You need to keep the <struct page,offset> tuple in that format, and no
> other. And when you actually touch the page, you need to do the
> kmap()/kunmap() (and you must not keep it mapped while you sleep, because
> that might trivially make the kernel run out of virtual mappings).

Ick: not allowing for one virtual mapping per process is pretty horrible.
Still, pretty easy to fix.

Also updated syscall numbers for 2.5.6, and changed FUTEX_UP/DOWN definitions
to be more logical for future expansions (eg. r/w).

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/arch/i386/kernel/entry.S working-2.5.6-futex-6/arch/i386/kernel/entry.S
--- linux-2.5.6/arch/i386/kernel/entry.S Fri Mar 8 14:49:11 2002
+++ working-2.5.6-futex-6/arch/i386/kernel/entry.S Sat Mar 9 14:09:10 2002
@@ -717,6 +717,7 @@
.long SYMBOL_NAME(sys_fremovexattr)
.long SYMBOL_NAME(sys_tkill)
.long SYMBOL_NAME(sys_sendfile64)
+ .long SYMBOL_NAME(sys_futex) /* 240 */

.rept NR_syscalls-(.-sys_call_table)/4
.long SYMBOL_NAME(sys_ni_syscall)
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/arch/ppc/kernel/misc.S working-2.5.6-futex-6/arch/ppc/kernel/misc.S
--- linux-2.5.6/arch/ppc/kernel/misc.S Fri Mar 8 14:49:11 2002
+++ working-2.5.6-futex-6/arch/ppc/kernel/misc.S Sat Mar 9 14:08:24 2002
@@ -1289,6 +1289,7 @@
.long sys_removexattr
.long sys_lremovexattr
.long sys_fremovexattr /* 220 */
+ .long sys_futex
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
.endr
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-i386/mman.h working-2.5.6-futex-6/include/asm-i386/mman.h
--- linux-2.5.6/include/asm-i386/mman.h Wed Mar 15 12:45:20 2000
+++ working-2.5.6-futex-6/include/asm-i386/mman.h Sat Mar 9 14:32:43 2002
@@ -4,6 +4,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-i386/unistd.h working-2.5.6-futex-6/include/asm-i386/unistd.h
--- linux-2.5.6/include/asm-i386/unistd.h Fri Mar 8 14:49:28 2002
+++ working-2.5.6-futex-6/include/asm-i386/unistd.h Sat Mar 9 14:09:31 2002
@@ -244,6 +244,7 @@
#define __NR_fremovexattr 237
#define __NR_tkill 238
#define __NR_sendfile64 239
+#define __NR_futex 240

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-ppc/mman.h working-2.5.6-futex-6/include/asm-ppc/mman.h
--- linux-2.5.6/include/asm-ppc/mman.h Tue May 22 08:02:06 2001
+++ working-2.5.6-futex-6/include/asm-ppc/mman.h Sat Mar 9 14:32:39 2002
@@ -7,6 +7,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-ppc/unistd.h working-2.5.6-futex-6/include/asm-ppc/unistd.h
--- linux-2.5.6/include/asm-ppc/unistd.h Wed Feb 20 17:57:18 2002
+++ working-2.5.6-futex-6/include/asm-ppc/unistd.h Sat Mar 9 14:08:27 2002
@@ -228,6 +228,7 @@
#define __NR_removexattr 218
#define __NR_lremovexattr 219
#define __NR_fremovexattr 220
+#define __NR_futex 221

#define __NR(n) #n

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/linux/futex.h working-2.5.6-futex-6/include/linux/futex.h
--- linux-2.5.6/include/linux/futex.h Thu Jan 1 10:00:00 1970
+++ working-2.5.6-futex-6/include/linux/futex.h Sat Mar 9 14:08:24 2002
@@ -0,0 +1,8 @@
+#ifndef _LINUX_FUTEX_H
+#define _LINUX_FUTEX_H
+
+/* Second argument to futex syscall */
+#define FUTEX_UP (0)
+#define FUTEX_DOWN (1)
+
+#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/linux/hash.h working-2.5.6-futex-6/include/linux/hash.h
--- linux-2.5.6/include/linux/hash.h Thu Jan 1 10:00:00 1970
+++ working-2.5.6-futex-6/include/linux/hash.h Sat Mar 9 14:08:27 2002
@@ -0,0 +1,58 @@
+#ifndef _LINUX_HASH_H
+#define _LINUX_HASH_H
+/* Fast hashing routine for a long.
+ (C) 2002 William Lee Irwin III, IBM */
+
+/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These primes are chosen to be bit-sparse, that is operations on
+ * them can use shifts and additions instead of multiplications for
+ * machines where multiplications are slow.
+ */
+#if BITS_PER_LONG == 32
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
+#elif BITS_PER_LONG == 64
+/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#else
+#error Define GOLDEN_RATIO_PRIME for your wordsize.
+#endif
+
+static inline unsigned long hash_long(unsigned long val, unsigned int bits)
+{
+ unsigned long hash = val;
+
+#if BITS_PER_LONG == 64
+ /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
+ unsigned long n = hash;
+ n <<= 18;
+ hash -= n;
+ n <<= 33;
+ hash -= n;
+ n <<= 3;
+ hash += n;
+ n <<= 3;
+ hash -= n;
+ n <<= 4;
+ hash += n;
+ n <<= 2;
+ hash += n;
+#else
+ /* On some cpus multiply is faster, on others gcc will do shifts */
+ hash *= GOLDEN_RATIO_PRIME;
+#endif
+
+ /* High bits are more random, so use them. */
+ return hash >> (BITS_PER_LONG - bits);
+}
+
+static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
+{
+ return hash_long((unsigned long)ptr, bits);
+}
+#endif /* _LINUX_HASH_H */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/linux/mmzone.h working-2.5.6-futex-6/include/linux/mmzone.h
--- linux-2.5.6/include/linux/mmzone.h Fri Mar 1 22:58:34 2002
+++ working-2.5.6-futex-6/include/linux/mmzone.h Sat Mar 9 14:52:15 2002
@@ -51,8 +51,7 @@
/*
* wait_table -- the array holding the hash table
* wait_table_size -- the size of the hash table array
- * wait_table_shift -- wait_table_size
- * == BITS_PER_LONG (1 << wait_table_bits)
+ * wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
@@ -75,7 +74,7 @@
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_size;
- unsigned long wait_table_shift;
+ unsigned long wait_table_bits;

/*
* Discontig memory support fields.
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/Makefile working-2.5.6-futex-6/kernel/Makefile
--- linux-2.5.6/kernel/Makefile Wed Feb 20 17:56:17 2002
+++ working-2.5.6-futex-6/kernel/Makefile Sat Mar 9 14:08:27 2002
@@ -15,7 +15,7 @@
obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o info.o time.o softirq.o resource.o \
sysctl.o acct.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o
+ signal.o sys.o kmod.o context.o futex.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/futex.c working-2.5.6-futex-6/kernel/futex.c
--- linux-2.5.6/kernel/futex.c Thu Jan 1 10:00:00 1970
+++ working-2.5.6-futex-6/kernel/futex.c Sat Mar 9 15:03:13 2002
@@ -0,0 +1,232 @@
+/*
+ * Fast Userspace Mutexes (which I call "Futexes!").
+ * (C) Rusty Russell, IBM 2002
+ *
+ * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
+ * enough at me, Linus for the original (flawed) idea, Matthew
+ * Kirkwood for proof-of-concept implementation.
+ *
+ * "The futexes are also cursed."
+ * "But they come in a choice of three flavours!"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/futex.h>
+#include <linux/highmem.h>
+#include <asm/atomic.h>
+
+/* These mutexes are a very simple counter: the winner is the one who
+ decrements from 1 to 0. The counter starts at 1 when the lock is
+ free. A value other than 0 or 1 means someone may be sleeping.
+ This is simple enough to work on all architectures, but has the
+ problem that if we never "up" the semaphore it could eventually
+ wrap around. */
+
+/* FIXME: This may be way too small. --RR */
+#define FUTEX_HASHBITS 6
+
+/* We use this instead of a normal wait_queue_t, so we can wake only
+ the relevent ones (hashed queues may be shared) */
+struct futex_q {
+ struct list_head list;
+ struct task_struct *task;
+ /* Page struct and offset within it. */
+ struct page *page;
+ unsigned int offset;
+};
+
+/* The key for the hash is the address + index + offset within page */
+static struct list_head futex_queues[1<<FUTEX_HASHBITS];
+static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+
+static inline struct list_head *hash_futex(struct page *page,
+ unsigned long offset)
+{
+ unsigned long h;
+
+ /* struct page is shared, so we can hash on its address */
+ h = (unsigned long)page + offset;
+ return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
+}
+
+static inline void wake_one_waiter(struct list_head *head,
+ struct page *page,
+ unsigned int offset)
+{
+ struct list_head *i;
+
+ spin_lock(&futex_lock);
+ list_for_each(i, head) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->page == page && this->offset == offset) {
+ wake_up_process(this->task);
+ break;
+ }
+ }
+ spin_unlock(&futex_lock);
+}
+
+/* Add at end to avoid starvation */
+static inline void queue_me(struct list_head *head,
+ struct futex_q *q,
+ struct page *page,
+ unsigned int offset)
+{
+ q->task = current;
+ q->page = page;
+ q->offset = offset;
+
+ spin_lock(&futex_lock);
+ list_add_tail(&q->list, head);
+ spin_unlock(&futex_lock);
+}
+
+static inline void unqueue_me(struct futex_q *q)
+{
+ spin_lock(&futex_lock);
+ list_del(&q->list);
+ spin_unlock(&futex_lock);
+}
+
+/* Get kernel address of the user page and pin it. */
+static struct page *pin_page(unsigned long page_start)
+{
+ struct mm_struct *mm = current->mm;
+ struct page *page;
+ int err;
+
+ down_read(&mm->mmap_sem);
+ err = get_user_pages(current, current->mm, page_start,
+ 1 /* one page */,
+ 1 /* writable */,
+ 0 /* don't force */,
+ &page,
+ NULL /* don't return vmas */);
+ up_read(&mm->mmap_sem);
+
+ if (err < 0)
+ return ERR_PTR(err);
+ return page;
+}
+
+/* Try to decrement the user count to zero. */
+static int decrement_to_zero(struct page *page, unsigned int offset)
+{
+ atomic_t *count;
+ int ret = 0;
+
+ count = kmap(page) + offset;
+ /* If we take the semaphore from 1 to 0, it's ours. If it's
+ zero, decrement anyway, to indicate we are waiting. If
+ it's negative, don't decrement so we don't wrap... */
+ if (atomic_read(count) >= 0 && atomic_dec_and_test(count))
+ ret = 1;
+ kunmap(page);
+ return ret;
+}
+
+/* Simplified from arch/ppc/kernel/semaphore.c: Paul M. is a genius. */
+static int futex_down(struct list_head *head, struct page *page, int offset)
+{
+ int retval = 0;
+ struct futex_q q;
+
+ current->state = TASK_INTERRUPTIBLE;
+ queue_me(head, &q, page, offset);
+
+ while (!decrement_to_zero(page, offset)) {
+ if (signal_pending(current)) {
+ retval = -EINTR;
+ break;
+ }
+ schedule();
+ current->state = TASK_INTERRUPTIBLE;
+ }
+ current->state = TASK_RUNNING;
+ unqueue_me(&q);
+ /* If we were signalled, we might have just been woken: we
+ must wake another one. Otherwise we need to wake someone
+ else (if they are waiting) so they drop the count below 0,
+ and when we "up" in userspace, we know there is a
+ waiter. */
+ wake_one_waiter(head, page, offset);
+ return retval;
+}
+
+static int futex_up(struct list_head *head, struct page *page, int offset)
+{
+ atomic_t *count;
+
+ count = kmap(page) + offset;
+ atomic_set(count, 1);
+ smp_wmb();
+ kunmap(page);
+ wake_one_waiter(head, page, offset);
+ return 0;
+}
+
+asmlinkage int sys_futex(void *uaddr, int op)
+{
+ int ret;
+ unsigned long pos_in_page;
+ struct list_head *head;
+ struct page *page;
+
+ pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
+
+ /* Must be "naturally" aligned, and not on page boundary. */
+ if ((pos_in_page % __alignof__(atomic_t)) != 0
+ || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
+ return -EINVAL;
+
+ /* Simpler if it doesn't vanish underneath us. */
+ page = pin_page((unsigned long)uaddr - pos_in_page);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ head = hash_futex(page, pos_in_page);
+ switch (op) {
+ case FUTEX_UP:
+ ret = futex_up(head, page, pos_in_page);
+ break;
+ case FUTEX_DOWN:
+ ret = futex_down(head, page, pos_in_page);
+ break;
+ /* Add other lock types here... */
+ default:
+ ret = -EINVAL;
+ }
+ put_page(page);
+
+ return ret;
+}
+
+static int __init init(void)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
+ INIT_LIST_HEAD(&futex_queues[i]);
+ return 0;
+}
+__initcall(init);
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/mm/filemap.c working-2.5.6-futex-6/mm/filemap.c
--- linux-2.5.6/mm/filemap.c Fri Mar 8 14:49:30 2002
+++ working-2.5.6-futex-6/mm/filemap.c Sat Mar 9 14:08:27 2002
@@ -25,6 +25,7 @@
#include <linux/iobuf.h>
#include <linux/compiler.h>
#include <linux/fs.h>
+#include <linux/hash.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -773,32 +774,8 @@
static inline wait_queue_head_t *page_waitqueue(struct page *page)
{
const zone_t *zone = page_zone(page);
- wait_queue_head_t *wait = zone->wait_table;
- unsigned long hash = (unsigned long)page;

-#if BITS_PER_LONG == 64
- /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
- unsigned long n = hash;
- n <<= 18;
- hash -= n;
- n <<= 33;
- hash -= n;
- n <<= 3;
- hash += n;
- n <<= 3;
- hash -= n;
- n <<= 4;
- hash += n;
- n <<= 2;
- hash += n;
-#else
- /* On some cpus multiply is faster, on others gcc will do shifts */
- hash *= GOLDEN_RATIO_PRIME;
-#endif
-
- hash >>= zone->wait_table_shift;
-
- return &wait[hash];
+ return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}

/*
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/mm/mprotect.c working-2.5.6-futex-6/mm/mprotect.c
--- linux-2.5.6/mm/mprotect.c Wed Feb 20 17:57:21 2002
+++ working-2.5.6-futex-6/mm/mprotect.c Sat Mar 9 14:32:32 2002
@@ -280,7 +280,7 @@
end = start + len;
if (end < start)
return -EINVAL;
- if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC))
+ if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
return -EINVAL;
if (end == start)
return 0;
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/mm/page_alloc.c working-2.5.6-futex-6/mm/page_alloc.c
--- linux-2.5.6/mm/page_alloc.c Fri Mar 8 14:49:30 2002
+++ working-2.5.6-futex-6/mm/page_alloc.c Sat Mar 9 14:08:27 2002
@@ -776,8 +776,8 @@
* per zone.
*/
zone->wait_table_size = wait_table_size(size);
- zone->wait_table_shift =
- BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
+ zone->wait_table_bits =
+ wait_table_bits(zone->wait_table_size);
zone->wait_table = (wait_queue_head_t *)
alloc_bootmem_node(pgdat, zone->wait_table_size
* sizeof(wait_queue_head_t));

2002-03-10 19:42:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>
>So if anything its just not worth the effort of breaking the 386 setup
>either 8). 386 SMP is a different issue but I don't see any lunatics doing
>a 386 based sequent port thankfully.

Since the only person that comes to mind that would be crazy enough to
even _try_ to use Linux on 386-SMP is you, Alan, I'm really relieved to
hear you say that ;)

And no, it's not worth discontinuing i386 support. It just isn't
painful enough to maintain.

Note that the i386 has _long_ been a "stepchild", though: because of the
lack of WP, the kernel simply doesn't do threaded MM correctly on a 386.
Never has, and never will.

However, the known "incorrect" case is so obscure that it's not even an
issue - although I suspect that it means that you should not let
untrusted users run on a i386 server machine that contains any sensitive
data. I could cerrtainly come up with exploits that would work at least
in theory (whether they are workable in practice I don't know).

Using i386's for network servers is fine, of course. Just don't use
them for cpu farms (not that I think anybody is - it takes quite a big
farm of i386 machines to equal even _one_ PII ;)

Linus


2002-03-10 19:58:06

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> So if anything its just not worth the effort of breaking the
> 386 setup either 8). 386 SMP is a different issue but I don't
> see any lunatics doing a 386 based sequent port thankfully.

Hey, don't count it out ... someone was emailing me a week or
two ago, asking what the internal structure of a Sequent Symmetry
was, so that they could get Linux running on it. OK, so they gave
up when I gave them an outline of what was in there, but ... ;-)

M.

PS. No I'm not suggesting we should support 386 SMP ;-)

2002-03-10 20:25:15

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> two ago, asking what the internal structure of a Sequent Symmetry
> was, so that they could get Linux running on it. OK, so they gave
> up when I gave them an outline of what was in there, but ... ;-)

I feel safe. A long time ago I looked at a symettry but 3 phase power was
harder to arrange 8)

2002-03-10 20:28:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

>> two ago, asking what the internal structure of a Sequent Symmetry
>> was, so that they could get Linux running on it. OK, so they gave
>> up when I gave them an outline of what was in there, but ... ;-)
>
> I feel safe. A long time ago I looked at a symettry but 3 phase
> power was harder to arrange 8)

IIRC the half height cabinets run of single phase, 240V. I'm
sure I can arrange to have a 386 half height system shipped to
you ;-) ;-)

M.

PS. Have fun with that VME bus ...

2002-03-10 20:51:01

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> IIRC the half height cabinets run of single phase, 240V. I'm
> sure I can arrange to have a 386 half height system shipped to
> you ;-) ;-)
>
> PS. Have fun with that VME bus ...

Actually we have VME code for Linux, including a PCI/VME bridge. Its not
in the base distro as the author never needed to tidy it up to make it
work for arbitary PCI/VME bridges.

I think you should ship the machine to hpa though, then he can run
kernel.org on it 8)

2002-03-11 14:13:49

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Friday 08 March 2002 09:12 pm, Linus Torvalds wrote:
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > > The point being that the difference between a "decl" and a "lock ;
> > > decl" is about 1:12 or so in performance.
> >
> > I am no expert in architecture, but if its done through the cache
> > coherency mechanism, the overhead shouldn't be 12:1. You simply mark the
> > cache line as part of you instruction to avoid a cache line transfer. How
> > can that be 12 times slower. .. Ready to be educated....
>
> A lock in a SMP system also needs to synchronize the instruction stream,
> and not let stores move "out" from the locked region.
>
> On a UP system, this all happens automatically (well, getting it to happen
> right is obviously one of the big issues in an out-of-order CPU core, but
> it's a very fundamental part of the core, so it's "free" in the sense that
> if it isn't done, the CPU simply doesn't work).
>
> On SMP, it's a memory barrier. This is why a "lock ; decl" is more
> expensive than a "decl" - it's the implied memory ordering constraints (on
> other architectures they are explicit). On an intel CPU, this basically
> means that the pipeline is drained, so a locked instruction takes roughly
> 12 cycles on a PPro core (AMD's K7 core seems to be rather more graceful
> about this one). I haven't timed a P4 lately, I think it's worse.
>
> Other architectures do the memory ordering explicitly, and some are
> better, some are worse. But it always costs you _something_.
>
> Linus


Sure, not contending that. Right now I think our focus should be to get the
right functionality out and address people's concerns.
Improvements, as you suggested, are orthogonal and can always be put
in later.

--
-- Hubertus Franke ([email protected])

2002-03-11 22:47:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


On Sat, 9 Mar 2002, Rusty Russell wrote:
> >
> > So I would suggest making the size (and thus alignment check) of locks at
> > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > locks on the stack, but gcc does support stack alignment, even if the code
> > sucks right now.
>
> Actually, I disagree.
>
> 1) We've left wiggle room in the second arg to sys_futex() to add rwsems
> later if required.
> 2) Someone needs to implement them and prove they are superior to the
> pure userspace solution.

You've convinced me.

Considering how long people argued about dubious cycle measurements on the
rwsem implementation, and where the crrent one actually uses a spinlock
for exclusion on the fast path, the kernel lock really probably doesn't
need to be expanded, and as there is provable overhead to the expansion,
I'll just agree with you.

Applied.

Linus

2002-03-11 23:12:35

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Monday 11 March 2002 05:45 pm, Linus Torvalds wrote:
> On Sat, 9 Mar 2002, Rusty Russell wrote:
> > > So I would suggest making the size (and thus alignment check) of locks
> > > at least 8 bytes (and preferably 16). That makes it slightly harder to
> > > put locks on the stack, but gcc does support stack alignment, even if
> > > the code sucks right now.
> >
> > Actually, I disagree.
> >
> > 1) We've left wiggle room in the second arg to sys_futex() to add rwsems
> > later if required.
> > 2) Someone needs to implement them and prove they are superior to the
> > pure userspace solution.
>
> You've convinced me.
>
> Considering how long people argued about dubious cycle measurements on the
> rwsem implementation, and where the crrent one actually uses a spinlock
> for exclusion on the fast path, the kernel lock really probably doesn't
> need to be expanded, and as there is provable overhead to the expansion,
> I'll just agree with you.
>
> Applied.
>
> Linus

Great, now, that this is settled, let's go back to the compare and swap
issues.

As far as I can tell, we need compare and swap on a single queue
implementation for rwsems. Again most architectures provide something like
that today. As for those which don't, why not provide either of the following
approaches

(a) spinlock around the update code in the kernel. We could provide multiple
spinlocks to avoid potential collision.
(b) only provide futex in the kernel and user library approach using 2 queues
as shown by Rusty. The lack of cmpxchg support would be exported by the
futex_region call.

-
-- Hubertus Franke ([email protected])

2002-03-12 09:36:52

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

"H. Peter Anvin" wrote:

>
> Okay, I'll say it and be impopular...
>
> Perhaps it's time to drop i386 support?
>
Wouldn't it be better to just separate it out? I.e. make i386
an arch of its own, while most pc people use a "486 and up" arch?

The few who actually want 386 code won't loose it, and other developers
won't have to bother with 386 issues. Then drop the 386 arch when it
dies from lack of maintenance...

Helge Hafting

2002-03-12 09:55:05

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you
write:
>
> On Sat, 9 Mar 2002, Rusty Russell wrote:
> > >
> > > So I would suggest making the size (and thus alignment check) of locks at
> > > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > > locks on the stack, but gcc does support stack alignment, even if the cod
e
> > > sucks right now.
> >
> > Actually, I disagree.
> >
> > 1) We've left wiggle room in the second arg to sys_futex() to add rwsems
> > later if required.
> > 2) Someone needs to implement them and prove they are superior to the
> > pure userspace solution.
>
> You've convinced me.

Damn. Because now I've been playing with a different approach.

If we basically export "add_to_waitqueue", "del_from_waitqueue",
"wait_for_waitqueue" and "wakeup_waitqueue" syscalls, we have a more
powerful interface: the kernel need not touch userspace addresses at
all (no kmap/kunmap, no worried about spinlocks vs. rwlocks).

The problem is that this fundamentally requires at least two syscalls
in the slow path (add_to_waitqueue, try for lock, wait_for_waitqueue).
My tests here show it's about 6% slower than the solution you accepted
for tdbtorture (which means the slow path is significantly slower). I
can't imagine shaving that much more off it.

There are variations on this: cookie could be replaces the page struct
and the offset, ala futexes.

Thoughts?
Rusty.

PS. Kudos: it was Ben LaHaise's idea to export waitqueues, but I
didn't see how to do it until Paul M made a bad joke about two
syscalls....
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/arch/i386/kernel/entry.S working-2.5.6-hwq/arch/i386/kernel/entry.S
--- linux-2.5.6/arch/i386/kernel/entry.S Fri Mar 8 14:49:11 2002
+++ working-2.5.6-hwq/arch/i386/kernel/entry.S Mon Mar 11 15:50:20 2002
@@ -717,6 +717,9 @@
.long SYMBOL_NAME(sys_fremovexattr)
.long SYMBOL_NAME(sys_tkill)
.long SYMBOL_NAME(sys_sendfile64)
+ .long SYMBOL_NAME(sys_uwaitq_add) /* 240 */
+ .long SYMBOL_NAME(sys_uwaitq_wait)
+ .long SYMBOL_NAME(sys_uwaitq_wake)

.rept NR_syscalls-(.-sys_call_table)/4
.long SYMBOL_NAME(sys_ni_syscall)
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-i386/mman.h working-2.5.6-hwq/include/asm-i386/mman.h
--- linux-2.5.6/include/asm-i386/mman.h Wed Mar 15 12:45:20 2000
+++ working-2.5.6-hwq/include/asm-i386/mman.h Mon Mar 11 15:58:59 2002
@@ -5,6 +5,7 @@
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
#define PROT_NONE 0x0 /* page can not be accessed */
+#define PROT_SEM 0x8 /* page can contain semaphores */

#define MAP_SHARED 0x01 /* Share changes */
#define MAP_PRIVATE 0x02 /* Changes are private */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-i386/unistd.h working-2.5.6-hwq/include/asm-i386/unistd.h
--- linux-2.5.6/include/asm-i386/unistd.h Fri Mar 8 14:49:28 2002
+++ working-2.5.6-hwq/include/asm-i386/unistd.h Mon Mar 11 14:52:07 2002
@@ -244,6 +244,9 @@
#define __NR_fremovexattr 237
#define __NR_tkill 238
#define __NR_sendfile64 239
+#define __NR_uwaitq_add 240
+#define __NR_uwaitq_wait 241
+#define __NR_uwaitq_wake 242

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/linux/sched.h working-2.5.6-hwq/include/linux/sched.h
--- linux-2.5.6/include/linux/sched.h Sat Mar 9 14:52:15 2002
+++ working-2.5.6-hwq/include/linux/sched.h Tue Mar 12 13:54:15 2002
@@ -230,6 +230,11 @@

typedef struct prio_array prio_array_t;

+struct uwaitq {
+ struct list_head list;
+ unsigned long /*long*/ cookie;
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
struct thread_info *thread_info;
@@ -344,6 +349,8 @@

/* journalling filesystem info */
void *journal_info;
+/* user wait queue info */
+ struct uwaitq uwaitq;
};

extern void __put_task_struct(struct task_struct *tsk);
@@ -508,6 +515,13 @@
extern int kill_proc(pid_t, int, int);
extern int do_sigaction(int, const struct k_sigaction *, struct k_sigaction *);
extern int do_sigaltstack(const stack_t *, stack_t *, unsigned long);
+
+extern void __uwaitq_unqueue(struct uwaitq *q);
+static inline void uwaitq_unqueue(struct uwaitq *q)
+{
+ if (q->cookie)
+ __uwaitq_unqueue(q);
+}

/*
* Re-calculate pending state from the set of locally pending
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/Makefile working-2.5.6-hwq/kernel/Makefile
--- linux-2.5.6/kernel/Makefile Wed Feb 20 17:56:17 2002
+++ working-2.5.6-hwq/kernel/Makefile Mon Mar 11 14:50:49 2002
@@ -15,7 +15,7 @@
obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o info.o time.o softirq.o resource.o \
sysctl.o acct.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o
+ signal.o sys.o kmod.o context.o uwaitq.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/exit.c working-2.5.6-hwq/kernel/exit.c
--- linux-2.5.6/kernel/exit.c Wed Feb 20 17:57:21 2002
+++ working-2.5.6-hwq/kernel/exit.c Tue Mar 12 12:57:09 2002
@@ -502,7 +502,7 @@
acct_process(code);
#endif
__exit_mm(tsk);
-
+ uwaitq_unqueue(&tsk->uwaitq);
lock_kernel();
sem_exit();
__exit_files(tsk);
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/fork.c working-2.5.6-hwq/kernel/fork.c
--- linux-2.5.6/kernel/fork.c Fri Mar 8 14:49:30 2002
+++ working-2.5.6-hwq/kernel/fork.c Tue Mar 12 12:52:27 2002
@@ -720,6 +720,8 @@
if (retval)
goto bad_fork_cleanup_namespace;
p->semundo = NULL;
+ INIT_LIST_HEAD(&p->uwaitq.list);
+ p->uwaitq.cookie = 0;

/* Our parent execution domain becomes current domain
These must match for thread signalling to apply */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/uwaitq.c working-2.5.6-hwq/kernel/uwaitq.c
--- linux-2.5.6/kernel/uwaitq.c Thu Jan 1 10:00:00 1970
+++ working-2.5.6-hwq/kernel/uwaitq.c Tue Mar 12 14:06:58 2002
@@ -0,0 +1,153 @@
+/*
+ * User-exported wait queues.
+ * (C) Rusty Russell, IBM 2002
+ *
+ * Thanks to Ben LaHaise for yelling "hashed waitqueues", and Paul
+ * Mackerras for suggesting breaking it into multiple syscalls.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <asm/uaccess.h>
+
+/* Theory is simple:
+ 1) If we are on a hash list, we are waiting to be woken up.
+ 2) Otherwise, if cookie is not zero, we have just been woken up.
+ 3) Otherwise, we are not involved in any userspace wait queues.
+*/
+/* FIXME: This may be way too small. --RR */
+#define UWAITQ_HASHBITS 6
+
+/* We use this instead of a normal wait_queue_t, so we can wake only
+ the relevent ones (hashed queues may be shared) */
+struct uwaitq_head
+{
+ struct list_head list;
+ spinlock_t lock;
+} ____cacheline_aligned;
+static struct uwaitq_head uwait_queues[1<<UWAITQ_HASHBITS] __cacheline_aligned;
+
+static inline struct uwaitq_head *hash_uwaitq(unsigned long /*long*/ cookie)
+{
+ return &uwait_queues[cookie & ((1<<UWAITQ_HASHBITS)-1)];
+}
+
+/* struct uwaitq is always embedded in a task struct. */
+static inline struct task_struct *uwaitq_to_task(struct uwaitq *q)
+{
+ return (void *)q - offsetof(struct task_struct, uwaitq);
+}
+
+/* Add at end to avoid starvation */
+static inline void queue_me(struct uwaitq_head *head)
+{
+ spin_lock(&head->lock);
+ list_add_tail(&current->uwaitq.list, &head->list);
+ spin_unlock(&head->lock);
+}
+
+/* Must be holding spinlock */
+static void wake_by_cookie(struct uwaitq_head *head, unsigned long /*long*/ cookie)
+{
+ struct list_head *i;
+
+ list_for_each(i, &head->list) {
+ struct uwaitq *this = list_entry(i, struct uwaitq, list);
+
+ if (cookie == this->cookie) {
+ list_del_init(&this->list);
+ wmb();
+ wake_up_process(uwaitq_to_task(this));
+ return;
+ }
+ }
+}
+
+void __uwaitq_unqueue(struct uwaitq *q)
+{
+ struct uwaitq_head *head;
+
+ head = hash_uwaitq(q->cookie);
+ spin_lock(&head->lock);
+ /* If we have been woken, we must wake someone else since we
+ are no longer interested. */
+ if (list_empty(&q->list))
+ wake_by_cookie(head, q->cookie);
+ else
+ list_del_init(&q->list);
+ spin_unlock(&head->lock);
+}
+
+asmlinkage int sys_uwaitq_wake(unsigned long /*long*/ cookie)
+{
+ struct uwaitq_head *head;
+
+ head = hash_uwaitq(cookie);
+ spin_lock(&head->lock);
+ wake_by_cookie(head, cookie);
+ spin_unlock(&head->lock);
+ return 0;
+}
+
+/* Add to the wait queue: 0 cookie means delete. */
+asmlinkage int sys_uwaitq_add(unsigned long /*long*/ cookie)
+{
+ /* Unqueue if is/was queued. */
+ uwaitq_unqueue(&current->uwaitq);
+
+ /* Set cookie. */
+ current->uwaitq.cookie = cookie;
+
+ /* If non-zero, requeue */
+ if (cookie)
+ queue_me(hash_uwaitq(cookie));
+ return 0;
+}
+
+/* Wait to be chucked off the queue. */
+asmlinkage int sys_uwaitq_wait(void)
+{
+ int retval = 0;
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ while (!list_empty(&current->uwaitq.list)) {
+ if (signal_pending(current)) {
+ retval = -EINTR;
+ __uwaitq_unqueue(&current->uwaitq);
+ goto out;
+ }
+ schedule();
+ set_current_state(TASK_INTERRUPTIBLE);
+ }
+ /* Mark the fact that we were woken up. */
+ current->uwaitq.cookie = 0;
+ out:
+ set_current_state(TASK_RUNNING);
+ return retval;
+}
+
+static int __init init(void)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(uwait_queues); i++) {
+ INIT_LIST_HEAD(&uwait_queues[i].list);
+ spin_lock_init(&uwait_queues[i].lock);
+ }
+ return 0;
+}
+__initcall(init);

2002-03-12 14:27:42

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Hi!

> >So if anything its just not worth the effort of breaking the 386 setup
> >either 8). 386 SMP is a different issue but I don't see any lunatics doing
> >a 386 based sequent port thankfully.
>
> Since the only person that comes to mind that would be crazy enough to
> even _try_ to use Linux on 386-SMP is you, Alan, I'm really relieved to
> hear you say that ;)
>
> And no, it's not worth discontinuing i386 support. It just isn't
> painful enough to maintain.
>
> Note that the i386 has _long_ been a "stepchild", though: because of the
> lack of WP, the kernel simply doesn't do threaded MM correctly on a 386.
> Never has, and never will.
>
> However, the known "incorrect" case is so obscure that it's not even an
> issue - although I suspect that it means that you should not let
> untrusted users run on a i386 server machine that contains any sensitive
> data. I could cerrtainly come up with exploits that would work at least
> in theory (whether they are workable in practice I don't know).

That should mean at least warning during bootup. Checking for WP
bit... message does not suggest that WP not present means security
hole.

--- clean.2.5//arch/i386/mm/init.c Sun Mar 10 20:06:31 2002
+++ linux/arch/i386/mm/init.c Mon Mar 11 21:49:14 2002
@@ -383,7 +383,7 @@
local_flush_tlb();

if (!boot_cpu_data.wp_works_ok) {
- printk("No.\n");
+ printk("No (that's security hole).\n");
#ifdef CONFIG_X86_WP_WORKS_OK
panic("This kernel doesn't support CPU's with broken WP. Recompile it for a 386!");
#endif

Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-03-12 14:56:05

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Tuesday 12 March 2002 02:20 am, Rusty Russell wrote:
> In message <[email protected]>
> you
>
> write:
> > On Sat, 9 Mar 2002, Rusty Russell wrote:
> > > > So I would suggest making the size (and thus alignment check) of
> > > > locks at least 8 bytes (and preferably 16). That makes it slightly
> > > > harder to put locks on the stack, but gcc does support stack
> > > > alignment, even if the cod
>
> e
>
> > > > sucks right now.
> > >
> > > Actually, I disagree.
> > >
> > > 1) We've left wiggle room in the second arg to sys_futex() to add
> > > rwsems later if required.
> > > 2) Someone needs to implement them and prove they are superior to the
> > > pure userspace solution.
> >
> > You've convinced me.
>
> Damn. Because now I've been playing with a different approach.
>
> If we basically export "add_to_waitqueue", "del_from_waitqueue",
> "wait_for_waitqueue" and "wakeup_waitqueue" syscalls, we have a more
> powerful interface: the kernel need not touch userspace addresses at
> all (no kmap/kunmap, no worried about spinlocks vs. rwlocks).
>
> The problem is that this fundamentally requires at least two syscalls
> in the slow path (add_to_waitqueue, try for lock, wait_for_waitqueue).
> My tests here show it's about 6% slower than the solution you accepted
> for tdbtorture (which means the slow path is significantly slower). I
> can't imagine shaving that much more off it.
>
> There are variations on this: cookie could be replaces the page struct
> and the offset, ala futexes.
>
> Thoughts?
> Rusty.
>
> PS. Kudos: it was Ben LaHaise's idea to export waitqueues, but I
> didn't see how to do it until Paul M made a bad joke about two
> syscalls....

Rusty, aren't you now going back to the design that I implemented after Ben's
comments.
>From the get-go, I never touched the user address in the kernel, as I thought
it would require detailed knowledge of the user level locking strategy.

Could you explain, why you need add_to_waitqueue and wait_for_waitqueue as
separate calls ? Is it for resolving a race conditions ?

One comment with respect to multiple wait queues and rwsems:
Again it will allow you to do reader-pref and/or writer-pref, but not
something like FIFO, i.e. wake up a writer if first waiter or wake up all
readers if first ..... and so on.
I don't know whether the latter is terrible important ...

--
-- Hubertus Franke ([email protected])

2002-03-12 17:19:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


On Tue, 12 Mar 2002, Rusty Russell wrote:
> >
> > You've convinced me.
>
> Damn. Because now I've been playing with a different approach.

I don't think your current patch is very useful.

It's obviously slower, and while it is an interesting approach for not
just lock generation but also for synchronization points, it doesn't seem
to actually _buy_ you anything. And since the cookie isn't guaranteed to
be unique, you can't actually use it as a synchronization point on its
own, but must always still have some shared memory location as a
confirmation for whatever the synchronization was.

Finally, waitqueue's (to me) always were about two big points:

- natural race condition avoidance through ordering:

current->state = sleeping;
add_wait_queue();
if (test)
schedule();

which you basically emulate with the "zero cookie" thing.

- ability to wait on multiple events concurrently

which you don't take any advantage of at all.

So you kind of missed the second big point of waitqueues, so the end
result really isn't any more fundamentally powerful than the (faster)
specialized semaphore system call as far as I can tell.

In short, I would argue that this approach, while interesting, doesn't
actually _buy_ you anything.

Now, if you want to wake on any of N events, then a "add_wait_queue*N +
wait" approach actually makes sense. But quite frankly, once you are there
you should really instead do full events, and go away and work together
with Ben on the aio stuff instead of this.

So: interesting approach, but in its current form pointless as far as I
can see.

Linus

2002-03-13 02:57:38

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you
write:
>
> On Tue, 12 Mar 2002, Rusty Russell wrote:
> > >
> > > You've convinced me.
> >
> > Damn. Because now I've been playing with a different approach.
>
> I don't think your current patch is very useful.

I agree. But your "Applied" EMail rushed me into posting it.

> It's obviously slower, and while it is an interesting approach for not
> just lock generation but also for synchronization points, it doesn't seem
> to actually _buy_ you anything. And since the cookie isn't guaranteed to
> be unique, you can't actually use it as a synchronization point on its
> own, but must always still have some shared memory location as a
> confirmation for whatever the synchronization was.

My original cookie was 128 bits. ie. unique.

> So: interesting approach, but in its current form pointless as far as I
> can see.

Yeah, I'm not sure what it's useful for either. But the code is out
there if someone gets inspired...

Thanks,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-13 04:02:22

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you write:
> > If we basically export "add_to_waitqueue", "del_from_waitqueue",
> > "wait_for_waitqueue" and "wakeup_waitqueue" syscalls, we have a more
> > powerful interface: the kernel need not touch userspace addresses at
> > all (no kmap/kunmap, no worried about spinlocks vs. rwlocks).
>
> Rusty, aren't you now going back to the design that I implemented after Ben's
> comments.

> From the get-go, I never touched the user address in the kernel, as
> I thought it would require detailed knowledge of the user level
> locking strategy.

Yes, as with my initial patch (like you, I had a semaphore in the
kernel). However, with two separate syscalls, you don't need to
allocate anything in the kernel, and still know nothing about the
userspace locking.

> Could you explain, why you need add_to_waitqueue and wait_for_waitqueue as
> separate calls ? Is it for resolving a race conditions ?

Yep. Exactly analogous to the kernel idiom (add to queue, check, sleep).
(Except the "waker dequeues us" microoptimization).

> One comment with respect to multiple wait queues and rwsems:
> Again it will allow you to do reader-pref and/or writer-pref, but not
> something like FIFO, i.e. wake up a writer if first waiter or wake up all
> readers if first ..... and so on.
> I don't know whether the latter is terrible important ...

(Aside: I'm still reluctant to implement strict FIFO locks until
someone shows a starvation case in the current locks).

I thought about this a little. If we add a flags arg to the
sys_uwaitq_wake() and make sys_uwaitq_wait() return the flags used by
the waker, and sys_uwaitq_wake return 1 if it woke someone...

#define UWAITQ_PASSING 1
up:
ret = sys_uwaitq_wake(cookie, UWAITQ_PASSING);
if (ret < 0) return -1;
/* No waiters actually waiting? */
if (ret == 0) {
lock->counter = 1;
sys_uwaitq_wake(cookie, 0);
}

down:
sys_uwaitq_queue(cookie);
if (atomic dec counter to 0) {
sys_uwaitq_queue(0); /* unqueue */
return 0;
}
ret = sys_uwaitq_wait(cookie);
if (ret < 0)
return -1;
if (ret == UWAITQ_PASSING)
return 0;
goto down; /* spin again */

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-13 07:37:34

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Sun, 10 Mar 2002 19:41:36 +0000 (UTC)
[email protected] (Linus Torvalds) wrote:

> And no, it's not worth discontinuing i386 support. It just isn't
> painful enough to maintain.

How about just dropping 386 + SMP support?

Then it would be a nobrainer to have cmpxchg a generic operation, AND
export it to userspace in the proposed "kernel routine page".

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/arch/i386/config.in working-2.5.6-futex-6/arch/i386/config.in
--- linux-2.5.6/arch/i386/config.in Wed Feb 20 17:56:59 2002
+++ working-2.5.6-futex-6/arch/i386/config.in Sat Mar 9 15:17:18 2002
@@ -177,7 +177,9 @@

bool 'Math emulation' CONFIG_MATH_EMULATION
bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
-bool 'Symmetric multi-processing support' CONFIG_SMP
+if [ "$CONFIG_M386" != "y" ]; then
+ bool 'Symmetric multi-processing support' CONFIG_SMP
+fi
bool 'Preemptible Kernel' CONFIG_PREEMPT
if [ "$CONFIG_SMP" != "y" ]; then
bool 'Local APIC support on uniprocessors' CONFIG_X86_UP_APIC

Thanks!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-13 09:13:43

by Martin Wirth

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> > > On Tue, 12 Mar 2002, Rusty Russell wrote:
> > > > > > > You've convinced me.
> > > > > Damn. Because now I've been playing with a different approach.
> > > I don't think your current patch is very useful.

> I agree. But your "Applied" EMail rushed me into posting it.


The normal way to use multithreading under UNIX is the pthread
library. Here the condition variables are the equivalent to kernel
wait queues. So I think to really implement a fast pthread lib based
on futexes one needs some means to implement condition variables
(with synchronous futex release to implement pthread_cond_wait(..)!).

This could either be done with the exported waitqueue approach or a bit
easier (but less general) by associating a second hashed waitqueue with
each futex (maybe marked by the odd offset+1?). Then we would have two
additional variants of sys_futex (with parameters FUTEX_WAIT,
FUTEX_SIGNAL).

The principle implementation is:

FUTEX_WAIT:
add_to_cond_queue
current->state=INTERRUPTIBLE
futex_up
schedule
remove_from_cond_queue
futex_down

FUTEX_SIGNAL
wake_up_all on cond_queue


Later we may also want FUTEX_SIGNAL_ONE and FUTEX_WAIT_TIMEOUT.

The user space code for pthread_cond_wait then of course needs a
chaining of the protecting pthread_mutex and the futex used as condition
variable.


Martin

2002-03-13 16:22:52

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> > And no, it's not worth discontinuing i386 support. It just isn't
> > painful enough to maintain.
>
> How about just dropping 386 + SMP support?

We dont support SMP 386 boxes anyway. Nothing to drop. We've never supported
anything earlier than 486 SMP like the early MP1.1 compliant IBM boards,
(and briefly the compaq non MP 1.1 compliant stuff (Thomas Radke - the
forgotten man in the creation of SMP Linux - did 2/4 way compaq stuff)

2002-03-13 19:44:11

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Wed, 13 Mar 2002, Martin Wirth wrote:

> The normal way to use multithreading under UNIX is the pthread
> library. Here the condition variables are the equivalent to kernel
> wait queues. So I think to really implement a fast pthread lib based
> on futexes one needs some means to implement condition variables
> (with synchronous futex release to implement pthread_cond_wait(..)!).

Let me mention this again... The IBM release of NGPT states that Linus has
approved the inclusion of the NGPT patches in the mainline kernel. Will
this be in 2.4.19 release? I've been running 2.4.17 for NGPT, haven't
tried 2.4.19 other than to see the patch didn't apply).

(NGPT = Next Generation Pthreads, a cleaner and faster POSIX threads)

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-03-13 19:53:22

by Dave McCracken

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


--On Wednesday, March 13, 2002 02:41:52 PM -0500 Bill Davidsen
<[email protected]> wrote:

> Let me mention this again... The IBM release of NGPT states that Linus has
> approved the inclusion of the NGPT patches in the mainline kernel. Will
> this be in 2.4.19 release? I've been running 2.4.17 for NGPT, haven't
> tried 2.4.19 other than to see the patch didn't apply).

The 2.4 patch needed for NGPT was accepted by Marcelo and is in 2.4.19-pre3.

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2002-03-13 19:55:02

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

> Let me mention this again... The IBM release of NGPT states that Linus has
> approved the inclusion of the NGPT patches in the mainline kernel. Will
> this be in 2.4.19 release? I've been running 2.4.17 for NGPT, haven't
> tried 2.4.19 other than to see the patch didn't apply).

2.5 but hopefully it will get backported as it proves solid and the API
is fixed

2002-03-13 22:19:06

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Wed, 13 Mar 2002, Dave McCracken wrote:

>
> --On Wednesday, March 13, 2002 02:41:52 PM -0500 Bill Davidsen
> <[email protected]> wrote:
>
> > Let me mention this again... The IBM release of NGPT states that Linus has
> > approved the inclusion of the NGPT patches in the mainline kernel. Will
> > this be in 2.4.19 release? I've been running 2.4.17 for NGPT, haven't
> > tried 2.4.19 other than to see the patch didn't apply).
>
> The 2.4 patch needed for NGPT was accepted by Marcelo and is in 2.4.19-pre3.

Good info, thanks! I hand edited the 2.4.17 patch to 2.4.18, but 19-pre2
didn't apply and I ran out of time about 1am this morning;-) When pre3
settles down a bit I'll use that as a base.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-03-15 07:29:19

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you write:
> > > > On Tue, 12 Mar 2002, Rusty Russell wrote:
> > > > > > > > You've convinced me.
> > > > > > Damn. Because now I've been playing with a different approach.
> > > > I don't think your current patch is very useful.
>
> > I agree. But your "Applied" EMail rushed me into posting it.
>
>
> The normal way to use multithreading under UNIX is the pthread
> library. Here the condition variables are the equivalent to kernel
> wait queues. So I think to really implement a fast pthread lib based
> on futexes one needs some means to implement condition variables
> (with synchronous futex release to implement pthread_cond_wait(..)!).

Discussions with Ulrich have reaffirmed my opinion that pthreads are
crap. Hence I'm not all that tempted to warp the (nice, clean,
usable) futex code too far to meet pthreads' wierd needs.

However, it's not too hard to implement condition variables using an
unavailable mutex, if we go for "full" semaphores: ie. not just
mutexes. It requires a bit more of a stretch for kernel atomic ops...

Yet another untested patch,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.7-pre1/kernel/futex.c working-2.5.7-pre1-sem/kernel/futex.c
--- linux-2.5.7-pre1/kernel/futex.c Wed Mar 13 13:30:39 2002
+++ working-2.5.7-pre1-sem/kernel/futex.c Fri Mar 15 18:18:37 2002
@@ -129,17 +129,18 @@
return page;
}

-/* Try to decrement the user count to zero. */
-static int decrement_to_zero(struct page *page, unsigned int offset)
+/* Try to decrement the user count to >= zero. */
+static int decrement_futex(struct page *page, unsigned int offset)
{
atomic_t *count;
int ret = 0;

count = kmap(page) + offset;
- /* If we take the semaphore from 1 to 0, it's ours. If it's
- zero, decrement anyway, to indicate we are waiting. If
- it's negative, don't decrement so we don't wrap... */
- if (atomic_read(count) >= 0 && atomic_dec_and_test(count))
+ /* If we take one from the sem, and it's still >= 0, it's
+ ours. If it's zero, we decrement anyway to indicate we are
+ waiting. If it's negative, don't decrement so we don't
+ wrap... */
+ if (atomic_read(count) >= 0 && !atomic_add_negative(-1, count))
ret = 1;
kunmap(page);
return ret;
@@ -173,12 +174,36 @@
return retval;
}

+/* Atomically: Add 1 if already positive, otherwise set to 1. */
+static void futex_make_positive(atomic_t *count)
+{
+ int old_count, new_count;
+
+#ifndef CONFIG_SMP
+ preempt_disable();
+ old_count = atomic_read(count);
+ new_count = old_count > 0 ? old_count+1 : 1;
+ atomic_set(count, new_count);
+ preempt_enable();
+#elif defined(__HAVE_ARCH_CMPXCHG)
+ do {
+ old_count = atomic_read(count);
+ new_count = old_count > 0 ? old_count+1 : 1;
+ } while (cmpxchg(count, old_count, new_count) != old_count);
+#else
+ /* Do this one at a time. You will need to implement
+ atomic_add_positive(). */
+ while (!atomic_add_positive(count, 1));
+#endif
+}
+
static int futex_up(struct list_head *head, struct page *page, int offset)
{
atomic_t *count;

count = kmap(page) + offset;
- atomic_set(count, 1);
+ /* set to MAX(count, 0) + 1 */
+ futex_make_positive(count);
smp_wmb();
kunmap(page);
wake_one_waiter(head, page, offset);

2002-03-15 08:43:00

by Martin Wirth

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


Rusty Russell wrote:

>
>Discussions with Ulrich have reaffirmed my opinion that pthreads are
>crap. Hence I'm not all that tempted to warp the (nice, clean,
>usable) futex code too far to meet pthreads' wierd needs.
>
Crap or not, there are tons of software based on pthreads and at least
the NGPT team says that Linus
agreed to implement for necessary kernel-infrastructure for a full, fast
pthread implementation.

Now, if you want to implement mutexes and condition variables with the attribute
PTHREAD_PROCESS_SHARED then you need some functionality like the futexes.
Or NGPT will add his own syscalls to handle these things, which is simply
unnecessary double functionality.

>
>However, it's not too hard to implement condition variables using an
>unavailable mutex, if we go for "full" semaphores: ie. not just
>mutexes. It requires a bit more of a stretch for kernel atomic ops...
>
A full semaphore is nice, but not a full replacement for a waitqueue (or
a pthread condition variable brr..).
For the semaphore you always have to assure that the ups and downs are
balanced, what is not the case
for the condition variable.

Martin


2002-03-15 15:29:25

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Friday 15 March 2002 03:41 am, Martin Wirth wrote:
> Rusty Russell wrote:
> >Discussions with Ulrich have reaffirmed my opinion that pthreads are
> >crap. Hence I'm not all that tempted to warp the (nice, clean,
> >usable) futex code too far to meet pthreads' wierd needs.
>
> Crap or not, there are tons of software based on pthreads and at least
> the NGPT team says that Linus
> agreed to implement for necessary kernel-infrastructure for a full, fast
> pthread implementation.
>
> Now, if you want to implement mutexes and condition variables with the
> attribute PTHREAD_PROCESS_SHARED then you need some functionality like the
> futexes. Or NGPT will add his own syscalls to handle these things, which is
> simply unnecessary double functionality.
>
> >However, it's not too hard to implement condition variables using an
> >unavailable mutex, if we go for "full" semaphores: ie. not just
> >mutexes. It requires a bit more of a stretch for kernel atomic ops...
>
> A full semaphore is nice, but not a full replacement for a waitqueue (or
> a pthread condition variable brr..).
> For the semaphore you always have to assure that the ups and downs are
> balanced, what is not the case
> for the condition variable.
>
> Martin
>

Folks, its not that simple as "use futex for PTHREAD_PROCESS_SHARED".
First, you must realize that conceptually, the N kernel threads utilized in a
M:N thread model like NGPT are virtual processors. Hence you can't simply
wait in the kernel as you would block your v-proc. Hence the current
futex interface of up and down kernel calls in not sufficient.

What is required is an asynchronous mechanism that lets a v-proc
leave a notification object <nobj> in the kernel that get's enqueued just
like every other waiting task. <nobj> ::= <v-proc, struct *futex>
When the futex is signaled and <nobj> is woken up, a scheduling event is send
to the <v-proc> or its task or its process (this has to be thought through).
This can be done through a signal or through an shared event queue or a
combination of such.

Under no circumstances can you block the <v-proc> == task on a futex !!!

I talked with Bill Abt of the NGPT team about it and he conceptually agrees
with this approach, but since the regular interface is still not hammered out
and stable, no point going after more sophisticated stuff yet.

--
-- Hubertus Franke ([email protected])

2002-03-15 16:24:22

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Martin Wirth wrote:

>
> Rusty Russell wrote:
>
>>
>> Discussions with Ulrich have reaffirmed my opinion that pthreads are
>> crap. Hence I'm not all that tempted to warp the (nice, clean,
>> usable) futex code too far to meet pthreads' wierd needs.
>>
> Crap or not, there are tons of software based on pthreads and at least
> the NGPT team says that Linus
> agreed to implement for necessary kernel-infrastructure for a full, fast
> pthread implementation.
>
> Now, if you want to implement mutexes and condition variables with the
> attribute
> PTHREAD_PROCESS_SHARED then you need some functionality like the futexes.
> Or NGPT will add his own syscalls to handle these things, which is simply
> unnecessary double functionality.
>


I think the "crap" refers to current missing meatures of linuxthreads
(most notable: PTHREAD_PROCESS_SHARED on cond and mutex, don't know about sema)

BTW, NGPT introduces two new syscalls: gettid and tkill

>>
>> However, it's not too hard to implement condition variables using an
>> unavailable mutex, if we go for "full" semaphores: ie. not just
>> mutexes. It requires a bit more of a stretch for kernel atomic ops...
>>
> A full semaphore is nice, but not a full replacement for a waitqueue (or
> a pthread condition variable brr..).
> For the semaphore you always have to assure that the ups and downs are
> balanced, what is not the case
> for the condition variable.
>

also remember pthread_cond_broadcast - waking up _all_ waiting threads.
If the woken up threads check their condition and go to sleep again, is
up to them ( read: the standard mandates that _all_ get woken up)

pthread_cond_signal notifies _one_ thread - which one depends on implementation
( I would like to see a priority based decision )

2002-03-16 00:11:40

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you write:
> Martin Wirth wrote:
> > Rusty Russell wrote:
> >
> >>
> >> Discussions with Ulrich have reaffirmed my opinion that pthreads are
> >> crap. Hence I'm not all that tempted to warp the (nice, clean,
> >> usable) futex code too far to meet pthreads' wierd needs.
> >>
> > Crap or not, there are tons of software based on pthreads and at least
> > the NGPT team says that Linus
> > agreed to implement for necessary kernel-infrastructure for a full, fast
> > pthread implementation.

Let me clarify my "pthreads are crap" statement.

I firmly believe that there is a place for clone & futex threaded
programs, which is not met by pthreads for cleanliness and performance
reasons, and that such programs will become increasingly important.

Therefore I refuse to penalise such progressive programs so we can
have standards compliance. Hence my insistance on a clean, minimal,
useful interface.

> >> However, it's not too hard to implement condition variables using an
> >> unavailable mutex, if we go for "full" semaphores: ie. not just
> >> mutexes. It requires a bit more of a stretch for kernel atomic ops...
> >>
> > A full semaphore is nice, but not a full replacement for a waitqueue (or
> > a pthread condition variable brr..).
> > For the semaphore you always have to assure that the ups and downs are
> > balanced, what is not the case
> > for the condition variable.
> >
>
> also remember pthread_cond_broadcast - waking up _all_ waiting threads.
> If the woken up threads check their condition and go to sleep again, is
> up to them ( read: the standard mandates that _all_ get woken up)
>
> pthread_cond_signal notifies _one_ thread - which one depends on implementati
on
> ( I would like to see a priority based decision )

The solution I was referring to before, using full semaphores, would
look like so:

struct pthread_cond_t
{
int num_waiting;
struct futex wait, ack;
};

#define PTHREAD_COND_INITIALIZER { 0, { 0 }, { 0 } }

int pthread_cond_signal(pthread_cond_t *cond)
{
if (cond->num_waiters)
return futex_up(&cond->futex, 1);
return 0;
}

int pthread_cond_broadcast(pthread_cond_t *cond)
{
unsigned int waiters = cond->num_waiting;

if (waiters) {
futex_up(&cond->futex, waiters);
/* Wait for ack before returning. */
futex_down(&cond->ack);
}
return 0;
}

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
{
int ret;

/* Increment first so broadcaster knows we are waiting. */
atomic_inc(cond->num_waiting);
futex_up(&mutex, 1);
ret = futex_down(&cond);
if (atomic_dec_and_test(cond->num_waiting))
futex_up(&cond->ack);
futex_down(&mutex->futex);
return ret;
}

Hope that clarifies,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-16 11:25:18

by Martin Wirth

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)



Rusty Russell wrote:

>
>The solution I was referring to before, using full semaphores, would
>look like so:
>
>struct pthread_cond_t
>{
> int num_waiting;
> struct futex wait, ack;
>};
>
>#define PTHREAD_COND_INITIALIZER { 0, { 0 }, { 0 } }
>
>int pthread_cond_signal(pthread_cond_t *cond)
>{
> if (cond->num_waiters)
> return futex_up(&cond->futex, 1);
> return 0;
>}
>
>int pthread_cond_broadcast(pthread_cond_t *cond)
>{
> unsigned int waiters = cond->num_waiting;
>
> if (waiters) {
> futex_up(&cond->futex, waiters);
> /* Wait for ack before returning. */
> futex_down(&cond->ack);
> }
> return 0;
>}
>
>int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
>{
> int ret;
>
> /* Increment first so broadcaster knows we are waiting. */
> atomic_inc(cond->num_waiting);
> futex_up(&mutex, 1);
> ret = futex_down(&cond);
> if (atomic_dec_and_test(cond->num_waiting))
> futex_up(&cond->ack);
> futex_down(&mutex->futex);
> return ret;
>}
>
In principle that works. But one of things that's less nice with
pthread_cond_wait is
that you sometimes have a (most of the time) unnecessary schedule
ping-pong, and with the
approach above you always have this (due to ack). And secondly if
futex_up(&f, N) for N > 1
relies on the chained wakeup in the kernels futex_up routine the
broadcast may take a while to
complete (the lowest priority waiter penalizes all others queued behind
him). A semaphore simply is no full replacement for a waitqueue with
wake_all.

Martin

P.S. With respect to pthreads I was not thinking of a bloated N:M
library, but of some simple
fast pthread semantics compatible wrapper for _clone etc.






2002-03-16 19:55:51

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Rusty Russell wrote:
>
> In message <[email protected]> you write:
> > Martin Wirth wrote:
> > > Rusty Russell wrote:
> > >
> > >>
> > >> Discussions with Ulrich have reaffirmed my opinion that pthreads are
> > >> crap. Hence I'm not all that tempted to warp the (nice, clean,
> > >> usable) futex code too far to meet pthreads' wierd needs.
> > >>
> > > Crap or not, there are tons of software based on pthreads and at least
> > > the NGPT team says that Linus
> > > agreed to implement for necessary kernel-infrastructure for a full, fast
> > > pthread implementation.
>
> Let me clarify my "pthreads are crap" statement.
>
> I firmly believe that there is a place for clone & futex threaded
> programs, which is not met by pthreads for cleanliness and performance
> reasons, and that such programs will become increasingly important.
>
> Therefore I refuse to penalise such progressive programs so we can
> have standards compliance. Hence my insistance on a clean, minimal,
> useful interface.
>
> > >> However, it's not too hard to implement condition variables using an
> > >> unavailable mutex, if we go for "full" semaphores: ie. not just
> > >> mutexes. It requires a bit more of a stretch for kernel atomic ops...
> > >>
> > > A full semaphore is nice, but not a full replacement for a waitqueue (or
> > > a pthread condition variable brr..).
> > > For the semaphore you always have to assure that the ups and downs are
> > > balanced, what is not the case
> > > for the condition variable.
> > >
> >
> > also remember pthread_cond_broadcast - waking up _all_ waiting threads.
> > If the woken up threads check their condition and go to sleep again, is
> > up to them ( read: the standard mandates that _all_ get woken up)
> >
> > pthread_cond_signal notifies _one_ thread - which one depends on implementati
> on
> > ( I would like to see a priority based decision )
>

I have to correct myself. This is the description of SUSV2 about
pthread_cond_signal/broadcast:

---snip---
These two functions are used to unblock threads blocked on a
condition variable.

The pthread_cond_signal() call unblocks at least one of the threads
that are blocked on the specified condition variable cond (if any
threads are blocked on cond).

The pthread_cond_broadcast() call unblocks all threads currently
blocked on the specified condition variable cond.

If more than one thread is blocked on a condition variable, the
scheduling policy determines the order in which threads are unblocked.
When each thread unblocked as a result of a pthread_cond_signal() or
pthread_cond_broadcast() returns from its call to pthread_cond_wait()
or pthread_cond_timedwait(), the thread owns the mutex with which it
called pthread_cond_wait() or pthread_cond_timedwait(). The thread(s)
that are unblocked contend for the mutex according to the scheduling
policy (if applicable), and as if each had called pthread_mutex_lock().

The pthread_cond_signal() or pthread_cond_broadcast() functions may
be called by a thread whether or not it currently owns the mutex that
threads calling pthread_cond_wait() or pthread_cond_timedwait() have
associated with the condition variable during their waits; however, if
predictable scheduling behaviour is required, then that mutex is locked
by the thread calling pthread_cond_signal() or pthread_cond_broadcast().

The pthread_cond_signal() and pthread_cond_broadcast() functions have
no effect if there are no threads currently blocked on cond.

RETURN VALUE

If successful, the pthread_cond_signal() and pthread_cond_broadcast()
functions return zero. Otherwise, an error number is returned to
indicate the error.

ERRORS

The pthread_cond_signal() and pthread_cond_broadcast() function
may fail if:

[EINVAL]
The value cond does not refer to an initialised condition variable.

These functions will not return an error code of [EINTR].
--snip---

So the semantic of a condvar implies that when returning from a
"succesful" wait [ i.e. not ETIMEDOUT] , the thread owns the mutex.
Therefore the scheduler _should_ only wake up the highest priority
waiting thread - it does not matter if we signal or broadcast!
It's the same operation in effect. Perhaps the broadcast is there
for implementations that wouldn't queue up (or wake up) the waiters
in priority order?

For this, I think, kernel support is best, since the waiters get
woken up in priority order giving wake_one semantics.

For pthread_cond_timedwait() a kernel timer is necessary.
So making the signal/broadcast a syscall that does NOT lead to
a context switch would be benefical. At the next scheduling point
the kernel decides whom to wake up, also checking for timed waiters
to return with ETIMEDOUT.

Then there is the issue with a crashing process holding locks.
I think on IRIX the waiters get a trap causing them to die - what
else one could do?

[after writing and deleting some pseudo code]

I think the condvars are best implemented in shmem + kernel semaphores.
The only issue is a pthread_cond_timedwait - but a semaphore op IS
interruptible. Besides alarm(2)/setitimer(2) - what are other timeout
mechanisms in Linux?

In QNX Neutrino you have TimerTimeout() to arm a kernel timeout
for any blocking state (to avoid the window in alarm/timer_settime
and the blocking function call)

Without this you can hardly implement a reliable timeout (with
sub second resolution) for pthread_cond_timedwait or sigtimedwait.

So for PTHREAD_PROCESS_PRIVATE one could use futexes - on
PTHREAD_PROCESS_SHARED it has to reside in the kernel anyway and you
naturally have to live with context switches and a performance hit.

Now the problem with N:M threading model: here it's necessary to
prevent the kernel from blocking the whole process?
Well, what is pthread_setconcurrency(int new_level) for?


And, what is so bad about condvars? How would you implement a
typical consumer/producer problem?

To cite Stevens (Vol II, page 165):
... "mutexes are for locking and cannot be used for waiting."
page 167: "A mutex is for locking and a condition variable is for
waiting. ... and both are needed"

--
-----------------------------------------------------------------------
Peter Waechtler

2002-03-18 00:33:32

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Sat, 16 Mar 2002 12:23:26 +0100
Martin Wirth <[email protected]> wrote:
> Rusty Russell wrote:
> >The solution I was referring to before, using full semaphores, would
> >look like so:

[snip]

> In principle that works. But one of things that's less nice with
> pthread_cond_wait is
> that you sometimes have a (most of the time) unnecessary schedule
> ping-pong, and with the
> approach above you always have this (due to ack).

Only vs. pthread_cond_broadcast. And if you're using that you probably
have some other performance issues anyway?

> And secondly if
> futex_up(&f, N) for N > 1
> relies on the chained wakeup in the kernels futex_up routine the
> broadcast may take a while to
> complete (the lowest priority waiter penalizes all others queued behind
> him). A semaphore simply is no full replacement for a waitqueue with
> wake_all.

Yes, we could have a "wake N" variant, which would be more efficient here.

Hope that clarifies,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-19 03:26:55

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On 17 Mar 2002 16:52:00 -0800
Ulrich Drepper <[email protected]> wrote:

> On Sat, 2002-03-16 at 22:50, Rusty Russell wrote:
>
> > Only vs. pthread_cond_broadcast.
>
> No. pthread_barrier_wait has the same problem. It has to wake up lots
> of thread.

Hmmm....

What do you WANT in a kernel primitive then? Given that we now have mutexes,
what else do we need to make pthreads relatively painless?

> > And if you're using that you probably
> > have some other performance issues anyway?
>
> Why? Conditional variables are of use in situations with loosely
> coupled threads.

I meant vs. pthread_cond_signal.

Look, here is an example implementation. Please suggest:
1) Where this is flawed,
2) Where this is suboptimal,
3) What kernel primitive would help to resolve these?

Thanks,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

/* Assume we have the following semaphore operations:

futex_down(futex);
futex_down_time(futex, relative timeout);
futex_up(futex, count);
*/
typedef struct
{
struct futex futex;
} pthread_mutex_t;

typedef struct
{
int num_waiting;
struct futex wait, ack;
} pthread_cond_t;

typedef struct
{
unsigned int num_left;
struct futex wait;
unsigned int initial_count;
} pthread_barrier_t;

#define PTHREAD_MUTEX_INITIALIZER { { 1 } }
#define PTHREAD_COND_INITIALIZER { 0, { 0 }, { 0 } }

int pthread_barrier_init(struct pthread_barrier_t *barrier,
void *addr,
unsigned int count)
{
barrier->num_left = barrier->initial_count = count;
barrier->wait.count = 0;
}

int pthread_barrier_wait(struct pthread_barrier_t *barrier)
{
if (atomic_dec_and_test(&barrier->num_left)) {
/* Restore barrier. */
barrier->num_left = barrier->initial_count;
/* Wake the other threads */
futex_up(&barrier->wait, barrier->initial_count-1);
return 0; /* PTHREAD_BARRIER_SERIAL_THREAD */
}
while (futex_down(&barrier->wait) == 0 || errno == EINTR);
return 1;
}

int pthread_cond_signal(pthread_cond_t *cond)
{
if (cond->num_waiters)
return futex_up(&cond->futex, 1);
return 0;
}

int pthread_cond_broadcast(pthread_cond_t *cond)
{
unsigned int waiters = cond->num_waiting;

if (waiters) {
/* Re-initialize ACK. Could have been upped by
pthread_cond_signal and pthread_cond_wait. */
cond->ack.count = 0;
futex_up(&cond->futex, waiters);
/* Wait for ack before returning. */
futex_down(&cond->ack);
}
return 0;
}

static int __pthread_cond_wait(pthread_cond_t *cond,
pthread_mutex_t *mutex,
const struct timespec *reltime)
{
int ret;

/* Increment first so broadcaster knows we are waiting. */
atomic_inc(cond->num_waiting);
futex_up(&mutex, 1);
do {
ret = futex_down_time(&cond, reltime);
} while (ret < 0 && errno == EINTR);
if (atomic_dec_and_test(cond->num_waiting))
futex_up(&cond->ack);
futex_down(&mutex->futex);
return ret;
}

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
{
return __pthread_cond_wait(cond, mutex, NULL);
}

int pthread_cond_timedwait(pthread_cond_t *cond,
pthread_mutex_t *mutex,
const struct timespec *abstime)
{
struct timeval _now;
struct timespec now, rel;

/* Absolute to relative */
gettimeofday(&_now, NULL);
TIMEVAL_TO_TIMESPEC(&_now, &now);
if (now.tv_sec > abstime->tv_sec
|| (now.tv_sec == abstime->tv_sec
&& now.tv_nsec > abstime->tv_nsec))
return ETIMEDOUT;

rel.tv_sec = now.tv_sec - abstime->tv_sec;
rel.tv_nsec = now.tv_usec - abstime->tv_usec;
if (rel.tv_nsec < 0) {
--rel.tv_sec;
rel.tv_nsec += 1000000000;
}
return __pthread_cond_wait(cond, mutex, &rel);
}

2002-03-19 04:05:51

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Mon, 2002-03-18 at 19:28, Rusty Russell wrote:

> What do you WANT in a kernel primitive then? Given that we now have mutexes,
> what else do we need to make pthreads relatively painless?

I think wrt to the mutexes only wake-all is missing. I don't think that
semaphore semantic is needed in the kernel.


> Look, here is an example implementation. Please suggest:
> 1) Where this is flawed,
> 2) Where this is suboptimal,
> 3) What kernel primitive would help to resolve these?

I'll look at this a bit later.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------


Attachments:
signature.asc (232.00 B)
This is a digitally signed message part

2002-03-19 08:36:43

by Martin Wirth

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Rusty Russell wrote:


> 1) Where this is flawed,


I. There is a race in __pthread_cond_wait between timeout and a
cond_signal or broadcast. If the signal comes in

} while (ret < 0 && errno == EINTR);
>>>>> we leave with errno==ETIMEDOUT and get signal or broadcast called
here
if (atomic_dec_and_test(cond->num_waiting))

then you up cond->wait one time to often, leaving it in an invalid state.

II. Your implementation relies on the fact that the signal or broadcast
caller owns the mutex used in cond_wait. According to the POSIX spec
this need not be the case. The only thing that may happen is that you
miss a wakeup. But it is not allowed to screw up the internal state of
of the condition variable, which might well happen in your
implementation. (Note: Calling cond_signal without holding the mutex is
not necessarily flawed software. Think of a periodically occurring
new_data or data_changed flag where it is not really important to sleep
race free)

III. Minor nit: You should also clear cond->ack.count
in cond_signal otherwise it may wrap around soon (at least for a
24-bit atomic variable) if you mostly use cond_signal.


> 2) Where this is suboptimal,


As said in a previous e-mail, you need an futex_up(..,n) that
really wakes_up n thread at once.




> 3) What kernel primitive would help to resolve these?

Your exported waitqueues or my suggestion for a second waitqueue
associated with a futex.


Martin

2002-03-20 06:18:01

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On 18 Mar 2002 20:05:22 -0800
Ulrich Drepper <[email protected]> wrote:

> On Mon, 2002-03-18 at 19:28, Rusty Russell wrote:
>
> > What do you WANT in a kernel primitive then? Given that we now have mutexes,
> > what else do we need to make pthreads relatively painless?
>
> I think wrt to the mutexes only wake-all is missing. I don't think that
> semaphore semantic is needed in the kernel.

And have all the waiters exit with errno = EAGAIN? ("You didn't get it but
someone wanted you all woken").

It can be done, I'd have to see if is sufficient to implement pthreads.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-20 10:43:40

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Rusty Russell wrote:

> On 18 Mar 2002 20:05:22 -0800
> Ulrich Drepper <[email protected]> wrote:
>
>
>>On Mon, 2002-03-18 at 19:28, Rusty Russell wrote:
>>
>>
>>>What do you WANT in a kernel primitive then? Given that we now have mutexes,
>>>what else do we need to make pthreads relatively painless?
>>>
>>I think wrt to the mutexes only wake-all is missing. I don't think that
>>semaphore semantic is needed in the kernel.
>>
>
> And have all the waiters exit with errno = EAGAIN? ("You didn't get it but
> someone wanted you all woken").
>

That's a joke, isn't it? ;-)

We can return -1 with errno=ELOCKBREAK if the process holding the lock

dies.

Why don't we need a semaphore in the kernel?
There is open_sem() for POSIX named semaphores. The SysV interface
is there - and it's far more ugly.

in linuxthreads we have:
sem_t *sem_open(const char *name, int oflag, ...)
{
__set_errno (ENOSYS);
return SEM_FAILED;
}

Well, mutexes are not semaphores. Why not have fusema or fuma?
They are good for thread pools where you want several workers.


> It can be done, I'd have to see if is sufficient to implement pthreads.
>

I don't see the necessity for wake_all except for the barrier case.

A condvar implies, that when successfully returning, the thread
owns the mutex. So even in the case of cond_broadcast, only one thread
will get the mutex - the others will get blocked on that.

The only reason for cond_broadcast I can think of, are implementations
that wouldn't wake_up the highest priority waiting thread. So with
a broadcast all will be woken up and try to acquire the lock.

The waiters will NOT return without the mutex locked except on ETIMEDOUT
(when the thread gets cancelled, the cancelation handler will be called
and the thread will die).

It's also written in the spec, that when you want "predictable scheduling"
the caller of signal/broadcast _has_to_ hold the mutex. On release of
the mutex all other threads will race for it.

2002-03-20 06:42:40

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you write:
> Rusty Russell wrote:
>
>
> > 1) Where this is flawed,
>
>
> I. There is a race in __pthread_cond_wait between timeout and a
> cond_signal or broadcast. If the signal comes in
>
> } while (ret < 0 && errno == EINTR);
> >>>>> we leave with errno==ETIMEDOUT and get signal or broadcast called
> here
> if (atomic_dec_and_test(cond->num_waiting))
>
> then you up cond->wait one time to often, leaving it in an invalid state.

Hmmm... this is true.

> II. Your implementation relies on the fact that the signal or broadcast
> caller owns the mutex used in cond_wait. According to the POSIX spec
> this need not be the case. The only thing that may happen is that you
> miss a wakeup. But it is not allowed to screw up the internal state of
> of the condition variable, which might well happen in your
> implementation. (Note: Calling cond_signal without holding the mutex is
> not necessarily flawed software. Think of a periodically occurring
> new_data or data_changed flag where it is not really important to sleep
> race free)

I hadn't appreciated this. That makes it harder. I think I have to
abandon the atomics and use a mutex inside the condition variable.

> III. Minor nit: You should also clear cond->ack.count
> in cond_signal otherwise it may wrap around soon (at least for a
> 24-bit atomic variable) if you mostly use cond_signal.

Yep.

> > 2) Where this is suboptimal,
>
>
> As said in a previous e-mail, you need an futex_up(..,n) that
> really wakes_up n thread at once.

OK, we could read the value in the kernel's up() and wake that many.

> > 3) What kernel primitive would help to resolve these?
>
> Your exported waitqueues or my suggestion for a second waitqueue
> associated with a futex.

Any chance of a rough patch (to the code below, at least)?

Thanks!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

--- non-pthreads.c.19-March-2002 Wed Mar 20 17:37:17 2002
+++ non-pthreads.c Wed Mar 20 17:43:42 2002
@@ -11,6 +11,7 @@

typedef struct
{
+ struct futex lock;
int num_waiting;
struct futex wait, ack;
} pthread_cond_t;
@@ -48,23 +49,29 @@

int pthread_cond_signal(pthread_cond_t *cond)
{
+ futex_down(&cond->lock);
+ /* Reset this so it doesn't overflow */
+ cond->ack.count = 0;
if (cond->num_waiters)
return futex_up(&cond->futex, 1);
+ futex_up(&cond->lock, 1);
return 0;
}

int pthread_cond_broadcast(pthread_cond_t *cond)
{
- unsigned int waiters = cond->num_waiting;
-
- if (waiters) {
- /* Re-initialize ACK. Could have been upped by
- pthread_cond_signal and pthread_cond_wait. */
+ futex_down(&cond->lock);
+ if (cond->num_waiting) {
cond->ack.count = 0;
+ /* Release the waiters. */
futex_up(&cond->futex, waiters);
/* Wait for ack before returning. */
futex_down(&cond->ack);
+ /* Reset wait, in case someone who was waiting timed
+ out and didn't decrement. */
+ cond->wait.count = 0;
}
+ futex_up(&cond->lock);
return 0;
}

@@ -75,8 +82,10 @@
int ret;

/* Increment first so broadcaster knows we are waiting. */
+ futex_down(&cond->lock);
atomic_inc(cond->num_waiting);
futex_up(&mutex, 1);
+ futex_up(&cond->lock, 1);
do {
ret = futex_down_time(&cond, reltime);
} while (ret < 0 && errno == EINTR);

2002-03-20 17:21:24

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Wed, 2002-03-20 at 02:42, Peter W?chtler wrote:

> Well, mutexes are not semaphores. Why not have fusema or fuma?
> They are good for thread pools where you want several workers.

I have absolutely no objections whatsoever to get semaphore support in
the kernel. It would be much more efficient in some situations and the
basic infrastructures is already there. If you can convince the powers
there are to accept such a patch I'm all for it.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------


Attachments:
signature.asc (232.00 B)
This is a digitally signed message part

2002-03-21 06:49:49

by Martin Wirth

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Rusty Russell wrote:


>2) Where this is suboptimal,

Up to know I was too focused on the wait functions, but there is
also a problem with cond_broadcast (and the mutex-protected version of
cond_signal): since they may block (on ack or lock) this opens up
chances for priority inversion like problems. I think to be really
usefull cond_broacast and cond_signal should have a non-blocking
behaviour with predictible runtime.

Just to convince you that this is a real world problem here is a
description of one of my data-aquisition programs:

A 'producer' thread waits for the trigger of a transient recorder at 500
Hz IRQ-rate, reads out 64k on each event into a large circular buffer,
calls cond_broadcast (every 5th IRQ) without holding a mutex and goes to
sleep to wait for the next IRQ. (This thread is SCHED_FIFO)

Then there are three (SCHED_OTHER) 'consumer' threads which work on the
same data doing different things of different importance (group them
according to some hardware parameter and store them into different
files, calculate averaged powerspectra, select pieces for an online
scope-like display etc.)

If in this scenario the producer would have to wait in cond_broadcast
until the least prio consumer has acknowledged (which may take a timer
tick or longer) he would lose several IRQs each time.

So for my applications a cond_broadcast blocking for the waiters is
simply not acceptable.

Martin

2002-03-24 18:26:11

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Martin Wirth wrote:
>
> Rusty Russell wrote:
>
> >2) Where this is suboptimal,
>
> Up to know I was too focused on the wait functions, but there is
> also a problem with cond_broadcast (and the mutex-protected version of
> cond_signal): since they may block (on ack or lock) this opens up
> chances for priority inversion like problems. I think to be really
> usefull cond_broacast and cond_signal should have a non-blocking
> behaviour with predictible runtime.
>
[real world example deleted]
> So for my applications a cond_broadcast blocking for the waiters is
> simply not acceptable.
>
Exactly.
I can't see a reason why the ack-futex is needed. I think we can simply
delete it.
When deleted, the broadcast wouldn't block on ack (also preventing
schedule ping-pong). With the cond->lock it's save to have several
broadcasters. That's fine.
But:
static int __pthread_cond_wait(pthread_cond_t *cond,
pthread_mutex_t *mutex,
const struct timespec *reltime)
{
int ret;

/* Increment first so broadcaster knows we are waiting. */
futex_down(&cond->lock);
atomic_inc(cond->num_waiting);
(*) futex_up(&mutex, 1);
a) futex_up(&cond->lock, 1); [move into syscall]
do {
b) ret = futex_down_time(&cond, ABSTIME); [cond_timed_wait]
} while (ret < 0 && errno == EINTR);
[futex_up(&cond->lock, 1); /* release condvar */]

futex_down(&mutex->futex);
return ret;
}

With the original code, we have a "signal/broadcast lost window (a->b)"
that shouldn't be there:

SUSV2 on pthread_cond_[timed]wait:
These functions atomically release mutex(*) and cause the calling
thread to block on the condition variable cond; atomically here means
"atomically with respect to access by another thread to the mutex and
then the condition variable". That is, if another thread is able to
acquire the mutex after the about-to-block thread has released it, then
a subsequent call to pthread_cond_signal() or pthread_cond_broadcast()
in that thread behaves as if it were issued after the about-to-block
thread has blocked.


So we would need to enhance the futex_down_timed() call, to
atomically release the cond->lock on entry, re-aquiring on exit (because
of the loop).
This boils down to a cond_var syscall to me (wouldn't sys_ulock(,,OP)
a better name ? with OPs like MUTEX_UP,MUTEX_DOWN, SEMA_UPn, SEMA_DOWNn,
COND_WAIT, COND_TIMED_WAIT, COND_SIGNAL, COND_BROADCAST, RWLOCK_WRLOCK,
RWLOCK_RDLOCK,RWLOCK_UNLOCK)

Also note that we have to recalculate the relative time to sleep when
signalled - or just using an absolute time stamp.
If the syscall is interruptible, we open the "signal/broadcast lost
window (a->b)" again... hmh (Here queuing up RT signals are much better
for handling the wakeup, because you can block them, and they don't get
lost)

Alternatively when using the uwaitq: they could use a lock to serialize
an add/wait and a possibly parallel wake operation (but with the above
locks you can achieve exactly this)

--
-----------------------------------------------------------------------
Peter Waechtler

2002-03-25 02:25:57

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you write:
> I can't see a reason why the ack-futex is needed. I think we can simply
> delete it.
> When deleted, the broadcast wouldn't block on ack (also preventing
> schedule ping-pong). With the cond->lock it's save to have several
> broadcasters. That's fine.

No, you might end up waking someone who did the pthread_cond_wait()
after you did the pthread_cond_broadcast in place of one of the
existing pthread_cond_wait() threads.

I don't believe this is allowed.

> But:
> static int __pthread_cond_wait(pthread_cond_t *cond,
> pthread_mutex_t *mutex,
> const struct timespec *reltime)
> {
> int ret;
>
> /* Increment first so broadcaster knows we are waiting. */
> futex_down(&cond->lock);
> atomic_inc(cond->num_waiting);
> (*) futex_up(&mutex, 1);
> a) futex_up(&cond->lock, 1); [move into syscall]
> do {
> b) ret = futex_down_time(&cond, ABSTIME); [cond_timed_wait]
> } while (ret < 0 && errno == EINTR);
> [futex_up(&cond->lock, 1); /* release condvar */]
>
> futex_down(&mutex->futex);
> return ret;
> }
>
> With the original code, we have a "signal/broadcast lost window (a->b)"
> that shouldn't be there:

Where? Having done the inc, the futex_up at (a) will fall through,
giving the "thread behaves as if it [signal or broadcast] were issued
after the about-to-block thread has blocked."

> So we would need to enhance the futex_down_timed() call, to
> atomically release the cond->lock on entry, re-aquiring on exit (because
> of the loop).
> This boils down to a cond_var syscall to me (wouldn't sys_ulock(,,OP)
> a better name ? with OPs like MUTEX_UP,MUTEX_DOWN, SEMA_UPn, SEMA_DOWNn,
> COND_WAIT, COND_TIMED_WAIT, COND_SIGNAL, COND_BROADCAST, RWLOCK_WRLOCK,
> RWLOCK_RDLOCK,RWLOCK_UNLOCK)

You're talking about a completely different beast.

So the summary is: futexes not sufficient to implement pthreads
locking. That's fine; I can drop all the "extension for pthreads"
futex patches and leave the code as-is ('cept the UP_FAIR patch, which
is independent of this debate).

Thanks!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-25 04:43:59

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Mon, 25 Mar 2002 13:28:44 +1100
Rusty Russell <[email protected]> wrote:
> So the summary is: futexes not sufficient to implement pthreads
> locking.

So let's go back to the more generic "exporting waitqueues to userspace" idea,
with a twist: we use a userspace address as the identifier on the waitq, which
gives us a unique identifier, but the kernel never actually derefs the
pointer. (And in my prior kernel code I optimized it so that waking did an
implicit remove; not sure it's a win, so assumed that was removed here).

This gives code as below (Peter, Martin, please check):

/* Assume we have the following operations:

uwaitq_add(void *uaddr);
uwaitq_remove(void *uaddr);
uwaitq_wake(void *uaddr, int wake_all_flag);
uwaitq_wait(relative timeout);
*/
typedef struct
{
int counter;
} pthread_mutex_t;

typedef struct
{
int condition;
} pthread_cond_t;

typedef struct
{
unsigned int num_left;
unsigned int initial_count;
} pthread_barrier_t;

#define PTHREAD_MUTEX_INITIALIZER { { 1 } }
#define PTHREAD_COND_INITIALIZER { 0, { 0 }, { 0 } }

int pthread_barrier_init(struct pthread_barrier_t *barrier,
void *addr,
unsigned int count)
{
barrier->num_left = barrier->initial_count = count;
}

int pthread_barrier_wait(struct pthread_barrier_t *barrier)
{
/* Use barrier address as uwaitq id. */
uwaitq_add(barrier);
if (atomic_dec_and_test(&barrier->num_left)) {
/* Restore barrier. */
barrier->num_left = barrier->initial_count;
/* Wake the other threads */
uwaitq_wake(barrier, 1 /* WAKE_ALL */);
uwaitq_remove(barrier);
return 0; /* PTHREAD_BARRIER_SERIAL_THREAD */
}
while (uwaitq_wait(NULL) == 0 || errno == EINTR);
uwaitq_remove(barrier);
return 1;
}

int pthread_cond_signal(pthread_cond_t *cond)
{
return uwaitq_wake(cond, 0 /* WAKE_ONE */);
}

int pthread_cond_broadcast(pthread_cond_t *cond)
{
return uwaitq_wake(cond, 1 /* WAKE_ALL */);
}

static int __pthread_cond_wait(pthread_cond_t *cond,
pthread_mutex_t *mutex,
const struct timespec *reltime)
{
int ret;

uwaitq_add(cond);
futex_up(&mutex, 1);
while ((ret = uwaitq_wait(reltime)) == 0 || errno == EINTR);
uwaitq_remove(cond);
futex_down(&mutex, NULL);
return ret;
}

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
{
return __pthread_cond_wait(cond, mutex, NULL);
}

int pthread_cond_timedwait(pthread_cond_t *cond,
pthread_mutex_t *mutex,
const struct timespec *abstime)
{
struct timeval _now;
struct timespec now, rel;

/* Absolute to relative */
gettimeofday(&_now, NULL);
TIMEVAL_TO_TIMESPEC(&_now, &now);
if (now.tv_sec > abstime->tv_sec
|| (now.tv_sec == abstime->tv_sec
&& now.tv_nsec > abstime->tv_nsec))
return ETIMEDOUT;

rel.tv_sec = now.tv_sec - abstime->tv_sec;
rel.tv_nsec = now.tv_usec - abstime->tv_usec;
if (rel.tv_nsec < 0) {
--rel.tv_sec;
rel.tv_nsec += 1000000000;
}
return __pthread_cond_wait(cond, mutex, &rel);
}

--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-25 09:48:29

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Rusty Russell wrote:

> In message <[email protected]> you write:
>
>>I can't see a reason why the ack-futex is needed. I think we can simply
>>delete it.
>>When deleted, the broadcast wouldn't block on ack (also preventing
>>schedule ping-pong). With the cond->lock it's save to have several
>>broadcasters. That's fine.
>>
>
> No, you might end up waking someone who did the pthread_cond_wait()
> after you did the pthread_cond_broadcast in place of one of the
> existing pthread_cond_wait() threads.
>
> I don't believe this is allowed.
>


Indeed, I suspect that this isn't wanted.
With the cond->lock you almost prevent this: an ongoing broadcast
can't be intermixed with newly incoming waiters (they will block
on futex_down(&cond->lock))
But there is the window between a->b...


>
>>But:
>>static int __pthread_cond_wait(pthread_cond_t *cond,
>> pthread_mutex_t *mutex,
>> const struct timespec *reltime)
>>{
>> int ret;
>>
>> /* Increment first so broadcaster knows we are waiting. */
>> futex_down(&cond->lock);
>> atomic_inc(cond->num_waiting);
>>(*) futex_up(&mutex, 1);
>>a) futex_up(&cond->lock, 1); [move into syscall]
>> do {
>>b) ret = futex_down_time(&cond, ABSTIME); [cond_timed_wait]
>> } while (ret < 0 && errno == EINTR);
>> [futex_up(&cond->lock, 1); /* release condvar */]
>>
>> futex_down(&mutex->futex);
>> return ret;
>>}
>>
>>With the original code, we have a "signal/broadcast lost window (a->b)"
>>that shouldn't be there:
>>
>
> Where? Having done the inc, the futex_up at (a) will fall through,
> giving the "thread behaves as if it [signal or broadcast] were issued
> after the about-to-block thread has blocked."
>
Right after (a) another thread gets scheduled, issueing a signal/broadcast.

Ah, and then the futex_down_timed() wouldn't block, OK ;-)
But this way you have to use the ack->lock



I strongly believe, that the implementation of a condvar needs a lock
to prevent intermixed calls. You will see my comment on your implementation
with uwaitq. ;-)




2002-03-25 11:56:52

by Peter Wächtler

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

Rusty Russell wrote:

> On Mon, 25 Mar 2002 13:28:44 +1100
> Rusty Russell <[email protected]> wrote:
>
>>So the summary is: futexes not sufficient to implement pthreads
>>locking.
>>
>
> So let's go back to the more generic "exporting waitqueues to userspace" idea,
> with a twist: we use a userspace address as the identifier on the waitq, which
> gives us a unique identifier, but the kernel never actually derefs the
> pointer. (And in my prior kernel code I optimized it so that waking did an
> implicit remove; not sure it's a win, so assumed that was removed here).
>
> This gives code as below (Peter, Martin, please check):
>
> /* Assume we have the following operations:
>
> uwaitq_add(void *uaddr);
> uwaitq_remove(void *uaddr);
> uwaitq_wake(void *uaddr, int wake_all_flag);
> uwaitq_wait(relative timeout);
> */
> typedef struct
> {
> int counter;
> } pthread_mutex_t;
>
> typedef struct
> {
> int condition;
> } pthread_cond_t;
>

> int pthread_cond_signal(pthread_cond_t *cond)
> {
> return uwaitq_wake(cond, 0 /* WAKE_ONE */);
> }
>
> int pthread_cond_broadcast(pthread_cond_t *cond)
> {
> return uwaitq_wake(cond, 1 /* WAKE_ALL */);
> }
>
> static int __pthread_cond_wait(pthread_cond_t *cond,
> pthread_mutex_t *mutex,
> const struct timespec *reltime)
> {
> int ret;
>
> uwaitq_add(cond);
> futex_up(&mutex, 1);


Here another thread gets scheduled, calling signal/broadcast.
Since the former thread is already on the queue -> well done ;-)


> while ((ret = uwaitq_wait(reltime)) == 0 || errno == EINTR);
> uwaitq_remove(cond);
> futex_down(&mutex, NULL);
> return ret;
> }


I assume that uwaitq_wait() will modify the reltime (which is legal)
if signalled. Otherwise we would wait 2*retime, and so on

Then we have to be careful about errno and the return values:

static int __pthread_cond_wait(pthread_cond_t *cond,
pthread_mutex_t *mutex,
const struct timespec *reltime)
{
int ret;

!
if (uwaitq_add(cond) == -1)
+
return -1;
!
if (futex_up(&mutex, 1) == -1){
+
uwaitq_remove(cond);
+
return -1;
+
}
!
while ((ret = uwaitq_wait(cond,reltime)) == -1 && errno == EINTR);
+
saverrno=errno;
uwaitq_remove(cond);
futex_down(&mutex, NULL);
+
if (ret == -1){
+
if (saverrno == ENOENT)
+
return 0; /* there was a sig/broadc before we went to sleep*/
+
errno=saverrno;
+
return -1;
+
}
return 0;
}

I assume that uwaitq_wait() will return -1 and errno==ENOENT or similar
if we are not on the queue (anymore), -1 and ETIMEDOUT on timeout,
-1 and EINVAL on illegal cond or reltime ,zero on wakeup?


2002-03-26 00:59:20

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you write:
> > while ((ret = uwaitq_wait(reltime)) == 0 || errno == EINTR);
> > uwaitq_remove(cond);
> > futex_down(&mutex, NULL);
> > return ret;
> > }
>
>
> I assume that uwaitq_wait() will modify the reltime (which is legal)
> if signalled. Otherwise we would wait 2*retime, and so on

Oh yeah, I'll have to split that out and handle it. Not worth having
the kernel do it as it's hardly going to be a fast path...

> Then we have to be careful about errno and the return values:

Sigh. Yep. I was pretty lax...

> I assume that uwaitq_wait() will return -1 and errno==ENOENT or similar
> if we are not on the queue (anymore), -1 and ETIMEDOUT on timeout,
> -1 and EINVAL on illegal cond or reltime ,zero on wakeup?

uwaitq_wait() which does not follow a uwaitq_add() would be some kind
of error, yes. ETIMEDOUT on timeout, yes. EFAULT on illegal cond
(that's all it can tell, that the address isn't in the user's range),
and 0 on success.

OK, new tidier version...
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

/* Assume we have the following operations:

uwaitq_add(void *uaddr);
uwaitq_remove(void *uaddr);
uwaitq_wake(void *uaddr, int wake_all_flag);
uwaitq_wait(relative timeout);

And on top of them:
futex_down(struct futex *);
futex_up(struct futex *);
*/
typedef struct futex pthread_mutex_t;

typedef struct
{
int dummy; /* This could be an empty struct for gcc */
} pthread_cond_t;

typedef struct
{
unsigned int num_left;
unsigned int initial_count;
} pthread_barrier_t;

#define PTHREAD_MUTEX_INITIALIZER FUTEX_INITIALIZER
#define PTHREAD_COND_INITIALIZER { 0 }

int pthread_barrier_init(struct pthread_barrier_t *barrier,
void *addr,
unsigned int count)
{
barrier->num_left = barrier->initial_count = count;
}

int pthread_barrier_wait(struct pthread_barrier_t *barrier)
{
int ret;

/* Use barrier address as uwaitq id. */
ret = uwaitq_add(barrier);
if (ret < 0)
return ret;

if (atomic_dec_and_test(&barrier->num_left)) {
/* Restore barrier. */
barrier->num_left = barrier->initial_count;
/* Wake the other threads */
uwaitq_wake(barrier, 1 /* WAKE_ALL */);
uwaitq_remove(barrier);
return 0; /* PTHREAD_BARRIER_SERIAL_THREAD */
}
while (uwaitq_wait(NULL) == 0 || errno == EINTR);
uwaitq_remove(barrier);
return 1;
}

int pthread_cond_signal(pthread_cond_t *cond)
{
return uwaitq_wake(cond, 0 /* WAKE_ONE */);
}

int pthread_cond_broadcast(pthread_cond_t *cond)
{
return uwaitq_wake(cond, 1 /* WAKE_ALL */);
}

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
{
int ret, saved_errno;

uwaitq_add(cond);
futex_up(&mutex);
while ((ret = uwaitq_wait(NULL)) == 0 || errno == EINTR);
saved_errno = errno;
uwaitq_remove(cond);
futex_down(&mutex);
errno = saved_errno;
return ret;
}

int pthread_cond_timedwait(pthread_cond_t *cond,
pthread_mutex_t *mutex,
const struct timespec *abstime)
{
struct timeval _now;
struct timespec now, rel;
int saved_errno, ret;

/* Absolute to relative */
again:
gettimeofday(&_now, NULL);
TIMEVAL_TO_TIMESPEC(&_now, &now);
if (now.tv_sec > abstime->tv_sec
|| (now.tv_sec == abstime->tv_sec
&& now.tv_nsec > abstime->tv_nsec))
return ETIMEDOUT;

rel.tv_sec = now.tv_sec - abstime->tv_sec;
rel.tv_nsec = now.tv_usec - abstime->tv_usec;
if (rel.tv_nsec < 0) {
--rel.tv_sec;
rel.tv_nsec += 1000000000;
}

uwaitq_add(cond);
futex_up(&mutex);
ret = uwaitq_wait(&rel);
if (ret < 0 && errno == EINTR)
goto again;

saved_errno = errno;
uwaitq_remove(cond);
futex_down(&mutex);
errno = saved_errno;

return ret;
}

2002-03-26 08:19:25

by Martin Wirth

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)


>
> And on top of them:
> futex_down(struct futex *);
> futex_up(struct futex *);
>
Why not keep the simple one-sys-call interface for the fuxtexes. The
code is so small that it is
not worth to delete it.

>
>
>int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
>{
> int ret, saved_errno;
>
> uwaitq_add(cond);
> futex_up(&mutex);
> while ((ret = uwaitq_wait(NULL)) == 0 || errno == EINTR);
> saved_errno = errno;
> uwaitq_remove(cond);
> futex_down(&mutex);
>
You should loop here in order catch signals:

while ( futex_down(&mutex) < 0 && errno == EINTR)

>
> if (ret < 0 && errno == EINTR)
> goto again;
>
This assumes that you are allowed to do a double uwaitq_add.

>
> saved_errno = errno;
> uwaitq_remove(cond);
> futex_down(&mutex);
>
Also loop here

>
> errno = saved_errno;
>
> return ret;
>}
>
Now whats interesting is the kernel part. I must admit that I haven't
fully understood all
effects of the double use of the cookie in your first implementation.
But if you use a memory
location as identifier you have to keep a separate flag within
uwaitq_head that is zeroed
before you add to the waitqueue and set by the signal functions. Then
uwaitq_wait has to check for it.
This is necessary in order not to loose a wakeup while you are on the
queue but not sleeping.

uwaitq_remove also takes an argument, are you heading for waiting on
multiple events?

Since you need to pin down the page between uwaitq_add and uwaitq_remove
you will have to limit
the number of simultaneous add calls. Should this be configurable?

Cheers,
Martin


2002-03-26 23:07:29

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you write:
>
> >
> > And on top of them:
> > futex_down(struct futex *);
> > futex_up(struct futex *);
> >
> Why not keep the simple one-sys-call interface for the fuxtexes. The
> code is so small that it is
> not worth to delete it.

Because it's a premature optimization. We can add it later if someone
proves it's a major hotspot. But it may turn out some other primitive
is more important: I'm not smart enough to know.

> >int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
> >{
...
> > futex_down(&mutex);
> >
> You should loop here in order catch signals:
>
> while ( futex_down(&mutex) < 0 && errno == EINTR)

Yep.

> > if (ret < 0 && errno == EINTR)
> > goto again;
> >
> This assumes that you are allowed to do a double uwaitq_add.

Oops.... Fixed by moving the uwaitq_add above the again: label.

> >
> > saved_errno = errno;
> > uwaitq_remove(cond);
> > futex_down(&mutex);
> >
> Also loop here

Yep.

> Now whats interesting is the kernel part. I must admit that I haven't
> fully understood all
> effects of the double use of the cookie in your first implementation.
> But if you use a memory
> location as identifier you have to keep a separate flag within
> uwaitq_head that is zeroed
> before you add to the waitqueue and set by the signal functions. Then
> uwaitq_wait has to check for it.

Yes. My prior implementation actually deleted the sleeper from the
queue on wakeup, making it easy to detect and also saving a syscall.
I'm leaning towards the "flag" idea now though.

> uwaitq_remove also takes an argument, are you heading for waiting on
> multiple events?

Yes. The problem is: how many? Each one takes memory and (as you
point out) pinned pages (although this could be changed with some loss
of simplicity). An arbitrary limit smacks of SysV IPC: a sure sign of
bad taste. So for the moment, the limit is 1.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-27 21:05:50

by Hubertus Franke

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Tuesday 26 March 2002 06:10 pm, Rusty Russell wrote:
> In message <[email protected]> you write:
> > > And on top of them:
> > > futex_down(struct futex *);
> > > futex_up(struct futex *);
> >
> > Why not keep the simple one-sys-call interface for the fuxtexes. The
> > code is so small that it is
> > not worth to delete it.
>
>

Rusty, you lost me in all these discussions now.
Is the current position to export wait queues and drop the futex interface ?
I would recommend against that. If we need 2 syscalls to implement
the futex behavior that certainly will create quite some overhead.

>From my own implementation, I exported the wait queues and I didn't need the
add/wait sequence. This as you know is/was due to the fact that I used
semaphores in the kernel. While that created some allocation problems and
won't allow for usage of the wait queues, it seems more compact.
Any chance to move the semaphore behavior into the futexes.



--
-- Hubertus Franke ([email protected])

2002-03-27 23:51:04

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

In message <[email protected]> you write:
> On Tuesday 26 March 2002 06:10 pm, Rusty Russell wrote:
> > In message <[email protected]> you write:
> > > > And on top of them:
> > > > futex_down(struct futex *);
> > > > futex_up(struct futex *);
> > >
> > > Why not keep the simple one-sys-call interface for the fuxtexes. The
> > > code is so small that it is
> > > not worth to delete it.
> >
> >
>
> Rusty, you lost me in all these discussions now.

I know the feeling 8)

> Is the current position to export wait queues and drop the futex interface ?
> I would recommend against that. If we need 2 syscalls to implement
> the futex behavior that certainly will create quite some overhead.

I'm still playing with the options. Two system calls in the slow path
is definitely slower, but it's not insanely slow, and it's becoming
fairly clear that the wider range of primitives is worthwhile.

Both approaches can coexist, but I would consider the sys_futex call a
premature optimization if uwaitqs go in: this comes down to the
numbers. I can supply a uwaitq implementation for benchmarking if you
want?

> >From my own implementation, I exported the wait queues and I didn't need the
> add/wait sequence. This as you know is/was due to the fact that I used
> semaphores in the kernel. While that created some allocation problems and
> won't allow for usage of the wait queues, it seems more compact.
> Any chance to move the semaphore behavior into the futexes.

The allocation element is the one I don't like: there's no really good
way of limiting it.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-18 00:52:29

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Sat, 2002-03-16 at 22:50, Rusty Russell wrote:

> Only vs. pthread_cond_broadcast.

No. pthread_barrier_wait has the same problem. It has to wake up lots
of thread.

> And if you're using that you probably
> have some other performance issues anyway?

Why? Conditional variables are of use in situations with loosely
coupled threads.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------


Attachments:
signature.asc (232.00 B)
This is a digitally signed message part