LinuxLists.cc - [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

2002-03-06 14:46:23

by Hubertus Franke

[permalink] [raw]

Subject: Re: Futexes III : performance numbers

2002-03-06 16:13:13

by Hubertus Franke

[permalink] [raw]

Subject: Re: Futexes III : performance numbers

On Wednesday 06 March 2002 02:54 am, Rusty Russell wrote:
> On Tue, 5 Mar 2002 16:23:14 -0500
>
> Hubertus Franke <[email protected]> wrote:
> > Interesting... the strict FIFO ordering of my fast semaphores limits
> > performance as seen by 99.71% contention, so we always ditch
> > into the kernel. Convoy Avoidance locks 2.5 times better.
> > Wohh futex rock, BUT... with 0.29% contention it basically tells
> > me that we are exhausting our entire quantum getting the lock
> > without contention. So their is some serious fairness issue here
> > at least for the tightly scheduled locks. Compare the M numbers
> > for 2 and 3 children.
>
> Fairness <sigh>. This patch should be much more FIFO: it works by handing
> the mutex straight to the first one on the queue if there is one, and only
> actually "freeing" it if there's noone waiting.
>
> Unfortunately, it seems to hurt performance by 50% on tdbtorture (although
> there are weird scheduler things happening too).
>
> Here's the "fair" patch:
> Rusty.

Rusty why not provide something along the line <snipped all over the place>.

#define FUTEX_DOWN (0)
#define FUTEX_UP (1)
#define FUTEX_FAIR_UP (2)

static int futex_up(struct list_head *head, atomic_t *count)
{
atomic_set(count, 1);
smp_wmb();
wake_one_waiter(head, count);
return 0;
}

static int futex_fair_up(truct list_head *head, atomic_t *count)
{
spin_lock(&futex_lock);
if (!pass_futex(head, count))
/* Noone to receive: set to one and leave it free. */
atomic_set(count, 1);
spin_unlock(&futex_lock);
return 0;
}

asmlinkage int sys_futex(void *uaddr, int op)
{
<..... snip ....>

head = hash_futex(page, pos_in_page);
switch (op) {

case FUTEX_DOWN:
ret = futex_down(head, page_address(page) + pos_in_page);
break;

case FUTEX_UP:
ret = futex_up(head, page_address(page) + pos_in_page);
break;

case FUTEX_FAIR_UP:
ret = futex_fair_up(head, page_address(page) + pos_in_page);
break;

default :

<..... snip ....>
}

This would satisfy the fair vs. calock issue, you let the app decide what to
use. Best of all, it seems to me you can even mix it.
Imagine, if a process knows it will soon reaquire the lock, then it would
use FUTEX_UP to avoid being tagged back to the end of wait queue
avoiding costly scheduling events. At the same time, if the process knows
that its done for a while with that lock, then it issues FUTEX_FAIR_UP.
The best of two worlds..

It also shows how cleanly the code can be expanded in the future.
The more I look at the hash queues the more I like it.
Will look at the rwlock now. Let me know what you think.

--
-- Hubertus Franke ([email protected])

2002-03-06 17:25:32

by George Anzinger

[permalink] [raw]

Subject: Re: [Lse-tech] Re: Futexes III : performance numbers

Hubertus Franke wrote:
>
> On Tuesday 05 March 2002 09:08 pm, Rusty Russell wrote:
> > In message <[email protected]> you write:
> > > On Tuesday 05 March 2002 02:01 am, Rusty Russell wrote:
> > > > 1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
> > > > 2) Fix for the "decrement wraparound" problem (Paul Mackerras)
> > > > 3) x86 fixes: tested on dual x86 box.
> > > >
> > > > Example userspace lib attached,
> > > > Rusty.
> > >
> > > I did a quick hack to enable ulockflex to run on the latest interface
> > > that Rusty posted.
> >
> > Cool... is this 8-way or some such "serious" SMP? How about the
> > below microoptimization (untested, but you get the idea).
> >
> > > Now 3 processes
> > > 3 1 5 4 k 0 0 0 99.98 0.00 0.00 0.033284 242040
> > > 3 1 5 4 m 0 0 0 0.29 0.00 0.00 0.018406 1979992
> > > 3 1 5 4 f 0 0 0 99.71 0.00 0.00 0.028083 306140
> > > 3 1 5 4 c 0 0 0 7.79 0.00 4.00 0.437084 774175
> > >
> > > Interesting... the strict FIFO ordering of my fast semaphores limits
> > > performance as seen by 99.71% contention, so we always ditch
> > > into the kernel. Convoy Avoidance locks 2.5 times better.
> >
> > Hmmm... actually I'm limited FIFO, in that I queue on the tail and do
> > wake one. Of course, someone can come in userspace and grab the lock
> > while the guy in the kernel is waking up, and this is clearly
> > happening here.
> >
> > This can be fixed, I think, by saying to the one we wake up "you have
> > the lock" and never actually changing the value to 1. This might cost
> > us very little: I'll send another patch this afternoon.
> >
>
> Well, yes it can be fixed as I did in my package, but it comes at a
> substantial cost as seen above. The question is whether to simply
> ignore strict FIFO requirements ?
> Doing the FIFO leads to the convoy problem, namely your lock arrival
> becomes the scheduling queue.
> As seen above from the nonexisting contention, mootexes completely
> exhaust their scheduling quantum before allowing anybocy else to grap
> the lock. This is desired behavior particular for high traffic, low lockhold
> time locks, but not for others.
>
> In this case you simply hand over the lock and won't allow anybody
> in user space to grap it during the time window one is woken up in
> the kernel.
> Also, from my own experience doing a spinning lock that way
>
> Another issue, is a few more operations, what would be nice is to be
> able to wake up all waiting processes and have them recontent?

Unless you are ready to go to priority queues (IMHO the preferred
approach) this is the only way to avoid priority inversion, especially
in a real time environment.

--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2002-03-06 20:35:53

by Hubertus Franke

[permalink] [raw]

Subject: Futexes V :

On Wednesday 06 March 2002 11:13 am, Hubertus Franke wrote:

I cut a new version with what I was previously discussing.
Now we have two kind of wakeup mechanism

(a) regular wakeup (as was) which basically gives you convoy avoidance
(b) fair wakeup (will first wake a waiting process up .. FIFO)

Basically integrated 2 prior patches of Rusty

Also changed FUTEX_DOWN, FUTEX_UP and FUTEX_UP_FAIR
operands to be linear (0,1,2), should make the case statement faster,
particularly when we get more operands.

frankeh:1019:~/futex/ulockflex>
./ulockflex -c 3 -a 1 -t 2 -o 5 -m 2 -R 499 -r 0 -x 0 -L f

SysV: 3 1 2 1 k 0 0 0 99.42 0.00 0.00 0.141423 243718
FAIR: 3 1 2 1 f 0 0 0 93.67 0.00 0.00 0.049231 217001
CA: 3 1 2 1 f 0 0 0 0.01 0.00 0.00 0.154154 2002852

Yesterdays numbers where

3 1 5 4 k 0 0 0 99.98 0.00 0.00 0.033284 242040
3 1 5 4 m 0 0 0 0.29 0.00 0.00 0.018406 1979992
3 1 5 4 f 0 0 0 99.71 0.00 0.00 0.028083 306140
3 1 5 4 c 0 0 0 7.79 0.00 4.00 0.437084 774175

Indicates that fair locking still needs some work, but what Rusty provided
on top the current
--
-- Hubertus Franke ([email protected])

---------------------------------------------------------------------------------------------
diff -urbN linux-2.5.5/arch/i386/kernel/entry.S
linux-2.5.5-futex/arch/i386/kernel/entry.S
--- linux-2.5.5/arch/i386/kernel/entry.S Tue Feb 19 21:10:58 2002
+++ linux-2.5.5-futex/arch/i386/kernel/entry.S Wed Mar 6 11:40:12 2002
@@ -716,6 +716,7 @@
.long SYMBOL_NAME(sys_lremovexattr)
.long SYMBOL_NAME(sys_fremovexattr)
.long SYMBOL_NAME(sys_tkill)
+ .long SYMBOL_NAME(sys_futex)

.rept NR_syscalls-(.-sys_call_table)/4
.long SYMBOL_NAME(sys_ni_syscall)
diff -urbN linux-2.5.5/arch/ppc/kernel/misc.S
linux-2.5.5-futex/arch/ppc/kernel/misc.S
--- linux-2.5.5/arch/ppc/kernel/misc.S Tue Feb 19 21:11:00 2002
+++ linux-2.5.5-futex/arch/ppc/kernel/misc.S Wed Mar 6 11:40:12 2002
@@ -1246,6 +1246,7 @@
.long sys_removexattr
.long sys_lremovexattr
.long sys_fremovexattr /* 220 */
+ .long sys_futex
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
.endr
diff -urbN linux-2.5.5/include/asm-i386/atomic.h
linux-2.5.5-futex/include/asm-i386/atomic.h
--- linux-2.5.5/include/asm-i386/atomic.h Tue Feb 19 21:10:58 2002
+++ linux-2.5.5-futex/include/asm-i386/atomic.h Wed Mar 6 11:42:35 2002
@@ -2,6 +2,7 @@
#define __ARCH_I386_ATOMIC__

#include <linux/config.h>
+#include <asm/system.h>

/*
* Atomic operations that C can't guarantee us. Useful for
diff -urbN linux-2.5.5/include/asm-i386/mman.h
linux-2.5.5-futex/include/asm-i386/mman.h
--- linux-2.5.5/include/asm-i386/mman.h Tue Feb 19 21:10:56 2002
+++ linux-2.5.5-futex/include/asm-i386/mman.h Wed Mar 6 11:40:12 2002
@@ -4,6 +4,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urbN linux-2.5.5/include/asm-i386/unistd.h
linux-2.5.5-futex/include/asm-i386/unistd.h
--- linux-2.5.5/include/asm-i386/unistd.h Tue Feb 19 21:11:04 2002
+++ linux-2.5.5-futex/include/asm-i386/unistd.h Wed Mar 6 11:40:12 2002
@@ -243,6 +243,7 @@
#define __NR_lremovexattr 236
#define __NR_fremovexattr 237
#define __NR_tkill 238
+#define __NR_futex 239

/* user-visible error numbers are in the range -1 - -124: see
<asm-i386/errno.h> */

diff -urbN linux-2.5.5/include/asm-ppc/mman.h
linux-2.5.5-futex/include/asm-ppc/mman.h
--- linux-2.5.5/include/asm-ppc/mman.h Tue Feb 19 21:11:03 2002
+++ linux-2.5.5-futex/include/asm-ppc/mman.h Wed Mar 6 11:40:12 2002
@@ -7,6 +7,7 @@
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
+#define PROT_SEM 0x8 /* page may be used for atomic ops */
#define PROT_NONE 0x0 /* page can not be accessed */

#define MAP_SHARED 0x01 /* Share changes */
diff -urbN linux-2.5.5/include/asm-ppc/unistd.h
linux-2.5.5-futex/include/asm-ppc/unistd.h
--- linux-2.5.5/include/asm-ppc/unistd.h Tue Feb 19 21:10:57 2002
+++ linux-2.5.5-futex/include/asm-ppc/unistd.h Wed Mar 6 11:40:12 2002
@@ -228,6 +228,7 @@
#define __NR_removexattr 218
#define __NR_lremovexattr 219
#define __NR_fremovexattr 220
+#define __NR_futex 221

#define __NR(n) #n

diff -urbN linux-2.5.5/include/linux/futex.h
linux-2.5.5-futex/include/linux/futex.h
--- linux-2.5.5/include/linux/futex.h Wed Dec 31 19:00:00 1969
+++ linux-2.5.5-futex/include/linux/futex.h Wed Mar 6 13:58:21 2002
@@ -0,0 +1,9 @@
+#ifndef _LINUX_FUTEX_H
+#define _LINUX_FUTEX_H
+
+/* Second argument to futex syscall */
+#define FUTEX_DOWN (0)
+#define FUTEX_UP (1)
+#define FUTEX_UP_FAIR (2)
+
+#endif
diff -urbN linux-2.5.5/include/linux/hash.h
linux-2.5.5-futex/include/linux/hash.h
--- linux-2.5.5/include/linux/hash.h Wed Dec 31 19:00:00 1969
+++ linux-2.5.5-futex/include/linux/hash.h Wed Mar 6 11:40:12 2002
@@ -0,0 +1,58 @@
+#ifndef _LINUX_HASH_H
+#define _LINUX_HASH_H
+/* Fast hashing routine for a long.
+ (C) 2002 William Lee Irwin III, IBM */
+
+/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These primes are chosen to be bit-sparse, that is operations on
+ * them can use shifts and additions instead of multiplications for
+ * machines where multiplications are slow.
+ */
+#if BITS_PER_LONG == 32
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
+#elif BITS_PER_LONG == 64
+/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#else
+#error Define GOLDEN_RATIO_PRIME for your wordsize.
+#endif
+
+static inline unsigned long hash_long(unsigned long val, unsigned int bits)
+{
+ unsigned long hash = val;
+
+#if BITS_PER_LONG == 64
+ /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
+ unsigned long n = hash;
+ n <<= 18;
+ hash -= n;
+ n <<= 33;
+ hash -= n;
+ n <<= 3;
+ hash += n;
+ n <<= 3;
+ hash -= n;
+ n <<= 4;
+ hash += n;
+ n <<= 2;
+ hash += n;
+#else
+ /* On some cpus multiply is faster, on others gcc will do shifts */
+ hash *= GOLDEN_RATIO_PRIME;
+#endif
+
+ /* High bits are more random, so use them. */
+ return hash >> (BITS_PER_LONG - bits);
+}
+
+static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
+{
+ return hash_long((unsigned long)ptr, bits);
+}
+#endif /* _LINUX_HASH_H */
diff -urbN linux-2.5.5/include/linux/mmzone.h
linux-2.5.5-futex/include/linux/mmzone.h
--- linux-2.5.5/include/linux/mmzone.h Tue Feb 19 21:10:53 2002
+++ linux-2.5.5-futex/include/linux/mmzone.h Wed Mar 6 11:42:35 2002
@@ -51,8 +51,7 @@
/*
* wait_table -- the array holding the hash table
* wait_table_size -- the size of the hash table array
- * wait_table_shift -- wait_table_size
- * == BITS_PER_LONG (1 << wait_table_bits)
+ * wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
@@ -75,7 +74,7 @@
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_size;
- unsigned long wait_table_shift;
+ unsigned long wait_table_bits;

/*
* Discontig memory support fields.
diff -urbN linux-2.5.5/kernel/Makefile linux-2.5.5-futex/kernel/Makefile
--- linux-2.5.5/kernel/Makefile Tue Feb 19 21:10:57 2002
+++ linux-2.5.5-futex/kernel/Makefile Wed Mar 6 11:40:12 2002
@@ -15,7 +15,7 @@
obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o info.o time.o softirq.o resource.o \
sysctl.o acct.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o
+ signal.o sys.o kmod.o context.o futex.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -urbN linux-2.5.5/kernel/futex.c linux-2.5.5-futex/kernel/futex.c
--- linux-2.5.5/kernel/futex.c Wed Dec 31 19:00:00 1969
+++ linux-2.5.5-futex/kernel/futex.c Wed Mar 6 13:59:01 2002
@@ -0,0 +1,255 @@
+/*
+ * Fast Userspace Mutexes (which I call "Futexes!").
+ * (C) Rusty Russell, IBM 2002
+ *
+ * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
+ * enough at me, Linus for the original (flawed) idea, Matthew
+ * Kirkwood for proof-of-concept implementation.
+ *
+ * "The futexes are also cursed."
+ * "But they come in a choice of three flavours!"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/futex.h>
+#include <asm/atomic.h>
+
+/* These mutexes are a very simple counter: the winner is the one who
+ decrements from 1 to 0. 1 == free. 0 == noone sleeping.
+
+ This is simple enough to work on all architectures. */
+
+/* FIXME: This may be way too small. --RR */
+#define FUTEX_HASHBITS 6
+
+/* We use this instead of a wait_queue_t, so we can wake only the
+ relevent ones (hashed queues may be shared) */
+struct futex_q {
+ struct list_head list;
+ struct task_struct *task;
+ atomic_t *count;
+};
+
+/* The key for the hash is the address + offset within page */
+static struct list_head futex_queues[1<<FUTEX_HASHBITS];
+static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+
+#define FUTEX_PASSED ((void *)-1)
+
+/* Try to find someone else to pass futex to. */
+static int pass_futex(struct list_head *head, atomic_t *count)
+{
+ struct list_head *i;
+ struct futex_q *recipient = NULL;
+ int more_candidates = 0;
+
+ /* Find first, and keep looking to see if there are others. */
+ list_for_each(i, head) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->count == count) {
+ if (!recipient) recipient = this;
+ else {
+ /* Someone else waiting, too. */
+ more_candidates = 1;
+ break;
+ }
+ }
+ }
+
+ /* Nobody wants it. */
+ if (!recipient)
+ return 0;
+
+ /* Fixup to avoid wasted wakeup when we up() later. */
+ if (!more_candidates)
+ atomic_set(count, 0);
+
+ /* Pass directly to them. */
+ recipient->count = FUTEX_PASSED;
+ smp_wmb();
+ wake_up_process(recipient->task);
+ return 1;
+}
+
+static inline void wake_one_waiter(struct list_head *head, atomic_t *count)
+{
+ struct list_head *i;
+
+ spin_lock(&futex_lock);
+ list_for_each(i, head) {
+ struct futex_q *this = list_entry(i, struct futex_q, list);
+
+ if (this->count == count) {
+ wake_up_process(this->task);
+ break;
+ }
+ }
+ spin_unlock(&futex_lock);
+}
+
+static inline struct list_head *hash_futex(struct page *page,
+ unsigned long offset)
+{
+ unsigned long h;
+
+ /* If someone is sleeping, page is pinned. ie. page_address
+ is a constant when we care about it. */
+ h = (unsigned long)page_address(page) + offset;
+ return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
+}
+
+/* Add at end to make it FIFO. */
+static inline void queue_me(struct list_head *head, struct futex_q *q)
+{
+ q->task = current;
+
+ spin_lock(&futex_lock);
+ list_add_tail(&q->list, head);
+ spin_unlock(&futex_lock);
+}
+
+/* Return true if there is someone else waiting as well */
+static inline void unqueue_me(struct futex_q *q)
+{
+ spin_lock(&futex_lock);
+ list_del(&q->list);
+ spin_unlock(&futex_lock);
+}
+
+/* Get kernel address of the user page and pin it. */
+static struct page *pin_page(unsigned long page_start)
+{
+ struct mm_struct *mm = current->mm;
+ struct page *page;
+ int err;
+
+ down_read(&mm->mmap_sem);
+ err = get_user_pages(current, current->mm, page_start,
+ 1 /* one page */,
+ 1 /* writable */,
+ 0 /* don't force */,
+ &page,
+ NULL /* don't return vmas */);
+ up_read(&mm->mmap_sem);
+
+ if (err < 0)
+ return ERR_PTR(err);
+ return page;
+}
+
+/* Simplified from arch/ppc/kernel/semaphore.c: Paul M. is a genius. */
+static int futex_down(struct list_head *head, atomic_t *count)
+{
+ struct futex_q q;
+
+ current->state = TASK_INTERRUPTIBLE;
+ q.count = count;
+ queue_me(head, &q);
+
+ /* It may have become available while we were adding ourselves
+ to queue? Also, make sure it's -ve so userspace knows
+ there's someone waiting. */
+ while ((atomic_read(count) < 0 || !atomic_dec_and_test(count))
+ && q.count != FUTEX_PASSED) {
+ schedule();
+ current->state = TASK_INTERRUPTIBLE;
+
+ if (signal_pending(current)) {
+ unqueue_me(&q);
+
+ /* We might have been passed futex anyway. */
+ return (q.count == FUTEX_PASSED) ? 0 : -EINTR;
+ }
+ }
+
+ /* We got the futex! */
+ current->state = TASK_RUNNING;
+ unqueue_me(&q);
+ return 0;
+}
+
+static int futex_fair_up(struct list_head *head, atomic_t *count)
+{
+ spin_lock(&futex_lock);
+ if (!pass_futex(head, count))
+ /* Noone to receive: set to one and leave it free. */
+ atomic_set(count, 1);
+ spin_unlock(&futex_lock);
+ return 0;
+}
+
+static int futex_up(struct list_head *head, atomic_t *count)
+{
+ atomic_set(count, 1);
+ smp_wmb();
+ wake_one_waiter(head, count);
+ return 0;
+}
+asmlinkage int sys_futex(void *uaddr, int op)
+{
+ int ret;
+ unsigned long pos_in_page;
+ struct list_head *head;
+ struct page *page;
+
+ pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
+
+ /* Must be "naturally" aligned, and not on page boundary. */
+ if ((pos_in_page % __alignof__(atomic_t)) != 0
+ || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
+ return -EINVAL;
+
+ /* Simpler if it doesn't vanish underneath us. */
+ page = pin_page((unsigned long)uaddr - pos_in_page);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ head = hash_futex(page, pos_in_page);
+ switch (op) {
+ case FUTEX_DOWN:
+ ret = futex_down(head, page_address(page) + pos_in_page);
+ break;
+ case FUTEX_UP:
+ ret = futex_up(head, page_address(page) + pos_in_page);
+ break;
+ case FUTEX_UP_FAIR:
+ ret = futex_fair_up(head, page_address(page) + pos_in_page);
+ break;
+ /* Add other lock types here... */
+ default:
+ ret = -EINVAL;
+ }
+ put_page(page);
+
+ return ret;
+}
+
+static int __init init(void)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
+ INIT_LIST_HEAD(&futex_queues[i]);
+ return 0;
+}
+__initcall(init);
diff -urbN linux-2.5.5/mm/filemap.c linux-2.5.5-futex/mm/filemap.c
--- linux-2.5.5/mm/filemap.c Wed Mar 6 15:10:09 2002
+++ linux-2.5.5-futex/mm/filemap.c Wed Mar 6 11:40:12 2002
@@ -25,6 +25,7 @@
#include <linux/iobuf.h>
#include <linux/compiler.h>
#include <linux/fs.h>
+#include <linux/hash.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -773,32 +774,8 @@
static inline wait_queue_head_t *page_waitqueue(struct page *page)
{
const zone_t *zone = page_zone(page);
- wait_queue_head_t *wait = zone->wait_table;
- unsigned long hash = (unsigned long)page;

-#if BITS_PER_LONG == 64
- /* Sigh, gcc can't optimise this alone like it does for 32 bits. */
- unsigned long n = hash;
- n <<= 18;
- hash -= n;
- n <<= 33;
- hash -= n;
- n <<= 3;
- hash += n;
- n <<= 3;
- hash -= n;
- n <<= 4;
- hash += n;
- n <<= 2;
- hash += n;
-#else
- /* On some cpus multiply is faster, on others gcc will do shifts */
- hash *= GOLDEN_RATIO_PRIME;
-#endif
-
- hash >>= zone->wait_table_shift;
-
- return &wait[hash];
+ return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}

/*
diff -urbN linux-2.5.5/mm/mprotect.c linux-2.5.5-futex/mm/mprotect.c
--- linux-2.5.5/mm/mprotect.c Tue Feb 19 21:11:01 2002
+++ linux-2.5.5-futex/mm/mprotect.c Wed Mar 6 11:40:12 2002
@@ -280,7 +280,7 @@
end = start + len;
if (end < start)
return -EINVAL;
- if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC))
+ if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
return -EINVAL;
if (end == start)
return 0;
diff -urbN linux-2.5.5/mm/page_alloc.c linux-2.5.5-futex/mm/page_alloc.c
--- linux-2.5.5/mm/page_alloc.c Tue Feb 19 21:10:55 2002
+++ linux-2.5.5-futex/mm/page_alloc.c Wed Mar 6 11:40:12 2002
@@ -776,8 +776,8 @@
* per zone.
*/
zone->wait_table_size = wait_table_size(size);
- zone->wait_table_shift =
- BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
+ zone->wait_table_bits =
+ wait_table_bits(zone->wait_table_size);
zone->wait_table = (wait_queue_head_t *)
alloc_bootmem_node(pgdat, zone->wait_table_size
* sizeof(wait_queue_head_t));

2002-03-07 00:34:33

by Hubertus Franke

[permalink] [raw]

Subject: Re: Futexes III : performance numbers

On Tuesday 05 March 2002 09:08 pm, Rusty Russell wrote:
> In message <[email protected]> you write:

More on fairness. I hacked ulockflex to keep the history of lock acquisition
and print it out after the run, so this doesn't create any overhead and
is recorded while the lock is held (history buffer is pretouched)

Read as follows
lock-aquisition [ how often for the same process ] : process id
left out if only 1

First the FUTEX_UP later the FUTEX_UP_FAIR.
two cases (-r2 -x2) and (-r0 -x0)

Summary, its clearly seen how the fairness can be effected.
It also show the efficacy of FUTEX_UP and FUTEX_UP_FAIR.

Comments.

./ulockflex -c 3 -a 1 -t 2 -o 5 -m 2 -R 499 -r 2 -x 1 -L f -H 2

===========================================================

(i.e. 2 usecs non-lockholdtime and 1 usec lockhold time)

----------------- UNFAIR LOCKS == FUTEX_UP ------------------
<..snip...>
1067602 [ 4065 ]: 1
1071667 [ 4522 ]: 0
1076189 : 2
1076190 [ 953 ]: 1
1077143 : 2
1077144 [ 2875 ]: 0
1080019 [ 4800 ]: 2
1084819 : 0
1084820 [ 968 ]: 1
1085788 : 0
1085789 : 1
1085790 : 0
1085791 : 1
1085792 : 0
1085793 : 1
1085794 : 0
1085795 : 1
1085796 : 0
1085797 : 1
1085798 : 0
1085799 : 1
1085800 : 0
1085801 : 1
1085802 : 0
1085803 : 1
1085804 : 0
1085805 : 1
1085806 : 0
1085807 : 1
1085808 : 0
1085809 : 1
1085810 : 0
1085811 : 1
1085812 : 0
1085813 : 2
1085814 [ 2 ]: 0
1085816 : 2
1085817 [ 2861 ]: 1
1088678 : 2
1088679 [ 4868 ]: 0
1093547 : 2
1093548 [ 914 ]: 1
1094462 : 2
1094463 [ 4829 ]: 0
1099292 : 2
1099293 [ 963 ]: 1
1100256 : 2
1100257 [ 4789 ]: 0
1105046 : 2
1105047 [ 966 ]: 1
1106013 : 2
1106014 [ 4800 ]: 0
1110814 : 2
1110815 [ 961 ]: 1
1111776 : 2
1111777 [ 2013 ]: 0
1113790 : 2
1113791 : 0
1113792 : 2
1113793 [ 3768 ]: 1
1117561 : 2
1117562 [ 4832 ]: 0
1122394 : 2
1122395 [ 955 ]: 1
1123350 : 2
1123351 [ 4813 ]: 0
1128164 : 2
1128165 [ 982 ]: 1
1129147 : 2
1129148 : 1
1129149 : 2
1129150 [ 4789 ]: 0
1133939 : 2
1133940 : 0
1133941 : 2
1133942 : 0
1133943 : 2
1133944 [ 969 ]: 1
1134913 : 2
1134914 [ 4841 ]: 0
1139755 : 2
1139756 [ 967 ]: 1
1140723 : 2
1140724 [ 4820 ]: 0
1145544 : 2
1145545 [ 969 ]: 1
1146514 : 2
1146515 [ 5007 ]: 0
1151522 : 2
1151523 [ 3678 ]: 1
1155201 : 2
1155202 [ 4756 ]: 0
1159958 : 2
1159959 [ 978 ]: 1

-------------------------- FAIR LOCKS == FUTEX_UP_FAIR ----------------
<... snip ...>
558617 : 0
558618 : 1
558619 : 2
558620 : 1
558621 : 0
558622 : 1
558623 : 2
558624 : 1
558625 : 0
558626 : 1
558627 : 2
558628 : 1
558629 : 0
558630 : 1
558631 : 2
<... and so on ....>

=================================================================.
/ulockflex -c 3 -a 1 -t 2 -o 5 -m 2 -R 499 -r 0 -x 0 -L f -H 2

===========================================================

(i.e. 0 usecs non-lockholdtime and 0 usec lockhold time)

----------------- UNFAIR LOCKS == FUTEX_UP ------------------
<..snip...>
7682404 [ 4593 ]: 1
7686997 : 2
7686998 [ 16336 ]: 0
7703334 : 2
7703335 [ 23875 ]: 1
7727210 [ 4 ]: 2
7727214 [ 20110 ]: 0
7747324 [ 13298 ]: 1
7760622 [ 11612 ]: 2
7772234 [ 8340 ]: 0
7780574 [ 6732 ]: 1
7787306 [ 13388 ]: 2
7800694 [ 3006 ]: 0
7803700 [ 17121 ]: 1
7820821 [ 6726 ]: 2
7827547 [ 13396 ]: 0
7840943 [ 6760 ]: 1
7847703 [ 13375 ]: 2
7861078 [ 4443 ]: 0
7865521 [ 15566 ]: 1
7881087 [ 6730 ]: 2
7887817 [ 13421 ]: 0
7901238 [ 3013 ]: 1
7904251 [ 16995 ]: 2
7921246 [ 6715 ]: 0
7927961 [ 13397 ]: 1
7941358 [ 6716 ]: 2
7948074 [ 13407 ]: 0
7961481 [ 6743 ]: 1
7968224 [ 13309 ]: 2
7981533 [ 6708 ]: 0
7988241 [ 13374 ]: 1
8001615 [ 3411 ]: 2
8005026 [ 16574 ]: 0
8021600 [ 7016 ]: 1

-------------------------- FAIR LOCKS == FUTEX_UP_FAIR ----------------
<... snip ...> same as for -r 2 -x 1

--
-- Hubertus Franke ([email protected])

2002-03-07 00:34:33

by Hubertus Franke

[permalink] [raw]

In message <[email protected]> you write:
> On Tuesday 26 March 2002 06:10 pm, Rusty Russell wrote:
> > In message <[email protected]> you write:
> > > > And on top of them:
> > > > futex_down(struct futex *);
> > > > futex_up(struct futex *);
> > >
> > > Why not keep the simple one-sys-call interface for the fuxtexes. The
> > > code is so small that it is
> > > not worth to delete it.
> >
> >
>
> Rusty, you lost me in all these discussions now.

I know the feeling 8)

> Is the current position to export wait queues and drop the futex interface ?
> I would recommend against that. If we need 2 syscalls to implement
> the futex behavior that certainly will create quite some overhead.

I'm still playing with the options. Two system calls in the slow path
is definitely slower, but it's not insanely slow, and it's becoming
fairly clear that the wider range of primitives is worthwhile.

Both approaches can coexist, but I would consider the sys_futex call a
premature optimization if uwaitqs go in: this comes down to the
numbers. I can supply a uwaitq implementation for benchmarking if you
want?

> >From my own implementation, I exported the wait queues and I didn't need the
> add/wait sequence. This as you know is/was due to the fact that I used
> semaphores in the kernel. While that created some allocation problems and
> won't allow for usage of the wait queues, it seems more compact.
> Any chance to move the semaphore behavior into the futexes.

The allocation element is the one I don't like: there's no really good
way of limiting it.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-03-18 00:52:29

by Ulrich Drepper

[permalink] [raw]

Subject: Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

On Sat, 2002-03-16 at 22:50, Rusty Russell wrote:

> Only vs. pthread_cond_broadcast.

No. pthread_barrier_wait has the same problem. It has to wake up lots
of thread.

> And if you're using that you probably
> have some other performance issues anyway?

Why? Conditional variables are of use in situations with loosely
coupled threads.

--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------

Attachments:

signature.asc (232.00 B)
This is a digitally signed message part