2006-08-08 07:05:26

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: [RFC] NUMA futex hashing

Current futex hash scheme is not the best for NUMA. The futex hash table is
an array of struct futex_hash_bucket, which is just a spinlock and a
list_head -- this means multiple spinlocks on the same cacheline and on NUMA
machines, on the same internode cacheline. If futexes of two unrelated
threads running on two different nodes happen to hash onto adjacent hash
buckets, or buckets on the same internode cacheline, then we have the
internode cacheline bouncing between nodes.

Here is a simple scheme which maintains per-node hash tables for futexes.

In this scheme, a private futex is assigned to the node id of the futex's KVA.
The reasoning is, the futex KVA is allocated from the node as indicated
by memory policy set by the process, and that should be a good 'home node'
for that futex. Of course this helps workloads where all the threads of a
process are bound to the same node, but it seems reasonable to run all
threads of a process on the same node.

A shared futex is assigned a home node based on jhash2 itself. Since inode
and offset are used as the key, the same inode offset is used to arrive at
the home node of a shared futex. This distributes private futexes across
all nodes.

Comments? Suggestions? Particularly regarding shared futexes. Any policy
suggestions?

Thanks,
Kiran

Note: This patch needs to have kvaddr_to_nid() reintroduced. This was taken
out in git commit 9f3fd602aef96c2a490e3bfd669d06475aeba8d8

Index: linux-2.6.18-rc3/kernel/futex.c
===================================================================
--- linux-2.6.18-rc3.orig/kernel/futex.c 2006-08-02 12:11:34.000000000 -0700
+++ linux-2.6.18-rc3/kernel/futex.c 2006-08-02 16:48:47.000000000 -0700
@@ -137,20 +137,35 @@ struct futex_hash_bucket {
struct list_head chain;
};

-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+static struct futex_hash_bucket *futex_queues[MAX_NUMNODES] __read_mostly;

/* Futex-fs vfsmount entry: */
static struct vfsmount *futex_mnt;

/*
* We hash on the keys returned from get_futex_key (see below).
+ * With NUMA aware futex hashing, we have per-node hash tables.
+ * We determine the home node of a futex based on the KVA -- if the futex
+ * is a private futex. For shared futexes, we use jhash2 itself on the
+ * futex_key to arrive at a home node.
*/
static struct futex_hash_bucket *hash_futex(union futex_key *key)
{
+ int nodeid;
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ if (key->both.offset & 0x1) {
+ /*
+ * Shared futex: Use any of the 'possible' nodes as home node.
+ */
+ nodeid = hash & (MAX_NUMNODES -1);
+ BUG_ON(!node_possible(nodeid));
+ } else
+ /* Private futex */
+ nodeid = kvaddr_to_nid(key->both.ptr);
+
+ return &futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)];
}

/*
@@ -1909,13 +1924,25 @@ static int __init init(void)
{
unsigned int i;

+ int nid;
+
+ for_each_node(nid)
+ {
+ futex_queues[nid] = kmalloc_node(
+ (sizeof(struct futex_hash_bucket) *
+ (1 << FUTEX_HASHBITS)),
+ GFP_KERNEL, nid);
+ if (!futex_queues[nid])
+ panic("futex_init: Allocation of multi-node futex_queues failed");
+ for (i = 0; i < (1 << FUTEX_HASHBITS); i++) {
+ INIT_LIST_HEAD(&futex_queues[nid][i].chain);
+ spin_lock_init(&futex_queues[nid][i].lock);
+ }
+ }
+
register_filesystem(&futex_fs_type);
futex_mnt = kern_mount(&futex_fs_type);

- for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
- INIT_LIST_HEAD(&futex_queues[i].chain);
- spin_lock_init(&futex_queues[i].lock);
- }
return 0;
}
__initcall(init);


2006-08-08 09:14:53

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tuesday 08 August 2006 09:07, Ravikiran G Thirumalai wrote:
> Current futex hash scheme is not the best for NUMA. The futex hash table
> is an array of struct futex_hash_bucket, which is just a spinlock and a
> list_head -- this means multiple spinlocks on the same cacheline and on
> NUMA machines, on the same internode cacheline. If futexes of two
> unrelated threads running on two different nodes happen to hash onto
> adjacent hash buckets, or buckets on the same internode cacheline, then we
> have the internode cacheline bouncing between nodes.
>
> Here is a simple scheme which maintains per-node hash tables for futexes.
>
> In this scheme, a private futex is assigned to the node id of the futex's
> KVA. The reasoning is, the futex KVA is allocated from the node as
> indicated by memory policy set by the process, and that should be a good
> 'home node' for that futex. Of course this helps workloads where all the
> threads of a process are bound to the same node, but it seems reasonable to
> run all threads of a process on the same node.
>
> A shared futex is assigned a home node based on jhash2 itself. Since inode
> and offset are used as the key, the same inode offset is used to arrive at
> the home node of a shared futex. This distributes private futexes across
> all nodes.
>
> Comments? Suggestions? Particularly regarding shared futexes. Any policy
> suggestions?
>

Your patch seems fine, but I have one comment.

For non NUMA machine, we would have one useless indirection to get the
futex_queues pointer.

static struct futex_hash_bucket *futex_queues[1];

I think it is worth to redesign your patch so that this extra-indirection is
needed only for NUMA machines.

#if defined(CONFIG_NUMA)
static struct futex_hash_bucket *futex_queues[MAX_NUMNODES];
#define FUTEX_QUEUES(nodeid, hash) \
&futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)];
#else
static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
# define FUTEX_QUEUES(nodeid, hash) \
&futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
#endif

Thank you

> Thanks,
> Kiran
>
> Note: This patch needs to have kvaddr_to_nid() reintroduced. This was
> taken out in git commit 9f3fd602aef96c2a490e3bfd669d06475aeba8d8
>
> Index: linux-2.6.18-rc3/kernel/futex.c
> ===================================================================
> --- linux-2.6.18-rc3.orig/kernel/futex.c 2006-08-02 12:11:34.000000000
> -0700 +++ linux-2.6.18-rc3/kernel/futex.c 2006-08-02 16:48:47.000000000
> -0700 @@ -137,20 +137,35 @@ struct futex_hash_bucket {
> struct list_head chain;
> };
>
> -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
> +static struct futex_hash_bucket *futex_queues[MAX_NUMNODES] __read_mostly;
>
> /* Futex-fs vfsmount entry: */
> static struct vfsmount *futex_mnt;
>
> /*
> * We hash on the keys returned from get_futex_key (see below).
> + * With NUMA aware futex hashing, we have per-node hash tables.
> + * We determine the home node of a futex based on the KVA -- if the futex
> + * is a private futex. For shared futexes, we use jhash2 itself on the
> + * futex_key to arrive at a home node.
> */
> static struct futex_hash_bucket *hash_futex(union futex_key *key)
> {
> + int nodeid;
> u32 hash = jhash2((u32*)&key->both.word,
> (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
> key->both.offset);
> - return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
> + if (key->both.offset & 0x1) {
> + /*
> + * Shared futex: Use any of the 'possible' nodes as home node.
> + */
> + nodeid = hash & (MAX_NUMNODES -1);
> + BUG_ON(!node_possible(nodeid));
> + } else
> + /* Private futex */
> + nodeid = kvaddr_to_nid(key->both.ptr);
> +
> + return &futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)];
> }
>
> /*
> @@ -1909,13 +1924,25 @@ static int __init init(void)
> {
> unsigned int i;
>
> + int nid;
> +
> + for_each_node(nid)
> + {
> + futex_queues[nid] = kmalloc_node(
> + (sizeof(struct futex_hash_bucket) *
> + (1 << FUTEX_HASHBITS)),
> + GFP_KERNEL, nid);
> + if (!futex_queues[nid])
> + panic("futex_init: Allocation of multi-node futex_queues failed");
> + for (i = 0; i < (1 << FUTEX_HASHBITS); i++) {
> + INIT_LIST_HEAD(&futex_queues[nid][i].chain);
> + spin_lock_init(&futex_queues[nid][i].lock);
> + }
> + }
> +
> register_filesystem(&futex_fs_type);
> futex_mnt = kern_mount(&futex_fs_type);
>
> - for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
> - INIT_LIST_HEAD(&futex_queues[i].chain);
> - spin_lock_init(&futex_queues[i].lock);
> - }
> return 0;
> }
> __initcall(init);
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2006-08-08 09:37:25

by Jes Sorensen

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

>>>>> "Ravikiran" == Ravikiran G Thirumalai <[email protected]> writes:

Ravikiran> Current futex hash scheme is not the best for NUMA. The
Ravikiran> futex hash table is an array of struct futex_hash_bucket,
Ravikiran> which is just a spinlock and a list_head -- this means
Ravikiran> multiple spinlocks on the same cacheline and on NUMA
Ravikiran> machines, on the same internode cacheline. If futexes of
Ravikiran> two unrelated threads running on two different nodes happen
Ravikiran> to hash onto adjacent hash buckets, or buckets on the same
Ravikiran> internode cacheline, then we have the internode cacheline
Ravikiran> bouncing between nodes.

Ravikiran,

Using that argument, all you need to do is to add the alignment
____cacheline_aligned_in_smp to the definition of
struct futex_hash_bucket and the problem is solved, given that the
internode cacheline in a NUMA system is defined to be the same as the
SMP cacheline size.

Ravikiran> Here is a simple scheme which maintains per-node hash
Ravikiran> tables for futexes.

Ravikiran> In this scheme, a private futex is assigned to the node id
Ravikiran> of the futex's KVA. The reasoning is, the futex KVA is
Ravikiran> allocated from the node as indicated by memory policy set
Ravikiran> by the process, and that should be a good 'home node' for
Ravikiran> that futex. Of course this helps workloads where all the
Ravikiran> threads of a process are bound to the same node, but it
Ravikiran> seems reasonable to run all threads of a process on the
Ravikiran> same node.

You can't make that assumption at all. In many NUMA workloads it is
not common to have all threads of a process run on the same node. You
often see a case where one thread spawns a number of threads that are
then grouped onto the various nodes.

If we want to make the futexes really NUMA aware, having them
explicitly allocated on a given node would be more useful or
alternatively have them allocated on a first touch basis.

But to be honest, I doubt it matters too much since the futex
cacheline is most likely to end up in cache on the node where it's
being used and as long as the other nodes don't try and touch the same
futex this becomes a non-issue with the proper alignment.

I don't think your patch is harmful, but it looks awfully complex for
something that could be solved just as well by a simple alignment
statement.

Cheers,
Jes

2006-08-08 09:58:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

Ravikiran G Thirumalai <[email protected]> writes:

> Current futex hash scheme is not the best for NUMA. The futex hash table is
> an array of struct futex_hash_bucket, which is just a spinlock and a
> list_head -- this means multiple spinlocks on the same cacheline and on NUMA
> machines, on the same internode cacheline. If futexes of two unrelated
> threads running on two different nodes happen to hash onto adjacent hash
> buckets, or buckets on the same internode cacheline, then we have the
> internode cacheline bouncing between nodes.

When I did some testing with a (arguably far too lock intensive) benchmark
on a bigger box I got most bouncing cycles not in the futex locks itself,
but in the down_read on the mm semaphore.

I guess that is not addressed?

-Andi

2006-08-08 09:58:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

Jes Sorensen <[email protected]> writes:
>
> Using that argument, all you need to do is to add the alignment
> ____cacheline_aligned_in_smp to the definition of
> struct futex_hash_bucket and the problem is solved, given that the
> internode cacheline in a NUMA system is defined to be the same as the
> SMP cacheline size.

Yes but it would waste quite a lot of memory and cache.
Wasted cache = slow.

-Andi

2006-08-08 10:07:50

by Jes Sorensen

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

>>>>> "Andi" == Andi Kleen <[email protected]> writes:

Andi> Jes Sorensen <[email protected]> writes:
>> Using that argument, all you need to do is to add the alignment
>> ____cacheline_aligned_in_smp to the definition of struct
>> futex_hash_bucket and the problem is solved, given that the
>> internode cacheline in a NUMA system is defined to be the same as
>> the SMP cacheline size.

Andi> Yes but it would waste quite a lot of memory and cache. Wasted
Andi> cache = slow.

Compared to the extra level of indirection, I doubt it would be
measurable. The cache space is barely wasted, we're talking
approximately half a cacheline per futex hash bucket in use.

Jes

2006-08-08 10:10:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tuesday 08 August 2006 11:57, Andi Kleen wrote:
> Ravikiran G Thirumalai <[email protected]> writes:
> > Current futex hash scheme is not the best for NUMA. The futex hash
> > table is an array of struct futex_hash_bucket, which is just a spinlock
> > and a list_head -- this means multiple spinlocks on the same cacheline
> > and on NUMA machines, on the same internode cacheline. If futexes of two
> > unrelated threads running on two different nodes happen to hash onto
> > adjacent hash buckets, or buckets on the same internode cacheline, then
> > we have the internode cacheline bouncing between nodes.
>
> When I did some testing with a (arguably far too lock intensive) benchmark
> on a bigger box I got most bouncing cycles not in the futex locks itself,
> but in the down_read on the mm semaphore.

This is true, even with a normal application (not a biased benchmark) and
using oprofile. mmap_sem is the killer.

We may have special case for PRIVATE futexes (they dont need to be chained in
a global table, but a process private table)

POSIX thread api already can let the application tell glibc/kernel a
mutex/futex ahe a process scope.

For this private futexes, I think we would not need to down_read(mmap_sem) at
all. (only a/some lock/s protecting the process private table)

Eric

2006-08-08 10:36:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

>
> We may have special case for PRIVATE futexes (they dont need to be chained in
> a global table, but a process private table)

What do you mean with PRIVATE futex?

Even if the futex mapping is only visible by a single MM mmap_sem is still needed
to protect against other threads doing mmap.

-Andi

2006-08-08 12:29:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tuesday 08 August 2006 12:36, Andi Kleen wrote:
> > We may have special case for PRIVATE futexes (they dont need to be
> > chained in a global table, but a process private table)
>
> What do you mean with PRIVATE futex?
>
> Even if the futex mapping is only visible by a single MM mmap_sem is still
> needed to protect against other threads doing mmap.

Hum... I would call that a user error.

If a thread is munmap()ing the vma that contains active futexes, result is
undefined. Same as today I think (a thread blocked in a FUTEX_WAIT should
stay blocked)

The point is that private futexes could be managed using virtual addresses,
and no call to find_extend_vma(), hence no mmap_sem contention.

There could be problem if the same futex (32 bits integer) could be mapped at
different virtual addresses in the same process.

Eric

2006-08-08 12:48:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tuesday 08 August 2006 14:29, Eric Dumazet wrote:
> On Tuesday 08 August 2006 12:36, Andi Kleen wrote:
> > > We may have special case for PRIVATE futexes (they dont need to be
> > > chained in a global table, but a process private table)
> >
> > What do you mean with PRIVATE futex?
> >
> > Even if the futex mapping is only visible by a single MM mmap_sem is still
> > needed to protect against other threads doing mmap.
>
> Hum... I would call that a user error.
>
> If a thread is munmap()ing the vma that contains active futexes, result is
> undefined.

We can't allow anything that could crash the kernel, corrupt a kernel,
data structure, allow writing to freed memory etc. No matter how
defined it is or not. Working with a vma that doesn't have
an existence guarantee would be just that.

-Andi

2006-08-08 12:57:15

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tuesday 08 August 2006 14:47, Andi Kleen wrote:
> On Tuesday 08 August 2006 14:29, Eric Dumazet wrote:
> > On Tuesday 08 August 2006 12:36, Andi Kleen wrote:
> > > > We may have special case for PRIVATE futexes (they dont need to be
> > > > chained in a global table, but a process private table)
> > >
> > > What do you mean with PRIVATE futex?
> > >
> > > Even if the futex mapping is only visible by a single MM mmap_sem is
> > > still needed to protect against other threads doing mmap.
> >
> > Hum... I would call that a user error.
> >
> > If a thread is munmap()ing the vma that contains active futexes, result
> > is undefined.
>
> We can't allow anything that could crash the kernel, corrupt a kernel,
> data structure, allow writing to freed memory etc. No matter how
> defined it is or not. Working with a vma that doesn't have
> an existence guarantee would be just that.

As I said, we do not walk the vmas anymore, no crashes are ever possible.

Just keep a process private list of 'private futexes' , indexed by their
virtual address. This list can be of course stored in a efficient data
structure, an AVL or RB tree, or hash table.

The validity of the virtual address is still tested by normal get_user()
call.. If the memory was freed by a thread, then a normal EFAULT error will
be reported... eventually.

Eric

2006-08-08 14:40:03

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On 8/8/06, Eric Dumazet <[email protected]> wrote:
> The validity of the virtual address is still tested by normal get_user()
> call.. If the memory was freed by a thread, then a normal EFAULT error will
> be reported... eventually.

This is indeed what should be done. Private futexes are the by far
more frequent case and I bet you'd see improvements when avoiding the
mm mutex even for normal machines since futexes really are everywhere.
For shared mutexes you end up doing two lookups and that's fine IMO
as long as the first lookup is fast.

As for the NUMA case, I would oppose any change which has the
slightest impact on non-NUMA machines. It cannot be allowed that the
majority of systems is slowed down significantly just because of NUMA.
Especially since the effects of NUMA beside cache line transfer
penalties IMO probably are neglect able. The in-kernel futex
representation only exists when there are waiters and so the memory
needed is only allocated when we a waiting. In this case it just be
easy enough to use local memory. But this unlikely will help much
since the waker thread is ideally not on the same processor, maybe not
even on the same node. So there will be cacheline transfers in most
cases and everything possible improvement will be minimal and maybe
even not generally measurable.

If you want to do anything in this area, first remove the global
mutex. Then really measure with real world application. And I don't
mean specially designed HPC apps which assign threads/processes to
processors or nodes. Those are special cases of a special case.

2006-08-08 15:28:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

Ulrich Drepper wrote:
> On 8/8/06, Eric Dumazet <[email protected]> wrote:
>
>> The validity of the virtual address is still tested by normal get_user()
>> call.. If the memory was freed by a thread, then a normal EFAULT error
>> will
>> be reported... eventually.
>
>
> This is indeed what should be done. Private futexes are the by far
> more frequent case and I bet you'd see improvements when avoiding the
> mm mutex even for normal machines since futexes really are everywhere.
> For shared mutexes you end up doing two lookups and that's fine IMO
> as long as the first lookup is fast.

The private futex's namespace is its virtual address, so I don't see
how you can decouple that from the management of virtual addresses.

Let me get this straight: to insert a contended futex into your rbtree,
you need to hold the mmap sem to ensure that address remains valid,
then you need to take a lock which protects your rbtree. Then to wake
up a process and remove the futex, you need to take the rbtree lock. Or
to unmap any memory you also need to take the rbtree lock and ensure
there are no futexes there.

So you just add another lock for no reason, or have I got a few screws
loose myself? I don't see how you can significantly reduce lock
cacheline bouncing in a futex heavy workload if you're just going to
add another shared data structure. But if you can, sweet ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-08 15:36:26

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On 8/8/06, Nick Piggin <[email protected]> wrote:
> Let me get this straight: to insert a contended futex into your rbtree,
> you need to hold the mmap sem to ensure that address remains valid,
> then you need to take a lock which protects your rbtree.

Why does it have to remain valid? As long as the kernel doesn't crash
on any of the operations associated with the futex syscalls let the
address space region explode, implode, whatever. It's a bug in the
program if the address region is changed while a futex is placed
there. If the futex syscall hangs forever or returns with a bogus
state (error or even success) this is perfectly acceptable. We
shouldn't slow down correct uses just to make it possible for broken
programs to receive a more detailed error description.

2006-08-08 16:08:37

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tuesday 08 August 2006 17:11, Nick Piggin wrote:
> Ulrich Drepper wrote:
> > On 8/8/06, Eric Dumazet <[email protected]> wrote:
> >> The validity of the virtual address is still tested by normal get_user()
> >> call.. If the memory was freed by a thread, then a normal EFAULT error
> >> will
> >> be reported... eventually.
> >
> > This is indeed what should be done. Private futexes are the by far
> > more frequent case and I bet you'd see improvements when avoiding the
> > mm mutex even for normal machines since futexes really are everywhere.
> > For shared mutexes you end up doing two lookups and that's fine IMO
> > as long as the first lookup is fast.
>
> The private futex's namespace is its virtual address, so I don't see
> how you can decouple that from the management of virtual addresses.
>
> Let me get this straight: to insert a contended futex into your rbtree,
> you need to hold the mmap sem to ensure that address remains valid,
> then you need to take a lock which protects your rbtree. Then to wake
> up a process and remove the futex, you need to take the rbtree lock. Or
> to unmap any memory you also need to take the rbtree lock and ensure
> there are no futexes there.
>
> So you just add another lock for no reason, or have I got a few screws
> loose myself? I don't see how you can significantly reduce lock
> cacheline bouncing in a futex heavy workload if you're just going to
> add another shared data structure. But if you can, sweet ;)

We certainly can. But if you insist of using mmap sem at all, then we have a
problem.

rbtree would not reduce cacheline bouncing, so :

We could use a hashtable (allocated on demand) of size N, N depending on
NR_CPUS for example. each chain protected by a private spinlock. If N is well
chosen, we might reduce lock cacheline bouncing. (different threads fighting
on different private futexes would have a good chance to get different
cachelines in this hashtable)

As soon a process enters 'private futex' code, the futex code allocates this
hashtable if the process has a NULL hash table (set to NULL at exec() time,
or maybe re-allocated because we want to be sure futex syscall always suceed
(no ENOMEM))

So we really can... but for 'private futexes' which are the vast majority of
futexes needed by typical program (using POSIX pshared thread mutex attribute
PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc)

Of course we would need a new syscall, and to change glibc to be able to
actually use this new private_futex syscall.

Probably a lot of work, still, but could help heavy threaded programs not
touching mmap_sem.

We might have a refcounting problem on this 'hashtable' since several threads
share this structure, but only at thread creation/destruction, not in futex
call (ie no cacheline bouncing on the refcount)

Eric

2006-08-08 16:22:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

Ulrich Drepper wrote:
> On 8/8/06, Nick Piggin <[email protected]> wrote:
>
>> Let me get this straight: to insert a contended futex into your rbtree,
>> you need to hold the mmap sem to ensure that address remains valid,
>> then you need to take a lock which protects your rbtree.
>
>
> Why does it have to remain valid? As long as the kernel doesn't crash
> on any of the operations associated with the futex syscalls let the
> address space region explode, implode, whatever. It's a bug in the
> program if the address region is changed while a futex is placed
> there. If the futex syscall hangs forever or returns with a bogus
> state (error or even success) this is perfectly acceptable. We

I thought mremap (no, that's already kind of messed up); or
even just getting consistency in failures (eg. so you don't have
the situation that a futex op can succeed on a previously
unmapped region).

If you're not worried about the latter, then it might work...

I didn't initially click that the private futex API operates
purely on tokens rather than virtual memory... comments in
futex.c talk about futexes being hashed to a particular
physical page (which is the case for shared). That's whacked.

So actually you would change semantics in some weird corner
cases, like mremaping a shared futex over a private futex's
Arguably that's broken, though ;)

> shouldn't slow down correct uses just to make it possible for broken
> programs to receive a more detailed error description.
>

No we shouldn't slow them down. I'd be interested to see whether
locking is significantly sped up with this new data structure,
though.

You might also slow down due to the fact that you'd have to do the
locking and unconditionally traverse the private futexes even for
shared futexes.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-08 16:26:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

Nick Piggin wrote:

> No we shouldn't slow them down. I'd be interested to see whether
> locking is significantly sped up with this new data structure,
> though.

OTOH, maybe you don't need a new data structure. Maybe you could
use the hash and check that for a match on a private futex before
trying to find a possible shared futex.

Locking I guess becomes no more of a problem than now, and in some
cases maybe much less. So OK, I stand corrected.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-08 16:34:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

Eric Dumazet wrote:

> We certainly can. But if you insist of using mmap sem at all, then we have a
> problem.
>
> rbtree would not reduce cacheline bouncing, so :
>
> We could use a hashtable (allocated on demand) of size N, N depending on
> NR_CPUS for example. each chain protected by a private spinlock. If N is well
> chosen, we might reduce lock cacheline bouncing. (different threads fighting
> on different private futexes would have a good chance to get different
> cachelines in this hashtable)

See other mail. We already have a hash table ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-08 16:49:31

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tuesday 08 August 2006 18:34, Nick Piggin wrote:
> Eric Dumazet wrote:
> > We certainly can. But if you insist of using mmap sem at all, then we
> > have a problem.
> >
> > rbtree would not reduce cacheline bouncing, so :
> >
> > We could use a hashtable (allocated on demand) of size N, N depending on
> > NR_CPUS for example. each chain protected by a private spinlock. If N is
> > well chosen, we might reduce lock cacheline bouncing. (different threads
> > fighting on different private futexes would have a good chance to get
> > different cachelines in this hashtable)
>
> See other mail. We already have a hash table ;)

Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is
private to this process.

Giving the same info at FUTEX_WAKE time could avoid the kernel to make the
second pass (using only a private futex lookup), avoiding again the mmap_sem
touch in case no threads are waiting anymore on this futex.

Eric

2006-08-08 16:49:57

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On 8/8/06, Nick Piggin <[email protected]> wrote:
> I thought mremap (no, that's already kind of messed up); or
> even just getting consistency in failures (eg. so you don't have
> the situation that a futex op can succeed on a previously
> unmapped region).
>
> If you're not worried about the latter, then it might work...

I'm not the least bit worried about this. It's 100% an application's
fault. You cannot touch an address space if it's used, e.g., for
mutexes.


> I didn't initially click that the private futex API operates
> purely on tokens rather than virtual memory...

I haven't looked at the code in some time but I thought this got
clarified in the comments. For waiting on private mutexes we need
nothing but the address value itself. There is the FUTEX_WAKE_OP
operation which will also write to memory but this is only the waker
side and if the memory mapping is gone, just flag an error. It's
another program error which shouldn't in any way slow down normal
operations.

2006-08-08 16:58:12

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On 8/8/06, Eric Dumazet <[email protected]> wrote:
> So we really can... but for 'private futexes' which are the vast majority of
> futexes needed by typical program (using POSIX pshared thread mutex attribute
> PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc)

Nonsense. Mutexes are by default always private. They explicitly
have to be marked as sharable. This happens using the
pthread_mutexattr_setpshared function which takes
PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED in the second
parameter. So the former _is_ clearly used.


> Of course we would need a new syscall, and to change glibc to be able to
> actually use this new private_futex syscall.

No, why? The kernel already does recognize private mutexes. It just
checks whether the pages used to store it are private or mapped. This
requires some interaction with the memory subsystem but as long as no
crashes happen the data can change underneath. It's the program's
fault if it does.

On the waker side you would search the local futex hash table/tree
first and if this doesn't yield a match, search the global table.
Wakeup calls without any waiters are usually rare.

2006-08-08 16:59:08

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tuesday 08 August 2006 18:49, Eric Dumazet wrote:
> On Tuesday 08 August 2006 18:34, Nick Piggin wrote:
> > Eric Dumazet wrote:
> > > We certainly can. But if you insist of using mmap sem at all, then we
> > > have a problem.
> > >
> > > rbtree would not reduce cacheline bouncing, so :
> > >
> > > We could use a hashtable (allocated on demand) of size N, N depending
> > > on NR_CPUS for example. each chain protected by a private spinlock. If
> > > N is well chosen, we might reduce lock cacheline bouncing. (different
> > > threads fighting on different private futexes would have a good chance
> > > to get different cachelines in this hashtable)
> >
> > See other mail. We already have a hash table ;)
>
> Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is
> private to this process.
>
> Giving the same info at FUTEX_WAKE time could avoid the kernel to make the
> second pass (using only a private futex lookup), avoiding again the
> mmap_sem touch in case no threads are waiting anymore on this futex.

After looking at kernel/futex.c, I realize we also can avoid the atomic ops
(and another cacheline bouncing) done in get_key_refs()/drop_key_refs(),
touching the inode i_count or mm_count refcounter)

Eric

2006-08-08 17:08:21

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tuesday 08 August 2006 18:58, Ulrich Drepper wrote:
> On 8/8/06, Eric Dumazet <[email protected]> wrote:
> > So we really can... but for 'private futexes' which are the vast majority
> > of futexes needed by typical program (using POSIX pshared thread mutex
> > attribute PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc)
>
> Nonsense. Mutexes are by default always private. They explicitly
> have to be marked as sharable. This happens using the
> pthread_mutexattr_setpshared function which takes
> PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED in the second
> parameter. So the former _is_ clearly used.
>

I was saying that PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED info is
not provided to the kernel (because futex api/implementation dont need to).
It was not an attack on glibc.

> > Of course we would need a new syscall, and to change glibc to be able to
> > actually use this new private_futex syscall.
>
> No, why? The kernel already does recognize private mutexes. It just
> checks whether the pages used to store it are private or mapped. This
> requires some interaction with the memory subsystem but as long as no
> crashes happen the data can change underneath. It's the program's
> fault if it does.

But if you let futex code doing the vma walk to check the private/shared
status, you still need the mmap_sem locking.

Moreover, a program can mmap() a file (shared in terms of VMA), and continue
to use a PTHREAD_PROCESS_PRIVATE mutex lying in this shared zone
(Example : shmem or hugetlb mapping, wich API might always give a 'shared'
vma)

>
> On the waker side you would search the local futex hash table/tree
> first and if this doesn't yield a match, search the global table.
> Wakeup calls without any waiters are usually rare.

If the two searches touch two different cache lines in the hash table, we
might have a performance regression.
Of course we might chose a hash function so that the same slot is accessed.

Eric

2006-08-08 20:29:39

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tue, Aug 08, 2006 at 11:14:49AM +0200, Eric Dumazet wrote:
> On Tuesday 08 August 2006 09:07, Ravikiran G Thirumalai wrote:
> >
>
> Your patch seems fine, but I have one comment.
>
> For non NUMA machine, we would have one useless indirection to get the
> futex_queues pointer.
>
> static struct futex_hash_bucket *futex_queues[1];
>
> I think it is worth to redesign your patch so that this extra-indirection is
> needed only for NUMA machines.

Yes. Will do in the next iteration.

Thanks,
Kiran

2006-08-09 00:11:25

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

On Tue, Aug 08, 2006 at 12:10:39PM +0200, Eric Dumazet wrote:
> On Tuesday 08 August 2006 11:57, Andi Kleen wrote:
> > Ravikiran G Thirumalai <[email protected]> writes:
> > > Current futex hash scheme is not the best for NUMA. The futex hash
> > > table is an array of struct futex_hash_bucket, which is just a spinlock
> > > and a list_head -- this means multiple spinlocks on the same cacheline
> > > and on NUMA machines, on the same internode cacheline. If futexes of two
> > > unrelated threads running on two different nodes happen to hash onto
> > > adjacent hash buckets, or buckets on the same internode cacheline, then
> > > we have the internode cacheline bouncing between nodes.
> >
> > When I did some testing with a (arguably far too lock intensive) benchmark
> > on a bigger box I got most bouncing cycles not in the futex locks itself,
> > but in the down_read on the mm semaphore.
>
> This is true, even with a normal application (not a biased benchmark) and
> using oprofile. mmap_sem is the killer.

Not if two threads of two different process (so no same mmap_sem) hash onto
futexes on the same cacheline. But agreed, mmap_sem needs to be fixed too.
If everyone agrees on a per-process hash table for private futexes, then
we will work on that approach.

Thanks,
Kiran

2006-08-09 01:56:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

Eric Dumazet wrote:

>On Tuesday 08 August 2006 18:34, Nick Piggin wrote:
>
>>Eric Dumazet wrote:
>>
>>>We certainly can. But if you insist of using mmap sem at all, then we
>>>have a problem.
>>>
>>>rbtree would not reduce cacheline bouncing, so :
>>>
>>>We could use a hashtable (allocated on demand) of size N, N depending on
>>>NR_CPUS for example. each chain protected by a private spinlock. If N is
>>>well chosen, we might reduce lock cacheline bouncing. (different threads
>>>fighting on different private futexes would have a good chance to get
>>>different cachelines in this hashtable)
>>>
>>See other mail. We already have a hash table ;)
>>
>
>Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is
>private to this process.
>

Yes, but I'm saying we already have a hash table. The hash table.

I'm *not* saying you *don't* also want a private directive from userspace.
--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-09 01:58:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

Ulrich Drepper wrote:

>> Of course we would need a new syscall, and to change glibc to be able to
>> actually use this new private_futex syscall.
>
>
> No, why? The kernel already does recognize private mutexes. It just
> checks whether the pages used to store it are private or mapped. This
> requires some interaction with the memory subsystem but as long as no
> crashes happen the data can change underneath. It's the program's
> fault if it does.


Because that requires taking mmap_sem. Avoiding that is the whole
purpose, isn't it?

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-09 06:26:33

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

Based on various discussions and feedbacks, I cooked a patch that implements
the notion of private futexes (private to a process, in the spirit of POSIX
pshared PTHREAD_PROCESS_PRIVATE )

[PATCH] futex : Add new PRIVATE futex primitives for performance improvements

When a futex is privately used by a process, we dont really need to lookup the
list of vmas of the process in order to discover if the futex is backed by a
inode or by the mm struct. We dont really need to keep a refcount on the
inode or mm.

This patch introduces new futex calls, that could be used by user land (glibc
of course) when private futexes are used.

Avoiding vmas lookup means avoiding taking the mmap_sem (and forcing cacheline
bouncings).

Avoiding refcounting on underlying inode or mm struct also avoids cacheline
bouncing.

Thats two cacheline bounces avoided per FUTEX syscall

glibc could use the new futex primitives introduced here (in particular for
PTHREAD_PROCESS_PRIVATE semantic), and fallback to old one if running on
older kernel. Fallback could set a global variable with the number of syscall
so that only one failed syscall is done in the process lifetime.

Note : Compatibility should be maintained by this patch, as old applications
will use the 'SHARED' functionality, unchanged.

Signed-off-by: Eric Dumazet <[email protected]>


Attachments:
(No filename) (1.32 kB)
futex_priv1.patch (9.37 kB)
Download all attachments

2006-08-09 06:43:58

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC] NUMA futex hashing

(patch inlined this time)
Based on various discussions and feedbacks, I cooked a patch that implements
the notion of private futexes (private to a process, in the spirit of POSIX
pshared PTHREAD_PROCESS_PRIVATE )

[PATCH] futex : Add new PRIVATE futex primitives for performance improvements

When a futex is privately used by a process, we dont really need to lookup the
list of vmas of the process in order to discover if the futex is backed by a
inode or by the mm struct. We dont really need to keep a refcount on the
inode or mm.

This patch introduces new futex calls, that could be used by user land (glibc
of course) when private futexes are used.

Avoiding vmas lookup means avoiding taking the mmap_sem (and forcing cacheline
bouncings).

Avoiding refcounting on underlying inode or mm struct also avoids cacheline
bouncing.

Thats two cacheline bounces avoided per FUTEX syscall

glibc could use the new futex primitives introduced here (in particular for
PTHREAD_PROCESS_PRIVATE semantic), and fallback to old one if running on
older kernel. Fallback could set a global variable with the number of syscall
so that only one failed syscall is done in the process lifetime.

Note : Compatibility should be maintained by this patch, as old applications
will use the 'SHARED' functionality, unchanged.

Signed-off-by: Eric Dumazet <[email protected]>
--- linux-2.6.18-rc4/include/linux/futex.h 2006-08-08 22:46:13.000000000 +0200
+++ linux-2.6.18-rc4-ed/include/linux/futex.h 2006-08-08 23:23:13.000000000
+0200
@@ -15,6 +15,11 @@
#define FUTEX_LOCK_PI 6
#define FUTEX_UNLOCK_PI 7
#define FUTEX_TRYLOCK_PI 8
+#define FUTEX_WAIT_PRIVATE 9
+#define FUTEX_WAKE_PRIVATE 10
+#define FUTEX_REQUEUE_PRIVATE 11
+#define FUTEX_CMP_REQUEUE_PRIVATE 12
+#define FUTEX_WAKE_OP_PRIVATE 13

/*
* Support for robust futexes: the kernel cleans up held futexes at
--- linux-2.6.18-rc4/kernel/futex.c 2006-08-08 22:45:46.000000000 +0200
+++ linux-2.6.18-rc4-ed/kernel/futex.c 2006-08-09 07:26:19.000000000 +0200
@@ -60,8 +60,9 @@
* Don't rearrange members without looking at hash_futex().
*
* offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
*/
+#define OFF_INODE 1 /* We set bit 0 if it has a reference on inode */
+#define OFF_MMSHARED 2 /* We set bit 1 if it has a reference on mm */
union futex_key {
struct {
unsigned long pgoff;
@@ -79,6 +80,8 @@
int offset;
} both;
};
+#define FUT_SHARED 1 /* we should walk vmas */
+#define FUT_PRIVATE 0 /* private futex: no need to walk vmas*/

/*
* Priority Inheritance state:
@@ -140,7 +143,7 @@
static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];

/* Futex-fs vfsmount entry: */
-static struct vfsmount *futex_mnt;
+static struct vfsmount *futex_mnt __read_mostly;

/*
* We hash on the keys returned from get_futex_key (see below).
@@ -175,7 +178,7 @@
*
* Should be called with &current->mm->mmap_sem but NOT any spinlocks.
*/
-static int get_futex_key(u32 __user *uaddr, union futex_key *key)
+static int get_futex_key(u32 __user *uaddr, union futex_key *key, int shared)
{
unsigned long address = (unsigned long)uaddr;
struct mm_struct *mm = current->mm;
@@ -191,6 +194,11 @@
return -EINVAL;
address -= key->both.offset;

+ if (shared == FUT_PRIVATE) {
+ key->private.mm = mm;
+ key->private.address = address;
+ return 0;
+ }
/*
* The futex is hashed differently depending on whether
* it's in a shared or private mapping. So check vma first.
@@ -215,6 +223,7 @@
* mappings of _writable_ handles.
*/
if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+ key->both.offset += OFF_MMSHARED; /* reference taken on mm */
key->private.mm = mm;
key->private.address = address;
return 0;
@@ -224,7 +233,7 @@
* Linear file mappings are also simple.
*/
key->shared.inode = vma->vm_file->f_dentry->d_inode;
- key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+ key->both.offset += OFF_INODE; /* reference taken on inode */
if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
+ vma->vm_pgoff);
@@ -256,8 +265,8 @@
*/
static inline void get_key_refs(union futex_key *key)
{
- if (key->both.ptr != 0) {
- if (key->both.offset & 1)
+ if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+ if (key->both.offset & OFF_INODE)
atomic_inc(&key->shared.inode->i_count);
else
atomic_inc(&key->private.mm->mm_count);
@@ -270,8 +279,8 @@
*/
static void drop_key_refs(union futex_key *key)
{
- if (key->both.ptr != 0) {
- if (key->both.offset & 1)
+ if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+ if (key->both.offset & OFF_INODE)
iput(key->shared.inode);
else
mmdrop(key->private.mm);
@@ -650,7 +659,7 @@
* Wake up all waiters hashed on the physical page that is mapped
* to this virtual address:
*/
-static int futex_wake(u32 __user *uaddr, int nr_wake)
+static int futex_wake(u32 __user *uaddr, int nr_wake, int shared)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
@@ -658,9 +667,10 @@
union futex_key key;
int ret;

- down_read(&current->mm->mmap_sem);
+ if (shared)
+ down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr, &key);
+ ret = get_futex_key(uaddr, &key, shared);
if (unlikely(ret != 0))
goto out;

@@ -682,7 +692,8 @@

spin_unlock(&hb->lock);
out:
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(&current->mm->mmap_sem);
return ret;
}

@@ -692,7 +703,7 @@
*/
static int
futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2,
- int nr_wake, int nr_wake2, int op)
+ int nr_wake, int nr_wake2, int op, int shared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
@@ -703,10 +714,10 @@
retryfull:
down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr1, &key1);
+ ret = get_futex_key(uaddr1, &key1, shared);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, &key2);
+ ret = get_futex_key(uaddr2, &key2, shared);
if (unlikely(ret != 0))
goto out;

@@ -802,7 +813,7 @@
* physical page.
*/
static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2,
- int nr_wake, int nr_requeue, u32 *cmpval)
+ int nr_wake, int nr_requeue, u32 *cmpval, int shared)
{
union futex_key key1, key2;
struct futex_hash_bucket *hb1, *hb2;
@@ -811,12 +822,13 @@
int ret, drop_count = 0;

retry:
- down_read(&current->mm->mmap_sem);
+ if (shared)
+ down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr1, &key1);
+ ret = get_futex_key(uaddr1, &key1, shared);
if (unlikely(ret != 0))
goto out;
- ret = get_futex_key(uaddr2, &key2);
+ ret = get_futex_key(uaddr2, &key2, shared);
if (unlikely(ret != 0))
goto out;

@@ -839,7 +851,8 @@
* If we would have faulted, release mmap_sem, fault
* it in and start all over again.
*/
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(&current->mm->mmap_sem);

ret = get_user(curval, uaddr1);

@@ -888,7 +901,8 @@
drop_key_refs(&key1);

out:
- up_read(&current->mm->mmap_sem);
+ if (shared)
+ up_read(&current->mm->mmap_sem);
return ret;
}

@@ -999,7 +1013,7 @@
drop_key_refs(&q->key);
}

-static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
+static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time, int
shared)
{
struct task_struct *curr = current;
DECLARE_WAITQUEUE(wait, curr);
@@ -1010,9 +1024,10 @@

q.pi_state = NULL;
retry:
- down_read(&curr->mm->mmap_sem);
+ if (shared)
+ down_read(&curr->mm->mmap_sem);

- ret = get_futex_key(uaddr, &q.key);
+ ret = get_futex_key(uaddr, &q.key, shared);
if (unlikely(ret != 0))
goto out_release_sem;

@@ -1047,7 +1062,8 @@
* If we would have faulted, release mmap_sem, fault it in and
* start all over again.
*/
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(&curr->mm->mmap_sem);

ret = get_user(uval, uaddr);

@@ -1066,7 +1082,8 @@
* Now the futex is queued and we have checked the data, we
* don't want to hold mmap_sem while we sleep.
*/
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(&curr->mm->mmap_sem);

/*
* There might have been scheduling since the queue_me(), as we
@@ -1108,7 +1125,8 @@
queue_unlock(&q, hb);

out_release_sem:
- up_read(&curr->mm->mmap_sem);
+ if (shared)
+ up_read(&curr->mm->mmap_sem);
return ret;
}

@@ -1134,7 +1152,7 @@
retry:
down_read(&curr->mm->mmap_sem);

- ret = get_futex_key(uaddr, &q.key);
+ ret = get_futex_key(uaddr, &q.key, FUT_SHARED);
if (unlikely(ret != 0))
goto out_release_sem;

@@ -1435,7 +1453,7 @@
*/
down_read(&current->mm->mmap_sem);

- ret = get_futex_key(uaddr, &key);
+ ret = get_futex_key(uaddr, &key, FUT_SHARED);
if (unlikely(ret != 0))
goto out;

@@ -1551,7 +1569,7 @@
return ret;
}

-static struct file_operations futex_fops = {
+static const struct file_operations futex_fops = {
.release = futex_close,
.poll = futex_poll,
};
@@ -1600,7 +1618,7 @@
q->pi_state = NULL;

down_read(&current->mm->mmap_sem);
- err = get_futex_key(uaddr, &q->key);
+ err = get_futex_key(uaddr, &q->key, FUT_SHARED);

if (unlikely(err != 0)) {
up_read(&current->mm->mmap_sem);
@@ -1742,7 +1760,7 @@
*/
if (!pi) {
if (uval & FUTEX_WAITERS)
- futex_wake(uaddr, 1);
+ futex_wake(uaddr, 1, FUT_SHARED);
}
}
return 0;
@@ -1830,23 +1848,38 @@

switch (op) {
case FUTEX_WAIT:
- ret = futex_wait(uaddr, val, timeout);
+ ret = futex_wait(uaddr, val, timeout, FUT_SHARED);
+ break;
+ case FUTEX_WAIT_PRIVATE:
+ ret = futex_wait(uaddr, val, timeout, FUT_PRIVATE);
break;
case FUTEX_WAKE:
- ret = futex_wake(uaddr, val);
+ ret = futex_wake(uaddr, val, FUT_SHARED);
+ break;
+ case FUTEX_WAKE_PRIVATE:
+ ret = futex_wake(uaddr, val, FUT_PRIVATE);
break;
case FUTEX_FD:
/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
ret = futex_fd(uaddr, val);
break;
case FUTEX_REQUEUE:
- ret = futex_requeue(uaddr, uaddr2, val, val2, NULL);
+ ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_SHARED);
+ break;
+ case FUTEX_REQUEUE_PRIVATE:
+ ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_PRIVATE);
break;
case FUTEX_CMP_REQUEUE:
- ret = futex_requeue(uaddr, uaddr2, val, val2, &val3);
+ ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_SHARED);
+ break;
+ case FUTEX_CMP_REQUEUE_PRIVATE:
+ ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_PRIVATE);
break;
case FUTEX_WAKE_OP:
- ret = futex_wake_op(uaddr, uaddr2, val, val2, val3);
+ ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_SHARED);
+ break;
+ case FUTEX_WAKE_OP_PRIVATE:
+ ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_PRIVATE);
break;
case FUTEX_LOCK_PI:
ret = futex_lock_pi(uaddr, val, timeout, val2, 0);