LinuxLists.cc - [rfc] [patch 1/2 ] Process private hash tables for private futexes

2009-03-21 04:46:57

Subject: [rfc] [patch 1/2 ] Process private hash tables for private futexes

Patch to have a process private hash table for 'PRIVATE' futexes.

On large core count systems running multiple threaded processes causes
false sharing on the global futex hash table. The global futex hash
table is an array of struct futex_hash_bucket which is defined as:

struct futex_hash_bucket {
spinlock_t lock;
struct plist_head chain;
};

static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];

Needless to say this will cause multiple spinlocks to reside on the
same cacheline which is very bad when multiple un-related process
hash onto adjacent hash buckets. The probability of unrelated futexes
ending on adjacent hash buckets increase with the number of cores in the
system (more cores available translates to more processes/more threads
being run on a system). The effects of false sharing are tangible on
machines with more than 32 cores. We have noticed this with workload
of a certain multiple threaded FEA (Finite Element Analysis) solvers.
We reported this problem couple of years ago which eventually resulted in
a new api for private futexes to avoid mmap_sem. The false sharing on
the global futex hash was put off pending glibc changes to accomodate
the futex private apis. Now that the glibc changes are in, and
multicore is more prevalent, maybe it is time to fix this problem.

The root cause of the problem is a global futex hash table even for process
private futexes. Process private futexes can be hashed on process private
hash tables, avoiding the global hash and a longer hash table walk when
there are a lot more futexes in the workload. However, this results in an
addition of one extra pointer to the mm_struct. Hence, this implementation
of a process private hash table is based off a config option, which can be
turned off for smaller core count systems. Furthermore, a subsequent patch
will introduce a sysctl to dynamically turn on private futex hash tables.

We found this patch to improve the runtime of a certain FEA solver by about
15% on a 32 core vSMP system.

Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.28.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.28.6.orig/include/linux/mm_types.h 2009-03-11 16:52:06.000000000 -0800
+++ linux-2.6.28.6/include/linux/mm_types.h 2009-03-11 16:52:23.000000000 -0800
@@ -256,6 +256,10 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
+#ifdef CONFIG_PROCESS_PRIVATE_FUTEX
+ /* Process private futex hash table */
+ struct futex_hash_bucket *htb;
+#endif
};

#endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6.28.6/init/Kconfig
===================================================================
--- linux-2.6.28.6.orig/init/Kconfig 2009-03-11 16:52:06.000000000 -0800
+++ linux-2.6.28.6/init/Kconfig 2009-03-18 17:06:23.000000000 -0800
@@ -672,6 +672,14 @@ config FUTEX
support for "fast userspace mutexes". The resulting kernel may not
run glibc-based applications correctly.

+config PROCESS_PRIVATE_FUTEX
+ bool "Process private futexes" if FUTEX
+ default n
+ help
+ This option enables ability to have per-process hashtables for private
+ futexes. This makes sense on large core-count systems (more than
+ 32 cores)
+
config ANON_INODES
bool

Index: linux-2.6.28.6/kernel/fork.c
===================================================================
--- linux-2.6.28.6.orig/kernel/fork.c 2009-02-17 09:29:27.000000000 -0800
+++ linux-2.6.28.6/kernel/fork.c 2009-03-12 17:12:40.000000000 -0800
@@ -424,6 +424,7 @@ static struct mm_struct * mm_init(struct
return mm;
}

+ free_futex_htb(mm);
free_mm(mm);
return NULL;
}
Index: linux-2.6.28.6/kernel/futex.c
===================================================================
--- linux-2.6.28.6.orig/kernel/futex.c 2009-03-11 16:52:13.000000000 -0800
+++ linux-2.6.28.6/kernel/futex.c 2009-03-18 17:36:04.000000000 -0800
@@ -140,15 +140,84 @@ static inline void futex_unlock_mm(struc
up_read(fshared);
}

+#ifdef CONFIG_PROCESS_PRIVATE_FUTEX
+static void free_htb(struct futex_hash_bucket *htb)
+{
+ if (htb != futex_queues)
+ kfree(htb);
+}
+
+void free_futex_htb(struct mm_struct *mm)
+{
+ free_htb(mm->htb);
+}
+
+static void alloc_htb(struct mm_struct *mm)
+{
+ struct futex_hash_bucket *htb;
+ int i;
+ /*
+ * Allocate and install a private hash table of the
+ * same size as the global hash table. We fall
+ * back onto the global hash on allocation failure
+ */
+ htb = kmalloc(sizeof(futex_queues), GFP_KERNEL);
+ if (!htb)
+ htb = futex_queues;
+ else {
+ for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+ plist_head_init(&htb[i].chain, &htb[i].lock);
+ spin_lock_init(&htb[i].lock);
+ }
+ }
+ /* Install the hash table */
+ spin_lock(&mm->page_table_lock);
+ if (mm->htb) {
+ /* Another thread installed the hash table */
+ spin_unlock(&mm->page_table_lock);
+ free_htb(htb);
+ } else {
+ mm->htb = htb;
+ spin_unlock(&mm->page_table_lock);
+ }
+
+}
+
+static struct futex_hash_bucket *get_futex_hashtable(union futex_key *key)
+{
+ struct mm_struct *mm;
+ if (key->both.offset & FUT_OFF_INODE)
+ /* Shared inode based mapping uses global hash */
+ return futex_queues;
+ /*
+ * Private futexes -- This covers both FUTEX_PRIVATE_FLAG
+ * and 'mm' only private futexes
+ */
+
+ mm = current->mm;
+ if (unlikely(!mm->htb))
+ alloc_htb(mm);
+ return mm->htb;
+}
+#else
+static inline
+struct futex_hash_bucket *get_futex_hashtable(union futex_key *key)
+{
+ return futex_queues;
+}
+#endif
/*
* We hash on the keys returned from get_futex_key (see below).
*/
static struct futex_hash_bucket *hash_futex(union futex_key *key)
{
- u32 hash = jhash2((u32*)&key->both.word,
+ struct futex_hash_bucket *htb;
+ u32 hash;
+ htb = get_futex_hashtable(key);
+ hash = jhash2((u32 *)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ return &htb[hash & ((1 << FUTEX_HASHBITS)-1)];
}

/*
Index: linux-2.6.28.6/include/linux/futex.h
===================================================================
--- linux-2.6.28.6.orig/include/linux/futex.h 2009-02-17 09:29:27.000000000 -0800
+++ linux-2.6.28.6/include/linux/futex.h 2009-03-18 16:59:27.000000000 -0800
@@ -176,6 +176,15 @@ static inline void exit_pi_state_list(st
{
}
#endif
+
+#ifdef CONFIG_PROCESS_PRIVATE_FUTEX
+extern void free_futex_htb(struct mm_struct *mm);
+#else
+static inline void free_futex_htb(struct mm_struct *mm)
+{
+ return;
+}
+#endif
#endif /* __KERNEL__ */

#define FUTEX_OP_SET 0 /* *(int *)UADDR2 = OPARG; */

2009-03-21 04:53:16

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: [rfc] [patch 2/2 ] Sysctl to turn on/off private futex hash tables for private futexes

The following patch introduces a sysctl to control whether process private
hashtable will be used for process private futexes.

Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.28.6/include/linux/futex.h
===================================================================
--- linux-2.6.28.6.orig/include/linux/futex.h 2009-03-18 16:59:27.000000000 -0800
+++ linux-2.6.28.6/include/linux/futex.h 2009-03-18 17:49:02.000000000 -0800
@@ -179,6 +179,7 @@ static inline void exit_pi_state_list(st

#ifdef CONFIG_PROCESS_PRIVATE_FUTEX
extern void free_futex_htb(struct mm_struct *mm);
+extern int sysctl_private_hash;
#else
static inline void free_futex_htb(struct mm_struct *mm)
{
Index: linux-2.6.28.6/kernel/futex.c
===================================================================
--- linux-2.6.28.6.orig/kernel/futex.c 2009-03-18 17:36:04.000000000 -0800
+++ linux-2.6.28.6/kernel/futex.c 2009-03-18 17:57:13.000000000 -0800
@@ -154,14 +154,16 @@ void free_futex_htb(struct mm_struct *mm

static void alloc_htb(struct mm_struct *mm)
{
- struct futex_hash_bucket *htb;
int i;
+ struct futex_hash_bucket *htb = NULL;
/*
* Allocate and install a private hash table of the
* same size as the global hash table. We fall
- * back onto the global hash on allocation failure
+ * back onto the global hash on allocation failure, or
+ * if private futexes are disabled.
*/
- htb = kmalloc(sizeof(futex_queues), GFP_KERNEL);
+ if (sysctl_private_hash)
+ htb = kmalloc(sizeof(futex_queues), GFP_KERNEL);
if (!htb)
htb = futex_queues;
else {
@@ -183,6 +185,8 @@ static void alloc_htb(struct mm_struct *

}

+int sysctl_private_hash;
+
static struct futex_hash_bucket *get_futex_hashtable(union futex_key *key)
{
struct mm_struct *mm;
Index: linux-2.6.28.6/kernel/sysctl.c
===================================================================
--- linux-2.6.28.6.orig/kernel/sysctl.c 2009-03-18 16:59:27.000000000 -0800
+++ linux-2.6.28.6/kernel/sysctl.c 2009-03-18 17:49:02.000000000 -0800
@@ -48,6 +48,7 @@
#include <linux/acpi.h>
#include <linux/reboot.h>
#include <linux/ftrace.h>
+#include <linux/futex.h>

#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -799,6 +800,19 @@ static struct ctl_table kern_table[] = {
.strategy = &sysctl_intvec,
},
#endif
+#ifdef CONFIG_PROCESS_PRIVATE_FUTEX
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "private_futex_hashtable",
+ .data = &sysctl_private_hash,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
#ifdef CONFIG_COMPAT
{
.ctl_name = KERN_COMPAT_LOG,
Index: linux-2.6.28.6/Documentation/sysctl/kernel.txt
===================================================================
--- linux-2.6.28.6.orig/Documentation/sysctl/kernel.txt 2009-02-17 09:29:27.000000000 -0800
+++ linux-2.6.28.6/Documentation/sysctl/kernel.txt 2009-03-20 12:14:15.000000000 -0800
@@ -281,6 +281,25 @@ send before ratelimiting kicks in.

==============================================================

+private_futex_hashtable:
+
+This sysctl can be used to enable processes to use a per-process
+private hash table for private futexes. Private futexes
+are typically used in threaded workloads. When this option is
+off, which is the default, threads waiting on futexes get
+hashed onto a global hash table. On large core count machines
+this turns out to be bad for performance if the workload
+consist of multiple unrelated threaded processes. Turning
+this option on will enable processes to use a private hash
+for private futexes, improving performance.
+
+Valid options are 0 and 1. A value of 0 turns off process private
+futex hash tables and a value of 1 enables private futex hash
+tables. This option will only affect processes that get created
+after the option gets changed.
+
+==============================================================
+
randomize-va-space:

This option can be used to select the type of process address

2009-03-21 09:08:21

by Eric Dumazet

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

Ravikiran G Thirumalai a ?crit :
> Patch to have a process private hash table for 'PRIVATE' futexes.
>
> On large core count systems running multiple threaded processes causes
> false sharing on the global futex hash table. The global futex hash
> table is an array of struct futex_hash_bucket which is defined as:
>
> struct futex_hash_bucket {
> spinlock_t lock;
> struct plist_head chain;
> };
>
> static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
>
> Needless to say this will cause multiple spinlocks to reside on the
> same cacheline which is very bad when multiple un-related process
> hash onto adjacent hash buckets. The probability of unrelated futexes
> ending on adjacent hash buckets increase with the number of cores in the
> system (more cores available translates to more processes/more threads
> being run on a system). The effects of false sharing are tangible on
> machines with more than 32 cores. We have noticed this with workload
> of a certain multiple threaded FEA (Finite Element Analysis) solvers.
> We reported this problem couple of years ago which eventually resulted in
> a new api for private futexes to avoid mmap_sem. The false sharing on
> the global futex hash was put off pending glibc changes to accomodate
> the futex private apis. Now that the glibc changes are in, and
> multicore is more prevalent, maybe it is time to fix this problem.
>
> The root cause of the problem is a global futex hash table even for process
> private futexes. Process private futexes can be hashed on process private
> hash tables, avoiding the global hash and a longer hash table walk when
> there are a lot more futexes in the workload. However, this results in an
> addition of one extra pointer to the mm_struct. Hence, this implementation
> of a process private hash table is based off a config option, which can be
> turned off for smaller core count systems. Furthermore, a subsequent patch
> will introduce a sysctl to dynamically turn on private futex hash tables.
>
> We found this patch to improve the runtime of a certain FEA solver by about
> 15% on a 32 core vSMP system.
>
> Signed-off-by: Ravikiran Thirumalai <[email protected]>
> Signed-off-by: Shai Fultheim <[email protected]>
>

First incantation of PRIVATE_FUTEXES had process private hash table

http://lkml.org/lkml/2007/3/15/230

I dont remember objections at that time, maybe it was going to slow down small
users of these PRIVATE_FUTEXES, ie processes that will maybe use one futex_wait()
in their existence, because they'll have to allocate their private hash table
and populate it.

So I dropped parts about NUMA and private hash tables to get PRIVATE_FUTEXES into mainline.

http://lwn.net/Articles/229668/

Did you tried to change FUTEX_HASHBITS instead, since current value is really really
ridiculous ?

You could also try to adapt this patch to current kernels :

http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-03/msg06504.html

[PATCH 3/3] FUTEX : NUMA friendly global hashtable

On NUMA machines, we should get better performance using a big futex
hashtable, allocated with vmalloc() so that it is spreaded on several nodes.

I chose a static size of four pages. (Very big NUMA machines have 64k page
size)

2009-03-21 11:42:42

by Andrew Morton

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

On Fri, 20 Mar 2009 21:46:37 -0700 Ravikiran G Thirumalai <[email protected]> wrote:

> Patch to have a process private hash table for 'PRIVATE' futexes.
>
> On large core count systems running multiple threaded processes causes
> false sharing on the global futex hash table. The global futex hash
> table is an array of struct futex_hash_bucket which is defined as:
>
> struct futex_hash_bucket {
> spinlock_t lock;
> struct plist_head chain;
> };
>
> static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
>
> Needless to say this will cause multiple spinlocks to reside on the
> same cacheline which is very bad when multiple un-related process
> hash onto adjacent hash buckets. The probability of unrelated futexes
> ending on adjacent hash buckets increase with the number of cores in the
> system (more cores available translates to more processes/more threads
> being run on a system). The effects of false sharing are tangible on
> machines with more than 32 cores. We have noticed this with workload
> of a certain multiple threaded FEA (Finite Element Analysis) solvers.
> We reported this problem couple of years ago which eventually resulted in
> a new api for private futexes to avoid mmap_sem. The false sharing on
> the global futex hash was put off pending glibc changes to accomodate
> the futex private apis. Now that the glibc changes are in, and
> multicore is more prevalent, maybe it is time to fix this problem.
>
> The root cause of the problem is a global futex hash table even for process
> private futexes. Process private futexes can be hashed on process private
> hash tables, avoiding the global hash and a longer hash table walk when
> there are a lot more futexes in the workload. However, this results in an
> addition of one extra pointer to the mm_struct. Hence, this implementation
> of a process private hash table is based off a config option, which can be
> turned off for smaller core count systems. Furthermore, a subsequent patch
> will introduce a sysctl to dynamically turn on private futex hash tables.
>
> We found this patch to improve the runtime of a certain FEA solver by about
> 15% on a 32 core vSMP system.
>
> Signed-off-by: Ravikiran Thirumalai <[email protected]>
> Signed-off-by: Shai Fultheim <[email protected]>
>
> Index: linux-2.6.28.6/include/linux/mm_types.h
> ===================================================================
> --- linux-2.6.28.6.orig/include/linux/mm_types.h 2009-03-11 16:52:06.000000000 -0800
> +++ linux-2.6.28.6/include/linux/mm_types.h 2009-03-11 16:52:23.000000000 -0800
> @@ -256,6 +256,10 @@ struct mm_struct {
> #ifdef CONFIG_MMU_NOTIFIER
> struct mmu_notifier_mm *mmu_notifier_mm;
> #endif
> +#ifdef CONFIG_PROCESS_PRIVATE_FUTEX
> + /* Process private futex hash table */
> + struct futex_hash_bucket *htb;
> +#endif

So we're effectively improving the hashing operation by splitting the
single hash table into multiple ones.

But was that the best way of speeding up the hashing operation? I'd have
thought that for some workloads, there will still be tremendous amounts of
contention for the per-mm hashtable? In which case it is but a partial fix
for certain workloads.

Whereas a more general hashing optimisation (if we can come up with it)
would benefit both types of workload?

2009-03-21 11:56:20

by Eric Dumazet

[permalink] [raw]

Subject: [PATCH] futex: Dynamically size futexes hash table

Eric Dumazet a ?crit :
> Ravikiran G Thirumalai a ?crit :
>> Patch to have a process private hash table for 'PRIVATE' futexes.
>>
>> On large core count systems running multiple threaded processes causes
>> false sharing on the global futex hash table. The global futex hash
>> table is an array of struct futex_hash_bucket which is defined as:
>>
>> struct futex_hash_bucket {
>> spinlock_t lock;
>> struct plist_head chain;
>> };
>>
>> static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
>>
>> Needless to say this will cause multiple spinlocks to reside on the
>> same cacheline which is very bad when multiple un-related process
>> hash onto adjacent hash buckets. The probability of unrelated futexes
>> ending on adjacent hash buckets increase with the number of cores in the
>> system (more cores available translates to more processes/more threads
>> being run on a system). The effects of false sharing are tangible on
>> machines with more than 32 cores. We have noticed this with workload
>> of a certain multiple threaded FEA (Finite Element Analysis) solvers.
>> We reported this problem couple of years ago which eventually resulted in
>> a new api for private futexes to avoid mmap_sem. The false sharing on
>> the global futex hash was put off pending glibc changes to accomodate
>> the futex private apis. Now that the glibc changes are in, and
>> multicore is more prevalent, maybe it is time to fix this problem.
>>
>> The root cause of the problem is a global futex hash table even for process
>> private futexes. Process private futexes can be hashed on process private
>> hash tables, avoiding the global hash and a longer hash table walk when
>> there are a lot more futexes in the workload. However, this results in an
>> addition of one extra pointer to the mm_struct. Hence, this implementation
>> of a process private hash table is based off a config option, which can be
>> turned off for smaller core count systems. Furthermore, a subsequent patch
>> will introduce a sysctl to dynamically turn on private futex hash tables.
>>
>> We found this patch to improve the runtime of a certain FEA solver by about
>> 15% on a 32 core vSMP system.
>>
>> Signed-off-by: Ravikiran Thirumalai <[email protected]>
>> Signed-off-by: Shai Fultheim <[email protected]>
>>
>
> First incantation of PRIVATE_FUTEXES had process private hash table
>
> http://lkml.org/lkml/2007/3/15/230
>
> I dont remember objections at that time, maybe it was going to slow down small
> users of these PRIVATE_FUTEXES, ie processes that will maybe use one futex_wait()
> in their existence, because they'll have to allocate their private hash table
> and populate it.
>
> So I dropped parts about NUMA and private hash tables to get PRIVATE_FUTEXES into mainline.
>
> http://lwn.net/Articles/229668/
>
> Did you tried to change FUTEX_HASHBITS instead, since current value is really really
> ridiculous ?
>
> You could also try to adapt this patch to current kernels :
>
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-03/msg06504.html
>
> [PATCH 3/3] FUTEX : NUMA friendly global hashtable
>
> On NUMA machines, we should get better performance using a big futex
> hashtable, allocated with vmalloc() so that it is spreaded on several nodes.
>
> I chose a static size of four pages. (Very big NUMA machines have 64k page
> size)
>

Here is my suggested patch.

It should help both private and shared futexes usage on large machines.

[PATCH] futex: Dynamically size futexes hash table

As noticed by Ravikiran Thirumalai and Shai Fultheim, global futexes hash table
is too small with current hardware.
With 256 slots, probability of false sharing is too high.

This patch makes its size not known at compile time, but boot time computed,
depending on various factors, like number of possible cpus.

Using vmalloc() also has better NUMA properties, since at boot time
this hash table will be spreaded on many nodes, instead of only one.

Signed-off-by: Eric Dumazet <[email protected]>
---
kernel/futex.c | 34 ++++++++++++++++++++++++++++++----
1 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 438701a..f636e5c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -62,7 +62,6 @@

int __read_mostly futex_cmpxchg_enabled;

-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)

/*
* Priority Inheritance state:
@@ -121,7 +120,13 @@ struct futex_hash_bucket {
struct plist_head chain;
};

-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+#if defined(CONFIG_BASE_SMALL)
+const unsigned int futex_hash_mask = 15;
+static struct futex_hash_bucket futex_queues[16];
+#else
+unsigned int futex_hash_mask __read_mostly;
+static struct futex_hash_bucket *futex_queues __read_mostly;
+#endif

/*
* We hash on the keys returned from get_futex_key (see below).
@@ -131,7 +136,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ return &futex_queues[hash & futex_hash_mask];
}

/*
@@ -2016,7 +2021,28 @@ static int __init futex_init(void)
{
u32 curval;
int i;
+#if !defined(CONFIG_BASE_SMALL)
+#if defined(CONFIG_PROVE_LOCKING)
+ unsigned int nb_slots = 256;
+#else
+ unsigned int nb_slots = roundup_pow_of_two(num_possible_cpus()) * 256;
+#endif
+ size_t sz;

+retry:
+ sz = nb_slots * sizeof(struct futex_hash_bucket);
+ if (sz > PAGE_SIZE)
+ futex_queues = vmalloc(sz);
+ else
+ futex_queues = kmalloc(sz, GFP_KERNEL);
+ if (!futex_queues) {
+ nb_slots >>= 1;
+ if (!nb_slots)
+ panic("Could not allocate futex hash table");
+ goto retry;
+ }
+ futex_hash_mask = nb_slots - 1;
+#endif
/*
* This will fail and we want it. Some arch implementations do
* runtime detection of the futex_atomic_cmpxchg_inatomic()
@@ -2031,7 +2057,7 @@ static int __init futex_init(void)
if (curval == -EFAULT)
futex_cmpxchg_enabled = 1;

- for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+ for (i = 0; i <= futex_hash_mask; i++) {
plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock);
spin_lock_init(&futex_queues[i].lock);
}

2009-03-21 16:29:24

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] futex: Dynamically size futexes hash table

* Eric Dumazet <[email protected]> wrote:

> @@ -2016,7 +2021,28 @@ static int __init futex_init(void)
> {
> u32 curval;
> int i;
> +#if !defined(CONFIG_BASE_SMALL)
> +#if defined(CONFIG_PROVE_LOCKING)
> + unsigned int nb_slots = 256;
> +#else
> + unsigned int nb_slots = roundup_pow_of_two(num_possible_cpus()) * 256;
> +#endif
> + size_t sz;
>
> +retry:
> + sz = nb_slots * sizeof(struct futex_hash_bucket);
> + if (sz > PAGE_SIZE)
> + futex_queues = vmalloc(sz);
> + else
> + futex_queues = kmalloc(sz, GFP_KERNEL);
> + if (!futex_queues) {
> + nb_slots >>= 1;
> + if (!nb_slots)
> + panic("Could not allocate futex hash table");
> + goto retry;
> + }
> + futex_hash_mask = nb_slots - 1;
> +#endif

no fundamental objections and the hash sizing problem is causing
real problems and need to be fixed, but this code sequence is too
ugly to live.

Is there really no big-hash-alloc framework in existence? I have
some vague memories of networking having met such problems in the
past - maybe they have some goodness there we could reuse
shamelessly in the futex code?

Ingo

2009-03-22 04:15:26

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

On Sat, Mar 21, 2009 at 04:35:14AM -0700, Andrew Morton wrote:
>On Fri, 20 Mar 2009 21:46:37 -0700 Ravikiran G Thirumalai <[email protected]> wrote:
>
>>
>> Index: linux-2.6.28.6/include/linux/mm_types.h
>> ===================================================================
>> --- linux-2.6.28.6.orig/include/linux/mm_types.h 2009-03-11 16:52:06.000000000 -0800
>> +++ linux-2.6.28.6/include/linux/mm_types.h 2009-03-11 16:52:23.000000000 -0800
>> @@ -256,6 +256,10 @@ struct mm_struct {
>> #ifdef CONFIG_MMU_NOTIFIER
>> struct mmu_notifier_mm *mmu_notifier_mm;
>> #endif
>> +#ifdef CONFIG_PROCESS_PRIVATE_FUTEX
>> + /* Process private futex hash table */
>> + struct futex_hash_bucket *htb;
>> +#endif
>
>So we're effectively improving the hashing operation by splitting the
>single hash table into multiple ones.
>
>But was that the best way of speeding up the hashing operation? I'd have
>thought that for some workloads, there will still be tremendous amounts of
>contention for the per-mm hashtable? In which case it is but a partial fix
>for certain workloads.

If there is tremendous contention on the per-mm hashtable, then
workload suffers from userspace lock contention to begin with, and the right
approach would be to fix the lock contention in userspace/workload, no?
True, if a workload happens to be one process on a large core count machine
with a zillion threads, using a zillion futexes, hashing might still be bad,
but such a workload has bigger hurdles like the mmap_sem which is still
a bigger lock than the per-bucket locks of the private hash table. (Even
with use of the FUTEX_PRIVATE_FLAG, mmap_sem gets contended for page faults
and the like).

The fundamental problem we are facing right now is the private futexes
hashing onto a global hash table and then the futex subsystem
wading through the table that contains private futexes from other
*unrelated* processes. Even with better hashing/larger hash table, we do not
eliminate the possibility of private futexes from two unrelated processes
hashing on to hash buckets close enough to be on the same cache line -- if
we use one global hash table for private futexes.
(I remember doing some tests and instrumentation in the past with a
different hash function as well as more hash slots)
Hence, it seems like a private hash table for private futexes are the right
solution. Perhaps we can reduce the hash slots for private hash table and
increase the hash slots on the global hash (to help non-private futex
hashing)?

Thanks,
Kiran

2009-03-22 04:54:31

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

On Sat, Mar 21, 2009 at 10:07:48AM +0100, Eric Dumazet wrote:
>Ravikiran G Thirumalai a ?crit :
>>
>> We found this patch to improve the runtime of a certain FEA solver by about
>> 15% on a 32 core vSMP system.
>>
>> Signed-off-by: Ravikiran Thirumalai <[email protected]>
>> Signed-off-by: Shai Fultheim <[email protected]>
>>
>
>First incantation of PRIVATE_FUTEXES had process private hash table
>
>http://lkml.org/lkml/2007/3/15/230
>
>I dont remember objections at that time, maybe it was going to slow down small
>users of these PRIVATE_FUTEXES, ie processes that will maybe use one futex_wait()
> in their existence, because they'll have to allocate their private hash table
>and populate it.
>

With the current proposal, we can still use the global futex hashes for such
workloads (with the sysctl setting).

>So I dropped parts about NUMA and private hash tables to get PRIVATE_FUTEXES into mainline
>
>http://lwn.net/Articles/229668/
>
>Did you tried to change FUTEX_HASHBITS instead, since current value is really really
>ridiculous ?

We tried it in the past and I remember on a 16 core machine, we had to
use 32k hash slots to avoid false sharing.

>
>You could also try to adapt this patch to current kernels :
>
>http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-03/msg06504.html
>
>[PATCH 3/3] FUTEX : NUMA friendly global hashtable
>
>On NUMA machines, we should get better performance using a big futex
>hashtable, allocated with vmalloc() so that it is spreaded on several nodes.
>
>I chose a static size of four pages. (Very big NUMA machines have 64k page
>size)

Yes, dynamically changing the hash table is better (looking at the patch you
have posted), but still there are no locality guarantees here. A process
pinned to node X may still end up accessing remote memory locations while
accessing the hash table. A process private table on the other hand should
not have this problem. I think using a global hash for entirely process local
objects is bad design wise here.

2009-03-22 08:17:45

by Eric Dumazet

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

Ravikiran G Thirumalai a ?crit :
>>
>> Did you tried to change FUTEX_HASHBITS instead, since current value is really really
>> ridiculous ?
>
> We tried it in the past and I remember on a 16 core machine, we had to
> use 32k hash slots to avoid false sharing.
>
>
> Yes, dynamically changing the hash table is better (looking at the patch you
> have posted), but still there are no locality guarantees here. A process
> pinned to node X may still end up accessing remote memory locations while
> accessing the hash table. A process private table on the other hand should
> not have this problem. I think using a global hash for entirely process local
> objects is bad design wise here.
>
>

Bad design, or bad luck... considering all kernel already use such global tables
(dentries, inodes, tcp, ip route cache, ...).

Problem is to size this hash table, being private or not. You said hou had
to have a 32768 slots to avoid false sharing on a 16 core machine. This seems
strange to me, given we use jhash. What is the size of the cache line on your
platforms ???

Say we have 32768 slots for the global hash table, and 256 slots for a private one,
you probably can have a program running slowly with this private 256 slots table,
if this program uses all available cores.

If we use large private hash table, the setup cost is higher (need to initialize
all spinlocks and plist heads at each program startup), unless we use a dedicate
kmem_cache to keep a pool of preinitialized priv hash tables...

2009-03-23 20:29:20

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

On Sun, Mar 22, 2009 at 09:17:01AM +0100, Eric Dumazet wrote:
>Ravikiran G Thirumalai a ?crit :
>>>
>>> Did you tried to change FUTEX_HASHBITS instead, since current value is really really
>>> ridiculous ?
>>
>> We tried it in the past and I remember on a 16 core machine, we had to
>> use 32k hash slots to avoid false sharing.
>>
>>
>> Yes, dynamically changing the hash table is better (looking at the patch you
>> have posted), but still there are no locality guarantees here. A process
>> pinned to node X may still end up accessing remote memory locations while
>> accessing the hash table. A process private table on the other hand should
>> not have this problem. I think using a global hash for entirely process local
>> objects is bad design wise here.
>>
>>
>
>
>Bad design, or bad luck... considering all kernel already use such global tables
>(dentries, inodes, tcp, ip route cache, ...).

Not necessarily. The dentry/inode/route caches need to be shared by
processes so a global cache makes sense there -- the private futexes need to
be only shared between threads of the process rather than the entire world.

>
>Problem is to size this hash table, being private or not. You said hou had
>to have a 32768 slots to avoid false sharing on a 16 core machine. This seems
>strange to me, given we use jhash. What is the size of the cache line on your
>platforms ???

It is large and true these bad effects get magnified with larger cache lines.
However, this does forewarn other architectures of problems such as these.
Access to the below virtual addresses were seen to cause cache trashing
between nodes on a vSMP system. The eip corresponds to the spin_lock
on a 2.6.27 kernel at 'futex_wake' (see end of email).
Obviously these addresses correspond to the spinlock on the hash buckets,
and this was a threaded FEA solver workload on a 32 core machine.
As can be seen, this is a problem even on a machine with 64b cacheline.

>
>Say we have 32768 slots for the global hash table, and 256 slots for a private one,
>you probably can have a program running slowly with this private 256 slots table,
>if this program uses all available cores.

True, as I replied to akpm in this thread, if a workload happens to be one
multi-threaded process with a zillion threads, the workload will have bigger
overheads due to the sharing of process address space and mmap_sem. Atleast
that has been our experience so far. Private futex hashes solve the
problem on hand.

>
>If we use large private hash table, the setup cost is higher (need to initialize
>all spinlocks and plist heads at each program startup), unless we use a dedicate
>kmem_cache to keep a pool of preinitialized priv hash tables...
>

Hmm! How about
a) Reduce hash table size for private futex hash and increase hash table
size for the global hash?

OR, better,

b) Since it is multiple spinlocks on the same cacheline which is a PITA
here, how about keeping the global table, but just add a dereference
to each hash slot, and interleave the adjacent hash buckets between
nodes/cpus? So even without needing to lose out memory from padding,
we avoid false sharing on cachelines due to unrelated futexes hashing
onto adjacent hash buckets?

Thanks,
Kiran

Cache misses at futex_wake due to access to the following addresses:
-------------------------------------------------------------------
fffff819cc180
fffff819cc1d0
fffff819cc248
fffff819cc310
fffff819cc3b0
fffff819cc400
fffff819cc568
fffff819cc5b8
fffff819cc658
fffff819cc770
fffff819cc798
fffff819cc838
fffff819cc8d8
fffff819cc9c8
fffff819cc9f0
fffff819cca90
fffff819ccae0
fffff819ccb08
fffff819ccd38
fffff819ccd88
fffff819ccdb0
fffff819cce78
fffff819ccf18
fffff819ccfb8
fffff819cd030
fffff819cd058
fffff819cd148
fffff819cd210
fffff819cd260
fffff819cd288
fffff819cd2b0
fffff819cd300
fffff819cd350
fffff819cd3f0
fffff819cd440
fffff819cd558
fffff819cd580
fffff819cd620
fffff819cd738
fffff819cd7b0
fffff819cd7d8
fffff819cd828
fffff819cd8c8
fffff819cd8f0
fffff819cd968
fffff819cd9b8
fffff819cd9e0
fffff819cda08
fffff819cda58
fffff819cdad0
fffff819cdbc0
fffff819cdc10
fffff819cdc60
fffff819cddf0
fffff819cde68
fffff819cdfa8
fffff819cdfd0
fffff819ce020
fffff819ce048
fffff819ce070
fffff819ce098
fffff819ce0c0
fffff819ce0e8
fffff819ce110
fffff819ce1d8
fffff819ce200
fffff819ce228
fffff819ce250
fffff819ce3b8
fffff819ce430
fffff819ce480
fffff819ce5e8
fffff819ce660
fffff819ce728
fffff819ce750
fffff819ce868

2009-03-23 21:59:21

by Eric Dumazet

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

Ravikiran G Thirumalai a ?crit :
> On Sun, Mar 22, 2009 at 09:17:01AM +0100, Eric Dumazet wrote:
>> Ravikiran G Thirumalai a ?crit :
>>>> Did you tried to change FUTEX_HASHBITS instead, since current value is really really
>>>> ridiculous ?
>>> We tried it in the past and I remember on a 16 core machine, we had to
>>> use 32k hash slots to avoid false sharing.
>>>
>>>
>>> Yes, dynamically changing the hash table is better (looking at the patch you
>>> have posted), but still there are no locality guarantees here. A process
>>> pinned to node X may still end up accessing remote memory locations while
>>> accessing the hash table. A process private table on the other hand should
>>> not have this problem. I think using a global hash for entirely process local
>>> objects is bad design wise here.
>>>
>>>
>>
>> Bad design, or bad luck... considering all kernel already use such global tables
>> (dentries, inodes, tcp, ip route cache, ...).
>
> Not necessarily. The dentry/inode/route caches need to be shared by
> processes so a global cache makes sense there -- the private futexes need to
> be only shared between threads of the process rather than the entire world.
>
>> Problem is to size this hash table, being private or not. You said hou had
>> to have a 32768 slots to avoid false sharing on a 16 core machine. This seems
>> strange to me, given we use jhash. What is the size of the cache line on your
>> platforms ???
>
> It is large and true these bad effects get magnified with larger cache lines.
> However, this does forewarn other architectures of problems such as these.
> Access to the below virtual addresses were seen to cause cache trashing
> between nodes on a vSMP system. The eip corresponds to the spin_lock
> on a 2.6.27 kernel at 'futex_wake' (see end of email).
> Obviously these addresses correspond to the spinlock on the hash buckets,
> and this was a threaded FEA solver workload on a 32 core machine.
> As can be seen, this is a problem even on a machine with 64b cacheline.
>
>
>> Say we have 32768 slots for the global hash table, and 256 slots for a private one,
>> you probably can have a program running slowly with this private 256 slots table,
>> if this program uses all available cores.
>
> True, as I replied to akpm in this thread, if a workload happens to be one
> multi-threaded process with a zillion threads, the workload will have bigger
> overheads due to the sharing of process address space and mmap_sem. Atleast
> that has been our experience so far. Private futex hashes solve the
> problem on hand.
>
>> If we use large private hash table, the setup cost is higher (need to initialize
>> all spinlocks and plist heads at each program startup), unless we use a dedicate
>> kmem_cache to keep a pool of preinitialized priv hash tables...
>>
>
>
> Hmm! How about
> a) Reduce hash table size for private futex hash and increase hash table
> size for the global hash?
>
> OR, better,
>
> b) Since it is multiple spinlocks on the same cacheline which is a PITA
> here, how about keeping the global table, but just add a dereference
> to each hash slot, and interleave the adjacent hash buckets between
> nodes/cpus? So even without needing to lose out memory from padding,
> we avoid false sharing on cachelines due to unrelated futexes hashing
> onto adjacent hash buckets?
>

Because of jhash(), futex slots are almost random. No need to try to interleave
them. Since you have a "cache line" of 4096 bytes, you need almost 4 pages
per cpu to avoid in a statistical way false sharing.

Unfortunatly a small private futex hash will have the same scalability problem,
so its not a general solution.

Could you try this patch please ?

[PATCH] futex: Dynamically size futexes hash table

As noticed by Ravikiran Thirumalai and Shai Fultheim, global futexes hash table
is too small with current hardware.
With 256 slots, probability of false sharing is too high.

This patch makes its size not known at compile time, but boot time computed,
depending on various factors, like number of possible cpus, and VSMP settings.

Using alloc_large_system_hash() also has better NUMA properties, since at boot time
hash table will be spreaded on many nodes, instead of only one.

Signed-off-by: Eric Dumazet <[email protected]>
---

diff --git a/kernel/futex.c b/kernel/futex.c
index 438701a..bd9a052 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -62,7 +62,6 @@

int __read_mostly futex_cmpxchg_enabled;

-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)

/*
* Priority Inheritance state:
@@ -121,7 +120,13 @@ struct futex_hash_bucket {
struct plist_head chain;
};

-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+#if defined(CONFIG_BASE_SMALL)
+const unsigned int futex_hash_mask = 15;
+static struct futex_hash_bucket futex_queues[16];
+#else
+unsigned int futex_hash_mask __read_mostly;
+static struct futex_hash_bucket *futex_queues __read_mostly;
+#endif

/*
* We hash on the keys returned from get_futex_key (see below).
@@ -131,7 +136,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ return &futex_queues[hash & futex_hash_mask];
}

/*
@@ -2016,7 +2021,25 @@ static int __init futex_init(void)
{
u32 curval;
int i;
+#if !defined(CONFIG_BASE_SMALL)
+#if defined(CONFIG_PROVE_LOCKING)
+ unsigned int nr_slots = 256;
+#else
+ /* 4 nodes per cpu on VSMP, or 256 slots per cpu */
+ unsigned int nr_slots = max(256, (4 << INTERNODE_CACHE_SHIFT) /
+ sizeof(struct futex_hash_bucket));
+ nr_slots = roundup_pow_of_two(num_possible_cpus() * nr_slots);
+#endif

+ futex_queues = alloc_large_system_hash("futexes",
+ sizeof(struct futex_hash_bucket),
+ nr_slots,
+ 0, /* scale : unused */
+ 0, /* flags */
+ NULL, /* shift */
+ &futex_hash_mask,
+ nr_slots);
+#endif
/*
* This will fail and we want it. Some arch implementations do
* runtime detection of the futex_atomic_cmpxchg_inatomic()
@@ -2031,7 +2054,7 @@ static int __init futex_init(void)
if (curval == -EFAULT)
futex_cmpxchg_enabled = 1;

- for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+ for (i = 0; i <= futex_hash_mask; i++) {
plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock);
spin_lock_init(&futex_queues[i].lock);
}

2009-03-24 03:19:55

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

On Mon, Mar 23, 2009 at 10:57:55PM +0100, Eric Dumazet wrote:
>Ravikiran G Thirumalai a ?crit :
>>
[...]
>> Hmm! How about
>> a) Reduce hash table size for private futex hash and increase hash table
>> size for the global hash?
>>
>> OR, better,
>>
>> b) Since it is multiple spinlocks on the same cacheline which is a PITA
>> here, how about keeping the global table, but just add a dereference
>> to each hash slot, and interleave the adjacent hash buckets between
>> nodes/cpus? So even without needing to lose out memory from padding,
>> we avoid false sharing on cachelines due to unrelated futexes hashing
>> onto adjacent hash buckets?
>>
>
>Because of jhash(), futex slots are almost random. No need to try to interleave
>them. Since you have a "cache line" of 4096 bytes, you need almost 4 pages
>per cpu to avoid in a statistical way false sharing.

How did you come up with that number? So there is no way adjacent
cachelines will never ever be used in the global hash??

2009-03-24 03:33:49

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

On Mon, Mar 23, 2009 at 08:19:25PM -0700, Ravikiran G Thirumalai wrote:
>>Ravikiran G Thirumalai a ?crit :
>>>
[...]
>
>How did you come up with that number? So there is no way adjacent
>cachelines will never ever be used in the global hash??

s/cachelines/buckets

2009-03-24 05:32:40

by Eric Dumazet

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

Ravikiran G Thirumalai a ?crit :
> On Mon, Mar 23, 2009 at 10:57:55PM +0100, Eric Dumazet wrote:
>> Ravikiran G Thirumalai a ?crit :
> [...]
>>> Hmm! How about
>>> a) Reduce hash table size for private futex hash and increase hash table
>>> size for the global hash?
>>>
>>> OR, better,
>>>
>>> b) Since it is multiple spinlocks on the same cacheline which is a PITA
>>> here, how about keeping the global table, but just add a dereference
>>> to each hash slot, and interleave the adjacent hash buckets between
>>> nodes/cpus? So even without needing to lose out memory from padding,
>>> we avoid false sharing on cachelines due to unrelated futexes hashing
>>> onto adjacent hash buckets?
>>>
>> Because of jhash(), futex slots are almost random. No need to try to interleave
>> them. Since you have a "cache line" of 4096 bytes, you need almost 4 pages
>> per cpu to avoid in a statistical way false sharing.
>
> How did you come up with that number? So there is no way adjacent
> cachelines will never ever be used in the global hash??
>
>

There is no way, unless you use one chain per 4096 bytes block, and that
would be a waste of memory.

Its about statistics (assuming jhash gives us normally distributed values),
not hard and guaranted constraints.

http://en.wikipedia.org/wiki/Standard_deviation

For a particular process, you can carefully chose (knowing hash function
in use by kernel), user land futexes that will *all* be hashed in a *single*
hash chain, protected by a single spinlock.

This process will then suffer from contention, using private futexes
and private or shared hash table.

Even without knowing hash function, there is still a small chance that it
can happen.

Please note that the hash table size has several purposes, first one being storing
potentially many futexes (say 30000 threads are waiting on different futexes),
and being able to avoid cache line ping pongs or misses. This is why I chose
at least 256 slots per cpu, and not 4 cache lines per cpu.

2009-03-24 07:06:39

by Eric Dumazet

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

Eric Dumazet a ?crit :
> Ravikiran G Thirumalai a ?crit :
>> On Sun, Mar 22, 2009 at 09:17:01AM +0100, Eric Dumazet wrote:
>>> Ravikiran G Thirumalai a ?crit :
>>>>> Did you tried to change FUTEX_HASHBITS instead, since current value is really really
>>>>> ridiculous ?
>>>> We tried it in the past and I remember on a 16 core machine, we had to
>>>> use 32k hash slots to avoid false sharing.
>>>>
>>>>
>>>> Yes, dynamically changing the hash table is better (looking at the patch you
>>>> have posted), but still there are no locality guarantees here. A process
>>>> pinned to node X may still end up accessing remote memory locations while
>>>> accessing the hash table. A process private table on the other hand should
>>>> not have this problem. I think using a global hash for entirely process local
>>>> objects is bad design wise here.
>>>>
>>>>
>>> Bad design, or bad luck... considering all kernel already use such global tables
>>> (dentries, inodes, tcp, ip route cache, ...).
>> Not necessarily. The dentry/inode/route caches need to be shared by
>> processes so a global cache makes sense there -- the private futexes need to
>> be only shared between threads of the process rather than the entire world.
>>
>>> Problem is to size this hash table, being private or not. You said hou had
>>> to have a 32768 slots to avoid false sharing on a 16 core machine. This seems
>>> strange to me, given we use jhash. What is the size of the cache line on your
>>> platforms ???
>> It is large and true these bad effects get magnified with larger cache lines.
>> However, this does forewarn other architectures of problems such as these.
>> Access to the below virtual addresses were seen to cause cache trashing
>> between nodes on a vSMP system. The eip corresponds to the spin_lock
>> on a 2.6.27 kernel at 'futex_wake' (see end of email).
>> Obviously these addresses correspond to the spinlock on the hash buckets,
>> and this was a threaded FEA solver workload on a 32 core machine.
>> As can be seen, this is a problem even on a machine with 64b cacheline.
>>
>>
>>> Say we have 32768 slots for the global hash table, and 256 slots for a private one,
>>> you probably can have a program running slowly with this private 256 slots table,
>>> if this program uses all available cores.
>> True, as I replied to akpm in this thread, if a workload happens to be one
>> multi-threaded process with a zillion threads, the workload will have bigger
>> overheads due to the sharing of process address space and mmap_sem. Atleast
>> that has been our experience so far. Private futex hashes solve the
>> problem on hand.
>>
>>> If we use large private hash table, the setup cost is higher (need to initialize
>>> all spinlocks and plist heads at each program startup), unless we use a dedicate
>>> kmem_cache to keep a pool of preinitialized priv hash tables...
>>>
>>
>> Hmm! How about
>> a) Reduce hash table size for private futex hash and increase hash table
>> size for the global hash?
>>
>> OR, better,
>>
>> b) Since it is multiple spinlocks on the same cacheline which is a PITA
>> here, how about keeping the global table, but just add a dereference
>> to each hash slot, and interleave the adjacent hash buckets between
>> nodes/cpus? So even without needing to lose out memory from padding,
>> we avoid false sharing on cachelines due to unrelated futexes hashing
>> onto adjacent hash buckets?
>>
>
> Because of jhash(), futex slots are almost random. No need to try to interleave
> them. Since you have a "cache line" of 4096 bytes, you need almost 4 pages
> per cpu to avoid in a statistical way false sharing.
>
> Unfortunatly a small private futex hash will have the same scalability problem,
> so its not a general solution.
>
> Could you try this patch please ?
>
> [PATCH] futex: Dynamically size futexes hash table
>
> As noticed by Ravikiran Thirumalai and Shai Fultheim, global futexes hash table
> is too small with current hardware.
> With 256 slots, probability of false sharing is too high.
>
> This patch makes its size not known at compile time, but boot time computed,
> depending on various factors, like number of possible cpus, and VSMP settings.
>
> Using alloc_large_system_hash() also has better NUMA properties, since at boot time
> hash table will be spreaded on many nodes, instead of only one.
>
> Signed-off-by: Eric Dumazet <[email protected]>
> ---
>
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 438701a..bd9a052 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -62,7 +62,6 @@
>
> int __read_mostly futex_cmpxchg_enabled;
>
> -#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
>
> /*
> * Priority Inheritance state:
> @@ -121,7 +120,13 @@ struct futex_hash_bucket {
> struct plist_head chain;
> };
>
> -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
> +#if defined(CONFIG_BASE_SMALL)

Oops, should be

#if CONFIG_BASE_SMALL

(CONFIG_BASE_SMALL being defined to 0 or !0)

I tested this last patch on a NUMA machine (8 cores), and checked dmesg output :

futexes hash table entries: 2048 (order: 4, 90112 bytes)

[PATCH] futex: Dynamically size futexes hash table

As noticed by Ravikiran Thirumalai and Shai Fultheim, global futexes hash table
is too small with current hardware.
With 256 slots, probability of false sharing is too high.

This patch makes its size not known at compile time, but boot time computed,
depending on various factors, like number of possible cpus, and VSMP settings.

Using alloc_large_system_hash() also has better NUMA properties, since at boot time
hash table will be spreaded on many nodes, instead of only one.

Signed-off-by: Eric Dumazet <[email protected]>
---

diff --git a/kernel/futex.c b/kernel/futex.c
index 438701a..f8bd4c0 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -55,6 +55,7 @@
#include <linux/magic.h>
#include <linux/pid.h>
#include <linux/nsproxy.h>
+#include <linux/bootmem.h>

#include <asm/futex.h>

@@ -62,7 +63,6 @@

int __read_mostly futex_cmpxchg_enabled;

-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)

/*
* Priority Inheritance state:
@@ -121,7 +121,13 @@ struct futex_hash_bucket {
struct plist_head chain;
};

-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+#if CONFIG_BASE_SMALL
+const unsigned int futex_hash_mask = 15;
+static struct futex_hash_bucket futex_queues[16];
+#else
+unsigned int futex_hash_mask __read_mostly;
+static struct futex_hash_bucket *futex_queues __read_mostly;
+#endif

/*
* We hash on the keys returned from get_futex_key (see below).
@@ -131,7 +137,7 @@ static struct futex_hash_bucket *hash_futex(union futex_key *key)
u32 hash = jhash2((u32*)&key->both.word,
(sizeof(key->both.word)+sizeof(key->both.ptr))/4,
key->both.offset);
- return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+ return &futex_queues[hash & futex_hash_mask];
}

/*
@@ -2016,7 +2022,25 @@ static int __init futex_init(void)
{
u32 curval;
int i;
+#if !CONFIG_BASE_SMALL
+#if defined(CONFIG_PROVE_LOCKING)
+ unsigned int nr_slots = 256;
+#else
+ /* 4 nodes per cpu on VSMP, or 256 slots per cpu */
+ unsigned int nr_slots = max(256U, (4U << INTERNODE_CACHE_SHIFT) /
+ sizeof(struct futex_hash_bucket));
+ nr_slots = roundup_pow_of_two(num_possible_cpus() * nr_slots);
+#endif

+ futex_queues = alloc_large_system_hash("futexes",
+ sizeof(struct futex_hash_bucket),
+ nr_slots,
+ 0, /* scale : unused */
+ 0, /* flags */
+ NULL, /* shift */
+ &futex_hash_mask,
+ nr_slots);
+#endif
/*
* This will fail and we want it. Some arch implementations do
* runtime detection of the futex_atomic_cmpxchg_inatomic()
@@ -2031,7 +2055,7 @@ static int __init futex_init(void)
if (curval == -EFAULT)
futex_cmpxchg_enabled = 1;

- for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+ for (i = 0; i <= futex_hash_mask; i++) {
plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock);
spin_lock_init(&futex_queues[i].lock);
}

2009-04-23 17:30:49

by Darren Hart

[permalink] [raw]

Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes

Eric Dumazet wrote:

>
> + futex_queues = alloc_large_system_hash("futexes",
> + sizeof(struct futex_hash_bucket),
> + nr_slots,
> + 0, /* scale : unused */
> + 0, /* flags */
> + NULL, /* shift */
> + &futex_hash_mask,
> + nr_slots);

OK, so I'm a little late to the party, apologies.

Minor nit (from someone who has been spending a lot of time in futex.c
trying to clean things up recently :-). Let's not comment each argument
individually. It needlessly lengthens the code. The interested user
can easily look up alloc_large_system_hash. If you feel a comment is
really needed, consider a single line before the call... I gave this
some thought and couldn't think of anything you could put there that
wouldn't still send the interested reader over to page_alloc.c.

/* Allocate futex hashtable with... <whatever is special about this> */
futex_queues = alloc_large_system_hash("futexes",
sizeof(struct futex_hash_bucket),
nr_slots, 0, 0, NULL, &futex_hash_mask,
nr_slots);

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team