2013-07-26 14:27:50

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 0/2] hugepage: optimize page fault path locking

This patchset attempts to reduce the amount of contention we impose
on the hugetlb_instantiation_mutex by replacing the global mutex with
a table of mutexes, selected based on a hash. The original discussion can
be found here: http://lkml.org/lkml/2013/7/12/428

Patch 1: Allows the file region tracking list to be serialized by its own rwsem.
This is necessary because the next patch allows concurrent hugepage fault paths,
getting rid of the hugetlb_instantiation_mutex - which protects chains of struct
file_regionin inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions
(!VM_MAYSHARE).

Patch 2: From David Gibson, for some reason never made it into the kernel.
Further cleanups and enhancements from Anton Blanchard and myself.
Details of how the hash key is selected is in the patch.

Davidlohr Bueso (2):
hugepage: protect file regions with rwsem
hugepage: allow parallelization of the hugepage fault path

mm/hugetlb.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 106 insertions(+), 28 deletions(-)

--
1.7.11.7


2013-07-26 14:27:58

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 1/2] hugepage: protect file regions with rwsem

The next patch will introduce parallel hugepage fault paths, replacing
the global hugetbl_instantiation_mutex with a table of hashed mutexes.
In order to prevent races, introduce a rw-semaphore that exclusively
serializes access to file region tracking structure.

Thanks to Konstantin Khlebnikov and Joonsoo Kim for pointing this issue out.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/hugetlb.c | 47 +++++++++++++++++++++++++++++++++--------------
1 file changed, 33 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..4c3f4f0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -134,16 +134,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
* Region tracking -- allows tracking of reservations and instantiated pages
* across the pages in a mapping.
*
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantion_mutex. To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation mutex:
- *
- * down_write(&mm->mmap_sem);
- * or
- * down_read(&mm->mmap_sem);
- * mutex_lock(&hugetlb_instantiation_mutex);
+ * With the parallelization of the hugepage fault path, the region_rwsem replaces
+ * the original hugetlb_instantiation_mutex, serializing access to the chains of
+ * file regions.
*/
+DECLARE_RWSEM(region_rwsem);
+
struct file_region {
struct list_head link;
long from;
@@ -154,6 +150,8 @@ static long region_add(struct list_head *head, long f, long t)
{
struct file_region *rg, *nrg, *trg;

+ down_write(&region_rwsem);
+
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
if (f <= rg->to)
@@ -183,6 +181,8 @@ static long region_add(struct list_head *head, long f, long t)
}
nrg->from = f;
nrg->to = t;
+
+ up_write(&region_rwsem);
return 0;
}

@@ -191,6 +191,8 @@ static long region_chg(struct list_head *head, long f, long t)
struct file_region *rg, *nrg;
long chg = 0;

+ down_write(&region_rwsem);
+
/* Locate the region we are before or in. */
list_for_each_entry(rg, head, link)
if (f <= rg->to)
@@ -201,17 +203,21 @@ static long region_chg(struct list_head *head, long f, long t)
* size such that we can guarantee to record the reservation. */
if (&rg->link == head || t < rg->from) {
nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
- if (!nrg)
- return -ENOMEM;
+ if (!nrg) {
+ chg = -ENOMEM;
+ goto done_write;
+ }
nrg->from = f;
nrg->to = f;
INIT_LIST_HEAD(&nrg->link);
list_add(&nrg->link, rg->link.prev);

- return t - f;
+ chg = t - f;
+ goto done_write;
}

/* Round our left edge to the current segment if it encloses us. */
+ downgrade_write(&region_rwsem);
if (f > rg->from)
f = rg->from;
chg = t - f;
@@ -221,7 +227,7 @@ static long region_chg(struct list_head *head, long f, long t)
if (&rg->link == head)
break;
if (rg->from > t)
- return chg;
+ break;

/* We overlap with this area, if it extends further than
* us then we must extend ourselves. Account for its
@@ -232,6 +238,11 @@ static long region_chg(struct list_head *head, long f, long t)
}
chg -= rg->to - rg->from;
}
+
+ up_read(&region_rwsem);
+ return chg;
+done_write:
+ up_write(&region_rwsem);
return chg;
}

@@ -240,12 +251,14 @@ static long region_truncate(struct list_head *head, long end)
struct file_region *rg, *trg;
long chg = 0;

+ down_write(&region_rwsem);
+
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
if (end <= rg->to)
break;
if (&rg->link == head)
- return 0;
+ goto done;

/* If we are in the middle of a region then adjust it. */
if (end > rg->from) {
@@ -262,6 +275,9 @@ static long region_truncate(struct list_head *head, long end)
list_del(&rg->link);
kfree(rg);
}
+
+done:
+ up_write(&region_rwsem);
return chg;
}

@@ -270,6 +286,8 @@ static long region_count(struct list_head *head, long f, long t)
struct file_region *rg;
long chg = 0;

+ down_read(&region_rwsem);
+
/* Locate each segment we overlap with, and count that overlap. */
list_for_each_entry(rg, head, link) {
long seg_from;
@@ -286,6 +304,7 @@ static long region_count(struct list_head *head, long f, long t)
chg += seg_to - seg_from;
}

+ up_read(&region_rwsem);
return chg;
}

--
1.7.11.7

2013-07-26 14:28:03

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path

From: David Gibson <[email protected]>

At present, the page fault path for hugepages is serialized by a
single mutex. This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM). This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash, which allows us to know
which page in the file we're instantiating. For shared mappings, the
hash key is selected based on the address space and file offset being faulted.
Similarly, for private mappings, the mm and virtual address are used.

From: Anton Blanchard <[email protected]>
[https://lkml.org/lkml/2011/7/15/31]
Forward ported and made a few changes:

- Use the Jenkins hash to scatter the hash, better than using just the
low bits.

- Always round num_fault_mutexes to a power of two to avoid an
expensive modulus in the hash calculation.

I also tested this patch on a large POWER7 box using a simple parallel
fault testcase:

http://ozlabs.org/~anton/junkcode/parallel_fault.c

Command line options:

parallel_fault <nr_threads> <size in kB> <skip in kB>

First the time taken to fault 128GB of 16MB hugepages:

40.68 seconds

Now the same test with 64 concurrent threads:
39.34 seconds

Hardly any speedup. Finally the 64 concurrent threads test with
this patch applied:
0.85 seconds

We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x

This was tested with the libhugetlbfs test suite, and the PASS/FAIL
count was the same before and after this patch.

From: Davidlohr Bueso <[email protected]>
- Cleaned up and forward ported to Linus' latest.
- Cache aligned mutexes.
- Keep non SMP systems using a single mutex.

It was found that this mutex can become quite contended
during the early phases of large databases which make use of huge pages - for instance
startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
reports that this mutex can be one of the top 5 most contended locks in the kernel during
the first few minutes:

hugetlb_instantiation_mutex: 10678 10678
---------------------------
hugetlb_instantiation_mutex 10678 [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
---------------------------
hugetlb_instantiation_mutex 10678 [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340

contentions: 10678
acquisitions: 99476
waittime-total: 76888911.01 us

With this patch we see a much less contention and wait time:

&htlb_fault_mutex_table[i]: 383
--------------------------
&htlb_fault_mutex_table[i] 383 [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
--------------------------
&htlb_fault_mutex_table[i] 383 [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440

contentions: 383
acquisitions: 120546
waittime-total: 1381.72 us

Signed-off-by: David Gibson <[email protected]>
Signed-off-by: Anton Blanchard <[email protected]>
Tested-by: Eric B Munson <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 73 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4c3f4f0..1426e44 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -21,6 +21,7 @@
#include <linux/rmap.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/jhash.h>

#include <asm/page.h>
#include <asm/pgtable.h>
@@ -52,6 +53,13 @@ static unsigned long __initdata default_hstate_size;
*/
DEFINE_SPINLOCK(hugetlb_lock);

+/*
+ * Serializes faults on the same logical page. This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
{
bool free = (spool->count == 0) && (spool->used_hpages == 0);
@@ -1915,13 +1923,15 @@ static void __exit hugetlb_exit(void)
for_each_hstate(h) {
kobject_put(hstate_kobjs[hstate_index(h)]);
}
-
+ kfree(htlb_fault_mutex_table);
kobject_put(hugepages_kobj);
}
module_exit(hugetlb_exit);

static int __init hugetlb_init(void)
{
+ int i;
+
/* Some platform decide whether they support huge pages at boot
* time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
* there is no such support
@@ -1946,6 +1956,19 @@ static int __init hugetlb_init(void)
hugetlb_register_all_nodes();
hugetlb_cgroup_file_init();

+#ifdef CONFIG_SMP
+ num_fault_mutexes = roundup_pow_of_two(2 * num_possible_cpus());
+#else
+ num_fault_mutexes = 1;
+#endif
+ htlb_fault_mutex_table =
+ kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+ if (!htlb_fault_mutex_table)
+ return -ENOMEM;
+
+ for (i = 0; i < num_fault_mutexes; i++)
+ mutex_init(&htlb_fault_mutex_table[i]);
+
return 0;
}
module_init(hugetlb_init);
@@ -2728,15 +2751,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
}

static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep, unsigned int flags)
+ struct address_space *mapping, pgoff_t idx,
+ unsigned long address, pte_t *ptep, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
int ret = VM_FAULT_SIGBUS;
int anon_rmap = 0;
- pgoff_t idx;
unsigned long size;
struct page *page;
- struct address_space *mapping;
pte_t new_pte;

/*
@@ -2750,9 +2772,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}

- mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, address);
-
/*
* Use page lock to guard against racing truncation
* before we get page_table_lock.
@@ -2858,15 +2877,51 @@ backout_unlocked:
goto out;
}

+#ifdef CONFIG_SMP
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct address_space *mapping,
+ pgoff_t idx, unsigned long address)
+{
+ unsigned long key[2];
+ u32 hash;
+
+ if (vma->vm_flags & VM_SHARED) {
+ key[0] = (unsigned long)mapping;
+ key[1] = idx;
+ } else {
+ key[0] = (unsigned long)mm;
+ key[1] = address >> huge_page_shift(h);
+ }
+
+ hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+ return hash & (num_fault_mutexes - 1);
+}
+#else
+/*
+ * For uniprocesor systems we always use a single mutex, so just
+ * return 0 and avoid the hashing overhead.
+ */
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct address_space *mapping,
+ pgoff_t idx, unsigned long address)
+{
+ return 0;
+}
+#endif
+
int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
- pte_t *ptep;
- pte_t entry;
+ pgoff_t idx;
int ret;
+ u32 hash;
+ pte_t *ptep, entry;
struct page *page = NULL;
+ struct address_space *mapping;
struct page *pagecache_page = NULL;
- static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);

address &= huge_page_mask(h);
@@ -2886,15 +2941,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (!ptep)
return VM_FAULT_OOM;

+ mapping = vma->vm_file->f_mapping;
+ idx = vma_hugecache_offset(h, vma, address);
+
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
- mutex_lock(&hugetlb_instantiation_mutex);
+ hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+ mutex_lock(&htlb_fault_mutex_table[hash]);
+
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
- ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+ ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
goto out_mutex;
}

@@ -2962,8 +3022,7 @@ out_page_table_lock:
put_page(page);

out_mutex:
- mutex_unlock(&hugetlb_instantiation_mutex);
-
+ mutex_unlock(&htlb_fault_mutex_table[hash]);
return ret;
}

--
1.7.11.7

2013-07-28 06:00:56

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path

On Fri, Jul 26, 2013 at 10:27 PM, Davidlohr Bueso
<[email protected]> wrote:
> From: David Gibson <[email protected]>
>
> At present, the page fault path for hugepages is serialized by a
> single mutex. This is used to avoid spurious out-of-memory conditions
> when the hugepage pool is fully utilized (two processes or threads can
> race to instantiate the same mapping with the last hugepage from the
> pool, the race loser returning VM_FAULT_OOM). This problem is
> specific to hugepages, because it is normal to want to use every
> single hugepage in the system - with normal pages we simply assume
> there will always be a few spare pages which can be used temporarily
> until the race is resolved.
>
> Unfortunately this serialization also means that clearing of hugepages
> cannot be parallelized across multiple CPUs, which can lead to very
> long process startup times when using large numbers of hugepages.
>
> This patch improves the situation by replacing the single mutex with a
> table of mutexes, selected based on a hash, which allows us to know
> which page in the file we're instantiating. For shared mappings, the
> hash key is selected based on the address space and file offset being faulted.
> Similarly, for private mappings, the mm and virtual address are used.
>
> From: Anton Blanchard <[email protected]>
> [https://lkml.org/lkml/2011/7/15/31]
> Forward ported and made a few changes:
>
> - Use the Jenkins hash to scatter the hash, better than using just the
> low bits.
>
> - Always round num_fault_mutexes to a power of two to avoid an
> expensive modulus in the hash calculation.
>
> I also tested this patch on a large POWER7 box using a simple parallel
> fault testcase:
>
> http://ozlabs.org/~anton/junkcode/parallel_fault.c
>
> Command line options:
>
> parallel_fault <nr_threads> <size in kB> <skip in kB>
>
> First the time taken to fault 128GB of 16MB hugepages:
>
> 40.68 seconds
>
> Now the same test with 64 concurrent threads:
> 39.34 seconds
>
> Hardly any speedup. Finally the 64 concurrent threads test with
> this patch applied:
> 0.85 seconds
>
> We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x
>
> This was tested with the libhugetlbfs test suite, and the PASS/FAIL
> count was the same before and after this patch.
>
> From: Davidlohr Bueso <[email protected]>
> - Cleaned up and forward ported to Linus' latest.
> - Cache aligned mutexes.
> - Keep non SMP systems using a single mutex.
>
> It was found that this mutex can become quite contended
> during the early phases of large databases which make use of huge pages - for instance
> startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> reports that this mutex can be one of the top 5 most contended locks in the kernel during
> the first few minutes:
>
> hugetlb_instantiation_mutex: 10678 10678
> ---------------------------
> hugetlb_instantiation_mutex 10678 [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> ---------------------------
> hugetlb_instantiation_mutex 10678 [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
>
> contentions: 10678
> acquisitions: 99476
> waittime-total: 76888911.01 us
>
> With this patch we see a much less contention and wait time:
>
> &htlb_fault_mutex_table[i]: 383
> --------------------------
> &htlb_fault_mutex_table[i] 383 [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
> --------------------------
> &htlb_fault_mutex_table[i] 383 [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
>
> contentions: 383
> acquisitions: 120546
> waittime-total: 1381.72 us
>
I see same figures in the message of Jul 18,
contentions: 10678
acquisitions: 99476
waittime-total: 76888911.01 us
and
contentions: 383
acquisitions: 120546
waittime-total: 1381.72 us
if I copy and paste correctly.

Were they measured with the global semaphore introduced in 1/8 for
serializing changes in file regions?

2013-07-29 06:18:23

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugepage: optimize page fault path locking

On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
> This patchset attempts to reduce the amount of contention we impose
> on the hugetlb_instantiation_mutex by replacing the global mutex with
> a table of mutexes, selected based on a hash. The original discussion can
> be found here: http://lkml.org/lkml/2013/7/12/428

Hello, Davidlohr.

I recently sent a patchset which remove the hugetlb_instantiation_mutex
entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
This patchset can be found here: https://lkml.org/lkml/2013/7/29/54

If possible, could you review it and test it whether your problem is
disappered with it or not?

Thanks.

>
> Patch 1: Allows the file region tracking list to be serialized by its own rwsem.
> This is necessary because the next patch allows concurrent hugepage fault paths,
> getting rid of the hugetlb_instantiation_mutex - which protects chains of struct
> file_regionin inode->i_mapping->private_list (VM_MAYSHARE) or vma_resv_map(vma)->regions
> (!VM_MAYSHARE).
>
> Patch 2: From David Gibson, for some reason never made it into the kernel.
> Further cleanups and enhancements from Anton Blanchard and myself.
> Details of how the hash key is selected is in the patch.
>
> Davidlohr Bueso (2):
> hugepage: protect file regions with rwsem
> hugepage: allow parallelization of the hugepage fault path
>
> mm/hugetlb.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 106 insertions(+), 28 deletions(-)
>
> --
> 1.7.11.7
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-29 19:16:51

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 2/2] hugepage: allow parallelization of the hugepage fault path

On Sun, 2013-07-28 at 14:00 +0800, Hillf Danton wrote:
> On Fri, Jul 26, 2013 at 10:27 PM, Davidlohr Bueso
> <[email protected]> wrote:
> > From: David Gibson <[email protected]>
> >
> > At present, the page fault path for hugepages is serialized by a
> > single mutex. This is used to avoid spurious out-of-memory conditions
> > when the hugepage pool is fully utilized (two processes or threads can
> > race to instantiate the same mapping with the last hugepage from the
> > pool, the race loser returning VM_FAULT_OOM). This problem is
> > specific to hugepages, because it is normal to want to use every
> > single hugepage in the system - with normal pages we simply assume
> > there will always be a few spare pages which can be used temporarily
> > until the race is resolved.
> >
> > Unfortunately this serialization also means that clearing of hugepages
> > cannot be parallelized across multiple CPUs, which can lead to very
> > long process startup times when using large numbers of hugepages.
> >
> > This patch improves the situation by replacing the single mutex with a
> > table of mutexes, selected based on a hash, which allows us to know
> > which page in the file we're instantiating. For shared mappings, the
> > hash key is selected based on the address space and file offset being faulted.
> > Similarly, for private mappings, the mm and virtual address are used.
> >
> > From: Anton Blanchard <[email protected]>
> > [https://lkml.org/lkml/2011/7/15/31]
> > Forward ported and made a few changes:
> >
> > - Use the Jenkins hash to scatter the hash, better than using just the
> > low bits.
> >
> > - Always round num_fault_mutexes to a power of two to avoid an
> > expensive modulus in the hash calculation.
> >
> > I also tested this patch on a large POWER7 box using a simple parallel
> > fault testcase:
> >
> > http://ozlabs.org/~anton/junkcode/parallel_fault.c
> >
> > Command line options:
> >
> > parallel_fault <nr_threads> <size in kB> <skip in kB>
> >
> > First the time taken to fault 128GB of 16MB hugepages:
> >
> > 40.68 seconds
> >
> > Now the same test with 64 concurrent threads:
> > 39.34 seconds
> >
> > Hardly any speedup. Finally the 64 concurrent threads test with
> > this patch applied:
> > 0.85 seconds
> >
> > We go from 40.68 seconds to 0.85 seconds, an improvement of 47.9x
> >
> > This was tested with the libhugetlbfs test suite, and the PASS/FAIL
> > count was the same before and after this patch.
> >
> > From: Davidlohr Bueso <[email protected]>
> > - Cleaned up and forward ported to Linus' latest.
> > - Cache aligned mutexes.
> > - Keep non SMP systems using a single mutex.
> >
> > It was found that this mutex can become quite contended
> > during the early phases of large databases which make use of huge pages - for instance
> > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > the first few minutes:
> >
> > hugetlb_instantiation_mutex: 10678 10678
> > ---------------------------
> > hugetlb_instantiation_mutex 10678 [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > ---------------------------
> > hugetlb_instantiation_mutex 10678 [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> >
> > contentions: 10678
> > acquisitions: 99476
> > waittime-total: 76888911.01 us
> >
> > With this patch we see a much less contention and wait time:
> >
> > &htlb_fault_mutex_table[i]: 383
> > --------------------------
> > &htlb_fault_mutex_table[i] 383 [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
> > --------------------------
> > &htlb_fault_mutex_table[i] 383 [<ffffffff8115e27b>] hugetlb_fault+0x1eb/0x440
> >
> > contentions: 383
> > acquisitions: 120546
> > waittime-total: 1381.72 us
> >
> I see same figures in the message of Jul 18,
> contentions: 10678
> acquisitions: 99476
> waittime-total: 76888911.01 us
> and
> contentions: 383
> acquisitions: 120546
> waittime-total: 1381.72 us
> if I copy and paste correctly.
>
> Were they measured with the global semaphore introduced in 1/8 for
> serializing changes in file regions?

They were, but I copied the wrong text:

for the htlb mutex:

contentions: 453
acquisitions: 154786
waittime-total: 117765.59 us

For the new lock, this particular workload only uses region_add() and
region_chg() calls:

region_rwsem-W:
contentions: 4
acquisitions: 20077
waittime-total: 2244.64 us

region_rwsem-R:
contentions: 0
acquisitions: 2
waittime-total: 0 us

2013-08-07 00:08:13

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugepage: optimize page fault path locking

On Mon, 2013-07-29 at 15:18 +0900, Joonsoo Kim wrote:
> On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
> > This patchset attempts to reduce the amount of contention we impose
> > on the hugetlb_instantiation_mutex by replacing the global mutex with
> > a table of mutexes, selected based on a hash. The original discussion can
> > be found here: http://lkml.org/lkml/2013/7/12/428
>
> Hello, Davidlohr.
>
> I recently sent a patchset which remove the hugetlb_instantiation_mutex
> entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
> This patchset can be found here: https://lkml.org/lkml/2013/7/29/54
>
> If possible, could you review it and test it whether your problem is
> disappered with it or not?

This patchset applies on top of https://lkml.org/lkml/2013/7/22/96
"[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix", right?

AFAIK those changes are the ones Andrew picked up a few weeks ago and
are now in linux-next, right? I was able to apply those just fine, but
couldn't apply your 'remove a hugetlb_instantiation_mutex series' (IIRC
pach 1/18 failed). I guess you'll send out a v2 anyway so I'll wait
until then.

In any case I'm not seeing an actual performance issue with the
hugetlb_instantiation_mutex, all I noticed was that under large DB
workloads that make use of hugepages, such as Oracle, this lock becomes
quite hot during the first few minutes of startup, which makes sense in
the fault path it is contended. So I'll try out your patches, but, in
this particular case, I just cannot compare with the lock vs without the
lock situations.

Thanks,
Davidlohr

2013-08-07 09:21:27

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/2] hugepage: optimize page fault path locking

On Tue, Aug 06, 2013 at 05:08:04PM -0700, Davidlohr Bueso wrote:
> On Mon, 2013-07-29 at 15:18 +0900, Joonsoo Kim wrote:
> > On Fri, Jul 26, 2013 at 07:27:23AM -0700, Davidlohr Bueso wrote:
> > > This patchset attempts to reduce the amount of contention we impose
> > > on the hugetlb_instantiation_mutex by replacing the global mutex with
> > > a table of mutexes, selected based on a hash. The original discussion can
> > > be found here: http://lkml.org/lkml/2013/7/12/428
> >
> > Hello, Davidlohr.
> >
> > I recently sent a patchset which remove the hugetlb_instantiation_mutex
> > entirely ('mm, hugetlb: remove a hugetlb_instantiation_mutex').
> > This patchset can be found here: https://lkml.org/lkml/2013/7/29/54
> >
> > If possible, could you review it and test it whether your problem is
> > disappered with it or not?
>
> This patchset applies on top of https://lkml.org/lkml/2013/7/22/96
> "[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix", right?
>
> AFAIK those changes are the ones Andrew picked up a few weeks ago and
> are now in linux-next, right? I was able to apply those just fine, but
> couldn't apply your 'remove a hugetlb_instantiation_mutex series' (IIRC
> pach 1/18 failed). I guess you'll send out a v2 anyway so I'll wait
> until then.
>
> In any case I'm not seeing an actual performance issue with the
> hugetlb_instantiation_mutex, all I noticed was that under large DB
> workloads that make use of hugepages, such as Oracle, this lock becomes
> quite hot during the first few minutes of startup, which makes sense in
> the fault path it is contended. So I'll try out your patches, but, in
> this particular case, I just cannot compare with the lock vs without the
> lock situations.

Okay. I just want to know that lock contention is reduced by my patches
in the first few minutes of startup. I will send v2 soon.

Thanks.