LinuxLists.cc - [PATCH 0/8] mm, hugetlb: fixes and fault scalability

2014-01-27 03:52:54

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH 0/8] mm, hugetlb: fixes and fault scalability

This patchset resumes the work to improve the whole hugepage fault
scalability path. Previous efforts can be found here:

https://lkml.org/lkml/2013/7/26/299
https://lkml.org/lkml/2013/12/18/50

The latest attempt to address the big-fat hugetlb instantiation mutex by
removing the need for it altogether ended up having too much of an overhead
to consider and allow scalability. The discussion can be found at:
https://lkml.org/lkml/2014/1/3/244

This patchset is divided in three parts, where the first seven patches,
from Joonsoo, have been included and reviewed in previous patchsets. The
last patch is the actual performance one.

Part 1. (1-3) Introduce new protection method for region tracking
data structure, instead of the hugetlb_instantiation_mutex. There
is race condition when we map the hugetlbfs file to two different
processes. To prevent it, we need to new protection method like
as this patchset.

Part 2. (4-7) clean-up.

These make code really simple, so these are worth to go into
mainline separately.

Part 4 (8) Use a table of mutexes instead of a unique one, and allow
faults to be handled in parallel. Benefits and caveats to this
approach are in the patch.

All changes have passed the libhugetblfs test cases.
This patchset applies on top of Linus' current tree (3.13-77d143de)

Thanks!

mm, hugetlb: unify region structure handling
mm, hugetlb: region manipulation functions take resv_map rather
list_head
mm, hugetlb: fix race in region tracking
mm, hugetlb: remove resv_map_put
mm, hugetlb: use vma_resv_map() map types
mm, hugetlb: remove vma_has_reserves
mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve
mm, hugetlb: improve page-fault scalability

fs/hugetlbfs/inode.c | 17 ++-
include/linux/hugetlb.h | 10 ++
mm/hugetlb.c | 323 +++++++++++++++++++++++++++---------------------
3 files changed, 206 insertions(+), 144 deletions(-)

--
1.8.1.4

2014-01-27 03:52:56

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH 2/8] mm, hugetlb: region manipulation functions take resv_map rather list_head

From: Joonsoo Kim <[email protected]>

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions. This doesn't introduce any functional change, and it is just
for preparing a next step.

Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/hugetlb.c | 30 +++++++++++++++++-------------
1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fbed1b7..572866d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -151,8 +151,9 @@ struct file_region {
long to;
};

-static long region_add(struct list_head *head, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t)
{
+ struct list_head *head = &resv->regions;
struct file_region *rg, *nrg, *trg;

/* Locate the region we are either in or before. */
@@ -187,8 +188,9 @@ static long region_add(struct list_head *head, long f, long t)
return 0;
}

-static long region_chg(struct list_head *head, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t)
{
+ struct list_head *head = &resv->regions;
struct file_region *rg, *nrg;
long chg = 0;

@@ -236,8 +238,9 @@ static long region_chg(struct list_head *head, long f, long t)
return chg;
}

-static long region_truncate(struct list_head *head, long end)
+static long region_truncate(struct resv_map *resv, long end)
{
+ struct list_head *head = &resv->regions;
struct file_region *rg, *trg;
long chg = 0;

@@ -266,8 +269,9 @@ static long region_truncate(struct list_head *head, long end)
return chg;
}

-static long region_count(struct list_head *head, long f, long t)
+static long region_count(struct resv_map *resv, long f, long t)
{
+ struct list_head *head = &resv->regions;
struct file_region *rg;
long chg = 0;

@@ -393,7 +397,7 @@ void resv_map_release(struct kref *ref)
struct resv_map *resv_map = container_of(ref, struct resv_map, refs);

/* Clear out any active regions before we release the map. */
- region_truncate(&resv_map->regions, 0);
+ region_truncate(resv_map, 0);
kfree(resv_map);
}

@@ -1152,7 +1156,7 @@ static long vma_needs_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = inode->i_mapping->private_data;

- return region_chg(&resv->regions, idx, idx + 1);
+ return region_chg(resv, idx, idx + 1);

} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
return 1;
@@ -1162,7 +1166,7 @@ static long vma_needs_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = vma_resv_map(vma);

- err = region_chg(&resv->regions, idx, idx + 1);
+ err = region_chg(resv, idx, idx + 1);
if (err < 0)
return err;
return 0;
@@ -1178,14 +1182,14 @@ static void vma_commit_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = inode->i_mapping->private_data;

- region_add(&resv->regions, idx, idx + 1);
+ region_add(resv, idx, idx + 1);

} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = vma_resv_map(vma);

/* Mark this page used in the map. */
- region_add(&resv->regions, idx, idx + 1);
+ region_add(resv, idx, idx + 1);
}
}

@@ -2276,7 +2280,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
end = vma_hugecache_offset(h, vma, vma->vm_end);

reserve = (end - start) -
- region_count(&resv->regions, start, end);
+ region_count(resv, start, end);

resv_map_put(vma);

@@ -3178,7 +3182,7 @@ int hugetlb_reserve_pages(struct inode *inode,
if (!vma || vma->vm_flags & VM_MAYSHARE) {
resv_map = inode->i_mapping->private_data;

- chg = region_chg(&resv_map->regions, from, to);
+ chg = region_chg(resv_map, from, to);

} else {
resv_map = resv_map_alloc();
@@ -3224,7 +3228,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* else has to be done for private mappings here
*/
if (!vma || vma->vm_flags & VM_MAYSHARE)
- region_add(&resv_map->regions, from, to);
+ region_add(resv_map, from, to);
return 0;
out_err:
if (vma)
@@ -3240,7 +3244,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
struct hugepage_subpool *spool = subpool_inode(inode);

if (resv_map)
- chg = region_truncate(&resv_map->regions, offset);
+ chg = region_truncate(resv_map, offset);
spin_lock(&inode->i_lock);
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
--
1.8.1.4

2014-01-27 03:53:03

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH 4/8] mm, hugetlb: remove resv_map_put

From: Joonsoo Kim <[email protected]>

This is a preparation patch to unify the use of vma_resv_map() regardless
of the map type. This patch prepares it by removing resv_map_put(), which
only works for HPAGE_RESV_OWNER's resv_map, not for all resv_maps.

Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
[Updated changelog]
Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/hugetlb.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6b40d7e..13edf17 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2273,15 +2273,6 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
kref_get(&resv->refs);
}

-static void resv_map_put(struct vm_area_struct *vma)
-{
- struct resv_map *resv = vma_resv_map(vma);
-
- if (!resv)
- return;
- kref_put(&resv->refs, resv_map_release);
-}
-
static void hugetlb_vm_op_close(struct vm_area_struct *vma)
{
struct hstate *h = hstate_vma(vma);
@@ -2298,7 +2289,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
reserve = (end - start) -
region_count(resv, start, end);

- resv_map_put(vma);
+ kref_put(&resv->refs, resv_map_release);

if (reserve) {
hugetlb_acct_memory(h, -reserve);
@@ -3247,8 +3238,8 @@ int hugetlb_reserve_pages(struct inode *inode,
region_add(resv_map, from, to);
return 0;
out_err:
- if (vma)
- resv_map_put(vma);
+ if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+ kref_put(&resv_map->refs, resv_map_release);
return ret;
}

--
1.8.1.4

2014-01-27 03:53:08

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH 6/8] mm, hugetlb: remove vma_has_reserves

From: Joonsoo Kim <[email protected]>

vma_has_reserves() can be substituted by using return value of
vma_needs_reservation(). If chg returned by vma_needs_reservation()
is 0, it means that vma has reserves. Otherwise, it means that vma don't
have reserves and need a hugepage outside of reserve pool. This definition
is perfectly same as vma_has_reserves(), so remove vma_has_reserves().

Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/hugetlb.c | 46 +++++++++-------------------------------------
1 file changed, 9 insertions(+), 37 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 541cceb..83bc161 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -469,39 +469,6 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
vma->vm_private_data = (void *)0;
}

-/* Returns true if the VMA has associated reserve pages */
-static int vma_has_reserves(struct vm_area_struct *vma, long chg)
-{
- if (vma->vm_flags & VM_NORESERVE) {
- /*
- * This address is already reserved by other process(chg == 0),
- * so, we should decrement reserved count. Without decrementing,
- * reserve count remains after releasing inode, because this
- * allocated page will go into page cache and is regarded as
- * coming from reserved pool in releasing step. Currently, we
- * don't have any other solution to deal with this situation
- * properly, so add work-around here.
- */
- if (vma->vm_flags & VM_MAYSHARE && chg == 0)
- return 1;
- else
- return 0;
- }
-
- /* Shared mappings always use reserves */
- if (vma->vm_flags & VM_MAYSHARE)
- return 1;
-
- /*
- * Only the process that called mmap() has reserves for
- * private mappings.
- */
- if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
- return 1;
-
- return 0;
-}
-
static void enqueue_huge_page(struct hstate *h, struct page *page)
{
int nid = page_to_nid(page);
@@ -555,10 +522,11 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
/*
* A child process with MAP_PRIVATE mappings created by their parent
* have no page reserves. This check ensures that reservations are
- * not "stolen". The child may still get SIGKILLed
+ * not "stolen". The child may still get SIGKILLed.
+ * chg represents whether current user has a reserved hugepages or not,
+ * so that we can use it to ensure that reservations are not "stolen".
*/
- if (!vma_has_reserves(vma, chg) &&
- h->free_huge_pages - h->resv_huge_pages == 0)
+ if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
goto err;

/* If reserves cannot be used, ensure enough pages are in the pool */
@@ -577,7 +545,11 @@ retry_cpuset:
if (page) {
if (avoid_reserve)
break;
- if (!vma_has_reserves(vma, chg))
+ /*
+ * chg means whether current user allocates
+ * a hugepage on the reserved pool or not
+ */
+ if (chg)
break;

SetPagePrivate(page);
--
1.8.1.4

2014-01-27 03:53:20

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH 8/8] mm, hugetlb: improve page-fault scalability

The kernel can currently only handle a single hugetlb page fault at a time.
This is due to a single mutex that serializes the entire path. This lock
protects from spurious OOM errors under conditions of low of low availability
of free hugepages. This problem is specific to hugepages, because it is
normal to want to use every single hugepage in the system - with normal pages
we simply assume there will always be a few spare pages which can be used
temporarily until the race is resolved.

Address this problem by using a table of mutexes, allowing a better chance of
parallelization, where each hugepage is individually serialized. The hash key
is selected depending on the mapping type. For shared ones it consists of the
address space and file offset being faulted; while for private ones the mm and
virtual address are used. The size of the table is selected based on a compromise
of collisions and memory footprint of a series of database workloads.

Large database workloads that make heavy use of hugepages can be particularly
exposed to this issue, causing start-up times to be painfully slow. This patch
reduces the startup time of a 10 Gb Oracle DB (with ~5000 faults) from 37.5 secs
to 25.7 secs. Larger workloads will naturally benefit even more.

NOTE:
The only downside to this patch, detected by Joonsoo Kim, is that a small race
is possible in private mappings: A child process (with its own mm, after cow)
can instantiate a page that is already being handled by the parent in a cow
fault. When low on pages, can trigger spurious OOMs. I have not been able to
think of a efficient way of handling this... but do we really care about such
a tiny window? We already maintain another theoretical race with normal pages.
If not, one possible way to is to maintain the single hash for private mappings
-- any workloads that *really* suffer from this scaling problem should already
use shared mappings.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/hugetlb.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 73 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f3efa5..ec04e84 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -22,6 +22,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/page-isolation.h>
+#include <linux/jhash.h>

#include <asm/page.h>
#include <asm/pgtable.h>
@@ -53,6 +54,13 @@ static unsigned long __initdata default_hstate_size;
*/
DEFINE_SPINLOCK(hugetlb_lock);

+/*
++ * Serializes faults on the same logical page. This is used to
++ * prevent spurious OOMs when the hugepage pool is fully utilized.
++ */
+static int __read_mostly num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
{
bool free = (spool->count == 0) && (spool->used_hpages == 0);
@@ -1922,11 +1930,14 @@ static void __exit hugetlb_exit(void)
}

kobject_put(hugepages_kobj);
+ kfree(htlb_fault_mutex_table);
}
module_exit(hugetlb_exit);

static int __init hugetlb_init(void)
{
+ int i;
+
/* Some platform decide whether they support huge pages at boot
* time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
* there is no such support
@@ -1951,6 +1962,18 @@ static int __init hugetlb_init(void)
hugetlb_register_all_nodes();
hugetlb_cgroup_file_init();

+#ifdef CONFIG_SMP
+ num_fault_mutexes = roundup_pow_of_two(8 * num_possible_cpus());
+#else
+ num_fault_mutexes = 1;
+#endif
+ htlb_fault_mutex_table =
+ kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+ if (!htlb_fault_mutex_table)
+ return -ENOMEM;
+
+ for (i = 0; i < num_fault_mutexes; i++)
+ mutex_init(&htlb_fault_mutex_table[i]);
return 0;
}
module_init(hugetlb_init);
@@ -2733,15 +2756,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
}

static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep, unsigned int flags)
+ struct address_space *mapping, pgoff_t idx,
+ unsigned long address, pte_t *ptep, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
int ret = VM_FAULT_SIGBUS;
int anon_rmap = 0;
- pgoff_t idx;
unsigned long size;
struct page *page;
- struct address_space *mapping;
pte_t new_pte;
spinlock_t *ptl;

@@ -2756,9 +2778,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}

- mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, address);
-
/*
* Use page lock to guard against racing truncation
* before we get page_table_lock.
@@ -2868,17 +2887,53 @@ backout_unlocked:
goto out;
}

+#ifdef CONFIG_SMP
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct address_space *mapping,
+ pgoff_t idx, unsigned long address)
+{
+ unsigned long key[2];
+ u32 hash;
+
+ if (vma->vm_flags & VM_SHARED) {
+ key[0] = (unsigned long) mapping;
+ key[1] = idx;
+ } else {
+ key[0] = (unsigned long) mm;
+ key[1] = address >> huge_page_shift(h);
+ }
+
+ hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+ return hash & (num_fault_mutexes - 1);
+}
+#else
+/*
+ * For uniprocesor systems we always use a single mutex, so just
+ * return 0 and avoid the hashing overhead.
+ */
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct address_space *mapping,
+ pgoff_t idx, unsigned long address)
+{
+ return 0;
+}
+#endif
+
int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
- pte_t *ptep;
- pte_t entry;
+ pte_t *ptep, entry;
spinlock_t *ptl;
int ret;
+ u32 hash, parent_hash;
+ pgoff_t idx;
struct page *page = NULL;
struct page *pagecache_page = NULL;
- static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);
+ struct address_space *mapping;

address &= huge_page_mask(h);

@@ -2897,15 +2952,21 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (!ptep)
return VM_FAULT_OOM;

+ mapping = vma->vm_file->f_mapping;
+ idx = vma_hugecache_offset(h, vma, address);
+
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
- mutex_lock(&hugetlb_instantiation_mutex);
+ parent_hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+ hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+ mutex_lock(&htlb_fault_mutex_table[hash]);
+
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
- ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+ ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
goto out_mutex;
}

@@ -2974,8 +3035,7 @@ out_ptl:
put_page(page);

out_mutex:
- mutex_unlock(&hugetlb_instantiation_mutex);
-
+ mutex_unlock(&htlb_fault_mutex_table[hash]);
return ret;
}

--
1.8.1.4

2014-01-27 03:53:46

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH 7/8] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve

From: Joonsoo Kim <[email protected]>

Currently, we have two variable to represent whether we can use reserved
page or not, chg and avoid_reserve, respectively. With aggregating these,
we can have more clean code. This makes no functional difference.

Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/hugetlb.c | 31 +++++++++++--------------------
1 file changed, 11 insertions(+), 20 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83bc161..5f3efa5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -508,8 +508,7 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)

static struct page *dequeue_huge_page_vma(struct hstate *h,
struct vm_area_struct *vma,
- unsigned long address, int avoid_reserve,
- long chg)
+ unsigned long address, bool use_reserve)
{
struct page *page = NULL;
struct mempolicy *mpol;
@@ -523,14 +522,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
* A child process with MAP_PRIVATE mappings created by their parent
* have no page reserves. This check ensures that reservations are
* not "stolen". The child may still get SIGKILLed.
- * chg represents whether current user has a reserved hugepages or not,
- * so that we can use it to ensure that reservations are not "stolen".
+ * Or, when parent process do COW, we cannot use reserved page.
+ * In this case, ensure enough pages are in the pool.
*/
- if (chg && h->free_huge_pages - h->resv_huge_pages == 0)
- goto err;
-
- /* If reserves cannot be used, ensure enough pages are in the pool */
- if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
+ if (!use_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
goto err;

retry_cpuset:
@@ -543,13 +538,7 @@ retry_cpuset:
if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask(h))) {
page = dequeue_huge_page_node(h, zone_to_nid(zone));
if (page) {
- if (avoid_reserve)
- break;
- /*
- * chg means whether current user allocates
- * a hugepage on the reserved pool or not
- */
- if (chg)
+ if (!use_reserve)
break;

SetPagePrivate(page);
@@ -1185,6 +1174,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct page *page;
long chg;
+ bool use_reserve;
int ret, idx;
struct hugetlb_cgroup *h_cg;

@@ -1200,18 +1190,19 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
chg = vma_needs_reservation(h, vma, addr);
if (chg < 0)
return ERR_PTR(-ENOMEM);
- if (chg || avoid_reserve)
+ use_reserve = (!chg && !avoid_reserve);
+ if (!use_reserve)
if (hugepage_subpool_get_pages(spool, 1))
return ERR_PTR(-ENOSPC);

ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
if (ret) {
- if (chg || avoid_reserve)
+ if (!use_reserve)
hugepage_subpool_put_pages(spool, 1);
return ERR_PTR(-ENOSPC);
}
spin_lock(&hugetlb_lock);
- page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+ page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
if (!page) {
spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1219,7 +1210,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
hugetlb_cgroup_uncharge_cgroup(idx,
pages_per_huge_page(h),
h_cg);
- if (chg || avoid_reserve)
+ if (!use_reserve)
hugepage_subpool_put_pages(spool, 1);
return ERR_PTR(-ENOSPC);
}
--
1.8.1.4

2014-01-27 03:54:10

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH 5/8] mm, hugetlb: use vma_resv_map() map types

From: Joonsoo Kim <[email protected]>

Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable. Unify it.

Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++------------------------------
1 file changed, 38 insertions(+), 38 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 13edf17..541cceb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -417,13 +417,24 @@ void resv_map_release(struct kref *ref)
kfree(resv_map);
}

+static inline struct resv_map *inode_resv_map(struct inode *inode)
+{
+ return inode->i_mapping->private_data;
+}
+
static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
{
VM_BUG_ON(!is_vm_hugetlb_page(vma));
- if (!(vma->vm_flags & VM_MAYSHARE))
+ if (vma->vm_flags & VM_MAYSHARE) {
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ struct inode *inode = mapping->host;
+
+ return inode_resv_map(inode);
+
+ } else {
return (struct resv_map *)(get_vma_private_data(vma) &
~HPAGE_RESV_MASK);
- return NULL;
+ }
}

static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
@@ -1165,48 +1176,34 @@ static void return_unused_surplus_pages(struct hstate *h,
static long vma_needs_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
- struct address_space *mapping = vma->vm_file->f_mapping;
- struct inode *inode = mapping->host;
-
- if (vma->vm_flags & VM_MAYSHARE) {
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- struct resv_map *resv = inode->i_mapping->private_data;
-
- return region_chg(resv, idx, idx + 1);
+ struct resv_map *resv;
+ pgoff_t idx;
+ long chg;

- } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+ resv = vma_resv_map(vma);
+ if (!resv)
return 1;

- } else {
- long err;
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- struct resv_map *resv = vma_resv_map(vma);
+ idx = vma_hugecache_offset(h, vma, addr);
+ chg = region_chg(resv, idx, idx + 1);

- err = region_chg(resv, idx, idx + 1);
- if (err < 0)
- return err;
- return 0;
- }
+ if (vma->vm_flags & VM_MAYSHARE)
+ return chg;
+ else
+ return chg < 0 ? chg : 0;
}
static void vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
- struct address_space *mapping = vma->vm_file->f_mapping;
- struct inode *inode = mapping->host;
-
- if (vma->vm_flags & VM_MAYSHARE) {
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- struct resv_map *resv = inode->i_mapping->private_data;
-
- region_add(resv, idx, idx + 1);
+ struct resv_map *resv;
+ pgoff_t idx;

- } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- struct resv_map *resv = vma_resv_map(vma);
+ resv = vma_resv_map(vma);
+ if (!resv)
+ return;

- /* Mark this page used in the map. */
- region_add(resv, idx, idx + 1);
- }
+ idx = vma_hugecache_offset(h, vma, addr);
+ region_add(resv, idx, idx + 1);
}

static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -2269,7 +2266,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
* after this open call completes. It is therefore safe to take a
* new reference here without additional locking.
*/
- if (resv)
+ if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
kref_get(&resv->refs);
}

@@ -2282,7 +2279,10 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
unsigned long start;
unsigned long end;

- if (resv) {
+ if (!resv)
+ return;
+
+ if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
start = vma_hugecache_offset(h, vma, vma->vm_start);
end = vma_hugecache_offset(h, vma, vma->vm_end);

@@ -3187,7 +3187,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* called to make the mapping read-write. Assume !vma is a shm mapping
*/
if (!vma || vma->vm_flags & VM_MAYSHARE) {
- resv_map = inode->i_mapping->private_data;
+ resv_map = inode_resv_map(inode);

chg = region_chg(resv_map, from, to);

@@ -3246,7 +3246,7 @@ out_err:
void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
{
struct hstate *h = hstate_inode(inode);
- struct resv_map *resv_map = inode->i_mapping->private_data;
+ struct resv_map *resv_map = inode_resv_map(inode);
long chg = 0;
struct hugepage_subpool *spool = subpool_inode(inode);

--
1.8.1.4

2014-01-27 03:53:00

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH 3/8] mm, hugetlb: fix race in region tracking

From: Joonsoo Kim <[email protected]>

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only the,
mmap_sem (exclusively). This doesn't prevent other tasks from modifying the
region structure, so it can be modified by two processes concurrently.

To solve this, introduce a spinlock to resv_map and make region manipulation
function grab it before they do actual work.

Acked-by: David Gibson <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
[Updated changelog]
Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 48 ++++++++++++++++++++++++++++++++----------------
2 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 29c9371..db556f3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,6 +26,7 @@ struct hugepage_subpool {

struct resv_map {
struct kref refs;
+ spinlock_t lock;
struct list_head regions;
};
extern struct resv_map *resv_map_alloc(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 572866d..6b40d7e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -135,15 +135,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
* Region tracking -- allows tracking of reservations and instantiated pages
* across the pages in a mapping.
*
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantiation_mutex. To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation_mutex:
- *
- * down_write(&mm->mmap_sem);
- * or
- * down_read(&mm->mmap_sem);
- * mutex_lock(&hugetlb_instantiation_mutex);
+ * The region data structures are embedded into a resv_map and
+ * protected by a resv_map's lock
*/
struct file_region {
struct list_head link;
@@ -156,6 +149,7 @@ static long region_add(struct resv_map *resv, long f, long t)
struct list_head *head = &resv->regions;
struct file_region *rg, *nrg, *trg;

+ spin_lock(&resv->lock);
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
if (f <= rg->to)
@@ -185,15 +179,18 @@ static long region_add(struct resv_map *resv, long f, long t)
}
nrg->from = f;
nrg->to = t;
+ spin_unlock(&resv->lock);
return 0;
}

static long region_chg(struct resv_map *resv, long f, long t)
{
struct list_head *head = &resv->regions;
- struct file_region *rg, *nrg;
+ struct file_region *rg, *nrg = NULL;
long chg = 0;

+retry:
+ spin_lock(&resv->lock);
/* Locate the region we are before or in. */
list_for_each_entry(rg, head, link)
if (f <= rg->to)
@@ -203,15 +200,23 @@ static long region_chg(struct resv_map *resv, long f, long t)
* Subtle, allocate a new region at the position but make it zero
* size such that we can guarantee to record the reservation. */
if (&rg->link == head || t < rg->from) {
- nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
- if (!nrg)
- return -ENOMEM;
+ if (!nrg) {
+ spin_unlock(&resv->lock);
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (!nrg)
+ return -ENOMEM;
+
+ goto retry;
+ }
+
nrg->from = f;
nrg->to = f;
INIT_LIST_HEAD(&nrg->link);
list_add(&nrg->link, rg->link.prev);
+ nrg = NULL;

- return t - f;
+ chg = t - f;
+ goto out_locked;
}

/* Round our left edge to the current segment if it encloses us. */
@@ -224,7 +229,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
if (&rg->link == head)
break;
if (rg->from > t)
- return chg;
+ goto out_locked;

/* We overlap with this area, if it extends further than
* us then we must extend ourselves. Account for its
@@ -235,6 +240,10 @@ static long region_chg(struct resv_map *resv, long f, long t)
}
chg -= rg->to - rg->from;
}
+
+out_locked:
+ spin_unlock(&resv->lock);
+ kfree(nrg);
return chg;
}

@@ -244,12 +253,13 @@ static long region_truncate(struct resv_map *resv, long end)
struct file_region *rg, *trg;
long chg = 0;

+ spin_lock(&resv->lock);
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
if (end <= rg->to)
break;
if (&rg->link == head)
- return 0;
+ goto out;

/* If we are in the middle of a region then adjust it. */
if (end > rg->from) {
@@ -266,6 +276,9 @@ static long region_truncate(struct resv_map *resv, long end)
list_del(&rg->link);
kfree(rg);
}
+
+out:
+ spin_unlock(&resv->lock);
return chg;
}

@@ -275,6 +288,7 @@ static long region_count(struct resv_map *resv, long f, long t)
struct file_region *rg;
long chg = 0;

+ spin_lock(&resv->lock);
/* Locate each segment we overlap with, and count that overlap. */
list_for_each_entry(rg, head, link) {
long seg_from;
@@ -290,6 +304,7 @@ static long region_count(struct resv_map *resv, long f, long t)

chg += seg_to - seg_from;
}
+ spin_unlock(&resv->lock);

return chg;
}
@@ -387,6 +402,7 @@ struct resv_map *resv_map_alloc(void)
return NULL;

kref_init(&resv_map->refs);
+ spin_lock_init(&resv_map->lock);
INIT_LIST_HEAD(&resv_map->regions);

return resv_map;
--
1.8.1.4

2014-01-27 03:52:52

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH 1/8] mm, hugetlb: unify region structure handling

From: Joonsoo Kim <[email protected]>

Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use address_mapping's
private_list and, while for MAP_PRIVATE, we use a resv_map.

Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.

Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
[Updated changelog]
Signed-off-by: Davidlohr Bueso <[email protected]>
---
fs/hugetlbfs/inode.c | 17 +++++++++++++++--
include/linux/hugetlb.h | 9 +++++++++
mm/hugetlb.c | 37 +++++++++++++++++++++----------------
3 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d19b30a..2040275 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -366,7 +366,13 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)

static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct resv_map *resv_map;
+
truncate_hugepages(inode, 0);
+ resv_map = (struct resv_map *)inode->i_mapping->private_data;
+ /* root inode doesn't have the resv_map, so we should check it */
+ if (resv_map)
+ resv_map_release(&resv_map->refs);
clear_inode(inode);
}

@@ -476,6 +482,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
umode_t mode, dev_t dev)
{
struct inode *inode;
+ struct resv_map *resv_map;
+
+ resv_map = resv_map_alloc();
+ if (!resv_map)
+ return NULL;

inode = new_inode(sb);
if (inode) {
@@ -487,7 +498,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
- INIT_LIST_HEAD(&inode->i_mapping->private_list);
+ inode->i_mapping->private_data = resv_map;
info = HUGETLBFS_I(inode);
/*
* The policy is initialized here even if we are creating a
@@ -517,7 +528,9 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
break;
}
lockdep_annotate_inode_mutex_key(inode);
- }
+ } else
+ kref_put(&resv_map->refs, resv_map_release);
+
return inode;
}

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d01cc97..29c9371 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -5,6 +5,8 @@
#include <linux/fs.h>
#include <linux/hugetlb_inline.h>
#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/kref.h>

struct ctl_table;
struct user_struct;
@@ -22,6 +24,13 @@ struct hugepage_subpool {
long max_hpages, used_hpages;
};

+struct resv_map {
+ struct kref refs;
+ struct list_head regions;
+};
+extern struct resv_map *resv_map_alloc(void);
+void resv_map_release(struct kref *ref);
+
extern spinlock_t hugetlb_lock;
extern int hugetlb_max_hstate __read_mostly;
#define for_each_hstate(h) \
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 04306b9..fbed1b7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -376,12 +376,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
vma->vm_private_data = (void *)value;
}

-struct resv_map {
- struct kref refs;
- struct list_head regions;
-};
-
-static struct resv_map *resv_map_alloc(void)
+struct resv_map *resv_map_alloc(void)
{
struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
if (!resv_map)
@@ -393,7 +388,7 @@ static struct resv_map *resv_map_alloc(void)
return resv_map;
}

-static void resv_map_release(struct kref *ref)
+void resv_map_release(struct kref *ref)
{
struct resv_map *resv_map = container_of(ref, struct resv_map, refs);

@@ -1155,8 +1150,9 @@ static long vma_needs_reservation(struct hstate *h,

if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- return region_chg(&inode->i_mapping->private_list,
- idx, idx + 1);
+ struct resv_map *resv = inode->i_mapping->private_data;
+
+ return region_chg(&resv->regions, idx, idx + 1);

} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
return 1;
@@ -1180,7 +1176,9 @@ static void vma_commit_reservation(struct hstate *h,

if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- region_add(&inode->i_mapping->private_list, idx, idx + 1);
+ struct resv_map *resv = inode->i_mapping->private_data;
+
+ region_add(&resv->regions, idx, idx + 1);

} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
@@ -3161,6 +3159,7 @@ int hugetlb_reserve_pages(struct inode *inode,
long ret, chg;
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = subpool_inode(inode);
+ struct resv_map *resv_map;

/*
* Only apply hugepage reservation if asked. At fault time, an
@@ -3176,10 +3175,13 @@ int hugetlb_reserve_pages(struct inode *inode,
* to reserve the full area even if read-only as mprotect() may be
* called to make the mapping read-write. Assume !vma is a shm mapping
*/
- if (!vma || vma->vm_flags & VM_MAYSHARE)
- chg = region_chg(&inode->i_mapping->private_list, from, to);
- else {
- struct resv_map *resv_map = resv_map_alloc();
+ if (!vma || vma->vm_flags & VM_MAYSHARE) {
+ resv_map = inode->i_mapping->private_data;
+
+ chg = region_chg(&resv_map->regions, from, to);
+
+ } else {
+ resv_map = resv_map_alloc();
if (!resv_map)
return -ENOMEM;

@@ -3222,7 +3224,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* else has to be done for private mappings here
*/
if (!vma || vma->vm_flags & VM_MAYSHARE)
- region_add(&inode->i_mapping->private_list, from, to);
+ region_add(&resv_map->regions, from, to);
return 0;
out_err:
if (vma)
@@ -3233,9 +3235,12 @@ out_err:
void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
{
struct hstate *h = hstate_inode(inode);
- long chg = region_truncate(&inode->i_mapping->private_list, offset);
+ struct resv_map *resv_map = inode->i_mapping->private_data;
+ long chg = 0;
struct hugepage_subpool *spool = subpool_inode(inode);

+ if (resv_map)
+ chg = region_truncate(&resv_map->regions, offset);
spin_lock(&inode->i_lock);
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
--
1.8.1.4

2014-01-27 04:15:29

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH 8/8] mm, hugetlb: improve page-fault scalability

sigh, I sent the wrong patch, this one has some bogus leftovers of some
other things. Please ignore, I'm sending v2.

On Sun, 2014-01-26 at 19:52 -0800, Davidlohr Bueso wrote:
> The kernel can currently only handle a single hugetlb page fault at a time.
> This is due to a single mutex that serializes the entire path. This lock
> protects from spurious OOM errors under conditions of low of low availability
> of free hugepages. This problem is specific to hugepages, because it is
> normal to want to use every single hugepage in the system - with normal pages
> we simply assume there will always be a few spare pages which can be used
> temporarily until the race is resolved.
>
> Address this problem by using a table of mutexes, allowing a better chance of
> parallelization, where each hugepage is individually serialized. The hash key
> is selected depending on the mapping type. For shared ones it consists of the
> address space and file offset being faulted; while for private ones the mm and
> virtual address are used. The size of the table is selected based on a compromise
> of collisions and memory footprint of a series of database workloads.
>
> Large database workloads that make heavy use of hugepages can be particularly
> exposed to this issue, causing start-up times to be painfully slow. This patch
> reduces the startup time of a 10 Gb Oracle DB (with ~5000 faults) from 37.5 secs
> to 25.7 secs. Larger workloads will naturally benefit even more.
>
> NOTE:
> The only downside to this patch, detected by Joonsoo Kim, is that a small race
> is possible in private mappings: A child process (with its own mm, after cow)
> can instantiate a page that is already being handled by the parent in a cow
> fault. When low on pages, can trigger spurious OOMs. I have not been able to
> think of a efficient way of handling this... but do we really care about such
> a tiny window? We already maintain another theoretical race with normal pages.
> If not, one possible way to is to maintain the single hash for private mappings
> -- any workloads that *really* suffer from this scaling problem should already
> use shared mappings.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> mm/hugetlb.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 73 insertions(+), 13 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5f3efa5..ec04e84 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -22,6 +22,7 @@
> #include <linux/swap.h>
> #include <linux/swapops.h>
> #include <linux/page-isolation.h>
> +#include <linux/jhash.h>
>
> #include <asm/page.h>
> #include <asm/pgtable.h>
> @@ -53,6 +54,13 @@ static unsigned long __initdata default_hstate_size;
> */
> DEFINE_SPINLOCK(hugetlb_lock);
>
> +/*
> ++ * Serializes faults on the same logical page. This is used to
> ++ * prevent spurious OOMs when the hugepage pool is fully utilized.
> ++ */
> +static int __read_mostly num_fault_mutexes;
> +static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
> +
> static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> {
> bool free = (spool->count == 0) && (spool->used_hpages == 0);
> @@ -1922,11 +1930,14 @@ static void __exit hugetlb_exit(void)
> }
>
> kobject_put(hugepages_kobj);
> + kfree(htlb_fault_mutex_table);
> }
> module_exit(hugetlb_exit);
>
> static int __init hugetlb_init(void)
> {
> + int i;
> +
> /* Some platform decide whether they support huge pages at boot
> * time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
> * there is no such support
> @@ -1951,6 +1962,18 @@ static int __init hugetlb_init(void)
> hugetlb_register_all_nodes();
> hugetlb_cgroup_file_init();
>
> +#ifdef CONFIG_SMP
> + num_fault_mutexes = roundup_pow_of_two(8 * num_possible_cpus());
> +#else
> + num_fault_mutexes = 1;
> +#endif
> + htlb_fault_mutex_table =
> + kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
> + if (!htlb_fault_mutex_table)
> + return -ENOMEM;
> +
> + for (i = 0; i < num_fault_mutexes; i++)
> + mutex_init(&htlb_fault_mutex_table[i]);
> return 0;
> }
> module_init(hugetlb_init);
> @@ -2733,15 +2756,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
> }
>
> static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long address, pte_t *ptep, unsigned int flags)
> + struct address_space *mapping, pgoff_t idx,
> + unsigned long address, pte_t *ptep, unsigned int flags)
> {
> struct hstate *h = hstate_vma(vma);
> int ret = VM_FAULT_SIGBUS;
> int anon_rmap = 0;
> - pgoff_t idx;
> unsigned long size;
> struct page *page;
> - struct address_space *mapping;
> pte_t new_pte;
> spinlock_t *ptl;
>
> @@ -2756,9 +2778,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> return ret;
> }
>
> - mapping = vma->vm_file->f_mapping;
> - idx = vma_hugecache_offset(h, vma, address);
> -
> /*
> * Use page lock to guard against racing truncation
> * before we get page_table_lock.
> @@ -2868,17 +2887,53 @@ backout_unlocked:
> goto out;
> }
>
> +#ifdef CONFIG_SMP
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + struct address_space *mapping,
> + pgoff_t idx, unsigned long address)
> +{
> + unsigned long key[2];
> + u32 hash;
> +
> + if (vma->vm_flags & VM_SHARED) {
> + key[0] = (unsigned long) mapping;
> + key[1] = idx;
> + } else {
> + key[0] = (unsigned long) mm;
> + key[1] = address >> huge_page_shift(h);
> + }
> +
> + hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
> +
> + return hash & (num_fault_mutexes - 1);
> +}
> +#else
> +/*
> + * For uniprocesor systems we always use a single mutex, so just
> + * return 0 and avoid the hashing overhead.
> + */
> +static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + struct address_space *mapping,
> + pgoff_t idx, unsigned long address)
> +{
> + return 0;
> +}
> +#endif
> +
> int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, unsigned int flags)
> {
> - pte_t *ptep;
> - pte_t entry;
> + pte_t *ptep, entry;
> spinlock_t *ptl;
> int ret;
> + u32 hash, parent_hash;
> + pgoff_t idx;
> struct page *page = NULL;
> struct page *pagecache_page = NULL;
> - static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> struct hstate *h = hstate_vma(vma);
> + struct address_space *mapping;
>
> address &= huge_page_mask(h);
>
> @@ -2897,15 +2952,21 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (!ptep)
> return VM_FAULT_OOM;
>
> + mapping = vma->vm_file->f_mapping;
> + idx = vma_hugecache_offset(h, vma, address);
> +
> /*
> * Serialize hugepage allocation and instantiation, so that we don't
> * get spurious allocation failures if two CPUs race to instantiate
> * the same page in the page cache.
> */
> - mutex_lock(&hugetlb_instantiation_mutex);
> + parent_hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
> + hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
> + mutex_lock(&htlb_fault_mutex_table[hash]);
> +
> entry = huge_ptep_get(ptep);
> if (huge_pte_none(entry)) {
> - ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> + ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
> goto out_mutex;
> }
>
> @@ -2974,8 +3035,7 @@ out_ptl:
> put_page(page);
>
> out_mutex:
> - mutex_unlock(&hugetlb_instantiation_mutex);
> -
> + mutex_unlock(&htlb_fault_mutex_table[hash]);
> return ret;
> }
>

2014-01-27 04:17:52

by Davidlohr Bueso

[permalink] [raw]

Subject: [PATCH v2 8/8] mm, hugetlb: improve page-fault scalability

The kernel can currently only handle a single hugetlb page fault at a time.
This is due to a single mutex that serializes the entire path. This lock
protects from spurious OOM errors under conditions of low of low availability
of free hugepages. This problem is specific to hugepages, because it is
normal to want to use every single hugepage in the system - with normal pages
we simply assume there will always be a few spare pages which can be used
temporarily until the race is resolved.

Address this problem by using a table of mutexes, allowing a better chance of
parallelization, where each hugepage is individually serialized. The hash key
is selected depending on the mapping type. For shared ones it consists of the
address space and file offset being faulted; while for private ones the mm and
virtual address are used. The size of the table is selected based on a compromise
of collisions and memory footprint of a series of database workloads.

Large database workloads that make heavy use of hugepages can be particularly
exposed to this issue, causing start-up times to be painfully slow. This patch
reduces the startup time of a 10 Gb Oracle DB (with ~5000 faults) from 37.5 secs
to 25.7 secs. Larger workloads will naturally benefit even more.

NOTE:
The only downside to this patch, detected by Joonsoo Kim, is that a small race
is possible in private mappings: A child process (with its own mm, after cow)
can instantiate a page that is already being handled by the parent in a cow
fault. When low on pages, can trigger spurious OOMs. I have not been able to
think of a efficient way of handling this... but do we really care about such
a tiny window? We already maintain another theoretical race with normal pages.
If not, one possible way to is to maintain the single hash for private mappings
-- any workloads that *really* suffer from this scaling problem should already
use shared mappings.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/hugetlb.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 72 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f3efa5..cffd105 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -22,6 +22,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/page-isolation.h>
+#include <linux/jhash.h>

#include <asm/page.h>
#include <asm/pgtable.h>
@@ -53,6 +54,13 @@ static unsigned long __initdata default_hstate_size;
*/
DEFINE_SPINLOCK(hugetlb_lock);

+/*
++ * Serializes faults on the same logical page. This is used to
++ * prevent spurious OOMs when the hugepage pool is fully utilized.
++ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
{
bool free = (spool->count == 0) && (spool->used_hpages == 0);
@@ -1922,11 +1930,14 @@ static void __exit hugetlb_exit(void)
}

kobject_put(hugepages_kobj);
+ kfree(htlb_fault_mutex_table);
}
module_exit(hugetlb_exit);

static int __init hugetlb_init(void)
{
+ int i;
+
/* Some platform decide whether they support huge pages at boot
* time. On these, such as powerpc, HPAGE_SHIFT is set to 0 when
* there is no such support
@@ -1951,6 +1962,18 @@ static int __init hugetlb_init(void)
hugetlb_register_all_nodes();
hugetlb_cgroup_file_init();

+#ifdef CONFIG_SMP
+ num_fault_mutexes = roundup_pow_of_two(8 * num_possible_cpus());
+#else
+ num_fault_mutexes = 1;
+#endif
+ htlb_fault_mutex_table =
+ kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+ if (!htlb_fault_mutex_table)
+ return -ENOMEM;
+
+ for (i = 0; i < num_fault_mutexes; i++)
+ mutex_init(&htlb_fault_mutex_table[i]);
return 0;
}
module_init(hugetlb_init);
@@ -2733,15 +2756,14 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
}

static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep, unsigned int flags)
+ struct address_space *mapping, pgoff_t idx,
+ unsigned long address, pte_t *ptep, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
int ret = VM_FAULT_SIGBUS;
int anon_rmap = 0;
- pgoff_t idx;
unsigned long size;
struct page *page;
- struct address_space *mapping;
pte_t new_pte;
spinlock_t *ptl;

@@ -2756,9 +2778,6 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}

- mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, address);
-
/*
* Use page lock to guard against racing truncation
* before we get page_table_lock.
@@ -2868,17 +2887,53 @@ backout_unlocked:
goto out;
}

+#ifdef CONFIG_SMP
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct address_space *mapping,
+ pgoff_t idx, unsigned long address)
+{
+ unsigned long key[2];
+ u32 hash;
+
+ if (vma->vm_flags & VM_SHARED) {
+ key[0] = (unsigned long) mapping;
+ key[1] = idx;
+ } else {
+ key[0] = (unsigned long) mm;
+ key[1] = address >> huge_page_shift(h);
+ }
+
+ hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+
+ return hash & (num_fault_mutexes - 1);
+}
+#else
+/*
+ * For uniprocesor systems we always use a single mutex, so just
+ * return 0 and avoid the hashing overhead.
+ */
+static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct address_space *mapping,
+ pgoff_t idx, unsigned long address)
+{
+ return 0;
+}
+#endif
+
int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
- pte_t *ptep;
- pte_t entry;
+ pte_t *ptep, entry;
spinlock_t *ptl;
int ret;
+ u32 hash;
+ pgoff_t idx;
struct page *page = NULL;
struct page *pagecache_page = NULL;
- static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);
+ struct address_space *mapping;

address &= huge_page_mask(h);

@@ -2897,15 +2952,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (!ptep)
return VM_FAULT_OOM;

+ mapping = vma->vm_file->f_mapping;
+ idx = vma_hugecache_offset(h, vma, address);
+
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
- mutex_lock(&hugetlb_instantiation_mutex);
+ hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
+ mutex_lock(&htlb_fault_mutex_table[hash]);
+
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
- ret = hugetlb_no_page(mm, vma, address, ptep, flags);
+ ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
goto out_mutex;
}

@@ -2974,8 +3034,7 @@ out_ptl:
put_page(page);

out_mutex:
- mutex_unlock(&hugetlb_instantiation_mutex);
-
+ mutex_unlock(&htlb_fault_mutex_table[hash]);
return ret;
}

--
1.8.1.4

2014-01-27 21:02:55

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 2/8] mm, hugetlb: region manipulation functions take resv_map rather list_head

On Sun, Jan 26, 2014 at 07:52:20PM -0800, Davidlohr Bueso wrote:
> From: Joonsoo Kim <[email protected]>
>
> To change a protection method for region tracking to find grained one,
> we pass the resv_map, instead of list_head, to region manipulation
> functions. This doesn't introduce any functional change, and it is just
> for preparing a next step.
>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> Signed-off-by: Davidlohr Bueso <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

2014-01-27 21:02:53

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 1/8] mm, hugetlb: unify region structure handling

On Sun, Jan 26, 2014 at 07:52:19PM -0800, Davidlohr Bueso wrote:
> From: Joonsoo Kim <[email protected]>
>
> Currently, to track reserved and allocated regions, we use two different
> ways, depending on the mapping. For MAP_SHARED, we use address_mapping's
> private_list and, while for MAP_PRIVATE, we use a resv_map.
>
> Now, we are preparing to change a coarse grained lock which protect a
> region structure to fine grained lock, and this difference hinder it.
> So, before changing it, unify region structure handling, consistently
> using a resv_map regardless of the kind of mapping.
>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> [Updated changelog]
> Signed-off-by: Davidlohr Bueso <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

2014-01-27 21:03:17

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm, hugetlb: fix race in region tracking

On Sun, Jan 26, 2014 at 07:52:21PM -0800, Davidlohr Bueso wrote:
> From: Joonsoo Kim <[email protected]>
>
> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only the,
> mmap_sem (exclusively). This doesn't prevent other tasks from modifying the
> region structure, so it can be modified by two processes concurrently.
>
> To solve this, introduce a spinlock to resv_map and make region manipulation
> function grab it before they do actual work.
>
> Acked-by: David Gibson <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> [Updated changelog]
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
...
> @@ -203,15 +200,23 @@ static long region_chg(struct resv_map *resv, long f, long t)
> * Subtle, allocate a new region at the position but make it zero
> * size such that we can guarantee to record the reservation. */
> if (&rg->link == head || t < rg->from) {
> - nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> - if (!nrg)
> - return -ENOMEM;
> + if (!nrg) {
> + spin_unlock(&resv->lock);

I think that doing kmalloc() inside the lock is simpler.
Why do you unlock and retry here?

Thanks,
Naoya Horiguchi

> + nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> + if (!nrg)
> + return -ENOMEM;
> +
> + goto retry;
> + }
> +
> nrg->from = f;
> nrg->to = f;
> INIT_LIST_HEAD(&nrg->link);
> list_add(&nrg->link, rg->link.prev);
> + nrg = NULL;
>
> - return t - f;
> + chg = t - f;
> + goto out_locked;
> }
>
> /* Round our left edge to the current segment if it encloses us. */

2014-01-27 21:03:33

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 4/8] mm, hugetlb: remove resv_map_put

On Sun, Jan 26, 2014 at 07:52:22PM -0800, Davidlohr Bueso wrote:
> From: Joonsoo Kim <[email protected]>
>
> This is a preparation patch to unify the use of vma_resv_map() regardless
> of the map type. This patch prepares it by removing resv_map_put(), which
> only works for HPAGE_RESV_OWNER's resv_map, not for all resv_maps.
>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> [Updated changelog]
> Signed-off-by: Davidlohr Bueso <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

2014-01-27 21:03:55

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm, hugetlb: use vma_resv_map() map types

On Sun, Jan 26, 2014 at 07:52:23PM -0800, Davidlohr Bueso wrote:
> From: Joonsoo Kim <[email protected]>
>
> Util now, we get a resv_map by two ways according to each mapping type.
> This makes code dirty and unreadable. Unify it.
>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> Signed-off-by: Davidlohr Bueso <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

There are a few small nitpicking below ...

> ---
> mm/hugetlb.c | 76 ++++++++++++++++++++++++++++++------------------------------
> 1 file changed, 38 insertions(+), 38 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 13edf17..541cceb 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -417,13 +417,24 @@ void resv_map_release(struct kref *ref)
> kfree(resv_map);
> }
>
> +static inline struct resv_map *inode_resv_map(struct inode *inode)
> +{
> + return inode->i_mapping->private_data;
> +}
> +
> static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
> {
> VM_BUG_ON(!is_vm_hugetlb_page(vma));
> - if (!(vma->vm_flags & VM_MAYSHARE))
> + if (vma->vm_flags & VM_MAYSHARE) {
> + struct address_space *mapping = vma->vm_file->f_mapping;
> + struct inode *inode = mapping->host;

You don't have to declare mapping.

> +
> + return inode_resv_map(inode);
> +
> + } else {
> return (struct resv_map *)(get_vma_private_data(vma) &
> ~HPAGE_RESV_MASK);
> - return NULL;
> + }
> }
>
> static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
> @@ -1165,48 +1176,34 @@ static void return_unused_surplus_pages(struct hstate *h,
> static long vma_needs_reservation(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr)
> {
> - struct address_space *mapping = vma->vm_file->f_mapping;
> - struct inode *inode = mapping->host;
> -
> - if (vma->vm_flags & VM_MAYSHARE) {
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *resv = inode->i_mapping->private_data;
> -
> - return region_chg(resv, idx, idx + 1);
> + struct resv_map *resv;
> + pgoff_t idx;
> + long chg;
>
> - } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> + resv = vma_resv_map(vma);
> + if (!resv)
> return 1;
>
> - } else {
> - long err;
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *resv = vma_resv_map(vma);
> + idx = vma_hugecache_offset(h, vma, addr);
> + chg = region_chg(resv, idx, idx + 1);
>
> - err = region_chg(resv, idx, idx + 1);
> - if (err < 0)
> - return err;
> - return 0;
> - }
> + if (vma->vm_flags & VM_MAYSHARE)
> + return chg;
> + else
> + return chg < 0 ? chg : 0;
> }
> static void vma_commit_reservation(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr)
> {
> - struct address_space *mapping = vma->vm_file->f_mapping;
> - struct inode *inode = mapping->host;
> -
> - if (vma->vm_flags & VM_MAYSHARE) {
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *resv = inode->i_mapping->private_data;
> -
> - region_add(resv, idx, idx + 1);
> + struct resv_map *resv;
> + pgoff_t idx;
>
> - } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *resv = vma_resv_map(vma);
> + resv = vma_resv_map(vma);
> + if (!resv)
> + return;
>
> - /* Mark this page used in the map. */
> - region_add(resv, idx, idx + 1);
> - }
> + idx = vma_hugecache_offset(h, vma, addr);
> + region_add(resv, idx, idx + 1);
> }
>
> static struct page *alloc_huge_page(struct vm_area_struct *vma,
> @@ -2269,7 +2266,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
> * after this open call completes. It is therefore safe to take a
> * new reference here without additional locking.
> */
> - if (resv)
> + if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> kref_get(&resv->refs);
> }
>
> @@ -2282,7 +2279,10 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
> unsigned long start;
> unsigned long end;
>
> - if (resv) {
> + if (!resv)
> + return;
> +
> + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {

if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {

looks simpler.

Thanks,
Naoya Horiguchi

> start = vma_hugecache_offset(h, vma, vma->vm_start);
> end = vma_hugecache_offset(h, vma, vma->vm_end);
>
> @@ -3187,7 +3187,7 @@ int hugetlb_reserve_pages(struct inode *inode,
> * called to make the mapping read-write. Assume !vma is a shm mapping
> */
> if (!vma || vma->vm_flags & VM_MAYSHARE) {
> - resv_map = inode->i_mapping->private_data;
> + resv_map = inode_resv_map(inode);
>
> chg = region_chg(resv_map, from, to);
>
> @@ -3246,7 +3246,7 @@ out_err:
> void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
> {
> struct hstate *h = hstate_inode(inode);
> - struct resv_map *resv_map = inode->i_mapping->private_data;
> + struct resv_map *resv_map = inode_resv_map(inode);
> long chg = 0;
> struct hugepage_subpool *spool = subpool_inode(inode);
>
> --
> 1.8.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-01-27 21:04:30

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 6/8] mm, hugetlb: remove vma_has_reserves

On Sun, Jan 26, 2014 at 07:52:24PM -0800, Davidlohr Bueso wrote:
> From: Joonsoo Kim <[email protected]>
>
> vma_has_reserves() can be substituted by using return value of
> vma_needs_reservation(). If chg returned by vma_needs_reservation()
> is 0, it means that vma has reserves. Otherwise, it means that vma don't
> have reserves and need a hugepage outside of reserve pool. This definition
> is perfectly same as vma_has_reserves(), so remove vma_has_reserves().

I'm concerned that this patch doesn't work when VM_NORESERVE is set.
vma_needs_reservation() doesn't check VM_NORESERVE and this patch changes
dequeue_huge_page_vma() not to check it. So no one seems to check it any more.
It would be nice if you add some comment about the justification in changelog.
Does the testcase attached in commit af0ed73e699b still pass with this patch?
(I'm not sure it's covered in libhugetlbfs test suite.)

Thanks,
Naoya Horiguchi

2014-01-27 21:04:42

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 7/8] mm, hugetlb: mm, hugetlb: unify chg and avoid_reserve to use_reserve

On Sun, Jan 26, 2014 at 07:52:25PM -0800, Davidlohr Bueso wrote:
> From: Joonsoo Kim <[email protected]>
>
> Currently, we have two variable to represent whether we can use reserved
> page or not, chg and avoid_reserve, respectively. With aggregating these,
> we can have more clean code. This makes no functional difference.
>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> Signed-off-by: Davidlohr Bueso <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

2014-01-27 21:44:14

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm, hugetlb: fix race in region tracking

On Mon, 2014-01-27 at 16:02 -0500, Naoya Horiguchi wrote:
> On Sun, Jan 26, 2014 at 07:52:21PM -0800, Davidlohr Bueso wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > There is a race condition if we map a same file on different processes.
> > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only the,
> > mmap_sem (exclusively). This doesn't prevent other tasks from modifying the
> > region structure, so it can be modified by two processes concurrently.
> >
> > To solve this, introduce a spinlock to resv_map and make region manipulation
> > function grab it before they do actual work.
> >
> > Acked-by: David Gibson <[email protected]>
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > [Updated changelog]
> > Signed-off-by: Davidlohr Bueso <[email protected]>
> > ---
> ...
> > @@ -203,15 +200,23 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > * Subtle, allocate a new region at the position but make it zero
> > * size such that we can guarantee to record the reservation. */
> > if (&rg->link == head || t < rg->from) {
> > - nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > - if (!nrg)
> > - return -ENOMEM;
> > + if (!nrg) {
> > + spin_unlock(&resv->lock);
>
> I think that doing kmalloc() inside the lock is simpler.
> Why do you unlock and retry here?

This is a spinlock, no can do -- we've previously debated this and since
the critical region is quite small, a non blocking lock is better suited
here. We do the retry so we don't race once the new region is allocated
after the lock is dropped.

Thanks,
Davidlohr

2014-01-28 01:54:11

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm, hugetlb: fix race in region tracking

Hi Davidlohr,

On Mon, Jan 27, 2014 at 01:44:02PM -0800, Davidlohr Bueso wrote:
> On Mon, 2014-01-27 at 16:02 -0500, Naoya Horiguchi wrote:
> > On Sun, Jan 26, 2014 at 07:52:21PM -0800, Davidlohr Bueso wrote:
> > > From: Joonsoo Kim <[email protected]>
> > >
> > > There is a race condition if we map a same file on different processes.
> > > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only the,
> > > mmap_sem (exclusively). This doesn't prevent other tasks from modifying the
> > > region structure, so it can be modified by two processes concurrently.
> > >
> > > To solve this, introduce a spinlock to resv_map and make region manipulation
> > > function grab it before they do actual work.
> > >
> > > Acked-by: David Gibson <[email protected]>
> > > Signed-off-by: Joonsoo Kim <[email protected]>
> > > [Updated changelog]
> > > Signed-off-by: Davidlohr Bueso <[email protected]>
> > > ---
> > ...
> > > @@ -203,15 +200,23 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > > * Subtle, allocate a new region at the position but make it zero
> > > * size such that we can guarantee to record the reservation. */
> > > if (&rg->link == head || t < rg->from) {
> > > - nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > > - if (!nrg)
> > > - return -ENOMEM;
> > > + if (!nrg) {
> > > + spin_unlock(&resv->lock);
> >
> > I think that doing kmalloc() inside the lock is simpler.
> > Why do you unlock and retry here?
>
> This is a spinlock, no can do -- we've previously debated this and since
> the critical region is quite small, a non blocking lock is better suited
> here. We do the retry so we don't race once the new region is allocated
> after the lock is dropped.

Using spinlock instead of rw_sem makes sense.
But I'm not sure how the retry is essential to fix the race.
(Sorry I can't find the discussion log about this.)
As you did in your ver.1 (https://lkml.org/lkml/2013/7/26/296),
simply doing like below seems to be fine to me, is it right?

if (&rg->link == head || t < rg->from) {
nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
if (!nrg) {
chg = -ENOMEM;
goto out_locked;
}
nrg->from = f;
...
}

In the current version nrg is initialized to NULL, so we always do retry
once when adding new file_region. That's not optimal to me.

If this retry is really essential for the fix, please comment the reason
both in patch description and inline comment. It's very important for
future code maintenance.

And I noticed another point. I don't think the name of new goto label
'out_locked' is a good one. 'out_unlock' or 'unlock' is better.

Thanks,
Naoya Horiguchi

2014-01-28 02:35:22

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm, hugetlb: fix race in region tracking

On Mon, 2014-01-27 at 20:53 -0500, Naoya Horiguchi wrote:
> Hi Davidlohr,
>
> On Mon, Jan 27, 2014 at 01:44:02PM -0800, Davidlohr Bueso wrote:
> > On Mon, 2014-01-27 at 16:02 -0500, Naoya Horiguchi wrote:
> > > On Sun, Jan 26, 2014 at 07:52:21PM -0800, Davidlohr Bueso wrote:
> > > > From: Joonsoo Kim <[email protected]>
> > > >
> > > > There is a race condition if we map a same file on different processes.
> > > > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > > > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only the,
> > > > mmap_sem (exclusively). This doesn't prevent other tasks from modifying the
> > > > region structure, so it can be modified by two processes concurrently.
> > > >
> > > > To solve this, introduce a spinlock to resv_map and make region manipulation
> > > > function grab it before they do actual work.
> > > >
> > > > Acked-by: David Gibson <[email protected]>
> > > > Signed-off-by: Joonsoo Kim <[email protected]>
> > > > [Updated changelog]
> > > > Signed-off-by: Davidlohr Bueso <[email protected]>
> > > > ---
> > > ...
> > > > @@ -203,15 +200,23 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > > > * Subtle, allocate a new region at the position but make it zero
> > > > * size such that we can guarantee to record the reservation. */
> > > > if (&rg->link == head || t < rg->from) {
> > > > - nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > > > - if (!nrg)
> > > > - return -ENOMEM;
> > > > + if (!nrg) {
> > > > + spin_unlock(&resv->lock);
> > >
> > > I think that doing kmalloc() inside the lock is simpler.
> > > Why do you unlock and retry here?
> >
> > This is a spinlock, no can do -- we've previously debated this and since
> > the critical region is quite small, a non blocking lock is better suited
> > here. We do the retry so we don't race once the new region is allocated
> > after the lock is dropped.
>
> Using spinlock instead of rw_sem makes sense.
> But I'm not sure how the retry is essential to fix the race.
> (Sorry I can't find the discussion log about this.)
> As you did in your ver.1 (https://lkml.org/lkml/2013/7/26/296),
> simply doing like below seems to be fine to me, is it right?
>
> if (&rg->link == head || t < rg->from) {
> nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> if (!nrg) {
> chg = -ENOMEM;
> goto out_locked;
> }
> nrg->from = f;
> ...
> }

That's nice and simple because we were using the rwsem version.

>
> In the current version nrg is initialized to NULL, so we always do retry
> once when adding new file_region. That's not optimal to me.

Right, the retry can only occur once.

>
> If this retry is really essential for the fix, please comment the reason
> both in patch description and inline comment. It's very important for
> future code maintenance.

So we locate the corresponding region in the reserve map, and if we are
below the current region, then we allocate a new one. Since we dropped
the lock to allocate memory, we have to make sure that we still need the
new region and that we don't race with the new status of the reservation
map. This is the whole point of the retry, and I don't see it being
suboptimal.

We just cannot retake the lock after we get the new region and just add
it to to the list.

>
> And I noticed another point. I don't think the name of new goto label
> 'out_locked' is a good one. 'out_unlock' or 'unlock' is better.

What worries me more is that we're actually freeing a valid new region
(nrg) upon exit. We certainly don't do so in the current code, and it
doesn't seem to be a leak. Instead, we should be doing:

if (&rg->link == head || t < rg->from) {
if (!nrg) {
spin_unlock(&resv->lock);
nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
if (!nrg)
return -ENOMEM;

goto retry;
}

nrg->from = f;
nrg->to = f;
INIT_LIST_HEAD(&nrg->link);
list_add(&nrg->link, rg->link.prev);

chg = t - f;
goto out;
}
...
out:
spin_unlock(&resv->lock);
return chg;

Thanks,
Davidlohr

2014-01-28 02:36:48

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm, hugetlb: use vma_resv_map() map types

On Mon, 2014-01-27 at 16:03 -0500, Naoya Horiguchi wrote:
> On Sun, Jan 26, 2014 at 07:52:23PM -0800, Davidlohr Bueso wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Util now, we get a resv_map by two ways according to each mapping type.
> > This makes code dirty and unreadable. Unify it.
> >
> > Reviewed-by: Aneesh Kumar K.V <[email protected]>
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > Signed-off-by: Davidlohr Bueso <[email protected]>
>
> Reviewed-by: Naoya Horiguchi <[email protected]>
>
> There are a few small nitpicking below ...

Will update, thanks!

2014-01-29 00:37:33

by Naoya Horiguchi

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm, hugetlb: fix race in region tracking

On Mon, Jan 27, 2014 at 06:34:17PM -0800, Davidlohr Bueso wrote:
> On Mon, 2014-01-27 at 20:53 -0500, Naoya Horiguchi wrote:
> > Hi Davidlohr,
> >
> > On Mon, Jan 27, 2014 at 01:44:02PM -0800, Davidlohr Bueso wrote:
> > > On Mon, 2014-01-27 at 16:02 -0500, Naoya Horiguchi wrote:
> > > > On Sun, Jan 26, 2014 at 07:52:21PM -0800, Davidlohr Bueso wrote:
> > > > > From: Joonsoo Kim <[email protected]>
> > > > >
> > > > > There is a race condition if we map a same file on different processes.
> > > > > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > > > > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only the,
> > > > > mmap_sem (exclusively). This doesn't prevent other tasks from modifying the
> > > > > region structure, so it can be modified by two processes concurrently.
> > > > >
> > > > > To solve this, introduce a spinlock to resv_map and make region manipulation
> > > > > function grab it before they do actual work.
> > > > >
> > > > > Acked-by: David Gibson <[email protected]>
> > > > > Signed-off-by: Joonsoo Kim <[email protected]>
> > > > > [Updated changelog]
> > > > > Signed-off-by: Davidlohr Bueso <[email protected]>
> > > > > ---
> > > > ...
> > > > > @@ -203,15 +200,23 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > > > > * Subtle, allocate a new region at the position but make it zero
> > > > > * size such that we can guarantee to record the reservation. */
> > > > > if (&rg->link == head || t < rg->from) {
> > > > > - nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > > > > - if (!nrg)
> > > > > - return -ENOMEM;
> > > > > + if (!nrg) {
> > > > > + spin_unlock(&resv->lock);
> > > >
> > > > I think that doing kmalloc() inside the lock is simpler.
> > > > Why do you unlock and retry here?
> > >
> > > This is a spinlock, no can do -- we've previously debated this and since
> > > the critical region is quite small, a non blocking lock is better suited
> > > here. We do the retry so we don't race once the new region is allocated
> > > after the lock is dropped.
> >
> > Using spinlock instead of rw_sem makes sense.
> > But I'm not sure how the retry is essential to fix the race.
> > (Sorry I can't find the discussion log about this.)
> > As you did in your ver.1 (https://lkml.org/lkml/2013/7/26/296),
> > simply doing like below seems to be fine to me, is it right?
> >
> > if (&rg->link == head || t < rg->from) {
> > nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > if (!nrg) {
> > chg = -ENOMEM;
> > goto out_locked;
> > }
> > nrg->from = f;
> > ...
> > }
>
> That's nice and simple because we were using the rwsem version.
>
> >
> > In the current version nrg is initialized to NULL, so we always do retry
> > once when adding new file_region. That's not optimal to me.
>
> Right, the retry can only occur once.
>
> >
> > If this retry is really essential for the fix, please comment the reason
> > both in patch description and inline comment. It's very important for
> > future code maintenance.
>
> So we locate the corresponding region in the reserve map, and if we are
> below the current region, then we allocate a new one. Since we dropped
> the lock to allocate memory, we have to make sure that we still need the
> new region and that we don't race with the new status of the reservation
> map. This is the whole point of the retry, and I don't see it being
> suboptimal.

I'm afraid that you don't explain why you need drop the lock for memory
allocation. Are you saying that this unlocking comes from the difference
between rwsem and spin lock?

I think if we call kmalloc() with the lock held we don't have to check
that "we still need the new region" because resv->lock guarantees that
no other thread changes the reservation map, right?

> We just cannot retake the lock after we get the new region and just add
> it to to the list.
>
> >
> > And I noticed another point. I don't think the name of new goto label
> > 'out_locked' is a good one. 'out_unlock' or 'unlock' is better.
>
> What worries me more is that we're actually freeing a valid new region
> (nrg) upon exit. We certainly don't do so in the current code, and it
> doesn't seem to be a leak. Instead, we should be doing:

You're right. There is another goto in region_chg() where we never do
the kmalloc, so calling kfree is a bug.

Thanks,
Naoya Horiguchi

> if (&rg->link == head || t < rg->from) {
> if (!nrg) {
> spin_unlock(&resv->lock);
> nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> if (!nrg)
> return -ENOMEM;
>
> goto retry;
> }
>
> nrg->from = f;
> nrg->to = f;
> INIT_LIST_HEAD(&nrg->link);
> list_add(&nrg->link, rg->link.prev);
>
> chg = t - f;
> goto out;
> }
> ...
> out:
> spin_unlock(&resv->lock);
> return chg;
>
>
> Thanks,
> Davidlohr
>

2014-01-29 01:20:01

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm, hugetlb: fix race in region tracking

On Tue, 2014-01-28 at 19:36 -0500, Naoya Horiguchi wrote:
> On Mon, Jan 27, 2014 at 06:34:17PM -0800, Davidlohr Bueso wrote:
[...]
> > > If this retry is really essential for the fix, please comment the reason
> > > both in patch description and inline comment. It's very important for
> > > future code maintenance.
> >
> > So we locate the corresponding region in the reserve map, and if we are
> > below the current region, then we allocate a new one. Since we dropped
> > the lock to allocate memory, we have to make sure that we still need the
> > new region and that we don't race with the new status of the reservation
> > map. This is the whole point of the retry, and I don't see it being
> > suboptimal.
>
> I'm afraid that you don't explain why you need drop the lock for memory
> allocation. Are you saying that this unlocking comes from the difference
> between rwsem and spin lock?

Because you cannot go to sleep while holding a spinlock, which is
exactly what kmalloc(GFP_KERNEL) can do. We *might* get a way with it
with GFP_ATOMIC, I dunno, but I certainly prefer this approach better.

Thanks,
Davidlohr

2014-01-29 19:24:29

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH 6/8] mm, hugetlb: remove vma_has_reserves

On Mon, 2014-01-27 at 16:04 -0500, Naoya Horiguchi wrote:
> On Sun, Jan 26, 2014 at 07:52:24PM -0800, Davidlohr Bueso wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > vma_has_reserves() can be substituted by using return value of
> > vma_needs_reservation(). If chg returned by vma_needs_reservation()
> > is 0, it means that vma has reserves. Otherwise, it means that vma don't
> > have reserves and need a hugepage outside of reserve pool. This definition
> > is perfectly same as vma_has_reserves(), so remove vma_has_reserves().
>
> I'm concerned that this patch doesn't work when VM_NORESERVE is set.
> vma_needs_reservation() doesn't check VM_NORESERVE and this patch changes
> dequeue_huge_page_vma() not to check it. So no one seems to check it any more.

Good catch. I agree, this is new behavior and quite frankly not worth
changing just for a cleanup - the code is subtle enough as it is. I'm
dropping this patch and #7 which depends on this one, if Joonsoo wants
to later pursue this, he can.

Thanks,
Davidlohr

2014-02-04 00:18:24

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm, hugetlb: fix race in region tracking

On Tue, 28 Jan 2014 17:19:38 -0800 Davidlohr Bueso <[email protected]> wrote:

> On Tue, 2014-01-28 at 19:36 -0500, Naoya Horiguchi wrote:
> > On Mon, Jan 27, 2014 at 06:34:17PM -0800, Davidlohr Bueso wrote:
> [...]
> > > > If this retry is really essential for the fix, please comment the reason
> > > > both in patch description and inline comment. It's very important for
> > > > future code maintenance.
> > >
> > > So we locate the corresponding region in the reserve map, and if we are
> > > below the current region, then we allocate a new one. Since we dropped
> > > the lock to allocate memory, we have to make sure that we still need the
> > > new region and that we don't race with the new status of the reservation
> > > map. This is the whole point of the retry, and I don't see it being
> > > suboptimal.
> >
> > I'm afraid that you don't explain why you need drop the lock for memory
> > allocation. Are you saying that this unlocking comes from the difference
> > between rwsem and spin lock?
>
> Because you cannot go to sleep while holding a spinlock, which is
> exactly what kmalloc(GFP_KERNEL) can do. We *might* get a way with it
> with GFP_ATOMIC, I dunno, but I certainly prefer this approach better.

yup. You could do

foo = kmalloc(size, GFP_NOWAIT);
if (!foo) {
spin_unlock(...);
foo = kmalloc(size, GFP_KERNEL);
if (!foo)
...
spin_lock(...);
}

that avoids the lock/unlock once per allocation. But it also increases
the lock's average hold times....