2013-07-29 05:32:28

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 00/18] mm, hugetlb: remove a hugetlb_instantiation_mutex

Without a hugetlb_instantiation_mutex, if parallel fault occur, we can
fail to allocate a hugepage, because many threads dequeue a hugepage
to handle a fault of same address. This makes reserved pool shortage
just for a little while and this cause faulting thread who is ensured
to have enough reserved hugepages to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance problem reported by Davidlohr Bueso [1].

It is implemented by following 3-steps.

Step 1. Protect region tracking via per region spin_lock.

Currently, region tracking is protected by a
hugetlb_instantiation_mutex, so before removing it, we should
replace it with another solution.

Step 2. Decide whether we use reserved page pool or not by an uniform way.

We need a graceful failure handling if there is no lock like as
hugetlb_instantiation_mutex. To decide whether we need to handle
a failure or not, we need to know current status properly.

Step 3. Graceful failure handling if we failed with reserved page or
failed to allocate with use_reserve.

Failure handling consist of two cases. One is if we failed with
having reserved page, we return back to reserved pool properly.
Current code doesn't recover a reserve count properly, so we need
to fix it. The other is if we failed to allocate a new huge page
with use_reserve indicator, we return 0 to fault handler,
instead of SIGBUS. This makes this thread retrying fault handling.
With above handlings, we can succeed to handle a fault
on any situation without a hugetlb_instantiation_mutex.

Patch 1: Fix a minor problem
Patch 2-5: Implement Step 1.
Patch 6-11: Implement Step 2.
Patch 12-18: Implement Step 3.

These patches are based on my previous patchset [2].
[2] is based on v3.10.

With applying these, I passed a libhugetlbfs test suite clearly which
have allocation-instantiation race test cases.

If there is a something I should consider, please let me know!
Thanks.

[1] http://lwn.net/Articles/558863/
"[PATCH] mm/hugetlb: per-vma instantiation mutexes"
[2] https://lkml.org/lkml/2013/7/22/96
"[PATCH v2 00/10] mm, hugetlb: clean-up and possible bug fix"


Joonsoo Kim (18):
mm, hugetlb: protect reserved pages when softofflining requests the
pages
mm, hugetlb: change variable name reservations to resv
mm, hugetlb: unify region structure handling
mm, hugetlb: region manipulation functions take resv_map rather
list_head
mm, hugetlb: protect region tracking via newly introduced resv_map
lock
mm, hugetlb: remove vma_need_reservation()
mm, hugetlb: pass has_reserve to dequeue_huge_page_vma()
mm, hugetlb: do hugepage_subpool_get_pages() when avoid_reserve
mm, hugetlb: unify has_reserve and avoid_reserve to use_reserve
mm, hugetlb: call vma_has_reserve() before entering alloc_huge_page()
mm, hugetlb: move down outside_reserve check
mm, hugetlb: remove a check for return value of alloc_huge_page()
mm, hugetlb: grab a page_table_lock after page_cache_release
mm, hugetlb: clean-up error handling in hugetlb_cow()
mm, hugetlb: move up anon_vma_prepare()
mm, hugetlb: return a reserved page to a reserved pool if failed
mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve
mm, hugetlb: remove a hugetlb_instantiation_mutex

fs/hugetlbfs/inode.c | 12 +-
include/linux/hugetlb.h | 10 ++
mm/hugetlb.c | 361 +++++++++++++++++++++++++----------------------
3 files changed, 217 insertions(+), 166 deletions(-)

--
1.7.9.5


2013-07-29 05:32:30

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 04/18] mm, hugetlb: region manipulation functions take resv_map rather list_head

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions. This doesn't introduce any functional change, and it is just
for preparing a next step.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 35f6b12..24c0111 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -150,8 +150,9 @@ struct file_region {
long to;
};

-static long region_add(struct list_head *head, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t)
{
+ struct list_head *head = &resv->regions;
struct file_region *rg, *nrg, *trg;

/* Locate the region we are either in or before. */
@@ -186,8 +187,9 @@ static long region_add(struct list_head *head, long f, long t)
return 0;
}

-static long region_chg(struct list_head *head, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t)
{
+ struct list_head *head = &resv->regions;
struct file_region *rg, *nrg;
long chg = 0;

@@ -235,8 +237,9 @@ static long region_chg(struct list_head *head, long f, long t)
return chg;
}

-static long region_truncate(struct list_head *head, long end)
+static long region_truncate(struct resv_map *resv, long end)
{
+ struct list_head *head = &resv->regions;
struct file_region *rg, *trg;
long chg = 0;

@@ -265,8 +268,9 @@ static long region_truncate(struct list_head *head, long end)
return chg;
}

-static long region_count(struct list_head *head, long f, long t)
+static long region_count(struct resv_map *resv, long f, long t)
{
+ struct list_head *head = &resv->regions;
struct file_region *rg;
long chg = 0;

@@ -392,7 +396,7 @@ void resv_map_release(struct kref *ref)
struct resv_map *resv_map = container_of(ref, struct resv_map, refs);

/* Clear out any active regions before we release the map. */
- region_truncate(&resv_map->regions, 0);
+ region_truncate(resv_map, 0);
kfree(resv_map);
}

@@ -1083,7 +1087,7 @@ static long vma_needs_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = inode->i_mapping->private_data;

- return region_chg(&resv->regions, idx, idx + 1);
+ return region_chg(resv, idx, idx + 1);

} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
return 1;
@@ -1093,7 +1097,7 @@ static long vma_needs_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = vma_resv_map(vma);

- err = region_chg(&resv->regions, idx, idx + 1);
+ err = region_chg(resv, idx, idx + 1);
if (err < 0)
return err;
return 0;
@@ -1109,14 +1113,14 @@ static void vma_commit_reservation(struct hstate *h,
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = inode->i_mapping->private_data;

- region_add(&resv->regions, idx, idx + 1);
+ region_add(resv, idx, idx + 1);

} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
struct resv_map *resv = vma_resv_map(vma);

/* Mark this page used in the map. */
- region_add(&resv->regions, idx, idx + 1);
+ region_add(resv, idx, idx + 1);
}
}

@@ -2203,7 +2207,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
end = vma_hugecache_offset(h, vma, vma->vm_end);

reserve = (end - start) -
- region_count(&resv->regions, start, end);
+ region_count(resv, start, end);

resv_map_put(vma);

@@ -3078,7 +3082,7 @@ int hugetlb_reserve_pages(struct inode *inode,
if (!vma || vma->vm_flags & VM_MAYSHARE) {
resv_map = inode->i_mapping->private_data;

- chg = region_chg(&resv_map->regions, from, to);
+ chg = region_chg(resv_map, from, to);

} else {
resv_map = resv_map_alloc();
@@ -3124,7 +3128,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* else has to be done for private mappings here
*/
if (!vma || vma->vm_flags & VM_MAYSHARE)
- region_add(&resv_map->regions, from, to);
+ region_add(resv_map, from, to);
return 0;
out_err:
if (vma)
@@ -3140,7 +3144,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
struct hugepage_subpool *spool = subpool_inode(inode);

if (resv_map)
- chg = region_truncate(&resv_map->regions, offset);
+ chg = region_truncate(resv_map, offset);
spin_lock(&inode->i_lock);
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
--
1.7.9.5

2013-07-29 05:32:32

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 05/18] mm, hugetlb: protect region tracking via newly introduced resv_map lock

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
grab a mmap_sem. This doesn't prevent other process to modify region
structure, so it can be modified by two processes concurrently.

To solve this, I introduce a lock to resv_map and make region manipulation
function grab a lock before they do actual work. This makes region
tracking safe.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2677c07..e29e28f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -26,6 +26,7 @@ struct hugepage_subpool {

struct resv_map {
struct kref refs;
+ spinlock_t lock;
struct list_head regions;
};
extern struct resv_map *resv_map_alloc(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 24c0111..bf2ee11 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
* Region tracking -- allows tracking of reservations and instantiated pages
* across the pages in a mapping.
*
- * The region data structures are protected by a combination of the mmap_sem
- * and the hugetlb_instantiation_mutex. To access or modify a region the caller
- * must either hold the mmap_sem for write, or the mmap_sem for read and
- * the hugetlb_instantiation_mutex:
- *
- * down_write(&mm->mmap_sem);
- * or
- * down_read(&mm->mmap_sem);
- * mutex_lock(&hugetlb_instantiation_mutex);
+ * The region data structures are embedded into a resv_map and
+ * protected by a resv_map's lock
*/
struct file_region {
struct list_head link;
@@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t)
struct list_head *head = &resv->regions;
struct file_region *rg, *nrg, *trg;

+ spin_lock(&resv->lock);
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
if (f <= rg->to)
@@ -184,6 +178,7 @@ static long region_add(struct resv_map *resv, long f, long t)
}
nrg->from = f;
nrg->to = t;
+ spin_unlock(&resv->lock);
return 0;
}

@@ -193,6 +188,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
struct file_region *rg, *nrg;
long chg = 0;

+ spin_lock(&resv->lock);
/* Locate the region we are before or in. */
list_for_each_entry(rg, head, link)
if (f <= rg->to)
@@ -203,14 +199,18 @@ static long region_chg(struct resv_map *resv, long f, long t)
* size such that we can guarantee to record the reservation. */
if (&rg->link == head || t < rg->from) {
nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
- if (!nrg)
- return -ENOMEM;
+ if (!nrg) {
+ chg = -ENOMEM;
+ goto out;
+ }
+
nrg->from = f;
nrg->to = f;
INIT_LIST_HEAD(&nrg->link);
list_add(&nrg->link, rg->link.prev);

- return t - f;
+ chg = t - f;
+ goto out;
}

/* Round our left edge to the current segment if it encloses us. */
@@ -223,7 +223,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
if (&rg->link == head)
break;
if (rg->from > t)
- return chg;
+ goto out;

/* We overlap with this area, if it extends further than
* us then we must extend ourselves. Account for its
@@ -234,6 +234,9 @@ static long region_chg(struct resv_map *resv, long f, long t)
}
chg -= rg->to - rg->from;
}
+
+out:
+ spin_unlock(&resv->lock);
return chg;
}

@@ -243,12 +246,13 @@ static long region_truncate(struct resv_map *resv, long end)
struct file_region *rg, *trg;
long chg = 0;

+ spin_lock(&resv->lock);
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
if (end <= rg->to)
break;
if (&rg->link == head)
- return 0;
+ goto out;

/* If we are in the middle of a region then adjust it. */
if (end > rg->from) {
@@ -265,6 +269,9 @@ static long region_truncate(struct resv_map *resv, long end)
list_del(&rg->link);
kfree(rg);
}
+
+out:
+ spin_unlock(&resv->lock);
return chg;
}

@@ -274,6 +281,7 @@ static long region_count(struct resv_map *resv, long f, long t)
struct file_region *rg;
long chg = 0;

+ spin_lock(&resv->lock);
/* Locate each segment we overlap with, and count that overlap. */
list_for_each_entry(rg, head, link) {
long seg_from;
@@ -289,6 +297,7 @@ static long region_count(struct resv_map *resv, long f, long t)

chg += seg_to - seg_from;
}
+ spin_unlock(&resv->lock);

return chg;
}
@@ -386,6 +395,7 @@ struct resv_map *resv_map_alloc(void)
return NULL;

kref_init(&resv_map->refs);
+ spin_lock_init(&resv_map->lock);
INIT_LIST_HEAD(&resv_map->regions);

return resv_map;
--
1.7.9.5

2013-07-29 05:32:26

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 02/18] mm, hugetlb: change variable name reservations to resv

'reservations' is so long name as a variable and we use 'resv_map'
to represent 'struct resv_map' in other place. To reduce confusion and
unreadability, change it.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d971233..12b6581 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1095,9 +1095,9 @@ static long vma_needs_reservation(struct hstate *h,
} else {
long err;
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- struct resv_map *reservations = vma_resv_map(vma);
+ struct resv_map *resv = vma_resv_map(vma);

- err = region_chg(&reservations->regions, idx, idx + 1);
+ err = region_chg(&resv->regions, idx, idx + 1);
if (err < 0)
return err;
return 0;
@@ -1115,10 +1115,10 @@ static void vma_commit_reservation(struct hstate *h,

} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- struct resv_map *reservations = vma_resv_map(vma);
+ struct resv_map *resv = vma_resv_map(vma);

/* Mark this page used in the map. */
- region_add(&reservations->regions, idx, idx + 1);
+ region_add(&resv->regions, idx, idx + 1);
}
}

@@ -2168,7 +2168,7 @@ out:

static void hugetlb_vm_op_open(struct vm_area_struct *vma)
{
- struct resv_map *reservations = vma_resv_map(vma);
+ struct resv_map *resv = vma_resv_map(vma);

/*
* This new VMA should share its siblings reservation map if present.
@@ -2178,34 +2178,34 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
* after this open call completes. It is therefore safe to take a
* new reference here without additional locking.
*/
- if (reservations)
- kref_get(&reservations->refs);
+ if (resv)
+ kref_get(&resv->refs);
}

static void resv_map_put(struct vm_area_struct *vma)
{
- struct resv_map *reservations = vma_resv_map(vma);
+ struct resv_map *resv = vma_resv_map(vma);

- if (!reservations)
+ if (!resv)
return;
- kref_put(&reservations->refs, resv_map_release);
+ kref_put(&resv->refs, resv_map_release);
}

static void hugetlb_vm_op_close(struct vm_area_struct *vma)
{
struct hstate *h = hstate_vma(vma);
- struct resv_map *reservations = vma_resv_map(vma);
+ struct resv_map *resv = vma_resv_map(vma);
struct hugepage_subpool *spool = subpool_vma(vma);
unsigned long reserve;
unsigned long start;
unsigned long end;

- if (reservations) {
+ if (resv) {
start = vma_hugecache_offset(h, vma, vma->vm_start);
end = vma_hugecache_offset(h, vma, vma->vm_end);

reserve = (end - start) -
- region_count(&reservations->regions, start, end);
+ region_count(&resv->regions, start, end);

resv_map_put(vma);

--
1.7.9.5

2013-07-29 05:33:24

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 18/18] mm, hugetlb: remove a hugetlb_instantiation_mutex

Now, we have prepared to have an infrastructure in order to remove a this
awkward mutex which serialize all faulting tasks, so remove it.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 909075b..4fab047 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2533,9 +2533,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,

/*
* Hugetlb_cow() should be called with page lock of the original hugepage held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
- * cannot race with other handlers or page migration.
- * Keep the pte_same checks anyway to make transition from the mutex easier.
+ * Called with pte_page locked so we cannot race with page migration.
*/
static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pte_t pte,
@@ -2844,7 +2842,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int ret;
struct page *page = NULL;
struct page *pagecache_page = NULL;
- static DEFINE_MUTEX(hugetlb_instantiation_mutex);
struct hstate *h = hstate_vma(vma);

address &= huge_page_mask(h);
@@ -2864,17 +2861,9 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (!ptep)
return VM_FAULT_OOM;

- /*
- * Serialize hugepage allocation and instantiation, so that we don't
- * get spurious allocation failures if two CPUs race to instantiate
- * the same page in the page cache.
- */
- mutex_lock(&hugetlb_instantiation_mutex);
entry = huge_ptep_get(ptep);
- if (huge_pte_none(entry)) {
- ret = hugetlb_no_page(mm, vma, address, ptep, flags);
- goto out_mutex;
- }
+ if (huge_pte_none(entry))
+ return hugetlb_no_page(mm, vma, address, ptep, flags);

ret = 0;

@@ -2887,10 +2876,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* consumed.
*/
if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
- if (vma_has_reserves(h, vma, address) < 0) {
- ret = VM_FAULT_OOM;
- goto out_mutex;
- }
+ if (vma_has_reserves(h, vma, address) < 0)
+ return VM_FAULT_OOM;

if (!(vma->vm_flags & VM_MAYSHARE))
pagecache_page = hugetlbfs_pagecache_page(h,
@@ -2939,9 +2926,6 @@ out_page_table_lock:
unlock_page(page);
put_page(page);

-out_mutex:
- mutex_unlock(&hugetlb_instantiation_mutex);
-
return ret;
}

--
1.7.9.5

2013-07-29 05:32:25

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

alloc_huge_page_node() use dequeue_huge_page_node() without
any validation check, so it can steal reserved page unconditionally.
To fix it, check the number of free_huge_page in alloc_huge_page_node().

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6782b41..d971233 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
*/
struct page *alloc_huge_page_node(struct hstate *h, int nid)
{
- struct page *page;
+ struct page *page = NULL;

spin_lock(&hugetlb_lock);
- page = dequeue_huge_page_node(h, nid);
+ if (h->free_huge_pages - h->resv_huge_pages > 0)
+ page = dequeue_huge_page_node(h, nid);
spin_unlock(&hugetlb_lock);

if (!page)
--
1.7.9.5

2013-07-29 05:33:41

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 16/18] mm, hugetlb: return a reserved page to a reserved pool if failed

If we fail with a reserved page, just calling put_page() is not sufficient,
because put_page() invoke free_huge_page() at last step and it doesn't
know whether a page comes from a reserved pool or not. So it doesn't do
anything related to reserved count. This makes reserve count lower
than how we need, because reserve count already decrease in
dequeue_huge_page_vma(). This patch fix this situation.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb8a45f..6a9ec69 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -649,6 +649,34 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
}

+static void put_huge_page(struct page *page, int use_reserve)
+{
+ struct hstate *h = page_hstate(page);
+ struct hugepage_subpool *spool =
+ (struct hugepage_subpool *)page_private(page);
+
+ if (!use_reserve) {
+ put_page(page);
+ return;
+ }
+
+ if (!put_page_testzero(page))
+ return;
+
+ set_page_private(page, 0);
+ page->mapping = NULL;
+ BUG_ON(page_count(page));
+ BUG_ON(page_mapcount(page));
+
+ spin_lock(&hugetlb_lock);
+ hugetlb_cgroup_uncharge_page(hstate_index(h),
+ pages_per_huge_page(h), page);
+ enqueue_huge_page(h, page);
+ h->resv_huge_pages++;
+ spin_unlock(&hugetlb_lock);
+ hugepage_subpool_put_pages(spool, 1);
+}
+
static void free_huge_page(struct page *page)
{
/*
@@ -2625,7 +2653,7 @@ retry_avoidcopy:
spin_unlock(&mm->page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

- page_cache_release(new_page);
+ put_huge_page(new_page, use_reserve);
out_old_page:
page_cache_release(old_page);
out_lock:
@@ -2725,7 +2753,7 @@ retry:

err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
if (err) {
- put_page(page);
+ put_huge_page(page, use_reserve);
if (err == -EEXIST)
goto retry;
goto out;
@@ -2798,7 +2826,7 @@ backout:
spin_unlock(&mm->page_table_lock);
backout_unlocked:
unlock_page(page);
- put_page(page);
+ put_huge_page(page, use_reserve);
goto out;
}

--
1.7.9.5

2013-07-29 05:33:39

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

If parallel fault occur, we can fail to allocate a hugepage,
because many threads dequeue a hugepage to handle a fault of same address.
This makes reserved pool shortage just for a little while and this cause
faulting thread who is ensured to have enough reserved hugepages
to get a SIGBUS signal.

To solve this problem, we already have a nice solution, that is,
a hugetlb_instantiation_mutex. This blocks other threads to dive into
a fault handler. This solve the problem clearly, but it introduce
performance degradation, because it serialize all fault handling.

Now, I try to remove a hugetlb_instantiation_mutex to get rid of
performance degradation. A prerequisite is that other thread should
not get a SIGBUS if they are ensured to have enough reserved pages.

For this purpose, if we fail to allocate a new hugepage with use_reserve,
we return just 0, instead of VM_FAULT_SIGBUS. use_reserve
represent that this user is legimate one who are ensured to have enough
reserved pages. This prevent these thread not to get a SIGBUS signal and
make these thread retrying fault handling.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6a9ec69..909075b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2623,7 +2623,10 @@ retry_avoidcopy:
WARN_ON_ONCE(1);
}

- ret = VM_FAULT_SIGBUS;
+ if (use_reserve)
+ ret = 0;
+ else
+ ret = VM_FAULT_SIGBUS;
goto out_lock;
}

@@ -2741,7 +2744,10 @@ retry:

page = alloc_huge_page(vma, address, use_reserve);
if (IS_ERR(page)) {
- ret = VM_FAULT_SIGBUS;
+ if (use_reserve)
+ ret = 0;
+ else
+ ret = VM_FAULT_SIGBUS;
goto out;
}
clear_huge_page(page, address, pages_per_huge_page(h));
--
1.7.9.5

2013-07-29 05:34:09

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 15/18] mm, hugetlb: move up anon_vma_prepare()

If we fail with a allocated hugepage, it is hard to recover properly.
One such example is reserve count. We don't have any method to recover
reserve count. Although, I will introduce a function to recover reserve
count in following patch, it is better not to allocate a hugepage
as much as possible. So move up anon_vma_prepare() which can be failed
in OOM situation.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 683fd38..bb8a45f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2536,6 +2536,15 @@ retry_avoidcopy:
/* Drop page_table_lock as buddy allocator may be called */
spin_unlock(&mm->page_table_lock);

+ /*
+ * When the original hugepage is shared one, it does not have
+ * anon_vma prepared.
+ */
+ if (unlikely(anon_vma_prepare(vma))) {
+ ret = VM_FAULT_OOM;
+ goto out_old_page;
+ }
+
use_reserve = vma_has_reserves(h, vma, address);
if (use_reserve == -ENOMEM) {
ret = VM_FAULT_OOM;
@@ -2590,15 +2599,6 @@ retry_avoidcopy:
goto out_lock;
}

- /*
- * When the original hugepage is shared one, it does not have
- * anon_vma prepared.
- */
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
- goto out_new_page;
- }
-
copy_user_huge_page(new_page, old_page, address, vma,
pages_per_huge_page(h));
__SetPageUptodate(new_page);
@@ -2625,7 +2625,6 @@ retry_avoidcopy:
spin_unlock(&mm->page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

-out_new_page:
page_cache_release(new_page);
out_old_page:
page_cache_release(old_page);
--
1.7.9.5

2013-07-29 05:34:30

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 14/18] mm, hugetlb: clean-up error handling in hugetlb_cow()

Current code include 'Caller expects lock to be held' in every error path.
We can clean-up it as we do error handling in one place.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 255bd9e..683fd38 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2516,6 +2516,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
int use_reserve, outside_reserve = 0;
+ int ret = 0;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

@@ -2537,11 +2538,8 @@ retry_avoidcopy:

use_reserve = vma_has_reserves(h, vma, address);
if (use_reserve == -ENOMEM) {
- page_cache_release(old_page);
-
- /* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ ret = VM_FAULT_OOM;
+ goto out_old_page;
}

/*
@@ -2588,9 +2586,8 @@ retry_avoidcopy:
WARN_ON_ONCE(1);
}

- /* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
- return VM_FAULT_SIGBUS;
+ ret = VM_FAULT_SIGBUS;
+ goto out_lock;
}

/*
@@ -2598,11 +2595,8 @@ retry_avoidcopy:
* anon_vma prepared.
*/
if (unlikely(anon_vma_prepare(vma))) {
- page_cache_release(new_page);
- page_cache_release(old_page);
- /* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
- return VM_FAULT_OOM;
+ ret = VM_FAULT_OOM;
+ goto out_new_page;
}

copy_user_huge_page(new_page, old_page, address, vma,
@@ -2630,12 +2624,15 @@ retry_avoidcopy:
}
spin_unlock(&mm->page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+out_new_page:
page_cache_release(new_page);
+out_old_page:
page_cache_release(old_page);
-
+out_lock:
/* Caller expects lock to be held */
spin_lock(&mm->page_table_lock);
- return 0;
+ return ret;
}

/* Return the pagecache page at a given address within a VMA */
--
1.7.9.5

2013-07-29 05:34:44

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 13/18] mm, hugetlb: grab a page_table_lock after page_cache_release

We don't need to grab a page_table_lock when we try to release a page.
So, defer to grab a page_table_lock.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 35ccdad..255bd9e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2630,10 +2630,11 @@ retry_avoidcopy:
}
spin_unlock(&mm->page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- /* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
page_cache_release(new_page);
page_cache_release(old_page);
+
+ /* Caller expects lock to be held */
+ spin_lock(&mm->page_table_lock);
return 0;
}

--
1.7.9.5

2013-07-29 05:35:00

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 11/18] mm, hugetlb: move down outside_reserve check

Just move down outsider_reserve check.
This makes code more readable.

There is no functional change.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f31ca5..94173e0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2530,20 +2530,6 @@ retry_avoidcopy:
return 0;
}

- /*
- * If the process that created a MAP_PRIVATE mapping is about to
- * perform a COW due to a shared page count, attempt to satisfy
- * the allocation without using the existing reserves. The pagecache
- * page is used to determine if the reserve at this address was
- * consumed or not. If reserves were used, a partial faulted mapping
- * at the time of fork() could consume its reserves on COW instead
- * of the full address range.
- */
- if (!(vma->vm_flags & VM_MAYSHARE) &&
- is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
- old_page != pagecache_page)
- outside_reserve = 1;
-
page_cache_get(old_page);

/* Drop page_table_lock as buddy allocator may be called */
@@ -2557,6 +2543,20 @@ retry_avoidcopy:
spin_lock(&mm->page_table_lock);
return VM_FAULT_OOM;
}
+
+ /*
+ * If the process that created a MAP_PRIVATE mapping is about to
+ * perform a COW due to a shared page count, attempt to satisfy
+ * the allocation without using the existing reserves. The pagecache
+ * page is used to determine if the reserve at this address was
+ * consumed or not. If reserves were used, a partial faulted mapping
+ * at the time of fork() could consume its reserves on COW instead
+ * of the full address range.
+ */
+ if (!(vma->vm_flags & VM_MAYSHARE) &&
+ is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
+ old_page != pagecache_page)
+ outside_reserve = 1;
use_reserve = use_reserve && !outside_reserve;

new_page = alloc_huge_page(vma, address, use_reserve);
--
1.7.9.5

2013-07-29 05:34:58

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 12/18] mm, hugetlb: remove a check for return value of alloc_huge_page()

Now, alloc_huge_page() only return -ENOSPEC if failed.
So, we don't worry about other return value.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 94173e0..35ccdad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2562,7 +2562,6 @@ retry_avoidcopy:
new_page = alloc_huge_page(vma, address, use_reserve);

if (IS_ERR(new_page)) {
- long err = PTR_ERR(new_page);
page_cache_release(old_page);

/*
@@ -2591,10 +2590,7 @@ retry_avoidcopy:

/* Caller expects lock to be held */
spin_lock(&mm->page_table_lock);
- if (err == -ENOMEM)
- return VM_FAULT_OOM;
- else
- return VM_FAULT_SIGBUS;
+ return VM_FAULT_SIGBUS;
}

/*
@@ -2720,11 +2716,7 @@ retry:

page = alloc_huge_page(vma, address, use_reserve);
if (IS_ERR(page)) {
- ret = PTR_ERR(page);
- if (ret == -ENOMEM)
- ret = VM_FAULT_OOM;
- else
- ret = VM_FAULT_SIGBUS;
+ ret = VM_FAULT_SIGBUS;
goto out;
}
clear_huge_page(page, address, pages_per_huge_page(h));
--
1.7.9.5

2013-07-29 05:35:43

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 10/18] mm, hugetlb: call vma_has_reserve() before entering alloc_huge_page()

To implement a graceful failure handling, we need to know whether
allocation request is for reserved pool or not, on higher level.
In this patch, we just move up vma_has_reseve() to caller function
in order to know it. There is no functional change.

Following patches implement a grace failure handling and remove
a hugetlb_instantiation_mutex.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a66226e..5f31ca5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1123,12 +1123,12 @@ static void vma_commit_reservation(struct hstate *h,
}

static struct page *alloc_huge_page(struct vm_area_struct *vma,
- unsigned long addr, int avoid_reserve)
+ unsigned long addr, int use_reserve)
{
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct page *page;
- int ret, idx, use_reserve;
+ int ret, idx;
struct hugetlb_cgroup *h_cg;

idx = hstate_index(h);
@@ -1140,11 +1140,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
* need pages and subpool limit allocated allocated if no reserve
* mapping overlaps.
*/
- use_reserve = vma_has_reserves(h, vma, addr);
- if (use_reserve < 0)
- return ERR_PTR(-ENOMEM);
-
- use_reserve = use_reserve && !avoid_reserve;
if (!use_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
return ERR_PTR(-ENOSPC);

@@ -2520,7 +2515,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
- int outside_reserve = 0;
+ int use_reserve, outside_reserve = 0;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

@@ -2553,7 +2548,18 @@ retry_avoidcopy:

/* Drop page_table_lock as buddy allocator may be called */
spin_unlock(&mm->page_table_lock);
- new_page = alloc_huge_page(vma, address, outside_reserve);
+
+ use_reserve = vma_has_reserves(h, vma, address);
+ if (use_reserve == -ENOMEM) {
+ page_cache_release(old_page);
+
+ /* Caller expects lock to be held */
+ spin_lock(&mm->page_table_lock);
+ return VM_FAULT_OOM;
+ }
+ use_reserve = use_reserve && !outside_reserve;
+
+ new_page = alloc_huge_page(vma, address, use_reserve);

if (IS_ERR(new_page)) {
long err = PTR_ERR(new_page);
@@ -2679,6 +2685,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
struct address_space *mapping;
pte_t new_pte;
+ int use_reserve = 0;

/*
* Currently, we are forced to kill the process in the event the
@@ -2704,7 +2711,14 @@ retry:
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto out;
- page = alloc_huge_page(vma, address, 0);
+
+ use_reserve = vma_has_reserves(h, vma, address);
+ if (use_reserve == -ENOMEM) {
+ ret = VM_FAULT_OOM;
+ goto out;
+ }
+
+ page = alloc_huge_page(vma, address, use_reserve);
if (IS_ERR(page)) {
ret = PTR_ERR(page);
if (ret == -ENOMEM)
--
1.7.9.5

2013-07-29 05:35:56

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 09/18] mm, hugetlb: unify has_reserve and avoid_reserve to use_reserve

Currently, we have two variable to represent whether we can use reserved
page or not, has_reserve and avoid_reserve, respectively.
These have same meaning, so we can unify them to use_reserve.
This makes no functinoal difference, is just for clean-up.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 749629e..a66226e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -572,8 +572,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)

static struct page *dequeue_huge_page_vma(struct hstate *h,
struct vm_area_struct *vma,
- unsigned long address,
- int has_reserve, int avoid_reserve)
+ unsigned long address, int use_reserve)
{
struct page *page = NULL;
struct mempolicy *mpol;
@@ -586,13 +585,11 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
/*
* A child process with MAP_PRIVATE mappings created by their parent
* have no page reserves. This check ensures that reservations are
- * not "stolen". The child may still get SIGKILLed
+ * not "stolen". The child may still get SIGKILLed.
+ * Or, when parent process do COW, we cannot use reserved page.
+ * In this case, ensure enough pages are in the pool.
*/
- if (!has_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
- return NULL;
-
- /* If reserves cannot be used, ensure enough pages are in the pool */
- if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
+ if (!use_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
return NULL;

retry_cpuset:
@@ -605,9 +602,7 @@ retry_cpuset:
if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask)) {
page = dequeue_huge_page_node(h, zone_to_nid(zone));
if (page) {
- if (avoid_reserve)
- break;
- if (!has_reserve)
+ if (!use_reserve)
break;

h->resv_huge_pages--;
@@ -1133,7 +1128,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct page *page;
- int ret, idx, has_reserve;
+ int ret, idx, use_reserve;
struct hugetlb_cgroup *h_cg;

idx = hstate_index(h);
@@ -1145,22 +1140,22 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
* need pages and subpool limit allocated allocated if no reserve
* mapping overlaps.
*/
- has_reserve = vma_has_reserves(h, vma, addr);
- if (has_reserve < 0)
+ use_reserve = vma_has_reserves(h, vma, addr);
+ if (use_reserve < 0)
return ERR_PTR(-ENOMEM);

- if ((!has_reserve || avoid_reserve)
- && (hugepage_subpool_get_pages(spool, 1) < 0))
+ use_reserve = use_reserve && !avoid_reserve;
+ if (!use_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
return ERR_PTR(-ENOSPC);

ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
if (ret) {
- if (!has_reserve || avoid_reserve)
+ if (!use_reserve)
hugepage_subpool_put_pages(spool, 1);
return ERR_PTR(-ENOSPC);
}
spin_lock(&hugetlb_lock);
- page = dequeue_huge_page_vma(h, vma, addr, has_reserve, avoid_reserve);
+ page = dequeue_huge_page_vma(h, vma, addr, use_reserve);
if (!page) {
spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1168,7 +1163,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
hugetlb_cgroup_uncharge_cgroup(idx,
pages_per_huge_page(h),
h_cg);
- if (!has_reserve || avoid_reserve)
+ if (!use_reserve)
hugepage_subpool_put_pages(spool, 1);
return ERR_PTR(-ENOSPC);
}
--
1.7.9.5

2013-07-29 05:36:18

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 03/18] mm, hugetlb: unify region structure handling

Currently, to track a reserved and allocated region, we use two different
ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
Now, we are preparing to change a coarse grained lock which protect
a region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..a1ae3ada 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)

static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct resv_map *resv_map;
+
truncate_hugepages(inode, 0);
+ resv_map = (struct resv_map *)inode->i_mapping->private_data;
+ if (resv_map)
+ kref_put(&resv_map->refs, resv_map_release);
clear_inode(inode);
}

@@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
umode_t mode, dev_t dev)
{
struct inode *inode;
+ struct resv_map *resv_map;
+
+ resv_map = resv_map_alloc();
+ if (!resv_map)
+ return NULL;

inode = new_inode(sb);
if (inode) {
@@ -477,7 +487,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
- INIT_LIST_HEAD(&inode->i_mapping->private_list);
+ inode->i_mapping->private_data = resv_map;
info = HUGETLBFS_I(inode);
/*
* The policy is initialized here even if we are creating a
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6b4890f..2677c07 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -5,6 +5,8 @@
#include <linux/fs.h>
#include <linux/hugetlb_inline.h>
#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/kref.h>

struct ctl_table;
struct user_struct;
@@ -22,6 +24,13 @@ struct hugepage_subpool {
long max_hpages, used_hpages;
};

+struct resv_map {
+ struct kref refs;
+ struct list_head regions;
+};
+extern struct resv_map *resv_map_alloc(void);
+void resv_map_release(struct kref *ref);
+
extern spinlock_t hugetlb_lock;
extern int hugetlb_max_hstate __read_mostly;
#define for_each_hstate(h) \
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 12b6581..35f6b12 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -375,12 +375,7 @@ static void set_vma_private_data(struct vm_area_struct *vma,
vma->vm_private_data = (void *)value;
}

-struct resv_map {
- struct kref refs;
- struct list_head regions;
-};
-
-static struct resv_map *resv_map_alloc(void)
+struct resv_map *resv_map_alloc(void)
{
struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
if (!resv_map)
@@ -392,7 +387,7 @@ static struct resv_map *resv_map_alloc(void)
return resv_map;
}

-static void resv_map_release(struct kref *ref)
+void resv_map_release(struct kref *ref)
{
struct resv_map *resv_map = container_of(ref, struct resv_map, refs);

@@ -1086,8 +1081,9 @@ static long vma_needs_reservation(struct hstate *h,

if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- return region_chg(&inode->i_mapping->private_list,
- idx, idx + 1);
+ struct resv_map *resv = inode->i_mapping->private_data;
+
+ return region_chg(&resv->regions, idx, idx + 1);

} else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
return 1;
@@ -1111,7 +1107,9 @@ static void vma_commit_reservation(struct hstate *h,

if (vma->vm_flags & VM_MAYSHARE) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- region_add(&inode->i_mapping->private_list, idx, idx + 1);
+ struct resv_map *resv = inode->i_mapping->private_data;
+
+ region_add(&resv->regions, idx, idx + 1);

} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
pgoff_t idx = vma_hugecache_offset(h, vma, addr);
@@ -3061,6 +3059,7 @@ int hugetlb_reserve_pages(struct inode *inode,
long ret, chg;
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = subpool_inode(inode);
+ struct resv_map *resv_map;

/*
* Only apply hugepage reservation if asked. At fault time, an
@@ -3076,10 +3075,13 @@ int hugetlb_reserve_pages(struct inode *inode,
* to reserve the full area even if read-only as mprotect() may be
* called to make the mapping read-write. Assume !vma is a shm mapping
*/
- if (!vma || vma->vm_flags & VM_MAYSHARE)
- chg = region_chg(&inode->i_mapping->private_list, from, to);
- else {
- struct resv_map *resv_map = resv_map_alloc();
+ if (!vma || vma->vm_flags & VM_MAYSHARE) {
+ resv_map = inode->i_mapping->private_data;
+
+ chg = region_chg(&resv_map->regions, from, to);
+
+ } else {
+ resv_map = resv_map_alloc();
if (!resv_map)
return -ENOMEM;

@@ -3122,7 +3124,7 @@ int hugetlb_reserve_pages(struct inode *inode,
* else has to be done for private mappings here
*/
if (!vma || vma->vm_flags & VM_MAYSHARE)
- region_add(&inode->i_mapping->private_list, from, to);
+ region_add(&resv_map->regions, from, to);
return 0;
out_err:
if (vma)
@@ -3133,9 +3135,12 @@ out_err:
void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
{
struct hstate *h = hstate_inode(inode);
- long chg = region_truncate(&inode->i_mapping->private_list, offset);
+ struct resv_map *resv_map = inode->i_mapping->private_data;
+ long chg = 0;
struct hugepage_subpool *spool = subpool_inode(inode);

+ if (resv_map)
+ chg = region_truncate(&resv_map->regions, offset);
spin_lock(&inode->i_lock);
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
--
1.7.9.5

2013-07-29 05:36:16

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 07/18] mm, hugetlb: pass has_reserve to dequeue_huge_page_vma()

We don't have to call vma_has_reserve() each time we need information.
Passing has_reserve unburden this effort.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ff46a2c..1426c03 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -572,7 +572,8 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)

static struct page *dequeue_huge_page_vma(struct hstate *h,
struct vm_area_struct *vma,
- unsigned long address, int avoid_reserve)
+ unsigned long address,
+ int has_reserve, int avoid_reserve)
{
struct page *page = NULL;
struct mempolicy *mpol;
@@ -587,8 +588,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
* have no page reserves. This check ensures that reservations are
* not "stolen". The child may still get SIGKILLed
*/
- if (!vma_has_reserves(h, vma, address) &&
- h->free_huge_pages - h->resv_huge_pages == 0)
+ if (!has_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
return NULL;

/* If reserves cannot be used, ensure enough pages are in the pool */
@@ -607,7 +607,7 @@ retry_cpuset:
if (page) {
if (avoid_reserve)
break;
- if (!vma_has_reserves(h, vma, address))
+ if (!has_reserve)
break;

h->resv_huge_pages--;
@@ -1159,7 +1159,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-ENOSPC);
}
spin_lock(&hugetlb_lock);
- page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
+ page = dequeue_huge_page_vma(h, vma, addr, has_reserve, avoid_reserve);
if (!page) {
spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
--
1.7.9.5

2013-07-29 05:36:14

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 06/18] mm, hugetlb: remove vma_need_reservation()

vma_need_reservation() can be substituted by vma_has_reserves()
with minor change. These function do almost same thing,
so unifying them is better to maintain.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bf2ee11..ff46a2c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -451,8 +451,18 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
vma->vm_private_data = (void *)0;
}

-/* Returns true if the VMA has associated reserve pages */
-static int vma_has_reserves(struct vm_area_struct *vma, long chg)
+/*
+ * Determine if the huge page at addr within the vma has an associated
+ * reservation. Where it does not we will need to logically increase
+ * reservation and actually increase subpool usage before an allocation
+ * can occur. Where any new reservation would be required the
+ * reservation change is prepared, but not committed. Once the page
+ * has been allocated from the subpool and instantiated the change should
+ * be committed via vma_commit_reservation. No action is required on
+ * failure.
+ */
+static int vma_has_reserves(struct hstate *h,
+ struct vm_area_struct *vma, unsigned long addr)
{
if (vma->vm_flags & VM_NORESERVE) {
/*
@@ -464,10 +474,22 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
* step. Currently, we don't have any other solution to deal
* with this situation properly, so add work-around here.
*/
- if (vma->vm_flags & VM_MAYSHARE && chg == 0)
- return 1;
- else
- return 0;
+ if (vma->vm_flags & VM_MAYSHARE) {
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ struct inode *inode = mapping->host;
+ pgoff_t idx = vma_hugecache_offset(h, vma, addr);
+ struct resv_map *resv = inode->i_mapping->private_data;
+ long chg;
+
+ chg = region_chg(resv, idx, idx + 1);
+ if (chg < 0)
+ return -ENOMEM;
+
+ if (chg == 0)
+ return 1;
+ }
+
+ return 0;
}

/* Shared mappings always use reserves */
@@ -478,8 +500,16 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
* Only the process that called mmap() has reserves for
* private mappings.
*/
- if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+ if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
+ pgoff_t idx = vma_hugecache_offset(h, vma, addr);
+ struct resv_map *resv = vma_resv_map(vma);
+
+ /* Just for allocating region structure */
+ if (region_chg(resv, idx, idx + 1) < 0)
+ return -ENOMEM;
+
return 1;
+ }

return 0;
}
@@ -542,8 +572,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)

static struct page *dequeue_huge_page_vma(struct hstate *h,
struct vm_area_struct *vma,
- unsigned long address, int avoid_reserve,
- long chg)
+ unsigned long address, int avoid_reserve)
{
struct page *page = NULL;
struct mempolicy *mpol;
@@ -558,7 +587,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
* have no page reserves. This check ensures that reservations are
* not "stolen". The child may still get SIGKILLed
*/
- if (!vma_has_reserves(vma, chg) &&
+ if (!vma_has_reserves(h, vma, address) &&
h->free_huge_pages - h->resv_huge_pages == 0)
return NULL;

@@ -578,7 +607,7 @@ retry_cpuset:
if (page) {
if (avoid_reserve)
break;
- if (!vma_has_reserves(vma, chg))
+ if (!vma_has_reserves(h, vma, address))
break;

h->resv_huge_pages--;
@@ -1077,42 +1106,6 @@ static void return_unused_surplus_pages(struct hstate *h,
}
}

-/*
- * Determine if the huge page at addr within the vma has an associated
- * reservation. Where it does not we will need to logically increase
- * reservation and actually increase subpool usage before an allocation
- * can occur. Where any new reservation would be required the
- * reservation change is prepared, but not committed. Once the page
- * has been allocated from the subpool and instantiated the change should
- * be committed via vma_commit_reservation. No action is required on
- * failure.
- */
-static long vma_needs_reservation(struct hstate *h,
- struct vm_area_struct *vma, unsigned long addr)
-{
- struct address_space *mapping = vma->vm_file->f_mapping;
- struct inode *inode = mapping->host;
-
- if (vma->vm_flags & VM_MAYSHARE) {
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- struct resv_map *resv = inode->i_mapping->private_data;
-
- return region_chg(resv, idx, idx + 1);
-
- } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
- return 1;
-
- } else {
- long err;
- pgoff_t idx = vma_hugecache_offset(h, vma, addr);
- struct resv_map *resv = vma_resv_map(vma);
-
- err = region_chg(resv, idx, idx + 1);
- if (err < 0)
- return err;
- return 0;
- }
-}
static void vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
@@ -1140,8 +1133,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct page *page;
- long chg;
- int ret, idx;
+ int ret, idx, has_reserve;
struct hugetlb_cgroup *h_cg;

idx = hstate_index(h);
@@ -1153,20 +1145,21 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
* need pages and subpool limit allocated allocated if no reserve
* mapping overlaps.
*/
- chg = vma_needs_reservation(h, vma, addr);
- if (chg < 0)
+ has_reserve = vma_has_reserves(h, vma, addr);
+ if (has_reserve < 0)
return ERR_PTR(-ENOMEM);
- if (chg)
- if (hugepage_subpool_get_pages(spool, chg))
+
+ if (!has_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
return ERR_PTR(-ENOSPC);

ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
if (ret) {
- hugepage_subpool_put_pages(spool, chg);
+ if (!has_reserve)
+ hugepage_subpool_put_pages(spool, 1);
return ERR_PTR(-ENOSPC);
}
spin_lock(&hugetlb_lock);
- page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+ page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
if (!page) {
spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1174,7 +1167,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
hugetlb_cgroup_uncharge_cgroup(idx,
pages_per_huge_page(h),
h_cg);
- hugepage_subpool_put_pages(spool, chg);
+ if (!has_reserve)
+ hugepage_subpool_put_pages(spool, 1);
return ERR_PTR(-ENOSPC);
}
spin_lock(&hugetlb_lock);
@@ -2769,7 +2763,7 @@ retry:
* the spinlock.
*/
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
- if (vma_needs_reservation(h, vma, address) < 0) {
+ if (vma_has_reserves(h, vma, address) < 0) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
}
@@ -2860,7 +2854,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* consumed.
*/
if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
- if (vma_needs_reservation(h, vma, address) < 0) {
+ if (vma_has_reserves(h, vma, address) < 0) {
ret = VM_FAULT_OOM;
goto out_mutex;
}
--
1.7.9.5

2013-07-29 05:36:13

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 08/18] mm, hugetlb: do hugepage_subpool_get_pages() when avoid_reserve

When we try to get a huge page with avoid_reserve, we don't consume
a reserved page. So it is treated like as non-reserve case.

Signed-off-by: Joonsoo Kim <[email protected]>

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1426c03..749629e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1149,12 +1149,13 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (has_reserve < 0)
return ERR_PTR(-ENOMEM);

- if (!has_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
+ if ((!has_reserve || avoid_reserve)
+ && (hugepage_subpool_get_pages(spool, 1) < 0))
return ERR_PTR(-ENOSPC);

ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
if (ret) {
- if (!has_reserve)
+ if (!has_reserve || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
return ERR_PTR(-ENOSPC);
}
@@ -1167,7 +1168,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
hugetlb_cgroup_uncharge_cgroup(idx,
pages_per_huge_page(h),
h_cg);
- if (!has_reserve)
+ if (!has_reserve || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
return ERR_PTR(-ENOSPC);
}
--
1.7.9.5

2013-07-29 07:24:48

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
> alloc_huge_page_node() use dequeue_huge_page_node() without
> any validation check, so it can steal reserved page unconditionally.

Well, why is it illegal to use reserved page here?

> To fix it, check the number of free_huge_page in alloc_huge_page_node().
>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6782b41..d971233 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> */
> struct page *alloc_huge_page_node(struct hstate *h, int nid)
> {
> - struct page *page;
> + struct page *page = NULL;
>
> spin_lock(&hugetlb_lock);
> - page = dequeue_huge_page_node(h, nid);
> + if (h->free_huge_pages - h->resv_huge_pages > 0)
> + page = dequeue_huge_page_node(h, nid);
> spin_unlock(&hugetlb_lock);
>
> if (!page)
> --
> 1.7.9.5
>

2013-07-29 08:58:59

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm, hugetlb: protect region tracking via newly introduced resv_map lock

On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> grab a mmap_sem. This doesn't prevent other process to modify region
> structure, so it can be modified by two processes concurrently.
>
> To solve this, I introduce a lock to resv_map and make region manipulation
> function grab a lock before they do actual work. This makes region
> tracking safe.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 2677c07..e29e28f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -26,6 +26,7 @@ struct hugepage_subpool {
>
> struct resv_map {
> struct kref refs;
> + spinlock_t lock;
> struct list_head regions;
> };
> extern struct resv_map *resv_map_alloc(void);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 24c0111..bf2ee11 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
[...]
> @@ -193,6 +188,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
> struct file_region *rg, *nrg;
> long chg = 0;
>
> + spin_lock(&resv->lock);
> /* Locate the region we are before or in. */
> list_for_each_entry(rg, head, link)
> if (f <= rg->to)
> @@ -203,14 +199,18 @@ static long region_chg(struct resv_map *resv, long f, long t)
> * size such that we can guarantee to record the reservation. */
> if (&rg->link == head || t < rg->from) {
> nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);

Hm, you are allocating a piece of memory with spin lock held.
How about replacing that spin lock with a mutex?

> - if (!nrg)
> - return -ENOMEM;
> + if (!nrg) {
> + chg = -ENOMEM;
> + goto out;
> + }
> +

2013-07-29 09:10:20

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

On Mon, Jul 29, 2013 at 02:32:08PM +0900, Joonsoo Kim wrote:
> If parallel fault occur, we can fail to allocate a hugepage,
> because many threads dequeue a hugepage to handle a fault of same address.
> This makes reserved pool shortage just for a little while and this cause
> faulting thread who is ensured to have enough reserved hugepages
> to get a SIGBUS signal.

It's not just about reserved pages. The same race can happen
perfectly well when you're really, truly allocating the last hugepage
in the system.

>
> To solve this problem, we already have a nice solution, that is,
> a hugetlb_instantiation_mutex. This blocks other threads to dive into
> a fault handler. This solve the problem clearly, but it introduce
> performance degradation, because it serialize all fault handling.
>
> Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> performance degradation. A prerequisite is that other thread should
> not get a SIGBUS if they are ensured to have enough reserved pages.
>
> For this purpose, if we fail to allocate a new hugepage with use_reserve,
> we return just 0, instead of VM_FAULT_SIGBUS. use_reserve
> represent that this user is legimate one who are ensured to have enough
> reserved pages. This prevent these thread not to get a SIGBUS signal and
> make these thread retrying fault handling.

Not sufficient, since it can happen without reserved pages.

Also, I think there are edge cases where even reserved mappings can
run out, in particular with the interaction between MAP_PRIVATE,
fork() and reservations. In this case, when you have a genuine out of
memory condition, you will spin forever on the fault.

>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6a9ec69..909075b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2623,7 +2623,10 @@ retry_avoidcopy:
> WARN_ON_ONCE(1);
> }
>
> - ret = VM_FAULT_SIGBUS;
> + if (use_reserve)
> + ret = 0;
> + else
> + ret = VM_FAULT_SIGBUS;
> goto out_lock;
> }
>
> @@ -2741,7 +2744,10 @@ retry:
>
> page = alloc_huge_page(vma, address, use_reserve);
> if (IS_ERR(page)) {
> - ret = VM_FAULT_SIGBUS;
> + if (use_reserve)
> + ret = 0;
> + else
> + ret = VM_FAULT_SIGBUS;
> goto out;
> }
> clear_huge_page(page, address, pages_per_huge_page(h));

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.49 kB)
(No filename) (198.00 B)
Download all attachments

2013-07-29 17:53:18

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 06/18] mm, hugetlb: remove vma_need_reservation()

Hi,

On Mon, Jul 29, 2013 at 02:31:57PM +0900, Joonsoo Kim wrote:
> vma_need_reservation() can be substituted by vma_has_reserves()
> with minor change. These function do almost same thing,
> so unifying them is better to maintain.

I think it could, but not with minor changes.
vma_has_reserves is mostly the negation of the vma_needs_reservation,
but not exactly (consider that vma(VM_NORESERVE) does not have nor
need any reservation.)

> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bf2ee11..ff46a2c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -451,8 +451,18 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> vma->vm_private_data = (void *)0;
> }
>
> -/* Returns true if the VMA has associated reserve pages */
> -static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> +/*
> + * Determine if the huge page at addr within the vma has an associated
> + * reservation. Where it does not we will need to logically increase
> + * reservation and actually increase subpool usage before an allocation
> + * can occur. Where any new reservation would be required the
> + * reservation change is prepared, but not committed. Once the page
> + * has been allocated from the subpool and instantiated the change should
> + * be committed via vma_commit_reservation. No action is required on
> + * failure.
> + */
> +static int vma_has_reserves(struct hstate *h,
> + struct vm_area_struct *vma, unsigned long addr)
> {
> if (vma->vm_flags & VM_NORESERVE) {
> /*

Could you add more detailed comment on top of the function for better
maintenability? What is the expected behavior, or what is the returned
values for each case with what reasons?
And this function is not a simple reserve checker any more, but it can
change reservation maps, so it would be nice to give more suitable name,
though I don't have any good suggestions.

> @@ -464,10 +474,22 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> * step. Currently, we don't have any other solution to deal
> * with this situation properly, so add work-around here.
> */
> - if (vma->vm_flags & VM_MAYSHARE && chg == 0)
> - return 1;
> - else
> - return 0;
> + if (vma->vm_flags & VM_MAYSHARE) {
> + struct address_space *mapping = vma->vm_file->f_mapping;
> + struct inode *inode = mapping->host;
> + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> + struct resv_map *resv = inode->i_mapping->private_data;

It would be nice if we can get resv_map from vma in a single function for
VM_MAYSHARE, so can you add it on vma_resv_map?

> + long chg;
> +
> + chg = region_chg(resv, idx, idx + 1);
> + if (chg < 0)
> + return -ENOMEM;
> +
> + if (chg == 0)
> + return 1;
> + }
> +
> + return 0;
> }
>
> /* Shared mappings always use reserves */
> @@ -478,8 +500,16 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> * Only the process that called mmap() has reserves for
> * private mappings.
> */
> - if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> + struct resv_map *resv = vma_resv_map(vma);
> +
> + /* Just for allocating region structure */
> + if (region_chg(resv, idx, idx + 1) < 0)
> + return -ENOMEM;
> +
> return 1;
> + }
>
> return 0;
> }
> @@ -542,8 +572,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
>
> static struct page *dequeue_huge_page_vma(struct hstate *h,
> struct vm_area_struct *vma,
> - unsigned long address, int avoid_reserve,
> - long chg)
> + unsigned long address, int avoid_reserve)
> {
> struct page *page = NULL;
> struct mempolicy *mpol;
> @@ -558,7 +587,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> * have no page reserves. This check ensures that reservations are
> * not "stolen". The child may still get SIGKILLed
> */
> - if (!vma_has_reserves(vma, chg) &&
> + if (!vma_has_reserves(h, vma, address) &&
> h->free_huge_pages - h->resv_huge_pages == 0)
> return NULL;
>
> @@ -578,7 +607,7 @@ retry_cpuset:
> if (page) {
> if (avoid_reserve)
> break;
> - if (!vma_has_reserves(vma, chg))
> + if (!vma_has_reserves(h, vma, address))
> break;
>
> h->resv_huge_pages--;
> @@ -1077,42 +1106,6 @@ static void return_unused_surplus_pages(struct hstate *h,
> }
> }
>
> -/*
> - * Determine if the huge page at addr within the vma has an associated
> - * reservation. Where it does not we will need to logically increase
> - * reservation and actually increase subpool usage before an allocation
> - * can occur. Where any new reservation would be required the
> - * reservation change is prepared, but not committed. Once the page
> - * has been allocated from the subpool and instantiated the change should
> - * be committed via vma_commit_reservation. No action is required on
> - * failure.
> - */
> -static long vma_needs_reservation(struct hstate *h,
> - struct vm_area_struct *vma, unsigned long addr)
> -{
> - struct address_space *mapping = vma->vm_file->f_mapping;
> - struct inode *inode = mapping->host;
> -
> - if (vma->vm_flags & VM_MAYSHARE) {
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *resv = inode->i_mapping->private_data;
> -
> - return region_chg(resv, idx, idx + 1);
> -
> - } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> - return 1;
> -
> - } else {
> - long err;
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *resv = vma_resv_map(vma);
> -
> - err = region_chg(resv, idx, idx + 1);
> - if (err < 0)
> - return err;
> - return 0;
> - }
> -}
> static void vma_commit_reservation(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr)
> {
> @@ -1140,8 +1133,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> struct hugepage_subpool *spool = subpool_vma(vma);
> struct hstate *h = hstate_vma(vma);
> struct page *page;
> - long chg;
> - int ret, idx;
> + int ret, idx, has_reserve;
> struct hugetlb_cgroup *h_cg;
>
> idx = hstate_index(h);
> @@ -1153,20 +1145,21 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> * need pages and subpool limit allocated allocated if no reserve
> * mapping overlaps.
> */
> - chg = vma_needs_reservation(h, vma, addr);
> - if (chg < 0)
> + has_reserve = vma_has_reserves(h, vma, addr);
> + if (has_reserve < 0)
> return ERR_PTR(-ENOMEM);
> - if (chg)
> - if (hugepage_subpool_get_pages(spool, chg))
> +
> + if (!has_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> return ERR_PTR(-ENOSPC);
> ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
> if (ret) {
> - hugepage_subpool_put_pages(spool, chg);
> + if (!has_reserve)
> + hugepage_subpool_put_pages(spool, 1);

It seems that this fixes the subpool accounting, so worth separating to
another patch or writing on the description.

Thanks,
Naoya Horiguchi

> return ERR_PTR(-ENOSPC);
> }
> spin_lock(&hugetlb_lock);
> - page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
> + page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
> if (!page) {
> spin_unlock(&hugetlb_lock);
> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> @@ -1174,7 +1167,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> hugetlb_cgroup_uncharge_cgroup(idx,
> pages_per_huge_page(h),
> h_cg);
> - hugepage_subpool_put_pages(spool, chg);
> + if (!has_reserve)
> + hugepage_subpool_put_pages(spool, 1);
> return ERR_PTR(-ENOSPC);
> }
> spin_lock(&hugetlb_lock);
> @@ -2769,7 +2763,7 @@ retry:
> * the spinlock.
> */
> if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> - if (vma_needs_reservation(h, vma, address) < 0) {
> + if (vma_has_reserves(h, vma, address) < 0) {
> ret = VM_FAULT_OOM;
> goto backout_unlocked;
> }
> @@ -2860,7 +2854,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> * consumed.
> */
> if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
> - if (vma_needs_reservation(h, vma, address) < 0) {
> + if (vma_has_reserves(h, vma, address) < 0) {
> ret = VM_FAULT_OOM;
> goto out_mutex;
> }
> --
> 1.7.9.5
>

2013-07-29 18:06:15

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 08/18] mm, hugetlb: do hugepage_subpool_get_pages() when avoid_reserve

On Mon, Jul 29, 2013 at 02:31:59PM +0900, Joonsoo Kim wrote:
> When we try to get a huge page with avoid_reserve, we don't consume
> a reserved page. So it is treated like as non-reserve case.

This patch will be completely overwritten with 9/18.
So is this patch necessary?

Naoya Horiguchi

> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1426c03..749629e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1149,12 +1149,13 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> if (has_reserve < 0)
> return ERR_PTR(-ENOMEM);
>
> - if (!has_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> + if ((!has_reserve || avoid_reserve)
> + && (hugepage_subpool_get_pages(spool, 1) < 0))
> return ERR_PTR(-ENOSPC);
>
> ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
> if (ret) {
> - if (!has_reserve)
> + if (!has_reserve || avoid_reserve)
> hugepage_subpool_put_pages(spool, 1);
> return ERR_PTR(-ENOSPC);
> }
> @@ -1167,7 +1168,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> hugetlb_cgroup_uncharge_cgroup(idx,
> pages_per_huge_page(h),
> h_cg);
> - if (!has_reserve)
> + if (!has_reserve || avoid_reserve)
> hugepage_subpool_put_pages(spool, 1);
> return ERR_PTR(-ENOSPC);
> }
> --
> 1.7.9.5
>

2013-07-29 18:28:20

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 10/18] mm, hugetlb: call vma_has_reserve() before entering alloc_huge_page()

On Mon, Jul 29, 2013 at 02:32:01PM +0900, Joonsoo Kim wrote:
> To implement a graceful failure handling, we need to know whether
> allocation request is for reserved pool or not, on higher level.
> In this patch, we just move up vma_has_reseve() to caller function
> in order to know it. There is no functional change.
>
> Following patches implement a grace failure handling and remove
> a hugetlb_instantiation_mutex.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a66226e..5f31ca5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1123,12 +1123,12 @@ static void vma_commit_reservation(struct hstate *h,
> }
>
> static struct page *alloc_huge_page(struct vm_area_struct *vma,
> - unsigned long addr, int avoid_reserve)
> + unsigned long addr, int use_reserve)
> {
> struct hugepage_subpool *spool = subpool_vma(vma);
> struct hstate *h = hstate_vma(vma);
> struct page *page;
> - int ret, idx, use_reserve;
> + int ret, idx;
> struct hugetlb_cgroup *h_cg;
>
> idx = hstate_index(h);
> @@ -1140,11 +1140,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> * need pages and subpool limit allocated allocated if no reserve
> * mapping overlaps.
> */
> - use_reserve = vma_has_reserves(h, vma, addr);
> - if (use_reserve < 0)
> - return ERR_PTR(-ENOMEM);
> -
> - use_reserve = use_reserve && !avoid_reserve;
> if (!use_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> return ERR_PTR(-ENOSPC);
>
> @@ -2520,7 +2515,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> {
> struct hstate *h = hstate_vma(vma);
> struct page *old_page, *new_page;
> - int outside_reserve = 0;
> + int use_reserve, outside_reserve = 0;
> unsigned long mmun_start; /* For mmu_notifiers */
> unsigned long mmun_end; /* For mmu_notifiers */
>
> @@ -2553,7 +2548,18 @@ retry_avoidcopy:
>
> /* Drop page_table_lock as buddy allocator may be called */
> spin_unlock(&mm->page_table_lock);
> - new_page = alloc_huge_page(vma, address, outside_reserve);
> +
> + use_reserve = vma_has_reserves(h, vma, address);
> + if (use_reserve == -ENOMEM) {
> + page_cache_release(old_page);
> +
> + /* Caller expects lock to be held */
> + spin_lock(&mm->page_table_lock);
> + return VM_FAULT_OOM;
> + }
> + use_reserve = use_reserve && !outside_reserve;

When outside_reserve is true, we don't have to call vma_has_reserves
because then use_reserve is always false. So something like:

use_reserve = 0;
if (!outside_reserve) {
use_reserve = vma_has_reserves(...);
...
}

looks better to me.
Or if you expect vma_has_reserves to change resv_map implicitly,
could you add a comment about it.

Thanks,
Naoya Horiguchi

> +
> + new_page = alloc_huge_page(vma, address, use_reserve);
>
> if (IS_ERR(new_page)) {
> long err = PTR_ERR(new_page);
> @@ -2679,6 +2685,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> struct page *page;
> struct address_space *mapping;
> pte_t new_pte;
> + int use_reserve = 0;
>
> /*
> * Currently, we are forced to kill the process in the event the
> @@ -2704,7 +2711,14 @@ retry:
> size = i_size_read(mapping->host) >> huge_page_shift(h);
> if (idx >= size)
> goto out;
> - page = alloc_huge_page(vma, address, 0);
> +
> + use_reserve = vma_has_reserves(h, vma, address);
> + if (use_reserve == -ENOMEM) {
> + ret = VM_FAULT_OOM;
> + goto out;
> + }
> +
> + page = alloc_huge_page(vma, address, use_reserve);
> if (IS_ERR(page)) {
> ret = PTR_ERR(page);
> if (ret == -ENOMEM)
> --
> 1.7.9.5
>

2013-07-29 18:39:54

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 11/18] mm, hugetlb: move down outside_reserve check

On Mon, Jul 29, 2013 at 02:32:02PM +0900, Joonsoo Kim wrote:
> Just move down outsider_reserve check.
> This makes code more readable.
>
> There is no functional change.

Why don't you do this in 10/18?

Thanks,
Naoya Horiguchi

> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5f31ca5..94173e0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2530,20 +2530,6 @@ retry_avoidcopy:
> return 0;
> }
>
> - /*
> - * If the process that created a MAP_PRIVATE mapping is about to
> - * perform a COW due to a shared page count, attempt to satisfy
> - * the allocation without using the existing reserves. The pagecache
> - * page is used to determine if the reserve at this address was
> - * consumed or not. If reserves were used, a partial faulted mapping
> - * at the time of fork() could consume its reserves on COW instead
> - * of the full address range.
> - */
> - if (!(vma->vm_flags & VM_MAYSHARE) &&
> - is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> - old_page != pagecache_page)
> - outside_reserve = 1;
> -
> page_cache_get(old_page);
>
> /* Drop page_table_lock as buddy allocator may be called */
> @@ -2557,6 +2543,20 @@ retry_avoidcopy:
> spin_lock(&mm->page_table_lock);
> return VM_FAULT_OOM;
> }
> +
> + /*
> + * If the process that created a MAP_PRIVATE mapping is about to
> + * perform a COW due to a shared page count, attempt to satisfy
> + * the allocation without using the existing reserves. The pagecache
> + * page is used to determine if the reserve at this address was
> + * consumed or not. If reserves were used, a partial faulted mapping
> + * at the time of fork() could consume its reserves on COW instead
> + * of the full address range.
> + */
> + if (!(vma->vm_flags & VM_MAYSHARE) &&
> + is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> + old_page != pagecache_page)
> + outside_reserve = 1;
> use_reserve = use_reserve && !outside_reserve;
>
> new_page = alloc_huge_page(vma, address, use_reserve);
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-07-29 18:50:56

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 13/18] mm, hugetlb: grab a page_table_lock after page_cache_release

On Mon, Jul 29, 2013 at 02:32:04PM +0900, Joonsoo Kim wrote:
> We don't need to grab a page_table_lock when we try to release a page.
> So, defer to grab a page_table_lock.
>
> Signed-off-by: Joonsoo Kim <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 35ccdad..255bd9e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2630,10 +2630,11 @@ retry_avoidcopy:
> }
> spin_unlock(&mm->page_table_lock);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> - /* Caller expects lock to be held */
> - spin_lock(&mm->page_table_lock);
> page_cache_release(new_page);
> page_cache_release(old_page);
> +
> + /* Caller expects lock to be held */
> + spin_lock(&mm->page_table_lock);
> return 0;
> }
>
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-07-29 18:53:11

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm, hugetlb: protect region tracking via newly introduced resv_map lock

On Mon, 2013-07-29 at 14:31 +0900, Joonsoo Kim wrote:
> There is a race condition if we map a same file on different processes.
> Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> grab a mmap_sem. This doesn't prevent other process to modify region
> structure, so it can be modified by two processes concurrently.
>
> To solve this, I introduce a lock to resv_map and make region manipulation
> function grab a lock before they do actual work. This makes region
> tracking safe.

I think we can use a rwsem here instead of a spinlock, similar to my
previous 1/2 approach.

>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 2677c07..e29e28f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -26,6 +26,7 @@ struct hugepage_subpool {
>
> struct resv_map {
> struct kref refs;
> + spinlock_t lock;
> struct list_head regions;
> };
> extern struct resv_map *resv_map_alloc(void);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 24c0111..bf2ee11 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> * Region tracking -- allows tracking of reservations and instantiated pages
> * across the pages in a mapping.
> *
> - * The region data structures are protected by a combination of the mmap_sem
> - * and the hugetlb_instantiation_mutex. To access or modify a region the caller
> - * must either hold the mmap_sem for write, or the mmap_sem for read and
> - * the hugetlb_instantiation_mutex:
> - *
> - * down_write(&mm->mmap_sem);
> - * or
> - * down_read(&mm->mmap_sem);
> - * mutex_lock(&hugetlb_instantiation_mutex);
> + * The region data structures are embedded into a resv_map and
> + * protected by a resv_map's lock
> */
> struct file_region {
> struct list_head link;
> @@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t)
> struct list_head *head = &resv->regions;
> struct file_region *rg, *nrg, *trg;
>
> + spin_lock(&resv->lock);
> /* Locate the region we are either in or before. */
> list_for_each_entry(rg, head, link)
> if (f <= rg->to)
> @@ -184,6 +178,7 @@ static long region_add(struct resv_map *resv, long f, long t)
> }
> nrg->from = f;
> nrg->to = t;
> + spin_unlock(&resv->lock);
> return 0;
> }
>
> @@ -193,6 +188,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
> struct file_region *rg, *nrg;
> long chg = 0;
>
> + spin_lock(&resv->lock);
> /* Locate the region we are before or in. */
> list_for_each_entry(rg, head, link)
> if (f <= rg->to)
> @@ -203,14 +199,18 @@ static long region_chg(struct resv_map *resv, long f, long t)
> * size such that we can guarantee to record the reservation. */
> if (&rg->link == head || t < rg->from) {
> nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> - if (!nrg)
> - return -ENOMEM;
> + if (!nrg) {
> + chg = -ENOMEM;
> + goto out;
> + }
> +
> nrg->from = f;
> nrg->to = f;
> INIT_LIST_HEAD(&nrg->link);
> list_add(&nrg->link, rg->link.prev);
>
> - return t - f;
> + chg = t - f;
> + goto out;
> }
>
> /* Round our left edge to the current segment if it encloses us. */
> @@ -223,7 +223,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
> if (&rg->link == head)
> break;
> if (rg->from > t)
> - return chg;
> + goto out;
>
> /* We overlap with this area, if it extends further than
> * us then we must extend ourselves. Account for its
> @@ -234,6 +234,9 @@ static long region_chg(struct resv_map *resv, long f, long t)
> }
> chg -= rg->to - rg->from;
> }
> +
> +out:
> + spin_unlock(&resv->lock);
> return chg;
> }
>
> @@ -243,12 +246,13 @@ static long region_truncate(struct resv_map *resv, long end)
> struct file_region *rg, *trg;
> long chg = 0;
>
> + spin_lock(&resv->lock);
> /* Locate the region we are either in or before. */
> list_for_each_entry(rg, head, link)
> if (end <= rg->to)
> break;
> if (&rg->link == head)
> - return 0;
> + goto out;
>
> /* If we are in the middle of a region then adjust it. */
> if (end > rg->from) {
> @@ -265,6 +269,9 @@ static long region_truncate(struct resv_map *resv, long end)
> list_del(&rg->link);
> kfree(rg);
> }
> +
> +out:
> + spin_unlock(&resv->lock);
> return chg;
> }
>
> @@ -274,6 +281,7 @@ static long region_count(struct resv_map *resv, long f, long t)
> struct file_region *rg;
> long chg = 0;
>
> + spin_lock(&resv->lock);
> /* Locate each segment we overlap with, and count that overlap. */
> list_for_each_entry(rg, head, link) {
> long seg_from;
> @@ -289,6 +297,7 @@ static long region_count(struct resv_map *resv, long f, long t)
>
> chg += seg_to - seg_from;
> }
> + spin_unlock(&resv->lock);
>
> return chg;
> }
> @@ -386,6 +395,7 @@ struct resv_map *resv_map_alloc(void)
> return NULL;
>
> kref_init(&resv_map->refs);
> + spin_lock_init(&resv_map->lock);
> INIT_LIST_HEAD(&resv_map->regions);
>
> return resv_map;

2013-07-29 19:05:58

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm, hugetlb: move up anon_vma_prepare()

On Mon, Jul 29, 2013 at 02:32:06PM +0900, Joonsoo Kim wrote:
> If we fail with a allocated hugepage, it is hard to recover properly.
> One such example is reserve count. We don't have any method to recover
> reserve count. Although, I will introduce a function to recover reserve
> count in following patch, it is better not to allocate a hugepage
> as much as possible. So move up anon_vma_prepare() which can be failed
> in OOM situation.
>
> Signed-off-by: Joonsoo Kim <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 683fd38..bb8a45f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2536,6 +2536,15 @@ retry_avoidcopy:
> /* Drop page_table_lock as buddy allocator may be called */
> spin_unlock(&mm->page_table_lock);
>
> + /*
> + * When the original hugepage is shared one, it does not have
> + * anon_vma prepared.
> + */
> + if (unlikely(anon_vma_prepare(vma))) {
> + ret = VM_FAULT_OOM;
> + goto out_old_page;
> + }
> +
> use_reserve = vma_has_reserves(h, vma, address);
> if (use_reserve == -ENOMEM) {
> ret = VM_FAULT_OOM;
> @@ -2590,15 +2599,6 @@ retry_avoidcopy:
> goto out_lock;
> }
>
> - /*
> - * When the original hugepage is shared one, it does not have
> - * anon_vma prepared.
> - */
> - if (unlikely(anon_vma_prepare(vma))) {
> - ret = VM_FAULT_OOM;
> - goto out_new_page;
> - }
> -
> copy_user_huge_page(new_page, old_page, address, vma,
> pages_per_huge_page(h));
> __SetPageUptodate(new_page);
> @@ -2625,7 +2625,6 @@ retry_avoidcopy:
> spin_unlock(&mm->page_table_lock);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>
> -out_new_page:
> page_cache_release(new_page);
> out_old_page:
> page_cache_release(old_page);
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-07-29 19:19:35

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm, hugetlb: move up anon_vma_prepare()

On Mon, Jul 29, 2013 at 03:05:37PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 02:32:06PM +0900, Joonsoo Kim wrote:
> > If we fail with a allocated hugepage, it is hard to recover properly.
> > One such example is reserve count. We don't have any method to recover
> > reserve count. Although, I will introduce a function to recover reserve
> > count in following patch, it is better not to allocate a hugepage
> > as much as possible. So move up anon_vma_prepare() which can be failed
> > in OOM situation.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Reviewed-by: Naoya Horiguchi <[email protected]>

Sorry, let me suspend this Reviewed for a question.
If alloc_huge_page failed after we succeeded anon_vma_parepare,
the allocated anon_vma_chain and/or anon_vma are safely freed?
Or don't we have to free them?

Thanks,
Naoya Horiguchi

> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 683fd38..bb8a45f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2536,6 +2536,15 @@ retry_avoidcopy:
> > /* Drop page_table_lock as buddy allocator may be called */
> > spin_unlock(&mm->page_table_lock);
> >
> > + /*
> > + * When the original hugepage is shared one, it does not have
> > + * anon_vma prepared.
> > + */
> > + if (unlikely(anon_vma_prepare(vma))) {
> > + ret = VM_FAULT_OOM;
> > + goto out_old_page;
> > + }
> > +
> > use_reserve = vma_has_reserves(h, vma, address);
> > if (use_reserve == -ENOMEM) {
> > ret = VM_FAULT_OOM;
> > @@ -2590,15 +2599,6 @@ retry_avoidcopy:
> > goto out_lock;
> > }
> >
> > - /*
> > - * When the original hugepage is shared one, it does not have
> > - * anon_vma prepared.
> > - */
> > - if (unlikely(anon_vma_prepare(vma))) {
> > - ret = VM_FAULT_OOM;
> > - goto out_new_page;
> > - }
> > -
> > copy_user_huge_page(new_page, old_page, address, vma,
> > pages_per_huge_page(h));
> > __SetPageUptodate(new_page);
> > @@ -2625,7 +2625,6 @@ retry_avoidcopy:
> > spin_unlock(&mm->page_table_lock);
> > mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> >
> > -out_new_page:
> > page_cache_release(new_page);
> > out_old_page:
> > page_cache_release(old_page);
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-07-29 20:19:36

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 16/18] mm, hugetlb: return a reserved page to a reserved pool if failed

On Mon, Jul 29, 2013 at 02:32:07PM +0900, Joonsoo Kim wrote:
> If we fail with a reserved page, just calling put_page() is not sufficient,
> because put_page() invoke free_huge_page() at last step and it doesn't
> know whether a page comes from a reserved pool or not. So it doesn't do
> anything related to reserved count. This makes reserve count lower
> than how we need, because reserve count already decrease in
> dequeue_huge_page_vma(). This patch fix this situation.

I think we could use a page flag (for example PG_reserve) on a hugepage
in order to record that the hugepage comes from the reserved pool.
Furthermore, the reserve flag would be set when dequeueing a free hugepage,
and cleared when hugepage_fault returns, whether it fails or not.
I think it's simpler than put_page variant approach, but doesn't it work
to solve your problem?

Thanks,
Naoya Horiguchi

> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bb8a45f..6a9ec69 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -649,6 +649,34 @@ struct hstate *size_to_hstate(unsigned long size)
> return NULL;
> }
>
> +static void put_huge_page(struct page *page, int use_reserve)
> +{
> + struct hstate *h = page_hstate(page);
> + struct hugepage_subpool *spool =
> + (struct hugepage_subpool *)page_private(page);
> +
> + if (!use_reserve) {
> + put_page(page);
> + return;
> + }
> +
> + if (!put_page_testzero(page))
> + return;
> +
> + set_page_private(page, 0);
> + page->mapping = NULL;
> + BUG_ON(page_count(page));
> + BUG_ON(page_mapcount(page));
> +
> + spin_lock(&hugetlb_lock);
> + hugetlb_cgroup_uncharge_page(hstate_index(h),
> + pages_per_huge_page(h), page);
> + enqueue_huge_page(h, page);
> + h->resv_huge_pages++;
> + spin_unlock(&hugetlb_lock);
> + hugepage_subpool_put_pages(spool, 1);
> +}
> +
> static void free_huge_page(struct page *page)
> {
> /*
> @@ -2625,7 +2653,7 @@ retry_avoidcopy:
> spin_unlock(&mm->page_table_lock);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>
> - page_cache_release(new_page);
> + put_huge_page(new_page, use_reserve);
> out_old_page:
> page_cache_release(old_page);
> out_lock:
> @@ -2725,7 +2753,7 @@ retry:
>
> err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
> if (err) {
> - put_page(page);
> + put_huge_page(page, use_reserve);
> if (err == -EEXIST)
> goto retry;
> goto out;
> @@ -2798,7 +2826,7 @@ backout:
> spin_unlock(&mm->page_table_lock);
> backout_unlocked:
> unlock_page(page);
> - put_page(page);
> + put_huge_page(page, use_reserve);
> goto out;
> }
>
> --
> 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-07-30 16:49:46

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

Joonsoo Kim <[email protected]> writes:

> alloc_huge_page_node() use dequeue_huge_page_node() without
> any validation check, so it can steal reserved page unconditionally.
> To fix it, check the number of free_huge_page in
> alloc_huge_page_node().


May be we should say. Don't use the reserve pool when soft offlining a huge
page. Check we have free pages outside the reserve pool before we
dequeue the huge page

Reviewed-by: Aneesh Kumar <[email protected]>


>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6782b41..d971233 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> */
> struct page *alloc_huge_page_node(struct hstate *h, int nid)
> {
> - struct page *page;
> + struct page *page = NULL;
>
> spin_lock(&hugetlb_lock);
> - page = dequeue_huge_page_node(h, nid);
> + if (h->free_huge_pages - h->resv_huge_pages > 0)
> + page = dequeue_huge_page_node(h, nid);
> spin_unlock(&hugetlb_lock);
>
> if (!page)
> --
> 1.7.9.5

2013-07-30 16:51:01

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 02/18] mm, hugetlb: change variable name reservations to resv

Joonsoo Kim <[email protected]> writes:

> 'reservations' is so long name as a variable and we use 'resv_map'
> to represent 'struct resv_map' in other place. To reduce confusion and
> unreadability, change it.
>
> Signed-off-by: Joonsoo Kim <[email protected]>

Reviewed-by: Aneesh Kumar <[email protected]>

>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d971233..12b6581 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1095,9 +1095,9 @@ static long vma_needs_reservation(struct hstate *h,
> } else {
> long err;
> pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *reservations = vma_resv_map(vma);
> + struct resv_map *resv = vma_resv_map(vma);
>
> - err = region_chg(&reservations->regions, idx, idx + 1);
> + err = region_chg(&resv->regions, idx, idx + 1);
> if (err < 0)
> return err;
> return 0;
> @@ -1115,10 +1115,10 @@ static void vma_commit_reservation(struct hstate *h,
>
> } else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *reservations = vma_resv_map(vma);
> + struct resv_map *resv = vma_resv_map(vma);
>
> /* Mark this page used in the map. */
> - region_add(&reservations->regions, idx, idx + 1);
> + region_add(&resv->regions, idx, idx + 1);
> }
> }
>
> @@ -2168,7 +2168,7 @@ out:
>
> static void hugetlb_vm_op_open(struct vm_area_struct *vma)
> {
> - struct resv_map *reservations = vma_resv_map(vma);
> + struct resv_map *resv = vma_resv_map(vma);
>
> /*
> * This new VMA should share its siblings reservation map if present.
> @@ -2178,34 +2178,34 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
> * after this open call completes. It is therefore safe to take a
> * new reference here without additional locking.
> */
> - if (reservations)
> - kref_get(&reservations->refs);
> + if (resv)
> + kref_get(&resv->refs);
> }
>
> static void resv_map_put(struct vm_area_struct *vma)
> {
> - struct resv_map *reservations = vma_resv_map(vma);
> + struct resv_map *resv = vma_resv_map(vma);
>
> - if (!reservations)
> + if (!resv)
> return;
> - kref_put(&reservations->refs, resv_map_release);
> + kref_put(&resv->refs, resv_map_release);
> }
>
> static void hugetlb_vm_op_close(struct vm_area_struct *vma)
> {
> struct hstate *h = hstate_vma(vma);
> - struct resv_map *reservations = vma_resv_map(vma);
> + struct resv_map *resv = vma_resv_map(vma);
> struct hugepage_subpool *spool = subpool_vma(vma);
> unsigned long reserve;
> unsigned long start;
> unsigned long end;
>
> - if (reservations) {
> + if (resv) {
> start = vma_hugecache_offset(h, vma, vma->vm_start);
> end = vma_hugecache_offset(h, vma, vma->vm_end);
>
> reserve = (end - start) -
> - region_count(&reservations->regions, start, end);
> + region_count(&resv->regions, start, end);
>
> resv_map_put(vma);
>
> --
> 1.7.9.5

2013-07-30 17:27:49

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 03/18] mm, hugetlb: unify region structure handling

Joonsoo Kim <[email protected]> writes:

> Currently, to track a reserved and allocated region, we use two different
> ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> Now, we are preparing to change a coarse grained lock which protect
> a region structure to fine grained lock, and this difference hinder it.
> So, before changing it, unify region structure handling.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..a1ae3ada 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
>
> static void hugetlbfs_evict_inode(struct inode *inode)
> {
> + struct resv_map *resv_map;
> +
> truncate_hugepages(inode, 0);
> + resv_map = (struct resv_map *)inode->i_mapping->private_data;
> + if (resv_map)

can resv_map == NULL ?


> + kref_put(&resv_map->refs, resv_map_release);

Also the kref_put is confusing. For shared mapping we don't have ref
count incremented right ? So may be you can directly call
resv_map_release or add a comment around explaining this more ?


> clear_inode(inode);
> }
>
> @@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> umode_t mode, dev_t dev)
> {
> struct inode *inode;
> + struct resv_map *resv_map;
> +
> + resv_map = resv_map_alloc();
> + if (!resv_map)
> + return NULL;

-aneesh

2013-07-30 17:50:09

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 06/18] mm, hugetlb: remove vma_need_reservation()

Joonsoo Kim <[email protected]> writes:

> vma_need_reservation() can be substituted by vma_has_reserves()
> with minor change. These function do almost same thing,
> so unifying them is better to maintain.

I found the resulting code confusing and complex. I am sure there is
more that what is explained in the commit message. If you are just doing
this for cleanup, may be we should avoid doing this ?


>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bf2ee11..ff46a2c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -451,8 +451,18 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> vma->vm_private_data = (void *)0;
> }
>
> -/* Returns true if the VMA has associated reserve pages */
> -static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> +/*
> + * Determine if the huge page at addr within the vma has an associated
> + * reservation. Where it does not we will need to logically increase
> + * reservation and actually increase subpool usage before an allocation
> + * can occur. Where any new reservation would be required the
> + * reservation change is prepared, but not committed. Once the page
> + * has been allocated from the subpool and instantiated the change should
> + * be committed via vma_commit_reservation. No action is required on
> + * failure.
> + */
> +static int vma_has_reserves(struct hstate *h,
> + struct vm_area_struct *vma, unsigned long addr)
> {
> if (vma->vm_flags & VM_NORESERVE) {
> /*
> @@ -464,10 +474,22 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> * step. Currently, we don't have any other solution to deal
> * with this situation properly, so add work-around here.
> */
> - if (vma->vm_flags & VM_MAYSHARE && chg == 0)
> - return 1;
> - else
> - return 0;
> + if (vma->vm_flags & VM_MAYSHARE) {
> + struct address_space *mapping = vma->vm_file->f_mapping;
> + struct inode *inode = mapping->host;
> + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> + struct resv_map *resv = inode->i_mapping->private_data;
> + long chg;
> +
> + chg = region_chg(resv, idx, idx + 1);
> + if (chg < 0)
> + return -ENOMEM;
> +
> + if (chg == 0)
> + return 1;
> + }
> +
> + return 0;
> }
>
> /* Shared mappings always use reserves */
> @@ -478,8 +500,16 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> * Only the process that called mmap() has reserves for
> * private mappings.
> */
> - if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> + struct resv_map *resv = vma_resv_map(vma);
> +
> + /* Just for allocating region structure */
> + if (region_chg(resv, idx, idx + 1) < 0)
> + return -ENOMEM;
> +
> return 1;
> + }
>
> return 0;
> }
> @@ -542,8 +572,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
>
> static struct page *dequeue_huge_page_vma(struct hstate *h,
> struct vm_area_struct *vma,
> - unsigned long address, int avoid_reserve,
> - long chg)
> + unsigned long address, int avoid_reserve)
> {
> struct page *page = NULL;
> struct mempolicy *mpol;
> @@ -558,7 +587,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> * have no page reserves. This check ensures that reservations are
> * not "stolen". The child may still get SIGKILLed
> */
> - if (!vma_has_reserves(vma, chg) &&
> + if (!vma_has_reserves(h, vma, address) &&
> h->free_huge_pages - h->resv_huge_pages == 0)
> return NULL;
>
> @@ -578,7 +607,7 @@ retry_cpuset:
> if (page) {
> if (avoid_reserve)
> break;
> - if (!vma_has_reserves(vma, chg))
> + if (!vma_has_reserves(h, vma, address))
> break;
>
> h->resv_huge_pages--;
> @@ -1077,42 +1106,6 @@ static void return_unused_surplus_pages(struct hstate *h,
> }
> }
>
> -/*
> - * Determine if the huge page at addr within the vma has an associated
> - * reservation. Where it does not we will need to logically increase
> - * reservation and actually increase subpool usage before an allocation
> - * can occur. Where any new reservation would be required the
> - * reservation change is prepared, but not committed. Once the page
> - * has been allocated from the subpool and instantiated the change should
> - * be committed via vma_commit_reservation. No action is required on
> - * failure.
> - */
> -static long vma_needs_reservation(struct hstate *h,
> - struct vm_area_struct *vma, unsigned long addr)
> -{
> - struct address_space *mapping = vma->vm_file->f_mapping;
> - struct inode *inode = mapping->host;
> -
> - if (vma->vm_flags & VM_MAYSHARE) {
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *resv = inode->i_mapping->private_data;
> -
> - return region_chg(resv, idx, idx + 1);
> -
> - } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> - return 1;
> -
> - } else {
> - long err;
> - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> - struct resv_map *resv = vma_resv_map(vma);
> -
> - err = region_chg(resv, idx, idx + 1);
> - if (err < 0)
> - return err;
> - return 0;
> - }
> -}
> static void vma_commit_reservation(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr)
> {
> @@ -1140,8 +1133,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> struct hugepage_subpool *spool = subpool_vma(vma);
> struct hstate *h = hstate_vma(vma);
> struct page *page;
> - long chg;
> - int ret, idx;
> + int ret, idx, has_reserve;
> struct hugetlb_cgroup *h_cg;
>
> idx = hstate_index(h);
> @@ -1153,20 +1145,21 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> * need pages and subpool limit allocated allocated if no reserve
> * mapping overlaps.
> */
> - chg = vma_needs_reservation(h, vma, addr);
> - if (chg < 0)
> + has_reserve = vma_has_reserves(h, vma, addr);
> + if (has_reserve < 0)
> return ERR_PTR(-ENOMEM);
> - if (chg)
> - if (hugepage_subpool_get_pages(spool, chg))
> +
> + if (!has_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> return ERR_PTR(-ENOSPC);
>
> ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
> if (ret) {
> - hugepage_subpool_put_pages(spool, chg);
> + if (!has_reserve)
> + hugepage_subpool_put_pages(spool, 1);
> return ERR_PTR(-ENOSPC);
> }
> spin_lock(&hugetlb_lock);
> - page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
> + page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
> if (!page) {
> spin_unlock(&hugetlb_lock);
> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> @@ -1174,7 +1167,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> hugetlb_cgroup_uncharge_cgroup(idx,
> pages_per_huge_page(h),
> h_cg);
> - hugepage_subpool_put_pages(spool, chg);
> + if (!has_reserve)
> + hugepage_subpool_put_pages(spool, 1);
> return ERR_PTR(-ENOSPC);
> }
> spin_lock(&hugetlb_lock);
> @@ -2769,7 +2763,7 @@ retry:
> * the spinlock.
> */
> if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> - if (vma_needs_reservation(h, vma, address) < 0) {
> + if (vma_has_reserves(h, vma, address) < 0) {
> ret = VM_FAULT_OOM;
> goto backout_unlocked;
> }
> @@ -2860,7 +2854,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> * consumed.
> */
> if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
> - if (vma_needs_reservation(h, vma, address) < 0) {
> + if (vma_has_reserves(h, vma, address) < 0) {
> ret = VM_FAULT_OOM;
> goto out_mutex;
> }
> --
> 1.7.9.5

2013-07-31 02:27:54

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
> > alloc_huge_page_node() use dequeue_huge_page_node() without
> > any validation check, so it can steal reserved page unconditionally.
>
> Well, why is it illegal to use reserved page here?

Hello, Hillf.

If we use reserved page here, other processes which are promised to use
enough hugepages cannot get enough hugepages and can die. This is
unexpected result to them.

Thanks.

>
> > To fix it, check the number of free_huge_page in alloc_huge_page_node().
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6782b41..d971233 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -935,10 +935,11 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> > */
> > struct page *alloc_huge_page_node(struct hstate *h, int nid)
> > {
> > - struct page *page;
> > + struct page *page = NULL;
> >
> > spin_lock(&hugetlb_lock);
> > - page = dequeue_huge_page_node(h, nid);
> > + if (h->free_huge_pages - h->resv_huge_pages > 0)
> > + page = dequeue_huge_page_node(h, nid);
> > spin_unlock(&hugetlb_lock);
> >
> > if (!page)
> > --
> > 1.7.9.5
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 02:36:16

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 03/18] mm, hugetlb: unify region structure handling

On Tue, Jul 30, 2013 at 10:57:37PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <[email protected]> writes:
>
> > Currently, to track a reserved and allocated region, we use two different
> > ways for MAP_SHARED and MAP_PRIVATE. For MAP_SHARED, we use
> > address_mapping's private_list and, for MAP_PRIVATE, we use a resv_map.
> > Now, we are preparing to change a coarse grained lock which protect
> > a region structure to fine grained lock, and this difference hinder it.
> > So, before changing it, unify region structure handling.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index a3f868a..a1ae3ada 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -366,7 +366,12 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
> >
> > static void hugetlbfs_evict_inode(struct inode *inode)
> > {
> > + struct resv_map *resv_map;
> > +
> > truncate_hugepages(inode, 0);
> > + resv_map = (struct resv_map *)inode->i_mapping->private_data;
> > + if (resv_map)
>
> can resv_map == NULL ?

Hello, Aneesh.

Yes, it can.
root inode which is allocated in hugetlbfs_get_root() doesn't allocate a resv_map.

>
>
> > + kref_put(&resv_map->refs, resv_map_release);
>
> Also the kref_put is confusing. For shared mapping we don't have ref
> count incremented right ? So may be you can directly call
> resv_map_release or add a comment around explaining this more ?

Yes, I can call resv_map_release() directly, but I think that release
via reference management is better than it.

Thanks.

>
>
> > clear_inode(inode);
> > }
> >
> > @@ -468,6 +473,11 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> > umode_t mode, dev_t dev)
> > {
> > struct inode *inode;
> > + struct resv_map *resv_map;
> > +
> > + resv_map = resv_map_alloc();
> > + if (!resv_map)
> > + return NULL;
>
> -aneesh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 02:41:27

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm, hugetlb: protect region tracking via newly introduced resv_map lock

On Mon, Jul 29, 2013 at 04:58:57PM +0800, Hillf Danton wrote:
> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
> > There is a race condition if we map a same file on different processes.
> > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> > grab a mmap_sem. This doesn't prevent other process to modify region
> > structure, so it can be modified by two processes concurrently.
> >
> > To solve this, I introduce a lock to resv_map and make region manipulation
> > function grab a lock before they do actual work. This makes region
> > tracking safe.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 2677c07..e29e28f 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -26,6 +26,7 @@ struct hugepage_subpool {
> >
> > struct resv_map {
> > struct kref refs;
> > + spinlock_t lock;
> > struct list_head regions;
> > };
> > extern struct resv_map *resv_map_alloc(void);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 24c0111..bf2ee11 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> [...]
> > @@ -193,6 +188,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > struct file_region *rg, *nrg;
> > long chg = 0;
> >
> > + spin_lock(&resv->lock);
> > /* Locate the region we are before or in. */
> > list_for_each_entry(rg, head, link)
> > if (f <= rg->to)
> > @@ -203,14 +199,18 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > * size such that we can guarantee to record the reservation. */
> > if (&rg->link == head || t < rg->from) {
> > nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
>
> Hm, you are allocating a piece of memory with spin lock held.
> How about replacing that spin lock with a mutex?

I think that lock held period here is very short, so mutex is not appropriate
to use. How about trying allocate with GFP_NOWAIT? And then if failed,
release the lock and allocate with GFP_KERNEL and retry at the beginnig.

Thanks.

>
> > - if (!nrg)
> > - return -ENOMEM;
> > + if (!nrg) {
> > + chg = -ENOMEM;
> > + goto out;
> > + }
> > +
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 02:43:09

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm, hugetlb: protect region tracking via newly introduced resv_map lock

On Mon, Jul 29, 2013 at 11:53:05AM -0700, Davidlohr Bueso wrote:
> On Mon, 2013-07-29 at 14:31 +0900, Joonsoo Kim wrote:
> > There is a race condition if we map a same file on different processes.
> > Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
> > When we do mmap, we don't grab a hugetlb_instantiation_mutex, but,
> > grab a mmap_sem. This doesn't prevent other process to modify region
> > structure, so it can be modified by two processes concurrently.
> >
> > To solve this, I introduce a lock to resv_map and make region manipulation
> > function grab a lock before they do actual work. This makes region
> > tracking safe.
>
> I think we can use a rwsem here instead of a spinlock, similar to my
> previous 1/2 approach.

Hello, Davidlohr.

As I said in reply to Hillf, I prefer to use a spinlock.
Please refer that reply.

Thanks.

>
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 2677c07..e29e28f 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -26,6 +26,7 @@ struct hugepage_subpool {
> >
> > struct resv_map {
> > struct kref refs;
> > + spinlock_t lock;
> > struct list_head regions;
> > };
> > extern struct resv_map *resv_map_alloc(void);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 24c0111..bf2ee11 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -134,15 +134,8 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> > * Region tracking -- allows tracking of reservations and instantiated pages
> > * across the pages in a mapping.
> > *
> > - * The region data structures are protected by a combination of the mmap_sem
> > - * and the hugetlb_instantiation_mutex. To access or modify a region the caller
> > - * must either hold the mmap_sem for write, or the mmap_sem for read and
> > - * the hugetlb_instantiation_mutex:
> > - *
> > - * down_write(&mm->mmap_sem);
> > - * or
> > - * down_read(&mm->mmap_sem);
> > - * mutex_lock(&hugetlb_instantiation_mutex);
> > + * The region data structures are embedded into a resv_map and
> > + * protected by a resv_map's lock
> > */
> > struct file_region {
> > struct list_head link;
> > @@ -155,6 +148,7 @@ static long region_add(struct resv_map *resv, long f, long t)
> > struct list_head *head = &resv->regions;
> > struct file_region *rg, *nrg, *trg;
> >
> > + spin_lock(&resv->lock);
> > /* Locate the region we are either in or before. */
> > list_for_each_entry(rg, head, link)
> > if (f <= rg->to)
> > @@ -184,6 +178,7 @@ static long region_add(struct resv_map *resv, long f, long t)
> > }
> > nrg->from = f;
> > nrg->to = t;
> > + spin_unlock(&resv->lock);
> > return 0;
> > }
> >
> > @@ -193,6 +188,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > struct file_region *rg, *nrg;
> > long chg = 0;
> >
> > + spin_lock(&resv->lock);
> > /* Locate the region we are before or in. */
> > list_for_each_entry(rg, head, link)
> > if (f <= rg->to)
> > @@ -203,14 +199,18 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > * size such that we can guarantee to record the reservation. */
> > if (&rg->link == head || t < rg->from) {
> > nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
> > - if (!nrg)
> > - return -ENOMEM;
> > + if (!nrg) {
> > + chg = -ENOMEM;
> > + goto out;
> > + }
> > +
> > nrg->from = f;
> > nrg->to = f;
> > INIT_LIST_HEAD(&nrg->link);
> > list_add(&nrg->link, rg->link.prev);
> >
> > - return t - f;
> > + chg = t - f;
> > + goto out;
> > }
> >
> > /* Round our left edge to the current segment if it encloses us. */
> > @@ -223,7 +223,7 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > if (&rg->link == head)
> > break;
> > if (rg->from > t)
> > - return chg;
> > + goto out;
> >
> > /* We overlap with this area, if it extends further than
> > * us then we must extend ourselves. Account for its
> > @@ -234,6 +234,9 @@ static long region_chg(struct resv_map *resv, long f, long t)
> > }
> > chg -= rg->to - rg->from;
> > }
> > +
> > +out:
> > + spin_unlock(&resv->lock);
> > return chg;
> > }
> >
> > @@ -243,12 +246,13 @@ static long region_truncate(struct resv_map *resv, long end)
> > struct file_region *rg, *trg;
> > long chg = 0;
> >
> > + spin_lock(&resv->lock);
> > /* Locate the region we are either in or before. */
> > list_for_each_entry(rg, head, link)
> > if (end <= rg->to)
> > break;
> > if (&rg->link == head)
> > - return 0;
> > + goto out;
> >
> > /* If we are in the middle of a region then adjust it. */
> > if (end > rg->from) {
> > @@ -265,6 +269,9 @@ static long region_truncate(struct resv_map *resv, long end)
> > list_del(&rg->link);
> > kfree(rg);
> > }
> > +
> > +out:
> > + spin_unlock(&resv->lock);
> > return chg;
> > }
> >
> > @@ -274,6 +281,7 @@ static long region_count(struct resv_map *resv, long f, long t)
> > struct file_region *rg;
> > long chg = 0;
> >
> > + spin_lock(&resv->lock);
> > /* Locate each segment we overlap with, and count that overlap. */
> > list_for_each_entry(rg, head, link) {
> > long seg_from;
> > @@ -289,6 +297,7 @@ static long region_count(struct resv_map *resv, long f, long t)
> >
> > chg += seg_to - seg_from;
> > }
> > + spin_unlock(&resv->lock);
> >
> > return chg;
> > }
> > @@ -386,6 +395,7 @@ struct resv_map *resv_map_alloc(void)
> > return NULL;
> >
> > kref_init(&resv_map->refs);
> > + spin_lock_init(&resv_map->lock);
> > INIT_LIST_HEAD(&resv_map->regions);
> >
> > return resv_map;
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 02:49:26

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim <[email protected]> wrote:
> On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
>> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
>> > alloc_huge_page_node() use dequeue_huge_page_node() without
>> > any validation check, so it can steal reserved page unconditionally.
>>
>> Well, why is it illegal to use reserved page here?
>
> If we use reserved page here, other processes which are promised to use
> enough hugepages cannot get enough hugepages and can die. This is
> unexpected result to them.
>
But, how do you determine that a huge page is requested by a process
that is not allowed to use reserved pages?

2013-07-31 04:41:05

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote:
> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim <[email protected]> wrote:
> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
> >> > alloc_huge_page_node() use dequeue_huge_page_node() without
> >> > any validation check, so it can steal reserved page unconditionally.
> >>
> >> Well, why is it illegal to use reserved page here?
> >
> > If we use reserved page here, other processes which are promised to use
> > enough hugepages cannot get enough hugepages and can die. This is
> > unexpected result to them.
> >
> But, how do you determine that a huge page is requested by a process
> that is not allowed to use reserved pages?

Reserved page is just one for each address or file offset. If we need to
move this page, this means that it already use it's own reserved page, this
page is it. So we should not use other reserved page for moving this page.

Thanks.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 04:53:27

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 06/18] mm, hugetlb: remove vma_need_reservation()

Hello, Naoya.

On Mon, Jul 29, 2013 at 01:52:52PM -0400, Naoya Horiguchi wrote:
> Hi,
>
> On Mon, Jul 29, 2013 at 02:31:57PM +0900, Joonsoo Kim wrote:
> > vma_need_reservation() can be substituted by vma_has_reserves()
> > with minor change. These function do almost same thing,
> > so unifying them is better to maintain.
>
> I think it could, but not with minor changes.
> vma_has_reserves is mostly the negation of the vma_needs_reservation,
> but not exactly (consider that vma(VM_NORESERVE) does not have nor
> need any reservation.)

Yes, currently not exactly same, but we can merge these functions.
I think that confusion comes from selection of base function. If I merge
vma_has_reserve() into vma_need_reservation(), it may be more trivial.

>
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bf2ee11..ff46a2c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -451,8 +451,18 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> > vma->vm_private_data = (void *)0;
> > }
> >
> > -/* Returns true if the VMA has associated reserve pages */
> > -static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> > +/*
> > + * Determine if the huge page at addr within the vma has an associated
> > + * reservation. Where it does not we will need to logically increase
> > + * reservation and actually increase subpool usage before an allocation
> > + * can occur. Where any new reservation would be required the
> > + * reservation change is prepared, but not committed. Once the page
> > + * has been allocated from the subpool and instantiated the change should
> > + * be committed via vma_commit_reservation. No action is required on
> > + * failure.
> > + */
> > +static int vma_has_reserves(struct hstate *h,
> > + struct vm_area_struct *vma, unsigned long addr)
> > {
> > if (vma->vm_flags & VM_NORESERVE) {
> > /*
>
> Could you add more detailed comment on top of the function for better
> maintenability? What is the expected behavior, or what is the returned
> values for each case with what reasons?

Okay. I should do it.

> And this function is not a simple reserve checker any more, but it can
> change reservation maps, so it would be nice to give more suitable name,
> though I don't have any good suggestions.

Okay.

>
> > @@ -464,10 +474,22 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> > * step. Currently, we don't have any other solution to deal
> > * with this situation properly, so add work-around here.
> > */
> > - if (vma->vm_flags & VM_MAYSHARE && chg == 0)
> > - return 1;
> > - else
> > - return 0;
> > + if (vma->vm_flags & VM_MAYSHARE) {
> > + struct address_space *mapping = vma->vm_file->f_mapping;
> > + struct inode *inode = mapping->host;
> > + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > + struct resv_map *resv = inode->i_mapping->private_data;
>
> It would be nice if we can get resv_map from vma in a single function for
> VM_MAYSHARE, so can you add it on vma_resv_map?

Yes, that's good idea. I will consider it.

>
> > + long chg;
> > +
> > + chg = region_chg(resv, idx, idx + 1);
> > + if (chg < 0)
> > + return -ENOMEM;
> > +
> > + if (chg == 0)
> > + return 1;
> > + }
> > +
> > + return 0;
> > }
> >
> > /* Shared mappings always use reserves */
> > @@ -478,8 +500,16 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> > * Only the process that called mmap() has reserves for
> > * private mappings.
> > */
> > - if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> > + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> > + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > + struct resv_map *resv = vma_resv_map(vma);
> > +
> > + /* Just for allocating region structure */
> > + if (region_chg(resv, idx, idx + 1) < 0)
> > + return -ENOMEM;
> > +
> > return 1;
> > + }
> >
> > return 0;
> > }
> > @@ -542,8 +572,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> >
> > static struct page *dequeue_huge_page_vma(struct hstate *h,
> > struct vm_area_struct *vma,
> > - unsigned long address, int avoid_reserve,
> > - long chg)
> > + unsigned long address, int avoid_reserve)
> > {
> > struct page *page = NULL;
> > struct mempolicy *mpol;
> > @@ -558,7 +587,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> > * have no page reserves. This check ensures that reservations are
> > * not "stolen". The child may still get SIGKILLed
> > */
> > - if (!vma_has_reserves(vma, chg) &&
> > + if (!vma_has_reserves(h, vma, address) &&
> > h->free_huge_pages - h->resv_huge_pages == 0)
> > return NULL;
> >
> > @@ -578,7 +607,7 @@ retry_cpuset:
> > if (page) {
> > if (avoid_reserve)
> > break;
> > - if (!vma_has_reserves(vma, chg))
> > + if (!vma_has_reserves(h, vma, address))
> > break;
> >
> > h->resv_huge_pages--;
> > @@ -1077,42 +1106,6 @@ static void return_unused_surplus_pages(struct hstate *h,
> > }
> > }
> >
> > -/*
> > - * Determine if the huge page at addr within the vma has an associated
> > - * reservation. Where it does not we will need to logically increase
> > - * reservation and actually increase subpool usage before an allocation
> > - * can occur. Where any new reservation would be required the
> > - * reservation change is prepared, but not committed. Once the page
> > - * has been allocated from the subpool and instantiated the change should
> > - * be committed via vma_commit_reservation. No action is required on
> > - * failure.
> > - */
> > -static long vma_needs_reservation(struct hstate *h,
> > - struct vm_area_struct *vma, unsigned long addr)
> > -{
> > - struct address_space *mapping = vma->vm_file->f_mapping;
> > - struct inode *inode = mapping->host;
> > -
> > - if (vma->vm_flags & VM_MAYSHARE) {
> > - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > - struct resv_map *resv = inode->i_mapping->private_data;
> > -
> > - return region_chg(resv, idx, idx + 1);
> > -
> > - } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> > - return 1;
> > -
> > - } else {
> > - long err;
> > - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > - struct resv_map *resv = vma_resv_map(vma);
> > -
> > - err = region_chg(resv, idx, idx + 1);
> > - if (err < 0)
> > - return err;
> > - return 0;
> > - }
> > -}
> > static void vma_commit_reservation(struct hstate *h,
> > struct vm_area_struct *vma, unsigned long addr)
> > {
> > @@ -1140,8 +1133,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > struct hugepage_subpool *spool = subpool_vma(vma);
> > struct hstate *h = hstate_vma(vma);
> > struct page *page;
> > - long chg;
> > - int ret, idx;
> > + int ret, idx, has_reserve;
> > struct hugetlb_cgroup *h_cg;
> >
> > idx = hstate_index(h);
> > @@ -1153,20 +1145,21 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > * need pages and subpool limit allocated allocated if no reserve
> > * mapping overlaps.
> > */
> > - chg = vma_needs_reservation(h, vma, addr);
> > - if (chg < 0)
> > + has_reserve = vma_has_reserves(h, vma, addr);
> > + if (has_reserve < 0)
> > return ERR_PTR(-ENOMEM);
> > - if (chg)
> > - if (hugepage_subpool_get_pages(spool, chg))
> > +
> > + if (!has_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> > return ERR_PTR(-ENOSPC);
> > ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
> > if (ret) {
> > - hugepage_subpool_put_pages(spool, chg);
> > + if (!has_reserve)
> > + hugepage_subpool_put_pages(spool, 1);
>
> It seems that this fixes the subpool accounting, so worth separating to
> another patch or writing on the description.

Okay!

Thanks.

>
> Thanks,
> Naoya Horiguchi
>
> > return ERR_PTR(-ENOSPC);
> > }
> > spin_lock(&hugetlb_lock);
> > - page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
> > + page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
> > if (!page) {
> > spin_unlock(&hugetlb_lock);
> > page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > @@ -1174,7 +1167,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > hugetlb_cgroup_uncharge_cgroup(idx,
> > pages_per_huge_page(h),
> > h_cg);
> > - hugepage_subpool_put_pages(spool, chg);
> > + if (!has_reserve)
> > + hugepage_subpool_put_pages(spool, 1);
> > return ERR_PTR(-ENOSPC);
> > }
> > spin_lock(&hugetlb_lock);
> > @@ -2769,7 +2763,7 @@ retry:
> > * the spinlock.
> > */
> > if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> > - if (vma_needs_reservation(h, vma, address) < 0) {
> > + if (vma_has_reserves(h, vma, address) < 0) {
> > ret = VM_FAULT_OOM;
> > goto backout_unlocked;
> > }
> > @@ -2860,7 +2854,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > * consumed.
> > */
> > if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
> > - if (vma_needs_reservation(h, vma, address) < 0) {
> > + if (vma_has_reserves(h, vma, address) < 0) {
> > ret = VM_FAULT_OOM;
> > goto out_mutex;
> > }
> > --
> > 1.7.9.5
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 04:56:32

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 06/18] mm, hugetlb: remove vma_need_reservation()

On Tue, Jul 30, 2013 at 11:19:58PM +0530, Aneesh Kumar K.V wrote:
> Joonsoo Kim <[email protected]> writes:
>
> > vma_need_reservation() can be substituted by vma_has_reserves()
> > with minor change. These function do almost same thing,
> > so unifying them is better to maintain.
>
> I found the resulting code confusing and complex. I am sure there is
> more that what is explained in the commit message. If you are just doing
> this for cleanup, may be we should avoid doing this ?

I may need this cleanup, because I want to decide whether this page comes
from reserved page pool or not, more clearly. Without this, this decision
can be harder.
Anyway, I should describe the purpose of this patch in detail.

Thanks.

>
>
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bf2ee11..ff46a2c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -451,8 +451,18 @@ void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> > vma->vm_private_data = (void *)0;
> > }
> >
> > -/* Returns true if the VMA has associated reserve pages */
> > -static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> > +/*
> > + * Determine if the huge page at addr within the vma has an associated
> > + * reservation. Where it does not we will need to logically increase
> > + * reservation and actually increase subpool usage before an allocation
> > + * can occur. Where any new reservation would be required the
> > + * reservation change is prepared, but not committed. Once the page
> > + * has been allocated from the subpool and instantiated the change should
> > + * be committed via vma_commit_reservation. No action is required on
> > + * failure.
> > + */
> > +static int vma_has_reserves(struct hstate *h,
> > + struct vm_area_struct *vma, unsigned long addr)
> > {
> > if (vma->vm_flags & VM_NORESERVE) {
> > /*
> > @@ -464,10 +474,22 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> > * step. Currently, we don't have any other solution to deal
> > * with this situation properly, so add work-around here.
> > */
> > - if (vma->vm_flags & VM_MAYSHARE && chg == 0)
> > - return 1;
> > - else
> > - return 0;
> > + if (vma->vm_flags & VM_MAYSHARE) {
> > + struct address_space *mapping = vma->vm_file->f_mapping;
> > + struct inode *inode = mapping->host;
> > + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > + struct resv_map *resv = inode->i_mapping->private_data;
> > + long chg;
> > +
> > + chg = region_chg(resv, idx, idx + 1);
> > + if (chg < 0)
> > + return -ENOMEM;
> > +
> > + if (chg == 0)
> > + return 1;
> > + }
> > +
> > + return 0;
> > }
> >
> > /* Shared mappings always use reserves */
> > @@ -478,8 +500,16 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
> > * Only the process that called mmap() has reserves for
> > * private mappings.
> > */
> > - if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> > + if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> > + pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > + struct resv_map *resv = vma_resv_map(vma);
> > +
> > + /* Just for allocating region structure */
> > + if (region_chg(resv, idx, idx + 1) < 0)
> > + return -ENOMEM;
> > +
> > return 1;
> > + }
> >
> > return 0;
> > }
> > @@ -542,8 +572,7 @@ static struct page *dequeue_huge_page_node(struct hstate *h, int nid)
> >
> > static struct page *dequeue_huge_page_vma(struct hstate *h,
> > struct vm_area_struct *vma,
> > - unsigned long address, int avoid_reserve,
> > - long chg)
> > + unsigned long address, int avoid_reserve)
> > {
> > struct page *page = NULL;
> > struct mempolicy *mpol;
> > @@ -558,7 +587,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> > * have no page reserves. This check ensures that reservations are
> > * not "stolen". The child may still get SIGKILLed
> > */
> > - if (!vma_has_reserves(vma, chg) &&
> > + if (!vma_has_reserves(h, vma, address) &&
> > h->free_huge_pages - h->resv_huge_pages == 0)
> > return NULL;
> >
> > @@ -578,7 +607,7 @@ retry_cpuset:
> > if (page) {
> > if (avoid_reserve)
> > break;
> > - if (!vma_has_reserves(vma, chg))
> > + if (!vma_has_reserves(h, vma, address))
> > break;
> >
> > h->resv_huge_pages--;
> > @@ -1077,42 +1106,6 @@ static void return_unused_surplus_pages(struct hstate *h,
> > }
> > }
> >
> > -/*
> > - * Determine if the huge page at addr within the vma has an associated
> > - * reservation. Where it does not we will need to logically increase
> > - * reservation and actually increase subpool usage before an allocation
> > - * can occur. Where any new reservation would be required the
> > - * reservation change is prepared, but not committed. Once the page
> > - * has been allocated from the subpool and instantiated the change should
> > - * be committed via vma_commit_reservation. No action is required on
> > - * failure.
> > - */
> > -static long vma_needs_reservation(struct hstate *h,
> > - struct vm_area_struct *vma, unsigned long addr)
> > -{
> > - struct address_space *mapping = vma->vm_file->f_mapping;
> > - struct inode *inode = mapping->host;
> > -
> > - if (vma->vm_flags & VM_MAYSHARE) {
> > - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > - struct resv_map *resv = inode->i_mapping->private_data;
> > -
> > - return region_chg(resv, idx, idx + 1);
> > -
> > - } else if (!is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
> > - return 1;
> > -
> > - } else {
> > - long err;
> > - pgoff_t idx = vma_hugecache_offset(h, vma, addr);
> > - struct resv_map *resv = vma_resv_map(vma);
> > -
> > - err = region_chg(resv, idx, idx + 1);
> > - if (err < 0)
> > - return err;
> > - return 0;
> > - }
> > -}
> > static void vma_commit_reservation(struct hstate *h,
> > struct vm_area_struct *vma, unsigned long addr)
> > {
> > @@ -1140,8 +1133,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > struct hugepage_subpool *spool = subpool_vma(vma);
> > struct hstate *h = hstate_vma(vma);
> > struct page *page;
> > - long chg;
> > - int ret, idx;
> > + int ret, idx, has_reserve;
> > struct hugetlb_cgroup *h_cg;
> >
> > idx = hstate_index(h);
> > @@ -1153,20 +1145,21 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > * need pages and subpool limit allocated allocated if no reserve
> > * mapping overlaps.
> > */
> > - chg = vma_needs_reservation(h, vma, addr);
> > - if (chg < 0)
> > + has_reserve = vma_has_reserves(h, vma, addr);
> > + if (has_reserve < 0)
> > return ERR_PTR(-ENOMEM);
> > - if (chg)
> > - if (hugepage_subpool_get_pages(spool, chg))
> > +
> > + if (!has_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> > return ERR_PTR(-ENOSPC);
> >
> > ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
> > if (ret) {
> > - hugepage_subpool_put_pages(spool, chg);
> > + if (!has_reserve)
> > + hugepage_subpool_put_pages(spool, 1);
> > return ERR_PTR(-ENOSPC);
> > }
> > spin_lock(&hugetlb_lock);
> > - page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
> > + page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve);
> > if (!page) {
> > spin_unlock(&hugetlb_lock);
> > page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> > @@ -1174,7 +1167,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > hugetlb_cgroup_uncharge_cgroup(idx,
> > pages_per_huge_page(h),
> > h_cg);
> > - hugepage_subpool_put_pages(spool, chg);
> > + if (!has_reserve)
> > + hugepage_subpool_put_pages(spool, 1);
> > return ERR_PTR(-ENOSPC);
> > }
> > spin_lock(&hugetlb_lock);
> > @@ -2769,7 +2763,7 @@ retry:
> > * the spinlock.
> > */
> > if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> > - if (vma_needs_reservation(h, vma, address) < 0) {
> > + if (vma_has_reserves(h, vma, address) < 0) {
> > ret = VM_FAULT_OOM;
> > goto backout_unlocked;
> > }
> > @@ -2860,7 +2854,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > * consumed.
> > */
> > if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
> > - if (vma_needs_reservation(h, vma, address) < 0) {
> > + if (vma_has_reserves(h, vma, address) < 0) {
> > ret = VM_FAULT_OOM;
> > goto out_mutex;
> > }
> > --
> > 1.7.9.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 05:02:53

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 08/18] mm, hugetlb: do hugepage_subpool_get_pages() when avoid_reserve

On Mon, Jul 29, 2013 at 02:05:51PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 02:31:59PM +0900, Joonsoo Kim wrote:
> > When we try to get a huge page with avoid_reserve, we don't consume
> > a reserved page. So it is treated like as non-reserve case.
>
> This patch will be completely overwritten with 9/18.
> So is this patch necessary?

Yes. This is a bug fix, so should be separate.
When we try to allocate with avoid_reserve, we don't use reserved page pool.
So, hugepage_subpool_get_pages() should be called and returned if failed.

If we merge these into one, we cannot know that there exists a bug.

Thanks.

>
> Naoya Horiguchi
>
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 1426c03..749629e 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1149,12 +1149,13 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > if (has_reserve < 0)
> > return ERR_PTR(-ENOMEM);
> >
> > - if (!has_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> > + if ((!has_reserve || avoid_reserve)
> > + && (hugepage_subpool_get_pages(spool, 1) < 0))
> > return ERR_PTR(-ENOSPC);
> >
> > ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
> > if (ret) {
> > - if (!has_reserve)
> > + if (!has_reserve || avoid_reserve)
> > hugepage_subpool_put_pages(spool, 1);
> > return ERR_PTR(-ENOSPC);
> > }
> > @@ -1167,7 +1168,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > hugetlb_cgroup_uncharge_cgroup(idx,
> > pages_per_huge_page(h),
> > h_cg);
> > - if (!has_reserve)
> > + if (!has_reserve || avoid_reserve)
> > hugepage_subpool_put_pages(spool, 1);
> > return ERR_PTR(-ENOSPC);
> > }
> > --
> > 1.7.9.5
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 05:06:06

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 10/18] mm, hugetlb: call vma_has_reserve() before entering alloc_huge_page()

On Mon, Jul 29, 2013 at 02:27:54PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 02:32:01PM +0900, Joonsoo Kim wrote:
> > To implement a graceful failure handling, we need to know whether
> > allocation request is for reserved pool or not, on higher level.
> > In this patch, we just move up vma_has_reseve() to caller function
> > in order to know it. There is no functional change.
> >
> > Following patches implement a grace failure handling and remove
> > a hugetlb_instantiation_mutex.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a66226e..5f31ca5 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1123,12 +1123,12 @@ static void vma_commit_reservation(struct hstate *h,
> > }
> >
> > static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > - unsigned long addr, int avoid_reserve)
> > + unsigned long addr, int use_reserve)
> > {
> > struct hugepage_subpool *spool = subpool_vma(vma);
> > struct hstate *h = hstate_vma(vma);
> > struct page *page;
> > - int ret, idx, use_reserve;
> > + int ret, idx;
> > struct hugetlb_cgroup *h_cg;
> >
> > idx = hstate_index(h);
> > @@ -1140,11 +1140,6 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
> > * need pages and subpool limit allocated allocated if no reserve
> > * mapping overlaps.
> > */
> > - use_reserve = vma_has_reserves(h, vma, addr);
> > - if (use_reserve < 0)
> > - return ERR_PTR(-ENOMEM);
> > -
> > - use_reserve = use_reserve && !avoid_reserve;
> > if (!use_reserve && (hugepage_subpool_get_pages(spool, 1) < 0))
> > return ERR_PTR(-ENOSPC);
> >
> > @@ -2520,7 +2515,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> > {
> > struct hstate *h = hstate_vma(vma);
> > struct page *old_page, *new_page;
> > - int outside_reserve = 0;
> > + int use_reserve, outside_reserve = 0;
> > unsigned long mmun_start; /* For mmu_notifiers */
> > unsigned long mmun_end; /* For mmu_notifiers */
> >
> > @@ -2553,7 +2548,18 @@ retry_avoidcopy:
> >
> > /* Drop page_table_lock as buddy allocator may be called */
> > spin_unlock(&mm->page_table_lock);
> > - new_page = alloc_huge_page(vma, address, outside_reserve);
> > +
> > + use_reserve = vma_has_reserves(h, vma, address);
> > + if (use_reserve == -ENOMEM) {
> > + page_cache_release(old_page);
> > +
> > + /* Caller expects lock to be held */
> > + spin_lock(&mm->page_table_lock);
> > + return VM_FAULT_OOM;
> > + }
> > + use_reserve = use_reserve && !outside_reserve;
>
> When outside_reserve is true, we don't have to call vma_has_reserves
> because then use_reserve is always false. So something like:
>
> use_reserve = 0;
> if (!outside_reserve) {
> use_reserve = vma_has_reserves(...);
> ...
> }
>
> looks better to me.
> Or if you expect vma_has_reserves to change resv_map implicitly,
> could you add a comment about it.

Yes, you are right.
I will change it.

Thanks.

>
> Thanks,
> Naoya Horiguchi
>
> > +
> > + new_page = alloc_huge_page(vma, address, use_reserve);
> >
> > if (IS_ERR(new_page)) {
> > long err = PTR_ERR(new_page);
> > @@ -2679,6 +2685,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > struct page *page;
> > struct address_space *mapping;
> > pte_t new_pte;
> > + int use_reserve = 0;
> >
> > /*
> > * Currently, we are forced to kill the process in the event the
> > @@ -2704,7 +2711,14 @@ retry:
> > size = i_size_read(mapping->host) >> huge_page_shift(h);
> > if (idx >= size)
> > goto out;
> > - page = alloc_huge_page(vma, address, 0);
> > +
> > + use_reserve = vma_has_reserves(h, vma, address);
> > + if (use_reserve == -ENOMEM) {
> > + ret = VM_FAULT_OOM;
> > + goto out;
> > + }
> > +
> > + page = alloc_huge_page(vma, address, use_reserve);
> > if (IS_ERR(page)) {
> > ret = PTR_ERR(page);
> > if (ret == -ENOMEM)
> > --
> > 1.7.9.5
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 05:08:26

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 11/18] mm, hugetlb: move down outside_reserve check

On Mon, Jul 29, 2013 at 02:39:30PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 02:32:02PM +0900, Joonsoo Kim wrote:
> > Just move down outsider_reserve check.
> > This makes code more readable.
> >
> > There is no functional change.
>
> Why don't you do this in 10/18?

Just help to review :)
Step-by-step approach may help to review, so I decide to be separate it.
If you don't want it, I will merge it in next spin.

Thanks.

>
> Thanks,
> Naoya Horiguchi
>
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 5f31ca5..94173e0 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2530,20 +2530,6 @@ retry_avoidcopy:
> > return 0;
> > }
> >
> > - /*
> > - * If the process that created a MAP_PRIVATE mapping is about to
> > - * perform a COW due to a shared page count, attempt to satisfy
> > - * the allocation without using the existing reserves. The pagecache
> > - * page is used to determine if the reserve at this address was
> > - * consumed or not. If reserves were used, a partial faulted mapping
> > - * at the time of fork() could consume its reserves on COW instead
> > - * of the full address range.
> > - */
> > - if (!(vma->vm_flags & VM_MAYSHARE) &&
> > - is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > - old_page != pagecache_page)
> > - outside_reserve = 1;
> > -
> > page_cache_get(old_page);
> >
> > /* Drop page_table_lock as buddy allocator may be called */
> > @@ -2557,6 +2543,20 @@ retry_avoidcopy:
> > spin_lock(&mm->page_table_lock);
> > return VM_FAULT_OOM;
> > }
> > +
> > + /*
> > + * If the process that created a MAP_PRIVATE mapping is about to
> > + * perform a COW due to a shared page count, attempt to satisfy
> > + * the allocation without using the existing reserves. The pagecache
> > + * page is used to determine if the reserve at this address was
> > + * consumed or not. If reserves were used, a partial faulted mapping
> > + * at the time of fork() could consume its reserves on COW instead
> > + * of the full address range.
> > + */
> > + if (!(vma->vm_flags & VM_MAYSHARE) &&
> > + is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > + old_page != pagecache_page)
> > + outside_reserve = 1;
> > use_reserve = use_reserve && !outside_reserve;
> >
> > new_page = alloc_huge_page(vma, address, use_reserve);
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 05:12:23

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm, hugetlb: move up anon_vma_prepare()

On Mon, Jul 29, 2013 at 03:19:15PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 03:05:37PM -0400, Naoya Horiguchi wrote:
> > On Mon, Jul 29, 2013 at 02:32:06PM +0900, Joonsoo Kim wrote:
> > > If we fail with a allocated hugepage, it is hard to recover properly.
> > > One such example is reserve count. We don't have any method to recover
> > > reserve count. Although, I will introduce a function to recover reserve
> > > count in following patch, it is better not to allocate a hugepage
> > > as much as possible. So move up anon_vma_prepare() which can be failed
> > > in OOM situation.
> > >
> > > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > Reviewed-by: Naoya Horiguchi <[email protected]>
>
> Sorry, let me suspend this Reviewed for a question.
> If alloc_huge_page failed after we succeeded anon_vma_parepare,
> the allocated anon_vma_chain and/or anon_vma are safely freed?
> Or don't we have to free them?

Yes, it will be freed by free_pgtables() and then unlink_anon_vmas()
when a task terminate. So, we don't have to free them.

Thanks.

>
> Thanks,
> Naoya Horiguchi
>
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 683fd38..bb8a45f 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -2536,6 +2536,15 @@ retry_avoidcopy:
> > > /* Drop page_table_lock as buddy allocator may be called */
> > > spin_unlock(&mm->page_table_lock);
> > >
> > > + /*
> > > + * When the original hugepage is shared one, it does not have
> > > + * anon_vma prepared.
> > > + */
> > > + if (unlikely(anon_vma_prepare(vma))) {
> > > + ret = VM_FAULT_OOM;
> > > + goto out_old_page;
> > > + }
> > > +
> > > use_reserve = vma_has_reserves(h, vma, address);
> > > if (use_reserve == -ENOMEM) {
> > > ret = VM_FAULT_OOM;
> > > @@ -2590,15 +2599,6 @@ retry_avoidcopy:
> > > goto out_lock;
> > > }
> > >
> > > - /*
> > > - * When the original hugepage is shared one, it does not have
> > > - * anon_vma prepared.
> > > - */
> > > - if (unlikely(anon_vma_prepare(vma))) {
> > > - ret = VM_FAULT_OOM;
> > > - goto out_new_page;
> > > - }
> > > -
> > > copy_user_huge_page(new_page, old_page, address, vma,
> > > pages_per_huge_page(h));
> > > __SetPageUptodate(new_page);
> > > @@ -2625,7 +2625,6 @@ retry_avoidcopy:
> > > spin_unlock(&mm->page_table_lock);
> > > mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> > >
> > > -out_new_page:
> > > page_cache_release(new_page);
> > > out_old_page:
> > > page_cache_release(old_page);
> > > --
> > > 1.7.9.5
> > >
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to [email protected]. For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> > >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 05:21:30

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 16/18] mm, hugetlb: return a reserved page to a reserved pool if failed

On Mon, Jul 29, 2013 at 04:19:10PM -0400, Naoya Horiguchi wrote:
> On Mon, Jul 29, 2013 at 02:32:07PM +0900, Joonsoo Kim wrote:
> > If we fail with a reserved page, just calling put_page() is not sufficient,
> > because put_page() invoke free_huge_page() at last step and it doesn't
> > know whether a page comes from a reserved pool or not. So it doesn't do
> > anything related to reserved count. This makes reserve count lower
> > than how we need, because reserve count already decrease in
> > dequeue_huge_page_vma(). This patch fix this situation.
>
> I think we could use a page flag (for example PG_reserve) on a hugepage
> in order to record that the hugepage comes from the reserved pool.
> Furthermore, the reserve flag would be set when dequeueing a free hugepage,
> and cleared when hugepage_fault returns, whether it fails or not.
> I think it's simpler than put_page variant approach, but doesn't it work
> to solve your problem?

Yes. That's good idea.
I thought this idea before, but didn't implement that way, because
I was worry that this may make patchset more larger and complex. But
implementing that way may be better.

Thanks.

>
> Thanks,
> Naoya Horiguchi
>
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bb8a45f..6a9ec69 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -649,6 +649,34 @@ struct hstate *size_to_hstate(unsigned long size)
> > return NULL;
> > }
> >
> > +static void put_huge_page(struct page *page, int use_reserve)
> > +{
> > + struct hstate *h = page_hstate(page);
> > + struct hugepage_subpool *spool =
> > + (struct hugepage_subpool *)page_private(page);
> > +
> > + if (!use_reserve) {
> > + put_page(page);
> > + return;
> > + }
> > +
> > + if (!put_page_testzero(page))
> > + return;
> > +
> > + set_page_private(page, 0);
> > + page->mapping = NULL;
> > + BUG_ON(page_count(page));
> > + BUG_ON(page_mapcount(page));
> > +
> > + spin_lock(&hugetlb_lock);
> > + hugetlb_cgroup_uncharge_page(hstate_index(h),
> > + pages_per_huge_page(h), page);
> > + enqueue_huge_page(h, page);
> > + h->resv_huge_pages++;
> > + spin_unlock(&hugetlb_lock);
> > + hugepage_subpool_put_pages(spool, 1);
> > +}
> > +
> > static void free_huge_page(struct page *page)
> > {
> > /*
> > @@ -2625,7 +2653,7 @@ retry_avoidcopy:
> > spin_unlock(&mm->page_table_lock);
> > mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> >
> > - page_cache_release(new_page);
> > + put_huge_page(new_page, use_reserve);
> > out_old_page:
> > page_cache_release(old_page);
> > out_lock:
> > @@ -2725,7 +2753,7 @@ retry:
> >
> > err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
> > if (err) {
> > - put_page(page);
> > + put_huge_page(page, use_reserve);
> > if (err == -EEXIST)
> > goto retry;
> > goto out;
> > @@ -2798,7 +2826,7 @@ backout:
> > spin_unlock(&mm->page_table_lock);
> > backout_unlocked:
> > unlock_page(page);
> > - put_page(page);
> > + put_huge_page(page, use_reserve);
> > goto out;
> > }
> >
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 05:37:56

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

Hello, David.

On Mon, Jul 29, 2013 at 05:28:23PM +1000, David Gibson wrote:
> On Mon, Jul 29, 2013 at 02:32:08PM +0900, Joonsoo Kim wrote:
> > If parallel fault occur, we can fail to allocate a hugepage,
> > because many threads dequeue a hugepage to handle a fault of same address.
> > This makes reserved pool shortage just for a little while and this cause
> > faulting thread who is ensured to have enough reserved hugepages
> > to get a SIGBUS signal.
>
> It's not just about reserved pages. The same race can happen
> perfectly well when you're really, truly allocating the last hugepage
> in the system.

Yes, you are right.
This is a critical comment to this patchset :(

IIUC, the case you mentioned is about tasks which have a mapping
with MAP_NORESERVE. Should we ensure them to allocate the last hugepage?
They map a region with MAP_NORESERVE, so don't assume that their requests
always succeed.

>
> >
> > To solve this problem, we already have a nice solution, that is,
> > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > a fault handler. This solve the problem clearly, but it introduce
> > performance degradation, because it serialize all fault handling.
> >
> > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > performance degradation. A prerequisite is that other thread should
> > not get a SIGBUS if they are ensured to have enough reserved pages.
> >
> > For this purpose, if we fail to allocate a new hugepage with use_reserve,
> > we return just 0, instead of VM_FAULT_SIGBUS. use_reserve
> > represent that this user is legimate one who are ensured to have enough
> > reserved pages. This prevent these thread not to get a SIGBUS signal and
> > make these thread retrying fault handling.
>
> Not sufficient, since it can happen without reserved pages.

Ditto.

>
> Also, I think there are edge cases where even reserved mappings can
> run out, in particular with the interaction between MAP_PRIVATE,
> fork() and reservations. In this case, when you have a genuine out of
> memory condition, you will spin forever on the fault.

If there are edge cases, we can fix it. It doesn't matter.
If you find it, please tell me in more detail.

Thanks.

>
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6a9ec69..909075b 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2623,7 +2623,10 @@ retry_avoidcopy:
> > WARN_ON_ONCE(1);
> > }
> >
> > - ret = VM_FAULT_SIGBUS;
> > + if (use_reserve)
> > + ret = 0;
> > + else
> > + ret = VM_FAULT_SIGBUS;
> > goto out_lock;
> > }
> >
> > @@ -2741,7 +2744,10 @@ retry:
> >
> > page = alloc_huge_page(vma, address, use_reserve);
> > if (IS_ERR(page)) {
> > - ret = VM_FAULT_SIGBUS;
> > + if (use_reserve)
> > + ret = 0;
> > + else
> > + ret = VM_FAULT_SIGBUS;
> > goto out;
> > }
> > clear_huge_page(page, address, pages_per_huge_page(h));
>
> --
> David Gibson | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
> | _way_ _around_!
> http://www.ozlabs.org/~dgibson

2013-07-31 06:21:40

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

On Wed, Jul 31, 2013 at 12:41 PM, Joonsoo Kim <[email protected]> wrote:
> On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote:
>> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim <[email protected]> wrote:
>> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
>> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
>> >> > alloc_huge_page_node() use dequeue_huge_page_node() without
>> >> > any validation check, so it can steal reserved page unconditionally.
>> >>
>> >> Well, why is it illegal to use reserved page here?
>> >
>> > If we use reserved page here, other processes which are promised to use
>> > enough hugepages cannot get enough hugepages and can die. This is
>> > unexpected result to them.
>> >
>> But, how do you determine that a huge page is requested by a process
>> that is not allowed to use reserved pages?
>
> Reserved page is just one for each address or file offset. If we need to
> move this page, this means that it already use it's own reserved page, this
> page is it. So we should not use other reserved page for moving this page.
>
Hm, how do you determine "this page" is not buddy?

2013-07-31 06:37:42

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

On Wed, Jul 31, 2013 at 02:21:38PM +0800, Hillf Danton wrote:
> On Wed, Jul 31, 2013 at 12:41 PM, Joonsoo Kim <[email protected]> wrote:
> > On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote:
> >> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim <[email protected]> wrote:
> >> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
> >> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
> >> >> > alloc_huge_page_node() use dequeue_huge_page_node() without
> >> >> > any validation check, so it can steal reserved page unconditionally.
> >> >>
> >> >> Well, why is it illegal to use reserved page here?
> >> >
> >> > If we use reserved page here, other processes which are promised to use
> >> > enough hugepages cannot get enough hugepages and can die. This is
> >> > unexpected result to them.
> >> >
> >> But, how do you determine that a huge page is requested by a process
> >> that is not allowed to use reserved pages?
> >
> > Reserved page is just one for each address or file offset. If we need to
> > move this page, this means that it already use it's own reserved page, this
> > page is it. So we should not use other reserved page for moving this page.
> >
> Hm, how do you determine "this page" is not buddy?

If this page comes from the buddy, it doesn't matter. It imply that
this mapping cannot use reserved page pool, because we always allocate
a page from reserved page pool first.

Thanks.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-07-31 15:25:08

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

On Wed, Jul 31, 2013 at 2:37 PM, Joonsoo Kim <[email protected]> wrote:
> On Wed, Jul 31, 2013 at 02:21:38PM +0800, Hillf Danton wrote:
>> On Wed, Jul 31, 2013 at 12:41 PM, Joonsoo Kim <[email protected]> wrote:
>> > On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote:
>> >> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim <[email protected]> wrote:
>> >> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
>> >> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
>> >> >> > alloc_huge_page_node() use dequeue_huge_page_node() without
>> >> >> > any validation check, so it can steal reserved page unconditionally.
>> >> >>
>> >> >> Well, why is it illegal to use reserved page here?
>> >> >
>> >> > If we use reserved page here, other processes which are promised to use
>> >> > enough hugepages cannot get enough hugepages and can die. This is
>> >> > unexpected result to them.
>> >> >
>> >> But, how do you determine that a huge page is requested by a process
>> >> that is not allowed to use reserved pages?
>> >
>> > Reserved page is just one for each address or file offset. If we need to
>> > move this page, this means that it already use it's own reserved page, this
>> > page is it. So we should not use other reserved page for moving this page.
>> >
>> Hm, how do you determine "this page" is not buddy?
>
> If this page comes from the buddy, it doesn't matter. It imply that
> this mapping cannot use reserved page pool, because we always allocate
> a page from reserved page pool first.
>
A buddy page also implies, if the mapping can use reserved pages, that no
reserved page was available when requested. Now we can try reserved
page again.

2013-07-31 16:44:20

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm, hugetlb: move up anon_vma_prepare()

On Wed, Jul 31, 2013 at 02:12:21PM +0900, Joonsoo Kim wrote:
> On Mon, Jul 29, 2013 at 03:19:15PM -0400, Naoya Horiguchi wrote:
> > On Mon, Jul 29, 2013 at 03:05:37PM -0400, Naoya Horiguchi wrote:
> > > On Mon, Jul 29, 2013 at 02:32:06PM +0900, Joonsoo Kim wrote:
> > > > If we fail with a allocated hugepage, it is hard to recover properly.
> > > > One such example is reserve count. We don't have any method to recover
> > > > reserve count. Although, I will introduce a function to recover reserve
> > > > count in following patch, it is better not to allocate a hugepage
> > > > as much as possible. So move up anon_vma_prepare() which can be failed
> > > > in OOM situation.
> > > >
> > > > Signed-off-by: Joonsoo Kim <[email protected]>
> > >
> > > Reviewed-by: Naoya Horiguchi <[email protected]>
> >
> > Sorry, let me suspend this Reviewed for a question.
> > If alloc_huge_page failed after we succeeded anon_vma_parepare,
> > the allocated anon_vma_chain and/or anon_vma are safely freed?
> > Or don't we have to free them?
>
> Yes, it will be freed by free_pgtables() and then unlink_anon_vmas()
> when a task terminate. So, we don't have to free them.

OK, thanks for clarification.

Reviewed-by: Naoya Horiguchi <[email protected]>

2013-07-31 20:46:49

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 11/18] mm, hugetlb: move down outside_reserve check

On Wed, Jul 31, 2013 at 02:08:24PM +0900, Joonsoo Kim wrote:
> On Mon, Jul 29, 2013 at 02:39:30PM -0400, Naoya Horiguchi wrote:
> > On Mon, Jul 29, 2013 at 02:32:02PM +0900, Joonsoo Kim wrote:
> > > Just move down outsider_reserve check.
> > > This makes code more readable.
> > >
> > > There is no functional change.
> >
> > Why don't you do this in 10/18?
>
> Just help to review :)
> Step-by-step approach may help to review, so I decide to be separate it.
> If you don't want it, I will merge it in next spin.

OK, it's up to you actually.

2013-07-31 20:56:26

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH 08/18] mm, hugetlb: do hugepage_subpool_get_pages() when avoid_reserve

On Wed, Jul 31, 2013 at 02:02:50PM +0900, Joonsoo Kim wrote:
> On Mon, Jul 29, 2013 at 02:05:51PM -0400, Naoya Horiguchi wrote:
> > On Mon, Jul 29, 2013 at 02:31:59PM +0900, Joonsoo Kim wrote:
> > > When we try to get a huge page with avoid_reserve, we don't consume
> > > a reserved page. So it is treated like as non-reserve case.
> >
> > This patch will be completely overwritten with 9/18.
> > So is this patch necessary?
>
> Yes. This is a bug fix, so should be separate.
> When we try to allocate with avoid_reserve, we don't use reserved page pool.
> So, hugepage_subpool_get_pages() should be called and returned if failed.
>
> If we merge these into one, we cannot know that there exists a bug.

OK, so you can merge this with the subpool accounting fix in 6/18
as one fix.

Thanks,
Naoya

2013-08-01 06:07:44

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

On Wed, Jul 31, 2013 at 11:25:04PM +0800, Hillf Danton wrote:
> On Wed, Jul 31, 2013 at 2:37 PM, Joonsoo Kim <[email protected]> wrote:
> > On Wed, Jul 31, 2013 at 02:21:38PM +0800, Hillf Danton wrote:
> >> On Wed, Jul 31, 2013 at 12:41 PM, Joonsoo Kim <[email protected]> wrote:
> >> > On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote:
> >> >> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim <[email protected]> wrote:
> >> >> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
> >> >> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
> >> >> >> > alloc_huge_page_node() use dequeue_huge_page_node() without
> >> >> >> > any validation check, so it can steal reserved page unconditionally.
> >> >> >>
> >> >> >> Well, why is it illegal to use reserved page here?
> >> >> >
> >> >> > If we use reserved page here, other processes which are promised to use
> >> >> > enough hugepages cannot get enough hugepages and can die. This is
> >> >> > unexpected result to them.
> >> >> >
> >> >> But, how do you determine that a huge page is requested by a process
> >> >> that is not allowed to use reserved pages?
> >> >
> >> > Reserved page is just one for each address or file offset. If we need to
> >> > move this page, this means that it already use it's own reserved page, this
> >> > page is it. So we should not use other reserved page for moving this page.
> >> >
> >> Hm, how do you determine "this page" is not buddy?
> >
> > If this page comes from the buddy, it doesn't matter. It imply that
> > this mapping cannot use reserved page pool, because we always allocate
> > a page from reserved page pool first.
> >
> A buddy page also implies, if the mapping can use reserved pages, that no
> reserved page was available when requested. Now we can try reserved
> page again.

If the mapping can use reserved pages, reserved page is always allocated
earlier than the buddy. So the fact that there is the page implies that
reserved page is already allocated. Hm, do I miss something?

Thanks.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-08-01 16:17:47

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

Hillf Danton <[email protected]> writes:

> On Wed, Jul 31, 2013 at 2:37 PM, Joonsoo Kim <[email protected]> wrote:
>> On Wed, Jul 31, 2013 at 02:21:38PM +0800, Hillf Danton wrote:
>>> On Wed, Jul 31, 2013 at 12:41 PM, Joonsoo Kim <[email protected]> wrote:
>>> > On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote:
>>> >> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim <[email protected]> wrote:
>>> >> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
>>> >> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
>>> >> >> > alloc_huge_page_node() use dequeue_huge_page_node() without
>>> >> >> > any validation check, so it can steal reserved page unconditionally.
>>> >> >>
>>> >> >> Well, why is it illegal to use reserved page here?
>>> >> >
>>> >> > If we use reserved page here, other processes which are promised to use
>>> >> > enough hugepages cannot get enough hugepages and can die. This is
>>> >> > unexpected result to them.
>>> >> >
>>> >> But, how do you determine that a huge page is requested by a process
>>> >> that is not allowed to use reserved pages?
>>> >
>>> > Reserved page is just one for each address or file offset. If we need to
>>> > move this page, this means that it already use it's own reserved page, this
>>> > page is it. So we should not use other reserved page for moving this page.
>>> >
>>> Hm, how do you determine "this page" is not buddy?
>>
>> If this page comes from the buddy, it doesn't matter. It imply that
>> this mapping cannot use reserved page pool, because we always allocate
>> a page from reserved page pool first.
>>
> A buddy page also implies, if the mapping can use reserved pages, that no
> reserved page was available when requested. Now we can try reserved
> page again.

I didn't quiet get that. My understanding is, the new page we are
allocating here for soft offline should not be allocated from the
reserve pool. If we do that we may possibly have allocation failure
later for a request that had done page reservation. Now to
avoid that we make sure we have enough free pages outside reserve pool
so that we can dequeue the huge page. If not we use buddy to allocate
the hugepage.

-aneesh

2013-08-04 05:10:09

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

On Fri, Aug 2, 2013 at 12:17 AM, Aneesh Kumar K.V
<[email protected]> wrote:
> Hillf Danton <[email protected]> writes:
>
>> On Wed, Jul 31, 2013 at 2:37 PM, Joonsoo Kim <[email protected]> wrote:
>>> On Wed, Jul 31, 2013 at 02:21:38PM +0800, Hillf Danton wrote:
>>>> On Wed, Jul 31, 2013 at 12:41 PM, Joonsoo Kim <[email protected]> wrote:
>>>> > On Wed, Jul 31, 2013 at 10:49:24AM +0800, Hillf Danton wrote:
>>>> >> On Wed, Jul 31, 2013 at 10:27 AM, Joonsoo Kim <[email protected]> wrote:
>>>> >> > On Mon, Jul 29, 2013 at 03:24:46PM +0800, Hillf Danton wrote:
>>>> >> >> On Mon, Jul 29, 2013 at 1:31 PM, Joonsoo Kim <[email protected]> wrote:
>>>> >> >> > alloc_huge_page_node() use dequeue_huge_page_node() without
>>>> >> >> > any validation check, so it can steal reserved page unconditionally.
>>>> >> >>
>>>> >> >> Well, why is it illegal to use reserved page here?
>>>> >> >
>>>> >> > If we use reserved page here, other processes which are promised to use
>>>> >> > enough hugepages cannot get enough hugepages and can die. This is
>>>> >> > unexpected result to them.
>>>> >> >
>>>> >> But, how do you determine that a huge page is requested by a process
>>>> >> that is not allowed to use reserved pages?
>>>> >
>>>> > Reserved page is just one for each address or file offset. If we need to
>>>> > move this page, this means that it already use it's own reserved page, this
>>>> > page is it. So we should not use other reserved page for moving this page.
>>>> >
>>>> Hm, how do you determine "this page" is not buddy?
>>>
>>> If this page comes from the buddy, it doesn't matter. It imply that
>>> this mapping cannot use reserved page pool, because we always allocate
>>> a page from reserved page pool first.
>>>
>> A buddy page also implies, if the mapping can use reserved pages, that no
>> reserved page was available when requested. Now we can try reserved
>> page again.
>
> I didn't quiet get that. My understanding is, the new page we are

Neither did I ;)

> allocating here for soft offline should not be allocated from the
> reserve pool. If we do that we may possibly have allocation failure
> later for a request that had done page reservation. Now to
> avoid that we make sure we have enough free pages outside reserve pool
> so that we can dequeue the huge page. If not we use buddy to allocate
> the hugepage.
>
What if it is a mapping with HPAGE_RESV_OWNER set?
Or can we block owner from using any page available here?

2013-08-04 09:48:39

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

On Wed, Jul 31, 2013 at 02:37:53PM +0900, Joonsoo Kim wrote:
> Hello, David.
>
> On Mon, Jul 29, 2013 at 05:28:23PM +1000, David Gibson wrote:
> > On Mon, Jul 29, 2013 at 02:32:08PM +0900, Joonsoo Kim wrote:
> > > If parallel fault occur, we can fail to allocate a hugepage,
> > > because many threads dequeue a hugepage to handle a fault of same address.
> > > This makes reserved pool shortage just for a little while and this cause
> > > faulting thread who is ensured to have enough reserved hugepages
> > > to get a SIGBUS signal.
> >
> > It's not just about reserved pages. The same race can happen
> > perfectly well when you're really, truly allocating the last hugepage
> > in the system.
>
> Yes, you are right.
> This is a critical comment to this patchset :(
>
> IIUC, the case you mentioned is about tasks which have a mapping
> with MAP_NORESERVE.

Any mapping that doesn't use the reserved pool, not just
MAP_NORESERVE. For example, if a process makes a MAP_PRIVATE mapping,
then fork()s then the mapping is instantiated in the child, that will
not draw from the reserved pool.

> Should we ensure them to allocate the last hugepage?
> They map a region with MAP_NORESERVE, so don't assume that their requests
> always succeed.

If the pages are available, people get cranky if it fails for no
apparent reason, MAP_NORESERVE or not. They get especially cranky if
it sometimes fails and sometimes doesn't due to a race condition.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.60 kB)
(No filename) (198.00 B)
Download all attachments

2013-08-05 05:17:18

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm, hugetlb: protect reserved pages when softofflining requests the pages

Hillf Danton <[email protected]> writes:

> On Fri, Aug 2, 2013 at 12:17 AM, Aneesh Kumar K.V
> <[email protected]> wrote:
>> Hillf Danton <[email protected]> writes:
>>
...

>>>>> >> >> Well, why is it illegal to use reserved page here?
>>>>> >> >
>>>>> >> > If we use reserved page here, other processes which are promised to use
>>>>> >> > enough hugepages cannot get enough hugepages and can die. This is
>>>>> >> > unexpected result to them.
>>>>> >> >
>>>>> >> But, how do you determine that a huge page is requested by a process
>>>>> >> that is not allowed to use reserved pages?
>>>>> >
>>>>> > Reserved page is just one for each address or file offset. If we need to
>>>>> > move this page, this means that it already use it's own reserved page, this
>>>>> > page is it. So we should not use other reserved page for moving this page.
>>>>> >
>>>>> Hm, how do you determine "this page" is not buddy?
>>>>
>>>> If this page comes from the buddy, it doesn't matter. It imply that
>>>> this mapping cannot use reserved page pool, because we always allocate
>>>> a page from reserved page pool first.
>>>>
>>> A buddy page also implies, if the mapping can use reserved pages, that no
>>> reserved page was available when requested. Now we can try reserved
>>> page again.
>>
>> I didn't quiet get that. My understanding is, the new page we are
>
> Neither did I ;)
>
>> allocating here for soft offline should not be allocated from the
>> reserve pool. If we do that we may possibly have allocation failure
>> later for a request that had done page reservation. Now to
>> avoid that we make sure we have enough free pages outside reserve pool
>> so that we can dequeue the huge page. If not we use buddy to allocate
>> the hugepage.
>>
> What if it is a mapping with HPAGE_RESV_OWNER set?
> Or can we block owner from using any page available here?

We are allocating a new hugepage to replace the already allocated
hugepage. We are in soft offline . That means the existing reservation
if we have any is already consumed. The request for new page should not
come from the reserve pool.

-aneesh

2013-08-05 07:36:39

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

> Any mapping that doesn't use the reserved pool, not just
> MAP_NORESERVE. For example, if a process makes a MAP_PRIVATE mapping,
> then fork()s then the mapping is instantiated in the child, that will
> not draw from the reserved pool.
>
> > Should we ensure them to allocate the last hugepage?
> > They map a region with MAP_NORESERVE, so don't assume that their requests
> > always succeed.
>
> If the pages are available, people get cranky if it fails for no
> apparent reason, MAP_NORESERVE or not. They get especially cranky if
> it sometimes fails and sometimes doesn't due to a race condition.

Hello,

Hmm... Okay. I will try to implement another way to protect race condition.
Maybe it is the best to use a table mutex :)
Anyway, please give me a time, guys.

Really thank you for pointing that.

2013-08-07 00:18:50

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

On Mon, 2013-08-05 at 16:36 +0900, Joonsoo Kim wrote:
> > Any mapping that doesn't use the reserved pool, not just
> > MAP_NORESERVE. For example, if a process makes a MAP_PRIVATE mapping,
> > then fork()s then the mapping is instantiated in the child, that will
> > not draw from the reserved pool.
> >
> > > Should we ensure them to allocate the last hugepage?
> > > They map a region with MAP_NORESERVE, so don't assume that their requests
> > > always succeed.
> >
> > If the pages are available, people get cranky if it fails for no
> > apparent reason, MAP_NORESERVE or not. They get especially cranky if
> > it sometimes fails and sometimes doesn't due to a race condition.
>
> Hello,
>
> Hmm... Okay. I will try to implement another way to protect race condition.
> Maybe it is the best to use a table mutex :)
> Anyway, please give me a time, guys.

So another option is to take the mutex table patchset for now as it
*does* improve things a great deal, then, when ready, get rid of the
instantiation lock all together.

Thanks,
Davidlohr

2013-08-07 01:32:59

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

On Tue, Aug 06, 2013 at 05:18:44PM -0700, Davidlohr Bueso wrote:
> On Mon, 2013-08-05 at 16:36 +0900, Joonsoo Kim wrote:
> > > Any mapping that doesn't use the reserved pool, not just
> > > MAP_NORESERVE. For example, if a process makes a MAP_PRIVATE mapping,
> > > then fork()s then the mapping is instantiated in the child, that will
> > > not draw from the reserved pool.
> > >
> > > > Should we ensure them to allocate the last hugepage?
> > > > They map a region with MAP_NORESERVE, so don't assume that their requests
> > > > always succeed.
> > >
> > > If the pages are available, people get cranky if it fails for no
> > > apparent reason, MAP_NORESERVE or not. They get especially cranky if
> > > it sometimes fails and sometimes doesn't due to a race condition.
> >
> > Hello,
> >
> > Hmm... Okay. I will try to implement another way to protect race condition.
> > Maybe it is the best to use a table mutex :)
> > Anyway, please give me a time, guys.
>
> So another option is to take the mutex table patchset for now as it
> *does* improve things a great deal, then, when ready, get rid of the
> instantiation lock all together.

We still don't have a solid proposal for doing that. Joonsoo Kim's
patchset misses cases (non reserved mappings). I'm also not certain
there aren't a few edge cases which can lead to even reserved mappings
failing, and if that happens the patches will lead to livelock.

Getting rid of the instantiation mutex is a lot harder than it appears.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.64 kB)
(No filename) (198.00 B)
Download all attachments

2013-08-07 01:38:54

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

On Wed, 2013-08-07 at 11:03 +1000, David Gibson wrote:
> On Tue, Aug 06, 2013 at 05:18:44PM -0700, Davidlohr Bueso wrote:
> > On Mon, 2013-08-05 at 16:36 +0900, Joonsoo Kim wrote:
> > > > Any mapping that doesn't use the reserved pool, not just
> > > > MAP_NORESERVE. For example, if a process makes a MAP_PRIVATE mapping,
> > > > then fork()s then the mapping is instantiated in the child, that will
> > > > not draw from the reserved pool.
> > > >
> > > > > Should we ensure them to allocate the last hugepage?
> > > > > They map a region with MAP_NORESERVE, so don't assume that their requests
> > > > > always succeed.
> > > >
> > > > If the pages are available, people get cranky if it fails for no
> > > > apparent reason, MAP_NORESERVE or not. They get especially cranky if
> > > > it sometimes fails and sometimes doesn't due to a race condition.
> > >
> > > Hello,
> > >
> > > Hmm... Okay. I will try to implement another way to protect race condition.
> > > Maybe it is the best to use a table mutex :)
> > > Anyway, please give me a time, guys.
> >
> > So another option is to take the mutex table patchset for now as it
> > *does* improve things a great deal, then, when ready, get rid of the
> > instantiation lock all together.
>
> We still don't have a solid proposal for doing that. Joonsoo Kim's
> patchset misses cases (non reserved mappings). I'm also not certain
> there aren't a few edge cases which can lead to even reserved mappings
> failing, and if that happens the patches will lead to livelock.
>

Exactly, which is why I suggest minimizing the lock contention until we
do have such a proposal.

> Getting rid of the instantiation mutex is a lot harder than it appears.
>

2013-08-07 09:18:31

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

On Tue, Aug 06, 2013 at 06:38:49PM -0700, Davidlohr Bueso wrote:
> On Wed, 2013-08-07 at 11:03 +1000, David Gibson wrote:
> > On Tue, Aug 06, 2013 at 05:18:44PM -0700, Davidlohr Bueso wrote:
> > > On Mon, 2013-08-05 at 16:36 +0900, Joonsoo Kim wrote:
> > > > > Any mapping that doesn't use the reserved pool, not just
> > > > > MAP_NORESERVE. For example, if a process makes a MAP_PRIVATE mapping,
> > > > > then fork()s then the mapping is instantiated in the child, that will
> > > > > not draw from the reserved pool.
> > > > >
> > > > > > Should we ensure them to allocate the last hugepage?
> > > > > > They map a region with MAP_NORESERVE, so don't assume that their requests
> > > > > > always succeed.
> > > > >
> > > > > If the pages are available, people get cranky if it fails for no
> > > > > apparent reason, MAP_NORESERVE or not. They get especially cranky if
> > > > > it sometimes fails and sometimes doesn't due to a race condition.
> > > >
> > > > Hello,
> > > >
> > > > Hmm... Okay. I will try to implement another way to protect race condition.
> > > > Maybe it is the best to use a table mutex :)
> > > > Anyway, please give me a time, guys.
> > >
> > > So another option is to take the mutex table patchset for now as it
> > > *does* improve things a great deal, then, when ready, get rid of the
> > > instantiation lock all together.
> >
> > We still don't have a solid proposal for doing that. Joonsoo Kim's
> > patchset misses cases (non reserved mappings). I'm also not certain
> > there aren't a few edge cases which can lead to even reserved mappings
> > failing, and if that happens the patches will lead to livelock.
> >
>
> Exactly, which is why I suggest minimizing the lock contention until we
> do have such a proposal.

Okay. my proposal is not complete and maybe much time is needed.
And I'm not sure that my *retry* approach can eventually cover all
the race situations, currently.

If you have to hurry, I don't have strong objection to your patches,
but, IMHO, we should go slow, because it is not just trivial change.
Hugetlb code is too subtle, so it is hard to confirm it's solidness.
Following is the race problem what I found with those patches.

I assume that nr_free_hugepage is 2.

1. parent process map an 1 hugepage sizeid region with MAP_PRIVATE
2. parent process write something to this region, so fault occur.
3. fault handling.
4. fork
5. parent process write something to this hugepage, so cow-fault occur.
6. while parent allocate a new page and do copy_user_huge_page()
in fault handler, child process write something to this hugepage,
so cow-fault occur. This access is not protected by table mutex,
because mm is different.
7. child process die, because there is no free hugepage.

If we have no race, child process would not die,
because all we needed is only 2 hugepages, one for parent,
and the other for child.

Thanks.

2013-08-09 00:27:27

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

On Wed, Aug 07, 2013 at 06:18:32PM +0900, Joonsoo Kim wrote:
> On Tue, Aug 06, 2013 at 06:38:49PM -0700, Davidlohr Bueso wrote:
> > On Wed, 2013-08-07 at 11:03 +1000, David Gibson wrote:
> > > On Tue, Aug 06, 2013 at 05:18:44PM -0700, Davidlohr Bueso wrote:
> > > > On Mon, 2013-08-05 at 16:36 +0900, Joonsoo Kim wrote:
> > > > > > Any mapping that doesn't use the reserved pool, not just
> > > > > > MAP_NORESERVE. For example, if a process makes a MAP_PRIVATE mapping,
> > > > > > then fork()s then the mapping is instantiated in the child, that will
> > > > > > not draw from the reserved pool.
> > > > > >
> > > > > > > Should we ensure them to allocate the last hugepage?
> > > > > > > They map a region with MAP_NORESERVE, so don't assume that their requests
> > > > > > > always succeed.
> > > > > >
> > > > > > If the pages are available, people get cranky if it fails for no
> > > > > > apparent reason, MAP_NORESERVE or not. They get especially cranky if
> > > > > > it sometimes fails and sometimes doesn't due to a race condition.
> > > > >
> > > > > Hello,
> > > > >
> > > > > Hmm... Okay. I will try to implement another way to protect race condition.
> > > > > Maybe it is the best to use a table mutex :)
> > > > > Anyway, please give me a time, guys.
> > > >
> > > > So another option is to take the mutex table patchset for now as it
> > > > *does* improve things a great deal, then, when ready, get rid of the
> > > > instantiation lock all together.
> > >
> > > We still don't have a solid proposal for doing that. Joonsoo Kim's
> > > patchset misses cases (non reserved mappings). I'm also not certain
> > > there aren't a few edge cases which can lead to even reserved mappings
> > > failing, and if that happens the patches will lead to livelock.
> > >
> >
> > Exactly, which is why I suggest minimizing the lock contention until we
> > do have such a proposal.
>
> Okay. my proposal is not complete and maybe much time is needed.
> And I'm not sure that my *retry* approach can eventually cover all
> the race situations, currently.

Yes. The difficulty with retrying is knowing when its safe to to
so. If you don't retry enough, you get SIGBUS when you should be able
to allocate, if you retry too much, you freeze up trying to find a
page that isn't there.

I once attempted an approach involving an atomic counter of the number
of "in flight" hugepages, only retrying when it's non zero. Working
out a safe ordering for all the updates to get all the cases right
made my brain melt though, and I never got it working.

> If you have to hurry, I don't have strong objection to your patches,
> but, IMHO, we should go slow, because it is not just trivial change.
> Hugetlb code is too subtle, so it is hard to confirm it's solidness.
> Following is the race problem what I found with those patches.
>
> I assume that nr_free_hugepage is 2.
>
> 1. parent process map an 1 hugepage sizeid region with MAP_PRIVATE
> 2. parent process write something to this region, so fault occur.
> 3. fault handling.
> 4. fork
> 5. parent process write something to this hugepage, so cow-fault occur.
> 6. while parent allocate a new page and do copy_user_huge_page()
> in fault handler, child process write something to this hugepage,
> so cow-fault occur. This access is not protected by table mutex,
> because mm is different.
> 7. child process die, because there is no free hugepage.
>
> If we have no race, child process would not die,
> because all we needed is only 2 hugepages, one for parent,
> and the other for child.

Ouch, good catch. Unlike the existing form of the race, I doubt this
one has been encountered in the wild, but it shows how subtle this is.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.80 kB)
(No filename) (198.00 B)
Download all attachments

2013-08-09 09:37:30

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm, hugetlb: retry if we fail to allocate a hugepage with use_reserve

> I once attempted an approach involving an atomic counter of the number
> of "in flight" hugepages, only retrying when it's non zero. Working
> out a safe ordering for all the updates to get all the cases right
> made my brain melt though, and I never got it working.

I sent v2 few seconds before. My new approach is similar as yours.
Could you review my patches to save my brain to be melted? :)

Thanks.