Only change in this revision is the fix to the self-discovered
issue in region_chg(). Functional and stress tests passing.
Full changelog below.
As suggested during the RFC process, tests have been proposed to
libhugetlbfs as described at:
http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
fallocate(2) man page modifications are also necessary to specify
that fallocate for hugetlbfs only operates on whole pages. This
change will be submitted once the code has stabilized and been
proposed for merging.
hugetlbfs is used today by applications that want a high degree of
control over huge page usage. Often, large hugetlbfs files are used
to map a large number huge pages into the application processes.
The applications know when page ranges within these large files will
no longer be used, and ideally would like to release them back to
the subpool or global pools for other uses. The fallocate() system
call provides an interface for preallocation and hole punching within
files. This patch set adds fallocate functionality to hugetlbfs.
v3:
Fixed issue with region_chg to recheck if there are sufficient
entries in the cache after acquiring lock.
v2:
Fixed leak in resv_map_release discovered by Hillf Danton.
Used LONG_MAX as indicator of truncate function for region_del.
v1:
Add a cache of region descriptors to the resv_map for use by
region_add in case hole punch deletes entries necessary for
a successful operation.
RFC v4:
Removed alloc_huge_page/hugetlb_reserve_pages race patches as already
in mmotm
Moved hugetlb_fix_reserve_counts in series as suggested by Naoya Horiguchi
Inline'ed hugetlb_fault_mutex routines as suggested by Davidlohr Bueso and
existing code changed to use new interfaces as suggested by Naoya
fallocate preallocation code cleaned up and made simpler
Modified alloc_huge_page to handle special case where allocation is
for a hole punched area with spool reserves
RFC v3:
Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
in current code
fallocate allocation and hole punch is synchronized with page
faults via existing mutex table
hole punch uses existing hugetlb_vmtruncate_list instead of more
generic unmap_mapping_range for unmapping
Error handling for the case when region_del() fauils
RFC v2:
Addressed alignment and error handling issues noticed by Hillf Danton
New region_del() routine for region tracking/resv_map of ranges
Fixed several issues found during more extensive testing
Error handling in region_del() when kmalloc() fails stills needs
to be addressed
madvise remove support remains
Mike Kravetz (10):
mm/hugetlb: add cache of descriptors to resv_map for region_add
mm/hugetlb: add region_del() to delete a specific range of entries
mm/hugetlb: expose hugetlb fault mutex for use by fallocate
hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
hugetlbfs: truncate_hugepages() takes a range of pages
mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
hugetlbfs: New huge_add_to_page_cache helper routine
hugetlbfs: add hugetlbfs_fallocate()
mm: madvise allow remove operation for hugetlbfs
fs/hugetlbfs/inode.c | 281 +++++++++++++++++++++++++++++---
include/linux/hugetlb.h | 17 +-
mm/hugetlb.c | 423 ++++++++++++++++++++++++++++++++++++++----------
mm/madvise.c | 2 +-
4 files changed, 619 insertions(+), 104 deletions(-)
--
2.1.0
fallocate hole punch will want to remove a specific range of
pages. When pages are removed, their associated entries in
the region/reserve map will also be removed. This will break
an assumption in the region_chg/region_add calling sequence.
If a new region descriptor must be allocated, it is done as
part of the region_chg processing. In this way, region_add
can not fail because it does not need to attempt an allocation.
To prepare for fallocate hole punch, create a "cache" of
descriptors that can be used by region_add if necessary.
region_chg will ensure there are sufficient entries in the
cache. It will be necessary to track the number of in progress
add operations to know a sufficient number of descriptors
reside in the cache. A new routine region_abort is added to
adjust this in progress count when add operations are aborted.
vma_abort_reservation is also added for callers creating
reservations with vma_needs_reservation/vma_commit_reservation.
Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 3 +
mm/hugetlb.c | 169 ++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 153 insertions(+), 19 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d891f94..667cf44 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -35,6 +35,9 @@ struct resv_map {
struct kref refs;
spinlock_t lock;
struct list_head regions;
+ long adds_in_progress;
+ struct list_head rgn_cache;
+ long rgn_cache_count;
};
extern struct resv_map *resv_map_alloc(void);
void resv_map_release(struct kref *ref);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..241d16d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -240,11 +240,14 @@ struct file_region {
/*
* Add the huge page range represented by [f, t) to the reserve
- * map. Existing regions will be expanded to accommodate the
- * specified range. We know only existing regions need to be
- * expanded, because region_add is only called after region_chg
- * with the same range. If a new file_region structure must
- * be allocated, it is done in region_chg.
+ * map. In the normal case, existing regions will be expanded
+ * to accommodate the specified range. Sufficient regions should
+ * exist for expansion due to the previous call to region_chg
+ * with the same range. However, it is possible that region_del
+ * could have been called after region_chg and modifed the map
+ * in such a way that no region exists to be expanded. In this
+ * case, pull a region descriptor from the cache associated with
+ * the map and use that for the new range.
*
* Return the number of new huge pages added to the map. This
* number is greater than or equal to zero.
@@ -261,6 +264,27 @@ static long region_add(struct resv_map *resv, long f, long t)
if (f <= rg->to)
break;
+ if (&rg->link == head || t < rg->from) {
+ /*
+ * No region exists which can be expanded to include the
+ * specified range. Pull a region descriptor from the
+ * cache, and use it for this range.
+ */
+ VM_BUG_ON(!resv->rgn_cache_count);
+
+ resv->rgn_cache_count--;
+ nrg = list_first_entry(&resv->rgn_cache, struct file_region,
+ link);
+ list_del(&nrg->link);
+
+ nrg->from = f;
+ nrg->to = t;
+ list_add(&nrg->link, rg->link.prev);
+
+ add += t - f;
+ goto out_locked;
+ }
+
/* Round our left edge to the current segment if it encloses us. */
if (f > rg->from)
f = rg->from;
@@ -294,6 +318,8 @@ static long region_add(struct resv_map *resv, long f, long t)
add += t - nrg->to; /* Added to end of region */
nrg->to = t;
+out_locked:
+ resv->adds_in_progress--;
spin_unlock(&resv->lock);
VM_BUG_ON(add < 0);
return add;
@@ -312,11 +338,16 @@ static long region_add(struct resv_map *resv, long f, long t)
* so that the subsequent region_add call will have all the
* regions it needs and will not fail.
*
+ * Upon entry, region_chg will also examine the cache of
+ * region descriptors associated with the map. If there
+ * not enough descriptors cached, one will be allocated
+ * for the in progress add operation.
+ *
* Returns the number of huge pages that need to be added
* to the existing reservation map for the range [f, t).
* This number is greater or equal to zero. -ENOMEM is
- * returned if a new file_region structure is needed and can
- * not be allocated.
+ * returned if a new file_region structure or cache entry
+ * is needed and can not be allocated.
*/
static long region_chg(struct resv_map *resv, long f, long t)
{
@@ -326,6 +357,31 @@ static long region_chg(struct resv_map *resv, long f, long t)
retry:
spin_lock(&resv->lock);
+retry_locked:
+ resv->adds_in_progress++;
+
+ /*
+ * Check for sufficient descriptors in the cache to accommodate
+ * the number of in progress add operations.
+ */
+ if (resv->adds_in_progress > resv->rgn_cache_count) {
+ struct file_region *trg;
+
+ VM_BUG_ON(resv->adds_in_progress - resv->rgn_cache_count > 1);
+ /* Must drop lock to allocate a new descriptor. */
+ resv->adds_in_progress--;
+ spin_unlock(&resv->lock);
+
+ trg = kmalloc(sizeof(*trg), GFP_KERNEL);
+ if (!trg)
+ return -ENOMEM;
+
+ spin_lock(&resv->lock);
+ list_add(&trg->link, &resv->rgn_cache);
+ resv->rgn_cache_count++;
+ goto retry_locked;
+ }
+
/* Locate the region we are before or in. */
list_for_each_entry(rg, head, link)
if (f <= rg->to)
@@ -336,6 +392,7 @@ retry:
* size such that we can guarantee to record the reservation. */
if (&rg->link == head || t < rg->from) {
if (!nrg) {
+ resv->adds_in_progress--;
spin_unlock(&resv->lock);
nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
if (!nrg)
@@ -385,6 +442,25 @@ out_nrg:
}
/*
+ * Abort the in progress add operation. The adds_in_progress field
+ * of the resv_map keeps track of the operations in progress between
+ * calls to region_chg and region_add. Operations are sometimes
+ * aborted after the call to region_chg. In such cases, region_abort
+ * is called to decrement the adds_in_progress counter.
+ *
+ * NOTE: The range arguments [f, t) are not needed or used in this
+ * routine. They are kept to make reading the calling code easier as
+ * arguments will match the associated region_chg call.
+ */
+static void region_abort(struct resv_map *resv, long f, long t)
+{
+ spin_lock(&resv->lock);
+ VM_BUG_ON(!resv->rgn_cache_count);
+ resv->adds_in_progress--;
+ spin_unlock(&resv->lock);
+}
+
+/*
* Truncate the reserve map at index 'end'. Modify/truncate any
* region which contains end. Delete any regions past end.
* Return the number of huge pages removed from the map.
@@ -544,22 +620,44 @@ static void set_vma_private_data(struct vm_area_struct *vma,
struct resv_map *resv_map_alloc(void)
{
struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
- if (!resv_map)
+ struct file_region *rg = kmalloc(sizeof(*rg), GFP_KERNEL);
+
+ if (!resv_map || !rg) {
+ kfree(resv_map);
+ kfree(rg);
return NULL;
+ }
kref_init(&resv_map->refs);
spin_lock_init(&resv_map->lock);
INIT_LIST_HEAD(&resv_map->regions);
+ resv_map->adds_in_progress = 0;
+
+ INIT_LIST_HEAD(&resv_map->rgn_cache);
+ list_add(&rg->link, &resv_map->rgn_cache);
+ resv_map->rgn_cache_count = 1;
+
return resv_map;
}
void resv_map_release(struct kref *ref)
{
struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
+ struct list_head *head = &resv_map->rgn_cache;
+ struct file_region *rg, *trg;
/* Clear out any active regions before we release the map. */
region_truncate(resv_map, 0);
+
+ /* ... and any entries left in the cache */
+ list_for_each_entry_safe(rg, trg, head, link) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+
+ VM_BUG_ON(resv_map->adds_in_progress);
+
kfree(resv_map);
}
@@ -1473,16 +1571,18 @@ static void return_unused_surplus_pages(struct hstate *h,
}
}
+
/*
- * vma_needs_reservation and vma_commit_reservation are used by the huge
- * page allocation routines to manage reservations.
+ * vma_needs_reservation, vma_commit_reservation and vma_abort_reservation
+ * are used by the huge page allocation routines to manage reservations.
*
* vma_needs_reservation is called to determine if the huge page at addr
* within the vma has an associated reservation. If a reservation is
* needed, the value 1 is returned. The caller is then responsible for
* managing the global reservation and subpool usage counts. After
* the huge page has been allocated, vma_commit_reservation is called
- * to add the page to the reservation map.
+ * to add the page to the reservation map. If the reservation must be
+ * aborted instead of committed, vma_abort_reservation is called.
*
* In the normal case, vma_commit_reservation returns the same value
* as the preceding vma_needs_reservation call. The only time this
@@ -1490,9 +1590,14 @@ static void return_unused_surplus_pages(struct hstate *h,
* is the responsibility of the caller to notice the difference and
* take appropriate action.
*/
+enum vma_resv_mode {
+ VMA_NEEDS_RESV,
+ VMA_COMMIT_RESV,
+ VMA_ABORT_RESV,
+};
static long __vma_reservation_common(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr,
- bool commit)
+ enum vma_resv_mode mode)
{
struct resv_map *resv;
pgoff_t idx;
@@ -1503,10 +1608,20 @@ static long __vma_reservation_common(struct hstate *h,
return 1;
idx = vma_hugecache_offset(h, vma, addr);
- if (commit)
- ret = region_add(resv, idx, idx + 1);
- else
+ switch (mode) {
+ case VMA_NEEDS_RESV:
ret = region_chg(resv, idx, idx + 1);
+ break;
+ case VMA_COMMIT_RESV:
+ ret = region_add(resv, idx, idx + 1);
+ break;
+ case VMA_ABORT_RESV:
+ region_abort(resv, idx, idx + 1);
+ ret = 0;
+ break;
+ default:
+ BUG();
+ }
if (vma->vm_flags & VM_MAYSHARE)
return ret;
@@ -1517,13 +1632,19 @@ static long __vma_reservation_common(struct hstate *h,
static long vma_needs_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
- return __vma_reservation_common(h, vma, addr, false);
+ return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV);
}
static long vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
- return __vma_reservation_common(h, vma, addr, true);
+ return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV);
+}
+
+static void vma_abort_reservation(struct hstate *h,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ (void)__vma_reservation_common(h, vma, addr, VMA_ABORT_RESV);
}
static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -1549,8 +1670,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (chg < 0)
return ERR_PTR(-ENOMEM);
if (chg || avoid_reserve)
- if (hugepage_subpool_get_pages(spool, 1) < 0)
+ if (hugepage_subpool_get_pages(spool, 1) < 0) {
+ vma_abort_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
+ }
ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
if (ret)
@@ -1596,6 +1719,7 @@ out_uncharge_cgroup:
out_subpool_put:
if (chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
+ vma_abort_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
}
@@ -3236,11 +3360,14 @@ retry:
* any allocations necessary to record that reservation occur outside
* the spinlock.
*/
- if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
+ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
if (vma_needs_reservation(h, vma, address) < 0) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
}
+ /* Just decrements count, does not deallocate */
+ vma_abort_reservation(h, vma, address);
+ }
ptl = huge_pte_lockptr(h, mm, ptep);
spin_lock(ptl);
@@ -3387,6 +3514,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ret = VM_FAULT_OOM;
goto out_mutex;
}
+ /* Just decrements count, does not deallocate */
+ vma_abort_reservation(h, vma, address);
if (!(vma->vm_flags & VM_MAYSHARE))
pagecache_page = hugetlbfs_pagecache_page(h,
@@ -3726,6 +3855,8 @@ int hugetlb_reserve_pages(struct inode *inode,
}
return 0;
out_err:
+ if (!vma || vma->vm_flags & VM_MAYSHARE)
+ region_abort(resv_map, from, to);
if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
kref_put(&resv_map->refs, resv_map_release);
return ret;
--
2.1.0
fallocate hole punch will want to remove a specific range of pages.
The existing region_truncate() routine deletes all region/reserve
map entries after a specified offset. region_del() will provide
this same functionality if the end of region is specified as LONG_MAX.
Hence, region_del() can replace region_truncate().
Unlike region_truncate(), region_del() can return an error in the
rare case where it can not allocate memory for a region descriptor.
This ONLY happens in the case where an existing region must be split.
Current callers passing LONG_MAX as end of range will never experience
this error and do not need to deal with error handling. Future
callers of region_del() (such as fallocate hole punch) will need to
handle this error.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 99 ++++++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 73 insertions(+), 26 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 241d16d..a5c8b3c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -461,43 +461,90 @@ static void region_abort(struct resv_map *resv, long f, long t)
}
/*
- * Truncate the reserve map at index 'end'. Modify/truncate any
- * region which contains end. Delete any regions past end.
- * Return the number of huge pages removed from the map.
+ * Delete the specified range [f, t) from the reserve map. If the
+ * t parameter is LONG_MAX, this indicates that ALL regions after f
+ * should be deleted. Locate the regions which intersect [f, t)
+ * and either trim, delete or split the existing regions.
+ *
+ * Returns the number of huge pages deleted from the reserve map.
+ * In the normal case, the return value is zero or more. In the
+ * case where a region must be split, a new region descriptor must
+ * be allocated. If the allocation fails, -ENOMEM will be returned.
+ * NOTE: If the parameter t == LONG_MAX, then we will never split
+ * a region and possibly return -ENOMEM. Callers specifying
+ * t == LONG_MAX do not need to check for -ENOMEM error.
*/
-static long region_truncate(struct resv_map *resv, long end)
+static long region_del(struct resv_map *resv, long f, long t)
{
struct list_head *head = &resv->regions;
struct file_region *rg, *trg;
- long chg = 0;
+ struct file_region *nrg = NULL;
+ long del = 0;
+retry:
spin_lock(&resv->lock);
- /* Locate the region we are either in or before. */
- list_for_each_entry(rg, head, link)
- if (end <= rg->to)
+ list_for_each_entry_safe(rg, trg, head, link) {
+ if (rg->to <= f)
+ continue;
+ if (rg->from >= t)
break;
- if (&rg->link == head)
- goto out;
- /* If we are in the middle of a region then adjust it. */
- if (end > rg->from) {
- chg = rg->to - end;
- rg->to = end;
- rg = list_entry(rg->link.next, typeof(*rg), link);
- }
+ if (f > rg->from && t < rg->to) { /* Must split region */
+ /*
+ * Check for an entry in the cache before dropping
+ * lock and attempting allocation.
+ */
+ if (!nrg &&
+ resv->rgn_cache_count > resv->adds_in_progress) {
+ nrg = list_first_entry(&resv->rgn_cache,
+ struct file_region,
+ link);
+ list_del(&nrg->link);
+ resv->rgn_cache_count--;
+ }
- /* Drop any remaining regions. */
- list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
- if (&rg->link == head)
+ if (!nrg) {
+ spin_unlock(&resv->lock);
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (!nrg)
+ return -ENOMEM;
+ goto retry;
+ }
+
+ del += t - f;
+
+ /* New entry for end of split region */
+ nrg->from = t;
+ nrg->to = rg->to;
+ INIT_LIST_HEAD(&nrg->link);
+
+ /* Original entry is trimmed */
+ rg->to = f;
+
+ list_add(&nrg->link, &rg->link);
+ nrg = NULL;
break;
- chg += rg->to - rg->from;
- list_del(&rg->link);
- kfree(rg);
+ }
+
+ if (f <= rg->from && t >= rg->to) { /* Remove entire region */
+ del += rg->to - rg->from;
+ list_del(&rg->link);
+ kfree(rg);
+ continue;
+ }
+
+ if (f <= rg->from) { /* Trim beginning of region */
+ del += t - rg->from;
+ rg->from = t;
+ } else { /* Trim end of region */
+ del += rg->to - f;
+ rg->to = f;
+ }
}
-out:
spin_unlock(&resv->lock);
- return chg;
+ kfree(nrg);
+ return del;
}
/*
@@ -648,7 +695,7 @@ void resv_map_release(struct kref *ref)
struct file_region *rg, *trg;
/* Clear out any active regions before we release the map. */
- region_truncate(resv_map, 0);
+ region_del(resv_map, 0, LONG_MAX);
/* ... and any entries left in the cache */
list_for_each_entry_safe(rg, trg, head, link) {
@@ -3871,7 +3918,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
long gbl_reserve;
if (resv_map)
- chg = region_truncate(resv_map, offset);
+ chg = region_del(resv_map, offset, LONG_MAX);
spin_lock(&inode->i_lock);
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
--
2.1.0
hugetlb page faults are currently synchronized by the table of
mutexes (htlb_fault_mutex_table). fallocate code will need to
synchronize with the page fault code when it allocates or
deletes pages. Expose interfaces so that fallocate operations
can be synchronized with page faults. Minor name changes to
be more consistent with other global hugetlb symbols.
Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 20 ++++++++++----------
2 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 667cf44..933da39 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -88,6 +88,11 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
bool isolate_huge_page(struct page *page, struct list_head *list);
void putback_active_hugepage(struct page *page);
void free_huge_page(struct page *page);
+extern struct mutex *hugetlb_fault_mutex_table;
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct address_space *mapping,
+ pgoff_t idx, unsigned long address);
#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a5c8b3c..52c2801 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -64,7 +64,7 @@ DEFINE_SPINLOCK(hugetlb_lock);
* prevent spurious OOMs when the hugepage pool is fully utilized.
*/
static int num_fault_mutexes;
-static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
/* Forward declaration */
static int hugetlb_acct_memory(struct hstate *h, long delta);
@@ -2482,7 +2482,7 @@ static void __exit hugetlb_exit(void)
}
kobject_put(hugepages_kobj);
- kfree(htlb_fault_mutex_table);
+ kfree(hugetlb_fault_mutex_table);
}
module_exit(hugetlb_exit);
@@ -2515,12 +2515,12 @@ static int __init hugetlb_init(void)
#else
num_fault_mutexes = 1;
#endif
- htlb_fault_mutex_table =
+ hugetlb_fault_mutex_table =
kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
- BUG_ON(!htlb_fault_mutex_table);
+ BUG_ON(!hugetlb_fault_mutex_table);
for (i = 0; i < num_fault_mutexes; i++)
- mutex_init(&htlb_fault_mutex_table[i]);
+ mutex_init(&hugetlb_fault_mutex_table[i]);
return 0;
}
module_init(hugetlb_init);
@@ -3454,7 +3454,7 @@ backout_unlocked:
}
#ifdef CONFIG_SMP
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping,
pgoff_t idx, unsigned long address)
@@ -3479,7 +3479,7 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
* For uniprocesor systems we always use a single mutex, so just
* return 0 and avoid the hashing overhead.
*/
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping,
pgoff_t idx, unsigned long address)
@@ -3527,8 +3527,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
- hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
- mutex_lock(&htlb_fault_mutex_table[hash]);
+ hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, address);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
@@ -3613,7 +3613,7 @@ out_ptl:
put_page(pagecache_page);
}
out_mutex:
- mutex_unlock(&htlb_fault_mutex_table[hash]);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
--
2.1.0
fallocate hole punch will want to unmap a specific range of pages.
Modify the existing hugetlb_vmtruncate_list() routine to take a
start/end range. If end is 0, this indicates all pages after start
should be unmapped. This is the same as the existing truncate
functionality. Modify existing callers to add 0 as end of range.
Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().
Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 0cf74df..ed40f56 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -349,11 +349,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
}
static inline void
-hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
+hugetlb_vmdelete_list(struct rb_root *root, pgoff_t start, pgoff_t end)
{
struct vm_area_struct *vma;
- vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) {
+ /*
+ * end == 0 indicates that the entire range after
+ * start should be unmapped.
+ */
+ vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) {
unsigned long v_offset;
/*
@@ -362,13 +366,20 @@ hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
* which overlap the truncated area starting at pgoff,
* and no vma on a 32-bit arch can span beyond the 4GB.
*/
- if (vma->vm_pgoff < pgoff)
- v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT;
+ if (vma->vm_pgoff < start)
+ v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
else
v_offset = 0;
- unmap_hugepage_range(vma, vma->vm_start + v_offset,
- vma->vm_end, NULL);
+ if (end) {
+ end = ((end - start) << PAGE_SHIFT) +
+ vma->vm_start + v_offset;
+ if (end > vma->vm_end)
+ end = vma->vm_end;
+ } else
+ end = vma->vm_end;
+
+ unmap_hugepage_range(vma, vma->vm_start + v_offset, end, NULL);
}
}
@@ -384,7 +395,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_size_write(inode, offset);
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
- hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+ hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
i_mmap_unlock_write(mapping);
truncate_hugepages(inode, offset);
return 0;
--
2.1.0
Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of LLONG_MAX is passed,
the current "truncate" functionality is maintained. Existing
callers are modified to pass LLONG_MAX as end of range. By keying
off end == LLONG_MAX, the routine behaves differently for truncate
and hole punch. Page removal is now synchronized with page
allocation via faults by using the fault mutex table. The hole
punch case can experience the rare region_del error and must
handle accordingly.
Add the routine hugetlb_fix_reserve_counts to fix up reserve counts
in the case where region_del returns an error.
Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages(). To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().
Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from
region_del and pass it back to the caller.
Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 98 ++++++++++++++++++++++++++++++++++++++++++++-----
include/linux/hugetlb.h | 4 +-
mm/hugetlb.c | 40 ++++++++++++++++++--
3 files changed, 128 insertions(+), 14 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed40f56..a974e4b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -293,26 +293,61 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
return -EINVAL;
}
-static void truncate_huge_page(struct page *page)
+static void remove_huge_page(struct page *page)
{
ClearPageDirty(page);
ClearPageUptodate(page);
delete_from_page_cache(page);
}
-static void truncate_hugepages(struct inode *inode, loff_t lstart)
+
+/*
+ * remove_inode_hugepages handles two distinct cases: truncation and hole
+ * punch. There are subtle differences in operation for each case.
+
+ * truncation is indicated by end of range being LLONG_MAX
+ * In this case, we first scan the range and release found pages.
+ * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
+ * maps and global counts.
+ * hole punch is indicated if end is not LLONG_MAX
+ * In the hole punch case we scan the range and release found pages.
+ * Only when releasing a page is the associated region/reserv map
+ * deleted. The region/reserv map for ranges without associated
+ * pages are not modified.
+ * Note: If the passed end of range value is beyond the end of file, but
+ * not LLONG_MAX this routine still performs a hole punch operation.
+ */
+static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
+ loff_t lend)
{
struct hstate *h = hstate_inode(inode);
struct address_space *mapping = &inode->i_data;
const pgoff_t start = lstart >> huge_page_shift(h);
+ const pgoff_t end = lend >> huge_page_shift(h);
+ struct vm_area_struct pseudo_vma;
struct pagevec pvec;
pgoff_t next;
int i, freed = 0;
+ long lookup_nr = PAGEVEC_SIZE;
+ bool truncate_op = (lend == LLONG_MAX);
+ memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+ pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
pagevec_init(&pvec, 0);
next = start;
- while (1) {
- if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ while (next < end) {
+ /*
+ * Make sure to never grab more pages that we
+ * might possibly need.
+ */
+ if (end - next < lookup_nr)
+ lookup_nr = end - next;
+
+ /*
+ * This pagevec_lookup() may return pages past 'end',
+ * so we must check for page->index > end.
+ */
+ if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
if (next == start)
break;
next = start;
@@ -321,26 +356,69 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
+ u32 hash;
+
+ hash = hugetlb_fault_mutex_hash(h, current->mm,
+ &pseudo_vma,
+ mapping, next, 0);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
lock_page(page);
+ if (page->index >= end) {
+ unlock_page(page);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ next = end; /* we are done */
+ break;
+ }
+
+ /*
+ * If page is mapped, it was faulted in after being
+ * unmapped. Do nothing in this race case. In the
+ * normal case page is not mapped.
+ */
+ if (!page_mapped(page)) {
+ bool rsv_on_error = !PagePrivate(page);
+ /*
+ * We must free the huge page and remove
+ * from page cache (remove_huge_page) BEFORE
+ * removing the region/reserve map
+ * (hugetlb_unreserve_pages). In rare out
+ * of memory conditions, removal of the
+ * region/reserve map could fail. Before
+ * free'ing the page, note PagePrivate which
+ * is used in case of error.
+ */
+ remove_huge_page(page);
+ freed++;
+ if (!truncate_op) {
+ if (unlikely(hugetlb_unreserve_pages(
+ inode, next,
+ next + 1, 1)))
+ hugetlb_fix_reserve_counts(
+ inode, rsv_on_error);
+ }
+ }
+
if (page->index > next)
next = page->index;
+
++next;
- truncate_huge_page(page);
unlock_page(page);
- freed++;
+
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
}
- BUG_ON(!lstart && mapping->nrpages);
- hugetlb_unreserve_pages(inode, start, freed);
+
+ if (truncate_op)
+ (void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
}
static void hugetlbfs_evict_inode(struct inode *inode)
{
struct resv_map *resv_map;
- truncate_hugepages(inode, 0);
+ remove_inode_hugepages(inode, 0, LLONG_MAX);
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
@@ -397,7 +475,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
i_mmap_unlock_write(mapping);
- truncate_hugepages(inode, offset);
+ remove_inode_hugepages(inode, offset, LLONG_MAX);
return 0;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 933da39..e7825c9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -83,11 +83,13 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int hugetlb_reserve_pages(struct inode *inode, long from, long to,
struct vm_area_struct *vma,
vm_flags_t vm_flags);
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+ long freed);
int dequeue_hwpoisoned_huge_page(struct page *page);
bool isolate_huge_page(struct page *page, struct list_head *list);
void putback_active_hugepage(struct page *page);
void free_huge_page(struct page *page);
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
extern struct mutex *hugetlb_fault_mutex_table;
u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 52c2801..def39e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -548,6 +548,28 @@ retry:
}
/*
+ * A rare out of memory error was encountered which prevented removal of
+ * the reserve map region for a page. The huge page itself was free'ed
+ * and removed from the page cache. This routine will adjust the subpool
+ * usage count, and the global reserve count if needed. By incrementing
+ * these counts, the reserve map entry which could not be deleted will
+ * appear as a "reserved" entry instead of simply dangling with incorrect
+ * counts.
+ */
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
+{
+ struct hugepage_subpool *spool = subpool_inode(inode);
+ long rsv_adjust;
+
+ rsv_adjust = hugepage_subpool_get_pages(spool, 1);
+ if (restore_reserve && rsv_adjust) {
+ struct hstate *h = hstate_inode(inode);
+
+ hugetlb_acct_memory(h, 1);
+ }
+}
+
+/*
* Count and return the number of huge pages in the reserve map
* that intersect with the range [f, t).
*/
@@ -3909,7 +3931,8 @@ out_err:
return ret;
}
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+ long freed)
{
struct hstate *h = hstate_inode(inode);
struct resv_map *resv_map = inode_resv_map(inode);
@@ -3917,8 +3940,17 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
struct hugepage_subpool *spool = subpool_inode(inode);
long gbl_reserve;
- if (resv_map)
- chg = region_del(resv_map, offset, LONG_MAX);
+ if (resv_map) {
+ chg = region_del(resv_map, start, end);
+ /*
+ * region_del() can fail in the rare case where a region
+ * must be split and another region descriptor can not be
+ * allocated. If end == LONG_MAX, it will not fail.
+ */
+ if (chg < 0)
+ return chg;
+ }
+
spin_lock(&inode->i_lock);
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
@@ -3929,6 +3961,8 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
*/
gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
hugetlb_acct_memory(h, -gbl_reserve);
+
+ return 0;
}
#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
--
2.1.0
In vma_has_reserves(), the current assumption is that reserves are
always present for shared mappings. However, this will not be the
case with fallocate hole punch. When punching a hole, the present
page will be deleted as well as the region/reserve map entry (and
hence any reservation). vma_has_reserves is passed "chg" which
indicates whether or not a region/reserve map is present. Use
this to determine if reserves are actually present or were removed
via hole punch.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index def39e3..f72cb96 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -801,9 +801,19 @@ static int vma_has_reserves(struct vm_area_struct *vma, long chg)
return 0;
}
- /* Shared mappings always use reserves */
- if (vma->vm_flags & VM_MAYSHARE)
- return 1;
+ if (vma->vm_flags & VM_MAYSHARE) {
+ /*
+ * We know VM_NORESERVE is not set. Therefore, there SHOULD
+ * be a region map for all pages. The only situation where
+ * there is no region map is if a hole was punched via
+ * fallocate. In this case, there really are no reverves to
+ * use. This situation is indicated if chg != 0.
+ */
+ if (chg)
+ return 0;
+ else
+ return 1;
+ }
/*
* Only the process that called mmap() has reserves for
--
2.1.0
Areas hole punched by fallocate will not have entries in the
region/reserve map. However, shared mappings with min_size subpool
reservations may still have reserved pages. alloc_huge_page needs
to handle this special case and do the proper accounting.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 54 +++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 39 insertions(+), 15 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f72cb96..5a2ee06 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1732,34 +1732,58 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct page *page;
- long chg, commit;
+ long map_chg, map_commit;
+ long gbl_chg;
int ret, idx;
struct hugetlb_cgroup *h_cg;
idx = hstate_index(h);
/*
- * Processes that did not create the mapping will have no
- * reserves and will not have accounted against subpool
- * limit. Check that the subpool limit can be made before
- * satisfying the allocation MAP_NORESERVE mappings may also
- * need pages and subpool limit allocated allocated if no reserve
- * mapping overlaps.
+ * Examine the region/reserve map to determine if the process
+ * has a reservation for the page to be allocated. A return
+ * code of zero indicates a reservation exists (no change).
*/
- chg = vma_needs_reservation(h, vma, addr);
- if (chg < 0)
+ map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
+ if (map_chg < 0)
return ERR_PTR(-ENOMEM);
- if (chg || avoid_reserve)
- if (hugepage_subpool_get_pages(spool, 1) < 0) {
+
+ /*
+ * Processes that did not create the mapping will have no
+ * reserves as indicated by the region/reserve map. Check
+ * that the allocation will not exceed the subpool limit.
+ * Allocations for MAP_NORESERVE mappings also need to be
+ * checked against any subpool limit.
+ */
+ if (map_chg || avoid_reserve) {
+ gbl_chg = hugepage_subpool_get_pages(spool, 1);
+ if (gbl_chg < 0) {
vma_abort_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
}
+ /*
+ * Even though there was no reservation in the region/reserve
+ * map, there could be reservations associated with the
+ * subpool that can be used. This would be indicated if the
+ * return value of hugepage_subpool_get_pages() is zero.
+ * However, if avoid_reserve is specified we still avoid even
+ * the subpool reservations.
+ */
+ if (avoid_reserve)
+ gbl_chg = 1;
+ }
+
ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
if (ret)
goto out_subpool_put;
spin_lock(&hugetlb_lock);
- page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+ /*
+ * glb_chg is passed to indicate whether or not a page must be taken
+ * from the global free pool (global change). gbl_chg == 0 indicates
+ * a reservation exists for the allocation.
+ */
+ page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
if (!page) {
spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1775,8 +1799,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
set_page_private(page, (unsigned long)spool);
- commit = vma_commit_reservation(h, vma, addr);
- if (unlikely(chg > commit)) {
+ map_commit = vma_commit_reservation(h, vma, addr);
+ if (unlikely(map_chg > map_commit)) {
/*
* The page was added to the reservation map between
* vma_needs_reservation and vma_commit_reservation.
@@ -1796,7 +1820,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
out_uncharge_cgroup:
hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
out_subpool_put:
- if (chg || avoid_reserve)
+ if (map_chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
vma_abort_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
--
2.1.0
Currently, there is only a single place where hugetlbfs pages are
added to the page cache. The new fallocate code be adding a second
one, so break the functionality out into its own helper.
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 2 ++
mm/hugetlb.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 9 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e7825c9..657ef26 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -333,6 +333,8 @@ struct huge_bootmem_page {
struct page *alloc_huge_page_node(struct hstate *h, int nid);
struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+ pgoff_t idx);
/* arch callback */
int __init alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5a2ee06..aedc5e7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3374,6 +3374,23 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
return page != NULL;
}
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+ pgoff_t idx)
+{
+ struct inode *inode = mapping->host;
+ struct hstate *h = hstate_inode(inode);
+ int err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+
+ if (err)
+ return err;
+ ClearPagePrivate(page);
+
+ spin_lock(&inode->i_lock);
+ inode->i_blocks += blocks_per_huge_page(h);
+ spin_unlock(&inode->i_lock);
+ return 0;
+}
+
static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct address_space *mapping, pgoff_t idx,
unsigned long address, pte_t *ptep, unsigned int flags)
@@ -3421,21 +3438,13 @@ retry:
set_page_huge_active(page);
if (vma->vm_flags & VM_MAYSHARE) {
- int err;
- struct inode *inode = mapping->host;
-
- err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+ int err = huge_add_to_page_cache(page, mapping, idx);
if (err) {
put_page(page);
if (err == -EEXIST)
goto retry;
goto out;
}
- ClearPagePrivate(page);
-
- spin_lock(&inode->i_lock);
- inode->i_blocks += blocks_per_huge_page(h);
- spin_unlock(&inode->i_lock);
} else {
lock_page(page);
if (unlikely(anon_vma_prepare(vma))) {
--
2.1.0
This is based on the shmem version, but it has diverged quite
a bit. We have no swap to worry about, nor the new file sealing.
Add synchronication via the fault mutex table to coordinate
page faults, fallocate allocation and fallocate hole punch.
What this allows us to do is move physical memory in and out of
a hugetlbfs file without having it mapped. This also gives us
the ability to support MADV_REMOVE since it is currently
implemented using fallocate(). MADV_REMOVE lets madvise() remove
pages from the middle of a hugetlbfs file, which wasn't possible
before.
hugetlbfs fallocate only operates on whole huge pages.
Based-on code-by: Dave Hansen <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/hugetlb.h | 3 +
mm/hugetlb.c | 2 +-
3 files changed, 161 insertions(+), 2 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a974e4b..6e565a4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -12,6 +12,7 @@
#include <linux/thread_info.h>
#include <asm/current.h>
#include <linux/sched.h> /* remove ASAP */
+#include <linux/falloc.h>
#include <linux/fs.h>
#include <linux/mount.h>
#include <linux/file.h>
@@ -479,6 +480,160 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
return 0;
}
+static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+ struct hstate *h = hstate_inode(inode);
+ loff_t hpage_size = huge_page_size(h);
+ loff_t hole_start, hole_end;
+
+ /*
+ * For hole punch round up the beginning offset of the hole and
+ * round down the end.
+ */
+ hole_start = round_up(offset, hpage_size);
+ hole_end = round_down(offset + len, hpage_size);
+
+ if (hole_end > hole_start) {
+ struct address_space *mapping = inode->i_mapping;
+
+ mutex_lock(&inode->i_mutex);
+ i_mmap_lock_write(mapping);
+ if (!RB_EMPTY_ROOT(&mapping->i_mmap))
+ hugetlb_vmdelete_list(&mapping->i_mmap,
+ hole_start >> PAGE_SHIFT,
+ hole_end >> PAGE_SHIFT);
+ i_mmap_unlock_write(mapping);
+ remove_inode_hugepages(inode, hole_start, hole_end);
+ mutex_unlock(&inode->i_mutex);
+ }
+
+ return 0;
+}
+
+static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
+ loff_t len)
+{
+ struct inode *inode = file_inode(file);
+ struct address_space *mapping = inode->i_mapping;
+ struct hstate *h = hstate_inode(inode);
+ struct vm_area_struct pseudo_vma;
+ struct mm_struct *mm = current->mm;
+ loff_t hpage_size = huge_page_size(h);
+ unsigned long hpage_shift = huge_page_shift(h);
+ pgoff_t start, index, end;
+ int error;
+ u32 hash;
+
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ return -EOPNOTSUPP;
+
+ if (mode & FALLOC_FL_PUNCH_HOLE)
+ return hugetlbfs_punch_hole(inode, offset, len);
+
+ /*
+ * Default preallocate case.
+ * For this range, start is rounded down and end is rounded up
+ * as well as being converted to page offsets.
+ */
+ start = offset >> hpage_shift;
+ end = (offset + len + hpage_size - 1) >> hpage_shift;
+
+ mutex_lock(&inode->i_mutex);
+
+ /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
+ error = inode_newsize_ok(inode, offset + len);
+ if (error)
+ goto out;
+
+ /*
+ * Initialize a pseudo vma that just contains the policy used
+ * when allocating the huge pages. The actual policy field
+ * (vm_policy) is determined based on the index in the loop below.
+ */
+ memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+ pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
+ pseudo_vma.vm_file = file;
+
+ for (index = start; index < end; index++) {
+ /*
+ * This is supposed to be the vaddr where the page is being
+ * faulted in, but we have no vaddr here.
+ */
+ struct page *page;
+ unsigned long addr;
+ int avoid_reserve = 0;
+
+ cond_resched();
+
+ /*
+ * fallocate(2) manpage permits EINTR; we may have been
+ * interrupted because we are using up too much memory.
+ */
+ if (signal_pending(current)) {
+ error = -EINTR;
+ break;
+ }
+
+ /* Get policy based on index */
+ pseudo_vma.vm_policy =
+ mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
+ index);
+
+ /* addr is the offset within the file (zero based) */
+ addr = index * hpage_size;
+
+ /* mutex taken here, fault path and hole punch */
+ hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
+ index, addr);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+ /* See if already present in mapping to avoid alloc/free */
+ page = find_get_page(mapping, index);
+ if (page) {
+ put_page(page);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ mpol_cond_put(pseudo_vma.vm_policy);
+ continue;
+ }
+
+ /* Allocate page and add to page cache */
+ page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
+ mpol_cond_put(pseudo_vma.vm_policy);
+ if (IS_ERR(page)) {
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ error = PTR_ERR(page);
+ goto out;
+ }
+ clear_huge_page(page, addr, pages_per_huge_page(h));
+ __SetPageUptodate(page);
+ error = huge_add_to_page_cache(page, mapping, index);
+ if (unlikely(error)) {
+ put_page(page);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ goto out;
+ }
+
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
+ /*
+ * page_put due to reference from alloc_huge_page()
+ * unlock_page because locked by add_to_page_cache()
+ */
+ put_page(page);
+ unlock_page(page);
+ }
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+ i_size_write(inode, offset + len);
+ inode->i_ctime = CURRENT_TIME;
+ spin_lock(&inode->i_lock);
+ inode->i_private = NULL;
+ spin_unlock(&inode->i_lock);
+out:
+ mutex_unlock(&inode->i_mutex);
+ return error;
+}
+
static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
{
struct inode *inode = d_inode(dentry);
@@ -790,7 +945,8 @@ const struct file_operations hugetlbfs_file_operations = {
.mmap = hugetlbfs_file_mmap,
.fsync = noop_fsync,
.get_unmapped_area = hugetlb_get_unmapped_area,
- .llseek = default_llseek,
+ .llseek = default_llseek,
+ .fallocate = hugetlbfs_fallocate,
};
static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 657ef26..386dcbf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -330,6 +330,8 @@ struct huge_bootmem_page {
#endif
};
+struct page *alloc_huge_page(struct vm_area_struct *vma,
+ unsigned long addr, int avoid_reserve);
struct page *alloc_huge_page_node(struct hstate *h, int nid);
struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
@@ -483,6 +485,7 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
+#define alloc_huge_page(v, a, r) NULL
#define alloc_huge_page_node(h, nid) NULL
#define alloc_huge_page_noerr(v, a, r) NULL
#define alloc_bootmem_huge_page(h) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index aedc5e7..a775a65 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1726,7 +1726,7 @@ static void vma_abort_reservation(struct hstate *h,
(void)__vma_reservation_common(h, vma, addr, VMA_ABORT_RESV);
}
-static struct page *alloc_huge_page(struct vm_area_struct *vma,
+struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve)
{
struct hugepage_subpool *spool = subpool_vma(vma);
--
2.1.0
Now that we have hole punching support for hugetlbfs, we can
also support the MADV_REMOVE interface to it.
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/madvise.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 70ce0d4..a235367 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -468,7 +468,7 @@ static long madvise_remove(struct vm_area_struct *vma,
*prev = NULL; /* tell sys_madvise we drop mmap_sem */
- if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB))
+ if (vma->vm_flags & VM_LOCKED)
return -EINVAL;
f = vma->vm_file;
--
2.1.0
On Sun, Jul 12, 2015 at 09:20:58PM -0700, Mike Kravetz wrote:
> Only change in this revision is the fix to the self-discovered
> issue in region_chg(). Functional and stress tests passing.
> Full changelog below.
>
> As suggested during the RFC process, tests have been proposed to
> libhugetlbfs as described at:
> http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
> fallocate(2) man page modifications are also necessary to specify
> that fallocate for hugetlbfs only operates on whole pages. This
> change will be submitted once the code has stabilized and been
> proposed for merging.
>
> hugetlbfs is used today by applications that want a high degree of
> control over huge page usage. Often, large hugetlbfs files are used
> to map a large number huge pages into the application processes.
> The applications know when page ranges within these large files will
> no longer be used, and ideally would like to release them back to
> the subpool or global pools for other uses. The fallocate() system
> call provides an interface for preallocation and hole punching within
> files. This patch set adds fallocate functionality to hugetlbfs.
>
> v3:
> Fixed issue with region_chg to recheck if there are sufficient
> entries in the cache after acquiring lock.
> v2:
> Fixed leak in resv_map_release discovered by Hillf Danton.
> Used LONG_MAX as indicator of truncate function for region_del.
> v1:
> Add a cache of region descriptors to the resv_map for use by
> region_add in case hole punch deletes entries necessary for
> a successful operation.
> RFC v4:
> Removed alloc_huge_page/hugetlb_reserve_pages race patches as already
> in mmotm
> Moved hugetlb_fix_reserve_counts in series as suggested by Naoya Horiguchi
> Inline'ed hugetlb_fault_mutex routines as suggested by Davidlohr Bueso and
> existing code changed to use new interfaces as suggested by Naoya
> fallocate preallocation code cleaned up and made simpler
> Modified alloc_huge_page to handle special case where allocation is
> for a hole punched area with spool reserves
> RFC v3:
> Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
> in current code
> fallocate allocation and hole punch is synchronized with page
> faults via existing mutex table
> hole punch uses existing hugetlb_vmtruncate_list instead of more
> generic unmap_mapping_range for unmapping
> Error handling for the case when region_del() fauils
> RFC v2:
> Addressed alignment and error handling issues noticed by Hillf Danton
> New region_del() routine for region tracking/resv_map of ranges
> Fixed several issues found during more extensive testing
> Error handling in region_del() when kmalloc() fails stills needs
> to be addressed
> madvise remove support remains
>
> Mike Kravetz (10):
> mm/hugetlb: add cache of descriptors to resv_map for region_add
> mm/hugetlb: add region_del() to delete a specific range of entries
> mm/hugetlb: expose hugetlb fault mutex for use by fallocate
> hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
> hugetlbfs: truncate_hugepages() takes a range of pages
> mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
> mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
> hugetlbfs: New huge_add_to_page_cache helper routine
> hugetlbfs: add hugetlbfs_fallocate()
> mm: madvise allow remove operation for hugetlbfs
>
> fs/hugetlbfs/inode.c | 281 +++++++++++++++++++++++++++++---
> include/linux/hugetlb.h | 17 +-
> mm/hugetlb.c | 423 ++++++++++++++++++++++++++++++++++++++----------
> mm/madvise.c | 2 +-
> 4 files changed, 619 insertions(+), 104 deletions(-)
I've read through this series and it looks good to me.
I'll send a comment later for 1/10, but it's kind of nitpicks.
Reviewed-by: Naoya Horiguchi <[email protected]>
Thanks,
Naoya Horiguchi-
On Sun, Jul 12, 2015 at 09:20:59PM -0700, Mike Kravetz wrote:
> fallocate hole punch will want to remove a specific range of
> pages. When pages are removed, their associated entries in
> the region/reserve map will also be removed. This will break
> an assumption in the region_chg/region_add calling sequence.
> If a new region descriptor must be allocated, it is done as
> part of the region_chg processing. In this way, region_add
> can not fail because it does not need to attempt an allocation.
>
> To prepare for fallocate hole punch, create a "cache" of
> descriptors that can be used by region_add if necessary.
> region_chg will ensure there are sufficient entries in the
> cache. It will be necessary to track the number of in progress
> add operations to know a sufficient number of descriptors
> reside in the cache. A new routine region_abort is added to
> adjust this in progress count when add operations are aborted.
> vma_abort_reservation is also added for callers creating
> reservations with vma_needs_reservation/vma_commit_reservation.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> include/linux/hugetlb.h | 3 +
> mm/hugetlb.c | 169 ++++++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 153 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d891f94..667cf44 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -35,6 +35,9 @@ struct resv_map {
> struct kref refs;
> spinlock_t lock;
> struct list_head regions;
> + long adds_in_progress;
> + struct list_head rgn_cache;
> + long rgn_cache_count;
> };
> extern struct resv_map *resv_map_alloc(void);
> void resv_map_release(struct kref *ref);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a8c3087..241d16d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -240,11 +240,14 @@ struct file_region {
>
> /*
> * Add the huge page range represented by [f, t) to the reserve
> - * map. Existing regions will be expanded to accommodate the
> - * specified range. We know only existing regions need to be
> - * expanded, because region_add is only called after region_chg
> - * with the same range. If a new file_region structure must
> - * be allocated, it is done in region_chg.
> + * map. In the normal case, existing regions will be expanded
> + * to accommodate the specified range. Sufficient regions should
> + * exist for expansion due to the previous call to region_chg
> + * with the same range. However, it is possible that region_del
> + * could have been called after region_chg and modifed the map
> + * in such a way that no region exists to be expanded. In this
> + * case, pull a region descriptor from the cache associated with
> + * the map and use that for the new range.
> *
> * Return the number of new huge pages added to the map. This
> * number is greater than or equal to zero.
> @@ -261,6 +264,27 @@ static long region_add(struct resv_map *resv, long f, long t)
> if (f <= rg->to)
> break;
>
> + if (&rg->link == head || t < rg->from) {
> + /*
> + * No region exists which can be expanded to include the
> + * specified range. Pull a region descriptor from the
> + * cache, and use it for this range.
> + */
This comment mentions this if-block, not the VM_BUG_ON below, so it had
better be put the above if-line.
> + VM_BUG_ON(!resv->rgn_cache_count);
resv->rgn_cache_count <= 0 might be safer.
...
> @@ -3236,11 +3360,14 @@ retry:
> * any allocations necessary to record that reservation occur outside
> * the spinlock.
> */
> - if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> if (vma_needs_reservation(h, vma, address) < 0) {
> ret = VM_FAULT_OOM;
> goto backout_unlocked;
> }
> + /* Just decrements count, does not deallocate */
> + vma_abort_reservation(h, vma, address);
> + }
This is not "abort reservation" operation, but you use "abort reservation"
routine, which might confusing and makes future maintenance hard. I think
this should be done in a simplified variant of vma_commit_reservation()
(maybe just an alias of your vma_abort_reservation()) or fast path in
vma_commit_reservation().
Thanks,
Naoya Horiguchi
>
> ptl = huge_pte_lockptr(h, mm, ptep);
> spin_lock(ptl);
> @@ -3387,6 +3514,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> ret = VM_FAULT_OOM;
> goto out_mutex;
> }
> + /* Just decrements count, does not deallocate */
> + vma_abort_reservation(h, vma, address);
>
> if (!(vma->vm_flags & VM_MAYSHARE))
> pagecache_page = hugetlbfs_pagecache_page(h,
> @@ -3726,6 +3855,8 @@ int hugetlb_reserve_pages(struct inode *inode,
> }
> return 0;
> out_err:
> + if (!vma || vma->vm_flags & VM_MAYSHARE)
> + region_abort(resv_map, from, to);
> if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> kref_put(&resv_map->refs, resv_map_release);
> return ret;
> --
> 2.1.0
> -
On 07/17/2015 02:02 AM, Naoya Horiguchi wrote:
> On Sun, Jul 12, 2015 at 09:20:59PM -0700, Mike Kravetz wrote:
>> fallocate hole punch will want to remove a specific range of
>> pages. When pages are removed, their associated entries in
>> the region/reserve map will also be removed. This will break
>> an assumption in the region_chg/region_add calling sequence.
>> If a new region descriptor must be allocated, it is done as
>> part of the region_chg processing. In this way, region_add
>> can not fail because it does not need to attempt an allocation.
>>
>> To prepare for fallocate hole punch, create a "cache" of
>> descriptors that can be used by region_add if necessary.
>> region_chg will ensure there are sufficient entries in the
>> cache. It will be necessary to track the number of in progress
>> add operations to know a sufficient number of descriptors
>> reside in the cache. A new routine region_abort is added to
>> adjust this in progress count when add operations are aborted.
>> vma_abort_reservation is also added for callers creating
>> reservations with vma_needs_reservation/vma_commit_reservation.
>>
>> Signed-off-by: Mike Kravetz <[email protected]>
>> ---
>> include/linux/hugetlb.h | 3 +
>> mm/hugetlb.c | 169 ++++++++++++++++++++++++++++++++++++++++++------
>> 2 files changed, 153 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index d891f94..667cf44 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -35,6 +35,9 @@ struct resv_map {
>> struct kref refs;
>> spinlock_t lock;
>> struct list_head regions;
>> + long adds_in_progress;
>> + struct list_head rgn_cache;
>> + long rgn_cache_count;
>> };
>> extern struct resv_map *resv_map_alloc(void);
>> void resv_map_release(struct kref *ref);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index a8c3087..241d16d 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -240,11 +240,14 @@ struct file_region {
>>
>> /*
>> * Add the huge page range represented by [f, t) to the reserve
>> - * map. Existing regions will be expanded to accommodate the
>> - * specified range. We know only existing regions need to be
>> - * expanded, because region_add is only called after region_chg
>> - * with the same range. If a new file_region structure must
>> - * be allocated, it is done in region_chg.
>> + * map. In the normal case, existing regions will be expanded
>> + * to accommodate the specified range. Sufficient regions should
>> + * exist for expansion due to the previous call to region_chg
>> + * with the same range. However, it is possible that region_del
>> + * could have been called after region_chg and modifed the map
>> + * in such a way that no region exists to be expanded. In this
>> + * case, pull a region descriptor from the cache associated with
>> + * the map and use that for the new range.
>> *
>> * Return the number of new huge pages added to the map. This
>> * number is greater than or equal to zero.
>> @@ -261,6 +264,27 @@ static long region_add(struct resv_map *resv, long f, long t)
>> if (f <= rg->to)
>> break;
>>
>> + if (&rg->link == head || t < rg->from) {
>> + /*
>> + * No region exists which can be expanded to include the
>> + * specified range. Pull a region descriptor from the
>> + * cache, and use it for this range.
>> + */
>
> This comment mentions this if-block, not the VM_BUG_ON below, so it had
> better be put the above if-line.
OK, I will move and make a minor modification to the comment.
>
>> + VM_BUG_ON(!resv->rgn_cache_count);
>
> resv->rgn_cache_count <= 0 might be safer.
Sure.
> ...
>> @@ -3236,11 +3360,14 @@ retry:
>> * any allocations necessary to record that reservation occur outside
>> * the spinlock.
>> */
>> - if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
>> + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>> if (vma_needs_reservation(h, vma, address) < 0) {
>> ret = VM_FAULT_OOM;
>> goto backout_unlocked;
>> }
>> + /* Just decrements count, does not deallocate */
>> + vma_abort_reservation(h, vma, address);
>> + }
>
> This is not "abort reservation" operation, but you use "abort reservation"
> routine, which might confusing and makes future maintenance hard. I think
> this should be done in a simplified variant of vma_commit_reservation()
> (maybe just an alias of your vma_abort_reservation()) or fast path in
> vma_commit_reservation().
I am struggling a bit with the names of these routines. The
routines in question are:
vma_needs_reservation - This is a wrapper for region_chg(), so the
return value is the number of regions needed for the page.
Since there is only one page, the routine effectively
becomes a boolean. Hence the name "needs".
vma_commit_reservation - This is a wrapper for region_add(). It
must be called after a prior call to vma_needs_reservation
and after actual allocation of the page.
We need a way to handle the case where vma_needs_reservation has
been called, but the page allocation is not successful. I chose
the name vma_abort_reservation, but as noted (even in my comments)
it is not an actual abort.
I am not sure if you are suggesting vma_commit_reservation() should
handle this as a special case. I think a separately named routine which
indicates and end of the reservation/allocation process would be
easier to understand.
What about changing the name vma_abort_reservation() to
vma_end_reservation()? This would indicate that the reservation/
allocation process is ended.
> Thanks,
> Naoya Horiguchi
Thank you for your reviews.
--
Mike Kravetz
>
>>
>> ptl = huge_pte_lockptr(h, mm, ptep);
>> spin_lock(ptl);
>> @@ -3387,6 +3514,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>> ret = VM_FAULT_OOM;
>> goto out_mutex;
>> }
>> + /* Just decrements count, does not deallocate */
>> + vma_abort_reservation(h, vma, address);
>>
>> if (!(vma->vm_flags & VM_MAYSHARE))
>> pagecache_page = hugetlbfs_pagecache_page(h,
>> @@ -3726,6 +3855,8 @@ int hugetlb_reserve_pages(struct inode *inode,
>> }
>> return 0;
>> out_err:
>> + if (!vma || vma->vm_flags & VM_MAYSHARE)
>> + region_abort(resv_map, from, to);
>> if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
>> kref_put(&resv_map->refs, resv_map_release);
>> return ret;
>> --
>> 2.1.0
>
> Only change in this revision is the fix to the self-discovered
> issue in region_chg(). Functional and stress tests passing.
> Full changelog below.
>
> As suggested during the RFC process, tests have been proposed to
> libhugetlbfs as described at:
> http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
> fallocate(2) man page modifications are also necessary to specify
> that fallocate for hugetlbfs only operates on whole pages. This
> change will be submitted once the code has stabilized and been
> proposed for merging.
>
> hugetlbfs is used today by applications that want a high degree of
> control over huge page usage. Often, large hugetlbfs files are used
> to map a large number huge pages into the application processes.
> The applications know when page ranges within these large files will
> no longer be used, and ideally would like to release them back to
> the subpool or global pools for other uses. The fallocate() system
> call provides an interface for preallocation and hole punching within
> files. This patch set adds fallocate functionality to hugetlbfs.
>
> v3:
> Fixed issue with region_chg to recheck if there are sufficient
> entries in the cache after acquiring lock.
> v2:
> Fixed leak in resv_map_release discovered by Hillf Danton.
> Used LONG_MAX as indicator of truncate function for region_del.
> v1:
> Add a cache of region descriptors to the resv_map for use by
> region_add in case hole punch deletes entries necessary for
> a successful operation.
> RFC v4:
> Removed alloc_huge_page/hugetlb_reserve_pages race patches as already
> in mmotm
> Moved hugetlb_fix_reserve_counts in series as suggested by Naoya Horiguchi
> Inline'ed hugetlb_fault_mutex routines as suggested by Davidlohr Bueso and
> existing code changed to use new interfaces as suggested by Naoya
> fallocate preallocation code cleaned up and made simpler
> Modified alloc_huge_page to handle special case where allocation is
> for a hole punched area with spool reserves
> RFC v3:
> Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
> in current code
> fallocate allocation and hole punch is synchronized with page
> faults via existing mutex table
> hole punch uses existing hugetlb_vmtruncate_list instead of more
> generic unmap_mapping_range for unmapping
> Error handling for the case when region_del() fauils
> RFC v2:
> Addressed alignment and error handling issues noticed by Hillf Danton
> New region_del() routine for region tracking/resv_map of ranges
> Fixed several issues found during more extensive testing
> Error handling in region_del() when kmalloc() fails stills needs
> to be addressed
> madvise remove support remains
>
> Mike Kravetz (10):
> mm/hugetlb: add cache of descriptors to resv_map for region_add
> mm/hugetlb: add region_del() to delete a specific range of entries
> mm/hugetlb: expose hugetlb fault mutex for use by fallocate
> hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
> hugetlbfs: truncate_hugepages() takes a range of pages
> mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
> mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
> hugetlbfs: New huge_add_to_page_cache helper routine
> hugetlbfs: add hugetlbfs_fallocate()
> mm: madvise allow remove operation for hugetlbfs
>
> fs/hugetlbfs/inode.c | 281 +++++++++++++++++++++++++++++---
> include/linux/hugetlb.h | 17 +-
> mm/hugetlb.c | 423 ++++++++++++++++++++++++++++++++++++++----------
> mm/madvise.c | 2 +-
> 4 files changed, 619 insertions(+), 104 deletions(-)
>
Acked-by: Hillf Danton <[email protected]>
On Mon, Jul 20, 2015 at 10:50:12AM -0700, Mike Kravetz wrote:
...
> > ...
> >> @@ -3236,11 +3360,14 @@ retry:
> >> * any allocations necessary to record that reservation occur outside
> >> * the spinlock.
> >> */
> >> - if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> >> + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> >> if (vma_needs_reservation(h, vma, address) < 0) {
> >> ret = VM_FAULT_OOM;
> >> goto backout_unlocked;
> >> }
> >> + /* Just decrements count, does not deallocate */
> >> + vma_abort_reservation(h, vma, address);
> >> + }
> >
> > This is not "abort reservation" operation, but you use "abort reservation"
> > routine, which might confusing and makes future maintenance hard. I think
> > this should be done in a simplified variant of vma_commit_reservation()
> > (maybe just an alias of your vma_abort_reservation()) or fast path in
> > vma_commit_reservation().
>
> I am struggling a bit with the names of these routines. The
> routines in question are:
>
> vma_needs_reservation - This is a wrapper for region_chg(), so the
> return value is the number of regions needed for the page.
> Since there is only one page, the routine effectively
> becomes a boolean. Hence the name "needs".
>
> vma_commit_reservation - This is a wrapper for region_add(). It
> must be called after a prior call to vma_needs_reservation
> and after actual allocation of the page.
>
> We need a way to handle the case where vma_needs_reservation has
> been called, but the page allocation is not successful. I chose
> the name vma_abort_reservation, but as noted (even in my comments)
> it is not an actual abort.
>
> I am not sure if you are suggesting vma_commit_reservation() should
> handle this as a special case. I think a separately named routine which
> indicates and end of the reservation/allocation process would be
> easier to understand.
>
> What about changing the name vma_abort_reservation() to
> vma_end_reservation()? This would indicate that the reservation/
> allocation process is ended.
OK, vma_end_reservation() sounds nice to me.
> > Thanks,
> > Naoya Horiguchi
>
> Thank you for your reviews.
You're welcome :)
Naoya Horiguchi-