2015-07-21 18:10:39

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 00/10] hugetlbfs: add fallocate support

Changes in this revision address the minor comment and function name
issues brought up by Naoya Horiguchi. Patch set is also rebased on
current "mmotm/since-4.1". This revision does not introduce any
functional changes.

As suggested during the RFC process, tests have been proposed to
libhugetlbfs as described at:
http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
fallocate(2) man page modifications are also necessary to specify
that fallocate for hugetlbfs only operates on whole pages. This
change will be submitted once the code has stabilized and been
proposed for merging.

hugetlbfs is used today by applications that want a high degree of
control over huge page usage. Often, large hugetlbfs files are used
to map a large number huge pages into the application processes.
The applications know when page ranges within these large files will
no longer be used, and ideally would like to release them back to
the subpool or global pools for other uses. The fallocate() system
call provides an interface for preallocation and hole punching within
files. This patch set adds fallocate functionality to hugetlbfs.

v4:
Renamed vma_abort_reservation() to vma_end_reservation().
v3:
Fixed issue with region_chg to recheck if there are sufficient
entries in the cache after acquiring lock.
v2:
Fixed leak in resv_map_release discovered by Hillf Danton.
Used LONG_MAX as indicator of truncate function for region_del.
v1:
Add a cache of region descriptors to the resv_map for use by
region_add in case hole punch deletes entries necessary for
a successful operation.
RFC v4:
Removed alloc_huge_page/hugetlb_reserve_pages race patches as already
in mmotm
Moved hugetlb_fix_reserve_counts in series as suggested by Naoya Horiguchi
Inline'ed hugetlb_fault_mutex routines as suggested by Davidlohr Bueso and
existing code changed to use new interfaces as suggested by Naoya
fallocate preallocation code cleaned up and made simpler
Modified alloc_huge_page to handle special case where allocation is
for a hole punched area with spool reserves
RFC v3:
Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
in current code
fallocate allocation and hole punch is synchronized with page
faults via existing mutex table
hole punch uses existing hugetlb_vmtruncate_list instead of more
generic unmap_mapping_range for unmapping
Error handling for the case when region_del() fauils
RFC v2:
Addressed alignment and error handling issues noticed by Hillf Danton
New region_del() routine for region tracking/resv_map of ranges
Fixed several issues found during more extensive testing
Error handling in region_del() when kmalloc() fails stills needs
to be addressed
madvise remove support remains

v3 patch series:
Reviewed-by: Naoya Horiguchi <[email protected]>
Acked-by: Hillf Danton <[email protected]>

Mike Kravetz (10):
mm/hugetlb: add cache of descriptors to resv_map for region_add
mm/hugetlb: add region_del() to delete a specific range of entries
mm/hugetlb: expose hugetlb fault mutex for use by fallocate
hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
hugetlbfs: truncate_hugepages() takes a range of pages
mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
hugetlbfs: New huge_add_to_page_cache helper routine
hugetlbfs: add hugetlbfs_fallocate()
mm: madvise allow remove operation for hugetlbfs

fs/hugetlbfs/inode.c | 281 ++++++++++++++++++++++++++++++--
include/linux/hugetlb.h | 17 +-
mm/hugetlb.c | 424 ++++++++++++++++++++++++++++++++++++++----------
mm/madvise.c | 2 +-
4 files changed, 621 insertions(+), 103 deletions(-)

--
2.1.0


2015-07-21 18:11:08

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add

fallocate hole punch will want to remove a specific range of
pages. When pages are removed, their associated entries in
the region/reserve map will also be removed. This will break
an assumption in the region_chg/region_add calling sequence.
If a new region descriptor must be allocated, it is done as
part of the region_chg processing. In this way, region_add
can not fail because it does not need to attempt an allocation.

To prepare for fallocate hole punch, create a "cache" of
descriptors that can be used by region_add if necessary.
region_chg will ensure there are sufficient entries in the
cache. It will be necessary to track the number of in progress
add operations to know a sufficient number of descriptors
reside in the cache. A new routine region_abort is added to
adjust this in progress count when add operations are aborted.
vma_abort_reservation is also added for callers creating
reservations with vma_needs_reservation/vma_commit_reservation.

Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 3 +
mm/hugetlb.c | 170 ++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 154 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d891f94..667cf44 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -35,6 +35,9 @@ struct resv_map {
struct kref refs;
spinlock_t lock;
struct list_head regions;
+ long adds_in_progress;
+ struct list_head rgn_cache;
+ long rgn_cache_count;
};
extern struct resv_map *resv_map_alloc(void);
void resv_map_release(struct kref *ref);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51ae41d..c3923a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -240,11 +240,14 @@ struct file_region {

/*
* Add the huge page range represented by [f, t) to the reserve
- * map. Existing regions will be expanded to accommodate the
- * specified range. We know only existing regions need to be
- * expanded, because region_add is only called after region_chg
- * with the same range. If a new file_region structure must
- * be allocated, it is done in region_chg.
+ * map. In the normal case, existing regions will be expanded
+ * to accommodate the specified range. Sufficient regions should
+ * exist for expansion due to the previous call to region_chg
+ * with the same range. However, it is possible that region_del
+ * could have been called after region_chg and modifed the map
+ * in such a way that no region exists to be expanded. In this
+ * case, pull a region descriptor from the cache associated with
+ * the map and use that for the new range.
*
* Return the number of new huge pages added to the map. This
* number is greater than or equal to zero.
@@ -261,6 +264,28 @@ static long region_add(struct resv_map *resv, long f, long t)
if (f <= rg->to)
break;

+ /*
+ * If no region exists which can be expanded to include the
+ * specified range, the list must have been modified by an
+ * interleving call to region_del(). Pull a region descriptor
+ * from the cache and use it for this range.
+ */
+ if (&rg->link == head || t < rg->from) {
+ VM_BUG_ON(resv->rgn_cache_count <= 0);
+
+ resv->rgn_cache_count--;
+ nrg = list_first_entry(&resv->rgn_cache, struct file_region,
+ link);
+ list_del(&nrg->link);
+
+ nrg->from = f;
+ nrg->to = t;
+ list_add(&nrg->link, rg->link.prev);
+
+ add += t - f;
+ goto out_locked;
+ }
+
/* Round our left edge to the current segment if it encloses us. */
if (f > rg->from)
f = rg->from;
@@ -294,6 +319,8 @@ static long region_add(struct resv_map *resv, long f, long t)
add += t - nrg->to; /* Added to end of region */
nrg->to = t;

+out_locked:
+ resv->adds_in_progress--;
spin_unlock(&resv->lock);
VM_BUG_ON(add < 0);
return add;
@@ -312,11 +339,16 @@ static long region_add(struct resv_map *resv, long f, long t)
* so that the subsequent region_add call will have all the
* regions it needs and will not fail.
*
+ * Upon entry, region_chg will also examine the cache of
+ * region descriptors associated with the map. If there
+ * not enough descriptors cached, one will be allocated
+ * for the in progress add operation.
+ *
* Returns the number of huge pages that need to be added
* to the existing reservation map for the range [f, t).
* This number is greater or equal to zero. -ENOMEM is
- * returned if a new file_region structure is needed and can
- * not be allocated.
+ * returned if a new file_region structure or cache entry
+ * is needed and can not be allocated.
*/
static long region_chg(struct resv_map *resv, long f, long t)
{
@@ -326,6 +358,31 @@ static long region_chg(struct resv_map *resv, long f, long t)

retry:
spin_lock(&resv->lock);
+retry_locked:
+ resv->adds_in_progress++;
+
+ /*
+ * Check for sufficient descriptors in the cache to accommodate
+ * the number of in progress add operations.
+ */
+ if (resv->adds_in_progress > resv->rgn_cache_count) {
+ struct file_region *trg;
+
+ VM_BUG_ON(resv->adds_in_progress - resv->rgn_cache_count > 1);
+ /* Must drop lock to allocate a new descriptor. */
+ resv->adds_in_progress--;
+ spin_unlock(&resv->lock);
+
+ trg = kmalloc(sizeof(*trg), GFP_KERNEL);
+ if (!trg)
+ return -ENOMEM;
+
+ spin_lock(&resv->lock);
+ list_add(&trg->link, &resv->rgn_cache);
+ resv->rgn_cache_count++;
+ goto retry_locked;
+ }
+
/* Locate the region we are before or in. */
list_for_each_entry(rg, head, link)
if (f <= rg->to)
@@ -336,6 +393,7 @@ retry:
* size such that we can guarantee to record the reservation. */
if (&rg->link == head || t < rg->from) {
if (!nrg) {
+ resv->adds_in_progress--;
spin_unlock(&resv->lock);
nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
if (!nrg)
@@ -385,6 +443,25 @@ out_nrg:
}

/*
+ * Abort the in progress add operation. The adds_in_progress field
+ * of the resv_map keeps track of the operations in progress between
+ * calls to region_chg and region_add. Operations are sometimes
+ * aborted after the call to region_chg. In such cases, region_abort
+ * is called to decrement the adds_in_progress counter.
+ *
+ * NOTE: The range arguments [f, t) are not needed or used in this
+ * routine. They are kept to make reading the calling code easier as
+ * arguments will match the associated region_chg call.
+ */
+static void region_abort(struct resv_map *resv, long f, long t)
+{
+ spin_lock(&resv->lock);
+ VM_BUG_ON(!resv->rgn_cache_count);
+ resv->adds_in_progress--;
+ spin_unlock(&resv->lock);
+}
+
+/*
* Truncate the reserve map at index 'end'. Modify/truncate any
* region which contains end. Delete any regions past end.
* Return the number of huge pages removed from the map.
@@ -544,22 +621,44 @@ static void set_vma_private_data(struct vm_area_struct *vma,
struct resv_map *resv_map_alloc(void)
{
struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
- if (!resv_map)
+ struct file_region *rg = kmalloc(sizeof(*rg), GFP_KERNEL);
+
+ if (!resv_map || !rg) {
+ kfree(resv_map);
+ kfree(rg);
return NULL;
+ }

kref_init(&resv_map->refs);
spin_lock_init(&resv_map->lock);
INIT_LIST_HEAD(&resv_map->regions);

+ resv_map->adds_in_progress = 0;
+
+ INIT_LIST_HEAD(&resv_map->rgn_cache);
+ list_add(&rg->link, &resv_map->rgn_cache);
+ resv_map->rgn_cache_count = 1;
+
return resv_map;
}

void resv_map_release(struct kref *ref)
{
struct resv_map *resv_map = container_of(ref, struct resv_map, refs);
+ struct list_head *head = &resv_map->rgn_cache;
+ struct file_region *rg, *trg;

/* Clear out any active regions before we release the map. */
region_truncate(resv_map, 0);
+
+ /* ... and any entries left in the cache */
+ list_for_each_entry_safe(rg, trg, head, link) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+
+ VM_BUG_ON(resv_map->adds_in_progress);
+
kfree(resv_map);
}

@@ -1473,16 +1572,18 @@ static void return_unused_surplus_pages(struct hstate *h,
}
}

+
/*
- * vma_needs_reservation and vma_commit_reservation are used by the huge
- * page allocation routines to manage reservations.
+ * vma_needs_reservation, vma_commit_reservation and vma_abort_reservation
+ * are used by the huge page allocation routines to manage reservations.
*
* vma_needs_reservation is called to determine if the huge page at addr
* within the vma has an associated reservation. If a reservation is
* needed, the value 1 is returned. The caller is then responsible for
* managing the global reservation and subpool usage counts. After
* the huge page has been allocated, vma_commit_reservation is called
- * to add the page to the reservation map.
+ * to add the page to the reservation map. If the reservation must be
+ * aborted instead of committed, vma_abort_reservation is called.
*
* In the normal case, vma_commit_reservation returns the same value
* as the preceding vma_needs_reservation call. The only time this
@@ -1490,9 +1591,14 @@ static void return_unused_surplus_pages(struct hstate *h,
* is the responsibility of the caller to notice the difference and
* take appropriate action.
*/
+enum vma_resv_mode {
+ VMA_NEEDS_RESV,
+ VMA_COMMIT_RESV,
+ VMA_ABORT_RESV,
+};
static long __vma_reservation_common(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr,
- bool commit)
+ enum vma_resv_mode mode)
{
struct resv_map *resv;
pgoff_t idx;
@@ -1503,10 +1609,20 @@ static long __vma_reservation_common(struct hstate *h,
return 1;

idx = vma_hugecache_offset(h, vma, addr);
- if (commit)
- ret = region_add(resv, idx, idx + 1);
- else
+ switch (mode) {
+ case VMA_NEEDS_RESV:
ret = region_chg(resv, idx, idx + 1);
+ break;
+ case VMA_COMMIT_RESV:
+ ret = region_add(resv, idx, idx + 1);
+ break;
+ case VMA_ABORT_RESV:
+ region_abort(resv, idx, idx + 1);
+ ret = 0;
+ break;
+ default:
+ BUG();
+ }

if (vma->vm_flags & VM_MAYSHARE)
return ret;
@@ -1517,13 +1633,19 @@ static long __vma_reservation_common(struct hstate *h,
static long vma_needs_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
- return __vma_reservation_common(h, vma, addr, false);
+ return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV);
}

static long vma_commit_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
- return __vma_reservation_common(h, vma, addr, true);
+ return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV);
+}
+
+static void vma_abort_reservation(struct hstate *h,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ (void)__vma_reservation_common(h, vma, addr, VMA_ABORT_RESV);
}

static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -1549,8 +1671,10 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
if (chg < 0)
return ERR_PTR(-ENOMEM);
if (chg || avoid_reserve)
- if (hugepage_subpool_get_pages(spool, 1) < 0)
+ if (hugepage_subpool_get_pages(spool, 1) < 0) {
+ vma_abort_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
+ }

ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
if (ret)
@@ -1596,6 +1720,7 @@ out_uncharge_cgroup:
out_subpool_put:
if (chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
+ vma_abort_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
}

@@ -3236,11 +3361,14 @@ retry:
* any allocations necessary to record that reservation occur outside
* the spinlock.
*/
- if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
+ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
if (vma_needs_reservation(h, vma, address) < 0) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
}
+ /* Just decrements count, does not deallocate */
+ vma_abort_reservation(h, vma, address);
+ }

ptl = huge_pte_lockptr(h, mm, ptep);
spin_lock(ptl);
@@ -3387,6 +3515,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ret = VM_FAULT_OOM;
goto out_mutex;
}
+ /* Just decrements count, does not deallocate */
+ vma_abort_reservation(h, vma, address);

if (!(vma->vm_flags & VM_MAYSHARE))
pagecache_page = hugetlbfs_pagecache_page(h,
@@ -3726,6 +3856,8 @@ int hugetlb_reserve_pages(struct inode *inode,
}
return 0;
out_err:
+ if (!vma || vma->vm_flags & VM_MAYSHARE)
+ region_abort(resv_map, from, to);
if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
kref_put(&resv_map->refs, resv_map_release);
return ret;
--
2.1.0

2015-07-21 18:10:57

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 02/10] mm/hugetlb: add region_del() to delete a specific range of entries

fallocate hole punch will want to remove a specific range of pages.
The existing region_truncate() routine deletes all region/reserve
map entries after a specified offset. region_del() will provide
this same functionality if the end of region is specified as LONG_MAX.
Hence, region_del() can replace region_truncate().

Unlike region_truncate(), region_del() can return an error in the
rare case where it can not allocate memory for a region descriptor.
This ONLY happens in the case where an existing region must be split.
Current callers passing LONG_MAX as end of range will never experience
this error and do not need to deal with error handling. Future
callers of region_del() (such as fallocate hole punch) will need to
handle this error.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 85 insertions(+), 37 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c3923a1..a573396 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -462,43 +462,90 @@ static void region_abort(struct resv_map *resv, long f, long t)
}

/*
- * Truncate the reserve map at index 'end'. Modify/truncate any
- * region which contains end. Delete any regions past end.
- * Return the number of huge pages removed from the map.
+ * Delete the specified range [f, t) from the reserve map. If the
+ * t parameter is LONG_MAX, this indicates that ALL regions after f
+ * should be deleted. Locate the regions which intersect [f, t)
+ * and either trim, delete or split the existing regions.
+ *
+ * Returns the number of huge pages deleted from the reserve map.
+ * In the normal case, the return value is zero or more. In the
+ * case where a region must be split, a new region descriptor must
+ * be allocated. If the allocation fails, -ENOMEM will be returned.
+ * NOTE: If the parameter t == LONG_MAX, then we will never split
+ * a region and possibly return -ENOMEM. Callers specifying
+ * t == LONG_MAX do not need to check for -ENOMEM error.
*/
-static long region_truncate(struct resv_map *resv, long end)
+static long region_del(struct resv_map *resv, long f, long t)
{
struct list_head *head = &resv->regions;
struct file_region *rg, *trg;
- long chg = 0;
+ struct file_region *nrg = NULL;
+ long del = 0;

+retry:
spin_lock(&resv->lock);
- /* Locate the region we are either in or before. */
- list_for_each_entry(rg, head, link)
- if (end <= rg->to)
+ list_for_each_entry_safe(rg, trg, head, link) {
+ if (rg->to <= f)
+ continue;
+ if (rg->from >= t)
break;
- if (&rg->link == head)
- goto out;

- /* If we are in the middle of a region then adjust it. */
- if (end > rg->from) {
- chg = rg->to - end;
- rg->to = end;
- rg = list_entry(rg->link.next, typeof(*rg), link);
- }
+ if (f > rg->from && t < rg->to) { /* Must split region */
+ /*
+ * Check for an entry in the cache before dropping
+ * lock and attempting allocation.
+ */
+ if (!nrg &&
+ resv->rgn_cache_count > resv->adds_in_progress) {
+ nrg = list_first_entry(&resv->rgn_cache,
+ struct file_region,
+ link);
+ list_del(&nrg->link);
+ resv->rgn_cache_count--;
+ }

- /* Drop any remaining regions. */
- list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
- if (&rg->link == head)
+ if (!nrg) {
+ spin_unlock(&resv->lock);
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (!nrg)
+ return -ENOMEM;
+ goto retry;
+ }
+
+ del += t - f;
+
+ /* New entry for end of split region */
+ nrg->from = t;
+ nrg->to = rg->to;
+ INIT_LIST_HEAD(&nrg->link);
+
+ /* Original entry is trimmed */
+ rg->to = f;
+
+ list_add(&nrg->link, &rg->link);
+ nrg = NULL;
break;
- chg += rg->to - rg->from;
- list_del(&rg->link);
- kfree(rg);
+ }
+
+ if (f <= rg->from && t >= rg->to) { /* Remove entire region */
+ del += rg->to - rg->from;
+ list_del(&rg->link);
+ kfree(rg);
+ continue;
+ }
+
+ if (f <= rg->from) { /* Trim beginning of region */
+ del += t - rg->from;
+ rg->from = t;
+ } else { /* Trim end of region */
+ del += rg->to - f;
+ rg->to = f;
+ }
}

-out:
spin_unlock(&resv->lock);
- return chg;
+ kfree(nrg);
+ return del;
}

/*
@@ -649,7 +696,7 @@ void resv_map_release(struct kref *ref)
struct file_region *rg, *trg;

/* Clear out any active regions before we release the map. */
- region_truncate(resv_map, 0);
+ region_del(resv_map, 0, LONG_MAX);

/* ... and any entries left in the cache */
list_for_each_entry_safe(rg, trg, head, link) {
@@ -1574,7 +1621,7 @@ static void return_unused_surplus_pages(struct hstate *h,


/*
- * vma_needs_reservation, vma_commit_reservation and vma_abort_reservation
+ * vma_needs_reservation, vma_commit_reservation and vma_end_reservation
* are used by the huge page allocation routines to manage reservations.
*
* vma_needs_reservation is called to determine if the huge page at addr
@@ -1582,8 +1629,9 @@ static void return_unused_surplus_pages(struct hstate *h,
* needed, the value 1 is returned. The caller is then responsible for
* managing the global reservation and subpool usage counts. After
* the huge page has been allocated, vma_commit_reservation is called
- * to add the page to the reservation map. If the reservation must be
- * aborted instead of committed, vma_abort_reservation is called.
+ * to add the page to the reservation map. If the page allocation fails,
+ * the reservation must be ended instead of committed. vma_end_reservation
+ * is called in such cases.
*
* In the normal case, vma_commit_reservation returns the same value
* as the preceding vma_needs_reservation call. The only time this
@@ -1594,7 +1642,7 @@ static void return_unused_surplus_pages(struct hstate *h,
enum vma_resv_mode {
VMA_NEEDS_RESV,
VMA_COMMIT_RESV,
- VMA_ABORT_RESV,
+ VMA_END_RESV,
};
static long __vma_reservation_common(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr,
@@ -1616,7 +1664,7 @@ static long __vma_reservation_common(struct hstate *h,
case VMA_COMMIT_RESV:
ret = region_add(resv, idx, idx + 1);
break;
- case VMA_ABORT_RESV:
+ case VMA_END_RESV:
region_abort(resv, idx, idx + 1);
ret = 0;
break;
@@ -1642,10 +1690,10 @@ static long vma_commit_reservation(struct hstate *h,
return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV);
}

-static void vma_abort_reservation(struct hstate *h,
+static void vma_end_reservation(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
- (void)__vma_reservation_common(h, vma, addr, VMA_ABORT_RESV);
+ (void)__vma_reservation_common(h, vma, addr, VMA_END_RESV);
}

static struct page *alloc_huge_page(struct vm_area_struct *vma,
@@ -1672,7 +1720,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-ENOMEM);
if (chg || avoid_reserve)
if (hugepage_subpool_get_pages(spool, 1) < 0) {
- vma_abort_reservation(h, vma, addr);
+ vma_end_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
}

@@ -1720,7 +1768,7 @@ out_uncharge_cgroup:
out_subpool_put:
if (chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
- vma_abort_reservation(h, vma, addr);
+ vma_end_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
}

@@ -3367,7 +3415,7 @@ retry:
goto backout_unlocked;
}
/* Just decrements count, does not deallocate */
- vma_abort_reservation(h, vma, address);
+ vma_end_reservation(h, vma, address);
}

ptl = huge_pte_lockptr(h, mm, ptep);
@@ -3516,7 +3564,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_mutex;
}
/* Just decrements count, does not deallocate */
- vma_abort_reservation(h, vma, address);
+ vma_end_reservation(h, vma, address);

if (!(vma->vm_flags & VM_MAYSHARE))
pagecache_page = hugetlbfs_pagecache_page(h,
@@ -3872,7 +3920,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
long gbl_reserve;

if (resv_map)
- chg = region_truncate(resv_map, offset);
+ chg = region_del(resv_map, offset, LONG_MAX);
spin_lock(&inode->i_lock);
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
--
2.1.0

2015-07-21 18:11:40

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 03/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate

hugetlb page faults are currently synchronized by the table of
mutexes (htlb_fault_mutex_table). fallocate code will need to
synchronize with the page fault code when it allocates or
deletes pages. Expose interfaces so that fallocate operations
can be synchronized with page faults. Minor name changes to
be more consistent with other global hugetlb symbols.

Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 20 ++++++++++----------
2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 667cf44..933da39 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -88,6 +88,11 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
bool isolate_huge_page(struct page *page, struct list_head *list);
void putback_active_hugepage(struct page *page);
void free_huge_page(struct page *page);
+extern struct mutex *hugetlb_fault_mutex_table;
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ struct address_space *mapping,
+ pgoff_t idx, unsigned long address);

#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a573396..db19f4e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -64,7 +64,7 @@ DEFINE_SPINLOCK(hugetlb_lock);
* prevent spurious OOMs when the hugepage pool is fully utilized.
*/
static int num_fault_mutexes;
-static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp;
+struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;

/* Forward declaration */
static int hugetlb_acct_memory(struct hstate *h, long delta);
@@ -2484,7 +2484,7 @@ static void __exit hugetlb_exit(void)
}

kobject_put(hugepages_kobj);
- kfree(htlb_fault_mutex_table);
+ kfree(hugetlb_fault_mutex_table);
}
module_exit(hugetlb_exit);

@@ -2517,12 +2517,12 @@ static int __init hugetlb_init(void)
#else
num_fault_mutexes = 1;
#endif
- htlb_fault_mutex_table =
+ hugetlb_fault_mutex_table =
kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
- BUG_ON(!htlb_fault_mutex_table);
+ BUG_ON(!hugetlb_fault_mutex_table);

for (i = 0; i < num_fault_mutexes; i++)
- mutex_init(&htlb_fault_mutex_table[i]);
+ mutex_init(&hugetlb_fault_mutex_table[i]);
return 0;
}
module_init(hugetlb_init);
@@ -3456,7 +3456,7 @@ backout_unlocked:
}

#ifdef CONFIG_SMP
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping,
pgoff_t idx, unsigned long address)
@@ -3481,7 +3481,7 @@ static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
* For uniprocesor systems we always use a single mutex, so just
* return 0 and avoid the hashing overhead.
*/
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping,
pgoff_t idx, unsigned long address)
@@ -3529,8 +3529,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
- hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
- mutex_lock(&htlb_fault_mutex_table[hash]);
+ hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, address);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);

entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
@@ -3615,7 +3615,7 @@ out_ptl:
put_page(pagecache_page);
}
out_mutex:
- mutex_unlock(&htlb_fault_mutex_table[hash]);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
--
2.1.0

2015-07-21 18:11:17

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 04/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete

fallocate hole punch will want to unmap a specific range of pages.
Modify the existing hugetlb_vmtruncate_list() routine to take a
start/end range. If end is 0, this indicates all pages after start
should be unmapped. This is the same as the existing truncate
functionality. Modify existing callers to add 0 as end of range.

Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().

Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 0cf74df..ed40f56 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -349,11 +349,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
}

static inline void
-hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
+hugetlb_vmdelete_list(struct rb_root *root, pgoff_t start, pgoff_t end)
{
struct vm_area_struct *vma;

- vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) {
+ /*
+ * end == 0 indicates that the entire range after
+ * start should be unmapped.
+ */
+ vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) {
unsigned long v_offset;

/*
@@ -362,13 +366,20 @@ hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
* which overlap the truncated area starting at pgoff,
* and no vma on a 32-bit arch can span beyond the 4GB.
*/
- if (vma->vm_pgoff < pgoff)
- v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT;
+ if (vma->vm_pgoff < start)
+ v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
else
v_offset = 0;

- unmap_hugepage_range(vma, vma->vm_start + v_offset,
- vma->vm_end, NULL);
+ if (end) {
+ end = ((end - start) << PAGE_SHIFT) +
+ vma->vm_start + v_offset;
+ if (end > vma->vm_end)
+ end = vma->vm_end;
+ } else
+ end = vma->vm_end;
+
+ unmap_hugepage_range(vma, vma->vm_start + v_offset, end, NULL);
}
}

@@ -384,7 +395,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_size_write(inode, offset);
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
- hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+ hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
i_mmap_unlock_write(mapping);
truncate_hugepages(inode, offset);
return 0;
--
2.1.0

2015-07-21 18:11:30

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 05/10] hugetlbfs: truncate_hugepages() takes a range of pages

Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of LLONG_MAX is passed,
the current "truncate" functionality is maintained. Existing
callers are modified to pass LLONG_MAX as end of range. By keying
off end == LLONG_MAX, the routine behaves differently for truncate
and hole punch. Page removal is now synchronized with page
allocation via faults by using the fault mutex table. The hole
punch case can experience the rare region_del error and must
handle accordingly.

Add the routine hugetlb_fix_reserve_counts to fix up reserve counts
in the case where region_del returns an error.

Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages(). To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().

Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from
region_del and pass it back to the caller.

Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 98 ++++++++++++++++++++++++++++++++++++++++++++-----
include/linux/hugetlb.h | 4 +-
mm/hugetlb.c | 40 ++++++++++++++++++--
3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed40f56..a974e4b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -293,26 +293,61 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
return -EINVAL;
}

-static void truncate_huge_page(struct page *page)
+static void remove_huge_page(struct page *page)
{
ClearPageDirty(page);
ClearPageUptodate(page);
delete_from_page_cache(page);
}

-static void truncate_hugepages(struct inode *inode, loff_t lstart)
+
+/*
+ * remove_inode_hugepages handles two distinct cases: truncation and hole
+ * punch. There are subtle differences in operation for each case.
+
+ * truncation is indicated by end of range being LLONG_MAX
+ * In this case, we first scan the range and release found pages.
+ * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
+ * maps and global counts.
+ * hole punch is indicated if end is not LLONG_MAX
+ * In the hole punch case we scan the range and release found pages.
+ * Only when releasing a page is the associated region/reserv map
+ * deleted. The region/reserv map for ranges without associated
+ * pages are not modified.
+ * Note: If the passed end of range value is beyond the end of file, but
+ * not LLONG_MAX this routine still performs a hole punch operation.
+ */
+static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
+ loff_t lend)
{
struct hstate *h = hstate_inode(inode);
struct address_space *mapping = &inode->i_data;
const pgoff_t start = lstart >> huge_page_shift(h);
+ const pgoff_t end = lend >> huge_page_shift(h);
+ struct vm_area_struct pseudo_vma;
struct pagevec pvec;
pgoff_t next;
int i, freed = 0;
+ long lookup_nr = PAGEVEC_SIZE;
+ bool truncate_op = (lend == LLONG_MAX);

+ memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+ pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
pagevec_init(&pvec, 0);
next = start;
- while (1) {
- if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ while (next < end) {
+ /*
+ * Make sure to never grab more pages that we
+ * might possibly need.
+ */
+ if (end - next < lookup_nr)
+ lookup_nr = end - next;
+
+ /*
+ * This pagevec_lookup() may return pages past 'end',
+ * so we must check for page->index > end.
+ */
+ if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
if (next == start)
break;
next = start;
@@ -321,26 +356,69 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)

for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
+ u32 hash;
+
+ hash = hugetlb_fault_mutex_hash(h, current->mm,
+ &pseudo_vma,
+ mapping, next, 0);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);

lock_page(page);
+ if (page->index >= end) {
+ unlock_page(page);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ next = end; /* we are done */
+ break;
+ }
+
+ /*
+ * If page is mapped, it was faulted in after being
+ * unmapped. Do nothing in this race case. In the
+ * normal case page is not mapped.
+ */
+ if (!page_mapped(page)) {
+ bool rsv_on_error = !PagePrivate(page);
+ /*
+ * We must free the huge page and remove
+ * from page cache (remove_huge_page) BEFORE
+ * removing the region/reserve map
+ * (hugetlb_unreserve_pages). In rare out
+ * of memory conditions, removal of the
+ * region/reserve map could fail. Before
+ * free'ing the page, note PagePrivate which
+ * is used in case of error.
+ */
+ remove_huge_page(page);
+ freed++;
+ if (!truncate_op) {
+ if (unlikely(hugetlb_unreserve_pages(
+ inode, next,
+ next + 1, 1)))
+ hugetlb_fix_reserve_counts(
+ inode, rsv_on_error);
+ }
+ }
+
if (page->index > next)
next = page->index;
+
++next;
- truncate_huge_page(page);
unlock_page(page);
- freed++;
+
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
}
- BUG_ON(!lstart && mapping->nrpages);
- hugetlb_unreserve_pages(inode, start, freed);
+
+ if (truncate_op)
+ (void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
}

static void hugetlbfs_evict_inode(struct inode *inode)
{
struct resv_map *resv_map;

- truncate_hugepages(inode, 0);
+ remove_inode_hugepages(inode, 0, LLONG_MAX);
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
@@ -397,7 +475,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
i_mmap_unlock_write(mapping);
- truncate_hugepages(inode, offset);
+ remove_inode_hugepages(inode, offset, LLONG_MAX);
return 0;
}

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 933da39..e7825c9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -83,11 +83,13 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int hugetlb_reserve_pages(struct inode *inode, long from, long to,
struct vm_area_struct *vma,
vm_flags_t vm_flags);
-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+ long freed);
int dequeue_hwpoisoned_huge_page(struct page *page);
bool isolate_huge_page(struct page *page, struct list_head *list);
void putback_active_hugepage(struct page *page);
void free_huge_page(struct page *page);
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve);
extern struct mutex *hugetlb_fault_mutex_table;
u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index db19f4e..db9caea 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -549,6 +549,28 @@ retry:
}

/*
+ * A rare out of memory error was encountered which prevented removal of
+ * the reserve map region for a page. The huge page itself was free'ed
+ * and removed from the page cache. This routine will adjust the subpool
+ * usage count, and the global reserve count if needed. By incrementing
+ * these counts, the reserve map entry which could not be deleted will
+ * appear as a "reserved" entry instead of simply dangling with incorrect
+ * counts.
+ */
+void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve)
+{
+ struct hugepage_subpool *spool = subpool_inode(inode);
+ long rsv_adjust;
+
+ rsv_adjust = hugepage_subpool_get_pages(spool, 1);
+ if (restore_reserve && rsv_adjust) {
+ struct hstate *h = hstate_inode(inode);
+
+ hugetlb_acct_memory(h, 1);
+ }
+}
+
+/*
* Count and return the number of huge pages in the reserve map
* that intersect with the range [f, t).
*/
@@ -3911,7 +3933,8 @@ out_err:
return ret;
}

-void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
+long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
+ long freed)
{
struct hstate *h = hstate_inode(inode);
struct resv_map *resv_map = inode_resv_map(inode);
@@ -3919,8 +3942,17 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
struct hugepage_subpool *spool = subpool_inode(inode);
long gbl_reserve;

- if (resv_map)
- chg = region_del(resv_map, offset, LONG_MAX);
+ if (resv_map) {
+ chg = region_del(resv_map, start, end);
+ /*
+ * region_del() can fail in the rare case where a region
+ * must be split and another region descriptor can not be
+ * allocated. If end == LONG_MAX, it will not fail.
+ */
+ if (chg < 0)
+ return chg;
+ }
+
spin_lock(&inode->i_lock);
inode->i_blocks -= (blocks_per_huge_page(h) * freed);
spin_unlock(&inode->i_lock);
@@ -3931,6 +3963,8 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
*/
gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
hugetlb_acct_memory(h, -gbl_reserve);
+
+ return 0;
}

#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
--
2.1.0

2015-07-21 18:11:35

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 06/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch

In vma_has_reserves(), the current assumption is that reserves are
always present for shared mappings. However, this will not be the
case with fallocate hole punch. When punching a hole, the present
page will be deleted as well as the region/reserve map entry (and
hence any reservation). vma_has_reserves is passed "chg" which
indicates whether or not a region/reserve map is present. Use
this to determine if reserves are actually present or were removed
via hole punch.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index db9caea..f9d3faf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -803,8 +803,19 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
}

/* Shared mappings always use reserves */
- if (vma->vm_flags & VM_MAYSHARE)
- return true;
+ if (vma->vm_flags & VM_MAYSHARE) {
+ /*
+ * We know VM_NORESERVE is not set. Therefore, there SHOULD
+ * be a region map for all pages. The only situation where
+ * there is no region map is if a hole was punched via
+ * fallocate. In this case, there really are no reverves to
+ * use. This situation is indicated if chg != 0.
+ */
+ if (chg)
+ return false;
+ else
+ return true;
+ }

/*
* Only the process that called mmap() has reserves for
--
2.1.0

2015-07-21 18:11:19

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 07/10] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate

Areas hole punched by fallocate will not have entries in the
region/reserve map. However, shared mappings with min_size subpool
reservations may still have reserved pages. alloc_huge_page needs
to handle this special case and do the proper accounting.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 54 +++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 39 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9d3faf..efcd58c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1735,34 +1735,58 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct page *page;
- long chg, commit;
+ long map_chg, map_commit;
+ long gbl_chg;
int ret, idx;
struct hugetlb_cgroup *h_cg;

idx = hstate_index(h);
/*
- * Processes that did not create the mapping will have no
- * reserves and will not have accounted against subpool
- * limit. Check that the subpool limit can be made before
- * satisfying the allocation MAP_NORESERVE mappings may also
- * need pages and subpool limit allocated allocated if no reserve
- * mapping overlaps.
+ * Examine the region/reserve map to determine if the process
+ * has a reservation for the page to be allocated. A return
+ * code of zero indicates a reservation exists (no change).
*/
- chg = vma_needs_reservation(h, vma, addr);
- if (chg < 0)
+ map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
+ if (map_chg < 0)
return ERR_PTR(-ENOMEM);
- if (chg || avoid_reserve)
- if (hugepage_subpool_get_pages(spool, 1) < 0) {
+
+ /*
+ * Processes that did not create the mapping will have no
+ * reserves as indicated by the region/reserve map. Check
+ * that the allocation will not exceed the subpool limit.
+ * Allocations for MAP_NORESERVE mappings also need to be
+ * checked against any subpool limit.
+ */
+ if (map_chg || avoid_reserve) {
+ gbl_chg = hugepage_subpool_get_pages(spool, 1);
+ if (gbl_chg < 0) {
vma_end_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
}

+ /*
+ * Even though there was no reservation in the region/reserve
+ * map, there could be reservations associated with the
+ * subpool that can be used. This would be indicated if the
+ * return value of hugepage_subpool_get_pages() is zero.
+ * However, if avoid_reserve is specified we still avoid even
+ * the subpool reservations.
+ */
+ if (avoid_reserve)
+ gbl_chg = 1;
+ }
+
ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
if (ret)
goto out_subpool_put;

spin_lock(&hugetlb_lock);
- page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+ /*
+ * glb_chg is passed to indicate whether or not a page must be taken
+ * from the global free pool (global change). gbl_chg == 0 indicates
+ * a reservation exists for the allocation.
+ */
+ page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
if (!page) {
spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1778,8 +1802,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,

set_page_private(page, (unsigned long)spool);

- commit = vma_commit_reservation(h, vma, addr);
- if (unlikely(chg > commit)) {
+ map_commit = vma_commit_reservation(h, vma, addr);
+ if (unlikely(map_chg > map_commit)) {
/*
* The page was added to the reservation map between
* vma_needs_reservation and vma_commit_reservation.
@@ -1799,7 +1823,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma,
out_uncharge_cgroup:
hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
out_subpool_put:
- if (chg || avoid_reserve)
+ if (map_chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
vma_end_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
--
2.1.0

2015-07-21 18:11:37

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 08/10] hugetlbfs: New huge_add_to_page_cache helper routine

Currently, there is only a single place where hugetlbfs pages are
added to the page cache. The new fallocate code be adding a second
one, so break the functionality out into its own helper.

Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 2 ++
mm/hugetlb.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e7825c9..657ef26 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -333,6 +333,8 @@ struct huge_bootmem_page {
struct page *alloc_huge_page_node(struct hstate *h, int nid);
struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+ pgoff_t idx);

/* arch callback */
int __init alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index efcd58c..9812785 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3377,6 +3377,23 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
return page != NULL;
}

+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+ pgoff_t idx)
+{
+ struct inode *inode = mapping->host;
+ struct hstate *h = hstate_inode(inode);
+ int err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+
+ if (err)
+ return err;
+ ClearPagePrivate(page);
+
+ spin_lock(&inode->i_lock);
+ inode->i_blocks += blocks_per_huge_page(h);
+ spin_unlock(&inode->i_lock);
+ return 0;
+}
+
static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct address_space *mapping, pgoff_t idx,
unsigned long address, pte_t *ptep, unsigned int flags)
@@ -3424,21 +3441,13 @@ retry:
set_page_huge_active(page);

if (vma->vm_flags & VM_MAYSHARE) {
- int err;
- struct inode *inode = mapping->host;
-
- err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+ int err = huge_add_to_page_cache(page, mapping, idx);
if (err) {
put_page(page);
if (err == -EEXIST)
goto retry;
goto out;
}
- ClearPagePrivate(page);
-
- spin_lock(&inode->i_lock);
- inode->i_blocks += blocks_per_huge_page(h);
- spin_unlock(&inode->i_lock);
} else {
lock_page(page);
if (unlikely(anon_vma_prepare(vma))) {
--
2.1.0

2015-07-21 18:13:27

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 09/10] hugetlbfs: add hugetlbfs_fallocate()

This is based on the shmem version, but it has diverged quite
a bit. We have no swap to worry about, nor the new file sealing.
Add synchronication via the fault mutex table to coordinate
page faults, fallocate allocation and fallocate hole punch.

What this allows us to do is move physical memory in and out of
a hugetlbfs file without having it mapped. This also gives us
the ability to support MADV_REMOVE since it is currently
implemented using fallocate(). MADV_REMOVE lets madvise() remove
pages from the middle of a hugetlbfs file, which wasn't possible
before.

hugetlbfs fallocate only operates on whole huge pages.

Based-on code-by: Dave Hansen <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/hugetlb.h | 3 +
mm/hugetlb.c | 2 +-
3 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a974e4b..6e565a4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -12,6 +12,7 @@
#include <linux/thread_info.h>
#include <asm/current.h>
#include <linux/sched.h> /* remove ASAP */
+#include <linux/falloc.h>
#include <linux/fs.h>
#include <linux/mount.h>
#include <linux/file.h>
@@ -479,6 +480,160 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
return 0;
}

+static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+ struct hstate *h = hstate_inode(inode);
+ loff_t hpage_size = huge_page_size(h);
+ loff_t hole_start, hole_end;
+
+ /*
+ * For hole punch round up the beginning offset of the hole and
+ * round down the end.
+ */
+ hole_start = round_up(offset, hpage_size);
+ hole_end = round_down(offset + len, hpage_size);
+
+ if (hole_end > hole_start) {
+ struct address_space *mapping = inode->i_mapping;
+
+ mutex_lock(&inode->i_mutex);
+ i_mmap_lock_write(mapping);
+ if (!RB_EMPTY_ROOT(&mapping->i_mmap))
+ hugetlb_vmdelete_list(&mapping->i_mmap,
+ hole_start >> PAGE_SHIFT,
+ hole_end >> PAGE_SHIFT);
+ i_mmap_unlock_write(mapping);
+ remove_inode_hugepages(inode, hole_start, hole_end);
+ mutex_unlock(&inode->i_mutex);
+ }
+
+ return 0;
+}
+
+static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
+ loff_t len)
+{
+ struct inode *inode = file_inode(file);
+ struct address_space *mapping = inode->i_mapping;
+ struct hstate *h = hstate_inode(inode);
+ struct vm_area_struct pseudo_vma;
+ struct mm_struct *mm = current->mm;
+ loff_t hpage_size = huge_page_size(h);
+ unsigned long hpage_shift = huge_page_shift(h);
+ pgoff_t start, index, end;
+ int error;
+ u32 hash;
+
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ return -EOPNOTSUPP;
+
+ if (mode & FALLOC_FL_PUNCH_HOLE)
+ return hugetlbfs_punch_hole(inode, offset, len);
+
+ /*
+ * Default preallocate case.
+ * For this range, start is rounded down and end is rounded up
+ * as well as being converted to page offsets.
+ */
+ start = offset >> hpage_shift;
+ end = (offset + len + hpage_size - 1) >> hpage_shift;
+
+ mutex_lock(&inode->i_mutex);
+
+ /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
+ error = inode_newsize_ok(inode, offset + len);
+ if (error)
+ goto out;
+
+ /*
+ * Initialize a pseudo vma that just contains the policy used
+ * when allocating the huge pages. The actual policy field
+ * (vm_policy) is determined based on the index in the loop below.
+ */
+ memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+ pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
+ pseudo_vma.vm_file = file;
+
+ for (index = start; index < end; index++) {
+ /*
+ * This is supposed to be the vaddr where the page is being
+ * faulted in, but we have no vaddr here.
+ */
+ struct page *page;
+ unsigned long addr;
+ int avoid_reserve = 0;
+
+ cond_resched();
+
+ /*
+ * fallocate(2) manpage permits EINTR; we may have been
+ * interrupted because we are using up too much memory.
+ */
+ if (signal_pending(current)) {
+ error = -EINTR;
+ break;
+ }
+
+ /* Get policy based on index */
+ pseudo_vma.vm_policy =
+ mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
+ index);
+
+ /* addr is the offset within the file (zero based) */
+ addr = index * hpage_size;
+
+ /* mutex taken here, fault path and hole punch */
+ hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
+ index, addr);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+ /* See if already present in mapping to avoid alloc/free */
+ page = find_get_page(mapping, index);
+ if (page) {
+ put_page(page);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ mpol_cond_put(pseudo_vma.vm_policy);
+ continue;
+ }
+
+ /* Allocate page and add to page cache */
+ page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
+ mpol_cond_put(pseudo_vma.vm_policy);
+ if (IS_ERR(page)) {
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ error = PTR_ERR(page);
+ goto out;
+ }
+ clear_huge_page(page, addr, pages_per_huge_page(h));
+ __SetPageUptodate(page);
+ error = huge_add_to_page_cache(page, mapping, index);
+ if (unlikely(error)) {
+ put_page(page);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ goto out;
+ }
+
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
+ /*
+ * page_put due to reference from alloc_huge_page()
+ * unlock_page because locked by add_to_page_cache()
+ */
+ put_page(page);
+ unlock_page(page);
+ }
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+ i_size_write(inode, offset + len);
+ inode->i_ctime = CURRENT_TIME;
+ spin_lock(&inode->i_lock);
+ inode->i_private = NULL;
+ spin_unlock(&inode->i_lock);
+out:
+ mutex_unlock(&inode->i_mutex);
+ return error;
+}
+
static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
{
struct inode *inode = d_inode(dentry);
@@ -790,7 +945,8 @@ const struct file_operations hugetlbfs_file_operations = {
.mmap = hugetlbfs_file_mmap,
.fsync = noop_fsync,
.get_unmapped_area = hugetlb_get_unmapped_area,
- .llseek = default_llseek,
+ .llseek = default_llseek,
+ .fallocate = hugetlbfs_fallocate,
};

static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 657ef26..386dcbf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -330,6 +330,8 @@ struct huge_bootmem_page {
#endif
};

+struct page *alloc_huge_page(struct vm_area_struct *vma,
+ unsigned long addr, int avoid_reserve);
struct page *alloc_huge_page_node(struct hstate *h, int nid);
struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
@@ -483,6 +485,7 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,

#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
+#define alloc_huge_page(v, a, r) NULL
#define alloc_huge_page_node(h, nid) NULL
#define alloc_huge_page_noerr(v, a, r) NULL
#define alloc_bootmem_huge_page(h) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9812785..e9acf24 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1729,7 +1729,7 @@ static void vma_end_reservation(struct hstate *h,
(void)__vma_reservation_common(h, vma, addr, VMA_END_RESV);
}

-static struct page *alloc_huge_page(struct vm_area_struct *vma,
+struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve)
{
struct hugepage_subpool *spool = subpool_vma(vma);
--
2.1.0

2015-07-21 18:14:29

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 10/10] mm: madvise allow remove operation for hugetlbfs

Now that we have hole punching support for hugetlbfs, we can
also support the MADV_REMOVE interface to it.

Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/madvise.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 67d5fe7..fa6479a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -468,7 +468,7 @@ static long madvise_remove(struct vm_area_struct *vma,

*prev = NULL; /* tell sys_madvise we drop mmap_sem */

- if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB))
+ if (vma->vm_flags & VM_LOCKED)
return -EINVAL;

f = vma->vm_file;
--
2.1.0

2015-07-22 22:04:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Tue, 21 Jul 2015 11:09:34 -0700 Mike Kravetz <[email protected]> wrote:

> Changes in this revision address the minor comment and function name
> issues brought up by Naoya Horiguchi. Patch set is also rebased on
> current "mmotm/since-4.1". This revision does not introduce any
> functional changes.
>
> As suggested during the RFC process, tests have been proposed to
> libhugetlbfs as described at:
> http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
> fallocate(2) man page modifications are also necessary to specify
> that fallocate for hugetlbfs only operates on whole pages. This
> change will be submitted once the code has stabilized and been
> proposed for merging.
>
> hugetlbfs is used today by applications that want a high degree of
> control over huge page usage. Often, large hugetlbfs files are used
> to map a large number huge pages into the application processes.
> The applications know when page ranges within these large files will
> no longer be used, and ideally would like to release them back to
> the subpool or global pools for other uses. The fallocate() system
> call provides an interface for preallocation and hole punching within
> files. This patch set adds fallocate functionality to hugetlbfs.
>
> v4:
> Renamed vma_abort_reservation() to vma_end_reservation().
>
> ...
>
> v3 patch series:
> Reviewed-by: Naoya Horiguchi <[email protected]>
> Acked-by: Hillf Danton <[email protected]>

eek. Please don't hide the reviews and acks in [0/n] like this. I
don't expect to see them there, which causes me to park the patch
series until I'm feeling awake enough to undertake the patchset's first
ever review.

2015-07-22 22:03:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add

On Tue, 21 Jul 2015 11:09:35 -0700 Mike Kravetz <[email protected]> wrote:

> fallocate hole punch will want to remove a specific range of
> pages. When pages are removed, their associated entries in
> the region/reserve map will also be removed. This will break
> an assumption in the region_chg/region_add calling sequence.
> If a new region descriptor must be allocated, it is done as
> part of the region_chg processing. In this way, region_add
> can not fail because it does not need to attempt an allocation.
>
> To prepare for fallocate hole punch, create a "cache" of
> descriptors that can be used by region_add if necessary.
> region_chg will ensure there are sufficient entries in the
> cache. It will be necessary to track the number of in progress
> add operations to know a sufficient number of descriptors
> reside in the cache. A new routine region_abort is added to
> adjust this in progress count when add operations are aborted.
> vma_abort_reservation is also added for callers creating
> reservations with vma_needs_reservation/vma_commit_reservation.
>
> ...
>
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -35,6 +35,9 @@ struct resv_map {
> struct kref refs;
> spinlock_t lock;
> struct list_head regions;
> + long adds_in_progress;
> + struct list_head rgn_cache;
> + long rgn_cache_count;
> };

Linux style is to spell words out fully: rgn->region. One advantage of
doing this is that we get consistency: when there's no doubt about how
a thing is spelled we don't end up with various different terms for the
same thing. Note that we already have resv_map.regions and struct
file_region.

To avoid a respin I think I'll just do a s/rgn/region/g on the patches
then take a look for 80 col overflow - shout at me if you dislike that.

And yes, resv_map should have been called reservation_map. I suck.

And there's region_chg. Who wrote this stuff.

As an off-topic cleanup, resv_map and its associated stuff could be
moved out of include/linux/hugetlb.h and made private to hugetlb.c with
some minor changes.

Finally, the data structure definition site is a great place at which
to document the overall design. What the fields do, how they interact,
locking design, etc. eg, what data structure is at
resv_map.region_cache, how is it managed, what purpose does it serve
etc.

And adds_in_progress is interesting. I see the comment over
region_abort() (rgn_abort?!) but haven't quite soaked that in yet.

>
> ...
>
> @@ -312,11 +339,16 @@ static long region_add(struct resv_map *resv, long f, long t)
> * so that the subsequent region_add call will have all the
> * regions it needs and will not fail.
> *
> + * Upon entry, region_chg will also examine the cache of
> + * region descriptors associated with the map. If there

"are"

> + * not enough descriptors cached, one will be allocated
> + * for the in progress add operation.
> + *
> * Returns the number of huge pages that need to be added
> * to the existing reservation map for the range [f, t).
> * This number is greater or equal to zero. -ENOMEM is
> - * returned if a new file_region structure is needed and can
> - * not be allocated.
> + * returned if a new file_region structure or cache entry
> + * is needed and can not be allocated.
> */
> static long region_chg(struct resv_map *resv, long f, long t)
> {
>
> ...
>

2015-07-22 22:03:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 03/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate

On Tue, 21 Jul 2015 11:09:37 -0700 Mike Kravetz <[email protected]> wrote:

> Minor name changes to
> be more consistent with other global hugetlb symbols.
>
> ...
>
> - kfree(htlb_fault_mutex_table);
> + kfree(hugetlb_fault_mutex_table);

yay ;)

2015-07-22 22:03:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] hugetlbfs: add hugetlbfs_fallocate()

On Tue, 21 Jul 2015 11:09:43 -0700 Mike Kravetz <[email protected]> wrote:

> This is based on the shmem version, but it has diverged quite
> a bit. We have no swap to worry about, nor the new file sealing.
> Add synchronication via the fault mutex table to coordinate
> page faults, fallocate allocation and fallocate hole punch.
>
> What this allows us to do is move physical memory in and out of
> a hugetlbfs file without having it mapped. This also gives us
> the ability to support MADV_REMOVE since it is currently
> implemented using fallocate(). MADV_REMOVE lets madvise() remove
> pages from the middle of a hugetlbfs file, which wasn't possible
> before.
>
> hugetlbfs fallocate only operates on whole huge pages.
>
> ...
>
> +static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> + loff_t len)
> +{
> + struct inode *inode = file_inode(file);
> + struct address_space *mapping = inode->i_mapping;
> + struct hstate *h = hstate_inode(inode);
> + struct vm_area_struct pseudo_vma;
> + struct mm_struct *mm = current->mm;
> + loff_t hpage_size = huge_page_size(h);
> + unsigned long hpage_shift = huge_page_shift(h);
> + pgoff_t start, index, end;
> + int error;
> + u32 hash;
> +
> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + return -EOPNOTSUPP;

EOPNOTSUPP is a networking thing. It's inappropriate here.

The problem is that if this error is ever returned to userspace, the
user will be sitting looking at "Operation not supported on transport
endpoint" and wondering what went wrong in the networking stack.

> + if (mode & FALLOC_FL_PUNCH_HOLE)
> + return hugetlbfs_punch_hole(inode, offset, len);
> +
> + /*
> + * Default preallocate case.
> + * For this range, start is rounded down and end is rounded up
> + * as well as being converted to page offsets.
> + */
> + start = offset >> hpage_shift;
> + end = (offset + len + hpage_size - 1) >> hpage_shift;
> +
> + mutex_lock(&inode->i_mutex);
> +
> + /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
> + error = inode_newsize_ok(inode, offset + len);
> + if (error)
> + goto out;
> +
> + /*
> + * Initialize a pseudo vma that just contains the policy used
> + * when allocating the huge pages. The actual policy field
> + * (vm_policy) is determined based on the index in the loop below.
> + */
> + memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
> + pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
> + pseudo_vma.vm_file = file;

triviata: we could have just done

struct vm_area_struct pseudo_vma = {
.vm_flags = ...
.vm_file = file;
};

> + for (index = start; index < end; index++) {
> + /*
> + * This is supposed to be the vaddr where the page is being
> + * faulted in, but we have no vaddr here.
> + */
> + struct page *page;
> + unsigned long addr;
> + int avoid_reserve = 0;
> +
> + cond_resched();
> +
> + /*
> + * fallocate(2) manpage permits EINTR; we may have been
> + * interrupted because we are using up too much memory.
> + */
> + if (signal_pending(current)) {
> + error = -EINTR;
> + break;
> + }
> +
> + /* Get policy based on index */
> + pseudo_vma.vm_policy =
> + mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
> + index);
> +
> + /* addr is the offset within the file (zero based) */

So use loff_t?

> + addr = index * hpage_size;
> +
> + /* mutex taken here, fault path and hole punch */
> + hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
> + index, addr);
> + mutex_lock(&hugetlb_fault_mutex_table[hash]);
> +
> + /* See if already present in mapping to avoid alloc/free */
> + page = find_get_page(mapping, index);
> + if (page) {
> + put_page(page);
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + mpol_cond_put(pseudo_vma.vm_policy);
> + continue;
> + }
> +
> + /* Allocate page and add to page cache */
> + page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
> + mpol_cond_put(pseudo_vma.vm_policy);
> + if (IS_ERR(page)) {
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + error = PTR_ERR(page);
> + goto out;
> + }
> + clear_huge_page(page, addr, pages_per_huge_page(h));
> + __SetPageUptodate(page);
> + error = huge_add_to_page_cache(page, mapping, index);
> + if (unlikely(error)) {
> + put_page(page);
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + goto out;
> + }
> +
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +
> + /*
> + * page_put due to reference from alloc_huge_page()
> + * unlock_page because locked by add_to_page_cache()
> + */
> + put_page(page);
> + unlock_page(page);
> + }
> +
> + if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> + i_size_write(inode, offset + len);
> + inode->i_ctime = CURRENT_TIME;
> + spin_lock(&inode->i_lock);
> + inode->i_private = NULL;
> + spin_unlock(&inode->i_lock);
> +out:
> + mutex_unlock(&inode->i_mutex);
> + return error;
>
> ...
>

2015-07-22 22:06:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Tue, 21 Jul 2015 11:09:34 -0700 Mike Kravetz <[email protected]> wrote:

> As suggested during the RFC process, tests have been proposed to
> libhugetlbfs as described at:
> http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/

I didn't know that libhugetlbfs has tests. I wonder if that makes
tools/testing/selftests/vm's hugetlbfstest harmful?

2015-07-22 22:20:15

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Wed, 2015-07-22 at 15:06 -0700, Andrew Morton wrote:
> On Tue, 21 Jul 2015 11:09:34 -0700 Mike Kravetz <[email protected]> wrote:
>
> > As suggested during the RFC process, tests have been proposed to
> > libhugetlbfs as described at:
> > http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/

Great!

>
> I didn't know that libhugetlbfs has tests. I wonder if that makes
> tools/testing/selftests/vm's hugetlbfstest harmful?

Why harmful? Redundant, maybe(?). Does anyone even use selftests for
hugetlbfs regression testing? Lets see, we also have these:

- hugepage-{mmap,shm}.c
- map_hugetlb.c

There's probably a lot of room for improvement here.

2015-07-22 22:24:28

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] hugetlbfs: add hugetlbfs_fallocate()

On 07/22/2015 03:03 PM, Andrew Morton wrote:
> On Tue, 21 Jul 2015 11:09:43 -0700 Mike Kravetz <[email protected]> wrote:
...
>> +
>> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
>> + return -EOPNOTSUPP;
>
> EOPNOTSUPP is a networking thing. It's inappropriate here.
>
> The problem is that if this error is ever returned to userspace, the
> user will be sitting looking at "Operation not supported on transport
> endpoint" and wondering what went wrong in the networking stack.

Trying to follow FALLOCATE(2) man page:

"EOPNOTSUPP
The filesystem containing the file referred to by fd does not
support this operation; or the mode is not supported by the
filesystem containing the file referred to by fd."

--
Mike Kravetz

2015-07-22 22:30:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Wed, 22 Jul 2015 15:19:54 -0700 Davidlohr Bueso <[email protected]> wrote:

> >
> > I didn't know that libhugetlbfs has tests. I wonder if that makes
> > tools/testing/selftests/vm's hugetlbfstest harmful?
>
> Why harmful? Redundant, maybe(?).

The presence of the in-kernel tests will cause people to add stuff to
them when it would be better if they were to apply that effort to
making libhugetlbfs better. Or vice versa.

Mike's work is an example. Someone later makes a change to hugetlbfs, runs
the kernel selftest and says "yay, everything works", unaware that they
just broke fallocate support.

> Does anyone even use selftests for
> hugetlbfs regression testing? Lets see, we also have these:
>
> - hugepage-{mmap,shm}.c
> - map_hugetlb.c
>
> There's probably a lot of room for improvement here.

selftests is a pretty scrappy place. It's partly a dumping ground for
things so useful test code doesn't just get lost and bitrotted. Partly
a framework so people who add features can easily test them. Partly to
provide tools to architecture maintainers when they wire up new
syscalls and the like.

Unless there's some good reason to retain the hugetlb part of
selftests, I'm thinking we should just remove it to avoid
distracting/misleading people. Or possibly move the libhugetlbfs test
code into the kernel tree and maintain it there.

2015-07-22 22:30:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] hugetlbfs: add hugetlbfs_fallocate()

On Wed, 22 Jul 2015 15:23:42 -0700 Mike Kravetz <[email protected]> wrote:

> On 07/22/2015 03:03 PM, Andrew Morton wrote:
> > On Tue, 21 Jul 2015 11:09:43 -0700 Mike Kravetz <[email protected]> wrote:
> ...
> >> +
> >> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> >> + return -EOPNOTSUPP;
> >
> > EOPNOTSUPP is a networking thing. It's inappropriate here.
> >
> > The problem is that if this error is ever returned to userspace, the
> > user will be sitting looking at "Operation not supported on transport
> > endpoint" and wondering what went wrong in the networking stack.
>
> Trying to follow FALLOCATE(2) man page:
>
> "EOPNOTSUPP
> The filesystem containing the file referred to by fd does not
> support this operation; or the mode is not supported by the
> filesystem containing the file referred to by fd."
>

Sigh.

2015-07-22 22:34:47

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Wed, 2015-07-22 at 15:30 -0700, Andrew Morton wrote:
> selftests is a pretty scrappy place. It's partly a dumping ground for
> things so useful test code doesn't just get lost and bitrotted. Partly
> a framework so people who add features can easily test them. Partly to
> provide tools to architecture maintainers when they wire up new
> syscalls and the like.

Yeah, ipc, for instance, also sucks _badly_ in selftests.

2015-07-22 22:37:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Wed, 22 Jul 2015 15:34:34 -0700 Davidlohr Bueso <[email protected]> wrote:

> On Wed, 2015-07-22 at 15:30 -0700, Andrew Morton wrote:
> > selftests is a pretty scrappy place. It's partly a dumping ground for
> > things so useful test code doesn't just get lost and bitrotted. Partly
> > a framework so people who add features can easily test them. Partly to
> > provide tools to architecture maintainers when they wire up new
> > syscalls and the like.
>
> Yeah, ipc, for instance, also sucks _badly_ in selftests.

What testsuite should people be using for IPC?

2015-07-22 22:51:13

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Wed, 2015-07-22 at 15:37 -0700, Andrew Morton wrote:
> On Wed, 22 Jul 2015 15:34:34 -0700 Davidlohr Bueso <[email protected]> wrote:
>
> > On Wed, 2015-07-22 at 15:30 -0700, Andrew Morton wrote:
> > > selftests is a pretty scrappy place. It's partly a dumping ground for
> > > things so useful test code doesn't just get lost and bitrotted. Partly
> > > a framework so people who add features can easily test them. Partly to
> > > provide tools to architecture maintainers when they wire up new
> > > syscalls and the like.
> >
> > Yeah, ipc, for instance, also sucks _badly_ in selftests.
>
> What testsuite should people be using for IPC?

The best I've found is using the ipc parts of LTP. It's caught a lot of
bugs in the past. Unsurprisingly, I believe Fengguang has this
automated.

Manfred also has a few for more specific purposes -- which also serve
for performance testing:
https://github.com/manfred-colorfu/ipcsemtest
https://github.com/manfred-colorfu/ipcscale

iirc Dave Hansen also had written some shm-specific tests.

2015-07-22 22:54:14

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Wed, 2015-07-22 at 15:50 -0700, Davidlohr Bueso wrote:
> On Wed, 2015-07-22 at 15:37 -0700, Andrew Morton wrote:
> > On Wed, 22 Jul 2015 15:34:34 -0700 Davidlohr Bueso <[email protected]> wrote:
> >
> > > On Wed, 2015-07-22 at 15:30 -0700, Andrew Morton wrote:
> > > > selftests is a pretty scrappy place. It's partly a dumping ground for
> > > > things so useful test code doesn't just get lost and bitrotted. Partly
> > > > a framework so people who add features can easily test them. Partly to
> > > > provide tools to architecture maintainers when they wire up new
> > > > syscalls and the like.
> > >
> > > Yeah, ipc, for instance, also sucks _badly_ in selftests.
> >
> > What testsuite should people be using for IPC?
>
> The best I've found is using the ipc parts of LTP. It's caught a lot of
> bugs in the past. Unsurprisingly, I believe Fengguang has this
> automated.
>
> Manfred also has a few for more specific purposes -- which also serve
> for performance testing:
> https://github.com/manfred-colorfu/ipcsemtest
> https://github.com/manfred-colorfu/ipcscale
>
> iirc Dave Hansen also had written some shm-specific tests.

I also used to rely on multiple Oracle 11g RDBMS workloads and
benchmarks, particularly for dealing with sysv sems. Unfortunately I no
longer have the required licenses :(

2015-07-22 23:20:05

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On 07/22/2015 03:30 PM, Andrew Morton wrote:
> On Wed, 22 Jul 2015 15:19:54 -0700 Davidlohr Bueso <[email protected]> wrote:
>
>>>
>>> I didn't know that libhugetlbfs has tests. I wonder if that makes
>>> tools/testing/selftests/vm's hugetlbfstest harmful?
>>
>> Why harmful? Redundant, maybe(?).
>
> The presence of the in-kernel tests will cause people to add stuff to
> them when it would be better if they were to apply that effort to
> making libhugetlbfs better. Or vice versa.
>
> Mike's work is an example. Someone later makes a change to hugetlbfs, runs
> the kernel selftest and says "yay, everything works", unaware that they
> just broke fallocate support.
>
>> Does anyone even use selftests for
>> hugetlbfs regression testing? Lets see, we also have these:
>>
>> - hugepage-{mmap,shm}.c
>> - map_hugetlb.c
>>
>> There's probably a lot of room for improvement here.
>
> selftests is a pretty scrappy place. It's partly a dumping ground for
> things so useful test code doesn't just get lost and bitrotted. Partly
> a framework so people who add features can easily test them. Partly to
> provide tools to architecture maintainers when they wire up new
> syscalls and the like.
>
> Unless there's some good reason to retain the hugetlb part of
> selftests, I'm thinking we should just remove it to avoid
> distracting/misleading people. Or possibly move the libhugetlbfs test
> code into the kernel tree and maintain it there.

Adding Eric as he is the libhugetlbfs maintainer.

I think removing the hugetlb selftests in the kernel and pointing
people to libhugetlbfs is the way to go. From a very quick scan
of the selftests, I would guess libhugetlbfs covers everything
in those tests.

I'm willing to verify the testing provided by selftests is included
in libhugetlbfs, and remove selftests if that is the direction we
want to take.

--
Mike Kravetz

2015-07-22 23:54:17

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Wed, 2015-07-22 at 16:18 -0700, Mike Kravetz wrote:
> On 07/22/2015 03:30 PM, Andrew Morton wrote:
> > On Wed, 22 Jul 2015 15:19:54 -0700 Davidlohr Bueso <[email protected]> wrote:
> >
> >>>
> >>> I didn't know that libhugetlbfs has tests. I wonder if that makes
> >>> tools/testing/selftests/vm's hugetlbfstest harmful?
> >>
> >> Why harmful? Redundant, maybe(?).
> >
> > The presence of the in-kernel tests will cause people to add stuff to
> > them when it would be better if they were to apply that effort to
> > making libhugetlbfs better. Or vice versa.
> >
> > Mike's work is an example. Someone later makes a change to hugetlbfs, runs
> > the kernel selftest and says "yay, everything works", unaware that they
> > just broke fallocate support.
> >
> >> Does anyone even use selftests for
> >> hugetlbfs regression testing? Lets see, we also have these:
> >>
> >> - hugepage-{mmap,shm}.c
> >> - map_hugetlb.c
> >>
> >> There's probably a lot of room for improvement here.
> >
> > selftests is a pretty scrappy place. It's partly a dumping ground for
> > things so useful test code doesn't just get lost and bitrotted. Partly
> > a framework so people who add features can easily test them. Partly to
> > provide tools to architecture maintainers when they wire up new
> > syscalls and the like.
> >
> > Unless there's some good reason to retain the hugetlb part of
> > selftests, I'm thinking we should just remove it to avoid
> > distracting/misleading people. Or possibly move the libhugetlbfs test
> > code into the kernel tree and maintain it there.
>
> Adding Eric as he is the libhugetlbfs maintainer.
>
> I think removing the hugetlb selftests in the kernel and pointing
> people to libhugetlbfs is the way to go. From a very quick scan
> of the selftests, I would guess libhugetlbfs covers everything
> in those tests.

fwiw, I've been trying to push people towards this for a while. Ie:

commit 15610c86fa83ff778eb80d3cfaa71d6acceb628a
Author: Davidlohr Bueso <[email protected]>
Date: Wed Sep 11 14:21:48 2013 -0700

hugepage: mention libhugetlbfs in doc

Explicitly mention/recommend using the libhugetlbfs test cases when
changing related kernel code. Developers that are unaware of the project
can easily miss this and introduce potential regressions that may or may
not be caught by community review.

Also do some cleanups that make the document visually easier to view at a
first glance.

Signed-off-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

But generally speaking, I doubt this doc is read much.

>
> I'm willing to verify the testing provided by selftests is included
> in libhugetlbfs, and remove selftests if that is the direction we
> want to take.

Ack to this idea and thanks for volunteering.

2015-07-23 00:05:25

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Wed, 22 Jul 2015, Davidlohr Bueso wrote:

> fwiw, I've been trying to push people towards this for a while. Ie:
>
> commit 15610c86fa83ff778eb80d3cfaa71d6acceb628a
> Author: Davidlohr Bueso <[email protected]>
> Date: Wed Sep 11 14:21:48 2013 -0700
>
> hugepage: mention libhugetlbfs in doc
>
> Explicitly mention/recommend using the libhugetlbfs test cases when
> changing related kernel code. Developers that are unaware of the project
> can easily miss this and introduce potential regressions that may or may
> not be caught by community review.
>
> Also do some cleanups that make the document visually easier to view at a
> first glance.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> But generally speaking, I doubt this doc is read much.
>

The mmap(2) man page cites it specifically as the source of information on
MAP_HUGETLB and it, in turn, directs people to
tools/testing/selftests/vm/map_hugetlb.c. It also mentions libhugetlbfs
as a result of your patch, so perhaps change the man page to point people
directly there?

2015-07-23 15:25:14

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Wed, 22 Jul 2015, Mike Kravetz wrote:

> On 07/22/2015 03:30 PM, Andrew Morton wrote:
> >On Wed, 22 Jul 2015 15:19:54 -0700 Davidlohr Bueso <[email protected]> wrote:
> >
> >>>
> >>>I didn't know that libhugetlbfs has tests. I wonder if that makes
> >>>tools/testing/selftests/vm's hugetlbfstest harmful?
> >>
> >>Why harmful? Redundant, maybe(?).
> >
> >The presence of the in-kernel tests will cause people to add stuff to
> >them when it would be better if they were to apply that effort to
> >making libhugetlbfs better. Or vice versa.
> >
> >Mike's work is an example. Someone later makes a change to hugetlbfs, runs
> >the kernel selftest and says "yay, everything works", unaware that they
> >just broke fallocate support.
> >
> >>Does anyone even use selftests for
> >>hugetlbfs regression testing? Lets see, we also have these:
> >>
> >>- hugepage-{mmap,shm}.c
> >>- map_hugetlb.c
> >>
> >>There's probably a lot of room for improvement here.
> >
> >selftests is a pretty scrappy place. It's partly a dumping ground for
> >things so useful test code doesn't just get lost and bitrotted. Partly
> >a framework so people who add features can easily test them. Partly to
> >provide tools to architecture maintainers when they wire up new
> >syscalls and the like.
> >
> >Unless there's some good reason to retain the hugetlb part of
> >selftests, I'm thinking we should just remove it to avoid
> >distracting/misleading people. Or possibly move the libhugetlbfs test
> >code into the kernel tree and maintain it there.
>
> Adding Eric as he is the libhugetlbfs maintainer.
>
> I think removing the hugetlb selftests in the kernel and pointing
> people to libhugetlbfs is the way to go. From a very quick scan
> of the selftests, I would guess libhugetlbfs covers everything
> in those tests.
>
> I'm willing to verify the testing provided by selftests is included
> in libhugetlbfs, and remove selftests if that is the direction we
> want to take.

I would rather see the test suite stay in the library, there are a
number of tests that rely on infrastructure in the library that is not
available in selftests.

I am happy to help with any tests that need to be added/modified in the
library to cover.


Attachments:
(No filename) (2.17 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-23 17:06:56

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On 07/23/2015 08:17 AM, Eric B Munson wrote:
> On Wed, 22 Jul 2015, Mike Kravetz wrote:
>
>> On 07/22/2015 03:30 PM, Andrew Morton wrote:
>>> On Wed, 22 Jul 2015 15:19:54 -0700 Davidlohr Bueso <[email protected]> wrote:
>>>
>>>>>
>>>>> I didn't know that libhugetlbfs has tests. I wonder if that makes
>>>>> tools/testing/selftests/vm's hugetlbfstest harmful?
>>>>
>>>> Why harmful? Redundant, maybe(?).
>>>
>>> The presence of the in-kernel tests will cause people to add stuff to
>>> them when it would be better if they were to apply that effort to
>>> making libhugetlbfs better. Or vice versa.
>>>
>>> Mike's work is an example. Someone later makes a change to hugetlbfs, runs
>>> the kernel selftest and says "yay, everything works", unaware that they
>>> just broke fallocate support.
>>>
>>>> Does anyone even use selftests for
>>>> hugetlbfs regression testing? Lets see, we also have these:
>>>>
>>>> - hugepage-{mmap,shm}.c
>>>> - map_hugetlb.c
>>>>
>>>> There's probably a lot of room for improvement here.
>>>
>>> selftests is a pretty scrappy place. It's partly a dumping ground for
>>> things so useful test code doesn't just get lost and bitrotted. Partly
>>> a framework so people who add features can easily test them. Partly to
>>> provide tools to architecture maintainers when they wire up new
>>> syscalls and the like.
>>>
>>> Unless there's some good reason to retain the hugetlb part of
>>> selftests, I'm thinking we should just remove it to avoid
>>> distracting/misleading people. Or possibly move the libhugetlbfs test
>>> code into the kernel tree and maintain it there.
>>
>> Adding Eric as he is the libhugetlbfs maintainer.
>>
>> I think removing the hugetlb selftests in the kernel and pointing
>> people to libhugetlbfs is the way to go. From a very quick scan
>> of the selftests, I would guess libhugetlbfs covers everything
>> in those tests.
>>
>> I'm willing to verify the testing provided by selftests is included
>> in libhugetlbfs, and remove selftests if that is the direction we
>> want to take.
>
> I would rather see the test suite stay in the library, there are a
> number of tests that rely on infrastructure in the library that is not
> available in selftests.
>
> I am happy to help with any tests that need to be added/modified in the
> library to cover.

I thought about this some more and think there are two distinct
groups of users that should be considered.
1) Application developers who simply want to use hugetlb
2) Kernel developers who are modifying hugetlb related code

The application developers will mostly want information in the
man pages, hugetlbpage.txt and hugetlb selftest programs to use
as sample code to get started. They can also use libhugetlbfs
man pages/library if they desire. Because of this, I do not
really want to remove the hugetlb selftest programs. There are
no equivalent simple stand alone programs in libhugetlbfs.

Kernel developers would be more concerned about introducing
regressions. The selftest programs are of limited use for this
purpose. The libhugetlbfs test suite is much more suited for
regression testing.

With this in mind, I suggest:
- Keep the mmap man page reference to Documentation/vm/hugetlbpage.txt
- Small modification to hugetlbpage.txt saying the selftest code is
good for application development examples. And, kernel developers
should use libhugetlbfs test suite for regression testing. In any
case, the sourceforge URL for libhugetlbfs is no longer valid and
needs to be updated.
- Modify the run_vmtests selftest script to print out a message saying
libhugetlbfs should be used for hugetlb regression testing. This
would help catch people who might think the few selftests are
sufficient.

Thoughts?
--
Mike Kravetz

2015-07-23 17:17:32

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On Thu, 23 Jul 2015, Mike Kravetz wrote:

> On 07/23/2015 08:17 AM, Eric B Munson wrote:
> >On Wed, 22 Jul 2015, Mike Kravetz wrote:
> >
> >>On 07/22/2015 03:30 PM, Andrew Morton wrote:
> >>>On Wed, 22 Jul 2015 15:19:54 -0700 Davidlohr Bueso <[email protected]> wrote:
> >>>
> >>>>>
> >>>>>I didn't know that libhugetlbfs has tests. I wonder if that makes
> >>>>>tools/testing/selftests/vm's hugetlbfstest harmful?
> >>>>
> >>>>Why harmful? Redundant, maybe(?).
> >>>
> >>>The presence of the in-kernel tests will cause people to add stuff to
> >>>them when it would be better if they were to apply that effort to
> >>>making libhugetlbfs better. Or vice versa.
> >>>
> >>>Mike's work is an example. Someone later makes a change to hugetlbfs, runs
> >>>the kernel selftest and says "yay, everything works", unaware that they
> >>>just broke fallocate support.
> >>>
> >>>>Does anyone even use selftests for
> >>>>hugetlbfs regression testing? Lets see, we also have these:
> >>>>
> >>>>- hugepage-{mmap,shm}.c
> >>>>- map_hugetlb.c
> >>>>
> >>>>There's probably a lot of room for improvement here.
> >>>
> >>>selftests is a pretty scrappy place. It's partly a dumping ground for
> >>>things so useful test code doesn't just get lost and bitrotted. Partly
> >>>a framework so people who add features can easily test them. Partly to
> >>>provide tools to architecture maintainers when they wire up new
> >>>syscalls and the like.
> >>>
> >>>Unless there's some good reason to retain the hugetlb part of
> >>>selftests, I'm thinking we should just remove it to avoid
> >>>distracting/misleading people. Or possibly move the libhugetlbfs test
> >>>code into the kernel tree and maintain it there.
> >>
> >>Adding Eric as he is the libhugetlbfs maintainer.
> >>
> >>I think removing the hugetlb selftests in the kernel and pointing
> >>people to libhugetlbfs is the way to go. From a very quick scan
> >>of the selftests, I would guess libhugetlbfs covers everything
> >>in those tests.
> >>
> >>I'm willing to verify the testing provided by selftests is included
> >>in libhugetlbfs, and remove selftests if that is the direction we
> >>want to take.
> >
> >I would rather see the test suite stay in the library, there are a
> >number of tests that rely on infrastructure in the library that is not
> >available in selftests.
> >
> >I am happy to help with any tests that need to be added/modified in the
> >library to cover.
>
> I thought about this some more and think there are two distinct
> groups of users that should be considered.
> 1) Application developers who simply want to use hugetlb
> 2) Kernel developers who are modifying hugetlb related code
>
> The application developers will mostly want information in the
> man pages, hugetlbpage.txt and hugetlb selftest programs to use
> as sample code to get started. They can also use libhugetlbfs
> man pages/library if they desire. Because of this, I do not
> really want to remove the hugetlb selftest programs. There are
> no equivalent simple stand alone programs in libhugetlbfs.
>
> Kernel developers would be more concerned about introducing
> regressions. The selftest programs are of limited use for this
> purpose. The libhugetlbfs test suite is much more suited for
> regression testing.
>
> With this in mind, I suggest:
> - Keep the mmap man page reference to Documentation/vm/hugetlbpage.txt
> - Small modification to hugetlbpage.txt saying the selftest code is
> good for application development examples. And, kernel developers
> should use libhugetlbfs test suite for regression testing. In any
> case, the sourceforge URL for libhugetlbfs is no longer valid and
> needs to be updated.
> - Modify the run_vmtests selftest script to print out a message saying
> libhugetlbfs should be used for hugetlb regression testing. This
> would help catch people who might think the few selftests are
> sufficient.
>
> Thoughts?

There are a number of tests in the libhugetlbfs suite that cover kernel
problems, are you suggesting that we move all these tests out of
libhugetlbfs and into selftests? I don't think we should separate the
responsibility for testing kernel regressions so where ever they end up,
they should all be together. The libhugetlbfs suite has some nice
features for setting up the test environment (consider that a plug to
move tests in that direction).


Attachments:
(No filename) (4.27 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-07-23 17:30:38

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] hugetlbfs: add fallocate support

On 07/23/2015 10:17 AM, Eric B Munson wrote:
> On Thu, 23 Jul 2015, Mike Kravetz wrote:
>
>> On 07/23/2015 08:17 AM, Eric B Munson wrote:
>>> On Wed, 22 Jul 2015, Mike Kravetz wrote:
>>>
>>>> On 07/22/2015 03:30 PM, Andrew Morton wrote:
>>>>> On Wed, 22 Jul 2015 15:19:54 -0700 Davidlohr Bueso <[email protected]> wrote:
>>>>>
>>>>>>>
>>>>>>> I didn't know that libhugetlbfs has tests. I wonder if that makes
>>>>>>> tools/testing/selftests/vm's hugetlbfstest harmful?
>>>>>>
>>>>>> Why harmful? Redundant, maybe(?).
>>>>>
>>>>> The presence of the in-kernel tests will cause people to add stuff to
>>>>> them when it would be better if they were to apply that effort to
>>>>> making libhugetlbfs better. Or vice versa.
>>>>>
>>>>> Mike's work is an example. Someone later makes a change to hugetlbfs, runs
>>>>> the kernel selftest and says "yay, everything works", unaware that they
>>>>> just broke fallocate support.
>>>>>
>>>>>> Does anyone even use selftests for
>>>>>> hugetlbfs regression testing? Lets see, we also have these:
>>>>>>
>>>>>> - hugepage-{mmap,shm}.c
>>>>>> - map_hugetlb.c
>>>>>>
>>>>>> There's probably a lot of room for improvement here.
>>>>>
>>>>> selftests is a pretty scrappy place. It's partly a dumping ground for
>>>>> things so useful test code doesn't just get lost and bitrotted. Partly
>>>>> a framework so people who add features can easily test them. Partly to
>>>>> provide tools to architecture maintainers when they wire up new
>>>>> syscalls and the like.
>>>>>
>>>>> Unless there's some good reason to retain the hugetlb part of
>>>>> selftests, I'm thinking we should just remove it to avoid
>>>>> distracting/misleading people. Or possibly move the libhugetlbfs test
>>>>> code into the kernel tree and maintain it there.
>>>>
>>>> Adding Eric as he is the libhugetlbfs maintainer.
>>>>
>>>> I think removing the hugetlb selftests in the kernel and pointing
>>>> people to libhugetlbfs is the way to go. From a very quick scan
>>>> of the selftests, I would guess libhugetlbfs covers everything
>>>> in those tests.
>>>>
>>>> I'm willing to verify the testing provided by selftests is included
>>>> in libhugetlbfs, and remove selftests if that is the direction we
>>>> want to take.
>>>
>>> I would rather see the test suite stay in the library, there are a
>>> number of tests that rely on infrastructure in the library that is not
>>> available in selftests.
>>>
>>> I am happy to help with any tests that need to be added/modified in the
>>> library to cover.
>>
>> I thought about this some more and think there are two distinct
>> groups of users that should be considered.
>> 1) Application developers who simply want to use hugetlb
>> 2) Kernel developers who are modifying hugetlb related code
>>
>> The application developers will mostly want information in the
>> man pages, hugetlbpage.txt and hugetlb selftest programs to use
>> as sample code to get started. They can also use libhugetlbfs
>> man pages/library if they desire. Because of this, I do not
>> really want to remove the hugetlb selftest programs. There are
>> no equivalent simple stand alone programs in libhugetlbfs.
>>
>> Kernel developers would be more concerned about introducing
>> regressions. The selftest programs are of limited use for this
>> purpose. The libhugetlbfs test suite is much more suited for
>> regression testing.
>>
>> With this in mind, I suggest:
>> - Keep the mmap man page reference to Documentation/vm/hugetlbpage.txt
>> - Small modification to hugetlbpage.txt saying the selftest code is
>> good for application development examples. And, kernel developers
>> should use libhugetlbfs test suite for regression testing. In any
>> case, the sourceforge URL for libhugetlbfs is no longer valid and
>> needs to be updated.
>> - Modify the run_vmtests selftest script to print out a message saying
>> libhugetlbfs should be used for hugetlb regression testing. This
>> would help catch people who might think the few selftests are
>> sufficient.
>>
>> Thoughts?
>
> There are a number of tests in the libhugetlbfs suite that cover kernel
> problems, are you suggesting that we move all these tests out of
> libhugetlbfs and into selftests? I don't think we should separate the
> responsibility for testing kernel regressions so where ever they end up,
> they should all be together. The libhugetlbfs suite has some nice
> features for setting up the test environment (consider that a plug to
> move tests in that direction).

No, not suggesting we move anything out of libhugetlbfs. I believe
that should be the primary test suite for hugetlb.

However, the few programs in selftest do provide some value IMO.
They are examples of hugetlb usage without any of the libhugetlbfs
infrastructure present. Ideally, there would be some place to put
this sample code. I can not think of an ideal location.

--
Mike Kravetz