2018-01-03 09:32:30

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 0/6] mm, hugetlb: allocation API and migration improvements

Hi,
I've posted this as an RFC [1] and both Mike and Naoya seem to be OK
both with patches and the approach. I have rebased this on top of [2]
because there is a small conflict in mm/mempolicy.c. I know it is late
in the release cycle but similarly to [2] I would really like to see
this in linux-next for a longer time for a wider testing exposure.

Motivation:
this is a follow up for [3] for the allocation API and [4] for the
hugetlb migration. It wasn't really easy to split those into two
separate patch series as they share some code.

My primary motivation to touch this code is to make the gigantic pages
migration working. The giga pages allocation code is just too fragile
and hacked into the hugetlb code now. This series tries to move giga
pages closer to the first class citizen. We are not there yet but having
5 patches is quite a lot already and it will already make the code much
easier to follow. I will come with other changes on top after this sees
some review.

The first two patches should be trivial to review. The third patch
changes the way how we migrate huge pages. Newly allocated pages are a
subject of the overcommit check and they participate surplus accounting
which is quite unfortunate as the changelog explains. This patch doesn't
change anything wrt. giga pages.
Patch #4 removes the surplus accounting hack from
__alloc_surplus_huge_page. I hope I didn't miss anything there and a
deeper review is really due there.
Patch #5 finally unifies allocation paths and giga pages shouldn't be
any special anymore. There is also some renaming going on as well.

Shortlog
Michal Hocko (6):
mm, hugetlb: unify core page allocation accounting and initialization
mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
mm, hugetlb: do not rely on overcommit limit during migration
mm, hugetlb: get rid of surplus page accounting tricks
mm, hugetlb: further simplify hugetlb allocation API
hugetlb, mempolicy: fix the mbind hugetlb migration

Diffstat:
include/linux/hugetlb.h | 8 +-
mm/hugetlb.c | 338 +++++++++++++++++++++++++++---------------------
mm/mempolicy.c | 3 +-
mm/migrate.c | 3 +-
4 files changed, 198 insertions(+), 154 deletions(-)


[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
[3] http://lkml.kernel.org/r/[email protected]
[4] http://lkml.kernel.org/r/[email protected]


2018-01-03 09:32:35

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/6] mm, hugetlb: integrate giga hugetlb more naturally to the allocation path

From: Michal Hocko <[email protected]>

Gigantic hugetlb pages were ingrown to the hugetlb code as an alien
specie with a lot of special casing. The allocation path is not an
exception. Unnecessarily so to be honest. It is true that the underlying
allocator is different but that is an implementation detail.

This patch unifies the hugetlb allocation path that a prepares fresh
pool pages. alloc_fresh_gigantic_page basically copies alloc_fresh_huge_page
logic so we can move everything there. This will simplify set_max_huge_pages
which doesn't have to care about what kind of huge page we allocate.

Changes since RFC
- compile fix for !CONFIG_ARCH_HAS_GIGANTIC_PAGE

Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/hugetlb.c | 55 ++++++++++++++-----------------------------------------
1 file changed, 14 insertions(+), 41 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8959667f539..360765156c7c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1106,7 +1106,8 @@ static bool zone_spans_last_pfn(const struct zone *zone,
return zone_spans_pfn(zone, last_pfn);
}

-static struct page *alloc_gigantic_page(int nid, struct hstate *h)
+static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask)
{
unsigned int order = huge_page_order(h);
unsigned long nr_pages = 1 << order;
@@ -1114,11 +1115,9 @@ static struct page *alloc_gigantic_page(int nid, struct hstate *h)
struct zonelist *zonelist;
struct zone *zone;
struct zoneref *z;
- gfp_t gfp_mask;

- gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
zonelist = node_zonelist(nid, gfp_mask);
- for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), NULL) {
+ for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nodemask) {
spin_lock_irqsave(&zone->lock, flags);

pfn = ALIGN(zone->zone_start_pfn, nr_pages);
@@ -1149,42 +1148,13 @@ static struct page *alloc_gigantic_page(int nid, struct hstate *h)
static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
static void prep_compound_gigantic_page(struct page *page, unsigned int order);

-static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
-{
- struct page *page;
-
- page = alloc_gigantic_page(nid, h);
- if (page) {
- prep_compound_gigantic_page(page, huge_page_order(h));
- prep_new_huge_page(h, page, nid);
- put_page(page); /* free it into the hugepage allocator */
- }
-
- return page;
-}
-
-static int alloc_fresh_gigantic_page(struct hstate *h,
- nodemask_t *nodes_allowed)
-{
- struct page *page = NULL;
- int nr_nodes, node;
-
- for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
- page = alloc_fresh_gigantic_page_node(h, node);
- if (page)
- return 1;
- }
-
- return 0;
-}
-
#else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */
static inline bool gigantic_page_supported(void) { return false; }
+static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask) { return NULL; }
static inline void free_gigantic_page(struct page *page, unsigned int order) { }
static inline void destroy_compound_gigantic_page(struct page *page,
unsigned int order) { }
-static inline int alloc_fresh_gigantic_page(struct hstate *h,
- nodemask_t *nodes_allowed) { return 0; }
#endif

static void update_and_free_page(struct hstate *h, struct page *page)
@@ -1410,8 +1380,12 @@ static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;

for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
- page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask,
- node, nodes_allowed);
+ if (hstate_is_gigantic(h))
+ page = alloc_gigantic_page(h, gfp_mask,
+ node, nodes_allowed);
+ else
+ page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask,
+ node, nodes_allowed);
if (page)
break;

@@ -1420,6 +1394,8 @@ static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
if (!page)
return 0;

+ if (hstate_is_gigantic(h))
+ prep_compound_gigantic_page(page, huge_page_order(h));
prep_new_huge_page(h, page, page_to_nid(page));
put_page(page); /* free it into the hugepage allocator */

@@ -2307,10 +2283,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
/* yield cpu to avoid soft lockup */
cond_resched();

- if (hstate_is_gigantic(h))
- ret = alloc_fresh_gigantic_page(h, nodes_allowed);
- else
- ret = alloc_fresh_huge_page(h, nodes_allowed);
+ ret = alloc_fresh_huge_page(h, nodes_allowed);
spin_lock(&hugetlb_lock);
if (!ret)
goto out;
--
2.15.1

2018-01-03 09:32:43

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 4/6] mm, hugetlb: get rid of surplus page accounting tricks

From: Michal Hocko <[email protected]>

alloc_surplus_huge_page increases the pool size and the number of
surplus pages opportunistically to prevent from races with the pool size
change. See d1c3fb1f8f29 ("hugetlb: introduce nr_overcommit_hugepages
sysctl") for more details.

The resulting code is unnecessarily hairy, cause code duplication and
doesn't allow to share the allocation paths. Moreover pool size changes
tend to be very seldom so optimizing for them is not really reasonable.
Simplify the code and allow to allocate a fresh surplus page as long as
we are under the overcommit limit and then recheck the condition after
the allocation and drop the new page if the situation has changed. This
should provide a reasonable guarantee that an abrupt allocation requests
will not go way off the limit.

If we consider races with the pool shrinking and enlarging then we
should be reasonably safe as well. In the first case we are off by one
in the worst case and the second case should work OK because the page is
not yet visible. We can waste CPU cycles for the allocation but that
should be acceptable for a relatively rare condition.

Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/hugetlb.c | 62 ++++++++++++++++++++++--------------------------------------
1 file changed, 23 insertions(+), 39 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f260ffa26363..7dc80cbe8e89 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1540,62 +1540,46 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nmask)
{
- struct page *page;
- unsigned int r_nid;
+ struct page *page = NULL;

if (hstate_is_gigantic(h))
return NULL;

- /*
- * Assume we will successfully allocate the surplus page to
- * prevent racing processes from causing the surplus to exceed
- * overcommit
- *
- * This however introduces a different race, where a process B
- * tries to grow the static hugepage pool while alloc_pages() is
- * called by process A. B will only examine the per-node
- * counters in determining if surplus huge pages can be
- * converted to normal huge pages in adjust_pool_surplus(). A
- * won't be able to increment the per-node counter, until the
- * lock is dropped by B, but B doesn't drop hugetlb_lock until
- * no more huge pages can be converted from surplus to normal
- * state (and doesn't try to convert again). Thus, we have a
- * case where a surplus huge page exists, the pool is grown, and
- * the surplus huge page still exists after, even though it
- * should just have been converted to a normal huge page. This
- * does not leak memory, though, as the hugepage will be freed
- * once it is out of use. It also does not allow the counters to
- * go out of whack in adjust_pool_surplus() as we don't modify
- * the node values until we've gotten the hugepage and only the
- * per-node value is checked there.
- */
spin_lock(&hugetlb_lock);
- if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
- spin_unlock(&hugetlb_lock);
- return NULL;
- } else {
- h->nr_huge_pages++;
- h->surplus_huge_pages++;
- }
+ if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages)
+ goto out_unlock;
spin_unlock(&hugetlb_lock);

page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
+ if (!page)
+ goto out_unlock;

spin_lock(&hugetlb_lock);
- if (page) {
+ /*
+ * We could have raced with the pool size change.
+ * Double check that and simply deallocate the new page
+ * if we would end up overcommiting the surpluses. Abuse
+ * temporary page to workaround the nasty free_huge_page
+ * codeflow
+ */
+ if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
+ SetPageHugeTemporary(page);
+ put_page(page);
+ page = NULL;
+ } else {
+ int r_nid;
+
+ h->surplus_huge_pages++;
+ h->nr_huge_pages++;
INIT_LIST_HEAD(&page->lru);
r_nid = page_to_nid(page);
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
set_hugetlb_cgroup(page, NULL);
- /*
- * We incremented the global counters already
- */
h->nr_huge_pages_node[r_nid]++;
h->surplus_huge_pages_node[r_nid]++;
- } else {
- h->nr_huge_pages--;
- h->surplus_huge_pages--;
}
+
+out_unlock:
spin_unlock(&hugetlb_lock);

return page;
--
2.15.1

2018-01-03 09:32:42

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 5/6] mm, hugetlb: further simplify hugetlb allocation API

From: Michal Hocko <[email protected]>

Hugetlb allocator has several layer of allocation functions depending
and the purpose of the allocation. There are two allocators depending
on whether the page can be allocated from the page allocator or we need
a contiguous allocator. This is currently opencoded in alloc_fresh_huge_page
which is the only path that might allocate giga pages which require the
later allocator. Create alloc_fresh_huge_page which hides this
implementation detail and use it in all callers which hardcoded the
buddy allocator path (__hugetlb_alloc_buddy_huge_page). This shouldn't
introduce any funtional change because both migration and surplus
allocators exlude giga pages explicitly.

While we are at it let's do some renaming. The current scheme is not
consistent and overly painfull to read and understand. Get rid of prefix
underscores from most functions. There is no real reason to make names
longer.
* alloc_fresh_huge_page is the new layer to abstract underlying
allocator
* __hugetlb_alloc_buddy_huge_page becomes shorter and neater
alloc_buddy_huge_page.
* Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
the new page directly to the pool
* alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
as it uses alloc_fresh_huge_page now
* others lose their excessive prefix underscores to make names shorter

Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/hugetlb.c | 78 ++++++++++++++++++++++++++++++++----------------------------
1 file changed, 42 insertions(+), 36 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7dc80cbe8e89..60acd3e93a95 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1378,7 +1378,7 @@ pgoff_t __basepage_index(struct page *page)
return (index << compound_order(page_head)) + compound_idx;
}

-static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
+static struct page *alloc_buddy_huge_page(struct hstate *h,
gfp_t gfp_mask, int nid, nodemask_t *nmask)
{
int order = huge_page_order(h);
@@ -1396,34 +1396,49 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
return page;
}

+/*
+ * Common helper to allocate a fresh hugetlb page. All specific allocators
+ * should use this function to get new hugetlb pages
+ */
+static struct page *alloc_fresh_huge_page(struct hstate *h,
+ gfp_t gfp_mask, int nid, nodemask_t *nmask)
+{
+ struct page *page;
+
+ if (hstate_is_gigantic(h))
+ page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
+ else
+ page = alloc_buddy_huge_page(h, gfp_mask,
+ nid, nmask);
+ if (!page)
+ return NULL;
+
+ if (hstate_is_gigantic(h))
+ prep_compound_gigantic_page(page, huge_page_order(h));
+ prep_new_huge_page(h, page, page_to_nid(page));
+
+ return page;
+}
+
/*
* Allocates a fresh page to the hugetlb allocator pool in the node interleaved
* manner.
*/
-static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
{
struct page *page;
int nr_nodes, node;
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;

for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
- if (hstate_is_gigantic(h))
- page = alloc_gigantic_page(h, gfp_mask,
- node, nodes_allowed);
- else
- page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask,
- node, nodes_allowed);
+ page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
if (page)
break;
-
}

if (!page)
return 0;

- if (hstate_is_gigantic(h))
- prep_compound_gigantic_page(page, huge_page_order(h));
- prep_new_huge_page(h, page, page_to_nid(page));
put_page(page); /* free it into the hugepage allocator */

return 1;
@@ -1537,7 +1552,7 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
/*
* Allocates a fresh surplus page from the page allocator.
*/
-static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
+static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nmask)
{
struct page *page = NULL;
@@ -1550,7 +1565,7 @@ static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
goto out_unlock;
spin_unlock(&hugetlb_lock);

- page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
+ page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
if (!page)
goto out_unlock;

@@ -1567,16 +1582,8 @@ static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
put_page(page);
page = NULL;
} else {
- int r_nid;
-
h->surplus_huge_pages++;
- h->nr_huge_pages++;
- INIT_LIST_HEAD(&page->lru);
- r_nid = page_to_nid(page);
- set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
- set_hugetlb_cgroup(page, NULL);
- h->nr_huge_pages_node[r_nid]++;
- h->surplus_huge_pages_node[r_nid]++;
+ h->nr_huge_pages_node[page_to_nid(page)]++;
}

out_unlock:
@@ -1585,7 +1592,7 @@ static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
return page;
}

-static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nmask)
{
struct page *page;
@@ -1593,7 +1600,7 @@ static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
if (hstate_is_gigantic(h))
return NULL;

- page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
+ page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
if (!page)
return NULL;

@@ -1601,7 +1608,6 @@ static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
* We do not account these pages as surplus because they are only
* temporary and will be released properly on the last reference
*/
- prep_new_huge_page(h, page, page_to_nid(page));
SetPageHugeTemporary(page);

return page;
@@ -1611,7 +1617,7 @@ static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
* Use the VMA's mpolicy to allocate a huge page from the buddy.
*/
static
-struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
+struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr)
{
struct page *page;
@@ -1621,7 +1627,7 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
nodemask_t *nodemask;

nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
- page = __alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
+ page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
mpol_cond_put(mpol);

return page;
@@ -1642,7 +1648,7 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
spin_unlock(&hugetlb_lock);

if (!page)
- page = __alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+ page = alloc_migrate_huge_page(h, gfp_mask, nid, NULL);

return page;
}
@@ -1665,7 +1671,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
}
spin_unlock(&hugetlb_lock);

- return __alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
+ return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
}

/*
@@ -1693,7 +1699,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
retry:
spin_unlock(&hugetlb_lock);
for (i = 0; i < needed; i++) {
- page = __alloc_surplus_huge_page(h, htlb_alloc_mask(h),
+ page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
NUMA_NO_NODE, NULL);
if (!page) {
alloc_ok = false;
@@ -2030,7 +2036,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
if (!page) {
spin_unlock(&hugetlb_lock);
- page = __alloc_buddy_huge_page_with_mpol(h, vma, addr);
+ page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
if (!page)
goto out_uncharge_cgroup;
if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
@@ -2170,7 +2176,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
if (hstate_is_gigantic(h)) {
if (!alloc_bootmem_huge_page(h))
break;
- } else if (!alloc_fresh_huge_page(h,
+ } else if (!alloc_pool_huge_page(h,
&node_states[N_MEMORY]))
break;
cond_resched();
@@ -2290,7 +2296,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
* First take pages out of surplus state. Then make up the
* remaining difference by allocating fresh huge pages.
*
- * We might race with __alloc_surplus_huge_page() here and be unable
+ * We might race with alloc_surplus_huge_page() here and be unable
* to convert a surplus huge page to a normal huge page. That is
* not critical, though, it just means the overall size of the
* pool might be one hugepage larger than it needs to be, but
@@ -2313,7 +2319,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
/* yield cpu to avoid soft lockup */
cond_resched();

- ret = alloc_fresh_huge_page(h, nodes_allowed);
+ ret = alloc_pool_huge_page(h, nodes_allowed);
spin_lock(&hugetlb_lock);
if (!ret)
goto out;
@@ -2333,7 +2339,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
* By placing pages into the surplus state independent of the
* overcommit value, we are allowing the surplus pool size to
* exceed overcommit. There are few sane options here. Since
- * __alloc_surplus_huge_page() is checking the global counter,
+ * alloc_surplus_huge_page() is checking the global counter,
* though, we'll note that we're not allowed to exceed surplus
* and won't grow the pool anywhere else. Not until one of the
* sysctls are changed, or the surplus pages go out of use.
--
2.15.1

2018-01-03 09:33:21

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 6/6] hugetlb, mempolicy: fix the mbind hugetlb migration

From: Michal Hocko <[email protected]>

do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
pages. alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
allocation function which has to take care of reserves, overcommit or
hugetlb cgroup accounting. None of that is really required for the page
migration because the new page is only temporal and either will replace
the original page or it will be dropped. This is essentially as for
other migration call paths and there shouldn't be any reason to handle
mbind in a special way.

The current implementation is even suboptimal because the migration
might fail just because the hugetlb cgroup limit is reached, or the
overcommit is saturated.

Fix this by making mbind like other hugetlb migration paths. Add
a new migration helper alloc_huge_page_vma as a wrapper around
alloc_huge_page_nodemask with additional mempolicy handling.

alloc_huge_page_noerr has no more users and it can go.

Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/hugetlb.h | 5 ++---
mm/hugetlb.c | 33 +++++++++++++++++++--------------
mm/mempolicy.c | 3 +--
3 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 66992348531e..612a29b7f6c6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -356,10 +356,9 @@ struct huge_bootmem_page {
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
struct page *alloc_huge_page_node(struct hstate *h, int nid);
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
- unsigned long addr, int avoid_reserve);
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask);
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long address);
int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);

@@ -537,7 +536,7 @@ struct hstate {};
#define alloc_huge_page(v, a, r) NULL
#define alloc_huge_page_node(h, nid) NULL
#define alloc_huge_page_nodemask(h, preferred_nid, nmask) NULL
-#define alloc_huge_page_noerr(v, a, r) NULL
+#define alloc_huge_page_vma(vma, address) NULL
#define alloc_bootmem_huge_page(h) NULL
#define hstate_file(f) NULL
#define hstate_sizelog(s) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 60acd3e93a95..ffcae114ceed 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1674,6 +1674,25 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
}

+/* mempolicy aware migration callback */
+struct page *alloc_huge_page_vma(struct vm_area_struct *vma, unsigned long address)
+{
+ struct mempolicy *mpol;
+ nodemask_t *nodemask;
+ struct page *page;
+ struct hstate *h;
+ gfp_t gfp_mask;
+ int node;
+
+ h = hstate_vma(vma);
+ gfp_mask = htlb_alloc_mask(h);
+ node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
+ page = alloc_huge_page_nodemask(h, node, nodemask);
+ mpol_cond_put(mpol);
+
+ return page;
+}
+
/*
* Increase the hugetlb pool such that it can accommodate a reservation
* of size 'delta'.
@@ -2079,20 +2098,6 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
return ERR_PTR(-ENOSPC);
}

-/*
- * alloc_huge_page()'s wrapper which simply returns the page if allocation
- * succeeds, otherwise NULL. This function is called from new_vma_page(),
- * where no ERR_VALUE is expected to be returned.
- */
-struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
- unsigned long addr, int avoid_reserve)
-{
- struct page *page = alloc_huge_page(vma, addr, avoid_reserve);
- if (IS_ERR(page))
- page = NULL;
- return page;
-}
-
int alloc_bootmem_huge_page(struct hstate *h)
__attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
int __alloc_bootmem_huge_page(struct hstate *h)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b6f4fcf9df64..30e68da64873 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1097,8 +1097,7 @@ static struct page *new_page(struct page *page, unsigned long start)
}

if (PageHuge(page)) {
- BUG_ON(!vma);
- return alloc_huge_page_noerr(vma, address, 1);
+ return alloc_huge_page_vma(vma, address);
} else if (PageTransHuge(page)) {
struct page *thp;

--
2.15.1

2018-01-03 09:33:50

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 3/6] mm, hugetlb: do not rely on overcommit limit during migration

From: Michal Hocko <[email protected]>

hugepage migration relies on __alloc_buddy_huge_page to get a new page.
This has 2 main disadvantages.
1) it doesn't allow to migrate any huge page if the pool is used
completely which is not an exceptional case as the pool is static and
unused memory is just wasted.
2) it leads to a weird semantic when migration between two numa nodes
might increase the pool size of the destination NUMA node while the page
is in use. The issue is caused by per NUMA node surplus pages tracking
(see free_huge_page).

Address both issues by changing the way how we allocate and account
pages allocated for migration. Those should temporal by definition.
So we mark them that way (we will abuse page flags in the 3rd page)
and update free_huge_page to free such pages to the page allocator.
Page migration path then just transfers the temporal status from the
new page to the old one which will be freed on the last reference.
The global surplus count will never change during this path but we still
have to be careful when migrating a per-node suprlus page. This is now
handled in move_hugetlb_state which is called from the migration path
and it copies the hugetlb specific page state and fixes up the
accounting when needed

Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
reflect its purpose. The new allocation routine for the migration path
is __alloc_migrate_huge_page.

The user visible effect of this patch is that migrated pages are really
temporal and they travel between NUMA nodes as per the migration
request:
Before migration
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

After

/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

with the previous implementation, both nodes would have nr_hugepages:1
until the page is freed.

Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/hugetlb.h | 3 ++
mm/hugetlb.c | 111 +++++++++++++++++++++++++++++++++++++++++-------
mm/migrate.c | 3 +-
3 files changed, 99 insertions(+), 18 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 944e6e8bd572..66992348531e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -119,6 +119,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
void putback_active_hugepage(struct page *page);
+void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
void hugetlb_fix_reserve_counts(struct inode *inode);
extern struct mutex *hugetlb_fault_mutex_table;
@@ -157,6 +158,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

bool is_hugetlb_entry_migration(pte_t pte);
+
#else /* !CONFIG_HUGETLB_PAGE */

static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
@@ -197,6 +199,7 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
return false;
}
#define putback_active_hugepage(p) do {} while (0)
+#define move_hugetlb_state(old, new, reason) do {} while (0)

static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 360765156c7c..f260ffa26363 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,6 +34,7 @@
#include <linux/hugetlb_cgroup.h>
#include <linux/node.h>
#include <linux/userfaultfd_k.h>
+#include <linux/page_owner.h>
#include "internal.h"

int hugetlb_max_hstate __read_mostly;
@@ -1219,6 +1220,28 @@ static void clear_page_huge_active(struct page *page)
ClearPagePrivate(&page[1]);
}

+/*
+ * Internal hugetlb specific page flag. Do not use outside of the hugetlb
+ * code
+ */
+static inline bool PageHugeTemporary(struct page *page)
+{
+ if (!PageHuge(page))
+ return false;
+
+ return (unsigned long)page[2].mapping == -1U;
+}
+
+static inline void SetPageHugeTemporary(struct page *page)
+{
+ page[2].mapping = (void *)-1U;
+}
+
+static inline void ClearPageHugeTemporary(struct page *page)
+{
+ page[2].mapping = NULL;
+}
+
void free_huge_page(struct page *page)
{
/*
@@ -1253,7 +1276,11 @@ void free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;

- if (h->surplus_huge_pages_node[nid]) {
+ if (PageHugeTemporary(page)) {
+ list_del(&page->lru);
+ ClearPageHugeTemporary(page);
+ update_and_free_page(h, page);
+ } else if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
list_del(&page->lru);
update_and_free_page(h, page);
@@ -1507,7 +1534,10 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
return rc;
}

-static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
+/*
+ * Allocates a fresh surplus page from the page allocator.
+ */
+static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nmask)
{
struct page *page;
@@ -1571,6 +1601,28 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
return page;
}

+static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+ int nid, nodemask_t *nmask)
+{
+ struct page *page;
+
+ if (hstate_is_gigantic(h))
+ return NULL;
+
+ page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
+ if (!page)
+ return NULL;
+
+ /*
+ * We do not account these pages as surplus because they are only
+ * temporary and will be released properly on the last reference
+ */
+ prep_new_huge_page(h, page, page_to_nid(page));
+ SetPageHugeTemporary(page);
+
+ return page;
+}
+
/*
* Use the VMA's mpolicy to allocate a huge page from the buddy.
*/
@@ -1585,17 +1637,13 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
nodemask_t *nodemask;

nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
- page = __alloc_buddy_huge_page(h, gfp_mask, nid, nodemask);
+ page = __alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
mpol_cond_put(mpol);

return page;
}

-/*
- * This allocation function is useful in the context where vma is irrelevant.
- * E.g. soft-offlining uses this function because it only cares physical
- * address of error page.
- */
+/* page migration callback function */
struct page *alloc_huge_page_node(struct hstate *h, int nid)
{
gfp_t gfp_mask = htlb_alloc_mask(h);
@@ -1610,12 +1658,12 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
spin_unlock(&hugetlb_lock);

if (!page)
- page = __alloc_buddy_huge_page(h, gfp_mask, nid, NULL);
+ page = __alloc_migrate_huge_page(h, gfp_mask, nid, NULL);

return page;
}

-
+/* page migration callback function */
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask)
{
@@ -1633,9 +1681,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
}
spin_unlock(&hugetlb_lock);

- /* No reservations, try to overcommit */
-
- return __alloc_buddy_huge_page(h, gfp_mask, preferred_nid, nmask);
+ return __alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
}

/*
@@ -1663,7 +1709,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
retry:
spin_unlock(&hugetlb_lock);
for (i = 0; i < needed; i++) {
- page = __alloc_buddy_huge_page(h, htlb_alloc_mask(h),
+ page = __alloc_surplus_huge_page(h, htlb_alloc_mask(h),
NUMA_NO_NODE, NULL);
if (!page) {
alloc_ok = false;
@@ -2260,7 +2306,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
* First take pages out of surplus state. Then make up the
* remaining difference by allocating fresh huge pages.
*
- * We might race with __alloc_buddy_huge_page() here and be unable
+ * We might race with __alloc_surplus_huge_page() here and be unable
* to convert a surplus huge page to a normal huge page. That is
* not critical, though, it just means the overall size of the
* pool might be one hugepage larger than it needs to be, but
@@ -2303,7 +2349,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
* By placing pages into the surplus state independent of the
* overcommit value, we are allowing the surplus pool size to
* exceed overcommit. There are few sane options here. Since
- * __alloc_buddy_huge_page() is checking the global counter,
+ * __alloc_surplus_huge_page() is checking the global counter,
* though, we'll note that we're not allowed to exceed surplus
* and won't grow the pool anywhere else. Not until one of the
* sysctls are changed, or the surplus pages go out of use.
@@ -4779,3 +4825,36 @@ void putback_active_hugepage(struct page *page)
spin_unlock(&hugetlb_lock);
put_page(page);
}
+
+void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
+{
+ struct hstate *h = page_hstate(oldpage);
+
+ hugetlb_cgroup_migrate(oldpage, newpage);
+ set_page_owner_migrate_reason(newpage, reason);
+
+ /*
+ * transfer temporary state of the new huge page. This is
+ * reverse to other transitions because the newpage is going to
+ * be final while the old one will be freed so it takes over
+ * the temporary status.
+ *
+ * Also note that we have to transfer the per-node surplus state
+ * here as well otherwise the global surplus count will not match
+ * the per-node's.
+ */
+ if (PageHugeTemporary(newpage)) {
+ int old_nid = page_to_nid(oldpage);
+ int new_nid = page_to_nid(newpage);
+
+ SetPageHugeTemporary(oldpage);
+ ClearPageHugeTemporary(newpage);
+
+ spin_lock(&hugetlb_lock);
+ if (h->surplus_huge_pages_node[old_nid]) {
+ h->surplus_huge_pages_node[old_nid]--;
+ h->surplus_huge_pages_node[new_nid]++;
+ }
+ spin_unlock(&hugetlb_lock);
+ }
+}
diff --git a/mm/migrate.c b/mm/migrate.c
index feba2e63e165..3cb0f5955b41 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1310,9 +1310,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
put_anon_vma(anon_vma);

if (rc == MIGRATEPAGE_SUCCESS) {
- hugetlb_cgroup_migrate(hpage, new_hpage);
+ move_hugetlb_state(hpage, new_hpage, reason);
put_new_page = NULL;
- set_page_owner_migrate_reason(new_hpage, reason);
}

unlock_page(hpage);
--
2.15.1

2018-01-03 09:36:04

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/6] mm, hugetlb: unify core page allocation accounting and initialization

From: Michal Hocko <[email protected]>

hugetlb allocator has two entry points to the page allocator
- alloc_fresh_huge_page_node
- __hugetlb_alloc_buddy_huge_page

The two differ very subtly in two aspects. The first one doesn't care
about HTLB_BUDDY_* stats and it doesn't initialize the huge page.
prep_new_huge_page is not used because it not only initializes hugetlb
specific stuff but because it also put_page and releases the page to
the hugetlb pool which is not what is required in some contexts. This
makes things more complicated than necessary.

Simplify things by a) removing the page allocator entry point duplicity
and only keep __hugetlb_alloc_buddy_huge_page and b) make
prep_new_huge_page more reusable by removing the put_page which moves
the page to the allocator pool. All current callers are updated to call
put_page explicitly. Later patches will add new callers which won't
need it.

This patch shouldn't introduce any functional change.

Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/hugetlb.c | 61 +++++++++++++++++++++++++++++-------------------------------
1 file changed, 29 insertions(+), 32 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4137fb67cd79..a8959667f539 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1157,6 +1157,7 @@ static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
if (page) {
prep_compound_gigantic_page(page, huge_page_order(h));
prep_new_huge_page(h, page, nid);
+ put_page(page); /* free it into the hugepage allocator */
}

return page;
@@ -1304,7 +1305,6 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
spin_unlock(&hugetlb_lock);
- put_page(page); /* free it into the hugepage allocator */
}

static void prep_compound_gigantic_page(struct page *page, unsigned int order)
@@ -1381,41 +1381,49 @@ pgoff_t __basepage_index(struct page *page)
return (index << compound_order(page_head)) + compound_idx;
}

-static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
+static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
+ gfp_t gfp_mask, int nid, nodemask_t *nmask)
{
+ int order = huge_page_order(h);
struct page *page;

- page = __alloc_pages_node(nid,
- htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
- __GFP_RETRY_MAYFAIL|__GFP_NOWARN,
- huge_page_order(h));
- if (page) {
- prep_new_huge_page(h, page, nid);
- }
+ gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
+ if (nid == NUMA_NO_NODE)
+ nid = numa_mem_id();
+ page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
+ if (page)
+ __count_vm_event(HTLB_BUDDY_PGALLOC);
+ else
+ __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);

return page;
}

+/*
+ * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
+ * manner.
+ */
static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
{
struct page *page;
int nr_nodes, node;
- int ret = 0;
+ gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;

for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
- page = alloc_fresh_huge_page_node(h, node);
- if (page) {
- ret = 1;
+ page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask,
+ node, nodes_allowed);
+ if (page)
break;
- }
+
}

- if (ret)
- count_vm_event(HTLB_BUDDY_PGALLOC);
- else
- count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
+ if (!page)
+ return 0;

- return ret;
+ prep_new_huge_page(h, page, page_to_nid(page));
+ put_page(page); /* free it into the hugepage allocator */
+
+ return 1;
}

/*
@@ -1523,17 +1531,6 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
return rc;
}

-static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
- gfp_t gfp_mask, int nid, nodemask_t *nmask)
-{
- int order = huge_page_order(h);
-
- gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
- if (nid == NUMA_NO_NODE)
- nid = numa_mem_id();
- return __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
-}
-
static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nmask)
{
@@ -1589,11 +1586,9 @@ static struct page *__alloc_buddy_huge_page(struct hstate *h, gfp_t gfp_mask,
*/
h->nr_huge_pages_node[r_nid]++;
h->surplus_huge_pages_node[r_nid]++;
- __count_vm_event(HTLB_BUDDY_PGALLOC);
} else {
h->nr_huge_pages--;
h->surplus_huge_pages--;
- __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
}
spin_unlock(&hugetlb_lock);

@@ -2148,6 +2143,8 @@ static void __init gather_bootmem_prealloc(void)
prep_compound_huge_page(page, h->order);
WARN_ON(PageReserved(page));
prep_new_huge_page(h, page, page_to_nid(page));
+ put_page(page); /* free it into the hugepage allocator */
+
/*
* If we had gigantic hugepages allocated at boot time, we need
* to restore the 'stolen' pages to totalram_pages in order to
--
2.15.1

2018-01-04 00:05:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/6] mm, hugetlb: allocation API and migration improvements

On Wed, 3 Jan 2018 10:32:07 +0100 Michal Hocko <[email protected]> wrote:

> I've posted this as an RFC [1] and both Mike and Naoya seem to be OK
> both with patches and the approach. I have rebased this on top of [2]
> because there is a small conflict in mm/mempolicy.c. I know it is late
> in the release cycle but similarly to [2] I would really like to see
> this in linux-next for a longer time for a wider testing exposure.

I'm interpreting this to mean "hold for 4.17-rc1"?

2018-01-04 07:33:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/6] mm, hugetlb: allocation API and migration improvements

On Wed 03-01-18 16:05:23, Andrew Morton wrote:
> On Wed, 3 Jan 2018 10:32:07 +0100 Michal Hocko <[email protected]> wrote:
>
> > I've posted this as an RFC [1] and both Mike and Naoya seem to be OK
> > both with patches and the approach. I have rebased this on top of [2]
> > because there is a small conflict in mm/mempolicy.c. I know it is late
> > in the release cycle but similarly to [2] I would really like to see
> > this in linux-next for a longer time for a wider testing exposure.
>
> I'm interpreting this to mean "hold for 4.17-rc1"?

Yeah, that should be good enough. There shouldn't be any reason to rush
this through. I will build more changes on top but that is not critical
either. The longer this will be in linux-next, the better.

Thanks!
--
Michal Hocko
SUSE Labs

2018-02-21 04:41:58

by Dan Rue

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm, hugetlb: further simplify hugetlb allocation API

On Wed, Jan 03, 2018 at 10:32:12AM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Hugetlb allocator has several layer of allocation functions depending
> and the purpose of the allocation. There are two allocators depending
> on whether the page can be allocated from the page allocator or we need
> a contiguous allocator. This is currently opencoded in alloc_fresh_huge_page
> which is the only path that might allocate giga pages which require the
> later allocator. Create alloc_fresh_huge_page which hides this
> implementation detail and use it in all callers which hardcoded the
> buddy allocator path (__hugetlb_alloc_buddy_huge_page). This shouldn't
> introduce any funtional change because both migration and surplus
> allocators exlude giga pages explicitly.
>
> While we are at it let's do some renaming. The current scheme is not
> consistent and overly painfull to read and understand. Get rid of prefix
> underscores from most functions. There is no real reason to make names
> longer.
> * alloc_fresh_huge_page is the new layer to abstract underlying
> allocator
> * __hugetlb_alloc_buddy_huge_page becomes shorter and neater
> alloc_buddy_huge_page.
> * Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
> the new page directly to the pool
> * alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
> as it uses alloc_fresh_huge_page now
> * others lose their excessive prefix underscores to make names shorter

Hi Michal -

We (Linaro) run the libhugetlbfs test suite continuously against
mainline and recently (Feb 1), the 'counters' test started failing on
with the following error:

root@localhost:~# mount_point="/mnt/hugetlb/"
root@localhost:~# echo 200 > /proc/sys/vm/nr_hugepages
root@localhost:~# mkdir -p "${mount_point}"
root@localhost:~# mount -t hugetlbfs hugetlbfs "${mount_point}"
root@localhost:~# export LD_LIBRARY_PATH=/root/libhugetlbfs/libhugetlbfs-2.20/obj64
root@localhost:~# /root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters
Starting testcase "/root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters", pid 3319
Base pool size: 0
Clean...
FAIL Line 326: Bad HugePages_Total: expected 0, actual 1

Line 326 refers to the test source @
https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/counters.c#L326

I bisected the failure to this commit. The problem is seen on multiple
architectures (tested x86-64 and arm64).

Thanks,
Dan

>
> Reviewed-by: Mike Kravetz <[email protected]>
> Reviewed-by: Naoya Horiguchi <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/hugetlb.c | 78 ++++++++++++++++++++++++++++++++----------------------------
> 1 file changed, 42 insertions(+), 36 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7dc80cbe8e89..60acd3e93a95 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1378,7 +1378,7 @@ pgoff_t __basepage_index(struct page *page)
> return (index << compound_order(page_head)) + compound_idx;
> }
>
> -static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
> +static struct page *alloc_buddy_huge_page(struct hstate *h,
> gfp_t gfp_mask, int nid, nodemask_t *nmask)
> {
> int order = huge_page_order(h);
> @@ -1396,34 +1396,49 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
> return page;
> }
>
> +/*
> + * Common helper to allocate a fresh hugetlb page. All specific allocators
> + * should use this function to get new hugetlb pages
> + */
> +static struct page *alloc_fresh_huge_page(struct hstate *h,
> + gfp_t gfp_mask, int nid, nodemask_t *nmask)
> +{
> + struct page *page;
> +
> + if (hstate_is_gigantic(h))
> + page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
> + else
> + page = alloc_buddy_huge_page(h, gfp_mask,
> + nid, nmask);
> + if (!page)
> + return NULL;
> +
> + if (hstate_is_gigantic(h))
> + prep_compound_gigantic_page(page, huge_page_order(h));
> + prep_new_huge_page(h, page, page_to_nid(page));
> +
> + return page;
> +}
> +
> /*
> * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
> * manner.
> */
> -static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> +static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> {
> struct page *page;
> int nr_nodes, node;
> gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>
> for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> - if (hstate_is_gigantic(h))
> - page = alloc_gigantic_page(h, gfp_mask,
> - node, nodes_allowed);
> - else
> - page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask,
> - node, nodes_allowed);
> + page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
> if (page)
> break;
> -
> }
>
> if (!page)
> return 0;
>
> - if (hstate_is_gigantic(h))
> - prep_compound_gigantic_page(page, huge_page_order(h));
> - prep_new_huge_page(h, page, page_to_nid(page));
> put_page(page); /* free it into the hugepage allocator */
>
> return 1;
> @@ -1537,7 +1552,7 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> /*
> * Allocates a fresh surplus page from the page allocator.
> */
> -static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> +static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> int nid, nodemask_t *nmask)
> {
> struct page *page = NULL;
> @@ -1550,7 +1565,7 @@ static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> goto out_unlock;
> spin_unlock(&hugetlb_lock);
>
> - page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
> + page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
> if (!page)
> goto out_unlock;
>
> @@ -1567,16 +1582,8 @@ static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> put_page(page);
> page = NULL;
> } else {
> - int r_nid;
> -
> h->surplus_huge_pages++;
> - h->nr_huge_pages++;
> - INIT_LIST_HEAD(&page->lru);
> - r_nid = page_to_nid(page);
> - set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
> - set_hugetlb_cgroup(page, NULL);
> - h->nr_huge_pages_node[r_nid]++;
> - h->surplus_huge_pages_node[r_nid]++;
> + h->nr_huge_pages_node[page_to_nid(page)]++;
> }
>
> out_unlock:
> @@ -1585,7 +1592,7 @@ static struct page *__alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> return page;
> }
>
> -static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
> +static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
> int nid, nodemask_t *nmask)
> {
> struct page *page;
> @@ -1593,7 +1600,7 @@ static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
> if (hstate_is_gigantic(h))
> return NULL;
>
> - page = __hugetlb_alloc_buddy_huge_page(h, gfp_mask, nid, nmask);
> + page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
> if (!page)
> return NULL;
>
> @@ -1601,7 +1608,6 @@ static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
> * We do not account these pages as surplus because they are only
> * temporary and will be released properly on the last reference
> */
> - prep_new_huge_page(h, page, page_to_nid(page));
> SetPageHugeTemporary(page);
>
> return page;
> @@ -1611,7 +1617,7 @@ static struct page *__alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
> * Use the VMA's mpolicy to allocate a huge page from the buddy.
> */
> static
> -struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
> +struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
> struct vm_area_struct *vma, unsigned long addr)
> {
> struct page *page;
> @@ -1621,7 +1627,7 @@ struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h,
> nodemask_t *nodemask;
>
> nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
> - page = __alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
> + page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
> mpol_cond_put(mpol);
>
> return page;
> @@ -1642,7 +1648,7 @@ struct page *alloc_huge_page_node(struct hstate *h, int nid)
> spin_unlock(&hugetlb_lock);
>
> if (!page)
> - page = __alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
> + page = alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
>
> return page;
> }
> @@ -1665,7 +1671,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> }
> spin_unlock(&hugetlb_lock);
>
> - return __alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
> + return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
> }
>
> /*
> @@ -1693,7 +1699,7 @@ static int gather_surplus_pages(struct hstate *h, int delta)
> retry:
> spin_unlock(&hugetlb_lock);
> for (i = 0; i < needed; i++) {
> - page = __alloc_surplus_huge_page(h, htlb_alloc_mask(h),
> + page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
> NUMA_NO_NODE, NULL);
> if (!page) {
> alloc_ok = false;
> @@ -2030,7 +2036,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
> page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
> if (!page) {
> spin_unlock(&hugetlb_lock);
> - page = __alloc_buddy_huge_page_with_mpol(h, vma, addr);
> + page = alloc_buddy_huge_page_with_mpol(h, vma, addr);
> if (!page)
> goto out_uncharge_cgroup;
> if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
> @@ -2170,7 +2176,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> if (hstate_is_gigantic(h)) {
> if (!alloc_bootmem_huge_page(h))
> break;
> - } else if (!alloc_fresh_huge_page(h,
> + } else if (!alloc_pool_huge_page(h,
> &node_states[N_MEMORY]))
> break;
> cond_resched();
> @@ -2290,7 +2296,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> * First take pages out of surplus state. Then make up the
> * remaining difference by allocating fresh huge pages.
> *
> - * We might race with __alloc_surplus_huge_page() here and be unable
> + * We might race with alloc_surplus_huge_page() here and be unable
> * to convert a surplus huge page to a normal huge page. That is
> * not critical, though, it just means the overall size of the
> * pool might be one hugepage larger than it needs to be, but
> @@ -2313,7 +2319,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> /* yield cpu to avoid soft lockup */
> cond_resched();
>
> - ret = alloc_fresh_huge_page(h, nodes_allowed);
> + ret = alloc_pool_huge_page(h, nodes_allowed);
> spin_lock(&hugetlb_lock);
> if (!ret)
> goto out;
> @@ -2333,7 +2339,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> * By placing pages into the surplus state independent of the
> * overcommit value, we are allowing the surplus pool size to
> * exceed overcommit. There are few sane options here. Since
> - * __alloc_surplus_huge_page() is checking the global counter,
> + * alloc_surplus_huge_page() is checking the global counter,
> * though, we'll note that we're not allowed to exceed surplus
> * and won't grow the pool anywhere else. Not until one of the
> * sysctls are changed, or the surplus pages go out of use.
> --
> 2.15.1
>

2018-02-21 12:54:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm, hugetlb: further simplify hugetlb allocation API

On Tue 20-02-18 22:24:57, Dan Rue wrote:
> On Wed, Jan 03, 2018 at 10:32:12AM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Hugetlb allocator has several layer of allocation functions depending
> > and the purpose of the allocation. There are two allocators depending
> > on whether the page can be allocated from the page allocator or we need
> > a contiguous allocator. This is currently opencoded in alloc_fresh_huge_page
> > which is the only path that might allocate giga pages which require the
> > later allocator. Create alloc_fresh_huge_page which hides this
> > implementation detail and use it in all callers which hardcoded the
> > buddy allocator path (__hugetlb_alloc_buddy_huge_page). This shouldn't
> > introduce any funtional change because both migration and surplus
> > allocators exlude giga pages explicitly.
> >
> > While we are at it let's do some renaming. The current scheme is not
> > consistent and overly painfull to read and understand. Get rid of prefix
> > underscores from most functions. There is no real reason to make names
> > longer.
> > * alloc_fresh_huge_page is the new layer to abstract underlying
> > allocator
> > * __hugetlb_alloc_buddy_huge_page becomes shorter and neater
> > alloc_buddy_huge_page.
> > * Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
> > the new page directly to the pool
> > * alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
> > as it uses alloc_fresh_huge_page now
> > * others lose their excessive prefix underscores to make names shorter
>
> Hi Michal -
>
> We (Linaro) run the libhugetlbfs test suite continuously against
> mainline and recently (Feb 1), the 'counters' test started failing on
> with the following error:
>
> root@localhost:~# mount_point="/mnt/hugetlb/"
> root@localhost:~# echo 200 > /proc/sys/vm/nr_hugepages
> root@localhost:~# mkdir -p "${mount_point}"
> root@localhost:~# mount -t hugetlbfs hugetlbfs "${mount_point}"
> root@localhost:~# export LD_LIBRARY_PATH=/root/libhugetlbfs/libhugetlbfs-2.20/obj64
> root@localhost:~# /root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters
> Starting testcase "/root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters", pid 3319
> Base pool size: 0
> Clean...
> FAIL Line 326: Bad HugePages_Total: expected 0, actual 1
>
> Line 326 refers to the test source @
> https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/counters.c#L326

Thanks for the report. I am fighting to get hugetlb tests working. My
previous deployment is gone and the new git snapshot fails to build. I
will look into it further but ...

> I bisected the failure to this commit. The problem is seen on multiple
> architectures (tested x86-64 and arm64).

The patch shouldn't have introduced any functional changes IIRC. But let
me have a look
--
Michal Hocko
SUSE Labs

2018-02-21 16:09:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm, hugetlb: further simplify hugetlb allocation API

On Wed 21-02-18 10:55:26, Michal Hocko wrote:
> On Tue 20-02-18 22:24:57, Dan Rue wrote:
[...]
> > I bisected the failure to this commit. The problem is seen on multiple
> > architectures (tested x86-64 and arm64).
>
> The patch shouldn't have introduced any functional changes IIRC. But let
> me have a look

Hmm, I guess I can see it. Does the following help?
---
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c204e3d132b..a963f2034dfc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1583,7 +1583,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
page = NULL;
} else {
h->surplus_huge_pages++;
- h->nr_huge_pages_node[page_to_nid(page)]++;
+ h->surplus_huge_pages_node[page_to_nid(page)]++;
}

out_unlock:
--
Michal Hocko
SUSE Labs

2018-02-21 19:02:00

by Dan Rue

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm, hugetlb: further simplify hugetlb allocation API

On Wed, Feb 21, 2018 at 11:01:07AM +0100, Michal Hocko wrote:
> On Wed 21-02-18 10:55:26, Michal Hocko wrote:
> > On Tue 20-02-18 22:24:57, Dan Rue wrote:
> [...]
> > > I bisected the failure to this commit. The problem is seen on multiple
> > > architectures (tested x86-64 and arm64).
> >
> > The patch shouldn't have introduced any functional changes IIRC. But let
> > me have a look
>
> Hmm, I guess I can see it. Does the following help?
> ---
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7c204e3d132b..a963f2034dfc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1583,7 +1583,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> page = NULL;
> } else {
> h->surplus_huge_pages++;
> - h->nr_huge_pages_node[page_to_nid(page)]++;
> + h->surplus_huge_pages_node[page_to_nid(page)]++;
> }
>
> out_unlock:

That did the trick. Confirmed fixed on v4.15-3389-g0c397daea1d4 and
v4.16-rc2 with the above patch.

Dan

> --
> Michal Hocko
> SUSE Labs

2018-02-21 19:14:23

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm, hugetlb: further simplify hugetlb allocation API

On 02/21/2018 02:01 AM, Michal Hocko wrote:
> On Wed 21-02-18 10:55:26, Michal Hocko wrote:
> Hmm, I guess I can see it. Does the following help?
> ---
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7c204e3d132b..a963f2034dfc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1583,7 +1583,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> page = NULL;
> } else {
> h->surplus_huge_pages++;
> - h->nr_huge_pages_node[page_to_nid(page)]++;
> + h->surplus_huge_pages_node[page_to_nid(page)]++;
> }
>
> out_unlock:

I thought we had this corrected in a previous version of the patch.
My apologies for not looking more closely at this version.

FWIW,
Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

2018-02-21 19:15:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm, hugetlb: further simplify hugetlb allocation API

On Wed 21-02-18 10:19:14, Dan Rue wrote:
> On Wed, Feb 21, 2018 at 11:01:07AM +0100, Michal Hocko wrote:
> > On Wed 21-02-18 10:55:26, Michal Hocko wrote:
> > > On Tue 20-02-18 22:24:57, Dan Rue wrote:
> > [...]
> > > > I bisected the failure to this commit. The problem is seen on multiple
> > > > architectures (tested x86-64 and arm64).
> > >
> > > The patch shouldn't have introduced any functional changes IIRC. But let
> > > me have a look
> >
> > Hmm, I guess I can see it. Does the following help?
> > ---
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 7c204e3d132b..a963f2034dfc 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1583,7 +1583,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> > page = NULL;
> > } else {
> > h->surplus_huge_pages++;
> > - h->nr_huge_pages_node[page_to_nid(page)]++;
> > + h->surplus_huge_pages_node[page_to_nid(page)]++;
> > }
> >
> > out_unlock:
>
> That did the trick. Confirmed fixed on v4.15-3389-g0c397daea1d4 and
> v4.16-rc2 with the above patch.

Thanks a lot for re-testing! Can I assume your Tested-by?

--
Michal Hocko
SUSE Labs

2018-02-21 19:16:01

by Dan Rue

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm, hugetlb: further simplify hugetlb allocation API

On Wed, Feb 21, 2018 at 07:52:52PM +0100, Michal Hocko wrote:
> On Wed 21-02-18 10:19:14, Dan Rue wrote:
> > On Wed, Feb 21, 2018 at 11:01:07AM +0100, Michal Hocko wrote:
> > > On Wed 21-02-18 10:55:26, Michal Hocko wrote:
> > > > On Tue 20-02-18 22:24:57, Dan Rue wrote:
> > > [...]
> > > > > I bisected the failure to this commit. The problem is seen on multiple
> > > > > architectures (tested x86-64 and arm64).
> > > >
> > > > The patch shouldn't have introduced any functional changes IIRC. But let
> > > > me have a look
> > >
> > > Hmm, I guess I can see it. Does the following help?
> > > ---
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 7c204e3d132b..a963f2034dfc 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1583,7 +1583,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> > > page = NULL;
> > > } else {
> > > h->surplus_huge_pages++;
> > > - h->nr_huge_pages_node[page_to_nid(page)]++;
> > > + h->surplus_huge_pages_node[page_to_nid(page)]++;
> > > }
> > >
> > > out_unlock:
> >
> > That did the trick. Confirmed fixed on v4.15-3389-g0c397daea1d4 and
> > v4.16-rc2 with the above patch.
>
> Thanks a lot for re-testing! Can I assume your Tested-by?

Tested-by: Dan Rue <[email protected]>

>
> --
> Michal Hocko
> SUSE Labs

2018-02-21 19:16:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm, hugetlb: further simplify hugetlb allocation API

On Wed 21-02-18 09:59:40, Mike Kravetz wrote:
> On 02/21/2018 02:01 AM, Michal Hocko wrote:
> > On Wed 21-02-18 10:55:26, Michal Hocko wrote:
> > Hmm, I guess I can see it. Does the following help?
> > ---
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 7c204e3d132b..a963f2034dfc 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1583,7 +1583,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
> > page = NULL;
> > } else {
> > h->surplus_huge_pages++;
> > - h->nr_huge_pages_node[page_to_nid(page)]++;
> > + h->surplus_huge_pages_node[page_to_nid(page)]++;
> > }
> >
> > out_unlock:
>
> I thought we had this corrected in a previous version of the patch.
> My apologies for not looking more closely at this version.

I must have screwed up when rebasing. I remember I was splitting this
patch...

> FWIW,
> Reviewed-by: Mike Kravetz <[email protected]>

Thanks!
--
Michal Hocko
SUSE Labs

2018-02-21 19:16:22

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] hugetlb: fix surplus pages accounting

OK, so here we go with the fix.

From bc55c70ca2325f3305a80cfca5731f9550205589 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 21 Feb 2018 19:47:33 +0100
Subject: [PATCH] hugetlb: fix surplus pages accounting

Dan Rue has noticed that libhugetlbfs test suite fails counter test:

root@localhost:~# mount_point="/mnt/hugetlb/"
root@localhost:~# echo 200 > /proc/sys/vm/nr_hugepages
root@localhost:~# mkdir -p "${mount_point}"
root@localhost:~# mount -t hugetlbfs hugetlbfs "${mount_point}"
root@localhost:~# export LD_LIBRARY_PATH=/root/libhugetlbfs/libhugetlbfs-2.20/obj64
root@localhost:~# /root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters
Starting testcase "/root/libhugetlbfs/libhugetlbfs-2.20/tests/obj64/counters", pid 3319
Base pool size: 0
Clean...
FAIL Line 326: Bad HugePages_Total: expected 0, actual 1

The bug was bisected to 0c397daea1d4 ("mm, hugetlb: further simplify
hugetlb allocation API"). The reason is that alloc_surplus_huge_page misaccounts
per node surplus pages. We should increase surplus_huge_pages_node rather than
nr_huge_pages_node which is already handled by alloc_fresh_huge_page.

Fixes: 0c397daea1d4 ("mm, hugetlb: further simplify hugetlb allocation API")
Reported-by: Dan Rue <[email protected]>
Tested-by: Dan Rue <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c204e3d132b..a963f2034dfc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1583,7 +1583,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
page = NULL;
} else {
h->surplus_huge_pages++;
- h->nr_huge_pages_node[page_to_nid(page)]++;
+ h->surplus_huge_pages_node[page_to_nid(page)]++;
}

out_unlock:
--
2.16.1

--
Michal Hocko
SUSE Labs