2014-04-02 18:09:17

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 0/4] hugetlb: add support gigantic page allocation at runtime

The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
runtime. This means that hugepages allocation during runtime is limited to
MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
greater than MAX_ORDER), this in turn means that those pages can't be
allocated at runtime.

HugeTLB supports gigantic page allocation during boottime, via the boot
allocator. To this end the kernel provides the command-line options
hugepagesz= and hugepages=, which can be used to instruct the kernel to
allocate N gigantic pages during boot.

For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
this has to be done at boot via the hugepagesz= and hugepages= command-line
options.

Now, gigantic page allocation at boottime has two serious problems:

1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
evenly distributes boottime allocated hugepages among nodes.

For example, suppose you have a four-node NUMA machine and want
to allocate four 1G gigantic pages at boottime. The kernel will
allocate one gigantic page per node.

On the other hand, we do have users who want to be able to specify
which NUMA node gigantic pages should allocated from. So that they
can place virtual machines on a specific NUMA node.

2. Gigantic pages allocated at boottime can't be freed

At this point it's important to observe that regular hugepages allocated
at runtime don't have those problems. This is so because HugeTLB interface
for runtime allocation in sysfs supports NUMA and runtime allocated pages
can be freed just fine via the buddy allocator.

This series adds support for allocating gigantic pages at runtime. It does
so by allocating gigantic pages via CMA instead of the buddy allocator.
Releasing gigantic pages is also supported via CMA. As this series builds
on top of the existing HugeTLB interface, it makes gigantic page allocation
and releasing just like regular sized hugepages. This also means that NUMA
support just works.

For example, to allocate two 1G gigantic pages on node 1, one can do:

# echo 2 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

And, to release all gigantic pages on the same node:

# echo 0 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Please, refer to patch 4/4 for full technical details.

Finally, please note that this series is a follow up for a previous series
that tried to extend the command-line options set to be NUMA aware:

http://marc.info/?l=linux-mm&m=139593335312191&w=2

During the discussion of that series it was agreed that having runtime
allocation support for gigantic pages was a better solution.

Luiz Capitulino (4):
hugetlb: add hstate_is_gigantic()
hugetlb: update_and_free_page(): don't clear PG_reserved bit
hugetlb: move helpers up in the file
hugetlb: add support for gigantic page allocation at runtime

arch/x86/include/asm/hugetlb.h | 10 ++
include/linux/hugetlb.h | 5 +
mm/hugetlb.c | 344 ++++++++++++++++++++++++++++++-----------
3 files changed, 265 insertions(+), 94 deletions(-)

--
1.8.1.4


2014-04-02 18:09:20

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 1/4] hugetlb: add hstate_is_gigantic()

Signed-off-by: Luiz Capitulino <[email protected]>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 28 ++++++++++++++--------------
2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8c43cc4..8590134 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -333,6 +333,11 @@ static inline unsigned huge_page_shift(struct hstate *h)
return h->order + PAGE_SHIFT;
}

+static inline bool hstate_is_gigantic(struct hstate *h)
+{
+ return huge_page_order(h) >= MAX_ORDER;
+}
+
static inline unsigned int pages_per_huge_page(struct hstate *h)
{
return 1 << h->order;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c01cb9f..8c50547 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -574,7 +574,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
{
int i;

- VM_BUG_ON(h->order >= MAX_ORDER);
+ VM_BUG_ON(hstate_is_gigantic(h));

h->nr_huge_pages--;
h->nr_huge_pages_node[page_to_nid(page)]--;
@@ -627,7 +627,7 @@ static void free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;

- if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
+ if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
/* remove the page from active list */
list_del(&page->lru);
update_and_free_page(h, page);
@@ -731,7 +731,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
{
struct page *page;

- if (h->order >= MAX_ORDER)
+ if (hstate_is_gigantic(h))
return NULL;

page = alloc_pages_exact_node(nid,
@@ -925,7 +925,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
struct page *page;
unsigned int r_nid;

- if (h->order >= MAX_ORDER)
+ if (hstate_is_gigantic(h))
return NULL;

/*
@@ -1118,7 +1118,7 @@ static void return_unused_surplus_pages(struct hstate *h,
h->resv_huge_pages -= unused_resv_pages;

/* Cannot return gigantic pages currently */
- if (h->order >= MAX_ORDER)
+ if (hstate_is_gigantic(h))
return;

nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
@@ -1328,7 +1328,7 @@ static void __init gather_bootmem_prealloc(void)
* fix confusing memory reports from free(1) and another
* side-effects, like CommitLimit going negative.
*/
- if (h->order > (MAX_ORDER - 1))
+ if (hstate_is_gigantic(h))
adjust_managed_page_count(page, 1 << h->order);
}
}
@@ -1338,7 +1338,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
unsigned long i;

for (i = 0; i < h->max_huge_pages; ++i) {
- if (h->order >= MAX_ORDER) {
+ if (hstate_is_gigantic(h)) {
if (!alloc_bootmem_huge_page(h))
break;
} else if (!alloc_fresh_huge_page(h,
@@ -1354,7 +1354,7 @@ static void __init hugetlb_init_hstates(void)

for_each_hstate(h) {
/* oversize hugepages were init'ed in early boot */
- if (h->order < MAX_ORDER)
+ if (!hstate_is_gigantic(h))
hugetlb_hstate_alloc_pages(h);
}
}
@@ -1388,7 +1388,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
{
int i;

- if (h->order >= MAX_ORDER)
+ if (hstate_is_gigantic(h))
return;

for_each_node_mask(i, *nodes_allowed) {
@@ -1451,7 +1451,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
{
unsigned long min_count, ret;

- if (h->order >= MAX_ORDER)
+ if (hstate_is_gigantic(h))
return h->max_huge_pages;

/*
@@ -1577,7 +1577,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
goto out;

h = kobj_to_hstate(kobj, &nid);
- if (h->order >= MAX_ORDER) {
+ if (hstate_is_gigantic(h)) {
err = -EINVAL;
goto out;
}
@@ -1660,7 +1660,7 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
unsigned long input;
struct hstate *h = kobj_to_hstate(kobj, NULL);

- if (h->order >= MAX_ORDER)
+ if (hstate_is_gigantic(h))
return -EINVAL;

err = kstrtoul(buf, 10, &input);
@@ -2071,7 +2071,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,

tmp = h->max_huge_pages;

- if (write && h->order >= MAX_ORDER)
+ if (write && hstate_is_gigantic(h))
return -EINVAL;

table->data = &tmp;
@@ -2124,7 +2124,7 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,

tmp = h->nr_overcommit_huge_pages;

- if (write && h->order >= MAX_ORDER)
+ if (write && hstate_is_gigantic(h))
return -EINVAL;

table->data = &tmp;
--
1.8.1.4

2014-04-02 18:09:25

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 4/4] hugetlb: add support for gigantic page allocation at runtime

HugeTLB is limited to allocating hugepages whose size are less than
MAX_ORDER order. This is so because HugeTLB allocates hugepages via
the buddy allocator. Gigantic pages (that is, pages whose size is
greater than MAX_ORDER order) have to be allocated at boottime.

However, boottime allocation has at least two serious problems. First,
it doesn't support NUMA and second, gigantic pages allocated at
boottime can't be freed.

This commit solves both issues by adding support for allocating gigantic
pages during runtime. It works just like regular sized hugepages,
meaning that the interface in sysfs is the same, it supports NUMA,
and gigantic pages can be freed.

For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
gigantic pages on node 1, one can do:

# echo 2 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

And to free them later:

# echo 0 > \
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

The one problem with gigantic page allocation at runtime is that it
can't be serviced by the buddy allocator. To overcome that problem, this
series scans all zones from a node looking for a large enough contiguous
region. When one is found, it's allocated by using CMA, that is, we call
alloc_contig_range() to do the actual allocation. For example, on x86_64
we scan all zones looking for a 1GB contiguous region. When one is found
it's allocated by alloc_contig_range().

One expected issue with that approach is that such gigantic contiguous
regions tend to vanish as time goes by. The best way to avoid this for
now is to make gigantic page allocations very early during boot, say
from a init script. Other possible optimization include using compaction,
which is supported by CMA but is not explicitly used by this commit.

It's also important to note the following:

1. My target systems are x86_64 machines, so I have only tested 1GB
pages allocation/release. I did try to make this arch indepedent
and expect it to work on other archs but didn't try it myself

2. I didn't add support for hugepage overcommit, that is allocating
a gigantic page on demand when
/proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
think it's reasonable to do the hard and long work required for
allocating a gigantic page at fault time. But it should be simple
to add this if wanted

Signed-off-by: Luiz Capitulino <[email protected]>
---
arch/x86/include/asm/hugetlb.h | 10 +++
mm/hugetlb.c | 177 ++++++++++++++++++++++++++++++++++++++---
2 files changed, 176 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
index a809121..2b262f7 100644
--- a/arch/x86/include/asm/hugetlb.h
+++ b/arch/x86/include/asm/hugetlb.h
@@ -91,6 +91,16 @@ static inline void arch_release_hugepage(struct page *page)
{
}

+static inline int arch_prepare_gigantic_page(struct page *page)
+{
+ return 0;
+}
+
+static inline void arch_release_gigantic_page(struct page *page)
+{
+}
+
+
static inline void arch_clear_hugepage_flags(struct page *page)
{
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2c7a44a..c68515e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -643,11 +643,159 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
((node = hstate_next_node_to_free(hs, mask)) || 1); \
nr_nodes--)

+#ifdef CONFIG_CMA
+static void destroy_compound_gigantic_page(struct page *page,
+ unsigned long order)
+{
+ int i;
+ int nr_pages = 1 << order;
+ struct page *p = page + 1;
+
+ for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+ __ClearPageTail(p);
+ set_page_refcounted(p);
+ p->first_page = NULL;
+ }
+
+ set_compound_order(page, 0);
+ __ClearPageHead(page);
+}
+
+static void free_gigantic_page(struct page *page, unsigned order)
+{
+ free_contig_range(page_to_pfn(page), 1 << order);
+}
+
+static int __alloc_gigantic_page(unsigned long start_pfn, unsigned long count)
+{
+ unsigned long end_pfn = start_pfn + count;
+ return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+}
+
+static bool pfn_valid_gigantic(unsigned long pfn)
+{
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ return false;
+
+ page = pfn_to_page(pfn);
+
+ if (PageReserved(page))
+ return false;
+
+ if (page_count(page) > 0)
+ return false;
+
+ return true;
+}
+
+static inline bool pfn_aligned_gigantic(unsigned long pfn, unsigned order)
+{
+ return IS_ALIGNED((phys_addr_t) pfn << PAGE_SHIFT, PAGE_SIZE << order);
+}
+
+static struct page *alloc_gigantic_page(int nid, unsigned order)
+{
+ unsigned long ret, i, count, start_pfn, flags;
+ unsigned long nr_pages = 1 << order;
+ struct zone *z;
+
+ z = NODE_DATA(nid)->node_zones;
+ for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
+ spin_lock_irqsave(&z->lock, flags);
+ if (z->spanned_pages < nr_pages) {
+ spin_unlock_irqrestore(&z->lock, flags);
+ continue;
+ }
+
+ /* scan zone 'z' looking for a contiguous 'nr_pages' range */
+ count = 0;
+ start_pfn = z->zone_start_pfn; /* to silence gcc */
+ for (i = z->zone_start_pfn; i < zone_end_pfn(z); i++) {
+ if (!pfn_valid_gigantic(i)) {
+ count = 0;
+ continue;
+ }
+ if (!count) {
+ if (!pfn_aligned_gigantic(i, order))
+ continue;
+ start_pfn = i;
+ }
+ if (++count == nr_pages) {
+ /*
+ * We release the zone lock here because
+ * alloc_contig_range() will also lock the zone
+ * at some point. If there's an allocation
+ * spinning on this lock, it may win the race
+ * and cause alloc_contig_range() to fail...
+ */
+ spin_unlock_irqrestore(&z->lock, flags);
+ ret = __alloc_gigantic_page(start_pfn, count);
+ if (!ret)
+ return pfn_to_page(start_pfn);
+ count = 0;
+ spin_lock_irqsave(&z->lock, flags);
+ }
+ }
+
+ spin_unlock_irqrestore(&z->lock, flags);
+ }
+
+ return NULL;
+}
+
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
+static void prep_compound_gigantic_page(struct page *page, unsigned long order);
+
+static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
+{
+ struct page *page;
+
+ page = alloc_gigantic_page(nid, huge_page_order(h));
+ if (page) {
+ if (arch_prepare_gigantic_page(page)) {
+ free_gigantic_page(page, huge_page_order(h));
+ return NULL;
+ }
+ prep_compound_gigantic_page(page, huge_page_order(h));
+ prep_new_huge_page(h, page, nid);
+ }
+
+ return page;
+}
+
+static int alloc_fresh_gigantic_page(struct hstate *h,
+ nodemask_t *nodes_allowed)
+{
+ struct page *page = NULL;
+ int nr_nodes, node;
+
+ for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+ page = alloc_fresh_gigantic_page_node(h, node);
+ if (page)
+ return 1;
+ }
+
+ return 0;
+}
+
+static inline bool gigantic_page_supported(void) { return true; }
+#else /* !CONFIG_CMA */
+static inline bool gigantic_page_supported(void) { return false; }
+static inline void free_gigantic_page(struct page *page, unsigned order) { }
+static inline void destroy_compound_gigantic_page(struct page *page,
+ unsigned long order) { }
+static inline int alloc_fresh_gigantic_page(struct hstate *h,
+ nodemask_t *nodes_allowed) { return 0; }
+#endif /* CONFIG_CMA */
+
static void update_and_free_page(struct hstate *h, struct page *page)
{
int i;

- VM_BUG_ON(hstate_is_gigantic(h));
+ if (hstate_is_gigantic(h) && !gigantic_page_supported())
+ return;

h->nr_huge_pages--;
h->nr_huge_pages_node[page_to_nid(page)]--;
@@ -661,8 +809,14 @@ static void update_and_free_page(struct hstate *h, struct page *page)
VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
set_compound_page_dtor(page, NULL);
set_page_refcounted(page);
- arch_release_hugepage(page);
- __free_pages(page, huge_page_order(h));
+ if (hstate_is_gigantic(h)) {
+ arch_release_gigantic_page(page);
+ destroy_compound_gigantic_page(page, huge_page_order(h));
+ free_gigantic_page(page, huge_page_order(h));
+ } else {
+ arch_release_hugepage(page);
+ __free_pages(page, huge_page_order(h));
+ }
}

struct hstate *size_to_hstate(unsigned long size)
@@ -701,7 +855,7 @@ static void free_huge_page(struct page *page)
if (restore_reserve)
h->resv_huge_pages++;

- if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
+ if (h->surplus_huge_pages_node[nid]) {
/* remove the page from active list */
list_del(&page->lru);
update_and_free_page(h, page);
@@ -805,9 +959,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
{
struct page *page;

- if (hstate_is_gigantic(h))
- return NULL;
-
page = alloc_pages_exact_node(nid,
htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
__GFP_REPEAT|__GFP_NOWARN,
@@ -1452,7 +1603,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
{
unsigned long min_count, ret;

- if (hstate_is_gigantic(h))
+ if (hstate_is_gigantic(h) && !gigantic_page_supported())
return h->max_huge_pages;

/*
@@ -1479,7 +1630,11 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
* and reducing the surplus.
*/
spin_unlock(&hugetlb_lock);
- ret = alloc_fresh_huge_page(h, nodes_allowed);
+ if (hstate_is_gigantic(h)) {
+ ret = alloc_fresh_gigantic_page(h, nodes_allowed);
+ } else {
+ ret = alloc_fresh_huge_page(h, nodes_allowed);
+ }
spin_lock(&hugetlb_lock);
if (!ret)
goto out;
@@ -1578,7 +1733,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
goto out;

h = kobj_to_hstate(kobj, &nid);
- if (hstate_is_gigantic(h)) {
+ if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
err = -EINVAL;
goto out;
}
@@ -2072,7 +2227,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,

tmp = h->max_huge_pages;

- if (write && hstate_is_gigantic(h))
+ if (write && hstate_is_gigantic(h) && !gigantic_page_supported())
return -EINVAL;

table->data = &tmp;
--
1.8.1.4

2014-04-02 18:10:51

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 3/4] hugetlb: move helpers up in the file

Next commit will add new code which will want to call the
for_each_node_mask_to_alloc() macro. Move it, its buddy
for_each_node_mask_to_free() and their dependencies up in the file so
the new code can use them. This is just code movement, no logic change.

Signed-off-by: Luiz Capitulino <[email protected]>
---
mm/hugetlb.c | 146 +++++++++++++++++++++++++++++------------------------------
1 file changed, 73 insertions(+), 73 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7e07e47..2c7a44a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -570,6 +570,79 @@ err:
return NULL;
}

+/*
+ * common helper functions for hstate_next_node_to_{alloc|free}.
+ * We may have allocated or freed a huge page based on a different
+ * nodes_allowed previously, so h->next_node_to_{alloc|free} might
+ * be outside of *nodes_allowed. Ensure that we use an allowed
+ * node for alloc or free.
+ */
+static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+ nid = next_node(nid, *nodes_allowed);
+ if (nid == MAX_NUMNODES)
+ nid = first_node(*nodes_allowed);
+ VM_BUG_ON(nid >= MAX_NUMNODES);
+
+ return nid;
+}
+
+static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
+{
+ if (!node_isset(nid, *nodes_allowed))
+ nid = next_node_allowed(nid, nodes_allowed);
+ return nid;
+}
+
+/*
+ * returns the previously saved node ["this node"] from which to
+ * allocate a persistent huge page for the pool and advance the
+ * next node from which to allocate, handling wrap at end of node
+ * mask.
+ */
+static int hstate_next_node_to_alloc(struct hstate *h,
+ nodemask_t *nodes_allowed)
+{
+ int nid;
+
+ VM_BUG_ON(!nodes_allowed);
+
+ nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
+ h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
+
+ return nid;
+}
+
+/*
+ * helper for free_pool_huge_page() - return the previously saved
+ * node ["this node"] from which to free a huge page. Advance the
+ * next node id whether or not we find a free huge page to free so
+ * that the next attempt to free addresses the next node.
+ */
+static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
+{
+ int nid;
+
+ VM_BUG_ON(!nodes_allowed);
+
+ nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
+ h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
+
+ return nid;
+}
+
+#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
+ for (nr_nodes = nodes_weight(*mask); \
+ nr_nodes > 0 && \
+ ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
+ nr_nodes--)
+
+#define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
+ for (nr_nodes = nodes_weight(*mask); \
+ nr_nodes > 0 && \
+ ((node = hstate_next_node_to_free(hs, mask)) || 1); \
+ nr_nodes--)
+
static void update_and_free_page(struct hstate *h, struct page *page)
{
int i;
@@ -750,79 +823,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
return page;
}

-/*
- * common helper functions for hstate_next_node_to_{alloc|free}.
- * We may have allocated or freed a huge page based on a different
- * nodes_allowed previously, so h->next_node_to_{alloc|free} might
- * be outside of *nodes_allowed. Ensure that we use an allowed
- * node for alloc or free.
- */
-static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
-{
- nid = next_node(nid, *nodes_allowed);
- if (nid == MAX_NUMNODES)
- nid = first_node(*nodes_allowed);
- VM_BUG_ON(nid >= MAX_NUMNODES);
-
- return nid;
-}
-
-static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
-{
- if (!node_isset(nid, *nodes_allowed))
- nid = next_node_allowed(nid, nodes_allowed);
- return nid;
-}
-
-/*
- * returns the previously saved node ["this node"] from which to
- * allocate a persistent huge page for the pool and advance the
- * next node from which to allocate, handling wrap at end of node
- * mask.
- */
-static int hstate_next_node_to_alloc(struct hstate *h,
- nodemask_t *nodes_allowed)
-{
- int nid;
-
- VM_BUG_ON(!nodes_allowed);
-
- nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
- h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
-
- return nid;
-}
-
-/*
- * helper for free_pool_huge_page() - return the previously saved
- * node ["this node"] from which to free a huge page. Advance the
- * next node id whether or not we find a free huge page to free so
- * that the next attempt to free addresses the next node.
- */
-static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
-{
- int nid;
-
- VM_BUG_ON(!nodes_allowed);
-
- nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
- h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
-
- return nid;
-}
-
-#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
- for (nr_nodes = nodes_weight(*mask); \
- nr_nodes > 0 && \
- ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
- nr_nodes--)
-
-#define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
- for (nr_nodes = nodes_weight(*mask); \
- nr_nodes > 0 && \
- ((node = hstate_next_node_to_free(hs, mask)) || 1); \
- nr_nodes--)
-
static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
{
struct page *page;
--
1.8.1.4

2014-04-02 18:09:18

by Luiz Capitulino

[permalink] [raw]
Subject: [PATCH 2/4] hugetlb: update_and_free_page(): don't clear PG_reserved bit

Hugepages pages never get the PG_reserved bit set, so don't clear it. But
add a warning just in case.

Signed-off-by: Luiz Capitulino <[email protected]>
---
mm/hugetlb.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8c50547..7e07e47 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -581,8 +581,9 @@ static void update_and_free_page(struct hstate *h, struct page *page)
for (i = 0; i < pages_per_huge_page(h); i++) {
page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
1 << PG_referenced | 1 << PG_dirty |
- 1 << PG_active | 1 << PG_reserved |
- 1 << PG_private | 1 << PG_writeback);
+ 1 << PG_active | 1 << PG_private |
+ 1 << PG_writeback);
+ WARN_ON(PageReserved(&page[i]));
}
VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
set_compound_page_dtor(page, NULL);
--
1.8.1.4

2014-04-03 15:34:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 0/4] hugetlb: add support gigantic page allocation at runtime

On Wed, Apr 02, 2014 at 02:08:44PM -0400, Luiz Capitulino wrote:
> Luiz Capitulino (4):
> hugetlb: add hstate_is_gigantic()
> hugetlb: update_and_free_page(): don't clear PG_reserved bit
> hugetlb: move helpers up in the file
> hugetlb: add support for gigantic page allocation at runtime

Reviewed-by: Andrea Arcangeli <[email protected]>

2014-04-04 03:06:06

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: add support for gigantic page allocation at runtime

(2014/04/03 3:08), Luiz Capitulino wrote:
> HugeTLB is limited to allocating hugepages whose size are less than
> MAX_ORDER order. This is so because HugeTLB allocates hugepages via
> the buddy allocator. Gigantic pages (that is, pages whose size is
> greater than MAX_ORDER order) have to be allocated at boottime.
>
> However, boottime allocation has at least two serious problems. First,
> it doesn't support NUMA and second, gigantic pages allocated at
> boottime can't be freed.
>
> This commit solves both issues by adding support for allocating gigantic
> pages during runtime. It works just like regular sized hugepages,
> meaning that the interface in sysfs is the same, it supports NUMA,
> and gigantic pages can be freed.
>
> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
> gigantic pages on node 1, one can do:
>
> # echo 2 > \
> /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>
> And to free them later:
>
> # echo 0 > \
> /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>
> The one problem with gigantic page allocation at runtime is that it
> can't be serviced by the buddy allocator. To overcome that problem, this
> series scans all zones from a node looking for a large enough contiguous
> region. When one is found, it's allocated by using CMA, that is, we call
> alloc_contig_range() to do the actual allocation. For example, on x86_64
> we scan all zones looking for a 1GB contiguous region. When one is found
> it's allocated by alloc_contig_range().
>
> One expected issue with that approach is that such gigantic contiguous
> regions tend to vanish as time goes by. The best way to avoid this for
> now is to make gigantic page allocations very early during boot, say
> from a init script. Other possible optimization include using compaction,
> which is supported by CMA but is not explicitly used by this commit.
>
> It's also important to note the following:
>
> 1. My target systems are x86_64 machines, so I have only tested 1GB
> pages allocation/release. I did try to make this arch indepedent
> and expect it to work on other archs but didn't try it myself
>
> 2. I didn't add support for hugepage overcommit, that is allocating
> a gigantic page on demand when
> /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
> think it's reasonable to do the hard and long work required for
> allocating a gigantic page at fault time. But it should be simple
> to add this if wanted
>
> Signed-off-by: Luiz Capitulino <[email protected]>
> ---
> arch/x86/include/asm/hugetlb.h | 10 +++
> mm/hugetlb.c | 177 ++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 176 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
> index a809121..2b262f7 100644
> --- a/arch/x86/include/asm/hugetlb.h
> +++ b/arch/x86/include/asm/hugetlb.h
> @@ -91,6 +91,16 @@ static inline void arch_release_hugepage(struct page *page)
> {
> }
>
> +static inline int arch_prepare_gigantic_page(struct page *page)
> +{
> + return 0;
> +}
> +
> +static inline void arch_release_gigantic_page(struct page *page)
> +{
> +}
> +
> +
> static inline void arch_clear_hugepage_flags(struct page *page)
> {
> }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2c7a44a..c68515e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -643,11 +643,159 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> ((node = hstate_next_node_to_free(hs, mask)) || 1); \
> nr_nodes--)
>
> +#ifdef CONFIG_CMA
> +static void destroy_compound_gigantic_page(struct page *page,
> + unsigned long order)
> +{
> + int i;
> + int nr_pages = 1 << order;
> + struct page *p = page + 1;
> +
> + for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> + __ClearPageTail(p);
> + set_page_refcounted(p);
> + p->first_page = NULL;
> + }
> +
> + set_compound_order(page, 0);
> + __ClearPageHead(page);
> +}
> +
> +static void free_gigantic_page(struct page *page, unsigned order)
> +{
> + free_contig_range(page_to_pfn(page), 1 << order);
> +}
> +
> +static int __alloc_gigantic_page(unsigned long start_pfn, unsigned long count)
> +{
> + unsigned long end_pfn = start_pfn + count;
> + return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> +}
> +
> +static bool pfn_valid_gigantic(unsigned long pfn)
> +{
> + struct page *page;
> +
> + if (!pfn_valid(pfn))
> + return false;
> +
> + page = pfn_to_page(pfn);
> +
> + if (PageReserved(page))
> + return false;
> +
> + if (page_count(page) > 0)
> + return false;
> +
> + return true;
> +}
> +
> +static inline bool pfn_aligned_gigantic(unsigned long pfn, unsigned order)
> +{
> + return IS_ALIGNED((phys_addr_t) pfn << PAGE_SHIFT, PAGE_SIZE << order);
> +}
> +
> +static struct page *alloc_gigantic_page(int nid, unsigned order)
> +{
> + unsigned long ret, i, count, start_pfn, flags;
> + unsigned long nr_pages = 1 << order;
> + struct zone *z;
> +
> + z = NODE_DATA(nid)->node_zones;
> + for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
> + spin_lock_irqsave(&z->lock, flags);
> + if (z->spanned_pages < nr_pages) {
> + spin_unlock_irqrestore(&z->lock, flags);
> + continue;
> + }
> +
> + /* scan zone 'z' looking for a contiguous 'nr_pages' range */
> + count = 0;

> + start_pfn = z->zone_start_pfn; /* to silence gcc */
> + for (i = z->zone_start_pfn; i < zone_end_pfn(z); i++) {

This loop is not smart. On our system, one node has serveral TBytes.
So the maximum loop count is "TBytes/Page size".

First page of gigantic page must be aligned.
So how about it:

start_pfn = zone_start_pfn aligned gigantic page
for (i = start_pfn; i < zone_end_pfn; i += size of gigantic page) {
if (!pfn_valid_gigantic(i)) {
count = 0;
continue;
}

...
}

Thanks,
Yasuaki Ishimatsu

> + if (!pfn_valid_gigantic(i)) {
> + count = 0;
> + continue;
> + }
> + if (!count) {
> + if (!pfn_aligned_gigantic(i, order))
> + continue;
> + start_pfn = i;
> + }
> + if (++count == nr_pages) {
> + /*
> + * We release the zone lock here because
> + * alloc_contig_range() will also lock the zone
> + * at some point. If there's an allocation
> + * spinning on this lock, it may win the race
> + * and cause alloc_contig_range() to fail...
> + */
> + spin_unlock_irqrestore(&z->lock, flags);
> + ret = __alloc_gigantic_page(start_pfn, count);
> + if (!ret)
> + return pfn_to_page(start_pfn);
> + count = 0;
> + spin_lock_irqsave(&z->lock, flags);
> + }
> + }
> +
> + spin_unlock_irqrestore(&z->lock, flags);
> + }
> +
> + return NULL;
> +}
> +
> +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
> +static void prep_compound_gigantic_page(struct page *page, unsigned long order);
> +
> +static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
> +{
> + struct page *page;
> +
> + page = alloc_gigantic_page(nid, huge_page_order(h));
> + if (page) {
> + if (arch_prepare_gigantic_page(page)) {
> + free_gigantic_page(page, huge_page_order(h));
> + return NULL;
> + }
> + prep_compound_gigantic_page(page, huge_page_order(h));
> + prep_new_huge_page(h, page, nid);
> + }
> +
> + return page;
> +}
> +
> +static int alloc_fresh_gigantic_page(struct hstate *h,
> + nodemask_t *nodes_allowed)
> +{
> + struct page *page = NULL;
> + int nr_nodes, node;
> +
> + for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> + page = alloc_fresh_gigantic_page_node(h, node);
> + if (page)
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> +static inline bool gigantic_page_supported(void) { return true; }
> +#else /* !CONFIG_CMA */
> +static inline bool gigantic_page_supported(void) { return false; }
> +static inline void free_gigantic_page(struct page *page, unsigned order) { }
> +static inline void destroy_compound_gigantic_page(struct page *page,
> + unsigned long order) { }
> +static inline int alloc_fresh_gigantic_page(struct hstate *h,
> + nodemask_t *nodes_allowed) { return 0; }
> +#endif /* CONFIG_CMA */
> +
> static void update_and_free_page(struct hstate *h, struct page *page)
> {
> int i;
>
> - VM_BUG_ON(hstate_is_gigantic(h));
> + if (hstate_is_gigantic(h) && !gigantic_page_supported())
> + return;
>
> h->nr_huge_pages--;
> h->nr_huge_pages_node[page_to_nid(page)]--;
> @@ -661,8 +809,14 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> set_compound_page_dtor(page, NULL);
> set_page_refcounted(page);
> - arch_release_hugepage(page);
> - __free_pages(page, huge_page_order(h));
> + if (hstate_is_gigantic(h)) {
> + arch_release_gigantic_page(page);
> + destroy_compound_gigantic_page(page, huge_page_order(h));
> + free_gigantic_page(page, huge_page_order(h));
> + } else {
> + arch_release_hugepage(page);
> + __free_pages(page, huge_page_order(h));
> + }
> }
>
> struct hstate *size_to_hstate(unsigned long size)
> @@ -701,7 +855,7 @@ static void free_huge_page(struct page *page)
> if (restore_reserve)
> h->resv_huge_pages++;
>
> - if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
> + if (h->surplus_huge_pages_node[nid]) {
> /* remove the page from active list */
> list_del(&page->lru);
> update_and_free_page(h, page);
> @@ -805,9 +959,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> {
> struct page *page;
>
> - if (hstate_is_gigantic(h))
> - return NULL;
> -
> page = alloc_pages_exact_node(nid,
> htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
> __GFP_REPEAT|__GFP_NOWARN,
> @@ -1452,7 +1603,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> {
> unsigned long min_count, ret;
>
> - if (hstate_is_gigantic(h))
> + if (hstate_is_gigantic(h) && !gigantic_page_supported())
> return h->max_huge_pages;
>
> /*
> @@ -1479,7 +1630,11 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> * and reducing the surplus.
> */
> spin_unlock(&hugetlb_lock);
> - ret = alloc_fresh_huge_page(h, nodes_allowed);
> + if (hstate_is_gigantic(h)) {
> + ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> + } else {
> + ret = alloc_fresh_huge_page(h, nodes_allowed);
> + }
> spin_lock(&hugetlb_lock);
> if (!ret)
> goto out;
> @@ -1578,7 +1733,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
> goto out;
>
> h = kobj_to_hstate(kobj, &nid);
> - if (hstate_is_gigantic(h)) {
> + if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
> err = -EINVAL;
> goto out;
> }
> @@ -2072,7 +2227,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>
> tmp = h->max_huge_pages;
>
> - if (write && hstate_is_gigantic(h))
> + if (write && hstate_is_gigantic(h) && !gigantic_page_supported())
> return -EINVAL;
>
> table->data = &tmp;
>

2014-04-04 13:31:20

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: add support for gigantic page allocation at runtime

On Fri, 4 Apr 2014 12:05:17 +0900
Yasuaki Ishimatsu <[email protected]> wrote:

> (2014/04/03 3:08), Luiz Capitulino wrote:
> > HugeTLB is limited to allocating hugepages whose size are less than
> > MAX_ORDER order. This is so because HugeTLB allocates hugepages via
> > the buddy allocator. Gigantic pages (that is, pages whose size is
> > greater than MAX_ORDER order) have to be allocated at boottime.
> >
> > However, boottime allocation has at least two serious problems. First,
> > it doesn't support NUMA and second, gigantic pages allocated at
> > boottime can't be freed.
> >
> > This commit solves both issues by adding support for allocating gigantic
> > pages during runtime. It works just like regular sized hugepages,
> > meaning that the interface in sysfs is the same, it supports NUMA,
> > and gigantic pages can be freed.
> >
> > For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
> > gigantic pages on node 1, one can do:
> >
> > # echo 2 > \
> > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> >
> > And to free them later:
> >
> > # echo 0 > \
> > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> >
> > The one problem with gigantic page allocation at runtime is that it
> > can't be serviced by the buddy allocator. To overcome that problem, this
> > series scans all zones from a node looking for a large enough contiguous
> > region. When one is found, it's allocated by using CMA, that is, we call
> > alloc_contig_range() to do the actual allocation. For example, on x86_64
> > we scan all zones looking for a 1GB contiguous region. When one is found
> > it's allocated by alloc_contig_range().
> >
> > One expected issue with that approach is that such gigantic contiguous
> > regions tend to vanish as time goes by. The best way to avoid this for
> > now is to make gigantic page allocations very early during boot, say
> > from a init script. Other possible optimization include using compaction,
> > which is supported by CMA but is not explicitly used by this commit.
> >
> > It's also important to note the following:
> >
> > 1. My target systems are x86_64 machines, so I have only tested 1GB
> > pages allocation/release. I did try to make this arch indepedent
> > and expect it to work on other archs but didn't try it myself
> >
> > 2. I didn't add support for hugepage overcommit, that is allocating
> > a gigantic page on demand when
> > /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
> > think it's reasonable to do the hard and long work required for
> > allocating a gigantic page at fault time. But it should be simple
> > to add this if wanted
> >
> > Signed-off-by: Luiz Capitulino <[email protected]>
> > ---
> > arch/x86/include/asm/hugetlb.h | 10 +++
> > mm/hugetlb.c | 177 ++++++++++++++++++++++++++++++++++++++---
> > 2 files changed, 176 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
> > index a809121..2b262f7 100644
> > --- a/arch/x86/include/asm/hugetlb.h
> > +++ b/arch/x86/include/asm/hugetlb.h
> > @@ -91,6 +91,16 @@ static inline void arch_release_hugepage(struct page *page)
> > {
> > }
> >
> > +static inline int arch_prepare_gigantic_page(struct page *page)
> > +{
> > + return 0;
> > +}
> > +
> > +static inline void arch_release_gigantic_page(struct page *page)
> > +{
> > +}
> > +
> > +
> > static inline void arch_clear_hugepage_flags(struct page *page)
> > {
> > }
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 2c7a44a..c68515e 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -643,11 +643,159 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> > ((node = hstate_next_node_to_free(hs, mask)) || 1); \
> > nr_nodes--)
> >
> > +#ifdef CONFIG_CMA
> > +static void destroy_compound_gigantic_page(struct page *page,
> > + unsigned long order)
> > +{
> > + int i;
> > + int nr_pages = 1 << order;
> > + struct page *p = page + 1;
> > +
> > + for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> > + __ClearPageTail(p);
> > + set_page_refcounted(p);
> > + p->first_page = NULL;
> > + }
> > +
> > + set_compound_order(page, 0);
> > + __ClearPageHead(page);
> > +}
> > +
> > +static void free_gigantic_page(struct page *page, unsigned order)
> > +{
> > + free_contig_range(page_to_pfn(page), 1 << order);
> > +}
> > +
> > +static int __alloc_gigantic_page(unsigned long start_pfn, unsigned long count)
> > +{
> > + unsigned long end_pfn = start_pfn + count;
> > + return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> > +}
> > +
> > +static bool pfn_valid_gigantic(unsigned long pfn)
> > +{
> > + struct page *page;
> > +
> > + if (!pfn_valid(pfn))
> > + return false;
> > +
> > + page = pfn_to_page(pfn);
> > +
> > + if (PageReserved(page))
> > + return false;
> > +
> > + if (page_count(page) > 0)
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > +static inline bool pfn_aligned_gigantic(unsigned long pfn, unsigned order)
> > +{
> > + return IS_ALIGNED((phys_addr_t) pfn << PAGE_SHIFT, PAGE_SIZE << order);
> > +}
> > +
> > +static struct page *alloc_gigantic_page(int nid, unsigned order)
> > +{
> > + unsigned long ret, i, count, start_pfn, flags;
> > + unsigned long nr_pages = 1 << order;
> > + struct zone *z;
> > +
> > + z = NODE_DATA(nid)->node_zones;
> > + for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) {
> > + spin_lock_irqsave(&z->lock, flags);
> > + if (z->spanned_pages < nr_pages) {
> > + spin_unlock_irqrestore(&z->lock, flags);
> > + continue;
> > + }
> > +
> > + /* scan zone 'z' looking for a contiguous 'nr_pages' range */
> > + count = 0;
>
> > + start_pfn = z->zone_start_pfn; /* to silence gcc */
> > + for (i = z->zone_start_pfn; i < zone_end_pfn(z); i++) {
>
> This loop is not smart. On our system, one node has serveral TBytes.
> So the maximum loop count is "TBytes/Page size".

Interesting. Would you be willing to test this series on such a
machine?

> First page of gigantic page must be aligned.
> So how about it:
>
> start_pfn = zone_start_pfn aligned gigantic page
> for (i = start_pfn; i < zone_end_pfn; i += size of gigantic page) {
> if (!pfn_valid_gigantic(i)) {
> count = 0;
> continue;
> }
>
> ...
> }

I'm not sure that very loop will work because pfn_valid_gigantic() checks
a single PFN today, but we do have to scan every single PFN on a gigantic
page range.

On the other hand, I think got what you're suggesting. When an unsuitable
PFN is found, we should just skip to the next aligned PFN instead of
keep scanning for nothing (which is what my loop does today). Maybe you're
suggesting pfn_valid_gigantic() should do that?

Anyway, I'll make that change, thank you very much for you review!

>
> Thanks,
> Yasuaki Ishimatsu
>
> > + if (!pfn_valid_gigantic(i)) {
> > + count = 0;
> > + continue;
> > + }
> > + if (!count) {
> > + if (!pfn_aligned_gigantic(i, order))
> > + continue;
> > + start_pfn = i;
> > + }
> > + if (++count == nr_pages) {
> > + /*
> > + * We release the zone lock here because
> > + * alloc_contig_range() will also lock the zone
> > + * at some point. If there's an allocation
> > + * spinning on this lock, it may win the race
> > + * and cause alloc_contig_range() to fail...
> > + */
> > + spin_unlock_irqrestore(&z->lock, flags);
> > + ret = __alloc_gigantic_page(start_pfn, count);
> > + if (!ret)
> > + return pfn_to_page(start_pfn);
> > + count = 0;
> > + spin_lock_irqsave(&z->lock, flags);
> > + }
> > + }
> > +
> > + spin_unlock_irqrestore(&z->lock, flags);
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
> > +static void prep_compound_gigantic_page(struct page *page, unsigned long order);
> > +
> > +static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
> > +{
> > + struct page *page;
> > +
> > + page = alloc_gigantic_page(nid, huge_page_order(h));
> > + if (page) {
> > + if (arch_prepare_gigantic_page(page)) {
> > + free_gigantic_page(page, huge_page_order(h));
> > + return NULL;
> > + }
> > + prep_compound_gigantic_page(page, huge_page_order(h));
> > + prep_new_huge_page(h, page, nid);
> > + }
> > +
> > + return page;
> > +}
> > +
> > +static int alloc_fresh_gigantic_page(struct hstate *h,
> > + nodemask_t *nodes_allowed)
> > +{
> > + struct page *page = NULL;
> > + int nr_nodes, node;
> > +
> > + for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> > + page = alloc_fresh_gigantic_page_node(h, node);
> > + if (page)
> > + return 1;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static inline bool gigantic_page_supported(void) { return true; }
> > +#else /* !CONFIG_CMA */
> > +static inline bool gigantic_page_supported(void) { return false; }
> > +static inline void free_gigantic_page(struct page *page, unsigned order) { }
> > +static inline void destroy_compound_gigantic_page(struct page *page,
> > + unsigned long order) { }
> > +static inline int alloc_fresh_gigantic_page(struct hstate *h,
> > + nodemask_t *nodes_allowed) { return 0; }
> > +#endif /* CONFIG_CMA */
> > +
> > static void update_and_free_page(struct hstate *h, struct page *page)
> > {
> > int i;
> >
> > - VM_BUG_ON(hstate_is_gigantic(h));
> > + if (hstate_is_gigantic(h) && !gigantic_page_supported())
> > + return;
> >
> > h->nr_huge_pages--;
> > h->nr_huge_pages_node[page_to_nid(page)]--;
> > @@ -661,8 +809,14 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> > VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> > set_compound_page_dtor(page, NULL);
> > set_page_refcounted(page);
> > - arch_release_hugepage(page);
> > - __free_pages(page, huge_page_order(h));
> > + if (hstate_is_gigantic(h)) {
> > + arch_release_gigantic_page(page);
> > + destroy_compound_gigantic_page(page, huge_page_order(h));
> > + free_gigantic_page(page, huge_page_order(h));
> > + } else {
> > + arch_release_hugepage(page);
> > + __free_pages(page, huge_page_order(h));
> > + }
> > }
> >
> > struct hstate *size_to_hstate(unsigned long size)
> > @@ -701,7 +855,7 @@ static void free_huge_page(struct page *page)
> > if (restore_reserve)
> > h->resv_huge_pages++;
> >
> > - if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
> > + if (h->surplus_huge_pages_node[nid]) {
> > /* remove the page from active list */
> > list_del(&page->lru);
> > update_and_free_page(h, page);
> > @@ -805,9 +959,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> > {
> > struct page *page;
> >
> > - if (hstate_is_gigantic(h))
> > - return NULL;
> > -
> > page = alloc_pages_exact_node(nid,
> > htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
> > __GFP_REPEAT|__GFP_NOWARN,
> > @@ -1452,7 +1603,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > {
> > unsigned long min_count, ret;
> >
> > - if (hstate_is_gigantic(h))
> > + if (hstate_is_gigantic(h) && !gigantic_page_supported())
> > return h->max_huge_pages;
> >
> > /*
> > @@ -1479,7 +1630,11 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > * and reducing the surplus.
> > */
> > spin_unlock(&hugetlb_lock);
> > - ret = alloc_fresh_huge_page(h, nodes_allowed);
> > + if (hstate_is_gigantic(h)) {
> > + ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> > + } else {
> > + ret = alloc_fresh_huge_page(h, nodes_allowed);
> > + }
> > spin_lock(&hugetlb_lock);
> > if (!ret)
> > goto out;
> > @@ -1578,7 +1733,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
> > goto out;
> >
> > h = kobj_to_hstate(kobj, &nid);
> > - if (hstate_is_gigantic(h)) {
> > + if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
> > err = -EINVAL;
> > goto out;
> > }
> > @@ -2072,7 +2227,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
> >
> > tmp = h->max_huge_pages;
> >
> > - if (write && hstate_is_gigantic(h))
> > + if (write && hstate_is_gigantic(h) && !gigantic_page_supported())
> > return -EINVAL;
> >
> > table->data = &tmp;
> >
>
>

2014-04-08 02:00:24

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: add support for gigantic page allocation at runtime

(2014/04/04 22:30), Luiz Capitulino wrote:
> On Fri, 4 Apr 2014 12:05:17 +0900
> Yasuaki Ishimatsu <[email protected]> wrote:
>
>> (2014/04/03 3:08), Luiz Capitulino wrote:
>>> HugeTLB is limited to allocating hugepages whose size are less than
>>> MAX_ORDER order. This is so because HugeTLB allocates hugepages via
>>> the buddy allocator. Gigantic pages (that is, pages whose size is
>>> greater than MAX_ORDER order) have to be allocated at boottime.
>>>
>>> However, boottime allocation has at least two serious problems. First,
>>> it doesn't support NUMA and second, gigantic pages allocated at
>>> boottime can't be freed.
>>>
>>> This commit solves both issues by adding support for allocating gigantic
>>> pages during runtime. It works just like regular sized hugepages,
>>> meaning that the interface in sysfs is the same, it supports NUMA,
>>> and gigantic pages can be freed.
>>>
>>> For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
>>> gigantic pages on node 1, one can do:
>>>
>>> # echo 2 > \
>>> /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>>>
>>> And to free them later:
>>>
>>> # echo 0 > \
>>> /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
>>>
>>> The one problem with gigantic page allocation at runtime is that it
>>> can't be serviced by the buddy allocator. To overcome that problem, this
>>> series scans all zones from a node looking for a large enough contiguous
>>> region. When one is found, it's allocated by using CMA, that is, we call
>>> alloc_contig_range() to do the actual allocation. For example, on x86_64
>>> we scan all zones looking for a 1GB contiguous region. When one is found
>>> it's allocated by alloc_contig_range().
>>>
>>> One expected issue with that approach is that such gigantic contiguous
>>> regions tend to vanish as time goes by. The best way to avoid this for
>>> now is to make gigantic page allocations very early during boot, say
>>> from a init script. Other possible optimization include using compaction,
>>> which is supported by CMA but is not explicitly used by this commit.
>>>
>>> It's also important to note the following:
>>>
>>> 1. My target systems are x86_64 machines, so I have only tested 1GB
>>> pages allocation/release. I did try to make this arch indepedent
>>> and expect it to work on other archs but didn't try it myself
>>>
>>> 2. I didn't add support for hugepage overcommit, that is allocating
>>> a gigantic page on demand when
>>> /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
>>> think it's reasonable to do the hard and long work required for
>>> allocating a gigantic page at fault time. But it should be simple
>>> to add this if wanted
>>>
>>> Signed-off-by: Luiz Capitulino <[email protected]>
>>> ---
>>> arch/x86/include/asm/hugetlb.h | 10 +++
>>> mm/hugetlb.c | 177 ++++++++++++++++++++++++++++++++++++++---
>>> 2 files changed, 176 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
>>> index a809121..2b262f7 100644
>>> --- a/arch/x86/include/asm/hugetlb.h
>>> +++ b/arch/x86/include/asm/hugetlb.h
>>> @@ -91,6 +91,16 @@ static inline void arch_release_hugepage(struct page *page)

<snip>

>>
>>> + start_pfn = z->zone_start_pfn; /* to silence gcc */
>>> + for (i = z->zone_start_pfn; i < zone_end_pfn(z); i++) {
>>
>> This loop is not smart. On our system, one node has serveral TBytes.
>> So the maximum loop count is "TBytes/Page size".
>
> Interesting. Would you be willing to test this series on such a
> machine?
>
>> First page of gigantic page must be aligned.
>> So how about it:
>>
>> start_pfn = zone_start_pfn aligned gigantic page
>> for (i = start_pfn; i < zone_end_pfn; i += size of gigantic page) {
>> if (!pfn_valid_gigantic(i)) {
>> count = 0;
>> continue;
>> }
>>
>> ...
>> }
>
> I'm not sure that very loop will work because pfn_valid_gigantic() checks
> a single PFN today, but we do have to scan every single PFN on a gigantic
> page range.
>

> On the other hand, I think got what you're suggesting. When an unsuitable
> PFN is found, we should just skip to the next aligned PFN instead of
> keep scanning for nothing (which is what my loop does today). Maybe you're
> suggesting pfn_valid_gigantic() should do that?

That's right.

Thanks,
Yasuaki Ishimatsu

>
> Anyway, I'll make that change, thank you very much for you review!
>
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>>> + if (!pfn_valid_gigantic(i)) {
>>> + count = 0;
>>> + continue;
>>> + }
>>> + if (!count) {
>>> + if (!pfn_aligned_gigantic(i, order))
>>> + continue;
>>> + start_pfn = i;
>>> + }
>>> + if (++count == nr_pages) {
>>> + /*
>>> + * We release the zone lock here because
>>> + * alloc_contig_range() will also lock the zone
>>> + * at some point. If there's an allocation
>>> + * spinning on this lock, it may win the race
>>> + * and cause alloc_contig_range() to fail...
>>> + */
>>> + spin_unlock_irqrestore(&z->lock, flags);
>>> + ret = __alloc_gigantic_page(start_pfn, count);
>>> + if (!ret)
>>> + return pfn_to_page(start_pfn);
>>> + count = 0;
>>> + spin_lock_irqsave(&z->lock, flags);
>>> + }
>>> + }
>>> +
>>> + spin_unlock_irqrestore(&z->lock, flags);
>>> + }
>>> +
>>> + return NULL;
>>> +}
>>> +
>>> +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
>>> +static void prep_compound_gigantic_page(struct page *page, unsigned long order);
>>> +
>>> +static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
>>> +{
>>> + struct page *page;
>>> +
>>> + page = alloc_gigantic_page(nid, huge_page_order(h));
>>> + if (page) {
>>> + if (arch_prepare_gigantic_page(page)) {
>>> + free_gigantic_page(page, huge_page_order(h));
>>> + return NULL;
>>> + }
>>> + prep_compound_gigantic_page(page, huge_page_order(h));
>>> + prep_new_huge_page(h, page, nid);
>>> + }
>>> +
>>> + return page;
>>> +}
>>> +
>>> +static int alloc_fresh_gigantic_page(struct hstate *h,
>>> + nodemask_t *nodes_allowed)
>>> +{
>>> + struct page *page = NULL;
>>> + int nr_nodes, node;
>>> +
>>> + for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
>>> + page = alloc_fresh_gigantic_page_node(h, node);
>>> + if (page)
>>> + return 1;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static inline bool gigantic_page_supported(void) { return true; }
>>> +#else /* !CONFIG_CMA */
>>> +static inline bool gigantic_page_supported(void) { return false; }
>>> +static inline void free_gigantic_page(struct page *page, unsigned order) { }
>>> +static inline void destroy_compound_gigantic_page(struct page *page,
>>> + unsigned long order) { }
>>> +static inline int alloc_fresh_gigantic_page(struct hstate *h,
>>> + nodemask_t *nodes_allowed) { return 0; }
>>> +#endif /* CONFIG_CMA */
>>> +
>>> static void update_and_free_page(struct hstate *h, struct page *page)
>>> {
>>> int i;
>>>
>>> - VM_BUG_ON(hstate_is_gigantic(h));
>>> + if (hstate_is_gigantic(h) && !gigantic_page_supported())
>>> + return;
>>>
>>> h->nr_huge_pages--;
>>> h->nr_huge_pages_node[page_to_nid(page)]--;
>>> @@ -661,8 +809,14 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>>> VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
>>> set_compound_page_dtor(page, NULL);
>>> set_page_refcounted(page);
>>> - arch_release_hugepage(page);
>>> - __free_pages(page, huge_page_order(h));
>>> + if (hstate_is_gigantic(h)) {
>>> + arch_release_gigantic_page(page);
>>> + destroy_compound_gigantic_page(page, huge_page_order(h));
>>> + free_gigantic_page(page, huge_page_order(h));
>>> + } else {
>>> + arch_release_hugepage(page);
>>> + __free_pages(page, huge_page_order(h));
>>> + }
>>> }
>>>
>>> struct hstate *size_to_hstate(unsigned long size)
>>> @@ -701,7 +855,7 @@ static void free_huge_page(struct page *page)
>>> if (restore_reserve)
>>> h->resv_huge_pages++;
>>>
>>> - if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
>>> + if (h->surplus_huge_pages_node[nid]) {
>>> /* remove the page from active list */
>>> list_del(&page->lru);
>>> update_and_free_page(h, page);
>>> @@ -805,9 +959,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
>>> {
>>> struct page *page;
>>>
>>> - if (hstate_is_gigantic(h))
>>> - return NULL;
>>> -
>>> page = alloc_pages_exact_node(nid,
>>> htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>>> __GFP_REPEAT|__GFP_NOWARN,
>>> @@ -1452,7 +1603,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>>> {
>>> unsigned long min_count, ret;
>>>
>>> - if (hstate_is_gigantic(h))
>>> + if (hstate_is_gigantic(h) && !gigantic_page_supported())
>>> return h->max_huge_pages;
>>>
>>> /*
>>> @@ -1479,7 +1630,11 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>>> * and reducing the surplus.
>>> */
>>> spin_unlock(&hugetlb_lock);
>>> - ret = alloc_fresh_huge_page(h, nodes_allowed);
>>> + if (hstate_is_gigantic(h)) {
>>> + ret = alloc_fresh_gigantic_page(h, nodes_allowed);
>>> + } else {
>>> + ret = alloc_fresh_huge_page(h, nodes_allowed);
>>> + }
>>> spin_lock(&hugetlb_lock);
>>> if (!ret)
>>> goto out;
>>> @@ -1578,7 +1733,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
>>> goto out;
>>>
>>> h = kobj_to_hstate(kobj, &nid);
>>> - if (hstate_is_gigantic(h)) {
>>> + if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
>>> err = -EINVAL;
>>> goto out;
>>> }
>>> @@ -2072,7 +2227,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>>>
>>> tmp = h->max_huge_pages;
>>>
>>> - if (write && hstate_is_gigantic(h))
>>> + if (write && hstate_is_gigantic(h) && !gigantic_page_supported())
>>> return -EINVAL;
>>>
>>> table->data = &tmp;
>>>
>>
>>
>

2014-04-08 02:01:36

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 1/4] hugetlb: add hstate_is_gigantic()

(2014/04/03 3:08), Luiz Capitulino wrote:
> Signed-off-by: Luiz Capitulino <[email protected]>
> ---

Reviewed-by: Yasuaki Ishimatsu <[email protected]>

Thanks,
Yasuaki Ishimatsu

> include/linux/hugetlb.h | 5 +++++
> mm/hugetlb.c | 28 ++++++++++++++--------------
> 2 files changed, 19 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8c43cc4..8590134 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -333,6 +333,11 @@ static inline unsigned huge_page_shift(struct hstate *h)
> return h->order + PAGE_SHIFT;
> }
>
> +static inline bool hstate_is_gigantic(struct hstate *h)
> +{
> + return huge_page_order(h) >= MAX_ORDER;
> +}
> +
> static inline unsigned int pages_per_huge_page(struct hstate *h)
> {
> return 1 << h->order;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c01cb9f..8c50547 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -574,7 +574,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> {
> int i;
>
> - VM_BUG_ON(h->order >= MAX_ORDER);
> + VM_BUG_ON(hstate_is_gigantic(h));
>
> h->nr_huge_pages--;
> h->nr_huge_pages_node[page_to_nid(page)]--;
> @@ -627,7 +627,7 @@ static void free_huge_page(struct page *page)
> if (restore_reserve)
> h->resv_huge_pages++;
>
> - if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) {
> + if (h->surplus_huge_pages_node[nid] && !hstate_is_gigantic(h)) {
> /* remove the page from active list */
> list_del(&page->lru);
> update_and_free_page(h, page);
> @@ -731,7 +731,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> {
> struct page *page;
>
> - if (h->order >= MAX_ORDER)
> + if (hstate_is_gigantic(h))
> return NULL;
>
> page = alloc_pages_exact_node(nid,
> @@ -925,7 +925,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> struct page *page;
> unsigned int r_nid;
>
> - if (h->order >= MAX_ORDER)
> + if (hstate_is_gigantic(h))
> return NULL;
>
> /*
> @@ -1118,7 +1118,7 @@ static void return_unused_surplus_pages(struct hstate *h,
> h->resv_huge_pages -= unused_resv_pages;
>
> /* Cannot return gigantic pages currently */
> - if (h->order >= MAX_ORDER)
> + if (hstate_is_gigantic(h))
> return;
>
> nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
> @@ -1328,7 +1328,7 @@ static void __init gather_bootmem_prealloc(void)
> * fix confusing memory reports from free(1) and another
> * side-effects, like CommitLimit going negative.
> */
> - if (h->order > (MAX_ORDER - 1))
> + if (hstate_is_gigantic(h))
> adjust_managed_page_count(page, 1 << h->order);
> }
> }
> @@ -1338,7 +1338,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> unsigned long i;
>
> for (i = 0; i < h->max_huge_pages; ++i) {
> - if (h->order >= MAX_ORDER) {
> + if (hstate_is_gigantic(h)) {
> if (!alloc_bootmem_huge_page(h))
> break;
> } else if (!alloc_fresh_huge_page(h,
> @@ -1354,7 +1354,7 @@ static void __init hugetlb_init_hstates(void)
>
> for_each_hstate(h) {
> /* oversize hugepages were init'ed in early boot */
> - if (h->order < MAX_ORDER)
> + if (!hstate_is_gigantic(h))
> hugetlb_hstate_alloc_pages(h);
> }
> }
> @@ -1388,7 +1388,7 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
> {
> int i;
>
> - if (h->order >= MAX_ORDER)
> + if (hstate_is_gigantic(h))
> return;
>
> for_each_node_mask(i, *nodes_allowed) {
> @@ -1451,7 +1451,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> {
> unsigned long min_count, ret;
>
> - if (h->order >= MAX_ORDER)
> + if (hstate_is_gigantic(h))
> return h->max_huge_pages;
>
> /*
> @@ -1577,7 +1577,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
> goto out;
>
> h = kobj_to_hstate(kobj, &nid);
> - if (h->order >= MAX_ORDER) {
> + if (hstate_is_gigantic(h)) {
> err = -EINVAL;
> goto out;
> }
> @@ -1660,7 +1660,7 @@ static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
> unsigned long input;
> struct hstate *h = kobj_to_hstate(kobj, NULL);
>
> - if (h->order >= MAX_ORDER)
> + if (hstate_is_gigantic(h))
> return -EINVAL;
>
> err = kstrtoul(buf, 10, &input);
> @@ -2071,7 +2071,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>
> tmp = h->max_huge_pages;
>
> - if (write && h->order >= MAX_ORDER)
> + if (write && hstate_is_gigantic(h))
> return -EINVAL;
>
> table->data = &tmp;
> @@ -2124,7 +2124,7 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
>
> tmp = h->nr_overcommit_huge_pages;
>
> - if (write && h->order >= MAX_ORDER)
> + if (write && hstate_is_gigantic(h))
> return -EINVAL;
>
> table->data = &tmp;
>

2014-04-08 02:01:46

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 2/4] hugetlb: update_and_free_page(): don't clear PG_reserved bit

(2014/04/03 3:08), Luiz Capitulino wrote:
> Hugepages pages never get the PG_reserved bit set, so don't clear it. But
> add a warning just in case.
>
> Signed-off-by: Luiz Capitulino <[email protected]>
> ---

Reviewed-by: Yasuaki Ishimatsu <[email protected]>

Thanks,
Yasuaki Ishimatsu

> mm/hugetlb.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8c50547..7e07e47 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -581,8 +581,9 @@ static void update_and_free_page(struct hstate *h, struct page *page)
> for (i = 0; i < pages_per_huge_page(h); i++) {
> page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
> 1 << PG_referenced | 1 << PG_dirty |
> - 1 << PG_active | 1 << PG_reserved |
> - 1 << PG_private | 1 << PG_writeback);
> + 1 << PG_active | 1 << PG_private |
> + 1 << PG_writeback);
> + WARN_ON(PageReserved(&page[i]));
> }
> VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
> set_compound_page_dtor(page, NULL);
>

2014-04-08 02:03:42

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 3/4] hugetlb: move helpers up in the file

(2014/04/03 3:08), Luiz Capitulino wrote:
> Next commit will add new code which will want to call the
> for_each_node_mask_to_alloc() macro. Move it, its buddy
> for_each_node_mask_to_free() and their dependencies up in the file so
> the new code can use them. This is just code movement, no logic change.
>
> Signed-off-by: Luiz Capitulino <[email protected]>
> ---

Reviewed-by: Yasuaki Ishimatsu <[email protected]>

Thanks,
Yasuaki Ishimatsu

> mm/hugetlb.c | 146 +++++++++++++++++++++++++++++------------------------------
> 1 file changed, 73 insertions(+), 73 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7e07e47..2c7a44a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -570,6 +570,79 @@ err:
> return NULL;
> }
>
> +/*
> + * common helper functions for hstate_next_node_to_{alloc|free}.
> + * We may have allocated or freed a huge page based on a different
> + * nodes_allowed previously, so h->next_node_to_{alloc|free} might
> + * be outside of *nodes_allowed. Ensure that we use an allowed
> + * node for alloc or free.
> + */
> +static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
> +{
> + nid = next_node(nid, *nodes_allowed);
> + if (nid == MAX_NUMNODES)
> + nid = first_node(*nodes_allowed);
> + VM_BUG_ON(nid >= MAX_NUMNODES);
> +
> + return nid;
> +}
> +
> +static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
> +{
> + if (!node_isset(nid, *nodes_allowed))
> + nid = next_node_allowed(nid, nodes_allowed);
> + return nid;
> +}
> +
> +/*
> + * returns the previously saved node ["this node"] from which to
> + * allocate a persistent huge page for the pool and advance the
> + * next node from which to allocate, handling wrap at end of node
> + * mask.
> + */
> +static int hstate_next_node_to_alloc(struct hstate *h,
> + nodemask_t *nodes_allowed)
> +{
> + int nid;
> +
> + VM_BUG_ON(!nodes_allowed);
> +
> + nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> + h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
> +
> + return nid;
> +}
> +
> +/*
> + * helper for free_pool_huge_page() - return the previously saved
> + * node ["this node"] from which to free a huge page. Advance the
> + * next node id whether or not we find a free huge page to free so
> + * that the next attempt to free addresses the next node.
> + */
> +static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> +{
> + int nid;
> +
> + VM_BUG_ON(!nodes_allowed);
> +
> + nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
> + h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
> +
> + return nid;
> +}
> +
> +#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
> + for (nr_nodes = nodes_weight(*mask); \
> + nr_nodes > 0 && \
> + ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
> + nr_nodes--)
> +
> +#define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
> + for (nr_nodes = nodes_weight(*mask); \
> + nr_nodes > 0 && \
> + ((node = hstate_next_node_to_free(hs, mask)) || 1); \
> + nr_nodes--)
> +
> static void update_and_free_page(struct hstate *h, struct page *page)
> {
> int i;
> @@ -750,79 +823,6 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> return page;
> }
>
> -/*
> - * common helper functions for hstate_next_node_to_{alloc|free}.
> - * We may have allocated or freed a huge page based on a different
> - * nodes_allowed previously, so h->next_node_to_{alloc|free} might
> - * be outside of *nodes_allowed. Ensure that we use an allowed
> - * node for alloc or free.
> - */
> -static int next_node_allowed(int nid, nodemask_t *nodes_allowed)
> -{
> - nid = next_node(nid, *nodes_allowed);
> - if (nid == MAX_NUMNODES)
> - nid = first_node(*nodes_allowed);
> - VM_BUG_ON(nid >= MAX_NUMNODES);
> -
> - return nid;
> -}
> -
> -static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
> -{
> - if (!node_isset(nid, *nodes_allowed))
> - nid = next_node_allowed(nid, nodes_allowed);
> - return nid;
> -}
> -
> -/*
> - * returns the previously saved node ["this node"] from which to
> - * allocate a persistent huge page for the pool and advance the
> - * next node from which to allocate, handling wrap at end of node
> - * mask.
> - */
> -static int hstate_next_node_to_alloc(struct hstate *h,
> - nodemask_t *nodes_allowed)
> -{
> - int nid;
> -
> - VM_BUG_ON(!nodes_allowed);
> -
> - nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> - h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
> -
> - return nid;
> -}
> -
> -/*
> - * helper for free_pool_huge_page() - return the previously saved
> - * node ["this node"] from which to free a huge page. Advance the
> - * next node id whether or not we find a free huge page to free so
> - * that the next attempt to free addresses the next node.
> - */
> -static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> -{
> - int nid;
> -
> - VM_BUG_ON(!nodes_allowed);
> -
> - nid = get_valid_node_allowed(h->next_nid_to_free, nodes_allowed);
> - h->next_nid_to_free = next_node_allowed(nid, nodes_allowed);
> -
> - return nid;
> -}
> -
> -#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
> - for (nr_nodes = nodes_weight(*mask); \
> - nr_nodes > 0 && \
> - ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
> - nr_nodes--)
> -
> -#define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
> - for (nr_nodes = nodes_weight(*mask); \
> - nr_nodes > 0 && \
> - ((node = hstate_next_node_to_free(hs, mask)) || 1); \
> - nr_nodes--)
> -
> static int alloc_fresh_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> {
> struct page *page;
>

2014-04-08 22:51:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: add support for gigantic page allocation at runtime

On Mon, 7 Apr 2014 14:49:35 -0400 Luiz Capitulino <[email protected]> wrote:

> > > ---
> > > arch/x86/include/asm/hugetlb.h | 10 +++
> > > mm/hugetlb.c | 177 ++++++++++++++++++++++++++++++++++++++---
> > > 2 files changed, 176 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
> > > index a809121..2b262f7 100644
> > > --- a/arch/x86/include/asm/hugetlb.h
> > > +++ b/arch/x86/include/asm/hugetlb.h
> > > @@ -91,6 +91,16 @@ static inline void arch_release_hugepage(struct page *page)
> > > {
> > > }
> > >
> > > +static inline int arch_prepare_gigantic_page(struct page *page)
> > > +{
> > > + return 0;
> > > +}
> > > +
> > > +static inline void arch_release_gigantic_page(struct page *page)
> > > +{
> > > +}
> > > +
> > > +
> > > static inline void arch_clear_hugepage_flags(struct page *page)
> > > {
> > > }
> >
> > These are defined only on arch/x86, but called in generic code.
> > Does it cause build failure on other archs?
>
> Hmm, probably. The problem here is that I'm unable to test this
> code in other archs. So I think the best solution for the first
> merge is to make the build of this feature conditional to x86_64?
> Then the first person interested in making this work in other
> archs add the generic code. Sounds reasonable?

These functions don't actually do anything so if and when other
architectures come along to implement this feature, their developers
won't know what you were thinking when you added them. So how about
some code comments to explain their roles and responsibilities?

Or just delete them altogether and let people add them (or something
similar) if and when the need arises. It's hard to tell when one lacks
telepathic powers, sigh.

2014-04-09 00:29:49

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH 4/4] hugetlb: add support for gigantic page allocation at runtime

On Tue, 8 Apr 2014 15:51:02 -0700
Andrew Morton <[email protected]> wrote:

> On Mon, 7 Apr 2014 14:49:35 -0400 Luiz Capitulino <[email protected]> wrote:
>
> > > > ---
> > > > arch/x86/include/asm/hugetlb.h | 10 +++
> > > > mm/hugetlb.c | 177 ++++++++++++++++++++++++++++++++++++++---
> > > > 2 files changed, 176 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h
> > > > index a809121..2b262f7 100644
> > > > --- a/arch/x86/include/asm/hugetlb.h
> > > > +++ b/arch/x86/include/asm/hugetlb.h
> > > > @@ -91,6 +91,16 @@ static inline void arch_release_hugepage(struct page *page)
> > > > {
> > > > }
> > > >
> > > > +static inline int arch_prepare_gigantic_page(struct page *page)
> > > > +{
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static inline void arch_release_gigantic_page(struct page *page)
> > > > +{
> > > > +}
> > > > +
> > > > +
> > > > static inline void arch_clear_hugepage_flags(struct page *page)
> > > > {
> > > > }
> > >
> > > These are defined only on arch/x86, but called in generic code.
> > > Does it cause build failure on other archs?
> >
> > Hmm, probably. The problem here is that I'm unable to test this
> > code in other archs. So I think the best solution for the first
> > merge is to make the build of this feature conditional to x86_64?
> > Then the first person interested in making this work in other
> > archs add the generic code. Sounds reasonable?
>
> These functions don't actually do anything so if and when other
> architectures come along to implement this feature, their developers
> won't know what you were thinking when you added them. So how about
> some code comments to explain their roles and responsibilities?
>
> Or just delete them altogether and let people add them (or something
> similar) if and when the need arises. It's hard to tell when one lacks
> telepathic powers, sigh.

That's exactly what I did for v2 (already posted).