The concurrent use of multiple hugetlb page sizes on a single system
is becoming more common. One of the reasons is better TLB support for
gigantic page sizes on x86 hardware. In addition, hugetlb pages are
being used to back VMs in hosting environments.
When using hugetlb pages to back VMs, it is often desirable to
preallocate hugetlb pools. This avoids the delay and uncertainty of
allocating hugetlb pages at VM startup. In addition, preallocating
huge pages minimizes the issue of memory fragmentation that increases
the longer the system is up and running.
In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes. Over
time, the preallocated pool of smaller hugetlb pages may become
depleted while larger hugetlb pages still remain. In such situations,
it is desirable to convert larger hugetlb pages to smaller hugetlb pages.
Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages. For example, to convert 50 GB pages on x86:
gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages
On an idle system this operation is fairly reliable and results are as
expected. The number of 2MB pages is increased as expected and the time
of the operation is a second or two.
However, when there is activity on the system the following issues
arise:
1) This process can take quite some time, especially if allocation of
the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
will match the size of the larger page which was freed. This is
because the area freed by the larger page could quickly be
fragmented.
In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:
Unexpected number of 2MB pages allocated: Expected 25600, have 19944
real 0m42.092s
user 0m0.008s
sys 0m41.467s
To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size. This avoids freeing pages to buddy and then
trying to allocate from buddy.
Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.
- demote_size Target page size for demotion, a smaller huge page size.
File can be written to chose a smaller huge page size if
multiple are available.
- demote Writable number of hugetlb pages to be demoted
To demote 50 GB huge pages, one would:
cat .../hugepages-1048576kB/free_hugepages /* optional, verify free pages */
cat .../hugepages-1048576kB/demote_size /* optional, verify target size */
echo 50 > .../hugepages-1048576kB/demote
Only hugetlb pages which are free at the time of the request can be demoted.
Demotion does not add to the complexity of surplus pages and honors reserved
huge pages. Therefore, when a value is written to the sysfs demote file,
that value is only the maximum number of pages which will be demoted. It
is possible fewer will actually be demoted. The recently introduced
per-hstate mutex is used to synchronize demote operations with other
operations that modify hugetlb pools.
Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages are
used to back VMs on x86. Both issues of long allocation times and not
necessarily getting the expected number of smaller huge pages after a free
and allocate cycle have been experienced. The occurrence of these issues
is dependent on other activity within the host and can not be predicted.
v3 -> v4
- Fix dead store in demote_size_show and rewrite to look better
- Fix documentation typo
- Make code setting up per-hstate demote orders more clear
- Add warning if demote_pool_huge_page passed null demote order
- Acquired hugetlb_lock lock later in demote_store
- Restored cma_release debug message
- Updated commit message for the need of using HUGETLB_PAGE_ORDER
- Renamed function to destroy hugetlb pages for demote
- Made sure error codes are passed all the way back to user
v2 -> v3
- Require gigantic_page_runtime_supported for demote
- Simplify code in demote_store and update comment
- Remove hugetlb specific cma flag, add cma_pages_valid interface
- Retain error return code in demote_free_huge_page
RESEND -> v2
- Removed optimizations for vmemmap optimized pages
- Make demote_size writable
- Removed demote interfaces for smallest huge page size
- Updated documentation and commit messages
- Fixed build break for !CONFIG_ARCH_HAS_GIGANTIC_PAGE
v1 -> RESEND
- Rebase on next-20210816
- Fix a few typos in commit messages
RFC -> v1
- Provides basic support for vmemmap optimized pages
- Takes speculative page references into account
- Updated Documentation file
- Added optimizations for vmemmap optimized pages
Mike Kravetz (5):
hugetlb: add demote hugetlb page sysfs interfaces
mm/cma: add cma_pages_valid to determine if pages are in CMA
hugetlb: be sure to free demoted CMA pages to CMA
hugetlb: add demote bool to gigantic page routines
hugetlb: add hugetlb demote page support
Documentation/admin-guide/mm/hugetlbpage.rst | 30 +-
include/linux/cma.h | 1 +
include/linux/hugetlb.h | 1 +
mm/cma.c | 24 +-
mm/hugetlb.c | 318 ++++++++++++++++++-
5 files changed, 353 insertions(+), 21 deletions(-)
--
2.31.1
The routines remove_hugetlb_page and destroy_compound_gigantic_page
will remove a gigantic page and make the set of base pages ready to be
returned to a lower level allocator. In the process of doing this, they
make all base pages reference counted.
The routine prep_compound_gigantic_page creates a gigantic page from a
set of base pages. It assumes that all these base pages are reference
counted.
During demotion, a gigantic page will be split into huge pages of a
smaller size. This logically involves use of the routines,
remove_hugetlb_page, and destroy_compound_gigantic_page followed by
prep_compound*_page for each smaller huge page.
When pages are reference counted (ref count >= 0), additional
speculative ref counts could be taken. This could result in errors
while demoting a huge page. Quite a bit of code would need to be
created to handle all possible issues.
Instead of dealing with the possibility of speculative ref counts, avoid
the possibility by keeping ref counts at zero during the demote process.
Add a boolean 'demote' to the routines remove_hugetlb_page,
destroy_compound_gigantic_page and prep_compound_gigantic_page. If the
boolean is set, the remove and destroy routines will not reference count
pages and the prep routine will not expect reference counted pages.
'*_for_demote' wrappers of the routines will be added in a subsequent
patch where this functionality is used.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 54 +++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 43 insertions(+), 11 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 563338f4dbc4..794e0c4c1b3c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1271,8 +1271,8 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
nr_nodes--)
#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
- unsigned int order)
+static void __destroy_compound_gigantic_page(struct page *page,
+ unsigned int order, bool demote)
{
int i;
int nr_pages = 1 << order;
@@ -1284,7 +1284,8 @@ static void destroy_compound_gigantic_page(struct page *page,
for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
p->mapping = NULL;
clear_compound_head(p);
- set_page_refcounted(p);
+ if (!demote)
+ set_page_refcounted(p);
}
set_compound_order(page, 0);
@@ -1292,6 +1293,12 @@ static void destroy_compound_gigantic_page(struct page *page,
__ClearPageHead(page);
}
+static void destroy_compound_gigantic_page(struct page *page,
+ unsigned int order)
+{
+ __destroy_compound_gigantic_page(page, order, false);
+}
+
static void free_gigantic_page(struct page *page, unsigned int order)
{
/*
@@ -1364,12 +1371,15 @@ static inline void destroy_compound_gigantic_page(struct page *page,
/*
* Remove hugetlb page from lists, and update dtor so that page appears
- * as just a compound page. A reference is held on the page.
+ * as just a compound page.
+ *
+ * A reference is held on the page, except in the case of demote.
*
* Must be called with hugetlb lock held.
*/
-static void remove_hugetlb_page(struct hstate *h, struct page *page,
- bool adjust_surplus)
+static void __remove_hugetlb_page(struct hstate *h, struct page *page,
+ bool adjust_surplus,
+ bool demote)
{
int nid = page_to_nid(page);
@@ -1407,8 +1417,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
*
* This handles the case where more than one ref is held when and
* after update_and_free_page is called.
+ *
+ * In the case of demote we do not ref count the page as it will soon
+ * be turned into a page of smaller size.
*/
- set_page_refcounted(page);
+ if (!demote)
+ set_page_refcounted(page);
if (hstate_is_gigantic(h))
set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
else
@@ -1418,6 +1432,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
h->nr_huge_pages_node[nid]--;
}
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+ bool adjust_surplus)
+{
+ __remove_hugetlb_page(h, page, adjust_surplus, false);
+}
+
static void add_hugetlb_page(struct hstate *h, struct page *page,
bool adjust_surplus)
{
@@ -1681,7 +1701,8 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
spin_unlock_irq(&hugetlb_lock);
}
-static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+static bool __prep_compound_gigantic_page(struct page *page, unsigned int order,
+ bool demote)
{
int i, j;
int nr_pages = 1 << order;
@@ -1719,10 +1740,16 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
* the set of pages can not be converted to a gigantic page.
* The caller who allocated the pages should then discard the
* pages using the appropriate free interface.
+ *
+ * In the case of demote, the ref count will be zero.
*/
- if (!page_ref_freeze(p, 1)) {
- pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
- goto out_error;
+ if (!demote) {
+ if (!page_ref_freeze(p, 1)) {
+ pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+ goto out_error;
+ }
+ } else {
+ VM_BUG_ON_PAGE(page_count(p), p);
}
set_page_count(p, 0);
set_compound_head(p, page);
@@ -1747,6 +1774,11 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
return false;
}
+static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+{
+ return __prep_compound_gigantic_page(page, order, false);
+}
+
/*
* PageHuge() only returns true for hugetlbfs pages, but not for normal or
* transparent huge pages. See the PageTransHuge() documentation for more
--
2.31.1
Two new sysfs files are added to demote hugtlb pages. These files are
both per-hugetlb page size and per node. Files are:
demote_size - The size in Kb that pages are demoted to. (read-write)
demote - The number of huge pages to demote. (write-only)
By default, demote_size is the next smallest huge page size. Valid huge
page sizes less than huge page size may be written to this file. When
huge pages are demoted, they are demoted to this size.
Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.
NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size. For example, on x86 1GB huge
pages will have demote interfaces. 2MB huge pages will not have demote
interfaces.
This patch does not provide full demote functionality. It only provides
the sysfs interfaces.
It also provides documentation for the new interfaces.
Signed-off-by: Mike Kravetz <[email protected]>
---
Documentation/admin-guide/mm/hugetlbpage.rst | 30 +++-
include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 155 ++++++++++++++++++-
3 files changed, 183 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 8abaeb144e44..bb90de3885d1 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -234,8 +234,12 @@ will exist, of the form::
hugepages-${size}kB
-Inside each of these directories, the same set of files will exist::
+Inside each of these directories, the set of files contained in ``/proc``
+will exist. In addition, two additional interfaces for demoting huge
+pages may exist::
+ demote
+ demote_size
nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages
@@ -243,7 +247,29 @@ Inside each of these directories, the same set of files will exist::
resv_hugepages
surplus_hugepages
-which function as described above for the default huge page-sized case.
+The demote interfaces provide the ability to split a huge page into
+smaller huge pages. For example, the x86 architecture supports both
+1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512
+2MB huge pages. Demote interfaces are not available for the smallest
+huge page size. The demote interfaces are:
+
+demote_size
+ is the size of demoted pages. When a page is demoted a corresponding
+ number of huge pages of demote_size will be created. By default,
+ demote_size is set to the next smaller huge page size. If there are
+ multiple smaller huge page sizes, demote_size can be set to any of
+ these smaller sizes. Only huge page sizes less than the current huge
+ pages size are allowed.
+
+demote
+ is used to demote a number of huge pages. A user with root privileges
+ can write to this file. It may not be possible to demote the
+ requested number of huge pages. To determine how many pages were
+ actually demoted, compare the value of nr_hugepages before and after
+ writing to the demote interface. demote is a write only interface.
+
+The interfaces which are the same as in ``/proc`` (all except demote and
+demote_size) function as described above for the default huge page-sized case.
.. _mem_policy_and_hp_alloc:
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1faebe1cd0ed..f2c3979efd69 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -596,6 +596,7 @@ struct hstate {
int next_nid_to_alloc;
int next_nid_to_free;
unsigned int order;
+ unsigned int demote_order;
unsigned long mask;
unsigned long max_huge_pages;
unsigned long nr_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95dc7b83381f..44d3c9477635 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2986,7 +2986,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
static void __init hugetlb_init_hstates(void)
{
- struct hstate *h;
+ struct hstate *h, *h2;
for_each_hstate(h) {
if (minimum_order > huge_page_order(h))
@@ -2995,6 +2995,22 @@ static void __init hugetlb_init_hstates(void)
/* oversize hugepages were init'ed in early boot */
if (!hstate_is_gigantic(h))
hugetlb_hstate_alloc_pages(h);
+
+ /*
+ * Set demote order for each hstate. Note that
+ * h->demote_order is initially 0.
+ * - We can not demote gigantic pages if runtime freeing
+ * is not supported, so skip this.
+ */
+ if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+ continue;
+ for_each_hstate(h2) {
+ if (h2 == h)
+ continue;
+ if (h2->order < h->order &&
+ h2->order > h->demote_order)
+ h->demote_order = h2->order;
+ }
}
VM_BUG_ON(minimum_order == UINT_MAX);
}
@@ -3235,9 +3251,31 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
return 0;
}
+static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+ __must_hold(&hugetlb_lock)
+{
+ int rc = 0;
+
+ lockdep_assert_held(&hugetlb_lock);
+
+ /* We should never get here if no demote order */
+ if (!h->demote_order) {
+ pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n");
+ return -EINVAL; /* internal error */
+ }
+
+ /*
+ * TODO - demote fucntionality will be added in subsequent patch
+ */
+ return rc;
+}
+
#define HSTATE_ATTR_RO(_name) \
static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+#define HSTATE_ATTR_WO(_name) \
+ static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
+
#define HSTATE_ATTR(_name) \
static struct kobj_attribute _name##_attr = \
__ATTR(_name, 0644, _name##_show, _name##_store)
@@ -3433,6 +3471,105 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
}
HSTATE_ATTR_RO(surplus_hugepages);
+static ssize_t demote_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t len)
+{
+ unsigned long nr_demote;
+ unsigned long nr_available;
+ nodemask_t nodes_allowed, *n_mask;
+ struct hstate *h;
+ int err = 0;
+ int nid;
+
+ err = kstrtoul(buf, 10, &nr_demote);
+ if (err)
+ return err;
+ h = kobj_to_hstate(kobj, &nid);
+
+ /* Synchronize with other sysfs operations modifying huge pages */
+ mutex_lock(&h->resize_lock);
+
+ if (nid != NUMA_NO_NODE) {
+ init_nodemask_of_node(&nodes_allowed, nid);
+ n_mask = &nodes_allowed;
+ } else {
+ n_mask = &node_states[N_MEMORY];
+ }
+
+ spin_lock_irq(&hugetlb_lock);
+ while (nr_demote) {
+ /*
+ * Check for available pages to demote each time thorough the
+ * loop as demote_pool_huge_page will drop hugetlb_lock.
+ *
+ * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
+ * but will when full demote functionality is added in a later
+ * patch.
+ */
+ if (nid != NUMA_NO_NODE)
+ nr_available = h->free_huge_pages_node[nid];
+ else
+ nr_available = h->free_huge_pages;
+ nr_available -= h->resv_huge_pages;
+ if (!nr_available)
+ break;
+
+ err = demote_pool_huge_page(h, n_mask);
+ if (err)
+ break;
+
+ nr_demote--;
+ }
+
+ spin_unlock_irq(&hugetlb_lock);
+ mutex_unlock(&h->resize_lock);
+
+ if (err)
+ return err;
+ return len;
+}
+HSTATE_ATTR_WO(demote);
+
+static ssize_t demote_size_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int nid;
+ struct hstate *h = kobj_to_hstate(kobj, &nid);
+ unsigned long demote_size = (PAGE_SIZE << h->demote_order) / SZ_1K;
+
+ return sysfs_emit(buf, "%lukB\n", demote_size);
+}
+
+static ssize_t demote_size_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct hstate *h, *demote_hstate;
+ unsigned long demote_size;
+ unsigned int demote_order;
+ int nid;
+
+ demote_size = (unsigned long)memparse(buf, NULL);
+
+ demote_hstate = size_to_hstate(demote_size);
+ if (!demote_hstate)
+ return -EINVAL;
+ demote_order = demote_hstate->order;
+
+ /* demote order must be smaller than hstate order */
+ h = kobj_to_hstate(kobj, &nid);
+ if (demote_order >= h->order)
+ return -EINVAL;
+
+ /* resize_lock synchronizes access to demote size and writes */
+ mutex_lock(&h->resize_lock);
+ h->demote_order = demote_order;
+ mutex_unlock(&h->resize_lock);
+
+ return count;
+}
+HSTATE_ATTR(demote_size);
+
static struct attribute *hstate_attrs[] = {
&nr_hugepages_attr.attr,
&nr_overcommit_hugepages_attr.attr,
@@ -3449,6 +3586,16 @@ static const struct attribute_group hstate_attr_group = {
.attrs = hstate_attrs,
};
+static struct attribute *hstate_demote_attrs[] = {
+ &demote_size_attr.attr,
+ &demote_attr.attr,
+ NULL,
+};
+
+static const struct attribute_group hstate_demote_attr_group = {
+ .attrs = hstate_demote_attrs,
+};
+
static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
struct kobject **hstate_kobjs,
const struct attribute_group *hstate_attr_group)
@@ -3466,6 +3613,12 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
hstate_kobjs[hi] = NULL;
}
+ if (h->demote_order) {
+ if (sysfs_create_group(hstate_kobjs[hi],
+ &hstate_demote_attr_group))
+ pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
+ }
+
return retval;
}
--
2.31.1
Demote page functionality will split a huge page into a number of huge
pages of a smaller size. For example, on x86 a 1GB huge page can be
demoted into 512 2M huge pages. Demotion is done 'in place' by simply
splitting the huge page.
Added '*_for_demote' wrappers for remove_hugetlb_page,
destroy_compound_gigantic_page and prep_compound_gigantic_page for use
by demote code.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 74 insertions(+), 8 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 794e0c4c1b3c..22cb6126be7e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1270,7 +1270,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
((node = hstate_next_node_to_free(hs, mask)) || 1); \
nr_nodes--)
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+/* used to demote non-gigantic_huge pages as well */
static void __destroy_compound_gigantic_page(struct page *page,
unsigned int order, bool demote)
{
@@ -1293,6 +1293,13 @@ static void __destroy_compound_gigantic_page(struct page *page,
__ClearPageHead(page);
}
+static void destroy_compound_hugetlb_page_for_demote(struct page *page,
+ unsigned int order)
+{
+ __destroy_compound_gigantic_page(page, order, true);
+}
+
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
static void destroy_compound_gigantic_page(struct page *page,
unsigned int order)
{
@@ -1438,6 +1445,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
__remove_hugetlb_page(h, page, adjust_surplus, false);
}
+static void remove_hugetlb_page_for_demote(struct hstate *h, struct page *page,
+ bool adjust_surplus)
+{
+ __remove_hugetlb_page(h, page, adjust_surplus, true);
+}
+
static void add_hugetlb_page(struct hstate *h, struct page *page,
bool adjust_surplus)
{
@@ -1779,6 +1792,12 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
return __prep_compound_gigantic_page(page, order, false);
}
+static bool prep_compound_gigantic_page_for_demote(struct page *page,
+ unsigned int order)
+{
+ return __prep_compound_gigantic_page(page, order, true);
+}
+
/*
* PageHuge() only returns true for hugetlbfs pages, but not for normal or
* transparent huge pages. See the PageTransHuge() documentation for more
@@ -3304,9 +3323,54 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
return 0;
}
+static int demote_free_huge_page(struct hstate *h, struct page *page)
+{
+ int i, nid = page_to_nid(page);
+ struct hstate *target_hstate;
+ int rc = 0;
+
+ target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
+
+ remove_hugetlb_page_for_demote(h, page, false);
+ spin_unlock_irq(&hugetlb_lock);
+
+ rc = alloc_huge_page_vmemmap(h, page);
+ if (rc) {
+ /* Allocation of vmemmmap failed, we can not demote page */
+ spin_lock_irq(&hugetlb_lock);
+ set_page_refcounted(page);
+ add_hugetlb_page(h, page, false);
+ return rc;
+ }
+
+ /*
+ * Use destroy_compound_hugetlb_page_for_demote for all huge page
+ * sizes as it will not ref count pages.
+ */
+ destroy_compound_hugetlb_page_for_demote(page, huge_page_order(h));
+
+ for (i = 0; i < pages_per_huge_page(h);
+ i += pages_per_huge_page(target_hstate)) {
+ if (hstate_is_gigantic(target_hstate))
+ prep_compound_gigantic_page_for_demote(page + i,
+ target_hstate->order);
+ else
+ prep_compound_page(page + i, target_hstate->order);
+ set_page_private(page + i, 0);
+ set_page_refcounted(page + i);
+ prep_new_huge_page(target_hstate, page + i, nid);
+ put_page(page + i);
+ }
+
+ spin_lock_irq(&hugetlb_lock);
+ return rc;
+}
+
static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
__must_hold(&hugetlb_lock)
{
+ int nr_nodes, node;
+ struct page *page;
int rc = 0;
lockdep_assert_held(&hugetlb_lock);
@@ -3317,9 +3381,15 @@ static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
return -EINVAL; /* internal error */
}
- /*
- * TODO - demote fucntionality will be added in subsequent patch
- */
+ for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
+ if (!list_empty(&h->hugepage_freelists[node])) {
+ page = list_entry(h->hugepage_freelists[node].next,
+ struct page, lru);
+ rc = demote_free_huge_page(h, page);
+ break;
+ }
+ }
+
return rc;
}
@@ -3554,10 +3624,6 @@ static ssize_t demote_store(struct kobject *kobj,
/*
* Check for available pages to demote each time thorough the
* loop as demote_pool_huge_page will drop hugetlb_lock.
- *
- * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
- * but will when full demote functionality is added in a later
- * patch.
*/
if (nid != NUMA_NO_NODE)
nr_available = h->free_huge_pages_node[nid];
--
2.31.1
When huge page demotion is fully implemented, gigantic pages can be
demoted to a smaller huge page size. For example, on x86 a 1G page
can be demoted to 512 2M pages. However, gigantic pages can potentially
be allocated from CMA. If a gigantic page which was allocated from CMA
is demoted, the corresponding demoted pages needs to be returned to CMA.
Use the new interface cma_pages_valid() to determine if a non-gigantic
hugetlb page should be freed to CMA. Also, clear mapping field of these
pages as expected by cma_release.
This also requires a change to CMA region creation for gigantic pages.
CMA uses a per-region bit map to track allocations. When setting up the
region, you specify how many pages each bit represents. Currently,
only gigantic pages are allocated/freed from CMA so the region is set up
such that one bit represents a gigantic page size allocation.
With demote, a gigantic page (allocation) could be split into smaller
size pages. And, these smaller size pages will be freed to CMA. So,
since the per-region bit map needs to be set up to represent the smallest
allocation/free size, it now needs to be set to the smallest huge page
size which can be freed to CMA.
Unfortunately, we set up the CMA region for huge pages before we set up
huge pages sizes (hstates). So, technically we do not know the smallest
huge page size as this can change via command line options and
architecture specific code. Therefore, at region setup time we use
HUGETLB_PAGE_ORDER as the smallest possible huge page size that can be
given back to CMA. It is possible that this value is sub-optimal for
some architectures/config options. If needed, this can be addressed in
follow on work.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 41 +++++++++++++++++++++++++++++++++++++++--
1 file changed, 39 insertions(+), 2 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 44d3c9477635..563338f4dbc4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -50,6 +50,16 @@ struct hstate hstates[HUGE_MAX_HSTATE];
#ifdef CONFIG_CMA
static struct cma *hugetlb_cma[MAX_NUMNODES];
+static bool hugetlb_cma_page(struct page *page, unsigned int order)
+{
+ return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
+ 1 << order);
+}
+#else
+static bool hugetlb_cma_page(struct page *page, unsigned int order)
+{
+ return false;
+}
#endif
static unsigned long hugetlb_cma_size __initdata;
@@ -1272,6 +1282,7 @@ static void destroy_compound_gigantic_page(struct page *page,
atomic_set(compound_pincount_ptr(page), 0);
for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+ p->mapping = NULL;
clear_compound_head(p);
set_page_refcounted(p);
}
@@ -1476,7 +1487,13 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
1 << PG_active | 1 << PG_private |
1 << PG_writeback);
}
- if (hstate_is_gigantic(h)) {
+
+ /*
+ * Non-gigantic pages demoted from CMA allocated gigantic pages
+ * need to be given back to CMA in free_gigantic_page.
+ */
+ if (hstate_is_gigantic(h) ||
+ hugetlb_cma_page(page, huge_page_order(h))) {
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
} else {
@@ -3001,9 +3018,13 @@ static void __init hugetlb_init_hstates(void)
* h->demote_order is initially 0.
* - We can not demote gigantic pages if runtime freeing
* is not supported, so skip this.
+ * - If CMA allocation is possible, we can not demote
+ * HUGETLB_PAGE_ORDER or smaller size pages.
*/
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
continue;
+ if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
+ continue;
for_each_hstate(h2) {
if (h2 == h)
continue;
@@ -3555,6 +3576,8 @@ static ssize_t demote_size_store(struct kobject *kobj,
if (!demote_hstate)
return -EINVAL;
demote_order = demote_hstate->order;
+ if (demote_order < HUGETLB_PAGE_ORDER)
+ return -EINVAL;
/* demote order must be smaller than hstate order */
h = kobj_to_hstate(kobj, &nid);
@@ -6543,6 +6566,7 @@ void __init hugetlb_cma_reserve(int order)
if (hugetlb_cma_size < (PAGE_SIZE << order)) {
pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n",
(PAGE_SIZE << order) / SZ_1M);
+ hugetlb_cma_size = 0;
return;
}
@@ -6563,7 +6587,13 @@ void __init hugetlb_cma_reserve(int order)
size = round_up(size, PAGE_SIZE << order);
snprintf(name, sizeof(name), "hugetlb%d", nid);
- res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
+ /*
+ * Note that 'order per bit' is based on smallest size that
+ * may be returned to CMA allocator in the case of
+ * huge page demotion.
+ */
+ res = cma_declare_contiguous_nid(0, size, 0,
+ PAGE_SIZE << HUGETLB_PAGE_ORDER,
0, false, name,
&hugetlb_cma[nid], nid);
if (res) {
@@ -6579,6 +6609,13 @@ void __init hugetlb_cma_reserve(int order)
if (reserved >= hugetlb_cma_size)
break;
}
+
+ if (!reserved)
+ /*
+ * hugetlb_cma_size is used to determine if allocations from
+ * cma are possible. Set to zero if no cma regions are set up.
+ */
+ hugetlb_cma_size = 0;
}
void __init hugetlb_cma_check(void)
--
2.31.1
Add new interface cma_pages_valid() which indicates if the specified
pages are part of a CMA region. This interface will be used in a
subsequent patch by hugetlb code.
In order to keep the same amount of DEBUG information, a pr_debug() call
was added to cma_pages_valid(). In the case where the page passed to
cma_release is not in cma region, the debug message will be printed from
cma_pages_valid as opposed to cma_release.
Signed-off-by: Mike Kravetz <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
---
include/linux/cma.h | 1 +
mm/cma.c | 24 ++++++++++++++++++++----
2 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 53fd8c3cdbd0..bd801023504b 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -46,6 +46,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
struct cma **res_cma);
extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
bool no_warn);
+extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count);
extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count);
extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
diff --git a/mm/cma.c b/mm/cma.c
index 995e15480937..11152c3fb23c 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -524,6 +524,25 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
return page;
}
+bool cma_pages_valid(struct cma *cma, const struct page *pages,
+ unsigned long count)
+{
+ unsigned long pfn;
+
+ if (!cma || !pages)
+ return false;
+
+ pfn = page_to_pfn(pages);
+
+ if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count) {
+ pr_debug("%s(page %p, count %lu)\n", __func__,
+ (void *)pages, count);
+ return false;
+ }
+
+ return true;
+}
+
/**
* cma_release() - release allocated pages
* @cma: Contiguous memory region for which the allocation is performed.
@@ -539,16 +558,13 @@ bool cma_release(struct cma *cma, const struct page *pages,
{
unsigned long pfn;
- if (!cma || !pages)
+ if (!cma_pages_valid(cma, pages, count))
return false;
pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count);
pfn = page_to_pfn(pages);
- if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
- return false;
-
VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
free_contig_range(pfn, count);
--
2.31.1
On Thu, Oct 07, 2021 at 11:19:14AM -0700, Mike Kravetz wrote:
> +static ssize_t demote_store(struct kobject *kobj,
> + struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> + unsigned long nr_demote;
> + unsigned long nr_available;
> + nodemask_t nodes_allowed, *n_mask;
> + struct hstate *h;
> + int err = 0;
> + int nid;
> +
> + err = kstrtoul(buf, 10, &nr_demote);
> + if (err)
> + return err;
> + h = kobj_to_hstate(kobj, &nid);
> +
> + /* Synchronize with other sysfs operations modifying huge pages */
> + mutex_lock(&h->resize_lock);
> +
> + if (nid != NUMA_NO_NODE) {
> + init_nodemask_of_node(&nodes_allowed, nid);
> + n_mask = &nodes_allowed;
> + } else {
> + n_mask = &node_states[N_MEMORY];
> + }
Why this needs to be protected by the resize_lock? I do not understand
what are we really protecting here and from what.
Besides that, I did not spot anything wrong.
--
Oscar Salvador
SUSE Labs
On Thu, Oct 07, 2021 at 11:19:15AM -0700, Mike Kravetz wrote:
> Add new interface cma_pages_valid() which indicates if the specified
> pages are part of a CMA region. This interface will be used in a
> subsequent patch by hugetlb code.
>
> In order to keep the same amount of DEBUG information, a pr_debug() call
> was added to cma_pages_valid(). In the case where the page passed to
> cma_release is not in cma region, the debug message will be printed from
> cma_pages_valid as opposed to cma_release.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: OScar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Fri, Oct 08, 2021 at 09:53:54AM +0200, Oscar Salvador wrote:
> On Thu, Oct 07, 2021 at 11:19:15AM -0700, Mike Kravetz wrote:
> > Add new interface cma_pages_valid() which indicates if the specified
> > pages are part of a CMA region. This interface will be used in a
> > subsequent patch by hugetlb code.
> >
> > In order to keep the same amount of DEBUG information, a pr_debug() call
> > was added to cma_pages_valid(). In the case where the page passed to
> > cma_release is not in cma region, the debug message will be printed from
> > cma_pages_valid as opposed to cma_release.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
> > Acked-by: David Hildenbrand <[email protected]>
>
> Reviewed-by: OScar Salvador <[email protected]>
Fat fingers: s/OScar/Oscar
--
Oscar Salvador
SUSE Labs
On 10/8/21 12:51 AM, Oscar Salvador wrote:
> On Thu, Oct 07, 2021 at 11:19:14AM -0700, Mike Kravetz wrote:
>> +static ssize_t demote_store(struct kobject *kobj,
>> + struct kobj_attribute *attr, const char *buf, size_t len)
>> +{
>> + unsigned long nr_demote;
>> + unsigned long nr_available;
>> + nodemask_t nodes_allowed, *n_mask;
>> + struct hstate *h;
>> + int err = 0;
>> + int nid;
>> +
>> + err = kstrtoul(buf, 10, &nr_demote);
>> + if (err)
>> + return err;
>> + h = kobj_to_hstate(kobj, &nid);
>> +
>> + /* Synchronize with other sysfs operations modifying huge pages */
>> + mutex_lock(&h->resize_lock);
>> +
>> + if (nid != NUMA_NO_NODE) {
>> + init_nodemask_of_node(&nodes_allowed, nid);
>> + n_mask = &nodes_allowed;
>> + } else {
>> + n_mask = &node_states[N_MEMORY];
>> + }
>
> Why this needs to be protected by the resize_lock? I do not understand
> what are we really protecting here and from what.
In general, the resize_lock prevents unexpected consequences when
multiple users are modifying the number of pages in a pool concurrently
from the proc/sysfs interfaces. The mutex is acquired here because we
are modifying (decreasing) the pool size.
When the mutex was added, Michal asked about the need. Theoretically,
all code making pool adjustment should be safe because at a low level
the hugetlb_lock is taken when the hstate is modified. However, I did
point out that there is a hstate variable 'next_nid_to_alloc' which is
used outside the lock which could result in pages being allocated from
the wrong node. One could argue that if two (root) sysadmin users are
modifying the pool concurrently, they should not be surprised by such
consequences. The mutex seemed like a small price to avoid any such
potential issues. It is taken here to be consistent with this model.
Coincidentally, I was running some stress testing with this series last
night and noticed some unexpected behavior. As a result, we will also
need to take the resize_mutex of the 'target_hstate'. This is all in
patch 5 of the series. I will add details there.
--
Mike Kravetz
On 10/7/21 11:19 AM, Mike Kravetz wrote:
> +static int demote_free_huge_page(struct hstate *h, struct page *page)
> +{
> + int i, nid = page_to_nid(page);
> + struct hstate *target_hstate;
> + int rc = 0;
> +
> + target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
> +
> + remove_hugetlb_page_for_demote(h, page, false);
> + spin_unlock_irq(&hugetlb_lock);
> +
> + rc = alloc_huge_page_vmemmap(h, page);
> + if (rc) {
> + /* Allocation of vmemmmap failed, we can not demote page */
> + spin_lock_irq(&hugetlb_lock);
> + set_page_refcounted(page);
> + add_hugetlb_page(h, page, false);
> + return rc;
> + }
> +
> + /*
> + * Use destroy_compound_hugetlb_page_for_demote for all huge page
> + * sizes as it will not ref count pages.
> + */
> + destroy_compound_hugetlb_page_for_demote(page, huge_page_order(h));
> +
> + for (i = 0; i < pages_per_huge_page(h);
> + i += pages_per_huge_page(target_hstate)) {
> + if (hstate_is_gigantic(target_hstate))
> + prep_compound_gigantic_page_for_demote(page + i,
> + target_hstate->order);
> + else
> + prep_compound_page(page + i, target_hstate->order);
> + set_page_private(page + i, 0);
> + set_page_refcounted(page + i);
> + prep_new_huge_page(target_hstate, page + i, nid);
> + put_page(page + i);
> + }
I was doing some stress testing with multiple concurrent writers to
sysfs/.../nr_hugepages and sysfs/.../demote. On occasion, I would see
unexpected surplus pages of the smaller huge page size (2M on x86).
Here is what was happening. One task was decrementing the number of
2M huge pages with "echo 0 > nr_hugepages. It proceeded to the routine
set_max_huge_pages and was executing the following:
/*
* Decrease the pool size
* First return free pages to the buddy allocator (being careful
* to keep enough around to satisfy reservations). Then place
* pages into surplus state as needed so the pool will shrink
* to the desired size as pages become free.
*
* By placing pages into the surplus state independent of the
* overcommit value, we are allowing the surplus pool size to
* exceed overcommit. There are few sane options here. Since
* alloc_surplus_huge_page() is checking the global counter,
* though, we'll note that we're not allowed to exceed surplus
* and won't grow the pool anywhere else. Not until one of the
* sysctls are changed, or the surplus pages go out of use.
*/
min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
min_count = max(count, min_count);
try_to_free_low(h, min_count, nodes_allowed);
/*
* Collect pages to be removed on list without dropping lock
*/
while (min_count < persistent_huge_pages(h)) {
page = remove_pool_huge_page(h, nodes_allowed, 0);
if (!page)
break;
list_add(&page->lru, &page_list);
}
/* free the pages after dropping lock */
spin_unlock_irq(&hugetlb_lock);
update_and_free_pages_bulk(h, &page_list);
flush_free_hpage_work(h);
Now, while the lock was dropped the routine demote_free_huge_page above
added 512 huge pages to the 2M pool.
spin_lock_irq(&hugetlb_lock);
Then after acquiring the lock we make these 512 pages surplus.
while (count < persistent_huge_pages(h)) {
if (!adjust_pool_surplus(h, nodes_allowed, 1))
break;
}
To prevent this race from happening in general, the hstate specific mutex
resize_lock is held for the duration of set_max_huge_pages. Since, the
demote code is also adjusting pool sizes it should also take the mutex.
The routine demote_store already takes the mutex of the hstate of the
page size being demoted (1M in this case). That is because the 1M pool
size will be decreased. We also need to take the resize mutex of the 2M
pool as this pool will be increased. To prevent deadlocks, we use the
convention of always taking the resize mutex of the larger hstate first.
An updated version of this patch below adds taking the 'target hstate'
mutex in demote_free_huge_page. Although unnecessary, it also updates
max_huge_pages of both hstates for consistency.
From 25e4dac59f4d203f3a7e86d3591d70c1e956d11c Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Fri, 8 Oct 2021 13:21:21 -0700
Subject: [PATCH v4 5/5] hugetlb: add hugetlb demote page support
Demote page functionality will split a huge page into a number of huge
pages of a smaller size. For example, on x86 a 1GB huge page can be
demoted into 512 2M huge pages. Demotion is done 'in place' by simply
splitting the huge page.
Added '*_for_demote' wrappers for remove_hugetlb_page,
destroy_compound_hugetlb_page and prep_compound_gigantic_page for use
by demote code.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 92 insertions(+), 8 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 794e0c4c1b3c..e1883510309a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1270,7 +1270,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
((node = hstate_next_node_to_free(hs, mask)) || 1); \
nr_nodes--)
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+/* used to demote non-gigantic_huge pages as well */
static void __destroy_compound_gigantic_page(struct page *page,
unsigned int order, bool demote)
{
@@ -1293,6 +1293,13 @@ static void __destroy_compound_gigantic_page(struct page *page,
__ClearPageHead(page);
}
+static void destroy_compound_hugetlb_page_for_demote(struct page *page,
+ unsigned int order)
+{
+ __destroy_compound_gigantic_page(page, order, true);
+}
+
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
static void destroy_compound_gigantic_page(struct page *page,
unsigned int order)
{
@@ -1438,6 +1445,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
__remove_hugetlb_page(h, page, adjust_surplus, false);
}
+static void remove_hugetlb_page_for_demote(struct hstate *h, struct page *page,
+ bool adjust_surplus)
+{
+ __remove_hugetlb_page(h, page, adjust_surplus, true);
+}
+
static void add_hugetlb_page(struct hstate *h, struct page *page,
bool adjust_surplus)
{
@@ -1779,6 +1792,12 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
return __prep_compound_gigantic_page(page, order, false);
}
+static bool prep_compound_gigantic_page_for_demote(struct page *page,
+ unsigned int order)
+{
+ return __prep_compound_gigantic_page(page, order, true);
+}
+
/*
* PageHuge() only returns true for hugetlbfs pages, but not for normal or
* transparent huge pages. See the PageTransHuge() documentation for more
@@ -3304,9 +3323,72 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
return 0;
}
+static int demote_free_huge_page(struct hstate *h, struct page *page)
+{
+ int i, nid = page_to_nid(page);
+ struct hstate *target_hstate;
+ int rc = 0;
+
+ target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
+
+ remove_hugetlb_page_for_demote(h, page, false);
+ spin_unlock_irq(&hugetlb_lock);
+
+ rc = alloc_huge_page_vmemmap(h, page);
+ if (rc) {
+ /* Allocation of vmemmmap failed, we can not demote page */
+ spin_lock_irq(&hugetlb_lock);
+ set_page_refcounted(page);
+ add_hugetlb_page(h, page, false);
+ return rc;
+ }
+
+ /*
+ * Use destroy_compound_hugetlb_page_for_demote for all huge page
+ * sizes as it will not ref count pages.
+ */
+ destroy_compound_hugetlb_page_for_demote(page, huge_page_order(h));
+
+ /*
+ * Taking target hstate mutex synchronizes with set_max_huge_pages.
+ * Without the mutex, pages added to target hstate could be marked
+ * as surplus.
+ *
+ * Note that we already hold h->resize_lock. To prevent deadlock,
+ * use the convention of always taking larger size hstate mutex first.
+ */
+ mutex_lock(&target_hstate->resize_lock);
+ for (i = 0; i < pages_per_huge_page(h);
+ i += pages_per_huge_page(target_hstate)) {
+ if (hstate_is_gigantic(target_hstate))
+ prep_compound_gigantic_page_for_demote(page + i,
+ target_hstate->order);
+ else
+ prep_compound_page(page + i, target_hstate->order);
+ set_page_private(page + i, 0);
+ set_page_refcounted(page + i);
+ prep_new_huge_page(target_hstate, page + i, nid);
+ put_page(page + i);
+ }
+ mutex_unlock(&target_hstate->resize_lock);
+
+ spin_lock_irq(&hugetlb_lock);
+
+ /*
+ * Not absolutely necessary, but for consistency update max_huge_pages
+ * based on pool changes for the demoted page.
+ */
+ h->max_huge_pages--;
+ target_hstate->max_huge_pages += pages_per_huge_page(h);
+
+ return rc;
+}
+
static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
__must_hold(&hugetlb_lock)
{
+ int nr_nodes, node;
+ struct page *page;
int rc = 0;
lockdep_assert_held(&hugetlb_lock);
@@ -3317,9 +3399,15 @@ static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
return -EINVAL; /* internal error */
}
- /*
- * TODO - demote fucntionality will be added in subsequent patch
- */
+ for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
+ if (!list_empty(&h->hugepage_freelists[node])) {
+ page = list_entry(h->hugepage_freelists[node].next,
+ struct page, lru);
+ rc = demote_free_huge_page(h, page);
+ break;
+ }
+ }
+
return rc;
}
@@ -3554,10 +3642,6 @@ static ssize_t demote_store(struct kobject *kobj,
/*
* Check for available pages to demote each time thorough the
* loop as demote_pool_huge_page will drop hugetlb_lock.
- *
- * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
- * but will when full demote functionality is added in a later
- * patch.
*/
if (nid != NUMA_NO_NODE)
nr_available = h->free_huge_pages_node[nid];
--
2.31.1
On Fri, Oct 08, 2021 at 01:24:28PM -0700, Mike Kravetz wrote:
> In general, the resize_lock prevents unexpected consequences when
> multiple users are modifying the number of pages in a pool concurrently
> from the proc/sysfs interfaces. The mutex is acquired here because we
> are modifying (decreasing) the pool size.
Yes, I got that. My question was wrt. n_mask initialization:
+ if (nid != NUMA_NO_NODE) {
+ init_nodemask_of_node(&nodes_allowed, nid);
+ n_mask = &nodes_allowed;
+ } else {
+ n_mask = &node_states[N_MEMORY];
+ }
AFAICS, this does not need to be protected.
with that addressed:
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs
On Thu, Oct 07, 2021 at 11:19:17AM -0700, Mike Kravetz wrote:
> The routines remove_hugetlb_page and destroy_compound_gigantic_page
> will remove a gigantic page and make the set of base pages ready to be
> returned to a lower level allocator. In the process of doing this, they
> make all base pages reference counted.
>
> The routine prep_compound_gigantic_page creates a gigantic page from a
> set of base pages. It assumes that all these base pages are reference
> counted.
>
> During demotion, a gigantic page will be split into huge pages of a
> smaller size. This logically involves use of the routines,
> remove_hugetlb_page, and destroy_compound_gigantic_page followed by
> prep_compound*_page for each smaller huge page.
>
> When pages are reference counted (ref count >= 0), additional
> speculative ref counts could be taken. This could result in errors
It would be great to learn about those cases involving speculative ref counts.
> while demoting a huge page. Quite a bit of code would need to be
> created to handle all possible issues.
>
> Instead of dealing with the possibility of speculative ref counts, avoid
> the possibility by keeping ref counts at zero during the demote process.
> Add a boolean 'demote' to the routines remove_hugetlb_page,
> destroy_compound_gigantic_page and prep_compound_gigantic_page. If the
> boolean is set, the remove and destroy routines will not reference count
> pages and the prep routine will not expect reference counted pages.
>
> '*_for_demote' wrappers of the routines will be added in a subsequent
> patch where this functionality is used.
>
> Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
> ---
> mm/hugetlb.c | 54 +++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 43 insertions(+), 11 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 563338f4dbc4..794e0c4c1b3c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1271,8 +1271,8 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> nr_nodes--)
>
> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> -static void destroy_compound_gigantic_page(struct page *page,
> - unsigned int order)
> +static void __destroy_compound_gigantic_page(struct page *page,
> + unsigned int order, bool demote)
> {
> int i;
> int nr_pages = 1 << order;
> @@ -1284,7 +1284,8 @@ static void destroy_compound_gigantic_page(struct page *page,
> for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> p->mapping = NULL;
> clear_compound_head(p);
> - set_page_refcounted(p);
> + if (!demote)
> + set_page_refcounted(p);
> }
>
> set_compound_order(page, 0);
> @@ -1292,6 +1293,12 @@ static void destroy_compound_gigantic_page(struct page *page,
> __ClearPageHead(page);
> }
>
> +static void destroy_compound_gigantic_page(struct page *page,
> + unsigned int order)
> +{
> + __destroy_compound_gigantic_page(page, order, false);
> +}
> +
> static void free_gigantic_page(struct page *page, unsigned int order)
> {
> /*
> @@ -1364,12 +1371,15 @@ static inline void destroy_compound_gigantic_page(struct page *page,
>
> /*
> * Remove hugetlb page from lists, and update dtor so that page appears
> - * as just a compound page. A reference is held on the page.
> + * as just a compound page.
> + *
> + * A reference is held on the page, except in the case of demote.
> *
> * Must be called with hugetlb lock held.
> */
> -static void remove_hugetlb_page(struct hstate *h, struct page *page,
> - bool adjust_surplus)
> +static void __remove_hugetlb_page(struct hstate *h, struct page *page,
> + bool adjust_surplus,
> + bool demote)
> {
> int nid = page_to_nid(page);
>
> @@ -1407,8 +1417,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
> *
> * This handles the case where more than one ref is held when and
> * after update_and_free_page is called.
> + *
> + * In the case of demote we do not ref count the page as it will soon
> + * be turned into a page of smaller size.
> */
> - set_page_refcounted(page);
> + if (!demote)
> + set_page_refcounted(page);
> if (hstate_is_gigantic(h))
> set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> else
> @@ -1418,6 +1432,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
> h->nr_huge_pages_node[nid]--;
> }
>
> +static void remove_hugetlb_page(struct hstate *h, struct page *page,
> + bool adjust_surplus)
> +{
> + __remove_hugetlb_page(h, page, adjust_surplus, false);
> +}
> +
> static void add_hugetlb_page(struct hstate *h, struct page *page,
> bool adjust_surplus)
> {
> @@ -1681,7 +1701,8 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> spin_unlock_irq(&hugetlb_lock);
> }
>
> -static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
> +static bool __prep_compound_gigantic_page(struct page *page, unsigned int order,
> + bool demote)
> {
> int i, j;
> int nr_pages = 1 << order;
> @@ -1719,10 +1740,16 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
> * the set of pages can not be converted to a gigantic page.
> * The caller who allocated the pages should then discard the
> * pages using the appropriate free interface.
> + *
> + * In the case of demote, the ref count will be zero.
> */
> - if (!page_ref_freeze(p, 1)) {
> - pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
> - goto out_error;
> + if (!demote) {
> + if (!page_ref_freeze(p, 1)) {
> + pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
> + goto out_error;
> + }
> + } else {
> + VM_BUG_ON_PAGE(page_count(p), p);
> }
> set_page_count(p, 0);
> set_compound_head(p, page);
> @@ -1747,6 +1774,11 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
> return false;
> }
>
> +static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
> +{
> + return __prep_compound_gigantic_page(page, order, false);
> +}
> +
> /*
> * PageHuge() only returns true for hugetlbfs pages, but not for normal or
> * transparent huge pages. See the PageTransHuge() documentation for more
> --
> 2.31.1
>
--
Oscar Salvador
SUSE Labs
On Fri, Oct 08, 2021 at 01:57:48PM -0700, Mike Kravetz wrote:
> From 25e4dac59f4d203f3a7e86d3591d70c1e956d11c Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <[email protected]>
> Date: Fri, 8 Oct 2021 13:21:21 -0700
> Subject: [PATCH v4 5/5] hugetlb: add hugetlb demote page support
>
> Demote page functionality will split a huge page into a number of huge
> pages of a smaller size. For example, on x86 a 1GB huge page can be
> demoted into 512 2M huge pages. Demotion is done 'in place' by simply
> splitting the huge page.
>
> Added '*_for_demote' wrappers for remove_hugetlb_page,
> destroy_compound_hugetlb_page and prep_compound_gigantic_page for use
> by demote code.
>
> Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
> ---
> mm/hugetlb.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 92 insertions(+), 8 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 794e0c4c1b3c..e1883510309a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1270,7 +1270,7 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> ((node = hstate_next_node_to_free(hs, mask)) || 1); \
> nr_nodes--)
>
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> +/* used to demote non-gigantic_huge pages as well */
> static void __destroy_compound_gigantic_page(struct page *page,
> unsigned int order, bool demote)
> {
> @@ -1293,6 +1293,13 @@ static void __destroy_compound_gigantic_page(struct page *page,
> __ClearPageHead(page);
> }
>
> +static void destroy_compound_hugetlb_page_for_demote(struct page *page,
> + unsigned int order)
> +{
> + __destroy_compound_gigantic_page(page, order, true);
> +}
> +
> +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> static void destroy_compound_gigantic_page(struct page *page,
> unsigned int order)
> {
> @@ -1438,6 +1445,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
> __remove_hugetlb_page(h, page, adjust_surplus, false);
> }
>
> +static void remove_hugetlb_page_for_demote(struct hstate *h, struct page *page,
> + bool adjust_surplus)
> +{
> + __remove_hugetlb_page(h, page, adjust_surplus, true);
> +}
> +
> static void add_hugetlb_page(struct hstate *h, struct page *page,
> bool adjust_surplus)
> {
> @@ -1779,6 +1792,12 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
> return __prep_compound_gigantic_page(page, order, false);
> }
>
> +static bool prep_compound_gigantic_page_for_demote(struct page *page,
> + unsigned int order)
> +{
> + return __prep_compound_gigantic_page(page, order, true);
> +}
> +
> /*
> * PageHuge() only returns true for hugetlbfs pages, but not for normal or
> * transparent huge pages. See the PageTransHuge() documentation for more
> @@ -3304,9 +3323,72 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
> return 0;
> }
>
> +static int demote_free_huge_page(struct hstate *h, struct page *page)
> +{
> + int i, nid = page_to_nid(page);
> + struct hstate *target_hstate;
> + int rc = 0;
> +
> + target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
> +
> + remove_hugetlb_page_for_demote(h, page, false);
> + spin_unlock_irq(&hugetlb_lock);
> +
> + rc = alloc_huge_page_vmemmap(h, page);
> + if (rc) {
> + /* Allocation of vmemmmap failed, we can not demote page */
> + spin_lock_irq(&hugetlb_lock);
> + set_page_refcounted(page);
> + add_hugetlb_page(h, page, false);
> + return rc;
> + }
> +
> + /*
> + * Use destroy_compound_hugetlb_page_for_demote for all huge page
> + * sizes as it will not ref count pages.
> + */
> + destroy_compound_hugetlb_page_for_demote(page, huge_page_order(h));
> +
> + /*
> + * Taking target hstate mutex synchronizes with set_max_huge_pages.
> + * Without the mutex, pages added to target hstate could be marked
> + * as surplus.
> + *
> + * Note that we already hold h->resize_lock. To prevent deadlock,
> + * use the convention of always taking larger size hstate mutex first.
> + */
> + mutex_lock(&target_hstate->resize_lock);
> + for (i = 0; i < pages_per_huge_page(h);
> + i += pages_per_huge_page(target_hstate)) {
> + if (hstate_is_gigantic(target_hstate))
> + prep_compound_gigantic_page_for_demote(page + i,
> + target_hstate->order);
> + else
> + prep_compound_page(page + i, target_hstate->order);
> + set_page_private(page + i, 0);
> + set_page_refcounted(page + i);
> + prep_new_huge_page(target_hstate, page + i, nid);
> + put_page(page + i);
> + }
> + mutex_unlock(&target_hstate->resize_lock);
> +
> + spin_lock_irq(&hugetlb_lock);
> +
> + /*
> + * Not absolutely necessary, but for consistency update max_huge_pages
> + * based on pool changes for the demoted page.
> + */
> + h->max_huge_pages--;
> + target_hstate->max_huge_pages += pages_per_huge_page(h);
> +
> + return rc;
> +}
> +
> static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> __must_hold(&hugetlb_lock)
> {
> + int nr_nodes, node;
> + struct page *page;
> int rc = 0;
>
> lockdep_assert_held(&hugetlb_lock);
> @@ -3317,9 +3399,15 @@ static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> return -EINVAL; /* internal error */
> }
>
> - /*
> - * TODO - demote fucntionality will be added in subsequent patch
> - */
> + for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
> + if (!list_empty(&h->hugepage_freelists[node])) {
> + page = list_entry(h->hugepage_freelists[node].next,
> + struct page, lru);
> + rc = demote_free_huge_page(h, page);
> + break;
> + }
> + }
> +
> return rc;
> }
>
> @@ -3554,10 +3642,6 @@ static ssize_t demote_store(struct kobject *kobj,
> /*
> * Check for available pages to demote each time thorough the
> * loop as demote_pool_huge_page will drop hugetlb_lock.
> - *
> - * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
> - * but will when full demote functionality is added in a later
> - * patch.
> */
> if (nid != NUMA_NO_NODE)
> nr_available = h->free_huge_pages_node[nid];
> --
> 2.31.1
>
>
--
Oscar Salvador
SUSE Labs
On 10/18/21 12:35 AM, Oscar Salvador wrote:
> On Fri, Oct 08, 2021 at 01:24:28PM -0700, Mike Kravetz wrote:
>> In general, the resize_lock prevents unexpected consequences when
>> multiple users are modifying the number of pages in a pool concurrently
>> from the proc/sysfs interfaces. The mutex is acquired here because we
>> are modifying (decreasing) the pool size.
>
> Yes, I got that. My question was wrt. n_mask initialization:
>
> + if (nid != NUMA_NO_NODE) {
> + init_nodemask_of_node(&nodes_allowed, nid);
> + n_mask = &nodes_allowed;
> + } else {
> + n_mask = &node_states[N_MEMORY];
> + }
>
> AFAICS, this does not need to be protected.
>
Sorry that I misunderstood your quesion!
You are correct, the n_mask initialization does not need to be protected
by the mutex. Thanks for pointing that out.
The updated patch below simply moves taking the mutex after this
initialization code.
Andrew, please let me know if you want something else to make this
update simpler for you.
From f9c401323fee234667787a118c74d93aa185fcf6 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Fri, 22 Oct 2021 11:40:57 -0700
Subject: [PATCH v4 1/5] hugetlb: add demote hugetlb page sysfs interfaces
Two new sysfs files are added to demote hugtlb pages. These files are
both per-hugetlb page size and per node. Files are:
demote_size - The size in Kb that pages are demoted to. (read-write)
demote - The number of huge pages to demote. (write-only)
By default, demote_size is the next smallest huge page size. Valid huge
page sizes less than huge page size may be written to this file. When
huge pages are demoted, they are demoted to this size.
Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.
NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size. For example, on x86 1GB huge
pages will have demote interfaces. 2MB huge pages will not have demote
interfaces.
This patch does not provide full demote functionality. It only provides
the sysfs interfaces.
It also provides documentation for the new interfaces.
Signed-off-by: Mike Kravetz <[email protected]>
---
Documentation/admin-guide/mm/hugetlbpage.rst | 30 +++-
include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 155 ++++++++++++++++++-
3 files changed, 183 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 8abaeb144e44..bb90de3885d1 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -234,8 +234,12 @@ will exist, of the form::
hugepages-${size}kB
-Inside each of these directories, the same set of files will exist::
+Inside each of these directories, the set of files contained in ``/proc``
+will exist. In addition, two additional interfaces for demoting huge
+pages may exist::
+ demote
+ demote_size
nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages
@@ -243,7 +247,29 @@ Inside each of these directories, the same set of files will exist::
resv_hugepages
surplus_hugepages
-which function as described above for the default huge page-sized case.
+The demote interfaces provide the ability to split a huge page into
+smaller huge pages. For example, the x86 architecture supports both
+1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512
+2MB huge pages. Demote interfaces are not available for the smallest
+huge page size. The demote interfaces are:
+
+demote_size
+ is the size of demoted pages. When a page is demoted a corresponding
+ number of huge pages of demote_size will be created. By default,
+ demote_size is set to the next smaller huge page size. If there are
+ multiple smaller huge page sizes, demote_size can be set to any of
+ these smaller sizes. Only huge page sizes less than the current huge
+ pages size are allowed.
+
+demote
+ is used to demote a number of huge pages. A user with root privileges
+ can write to this file. It may not be possible to demote the
+ requested number of huge pages. To determine how many pages were
+ actually demoted, compare the value of nr_hugepages before and after
+ writing to the demote interface. demote is a write only interface.
+
+The interfaces which are the same as in ``/proc`` (all except demote and
+demote_size) function as described above for the default huge page-sized case.
.. _mem_policy_and_hp_alloc:
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1faebe1cd0ed..f2c3979efd69 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -596,6 +596,7 @@ struct hstate {
int next_nid_to_alloc;
int next_nid_to_free;
unsigned int order;
+ unsigned int demote_order;
unsigned long mask;
unsigned long max_huge_pages;
unsigned long nr_huge_pages;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95dc7b83381f..d2262ad4b3ed 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2986,7 +2986,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
static void __init hugetlb_init_hstates(void)
{
- struct hstate *h;
+ struct hstate *h, *h2;
for_each_hstate(h) {
if (minimum_order > huge_page_order(h))
@@ -2995,6 +2995,22 @@ static void __init hugetlb_init_hstates(void)
/* oversize hugepages were init'ed in early boot */
if (!hstate_is_gigantic(h))
hugetlb_hstate_alloc_pages(h);
+
+ /*
+ * Set demote order for each hstate. Note that
+ * h->demote_order is initially 0.
+ * - We can not demote gigantic pages if runtime freeing
+ * is not supported, so skip this.
+ */
+ if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+ continue;
+ for_each_hstate(h2) {
+ if (h2 == h)
+ continue;
+ if (h2->order < h->order &&
+ h2->order > h->demote_order)
+ h->demote_order = h2->order;
+ }
}
VM_BUG_ON(minimum_order == UINT_MAX);
}
@@ -3235,9 +3251,31 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
return 0;
}
+static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+ __must_hold(&hugetlb_lock)
+{
+ int rc = 0;
+
+ lockdep_assert_held(&hugetlb_lock);
+
+ /* We should never get here if no demote order */
+ if (!h->demote_order) {
+ pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n");
+ return -EINVAL; /* internal error */
+ }
+
+ /*
+ * TODO - demote fucntionality will be added in subsequent patch
+ */
+ return rc;
+}
+
#define HSTATE_ATTR_RO(_name) \
static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+#define HSTATE_ATTR_WO(_name) \
+ static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
+
#define HSTATE_ATTR(_name) \
static struct kobj_attribute _name##_attr = \
__ATTR(_name, 0644, _name##_show, _name##_store)
@@ -3433,6 +3471,105 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
}
HSTATE_ATTR_RO(surplus_hugepages);
+static ssize_t demote_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t len)
+{
+ unsigned long nr_demote;
+ unsigned long nr_available;
+ nodemask_t nodes_allowed, *n_mask;
+ struct hstate *h;
+ int err = 0;
+ int nid;
+
+ err = kstrtoul(buf, 10, &nr_demote);
+ if (err)
+ return err;
+ h = kobj_to_hstate(kobj, &nid);
+
+ if (nid != NUMA_NO_NODE) {
+ init_nodemask_of_node(&nodes_allowed, nid);
+ n_mask = &nodes_allowed;
+ } else {
+ n_mask = &node_states[N_MEMORY];
+ }
+
+ /* Synchronize with other sysfs operations modifying huge pages */
+ mutex_lock(&h->resize_lock);
+ spin_lock_irq(&hugetlb_lock);
+
+ while (nr_demote) {
+ /*
+ * Check for available pages to demote each time thorough the
+ * loop as demote_pool_huge_page will drop hugetlb_lock.
+ *
+ * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
+ * but will when full demote functionality is added in a later
+ * patch.
+ */
+ if (nid != NUMA_NO_NODE)
+ nr_available = h->free_huge_pages_node[nid];
+ else
+ nr_available = h->free_huge_pages;
+ nr_available -= h->resv_huge_pages;
+ if (!nr_available)
+ break;
+
+ err = demote_pool_huge_page(h, n_mask);
+ if (err)
+ break;
+
+ nr_demote--;
+ }
+
+ spin_unlock_irq(&hugetlb_lock);
+ mutex_unlock(&h->resize_lock);
+
+ if (err)
+ return err;
+ return len;
+}
+HSTATE_ATTR_WO(demote);
+
+static ssize_t demote_size_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int nid;
+ struct hstate *h = kobj_to_hstate(kobj, &nid);
+ unsigned long demote_size = (PAGE_SIZE << h->demote_order) / SZ_1K;
+
+ return sysfs_emit(buf, "%lukB\n", demote_size);
+}
+
+static ssize_t demote_size_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct hstate *h, *demote_hstate;
+ unsigned long demote_size;
+ unsigned int demote_order;
+ int nid;
+
+ demote_size = (unsigned long)memparse(buf, NULL);
+
+ demote_hstate = size_to_hstate(demote_size);
+ if (!demote_hstate)
+ return -EINVAL;
+ demote_order = demote_hstate->order;
+
+ /* demote order must be smaller than hstate order */
+ h = kobj_to_hstate(kobj, &nid);
+ if (demote_order >= h->order)
+ return -EINVAL;
+
+ /* resize_lock synchronizes access to demote size and writes */
+ mutex_lock(&h->resize_lock);
+ h->demote_order = demote_order;
+ mutex_unlock(&h->resize_lock);
+
+ return count;
+}
+HSTATE_ATTR(demote_size);
+
static struct attribute *hstate_attrs[] = {
&nr_hugepages_attr.attr,
&nr_overcommit_hugepages_attr.attr,
@@ -3449,6 +3586,16 @@ static const struct attribute_group hstate_attr_group = {
.attrs = hstate_attrs,
};
+static struct attribute *hstate_demote_attrs[] = {
+ &demote_size_attr.attr,
+ &demote_attr.attr,
+ NULL,
+};
+
+static const struct attribute_group hstate_demote_attr_group = {
+ .attrs = hstate_demote_attrs,
+};
+
static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
struct kobject **hstate_kobjs,
const struct attribute_group *hstate_attr_group)
@@ -3466,6 +3613,12 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
hstate_kobjs[hi] = NULL;
}
+ if (h->demote_order) {
+ if (sysfs_create_group(hstate_kobjs[hi],
+ &hstate_demote_attr_group))
+ pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
+ }
+
return retval;
}
--
2.31.1
On 10/18/21 12:58 AM, Oscar Salvador wrote:
> On Thu, Oct 07, 2021 at 11:19:17AM -0700, Mike Kravetz wrote:
>> The routines remove_hugetlb_page and destroy_compound_gigantic_page
>> will remove a gigantic page and make the set of base pages ready to be
>> returned to a lower level allocator. In the process of doing this, they
>> make all base pages reference counted.
>>
>> The routine prep_compound_gigantic_page creates a gigantic page from a
>> set of base pages. It assumes that all these base pages are reference
>> counted.
>>
>> During demotion, a gigantic page will be split into huge pages of a
>> smaller size. This logically involves use of the routines,
>> remove_hugetlb_page, and destroy_compound_gigantic_page followed by
>> prep_compound*_page for each smaller huge page.
>>
>> When pages are reference counted (ref count >= 0), additional
>> speculative ref counts could be taken. This could result in errors
>
> It would be great to learn about those cases involving speculative ref counts.
>
How about if this commit message provides links to previous commits
describing these issues? There are pretty extensive descriptions in
those previous commits, so no need to repeat here IMO.
The patch with an updated commit message is below.
From 10fcff70c809402901a93ea507d5506c87a8227d Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Fri, 22 Oct 2021 11:50:31 -0700
Subject: [PATCH v4 4/5] hugetlb: add demote bool to gigantic page routines
The routines remove_hugetlb_page and destroy_compound_gigantic_page
will remove a gigantic page and make the set of base pages ready to be
returned to a lower level allocator. In the process of doing this, they
make all base pages reference counted.
The routine prep_compound_gigantic_page creates a gigantic page from a
set of base pages. It assumes that all these base pages are reference
counted.
During demotion, a gigantic page will be split into huge pages of a
smaller size. This logically involves use of the routines,
remove_hugetlb_page, and destroy_compound_gigantic_page followed by
prep_compound*_page for each smaller huge page.
When pages are reference counted (ref count >= 0), additional
speculative ref counts could be taken as described in previous
commits [1] and [2]. This could result in errors while demoting
a huge page. Quite a bit of code would need to be created to
handle all possible issues.
Instead of dealing with the possibility of speculative ref counts, avoid
the possibility by keeping ref counts at zero during the demote process.
Add a boolean 'demote' to the routines remove_hugetlb_page,
destroy_compound_gigantic_page and prep_compound_gigantic_page. If the
boolean is set, the remove and destroy routines will not reference count
pages and the prep routine will not expect reference counted pages.
'*_for_demote' wrappers of the routines will be added in a subsequent
patch where this functionality is used.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 54 +++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 43 insertions(+), 11 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 21a9c353c2ae..bb724a393864 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1271,8 +1271,8 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
nr_nodes--)
#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
- unsigned int order)
+static void __destroy_compound_gigantic_page(struct page *page,
+ unsigned int order, bool demote)
{
int i;
int nr_pages = 1 << order;
@@ -1284,7 +1284,8 @@ static void destroy_compound_gigantic_page(struct page *page,
for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
p->mapping = NULL;
clear_compound_head(p);
- set_page_refcounted(p);
+ if (!demote)
+ set_page_refcounted(p);
}
set_compound_order(page, 0);
@@ -1292,6 +1293,12 @@ static void destroy_compound_gigantic_page(struct page *page,
__ClearPageHead(page);
}
+static void destroy_compound_gigantic_page(struct page *page,
+ unsigned int order)
+{
+ __destroy_compound_gigantic_page(page, order, false);
+}
+
static void free_gigantic_page(struct page *page, unsigned int order)
{
/*
@@ -1364,12 +1371,15 @@ static inline void destroy_compound_gigantic_page(struct page *page,
/*
* Remove hugetlb page from lists, and update dtor so that page appears
- * as just a compound page. A reference is held on the page.
+ * as just a compound page.
+ *
+ * A reference is held on the page, except in the case of demote.
*
* Must be called with hugetlb lock held.
*/
-static void remove_hugetlb_page(struct hstate *h, struct page *page,
- bool adjust_surplus)
+static void __remove_hugetlb_page(struct hstate *h, struct page *page,
+ bool adjust_surplus,
+ bool demote)
{
int nid = page_to_nid(page);
@@ -1407,8 +1417,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
*
* This handles the case where more than one ref is held when and
* after update_and_free_page is called.
+ *
+ * In the case of demote we do not ref count the page as it will soon
+ * be turned into a page of smaller size.
*/
- set_page_refcounted(page);
+ if (!demote)
+ set_page_refcounted(page);
if (hstate_is_gigantic(h))
set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
else
@@ -1418,6 +1432,12 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
h->nr_huge_pages_node[nid]--;
}
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+ bool adjust_surplus)
+{
+ __remove_hugetlb_page(h, page, adjust_surplus, false);
+}
+
static void add_hugetlb_page(struct hstate *h, struct page *page,
bool adjust_surplus)
{
@@ -1681,7 +1701,8 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
spin_unlock_irq(&hugetlb_lock);
}
-static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+static bool __prep_compound_gigantic_page(struct page *page, unsigned int order,
+ bool demote)
{
int i, j;
int nr_pages = 1 << order;
@@ -1719,10 +1740,16 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
* the set of pages can not be converted to a gigantic page.
* The caller who allocated the pages should then discard the
* pages using the appropriate free interface.
+ *
+ * In the case of demote, the ref count will be zero.
*/
- if (!page_ref_freeze(p, 1)) {
- pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
- goto out_error;
+ if (!demote) {
+ if (!page_ref_freeze(p, 1)) {
+ pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+ goto out_error;
+ }
+ } else {
+ VM_BUG_ON_PAGE(page_count(p), p);
}
set_page_count(p, 0);
set_compound_head(p, page);
@@ -1747,6 +1774,11 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
return false;
}
+static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+{
+ return __prep_compound_gigantic_page(page, order, false);
+}
+
/*
* PageHuge() only returns true for hugetlbfs pages, but not for normal or
* transparent huge pages. See the PageTransHuge() documentation for more
--
2.31.1
On Fri, Oct 22, 2021 at 11:58:42AM -0700, Mike Kravetz wrote:
> From f9c401323fee234667787a118c74d93aa185fcf6 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <[email protected]>
> Date: Fri, 22 Oct 2021 11:40:57 -0700
> Subject: [PATCH v4 1/5] hugetlb: add demote hugetlb page sysfs interfaces
>
> Two new sysfs files are added to demote hugtlb pages. These files are
> both per-hugetlb page size and per node. Files are:
> demote_size - The size in Kb that pages are demoted to. (read-write)
> demote - The number of huge pages to demote. (write-only)
>
> By default, demote_size is the next smallest huge page size. Valid huge
> page sizes less than huge page size may be written to this file. When
> huge pages are demoted, they are demoted to this size.
>
> Writing a value to demote will result in an attempt to demote that
> number of hugetlb pages to an appropriate number of demote_size pages.
>
> NOTE: Demote interfaces are only provided for huge page sizes if there
> is a smaller target demote huge page size. For example, on x86 1GB huge
> pages will have demote interfaces. 2MB huge pages will not have demote
> interfaces.
>
> This patch does not provide full demote functionality. It only provides
> the sysfs interfaces.
>
> It also provides documentation for the new interfaces.
>
> Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
> ---
> Documentation/admin-guide/mm/hugetlbpage.rst | 30 +++-
> include/linux/hugetlb.h | 1 +
> mm/hugetlb.c | 155 ++++++++++++++++++-
> 3 files changed, 183 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index 8abaeb144e44..bb90de3885d1 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -234,8 +234,12 @@ will exist, of the form::
>
> hugepages-${size}kB
>
> -Inside each of these directories, the same set of files will exist::
> +Inside each of these directories, the set of files contained in ``/proc``
> +will exist. In addition, two additional interfaces for demoting huge
> +pages may exist::
>
> + demote
> + demote_size
> nr_hugepages
> nr_hugepages_mempolicy
> nr_overcommit_hugepages
> @@ -243,7 +247,29 @@ Inside each of these directories, the same set of files will exist::
> resv_hugepages
> surplus_hugepages
>
> -which function as described above for the default huge page-sized case.
> +The demote interfaces provide the ability to split a huge page into
> +smaller huge pages. For example, the x86 architecture supports both
> +1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512
> +2MB huge pages. Demote interfaces are not available for the smallest
> +huge page size. The demote interfaces are:
> +
> +demote_size
> + is the size of demoted pages. When a page is demoted a corresponding
> + number of huge pages of demote_size will be created. By default,
> + demote_size is set to the next smaller huge page size. If there are
> + multiple smaller huge page sizes, demote_size can be set to any of
> + these smaller sizes. Only huge page sizes less than the current huge
> + pages size are allowed.
> +
> +demote
> + is used to demote a number of huge pages. A user with root privileges
> + can write to this file. It may not be possible to demote the
> + requested number of huge pages. To determine how many pages were
> + actually demoted, compare the value of nr_hugepages before and after
> + writing to the demote interface. demote is a write only interface.
> +
> +The interfaces which are the same as in ``/proc`` (all except demote and
> +demote_size) function as described above for the default huge page-sized case.
>
> .. _mem_policy_and_hp_alloc:
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1faebe1cd0ed..f2c3979efd69 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -596,6 +596,7 @@ struct hstate {
> int next_nid_to_alloc;
> int next_nid_to_free;
> unsigned int order;
> + unsigned int demote_order;
> unsigned long mask;
> unsigned long max_huge_pages;
> unsigned long nr_huge_pages;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 95dc7b83381f..d2262ad4b3ed 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2986,7 +2986,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>
> static void __init hugetlb_init_hstates(void)
> {
> - struct hstate *h;
> + struct hstate *h, *h2;
>
> for_each_hstate(h) {
> if (minimum_order > huge_page_order(h))
> @@ -2995,6 +2995,22 @@ static void __init hugetlb_init_hstates(void)
> /* oversize hugepages were init'ed in early boot */
> if (!hstate_is_gigantic(h))
> hugetlb_hstate_alloc_pages(h);
> +
> + /*
> + * Set demote order for each hstate. Note that
> + * h->demote_order is initially 0.
> + * - We can not demote gigantic pages if runtime freeing
> + * is not supported, so skip this.
> + */
> + if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> + continue;
> + for_each_hstate(h2) {
> + if (h2 == h)
> + continue;
> + if (h2->order < h->order &&
> + h2->order > h->demote_order)
> + h->demote_order = h2->order;
> + }
> }
> VM_BUG_ON(minimum_order == UINT_MAX);
> }
> @@ -3235,9 +3251,31 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
> return 0;
> }
>
> +static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
> + __must_hold(&hugetlb_lock)
> +{
> + int rc = 0;
> +
> + lockdep_assert_held(&hugetlb_lock);
> +
> + /* We should never get here if no demote order */
> + if (!h->demote_order) {
> + pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n");
> + return -EINVAL; /* internal error */
> + }
> +
> + /*
> + * TODO - demote fucntionality will be added in subsequent patch
> + */
> + return rc;
> +}
> +
> #define HSTATE_ATTR_RO(_name) \
> static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
>
> +#define HSTATE_ATTR_WO(_name) \
> + static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
> +
> #define HSTATE_ATTR(_name) \
> static struct kobj_attribute _name##_attr = \
> __ATTR(_name, 0644, _name##_show, _name##_store)
> @@ -3433,6 +3471,105 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
> }
> HSTATE_ATTR_RO(surplus_hugepages);
>
> +static ssize_t demote_store(struct kobject *kobj,
> + struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> + unsigned long nr_demote;
> + unsigned long nr_available;
> + nodemask_t nodes_allowed, *n_mask;
> + struct hstate *h;
> + int err = 0;
> + int nid;
> +
> + err = kstrtoul(buf, 10, &nr_demote);
> + if (err)
> + return err;
> + h = kobj_to_hstate(kobj, &nid);
> +
> + if (nid != NUMA_NO_NODE) {
> + init_nodemask_of_node(&nodes_allowed, nid);
> + n_mask = &nodes_allowed;
> + } else {
> + n_mask = &node_states[N_MEMORY];
> + }
> +
> + /* Synchronize with other sysfs operations modifying huge pages */
> + mutex_lock(&h->resize_lock);
> + spin_lock_irq(&hugetlb_lock);
> +
> + while (nr_demote) {
> + /*
> + * Check for available pages to demote each time thorough the
> + * loop as demote_pool_huge_page will drop hugetlb_lock.
> + *
> + * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
> + * but will when full demote functionality is added in a later
> + * patch.
> + */
> + if (nid != NUMA_NO_NODE)
> + nr_available = h->free_huge_pages_node[nid];
> + else
> + nr_available = h->free_huge_pages;
> + nr_available -= h->resv_huge_pages;
> + if (!nr_available)
> + break;
> +
> + err = demote_pool_huge_page(h, n_mask);
> + if (err)
> + break;
> +
> + nr_demote--;
> + }
> +
> + spin_unlock_irq(&hugetlb_lock);
> + mutex_unlock(&h->resize_lock);
> +
> + if (err)
> + return err;
> + return len;
> +}
> +HSTATE_ATTR_WO(demote);
> +
> +static ssize_t demote_size_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + int nid;
> + struct hstate *h = kobj_to_hstate(kobj, &nid);
> + unsigned long demote_size = (PAGE_SIZE << h->demote_order) / SZ_1K;
> +
> + return sysfs_emit(buf, "%lukB\n", demote_size);
> +}
> +
> +static ssize_t demote_size_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct hstate *h, *demote_hstate;
> + unsigned long demote_size;
> + unsigned int demote_order;
> + int nid;
> +
> + demote_size = (unsigned long)memparse(buf, NULL);
> +
> + demote_hstate = size_to_hstate(demote_size);
> + if (!demote_hstate)
> + return -EINVAL;
> + demote_order = demote_hstate->order;
> +
> + /* demote order must be smaller than hstate order */
> + h = kobj_to_hstate(kobj, &nid);
> + if (demote_order >= h->order)
> + return -EINVAL;
> +
> + /* resize_lock synchronizes access to demote size and writes */
> + mutex_lock(&h->resize_lock);
> + h->demote_order = demote_order;
> + mutex_unlock(&h->resize_lock);
> +
> + return count;
> +}
> +HSTATE_ATTR(demote_size);
> +
> static struct attribute *hstate_attrs[] = {
> &nr_hugepages_attr.attr,
> &nr_overcommit_hugepages_attr.attr,
> @@ -3449,6 +3586,16 @@ static const struct attribute_group hstate_attr_group = {
> .attrs = hstate_attrs,
> };
>
> +static struct attribute *hstate_demote_attrs[] = {
> + &demote_size_attr.attr,
> + &demote_attr.attr,
> + NULL,
> +};
> +
> +static const struct attribute_group hstate_demote_attr_group = {
> + .attrs = hstate_demote_attrs,
> +};
> +
> static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
> struct kobject **hstate_kobjs,
> const struct attribute_group *hstate_attr_group)
> @@ -3466,6 +3613,12 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
> hstate_kobjs[hi] = NULL;
> }
>
> + if (h->demote_order) {
> + if (sysfs_create_group(hstate_kobjs[hi],
> + &hstate_demote_attr_group))
> + pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
> + }
> +
> return retval;
> }
>
> --
> 2.31.1
>
--
Oscar Salvador
SUSE Labs
On Fri, Oct 22, 2021 at 12:05:47PM -0700, Mike Kravetz wrote:
> How about if this commit message provides links to previous commits
> describing these issues? There are pretty extensive descriptions in
> those previous commits, so no need to repeat here IMO.
Fine by me.
> The patch with an updated commit message is below.
>
> From 10fcff70c809402901a93ea507d5506c87a8227d Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <[email protected]>
> Date: Fri, 22 Oct 2021 11:50:31 -0700
> Subject: [PATCH v4 4/5] hugetlb: add demote bool to gigantic page routines
>
> The routines remove_hugetlb_page and destroy_compound_gigantic_page
> will remove a gigantic page and make the set of base pages ready to be
> returned to a lower level allocator. In the process of doing this, they
> make all base pages reference counted.
>
> The routine prep_compound_gigantic_page creates a gigantic page from a
> set of base pages. It assumes that all these base pages are reference
> counted.
>
> During demotion, a gigantic page will be split into huge pages of a
> smaller size. This logically involves use of the routines,
> remove_hugetlb_page, and destroy_compound_gigantic_page followed by
> prep_compound*_page for each smaller huge page.
>
> When pages are reference counted (ref count >= 0), additional
> speculative ref counts could be taken as described in previous
> commits [1] and [2]. This could result in errors while demoting
> a huge page. Quite a bit of code would need to be created to
> handle all possible issues.
>
> Instead of dealing with the possibility of speculative ref counts, avoid
> the possibility by keeping ref counts at zero during the demote process.
> Add a boolean 'demote' to the routines remove_hugetlb_page,
> destroy_compound_gigantic_page and prep_compound_gigantic_page. If the
> boolean is set, the remove and destroy routines will not reference count
> pages and the prep routine will not expect reference counted pages.
>
> '*_for_demote' wrappers of the routines will be added in a subsequent
> patch where this functionality is used.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
--
Oscar Salvador
SUSE Labs