When huge page demotion is fully implemented, gigantic pages can be
demoted to a smaller huge page size. For example, on x86 a 1G page
can be demoted to 512 2M pages. However, gigantic pages can potentially
be allocated from CMA. If a gigantic page which was allocated from CMA
is demoted, the corresponding demoted pages needs to be returned to CMA.
Use the new interface cma_pages_valid() to determine if a non-gigantic
hugetlb page should be freed to CMA. Also, clear mapping field of these
pages as expected by cma_release.
This also requires a change to CMA reservations for gigantic pages.
Currently, the 'order_per_bit' is set to the gigantic page size.
However, if gigantic pages can be demoted this needs to be set to the
order of the smallest huge page. At CMA reservation time we do not know
the size of the smallest huge page size, so use HUGETLB_PAGE_ORDER.
Also, prohibit demotion to huge page sizes smaller than
HUGETLB_PAGE_ORDER.
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 32 +++++++++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f4dad1ab12d8..a15f6763e8f4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -50,6 +50,16 @@ struct hstate hstates[HUGE_MAX_HSTATE];
#ifdef CONFIG_CMA
static struct cma *hugetlb_cma[MAX_NUMNODES];
+static bool hugetlb_cma_page(struct page *page, unsigned int order)
+{
+ return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
+ 1 << order);
+}
+#else
+static bool hugetlb_cma_page(struct page *page, unsigned int order)
+{
+ return false;
+}
#endif
static unsigned long hugetlb_cma_size __initdata;
@@ -1272,6 +1282,7 @@ static void destroy_compound_gigantic_page(struct page *page,
atomic_set(compound_pincount_ptr(page), 0);
for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+ p->mapping = NULL;
clear_compound_head(p);
set_page_refcounted(p);
}
@@ -1476,7 +1487,13 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
1 << PG_active | 1 << PG_private |
1 << PG_writeback);
}
- if (hstate_is_gigantic(h)) {
+
+ /*
+ * Non-gigantic pages demoted from CMA allocated gigantic pages
+ * need to be given back to CMA in free_gigantic_page.
+ */
+ if (hstate_is_gigantic(h) ||
+ hugetlb_cma_page(page, huge_page_order(h))) {
destroy_compound_gigantic_page(page, huge_page_order(h));
free_gigantic_page(page, huge_page_order(h));
} else {
@@ -3003,7 +3020,8 @@ static void __init hugetlb_init_hstates(void)
* is not supported.
*/
if (!hstate_is_gigantic(h) ||
- gigantic_page_runtime_supported()) {
+ gigantic_page_runtime_supported() ||
+ !hugetlb_cma_size || !(h->order <= HUGETLB_PAGE_ORDER)) {
for_each_hstate(h2) {
if (h2 == h)
continue;
@@ -3555,6 +3573,8 @@ static ssize_t demote_size_store(struct kobject *kobj,
if (!t_hstate)
return -EINVAL;
demote_order = t_hstate->order;
+ if (demote_order < HUGETLB_PAGE_ORDER)
+ return -EINVAL;
/* demote order must be smaller hstate order */
h = kobj_to_hstate(kobj, &nid);
@@ -6563,7 +6583,13 @@ void __init hugetlb_cma_reserve(int order)
size = round_up(size, PAGE_SIZE << order);
snprintf(name, sizeof(name), "hugetlb%d", nid);
- res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
+ /*
+ * Note that 'order per bit' is based on smallest size that
+ * may be returned to CMA allocator in the case of
+ * huge page demotion.
+ */
+ res = cma_declare_contiguous_nid(0, size, 0,
+ PAGE_SIZE << HUGETLB_PAGE_ORDER,
0, false, name,
&hugetlb_cma[nid], nid);
if (res) {
--
2.31.1
On Fri, Oct 01, 2021 at 10:52:08AM -0700, Mike Kravetz wrote:
> When huge page demotion is fully implemented, gigantic pages can be
> demoted to a smaller huge page size. For example, on x86 a 1G page
> can be demoted to 512 2M pages. However, gigantic pages can potentially
> be allocated from CMA. If a gigantic page which was allocated from CMA
> is demoted, the corresponding demoted pages needs to be returned to CMA.
>
> Use the new interface cma_pages_valid() to determine if a non-gigantic
> hugetlb page should be freed to CMA. Also, clear mapping field of these
> pages as expected by cma_release.
>
> This also requires a change to CMA reservations for gigantic pages.
> Currently, the 'order_per_bit' is set to the gigantic page size.
> However, if gigantic pages can be demoted this needs to be set to the
> order of the smallest huge page. At CMA reservation time we do not know
to the smallest, or to the next smaller? Would you mind elaborating why?
> @@ -3003,7 +3020,8 @@ static void __init hugetlb_init_hstates(void)
> * is not supported.
> */
> if (!hstate_is_gigantic(h) ||
> - gigantic_page_runtime_supported()) {
> + gigantic_page_runtime_supported() ||
> + !hugetlb_cma_size || !(h->order <= HUGETLB_PAGE_ORDER)) {
I am bit lost in the CMA area, so bear with me.
We do not allow to demote if we specify we want hugetlb pages from the CMA?
Also, can h->order be smaller than HUGETLB_PAGE_ORDER? I though
HUGETLB_PAGE_ORDER was the smallest one.
The check for HUGETLB_PAGE_ORDER can probably be squashed into patch#1.
> for_each_hstate(h2) {
> if (h2 == h)
> continue;
> @@ -3555,6 +3573,8 @@ static ssize_t demote_size_store(struct kobject *kobj,
> if (!t_hstate)
> return -EINVAL;
> demote_order = t_hstate->order;
> + if (demote_order < HUGETLB_PAGE_ORDER)
> + return -EINVAL;
This could probably go in the first patch.
--
Oscar Salvador
SUSE Labs
On 10/5/21 2:33 AM, Oscar Salvador wrote:
> On Fri, Oct 01, 2021 at 10:52:08AM -0700, Mike Kravetz wrote:
>> When huge page demotion is fully implemented, gigantic pages can be
>> demoted to a smaller huge page size. For example, on x86 a 1G page
>> can be demoted to 512 2M pages. However, gigantic pages can potentially
>> be allocated from CMA. If a gigantic page which was allocated from CMA
>> is demoted, the corresponding demoted pages needs to be returned to CMA.
>>
>> Use the new interface cma_pages_valid() to determine if a non-gigantic
>> hugetlb page should be freed to CMA. Also, clear mapping field of these
>> pages as expected by cma_release.
>>
>> This also requires a change to CMA reservations for gigantic pages.
>> Currently, the 'order_per_bit' is set to the gigantic page size.
>> However, if gigantic pages can be demoted this needs to be set to the
>> order of the smallest huge page. At CMA reservation time we do not know
>
> to the smallest, or to the next smaller? Would you mind elaborating why?
>
It is the smallest.
CMA uses a per-region bit map to track allocations. When setting up the
region, you specify how many pages each bit represents. Currently,
only gigantic pages are allocated/freed from CMA so the region is set up
such that one bit represents a gigantic page size allocation.
With demote, a gigantic page (allocation) could be split into smaller
size pages. And, these smaller size pages will be freed to CMA. So,
since the per-region bit map needs to be set up to represent the smallest
allocation/free size, it now needs to be set to the smallest huge page
size which can be freed to CMA.
Unfortunately, we set up the CMA region for huge pages before we set up
huge pages sizes (hstates). So, technically we do not know the smallest
huge page size as this can change via command line options and
architecture specific code. Therefore, at region setup time we need some
constant value for smallest possible huge page size. That is why
HUGETLB_PAGE_ORDER is used.
I should probably add all that to the changelog for clarity?
>> @@ -3003,7 +3020,8 @@ static void __init hugetlb_init_hstates(void)
>> * is not supported.
>> */
>> if (!hstate_is_gigantic(h) ||
>> - gigantic_page_runtime_supported()) {
>> + gigantic_page_runtime_supported() ||
>> + !hugetlb_cma_size || !(h->order <= HUGETLB_PAGE_ORDER)) {
>
> I am bit lost in the CMA area, so bear with me.
> We do not allow to demote if we specify we want hugetlb pages from the CMA?
We limit the size of the order which can be demoted if hugetlb pages can
be allocated from CMA.
> Also, can h->order be smaller than HUGETLB_PAGE_ORDER? I though
> HUGETLB_PAGE_ORDER was the smallest one.
Nope,
arm64 with 64K PAGE_SIZE is one example. huge page sizes/orders are:
CONT_PMD_SHIFT 34 16.0 GiB
PMD_SHIFT 29 512 MiB
CONT_PTE_SHIFT 21 2.00 MiB
#define HPAGE_SHIFT PMD_SHIFT
#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
So, HUGETLB_PAGE_ORDER is associated with the 512 MiB huge page size,
but there is also a (smaller) 2.00 MiB huge page size.
After your comment yesterday about rewriting this code for clarity, this
now becomes:
/*
* Set demote order for each hstate. Note that
* h->demote_order is initially 0.
* - We can not demote gigantic pages if runtime freeing
* is not supported, so skip this.
* - If CMA allocation is possible, we can not demote
* HUGETLB_PAGE_ORDER or smaller size pages.
*/
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
continue;
if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
continue;
for_each_hstate(h2) {
if (h2 == h)
continue;
if (h2->order < h->order &&
h2->order > h->demote_order)
h->demote_order = h2->order;
}
Hopefully, that is more clear.
>
> The check for HUGETLB_PAGE_ORDER can probably be squashed into patch#1.
>
>
>> for_each_hstate(h2) {
>> if (h2 == h)
>> continue;
>> @@ -3555,6 +3573,8 @@ static ssize_t demote_size_store(struct kobject *kobj,
>> if (!t_hstate)
>> return -EINVAL;
>> demote_order = t_hstate->order;
>> + if (demote_order < HUGETLB_PAGE_ORDER)
>> + return -EINVAL;
>
> This could probably go in the first patch.
>
>
Both of the above HUGETLB_PAGE_ORDER checks 'could' go into the first
patch. However, the code which actually makes them necessary is in this
patch. I would prefer to leave them together here.
--
Mike Kravetz
On Tue, Oct 05, 2021 at 11:57:54AM -0700, Mike Kravetz wrote:
> It is the smallest.
>
> CMA uses a per-region bit map to track allocations. When setting up the
> region, you specify how many pages each bit represents. Currently,
> only gigantic pages are allocated/freed from CMA so the region is set up
> such that one bit represents a gigantic page size allocation.
>
> With demote, a gigantic page (allocation) could be split into smaller
> size pages. And, these smaller size pages will be freed to CMA. So,
> since the per-region bit map needs to be set up to represent the smallest
> allocation/free size, it now needs to be set to the smallest huge page
> size which can be freed to CMA.
>
> Unfortunately, we set up the CMA region for huge pages before we set up
> huge pages sizes (hstates). So, technically we do not know the smallest
> huge page size as this can change via command line options and
> architecture specific code. Therefore, at region setup time we need some
> constant value for smallest possible huge page size. That is why
> HUGETLB_PAGE_ORDER is used.
Do you know if that is done for a reason? Setting up CMA for hugetlb before
initialiting hugetlb itself? Would not make more sense to do it the other way
around?
The way I see it is that it is a bit unfortunate that we cannot only demote
gigantic pages per se, so 1GB on x86_64 and 16G on arm64 with 64k page size.
I guess nothing to be worried about now as this is an early stage, but maybe
something to think about in the future in we case we want to allow for more
flexibility.
> I should probably add all that to the changelog for clarity?
Yes, I think it would be great to have that as a context.
> After your comment yesterday about rewriting this code for clarity, this
> now becomes:
>
> /*
> * Set demote order for each hstate. Note that
> * h->demote_order is initially 0.
> * - We can not demote gigantic pages if runtime freeing
> * is not supported, so skip this.
> * - If CMA allocation is possible, we can not demote
> * HUGETLB_PAGE_ORDER or smaller size pages.
> */
> if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> continue;
> if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
> continue;
> for_each_hstate(h2) {
> if (h2 == h)
> continue;
> if (h2->order < h->order &&
> h2->order > h->demote_order)
> h->demote_order = h2->order;
> }
>
> Hopefully, that is more clear.
Defintiely, this looks better to me.
--
Oscar Salvador
SUSE Labs
On 10/6/21 12:54 AM, Oscar Salvador wrote:
> On Tue, Oct 05, 2021 at 11:57:54AM -0700, Mike Kravetz wrote:
>> It is the smallest.
>>
>> CMA uses a per-region bit map to track allocations. When setting up the
>> region, you specify how many pages each bit represents. Currently,
>> only gigantic pages are allocated/freed from CMA so the region is set up
>> such that one bit represents a gigantic page size allocation.
>>
>> With demote, a gigantic page (allocation) could be split into smaller
>> size pages. And, these smaller size pages will be freed to CMA. So,
>> since the per-region bit map needs to be set up to represent the smallest
>> allocation/free size, it now needs to be set to the smallest huge page
>> size which can be freed to CMA.
>>
>> Unfortunately, we set up the CMA region for huge pages before we set up
>> huge pages sizes (hstates). So, technically we do not know the smallest
>> huge page size as this can change via command line options and
>> architecture specific code. Therefore, at region setup time we need some
>> constant value for smallest possible huge page size. That is why
>> HUGETLB_PAGE_ORDER is used.
>
> Do you know if that is done for a reason? Setting up CMA for hugetlb before
> initialiting hugetlb itself? Would not make more sense to do it the other way
> around?
>
One reason is that the initialization sequence is a bit messy. In most
cases, arch specific code sets up huge pages. So, we would need to make
sure this is done before the cma initialization. This might be
possible, but I am not confident in my abilities to understand/modify
and test early init code for all architectures supporting hugetlb cma
allocations.
In addition, not all architectures initialize their huge page sizes. It
is possible for architectures to only set up huge pages that have been
requested on the command line. In such cases, it would require some
fancy command line parsing to look for and process a hugetlb_cma argument
before any other hugetlb argument. Not even sure if this is possible.
The most reasonable way to address this would be to add an arch specific
callback asking for the smallest supported huge page size. I did not do
this here, as I am not sure this is really going to be an issue. In
the use case (and architecture) I know of, this is not an issue. As you
mention, this or something else could be added if the need arises.
--
Mike Kravetz
> The way I see it is that it is a bit unfortunate that we cannot only demote
> gigantic pages per se, so 1GB on x86_64 and 16G on arm64 with 64k page size.
>
> I guess nothing to be worried about now as this is an early stage, but maybe
> something to think about in the future in we case we want to allow for more
> flexibility.
>
>> I should probably add all that to the changelog for clarity?
>
> Yes, I think it would be great to have that as a context.
>
>> After your comment yesterday about rewriting this code for clarity, this
>> now becomes:
>>
>> /*
>> * Set demote order for each hstate. Note that
>> * h->demote_order is initially 0.
>> * - We can not demote gigantic pages if runtime freeing
>> * is not supported, so skip this.
>> * - If CMA allocation is possible, we can not demote
>> * HUGETLB_PAGE_ORDER or smaller size pages.
>> */
>> if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>> continue;
>> if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
>> continue;
>> for_each_hstate(h2) {
>> if (h2 == h)
>> continue;
>> if (h2->order < h->order &&
>> h2->order > h->demote_order)
>> h->demote_order = h2->order;
>> }
>>
>> Hopefully, that is more clear.
>
> Defintiely, this looks better to me.
>