This patch set was included as patches [04/21-06/21] in my
previous patch set to introduce Designated Movable Blocks [1].
They were included there for context, but they have value on
their own and are being resubmitted here for consideration on
their own merits.
The alloc_contig_range() function attempts to isolate the free
pages within the range and migrate the data from non-free pages
within the range to allow those pages to become isolated free
pages. If all of the pages in the range are on isolated free
page lists they can be allocated to the caller.
When free hugepages are encountered in the range an attempt is
made to allocate a new compound page to become a replacement
hugepage and to dissolve the free hugepage so that its pages
within isolated pageblocks can be added to the isolated free
page lists. Hugepages that are not free and encountered within
the range must be migrated out of the range and dissolved to
allow the underlying pages to be added to the isolated free
page lists.
Moving the data from hugepages within the range and freeing the
hugepages is not sufficient since the allocation of migration
target hugepages is allowed to come from free hugepages that may
contain isolated pageblocks and freed hugepages will not be on
isolated free page lists so the alloc_contig_range() will fail.
To address these shortcommings the HPG_dissolve hugepage flag is
introduced to tag hugepages that must be dissolved when they are
freed so that their underlying pages can be moved to the page
allocator's free lists. This prevents hugepages that have had
their data migrated to new hugepages from being made available
to subsequent hugepage allocations and allows the isolated free
page test of alloc_contig_range() to succeed.
Dissolved hugepages must be replaced with new hugepages to keep
the hugetlbfs balanced. To support blocking allocation a new
workqueue in introduced that is analogous to the workqueue
introduced to support the hugetlb vmemmap optimization. This new
workqueue allows the allocation and dissolution of the hugepage
to be offloaded to a separate context from the freeing of the
hugepage. The sync_hugetlb_dissolve() function is introduced to
allow outstanding dissolution of hugepages to complete before
the isolated free page check is made by alloc_contig_range().
In addition, a check is added to hugepage allocation to prevent
free hugepages within an isolated pageblock range from being
used to satisfy migration target allocations preventing circular
migrations.
[1] https://lore.kernel.org/linux-mm/[email protected]/
Doug Berger (3):
mm/hugetlb: refactor alloc_and_dissolve_huge_page
mm/hugetlb: allow migrated hugepage to dissolve when freed
mm/hugetlb: add hugepage isolation support
include/linux/hugetlb.h | 5 ++
mm/hugetlb.c | 161 +++++++++++++++++++++++++++++++---------
mm/migrate.c | 1 +
mm/page_alloc.c | 1 +
4 files changed, 131 insertions(+), 37 deletions(-)
base-commit: b90cb1053190353cc30f0fef0ef1f378ccc063c5
--
2.25.1
The alloc_replacement_page() and replace_hugepage() functions are
created from code in the alloc_and_dissolve_huge_page() function
to allow their reuse by the next commit.
Signed-off-by: Doug Berger <[email protected]>
---
mm/hugetlb.c | 84 +++++++++++++++++++++++++++++++---------------------
1 file changed, 51 insertions(+), 33 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e070b8593b37..2b60de78007c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2709,32 +2709,22 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
}
/*
- * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
- * @h: struct hstate old page belongs to
- * @old_page: Old page to dissolve
- * @list: List to isolate the page in case we need to
- * Returns 0 on success, otherwise negated error.
+ * Before dissolving the page, we need to allocate a new one for the
+ * pool to remain stable. Here, we allocate the page and 'prep' it
+ * by doing everything but actually updating counters and adding to
+ * the pool. This simplifies and let us do most of the processing
+ * under the lock.
*/
-static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
- struct list_head *list)
+static struct page *alloc_replacement_page(struct hstate *h, int nid)
{
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
- int nid = page_to_nid(old_page);
bool alloc_retry = false;
struct page *new_page;
- int ret = 0;
- /*
- * Before dissolving the page, we need to allocate a new one for the
- * pool to remain stable. Here, we allocate the page and 'prep' it
- * by doing everything but actually updating counters and adding to
- * the pool. This simplifies and let us do most of the processing
- * under the lock.
- */
alloc_retry:
new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
if (!new_page)
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);
/*
* If all goes well, this page will be directly added to the free
* list in the pool. For this the ref count needs to be zero.
@@ -2748,7 +2738,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
SetHPageTemporary(new_page);
if (!put_page_testzero(new_page)) {
if (alloc_retry)
- return -EBUSY;
+ return ERR_PTR(-EBUSY);
alloc_retry = true;
goto alloc_retry;
@@ -2757,6 +2747,48 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
__prep_new_huge_page(h, new_page);
+ return new_page;
+}
+
+static void replace_hugepage(struct hstate *h, int nid, struct page *old_page,
+ struct page *new_page)
+{
+ lockdep_assert_held(&hugetlb_lock);
+ /*
+ * Ok, old_page is still a genuine free hugepage. Remove it from
+ * the freelist and decrease the counters. These will be
+ * incremented again when calling __prep_account_new_huge_page()
+ * and enqueue_huge_page() for new_page. The counters will remain
+ * stable since this happens under the lock.
+ */
+ remove_hugetlb_page(h, old_page, false);
+
+ /*
+ * Ref count on new page is already zero as it was dropped
+ * earlier. It can be directly added to the pool free list.
+ */
+ __prep_account_new_huge_page(h, nid);
+ enqueue_huge_page(h, new_page);
+}
+
+/*
+ * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
+ * @h: struct hstate old page belongs to
+ * @old_page: Old page to dissolve
+ * @list: List to isolate the page in case we need to
+ * Returns 0 on success, otherwise negated error.
+ */
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
+ struct list_head *list)
+{
+ int nid = page_to_nid(old_page);
+ struct page *new_page;
+ int ret = 0;
+
+ new_page = alloc_replacement_page(h, nid);
+ if (IS_ERR(new_page))
+ return PTR_ERR(new_page);
+
retry:
spin_lock_irq(&hugetlb_lock);
if (!PageHuge(old_page)) {
@@ -2783,21 +2815,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
cond_resched();
goto retry;
} else {
- /*
- * Ok, old_page is still a genuine free hugepage. Remove it from
- * the freelist and decrease the counters. These will be
- * incremented again when calling __prep_account_new_huge_page()
- * and enqueue_huge_page() for new_page. The counters will remain
- * stable since this happens under the lock.
- */
- remove_hugetlb_page(h, old_page, false);
-
- /*
- * Ref count on new page is already zero as it was dropped
- * earlier. It can be directly added to the pool free list.
- */
- __prep_account_new_huge_page(h, nid);
- enqueue_huge_page(h, new_page);
+ replace_hugepage(h, nid, old_page, new_page);
/*
* Pages have been replaced, we can safely free the old one.
--
2.25.1
There is no isolation mechanism for hugepages so a hugepage that
is migrated is returned to its hugepage freelist. This creates
problems for alloc_contig_range() because migrated hugepages can
be allocated as migrate targets for subsequent hugepage migration
attempts.
Even if the migration succeeds the alloc_contig_range() attempt
will fail because test_pages_isolated() will find the now free
hugepages haven't been dissolved.
A subsequent attempt by alloc_contig_range() is necessary for the
isolate_migratepages_range() function to find the freed hugepage
and dissolve it (assuming it has not been reallocated).
A workqueue is introduced to perform the equivalent functionality
of alloc_and_dissolve_huge_page() for a migrated hugepage when it
is freed so that the pages can be released to the isolated page
lists of the buddy allocator allowing the alloc_contig_range()
attempt to succeed.
The HPG_dissolve hugepage flag is introduced to allow tagging
migratable hugepages that should be dissolved when freed.
Signed-off-by: Doug Berger <[email protected]>
---
include/linux/hugetlb.h | 5 +++
mm/hugetlb.c | 72 ++++++++++++++++++++++++++++++++++++++---
mm/migrate.c | 1 +
mm/page_alloc.c | 1 +
4 files changed, 75 insertions(+), 4 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3ec981a0d8b3..0e6e21805e51 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -222,6 +222,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
bool is_hugetlb_entry_migration(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
+void sync_hugetlb_dissolve(void);
#else /* !CONFIG_HUGETLB_PAGE */
@@ -430,6 +431,8 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }
+static inline void sync_hugetlb_dissolve(void) { }
+
#endif /* !CONFIG_HUGETLB_PAGE */
/*
* hugepages at page global directory. If arch support
@@ -574,6 +577,7 @@ enum hugetlb_page_flags {
HPG_freed,
HPG_vmemmap_optimized,
HPG_raw_hwp_unreliable,
+ HPG_dissolve,
__NR_HPAGEFLAGS,
};
@@ -621,6 +625,7 @@ HPAGEFLAG(Temporary, temporary)
HPAGEFLAG(Freed, freed)
HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
+HPAGEFLAG(Dissolve, dissolve)
#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2b60de78007c..eab812760fbe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1582,6 +1582,10 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
}
}
+static LLIST_HEAD(hpage_dissolvelist);
+static void dissolve_hpage_workfn(struct work_struct *work);
+static DECLARE_WORK(dissolve_hpage_work, dissolve_hpage_workfn);
+
/*
* As update_and_free_page() can be called under any context, so we cannot
* use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
@@ -1628,6 +1632,8 @@ static inline void flush_free_hpage_work(struct hstate *h)
{
if (hugetlb_vmemmap_optimizable(h))
flush_work(&free_hpage_work);
+ if (!hstate_is_gigantic(h))
+ flush_work(&dissolve_hpage_work);
}
static void update_and_free_page(struct hstate *h, struct page *page,
@@ -1679,7 +1685,7 @@ void free_huge_page(struct page *page)
struct hstate *h = page_hstate(page);
int nid = page_to_nid(page);
struct hugepage_subpool *spool = hugetlb_page_subpool(page);
- bool restore_reserve;
+ bool restore_reserve, dissolve;
unsigned long flags;
VM_BUG_ON_PAGE(page_count(page), page);
@@ -1691,6 +1697,8 @@ void free_huge_page(struct page *page)
page->mapping = NULL;
restore_reserve = HPageRestoreReserve(page);
ClearHPageRestoreReserve(page);
+ dissolve = HPageDissolve(page);
+ ClearHPageDissolve(page);
/*
* If HPageRestoreReserve was set on page, page allocation consumed a
@@ -1729,6 +1737,11 @@ void free_huge_page(struct page *page)
remove_hugetlb_page(h, page, true);
spin_unlock_irqrestore(&hugetlb_lock, flags);
update_and_free_page(h, page, true);
+ } else if (dissolve) {
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
+ if (llist_add((struct llist_node *)&page->mapping,
+ &hpage_dissolvelist))
+ schedule_work(&dissolve_hpage_work);
} else {
arch_clear_hugepage_flags(page);
enqueue_huge_page(h, page);
@@ -2771,6 +2784,49 @@ static void replace_hugepage(struct hstate *h, int nid, struct page *old_page,
enqueue_huge_page(h, new_page);
}
+static void dissolve_hpage_workfn(struct work_struct *work)
+{
+ struct llist_node *node;
+
+ node = llist_del_all(&hpage_dissolvelist);
+
+ while (node) {
+ struct page *oldpage, *newpage;
+ struct hstate *h;
+ int nid;
+
+ oldpage = container_of((struct address_space **)node,
+ struct page, mapping);
+ node = node->next;
+ oldpage->mapping = NULL;
+
+ h = page_hstate(oldpage);
+ nid = page_to_nid(oldpage);
+
+ newpage = alloc_replacement_page(h, nid);
+
+ spin_lock_irq(&hugetlb_lock);
+ /* finish freeing oldpage */
+ arch_clear_hugepage_flags(oldpage);
+ enqueue_huge_page(h, oldpage);
+ if (IS_ERR(newpage)) {
+ /* cannot dissolve so just leave free */
+ spin_unlock_irq(&hugetlb_lock);
+ goto next;
+ }
+
+ replace_hugepage(h, nid, oldpage, newpage);
+
+ /*
+ * Pages have been replaced, we can safely free the old one.
+ */
+ spin_unlock_irq(&hugetlb_lock);
+ __update_and_free_page(h, oldpage);
+next:
+ cond_resched();
+ }
+}
+
/*
* alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
* @h: struct hstate old page belongs to
@@ -2803,6 +2859,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
*/
spin_unlock_irq(&hugetlb_lock);
ret = isolate_hugetlb(old_page, list);
+ SetHPageDissolve(old_page);
spin_lock_irq(&hugetlb_lock);
goto free_new;
} else if (!HPageFreed(old_page)) {
@@ -2864,14 +2921,21 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
if (hstate_is_gigantic(h))
return -ENOMEM;
- if (page_count(head) && !isolate_hugetlb(head, list))
+ if (page_count(head) && !isolate_hugetlb(head, list)) {
+ SetHPageDissolve(head);
ret = 0;
- else if (!page_count(head))
+ } else if (!page_count(head)) {
ret = alloc_and_dissolve_huge_page(h, head, list);
-
+ }
return ret;
}
+void sync_hugetlb_dissolve(void)
+{
+ flush_work(&free_hpage_work);
+ flush_work(&dissolve_hpage_work);
+}
+
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve)
{
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..b6c6123e614c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -141,6 +141,7 @@ void putback_movable_pages(struct list_head *l)
list_for_each_entry_safe(page, page2, l, lru) {
if (unlikely(PageHuge(page))) {
+ ClearHPageDissolve(page);
putback_active_hugepage(page);
continue;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..6bf76bbc0308 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -9235,6 +9235,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
if (ret && ret != -EBUSY)
goto done;
ret = 0;
+ sync_hugetlb_dissolve();
/*
* Pages from [start, end) are within a pageblock_nr_pages
--
2.25.1
When a range of pageblocks is isolated there is at most one
hugepage that has only tail pages overlapping that range (i.e.
a hugepage that overlaps the beginning of the range).
However, that hugepage is the first migration target for an
alloc_contig_range() attempt so it already receives special
attention.
Checking whether the pageblock containing the head of a hugepage
is isolated is an inexpensive way to avoid hugepage allocations
from isolated pageblocks which makes alloc_contig_range() more
efficient.
Signed-off-by: Doug Berger <[email protected]>
---
mm/hugetlb.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eab812760fbe..3a2f0b55059d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -33,6 +33,7 @@
#include <linux/migrate.h>
#include <linux/nospec.h>
#include <linux/delayacct.h>
+#include <linux/page-isolation.h>
#include <asm/page.h>
#include <asm/pgalloc.h>
@@ -1135,6 +1136,10 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
if (PageHWPoison(page))
continue;
+ /* Check head pageblock isolation */
+ if (is_migrate_isolate_page(page))
+ continue;
+
list_move(&page->lru, &h->hugepage_activelist);
set_page_refcounted(page);
ClearHPageFreed(page);
--
2.25.1
On 09/21/22 15:36, Doug Berger wrote:
> This patch set was included as patches [04/21-06/21] in my
> previous patch set to introduce Designated Movable Blocks [1].
> They were included there for context, but they have value on
> their own and are being resubmitted here for consideration on
> their own merits.
>
> The alloc_contig_range() function attempts to isolate the free
> pages within the range and migrate the data from non-free pages
> within the range to allow those pages to become isolated free
> pages. If all of the pages in the range are on isolated free
> page lists they can be allocated to the caller.
>
> When free hugepages are encountered in the range an attempt is
> made to allocate a new compound page to become a replacement
> hugepage and to dissolve the free hugepage so that its pages
> within isolated pageblocks can be added to the isolated free
> page lists. Hugepages that are not free and encountered within
> the range must be migrated out of the range and dissolved to
> allow the underlying pages to be added to the isolated free
> page lists.
>
> Moving the data from hugepages within the range and freeing the
> hugepages is not sufficient since the allocation of migration
> target hugepages is allowed to come from free hugepages that may
> contain isolated pageblocks and freed hugepages will not be on
> isolated free page lists so the alloc_contig_range() will fail.
Thanks for looking into this! I am adding Oscar, Michal and David on
Cc: as they have looked at similar issues in the past.
Before jumping into the details of your changes, I just want to make one
observation. When migrating (or dissolving) hugetlb pages that are in a
range operated on by alloc_contig_range(), we should not change the number
of hugetlb pages available as noted here. This includes the number of
total huge pages and the number of huge pages on the node. Therefore,
we should allocate another huge page from buddy either as the migration
target or to take the place of the dissolved page.
For free huge pages, we do this via alloc_and_dissolve_huge_page. IIUC,
there are no issues with this code path?
As noted above, for pages to be migrated we first try to use an existing
free huge page as the target. Quite some time ago, Michal added code to
allocate a new page from buddy as the target if no free huge pages were
available. This change also included a special flag to dissolve the
source huge page when it is freed. It seems like this is the exact
behavior we want here? I wonder if it might be easier just to use this
existing code?
--
Mike Kravetz
> To address these shortcommings the HPG_dissolve hugepage flag is
> introduced to tag hugepages that must be dissolved when they are
> freed so that their underlying pages can be moved to the page
> allocator's free lists. This prevents hugepages that have had
> their data migrated to new hugepages from being made available
> to subsequent hugepage allocations and allows the isolated free
> page test of alloc_contig_range() to succeed.
>
> Dissolved hugepages must be replaced with new hugepages to keep
> the hugetlbfs balanced. To support blocking allocation a new
> workqueue in introduced that is analogous to the workqueue
> introduced to support the hugetlb vmemmap optimization. This new
> workqueue allows the allocation and dissolution of the hugepage
> to be offloaded to a separate context from the freeing of the
> hugepage. The sync_hugetlb_dissolve() function is introduced to
> allow outstanding dissolution of hugepages to complete before
> the isolated free page check is made by alloc_contig_range().
>
> In addition, a check is added to hugepage allocation to prevent
> free hugepages within an isolated pageblock range from being
> used to satisfy migration target allocations preventing circular
> migrations.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Doug Berger (3):
> mm/hugetlb: refactor alloc_and_dissolve_huge_page
> mm/hugetlb: allow migrated hugepage to dissolve when freed
> mm/hugetlb: add hugepage isolation support
>
> include/linux/hugetlb.h | 5 ++
> mm/hugetlb.c | 161 +++++++++++++++++++++++++++++++---------
> mm/migrate.c | 1 +
> mm/page_alloc.c | 1 +
> 4 files changed, 131 insertions(+), 37 deletions(-)
>
>
> base-commit: b90cb1053190353cc30f0fef0ef1f378ccc063c5
> --
> 2.25.1
>
On 09/22/22 13:25, Mike Kravetz wrote:
> On 09/21/22 15:36, Doug Berger wrote:
>
> As noted above, for pages to be migrated we first try to use an existing
> free huge page as the target. Quite some time ago, Michal added code to
> allocate a new page from buddy as the target if no free huge pages were
> available. This change also included a special flag to dissolve the
> source huge page when it is freed. It seems like this is the exact
> behavior we want here? I wonder if it might be easier just to use this
> existing code?
Totally untested, but I believe the patch below would accomplish this.
From aa8fc11bb67bc9e67e3b6b280fab339afce37759 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Thu, 22 Sep 2022 15:32:10 -0700
Subject: [PATCH] hugetlb: force alloc_contig_range hugetlb migrations to
allocate new pages
When migrating hugetlb pages as the result of an alloc_contig_range
operation, allocate a new page from buddy for the migration target.
This guarantees that the number of hugetlb pages is not decreased by
the operation. In addition, this will result in the special HPageTemporary
flag being set in the source page so that it will be dissolved when
freed.
Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 5 +++--
mm/hugetlb.c | 12 ++++++++++--
mm/internal.h | 1 +
mm/migrate.c | 3 ++-
mm/page_alloc.c | 1 +
5 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cfe15b32e2d4..558831bf1087 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -702,7 +702,8 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
- nodemask_t *nmask, gfp_t gfp_mask);
+ nodemask_t *nmask, gfp_t gfp_mask,
+ bool temporary);
struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
unsigned long address);
int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
@@ -1003,7 +1004,7 @@ static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
static inline struct page *
alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
- nodemask_t *nmask, gfp_t gfp_mask)
+ nodemask_t *nmask, gfp_t gfp_mask, bool temporary)
{
return NULL;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8bcaf66defc5..19de8ae79ec8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2308,8 +2308,11 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
/* page migration callback function */
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
- nodemask_t *nmask, gfp_t gfp_mask)
+ nodemask_t *nmask, gfp_t gfp_mask, bool temporary)
{
+ if (temporary)
+ goto temporary_alloc;
+
spin_lock_irq(&hugetlb_lock);
if (h->free_huge_pages - h->resv_huge_pages > 0) {
struct page *page;
@@ -2322,6 +2325,11 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
}
spin_unlock_irq(&hugetlb_lock);
+temporary_alloc:
+ /*
+ * Try to allocate a fresh page that with special HPageTemporary
+ * characteristics
+ */
return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
}
@@ -2337,7 +2345,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
gfp_mask = htlb_alloc_mask(h);
node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
- page = alloc_huge_page_nodemask(h, node, nodemask, gfp_mask);
+ page = alloc_huge_page_nodemask(h, node, nodemask, gfp_mask, false);
mpol_cond_put(mpol);
return page;
diff --git a/mm/internal.h b/mm/internal.h
index b3002e03c28f..3ebf8885e24f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -800,6 +800,7 @@ struct migration_target_control {
int nid; /* preferred node id */
nodemask_t *nmask;
gfp_t gfp_mask;
+ bool alloc_contig; /* alloc_contig_range allocation */
};
/*
diff --git a/mm/migrate.c b/mm/migrate.c
index c228afba0963..6505ba2070d7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1610,7 +1610,8 @@ struct page *alloc_migration_target(struct page *page, unsigned long private)
struct hstate *h = page_hstate(&folio->page);
gfp_mask = htlb_modify_alloc_mask(h, gfp_mask);
- return alloc_huge_page_nodemask(h, nid, mtc->nmask, gfp_mask);
+ return alloc_huge_page_nodemask(h, nid, mtc->nmask, gfp_mask,
+ mtc->alloc_contig);
}
if (folio_test_large(folio)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d7b20bf09c1c..2b8a5a2b51cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -9166,6 +9166,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
struct migration_target_control mtc = {
.nid = zone_to_nid(cc->zone),
.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
+ .alloc_contig = true,
};
lru_cache_disable();
--
2.37.3
On 9/22/2022 3:41 PM, Mike Kravetz wrote:
> On 09/22/22 13:25, Mike Kravetz wrote:
>> On 09/21/22 15:36, Doug Berger wrote:
>>
>> As noted above, for pages to be migrated we first try to use an existing
>> free huge page as the target. Quite some time ago, Michal added code to
>> allocate a new page from buddy as the target if no free huge pages were
>> available. This change also included a special flag to dissolve the
>> source huge page when it is freed. It seems like this is the exact
>> behavior we want here? I wonder if it might be easier just to use this
>> existing code?
>
> Totally untested, but I believe the patch below would accomplish this.
>
> From aa8fc11bb67bc9e67e3b6b280fab339afce37759 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <[email protected]>
> Date: Thu, 22 Sep 2022 15:32:10 -0700
> Subject: [PATCH] hugetlb: force alloc_contig_range hugetlb migrations to
> allocate new pages
>
> When migrating hugetlb pages as the result of an alloc_contig_range
> operation, allocate a new page from buddy for the migration target.
> This guarantees that the number of hugetlb pages is not decreased by
> the operation. In addition, this will result in the special HPageTemporary
> flag being set in the source page so that it will be dissolved when
> freed.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> include/linux/hugetlb.h | 5 +++--
> mm/hugetlb.c | 12 ++++++++++--
> mm/internal.h | 1 +
> mm/migrate.c | 3 ++-
> mm/page_alloc.c | 1 +
> 5 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index cfe15b32e2d4..558831bf1087 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -702,7 +702,8 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
> struct page *alloc_huge_page(struct vm_area_struct *vma,
> unsigned long addr, int avoid_reserve);
> struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> - nodemask_t *nmask, gfp_t gfp_mask);
> + nodemask_t *nmask, gfp_t gfp_mask,
> + bool temporary);
> struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
> unsigned long address);
> int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
> @@ -1003,7 +1004,7 @@ static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
>
> static inline struct page *
> alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> - nodemask_t *nmask, gfp_t gfp_mask)
> + nodemask_t *nmask, gfp_t gfp_mask, bool temporary)
> {
> return NULL;
> }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8bcaf66defc5..19de8ae79ec8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2308,8 +2308,11 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
>
> /* page migration callback function */
> struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> - nodemask_t *nmask, gfp_t gfp_mask)
> + nodemask_t *nmask, gfp_t gfp_mask, bool temporary)
> {
> + if (temporary)
> + goto temporary_alloc;
> +
> spin_lock_irq(&hugetlb_lock);
> if (h->free_huge_pages - h->resv_huge_pages > 0) {
> struct page *page;
> @@ -2322,6 +2325,11 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> }
> spin_unlock_irq(&hugetlb_lock);
>
> +temporary_alloc:
> + /*
> + * Try to allocate a fresh page that with special HPageTemporary
> + * characteristics
> + */
> return alloc_migrate_huge_page(h, gfp_mask, preferred_nid, nmask);
> }
>
> @@ -2337,7 +2345,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
>
> gfp_mask = htlb_alloc_mask(h);
> node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> - page = alloc_huge_page_nodemask(h, node, nodemask, gfp_mask);
> + page = alloc_huge_page_nodemask(h, node, nodemask, gfp_mask, false);
> mpol_cond_put(mpol);
>
> return page;
> diff --git a/mm/internal.h b/mm/internal.h
> index b3002e03c28f..3ebf8885e24f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -800,6 +800,7 @@ struct migration_target_control {
> int nid; /* preferred node id */
> nodemask_t *nmask;
> gfp_t gfp_mask;
> + bool alloc_contig; /* alloc_contig_range allocation */
> };
>
> /*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c228afba0963..6505ba2070d7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1610,7 +1610,8 @@ struct page *alloc_migration_target(struct page *page, unsigned long private)
> struct hstate *h = page_hstate(&folio->page);
>
> gfp_mask = htlb_modify_alloc_mask(h, gfp_mask);
> - return alloc_huge_page_nodemask(h, nid, mtc->nmask, gfp_mask);
> + return alloc_huge_page_nodemask(h, nid, mtc->nmask, gfp_mask,
> + mtc->alloc_contig);
> }
>
> if (folio_test_large(folio)) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d7b20bf09c1c..2b8a5a2b51cd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -9166,6 +9166,7 @@ int __alloc_contig_migrate_range(struct compact_control *cc,
> struct migration_target_control mtc = {
> .nid = zone_to_nid(cc->zone),
> .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> + .alloc_contig = true,
> };
>
> lru_cache_disable();
I believe I exposed alloc_migrate_huge_page() and conditionally invoked
it from alloc_migration_target() when in alloc_contig, which is roughly
equivalent. I didn't consider modifying the mtc to pass the information
so my logic in alloc_migration_target() was a little kludgy.
Like I said, this can be made to work and I'm happy to accept an
alternative if others agree. I think the isolation test of patch 3 is
also still desirable.
Thanks again!
-Doug
On 9/22/2022 1:25 PM, Mike Kravetz wrote:
> On 09/21/22 15:36, Doug Berger wrote:
>> This patch set was included as patches [04/21-06/21] in my
>> previous patch set to introduce Designated Movable Blocks [1].
>> They were included there for context, but they have value on
>> their own and are being resubmitted here for consideration on
>> their own merits.
>>
>> The alloc_contig_range() function attempts to isolate the free
>> pages within the range and migrate the data from non-free pages
>> within the range to allow those pages to become isolated free
>> pages. If all of the pages in the range are on isolated free
>> page lists they can be allocated to the caller.
>>
>> When free hugepages are encountered in the range an attempt is
>> made to allocate a new compound page to become a replacement
>> hugepage and to dissolve the free hugepage so that its pages
>> within isolated pageblocks can be added to the isolated free
>> page lists. Hugepages that are not free and encountered within
>> the range must be migrated out of the range and dissolved to
>> allow the underlying pages to be added to the isolated free
>> page lists.
>>
>> Moving the data from hugepages within the range and freeing the
>> hugepages is not sufficient since the allocation of migration
>> target hugepages is allowed to come from free hugepages that may
>> contain isolated pageblocks and freed hugepages will not be on
>> isolated free page lists so the alloc_contig_range() will fail.
>
> Thanks for looking into this! I am adding Oscar, Michal and David on
> Cc: as they have looked at similar issues in the past.
Thanks for helping to get the right eyeballs looking at this.
>
> Before jumping into the details of your changes, I just want to make one
> observation. When migrating (or dissolving) hugetlb pages that are in a
> range operated on by alloc_contig_range(), we should not change the number
> of hugetlb pages available as noted here. This includes the number of
> total huge pages and the number of huge pages on the node. Therefore,
> we should allocate another huge page from buddy either as the migration
> target or to take the place of the dissolved page.
>
> For free huge pages, we do this via alloc_and_dissolve_huge_page. IIUC,
> there are no issues with this code path?
Yes, I did not observe any issue with the the free hugepage handling
except that your recent improvements with freezing allocated pages
(thanks for those) will make merging patch 1 more difficult ;).
>
> As noted above, for pages to be migrated we first try to use an existing
> free huge page as the target. Quite some time ago, Michal added code to
> allocate a new page from buddy as the target if no free huge pages were
> available. This change also included a special flag to dissolve the
> source huge page when it is freed. It seems like this is the exact
> behavior we want here? I wonder if it might be easier just to use this
> existing code?
Yes, the temporary page flag can be made to work here and I did
experiment with it, but it is dependent on the current design decisions.
I decided to move to the approach suggested here because I believe it
could conceivably scale if support for migrating gigantic pages is
desired in the future. The circular dependency on alloc_contig_range
will likely keep that from happening, but that was my thinking.
Thanks for taking the time,
-Doug
On 09/22/22 16:27, Doug Berger wrote:
> On 9/22/2022 3:41 PM, Mike Kravetz wrote:
> > On 09/22/22 13:25, Mike Kravetz wrote:
> > > On 09/21/22 15:36, Doug Berger wrote:
> > >
> > > As noted above, for pages to be migrated we first try to use an existing
> > > free huge page as the target. Quite some time ago, Michal added code to
> > > allocate a new page from buddy as the target if no free huge pages were
> > > available. This change also included a special flag to dissolve the
> > > source huge page when it is freed. It seems like this is the exact
> > > behavior we want here? I wonder if it might be easier just to use this
> > > existing code?
> >
> > Totally untested, but I believe the patch below would accomplish this.
> >
> > From aa8fc11bb67bc9e67e3b6b280fab339afce37759 Mon Sep 17 00:00:00 2001
> > From: Mike Kravetz <[email protected]>
> > Date: Thu, 22 Sep 2022 15:32:10 -0700
> > Subject: [PATCH] hugetlb: force alloc_contig_range hugetlb migrations to
> > allocate new pages
> >
> > When migrating hugetlb pages as the result of an alloc_contig_range
> > operation, allocate a new page from buddy for the migration target.
> > This guarantees that the number of hugetlb pages is not decreased by
> > the operation. In addition, this will result in the special HPageTemporary
> > flag being set in the source page so that it will be dissolved when
> > freed.
> >
<snip>
> I believe I exposed alloc_migrate_huge_page() and conditionally invoked it
> from alloc_migration_target() when in alloc_contig, which is roughly
> equivalent. I didn't consider modifying the mtc to pass the information so
> my logic in alloc_migration_target() was a little kludgy.
>
> Like I said, this can be made to work and I'm happy to accept an alternative
> if others agree. I think the isolation test of patch 3 is also still
> desirable.
Yes, hoping to get some other opinions as well.
I do agree that patch 3 is still a good idea.
--
Mike Kravetz