2023-09-16 05:36:56

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v3 00/12] Batch hugetlb vmemmap modification operations

When hugetlb vmemmap optimization was introduced, the overhead of enabling
the option was measured as described in commit 426e5c429d16 [1]. The summary
states that allocating a hugetlb page should be ~2x slower with optimization
and freeing a hugetlb page should be ~2-3x slower. Such overhead was deemed
an acceptable trade off for the memory savings obtained by freeing vmemmap
pages.

It was recently reported that the overhead associated with enabling vmemmap
optimization could be as high as 190x for hugetlb page allocations.
Yes, 190x! Some actual numbers from other environments are:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.119s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.477s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m28.973s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m36.748s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.463s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m2.931s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 2m27.609s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 2m29.924s

In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
resulted in allocation times being 61x slower.

A quick profile showed that the vast majority of this overhead was due to
TLB flushing. Each time we modify the kernel pagetable we need to flush
the TLB. For each hugetlb that is optimized, there could be potentially
two TLB flushes performed. One for the vmemmap pages associated with the
hugetlb page, and potentially another one if the vmemmap pages are mapped
at the PMD level and must be split. The TLB flushes required for the kernel
pagetable, result in a broadcast IPI with each CPU having to flush a range
of pages, or do a global flush if a threshold is exceeded. So, the flush
time increases with the number of CPUs. In addition, in virtual environments
the broadcast IPI can’t be accelerated by hypervisor hardware and leads to
traps that need to wakeup/IPI all vCPUs which is very expensive. Because of
this the slowdown in virtual environments is even worse than bare metal as
the number of vCPUS/CPUs is increased.

The following series attempts to reduce amount of time spent in TLB flushing.
The idea is to batch the vmemmap modification operations for multiple hugetlb
pages. Instead of doing one or two TLB flushes for each page, we do two TLB
flushes for each batch of pages. One flush after splitting pages mapped at
the PMD level, and another after remapping vmemmap associated with all
hugetlb pages. Results of such batching are as follows:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.719s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.245s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m7.267s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m13.199s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.715s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m3.186s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m4.799s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m5.273s

With batching, results are back in the 2-3x slowdown range.

This series is based on next-20230913.
The first 4 patches of the series are modifications currently going into the
mm tree that modify the same area, or fix BUGs hit easily when exercising
this series. They are not directly related to the batching changes.
Patch 5 (hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles)
is where batching changes begin.

Changes v2 -> v3:
- patch 5 was part of an earlier series that was not picked up. It is
included here as it helps with batching optimizations.
- patch 6 hugetlb_vmemmap_restore_folios is changed from type void to
returning an error code as well as an additional output parameter providing
the number folios for which vmemmap was actually restored. The caller can
then be more intelligent about processing the list.
- patch 9 eliminate local list in vmemmap_restore_pte. The routine
hugetlb_vmemmap_optimize_folios checks for ENOMEM and frees accumulated
vmemmap pages while processing the list.
- patch 10 introduce flags field to struct vmemmap_remap_walk and
VMEMMAP_SPLIT_NO_TLB_FLUSH for not flushing during pass to split PMDs.
- patch 11 rename flag VMEMMAP_REMAP_NO_TLB_FLUSH and pass in from callers.

Changes v1 -> v2:
- patch 5 now takes into account the requirement that only compound
pages with hugetlb flag set can be passed to vmemmmap routines. This
involved separating the 'prep' of hugetlb pages even further. The code
dealing with bootmem allocations was also modified so that batching is
possible. Adding a 'batch' of hugetlb pages to their respective free
lists is now done in one lock cycle.
- patch 7 added description of routine hugetlb_vmemmap_restore_folios
(Muchun).
- patch 8 rename bulk_pages to vmemmap_pages and let caller be responsible
for freeing (Muchun)
- patch 9 use 'walk->remap_pte' to determine if a split only operation
is being performed (Muchun). Removed unused variable and
hugetlb_optimize_vmemmap_key (Muchun).
- Patch 10 pass 'flags variable' instead of bool to indicate behavior and
allow for future expansion (Muchun). Single flag VMEMMAP_NO_TLB_FLUSH.
Provide detailed comment about the need to keep old and new vmemmap pages
in sync (Muchun).
- Patch 11 pass flag variable as in patch 10 (Muchun).


Joao Martins (2):
hugetlb: batch PMD split for bulk vmemmap dedup
hugetlb: batch TLB flushes when freeing vmemmap

Johannes Weiner (1):
mm: page_alloc: remove pcppage migratetype caching fix

Matthew Wilcox (Oracle) (3):
hugetlb: Use a folio in free_hpage_workfn()
hugetlb: Remove a few calls to page_folio()
hugetlb: Convert remove_pool_huge_page() to
remove_pool_hugetlb_folio()

Mike Kravetz (6):
hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
hugetlb: restructure pool allocations
hugetlb: perform vmemmap optimization on a list of pages
hugetlb: perform vmemmap restoration on a list of pages
hugetlb: batch freeing of vmemmap pages
hugetlb: batch TLB flushes when restoring vmemmap

mm/hugetlb.c | 288 ++++++++++++++++++++++++++++++++-----------
mm/hugetlb_vmemmap.c | 255 ++++++++++++++++++++++++++++++++------
mm/hugetlb_vmemmap.h | 16 +++
mm/page_alloc.c | 3 -
4 files changed, 452 insertions(+), 110 deletions(-)

--
2.41.0


2023-09-16 07:55:44

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v3 04/12] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio()

From: "Matthew Wilcox (Oracle)" <[email protected]>

Convert the callers to expect a folio and remove the unnecesary conversion
back to a struct page.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
---
mm/hugetlb.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7bbdc71fb34d..744e214c7d9b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1439,7 +1439,7 @@ static int hstate_next_node_to_alloc(struct hstate *h,
}

/*
- * helper for remove_pool_huge_page() - return the previously saved
+ * helper for remove_pool_hugetlb_folio() - return the previously saved
* node ["this node"] from which to free a huge page. Advance the
* next node id whether or not we find a free huge page to free so
* that the next attempt to free addresses the next node.
@@ -2201,9 +2201,8 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
* an additional call to free the page to low level allocators.
* Called with hugetlb_lock locked.
*/
-static struct page *remove_pool_huge_page(struct hstate *h,
- nodemask_t *nodes_allowed,
- bool acct_surplus)
+static struct folio *remove_pool_hugetlb_folio(struct hstate *h,
+ nodemask_t *nodes_allowed, bool acct_surplus)
{
int nr_nodes, node;
struct folio *folio = NULL;
@@ -2223,7 +2222,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
}
}

- return &folio->page;
+ return folio;
}

/*
@@ -2577,7 +2576,6 @@ static void return_unused_surplus_pages(struct hstate *h,
unsigned long unused_resv_pages)
{
unsigned long nr_pages;
- struct page *page;
LIST_HEAD(page_list);

lockdep_assert_held(&hugetlb_lock);
@@ -2598,15 +2596,17 @@ static void return_unused_surplus_pages(struct hstate *h,
* evenly across all nodes with memory. Iterate across these nodes
* until we can no longer free unreserved surplus pages. This occurs
* when the nodes with surplus pages have no free pages.
- * remove_pool_huge_page() will balance the freed pages across the
+ * remove_pool_hugetlb_folio() will balance the freed pages across the
* on-line nodes with memory and will handle the hstate accounting.
*/
while (nr_pages--) {
- page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1);
- if (!page)
+ struct folio *folio;
+
+ folio = remove_pool_hugetlb_folio(h, &node_states[N_MEMORY], 1);
+ if (!folio)
goto out;

- list_add(&page->lru, &page_list);
+ list_add(&folio->lru, &page_list);
}

out:
@@ -3401,7 +3401,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
nodemask_t *nodes_allowed)
{
unsigned long min_count, ret;
- struct page *page;
LIST_HEAD(page_list);
NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);

@@ -3523,11 +3522,13 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
* Collect pages to be removed on list without dropping lock
*/
while (min_count < persistent_huge_pages(h)) {
- page = remove_pool_huge_page(h, nodes_allowed, 0);
- if (!page)
+ struct folio *folio;
+
+ folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
+ if (!folio)
break;

- list_add(&page->lru, &page_list);
+ list_add(&folio->lru, &page_list);
}
/* free the pages after dropping lock */
spin_unlock_irq(&hugetlb_lock);
--
2.41.0

2023-09-16 09:50:53

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v3 12/12] hugetlb: batch TLB flushes when restoring vmemmap

Update the internal hugetlb restore vmemmap code path such that TLB
flushing can be batched. Use the existing mechanism of passing the
VMEMMAP_REMAP_NO_TLB_FLUSH flag to indicate flushing should not be
performed for individual pages. The routine hugetlb_vmemmap_restore_folios
is the only user of this new mechanism, and it will perform a global
flush after all vmemmap is restored.

Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb_vmemmap.c | 39 ++++++++++++++++++++++++---------------
1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 921f2fa7cf1b..0e9074a09afd 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -460,18 +460,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
* @end: end address of the vmemmap virtual address range that we want to
* remap.
* @reuse: reuse address.
+ * @flags: modify behavior for bulk operations
*
* Return: %0 on success, negative error code otherwise.
*/
static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
- unsigned long reuse)
+ unsigned long reuse, unsigned long flags)
{
LIST_HEAD(vmemmap_pages);
struct vmemmap_remap_walk walk = {
.remap_pte = vmemmap_restore_pte,
.reuse_addr = reuse,
.vmemmap_pages = &vmemmap_pages,
- .flags = 0,
+ .flags = flags,
};

/* See the comment in the vmemmap_remap_free(). */
@@ -493,17 +494,7 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);

-/**
- * hugetlb_vmemmap_restore - restore previously optimized (by
- * hugetlb_vmemmap_optimize()) vmemmap pages which
- * will be reallocated and remapped.
- * @h: struct hstate.
- * @head: the head page whose vmemmap pages will be restored.
- *
- * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
- * negative error code otherwise.
- */
-int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+static int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, unsigned long flags)
{
int ret;
unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
@@ -524,7 +515,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
* When a HugeTLB page is freed to the buddy allocator, previously
* discarded vmemmap pages must be allocated and remapping.
*/
- ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
+ ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
if (!ret) {
ClearHPageVmemmapOptimized(head);
static_branch_dec(&hugetlb_optimize_vmemmap_key);
@@ -533,6 +524,21 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
return ret;
}

+/**
+ * hugetlb_vmemmap_restore - restore previously optimized (by
+ * hugetlb_vmemmap_optimize()) vmemmap pages which
+ * will be reallocated and remapped.
+ * @h: struct hstate.
+ * @head: the head page whose vmemmap pages will be restored.
+ *
+ * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
+ * negative error code otherwise.
+ */
+int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+{
+ return __hugetlb_vmemmap_restore(h, head, 0);
+}
+
/**
* hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
* @h: struct hstate.
@@ -557,7 +563,8 @@ int hugetlb_vmemmap_restore_folios(const struct hstate *h,
num_restored = 0;
list_for_each_entry(folio, folio_list, lru) {
if (folio_test_hugetlb_vmemmap_optimized(folio)) {
- t_ret = hugetlb_vmemmap_restore(h, &folio->page);
+ t_ret = __hugetlb_vmemmap_restore(h, &folio->page,
+ VMEMMAP_REMAP_NO_TLB_FLUSH);
if (t_ret)
ret = t_ret;
else
@@ -565,6 +572,8 @@ int hugetlb_vmemmap_restore_folios(const struct hstate *h,
}
}

+ flush_tlb_all();
+
if (*restored)
*restored = num_restored;
return ret;
--
2.41.0

2023-09-16 21:13:39

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v3 03/12] hugetlb: Remove a few calls to page_folio()

From: "Matthew Wilcox (Oracle)" <[email protected]>

Anything found on a linked list threaded through ->lru is guaranteed to
be a folio as the compound_head found in a tail page overlaps the ->lru
member of struct page. So we can pull folios directly off these lists
no matter whether pages or folios were added to the list.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
---
mm/hugetlb.c | 26 +++++++++++---------------
1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6c6f19cc6046..7bbdc71fb34d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1829,11 +1829,9 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,

static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
{
- struct page *page, *t_page;
- struct folio *folio;
+ struct folio *folio, *t_folio;

- list_for_each_entry_safe(page, t_page, list, lru) {
- folio = page_folio(page);
+ list_for_each_entry_safe(folio, t_folio, list, lru) {
update_and_free_hugetlb_folio(h, folio, false);
cond_resched();
}
@@ -2208,8 +2206,7 @@ static struct page *remove_pool_huge_page(struct hstate *h,
bool acct_surplus)
{
int nr_nodes, node;
- struct page *page = NULL;
- struct folio *folio;
+ struct folio *folio = NULL;

lockdep_assert_held(&hugetlb_lock);
for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
@@ -2219,15 +2216,14 @@ static struct page *remove_pool_huge_page(struct hstate *h,
*/
if ((!acct_surplus || h->surplus_huge_pages_node[node]) &&
!list_empty(&h->hugepage_freelists[node])) {
- page = list_entry(h->hugepage_freelists[node].next,
- struct page, lru);
- folio = page_folio(page);
+ folio = list_entry(h->hugepage_freelists[node].next,
+ struct folio, lru);
remove_hugetlb_folio(h, folio, acct_surplus);
break;
}
}

- return page;
+ return &folio->page;
}

/*
@@ -3343,15 +3339,15 @@ static void try_to_free_low(struct hstate *h, unsigned long count,
* Collect pages to be freed on a list, and free after dropping lock
*/
for_each_node_mask(i, *nodes_allowed) {
- struct page *page, *next;
+ struct folio *folio, *next;
struct list_head *freel = &h->hugepage_freelists[i];
- list_for_each_entry_safe(page, next, freel, lru) {
+ list_for_each_entry_safe(folio, next, freel, lru) {
if (count >= h->nr_huge_pages)
goto out;
- if (PageHighMem(page))
+ if (folio_test_highmem(folio))
continue;
- remove_hugetlb_folio(h, page_folio(page), false);
- list_add(&page->lru, &page_list);
+ remove_hugetlb_folio(h, folio, false);
+ list_add(&folio->lru, &page_list);
}
}

--
2.41.0