LinuxLists.cc - [PATCH v6 0/8] Batch hugetlb vmemmap modification operations

2023-09-26 10:44:07

Subject: [PATCH v6 0/8] Batch hugetlb vmemmap modification operations

When hugetlb vmemmap optimization was introduced, the overhead of enabling
the option was measured as described in commit 426e5c429d16 [1]. The summary
states that allocating a hugetlb page should be ~2x slower with optimization
and freeing a hugetlb page should be ~2-3x slower. Such overhead was deemed
an acceptable trade off for the memory savings obtained by freeing vmemmap
pages.

It was recently reported that the overhead associated with enabling vmemmap
optimization could be as high as 190x for hugetlb page allocations.
Yes, 190x! Some actual numbers from other environments are:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.119s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.477s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m28.973s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m36.748s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.463s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m2.931s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 2m27.609s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 2m29.924s

In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
resulted in allocation times being 61x slower.

A quick profile showed that the vast majority of this overhead was due to
TLB flushing. Each time we modify the kernel pagetable we need to flush
the TLB. For each hugetlb that is optimized, there could be potentially
two TLB flushes performed. One for the vmemmap pages associated with the
hugetlb page, and potentially another one if the vmemmap pages are mapped
at the PMD level and must be split. The TLB flushes required for the kernel
pagetable, result in a broadcast IPI with each CPU having to flush a range
of pages, or do a global flush if a threshold is exceeded. So, the flush
time increases with the number of CPUs. In addition, in virtual environments
the broadcast IPI can’t be accelerated by hypervisor hardware and leads to
traps that need to wakeup/IPI all vCPUs which is very expensive. Because of
this the slowdown in virtual environments is even worse than bare metal as
the number of vCPUS/CPUs is increased.

The following series attempts to reduce amount of time spent in TLB flushing.
The idea is to batch the vmemmap modification operations for multiple hugetlb
pages. Instead of doing one or two TLB flushes for each page, we do two TLB
flushes for each batch of pages. One flush after splitting pages mapped at
the PMD level, and another after remapping vmemmap associated with all
hugetlb pages. Results of such batching are as follows:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.719s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.245s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m7.267s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m13.199s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.715s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m3.186s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m4.799s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m5.273s

With batching, results are back in the 2-3x slowdown range.

This series is based on mm-unstable (September 24)

Changes v5 -> v6:
- patch 4 in bulk_vmemmap_restore_error remove folio from list before
calling add_hugetlb_folio.
- Added Muchun RB for patches 2 and 3

Changes v4 -> v5:
- patch 3 comment style updated, unnecessary INIT_LIST_HEAD
- patch 4 updated hugetlb_vmemmap_restore_folios to pass back number of
restored folios in non-error case. In addition, routine passes back
list of folios with vmemmmap. Naming more consistent.
- patch 5 remover over optimization and added Muchun RB
- patch 6 break and early return in ENOMEM case. Updated comments.
Added Muchun RB.
- patch 7 Updated comments about splitting failure. Added Muchun RB.
- patch 8 Made comments consistent.

Changes v3 -> v4:
- Rebased on mm-unstable and dropped requisite patches.
- patch 2 updated to take bootmem vmemmap initialization into account
- patch 3 more changes for bootmem hugetlb pages. added routine
prep_and_add_bootmem_folios.
- patch 5 in hugetlb_vmemmap_optimize_folios on ENOMEM check for
list_empty before freeing and retry. This is more important in
subsequent patch where we flush_tlb_all after ENOMEM.

Changes v2 -> v3:
- patch 5 was part of an earlier series that was not picked up. It is
included here as it helps with batching optimizations.
- patch 6 hugetlb_vmemmap_restore_folios is changed from type void to
returning an error code as well as an additional output parameter providing
the number folios for which vmemmap was actually restored. The caller can
then be more intelligent about processing the list.
- patch 9 eliminate local list in vmemmap_restore_pte. The routine
hugetlb_vmemmap_optimize_folios checks for ENOMEM and frees accumulated
vmemmap pages while processing the list.
- patch 10 introduce flags field to struct vmemmap_remap_walk and
VMEMMAP_SPLIT_NO_TLB_FLUSH for not flushing during pass to split PMDs.
- patch 11 rename flag VMEMMAP_REMAP_NO_TLB_FLUSH and pass in from callers.

Changes v1 -> v2:
- patch 5 now takes into account the requirement that only compound
pages with hugetlb flag set can be passed to vmemmmap routines. This
involved separating the 'prep' of hugetlb pages even further. The code
dealing with bootmem allocations was also modified so that batching is
possible. Adding a 'batch' of hugetlb pages to their respective free
lists is now done in one lock cycle.
- patch 7 added description of routine hugetlb_vmemmap_restore_folios
(Muchun).
- patch 8 rename bulk_pages to vmemmap_pages and let caller be responsible
for freeing (Muchun)
- patch 9 use 'walk->remap_pte' to determine if a split only operation
is being performed (Muchun). Removed unused variable and
hugetlb_optimize_vmemmap_key (Muchun).
- Patch 10 pass 'flags variable' instead of bool to indicate behavior and
allow for future expansion (Muchun). Single flag VMEMMAP_NO_TLB_FLUSH.
Provide detailed comment about the need to keep old and new vmemmap pages
in sync (Muchun).
- Patch 11 pass flag variable as in patch 10 (Muchun).

Joao Martins (2):
hugetlb: batch PMD split for bulk vmemmap dedup
hugetlb: batch TLB flushes when freeing vmemmap

Mike Kravetz (6):
hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
hugetlb: restructure pool allocations
hugetlb: perform vmemmap optimization on a list of pages
hugetlb: perform vmemmap restoration on a list of pages
hugetlb: batch freeing of vmemmap pages
hugetlb: batch TLB flushes when restoring vmemmap

mm/hugetlb.c | 301 ++++++++++++++++++++++++++++++++++++-------
mm/hugetlb_vmemmap.c | 273 +++++++++++++++++++++++++++++++++------
mm/hugetlb_vmemmap.h | 15 +++
3 files changed, 506 insertions(+), 83 deletions(-)

--
2.41.0

2023-09-26 10:44:13

by Mike Kravetz

[permalink] [raw]

Subject: [PATCH v6 1/8] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles

update_and_free_pages_bulk is designed to free a list of hugetlb pages
back to their associated lower level allocators. This may require
allocating vmemmmap pages associated with each hugetlb page. The
hugetlb page destructor must be changed before pages are freed to lower
level allocators. However, the destructor must be changed under the
hugetlb lock. This means there is potentially one lock cycle per page.

Minimize the number of lock cycles in update_and_free_pages_bulk by:
1) allocating necessary vmemmap for all hugetlb pages on the list
2) take hugetlb lock and clear destructor for all pages on the list
3) free all pages on list back to low level allocators

Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Acked-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index de220e3ff8be..47159b9de633 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1837,7 +1837,46 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
{
struct folio *folio, *t_folio;
+ bool clear_dtor = false;

+ /*
+ * First allocate required vmemmmap (if necessary) for all folios on
+ * list. If vmemmap can not be allocated, we can not free folio to
+ * lower level allocator, so add back as hugetlb surplus page.
+ * add_hugetlb_folio() removes the page from THIS list.
+ * Use clear_dtor to note if vmemmap was successfully allocated for
+ * ANY page on the list.
+ */
+ list_for_each_entry_safe(folio, t_folio, list, lru) {
+ if (folio_test_hugetlb_vmemmap_optimized(folio)) {
+ if (hugetlb_vmemmap_restore(h, &folio->page)) {
+ spin_lock_irq(&hugetlb_lock);
+ add_hugetlb_folio(h, folio, true);
+ spin_unlock_irq(&hugetlb_lock);
+ } else
+ clear_dtor = true;
+ }
+ }
+
+ /*
+ * If vmemmmap allocation was performed on any folio above, take lock
+ * to clear destructor of all folios on list. This avoids the need to
+ * lock/unlock for each individual folio.
+ * The assumption is vmemmap allocation was performed on all or none
+ * of the folios on the list. This is true expect in VERY rare cases.
+ */
+ if (clear_dtor) {
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry(folio, list, lru)
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ }
+
+ /*
+ * Free folios back to low level allocators. vmemmap and destructors
+ * were taken care of above, so update_and_free_hugetlb_folio will
+ * not need to take hugetlb lock.
+ */
list_for_each_entry_safe(folio, t_folio, list, lru) {
update_and_free_hugetlb_folio(h, folio, false);
cond_resched();
--
2.41.0

2023-09-26 13:21:52

by Mike Kravetz

[permalink] [raw]

Subject: [PATCH v6 7/8] hugetlb: batch TLB flushes when freeing vmemmap

From: Joao Martins <[email protected]>

Now that a list of pages is deduplicated at once, the TLB
flush can be batched for all vmemmap pages that got remapped.

Expand the flags field value to pass whether to skip the TLB flush
on remap of the PTE.

The TLB flush is global as we don't have guarantees from caller
that the set of folios is contiguous, or to add complexity in
composing a list of kVAs to flush.

Modified by Mike Kravetz to perform TLB flush on single folio if an
error is encountered.

Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
---
mm/hugetlb_vmemmap.c | 49 ++++++++++++++++++++++++++++++++++----------
1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 10739e4285d5..9df350372046 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -40,6 +40,8 @@ struct vmemmap_remap_walk {

/* Skip the TLB flush when we split the PMD */
#define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
+/* Skip the TLB flush when we remap the PTE */
+#define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1)
unsigned long flags;
};

@@ -214,7 +216,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
return ret;
} while (pgd++, addr = next, addr != end);

- if (walk->remap_pte)
+ if (walk->remap_pte && !(walk->flags & VMEMMAP_REMAP_NO_TLB_FLUSH))
flush_tlb_kernel_range(start, end);

return 0;
@@ -355,19 +357,21 @@ static int vmemmap_remap_split(unsigned long start, unsigned long end,
* @reuse: reuse address.
* @vmemmap_pages: list to deposit vmemmap pages to be freed. It is callers
* responsibility to free pages.
+ * @flags: modifications to vmemmap_remap_walk flags
*
* Return: %0 on success, negative error code otherwise.
*/
static int vmemmap_remap_free(unsigned long start, unsigned long end,
unsigned long reuse,
- struct list_head *vmemmap_pages)
+ struct list_head *vmemmap_pages,
+ unsigned long flags)
{
int ret;
struct vmemmap_remap_walk walk = {
.remap_pte = vmemmap_remap_pte,
.reuse_addr = reuse,
.vmemmap_pages = vmemmap_pages,
- .flags = 0,
+ .flags = flags,
};
int nid = page_to_nid((struct page *)reuse);
gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
@@ -629,7 +633,8 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h

static int __hugetlb_vmemmap_optimize(const struct hstate *h,
struct page *head,
- struct list_head *vmemmap_pages)
+ struct list_head *vmemmap_pages,
+ unsigned long flags)
{
int ret = 0;
unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
@@ -640,6 +645,18 @@ static int __hugetlb_vmemmap_optimize(const struct hstate *h,
return ret;

static_branch_inc(&hugetlb_optimize_vmemmap_key);
+ /*
+ * Very Subtle
+ * If VMEMMAP_REMAP_NO_TLB_FLUSH is set, TLB flushing is not performed
+ * immediately after remapping. As a result, subsequent accesses
+ * and modifications to struct pages associated with the hugetlb
+ * page could be to the OLD struct pages. Set the vmemmap optimized
+ * flag here so that it is copied to the new head page. This keeps
+ * the old and new struct pages in sync.
+ * If there is an error during optimization, we will immediately FLUSH
+ * the TLB and clear the flag below.
+ */
+ SetHPageVmemmapOptimized(head);

vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
vmemmap_reuse = vmemmap_start;
@@ -651,11 +668,12 @@ static int __hugetlb_vmemmap_optimize(const struct hstate *h,
* mapping the range to vmemmap_pages list so that they can be freed by
* the caller.
*/
- ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages);
- if (ret)
+ ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse,
+ vmemmap_pages, flags);
+ if (ret) {
static_branch_dec(&hugetlb_optimize_vmemmap_key);
- else
- SetHPageVmemmapOptimized(head);
+ ClearHPageVmemmapOptimized(head);
+ }

return ret;
}
@@ -674,7 +692,7 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
{
LIST_HEAD(vmemmap_pages);

- __hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
+ __hugetlb_vmemmap_optimize(h, head, &vmemmap_pages, 0);
free_vmemmap_page_list(&vmemmap_pages);
}

@@ -719,19 +737,28 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l

list_for_each_entry(folio, folio_list, lru) {
int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
- &vmemmap_pages);
+ &vmemmap_pages,
+ VMEMMAP_REMAP_NO_TLB_FLUSH);

/*
* Pages to be freed may have been accumulated. If we
* encounter an ENOMEM, free what we have and try again.
+ * This can occur in the case that both spliting fails
+ * halfway and head page allocation also failed. In this
+ * case __hugetlb_vmemmap_optimize() would free memory
+ * allowing more vmemmap remaps to occur.
*/
if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
+ flush_tlb_all();
free_vmemmap_page_list(&vmemmap_pages);
INIT_LIST_HEAD(&vmemmap_pages);
- __hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
+ __hugetlb_vmemmap_optimize(h, &folio->page,
+ &vmemmap_pages,
+ VMEMMAP_REMAP_NO_TLB_FLUSH);
}
}

+ flush_tlb_all();
free_vmemmap_page_list(&vmemmap_pages);
}

--
2.41.0