2023-09-19 02:15:40

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 0/8] Batch hugetlb vmemmap modification operations

When hugetlb vmemmap optimization was introduced, the overhead of enabling
the option was measured as described in commit 426e5c429d16 [1]. The summary
states that allocating a hugetlb page should be ~2x slower with optimization
and freeing a hugetlb page should be ~2-3x slower. Such overhead was deemed
an acceptable trade off for the memory savings obtained by freeing vmemmap
pages.

It was recently reported that the overhead associated with enabling vmemmap
optimization could be as high as 190x for hugetlb page allocations.
Yes, 190x! Some actual numbers from other environments are:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.119s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.477s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m28.973s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m36.748s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.463s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m2.931s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 2m27.609s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 2m29.924s

In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
resulted in allocation times being 61x slower.

A quick profile showed that the vast majority of this overhead was due to
TLB flushing. Each time we modify the kernel pagetable we need to flush
the TLB. For each hugetlb that is optimized, there could be potentially
two TLB flushes performed. One for the vmemmap pages associated with the
hugetlb page, and potentially another one if the vmemmap pages are mapped
at the PMD level and must be split. The TLB flushes required for the kernel
pagetable, result in a broadcast IPI with each CPU having to flush a range
of pages, or do a global flush if a threshold is exceeded. So, the flush
time increases with the number of CPUs. In addition, in virtual environments
the broadcast IPI can’t be accelerated by hypervisor hardware and leads to
traps that need to wakeup/IPI all vCPUs which is very expensive. Because of
this the slowdown in virtual environments is even worse than bare metal as
the number of vCPUS/CPUs is increased.

The following series attempts to reduce amount of time spent in TLB flushing.
The idea is to batch the vmemmap modification operations for multiple hugetlb
pages. Instead of doing one or two TLB flushes for each page, we do two TLB
flushes for each batch of pages. One flush after splitting pages mapped at
the PMD level, and another after remapping vmemmap associated with all
hugetlb pages. Results of such batching are as follows:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.719s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.245s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m7.267s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m13.199s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.715s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m3.186s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m4.799s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m5.273s

With batching, results are back in the 2-3x slowdown range.

This series is based on mm-unstable (September 18)

Changes v3 -> v4:
- Rebased on mm-unstable and dropped requisite patches.
- patch 2 updated to take bootmem vmemmap initialization into account
- patch 3 more changes for bootmem hugetlb pages. added routine
prep_and_add_bootmem_folios.
- patch 5 in hugetlb_vmemmap_optimize_folios on ENOMEM check for
list_empty before freeing and retry. This is more important in
subsequent patch where we flush_tlb_all after ENOMEM.

Changes v2 -> v3:
- patch 5 was part of an earlier series that was not picked up. It is
included here as it helps with batching optimizations.
- patch 6 hugetlb_vmemmap_restore_folios is changed from type void to
returning an error code as well as an additional output parameter providing
the number folios for which vmemmap was actually restored. The caller can
then be more intelligent about processing the list.
- patch 9 eliminate local list in vmemmap_restore_pte. The routine
hugetlb_vmemmap_optimize_folios checks for ENOMEM and frees accumulated
vmemmap pages while processing the list.
- patch 10 introduce flags field to struct vmemmap_remap_walk and
VMEMMAP_SPLIT_NO_TLB_FLUSH for not flushing during pass to split PMDs.
- patch 11 rename flag VMEMMAP_REMAP_NO_TLB_FLUSH and pass in from callers.

Changes v1 -> v2:
- patch 5 now takes into account the requirement that only compound
pages with hugetlb flag set can be passed to vmemmmap routines. This
involved separating the 'prep' of hugetlb pages even further. The code
dealing with bootmem allocations was also modified so that batching is
possible. Adding a 'batch' of hugetlb pages to their respective free
lists is now done in one lock cycle.
- patch 7 added description of routine hugetlb_vmemmap_restore_folios
(Muchun).
- patch 8 rename bulk_pages to vmemmap_pages and let caller be responsible
for freeing (Muchun)
- patch 9 use 'walk->remap_pte' to determine if a split only operation
is being performed (Muchun). Removed unused variable and
hugetlb_optimize_vmemmap_key (Muchun).
- Patch 10 pass 'flags variable' instead of bool to indicate behavior and
allow for future expansion (Muchun). Single flag VMEMMAP_NO_TLB_FLUSH.
Provide detailed comment about the need to keep old and new vmemmap pages
in sync (Muchun).
- Patch 11 pass flag variable as in patch 10 (Muchun).

Joao Martins (2):
hugetlb: batch PMD split for bulk vmemmap dedup
hugetlb: batch TLB flushes when freeing vmemmap

Mike Kravetz (6):
hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
hugetlb: restructure pool allocations
hugetlb: perform vmemmap optimization on a list of pages
hugetlb: perform vmemmap restoration on a list of pages
hugetlb: batch freeing of vmemmap pages
hugetlb: batch TLB flushes when restoring vmemmap

mm/hugetlb.c | 266 ++++++++++++++++++++++++++++++++++++-------
mm/hugetlb_vmemmap.c | 255 +++++++++++++++++++++++++++++++++++------
mm/hugetlb_vmemmap.h | 16 +++
3 files changed, 457 insertions(+), 80 deletions(-)

--
2.41.0


2023-09-19 03:20:06

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 5/8] hugetlb: batch freeing of vmemmap pages

Now that batching of hugetlb vmemmap optimization processing is possible,
batch the freeing of vmemmap pages. When freeing vmemmap pages for a
hugetlb page, we add them to a list that is freed after the entire batch
has been processed.

This enhances the ability to return contiguous ranges of memory to the
low level allocators.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb_vmemmap.c | 85 ++++++++++++++++++++++++++++++--------------
1 file changed, 59 insertions(+), 26 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 463a4037ec6e..147ed15bcae4 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -222,6 +222,9 @@ static void free_vmemmap_page_list(struct list_head *list)
{
struct page *page, *next;

+ if (list_empty(list))
+ return;
+
list_for_each_entry_safe(page, next, list, lru)
free_vmemmap_page(page);
}
@@ -251,7 +254,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
}

entry = mk_pte(walk->reuse_page, pgprot);
- list_add_tail(&page->lru, walk->vmemmap_pages);
+ list_add(&page->lru, walk->vmemmap_pages);
set_pte_at(&init_mm, addr, pte, entry);
}

@@ -306,18 +309,20 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
* @end: end address of the vmemmap virtual address range that we want to
* remap.
* @reuse: reuse address.
+ * @vmemmap_pages: list to deposit vmemmap pages to be freed. It is callers
+ * responsibility to free pages.
*
* Return: %0 on success, negative error code otherwise.
*/
static int vmemmap_remap_free(unsigned long start, unsigned long end,
- unsigned long reuse)
+ unsigned long reuse,
+ struct list_head *vmemmap_pages)
{
int ret;
- LIST_HEAD(vmemmap_pages);
struct vmemmap_remap_walk walk = {
.remap_pte = vmemmap_remap_pte,
.reuse_addr = reuse,
- .vmemmap_pages = &vmemmap_pages,
+ .vmemmap_pages = vmemmap_pages,
};
int nid = page_to_nid((struct page *)reuse);
gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
@@ -334,7 +339,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
if (walk.reuse_page) {
copy_page(page_to_virt(walk.reuse_page),
(void *)walk.reuse_addr);
- list_add(&walk.reuse_page->lru, &vmemmap_pages);
+ list_add(&walk.reuse_page->lru, vmemmap_pages);
}

/*
@@ -365,15 +370,13 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
walk = (struct vmemmap_remap_walk) {
.remap_pte = vmemmap_restore_pte,
.reuse_addr = reuse,
- .vmemmap_pages = &vmemmap_pages,
+ .vmemmap_pages = vmemmap_pages,
};

vmemmap_remap_range(reuse, end, &walk);
}
mmap_read_unlock(&init_mm);

- free_vmemmap_page_list(&vmemmap_pages);
-
return ret;
}

@@ -389,7 +392,7 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
page = alloc_pages_node(nid, gfp_mask, 0);
if (!page)
goto out;
- list_add_tail(&page->lru, list);
+ list_add(&page->lru, list);
}

return 0;
@@ -576,24 +579,17 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
return true;
}

-/**
- * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
- * @h: struct hstate.
- * @head: the head page whose vmemmap pages will be optimized.
- *
- * This function only tries to optimize @head's vmemmap pages and does not
- * guarantee that the optimization will succeed after it returns. The caller
- * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
- * have been optimized.
- */
-void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
+static int __hugetlb_vmemmap_optimize(const struct hstate *h,
+ struct page *head,
+ struct list_head *vmemmap_pages)
{
+ int ret = 0;
unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
unsigned long vmemmap_reuse;

VM_WARN_ON_ONCE(!PageHuge(head));
if (!vmemmap_should_optimize(h, head))
- return;
+ return ret;

static_branch_inc(&hugetlb_optimize_vmemmap_key);

@@ -603,21 +599,58 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)

/*
* Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
- * to the page which @vmemmap_reuse is mapped to, then free the pages
- * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
+ * to the page which @vmemmap_reuse is mapped to. Add pages previously
+ * mapping the range to vmemmap_pages list so that they can be freed by
+ * the caller.
*/
- if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
+ ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages);
+ if (ret)
static_branch_dec(&hugetlb_optimize_vmemmap_key);
else
SetHPageVmemmapOptimized(head);
+
+ return ret;
+}
+
+/**
+ * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
+ * @h: struct hstate.
+ * @head: the head page whose vmemmap pages will be optimized.
+ *
+ * This function only tries to optimize @head's vmemmap pages and does not
+ * guarantee that the optimization will succeed after it returns. The caller
+ * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
+ * have been optimized.
+ */
+void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
+{
+ LIST_HEAD(vmemmap_pages);
+
+ __hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
+ free_vmemmap_page_list(&vmemmap_pages);
}

void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
{
struct folio *folio;
+ LIST_HEAD(vmemmap_pages);

- list_for_each_entry(folio, folio_list, lru)
- hugetlb_vmemmap_optimize(h, &folio->page);
+ list_for_each_entry(folio, folio_list, lru) {
+ int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
+ &vmemmap_pages);
+
+ /*
+ * Pages to be freed may have been accumulated. If we
+ * encounter an ENOMEM, free what we have and try again.
+ */
+ if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
+ free_vmemmap_page_list(&vmemmap_pages);
+ INIT_LIST_HEAD(&vmemmap_pages);
+ __hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
+ }
+ }
+
+ free_vmemmap_page_list(&vmemmap_pages);
}

static struct ctl_table hugetlb_vmemmap_sysctls[] = {
--
2.41.0

2023-09-19 04:24:52

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 2/8] hugetlb: restructure pool allocations

Allocation of a hugetlb page for the hugetlb pool is done by the routine
alloc_pool_huge_page. This routine will allocate contiguous pages from
a low level allocator, prep the pages for usage as a hugetlb page and
then add the resulting hugetlb page to the pool.

In the 'prep' stage, optional vmemmap optimization is done. For
performance reasons we want to perform vmemmap optimization on multiple
hugetlb pages at once. To do this, restructure the hugetlb pool
allocation code such that vmemmap optimization can be isolated and later
batched.

The code to allocate hugetlb pages from bootmem was also modified to
allow batching.

No functional changes, only code restructure.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 183 ++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 144 insertions(+), 39 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1ca0377c62ab..8624286be273 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1970,16 +1970,21 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
h->nr_huge_pages_node[nid]++;
}

-static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio)
{
folio_set_hugetlb(folio);
- hugetlb_vmemmap_optimize(h, &folio->page);
INIT_LIST_HEAD(&folio->lru);
hugetlb_set_folio_subpool(folio, NULL);
set_hugetlb_cgroup(folio, NULL);
set_hugetlb_cgroup_rsvd(folio, NULL);
}

+static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+{
+ init_new_hugetlb_folio(h, folio);
+ hugetlb_vmemmap_optimize(h, &folio->page);
+}
+
static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int nid)
{
__prep_new_hugetlb_folio(h, folio);
@@ -2190,16 +2195,9 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
return page_folio(page);
}

-/*
- * Common helper to allocate a fresh hugetlb page. All specific allocators
- * should use this function to get new hugetlb pages
- *
- * Note that returned page is 'frozen': ref count of head page and all tail
- * pages is zero.
- */
-static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
- gfp_t gfp_mask, int nid, nodemask_t *nmask,
- nodemask_t *node_alloc_noretry)
+static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
+ gfp_t gfp_mask, int nid, nodemask_t *nmask,
+ nodemask_t *node_alloc_noretry)
{
struct folio *folio;
bool retry = false;
@@ -2212,6 +2210,7 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
nid, nmask, node_alloc_noretry);
if (!folio)
return NULL;
+
if (hstate_is_gigantic(h)) {
if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
/*
@@ -2226,32 +2225,84 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
return NULL;
}
}
- prep_new_hugetlb_folio(h, folio, folio_nid(folio));

return folio;
}

+static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
+ gfp_t gfp_mask, int nid, nodemask_t *nmask,
+ nodemask_t *node_alloc_noretry)
+{
+ struct folio *folio;
+
+ folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+ node_alloc_noretry);
+ if (folio)
+ init_new_hugetlb_folio(h, folio);
+ return folio;
+}
+
/*
- * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
- * manner.
+ * Common helper to allocate a fresh hugetlb page. All specific allocators
+ * should use this function to get new hugetlb pages
+ *
+ * Note that returned page is 'frozen': ref count of head page and all tail
+ * pages is zero.
*/
-static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
- nodemask_t *node_alloc_noretry)
+static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
+ gfp_t gfp_mask, int nid, nodemask_t *nmask,
+ nodemask_t *node_alloc_noretry)
{
struct folio *folio;
- int nr_nodes, node;
+
+ folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+ node_alloc_noretry);
+ if (!folio)
+ return NULL;
+
+ prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+ return folio;
+}
+
+static void prep_and_add_allocated_folios(struct hstate *h,
+ struct list_head *folio_list)
+{
+ struct folio *folio, *tmp_f;
+
+ /*
+ * Add all new pool pages to free lists in one lock cycle
+ */
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+ __prep_account_new_huge_page(h, folio_nid(folio));
+ enqueue_hugetlb_folio(h, folio);
+ }
+ spin_unlock_irq(&hugetlb_lock);
+
+ INIT_LIST_HEAD(folio_list);
+}
+
+/*
+ * Allocates a fresh hugetlb page in a node interleaved manner. The page
+ * will later be added to the appropriate hugetlb pool.
+ */
+static struct folio *alloc_pool_huge_folio(struct hstate *h,
+ nodemask_t *nodes_allowed,
+ nodemask_t *node_alloc_noretry)
+{
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+ int nr_nodes, node;

for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
- folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node,
+ struct folio *folio;
+
+ folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
nodes_allowed, node_alloc_noretry);
- if (folio) {
- free_huge_folio(folio); /* free it into the hugepage allocator */
- return 1;
- }
+ if (folio)
+ return folio;
}

- return 0;
+ return NULL;
}

/*
@@ -3264,25 +3315,35 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
*/
static void __init gather_bootmem_prealloc(void)
{
+ LIST_HEAD(folio_list);
struct huge_bootmem_page *m;
+ struct hstate *h, *prev_h = NULL;

list_for_each_entry(m, &huge_boot_pages, list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;
- struct hstate *h = m->hstate;
+
+ h = m->hstate;
+ /*
+ * It is possible to have multiple huge page sizes (hstates)
+ * in this list. If so, process each size separately.
+ */
+ if (h != prev_h && prev_h != NULL)
+ prep_and_add_allocated_folios(prev_h, &folio_list);
+ prev_h = h;

VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(folio_ref_count(folio) != 1);

hugetlb_folio_init_vmemmap(folio, h,
HUGETLB_VMEMMAP_RESERVE_PAGES);
- prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+ __prep_new_hugetlb_folio(h, folio);
/* If HVO fails, initialize all tail struct pages */
if (!HPageVmemmapOptimized(&folio->page))
hugetlb_folio_init_tail_vmemmap(folio,
HUGETLB_VMEMMAP_RESERVE_PAGES,
pages_per_huge_page(h));
- free_huge_folio(folio); /* add to the hugepage allocator */
+ list_add(&folio->lru, &folio_list);

/*
* We need to restore the 'stolen' pages to totalram_pages
@@ -3292,6 +3353,8 @@ static void __init gather_bootmem_prealloc(void)
adjust_managed_page_count(page, pages_per_huge_page(h));
cond_resched();
}
+
+ prep_and_add_allocated_folios(h, &folio_list);
}

static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
@@ -3325,9 +3388,22 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
h->max_huge_pages_node[nid] = i;
}

+/*
+ * NOTE: this routine is called in different contexts for gigantic and
+ * non-gigantic pages.
+ * - For gigantic pages, this is called early in the boot process and
+ * pages are allocated from memblock allocated or something similar.
+ * Gigantic pages are actually added to pools later with the routine
+ * gather_bootmem_prealloc.
+ * - For non-gigantic pages, this is called later in the boot process after
+ * all of mm is up and functional. Pages are allocated from buddy and
+ * then added to hugetlb pools.
+ */
static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
{
unsigned long i;
+ struct folio *folio;
+ LIST_HEAD(folio_list);
nodemask_t *node_alloc_noretry;
bool node_specific_alloc = false;

@@ -3369,14 +3445,25 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)

for (i = 0; i < h->max_huge_pages; ++i) {
if (hstate_is_gigantic(h)) {
+ /*
+ * gigantic pages not added to list as they are not
+ * added to pools now.
+ */
if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
break;
- } else if (!alloc_pool_huge_page(h,
- &node_states[N_MEMORY],
- node_alloc_noretry))
- break;
+ } else {
+ folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+ node_alloc_noretry);
+ if (!folio)
+ break;
+ list_add(&folio->lru, &folio_list);
+ }
cond_resched();
}
+
+ /* list will be empty if hstate_is_gigantic */
+ prep_and_add_allocated_folios(h, &folio_list);
+
if (i < h->max_huge_pages) {
char buf[32];

@@ -3510,7 +3597,9 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
nodemask_t *nodes_allowed)
{
- unsigned long min_count, ret;
+ unsigned long min_count;
+ unsigned long allocated;
+ struct folio *folio;
LIST_HEAD(page_list);
NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);

@@ -3587,7 +3676,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
break;
}

- while (count > persistent_huge_pages(h)) {
+ allocated = 0;
+ while (count > (persistent_huge_pages(h) + allocated)) {
/*
* If this allocation races such that we no longer need the
* page, free_huge_folio will handle it by freeing the page
@@ -3598,15 +3688,32 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
/* yield cpu to avoid soft lockup */
cond_resched();

- ret = alloc_pool_huge_page(h, nodes_allowed,
+ folio = alloc_pool_huge_folio(h, nodes_allowed,
node_alloc_noretry);
- spin_lock_irq(&hugetlb_lock);
- if (!ret)
+ if (!folio) {
+ prep_and_add_allocated_folios(h, &page_list);
+ spin_lock_irq(&hugetlb_lock);
goto out;
+ }
+
+ list_add(&folio->lru, &page_list);
+ allocated++;

/* Bail for signals. Probably ctrl-c from user */
- if (signal_pending(current))
+ if (signal_pending(current)) {
+ prep_and_add_allocated_folios(h, &page_list);
+ spin_lock_irq(&hugetlb_lock);
goto out;
+ }
+
+ spin_lock_irq(&hugetlb_lock);
+ }
+
+ /* Add allocated pages to the pool */
+ if (!list_empty(&page_list)) {
+ spin_unlock_irq(&hugetlb_lock);
+ prep_and_add_allocated_folios(h, &page_list);
+ spin_lock_irq(&hugetlb_lock);
}

/*
@@ -3632,8 +3739,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
* Collect pages to be removed on list without dropping lock
*/
while (min_count < persistent_huge_pages(h)) {
- struct folio *folio;
-
folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
if (!folio)
break;
--
2.41.0

2023-09-19 06:30:40

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 8/8] hugetlb: batch TLB flushes when restoring vmemmap

Update the internal hugetlb restore vmemmap code path such that TLB
flushing can be batched. Use the existing mechanism of passing the
VMEMMAP_REMAP_NO_TLB_FLUSH flag to indicate flushing should not be
performed for individual pages. The routine hugetlb_vmemmap_restore_folios
is the only user of this new mechanism, and it will perform a global
flush after all vmemmap is restored.

Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb_vmemmap.c | 39 ++++++++++++++++++++++++---------------
1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index a6c356acb1fc..ae2229f19158 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -460,18 +460,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
* @end: end address of the vmemmap virtual address range that we want to
* remap.
* @reuse: reuse address.
+ * @flags: modify behavior for bulk operations
*
* Return: %0 on success, negative error code otherwise.
*/
static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
- unsigned long reuse)
+ unsigned long reuse, unsigned long flags)
{
LIST_HEAD(vmemmap_pages);
struct vmemmap_remap_walk walk = {
.remap_pte = vmemmap_restore_pte,
.reuse_addr = reuse,
.vmemmap_pages = &vmemmap_pages,
- .flags = 0,
+ .flags = flags,
};

/* See the comment in the vmemmap_remap_free(). */
@@ -493,17 +494,7 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);

-/**
- * hugetlb_vmemmap_restore - restore previously optimized (by
- * hugetlb_vmemmap_optimize()) vmemmap pages which
- * will be reallocated and remapped.
- * @h: struct hstate.
- * @head: the head page whose vmemmap pages will be restored.
- *
- * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
- * negative error code otherwise.
- */
-int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+static int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, unsigned long flags)
{
int ret;
unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
@@ -524,7 +515,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
* When a HugeTLB page is freed to the buddy allocator, previously
* discarded vmemmap pages must be allocated and remapping.
*/
- ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
+ ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
if (!ret) {
ClearHPageVmemmapOptimized(head);
static_branch_dec(&hugetlb_optimize_vmemmap_key);
@@ -533,6 +524,21 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
return ret;
}

+/**
+ * hugetlb_vmemmap_restore - restore previously optimized (by
+ * hugetlb_vmemmap_optimize()) vmemmap pages which
+ * will be reallocated and remapped.
+ * @h: struct hstate.
+ * @head: the head page whose vmemmap pages will be restored.
+ *
+ * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
+ * negative error code otherwise.
+ */
+int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
+{
+ return __hugetlb_vmemmap_restore(h, head, 0);
+}
+
/**
* hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
* @h: struct hstate.
@@ -557,7 +563,8 @@ int hugetlb_vmemmap_restore_folios(const struct hstate *h,
num_restored = 0;
list_for_each_entry(folio, folio_list, lru) {
if (folio_test_hugetlb_vmemmap_optimized(folio)) {
- t_ret = hugetlb_vmemmap_restore(h, &folio->page);
+ t_ret = __hugetlb_vmemmap_restore(h, &folio->page,
+ VMEMMAP_REMAP_NO_TLB_FLUSH);
if (t_ret)
ret = t_ret;
else
@@ -565,6 +572,8 @@ int hugetlb_vmemmap_restore_folios(const struct hstate *h,
}
}

+ flush_tlb_all();
+
if (*restored)
*restored = num_restored;
return ret;
--
2.41.0

2023-09-19 07:40:28

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup

From: Joao Martins <[email protected]>

In an effort to minimize amount of TLB flushes, batch all PMD splits
belonging to a range of pages in order to perform only 1 (global) TLB
flush.

Add a flags field to the walker and pass whether it's a bulk allocation
or just a single page to decide to remap. First value
(VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB
flush when we split the PMD.

Rebased and updated by Mike Kravetz

Signed-off-by: Joao Martins <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb_vmemmap.c | 79 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 75 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 147ed15bcae4..e8bc2f7567db 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -27,6 +27,7 @@
* @reuse_addr: the virtual address of the @reuse_page page.
* @vmemmap_pages: the list head of the vmemmap pages that can be freed
* or is mapped from.
+ * @flags: used to modify behavior in bulk operations
*/
struct vmemmap_remap_walk {
void (*remap_pte)(pte_t *pte, unsigned long addr,
@@ -35,9 +36,11 @@ struct vmemmap_remap_walk {
struct page *reuse_page;
unsigned long reuse_addr;
struct list_head *vmemmap_pages;
+#define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
+ unsigned long flags;
};

-static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
+static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
{
pmd_t __pmd;
int i;
@@ -80,7 +83,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
/* Make pte visible before pmd. See comment in pmd_install(). */
smp_wmb();
pmd_populate_kernel(&init_mm, pmd, pgtable);
- flush_tlb_kernel_range(start, start + PMD_SIZE);
+ if (flush)
+ flush_tlb_kernel_range(start, start + PMD_SIZE);
} else {
pte_free_kernel(&init_mm, pgtable);
}
@@ -127,11 +131,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
do {
int ret;

- ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
+ ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
+ walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH);
if (ret)
return ret;

next = pmd_addr_end(addr, end);
+
+ /*
+ * We are only splitting, not remapping the hugetlb vmemmap
+ * pages.
+ */
+ if (!walk->remap_pte)
+ continue;
+
vmemmap_pte_range(pmd, addr, next, walk);
} while (pmd++, addr = next, addr != end);

@@ -198,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
return ret;
} while (pgd++, addr = next, addr != end);

- flush_tlb_kernel_range(start, end);
+ if (walk->remap_pte)
+ flush_tlb_kernel_range(start, end);

return 0;
}
@@ -300,6 +314,36 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
}

+/**
+ * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
+ * backing PMDs of the directmap into PTEs
+ * @start: start address of the vmemmap virtual address range that we want
+ * to remap.
+ * @end: end address of the vmemmap virtual address range that we want to
+ * remap.
+ * @reuse: reuse address.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+static int vmemmap_remap_split(unsigned long start, unsigned long end,
+ unsigned long reuse)
+{
+ int ret;
+ struct vmemmap_remap_walk walk = {
+ .remap_pte = NULL,
+ .flags = VMEMMAP_SPLIT_NO_TLB_FLUSH,
+ };
+
+ /* See the comment in the vmemmap_remap_free(). */
+ BUG_ON(start - reuse != PAGE_SIZE);
+
+ mmap_read_lock(&init_mm);
+ ret = vmemmap_remap_range(reuse, end, &walk);
+ mmap_read_unlock(&init_mm);
+
+ return ret;
+}
+
/**
* vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
* to the page which @reuse is mapped to, then free vmemmap
@@ -323,6 +367,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
.remap_pte = vmemmap_remap_pte,
.reuse_addr = reuse,
.vmemmap_pages = vmemmap_pages,
+ .flags = 0,
};
int nid = page_to_nid((struct page *)reuse);
gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
@@ -371,6 +416,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
.remap_pte = vmemmap_restore_pte,
.reuse_addr = reuse,
.vmemmap_pages = vmemmap_pages,
+ .flags = 0,
};

vmemmap_remap_range(reuse, end, &walk);
@@ -422,6 +468,7 @@ static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
.remap_pte = vmemmap_restore_pte,
.reuse_addr = reuse,
.vmemmap_pages = &vmemmap_pages,
+ .flags = 0,
};

/* See the comment in the vmemmap_remap_free(). */
@@ -630,11 +677,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
free_vmemmap_page_list(&vmemmap_pages);
}

+static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
+{
+ unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
+ unsigned long vmemmap_reuse;
+
+ if (!vmemmap_should_optimize(h, head))
+ return;
+
+ vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
+ vmemmap_reuse = vmemmap_start;
+ vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
+
+ /*
+ * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
+ * @vmemmap_end]
+ */
+ vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
+}
+
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
{
struct folio *folio;
LIST_HEAD(vmemmap_pages);

+ list_for_each_entry(folio, folio_list, lru)
+ hugetlb_vmemmap_split(h, &folio->page);
+
+ flush_tlb_all();
+
list_for_each_entry(folio, folio_list, lru) {
int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
&vmemmap_pages);
--
2.41.0

2023-09-19 08:01:34

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup



On 2023/9/19 07:01, Mike Kravetz wrote:
> From: Joao Martins <[email protected]>
>
> In an effort to minimize amount of TLB flushes, batch all PMD splits
> belonging to a range of pages in order to perform only 1 (global) TLB
> flush.
>
> Add a flags field to the walker and pass whether it's a bulk allocation
> or just a single page to decide to remap. First value
> (VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB
> flush when we split the PMD.
>
> Rebased and updated by Mike Kravetz
>
> Signed-off-by: Joao Martins <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb_vmemmap.c | 79 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 75 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 147ed15bcae4..e8bc2f7567db 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -27,6 +27,7 @@
> * @reuse_addr: the virtual address of the @reuse_page page.
> * @vmemmap_pages: the list head of the vmemmap pages that can be freed
> * or is mapped from.
> + * @flags: used to modify behavior in bulk operations

Better to describe it as "used to modify behavior in vmemmap page table
walking operations"

> */
> struct vmemmap_remap_walk {
> void (*remap_pte)(pte_t *pte, unsigned long addr,
> @@ -35,9 +36,11 @@ struct vmemmap_remap_walk {
> struct page *reuse_page;
> unsigned long reuse_addr;
> struct list_head *vmemmap_pages;
> +#define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
> + unsigned long flags;
> };
>
> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool flush)
> {
> pmd_t __pmd;
> int i;
> @@ -80,7 +83,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> /* Make pte visible before pmd. See comment in pmd_install(). */
> smp_wmb();
> pmd_populate_kernel(&init_mm, pmd, pgtable);
> - flush_tlb_kernel_range(start, start + PMD_SIZE);
> + if (flush)
> + flush_tlb_kernel_range(start, start + PMD_SIZE);
> } else {
> pte_free_kernel(&init_mm, pgtable);
> }
> @@ -127,11 +131,20 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
> do {
> int ret;
>
> - ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
> + ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
> + walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH);
> if (ret)
> return ret;
>
> next = pmd_addr_end(addr, end);
> +
> + /*
> + * We are only splitting, not remapping the hugetlb vmemmap
> + * pages.
> + */
> + if (!walk->remap_pte)
> + continue;
> +
> vmemmap_pte_range(pmd, addr, next, walk);
> } while (pmd++, addr = next, addr != end);
>
> @@ -198,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
> return ret;
> } while (pgd++, addr = next, addr != end);
>
> - flush_tlb_kernel_range(start, end);
> + if (walk->remap_pte)
> + flush_tlb_kernel_range(start, end);
>
> return 0;
> }
> @@ -300,6 +314,36 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
> }
>
> +/**
> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
> + * backing PMDs of the directmap into PTEs
> + * @start: start address of the vmemmap virtual address range that we want
> + * to remap.
> + * @end: end address of the vmemmap virtual address range that we want to
> + * remap.
> + * @reuse: reuse address.
> + *
> + * Return: %0 on success, negative error code otherwise.
> + */
> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
> + unsigned long reuse)
> +{
> + int ret;
> + struct vmemmap_remap_walk walk = {
> + .remap_pte = NULL,
> + .flags = VMEMMAP_SPLIT_NO_TLB_FLUSH,
> + };
> +
> + /* See the comment in the vmemmap_remap_free(). */
> + BUG_ON(start - reuse != PAGE_SIZE);
> +
> + mmap_read_lock(&init_mm);
> + ret = vmemmap_remap_range(reuse, end, &walk);
> + mmap_read_unlock(&init_mm);
> +
> + return ret;
> +}
> +
> /**
> * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
> * to the page which @reuse is mapped to, then free vmemmap
> @@ -323,6 +367,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> .remap_pte = vmemmap_remap_pte,
> .reuse_addr = reuse,
> .vmemmap_pages = vmemmap_pages,
> + .flags = 0,
> };
> int nid = page_to_nid((struct page *)reuse);
> gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> @@ -371,6 +416,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> .remap_pte = vmemmap_restore_pte,
> .reuse_addr = reuse,
> .vmemmap_pages = vmemmap_pages,
> + .flags = 0,
> };
>
> vmemmap_remap_range(reuse, end, &walk);
> @@ -422,6 +468,7 @@ static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> .remap_pte = vmemmap_restore_pte,
> .reuse_addr = reuse,
> .vmemmap_pages = &vmemmap_pages,
> + .flags = 0,
> };
>
> /* See the comment in the vmemmap_remap_free(). */
> @@ -630,11 +677,35 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> free_vmemmap_page_list(&vmemmap_pages);
> }
>
> +static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
> +{
> + unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> + unsigned long vmemmap_reuse;
> +
> + if (!vmemmap_should_optimize(h, head))
> + return;
> +
> + vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
> + vmemmap_reuse = vmemmap_start;
> + vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
> +
> + /*
> + * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
> + * @vmemmap_end]
> + */
> + vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
> +}
> +
> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
> {
> struct folio *folio;
> LIST_HEAD(vmemmap_pages);
>
> + list_for_each_entry(folio, folio_list, lru)
> + hugetlb_vmemmap_split(h, &folio->page);
> +
> + flush_tlb_all();
> +
> list_for_each_entry(folio, folio_list, lru) {
> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
> &vmemmap_pages);

This is unlikely to be failed since the page table allocation
is moved to the above (Note that the head vmemmap page allocation
is not mandatory). So we should handle the error case in the above
splitting operation.

Thanks.


2023-09-19 08:04:44

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 8/8] hugetlb: batch TLB flushes when restoring vmemmap



On 2023/9/19 07:02, Mike Kravetz wrote:
> Update the internal hugetlb restore vmemmap code path such that TLB
> flushing can be batched. Use the existing mechanism of passing the
> VMEMMAP_REMAP_NO_TLB_FLUSH flag to indicate flushing should not be
> performed for individual pages. The routine hugetlb_vmemmap_restore_folios
> is the only user of this new mechanism, and it will perform a global
> flush after all vmemmap is restored.
>
> Signed-off-by: Joao Martins <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb_vmemmap.c | 39 ++++++++++++++++++++++++---------------
> 1 file changed, 24 insertions(+), 15 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index a6c356acb1fc..ae2229f19158 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -460,18 +460,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> * @end: end address of the vmemmap virtual address range that we want to
> * remap.
> * @reuse: reuse address.
> + * @flags: modify behavior for bulk operations

Please keep the comment consistent with vmemmap_remap_split(), which says:
"@flags:    modifications to vmemmap_remap_walk flags".

Thanks.

> *
> * Return: %0 on success, negative error code otherwise.
> */
> static int vmemmap_remap_alloc(unsigned long start, unsigned long end,
> - unsigned long reuse)
> + unsigned long reuse, unsigned long flags)
> {
> LIST_HEAD(vmemmap_pages);
> struct vmemmap_remap_walk walk = {
> .remap_pte = vmemmap_restore_pte,
> .reuse_addr = reuse,
> .vmemmap_pages = &vmemmap_pages,
> - .flags = 0,
> + .flags = flags,
> };
>
> /* See the comment in the vmemmap_remap_free(). */
> @@ -493,17 +494,7 @@ EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
> static bool vmemmap_optimize_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON);
> core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0);
>
> -/**
> - * hugetlb_vmemmap_restore - restore previously optimized (by
> - * hugetlb_vmemmap_optimize()) vmemmap pages which
> - * will be reallocated and remapped.
> - * @h: struct hstate.
> - * @head: the head page whose vmemmap pages will be restored.
> - *
> - * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
> - * negative error code otherwise.
> - */
> -int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> +static int __hugetlb_vmemmap_restore(const struct hstate *h, struct page *head, unsigned long flags)
> {
> int ret;
> unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> @@ -524,7 +515,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> * When a HugeTLB page is freed to the buddy allocator, previously
> * discarded vmemmap pages must be allocated and remapping.
> */
> - ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse);
> + ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, flags);
> if (!ret) {
> ClearHPageVmemmapOptimized(head);
> static_branch_dec(&hugetlb_optimize_vmemmap_key);
> @@ -533,6 +524,21 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> return ret;
> }
>
> +/**
> + * hugetlb_vmemmap_restore - restore previously optimized (by
> + * hugetlb_vmemmap_optimize()) vmemmap pages which
> + * will be reallocated and remapped.
> + * @h: struct hstate.
> + * @head: the head page whose vmemmap pages will be restored.
> + *
> + * Return: %0 if @head's vmemmap pages have been reallocated and remapped,
> + * negative error code otherwise.
> + */
> +int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> +{
> + return __hugetlb_vmemmap_restore(h, head, 0);
> +}
> +
> /**
> * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
> * @h: struct hstate.
> @@ -557,7 +563,8 @@ int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> num_restored = 0;
> list_for_each_entry(folio, folio_list, lru) {
> if (folio_test_hugetlb_vmemmap_optimized(folio)) {
> - t_ret = hugetlb_vmemmap_restore(h, &folio->page);
> + t_ret = __hugetlb_vmemmap_restore(h, &folio->page,
> + VMEMMAP_REMAP_NO_TLB_FLUSH);
> if (t_ret)
> ret = t_ret;
> else
> @@ -565,6 +572,8 @@ int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> }
> }
>
> + flush_tlb_all();
> +
> if (*restored)
> *restored = num_restored;
> return ret;

2023-09-19 08:11:32

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] hugetlb: batch freeing of vmemmap pages



On 2023/9/19 07:01, Mike Kravetz wrote:
> Now that batching of hugetlb vmemmap optimization processing is possible,
> batch the freeing of vmemmap pages. When freeing vmemmap pages for a
> hugetlb page, we add them to a list that is freed after the entire batch
> has been processed.
>
> This enhances the ability to return contiguous ranges of memory to the
> low level allocators.
>
> Signed-off-by: Mike Kravetz <[email protected]>

Reviewed-by: Muchun Song <[email protected]>

One nit bellow.

> ---
> mm/hugetlb_vmemmap.c | 85 ++++++++++++++++++++++++++++++--------------
> 1 file changed, 59 insertions(+), 26 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 463a4037ec6e..147ed15bcae4 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -222,6 +222,9 @@ static void free_vmemmap_page_list(struct list_head *list)
> {
> struct page *page, *next;
>
> + if (list_empty(list))
> + return;

It seems unnecessary since the following "list_for_each_entry_safe"
could handle empty-list case. Right?

> +
> list_for_each_entry_safe(page, next, list, lru)
> free_vmemmap_page(page);
> }
> @@ -251,7 +254,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> }
>
> entry = mk_pte(walk->reuse_page, pgprot);
> - list_add_tail(&page->lru, walk->vmemmap_pages);
> + list_add(&page->lru, walk->vmemmap_pages);
> set_pte_at(&init_mm, addr, pte, entry);
> }
>
> @@ -306,18 +309,20 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> * @end: end address of the vmemmap virtual address range that we want to
> * remap.
> * @reuse: reuse address.
> + * @vmemmap_pages: list to deposit vmemmap pages to be freed. It is callers
> + * responsibility to free pages.
> *
> * Return: %0 on success, negative error code otherwise.
> */
> static int vmemmap_remap_free(unsigned long start, unsigned long end,
> - unsigned long reuse)
> + unsigned long reuse,
> + struct list_head *vmemmap_pages)
> {
> int ret;
> - LIST_HEAD(vmemmap_pages);
> struct vmemmap_remap_walk walk = {
> .remap_pte = vmemmap_remap_pte,
> .reuse_addr = reuse,
> - .vmemmap_pages = &vmemmap_pages,
> + .vmemmap_pages = vmemmap_pages,
> };
> int nid = page_to_nid((struct page *)reuse);
> gfp_t gfp_mask = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> @@ -334,7 +339,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> if (walk.reuse_page) {
> copy_page(page_to_virt(walk.reuse_page),
> (void *)walk.reuse_addr);
> - list_add(&walk.reuse_page->lru, &vmemmap_pages);
> + list_add(&walk.reuse_page->lru, vmemmap_pages);
> }
>
> /*
> @@ -365,15 +370,13 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> walk = (struct vmemmap_remap_walk) {
> .remap_pte = vmemmap_restore_pte,
> .reuse_addr = reuse,
> - .vmemmap_pages = &vmemmap_pages,
> + .vmemmap_pages = vmemmap_pages,
> };
>
> vmemmap_remap_range(reuse, end, &walk);
> }
> mmap_read_unlock(&init_mm);
>
> - free_vmemmap_page_list(&vmemmap_pages);
> -
> return ret;
> }
>
> @@ -389,7 +392,7 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> page = alloc_pages_node(nid, gfp_mask, 0);
> if (!page)
> goto out;
> - list_add_tail(&page->lru, list);
> + list_add(&page->lru, list);
> }
>
> return 0;
> @@ -576,24 +579,17 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
> return true;
> }
>
> -/**
> - * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> - * @h: struct hstate.
> - * @head: the head page whose vmemmap pages will be optimized.
> - *
> - * This function only tries to optimize @head's vmemmap pages and does not
> - * guarantee that the optimization will succeed after it returns. The caller
> - * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
> - * have been optimized.
> - */
> -void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> +static int __hugetlb_vmemmap_optimize(const struct hstate *h,
> + struct page *head,
> + struct list_head *vmemmap_pages)
> {
> + int ret = 0;
> unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> unsigned long vmemmap_reuse;
>
> VM_WARN_ON_ONCE(!PageHuge(head));
> if (!vmemmap_should_optimize(h, head))
> - return;
> + return ret;
>
> static_branch_inc(&hugetlb_optimize_vmemmap_key);
>
> @@ -603,21 +599,58 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
>
> /*
> * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
> - * to the page which @vmemmap_reuse is mapped to, then free the pages
> - * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
> + * to the page which @vmemmap_reuse is mapped to. Add pages previously
> + * mapping the range to vmemmap_pages list so that they can be freed by
> + * the caller.
> */
> - if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
> + ret = vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse, vmemmap_pages);
> + if (ret)
> static_branch_dec(&hugetlb_optimize_vmemmap_key);
> else
> SetHPageVmemmapOptimized(head);
> +
> + return ret;
> +}
> +
> +/**
> + * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> + * @h: struct hstate.
> + * @head: the head page whose vmemmap pages will be optimized.
> + *
> + * This function only tries to optimize @head's vmemmap pages and does not
> + * guarantee that the optimization will succeed after it returns. The caller
> + * can use HPageVmemmapOptimized(@head) to detect if @head's vmemmap pages
> + * have been optimized.
> + */
> +void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> +{
> + LIST_HEAD(vmemmap_pages);
> +
> + __hugetlb_vmemmap_optimize(h, head, &vmemmap_pages);
> + free_vmemmap_page_list(&vmemmap_pages);
> }
>
> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
> {
> struct folio *folio;
> + LIST_HEAD(vmemmap_pages);
>
> - list_for_each_entry(folio, folio_list, lru)
> - hugetlb_vmemmap_optimize(h, &folio->page);
> + list_for_each_entry(folio, folio_list, lru) {
> + int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
> + &vmemmap_pages);
> +
> + /*
> + * Pages to be freed may have been accumulated. If we
> + * encounter an ENOMEM, free what we have and try again.
> + */
> + if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
> + free_vmemmap_page_list(&vmemmap_pages);
> + INIT_LIST_HEAD(&vmemmap_pages);
> + __hugetlb_vmemmap_optimize(h, &folio->page, &vmemmap_pages);
> + }
> + }
> +
> + free_vmemmap_page_list(&vmemmap_pages);
> }
>
> static struct ctl_table hugetlb_vmemmap_sysctls[] = {

2023-09-19 09:35:53

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages

The routine update_and_free_pages_bulk already performs vmemmap
restoration on the list of hugetlb pages in a separate step. In
preparation for more functionality to be added in this step, create a
new routine hugetlb_vmemmap_restore_folios() that will restore
vmemmap for a list of folios.

This new routine must provide sufficient feedback about errors and
actual restoration performed so that update_and_free_pages_bulk can
perform optimally.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 36 ++++++++++++++++++------------------
mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++++++++++++++++
mm/hugetlb_vmemmap.h | 11 +++++++++++
3 files changed, 66 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d6f3db3c1313..814bb1982274 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1836,36 +1836,36 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,

static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
{
+ int ret;
+ unsigned long restored;
struct folio *folio, *t_folio;
- bool clear_dtor = false;

/*
- * First allocate required vmemmmap (if necessary) for all folios on
- * list. If vmemmap can not be allocated, we can not free folio to
- * lower level allocator, so add back as hugetlb surplus page.
- * add_hugetlb_folio() removes the page from THIS list.
- * Use clear_dtor to note if vmemmap was successfully allocated for
- * ANY page on the list.
+ * First allocate required vmemmmap (if necessary) for all folios.
*/
- list_for_each_entry_safe(folio, t_folio, list, lru) {
- if (folio_test_hugetlb_vmemmap_optimized(folio)) {
- if (hugetlb_vmemmap_restore(h, &folio->page)) {
- spin_lock_irq(&hugetlb_lock);
+ ret = hugetlb_vmemmap_restore_folios(h, list, &restored);
+
+ /*
+ * If there was an error restoring vmemmap for ANY folios on the list,
+ * add them back as surplus hugetlb pages. add_hugetlb_folio() removes
+ * the folio from THIS list.
+ */
+ if (ret < 0) {
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry_safe(folio, t_folio, list, lru)
+ if (folio_test_hugetlb_vmemmap_optimized(folio))
add_hugetlb_folio(h, folio, true);
- spin_unlock_irq(&hugetlb_lock);
- } else
- clear_dtor = true;
- }
+ spin_unlock_irq(&hugetlb_lock);
}

/*
- * If vmemmmap allocation was performed on any folio above, take lock
- * to clear destructor of all folios on list. This avoids the need to
+ * If vmemmmap allocation was performed on ANY folio , take lock to
+ * clear destructor of all folios on list. This avoids the need to
* lock/unlock for each individual folio.
* The assumption is vmemmap allocation was performed on all or none
* of the folios on the list. This is true expect in VERY rare cases.
*/
- if (clear_dtor) {
+ if (restored) {
spin_lock_irq(&hugetlb_lock);
list_for_each_entry(folio, list, lru)
__clear_hugetlb_destructor(h, folio);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4558b814ffab..463a4037ec6e 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -480,6 +480,43 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
return ret;
}

+/**
+ * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
+ * @h: struct hstate.
+ * @folio_list: list of folios.
+ * @restored: Set to number of folios for which vmemmap was restored
+ * successfully if caller passes a non-NULL pointer.
+ *
+ * Return: %0 if vmemmap exists for all folios on the list. If an error is
+ * encountered restoring vmemmap for ANY folio, an error code
+ * will be returned to the caller. It is then the responsibility
+ * of the caller to check the hugetlb vmemmap optimized flag of
+ * each folio to determine if vmemmap was actually restored.
+ */
+int hugetlb_vmemmap_restore_folios(const struct hstate *h,
+ struct list_head *folio_list,
+ unsigned long *restored)
+{
+ unsigned long num_restored;
+ struct folio *folio;
+ int ret = 0, t_ret;
+
+ num_restored = 0;
+ list_for_each_entry(folio, folio_list, lru) {
+ if (folio_test_hugetlb_vmemmap_optimized(folio)) {
+ t_ret = hugetlb_vmemmap_restore(h, &folio->page);
+ if (t_ret)
+ ret = t_ret;
+ else
+ num_restored++;
+ }
+ }
+
+ if (*restored)
+ *restored = num_restored;
+ return ret;
+}
+
/* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
{
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index c512e388dbb4..bb58453c3cc0 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -19,6 +19,8 @@

#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
+int hugetlb_vmemmap_restore_folios(const struct hstate *h,
+ struct list_head *folio_list, unsigned long *restored);
void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);

@@ -45,6 +47,15 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
return 0;
}

+static inline int hugetlb_vmemmap_restore_folios(const struct hstate *h,
+ struct list_head *folio_list,
+ unsigned long *restored)
+{
+ if (restored)
+ *restored = 0;
+ return 0;
+}
+
static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
{
}
--
2.41.0

2023-09-19 09:59:45

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup

On 19/09/2023 09:41, Muchun Song wrote:
>> On Sep 19, 2023, at 16:26, Joao Martins <[email protected]> wrote:
>> On 19/09/2023 07:42, Muchun Song wrote:
>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>>> From: Joao Martins <[email protected]>
>>>>
>>>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>>>> belonging to a range of pages in order to perform only 1 (global) TLB
>>>> flush.
>>>>
>>>> Add a flags field to the walker and pass whether it's a bulk allocation
>>>> or just a single page to decide to remap. First value
>>>> (VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB
>>>> flush when we split the PMD.
>>>>
>>>> Rebased and updated by Mike Kravetz
>>>>
>>>> Signed-off-by: Joao Martins <[email protected]>
>>>> Signed-off-by: Mike Kravetz <[email protected]>
>>>> ---
>>>> mm/hugetlb_vmemmap.c | 79 +++++++++++++++++++++++++++++++++++++++++---
>>>> 1 file changed, 75 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>>> index 147ed15bcae4..e8bc2f7567db 100644
>>>> --- a/mm/hugetlb_vmemmap.c
>>>> +++ b/mm/hugetlb_vmemmap.c
>>>> @@ -27,6 +27,7 @@
>>>> * @reuse_addr: the virtual address of the @reuse_page page.
>>>> * @vmemmap_pages: the list head of the vmemmap pages that can be freed
>>>> * or is mapped from.
>>>> + * @flags: used to modify behavior in bulk operations
>>>
>>> Better to describe it as "used to modify behavior in vmemmap page table walking
>>> operations"
>>>
>> OK
>>
>>>> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
>>>> *folio_list)
>>>> {
>>>> struct folio *folio;
>>>> LIST_HEAD(vmemmap_pages);
>>>> + list_for_each_entry(folio, folio_list, lru)
>>>> + hugetlb_vmemmap_split(h, &folio->page);
>>>> +
>>>> + flush_tlb_all();
>>>> +
>>>> list_for_each_entry(folio, folio_list, lru) {
>>>> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>>>> &vmemmap_pages);
>>>
>>> This is unlikely to be failed since the page table allocation
>>> is moved to the above
>>
>>> (Note that the head vmemmap page allocation
>>> is not mandatory).
>>
>> Good point that I almost forgot
>>
>>> So we should handle the error case in the above
>>> splitting operation.
>>
>> But back to the previous discussion in v2... the thinking was that /some/ PMDs
>> got split, and say could allow some PTE remapping to occur and free some pages
>> back (each page allows 6 more splits worst case). Then the next
>> __hugetlb_vmemmap_optimize() will have to split PMD pages again for those
>> hugepages that failed the batch PMD split (as we only defer the PTE remap tlb
>> flush in this stage).
>
> Oh, yes. Maybe we could break the above traversal as early as possible
> once we enter an ENOMEM?
>

Sounds good -- no point in keep trying to split if we are failing with OOM.

Perhaps a comment in both of these clauses (the early break on split and the OOM
handling in batch optimize) could help make this clear.

>>
>> Unless this isn't something worth handling
>>
>> Joao
>
>
>

2023-09-19 10:53:00

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup



> On Sep 19, 2023, at 16:26, Joao Martins <[email protected]> wrote:
>
> On 19/09/2023 07:42, Muchun Song wrote:
>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>> From: Joao Martins <[email protected]>
>>>
>>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>>> belonging to a range of pages in order to perform only 1 (global) TLB
>>> flush.
>>>
>>> Add a flags field to the walker and pass whether it's a bulk allocation
>>> or just a single page to decide to remap. First value
>>> (VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB
>>> flush when we split the PMD.
>>>
>>> Rebased and updated by Mike Kravetz
>>>
>>> Signed-off-by: Joao Martins <[email protected]>
>>> Signed-off-by: Mike Kravetz <[email protected]>
>>> ---
>>> mm/hugetlb_vmemmap.c | 79 +++++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 75 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index 147ed15bcae4..e8bc2f7567db 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -27,6 +27,7 @@
>>> * @reuse_addr: the virtual address of the @reuse_page page.
>>> * @vmemmap_pages: the list head of the vmemmap pages that can be freed
>>> * or is mapped from.
>>> + * @flags: used to modify behavior in bulk operations
>>
>> Better to describe it as "used to modify behavior in vmemmap page table walking
>> operations"
>>
> OK
>
>>> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
>>> *folio_list)
>>> {
>>> struct folio *folio;
>>> LIST_HEAD(vmemmap_pages);
>>> + list_for_each_entry(folio, folio_list, lru)
>>> + hugetlb_vmemmap_split(h, &folio->page);
>>> +
>>> + flush_tlb_all();
>>> +
>>> list_for_each_entry(folio, folio_list, lru) {
>>> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>>> &vmemmap_pages);
>>
>> This is unlikely to be failed since the page table allocation
>> is moved to the above
>
>> (Note that the head vmemmap page allocation
>> is not mandatory).
>
> Good point that I almost forgot
>
>> So we should handle the error case in the above
>> splitting operation.
>
> But back to the previous discussion in v2... the thinking was that /some/ PMDs
> got split, and say could allow some PTE remapping to occur and free some pages
> back (each page allows 6 more splits worst case). Then the next
> __hugetlb_vmemmap_optimize() will have to split PMD pages again for those
> hugepages that failed the batch PMD split (as we only defer the PTE remap tlb
> flush in this stage).

Oh, yes. Maybe we could break the above traversal as early as possible
once we enter an ENOMEM?

>
> Unless this isn't something worth handling
>
> Joao


2023-09-19 12:27:39

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages



On 2023/9/19 07:01, Mike Kravetz wrote:
> The routine update_and_free_pages_bulk already performs vmemmap
> restoration on the list of hugetlb pages in a separate step. In
> preparation for more functionality to be added in this step, create a
> new routine hugetlb_vmemmap_restore_folios() that will restore
> vmemmap for a list of folios.
>
> This new routine must provide sufficient feedback about errors and
> actual restoration performed so that update_and_free_pages_bulk can
> perform optimally.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 36 ++++++++++++++++++------------------
> mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++++++++++++++++
> mm/hugetlb_vmemmap.h | 11 +++++++++++
> 3 files changed, 66 insertions(+), 18 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d6f3db3c1313..814bb1982274 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1836,36 +1836,36 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
>
> static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> {
> + int ret;
> + unsigned long restored;
> struct folio *folio, *t_folio;
> - bool clear_dtor = false;
>
> /*
> - * First allocate required vmemmmap (if necessary) for all folios on
> - * list. If vmemmap can not be allocated, we can not free folio to
> - * lower level allocator, so add back as hugetlb surplus page.
> - * add_hugetlb_folio() removes the page from THIS list.
> - * Use clear_dtor to note if vmemmap was successfully allocated for
> - * ANY page on the list.
> + * First allocate required vmemmmap (if necessary) for all folios.
> */
> - list_for_each_entry_safe(folio, t_folio, list, lru) {
> - if (folio_test_hugetlb_vmemmap_optimized(folio)) {
> - if (hugetlb_vmemmap_restore(h, &folio->page)) {
> - spin_lock_irq(&hugetlb_lock);
> + ret = hugetlb_vmemmap_restore_folios(h, list, &restored);
> +
> + /*
> + * If there was an error restoring vmemmap for ANY folios on the list,
> + * add them back as surplus hugetlb pages. add_hugetlb_folio() removes
> + * the folio from THIS list.
> + */
> + if (ret < 0) {
> + spin_lock_irq(&hugetlb_lock);
> + list_for_each_entry_safe(folio, t_folio, list, lru)
> + if (folio_test_hugetlb_vmemmap_optimized(folio))
> add_hugetlb_folio(h, folio, true);
> - spin_unlock_irq(&hugetlb_lock);
> - } else
> - clear_dtor = true;
> - }
> + spin_unlock_irq(&hugetlb_lock);
> }
>
> /*
> - * If vmemmmap allocation was performed on any folio above, take lock
> - * to clear destructor of all folios on list. This avoids the need to
> + * If vmemmmap allocation was performed on ANY folio , take lock to
> + * clear destructor of all folios on list. This avoids the need to
> * lock/unlock for each individual folio.
> * The assumption is vmemmap allocation was performed on all or none
> * of the folios on the list. This is true expect in VERY rare cases.
> */
> - if (clear_dtor) {
> + if (restored) {
> spin_lock_irq(&hugetlb_lock);
> list_for_each_entry(folio, list, lru)
> __clear_hugetlb_destructor(h, folio);
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 4558b814ffab..463a4037ec6e 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -480,6 +480,43 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> return ret;
> }
>
> +/**
> + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
> + * @h: struct hstate.
> + * @folio_list: list of folios.
> + * @restored: Set to number of folios for which vmemmap was restored
> + * successfully if caller passes a non-NULL pointer.
> + *
> + * Return: %0 if vmemmap exists for all folios on the list. If an error is
> + * encountered restoring vmemmap for ANY folio, an error code
> + * will be returned to the caller. It is then the responsibility
> + * of the caller to check the hugetlb vmemmap optimized flag of
> + * each folio to determine if vmemmap was actually restored.
> + */
> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> + struct list_head *folio_list,
> + unsigned long *restored)
> +{
> + unsigned long num_restored;
> + struct folio *folio;
> + int ret = 0, t_ret;
> +
> + num_restored = 0;
> + list_for_each_entry(folio, folio_list, lru) {
> + if (folio_test_hugetlb_vmemmap_optimized(folio)) {
> + t_ret = hugetlb_vmemmap_restore(h, &folio->page);

I still think we should free a non-optimized HugeTLB page if we
encounter an OOM situation instead of continue to restore
vemmmap pages. Restoring vmemmmap pages will only aggravate
the OOM situation. The suitable appraoch is to free a non-optimized
HugeTLB page to satisfy our allocation of vmemmap pages, what's
your opinion, Mike?

Thanks.

> + if (t_ret)
> + ret = t_ret;
> + else
> + num_restored++;
> + }
> + }
> +
> + if (*restored)
> + *restored = num_restored;
> + return ret;
> +}
> +
> /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
> static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
> {
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index c512e388dbb4..bb58453c3cc0 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -19,6 +19,8 @@
>
> #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> + struct list_head *folio_list, unsigned long *restored);
> void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
>
> @@ -45,6 +47,15 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
> return 0;
> }
>
> +static inline int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> + struct list_head *folio_list,
> + unsigned long *restored)
> +{
> + if (restored)
> + *restored = 0;
> + return 0;
> +}
> +
> static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> {
> }

2023-09-19 12:39:37

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup

On 19/09/2023 07:42, Muchun Song wrote:
> On 2023/9/19 07:01, Mike Kravetz wrote:
>> From: Joao Martins <[email protected]>
>>
>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>> belonging to a range of pages in order to perform only 1 (global) TLB
>> flush.
>>
>> Add a flags field to the walker and pass whether it's a bulk allocation
>> or just a single page to decide to remap. First value
>> (VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB
>> flush when we split the PMD.
>>
>> Rebased and updated by Mike Kravetz
>>
>> Signed-off-by: Joao Martins <[email protected]>
>> Signed-off-by: Mike Kravetz <[email protected]>
>> ---
>>   mm/hugetlb_vmemmap.c | 79 +++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 75 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index 147ed15bcae4..e8bc2f7567db 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -27,6 +27,7 @@
>>    * @reuse_addr:        the virtual address of the @reuse_page page.
>>    * @vmemmap_pages:    the list head of the vmemmap pages that can be freed
>>    *            or is mapped from.
>> + * @flags:        used to modify behavior in bulk operations
>
> Better to describe it as "used to modify behavior in vmemmap page table walking
> operations"
>
OK

>>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
>> *folio_list)
>>   {
>>       struct folio *folio;
>>       LIST_HEAD(vmemmap_pages);
>>   +    list_for_each_entry(folio, folio_list, lru)
>> +        hugetlb_vmemmap_split(h, &folio->page);
>> +
>> +    flush_tlb_all();
>> +
>>       list_for_each_entry(folio, folio_list, lru) {
>>           int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>>                                   &vmemmap_pages);
>
> This is unlikely to be failed since the page table allocation
> is moved to the above

> (Note that the head vmemmap page allocation
> is not mandatory).

Good point that I almost forgot

> So we should handle the error case in the above
> splitting operation.

But back to the previous discussion in v2... the thinking was that /some/ PMDs
got split, and say could allow some PTE remapping to occur and free some pages
back (each page allows 6 more splits worst case). Then the next
__hugetlb_vmemmap_optimize() will have to split PMD pages again for those
hugepages that failed the batch PMD split (as we only defer the PTE remap tlb
flush in this stage).

Unless this isn't something worth handling

Joao

2023-09-19 19:02:39

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup

On 19/09/2023 09:57, Muchun Song wrote:
>> On Sep 19, 2023, at 16:55, Joao Martins <[email protected]> wrote:
>> On 19/09/2023 09:41, Muchun Song wrote:
>>>> On Sep 19, 2023, at 16:26, Joao Martins <[email protected]> wrote:
>>>> On 19/09/2023 07:42, Muchun Song wrote:
>>>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>>>>> list_for_each_entry(folio, folio_list, lru) {
>>>>>> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>>>>>> &vmemmap_pages);
>>>>>
>>>>> This is unlikely to be failed since the page table allocation
>>>>> is moved to the above
>>>>
>>>>> (Note that the head vmemmap page allocation
>>>>> is not mandatory).
>>>>
>>>> Good point that I almost forgot
>>>>
>>>>> So we should handle the error case in the above
>>>>> splitting operation.
>>>>
>>>> But back to the previous discussion in v2... the thinking was that /some/ PMDs
>>>> got split, and say could allow some PTE remapping to occur and free some pages
>>>> back (each page allows 6 more splits worst case). Then the next
>>>> __hugetlb_vmemmap_optimize() will have to split PMD pages again for those
>>>> hugepages that failed the batch PMD split (as we only defer the PTE remap tlb
>>>> flush in this stage).
>>>
>>> Oh, yes. Maybe we could break the above traversal as early as possible
>>> once we enter an ENOMEM?
>>>
>>
>> Sounds good -- no point in keep trying to split if we are failing with OOM.
>>
>> Perhaps a comment in both of these clauses (the early break on split and the OOM
>> handling in batch optimize) could help make this clear.
>
> Make sense.

These are the changes I have so far for this patch based on the discussion so
far. For next one it's at the end:

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index e8bc2f7567db..d9c6f2cf698c 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -27,7 +27,8 @@
* @reuse_addr: the virtual address of the @reuse_page page.
* @vmemmap_pages: the list head of the vmemmap pages that can be freed
* or is mapped from.
- * @flags: used to modify behavior in bulk operations
+ * @flags: used to modify behavior in vmemmap page table walking
+ * operations.
*/
struct vmemmap_remap_walk {
void (*remap_pte)(pte_t *pte, unsigned long addr,
@@ -36,6 +37,8 @@ struct vmemmap_remap_walk {
struct page *reuse_page;
unsigned long reuse_addr;
struct list_head *vmemmap_pages;
+
+/* Skip the TLB flush when we split the PMD */
#define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
unsigned long flags;
};
@@ -132,7 +135,7 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
int ret;

ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
- walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH);
+ !(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH));
if (ret)
return ret;

@@ -677,13 +680,13 @@ void hugetlb_vmemmap_optimize(const struct hstate *h,
struct page *head)
free_vmemmap_page_list(&vmemmap_pages);
}

-static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
+static int hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
{
unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
unsigned long vmemmap_reuse;

if (!vmemmap_should_optimize(h, head))
- return;
+ return 0;

vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
vmemmap_reuse = vmemmap_start;
@@ -693,7 +696,7 @@ static void hugetlb_vmemmap_split(const struct hstate *h,
struct page *head)
* Split PMDs on the vmemmap virtual address range [@vmemmap_start,
* @vmemmap_end]
*/
- vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
+ return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
}

void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
*folio_list)
@@ -701,8 +704,18 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h,
struct list_head *folio_l
struct folio *folio;
LIST_HEAD(vmemmap_pages);

- list_for_each_entry(folio, folio_list, lru)
- hugetlb_vmemmap_split(h, &folio->page);
+ list_for_each_entry(folio, folio_list, lru) {
+ int ret = hugetlb_vmemmap_split(h, &folio->page);
+
+ /*
+ * Spliting the PMD requires allocating a page, thus lets fail
+ * early once we encounter the first OOM. No point in retrying
+ * as it can be dynamically done on remap with the memory
+ * we get back from the vmemmap deduplication.
+ */
+ if (ret == -ENOMEM)
+ break;
+ }

flush_tlb_all();

For patch 7, I only have commentary added derived from this earlier discussion
above:

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index d9c6f2cf698c..f6a1020a4b6a 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -40,6 +40,8 @@ struct vmemmap_remap_walk {

/* Skip the TLB flush when we split the PMD */
#define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
+/* Skip the TLB flush when we remap the PTE */
#define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1)
unsigned long flags;
};

@@ -721,19 +739,28 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h,
struct list_head *folio_l

list_for_each_entry(folio, folio_list, lru) {
int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
&vmemmap_pages,
VMEMMAP_REMAP_NO_TLB_FLUSH);

/*
* Pages to be freed may have been accumulated. If we
* encounter an ENOMEM, free what we have and try again.
+ * This can occur in the case that both spliting fails
+ * halfway and head page allocation also failed. In this
+ * case __hugetlb_vmemmap_optimize() would free memory
+ * allowing more vmemmap remaps to occur.
*/
if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {

2023-09-19 21:49:59

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup



> On Sep 19, 2023, at 16:55, Joao Martins <[email protected]> wrote:
>
> On 19/09/2023 09:41, Muchun Song wrote:
>>> On Sep 19, 2023, at 16:26, Joao Martins <[email protected]> wrote:
>>> On 19/09/2023 07:42, Muchun Song wrote:
>>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>>>> From: Joao Martins <[email protected]>
>>>>>
>>>>> In an effort to minimize amount of TLB flushes, batch all PMD splits
>>>>> belonging to a range of pages in order to perform only 1 (global) TLB
>>>>> flush.
>>>>>
>>>>> Add a flags field to the walker and pass whether it's a bulk allocation
>>>>> or just a single page to decide to remap. First value
>>>>> (VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB
>>>>> flush when we split the PMD.
>>>>>
>>>>> Rebased and updated by Mike Kravetz
>>>>>
>>>>> Signed-off-by: Joao Martins <[email protected]>
>>>>> Signed-off-by: Mike Kravetz <[email protected]>
>>>>> ---
>>>>> mm/hugetlb_vmemmap.c | 79 +++++++++++++++++++++++++++++++++++++++++---
>>>>> 1 file changed, 75 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>>>> index 147ed15bcae4..e8bc2f7567db 100644
>>>>> --- a/mm/hugetlb_vmemmap.c
>>>>> +++ b/mm/hugetlb_vmemmap.c
>>>>> @@ -27,6 +27,7 @@
>>>>> * @reuse_addr: the virtual address of the @reuse_page page.
>>>>> * @vmemmap_pages: the list head of the vmemmap pages that can be freed
>>>>> * or is mapped from.
>>>>> + * @flags: used to modify behavior in bulk operations
>>>>
>>>> Better to describe it as "used to modify behavior in vmemmap page table walking
>>>> operations"
>>>>
>>> OK
>>>
>>>>> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
>>>>> *folio_list)
>>>>> {
>>>>> struct folio *folio;
>>>>> LIST_HEAD(vmemmap_pages);
>>>>> + list_for_each_entry(folio, folio_list, lru)
>>>>> + hugetlb_vmemmap_split(h, &folio->page);
>>>>> +
>>>>> + flush_tlb_all();
>>>>> +
>>>>> list_for_each_entry(folio, folio_list, lru) {
>>>>> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>>>>> &vmemmap_pages);
>>>>
>>>> This is unlikely to be failed since the page table allocation
>>>> is moved to the above
>>>
>>>> (Note that the head vmemmap page allocation
>>>> is not mandatory).
>>>
>>> Good point that I almost forgot
>>>
>>>> So we should handle the error case in the above
>>>> splitting operation.
>>>
>>> But back to the previous discussion in v2... the thinking was that /some/ PMDs
>>> got split, and say could allow some PTE remapping to occur and free some pages
>>> back (each page allows 6 more splits worst case). Then the next
>>> __hugetlb_vmemmap_optimize() will have to split PMD pages again for those
>>> hugepages that failed the batch PMD split (as we only defer the PTE remap tlb
>>> flush in this stage).
>>
>> Oh, yes. Maybe we could break the above traversal as early as possible
>> once we enter an ENOMEM?
>>
>
> Sounds good -- no point in keep trying to split if we are failing with OOM.
>
> Perhaps a comment in both of these clauses (the early break on split and the OOM
> handling in batch optimize) could help make this clear.

Make sense.

Thanks.

>
>>>
>>> Unless this isn't something worth handling
>>>
>>> Joao


2023-09-19 23:08:46

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 8/8] hugetlb: batch TLB flushes when restoring vmemmap

On 09/19/23 14:48, Muchun Song wrote:
>
>
> On 2023/9/19 07:02, Mike Kravetz wrote:
> > Update the internal hugetlb restore vmemmap code path such that TLB
> > flushing can be batched. Use the existing mechanism of passing the
> > VMEMMAP_REMAP_NO_TLB_FLUSH flag to indicate flushing should not be
> > performed for individual pages. The routine hugetlb_vmemmap_restore_folios
> > is the only user of this new mechanism, and it will perform a global
> > flush after all vmemmap is restored.
> >
> > Signed-off-by: Joao Martins <[email protected]>
> > Signed-off-by: Mike Kravetz <[email protected]>
> > ---
> > mm/hugetlb_vmemmap.c | 39 ++++++++++++++++++++++++---------------
> > 1 file changed, 24 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index a6c356acb1fc..ae2229f19158 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -460,18 +460,19 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
> > * @end: end address of the vmemmap virtual address range that we want to
> > * remap.
> > * @reuse: reuse address.
> > + * @flags: modify behavior for bulk operations
>
> Please keep the comment consistent with vmemmap_remap_split(), which says:
> "@flags:??? modifications to vmemmap_remap_walk flags".

Thanks, will change in next version.
--
Mike Kravetz

2023-09-19 23:41:41

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages

On 09/19/23 17:52, Muchun Song wrote:
>
>
> On 2023/9/19 07:01, Mike Kravetz wrote:
> > The routine update_and_free_pages_bulk already performs vmemmap
> > restoration on the list of hugetlb pages in a separate step. In
> > preparation for more functionality to be added in this step, create a
> > new routine hugetlb_vmemmap_restore_folios() that will restore
> > vmemmap for a list of folios.
> >
> > This new routine must provide sufficient feedback about errors and
> > actual restoration performed so that update_and_free_pages_bulk can
> > perform optimally.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
> > ---
> > mm/hugetlb.c | 36 ++++++++++++++++++------------------
> > mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++++++++++++++++
> > mm/hugetlb_vmemmap.h | 11 +++++++++++
> > 3 files changed, 66 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index d6f3db3c1313..814bb1982274 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1836,36 +1836,36 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
> > static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> > {
> > + int ret;
> > + unsigned long restored;
> > struct folio *folio, *t_folio;
> > - bool clear_dtor = false;
> > /*
> > - * First allocate required vmemmmap (if necessary) for all folios on
> > - * list. If vmemmap can not be allocated, we can not free folio to
> > - * lower level allocator, so add back as hugetlb surplus page.
> > - * add_hugetlb_folio() removes the page from THIS list.
> > - * Use clear_dtor to note if vmemmap was successfully allocated for
> > - * ANY page on the list.
> > + * First allocate required vmemmmap (if necessary) for all folios.
> > */
> > - list_for_each_entry_safe(folio, t_folio, list, lru) {
> > - if (folio_test_hugetlb_vmemmap_optimized(folio)) {
> > - if (hugetlb_vmemmap_restore(h, &folio->page)) {
> > - spin_lock_irq(&hugetlb_lock);
> > + ret = hugetlb_vmemmap_restore_folios(h, list, &restored);
> > +
> > + /*
> > + * If there was an error restoring vmemmap for ANY folios on the list,
> > + * add them back as surplus hugetlb pages. add_hugetlb_folio() removes
> > + * the folio from THIS list.
> > + */
> > + if (ret < 0) {
> > + spin_lock_irq(&hugetlb_lock);
> > + list_for_each_entry_safe(folio, t_folio, list, lru)
> > + if (folio_test_hugetlb_vmemmap_optimized(folio))
> > add_hugetlb_folio(h, folio, true);
> > - spin_unlock_irq(&hugetlb_lock);
> > - } else
> > - clear_dtor = true;
> > - }
> > + spin_unlock_irq(&hugetlb_lock);
> > }
> > /*
> > - * If vmemmmap allocation was performed on any folio above, take lock
> > - * to clear destructor of all folios on list. This avoids the need to
> > + * If vmemmmap allocation was performed on ANY folio , take lock to
> > + * clear destructor of all folios on list. This avoids the need to
> > * lock/unlock for each individual folio.
> > * The assumption is vmemmap allocation was performed on all or none
> > * of the folios on the list. This is true expect in VERY rare cases.
> > */
> > - if (clear_dtor) {
> > + if (restored) {
> > spin_lock_irq(&hugetlb_lock);
> > list_for_each_entry(folio, list, lru)
> > __clear_hugetlb_destructor(h, folio);
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 4558b814ffab..463a4037ec6e 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -480,6 +480,43 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> > return ret;
> > }
> > +/**
> > + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
> > + * @h: struct hstate.
> > + * @folio_list: list of folios.
> > + * @restored: Set to number of folios for which vmemmap was restored
> > + * successfully if caller passes a non-NULL pointer.
> > + *
> > + * Return: %0 if vmemmap exists for all folios on the list. If an error is
> > + * encountered restoring vmemmap for ANY folio, an error code
> > + * will be returned to the caller. It is then the responsibility
> > + * of the caller to check the hugetlb vmemmap optimized flag of
> > + * each folio to determine if vmemmap was actually restored.
> > + */
> > +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> > + struct list_head *folio_list,
> > + unsigned long *restored)
> > +{
> > + unsigned long num_restored;
> > + struct folio *folio;
> > + int ret = 0, t_ret;
> > +
> > + num_restored = 0;
> > + list_for_each_entry(folio, folio_list, lru) {
> > + if (folio_test_hugetlb_vmemmap_optimized(folio)) {
> > + t_ret = hugetlb_vmemmap_restore(h, &folio->page);
>
> I still think we should free a non-optimized HugeTLB page if we
> encounter an OOM situation instead of continue to restore
> vemmmap pages. Restoring vmemmmap pages will only aggravate
> the OOM situation. The suitable appraoch is to free a non-optimized
> HugeTLB page to satisfy our allocation of vmemmap pages, what's
> your opinion, Mike?

I agree.

As you mentioned previously, this may complicate this code path a bit.
I will rewrite to make this happen.
--
Mike Kravetz

2023-09-19 23:43:27

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] hugetlb: batch freeing of vmemmap pages

On 09/19/23 14:09, Muchun Song wrote:
>
>
> On 2023/9/19 07:01, Mike Kravetz wrote:
> > Now that batching of hugetlb vmemmap optimization processing is possible,
> > batch the freeing of vmemmap pages. When freeing vmemmap pages for a
> > hugetlb page, we add them to a list that is freed after the entire batch
> > has been processed.
> >
> > This enhances the ability to return contiguous ranges of memory to the
> > low level allocators.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
>
> Reviewed-by: Muchun Song <[email protected]>
>
> One nit bellow.
>
> > ---
> > mm/hugetlb_vmemmap.c | 85 ++++++++++++++++++++++++++++++--------------
> > 1 file changed, 59 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 463a4037ec6e..147ed15bcae4 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -222,6 +222,9 @@ static void free_vmemmap_page_list(struct list_head *list)
> > {
> > struct page *page, *next;
> > + if (list_empty(list))
> > + return;
>
> It seems unnecessary since the following "list_for_each_entry_safe"
> could handle empty-list case. Right?
>

Yes, it is an over-optimization that is not really necessary.
I will remove it.

--
Mike Kravetz

2023-09-20 02:57:46

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages



> On Sep 20, 2023, at 04:57, Mike Kravetz <[email protected]> wrote:
>
> On 09/19/23 17:52, Muchun Song wrote:
>>
>>
>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>> The routine update_and_free_pages_bulk already performs vmemmap
>>> restoration on the list of hugetlb pages in a separate step. In
>>> preparation for more functionality to be added in this step, create a
>>> new routine hugetlb_vmemmap_restore_folios() that will restore
>>> vmemmap for a list of folios.
>>>
>>> This new routine must provide sufficient feedback about errors and
>>> actual restoration performed so that update_and_free_pages_bulk can
>>> perform optimally.
>>>
>>> Signed-off-by: Mike Kravetz <[email protected]>
>>> ---
>>> mm/hugetlb.c | 36 ++++++++++++++++++------------------
>>> mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++++++++++++++++
>>> mm/hugetlb_vmemmap.h | 11 +++++++++++
>>> 3 files changed, 66 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index d6f3db3c1313..814bb1982274 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -1836,36 +1836,36 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
>>> static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
>>> {
>>> + int ret;
>>> + unsigned long restored;
>>> struct folio *folio, *t_folio;
>>> - bool clear_dtor = false;
>>> /*
>>> - * First allocate required vmemmmap (if necessary) for all folios on
>>> - * list. If vmemmap can not be allocated, we can not free folio to
>>> - * lower level allocator, so add back as hugetlb surplus page.
>>> - * add_hugetlb_folio() removes the page from THIS list.
>>> - * Use clear_dtor to note if vmemmap was successfully allocated for
>>> - * ANY page on the list.
>>> + * First allocate required vmemmmap (if necessary) for all folios.
>>> */
>>> - list_for_each_entry_safe(folio, t_folio, list, lru) {
>>> - if (folio_test_hugetlb_vmemmap_optimized(folio)) {
>>> - if (hugetlb_vmemmap_restore(h, &folio->page)) {
>>> - spin_lock_irq(&hugetlb_lock);
>>> + ret = hugetlb_vmemmap_restore_folios(h, list, &restored);
>>> +
>>> + /*
>>> + * If there was an error restoring vmemmap for ANY folios on the list,
>>> + * add them back as surplus hugetlb pages. add_hugetlb_folio() removes
>>> + * the folio from THIS list.
>>> + */
>>> + if (ret < 0) {
>>> + spin_lock_irq(&hugetlb_lock);
>>> + list_for_each_entry_safe(folio, t_folio, list, lru)
>>> + if (folio_test_hugetlb_vmemmap_optimized(folio))
>>> add_hugetlb_folio(h, folio, true);
>>> - spin_unlock_irq(&hugetlb_lock);
>>> - } else
>>> - clear_dtor = true;
>>> - }
>>> + spin_unlock_irq(&hugetlb_lock);
>>> }
>>> /*
>>> - * If vmemmmap allocation was performed on any folio above, take lock
>>> - * to clear destructor of all folios on list. This avoids the need to
>>> + * If vmemmmap allocation was performed on ANY folio , take lock to
>>> + * clear destructor of all folios on list. This avoids the need to
>>> * lock/unlock for each individual folio.
>>> * The assumption is vmemmap allocation was performed on all or none
>>> * of the folios on the list. This is true expect in VERY rare cases.
>>> */
>>> - if (clear_dtor) {
>>> + if (restored) {
>>> spin_lock_irq(&hugetlb_lock);
>>> list_for_each_entry(folio, list, lru)
>>> __clear_hugetlb_destructor(h, folio);
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index 4558b814ffab..463a4037ec6e 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -480,6 +480,43 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>>> return ret;
>>> }
>>> +/**
>>> + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
>>> + * @h: struct hstate.
>>> + * @folio_list: list of folios.
>>> + * @restored: Set to number of folios for which vmemmap was restored
>>> + * successfully if caller passes a non-NULL pointer.
>>> + *
>>> + * Return: %0 if vmemmap exists for all folios on the list. If an error is
>>> + * encountered restoring vmemmap for ANY folio, an error code
>>> + * will be returned to the caller. It is then the responsibility
>>> + * of the caller to check the hugetlb vmemmap optimized flag of
>>> + * each folio to determine if vmemmap was actually restored.
>>> + */
>>> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
>>> + struct list_head *folio_list,
>>> + unsigned long *restored)
>>> +{
>>> + unsigned long num_restored;
>>> + struct folio *folio;
>>> + int ret = 0, t_ret;
>>> +
>>> + num_restored = 0;
>>> + list_for_each_entry(folio, folio_list, lru) {
>>> + if (folio_test_hugetlb_vmemmap_optimized(folio)) {
>>> + t_ret = hugetlb_vmemmap_restore(h, &folio->page);
>>
>> I still think we should free a non-optimized HugeTLB page if we
>> encounter an OOM situation instead of continue to restore
>> vemmmap pages. Restoring vmemmmap pages will only aggravate
>> the OOM situation. The suitable appraoch is to free a non-optimized
>> HugeTLB page to satisfy our allocation of vmemmap pages, what's
>> your opinion, Mike?
>
> I agree.
>
> As you mentioned previously, this may complicate this code path a bit.
> I will rewrite to make this happen.

Maybe we could introduced two list passed to update_and_free_pages_bulk (this
will be easy for the callers of it), one is for non-optimized huge page,
another is optimized one. In update_and_free_pages_bulk, we could first
free those non-optimized huge page, and then restore vemmmap pages for
those optimized ones, in which case, the code could be simple.
hugetlb_vmemmap_restore_folios() dose not need to add complexity, which
still continue to restore vmemmap pages and will stop once we encounter
an OOM situation.

Thanks.

> --
> Mike Kravetz


2023-09-20 03:54:12

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup



> On Sep 19, 2023, at 23:09, Joao Martins <[email protected]> wrote:
>
> On 19/09/2023 09:57, Muchun Song wrote:
>>> On Sep 19, 2023, at 16:55, Joao Martins <[email protected]> wrote:
>>> On 19/09/2023 09:41, Muchun Song wrote:
>>>>> On Sep 19, 2023, at 16:26, Joao Martins <[email protected]> wrote:
>>>>> On 19/09/2023 07:42, Muchun Song wrote:
>>>>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>>>>>> list_for_each_entry(folio, folio_list, lru) {
>>>>>>> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>>>>>>> &vmemmap_pages);
>>>>>>
>>>>>> This is unlikely to be failed since the page table allocation
>>>>>> is moved to the above
>>>>>
>>>>>> (Note that the head vmemmap page allocation
>>>>>> is not mandatory).
>>>>>
>>>>> Good point that I almost forgot
>>>>>
>>>>>> So we should handle the error case in the above
>>>>>> splitting operation.
>>>>>
>>>>> But back to the previous discussion in v2... the thinking was that /some/ PMDs
>>>>> got split, and say could allow some PTE remapping to occur and free some pages
>>>>> back (each page allows 6 more splits worst case). Then the next
>>>>> __hugetlb_vmemmap_optimize() will have to split PMD pages again for those
>>>>> hugepages that failed the batch PMD split (as we only defer the PTE remap tlb
>>>>> flush in this stage).
>>>>
>>>> Oh, yes. Maybe we could break the above traversal as early as possible
>>>> once we enter an ENOMEM?
>>>>
>>>
>>> Sounds good -- no point in keep trying to split if we are failing with OOM.
>>>
>>> Perhaps a comment in both of these clauses (the early break on split and the OOM
>>> handling in batch optimize) could help make this clear.
>>
>> Make sense.
>
> These are the changes I have so far for this patch based on the discussion so
> far. For next one it's at the end:

Code looks good to me. One nit below.

>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index e8bc2f7567db..d9c6f2cf698c 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -27,7 +27,8 @@
> * @reuse_addr: the virtual address of the @reuse_page page.
> * @vmemmap_pages: the list head of the vmemmap pages that can be freed
> * or is mapped from.
> - * @flags: used to modify behavior in bulk operations
> + * @flags: used to modify behavior in vmemmap page table walking
> + * operations.
> */
> struct vmemmap_remap_walk {
> void (*remap_pte)(pte_t *pte, unsigned long addr,
> @@ -36,6 +37,8 @@ struct vmemmap_remap_walk {
> struct page *reuse_page;
> unsigned long reuse_addr;
> struct list_head *vmemmap_pages;
> +
> +/* Skip the TLB flush when we split the PMD */
> #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
> unsigned long flags;
> };
> @@ -132,7 +135,7 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
> int ret;
>
> ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
> - walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH);
> + !(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH));
> if (ret)
> return ret;
>
> @@ -677,13 +680,13 @@ void hugetlb_vmemmap_optimize(const struct hstate *h,
> struct page *head)
> free_vmemmap_page_list(&vmemmap_pages);
> }
>
> -static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
> +static int hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
> {
> unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> unsigned long vmemmap_reuse;
>
> if (!vmemmap_should_optimize(h, head))
> - return;
> + return 0;
>
> vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
> vmemmap_reuse = vmemmap_start;
> @@ -693,7 +696,7 @@ static void hugetlb_vmemmap_split(const struct hstate *h,
> struct page *head)
> * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
> * @vmemmap_end]
> */
> - vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
> + return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
> }
>
> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
> *folio_list)
> @@ -701,8 +704,18 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h,
> struct list_head *folio_l
> struct folio *folio;
> LIST_HEAD(vmemmap_pages);
>
> - list_for_each_entry(folio, folio_list, lru)
> - hugetlb_vmemmap_split(h, &folio->page);
> + list_for_each_entry(folio, folio_list, lru) {
> + int ret = hugetlb_vmemmap_split(h, &folio->page);
> +
> + /*
> + * Spliting the PMD requires allocating a page, thus lets fail
^^^^ ^^^
Splitting page table page

I'd like to specify the functionality of the allocated page.

> + * early once we encounter the first OOM. No point in retrying
> + * as it can be dynamically done on remap with the memory
> + * we get back from the vmemmap deduplication.
> + */
> + if (ret == -ENOMEM)
> + break;
> + }
>
> flush_tlb_all();
>
> For patch 7, I only have commentary added derived from this earlier discussion
> above:
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index d9c6f2cf698c..f6a1020a4b6a 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -40,6 +40,8 @@ struct vmemmap_remap_walk {
>
> /* Skip the TLB flush when we split the PMD */
> #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
> +/* Skip the TLB flush when we remap the PTE */
> #define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1)
> unsigned long flags;
> };
>
> @@ -721,19 +739,28 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h,
> struct list_head *folio_l
>
> list_for_each_entry(folio, folio_list, lru) {
> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
> &vmemmap_pages,
> VMEMMAP_REMAP_NO_TLB_FLUSH);
>
> /*
> * Pages to be freed may have been accumulated. If we
> * encounter an ENOMEM, free what we have and try again.
> + * This can occur in the case that both spliting fails
^^^
splitting

> + * halfway and head page allocation also failed. In this
^^^^^^^
head vmemmap page

Otherwise:

Reviewed-by: Muchun Song <[email protected]>

Thanks.

> + * case __hugetlb_vmemmap_optimize() would free memory
> + * allowing more vmemmap remaps to occur.
> */
> if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
>

2023-09-20 04:08:33

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages



> On Sep 20, 2023, at 10:56, Muchun Song <[email protected]> wrote:
>
>
>
>> On Sep 20, 2023, at 04:57, Mike Kravetz <[email protected]> wrote:
>>
>> On 09/19/23 17:52, Muchun Song wrote:
>>>
>>>
>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>>> The routine update_and_free_pages_bulk already performs vmemmap
>>>> restoration on the list of hugetlb pages in a separate step. In
>>>> preparation for more functionality to be added in this step, create a
>>>> new routine hugetlb_vmemmap_restore_folios() that will restore
>>>> vmemmap for a list of folios.
>>>>
>>>> This new routine must provide sufficient feedback about errors and
>>>> actual restoration performed so that update_and_free_pages_bulk can
>>>> perform optimally.
>>>>
>>>> Signed-off-by: Mike Kravetz <[email protected]>
>>>> ---
>>>> mm/hugetlb.c | 36 ++++++++++++++++++------------------
>>>> mm/hugetlb_vmemmap.c | 37 +++++++++++++++++++++++++++++++++++++
>>>> mm/hugetlb_vmemmap.h | 11 +++++++++++
>>>> 3 files changed, 66 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index d6f3db3c1313..814bb1982274 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -1836,36 +1836,36 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
>>>> static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
>>>> {
>>>> + int ret;
>>>> + unsigned long restored;
>>>> struct folio *folio, *t_folio;
>>>> - bool clear_dtor = false;
>>>> /*
>>>> - * First allocate required vmemmmap (if necessary) for all folios on
>>>> - * list. If vmemmap can not be allocated, we can not free folio to
>>>> - * lower level allocator, so add back as hugetlb surplus page.
>>>> - * add_hugetlb_folio() removes the page from THIS list.
>>>> - * Use clear_dtor to note if vmemmap was successfully allocated for
>>>> - * ANY page on the list.
>>>> + * First allocate required vmemmmap (if necessary) for all folios.
>>>> */
>>>> - list_for_each_entry_safe(folio, t_folio, list, lru) {
>>>> - if (folio_test_hugetlb_vmemmap_optimized(folio)) {
>>>> - if (hugetlb_vmemmap_restore(h, &folio->page)) {
>>>> - spin_lock_irq(&hugetlb_lock);
>>>> + ret = hugetlb_vmemmap_restore_folios(h, list, &restored);
>>>> +
>>>> + /*
>>>> + * If there was an error restoring vmemmap for ANY folios on the list,
>>>> + * add them back as surplus hugetlb pages. add_hugetlb_folio() removes
>>>> + * the folio from THIS list.
>>>> + */
>>>> + if (ret < 0) {
>>>> + spin_lock_irq(&hugetlb_lock);
>>>> + list_for_each_entry_safe(folio, t_folio, list, lru)
>>>> + if (folio_test_hugetlb_vmemmap_optimized(folio))
>>>> add_hugetlb_folio(h, folio, true);
>>>> - spin_unlock_irq(&hugetlb_lock);
>>>> - } else
>>>> - clear_dtor = true;
>>>> - }
>>>> + spin_unlock_irq(&hugetlb_lock);
>>>> }
>>>> /*
>>>> - * If vmemmmap allocation was performed on any folio above, take lock
>>>> - * to clear destructor of all folios on list. This avoids the need to
>>>> + * If vmemmmap allocation was performed on ANY folio , take lock to
>>>> + * clear destructor of all folios on list. This avoids the need to
>>>> * lock/unlock for each individual folio.
>>>> * The assumption is vmemmap allocation was performed on all or none
>>>> * of the folios on the list. This is true expect in VERY rare cases.
>>>> */
>>>> - if (clear_dtor) {
>>>> + if (restored) {
>>>> spin_lock_irq(&hugetlb_lock);
>>>> list_for_each_entry(folio, list, lru)
>>>> __clear_hugetlb_destructor(h, folio);
>>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>>> index 4558b814ffab..463a4037ec6e 100644
>>>> --- a/mm/hugetlb_vmemmap.c
>>>> +++ b/mm/hugetlb_vmemmap.c
>>>> @@ -480,6 +480,43 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
>>>> return ret;
>>>> }
>>>> +/**
>>>> + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
>>>> + * @h: struct hstate.
>>>> + * @folio_list: list of folios.
>>>> + * @restored: Set to number of folios for which vmemmap was restored
>>>> + * successfully if caller passes a non-NULL pointer.
>>>> + *
>>>> + * Return: %0 if vmemmap exists for all folios on the list. If an error is
>>>> + * encountered restoring vmemmap for ANY folio, an error code
>>>> + * will be returned to the caller. It is then the responsibility
>>>> + * of the caller to check the hugetlb vmemmap optimized flag of
>>>> + * each folio to determine if vmemmap was actually restored.
>>>> + */
>>>> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
>>>> + struct list_head *folio_list,
>>>> + unsigned long *restored)
>>>> +{
>>>> + unsigned long num_restored;
>>>> + struct folio *folio;
>>>> + int ret = 0, t_ret;
>>>> +
>>>> + num_restored = 0;
>>>> + list_for_each_entry(folio, folio_list, lru) {
>>>> + if (folio_test_hugetlb_vmemmap_optimized(folio)) {
>>>> + t_ret = hugetlb_vmemmap_restore(h, &folio->page);
>>>
>>> I still think we should free a non-optimized HugeTLB page if we
>>> encounter an OOM situation instead of continue to restore
>>> vemmmap pages. Restoring vmemmmap pages will only aggravate
>>> the OOM situation. The suitable appraoch is to free a non-optimized
>>> HugeTLB page to satisfy our allocation of vmemmap pages, what's
>>> your opinion, Mike?
>>
>> I agree.
>>
>> As you mentioned previously, this may complicate this code path a bit.
>> I will rewrite to make this happen.
>
> Maybe we could introduced two list passed to update_and_free_pages_bulk (this
> will be easy for the callers of it), one is for non-optimized huge page,
> another is optimized one. In update_and_free_pages_bulk, we could first
> free those non-optimized huge page, and then restore vemmmap pages for
> those optimized ones, in which case, the code could be simple.
> hugetlb_vmemmap_restore_folios() dose not need to add complexity, which
> still continue to restore vmemmap pages and will stop once we encounter
> an OOM situation.

BTW, maybe we should try again iff there are some non-optimized huge page
whose vmemmap pages are restored successfully previously and could be freed
first, then continue to restore the vmemmap pages of the remaining huge pages.
I think the retry code could be done in update_and_free_pages_bulk() as well.

>
> Thanks.
>
>> --
>> Mike Kravetz


2023-09-20 17:47:01

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup

On 20/09/2023 03:47, Muchun Song wrote:
>> On Sep 19, 2023, at 23:09, Joao Martins <[email protected]> wrote:
>> On 19/09/2023 09:57, Muchun Song wrote:
>>>> On Sep 19, 2023, at 16:55, Joao Martins <[email protected]> wrote:
>>>> On 19/09/2023 09:41, Muchun Song wrote:
>>>>>> On Sep 19, 2023, at 16:26, Joao Martins <[email protected]> wrote:
>>>>>> On 19/09/2023 07:42, Muchun Song wrote:
>>>>>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>>>>>>> list_for_each_entry(folio, folio_list, lru) {
>>>>>>>> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>>>>>>>> &vmemmap_pages);
>>>>>>>
>>>>>>> This is unlikely to be failed since the page table allocation
>>>>>>> is moved to the above
>>>>>>
>>>>>>> (Note that the head vmemmap page allocation
>>>>>>> is not mandatory).
>>>>>>
>>>>>> Good point that I almost forgot
>>>>>>
>>>>>>> So we should handle the error case in the above
>>>>>>> splitting operation.
>>>>>>
>>>>>> But back to the previous discussion in v2... the thinking was that /some/ PMDs
>>>>>> got split, and say could allow some PTE remapping to occur and free some pages
>>>>>> back (each page allows 6 more splits worst case). Then the next
>>>>>> __hugetlb_vmemmap_optimize() will have to split PMD pages again for those
>>>>>> hugepages that failed the batch PMD split (as we only defer the PTE remap tlb
>>>>>> flush in this stage).
>>>>>
>>>>> Oh, yes. Maybe we could break the above traversal as early as possible
>>>>> once we enter an ENOMEM?
>>>>>
>>>>
>>>> Sounds good -- no point in keep trying to split if we are failing with OOM.
>>>>
>>>> Perhaps a comment in both of these clauses (the early break on split and the OOM
>>>> handling in batch optimize) could help make this clear.
>>>
>>> Make sense.
>>
>> These are the changes I have so far for this patch based on the discussion so
>> far. For next one it's at the end:
>
> Code looks good to me. One nit below.
>
Thanks

>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index e8bc2f7567db..d9c6f2cf698c 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -27,7 +27,8 @@
>> * @reuse_addr: the virtual address of the @reuse_page page.
>> * @vmemmap_pages: the list head of the vmemmap pages that can be freed
>> * or is mapped from.
>> - * @flags: used to modify behavior in bulk operations
>> + * @flags: used to modify behavior in vmemmap page table walking
>> + * operations.
>> */
>> struct vmemmap_remap_walk {
>> void (*remap_pte)(pte_t *pte, unsigned long addr,
>> @@ -36,6 +37,8 @@ struct vmemmap_remap_walk {
>> struct page *reuse_page;
>> unsigned long reuse_addr;
>> struct list_head *vmemmap_pages;
>> +
>> +/* Skip the TLB flush when we split the PMD */
>> #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
>> unsigned long flags;
>> };
>> @@ -132,7 +135,7 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
>> int ret;
>>
>> ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
>> - walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH);
>> + !(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH));
>> if (ret)
>> return ret;
>>
>> @@ -677,13 +680,13 @@ void hugetlb_vmemmap_optimize(const struct hstate *h,
>> struct page *head)
>> free_vmemmap_page_list(&vmemmap_pages);
>> }
>>
>> -static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>> +static int hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>> {
>> unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>> unsigned long vmemmap_reuse;
>>
>> if (!vmemmap_should_optimize(h, head))
>> - return;
>> + return 0;
>>
>> vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
>> vmemmap_reuse = vmemmap_start;
>> @@ -693,7 +696,7 @@ static void hugetlb_vmemmap_split(const struct hstate *h,
>> struct page *head)
>> * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
>> * @vmemmap_end]
>> */
>> - vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>> + return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>> }
>>
>> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
>> *folio_list)
>> @@ -701,8 +704,18 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h,
>> struct list_head *folio_l
>> struct folio *folio;
>> LIST_HEAD(vmemmap_pages);
>>
>> - list_for_each_entry(folio, folio_list, lru)
>> - hugetlb_vmemmap_split(h, &folio->page);
>> + list_for_each_entry(folio, folio_list, lru) {
>> + int ret = hugetlb_vmemmap_split(h, &folio->page);
>> +
>> + /*
>> + * Spliting the PMD requires allocating a page, thus lets fail
> ^^^^ ^^^
> Splitting page table page
>
> I'd like to specify the functionality of the allocated page.
>
OK

>> + * early once we encounter the first OOM. No point in retrying
>> + * as it can be dynamically done on remap with the memory
>> + * we get back from the vmemmap deduplication.
>> + */
>> + if (ret == -ENOMEM)
>> + break;
>> + }
>>
>> flush_tlb_all();
>>
>> For patch 7, I only have commentary added derived from this earlier discussion
>> above:
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index d9c6f2cf698c..f6a1020a4b6a 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -40,6 +40,8 @@ struct vmemmap_remap_walk {
>>
>> /* Skip the TLB flush when we split the PMD */
>> #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
>> +/* Skip the TLB flush when we remap the PTE */
>> #define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1)
>> unsigned long flags;
>> };
>>
>> @@ -721,19 +739,28 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h,
>> struct list_head *folio_l
>>
>> list_for_each_entry(folio, folio_list, lru) {
>> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>> &vmemmap_pages,
>> VMEMMAP_REMAP_NO_TLB_FLUSH);
>>
>> /*
>> * Pages to be freed may have been accumulated. If we
>> * encounter an ENOMEM, free what we have and try again.
>> + * This can occur in the case that both spliting fails
> ^^^
> splitting
>

ok

>> + * halfway and head page allocation also failed. In this
> ^^^^^^^
> head vmemmap page
>
ok

> Otherwise:
>
> Reviewed-by: Muchun Song <[email protected]>
>

Thanks, I assume that's for both patches?

> Thanks.
>
>> + * case __hugetlb_vmemmap_optimize() would free memory
>> + * allowing more vmemmap remaps to occur.
>> */
>> if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
>>
>
>

2023-09-21 05:16:39

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages

On 09/20/23 11:03, Muchun Song wrote:
> > On Sep 20, 2023, at 10:56, Muchun Song <[email protected]> wrote:
> >> On Sep 20, 2023, at 04:57, Mike Kravetz <[email protected]> wrote:
> >> On 09/19/23 17:52, Muchun Song wrote:
> >>> On 2023/9/19 07:01, Mike Kravetz wrote:
> >>>
> >>> I still think we should free a non-optimized HugeTLB page if we
> >>> encounter an OOM situation instead of continue to restore
> >>> vemmmap pages. Restoring vmemmmap pages will only aggravate
> >>> the OOM situation. The suitable appraoch is to free a non-optimized
> >>> HugeTLB page to satisfy our allocation of vmemmap pages, what's
> >>> your opinion, Mike?
> >>
> >> I agree.
> >>
> >> As you mentioned previously, this may complicate this code path a bit.
> >> I will rewrite to make this happen.
> >
> > Maybe we could introduced two list passed to update_and_free_pages_bulk (this
> > will be easy for the callers of it), one is for non-optimized huge page,
> > another is optimized one. In update_and_free_pages_bulk, we could first
> > free those non-optimized huge page, and then restore vemmmap pages for
> > those optimized ones, in which case, the code could be simple.
> > hugetlb_vmemmap_restore_folios() dose not need to add complexity, which
> > still continue to restore vmemmap pages and will stop once we encounter
> > an OOM situation.

I am not sure if passing in optimized and non-optimized lists to
update_and_free_pages_bulk will help much. IIUC, it will almost always
be the case where only one list has entries. Is that mostly accurate?

> BTW, maybe we should try again iff there are some non-optimized huge page
> whose vmemmap pages are restored successfully previously and could be freed
> first, then continue to restore the vmemmap pages of the remaining huge pages.
> I think the retry code could be done in update_and_free_pages_bulk() as well.

I came up with a new routine to handle these ENOMEM returns from
hugetlb_vmemmap_restore_folios. I 'think' it handles these situations.
Here is an updated version of this patch. Sorry, diff makes it a bit
hard to read.

From b13bdccb01730f995191944769f87d0725c289ad Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Sun, 10 Sep 2023 16:14:50 -0700
Subject: [PATCH] hugetlb: perform vmemmap restoration on a list of pages

The routine update_and_free_pages_bulk already performs vmemmap
restoration on the list of hugetlb pages in a separate step. In
preparation for more functionality to be added in this step, create a
new routine hugetlb_vmemmap_restore_folios() that will restore
vmemmap for a list of folios.

This new routine must provide sufficient feedback about errors and
actual restoration performed so that update_and_free_pages_bulk can
perform optimally.

Special care must be taken when encountering a ENOMEM error from
hugetlb_vmemmap_restore_folios. We want to continue making as much
forward progress as possible. A new routine bulk_vmemmap_restore_enomem
handles this specific situation.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 83 ++++++++++++++++++++++++++++++++++----------
mm/hugetlb_vmemmap.c | 39 +++++++++++++++++++++
mm/hugetlb_vmemmap.h | 11 ++++++
3 files changed, 115 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 70fedf8682c4..52abe56cf38a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1834,38 +1834,85 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
schedule_work(&free_hpage_work);
}

-static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
+static void bulk_vmemmap_restore_enomem(struct hstate *h,
+ struct list_head *list,
+ unsigned long restored)
{
struct folio *folio, *t_folio;
- bool clear_dtor = false;

- /*
- * First allocate required vmemmmap (if necessary) for all folios on
- * list. If vmemmap can not be allocated, we can not free folio to
- * lower level allocator, so add back as hugetlb surplus page.
- * add_hugetlb_folio() removes the page from THIS list.
- * Use clear_dtor to note if vmemmap was successfully allocated for
- * ANY page on the list.
- */
- list_for_each_entry_safe(folio, t_folio, list, lru) {
- if (folio_test_hugetlb_vmemmap_optimized(folio)) {
+ if (restored) {
+ /*
+ * On ENOMEM error, free any restored hugetlb pages so that
+ * restore of the entire list can be retried.
+ * The idea is that by freeing hugetlb pages with vmemmap
+ * (those previously restored) we will free up memory so that
+ * we can allocate vmemmap for more hugetlb pages.
+ * We must examine and possibly free EVERY hugetlb page on list
+ * in order to call hugetlb_vmemmap_restore_folios again.
+ * This is not optimal, but is an error case that should not
+ * happen frequently.
+ */
+ list_for_each_entry_safe(folio, t_folio, list, lru)
+ if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
+ list_del(&folio->lru);
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ update_and_free_hugetlb_folio(h, folio, false);
+ cond_resched();
+ }
+ } else {
+ /*
+ * In the case where vmemmap was not restored for ANY folios,
+ * we loop through them trying to restore individually in the
+ * hope that someone elsewhere may free enough memory.
+ * If unable to restore a page, the hugetlb page is made a
+ * surplus page and removed from the list.
+ * If are able to restore vmemmap for one hugetlb page, we free
+ * it and quit processing the list to retry the bulk operation.
+ */
+ list_for_each_entry_safe(folio, t_folio, list, lru)
if (hugetlb_vmemmap_restore(h, &folio->page)) {
spin_lock_irq(&hugetlb_lock);
add_hugetlb_folio(h, folio, true);
spin_unlock_irq(&hugetlb_lock);
- } else
- clear_dtor = true;
- }
+ } else {
+ list_del(&folio->lru);
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ update_and_free_hugetlb_folio(h, folio, false);
+ break;
+ }
}
+}
+
+static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
+{
+ int ret;
+ unsigned long restored;
+ struct folio *folio, *t_folio;

/*
- * If vmemmmap allocation was performed on any folio above, take lock
- * to clear destructor of all folios on list. This avoids the need to
+ * First allocate required vmemmmap (if necessary) for all folios.
+ * Carefully handle ENOMEM errors and free up any available hugetlb
+ * pages in order to make forward progress.
+ */
+retry:
+ ret = hugetlb_vmemmap_restore_folios(h, list, &restored);
+ if (ret == -ENOMEM) {
+ bulk_vmemmap_restore_enomem(h, list, restored);
+ goto retry;
+ }
+
+ /*
+ * If vmemmmap allocation was performed on ANY folio , take lock to
+ * clear destructor of all folios on list. This avoids the need to
* lock/unlock for each individual folio.
* The assumption is vmemmap allocation was performed on all or none
* of the folios on the list. This is true expect in VERY rare cases.
*/
- if (clear_dtor) {
+ if (restored) {
spin_lock_irq(&hugetlb_lock);
list_for_each_entry(folio, list, lru)
__clear_hugetlb_destructor(h, folio);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4558b814ffab..cc91edbfb68b 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -480,6 +480,45 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
return ret;
}

+/**
+ * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
+ * @h: struct hstate.
+ * @folio_list: list of folios.
+ * @restored: Set to number of folios for which vmemmap was restored
+ * successfully if caller passes a non-NULL pointer.
+ *
+ * Return: %0 if vmemmap exists for all folios on the list. If an error is
+ * encountered restoring vmemmap for ANY folio, an error code
+ * will be returned to the caller. It is then the responsibility
+ * of the caller to check the hugetlb vmemmap optimized flag of
+ * each folio to determine if vmemmap was actually restored.
+ * Note that processing is stopped when first error is encountered.
+ */
+int hugetlb_vmemmap_restore_folios(const struct hstate *h,
+ struct list_head *folio_list,
+ unsigned long *restored)
+{
+ unsigned long num_restored;
+ struct folio *folio;
+ int ret = 0;
+
+ num_restored = 0;
+ list_for_each_entry(folio, folio_list, lru) {
+ if (folio_test_hugetlb_vmemmap_optimized(folio)) {
+ ret = hugetlb_vmemmap_restore(h, &folio->page);
+ if (ret)
+ goto out;
+ else
+ num_restored++;
+ }
+ }
+
+out:
+ if (*restored)
+ *restored = num_restored;
+ return ret;
+}
+
/* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
{
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index c512e388dbb4..bb58453c3cc0 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -19,6 +19,8 @@

#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
+int hugetlb_vmemmap_restore_folios(const struct hstate *h,
+ struct list_head *folio_list, unsigned long *restored);
void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);

@@ -45,6 +47,15 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
return 0;
}

+static inline int hugetlb_vmemmap_restore_folios(const struct hstate *h,
+ struct list_head *folio_list,
+ unsigned long *restored)
+{
+ if (restored)
+ *restored = 0;
+ return 0;
+}
+
static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
{
}
--
2.41.0

2023-09-21 08:15:57

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] hugetlb: batch PMD split for bulk vmemmap dedup



> On Sep 20, 2023, at 18:39, Joao Martins <[email protected]> wrote:
>
> On 20/09/2023 03:47, Muchun Song wrote:
>>> On Sep 19, 2023, at 23:09, Joao Martins <[email protected]> wrote:
>>> On 19/09/2023 09:57, Muchun Song wrote:
>>>>> On Sep 19, 2023, at 16:55, Joao Martins <[email protected]> wrote:
>>>>> On 19/09/2023 09:41, Muchun Song wrote:
>>>>>>> On Sep 19, 2023, at 16:26, Joao Martins <[email protected]> wrote:
>>>>>>> On 19/09/2023 07:42, Muchun Song wrote:
>>>>>>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>>>>>>>> list_for_each_entry(folio, folio_list, lru) {
>>>>>>>>> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>>>>>>>>> &vmemmap_pages);
>>>>>>>>
>>>>>>>> This is unlikely to be failed since the page table allocation
>>>>>>>> is moved to the above
>>>>>>>
>>>>>>>> (Note that the head vmemmap page allocation
>>>>>>>> is not mandatory).
>>>>>>>
>>>>>>> Good point that I almost forgot
>>>>>>>
>>>>>>>> So we should handle the error case in the above
>>>>>>>> splitting operation.
>>>>>>>
>>>>>>> But back to the previous discussion in v2... the thinking was that /some/ PMDs
>>>>>>> got split, and say could allow some PTE remapping to occur and free some pages
>>>>>>> back (each page allows 6 more splits worst case). Then the next
>>>>>>> __hugetlb_vmemmap_optimize() will have to split PMD pages again for those
>>>>>>> hugepages that failed the batch PMD split (as we only defer the PTE remap tlb
>>>>>>> flush in this stage).
>>>>>>
>>>>>> Oh, yes. Maybe we could break the above traversal as early as possible
>>>>>> once we enter an ENOMEM?
>>>>>>
>>>>>
>>>>> Sounds good -- no point in keep trying to split if we are failing with OOM.
>>>>>
>>>>> Perhaps a comment in both of these clauses (the early break on split and the OOM
>>>>> handling in batch optimize) could help make this clear.
>>>>
>>>> Make sense.
>>>
>>> These are the changes I have so far for this patch based on the discussion so
>>> far. For next one it's at the end:
>>
>> Code looks good to me. One nit below.
>>
> Thanks
>
>>>
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index e8bc2f7567db..d9c6f2cf698c 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -27,7 +27,8 @@
>>> * @reuse_addr: the virtual address of the @reuse_page page.
>>> * @vmemmap_pages: the list head of the vmemmap pages that can be freed
>>> * or is mapped from.
>>> - * @flags: used to modify behavior in bulk operations
>>> + * @flags: used to modify behavior in vmemmap page table walking
>>> + * operations.
>>> */
>>> struct vmemmap_remap_walk {
>>> void (*remap_pte)(pte_t *pte, unsigned long addr,
>>> @@ -36,6 +37,8 @@ struct vmemmap_remap_walk {
>>> struct page *reuse_page;
>>> unsigned long reuse_addr;
>>> struct list_head *vmemmap_pages;
>>> +
>>> +/* Skip the TLB flush when we split the PMD */
>>> #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
>>> unsigned long flags;
>>> };
>>> @@ -132,7 +135,7 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
>>> int ret;
>>>
>>> ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK,
>>> - walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH);
>>> + !(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH));
>>> if (ret)
>>> return ret;
>>>
>>> @@ -677,13 +680,13 @@ void hugetlb_vmemmap_optimize(const struct hstate *h,
>>> struct page *head)
>>> free_vmemmap_page_list(&vmemmap_pages);
>>> }
>>>
>>> -static void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>>> +static int hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
>>> {
>>> unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
>>> unsigned long vmemmap_reuse;
>>>
>>> if (!vmemmap_should_optimize(h, head))
>>> - return;
>>> + return 0;
>>>
>>> vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h);
>>> vmemmap_reuse = vmemmap_start;
>>> @@ -693,7 +696,7 @@ static void hugetlb_vmemmap_split(const struct hstate *h,
>>> struct page *head)
>>> * Split PMDs on the vmemmap virtual address range [@vmemmap_start,
>>> * @vmemmap_end]
>>> */
>>> - vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>>> + return vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse);
>>> }
>>>
>>> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head
>>> *folio_list)
>>> @@ -701,8 +704,18 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h,
>>> struct list_head *folio_l
>>> struct folio *folio;
>>> LIST_HEAD(vmemmap_pages);
>>>
>>> - list_for_each_entry(folio, folio_list, lru)
>>> - hugetlb_vmemmap_split(h, &folio->page);
>>> + list_for_each_entry(folio, folio_list, lru) {
>>> + int ret = hugetlb_vmemmap_split(h, &folio->page);
>>> +
>>> + /*
>>> + * Spliting the PMD requires allocating a page, thus lets fail
>> ^^^^ ^^^
>> Splitting page table page
>>
>> I'd like to specify the functionality of the allocated page.
>>
> OK
>
>>> + * early once we encounter the first OOM. No point in retrying
>>> + * as it can be dynamically done on remap with the memory
>>> + * we get back from the vmemmap deduplication.
>>> + */
>>> + if (ret == -ENOMEM)
>>> + break;
>>> + }
>>>
>>> flush_tlb_all();
>>>
>>> For patch 7, I only have commentary added derived from this earlier discussion
>>> above:
>>>
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index d9c6f2cf698c..f6a1020a4b6a 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -40,6 +40,8 @@ struct vmemmap_remap_walk {
>>>
>>> /* Skip the TLB flush when we split the PMD */
>>> #define VMEMMAP_SPLIT_NO_TLB_FLUSH BIT(0)
>>> +/* Skip the TLB flush when we remap the PTE */
>>> #define VMEMMAP_REMAP_NO_TLB_FLUSH BIT(1)
>>> unsigned long flags;
>>> };
>>>
>>> @@ -721,19 +739,28 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h,
>>> struct list_head *folio_l
>>>
>>> list_for_each_entry(folio, folio_list, lru) {
>>> int ret = __hugetlb_vmemmap_optimize(h, &folio->page,
>>> &vmemmap_pages,
>>> VMEMMAP_REMAP_NO_TLB_FLUSH);
>>>
>>> /*
>>> * Pages to be freed may have been accumulated. If we
>>> * encounter an ENOMEM, free what we have and try again.
>>> + * This can occur in the case that both spliting fails
>> ^^^
>> splitting
>>
>
> ok
>
>>> + * halfway and head page allocation also failed. In this
>> ^^^^^^^
>> head vmemmap page
>>
> ok
>
>> Otherwise:
>>
>> Reviewed-by: Muchun Song <[email protected]>
>>
>
> Thanks, I assume that's for both patches?

Yes. Thanks.

>
>> Thanks.
>>
>>> + * case __hugetlb_vmemmap_optimize() would free memory
>>> + * allowing more vmemmap remaps to occur.
>>> */
>>> if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {


2023-09-21 19:18:08

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages



On 2023/9/21 09:12, Mike Kravetz wrote:
> On 09/20/23 11:03, Muchun Song wrote:
>>> On Sep 20, 2023, at 10:56, Muchun Song <[email protected]> wrote:
>>>> On Sep 20, 2023, at 04:57, Mike Kravetz <[email protected]> wrote:
>>>> On 09/19/23 17:52, Muchun Song wrote:
>>>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>>>>
>>>>> I still think we should free a non-optimized HugeTLB page if we
>>>>> encounter an OOM situation instead of continue to restore
>>>>> vemmmap pages. Restoring vmemmmap pages will only aggravate
>>>>> the OOM situation. The suitable appraoch is to free a non-optimized
>>>>> HugeTLB page to satisfy our allocation of vmemmap pages, what's
>>>>> your opinion, Mike?
>>>> I agree.
>>>>
>>>> As you mentioned previously, this may complicate this code path a bit.
>>>> I will rewrite to make this happen.
>>> Maybe we could introduced two list passed to update_and_free_pages_bulk (this
>>> will be easy for the callers of it), one is for non-optimized huge page,
>>> another is optimized one. In update_and_free_pages_bulk, we could first
>>> free those non-optimized huge page, and then restore vemmmap pages for
>>> those optimized ones, in which case, the code could be simple.
>>> hugetlb_vmemmap_restore_folios() dose not need to add complexity, which
>>> still continue to restore vmemmap pages and will stop once we encounter
>>> an OOM situation.
> I am not sure if passing in optimized and non-optimized lists to
> update_and_free_pages_bulk will help much. IIUC, it will almost always
> be the case where only one list has entries. Is that mostly accurate?

I think you are right. It will be less helpful since most of
pages will be not optimized when HVO is enabled.

>> BTW, maybe we should try again iff there are some non-optimized huge page
>> whose vmemmap pages are restored successfully previously and could be freed
>> first, then continue to restore the vmemmap pages of the remaining huge pages.
>> I think the retry code could be done in update_and_free_pages_bulk() as well.
> I came up with a new routine to handle these ENOMEM returns from
> hugetlb_vmemmap_restore_folios. I 'think' it handles these situations.
> Here is an updated version of this patch. Sorry, diff makes it a bit
> hard to read.
>
> From b13bdccb01730f995191944769f87d0725c289ad Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <[email protected]>
> Date: Sun, 10 Sep 2023 16:14:50 -0700
> Subject: [PATCH] hugetlb: perform vmemmap restoration on a list of pages
>
> The routine update_and_free_pages_bulk already performs vmemmap
> restoration on the list of hugetlb pages in a separate step. In
> preparation for more functionality to be added in this step, create a
> new routine hugetlb_vmemmap_restore_folios() that will restore
> vmemmap for a list of folios.
>
> This new routine must provide sufficient feedback about errors and
> actual restoration performed so that update_and_free_pages_bulk can
> perform optimally.
>
> Special care must be taken when encountering a ENOMEM error from
> hugetlb_vmemmap_restore_folios. We want to continue making as much
> forward progress as possible. A new routine bulk_vmemmap_restore_enomem
> handles this specific situation.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 83 ++++++++++++++++++++++++++++++++++----------
> mm/hugetlb_vmemmap.c | 39 +++++++++++++++++++++
> mm/hugetlb_vmemmap.h | 11 ++++++
> 3 files changed, 115 insertions(+), 18 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 70fedf8682c4..52abe56cf38a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1834,38 +1834,85 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
> schedule_work(&free_hpage_work);
> }
>
> -static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> +static void bulk_vmemmap_restore_enomem(struct hstate *h,
> + struct list_head *list,
> + unsigned long restored)
> {
> struct folio *folio, *t_folio;
> - bool clear_dtor = false;
>
> - /*
> - * First allocate required vmemmmap (if necessary) for all folios on
> - * list. If vmemmap can not be allocated, we can not free folio to
> - * lower level allocator, so add back as hugetlb surplus page.
> - * add_hugetlb_folio() removes the page from THIS list.
> - * Use clear_dtor to note if vmemmap was successfully allocated for
> - * ANY page on the list.
> - */
> - list_for_each_entry_safe(folio, t_folio, list, lru) {
> - if (folio_test_hugetlb_vmemmap_optimized(folio)) {
> + if (restored) {
> + /*
> + * On ENOMEM error, free any restored hugetlb pages so that
> + * restore of the entire list can be retried.
> + * The idea is that by freeing hugetlb pages with vmemmap
> + * (those previously restored) we will free up memory so that
> + * we can allocate vmemmap for more hugetlb pages.
> + * We must examine and possibly free EVERY hugetlb page on list
> + * in order to call hugetlb_vmemmap_restore_folios again.
> + * This is not optimal, but is an error case that should not
> + * happen frequently.
> + */
> + list_for_each_entry_safe(folio, t_folio, list, lru)
> + if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
> + list_del(&folio->lru);
> + spin_lock_irq(&hugetlb_lock);
> + __clear_hugetlb_destructor(h, folio);
> + spin_unlock_irq(&hugetlb_lock);
> + update_and_free_hugetlb_folio(h, folio, false);
> + cond_resched();
> + }
> + } else {
> + /*
> + * In the case where vmemmap was not restored for ANY folios,
> + * we loop through them trying to restore individually in the
> + * hope that someone elsewhere may free enough memory.
> + * If unable to restore a page, the hugetlb page is made a
> + * surplus page and removed from the list.
> + * If are able to restore vmemmap for one hugetlb page, we free
> + * it and quit processing the list to retry the bulk operation.
> + */
> + list_for_each_entry_safe(folio, t_folio, list, lru)
> if (hugetlb_vmemmap_restore(h, &folio->page)) {
> spin_lock_irq(&hugetlb_lock);
> add_hugetlb_folio(h, folio, true);
> spin_unlock_irq(&hugetlb_lock);
> - } else
> - clear_dtor = true;
> - }
> + } else {
> + list_del(&folio->lru);
> + spin_lock_irq(&hugetlb_lock);
> + __clear_hugetlb_destructor(h, folio);
> + spin_unlock_irq(&hugetlb_lock);
> + update_and_free_hugetlb_folio(h, folio, false);
> + break;
> + }
> }
> +}
> +
> +static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> +{
> + int ret;
> + unsigned long restored;
> + struct folio *folio, *t_folio;
>
> /*
> - * If vmemmmap allocation was performed on any folio above, take lock
> - * to clear destructor of all folios on list. This avoids the need to
> + * First allocate required vmemmmap (if necessary) for all folios.
> + * Carefully handle ENOMEM errors and free up any available hugetlb
> + * pages in order to make forward progress.
> + */
> +retry:
> + ret = hugetlb_vmemmap_restore_folios(h, list, &restored);
> + if (ret == -ENOMEM) {
> + bulk_vmemmap_restore_enomem(h, list, restored);
> + goto retry;
> + }
> +
> + /*
> + * If vmemmmap allocation was performed on ANY folio , take lock to
> + * clear destructor of all folios on list. This avoids the need to
> * lock/unlock for each individual folio.
> * The assumption is vmemmap allocation was performed on all or none
> * of the folios on the list. This is true expect in VERY rare cases.
> */
> - if (clear_dtor) {
> + if (restored) {
> spin_lock_irq(&hugetlb_lock);
> list_for_each_entry(folio, list, lru)
> __clear_hugetlb_destructor(h, folio);
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 4558b814ffab..cc91edbfb68b 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -480,6 +480,45 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> return ret;
> }
>
> +/**
> + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
> + * @h: struct hstate.
> + * @folio_list: list of folios.
> + * @restored: Set to number of folios for which vmemmap was restored
> + * successfully if caller passes a non-NULL pointer.
> + *
> + * Return: %0 if vmemmap exists for all folios on the list. If an error is
> + * encountered restoring vmemmap for ANY folio, an error code
> + * will be returned to the caller. It is then the responsibility
> + * of the caller to check the hugetlb vmemmap optimized flag of
> + * each folio to determine if vmemmap was actually restored.
> + * Note that processing is stopped when first error is encountered.
> + */
> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> + struct list_head *folio_list,
> + unsigned long *restored)

How about changing parameter of @restored to a list_head type which
returns the non-optimized (previously) or vmemmap-restored-sucessful
huge pages?
In which case, the caller could traverse this returned list to free
them first like you have implemented in bulk_vmemmap_restore_enomem(),
it will be more efficient. The meaning of returned value should also
be changed accordingly since update_and_free_pages_bulk() wants to
whether there is a vmemmap-optimized huge page being restored sucessfully
to determine if it should clear hugetlb flag. So
hugetlb_vmemmap_restore_folios()
could return how many huge pages being restored successful, if a negative
number is returned meaning there is some error in the process of restoring
of vmemmap.

Thanks.

> +{
> + unsigned long num_restored;
> + struct folio *folio;
> + int ret = 0;
> +
> + num_restored = 0;
> + list_for_each_entry(folio, folio_list, lru) {
> + if (folio_test_hugetlb_vmemmap_optimized(folio)) {
> + ret = hugetlb_vmemmap_restore(h, &folio->page);
> + if (ret)
> + goto out;
> + else
> + num_restored++;
> + }
> + }
> +
> +out:
> + if (*restored)
> + *restored = num_restored;
> + return ret;
> +}
> +
> /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
> static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
> {
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index c512e388dbb4..bb58453c3cc0 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -19,6 +19,8 @@
>
> #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> + struct list_head *folio_list, unsigned long *restored);
> void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
>
> @@ -45,6 +47,15 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
> return 0;
> }
>
> +static inline int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> + struct list_head *folio_list,
> + unsigned long *restored)
> +{
> + if (restored)
> + *restored = 0;
> + return 0;
> +}
> +
> static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> {
> }

2023-09-21 19:28:50

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages



On 2023/9/21 17:31, Muchun Song wrote:
>
>
> On 2023/9/21 09:12, Mike Kravetz wrote:
>> On 09/20/23 11:03, Muchun Song wrote:
>>>> On Sep 20, 2023, at 10:56, Muchun Song <[email protected]> wrote:
>>>>> On Sep 20, 2023, at 04:57, Mike Kravetz <[email protected]>
>>>>> wrote:
>>>>> On 09/19/23 17:52, Muchun Song wrote:
>>>>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>>>>>
>>>>>> I still think we should free a non-optimized HugeTLB page if we
>>>>>> encounter an OOM situation instead of continue to restore
>>>>>> vemmmap pages. Restoring vmemmmap pages will only aggravate
>>>>>> the OOM situation. The suitable appraoch is to free a non-optimized
>>>>>> HugeTLB page to satisfy our allocation of vmemmap pages, what's
>>>>>> your opinion, Mike?
>>>>> I agree.
>>>>>
>>>>> As you mentioned previously, this may complicate this code path a
>>>>> bit.
>>>>> I will rewrite to make this happen.
>>>> Maybe we could introduced two list passed to
>>>> update_and_free_pages_bulk (this
>>>> will be easy for the callers of it), one is for non-optimized huge
>>>> page,
>>>> another is optimized one. In update_and_free_pages_bulk, we could
>>>> first
>>>> free those non-optimized huge page, and then restore vemmmap pages for
>>>> those optimized ones, in which case, the code could be simple.
>>>> hugetlb_vmemmap_restore_folios() dose not need to add complexity,
>>>> which
>>>> still continue to restore vmemmap pages and will stop once we
>>>> encounter
>>>> an OOM situation.
>> I am not sure if passing in optimized and non-optimized lists to
>> update_and_free_pages_bulk will help much.  IIUC, it will almost always
>> be the case where only one list has entries.  Is that mostly accurate?
>
> I think you are right. It will be less helpful since most of
> pages will be not optimized when HVO is enabled.

Sorry, correction: **not** should be deleted.

>
>>> BTW, maybe we should try again iff there are some non-optimized huge
>>> page
>>> whose vmemmap pages are restored successfully previously and could
>>> be freed
>>> first, then continue to restore the vmemmap pages of the remaining
>>> huge pages.
>>> I think the retry code could be done in update_and_free_pages_bulk()
>>> as well.
>> I came up with a new routine to handle these ENOMEM returns from
>> hugetlb_vmemmap_restore_folios.  I 'think' it handles these situations.
>> Here is an updated version of this patch.  Sorry, diff makes it a bit
>> hard to read.
>>
>>  From b13bdccb01730f995191944769f87d0725c289ad Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz <[email protected]>
>> Date: Sun, 10 Sep 2023 16:14:50 -0700
>> Subject: [PATCH] hugetlb: perform vmemmap restoration on a list of pages
>>
>> The routine update_and_free_pages_bulk already performs vmemmap
>> restoration on the list of hugetlb pages in a separate step.  In
>> preparation for more functionality to be added in this step, create a
>> new routine hugetlb_vmemmap_restore_folios() that will restore
>> vmemmap for a list of folios.
>>
>> This new routine must provide sufficient feedback about errors and
>> actual restoration performed so that update_and_free_pages_bulk can
>> perform optimally.
>>
>> Special care must be taken when encountering a ENOMEM error from
>> hugetlb_vmemmap_restore_folios.  We want to continue making as much
>> forward progress as possible.  A new routine bulk_vmemmap_restore_enomem
>> handles this specific situation.
>>
>> Signed-off-by: Mike Kravetz <[email protected]>
>> ---
>>   mm/hugetlb.c         | 83 ++++++++++++++++++++++++++++++++++----------
>>   mm/hugetlb_vmemmap.c | 39 +++++++++++++++++++++
>>   mm/hugetlb_vmemmap.h | 11 ++++++
>>   3 files changed, 115 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 70fedf8682c4..52abe56cf38a 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1834,38 +1834,85 @@ static void
>> update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
>>           schedule_work(&free_hpage_work);
>>   }
>>   -static void update_and_free_pages_bulk(struct hstate *h, struct
>> list_head *list)
>> +static void bulk_vmemmap_restore_enomem(struct hstate *h,
>> +                        struct list_head *list,
>> +                        unsigned long restored)
>>   {
>>       struct folio *folio, *t_folio;
>> -    bool clear_dtor = false;
>>   -    /*
>> -     * First allocate required vmemmmap (if necessary) for all
>> folios on
>> -     * list.  If vmemmap can not be allocated, we can not free folio to
>> -     * lower level allocator, so add back as hugetlb surplus page.
>> -     * add_hugetlb_folio() removes the page from THIS list.
>> -     * Use clear_dtor to note if vmemmap was successfully allocated for
>> -     * ANY page on the list.
>> -     */
>> -    list_for_each_entry_safe(folio, t_folio, list, lru) {
>> -        if (folio_test_hugetlb_vmemmap_optimized(folio)) {
>> +    if (restored) {
>> +        /*
>> +         * On ENOMEM error, free any restored hugetlb pages so that
>> +         * restore of the entire list can be retried.
>> +         * The idea is that by freeing hugetlb pages with vmemmap
>> +         * (those previously restored) we will free up memory so that
>> +         * we can allocate vmemmap for more hugetlb pages.
>> +         * We must examine and possibly free EVERY hugetlb page on list
>> +         * in order to call hugetlb_vmemmap_restore_folios again.
>> +         * This is not optimal, but is an error case that should not
>> +         * happen frequently.
>> +         */
>> +        list_for_each_entry_safe(folio, t_folio, list, lru)
>> +            if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
>> +                list_del(&folio->lru);
>> +                spin_lock_irq(&hugetlb_lock);
>> +                __clear_hugetlb_destructor(h, folio);
>> +                spin_unlock_irq(&hugetlb_lock);
>> +                update_and_free_hugetlb_folio(h, folio, false);
>> +                cond_resched();
>> +            }
>> +    } else {
>> +        /*
>> +         * In the case where vmemmap was not restored for ANY folios,
>> +         * we loop through them trying to restore individually in the
>> +         * hope that someone elsewhere may free enough memory.
>> +         * If unable to restore a page, the hugetlb page is made a
>> +         * surplus page and removed from the list.
>> +         * If are able to restore vmemmap for one hugetlb page, we free
>> +         * it and quit processing the list to retry the bulk operation.
>> +         */
>> +        list_for_each_entry_safe(folio, t_folio, list, lru)
>>               if (hugetlb_vmemmap_restore(h, &folio->page)) {
>>                   spin_lock_irq(&hugetlb_lock);
>>                   add_hugetlb_folio(h, folio, true);
>>                   spin_unlock_irq(&hugetlb_lock);
>> -            } else
>> -                clear_dtor = true;
>> -        }
>> +            } else {
>> +                list_del(&folio->lru);
>> +                spin_lock_irq(&hugetlb_lock);
>> +                __clear_hugetlb_destructor(h, folio);
>> +                spin_unlock_irq(&hugetlb_lock);
>> +                update_and_free_hugetlb_folio(h, folio, false);
>> +                break;
>> +            }
>>       }
>> +}
>> +
>> +static void update_and_free_pages_bulk(struct hstate *h, struct
>> list_head *list)
>> +{
>> +    int ret;
>> +    unsigned long restored;
>> +    struct folio *folio, *t_folio;
>>         /*
>> -     * If vmemmmap allocation was performed on any folio above, take
>> lock
>> -     * to clear destructor of all folios on list.  This avoids the
>> need to
>> +     * First allocate required vmemmmap (if necessary) for all folios.
>> +     * Carefully handle ENOMEM errors and free up any available hugetlb
>> +     * pages in order to make forward progress.
>> +     */
>> +retry:
>> +    ret = hugetlb_vmemmap_restore_folios(h, list, &restored);
>> +    if (ret == -ENOMEM) {
>> +        bulk_vmemmap_restore_enomem(h, list, restored);
>> +        goto retry;
>> +    }
>> +
>> +    /*
>> +     * If vmemmmap allocation was performed on ANY folio , take lock to
>> +     * clear destructor of all folios on list.  This avoids the need to
>>        * lock/unlock for each individual folio.
>>        * The assumption is vmemmap allocation was performed on all or
>> none
>>        * of the folios on the list.  This is true expect in VERY rare
>> cases.
>>        */
>> -    if (clear_dtor) {
>> +    if (restored) {
>>           spin_lock_irq(&hugetlb_lock);
>>           list_for_each_entry(folio, list, lru)
>>               __clear_hugetlb_destructor(h, folio);
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index 4558b814ffab..cc91edbfb68b 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -480,6 +480,45 @@ int hugetlb_vmemmap_restore(const struct hstate
>> *h, struct page *head)
>>       return ret;
>>   }
>>   +/**
>> + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio
>> on the list.
>> + * @h:        struct hstate.
>> + * @folio_list:    list of folios.
>> + * @restored:    Set to number of folios for which vmemmap was restored
>> + *        successfully if caller passes a non-NULL pointer.
>> + *
>> + * Return: %0 if vmemmap exists for all folios on the list.  If an
>> error is
>> + *        encountered restoring vmemmap for ANY folio, an error code
>> + *        will be returned to the caller.  It is then the
>> responsibility
>> + *        of the caller to check the hugetlb vmemmap optimized flag of
>> + *        each folio to determine if vmemmap was actually restored.
>> + *        Note that processing is stopped when first error is
>> encountered.
>> + */
>> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
>> +                    struct list_head *folio_list,
>> +                    unsigned long *restored)
>
> How about changing parameter of @restored to a list_head type which
> returns the non-optimized (previously) or vmemmap-restored-sucessful
> huge pages?
> In which case, the caller could traverse this returned list to free
> them first like you have implemented in bulk_vmemmap_restore_enomem(),
> it will be more efficient. The meaning of returned value should also
> be changed accordingly since update_and_free_pages_bulk() wants to
> whether there is a vmemmap-optimized huge page being restored sucessfully
> to determine if it should clear hugetlb flag. So
> hugetlb_vmemmap_restore_folios()
> could return how many huge pages being restored successful, if a negative
> number is returned meaning there is some error in the process of
> restoring
> of vmemmap.
>
> Thanks.
>
>> +{
>> +    unsigned long num_restored;
>> +    struct folio *folio;
>> +    int ret = 0;
>> +
>> +    num_restored = 0;
>> +    list_for_each_entry(folio, folio_list, lru) {
>> +        if (folio_test_hugetlb_vmemmap_optimized(folio)) {
>> +            ret = hugetlb_vmemmap_restore(h, &folio->page);
>> +            if (ret)
>> +                goto out;
>> +            else
>> +                num_restored++;
>> +        }
>> +    }
>> +
>> +out:
>> +    if (*restored)
>> +        *restored = num_restored;
>> +    return ret;
>> +}
>> +
>>   /* Return true iff a HugeTLB whose vmemmap should and can be
>> optimized. */
>>   static bool vmemmap_should_optimize(const struct hstate *h, const
>> struct page *head)
>>   {
>> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
>> index c512e388dbb4..bb58453c3cc0 100644
>> --- a/mm/hugetlb_vmemmap.h
>> +++ b/mm/hugetlb_vmemmap.h
>> @@ -19,6 +19,8 @@
>>     #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>>   int hugetlb_vmemmap_restore(const struct hstate *h, struct page
>> *head);
>> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
>> +            struct list_head *folio_list, unsigned long *restored);
>>   void hugetlb_vmemmap_optimize(const struct hstate *h, struct page
>> *head);
>>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct
>> list_head *folio_list);
>>   @@ -45,6 +47,15 @@ static inline int hugetlb_vmemmap_restore(const
>> struct hstate *h, struct page *h
>>       return 0;
>>   }
>>   +static inline int hugetlb_vmemmap_restore_folios(const struct
>> hstate *h,
>> +                    struct list_head *folio_list,
>> +                    unsigned long *restored)
>> +{
>> +    if (restored)
>> +        *restored = 0;
>> +    return 0;
>> +}
>> +
>>   static inline void hugetlb_vmemmap_optimize(const struct hstate *h,
>> struct page *head)
>>   {
>>   }
>

2023-09-22 01:05:39

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages

On 09/21/23 17:31, Muchun Song wrote:
>
>
> On 2023/9/21 09:12, Mike Kravetz wrote:
> > On 09/20/23 11:03, Muchun Song wrote:
> > > > On Sep 20, 2023, at 10:56, Muchun Song <[email protected]> wrote:
> > > > > On Sep 20, 2023, at 04:57, Mike Kravetz <[email protected]> wrote:
> > > > > On 09/19/23 17:52, Muchun Song wrote:
> > > > > > On 2023/9/19 07:01, Mike Kravetz wrote:
> > +/**
> > + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
> > + * @h: struct hstate.
> > + * @folio_list: list of folios.
> > + * @restored: Set to number of folios for which vmemmap was restored
> > + * successfully if caller passes a non-NULL pointer.
> > + *
> > + * Return: %0 if vmemmap exists for all folios on the list. If an error is
> > + * encountered restoring vmemmap for ANY folio, an error code
> > + * will be returned to the caller. It is then the responsibility
> > + * of the caller to check the hugetlb vmemmap optimized flag of
> > + * each folio to determine if vmemmap was actually restored.
> > + * Note that processing is stopped when first error is encountered.
> > + */
> > +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> > + struct list_head *folio_list,
> > + unsigned long *restored)
>
> How about changing parameter of @restored to a list_head type which
> returns the non-optimized (previously) or vmemmap-restored-sucessful huge
> pages?
> In which case, the caller could traverse this returned list to free
> them first like you have implemented in bulk_vmemmap_restore_enomem(),
> it will be more efficient. The meaning of returned value should also
> be changed accordingly since update_and_free_pages_bulk() wants to
> whether there is a vmemmap-optimized huge page being restored sucessfully
> to determine if it should clear hugetlb flag. So
> hugetlb_vmemmap_restore_folios()
> could return how many huge pages being restored successful, if a negative
> number is returned meaning there is some error in the process of restoring
> of vmemmap.
>

I had similar thoughts. An updated patch based on this approach is below.
When creating the patch, I discovered that using the function return code
for both number of vmemmap restored pages as well as error code was
unnecessary. Just checking !list_empty() of non-optimized pages tells us
if any were restored or could be freed.

From b79f6eeb7a11644830bddfc43d2219c149d26405 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Sun, 10 Sep 2023 16:14:50 -0700
Subject: [PATCH] hugetlb: perform vmemmap restoration on a list of pages

The routine update_and_free_pages_bulk already performs vmemmap
restoration on the list of hugetlb pages in a separate step. In
preparation for more functionality to be added in this step, create a
new routine hugetlb_vmemmap_restore_folios() that will restore
vmemmap for a list of folios.

This new routine must provide sufficient feedback about errors and
actual restoration performed so that update_and_free_pages_bulk can
perform optimally.

Special care must be taken when encountering an error from
hugetlb_vmemmap_restore_folios. We want to continue making as much
forward progress as possible. A new routine bulk_vmemmap_restore_error
handles this specific situation.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 94 +++++++++++++++++++++++++++++++-------------
mm/hugetlb_vmemmap.c | 36 +++++++++++++++++
mm/hugetlb_vmemmap.h | 10 +++++
3 files changed, 112 insertions(+), 28 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 70fedf8682c4..11de3f885065 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1834,50 +1834,88 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
schedule_work(&free_hpage_work);
}

-static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
+static void bulk_vmemmap_restore_error(struct hstate *h,
+ struct list_head *list,
+ struct list_head *non_op_folios)
{
struct folio *folio, *t_folio;
- bool clear_dtor = false;

- /*
- * First allocate required vmemmmap (if necessary) for all folios on
- * list. If vmemmap can not be allocated, we can not free folio to
- * lower level allocator, so add back as hugetlb surplus page.
- * add_hugetlb_folio() removes the page from THIS list.
- * Use clear_dtor to note if vmemmap was successfully allocated for
- * ANY page on the list.
- */
- list_for_each_entry_safe(folio, t_folio, list, lru) {
- if (folio_test_hugetlb_vmemmap_optimized(folio)) {
+ if (!list_empty(non_op_folios)) {
+ /*
+ * Free any restored hugetlb pages so that restore of the
+ * entire list can be retried.
+ * The idea is that in the common case of ENOMEM errors freeing
+ * hugetlb pages with vmemmap we will free up memory so that we
+ * can allocate vmemmap for more hugetlb pages.
+ */
+ list_for_each_entry_safe(folio, t_folio, non_op_folios, lru) {
+ list_del(&folio->lru);
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ update_and_free_hugetlb_folio(h, folio, false);
+ cond_resched();
+ }
+ } else {
+ /*
+ * In the case where vmemmap was not restored for ANY folios,
+ * we loop through them trying to restore individually in the
+ * hope that someone elsewhere may have done something to cause
+ * success (such as freeing some memory).
+ * If unable to restore a hugetlb page, the hugetlb page is
+ * made a surplus page and removed from the list.
+ * If are able to restore vmemmap for one hugetlb page, we free
+ * it and quit processing the list to retry the bulk operation.
+ */
+ list_for_each_entry_safe(folio, t_folio, list, lru)
if (hugetlb_vmemmap_restore(h, &folio->page)) {
spin_lock_irq(&hugetlb_lock);
add_hugetlb_folio(h, folio, true);
spin_unlock_irq(&hugetlb_lock);
- } else
- clear_dtor = true;
- }
+ } else {
+ list_del(&folio->lru);
+ spin_lock_irq(&hugetlb_lock);
+ __clear_hugetlb_destructor(h, folio);
+ spin_unlock_irq(&hugetlb_lock);
+ update_and_free_hugetlb_folio(h, folio, false);
+ cond_resched();
+ break;
+ }
}
+}
+
+static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
+{
+ int ret;
+ LIST_HEAD(non_op_folio);
+ struct folio *folio, *t_folio;

/*
- * If vmemmmap allocation was performed on any folio above, take lock
- * to clear destructor of all folios on list. This avoids the need to
- * lock/unlock for each individual folio.
- * The assumption is vmemmap allocation was performed on all or none
- * of the folios on the list. This is true expect in VERY rare cases.
+ * First allocate required vmemmmap (if necessary) for all folios.
+ * Carefully handle errors and free up any available hugetlb pages
+ * in an effort to make forward progress.
*/
- if (clear_dtor) {
+retry:
+ ret = hugetlb_vmemmap_restore_folios(h, list, &non_op_folio);
+ if (ret < 0) {
+ bulk_vmemmap_restore_error(h, list, &non_op_folio);
+ goto retry;
+ }
+
+ /*
+ * At this point, list should be empty, and there should only be
+ * pages on the non_op_folio list. free those entries. Do note
+ * that the non_op_folio list could be empty.
+ */
+ VM_WARN_ON(!list_empty(list));
+ if (!list_empty(&non_op_folio)) {
spin_lock_irq(&hugetlb_lock);
- list_for_each_entry(folio, list, lru)
+ list_for_each_entry(folio, &non_op_folio, lru)
__clear_hugetlb_destructor(h, folio);
spin_unlock_irq(&hugetlb_lock);
}

- /*
- * Free folios back to low level allocators. vmemmap and destructors
- * were taken care of above, so update_and_free_hugetlb_folio will
- * not need to take hugetlb lock.
- */
- list_for_each_entry_safe(folio, t_folio, list, lru) {
+ list_for_each_entry_safe(folio, t_folio, &non_op_folio, lru) {
update_and_free_hugetlb_folio(h, folio, false);
cond_resched();
}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4558b814ffab..f827d4efcf8e 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -480,6 +480,42 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
return ret;
}

+/**
+ * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
+ * @h: hstate.
+ * @folio_list: list of folios.
+ * @non_op_list: Output list of folio for which vmemmap exists.
+ *
+ * Return: %0 if vmemmap exists for all folios on the list and all entries have
+ * been moved to non_op_list. If an error is encountered restoring
+ * vmemmap for ANY folio, an error code will be returned to the
+ * caller. Processing en entries stops when the first error is
+ * encountered. folios processed before the error with vmemmap
+ * will reside on the non_op_list. The folio that experienced the
+ * error and all non-processed folios will remain on folio_list.
+ */
+int hugetlb_vmemmap_restore_folios(const struct hstate *h,
+ struct list_head *folio_list,
+ struct list_head *non_op_list)
+{
+ struct folio *folio, *t_folio;
+ int ret = 0;
+
+ list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
+ if (folio_test_hugetlb_vmemmap_optimized(folio)) {
+ ret = hugetlb_vmemmap_restore(h, &folio->page);
+ if (ret)
+ goto out;
+ }
+
+ /* Add non-optimized folios to output list */
+ list_move(&folio->lru, non_op_list);
+ }
+
+out:
+ return ret;
+}
+
/* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
{
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index c512e388dbb4..e6378ae5c5b6 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -19,6 +19,9 @@

#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
+int hugetlb_vmemmap_restore_folios(const struct hstate *h,
+ struct list_head *folio_list,
+ struct list_head *non_op_folios);
void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);

@@ -45,6 +48,13 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
return 0;
}

+static int hugetlb_vmemmap_restore_folios(const struct hstate *h,
+ struct list_head *folio_list,
+ struct list_head *non_op_folios)
+{
+ return 0;
+}
+
static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
{
}
--
2.41.0

2023-09-22 14:20:35

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages



On 2023/9/22 05:58, Mike Kravetz wrote:
> On 09/21/23 17:31, Muchun Song wrote:
>>
>> On 2023/9/21 09:12, Mike Kravetz wrote:
>>> On 09/20/23 11:03, Muchun Song wrote:
>>>>> On Sep 20, 2023, at 10:56, Muchun Song <[email protected]> wrote:
>>>>>> On Sep 20, 2023, at 04:57, Mike Kravetz <[email protected]> wrote:
>>>>>> On 09/19/23 17:52, Muchun Song wrote:
>>>>>>> On 2023/9/19 07:01, Mike Kravetz wrote:
>>> +/**
>>> + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
>>> + * @h: struct hstate.
>>> + * @folio_list: list of folios.
>>> + * @restored: Set to number of folios for which vmemmap was restored
>>> + * successfully if caller passes a non-NULL pointer.
>>> + *
>>> + * Return: %0 if vmemmap exists for all folios on the list. If an error is
>>> + * encountered restoring vmemmap for ANY folio, an error code
>>> + * will be returned to the caller. It is then the responsibility
>>> + * of the caller to check the hugetlb vmemmap optimized flag of
>>> + * each folio to determine if vmemmap was actually restored.
>>> + * Note that processing is stopped when first error is encountered.
>>> + */
>>> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
>>> + struct list_head *folio_list,
>>> + unsigned long *restored)
>> How about changing parameter of @restored to a list_head type which
>> returns the non-optimized (previously) or vmemmap-restored-sucessful huge
>> pages?
>> In which case, the caller could traverse this returned list to free
>> them first like you have implemented in bulk_vmemmap_restore_enomem(),
>> it will be more efficient. The meaning of returned value should also
>> be changed accordingly since update_and_free_pages_bulk() wants to
>> whether there is a vmemmap-optimized huge page being restored sucessfully
>> to determine if it should clear hugetlb flag. So
>> hugetlb_vmemmap_restore_folios()
>> could return how many huge pages being restored successful, if a negative
>> number is returned meaning there is some error in the process of restoring
>> of vmemmap.
>>
> I had similar thoughts. An updated patch based on this approach is below.
> When creating the patch, I discovered that using the function return code
> for both number of vmemmap restored pages as well as error code was
> unnecessary. Just checking !list_empty() of non-optimized pages tells us
> if any were restored or could be freed.

I also thought about this. But there is a little different. If HVO
is disabled, we will always clear the hugetlb flag twice since the
list couldn't be empty, I thought it is an optimization for HVO-disabled
case.

>
> From b79f6eeb7a11644830bddfc43d2219c149d26405 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <[email protected]>
> Date: Sun, 10 Sep 2023 16:14:50 -0700
> Subject: [PATCH] hugetlb: perform vmemmap restoration on a list of pages
>
> The routine update_and_free_pages_bulk already performs vmemmap
> restoration on the list of hugetlb pages in a separate step. In
> preparation for more functionality to be added in this step, create a
> new routine hugetlb_vmemmap_restore_folios() that will restore
> vmemmap for a list of folios.
>
> This new routine must provide sufficient feedback about errors and
> actual restoration performed so that update_and_free_pages_bulk can
> perform optimally.
>
> Special care must be taken when encountering an error from
> hugetlb_vmemmap_restore_folios. We want to continue making as much
> forward progress as possible. A new routine bulk_vmemmap_restore_error
> handles this specific situation.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> mm/hugetlb.c | 94 +++++++++++++++++++++++++++++++-------------
> mm/hugetlb_vmemmap.c | 36 +++++++++++++++++
> mm/hugetlb_vmemmap.h | 10 +++++
> 3 files changed, 112 insertions(+), 28 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 70fedf8682c4..11de3f885065 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1834,50 +1834,88 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
> schedule_work(&free_hpage_work);
> }
>
> -static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> +static void bulk_vmemmap_restore_error(struct hstate *h,
> + struct list_head *list,
> + struct list_head *non_op_folios)
> {
> struct folio *folio, *t_folio;
> - bool clear_dtor = false;
>
> - /*
> - * First allocate required vmemmmap (if necessary) for all folios on
> - * list. If vmemmap can not be allocated, we can not free folio to
> - * lower level allocator, so add back as hugetlb surplus page.
> - * add_hugetlb_folio() removes the page from THIS list.
> - * Use clear_dtor to note if vmemmap was successfully allocated for
> - * ANY page on the list.
> - */
> - list_for_each_entry_safe(folio, t_folio, list, lru) {
> - if (folio_test_hugetlb_vmemmap_optimized(folio)) {
> + if (!list_empty(non_op_folios)) {
> + /*
> + * Free any restored hugetlb pages so that restore of the
> + * entire list can be retried.
> + * The idea is that in the common case of ENOMEM errors freeing
> + * hugetlb pages with vmemmap we will free up memory so that we
> + * can allocate vmemmap for more hugetlb pages.
> + */
> + list_for_each_entry_safe(folio, t_folio, non_op_folios, lru) {
> + list_del(&folio->lru);
> + spin_lock_irq(&hugetlb_lock);
> + __clear_hugetlb_destructor(h, folio);
> + spin_unlock_irq(&hugetlb_lock);
> + update_and_free_hugetlb_folio(h, folio, false);
> + cond_resched();
> + }
> + } else {
> + /*
> + * In the case where vmemmap was not restored for ANY folios,
> + * we loop through them trying to restore individually in the
> + * hope that someone elsewhere may have done something to cause
> + * success (such as freeing some memory).
> + * If unable to restore a hugetlb page, the hugetlb page is
> + * made a surplus page and removed from the list.
> + * If are able to restore vmemmap for one hugetlb page, we free
> + * it and quit processing the list to retry the bulk operation.
> + */
> + list_for_each_entry_safe(folio, t_folio, list, lru)
> if (hugetlb_vmemmap_restore(h, &folio->page)) {
> spin_lock_irq(&hugetlb_lock);
> add_hugetlb_folio(h, folio, true);
> spin_unlock_irq(&hugetlb_lock);
> - } else
> - clear_dtor = true;
> - }
> + } else {
> + list_del(&folio->lru);
> + spin_lock_irq(&hugetlb_lock);
> + __clear_hugetlb_destructor(h, folio);
> + spin_unlock_irq(&hugetlb_lock);
> + update_and_free_hugetlb_folio(h, folio, false);
> + cond_resched();
> + break;
> + }
> }
> +}
> +
> +static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> +{
> + int ret;
> + LIST_HEAD(non_op_folio);
> + struct folio *folio, *t_folio;
>
> /*
> - * If vmemmmap allocation was performed on any folio above, take lock
> - * to clear destructor of all folios on list. This avoids the need to
> - * lock/unlock for each individual folio.
> - * The assumption is vmemmap allocation was performed on all or none
> - * of the folios on the list. This is true expect in VERY rare cases.
> + * First allocate required vmemmmap (if necessary) for all folios.
> + * Carefully handle errors and free up any available hugetlb pages
> + * in an effort to make forward progress.
> */
> - if (clear_dtor) {
> +retry:
> + ret = hugetlb_vmemmap_restore_folios(h, list, &non_op_folio);
> + if (ret < 0) {
> + bulk_vmemmap_restore_error(h, list, &non_op_folio);
> + goto retry;
> + }
> +
> + /*
> + * At this point, list should be empty, and there should only be
> + * pages on the non_op_folio list. free those entries. Do note
> + * that the non_op_folio list could be empty.
> + */
> + VM_WARN_ON(!list_empty(list));
> + if (!list_empty(&non_op_folio)) {
> spin_lock_irq(&hugetlb_lock);
> - list_for_each_entry(folio, list, lru)
> + list_for_each_entry(folio, &non_op_folio, lru)
> __clear_hugetlb_destructor(h, folio);
> spin_unlock_irq(&hugetlb_lock);
> }
>
> - /*
> - * Free folios back to low level allocators. vmemmap and destructors
> - * were taken care of above, so update_and_free_hugetlb_folio will
> - * not need to take hugetlb lock.
> - */
> - list_for_each_entry_safe(folio, t_folio, list, lru) {
> + list_for_each_entry_safe(folio, t_folio, &non_op_folio, lru) {
> update_and_free_hugetlb_folio(h, folio, false);
> cond_resched();
> }
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 4558b814ffab..f827d4efcf8e 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -480,6 +480,42 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> return ret;
> }
>
> +/**
> + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
> + * @h: hstate.
> + * @folio_list: list of folios.
> + * @non_op_list: Output list of folio for which vmemmap exists.
> + *
> + * Return: %0 if vmemmap exists for all folios on the list and all entries have
> + * been moved to non_op_list. If an error is encountered restoring
> + * vmemmap for ANY folio, an error code will be returned to the
> + * caller. Processing en entries stops when the first error is
> + * encountered. folios processed before the error with vmemmap
> + * will reside on the non_op_list. The folio that experienced the
> + * error and all non-processed folios will remain on folio_list.
> + */
> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> + struct list_head *folio_list,
> + struct list_head *non_op_list)

non_optimized_list or vmemmap_intact_list? The abbreviation is not
straightforward.

> +{
> + struct folio *folio, *t_folio;
> + int ret = 0;
> +
> + list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
> + if (folio_test_hugetlb_vmemmap_optimized(folio)) {

hugetlb_vmemmap_restore() has this check as well, so it is unnecessary here.

> + ret = hugetlb_vmemmap_restore(h, &folio->page);
> + if (ret)
> + goto out;

Maybe we could drop the label ("out") and just breaking or returning from
here is enough.

> + }
> +
> + /* Add non-optimized folios to output list */
> + list_move(&folio->lru, non_op_list);
> + }
> +
> +out:
> + return ret;
> +}
> +
> /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
> static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
> {
> diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> index c512e388dbb4..e6378ae5c5b6 100644
> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -19,6 +19,9 @@
>
> #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
> +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> + struct list_head *folio_list,
> + struct list_head *non_op_folios);

It is better to keep 3rd name (non_op_folios) consistent with where it is
defined (it is non_op_list).

> void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
> void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
>
> @@ -45,6 +48,13 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
> return 0;
> }
>
> +static int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> + struct list_head *folio_list,
> + struct list_head *non_op_folios)
> +{
> + return 0;
> +}
> +
> static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> {
> }

2023-09-22 19:53:51

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages

On 09/22/23 10:01, Mike Kravetz wrote:
> On 09/22/23 16:19, Muchun Song wrote:
> > On 2023/9/22 05:58, Mike Kravetz wrote:
> > > On 09/21/23 17:31, Muchun Song wrote:
> > > > On 2023/9/21 09:12, Mike Kravetz wrote:
> > > > > On 09/20/23 11:03, Muchun Song wrote:
> > > + */
> > > +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> > > + struct list_head *folio_list,
> > > + struct list_head *non_op_list)
> >
> > non_optimized_list or vmemmap_intact_list? The abbreviation is not
> > straightforward.
> >
>
> Ok, I will be more specific. non_optimized_list is better.
>

I changed my mind.
The longer name caused 80 column line wrap that I didn't like. :)

I will use non_hvo_folios. The abbreviation hvo is pretty specific
in this context.
--
Mike Kravetz

2023-09-23 00:20:16

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] hugetlb: perform vmemmap restoration on a list of pages

On 09/22/23 16:19, Muchun Song wrote:
>
>
> On 2023/9/22 05:58, Mike Kravetz wrote:
> > On 09/21/23 17:31, Muchun Song wrote:
> > >
> > > On 2023/9/21 09:12, Mike Kravetz wrote:
> > > > On 09/20/23 11:03, Muchun Song wrote:
> > > > > > On Sep 20, 2023, at 10:56, Muchun Song <[email protected]> wrote:
> > > > > > > On Sep 20, 2023, at 04:57, Mike Kravetz <[email protected]> wrote:
> > > > > > > On 09/19/23 17:52, Muchun Song wrote:
> > > > > > > > On 2023/9/19 07:01, Mike Kravetz wrote:
> > > > +/**
> > > > + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
> > > > + * @h: struct hstate.
> > > > + * @folio_list: list of folios.
> > > > + * @restored: Set to number of folios for which vmemmap was restored
> > > > + * successfully if caller passes a non-NULL pointer.
> > > > + *
> > > > + * Return: %0 if vmemmap exists for all folios on the list. If an error is
> > > > + * encountered restoring vmemmap for ANY folio, an error code
> > > > + * will be returned to the caller. It is then the responsibility
> > > > + * of the caller to check the hugetlb vmemmap optimized flag of
> > > > + * each folio to determine if vmemmap was actually restored.
> > > > + * Note that processing is stopped when first error is encountered.
> > > > + */
> > > > +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> > > > + struct list_head *folio_list,
> > > > + unsigned long *restored)
> > > How about changing parameter of @restored to a list_head type which
> > > returns the non-optimized (previously) or vmemmap-restored-sucessful huge
> > > pages?
> > > In which case, the caller could traverse this returned list to free
> > > them first like you have implemented in bulk_vmemmap_restore_enomem(),
> > > it will be more efficient. The meaning of returned value should also
> > > be changed accordingly since update_and_free_pages_bulk() wants to
> > > whether there is a vmemmap-optimized huge page being restored sucessfully
> > > to determine if it should clear hugetlb flag. So
> > > hugetlb_vmemmap_restore_folios()
> > > could return how many huge pages being restored successful, if a negative
> > > number is returned meaning there is some error in the process of restoring
> > > of vmemmap.
> > >
> > I had similar thoughts. An updated patch based on this approach is below.
> > When creating the patch, I discovered that using the function return code
> > for both number of vmemmap restored pages as well as error code was
> > unnecessary. Just checking !list_empty() of non-optimized pages tells us
> > if any were restored or could be freed.
>
> I also thought about this. But there is a little different. If HVO
> is disabled, we will always clear the hugetlb flag twice since the
> list couldn't be empty, I thought it is an optimization for HVO-disabled
> case.
>

Ah! Good point. We will clear twice with with the patch below if
HVO-disabled.
The reason I did not initially want to have return code be both number
restored and error code is that type int is not sufficient. Number of
pages is usually of type unsigned long, but we need a signed value for
error codes. type long should be sufficient for this case.

I will change it and associated logic.

> > From b79f6eeb7a11644830bddfc43d2219c149d26405 Mon Sep 17 00:00:00 2001
> > From: Mike Kravetz <[email protected]>
> > Date: Sun, 10 Sep 2023 16:14:50 -0700
> > Subject: [PATCH] hugetlb: perform vmemmap restoration on a list of pages
> >
> > The routine update_and_free_pages_bulk already performs vmemmap
> > restoration on the list of hugetlb pages in a separate step. In
> > preparation for more functionality to be added in this step, create a
> > new routine hugetlb_vmemmap_restore_folios() that will restore
> > vmemmap for a list of folios.
> >
> > This new routine must provide sufficient feedback about errors and
> > actual restoration performed so that update_and_free_pages_bulk can
> > perform optimally.
> >
> > Special care must be taken when encountering an error from
> > hugetlb_vmemmap_restore_folios. We want to continue making as much
> > forward progress as possible. A new routine bulk_vmemmap_restore_error
> > handles this specific situation.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
> > ---
> > mm/hugetlb.c | 94 +++++++++++++++++++++++++++++++-------------
> > mm/hugetlb_vmemmap.c | 36 +++++++++++++++++
> > mm/hugetlb_vmemmap.h | 10 +++++
> > 3 files changed, 112 insertions(+), 28 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 70fedf8682c4..11de3f885065 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1834,50 +1834,88 @@ static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
> > schedule_work(&free_hpage_work);
> > }
> > -static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> > +static void bulk_vmemmap_restore_error(struct hstate *h,
> > + struct list_head *list,
> > + struct list_head *non_op_folios)
> > {
> > struct folio *folio, *t_folio;
> > - bool clear_dtor = false;
> > - /*
> > - * First allocate required vmemmmap (if necessary) for all folios on
> > - * list. If vmemmap can not be allocated, we can not free folio to
> > - * lower level allocator, so add back as hugetlb surplus page.
> > - * add_hugetlb_folio() removes the page from THIS list.
> > - * Use clear_dtor to note if vmemmap was successfully allocated for
> > - * ANY page on the list.
> > - */
> > - list_for_each_entry_safe(folio, t_folio, list, lru) {
> > - if (folio_test_hugetlb_vmemmap_optimized(folio)) {
> > + if (!list_empty(non_op_folios)) {
> > + /*
> > + * Free any restored hugetlb pages so that restore of the
> > + * entire list can be retried.
> > + * The idea is that in the common case of ENOMEM errors freeing
> > + * hugetlb pages with vmemmap we will free up memory so that we
> > + * can allocate vmemmap for more hugetlb pages.
> > + */
> > + list_for_each_entry_safe(folio, t_folio, non_op_folios, lru) {
> > + list_del(&folio->lru);
> > + spin_lock_irq(&hugetlb_lock);
> > + __clear_hugetlb_destructor(h, folio);
> > + spin_unlock_irq(&hugetlb_lock);
> > + update_and_free_hugetlb_folio(h, folio, false);
> > + cond_resched();
> > + }
> > + } else {
> > + /*
> > + * In the case where vmemmap was not restored for ANY folios,
> > + * we loop through them trying to restore individually in the
> > + * hope that someone elsewhere may have done something to cause
> > + * success (such as freeing some memory).
> > + * If unable to restore a hugetlb page, the hugetlb page is
> > + * made a surplus page and removed from the list.
> > + * If are able to restore vmemmap for one hugetlb page, we free
> > + * it and quit processing the list to retry the bulk operation.
> > + */
> > + list_for_each_entry_safe(folio, t_folio, list, lru)
> > if (hugetlb_vmemmap_restore(h, &folio->page)) {
> > spin_lock_irq(&hugetlb_lock);
> > add_hugetlb_folio(h, folio, true);
> > spin_unlock_irq(&hugetlb_lock);
> > - } else
> > - clear_dtor = true;
> > - }
> > + } else {
> > + list_del(&folio->lru);
> > + spin_lock_irq(&hugetlb_lock);
> > + __clear_hugetlb_destructor(h, folio);
> > + spin_unlock_irq(&hugetlb_lock);
> > + update_and_free_hugetlb_folio(h, folio, false);
> > + cond_resched();
> > + break;
> > + }
> > }
> > +}
> > +
> > +static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
> > +{
> > + int ret;
> > + LIST_HEAD(non_op_folio);
> > + struct folio *folio, *t_folio;
> > /*
> > - * If vmemmmap allocation was performed on any folio above, take lock
> > - * to clear destructor of all folios on list. This avoids the need to
> > - * lock/unlock for each individual folio.
> > - * The assumption is vmemmap allocation was performed on all or none
> > - * of the folios on the list. This is true expect in VERY rare cases.
> > + * First allocate required vmemmmap (if necessary) for all folios.
> > + * Carefully handle errors and free up any available hugetlb pages
> > + * in an effort to make forward progress.
> > */
> > - if (clear_dtor) {
> > +retry:
> > + ret = hugetlb_vmemmap_restore_folios(h, list, &non_op_folio);
> > + if (ret < 0) {
> > + bulk_vmemmap_restore_error(h, list, &non_op_folio);
> > + goto retry;
> > + }
> > +
> > + /*
> > + * At this point, list should be empty, and there should only be
> > + * pages on the non_op_folio list. free those entries. Do note
> > + * that the non_op_folio list could be empty.
> > + */
> > + VM_WARN_ON(!list_empty(list));
> > + if (!list_empty(&non_op_folio)) {
> > spin_lock_irq(&hugetlb_lock);
> > - list_for_each_entry(folio, list, lru)
> > + list_for_each_entry(folio, &non_op_folio, lru)
> > __clear_hugetlb_destructor(h, folio);
> > spin_unlock_irq(&hugetlb_lock);
> > }
> > - /*
> > - * Free folios back to low level allocators. vmemmap and destructors
> > - * were taken care of above, so update_and_free_hugetlb_folio will
> > - * not need to take hugetlb lock.
> > - */
> > - list_for_each_entry_safe(folio, t_folio, list, lru) {
> > + list_for_each_entry_safe(folio, t_folio, &non_op_folio, lru) {
> > update_and_free_hugetlb_folio(h, folio, false);
> > cond_resched();
> > }
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 4558b814ffab..f827d4efcf8e 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -480,6 +480,42 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
> > return ret;
> > }
> > +/**
> > + * hugetlb_vmemmap_restore_folios - restore vmemmap for every folio on the list.
> > + * @h: hstate.
> > + * @folio_list: list of folios.
> > + * @non_op_list: Output list of folio for which vmemmap exists.
> > + *
> > + * Return: %0 if vmemmap exists for all folios on the list and all entries have
> > + * been moved to non_op_list. If an error is encountered restoring
> > + * vmemmap for ANY folio, an error code will be returned to the
> > + * caller. Processing en entries stops when the first error is
> > + * encountered. folios processed before the error with vmemmap
> > + * will reside on the non_op_list. The folio that experienced the
> > + * error and all non-processed folios will remain on folio_list.
> > + */
> > +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> > + struct list_head *folio_list,
> > + struct list_head *non_op_list)
>
> non_optimized_list or vmemmap_intact_list? The abbreviation is not
> straightforward.
>

Ok, I will be more specific. non_optimized_list is better.

> > +{
> > + struct folio *folio, *t_folio;
> > + int ret = 0;
> > +
> > + list_for_each_entry_safe(folio, t_folio, folio_list, lru) {
> > + if (folio_test_hugetlb_vmemmap_optimized(folio)) {
>
> hugetlb_vmemmap_restore() has this check as well, so it is unnecessary here.
>

Not necessary in this code, but if we want to know if restore operation
was actually performed to return 'number restored' as discussed above,
this test and an additional counter will be required.

> > + ret = hugetlb_vmemmap_restore(h, &folio->page);
> > + if (ret)
> > + goto out;
>
> Maybe we could drop the label ("out") and just breaking or returning from
> here is enough.

Yes

> > + }
> > +
> > + /* Add non-optimized folios to output list */
> > + list_move(&folio->lru, non_op_list);
> > + }
> > +
> > +out:
> > + return ret;
> > +}
> > +
> > /* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
> > static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
> > {
> > diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
> > index c512e388dbb4..e6378ae5c5b6 100644
> > --- a/mm/hugetlb_vmemmap.h
> > +++ b/mm/hugetlb_vmemmap.h
> > @@ -19,6 +19,9 @@
> > #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> > int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
> > +int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> > + struct list_head *folio_list,
> > + struct list_head *non_op_folios);
>
> It is better to keep 3rd name (non_op_folios) consistent with where it is
> defined (it is non_op_list).
>

I think using non_optimized_folios everywhere will be consistent and
more descriptive.

Thanks for all your comments and suggestions!
--
Mike Kravetz

> > void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
> > void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
> > @@ -45,6 +48,13 @@ static inline int hugetlb_vmemmap_restore(const struct hstate *h, struct page *h
> > return 0;
> > }
> > +static int hugetlb_vmemmap_restore_folios(const struct hstate *h,
> > + struct list_head *folio_list,
> > + struct list_head *non_op_folios)
> > +{
> > + return 0;
> > +}
> > +
> > static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> > {
> > }
>