2023-10-06 03:22:53

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v7 0/8] Batch hugetlb vmemmap modification operations

When hugetlb vmemmap optimization was introduced, the overhead of enabling
the option was measured as described in commit 426e5c429d16 [1]. The summary
states that allocating a hugetlb page should be ~2x slower with optimization
and freeing a hugetlb page should be ~2-3x slower. Such overhead was deemed
an acceptable trade off for the memory savings obtained by freeing vmemmap
pages.

It was recently reported that the overhead associated with enabling vmemmap
optimization could be as high as 190x for hugetlb page allocations.
Yes, 190x! Some actual numbers from other environments are:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.119s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.477s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m28.973s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m36.748s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.463s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m2.931s

Unmodified next-20230824, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 2m27.609s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 2m29.924s

In the VM environment, the slowdown of enabling hugetlb vmemmap optimization
resulted in allocation times being 61x slower.

A quick profile showed that the vast majority of this overhead was due to
TLB flushing. Each time we modify the kernel pagetable we need to flush
the TLB. For each hugetlb that is optimized, there could be potentially
two TLB flushes performed. One for the vmemmap pages associated with the
hugetlb page, and potentially another one if the vmemmap pages are mapped
at the PMD level and must be split. The TLB flushes required for the kernel
pagetable, result in a broadcast IPI with each CPU having to flush a range
of pages, or do a global flush if a threshold is exceeded. So, the flush
time increases with the number of CPUs. In addition, in virtual environments
the broadcast IPI can’t be accelerated by hypervisor hardware and leads to
traps that need to wakeup/IPI all vCPUs which is very expensive. Because of
this the slowdown in virtual environments is even worse than bare metal as
the number of vCPUS/CPUs is increased.

The following series attempts to reduce amount of time spent in TLB flushing.
The idea is to batch the vmemmap modification operations for multiple hugetlb
pages. Instead of doing one or two TLB flushes for each page, we do two TLB
flushes for each batch of pages. One flush after splitting pages mapped at
the PMD level, and another after remapping vmemmap associated with all
hugetlb pages. Results of such batching are as follows:

Bare Metal 8 socket Intel(R) Xeon(R) CPU E7-8895
------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m4.719s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m4.245s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 500000 > .../hugepages-2048kB/nr_hugepages
real 0m7.267s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m13.199s

VM with 252 vcpus on host with 2 socket AMD EPYC 7J13 Milan
-----------------------------------------------------------
next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 0
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m2.715s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m3.186s

next-20230824 + Batching patches, vm.hugetlb_optimize_vmemmap = 1
time echo 524288 > .../hugepages-2048kB/nr_hugepages
real 0m4.799s
time echo 0 > .../hugepages-2048kB/nr_hugepages
real 0m5.273s

With batching, results are back in the 2-3x slowdown range.

This series is based on mm-unstable (October 5)

Changes v6 -> v7:
- Fixed hugetlb_vmemmap_restore_folios stub for the
!CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP case
- Added Muchun RB for patches 4 and 8

Changes v5 -> v6:
- patch 4 in bulk_vmemmap_restore_error remove folio from list before
calling add_hugetlb_folio.
- Added Muchun RB for patches 2 and 3

Changes v4 -> v5:
- patch 3 comment style updated, unnecessary INIT_LIST_HEAD
- patch 4 updated hugetlb_vmemmap_restore_folios to pass back number of
restored folios in non-error case. In addition, routine passes back
list of folios with vmemmmap. Naming more consistent.
- patch 5 remover over optimization and added Muchun RB
- patch 6 break and early return in ENOMEM case. Updated comments.
Added Muchun RB.
- patch 7 Updated comments about splitting failure. Added Muchun RB.
- patch 8 Made comments consistent.

Changes v3 -> v4:
- Rebased on mm-unstable and dropped requisite patches.
- patch 2 updated to take bootmem vmemmap initialization into account
- patch 3 more changes for bootmem hugetlb pages. added routine
prep_and_add_bootmem_folios.
- patch 5 in hugetlb_vmemmap_optimize_folios on ENOMEM check for
list_empty before freeing and retry. This is more important in
subsequent patch where we flush_tlb_all after ENOMEM.

Changes v2 -> v3:
- patch 5 was part of an earlier series that was not picked up. It is
included here as it helps with batching optimizations.
- patch 6 hugetlb_vmemmap_restore_folios is changed from type void to
returning an error code as well as an additional output parameter providing
the number folios for which vmemmap was actually restored. The caller can
then be more intelligent about processing the list.
- patch 9 eliminate local list in vmemmap_restore_pte. The routine
hugetlb_vmemmap_optimize_folios checks for ENOMEM and frees accumulated
vmemmap pages while processing the list.
- patch 10 introduce flags field to struct vmemmap_remap_walk and
VMEMMAP_SPLIT_NO_TLB_FLUSH for not flushing during pass to split PMDs.
- patch 11 rename flag VMEMMAP_REMAP_NO_TLB_FLUSH and pass in from callers.

Changes v1 -> v2:
- patch 5 now takes into account the requirement that only compound
pages with hugetlb flag set can be passed to vmemmmap routines. This
involved separating the 'prep' of hugetlb pages even further. The code
dealing with bootmem allocations was also modified so that batching is
possible. Adding a 'batch' of hugetlb pages to their respective free
lists is now done in one lock cycle.
- patch 7 added description of routine hugetlb_vmemmap_restore_folios
(Muchun).
- patch 8 rename bulk_pages to vmemmap_pages and let caller be responsible
for freeing (Muchun)
- patch 9 use 'walk->remap_pte' to determine if a split only operation
is being performed (Muchun). Removed unused variable and
hugetlb_optimize_vmemmap_key (Muchun).
- Patch 10 pass 'flags variable' instead of bool to indicate behavior and
allow for future expansion (Muchun). Single flag VMEMMAP_NO_TLB_FLUSH.
Provide detailed comment about the need to keep old and new vmemmap pages
in sync (Muchun).
- Patch 11 pass flag variable as in patch 10 (Muchun).

Joao Martins (2):
hugetlb: batch PMD split for bulk vmemmap dedup
hugetlb: batch TLB flushes when freeing vmemmap

Mike Kravetz (6):
hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
hugetlb: restructure pool allocations
hugetlb: perform vmemmap optimization on a list of pages
hugetlb: perform vmemmap restoration on a list of pages
hugetlb: batch freeing of vmemmap pages
hugetlb: batch TLB flushes when restoring vmemmap

mm/hugetlb.c | 301 ++++++++++++++++++++++++++++++++++++-------
mm/hugetlb_vmemmap.c | 273 +++++++++++++++++++++++++++++++++------
mm/hugetlb_vmemmap.h | 16 +++
3 files changed, 507 insertions(+), 83 deletions(-)

--
2.41.0


2023-10-06 03:22:53

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v7 3/8] hugetlb: perform vmemmap optimization on a list of pages

When adding hugetlb pages to the pool, we first create a list of the
allocated pages before adding to the pool. Pass this list of pages to a
new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization.

Due to significant differences in vmemmmap initialization for bootmem
allocated hugetlb pages, a new routine prep_and_add_bootmem_folios
is created.

We also modify the routine vmemmap_should_optimize() to check for pages
that are already optimized. There are code paths that might request
vmemmap optimization twice and we want to make sure this is not
attempted.

Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
---
mm/hugetlb.c | 42 ++++++++++++++++++++++++++++++++++--------
mm/hugetlb_vmemmap.c | 11 +++++++++++
mm/hugetlb_vmemmap.h | 5 +++++
3 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4ccb54824daa..2df9435afa48 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2249,6 +2249,9 @@ static void prep_and_add_allocated_folios(struct hstate *h,
{
struct folio *folio, *tmp_f;

+ /* Send list for bulk vmemmap optimization processing */
+ hugetlb_vmemmap_optimize_folios(h, folio_list);
+
/* Add all new pool pages to free lists in one lock cycle */
spin_lock_irq(&hugetlb_lock);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
@@ -3287,6 +3290,34 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
prep_compound_head((struct page *)folio, huge_page_order(h));
}

+static void __init prep_and_add_bootmem_folios(struct hstate *h,
+ struct list_head *folio_list)
+{
+ struct folio *folio, *tmp_f;
+
+ /* Send list for bulk vmemmap optimization processing */
+ hugetlb_vmemmap_optimize_folios(h, folio_list);
+
+ /* Add all new pool pages to free lists in one lock cycle */
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+ if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
+ /*
+ * If HVO fails, initialize all tail struct pages
+ * We do not worry about potential long lock hold
+ * time as this is early in boot and there should
+ * be no contention.
+ */
+ hugetlb_folio_init_tail_vmemmap(folio,
+ HUGETLB_VMEMMAP_RESERVE_PAGES,
+ pages_per_huge_page(h));
+ }
+ __prep_account_new_huge_page(h, folio_nid(folio));
+ enqueue_hugetlb_folio(h, folio);
+ }
+ spin_unlock_irq(&hugetlb_lock);
+}
+
/*
* Put bootmem huge pages into the standard lists after mem_map is up.
* Note: This only applies to gigantic (order > MAX_ORDER) pages.
@@ -3307,7 +3338,7 @@ static void __init gather_bootmem_prealloc(void)
* in this list. If so, process each size separately.
*/
if (h != prev_h && prev_h != NULL)
- prep_and_add_allocated_folios(prev_h, &folio_list);
+ prep_and_add_bootmem_folios(prev_h, &folio_list);
prev_h = h;

VM_BUG_ON(!hstate_is_gigantic(h));
@@ -3315,12 +3346,7 @@ static void __init gather_bootmem_prealloc(void)

hugetlb_folio_init_vmemmap(folio, h,
HUGETLB_VMEMMAP_RESERVE_PAGES);
- __prep_new_hugetlb_folio(h, folio);
- /* If HVO fails, initialize all tail struct pages */
- if (!HPageVmemmapOptimized(&folio->page))
- hugetlb_folio_init_tail_vmemmap(folio,
- HUGETLB_VMEMMAP_RESERVE_PAGES,
- pages_per_huge_page(h));
+ init_new_hugetlb_folio(h, folio);
list_add(&folio->lru, &folio_list);

/*
@@ -3332,7 +3358,7 @@ static void __init gather_bootmem_prealloc(void)
cond_resched();
}

- prep_and_add_allocated_folios(h, &folio_list);
+ prep_and_add_bootmem_folios(h, &folio_list);
}

static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 76682d1d79a7..4558b814ffab 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -483,6 +483,9 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head)
/* Return true iff a HugeTLB whose vmemmap should and can be optimized. */
static bool vmemmap_should_optimize(const struct hstate *h, const struct page *head)
{
+ if (HPageVmemmapOptimized((struct page *)head))
+ return false;
+
if (!READ_ONCE(vmemmap_optimize_enabled))
return false;

@@ -572,6 +575,14 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
SetHPageVmemmapOptimized(head);
}

+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
+{
+ struct folio *folio;
+
+ list_for_each_entry(folio, folio_list, lru)
+ hugetlb_vmemmap_optimize(h, &folio->page);
+}
+
static struct ctl_table hugetlb_vmemmap_sysctls[] = {
{
.procname = "hugetlb_optimize_vmemmap",
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 4573899855d7..c512e388dbb4 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -20,6 +20,7 @@
#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head);
void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head);
+void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);

static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
{
@@ -48,6 +49,10 @@ static inline void hugetlb_vmemmap_optimize(const struct hstate *h, struct page
{
}

+static inline void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
+{
+}
+
static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
{
return 0;
--
2.41.0

2023-10-06 03:24:03

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH v7 2/8] hugetlb: restructure pool allocations

Allocation of a hugetlb page for the hugetlb pool is done by the routine
alloc_pool_huge_page. This routine will allocate contiguous pages from
a low level allocator, prep the pages for usage as a hugetlb page and
then add the resulting hugetlb page to the pool.

In the 'prep' stage, optional vmemmap optimization is done. For
performance reasons we want to perform vmemmap optimization on multiple
hugetlb pages at once. To do this, restructure the hugetlb pool
allocation code such that vmemmap optimization can be isolated and later
batched.

The code to allocate hugetlb pages from bootmem was also modified to
allow batching.

No functional changes, only code restructure.

Signed-off-by: Mike Kravetz <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
---
mm/hugetlb.c | 179 ++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 140 insertions(+), 39 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d7679d37d072..4ccb54824daa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1964,16 +1964,21 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
h->nr_huge_pages_node[nid]++;
}

-static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio)
{
folio_set_hugetlb(folio);
- hugetlb_vmemmap_optimize(h, &folio->page);
INIT_LIST_HEAD(&folio->lru);
hugetlb_set_folio_subpool(folio, NULL);
set_hugetlb_cgroup(folio, NULL);
set_hugetlb_cgroup_rsvd(folio, NULL);
}

+static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+{
+ init_new_hugetlb_folio(h, folio);
+ hugetlb_vmemmap_optimize(h, &folio->page);
+}
+
static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int nid)
{
__prep_new_hugetlb_folio(h, folio);
@@ -2170,16 +2175,9 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
return page_folio(page);
}

-/*
- * Common helper to allocate a fresh hugetlb page. All specific allocators
- * should use this function to get new hugetlb pages
- *
- * Note that returned page is 'frozen': ref count of head page and all tail
- * pages is zero.
- */
-static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
- gfp_t gfp_mask, int nid, nodemask_t *nmask,
- nodemask_t *node_alloc_noretry)
+static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
+ gfp_t gfp_mask, int nid, nodemask_t *nmask,
+ nodemask_t *node_alloc_noretry)
{
struct folio *folio;
bool retry = false;
@@ -2192,6 +2190,7 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
nid, nmask, node_alloc_noretry);
if (!folio)
return NULL;
+
if (hstate_is_gigantic(h)) {
if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
/*
@@ -2206,32 +2205,80 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
return NULL;
}
}
- prep_new_hugetlb_folio(h, folio, folio_nid(folio));

return folio;
}

+static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
+ gfp_t gfp_mask, int nid, nodemask_t *nmask,
+ nodemask_t *node_alloc_noretry)
+{
+ struct folio *folio;
+
+ folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+ node_alloc_noretry);
+ if (folio)
+ init_new_hugetlb_folio(h, folio);
+ return folio;
+}
+
/*
- * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
- * manner.
+ * Common helper to allocate a fresh hugetlb page. All specific allocators
+ * should use this function to get new hugetlb pages
+ *
+ * Note that returned page is 'frozen': ref count of head page and all tail
+ * pages is zero.
*/
-static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
- nodemask_t *node_alloc_noretry)
+static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
+ gfp_t gfp_mask, int nid, nodemask_t *nmask,
+ nodemask_t *node_alloc_noretry)
{
struct folio *folio;
- int nr_nodes, node;
+
+ folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+ node_alloc_noretry);
+ if (!folio)
+ return NULL;
+
+ prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+ return folio;
+}
+
+static void prep_and_add_allocated_folios(struct hstate *h,
+ struct list_head *folio_list)
+{
+ struct folio *folio, *tmp_f;
+
+ /* Add all new pool pages to free lists in one lock cycle */
+ spin_lock_irq(&hugetlb_lock);
+ list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+ __prep_account_new_huge_page(h, folio_nid(folio));
+ enqueue_hugetlb_folio(h, folio);
+ }
+ spin_unlock_irq(&hugetlb_lock);
+}
+
+/*
+ * Allocates a fresh hugetlb page in a node interleaved manner. The page
+ * will later be added to the appropriate hugetlb pool.
+ */
+static struct folio *alloc_pool_huge_folio(struct hstate *h,
+ nodemask_t *nodes_allowed,
+ nodemask_t *node_alloc_noretry)
+{
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+ int nr_nodes, node;

for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
- folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node,
+ struct folio *folio;
+
+ folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
nodes_allowed, node_alloc_noretry);
- if (folio) {
- free_huge_folio(folio); /* free it into the hugepage allocator */
- return 1;
- }
+ if (folio)
+ return folio;
}

- return 0;
+ return NULL;
}

/*
@@ -3246,25 +3293,35 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
*/
static void __init gather_bootmem_prealloc(void)
{
+ LIST_HEAD(folio_list);
struct huge_bootmem_page *m;
+ struct hstate *h, *prev_h = NULL;

list_for_each_entry(m, &huge_boot_pages, list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;
- struct hstate *h = m->hstate;
+
+ h = m->hstate;
+ /*
+ * It is possible to have multiple huge page sizes (hstates)
+ * in this list. If so, process each size separately.
+ */
+ if (h != prev_h && prev_h != NULL)
+ prep_and_add_allocated_folios(prev_h, &folio_list);
+ prev_h = h;

VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(folio_ref_count(folio) != 1);

hugetlb_folio_init_vmemmap(folio, h,
HUGETLB_VMEMMAP_RESERVE_PAGES);
- prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+ __prep_new_hugetlb_folio(h, folio);
/* If HVO fails, initialize all tail struct pages */
if (!HPageVmemmapOptimized(&folio->page))
hugetlb_folio_init_tail_vmemmap(folio,
HUGETLB_VMEMMAP_RESERVE_PAGES,
pages_per_huge_page(h));
- free_huge_folio(folio); /* add to the hugepage allocator */
+ list_add(&folio->lru, &folio_list);

/*
* We need to restore the 'stolen' pages to totalram_pages
@@ -3274,6 +3331,8 @@ static void __init gather_bootmem_prealloc(void)
adjust_managed_page_count(page, pages_per_huge_page(h));
cond_resched();
}
+
+ prep_and_add_allocated_folios(h, &folio_list);
}

static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
@@ -3307,9 +3366,22 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
h->max_huge_pages_node[nid] = i;
}

+/*
+ * NOTE: this routine is called in different contexts for gigantic and
+ * non-gigantic pages.
+ * - For gigantic pages, this is called early in the boot process and
+ * pages are allocated from memblock allocated or something similar.
+ * Gigantic pages are actually added to pools later with the routine
+ * gather_bootmem_prealloc.
+ * - For non-gigantic pages, this is called later in the boot process after
+ * all of mm is up and functional. Pages are allocated from buddy and
+ * then added to hugetlb pools.
+ */
static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
{
unsigned long i;
+ struct folio *folio;
+ LIST_HEAD(folio_list);
nodemask_t *node_alloc_noretry;
bool node_specific_alloc = false;

@@ -3351,14 +3423,25 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)

for (i = 0; i < h->max_huge_pages; ++i) {
if (hstate_is_gigantic(h)) {
+ /*
+ * gigantic pages not added to list as they are not
+ * added to pools now.
+ */
if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
break;
- } else if (!alloc_pool_huge_page(h,
- &node_states[N_MEMORY],
- node_alloc_noretry))
- break;
+ } else {
+ folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+ node_alloc_noretry);
+ if (!folio)
+ break;
+ list_add(&folio->lru, &folio_list);
+ }
cond_resched();
}
+
+ /* list will be empty if hstate_is_gigantic */
+ prep_and_add_allocated_folios(h, &folio_list);
+
if (i < h->max_huge_pages) {
char buf[32];

@@ -3492,7 +3575,9 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
nodemask_t *nodes_allowed)
{
- unsigned long min_count, ret;
+ unsigned long min_count;
+ unsigned long allocated;
+ struct folio *folio;
LIST_HEAD(page_list);
NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);

@@ -3569,7 +3654,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
break;
}

- while (count > persistent_huge_pages(h)) {
+ allocated = 0;
+ while (count > (persistent_huge_pages(h) + allocated)) {
/*
* If this allocation races such that we no longer need the
* page, free_huge_folio will handle it by freeing the page
@@ -3580,15 +3666,32 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
/* yield cpu to avoid soft lockup */
cond_resched();

- ret = alloc_pool_huge_page(h, nodes_allowed,
+ folio = alloc_pool_huge_folio(h, nodes_allowed,
node_alloc_noretry);
- spin_lock_irq(&hugetlb_lock);
- if (!ret)
+ if (!folio) {
+ prep_and_add_allocated_folios(h, &page_list);
+ spin_lock_irq(&hugetlb_lock);
goto out;
+ }
+
+ list_add(&folio->lru, &page_list);
+ allocated++;

/* Bail for signals. Probably ctrl-c from user */
- if (signal_pending(current))
+ if (signal_pending(current)) {
+ prep_and_add_allocated_folios(h, &page_list);
+ spin_lock_irq(&hugetlb_lock);
goto out;
+ }
+
+ spin_lock_irq(&hugetlb_lock);
+ }
+
+ /* Add allocated pages to the pool */
+ if (!list_empty(&page_list)) {
+ spin_unlock_irq(&hugetlb_lock);
+ prep_and_add_allocated_folios(h, &page_list);
+ spin_lock_irq(&hugetlb_lock);
}

/*
@@ -3614,8 +3717,6 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
* Collect pages to be removed on list without dropping lock
*/
while (min_count < persistent_huge_pages(h)) {
- struct folio *folio;
-
folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
if (!folio)
break;
--
2.41.0

2023-10-11 06:48:40

by Sergey Senozhatsky

[permalink] [raw]
Subject: Bisected: [PATCH v7 2/8] hugetlb: restructure pool allocations

On (23/10/05 20:20), Mike Kravetz wrote:
> Allocation of a hugetlb page for the hugetlb pool is done by the routine
> alloc_pool_huge_page. This routine will allocate contiguous pages from
> a low level allocator, prep the pages for usage as a hugetlb page and
> then add the resulting hugetlb page to the pool.
>
> In the 'prep' stage, optional vmemmap optimization is done. For
> performance reasons we want to perform vmemmap optimization on multiple
> hugetlb pages at once. To do this, restructure the hugetlb pool
> allocation code such that vmemmap optimization can be isolated and later
> batched.
>
> The code to allocate hugetlb pages from bootmem was also modified to
> allow batching.
>
> No functional changes, only code restructure.

I'm afraid there are some functional changes.

[...]
# good: [9e6c86c616f7d4b166c12f1ea9b69831f85c4a86] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
git bisect good 9e6c86c616f7d4b166c12f1ea9b69831f85c4a86
# bad: [1d50db09471e2a67272cee6e004ffed380ac571b] Merge branch 'master' of git://linuxtv.org/mchehab/media-next.git
git bisect bad 1d50db09471e2a67272cee6e004ffed380ac571b
# good: [80b14e865ea20ddc20aae61e2c106ebb03257cd3] bcachefs: Lower BCH_NAME_MAX to 512
git bisect good 80b14e865ea20ddc20aae61e2c106ebb03257cd3
# bad: [31d068de8a0de2c44168bbd8a62da21a7bc76051] Merge branch 'for-linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux.git
git bisect bad 31d068de8a0de2c44168bbd8a62da21a7bc76051
# bad: [0bb194b29fa6296a74b989e0f7f2db4fc11d8012] Merge branch 'perf-tools-next' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git
git bisect bad 0bb194b29fa6296a74b989e0f7f2db4fc11d8012
# good: [62001c9bf6aad59dc800c16613e5440b9226c252] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan.git
git bisect good 62001c9bf6aad59dc800c16613e5440b9226c252
# good: [21d856352ab78daffb9d05296b87b570f3afac33] Merge branch 'mm-stable' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect good 21d856352ab78daffb9d05296b87b570f3afac33
# good: [7fded33c6971b6c8e87cbbf48e74536aacca2991] perf test: Add pmu-event test for "Compat" and new event_field.
git bisect good 7fded33c6971b6c8e87cbbf48e74536aacca2991
# good: [4c37b665186a5e2907bf0308474ac3f15eb847da] compiler.h: move __is_constexpr() to compiler.h
git bisect good 4c37b665186a5e2907bf0308474ac3f15eb847da
# bad: [3424f8f382bd4d30e01eaf316823321ef7bd1560] mm: delete checks for xor_unlock_is_negative_byte()
git bisect bad 3424f8f382bd4d30e01eaf316823321ef7bd1560
# bad: [b5c6a60fe5b0339ba9c54f9f871db5c4a0e47906] iomap: protect read_bytes_pending with the state_lock
git bisect bad b5c6a60fe5b0339ba9c54f9f871db5c4a0e47906
# bad: [bfae92330ddc44968c628c0085082d25061495a4] hugetlb: batch PMD split for bulk vmemmap dedup
git bisect bad bfae92330ddc44968c628c0085082d25061495a4
# bad: [fb59f2cb8956f3888d2e0b438cc503565aa3c405] hugetlb: perform vmemmap optimization on a list of pages
git bisect bad fb59f2cb8956f3888d2e0b438cc503565aa3c405
# bad: [bfb41d6b2fe148c939366957ea9cb9aa72d59c4c] hugetlb: restructure pool allocations
git bisect bad bfb41d6b2fe148c939366957ea9cb9aa72d59c4c
# first bad commit: [bfb41d6b2fe148c939366957ea9cb9aa72d59c4c] hugetlb: restructure pool allocations

bfb41d6b2fe148c939366957ea9cb9aa72d59c4c is the first bad commit

commit bfb41d6b2fe148c939366957ea9cb9aa72d59c4c
Author: Mike Kravetz
Date: Thu Oct 5 20:20:04 2023 -0700

hugetlb: restructure pool allocations


Bugs-on linux-next:

[ 0.827640] ------------[ cut here ]------------
[ 0.828608] kernel BUG at mm/hugetlb.c:3180!
[ 0.829812] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 0.830610] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc4+ #164
[ 0.830610] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 0.830610] RIP: 0010:gather_bootmem_prealloc+0x1c5/0x1d0
[ 0.830610] Code: 48 89 e6 4c 89 ff e8 aa 13 83 fe 65 48 8b 04 25 28 00 00 00 48 3b 44 24 10 75 11 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b e8 64 f9 04 ff 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57
[ 0.830610] RSP: 0000:ffffc90000017b00 EFLAGS: 00010297
[ 0.830610] RAX: ffffc90000017b00 RBX: ffffea0000000000 RCX: ffffffff83847358
[ 0.830610] RDX: ffffffff83847358 RSI: fffffffffffffec8 RDI: 0000000000000000
[ 0.830610] RBP: 0000000000000000 R08: ffff8881030bc708 R09: ffff8881030bc710
[ 0.830610] R10: 0000000000000001 R11: 000077791a770248 R12: ffffffff82224228
[ 0.830610] R13: 0000000000000001 R14: ffffffff82ae6630 R15: ffffffff83847220
[ 0.830610] FS: 0000000000000000(0000) GS:ffff888661e00000(0000) knlGS:0000000000000000
[ 0.830610] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.830610] CR2: ffff88867ffff000 CR3: 0000000002246001 CR4: 0000000000770ef0
[ 0.830610] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.830610] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.830610] PKRU: 55555554
[ 0.830610] Call Trace:
[ 0.830610] <TASK>
[ 0.830610] ? __die_body+0x67/0xb0
[ 0.830610] ? die+0xa0/0xc0
[ 0.830610] ? do_trap+0xa8/0x180
[ 0.830610] ? gather_bootmem_prealloc+0x1c5/0x1d0
[ 0.830610] ? do_error_trap+0xc6/0x110
[ 0.830610] ? gather_bootmem_prealloc+0x1c5/0x1d0
[ 0.830610] ? handle_invalid_op+0x25/0x30
[ 0.830610] ? gather_bootmem_prealloc+0x1c5/0x1d0
[ 0.830610] ? exc_invalid_op+0x2f/0x40
[ 0.830610] ? asm_exc_invalid_op+0x16/0x20
[ 0.830610] ? gather_bootmem_prealloc+0x1c5/0x1d0
[ 0.830610] ? __alloc_bootmem_huge_page+0x120/0x120
[ 0.830610] hugetlb_init+0x14a/0x280
[ 0.830610] ? _raw_spin_unlock_irqrestore+0x3d/0x60
[ 0.830610] ? __alloc_bootmem_huge_page+0x120/0x120
[ 0.830610] do_one_initcall+0xce/0x2d0
[ 0.830610] ? __lock_acquire+0x5db/0x2d40
[ 0.830610] ? ida_alloc_range+0xaf/0x3e0
[ 0.830610] ? proc_register+0x4e/0x1b0
[ 0.830610] ? __kmem_cache_alloc_node+0x2f/0x200
[ 0.830610] ? lock_is_held_type+0xdd/0x150
[ 0.830610] ? do_initcalls+0x1c/0x70
[ 0.830610] ? parse_args+0x16f/0x340
[ 0.830610] do_initcall_level+0xab/0x110
[ 0.830610] ? kernel_init+0x16/0x190
[ 0.830610] do_initcalls+0x3f/0x70
[ 0.830610] kernel_init_freeable+0x15c/0x1d0
[ 0.830610] ? rest_init+0x1f0/0x1f0
[ 0.830610] kernel_init+0x16/0x190
[ 0.830610] ret_from_fork+0x2f/0x40
[ 0.830610] ? rest_init+0x1f0/0x1f0
[ 0.830610] ret_from_fork_asm+0x11/0x20
[ 0.830610] </TASK>
[ 0.830610] Modules linked in:
[ 0.830614] ---[ end trace 0000000000000000 ]---
[ 0.831896] RIP: 0010:gather_bootmem_prealloc+0x1c5/0x1d0
[ 0.833386] Code: 48 89 e6 4c 89 ff e8 aa 13 83 fe 65 48 8b 04 25 28 00 00 00 48 3b 44 24 10 75 11 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b e8 64 f9 04 ff 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57
[ 0.833947] RSP: 0000:ffffc90000017b00 EFLAGS: 00010297
[ 0.835384] RAX: ffffc90000017b00 RBX: ffffea0000000000 RCX: ffffffff83847358
[ 0.837279] RDX: ffffffff83847358 RSI: fffffffffffffec8 RDI: 0000000000000000
[ 0.839237] RBP: 0000000000000000 R08: ffff8881030bc708 R09: ffff8881030bc710
[ 0.840613] R10: 0000000000000001 R11: 000077791a770248 R12: ffffffff82224228
[ 0.842567] R13: 0000000000000001 R14: ffffffff82ae6630 R15: ffffffff83847220
[ 0.843946] FS: 0000000000000000(0000) GS:ffff888661e00000(0000) knlGS:0000000000000000
[ 0.846170] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.847279] CR2: ffff88867ffff000 CR3: 0000000002246001 CR4: 0000000000770ef0
[ 0.849242] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.850613] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.852581] PKRU: 55555554
[ 0.853336] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 0.853944] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

2023-10-18 22:45:44

by Mike Kravetz

[permalink] [raw]
Subject: Re: Bisected: [PATCH v7 2/8] hugetlb: restructure pool allocations

On 10/11/23 15:47, Sergey Senozhatsky wrote:
> On (23/10/05 20:20), Mike Kravetz wrote:
> > Allocation of a hugetlb page for the hugetlb pool is done by the routine
> > alloc_pool_huge_page. This routine will allocate contiguous pages from
> > a low level allocator, prep the pages for usage as a hugetlb page and
> > then add the resulting hugetlb page to the pool.
> >
> > In the 'prep' stage, optional vmemmap optimization is done. For
> > performance reasons we want to perform vmemmap optimization on multiple
> > hugetlb pages at once. To do this, restructure the hugetlb pool
> > allocation code such that vmemmap optimization can be isolated and later
> > batched.
> >
> > The code to allocate hugetlb pages from bootmem was also modified to
> > allow batching.
> >
> > No functional changes, only code restructure.
>
> I'm afraid there are some functional changes.
>

Hi Sergey,

Sorry for the delay. Not ignoring your report but chasing this in
another thread.
https://lore.kernel.org/linux-mm/20231018222003.GA21776@monkey/

Quick question.
Are you using LLVM/clang in your builds?

My guess is that you are hitting the same issue. That BUG at mm/hugetlb.c:3180
should not be checked/executed unless you allocate gigantic hugetlb pages on
the kernel command line. Suspect you are not doing this, and loop code is
being run when it should not.
--
Mike Kravetz

> [...]
> # good: [9e6c86c616f7d4b166c12f1ea9b69831f85c4a86] hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
> git bisect good 9e6c86c616f7d4b166c12f1ea9b69831f85c4a86
> # bad: [1d50db09471e2a67272cee6e004ffed380ac571b] Merge branch 'master' of git://linuxtv.org/mchehab/media-next.git
> git bisect bad 1d50db09471e2a67272cee6e004ffed380ac571b
> # good: [80b14e865ea20ddc20aae61e2c106ebb03257cd3] bcachefs: Lower BCH_NAME_MAX to 512
> git bisect good 80b14e865ea20ddc20aae61e2c106ebb03257cd3
> # bad: [31d068de8a0de2c44168bbd8a62da21a7bc76051] Merge branch 'for-linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux.git
> git bisect bad 31d068de8a0de2c44168bbd8a62da21a7bc76051
> # bad: [0bb194b29fa6296a74b989e0f7f2db4fc11d8012] Merge branch 'perf-tools-next' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git
> git bisect bad 0bb194b29fa6296a74b989e0f7f2db4fc11d8012
> # good: [62001c9bf6aad59dc800c16613e5440b9226c252] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan.git
> git bisect good 62001c9bf6aad59dc800c16613e5440b9226c252
> # good: [21d856352ab78daffb9d05296b87b570f3afac33] Merge branch 'mm-stable' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> git bisect good 21d856352ab78daffb9d05296b87b570f3afac33
> # good: [7fded33c6971b6c8e87cbbf48e74536aacca2991] perf test: Add pmu-event test for "Compat" and new event_field.
> git bisect good 7fded33c6971b6c8e87cbbf48e74536aacca2991
> # good: [4c37b665186a5e2907bf0308474ac3f15eb847da] compiler.h: move __is_constexpr() to compiler.h
> git bisect good 4c37b665186a5e2907bf0308474ac3f15eb847da
> # bad: [3424f8f382bd4d30e01eaf316823321ef7bd1560] mm: delete checks for xor_unlock_is_negative_byte()
> git bisect bad 3424f8f382bd4d30e01eaf316823321ef7bd1560
> # bad: [b5c6a60fe5b0339ba9c54f9f871db5c4a0e47906] iomap: protect read_bytes_pending with the state_lock
> git bisect bad b5c6a60fe5b0339ba9c54f9f871db5c4a0e47906
> # bad: [bfae92330ddc44968c628c0085082d25061495a4] hugetlb: batch PMD split for bulk vmemmap dedup
> git bisect bad bfae92330ddc44968c628c0085082d25061495a4
> # bad: [fb59f2cb8956f3888d2e0b438cc503565aa3c405] hugetlb: perform vmemmap optimization on a list of pages
> git bisect bad fb59f2cb8956f3888d2e0b438cc503565aa3c405
> # bad: [bfb41d6b2fe148c939366957ea9cb9aa72d59c4c] hugetlb: restructure pool allocations
> git bisect bad bfb41d6b2fe148c939366957ea9cb9aa72d59c4c
> # first bad commit: [bfb41d6b2fe148c939366957ea9cb9aa72d59c4c] hugetlb: restructure pool allocations
>
> bfb41d6b2fe148c939366957ea9cb9aa72d59c4c is the first bad commit
>
> commit bfb41d6b2fe148c939366957ea9cb9aa72d59c4c
> Author: Mike Kravetz
> Date: Thu Oct 5 20:20:04 2023 -0700
>
> hugetlb: restructure pool allocations
>
>
> Bugs-on linux-next:
>
> [ 0.827640] ------------[ cut here ]------------
> [ 0.828608] kernel BUG at mm/hugetlb.c:3180!

2023-10-19 02:16:30

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: Bisected: [PATCH v7 2/8] hugetlb: restructure pool allocations

On (23/10/18 15:44), Mike Kravetz wrote:
> Hi Sergey,
>
> Sorry for the delay. Not ignoring your report but chasing this in
> another thread.

Hi Mike, no worries.

> https://lore.kernel.org/linux-mm/20231018222003.GA21776@monkey/

Oh, interesting.
I just read the article the other day about some of unexpected optimizations
that clang can do [1].

Setting h to NULL seems to be fixing the problem for me.

> Quick question.
> Are you using LLVM/clang in your builds?

Yes, clang 14.0.6
The kernel compiled with gcc does not BUG_ON().

> My guess is that you are hitting the same issue. That BUG at mm/hugetlb.c:3180
> should not be checked/executed unless you allocate gigantic hugetlb pages on
> the kernel command line. Suspect you are not doing this, and loop code is
> being run when it should not.

Looks very similar indeed.

[1] https://research.swtch.com/ub

2023-10-19 02:30:48

by Mike Kravetz

[permalink] [raw]
Subject: Re: Bisected: [PATCH v7 2/8] hugetlb: restructure pool allocations

On 10/19/23 11:15, Sergey Senozhatsky wrote:
> On (23/10/18 15:44), Mike Kravetz wrote:
> > Hi Sergey,
> >
> > Sorry for the delay. Not ignoring your report but chasing this in
> > another thread.
>
> Hi Mike, no worries.
>
> > https://lore.kernel.org/linux-mm/20231018222003.GA21776@monkey/
>
> Oh, interesting.
> I just read the article the other day about some of unexpected optimizations
> that clang can do [1].
>
> Setting h to NULL seems to be fixing the problem for me.
>
> > Quick question.
> > Are you using LLVM/clang in your builds?
>
> Yes, clang 14.0.6
> The kernel compiled with gcc does not BUG_ON().
>
> > My guess is that you are hitting the same issue. That BUG at mm/hugetlb.c:3180
> > should not be checked/executed unless you allocate gigantic hugetlb pages on
> > the kernel command line. Suspect you are not doing this, and loop code is
> > being run when it should not.
>
> Looks very similar indeed.
>
> [1] https://research.swtch.com/ub

Thank you for testing!
--
Mike Kravetz