2024-01-18 12:39:32

by Gang Li

[permalink] [raw]
Subject: [RESEND PATCH v4 0/7] hugetlb: parallelize hugetlb page init on boot

Hi all, hugetlb init parallelization has now been updated to v4.

This version is tested on next-20240112.

Update Summary:
- Make padata_do_multithreaded dispatch all jobs with a global iterator
- Revise commit message
- Rename some functions
- Collect Tested-by and Reviewed-by

# Introduction
Hugetlb initialization during boot takes up a considerable amount of time.
For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2
seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel
host takes more than 1 minute[1]. This is a noteworthy figure.

Inspired by [2] and [3], hugetlb initialization can also be accelerated
through parallelization. Kernel already has infrastructure like
padata_do_multithreaded, this patch uses it to achieve effective results
by minimal modifications.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/

# Test result
test no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2t(4 node) 1G 4745 2024 57.34%
128c1t(2 node) 1G 3358 1712 49.02%
12t 1G 77000 18300 76.23%

256c2t(4 node) 2M 3336 1051 68.52%
128c1t(2 node) 2M 1943 716 63.15%

# Change log
Changes in v4:
- Make padata_do_multithreaded dispatch all jobs with a global iterator
- Revise commit message
- Rename some functions
- Collect Tested-by and Reviewed-by

Changes in v3:
- https://lore.kernel.org/all/[email protected]/
- Select CONFIG_PADATA as we use padata_do_multithreaded
- Fix a race condition in h->next_nid_to_alloc
- Fix local variable initialization issues
- Remove RFC tag

Changes in v2:
- https://lore.kernel.org/all/[email protected]/
- Reduce complexity with `padata_do_multithreaded`
- Support 1G hugetlb

v1:
- https://lore.kernel.org/all/[email protected]/
- parallelize 2M hugetlb initialization with workqueue

Gang Li (7):
hugetlb: code clean for hugetlb_hstate_alloc_pages
hugetlb: split hugetlb_hstate_alloc_pages
padata: dispatch works on different nodes
hugetlb: pass *next_nid_to_alloc directly to
for_each_node_mask_to_alloc
hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
hugetlb: parallelize 2M hugetlb allocation and initialization
hugetlb: parallelize 1G hugetlb initialization

fs/Kconfig | 1 +
include/linux/hugetlb.h | 2 +-
include/linux/padata.h | 3 +
kernel/padata.c | 14 ++-
mm/hugetlb.c | 227 ++++++++++++++++++++++++++--------------
mm/mm_init.c | 1 +
6 files changed, 168 insertions(+), 80 deletions(-)

--
2.20.1



2024-01-18 12:39:44

by Gang Li

[permalink] [raw]
Subject: [PATCH v4 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages

The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
code, its readability can be improved, facilitating future modifications.

This patch extracts two functions to reduce the complexity of
`hugetlb_hstate_alloc_pages` and has no functional changes.

- hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
each online node and performs allocation if necessary.
- hugetlb_hstate_alloc_pages_report() report error during allocation.
And the value of h->max_huge_pages is updated accordingly.

Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
---
mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d4..b8e4a6adefd6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3482,6 +3482,33 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
h->max_huge_pages_node[nid] = i;
}

+static bool __init hugetlb_hstate_alloc_pages_specific_nodes(struct hstate *h)
+{
+ int i;
+ bool node_specific_alloc = false;
+
+ for_each_online_node(i) {
+ if (h->max_huge_pages_node[i] > 0) {
+ hugetlb_hstate_alloc_pages_onenode(h, i);
+ node_specific_alloc = true;
+ }
+ }
+
+ return node_specific_alloc;
+}
+
+static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated, struct hstate *h)
+{
+ if (allocated < h->max_huge_pages) {
+ char buf[32];
+
+ string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+ pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
+ h->max_huge_pages, buf, allocated);
+ h->max_huge_pages = allocated;
+ }
+}
+
/*
* NOTE: this routine is called in different contexts for gigantic and
* non-gigantic pages.
@@ -3499,7 +3526,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
struct folio *folio;
LIST_HEAD(folio_list);
nodemask_t *node_alloc_noretry;
- bool node_specific_alloc = false;

/* skip gigantic hugepages allocation if hugetlb_cma enabled */
if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3508,14 +3534,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
}

/* do node specific alloc */
- for_each_online_node(i) {
- if (h->max_huge_pages_node[i] > 0) {
- hugetlb_hstate_alloc_pages_onenode(h, i);
- node_specific_alloc = true;
- }
- }
-
- if (node_specific_alloc)
+ if (hugetlb_hstate_alloc_pages_specific_nodes(h))
return;

/* below will do all node balanced alloc */
@@ -3558,14 +3577,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
/* list will be empty if hstate_is_gigantic */
prep_and_add_allocated_folios(h, &folio_list);

- if (i < h->max_huge_pages) {
- char buf[32];
-
- string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
- pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
- h->max_huge_pages, buf, i);
- h->max_huge_pages = i;
- }
+ hugetlb_hstate_alloc_pages_errcheck(i, h);
kfree(node_alloc_noretry);
}

--
2.20.1


2024-01-18 12:39:59

by Gang Li

[permalink] [raw]
Subject: [PATCH v4 2/7] hugetlb: split hugetlb_hstate_alloc_pages

1G and 2M huge pages have different allocation and initialization logic,
which leads to subtle differences in parallelization. Therefore, it is
appropriate to split hugetlb_hstate_alloc_pages into gigantic and
non-gigantic.

This patch has no functional changes.

Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
---
mm/hugetlb.c | 87 ++++++++++++++++++++++++++--------------------------
1 file changed, 43 insertions(+), 44 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8e4a6adefd6..98ae108e1fac 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3509,6 +3509,43 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
}
}

+static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
+{
+ unsigned long i;
+
+ for (i = 0; i < h->max_huge_pages; ++i) {
+ if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
+ break;
+ cond_resched();
+ }
+
+ return i;
+}
+
+static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
+{
+ unsigned long i;
+ struct folio *folio;
+ LIST_HEAD(folio_list);
+ nodemask_t node_alloc_noretry;
+
+ /* Bit mask controlling how hard we retry per-node allocations.*/
+ nodes_clear(node_alloc_noretry);
+
+ for (i = 0; i < h->max_huge_pages; ++i) {
+ folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+ &node_alloc_noretry);
+ if (!folio)
+ break;
+ list_add(&folio->lru, &folio_list);
+ cond_resched();
+ }
+
+ prep_and_add_allocated_folios(h, &folio_list);
+
+ return i;
+}
+
/*
* NOTE: this routine is called in different contexts for gigantic and
* non-gigantic pages.
@@ -3522,10 +3559,7 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
*/
static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
{
- unsigned long i;
- struct folio *folio;
- LIST_HEAD(folio_list);
- nodemask_t *node_alloc_noretry;
+ unsigned long allocated;

/* skip gigantic hugepages allocation if hugetlb_cma enabled */
if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3538,47 +3572,12 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
return;

/* below will do all node balanced alloc */
- if (!hstate_is_gigantic(h)) {
- /*
- * Bit mask controlling how hard we retry per-node allocations.
- * Ignore errors as lower level routines can deal with
- * node_alloc_noretry == NULL. If this kmalloc fails at boot
- * time, we are likely in bigger trouble.
- */
- node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
- GFP_KERNEL);
- } else {
- /* allocations done at boot time */
- node_alloc_noretry = NULL;
- }
-
- /* bit mask controlling how hard we retry per-node allocations */
- if (node_alloc_noretry)
- nodes_clear(*node_alloc_noretry);
-
- for (i = 0; i < h->max_huge_pages; ++i) {
- if (hstate_is_gigantic(h)) {
- /*
- * gigantic pages not added to list as they are not
- * added to pools now.
- */
- if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
- break;
- } else {
- folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
- node_alloc_noretry);
- if (!folio)
- break;
- list_add(&folio->lru, &folio_list);
- }
- cond_resched();
- }
-
- /* list will be empty if hstate_is_gigantic */
- prep_and_add_allocated_folios(h, &folio_list);
+ if (hstate_is_gigantic(h))
+ allocated = hugetlb_gigantic_pages_alloc_boot(h);
+ else
+ allocated = hugetlb_pages_alloc_boot(h);

- hugetlb_hstate_alloc_pages_errcheck(i, h);
- kfree(node_alloc_noretry);
+ hugetlb_hstate_alloc_pages_errcheck(allocated, h);
}

static void __init hugetlb_init_hstates(void)
--
2.20.1


2024-01-18 12:40:14

by Gang Li

[permalink] [raw]
Subject: [PATCH v4 3/7] padata: dispatch works on different nodes

When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.

Thus, numa_aware flag is introduced here, allowing tasks to be
distributed across different nodes to fully utilize the advantage of
multi-node systems.

Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
---
include/linux/padata.h | 3 +++
kernel/padata.c | 14 ++++++++++++--
mm/mm_init.c | 1 +
3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 495b16b6b4d7..f79ccd50e7f4 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -137,6 +137,8 @@ struct padata_shell {
* appropriate for one worker thread to do at once.
* @max_threads: Max threads to use for the job, actual number may be less
* depending on task size and minimum chunk size.
+ * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
+ * no CPU, dispatch its jobs to a random CPU.
*/
struct padata_mt_job {
void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
@@ -146,6 +148,7 @@ struct padata_mt_job {
unsigned long align;
unsigned long min_chunk;
int max_threads;
+ bool numa_aware;
};

/**
diff --git a/kernel/padata.c b/kernel/padata.c
index 179fb1518070..10eae3f59203 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
struct padata_work my_work, *pw;
struct padata_mt_job_state ps;
LIST_HEAD(works);
- int nworks;
+ int nworks, nid;
+ static atomic_t last_used_nid = ATOMIC_INIT(0);

if (job->size == 0)
return;
@@ -517,7 +518,16 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
ps.chunk_size = roundup(ps.chunk_size, job->align);

list_for_each_entry(pw, &works, pw_list)
- queue_work(system_unbound_wq, &pw->pw_work);
+ if (job->numa_aware) {
+ int old_node = atomic_read(&last_used_nid);
+
+ do {
+ nid = next_node_in(old_node, node_states[N_CPU]);
+ } while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
+ queue_work_node(nid, system_unbound_wq, &pw->pw_work);
+ } else {
+ queue_work(system_unbound_wq, &pw->pw_work);
+ }

/* Use the current thread, which saves starting a workqueue worker. */
padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2c19f5515e36..549e76af8f82 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2231,6 +2231,7 @@ static int __init deferred_init_memmap(void *data)
.align = PAGES_PER_SECTION,
.min_chunk = PAGES_PER_SECTION,
.max_threads = max_threads,
+ .numa_aware = false,
};

padata_do_multithreaded(&job);
--
2.20.1


2024-01-18 12:40:28

by Gang Li

[permalink] [raw]
Subject: [PATCH v4 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc

With parallelization of hugetlb allocation across different threads, each
thread works on a differnet node to allocate pages from, instead of all
allocating from a common node h->next_nid_to_alloc. To address this, it's
necessary to assign a separate next_nid_to_alloc for each thread.

Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
have been modified to directly accept a *next_nid_to_alloc parameter,
ensuring thread-specific allocation and avoiding concurrent access issues.

Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
---
mm/hugetlb.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 98ae108e1fac..effe5539e545 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
* next node from which to allocate, handling wrap at end of node
* mask.
*/
-static int hstate_next_node_to_alloc(struct hstate *h,
+static int hstate_next_node_to_alloc(int *next_node,
nodemask_t *nodes_allowed)
{
int nid;

VM_BUG_ON(!nodes_allowed);

- nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
- h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
+ nid = get_valid_node_allowed(*next_node, nodes_allowed);
+ *next_node = next_node_allowed(nid, nodes_allowed);

return nid;
}
@@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
return nid;
}

-#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
+#define for_each_node_mask_to_alloc(next_node, nr_nodes, node, mask) \
for (nr_nodes = nodes_weight(*mask); \
nr_nodes > 0 && \
- ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
+ ((node = hstate_next_node_to_alloc(next_node, mask)) || 1); \
nr_nodes--)

#define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
@@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
*/
static struct folio *alloc_pool_huge_folio(struct hstate *h,
nodemask_t *nodes_allowed,
- nodemask_t *node_alloc_noretry)
+ nodemask_t *node_alloc_noretry,
+ int *next_node)
{
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
int nr_nodes, node;

- for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+ for_each_node_mask_to_alloc(next_node, nr_nodes, node, nodes_allowed) {
struct folio *folio;

folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
@@ -3310,7 +3311,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
goto found;
}
/* allocate from next node when distributing huge pages */
- for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
+ for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
m = memblock_alloc_try_nid_raw(
huge_page_size(h), huge_page_size(h),
0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
@@ -3679,7 +3680,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
VM_BUG_ON(delta != -1 && delta != 1);

if (delta < 0) {
- for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+ for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
if (h->surplus_huge_pages_node[node])
goto found;
}
@@ -3794,7 +3795,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
cond_resched();

folio = alloc_pool_huge_folio(h, nodes_allowed,
- node_alloc_noretry);
+ node_alloc_noretry,
+ &h->next_nid_to_alloc);
if (!folio) {
prep_and_add_allocated_folios(h, &page_list);
spin_lock_irq(&hugetlb_lock);
--
2.20.1


2024-01-18 12:40:41

by Gang Li

[permalink] [raw]
Subject: [PATCH v4 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA

Allow hugetlb use padata_do_multithreaded for parallel initialization.
Select CONFIG_PADATA in this case.

Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
---
fs/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075..a57d6e6c41e6 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -262,6 +262,7 @@ menuconfig HUGETLBFS
depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
depends on (SYSFS || SYSCTL)
select MEMFD_CREATE
+ select PADATA
help
hugetlbfs is a filesystem backing for HugeTLB pages, based on
ramfs. For architectures that support it, say Y here and read
--
2.20.1


2024-01-18 12:41:04

by Gang Li

[permalink] [raw]
Subject: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization

By distributing both the allocation and the initialization tasks across
multiple threads, the initialization of 2M hugetlb will be faster,
thereby improving the boot speed.

Here are some test results:
test no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2t(4 node) 2M 3336 1051 68.52%
128c1t(2 node) 2M 1943 716 63.15%

Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
---
mm/hugetlb.c | 70 ++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 52 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index effe5539e545..9b348ba418f5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,6 +35,7 @@
#include <linux/delayacct.h>
#include <linux/memory.h>
#include <linux/mm_inline.h>
+#include <linux/padata.h>

#include <asm/page.h>
#include <asm/pgalloc.h>
@@ -3510,43 +3511,76 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
}
}

-static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
+static void __init hugetlb_alloc_node(unsigned long start, unsigned long end, void *arg)
{
- unsigned long i;
+ struct hstate *h = (struct hstate *)arg;
+ int i, num = end - start;
+ nodemask_t node_alloc_noretry;
+ unsigned long flags;
+ int next_node = 0;

- for (i = 0; i < h->max_huge_pages; ++i) {
- if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
+ /* Bit mask controlling how hard we retry per-node allocations.*/
+ nodes_clear(node_alloc_noretry);
+
+ for (i = 0; i < num; ++i) {
+ struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+ &node_alloc_noretry, &next_node);
+ if (!folio)
break;
+ spin_lock_irqsave(&hugetlb_lock, flags);
+ __prep_account_new_huge_page(h, folio_nid(folio));
+ enqueue_hugetlb_folio(h, folio);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
cond_resched();
}
+}

- return i;
+static void __init hugetlb_vmemmap_optimize_node(unsigned long start, unsigned long end, void *arg)
+{
+ struct hstate *h = (struct hstate *)arg;
+ int nid = start;
+
+ hugetlb_vmemmap_optimize_folios(h, &h->hugepage_freelists[nid]);
}

-static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
+static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
{
unsigned long i;
- struct folio *folio;
- LIST_HEAD(folio_list);
- nodemask_t node_alloc_noretry;
-
- /* Bit mask controlling how hard we retry per-node allocations.*/
- nodes_clear(node_alloc_noretry);

for (i = 0; i < h->max_huge_pages; ++i) {
- folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
- &node_alloc_noretry);
- if (!folio)
+ if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
break;
- list_add(&folio->lru, &folio_list);
cond_resched();
}

- prep_and_add_allocated_folios(h, &folio_list);
-
return i;
}

+static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
+{
+ struct padata_mt_job job = {
+ .fn_arg = h,
+ .align = 1,
+ .numa_aware = true
+ };
+
+ job.thread_fn = hugetlb_alloc_node;
+ job.start = 0;
+ job.size = h->max_huge_pages;
+ job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+ job.max_threads = num_node_state(N_MEMORY) * 2;
+ padata_do_multithreaded(&job);
+
+ job.thread_fn = hugetlb_vmemmap_optimize_node;
+ job.start = 0;
+ job.size = num_node_state(N_MEMORY);
+ job.min_chunk = 1;
+ job.max_threads = num_node_state(N_MEMORY);
+ padata_do_multithreaded(&job);
+
+ return h->nr_huge_pages;
+}
+
/*
* NOTE: this routine is called in different contexts for gigantic and
* non-gigantic pages.
--
2.20.1


2024-01-18 12:42:07

by Gang Li

[permalink] [raw]
Subject: [PATCH v4 7/7] hugetlb: parallelize 1G hugetlb initialization

Optimizing the initialization speed of 1G huge pages through
parallelization.

1G hugetlbs are allocated from bootmem, a process that is already
very fast and does not currently require optimization. Therefore,
we focus on parallelizing only the initialization phase in
`gather_bootmem_prealloc`.

Here are some test results:
test no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2t(4 node) 1G 4745 2024 57.34%
128c1t(2 node) 1G 3358 1712 49.02%
12t 1G 77000 18300 76.23%

Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
---
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 42 +++++++++++++++++++++++++++++++++--------
2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c1ee640d87b1..77b30a8c6076 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);

extern int sysctl_hugetlb_shm_group;
-extern struct list_head huge_boot_pages;
+extern struct list_head huge_boot_pages[MAX_NUMNODES];

/* arch callbacks */

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9b348ba418f5..2f4b77630ada 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
#endif
static unsigned long hugetlb_cma_size __initdata;

-__initdata LIST_HEAD(huge_boot_pages);
+__initdata struct list_head huge_boot_pages[MAX_NUMNODES];

/* for command line parsing */
static struct hstate * __initdata parsed_hstate;
@@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
int __alloc_bootmem_huge_page(struct hstate *h, int nid)
{
struct huge_bootmem_page *m = NULL; /* initialize for clang */
- int nr_nodes, node;
+ int nr_nodes, node = nid;

/* do node specific alloc */
if (nid != NUMA_NO_NODE) {
@@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
huge_page_size(h) - PAGE_SIZE);
/* Put them into a private list first because mem_map is not up yet */
INIT_LIST_HEAD(&m->list);
- list_add(&m->list, &huge_boot_pages);
+ list_add(&m->list, &huge_boot_pages[node]);
m->hstate = h;
return 1;
}
@@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
/* Send list for bulk vmemmap optimization processing */
hugetlb_vmemmap_optimize_folios(h, folio_list);

- /* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
/*
@@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
HUGETLB_VMEMMAP_RESERVE_PAGES,
pages_per_huge_page(h));
}
+ /* Subdivide locks to achieve better parallel performance */
+ spin_lock_irqsave(&hugetlb_lock, flags);
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
- spin_unlock_irqrestore(&hugetlb_lock, flags);
}

/*
* Put bootmem huge pages into the standard lists after mem_map is up.
* Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
*/
-static void __init gather_bootmem_prealloc(void)
+static void __init __gather_bootmem_prealloc(unsigned long start, unsigned long end, void *arg)
+
{
+ int nid = start;
LIST_HEAD(folio_list);
struct huge_bootmem_page *m;
struct hstate *h = NULL, *prev_h = NULL;

- list_for_each_entry(m, &huge_boot_pages, list) {
+ list_for_each_entry(m, &huge_boot_pages[nid], list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;

@@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
prep_and_add_bootmem_folios(h, &folio_list);
}

+static void __init gather_bootmem_prealloc(void)
+{
+ struct padata_mt_job job = {
+ .thread_fn = __gather_bootmem_prealloc,
+ .fn_arg = NULL,
+ .start = 0,
+ .size = num_node_state(N_MEMORY),
+ .align = 1,
+ .min_chunk = 1,
+ .max_threads = num_node_state(N_MEMORY),
+ .numa_aware = true,
+ };
+
+ padata_do_multithreaded(&job);
+}
+
static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
{
unsigned long i;
@@ -3602,6 +3620,14 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
return;
}

+ /* hugetlb_hstate_alloc_pages will be called many times, init huge_boot_pages once*/
+ if (huge_boot_pages[0].next == NULL) {
+ int i = 0;
+
+ for (i = 0; i < MAX_NUMNODES; i++)
+ INIT_LIST_HEAD(&huge_boot_pages[i]);
+ }
+
/* do node specific alloc */
if (hugetlb_hstate_alloc_pages_specific_nodes(h))
return;
--
2.20.1


2024-01-18 14:30:20

by Kefeng Wang

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] hugetlb: parallelize 1G hugetlb initialization



On 2024/1/18 20:39, Gang Li wrote:
> Optimizing the initialization speed of 1G huge pages through
> parallelization.
>
> 1G hugetlbs are allocated from bootmem, a process that is already
> very fast and does not currently require optimization. Therefore,
> we focus on parallelizing only the initialization phase in
> `gather_bootmem_prealloc`.
>
> Here are some test results:
> test no patch(ms) patched(ms) saved
> ------------------- -------------- ------------- --------
> 256c2t(4 node) 1G 4745 2024 57.34%
> 128c1t(2 node) 1G 3358 1712 49.02%
> 12t 1G 77000 18300 76.23%
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>
> ---
> include/linux/hugetlb.h | 2 +-
> mm/hugetlb.c | 42 +++++++++++++++++++++++++++++++++--------
> 2 files changed, 35 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c1ee640d87b1..77b30a8c6076 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
>
> extern int sysctl_hugetlb_shm_group;
> -extern struct list_head huge_boot_pages;
> +extern struct list_head huge_boot_pages[MAX_NUMNODES];

Hi, please update arch/powerpc/mm/hugetlbpage.c too, since it
was used in ppc.

>
> /* arch callbacks */
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9b348ba418f5..2f4b77630ada 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
> #endif
> static unsigned long hugetlb_cma_size __initdata;
>
> -__initdata LIST_HEAD(huge_boot_pages);
> +__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
>
> /* for command line parsing */
> static struct hstate * __initdata parsed_hstate;
> @@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
> int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> {
> struct huge_bootmem_page *m = NULL; /* initialize for clang */
> - int nr_nodes, node;
> + int nr_nodes, node = nid;
>
> /* do node specific alloc */
> if (nid != NUMA_NO_NODE) {
> @@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> huge_page_size(h) - PAGE_SIZE);
> /* Put them into a private list first because mem_map is not up yet */
> INIT_LIST_HEAD(&m->list);
> - list_add(&m->list, &huge_boot_pages);
> + list_add(&m->list, &huge_boot_pages[node]);
> m->hstate = h;
> return 1;
> }
> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> /* Send list for bulk vmemmap optimization processing */
> hugetlb_vmemmap_optimize_folios(h, folio_list);
>
> - /* Add all new pool pages to free lists in one lock cycle */
> - spin_lock_irqsave(&hugetlb_lock, flags);
> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
> /*
> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> HUGETLB_VMEMMAP_RESERVE_PAGES,
> pages_per_huge_page(h));
> }
> + /* Subdivide locks to achieve better parallel performance */
> + spin_lock_irqsave(&hugetlb_lock, flags);
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> + spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
> - spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
>
> /*
> * Put bootmem huge pages into the standard lists after mem_map is up.
> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
> */
> -static void __init gather_bootmem_prealloc(void)
> +static void __init __gather_bootmem_prealloc(unsigned long start, unsigned long end, void *arg)
> +
> {
> + int nid = start;
> LIST_HEAD(folio_list);
> struct huge_bootmem_page *m;
> struct hstate *h = NULL, *prev_h = NULL;
>
> - list_for_each_entry(m, &huge_boot_pages, list) {
> + list_for_each_entry(m, &huge_boot_pages[nid], list) {
> struct page *page = virt_to_page(m);
> struct folio *folio = (void *)page;
>
> @@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
> prep_and_add_bootmem_folios(h, &folio_list);
> }
>
> +static void __init gather_bootmem_prealloc(void)
> +{
> + struct padata_mt_job job = {
> + .thread_fn = __gather_bootmem_prealloc,
> + .fn_arg = NULL,
> + .start = 0,
> + .size = num_node_state(N_MEMORY),
> + .align = 1,
> + .min_chunk = 1,
> + .max_threads = num_node_state(N_MEMORY),
> + .numa_aware = true,
> + };
> +
> + padata_do_multithreaded(&job);
> +}
> +
> static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> {
> unsigned long i;
> @@ -3602,6 +3620,14 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> return;
> }
>
> + /* hugetlb_hstate_alloc_pages will be called many times, init huge_boot_pages once*/
> + if (huge_boot_pages[0].next == NULL) {
> + int i = 0;
> +
> + for (i = 0; i < MAX_NUMNODES; i++)
> + INIT_LIST_HEAD(&huge_boot_pages[i]);
> + }
> +
> /* do node specific alloc */
> if (hugetlb_hstate_alloc_pages_specific_nodes(h))
> return;

2024-01-19 01:31:06

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] padata: dispatch works on different nodes

On Thu, 2024-01-18 at 20:39 +0800, Gang Li wrote:
> When a group of tasks that access different nodes are scheduled on the
> same node, they may encounter bandwidth bottlenecks and access latency.
>
> Thus, numa_aware flag is introduced here, allowing tasks to be
> distributed across different nodes to fully utilize the advantage of
> multi-node systems.
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>
> ---
> include/linux/padata.h | 3 +++
> kernel/padata.c | 14 ++++++++++++--
> mm/mm_init.c | 1 +
> 3 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/padata.h b/include/linux/padata.h
> index 495b16b6b4d7..f79ccd50e7f4 100644
> --- a/include/linux/padata.h
> +++ b/include/linux/padata.h
> @@ -137,6 +137,8 @@ struct padata_shell {
> * appropriate for one worker thread to do at once.
> * @max_threads: Max threads to use for the job, actual number may be less
> * depending on task size and minimum chunk size.
> + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
> + * no CPU, dispatch its jobs to a random CPU.

Suggest:
Distribute jobs to different nodes with CPU in a round robin fashion.

> */
> struct padata_mt_job {


2024-01-19 02:55:43

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc



> On Jan 18, 2024, at 20:39, Gang Li <[email protected]> wrote:
>
> With parallelization of hugetlb allocation across different threads, each
> thread works on a differnet node to allocate pages from, instead of all
> allocating from a common node h->next_nid_to_alloc. To address this, it's
> necessary to assign a separate next_nid_to_alloc for each thread.
>
> Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
> have been modified to directly accept a *next_nid_to_alloc parameter,
> ensuring thread-specific allocation and avoiding concurrent access issues.
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>

Reviewed-by: Muchun Song <[email protected]>

Thanks.


2024-01-19 02:59:58

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] padata: dispatch works on different nodes



On 2024/1/18 20:39, Gang Li wrote:
> When a group of tasks that access different nodes are scheduled on the
> same node, they may encounter bandwidth bottlenecks and access latency.
>
> Thus, numa_aware flag is introduced here, allowing tasks to be
> distributed across different nodes to fully utilize the advantage of
> multi-node systems.
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>
> ---
> include/linux/padata.h | 3 +++
> kernel/padata.c | 14 ++++++++++++--
> mm/mm_init.c | 1 +
> 3 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/padata.h b/include/linux/padata.h
> index 495b16b6b4d7..f79ccd50e7f4 100644
> --- a/include/linux/padata.h
> +++ b/include/linux/padata.h
> @@ -137,6 +137,8 @@ struct padata_shell {
> * appropriate for one worker thread to do at once.
> * @max_threads: Max threads to use for the job, actual number may be less
> * depending on task size and minimum chunk size.
> + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
> + * no CPU, dispatch its jobs to a random CPU.
> */
> struct padata_mt_job {
> void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> @@ -146,6 +148,7 @@ struct padata_mt_job {
> unsigned long align;
> unsigned long min_chunk;
> int max_threads;
> + bool numa_aware;
> };
>
> /**
> diff --git a/kernel/padata.c b/kernel/padata.c
> index 179fb1518070..10eae3f59203 100644
> --- a/kernel/padata.c
> +++ b/kernel/padata.c
> @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> struct padata_work my_work, *pw;
> struct padata_mt_job_state ps;
> LIST_HEAD(works);
> - int nworks;
> + int nworks, nid;
> + static atomic_t last_used_nid = ATOMIC_INIT(0);
last_used_nid is only used during boot time so it could be
__init_data. Otherwise, LGTM.

Reviewed-by: Muchun Song <[email protected]>

>
> if (job->size == 0)
> return;
> @@ -517,7 +518,16 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> ps.chunk_size = roundup(ps.chunk_size, job->align);
>
> list_for_each_entry(pw, &works, pw_list)
> - queue_work(system_unbound_wq, &pw->pw_work);
> + if (job->numa_aware) {
> + int old_node = atomic_read(&last_used_nid);
> +
> + do {
> + nid = next_node_in(old_node, node_states[N_CPU]);
> + } while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
> + queue_work_node(nid, system_unbound_wq, &pw->pw_work);
> + } else {
> + queue_work(system_unbound_wq, &pw->pw_work);
> + }
>
> /* Use the current thread, which saves starting a workqueue worker. */
> padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 2c19f5515e36..549e76af8f82 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2231,6 +2231,7 @@ static int __init deferred_init_memmap(void *data)
> .align = PAGES_PER_SECTION,
> .min_chunk = PAGES_PER_SECTION,
> .max_threads = max_threads,
> + .numa_aware = false,
> };
>
> padata_do_multithreaded(&job);


2024-01-19 05:01:12

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc

On Thu, 2024-01-18 at 20:39 +0800, Gang Li wrote:
> With parallelization of hugetlb allocation across different threads, each
> thread works on a differnet node to allocate pages from, instead of all
> allocating from a common node h->next_nid_to_alloc. To address this, it's
> necessary to assign a separate next_nid_to_alloc for each thread.
>
> Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
> have been modified to directly accept a *next_nid_to_alloc parameter,
> ensuring thread-specific allocation and avoiding concurrent access issues.
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>

Reviewed-by: Tim Chen <[email protected]>

> ---


2024-01-19 14:56:21

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] hugetlb: parallelize 1G hugetlb initialization



On 2024/1/18 22:22, Kefeng Wang wrote:
> On 2024/1/18 20:39, Gang Li wrote:
>>   include/linux/hugetlb.h |  2 +-
>>   mm/hugetlb.c            | 42 +++++++++++++++++++++++++++++++++--------
>>   2 files changed, 35 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index c1ee640d87b1..77b30a8c6076 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct
>> vm_area_struct *vma,
>>   struct address_space *hugetlb_page_mapping_lock_write(struct page
>> *hpage);
>>   extern int sysctl_hugetlb_shm_group;
>> -extern struct list_head huge_boot_pages;
>> +extern struct list_head huge_boot_pages[MAX_NUMNODES];
>
> Hi, please update arch/powerpc/mm/hugetlbpage.c too, since it
> was used in ppc.
>

Thanks, will do

2024-01-19 14:56:36

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] padata: dispatch works on different nodes

On 2024/1/19 10:59, Muchun Song wrote:
> On 2024/1/18 20:39, Gang Li wrote:
>> --- a/kernel/padata.c
>> +++ b/kernel/padata.c
>> @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct
>> padata_mt_job *job)
>>       struct padata_work my_work, *pw;
>>       struct padata_mt_job_state ps;
>>       LIST_HEAD(works);
>> -    int nworks;
>> +    int nworks, nid;
>> +    static atomic_t last_used_nid = ATOMIC_INIT(0);
> last_used_nid is only used during boot time so it could be
> __init_data. Otherwise, LGTM.
>
> Reviewed-by: Muchun Song <[email protected]>
>

OK, thanks.

2024-01-19 14:57:12

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] padata: dispatch works on different nodes



On 2024/1/19 07:04, Tim Chen wrote:
> On Thu, 2024-01-18 at 20:39 +0800, Gang Li wrote:
>> When a group of tasks that access different nodes are scheduled on the
>> same node, they may encounter bandwidth bottlenecks and access latency.
>>
>> Thus, numa_aware flag is introduced here, allowing tasks to be
>> distributed across different nodes to fully utilize the advantage of
>> multi-node systems.
>>
>> Signed-off-by: Gang Li <[email protected]>
>> Tested-by: David Rientjes <[email protected]>
>> ---
>> include/linux/padata.h | 3 +++
>> kernel/padata.c | 14 ++++++++++++--
>> mm/mm_init.c | 1 +
>> 3 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/padata.h b/include/linux/padata.h
>> index 495b16b6b4d7..f79ccd50e7f4 100644
>> --- a/include/linux/padata.h
>> +++ b/include/linux/padata.h
>> @@ -137,6 +137,8 @@ struct padata_shell {
>> * appropriate for one worker thread to do at once.
>> * @max_threads: Max threads to use for the job, actual number may be less
>> * depending on task size and minimum chunk size.
>> + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
>> + * no CPU, dispatch its jobs to a random CPU.
>
> Suggest:
> Distribute jobs to different nodes with CPU in a round robin fashion.
>

Good idea.
Thanks.

>> */
>> struct padata_mt_job {
>

2024-01-22 03:43:36

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 2/7] hugetlb: split hugetlb_hstate_alloc_pages



On 2024/1/18 20:39, Gang Li wrote:
> 1G and 2M huge pages have different allocation and initialization logic,
> which leads to subtle differences in parallelization. Therefore, it is
> appropriate to split hugetlb_hstate_alloc_pages into gigantic and
> non-gigantic.
>
> This patch has no functional changes.
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>
> Reviewed-by: Tim Chen <[email protected]>
Reviewed-by: Muchun Song <[email protected]>

Thanks.


2024-01-22 06:17:17

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc



On 2024/1/18 20:39, Gang Li wrote:
> With parallelization of hugetlb allocation across different threads, each
> thread works on a differnet node to allocate pages from, instead of all
> allocating from a common node h->next_nid_to_alloc. To address this, it's
> necessary to assign a separate next_nid_to_alloc for each thread.
>
> Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
> have been modified to directly accept a *next_nid_to_alloc parameter,
> ensuring thread-specific allocation and avoiding concurrent access issues.
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>
> ---
> mm/hugetlb.c | 22 ++++++++++++----------
> 1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 98ae108e1fac..effe5539e545 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
> * next node from which to allocate, handling wrap at end of node
> * mask.
> */
> -static int hstate_next_node_to_alloc(struct hstate *h,
> +static int hstate_next_node_to_alloc(int *next_node,
> nodemask_t *nodes_allowed)
> {
> int nid;
>
> VM_BUG_ON(!nodes_allowed);
>
> - nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> - h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
> + nid = get_valid_node_allowed(*next_node, nodes_allowed);
> + *next_node = next_node_allowed(nid, nodes_allowed);
>
> return nid;
> }
> @@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> return nid;
> }
>
> -#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
> +#define for_each_node_mask_to_alloc(next_node, nr_nodes, node, mask) \
> for (nr_nodes = nodes_weight(*mask); \
> nr_nodes > 0 && \
> - ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
> + ((node = hstate_next_node_to_alloc(next_node, mask)) || 1); \
> nr_nodes--)
>
> #define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
> @@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
> */
> static struct folio *alloc_pool_huge_folio(struct hstate *h,
> nodemask_t *nodes_allowed,
> - nodemask_t *node_alloc_noretry)
> + nodemask_t *node_alloc_noretry,
> + int *next_node)
> {
> gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> int nr_nodes, node;
>
> - for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> + for_each_node_mask_to_alloc(next_node, nr_nodes, node, nodes_allowed) {


A small question here, why not pass h->next_nid_to_alloc to
for_each_node_mask_to_alloc()? What's the purpose of the third
parameter of alloc_pool_huge_folio()?

Thanks.


2024-01-22 07:10:30

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization



On 2024/1/18 20:39, Gang Li wrote:
> By distributing both the allocation and the initialization tasks across
> multiple threads, the initialization of 2M hugetlb will be faster,
> thereby improving the boot speed.
>
> Here are some test results:
> test no patch(ms) patched(ms) saved
> ------------------- -------------- ------------- --------
> 256c2t(4 node) 2M 3336 1051 68.52%
> 128c1t(2 node) 2M 1943 716 63.15%
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>
> ---
> mm/hugetlb.c | 70 ++++++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 52 insertions(+), 18 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index effe5539e545..9b348ba418f5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -35,6 +35,7 @@
> #include <linux/delayacct.h>
> #include <linux/memory.h>
> #include <linux/mm_inline.h>
> +#include <linux/padata.h>
>
> #include <asm/page.h>
> #include <asm/pgalloc.h>
> @@ -3510,43 +3511,76 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
> }
> }
>
> -static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
> +static void __init hugetlb_alloc_node(unsigned long start, unsigned long end, void *arg)
> {
> - unsigned long i;
> + struct hstate *h = (struct hstate *)arg;
> + int i, num = end - start;
> + nodemask_t node_alloc_noretry;
> + unsigned long flags;
> + int next_node = 0;

This should be first_online_node which may be not zero.

>
> - for (i = 0; i < h->max_huge_pages; ++i) {
> - if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> + /* Bit mask controlling how hard we retry per-node allocations.*/
> + nodes_clear(node_alloc_noretry);
> +
> + for (i = 0; i < num; ++i) {
> + struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> + &node_alloc_noretry, &next_node);
> + if (!folio)
> break;
> + spin_lock_irqsave(&hugetlb_lock, flags);

I suspect there will more contention on this lock when parallelizing.
I want to know why you chose to drop prep_and_add_allocated_folios()
call in the original hugetlb_pages_alloc_boot()?

> + __prep_account_new_huge_page(h, folio_nid(folio));
> + enqueue_hugetlb_folio(h, folio);
> + spin_unlock_irqrestore(&hugetlb_lock, flags);
> cond_resched();
> }
> +}
>
> - return i;
> +static void __init hugetlb_vmemmap_optimize_node(unsigned long start, unsigned long end, void *arg)
> +{
> + struct hstate *h = (struct hstate *)arg;
> + int nid = start;
> +
> + hugetlb_vmemmap_optimize_folios(h, &h->hugepage_freelists[nid]);
> }
>
> -static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
> +static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
> {
> unsigned long i;
> - struct folio *folio;
> - LIST_HEAD(folio_list);
> - nodemask_t node_alloc_noretry;
> -
> - /* Bit mask controlling how hard we retry per-node allocations.*/
> - nodes_clear(node_alloc_noretry);
>
> for (i = 0; i < h->max_huge_pages; ++i) {
> - folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> - &node_alloc_noretry);
> - if (!folio)
> + if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> break;
> - list_add(&folio->lru, &folio_list);
> cond_resched();
> }
>
> - prep_and_add_allocated_folios(h, &folio_list);
> -
> return i;
> }
>
> +static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
> +{
> + struct padata_mt_job job = {
> + .fn_arg = h,
> + .align = 1,
> + .numa_aware = true
> + };
> +
> + job.thread_fn = hugetlb_alloc_node;
> + job.start = 0;
> + job.size = h->max_huge_pages;
> + job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
> + job.max_threads = num_node_state(N_MEMORY) * 2;

I am curious the magic number of 2 used in assignments of ->min_chunk
and ->max_threads, does it from your experiment? I thinke it should
be a comment here.

And I am also sceptical about the optimization for a small amount of
allocation of hugepages. Given 4 hugepags needed to be allocated on UMA
system, job.min_chunk will be 2, job.max_threads will be 2. Then, 2
workers will be scheduled, however each worker will just allocate 2 pages,
how much the cost of scheduling? What if allocate 4 pages in single
worker? Do you have any numbers on parallelism vs non-parallelism in
a small allocation case? If we cannot gain from this case, I think we shold
assign a reasonable value to ->min_chunk based on experiment.

Thanks.

> + padata_do_multithreaded(&job);
> +
> + job.thread_fn = hugetlb_vmemmap_optimize_node;
> + job.start = 0;
> + job.size = num_node_state(N_MEMORY);
> + job.min_chunk = 1;
> + job.max_threads = num_node_state(N_MEMORY);
> + padata_do_multithreaded(&job);
> +
> + return h->nr_huge_pages;
> +}
> +
> /*
> * NOTE: this routine is called in different contexts for gigantic and
> * non-gigantic pages.


2024-01-22 09:38:38

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc

On 2024/1/22 14:16, Muchun Song wrote:
> On 2024/1/18 20:39, Gang Li wrote:
>>   static struct folio *alloc_pool_huge_folio(struct hstate *h,
>>                       nodemask_t *nodes_allowed,
>> -                    nodemask_t *node_alloc_noretry)
>> +                    nodemask_t *node_alloc_noretry,
>> +                    int *next_node)
>>   {
>>       gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>>       int nr_nodes, node;
>> -    for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
>> +    for_each_node_mask_to_alloc(next_node, nr_nodes, node,
>> nodes_allowed) {
>
>
> A small question here, why not pass h->next_nid_to_alloc to
> for_each_node_mask_to_alloc()? What's the purpose of the third
> parameter of alloc_pool_huge_folio()?
>
> Thanks.
>

In hugetlb_alloc_node->alloc_pool_huge_folio, hugetlb is initialized in
parallel at boot time, then it needs each thread to have its own
next_nid, and can't use the global h->next_nid_to_alloc. so an extra
parameter is added.

And h->next_nid_to_alloc in set_max_huge_pages->alloc_pool_huge_folio
can not be removed. Because if the user calls set_max_huge_pages
frequently and only adds 1 page at a time, that would result in each
request being made on the same node if local variables are used.

2024-01-22 10:13:13

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization

On 2024/1/22 15:10, Muchun Song wrote:> On 2024/1/18 20:39, Gang Li wrote:
>> +static void __init hugetlb_alloc_node(unsigned long start, unsigned
>> long end, void *arg)
>>   {
>> -    unsigned long i;
>> +    struct hstate *h = (struct hstate *)arg;
>> +    int i, num = end - start;
>> +    nodemask_t node_alloc_noretry;
>> +    unsigned long flags;
>> +    int next_node = 0;
>
> This should be first_online_node which may be not zero.
>

That's right. Thanks!

>> -    for (i = 0; i < h->max_huge_pages; ++i) {
>> -        if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
>> +    /* Bit mask controlling how hard we retry per-node allocations.*/
>> +    nodes_clear(node_alloc_noretry);
>> +
>> +    for (i = 0; i < num; ++i) {
>> +        struct folio *folio = alloc_pool_huge_folio(h,
>> &node_states[N_MEMORY],
>> +                        &node_alloc_noretry, &next_node);
>> +        if (!folio)
>>               break;
>> +        spin_lock_irqsave(&hugetlb_lock, flags);
> > I suspect there will more contention on this lock when parallelizing.

In the worst case, there are only 'numa node number' of threads in
contention. And in my testing, it doesn't degrade performance, but
rather improves performance due to the reduced granularity.

> I want to know why you chose to drop prep_and_add_allocated_folios()
> call in the original hugetlb_pages_alloc_boot()?

Splitting him to parallelize hugetlb_vmemmap_optimize_folios.

>> +static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
>> +{
>> +    struct padata_mt_job job = {
>> +        .fn_arg        = h,
>> +        .align        = 1,
>> +        .numa_aware    = true
>> +    };
>> +
>> +    job.thread_fn    = hugetlb_alloc_node;
>> +    job.start    = 0;
>> +    job.size    = h->max_huge_pages;
>> +    job.min_chunk    = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
>> +    job.max_threads    = num_node_state(N_MEMORY) * 2;
>
> I am curious the magic number of 2 used in assignments of ->min_chunk
> and ->max_threads, does it from your experiment? I thinke it should
> be a comment here.
>

This is tested and I can perform more detailed tests and provide data.

> And I am also sceptical about the optimization for a small amount of
> allocation of hugepages. Given 4 hugepags needed to be allocated on UMA
> system, job.min_chunk will be 2, job.max_threads will be 2. Then, 2
> workers will be scheduled, however each worker will just allocate 2 pages,
> how much the cost of scheduling? What if allocate 4 pages in single
> worker? Do you have any numbers on parallelism vs non-parallelism in
> a small allocation case? If we cannot gain from this case, I think we shold
> assign a reasonable value to ->min_chunk based on experiment.
>
> Thanks.
>

That's a good suggestion, I'll run some tests and choose the best
values.



2024-01-22 10:22:23

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc



> On Jan 22, 2024, at 17:14, Gang Li <[email protected]> wrote:
>
> On 2024/1/22 14:16, Muchun Song wrote:
>> On 2024/1/18 20:39, Gang Li wrote:
>>> static struct folio *alloc_pool_huge_folio(struct hstate *h,
>>> nodemask_t *nodes_allowed,
>>> - nodemask_t *node_alloc_noretry)
>>> + nodemask_t *node_alloc_noretry,
>>> + int *next_node)
>>> {
>>> gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>>> int nr_nodes, node;
>>> - for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
>>> + for_each_node_mask_to_alloc(next_node, nr_nodes, node, nodes_allowed) {
>> A small question here, why not pass h->next_nid_to_alloc to
>> for_each_node_mask_to_alloc()? What's the purpose of the third
>> parameter of alloc_pool_huge_folio()?
>> Thanks.
>
> In hugetlb_alloc_node->alloc_pool_huge_folio, hugetlb is initialized in
> parallel at boot time, then it needs each thread to have its own
> next_nid, and can't use the global h->next_nid_to_alloc. so an extra parameter is added.

Yes. When I read your patch 6, I realized this.

>
> And h->next_nid_to_alloc in set_max_huge_pages->alloc_pool_huge_folio
> can not be removed. Because if the user calls set_max_huge_pages
> frequently and only adds 1 page at a time, that would result in each
> request being made on the same node if local variables are used.


2024-01-22 11:40:49

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization



> On Jan 22, 2024, at 18:12, Gang Li <[email protected]> wrote:
>
> On 2024/1/22 15:10, Muchun Song wrote:> On 2024/1/18 20:39, Gang Li wrote:
>>> +static void __init hugetlb_alloc_node(unsigned long start, unsigned long end, void *arg)
>>> {
>>> - unsigned long i;
>>> + struct hstate *h = (struct hstate *)arg;
>>> + int i, num = end - start;
>>> + nodemask_t node_alloc_noretry;
>>> + unsigned long flags;
>>> + int next_node = 0;
>> This should be first_online_node which may be not zero.
>
> That's right. Thanks!
>
>>> - for (i = 0; i < h->max_huge_pages; ++i) {
>>> - if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
>>> + /* Bit mask controlling how hard we retry per-node allocations.*/
>>> + nodes_clear(node_alloc_noretry);
>>> +
>>> + for (i = 0; i < num; ++i) {
>>> + struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
>>> + &node_alloc_noretry, &next_node);
>>> + if (!folio)
>>> break;
>>> + spin_lock_irqsave(&hugetlb_lock, flags);
>> > I suspect there will more contention on this lock when parallelizing.
>
> In the worst case, there are only 'numa node number' of threads in
> contention. And in my testing, it doesn't degrade performance, but
> rather improves performance due to the reduced granularity.

So, the performance does not change if you move the lock out of
loop?

>
>> I want to know why you chose to drop prep_and_add_allocated_folios()
>> call in the original hugetlb_pages_alloc_boot()?
>
> Splitting him to parallelize hugetlb_vmemmap_optimize_folios.

Unfortunately, HVO should be enabled before pages go to the pool list.

>
>>> +static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
>>> +{
>>> + struct padata_mt_job job = {
>>> + .fn_arg = h,
>>> + .align = 1,
>>> + .numa_aware = true
>>> + };
>>> +
>>> + job.thread_fn = hugetlb_alloc_node;
>>> + job.start = 0;
>>> + job.size = h->max_huge_pages;
>>> + job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
>>> + job.max_threads = num_node_state(N_MEMORY) * 2;
>> I am curious the magic number of 2 used in assignments of ->min_chunk
>> and ->max_threads, does it from your experiment? I thinke it should
>> be a comment here.
>
> This is tested and I can perform more detailed tests and provide data.
>
>> And I am also sceptical about the optimization for a small amount of
>> allocation of hugepages. Given 4 hugepags needed to be allocated on UMA
>> system, job.min_chunk will be 2, job.max_threads will be 2. Then, 2
>> workers will be scheduled, however each worker will just allocate 2 pages,
>> how much the cost of scheduling? What if allocate 4 pages in single
>> worker? Do you have any numbers on parallelism vs non-parallelism in
>> a small allocation case? If we cannot gain from this case, I think we shold
>> assign a reasonable value to ->min_chunk based on experiment.
>> Thanks.
>>
>
> That's a good suggestion, I'll run some tests and choose the best
> values.
>
>


2024-01-23 03:06:58

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization

On 2024/1/22 19:30, Muchun Song wrote:
>> On Jan 22, 2024, at 18:12, Gang Li <[email protected]> wrote:
>>
>> On 2024/1/22 15:10, Muchun Song wrote:> On 2024/1/18 20:39, Gang Li wrote:
>>>> +static void __init hugetlb_alloc_node(unsigned long start, unsigned long end, void *arg)
>>>> {
>>>> - unsigned long i;
>>>> + struct hstate *h = (struct hstate *)arg;
>>>> + int i, num = end - start;
>>>> + nodemask_t node_alloc_noretry;
>>>> + unsigned long flags;
>>>> + int next_node = 0;
>>> This should be first_online_node which may be not zero.
>>
>> That's right. Thanks!
>>
>>>> - for (i = 0; i < h->max_huge_pages; ++i) {
>>>> - if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
>>>> + /* Bit mask controlling how hard we retry per-node allocations.*/
>>>> + nodes_clear(node_alloc_noretry);
>>>> +
>>>> + for (i = 0; i < num; ++i) {
>>>> + struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
>>>> + &node_alloc_noretry, &next_node);
>>>> + if (!folio)
>>>> break;
>>>> + spin_lock_irqsave(&hugetlb_lock, flags);
>>>> I suspect there will more contention on this lock when parallelizing.
>>
>> In the worst case, there are only 'numa node number' of threads in
>> contention. And in my testing, it doesn't degrade performance, but
>> rather improves performance due to the reduced granularity.
>
> So, the performance does not change if you move the lock out of
> loop?
>

If we move the lock out of loop, then multi-threading becomes
single-threading, which definitely reduces performance.

```
+ spin_lock_irqsave(&hugetlb_lock, flags);
for (i = 0; i < num; ++i) {
struct folio *folio = alloc_pool_huge_folio(h,
&node_states[N_MEMORY],
&node_alloc_noretry,
&next_node);
if (!folio)
break;
- spin_lock_irqsave(&hugetlb_lock, flags);
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
- spin_unlock_irqrestore(&hugetlb_lock, flags);
cond_resched();
}
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
```


2024-01-23 03:33:25

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization



> On Jan 23, 2024, at 10:12, Gang Li <[email protected]> wrote:
>
> On 2024/1/22 19:30, Muchun Song wrote:
>>> On Jan 22, 2024, at 18:12, Gang Li <[email protected]> wrote:
>>>
>>> On 2024/1/22 15:10, Muchun Song wrote:> On 2024/1/18 20:39, Gang Li wrote:
>>>>> +static void __init hugetlb_alloc_node(unsigned long start, unsigned long end, void *arg)
>>>>> {
>>>>> - unsigned long i;
>>>>> + struct hstate *h = (struct hstate *)arg;
>>>>> + int i, num = end - start;
>>>>> + nodemask_t node_alloc_noretry;
>>>>> + unsigned long flags;
>>>>> + int next_node = 0;
>>>> This should be first_online_node which may be not zero.
>>>
>>> That's right. Thanks!
>>>
>>>>> - for (i = 0; i < h->max_huge_pages; ++i) {
>>>>> - if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
>>>>> + /* Bit mask controlling how hard we retry per-node allocations.*/
>>>>> + nodes_clear(node_alloc_noretry);
>>>>> +
>>>>> + for (i = 0; i < num; ++i) {
>>>>> + struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
>>>>> + &node_alloc_noretry, &next_node);
>>>>> + if (!folio)
>>>>> break;
>>>>> + spin_lock_irqsave(&hugetlb_lock, flags);
>>>>> I suspect there will more contention on this lock when parallelizing.
>>>
>>> In the worst case, there are only 'numa node number' of threads in
>>> contention. And in my testing, it doesn't degrade performance, but
>>> rather improves performance due to the reduced granularity.
>> So, the performance does not change if you move the lock out of
>> loop?
>>
>
> If we move the lock out of loop, then multi-threading becomes single-threading, which definitely reduces performance.

No. I mean batching the pages into pool list just like prep_and_add_allocated_folios
does.

>
> ```
> + spin_lock_irqsave(&hugetlb_lock, flags);
> for (i = 0; i < num; ++i) {
> struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> &node_alloc_noretry, &next_node);
> if (!folio)
> break;
> - spin_lock_irqsave(&hugetlb_lock, flags);
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> - spin_unlock_irqrestore(&hugetlb_lock, flags);
> cond_resched();
> }
> + spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
> ```



2024-01-24 09:58:30

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] hugetlb: parallelize 1G hugetlb initialization



On 2024/1/18 20:39, Gang Li wrote:
> Optimizing the initialization speed of 1G huge pages through
> parallelization.
>
> 1G hugetlbs are allocated from bootmem, a process that is already
> very fast and does not currently require optimization. Therefore,
> we focus on parallelizing only the initialization phase in
> `gather_bootmem_prealloc`.
>
> Here are some test results:
> test no patch(ms) patched(ms) saved
> ------------------- -------------- ------------- --------
> 256c2t(4 node) 1G 4745 2024 57.34%

What does "256c2t" mean?

> 128c1t(2 node) 1G 3358 1712 49.02%
> 12t 1G 77000 18300 76.23%
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>
> ---
> include/linux/hugetlb.h | 2 +-
> mm/hugetlb.c | 42 +++++++++++++++++++++++++++++++++--------
> 2 files changed, 35 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c1ee640d87b1..77b30a8c6076 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
>
> extern int sysctl_hugetlb_shm_group;
> -extern struct list_head huge_boot_pages;
> +extern struct list_head huge_boot_pages[MAX_NUMNODES];
>
> /* arch callbacks */
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9b348ba418f5..2f4b77630ada 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
> #endif
> static unsigned long hugetlb_cma_size __initdata;
>
> -__initdata LIST_HEAD(huge_boot_pages);
> +__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
>
> /* for command line parsing */
> static struct hstate * __initdata parsed_hstate;
> @@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
> int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> {
> struct huge_bootmem_page *m = NULL; /* initialize for clang */
> - int nr_nodes, node;
> + int nr_nodes, node = nid;

Why not use nid directly in the following list_add()?

>
> /* do node specific alloc */
> if (nid != NUMA_NO_NODE) {
> @@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> huge_page_size(h) - PAGE_SIZE);
> /* Put them into a private list first because mem_map is not up yet */
> INIT_LIST_HEAD(&m->list);
> - list_add(&m->list, &huge_boot_pages);
> + list_add(&m->list, &huge_boot_pages[node]);
> m->hstate = h;
> return 1;
> }
> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> /* Send list for bulk vmemmap optimization processing */
> hugetlb_vmemmap_optimize_folios(h, folio_list);
>
> - /* Add all new pool pages to free lists in one lock cycle */
> - spin_lock_irqsave(&hugetlb_lock, flags);
> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
> /*
> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> HUGETLB_VMEMMAP_RESERVE_PAGES,
> pages_per_huge_page(h));
> }
> + /* Subdivide locks to achieve better parallel performance *
> + spin_lock_irqsave(&hugetlb_lock, flags);
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> + spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
> - spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
>
> /*
> * Put bootmem huge pages into the standard lists after mem_map is up.
> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
> */
> -static void __init gather_bootmem_prealloc(void)
> +static void __init __gather_bootmem_prealloc(unsigned long start, unsigned long end, void *arg)

This function name could be gather_bootmem_prealloc_node.

> +
> {
> + int nid = start;
> LIST_HEAD(folio_list);
> struct huge_bootmem_page *m;
> struct hstate *h = NULL, *prev_h = NULL;
>
> - list_for_each_entry(m, &huge_boot_pages, list) {
> + list_for_each_entry(m, &huge_boot_pages[nid], list) {
> struct page *page = virt_to_page(m);
> struct folio *folio = (void *)page;
>
> @@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
> prep_and_add_bootmem_folios(h, &folio_list);
> }
>
> +static void __init gather_bootmem_prealloc(void)
> +{
> + struct padata_mt_job job = {
> + .thread_fn = __gather_bootmem_prealloc,
> + .fn_arg = NULL,
> + .start = 0,
> + .size = num_node_state(N_MEMORY),
> + .align = 1,
> + .min_chunk = 1,
> + .max_threads = num_node_state(N_MEMORY),
> + .numa_aware = true,
> + };
> +
> + padata_do_multithreaded(&job);
> +}
> +
> static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> {
> unsigned long i;
> @@ -3602,6 +3620,14 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> return;
> }
>
> + /* hugetlb_hstate_alloc_pages will be called many times, init huge_boot_pages once*/

s/init/initialize/g

And you miss a black right before "*/".

> + if (huge_boot_pages[0].next == NULL) {

It it not intuitive. I'd like to use a 'initialied' variable
to indicate whether it has been initialized. BTW, it can be
marked as __initdata.

> + int i = 0;
> +
> + for (i = 0; i < MAX_NUMNODES; i++)
> + INIT_LIST_HEAD(&huge_boot_pages[i]);
> + }
> +
> /* do node specific alloc */
> if (hugetlb_hstate_alloc_pages_specific_nodes(h))
> return;


2024-01-24 10:53:02

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] hugetlb: parallelize 1G hugetlb initialization

On 2024/1/24 17:23, Muchun Song wrote:
> On 2024/1/18 20:39, Gang Li wrote:
>> Optimizing the initialization speed of 1G huge pages through
>> parallelization.
>>
>> 1G hugetlbs are allocated from bootmem, a process that is already
>> very fast and does not currently require optimization. Therefore,
>> we focus on parallelizing only the initialization phase in
>> `gather_bootmem_prealloc`.
>>
>> Here are some test results:
>>          test          no patch(ms)   patched(ms)   saved
>>   ------------------- -------------- ------------- --------
>>    256c2t(4 node) 1G           4745          2024   57.34%
>
> What does "256c2t" mean?

A machine with 256 core and 2T memory.

>
>>    128c1t(2 node) 1G           3358          1712   49.02%
>>        12t        1G          77000         18300   76.23%
>>
>> Signed-off-by: Gang Li <[email protected]>
>> Tested-by: David Rientjes <[email protected]>
>> ---
>>   include/linux/hugetlb.h |  2 +-
>>   mm/hugetlb.c            | 42 +++++++++++++++++++++++++++++++++--------
>>   2 files changed, 35 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index c1ee640d87b1..77b30a8c6076 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct
>> vm_area_struct *vma,
>>   struct address_space *hugetlb_page_mapping_lock_write(struct page
>> *hpage);
>>   extern int sysctl_hugetlb_shm_group;
>> -extern struct list_head huge_boot_pages;
>> +extern struct list_head huge_boot_pages[MAX_NUMNODES];
>>   /* arch callbacks */
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 9b348ba418f5..2f4b77630ada 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio,
>> unsigned int order)
>>   #endif
>>   static unsigned long hugetlb_cma_size __initdata;
>> -__initdata LIST_HEAD(huge_boot_pages);
>> +__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
>>   /* for command line parsing */
>>   static struct hstate * __initdata parsed_hstate;
>> @@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h,
>> int nid)
>>   int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>>   {
>>       struct huge_bootmem_page *m = NULL; /* initialize for clang */
>> -    int nr_nodes, node;
>> +    int nr_nodes, node = nid;
>
> Why not use nid directly in the following list_add()?

`node` may be changed in `for_each_node_mask_to_alloc`.

>
>>       /* do node specific alloc */
>>       if (nid != NUMA_NO_NODE) {
>> @@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h,
>> int nid)
>>           huge_page_size(h) - PAGE_SIZE);
>>       /* Put them into a private list first because mem_map is not up
>> yet */
>>       INIT_LIST_HEAD(&m->list);
>> -    list_add(&m->list, &huge_boot_pages);
>> +    list_add(&m->list, &huge_boot_pages[node]);
>>       m->hstate = h;
>>       return 1;
>>   }
>> @@ -3390,8 +3390,6 @@ static void __init
>> prep_and_add_bootmem_folios(struct hstate *h,
>>       /* Send list for bulk vmemmap optimization processing */
>>       hugetlb_vmemmap_optimize_folios(h, folio_list);
>> -    /* Add all new pool pages to free lists in one lock cycle */
>> -    spin_lock_irqsave(&hugetlb_lock, flags);
>>       list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
>>           if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
>>               /*
>> @@ -3404,23 +3402,27 @@ static void __init
>> prep_and_add_bootmem_folios(struct hstate *h,
>>                       HUGETLB_VMEMMAP_RESERVE_PAGES,
>>                       pages_per_huge_page(h));
>>           }
>> +        /* Subdivide locks to achieve better parallel performance *
>> +        spin_lock_irqsave(&hugetlb_lock, flags);
>>           __prep_account_new_huge_page(h, folio_nid(folio));
>>           enqueue_hugetlb_folio(h, folio);
>> +        spin_unlock_irqrestore(&hugetlb_lock, flags);
>>       }
>> -    spin_unlock_irqrestore(&hugetlb_lock, flags);
>>   }
>>   /*
>>    * Put bootmem huge pages into the standard lists after mem_map is up.
>>    * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
>>    */
>> -static void __init gather_bootmem_prealloc(void)
>> +static void __init __gather_bootmem_prealloc(unsigned long start,
>> unsigned long end, void *arg)
>
> This function name could be gather_bootmem_prealloc_node.
>

LGTM.

>> +
>>   {
>> +    int nid = start;
>>       LIST_HEAD(folio_list);
>>       struct huge_bootmem_page *m;
>>       struct hstate *h = NULL, *prev_h = NULL;
>> -    list_for_each_entry(m, &huge_boot_pages, list) {
>> +    list_for_each_entry(m, &huge_boot_pages[nid], list) {
>>           struct page *page = virt_to_page(m);
>>           struct folio *folio = (void *)page;
>> @@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
>>       prep_and_add_bootmem_folios(h, &folio_list);
>>   }
>> +static void __init gather_bootmem_prealloc(void)
>> +{
>> +    struct padata_mt_job job = {
>> +        .thread_fn    = __gather_bootmem_prealloc,
>> +        .fn_arg        = NULL,
>> +        .start        = 0,
>> +        .size        = num_node_state(N_MEMORY),
>> +        .align        = 1,
>> +        .min_chunk    = 1,
>> +        .max_threads    = num_node_state(N_MEMORY),
>> +        .numa_aware    = true,
>> +    };
>> +
>> +    padata_do_multithreaded(&job);
>> +}
>> +
>>   static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate
>> *h, int nid)
>>   {
>>       unsigned long i;
>> @@ -3602,6 +3620,14 @@ static void __init
>> hugetlb_hstate_alloc_pages(struct hstate *h)
>>           return;
>>       }
>> +    /* hugetlb_hstate_alloc_pages will be called many times, init
>> huge_boot_pages once*/
>
> s/init/initialize/g
>
> And you miss a black right before "*/".

OK

>
>> +    if (huge_boot_pages[0].next == NULL) {
>
> It it not intuitive. I'd like to use a 'initialied' variable

Would it make the code look a bit redundant?

> to indicate whether it has been initialized. BTW, it can be
> marked as __initdata.
>

OK

2024-01-25 02:49:40

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] hugetlb: parallelize 1G hugetlb initialization



> On Jan 24, 2024, at 18:52, Gang Li <[email protected]> wrote:
>
> On 2024/1/24 17:23, Muchun Song wrote:
>> On 2024/1/18 20:39, Gang Li wrote:
>>> Optimizing the initialization speed of 1G huge pages through
>>> parallelization.
>>>
>>> 1G hugetlbs are allocated from bootmem, a process that is already
>>> very fast and does not currently require optimization. Therefore,
>>> we focus on parallelizing only the initialization phase in
>>> `gather_bootmem_prealloc`.
>>>
>>> Here are some test results:
>>> test no patch(ms) patched(ms) saved
>>> ------------------- -------------- ------------- --------
>>> 256c2t(4 node) 1G 4745 2024 57.34%
>> What does "256c2t" mean?
>
> A machine with 256 core and 2T memory.

A little confusing. I thought 256c2 is a number in hexadecimal.
The unit of memory should be capital T. We should add a simple
explanation about this.

>
>>> 128c1t(2 node) 1G 3358 1712 49.02%
>>> 12t 1G 77000 18300 76.23%

I am curious how many NUMA nodes does this system have? I suspect
it should not be one.

>>>
>>> Signed-off-by: Gang Li <[email protected]>
>>> Tested-by: David Rientjes <[email protected]>
>>> ---
>>> include/linux/hugetlb.h | 2 +-
>>> mm/hugetlb.c | 42 +++++++++++++++++++++++++++++++++--------
>>> 2 files changed, 35 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index c1ee640d87b1..77b30a8c6076 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>>> struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
>>> extern int sysctl_hugetlb_shm_group;
>>> -extern struct list_head huge_boot_pages;
>>> +extern struct list_head huge_boot_pages[MAX_NUMNODES];
>>> /* arch callbacks */
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 9b348ba418f5..2f4b77630ada 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
>>> #endif
>>> static unsigned long hugetlb_cma_size __initdata;
>>> -__initdata LIST_HEAD(huge_boot_pages);
>>> +__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
>>> /* for command line parsing */
>>> static struct hstate * __initdata parsed_hstate;
>>> @@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
>>> int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>>> {
>>> struct huge_bootmem_page *m = NULL; /* initialize for clang */
>>> - int nr_nodes, node;
>>> + int nr_nodes, node = nid;
>> Why not use nid directly in the following list_add()?
>
> `node` may be changed in `for_each_node_mask_to_alloc`.

Got it.

>
>>> /* do node specific alloc */
>>> if (nid != NUMA_NO_NODE) {
>>> @@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>>> huge_page_size(h) - PAGE_SIZE);
>>> /* Put them into a private list first because mem_map is not up yet */
>>> INIT_LIST_HEAD(&m->list);
>>> - list_add(&m->list, &huge_boot_pages);
>>> + list_add(&m->list, &huge_boot_pages[node]);
>>> m->hstate = h;
>>> return 1;
>>> }
>>> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>> /* Send list for bulk vmemmap optimization processing */
>>> hugetlb_vmemmap_optimize_folios(h, folio_list);
>>> - /* Add all new pool pages to free lists in one lock cycle */
>>> - spin_lock_irqsave(&hugetlb_lock, flags);
>>> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
>>> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
>>> /*
>>> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>> HUGETLB_VMEMMAP_RESERVE_PAGES,
>>> pages_per_huge_page(h));
>>> }
>>> + /* Subdivide locks to achieve better parallel performance *
>>> + spin_lock_irqsave(&hugetlb_lock, flags);
>>> __prep_account_new_huge_page(h, folio_nid(folio));
>>> enqueue_hugetlb_folio(h, folio);
>>> + spin_unlock_irqrestore(&hugetlb_lock, flags);
>>> }
>>> - spin_unlock_irqrestore(&hugetlb_lock, flags);
>>> }
>>> /*
>>> * Put bootmem huge pages into the standard lists after mem_map is up.
>>> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
>>> */
>>> -static void __init gather_bootmem_prealloc(void)
>>> +static void __init __gather_bootmem_prealloc(unsigned long start, unsigned long end, void *arg)
>> This function name could be gather_bootmem_prealloc_node.
>
> LGTM.
>
>>> +
>>> {
>>> + int nid = start;
>>> LIST_HEAD(folio_list);
>>> struct huge_bootmem_page *m;
>>> struct hstate *h = NULL, *prev_h = NULL;
>>> - list_for_each_entry(m, &huge_boot_pages, list) {
>>> + list_for_each_entry(m, &huge_boot_pages[nid], list) {
>>> struct page *page = virt_to_page(m);
>>> struct folio *folio = (void *)page;
>>> @@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
>>> prep_and_add_bootmem_folios(h, &folio_list);
>>> }
>>> +static void __init gather_bootmem_prealloc(void)
>>> +{
>>> + struct padata_mt_job job = {
>>> + .thread_fn = __gather_bootmem_prealloc,
>>> + .fn_arg = NULL,
>>> + .start = 0,
>>> + .size = num_node_state(N_MEMORY),
>>> + .align = 1,
>>> + .min_chunk = 1,
>>> + .max_threads = num_node_state(N_MEMORY),
>>> + .numa_aware = true,
>>> + };
>>> +
>>> + padata_do_multithreaded(&job);
>>> +}
>>> +
>>> static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>>> {
>>> unsigned long i;
>>> @@ -3602,6 +3620,14 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>>> return;
>>> }
>>> + /* hugetlb_hstate_alloc_pages will be called many times, init huge_boot_pages once*/
>> s/init/initialize/g
>> And you miss a black right before "*/".
>
> OK
>
>>> + if (huge_boot_pages[0].next == NULL) {
>> It it not intuitive. I'd like to use a 'initialied' variable
>
> Would it make the code look a bit redundant?

What is redundant?

>
>> to indicate whether it has been initialized. BTW, it can be
>> marked as __initdata.
>>
>
> OK



2024-01-25 03:47:41

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] hugetlb: parallelize 1G hugetlb initialization

On 2024/1/25 10:48, Muchun Song wrote:>>>> 12t 1G
77000 18300 76.23%
>
> I am curious how many NUMA nodes does this system have? I suspect
> it should not be one.
>

Hi David Rientjes,

I'm also curious. Can you show us some details about this machine?
Such as `lscpu`.

Thanks!

2024-01-25 03:56:47

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v4 7/7] hugetlb: parallelize 1G hugetlb initialization

On 2024/1/25 10:48, Muchun Song wrote:
>>>> + if (huge_boot_pages[0].next == NULL) {
>>> It it not intuitive. I'd like to use a 'initialied' variable
>>
>> Would it make the code look a bit redundant?
>
> What is redundant?
>

I was thinking of adding a global variable at first, but it's possible
to add a local static variable, which is more concise.