Hi all, hugetlb init parallelization has now been updated to v3.
This series is tested on next-20240102 and can not be applied to v6.7-rc8.
Update Summary:
- Select CONFIG_PADATA as we use padata_do_multithreaded
- Fix a race condition in h->next_nid_to_alloc
- Fix local variable initialization issues
- Remove RFC tag
Thanks to the testing by David Rientjes, we now know that this patch reduce
hugetlb 1G initialization time from 77s to 18.3s on a 12T machine[4].
# Introduction
Hugetlb initialization during boot takes up a considerable amount of time.
For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2
seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel
host takes more than 1 minute[1]. This is a noteworthy figure.
Inspired by [2] and [3], hugetlb initialization can also be accelerated
through parallelization. Kernel already has infrastructure like
padata_do_multithreaded, this patch uses it to achieve effective results
by minimal modifications.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
# Test result
test no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2t(4 node) 1G 4745 2024 57.34%
128c1t(2 node) 1G 3358 1712 49.02%
12t 1G 77000 18300 76.23%
256c2t(4 node) 2M 3336 1051 68.52%
128c1t(2 node) 2M 1943 716 63.15%
# Change log
Changes in v3:
- Select CONFIG_PADATA as we use padata_do_multithreaded
- Fix a race condition in h->next_nid_to_alloc
- Fix local variable initialization issues
- Remove RFC tag
Changes in v2:
- https://lore.kernel.org/all/[email protected]/
- Reduce complexity with `padata_do_multithreaded`
- Support 1G hugetlb
v1:
- https://lore.kernel.org/all/[email protected]/
- parallelize 2M hugetlb initialization with workqueue
Gang Li (7):
hugetlb: code clean for hugetlb_hstate_alloc_pages
hugetlb: split hugetlb_hstate_alloc_pages
padata: dispatch works on different nodes
hugetlb: pass *next_nid_to_alloc directly to
for_each_node_mask_to_alloc
hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
hugetlb: parallelize 2M hugetlb allocation and initialization
hugetlb: parallelize 1G hugetlb initialization
fs/Kconfig | 1 +
include/linux/hugetlb.h | 2 +-
include/linux/padata.h | 3 +
kernel/padata.c | 8 +-
mm/hugetlb.c | 224 +++++++++++++++++++++++++++-------------
mm/mm_init.c | 1 +
6 files changed, 163 insertions(+), 76 deletions(-)
--
2.20.1
The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
code, its readability can be improved, facilitating future modifications.
This patch extracts two functions to reduce the complexity of
`hugetlb_hstate_alloc_pages` and has no functional changes.
- hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
each online node and performs allocation if necessary.
- hugetlb_hstate_alloc_pages_report() report error during allocation.
And the value of h->max_huge_pages is updated accordingly.
Signed-off-by: Gang Li <[email protected]>
---
mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
1 file changed, 29 insertions(+), 17 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d42..2606135ec55e6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3482,6 +3482,33 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
h->max_huge_pages_node[nid] = i;
}
+static bool __init hugetlb_hstate_alloc_pages_node_specific(struct hstate *h)
+{
+ int i;
+ bool node_specific_alloc = false;
+
+ for_each_online_node(i) {
+ if (h->max_huge_pages_node[i] > 0) {
+ hugetlb_hstate_alloc_pages_onenode(h, i);
+ node_specific_alloc = true;
+ }
+ }
+
+ return node_specific_alloc;
+}
+
+static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, struct hstate *h)
+{
+ if (allocated < h->max_huge_pages) {
+ char buf[32];
+
+ string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+ pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
+ h->max_huge_pages, buf, allocated);
+ h->max_huge_pages = allocated;
+ }
+}
+
/*
* NOTE: this routine is called in different contexts for gigantic and
* non-gigantic pages.
@@ -3499,7 +3526,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
struct folio *folio;
LIST_HEAD(folio_list);
nodemask_t *node_alloc_noretry;
- bool node_specific_alloc = false;
/* skip gigantic hugepages allocation if hugetlb_cma enabled */
if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3508,14 +3534,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
}
/* do node specific alloc */
- for_each_online_node(i) {
- if (h->max_huge_pages_node[i] > 0) {
- hugetlb_hstate_alloc_pages_onenode(h, i);
- node_specific_alloc = true;
- }
- }
-
- if (node_specific_alloc)
+ if (hugetlb_hstate_alloc_pages_node_specific(h))
return;
/* below will do all node balanced alloc */
@@ -3558,14 +3577,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
/* list will be empty if hstate_is_gigantic */
prep_and_add_allocated_folios(h, &folio_list);
- if (i < h->max_huge_pages) {
- char buf[32];
-
- string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
- pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
- h->max_huge_pages, buf, i);
- h->max_huge_pages = i;
- }
+ hugetlb_hstate_alloc_pages_report(i, h);
kfree(node_alloc_noretry);
}
--
2.20.1
1G and 2M huge pages have different allocation and initialization logic,
which leads to subtle differences in parallelization. Therefore, it is
appropriate to split hugetlb_hstate_alloc_pages into gigantic and
non-gigantic.
This patch has no functional changes.
Signed-off-by: Gang Li <[email protected]>
---
mm/hugetlb.c | 86 +++++++++++++++++++++++++++-------------------------
1 file changed, 45 insertions(+), 41 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2606135ec55e6..92448e747991d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3509,6 +3509,47 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
}
}
+static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
+{
+ unsigned long i;
+
+ for (i = 0; i < h->max_huge_pages; ++i) {
+ /*
+ * gigantic pages not added to list as they are not
+ * added to pools now.
+ */
+ if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
+ break;
+ cond_resched();
+ }
+
+ return i;
+}
+
+static unsigned long __init hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)
+{
+ unsigned long i;
+ struct folio *folio;
+ LIST_HEAD(folio_list);
+ nodemask_t node_alloc_noretry;
+
+ /* Bit mask controlling how hard we retry per-node allocations.*/
+ nodes_clear(node_alloc_noretry);
+
+ for (i = 0; i < h->max_huge_pages; ++i) {
+ folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+ &node_alloc_noretry);
+ if (!folio)
+ break;
+ list_add(&folio->lru, &folio_list);
+ cond_resched();
+ }
+
+ prep_and_add_allocated_folios(h, &folio_list);
+
+ return i;
+}
+
/*
* NOTE: this routine is called in different contexts for gigantic and
* non-gigantic pages.
@@ -3522,10 +3563,7 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
*/
static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
{
- unsigned long i;
- struct folio *folio;
- LIST_HEAD(folio_list);
- nodemask_t *node_alloc_noretry;
+ unsigned long allocated;
/* skip gigantic hugepages allocation if hugetlb_cma enabled */
if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3539,46 +3577,12 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
/* below will do all node balanced alloc */
if (!hstate_is_gigantic(h)) {
- /*
- * Bit mask controlling how hard we retry per-node allocations.
- * Ignore errors as lower level routines can deal with
- * node_alloc_noretry == NULL. If this kmalloc fails at boot
- * time, we are likely in bigger trouble.
- */
- node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
- GFP_KERNEL);
+ allocated = hugetlb_hstate_alloc_pages_non_gigantic(h);
} else {
- /* allocations done at boot time */
- node_alloc_noretry = NULL;
- }
-
- /* bit mask controlling how hard we retry per-node allocations */
- if (node_alloc_noretry)
- nodes_clear(*node_alloc_noretry);
-
- for (i = 0; i < h->max_huge_pages; ++i) {
- if (hstate_is_gigantic(h)) {
- /*
- * gigantic pages not added to list as they are not
- * added to pools now.
- */
- if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
- break;
- } else {
- folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
- node_alloc_noretry);
- if (!folio)
- break;
- list_add(&folio->lru, &folio_list);
- }
- cond_resched();
+ allocated = hugetlb_hstate_alloc_pages_gigantic(h);
}
- /* list will be empty if hstate_is_gigantic */
- prep_and_add_allocated_folios(h, &folio_list);
-
- hugetlb_hstate_alloc_pages_report(i, h);
- kfree(node_alloc_noretry);
+ hugetlb_hstate_alloc_pages_report(allocated, h);
}
static void __init hugetlb_init_hstates(void)
--
2.20.1
When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.
Thus, numa_aware flag is introduced here, allowing tasks to be
distributed across different nodes to fully utilize the advantage of
multi-node systems.
Signed-off-by: Gang Li <[email protected]>
---
include/linux/padata.h | 3 +++
kernel/padata.c | 8 ++++++--
mm/mm_init.c | 1 +
3 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 495b16b6b4d72..f79ccd50e7f40 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -137,6 +137,8 @@ struct padata_shell {
* appropriate for one worker thread to do at once.
* @max_threads: Max threads to use for the job, actual number may be less
* depending on task size and minimum chunk size.
+ * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
+ * no CPU, dispatch its jobs to a random CPU.
*/
struct padata_mt_job {
void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
@@ -146,6 +148,7 @@ struct padata_mt_job {
unsigned long align;
unsigned long min_chunk;
int max_threads;
+ bool numa_aware;
};
/**
diff --git a/kernel/padata.c b/kernel/padata.c
index 179fb1518070c..1c2b3a337479e 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
struct padata_work my_work, *pw;
struct padata_mt_job_state ps;
LIST_HEAD(works);
- int nworks;
+ int nworks, nid = 0;
if (job->size == 0)
return;
@@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
ps.chunk_size = roundup(ps.chunk_size, job->align);
list_for_each_entry(pw, &works, pw_list)
- queue_work(system_unbound_wq, &pw->pw_work);
+ if (job->numa_aware)
+ queue_work_node((++nid % num_node_state(N_MEMORY)),
+ system_unbound_wq, &pw->pw_work);
+ else
+ queue_work(system_unbound_wq, &pw->pw_work);
/* Use the current thread, which saves starting a workqueue worker. */
padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 89dc29f1e6c6f..59fcffddf65a3 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2225,6 +2225,7 @@ static int __init deferred_init_memmap(void *data)
.align = PAGES_PER_SECTION,
.min_chunk = PAGES_PER_SECTION,
.max_threads = max_threads,
+ .numa_aware = false,
};
padata_do_multithreaded(&job);
--
2.20.1
The parallelization of hugetlb allocation leads to errors when sharing
h->next_nid_to_alloc across different threads. To address this, it's
necessary to assign a separate next_nid_to_alloc for each thread.
Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
have been modified to directly accept a *next_nid_to_alloc parameter,
ensuring thread-specific allocation and avoiding concurrent access issues.
Signed-off-by: Gang Li <[email protected]>
---
This patch seems not elegant, but I can't come up with anything better.
Any suggestions will be highly appreciated!
---
mm/hugetlb.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 92448e747991d..a71bc1622b53b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
* next node from which to allocate, handling wrap at end of node
* mask.
*/
-static int hstate_next_node_to_alloc(struct hstate *h,
+static int hstate_next_node_to_alloc(int *next_nid_to_alloc,
nodemask_t *nodes_allowed)
{
int nid;
VM_BUG_ON(!nodes_allowed);
- nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
- h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
+ nid = get_valid_node_allowed(*next_nid_to_alloc, nodes_allowed);
+ *next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
return nid;
}
@@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
return nid;
}
-#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
+#define for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, mask) \
for (nr_nodes = nodes_weight(*mask); \
nr_nodes > 0 && \
- ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
+ ((node = hstate_next_node_to_alloc(next_nid_to_alloc, mask)) || 1); \
nr_nodes--)
#define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
@@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
*/
static struct folio *alloc_pool_huge_folio(struct hstate *h,
nodemask_t *nodes_allowed,
- nodemask_t *node_alloc_noretry)
+ nodemask_t *node_alloc_noretry,
+ int *next_nid_to_alloc)
{
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
int nr_nodes, node;
- for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+ for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
struct folio *folio;
folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
@@ -3310,7 +3311,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
goto found;
}
/* allocate from next node when distributing huge pages */
- for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
+ for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
m = memblock_alloc_try_nid_raw(
huge_page_size(h), huge_page_size(h),
0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
@@ -3684,7 +3685,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
VM_BUG_ON(delta != -1 && delta != 1);
if (delta < 0) {
- for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+ for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
if (h->surplus_huge_pages_node[node])
goto found;
}
@@ -3799,7 +3800,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
cond_resched();
folio = alloc_pool_huge_folio(h, nodes_allowed,
- node_alloc_noretry);
+ node_alloc_noretry,
+ &h->next_nid_to_alloc);
if (!folio) {
prep_and_add_allocated_folios(h, &page_list);
spin_lock_irq(&hugetlb_lock);
--
2.20.1
Now hugetlb uses padata_do_multithreaded for parallel initialization,
so select CONFIG_PADATA.
Signed-off-by: Gang Li <[email protected]>
---
fs/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075f..a57d6e6c41e6f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -262,6 +262,7 @@ menuconfig HUGETLBFS
depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
depends on (SYSFS || SYSCTL)
select MEMFD_CREATE
+ select PADATA
help
hugetlbfs is a filesystem backing for HugeTLB pages, based on
ramfs. For architectures that support it, say Y here and read
--
2.20.1
By distributing both the allocation and the initialization tasks across
multiple threads, the initialization of 2M hugetlb will be faster,
thereby improving the boot speed.
Here are some test results:
test no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2t(4 node) 2M 3336 1051 68.52%
128c1t(2 node) 2M 1943 716 63.15%
Signed-off-by: Gang Li <[email protected]>
---
mm/hugetlb.c | 72 ++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 53 insertions(+), 19 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a71bc1622b53b..d1629df5f399f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,6 +35,7 @@
#include <linux/delayacct.h>
#include <linux/memory.h>
#include <linux/mm_inline.h>
+#include <linux/padata.h>
#include <asm/page.h>
#include <asm/pgalloc.h>
@@ -3510,6 +3511,38 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
}
}
+static void __init hugetlb_alloc_node(unsigned long start, unsigned long end, void *arg)
+{
+ struct hstate *h = (struct hstate *)arg;
+ int i, num = end - start;
+ nodemask_t node_alloc_noretry;
+ unsigned long flags;
+ int next_nid_to_alloc = 0;
+
+ /* Bit mask controlling how hard we retry per-node allocations.*/
+ nodes_clear(node_alloc_noretry);
+
+ for (i = 0; i < num; ++i) {
+ struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+ &node_alloc_noretry, &next_nid_to_alloc);
+ if (!folio)
+ break;
+ spin_lock_irqsave(&hugetlb_lock, flags);
+ __prep_account_new_huge_page(h, folio_nid(folio));
+ enqueue_hugetlb_folio(h, folio);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
+ cond_resched();
+ }
+}
+
+static void __init hugetlb_vmemmap_optimize_node(unsigned long start, unsigned long end, void *arg)
+{
+ struct hstate *h = (struct hstate *)arg;
+ int nid = start;
+
+ hugetlb_vmemmap_optimize_folios(h, &h->hugepage_freelists[nid]);
+}
+
static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
{
unsigned long i;
@@ -3529,26 +3562,27 @@ static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h
static unsigned long __init hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)
{
- unsigned long i;
- struct folio *folio;
- LIST_HEAD(folio_list);
- nodemask_t node_alloc_noretry;
-
- /* Bit mask controlling how hard we retry per-node allocations.*/
- nodes_clear(node_alloc_noretry);
-
- for (i = 0; i < h->max_huge_pages; ++i) {
- folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
- &node_alloc_noretry);
- if (!folio)
- break;
- list_add(&folio->lru, &folio_list);
- cond_resched();
- }
-
- prep_and_add_allocated_folios(h, &folio_list);
+ struct padata_mt_job job = {
+ .fn_arg = h,
+ .align = 1,
+ .numa_aware = true
+ };
- return i;
+ job.thread_fn = hugetlb_alloc_node;
+ job.start = 0;
+ job.size = h->max_huge_pages;
+ job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+ job.max_threads = num_node_state(N_MEMORY) * 2;
+ padata_do_multithreaded(&job);
+
+ job.thread_fn = hugetlb_vmemmap_optimize_node;
+ job.start = 0;
+ job.size = num_node_state(N_MEMORY);
+ job.min_chunk = 1;
+ job.max_threads = num_node_state(N_MEMORY);
+ padata_do_multithreaded(&job);
+
+ return h->nr_huge_pages;
}
/*
--
2.20.1
Optimizing the initialization speed of 1G huge pages through
parallelization.
1G hugetlbs are allocated from bootmem, a process that is already
very fast and does not currently require optimization. Therefore,
we focus on parallelizing only the initialization phase in
`gather_bootmem_prealloc`.
Here are some test results:
test no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2t(4 node) 1G 4745 2024 57.34%
128c1t(2 node) 1G 3358 1712 49.02%
12t 1G 77000 18300 76.23%
Signed-off-by: Gang Li <[email protected]>
---
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++-------
2 files changed, 34 insertions(+), 8 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c1ee640d87b11..77b30a8c6076b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
extern int sysctl_hugetlb_shm_group;
-extern struct list_head huge_boot_pages;
+extern struct list_head huge_boot_pages[MAX_NUMNODES];
/* arch callbacks */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d1629df5f399f..e5a55707f8814 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
#endif
static unsigned long hugetlb_cma_size __initdata;
-__initdata LIST_HEAD(huge_boot_pages);
+__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
/* for command line parsing */
static struct hstate * __initdata parsed_hstate;
@@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
huge_page_size(h) - PAGE_SIZE);
/* Put them into a private list first because mem_map is not up yet */
INIT_LIST_HEAD(&m->list);
- list_add(&m->list, &huge_boot_pages);
+ list_add(&m->list, &huge_boot_pages[node]);
m->hstate = h;
return 1;
}
@@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
/* Send list for bulk vmemmap optimization processing */
hugetlb_vmemmap_optimize_folios(h, folio_list);
- /* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
/*
@@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
HUGETLB_VMEMMAP_RESERVE_PAGES,
pages_per_huge_page(h));
}
+ /* Subdivide locks to achieve better parallel performance */
+ spin_lock_irqsave(&hugetlb_lock, flags);
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
- spin_unlock_irqrestore(&hugetlb_lock, flags);
}
/*
* Put bootmem huge pages into the standard lists after mem_map is up.
* Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
*/
-static void __init gather_bootmem_prealloc(void)
+static void __init __gather_bootmem_prealloc(unsigned long start, unsigned long end, void *arg)
+
{
+ int nid = start;
LIST_HEAD(folio_list);
struct huge_bootmem_page *m;
struct hstate *h = NULL, *prev_h = NULL;
- list_for_each_entry(m, &huge_boot_pages, list) {
+ list_for_each_entry(m, &huge_boot_pages[nid], list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;
@@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
prep_and_add_bootmem_folios(h, &folio_list);
}
+static void __init gather_bootmem_prealloc(void)
+{
+ struct padata_mt_job job = {
+ .thread_fn = __gather_bootmem_prealloc,
+ .fn_arg = NULL,
+ .start = 0,
+ .size = num_node_state(N_MEMORY),
+ .align = 1,
+ .min_chunk = 1,
+ .max_threads = num_node_state(N_MEMORY),
+ .numa_aware = true,
+ };
+
+ padata_do_multithreaded(&job);
+}
+
static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
{
unsigned long i;
@@ -3606,6 +3624,14 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
return;
}
+ /* hugetlb_hstate_alloc_pages will be called many times, init huge_boot_pages once*/
+ if (huge_boot_pages[0].next == NULL) {
+ int i = 0;
+
+ for (i = 0; i < MAX_NUMNODES; i++)
+ INIT_LIST_HEAD(&huge_boot_pages[i]);
+ }
+
/* do node specific alloc */
if (hugetlb_hstate_alloc_pages_node_specific(h))
return;
--
2.20.1
On Tue, 2 Jan 2024, Gang Li wrote:
> The parallelization of hugetlb allocation leads to errors when sharing
> h->next_nid_to_alloc across different threads. To address this, it's
> necessary to assign a separate next_nid_to_alloc for each thread.
>
> Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
> have been modified to directly accept a *next_nid_to_alloc parameter,
> ensuring thread-specific allocation and avoiding concurrent access issues.
>
> Signed-off-by: Gang Li <[email protected]>
> ---
> This patch seems not elegant, but I can't come up with anything better.
> Any suggestions will be highly appreciated!
Same error as v2:
mm/hugetlb.c:3315:53: warning: variable 'node' is used uninitialized whenever '&&' condition is false [-Wsometimes-uninitialized]
for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/hugetlb.c:1501:3: note: expanded from macro 'for_each_node_mask_to_alloc'
nr_nodes > 0 && \
^~~~~~~~~~~~
mm/hugetlb.c:3342:38: note: uninitialized use occurs here
list_add(&m->list, &huge_boot_pages[node]);
^~~~
mm/hugetlb.c:3315:53: note: remove the '&&' if its condition is always true
for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
^
mm/hugetlb.c:3310:7: warning: variable 'node' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
if (!m)
^~
mm/hugetlb.c:3342:38: note: uninitialized use occurs here
list_add(&m->list, &huge_boot_pages[node]);
^~~~
mm/hugetlb.c:3310:3: note: remove the 'if' if its condition is always true
if (!m)
^~~~~~~
mm/hugetlb.c:3304:20: note: initialize the variable 'node' to silence this warning
int nr_nodes, node;
^
= 0
2 warnings generated.
> ---
> mm/hugetlb.c | 22 ++++++++++++----------
> 1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 92448e747991d..a71bc1622b53b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
> * next node from which to allocate, handling wrap at end of node
> * mask.
> */
> -static int hstate_next_node_to_alloc(struct hstate *h,
> +static int hstate_next_node_to_alloc(int *next_nid_to_alloc,
> nodemask_t *nodes_allowed)
> {
> int nid;
>
> VM_BUG_ON(!nodes_allowed);
>
> - nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> - h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
> + nid = get_valid_node_allowed(*next_nid_to_alloc, nodes_allowed);
> + *next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
>
> return nid;
> }
> @@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> return nid;
> }
>
> -#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
> +#define for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, mask) \
> for (nr_nodes = nodes_weight(*mask); \
> nr_nodes > 0 && \
> - ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
> + ((node = hstate_next_node_to_alloc(next_nid_to_alloc, mask)) || 1); \
> nr_nodes--)
>
> #define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
> @@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
> */
> static struct folio *alloc_pool_huge_folio(struct hstate *h,
> nodemask_t *nodes_allowed,
> - nodemask_t *node_alloc_noretry)
> + nodemask_t *node_alloc_noretry,
> + int *next_nid_to_alloc)
> {
> gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> int nr_nodes, node;
>
> - for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> + for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
> struct folio *folio;
>
> folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
> @@ -3310,7 +3311,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> goto found;
> }
> /* allocate from next node when distributing huge pages */
> - for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
> + for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
> m = memblock_alloc_try_nid_raw(
> huge_page_size(h), huge_page_size(h),
> 0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
> @@ -3684,7 +3685,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> VM_BUG_ON(delta != -1 && delta != 1);
>
> if (delta < 0) {
> - for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> + for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
> if (h->surplus_huge_pages_node[node])
> goto found;
> }
> @@ -3799,7 +3800,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
> cond_resched();
>
> folio = alloc_pool_huge_folio(h, nodes_allowed,
> - node_alloc_noretry);
> + node_alloc_noretry,
> + &h->next_nid_to_alloc);
> if (!folio) {
> prep_and_add_allocated_folios(h, &page_list);
> spin_lock_irq(&hugetlb_lock);
> --
> 2.20.1
>
>
On Tue, 2 Jan 2024, Gang Li wrote:
> Hi all, hugetlb init parallelization has now been updated to v3.
>
> This series is tested on next-20240102 and can not be applied to v6.7-rc8.
>
> Update Summary:
> - Select CONFIG_PADATA as we use padata_do_multithreaded
> - Fix a race condition in h->next_nid_to_alloc
> - Fix local variable initialization issues
> - Remove RFC tag
>
> Thanks to the testing by David Rientjes, we now know that this patch reduce
> hugetlb 1G initialization time from 77s to 18.3s on a 12T machine[4].
>
> # Introduction
> Hugetlb initialization during boot takes up a considerable amount of time.
> For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2
> seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel
> host takes more than 1 minute[1]. This is a noteworthy figure.
>
> Inspired by [2] and [3], hugetlb initialization can also be accelerated
> through parallelization. Kernel already has infrastructure like
> padata_do_multithreaded, this patch uses it to achieve effective results
> by minimal modifications.
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/[email protected]/
> [3] https://lore.kernel.org/all/[email protected]/
> [4] https://lore.kernel.org/all/[email protected]/
>
> # Test result
> test no patch(ms) patched(ms) saved
> ------------------- -------------- ------------- --------
> 256c2t(4 node) 1G 4745 2024 57.34%
> 128c1t(2 node) 1G 3358 1712 49.02%
> 12t 1G 77000 18300 76.23%
>
> 256c2t(4 node) 2M 3336 1051 68.52%
> 128c1t(2 node) 2M 1943 716 63.15%
>
I tested 1GB hugetlb on a smaller AMD host with the following:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
int __alloc_bootmem_huge_page(struct hstate *h, int nid)
{
struct huge_bootmem_page *m = NULL; /* initialize for clang */
- int nr_nodes, node;
+ int nr_nodes, node = nid;
/* do node specific alloc */
if (nid != NUMA_NO_NODE) {
After the build error is fixed, feel free to add:
Tested-by: David Rientjes <[email protected]>
to each patch. I think Andrew will probably take a build fix up as a
delta on top of patch 4 rather than sending a whole new series unless
there is other feedback that you receive.
On 2024/1/3 09:52, David Rientjes wrote:
>
> I tested 1GB hugetlb on a smaller AMD host with the following:
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
> int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> {
> struct huge_bootmem_page *m = NULL; /* initialize for clang */
> - int nr_nodes, node;
> + int nr_nodes, node = nid;
>
> /* do node specific alloc */
> if (nid != NUMA_NO_NODE) {
>
Oh, if nid != NUMA_NO_NODE and memblock_alloc_try_nid_raw succeed,
`node` must take the value of `nid`.
Otherwise, list_add(&m->list, &huge_boot_pages[node]) will not be
executed correctly.
> After the build error is fixed, feel free to add:
>
> Tested-by: David Rientjes <[email protected]>
>
Thanks!
> to each patch. I think Andrew will probably take a build fix up as a
> delta on top of patch 4 rather than sending a whole new series unless
> there is other feedback that you receive.
On 2024/1/3 09:32, David Rientjes wrote:
> Same error as v2:
>
> mm/hugetlb.c:3315:53: warning: variable 'node' is used uninitialized whenever '&&' condition is false [-Wsometimes-uninitialized]
> for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> mm/hugetlb.c:1501:3: note: expanded from macro 'for_each_node_mask_to_alloc'
> nr_nodes > 0 && \
> ^~~~~~~~~~~~
> mm/hugetlb.c:3342:38: note: uninitialized use occurs here
> list_add(&m->list, &huge_boot_pages[node]);
> ^~~~
> mm/hugetlb.c:3315:53: note: remove the '&&' if its condition is always true
> for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
> ^
> mm/hugetlb.c:3310:7: warning: variable 'node' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
> if (!m)
> ^~
> mm/hugetlb.c:3342:38: note: uninitialized use occurs here
> list_add(&m->list, &huge_boot_pages[node]);
> ^~~~
> mm/hugetlb.c:3310:3: note: remove the 'if' if its condition is always true
> if (!m)
> ^~~~~~~
> mm/hugetlb.c:3304:20: note: initialize the variable 'node' to silence this warning
> int nr_nodes, node;
> ^
> = 0
> 2 warnings generated.
>
How did you get those warnings? I got nothing in my compilation.
On Wed, 3 Jan 2024, Gang Li wrote:
> On 2024/1/3 09:32, David Rientjes wrote:
> > Same error as v2:
> >
> > mm/hugetlb.c:3315:53: warning: variable 'node' is used uninitialized
> > whenever '&&' condition is false [-Wsometimes-uninitialized]
> > for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node,
> > &node_states[N_MEMORY]) {
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > mm/hugetlb.c:1501:3: note: expanded from macro 'for_each_node_mask_to_alloc'
> > nr_nodes > 0 && \
> > ^~~~~~~~~~~~
> > mm/hugetlb.c:3342:38: note: uninitialized use occurs here
> > list_add(&m->list, &huge_boot_pages[node]);
> > ^~~~
> > mm/hugetlb.c:3315:53: note: remove the '&&' if its condition is always true
> > for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node,
> > &node_states[N_MEMORY]) {
> > ^
> > mm/hugetlb.c:3310:7: warning: variable 'node' is used uninitialized whenever
> > 'if' condition is false [-Wsometimes-uninitialized]
> > if (!m)
> > ^~
> > mm/hugetlb.c:3342:38: note: uninitialized use occurs here
> > list_add(&m->list, &huge_boot_pages[node]);
> > ^~~~
> > mm/hugetlb.c:3310:3: note: remove the 'if' if its condition is always true
> > if (!m)
> > ^~~~~~~
> > mm/hugetlb.c:3304:20: note: initialize the variable 'node' to silence this
> > warning
> > int nr_nodes, node;
> > ^
> > = 0
> > 2 warnings generated.
> >
>
> How did you get those warnings? I got nothing in my compilation.
>
I'm using clang.
You spotted the issue in your earlier reply about the potentially
uninitialized use of "node" when adding to the list.
On 2024/1/2 21:12, Gang Li wrote:
> The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
> code, its readability can be improved, facilitating future modifications.
>
> This patch extracts two functions to reduce the complexity of
> `hugetlb_hstate_alloc_pages` and has no functional changes.
>
> - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
> each online node and performs allocation if necessary.
> - hugetlb_hstate_alloc_pages_report() report error during allocation.
> And the value of h->max_huge_pages is updated accordingly.
>
> Signed-off-by: Gang Li <[email protected]>
> ---
> mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
> 1 file changed, 29 insertions(+), 17 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed1581b670d42..2606135ec55e6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3482,6 +3482,33 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> h->max_huge_pages_node[nid] = i;
> }
>
> +static bool __init hugetlb_hstate_alloc_pages_node_specific(struct hstate *h)
I'd like to rename this to hugetlb_hstate_alloc_pages_specific_nodes.
Otherwise, LGTM.
Reviewed-by: Muchun Song <[email protected]>
> +{
> + int i;
> + bool node_specific_alloc = false;
> +
> + for_each_online_node(i) {
> + if (h->max_huge_pages_node[i] > 0) {
> + hugetlb_hstate_alloc_pages_onenode(h, i);
> + node_specific_alloc = true;
> + }
> + }
> +
> + return node_specific_alloc;
> +}
> +
> +static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, struct hstate *h)
> +{
> + if (allocated < h->max_huge_pages) {
> + char buf[32];
> +
> + string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> + pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
> + h->max_huge_pages, buf, allocated);
> + h->max_huge_pages = allocated;
> + }
> +}
> +
> /*
> * NOTE: this routine is called in different contexts for gigantic and
> * non-gigantic pages.
> @@ -3499,7 +3526,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> struct folio *folio;
> LIST_HEAD(folio_list);
> nodemask_t *node_alloc_noretry;
> - bool node_specific_alloc = false;
>
> /* skip gigantic hugepages allocation if hugetlb_cma enabled */
> if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3508,14 +3534,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> }
>
> /* do node specific alloc */
> - for_each_online_node(i) {
> - if (h->max_huge_pages_node[i] > 0) {
> - hugetlb_hstate_alloc_pages_onenode(h, i);
> - node_specific_alloc = true;
> - }
> - }
> -
> - if (node_specific_alloc)
> + if (hugetlb_hstate_alloc_pages_node_specific(h))
> return;
>
> /* below will do all node balanced alloc */
> @@ -3558,14 +3577,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> /* list will be empty if hstate_is_gigantic */
> prep_and_add_allocated_folios(h, &folio_list);
>
> - if (i < h->max_huge_pages) {
> - char buf[32];
> -
> - string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> - pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
> - h->max_huge_pages, buf, i);
> - h->max_huge_pages = i;
> - }
> + hugetlb_hstate_alloc_pages_report(i, h);
> kfree(node_alloc_noretry);
> }
>
On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
> code, its readability can be improved, facilitating future modifications.
>
> This patch extracts two functions to reduce the complexity of
> `hugetlb_hstate_alloc_pages` and has no functional changes.
>
> - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
> each online node and performs allocation if necessary.
> - hugetlb_hstate_alloc_pages_report() report error during allocation.
> And the value of h->max_huge_pages is updated accordingly.
Minor nit, I think hugetlb_hstate_alloc_pages_errcheck() is more
descriptive than hugetlb_hstate_alloc_pages_report().
Otherwise
Reviewed-by: Tim Chen <[email protected]>
>
> Signed-off-by: Gang Li <[email protected]>
> ---
> mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
> 1 file changed, 29 insertions(+), 17 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed1581b670d42..2606135ec55e6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3482,6 +3482,33 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> h->max_huge_pages_node[nid] = i;
> }
>
> +static bool __init hugetlb_hstate_alloc_pages_node_specific(struct hstate *h)
> +{
> + int i;
> + bool node_specific_alloc = false;
> +
> + for_each_online_node(i) {
> + if (h->max_huge_pages_node[i] > 0) {
> + hugetlb_hstate_alloc_pages_onenode(h, i);
> + node_specific_alloc = true;
> + }
> + }
> +
> + return node_specific_alloc;
> +}
> +
> +static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, struct hstate *h)
> +{
> + if (allocated < h->max_huge_pages) {
> + char buf[32];
> +
> + string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> + pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
> + h->max_huge_pages, buf, allocated);
> + h->max_huge_pages = allocated;
> + }
> +}
> +
> /*
> * NOTE: this routine is called in different contexts for gigantic and
> * non-gigantic pages.
> @@ -3499,7 +3526,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> struct folio *folio;
> LIST_HEAD(folio_list);
> nodemask_t *node_alloc_noretry;
> - bool node_specific_alloc = false;
>
> /* skip gigantic hugepages allocation if hugetlb_cma enabled */
> if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3508,14 +3534,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> }
>
> /* do node specific alloc */
> - for_each_online_node(i) {
> - if (h->max_huge_pages_node[i] > 0) {
> - hugetlb_hstate_alloc_pages_onenode(h, i);
> - node_specific_alloc = true;
> - }
> - }
> -
> - if (node_specific_alloc)
> + if (hugetlb_hstate_alloc_pages_node_specific(h))
> return;
>
> /* below will do all node balanced alloc */
> @@ -3558,14 +3577,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> /* list will be empty if hstate_is_gigantic */
> prep_and_add_allocated_folios(h, &folio_list);
>
> - if (i < h->max_huge_pages) {
> - char buf[32];
> -
> - string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> - pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
> - h->max_huge_pages, buf, i);
> - h->max_huge_pages = i;
> - }
> + hugetlb_hstate_alloc_pages_report(i, h);
> kfree(node_alloc_noretry);
> }
>
On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> 1G and 2M huge pages have different allocation and initialization logic,
> which leads to subtle differences in parallelization. Therefore, it is
> appropriate to split hugetlb_hstate_alloc_pages into gigantic and
> non-gigantic.
>
> This patch has no functional changes.
>
> Signed-off-by: Gang Li <[email protected]>
> ---
> mm/hugetlb.c | 86 +++++++++++++++++++++++++++-------------------------
> 1 file changed, 45 insertions(+), 41 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2606135ec55e6..92448e747991d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3509,6 +3509,47 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
> }
> }
>
> +static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
> +{
> + unsigned long i;
> +
> + for (i = 0; i < h->max_huge_pages; ++i) {
> + /*
> + * gigantic pages not added to list as they are not
> + * added to pools now.
> + */
This comment unnecessary as now we don't have mix gigantic and non-gigantic code,
which uses foilio list. And folio_list is not in this routine.
Can be removed.
Otherwise Reviewed-by: Tim Chen <[email protected]>
> + if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> + break;
> + cond_resched();
> + }
> +
> + return i;
> +}
> +
> +static unsigned long __init hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)
> +{
> + unsigned long i;
> + struct folio *folio;
> + LIST_HEAD(folio_list);
> + nodemask_t node_alloc_noretry;
> +
> + /* Bit mask controlling how hard we retry per-node allocations.*/
> + nodes_clear(node_alloc_noretry);
> +
> + for (i = 0; i < h->max_huge_pages; ++i) {
> + folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> + &node_alloc_noretry);
> + if (!folio)
> + break;
> + list_add(&folio->lru, &folio_list);
> + cond_resched();
> + }
> +
> + prep_and_add_allocated_folios(h, &folio_list);
> +
> + return i;
> +}
> +
> /*
> * NOTE: this routine is called in different contexts for gigantic and
> * non-gigantic pages.
> @@ -3522,10 +3563,7 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
> */
> static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> {
> - unsigned long i;
> - struct folio *folio;
> - LIST_HEAD(folio_list);
> - nodemask_t *node_alloc_noretry;
> + unsigned long allocated;
>
> /* skip gigantic hugepages allocation if hugetlb_cma enabled */
> if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3539,46 +3577,12 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>
> /* below will do all node balanced alloc */
> if (!hstate_is_gigantic(h)) {
> - /*
> - * Bit mask controlling how hard we retry per-node allocations.
> - * Ignore errors as lower level routines can deal with
> - * node_alloc_noretry == NULL. If this kmalloc fails at boot
> - * time, we are likely in bigger trouble.
> - */
> - node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
> - GFP_KERNEL);
> + allocated = hugetlb_hstate_alloc_pages_non_gigantic(h);
> } else {
> - /* allocations done at boot time */
> - node_alloc_noretry = NULL;
> - }
> -
> - /* bit mask controlling how hard we retry per-node allocations */
> - if (node_alloc_noretry)
> - nodes_clear(*node_alloc_noretry);
> -
> - for (i = 0; i < h->max_huge_pages; ++i) {
> - if (hstate_is_gigantic(h)) {
> - /*
> - * gigantic pages not added to list as they are not
> - * added to pools now.
> - */
> - if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> - break;
> - } else {
> - folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> - node_alloc_noretry);
> - if (!folio)
> - break;
> - list_add(&folio->lru, &folio_list);
> - }
> - cond_resched();
> + allocated = hugetlb_hstate_alloc_pages_gigantic(h);
> }
>
> - /* list will be empty if hstate_is_gigantic */
> - prep_and_add_allocated_folios(h, &folio_list);
> -
> - hugetlb_hstate_alloc_pages_report(i, h);
> - kfree(node_alloc_noretry);
> + hugetlb_hstate_alloc_pages_report(allocated, h);
> }
>
> static void __init hugetlb_init_hstates(void)
On 2024/1/10 18:19, Muchun Song wrote:
>
>
> On 2024/1/2 21:12, Gang Li wrote:
>> The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
>> code, its readability can be improved, facilitating future modifications.
>>
>> This patch extracts two functions to reduce the complexity of
>> `hugetlb_hstate_alloc_pages` and has no functional changes.
>>
>> - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
>> each online node and performs allocation if necessary.
>> - hugetlb_hstate_alloc_pages_report() report error during allocation.
>> And the value of h->max_huge_pages is updated accordingly.
>>
>> Signed-off-by: Gang Li <[email protected]>
>> ---
>> mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
>> 1 file changed, 29 insertions(+), 17 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index ed1581b670d42..2606135ec55e6 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3482,6 +3482,33 @@ static void __init
>> hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>> h->max_huge_pages_node[nid] = i;
>> }
>> +static bool __init hugetlb_hstate_alloc_pages_node_specific(struct
>> hstate *h)
>
> I'd like to rename this to hugetlb_hstate_alloc_pages_specific_nodes.
>
> Otherwise, LGTM.
>
> Reviewed-by: Muchun Song <[email protected]>
>
Thanks! I will adjust it in the next version.
On 2024/1/11 05:55, Tim Chen wrote:
> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>> The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
>> code, its readability can be improved, facilitating future modifications.
>>
>> This patch extracts two functions to reduce the complexity of
>> `hugetlb_hstate_alloc_pages` and has no functional changes.
>>
>> - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
>> each online node and performs allocation if necessary.
>> - hugetlb_hstate_alloc_pages_report() report error during allocation.
>> And the value of h->max_huge_pages is updated accordingly.
>
> Minor nit, I think hugetlb_hstate_alloc_pages_errcheck() is more
> descriptive than hugetlb_hstate_alloc_pages_report().
Thanks! This looks more intuitive.
>
> Otherwise
>
> Reviewed-by: Tim Chen <[email protected]>
>
On 2024/1/11 07:12, Tim Chen wrote:
> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>> +static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
>> +{
>> + unsigned long i;
>> +
>> + for (i = 0; i < h->max_huge_pages; ++i) {
>> + /*
>> + * gigantic pages not added to list as they are not
>> + * added to pools now.
>> + */
>
> This comment unnecessary as now we don't have mix gigantic and non-gigantic code,
> which uses foilio list. And folio_list is not in this routine.
>
> Can be removed.
>
> Otherwise Reviewed-by: Tim Chen <[email protected]>
>
Thanks!
On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> When a group of tasks that access different nodes are scheduled on the
> same node, they may encounter bandwidth bottlenecks and access latency.
>
> Thus, numa_aware flag is introduced here, allowing tasks to be
> distributed across different nodes to fully utilize the advantage of
> multi-node systems.
>
> Signed-off-by: Gang Li <[email protected]>
> ---
> include/linux/padata.h | 3 +++
> kernel/padata.c | 8 ++++++--
> mm/mm_init.c | 1 +
> 3 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/padata.h b/include/linux/padata.h
> index 495b16b6b4d72..f79ccd50e7f40 100644
> --- a/include/linux/padata.h
> +++ b/include/linux/padata.h
> @@ -137,6 +137,8 @@ struct padata_shell {
> * appropriate for one worker thread to do at once.
> * @max_threads: Max threads to use for the job, actual number may be less
> * depending on task size and minimum chunk size.
> + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
> + * no CPU, dispatch its jobs to a random CPU.
> */
> struct padata_mt_job {
> void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> @@ -146,6 +148,7 @@ struct padata_mt_job {
> unsigned long align;
> unsigned long min_chunk;
> int max_threads;
> + bool numa_aware;
> };
>
> /**
> diff --git a/kernel/padata.c b/kernel/padata.c
> index 179fb1518070c..1c2b3a337479e 100644
> --- a/kernel/padata.c
> +++ b/kernel/padata.c
> @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> struct padata_work my_work, *pw;
> struct padata_mt_job_state ps;
> LIST_HEAD(works);
> - int nworks;
> + int nworks, nid = 0;
If we always start from 0, we may be biased towards the low numbered node,
and not use high numbered nodes at all. Suggest you do
static nid = 0;
>
> if (job->size == 0)
> return;
> @@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> ps.chunk_size = roundup(ps.chunk_size, job->align);
>
> list_for_each_entry(pw, &works, pw_list)
> - queue_work(system_unbound_wq, &pw->pw_work);
> + if (job->numa_aware)
> + queue_work_node((++nid % num_node_state(N_MEMORY)),
> + system_unbound_wq, &pw->pw_work);
I think we should use nid = next_node(nid, node_states[N_CPU]) instead of
++nid % num_node_state(N_MEMORY). You are picking the next node with CPU
to handle the job.
Tim
> + else
> + queue_work(system_unbound_wq, &pw->pw_work);
>
> /* Use the current thread, which saves starting a workqueue worker. */
> padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 89dc29f1e6c6f..59fcffddf65a3 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2225,6 +2225,7 @@ static int __init deferred_init_memmap(void *data)
> .align = PAGES_PER_SECTION,
> .min_chunk = PAGES_PER_SECTION,
> .max_threads = max_threads,
> + .numa_aware = false,
> };
>
> padata_do_multithreaded(&job);
On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> The parallelization of hugetlb allocation leads to errors when sharing
> h->next_nid_to_alloc across different threads. To address this, it's
Suggest you say
With parallelization of hugetlb allocation across different threads,
each thread works on a differnet node to allocate pages from, instead
of all allocating from a common node h->next_nid_to_alloc. To address this, it's
> necessary to assign a separate next_nid_to_alloc for each thread.
>
> Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
> have been modified to directly accept a *next_nid_to_alloc parameter,
> ensuring thread-specific allocation and avoiding concurrent access issues.
>
> Signed-off-by: Gang Li <[email protected]>
> ---
> This patch seems not elegant, but I can't come up with anything better.
> Any suggestions will be highly appreciated!
> ---
> mm/hugetlb.c | 22 ++++++++++++----------
> 1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 92448e747991d..a71bc1622b53b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
> * next node from which to allocate, handling wrap at end of node
> * mask.
> */
> -static int hstate_next_node_to_alloc(struct hstate *h,
> +static int hstate_next_node_to_alloc(int *next_nid_to_alloc,
> nodemask_t *nodes_allowed)
> {
> int nid;
>
> VM_BUG_ON(!nodes_allowed);
>
> - nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> - h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
> + nid = get_valid_node_allowed(*next_nid_to_alloc, nodes_allowed);
> + *next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
>
> return nid;
> }
> @@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
> return nid;
> }
>
> -#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
> +#define for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, mask) \
> for (nr_nodes = nodes_weight(*mask); \
> nr_nodes > 0 && \
> - ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
> + ((node = hstate_next_node_to_alloc(next_nid_to_alloc, mask)) || 1); \
> nr_nodes--)
>
> #define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
> @@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
> */
> static struct folio *alloc_pool_huge_folio(struct hstate *h,
> nodemask_t *nodes_allowed,
> - nodemask_t *node_alloc_noretry)
> + nodemask_t *node_alloc_noretry,
> + int *next_nid_to_alloc)
> {
> gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> int nr_nodes, node;
>
> - for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> + for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
> struct folio *folio;
>
> folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
> @@ -3310,7 +3311,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> goto found;
> }
> /* allocate from next node when distributing huge pages */
> - for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
> + for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
> m = memblock_alloc_try_nid_raw(
> huge_page_size(h), huge_page_size(h),
> 0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
> @@ -3684,7 +3685,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
> VM_BUG_ON(delta != -1 && delta != 1);
>
> if (delta < 0) {
> - for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> + for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
> if (h->surplus_huge_pages_node[node])
> goto found;
> }
> @@ -3799,7 +3800,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
> cond_resched();
>
> folio = alloc_pool_huge_folio(h, nodes_allowed,
> - node_alloc_noretry);
> + node_alloc_noretry,
> + &h->next_nid_to_alloc);
> if (!folio) {
> prep_and_add_allocated_folios(h, &page_list);
> spin_lock_irq(&hugetlb_lock);
On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> Now hugetlb uses padata_do_multithreaded for parallel initialization,
> so select CONFIG_PADATA.
Perhaps rephrase
Allow hugetlb use padata_do_multithreaded for parallel initialization.
Select CONFIG_PADATA in this case.
>
> Signed-off-by: Gang Li <[email protected]>
> ---
> fs/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 89fdbefd1075f..a57d6e6c41e6f 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -262,6 +262,7 @@ menuconfig HUGETLBFS
> depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
> depends on (SYSFS || SYSCTL)
> select MEMFD_CREATE
> + select PADATA
> help
> hugetlbfs is a filesystem backing for HugeTLB pages, based on
> ramfs. For architectures that support it, say Y here and read
On 2024/1/12 01:50, Tim Chen wrote:
> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>> When a group of tasks that access different nodes are scheduled on the
>> same node, they may encounter bandwidth bottlenecks and access latency.
>>
>> Thus, numa_aware flag is introduced here, allowing tasks to be
>> distributed across different nodes to fully utilize the advantage of
>> multi-node systems.
>>
>> Signed-off-by: Gang Li <[email protected]>
>> ---
>> include/linux/padata.h | 3 +++
>> kernel/padata.c | 8 ++++++--
>> mm/mm_init.c | 1 +
>> 3 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/padata.h b/include/linux/padata.h
>> index 495b16b6b4d72..f79ccd50e7f40 100644
>> --- a/include/linux/padata.h
>> +++ b/include/linux/padata.h
>> @@ -137,6 +137,8 @@ struct padata_shell {
>> * appropriate for one worker thread to do at once.
>> * @max_threads: Max threads to use for the job, actual number may be less
>> * depending on task size and minimum chunk size.
>> + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
>> + * no CPU, dispatch its jobs to a random CPU.
>> */
>> struct padata_mt_job {
>> void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
>> @@ -146,6 +148,7 @@ struct padata_mt_job {
>> unsigned long align;
>> unsigned long min_chunk;
>> int max_threads;
>> + bool numa_aware;
>> };
>>
>> /**
>> diff --git a/kernel/padata.c b/kernel/padata.c
>> index 179fb1518070c..1c2b3a337479e 100644
>> --- a/kernel/padata.c
>> +++ b/kernel/padata.c
>> @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>> struct padata_work my_work, *pw;
>> struct padata_mt_job_state ps;
>> LIST_HEAD(works);
>> - int nworks;
>> + int nworks, nid = 0;
>
> If we always start from 0, we may be biased towards the low numbered node,
> and not use high numbered nodes at all. Suggest you do
> static nid = 0;
>
When we use `static`, if there are multiple parallel calls to
`padata_do_multithreaded`, it may result in an uneven distribution of
tasks for each padata_do_multithreaded.
We can make the following modifications to address this issue.
```
diff --git a/kernel/padata.c b/kernel/padata.c
index 1c2b3a337479e..925e48df6dd8d 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct
padata_mt_job *job)
struct padata_work my_work, *pw;
struct padata_mt_job_state ps;
LIST_HEAD(works);
- int nworks, nid = 0;
+ int nworks, nid;
+ static volatile int global_nid = 0;
if (job->size == 0)
return;
@@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct
padata_mt_job *job)
ps.chunk_size = max(ps.chunk_size, job->min_chunk);
ps.chunk_size = roundup(ps.chunk_size, job->align);
+ nid = global_nid;
list_for_each_entry(pw, &works, pw_list)
- if (job->numa_aware)
- queue_work_node((++nid % num_node_state(N_MEMORY)),
- system_unbound_wq, &pw->pw_work);
- else
+ if (job->numa_aware) {
+ queue_work_node(nid, system_unbound_wq,
&pw->pw_work);
+ nid = next_node(nid, node_states[N_CPU]);
+ } else
queue_work(system_unbound_wq, &pw->pw_work);
+ if (job->numa_aware)
+ global_nid = nid;
/* Use the current thread, which saves starting a workqueue
worker. */
padata_work_init(&my_work, padata_mt_helper, &ps,
PADATA_WORK_ONSTACK);
```
>>
>> if (job->size == 0)
>> return;
>> @@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>> ps.chunk_size = roundup(ps.chunk_size, job->align);
>>
>> list_for_each_entry(pw, &works, pw_list)
>> - queue_work(system_unbound_wq, &pw->pw_work);
>> + if (job->numa_aware)
>> + queue_work_node((++nid % num_node_state(N_MEMORY)),
>> + system_unbound_wq, &pw->pw_work);
>
> I think we should use nid = next_node(nid, node_states[N_CPU]) instead of
> ++nid % num_node_state(N_MEMORY). You are picking the next node with CPU
> to handle the job.
>
> Tim
>
I agree.
On 2024/1/12 06:21, Tim Chen wrote:
> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>> The parallelization of hugetlb allocation leads to errors when sharing
>> h->next_nid_to_alloc across different threads. To address this, it's
>
> Suggest you say
> With parallelization of hugetlb allocation across different threads,
> each thread works on a differnet node to allocate pages from, instead
> of all allocating from a common node h->next_nid_to_alloc. To address this, it's
>
>> necessary to assign a separate next_nid_to_alloc for each thread.
LGTM, thanks.
On Fri, 2024-01-12 at 15:09 +0800, Gang Li wrote:
> On 2024/1/12 01:50, Tim Chen wrote:
> > On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> > > When a group of tasks that access different nodes are scheduled on the
> > > same node, they may encounter bandwidth bottlenecks and access latency.
> > >
> > > Thus, numa_aware flag is introduced here, allowing tasks to be
> > > distributed across different nodes to fully utilize the advantage of
> > > multi-node systems.
> > >
> > > Signed-off-by: Gang Li <[email protected]>
> > > ---
> > > include/linux/padata.h | 3 +++
> > > kernel/padata.c | 8 ++++++--
> > > mm/mm_init.c | 1 +
> > > 3 files changed, 10 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/padata.h b/include/linux/padata.h
> > > index 495b16b6b4d72..f79ccd50e7f40 100644
> > > --- a/include/linux/padata.h
> > > +++ b/include/linux/padata.h
> > > @@ -137,6 +137,8 @@ struct padata_shell {
> > > * appropriate for one worker thread to do at once.
> > > * @max_threads: Max threads to use for the job, actual number may be less
> > > * depending on task size and minimum chunk size.
> > > + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
> > > + * no CPU, dispatch its jobs to a random CPU.
> > > */
> > > struct padata_mt_job {
> > > void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> > > @@ -146,6 +148,7 @@ struct padata_mt_job {
> > > unsigned long align;
> > > unsigned long min_chunk;
> > > int max_threads;
> > > + bool numa_aware;
> > > };
> > >
> > > /**
> > > diff --git a/kernel/padata.c b/kernel/padata.c
> > > index 179fb1518070c..1c2b3a337479e 100644
> > > --- a/kernel/padata.c
> > > +++ b/kernel/padata.c
> > > @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> > > struct padata_work my_work, *pw;
> > > struct padata_mt_job_state ps;
> > > LIST_HEAD(works);
> > > - int nworks;
> > > + int nworks, nid = 0;
> >
> > If we always start from 0, we may be biased towards the low numbered node,
> > and not use high numbered nodes at all. Suggest you do
> > static nid = 0;
> >
>
> When we use `static`, if there are multiple parallel calls to
> `padata_do_multithreaded`, it may result in an uneven distribution of
> tasks for each padata_do_multithreaded.
>
> We can make the following modifications to address this issue.
>
> ```
> diff --git a/kernel/padata.c b/kernel/padata.c
> index 1c2b3a337479e..925e48df6dd8d 100644
> --- a/kernel/padata.c
> +++ b/kernel/padata.c
> @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct
> padata_mt_job *job)
> struct padata_work my_work, *pw;
> struct padata_mt_job_state ps;
> LIST_HEAD(works);
> - int nworks, nid = 0;
> + int nworks, nid;
> + static volatile int global_nid = 0;
>
> if (job->size == 0)
> return;
> @@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct
> padata_mt_job *job)
> ps.chunk_size = max(ps.chunk_size, job->min_chunk);
> ps.chunk_size = roundup(ps.chunk_size, job->align);
>
> + nid = global_nid;
> list_for_each_entry(pw, &works, pw_list)
> - if (job->numa_aware)
> - queue_work_node((++nid % num_node_state(N_MEMORY)),
> - system_unbound_wq, &pw->pw_work);
> - else
> + if (job->numa_aware) {
> + queue_work_node(nid, system_unbound_wq,
> &pw->pw_work);
> + nid = next_node(nid, node_states[N_CPU]);
> + } else
> queue_work(system_unbound_wq, &pw->pw_work);
> + if (job->numa_aware)
> + global_nid = nid;
Thinking more about it, there could still be multiple threads working
at the same time with stale global_nid. We should probably do a compare
exchange of global_nid with new nid only if the global nid was unchanged.
Otherwise we should go to the next node with the changed global nid before
we queue the job.
Tim
>
> /* Use the current thread, which saves starting a workqueue
> worker. */
> padata_work_init(&my_work, padata_mt_helper, &ps,
> PADATA_WORK_ONSTACK);
> ```
>
>
> > >
> > > if (job->size == 0)
> > > return;
> > > @@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> > > ps.chunk_size = roundup(ps.chunk_size, job->align);
> > >
> > > list_for_each_entry(pw, &works, pw_list)
> > > - queue_work(system_unbound_wq, &pw->pw_work);
> > > + if (job->numa_aware)
> > > + queue_work_node((++nid % num_node_state(N_MEMORY)),
> > > + system_unbound_wq, &pw->pw_work);
> >
> > I think we should use nid = next_node(nid, node_states[N_CPU]) instead of
> > ++nid % num_node_state(N_MEMORY). You are picking the next node with CPU
> > to handle the job.
> >
> > Tim
> >
>
> I agree.
On 2024/1/13 02:27, Tim Chen wrote:
> On Fri, 2024-01-12 at 15:09 +0800, Gang Li wrote:
>> On 2024/1/12 01:50, Tim Chen wrote:
>>> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>>>> When a group of tasks that access different nodes are scheduled on the
>>>> same node, they may encounter bandwidth bottlenecks and access latency.
>>>>
>>>> Thus, numa_aware flag is introduced here, allowing tasks to be
>>>> distributed across different nodes to fully utilize the advantage of
>>>> multi-node systems.
>>>>
>>>> Signed-off-by: Gang Li <[email protected]>
>>>> ---
>>>> include/linux/padata.h | 3 +++
>>>> kernel/padata.c | 8 ++++++--
>>>> mm/mm_init.c | 1 +
>>>> 3 files changed, 10 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/padata.h b/include/linux/padata.h
>>>> index 495b16b6b4d72..f79ccd50e7f40 100644
>>>> --- a/include/linux/padata.h
>>>> +++ b/include/linux/padata.h
>>>> @@ -137,6 +137,8 @@ struct padata_shell {
>>>> * appropriate for one worker thread to do at once.
>>>> * @max_threads: Max threads to use for the job, actual number may be less
>>>> * depending on task size and minimum chunk size.
>>>> + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
>>>> + * no CPU, dispatch its jobs to a random CPU.
>>>> */
>>>> struct padata_mt_job {
>>>> void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
>>>> @@ -146,6 +148,7 @@ struct padata_mt_job {
>>>> unsigned long align;
>>>> unsigned long min_chunk;
>>>> int max_threads;
>>>> + bool numa_aware;
>>>> };
>>>>
>>>> /**
>>>> diff --git a/kernel/padata.c b/kernel/padata.c
>>>> index 179fb1518070c..1c2b3a337479e 100644
>>>> --- a/kernel/padata.c
>>>> +++ b/kernel/padata.c
>>>> @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>>>> struct padata_work my_work, *pw;
>>>> struct padata_mt_job_state ps;
>>>> LIST_HEAD(works);
>>>> - int nworks;
>>>> + int nworks, nid = 0;
>>>
>>> If we always start from 0, we may be biased towards the low numbered node,
>>> and not use high numbered nodes at all. Suggest you do
>>> static nid = 0;
>>>
>>
>> When we use `static`, if there are multiple parallel calls to
>> `padata_do_multithreaded`, it may result in an uneven distribution of
>> tasks for each padata_do_multithreaded.
>>
>> We can make the following modifications to address this issue.
>>
>> ```
>> diff --git a/kernel/padata.c b/kernel/padata.c
>> index 1c2b3a337479e..925e48df6dd8d 100644
>> --- a/kernel/padata.c
>> +++ b/kernel/padata.c
>> @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct
>> padata_mt_job *job)
>> struct padata_work my_work, *pw;
>> struct padata_mt_job_state ps;
>> LIST_HEAD(works);
>> - int nworks, nid = 0;
>> + int nworks, nid;
>> + static volatile int global_nid = 0;
>>
>> if (job->size == 0)
>> return;
>> @@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct
>> padata_mt_job *job)
>> ps.chunk_size = max(ps.chunk_size, job->min_chunk);
>> ps.chunk_size = roundup(ps.chunk_size, job->align);
>>
>> + nid = global_nid;
>> list_for_each_entry(pw, &works, pw_list)
>> - if (job->numa_aware)
>> - queue_work_node((++nid % num_node_state(N_MEMORY)),
>> - system_unbound_wq, &pw->pw_work);
>> - else
>> + if (job->numa_aware) {
>> + queue_work_node(nid, system_unbound_wq,
>> &pw->pw_work);
>> + nid = next_node(nid, node_states[N_CPU]);
>> + } else
>> queue_work(system_unbound_wq, &pw->pw_work);
>> + if (job->numa_aware)
>> + global_nid = nid;
>
> Thinking more about it, there could still be multiple threads working
> at the same time with stale global_nid. We should probably do a compare
> exchange of global_nid with new nid only if the global nid was unchanged.
> Otherwise we should go to the next node with the changed global nid before
> we queue the job.
>
> Tim
>
How about:
```
nid = global_nid;
list_for_each_entry(pw, &works, pw_list)
if (job->numa_aware) {
int old_node = nid;
queue_work_node(nid, system_unbound_wq, &pw->pw_work);
nid = next_node(nid, node_states[N_CPU]);
cmpxchg(&global_nid, old_node, nid);
} else
queue_work(system_unbound_wq, &pw->pw_work);
```
On 2024/1/2 21:12, Gang Li wrote:
> 1G and 2M huge pages have different allocation and initialization logic,
> which leads to subtle differences in parallelization. Therefore, it is
> appropriate to split hugetlb_hstate_alloc_pages into gigantic and
> non-gigantic.
>
> This patch has no functional changes.
>
> Signed-off-by: Gang Li <[email protected]>
> ---
> mm/hugetlb.c | 86 +++++++++++++++++++++++++++-------------------------
> 1 file changed, 45 insertions(+), 41 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2606135ec55e6..92448e747991d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3509,6 +3509,47 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
> }
> }
>
> +static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
The name is so long, how about hugetlb_gigantic_pages_alloc_boot?
> +{
> + unsigned long i;
> +
> + for (i = 0; i < h->max_huge_pages; ++i) {
> + /*
> + * gigantic pages not added to list as they are not
> + * added to pools now.
> + */
> + if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> + break;
> + cond_resched();
> + }
> +
> + return i;
> +}
> +
> +static unsigned long __init hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)
hugetlb_pages_alloc_boot?
> +{
> + unsigned long i;
> + struct folio *folio;
> + LIST_HEAD(folio_list);
> + nodemask_t node_alloc_noretry;
> +
> + /* Bit mask controlling how hard we retry per-node allocations.*/
> + nodes_clear(node_alloc_noretry);
> +
> + for (i = 0; i < h->max_huge_pages; ++i) {
> + folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> + &node_alloc_noretry);
> + if (!folio)
> + break;
> + list_add(&folio->lru, &folio_list);
> + cond_resched();
> + }
> +
> + prep_and_add_allocated_folios(h, &folio_list);
> +
> + return i;
> +}
> +
> /*
> * NOTE: this routine is called in different contexts for gigantic and
> * non-gigantic pages.
> @@ -3522,10 +3563,7 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
> */
> static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> {
> - unsigned long i;
> - struct folio *folio;
> - LIST_HEAD(folio_list);
> - nodemask_t *node_alloc_noretry;
> + unsigned long allocated;
>
> /* skip gigantic hugepages allocation if hugetlb_cma enabled */
> if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3539,46 +3577,12 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>
> /* below will do all node balanced alloc */
> if (!hstate_is_gigantic(h)) {
It it unnecessary to reverse the condition. A little sime like following:
if (hstate_is_gigantic(h))
/* gigantic pages */
else
/* normal pages */
> - /*
> - * Bit mask controlling how hard we retry per-node allocations.
> - * Ignore errors as lower level routines can deal with
> - * node_alloc_noretry == NULL. If this kmalloc fails at boot
> - * time, we are likely in bigger trouble.
> - */
> - node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
> - GFP_KERNEL);
> + allocated = hugetlb_hstate_alloc_pages_non_gigantic(h);
> } else {
> - /* allocations done at boot time */
> - node_alloc_noretry = NULL;
> - }
> -
> - /* bit mask controlling how hard we retry per-node allocations */
> - if (node_alloc_noretry)
> - nodes_clear(*node_alloc_noretry);
> -
> - for (i = 0; i < h->max_huge_pages; ++i) {
> - if (hstate_is_gigantic(h)) {
> - /*
> - * gigantic pages not added to list as they are not
> - * added to pools now.
> - */
> - if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> - break;
> - } else {
> - folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> - node_alloc_noretry);
> - if (!folio)
> - break;
> - list_add(&folio->lru, &folio_list);
> - }
> - cond_resched();
> + allocated = hugetlb_hstate_alloc_pages_gigantic(h);
> }
>
> - /* list will be empty if hstate_is_gigantic */
> - prep_and_add_allocated_folios(h, &folio_list);
> -
> - hugetlb_hstate_alloc_pages_report(i, h);
> - kfree(node_alloc_noretry);
> + hugetlb_hstate_alloc_pages_report(allocated, h);
> }
>
> static void __init hugetlb_init_hstates(void)
On 2024/1/16 15:02, Muchun Song wrote:
> On 2024/1/2 21:12, Gang Li wrote:
>> hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
>
> The name is so long, how about hugetlb_gigantic_pages_alloc_boot?
>
>> hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)
>
> hugetlb_pages_alloc_boot?
LGTM.
> It it unnecessary to reverse the condition. A little sime like following:
>
> if (hstate_is_gigantic(h))
> /* gigantic pages */
> else
> /* normal pages */
Will take it in next version.
> On Jan 2, 2024, at 21:12, Gang Li <[email protected]> wrote:
>
> Now hugetlb uses padata_do_multithreaded for parallel initialization,
> so select CONFIG_PADATA.
>
> Signed-off-by: Gang Li <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
On Mon, 2024-01-15 at 16:57 +0800, Gang Li wrote:
>
> On 2024/1/13 02:27, Tim Chen wrote:
> > On Fri, 2024-01-12 at 15:09 +0800, Gang Li wrote:
> > > On 2024/1/12 01:50, Tim Chen wrote:
> > > > On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> > > > > When a group of tasks that access different nodes are scheduled on the
> > > > > same node, they may encounter bandwidth bottlenecks and access latency.
> > > > >
> > > > > Thus, numa_aware flag is introduced here, allowing tasks to be
> > > > > distributed across different nodes to fully utilize the advantage of
> > > > > multi-node systems.
> > > > >
> > > > > Signed-off-by: Gang Li <[email protected]>
> > > > > ---
> > > > > include/linux/padata.h | 3 +++
> > > > > kernel/padata.c | 8 ++++++--
> > > > > mm/mm_init.c | 1 +
> > > > > 3 files changed, 10 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/padata.h b/include/linux/padata.h
> > > > > index 495b16b6b4d72..f79ccd50e7f40 100644
> > > > > --- a/include/linux/padata.h
> > > > > +++ b/include/linux/padata.h
> > > > > @@ -137,6 +137,8 @@ struct padata_shell {
> > > > > * appropriate for one worker thread to do at once.
> > > > > * @max_threads: Max threads to use for the job, actual number may be less
> > > > > * depending on task size and minimum chunk size.
> > > > > + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
> > > > > + * no CPU, dispatch its jobs to a random CPU.
> > > > > */
> > > > > struct padata_mt_job {
> > > > > void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> > > > > @@ -146,6 +148,7 @@ struct padata_mt_job {
> > > > > unsigned long align;
> > > > > unsigned long min_chunk;
> > > > > int max_threads;
> > > > > + bool numa_aware;
> > > > > };
> > > > >
> > > > > /**
> > > > > diff --git a/kernel/padata.c b/kernel/padata.c
> > > > > index 179fb1518070c..1c2b3a337479e 100644
> > > > > --- a/kernel/padata.c
> > > > > +++ b/kernel/padata.c
> > > > > @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> > > > > struct padata_work my_work, *pw;
> > > > > struct padata_mt_job_state ps;
> > > > > LIST_HEAD(works);
> > > > > - int nworks;
> > > > > + int nworks, nid = 0;
> > > >
> > > > If we always start from 0, we may be biased towards the low numbered node,
> > > > and not use high numbered nodes at all. Suggest you do
> > > > static nid = 0;
> > > >
> > >
> > > When we use `static`, if there are multiple parallel calls to
> > > `padata_do_multithreaded`, it may result in an uneven distribution of
> > > tasks for each padata_do_multithreaded.
> > >
> > > We can make the following modifications to address this issue.
> > >
> > > ```
> > > diff --git a/kernel/padata.c b/kernel/padata.c
> > > index 1c2b3a337479e..925e48df6dd8d 100644
> > > --- a/kernel/padata.c
> > > +++ b/kernel/padata.c
> > > @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct
> > > padata_mt_job *job)
> > > struct padata_work my_work, *pw;
> > > struct padata_mt_job_state ps;
> > > LIST_HEAD(works);
> > > - int nworks, nid = 0;
> > > + int nworks, nid;
> > > + static volatile int global_nid = 0;
> > >
> > > if (job->size == 0)
> > > return;
> > > @@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct
> > > padata_mt_job *job)
> > > ps.chunk_size = max(ps.chunk_size, job->min_chunk);
> > > ps.chunk_size = roundup(ps.chunk_size, job->align);
> > >
> > > + nid = global_nid;
> > > list_for_each_entry(pw, &works, pw_list)
> > > - if (job->numa_aware)
> > > - queue_work_node((++nid % num_node_state(N_MEMORY)),
> > > - system_unbound_wq, &pw->pw_work);
> > > - else
> > > + if (job->numa_aware) {
> > > + queue_work_node(nid, system_unbound_wq,
> > > &pw->pw_work);
> > > + nid = next_node(nid, node_states[N_CPU]);
> > > + } else
> > > queue_work(system_unbound_wq, &pw->pw_work);
> > > + if (job->numa_aware)
> > > + global_nid = nid;
> >
> > Thinking more about it, there could still be multiple threads working
> > at the same time with stale global_nid. We should probably do a compare
> > exchange of global_nid with new nid only if the global nid was unchanged.
> > Otherwise we should go to the next node with the changed global nid before
> > we queue the job.
> >
> > Tim
> >
> How about:
> ```
> nid = global_nid;
> list_for_each_entry(pw, &works, pw_list)
> if (job->numa_aware) {
> int old_node = nid;
> queue_work_node(nid, system_unbound_wq, &pw->pw_work);
> nid = next_node(nid, node_states[N_CPU]);
> cmpxchg(&global_nid, old_node, nid);
> } else
> queue_work(system_unbound_wq, &pw->pw_work);
>
> ```
>
I am thinking something like
static volatile atomic_t last_used_nid;
list_for_each_entry(pw, &works, pw_list)
if (job->numa_aware) {
int old_node = atomic_read(&last_used_nid);
do {
nid = next_node_in(old_node, node_states[N_CPU]);
} while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
queue_work_node(nid, system_unbound_wq, &pw->pw_work);
} else {
queue_work(system_unbound_wq, &pw->pw_work);
}
Note that we need to use next_node_in so we'll wrap around the node mask.
Tim
Hi Tim,
On 2024/1/18 06:14, Tim Chen wrote:
> On Mon, 2024-01-15 at 16:57 +0800, Gang Li wrote:
>> How about:
>> ```
>> nid = global_nid;
>> list_for_each_entry(pw, &works, pw_list)
>> if (job->numa_aware) {
>> int old_node = nid;
>> queue_work_node(nid, system_unbound_wq, &pw->pw_work);
>> nid = next_node(nid, node_states[N_CPU]);
>> cmpxchg(&global_nid, old_node, nid);
>> } else
>> queue_work(system_unbound_wq, &pw->pw_work);
>>
>> ```
>>
My original idea was to have all tasks from a single
padata_do_multithreaded distributed continuously across NUMA nodes.
In that case, the task distribution would be predictable for a single
padata_do_multithreaded call.
>
> I am thinking something like
>
> static volatile atomic_t last_used_nid;
>
> list_for_each_entry(pw, &works, pw_list)
> if (job->numa_aware) {
> int old_node = atomic_read(&last_used_nid);
>
> do {
> nid = next_node_in(old_node, node_states[N_CPU]);
> } while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
However, having the tasks from all parallel padata_do_multithreaded
globally distributed across NUMA nodes is also fine by me.
I don't have a strong preference.
> queue_work_node(nid, system_unbound_wq, &pw->pw_work);
> } else {
> queue_work(system_unbound_wq, &pw->pw_work);
> }
>
> Note that we need to use next_node_in so we'll wrap around the node mask.
>