2024-01-26 15:26:58

by Gang Li

[permalink] [raw]
Subject: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization

Optimizing the initialization speed of 1G huge pages through
parallelization.

1G hugetlbs are allocated from bootmem, a process that is already
very fast and does not currently require optimization. Therefore,
we focus on parallelizing only the initialization phase in
`gather_bootmem_prealloc`.

Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 1G 4745 2024 57.34%
128c1T(2 node) 1G 3358 1712 49.02%
12T 1G 77000 18300 76.23%

Signed-off-by: Gang Li <[email protected]>
Tested-by: David Rientjes <[email protected]>
---
arch/powerpc/mm/hugetlbpage.c | 2 +-
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 44 ++++++++++++++++++++++++++++-------
3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 0a540b37aab62..a1651d5471862 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -226,7 +226,7 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
return 0;
m = phys_to_virt(gpage_freearray[--nr_gpages]);
gpage_freearray[nr_gpages] = 0;
- list_add(&m->list, &huge_boot_pages);
+ list_add(&m->list, &huge_boot_pages[0]);
m->hstate = hstate;
return 1;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c1ee640d87b11..77b30a8c6076b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);

extern int sysctl_hugetlb_shm_group;
-extern struct list_head huge_boot_pages;
+extern struct list_head huge_boot_pages[MAX_NUMNODES];

/* arch callbacks */

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 385840397bce5..eee0c456f6571 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
#endif
static unsigned long hugetlb_cma_size __initdata;

-__initdata LIST_HEAD(huge_boot_pages);
+__initdata struct list_head huge_boot_pages[MAX_NUMNODES];

/* for command line parsing */
static struct hstate * __initdata parsed_hstate;
@@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
int __alloc_bootmem_huge_page(struct hstate *h, int nid)
{
struct huge_bootmem_page *m = NULL; /* initialize for clang */
- int nr_nodes, node;
+ int nr_nodes, node = nid;

/* do node specific alloc */
if (nid != NUMA_NO_NODE) {
@@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
huge_page_size(h) - PAGE_SIZE);
/* Put them into a private list first because mem_map is not up yet */
INIT_LIST_HEAD(&m->list);
- list_add(&m->list, &huge_boot_pages);
+ list_add(&m->list, &huge_boot_pages[node]);
m->hstate = h;
return 1;
}
@@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
/* Send list for bulk vmemmap optimization processing */
hugetlb_vmemmap_optimize_folios(h, folio_list);

- /* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
/*
@@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
HUGETLB_VMEMMAP_RESERVE_PAGES,
pages_per_huge_page(h));
}
+ /* Subdivide locks to achieve better parallel performance */
+ spin_lock_irqsave(&hugetlb_lock, flags);
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
- spin_unlock_irqrestore(&hugetlb_lock, flags);
}

/*
* Put bootmem huge pages into the standard lists after mem_map is up.
* Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
*/
-static void __init gather_bootmem_prealloc(void)
+static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
+
{
+ int nid = start;
LIST_HEAD(folio_list);
struct huge_bootmem_page *m;
struct hstate *h = NULL, *prev_h = NULL;

- list_for_each_entry(m, &huge_boot_pages, list) {
+ list_for_each_entry(m, &huge_boot_pages[nid], list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;

@@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
prep_and_add_bootmem_folios(h, &folio_list);
}

+static void __init gather_bootmem_prealloc(void)
+{
+ struct padata_mt_job job = {
+ .thread_fn = gather_bootmem_prealloc_node,
+ .fn_arg = NULL,
+ .start = 0,
+ .size = num_node_state(N_MEMORY),
+ .align = 1,
+ .min_chunk = 1,
+ .max_threads = num_node_state(N_MEMORY),
+ .numa_aware = true,
+ };
+
+ padata_do_multithreaded(&job);
+}
+
static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
{
unsigned long i;
@@ -3600,6 +3618,7 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
{
unsigned long allocated;
+ static bool initialied __initdata;

/* skip gigantic hugepages allocation if hugetlb_cma enabled */
if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3607,6 +3626,15 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
return;
}

+ /* hugetlb_hstate_alloc_pages will be called many times, initialize huge_boot_pages once */
+ if (!initialied) {
+ int i = 0;
+
+ for (i = 0; i < MAX_NUMNODES; i++)
+ INIT_LIST_HEAD(&huge_boot_pages[i]);
+ initialied = true;
+ }
+
/* do node specific alloc */
if (hugetlb_hstate_alloc_pages_specific_nodes(h))
return;
--
2.20.1



2024-01-29 03:59:35

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization



> On Jan 26, 2024, at 23:24, Gang Li <[email protected]> wrote:
>
> Optimizing the initialization speed of 1G huge pages through
> parallelization.
>
> 1G hugetlbs are allocated from bootmem, a process that is already
> very fast and does not currently require optimization. Therefore,
> we focus on parallelizing only the initialization phase in
> `gather_bootmem_prealloc`.
>
> Here are some test results:
> test case no patch(ms) patched(ms) saved
> ------------------- -------------- ------------- --------
> 256c2T(4 node) 1G 4745 2024 57.34%
> 128c1T(2 node) 1G 3358 1712 49.02%
> 12T 1G 77000 18300 76.23%
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>

Reviewed-by: Muchun Song <[email protected]>

Thanks.


2024-02-05 07:28:40

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization



On 2024/1/26 23:24, Gang Li wrote:
> Optimizing the initialization speed of 1G huge pages through
> parallelization.
>
> 1G hugetlbs are allocated from bootmem, a process that is already
> very fast and does not currently require optimization. Therefore,
> we focus on parallelizing only the initialization phase in
> `gather_bootmem_prealloc`.
>
> Here are some test results:
> test case no patch(ms) patched(ms) saved
> ------------------- -------------- ------------- --------
> 256c2T(4 node) 1G 4745 2024 57.34%
> 128c1T(2 node) 1G 3358 1712 49.02%
> 12T 1G 77000 18300 76.23%
>
> Signed-off-by: Gang Li <[email protected]>
> Tested-by: David Rientjes <[email protected]>
> ---
> arch/powerpc/mm/hugetlbpage.c | 2 +-
> include/linux/hugetlb.h | 2 +-
> mm/hugetlb.c | 44 ++++++++++++++++++++++++++++-------
> 3 files changed, 38 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 0a540b37aab62..a1651d5471862 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -226,7 +226,7 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
> return 0;
> m = phys_to_virt(gpage_freearray[--nr_gpages]);
> gpage_freearray[nr_gpages] = 0;
> - list_add(&m->list, &huge_boot_pages);
> + list_add(&m->list, &huge_boot_pages[0]);
> m->hstate = hstate;
> return 1;
> }
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c1ee640d87b11..77b30a8c6076b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
>
> extern int sysctl_hugetlb_shm_group;
> -extern struct list_head huge_boot_pages;
> +extern struct list_head huge_boot_pages[MAX_NUMNODES];
>
> /* arch callbacks */
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 385840397bce5..eee0c456f6571 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
> #endif
> static unsigned long hugetlb_cma_size __initdata;
>
> -__initdata LIST_HEAD(huge_boot_pages);
> +__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
>
> /* for command line parsing */
> static struct hstate * __initdata parsed_hstate;
> @@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
> int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> {
> struct huge_bootmem_page *m = NULL; /* initialize for clang */
> - int nr_nodes, node;
> + int nr_nodes, node = nid;
>
> /* do node specific alloc */
> if (nid != NUMA_NO_NODE) {
> @@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> huge_page_size(h) - PAGE_SIZE);
> /* Put them into a private list first because mem_map is not up yet */
> INIT_LIST_HEAD(&m->list);
> - list_add(&m->list, &huge_boot_pages);
> + list_add(&m->list, &huge_boot_pages[node]);
> m->hstate = h;
> return 1;
> }
> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> /* Send list for bulk vmemmap optimization processing */
> hugetlb_vmemmap_optimize_folios(h, folio_list);
>
> - /* Add all new pool pages to free lists in one lock cycle */
> - spin_lock_irqsave(&hugetlb_lock, flags);
> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
> /*
> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> HUGETLB_VMEMMAP_RESERVE_PAGES,
> pages_per_huge_page(h));
> }
> + /* Subdivide locks to achieve better parallel performance */
> + spin_lock_irqsave(&hugetlb_lock, flags);
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> + spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
> - spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
>
> /*
> * Put bootmem huge pages into the standard lists after mem_map is up.
> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
> */
> -static void __init gather_bootmem_prealloc(void)
> +static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
> +
> {
> + int nid = start;

Sorry for so late to notice an issue here. I have seen a comment from
PADATA, whcih says:

    @max_threads: Max threads to use for the job, actual number may be less
                  depending on task size and minimum chunk size.

PADATA will not guarantee gather_bootmem_prealloc_node() will be called
->max_threads times (You have initialized it to the number of NUMA nodes in
gather_bootmem_prealloc). Therefore, we should add a loop here to initialize
multiple nodes, namely (@end - @start) here. Otherwise, we will miss
initializing some nodes.

Thanks.

> LIST_HEAD(folio_list);
> struct huge_bootmem_page *m;
> struct hstate *h = NULL, *prev_h = NULL;
>
> - list_for_each_entry(m, &huge_boot_pages, list) {
> + list_for_each_entry(m, &huge_boot_pages[nid], list) {
> struct page *page = virt_to_page(m);
> struct folio *folio = (void *)page;
>
> @@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
> prep_and_add_bootmem_folios(h, &folio_list);
> }
>
> +static void __init gather_bootmem_prealloc(void)
> +{
> + struct padata_mt_job job = {
> + .thread_fn = gather_bootmem_prealloc_node,
> + .fn_arg = NULL,
> + .start = 0,
> + .size = num_node_state(N_MEMORY),
> + .align = 1,
> + .min_chunk = 1,
> + .max_threads = num_node_state(N_MEMORY),
> + .numa_aware = true,
> + };
> +
> + padata_do_multithreaded(&job);
> +}
> +
> static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> {
> unsigned long i;
> @@ -3600,6 +3618,7 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
> static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> {
> unsigned long allocated;
> + static bool initialied __initdata;
>
> /* skip gigantic hugepages allocation if hugetlb_cma enabled */
> if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3607,6 +3626,15 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> return;
> }
>
> + /* hugetlb_hstate_alloc_pages will be called many times, initialize huge_boot_pages once */
> + if (!initialied) {
> + int i = 0;
> +
> + for (i = 0; i < MAX_NUMNODES; i++)
> + INIT_LIST_HEAD(&huge_boot_pages[i]);
> + initialied = true;
> + }
> +
> /* do node specific alloc */
> if (hugetlb_hstate_alloc_pages_specific_nodes(h))
> return;


2024-02-05 08:26:58

by Gang Li

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization



On 2024/2/5 15:28, Muchun Song wrote:
> On 2024/1/26 23:24, Gang Li wrote:
>> @@ -3390,8 +3390,6 @@ static void __init
>> prep_and_add_bootmem_folios(struct hstate *h,
>>       /* Send list for bulk vmemmap optimization processing */
>>       hugetlb_vmemmap_optimize_folios(h, folio_list);
>> -    /* Add all new pool pages to free lists in one lock cycle */
>> -    spin_lock_irqsave(&hugetlb_lock, flags);
>>       list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
>>           if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
>>               /*
>> @@ -3404,23 +3402,27 @@ static void __init
>> prep_and_add_bootmem_folios(struct hstate *h,
>>                       HUGETLB_VMEMMAP_RESERVE_PAGES,
>>                       pages_per_huge_page(h));
>>           }
>> +        /* Subdivide locks to achieve better parallel performance */
>> +        spin_lock_irqsave(&hugetlb_lock, flags);
>>           __prep_account_new_huge_page(h, folio_nid(folio));
>>           enqueue_hugetlb_folio(h, folio);
>> +        spin_unlock_irqrestore(&hugetlb_lock, flags);
>>       }
>> -    spin_unlock_irqrestore(&hugetlb_lock, flags);
>>   }
>>   /*
>>    * Put bootmem huge pages into the standard lists after mem_map is up.
>>    * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
>>    */
>> -static void __init gather_bootmem_prealloc(void)
>> +static void __init gather_bootmem_prealloc_node(unsigned long start,
>> unsigned long end, void *arg)
>> +
>>   {
>> +    int nid = start;
>
> Sorry for so late to notice an issue here. I have seen a comment from
> PADATA, whcih says:
>
>     @max_threads: Max threads to use for the job, actual number may be
> less
>                   depending on task size and minimum chunk size.
>
> PADATA will not guarantee gather_bootmem_prealloc_node() will be called
> ->max_threads times (You have initialized it to the number of NUMA nodes in
> gather_bootmem_prealloc). Therefore, we should add a loop here to
> initialize
> multiple nodes, namely (@end - @start) here. Otherwise, we will miss
> initializing some nodes.
>
> Thanks.
>
In padata_do_multithreaded:

```
/* Ensure at least one thread when size < min_chunk. */
nworks = max(job->size / max(job->min_chunk, job->align), 1ul);
nworks = min(nworks, job->max_threads);

ps.nworks = padata_work_alloc_mt(nworks, &ps, &works);
```

So we have works <= max_threads, but >= size/min_chunk.

Thanks!

2024-02-05 09:11:51

by Muchun Song

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization



> On Feb 5, 2024, at 16:26, Gang Li <[email protected]> wrote:
>
>
>
> On 2024/2/5 15:28, Muchun Song wrote:
>> On 2024/1/26 23:24, Gang Li wrote:
>>> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>> /* Send list for bulk vmemmap optimization processing */
>>> hugetlb_vmemmap_optimize_folios(h, folio_list);
>>> - /* Add all new pool pages to free lists in one lock cycle */
>>> - spin_lock_irqsave(&hugetlb_lock, flags);
>>> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
>>> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
>>> /*
>>> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>> HUGETLB_VMEMMAP_RESERVE_PAGES,
>>> pages_per_huge_page(h));
>>> }
>>> + /* Subdivide locks to achieve better parallel performance */
>>> + spin_lock_irqsave(&hugetlb_lock, flags);
>>> __prep_account_new_huge_page(h, folio_nid(folio));
>>> enqueue_hugetlb_folio(h, folio);
>>> + spin_unlock_irqrestore(&hugetlb_lock, flags);
>>> }
>>> - spin_unlock_irqrestore(&hugetlb_lock, flags);
>>> }
>>> /*
>>> * Put bootmem huge pages into the standard lists after mem_map is up.
>>> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
>>> */
>>> -static void __init gather_bootmem_prealloc(void)
>>> +static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
>>> +
>>> {
>>> + int nid = start;
>> Sorry for so late to notice an issue here. I have seen a comment from
>> PADATA, whcih says:
>> @max_threads: Max threads to use for the job, actual number may be less
>> depending on task size and minimum chunk size.
>> PADATA will not guarantee gather_bootmem_prealloc_node() will be called
>> ->max_threads times (You have initialized it to the number of NUMA nodes in
>> gather_bootmem_prealloc). Therefore, we should add a loop here to initialize
>> multiple nodes, namely (@end - @start) here. Otherwise, we will miss
>> initializing some nodes.
>> Thanks.
>>
> In padata_do_multithreaded:
>
> ```
> /* Ensure at least one thread when size < min_chunk. */
> nworks = max(job->size / max(job->min_chunk, job->align), 1ul);
> nworks = min(nworks, job->max_threads);
>
> ps.nworks = padata_work_alloc_mt(nworks, &ps, &works);
> ```
>
> So we have works <= max_threads, but >= size/min_chunk.

Given a 4-node system, the current implementation will schedule
4 threads to call gather_bootmem_prealloc() respectively, and
there is no problems here. But what if PADATA schedules 2
threads and each thread aims to handle 2 nodes? I think
it is possible for PADATA in the future, because it does not
break any semantics exposed to users. The comment about @min_chunk:

The minimum chunk size in job-specific units. This
allows the client to communicate the minimum amount
of work that's appropriate for one worker thread to
do at once.

It only defines the minimum chunk size but not maximum size,
so it is possible to let each ->thread_fn handle multiple
minimum chunk size. Right? Therefore, I am not concerned
about the current implementation of PADATA but that of future.

Maybe a separate patch is acceptable since it is an improving
patch instead of a fix one (at least there is no bug currently).

Thanks.


2024-02-07 02:00:59

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization

Add Daniel Jordan.

On 2/5/2024 1:09 AM, Muchun Song wrote:
>
>> On Feb 5, 2024, at 16:26, Gang Li <[email protected]> wrote:
>>
>>
>>
>> On 2024/2/5 15:28, Muchun Song wrote:
>>> On 2024/1/26 23:24, Gang Li wrote:
>>>> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>>> /* Send list for bulk vmemmap optimization processing */
>>>> hugetlb_vmemmap_optimize_folios(h, folio_list);
>>>> - /* Add all new pool pages to free lists in one lock cycle */
>>>> - spin_lock_irqsave(&hugetlb_lock, flags);
>>>> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
>>>> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
>>>> /*
>>>> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>>> HUGETLB_VMEMMAP_RESERVE_PAGES,
>>>> pages_per_huge_page(h));
>>>> }
>>>> + /* Subdivide locks to achieve better parallel performance */
>>>> + spin_lock_irqsave(&hugetlb_lock, flags);
>>>> __prep_account_new_huge_page(h, folio_nid(folio));
>>>> enqueue_hugetlb_folio(h, folio);
>>>> + spin_unlock_irqrestore(&hugetlb_lock, flags);
>>>> }
>>>> - spin_unlock_irqrestore(&hugetlb_lock, flags);
>>>> }
>>>> /*
>>>> * Put bootmem huge pages into the standard lists after mem_map is up.
>>>> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
>>>> */
>>>> -static void __init gather_bootmem_prealloc(void)
>>>> +static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
>>>> +
>>>> {
>>>> + int nid = start;
>>> Sorry for so late to notice an issue here. I have seen a comment from
>>> PADATA, whcih says:
>>> @max_threads: Max threads to use for the job, actual number may be less
>>> depending on task size and minimum chunk size.
>>> PADATA will not guarantee gather_bootmem_prealloc_node() will be called
>>> ->max_threads times (You have initialized it to the number of NUMA nodes in
>>> gather_bootmem_prealloc). Therefore, we should add a loop here to initialize
>>> multiple nodes, namely (@end - @start) here. Otherwise, we will miss
>>> initializing some nodes.
>>> Thanks.
>>>
>> In padata_do_multithreaded:
>>
>> ```
>> /* Ensure at least one thread when size < min_chunk. */
>> nworks = max(job->size / max(job->min_chunk, job->align), 1ul);
>> nworks = min(nworks, job->max_threads);
>>
>> ps.nworks = padata_work_alloc_mt(nworks, &ps, &works);
>> ```
>>
>> So we have works <= max_threads, but >= size/min_chunk.
> Given a 4-node system, the current implementation will schedule
> 4 threads to call gather_bootmem_prealloc() respectively, and
> there is no problems here. But what if PADATA schedules 2
> threads and each thread aims to handle 2 nodes? I think
> it is possible for PADATA in the future, because it does not
> break any semantics exposed to users. The comment about @min_chunk:
>
> The minimum chunk size in job-specific units. This
> allows the client to communicate the minimum amount
> of work that's appropriate for one worker thread to
> do at once.
>
> It only defines the minimum chunk size but not maximum size,
> so it is possible to let each ->thread_fn handle multiple
> minimum chunk size. Right? Therefore, I am not concerned
> about the current implementation of PADATA but that of future.
>
> Maybe a separate patch is acceptable since it is an improving
> patch instead of a fix one (at least there is no bug currently).
>
> Thanks.
>
>

2024-02-09 17:49:48

by Daniel Jordan

[permalink] [raw]
Subject: Re: Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization

On Tue, Feb 06, 2024 at 05:53:04PM -0800, Jane Chu wrote:
> Add Daniel Jordan.

Thanks, Jane.

I'm adding Steffen too, and please cc padata maintainers on future
patches. MAINTAINERS has linux-crypto too under padata, but for changes
to just padata_do_multithreaded that's probably not necessary.

> On 2/5/2024 1:09 AM, Muchun Song wrote:
> > > On Feb 5, 2024, at 16:26, Gang Li <[email protected]> wrote:
> > > On 2024/2/5 15:28, Muchun Song wrote:
> > > > On 2024/1/26 23:24, Gang Li wrote:
> > > > > -static void __init gather_bootmem_prealloc(void)
> > > > > +static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
> > > > > +
> > > > > {
> > > > > + int nid = start;
> > > > Sorry for so late to notice an issue here. I have seen a comment from
> > > > PADATA, whcih says:
> > > > @max_threads: Max threads to use for the job, actual number may be less
> > > > depending on task size and minimum chunk size.
> > > > PADATA will not guarantee gather_bootmem_prealloc_node() will be called
> > > > ->max_threads times (You have initialized it to the number of NUMA nodes in
> > > > gather_bootmem_prealloc). Therefore, we should add a loop here to initialize
> > > > multiple nodes, namely (@end - @start) here. Otherwise, we will miss
> > > > initializing some nodes.
> > > > Thanks.
> > > >
> > > In padata_do_multithreaded:
> > >
> > > ```
> > > /* Ensure at least one thread when size < min_chunk. */
> > > nworks = max(job->size / max(job->min_chunk, job->align), 1ul);
> > > nworks = min(nworks, job->max_threads);
> > >
> > > ps.nworks = padata_work_alloc_mt(nworks, &ps, &works);
> > > ```
> > >
> > > So we have works <= max_threads, but >= size/min_chunk.
> > Given a 4-node system, the current implementation will schedule
> > 4 threads to call gather_bootmem_prealloc() respectively, and
> > there is no problems here. But what if PADATA schedules 2
> > threads and each thread aims to handle 2 nodes? I think
> > it is possible for PADATA in the future, because it does not
> > break any semantics exposed to users. The comment about @min_chunk:
> >
> > The minimum chunk size in job-specific units. This
> > allows the client to communicate the minimum amount
> > of work that's appropriate for one worker thread to
> > do at once.
> >
> > It only defines the minimum chunk size but not maximum size,
> > so it is possible to let each ->thread_fn handle multiple
> > minimum chunk size. Right? Therefore, I am not concerned

Right. The core issue is that gather_bootmem_prealloc_node() doesn't
look at @end, but padata expects that each call of the thread function
covers the start/end range that's passed. I understand that this
happens to work today with how padata calculates nworks, but it seems
better to honor the expectation, so I agree with Muchun's suggestion a
few messages ago to loop over the range.


I hope to look at the rest of the series and that standalone Kconfig
patch after about a week, there isn't time before that.