LinuxLists.cc - [PATCH] mm/vmalloc: fix numa spreading for large hash tables

2021-09-28 12:04:44

Subject: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

Eric Dumazet reported a strange numa spreading info in [1], and found
commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
this issue [2].

Dig into the difference before and after this patch, page allocation has
some difference:

before:
alloc_large_system_hash
__vmalloc
__vmalloc_node(..., NUMA_NO_NODE, ...)
__vmalloc_node_range
__vmalloc_area_node
alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
alloc_pages_current
alloc_page_interleave /* can be proved by print policy mode */

after:
alloc_large_system_hash
__vmalloc
__vmalloc_node(..., NUMA_NO_NODE, ...)
__vmalloc_node_range
__vmalloc_area_node
alloc_pages_node /* choose nid by nuam_mem_id() */
__alloc_pages_node(nid, ....)

So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
it will allocate memory in current node instead of interleaving allocate
memory.

[1]
https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/

[2]
https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/

Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Reported-by: Eric Dumazet <[email protected]>
Signed-off-by: Chen Wandun <[email protected]>
---
mm/vmalloc.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f884706c5280..48e717626e94 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
unsigned int order, unsigned int nr_pages, struct page **pages)
{
unsigned int nr_allocated = 0;
+ struct page *page;
+ int i;

/*
* For order-0 pages we make use of bulk allocator, if
@@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
if (!order) {
while (nr_allocated < nr_pages) {
unsigned int nr, nr_pages_request;
+ page = NULL;

/*
* A maximum allowed request is hard-coded and is 100
@@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
*/
nr_pages_request = min(100U, nr_pages - nr_allocated);

- nr = alloc_pages_bulk_array_node(gfp, nid,
- nr_pages_request, pages + nr_allocated);
-
+ if (nid == NUMA_NO_NODE) {
+ for (i = 0; i < nr_pages_request; i++) {
+ page = alloc_page(gfp);
+ if (page)
+ pages[nr_allocated + i] = page;
+ else {
+ nr = i;
+ break;
+ }
+ }
+ if (i >= nr_pages_request)
+ nr = nr_pages_request;
+ } else {
+ nr = alloc_pages_bulk_array_node(gfp, nid,
+ nr_pages_request,
+ pages + nr_allocated);
+ }
nr_allocated += nr;
cond_resched();

@@ -2863,11 +2880,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
gfp |= __GFP_COMP;

/* High-order pages or fallback path if "bulk" fails. */
- while (nr_allocated < nr_pages) {
- struct page *page;
- int i;

- page = alloc_pages_node(nid, gfp, order);
+ page = NULL;
+ while (nr_allocated < nr_pages) {
+ if (nid == NUMA_NO_NODE)
+ page = alloc_pages(gfp, order);
+ else
+ page = alloc_pages_node(nid, gfp, order);
if (unlikely(!page))
break;

--
2.25.1

2021-09-28 22:36:52

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

On Tue, 28 Sep 2021 20:10:40 +0800 Chen Wandun <[email protected]> wrote:

> Eric Dumazet reported a strange numa spreading info in [1], and found
> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> this issue [2].
>
> Dig into the difference before and after this patch, page allocation has
> some difference:
>
> before:
> alloc_large_system_hash
> __vmalloc
> __vmalloc_node(..., NUMA_NO_NODE, ...)
> __vmalloc_node_range
> __vmalloc_area_node
> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
> alloc_pages_current
> alloc_page_interleave /* can be proved by print policy mode */
>
> after:
> alloc_large_system_hash
> __vmalloc
> __vmalloc_node(..., NUMA_NO_NODE, ...)
> __vmalloc_node_range
> __vmalloc_area_node
> alloc_pages_node /* choose nid by nuam_mem_id() */
> __alloc_pages_node(nid, ....)
>
> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
> it will allocate memory in current node instead of interleaving allocate
> memory.
>
> [1]
> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>
> [2]
> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>
> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
> Reported-by: Eric Dumazet <[email protected]>
> Signed-off-by: Chen Wandun <[email protected]>

This seems like it could cause significant performance regressions in
some situations?

If "yes" then wouldn't a cc:stable be appropriate? And some (perhaps
handwavy) quantification of the slowdown would help people understand
why we're recommending a backport.

If "no" then why the heck do we have that feature in there anyway ;)

2021-10-13 21:48:45

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <[email protected]> wrote:
>
> Eric Dumazet reported a strange numa spreading info in [1], and found
> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> this issue [2].
>
> Dig into the difference before and after this patch, page allocation has
> some difference:
>
> before:
> alloc_large_system_hash
> __vmalloc
> __vmalloc_node(..., NUMA_NO_NODE, ...)
> __vmalloc_node_range
> __vmalloc_area_node
> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
> alloc_pages_current
> alloc_page_interleave /* can be proved by print policy mode */
>
> after:
> alloc_large_system_hash
> __vmalloc
> __vmalloc_node(..., NUMA_NO_NODE, ...)
> __vmalloc_node_range
> __vmalloc_area_node
> alloc_pages_node /* choose nid by nuam_mem_id() */
> __alloc_pages_node(nid, ....)
>
> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
> it will allocate memory in current node instead of interleaving allocate
> memory.
>
> [1]
> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>
> [2]
> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>
> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
> Reported-by: Eric Dumazet <[email protected]>
> Signed-off-by: Chen Wandun <[email protected]>
> ---
> mm/vmalloc.c | 33 ++++++++++++++++++++++++++-------
> 1 file changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f884706c5280..48e717626e94 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> unsigned int order, unsigned int nr_pages, struct page **pages)
> {
> unsigned int nr_allocated = 0;
> + struct page *page;
> + int i;
>
> /*
> * For order-0 pages we make use of bulk allocator, if
> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> if (!order) {

Can you please replace the above with if (!order && nid != NUMA_NO_NODE)?

> while (nr_allocated < nr_pages) {
> unsigned int nr, nr_pages_request;
> + page = NULL;
>
> /*
> * A maximum allowed request is hard-coded and is 100
> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> */
> nr_pages_request = min(100U, nr_pages - nr_allocated);
>

Undo the following change in this if block.

> - nr = alloc_pages_bulk_array_node(gfp, nid,
> - nr_pages_request, pages + nr_allocated);
> -
> + if (nid == NUMA_NO_NODE) {
> + for (i = 0; i < nr_pages_request; i++) {
> + page = alloc_page(gfp);
> + if (page)
> + pages[nr_allocated + i] = page;
> + else {
> + nr = i;
> + break;
> + }
> + }
> + if (i >= nr_pages_request)
> + nr = nr_pages_request;
> + } else {
> + nr = alloc_pages_bulk_array_node(gfp, nid,
> + nr_pages_request,
> + pages + nr_allocated);
> + }
> nr_allocated += nr;
> cond_resched();
>
> @@ -2863,11 +2880,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,

Put the following line under "else if (order)"

> gfp |= __GFP_COMP;
>
> /* High-order pages or fallback path if "bulk" fails. */
> - while (nr_allocated < nr_pages) {

Keep the following declarations inside the while loop.

> - struct page *page;
> - int i;
>
> - page = alloc_pages_node(nid, gfp, order);
> + page = NULL;
> + while (nr_allocated < nr_pages) {
> + if (nid == NUMA_NO_NODE)
> + page = alloc_pages(gfp, order);
> + else
> + page = alloc_pages_node(nid, gfp, order);
> if (unlikely(!page))
> break;
>
> --
> 2.25.1
>

2021-10-14 08:52:02

by Chen Wandun

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

在 2021/9/29 6:33, Andrew Morton 写道:
> On Tue, 28 Sep 2021 20:10:40 +0800 Chen Wandun <[email protected]> wrote:
>
>> Eric Dumazet reported a strange numa spreading info in [1], and found
>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
>> this issue [2].
>>
>> Dig into the difference before and after this patch, page allocation has
>> some difference:
>>
>> before:
>> alloc_large_system_hash
>> __vmalloc
>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>> __vmalloc_node_range
>> __vmalloc_area_node
>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
>> alloc_pages_current
>> alloc_page_interleave /* can be proved by print policy mode */
>>
>> after:
>> alloc_large_system_hash
>> __vmalloc
>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>> __vmalloc_node_range
>> __vmalloc_area_node
>> alloc_pages_node /* choose nid by nuam_mem_id() */
>> __alloc_pages_node(nid, ....)
>>
>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
>> it will allocate memory in current node instead of interleaving allocate
>> memory.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>>
>> [2]
>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>>
>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
>> Reported-by: Eric Dumazet <[email protected]>
>> Signed-off-by: Chen Wandun <[email protected]>
>
> This seems like it could cause significant performance regressions in
> some situations?
Yes，I indeed will cause some performance regressions, I will send a
optimization patch based on this patch.
>
> If "yes" then wouldn't a cc:stable be appropriate? And some (perhaps
> handwavy) quantification of the slowdown would help people understand
> why we're recommending a backport.
>
> If "no" then why the heck do we have that feature in there anyway ;)
>
> .
>

2021-10-14 09:01:12

by Chen Wandun

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

在 2021/10/14 5:46, Shakeel Butt 写道:
> On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <[email protected]> wrote:
>>
>> Eric Dumazet reported a strange numa spreading info in [1], and found
>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
>> this issue [2].
>>
>> Dig into the difference before and after this patch, page allocation has
>> some difference:
>>
>> before:
>> alloc_large_system_hash
>> __vmalloc
>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>> __vmalloc_node_range
>> __vmalloc_area_node
>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
>> alloc_pages_current
>> alloc_page_interleave /* can be proved by print policy mode */
>>
>> after:
>> alloc_large_system_hash
>> __vmalloc
>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>> __vmalloc_node_range
>> __vmalloc_area_node
>> alloc_pages_node /* choose nid by nuam_mem_id() */
>> __alloc_pages_node(nid, ....)
>>
>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
>> it will allocate memory in current node instead of interleaving allocate
>> memory.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>>
>> [2]
>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>>
>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
>> Reported-by: Eric Dumazet <[email protected]>
>> Signed-off-by: Chen Wandun <[email protected]>
>> ---
>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++-------
>> 1 file changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index f884706c5280..48e717626e94 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>> unsigned int order, unsigned int nr_pages, struct page **pages)
>> {
>> unsigned int nr_allocated = 0;
>> + struct page *page;
>> + int i;
>>
>> /*
>> * For order-0 pages we make use of bulk allocator, if
>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>> if (!order) {
>
> Can you please replace the above with if (!order && nid != NUMA_NO_NODE)?
>
>> while (nr_allocated < nr_pages) {
>> unsigned int nr, nr_pages_request;
>> + page = NULL;
>>
>> /*
>> * A maximum allowed request is hard-coded and is 100
>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>> */
>> nr_pages_request = min(100U, nr_pages - nr_allocated);
>>
>
> Undo the following change in this if block.

Yes, It seem like more simpler as you suggested, But it still have
performance regression, I plan to change the following to consider
both mempolcy and alloc_pages_bulk.

Thanks,
Wandun

>
>> - nr = alloc_pages_bulk_array_node(gfp, nid,
>> - nr_pages_request, pages + nr_allocated);
>> -
>> + if (nid == NUMA_NO_NODE) {
>> + for (i = 0; i < nr_pages_request; i++) {
>> + page = alloc_page(gfp);
>> + if (page)
>> + pages[nr_allocated + i] = page;
>> + else {
>> + nr = i;
>> + break;
>> + }
>> + }
>> + if (i >= nr_pages_request)
>> + nr = nr_pages_request;
>> + } else {
>> + nr = alloc_pages_bulk_array_node(gfp, nid,
>> + nr_pages_request,
>> + pages + nr_allocated);
>> + }
>> nr_allocated += nr;
>> cond_resched();
>>
>> @@ -2863,11 +2880,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>
> Put the following line under "else if (order)"
>
>> gfp |= __GFP_COMP;
>>
>> /* High-order pages or fallback path if "bulk" fails. */
>> - while (nr_allocated < nr_pages) {
>
> Keep the following declarations inside the while loop.
>
>> - struct page *page;
>> - int i;
>>
>> - page = alloc_pages_node(nid, gfp, order);
>> + page = NULL;
>> + while (nr_allocated < nr_pages) {
>> + if (nid == NUMA_NO_NODE)
>> + page = alloc_pages(gfp, order);
>> + else
>> + page = alloc_pages_node(nid, gfp, order);
>> if (unlikely(!page))
>> break;
>>
>> --
>> 2.25.1
>>
> .
>

2021-10-14 10:04:04

by Uladzislau Rezki (Sony)

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

On Tue, Sep 28, 2021 at 08:10:40PM +0800, Chen Wandun wrote:
> Eric Dumazet reported a strange numa spreading info in [1], and found
> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> this issue [2].
>
> Dig into the difference before and after this patch, page allocation has
> some difference:
>
> before:
> alloc_large_system_hash
> __vmalloc
> __vmalloc_node(..., NUMA_NO_NODE, ...)
> __vmalloc_node_range
> __vmalloc_area_node
> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
> alloc_pages_current
> alloc_page_interleave /* can be proved by print policy mode */
>
> after:
> alloc_large_system_hash
> __vmalloc
> __vmalloc_node(..., NUMA_NO_NODE, ...)
> __vmalloc_node_range
> __vmalloc_area_node
> alloc_pages_node /* choose nid by nuam_mem_id() */
> __alloc_pages_node(nid, ....)
>
> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
> it will allocate memory in current node instead of interleaving allocate
> memory.
>
> [1]
> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>
> [2]
> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>
> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
> Reported-by: Eric Dumazet <[email protected]>
> Signed-off-by: Chen Wandun <[email protected]>
> ---
> mm/vmalloc.c | 33 ++++++++++++++++++++++++++-------
> 1 file changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f884706c5280..48e717626e94 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> unsigned int order, unsigned int nr_pages, struct page **pages)
> {
> unsigned int nr_allocated = 0;
> + struct page *page;
> + int i;
>
> /*
> * For order-0 pages we make use of bulk allocator, if
> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> if (!order) {
> while (nr_allocated < nr_pages) {
> unsigned int nr, nr_pages_request;
> + page = NULL;
>
> /*
> * A maximum allowed request is hard-coded and is 100
> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> */
> nr_pages_request = min(100U, nr_pages - nr_allocated);
>
> - nr = alloc_pages_bulk_array_node(gfp, nid,
> - nr_pages_request, pages + nr_allocated);
> -
> + if (nid == NUMA_NO_NODE) {
>
<snip>
void *vmalloc(unsigned long size)
{
return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE,
__builtin_return_address(0));
}
EXPORT_SYMBOL(vmalloc);
<snip>

vmalloc() uses NUMA_NO_NODE, so all vmalloc calls will be reverted to a single
page allocator for NUMA and non-NUMA systems. Is it intentional to bypass the
optimized bulk allocator for non-NUMA systems?

Thanks!

--
Vlad Rezki

2021-10-14 10:08:12

by Chen Wandun

[permalink] [raw]

Subject: [PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation

It will cause significant performance regressions in some situations
as Andrew mentioned in [1]. The main situation is vmalloc, vmalloc
will allocate pages with NUMA_NO_NODE by default, that will result
in alloc page one by one;

In order to solve this, __alloc_pages_bulk and mempolicy should be
considered at the same time.

1) If node is specified in memory allocation request, it will alloc
all pages by __alloc_pages_bulk.

2) If interleaving allocate memory, it will cauculate how many pages
should be allocated in each node, and use __alloc_pages_bulk to alloc
pages in each node.

[1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60

Signed-off-by: Chen Wandun <[email protected]>
----------------
based on "[PATCH] mm/vmalloc: fix numa spreading for large hash tables"
---
include/linux/gfp.h | 4 +++
mm/mempolicy.c | 76 +++++++++++++++++++++++++++++++++++++++++++++
mm/vmalloc.c | 19 +++---------
3 files changed, 85 insertions(+), 14 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 558299cb2970..b976c4177299 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -531,6 +531,10 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
struct list_head *page_list,
struct page **page_array);

+unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+ unsigned long nr_pages,
+ struct page **page_array);
+
/* Bulk allocate order-0 pages */
static inline unsigned long
alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9f8cd1457829..f456c5eb8d10 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2196,6 +2196,82 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
}
EXPORT_SYMBOL(alloc_pages);

+unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
+ struct mempolicy *pol, unsigned long nr_pages,
+ struct page **page_array)
+{
+ int nodes;
+ unsigned long nr_pages_per_node;
+ int delta;
+ int i;
+ unsigned long nr_allocated;
+ unsigned long total_allocated = 0;
+
+ nodes = nodes_weight(pol->nodes);
+ nr_pages_per_node = nr_pages / nodes;
+ delta = nr_pages - nodes * nr_pages_per_node;
+
+ for (i = 0; i < nodes; i++) {
+ if (delta) {
+ nr_allocated = __alloc_pages_bulk(gfp,
+ interleave_nodes(pol), NULL,
+ nr_pages_per_node + 1, NULL,
+ page_array);
+ delta--;
+ } else {
+ nr_allocated = __alloc_pages_bulk(gfp,
+ interleave_nodes(pol), NULL,
+ nr_pages_per_node, NULL, page_array);
+ }
+
+ page_array += nr_allocated;
+ total_allocated += nr_allocated;
+ }
+
+ return total_allocated;
+}
+
+unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
+ struct mempolicy *pol, unsigned long nr_pages,
+ struct page **page_array)
+{
+ gfp_t preferred_gfp;
+ unsigned long nr_allocated = 0;
+
+ preferred_gfp = gfp | __GFP_NOWARN;
+ preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
+
+ nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
+ nr_pages, NULL, page_array);
+
+ if (nr_allocated < nr_pages)
+ nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
+ nr_pages - nr_allocated, NULL,
+ page_array + nr_allocated);
+ return nr_allocated;
+}
+
+unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+ unsigned long nr_pages, struct page **page_array)
+{
+ struct mempolicy *pol = &default_policy;
+
+ if (!in_interrupt() && !(gfp & __GFP_THISNODE))
+ pol = get_task_policy(current);
+
+ if (pol->mode == MPOL_INTERLEAVE)
+ return alloc_pages_bulk_array_interleave(gfp, pol,
+ nr_pages, page_array);
+
+ if (pol->mode == MPOL_PREFERRED_MANY)
+ return alloc_pages_bulk_array_preferred_many(gfp,
+ numa_node_id(), pol, nr_pages, page_array);
+
+ return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
+ policy_nodemask(gfp, pol), nr_pages, NULL,
+ page_array);
+}
+
struct folio *folio_alloc(gfp_t gfp, unsigned order)
{
struct page *page = alloc_pages(gfp | __GFP_COMP, order);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b7ac4a8fe2b3..49adba793f3c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2856,23 +2856,14 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
*/
nr_pages_request = min(100U, nr_pages - nr_allocated);

- if (nid == NUMA_NO_NODE) {
- for (i = 0; i < nr_pages_request; i++) {
- page = alloc_page(gfp);
- if (page)
- pages[nr_allocated + i] = page;
- else {
- nr = i;
- break;
- }
- }
- if (i >= nr_pages_request)
- nr = nr_pages_request;
- } else {
+ if (nid == NUMA_NO_NODE)
+ nr = alloc_pages_bulk_array_mempolicy(gfp,
+ nr_pages_request,
+ pages + nr_allocated);
+ else
nr = alloc_pages_bulk_array_node(gfp, nid,
nr_pages_request,
pages + nr_allocated);
- }
nr_allocated += nr;
cond_resched();

--
2.25.1

2021-10-15 10:29:30

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

Excerpts from Chen Wandun's message of October 14, 2021 6:59 pm:
>
>
> 在 2021/10/14 5:46, Shakeel Butt 写道:
>> On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <[email protected]> wrote:
>>>
>>> Eric Dumazet reported a strange numa spreading info in [1], and found
>>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
>>> this issue [2].
>>>
>>> Dig into the difference before and after this patch, page allocation has
>>> some difference:
>>>
>>> before:
>>> alloc_large_system_hash
>>> __vmalloc
>>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>>> __vmalloc_node_range
>>> __vmalloc_area_node
>>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
>>> alloc_pages_current
>>> alloc_page_interleave /* can be proved by print policy mode */
>>>
>>> after:
>>> alloc_large_system_hash
>>> __vmalloc
>>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>>> __vmalloc_node_range
>>> __vmalloc_area_node
>>> alloc_pages_node /* choose nid by nuam_mem_id() */
>>> __alloc_pages_node(nid, ....)
>>>
>>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
>>> it will allocate memory in current node instead of interleaving allocate
>>> memory.
>>>
>>> [1]
>>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>>>
>>> [2]
>>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>>>
>>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
>>> Reported-by: Eric Dumazet <[email protected]>
>>> Signed-off-by: Chen Wandun <[email protected]>
>>> ---
>>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++-------
>>> 1 file changed, 26 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index f884706c5280..48e717626e94 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>> unsigned int order, unsigned int nr_pages, struct page **pages)
>>> {
>>> unsigned int nr_allocated = 0;
>>> + struct page *page;
>>> + int i;
>>>
>>> /*
>>> * For order-0 pages we make use of bulk allocator, if
>>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>> if (!order) {
>>
>> Can you please replace the above with if (!order && nid != NUMA_NO_NODE)?
>>
>>> while (nr_allocated < nr_pages) {
>>> unsigned int nr, nr_pages_request;
>>> + page = NULL;
>>>
>>> /*
>>> * A maximum allowed request is hard-coded and is 100
>>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>> */
>>> nr_pages_request = min(100U, nr_pages - nr_allocated);
>>>
>>
>> Undo the following change in this if block.
>
> Yes, It seem like more simpler as you suggested, But it still have
> performance regression, I plan to change the following to consider
> both mempolcy and alloc_pages_bulk.

Thanks for finding and debugging this. These APIs are a maze of twisty
little passages, all alike so I could be as confused as I was when I
wrote that patch, but doesn't a minimal fix look something like this?

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..75ee9679f521 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2860,7 +2860,10 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
struct page *page;
int i;

- page = alloc_pages_node(nid, gfp, order);
+ if (nid == NUMA_NO_NODE)
+ page = alloc_pages(gfp, order);
+ else
+ page = alloc_pages_node(nid, gfp, order);
if (unlikely(!page))
break;

Thanks,
Nick

2021-10-15 10:55:14

by Chen Wandun

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

在 2021/10/14 18:01, Uladzislau Rezki 写道:
> On Tue, Sep 28, 2021 at 08:10:40PM +0800, Chen Wandun wrote:
>> Eric Dumazet reported a strange numa spreading info in [1], and found
>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
>> this issue [2].
>>
>> Dig into the difference before and after this patch, page allocation has
>> some difference:
>>
>> before:
>> alloc_large_system_hash
>> __vmalloc
>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>> __vmalloc_node_range
>> __vmalloc_area_node
>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
>> alloc_pages_current
>> alloc_page_interleave /* can be proved by print policy mode */
>>
>> after:
>> alloc_large_system_hash
>> __vmalloc
>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>> __vmalloc_node_range
>> __vmalloc_area_node
>> alloc_pages_node /* choose nid by nuam_mem_id() */
>> __alloc_pages_node(nid, ....)
>>
>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
>> it will allocate memory in current node instead of interleaving allocate
>> memory.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>>
>> [2]
>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>>
>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
>> Reported-by: Eric Dumazet <[email protected]>
>> Signed-off-by: Chen Wandun <[email protected]>
>> ---
>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++-------
>> 1 file changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index f884706c5280..48e717626e94 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>> unsigned int order, unsigned int nr_pages, struct page **pages)
>> {
>> unsigned int nr_allocated = 0;
>> + struct page *page;
>> + int i;
>>
>> /*
>> * For order-0 pages we make use of bulk allocator, if
>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>> if (!order) {
>> while (nr_allocated < nr_pages) {
>> unsigned int nr, nr_pages_request;
>> + page = NULL;
>>
>> /*
>> * A maximum allowed request is hard-coded and is 100
>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>> */
>> nr_pages_request = min(100U, nr_pages - nr_allocated);
>>
>> - nr = alloc_pages_bulk_array_node(gfp, nid,
>> - nr_pages_request, pages + nr_allocated);
>> -
>> + if (nid == NUMA_NO_NODE) {
>>
> <snip>
> void *vmalloc(unsigned long size)
> {
> return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE,
> __builtin_return_address(0));
> }
> EXPORT_SYMBOL(vmalloc);
> <snip>
>
> vmalloc() uses NUMA_NO_NODE, so all vmalloc calls will be reverted to a single
> page allocator for NUMA and non-NUMA systems. Is it intentional to bypass the
> optimized bulk allocator for non-NUMA systems?
I sent a patch, it will help to solve this.

[PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to
accelerate memory allocation

Thanks,
Wandun

>
> Thanks!
>
> --
> Vlad Rezki
> .
>

2021-10-15 10:56:55

by Chen Wandun

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

在 2021/10/15 9:34, Nicholas Piggin 写道:
> Excerpts from Chen Wandun's message of October 14, 2021 6:59 pm:
>>
>>
>> 在 2021/10/14 5:46, Shakeel Butt 写道:
>>> On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <[email protected]> wrote:
>>>>
>>>> Eric Dumazet reported a strange numa spreading info in [1], and found
>>>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
>>>> this issue [2].
>>>>
>>>> Dig into the difference before and after this patch, page allocation has
>>>> some difference:
>>>>
>>>> before:
>>>> alloc_large_system_hash
>>>> __vmalloc
>>>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>>>> __vmalloc_node_range
>>>> __vmalloc_area_node
>>>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
>>>> alloc_pages_current
>>>> alloc_page_interleave /* can be proved by print policy mode */
>>>>
>>>> after:
>>>> alloc_large_system_hash
>>>> __vmalloc
>>>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>>>> __vmalloc_node_range
>>>> __vmalloc_area_node
>>>> alloc_pages_node /* choose nid by nuam_mem_id() */
>>>> __alloc_pages_node(nid, ....)
>>>>
>>>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
>>>> it will allocate memory in current node instead of interleaving allocate
>>>> memory.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>>>>
>>>> [2]
>>>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>>>>
>>>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
>>>> Reported-by: Eric Dumazet <[email protected]>
>>>> Signed-off-by: Chen Wandun <[email protected]>
>>>> ---
>>>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++-------
>>>> 1 file changed, 26 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>> index f884706c5280..48e717626e94 100644
>>>> --- a/mm/vmalloc.c
>>>> +++ b/mm/vmalloc.c
>>>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>> unsigned int order, unsigned int nr_pages, struct page **pages)
>>>> {
>>>> unsigned int nr_allocated = 0;
>>>> + struct page *page;
>>>> + int i;
>>>>
>>>> /*
>>>> * For order-0 pages we make use of bulk allocator, if
>>>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>> if (!order) {
>>>
>>> Can you please replace the above with if (!order && nid != NUMA_NO_NODE)?
>>>
>>>> while (nr_allocated < nr_pages) {
>>>> unsigned int nr, nr_pages_request;
>>>> + page = NULL;
>>>>
>>>> /*
>>>> * A maximum allowed request is hard-coded and is 100
>>>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>> */
>>>> nr_pages_request = min(100U, nr_pages - nr_allocated);
>>>>
>>>
>>> Undo the following change in this if block.
>>
>> Yes, It seem like more simpler as you suggested, But it still have
>> performance regression, I plan to change the following to consider
>> both mempolcy and alloc_pages_bulk.
>
> Thanks for finding and debugging this. These APIs are a maze of twisty
> little passages, all alike so I could be as confused as I was when I
> wrote that patch, but doesn't a minimal fix look something like this?

Yes, I sent a patch，it looks like as you show, besides it also
contains some performance optimization.

[PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to
accelerate memory allocation

Thanks,
Wandun

>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..75ee9679f521 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2860,7 +2860,10 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> struct page *page;
> int i;
>
> - page = alloc_pages_node(nid, gfp, order);
> + if (nid == NUMA_NO_NODE)
> + page = alloc_pages(gfp, order);
> + else
> + page = alloc_pages_node(nid, gfp, order);
> if (unlikely(!page))
> break;
>
>
> Thanks,
> Nick
> .
>

2021-10-15 13:21:50

by Nicholas Piggin

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

Excerpts from Chen Wandun's message of October 15, 2021 12:31 pm:
>
>
> 在 2021/10/15 9:34, Nicholas Piggin 写道:
>> Excerpts from Chen Wandun's message of October 14, 2021 6:59 pm:
>>>
>>>
>>> 在 2021/10/14 5:46, Shakeel Butt 写道:
>>>> On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <[email protected]> wrote:
>>>>>
>>>>> Eric Dumazet reported a strange numa spreading info in [1], and found
>>>>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
>>>>> this issue [2].
>>>>>
>>>>> Dig into the difference before and after this patch, page allocation has
>>>>> some difference:
>>>>>
>>>>> before:
>>>>> alloc_large_system_hash
>>>>> __vmalloc
>>>>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>>>>> __vmalloc_node_range
>>>>> __vmalloc_area_node
>>>>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
>>>>> alloc_pages_current
>>>>> alloc_page_interleave /* can be proved by print policy mode */
>>>>>
>>>>> after:
>>>>> alloc_large_system_hash
>>>>> __vmalloc
>>>>> __vmalloc_node(..., NUMA_NO_NODE, ...)
>>>>> __vmalloc_node_range
>>>>> __vmalloc_area_node
>>>>> alloc_pages_node /* choose nid by nuam_mem_id() */
>>>>> __alloc_pages_node(nid, ....)
>>>>>
>>>>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
>>>>> it will allocate memory in current node instead of interleaving allocate
>>>>> memory.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>>>>>
>>>>> [2]
>>>>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>>>>>
>>>>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
>>>>> Reported-by: Eric Dumazet <[email protected]>
>>>>> Signed-off-by: Chen Wandun <[email protected]>
>>>>> ---
>>>>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++-------
>>>>> 1 file changed, 26 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>>> index f884706c5280..48e717626e94 100644
>>>>> --- a/mm/vmalloc.c
>>>>> +++ b/mm/vmalloc.c
>>>>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>>> unsigned int order, unsigned int nr_pages, struct page **pages)
>>>>> {
>>>>> unsigned int nr_allocated = 0;
>>>>> + struct page *page;
>>>>> + int i;
>>>>>
>>>>> /*
>>>>> * For order-0 pages we make use of bulk allocator, if
>>>>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>>> if (!order) {
>>>>
>>>> Can you please replace the above with if (!order && nid != NUMA_NO_NODE)?
>>>>
>>>>> while (nr_allocated < nr_pages) {
>>>>> unsigned int nr, nr_pages_request;
>>>>> + page = NULL;
>>>>>
>>>>> /*
>>>>> * A maximum allowed request is hard-coded and is 100
>>>>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>>> */
>>>>> nr_pages_request = min(100U, nr_pages - nr_allocated);
>>>>>
>>>>
>>>> Undo the following change in this if block.
>>>
>>> Yes, It seem like more simpler as you suggested, But it still have
>>> performance regression, I plan to change the following to consider
>>> both mempolcy and alloc_pages_bulk.
>>
>> Thanks for finding and debugging this. These APIs are a maze of twisty
>> little passages, all alike so I could be as confused as I was when I
>> wrote that patch, but doesn't a minimal fix look something like this?
>
> Yes, I sent a patch，it looks like as you show, besides it also
> contains some performance optimization.
>
> [PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to
> accelerate memory allocation

Okay. It would be better to do it as two patches. First the minimal fix
so it can be backported easily and have the Fixes: tag pointed at my
commit. Then the performance optimization.

Thanks,
Nick

2021-10-15 21:32:16

by Eric Dumazet

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

On Fri, Oct 15, 2021 at 12:11 AM Nicholas Piggin <[email protected]> wrote:

> Okay. It would be better to do it as two patches. First the minimal fix
> so it can be backported easily and have the Fixes: tag pointed at my
> commit. Then the performance optimization.
>

+2, we need a small fix for stable branches.

Thanks

> Thanks,
> Nick

2021-10-18 03:05:16

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation

On Thu, 14 Oct 2021 17:29:52 +0800 Chen Wandun <[email protected]> wrote:

> It will cause significant performance regressions in some situations
> as Andrew mentioned in [1]. The main situation is vmalloc, vmalloc
> will allocate pages with NUMA_NO_NODE by default, that will result
> in alloc page one by one;
>
> In order to solve this, __alloc_pages_bulk and mempolicy should be
> considered at the same time.
>
> 1) If node is specified in memory allocation request, it will alloc
> all pages by __alloc_pages_bulk.
>
> 2) If interleaving allocate memory, it will cauculate how many pages
> should be allocated in each node, and use __alloc_pages_bulk to alloc
> pages in each node.
>
> [1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60
>
> ...
>
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -531,6 +531,10 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> struct list_head *page_list,
> struct page **page_array);
>
> +unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> + unsigned long nr_pages,
> + struct page **page_array);
> +
> /* Bulk allocate order-0 pages */
> static inline unsigned long
> alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 9f8cd1457829..f456c5eb8d10 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2196,6 +2196,82 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
> }
> EXPORT_SYMBOL(alloc_pages);
>
> +unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
> + struct mempolicy *pol, unsigned long nr_pages,
> + struct page **page_array)

I'll make this static.

> +{
> + int nodes;
> + unsigned long nr_pages_per_node;
> + int delta;
> + int i;
> + unsigned long nr_allocated;
> + unsigned long total_allocated = 0;
> +
> + nodes = nodes_weight(pol->nodes);
> + nr_pages_per_node = nr_pages / nodes;
> + delta = nr_pages - nodes * nr_pages_per_node;
> +
> + for (i = 0; i < nodes; i++) {
> + if (delta) {
> + nr_allocated = __alloc_pages_bulk(gfp,
> + interleave_nodes(pol), NULL,
> + nr_pages_per_node + 1, NULL,
> + page_array);
> + delta--;
> + } else {
> + nr_allocated = __alloc_pages_bulk(gfp,
> + interleave_nodes(pol), NULL,
> + nr_pages_per_node, NULL, page_array);
> + }
> +
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + }
> +
> + return total_allocated;
> +}
> +
> +unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
> + struct mempolicy *pol, unsigned long nr_pages,
> + struct page **page_array)

And this.

> +{
> + gfp_t preferred_gfp;
> + unsigned long nr_allocated = 0;
> +
> + preferred_gfp = gfp | __GFP_NOWARN;
> + preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
> +
> + nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
> + nr_pages, NULL, page_array);
> +
> + if (nr_allocated < nr_pages)
> + nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
> + nr_pages - nr_allocated, NULL,
> + page_array + nr_allocated);
> + return nr_allocated;
> +}
> +
> +unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> + unsigned long nr_pages, struct page **page_array)
> +{
> + struct mempolicy *pol = &default_policy;
> +
> + if (!in_interrupt() && !(gfp & __GFP_THISNODE))
> + pol = get_task_policy(current);
> +
> + if (pol->mode == MPOL_INTERLEAVE)
> + return alloc_pages_bulk_array_interleave(gfp, pol,
> + nr_pages, page_array);
> +
> + if (pol->mode == MPOL_PREFERRED_MANY)
> + return alloc_pages_bulk_array_preferred_many(gfp,
> + numa_node_id(), pol, nr_pages, page_array);
> +
> + return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
> + policy_nodemask(gfp, pol), nr_pages, NULL,
> + page_array);
> +}

Some documentation via code comments would be nice. If only for
alloc_pages_bulk_array_mempolicy(). Mainly to explain why it exists.
I suppose the internal functions can be left uncommented if they're
sufficiently obvious?

> struct folio *folio_alloc(gfp_t gfp, unsigned order)
> {
> struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index b7ac4a8fe2b3..49adba793f3c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2856,23 +2856,14 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> */
> nr_pages_request = min(100U, nr_pages - nr_allocated);
>
> - if (nid == NUMA_NO_NODE) {
> - for (i = 0; i < nr_pages_request; i++) {
> - page = alloc_page(gfp);
> - if (page)
> - pages[nr_allocated + i] = page;
> - else {
> - nr = i;
> - break;
> - }
> - }
> - if (i >= nr_pages_request)
> - nr = nr_pages_request;
> - } else {
> + if (nid == NUMA_NO_NODE)
> + nr = alloc_pages_bulk_array_mempolicy(gfp,
> + nr_pages_request,
> + pages + nr_allocated);
> + else
> nr = alloc_pages_bulk_array_node(gfp, nid,
> nr_pages_request,
> pages + nr_allocated);
> - }
> nr_allocated += nr;
> cond_resched();
>

2021-10-18 03:37:13

by Uladzislau Rezki (Sony)

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation

On Thu, Oct 14, 2021 at 05:29:52PM +0800, Chen Wandun wrote:
> It will cause significant performance regressions in some situations
> as Andrew mentioned in [1]. The main situation is vmalloc, vmalloc
> will allocate pages with NUMA_NO_NODE by default, that will result
> in alloc page one by one;
>
> In order to solve this, __alloc_pages_bulk and mempolicy should be
> considered at the same time.
>
> 1) If node is specified in memory allocation request, it will alloc
> all pages by __alloc_pages_bulk.
>
> 2) If interleaving allocate memory, it will cauculate how many pages
> should be allocated in each node, and use __alloc_pages_bulk to alloc
> pages in each node.
>
> [1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60
>
> Signed-off-by: Chen Wandun <[email protected]>
> ----------------
> based on "[PATCH] mm/vmalloc: fix numa spreading for large hash tables"
> ---
> include/linux/gfp.h | 4 +++
> mm/mempolicy.c | 76 +++++++++++++++++++++++++++++++++++++++++++++
> mm/vmalloc.c | 19 +++---------
> 3 files changed, 85 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 558299cb2970..b976c4177299 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -531,6 +531,10 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> struct list_head *page_list,
> struct page **page_array);
>
> +unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> + unsigned long nr_pages,
> + struct page **page_array);
> +
> /* Bulk allocate order-0 pages */
> static inline unsigned long
> alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 9f8cd1457829..f456c5eb8d10 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2196,6 +2196,82 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
> }
> EXPORT_SYMBOL(alloc_pages);
>
> +unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
> + struct mempolicy *pol, unsigned long nr_pages,
> + struct page **page_array)
> +{
> + int nodes;
> + unsigned long nr_pages_per_node;
> + int delta;
> + int i;
> + unsigned long nr_allocated;
> + unsigned long total_allocated = 0;
> +
> + nodes = nodes_weight(pol->nodes);
> + nr_pages_per_node = nr_pages / nodes;
> + delta = nr_pages - nodes * nr_pages_per_node;
> +
> + for (i = 0; i < nodes; i++) {
> + if (delta) {
> + nr_allocated = __alloc_pages_bulk(gfp,
> + interleave_nodes(pol), NULL,
> + nr_pages_per_node + 1, NULL,
> + page_array);
> + delta--;
> + } else {
> + nr_allocated = __alloc_pages_bulk(gfp,
> + interleave_nodes(pol), NULL,
> + nr_pages_per_node, NULL, page_array);
> + }
> +
> + page_array += nr_allocated;
> + total_allocated += nr_allocated;
> + }
> +
> + return total_allocated;
> +}
> +
> +unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
> + struct mempolicy *pol, unsigned long nr_pages,
> + struct page **page_array)
> +{
> + gfp_t preferred_gfp;
> + unsigned long nr_allocated = 0;
> +
> + preferred_gfp = gfp | __GFP_NOWARN;
> + preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
> +
> + nr_allocated = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
> + nr_pages, NULL, page_array);
> +
> + if (nr_allocated < nr_pages)
> + nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
> + nr_pages - nr_allocated, NULL,
> + page_array + nr_allocated);
> + return nr_allocated;
> +}
> +
> +unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
> + unsigned long nr_pages, struct page **page_array)
> +{
> + struct mempolicy *pol = &default_policy;
> +
> + if (!in_interrupt() && !(gfp & __GFP_THISNODE))
> + pol = get_task_policy(current);
> +
> + if (pol->mode == MPOL_INTERLEAVE)
> + return alloc_pages_bulk_array_interleave(gfp, pol,
> + nr_pages, page_array);
> +
> + if (pol->mode == MPOL_PREFERRED_MANY)
> + return alloc_pages_bulk_array_preferred_many(gfp,
> + numa_node_id(), pol, nr_pages, page_array);
> +
> + return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
> + policy_nodemask(gfp, pol), nr_pages, NULL,
> + page_array);
> +}
> +
> struct folio *folio_alloc(gfp_t gfp, unsigned order)
> {
> struct page *page = alloc_pages(gfp | __GFP_COMP, order);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index b7ac4a8fe2b3..49adba793f3c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2856,23 +2856,14 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> */
> nr_pages_request = min(100U, nr_pages - nr_allocated);
>
> - if (nid == NUMA_NO_NODE) {
> - for (i = 0; i < nr_pages_request; i++) {
> - page = alloc_page(gfp);
> - if (page)
> - pages[nr_allocated + i] = page;
> - else {
> - nr = i;
> - break;
> - }
> - }
> - if (i >= nr_pages_request)
> - nr = nr_pages_request;
> - } else {
> + if (nid == NUMA_NO_NODE)
> + nr = alloc_pages_bulk_array_mempolicy(gfp,
> + nr_pages_request,
> + pages + nr_allocated);
> + else
> nr = alloc_pages_bulk_array_node(gfp, nid,
> nr_pages_request,
> pages + nr_allocated);
> - }
> nr_allocated += nr;
> cond_resched();
>
> --
> 2.25.1
>
Now it looks much more correct.

Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>

--
Vlad Rezki

2021-10-18 03:37:24

by Uladzislau Rezki (Sony)

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

On Fri, Oct 15, 2021 at 05:11:25PM +1000, Nicholas Piggin wrote:
> Excerpts from Chen Wandun's message of October 15, 2021 12:31 pm:
> >
> >
> > 在 2021/10/15 9:34, Nicholas Piggin 写道:
> >> Excerpts from Chen Wandun's message of October 14, 2021 6:59 pm:
> >>>
> >>>
> >>> 在 2021/10/14 5:46, Shakeel Butt 写道:
> >>>> On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <[email protected]> wrote:
> >>>>>
> >>>>> Eric Dumazet reported a strange numa spreading info in [1], and found
> >>>>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> >>>>> this issue [2].
> >>>>>
> >>>>> Dig into the difference before and after this patch, page allocation has
> >>>>> some difference:
> >>>>>
> >>>>> before:
> >>>>> alloc_large_system_hash
> >>>>> __vmalloc
> >>>>> __vmalloc_node(..., NUMA_NO_NODE, ...)
> >>>>> __vmalloc_node_range
> >>>>> __vmalloc_area_node
> >>>>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
> >>>>> alloc_pages_current
> >>>>> alloc_page_interleave /* can be proved by print policy mode */
> >>>>>
> >>>>> after:
> >>>>> alloc_large_system_hash
> >>>>> __vmalloc
> >>>>> __vmalloc_node(..., NUMA_NO_NODE, ...)
> >>>>> __vmalloc_node_range
> >>>>> __vmalloc_area_node
> >>>>> alloc_pages_node /* choose nid by nuam_mem_id() */
> >>>>> __alloc_pages_node(nid, ....)
> >>>>>
> >>>>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
> >>>>> it will allocate memory in current node instead of interleaving allocate
> >>>>> memory.
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
> >>>>>
> >>>>> [2]
> >>>>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
> >>>>>
> >>>>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
> >>>>> Reported-by: Eric Dumazet <[email protected]>
> >>>>> Signed-off-by: Chen Wandun <[email protected]>
> >>>>> ---
> >>>>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++-------
> >>>>> 1 file changed, 26 insertions(+), 7 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> >>>>> index f884706c5280..48e717626e94 100644
> >>>>> --- a/mm/vmalloc.c
> >>>>> +++ b/mm/vmalloc.c
> >>>>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >>>>> unsigned int order, unsigned int nr_pages, struct page **pages)
> >>>>> {
> >>>>> unsigned int nr_allocated = 0;
> >>>>> + struct page *page;
> >>>>> + int i;
> >>>>>
> >>>>> /*
> >>>>> * For order-0 pages we make use of bulk allocator, if
> >>>>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >>>>> if (!order) {
> >>>>
> >>>> Can you please replace the above with if (!order && nid != NUMA_NO_NODE)?
> >>>>
> >>>>> while (nr_allocated < nr_pages) {
> >>>>> unsigned int nr, nr_pages_request;
> >>>>> + page = NULL;
> >>>>>
> >>>>> /*
> >>>>> * A maximum allowed request is hard-coded and is 100
> >>>>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >>>>> */
> >>>>> nr_pages_request = min(100U, nr_pages - nr_allocated);
> >>>>>
> >>>>
> >>>> Undo the following change in this if block.
> >>>
> >>> Yes, It seem like more simpler as you suggested, But it still have
> >>> performance regression, I plan to change the following to consider
> >>> both mempolcy and alloc_pages_bulk.
> >>
> >> Thanks for finding and debugging this. These APIs are a maze of twisty
> >> little passages, all alike so I could be as confused as I was when I
> >> wrote that patch, but doesn't a minimal fix look something like this?
> >
> > Yes, I sent a patch，it looks like as you show, besides it also
> > contains some performance optimization.
> >
> > [PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to
> > accelerate memory allocation
>
> Okay. It would be better to do it as two patches. First the minimal fix
> so it can be backported easily and have the Fixes: tag pointed at my
> commit. Then the performance optimization.
>
It is not only your commit. It also fixes my one :)

<snip>
commit 5c1f4e690eecc795b2e4d4408e87302040fceca4
Author: Uladzislau Rezki (Sony) <[email protected]>
Date: Mon Jun 28 19:40:14 2021 -0700

mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()
<snip>

I agree there should be two separate patches which fix NUMA balancing
issue, tagged with "Fixes" flag. One is located here:

https://lkml.org/lkml/2021/10/15/1172

second one should be that fixes a second place where "big" pages are
allocated, basically your patch.

--
Vlad Rezki

2021-10-18 08:47:56

by Chen Wandun

[permalink] [raw]

Subject: Re: [PATCH] mm/vmalloc: fix numa spreading for large hash tables

在 2021/10/15 19:51, Eric Dumazet 写道:
> On Fri, Oct 15, 2021 at 12:11 AM Nicholas Piggin <[email protected]> wrote:
>
>> Okay. It would be better to do it as two patches. First the minimal fix
>> so it can be backported easily and have the Fixes: tag pointed at my
>> commit. Then the performance optimization.
>>
>
> +2, we need a small fix for stable branches.
OK, I will send patch v2.

Thanks,
Wandun

>
> Thanks
>
>> Thanks,
>> Nick
> .
>