2021-10-18 12:25:16

by Chen Wandun

[permalink] [raw]
Subject: [PATCH v2 0/2] fix numa spreading for large hash tables

[PATCH v2 1/2] fix numa spreading problem
[PATCH v2 2/2] optimization about performance

v1 ==> v2:
1. do a minimal fix in [PATCH v2 1/2]
2. add some comments

Chen Wandun (2):
mm/vmalloc: fix numa spreading for large hash tables
mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate
memory allocation

include/linux/gfp.h | 4 +++
mm/mempolicy.c | 82 +++++++++++++++++++++++++++++++++++++++++++++
mm/vmalloc.c | 28 ++++++++++++----
3 files changed, 108 insertions(+), 6 deletions(-)

--
2.25.1


2021-10-18 12:27:35

by Chen Wandun

[permalink] [raw]
Subject: [PATCH v2 1/2] mm/vmalloc: fix numa spreading for large hash tables

Eric Dumazet reported a strange numa spreading info in [1], and found
commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
this issue [2].

Dig into the difference before and after this patch, page allocation has
some difference:

before:
alloc_large_system_hash
__vmalloc
__vmalloc_node(..., NUMA_NO_NODE, ...)
__vmalloc_node_range
__vmalloc_area_node
alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
alloc_pages_current
alloc_page_interleave /* can be proved by print policy mode */

after:
alloc_large_system_hash
__vmalloc
__vmalloc_node(..., NUMA_NO_NODE, ...)
__vmalloc_node_range
__vmalloc_area_node
alloc_pages_node /* choose nid by nuam_mem_id() */
__alloc_pages_node(nid, ....)

So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
it will allocate memory in current node instead of interleaving allocate
memory.

[1]
https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/

[2]
https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/

Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Reported-by: Eric Dumazet <[email protected]>
Signed-off-by: Chen Wandun <[email protected]>
---
mm/vmalloc.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..87552a4018aa 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2816,6 +2816,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
unsigned int order, unsigned int nr_pages, struct page **pages)
{
unsigned int nr_allocated = 0;
+ struct page *page;
+ int i;

/*
* For order-0 pages we make use of bulk allocator, if
@@ -2823,7 +2825,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
* to fails, fallback to a single page allocator that is
* more permissive.
*/
- if (!order) {
+ if (!order && nid != NUMA_NO_NODE) {
while (nr_allocated < nr_pages) {
unsigned int nr, nr_pages_request;

@@ -2848,7 +2850,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
if (nr != nr_pages_request)
break;
}
- } else
+ } else if (order)
/*
* Compound pages required for remap_vmalloc_page if
* high-order pages.
@@ -2856,11 +2858,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
gfp |= __GFP_COMP;

/* High-order pages or fallback path if "bulk" fails. */
- while (nr_allocated < nr_pages) {
- struct page *page;
- int i;

- page = alloc_pages_node(nid, gfp, order);
+ page = NULL;
+ while (nr_allocated < nr_pages) {
+ if (nid == NUMA_NO_NODE)
+ page = alloc_pages(gfp, order);
+ else
+ page = alloc_pages_node(nid, gfp, order);
if (unlikely(!page))
break;

--
2.25.1

2021-10-18 12:44:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm/vmalloc: fix numa spreading for large hash tables

On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote:
> Eric Dumazet reported a strange numa spreading info in [1], and found
> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> this issue [2].

I think the root problem here is that we have two meanings for
NUMA_NO_NODE. I tend to read it as "The memory can be allocated from
any node", but here it's used to mean "The memory should be spread over
every node". Should we split those out as -1 and -2?

2021-10-18 13:02:52

by Chen Wandun

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm/vmalloc: fix numa spreading for large hash tables



在 2021/10/18 20:39, Matthew Wilcox 写道:
> On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote:
>> Eric Dumazet reported a strange numa spreading info in [1], and found
>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
>> this issue [2].
>
> I think the root problem here is that we have two meanings for
> NUMA_NO_NODE. I tend to read it as "The memory can be allocated from
> any node", but here it's used to mean "The memory should be spread over
> every node". Should we split those out as -1 and -2?
Yes, the intent of NUMA_NO_NODE some time is confused.

Besides,I think NUMA_NO_NODE should consider mempolicy in
most cases in the kernel unless it point out explicitly memory
can be allocated without considering mempolicy.

> .
>

2021-10-18 14:19:31

by Uladzislau Rezki (Sony)

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm/vmalloc: fix numa spreading for large hash tables

> Eric Dumazet reported a strange numa spreading info in [1], and found
> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> this issue [2].
>
> Dig into the difference before and after this patch, page allocation has
> some difference:
>
> before:
> alloc_large_system_hash
> __vmalloc
> __vmalloc_node(..., NUMA_NO_NODE, ...)
> __vmalloc_node_range
> __vmalloc_area_node
> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
> alloc_pages_current
> alloc_page_interleave /* can be proved by print policy mode */
>
> after:
> alloc_large_system_hash
> __vmalloc
> __vmalloc_node(..., NUMA_NO_NODE, ...)
> __vmalloc_node_range
> __vmalloc_area_node
> alloc_pages_node /* choose nid by nuam_mem_id() */
> __alloc_pages_node(nid, ....)
>
> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
> it will allocate memory in current node instead of interleaving allocate
> memory.
>
> [1]
> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>
> [2]
> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>
> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
> Reported-by: Eric Dumazet <[email protected]>
> Signed-off-by: Chen Wandun <[email protected]>
> ---
> mm/vmalloc.c | 16 ++++++++++------
> 1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..87552a4018aa 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2816,6 +2816,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> unsigned int order, unsigned int nr_pages, struct page **pages)
> {
> unsigned int nr_allocated = 0;
> + struct page *page;
> + int i;
>
> /*
> * For order-0 pages we make use of bulk allocator, if
> @@ -2823,7 +2825,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> * to fails, fallback to a single page allocator that is
> * more permissive.
> */
> - if (!order) {
> + if (!order && nid != NUMA_NO_NODE) {
> while (nr_allocated < nr_pages) {
> unsigned int nr, nr_pages_request;
>
> @@ -2848,7 +2850,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> if (nr != nr_pages_request)
> break;
> }
> - } else
> + } else if (order)
> /*
> * Compound pages required for remap_vmalloc_page if
> * high-order pages.
> @@ -2856,11 +2858,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> gfp |= __GFP_COMP;
>
> /* High-order pages or fallback path if "bulk" fails. */
> - while (nr_allocated < nr_pages) {
> - struct page *page;
> - int i;
>
> - page = alloc_pages_node(nid, gfp, order);
> + page = NULL;
>
Why do you need to set page to NULL here?

--
Vlad Rezki

2021-10-19 15:56:45

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm/vmalloc: fix numa spreading for large hash tables

On Mon, Oct 18, 2021 at 5:41 AM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote:
> > Eric Dumazet reported a strange numa spreading info in [1], and found
> > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> > this issue [2].
>
> I think the root problem here is that we have two meanings for
> NUMA_NO_NODE. I tend to read it as "The memory can be allocated from
> any node", but here it's used to mean "The memory should be spread over
> every node". Should we split those out as -1 and -2?

I agree with Willy's suggestion to make it more explicit but as a
followup work. This patch needs a backport, so keep this simple.

2021-10-19 16:01:12

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm/vmalloc: fix numa spreading for large hash tables

On Mon, Oct 18, 2021 at 5:23 AM Chen Wandun <[email protected]> wrote:
>
[...]
>
> /* High-order pages or fallback path if "bulk" fails. */
> - while (nr_allocated < nr_pages) {
> - struct page *page;
> - int i;
>
> - page = alloc_pages_node(nid, gfp, order);
> + page = NULL;

No need for the above NULL assignment.

After removing this, you can add:

Reviewed-by: Shakeel Butt <[email protected]>

> + while (nr_allocated < nr_pages) {
> + if (nid == NUMA_NO_NODE)
> + page = alloc_pages(gfp, order);
> + else
> + page = alloc_pages_node(nid, gfp, order);
> if (unlikely(!page))
> break;
>
> --
> 2.25.1
>

2021-10-19 16:01:57

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm/vmalloc: fix numa spreading for large hash tables

On Tue, Oct 19, 2021 at 8:54 AM Shakeel Butt <[email protected]> wrote:
>
> On Mon, Oct 18, 2021 at 5:41 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote:
> > > Eric Dumazet reported a strange numa spreading info in [1], and found
> > > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> > > this issue [2].
> >
> > I think the root problem here is that we have two meanings for
> > NUMA_NO_NODE. I tend to read it as "The memory can be allocated from
> > any node", but here it's used to mean "The memory should be spread over
> > every node". Should we split those out as -1 and -2?
>
> I agree with Willy's suggestion to make it more explicit but as a
> followup work. This patch needs a backport, so keep this simple.

NUMA_NO_NODE in process context also meant :
Please follow current thread NUMA policies.

One could hope for instance, that whenever large BPF maps are allocated,
current thread could set non default NUMA policies.