2020-03-18 16:17:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix hugetlb_cma_reserve() if CONFIG_NUMA isn't set

On Wed 18-03-20 08:34:24, Roman Gushchin wrote:
> If CONFIG_NUMA isn't set, there is no need to ensure that
> the hugetlb cma area belongs to a specific numa node.
>
> min/max_low_pfn can be used for limiting the maximum size
> of the hugetlb_cma area.
>
> Also for_each_mem_pfn_range() is defined only if
> CONFIG_HAVE_MEMBLOCK_NODE_MAP is set, and on arm (unlike most
> other architectures) it depends on CONFIG_NUMA. This makes the
> build fail if CONFIG_NUMA isn't set.

CONFIG_HAVE_MEMBLOCK_NODE_MAP has popped out as a problem several times
already. Is there any real reason we cannot make it unconditional?
Essentially make the functionality always enabled and drop the config?
The code below is ugly as hell. Just look at it. You have
for_each_node_state without any ifdefery but the having ifdef
CONFIG_NUMA. That just doesn't make any sense.

> Signed-off-by: Roman Gushchin <[email protected]>
> Reported-by: Andreas Schaufler <[email protected]>
> ---
> mm/hugetlb.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7a20cae7c77a..a6161239abde 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5439,16 +5439,21 @@ void __init hugetlb_cma_reserve(int order)
>
> reserved = 0;
> for_each_node_state(nid, N_ONLINE) {
> - unsigned long start_pfn, end_pfn;
> unsigned long min_pfn = 0, max_pfn = 0;
> - int res, i;
> + int res;
> +#ifdef CONFIG_NUMA
> + unsigned long start_pfn, end_pfn;
> + int i;
>
> for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> if (!min_pfn)
> min_pfn = start_pfn;
> max_pfn = end_pfn;
> }
> -
> +#else
> + min_pfn = min_low_pfn;
> + max_pfn = max_low_pfn;
> +#endif
> size = max(per_node, hugetlb_cma_size - reserved);
> size = round_up(size, PAGE_SIZE << order);
>
> --
> 2.24.1

--
Michal Hocko
SUSE Labs


2020-03-18 17:56:50

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix hugetlb_cma_reserve() if CONFIG_NUMA isn't set

On Wed, Mar 18, 2020 at 05:16:25PM +0100, Michal Hocko wrote:
> On Wed 18-03-20 08:34:24, Roman Gushchin wrote:
> > If CONFIG_NUMA isn't set, there is no need to ensure that
> > the hugetlb cma area belongs to a specific numa node.
> >
> > min/max_low_pfn can be used for limiting the maximum size
> > of the hugetlb_cma area.
> >
> > Also for_each_mem_pfn_range() is defined only if
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP is set, and on arm (unlike most
> > other architectures) it depends on CONFIG_NUMA. This makes the
> > build fail if CONFIG_NUMA isn't set.
>
> CONFIG_HAVE_MEMBLOCK_NODE_MAP has popped out as a problem several times
> already. Is there any real reason we cannot make it unconditional?
> Essentially make the functionality always enabled and drop the config?

It depends on CONFIG_NUMA only on arm, and I really don't know
if there is a good justification for it. It not, that will be a much
simpler fix.

> The code below is ugly as hell. Just look at it. You have
> for_each_node_state without any ifdefery but the having ifdef
> CONFIG_NUMA. That just doesn't make any sense.

I don't think it makes no sense:
it tries to reserve a cma area on each node (need for_each_node_state()),
and it uses the for_each_mem_pfn_range() to get a min and max pfn
for each node. With !CONFIG_NUMA the first part is reduced to one
iteration and the second part is not required at all.

I agree that for_each_mem_pfn_range() here looks quite ugly, but I don't know
of a better way to get min/max pfns for a node so early in the boot process.
If somebody has any ideas here, I'll appreciate a lot.

I know Rik plans some further improvements here, so the goal for now
is to fix the build. If you think that enabling CONFIG_HAVE_MEMBLOCK_NODE_MAP
unconditionally is a way to go, I'm fine with it too.

Rik also posted a different fix for the build problem, but from what I've
seen it didn't fix it completely. I'm fine with either option here.

Thanks!

>
> > Signed-off-by: Roman Gushchin <[email protected]>
> > Reported-by: Andreas Schaufler <[email protected]>
> > ---
> > mm/hugetlb.c | 11 ++++++++---
> > 1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 7a20cae7c77a..a6161239abde 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5439,16 +5439,21 @@ void __init hugetlb_cma_reserve(int order)
> >
> > reserved = 0;
> > for_each_node_state(nid, N_ONLINE) {
> > - unsigned long start_pfn, end_pfn;
> > unsigned long min_pfn = 0, max_pfn = 0;
> > - int res, i;
> > + int res;
> > +#ifdef CONFIG_NUMA
> > + unsigned long start_pfn, end_pfn;
> > + int i;
> >
> > for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > if (!min_pfn)
> > min_pfn = start_pfn;
> > max_pfn = end_pfn;
> > }
> > -
> > +#else
> > + min_pfn = min_low_pfn;
> > + max_pfn = max_low_pfn;
> > +#endif
> > size = max(per_node, hugetlb_cma_size - reserved);
> > size = round_up(size, PAGE_SIZE << order);
> >
> > --
> > 2.24.1
>
> --
> Michal Hocko
> SUSE Labs
>

2020-03-19 16:17:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix hugetlb_cma_reserve() if CONFIG_NUMA isn't set

On Wed 18-03-20 10:55:29, Roman Gushchin wrote:
> On Wed, Mar 18, 2020 at 05:16:25PM +0100, Michal Hocko wrote:
> > On Wed 18-03-20 08:34:24, Roman Gushchin wrote:
> > > If CONFIG_NUMA isn't set, there is no need to ensure that
> > > the hugetlb cma area belongs to a specific numa node.
> > >
> > > min/max_low_pfn can be used for limiting the maximum size
> > > of the hugetlb_cma area.
> > >
> > > Also for_each_mem_pfn_range() is defined only if
> > > CONFIG_HAVE_MEMBLOCK_NODE_MAP is set, and on arm (unlike most
> > > other architectures) it depends on CONFIG_NUMA. This makes the
> > > build fail if CONFIG_NUMA isn't set.
> >
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP has popped out as a problem several times
> > already. Is there any real reason we cannot make it unconditional?
> > Essentially make the functionality always enabled and drop the config?
>
> It depends on CONFIG_NUMA only on arm, and I really don't know
> if there is a good justification for it. It not, that will be a much
> simpler fix.

I have checked the history and the dependency is there since NUMA was
introduced in arm64. So it would be great to double check with arch
maintainers.

> > The code below is ugly as hell. Just look at it. You have
> > for_each_node_state without any ifdefery but the having ifdef
> > CONFIG_NUMA. That just doesn't make any sense.
>
> I don't think it makes no sense:
> it tries to reserve a cma area on each node (need for_each_node_state()),
> and it uses the for_each_mem_pfn_range() to get a min and max pfn
> for each node. With !CONFIG_NUMA the first part is reduced to one
> iteration and the second part is not required at all.

Sure the resulting code logic makes sense. I meant that it doesn't make
much sense wrt readability. There is a loop over all existing numa nodes
to have ifdef for NUMA inside the loop. See?

> I agree that for_each_mem_pfn_range() here looks quite ugly, but I don't know
> of a better way to get min/max pfns for a node so early in the boot process.
> If somebody has any ideas here, I'll appreciate a lot.

The loop is ok. Maybe we have other memblock API that would be better
but I am not really aware of it from top of my head. I would stick with
it. It just sucks that this API depends on HAVE_MEMBLOCK_NODE_MAP and
that it is not generally available. This is what I am complaining about.
Just look what kind of dirty hack it made you to create ;)

> I know Rik plans some further improvements here, so the goal for now
> is to fix the build. If you think that enabling CONFIG_HAVE_MEMBLOCK_NODE_MAP
> unconditionally is a way to go, I'm fine with it too.

This is not the first time HAVE_MEMBLOCK_NODE_MAP has been problematic.
I might be missing something but I really do not get why do we really
need it these days. As for !NUMA, I suspect we can make it generate the
right thing when !NUMA.
--
Michal Hocko
SUSE Labs

2020-03-19 16:57:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix hugetlb_cma_reserve() if CONFIG_NUMA isn't set

On Thu, 2020-03-19 at 17:16 +0100, Michal Hocko wrote:
>
> This is not the first time HAVE_MEMBLOCK_NODE_MAP has been
> problematic.
> I might be missing something but I really do not get why do we really
> need it these days. As for !NUMA, I suspect we can make it generate
> the
> right thing when !NUMA.

We're working on a different fix now.

It looks like cma_declare_contiguous calls memblock_phys_alloc_range,
which calls memblock_alloc_range_nid, which takes a NUMA node as one
of its arguments.

Aslan is looking at simply adding a cma_declare_contiguous_nid, which
also takes a NUMA node ID as an argument. At that point we can simply
leave CMA free to allocate from anywhere in each NUMA node, which by
default already happens from the top down.

That should be the nicer long term fix to this issue.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-03-19 17:02:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: hugetlb: fix hugetlb_cma_reserve() if CONFIG_NUMA isn't set

On Thu 19-03-20 12:56:34, Rik van Riel wrote:
> On Thu, 2020-03-19 at 17:16 +0100, Michal Hocko wrote:
> >
> > This is not the first time HAVE_MEMBLOCK_NODE_MAP has been
> > problematic.
> > I might be missing something but I really do not get why do we really
> > need it these days. As for !NUMA, I suspect we can make it generate
> > the
> > right thing when !NUMA.
>
> We're working on a different fix now.
>
> It looks like cma_declare_contiguous calls memblock_phys_alloc_range,
> which calls memblock_alloc_range_nid, which takes a NUMA node as one
> of its arguments.
>
> Aslan is looking at simply adding a cma_declare_contiguous_nid, which
> also takes a NUMA node ID as an argument. At that point we can simply
> leave CMA free to allocate from anywhere in each NUMA node, which by
> default already happens from the top down.
>
> That should be the nicer long term fix to this issue.

Yes, that sounds like a better solution. Not that I would be much
happier to have HAVE_MEMBLOCK_NODE_MAP off the table as well ;)
Ohh well, it will have to remain on my todo list for some longer.
--
Michal Hocko
SUSE Labs