2015-07-01 03:16:54

by Tang Chen

[permalink] [raw]
Subject: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 60000000]
2. on node 0: [100000000, 20000000000]
3. on node 1: [20000000000, 40000000000]
4. on node 4: [40000000000, 60000000000]
5. on node 5: [60000000000, 80000000000]
6. on node 2: [80000000000, a0000000000]
7. on node 3: [a0000000000, a0800000000]
8. on node 6: [c0000000000, a0800000000]
9. on node 7: [e0000000000, a0800000000]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a0800000000. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 6 ++++--
include/linux/memblock.h | 2 ++
mm/memblock.c | 2 +-
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi->start = max(bi->start, low);
bi->end = min(bi->end, high);

- /* and there's no empty block */
- if (bi->start >= bi->end)
+ /* and there's no empty or non-exist block */
+ if (bi->start >= bi->end ||
+ memblock_overlaps_region(&memblock.memory,
+ bi->start, bi->end - bi->start) == -1)
numa_remove_memblk_from(i--, mi);
}

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 0215ffd..3bf6cc1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
+long memblock_overlaps_region(struct memblock_type *type,
+ phys_addr_t base, phys_addr_t size);
int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 1b444c7..55b5f9f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
}

-static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
+long __init_memblock memblock_overlaps_region(struct memblock_type *type,
phys_addr_t base, phys_addr_t size)
{
unsigned long i;
--
1.8.4.2


2015-07-01 06:26:47

by Xishi Qiu

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

On 2015/7/1 11:16, Tang Chen wrote:

> When parsing SRAT, all memory ranges are added into numa_meminfo.
> In numa_init(), before entering numa_cleanup_meminfo(), all possible
> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
> all ranges over max_pfn or empty.
>
> But, this only works if the nodes are continuous. Let's have a look
> at the following example:
>
> We have an SRAT like this:
> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
>
> On boot, only node 0,1,2,3 exist.
>
> And the numa_meminfo will look like this:
> numa_meminfo.nr_blks = 9
> 1. on node 0: [0, 60000000]
> 2. on node 0: [100000000, 20000000000]
> 3. on node 1: [20000000000, 40000000000]
> 4. on node 4: [40000000000, 60000000000]
> 5. on node 5: [60000000000, 80000000000]
> 6. on node 2: [80000000000, a0000000000]
> 7. on node 3: [a0000000000, a0800000000]
> 8. on node 6: [c0000000000, a0800000000]
> 9. on node 7: [e0000000000, a0800000000]
>
> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
> the end address is over max_pfn, which is a0800000000. But 4 and 5
> are not removed because their end addresses are less then max_pfn.
> But in fact, node 4 and 5 don't exist.
>
> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
>
> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
> node 4 and 5 will be mistakenly set to online.
>
> In this patch, we use memblock_overlaps_region() to check if ranges in
> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
> all available memory at boot time, if they overlap, it means the ranges
> exist. If not, then remove them from numa_meminfo.
>

Hi Tang Chen,

What's the impact of this problem?

Command "numactl --hard" will show an empty node(no cpu and no memory,
but pgdat is created), right?

Thanks,
Xishi Qiu

> Signed-off-by: Tang Chen <[email protected]>
> ---
> arch/x86/mm/numa.c | 6 ++++--
> include/linux/memblock.h | 2 ++
> mm/memblock.c | 2 +-
> 3 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 4053bb5..0c55cc5 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
> bi->start = max(bi->start, low);
> bi->end = min(bi->end, high);
>
> - /* and there's no empty block */
> - if (bi->start >= bi->end)
> + /* and there's no empty or non-exist block */
> + if (bi->start >= bi->end ||
> + memblock_overlaps_region(&memblock.memory,
> + bi->start, bi->end - bi->start) == -1)
> numa_remove_memblk_from(i--, mi);
> }
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 0215ffd..3bf6cc1 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
> int memblock_free(phys_addr_t base, phys_addr_t size);
> int memblock_reserve(phys_addr_t base, phys_addr_t size);
> void memblock_trim_memory(phys_addr_t align);
> +long memblock_overlaps_region(struct memblock_type *type,
> + phys_addr_t base, phys_addr_t size);
> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 1b444c7..55b5f9f 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
> }
>
> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> phys_addr_t base, phys_addr_t size)
> {
> unsigned long i;


2015-07-01 07:55:30

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.


On 07/01/2015 02:25 PM, Xishi Qiu wrote:
> On 2015/7/1 11:16, Tang Chen wrote:
>
>> When parsing SRAT, all memory ranges are added into numa_meminfo.
>> In numa_init(), before entering numa_cleanup_meminfo(), all possible
>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
>> all ranges over max_pfn or empty.
>>
>> But, this only works if the nodes are continuous. Let's have a look
>> at the following example:
>>
>> We have an SRAT like this:
>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
>>
>> On boot, only node 0,1,2,3 exist.
>>
>> And the numa_meminfo will look like this:
>> numa_meminfo.nr_blks = 9
>> 1. on node 0: [0, 60000000]
>> 2. on node 0: [100000000, 20000000000]
>> 3. on node 1: [20000000000, 40000000000]
>> 4. on node 4: [40000000000, 60000000000]
>> 5. on node 5: [60000000000, 80000000000]
>> 6. on node 2: [80000000000, a0000000000]
>> 7. on node 3: [a0000000000, a0800000000]
>> 8. on node 6: [c0000000000, a0800000000]
>> 9. on node 7: [e0000000000, a0800000000]
>>
>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
>> the end address is over max_pfn, which is a0800000000. But 4 and 5
>> are not removed because their end addresses are less then max_pfn.
>> But in fact, node 4 and 5 don't exist.
>>
>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
>>
>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
>> node 4 and 5 will be mistakenly set to online.
>>
>> In this patch, we use memblock_overlaps_region() to check if ranges in
>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
>> all available memory at boot time, if they overlap, it means the ranges
>> exist. If not, then remove them from numa_meminfo.
>>
> Hi Tang Chen,
>
> What's the impact of this problem?
>
> Command "numactl --hard" will show an empty node(no cpu and no memory,
> but pgdat is created), right?

On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

Thanks.

>
> Thanks,
> Xishi Qiu
>
>> Signed-off-by: Tang Chen <[email protected]>
>> ---
>> arch/x86/mm/numa.c | 6 ++++--
>> include/linux/memblock.h | 2 ++
>> mm/memblock.c | 2 +-
>> 3 files changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>> index 4053bb5..0c55cc5 100644
>> --- a/arch/x86/mm/numa.c
>> +++ b/arch/x86/mm/numa.c
>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
>> bi->start = max(bi->start, low);
>> bi->end = min(bi->end, high);
>>
>> - /* and there's no empty block */
>> - if (bi->start >= bi->end)
>> + /* and there's no empty or non-exist block */
>> + if (bi->start >= bi->end ||
>> + memblock_overlaps_region(&memblock.memory,
>> + bi->start, bi->end - bi->start) == -1)
>> numa_remove_memblk_from(i--, mi);
>> }
>>
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index 0215ffd..3bf6cc1 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
>> int memblock_free(phys_addr_t base, phys_addr_t size);
>> int memblock_reserve(phys_addr_t base, phys_addr_t size);
>> void memblock_trim_memory(phys_addr_t align);
>> +long memblock_overlaps_region(struct memblock_type *type,
>> + phys_addr_t base, phys_addr_t size);
>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 1b444c7..55b5f9f 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
>> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
>> }
>>
>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>> phys_addr_t base, phys_addr_t size)
>> {
>> unsigned long i;
>
>
> .
>

2015-07-01 08:56:16

by Xishi Qiu

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

On 2015/7/1 15:55, Tang Chen wrote:

>
> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
>> On 2015/7/1 11:16, Tang Chen wrote:
>>
>>> When parsing SRAT, all memory ranges are added into numa_meminfo.
>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible
>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
>>> all ranges over max_pfn or empty.
>>>
>>> But, this only works if the nodes are continuous. Let's have a look
>>> at the following example:
>>>
>>> We have an SRAT like this:
>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
>>>
>>> On boot, only node 0,1,2,3 exist.
>>>
>>> And the numa_meminfo will look like this:
>>> numa_meminfo.nr_blks = 9
>>> 1. on node 0: [0, 60000000]
>>> 2. on node 0: [100000000, 20000000000]
>>> 3. on node 1: [20000000000, 40000000000]
>>> 4. on node 4: [40000000000, 60000000000]
>>> 5. on node 5: [60000000000, 80000000000]
>>> 6. on node 2: [80000000000, a0000000000]
>>> 7. on node 3: [a0000000000, a0800000000]
>>> 8. on node 6: [c0000000000, a0800000000]
>>> 9. on node 7: [e0000000000, a0800000000]
>>>
>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
>>> the end address is over max_pfn, which is a0800000000. But 4 and 5
>>> are not removed because their end addresses are less then max_pfn.
>>> But in fact, node 4 and 5 don't exist.
>>>
>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
>>>
>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
>>> node 4 and 5 will be mistakenly set to online.
>>>
>>> In this patch, we use memblock_overlaps_region() to check if ranges in
>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
>>> all available memory at boot time, if they overlap, it means the ranges
>>> exist. If not, then remove them from numa_meminfo.
>>>
>> Hi Tang Chen,
>>
>> What's the impact of this problem?
>>
>> Command "numactl --hard" will show an empty node(no cpu and no memory,
>> but pgdat is created), right?
>
> On my box, if I run lscpu, the output looks like this:
>
> NUMA node0 CPU(s): 0-14,128-142
> NUMA node1 CPU(s): 15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s): 62-76,190-204
> NUMA node5 CPU(s): 78-92,206-220
>
> Node 2 and 3 are not exist, but they are online.
>

Yes, because srat->numa_meminfo->alloc pgdat.


Thanks,
Xishi Qiu

2015-07-02 15:02:28

by YASUAKI ISHIMATSU

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

Hi Tang,

> On my box, if I run lscpu, the output looks like this:
>
> NUMA node0 CPU(s): 0-14,128-142
> NUMA node1 CPU(s): 15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s): 62-76,190-204
> NUMA node5 CPU(s): 78-92,206-220
>
> Node 2 and 3 are not exist, but they are online.

According your description of patch, node 4 and 5 are mistakenly
set to online. Why does lscpu show the above result?

Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen <[email protected]> wrote:

>
> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
> > On 2015/7/1 11:16, Tang Chen wrote:
> >
> >> When parsing SRAT, all memory ranges are added into numa_meminfo.
> >> In numa_init(), before entering numa_cleanup_meminfo(), all possible
> >> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
> >> all ranges over max_pfn or empty.
> >>
> >> But, this only works if the nodes are continuous. Let's have a look
> >> at the following example:
> >>
> >> We have an SRAT like this:
> >> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
> >> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
> >> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
> >> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
> >> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
> >> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
> >> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
> >> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
> >> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
> >>
> >> On boot, only node 0,1,2,3 exist.
> >>
> >> And the numa_meminfo will look like this:
> >> numa_meminfo.nr_blks = 9
> >> 1. on node 0: [0, 60000000]
> >> 2. on node 0: [100000000, 20000000000]
> >> 3. on node 1: [20000000000, 40000000000]
> >> 4. on node 4: [40000000000, 60000000000]
> >> 5. on node 5: [60000000000, 80000000000]
> >> 6. on node 2: [80000000000, a0000000000]
> >> 7. on node 3: [a0000000000, a0800000000]
> >> 8. on node 6: [c0000000000, a0800000000]
> >> 9. on node 7: [e0000000000, a0800000000]
> >>
> >> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
> >> the end address is over max_pfn, which is a0800000000. But 4 and 5
> >> are not removed because their end addresses are less then max_pfn.
> >> But in fact, node 4 and 5 don't exist.
> >>
> >> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
> >>
> >> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
> >> node 4 and 5 will be mistakenly set to online.
> >>
> >> In this patch, we use memblock_overlaps_region() to check if ranges in
> >> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
> >> all available memory at boot time, if they overlap, it means the ranges
> >> exist. If not, then remove them from numa_meminfo.
> >>
> > Hi Tang Chen,
> >
> > What's the impact of this problem?
> >
> > Command "numactl --hard" will show an empty node(no cpu and no memory,
> > but pgdat is created), right?
>
> On my box, if I run lscpu, the output looks like this:
>
> NUMA node0 CPU(s): 0-14,128-142
> NUMA node1 CPU(s): 15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s): 62-76,190-204
> NUMA node5 CPU(s): 78-92,206-220
>
> Node 2 and 3 are not exist, but they are online.
>
> Thanks.
>
> >
> > Thanks,
> > Xishi Qiu
> >
> >> Signed-off-by: Tang Chen <[email protected]>
> >> ---
> >> arch/x86/mm/numa.c | 6 ++++--
> >> include/linux/memblock.h | 2 ++
> >> mm/memblock.c | 2 +-
> >> 3 files changed, 7 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> >> index 4053bb5..0c55cc5 100644
> >> --- a/arch/x86/mm/numa.c
> >> +++ b/arch/x86/mm/numa.c
> >> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
> >> bi->start = max(bi->start, low);
> >> bi->end = min(bi->end, high);
> >>
> >> - /* and there's no empty block */
> >> - if (bi->start >= bi->end)
> >> + /* and there's no empty or non-exist block */
> >> + if (bi->start >= bi->end ||
> >> + memblock_overlaps_region(&memblock.memory,
> >> + bi->start, bi->end - bi->start) == -1)
> >> numa_remove_memblk_from(i--, mi);
> >> }
> >>
> >> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >> index 0215ffd..3bf6cc1 100644
> >> --- a/include/linux/memblock.h
> >> +++ b/include/linux/memblock.h
> >> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
> >> int memblock_free(phys_addr_t base, phys_addr_t size);
> >> int memblock_reserve(phys_addr_t base, phys_addr_t size);
> >> void memblock_trim_memory(phys_addr_t align);
> >> +long memblock_overlaps_region(struct memblock_type *type,
> >> + phys_addr_t base, phys_addr_t size);
> >> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> >> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> >> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >> diff --git a/mm/memblock.c b/mm/memblock.c
> >> index 1b444c7..55b5f9f 100644
> >> --- a/mm/memblock.c
> >> +++ b/mm/memblock.c
> >> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
> >> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
> >> }
> >>
> >> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> >> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> >> phys_addr_t base, phys_addr_t size)
> >> {
> >> unsigned long i;
> >
> >
> > .
> >
>

2015-07-03 01:26:06

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.


On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:
> Hi Tang,
>
>> On my box, if I run lscpu, the output looks like this:
>>
>> NUMA node0 CPU(s): 0-14,128-142
>> NUMA node1 CPU(s): 15-29,143-157
>> NUMA node2 CPU(s):
>> NUMA node3 CPU(s):
>> NUMA node4 CPU(s): 62-76,190-204
>> NUMA node5 CPU(s): 78-92,206-220
>>
>> Node 2 and 3 are not exist, but they are online.
> According your description of patch, node 4 and 5 are mistakenly

Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.
> set to online. Why does lscpu show the above result?

Well, actually not only lscpu gives the strange result, under
/sys/device/system/node,
interfaces for node 2 and 3 are also created.

I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But
obviously,
node 2 and 3 are set online, which is incorrect.

For now, I only found that in numa_cleanup_meminfo(), memory above
max_pfn is removed,
but holes between nodes are not removed.

I think libraries are not able to handle this problem since nodes are
set online in kernel.
Seeing from user space, there is no hole.

Thanks.

>
> Thanks,
> Yasuaki Ishimatsu
>
> On Wed, 1 Jul 2015 15:55:30 +0800
> Tang Chen <[email protected]> wrote:
>
>> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
>>> On 2015/7/1 11:16, Tang Chen wrote:
>>>
>>>> When parsing SRAT, all memory ranges are added into numa_meminfo.
>>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible
>>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
>>>> all ranges over max_pfn or empty.
>>>>
>>>> But, this only works if the nodes are continuous. Let's have a look
>>>> at the following example:
>>>>
>>>> We have an SRAT like this:
>>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
>>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
>>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
>>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
>>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
>>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
>>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
>>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
>>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
>>>>
>>>> On boot, only node 0,1,2,3 exist.
>>>>
>>>> And the numa_meminfo will look like this:
>>>> numa_meminfo.nr_blks = 9
>>>> 1. on node 0: [0, 60000000]
>>>> 2. on node 0: [100000000, 20000000000]
>>>> 3. on node 1: [20000000000, 40000000000]
>>>> 4. on node 4: [40000000000, 60000000000]
>>>> 5. on node 5: [60000000000, 80000000000]
>>>> 6. on node 2: [80000000000, a0000000000]
>>>> 7. on node 3: [a0000000000, a0800000000]
>>>> 8. on node 6: [c0000000000, a0800000000]
>>>> 9. on node 7: [e0000000000, a0800000000]
>>>>
>>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
>>>> the end address is over max_pfn, which is a0800000000. But 4 and 5
>>>> are not removed because their end addresses are less then max_pfn.
>>>> But in fact, node 4 and 5 don't exist.
>>>>
>>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
>>>>
>>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
>>>> node 4 and 5 will be mistakenly set to online.
>>>>
>>>> In this patch, we use memblock_overlaps_region() to check if ranges in
>>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
>>>> all available memory at boot time, if they overlap, it means the ranges
>>>> exist. If not, then remove them from numa_meminfo.
>>>>
>>> Hi Tang Chen,
>>>
>>> What's the impact of this problem?
>>>
>>> Command "numactl --hard" will show an empty node(no cpu and no memory,
>>> but pgdat is created), right?
>> On my box, if I run lscpu, the output looks like this:
>>
>> NUMA node0 CPU(s): 0-14,128-142
>> NUMA node1 CPU(s): 15-29,143-157
>> NUMA node2 CPU(s):
>> NUMA node3 CPU(s):
>> NUMA node4 CPU(s): 62-76,190-204
>> NUMA node5 CPU(s): 78-92,206-220
>>
>> Node 2 and 3 are not exist, but they are online.
>>
>> Thanks.
>>
>>> Thanks,
>>> Xishi Qiu
>>>
>>>> Signed-off-by: Tang Chen <[email protected]>
>>>> ---
>>>> arch/x86/mm/numa.c | 6 ++++--
>>>> include/linux/memblock.h | 2 ++
>>>> mm/memblock.c | 2 +-
>>>> 3 files changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>>>> index 4053bb5..0c55cc5 100644
>>>> --- a/arch/x86/mm/numa.c
>>>> +++ b/arch/x86/mm/numa.c
>>>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
>>>> bi->start = max(bi->start, low);
>>>> bi->end = min(bi->end, high);
>>>>
>>>> - /* and there's no empty block */
>>>> - if (bi->start >= bi->end)
>>>> + /* and there's no empty or non-exist block */
>>>> + if (bi->start >= bi->end ||
>>>> + memblock_overlaps_region(&memblock.memory,
>>>> + bi->start, bi->end - bi->start) == -1)
>>>> numa_remove_memblk_from(i--, mi);
>>>> }
>>>>
>>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>>>> index 0215ffd..3bf6cc1 100644
>>>> --- a/include/linux/memblock.h
>>>> +++ b/include/linux/memblock.h
>>>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
>>>> int memblock_free(phys_addr_t base, phys_addr_t size);
>>>> int memblock_reserve(phys_addr_t base, phys_addr_t size);
>>>> void memblock_trim_memory(phys_addr_t align);
>>>> +long memblock_overlaps_region(struct memblock_type *type,
>>>> + phys_addr_t base, phys_addr_t size);
>>>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>>>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>>>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>>>> diff --git a/mm/memblock.c b/mm/memblock.c
>>>> index 1b444c7..55b5f9f 100644
>>>> --- a/mm/memblock.c
>>>> +++ b/mm/memblock.c
>>>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
>>>> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
>>>> }
>>>>
>>>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>>>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>>>> phys_addr_t base, phys_addr_t size)
>>>> {
>>>> unsigned long i;
>>>
>>> .
>>>
> .
>

2015-07-06 16:42:38

by YASUAKI ISHIMATSU

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.


On Fri, 3 Jul 2015 09:26:05 +0800
Tang Chen <[email protected]> wrote:

>
> On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:
> > Hi Tang,
> >
> >> On my box, if I run lscpu, the output looks like this:
> >>
> >> NUMA node0 CPU(s): 0-14,128-142
> >> NUMA node1 CPU(s): 15-29,143-157
> >> NUMA node2 CPU(s):
> >> NUMA node3 CPU(s):
> >> NUMA node4 CPU(s): 62-76,190-204
> >> NUMA node5 CPU(s): 78-92,206-220
> >>
> >> Node 2 and 3 are not exist, but they are online.
> > According your description of patch, node 4 and 5 are mistakenly
>
> Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.

Please add the results of lscpu before/after applyinig the patch into
description of your patch.

Feel free to add my
Reviewed-by: Yasuaki Ishimatsu <[email protected]>

Thanks,
Yasuaki Ishimatsu

> > set to online. Why does lscpu show the above result?
>
> Well, actually not only lscpu gives the strange result, under
> /sys/device/system/node,
> interfaces for node 2 and 3 are also created.
>
> I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But
> obviously,
> node 2 and 3 are set online, which is incorrect.
>
> For now, I only found that in numa_cleanup_meminfo(), memory above
> max_pfn is removed,
> but holes between nodes are not removed.
>
> I think libraries are not able to handle this problem since nodes are
> set online in kernel.
> Seeing from user space, there is no hole.
>
> Thanks.
>
> >
> > Thanks,
> > Yasuaki Ishimatsu
> >
> > On Wed, 1 Jul 2015 15:55:30 +0800
> > Tang Chen <[email protected]> wrote:
> >
> >> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
> >>> On 2015/7/1 11:16, Tang Chen wrote:
> >>>
> >>>> When parsing SRAT, all memory ranges are added into numa_meminfo.
> >>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible
> >>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
> >>>> all ranges over max_pfn or empty.
> >>>>
> >>>> But, this only works if the nodes are continuous. Let's have a look
> >>>> at the following example:
> >>>>
> >>>> We have an SRAT like this:
> >>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
> >>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
> >>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
> >>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
> >>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
> >>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
> >>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
> >>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
> >>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
> >>>>
> >>>> On boot, only node 0,1,2,3 exist.
> >>>>
> >>>> And the numa_meminfo will look like this:
> >>>> numa_meminfo.nr_blks = 9
> >>>> 1. on node 0: [0, 60000000]
> >>>> 2. on node 0: [100000000, 20000000000]
> >>>> 3. on node 1: [20000000000, 40000000000]
> >>>> 4. on node 4: [40000000000, 60000000000]
> >>>> 5. on node 5: [60000000000, 80000000000]
> >>>> 6. on node 2: [80000000000, a0000000000]
> >>>> 7. on node 3: [a0000000000, a0800000000]
> >>>> 8. on node 6: [c0000000000, a0800000000]
> >>>> 9. on node 7: [e0000000000, a0800000000]
> >>>>
> >>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
> >>>> the end address is over max_pfn, which is a0800000000. But 4 and 5
> >>>> are not removed because their end addresses are less then max_pfn.
> >>>> But in fact, node 4 and 5 don't exist.
> >>>>
> >>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
> >>>>
> >>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
> >>>> node 4 and 5 will be mistakenly set to online.
> >>>>
> >>>> In this patch, we use memblock_overlaps_region() to check if ranges in
> >>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
> >>>> all available memory at boot time, if they overlap, it means the ranges
> >>>> exist. If not, then remove them from numa_meminfo.
> >>>>
> >>> Hi Tang Chen,
> >>>
> >>> What's the impact of this problem?
> >>>
> >>> Command "numactl --hard" will show an empty node(no cpu and no memory,
> >>> but pgdat is created), right?
> >> On my box, if I run lscpu, the output looks like this:
> >>
> >> NUMA node0 CPU(s): 0-14,128-142
> >> NUMA node1 CPU(s): 15-29,143-157
> >> NUMA node2 CPU(s):
> >> NUMA node3 CPU(s):
> >> NUMA node4 CPU(s): 62-76,190-204
> >> NUMA node5 CPU(s): 78-92,206-220
> >>
> >> Node 2 and 3 are not exist, but they are online.
> >>
> >> Thanks.
> >>
> >>> Thanks,
> >>> Xishi Qiu
> >>>
> >>>> Signed-off-by: Tang Chen <[email protected]>
> >>>> ---
> >>>> arch/x86/mm/numa.c | 6 ++++--
> >>>> include/linux/memblock.h | 2 ++
> >>>> mm/memblock.c | 2 +-
> >>>> 3 files changed, 7 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> >>>> index 4053bb5..0c55cc5 100644
> >>>> --- a/arch/x86/mm/numa.c
> >>>> +++ b/arch/x86/mm/numa.c
> >>>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
> >>>> bi->start = max(bi->start, low);
> >>>> bi->end = min(bi->end, high);
> >>>>
> >>>> - /* and there's no empty block */
> >>>> - if (bi->start >= bi->end)
> >>>> + /* and there's no empty or non-exist block */
> >>>> + if (bi->start >= bi->end ||
> >>>> + memblock_overlaps_region(&memblock.memory,
> >>>> + bi->start, bi->end - bi->start) == -1)
> >>>> numa_remove_memblk_from(i--, mi);
> >>>> }
> >>>>
> >>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >>>> index 0215ffd..3bf6cc1 100644
> >>>> --- a/include/linux/memblock.h
> >>>> +++ b/include/linux/memblock.h
> >>>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
> >>>> int memblock_free(phys_addr_t base, phys_addr_t size);
> >>>> int memblock_reserve(phys_addr_t base, phys_addr_t size);
> >>>> void memblock_trim_memory(phys_addr_t align);
> >>>> +long memblock_overlaps_region(struct memblock_type *type,
> >>>> + phys_addr_t base, phys_addr_t size);
> >>>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
> >>>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
> >>>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >>>> diff --git a/mm/memblock.c b/mm/memblock.c
> >>>> index 1b444c7..55b5f9f 100644
> >>>> --- a/mm/memblock.c
> >>>> +++ b/mm/memblock.c
> >>>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
> >>>> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
> >>>> }
> >>>>
> >>>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> >>>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
> >>>> phys_addr_t base, phys_addr_t size)
> >>>> {
> >>>> unsigned long i;
> >>>
> >>> .
> >>>
> > .
> >
>

2015-07-07 08:56:46

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.


On 07/07/2015 12:42 AM, Yasuaki Ishimatsu wrote:
> On Fri, 3 Jul 2015 09:26:05 +0800
> Tang Chen <[email protected]> wrote:
>
>> On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:
>>> Hi Tang,
>>>
>>>> On my box, if I run lscpu, the output looks like this:
>>>>
>>>> NUMA node0 CPU(s): 0-14,128-142
>>>> NUMA node1 CPU(s): 15-29,143-157
>>>> NUMA node2 CPU(s):
>>>> NUMA node3 CPU(s):
>>>> NUMA node4 CPU(s): 62-76,190-204
>>>> NUMA node5 CPU(s): 78-92,206-220
>>>>
>>>> Node 2 and 3 are not exist, but they are online.
>>> According your description of patch, node 4 and 5 are mistakenly
>> Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.
> Please add the results of lscpu before/after applyinig the patch into
> description of your patch.
>
> Feel free to add my
> Reviewed-by: Yasuaki Ishimatsu <[email protected]>

Thanks for reviewing. Will update the patch soon.

Thanks.

>
> Thanks,
> Yasuaki Ishimatsu
>
>>> set to online. Why does lscpu show the above result?
>> Well, actually not only lscpu gives the strange result, under
>> /sys/device/system/node,
>> interfaces for node 2 and 3 are also created.
>>
>> I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But
>> obviously,
>> node 2 and 3 are set online, which is incorrect.
>>
>> For now, I only found that in numa_cleanup_meminfo(), memory above
>> max_pfn is removed,
>> but holes between nodes are not removed.
>>
>> I think libraries are not able to handle this problem since nodes are
>> set online in kernel.
>> Seeing from user space, there is no hole.
>>
>> Thanks.
>>
>>> Thanks,
>>> Yasuaki Ishimatsu
>>>
>>> On Wed, 1 Jul 2015 15:55:30 +0800
>>> Tang Chen <[email protected]> wrote:
>>>
>>>> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
>>>>> On 2015/7/1 11:16, Tang Chen wrote:
>>>>>
>>>>>> When parsing SRAT, all memory ranges are added into numa_meminfo.
>>>>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible
>>>>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
>>>>>> all ranges over max_pfn or empty.
>>>>>>
>>>>>> But, this only works if the nodes are continuous. Let's have a look
>>>>>> at the following example:
>>>>>>
>>>>>> We have an SRAT like this:
>>>>>> SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
>>>>>> SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
>>>>>> SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
>>>>>> SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
>>>>>> SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
>>>>>> SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
>>>>>> SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
>>>>>> SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
>>>>>> SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
>>>>>>
>>>>>> On boot, only node 0,1,2,3 exist.
>>>>>>
>>>>>> And the numa_meminfo will look like this:
>>>>>> numa_meminfo.nr_blks = 9
>>>>>> 1. on node 0: [0, 60000000]
>>>>>> 2. on node 0: [100000000, 20000000000]
>>>>>> 3. on node 1: [20000000000, 40000000000]
>>>>>> 4. on node 4: [40000000000, 60000000000]
>>>>>> 5. on node 5: [60000000000, 80000000000]
>>>>>> 6. on node 2: [80000000000, a0000000000]
>>>>>> 7. on node 3: [a0000000000, a0800000000]
>>>>>> 8. on node 6: [c0000000000, a0800000000]
>>>>>> 9. on node 7: [e0000000000, a0800000000]
>>>>>>
>>>>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
>>>>>> the end address is over max_pfn, which is a0800000000. But 4 and 5
>>>>>> are not removed because their end addresses are less then max_pfn.
>>>>>> But in fact, node 4 and 5 don't exist.
>>>>>>
>>>>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
>>>>>>
>>>>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(),
>>>>>> node 4 and 5 will be mistakenly set to online.
>>>>>>
>>>>>> In this patch, we use memblock_overlaps_region() to check if ranges in
>>>>>> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
>>>>>> all available memory at boot time, if they overlap, it means the ranges
>>>>>> exist. If not, then remove them from numa_meminfo.
>>>>>>
>>>>> Hi Tang Chen,
>>>>>
>>>>> What's the impact of this problem?
>>>>>
>>>>> Command "numactl --hard" will show an empty node(no cpu and no memory,
>>>>> but pgdat is created), right?
>>>> On my box, if I run lscpu, the output looks like this:
>>>>
>>>> NUMA node0 CPU(s): 0-14,128-142
>>>> NUMA node1 CPU(s): 15-29,143-157
>>>> NUMA node2 CPU(s):
>>>> NUMA node3 CPU(s):
>>>> NUMA node4 CPU(s): 62-76,190-204
>>>> NUMA node5 CPU(s): 78-92,206-220
>>>>
>>>> Node 2 and 3 are not exist, but they are online.
>>>>
>>>> Thanks.
>>>>
>>>>> Thanks,
>>>>> Xishi Qiu
>>>>>
>>>>>> Signed-off-by: Tang Chen <[email protected]>
>>>>>> ---
>>>>>> arch/x86/mm/numa.c | 6 ++++--
>>>>>> include/linux/memblock.h | 2 ++
>>>>>> mm/memblock.c | 2 +-
>>>>>> 3 files changed, 7 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>>>>>> index 4053bb5..0c55cc5 100644
>>>>>> --- a/arch/x86/mm/numa.c
>>>>>> +++ b/arch/x86/mm/numa.c
>>>>>> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
>>>>>> bi->start = max(bi->start, low);
>>>>>> bi->end = min(bi->end, high);
>>>>>>
>>>>>> - /* and there's no empty block */
>>>>>> - if (bi->start >= bi->end)
>>>>>> + /* and there's no empty or non-exist block */
>>>>>> + if (bi->start >= bi->end ||
>>>>>> + memblock_overlaps_region(&memblock.memory,
>>>>>> + bi->start, bi->end - bi->start) == -1)
>>>>>> numa_remove_memblk_from(i--, mi);
>>>>>> }
>>>>>>
>>>>>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>>>>>> index 0215ffd..3bf6cc1 100644
>>>>>> --- a/include/linux/memblock.h
>>>>>> +++ b/include/linux/memblock.h
>>>>>> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
>>>>>> int memblock_free(phys_addr_t base, phys_addr_t size);
>>>>>> int memblock_reserve(phys_addr_t base, phys_addr_t size);
>>>>>> void memblock_trim_memory(phys_addr_t align);
>>>>>> +long memblock_overlaps_region(struct memblock_type *type,
>>>>>> + phys_addr_t base, phys_addr_t size);
>>>>>> int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>>>>>> int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>>>>>> int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>>>>>> diff --git a/mm/memblock.c b/mm/memblock.c
>>>>>> index 1b444c7..55b5f9f 100644
>>>>>> --- a/mm/memblock.c
>>>>>> +++ b/mm/memblock.c
>>>>>> @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p
>>>>>> return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
>>>>>> }
>>>>>>
>>>>>> -static long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>>>>>> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>>>>>> phys_addr_t base, phys_addr_t size)
>>>>>> {
>>>>>> unsigned long i;
>>>>> .
>>>>>
>>> .
>>>
> .
>

2015-07-15 21:20:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote:
...
> - /* and there's no empty block */
> - if (bi->start >= bi->end)
> + /* and there's no empty or non-exist block */
> + if (bi->start >= bi->end ||
> + memblock_overlaps_region(&memblock.memory,
> + bi->start, bi->end - bi->start) == -1)

Ugh.... can you please change memblock_overlaps_region() to return
bool instead?

Thanks.

--
tejun

2015-07-16 05:29:55

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.


On 07/16/2015 05:20 AM, Tejun Heo wrote:
> On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote:
> ...
>> - /* and there's no empty block */
>> - if (bi->start >= bi->end)
>> + /* and there's no empty or non-exist block */
>> + if (bi->start >= bi->end ||
>> + memblock_overlaps_region(&memblock.memory,
>> + bi->start, bi->end - bi->start) == -1)
> Ugh.... can you please change memblock_overlaps_region() to return
> bool instead?

Well, I think memblock_overlaps_region() is designed to return
the index of the region overlapping with the given region.
Maybe it had some users before.

Of course for now, it is only called by memblock_is_region_reserved().

It is OK to change the return value of memblock_overlaps_region() to bool.
But any caller of memblock_is_region_reserved() should also be changed.

I think it is OK to leave it there.

Thanks.

>
> Thanks.
>

2015-07-16 07:21:09

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.


On 07/16/2015 05:20 AM, Tejun Heo wrote:
> On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote:
> ...
>> - /* and there's no empty block */
>> - if (bi->start >= bi->end)
>> + /* and there's no empty or non-exist block */
>> + if (bi->start >= bi->end ||
>> + memblock_overlaps_region(&memblock.memory,
>> + bi->start, bi->end - bi->start) == -1)
> Ugh.... can you please change memblock_overlaps_region() to return
> bool instead?

Well, I think memblock_overlaps_region() is designed to return
the index of the region overlapping with the given region.

Of course for now, it is only called by memblock_is_region_reserved().

Will post a patch to do this.

Thanks.

>
> Thanks.
>