Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752321AbbGAG0r (ORCPT ); Wed, 1 Jul 2015 02:26:47 -0400 Received: from szxga03-in.huawei.com ([119.145.14.66]:21615 "EHLO szxga03-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750768AbbGAG0j (ORCPT ); Wed, 1 Jul 2015 02:26:39 -0400 Message-ID: <559387EF.5050701@huawei.com> Date: Wed, 1 Jul 2015 14:25:51 +0800 From: Xishi Qiu User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: Tang Chen CC: , , , , , , , , , , , , , , , , , Subject: Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo. References: <1435720614-16480-1-git-send-email-tangchen@cn.fujitsu.com> In-Reply-To: <1435720614-16480-1-git-send-email-tangchen@cn.fujitsu.com> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.25.179] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020202.55938800.010D,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-05-26 15:14:31, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 4fd745e0027431d8d6281c50a5eaa1a5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4689 Lines: 119 On 2015/7/1 11:16, Tang Chen wrote: > When parsing SRAT, all memory ranges are added into numa_meminfo. > In numa_init(), before entering numa_cleanup_meminfo(), all possible > memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes > all ranges over max_pfn or empty. > > But, this only works if the nodes are continuous. Let's have a look > at the following example: > > We have an SRAT like this: > SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] > SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] > SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] > SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug > SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug > SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug > SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug > SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug > SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug > > On boot, only node 0,1,2,3 exist. > > And the numa_meminfo will look like this: > numa_meminfo.nr_blks = 9 > 1. on node 0: [0, 60000000] > 2. on node 0: [100000000, 20000000000] > 3. on node 1: [20000000000, 40000000000] > 4. on node 4: [40000000000, 60000000000] > 5. on node 5: [60000000000, 80000000000] > 6. on node 2: [80000000000, a0000000000] > 7. on node 3: [a0000000000, a0800000000] > 8. on node 6: [c0000000000, a0800000000] > 9. on node 7: [e0000000000, a0800000000] > > And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because > the end address is over max_pfn, which is a0800000000. But 4 and 5 > are not removed because their end addresses are less then max_pfn. > But in fact, node 4 and 5 don't exist. > > In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. > > Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), > node 4 and 5 will be mistakenly set to online. > > In this patch, we use memblock_overlaps_region() to check if ranges in > numa_meminfo overlap with ranges in memory_block. Since memory_block contains > all available memory at boot time, if they overlap, it means the ranges > exist. If not, then remove them from numa_meminfo. > Hi Tang Chen, What's the impact of this problem? Command "numactl --hard" will show an empty node(no cpu and no memory, but pgdat is created), right? Thanks, Xishi Qiu > Signed-off-by: Tang Chen > --- > arch/x86/mm/numa.c | 6 ++++-- > include/linux/memblock.h | 2 ++ > mm/memblock.c | 2 +- > 3 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > index 4053bb5..0c55cc5 100644 > --- a/arch/x86/mm/numa.c > +++ b/arch/x86/mm/numa.c > @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) > bi->start = max(bi->start, low); > bi->end = min(bi->end, high); > > - /* and there's no empty block */ > - if (bi->start >= bi->end) > + /* and there's no empty or non-exist block */ > + if (bi->start >= bi->end || > + memblock_overlaps_region(&memblock.memory, > + bi->start, bi->end - bi->start) == -1) > numa_remove_memblk_from(i--, mi); > } > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h > index 0215ffd..3bf6cc1 100644 > --- a/include/linux/memblock.h > +++ b/include/linux/memblock.h > @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); > int memblock_free(phys_addr_t base, phys_addr_t size); > int memblock_reserve(phys_addr_t base, phys_addr_t size); > void memblock_trim_memory(phys_addr_t align); > +long memblock_overlaps_region(struct memblock_type *type, > + phys_addr_t base, phys_addr_t size); > int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); > int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); > int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); > diff --git a/mm/memblock.c b/mm/memblock.c > index 1b444c7..55b5f9f 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p > return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); > } > > -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, > +long __init_memblock memblock_overlaps_region(struct memblock_type *type, > phys_addr_t base, phys_addr_t size) > { > unsigned long i; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/