Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964896AbWIAIdW (ORCPT ); Fri, 1 Sep 2006 04:33:22 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S964898AbWIAIdW (ORCPT ); Fri, 1 Sep 2006 04:33:22 -0400 Received: from calculon.skynet.ie ([193.1.99.88]:47570 "EHLO calculon.skynet.ie") by vger.kernel.org with ESMTP id S964896AbWIAIdV (ORCPT ); Fri, 1 Sep 2006 04:33:21 -0400 Date: Fri, 1 Sep 2006 09:33:19 +0100 (IST) From: Mel Gorman X-X-Sender: mel@skynet.skynet.ie To: Keith Mannthey Cc: akpm@osdl.org, tony.luck@intel.com, Linux Memory Management List , ak@suse.de, bob.picco@hp.com, Linux Kernel Mailing List , linuxppc-dev@ozlabs.org Subject: Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes In-Reply-To: Message-ID: References: <20060821134518.22179.46355.sendpatchset@skynet.skynet.ie> <20060821134638.22179.44471.sendpatchset@skynet.skynet.ie> <20060831154903.GA7011@skynet.ie> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9944 Lines: 248 On Thu, 31 Aug 2006, Keith Mannthey wrote: > On 8/31/06, Mel Gorman wrote: >> On Thu, 31 Aug 2006, Keith Mannthey wrote: >> > On 8/31/06, Mel Gorman wrote: >> >> On (30/08/06 13:57), Keith Mannthey didst pronounce: >> >> > On 8/21/06, Mel Gorman wrote: >> >> > > > >> Can you confirm that happens by applying the patch I sent to you and >> checking the output? When the reserve fails, it should print out what >> range it actually checked. I want to be sure it's not checking the >> addresses 0->0x1070000000 > > See below > Perfect, thanks a lot. I should have enough to reproduce without a test machine what is going on and develop the required patches. >> >> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac >> >> > > >> >> > > printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm, >> >> > > nd->start, nd->end); >> >> > >+ e820_register_active_regions(node, nd->start >> PAGE_SHIFT, >> >> > >+ nd->end >> >> PAGE_SHIFT); >> >> > >> >> > A node chunk in this section of code may be a hot-pluggable zone. With >> >> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions. >> >> > >> >> >> >> The ranges should not get registered as active memory by >> >> e820_register_active_regions() unless they are marked E820_RAM. My >> >> understanding is that the regions for hotadd would be marked "reserved" >> >> in the e820 map. Is that wrong? >> > >> > This is wrong. In a mult-node system that last node add area will not >> > be marked reserved by the e820. The e820 only defines memory < >> > end_pfn. the last node add area is > end_pfn. >> > >> >> ok, that should still be fine. As long as the ranges are not marked >> "usable", add_active_range() will not be called and the holes should be >> counted correctly with the patch I sent you. >> >> > With RESERVE based add-memory you want the add-areas repored by the >> > srat to be setup during boot like all the other pages. >> > >> >> So, do you actally expect a lot of unused mem_map to be allocated with >> struct pages that are inactive until memory is hot-added in an >> x86_64-specific manner? The arch-independent stuff currently will not do >> that. It sets up memmap for where memory really exists. If that is not >> what you expect, it will hit issues at hotadd time which is not the >> current issue but one that can be fixed. > > Yes. RESERVED based is a big waste of mem_map space. Right, it's all very clear now. At some point in the future, I'd like to visit why SPARSEMEM-based hot-add is not always used but it's a separate issue. > The add areas > are marked as RESERVED during boot and then later onlined during add. That explains the reserve_bootmem_node() > It might be ok. I will play with tomorrow. I might just need to > call add_active_range in the right spot :) > I'll play with this from the opposite perspective - what is required for any arch using this API to have spare memmap for reserve-based hot-add. >> >> > > if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, >> end) >> >> < >> >> > > 0) { >> >> > > /* Ignore hotadd region. Undo damage */ >> >> > >> >> > I have but the e820_register_active_regions as a else to this >> >> > statment the absent pages check fails. >> >> > >> >> >> >> The patch below omits this change because I think >> >> e820_register_active_regions() will still have got called by the time >> >> you encounter a hotplug area. >> > >> > called but then removed in setup arch. >> >> By "removed", I assume you mean the active regions removed by the call >> to remove_all_active_regions() in setup_arch(). Before reserve_hotadd() is >> called, e820_register_active_regions() will have reregistered the active >> regions with the NUMA node id. > > I see e820_register_active_regions is acting as a filter against the e820 > Yes. A range of pfn's in given to e820_register_active_regions() and it reads the e820 for E820_RAM sections within that range. >> >> > Also nodes_cover_memory and alot of these check were based against >> >> > comparing the srat data against the e820. Now all this code is >> >> > comparing SRAT against SRAT.... >> >> > >> >> >> >> I don't see why. The SRAT table passes a range to >> >> e820_register_active_regions() so should be comparing SRAT to e820 >> > >> > let me go off and look at e820_register_active_regions() some more. > Things get clear :) > > Should be ok. > nice. >> > Sure thing. It is just the hot-add area I am guessing it is an off by >> > one error of some sort. >> > > See below. I do my e820_register_active_area as an else to to if > (hotplug.....!reserve) and the prink is easy to sort out. > yep, it should be easy to put this into a test program. > I see your pfn are in base 10. Looks like it considers the last > addres to be a present page. (off by one thing). > Probably. I'll start poking now. Thanks a lot. > Thanks, > Keith > > Output below > disabling early console > Linux version 2.6.18-rc4-mm3-smp (root@elm3a153) (gcc version 4.1.0 > (SUSE Linux)) #6 SMP Thu Aug 31 22:06:00 EDT 2006 > Command line: root=/dev/sda3 > ip=9.47.66.153:9.47.66.169:9.47.66.1:255.255.255.0 resume=/dev/sda2 > showopts earlyprintk=ttyS0,115200 console=ttyS0,115200 console=tty0 > debug numa=hotadd=100 > BIOS-provided physical RAM map: > BIOS-e820: 0000000000000000 - 0000000000098400 (usable) > BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved) > BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) > BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable) > BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data) > BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved) > BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) > BIOS-e820: 0000000100000000 - 0000000470000000 (usable) > BIOS-e820: 0000001070000000 - 0000001160000000 (usable) > Entering add_active_range(0, 0, 152) 0 entries of 3200 used > Entering add_active_range(0, 256, 524165) 1 entries of 3200 used > Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used > Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used > end_pfn_map = 18219008 > DMI 2.3 present. > ACPI: RSDP (v000 IBM ) @ 0x00000000000fdcf0 > ACPI: RSDT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @ > 0x000000007ff98800 > ACPI: FADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @ > 0x000000007ff98780 > ACPI: MADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @ > 0x000000007ff98600 > ACPI: SRAT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @ > 0x000000007ff983c0 > ACPI: HPET (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @ > 0x000000007ff98380 > ACPI: SSDT (v001 IBM VIGSSDT0 0x00001000 INTL 0x20030122) @ > 0x000000007ff90780 > ACPI: SSDT (v001 IBM VIGSSDT1 0x00001000 INTL 0x20030122) @ > 0x000000007ff88bc0 > ACPI: DSDT (v001 IBM EXA01ZEU 0x00001000 INTL 0x20030122) @ > 0x0000000000000000 > SRAT: PXM 0 -> APIC 0 -> Node 0 > SRAT: PXM 0 -> APIC 1 -> Node 0 > SRAT: PXM 0 -> APIC 2 -> Node 0 > SRAT: PXM 0 -> APIC 3 -> Node 0 > SRAT: PXM 0 -> APIC 38 -> Node 0 > SRAT: PXM 0 -> APIC 39 -> Node 0 > SRAT: PXM 0 -> APIC 36 -> Node 0 > SRAT: PXM 0 -> APIC 37 -> Node 0 > SRAT: PXM 1 -> APIC 64 -> Node 1 > SRAT: PXM 1 -> APIC 65 -> Node 1 > SRAT: PXM 1 -> APIC 66 -> Node 1 > SRAT: PXM 1 -> APIC 67 -> Node 1 > SRAT: PXM 1 -> APIC 102 -> Node 1 > SRAT: PXM 1 -> APIC 103 -> Node 1 > SRAT: PXM 1 -> APIC 100 -> Node 1 > SRAT: PXM 1 -> APIC 101 -> Node 1 > SRAT: Node 0 PXM 0 0-80000000 > Entering add_active_range(0, 0, 152) 0 entries of 3200 used > Entering add_active_range(0, 256, 524165) 1 entries of 3200 used > SRAT: Node 0 PXM 0 0-470000000 > Entering add_active_range(0, 0, 152) 2 entries of 3200 used > Entering add_active_range(0, 256, 524165) 2 entries of 3200 used > Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used > SRAT: Node 0 PXM 0 0-1070000000 > reserve_hotadd called with node 0 sart 470000000 end 1070000000 > SRAT: Hotplug area has existing memory > Entering add_active_range(0, 0, 152) 3 entries of 3200 used > Entering add_active_range(0, 256, 524165) 3 entries of 3200 used > Entering add_active_range(0, 1048576, 4653056) 3 entries of 3200 used > SRAT: Node 1 PXM 1 1070000000-1160000000 > Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used > SRAT: Node 1 PXM 1 1070000000-3200000000 > reserve_hotadd called with node 1 sart 1160000000 end 3200000000 > SRAT: Hotplug area has existing memory > Entering add_active_range(1, 17235968, 18219008) 4 entries of 3200 used > NUMA: Using 28 for the hash shift. > Bootmem setup node 0 0000000000000000-0000001070000000 > Bootmem setup node 1 0000001070000000-0000001160000000 > Zone PFN ranges: > DMA 0 -> 4096 > DMA32 4096 -> 1048576 > Normal 1048576 -> 18219008 > early_node_map[4] active PFN ranges > 0: 0 -> 152 > 0: 256 -> 524165 > 0: 1048576 -> 4653056 > 1: 17235968 -> 18219008 > On node 0 totalpages: 4128541 > 0 pages used for SPARSE memmap > 1149 pages DMA reserved > DMA zone: 2843 pages, LIFO batch:0 > 0 pages used for SPARSE memmap > DMA32 zone: 520069 pages, LIFO batch:31 > 0 pages used for SPARSE memmap > Normal zone: 3604480 pages, LIFO batch:31 > On node 1 totalpages: 983040 > 0 pages used for SPARSE memmap > 0 pages used for SPARSE memmap > 0 pages used for SPARSE memmap > Normal zone: 983040 pages, LIFO batch:31 > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/