This is V9 of the patchset to size zones and memory holes in an
architecture-independent manner. It booted successfully on 5 different
machines (arches were x86, x86_64, ppc64 and ia64) in a number of different
configurations and successfully built a kernel. If it fails on any machine,
booting with loglevel=8 and the console log should tell me what went wrong.
Changelog since V8
o Rebase to 2.6.17-rc4-mm1
o Make dma_reserve static
o Use enumerated zone numbers
Changelog since V7
o Rebase to 2.6.17-mm6
o Account for mem_map as a memory hole
o Adjust mem_map when arch independent zone-sizing is used and PFN 0 is in
a memory hole not accounted for by ARCH_PFN_OFFSET
Changelog since V6
o MAX_ACTIVE_REGIONS is really maximum active regions, not MAX_ACTIVE_REGIONS-1
o MAX_ACTIVE_REGIONS is 256 unless the architecture specifically asks for
a different number or MAX_NUMNODES is >= 32
o nr_nodemap_entries tracks the number of entries rather than terminating with
end_pfn == 0
o Add number of documentation-related comments. Functions exposed by headers
may potentially be picked up by kerneldoc
o Changed misleading zone_present_pages_in_node() name to
zone_spanned_pages_in_node()
o Be a bit more verbose to help debugging when things go wrong.
o On x86_64, end_pfn_map now gets updated properly or ACPI tables get "lost"
o Signoffs added to patches 1 and 5 by Bob Picco related to contributions,
fixes and reviews
Changelog since V5
o Add a missing #include to mm/mem_init.c
o Drop the verbose debugging part of the set
o Report active range registration when loglevel is set for KERN_DEBUG
Changelog since V4
o Rebase to 2.6.17-rc3-mm1
o Calculate holes on x86 with SRAT correctly
Changelog since V3
o Rebase to 2.6.17-rc2
o Allow the active regions to be cleared. Needed by x86_64 when it decides
the SRAT table is bad half way through the registering of active regions
o Fix for flatmem x86_64 machines booting
Changelog since V2
o Fix a bug where holes in lower zones get double counted
o Catch the case where a new range is registered that is within an range
o Catch the case where a zone boundary is within a hole
o Use the EFI map for registering ranges on x86_64+numa
o On IA64+NUMA, add the active ranges before rounding for granules
o On x86_64, remove e820_hole_size and e820_bootmem_free and use
arch-independent equivalents
o On x86_64, remove the map walk in e820_end_of_ram()
o Rename memory_present_with_active_regions, name ambiguous
o Add absent_pages_in_range() for arches to call
Changelog since V1
o Correctly convert virtual and physical addresses to PFNs on ia64
o Correctly convert physical addresses to PFN on older ppc
o When add_active_range() is called with overlapping pfn ranges, merge them
o When a zone boundary occurs within a memory hole, account correctly
o Minor whitespace damage cleanup
o Debugging patch temporarily included
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate
zone sizes and holes in each architecture is very similar. Some of this
zone and hole sizing code is difficult to read for no good reason. This
set of patches eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas
have been discovered, free_area_init_nodes() is called to initialise
the pgdat and zones. The zone sizes and holes are then calculated in an
architecture independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly
Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged
on IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.
There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory
holes but the architecture-independent code accounts the memory as present.
The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for
zone and hole sizing but I lack the hardware to test on.
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Size zones and holes in an architecture independent manner for Power.
powerpc/Kconfig | 7 --
powerpc/mm/mem.c | 51 +++++----------
powerpc/mm/numa.c | 159 ++++---------------------------------------------
ppc/Kconfig | 3
ppc/mm/init.c | 23 +++----
5 files changed, 52 insertions(+), 191 deletions(-)
Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/powerpc/Kconfig linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/powerpc/Kconfig
--- linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/powerpc/Kconfig 2006-08-06 19:20:11.000000000 +0100
+++ linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/powerpc/Kconfig 2006-08-21 10:13:25.000000000 +0100
@@ -715,11 +715,10 @@ config ARCH_SPARSEMEM_DEFAULT
def_bool y
depends on SMP && PPC_PSERIES
-source "mm/Kconfig"
-
-config HAVE_ARCH_EARLY_PFN_TO_NID
+config ARCH_POPULATES_NODE_MAP
def_bool y
- depends on NEED_MULTIPLE_NODES
+
+source "mm/Kconfig"
config ARCH_MEMORY_PROBE
def_bool y
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c
--- linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c 2006-08-06 19:20:11.000000000 +0100
+++ linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c 2006-08-21 10:13:25.000000000 +0100
@@ -256,20 +256,22 @@ void __init do_init_bootmem(void)
boot_mapsize = init_bootmem(start >> PAGE_SHIFT, total_pages);
+ /* Add active regions with valid PFNs */
+ for (i = 0; i < lmb.memory.cnt; i++) {
+ unsigned long start_pfn, end_pfn;
+ start_pfn = lmb.memory.region[i].base >> PAGE_SHIFT;
+ end_pfn = start_pfn + lmb_size_pages(&lmb.memory, i);
+ add_active_range(0, start_pfn, end_pfn);
+ }
+
/* Add all physical memory to the bootmem map, mark each area
* present.
*/
- for (i = 0; i < lmb.memory.cnt; i++) {
- unsigned long base = lmb.memory.region[i].base;
- unsigned long size = lmb_size_bytes(&lmb.memory, i);
#ifdef CONFIG_HIGHMEM
- if (base >= total_lowmem)
- continue;
- if (base + size > total_lowmem)
- size = total_lowmem - base;
+ free_bootmem_with_active_regions(0, total_lowmem >> PAGE_SHIFT);
+#else
+ free_bootmem_with_active_regions(0, max_pfn);
#endif
- free_bootmem(base, size);
- }
/* reserve the sections we're already using */
for (i = 0; i < lmb.reserved.cnt; i++)
@@ -277,9 +279,8 @@ void __init do_init_bootmem(void)
lmb_size_bytes(&lmb.reserved, i));
/* XXX need to clip this if using highmem? */
- for (i = 0; i < lmb.memory.cnt; i++)
- memory_present(0, lmb_start_pfn(&lmb.memory, i),
- lmb_end_pfn(&lmb.memory, i));
+ sparse_memory_present_with_active_regions(0);
+
init_bootmem_done = 1;
}
@@ -288,10 +289,9 @@ void __init do_init_bootmem(void)
*/
void __init paging_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES];
- unsigned long zholes_size[MAX_NR_ZONES];
unsigned long total_ram = lmb_phys_mem_size();
unsigned long top_of_ram = lmb_end_of_DRAM();
+ unsigned long max_zone_pfns[MAX_NR_ZONES];
#ifdef CONFIG_HIGHMEM
map_page(PKMAP_BASE, 0, 0); /* XXX gross */
@@ -307,26 +307,13 @@ void __init paging_init(void)
top_of_ram, total_ram);
printk(KERN_DEBUG "Memory hole size: %ldMB\n",
(top_of_ram - total_ram) >> 20);
- /*
- * All pages are DMA-able so we put them all in the DMA zone.
- */
- memset(zones_size, 0, sizeof(zones_size));
- memset(zholes_size, 0, sizeof(zholes_size));
-
- zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
- zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-
#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
- zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
- zholes_size[ZONE_HIGHMEM] = (top_of_ram - total_ram) >> PAGE_SHIFT;
+ max_zone_pfns[0] = total_lowmem >> PAGE_SHIFT;
+ max_zone_pfns[1] = top_of_ram >> PAGE_SHIFT;
#else
- zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
- zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-#endif /* CONFIG_HIGHMEM */
-
- free_area_init_node(0, NODE_DATA(0), zones_size,
- __pa(PAGE_OFFSET) >> PAGE_SHIFT, zholes_size);
+ max_zone_pfns[0] = top_of_ram >> PAGE_SHIFT;
+#endif
+ free_area_init_nodes(max_zone_pfns);
}
#endif /* ! CONFIG_NEED_MULTIPLE_NODES */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c
--- linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c 2006-08-21 10:13:25.000000000 +0100
@@ -39,96 +39,6 @@ static bootmem_data_t __initdata plat_no
static int min_common_depth;
static int n_mem_addr_cells, n_mem_size_cells;
-/*
- * We need somewhere to store start/end/node for each region until we have
- * allocated the real node_data structures.
- */
-#define MAX_REGIONS (MAX_LMB_REGIONS*2)
-static struct {
- unsigned long start_pfn;
- unsigned long end_pfn;
- int nid;
-} init_node_data[MAX_REGIONS] __initdata;
-
-int __init early_pfn_to_nid(unsigned long pfn)
-{
- unsigned int i;
-
- for (i = 0; init_node_data[i].end_pfn; i++) {
- unsigned long start_pfn = init_node_data[i].start_pfn;
- unsigned long end_pfn = init_node_data[i].end_pfn;
-
- if ((start_pfn <= pfn) && (pfn < end_pfn))
- return init_node_data[i].nid;
- }
-
- return -1;
-}
-
-void __init add_region(unsigned int nid, unsigned long start_pfn,
- unsigned long pages)
-{
- unsigned int i;
-
- dbg("add_region nid %d start_pfn 0x%lx pages 0x%lx\n",
- nid, start_pfn, pages);
-
- for (i = 0; init_node_data[i].end_pfn; i++) {
- if (init_node_data[i].nid != nid)
- continue;
- if (init_node_data[i].end_pfn == start_pfn) {
- init_node_data[i].end_pfn += pages;
- return;
- }
- if (init_node_data[i].start_pfn == (start_pfn + pages)) {
- init_node_data[i].start_pfn -= pages;
- return;
- }
- }
-
- /*
- * Leave last entry NULL so we dont iterate off the end (we use
- * entry.end_pfn to terminate the walk).
- */
- if (i >= (MAX_REGIONS - 1)) {
- printk(KERN_ERR "WARNING: too many memory regions in "
- "numa code, truncating\n");
- return;
- }
-
- init_node_data[i].start_pfn = start_pfn;
- init_node_data[i].end_pfn = start_pfn + pages;
- init_node_data[i].nid = nid;
-}
-
-/* We assume init_node_data has no overlapping regions */
-void __init get_region(unsigned int nid, unsigned long *start_pfn,
- unsigned long *end_pfn, unsigned long *pages_present)
-{
- unsigned int i;
-
- *start_pfn = -1UL;
- *end_pfn = *pages_present = 0;
-
- for (i = 0; init_node_data[i].end_pfn; i++) {
- if (init_node_data[i].nid != nid)
- continue;
-
- *pages_present += init_node_data[i].end_pfn -
- init_node_data[i].start_pfn;
-
- if (init_node_data[i].start_pfn < *start_pfn)
- *start_pfn = init_node_data[i].start_pfn;
-
- if (init_node_data[i].end_pfn > *end_pfn)
- *end_pfn = init_node_data[i].end_pfn;
- }
-
- /* We didnt find a matching region, return start/end as 0 */
- if (*start_pfn == -1UL)
- *start_pfn = 0;
-}
-
static void __cpuinit map_cpu_to_node(int cpu, int node)
{
numa_cpu_lookup_table[cpu] = node;
@@ -468,8 +378,8 @@ new_range:
continue;
}
- add_region(nid, start >> PAGE_SHIFT,
- size >> PAGE_SHIFT);
+ add_active_range(nid, start >> PAGE_SHIFT,
+ (start >> PAGE_SHIFT) + (size >> PAGE_SHIFT));
if (--ranges)
goto new_range;
@@ -482,6 +392,7 @@ static void __init setup_nonnuma(void)
{
unsigned long top_of_ram = lmb_end_of_DRAM();
unsigned long total_ram = lmb_phys_mem_size();
+ unsigned long start_pfn, end_pfn;
unsigned int i;
printk(KERN_DEBUG "Top of RAM: 0x%lx, Total RAM: 0x%lx\n",
@@ -489,9 +400,11 @@ static void __init setup_nonnuma(void)
printk(KERN_DEBUG "Memory hole size: %ldMB\n",
(top_of_ram - total_ram) >> 20);
- for (i = 0; i < lmb.memory.cnt; ++i)
- add_region(0, lmb.memory.region[i].base >> PAGE_SHIFT,
- lmb_size_pages(&lmb.memory, i));
+ for (i = 0; i < lmb.memory.cnt; ++i) {
+ start_pfn = lmb.memory.region[i].base >> PAGE_SHIFT;
+ end_pfn = start_pfn + lmb_size_pages(&lmb.memory, i);
+ add_active_range(0, start_pfn, end_pfn);
+ }
node_set_online(0);
}
@@ -630,11 +543,11 @@ void __init do_init_bootmem(void)
(void *)(unsigned long)boot_cpuid);
for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
+ unsigned long start_pfn, end_pfn;
unsigned long bootmem_paddr;
unsigned long bootmap_pages;
- get_region(nid, &start_pfn, &end_pfn, &pages_present);
+ get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
/* Allocate the node structure node local if possible */
NODE_DATA(nid) = careful_allocation(nid,
@@ -667,19 +580,7 @@ void __init do_init_bootmem(void)
init_bootmem_node(NODE_DATA(nid), bootmem_paddr >> PAGE_SHIFT,
start_pfn, end_pfn);
- /* Add free regions on this node */
- for (i = 0; init_node_data[i].end_pfn; i++) {
- unsigned long start, end;
-
- if (init_node_data[i].nid != nid)
- continue;
-
- start = init_node_data[i].start_pfn << PAGE_SHIFT;
- end = init_node_data[i].end_pfn << PAGE_SHIFT;
-
- dbg("free_bootmem %lx %lx\n", start, end - start);
- free_bootmem_node(NODE_DATA(nid), start, end - start);
- }
+ free_bootmem_with_active_regions(nid, end_pfn);
/* Mark reserved regions on this node */
for (i = 0; i < lmb.reserved.cnt; i++) {
@@ -710,44 +611,16 @@ void __init do_init_bootmem(void)
}
}
- /* Add regions into sparsemem */
- for (i = 0; init_node_data[i].end_pfn; i++) {
- unsigned long start, end;
-
- if (init_node_data[i].nid != nid)
- continue;
-
- start = init_node_data[i].start_pfn;
- end = init_node_data[i].end_pfn;
-
- memory_present(nid, start, end);
- }
+ sparse_memory_present_with_active_regions(nid);
}
}
void __init paging_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES];
- unsigned long zholes_size[MAX_NR_ZONES];
- int nid;
-
- memset(zones_size, 0, sizeof(zones_size));
- memset(zholes_size, 0, sizeof(zholes_size));
-
- for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
-
- get_region(nid, &start_pfn, &end_pfn, &pages_present);
-
- zones_size[ZONE_DMA] = end_pfn - start_pfn;
- zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present;
-
- dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid,
- zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]);
-
- free_area_init_node(nid, NODE_DATA(nid), zones_size, start_pfn,
- zholes_size);
- }
+ unsigned long max_zone_pfns[MAX_NR_ZONES] = {
+ lmb_end_of_DRAM() >> PAGE_SHIFT
+ };
+ free_area_init_nodes(max_zone_pfns);
}
static int __init early_numa(char *p)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/ppc/Kconfig linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/ppc/Kconfig
--- linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/ppc/Kconfig 2006-08-06 19:20:11.000000000 +0100
+++ linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/ppc/Kconfig 2006-08-21 10:13:25.000000000 +0100
@@ -953,6 +953,9 @@ config NR_CPUS
config HIGHMEM
bool "High memory support"
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
+
source kernel/Kconfig.hz
source kernel/Kconfig.preempt
source "mm/Kconfig"
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/ppc/mm/init.c linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/ppc/mm/init.c
--- linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/arch/ppc/mm/init.c 2006-08-06 19:20:11.000000000 +0100
+++ linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/ppc/mm/init.c 2006-08-21 10:13:25.000000000 +0100
@@ -358,8 +358,8 @@ void __init do_init_bootmem(void)
*/
void __init paging_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES], i;
-
+ unsigned long start_pfn, end_pfn;
+ unsigned long max_zone_pfns;
#ifdef CONFIG_HIGHMEM
map_page(PKMAP_BASE, 0, 0); /* XXX gross */
pkmap_page_table = pte_offset_kernel(pmd_offset(pgd_offset_k
@@ -369,19 +369,18 @@ void __init paging_init(void)
(KMAP_FIX_BEGIN), KMAP_FIX_BEGIN), KMAP_FIX_BEGIN);
kmap_prot = PAGE_KERNEL;
#endif /* CONFIG_HIGHMEM */
-
- /*
- * All pages are DMA-able so we put them all in the DMA zone.
- */
- zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
- for (i = 1; i < MAX_NR_ZONES; i++)
- zones_size[i] = 0;
+ /* All pages are DMA-able so we put them all in the DMA zone. */
+ start_pfn = __pa(PAGE_OFFSET) >> PAGE_SHIFT;
+ end_pfn = start_pfn + (total_memory >> PAGE_SHIFT);
+ add_active_range(0, start_pfn, end_pfn);
#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
+ max_zone_pfns[0] = total_lowmem >> PAGE_SHIFT;
+ max_zone_pfns[1] = total_memory >> PAGE_SHIFT;
+#else
+ max_zone_pfns[0] = total_memory >> PAGE_SHIFT;
#endif /* CONFIG_HIGHMEM */
-
- free_area_init(zones_size);
+ free_area_init_nodes(max_zone_pfns);
}
void __init mem_init(void)
This patch defines the structure to represent an active range of page
frames within a node in an architecture independent manner. Architectures
are expected to register active ranges of PFNs using add_active_range(nid,
start_pfn, end_pfn) and call free_area_init_nodes() passing the PFNs of
the end of each zone.
include/linux/mm.h | 47 +++
include/linux/mmzone.h | 10
mm/page_alloc.c | 552 ++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 584 insertions(+), 25 deletions(-)
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Bob Picco <[email protected]>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-clean/include/linux/mm.h linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/include/linux/mm.h
--- linux-2.6.18-rc4-mm2-clean/include/linux/mm.h 2006-08-21 09:23:52.000000000 +0100
+++ linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/include/linux/mm.h 2006-08-21 10:12:12.000000000 +0100
@@ -974,6 +974,53 @@ extern void free_area_init(unsigned long
extern void free_area_init_node(int nid, pg_data_t *pgdat,
unsigned long * zones_size, unsigned long zone_start_pfn,
unsigned long *zholes_size);
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/*
+ * With CONFIG_ARCH_POPULATES_NODE_MAP set, an architecture may initialise its
+ * zones, allocate the backing mem_map and account for memory holes in a more
+ * architecture independent manner. This is a substitute for creating the
+ * zone_sizes[] and zholes_size[] arrays and passing them to
+ * free_area_init_node()
+ *
+ * An architecture is expected to register range of page frames backed by
+ * physical memory with add_active_range() before calling
+ * free_area_init_nodes() passing in the PFN each zone ends at. At a basic
+ * usage, an architecture is expected to do something like
+ *
+ * unsigned long max_zone_pfns[MAX_NR_ZONES] = {max_dma, max_normal_pfn,
+ * max_highmem_pfn};
+ * for_each_valid_physical_page_range()
+ * add_active_range(node_id, start_pfn, end_pfn)
+ * free_area_init_nodes(max_zone_pfns);
+ *
+ * If the architecture guarantees that there are no holes in the ranges
+ * registered with add_active_range(), free_bootmem_active_regions()
+ * will call free_bootmem_node() for each registered physical page range.
+ * Similarly sparse_memory_present_with_active_regions() calls
+ * memory_present() for each range when SPARSEMEM is enabled.
+ *
+ * See mm/page_alloc.c for more information on each function exposed by
+ * CONFIG_ARCH_POPULATES_NODE_MAP
+ */
+extern void free_area_init_nodes(unsigned long *max_zone_pfn);
+extern void add_active_range(unsigned int nid, unsigned long start_pfn,
+ unsigned long end_pfn);
+extern void shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
+ unsigned long new_end_pfn);
+extern void remove_all_active_ranges(void);
+extern unsigned long absent_pages_in_range(unsigned long start_pfn,
+ unsigned long end_pfn);
+extern void get_pfn_range_for_nid(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn);
+extern unsigned long find_min_pfn_with_active_regions(void);
+extern unsigned long find_max_pfn_with_active_regions(void);
+extern void free_bootmem_with_active_regions(int nid,
+ unsigned long max_low_pfn);
+extern void sparse_memory_present_with_active_regions(int nid);
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+extern int early_pfn_to_nid(unsigned long pfn);
+#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
extern void setup_per_zone_pages_min(void);
extern void mem_init(void);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-clean/include/linux/mmzone.h linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/include/linux/mmzone.h
--- linux-2.6.18-rc4-mm2-clean/include/linux/mmzone.h 2006-08-21 09:23:52.000000000 +0100
+++ linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/include/linux/mmzone.h 2006-08-21 10:12:12.000000000 +0100
@@ -307,6 +307,13 @@ struct zonelist {
struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
};
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+struct node_active_region {
+ unsigned long start_pfn;
+ unsigned long end_pfn;
+ int nid;
+};
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
@@ -519,7 +526,8 @@ extern struct zone *next_zone(struct zon
#endif
-#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
+ !defined(CONFIG_ARCH_POPULATES_NODE_MAP)
#define early_pfn_to_nid(nid) (0UL)
#endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-clean/mm/page_alloc.c linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/mm/page_alloc.c
--- linux-2.6.18-rc4-mm2-clean/mm/page_alloc.c 2006-08-21 09:23:52.000000000 +0100
+++ linux-2.6.18-rc4-mm2-101-add_free_area_init_nodes/mm/page_alloc.c 2006-08-21 10:12:12.000000000 +0100
@@ -37,6 +37,8 @@
#include <linux/vmalloc.h>
#include <linux/mempolicy.h>
#include <linux/stop_machine.h>
+#include <linux/sort.h>
+#include <linux/pfn.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -103,6 +105,33 @@ int min_free_kbytes = 1024;
unsigned long __meminitdata nr_kernel_pages;
unsigned long __meminitdata nr_all_pages;
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+ /*
+ * MAX_ACTIVE_REGIONS determines the maxmimum number of distinct
+ * ranges of memory (RAM) that may be registered with add_active_range().
+ * Ranges passed to add_active_range() will be merged if possible
+ * so the number of times add_active_range() can be called is
+ * related to the number of nodes and the number of holes
+ */
+ #ifdef CONFIG_MAX_ACTIVE_REGIONS
+ /* Allow an architecture to set MAX_ACTIVE_REGIONS to save memory */
+ #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
+ #else
+ #if MAX_NUMNODES >= 32
+ /* If there can be many nodes, allow up to 50 holes per node */
+ #define MAX_ACTIVE_REGIONS (MAX_NUMNODES*50)
+ #else
+ /* By default, allow up to 256 distinct regions */
+ #define MAX_ACTIVE_REGIONS 256
+ #endif
+ #endif
+
+ struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
+ int __initdata nr_nodemap_entries;
+ unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+ unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -1731,25 +1760,6 @@ static inline unsigned long wait_table_b
#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
-static void __init calculate_zone_totalpages(struct pglist_data *pgdat,
- unsigned long *zones_size, unsigned long *zholes_size)
-{
- unsigned long realtotalpages, totalpages = 0;
- enum zone_type i;
-
- for (i = 0; i < MAX_NR_ZONES; i++)
- totalpages += zones_size[i];
- pgdat->node_spanned_pages = totalpages;
-
- realtotalpages = totalpages;
- if (zholes_size)
- for (i = 0; i < MAX_NR_ZONES; i++)
- realtotalpages -= zholes_size[i];
- pgdat->node_present_pages = realtotalpages;
- printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
-}
-
-
/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
@@ -2067,6 +2077,272 @@ __meminit int init_currently_empty_zone(
return 0;
}
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/*
+ * Basic iterator support. Return the first range of PFNs for a node
+ * Note: nid == MAX_NUMNODES returns first region regardless of node
+ */
+static int __init first_active_region_index_in_nid(int nid)
+{
+ int i;
+
+ for (i = 0; i < nr_nodemap_entries; i++)
+ if (nid == MAX_NUMNODES || early_node_map[i].nid == nid)
+ return i;
+
+ return -1;
+}
+
+/*
+ * Basic iterator support. Return the next active range of PFNs for a node
+ * Note: nid == MAX_NUMNODES returns next region regardles of node
+ */
+static int __init next_active_region_index_in_nid(int index, int nid)
+{
+ for (index = index + 1; index < nr_nodemap_entries; index++)
+ if (nid == MAX_NUMNODES || early_node_map[index].nid == nid)
+ return index;
+
+ return -1;
+}
+
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+/*
+ * Required by SPARSEMEM. Given a PFN, return what node the PFN is on.
+ * Architectures may implement their own version but if add_active_range()
+ * was used and there are no special requirements, this is a convenient
+ * alternative
+ */
+int __init early_pfn_to_nid(unsigned long pfn)
+{
+ int i;
+
+ for (i = 0; i < nr_nodemap_entries; i++) {
+ unsigned long start_pfn = early_node_map[i].start_pfn;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+
+ if (start_pfn <= pfn && pfn < end_pfn)
+ return early_node_map[i].nid;
+ }
+
+ return 0;
+}
+#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
+
+/* Basic iterator support to walk early_node_map[] */
+#define for_each_active_range_index_in_nid(i, nid) \
+ for (i = first_active_region_index_in_nid(nid); i != -1; \
+ i = next_active_region_index_in_nid(i, nid))
+
+/**
+ * free_bootmem_with_active_regions - Call free_bootmem_node for each active range
+ * @nid: The node to free memory on. If MAX_NUMNODES, all nodes are freed
+ * @max_low_pfn: The highest PFN that till be passed to free_bootmem_node
+ *
+ * If an architecture guarantees that all ranges registered with
+ * add_active_ranges() contain no holes and may be freed, this
+ * this function may be used instead of calling free_bootmem() manually.
+ */
+void __init free_bootmem_with_active_regions(int nid,
+ unsigned long max_low_pfn)
+{
+ int i;
+
+ for_each_active_range_index_in_nid(i, nid) {
+ unsigned long size_pages = 0;
+ unsigned long end_pfn = early_node_map[i].end_pfn;
+
+ if (early_node_map[i].start_pfn >= max_low_pfn)
+ continue;
+
+ if (end_pfn > max_low_pfn)
+ end_pfn = max_low_pfn;
+
+ size_pages = end_pfn - early_node_map[i].start_pfn;
+ free_bootmem_node(NODE_DATA(early_node_map[i].nid),
+ PFN_PHYS(early_node_map[i].start_pfn),
+ size_pages << PAGE_SHIFT);
+ }
+}
+
+/**
+ * sparse_memory_present_with_active_regions - Call memory_present for each active range
+ * @nid: The node to call memory_present for. If MAX_NUMNODES, all nodes will be used
+ *
+ * If an architecture guarantees that all ranges registered with
+ * add_active_ranges() contain no holes and may be freed, this
+ * this function may be used instead of calling memory_present() manually.
+ */
+void __init sparse_memory_present_with_active_regions(int nid)
+{
+ int i;
+
+ for_each_active_range_index_in_nid(i, nid)
+ memory_present(early_node_map[i].nid,
+ early_node_map[i].start_pfn,
+ early_node_map[i].end_pfn);
+}
+
+/**
+ * get_pfn_range_for_nid - Return the start and end page frames for a node
+ * @nid: The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned
+ * @start_pfn: Passed by reference. On return, it will have the node start_pfn
+ * @end_pfn: Passed by reference. On return, it will have the node end_pfn
+ *
+ * It returns the start and end page frame of a node based on information
+ * provided by an arch calling add_active_range(). If called for a node
+ * with no available memory, a warning is printed and the start and end
+ * PFNs will be 0
+ */
+void __init get_pfn_range_for_nid(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn)
+{
+ int i;
+ *start_pfn = -1UL;
+ *end_pfn = 0;
+
+ for_each_active_range_index_in_nid(i, nid) {
+ *start_pfn = min(*start_pfn, early_node_map[i].start_pfn);
+ *end_pfn = max(*end_pfn, early_node_map[i].end_pfn);
+ }
+
+ if (*start_pfn == -1UL) {
+ printk(KERN_WARNING "Node %u active with no memory\n", nid);
+ *start_pfn = 0;
+ }
+}
+
+/*
+ * Return the number of pages a zone spans in a node, including holes
+ * present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node()
+ */
+unsigned long __init zone_spanned_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ unsigned long node_start_pfn, node_end_pfn;
+ unsigned long zone_start_pfn, zone_end_pfn;
+
+ /* Get the start and end of the node and zone */
+ get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+ zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+ zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+ /* Check that this node has pages within the zone's required range */
+ if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+ return 0;
+
+ /* Move the zone boundaries inside the node if necessary */
+ zone_end_pfn = min(zone_end_pfn, node_end_pfn);
+ zone_start_pfn = max(zone_start_pfn, node_start_pfn);
+
+ /* Return the spanned pages */
+ return zone_end_pfn - zone_start_pfn;
+}
+
+/*
+ * Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
+ * then all holes in the requested range will be accounted for
+ */
+unsigned long __init __absent_pages_in_range(int nid,
+ unsigned long range_start_pfn,
+ unsigned long range_end_pfn)
+{
+ int i = 0;
+ unsigned long prev_end_pfn = 0, hole_pages = 0;
+ unsigned long start_pfn;
+
+ /* Find the end_pfn of the first active range of pfns in the node */
+ i = first_active_region_index_in_nid(nid);
+ if (i == -1)
+ return 0;
+
+ prev_end_pfn = early_node_map[i].start_pfn;
+
+ /* Find all holes for the zone within the node */
+ for (; i != -1; i = next_active_region_index_in_nid(i, nid)) {
+
+ /* No need to continue if prev_end_pfn is outside the zone */
+ if (prev_end_pfn >= range_end_pfn)
+ break;
+
+ /* Make sure the end of the zone is not within the hole */
+ start_pfn = min(early_node_map[i].start_pfn, range_end_pfn);
+ prev_end_pfn = max(prev_end_pfn, range_start_pfn);
+
+ /* Update the hole size cound and move on */
+ if (start_pfn > range_start_pfn) {
+ BUG_ON(prev_end_pfn > start_pfn);
+ hole_pages += start_pfn - prev_end_pfn;
+ }
+ prev_end_pfn = early_node_map[i].end_pfn;
+ }
+
+ return hole_pages;
+}
+
+/**
+ * absent_pages_in_range - Return number of page frames in holes within a range
+ * @start_pfn: The start PFN to start searching for holes
+ * @end_pfn: The end PFN to stop searching for holes
+ *
+ * It returns the number of pages frames in memory holes within a range
+ */
+unsigned long __init absent_pages_in_range(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn);
+}
+
+/* Return the number of page frames in holes in a zone on a node */
+unsigned long __init zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *ignored)
+{
+ return __absent_pages_in_range(nid,
+ arch_zone_lowest_possible_pfn[zone_type],
+ arch_zone_highest_possible_pfn[zone_type]);
+}
+#else
+static inline unsigned long zone_spanned_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zones_size)
+{
+ return zones_size[zone_type];
+}
+
+static inline unsigned long zone_absent_pages_in_node(int nid,
+ unsigned long zone_type,
+ unsigned long *zholes_size)
+{
+ if (!zholes_size)
+ return 0;
+
+ return zholes_size[zone_type];
+}
+#endif
+
+static void __init calculate_node_totalpages(struct pglist_data *pgdat,
+ unsigned long *zones_size, unsigned long *zholes_size)
+{
+ unsigned long realtotalpages, totalpages = 0;
+ enum zone_type i;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ totalpages += zone_spanned_pages_in_node(pgdat->node_id, i,
+ zones_size);
+ pgdat->node_spanned_pages = totalpages;
+
+ realtotalpages = totalpages;
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ realtotalpages -=
+ zone_absent_pages_in_node(pgdat->node_id, i,
+ zholes_size);
+ pgdat->node_present_pages = realtotalpages;
+ printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
+ realtotalpages);
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -2090,9 +2366,9 @@ static void __meminit free_area_init_cor
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize;
- realsize = size = zones_size[j];
- if (zholes_size)
- realsize -= zholes_size[j];
+ size = zone_spanned_pages_in_node(nid, j, zones_size);
+ realsize = size - zone_absent_pages_in_node(nid, j,
+ zholes_size);
if (!is_highmem_idx(j))
nr_kernel_pages += realsize;
@@ -2162,8 +2438,13 @@ static void __init alloc_node_mem_map(st
/*
* With no DISCONTIG, the global mem_map is just set as node 0's
*/
- if (pgdat == NODE_DATA(0))
+ if (pgdat == NODE_DATA(0)) {
mem_map = NODE_DATA(0)->node_mem_map;
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+ if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
+ mem_map -= pgdat->node_start_pfn;
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+ }
#endif
#endif /* CONFIG_FLAT_NODE_MEM_MAP */
}
@@ -2174,13 +2455,236 @@ void __meminit free_area_init_node(int n
{
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
- calculate_zone_totalpages(pgdat, zones_size, zholes_size);
+ calculate_node_totalpages(pgdat, zones_size, zholes_size);
alloc_node_mem_map(pgdat);
free_area_init_core(pgdat, zones_size, zholes_size);
}
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/**
+ * add_active_range - Register a range of PFNs backed by physical memory
+ * @nid: The node ID the range resides on
+ * @start_pfn: The start PFN of the available physical memory
+ * @end_pfn: The end PFN of the available physical memory
+ *
+ * These ranges are stored in an early_node_map[] and later used by
+ * free_area_init_nodes() to calculate zone sizes and holes. If the
+ * range spans a memory hole, it is up to the architecture to ensure
+ * the memory is not freed by the bootmem allocator. If possible
+ * the range being registered will be merged with existing ranges.
+ */
+void __init add_active_range(unsigned int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ int i;
+
+ printk(KERN_DEBUG "Entering add_active_range(%d, %lu, %lu) "
+ "%d entries of %d used\n",
+ nid, start_pfn, end_pfn,
+ nr_nodemap_entries, MAX_ACTIVE_REGIONS);
+
+ /* Merge with existing active regions if possible */
+ for (i = 0; i < nr_nodemap_entries; i++) {
+ if (early_node_map[i].nid != nid)
+ continue;
+
+ /* Skip if an existing region covers this new one */
+ if (start_pfn >= early_node_map[i].start_pfn &&
+ end_pfn <= early_node_map[i].end_pfn)
+ return;
+
+ /* Merge forward if suitable */
+ if (start_pfn <= early_node_map[i].end_pfn &&
+ end_pfn > early_node_map[i].end_pfn) {
+ early_node_map[i].end_pfn = end_pfn;
+ return;
+ }
+
+ /* Merge backward if suitable */
+ if (start_pfn < early_node_map[i].end_pfn &&
+ end_pfn >= early_node_map[i].start_pfn) {
+ early_node_map[i].start_pfn = start_pfn;
+ return;
+ }
+ }
+
+ /* Check that early_node_map is large enough */
+ if (i >= MAX_ACTIVE_REGIONS) {
+ printk(KERN_CRIT "More than %d memory regions, truncating\n",
+ MAX_ACTIVE_REGIONS);
+ return;
+ }
+
+ early_node_map[i].nid = nid;
+ early_node_map[i].start_pfn = start_pfn;
+ early_node_map[i].end_pfn = end_pfn;
+ nr_nodemap_entries = i + 1;
+}
+
+/**
+ * shrink_active_range - Shrink an existing registered range of PFNs
+ * @nid: The node id the range is on that should be shrunk
+ * @old_end_pfn: The old end PFN of the range
+ * @new_end_pfn: The new PFN of the range
+ *
+ * i386 with NUMA use alloc_remap() to store a node_mem_map on a local node.
+ * The map is kept at the end physical page range that has already been
+ * registered with add_active_range(). This function allows an arch to shrink
+ * an existing registered range.
+ */
+void __init shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
+ unsigned long new_end_pfn)
+{
+ int i;
+
+ /* Find the old active region end and shrink */
+ for_each_active_range_index_in_nid(i, nid)
+ if (early_node_map[i].end_pfn == old_end_pfn) {
+ early_node_map[i].end_pfn = new_end_pfn;
+ break;
+ }
+}
+
+/**
+ * remove_all_active_ranges - Remove all currently registered regions
+ * During discovery, it may be found that a table like SRAT is invalid
+ * and an alternative discovery method must be used. This function removes
+ * all currently registered regions.
+ */
+void __init remove_all_active_ranges()
+{
+ memset(early_node_map, 0, sizeof(early_node_map));
+ nr_nodemap_entries = 0;
+}
+
+/* Compare two active node_active_regions */
+static int __init cmp_node_active_region(const void *a, const void *b)
+{
+ struct node_active_region *arange = (struct node_active_region *)a;
+ struct node_active_region *brange = (struct node_active_region *)b;
+
+ /* Done this way to avoid overflows */
+ if (arange->start_pfn > brange->start_pfn)
+ return 1;
+ if (arange->start_pfn < brange->start_pfn)
+ return -1;
+
+ return 0;
+}
+
+/* sort the node_map by start_pfn */
+static void __init sort_node_map(void)
+{
+ sort(early_node_map, (size_t)nr_nodemap_entries,
+ sizeof(struct node_active_region),
+ cmp_node_active_region, NULL);
+}
+
+/* Find the lowest pfn for a node. This depends on a sorted early_node_map */
+unsigned long __init find_min_pfn_for_node(unsigned long nid)
+{
+ int i;
+
+ /* Assuming a sorted map, the first range found has the starting pfn */
+ for_each_active_range_index_in_nid(i, nid)
+ return early_node_map[i].start_pfn;
+
+ printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
+ return 0;
+}
+
+/**
+ * find_min_pfn_with_active_regions - Find the minimum PFN registered
+ *
+ * It returns the minimum PFN based on information provided via
+ * add_active_range()
+ */
+unsigned long __init find_min_pfn_with_active_regions(void)
+{
+ return find_min_pfn_for_node(MAX_NUMNODES);
+}
+
+/**
+ * find_max_pfn_with_active_regions - Find the maximum PFN registered
+ *
+ * It returns the maximum PFN based on information provided via
+ * add_active_range()
+ */
+unsigned long __init find_max_pfn_with_active_regions(void)
+{
+ int i;
+ unsigned long max_pfn = 0;
+
+ for (i = 0; i < nr_nodemap_entries; i++)
+ max_pfn = max(max_pfn, early_node_map[i].end_pfn);
+
+ return max_pfn;
+}
+
+/**
+ * free_area_init_nodes - Initialise all pg_data_t and zone data
+ * @arch_max_dma_pfn: The maximum PFN usable for ZONE_DMA
+ * @arch_max_dma32_pfn: The maximum PFN usable for ZONE_DMA32
+ * @arch_max_low_pfn: The maximum PFN usable for ZONE_NORMAL
+ * @arch_max_high_pfn: The maximum PFN usable for ZONE_HIGHMEM
+ *
+ * This will call free_area_init_node() for each active node in the system.
+ * Using the page ranges provided by add_active_range(), the size of each
+ * zone in each node and their holes is calculated. If the maximum PFN
+ * between two adjacent zones match, it is assumed that the zone is empty.
+ * For example, if arch_max_dma_pfn == arch_max_dma32_pfn, it is assumed
+ * that arch_max_dma32_pfn has no pages. It is also assumed that a zone
+ * starts where the previous one ended. For example, ZONE_DMA32 starts
+ * at arch_max_dma_pfn.
+ */
+void __init free_area_init_nodes(unsigned long *max_zone_pfn)
+{
+ unsigned long nid;
+ enum zone_type i;
+
+ /* Record where the zone boundaries are */
+ memset(arch_zone_lowest_possible_pfn, 0,
+ sizeof(arch_zone_lowest_possible_pfn));
+ memset(arch_zone_highest_possible_pfn, 0,
+ sizeof(arch_zone_highest_possible_pfn));
+ arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions();
+ arch_zone_highest_possible_pfn[0] = max_zone_pfn[0];
+ for (i = 1; i < MAX_NR_ZONES; i++) {
+ arch_zone_lowest_possible_pfn[i] =
+ arch_zone_highest_possible_pfn[i-1];
+ arch_zone_highest_possible_pfn[i] =
+ max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]);
+ }
+
+ /* Regions in the early_node_map can be in any order */
+ sort_node_map();
+
+ /* Print out the zone ranges */
+ printk("Zone PFN ranges:\n");
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ printk(" %-8s %8lu -> %8lu\n",
+ zone_names[i],
+ arch_zone_lowest_possible_pfn[i],
+ arch_zone_highest_possible_pfn[i]);
+
+ /* Print out the early_node_map[] */
+ printk("early_node_map[%d] active PFN ranges\n", nr_nodemap_entries);
+ for (i = 0; i < nr_nodemap_entries; i++)
+ printk(" %3d: %8lu -> %8lu\n", early_node_map[i].nid,
+ early_node_map[i].start_pfn,
+ early_node_map[i].end_pfn);
+
+ /* Initialise every node */
+ for_each_online_node(nid) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+ free_area_init_node(nid, pgdat, NULL,
+ find_min_pfn_for_node(nid), NULL);
+ }
+}
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
static bootmem_data_t contig_bootmem_data;
struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
Size zones and holes in an architecture independent manner for x86.
Kconfig | 8 +---
kernel/setup.c | 24 ++++--------
kernel/srat.c | 97 +---------------------------------------------------
mm/discontig.c | 69 +++++++++---------------------------
4 files changed, 31 insertions(+), 167 deletions(-)
Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/i386/Kconfig linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/i386/Kconfig
--- linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/i386/Kconfig 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/i386/Kconfig 2006-08-21 10:14:43.000000000 +0100
@@ -602,12 +602,10 @@ config ARCH_SELECT_MEMORY_MODEL
def_bool y
depends on ARCH_SPARSEMEM_ENABLE
-source "mm/Kconfig"
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
-config HAVE_ARCH_EARLY_PFN_TO_NID
- bool
- default y
- depends on NUMA
+source "mm/Kconfig"
config HIGHPTE
bool "Allocate 3rd-level pagetables from highmem"
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/i386/kernel/setup.c
--- linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/i386/kernel/setup.c 2006-08-21 10:14:43.000000000 +0100
@@ -1107,22 +1107,16 @@ static unsigned long __init setup_memory
void __init zone_sizes_init(void)
{
- unsigned long zones_size[MAX_NR_ZONES] = { 0, };
- unsigned int max_dma, low;
-
- max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
- low = max_low_pfn;
-
- if (low < max_dma)
- zones_size[ZONE_DMA] = low;
- else {
- zones_size[ZONE_DMA] = max_dma;
- zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = highend_pfn - low;
+ unsigned long max_zone_pfns[MAX_NR_ZONES] = {
+ virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT,
+ max_low_pfn,
+ highend_pfn};
+#ifndef CONFIG_HIGHMEM
+ unsigned long highend_pfn = max_low_pfn;
#endif
- }
- free_area_init(zones_size);
+
+ add_active_range(0, 0, highend_pfn);
+ free_area_init_nodes(max_zone_pfns);
}
#else
extern unsigned long __init setup_memory(void);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/i386/kernel/srat.c
--- linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/i386/kernel/srat.c 2006-08-21 10:14:43.000000000 +0100
@@ -55,8 +55,6 @@ struct node_memory_chunk_s {
static struct node_memory_chunk_s node_memory_chunk[MAXCHUNKS];
static int num_memory_chunks; /* total number of memory chunks */
-static int zholes_size_init;
-static unsigned long zholes_size[MAX_NUMNODES * MAX_NR_ZONES];
static u8 __initdata apicid_to_pxm[MAX_APICID];
extern void * boot_ioremap(unsigned long, unsigned long);
@@ -139,47 +137,6 @@ static void __init parse_memory_affinity
"enabled and removable" : "enabled" ) );
}
-/* Take a chunk of pages from page frame cstart to cend and count the number
- * of pages in each zone, returned via zones[].
- */
-static __init void chunk_to_zones(unsigned long cstart, unsigned long cend,
- unsigned long *zones)
-{
- unsigned long max_dma;
- extern unsigned long max_low_pfn;
-
- int z;
- unsigned long rend;
-
- /* FIXME: MAX_DMA_ADDRESS and max_low_pfn are trying to provide
- * similarly scoped information and should be handled in a consistant
- * manner.
- */
- max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
- /* Split the hole into the zones in which it falls. Repeatedly
- * take the segment in which the remaining hole starts, round it
- * to the end of that zone.
- */
- memset(zones, 0, MAX_NR_ZONES * sizeof(long));
- while (cstart < cend) {
- if (cstart < max_dma) {
- z = ZONE_DMA;
- rend = (cend < max_dma)? cend : max_dma;
-
- } else if (cstart < max_low_pfn) {
- z = ZONE_NORMAL;
- rend = (cend < max_low_pfn)? cend : max_low_pfn;
-
- } else {
- z = ZONE_HIGHMEM;
- rend = cend;
- }
- zones[z] += rend - cstart;
- cstart = rend;
- }
-}
-
/*
* The SRAT table always lists ascending addresses, so can always
* assume that the first "start" address that you see is the real
@@ -224,7 +181,6 @@ static int __init acpi20_parse_srat(stru
memset(pxm_bitmap, 0, sizeof(pxm_bitmap)); /* init proximity domain bitmap */
memset(node_memory_chunk, 0, sizeof(node_memory_chunk));
- memset(zholes_size, 0, sizeof(zholes_size));
num_memory_chunks = 0;
while (p < end) {
@@ -291,6 +247,7 @@ static int __init acpi20_parse_srat(stru
printk("chunk %d nid %d start_pfn %08lx end_pfn %08lx\n",
j, chunk->nid, chunk->start_pfn, chunk->end_pfn);
node_read_chunk(chunk->nid, chunk);
+ add_active_range(chunk->nid, chunk->start_pfn, chunk->end_pfn);
}
for_each_online_node(nid) {
@@ -399,57 +356,7 @@ int __init get_memcfg_from_srat(void)
return acpi20_parse_srat((struct acpi_table_srat *)header);
}
out_err:
+ remove_all_active_ranges();
printk("failed to get NUMA memory information from SRAT table\n");
return 0;
}
-
-/* For each node run the memory list to determine whether there are
- * any memory holes. For each hole determine which ZONE they fall
- * into.
- *
- * NOTE#1: this requires knowledge of the zone boundries and so
- * _cannot_ be performed before those are calculated in setup_memory.
- *
- * NOTE#2: we rely on the fact that the memory chunks are ordered by
- * start pfn number during setup.
- */
-static void __init get_zholes_init(void)
-{
- int nid;
- int c;
- int first;
- unsigned long end = 0;
-
- for_each_online_node(nid) {
- first = 1;
- for (c = 0; c < num_memory_chunks; c++){
- if (node_memory_chunk[c].nid == nid) {
- if (first) {
- end = node_memory_chunk[c].end_pfn;
- first = 0;
-
- } else {
- /* Record any gap between this chunk
- * and the previous chunk on this node
- * against the zones it spans.
- */
- chunk_to_zones(end,
- node_memory_chunk[c].start_pfn,
- &zholes_size[nid * MAX_NR_ZONES]);
- }
- }
- }
- }
-}
-
-unsigned long * __init get_zholes_size(int nid)
-{
- if (!zholes_size_init) {
- zholes_size_init++;
- get_zholes_init();
- }
- if (nid >= MAX_NUMNODES || !node_online(nid))
- printk("%s: nid = %d is invalid/offline. num_online_nodes = %d",
- __FUNCTION__, nid, num_online_nodes());
- return &zholes_size[nid * MAX_NR_ZONES];
-}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/i386/mm/discontig.c
--- linux-2.6.18-rc4-mm2-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/i386/mm/discontig.c 2006-08-21 10:14:43.000000000 +0100
@@ -157,21 +157,6 @@ static void __init find_max_pfn_node(int
BUG();
}
-/* Find the owning node for a pfn. */
-int early_pfn_to_nid(unsigned long pfn)
-{
- int nid;
-
- for_each_node(nid) {
- if (node_end_pfn[nid] == 0)
- break;
- if (node_start_pfn[nid] <= pfn && node_end_pfn[nid] >= pfn)
- return nid;
- }
-
- return 0;
-}
-
/*
* Allocate memory for the pg_data_t for this node via a crude pre-bootmem
* method. For node zero take this from the bottom of memory, for
@@ -227,6 +212,8 @@ static unsigned long calculate_numa_rema
unsigned long pfn;
for_each_online_node(nid) {
+ unsigned old_end_pfn = node_end_pfn[nid];
+
/*
* The acpi/srat node info can show hot-add memroy zones
* where memory could be added but not currently present.
@@ -276,6 +263,7 @@ static unsigned long calculate_numa_rema
node_end_pfn[nid] -= size;
node_remap_start_pfn[nid] = node_end_pfn[nid];
+ shrink_active_range(nid, old_end_pfn, node_end_pfn[nid]);
}
printk("Reserving total of %ld pages for numa KVA remap\n",
reserve_pages);
@@ -369,45 +357,22 @@ void __init numa_kva_reserve(void)
void __init zone_sizes_init(void)
{
int nid;
-
-
- for_each_online_node(nid) {
- unsigned long zones_size[MAX_NR_ZONES] = {0, };
- unsigned long *zholes_size;
- unsigned int max_dma;
-
- unsigned long low = max_low_pfn;
- unsigned long start = node_start_pfn[nid];
- unsigned long high = node_end_pfn[nid];
-
- max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
- if (node_has_online_mem(nid)){
- if (start > low) {
-#ifdef CONFIG_HIGHMEM
- BUG_ON(start > high);
- zones_size[ZONE_HIGHMEM] = high - start;
-#endif
- } else {
- if (low < max_dma)
- zones_size[ZONE_DMA] = low;
- else {
- BUG_ON(max_dma > low);
- BUG_ON(low > high);
- zones_size[ZONE_DMA] = max_dma;
- zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = high - low;
-#endif
- }
- }
+ unsigned long max_zone_pfns[MAX_NR_ZONES] = {
+ virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT,
+ max_low_pfn,
+ highend_pfn
+ };
+
+ /* If SRAT has not registered memory, register it now */
+ if (find_max_pfn_with_active_regions() == 0) {
+ for_each_online_node(nid) {
+ if (node_has_online_mem(nid))
+ add_active_range(nid, node_start_pfn[nid],
+ node_end_pfn[nid]);
}
-
- zholes_size = get_zholes_size(nid);
-
- free_area_init_node(nid, NODE_DATA(nid), zones_size, start,
- zholes_size);
}
+
+ free_area_init_nodes(max_zone_pfns);
return;
}
Size zones and holes in an architecture independent manner for x86_64.
arch/x86_64/Kconfig | 3
arch/x86_64/kernel/e820.c | 125 ++++++++++++++-------------------------
arch/x86_64/kernel/setup.c | 7 +-
arch/x86_64/mm/init.c | 65 +-------------------
arch/x86_64/mm/k8topology.c | 3
arch/x86_64/mm/numa.c | 21 +++---
arch/x86_64/mm/srat.c | 11 ++-
include/asm-x86_64/e820.h | 5 -
include/asm-x86_64/proto.h | 2
9 files changed, 84 insertions(+), 158 deletions(-)
Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/Kconfig linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/Kconfig
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/Kconfig 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/Kconfig 2006-08-21 10:15:58.000000000 +0100
@@ -85,6 +85,9 @@ config ARCH_MAY_HAVE_PC_FDC
bool
default y
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
+
config DMI
bool
default y
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c 2006-08-21 10:15:58.000000000 +0100
@@ -162,59 +162,14 @@ unsigned long __init find_e820_area(unsi
return -1UL;
}
-/*
- * Free bootmem based on the e820 table for a node.
- */
-void __init e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end)
-{
- int i;
- for (i = 0; i < e820.nr_map; i++) {
- struct e820entry *ei = &e820.map[i];
- unsigned long last, addr;
-
- if (ei->type != E820_RAM ||
- ei->addr+ei->size <= start ||
- ei->addr >= end)
- continue;
-
- addr = round_up(ei->addr, PAGE_SIZE);
- if (addr < start)
- addr = start;
-
- last = round_down(ei->addr + ei->size, PAGE_SIZE);
- if (last >= end)
- last = end;
-
- if (last > addr && last-addr >= PAGE_SIZE)
- free_bootmem_node(pgdat, addr, last-addr);
- }
-}
-
/*
* Find the highest page frame number we have available
*/
unsigned long __init e820_end_of_ram(void)
{
- int i;
unsigned long end_pfn = 0;
+ end_pfn = find_max_pfn_with_active_regions();
- for (i = 0; i < e820.nr_map; i++) {
- struct e820entry *ei = &e820.map[i];
- unsigned long start, end;
-
- start = round_up(ei->addr, PAGE_SIZE);
- end = round_down(ei->addr + ei->size, PAGE_SIZE);
- if (start >= end)
- continue;
- if (ei->type == E820_RAM) {
- if (end > end_pfn<<PAGE_SHIFT)
- end_pfn = end>>PAGE_SHIFT;
- } else {
- if (end > end_pfn_map<<PAGE_SHIFT)
- end_pfn_map = end>>PAGE_SHIFT;
- }
- }
-
if (end_pfn > end_pfn_map)
end_pfn_map = end_pfn;
if (end_pfn_map > MAXMEM>>PAGE_SHIFT)
@@ -224,43 +179,10 @@ unsigned long __init e820_end_of_ram(voi
if (end_pfn > end_pfn_map)
end_pfn = end_pfn_map;
+ printk("end_pfn_map = %lu\n", end_pfn_map);
return end_pfn;
}
-/*
- * Compute how much memory is missing in a range.
- * Unlike the other functions in this file the arguments are in page numbers.
- */
-unsigned long __init
-e820_hole_size(unsigned long start_pfn, unsigned long end_pfn)
-{
- unsigned long ram = 0;
- unsigned long start = start_pfn << PAGE_SHIFT;
- unsigned long end = end_pfn << PAGE_SHIFT;
- int i;
- for (i = 0; i < e820.nr_map; i++) {
- struct e820entry *ei = &e820.map[i];
- unsigned long last, addr;
-
- if (ei->type != E820_RAM ||
- ei->addr+ei->size <= start ||
- ei->addr >= end)
- continue;
-
- addr = round_up(ei->addr, PAGE_SIZE);
- if (addr < start)
- addr = start;
-
- last = round_down(ei->addr + ei->size, PAGE_SIZE);
- if (last >= end)
- last = end;
-
- if (last > addr)
- ram += last - addr;
- }
- return ((end - start) - ram) >> PAGE_SHIFT;
-}
-
/*
* Mark e820 reserved areas as busy for the resource manager.
*/
@@ -342,6 +264,49 @@ void __init e820_mark_nosave_regions(voi
}
}
+/* Walk the e820 map and register active regions within a node */
+void __init
+e820_register_active_regions(int nid, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ int i;
+ unsigned long ei_startpfn, ei_endpfn;
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *ei = &e820.map[i];
+ ei_startpfn = round_up(ei->addr, PAGE_SIZE) >> PAGE_SHIFT;
+ ei_endpfn = round_down(ei->addr + ei->size, PAGE_SIZE)
+ >> PAGE_SHIFT;
+
+ /* Skip map entries smaller than a page */
+ if (ei_startpfn > ei_endpfn)
+ continue;
+
+ /* Check if end_pfn_map should be updated */
+ if (ei->type != E820_RAM && ei_endpfn > end_pfn_map)
+ end_pfn_map = ei_endpfn;
+
+ /* Skip if map is outside the node */
+ if (ei->type != E820_RAM ||
+ ei_endpfn <= start_pfn ||
+ ei_startpfn >= end_pfn)
+ continue;
+
+ /* Check for overlaps */
+ if (ei_startpfn < start_pfn)
+ ei_startpfn = start_pfn;
+ if (ei_endpfn > end_pfn)
+ ei_endpfn = end_pfn;
+
+ /* Obey end_user_pfn to save on memmap */
+ if (ei_startpfn >= end_user_pfn)
+ continue;
+ if (ei_endpfn > end_user_pfn)
+ ei_endpfn = end_user_pfn;
+
+ add_active_range(nid, ei_startpfn, ei_endpfn);
+ }
+}
+
/*
* Add a memory region to the kernel e820 map.
*/
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c 2006-08-21 10:15:58.000000000 +0100
@@ -292,7 +292,8 @@ contig_initmem_init(unsigned long start_
if (bootmap == -1L)
panic("Cannot find bootmem map of size %ld\n",bootmap_size);
bootmap_size = init_bootmem(bootmap >> PAGE_SHIFT, end_pfn);
- e820_bootmem_free(NODE_DATA(0), 0, end_pfn << PAGE_SHIFT);
+ e820_register_active_regions(0, start_pfn, end_pfn);
+ free_bootmem_with_active_regions(0, end_pfn);
reserve_bootmem(bootmap, bootmap_size);
}
#endif
@@ -386,6 +387,7 @@ void __init setup_arch(char **cmdline_p)
finish_e820_parsing();
+ e820_register_active_regions(0, 0, -1UL);
/*
* partially used pages are not usable - thus
* we are rounding upwards:
@@ -416,6 +418,9 @@ void __init setup_arch(char **cmdline_p)
max_pfn = end_pfn;
high_memory = (void *)__va(end_pfn * PAGE_SIZE - 1) + 1;
+ /* Remove active ranges so rediscovery with NUMA-awareness happens */
+ remove_all_active_ranges();
+
#ifdef CONFIG_ACPI_NUMA
/*
* Parse SRAT to discover nodes.
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/init.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/init.c 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c 2006-08-21 10:15:58.000000000 +0100
@@ -404,69 +404,15 @@ void __cpuinit zap_low_mappings(int cpu)
__flush_tlb_all();
}
-/* Compute zone sizes for the DMA and DMA32 zones in a node. */
-__init void
-size_zones(unsigned long *z, unsigned long *h,
- unsigned long start_pfn, unsigned long end_pfn)
-{
- int i;
- unsigned long w;
-
- for (i = 0; i < MAX_NR_ZONES; i++)
- z[i] = 0;
-
- if (start_pfn < MAX_DMA_PFN)
- z[ZONE_DMA] = MAX_DMA_PFN - start_pfn;
- if (start_pfn < MAX_DMA32_PFN) {
- unsigned long dma32_pfn = MAX_DMA32_PFN;
- if (dma32_pfn > end_pfn)
- dma32_pfn = end_pfn;
- z[ZONE_DMA32] = dma32_pfn - start_pfn;
- }
- z[ZONE_NORMAL] = end_pfn - start_pfn;
-
- /* Remove lower zones from higher ones. */
- w = 0;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- if (z[i])
- z[i] -= w;
- w += z[i];
- }
-
- /* Compute holes */
- w = start_pfn;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- unsigned long s = w;
- w += z[i];
- h[i] = e820_hole_size(s, w);
- }
-
- /* Add the space pace needed for mem_map to the holes too. */
- for (i = 0; i < MAX_NR_ZONES; i++)
- h[i] += (z[i] * sizeof(struct page)) / PAGE_SIZE;
-
- /* The 16MB DMA zone has the kernel and other misc mappings.
- Account them too */
- if (h[ZONE_DMA]) {
- h[ZONE_DMA] += dma_reserve;
- if (h[ZONE_DMA] >= z[ZONE_DMA]) {
- printk(KERN_WARNING
- "Kernel too large and filling up ZONE_DMA?\n");
- h[ZONE_DMA] = z[ZONE_DMA];
- }
- }
-}
-
#ifndef CONFIG_NUMA
void __init paging_init(void)
{
- unsigned long zones[MAX_NR_ZONES], holes[MAX_NR_ZONES];
-
+ unsigned long max_zone_pfns[MAX_NR_ZONES] = {MAX_DMA_PFN,
+ MAX_DMA32_PFN,
+ end_pfn};
memory_present(0, 0, end_pfn);
sparse_init();
- size_zones(zones, holes, 0, end_pfn);
- free_area_init_node(0, NODE_DATA(0), zones,
- __pa(PAGE_OFFSET) >> PAGE_SHIFT, holes);
+ free_area_init_nodes(max_zone_pfns);
}
#endif
@@ -613,7 +559,8 @@ void __init mem_init(void)
#else
totalram_pages = free_all_bootmem();
#endif
- reservedpages = end_pfn - totalram_pages - e820_hole_size(0, end_pfn);
+ reservedpages = end_pfn - totalram_pages -
+ absent_pages_in_range(0, end_pfn);
after_bootmem = 1;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c 2006-08-06 19:20:11.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c 2006-08-21 10:15:58.000000000 +0100
@@ -146,6 +146,9 @@ int __init k8_scan_nodes(unsigned long s
nodes[nodeid].start = base;
nodes[nodeid].end = limit;
+ e820_register_active_regions(nodeid,
+ nodes[nodeid].start >> PAGE_SHIFT,
+ nodes[nodeid].end >> PAGE_SHIFT);
prevbase = base;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/numa.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/numa.c 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c 2006-08-21 10:15:58.000000000 +0100
@@ -161,7 +161,7 @@ void __init setup_node_bootmem(int nodei
bootmap_start >> PAGE_SHIFT,
start_pfn, end_pfn);
- e820_bootmem_free(NODE_DATA(nodeid), start, end);
+ free_bootmem_with_active_regions(nodeid, end);
reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size);
reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<<PAGE_SHIFT);
@@ -175,13 +175,11 @@ void __init setup_node_bootmem(int nodei
void __init setup_node_zones(int nodeid)
{
unsigned long start_pfn, end_pfn, memmapsize, limit;
- unsigned long zones[MAX_NR_ZONES];
- unsigned long holes[MAX_NR_ZONES];
start_pfn = node_start_pfn(nodeid);
end_pfn = node_end_pfn(nodeid);
- Dprintk(KERN_INFO "Setting up node %d %lx-%lx\n",
+ Dprintk(KERN_INFO "Setting up memmap for node %d %lx-%lx\n",
nodeid, start_pfn, end_pfn);
/* Try to allocate mem_map at end to not fill up precious <4GB
@@ -195,10 +193,6 @@ void __init setup_node_zones(int nodeid)
round_down(limit - memmapsize, PAGE_SIZE),
limit);
#endif
-
- size_zones(zones, holes, start_pfn, end_pfn);
- free_area_init_node(nodeid, NODE_DATA(nodeid), zones,
- start_pfn, holes);
}
void __init numa_init_array(void)
@@ -259,8 +253,11 @@ static int __init numa_emulation(unsigne
printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
return -1;
}
- for_each_online_node(i)
+ for_each_online_node(i) {
+ e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
+ nodes[i].end >> PAGE_SHIFT);
setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+ }
numa_init_array();
return 0;
}
@@ -299,6 +296,7 @@ void __init numa_initmem_init(unsigned l
for (i = 0; i < NR_CPUS; i++)
numa_set_node(i, 0);
node_to_cpumask[0] = cpumask_of_cpu(0);
+ e820_register_active_regions(0, start_pfn, end_pfn);
setup_node_bootmem(0, start_pfn << PAGE_SHIFT, end_pfn << PAGE_SHIFT);
}
@@ -340,12 +338,17 @@ static void __init arch_sparse_init(void
void __init paging_init(void)
{
int i;
+ unsigned long max_zone_pfns[MAX_NR_ZONES] = { MAX_DMA_PFN,
+ MAX_DMA32_PFN,
+ end_pfn};
arch_sparse_init();
for_each_online_node(i) {
setup_node_zones(i);
}
+
+ free_area_init_nodes(max_zone_pfns);
}
static __init int numa_setup(char *opt)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c 2006-08-21 10:15:58.000000000 +0100
@@ -84,6 +84,7 @@ static __init void bad_srat(void)
apicid_to_node[i] = NUMA_NO_NODE;
for (i = 0; i < MAX_NUMNODES; i++)
nodes_add[i].start = nodes[i].end = 0;
+ remove_all_active_ranges();
}
static __init inline int srat_disabled(void)
@@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
if (mem < 0)
return 0;
- allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
+ allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * PAGE_SIZE;
allowed = (allowed / 100) * hotadd_percent;
if (allocated + mem > allowed) {
unsigned long range;
@@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
}
/* This check might be a bit too strict, but I'm keeping it for now. */
- if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
+ if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
return -1;
}
@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
nd->start, nd->end);
+ e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
+ nd->end >> PAGE_SHIFT);
if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) < 0) {
/* Ignore hotadd region. Undo damage */
@@ -351,12 +354,12 @@ static int nodes_cover_memory(void)
unsigned long s = nodes[i].start >> PAGE_SHIFT;
unsigned long e = nodes[i].end >> PAGE_SHIFT;
pxmram += e - s;
- pxmram -= e820_hole_size(s, e);
+ pxmram -= absent_pages_in_range(s, e);
if ((long)pxmram < 0)
pxmram = 0;
}
- e820ram = end_pfn - e820_hole_size(0, end_pfn);
+ e820ram = end_pfn - absent_pages_in_range(0, end_pfn);
/* We seem to lose 3 pages somewhere. Allow a bit of slack. */
if ((long)(e820ram - pxmram) >= 1*1024*1024) {
printk(KERN_ERR
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/include/asm-x86_64/e820.h linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/include/asm-x86_64/e820.h 2006-08-21 09:23:52.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h 2006-08-21 10:15:58.000000000 +0100
@@ -51,10 +51,9 @@ extern void e820_print_map(char *who);
extern int e820_any_mapped(unsigned long start, unsigned long end, unsigned type);
extern int e820_all_mapped(unsigned long start, unsigned long end, unsigned type);
-extern void e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end);
extern void e820_setup_gap(void);
-extern unsigned long e820_hole_size(unsigned long start_pfn,
- unsigned long end_pfn);
+extern void e820_register_active_regions(int nid,
+ unsigned long start_pfn, unsigned long end_pfn);
extern void finish_e820_parsing(void);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/include/asm-x86_64/proto.h linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/include/asm-x86_64/proto.h 2006-08-21 09:23:52.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h 2006-08-21 10:15:58.000000000 +0100
@@ -24,8 +24,6 @@ extern void mtrr_bp_init(void);
#define mtrr_bp_init() do {} while (0)
#endif
extern void init_memory_mapping(unsigned long start, unsigned long end);
-extern void size_zones(unsigned long *z, unsigned long *h,
- unsigned long start_pfn, unsigned long end_pfn);
extern void system_call(void);
extern int kernel_syscall(void);
The x86_64 code accounted for memmap and some portions of the the DMA zone
as holes. This was because those areas would never be reclaimed and accounting
for them as memory affects min watermarks. This patch will account for the
memmap as a memory hole. Architectures may optionally use set_dma_reserve() if
they wish to account for a portion of memory in ZONE_DMA as a hole.
arch/x86_64/mm/init.c | 4 +
include/linux/mm.h | 1
mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 99 insertions(+), 1 deletion(-)
Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/x86_64/mm/init.c linux-2.6.18-rc4-mm2-106-account_kernel_mmap/arch/x86_64/mm/init.c
--- linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/x86_64/mm/init.c 2006-08-21 10:15:58.000000000 +0100
+++ linux-2.6.18-rc4-mm2-106-account_kernel_mmap/arch/x86_64/mm/init.c 2006-08-21 10:18:23.000000000 +0100
@@ -660,8 +660,10 @@ void __init reserve_bootmem_generic(unsi
#else
reserve_bootmem(phys, len);
#endif
- if (phys+len <= MAX_DMA_PFN*PAGE_SIZE)
+ if (phys+len <= MAX_DMA_PFN*PAGE_SIZE) {
dma_reserve += len / PAGE_SIZE;
+ set_dma_reserve(dma_reserve);
+ }
}
int kern_addr_valid(unsigned long addr)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/include/linux/mm.h linux-2.6.18-rc4-mm2-106-account_kernel_mmap/include/linux/mm.h
--- linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/include/linux/mm.h 2006-08-21 10:12:12.000000000 +0100
+++ linux-2.6.18-rc4-mm2-106-account_kernel_mmap/include/linux/mm.h 2006-08-21 10:18:23.000000000 +0100
@@ -1021,6 +1021,7 @@ extern void sparse_memory_present_with_a
extern int early_pfn_to_nid(unsigned long pfn);
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+extern void set_dma_reserve(unsigned long new_dma_reserve);
extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
extern void setup_per_zone_pages_min(void);
extern void mem_init(void);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/mm/page_alloc.c linux-2.6.18-rc4-mm2-106-account_kernel_mmap/mm/page_alloc.c
--- linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/mm/page_alloc.c 2006-08-21 10:12:12.000000000 +0100
+++ linux-2.6.18-rc4-mm2-106-account_kernel_mmap/mm/page_alloc.c 2006-08-21 10:18:23.000000000 +0100
@@ -104,6 +104,7 @@ int min_free_kbytes = 1024;
unsigned long __meminitdata nr_kernel_pages;
unsigned long __meminitdata nr_all_pages;
+static unsigned long __initdata dma_reserve;
#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
/*
@@ -2303,6 +2304,20 @@ unsigned long __init zone_absent_pages_i
arch_zone_lowest_possible_pfn[zone_type],
arch_zone_highest_possible_pfn[zone_type]);
}
+
+/* Return the zone index a PFN is in */
+int memmap_zone_idx(struct page *lmem_map)
+{
+ int i;
+ unsigned long phys_addr = virt_to_phys(lmem_map);
+ unsigned long pfn = phys_addr >> PAGE_SHIFT;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ if (pfn < arch_zone_highest_possible_pfn[i])
+ break;
+
+ return i;
+}
#else
static inline unsigned long zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
@@ -2320,6 +2335,11 @@ static inline unsigned long zone_absent_
return zholes_size[zone_type];
}
+
+static inline int memmap_zone_idx(struct page *lmem_map)
+{
+ return MAX_NR_ZONES;
+}
#endif
static void __init calculate_node_totalpages(struct pglist_data *pgdat,
@@ -2343,6 +2363,58 @@ static void __init calculate_node_totalp
realtotalpages);
}
+#ifdef CONFIG_FLAT_NODE_MEM_MAP
+/* Account for mem_map for CONFIG_FLAT_NODE_MEM_MAP */
+unsigned long __meminit account_memmap(struct pglist_data *pgdat,
+ int zone_index)
+{
+ unsigned long pages = 0;
+ if (zone_index == memmap_zone_idx(pgdat->node_mem_map)) {
+ pages = pgdat->node_spanned_pages;
+ pages = (pages * sizeof(struct page)) >> PAGE_SHIFT;
+ printk(KERN_DEBUG "%lu pages used for memmap\n", pages);
+ }
+ return pages;
+}
+#else
+/* Account for mem_map for CONFIG_SPARSEMEM */
+unsigned long account_memmap(struct pglist_data *pgdat, int zone_index)
+{
+ unsigned long pages = 0;
+ unsigned long memmap_pfn;
+ struct page *memmap_addr;
+ int pnum;
+ unsigned long pgdat_startpfn, pgdat_endpfn;
+ struct mem_section *section;
+
+ pgdat_startpfn = pgdat->node_start_pfn;
+ pgdat_endpfn = pgdat_startpfn + pgdat->node_spanned_pages;
+
+ /* Go through valid sections looking for memmap */
+ for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+ if (!valid_section_nr(pnum))
+ continue;
+
+ section = __nr_to_section(pnum);
+ if (!section_has_mem_map(section))
+ continue;
+
+ memmap_addr = __section_mem_map_addr(section);
+ memmap_pfn = (unsigned long)memmap_addr >> PAGE_SHIFT;
+
+ if (memmap_pfn < pgdat_startpfn || memmap_pfn >= pgdat_endpfn)
+ continue;
+
+ if (zone_index == memmap_zone_idx(memmap_addr))
+ pages += (PAGES_PER_SECTION * sizeof(struct page));
+ }
+
+ pages >>= PAGE_SHIFT;
+ printk(KERN_DEBUG "%lu pages used for SPARSE memmap\n", pages);
+ return pages;
+}
+#endif
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -2370,6 +2442,14 @@ static void __meminit free_area_init_cor
realsize = size - zone_absent_pages_in_node(nid, j,
zholes_size);
+ realsize -= account_memmap(pgdat, j);
+ /* Account for reserved DMA pages */
+ if (j == ZONE_DMA && realsize > dma_reserve) {
+ realsize -= dma_reserve;
+ printk(KERN_DEBUG "%lu pages DMA reserved\n",
+ dma_reserve);
+ }
+
if (!is_highmem_idx(j))
nr_kernel_pages += realsize;
nr_all_pages += realsize;
@@ -2685,6 +2765,21 @@ void __init free_area_init_nodes(unsigne
}
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+/**
+ * set_dma_reserve - Account the specified number of pages reserved in ZONE_DMA
+ * @new_dma_reserve - The number of pages to mark reserved
+ *
+ * The per-cpu batchsize and zone watermarks are determined by present_pages.
+ * In the DMA zone, a significant percentage may be consumed by kernel image
+ * and other unfreeable allocations which can skew the watermarks badly. This
+ * function may optionally be used to account for unfreeable pages in
+ * ZONE_DMA. The effect will be lower watermarks and smaller per-cpu batchsize
+ */
+void __init set_dma_reserve(unsigned long new_dma_reserve)
+{
+ dma_reserve = new_dma_reserve;
+}
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
static bootmem_data_t contig_bootmem_data;
struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
Size zones and holes in an architecture independent manner for ia64.
arch/ia64/Kconfig | 3 ++
arch/ia64/mm/contig.c | 60 ++++++----------------------------------
arch/ia64/mm/discontig.c | 44 ++++++-----------------------
arch/ia64/mm/init.c | 12 ++++++++
include/asm-ia64/meminit.h | 1
5 files changed, 34 insertions(+), 86 deletions(-)
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Bob Picco <[email protected]>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/ia64/Kconfig linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/ia64/Kconfig
--- linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/ia64/Kconfig 2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/ia64/Kconfig 2006-08-21 10:17:10.000000000 +0100
@@ -352,6 +352,9 @@ config NODES_SHIFT
MAX_NUMNODES will be 2^(This value).
If in doubt, use the default.
+config ARCH_POPULATES_NODE_MAP
+ def_bool y
+
# VIRTUAL_MEM_MAP and FLAT_NODE_MEM_MAP are functionally equivalent.
# VIRTUAL_MEM_MAP has been retained for historical reasons.
config VIRTUAL_MEM_MAP
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/ia64/mm/contig.c linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/ia64/mm/contig.c
--- linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/ia64/mm/contig.c 2006-08-06 19:20:11.000000000 +0100
+++ linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/ia64/mm/contig.c 2006-08-21 10:17:10.000000000 +0100
@@ -26,7 +26,6 @@
#include <asm/mca.h>
#ifdef CONFIG_VIRTUAL_MEM_MAP
-static unsigned long num_dma_physpages;
static unsigned long max_gap;
#endif
@@ -218,18 +217,6 @@ count_pages (u64 start, u64 end, void *a
return 0;
}
-#ifdef CONFIG_VIRTUAL_MEM_MAP
-static int
-count_dma_pages (u64 start, u64 end, void *arg)
-{
- unsigned long *count = arg;
-
- if (start < MAX_DMA_ADDRESS)
- *count += (min(end, MAX_DMA_ADDRESS) - start) >> PAGE_SHIFT;
- return 0;
-}
-#endif
-
/*
* Set up the page tables.
*/
@@ -238,45 +225,22 @@ void __init
paging_init (void)
{
unsigned long max_dma;
- unsigned long zones_size[MAX_NR_ZONES];
-#ifdef CONFIG_VIRTUAL_MEM_MAP
- unsigned long zholes_size[MAX_NR_ZONES];
-#endif
-
- /* initialize mem_map[] */
-
- memset(zones_size, 0, sizeof(zones_size));
+ unsigned long nid = 0;
+ unsigned long max_zone_pfns[MAX_NR_ZONES];
num_physpages = 0;
efi_memmap_walk(count_pages, &num_physpages);
max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
+ max_zone_pfns[ZONE_DMA] = max_dma;
+ max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
#ifdef CONFIG_VIRTUAL_MEM_MAP
- memset(zholes_size, 0, sizeof(zholes_size));
-
- num_dma_physpages = 0;
- efi_memmap_walk(count_dma_pages, &num_dma_physpages);
-
- if (max_low_pfn < max_dma) {
- zones_size[ZONE_DMA] = max_low_pfn;
- zholes_size[ZONE_DMA] = max_low_pfn - num_dma_physpages;
- } else {
- zones_size[ZONE_DMA] = max_dma;
- zholes_size[ZONE_DMA] = max_dma - num_dma_physpages;
- if (num_physpages > num_dma_physpages) {
- zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
- zholes_size[ZONE_NORMAL] =
- ((max_low_pfn - max_dma) -
- (num_physpages - num_dma_physpages));
- }
- }
-
+ efi_memmap_walk(register_active_ranges, &nid);
efi_memmap_walk(find_largest_hole, (u64 *)&max_gap);
if (max_gap < LARGE_GAP) {
vmem_map = (struct page *) 0;
- free_area_init_node(0, NODE_DATA(0), zones_size, 0,
- zholes_size);
+ free_area_init_nodes(max_zone_pfns);
} else {
unsigned long map_size;
@@ -289,19 +253,13 @@ paging_init (void)
efi_memmap_walk(create_mem_map_page_table, NULL);
NODE_DATA(0)->node_mem_map = vmem_map;
- free_area_init_node(0, NODE_DATA(0), zones_size,
- 0, zholes_size);
+ free_area_init_nodes(max_zone_pfns);
printk("Virtual mem_map starts at 0x%p\n", mem_map);
}
#else /* !CONFIG_VIRTUAL_MEM_MAP */
- if (max_low_pfn < max_dma)
- zones_size[ZONE_DMA] = max_low_pfn;
- else {
- zones_size[ZONE_DMA] = max_dma;
- zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
- }
- free_area_init(zones_size);
+ add_active_range(0, 0, max_low_pfn);
+ free_area_init_nodes(max_zone_pfns);
#endif /* !CONFIG_VIRTUAL_MEM_MAP */
zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/ia64/mm/discontig.c linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/ia64/mm/discontig.c
--- linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/ia64/mm/discontig.c 2006-08-06 19:20:11.000000000 +0100
+++ linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/ia64/mm/discontig.c 2006-08-21 10:17:10.000000000 +0100
@@ -654,6 +654,7 @@ static __init int count_node_pages(unsig
{
unsigned long end = start + len;
+ add_active_range(node, start >> PAGE_SHIFT, end >> PAGE_SHIFT);
mem_data[node].num_physpages += len >> PAGE_SHIFT;
if (start <= __pa(MAX_DMA_ADDRESS))
mem_data[node].num_dma_physpages +=
@@ -678,10 +679,10 @@ static __init int count_node_pages(unsig
void __init paging_init(void)
{
unsigned long max_dma;
- unsigned long zones_size[MAX_NR_ZONES];
- unsigned long zholes_size[MAX_NR_ZONES];
unsigned long pfn_offset = 0;
+ unsigned long max_pfn = 0;
int node;
+ unsigned long max_zone_pfns[MAX_NR_ZONES];
max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
@@ -698,47 +699,20 @@ void __init paging_init(void)
#endif
for_each_online_node(node) {
- memset(zones_size, 0, sizeof(zones_size));
- memset(zholes_size, 0, sizeof(zholes_size));
-
num_physpages += mem_data[node].num_physpages;
-
- if (mem_data[node].min_pfn >= max_dma) {
- /* All of this node's memory is above ZONE_DMA */
- zones_size[ZONE_NORMAL] = mem_data[node].max_pfn -
- mem_data[node].min_pfn;
- zholes_size[ZONE_NORMAL] = mem_data[node].max_pfn -
- mem_data[node].min_pfn -
- mem_data[node].num_physpages;
- } else if (mem_data[node].max_pfn < max_dma) {
- /* All of this node's memory is in ZONE_DMA */
- zones_size[ZONE_DMA] = mem_data[node].max_pfn -
- mem_data[node].min_pfn;
- zholes_size[ZONE_DMA] = mem_data[node].max_pfn -
- mem_data[node].min_pfn -
- mem_data[node].num_dma_physpages;
- } else {
- /* This node has memory in both zones */
- zones_size[ZONE_DMA] = max_dma -
- mem_data[node].min_pfn;
- zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] -
- mem_data[node].num_dma_physpages;
- zones_size[ZONE_NORMAL] = mem_data[node].max_pfn -
- max_dma;
- zholes_size[ZONE_NORMAL] = zones_size[ZONE_NORMAL] -
- (mem_data[node].num_physpages -
- mem_data[node].num_dma_physpages);
- }
-
pfn_offset = mem_data[node].min_pfn;
#ifdef CONFIG_VIRTUAL_MEM_MAP
NODE_DATA(node)->node_mem_map = vmem_map + pfn_offset;
#endif
- free_area_init_node(node, NODE_DATA(node), zones_size,
- pfn_offset, zholes_size);
+ if (mem_data[node].max_pfn > max_pfn)
+ max_pfn = mem_data[node].max_pfn;
}
+ max_zone_pfns[ZONE_DMA] = max_dma;
+ max_zone_pfns[ZONE_NORMAL] = max_pfn;
+ free_area_init_nodes(max_zone_pfns);
+
zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/ia64/mm/init.c linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/ia64/mm/init.c
--- linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/ia64/mm/init.c 2006-08-06 19:20:11.000000000 +0100
+++ linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/arch/ia64/mm/init.c 2006-08-21 10:17:10.000000000 +0100
@@ -593,6 +593,18 @@ find_largest_hole (u64 start, u64 end, v
last_end = end;
return 0;
}
+
+int __init
+register_active_ranges(u64 start, u64 end, void *nid)
+{
+ BUG_ON(nid == NULL);
+ BUG_ON(*(unsigned long *)nid >= MAX_NUMNODES);
+
+ add_active_range(*(unsigned long *)nid,
+ __pa(start) >> PAGE_SHIFT,
+ __pa(end) >> PAGE_SHIFT);
+ return 0;
+}
#endif /* CONFIG_VIRTUAL_MEM_MAP */
static int __init
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-ia64/meminit.h linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/include/asm-ia64/meminit.h
--- linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-ia64/meminit.h 2006-08-21 09:23:52.000000000 +0100
+++ linux-2.6.18-rc4-mm2-105-ia64_use_init_nodes/include/asm-ia64/meminit.h 2006-08-21 10:17:10.000000000 +0100
@@ -56,6 +56,7 @@ extern void efi_memmap_init(unsigned lon
extern unsigned long vmalloc_end;
extern struct page *vmem_map;
extern int find_largest_hole (u64 start, u64 end, void *arg);
+ extern int register_active_ranges (u64 start, u64 end, void *arg);
extern int create_mem_map_page_table (u64 start, u64 end, void *arg);
extern int vmemmap_find_next_valid_pfn(int, int);
#else
On 8/21/06, Mel Gorman <[email protected]> wrote:
> This is V9 of the patchset to size zones and memory holes in an
> architecture-independent manner. It booted successfully on 5 different
> machines (arches were x86, x86_64, ppc64 and ia64) in a number of different
> configurations and successfully built a kernel. If it fails on any machine,
> booting with loglevel=8 and the console log should tell me what went wrong.
>
I am wondering why this new api didn't cleanup the pfn_to_nid code
path as well. Arches are left to still keep another set of
nid-start-end info around. We are sending info like
add_active_range(unsigned int nid, unsigned long start_pfn, unsigned
long end_pfn)
With this info making a common pnf_to_nid seems to be of intrest so we
don't have to keep redundant information in both generic and arch
specific data structures.
Are you intending the hot-add memory code path to call add_active_range or ???
Thanks,
Keith
On Mon, 21 Aug 2006, Keith Mannthey wrote:
> On 8/21/06, Mel Gorman <[email protected]> wrote:
>> This is V9 of the patchset to size zones and memory holes in an
>> architecture-independent manner. It booted successfully on 5 different
>> machines (arches were x86, x86_64, ppc64 and ia64) in a number of different
>> configurations and successfully built a kernel. If it fails on any machine,
>> booting with loglevel=8 and the console log should tell me what went wrong.
>>
>
> I am wondering why this new api didn't cleanup the pfn_to_nid code
> path as well. Arches are left to still keep another set of
> nid-start-end info around. We are sending info like
>
pfn_to_nid() is used at runtime and the early_node_map[] is deleted by
then. As this step, I only want to get the initialisation correct. What
can be replaced is the architecture-specific early_pfn_to_nid() function
which I did for power and x86.
> add_active_range(unsigned int nid, unsigned long start_pfn, unsigned
> long end_pfn)
>
> With this info making a common pnf_to_nid seems to be of intrest so we
> don't have to keep redundant information in both generic and arch
> specific data structures.
>
To implement a common one of interest, the array would have to be
converted to a linked list at the end of boot so it could be modified by
memory hot-add, then pfn_to_nid() would walk the linked list rather than
the existing array. pfn_valid() would probably be replaced as well.
However, this is going to be slower (if more accurate in some cases) than
the existing pfn_valid() and so I would treat it as a separate issue.
> Are you intending the hot-add memory code path to call add_active_range or
> ???
>
Not at this time. I want to make sure the memory initialisation is right
before dealing with additional complications.
> Thanks,
> Keith
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On 8/21/06, Mel Gorman <[email protected]> wrote:
>
> Size zones and holes in an architecture independent manner for x86_64.
Hey Mel,
I am having some trouble with the srat.c changes. I keep running into
"SRAT: Hotplug area has existing memory" so am am taking more throught
look at this patch.
I am working on 2.6.18-rc4-mm3 x86_64.
srat.c is doing some sanity checking against the e820 and hot-add
memory ranges. BIOS folk aren't to be trusted with the SRAT. Calling
remove_all_active_ranges before acpi_numa_init leaves nothing to fall
back onto if the SRAT is bad. (see bad_srat()). What should happen
when we discard the srat info?
i386 code may have similar fallback logic (haven't been there in a while)
also
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
> --- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c 2006-08-21 09:23:50.000000000 +0100
> +++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c 2006-08-21 10:15:58.000000000 +0100
> @@ -84,6 +84,7 @@ static __init void bad_srat(void)
> apicid_to_node[i] = NUMA_NO_NODE;
> for (i = 0; i < MAX_NUMNODES; i++)
> nodes_add[i].start = nodes[i].end = 0;
> + remove_all_active_ranges();
> }
We go back to setup_arch with no active areas?
> static __init inline int srat_disabled(void)
> @@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>
> if (mem < 0)
> return 0;
> - allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
> + allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * PAGE_SIZE;
> allowed = (allowed / 100) * hotadd_percent;
> if (allocated + mem > allowed) {
> unsigned long range;
> @@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
> }
>
> /* This check might be a bit too strict, but I'm keeping it for now. */
> - if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
> + if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
> printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
> return -1;
> }
We really do want to to compare against the e820 map at it contains
the memory that is really present (this info was blown away before
acpi_numa) Anyway I fixed up to have the current chunk added
(e820_register_active_regions) after calling this code so it logicaly
makes sense but it still trip over the check. I am not sure what you
are printing out in you debug code but dosen't look like pfns or
phys_addresses but maybe it can tell us why the check fails.
> @@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
>
> printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
> nd->start, nd->end);
> + e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
> + nd->end >> PAGE_SHIFT);
A node chunk in this section of code may be a hot-pluggable zone. With
MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
> if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) < 0) {
> /* Ignore hotadd region. Undo damage */
I have but the e820_register_active_regions as a else to this
statment the absent pages check fails.
Also nodes_cover_memory and alot of these check were based against
comparing the srat data against the e820. Now all this code is
comparing SRAT against SRAT....
I am willing to help here but we should compare the SRAT against to
e820. Table v. Table.
What to you think should be done?
Thanks,
Keith
Linux version 2.6.18-rc4-mm3-smp (root@elm3a153) (gcc version 4.1.0
(SUSE Linux)) #3 SMP Wed Aug 30 15:17:13 EDT 2006
Command line: root=/dev/sda3
ip=9.47.66.153:9.47.66.169:9.47.66.1:255.255.255.0 resume=/dev/sda2
showopts earlyprintk=ttyS0,115200 console=ttyS0,115200 console=tty0
debug numa=hotadd=100
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used
end_pfn_map = 18219008
DMI 2.3 present.
ACPI: RSDP (v000 IBM ) @ 0x00000000000fdcf0
ACPI: RSDT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98800
ACPI: FADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98780
ACPI: MADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98600
ACPI: SRAT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff983c0
ACPI: HPET (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98380
ACPI: SSDT (v001 IBM VIGSSDT0 0x00001000 INTL 0x20030122) @
0x000000007ff90780
ACPI: SSDT (v001 IBM VIGSSDT1 0x00001000 INTL 0x20030122) @
0x000000007ff88bc0
ACPI: DSDT (v001 IBM EXA01ZEU 0x00001000 INTL 0x20030122) @
0x0000000000000000
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 0 -> APIC 2 -> Node 0
SRAT: PXM 0 -> APIC 3 -> Node 0
SRAT: PXM 0 -> APIC 38 -> Node 0
SRAT: PXM 0 -> APIC 39 -> Node 0
SRAT: PXM 0 -> APIC 36 -> Node 0
SRAT: PXM 0 -> APIC 37 -> Node 0
SRAT: PXM 1 -> APIC 64 -> Node 1
SRAT: PXM 1 -> APIC 65 -> Node 1
SRAT: PXM 1 -> APIC 66 -> Node 1
SRAT: PXM 1 -> APIC 67 -> Node 1
SRAT: PXM 1 -> APIC 102 -> Node 1
SRAT: PXM 1 -> APIC 103 -> Node 1
SRAT: PXM 1 -> APIC 100 -> Node 1
SRAT: PXM 1 -> APIC 101 -> Node 1
SRAT: Node 0 PXM 0 0-80000000
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
SRAT: Node 0 PXM 0 0-470000000
Entering add_active_range(0, 0, 152) 2 entries of 3200 used
Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
SRAT: Node 0 PXM 0 0-1070000000
SRAT: Hotplug area has existing memory
Entering add_active_range(0, 0, 152) 3 entries of 3200 used
Entering add_active_range(0, 256, 524165) 3 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 3 entries of 3200 used
SRAT: Node 1 PXM 1 1070000000-1160000000
Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used
SRAT: Node 1 PXM 1 1070000000-3200000000
SRAT: Hotplug area has existing memory
Entering add_active_range(1, 17235968, 18219008) 4 entries of 3200 used
NUMA: Using 28 for the hash shift.
Bootmem setup node 0 0000000000000000-0000001070000000
Bootmem setup node 1 0000001070000000-0000001160000000
On (30/08/06 13:57), Keith Mannthey didst pronounce:
> On 8/21/06, Mel Gorman <[email protected]> wrote:
> >
> >Size zones and holes in an architecture independent manner for x86_64.
>
>
> Hey Mel,
Hi Keith.
> I am having some trouble with the srat.c changes. I keep running into
> "SRAT: Hotplug area has existing memory" so am am taking more throught
> look at this patch.
> I am working on 2.6.18-rc4-mm3 x86_64.
>
ok, great. How much physical memory is installed on the machine? I want to
determine if the "usable" entries in the e820 map contain physical memory
or not.
> srat.c is doing some sanity checking against the e820 and hot-add
> memory ranges. BIOS folk aren't to be trusted with the SRAT. Calling
> remove_all_active_ranges before acpi_numa_init leaves nothing to fall
> back onto if the SRAT is bad. (see bad_srat()). What should happen
> when we discard the srat info?
>
When the SRAT is bad, the information is discarded and discovered by an
alternative method later in the boot process.
In this case, numa_initmem_init() is called after acpi_numa_init(). It
calls acpi_scan_nodes() which returns -1 because the SRAT is bad. Once
that happens, either k8_scan_nodes() will be called and the regions
discovered there or if that is not possible, it'll fall through and
e820_register_active_regions will be called without any node awareness.
> i386 code may have similar fallback logic (haven't been there in a while)
>
There is fallback logic in the i386 code as well.
> also
>
> >diff -rup -X /usr/src/patchset-0.6/bin//dontdiff
> >linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
> >linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
> >--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
> >2006-08-21 09:23:50.000000000 +0100
> >+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
> >2006-08-21 10:15:58.000000000 +0100
> >@@ -84,6 +84,7 @@ static __init void bad_srat(void)
> > apicid_to_node[i] = NUMA_NO_NODE;
> > for (i = 0; i < MAX_NUMNODES; i++)
> > nodes_add[i].start = nodes[i].end = 0;
> >+ remove_all_active_ranges();
> > }
>
> We go back to setup_arch with no active areas?
>
Yes, and it'll be discovered using an alternative method later. There is
no point returning to setup_arch with known bad information about active
areas.
> > static __init inline int srat_disabled(void)
> >@@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
> >
> > if (mem < 0)
> > return 0;
> >- allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
> >+ allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) *
> >PAGE_SIZE;
> > allowed = (allowed / 100) * hotadd_percent;
> > if (allocated + mem > allowed) {
> > unsigned long range;
> >@@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
> > }
> >
> > /* This check might be a bit too strict, but I'm keeping it for
> > now. */
> >- if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
> >+ if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
> > printk(KERN_ERR "SRAT: Hotplug area has existing
> > memory\n");
> > return -1;
> > }
> We really do want to to compare against the e820 map at it contains
> the memory that is really present (this info was blown away before
> acpi_numa)
The information used by absent_pages_in_range() should match what was
available to e820_hole_size().
> Anyway I fixed up to have the current chunk added
> (e820_register_active_regions) after calling this code so it logicaly
> makes sense but it still trip over the check.
> I am not sure what you
> are printing out in you debug code but dosen't look like pfns or
> phys_addresses but maybe it can tell us why the check fails.
>
My debug code for add_active_range() printing out pfns but I spotted one
case where absent_pages_in_range(I) does not do what one would expect.
Lets say the ranges with physical memory was 0->1000 and 2000-3000 (in
pfns). absent_pages_in_range(0, 3000) would return 1000 as you'd expect but
absent_pages_in_range(5000-6000) would return 0! I have a patch that might
fix this at the end of the mail but I'm not sure it's the problem you are
hitting. In the bootlog, I see;
SRAT: Node 0 PXM 0 0-80000000
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
SRAT: Node 0 PXM 0 0-470000000
Entering add_active_range(0, 0, 152) 2 entries of 3200 used
Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
SRAT: Node 0 PXM 0 0-1070000000
SRAT: Hotplug area has existing memory
The last part (0-1070000000) is checked as a hotplug area but it's clear
that memory exists in that range. As reserve_hotadd() requires that the
whole range be a hole, I'm having trouble seeing how it ever successfully
reserved unless the ranges going into reserve_hotadd() are something other
than the pfn range for 0-1070000000). The patch later will print out the
range used by reserve_hotadd() so we can see.
> >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
> >
> > printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
> > nd->start, nd->end);
> >+ e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
> >+ nd->end >> PAGE_SHIFT);
>
> A node chunk in this section of code may be a hot-pluggable zone. With
> MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
>
The ranges should not get registered as active memory by
e820_register_active_regions() unless they are marked E820_RAM. My
understanding is that the regions for hotadd would be marked "reserved"
in the e820 map. Is that wrong?
> > if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) <
> > 0) {
> > /* Ignore hotadd region. Undo damage */
>
> I have but the e820_register_active_regions as a else to this
> statment the absent pages check fails.
>
The patch below omits this change because I think
e820_register_active_regions() will still have got called by the time
you encounter a hotplug area.
> Also nodes_cover_memory and alot of these check were based against
> comparing the srat data against the e820. Now all this code is
> comparing SRAT against SRAT....
>
I don't see why. The SRAT table passes a range to
e820_register_active_regions() so should be comparing SRAT to e820
> I am willing to help here but we should compare the SRAT against to
> e820. Table v. Table.
>
> What to you think should be done?
>
Can you read through this patch and see does it address the problem in any
way? If it doesn't, can you send a complete bootlog so I can see what is
being sent to reserve_hotadd()? Thanks
Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm3-clean/arch/x86_64/mm/srat.c linux-2.6.18-rc4-mm3-fix_x8664_hotadd/arch/x86_64/mm/srat.c
--- linux-2.6.18-rc4-mm3-clean/arch/x86_64/mm/srat.c 2006-08-29 16:25:10.000000000 +0100
+++ linux-2.6.18-rc4-mm3-fix_x8664_hotadd/arch/x86_64/mm/srat.c 2006-08-31 16:17:26.000000000 +0100
@@ -240,7 +240,8 @@ static int reserve_hotadd(int node, unsi
/* This check might be a bit too strict, but I'm keeping it for now. */
if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
- printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
+ printk(KERN_ERR "SRAT: Hotplug area %lu -> %lu has existing memory\n",
+ s_pfn, e_pfn);
return -1;
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm3-clean/mm/page_alloc.c linux-2.6.18-rc4-mm3-fix_x8664_hotadd/mm/page_alloc.c
--- linux-2.6.18-rc4-mm3-clean/mm/page_alloc.c 2006-08-29 16:25:31.000000000 +0100
+++ linux-2.6.18-rc4-mm3-fix_x8664_hotadd/mm/page_alloc.c 2006-08-31 14:52:38.000000000 +0100
@@ -2280,6 +2280,10 @@ unsigned long __init __absent_pages_in_r
prev_end_pfn = early_node_map[i].end_pfn;
}
+ /* If the range is outside of physical memory, return the range */
+ if (range_start_pfn > prev_end_pfn)
+ hole_pages = range_end_pfn - range_start_pfn;
+
return hole_pages;
}
>>> static __init inline int srat_disabled(void)
>>> @@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>>>
>>> if (mem < 0)
>>> return 0;
>>> - allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
>>> + allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) *
>>> PAGE_SIZE;
>>> allowed = (allowed / 100) * hotadd_percent;
>>> if (allocated + mem > allowed) {
>>> unsigned long range;
>>> @@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
>>> }
>>>
>>> /* This check might be a bit too strict, but I'm keeping it for
>>> now. */
>>> - if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>> + if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>> printk(KERN_ERR "SRAT: Hotplug area has existing
>>> memory\n");
>>> return -1;
>>> }
>>>
>> We really do want to to compare against the e820 map at it contains
>> the memory that is really present (this info was blown away before
>> acpi_numa)
>>
>
> The information used by absent_pages_in_range() should match what was
> available to e820_hole_size().
>
>
But it doesn't : all active ranges are removed before parsing srat. I
think we really need to check against e820 here.
--Mika
On Thu, 31 Aug 2006, Mika Penttil? wrote:
>
>>>> static __init inline int srat_disabled(void)
>>>> @@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>>>>
>>>> if (mem < 0)
>>>> return 0;
>>>> - allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
>>>> + allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) *
>>>> PAGE_SIZE;
>>>> allowed = (allowed / 100) * hotadd_percent;
>>>> if (allocated + mem > allowed) {
>>>> unsigned long range;
>>>> @@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
>>>> }
>>>>
>>>> /* This check might be a bit too strict, but I'm keeping it for
>>>> now. */
>>>> - if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>>> + if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>>> printk(KERN_ERR "SRAT: Hotplug area has existing
>>>> memory\n");
>>>> return -1;
>>>> }
>>>>
>>> We really do want to to compare against the e820 map at it contains
>>> the memory that is really present (this info was blown away before
>>> acpi_numa)
>>
>> The information used by absent_pages_in_range() should match what was
>> available to e820_hole_size().
>>
>>
> But it doesn't : all active ranges are removed before parsing srat. I think
> we really need to check against e820 here.
>
What I see happening is this;
1. setup_arch calls e820_register_active_regions(0, 0, -1UL) so that all
regions are registered as if they were on node 0 so e820_end_of_ram()
gets the right value
2. remove_all_active_regions() is called to clear what was registered so
that rediscovery with NUMA awareness happens
3. acpi_numa_init() is called. It parses the table and a little later
calls acpi_numa_memory_affinity_init() for each range in the table so
now we're into x86_64 code
4. acpi_numa_memory_affinity_init() basically deals an address range.
Assuming the SRAT table is not broken, it calls
e820_register_active_ranges() for that range. At this point, for the
range of addresses, the active ranges are now registered
5. reserve_hotadd is called if the range is hotpluggable. It will fail if
it finds that memory already exists there
So, when absent_pages_in_range() is being called by reserve_hotadd(), it
should be using the same information that was available in e820. What am I
missing?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> On Thu, 31 Aug 2006, Mika Penttil? wrote:
>
>>
>>>>> static __init inline int srat_disabled(void)
>>>>> @@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>>>>>
>>>>> if (mem < 0)
>>>>> return 0;
>>>>> - allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
>>>>> + allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) *
>>>>> PAGE_SIZE;
>>>>> allowed = (allowed / 100) * hotadd_percent;
>>>>> if (allocated + mem > allowed) {
>>>>> unsigned long range;
>>>>> @@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
>>>>> }
>>>>>
>>>>> /* This check might be a bit too strict, but I'm keeping it
>>>>> for now. */
>>>>> - if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>>>> + if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>>>> printk(KERN_ERR "SRAT: Hotplug area has existing
>>>>> memory\n");
>>>>> return -1;
>>>>> }
>>>>>
>>>> We really do want to to compare against the e820 map at it contains
>>>> the memory that is really present (this info was blown away before
>>>> acpi_numa)
>>>
>>> The information used by absent_pages_in_range() should match what was
>>> available to e820_hole_size().
>>>
>>>
>> But it doesn't : all active ranges are removed before parsing srat. I
>> think we really need to check against e820 here.
>>
>
> What I see happening is this;
>
> 1. setup_arch calls e820_register_active_regions(0, 0, -1UL) so that all
> regions are registered as if they were on node 0 so e820_end_of_ram()
> gets the right value
> 2. remove_all_active_regions() is called to clear what was registered so
> that rediscovery with NUMA awareness happens
> 3. acpi_numa_init() is called. It parses the table and a little later
> calls acpi_numa_memory_affinity_init() for each range in the table so
> now we're into x86_64 code
> 4. acpi_numa_memory_affinity_init() basically deals an address range.
> Assuming the SRAT table is not broken, it calls
> e820_register_active_ranges() for that range. At this point, for the
> range of addresses, the active ranges are now registered
> 5. reserve_hotadd is called if the range is hotpluggable. It will fail if
> it finds that memory already exists there
>
> So, when absent_pages_in_range() is being called by reserve_hotadd(),
> it should be using the same information that was available in e820.
> What am I missing?
>
Ok, right, missed the e820_register_active_ranges() in
acpi_numa_memory_affinity_init() before reserve_hotadd stuff. So
logically it should be working mod bugs.
Argh, just looked through the reserve hotadd code and
hotadd_enough_memory() looks still broken. And why are we doing
reserve_bootmem_node(), the regions aren't present RAM anyways?
--Mika
On 8/31/06, Mel Gorman <[email protected]> wrote:
> On (30/08/06 13:57), Keith Mannthey didst pronounce:
> > On 8/21/06, Mel Gorman <[email protected]> wrote:
> > >
> ok, great. How much physical memory is installed on the machine? I want to
> determine if the "usable" entries in the e820 map contain physical memory
> or not.
Usable entries in the e820 contian memory. I have about 20-24gb
depending on config.
> When the SRAT is bad, the information is discarded and discovered by an
> alternative method later in the boot process.
>
> In this case, numa_initmem_init() is called after acpi_numa_init(). It
> calls acpi_scan_nodes() which returns -1 because the SRAT is bad. Once
> that happens, either k8_scan_nodes() will be called and the regions
> discovered there or if that is not possible, it'll fall through and
> e820_register_active_regions will be called without any node awareness.
sorry I have missed some of the logic in this patch.
I see now in numa_initmem_init that if no numa setup is found it calls
e820_register_active_regions(0, start_pfn, end_pfn) again.
So if the srat is discard it runs the e820 code again.
> > >diff -rup -X /usr/src/patchset-0.6/bin//dontdiff
> > >linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
> > >linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
> > >--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
> > >2006-08-21 09:23:50.000000000 +0100
> > >+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
> > >2006-08-21 10:15:58.000000000 +0100
> > >@@ -84,6 +84,7 @@ static __init void bad_srat(void)
> > > apicid_to_node[i] = NUMA_NO_NODE;
> > > for (i = 0; i < MAX_NUMNODES; i++)
> > > nodes_add[i].start = nodes[i].end = 0;
> > >+ remove_all_active_ranges();
> > > }
> >
> > We go back to setup_arch with no active areas?
> >
>
> Yes, and it'll be discovered using an alternative method later. There is
> no point returning to setup_arch with known bad information about active
> areas.
Totally agreeded! I just didn't the the fallback path.
>
> > > static __init inline int srat_disabled(void)
> > >@@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
> > >
> > > if (mem < 0)
> > > return 0;
> > >- allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
> > >+ allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) *
> > >PAGE_SIZE;
> > > allowed = (allowed / 100) * hotadd_percent;
> > > if (allocated + mem > allowed) {
> > > unsigned long range;
> > >@@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
> > > }
> > >
> > > /* This check might be a bit too strict, but I'm keeping it for
> > > now. */
> > >- if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
> > >+ if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
> > > printk(KERN_ERR "SRAT: Hotplug area has existing
> > > memory\n");
> > > return -1;
> > > }
> > We really do want to to compare against the e820 map at it contains
> > the memory that is really present (this info was blown away before
> > acpi_numa)
>
> The information used by absent_pages_in_range() should match what was
> available to e820_hole_size().
Is absent_pages_in_range a check against the e820 or the
add_pages_to_range calls?
> > Anyway I fixed up to have the current chunk added
> > (e820_register_active_regions) after calling this code so it logicaly
> > makes sense but it still trip over the check.
> > I am not sure what you
> > are printing out in you debug code but dosen't look like pfns or
> > phys_addresses but maybe it can tell us why the check fails.
> >
>
> My debug code for add_active_range() printing out pfns but I spotted one
> case where absent_pages_in_range(I) does not do what one would expect.
> Lets say the ranges with physical memory was 0->1000 and 2000-3000 (in
> pfns). absent_pages_in_range(0, 3000) would return 1000 as you'd expect but
> absent_pages_in_range(5000-6000) would return 0! I have a patch that might
> fix this at the end of the mail but I'm not sure it's the problem you are
> hitting. In the bootlog, I see;
>
> SRAT: Node 0 PXM 0 0-80000000
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> SRAT: Node 0 PXM 0 0-470000000
> Entering add_active_range(0, 0, 152) 2 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> SRAT: Node 0 PXM 0 0-1070000000
> SRAT: Hotplug area has existing memory
>
> The last part (0-1070000000) is checked as a hotplug area but it's clear
> that memory exists in that range. As reserve_hotadd() requires that the
> whole range be a hole, I'm having trouble seeing how it ever successfully
> reserved unless the ranges going into reserve_hotadd() are something other
> than the pfn range for 0-1070000000). The patch later will print out the
> range used by reserve_hotadd() so we can see.
No the whole node is 0-1070000000 the hot add range is 470000000-1070000000
reserve_hotadd is called with start and end not nd->start nd->end.
470000000-1070000000 sould be empty.
> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
> > >
> > > printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
> > > nd->start, nd->end);
> > >+ e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
> > >+ nd->end >> PAGE_SHIFT);
> >
> > A node chunk in this section of code may be a hot-pluggable zone. With
> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
> >
>
> The ranges should not get registered as active memory by
> e820_register_active_regions() unless they are marked E820_RAM. My
> understanding is that the regions for hotadd would be marked "reserved"
> in the e820 map. Is that wrong?
This is wrong. In a mult-node system that last node add area will not
be marked reserved by the e820. The e820 only defines memory <
end_pfn. the last node add area is > end_pfn.
With RESERVE based add-memory you want the add-areas repored by the
srat to be setup during boot like all the other pages.
> > > if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) <
> > > 0) {
> > > /* Ignore hotadd region. Undo damage */
> >
> > I have but the e820_register_active_regions as a else to this
> > statment the absent pages check fails.
> >
>
> The patch below omits this change because I think
> e820_register_active_regions() will still have got called by the time
> you encounter a hotplug area.
called but then removed in setup arch.
> > Also nodes_cover_memory and alot of these check were based against
> > comparing the srat data against the e820. Now all this code is
> > comparing SRAT against SRAT....
> >
>
> I don't see why. The SRAT table passes a range to
> e820_register_active_regions() so should be comparing SRAT to e820
let me go off and look at e820_register_active_regions() some more.
> > I am willing to help here but we should compare the SRAT against to
> > e820. Table v. Table.
> >
> > What to you think should be done?
> >
>
> Can you read through this patch and see does it address the problem in any
> way? If it doesn't, can you send a complete bootlog so I can see what is
> being sent to reserve_hotadd()? Thanks
Sure thing. It is just the hot-add area I am guessing it is an off by
one error of some sort.
What is all this code buying us? Since this code dosen't appear to do
anything to help the arch out (just increases it's vm boot code
complexity a little) maybe insead of weaving
e820_register_active_regions() calls throught out the boot process you
should just waint untill things are sorted out and do a quick scan of
node data that has been setup at the end?
What are the future plans for this api?
Thanks,
Keith u
On Thu, 31 Aug 2006, Keith Mannthey wrote:
> On 8/31/06, Mel Gorman <[email protected]> wrote:
>> On (30/08/06 13:57), Keith Mannthey didst pronounce:
>> > On 8/21/06, Mel Gorman <[email protected]> wrote:
>> > >
>
>> ok, great. How much physical memory is installed on the machine? I want to
>> determine if the "usable" entries in the e820 map contain physical memory
>> or not.
>
> Usable entries in the e820 contian memory. I have about 20-24gb
> depending on config.
>
ok, that seems to match the (usable) regions in the e820 map.
>
>> When the SRAT is bad, the information is discarded and discovered by an
>> alternative method later in the boot process.
>>
>> In this case, numa_initmem_init() is called after acpi_numa_init(). It
>> calls acpi_scan_nodes() which returns -1 because the SRAT is bad. Once
>> that happens, either k8_scan_nodes() will be called and the regions
>> discovered there or if that is not possible, it'll fall through and
>> e820_register_active_regions will be called without any node awareness.
>
> sorry I have missed some of the logic in this patch.
>
> I see now in numa_initmem_init that if no numa setup is found it calls
> e820_register_active_regions(0, start_pfn, end_pfn) again.
>
right.
> So if the srat is discard it runs the e820 code again.
>
yes.
>
>> > >diff -rup -X /usr/src/patchset-0.6/bin//dontdiff
>> > >linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
>> > >linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
>> > >--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
>> > >2006-08-21 09:23:50.000000000 +0100
>> > >+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
>> > >2006-08-21 10:15:58.000000000 +0100
>> > >@@ -84,6 +84,7 @@ static __init void bad_srat(void)
>> > > apicid_to_node[i] = NUMA_NO_NODE;
>> > > for (i = 0; i < MAX_NUMNODES; i++)
>> > > nodes_add[i].start = nodes[i].end = 0;
>> > >+ remove_all_active_ranges();
>> > > }
>> >
>> > We go back to setup_arch with no active areas?
>> >
>>
>> Yes, and it'll be discovered using an alternative method later. There is
>> no point returning to setup_arch with known bad information about active
>> areas.
>
> Totally agreeded! I just didn't the the fallback path.
grand.
>>
>> > > static __init inline int srat_disabled(void)
>> > >@@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>> > >
>> > > if (mem < 0)
>> > > return 0;
>> > >- allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
>> > >+ allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) *
>> > >PAGE_SIZE;
>> > > allowed = (allowed / 100) * hotadd_percent;
>> > > if (allocated + mem > allowed) {
>> > > unsigned long range;
>> > >@@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
>> > > }
>> > >
>> > > /* This check might be a bit too strict, but I'm keeping it for
>> > > now. */
>> > >- if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
>> > >+ if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
>> > > printk(KERN_ERR "SRAT: Hotplug area has existing
>> > > memory\n");
>> > > return -1;
>> > > }
>> > We really do want to to compare against the e820 map at it contains
>> > the memory that is really present (this info was blown away before
>> > acpi_numa)
>>
>> The information used by absent_pages_in_range() should match what was
>> available to e820_hole_size().
>
> Is absent_pages_in_range a check against the e820 or the
> add_pages_to_range calls?
>
absent_pages_in_range() uses information provided via add_active_range()
and on x86_64, add_active_range() is called based on information in the
e820.
>> > Anyway I fixed up to have the current chunk added
>> > (e820_register_active_regions) after calling this code so it logicaly
>> > makes sense but it still trip over the check.
>> > I am not sure what you
>> > are printing out in you debug code but dosen't look like pfns or
>> > phys_addresses but maybe it can tell us why the check fails.
>> >
>>
>> My debug code for add_active_range() printing out pfns but I spotted one
>> case where absent_pages_in_range(I) does not do what one would expect.
>> Lets say the ranges with physical memory was 0->1000 and 2000-3000 (in
>> pfns). absent_pages_in_range(0, 3000) would return 1000 as you'd expect
>> but
>> absent_pages_in_range(5000-6000) would return 0! I have a patch that might
>> fix this at the end of the mail but I'm not sure it's the problem you are
>> hitting. In the bootlog, I see;
>>
>> SRAT: Node 0 PXM 0 0-80000000
>> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
>> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
>> SRAT: Node 0 PXM 0 0-470000000
>> Entering add_active_range(0, 0, 152) 2 entries of 3200 used
>> Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
>> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
>> SRAT: Node 0 PXM 0 0-1070000000
>> SRAT: Hotplug area has existing memory
>>
>
>> The last part (0-1070000000) is checked as a hotplug area but it's clear
>> that memory exists in that range. As reserve_hotadd() requires that the
>> whole range be a hole, I'm having trouble seeing how it ever successfully
>> reserved unless the ranges going into reserve_hotadd() are something other
>> than the pfn range for 0-1070000000). The patch later will print out the
>> range used by reserve_hotadd() so we can see.
>
> No the whole node is 0-1070000000 the hot add range is 470000000-1070000000
> reserve_hotadd is called with start and end not nd->start nd->end.
> 470000000-1070000000 sould be empty.
>
Can you confirm that happens by applying the patch I sent to you and
checking the output? When the reserve fails, it should print out what
range it actually checked. I want to be sure it's not checking the
addresses 0->0x1070000000
>
>> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
>> > >
>> > > printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
>> > > nd->start, nd->end);
>> > >+ e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
>> > >+ nd->end >> PAGE_SHIFT);
>> >
>> > A node chunk in this section of code may be a hot-pluggable zone. With
>> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
>> >
>>
>> The ranges should not get registered as active memory by
>> e820_register_active_regions() unless they are marked E820_RAM. My
>> understanding is that the regions for hotadd would be marked "reserved"
>> in the e820 map. Is that wrong?
>
> This is wrong. In a mult-node system that last node add area will not
> be marked reserved by the e820. The e820 only defines memory <
> end_pfn. the last node add area is > end_pfn.
>
ok, that should still be fine. As long as the ranges are not marked
"usable", add_active_range() will not be called and the holes should be
counted correctly with the patch I sent you.
> With RESERVE based add-memory you want the add-areas repored by the
> srat to be setup during boot like all the other pages.
>
So, do you actally expect a lot of unused mem_map to be allocated with
struct pages that are inactive until memory is hot-added in an
x86_64-specific manner? The arch-independent stuff currently will not do
that. It sets up memmap for where memory really exists. If that is not
what you expect, it will hit issues at hotadd time which is not the
current issue but one that can be fixed.
>> > > if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end)
>> <
>> > > 0) {
>> > > /* Ignore hotadd region. Undo damage */
>> >
>> > I have but the e820_register_active_regions as a else to this
>> > statment the absent pages check fails.
>> >
>>
>> The patch below omits this change because I think
>> e820_register_active_regions() will still have got called by the time
>> you encounter a hotplug area.
>
> called but then removed in setup arch.
By "removed", I assume you mean the active regions removed by the call
to remove_all_active_regions() in setup_arch(). Before reserve_hotadd() is
called, e820_register_active_regions() will have reregistered the active
regions with the NUMA node id.
>> > Also nodes_cover_memory and alot of these check were based against
>> > comparing the srat data against the e820. Now all this code is
>> > comparing SRAT against SRAT....
>> >
>>
>> I don't see why. The SRAT table passes a range to
>> e820_register_active_regions() so should be comparing SRAT to e820
>
> let me go off and look at e820_register_active_regions() some more.
>
Cool
>> > I am willing to help here but we should compare the SRAT against to
>> > e820. Table v. Table.
>> >
>> > What to you think should be done?
>> >
>>
>> Can you read through this patch and see does it address the problem in any
>> way? If it doesn't, can you send a complete bootlog so I can see what is
>> being sent to reserve_hotadd()? Thanks
>
> Sure thing. It is just the hot-add area I am guessing it is an off by
> one error of some sort.
>
Possible, the change to reserve_hotadd() should tell me.
> What is all this code buying us?
Less architecture-specific code across a number of architectures is the
main one.
> Since this code dosen't appear to do
> anything to help the arch out (just increases it's vm boot code
> complexity a little) maybe insead of weaving
> e820_register_active_regions() calls throught out the boot process you
> should just waint untill things are sorted out and do a quick scan of
> node data that has been setup at the end?
>
That would defeat the purpose of sizing zones and holes in an architecture
independent manner.
> What are the future plans for this api?
>
In the future, I will be releasing patches that set aside a zone (similar
to the Solaris Kernel Cage) used for easily-reclaimed pages that can be
used for growing the huge page pool at runtime (it comes under the heading
of anti-fragmentation) work. The same zone could also be used to give
memory hot-remove a better success rate than 0%. These patches make the
creation of the zone relatively trivial. Without them, the
architecture-specific code is really hairy.
Other possibilities are doing stuff like handling the mem= boot parameter
in an architecture-independent manner. My understanding is that some NUMA
architectures get the handling of that arguement wrong.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On 8/31/06, Mel Gorman <[email protected]> wrote:
> On Thu, 31 Aug 2006, Keith Mannthey wrote:
> > On 8/31/06, Mel Gorman <[email protected]> wrote:
> >> On (30/08/06 13:57), Keith Mannthey didst pronounce:
> >> > On 8/21/06, Mel Gorman <[email protected]> wrote:
> >> > >
> Can you confirm that happens by applying the patch I sent to you and
> checking the output? When the reserve fails, it should print out what
> range it actually checked. I want to be sure it's not checking the
> addresses 0->0x1070000000
See below
> >> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
> >> > >
> >> > > printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
> >> > > nd->start, nd->end);
> >> > >+ e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
> >> > >+ nd->end >> PAGE_SHIFT);
> >> >
> >> > A node chunk in this section of code may be a hot-pluggable zone. With
> >> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
> >> >
> >>
> >> The ranges should not get registered as active memory by
> >> e820_register_active_regions() unless they are marked E820_RAM. My
> >> understanding is that the regions for hotadd would be marked "reserved"
> >> in the e820 map. Is that wrong?
> >
> > This is wrong. In a mult-node system that last node add area will not
> > be marked reserved by the e820. The e820 only defines memory <
> > end_pfn. the last node add area is > end_pfn.
> >
>
> ok, that should still be fine. As long as the ranges are not marked
> "usable", add_active_range() will not be called and the holes should be
> counted correctly with the patch I sent you.
>
> > With RESERVE based add-memory you want the add-areas repored by the
> > srat to be setup during boot like all the other pages.
> >
>
> So, do you actally expect a lot of unused mem_map to be allocated with
> struct pages that are inactive until memory is hot-added in an
> x86_64-specific manner? The arch-independent stuff currently will not do
> that. It sets up memmap for where memory really exists. If that is not
> what you expect, it will hit issues at hotadd time which is not the
> current issue but one that can be fixed.
Yes. RESERVED based is a big waste of mem_map space. The add areas
are marked as RESERVED during boot and then later onlined during add.
It might be ok. I will play with tomorrow. I might just need to
call add_active_range in the right spot :)
> >> > > if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end)
> >> <
> >> > > 0) {
> >> > > /* Ignore hotadd region. Undo damage */
> >> >
> >> > I have but the e820_register_active_regions as a else to this
> >> > statment the absent pages check fails.
> >> >
> >>
> >> The patch below omits this change because I think
> >> e820_register_active_regions() will still have got called by the time
> >> you encounter a hotplug area.
> >
> > called but then removed in setup arch.
>
> By "removed", I assume you mean the active regions removed by the call
> to remove_all_active_regions() in setup_arch(). Before reserve_hotadd() is
> called, e820_register_active_regions() will have reregistered the active
> regions with the NUMA node id.
I see e820_register_active_regions is acting as a filter against the e820
> >> > Also nodes_cover_memory and alot of these check were based against
> >> > comparing the srat data against the e820. Now all this code is
> >> > comparing SRAT against SRAT....
> >> >
> >>
> >> I don't see why. The SRAT table passes a range to
> >> e820_register_active_regions() so should be comparing SRAT to e820
> >
> > let me go off and look at e820_register_active_regions() some more.
Things get clear :)
Should be ok.
> > Sure thing. It is just the hot-add area I am guessing it is an off by
> > one error of some sort.
> >
See below. I do my e820_register_active_area as an else to to if
(hotplug.....!reserve) and the prink is easy to sort out.
I see your pfn are in base 10. Looks like it considers the last
addres to be a present page. (off by one thing).
Thanks,
Keith
Output below
disabling early console
Linux version 2.6.18-rc4-mm3-smp (root@elm3a153) (gcc version 4.1.0
(SUSE Linux)) #6 SMP Thu Aug 31 22:06:00 EDT 2006
Command line: root=/dev/sda3
ip=9.47.66.153:9.47.66.169:9.47.66.1:255.255.255.0 resume=/dev/sda2
showopts earlyprintk=ttyS0,115200 console=ttyS0,115200 console=tty0
debug numa=hotadd=100
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used
end_pfn_map = 18219008
DMI 2.3 present.
ACPI: RSDP (v000 IBM ) @ 0x00000000000fdcf0
ACPI: RSDT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98800
ACPI: FADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98780
ACPI: MADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98600
ACPI: SRAT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff983c0
ACPI: HPET (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
0x000000007ff98380
ACPI: SSDT (v001 IBM VIGSSDT0 0x00001000 INTL 0x20030122) @
0x000000007ff90780
ACPI: SSDT (v001 IBM VIGSSDT1 0x00001000 INTL 0x20030122) @
0x000000007ff88bc0
ACPI: DSDT (v001 IBM EXA01ZEU 0x00001000 INTL 0x20030122) @
0x0000000000000000
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 0 -> APIC 2 -> Node 0
SRAT: PXM 0 -> APIC 3 -> Node 0
SRAT: PXM 0 -> APIC 38 -> Node 0
SRAT: PXM 0 -> APIC 39 -> Node 0
SRAT: PXM 0 -> APIC 36 -> Node 0
SRAT: PXM 0 -> APIC 37 -> Node 0
SRAT: PXM 1 -> APIC 64 -> Node 1
SRAT: PXM 1 -> APIC 65 -> Node 1
SRAT: PXM 1 -> APIC 66 -> Node 1
SRAT: PXM 1 -> APIC 67 -> Node 1
SRAT: PXM 1 -> APIC 102 -> Node 1
SRAT: PXM 1 -> APIC 103 -> Node 1
SRAT: PXM 1 -> APIC 100 -> Node 1
SRAT: PXM 1 -> APIC 101 -> Node 1
SRAT: Node 0 PXM 0 0-80000000
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
SRAT: Node 0 PXM 0 0-470000000
Entering add_active_range(0, 0, 152) 2 entries of 3200 used
Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
SRAT: Node 0 PXM 0 0-1070000000
reserve_hotadd called with node 0 sart 470000000 end 1070000000
SRAT: Hotplug area has existing memory
Entering add_active_range(0, 0, 152) 3 entries of 3200 used
Entering add_active_range(0, 256, 524165) 3 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 3 entries of 3200 used
SRAT: Node 1 PXM 1 1070000000-1160000000
Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used
SRAT: Node 1 PXM 1 1070000000-3200000000
reserve_hotadd called with node 1 sart 1160000000 end 3200000000
SRAT: Hotplug area has existing memory
Entering add_active_range(1, 17235968, 18219008) 4 entries of 3200 used
NUMA: Using 28 for the hash shift.
Bootmem setup node 0 0000000000000000-0000001070000000
Bootmem setup node 1 0000001070000000-0000001160000000
Zone PFN ranges:
DMA 0 -> 4096
DMA32 4096 -> 1048576
Normal 1048576 -> 18219008
early_node_map[4] active PFN ranges
0: 0 -> 152
0: 256 -> 524165
0: 1048576 -> 4653056
1: 17235968 -> 18219008
On node 0 totalpages: 4128541
0 pages used for SPARSE memmap
1149 pages DMA reserved
DMA zone: 2843 pages, LIFO batch:0
0 pages used for SPARSE memmap
DMA32 zone: 520069 pages, LIFO batch:31
0 pages used for SPARSE memmap
Normal zone: 3604480 pages, LIFO batch:31
On node 1 totalpages: 983040
0 pages used for SPARSE memmap
0 pages used for SPARSE memmap
0 pages used for SPARSE memmap
Normal zone: 983040 pages, LIFO batch:31
On Thu, 31 Aug 2006, Keith Mannthey wrote:
> On 8/31/06, Mel Gorman <[email protected]> wrote:
>> On Thu, 31 Aug 2006, Keith Mannthey wrote:
>> > On 8/31/06, Mel Gorman <[email protected]> wrote:
>> >> On (30/08/06 13:57), Keith Mannthey didst pronounce:
>> >> > On 8/21/06, Mel Gorman <[email protected]> wrote:
>> >> > >
>
>> Can you confirm that happens by applying the patch I sent to you and
>> checking the output? When the reserve fails, it should print out what
>> range it actually checked. I want to be sure it's not checking the
>> addresses 0->0x1070000000
>
> See below
>
Perfect, thanks a lot. I should have enough to reproduce without a test
machine what is going on and develop the required patches.
>> >> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
>> >> > >
>> >> > > printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
>> >> > > nd->start, nd->end);
>> >> > >+ e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
>> >> > >+ nd->end >>
>> PAGE_SHIFT);
>> >> >
>> >> > A node chunk in this section of code may be a hot-pluggable zone. With
>> >> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
>> >> >
>> >>
>> >> The ranges should not get registered as active memory by
>> >> e820_register_active_regions() unless they are marked E820_RAM. My
>> >> understanding is that the regions for hotadd would be marked "reserved"
>> >> in the e820 map. Is that wrong?
>> >
>> > This is wrong. In a mult-node system that last node add area will not
>> > be marked reserved by the e820. The e820 only defines memory <
>> > end_pfn. the last node add area is > end_pfn.
>> >
>>
>> ok, that should still be fine. As long as the ranges are not marked
>> "usable", add_active_range() will not be called and the holes should be
>> counted correctly with the patch I sent you.
>>
>> > With RESERVE based add-memory you want the add-areas repored by the
>> > srat to be setup during boot like all the other pages.
>> >
>>
>> So, do you actally expect a lot of unused mem_map to be allocated with
>> struct pages that are inactive until memory is hot-added in an
>> x86_64-specific manner? The arch-independent stuff currently will not do
>> that. It sets up memmap for where memory really exists. If that is not
>> what you expect, it will hit issues at hotadd time which is not the
>> current issue but one that can be fixed.
>
> Yes. RESERVED based is a big waste of mem_map space.
Right, it's all very clear now. At some point in the future, I'd like to
visit why SPARSEMEM-based hot-add is not always used but it's a separate
issue.
> The add areas
> are marked as RESERVED during boot and then later onlined during add.
That explains the reserve_bootmem_node()
> It might be ok. I will play with tomorrow. I might just need to
> call add_active_range in the right spot :)
>
I'll play with this from the opposite perspective - what is required for
any arch using this API to have spare memmap for reserve-based hot-add.
>> >> > > if (ma->flags.hot_pluggable && !reserve_hotadd(node, start,
>> end)
>> >> <
>> >> > > 0) {
>> >> > > /* Ignore hotadd region. Undo damage */
>> >> >
>> >> > I have but the e820_register_active_regions as a else to this
>> >> > statment the absent pages check fails.
>> >> >
>> >>
>> >> The patch below omits this change because I think
>> >> e820_register_active_regions() will still have got called by the time
>> >> you encounter a hotplug area.
>> >
>> > called but then removed in setup arch.
>>
>> By "removed", I assume you mean the active regions removed by the call
>> to remove_all_active_regions() in setup_arch(). Before reserve_hotadd() is
>> called, e820_register_active_regions() will have reregistered the active
>> regions with the NUMA node id.
>
> I see e820_register_active_regions is acting as a filter against the e820
>
Yes. A range of pfn's in given to e820_register_active_regions() and it
reads the e820 for E820_RAM sections within that range.
>> >> > Also nodes_cover_memory and alot of these check were based against
>> >> > comparing the srat data against the e820. Now all this code is
>> >> > comparing SRAT against SRAT....
>> >> >
>> >>
>> >> I don't see why. The SRAT table passes a range to
>> >> e820_register_active_regions() so should be comparing SRAT to e820
>> >
>> > let me go off and look at e820_register_active_regions() some more.
> Things get clear :)
>
> Should be ok.
>
nice.
>> > Sure thing. It is just the hot-add area I am guessing it is an off by
>> > one error of some sort.
>> >
> See below. I do my e820_register_active_area as an else to to if
> (hotplug.....!reserve) and the prink is easy to sort out.
>
yep, it should be easy to put this into a test program.
> I see your pfn are in base 10. Looks like it considers the last
> addres to be a present page. (off by one thing).
>
Probably. I'll start poking now. Thanks a lot.
> Thanks,
> Keith
>
> Output below
> disabling early console
> Linux version 2.6.18-rc4-mm3-smp (root@elm3a153) (gcc version 4.1.0
> (SUSE Linux)) #6 SMP Thu Aug 31 22:06:00 EDT 2006
> Command line: root=/dev/sda3
> ip=9.47.66.153:9.47.66.169:9.47.66.1:255.255.255.0 resume=/dev/sda2
> showopts earlyprintk=ttyS0,115200 console=ttyS0,115200 console=tty0
> debug numa=hotadd=100
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
> BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
> BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
> BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
> BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
> BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used
> end_pfn_map = 18219008
> DMI 2.3 present.
> ACPI: RSDP (v000 IBM ) @ 0x00000000000fdcf0
> ACPI: RSDT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98800
> ACPI: FADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98780
> ACPI: MADT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98600
> ACPI: SRAT (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff983c0
> ACPI: HPET (v001 IBM EXA01ZEU 0x00001000 IBM 0x45444f43) @
> 0x000000007ff98380
> ACPI: SSDT (v001 IBM VIGSSDT0 0x00001000 INTL 0x20030122) @
> 0x000000007ff90780
> ACPI: SSDT (v001 IBM VIGSSDT1 0x00001000 INTL 0x20030122) @
> 0x000000007ff88bc0
> ACPI: DSDT (v001 IBM EXA01ZEU 0x00001000 INTL 0x20030122) @
> 0x0000000000000000
> SRAT: PXM 0 -> APIC 0 -> Node 0
> SRAT: PXM 0 -> APIC 1 -> Node 0
> SRAT: PXM 0 -> APIC 2 -> Node 0
> SRAT: PXM 0 -> APIC 3 -> Node 0
> SRAT: PXM 0 -> APIC 38 -> Node 0
> SRAT: PXM 0 -> APIC 39 -> Node 0
> SRAT: PXM 0 -> APIC 36 -> Node 0
> SRAT: PXM 0 -> APIC 37 -> Node 0
> SRAT: PXM 1 -> APIC 64 -> Node 1
> SRAT: PXM 1 -> APIC 65 -> Node 1
> SRAT: PXM 1 -> APIC 66 -> Node 1
> SRAT: PXM 1 -> APIC 67 -> Node 1
> SRAT: PXM 1 -> APIC 102 -> Node 1
> SRAT: PXM 1 -> APIC 103 -> Node 1
> SRAT: PXM 1 -> APIC 100 -> Node 1
> SRAT: PXM 1 -> APIC 101 -> Node 1
> SRAT: Node 0 PXM 0 0-80000000
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> SRAT: Node 0 PXM 0 0-470000000
> Entering add_active_range(0, 0, 152) 2 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> SRAT: Node 0 PXM 0 0-1070000000
> reserve_hotadd called with node 0 sart 470000000 end 1070000000
> SRAT: Hotplug area has existing memory
> Entering add_active_range(0, 0, 152) 3 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 3 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 3 entries of 3200 used
> SRAT: Node 1 PXM 1 1070000000-1160000000
> Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used
> SRAT: Node 1 PXM 1 1070000000-3200000000
> reserve_hotadd called with node 1 sart 1160000000 end 3200000000
> SRAT: Hotplug area has existing memory
> Entering add_active_range(1, 17235968, 18219008) 4 entries of 3200 used
> NUMA: Using 28 for the hash shift.
> Bootmem setup node 0 0000000000000000-0000001070000000
> Bootmem setup node 1 0000001070000000-0000001160000000
> Zone PFN ranges:
> DMA 0 -> 4096
> DMA32 4096 -> 1048576
> Normal 1048576 -> 18219008
> early_node_map[4] active PFN ranges
> 0: 0 -> 152
> 0: 256 -> 524165
> 0: 1048576 -> 4653056
> 1: 17235968 -> 18219008
> On node 0 totalpages: 4128541
> 0 pages used for SPARSE memmap
> 1149 pages DMA reserved
> DMA zone: 2843 pages, LIFO batch:0
> 0 pages used for SPARSE memmap
> DMA32 zone: 520069 pages, LIFO batch:31
> 0 pages used for SPARSE memmap
> Normal zone: 3604480 pages, LIFO batch:31
> On node 1 totalpages: 983040
> 0 pages used for SPARSE memmap
> 0 pages used for SPARSE memmap
> 0 pages used for SPARSE memmap
> Normal zone: 983040 pages, LIFO batch:31
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
>
> Right, it's all very clear now. At some point in the future, I'd like
> to visit why SPARSEMEM-based hot-add is not always used but it's a
> separate issue.
>
>> The add areas
>> are marked as RESERVED during boot and then later onlined during add.
>
> That explains the reserve_bootmem_node()
>
But pages are marked reserved by default. You still have to alloc the
bootmem map for the the whole node range, including reserve hot add
areas and areas beyond e820-end-of-ram. So all the areas are already
reserved, until freed.
--Mika
On (31/08/06 20:08), Keith Mannthey didst pronounce:
> >So, do you actally expect a lot of unused mem_map to be allocated with
> >struct pages that are inactive until memory is hot-added in an
> >x86_64-specific manner? The arch-independent stuff currently will not do
> >that. It sets up memmap for where memory really exists. If that is not
> >what you expect, it will hit issues at hotadd time which is not the
> >current issue but one that can be fixed.
>
> Yes. RESERVED based is a big waste of mem_map space. The add areas
> are marked as RESERVED during boot and then later onlined during add.
> It might be ok. I will play with tomorrow. I might just need to
> call add_active_range in the right spot :)
>
Following this mail should be two patches that may address the problem
with reserved memory hot-add. One assumption made by arch-independent
zone-sizing was that the only memory holes of interest were those before the
end of physical memory. Another assumption was that mem_map should only be
allocated for memory that was physically present in the machine.
With MEMORY_HOTPLUG_RESERVE on x86_64, these assumptions do not hold. This
feature expects that mem_map is allocated at boot time and later activated
on a memory hot-add event. To determine if the region is usable for hot-add
in the future, holes are calculated beyond the end of physical memory.
The following two patches fix these two assumptions. They have been boot-tested
on a range of hardware (x86, ppc64, ia64 and x86_64) so there should be no
new regressions.
I don't have access to hardware that can use MEMORY_HOTPLUG_RESERVE so I'd
appreciate hearing if the patches work. I wrote a test program that simulated
the input from the machine the problem was reported on. It registers active
memory and simulates the check made by reserve_hotadd(). push_node_boundaries()
is called to push the end of the node out by 100 pages like what SRAT would
do for reserve hot-add and it appears to do the right thing. Output is below.
mel@arnold:~/patches/brokenout/zonesizing/driver_test$ gcc driver_test.c -o
driver_test && ./driver_test | grep -v "active with" | grep -v
account_node_boundary
Stage 1: Registering active ranges
Entering add_active_range(0, 0, 152) 0 entries of 96 used
Entering add_active_range(0, 256, 524165) 1 entries of 96 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 96 used
Entering add_active_range(1, 17235968, 18219008) 3 entries of 96 used
Dumping active map
0: 0 0 -> 152
1: 0 256 -> 524165
2: 0 1048576 -> 4653056
3: 1 17235968 -> 18219008
Entering push_node_boundaries(0, 0, 4653156)
Checking reserve-hotadd
absent_pages_in_range(4653056, 17235968) == 17235968 - 4653056 == 12582912
absent_pages_in_range(18219008, 52428800) == 52428800 - 18219008 == 34209792
Stage 2: Calculating zone sizes and holes
Stage 3: Dumping zone sizes and holes
zone_size[0][0] = 4096 zone_holes[0][0] = 104
zone_size[0][1] = 1044480 zone_holes[0][1] = 524411
zone_size[0][2] = 3604580 zone_holes[0][2] = 100
zone_size[1][2] = 983040 zone_holes[1][2] = 0
Stage 4: Printing present pages
On node 0, 4128541 pages
zone 0 present_pages = 3992
zone 1 present_pages = 520069
zone 2 present_pages = 3604480
On node 1, 983040 pages
zone 2 present_pages = 983040
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
absent_pages_in_range() made the assumption that users of the API would
not care about holes beyound the end of physical memory. This was not the
case. This patch will account for ranges outside of physical memory as
holes correctly.
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm3-clean/arch/x86_64/mm/srat.c linux-2.6.18-rc4-mm3-001_account_holes_range/arch/x86_64/mm/srat.c
--- linux-2.6.18-rc4-mm3-clean/arch/x86_64/mm/srat.c 2006-08-28 15:05:28.000000000 +0100
+++ linux-2.6.18-rc4-mm3-001_account_holes_range/arch/x86_64/mm/srat.c 2006-09-01 13:29:25.000000000 +0100
@@ -240,7 +240,9 @@ static int reserve_hotadd(int node, unsi
/* This check might be a bit too strict, but I'm keeping it for now. */
if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
- printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
+ printk(KERN_ERR
+ "SRAT: Hotplug area %lu -> %lu has existing memory\n",
+ s_pfn, e_pfn);
return -1;
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm3-clean/mm/page_alloc.c linux-2.6.18-rc4-mm3-001_account_holes_range/mm/page_alloc.c
--- linux-2.6.18-rc4-mm3-clean/mm/page_alloc.c 2006-08-28 15:05:30.000000000 +0100
+++ linux-2.6.18-rc4-mm3-001_account_holes_range/mm/page_alloc.c 2006-09-01 13:29:25.000000000 +0100
@@ -2259,6 +2259,10 @@ unsigned long __init __absent_pages_in_r
if (i == -1)
return 0;
+ /* Account for ranges before physical memory on this node */
+ if (early_node_map[i].start_pfn > range_start_pfn)
+ hole_pages = early_node_map[i].start_pfn - range_start_pfn;
+
prev_end_pfn = early_node_map[i].start_pfn;
/* Find all holes for the zone within the node */
@@ -2280,6 +2284,11 @@ unsigned long __init __absent_pages_in_r
prev_end_pfn = early_node_map[i].end_pfn;
}
+ /* Account for ranges past physical memory on this node */
+ if (range_end_pfn > prev_end_pfn)
+ hole_pages = range_end_pfn -
+ max(range_start_pfn, prev_end_pfn);
+
return hole_pages;
}
@@ -2301,9 +2310,16 @@ unsigned long __init zone_absent_pages_i
unsigned long zone_type,
unsigned long *ignored)
{
- return __absent_pages_in_range(nid,
- arch_zone_lowest_possible_pfn[zone_type],
- arch_zone_highest_possible_pfn[zone_type]);
+ unsigned long node_start_pfn, node_end_pfn;
+ unsigned long zone_start_pfn, zone_end_pfn;
+
+ get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+ zone_start_pfn = max(arch_zone_lowest_possible_pfn[zone_type],
+ node_start_pfn);
+ zone_end_pfn = min(arch_zone_highest_possible_pfn[zone_type],
+ node_end_pfn);
+
+ return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
}
/* Return the zone index a PFN is in */
Arch-independent zone-sizing determines the size of a node
(pgdat->node_spanned_pages) based on the physical memory that was registered
by the architecture. However, when CONFIG_MEMORY_HOTPLUG_RESERVE is set,
the architecture expects that the spanned_pages will be much larger and that
mem_map will be allocated that is used lated on memory hot-add.
This patch allows an architecture that sets CONFIG_MEMORY_HOTPLUG_RESERVE
to call push_node_boundaries() which will set the node beginning and end to
at *least* the requested boundary.
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm3-001_account_holes_range/arch/x86_64/mm/srat.c linux-2.6.18-rc4-mm3-002_push_node_boundaries/arch/x86_64/mm/srat.c
--- linux-2.6.18-rc4-mm3-001_account_holes_range/arch/x86_64/mm/srat.c 2006-09-01 13:29:25.000000000 +0100
+++ linux-2.6.18-rc4-mm3-002_push_node_boundaries/arch/x86_64/mm/srat.c 2006-09-01 13:30:57.000000000 +0100
@@ -334,6 +334,8 @@ acpi_numa_memory_affinity_init(struct ac
nd->start, nd->end);
e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
nd->end >> PAGE_SHIFT);
+ push_node_boundaries(node, nd->start >> PAGE_SHIFT,
+ nd->end >> PAGE_SHIFT);
if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) < 0) {
/* Ignore hotadd region. Undo damage */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm3-001_account_holes_range/include/linux/mm.h linux-2.6.18-rc4-mm3-002_push_node_boundaries/include/linux/mm.h
--- linux-2.6.18-rc4-mm3-001_account_holes_range/include/linux/mm.h 2006-08-28 15:05:30.000000000 +0100
+++ linux-2.6.18-rc4-mm3-002_push_node_boundaries/include/linux/mm.h 2006-09-01 13:30:57.000000000 +0100
@@ -1007,6 +1007,8 @@ extern void add_active_range(unsigned in
unsigned long end_pfn);
extern void shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
unsigned long new_end_pfn);
+extern void push_node_boundaries(unsigned int nid, unsigned long start_pfn,
+ unsigned long end_pfn);
extern void remove_all_active_ranges(void);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm3-001_account_holes_range/mm/page_alloc.c linux-2.6.18-rc4-mm3-002_push_node_boundaries/mm/page_alloc.c
--- linux-2.6.18-rc4-mm3-001_account_holes_range/mm/page_alloc.c 2006-09-01 13:29:25.000000000 +0100
+++ linux-2.6.18-rc4-mm3-002_push_node_boundaries/mm/page_alloc.c 2006-09-01 13:30:57.000000000 +0100
@@ -131,6 +131,10 @@ static unsigned long __initdata dma_rese
int __initdata nr_nodemap_entries;
unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#ifdef CONFIG_MEMORY_HOTPLUG_RESERVE
+ unsigned long __initdata node_boundary_start_pfn[MAX_NUMNODES];
+ unsigned long __initdata node_boundary_end_pfn[MAX_NUMNODES];
+#endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
#ifdef CONFIG_DEBUG_VM
@@ -2186,6 +2190,62 @@ void __init sparse_memory_present_with_a
}
/**
+ * push_node_boundaries - Push node boundaries to at least the requested boundary
+ * @nid: The nid of the node to push the boundary for
+ * @start_pfn: The start pfn of the node
+ * @end_pfn: The end pfn of the node
+ *
+ * In reserve-based hot-add, mem_map is allocated that is unused until hotadd
+ * time. Specifically, on x86_64, SRAT will report ranges that can potentially
+ * be hotplugged even though no physical memory exists. This function allows
+ * an arch to push out the node boundaries so mem_map is allocated that can
+ * be used later.
+ */
+#ifdef CONFIG_MEMORY_HOTPLUG_RESERVE
+void __init push_node_boundaries(unsigned int nid,
+ unsigned long start_pfn, unsigned long end_pfn)
+{
+ printk(KERN_DEBUG "Entering push_node_boundaries(%u, %lu, %lu)\n",
+ nid, start_pfn, end_pfn);
+
+ /* Initialise the boundary for this node if necessary */
+ if (node_boundary_end_pfn[nid] == 0)
+ node_boundary_start_pfn[nid] = -1UL;
+
+ /* Update the boundaries */
+ if (node_boundary_start_pfn[nid] > start_pfn)
+ node_boundary_start_pfn[nid] = start_pfn;
+ if (node_boundary_end_pfn[nid] < end_pfn)
+ node_boundary_end_pfn[nid] = end_pfn;
+}
+
+/* If necessary, push the node boundary out for reserve hotadd */
+static void __init account_node_boundary(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn)
+{
+ printk(KERN_DEBUG "Entering account_node_boundary(%u, %lu, %lu)\n",
+ nid, *start_pfn, *end_pfn);
+
+ /* Return if boundary information has not been provided */
+ if (node_boundary_end_pfn[nid] == 0)
+ return;
+
+ /* Check the boundaries and update if necessary */
+ if (node_boundary_start_pfn[nid] < *start_pfn)
+ *start_pfn = node_boundary_start_pfn[nid];
+ if (node_boundary_end_pfn[nid] > *end_pfn)
+ *end_pfn = node_boundary_end_pfn[nid];
+}
+#else
+void __init push_node_boundaries(unsigned int nid,
+ unsigned long start_pfn, unsigned long end_pfn) {}
+
+static void __init account_node_boundary(unsigned int nid,
+ unsigned long *start_pfn, unsigned long *end_pfn) {}
+#endif
+
+
+/**
* get_pfn_range_for_nid - Return the start and end page frames for a node
* @nid: The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned
* @start_pfn: Passed by reference. On return, it will have the node start_pfn
@@ -2212,6 +2272,9 @@ void __init get_pfn_range_for_nid(unsign
printk(KERN_WARNING "Node %u active with no memory\n", nid);
*start_pfn = 0;
}
+
+ /* Push the node boundaries out if requested */
+ account_node_boundary(nid, start_pfn, end_pfn);
}
/*
@@ -2655,6 +2718,10 @@ void __init remove_all_active_ranges()
{
memset(early_node_map, 0, sizeof(early_node_map));
nr_nodemap_entries = 0;
+#ifdef CONFIG_MEMORY_HOTPLUG_RESERVE
+ memset(node_boundary_start_pfn, 0, sizeof(node_boundary_start_pfn));
+ memset(node_boundary_end_pfn, 0, sizeof(node_boundary_end_pfn));
+#endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
}
/* Compare two active node_active_regions */