Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262008AbUCQTha (ORCPT ); Wed, 17 Mar 2004 14:37:30 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262022AbUCQTha (ORCPT ); Wed, 17 Mar 2004 14:37:30 -0500 Received: from ztxmail03.ztx.compaq.com ([161.114.1.207]:24582 "EHLO ztxmail03.ztx.compaq.com") by vger.kernel.org with ESMTP id S262008AbUCQThT (ORCPT ); Wed, 17 Mar 2004 14:37:19 -0500 Message-ID: <4058A75A.3080409@hp.com> Date: Wed, 17 Mar 2004 14:30:34 -0500 From: Robert Picco User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Martin J. Bligh" Cc: Jesse Barnes , linux-kernel@vger.kernel.org, colpatch@us.ibm.com, haveblue@us.ibm.com Subject: Re: boot time node and memory limit options References: <4057392A.8000602@hp.com> <20040316174329.GA29992@sgi.com> <34060000.1079465992@flay> <405879BC.7060904@hp.com> <1745150000.1079541412@[10.10.2.4]> In-Reply-To: <1745150000.1079541412@[10.10.2.4]> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8865 Lines: 243 Martin J. Bligh wrote: >>I did something like this before my posted patch in the IA64 >>ACPI NUMA memory initialization code. It wasn't posted or even >>reviewed by peers. Your patch below basically trims the NUMA node >>memory information before the X86 discontig code calls the bootmem >>initialization routines. The problem with coming up with a solution >>at this level is each (at least ones I've looked at) architecture >>handles low level memory initialization differently and there needs >>to be a common way to parse early boot arguments. >> >>The patch I posted was arrived at after some people suggested an >>architecture independent patch. My patch basically allocates memory >>from the bootmem allocator before mem_init calls free_all_bootmem_core. >>It's architecture independent. If the real goal is to limit physical >>memory before the bootmem allocator is initialized, then my current >>patch doesn't accomplish this. >> >> > >Mmmm. That does worry me somewhat, as its possible to allocate large >amounts of bootmem for hash tables, etc, IIRC. I think that's too late >to restrict things accurately. The fact that we only have bootmem on >node 0 on ia32 isn't going to help matters either ;-) > > I agree with sizing issues at boot of hash tables. I've seen them all recover when failing to allocate based on num_physpages and then iterating at smaller allocations until successful. All the primary initialization allocations recover but probably not all drivers. You could have similiar failure scenarios for any boot line parameter(s) implementation which reduces memory. >Don't we have the same arch dependant issue with the current mem= anyway? >Can we come up with something where the arch code calls back into a generic >function to derive limitations, and thereby at least get the parsing done >in a common routine for consistency? There aren't *that* many NUMA arches >to change anyway ... > > Well this is heading in the direction Dave has proposed and probably 2.7 material. This would really solve the problem differently than my proposed patch. thanks, Bob >M. > > > >>Bob >>Martin J. Bligh wrote: >> >> >> >>>--On Tuesday, March 16, 2004 09:43:29 -0800 Jesse Barnes wrote: >>> >>> >>> >>> >>> >>>>On Tue, Mar 16, 2004 at 12:28:10PM -0500, Robert Picco wrote: >>>> >>>> >>>> >>>> >>>>>This patch supports three boot line options. mem_limit limits the >>>>>amount of physical memory. node_mem_limit limits the amount of >>>>>physical memory per node on a NUMA machine. nodes_limit reduces the >>>>>number of NUMA nodes to the value specified. On a NUMA machine an >>>>>eliminated node's CPU(s) are removed from the cpu_possible_map. >>>>> >>>>>The patch has been tested on an IA64 NUMA machine and uniprocessor X86 >>>>>machine. >>>>> >>>>> >>>>> >>>>> >>>>I think this patch will be really useful. Matt and Martin, does it look >>>>ok to you? Given that discontiguous support is pretty platform specific >>>>right now, I thought it might be less code if it was done in arch/, but >>>>a platform independent version is awfully nice... >>>> >>>> >>>> >>>> >>>I haven't looked at your code yet, but I've had a similar patch in my tree >>>from Dave Hansen for a while you might want to look at: >>> >>>diff -purN -X /home/mbligh/.diff.exclude 320-kcg/arch/i386/kernel/numaq.c 330-numa_mem_equals/arch/i386/kernel/numaq.c >>>--- 320-kcg/arch/i386/kernel/numaq.c 2003-10-01 11:47:33.000000000 -0700 >>>+++ 330-numa_mem_equals/arch/i386/kernel/numaq.c 2004-03-14 09:54:00.000000000 -0800 >>>@@ -42,6 +42,10 @@ extern long node_start_pfn[], node_end_p >>> * function also increments numnodes with the number of nodes (quads) >>> * present. >>> */ >>>+extern unsigned long max_pages_per_node; >>>+extern int limit_mem_per_node; >>>+ >>>+#define node_size_pages(n) (node_end_pfn[n] - node_start_pfn[n]) >>>static void __init smp_dump_qct(void) >>>{ >>> int node; >>>@@ -60,6 +64,8 @@ static void __init smp_dump_qct(void) >>> eq->hi_shrd_mem_start - eq->priv_mem_size); >>> node_end_pfn[node] = MB_TO_PAGES( >>> eq->hi_shrd_mem_start + eq->hi_shrd_mem_size); >>>+ if (node_size_pages(node) > max_pages_per_node) >>>+ node_end_pfn[node] = node_start_pfn[node] + max_pages_per_node; >>> } >>> } >>>} >>>diff -purN -X /home/mbligh/.diff.exclude 320-kcg/arch/i386/kernel/setup.c 330-numa_mem_equals/arch/i386/kernel/setup.c >>>--- 320-kcg/arch/i386/kernel/setup.c 2004-03-11 14:33:36.000000000 -0800 >>>+++ 330-numa_mem_equals/arch/i386/kernel/setup.c 2004-03-14 09:54:00.000000000 -0800 >>>@@ -142,7 +142,7 @@ static void __init probe_roms(void) >>> probe_extension_roms(roms); >>>} >>> >>>-static void __init limit_regions(unsigned long long size) >>>+void __init limit_regions(unsigned long long size) >>>{ >>> unsigned long long current_addr = 0; >>> int i; >>>@@ -478,6 +478,7 @@ static void __init setup_memory_region(v >>> print_memory_map(who); >>>} /* setup_memory_region */ >>> >>>+unsigned long max_pages_per_node = 0xFFFFFFFF; >>> >>>static void __init parse_cmdline_early (char ** cmdline_p) >>>{ >>>@@ -521,6 +522,14 @@ static void __init parse_cmdline_early ( >>> userdef=1; >>> } >>> } >>>+ >>>+ if (c == ' ' && !memcmp(from, "memnode=", 8)) { >>>+ unsigned long long node_size_bytes; >>>+ if (to != command_line) >>>+ to--; >>>+ node_size_bytes = memparse(from+8, &from); >>>+ max_pages_per_node = node_size_bytes >> PAGE_SHIFT; >>>+ } >>> >>> if (c == ' ' && !memcmp(from, "memmap=", 7)) { >>> if (to != command_line) >>>diff -purN -X /home/mbligh/.diff.exclude 320-kcg/arch/i386/kernel/srat.c 330-numa_mem_equals/arch/i386/kernel/srat.c >>>--- 320-kcg/arch/i386/kernel/srat.c 2003-10-01 11:47:33.000000000 -0700 >>>+++ 330-numa_mem_equals/arch/i386/kernel/srat.c 2004-03-14 09:54:01.000000000 -0800 >>>@@ -53,6 +53,10 @@ struct node_memory_chunk_s { >>>}; >>>static struct node_memory_chunk_s node_memory_chunk[MAXCHUNKS]; >>> >>>+#define chunk_start(i) (node_memory_chunk[i].start_pfn) >>>+#define chunk_end(i) (node_memory_chunk[i].end_pfn) >>>+#define chunk_size(i) (chunk_end(i)-chunk_start(i)) >>>+ >>>static int num_memory_chunks; /* total number of memory chunks */ >>>static int zholes_size_init; >>>static unsigned long zholes_size[MAX_NUMNODES * MAX_NR_ZONES]; >>>@@ -198,6 +202,9 @@ static void __init initialize_physnode_m >>> } >>>} >>> >>>+extern unsigned long max_pages_per_node; >>>+extern int limit_mem_per_node; >>>+ >>>/* Parse the ACPI Static Resource Affinity Table */ >>>static int __init acpi20_parse_srat(struct acpi_table_srat *sratp) >>>{ >>>@@ -281,23 +288,27 @@ static int __init acpi20_parse_srat(stru >>> node_memory_chunk[j].start_pfn, >>> node_memory_chunk[j].end_pfn); >>> } >>>- >>>+ >>> /*calculate node_start_pfn/node_end_pfn arrays*/ >>> for (nid = 0; nid < numnodes; nid++) { >>>- int been_here_before = 0; >>>+ unsigned long node_present_pages = 0; >>> >>>+ node_start_pfn[nid] = -1; >>> for (j = 0; j < num_memory_chunks; j++){ >>>- if (node_memory_chunk[j].nid == nid) { >>>- if (been_here_before == 0) { >>>- node_start_pfn[nid] = node_memory_chunk[j].start_pfn; >>>- node_end_pfn[nid] = node_memory_chunk[j].end_pfn; >>>- been_here_before = 1; >>>- } else { /* We've found another chunk of memory for the node */ >>>- if (node_start_pfn[nid] < node_memory_chunk[j].start_pfn) { >>>- node_end_pfn[nid] = node_memory_chunk[j].end_pfn; >>>- } >>>- } >>>- } >>>+ unsigned long proposed_size; >>>+ >>>+ if (node_memory_chunk[j].nid != nid) >>>+ continue; >>>+ >>>+ proposed_size = node_present_pages + chunk_size(j); >>>+ if (proposed_size > max_pages_per_node) >>>+ chunk_end(j) = chunk_start(j) + >>>+ max_pages_per_node - node_present_pages; >>>+ node_present_pages += chunk_size(j); >>>+ >>>+ if (node_start_pfn[nid] == -1) >>>+ node_start_pfn[nid] = chunk_start(j); >>>+ node_end_pfn[nid] = chunk_end(j); >>> } >>> } >>> return 1; >>> >>>- >>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >>>the body of a message to majordomo@vger.kernel.org >>>More majordomo info at http://vger.kernel.org/majordomo-info.html >>>Please read the FAQ at http://www.tux.org/lkml/ >>> >>> >>> >>> >>> >> >> > > >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/