2008-01-18 18:34:39

by Mike Travis

[permalink] [raw]
Subject: [PATCH 4/5] x86: Add config variables for SMP_MAX

Adds and increases some config variables to accomodate larger SMP
configurations:

NR_CPUS: max limit now 4096
NODES_SHIFT: max limit now 9
THREAD_ORDER: max limit now 3
X86_SMP_MAX: say Y to enable possible cpus == NR_CPUS

Signed-off-by: Mike Travis <[email protected]>
---
arch/x86/Kconfig | 17 ++++++++++++++---
arch/x86/Kconfig.debug | 9 +++++++++
arch/x86/kernel/smpboot_64.c | 4 ++++
include/asm-x86/page_64.h | 4 ++++
4 files changed, 31 insertions(+), 3 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -487,19 +487,29 @@ config ARCH_SUPPORTS_KVM


config NR_CPUS
- int "Maximum number of CPUs (2-255)"
- range 2 255
+ int "Maximum number of CPUs (2-4096)"
+ range 2 4096
depends on SMP
+ default "1024" if X86_SMP_MAX
default "32" if X86_NUMAQ || X86_SUMMIT || X86_BIGSMP || X86_ES7000
default "8"
help
This allows you to specify the maximum number of CPUs which this
- kernel will support. The maximum supported value is 255 and the
+ kernel will support. The maximum supported value is 4096 and the
minimum value which makes sense is 2.

This is purely to save memory - each supported CPU adds
approximately eight kilobytes to the kernel image.

+config THREAD_ORDER
+ int "Kernel stack size (in page order)"
+ range 1 3
+ depends on X86_64_SMP
+ default "3" if X86_SMP_MAX
+ default "1"
+ help
+ Increases kernel stack size.
+
config SCHED_SMT
bool "SMT (Hyperthreading) scheduler support"
depends on (X86_64 && SMP) || (X86_32 && X86_HT)
@@ -882,6 +892,7 @@ config NUMA_EMU

config NODES_SHIFT
int
+ default "9" if X86_SMP_MAX
default "6" if X86_64
default "4" if X86_NUMAQ
default "3"
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -73,6 +73,15 @@ config X86_FIND_SMP_CONFIG
depends on X86_LOCAL_APIC || X86_VOYAGER
depends on X86_32

+config X86_SMP_MAX
+ bool "Enable Maximum SMP configuration"
+ def_bool n
+ depends on X86_64_SMP
+ help
+ Say Y here to enable a "large" SMP configuration for testing
+ purposes. It does this by increasing the number of possible
+ cpus to the NR_CPUS count.
+
config X86_MPPARSE
def_bool y
depends on (X86_32 && (X86_LOCAL_APIC && !X86_VISWS)) || X86_64
--- a/arch/x86/kernel/smpboot_64.c
+++ b/arch/x86/kernel/smpboot_64.c
@@ -784,6 +784,10 @@ __init void prefill_possible_map(void)
possible = num_processors + additional_cpus;
if (possible > NR_CPUS)
possible = NR_CPUS;
+#ifdef CONFIG_SMP_MAX
+ if (possible < NR_CPUS)
+ possible = NR_CPUS;
+#endif

printk(KERN_INFO "SMP: Allowing %d CPUs, %d hotplug CPUs\n",
possible,
--- a/include/asm-x86/page_64.h
+++ b/include/asm-x86/page_64.h
@@ -3,7 +3,11 @@

#define PAGETABLE_LEVELS 4

+#ifdef CONFIG_THREAD_ORDER
+#define THREAD_ORDER CONFIG_THREAD_ORDER
+#else
#define THREAD_ORDER 1
+#endif
#define THREAD_SIZE (PAGE_SIZE << THREAD_ORDER)
#define CURRENT_MASK (~(THREAD_SIZE-1))


--


2008-01-18 20:01:38

by Ingo Oeser

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

Hi Mike,

On Friday 18 January 2008, [email protected] wrote:
> +config THREAD_ORDER
> + int "Kernel stack size (in page order)"
> + range 1 3
> + depends on X86_64_SMP
> + default "3" if X86_SMP_MAX
> + default "1"
> + help
> + Increases kernel stack size.
> +

Could you please elaborate, why this is needed and put more info about
this requirement into this patch description?

People worked hard to push data allocation from stack to heap to make
THREAD_ORDER of 0 and 1 possible. So why increase it again and why does this
help scalability?

Many thanks and Best Regards

Ingo Oeser, puzzled a bit :-)

2008-01-18 20:10:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

On Fri, 18 Jan 2008, Ingo Oeser wrote:

> Hi Mike,
>
> On Friday 18 January 2008, [email protected] wrote:
> > +config THREAD_ORDER
> > + int "Kernel stack size (in page order)"
> > + range 1 3

THREAD_ORDER can also be used to consolidate the stack size with the
choices available for i386. In that cases the choices are 0 and 1.

2008-01-18 20:15:18

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

Ingo Oeser wrote:
> Hi Mike,
>
> On Friday 18 January 2008, [email protected] wrote:
>> +config THREAD_ORDER
>> + int "Kernel stack size (in page order)"
>> + range 1 3
>> + depends on X86_64_SMP
>> + default "3" if X86_SMP_MAX
>> + default "1"
>> + help
>> + Increases kernel stack size.
>> +
>
> Could you please elaborate, why this is needed and put more info about
> this requirement into this patch description?
>
> People worked hard to push data allocation from stack to heap to make
> THREAD_ORDER of 0 and 1 possible. So why increase it again and why does this
> help scalability?
>
> Many thanks and Best Regards
>
> Ingo Oeser, puzzled a bit :-)


The primary problem arises because of cpumask_t local variables. Until I
can deal with these, increasing NR_CPUS to a really large value increases
stack size dramatically.

Here are the top stack consumers with NR_CPUS = 4k.

16392 isolated_cpu_setup
10328 build_sched_domains
8248 numa_initmem_init
4664 cpu_attach_domain
4104 show_shared_cpu_map
3656 centrino_target
3608 powernowk8_cpu_init
3192 sched_domain_node_span
3144 acpi_cpufreq_target
2584 __svc_create_thread
2568 cpu_idle_wait
2136 netxen_nic_flash_print
2104 powernowk8_target
2088 _cpu_down
2072 cache_add_dev
2056 get_cur_freq
0 acpi_processor_ffh_cstate_probe
2056 microcode_write
0 acpi_processor_get_throttling
2048 check_supported_cpu

And I've yet to figure out how to accumulate stack sizes using
call threads.

Thanks,
Mike

2008-01-18 20:36:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX


First I think you have to get rid of the THREAD_ORDER stuff -- your
goal of the whole patchkit after all is to allow distributions to
support NR_CPUS==4096 in the standard kernels and I doubt any
distribution will over chose a THREAD_ORDER > 1 in their
standard kernels because it would be too unreliable on smaller
systems.

> Here are the top stack consumers with NR_CPUS = 4k.
>
> 16392 isolated_cpu_setup
> 10328 build_sched_domains
> 8248 numa_initmem_init

These should run single threaded early at boot so you can probably just make
the cpumask_t variables static __initdata

> 4664 cpu_attach_domain
> 4104 show_shared_cpu_map

These above are the real pigs. Fortunately they are all clearly
slowpath (except perhaps show_shared_cpu_map) so just using heap
allocations or when needed bootmem for them should be fine.

> 3656 centrino_target
> 3608 powernowk8_cpu_init
> 3192 sched_domain_node_span

x86-64 always has 8k stacks and separate interrupt stack. As long
as the calls are not in some stack intensive layered context (like block
IO processing path etc.) <3k shouldn't be too big an issue.

BTW there is a trick to get more stack space on x86-64 temporarily:
run it in a softirq. They got 16k stacks by default. Just leave
enough left over for the hard irqs that might happen if you don't
have interrupts disabled.

> 3144 acpi_cpufreq_target
> 2584 __svc_create_thread
> 2568 cpu_idle_wait
> 2136 netxen_nic_flash_print
> 2104 powernowk8_target
> 2088 _cpu_down
> 2072 cache_add_dev
> 2056 get_cur_freq
> 0 acpi_processor_ffh_cstate_probe
> 2056 microcode_write
> 0 acpi_processor_get_throttling
> 2048 check_supported_cpu
>
> And I've yet to figure out how to accumulate stack sizes using
> call threads.

One way if you don't care about indirect/asm calls is to use cflow and do
some post processing that adds up the data from checkstack.pl

The other way is to use mcount, but only for situations you can reproduce
of course. I did have a 2.4 mcount based stack instrumentation patch
some time ago that I could probably dig out if it was useful.

-Andi

2008-01-18 20:46:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX


* [email protected] <[email protected]> wrote:

> +config THREAD_ORDER
> + int "Kernel stack size (in page order)"
> + range 1 3
> + depends on X86_64_SMP
> + default "3" if X86_SMP_MAX
> + default "1"
> + help
> + Increases kernel stack size.

nack on kernel stack bloat. We worked hard to get the kernel stack
footprint down to 4K on x86. (and it is 4K on most distros, despite
there still being a legacy 8K stack size) No way are we going to throw
away all that now ...

Ingo

2008-01-18 20:49:05

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

Andi Kleen wrote:
> First I think you have to get rid of the THREAD_ORDER stuff -- your
> goal of the whole patchkit after all is to allow distributions to
> support NR_CPUS==4096 in the standard kernels and I doubt any
> distribution will over chose a THREAD_ORDER > 1 in their
> standard kernels because it would be too unreliable on smaller
> systems.
>
>> Here are the top stack consumers with NR_CPUS = 4k.
>>
>> 16392 isolated_cpu_setup
>> 10328 build_sched_domains
>> 8248 numa_initmem_init
>
> These should run single threaded early at boot so you can probably just make
> the cpumask_t variables static __initdata
>
>> 4664 cpu_attach_domain
>> 4104 show_shared_cpu_map
>
> These above are the real pigs. Fortunately they are all clearly
> slowpath (except perhaps show_shared_cpu_map) so just using heap
> allocations or when needed bootmem for them should be fine.
>
>> 3656 centrino_target
>> 3608 powernowk8_cpu_init
>> 3192 sched_domain_node_span
>
> x86-64 always has 8k stacks and separate interrupt stack. As long
> as the calls are not in some stack intensive layered context (like block
> IO processing path etc.) <3k shouldn't be too big an issue.
>
> BTW there is a trick to get more stack space on x86-64 temporarily:
> run it in a softirq. They got 16k stacks by default. Just leave
> enough left over for the hard irqs that might happen if you don't
> have interrupts disabled.
>
>> 3144 acpi_cpufreq_target
>> 2584 __svc_create_thread
>> 2568 cpu_idle_wait
>> 2136 netxen_nic_flash_print
>> 2104 powernowk8_target
>> 2088 _cpu_down
>> 2072 cache_add_dev
>> 2056 get_cur_freq
>> 0 acpi_processor_ffh_cstate_probe
>> 2056 microcode_write
>> 0 acpi_processor_get_throttling
>> 2048 check_supported_cpu
>>
>> And I've yet to figure out how to accumulate stack sizes using
>> call threads.
>
> One way if you don't care about indirect/asm calls is to use cflow and do
> some post processing that adds up the data from checkstack.pl
>
> The other way is to use mcount, but only for situations you can reproduce
> of course. I did have a 2.4 mcount based stack instrumentation patch
> some time ago that I could probably dig out if it was useful.
>
> -Andi

Thanks for the great feedback Andi. Since cpumask changes are the next
item on my list after NR_CPUS (and friends) are dealt with, perhaps I
could move the THREAD_ORDER stuff to the "Kernel Hacking" area for the
interim?

And yes, I'm interested in any tools to help accumulate information.

Btw, there are 116 functions now that have >= 1k stack size.

Cheers,
Mike

2008-01-18 20:49:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX


* Mike Travis <[email protected]> wrote:

> >> +config THREAD_ORDER
> >> + int "Kernel stack size (in page order)"
> >> + range 1 3
> >> + depends on X86_64_SMP
> >> + default "3" if X86_SMP_MAX
> >> + default "1"
> >> + help
> >> + Increases kernel stack size.
> >> +
> >
> > Could you please elaborate, why this is needed and put more info
> > about this requirement into this patch description?
> >
> > People worked hard to push data allocation from stack to heap to
> > make THREAD_ORDER of 0 and 1 possible. So why increase it again and
> > why does this help scalability?
> >
> > Many thanks and Best Regards
> >
> > Ingo Oeser, puzzled a bit :-)
>
>
> The primary problem arises because of cpumask_t local variables.
> Until I can deal with these, increasing NR_CPUS to a really large
> value increases stack size dramatically.

those should be fixed:

> Here are the top stack consumers with NR_CPUS = 4k.
>
> 16392 isolated_cpu_setup
> 10328 build_sched_domains
> 8248 numa_initmem_init
> 4664 cpu_attach_domain
> 4104 show_shared_cpu_map
> 3656 centrino_target
> 3608 powernowk8_cpu_init
> 3192 sched_domain_node_span
> 3144 acpi_cpufreq_target
> 2584 __svc_create_thread
> 2568 cpu_idle_wait
> 2136 netxen_nic_flash_print
> 2104 powernowk8_target
> 2088 _cpu_down
> 2072 cache_add_dev
> 2056 get_cur_freq
> 0 acpi_processor_ffh_cstate_probe
> 2056 microcode_write
> 0 acpi_processor_get_throttling
> 2048 check_supported_cpu

(and most of that is performance-uncritical.)

Ingo

2008-01-18 20:55:27

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

Ingo Molnar wrote:
> * Mike Travis <[email protected]> wrote:
>
>>>> +config THREAD_ORDER
>>>> + int "Kernel stack size (in page order)"
>>>> + range 1 3
>>>> + depends on X86_64_SMP
>>>> + default "3" if X86_SMP_MAX
>>>> + default "1"
>>>> + help
>>>> + Increases kernel stack size.
>>>> +
>>> Could you please elaborate, why this is needed and put more info
>>> about this requirement into this patch description?
>>>
>>> People worked hard to push data allocation from stack to heap to
>>> make THREAD_ORDER of 0 and 1 possible. So why increase it again and
>>> why does this help scalability?
>>>
>>> Many thanks and Best Regards
>>>
>>> Ingo Oeser, puzzled a bit :-)
>>
>> The primary problem arises because of cpumask_t local variables.
>> Until I can deal with these, increasing NR_CPUS to a really large
>> value increases stack size dramatically.
>
> those should be fixed:
>
>> Here are the top stack consumers with NR_CPUS = 4k.
>>
>> 16392 isolated_cpu_setup
>> 10328 build_sched_domains
>> 8248 numa_initmem_init
>> 4664 cpu_attach_domain
>> 4104 show_shared_cpu_map
>> 3656 centrino_target
>> 3608 powernowk8_cpu_init
>> 3192 sched_domain_node_span
>> 3144 acpi_cpufreq_target
>> 2584 __svc_create_thread
>> 2568 cpu_idle_wait
>> 2136 netxen_nic_flash_print
>> 2104 powernowk8_target
>> 2088 _cpu_down
>> 2072 cache_add_dev
>> 2056 get_cur_freq
>> 0 acpi_processor_ffh_cstate_probe
>> 2056 microcode_write
>> 0 acpi_processor_get_throttling
>> 2048 check_supported_cpu
>
> (and most of that is performance-uncritical.)
>
> Ingo

How big is the stack during early startup?

Thanks,
Mike

2008-01-18 20:58:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX


> How big is the stack during early startup?

THREAD_ORDER (runs on init_stack's stack)

early init stack could be increased in theory with some effort,
but since that is all single threaded anyways just a few strategic
static __initdata's should be simple enough.

-Andi

2008-01-18 21:02:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX II


> > Here are the top stack consumers with NR_CPUS = 4k.
> >
> > 16392 isolated_cpu_setup
> > 10328 build_sched_domains
> > 8248 numa_initmem_init
>
> These should run single threaded early at boot so you can probably just make
> the cpumask_t variables static __initdata


To correct myself: this is not true for build_sched_domains() which can
be triggered from sysfs.

-Andi

2008-01-19 14:53:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX


* [email protected] <[email protected]> wrote:

> Adds and increases some config variables to accomodate larger SMP
> configurations:
>
> NR_CPUS: max limit now 4096
> NODES_SHIFT: max limit now 9
> THREAD_ORDER: max limit now 3
> X86_SMP_MAX: say Y to enable possible cpus == NR_CPUS
>
> Signed-off-by: Mike Travis <[email protected]>

i've bisected a boot failure down to this patch (sans the THREAD_ORDER
bits): it causes an instant reboot of the 64-bit kernel upon bootup.
Failing config attached.

Ingo


Attachments:
(No filename) (516.00 B)
config (70.17 kB)
Download all attachments

2008-01-19 15:15:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX


* Ingo Molnar <[email protected]> wrote:

> > NR_CPUS: max limit now 4096
> > NODES_SHIFT: max limit now 9
> > THREAD_ORDER: max limit now 3
> > X86_SMP_MAX: say Y to enable possible cpus == NR_CPUS
> >
> > Signed-off-by: Mike Travis <[email protected]>
>
> i've bisected a boot failure down to this patch (sans the THREAD_ORDER
> bits): it causes an instant reboot of the 64-bit kernel upon bootup.
> Failing config attached.

and then it crashes with:

[ 0.000000] Bootmem setup node 0 0000000000000000-000000003fff0000
[ 0.000000] KERN_NOTICE cpu_to_node(0): usage too early!
PANIC: early exception 06 rip 10:ffffffff81f77f30 error 0 cr2 f06f53
[ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.24-rc8 #422
[ 0.000000]
[ 0.000000] Call Trace:
[ 0.000000] [<ffffffff81f76b4a>] ? setup_node_bootmem+0x1a0/0x1b8
[ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
[ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
[ 0.000000] [<ffffffff81f77103>] ? numa_initmem_init+0x343/0x471

moral: PLEASE do not use BUG() on in early init code, unless absolutely
necessary.

Ingo

---
include/asm-x86/topology.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/include/asm-x86/topology.h
===================================================================
--- linux.orig/include/asm-x86/topology.h
+++ linux/include/asm-x86/topology.h
@@ -70,10 +70,11 @@ static inline int cpu_to_node(int cpu)
if(x86_cpu_to_node_map_early_ptr) {
printk("KERN_NOTICE cpu_to_node(%d): usage too early!\n",
(int)cpu);
- BUG();
+ dump_stack();
+ return 0;
}
#endif
- if(per_cpu_offset(cpu))
+ if (per_cpu_offset(cpu))
return per_cpu(x86_cpu_to_node_map, cpu);
else
return NUMA_NO_NODE;

2008-01-19 15:24:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX


* Ingo Molnar <[email protected]> wrote:

> and then it crashes with:
>
> [ 0.000000] Bootmem setup node 0 0000000000000000-000000003fff0000
> [ 0.000000] KERN_NOTICE cpu_to_node(0): usage too early!
> PANIC: early exception 06 rip 10:ffffffff81f77f30 error 0 cr2 f06f53
> [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.24-rc8 #422
> [ 0.000000]
> [ 0.000000] Call Trace:
> [ 0.000000] [<ffffffff81f76b4a>] ? setup_node_bootmem+0x1a0/0x1b8
> [ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
> [ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
> [ 0.000000] [<ffffffff81f77103>] ? numa_initmem_init+0x343/0x471
>
> moral: PLEASE do not use BUG() on in early init code, unless
> absolutely necessary.

the right fix is below.

Ingo

---
include/asm-x86/topology.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/include/asm-x86/topology.h
===================================================================
--- linux.orig/include/asm-x86/topology.h
+++ linux/include/asm-x86/topology.h
@@ -70,10 +70,10 @@ static inline int cpu_to_node(int cpu)
if(x86_cpu_to_node_map_early_ptr) {
printk("KERN_NOTICE cpu_to_node(%d): usage too early!\n",
(int)cpu);
- BUG();
+ dump_stack();
}
#endif
- if(per_cpu_offset(cpu))
+ if (per_cpu_offset(cpu))
return per_cpu(x86_cpu_to_node_map, cpu);
else
return NUMA_NO_NODE;

2008-01-19 21:39:20

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

Ingo Molnar wrote:
> * [email protected] <[email protected]> wrote:
>
>> Adds and increases some config variables to accomodate larger SMP
>> configurations:
>>
>> NR_CPUS: max limit now 4096
>> NODES_SHIFT: max limit now 9
>> THREAD_ORDER: max limit now 3
>> X86_SMP_MAX: say Y to enable possible cpus == NR_CPUS
>>
>> Signed-off-by: Mike Travis <[email protected]>
>
> i've bisected a boot failure down to this patch (sans the THREAD_ORDER
> bits): it causes an instant reboot of the 64-bit kernel upon bootup.
> Failing config attached.
>
> Ingo
>

Thanks Ingo!

I've pulled the THREAD_ORDER change from my next version of the patch.
Seems SMP_MAX is just not ready for prime time yet.

One problem that I'm having appears to be that the !NUMA config of the
current -mm version also doesn't boot so I haven't been able to verify
that. (At least it doesn't seem to make it worse... ;-)

Thanks,
Mike

2008-01-19 21:52:53

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>> and then it crashes with:
>>
>> [ 0.000000] Bootmem setup node 0 0000000000000000-000000003fff0000
>> [ 0.000000] KERN_NOTICE cpu_to_node(0): usage too early!
>> PANIC: early exception 06 rip 10:ffffffff81f77f30 error 0 cr2 f06f53
>> [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.24-rc8 #422
>> [ 0.000000]
>> [ 0.000000] Call Trace:
>> [ 0.000000] [<ffffffff81f76b4a>] ? setup_node_bootmem+0x1a0/0x1b8
>> [ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
>> [ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
>> [ 0.000000] [<ffffffff81f77103>] ? numa_initmem_init+0x343/0x471
>>
>> moral: PLEASE do not use BUG() on in early init code, unless
>> absolutely necessary.
>
> the right fix is below.
>
> Ingo
>
> ---
> include/asm-x86/topology.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux/include/asm-x86/topology.h
> ===================================================================
> --- linux.orig/include/asm-x86/topology.h
> +++ linux/include/asm-x86/topology.h
> @@ -70,10 +70,10 @@ static inline int cpu_to_node(int cpu)
> if(x86_cpu_to_node_map_early_ptr) {
> printk("KERN_NOTICE cpu_to_node(%d): usage too early!\n",
> (int)cpu);
> - BUG();
> + dump_stack();
> }
> #endif
> - if(per_cpu_offset(cpu))
> + if (per_cpu_offset(cpu))
> return per_cpu(x86_cpu_to_node_map, cpu);
> else
> return NUMA_NO_NODE;

Thanks. I had pulled this from the patch as ak suggested. But it seems
that it's finding real errors that need to be fixed.

Thanks,
Mike

2008-01-19 23:25:20

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>> and then it crashes with:
>>
>> [ 0.000000] Bootmem setup node 0 0000000000000000-000000003fff0000
>> [ 0.000000] KERN_NOTICE cpu_to_node(0): usage too early!
>> PANIC: early exception 06 rip 10:ffffffff81f77f30 error 0 cr2 f06f53
>> [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.24-rc8 #422
>> [ 0.000000]
>> [ 0.000000] Call Trace:
>> [ 0.000000] [<ffffffff81f76b4a>] ? setup_node_bootmem+0x1a0/0x1b8
>> [ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
>> [ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
>> [ 0.000000] [<ffffffff81f77103>] ? numa_initmem_init+0x343/0x471
>>
>> moral: PLEASE do not use BUG() on in early init code, unless
>> absolutely necessary.
>
> the right fix is below.
>
> Ingo
>
> ---
> include/asm-x86/topology.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux/include/asm-x86/topology.h
> ===================================================================
> --- linux.orig/include/asm-x86/topology.h
> +++ linux/include/asm-x86/topology.h
> @@ -70,10 +70,10 @@ static inline int cpu_to_node(int cpu)
> if(x86_cpu_to_node_map_early_ptr) {
> printk("KERN_NOTICE cpu_to_node(%d): usage too early!\n",
> (int)cpu);
> - BUG();
> + dump_stack();
> }
> #endif
> - if(per_cpu_offset(cpu))
> + if (per_cpu_offset(cpu))
> return per_cpu(x86_cpu_to_node_map, cpu);
> else
> return NUMA_NO_NODE;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

Except the function needs to return a valid node id if it's to continue:

--- linux.orig/include/asm-x86/topology.h 2008-01-19 15:23:04.996551867 -0800
+++ linux/include/asm-x86/topology.h 2008-01-19 15:23:05.208564537 -0800
@@ -67,10 +67,11 @@ static inline int early_cpu_to_node(int
static inline int cpu_to_node(int cpu)
{
#ifdef CONFIG_DEBUG_PER_CPU_MAPS
- if(x86_cpu_to_node_map_early_ptr) {
+ if (x86_cpu_to_node_map_early_ptr) {
printk("KERN_NOTICE cpu_to_node(%d): usage too early!\n",
(int)cpu);
dump_stack();
+ return ((numanode_t *)x86_cpu_to_node_map_early_ptr)[cpu];
}
#endif
if (per_cpu_offset(cpu))

??

Thannks,
Mike

2008-01-20 01:14:39

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>> and then it crashes with:
>>
>> [ 0.000000] Bootmem setup node 0 0000000000000000-000000003fff0000
>> [ 0.000000] KERN_NOTICE cpu_to_node(0): usage too early!
>> PANIC: early exception 06 rip 10:ffffffff81f77f30 error 0 cr2 f06f53
>> [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.24-rc8 #422
>> [ 0.000000]
>> [ 0.000000] Call Trace:
>> [ 0.000000] [<ffffffff81f76b4a>] ? setup_node_bootmem+0x1a0/0x1b8
>> [ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
>> [ 0.000000] [<ffffffff81f77f30>] ? acpi_scan_nodes+0x204/0x255
>> [ 0.000000] [<ffffffff81f77103>] ? numa_initmem_init+0x343/0x471
>>

I *think* but have not been able to verify that this will fix the above panic:


--- a/arch/x86/mm/srat_64.c
+++ b/arch/x86/mm/srat_64.c
@@ -393,7 +393,7 @@ int __init acpi_scan_nodes(unsigned long
setup_node_bootmem(i, nodes[i].start, nodes[i].end);

for (i = 0; i < NR_CPUS; i++) {
- int node = cpu_to_node(i);
+ int node = early_cpu_to_node(i);
if (node == NUMA_NO_NODE)
continue;
if (!node_isset(node, node_possible_map))

The problem is that the x86 test tree kernel doesn't boot, even without my
changes. And the stack trace I don't think is correct.

Thanks,
Mike

2008-01-28 16:45:42

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX

Ten days ago, Mike wrote:
> The primary problem arises because of cpumask_t local variables. Until I
> can deal with these, increasing NR_CPUS to a really large value increases
> stack size dramatically.
>
> Here are the top stack consumers with NR_CPUS = 4k.
>
> 16392 isolated_cpu_setup
> 10328 build_sched_domains

The problem in kernel/sched.c:isolated_cpu_setup() is an array of
NR_CPUS integers:

static int __init isolated_cpu_setup(char *str)
{
int ints[NR_CPUS], i;

str = get_options(str, ARRAY_SIZE(ints), ints);

Since isolated_cpu_setup() is an __init routine, perhaps we could
make that ints[] array static __initdata?

The build_sched_domains() may require more thought and code rework.
See also the lkml discussion of my patches that reworked the cpuset
code implementing 'sched_load_balance' calling into build_sched_domains
() via kernel/sched.c:partition_sched_domains(). This is not performance
critical code, fortunately.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-01-28 17:06:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/5] x86: Add config variables for SMP_MAX


> The problem in kernel/sched.c:isolated_cpu_setup() is an array of
> NR_CPUS integers:
>
> static int __init isolated_cpu_setup(char *str)
> {
> int ints[NR_CPUS], i;
>
> str = get_options(str, ARRAY_SIZE(ints), ints);
>
> Since isolated_cpu_setup() is an __init routine, perhaps we could
> make that ints[] array static __initdata?

That or use alloc_bootmem / free_bootmem

-Andi