2008-01-13 18:36:43

by Mike Travis

[permalink] [raw]
Subject: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs


This patchset addresses the kernel bloat that occurs when NR_CPUS is increased.
The memory numbers below are with NR_CPUS = 1024 which I've been testing (4 and
32 real processors, the rest "possible" using the additional_cpus start option.)
These changes are all specific to the x86 architecture, non-arch specific
changes will follow.

Based on 2.6.24-rc6-mm1

Signed-off-by: Mike Travis <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---

The following columns are using the default x86_64 config with no modules.
32cpus is the default NR_CPUS, 1kcpus-before has NR_CPUS = 1024, and
1kcpus-after is after applying this patch.

As noticeable below there's still plenty of room for improvement... ;-)

32cpus 1kcpus-before 1kcpus-after
228 .altinstr_repl +0 .altinstr_repl +0 .altinstr_repl
1219 .altinstructio +0 .altinstructio +0 .altinstructio
717512 .bss +1542784 .bss -147456 .bss
61374 .comment +0 .comment +0 .comment
16 .con_initcall. +0 .con_initcall. +0 .con_initcall.
425256 .data +20224 .data -1024 .data
178688 .data.cachelin +12898304 .data.cachelin +0 .data.cachelin
8192 .data.init_tas +0 .data.init_tas +0 .data.init_tas
4096 .data.page_ali +0 .data.page_ali +0 .data.page_ali
27008 .data.percpu +128768 .data.percpu +128 .data.percpu
43904 .data.read_mos +8707872 .data.read_mos -4096 .data.read_mos
4 .data_nosave +0 .data_nosave +0 .data_nosave
5141 .exit.text +9 .exit.text -1 .exit.text
138480 .init.data +992 .init.data +3616 .init.data
133 .init.ramfs +0 .init.ramfs +1 .init.ramfs
3192 .init.setup +0 .init.setup +0 .init.setup
159754 .init.text +891 .init.text +13 .init.text
2288 .initcall.init +0 .initcall.init +0 .initcall.init
8 .jiffies +0 .jiffies +0 .jiffies
4512 .pci_fixup +0 .pci_fixup +0 .pci_fixup
1314438 .rodata +1312 .rodata -552 .rodata
36552 .smp_locks +256 .smp_locks +0 .smp_locks
3971848 .text +12992 .text +1781 .text
3368 .vdso +0 .vdso +0 .vdso
4 .vgetcpu_mode +0 .vgetcpu_mode +0 .vgetcpu_mode
218 .vsyscall_0 +0 .vsyscall_0 +0 .vsyscall_0
52 .vsyscall_1 +0 .vsyscall_1 +0 .vsyscall_1
91 .vsyscall_2 +0 .vsyscall_2 +0 .vsyscall_2
8 .vsyscall_3 +0 .vsyscall_3 +0 .vsyscall_3
54 .vsyscall_fn +0 .vsyscall_fn +0 .vsyscall_fn
80 .vsyscall_gtod +0 .vsyscall_gtod +0 .vsyscall_gtod
39480 __bug_table +0 __bug_table +0 __bug_table
16320 __ex_table +0 __ex_table +0 __ex_table
9160 __param +0 __param +0 __param
7172678 Total +23314404 Total -147590 Total

--


2008-01-14 08:14:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs


* [email protected] <[email protected]> wrote:

> This patchset addresses the kernel bloat that occurs when NR_CPUS is
> increased. The memory numbers below are with NR_CPUS = 1024 which I've
> been testing (4 and 32 real processors, the rest "possible" using the
> additional_cpus start option.) These changes are all specific to the
> x86 architecture, non-arch specific changes will follow.

thanks, i'll try this patchset in x86.git.

> 32cpus 1kcpus-before 1kcpus-after
> 7172678 Total +23314404 Total -147590 Total

1kcpus-after means it's +23314404-147590, i.e. +23166814? (i.e. a 0.6%
reduction of the bloat?)

i.e. we've got ~22K bloat per CPU - which is not bad, but because it's a
static component, it hurts smaller boxes. For distributors to enable
CONFIG_NR_CPU=1024 by default i guess that bloat has to drop below 1-2K
per CPU :-/ [that would still mean 1-2MB total bloat but that's much
more acceptable than 23MB]

Ingo

2008-01-14 09:00:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs


* Ingo Molnar <[email protected]> wrote:

> > 32cpus 1kcpus-before 1kcpus-after
> > 7172678 Total +23314404 Total -147590 Total
>
> 1kcpus-after means it's +23314404-147590, i.e. +23166814? (i.e. a 0.6%
> reduction of the bloat?)

or if it's relative to 32cpus then that's an excellent result :)

Ingo

2008-01-14 10:04:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs


> i.e. we've got ~22K bloat per CPU - which is not bad, but because it's a
> static component, it hurts smaller boxes. For distributors to enable
> CONFIG_NR_CPU=1024 by default i guess that bloat has to drop below 1-2K
> per CPU :-/ [that would still mean 1-2MB total bloat but that's much
> more acceptable than 23MB]

Even 1-2MB overhead would be too much for distributors I think. Ideally
there must be near zero overhead for possible CPUs (and I see no principle
reason why this is not possible) Worst case a low few hundred KBs, but even
that would be much.

There are the cpusets which get passed around, but these are only one bit per
possible CPU.

-Andi

2008-01-14 10:11:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs


* Andi Kleen <[email protected]> wrote:

> > i.e. we've got ~22K bloat per CPU - which is not bad, but because
> > it's a static component, it hurts smaller boxes. For distributors to
> > enable CONFIG_NR_CPU=1024 by default i guess that bloat has to drop
> > below 1-2K per CPU :-/ [that would still mean 1-2MB total bloat but
> > that's much more acceptable than 23MB]
>
> Even 1-2MB overhead would be too much for distributors I think.
> Ideally there must be near zero overhead for possible CPUs (and I see
> no principle reason why this is not possible) Worst case a low few
> hundred KBs, but even that would be much.

i think this patchset already gives a net win, by moving stuff from
NR_CPUS arrays into per_cpu area. (Travis please confirm that this is
indeed what the numbers show)

The (total-)size of the per-cpu area(s) grows linearly with the number
of CPUs, so we'll have the expected near-zero overhead on 4-8-16-32 CPUs
and the expected larger total overhead on 1024 CPUs.

Ingo

2008-01-14 11:31:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs


> i think this patchset already gives a net win, by moving stuff from
> NR_CPUS arrays into per_cpu area. (Travis please confirm that this is
> indeed what the numbers show)

Yes that is what his patchkit does, although I'm not sure he has addressed all NR_CPUS
pigs yet. The basic idea came out of some discussions we had at kernel summit on this
topic. It's definitely a step in the right direction.

Another problem is that NR_IRQS currently scales with NR_CPUs which is wrong too
(e.g. a hyperthreaded quad core/socket does not need 8 times as many
external interrupts as a single core/socket). And there are unfortunately a few
drivers that declare NR_IRQS arrays.

In general there are more scaling problems like this (e.g. it also doesn't make
sense to scale kernel threads for each CPU thread for example).

At some point we might need to separate CONFIG_NR_CPUS into a
CONFIG_NR_SOCKETS / CONFIG_NR_CPUS to address this, although full dynamic
scaling without configuration is best of course.

All can just be addressed step by step of course.

-Andi

2008-01-14 17:53:00

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>>> 32cpus 1kcpus-before 1kcpus-after
>>> 7172678 Total +23314404 Total -147590 Total
>> 1kcpus-after means it's +23314404-147590, i.e. +23166814? (i.e. a 0.6%
>> reduction of the bloat?)
>
> or if it's relative to 32cpus then that's an excellent result :)
>
> Ingo

Nope, it's a cumulative thing.


> allsizes -w 72 32cpus 1kcpus-after
32cpus 1kcpus-after
228 .altinstr_replacemen +0 .altinstr_replacemen
1219 .altinstructions +0 .altinstructions
717512 .bss +1395328 .bss
61374 .comment +0 .comment
16 .con_initcall.init +0 .con_initcall.init
425256 .data +19200 .data
178688 .data.cacheline_alig +12898304 .data.cacheline_alig
8192 .data.init_task +0 .data.init_task
4096 .data.page_aligned +0 .data.page_aligned
27008 .data.percpu +128896 .data.percpu
43904 .data.read_mostly +8703776 .data.read_mostly
4 .data_nosave +0 .data_nosave
5141 .exit.text +8 .exit.text
138480 .init.data +4608 .init.data
133 .init.ramfs +1 .init.ramfs
3192 .init.setup +0 .init.setup
159754 .init.text +904 .init.text
2288 .initcall.init +0 .initcall.init
8 .jiffies +0 .jiffies
4512 .pci_fixup +0 .pci_fixup
1314438 .rodata +760 .rodata
36552 .smp_locks +256 .smp_locks
3971848 .text +14773 .text
3368 .vdso +0 .vdso
4 .vgetcpu_mode +0 .vgetcpu_mode
218 .vsyscall_0 +0 .vsyscall_0
52 .vsyscall_1 +0 .vsyscall_1
91 .vsyscall_2 +0 .vsyscall_2
8 .vsyscall_3 +0 .vsyscall_3
54 .vsyscall_fn +0 .vsyscall_fn
80 .vsyscall_gtod_data +0 .vsyscall_gtod_data
39480 __bug_table +0 __bug_table
16320 __ex_table +0 __ex_table
9160 __param +0 __param
7172678 Total +23166814 Total

My goal is to move 90+% of the wasted, unused memory to either
the percpu area or the initdata section. The 4 fronts are:
NR_CPUS arrays, cpumask_t usages, more efficient cpu_alloc/percpu
area, and (relatively small) redesign of the irq system. (The
node and apicid arrays are related to the NR_CPUS arrays.)

The irq structs are particularly bad because they use NR_CPUS**2
arrays and the irq vars use 22588416 bytes (74%) of the total
30339492 bytes of memory:

7172678 Total 30339492 Total

> datasizes -w 72 32cpus 1kcpus-before
32cpus 1kcpus-before
262144 BSS __log_buf 12681216 CALNDA irq_desc
163840 CALNDA irq_desc 8718336 RMDATA irq_cfg
131072 BSS entries 528384 BSS irq_lists
76800 INITDA early_node_map 396288 BSS irq_2_pin
30720 RMDATA irq_cfg 264192 BSS irq_timer_state
29440 BSS ide_hwifs 262144 BSS __log_buf
24576 BSS boot_exception_ 132168 PERCPU per_cpu__kstat
20480 BSS irq_lists 131072 BSS entries
18840 DATA ioctl_start 131072 BSS boot_pageset
16384 BSS boot_cpu_stack 131072 CALNDA boot_cpu_pda
15360 BSS irq_2_pin 98304 BSS cpu_devices
14677 DATA bnx2_CP_b06FwTe 76800 INITDA early_node_map

I'm still working on a tool to analyze runtime usage of kernel
memory.

And I'm very open to any and all suggestions... ;-)

Thanks,
Mike

2008-01-14 18:01:00

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs


Ingo Molnar wrote:
> * Andi Kleen <[email protected]> wrote:
>
>>> i.e. we've got ~22K bloat per CPU - which is not bad, but because
>>> it's a static component, it hurts smaller boxes. For distributors to
>>> enable CONFIG_NR_CPU=1024 by default i guess that bloat has to drop
>>> below 1-2K per CPU :-/ [that would still mean 1-2MB total bloat but
>>> that's much more acceptable than 23MB]
>> Even 1-2MB overhead would be too much for distributors I think.
>> Ideally there must be near zero overhead for possible CPUs (and I see
>> no principle reason why this is not possible) Worst case a low few
>> hundred KBs, but even that would be much.
>
> i think this patchset already gives a net win, by moving stuff from
> NR_CPUS arrays into per_cpu area. (Travis please confirm that this is
> indeed what the numbers show)
>
> The (total-)size of the per-cpu area(s) grows linearly with the number
> of CPUs, so we'll have the expected near-zero overhead on 4-8-16-32 CPUs
> and the expected larger total overhead on 1024 CPUs.
>
> Ingo

Yes, and it's just the first step. Ideally, there is *no* extra memory
used by specifying NR_CPUS = <whatever> and all the extra memory only
comes into play when they are "possible/probable". This means that almost
all of the data needs to be in the percpu area (and compact that as much
as possible) or in the initdata section and discarded after use.

And Andi is right, the distributors will not default the NR_CPUS to a large
value unless there is zero or very little overhead. And since so much
depends on using standard configurations (certifications, etc.) we cannot
depend on a special build.

Thanks,
Mike

2008-01-16 07:35:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs

On Monday 14 January 2008 22:30, Andi Kleen wrote:

> In general there are more scaling problems like this (e.g. it also doesn't
> make sense to scale kernel threads for each CPU thread for example).

I think in a lot of ways, per-CPU kernel threads scale OK. At least
they should mostly be dynamic, so they don't require overhead on
smaller systems. On larger systems, I don't know if there are too
many kernel problems with all those threads (except for userspace
tools sometimes don't report well).

And I think making them per-CPU can be much easier than tuning some
arbitrary algorithm to get a mix between parallelism and footprint.

For example, I'm finding that it might actually be worthwhile to move
some per-node and dynamically-controlled thread creation over to the
basic per-CPU scheme because of differences in topologies...

Anyway, that's just an aside.

Oh, just while I remember it also, something funny is that MAX_NUMNODES
can be bigger than NR_CPUS on x86. I guess one can have CPUless nodes,
but wouldn't it make sense to have an upper bound of NR_CPUS by default?

2008-01-16 18:07:50

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs

On Wed, 16 Jan 2008, Nick Piggin wrote:

> Oh, just while I remember it also, something funny is that MAX_NUMNODES
> can be bigger than NR_CPUS on x86. I guess one can have CPUless nodes,
> but wouldn't it make sense to have an upper bound of NR_CPUS by default?

There are special configurations that some customers want which involves
huge amounts of memory and just a few processors. In that case the number
of nodes becomes larger than the number of processors.