LinuxLists.cc - Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

2008-06-04 15:08:18

Subject: Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

Andrew Morton wrote:
> On Thu, 29 May 2008 20:56:20 -0700 Christoph Lameter <[email protected]> wrote:
>
>> In various places the kernel maintains arrays of pointers indexed by
>> processor numbers. These are used to locate objects that need to be used
>> when executing on a specirfic processor. Both the slab allocator
>> and the page allocator use these arrays and there the arrays are used in
>> performance critical code. The allocpercpu functionality is a simple
>> allocator to provide these arrays.
>
> All seems reasonable to me. The obvious question is "how do we size
> the arena". We either waste memory or, much worse, run out.
>
> And running out is a real possibility, I think. Most people will only
> mount a handful of XFS filesystems. But some customer will come along
> who wants to mount 5,000, and distributors will need to cater for that,
> but how can they?
>
> I wonder if we can arrange for the default to be overridden via a
> kernel boot option?
>
>
> Another obvious question is "how much of a problem will we have with
> internal fragmentation"? This might be a drop-dead showstopper.

One problem with variable sized cpu_alloc area is this comment in bitmap.h:

* Note that nbits should be always a compile time evaluable constant.
* Otherwise many inlines will generate horrible code.

I'm guessing since this will be of low use and not performance critical,
then we can ignore the "horrible code"? ;-)

Thanks,
Mike

2008-06-06 05:39:22

by Eric Dumazet

[permalink] [raw]

Subject: Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

Mike Travis a ?crit :
> Andrew Morton wrote:
>> On Thu, 29 May 2008 20:56:20 -0700 Christoph Lameter <[email protected]> wrote:
>>
>>> In various places the kernel maintains arrays of pointers indexed by
>>> processor numbers. These are used to locate objects that need to be used
>>> when executing on a specirfic processor. Both the slab allocator
>>> and the page allocator use these arrays and there the arrays are used in
>>> performance critical code. The allocpercpu functionality is a simple
>>> allocator to provide these arrays.
>> All seems reasonable to me. The obvious question is "how do we size
>> the arena". We either waste memory or, much worse, run out.
>>
>> And running out is a real possibility, I think. Most people will only
>> mount a handful of XFS filesystems. But some customer will come along
>> who wants to mount 5,000, and distributors will need to cater for that,
>> but how can they?
>>
>> I wonder if we can arrange for the default to be overridden via a
>> kernel boot option?
>>
>>
>> Another obvious question is "how much of a problem will we have with
>> internal fragmentation"? This might be a drop-dead showstopper.

Christoph & Mike,

Please forgive me if I beat a dead horse, but this percpu stuff
should find its way.

I wonder why you absolutely want to have only one chunk holding
all percpu variables, static(vmlinux) & static(modules)
& dynamically allocated.

Its *not* possible to put an arbitrary limit to this global zone.
You'll allways find somebody to break this limit. This is the point
we must solve, before coding anything.

Have you considered using a list of fixed size chunks, each chunk
having its own bitmap ?

We only want fix offsets between CPU locations. For a given variable,
we MUST find addresses for all CPUS looking at the same offset table.
(Then we can optimize things on x86, using %gs/%fs register, instead
of a table lookup)

We could chose chunk size at compile time, depending on various
parameters (32/64 bit arches, or hugepage sizes on NUMA),
and a minimum value (ABI guarantee)

On x86_64 && NUMA we could use 2 Mbytes chunks, while
on x86_32 or non NUMA we should probably use 64 Kbytes.

At boot time, we setup the first chunk (chunk 0) and copy
.data.percpu on this chunk, for each possible cpu, and we
build the bitmap for future dynamic/module percpu allocations.
So we still have the restriction that sizeofsection(.data.percpu)
should fit in the chunk 0. Not a problem in practice.

Then if we need to expand percpu zone for heavy duty machine,
and chunk 0 is already filled, we can add as many 2 M/ 64K
chunks we need.

This would limit the dynamic percpu allocation to 64 kbytes for
a given variable, so huge users should probably still use a
different allocator (like oprofile alloc_cpu_buffers() function)
But at least we dont anymore limit the total size of percpu area.

I understand you want to offset percpu data to 0, but for
static percpu data. (pda being included in, to share %gs)

For dynamically allocated percpu variables (including modules
".data.percpu"), nothing forces you to have low offsets,
relative to %gs/%fs register. Access to these variables
will be register indirect based anyway (eg %gs:(%rax) )

1) NUMA case

For a 64 bit NUMA arch, chunk size of 2Mbytes

Allocates 2Mb for each possible processor (on its preferred memory
node), and compute values to setup offset_of_cpu[NR_CPUS] array.

Chunk 0
CPU 0 : virtual address XXXXXX
CPU 1 : virtual address XXXXXX + offset_of_cpu[1]
...
CPU n : virtual address XXXXXX + offset_of_cpu[n]
+ a shared bitmap

For next chunks, we could use vmalloc() zone to find
nr_possible_cpus virtual addresses ranges where you can map
a 2Mb page per possible cpu, as long as we respect the relative
delta between each cpu block, that was computed when
chunk 0 was setup.

Chunk 1..n
CPU 0 : virtual address YYYYYYYYYYYYYY
CPU 1 : virtual address YYYYYYYYYYYYYY + offset_of_cpu[1]
...
CPU n : virtual address YYYYYYYYYYYYYY + offset_of_cpu[n]
+ a shared bitmap (32Kbytes if 8 bytes granularity in allocator)

For a variable located in chunk 0, its 'address' relative to current
cpu %gs will be some number between [0 and 2^20-1]

For a variable located in chunk 1, its 'address' relative to current
cpu %gs will be some number between
[YYYYYYYYYYYYYY - XXXXXX and YYYYYYYYYYYYYY - XXXXXX + 2^20 - 1],
not necessarly [2^20 to 2^21 - 1]

Chunk 0 would use normal memory (no vmap TLB cost), only next ones need vmalloc().

So the extra TLB cost would only be taken for very special NUMA setups
(only if using a lot of percpu allocations)

Also, using a 2Mb page granularity probably wastes about 2Mb per cpu, but
this is nothing for NUMA machines :)

2) SMP && !NUMA

On non NUMA machines, we dont need vmalloc games, since we can allocate
chunk space using contiguous memory, (size = nr_possible_cpus*64Kbytes)

offset_of_cpu[N] = N * CHUNK_SIZE

(On a 4 CPU x86_32 machine, allocate a 256 Kbyte bloc then divide it in
64 kb blocs)
If this order-6 allocation fails, then fallback to vmalloc(), but most
percpu allocations happens at boot time, when memory is not yet fragmented...

3) UP case : fallback to standard allocators. No need for bitmaps.

NUMA special casing can be implemented later of course...

Thanks for reading

2008-06-06 13:08:26

by Mike Travis

[permalink] [raw]

Subject: Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

Eric Dumazet wrote:
> Mike Travis a ?crit :
>> Andrew Morton wrote:
>>> On Thu, 29 May 2008 20:56:20 -0700 Christoph Lameter <[email protected]> wrote:
>>>
>>>> In various places the kernel maintains arrays of pointers indexed by
>>>> processor numbers. These are used to locate objects that need to be used
>>>> when executing on a specirfic processor. Both the slab allocator
>>>> and the page allocator use these arrays and there the arrays are used in
>>>> performance critical code. The allocpercpu functionality is a simple
>>>> allocator to provide these arrays.
>>> All seems reasonable to me. The obvious question is "how do we size
>>> the arena". We either waste memory or, much worse, run out.
>>>
>>> And running out is a real possibility, I think. Most people will only
>>> mount a handful of XFS filesystems. But some customer will come along
>>> who wants to mount 5,000, and distributors will need to cater for that,
>>> but how can they?
>>>
>>> I wonder if we can arrange for the default to be overridden via a
>>> kernel boot option?
>>>
>>>
>>> Another obvious question is "how much of a problem will we have with
>>> internal fragmentation"? This might be a drop-dead showstopper.
>
> Christoph & Mike,
>
> Please forgive me if I beat a dead horse, but this percpu stuff
> should find its way.
>
> I wonder why you absolutely want to have only one chunk holding
> all percpu variables, static(vmlinux) & static(modules)
> & dynamically allocated.
>
> Its *not* possible to put an arbitrary limit to this global zone.
> You'll allways find somebody to break this limit. This is the point
> we must solve, before coding anything.
>
> Have you considered using a list of fixed size chunks, each chunk
> having its own bitmap ?
>
> We only want fix offsets between CPU locations. For a given variable,
> we MUST find addresses for all CPUS looking at the same offset table.
> (Then we can optimize things on x86, using %gs/%fs register, instead
> of a table lookup)
>
> We could chose chunk size at compile time, depending on various
> parameters (32/64 bit arches, or hugepage sizes on NUMA),
> and a minimum value (ABI guarantee)
>
> On x86_64 && NUMA we could use 2 Mbytes chunks, while
> on x86_32 or non NUMA we should probably use 64 Kbytes.
>
> At boot time, we setup the first chunk (chunk 0) and copy
> .data.percpu on this chunk, for each possible cpu, and we
> build the bitmap for future dynamic/module percpu allocations.
> So we still have the restriction that sizeofsection(.data.percpu)
> should fit in the chunk 0. Not a problem in practice.
>
> Then if we need to expand percpu zone for heavy duty machine,
> and chunk 0 is already filled, we can add as many 2 M/ 64K
> chunks we need.
>
> This would limit the dynamic percpu allocation to 64 kbytes for
> a given variable, so huge users should probably still use a
> different allocator (like oprofile alloc_cpu_buffers() function)
> But at least we dont anymore limit the total size of percpu area.
>
> I understand you want to offset percpu data to 0, but for
> static percpu data. (pda being included in, to share %gs)
>
> For dynamically allocated percpu variables (including modules
> ".data.percpu"), nothing forces you to have low offsets,
> relative to %gs/%fs register. Access to these variables
> will be register indirect based anyway (eg %gs:(%rax) )
>
>
> 1) NUMA case
>
> For a 64 bit NUMA arch, chunk size of 2Mbytes
>
> Allocates 2Mb for each possible processor (on its preferred memory
> node), and compute values to setup offset_of_cpu[NR_CPUS] array.
>
> Chunk 0
> CPU 0 : virtual address XXXXXX
> CPU 1 : virtual address XXXXXX + offset_of_cpu[1]
> ...
> CPU n : virtual address XXXXXX + offset_of_cpu[n]
> + a shared bitmap
>
>
> For next chunks, we could use vmalloc() zone to find
> nr_possible_cpus virtual addresses ranges where you can map
> a 2Mb page per possible cpu, as long as we respect the relative
> delta between each cpu block, that was computed when
> chunk 0 was setup.
>
> Chunk 1..n
> CPU 0 : virtual address YYYYYYYYYYYYYY
> CPU 1 : virtual address YYYYYYYYYYYYYY + offset_of_cpu[1]
> ...
> CPU n : virtual address YYYYYYYYYYYYYY + offset_of_cpu[n]
> + a shared bitmap (32Kbytes if 8 bytes granularity in allocator)
>
> For a variable located in chunk 0, its 'address' relative to current
> cpu %gs will be some number between [0 and 2^20-1]
>
> For a variable located in chunk 1, its 'address' relative to current
> cpu %gs will be some number between
> [YYYYYYYYYYYYYY - XXXXXX and YYYYYYYYYYYYYY - XXXXXX + 2^20 - 1],
> not necessarly [2^20 to 2^21 - 1]
>
>
> Chunk 0 would use normal memory (no vmap TLB cost), only next ones need vmalloc().
>
> So the extra TLB cost would only be taken for very special NUMA setups
> (only if using a lot of percpu allocations)
>
> Also, using a 2Mb page granularity probably wastes about 2Mb per cpu, but
> this is nothing for NUMA machines :)
>
> 2) SMP && !NUMA
>
> On non NUMA machines, we dont need vmalloc games, since we can allocate
> chunk space using contiguous memory, (size = nr_possible_cpus*64Kbytes)
>
> offset_of_cpu[N] = N * CHUNK_SIZE
>
> (On a 4 CPU x86_32 machine, allocate a 256 Kbyte bloc then divide it in
> 64 kb blocs)
> If this order-6 allocation fails, then fallback to vmalloc(), but most
> percpu allocations happens at boot time, when memory is not yet fragmented...
>
>
> 3) UP case : fallback to standard allocators. No need for bitmaps.
>
>
> NUMA special casing can be implemented later of course...
>
> Thanks for reading
>
>

Wow! Thanks for the detail! It's extremely useful (to me at least)
to see it spelled out.

Since Christoph is still on vacation I'll try to summarize where we're
at at the moment. (Besides being stuck on a boot up problem with the
%gs based percpu variables that is. ;-)

Yes, the problem is we need to use virtual addresses to expand the
percpu areas since each cpu needs the same fixed offset to the newly
allocated variables. This was in the prior (v2) version of cpu_alloc
so I'm looking at pulling that forward. And I also figured that the
size of the expansion allocations should be based on the system size
to minimize the effect on small systems (seems to be my life the
past 6 months... ;-)

I'm also looking at integrating more into the already present
infrastructure (thanks Rusty!) so there are less "diffs" (and less
new testing needed.) And of course, there's the complexities
of submitting patches to many architectures simultaneously.

Hopefully, I'll have something for review soon.

Thanks again,
Mike

2008-06-08 06:00:40

by Rusty Russell

[permalink] [raw]

Subject: Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

On Friday 06 June 2008 15:33:22 Eric Dumazet wrote:
> 1) NUMA case
>
> For a 64 bit NUMA arch, chunk size of 2Mbytes
>
> Allocates 2Mb for each possible processor (on its preferred memory
> node), and compute values to setup offset_of_cpu[NR_CPUS] array.
>
> Chunk 0
> CPU 0 : virtual address XXXXXX
> CPU 1 : virtual address XXXXXX + offset_of_cpu[1]
> ...
> CPU n : virtual address XXXXXX + offset_of_cpu[n]
> + a shared bitmap
>
>
> For next chunks, we could use vmalloc() zone to find
> nr_possible_cpus virtual addresses ranges where you can map
> a 2Mb page per possible cpu, as long as we respect the relative
> delta between each cpu block, that was computed when
> chunk 0 was setup.
>
> Chunk 1..n
> CPU 0 : virtual address YYYYYYYYYYYYYY
> CPU 1 : virtual address YYYYYYYYYYYYYY + offset_of_cpu[1]
> ...
> CPU n : virtual address YYYYYYYYYYYYYY + offset_of_cpu[n]
> + a shared bitmap (32Kbytes if 8 bytes granularity in allocator)
>
> For a variable located in chunk 0, its 'address' relative to current
> cpu %gs will be some number between [0 and 2^20-1]
>
> For a variable located in chunk 1, its 'address' relative to current
> cpu %gs will be some number between
> [YYYYYYYYYYYYYY - XXXXXX and YYYYYYYYYYYYYY - XXXXXX + 2^20 - 1],
> not necessarly [2^20 to 2^21 - 1]
>
>
> Chunk 0 would use normal memory (no vmap TLB cost), only next ones need
> vmalloc().
>
> So the extra TLB cost would only be taken for very special NUMA setups
> (only if using a lot of percpu allocations)
>
> Also, using a 2Mb page granularity probably wastes about 2Mb per cpu, but
> this is nothing for NUMA machines :)

If you're prepared to have mappings for chunk 0, you can simply make it
virtually linear and creating a new chunk is simple. If not, you need to
reserve the virtual address space(s) for future mappings. Otherwise you're
unlikely to get the same layout for allocations.

This is not a show-stopper: we've lived with limited vmalloc room since
forever. It just has to be sufficient.

Otherwise, your analysis is correct, if a little verbose :)

Cheers,
Rusty.

2008-06-09 18:44:42

by Christoph Lameter

[permalink] [raw]

Subject: Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

On Fri, 6 Jun 2008, Eric Dumazet wrote:

> Please forgive me if I beat a dead horse, but this percpu stuff
> should find its way.

Definitely and its very complex so any more eyes on this are appreciated.

> I wonder why you absolutely want to have only one chunk holding
> all percpu variables, static(vmlinux) & static(modules)
> & dynamically allocated.
>
> Its *not* possible to put an arbitrary limit to this global zone.
> You'll allways find somebody to break this limit. This is the point
> we must solve, before coding anything.

The problem is that offsets relative to %gs or %fs are limited by the
small memory model that is chosen. We cannot have an offset large than
2GB. So we must have a linear address range and cannot use separate chunks
of memory. If we do not use the segment register then we cannot do atomic
(wrt interrupt) cpu ops.

> Have you considered using a list of fixed size chunks, each chunk
> having its own bitmap ?

Mike has done so and then I had to tell him what I just told you.

> On x86_64 && NUMA we could use 2 Mbytes chunks, while
> on x86_32 or non NUMA we should probably use 64 Kbytes.

Right that is what cpu_alloc v2 did. It created a virtual mapping and
populated it on demand with 2MB PMD entries.

> I understand you want to offset percpu data to 0, but for
> static percpu data. (pda being included in, to share %gs)
>
> For dynamically allocated percpu variables (including modules
> ".data.percpu"), nothing forces you to have low offsets,
> relative to %gs/%fs register. Access to these variables
> will be register indirect based anyway (eg %gs:(%rax) )

The relative to 0 stuff comes in at the x86_64 level because we want to
unify pda and percpu accesses. pda access have been relative to 0 and in
particular the stack canary in glibc directly accesses the pda at a
certain offset. So we must be zero based in order to preserve
compatibility with glibc.

> Chunk 0 would use normal memory (no vmap TLB cost), only next ones need vmalloc().

Normal memory uses 2MB tlbs. There is no overhead therefore by mapping the
percpu areas using 2MB tlbs. So we do not need to be that complicated.

What v2 did was allocate an area n * MAX_VIRT_PER_CPU_SIZE in vmalloc
space and then it dynamically populated 2MB segments as needed. The MAX
size was 128MB or so.

We could either do the same on i386 or use 4kb mappings (then we can
directly use the vmalloc functionality). But then there would be
additional TLB overhead.

We have similar 2MB virtual mapping tricks for the virtual memmap.
Basically we can copy the functions and customize them for the virtual per
cpu areas (Mike is hopefully listening and reading the V2 patch ....)

2008-06-09 19:12:35

by Andi Kleen

[permalink] [raw]

Subject: Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

Christoph Lameter <[email protected]> writes:

> The problem is that offsets relative to %gs or %fs are limited by the
> small memory model that is chosen.

Actually they are not. If you really want you can do
movabs $64bit,%reg ; op ...,%gs:(%reg)
It's just not very efficient compared to small (or rather kernel) model
and also older binutils didn't support large model.

-Andi

2008-06-09 20:16:39

by Eric Dumazet

[permalink] [raw]

Subject: Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access

Andi Kleen a ?crit :
> Christoph Lameter <[email protected]> writes:
>
>> The problem is that offsets relative to %gs or %fs are limited by the
>> small memory model that is chosen.
>
> Actually they are not. If you really want you can do
> movabs $64bit,%reg ; op ...,%gs:(%reg)
> It's just not very efficient compared to small (or rather kernel) model
> and also older binutils didn't support large model.
>

I am not sure Christoph was refering to actual instructions.

I was suggesting using for static percpu (vmlinux or modules) :

vmlinux : (offset31 computed by linker at vmlinux link edit time)
incl %gs:offset31

modules : (offset31 computed at module load time by module loader)
incl %gs:offset31

(If we make sure all this stuff is allocated in first chunk)

And for dynamic percpu :

movq field(%rdi),%rax
incl %gs:(%rax) /* full 64bits 'offsets' */

I understood (but might be wrong again) that %gs itself could not be used with an offset > 2GB, because
the way %gs segment is setup. So in the 'dynamic percpu' case, %rax should not exceed 2^31