From: Tejun Heo <tj@kernel.org>
To: linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@redhat.com,
       benh@kernel.crashing.org, davem@davemloft.net, dhowells@redhat.com,
       npiggin@suse.de, JBeulich@novell.com, cl@linux-foundation.org,
       rusty@rustcorp.com.au, hpa@zytor.com, tglx@linutronix.de,
       akpm@linux-foundation.org, x86@kernel.org, andi@firstfloor.org
Subject: [PATCHSET percpu#for-next] implement and use sparse embedding first chunk allocator
Date: Tue, 21 Jul 2009 19:25:59 +0900
Message-Id: <1248171979-29166-1-git-send-email-tj@kernel.org>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6384
Lines: 147

Hello, all.

This patchset teaches percpu allocator how to manage very sparse
units, vmalloc how to allocate congruent sparse vmap areas and combine
them to extend the embedding allocator to allow embedding of sparse
unit addresses.  This basically implements Christoph's sparse
congruent allocator.

This allows NUMA configurations to use bootmem allocated memory
directly as non-NUMA machines do with the embedding allocator.
Setting up the first chunk is basically consisted of allocating memory
for each cpu and then build percpu configuration to so that the first
chunk is composed of those memory areas, which means that there can be
huge holes between units and chunks may overlap each other.

When further chunks are necessary pcpu_get_vm_areas() is called with
parameters to specify how many areas are necessary, how large each
should be and how apart they're from each other.  The function scans
vmalloc address space top down looking for matching holes and returns
array of vmap areas.  As the newly allocated areas are offset exactly
the same as the first chunk, the rest is pretty straight-forward.

This has the following benefits.

* No special remapping necessary.  Arch codes don't need change its
  address mapping or anything.  It just needs to inform percpu
  allocator how percpu areas ends up like.  percpu allocator will
  take any layout.

* No additional TLB pressure.  Both page and large page remapping adds
  TLB pressure.  With embedding, there's no overhead.  Whatever
  translations being used for linear mapping is used as-is.

* Removes dup-mapping.  Large page remapping ends up mapping the same
  page twice.  This causes subtle problem on x86 when page attribute
  needs to be changed.  The maps need to be looked up and split into
  page mappings, which is a bit fragile.  As embedding doesn't remap
  anything, this problem doesn't exist.

The only restriction is that the vmalloc area needs to be huge - at
least orders of magnitude larger than the distances between NUMA
nodes.  For 64bit machines, this isn't a problem but on 32bit NUMA
machines address space is a scarce resource.  For x86_32 NUMAs, the
page mapping allocator is used.  The reason for choosing page over
large page is because page is far simpler and the advantage of large
page isn't very clear.

 0001-percpu-fix-pcpu_reclaim-locking.patch
 0002-percpu-improve-boot-messages.patch
 0003-percpu-rename-4k-first-chunk-allocator-to-page.patch
 0004-percpu-build-first-chunk-allocators-selectively.patch
 0005-percpu-generalize-first-chunk-allocator-selection.patch
 0006-percpu-drop-static_size-from-first-chunk-allocator.patch
 0007-percpu-make-dyn_size-mandatory-for-pcpu_setup_firs.patch
 0008-percpu-add-align-to-pcpu_fc_alloc_fn_t.patch
 0009-percpu-move-pcpu_lpage_build_unit_map-and-pcpul_l.patch
 0010-percpu-introduce-pcpu_alloc_info-and-pcpu_group_inf.patch
 0011-percpu-add-pcpu_unit_offsets.patch
 0012-percpu-add-chunk-base_addr.patch
 0013-vmalloc-separate-out-insert_vmalloc_vm.patch
 0014-vmalloc-implement-pcpu_get_vm_areas.patch
 0015-percpu-use-group-information-to-allocate-vmap-areas.patch
 0016-percpu-update-embedding-first-chunk-allocator-to-ha.patch
 0017-x86-percpu-use-embedding-for-64bit-NUMA-and-page-fo.patch
 0018-percpu-kill-lpage-first-chunk-allocator.patch
 0019-sparc64-use-embedding-percpu-first-chunk-allocator.patch
 0020-powerpc64-convert-to-dynamic-percpu-allocator.patch

0001 fixes locking bug on reclaim path which was introduced by
2f39e637ea240efb74cf807d31c93a71a0b89174.

0002-0007 are misc changes.  4k allocator is renamed to page.
Messages are made prettier and more informative.  Avoid building
unused first chunk allocators and so on.  Nothing really drastic but
small cleanups to ease further changes.

0008-0009 prepares for later changes.  @align is added to
pcpu_fc_alloc and functions are relocated.

0010 changes how first chunk configuration is passed to
pcpu_setup_first_chunk().  All information is collected into
pcpu_alloc_info struct including the unit grouping information which
used to be lost in the process.  This change allows percpu allocator
to have enough information to allocate congruent vmap areas.

0011-0012 prepares percpu for sparse groups and units in them.  offset
information is added and used to calculate addresses.

0013-0014 implement pcpu_get_vm_areas() which allocate congruent vmap
areas.

0015-0016 teaches percpu how to use multiple vm areas to allow sparse
groups and extends embedding allocator so that it knows how to embed
sparse areas.

0017 converts x86_64 NUMA to use embedding and x86_32 NUMA page.

0018 kills now unused lpage allocator and the related page attribute
code.

0019 converts sparc64 to use embedding allocator.

0020 converts powerpc64 to dynamic percpu allocator using embedding
allocator.

After this series, only ia64 is left with the static allocator.  I
have the patch but don't have machine to verify it on.  Will post as
RFC patch.

This patchset is on top of

 linus#master (aea1f7964ae6cba5eb419a958956deb9016b3341)
 + [1] perpcu-fix-sparse-possible-cpu-map-handling patchset
 + pulled into percpu#for-next (457f82bac659745f6d5052e4c493d92d62722c9c)

and available in the following git tree.  Please note that the
following tree is temporary and will be rebased.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git review

Diffstat follows.  Only 112 lines added.  :-)

 Documentation/kernel-parameters.txt |   11 
 arch/powerpc/Kconfig                |    4 
 arch/powerpc/kernel/setup_64.c      |   61 +
 arch/sparc/Kconfig                  |    3 
 arch/sparc/kernel/smp_64.c          |  124 ---
 arch/x86/Kconfig                    |    6 
 arch/x86/kernel/setup_percpu.c      |  201 +-----
 arch/x86/mm/pageattr.c              |   20 
 include/linux/percpu.h              |  105 +--
 include/linux/vmalloc.h             |    6 
 mm/percpu.c                         | 1139 +++++++++++++++++-------------------
 mm/vmalloc.c                        |  338 ++++++++++
 12 files changed, 1065 insertions(+), 953 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/867587
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/