2009-07-21 10:28:48

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET percpu#for-next] implement and use sparse embedding first chunk allocator

Hello, all.

This patchset teaches percpu allocator how to manage very sparse
units, vmalloc how to allocate congruent sparse vmap areas and combine
them to extend the embedding allocator to allow embedding of sparse
unit addresses. This basically implements Christoph's sparse
congruent allocator.

This allows NUMA configurations to use bootmem allocated memory
directly as non-NUMA machines do with the embedding allocator.
Setting up the first chunk is basically consisted of allocating memory
for each cpu and then build percpu configuration to so that the first
chunk is composed of those memory areas, which means that there can be
huge holes between units and chunks may overlap each other.

When further chunks are necessary pcpu_get_vm_areas() is called with
parameters to specify how many areas are necessary, how large each
should be and how apart they're from each other. The function scans
vmalloc address space top down looking for matching holes and returns
array of vmap areas. As the newly allocated areas are offset exactly
the same as the first chunk, the rest is pretty straight-forward.

This has the following benefits.

* No special remapping necessary. Arch codes don't need change its
address mapping or anything. It just needs to inform percpu
allocator how percpu areas ends up like. percpu allocator will
take any layout.

* No additional TLB pressure. Both page and large page remapping adds
TLB pressure. With embedding, there's no overhead. Whatever
translations being used for linear mapping is used as-is.

* Removes dup-mapping. Large page remapping ends up mapping the same
page twice. This causes subtle problem on x86 when page attribute
needs to be changed. The maps need to be looked up and split into
page mappings, which is a bit fragile. As embedding doesn't remap
anything, this problem doesn't exist.

The only restriction is that the vmalloc area needs to be huge - at
least orders of magnitude larger than the distances between NUMA
nodes. For 64bit machines, this isn't a problem but on 32bit NUMA
machines address space is a scarce resource. For x86_32 NUMAs, the
page mapping allocator is used. The reason for choosing page over
large page is because page is far simpler and the advantage of large
page isn't very clear.

0001-percpu-fix-pcpu_reclaim-locking.patch
0002-percpu-improve-boot-messages.patch
0003-percpu-rename-4k-first-chunk-allocator-to-page.patch
0004-percpu-build-first-chunk-allocators-selectively.patch
0005-percpu-generalize-first-chunk-allocator-selection.patch
0006-percpu-drop-static_size-from-first-chunk-allocator.patch
0007-percpu-make-dyn_size-mandatory-for-pcpu_setup_firs.patch
0008-percpu-add-align-to-pcpu_fc_alloc_fn_t.patch
0009-percpu-move-pcpu_lpage_build_unit_map-and-pcpul_l.patch
0010-percpu-introduce-pcpu_alloc_info-and-pcpu_group_inf.patch
0011-percpu-add-pcpu_unit_offsets.patch
0012-percpu-add-chunk-base_addr.patch
0013-vmalloc-separate-out-insert_vmalloc_vm.patch
0014-vmalloc-implement-pcpu_get_vm_areas.patch
0015-percpu-use-group-information-to-allocate-vmap-areas.patch
0016-percpu-update-embedding-first-chunk-allocator-to-ha.patch
0017-x86-percpu-use-embedding-for-64bit-NUMA-and-page-fo.patch
0018-percpu-kill-lpage-first-chunk-allocator.patch
0019-sparc64-use-embedding-percpu-first-chunk-allocator.patch
0020-powerpc64-convert-to-dynamic-percpu-allocator.patch

0001 fixes locking bug on reclaim path which was introduced by
2f39e637ea240efb74cf807d31c93a71a0b89174.

0002-0007 are misc changes. 4k allocator is renamed to page.
Messages are made prettier and more informative. Avoid building
unused first chunk allocators and so on. Nothing really drastic but
small cleanups to ease further changes.

0008-0009 prepares for later changes. @align is added to
pcpu_fc_alloc and functions are relocated.

0010 changes how first chunk configuration is passed to
pcpu_setup_first_chunk(). All information is collected into
pcpu_alloc_info struct including the unit grouping information which
used to be lost in the process. This change allows percpu allocator
to have enough information to allocate congruent vmap areas.

0011-0012 prepares percpu for sparse groups and units in them. offset
information is added and used to calculate addresses.

0013-0014 implement pcpu_get_vm_areas() which allocate congruent vmap
areas.

0015-0016 teaches percpu how to use multiple vm areas to allow sparse
groups and extends embedding allocator so that it knows how to embed
sparse areas.

0017 converts x86_64 NUMA to use embedding and x86_32 NUMA page.

0018 kills now unused lpage allocator and the related page attribute
code.

0019 converts sparc64 to use embedding allocator.

0020 converts powerpc64 to dynamic percpu allocator using embedding
allocator.

After this series, only ia64 is left with the static allocator. I
have the patch but don't have machine to verify it on. Will post as
RFC patch.

This patchset is on top of

linus#master (aea1f7964ae6cba5eb419a958956deb9016b3341)
+ [1] perpcu-fix-sparse-possible-cpu-map-handling patchset
+ pulled into percpu#for-next (457f82bac659745f6d5052e4c493d92d62722c9c)

and available in the following git tree. Please note that the
following tree is temporary and will be rebased.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git review

Diffstat follows. Only 112 lines added. :-)

Documentation/kernel-parameters.txt | 11
arch/powerpc/Kconfig | 4
arch/powerpc/kernel/setup_64.c | 61 +
arch/sparc/Kconfig | 3
arch/sparc/kernel/smp_64.c | 124 ---
arch/x86/Kconfig | 6
arch/x86/kernel/setup_percpu.c | 201 +-----
arch/x86/mm/pageattr.c | 20
include/linux/percpu.h | 105 +--
include/linux/vmalloc.h | 6
mm/percpu.c | 1139 +++++++++++++++++-------------------
mm/vmalloc.c | 338 ++++++++++
12 files changed, 1065 insertions(+), 953 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/867587


2009-07-21 10:27:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 16/20] percpu: update embedding first chunk allocator to handle sparse units

Now that percpu core can handle very sparse units, given that vmalloc
space is large enough, embedding first chunk allocator can use any
memory to build the first chunk. This patch teaches
pcpu_embed_first_chunk() about distances between cpus and to use
alloc/free callbacks to allocate node specific areas for each group
and use them for the first chunk.

This brings the benefits of embedding allocator to NUMA configurations
- no extra TLB pressure with the flexibility of unified dynamic
allocator and no need to restructure arch code to build memory layout
suitable for percpu. With units put into atom_size aligned groups
according to cpu distances, using large page for dynamic chunks is
also easily possible with falling back to reuglar pages if large
allocation fails.

Embedding allocator users are converted to specify NULL
cpu_distance_fn, so this patch doesn't cause any visible behavior
difference. Following patches will convert them.

Signed-off-by: Tejun Heo <[email protected]>
---
arch/x86/kernel/setup_percpu.c | 4 +-
include/linux/percpu.h | 7 ++-
mm/percpu.c | 113 ++++++++++++++++++++++++++++++----------
3 files changed, 93 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 477d2de..5b03d7e 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -234,7 +234,9 @@ static int __init setup_pcpu_embed(bool chosen)
return -EINVAL;

return pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
- reserve - PERCPU_FIRST_CHUNK_RESERVE);
+ reserve - PERCPU_FIRST_CHUNK_RESERVE,
+ PAGE_SIZE, NULL, pcpu_fc_alloc,
+ pcpu_fc_free);
}

/*
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index a7ec840..2535993 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -110,8 +110,11 @@ extern int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
void *base_addr);

#ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
-extern int __init pcpu_embed_first_chunk(size_t reserved_size,
- ssize_t dyn_size);
+extern int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
+ size_t atom_size,
+ pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn);
#endif

#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
diff --git a/mm/percpu.c b/mm/percpu.c
index cc9c4c6..c2826d0 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1747,15 +1747,25 @@ early_param("percpu_alloc", percpu_alloc_setup);
* pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
* @reserved_size: the size of reserved percpu area in bytes
* @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @atom_size: allocation atom size
+ * @cpu_distance_fn: callback to determine distance between cpus, optional
+ * @alloc_fn: function to allocate percpu page
+ * @free_fn: funtion to free percpu page
*
* This is a helper to ease setting up embedded first percpu chunk and
* can be called where pcpu_setup_first_chunk() is expected.
*
* If this function is used to setup the first chunk, it is allocated
- * as a contiguous area using bootmem allocator and used as-is without
- * being mapped into vmalloc area. This enables the first chunk to
- * piggy back on the linear physical mapping which often uses larger
- * page size.
+ * by calling @alloc_fn and used as-is without being mapped into
+ * vmalloc area. Allocations are always whole multiples of @atom_size
+ * aligned to @atom_size.
+ *
+ * This enables the first chunk to piggy back on the linear physical
+ * mapping which often uses larger page size. Please note that this
+ * can result in very sparse cpu->unit mapping on NUMA machines thus
+ * requiring large vmalloc address space. Don't use this allocator if
+ * vmalloc space is not orders of magnitude larger than distances
+ * between node memory addresses (ie. 32bit NUMA machines).
*
* When @dyn_size is positive, dynamic area might be larger than
* specified to fill page alignment. When @dyn_size is auto,
@@ -1763,53 +1773,88 @@ early_param("percpu_alloc", percpu_alloc_setup);
* and reserved areas.
*
* If the needed size is smaller than the minimum or specified unit
- * size, the leftover is returned to the bootmem allocator.
+ * size, the leftover is returned using @free_fn.
*
* RETURNS:
* 0 on success, -errno on failure.
*/
-int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size)
+int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
+ size_t atom_size,
+ pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn)
{
+ void *base = (void *)ULONG_MAX;
+ void **areas = NULL;
struct pcpu_alloc_info *ai;
- size_t size_sum, chunk_size;
- void *base;
- int unit;
- int rc;
+ size_t size_sum, areas_size;
+ int group, i, rc;

- ai = pcpu_build_alloc_info(reserved_size, dyn_size, PAGE_SIZE, NULL);
+ ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size,
+ cpu_distance_fn);
if (IS_ERR(ai))
return PTR_ERR(ai);
- BUG_ON(ai->nr_groups != 1);
- BUG_ON(ai->groups[0].nr_units != num_possible_cpus());

size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
- chunk_size = ai->unit_size * num_possible_cpus();
+ areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *));

- base = __alloc_bootmem_nopanic(chunk_size, PAGE_SIZE,
- __pa(MAX_DMA_ADDRESS));
- if (!base) {
- pr_warning("PERCPU: failed to allocate %zu bytes for "
- "embedding\n", chunk_size);
+ areas = alloc_bootmem_nopanic(areas_size);
+ if (!areas) {
rc = -ENOMEM;
- goto out_free_ai;
+ goto out_free;
}

- /* return the leftover and copy */
- for (unit = 0; unit < num_possible_cpus(); unit++) {
- void *ptr = base + unit * ai->unit_size;
+ /* allocate, copy and determine base address */
+ for (group = 0; group < ai->nr_groups; group++) {
+ struct pcpu_group_info *gi = &ai->groups[group];
+ unsigned int cpu = NR_CPUS;
+ void *ptr;
+
+ for (i = 0; i < gi->nr_units && cpu == NR_CPUS; i++)
+ cpu = gi->cpu_map[i];
+ BUG_ON(cpu == NR_CPUS);
+
+ /* allocate space for the whole group */
+ ptr = alloc_fn(cpu, gi->nr_units * ai->unit_size, atom_size);
+ if (!ptr) {
+ rc = -ENOMEM;
+ goto out_free_areas;
+ }
+ areas[group] = ptr;

- free_bootmem(__pa(ptr + size_sum), ai->unit_size - size_sum);
- memcpy(ptr, __per_cpu_load, ai->static_size);
+ base = min(ptr, base);
+
+ for (i = 0; i < gi->nr_units; i++, ptr += ai->unit_size) {
+ if (gi->cpu_map[i] == NR_CPUS) {
+ /* unused unit, free whole */
+ free_fn(ptr, ai->unit_size);
+ continue;
+ }
+ /* copy and return the unused part */
+ memcpy(ptr, __per_cpu_load, ai->static_size);
+ free_fn(ptr + size_sum, ai->unit_size - size_sum);
+ }
}

- /* we're ready, commit */
+ /* base address is now known, determine group base offsets */
+ for (group = 0; group < ai->nr_groups; group++)
+ ai->groups[group].base_offset = areas[group] - base;
+
pr_info("PERCPU: Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n",
PFN_DOWN(size_sum), base, ai->static_size, ai->reserved_size,
ai->dyn_size, ai->unit_size);

rc = pcpu_setup_first_chunk(ai, base);
-out_free_ai:
+ goto out_free;
+
+out_free_areas:
+ for (group = 0; group < ai->nr_groups; group++)
+ free_fn(areas[group],
+ ai->groups[group].nr_units * ai->unit_size);
+out_free:
pcpu_free_alloc_info(ai);
+ if (areas)
+ free_bootmem(__pa(areas), areas_size);
return rc;
}
#endif /* CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK ||
@@ -2177,6 +2222,17 @@ void *pcpu_lpage_remapped(void *kaddr)
unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
EXPORT_SYMBOL(__per_cpu_offset);

+static void * __init pcpu_dfl_fc_alloc(unsigned int cpu, size_t size,
+ size_t align)
+{
+ return __alloc_bootmem_nopanic(size, align, __pa(MAX_DMA_ADDRESS));
+}
+
+static void __init pcpu_dfl_fc_free(void *ptr, size_t size)
+{
+ free_bootmem(__pa(ptr), size);
+}
+
void __init setup_per_cpu_areas(void)
{
unsigned long delta;
@@ -2188,7 +2244,8 @@ void __init setup_per_cpu_areas(void)
* what the legacy allocator did.
*/
rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
- PERCPU_DYNAMIC_RESERVE);
+ PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, NULL,
+ pcpu_dfl_fc_alloc, pcpu_dfl_fc_free);
if (rc < 0)
panic("Failed to initialized percpu areas.");

--
1.6.0.2

2009-07-21 10:28:05

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/20] percpu: drop @static_size from first chunk allocators

First chunk allocators assume percpu areas have been linked using one
of PERCPU_*() macros and depend on __per_cpu_load symbol defined by
those macros, so there isn't much point in passing in static area size
explicitly when it can be easily calculated from __per_cpu_start and
__per_cpu_end. Drop @static_size from all percpu first chunk
allocators and helpers.

Signed-off-by: Tejun Heo <[email protected]>
---
arch/x86/kernel/setup_percpu.c | 34 +++++++++++++++-------------------
include/linux/percpu.h | 18 ++++++++----------
mm/percpu.c | 29 +++++++++++++----------------
3 files changed, 36 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 7f1e09b..b0e7ac4 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -157,7 +157,7 @@ static int pcpu_lpage_cpu_distance(unsigned int from, unsigned int to)
return REMOTE_DISTANCE;
}

-static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
+static ssize_t __init setup_pcpu_lpage(bool chosen)
{
size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
size_t dyn_size = reserve - PERCPU_FIRST_CHUNK_RESERVE;
@@ -184,8 +184,7 @@ static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
return -ENOMEM;
}

- ret = pcpu_lpage_build_unit_map(static_size,
- PERCPU_FIRST_CHUNK_RESERVE,
+ ret = pcpu_lpage_build_unit_map(PERCPU_FIRST_CHUNK_RESERVE,
&dyn_size, &unit_size, PMD_SIZE,
unit_map, pcpu_lpage_cpu_distance);
if (ret < 0) {
@@ -208,9 +207,8 @@ static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
}
}

- ret = pcpu_lpage_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
- dyn_size, unit_size, PMD_SIZE,
- unit_map, nr_units,
+ ret = pcpu_lpage_first_chunk(PERCPU_FIRST_CHUNK_RESERVE, dyn_size,
+ unit_size, PMD_SIZE, unit_map, nr_units,
pcpu_fc_alloc, pcpu_fc_free, pcpul_map);
out_free:
if (ret < 0)
@@ -218,7 +216,7 @@ out_free:
return ret;
}
#else
-static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
+static ssize_t __init setup_pcpu_lpage(bool chosen)
{
return -EINVAL;
}
@@ -232,7 +230,7 @@ static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
* mapping so that it can use PMD mapping without additional TLB
* pressure.
*/
-static ssize_t __init setup_pcpu_embed(size_t static_size, bool chosen)
+static ssize_t __init setup_pcpu_embed(bool chosen)
{
size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;

@@ -244,7 +242,7 @@ static ssize_t __init setup_pcpu_embed(size_t static_size, bool chosen)
if (!chosen && (!cpu_has_pse || pcpu_need_numa()))
return -EINVAL;

- return pcpu_embed_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
+ return pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
reserve - PERCPU_FIRST_CHUNK_RESERVE);
}

@@ -260,9 +258,9 @@ static void __init pcpup_populate_pte(unsigned long addr)
populate_extra_pte(addr);
}

-static ssize_t __init setup_pcpu_page(size_t static_size)
+static ssize_t __init setup_pcpu_page(void)
{
- return pcpu_page_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
+ return pcpu_page_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
pcpu_fc_alloc, pcpu_fc_free,
pcpup_populate_pte);
}
@@ -282,7 +280,6 @@ static inline void setup_percpu_segment(int cpu)

void __init setup_per_cpu_areas(void)
{
- size_t static_size = __per_cpu_end - __per_cpu_start;
unsigned int cpu;
unsigned long delta;
size_t pcpu_unit_size;
@@ -300,9 +297,9 @@ void __init setup_per_cpu_areas(void)
if (pcpu_chosen_fc != PCPU_FC_AUTO) {
if (pcpu_chosen_fc != PCPU_FC_PAGE) {
if (pcpu_chosen_fc == PCPU_FC_LPAGE)
- ret = setup_pcpu_lpage(static_size, true);
+ ret = setup_pcpu_lpage(true);
else
- ret = setup_pcpu_embed(static_size, true);
+ ret = setup_pcpu_embed(true);

if (ret < 0)
pr_warning("PERCPU: %s allocator failed (%zd), "
@@ -310,15 +307,14 @@ void __init setup_per_cpu_areas(void)
pcpu_fc_names[pcpu_chosen_fc], ret);
}
} else {
- ret = setup_pcpu_lpage(static_size, false);
+ ret = setup_pcpu_lpage(false);
if (ret < 0)
- ret = setup_pcpu_embed(static_size, false);
+ ret = setup_pcpu_embed(false);
}
if (ret < 0)
- ret = setup_pcpu_page(static_size);
+ ret = setup_pcpu_page();
if (ret < 0)
- panic("cannot allocate static percpu area (%zu bytes, err=%zd)",
- static_size, ret);
+ panic("cannot initialize percpu area (err=%zd)", ret);

pcpu_unit_size = ret;

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 9be05cb..be2fc8f 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -84,13 +84,12 @@ extern size_t __init pcpu_setup_first_chunk(

#ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
extern ssize_t __init pcpu_embed_first_chunk(
- size_t static_size, size_t reserved_size,
- ssize_t dyn_size);
+ size_t reserved_size, ssize_t dyn_size);
#endif

#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
extern ssize_t __init pcpu_page_first_chunk(
- size_t static_size, size_t reserved_size,
+ size_t reserved_size,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_populate_pte_fn_t populate_pte_fn);
@@ -98,16 +97,15 @@ extern ssize_t __init pcpu_page_first_chunk(

#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
extern int __init pcpu_lpage_build_unit_map(
- size_t static_size, size_t reserved_size,
- ssize_t *dyn_sizep, size_t *unit_sizep,
- size_t lpage_size, int *unit_map,
+ size_t reserved_size, ssize_t *dyn_sizep,
+ size_t *unit_sizep, size_t lpage_size,
+ int *unit_map,
pcpu_fc_cpu_distance_fn_t cpu_distance_fn);

extern ssize_t __init pcpu_lpage_first_chunk(
- size_t static_size, size_t reserved_size,
- size_t dyn_size, size_t unit_size,
- size_t lpage_size, const int *unit_map,
- int nr_units,
+ size_t reserved_size, size_t dyn_size,
+ size_t unit_size, size_t lpage_size,
+ const int *unit_map, int nr_units,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_map_fn_t map_fn);
diff --git a/mm/percpu.c b/mm/percpu.c
index 6617020..e9f45ab 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1464,7 +1464,6 @@ static inline size_t pcpu_calc_fc_sizes(size_t static_size,
!defined(CONFIG_HAVE_SETUP_PER_CPU_AREA)
/**
* pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
- * @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes
* @dyn_size: free size for dynamic allocation in bytes, -1 for auto
*
@@ -1489,9 +1488,9 @@ static inline size_t pcpu_calc_fc_sizes(size_t static_size,
* The determined pcpu_unit_size which can be used to initialize
* percpu access on success, -errno on failure.
*/
-ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
- ssize_t dyn_size)
+ssize_t __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size)
{
+ const size_t static_size = __per_cpu_end - __per_cpu_start;
size_t size_sum, unit_size, chunk_size;
void *base;
unsigned int cpu;
@@ -1536,7 +1535,6 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
/**
* pcpu_page_first_chunk - map the first chunk using PAGE_SIZE pages
- * @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes
* @alloc_fn: function to allocate percpu page, always called with PAGE_SIZE
* @free_fn: funtion to free percpu page, always called with PAGE_SIZE
@@ -1552,12 +1550,13 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
* The determined pcpu_unit_size which can be used to initialize
* percpu access on success, -errno on failure.
*/
-ssize_t __init pcpu_page_first_chunk(size_t static_size, size_t reserved_size,
+ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_populate_pte_fn_t populate_pte_fn)
{
static struct vm_struct vm;
+ const size_t static_size = __per_cpu_end - __per_cpu_start;
char psize_str[16];
int unit_pages;
size_t pages_size;
@@ -1641,7 +1640,6 @@ out_free_ar:
#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
/**
* pcpu_lpage_build_unit_map - build unit_map for large page remapping
- * @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes
* @dyn_sizep: in/out parameter for dynamic size, -1 for auto
* @unit_sizep: out parameter for unit size
@@ -1661,13 +1659,14 @@ out_free_ar:
* On success, fills in @unit_map, sets *@dyn_sizep, *@unit_sizep and
* returns the number of units to be allocated. -errno on failure.
*/
-int __init pcpu_lpage_build_unit_map(size_t static_size, size_t reserved_size,
- ssize_t *dyn_sizep, size_t *unit_sizep,
- size_t lpage_size, int *unit_map,
+int __init pcpu_lpage_build_unit_map(size_t reserved_size, ssize_t *dyn_sizep,
+ size_t *unit_sizep, size_t lpage_size,
+ int *unit_map,
pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
{
static int group_map[NR_CPUS] __initdata;
static int group_cnt[NR_CPUS] __initdata;
+ const size_t static_size = __per_cpu_end - __per_cpu_start;
int group_cnt_max = 0;
size_t size_sum, min_unit_size, alloc_size;
int upa, max_upa, uninitialized_var(best_upa); /* units_per_alloc */
@@ -1819,7 +1818,6 @@ static void __init pcpul_lpage_dump_cfg(const char *lvl, size_t static_size,

/**
* pcpu_lpage_first_chunk - remap the first percpu chunk using large page
- * @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes
* @dyn_size: free size for dynamic allocation in bytes
* @unit_size: unit size in bytes
@@ -1850,15 +1848,15 @@ static void __init pcpul_lpage_dump_cfg(const char *lvl, size_t static_size,
* The determined pcpu_unit_size which can be used to initialize
* percpu access on success, -errno on failure.
*/
-ssize_t __init pcpu_lpage_first_chunk(size_t static_size, size_t reserved_size,
- size_t dyn_size, size_t unit_size,
- size_t lpage_size, const int *unit_map,
- int nr_units,
+ssize_t __init pcpu_lpage_first_chunk(size_t reserved_size, size_t dyn_size,
+ size_t unit_size, size_t lpage_size,
+ const int *unit_map, int nr_units,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_map_fn_t map_fn)
{
static struct vm_struct vm;
+ const size_t static_size = __per_cpu_end - __per_cpu_start;
size_t chunk_size = unit_size * nr_units;
size_t map_size;
unsigned int cpu;
@@ -2037,7 +2035,6 @@ EXPORT_SYMBOL(__per_cpu_offset);

void __init setup_per_cpu_areas(void)
{
- size_t static_size = __per_cpu_end - __per_cpu_start;
ssize_t unit_size;
unsigned long delta;
unsigned int cpu;
@@ -2046,7 +2043,7 @@ void __init setup_per_cpu_areas(void)
* Always reserve area for module percpu variables. That's
* what the legacy allocator did.
*/
- unit_size = pcpu_embed_first_chunk(static_size, PERCPU_MODULE_RESERVE,
+ unit_size = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
PERCPU_DYNAMIC_RESERVE);
if (unit_size < 0)
panic("Failed to initialized percpu areas.");
--
1.6.0.2

2009-07-21 10:28:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/20] vmalloc: implement pcpu_get_vm_areas()

To directly use spread NUMA memories for percpu units, percpu
allocator will be updated to allow sparsely mapping units in a chunk.
As the distances between units can be very large, this makes
allocating single vmap area for each chunk undesirable. This patch
implements pcpu_get_vm_areas() and pcpu_free_vm_areas() which
allocates and frees sparse congruent vmap areas.

pcpu_get_vm_areas() take @offsets and @sizes array which define
distances and sizes of vmap areas. It scans down from the top of
vmalloc area looking for the top-most address which can accomodate all
the areas. The top-down scan is to avoid interacting with regular
vmallocs which can push up these congruent areas up little by little
ending up wasting address space and page table.

To speed up top-down scan, the highest possible address hint is
maintained. Although the scan is linear from the hint, given the
usual large holes between memory addresses between NUMA nodes, the
scanning is highly likely to finish after finding the first hole for
the last unit which is scanned first.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Nick Piggin <[email protected]>
---
include/linux/vmalloc.h | 6 +
mm/vmalloc.c | 293 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 299 insertions(+), 0 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index a43ebec..227c2a5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -115,4 +115,10 @@ extern rwlock_t vmlist_lock;
extern struct vm_struct *vmlist;
extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);

+struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
+ const size_t *sizes, int nr_vms,
+ size_t align, gfp_t gfp_mask);
+
+void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms);
+
#endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2eb461c..ebe3470 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -265,6 +265,7 @@ struct vmap_area {
static DEFINE_SPINLOCK(vmap_area_lock);
static struct rb_root vmap_area_root = RB_ROOT;
static LIST_HEAD(vmap_area_list);
+static unsigned long vmap_area_pcpu_hole;

static struct vmap_area *__find_vmap_area(unsigned long addr)
{
@@ -431,6 +432,15 @@ static void __free_vmap_area(struct vmap_area *va)
RB_CLEAR_NODE(&va->rb_node);
list_del_rcu(&va->list);

+ /*
+ * Track the highest possible candidate for pcpu area
+ * allocation. Areas outside of vmalloc area can be returned
+ * here too, consider only end addresses which fall inside
+ * vmalloc area proper.
+ */
+ if (va->va_end > VMALLOC_START && va->va_end <= VMALLOC_END)
+ vmap_area_pcpu_hole = max(vmap_area_pcpu_hole, va->va_end);
+
call_rcu(&va->rcu_head, rcu_free_va);
}

@@ -1038,6 +1048,9 @@ void __init vmalloc_init(void)
va->va_end = va->va_start + tmp->size;
__insert_vmap_area(va);
}
+
+ vmap_area_pcpu_hole = VMALLOC_END;
+
vmap_initialized = true;
}

@@ -1821,6 +1834,286 @@ void free_vm_area(struct vm_struct *area)
}
EXPORT_SYMBOL_GPL(free_vm_area);

+static struct vmap_area *node_to_va(struct rb_node *n)
+{
+ return n ? rb_entry(n, struct vmap_area, rb_node) : NULL;
+}
+
+/**
+ * pvm_find_next_prev - find the next and prev vmap_area surrounding @end
+ * @end: target address
+ * @pnext: out arg for the next vmap_area
+ * @pprev: out arg for the previous vmap_area
+ *
+ * Returns: %true if either or both of next and prev are found,
+ * %false if no vmap_area exists
+ *
+ * Find vmap_areas end addresses of which enclose @end. ie. if not
+ * NULL, *pnext->va_end > @end and *pprev->va_end <= @end.
+ */
+static bool pvm_find_next_prev(unsigned long end,
+ struct vmap_area **pnext,
+ struct vmap_area **pprev)
+{
+ struct rb_node *n = vmap_area_root.rb_node;
+ struct vmap_area *va = NULL;
+
+ while (n) {
+ va = rb_entry(n, struct vmap_area, rb_node);
+ if (end < va->va_end)
+ n = n->rb_left;
+ else if (end > va->va_end)
+ n = n->rb_right;
+ else
+ break;
+ }
+
+ if (!va)
+ return false;
+
+ if (va->va_end > end) {
+ *pnext = va;
+ *pprev = node_to_va(rb_prev(&(*pnext)->rb_node));
+ } else {
+ *pprev = va;
+ *pnext = node_to_va(rb_next(&(*pprev)->rb_node));
+ }
+ return true;
+}
+
+/**
+ * pvm_determine_end - find the highest aligned address between two vmap_areas
+ * @pnext: in/out arg for the next vmap_area
+ * @pprev: in/out arg for the previous vmap_area
+ * @align: alignment
+ *
+ * Returns: determined end address
+ *
+ * Find the highest aligned address between *@pnext and *@pprev below
+ * VMALLOC_END. *@pnext and *@pprev are adjusted so that the aligned
+ * down address is between the end addresses of the two vmap_areas.
+ *
+ * Please note that the address returned by this function may fall
+ * inside *@pnext vmap_area. The caller is responsible for checking
+ * that.
+ */
+static unsigned long pvm_determine_end(struct vmap_area **pnext,
+ struct vmap_area **pprev,
+ unsigned long align)
+{
+ const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
+ unsigned long addr;
+
+ if (*pnext)
+ addr = min((*pnext)->va_start & ~(align - 1), vmalloc_end);
+ else
+ addr = vmalloc_end;
+
+ while (*pprev && (*pprev)->va_end > addr) {
+ *pnext = *pprev;
+ *pprev = node_to_va(rb_prev(&(*pnext)->rb_node));
+ }
+
+ return addr;
+}
+
+/**
+ * pcpu_get_vm_areas - allocate vmalloc areas for percpu allocator
+ * @offsets: array containing offset of each area
+ * @sizes: array containing size of each area
+ * @nr_vms: the number of areas to allocate
+ * @align: alignment, all entries in @offsets and @sizes must be aligned to this
+ * @gfp_mask: allocation mask
+ *
+ * Returns: kmalloc'd vm_struct pointer array pointing to allocated
+ * vm_structs on success, %NULL on failure
+ *
+ * Percpu allocator wants to use congruent vm areas so that it can
+ * maintain the offsets among percpu areas. This function allocates
+ * congruent vmalloc areas for it. These areas tend to be scattered
+ * pretty far, distance between two areas easily going up to
+ * gigabytes. To avoid interacting with regular vmallocs, these areas
+ * are allocated from top.
+ *
+ * Despite its complicated look, this allocator is rather simple. It
+ * does everything top-down and scans areas from the end looking for
+ * matching slot. While scanning, if any of the areas overlaps with
+ * existing vmap_area, the base address is pulled down to fit the
+ * area. Scanning is repeated till all the areas fit and then all
+ * necessary data structres are inserted and the result is returned.
+ */
+struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
+ const size_t *sizes, int nr_vms,
+ size_t align, gfp_t gfp_mask)
+{
+ const unsigned long vmalloc_start = ALIGN(VMALLOC_START, align);
+ const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
+ struct vmap_area **vas, *prev, *next;
+ struct vm_struct **vms;
+ int area, area2, last_area, term_area;
+ unsigned long base, start, end, last_end;
+ bool purged = false;
+
+ gfp_mask &= GFP_RECLAIM_MASK;
+
+ /* verify parameters and allocate data structures */
+ BUG_ON(align & ~PAGE_MASK || !is_power_of_2(align));
+ for (last_area = 0, area = 0; area < nr_vms; area++) {
+ start = offsets[area];
+ end = start + sizes[area];
+
+ /* is everything aligned properly? */
+ BUG_ON(!IS_ALIGNED(offsets[area], align));
+ BUG_ON(!IS_ALIGNED(sizes[area], align));
+
+ /* detect the area with the highest address */
+ if (start > offsets[last_area])
+ last_area = area;
+
+ for (area2 = 0; area2 < nr_vms; area2++) {
+ unsigned long start2 = offsets[area2];
+ unsigned long end2 = start2 + sizes[area2];
+
+ if (area2 == area)
+ continue;
+
+ BUG_ON(start2 >= start && start2 < end);
+ BUG_ON(end2 <= end && end2 > start);
+ }
+ }
+ last_end = offsets[last_area] + sizes[last_area];
+
+ if (vmalloc_end - vmalloc_start < last_end) {
+ WARN_ON(true);
+ return NULL;
+ }
+
+ vms = kzalloc(sizeof(vms[0]) * nr_vms, gfp_mask);
+ vas = kzalloc(sizeof(vas[0]) * nr_vms, gfp_mask);
+ if (!vas || !vms)
+ goto err_free;
+
+ for (area = 0; area < nr_vms; area++) {
+ vas[area] = kzalloc(sizeof(struct vmap_area), gfp_mask);
+ vms[area] = kzalloc(sizeof(struct vm_struct), gfp_mask);
+ if (!vas[area] || !vms[area])
+ goto err_free;
+ }
+retry:
+ spin_lock(&vmap_area_lock);
+
+ /* start scanning - we scan from the top, begin with the last area */
+ area = term_area = last_area;
+ start = offsets[area];
+ end = start + sizes[area];
+
+ if (!pvm_find_next_prev(vmap_area_pcpu_hole, &next, &prev)) {
+ base = vmalloc_end - last_end;
+ goto found;
+ }
+ base = pvm_determine_end(&next, &prev, align) - end;
+
+ while (true) {
+ BUG_ON(next && next->va_end <= base + end);
+ BUG_ON(prev && prev->va_end > base + end);
+
+ /*
+ * base might have underflowed, add last_end before
+ * comparing.
+ */
+ if (base + last_end < vmalloc_start + last_end) {
+ spin_unlock(&vmap_area_lock);
+ if (!purged) {
+ purge_vmap_area_lazy();
+ purged = true;
+ goto retry;
+ }
+ goto err_free;
+ }
+
+ /*
+ * If next overlaps, move base downwards so that it's
+ * right below next and then recheck.
+ */
+ if (next && next->va_start < base + end) {
+ base = pvm_determine_end(&next, &prev, align) - end;
+ term_area = area;
+ continue;
+ }
+
+ /*
+ * If prev overlaps, shift down next and prev and move
+ * base so that it's right below new next and then
+ * recheck.
+ */
+ if (prev && prev->va_end > base + start) {
+ next = prev;
+ prev = node_to_va(rb_prev(&next->rb_node));
+ base = pvm_determine_end(&next, &prev, align) - end;
+ term_area = area;
+ continue;
+ }
+
+ /*
+ * This area fits, move on to the previous one. If
+ * the previous one is the terminal one, we're done.
+ */
+ area = (area + nr_vms - 1) % nr_vms;
+ if (area == term_area)
+ break;
+ start = offsets[area];
+ end = start + sizes[area];
+ pvm_find_next_prev(base + end, &next, &prev);
+ }
+found:
+ /* we've found a fitting base, insert all va's */
+ for (area = 0; area < nr_vms; area++) {
+ struct vmap_area *va = vas[area];
+
+ va->va_start = base + offsets[area];
+ va->va_end = va->va_start + sizes[area];
+ __insert_vmap_area(va);
+ }
+
+ vmap_area_pcpu_hole = base + offsets[last_area];
+
+ spin_unlock(&vmap_area_lock);
+
+ /* insert all vm's */
+ for (area = 0; area < nr_vms; area++)
+ insert_vmalloc_vm(vms[area], vas[area], 0,
+ pcpu_get_vm_areas);
+
+ kfree(vas);
+ return vms;
+
+err_free:
+ for (area = 0; area < nr_vms; area++) {
+ if (vas)
+ kfree(vas[area]);
+ if (vms)
+ kfree(vms[area]);
+ }
+ kfree(vas);
+ kfree(vms);
+ return NULL;
+}
+
+/**
+ * pcpu_free_vm_areas - free vmalloc areas for percpu allocator
+ * @vms: vm_struct pointer array returned by pcpu_get_vm_areas()
+ * @nr_vms: the number of allocated areas
+ *
+ * Free vm_structs and the array allocated by pcpu_get_vm_areas().
+ */
+void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
+{
+ int i;
+
+ for (i = 0; i < nr_vms; i++)
+ free_vm_area(vms[i]);
+ kfree(vms);
+}

#ifdef CONFIG_PROC_FS
static void *s_start(struct seq_file *m, loff_t *pos)
--
1.6.0.2

2009-07-21 10:28:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 18/20] percpu: kill lpage first chunk allocator

With x86 converted to embedding allocator, lpage doesn't have any user
left. Kill it along with cpa handling code.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jan Beulich <[email protected]>
---
Documentation/kernel-parameters.txt | 10 +-
arch/x86/mm/pageattr.c | 20 +---
include/linux/percpu.h | 16 ---
mm/percpu.c | 241 -----------------------------------
4 files changed, 6 insertions(+), 281 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 89b7568..720b758 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1916,11 +1916,11 @@ and is between 256 and 4096 characters. It is defined in the file
See arch/parisc/kernel/pdc_chassis.c

percpu_alloc= Select which percpu first chunk allocator to use.
- Currently supported values are "embed", "page" and
- "lpage". Archs may support subset or none of the
- selections. See comments in mm/percpu.c for details
- on each allocator. This parameter is primarily for
- debugging and performance comparison.
+ Currently supported values are "embed" and "page".
+ Archs may support subset or none of the selections.
+ See comments in mm/percpu.c for details on each
+ allocator. This parameter is primarily for debugging
+ and performance comparison.

pf. [PARIDE]
See Documentation/blockdev/paride.txt.
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index c106f78..b0a09ba 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -684,7 +684,7 @@ static int cpa_process_alias(struct cpa_data *cpa)
{
struct cpa_data alias_cpa;
unsigned long laddr = (unsigned long)__va(cpa->pfn << PAGE_SHIFT);
- unsigned long vaddr, remapped;
+ unsigned long vaddr;
int ret;

if (cpa->pfn >= max_pfn_mapped)
@@ -739,24 +739,6 @@ static int cpa_process_alias(struct cpa_data *cpa)
}
#endif

- /*
- * If the PMD page was partially used for per-cpu remapping,
- * the recycled area needs to be split and modified. Because
- * the area is always proper subset of a PMD page
- * cpa->numpages is guaranteed to be 1 for these areas, so
- * there's no need to loop over and check for further remaps.
- */
- remapped = (unsigned long)pcpu_lpage_remapped((void *)laddr);
- if (remapped) {
- WARN_ON(cpa->numpages > 1);
- alias_cpa = *cpa;
- alias_cpa.vaddr = &remapped;
- alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY);
- ret = __change_page_attr_set_clr(&alias_cpa, 0);
- if (ret)
- return ret;
- }
-
return 0;
}

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 2535993..878836c 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -82,7 +82,6 @@ enum pcpu_fc {
PCPU_FC_AUTO,
PCPU_FC_EMBED,
PCPU_FC_PAGE,
- PCPU_FC_LPAGE,

PCPU_FC_NR,
};
@@ -95,7 +94,6 @@ typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size,
typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size);
typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);
typedef int (pcpu_fc_cpu_distance_fn_t)(unsigned int from, unsigned int to);
-typedef void (*pcpu_fc_map_fn_t)(void *ptr, size_t size, void *addr);

extern struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
int nr_units);
@@ -124,20 +122,6 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
pcpu_fc_populate_pte_fn_t populate_pte_fn);
#endif

-#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
-extern int __init pcpu_lpage_first_chunk(const struct pcpu_alloc_info *ai,
- pcpu_fc_alloc_fn_t alloc_fn,
- pcpu_fc_free_fn_t free_fn,
- pcpu_fc_map_fn_t map_fn);
-
-extern void *pcpu_lpage_remapped(void *kaddr);
-#else
-static inline void *pcpu_lpage_remapped(void *kaddr)
-{
- return NULL;
-}
-#endif
-
/*
* Use this to get to a cpu's version of the per-cpu object
* dynamically allocated. Non-atomic access to the current CPU's
diff --git a/mm/percpu.c b/mm/percpu.c
index c2826d0..7793392 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1713,7 +1713,6 @@ const char *pcpu_fc_names[PCPU_FC_NR] __initdata = {
[PCPU_FC_AUTO] = "auto",
[PCPU_FC_EMBED] = "embed",
[PCPU_FC_PAGE] = "page",
- [PCPU_FC_LPAGE] = "lpage",
};

enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
@@ -1730,10 +1729,6 @@ static int __init percpu_alloc_setup(char *str)
else if (!strcmp(str, "page"))
pcpu_chosen_fc = PCPU_FC_PAGE;
#endif
-#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
- else if (!strcmp(str, "lpage"))
- pcpu_chosen_fc = PCPU_FC_LPAGE;
-#endif
else
pr_warning("PERCPU: unknown allocator %s specified\n", str);

@@ -1970,242 +1965,6 @@ out_free_ar:
}
#endif /* CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK */

-#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
-struct pcpul_ent {
- void *ptr;
- void *map_addr;
-};
-
-static size_t pcpul_size;
-static size_t pcpul_lpage_size;
-static int pcpul_nr_lpages;
-static struct pcpul_ent *pcpul_map;
-
-static bool __init pcpul_unit_to_cpu(int unit, const struct pcpu_alloc_info *ai,
- unsigned int *cpup)
-{
- int group, cunit;
-
- for (group = 0, cunit = 0; group < ai->nr_groups; group++) {
- const struct pcpu_group_info *gi = &ai->groups[group];
-
- if (unit < cunit + gi->nr_units) {
- if (cpup)
- *cpup = gi->cpu_map[unit - cunit];
- return true;
- }
- cunit += gi->nr_units;
- }
-
- return false;
-}
-
-static int __init pcpul_cpu_to_unit(int cpu, const struct pcpu_alloc_info *ai)
-{
- int group, unit, i;
-
- for (group = 0, unit = 0; group < ai->nr_groups; group++, unit += i) {
- const struct pcpu_group_info *gi = &ai->groups[group];
-
- for (i = 0; i < gi->nr_units; i++)
- if (gi->cpu_map[i] == cpu)
- return unit + i;
- }
- BUG();
-}
-
-/**
- * pcpu_lpage_first_chunk - remap the first percpu chunk using large page
- * @ai: pcpu_alloc_info
- * @alloc_fn: function to allocate percpu lpage, always called with lpage_size
- * @free_fn: function to free percpu memory, @size <= lpage_size
- * @map_fn: function to map percpu lpage, always called with lpage_size
- *
- * This allocator uses large page to build and map the first chunk.
- * Unlike other helpers, the caller should provide fully initialized
- * @ai. This can be done using pcpu_build_alloc_info(). This two
- * stage initialization is to allow arch code to evaluate the
- * parameters before committing to it.
- *
- * Large pages are allocated as directed by @unit_map and other
- * parameters and mapped to vmalloc space. Unused holes are returned
- * to the page allocator. Note that these holes end up being actively
- * mapped twice - once to the physical mapping and to the vmalloc area
- * for the first percpu chunk. Depending on architecture, this might
- * cause problem when changing page attributes of the returned area.
- * These double mapped areas can be detected using
- * pcpu_lpage_remapped().
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int __init pcpu_lpage_first_chunk(const struct pcpu_alloc_info *ai,
- pcpu_fc_alloc_fn_t alloc_fn,
- pcpu_fc_free_fn_t free_fn,
- pcpu_fc_map_fn_t map_fn)
-{
- static struct vm_struct vm;
- const size_t lpage_size = ai->atom_size;
- size_t chunk_size, map_size;
- unsigned int cpu;
- int i, j, unit, nr_units, rc;
-
- nr_units = 0;
- for (i = 0; i < ai->nr_groups; i++)
- nr_units += ai->groups[i].nr_units;
-
- chunk_size = ai->unit_size * nr_units;
- BUG_ON(chunk_size % lpage_size);
-
- pcpul_size = ai->static_size + ai->reserved_size + ai->dyn_size;
- pcpul_lpage_size = lpage_size;
- pcpul_nr_lpages = chunk_size / lpage_size;
-
- /* allocate pointer array and alloc large pages */
- map_size = pcpul_nr_lpages * sizeof(pcpul_map[0]);
- pcpul_map = alloc_bootmem(map_size);
-
- /* allocate all pages */
- for (i = 0; i < pcpul_nr_lpages; i++) {
- size_t offset = i * lpage_size;
- int first_unit = offset / ai->unit_size;
- int last_unit = (offset + lpage_size - 1) / ai->unit_size;
- void *ptr;
-
- /* find out which cpu is mapped to this unit */
- for (unit = first_unit; unit <= last_unit; unit++)
- if (pcpul_unit_to_cpu(unit, ai, &cpu))
- goto found;
- continue;
- found:
- ptr = alloc_fn(cpu, lpage_size, lpage_size);
- if (!ptr) {
- pr_warning("PERCPU: failed to allocate large page "
- "for cpu%u\n", cpu);
- goto enomem;
- }
-
- pcpul_map[i].ptr = ptr;
- }
-
- /* return unused holes */
- for (unit = 0; unit < nr_units; unit++) {
- size_t start = unit * ai->unit_size;
- size_t end = start + ai->unit_size;
- size_t off, next;
-
- /* don't free used part of occupied unit */
- if (pcpul_unit_to_cpu(unit, ai, NULL))
- start += pcpul_size;
-
- /* unit can span more than one page, punch the holes */
- for (off = start; off < end; off = next) {
- void *ptr = pcpul_map[off / lpage_size].ptr;
- next = min(roundup(off + 1, lpage_size), end);
- if (ptr)
- free_fn(ptr + off % lpage_size, next - off);
- }
- }
-
- /* allocate address, map and copy */
- vm.flags = VM_ALLOC;
- vm.size = chunk_size;
- vm_area_register_early(&vm, ai->unit_size);
-
- for (i = 0; i < pcpul_nr_lpages; i++) {
- if (!pcpul_map[i].ptr)
- continue;
- pcpul_map[i].map_addr = vm.addr + i * lpage_size;
- map_fn(pcpul_map[i].ptr, lpage_size, pcpul_map[i].map_addr);
- }
-
- for_each_possible_cpu(cpu)
- memcpy(vm.addr + pcpul_cpu_to_unit(cpu, ai) * ai->unit_size,
- __per_cpu_load, ai->static_size);
-
- /* we're ready, commit */
- pr_info("PERCPU: large pages @%p s%zu r%zu d%zu u%zu\n",
- vm.addr, ai->static_size, ai->reserved_size, ai->dyn_size,
- ai->unit_size);
-
- rc = pcpu_setup_first_chunk(ai, vm.addr);
-
- /*
- * Sort pcpul_map array for pcpu_lpage_remapped(). Unmapped
- * lpages are pushed to the end and trimmed.
- */
- for (i = 0; i < pcpul_nr_lpages - 1; i++)
- for (j = i + 1; j < pcpul_nr_lpages; j++) {
- struct pcpul_ent tmp;
-
- if (!pcpul_map[j].ptr)
- continue;
- if (pcpul_map[i].ptr &&
- pcpul_map[i].ptr < pcpul_map[j].ptr)
- continue;
-
- tmp = pcpul_map[i];
- pcpul_map[i] = pcpul_map[j];
- pcpul_map[j] = tmp;
- }
-
- while (pcpul_nr_lpages && !pcpul_map[pcpul_nr_lpages - 1].ptr)
- pcpul_nr_lpages--;
-
- return rc;
-
-enomem:
- for (i = 0; i < pcpul_nr_lpages; i++)
- if (pcpul_map[i].ptr)
- free_fn(pcpul_map[i].ptr, lpage_size);
- free_bootmem(__pa(pcpul_map), map_size);
- return -ENOMEM;
-}
-
-/**
- * pcpu_lpage_remapped - determine whether a kaddr is in pcpul recycled area
- * @kaddr: the kernel address in question
- *
- * Determine whether @kaddr falls in the pcpul recycled area. This is
- * used by pageattr to detect VM aliases and break up the pcpu large
- * page mapping such that the same physical page is not mapped under
- * different attributes.
- *
- * The recycled area is always at the tail of a partially used large
- * page.
- *
- * RETURNS:
- * Address of corresponding remapped pcpu address if match is found;
- * otherwise, NULL.
- */
-void *pcpu_lpage_remapped(void *kaddr)
-{
- unsigned long lpage_mask = pcpul_lpage_size - 1;
- void *lpage_addr = (void *)((unsigned long)kaddr & ~lpage_mask);
- unsigned long offset = (unsigned long)kaddr & lpage_mask;
- int left = 0, right = pcpul_nr_lpages - 1;
- int pos;
-
- /* pcpul in use at all? */
- if (!pcpul_map)
- return NULL;
-
- /* okay, perform binary search */
- while (left <= right) {
- pos = (left + right) / 2;
-
- if (pcpul_map[pos].ptr < lpage_addr)
- left = pos + 1;
- else if (pcpul_map[pos].ptr > lpage_addr)
- right = pos - 1;
- else
- return pcpul_map[pos].map_addr + offset;
- }
-
- return NULL;
-}
-#endif /* CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK */
-
/*
* Generic percpu area setup.
*
--
1.6.0.2

2009-07-21 10:28:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/20] percpu: generalize first chunk allocator selection

Now that all first chunk allocators are in mm/percpu.c, it makes sense
to make generalize percpu_alloc kernel parameter. Define PCPU_FC_*
and set pcpu_chosen_fc using early_param() in mm/percpu.c. Arch code
can use the set value to determine which first chunk allocator to use.

Signed-off-by: Tejun Heo <[email protected]>
---
Documentation/kernel-parameters.txt | 11 ++++++-----
arch/x86/kernel/setup_percpu.c | 24 ++++++------------------
include/linux/percpu.h | 12 ++++++++++++
mm/percpu.c | 32 ++++++++++++++++++++++++++++++++
4 files changed, 56 insertions(+), 23 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index faa7178..89b7568 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1915,11 +1915,12 @@ and is between 256 and 4096 characters. It is defined in the file
Format: { 0 | 1 }
See arch/parisc/kernel/pdc_chassis.c

- percpu_alloc= [X86] Select which percpu first chunk allocator to use.
- Allowed values are one of "lpage", "embed" and "page".
- See comments in arch/x86/kernel/setup_percpu.c for
- details on each allocator. This parameter is primarily
- for debugging and performance comparison.
+ percpu_alloc= Select which percpu first chunk allocator to use.
+ Currently supported values are "embed", "page" and
+ "lpage". Archs may support subset or none of the
+ selections. See comments in mm/percpu.c for details
+ on each allocator. This parameter is primarily for
+ debugging and performance comparison.

pf. [PARIDE]
See Documentation/blockdev/paride.txt.
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 3c943b4..7f1e09b 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -267,16 +267,6 @@ static ssize_t __init setup_pcpu_page(size_t static_size)
pcpup_populate_pte);
}

-/* for explicit first chunk allocator selection */
-static char pcpu_chosen_alloc[16] __initdata;
-
-static int __init percpu_alloc_setup(char *str)
-{
- strncpy(pcpu_chosen_alloc, str, sizeof(pcpu_chosen_alloc) - 1);
- return 0;
-}
-early_param("percpu_alloc", percpu_alloc_setup);
-
static inline void setup_percpu_segment(int cpu)
{
#ifdef CONFIG_X86_32
@@ -307,19 +297,17 @@ void __init setup_per_cpu_areas(void)
* each allocator for details.
*/
ret = -EINVAL;
- if (strlen(pcpu_chosen_alloc)) {
- if (strcmp(pcpu_chosen_alloc, "page")) {
- if (!strcmp(pcpu_chosen_alloc, "lpage"))
+ if (pcpu_chosen_fc != PCPU_FC_AUTO) {
+ if (pcpu_chosen_fc != PCPU_FC_PAGE) {
+ if (pcpu_chosen_fc == PCPU_FC_LPAGE)
ret = setup_pcpu_lpage(static_size, true);
- else if (!strcmp(pcpu_chosen_alloc, "embed"))
- ret = setup_pcpu_embed(static_size, true);
else
- pr_warning("PERCPU: unknown allocator %s "
- "specified\n", pcpu_chosen_alloc);
+ ret = setup_pcpu_embed(static_size, true);
+
if (ret < 0)
pr_warning("PERCPU: %s allocator failed (%zd), "
"falling back to page\n",
- pcpu_chosen_alloc, ret);
+ pcpu_fc_names[pcpu_chosen_fc], ret);
}
} else {
ret = setup_pcpu_lpage(static_size, false);
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index e26788e..9be05cb 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -59,6 +59,18 @@
extern void *pcpu_base_addr;
extern const int *pcpu_unit_map;

+enum pcpu_fc {
+ PCPU_FC_AUTO,
+ PCPU_FC_EMBED,
+ PCPU_FC_PAGE,
+ PCPU_FC_LPAGE,
+
+ PCPU_FC_NR,
+};
+extern const char *pcpu_fc_names[PCPU_FC_NR];
+
+extern enum pcpu_fc pcpu_chosen_fc;
+
typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size);
typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size);
typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);
diff --git a/mm/percpu.c b/mm/percpu.c
index 4d926d3..6617020 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1414,6 +1414,38 @@ size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
return pcpu_unit_size;
}

+const char *pcpu_fc_names[PCPU_FC_NR] __initdata = {
+ [PCPU_FC_AUTO] = "auto",
+ [PCPU_FC_EMBED] = "embed",
+ [PCPU_FC_PAGE] = "page",
+ [PCPU_FC_LPAGE] = "lpage",
+};
+
+enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
+
+static int __init percpu_alloc_setup(char *str)
+{
+ if (0)
+ /* nada */;
+#ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
+ else if (!strcmp(str, "embed"))
+ pcpu_chosen_fc = PCPU_FC_EMBED;
+#endif
+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
+ else if (!strcmp(str, "page"))
+ pcpu_chosen_fc = PCPU_FC_PAGE;
+#endif
+#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
+ else if (!strcmp(str, "lpage"))
+ pcpu_chosen_fc = PCPU_FC_LPAGE;
+#endif
+ else
+ pr_warning("PERCPU: unknown allocator %s specified\n", str);
+
+ return 0;
+}
+early_param("percpu_alloc", percpu_alloc_setup);
+
static inline size_t pcpu_calc_fc_sizes(size_t static_size,
size_t reserved_size,
ssize_t *dyn_sizep)
--
1.6.0.2

2009-07-21 10:29:27

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/20] percpu: rename 4k first chunk allocator to page

Page size isn't always 4k depending on arch and configuration. Rename
4k first chunk allocator to page.

Signed-off-by: Tejun Heo <[email protected]>
Cc: David Howells <[email protected]>
---
Documentation/kernel-parameters.txt | 2 +-
arch/x86/kernel/setup_percpu.c | 23 ++++++++++++-----------
include/linux/percpu.h | 2 +-
mm/percpu.c | 25 ++++++++++++++-----------
4 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index dd1a6d4..faa7178 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1916,7 +1916,7 @@ and is between 256 and 4096 characters. It is defined in the file
See arch/parisc/kernel/pdc_chassis.c

percpu_alloc= [X86] Select which percpu first chunk allocator to use.
- Allowed values are one of "lpage", "embed" and "4k".
+ Allowed values are one of "lpage", "embed" and "page".
See comments in arch/x86/kernel/setup_percpu.c for
details on each allocator. This parameter is primarily
for debugging and performance comparison.
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 7501bb1..3c943b4 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -249,21 +249,22 @@ static ssize_t __init setup_pcpu_embed(size_t static_size, bool chosen)
}

/*
- * 4k allocator
+ * Page allocator
*
- * Boring fallback 4k allocator. This allocator puts more pressure on
- * PTE TLBs but other than that behaves nicely on both UMA and NUMA.
+ * Boring fallback 4k page allocator. This allocator puts more
+ * pressure on PTE TLBs but other than that behaves nicely on both UMA
+ * and NUMA.
*/
-static void __init pcpu4k_populate_pte(unsigned long addr)
+static void __init pcpup_populate_pte(unsigned long addr)
{
populate_extra_pte(addr);
}

-static ssize_t __init setup_pcpu_4k(size_t static_size)
+static ssize_t __init setup_pcpu_page(size_t static_size)
{
- return pcpu_4k_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
- pcpu_fc_alloc, pcpu_fc_free,
- pcpu4k_populate_pte);
+ return pcpu_page_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
+ pcpu_fc_alloc, pcpu_fc_free,
+ pcpup_populate_pte);
}

/* for explicit first chunk allocator selection */
@@ -307,7 +308,7 @@ void __init setup_per_cpu_areas(void)
*/
ret = -EINVAL;
if (strlen(pcpu_chosen_alloc)) {
- if (strcmp(pcpu_chosen_alloc, "4k")) {
+ if (strcmp(pcpu_chosen_alloc, "page")) {
if (!strcmp(pcpu_chosen_alloc, "lpage"))
ret = setup_pcpu_lpage(static_size, true);
else if (!strcmp(pcpu_chosen_alloc, "embed"))
@@ -317,7 +318,7 @@ void __init setup_per_cpu_areas(void)
"specified\n", pcpu_chosen_alloc);
if (ret < 0)
pr_warning("PERCPU: %s allocator failed (%zd), "
- "falling back to 4k\n",
+ "falling back to page\n",
pcpu_chosen_alloc, ret);
}
} else {
@@ -326,7 +327,7 @@ void __init setup_per_cpu_areas(void)
ret = setup_pcpu_embed(static_size, false);
}
if (ret < 0)
- ret = setup_pcpu_4k(static_size);
+ ret = setup_pcpu_page(static_size);
if (ret < 0)
panic("cannot allocate static percpu area (%zu bytes, err=%zd)",
static_size, ret);
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index e134c82..7989f61 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -74,7 +74,7 @@ extern ssize_t __init pcpu_embed_first_chunk(
size_t static_size, size_t reserved_size,
ssize_t dyn_size);

-extern ssize_t __init pcpu_4k_first_chunk(
+extern ssize_t __init pcpu_page_first_chunk(
size_t static_size, size_t reserved_size,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
diff --git a/mm/percpu.c b/mm/percpu.c
index 80317dc..27a3033 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1497,15 +1497,15 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
}

/**
- * pcpu_4k_first_chunk - map the first chunk using PAGE_SIZE pages
+ * pcpu_page_first_chunk - map the first chunk using PAGE_SIZE pages
* @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes
* @alloc_fn: function to allocate percpu page, always called with PAGE_SIZE
* @free_fn: funtion to free percpu page, always called with PAGE_SIZE
* @populate_pte_fn: function to populate pte
*
- * This is a helper to ease setting up embedded first percpu chunk and
- * can be called where pcpu_setup_first_chunk() is expected.
+ * This is a helper to ease setting up page-remapped first percpu
+ * chunk and can be called where pcpu_setup_first_chunk() is expected.
*
* This is the basic allocator. Static percpu area is allocated
* page-by-page into vmalloc area.
@@ -1514,12 +1514,13 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
* The determined pcpu_unit_size which can be used to initialize
* percpu access on success, -errno on failure.
*/
-ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
- pcpu_fc_alloc_fn_t alloc_fn,
- pcpu_fc_free_fn_t free_fn,
- pcpu_fc_populate_pte_fn_t populate_pte_fn)
+ssize_t __init pcpu_page_first_chunk(size_t static_size, size_t reserved_size,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn,
+ pcpu_fc_populate_pte_fn_t populate_pte_fn)
{
static struct vm_struct vm;
+ char psize_str[16];
int unit_pages;
size_t pages_size;
struct page **pages;
@@ -1527,6 +1528,8 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
int i, j;
ssize_t ret;

+ snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10);
+
unit_pages = PFN_UP(max_t(size_t, static_size + reserved_size,
PCPU_MIN_UNIT_SIZE));

@@ -1542,8 +1545,8 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,

ptr = alloc_fn(cpu, PAGE_SIZE);
if (!ptr) {
- pr_warning("PERCPU: failed to allocate "
- "4k page for cpu%u\n", cpu);
+ pr_warning("PERCPU: failed to allocate %s page "
+ "for cpu%u\n", psize_str, cpu);
goto enomem;
}
pages[j++] = virt_to_page(ptr);
@@ -1580,8 +1583,8 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
}

/* we're ready, commit */
- pr_info("PERCPU: %d 4k pages/cpu @%p s%zu r%zu\n",
- unit_pages, vm.addr, static_size, reserved_size);
+ pr_info("PERCPU: %d %s pages/cpu @%p s%zu r%zu\n",
+ unit_pages, psize_str, vm.addr, static_size, reserved_size);

ret = pcpu_setup_first_chunk(static_size, reserved_size, -1,
unit_pages << PAGE_SHIFT, vm.addr, NULL);
--
1.6.0.2

2009-07-21 10:28:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/20] percpu: improve boot messages

Improve percpu boot messages such that they're uniform and contain
more information.

Signed-off-by: Tejun Heo <[email protected]>
---
mm/percpu.c | 13 +++++++------
1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index c44a5b2..80317dc 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1488,8 +1488,9 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
}

/* we're ready, commit */
- pr_info("PERCPU: Embedded %zu pages at %p, static data %zu bytes\n",
- size_sum >> PAGE_SHIFT, base, static_size);
+ pr_info("PERCPU: Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n",
+ PFN_DOWN(size_sum), base, static_size, reserved_size, dyn_size,
+ unit_size);

return pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
unit_size, base, NULL);
@@ -1579,8 +1580,8 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
}

/* we're ready, commit */
- pr_info("PERCPU: %d 4k pages per cpu, static data %zu bytes\n",
- unit_pages, static_size);
+ pr_info("PERCPU: %d 4k pages/cpu @%p s%zu r%zu\n",
+ unit_pages, vm.addr, static_size, reserved_size);

ret = pcpu_setup_first_chunk(static_size, reserved_size, -1,
unit_pages << PAGE_SHIFT, vm.addr, NULL);
@@ -1898,8 +1899,8 @@ ssize_t __init pcpu_lpage_first_chunk(size_t static_size, size_t reserved_size,
static_size);

/* we're ready, commit */
- pr_info("PERCPU: Remapped at %p with large pages, static data "
- "%zu bytes\n", vm.addr, static_size);
+ pr_info("PERCPU: large pages @%p s%zu r%zu d%zu u%zu\n",
+ vm.addr, static_size, reserved_size, dyn_size, unit_size);

ret = pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
unit_size, vm.addr, unit_map);
--
1.6.0.2

2009-07-21 10:30:03

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 17/20] x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA

Embedding percpu first chunk allocator can now handle very sparse unit
mapping. Use embedding allocator instead of lpage for 64bit NUMA.
This removes extra TLB pressure and the need to do complex and fragile
dancing when changing page attributes.

For 32bit, using very sparse unit mapping isn't a good idea because
the vmalloc space is very constrained. 32bit NUMA machines aren't
exactly the focus of optimization and it isn't very clear whether
lpage performs better than page. Use page first chunk allocator for
32bit NUMAs.

As this leaves setup_pcpu_*() functions pretty much empty, fold them
into setup_per_cpu_areas().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/Kconfig | 4 -
arch/x86/kernel/setup_percpu.c | 155 +++++++--------------------------------
2 files changed, 28 insertions(+), 131 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5c0a3b4..96d2044 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -155,10 +155,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
config NEED_PER_CPU_PAGE_FIRST_CHUNK
def_bool y

-config NEED_PER_CPU_LPAGE_FIRST_CHUNK
- def_bool y
- depends on NEED_MULTIPLE_NODES
-
config HAVE_CPUMASK_OF_CPU_MAP
def_bool X86_64_SMP

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 5b03d7e..4aff7a5 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -55,6 +55,7 @@ EXPORT_SYMBOL(__per_cpu_offset);
#define PERCPU_FIRST_CHUNK_RESERVE 0
#endif

+#ifdef CONFIG_X86_32
/**
* pcpu_need_numa - determine percpu allocation needs to consider NUMA
*
@@ -83,6 +84,7 @@ static bool __init pcpu_need_numa(void)
#endif
return false;
}
+#endif

/**
* pcpu_alloc_bootmem - NUMA friendly alloc_bootmem wrapper for percpu
@@ -136,128 +138,23 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
free_bootmem(__pa(ptr), size);
}

-/*
- * Large page remapping allocator
- */
-#ifdef CONFIG_NEED_MULTIPLE_NODES
-static void __init pcpul_map(void *ptr, size_t size, void *addr)
-{
- pmd_t *pmd, pmd_v;
-
- pmd = populate_extra_pmd((unsigned long)addr);
- pmd_v = pfn_pmd(page_to_pfn(virt_to_page(ptr)), PAGE_KERNEL_LARGE);
- set_pmd(pmd, pmd_v);
-}
-
-static int pcpu_lpage_cpu_distance(unsigned int from, unsigned int to)
+static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
{
+#ifdef CONFIG_NEED_MULTIPLE_NODES
if (early_cpu_to_node(from) == early_cpu_to_node(to))
return LOCAL_DISTANCE;
else
return REMOTE_DISTANCE;
-}
-
-static int __init setup_pcpu_lpage(bool chosen)
-{
- size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
- size_t dyn_size = reserve - PERCPU_FIRST_CHUNK_RESERVE;
- struct pcpu_alloc_info *ai;
- int rc;
-
- /* on non-NUMA, embedding is better */
- if (!chosen && !pcpu_need_numa())
- return -EINVAL;
-
- /* need PSE */
- if (!cpu_has_pse) {
- pr_warning("PERCPU: lpage allocator requires PSE\n");
- return -EINVAL;
- }
-
- /* allocate and build unit_map */
- ai = pcpu_build_alloc_info(PERCPU_FIRST_CHUNK_RESERVE, dyn_size,
- PMD_SIZE, pcpu_lpage_cpu_distance);
- if (IS_ERR(ai)) {
- pr_warning("PERCPU: failed to build unit_map (%ld)\n",
- PTR_ERR(ai));
- return PTR_ERR(ai);
- }
-
- /* do the parameters look okay? */
- if (!chosen) {
- size_t vm_size = VMALLOC_END - VMALLOC_START;
- size_t tot_size = 0;
- int group;
-
- for (group = 0; group < ai->nr_groups; group++)
- tot_size += ai->unit_size * ai->groups[group].nr_units;
-
- /* don't consume more than 20% of vmalloc area */
- if (tot_size > vm_size / 5) {
- pr_info("PERCPU: too large chunk size %zuMB for "
- "large page remap\n", tot_size >> 20);
- rc = -EINVAL;
- goto out_free;
- }
- }
-
- rc = pcpu_lpage_first_chunk(ai, pcpu_fc_alloc, pcpu_fc_free, pcpul_map);
-out_free:
- pcpu_free_alloc_info(ai);
- return rc;
-}
#else
-static int __init setup_pcpu_lpage(bool chosen)
-{
- return -EINVAL;
-}
+ return LOCAL_DISTANCE;
#endif
-
-/*
- * Embedding allocator
- *
- * The first chunk is sized to just contain the static area plus
- * module and dynamic reserves and embedded into linear physical
- * mapping so that it can use PMD mapping without additional TLB
- * pressure.
- */
-static int __init setup_pcpu_embed(bool chosen)
-{
- size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
-
- /*
- * If large page isn't supported, there's no benefit in doing
- * this. Also, embedding allocation doesn't play well with
- * NUMA.
- */
- if (!chosen && (!cpu_has_pse || pcpu_need_numa()))
- return -EINVAL;
-
- return pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
- reserve - PERCPU_FIRST_CHUNK_RESERVE,
- PAGE_SIZE, NULL, pcpu_fc_alloc,
- pcpu_fc_free);
}

-/*
- * Page allocator
- *
- * Boring fallback 4k page allocator. This allocator puts more
- * pressure on PTE TLBs but other than that behaves nicely on both UMA
- * and NUMA.
- */
static void __init pcpup_populate_pte(unsigned long addr)
{
populate_extra_pte(addr);
}

-static int __init setup_pcpu_page(void)
-{
- return pcpu_page_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
- pcpu_fc_alloc, pcpu_fc_free,
- pcpup_populate_pte);
-}
-
static inline void setup_percpu_segment(int cpu)
{
#ifdef CONFIG_X86_32
@@ -281,30 +178,34 @@ void __init setup_per_cpu_areas(void)
NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);

/*
- * Allocate percpu area. If PSE is supported, try to make use
- * of large page mappings. Please read comments on top of
- * each allocator for details.
+ * Allocate percpu area. Embedding allocator is our favorite;
+ * however, on NUMA configurations, it can result in very
+ * sparse unit mapping and vmalloc area isn't spacious enough
+ * on 32bit. Use page in that case.
*/
+#ifdef CONFIG_X86_32
+ if (pcpu_chosen_fc == PCPU_FC_AUTO && pcpu_need_numa())
+ pcpu_chosen_fc = PCPU_FC_PAGE;
+#endif
rc = -EINVAL;
- if (pcpu_chosen_fc != PCPU_FC_AUTO) {
- if (pcpu_chosen_fc != PCPU_FC_PAGE) {
- if (pcpu_chosen_fc == PCPU_FC_LPAGE)
- rc = setup_pcpu_lpage(true);
- else
- rc = setup_pcpu_embed(true);
-
- if (rc < 0)
- pr_warning("PERCPU: %s allocator failed (%d), "
- "falling back to page\n",
- pcpu_fc_names[pcpu_chosen_fc], rc);
- }
- } else {
- rc = setup_pcpu_lpage(false);
+ if (pcpu_chosen_fc != PCPU_FC_PAGE) {
+ const size_t atom_size = cpu_has_pse ? PMD_SIZE : PAGE_SIZE;
+ const size_t dyn_size = PERCPU_MODULE_RESERVE +
+ PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
+
+ rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
+ dyn_size, atom_size,
+ pcpu_cpu_distance,
+ pcpu_fc_alloc, pcpu_fc_free);
if (rc < 0)
- rc = setup_pcpu_embed(false);
+ pr_warning("PERCPU: %s allocator failed (%d), "
+ "falling back to page\n",
+ pcpu_fc_names[pcpu_chosen_fc], rc);
}
if (rc < 0)
- rc = setup_pcpu_page();
+ rc = pcpu_page_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
+ pcpu_fc_alloc, pcpu_fc_free,
+ pcpup_populate_pte);
if (rc < 0)
panic("cannot initialize percpu area (err=%d)", rc);

--
1.6.0.2

2009-07-21 10:28:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 15/20] percpu: use group information to allocate vmap areas sparsely

ai->groups[] contains which units need to be put consecutively and at
what offset from the chunk base address. Compile this information
into pcpu_group_offsets[] and pcpu_group_sizes[] in
pcpu_setup_first_chunk() and use them to allocate sparse vm areas
using pcpu_get_vm_areas().

This will be used to allow directly using sparse NUMA memories as
percpu areas.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Nick Piggin <[email protected]>
---
mm/percpu.c | 35 ++++++++++++++++++++++++++---------
1 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 7b5e194..cc9c4c6 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -98,7 +98,7 @@ struct pcpu_chunk {
int map_used; /* # of map entries used */
int map_alloc; /* # of map entries allocated */
int *map; /* allocation map */
- struct vm_struct *vm; /* mapped vmalloc region */
+ struct vm_struct **vms; /* mapped vmalloc regions */
bool immutable; /* no [de]population allowed */
unsigned long populated[]; /* populated bitmap */
};
@@ -106,7 +106,7 @@ struct pcpu_chunk {
static int pcpu_unit_pages __read_mostly;
static int pcpu_unit_size __read_mostly;
static int pcpu_nr_units __read_mostly;
-static int pcpu_chunk_size __read_mostly;
+static int pcpu_atom_size __read_mostly;
static int pcpu_nr_slots __read_mostly;
static size_t pcpu_chunk_struct_size __read_mostly;

@@ -121,6 +121,11 @@ EXPORT_SYMBOL_GPL(pcpu_base_addr);
static const int *pcpu_unit_map __read_mostly; /* cpu -> unit */
const unsigned long *pcpu_unit_offsets __read_mostly; /* cpu -> unit offset */

+/* group information, used for vm allocation */
+static int pcpu_nr_groups __read_mostly;
+static const unsigned long *pcpu_group_offsets __read_mostly;
+static const size_t *pcpu_group_sizes __read_mostly;
+
/*
* The first chunk which always exists. Note that unlike other
* chunks, this one can be allocated and mapped in several different
@@ -988,8 +993,8 @@ static void free_pcpu_chunk(struct pcpu_chunk *chunk)
{
if (!chunk)
return;
- if (chunk->vm)
- free_vm_area(chunk->vm);
+ if (chunk->vms)
+ pcpu_free_vm_areas(chunk->vms, pcpu_nr_groups);
pcpu_mem_free(chunk->map, chunk->map_alloc * sizeof(chunk->map[0]));
kfree(chunk);
}
@@ -1006,8 +1011,10 @@ static struct pcpu_chunk *alloc_pcpu_chunk(void)
chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
chunk->map[chunk->map_used++] = pcpu_unit_size;

- chunk->vm = get_vm_area(pcpu_chunk_size, GFP_KERNEL);
- if (!chunk->vm) {
+ chunk->vms = pcpu_get_vm_areas(pcpu_group_offsets, pcpu_group_sizes,
+ pcpu_nr_groups, pcpu_atom_size,
+ GFP_KERNEL);
+ if (!chunk->vms) {
free_pcpu_chunk(chunk);
return NULL;
}
@@ -1015,7 +1022,7 @@ static struct pcpu_chunk *alloc_pcpu_chunk(void)
INIT_LIST_HEAD(&chunk->list);
chunk->free_size = pcpu_unit_size;
chunk->contig_hint = pcpu_unit_size;
- chunk->base_addr = chunk->vm->addr;
+ chunk->base_addr = chunk->vms[0]->addr - pcpu_group_offsets[0];

return chunk;
}
@@ -1571,6 +1578,8 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
size_t dyn_size = ai->dyn_size;
size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
struct pcpu_chunk *schunk, *dchunk = NULL;
+ unsigned long *group_offsets;
+ size_t *group_sizes;
unsigned long *unit_off;
unsigned int cpu;
int *unit_map;
@@ -1588,7 +1597,9 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,

pcpu_dump_alloc_info(KERN_DEBUG, ai);

- /* determine number of units and initialize unit_map and base */
+ /* process group information and build config tables accordingly */
+ group_offsets = alloc_bootmem(ai->nr_groups * sizeof(group_offsets[0]));
+ group_sizes = alloc_bootmem(ai->nr_groups * sizeof(group_sizes[0]));
unit_map = alloc_bootmem(nr_cpu_ids * sizeof(unit_map[0]));
unit_off = alloc_bootmem(nr_cpu_ids * sizeof(unit_off[0]));

@@ -1599,6 +1610,9 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
for (group = 0, unit = 0; group < ai->nr_groups; group++, unit += i) {
const struct pcpu_group_info *gi = &ai->groups[group];

+ group_offsets[group] = gi->base_offset;
+ group_sizes[group] = gi->nr_units * ai->unit_size;
+
for (i = 0; i < gi->nr_units; i++) {
cpu = gi->cpu_map[i];
if (cpu == NR_CPUS)
@@ -1620,13 +1634,16 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
for_each_possible_cpu(cpu)
BUG_ON(unit_map[cpu] == NR_CPUS);

+ pcpu_nr_groups = ai->nr_groups;
+ pcpu_group_offsets = group_offsets;
+ pcpu_group_sizes = group_sizes;
pcpu_unit_map = unit_map;
pcpu_unit_offsets = unit_off;

/* determine basic parameters */
pcpu_unit_pages = ai->unit_size >> PAGE_SHIFT;
pcpu_unit_size = pcpu_unit_pages << PAGE_SHIFT;
- pcpu_chunk_size = pcpu_nr_units * pcpu_unit_size;
+ pcpu_atom_size = ai->atom_size;
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long);

--
1.6.0.2

2009-07-21 10:28:23

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/20] percpu: add @align to pcpu_fc_alloc_fn_t

pcpu_fc_alloc_fn_t is about to see more interesting usage, add @align
parameter.

Signed-off-by: Tejun Heo <[email protected]>
---
arch/x86/kernel/setup_percpu.c | 4 ++--
include/linux/percpu.h | 3 ++-
mm/percpu.c | 4 ++--
3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index b0e7ac4..7cc5303 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -126,9 +126,9 @@ static void * __init pcpu_alloc_bootmem(unsigned int cpu, unsigned long size,
/*
* Helpers for first chunk memory allocation
*/
-static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size)
+static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
{
- return pcpu_alloc_bootmem(cpu, size, size);
+ return pcpu_alloc_bootmem(cpu, size, align);
}

static void __init pcpu_fc_free(void *ptr, size_t size)
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 0cfdd14..d385dbc 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -71,7 +71,8 @@ extern const char *pcpu_fc_names[PCPU_FC_NR];

extern enum pcpu_fc pcpu_chosen_fc;

-typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size);
+typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size,
+ size_t align);
typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size);
typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);
typedef int (pcpu_fc_cpu_distance_fn_t)(unsigned int from, unsigned int to);
diff --git a/mm/percpu.c b/mm/percpu.c
index 3177cf6..cfdc03e 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1578,7 +1578,7 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
for (i = 0; i < unit_pages; i++) {
void *ptr;

- ptr = alloc_fn(cpu, PAGE_SIZE);
+ ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE);
if (!ptr) {
pr_warning("PERCPU: failed to allocate %s page "
"for cpu%u\n", psize_str, cpu);
@@ -1888,7 +1888,7 @@ ssize_t __init pcpu_lpage_first_chunk(size_t reserved_size, size_t dyn_size,
goto found;
continue;
found:
- ptr = alloc_fn(cpu, lpage_size);
+ ptr = alloc_fn(cpu, lpage_size, lpage_size);
if (!ptr) {
pr_warning("PERCPU: failed to allocate large page "
"for cpu%u\n", cpu);
--
1.6.0.2

2009-07-21 10:28:40

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/20] percpu: build first chunk allocators selectively

There's no need to build unused first chunk allocators in. Define
CONFIG_NEED_PER_CPU_*_FIRST_CHUNK and let archs enable them
selectively.

Signed-off-by: Tejun Heo <[email protected]>
---
arch/x86/Kconfig | 10 ++++++++++
include/linux/percpu.h | 27 +++++----------------------
mm/percpu.c | 19 +++++++++++--------
3 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 001ad1d..5c0a3b4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -149,6 +149,16 @@ config ARCH_HAS_CACHE_LINE_SIZE
config HAVE_SETUP_PER_CPU_AREA
def_bool y

+config NEED_PER_CPU_EMBED_FIRST_CHUNK
+ def_bool y
+
+config NEED_PER_CPU_PAGE_FIRST_CHUNK
+ def_bool y
+
+config NEED_PER_CPU_LPAGE_FIRST_CHUNK
+ def_bool y
+ depends on NEED_MULTIPLE_NODES
+
config HAVE_CPUMASK_OF_CPU_MAP
def_bool X86_64_SMP

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 7989f61..e26788e 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -70,17 +70,21 @@ extern size_t __init pcpu_setup_first_chunk(
ssize_t dyn_size, size_t unit_size,
void *base_addr, const int *unit_map);

+#ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
extern ssize_t __init pcpu_embed_first_chunk(
size_t static_size, size_t reserved_size,
ssize_t dyn_size);
+#endif

+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
extern ssize_t __init pcpu_page_first_chunk(
size_t static_size, size_t reserved_size,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_populate_pte_fn_t populate_pte_fn);
+#endif

-#ifdef CONFIG_NEED_MULTIPLE_NODES
+#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
extern int __init pcpu_lpage_build_unit_map(
size_t static_size, size_t reserved_size,
ssize_t *dyn_sizep, size_t *unit_sizep,
@@ -98,27 +102,6 @@ extern ssize_t __init pcpu_lpage_first_chunk(

extern void *pcpu_lpage_remapped(void *kaddr);
#else
-static inline int pcpu_lpage_build_unit_map(
- size_t static_size, size_t reserved_size,
- ssize_t *dyn_sizep, size_t *unit_sizep,
- size_t lpage_size, int *unit_map,
- pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
-{
- return -EINVAL;
-}
-
-static inline ssize_t __init pcpu_lpage_first_chunk(
- size_t static_size, size_t reserved_size,
- size_t dyn_size, size_t unit_size,
- size_t lpage_size, const int *unit_map,
- int nr_units,
- pcpu_fc_alloc_fn_t alloc_fn,
- pcpu_fc_free_fn_t free_fn,
- pcpu_fc_map_fn_t map_fn)
-{
- return -EINVAL;
-}
-
static inline void *pcpu_lpage_remapped(void *kaddr)
{
return NULL;
diff --git a/mm/percpu.c b/mm/percpu.c
index 27a3033..4d926d3 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1414,8 +1414,9 @@ size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
return pcpu_unit_size;
}

-static size_t pcpu_calc_fc_sizes(size_t static_size, size_t reserved_size,
- ssize_t *dyn_sizep)
+static inline size_t pcpu_calc_fc_sizes(size_t static_size,
+ size_t reserved_size,
+ ssize_t *dyn_sizep)
{
size_t size_sum;

@@ -1427,6 +1428,8 @@ static size_t pcpu_calc_fc_sizes(size_t static_size, size_t reserved_size,
return size_sum;
}

+#if defined(CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK) || \
+ !defined(CONFIG_HAVE_SETUP_PER_CPU_AREA)
/**
* pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
* @static_size: the size of static percpu area in bytes
@@ -1495,7 +1498,10 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
return pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
unit_size, base, NULL);
}
+#endif /* CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK ||
+ !CONFIG_HAVE_SETUP_PER_CPU_AREA */

+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
/**
* pcpu_page_first_chunk - map the first chunk using PAGE_SIZE pages
* @static_size: the size of static percpu area in bytes
@@ -1598,12 +1604,9 @@ out_free_ar:
free_bootmem(__pa(pages), pages_size);
return ret;
}
+#endif /* CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK */

-/*
- * Large page remapping first chunk setup helper
- */
-#ifdef CONFIG_NEED_MULTIPLE_NODES
-
+#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
/**
* pcpu_lpage_build_unit_map - build unit_map for large page remapping
* @static_size: the size of static percpu area in bytes
@@ -1982,7 +1985,7 @@ void *pcpu_lpage_remapped(void *kaddr)

return NULL;
}
-#endif
+#endif /* CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK */

/*
* Generic percpu area setup.
--
1.6.0.2

2009-07-21 10:28:29

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/20] percpu: introduce pcpu_alloc_info and pcpu_group_info

Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.

This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.

pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.

* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.

* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.

* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.

* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.

* x86 setup_pcpu_lpage() is updated to deal with alloc_info.

* sparc64 setup_per_cpu_areas() is updated to build alloc_info.

Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Miller <[email protected]>
---
arch/sparc/kernel/smp_64.c | 24 ++-
arch/x86/kernel/setup_percpu.c | 38 ++--
include/linux/percpu.h | 42 +++-
mm/percpu.c | 529 +++++++++++++++++++++++++---------------
4 files changed, 389 insertions(+), 244 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index 9856d86..a42a4a7 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1475,17 +1475,29 @@ static void __init pcpu_map_range(unsigned long start, unsigned long end,

void __init setup_per_cpu_areas(void)
{
- size_t dyn_size, static_size = __per_cpu_end - __per_cpu_start;
static struct vm_struct vm;
+ struct pcpu_alloc_info *ai;
unsigned long delta, cpu;
size_t size_sum, pcpu_unit_size;
size_t ptrs_size;
void **ptrs;

- size_sum = PFN_ALIGN(static_size + PERCPU_MODULE_RESERVE +
+ ai = pcpu_alloc_alloc_info(1, nr_cpu_ids);
+
+ ai->static_size = __per_cpu_end - __per_cpu_start;
+ ai->reserved_size = PERCPU_MODULE_RESERVE;
+
+ size_sum = PFN_ALIGN(ai->static_size + ai->reserved_size +
PERCPU_DYNAMIC_RESERVE);
- dyn_size = size_sum - static_size - PERCPU_MODULE_RESERVE;

+ ai->dyn_size = size_sum - ai->static_size - ai->reserved_size;
+ ai->unit_size = PCPU_CHUNK_SIZE;
+ ai->atom_size = PCPU_CHUNK_SIZE;
+ ai->alloc_size = PCPU_CHUNK_SIZE;
+ ai->groups[0].nr_units = nr_cpu_ids;
+
+ for_each_possible_cpu(cpu)
+ ai->groups[0].cpu_map[cpu] = cpu;

ptrs_size = PFN_ALIGN(nr_cpu_ids * sizeof(ptrs[0]));
ptrs = alloc_bootmem(ptrs_size);
@@ -1497,7 +1509,7 @@ void __init setup_per_cpu_areas(void)
free_bootmem(__pa(ptrs[cpu] + size_sum),
PCPU_CHUNK_SIZE - size_sum);

- memcpy(ptrs[cpu], __per_cpu_load, static_size);
+ memcpy(ptrs[cpu], __per_cpu_load, ai->static_size);
}

/* allocate address and map */
@@ -1514,9 +1526,7 @@ void __init setup_per_cpu_areas(void)
pcpu_map_range(start, end, virt_to_page(ptrs[cpu]));
}

- pcpu_unit_size = pcpu_setup_first_chunk(static_size,
- PERCPU_MODULE_RESERVE, dyn_size,
- PCPU_CHUNK_SIZE, vm.addr, NULL);
+ pcpu_unit_size = pcpu_setup_first_chunk(ai, vm.addr);

free_bootmem(__pa(ptrs), ptrs_size);

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 7cc5303..934f285 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -161,9 +161,7 @@ static ssize_t __init setup_pcpu_lpage(bool chosen)
{
size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
size_t dyn_size = reserve - PERCPU_FIRST_CHUNK_RESERVE;
- size_t unit_map_size, unit_size;
- int *unit_map;
- int nr_units;
+ struct pcpu_alloc_info *ai;
ssize_t ret;

/* on non-NUMA, embedding is better */
@@ -177,26 +175,22 @@ static ssize_t __init setup_pcpu_lpage(bool chosen)
}

/* allocate and build unit_map */
- unit_map_size = num_possible_cpus() * sizeof(int);
- unit_map = alloc_bootmem_nopanic(unit_map_size);
- if (!unit_map) {
- pr_warning("PERCPU: failed to allocate unit_map\n");
- return -ENOMEM;
+ ai = pcpu_build_alloc_info(PERCPU_FIRST_CHUNK_RESERVE, dyn_size,
+ PMD_SIZE, pcpu_lpage_cpu_distance);
+ if (IS_ERR(ai)) {
+ pr_warning("PERCPU: failed to build unit_map (%ld)\n",
+ PTR_ERR(ai));
+ return PTR_ERR(ai);
}

- ret = pcpu_lpage_build_unit_map(PERCPU_FIRST_CHUNK_RESERVE,
- &dyn_size, &unit_size, PMD_SIZE,
- unit_map, pcpu_lpage_cpu_distance);
- if (ret < 0) {
- pr_warning("PERCPU: failed to build unit_map\n");
- goto out_free;
- }
- nr_units = ret;
-
/* do the parameters look okay? */
if (!chosen) {
size_t vm_size = VMALLOC_END - VMALLOC_START;
- size_t tot_size = nr_units * unit_size;
+ size_t tot_size = 0;
+ int group;
+
+ for (group = 0; group < ai->nr_groups; group++)
+ tot_size += ai->unit_size * ai->groups[group].nr_units;

/* don't consume more than 20% of vmalloc area */
if (tot_size > vm_size / 5) {
@@ -207,12 +201,10 @@ static ssize_t __init setup_pcpu_lpage(bool chosen)
}
}

- ret = pcpu_lpage_first_chunk(PERCPU_FIRST_CHUNK_RESERVE, dyn_size,
- unit_size, PMD_SIZE, unit_map, nr_units,
- pcpu_fc_alloc, pcpu_fc_free, pcpul_map);
+ ret = pcpu_lpage_first_chunk(ai, pcpu_fc_alloc, pcpu_fc_free,
+ pcpul_map);
out_free:
- if (ret < 0)
- free_bootmem(__pa(unit_map), unit_map_size);
+ pcpu_free_alloc_info(ai);
return ret;
}
#else
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 570fb18..77b86be 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -59,6 +59,25 @@
extern void *pcpu_base_addr;
extern const int *pcpu_unit_map;

+struct pcpu_group_info {
+ int nr_units; /* aligned # of units */
+ unsigned long base_offset; /* base address offset */
+ unsigned int *cpu_map; /* unit->cpu map, empty
+ * entries contain NR_CPUS */
+};
+
+struct pcpu_alloc_info {
+ size_t static_size;
+ size_t reserved_size;
+ size_t dyn_size;
+ size_t unit_size;
+ size_t atom_size;
+ size_t alloc_size;
+ size_t __ai_size; /* internal, don't use */
+ int nr_groups; /* 0 if grouping unnecessary */
+ struct pcpu_group_info groups[];
+};
+
enum pcpu_fc {
PCPU_FC_AUTO,
PCPU_FC_EMBED,
@@ -78,18 +97,17 @@ typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);
typedef int (pcpu_fc_cpu_distance_fn_t)(unsigned int from, unsigned int to);
typedef void (*pcpu_fc_map_fn_t)(void *ptr, size_t size, void *addr);

-#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
-extern int __init pcpu_lpage_build_unit_map(
- size_t reserved_size, ssize_t *dyn_sizep,
- size_t *unit_sizep, size_t lpage_size,
- int *unit_map,
+extern struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
+ int nr_units);
+extern void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai);
+
+extern struct pcpu_alloc_info * __init pcpu_build_alloc_info(
+ size_t reserved_size, ssize_t dyn_size,
+ size_t atom_size,
pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
-#endif

-extern size_t __init pcpu_setup_first_chunk(
- size_t static_size, size_t reserved_size,
- size_t dyn_size, size_t unit_size,
- void *base_addr, const int *unit_map);
+extern size_t __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
+ void *base_addr);

#ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
extern ssize_t __init pcpu_embed_first_chunk(
@@ -106,9 +124,7 @@ extern ssize_t __init pcpu_page_first_chunk(

#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
extern ssize_t __init pcpu_lpage_first_chunk(
- size_t reserved_size, size_t dyn_size,
- size_t unit_size, size_t lpage_size,
- const int *unit_map, int nr_units,
+ const struct pcpu_alloc_info *ai,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_map_fn_t map_fn);
diff --git a/mm/percpu.c b/mm/percpu.c
index b3d0ca0..816cea4 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -58,6 +58,7 @@

#include <linux/bitmap.h>
#include <linux/bootmem.h>
+#include <linux/err.h>
#include <linux/list.h>
#include <linux/log2.h>
#include <linux/mm.h>
@@ -1245,53 +1246,108 @@ static inline size_t pcpu_calc_fc_sizes(size_t static_size,
return size_sum;
}

-#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
/**
- * pcpu_lpage_build_unit_map - build unit_map for large page remapping
+ * pcpu_alloc_alloc_info - allocate percpu allocation info
+ * @nr_groups: the number of groups
+ * @nr_units: the number of units
+ *
+ * Allocate ai which is large enough for @nr_groups groups containing
+ * @nr_units units. The returned ai's groups[0].cpu_map points to the
+ * cpu_map array which is long enough for @nr_units and filled with
+ * NR_CPUS. It's the caller's responsibility to initialize cpu_map
+ * pointer of other groups.
+ *
+ * RETURNS:
+ * Pointer to the allocated pcpu_alloc_info on success, NULL on
+ * failure.
+ */
+struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
+ int nr_units)
+{
+ struct pcpu_alloc_info *ai;
+ size_t base_size, ai_size;
+ void *ptr;
+ int unit;
+
+ base_size = ALIGN(sizeof(*ai) + nr_groups * sizeof(ai->groups[0]),
+ __alignof__(ai->groups[0].cpu_map[0]));
+ ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);
+
+ ptr = alloc_bootmem_nopanic(PFN_ALIGN(ai_size));
+ if (!ptr)
+ return NULL;
+ ai = ptr;
+ ptr += base_size;
+
+ ai->groups[0].cpu_map = ptr;
+
+ for (unit = 0; unit < nr_units; unit++)
+ ai->groups[0].cpu_map[unit] = NR_CPUS;
+
+ ai->nr_groups = nr_groups;
+ ai->__ai_size = PFN_ALIGN(ai_size);
+
+ return ai;
+}
+
+/**
+ * pcpu_free_alloc_info - free percpu allocation info
+ * @ai: pcpu_alloc_info to free
+ *
+ * Free @ai which was allocated by pcpu_alloc_alloc_info().
+ */
+void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)
+{
+ free_bootmem(__pa(ai), ai->__ai_size);
+}
+
+/**
+ * pcpu_build_alloc_info - build alloc_info considering distances between CPUs
* @reserved_size: the size of reserved percpu area in bytes
- * @dyn_sizep: in/out parameter for dynamic size, -1 for auto
- * @unit_sizep: out parameter for unit size
- * @unit_map: unit_map to be filled
- * @cpu_distance_fn: callback to determine distance between cpus
+ * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @atom_size: allocation atom size
+ * @cpu_distance_fn: callback to determine distance between cpus, optional
*
- * This function builds cpu -> unit map and determine other parameters
- * considering needed percpu size, large page size and distances
- * between CPUs in NUMA.
+ * This function determines grouping of units, their mappings to cpus
+ * and other parameters considering needed percpu size, allocation
+ * atom size and distances between CPUs.
*
- * CPUs which are of LOCAL_DISTANCE both ways are grouped together and
- * may share units in the same large page. The returned configuration
- * is guaranteed to have CPUs on different nodes on different large
- * pages and >=75% usage of allocated virtual address space.
+ * Groups are always mutliples of atom size and CPUs which are of
+ * LOCAL_DISTANCE both ways are grouped together and share space for
+ * units in the same group. The returned configuration is guaranteed
+ * to have CPUs on different nodes on different groups and >=75% usage
+ * of allocated virtual address space.
*
* RETURNS:
- * On success, fills in @unit_map, sets *@dyn_sizep, *@unit_sizep and
- * returns the number of units to be allocated. -errno on failure.
+ * On success, pointer to the new allocation_info is returned. On
+ * failure, ERR_PTR value is returned.
*/
-int __init pcpu_lpage_build_unit_map(size_t reserved_size, ssize_t *dyn_sizep,
- size_t *unit_sizep, size_t lpage_size,
- int *unit_map,
- pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
+struct pcpu_alloc_info * __init pcpu_build_alloc_info(
+ size_t reserved_size, ssize_t dyn_size,
+ size_t atom_size,
+ pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
{
static int group_map[NR_CPUS] __initdata;
static int group_cnt[NR_CPUS] __initdata;
const size_t static_size = __per_cpu_end - __per_cpu_start;
- int group_cnt_max = 0;
+ int group_cnt_max = 0, nr_groups = 1, nr_units = 0;
size_t size_sum, min_unit_size, alloc_size;
int upa, max_upa, uninitialized_var(best_upa); /* units_per_alloc */
- int last_allocs;
+ int last_allocs, group, unit;
unsigned int cpu, tcpu;
- int group, unit;
+ struct pcpu_alloc_info *ai;
+ unsigned int *cpu_map;

/*
* Determine min_unit_size, alloc_size and max_upa such that
- * alloc_size is multiple of lpage_size and is the smallest
+ * alloc_size is multiple of atom_size and is the smallest
* which can accomodate 4k aligned segments which are equal to
* or larger than min_unit_size.
*/
- size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, dyn_sizep);
+ size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);

- alloc_size = roundup(min_unit_size, lpage_size);
+ alloc_size = roundup(min_unit_size, atom_size);
upa = alloc_size / min_unit_size;
while (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK))
upa--;
@@ -1304,10 +1360,11 @@ int __init pcpu_lpage_build_unit_map(size_t reserved_size, ssize_t *dyn_sizep,
for_each_possible_cpu(tcpu) {
if (cpu == tcpu)
break;
- if (group_map[tcpu] == group &&
+ if (group_map[tcpu] == group && cpu_distance_fn &&
(cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE ||
cpu_distance_fn(tcpu, cpu) > LOCAL_DISTANCE)) {
group++;
+ nr_groups = max(nr_groups, group + 1);
goto next_group;
}
}
@@ -1328,7 +1385,7 @@ int __init pcpu_lpage_build_unit_map(size_t reserved_size, ssize_t *dyn_sizep,
if (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK))
continue;

- for (group = 0; group_cnt[group]; group++) {
+ for (group = 0; group < nr_groups; group++) {
int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
allocs += this_allocs;
wasted += this_allocs * upa - group_cnt[group];
@@ -1348,75 +1405,122 @@ int __init pcpu_lpage_build_unit_map(size_t reserved_size, ssize_t *dyn_sizep,
last_allocs = allocs;
best_upa = upa;
}
- *unit_sizep = alloc_size / best_upa;
+ upa = best_upa;
+
+ /* allocate and fill alloc_info */
+ for (group = 0; group < nr_groups; group++)
+ nr_units += roundup(group_cnt[group], upa);
+
+ ai = pcpu_alloc_alloc_info(nr_groups, nr_units);
+ if (!ai)
+ return ERR_PTR(-ENOMEM);
+ cpu_map = ai->groups[0].cpu_map;
+
+ for (group = 0; group < nr_groups; group++) {
+ ai->groups[group].cpu_map = cpu_map;
+ cpu_map += roundup(group_cnt[group], upa);
+ }
+
+ ai->static_size = static_size;
+ ai->reserved_size = reserved_size;
+ ai->dyn_size = dyn_size;
+ ai->unit_size = alloc_size / upa;
+ ai->atom_size = atom_size;
+ ai->alloc_size = alloc_size;
+
+ for (group = 0, unit = 0; group_cnt[group]; group++) {
+ struct pcpu_group_info *gi = &ai->groups[group];
+
+ /*
+ * Initialize base_offset as if all groups are located
+ * back-to-back. The caller should update this to
+ * reflect actual allocation.
+ */
+ gi->base_offset = unit * ai->unit_size;

- /* assign units to cpus accordingly */
- unit = 0;
- for (group = 0; group_cnt[group]; group++) {
for_each_possible_cpu(cpu)
if (group_map[cpu] == group)
- unit_map[cpu] = unit++;
- unit = roundup(unit, best_upa);
+ gi->cpu_map[gi->nr_units++] = cpu;
+ gi->nr_units = roundup(gi->nr_units, upa);
+ unit += gi->nr_units;
}
+ BUG_ON(unit != nr_units);

- return unit; /* unit contains aligned number of units */
+ return ai;
}

-static bool __init pcpul_unit_to_cpu(int unit, const int *unit_map,
- unsigned int *cpup);
-
-static void __init pcpul_lpage_dump_cfg(const char *lvl, size_t static_size,
- size_t reserved_size, size_t dyn_size,
- size_t unit_size, size_t lpage_size,
- const int *unit_map, int nr_units)
+/**
+ * pcpu_dump_alloc_info - print out information about pcpu_alloc_info
+ * @lvl: loglevel
+ * @ai: allocation info to dump
+ *
+ * Print out information about @ai using loglevel @lvl.
+ */
+static void pcpu_dump_alloc_info(const char *lvl,
+ const struct pcpu_alloc_info *ai)
{
- int width = 1, v = nr_units;
+ int group_width = 1, cpu_width = 1, width;
char empty_str[] = "--------";
- int upl, lpl; /* units per lpage, lpage per line */
- unsigned int cpu;
- int lpage, unit;
+ int alloc = 0, alloc_end = 0;
+ int group, v;
+ int upa, apl; /* units per alloc, allocs per line */
+
+ v = ai->nr_groups;
+ while (v /= 10)
+ group_width++;

+ v = num_possible_cpus();
while (v /= 10)
- width++;
- empty_str[min_t(int, width, sizeof(empty_str) - 1)] = '\0';
+ cpu_width++;
+ empty_str[min_t(int, cpu_width, sizeof(empty_str) - 1)] = '\0';

- upl = max_t(int, lpage_size / unit_size, 1);
- lpl = rounddown_pow_of_two(max_t(int, 60 / (upl * (width + 1) + 2), 1));
+ upa = ai->alloc_size / ai->unit_size;
+ width = upa * (cpu_width + 1) + group_width + 3;
+ apl = rounddown_pow_of_two(max(60 / width, 1));

- printk("%spcpu-lpage: sta/res/dyn=%zu/%zu/%zu unit=%zu lpage=%zu", lvl,
- static_size, reserved_size, dyn_size, unit_size, lpage_size);
+ printk("%spcpu-alloc: s%zu r%zu d%zu u%zu alloc=%zu*%zu",
+ lvl, ai->static_size, ai->reserved_size, ai->dyn_size,
+ ai->unit_size, ai->alloc_size / ai->atom_size, ai->atom_size);

- for (lpage = 0, unit = 0; unit < nr_units; unit++) {
- if (!(unit % upl)) {
- if (!(lpage++ % lpl)) {
+ for (group = 0; group < ai->nr_groups; group++) {
+ const struct pcpu_group_info *gi = &ai->groups[group];
+ int unit = 0, unit_end = 0;
+
+ BUG_ON(gi->nr_units % upa);
+ for (alloc_end += gi->nr_units / upa;
+ alloc < alloc_end; alloc++) {
+ if (!(alloc % apl)) {
printk("\n");
- printk("%spcpu-lpage: ", lvl);
- } else
- printk("| ");
+ printk("%spcpu-alloc: ", lvl);
+ }
+ printk("[%0*d] ", group_width, group);
+
+ for (unit_end += upa; unit < unit_end; unit++)
+ if (gi->cpu_map[unit] != NR_CPUS)
+ printk("%0*d ", cpu_width,
+ gi->cpu_map[unit]);
+ else
+ printk("%s ", empty_str);
}
- if (pcpul_unit_to_cpu(unit, unit_map, &cpu))
- printk("%0*d ", width, cpu);
- else
- printk("%s ", empty_str);
}
printk("\n");
}
-#endif

/**
* pcpu_setup_first_chunk - initialize the first percpu chunk
- * @static_size: the size of static percpu area in bytes
- * @reserved_size: the size of reserved percpu area in bytes, 0 for none
- * @dyn_size: free size for dynamic allocation in bytes
- * @unit_size: unit size in bytes, must be multiple of PAGE_SIZE
+ * @ai: pcpu_alloc_info describing how to percpu area is shaped
* @base_addr: mapped address
- * @unit_map: cpu -> unit map, NULL for sequential mapping
*
* Initialize the first percpu chunk which contains the kernel static
* perpcu area. This function is to be called from arch percpu area
* setup path.
*
- * @reserved_size, if non-zero, specifies the amount of bytes to
+ * @ai contains all information necessary to initialize the first
+ * chunk and prime the dynamic percpu allocator.
+ *
+ * @ai->static_size is the size of static percpu area.
+ *
+ * @ai->reserved_size, if non-zero, specifies the amount of bytes to
* reserve after the static area in the first chunk. This reserves
* the first chunk such that it's available only through reserved
* percpu allocation. This is primarily used to serve module percpu
@@ -1424,13 +1528,26 @@ static void __init pcpul_lpage_dump_cfg(const char *lvl, size_t static_size,
* limited offset range for symbol relocations to guarantee module
* percpu symbols fall inside the relocatable range.
*
- * @dyn_size determines the number of bytes available for dynamic
- * allocation in the first chunk. The area between @static_size +
- * @reserved_size + @dyn_size and @unit_size is unused.
+ * @ai->dyn_size determines the number of bytes available for dynamic
+ * allocation in the first chunk. The area between @ai->static_size +
+ * @ai->reserved_size + @ai->dyn_size and @ai->unit_size is unused.
*
- * @unit_size specifies unit size and must be aligned to PAGE_SIZE and
- * equal to or larger than @static_size + @reserved_size + if
- * non-negative, @dyn_size.
+ * @ai->unit_size specifies unit size and must be aligned to PAGE_SIZE
+ * and equal to or larger than @ai->static_size + @ai->reserved_size +
+ * @ai->dyn_size.
+ *
+ * @ai->atom_size is the allocation atom size and used as alignment
+ * for vm areas.
+ *
+ * @ai->alloc_size is the allocation size and always multiple of
+ * @ai->atom_size. This is larger than @ai->atom_size if
+ * @ai->unit_size is larger than @ai->atom_size.
+ *
+ * @ai->nr_groups and @ai->groups describe virtual memory layout of
+ * percpu areas. Units which should be colocated are put into the
+ * same group. Dynamic VM areas will be allocated according to these
+ * groupings. If @ai->nr_groups is zero, a single group containing
+ * all units is assumed.
*
* The caller should have mapped the first chunk at @base_addr and
* copied static data to each unit.
@@ -1446,70 +1563,63 @@ static void __init pcpul_lpage_dump_cfg(const char *lvl, size_t static_size,
* The determined pcpu_unit_size which can be used to initialize
* percpu access.
*/
-size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
- size_t dyn_size, size_t unit_size,
- void *base_addr, const int *unit_map)
+size_t __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
+ void *base_addr)
{
static struct vm_struct first_vm;
static int smap[2], dmap[2];
- size_t size_sum = static_size + reserved_size + dyn_size;
+ size_t dyn_size = ai->dyn_size;
+ size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
struct pcpu_chunk *schunk, *dchunk = NULL;
- unsigned int cpu, tcpu;
- int i;
+ unsigned int cpu;
+ int *unit_map;
+ int group, unit, i;

/* sanity checks */
BUILD_BUG_ON(ARRAY_SIZE(smap) >= PCPU_DFL_MAP_ALLOC ||
ARRAY_SIZE(dmap) >= PCPU_DFL_MAP_ALLOC);
- BUG_ON(!static_size);
+ BUG_ON(ai->nr_groups <= 0);
+ BUG_ON(!ai->static_size);
BUG_ON(!base_addr);
- BUG_ON(unit_size < size_sum);
- BUG_ON(unit_size & ~PAGE_MASK);
- BUG_ON(unit_size < PCPU_MIN_UNIT_SIZE);
+ BUG_ON(ai->unit_size < size_sum);
+ BUG_ON(ai->unit_size & ~PAGE_MASK);
+ BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE);
+
+ pcpu_dump_alloc_info(KERN_DEBUG, ai);

/* determine number of units and verify and initialize pcpu_unit_map */
- if (unit_map) {
- int first_unit = INT_MAX, last_unit = INT_MIN;
-
- for_each_possible_cpu(cpu) {
- int unit = unit_map[cpu];
-
- BUG_ON(unit < 0);
- for_each_possible_cpu(tcpu) {
- if (tcpu == cpu)
- break;
- /* the mapping should be one-to-one */
- BUG_ON(unit_map[tcpu] == unit);
- }
+ unit_map = alloc_bootmem(nr_cpu_ids * sizeof(unit_map[0]));

- if (unit < first_unit) {
- pcpu_first_unit_cpu = cpu;
- first_unit = unit;
- }
- if (unit > last_unit) {
- pcpu_last_unit_cpu = cpu;
- last_unit = unit;
- }
- }
- pcpu_nr_units = last_unit + 1;
- pcpu_unit_map = unit_map;
- } else {
- int *identity_map;
+ for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+ unit_map[cpu] = NR_CPUS;
+ pcpu_first_unit_cpu = NR_CPUS;

- /* #units == #cpus, identity mapped */
- identity_map = alloc_bootmem(nr_cpu_ids *
- sizeof(identity_map[0]));
+ for (group = 0, unit = 0; group < ai->nr_groups; group++, unit += i) {
+ const struct pcpu_group_info *gi = &ai->groups[group];

- for_each_possible_cpu(cpu)
- identity_map[cpu] = cpu;
+ for (i = 0; i < gi->nr_units; i++) {
+ cpu = gi->cpu_map[i];
+ if (cpu == NR_CPUS)
+ continue;

- pcpu_first_unit_cpu = 0;
- pcpu_last_unit_cpu = pcpu_nr_units - 1;
- pcpu_nr_units = nr_cpu_ids;
- pcpu_unit_map = identity_map;
+ BUG_ON(cpu > nr_cpu_ids || !cpu_possible(cpu));
+ BUG_ON(unit_map[cpu] != NR_CPUS);
+
+ unit_map[cpu] = unit + i;
+ if (pcpu_first_unit_cpu == NR_CPUS)
+ pcpu_first_unit_cpu = cpu;
+ }
}
+ pcpu_last_unit_cpu = cpu;
+ pcpu_nr_units = unit;
+
+ for_each_possible_cpu(cpu)
+ BUG_ON(unit_map[cpu] == NR_CPUS);
+
+ pcpu_unit_map = unit_map;

/* determine basic parameters */
- pcpu_unit_pages = unit_size >> PAGE_SHIFT;
+ pcpu_unit_pages = ai->unit_size >> PAGE_SHIFT;
pcpu_unit_size = pcpu_unit_pages << PAGE_SHIFT;
pcpu_chunk_size = pcpu_nr_units * pcpu_unit_size;
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
@@ -1543,17 +1653,17 @@ size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
schunk->immutable = true;
bitmap_fill(schunk->populated, pcpu_unit_pages);

- if (reserved_size) {
- schunk->free_size = reserved_size;
+ if (ai->reserved_size) {
+ schunk->free_size = ai->reserved_size;
pcpu_reserved_chunk = schunk;
- pcpu_reserved_chunk_limit = static_size + reserved_size;
+ pcpu_reserved_chunk_limit = ai->static_size + ai->reserved_size;
} else {
schunk->free_size = dyn_size;
dyn_size = 0; /* dynamic area covered */
}
schunk->contig_hint = schunk->free_size;

- schunk->map[schunk->map_used++] = -static_size;
+ schunk->map[schunk->map_used++] = -ai->static_size;
if (schunk->free_size)
schunk->map[schunk->map_used++] = schunk->free_size;

@@ -1643,44 +1753,47 @@ early_param("percpu_alloc", percpu_alloc_setup);
*/
ssize_t __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size)
{
- const size_t static_size = __per_cpu_end - __per_cpu_start;
- size_t size_sum, unit_size, chunk_size;
+ struct pcpu_alloc_info *ai;
+ size_t size_sum, chunk_size;
void *base;
- unsigned int cpu;
+ int unit;
+ ssize_t ret;

- /* determine parameters and allocate */
- size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
+ ai = pcpu_build_alloc_info(reserved_size, dyn_size, PAGE_SIZE, NULL);
+ if (IS_ERR(ai))
+ return PTR_ERR(ai);
+ BUG_ON(ai->nr_groups != 1);
+ BUG_ON(ai->groups[0].nr_units != num_possible_cpus());

- unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
- chunk_size = unit_size * nr_cpu_ids;
+ size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
+ chunk_size = ai->unit_size * num_possible_cpus();

base = __alloc_bootmem_nopanic(chunk_size, PAGE_SIZE,
__pa(MAX_DMA_ADDRESS));
if (!base) {
pr_warning("PERCPU: failed to allocate %zu bytes for "
"embedding\n", chunk_size);
- return -ENOMEM;
+ ret = -ENOMEM;
+ goto out_free_ai;
}

/* return the leftover and copy */
- for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
- void *ptr = base + cpu * unit_size;
-
- if (cpu_possible(cpu)) {
- free_bootmem(__pa(ptr + size_sum),
- unit_size - size_sum);
- memcpy(ptr, __per_cpu_load, static_size);
- } else
- free_bootmem(__pa(ptr), unit_size);
+ for (unit = 0; unit < num_possible_cpus(); unit++) {
+ void *ptr = base + unit * ai->unit_size;
+
+ free_bootmem(__pa(ptr + size_sum), ai->unit_size - size_sum);
+ memcpy(ptr, __per_cpu_load, ai->static_size);
}

/* we're ready, commit */
pr_info("PERCPU: Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n",
- PFN_DOWN(size_sum), base, static_size, reserved_size, dyn_size,
- unit_size);
+ PFN_DOWN(size_sum), base, ai->static_size, ai->reserved_size,
+ ai->dyn_size, ai->unit_size);

- return pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
- unit_size, base, NULL);
+ ret = pcpu_setup_first_chunk(ai, base);
+out_free_ai:
+ pcpu_free_alloc_info(ai);
+ return ret;
}
#endif /* CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK ||
!CONFIG_HAVE_SETUP_PER_CPU_AREA */
@@ -1709,31 +1822,34 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
pcpu_fc_populate_pte_fn_t populate_pte_fn)
{
static struct vm_struct vm;
- const size_t static_size = __per_cpu_end - __per_cpu_start;
- ssize_t dyn_size = -1;
- size_t size_sum, unit_size;
+ struct pcpu_alloc_info *ai;
char psize_str[16];
int unit_pages;
size_t pages_size;
struct page **pages;
- unsigned int cpu;
- int i, j;
+ int unit, i, j;
ssize_t ret;

snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10);

- size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
- unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
- unit_pages = unit_size >> PAGE_SHIFT;
+ ai = pcpu_build_alloc_info(reserved_size, -1, PAGE_SIZE, NULL);
+ if (IS_ERR(ai))
+ return PTR_ERR(ai);
+ BUG_ON(ai->nr_groups != 1);
+ BUG_ON(ai->groups[0].nr_units != num_possible_cpus());
+
+ unit_pages = ai->unit_size >> PAGE_SHIFT;

/* unaligned allocations can't be freed, round up to page size */
- pages_size = PFN_ALIGN(unit_pages * nr_cpu_ids * sizeof(pages[0]));
+ pages_size = PFN_ALIGN(unit_pages * num_possible_cpus() *
+ sizeof(pages[0]));
pages = alloc_bootmem(pages_size);

/* allocate pages */
j = 0;
- for_each_possible_cpu(cpu)
+ for (unit = 0; unit < num_possible_cpus(); unit++)
for (i = 0; i < unit_pages; i++) {
+ unsigned int cpu = ai->groups[0].cpu_map[unit];
void *ptr;

ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE);
@@ -1747,18 +1863,18 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,

/* allocate vm area, map the pages and copy static data */
vm.flags = VM_ALLOC;
- vm.size = nr_cpu_ids * unit_size;
+ vm.size = num_possible_cpus() * ai->unit_size;
vm_area_register_early(&vm, PAGE_SIZE);

- for_each_possible_cpu(cpu) {
+ for (unit = 0; unit < num_possible_cpus(); unit++) {
unsigned long unit_addr =
- (unsigned long)vm.addr + cpu * unit_size;
+ (unsigned long)vm.addr + unit * ai->unit_size;

for (i = 0; i < unit_pages; i++)
populate_pte_fn(unit_addr + (i << PAGE_SHIFT));

/* pte already populated, the following shouldn't fail */
- ret = __pcpu_map_pages(unit_addr, &pages[cpu * unit_pages],
+ ret = __pcpu_map_pages(unit_addr, &pages[unit * unit_pages],
unit_pages);
if (ret < 0)
panic("failed to map percpu area, err=%zd\n", ret);
@@ -1772,16 +1888,15 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
*/

/* copy static data */
- memcpy((void *)unit_addr, __per_cpu_load, static_size);
+ memcpy((void *)unit_addr, __per_cpu_load, ai->static_size);
}

/* we're ready, commit */
pr_info("PERCPU: %d %s pages/cpu @%p s%zu r%zu d%zu\n",
- unit_pages, psize_str, vm.addr, static_size, reserved_size,
- dyn_size);
+ unit_pages, psize_str, vm.addr, ai->static_size,
+ ai->reserved_size, ai->dyn_size);

- ret = pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
- unit_size, vm.addr, NULL);
+ ret = pcpu_setup_first_chunk(ai, vm.addr);
goto out_free_ar;

enomem:
@@ -1790,6 +1905,7 @@ enomem:
ret = -ENOMEM;
out_free_ar:
free_bootmem(__pa(pages), pages_size);
+ pcpu_free_alloc_info(ai);
return ret;
}
#endif /* CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK */
@@ -1805,38 +1921,50 @@ static size_t pcpul_lpage_size;
static int pcpul_nr_lpages;
static struct pcpul_ent *pcpul_map;

-static bool __init pcpul_unit_to_cpu(int unit, const int *unit_map,
+static bool __init pcpul_unit_to_cpu(int unit, const struct pcpu_alloc_info *ai,
unsigned int *cpup)
{
- unsigned int cpu;
+ int group, cunit;

- for_each_possible_cpu(cpu)
- if (unit_map[cpu] == unit) {
+ for (group = 0, cunit = 0; group < ai->nr_groups; group++) {
+ const struct pcpu_group_info *gi = &ai->groups[group];
+
+ if (unit < cunit + gi->nr_units) {
if (cpup)
- *cpup = cpu;
+ *cpup = gi->cpu_map[unit - cunit];
return true;
}
+ cunit += gi->nr_units;
+ }

return false;
}

+static int __init pcpul_cpu_to_unit(int cpu, const struct pcpu_alloc_info *ai)
+{
+ int group, unit, i;
+
+ for (group = 0, unit = 0; group < ai->nr_groups; group++, unit += i) {
+ const struct pcpu_group_info *gi = &ai->groups[group];
+
+ for (i = 0; i < gi->nr_units; i++)
+ if (gi->cpu_map[i] == cpu)
+ return unit + i;
+ }
+ BUG();
+}
+
/**
* pcpu_lpage_first_chunk - remap the first percpu chunk using large page
- * @reserved_size: the size of reserved percpu area in bytes
- * @dyn_size: free size for dynamic allocation in bytes
- * @unit_size: unit size in bytes
- * @lpage_size: the size of a large page
- * @unit_map: cpu -> unit mapping
- * @nr_units: the number of units
+ * @ai: pcpu_alloc_info
* @alloc_fn: function to allocate percpu lpage, always called with lpage_size
* @free_fn: function to free percpu memory, @size <= lpage_size
* @map_fn: function to map percpu lpage, always called with lpage_size
*
* This allocator uses large page to build and map the first chunk.
- * Unlike other helpers, the caller should always specify @dyn_size
- * and @unit_size. These parameters along with @unit_map and
- * @nr_units can be determined using pcpu_lpage_build_unit_map().
- * This two stage initialization is to allow arch code to evaluate the
+ * Unlike other helpers, the caller should provide fully initialized
+ * @ai. This can be done using pcpu_build_alloc_info(). This two
+ * stage initialization is to allow arch code to evaluate the
* parameters before committing to it.
*
* Large pages are allocated as directed by @unit_map and other
@@ -1852,27 +1980,26 @@ static bool __init pcpul_unit_to_cpu(int unit, const int *unit_map,
* The determined pcpu_unit_size which can be used to initialize
* percpu access on success, -errno on failure.
*/
-ssize_t __init pcpu_lpage_first_chunk(size_t reserved_size, size_t dyn_size,
- size_t unit_size, size_t lpage_size,
- const int *unit_map, int nr_units,
+ssize_t __init pcpu_lpage_first_chunk(const struct pcpu_alloc_info *ai,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_map_fn_t map_fn)
{
static struct vm_struct vm;
- const size_t static_size = __per_cpu_end - __per_cpu_start;
- size_t chunk_size = unit_size * nr_units;
- size_t map_size;
+ const size_t lpage_size = ai->atom_size;
+ size_t chunk_size, map_size;
unsigned int cpu;
ssize_t ret;
- int i, j, unit;
+ int i, j, unit, nr_units;

- pcpul_lpage_dump_cfg(KERN_DEBUG, static_size, reserved_size, dyn_size,
- unit_size, lpage_size, unit_map, nr_units);
+ nr_units = 0;
+ for (i = 0; i < ai->nr_groups; i++)
+ nr_units += ai->groups[i].nr_units;

+ chunk_size = ai->unit_size * nr_units;
BUG_ON(chunk_size % lpage_size);

- pcpul_size = static_size + reserved_size + dyn_size;
+ pcpul_size = ai->static_size + ai->reserved_size + ai->dyn_size;
pcpul_lpage_size = lpage_size;
pcpul_nr_lpages = chunk_size / lpage_size;

@@ -1883,13 +2010,13 @@ ssize_t __init pcpu_lpage_first_chunk(size_t reserved_size, size_t dyn_size,
/* allocate all pages */
for (i = 0; i < pcpul_nr_lpages; i++) {
size_t offset = i * lpage_size;
- int first_unit = offset / unit_size;
- int last_unit = (offset + lpage_size - 1) / unit_size;
+ int first_unit = offset / ai->unit_size;
+ int last_unit = (offset + lpage_size - 1) / ai->unit_size;
void *ptr;

/* find out which cpu is mapped to this unit */
for (unit = first_unit; unit <= last_unit; unit++)
- if (pcpul_unit_to_cpu(unit, unit_map, &cpu))
+ if (pcpul_unit_to_cpu(unit, ai, &cpu))
goto found;
continue;
found:
@@ -1905,12 +2032,12 @@ ssize_t __init pcpu_lpage_first_chunk(size_t reserved_size, size_t dyn_size,

/* return unused holes */
for (unit = 0; unit < nr_units; unit++) {
- size_t start = unit * unit_size;
- size_t end = start + unit_size;
+ size_t start = unit * ai->unit_size;
+ size_t end = start + ai->unit_size;
size_t off, next;

/* don't free used part of occupied unit */
- if (pcpul_unit_to_cpu(unit, unit_map, NULL))
+ if (pcpul_unit_to_cpu(unit, ai, NULL))
start += pcpul_size;

/* unit can span more than one page, punch the holes */
@@ -1925,7 +2052,7 @@ ssize_t __init pcpu_lpage_first_chunk(size_t reserved_size, size_t dyn_size,
/* allocate address, map and copy */
vm.flags = VM_ALLOC;
vm.size = chunk_size;
- vm_area_register_early(&vm, unit_size);
+ vm_area_register_early(&vm, ai->unit_size);

for (i = 0; i < pcpul_nr_lpages; i++) {
if (!pcpul_map[i].ptr)
@@ -1935,15 +2062,15 @@ ssize_t __init pcpu_lpage_first_chunk(size_t reserved_size, size_t dyn_size,
}

for_each_possible_cpu(cpu)
- memcpy(vm.addr + unit_map[cpu] * unit_size, __per_cpu_load,
- static_size);
+ memcpy(vm.addr + pcpul_cpu_to_unit(cpu, ai) * ai->unit_size,
+ __per_cpu_load, ai->static_size);

/* we're ready, commit */
pr_info("PERCPU: large pages @%p s%zu r%zu d%zu u%zu\n",
- vm.addr, static_size, reserved_size, dyn_size, unit_size);
+ vm.addr, ai->static_size, ai->reserved_size, ai->dyn_size,
+ ai->unit_size);

- ret = pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
- unit_size, vm.addr, unit_map);
+ ret = pcpu_setup_first_chunk(ai, vm.addr);

/*
* Sort pcpul_map array for pcpu_lpage_remapped(). Unmapped
--
1.6.0.2

2009-07-21 10:28:00

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/20] percpu: add pcpu_unit_offsets[]

Currently units are mapped sequentially into address space. This
patch adds pcpu_unit_offsets[] which allows units to be mapped to
arbitrary offsets from the chunk base address. This is necessary to
allow sparse embedding which might would need to allocate address
ranges and memory areas which aren't aligned to unit size but
allocation atom size (page or large page size). This also simplifies
things a bit by removing the need to calculate offset from unit
number.

With this change, there's no need for the arch code to know
pcpu_unit_size. Update pcpu_setup_first_chunk() and first chunk
allocators to return regular 0 or -errno return code instead of unit
size or -errno.

Signed-off-by: Tejun Heo <[email protected]>
Cc: David S. Miller <[email protected]>
---
arch/sparc/kernel/smp_64.c | 12 +++--
arch/x86/kernel/setup_percpu.c | 51 ++++++++++------------
include/linux/percpu.h | 16 +++----
mm/percpu.c | 95 ++++++++++++++++++++--------------------
4 files changed, 84 insertions(+), 90 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index a42a4a7..b03fd36 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1478,9 +1478,10 @@ void __init setup_per_cpu_areas(void)
static struct vm_struct vm;
struct pcpu_alloc_info *ai;
unsigned long delta, cpu;
- size_t size_sum, pcpu_unit_size;
+ size_t size_sum;
size_t ptrs_size;
void **ptrs;
+ int rc;

ai = pcpu_alloc_alloc_info(1, nr_cpu_ids);

@@ -1526,14 +1527,15 @@ void __init setup_per_cpu_areas(void)
pcpu_map_range(start, end, virt_to_page(ptrs[cpu]));
}

- pcpu_unit_size = pcpu_setup_first_chunk(ai, vm.addr);
+ rc = pcpu_setup_first_chunk(ai, vm.addr);
+ if (rc)
+ panic("failed to setup percpu first chunk (%d)", rc);

free_bootmem(__pa(ptrs), ptrs_size);

delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
- for_each_possible_cpu(cpu) {
- __per_cpu_offset(cpu) = delta + cpu * pcpu_unit_size;
- }
+ for_each_possible_cpu(cpu)
+ __per_cpu_offset(cpu) = delta + pcpu_unit_offsets[cpu];

/* Setup %g5 for the boot cpu. */
__local_per_cpu_offset = __per_cpu_offset(smp_processor_id());
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 934f285..477d2de 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -157,12 +157,12 @@ static int pcpu_lpage_cpu_distance(unsigned int from, unsigned int to)
return REMOTE_DISTANCE;
}

-static ssize_t __init setup_pcpu_lpage(bool chosen)
+static int __init setup_pcpu_lpage(bool chosen)
{
size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
size_t dyn_size = reserve - PERCPU_FIRST_CHUNK_RESERVE;
struct pcpu_alloc_info *ai;
- ssize_t ret;
+ int rc;

/* on non-NUMA, embedding is better */
if (!chosen && !pcpu_need_numa())
@@ -196,19 +196,18 @@ static ssize_t __init setup_pcpu_lpage(bool chosen)
if (tot_size > vm_size / 5) {
pr_info("PERCPU: too large chunk size %zuMB for "
"large page remap\n", tot_size >> 20);
- ret = -EINVAL;
+ rc = -EINVAL;
goto out_free;
}
}

- ret = pcpu_lpage_first_chunk(ai, pcpu_fc_alloc, pcpu_fc_free,
- pcpul_map);
+ rc = pcpu_lpage_first_chunk(ai, pcpu_fc_alloc, pcpu_fc_free, pcpul_map);
out_free:
pcpu_free_alloc_info(ai);
- return ret;
+ return rc;
}
#else
-static ssize_t __init setup_pcpu_lpage(bool chosen)
+static int __init setup_pcpu_lpage(bool chosen)
{
return -EINVAL;
}
@@ -222,7 +221,7 @@ static ssize_t __init setup_pcpu_lpage(bool chosen)
* mapping so that it can use PMD mapping without additional TLB
* pressure.
*/
-static ssize_t __init setup_pcpu_embed(bool chosen)
+static int __init setup_pcpu_embed(bool chosen)
{
size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;

@@ -250,7 +249,7 @@ static void __init pcpup_populate_pte(unsigned long addr)
populate_extra_pte(addr);
}

-static ssize_t __init setup_pcpu_page(void)
+static int __init setup_pcpu_page(void)
{
return pcpu_page_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
pcpu_fc_alloc, pcpu_fc_free,
@@ -274,8 +273,7 @@ void __init setup_per_cpu_areas(void)
{
unsigned int cpu;
unsigned long delta;
- size_t pcpu_unit_size;
- ssize_t ret;
+ int rc;

pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);
@@ -285,36 +283,33 @@ void __init setup_per_cpu_areas(void)
* of large page mappings. Please read comments on top of
* each allocator for details.
*/
- ret = -EINVAL;
+ rc = -EINVAL;
if (pcpu_chosen_fc != PCPU_FC_AUTO) {
if (pcpu_chosen_fc != PCPU_FC_PAGE) {
if (pcpu_chosen_fc == PCPU_FC_LPAGE)
- ret = setup_pcpu_lpage(true);
+ rc = setup_pcpu_lpage(true);
else
- ret = setup_pcpu_embed(true);
+ rc = setup_pcpu_embed(true);

- if (ret < 0)
- pr_warning("PERCPU: %s allocator failed (%zd), "
+ if (rc < 0)
+ pr_warning("PERCPU: %s allocator failed (%d), "
"falling back to page\n",
- pcpu_fc_names[pcpu_chosen_fc], ret);
+ pcpu_fc_names[pcpu_chosen_fc], rc);
}
} else {
- ret = setup_pcpu_lpage(false);
- if (ret < 0)
- ret = setup_pcpu_embed(false);
+ rc = setup_pcpu_lpage(false);
+ if (rc < 0)
+ rc = setup_pcpu_embed(false);
}
- if (ret < 0)
- ret = setup_pcpu_page();
- if (ret < 0)
- panic("cannot initialize percpu area (err=%zd)", ret);
-
- pcpu_unit_size = ret;
+ if (rc < 0)
+ rc = setup_pcpu_page();
+ if (rc < 0)
+ panic("cannot initialize percpu area (err=%d)", rc);

/* alrighty, percpu areas up and running */
delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu) {
- per_cpu_offset(cpu) =
- delta + pcpu_unit_map[cpu] * pcpu_unit_size;
+ per_cpu_offset(cpu) = delta + pcpu_unit_offsets[cpu];
per_cpu(this_cpu_off, cpu) = per_cpu_offset(cpu);
per_cpu(cpu_number, cpu) = cpu;
setup_percpu_segment(cpu);
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 77b86be..a7ec840 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -57,7 +57,7 @@
#endif

extern void *pcpu_base_addr;
-extern const int *pcpu_unit_map;
+extern const unsigned long *pcpu_unit_offsets;

struct pcpu_group_info {
int nr_units; /* aligned # of units */
@@ -106,25 +106,23 @@ extern struct pcpu_alloc_info * __init pcpu_build_alloc_info(
size_t atom_size,
pcpu_fc_cpu_distance_fn_t cpu_distance_fn);

-extern size_t __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
- void *base_addr);
+extern int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
+ void *base_addr);

#ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
-extern ssize_t __init pcpu_embed_first_chunk(
- size_t reserved_size, ssize_t dyn_size);
+extern int __init pcpu_embed_first_chunk(size_t reserved_size,
+ ssize_t dyn_size);
#endif

#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
-extern ssize_t __init pcpu_page_first_chunk(
- size_t reserved_size,
+extern int __init pcpu_page_first_chunk(size_t reserved_size,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_populate_pte_fn_t populate_pte_fn);
#endif

#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
-extern ssize_t __init pcpu_lpage_first_chunk(
- const struct pcpu_alloc_info *ai,
+extern int __init pcpu_lpage_first_chunk(const struct pcpu_alloc_info *ai,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_map_fn_t map_fn);
diff --git a/mm/percpu.c b/mm/percpu.c
index 816cea4..8167fb8 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -117,8 +117,8 @@ static unsigned int pcpu_last_unit_cpu __read_mostly;
void *pcpu_base_addr __read_mostly;
EXPORT_SYMBOL_GPL(pcpu_base_addr);

-/* cpu -> unit map */
-const int *pcpu_unit_map __read_mostly;
+static const int *pcpu_unit_map __read_mostly; /* cpu -> unit */
+const unsigned long *pcpu_unit_offsets __read_mostly; /* cpu -> unit offset */

/*
* The first chunk which always exists. Note that unlike other
@@ -196,8 +196,8 @@ static int pcpu_page_idx(unsigned int cpu, int page_idx)
static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
unsigned int cpu, int page_idx)
{
- return (unsigned long)chunk->vm->addr +
- (pcpu_page_idx(cpu, page_idx) << PAGE_SHIFT);
+ return (unsigned long)chunk->vm->addr + pcpu_unit_offsets[cpu] +
+ (page_idx << PAGE_SHIFT);
}

static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
@@ -341,7 +341,7 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
* space. Note that any possible cpu id can be used here, so
* there's no need to worry about preemption or cpu hotplug.
*/
- addr += pcpu_unit_map[smp_processor_id()] * pcpu_unit_size;
+ addr += pcpu_unit_offsets[smp_processor_id()];
return pcpu_get_page_chunk(vmalloc_to_page(addr));
}

@@ -1560,17 +1560,17 @@ static void pcpu_dump_alloc_info(const char *lvl,
* and available for dynamic allocation like any other chunks.
*
* RETURNS:
- * The determined pcpu_unit_size which can be used to initialize
- * percpu access.
+ * 0 on success, -errno on failure.
*/
-size_t __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
- void *base_addr)
+int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
+ void *base_addr)
{
static struct vm_struct first_vm;
static int smap[2], dmap[2];
size_t dyn_size = ai->dyn_size;
size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
struct pcpu_chunk *schunk, *dchunk = NULL;
+ unsigned long *unit_off;
unsigned int cpu;
int *unit_map;
int group, unit, i;
@@ -1587,8 +1587,9 @@ size_t __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,

pcpu_dump_alloc_info(KERN_DEBUG, ai);

- /* determine number of units and verify and initialize pcpu_unit_map */
+ /* determine number of units and initialize unit_map and base */
unit_map = alloc_bootmem(nr_cpu_ids * sizeof(unit_map[0]));
+ unit_off = alloc_bootmem(nr_cpu_ids * sizeof(unit_off[0]));

for (cpu = 0; cpu < nr_cpu_ids; cpu++)
unit_map[cpu] = NR_CPUS;
@@ -1606,6 +1607,8 @@ size_t __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
BUG_ON(unit_map[cpu] != NR_CPUS);

unit_map[cpu] = unit + i;
+ unit_off[cpu] = gi->base_offset + i * ai->unit_size;
+
if (pcpu_first_unit_cpu == NR_CPUS)
pcpu_first_unit_cpu = cpu;
}
@@ -1617,6 +1620,7 @@ size_t __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
BUG_ON(unit_map[cpu] == NR_CPUS);

pcpu_unit_map = unit_map;
+ pcpu_unit_offsets = unit_off;

/* determine basic parameters */
pcpu_unit_pages = ai->unit_size >> PAGE_SHIFT;
@@ -1688,7 +1692,7 @@ size_t __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,

/* we're done */
pcpu_base_addr = schunk->vm->addr;
- return pcpu_unit_size;
+ return 0;
}

const char *pcpu_fc_names[PCPU_FC_NR] __initdata = {
@@ -1748,16 +1752,15 @@ early_param("percpu_alloc", percpu_alloc_setup);
* size, the leftover is returned to the bootmem allocator.
*
* RETURNS:
- * The determined pcpu_unit_size which can be used to initialize
- * percpu access on success, -errno on failure.
+ * 0 on success, -errno on failure.
*/
-ssize_t __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size)
+int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size)
{
struct pcpu_alloc_info *ai;
size_t size_sum, chunk_size;
void *base;
int unit;
- ssize_t ret;
+ int rc;

ai = pcpu_build_alloc_info(reserved_size, dyn_size, PAGE_SIZE, NULL);
if (IS_ERR(ai))
@@ -1773,7 +1776,7 @@ ssize_t __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size)
if (!base) {
pr_warning("PERCPU: failed to allocate %zu bytes for "
"embedding\n", chunk_size);
- ret = -ENOMEM;
+ rc = -ENOMEM;
goto out_free_ai;
}

@@ -1790,10 +1793,10 @@ ssize_t __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size)
PFN_DOWN(size_sum), base, ai->static_size, ai->reserved_size,
ai->dyn_size, ai->unit_size);

- ret = pcpu_setup_first_chunk(ai, base);
+ rc = pcpu_setup_first_chunk(ai, base);
out_free_ai:
pcpu_free_alloc_info(ai);
- return ret;
+ return rc;
}
#endif /* CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK ||
!CONFIG_HAVE_SETUP_PER_CPU_AREA */
@@ -1813,13 +1816,12 @@ out_free_ai:
* page-by-page into vmalloc area.
*
* RETURNS:
- * The determined pcpu_unit_size which can be used to initialize
- * percpu access on success, -errno on failure.
+ * 0 on success, -errno on failure.
*/
-ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
- pcpu_fc_alloc_fn_t alloc_fn,
- pcpu_fc_free_fn_t free_fn,
- pcpu_fc_populate_pte_fn_t populate_pte_fn)
+int __init pcpu_page_first_chunk(size_t reserved_size,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn,
+ pcpu_fc_populate_pte_fn_t populate_pte_fn)
{
static struct vm_struct vm;
struct pcpu_alloc_info *ai;
@@ -1827,8 +1829,7 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
int unit_pages;
size_t pages_size;
struct page **pages;
- int unit, i, j;
- ssize_t ret;
+ int unit, i, j, rc;

snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10);

@@ -1874,10 +1875,10 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
populate_pte_fn(unit_addr + (i << PAGE_SHIFT));

/* pte already populated, the following shouldn't fail */
- ret = __pcpu_map_pages(unit_addr, &pages[unit * unit_pages],
- unit_pages);
- if (ret < 0)
- panic("failed to map percpu area, err=%zd\n", ret);
+ rc = __pcpu_map_pages(unit_addr, &pages[unit * unit_pages],
+ unit_pages);
+ if (rc < 0)
+ panic("failed to map percpu area, err=%d\n", rc);

/*
* FIXME: Archs with virtual cache should flush local
@@ -1896,17 +1897,17 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
unit_pages, psize_str, vm.addr, ai->static_size,
ai->reserved_size, ai->dyn_size);

- ret = pcpu_setup_first_chunk(ai, vm.addr);
+ rc = pcpu_setup_first_chunk(ai, vm.addr);
goto out_free_ar;

enomem:
while (--j >= 0)
free_fn(page_address(pages[j]), PAGE_SIZE);
- ret = -ENOMEM;
+ rc = -ENOMEM;
out_free_ar:
free_bootmem(__pa(pages), pages_size);
pcpu_free_alloc_info(ai);
- return ret;
+ return rc;
}
#endif /* CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK */

@@ -1977,20 +1978,18 @@ static int __init pcpul_cpu_to_unit(int cpu, const struct pcpu_alloc_info *ai)
* pcpu_lpage_remapped().
*
* RETURNS:
- * The determined pcpu_unit_size which can be used to initialize
- * percpu access on success, -errno on failure.
+ * 0 on success, -errno on failure.
*/
-ssize_t __init pcpu_lpage_first_chunk(const struct pcpu_alloc_info *ai,
- pcpu_fc_alloc_fn_t alloc_fn,
- pcpu_fc_free_fn_t free_fn,
- pcpu_fc_map_fn_t map_fn)
+int __init pcpu_lpage_first_chunk(const struct pcpu_alloc_info *ai,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn,
+ pcpu_fc_map_fn_t map_fn)
{
static struct vm_struct vm;
const size_t lpage_size = ai->atom_size;
size_t chunk_size, map_size;
unsigned int cpu;
- ssize_t ret;
- int i, j, unit, nr_units;
+ int i, j, unit, nr_units, rc;

nr_units = 0;
for (i = 0; i < ai->nr_groups; i++)
@@ -2070,7 +2069,7 @@ ssize_t __init pcpu_lpage_first_chunk(const struct pcpu_alloc_info *ai,
vm.addr, ai->static_size, ai->reserved_size, ai->dyn_size,
ai->unit_size);

- ret = pcpu_setup_first_chunk(ai, vm.addr);
+ rc = pcpu_setup_first_chunk(ai, vm.addr);

/*
* Sort pcpul_map array for pcpu_lpage_remapped(). Unmapped
@@ -2094,7 +2093,7 @@ ssize_t __init pcpu_lpage_first_chunk(const struct pcpu_alloc_info *ai,
while (pcpul_nr_lpages && !pcpul_map[pcpul_nr_lpages - 1].ptr)
pcpul_nr_lpages--;

- return ret;
+ return rc;

enomem:
for (i = 0; i < pcpul_nr_lpages; i++)
@@ -2166,21 +2165,21 @@ EXPORT_SYMBOL(__per_cpu_offset);

void __init setup_per_cpu_areas(void)
{
- ssize_t unit_size;
unsigned long delta;
unsigned int cpu;
+ int rc;

/*
* Always reserve area for module percpu variables. That's
* what the legacy allocator did.
*/
- unit_size = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
- PERCPU_DYNAMIC_RESERVE);
- if (unit_size < 0)
+ rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
+ PERCPU_DYNAMIC_RESERVE);
+ if (rc < 0)
panic("Failed to initialized percpu areas.");

delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu)
- __per_cpu_offset[cpu] = delta + cpu * unit_size;
+ __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
}
#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
--
1.6.0.2

2009-07-21 10:30:49

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/20] vmalloc: separate out insert_vmalloc_vm()

Separate out insert_vmalloc_vm() from __get_vm_area_node().
insert_vmalloc_vm() initializes vm_struct from vmap_area and inserts
it into vmlist. insert_vmalloc_vm() only initializes fields which can
be determined from @vm, @flags and @caller The rest should be
initialized by the caller. For __get_vm_area_node(), all other fields
just need to be cleared and this is done by using kzalloc instead of
kmalloc.

This will be used to implement pcpu_get_vm_areas().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Nick Piggin <[email protected]>
---
mm/vmalloc.c | 45 ++++++++++++++++++++++++---------------------
1 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f8189a4..2eb461c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1122,13 +1122,34 @@ EXPORT_SYMBOL_GPL(map_vm_area);
DEFINE_RWLOCK(vmlist_lock);
struct vm_struct *vmlist;

+static void insert_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
+ unsigned long flags, void *caller)
+{
+ struct vm_struct *tmp, **p;
+
+ vm->flags = flags;
+ vm->addr = (void *)va->va_start;
+ vm->size = va->va_end - va->va_start;
+ vm->caller = caller;
+ va->private = vm;
+ va->flags |= VM_VM_AREA;
+
+ write_lock(&vmlist_lock);
+ for (p = &vmlist; (tmp = *p) != NULL; p = &tmp->next) {
+ if (tmp->addr >= vm->addr)
+ break;
+ }
+ vm->next = *p;
+ *p = vm;
+ write_unlock(&vmlist_lock);
+}
+
static struct vm_struct *__get_vm_area_node(unsigned long size,
unsigned long flags, unsigned long start, unsigned long end,
int node, gfp_t gfp_mask, void *caller)
{
static struct vmap_area *va;
struct vm_struct *area;
- struct vm_struct *tmp, **p;
unsigned long align = 1;

BUG_ON(in_interrupt());
@@ -1147,7 +1168,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
if (unlikely(!size))
return NULL;

- area = kmalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
+ area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
if (unlikely(!area))
return NULL;

@@ -1162,25 +1183,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
return NULL;
}

- area->flags = flags;
- area->addr = (void *)va->va_start;
- area->size = size;
- area->pages = NULL;
- area->nr_pages = 0;
- area->phys_addr = 0;
- area->caller = caller;
- va->private = area;
- va->flags |= VM_VM_AREA;
-
- write_lock(&vmlist_lock);
- for (p = &vmlist; (tmp = *p) != NULL; p = &tmp->next) {
- if (tmp->addr >= area->addr)
- break;
- }
- area->next = *p;
- *p = area;
- write_unlock(&vmlist_lock);
-
+ insert_vmalloc_vm(area, va, flags, caller);
return area;
}

--
1.6.0.2

2009-07-21 10:31:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/20] percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward

Unit map handling will be generalized and extended and used for
embedding sparse first chunk and other purposes. Relocate two
unit_map related functions upward in preparation. This patch just
moves the code without any actual change.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/percpu.h | 14 +-
mm/percpu.c | 339 ++++++++++++++++++++++++------------------------
2 files changed, 180 insertions(+), 173 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index d385dbc..570fb18 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -78,6 +78,14 @@ typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);
typedef int (pcpu_fc_cpu_distance_fn_t)(unsigned int from, unsigned int to);
typedef void (*pcpu_fc_map_fn_t)(void *ptr, size_t size, void *addr);

+#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
+extern int __init pcpu_lpage_build_unit_map(
+ size_t reserved_size, ssize_t *dyn_sizep,
+ size_t *unit_sizep, size_t lpage_size,
+ int *unit_map,
+ pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
+#endif
+
extern size_t __init pcpu_setup_first_chunk(
size_t static_size, size_t reserved_size,
size_t dyn_size, size_t unit_size,
@@ -97,12 +105,6 @@ extern ssize_t __init pcpu_page_first_chunk(
#endif

#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
-extern int __init pcpu_lpage_build_unit_map(
- size_t reserved_size, ssize_t *dyn_sizep,
- size_t *unit_sizep, size_t lpage_size,
- int *unit_map,
- pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
-
extern ssize_t __init pcpu_lpage_first_chunk(
size_t reserved_size, size_t dyn_size,
size_t unit_size, size_t lpage_size,
diff --git a/mm/percpu.c b/mm/percpu.c
index cfdc03e..b3d0ca0 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1231,6 +1231,178 @@ void free_percpu(void *ptr)
}
EXPORT_SYMBOL_GPL(free_percpu);

+static inline size_t pcpu_calc_fc_sizes(size_t static_size,
+ size_t reserved_size,
+ ssize_t *dyn_sizep)
+{
+ size_t size_sum;
+
+ size_sum = PFN_ALIGN(static_size + reserved_size +
+ (*dyn_sizep >= 0 ? *dyn_sizep : 0));
+ if (*dyn_sizep != 0)
+ *dyn_sizep = size_sum - static_size - reserved_size;
+
+ return size_sum;
+}
+
+#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
+/**
+ * pcpu_lpage_build_unit_map - build unit_map for large page remapping
+ * @reserved_size: the size of reserved percpu area in bytes
+ * @dyn_sizep: in/out parameter for dynamic size, -1 for auto
+ * @unit_sizep: out parameter for unit size
+ * @unit_map: unit_map to be filled
+ * @cpu_distance_fn: callback to determine distance between cpus
+ *
+ * This function builds cpu -> unit map and determine other parameters
+ * considering needed percpu size, large page size and distances
+ * between CPUs in NUMA.
+ *
+ * CPUs which are of LOCAL_DISTANCE both ways are grouped together and
+ * may share units in the same large page. The returned configuration
+ * is guaranteed to have CPUs on different nodes on different large
+ * pages and >=75% usage of allocated virtual address space.
+ *
+ * RETURNS:
+ * On success, fills in @unit_map, sets *@dyn_sizep, *@unit_sizep and
+ * returns the number of units to be allocated. -errno on failure.
+ */
+int __init pcpu_lpage_build_unit_map(size_t reserved_size, ssize_t *dyn_sizep,
+ size_t *unit_sizep, size_t lpage_size,
+ int *unit_map,
+ pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
+{
+ static int group_map[NR_CPUS] __initdata;
+ static int group_cnt[NR_CPUS] __initdata;
+ const size_t static_size = __per_cpu_end - __per_cpu_start;
+ int group_cnt_max = 0;
+ size_t size_sum, min_unit_size, alloc_size;
+ int upa, max_upa, uninitialized_var(best_upa); /* units_per_alloc */
+ int last_allocs;
+ unsigned int cpu, tcpu;
+ int group, unit;
+
+ /*
+ * Determine min_unit_size, alloc_size and max_upa such that
+ * alloc_size is multiple of lpage_size and is the smallest
+ * which can accomodate 4k aligned segments which are equal to
+ * or larger than min_unit_size.
+ */
+ size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, dyn_sizep);
+ min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
+
+ alloc_size = roundup(min_unit_size, lpage_size);
+ upa = alloc_size / min_unit_size;
+ while (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK))
+ upa--;
+ max_upa = upa;
+
+ /* group cpus according to their proximity */
+ for_each_possible_cpu(cpu) {
+ group = 0;
+ next_group:
+ for_each_possible_cpu(tcpu) {
+ if (cpu == tcpu)
+ break;
+ if (group_map[tcpu] == group &&
+ (cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE ||
+ cpu_distance_fn(tcpu, cpu) > LOCAL_DISTANCE)) {
+ group++;
+ goto next_group;
+ }
+ }
+ group_map[cpu] = group;
+ group_cnt[group]++;
+ group_cnt_max = max(group_cnt_max, group_cnt[group]);
+ }
+
+ /*
+ * Expand unit size until address space usage goes over 75%
+ * and then as much as possible without using more address
+ * space.
+ */
+ last_allocs = INT_MAX;
+ for (upa = max_upa; upa; upa--) {
+ int allocs = 0, wasted = 0;
+
+ if (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK))
+ continue;
+
+ for (group = 0; group_cnt[group]; group++) {
+ int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
+ allocs += this_allocs;
+ wasted += this_allocs * upa - group_cnt[group];
+ }
+
+ /*
+ * Don't accept if wastage is over 25%. The
+ * greater-than comparison ensures upa==1 always
+ * passes the following check.
+ */
+ if (wasted > num_possible_cpus() / 3)
+ continue;
+
+ /* and then don't consume more memory */
+ if (allocs > last_allocs)
+ break;
+ last_allocs = allocs;
+ best_upa = upa;
+ }
+ *unit_sizep = alloc_size / best_upa;
+
+ /* assign units to cpus accordingly */
+ unit = 0;
+ for (group = 0; group_cnt[group]; group++) {
+ for_each_possible_cpu(cpu)
+ if (group_map[cpu] == group)
+ unit_map[cpu] = unit++;
+ unit = roundup(unit, best_upa);
+ }
+
+ return unit; /* unit contains aligned number of units */
+}
+
+static bool __init pcpul_unit_to_cpu(int unit, const int *unit_map,
+ unsigned int *cpup);
+
+static void __init pcpul_lpage_dump_cfg(const char *lvl, size_t static_size,
+ size_t reserved_size, size_t dyn_size,
+ size_t unit_size, size_t lpage_size,
+ const int *unit_map, int nr_units)
+{
+ int width = 1, v = nr_units;
+ char empty_str[] = "--------";
+ int upl, lpl; /* units per lpage, lpage per line */
+ unsigned int cpu;
+ int lpage, unit;
+
+ while (v /= 10)
+ width++;
+ empty_str[min_t(int, width, sizeof(empty_str) - 1)] = '\0';
+
+ upl = max_t(int, lpage_size / unit_size, 1);
+ lpl = rounddown_pow_of_two(max_t(int, 60 / (upl * (width + 1) + 2), 1));
+
+ printk("%spcpu-lpage: sta/res/dyn=%zu/%zu/%zu unit=%zu lpage=%zu", lvl,
+ static_size, reserved_size, dyn_size, unit_size, lpage_size);
+
+ for (lpage = 0, unit = 0; unit < nr_units; unit++) {
+ if (!(unit % upl)) {
+ if (!(lpage++ % lpl)) {
+ printk("\n");
+ printk("%spcpu-lpage: ", lvl);
+ } else
+ printk("| ");
+ }
+ if (pcpul_unit_to_cpu(unit, unit_map, &cpu))
+ printk("%0*d ", width, cpu);
+ else
+ printk("%s ", empty_str);
+ }
+ printk("\n");
+}
+#endif
+
/**
* pcpu_setup_first_chunk - initialize the first percpu chunk
* @static_size: the size of static percpu area in bytes
@@ -1441,20 +1613,6 @@ static int __init percpu_alloc_setup(char *str)
}
early_param("percpu_alloc", percpu_alloc_setup);

-static inline size_t pcpu_calc_fc_sizes(size_t static_size,
- size_t reserved_size,
- ssize_t *dyn_sizep)
-{
- size_t size_sum;
-
- size_sum = PFN_ALIGN(static_size + reserved_size +
- (*dyn_sizep >= 0 ? *dyn_sizep : 0));
- if (*dyn_sizep != 0)
- *dyn_sizep = size_sum - static_size - reserved_size;
-
- return size_sum;
-}
-
#if defined(CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK) || \
!defined(CONFIG_HAVE_SETUP_PER_CPU_AREA)
/**
@@ -1637,122 +1795,6 @@ out_free_ar:
#endif /* CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK */

#ifdef CONFIG_NEED_PER_CPU_LPAGE_FIRST_CHUNK
-/**
- * pcpu_lpage_build_unit_map - build unit_map for large page remapping
- * @reserved_size: the size of reserved percpu area in bytes
- * @dyn_sizep: in/out parameter for dynamic size, -1 for auto
- * @unit_sizep: out parameter for unit size
- * @unit_map: unit_map to be filled
- * @cpu_distance_fn: callback to determine distance between cpus
- *
- * This function builds cpu -> unit map and determine other parameters
- * considering needed percpu size, large page size and distances
- * between CPUs in NUMA.
- *
- * CPUs which are of LOCAL_DISTANCE both ways are grouped together and
- * may share units in the same large page. The returned configuration
- * is guaranteed to have CPUs on different nodes on different large
- * pages and >=75% usage of allocated virtual address space.
- *
- * RETURNS:
- * On success, fills in @unit_map, sets *@dyn_sizep, *@unit_sizep and
- * returns the number of units to be allocated. -errno on failure.
- */
-int __init pcpu_lpage_build_unit_map(size_t reserved_size, ssize_t *dyn_sizep,
- size_t *unit_sizep, size_t lpage_size,
- int *unit_map,
- pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
-{
- static int group_map[NR_CPUS] __initdata;
- static int group_cnt[NR_CPUS] __initdata;
- const size_t static_size = __per_cpu_end - __per_cpu_start;
- int group_cnt_max = 0;
- size_t size_sum, min_unit_size, alloc_size;
- int upa, max_upa, uninitialized_var(best_upa); /* units_per_alloc */
- int last_allocs;
- unsigned int cpu, tcpu;
- int group, unit;
-
- /*
- * Determine min_unit_size, alloc_size and max_upa such that
- * alloc_size is multiple of lpage_size and is the smallest
- * which can accomodate 4k aligned segments which are equal to
- * or larger than min_unit_size.
- */
- size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, dyn_sizep);
- min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
-
- alloc_size = roundup(min_unit_size, lpage_size);
- upa = alloc_size / min_unit_size;
- while (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK))
- upa--;
- max_upa = upa;
-
- /* group cpus according to their proximity */
- for_each_possible_cpu(cpu) {
- group = 0;
- next_group:
- for_each_possible_cpu(tcpu) {
- if (cpu == tcpu)
- break;
- if (group_map[tcpu] == group &&
- (cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE ||
- cpu_distance_fn(tcpu, cpu) > LOCAL_DISTANCE)) {
- group++;
- goto next_group;
- }
- }
- group_map[cpu] = group;
- group_cnt[group]++;
- group_cnt_max = max(group_cnt_max, group_cnt[group]);
- }
-
- /*
- * Expand unit size until address space usage goes over 75%
- * and then as much as possible without using more address
- * space.
- */
- last_allocs = INT_MAX;
- for (upa = max_upa; upa; upa--) {
- int allocs = 0, wasted = 0;
-
- if (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK))
- continue;
-
- for (group = 0; group_cnt[group]; group++) {
- int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
- allocs += this_allocs;
- wasted += this_allocs * upa - group_cnt[group];
- }
-
- /*
- * Don't accept if wastage is over 25%. The
- * greater-than comparison ensures upa==1 always
- * passes the following check.
- */
- if (wasted > num_possible_cpus() / 3)
- continue;
-
- /* and then don't consume more memory */
- if (allocs > last_allocs)
- break;
- last_allocs = allocs;
- best_upa = upa;
- }
- *unit_sizep = alloc_size / best_upa;
-
- /* assign units to cpus accordingly */
- unit = 0;
- for (group = 0; group_cnt[group]; group++) {
- for_each_possible_cpu(cpu)
- if (group_map[cpu] == group)
- unit_map[cpu] = unit++;
- unit = roundup(unit, best_upa);
- }
-
- return unit; /* unit contains aligned number of units */
-}
-
struct pcpul_ent {
void *ptr;
void *map_addr;
@@ -1778,43 +1820,6 @@ static bool __init pcpul_unit_to_cpu(int unit, const int *unit_map,
return false;
}

-static void __init pcpul_lpage_dump_cfg(const char *lvl, size_t static_size,
- size_t reserved_size, size_t dyn_size,
- size_t unit_size, size_t lpage_size,
- const int *unit_map, int nr_units)
-{
- int width = 1, v = nr_units;
- char empty_str[] = "--------";
- int upl, lpl; /* units per lpage, lpage per line */
- unsigned int cpu;
- int lpage, unit;
-
- while (v /= 10)
- width++;
- empty_str[min_t(int, width, sizeof(empty_str) - 1)] = '\0';
-
- upl = max_t(int, lpage_size / unit_size, 1);
- lpl = rounddown_pow_of_two(max_t(int, 60 / (upl * (width + 1) + 2), 1));
-
- printk("%spcpu-lpage: sta/res/dyn=%zu/%zu/%zu unit=%zu lpage=%zu", lvl,
- static_size, reserved_size, dyn_size, unit_size, lpage_size);
-
- for (lpage = 0, unit = 0; unit < nr_units; unit++) {
- if (!(unit % upl)) {
- if (!(lpage++ % lpl)) {
- printk("\n");
- printk("%spcpu-lpage: ", lvl);
- } else
- printk("| ");
- }
- if (pcpul_unit_to_cpu(unit, unit_map, &cpu))
- printk("%0*d ", width, cpu);
- else
- printk("%s ", empty_str);
- }
- printk("\n");
-}
-
/**
* pcpu_lpage_first_chunk - remap the first percpu chunk using large page
* @reserved_size: the size of reserved percpu area in bytes
--
1.6.0.2

2009-07-21 10:31:30

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/20] percpu: add chunk->base_addr

The only thing percpu allocator wants to know about a vmalloc area is
the base address. Instead of requiring chunk->vm, add
chunk->base_addr which contains the necessary value. This simplifies
the code a bit and makes the dummy first_vm unnecessary. This change
will ease allowing a chunk to be mapped by multiple vms.

Signed-off-by: Tejun Heo <[email protected]>
---
mm/percpu.c | 25 +++++++++++--------------
1 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 8167fb8..7b5e194 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -94,10 +94,11 @@ struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
int free_size; /* free bytes in the chunk */
int contig_hint; /* max contiguous size hint */
- struct vm_struct *vm; /* mapped vmalloc region */
+ void *base_addr; /* base address of this chunk */
int map_used; /* # of map entries used */
int map_alloc; /* # of map entries allocated */
int *map; /* allocation map */
+ struct vm_struct *vm; /* mapped vmalloc region */
bool immutable; /* no [de]population allowed */
unsigned long populated[]; /* populated bitmap */
};
@@ -196,7 +197,7 @@ static int pcpu_page_idx(unsigned int cpu, int page_idx)
static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
unsigned int cpu, int page_idx)
{
- return (unsigned long)chunk->vm->addr + pcpu_unit_offsets[cpu] +
+ return (unsigned long)chunk->base_addr + pcpu_unit_offsets[cpu] +
(page_idx << PAGE_SHIFT);
}

@@ -324,7 +325,7 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
*/
static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
{
- void *first_start = pcpu_first_chunk->vm->addr;
+ void *first_start = pcpu_first_chunk->base_addr;

/* is it in the first chunk? */
if (addr >= first_start && addr < first_start + pcpu_unit_size) {
@@ -1014,6 +1015,7 @@ static struct pcpu_chunk *alloc_pcpu_chunk(void)
INIT_LIST_HEAD(&chunk->list);
chunk->free_size = pcpu_unit_size;
chunk->contig_hint = pcpu_unit_size;
+ chunk->base_addr = chunk->vm->addr;

return chunk;
}
@@ -1103,8 +1105,8 @@ area_found:

mutex_unlock(&pcpu_alloc_mutex);

- /* return address relative to unit0 */
- return __addr_to_pcpu_ptr(chunk->vm->addr + off);
+ /* return address relative to base address */
+ return __addr_to_pcpu_ptr(chunk->base_addr + off);

fail_unlock:
spin_unlock_irq(&pcpu_lock);
@@ -1213,7 +1215,7 @@ void free_percpu(void *ptr)
spin_lock_irqsave(&pcpu_lock, flags);

chunk = pcpu_chunk_addr_search(addr);
- off = addr - chunk->vm->addr;
+ off = addr - chunk->base_addr;

pcpu_free_area(chunk, off);

@@ -1565,7 +1567,6 @@ static void pcpu_dump_alloc_info(const char *lvl,
int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
void *base_addr)
{
- static struct vm_struct first_vm;
static int smap[2], dmap[2];
size_t dyn_size = ai->dyn_size;
size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
@@ -1629,10 +1630,6 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long);

- first_vm.flags = VM_ALLOC;
- first_vm.size = pcpu_chunk_size;
- first_vm.addr = base_addr;
-
/*
* Allocate chunk slots. The additional last slot is for
* empty chunks.
@@ -1651,7 +1648,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
*/
schunk = alloc_bootmem(pcpu_chunk_struct_size);
INIT_LIST_HEAD(&schunk->list);
- schunk->vm = &first_vm;
+ schunk->base_addr = base_addr;
schunk->map = smap;
schunk->map_alloc = ARRAY_SIZE(smap);
schunk->immutable = true;
@@ -1675,7 +1672,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
if (dyn_size) {
dchunk = alloc_bootmem(pcpu_chunk_struct_size);
INIT_LIST_HEAD(&dchunk->list);
- dchunk->vm = &first_vm;
+ dchunk->base_addr = base_addr;
dchunk->map = dmap;
dchunk->map_alloc = ARRAY_SIZE(dmap);
dchunk->immutable = true;
@@ -1691,7 +1688,7 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
pcpu_chunk_relocate(pcpu_first_chunk, -1);

/* we're done */
- pcpu_base_addr = schunk->vm->addr;
+ pcpu_base_addr = base_addr;
return 0;
}

--
1.6.0.2

2009-07-21 10:30:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 19/20] sparc64: use embedding percpu first chunk allocator

sparc64 currently allocates a large page for each cpu and partially
remap them into vmalloc area much like what lpage first chunk
allocator did. As a 4M page is used for each cpu, this results in
very large unit size and also adds TLB pressure due to the double
mapping of pages in the first chunk.

This patch converts sparc64 to use the embedding percpu first chunk
allocator which now knows how to handle NUMA configurations. This
simplifies the code a lot, doesn't incur any extra TLB pressure and
results in better utilization of address space.

Signed-off-by: Tejun Heo <[email protected]>
Cc: David S. Miller <[email protected]>
---
arch/sparc/Kconfig | 3 +
arch/sparc/kernel/smp_64.c | 128 ++++++-------------------------------------
2 files changed, 21 insertions(+), 110 deletions(-)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 4f6ed0f..fbd1233 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -95,6 +95,9 @@ config AUDIT_ARCH
config HAVE_SETUP_PER_CPU_AREA
def_bool y if SPARC64

+config NEED_PER_CPU_EMBED_FIRST_CHUNK
+ def_bool y if SPARC64
+
config GENERIC_HARDIRQS_NO__DO_IRQ
bool
def_bool y if SPARC64
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index b03fd36..ff68373 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1389,8 +1389,8 @@ void smp_send_stop(void)
* RETURNS:
* Pointer to the allocated area on success, NULL on failure.
*/
-static void * __init pcpu_alloc_bootmem(unsigned int cpu, unsigned long size,
- unsigned long align)
+static void * __init pcpu_alloc_bootmem(unsigned int cpu, size_t size,
+ size_t align)
{
const unsigned long goal = __pa(MAX_DMA_ADDRESS);
#ifdef CONFIG_NEED_MULTIPLE_NODES
@@ -1415,123 +1415,31 @@ static void * __init pcpu_alloc_bootmem(unsigned int cpu, unsigned long size,
#endif
}

-#define PCPU_CHUNK_SIZE (4UL * 1024UL * 1024UL)
-
-static void __init pcpu_map_range(unsigned long start, unsigned long end,
- struct page *page)
+static void __init pcpu_free_bootmem(void *ptr, size_t size)
{
- unsigned long pfn = page_to_pfn(page);
- unsigned long pte_base;
-
- BUG_ON((pfn<<PAGE_SHIFT)&(PCPU_CHUNK_SIZE - 1UL));
-
- pte_base = (_PAGE_VALID | _PAGE_SZ4MB_4U |
- _PAGE_CP_4U | _PAGE_CV_4U |
- _PAGE_P_4U | _PAGE_W_4U);
- if (tlb_type == hypervisor)
- pte_base = (_PAGE_VALID | _PAGE_SZ4MB_4V |
- _PAGE_CP_4V | _PAGE_CV_4V |
- _PAGE_P_4V | _PAGE_W_4V);
-
- while (start < end) {
- pgd_t *pgd = pgd_offset_k(start);
- unsigned long this_end;
- pud_t *pud;
- pmd_t *pmd;
- pte_t *pte;
-
- pud = pud_offset(pgd, start);
- if (pud_none(*pud)) {
- pmd_t *new;
-
- new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
- pud_populate(&init_mm, pud, new);
- }
-
- pmd = pmd_offset(pud, start);
- if (!pmd_present(*pmd)) {
- pte_t *new;
-
- new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
- pmd_populate_kernel(&init_mm, pmd, new);
- }
-
- pte = pte_offset_kernel(pmd, start);
- this_end = (start + PMD_SIZE) & PMD_MASK;
- if (this_end > end)
- this_end = end;
-
- while (start < this_end) {
- unsigned long paddr = pfn << PAGE_SHIFT;
-
- pte_val(*pte) = (paddr | pte_base);
+ free_bootmem(__pa(ptr), size);
+}

- start += PAGE_SIZE;
- pte++;
- pfn++;
- }
- }
+static int pcpu_cpu_distance(unsigned int from, unsigned int to)
+{
+ if (cpu_to_node(from) == cpu_to_node(to))
+ return LOCAL_DISTANCE;
+ else
+ return REMOTE_DISTANCE;
}

void __init setup_per_cpu_areas(void)
{
- static struct vm_struct vm;
- struct pcpu_alloc_info *ai;
- unsigned long delta, cpu;
- size_t size_sum;
- size_t ptrs_size;
- void **ptrs;
+ unsigned long delta;
+ unsigned int cpu;
int rc;

- ai = pcpu_alloc_alloc_info(1, nr_cpu_ids);
-
- ai->static_size = __per_cpu_end - __per_cpu_start;
- ai->reserved_size = PERCPU_MODULE_RESERVE;
-
- size_sum = PFN_ALIGN(ai->static_size + ai->reserved_size +
- PERCPU_DYNAMIC_RESERVE);
-
- ai->dyn_size = size_sum - ai->static_size - ai->reserved_size;
- ai->unit_size = PCPU_CHUNK_SIZE;
- ai->atom_size = PCPU_CHUNK_SIZE;
- ai->alloc_size = PCPU_CHUNK_SIZE;
- ai->groups[0].nr_units = nr_cpu_ids;
-
- for_each_possible_cpu(cpu)
- ai->groups[0].cpu_map[cpu] = cpu;
-
- ptrs_size = PFN_ALIGN(nr_cpu_ids * sizeof(ptrs[0]));
- ptrs = alloc_bootmem(ptrs_size);
-
- for_each_possible_cpu(cpu) {
- ptrs[cpu] = pcpu_alloc_bootmem(cpu, PCPU_CHUNK_SIZE,
- PCPU_CHUNK_SIZE);
-
- free_bootmem(__pa(ptrs[cpu] + size_sum),
- PCPU_CHUNK_SIZE - size_sum);
-
- memcpy(ptrs[cpu], __per_cpu_load, ai->static_size);
- }
-
- /* allocate address and map */
- vm.flags = VM_ALLOC;
- vm.size = nr_cpu_ids * PCPU_CHUNK_SIZE;
- vm_area_register_early(&vm, PCPU_CHUNK_SIZE);
-
- for_each_possible_cpu(cpu) {
- unsigned long start = (unsigned long) vm.addr;
- unsigned long end;
-
- start += cpu * PCPU_CHUNK_SIZE;
- end = start + PCPU_CHUNK_SIZE;
- pcpu_map_range(start, end, virt_to_page(ptrs[cpu]));
- }
-
- rc = pcpu_setup_first_chunk(ai, vm.addr);
+ rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
+ PERCPU_DYNAMIC_RESERVE, 4 << 20,
+ pcpu_cpu_distance, pcpu_alloc_bootmem,
+ pcpu_free_bootmem);
if (rc)
- panic("failed to setup percpu first chunk (%d)", rc);
-
- free_bootmem(__pa(ptrs), ptrs_size);
+ panic("failed to initialize first chunk (%d)", rc);

delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu)
--
1.6.0.2

2009-07-21 10:30:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/20] percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()

Now that all actual first chunk allocation and copying happen in the
first chunk allocators and helpers, there's no reason for
pcpu_setup_first_chunk() to try to determine @dyn_size automatically.
The only left user is page first chunk allocator. Make it determine
dyn_size like other allocators and make @dyn_size mandatory for
pcpu_setup_first_chunk().

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/percpu.h | 2 +-
mm/percpu.c | 39 +++++++++++++++++++--------------------
2 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index be2fc8f..0cfdd14 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -79,7 +79,7 @@ typedef void (*pcpu_fc_map_fn_t)(void *ptr, size_t size, void *addr);

extern size_t __init pcpu_setup_first_chunk(
size_t static_size, size_t reserved_size,
- ssize_t dyn_size, size_t unit_size,
+ size_t dyn_size, size_t unit_size,
void *base_addr, const int *unit_map);

#ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
diff --git a/mm/percpu.c b/mm/percpu.c
index e9f45ab..3177cf6 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1235,7 +1235,7 @@ EXPORT_SYMBOL_GPL(free_percpu);
* pcpu_setup_first_chunk - initialize the first percpu chunk
* @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes, 0 for none
- * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @dyn_size: free size for dynamic allocation in bytes
* @unit_size: unit size in bytes, must be multiple of PAGE_SIZE
* @base_addr: mapped address
* @unit_map: cpu -> unit map, NULL for sequential mapping
@@ -1252,10 +1252,9 @@ EXPORT_SYMBOL_GPL(free_percpu);
* limited offset range for symbol relocations to guarantee module
* percpu symbols fall inside the relocatable range.
*
- * @dyn_size, if non-negative, determines the number of bytes
- * available for dynamic allocation in the first chunk. Specifying
- * non-negative value makes percpu leave alone the area beyond
- * @static_size + @reserved_size + @dyn_size.
+ * @dyn_size determines the number of bytes available for dynamic
+ * allocation in the first chunk. The area between @static_size +
+ * @reserved_size + @dyn_size and @unit_size is unused.
*
* @unit_size specifies unit size and must be aligned to PAGE_SIZE and
* equal to or larger than @static_size + @reserved_size + if
@@ -1276,13 +1275,12 @@ EXPORT_SYMBOL_GPL(free_percpu);
* percpu access.
*/
size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
- ssize_t dyn_size, size_t unit_size,
+ size_t dyn_size, size_t unit_size,
void *base_addr, const int *unit_map)
{
static struct vm_struct first_vm;
static int smap[2], dmap[2];
- size_t size_sum = static_size + reserved_size +
- (dyn_size >= 0 ? dyn_size : 0);
+ size_t size_sum = static_size + reserved_size + dyn_size;
struct pcpu_chunk *schunk, *dchunk = NULL;
unsigned int cpu, tcpu;
int i;
@@ -1345,9 +1343,6 @@ size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long);

- if (dyn_size < 0)
- dyn_size = pcpu_unit_size - static_size - reserved_size;
-
first_vm.flags = VM_ALLOC;
first_vm.size = pcpu_chunk_size;
first_vm.addr = base_addr;
@@ -1557,6 +1552,8 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
{
static struct vm_struct vm;
const size_t static_size = __per_cpu_end - __per_cpu_start;
+ ssize_t dyn_size = -1;
+ size_t size_sum, unit_size;
char psize_str[16];
int unit_pages;
size_t pages_size;
@@ -1567,8 +1564,9 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,

snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10);

- unit_pages = PFN_UP(max_t(size_t, static_size + reserved_size,
- PCPU_MIN_UNIT_SIZE));
+ size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
+ unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
+ unit_pages = unit_size >> PAGE_SHIFT;

/* unaligned allocations can't be freed, round up to page size */
pages_size = PFN_ALIGN(unit_pages * nr_cpu_ids * sizeof(pages[0]));
@@ -1591,12 +1589,12 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,

/* allocate vm area, map the pages and copy static data */
vm.flags = VM_ALLOC;
- vm.size = nr_cpu_ids * unit_pages << PAGE_SHIFT;
+ vm.size = nr_cpu_ids * unit_size;
vm_area_register_early(&vm, PAGE_SIZE);

for_each_possible_cpu(cpu) {
- unsigned long unit_addr = (unsigned long)vm.addr +
- (cpu * unit_pages << PAGE_SHIFT);
+ unsigned long unit_addr =
+ (unsigned long)vm.addr + cpu * unit_size;

for (i = 0; i < unit_pages; i++)
populate_pte_fn(unit_addr + (i << PAGE_SHIFT));
@@ -1620,11 +1618,12 @@ ssize_t __init pcpu_page_first_chunk(size_t reserved_size,
}

/* we're ready, commit */
- pr_info("PERCPU: %d %s pages/cpu @%p s%zu r%zu\n",
- unit_pages, psize_str, vm.addr, static_size, reserved_size);
+ pr_info("PERCPU: %d %s pages/cpu @%p s%zu r%zu d%zu\n",
+ unit_pages, psize_str, vm.addr, static_size, reserved_size,
+ dyn_size);

- ret = pcpu_setup_first_chunk(static_size, reserved_size, -1,
- unit_pages << PAGE_SHIFT, vm.addr, NULL);
+ ret = pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
+ unit_size, vm.addr, NULL);
goto out_free_ar;

enomem:
--
1.6.0.2

2009-07-21 10:31:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/20] percpu: fix pcpu_reclaim() locking

pcpu_reclaim() calls pcpu_depopulate_chunk() which makes use of pages
array and bitmap returned by pcpu_get_pages_and_bitmap() and thus
should be called under pcpu_alloc_mutex. pcpu_reclaim() released the
mutex before calling depopulate leading to double free and other
strange problems caused by the unexpected concurrent usages of pages
array and bitmap. Fix it.

Signed-off-by: Tejun Heo <[email protected]>
---
mm/percpu.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index f993dc8..c44a5b2 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1181,12 +1181,13 @@ static void pcpu_reclaim(struct work_struct *work)
}

spin_unlock_irq(&pcpu_lock);
- mutex_unlock(&pcpu_alloc_mutex);

list_for_each_entry_safe(chunk, next, &todo, list) {
pcpu_depopulate_chunk(chunk, 0, pcpu_unit_size);
free_pcpu_chunk(chunk);
}
+
+ mutex_unlock(&pcpu_alloc_mutex);
}

/**
--
1.6.0.2

2009-07-21 10:31:06

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 20/20] powerpc64: convert to dynamic percpu allocator

Now that percpu allows arbitrary embedding of the first chunk,
powerpc64 can easily be converted to dynamic percpu allocator.
Convert it. powerpc supports several large page sizes. Cap atom_size
at 1M. There isn't much to gain by going above that anyway.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
---
arch/powerpc/Kconfig | 4 +-
arch/powerpc/kernel/setup_64.c | 61 +++++++++++++++++++++++++++++----------
2 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 61bbffa..2c42e15 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -46,10 +46,10 @@ config GENERIC_HARDIRQS_NO__DO_IRQ
bool
default y

-config HAVE_LEGACY_PER_CPU_AREA
+config HAVE_SETUP_PER_CPU_AREA
def_bool PPC64

-config HAVE_SETUP_PER_CPU_AREA
+config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool PPC64

config IRQ_PER_CPU
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 1f68160..aa6e450 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -57,6 +57,7 @@
#include <asm/cache.h>
#include <asm/page.h>
#include <asm/mmu.h>
+#include <asm/mmu-hash64.h>
#include <asm/firmware.h>
#include <asm/xmon.h>
#include <asm/udbg.h>
@@ -569,25 +570,53 @@ void cpu_die(void)
}

#ifdef CONFIG_SMP
-void __init setup_per_cpu_areas(void)
+#define PCPU_DYN_SIZE ()
+
+static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
{
- int i;
- unsigned long size;
- char *ptr;
-
- /* Copy section for each CPU (we discard the original) */
- size = ALIGN(__per_cpu_end - __per_cpu_start, PAGE_SIZE);
-#ifdef CONFIG_MODULES
- if (size < PERCPU_ENOUGH_ROOM)
- size = PERCPU_ENOUGH_ROOM;
-#endif
+ return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
+ __pa(MAX_DMA_ADDRESS));
+}

- for_each_possible_cpu(i) {
- ptr = alloc_bootmem_pages_node(NODE_DATA(cpu_to_node(i)), size);
+static void __init pcpu_fc_free(void *ptr, size_t size)
+{
+ free_bootmem(__pa(ptr), size);
+}

- paca[i].data_offset = ptr - __per_cpu_start;
- memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
- }
+static int pcpu_cpu_distance(unsigned int from, unsigned int to)
+{
+ if (cpu_to_node(from) == cpu_to_node(to))
+ return LOCAL_DISTANCE;
+ else
+ return REMOTE_DISTANCE;
+}
+
+void __init setup_per_cpu_areas(void)
+{
+ const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
+ size_t atom_size;
+ unsigned long delta;
+ unsigned int cpu;
+ int rc;
+
+ /*
+ * Linear mapping is one of 4K, 1M and 16M. For 4K, no need
+ * to group units. For larger mappings, use 1M atom which
+ * should be large enough to contain a number of units.
+ */
+ if (mmu_linear_psize == MMU_PAGE_4K)
+ atom_size = PAGE_SIZE;
+ else
+ atom_size = 1 << 20;
+
+ rc = pcpu_embed_first_chunk(0, dyn_size, atom_size, pcpu_cpu_distance,
+ pcpu_fc_alloc, pcpu_fc_free);
+ if (rc < 0)
+ panic("cannot initialize percpu area (err=%d)", rc);
+
+ delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
+ for_each_possible_cpu(cpu)
+ paca[cpu].data_offset = delta + pcpu_unit_offsets[cpu];
}
#endif

--
1.6.0.2

2009-07-21 12:24:07

by Tejun Heo

[permalink] [raw]
Subject: [RFC PATCH] percpu: kill legacy percpu allocator

With ia64 converted, there's no arch left which still uses legacy
percpu allocator. Kill it.

NOT_SIGNED_OFF_YET
Cc: Ingo Molnar <[email protected]>
Cc: Rusty Russell <[email protected]>
---

This patch is not ready yet. The following ia64 patch needs to be
verified and included before this one.

http://article.gmane.org/gmane.linux.kernel.cross-arch/4132

This patch is available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git review-kill-legacy

After this patch gets included, we can finally proceed with
Christoph's this_cpu patches and other goodies. :-)

Thanks.

include/linux/percpu.h | 24 ------
kernel/module.c | 150 -----------------------------------------
mm/Makefile | 4 -
mm/allocpercpu.c | 177 -------------------------------------------------
mm/percpu.c | 2
5 files changed, 357 deletions(-)

Index: work/include/linux/percpu.h
===================================================================
--- work.orig/include/linux/percpu.h
+++ work/include/linux/percpu.h
@@ -34,8 +34,6 @@

#ifdef CONFIG_SMP

-#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
-
/* minimum unit size, also is the maximum supported allocation size */
#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(64 << 10)

@@ -130,28 +128,6 @@ extern int __init pcpu_page_first_chunk(
#define per_cpu_ptr(ptr, cpu) SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)))

extern void *__alloc_reserved_percpu(size_t size, size_t align);
-
-#else /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
-struct percpu_data {
- void *ptrs[1];
-};
-
-/* pointer disguising messes up the kmemleak objects tracking */
-#ifndef CONFIG_DEBUG_KMEMLEAK
-#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
-#else
-#define __percpu_disguise(pdata) (struct percpu_data *)(pdata)
-#endif
-
-#define per_cpu_ptr(ptr, cpu) \
-({ \
- struct percpu_data *__p = __percpu_disguise(ptr); \
- (__typeof__(ptr))__p->ptrs[(cpu)]; \
-})
-
-#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
extern void *__alloc_percpu(size_t size, size_t align);
extern void free_percpu(void *__pdata);

Index: work/mm/allocpercpu.c
===================================================================
--- work.orig/mm/allocpercpu.c
+++ /dev/null
@@ -1,177 +0,0 @@
-/*
- * linux/mm/allocpercpu.c
- *
- * Separated from slab.c August 11, 2006 Christoph Lameter
- */
-#include <linux/mm.h>
-#include <linux/module.h>
-#include <linux/bootmem.h>
-#include <asm/sections.h>
-
-#ifndef cache_line_size
-#define cache_line_size() L1_CACHE_BYTES
-#endif
-
-/**
- * percpu_depopulate - depopulate per-cpu data for given cpu
- * @__pdata: per-cpu data to depopulate
- * @cpu: depopulate per-cpu data for this cpu
- *
- * Depopulating per-cpu data for a cpu going offline would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- */
-static void percpu_depopulate(void *__pdata, int cpu)
-{
- struct percpu_data *pdata = __percpu_disguise(__pdata);
-
- kfree(pdata->ptrs[cpu]);
- pdata->ptrs[cpu] = NULL;
-}
-
-/**
- * percpu_depopulate_mask - depopulate per-cpu data for some cpu's
- * @__pdata: per-cpu data to depopulate
- * @mask: depopulate per-cpu data for cpu's selected through mask bits
- */
-static void __percpu_depopulate_mask(void *__pdata, const cpumask_t *mask)
-{
- int cpu;
- for_each_cpu_mask_nr(cpu, *mask)
- percpu_depopulate(__pdata, cpu);
-}
-
-#define percpu_depopulate_mask(__pdata, mask) \
- __percpu_depopulate_mask((__pdata), &(mask))
-
-/**
- * percpu_populate - populate per-cpu data for given cpu
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @cpu: populate per-data for this cpu
- *
- * Populating per-cpu data for a cpu coming online would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- * Per-cpu object is populated with zeroed buffer.
- */
-static void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
-{
- struct percpu_data *pdata = __percpu_disguise(__pdata);
- int node = cpu_to_node(cpu);
-
- /*
- * We should make sure each CPU gets private memory.
- */
- size = roundup(size, cache_line_size());
-
- BUG_ON(pdata->ptrs[cpu]);
- if (node_online(node))
- pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
- else
- pdata->ptrs[cpu] = kzalloc(size, gfp);
- return pdata->ptrs[cpu];
-}
-
-/**
- * percpu_populate_mask - populate per-cpu data for more cpu's
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-cpu data for cpu's selected through mask bits
- *
- * Per-cpu objects are populated with zeroed buffers.
- */
-static int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
- cpumask_t *mask)
-{
- cpumask_t populated;
- int cpu;
-
- cpus_clear(populated);
- for_each_cpu_mask_nr(cpu, *mask)
- if (unlikely(!percpu_populate(__pdata, size, gfp, cpu))) {
- __percpu_depopulate_mask(__pdata, &populated);
- return -ENOMEM;
- } else
- cpu_set(cpu, populated);
- return 0;
-}
-
-#define percpu_populate_mask(__pdata, size, gfp, mask) \
- __percpu_populate_mask((__pdata), (size), (gfp), &(mask))
-
-/**
- * alloc_percpu - initial setup of per-cpu data
- * @size: size of per-cpu object
- * @align: alignment
- *
- * Allocate dynamic percpu area. Percpu objects are populated with
- * zeroed buffers.
- */
-void *__alloc_percpu(size_t size, size_t align)
-{
- /*
- * We allocate whole cache lines to avoid false sharing
- */
- size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
- void *pdata = kzalloc(sz, GFP_KERNEL);
- void *__pdata = __percpu_disguise(pdata);
-
- /*
- * Can't easily make larger alignment work with kmalloc. WARN
- * on it. Larger alignment should only be used for module
- * percpu sections on SMP for which this path isn't used.
- */
- WARN_ON_ONCE(align > SMP_CACHE_BYTES);
-
- if (unlikely(!pdata))
- return NULL;
- if (likely(!__percpu_populate_mask(__pdata, size, GFP_KERNEL,
- &cpu_possible_map)))
- return __pdata;
- kfree(pdata);
- return NULL;
-}
-EXPORT_SYMBOL_GPL(__alloc_percpu);
-
-/**
- * free_percpu - final cleanup of per-cpu data
- * @__pdata: object to clean up
- *
- * We simply clean up any per-cpu object left. No need for the client to
- * track and specify through a bis mask which per-cpu objects are to free.
- */
-void free_percpu(void *__pdata)
-{
- if (unlikely(!__pdata))
- return;
- __percpu_depopulate_mask(__pdata, cpu_possible_mask);
- kfree(__percpu_disguise(__pdata));
-}
-EXPORT_SYMBOL_GPL(free_percpu);
-
-/*
- * Generic percpu area setup.
- */
-#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
-unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
-
-EXPORT_SYMBOL(__per_cpu_offset);
-
-void __init setup_per_cpu_areas(void)
-{
- unsigned long size, i;
- char *ptr;
- unsigned long nr_possible_cpus = num_possible_cpus();
-
- /* Copy section for each CPU (we discard the original) */
- size = ALIGN(PERCPU_ENOUGH_ROOM, PAGE_SIZE);
- ptr = alloc_bootmem_pages(size * nr_possible_cpus);
-
- for_each_possible_cpu(i) {
- __per_cpu_offset[i] = ptr - __per_cpu_start;
- memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
- ptr += size;
- }
-}
-#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
Index: work/kernel/module.c
===================================================================
--- work.orig/kernel/module.c
+++ work/kernel/module.c
@@ -364,8 +364,6 @@ EXPORT_SYMBOL_GPL(find_module);

#ifdef CONFIG_SMP

-#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
-
static void *percpu_modalloc(unsigned long size, unsigned long align,
const char *name)
{
@@ -389,154 +387,6 @@ static void percpu_modfree(void *freeme)
free_percpu(freeme);
}

-#else /* ... CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
-/* Number of blocks used and allocated. */
-static unsigned int pcpu_num_used, pcpu_num_allocated;
-/* Size of each block. -ve means used. */
-static int *pcpu_size;
-
-static int split_block(unsigned int i, unsigned short size)
-{
- /* Reallocation required? */
- if (pcpu_num_used + 1 > pcpu_num_allocated) {
- int *new;
-
- new = krealloc(pcpu_size, sizeof(new[0])*pcpu_num_allocated*2,
- GFP_KERNEL);
- if (!new)
- return 0;
-
- pcpu_num_allocated *= 2;
- pcpu_size = new;
- }
-
- /* Insert a new subblock */
- memmove(&pcpu_size[i+1], &pcpu_size[i],
- sizeof(pcpu_size[0]) * (pcpu_num_used - i));
- pcpu_num_used++;
-
- pcpu_size[i+1] -= size;
- pcpu_size[i] = size;
- return 1;
-}
-
-static inline unsigned int block_size(int val)
-{
- if (val < 0)
- return -val;
- return val;
-}
-
-static void *percpu_modalloc(unsigned long size, unsigned long align,
- const char *name)
-{
- unsigned long extra;
- unsigned int i;
- void *ptr;
- int cpu;
-
- if (align > PAGE_SIZE) {
- printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
- name, align, PAGE_SIZE);
- align = PAGE_SIZE;
- }
-
- ptr = __per_cpu_start;
- for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
- /* Extra for alignment requirement. */
- extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
- BUG_ON(i == 0 && extra != 0);
-
- if (pcpu_size[i] < 0 || pcpu_size[i] < extra + size)
- continue;
-
- /* Transfer extra to previous block. */
- if (pcpu_size[i-1] < 0)
- pcpu_size[i-1] -= extra;
- else
- pcpu_size[i-1] += extra;
- pcpu_size[i] -= extra;
- ptr += extra;
-
- /* Split block if warranted */
- if (pcpu_size[i] - size > sizeof(unsigned long))
- if (!split_block(i, size))
- return NULL;
-
- /* add the per-cpu scanning areas */
- for_each_possible_cpu(cpu)
- kmemleak_alloc(ptr + per_cpu_offset(cpu), size, 0,
- GFP_KERNEL);
-
- /* Mark allocated */
- pcpu_size[i] = -pcpu_size[i];
- return ptr;
- }
-
- printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
- size);
- return NULL;
-}
-
-static void percpu_modfree(void *freeme)
-{
- unsigned int i;
- void *ptr = __per_cpu_start + block_size(pcpu_size[0]);
- int cpu;
-
- /* First entry is core kernel percpu data. */
- for (i = 1; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
- if (ptr == freeme) {
- pcpu_size[i] = -pcpu_size[i];
- goto free;
- }
- }
- BUG();
-
- free:
- /* remove the per-cpu scanning areas */
- for_each_possible_cpu(cpu)
- kmemleak_free(freeme + per_cpu_offset(cpu));
-
- /* Merge with previous? */
- if (pcpu_size[i-1] >= 0) {
- pcpu_size[i-1] += pcpu_size[i];
- pcpu_num_used--;
- memmove(&pcpu_size[i], &pcpu_size[i+1],
- (pcpu_num_used - i) * sizeof(pcpu_size[0]));
- i--;
- }
- /* Merge with next? */
- if (i+1 < pcpu_num_used && pcpu_size[i+1] >= 0) {
- pcpu_size[i] += pcpu_size[i+1];
- pcpu_num_used--;
- memmove(&pcpu_size[i+1], &pcpu_size[i+2],
- (pcpu_num_used - (i+1)) * sizeof(pcpu_size[0]));
- }
-}
-
-static int percpu_modinit(void)
-{
- pcpu_num_used = 2;
- pcpu_num_allocated = 2;
- pcpu_size = kmalloc(sizeof(pcpu_size[0]) * pcpu_num_allocated,
- GFP_KERNEL);
- /* Static in-kernel percpu data (used). */
- pcpu_size[0] = -(__per_cpu_end-__per_cpu_start);
- /* Free room. */
- pcpu_size[1] = PERCPU_ENOUGH_ROOM + pcpu_size[0];
- if (pcpu_size[1] < 0) {
- printk(KERN_ERR "No per-cpu room for modules.\n");
- pcpu_num_used = 1;
- }
-
- return 0;
-}
-__initcall(percpu_modinit);
-
-#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
static unsigned int find_pcpusec(Elf_Ehdr *hdr,
Elf_Shdr *sechdrs,
const char *secstrings)
Index: work/mm/percpu.c
===================================================================
--- work.orig/mm/percpu.c
+++ work/mm/percpu.c
@@ -46,8 +46,6 @@
*
* To use this allocator, arch code should do the followings.
*
- * - drop CONFIG_HAVE_LEGACY_PER_CPU_AREA
- *
* - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
* regular address to percpu pointer and back if they need to be
* different from the default
Index: work/mm/Makefile
===================================================================
--- work.orig/mm/Makefile
+++ work/mm/Makefile
@@ -33,11 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
-ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
obj-$(CONFIG_SMP) += percpu.o
-else
-obj-$(CONFIG_SMP) += allocpercpu.o
-endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o

2009-07-21 22:29:03

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 03/20] percpu: rename 4k first chunk allocator to page


On Tue, 21 Jul 2009, Tejun Heo wrote:

> - * Boring fallback 4k allocator. This allocator puts more pressure on
> - * PTE TLBs but other than that behaves nicely on both UMA and NUMA.
> + * Boring fallback 4k page allocator. This allocator puts more

Guess this should not mentione 4k anymroe? page size allocation?

> pr_warning("PERCPU: %s allocator failed (%zd), "
> - "falling back to 4k\n",
> + "falling back to page\n",

"falling back to page size"?

2009-07-21 22:59:57

by Christoph Lameter

[permalink] [raw]

2009-07-22 03:54:15

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 19/20] sparc64: use embedding percpu first chunk allocator

From: Tejun Heo <[email protected]>
Date: Tue, 21 Jul 2009 19:26:18 +0900

> sparc64 currently allocates a large page for each cpu and partially
> remap them into vmalloc area much like what lpage first chunk
> allocator did. As a 4M page is used for each cpu, this results in
> very large unit size and also adds TLB pressure due to the double
> mapping of pages in the first chunk.
>
> This patch converts sparc64 to use the embedding percpu first chunk
> allocator which now knows how to handle NUMA configurations. This
> simplifies the code a lot, doesn't incur any extra TLB pressure and
> results in better utilization of address space.
>
> Signed-off-by: Tejun Heo <[email protected]>

Acked-by: David S. Miller <[email protected]>

2009-07-22 04:30:28

by Rusty Russell

[permalink] [raw]
Subject: Re: [RFC PATCH] percpu: kill legacy percpu allocator

On Tue, 21 Jul 2009 09:52:44 pm Tejun Heo wrote:
> include/linux/percpu.h | 24 ------
> kernel/module.c | 150 -----------------------------------------

FWIW, I'm looking fwd to this. module.c doesn't shrink very often :)

Thanks,
Rusty.

2009-07-22 04:40:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 03/20] percpu: rename 4k first chunk allocator to page

Christoph Lameter wrote:
> On Tue, 21 Jul 2009, Tejun Heo wrote:
>
>> - * Boring fallback 4k allocator. This allocator puts more pressure on
>> - * PTE TLBs but other than that behaves nicely on both UMA and NUMA.
>> + * Boring fallback 4k page allocator. This allocator puts more
>
> Guess this should not mentione 4k anymroe? page size allocation?

That's x86 specific code and PAGE_SIZE is always 4k there, so....
Also, this comment gets completely deleted by later patches anyway.

>> pr_warning("PERCPU: %s allocator failed (%zd), "
>> - "falling back to 4k\n",
>> + "falling back to page\n",
>
> "falling back to page size"?

So updated.

Thanks.

--
tejun