2009-06-24 13:31:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Hello,

This patchset is combination of the following two patchsets.

[1] x86,percpu: generalize 4k and lpage allocator
[2] percpu: teach lpage allocator about NUMA

Changes from the last postings are

* updated to be on top of the current percpu#for-next(bf4bb2b1)

* sparc64 was converted to dynamic percpu allocator and using
pcpu_setup_first_chunk() which is changed by this patchset. sparc64
updated accordingly.

This patchset contains the following patches.

0001-x86-make-pcpu_chunk_addr_search-matching-stricter.patch
0002-percpu-drop-unit_size-from-embed-first-chunk-alloc.patch
0003-x86-percpu-generalize-4k-first-chunk-allocator.patch
0004-percpu-make-4k-first-chunk-allocator-map-memory.patch
0005-x86-percpu-generalize-lpage-first-chunk-allocator.patch
0006-percpu-simplify-pcpu_setup_first_chunk.patch
0007-percpu-reorder-a-few-functions-in-mm-percpu.c.patch
0008-percpu-drop-pcpu_chunk-page.patch
0009-percpu-allow-non-linear-sparse-cpu-unit-mappin.patch
0010-percpu-teach-large-page-allocator-about-NUMA.patch

0001-0006 generalizes first chunk allocators. 0007-0010 improves
lpage allocator such that NUMA is handled more intelligently.

This patchset first generalizes first chunk allocators, makes the
percpu allocator to be able to use non-linear and/or sparse cpu ->
unit mapping and then make lpage allocator consider CPU topology and
group CPUs in LOCAL_DISTANCE into the same large pages. For example,
on an 4/4 NUMA machine, the original code used up 16MB for each chunk
but the new code uses only 4MB - one large page for each NUMA node.
The grouping code is quite robust and will try to minimize space
wastage even when the CPU topology is asymmetric.

David, sparc64 should be able to use lpage (renamed from remap)
allocator the same way x86_64 does. Well, at least that was my
intention, if something doesn't work or needs improvements for
sparc64, please let me know.

This patchset is available in the following git tree and will be
published in for-next if there's no major objection. It might get
rebased before going into for-next.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git lpage-numa

diffstat follows.

arch/sparc/kernel/smp_64.c | 42 -
arch/x86/include/asm/percpu.h | 9
arch/x86/kernel/setup_percpu.c | 297 ++-------
arch/x86/mm/pageattr.c | 1
include/linux/percpu.h | 68 +-
mm/percpu.c | 1276 +++++++++++++++++++++++++++++++----------
6 files changed, 1139 insertions(+), 554 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/853114
[2] http://lkml.org/lkml/2009/6/17/14


2009-06-24 13:30:57

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/10] percpu: drop @unit_size from embed first chunk allocator

The only extra feature @unit_size provides is making dead space at the
end of the first chunk which doesn't have any valid usecase. Drop the
parameter. This will increase consistency with generalized 4k
allocator.

[ Impact: drop unused code path ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/kernel/setup_percpu.c | 2 +-
include/linux/percpu.h | 2 +-
mm/percpu.c | 16 +++++-----------
3 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 29a3eef..1472820 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -342,7 +342,7 @@ static ssize_t __init setup_pcpu_embed(size_t static_size, bool chosen)
return -EINVAL;

return pcpu_embed_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
- reserve - PERCPU_FIRST_CHUNK_RESERVE, -1);
+ reserve - PERCPU_FIRST_CHUNK_RESERVE);
}

/*
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index e500034..83bff05 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -69,7 +69,7 @@ extern size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,

extern ssize_t __init pcpu_embed_first_chunk(
size_t static_size, size_t reserved_size,
- ssize_t dyn_size, ssize_t unit_size);
+ ssize_t dyn_size);

/*
* Use this to get to a cpu's version of the per-cpu object
diff --git a/mm/percpu.c b/mm/percpu.c
index 19dd83b..fe34b6b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1207,7 +1207,6 @@ static struct page * __init pcpue_get_page(unsigned int cpu, int pageno)
* @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes
* @dyn_size: free size for dynamic allocation in bytes, -1 for auto
- * @unit_size: unit size in bytes, must be multiple of PAGE_SIZE, -1 for auto
*
* This is a helper to ease setting up embedded first percpu chunk and
* can be called where pcpu_setup_first_chunk() is expected.
@@ -1219,9 +1218,9 @@ static struct page * __init pcpue_get_page(unsigned int cpu, int pageno)
* page size.
*
* When @dyn_size is positive, dynamic area might be larger than
- * specified to fill page alignment. Also, when @dyn_size is auto,
- * @dyn_size does not fill the whole first chunk but only what's
- * necessary for page alignment after static and reserved areas.
+ * specified to fill page alignment. When @dyn_size is auto,
+ * @dyn_size is just big enough to fill page alignment after static
+ * and reserved areas.
*
* If the needed size is smaller than the minimum or specified unit
* size, the leftover is returned to the bootmem allocator.
@@ -1231,7 +1230,7 @@ static struct page * __init pcpue_get_page(unsigned int cpu, int pageno)
* percpu access on success, -errno on failure.
*/
ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
- ssize_t dyn_size, ssize_t unit_size)
+ ssize_t dyn_size)
{
size_t chunk_size;
unsigned int cpu;
@@ -1242,12 +1241,7 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
if (dyn_size != 0)
dyn_size = pcpue_size - static_size - reserved_size;

- if (unit_size >= 0) {
- BUG_ON(unit_size < pcpue_size);
- pcpue_unit_size = unit_size;
- } else
- pcpue_unit_size = max_t(size_t, pcpue_size, PCPU_MIN_UNIT_SIZE);
-
+ pcpue_unit_size = max_t(size_t, pcpue_size, PCPU_MIN_UNIT_SIZE);
chunk_size = pcpue_unit_size * num_possible_cpus();

pcpue_ptr = __alloc_bootmem_nopanic(chunk_size, PAGE_SIZE,
--
1.6.0.2

2009-06-24 13:31:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/10] percpu: allow non-linear / sparse cpu -> unit mapping

Currently cpu and unit are always identity mapped. To allow more
efficient large page support on NUMA and lazy allocation for possible
but offline cpus, cpu -> unit mapping needs to be non-linear and/or
sparse. This can be easily implemented by adding a cpu -> unit
mapping array and using it whenever looking up the matching unit for a
cpu.

The only unusal conversion is in pcpu_chunk_addr_search(). The passed
in address is unit0 based and unit0 might not be in use so it needs to
be converted to address of an in-use unit. This is easily done by
adding the unit offset for the current processor.

[ Impact: allows non-linear/sparse cpu -> unit mapping, no visible change yet ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Miller <[email protected]>
---
arch/sparc/kernel/smp_64.c | 2 +-
include/linux/percpu.h | 3 +-
mm/percpu.c | 129 ++++++++++++++++++++++++++++++++------------
3 files changed, 97 insertions(+), 37 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index f2f22ee..6970333 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1516,7 +1516,7 @@ void __init setup_per_cpu_areas(void)

pcpu_unit_size = pcpu_setup_first_chunk(static_size,
PERCPU_MODULE_RESERVE, dyn_size,
- PCPU_CHUNK_SIZE, vm.addr);
+ PCPU_CHUNK_SIZE, vm.addr, NULL);

free_bootmem(__pa(ptrs), ptrs_size);

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 63c8b7a..1e0e887 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -57,6 +57,7 @@
#endif

extern void *pcpu_base_addr;
+extern const int *pcpu_unit_map;

typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size);
typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size);
@@ -66,7 +67,7 @@ typedef void (*pcpu_fc_map_fn_t)(void *ptr, size_t size, void *addr);
extern size_t __init pcpu_setup_first_chunk(
size_t static_size, size_t reserved_size,
ssize_t dyn_size, size_t unit_size,
- void *base_addr);
+ void *base_addr, const int *unit_map);

extern ssize_t __init pcpu_embed_first_chunk(
size_t static_size, size_t reserved_size,
diff --git a/mm/percpu.c b/mm/percpu.c
index 5ee712e..f0fce38 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -8,12 +8,13 @@
*
* This is percpu allocator which can handle both static and dynamic
* areas. Percpu areas are allocated in chunks in vmalloc area. Each
- * chunk is consisted of num_possible_cpus() units and the first chunk
- * is used for static percpu variables in the kernel image (special
- * boot time alloc/init handling necessary as these areas need to be
- * brought up before allocation services are running). Unit grows as
- * necessary and all units grow or shrink in unison. When a chunk is
- * filled up, another chunk is allocated. ie. in vmalloc area
+ * chunk is consisted of boot-time determined number of units and the
+ * first chunk is used for static percpu variables in the kernel image
+ * (special boot time alloc/init handling necessary as these areas
+ * need to be brought up before allocation services are running).
+ * Unit grows as necessary and all units grow or shrink in unison.
+ * When a chunk is filled up, another chunk is allocated. ie. in
+ * vmalloc area
*
* c0 c1 c2
* ------------------- ------------------- ------------
@@ -22,11 +23,13 @@
*
* Allocation is done in offset-size areas of single unit space. Ie,
* an area of 512 bytes at 6k in c1 occupies 512 bytes at 6k of c1:u0,
- * c1:u1, c1:u2 and c1:u3. Percpu access can be done by configuring
- * percpu base registers pcpu_unit_size apart.
+ * c1:u1, c1:u2 and c1:u3. On UMA, units corresponds directly to
+ * cpus. On NUMA, the mapping can be non-linear and even sparse.
+ * Percpu access can be done by configuring percpu base registers
+ * according to cpu to unit mapping and pcpu_unit_size.
*
- * There are usually many small percpu allocations many of them as
- * small as 4 bytes. The allocator organizes chunks into lists
+ * There are usually many small percpu allocations many of them being
+ * as small as 4 bytes. The allocator organizes chunks into lists
* according to free size and tries to allocate from the fullest one.
* Each chunk keeps the maximum contiguous area size hint which is
* guaranteed to be eqaul to or larger than the maximum contiguous
@@ -99,14 +102,22 @@ struct pcpu_chunk {

static int pcpu_unit_pages __read_mostly;
static int pcpu_unit_size __read_mostly;
+static int pcpu_nr_units __read_mostly;
static int pcpu_chunk_size __read_mostly;
static int pcpu_nr_slots __read_mostly;
static size_t pcpu_chunk_struct_size __read_mostly;

+/* cpus with the lowest and highest unit numbers */
+static unsigned int pcpu_first_unit_cpu __read_mostly;
+static unsigned int pcpu_last_unit_cpu __read_mostly;
+
/* the address of the first chunk which starts with the kernel static area */
void *pcpu_base_addr __read_mostly;
EXPORT_SYMBOL_GPL(pcpu_base_addr);

+/* cpu -> unit map */
+const int *pcpu_unit_map __read_mostly;
+
/*
* The first chunk which always exists. Note that unlike other
* chunks, this one can be allocated and mapped in several different
@@ -177,7 +188,7 @@ static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)

static int pcpu_page_idx(unsigned int cpu, int page_idx)
{
- return cpu * pcpu_unit_pages + page_idx;
+ return pcpu_unit_map[cpu] * pcpu_unit_pages + page_idx;
}

static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
@@ -321,6 +332,14 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
return pcpu_first_chunk;
}

+ /*
+ * The address is relative to unit0 which might be unused and
+ * thus unmapped. Offset the address to the unit space of the
+ * current processor before looking it up in the vmalloc
+ * space. Note that any possible cpu id can be used here, so
+ * there's no need to worry about preemption or cpu hotplug.
+ */
+ addr += pcpu_unit_map[smp_processor_id()] * pcpu_unit_size;
return pcpu_get_page_chunk(vmalloc_to_page(addr));
}

@@ -593,8 +612,7 @@ static struct page **pcpu_get_pages_and_bitmap(struct pcpu_chunk *chunk,
{
static struct page **pages;
static unsigned long *bitmap;
- size_t pages_size = num_possible_cpus() * pcpu_unit_pages *
- sizeof(pages[0]);
+ size_t pages_size = pcpu_nr_units * pcpu_unit_pages * sizeof(pages[0]);
size_t bitmap_size = BITS_TO_LONGS(pcpu_unit_pages) *
sizeof(unsigned long);

@@ -692,10 +710,9 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
static void pcpu_pre_unmap_flush(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
- unsigned int last = num_possible_cpus() - 1;
-
- flush_cache_vunmap(pcpu_chunk_addr(chunk, 0, page_start),
- pcpu_chunk_addr(chunk, last, page_end));
+ flush_cache_vunmap(
+ pcpu_chunk_addr(chunk, pcpu_first_unit_cpu, page_start),
+ pcpu_chunk_addr(chunk, pcpu_last_unit_cpu, page_end));
}

static void __pcpu_unmap_pages(unsigned long addr, int nr_pages)
@@ -756,10 +773,9 @@ static void pcpu_unmap_pages(struct pcpu_chunk *chunk,
static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
- unsigned int last = num_possible_cpus() - 1;
-
- flush_tlb_kernel_range(pcpu_chunk_addr(chunk, 0, page_start),
- pcpu_chunk_addr(chunk, last, page_end));
+ flush_tlb_kernel_range(
+ pcpu_chunk_addr(chunk, pcpu_first_unit_cpu, page_start),
+ pcpu_chunk_addr(chunk, pcpu_last_unit_cpu, page_end));
}

static int __pcpu_map_pages(unsigned long addr, struct page **pages,
@@ -835,11 +851,9 @@ err:
static void pcpu_post_map_flush(struct pcpu_chunk *chunk,
int page_start, int page_end)
{
- unsigned int last = num_possible_cpus() - 1;
-
- /* flush at once, please read comments in pcpu_unmap() */
- flush_cache_vmap(pcpu_chunk_addr(chunk, 0, page_start),
- pcpu_chunk_addr(chunk, last, page_end));
+ flush_cache_vmap(
+ pcpu_chunk_addr(chunk, pcpu_first_unit_cpu, page_start),
+ pcpu_chunk_addr(chunk, pcpu_last_unit_cpu, page_end));
}

/**
@@ -953,8 +967,7 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
bitmap_copy(chunk->populated, populated, pcpu_unit_pages);
clear:
for_each_possible_cpu(cpu)
- memset(chunk->vm->addr + cpu * pcpu_unit_size + off, 0,
- size);
+ memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
return 0;

err_unmap:
@@ -1088,6 +1101,7 @@ area_found:

mutex_unlock(&pcpu_alloc_mutex);

+ /* return address relative to unit0 */
return __addr_to_pcpu_ptr(chunk->vm->addr + off);

fail_unlock:
@@ -1222,6 +1236,7 @@ EXPORT_SYMBOL_GPL(free_percpu);
* @dyn_size: free size for dynamic allocation in bytes, -1 for auto
* @unit_size: unit size in bytes, must be multiple of PAGE_SIZE
* @base_addr: mapped address
+ * @unit_map: cpu -> unit map, NULL for sequential mapping
*
* Initialize the first percpu chunk which contains the kernel static
* perpcu area. This function is to be called from arch percpu area
@@ -1260,16 +1275,17 @@ EXPORT_SYMBOL_GPL(free_percpu);
*/
size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
ssize_t dyn_size, size_t unit_size,
- void *base_addr)
+ void *base_addr, const int *unit_map)
{
static struct vm_struct first_vm;
static int smap[2], dmap[2];
size_t size_sum = static_size + reserved_size +
(dyn_size >= 0 ? dyn_size : 0);
struct pcpu_chunk *schunk, *dchunk = NULL;
+ unsigned int cpu, tcpu;
int i;

- /* santiy checks */
+ /* sanity checks */
BUILD_BUG_ON(ARRAY_SIZE(smap) >= PCPU_DFL_MAP_ALLOC ||
ARRAY_SIZE(dmap) >= PCPU_DFL_MAP_ALLOC);
BUG_ON(!static_size);
@@ -1278,9 +1294,52 @@ size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
BUG_ON(unit_size & ~PAGE_MASK);
BUG_ON(unit_size < PCPU_MIN_UNIT_SIZE);

+ /* determine number of units and verify and initialize pcpu_unit_map */
+ if (unit_map) {
+ int first_unit = INT_MAX, last_unit = INT_MIN;
+
+ for_each_possible_cpu(cpu) {
+ int unit = unit_map[cpu];
+
+ BUG_ON(unit < 0);
+ for_each_possible_cpu(tcpu) {
+ if (tcpu == cpu)
+ break;
+ /* the mapping should be one-to-one */
+ BUG_ON(unit_map[tcpu] == unit);
+ }
+
+ if (unit < first_unit) {
+ pcpu_first_unit_cpu = cpu;
+ first_unit = unit;
+ }
+ if (unit > last_unit) {
+ pcpu_last_unit_cpu = cpu;
+ last_unit = unit;
+ }
+ }
+ pcpu_nr_units = last_unit + 1;
+ pcpu_unit_map = unit_map;
+ } else {
+ int *identity_map;
+
+ /* #units == #cpus, identity mapped */
+ identity_map = alloc_bootmem(num_possible_cpus() *
+ sizeof(identity_map[0]));
+
+ for_each_possible_cpu(cpu)
+ identity_map[cpu] = cpu;
+
+ pcpu_first_unit_cpu = 0;
+ pcpu_last_unit_cpu = pcpu_nr_units - 1;
+ pcpu_nr_units = num_possible_cpus();
+ pcpu_unit_map = identity_map;
+ }
+
+ /* determine basic parameters */
pcpu_unit_pages = unit_size >> PAGE_SHIFT;
pcpu_unit_size = pcpu_unit_pages << PAGE_SHIFT;
- pcpu_chunk_size = num_possible_cpus() * pcpu_unit_size;
+ pcpu_chunk_size = pcpu_nr_units * pcpu_unit_size;
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long);

@@ -1349,7 +1408,7 @@ size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
pcpu_chunk_relocate(pcpu_first_chunk, -1);

/* we're done */
- pcpu_base_addr = (void *)pcpu_chunk_addr(schunk, 0, 0);
+ pcpu_base_addr = schunk->vm->addr;
return pcpu_unit_size;
}

@@ -1427,7 +1486,7 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
size_sum >> PAGE_SHIFT, base, static_size);

return pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
- unit_size, base);
+ unit_size, base, NULL);
}

/**
@@ -1519,7 +1578,7 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
unit_pages, static_size);

ret = pcpu_setup_first_chunk(static_size, reserved_size, -1,
- unit_pages << PAGE_SHIFT, vm.addr);
+ unit_pages << PAGE_SHIFT, vm.addr, NULL);
goto out_free_ar;

enomem:
@@ -1641,7 +1700,7 @@ ssize_t __init pcpu_lpage_first_chunk(size_t static_size, size_t reserved_size,
"%zu bytes\n", pcpul_vm.addr, static_size);

ret = pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
- pcpul_unit_size, pcpul_vm.addr);
+ pcpul_unit_size, pcpul_vm.addr, NULL);

/* sort pcpul_map array for pcpu_lpage_remapped() */
for (i = 0; i < num_possible_cpus() - 1; i++)
--
1.6.0.2

2009-06-24 13:31:47

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/10] x86: make pcpu_chunk_addr_search() matching stricter

The @addr passed into pcpu_chunk_addr_search() is unit0 based address
and thus should be matched inside unit0 area. Currently, when it uses
chunk size when determining whether the address falls in the first
chunk. Addresses in unitN where N>0 shouldn't be passed in anyway, so
this doesn't cause any malfunction but fix it for consistency.

[ Impact: mostly cleanup ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
mm/percpu.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index b149845..19dd83b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -290,7 +290,7 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
void *first_start = pcpu_first_chunk->vm->addr;

/* is it in the first chunk? */
- if (addr >= first_start && addr < first_start + pcpu_chunk_size) {
+ if (addr >= first_start && addr < first_start + pcpu_unit_size) {
/* is it in the reserved area? */
if (addr < first_start + pcpu_reserved_chunk_limit)
return pcpu_reserved_chunk;
--
1.6.0.2

2009-06-24 13:32:04

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/10] x86,percpu: generalize lpage first chunk allocator

Generalize and move x86 setup_pcpu_lpage() into
pcpu_lpage_first_chunk(). setup_pcpu_lpage() now is a simple wrapper
around the generalized version. Other than taking size parameters and
using arch supplied callbacks to allocate/free/map memory,
pcpu_lpage_first_chunk() is identical to the original implementation.

This simplifies arch code and will help converting more archs to
dynamic percpu allocator.

While at it, factor out pcpu_calc_fc_sizes() which is common to
pcpu_embed_first_chunk() and pcpu_lpage_first_chunk().

[ Impact: code reorganization and generalization ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/percpu.h | 9 --
arch/x86/kernel/setup_percpu.c | 169 ++------------------------------
arch/x86/mm/pageattr.c | 1 +
include/linux/percpu.h | 27 +++++
mm/percpu.c | 209 +++++++++++++++++++++++++++++++++++++++-
5 files changed, 244 insertions(+), 171 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 103f1dd..a18c038 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -156,15 +156,6 @@ do { \
/* We can use this directly for local CPU (faster). */
DECLARE_PER_CPU(unsigned long, this_cpu_off);

-#ifdef CONFIG_NEED_MULTIPLE_NODES
-void *pcpu_lpage_remapped(void *kaddr);
-#else
-static inline void *pcpu_lpage_remapped(void *kaddr)
-{
- return NULL;
-}
-#endif
-
#endif /* !__ASSEMBLY__ */

#ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index ab896b3..4f2e0ac 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -137,44 +137,21 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
}

/*
- * Large page remap allocator
- *
- * This allocator uses PMD page as unit. A PMD page is allocated for
- * each cpu and each is remapped into vmalloc area using PMD mapping.
- * As PMD page is quite large, only part of it is used for the first
- * chunk. Unused part is returned to the bootmem allocator.
- *
- * So, the PMD pages are mapped twice - once to the physical mapping
- * and to the vmalloc area for the first percpu chunk. The double
- * mapping does add one more PMD TLB entry pressure but still is much
- * better than only using 4k mappings while still being NUMA friendly.
+ * Large page remapping allocator
*/
#ifdef CONFIG_NEED_MULTIPLE_NODES
-struct pcpul_ent {
- unsigned int cpu;
- void *ptr;
-};
-
-static size_t pcpul_size;
-static struct pcpul_ent *pcpul_map;
-static struct vm_struct pcpul_vm;
-
-static struct page * __init pcpul_get_page(unsigned int cpu, int pageno)
+static void __init pcpul_map(void *ptr, size_t size, void *addr)
{
- size_t off = (size_t)pageno << PAGE_SHIFT;
+ pmd_t *pmd, pmd_v;

- if (off >= pcpul_size)
- return NULL;
-
- return virt_to_page(pcpul_map[cpu].ptr + off);
+ pmd = populate_extra_pmd((unsigned long)addr);
+ pmd_v = pfn_pmd(page_to_pfn(virt_to_page(ptr)), PAGE_KERNEL_LARGE);
+ set_pmd(pmd, pmd_v);
}

static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
{
- size_t map_size, dyn_size;
- unsigned int cpu;
- int i, j;
- ssize_t ret;
+ size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;

if (!chosen) {
size_t vm_size = VMALLOC_END - VMALLOC_START;
@@ -198,134 +175,10 @@ static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
return -EINVAL;
}

- /*
- * Currently supports only single page. Supporting multiple
- * pages won't be too difficult if it ever becomes necessary.
- */
- pcpul_size = PFN_ALIGN(static_size + PERCPU_MODULE_RESERVE +
- PERCPU_DYNAMIC_RESERVE);
- if (pcpul_size > PMD_SIZE) {
- pr_warning("PERCPU: static data is larger than large page, "
- "can't use large page\n");
- return -EINVAL;
- }
- dyn_size = pcpul_size - static_size - PERCPU_FIRST_CHUNK_RESERVE;
-
- /* allocate pointer array and alloc large pages */
- map_size = PFN_ALIGN(num_possible_cpus() * sizeof(pcpul_map[0]));
- pcpul_map = alloc_bootmem(map_size);
-
- for_each_possible_cpu(cpu) {
- pcpul_map[cpu].cpu = cpu;
- pcpul_map[cpu].ptr = pcpu_alloc_bootmem(cpu, PMD_SIZE,
- PMD_SIZE);
- if (!pcpul_map[cpu].ptr) {
- pr_warning("PERCPU: failed to allocate large page "
- "for cpu%u\n", cpu);
- goto enomem;
- }
-
- /*
- * Only use pcpul_size bytes and give back the rest.
- *
- * Ingo: The 2MB up-rounding bootmem is needed to make
- * sure the partial 2MB page is still fully RAM - it's
- * not well-specified to have a PAT-incompatible area
- * (unmapped RAM, device memory, etc.) in that hole.
- */
- free_bootmem(__pa(pcpul_map[cpu].ptr + pcpul_size),
- PMD_SIZE - pcpul_size);
-
- memcpy(pcpul_map[cpu].ptr, __per_cpu_load, static_size);
- }
-
- /* allocate address and map */
- pcpul_vm.flags = VM_ALLOC;
- pcpul_vm.size = num_possible_cpus() * PMD_SIZE;
- vm_area_register_early(&pcpul_vm, PMD_SIZE);
-
- for_each_possible_cpu(cpu) {
- pmd_t *pmd, pmd_v;
-
- pmd = populate_extra_pmd((unsigned long)pcpul_vm.addr +
- cpu * PMD_SIZE);
- pmd_v = pfn_pmd(page_to_pfn(virt_to_page(pcpul_map[cpu].ptr)),
- PAGE_KERNEL_LARGE);
- set_pmd(pmd, pmd_v);
- }
-
- /* we're ready, commit */
- pr_info("PERCPU: Remapped at %p with large pages, static data "
- "%zu bytes\n", pcpul_vm.addr, static_size);
-
- ret = pcpu_setup_first_chunk(pcpul_get_page, static_size,
- PERCPU_FIRST_CHUNK_RESERVE, dyn_size,
- PMD_SIZE, pcpul_vm.addr, NULL);
-
- /* sort pcpul_map array for pcpu_lpage_remapped() */
- for (i = 0; i < num_possible_cpus() - 1; i++)
- for (j = i + 1; j < num_possible_cpus(); j++)
- if (pcpul_map[i].ptr > pcpul_map[j].ptr) {
- struct pcpul_ent tmp = pcpul_map[i];
- pcpul_map[i] = pcpul_map[j];
- pcpul_map[j] = tmp;
- }
-
- return ret;
-
-enomem:
- for_each_possible_cpu(cpu)
- if (pcpul_map[cpu].ptr)
- free_bootmem(__pa(pcpul_map[cpu].ptr), pcpul_size);
- free_bootmem(__pa(pcpul_map), map_size);
- return -ENOMEM;
-}
-
-/**
- * pcpu_lpage_remapped - determine whether a kaddr is in pcpul recycled area
- * @kaddr: the kernel address in question
- *
- * Determine whether @kaddr falls in the pcpul recycled area. This is
- * used by pageattr to detect VM aliases and break up the pcpu PMD
- * mapping such that the same physical page is not mapped under
- * different attributes.
- *
- * The recycled area is always at the tail of a partially used PMD
- * page.
- *
- * RETURNS:
- * Address of corresponding remapped pcpu address if match is found;
- * otherwise, NULL.
- */
-void *pcpu_lpage_remapped(void *kaddr)
-{
- void *pmd_addr = (void *)((unsigned long)kaddr & PMD_MASK);
- unsigned long offset = (unsigned long)kaddr & ~PMD_MASK;
- int left = 0, right = num_possible_cpus() - 1;
- int pos;
-
- /* pcpul in use at all? */
- if (!pcpul_map)
- return NULL;
-
- /* okay, perform binary search */
- while (left <= right) {
- pos = (left + right) / 2;
-
- if (pcpul_map[pos].ptr < pmd_addr)
- left = pos + 1;
- else if (pcpul_map[pos].ptr > pmd_addr)
- right = pos - 1;
- else {
- /* it shouldn't be in the area for the first chunk */
- WARN_ON(offset < pcpul_size);
-
- return pcpul_vm.addr +
- pcpul_map[pos].cpu * PMD_SIZE + offset;
- }
- }
-
- return NULL;
+ return pcpu_lpage_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
+ reserve - PERCPU_FIRST_CHUNK_RESERVE,
+ PMD_SIZE,
+ pcpu_fc_alloc, pcpu_fc_free, pcpul_map);
}
#else
static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 1b734d7..c106f78 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -12,6 +12,7 @@
#include <linux/seq_file.h>
#include <linux/debugfs.h>
#include <linux/pfn.h>
+#include <linux/percpu.h>

#include <asm/e820.h>
#include <asm/processor.h>
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 41b5bfa..9f6bfd7 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -62,6 +62,7 @@ typedef struct page * (*pcpu_get_page_fn_t)(unsigned int cpu, int pageno);
typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size);
typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size);
typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);
+typedef void (*pcpu_fc_map_fn_t)(void *ptr, size_t size, void *addr);

extern size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
size_t static_size, size_t reserved_size,
@@ -79,6 +80,32 @@ extern ssize_t __init pcpu_4k_first_chunk(
pcpu_fc_free_fn_t free_fn,
pcpu_fc_populate_pte_fn_t populate_pte_fn);

+#ifdef CONFIG_NEED_MULTIPLE_NODES
+extern ssize_t __init pcpu_lpage_first_chunk(
+ size_t static_size, size_t reserved_size,
+ ssize_t dyn_size, size_t lpage_size,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn,
+ pcpu_fc_map_fn_t map_fn);
+
+extern void *pcpu_lpage_remapped(void *kaddr);
+#else
+static inline ssize_t __init pcpu_lpage_first_chunk(
+ size_t static_size, size_t reserved_size,
+ ssize_t dyn_size, size_t lpage_size,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn,
+ pcpu_fc_map_fn_t map_fn)
+{
+ return -EINVAL;
+}
+
+static inline void *pcpu_lpage_remapped(void *kaddr)
+{
+ return NULL;
+}
+#endif
+
/*
* Use this to get to a cpu's version of the per-cpu object
* dynamically allocated. Non-atomic access to the current CPU's
diff --git a/mm/percpu.c b/mm/percpu.c
index c173763..17dfb7c 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1190,6 +1190,19 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
return pcpu_unit_size;
}

+static size_t pcpu_calc_fc_sizes(size_t static_size, size_t reserved_size,
+ ssize_t *dyn_sizep)
+{
+ size_t size_sum;
+
+ size_sum = PFN_ALIGN(static_size + reserved_size +
+ (*dyn_sizep >= 0 ? *dyn_sizep : 0));
+ if (*dyn_sizep != 0)
+ *dyn_sizep = size_sum - static_size - reserved_size;
+
+ return size_sum;
+}
+
/*
* Embedding first chunk setup helper.
*/
@@ -1241,10 +1254,7 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
unsigned int cpu;

/* determine parameters and allocate */
- pcpue_size = PFN_ALIGN(static_size + reserved_size +
- (dyn_size >= 0 ? dyn_size : 0));
- if (dyn_size != 0)
- dyn_size = pcpue_size - static_size - reserved_size;
+ pcpue_size = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);

pcpue_unit_size = max_t(size_t, pcpue_size, PCPU_MIN_UNIT_SIZE);
chunk_size = pcpue_unit_size * num_possible_cpus();
@@ -1391,6 +1401,197 @@ out_free_ar:
}

/*
+ * Large page remapping first chunk setup helper
+ */
+#ifdef CONFIG_NEED_MULTIPLE_NODES
+struct pcpul_ent {
+ unsigned int cpu;
+ void *ptr;
+};
+
+static size_t pcpul_size;
+static size_t pcpul_unit_size;
+static struct pcpul_ent *pcpul_map;
+static struct vm_struct pcpul_vm;
+
+static struct page * __init pcpul_get_page(unsigned int cpu, int pageno)
+{
+ size_t off = (size_t)pageno << PAGE_SHIFT;
+
+ if (off >= pcpul_size)
+ return NULL;
+
+ return virt_to_page(pcpul_map[cpu].ptr + off);
+}
+
+/**
+ * pcpu_lpage_first_chunk - remap the first percpu chunk using large page
+ * @static_size: the size of static percpu area in bytes
+ * @reserved_size: the size of reserved percpu area in bytes
+ * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @lpage_size: the size of a large page
+ * @alloc_fn: function to allocate percpu lpage, always called with lpage_size
+ * @free_fn: function to free percpu memory, @size <= lpage_size
+ * @map_fn: function to map percpu lpage, always called with lpage_size
+ *
+ * This allocator uses large page as unit. A large page is allocated
+ * for each cpu and each is remapped into vmalloc area using large
+ * page mapping. As large page can be quite large, only part of it is
+ * used for the first chunk. Unused part is returned to the bootmem
+ * allocator.
+ *
+ * So, the large pages are mapped twice - once to the physical mapping
+ * and to the vmalloc area for the first percpu chunk. The double
+ * mapping does add one more large TLB entry pressure but still is
+ * much better than only using 4k mappings while still being NUMA
+ * friendly.
+ *
+ * RETURNS:
+ * The determined pcpu_unit_size which can be used to initialize
+ * percpu access on success, -errno on failure.
+ */
+ssize_t __init pcpu_lpage_first_chunk(size_t static_size, size_t reserved_size,
+ ssize_t dyn_size, size_t lpage_size,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn,
+ pcpu_fc_map_fn_t map_fn)
+{
+ size_t size_sum;
+ size_t map_size;
+ unsigned int cpu;
+ int i, j;
+ ssize_t ret;
+
+ /*
+ * Currently supports only single page. Supporting multiple
+ * pages won't be too difficult if it ever becomes necessary.
+ */
+ size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
+
+ pcpul_unit_size = lpage_size;
+ pcpul_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
+ if (pcpul_size > pcpul_unit_size) {
+ pr_warning("PERCPU: static data is larger than large page, "
+ "can't use large page\n");
+ return -EINVAL;
+ }
+
+ /* allocate pointer array and alloc large pages */
+ map_size = PFN_ALIGN(num_possible_cpus() * sizeof(pcpul_map[0]));
+ pcpul_map = alloc_bootmem(map_size);
+
+ for_each_possible_cpu(cpu) {
+ void *ptr;
+
+ ptr = alloc_fn(cpu, lpage_size);
+ if (!ptr) {
+ pr_warning("PERCPU: failed to allocate large page "
+ "for cpu%u\n", cpu);
+ goto enomem;
+ }
+
+ /*
+ * Only use pcpul_size bytes and give back the rest.
+ *
+ * Ingo: The lpage_size up-rounding bootmem is needed
+ * to make sure the partial lpage is still fully RAM -
+ * it's not well-specified to have a incompatible area
+ * (unmapped RAM, device memory, etc.) in that hole.
+ */
+ free_fn(ptr + pcpul_size, lpage_size - pcpul_size);
+
+ pcpul_map[cpu].cpu = cpu;
+ pcpul_map[cpu].ptr = ptr;
+
+ memcpy(ptr, __per_cpu_load, static_size);
+ }
+
+ /* allocate address and map */
+ pcpul_vm.flags = VM_ALLOC;
+ pcpul_vm.size = num_possible_cpus() * pcpul_unit_size;
+ vm_area_register_early(&pcpul_vm, pcpul_unit_size);
+
+ for_each_possible_cpu(cpu)
+ map_fn(pcpul_map[cpu].ptr, pcpul_unit_size,
+ pcpul_vm.addr + cpu * pcpul_unit_size);
+
+ /* we're ready, commit */
+ pr_info("PERCPU: Remapped at %p with large pages, static data "
+ "%zu bytes\n", pcpul_vm.addr, static_size);
+
+ ret = pcpu_setup_first_chunk(pcpul_get_page, static_size,
+ reserved_size, dyn_size, pcpul_unit_size,
+ pcpul_vm.addr, NULL);
+
+ /* sort pcpul_map array for pcpu_lpage_remapped() */
+ for (i = 0; i < num_possible_cpus() - 1; i++)
+ for (j = i + 1; j < num_possible_cpus(); j++)
+ if (pcpul_map[i].ptr > pcpul_map[j].ptr) {
+ struct pcpul_ent tmp = pcpul_map[i];
+ pcpul_map[i] = pcpul_map[j];
+ pcpul_map[j] = tmp;
+ }
+
+ return ret;
+
+enomem:
+ for_each_possible_cpu(cpu)
+ if (pcpul_map[cpu].ptr)
+ free_fn(pcpul_map[cpu].ptr, pcpul_size);
+ free_bootmem(__pa(pcpul_map), map_size);
+ return -ENOMEM;
+}
+
+/**
+ * pcpu_lpage_remapped - determine whether a kaddr is in pcpul recycled area
+ * @kaddr: the kernel address in question
+ *
+ * Determine whether @kaddr falls in the pcpul recycled area. This is
+ * used by pageattr to detect VM aliases and break up the pcpu large
+ * page mapping such that the same physical page is not mapped under
+ * different attributes.
+ *
+ * The recycled area is always at the tail of a partially used large
+ * page.
+ *
+ * RETURNS:
+ * Address of corresponding remapped pcpu address if match is found;
+ * otherwise, NULL.
+ */
+void *pcpu_lpage_remapped(void *kaddr)
+{
+ unsigned long unit_mask = pcpul_unit_size - 1;
+ void *lpage_addr = (void *)((unsigned long)kaddr & ~unit_mask);
+ unsigned long offset = (unsigned long)kaddr & unit_mask;
+ int left = 0, right = num_possible_cpus() - 1;
+ int pos;
+
+ /* pcpul in use at all? */
+ if (!pcpul_map)
+ return NULL;
+
+ /* okay, perform binary search */
+ while (left <= right) {
+ pos = (left + right) / 2;
+
+ if (pcpul_map[pos].ptr < lpage_addr)
+ left = pos + 1;
+ else if (pcpul_map[pos].ptr > lpage_addr)
+ right = pos - 1;
+ else {
+ /* it shouldn't be in the area for the first chunk */
+ WARN_ON(offset < pcpul_size);
+
+ return pcpul_vm.addr +
+ pcpul_map[pos].cpu * pcpul_unit_size + offset;
+ }
+ }
+
+ return NULL;
+}
+#endif
+
+/*
* Generic percpu area setup.
*
* The embedding helper is used because its behavior closely resembles
--
1.6.0.2

2009-06-24 13:32:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/10] x86,percpu: generalize 4k first chunk allocator

Generalize and move x86 setup_pcpu_4k() into pcpu_4k_first_chunk().
setup_pcpu_4k() now is a simple wrapper around the generalized
version. Other than taking size parameters and using arch supplied
callbacks to allocate/free memory, pcpu_4k_first_chunk() is identical
to the original implementation.

This simplifies arch code and will help converting more archs to
dynamic percpu allocator.

While at it, s/pcpu_populate_pte_fn_t/pcpu_fc_populate_pte_fn_t/ for
consistency.

[ Impact: code reorganization and generalization ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/kernel/setup_percpu.c | 78 +++++++++---------------------------
include/linux/percpu.h | 12 +++++-
mm/percpu.c | 85 +++++++++++++++++++++++++++++++++++++++-
3 files changed, 113 insertions(+), 62 deletions(-)

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 1472820..ab896b3 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -124,6 +124,19 @@ static void * __init pcpu_alloc_bootmem(unsigned int cpu, unsigned long size,
}

/*
+ * Helpers for first chunk memory allocation
+ */
+static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size)
+{
+ return pcpu_alloc_bootmem(cpu, size, size);
+}
+
+static void __init pcpu_fc_free(void *ptr, size_t size)
+{
+ free_bootmem(__pa(ptr), size);
+}
+
+/*
* Large page remap allocator
*
* This allocator uses PMD page as unit. A PMD page is allocated for
@@ -346,22 +359,11 @@ static ssize_t __init setup_pcpu_embed(size_t static_size, bool chosen)
}

/*
- * 4k page allocator
+ * 4k allocator
*
- * This is the basic allocator. Static percpu area is allocated
- * page-by-page and most of initialization is done by the generic
- * setup function.
+ * Boring fallback 4k allocator. This allocator puts more pressure on
+ * PTE TLBs but other than that behaves nicely on both UMA and NUMA.
*/
-static struct page **pcpu4k_pages __initdata;
-static int pcpu4k_nr_static_pages __initdata;
-
-static struct page * __init pcpu4k_get_page(unsigned int cpu, int pageno)
-{
- if (pageno < pcpu4k_nr_static_pages)
- return pcpu4k_pages[cpu * pcpu4k_nr_static_pages + pageno];
- return NULL;
-}
-
static void __init pcpu4k_populate_pte(unsigned long addr)
{
populate_extra_pte(addr);
@@ -369,51 +371,9 @@ static void __init pcpu4k_populate_pte(unsigned long addr)

static ssize_t __init setup_pcpu_4k(size_t static_size)
{
- size_t pages_size;
- unsigned int cpu;
- int i, j;
- ssize_t ret;
-
- pcpu4k_nr_static_pages = PFN_UP(static_size);
-
- /* unaligned allocations can't be freed, round up to page size */
- pages_size = PFN_ALIGN(pcpu4k_nr_static_pages * num_possible_cpus()
- * sizeof(pcpu4k_pages[0]));
- pcpu4k_pages = alloc_bootmem(pages_size);
-
- /* allocate and copy */
- j = 0;
- for_each_possible_cpu(cpu)
- for (i = 0; i < pcpu4k_nr_static_pages; i++) {
- void *ptr;
-
- ptr = pcpu_alloc_bootmem(cpu, PAGE_SIZE, PAGE_SIZE);
- if (!ptr) {
- pr_warning("PERCPU: failed to allocate "
- "4k page for cpu%u\n", cpu);
- goto enomem;
- }
-
- memcpy(ptr, __per_cpu_load + i * PAGE_SIZE, PAGE_SIZE);
- pcpu4k_pages[j++] = virt_to_page(ptr);
- }
-
- /* we're ready, commit */
- pr_info("PERCPU: Allocated %d 4k pages, static data %zu bytes\n",
- pcpu4k_nr_static_pages, static_size);
-
- ret = pcpu_setup_first_chunk(pcpu4k_get_page, static_size,
- PERCPU_FIRST_CHUNK_RESERVE, -1,
- -1, NULL, pcpu4k_populate_pte);
- goto out_free_ar;
-
-enomem:
- while (--j >= 0)
- free_bootmem(__pa(page_address(pcpu4k_pages[j])), PAGE_SIZE);
- ret = -ENOMEM;
-out_free_ar:
- free_bootmem(__pa(pcpu4k_pages), pages_size);
- return ret;
+ return pcpu_4k_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
+ pcpu_fc_alloc, pcpu_fc_free,
+ pcpu4k_populate_pte);
}

/* for explicit first chunk allocator selection */
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 83bff05..41b5bfa 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -59,18 +59,26 @@
extern void *pcpu_base_addr;

typedef struct page * (*pcpu_get_page_fn_t)(unsigned int cpu, int pageno);
-typedef void (*pcpu_populate_pte_fn_t)(unsigned long addr);
+typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size);
+typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size);
+typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);

extern size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
size_t static_size, size_t reserved_size,
ssize_t dyn_size, ssize_t unit_size,
void *base_addr,
- pcpu_populate_pte_fn_t populate_pte_fn);
+ pcpu_fc_populate_pte_fn_t populate_pte_fn);

extern ssize_t __init pcpu_embed_first_chunk(
size_t static_size, size_t reserved_size,
ssize_t dyn_size);

+extern ssize_t __init pcpu_4k_first_chunk(
+ size_t static_size, size_t reserved_size,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn,
+ pcpu_fc_populate_pte_fn_t populate_pte_fn);
+
/*
* Use this to get to a cpu's version of the per-cpu object
* dynamically allocated. Non-atomic access to the current CPU's
diff --git a/mm/percpu.c b/mm/percpu.c
index fe34b6b..39f4022 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1037,7 +1037,7 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
size_t static_size, size_t reserved_size,
ssize_t dyn_size, ssize_t unit_size,
void *base_addr,
- pcpu_populate_pte_fn_t populate_pte_fn)
+ pcpu_fc_populate_pte_fn_t populate_pte_fn)
{
static struct vm_struct first_vm;
static int smap[2], dmap[2];
@@ -1271,6 +1271,89 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
}

/*
+ * 4k page first chunk setup helper.
+ */
+static struct page **pcpu4k_pages __initdata;
+static int pcpu4k_nr_static_pages __initdata;
+
+static struct page * __init pcpu4k_get_page(unsigned int cpu, int pageno)
+{
+ if (pageno < pcpu4k_nr_static_pages)
+ return pcpu4k_pages[cpu * pcpu4k_nr_static_pages + pageno];
+ return NULL;
+}
+
+/**
+ * pcpu_4k_first_chunk - map the first chunk using PAGE_SIZE pages
+ * @static_size: the size of static percpu area in bytes
+ * @reserved_size: the size of reserved percpu area in bytes
+ * @alloc_fn: function to allocate percpu page, always called with PAGE_SIZE
+ * @free_fn: funtion to free percpu page, always called with PAGE_SIZE
+ * @populate_pte_fn: function to populate pte
+ *
+ * This is a helper to ease setting up embedded first percpu chunk and
+ * can be called where pcpu_setup_first_chunk() is expected.
+ *
+ * This is the basic allocator. Static percpu area is allocated
+ * page-by-page into vmalloc area.
+ *
+ * RETURNS:
+ * The determined pcpu_unit_size which can be used to initialize
+ * percpu access on success, -errno on failure.
+ */
+ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
+ pcpu_fc_alloc_fn_t alloc_fn,
+ pcpu_fc_free_fn_t free_fn,
+ pcpu_fc_populate_pte_fn_t populate_pte_fn)
+{
+ size_t pages_size;
+ unsigned int cpu;
+ int i, j;
+ ssize_t ret;
+
+ pcpu4k_nr_static_pages = PFN_UP(static_size);
+
+ /* unaligned allocations can't be freed, round up to page size */
+ pages_size = PFN_ALIGN(pcpu4k_nr_static_pages * num_possible_cpus() *
+ sizeof(pcpu4k_pages[0]));
+ pcpu4k_pages = alloc_bootmem(pages_size);
+
+ /* allocate and copy */
+ j = 0;
+ for_each_possible_cpu(cpu)
+ for (i = 0; i < pcpu4k_nr_static_pages; i++) {
+ void *ptr;
+
+ ptr = alloc_fn(cpu, PAGE_SIZE);
+ if (!ptr) {
+ pr_warning("PERCPU: failed to allocate "
+ "4k page for cpu%u\n", cpu);
+ goto enomem;
+ }
+
+ memcpy(ptr, __per_cpu_load + i * PAGE_SIZE, PAGE_SIZE);
+ pcpu4k_pages[j++] = virt_to_page(ptr);
+ }
+
+ /* we're ready, commit */
+ pr_info("PERCPU: Allocated %d 4k pages, static data %zu bytes\n",
+ pcpu4k_nr_static_pages, static_size);
+
+ ret = pcpu_setup_first_chunk(pcpu4k_get_page, static_size,
+ reserved_size, -1,
+ -1, NULL, populate_pte_fn);
+ goto out_free_ar;
+
+enomem:
+ while (--j >= 0)
+ free_fn(page_address(pcpu4k_pages[j]), PAGE_SIZE);
+ ret = -ENOMEM;
+out_free_ar:
+ free_bootmem(__pa(pcpu4k_pages), pages_size);
+ return ret;
+}
+
+/*
* Generic percpu area setup.
*
* The embedding helper is used because its behavior closely resembles
--
1.6.0.2

2009-06-24 13:32:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/10] percpu: make 4k first chunk allocator map memory

At first, percpu first chunk was always setup page-by-page by the
generic code. To add other allocators, different parts of the generic
initialization was made optional. Now we have three allocators -
embed, remap and 4k. embed and remap fully handle allocation and
mapping of the first chunk while 4k still depends on generic code for
those. This makes the generic alloc/map paths specifci to 4k and
makes the code unnecessary complicated with optional generic
behaviors.

This patch makes the 4k allocator to allocate and map memory directly
instead of depending on the generic code. The only outside visible
change is that now dynamic area in the first chunk is allocated
up-front instead of on-demand. This doesn't make any meaningful
difference as the area is minimal (usually less than a page, just
enough to fill the alignment) on 4k allocator. Plus, dynamic area in
the first chunk usually gets fully used anyway.

This will allow simplification of pcpu_setpu_first_chunk() and removal
of chunk->page array.

[ Impact: no outside visible change other than up-front allocation of dyn area ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
mm/percpu.c | 71 ++++++++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 54 insertions(+), 17 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 39f4022..c173763 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -632,6 +632,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int off, int size,
pcpu_unmap(chunk, unmap_start, unmap_end, flush);
}

+static int __pcpu_map_pages(unsigned long addr, struct page **pages,
+ int nr_pages)
+{
+ return map_kernel_range_noflush(addr, nr_pages << PAGE_SHIFT,
+ PAGE_KERNEL, pages);
+}
+
/**
* pcpu_map - map pages into a pcpu_chunk
* @chunk: chunk of interest
@@ -651,11 +658,9 @@ static int pcpu_map(struct pcpu_chunk *chunk, int page_start, int page_end)
WARN_ON(chunk->immutable);

for_each_possible_cpu(cpu) {
- err = map_kernel_range_noflush(
- pcpu_chunk_addr(chunk, cpu, page_start),
- (page_end - page_start) << PAGE_SHIFT,
- PAGE_KERNEL,
- pcpu_chunk_pagep(chunk, cpu, page_start));
+ err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
+ pcpu_chunk_pagep(chunk, cpu, page_start),
+ page_end - page_start);
if (err < 0)
return err;
}
@@ -1274,12 +1279,12 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
* 4k page first chunk setup helper.
*/
static struct page **pcpu4k_pages __initdata;
-static int pcpu4k_nr_static_pages __initdata;
+static int pcpu4k_unit_pages __initdata;

static struct page * __init pcpu4k_get_page(unsigned int cpu, int pageno)
{
- if (pageno < pcpu4k_nr_static_pages)
- return pcpu4k_pages[cpu * pcpu4k_nr_static_pages + pageno];
+ if (pageno < pcpu4k_unit_pages)
+ return pcpu4k_pages[cpu * pcpu4k_unit_pages + pageno];
return NULL;
}

@@ -1306,22 +1311,24 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_populate_pte_fn_t populate_pte_fn)
{
+ static struct vm_struct vm;
size_t pages_size;
unsigned int cpu;
int i, j;
ssize_t ret;

- pcpu4k_nr_static_pages = PFN_UP(static_size);
+ pcpu4k_unit_pages = PFN_UP(max_t(size_t, static_size + reserved_size,
+ PCPU_MIN_UNIT_SIZE));

/* unaligned allocations can't be freed, round up to page size */
- pages_size = PFN_ALIGN(pcpu4k_nr_static_pages * num_possible_cpus() *
+ pages_size = PFN_ALIGN(pcpu4k_unit_pages * num_possible_cpus() *
sizeof(pcpu4k_pages[0]));
pcpu4k_pages = alloc_bootmem(pages_size);

- /* allocate and copy */
+ /* allocate pages */
j = 0;
for_each_possible_cpu(cpu)
- for (i = 0; i < pcpu4k_nr_static_pages; i++) {
+ for (i = 0; i < pcpu4k_unit_pages; i++) {
void *ptr;

ptr = alloc_fn(cpu, PAGE_SIZE);
@@ -1330,18 +1337,48 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
"4k page for cpu%u\n", cpu);
goto enomem;
}
-
- memcpy(ptr, __per_cpu_load + i * PAGE_SIZE, PAGE_SIZE);
pcpu4k_pages[j++] = virt_to_page(ptr);
}

+ /* allocate vm area, map the pages and copy static data */
+ vm.flags = VM_ALLOC;
+ vm.size = num_possible_cpus() * pcpu4k_unit_pages << PAGE_SHIFT;
+ vm_area_register_early(&vm, PAGE_SIZE);
+
+ for_each_possible_cpu(cpu) {
+ unsigned long unit_addr = (unsigned long)vm.addr +
+ (cpu * pcpu4k_unit_pages << PAGE_SHIFT);
+
+ for (i = 0; i < pcpu4k_unit_pages; i++)
+ populate_pte_fn(unit_addr + (i << PAGE_SHIFT));
+
+ /* pte already populated, the following shouldn't fail */
+ ret = __pcpu_map_pages(unit_addr,
+ &pcpu4k_pages[cpu * pcpu4k_unit_pages],
+ pcpu4k_unit_pages);
+ if (ret < 0)
+ panic("failed to map percpu area, err=%zd\n", ret);
+
+ /*
+ * FIXME: Archs with virtual cache should flush local
+ * cache for the linear mapping here - something
+ * equivalent to flush_cache_vmap() on the local cpu.
+ * flush_cache_vmap() can't be used as most supporting
+ * data structures are not set up yet.
+ */
+
+ /* copy static data */
+ memcpy((void *)unit_addr, __per_cpu_load, static_size);
+ }
+
/* we're ready, commit */
- pr_info("PERCPU: Allocated %d 4k pages, static data %zu bytes\n",
- pcpu4k_nr_static_pages, static_size);
+ pr_info("PERCPU: %d 4k pages per cpu, static data %zu bytes\n",
+ pcpu4k_unit_pages, static_size);

ret = pcpu_setup_first_chunk(pcpu4k_get_page, static_size,
reserved_size, -1,
- -1, NULL, populate_pte_fn);
+ pcpu4k_unit_pages << PAGE_SHIFT, vm.addr,
+ NULL);
goto out_free_ar;

enomem:
--
1.6.0.2

2009-06-24 13:33:21

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/10] percpu: drop pcpu_chunk->page[]

percpu core doesn't need to tack all the allocated pages. It needs to
know whether certain pages are populated and a way to reverse map
address to page when freeing. This patch drops pcpu_chunk->page[] and
use populated bitmap and vmalloc_to_page() lookup instead. Using
vmalloc_to_page() exclusively is also possible but complicates first
chunk handling, inflates cache footprint and prevents non-standard
memory allocation for percpu memory.

pcpu_chunk->page[] was used to track each page's allocation and
allowed asymmetric population which happens during failure path;
however, with single bitmap for all units, this is no longer possible.
Bite the bullet and rewrite (de)populate functions so that things are
done in clearly separated steps such that asymmetric population
doesn't happen. This makes the (de)population process much more
modular and will also ease implementing non-standard memory usage in
the future (e.g. large pages).

This makes @get_page_fn parameter to pcpu_setup_first_chunk()
unnecessary. The parameter is dropped and all first chunk helpers are
updated accordingly. Please note that despite the volume most changes
to first chunk helpers are symbol renames for variables which don't
need to be referenced outside of the helper anymore.

This change reduces memory usage and cache footprint of pcpu_chunk.
Now only #unit_pages bits are necessary per chunk.

[ Impact: reduced memory usage and cache footprint for bookkeeping ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Miller <[email protected]>
---
arch/sparc/kernel/smp_64.c | 42 +--
include/linux/percpu.h | 3 +-
mm/percpu.c | 604 ++++++++++++++++++++++++++++----------------
3 files changed, 400 insertions(+), 249 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index ccad7b2..f2f22ee 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1415,19 +1415,6 @@ static void * __init pcpu_alloc_bootmem(unsigned int cpu, unsigned long size,
#endif
}

-static size_t pcpur_size __initdata;
-static void **pcpur_ptrs __initdata;
-
-static struct page * __init pcpur_get_page(unsigned int cpu, int pageno)
-{
- size_t off = (size_t)pageno << PAGE_SHIFT;
-
- if (off >= pcpur_size)
- return NULL;
-
- return virt_to_page(pcpur_ptrs[cpu] + off);
-}
-
#define PCPU_CHUNK_SIZE (4UL * 1024UL * 1024UL)

static void __init pcpu_map_range(unsigned long start, unsigned long end,
@@ -1491,25 +1478,26 @@ void __init setup_per_cpu_areas(void)
size_t dyn_size, static_size = __per_cpu_end - __per_cpu_start;
static struct vm_struct vm;
unsigned long delta, cpu;
- size_t pcpu_unit_size;
+ size_t size_sum, pcpu_unit_size;
size_t ptrs_size;
+ void **ptrs;

- pcpur_size = PFN_ALIGN(static_size + PERCPU_MODULE_RESERVE +
- PERCPU_DYNAMIC_RESERVE);
- dyn_size = pcpur_size - static_size - PERCPU_MODULE_RESERVE;
+ size_sum = PFN_ALIGN(static_size + PERCPU_MODULE_RESERVE +
+ PERCPU_DYNAMIC_RESERVE);
+ dyn_size = size_sum - static_size - PERCPU_MODULE_RESERVE;


- ptrs_size = PFN_ALIGN(num_possible_cpus() * sizeof(pcpur_ptrs[0]));
- pcpur_ptrs = alloc_bootmem(ptrs_size);
+ ptrs_size = PFN_ALIGN(num_possible_cpus() * sizeof(ptrs[0]));
+ ptrs = alloc_bootmem(ptrs_size);

for_each_possible_cpu(cpu) {
- pcpur_ptrs[cpu] = pcpu_alloc_bootmem(cpu, PCPU_CHUNK_SIZE,
- PCPU_CHUNK_SIZE);
+ ptrs[cpu] = pcpu_alloc_bootmem(cpu, PCPU_CHUNK_SIZE,
+ PCPU_CHUNK_SIZE);

- free_bootmem(__pa(pcpur_ptrs[cpu] + pcpur_size),
- PCPU_CHUNK_SIZE - pcpur_size);
+ free_bootmem(__pa(ptrs[cpu] + size_sum),
+ PCPU_CHUNK_SIZE - size_sum);

- memcpy(pcpur_ptrs[cpu], __per_cpu_load, static_size);
+ memcpy(ptrs[cpu], __per_cpu_load, static_size);
}

/* allocate address and map */
@@ -1523,14 +1511,14 @@ void __init setup_per_cpu_areas(void)

start += cpu * PCPU_CHUNK_SIZE;
end = start + PCPU_CHUNK_SIZE;
- pcpu_map_range(start, end, virt_to_page(pcpur_ptrs[cpu]));
+ pcpu_map_range(start, end, virt_to_page(ptrs[cpu]));
}

- pcpu_unit_size = pcpu_setup_first_chunk(pcpur_get_page, static_size,
+ pcpu_unit_size = pcpu_setup_first_chunk(static_size,
PERCPU_MODULE_RESERVE, dyn_size,
PCPU_CHUNK_SIZE, vm.addr);

- free_bootmem(__pa(pcpur_ptrs), ptrs_size);
+ free_bootmem(__pa(ptrs), ptrs_size);

delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu) {
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index ec64357..63c8b7a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -58,13 +58,12 @@

extern void *pcpu_base_addr;

-typedef struct page * (*pcpu_get_page_fn_t)(unsigned int cpu, int pageno);
typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size);
typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size);
typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);
typedef void (*pcpu_fc_map_fn_t)(void *ptr, size_t size, void *addr);

-extern size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
+extern size_t __init pcpu_setup_first_chunk(
size_t static_size, size_t reserved_size,
ssize_t dyn_size, size_t unit_size,
void *base_addr);
diff --git a/mm/percpu.c b/mm/percpu.c
index 770db98..5ee712e 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -94,8 +94,7 @@ struct pcpu_chunk {
int map_alloc; /* # of map entries allocated */
int *map; /* allocation map */
bool immutable; /* no [de]population allowed */
- struct page **page; /* points to page array */
- struct page *page_ar[]; /* #cpus * UNIT_PAGES */
+ unsigned long populated[]; /* populated bitmap */
};

static int pcpu_unit_pages __read_mostly;
@@ -129,9 +128,9 @@ static int pcpu_reserved_chunk_limit;
* Synchronization rules.
*
* There are two locks - pcpu_alloc_mutex and pcpu_lock. The former
- * protects allocation/reclaim paths, chunks and chunk->page arrays.
- * The latter is a spinlock and protects the index data structures -
- * chunk slots, chunks and area maps in chunks.
+ * protects allocation/reclaim paths, chunks, populated bitmap and
+ * vmalloc mapping. The latter is a spinlock and protects the index
+ * data structures - chunk slots, chunks and area maps in chunks.
*
* During allocation, pcpu_alloc_mutex is kept locked all the time and
* pcpu_lock is grabbed and released as necessary. All actual memory
@@ -188,16 +187,13 @@ static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
(pcpu_page_idx(cpu, page_idx) << PAGE_SHIFT);
}

-static struct page **pcpu_chunk_pagep(struct pcpu_chunk *chunk,
- unsigned int cpu, int page_idx)
+static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
+ unsigned int cpu, int page_idx)
{
- return &chunk->page[pcpu_page_idx(cpu, page_idx)];
-}
+ /* must not be used on pre-mapped chunk */
+ WARN_ON(chunk->immutable);

-static bool pcpu_chunk_page_occupied(struct pcpu_chunk *chunk,
- int page_idx)
-{
- return *pcpu_chunk_pagep(chunk, 0, page_idx) != NULL;
+ return vmalloc_to_page((void *)pcpu_chunk_addr(chunk, cpu, page_idx));
}

/* set the pointer to a chunk in a page struct */
@@ -212,6 +208,34 @@ static struct pcpu_chunk *pcpu_get_page_chunk(struct page *page)
return (struct pcpu_chunk *)page->index;
}

+static void pcpu_next_unpop(struct pcpu_chunk *chunk, int *rs, int *re, int end)
+{
+ *rs = find_next_zero_bit(chunk->populated, end, *rs);
+ *re = find_next_bit(chunk->populated, end, *rs + 1);
+}
+
+static void pcpu_next_pop(struct pcpu_chunk *chunk, int *rs, int *re, int end)
+{
+ *rs = find_next_bit(chunk->populated, end, *rs);
+ *re = find_next_zero_bit(chunk->populated, end, *rs + 1);
+}
+
+/*
+ * (Un)populated page region iterators. Iterate over (un)populated
+ * page regions betwen @start and @end in @chunk. @rs and @re should
+ * be integer variables and will be set to start and end page index of
+ * the current region.
+ */
+#define pcpu_for_each_unpop_region(chunk, rs, re, start, end) \
+ for ((rs) = (start), pcpu_next_unpop((chunk), &(rs), &(re), (end)); \
+ (rs) < (re); \
+ (rs) = (re) + 1, pcpu_next_unpop((chunk), &(rs), &(re), (end)))
+
+#define pcpu_for_each_pop_region(chunk, rs, re, start, end) \
+ for ((rs) = (start), pcpu_next_pop((chunk), &(rs), &(re), (end)); \
+ (rs) < (re); \
+ (rs) = (re) + 1, pcpu_next_pop((chunk), &(rs), &(re), (end)))
+
/**
* pcpu_mem_alloc - allocate memory
* @size: bytes to allocate
@@ -545,42 +569,197 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int freeme)
}

/**
- * pcpu_unmap - unmap pages out of a pcpu_chunk
+ * pcpu_get_pages_and_bitmap - get temp pages array and bitmap
+ * @chunk: chunk of interest
+ * @bitmapp: output parameter for bitmap
+ * @may_alloc: may allocate the array
+ *
+ * Returns pointer to array of pointers to struct page and bitmap,
+ * both of which can be indexed with pcpu_page_idx(). The returned
+ * array is cleared to zero and *@bitmapp is copied from
+ * @chunk->populated. Note that there is only one array and bitmap
+ * and access exclusion is the caller's responsibility.
+ *
+ * CONTEXT:
+ * pcpu_alloc_mutex and does GFP_KERNEL allocation if @may_alloc.
+ * Otherwise, don't care.
+ *
+ * RETURNS:
+ * Pointer to temp pages array on success, NULL on failure.
+ */
+static struct page **pcpu_get_pages_and_bitmap(struct pcpu_chunk *chunk,
+ unsigned long **bitmapp,
+ bool may_alloc)
+{
+ static struct page **pages;
+ static unsigned long *bitmap;
+ size_t pages_size = num_possible_cpus() * pcpu_unit_pages *
+ sizeof(pages[0]);
+ size_t bitmap_size = BITS_TO_LONGS(pcpu_unit_pages) *
+ sizeof(unsigned long);
+
+ if (!pages || !bitmap) {
+ if (may_alloc && !pages)
+ pages = pcpu_mem_alloc(pages_size);
+ if (may_alloc && !bitmap)
+ bitmap = pcpu_mem_alloc(bitmap_size);
+ if (!pages || !bitmap)
+ return NULL;
+ }
+
+ memset(pages, 0, pages_size);
+ bitmap_copy(bitmap, chunk->populated, pcpu_unit_pages);
+
+ *bitmapp = bitmap;
+ return pages;
+}
+
+/**
+ * pcpu_free_pages - free pages which were allocated for @chunk
+ * @chunk: chunk pages were allocated for
+ * @pages: array of pages to be freed, indexed by pcpu_page_idx()
+ * @populated: populated bitmap
+ * @page_start: page index of the first page to be freed
+ * @page_end: page index of the last page to be freed + 1
+ *
+ * Free pages [@page_start and @page_end) in @pages for all units.
+ * The pages were allocated for @chunk.
+ */
+static void pcpu_free_pages(struct pcpu_chunk *chunk,
+ struct page **pages, unsigned long *populated,
+ int page_start, int page_end)
+{
+ unsigned int cpu;
+ int i;
+
+ for_each_possible_cpu(cpu) {
+ for (i = page_start; i < page_end; i++) {
+ struct page *page = pages[pcpu_page_idx(cpu, i)];
+
+ if (page)
+ __free_page(page);
+ }
+ }
+}
+
+/**
+ * pcpu_alloc_pages - allocates pages for @chunk
+ * @chunk: target chunk
+ * @pages: array to put the allocated pages into, indexed by pcpu_page_idx()
+ * @populated: populated bitmap
+ * @page_start: page index of the first page to be allocated
+ * @page_end: page index of the last page to be allocated + 1
+ *
+ * Allocate pages [@page_start,@page_end) into @pages for all units.
+ * The allocation is for @chunk. Percpu core doesn't care about the
+ * content of @pages and will pass it verbatim to pcpu_map_pages().
+ */
+static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
+ struct page **pages, unsigned long *populated,
+ int page_start, int page_end)
+{
+ const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
+ unsigned int cpu;
+ int i;
+
+ for_each_possible_cpu(cpu) {
+ for (i = page_start; i < page_end; i++) {
+ struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
+
+ *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+ if (!*pagep) {
+ pcpu_free_pages(chunk, pages, populated,
+ page_start, page_end);
+ return -ENOMEM;
+ }
+ }
+ }
+ return 0;
+}
+
+/**
+ * pcpu_pre_unmap_flush - flush cache prior to unmapping
+ * @chunk: chunk the regions to be flushed belongs to
+ * @page_start: page index of the first page to be flushed
+ * @page_end: page index of the last page to be flushed + 1
+ *
+ * Pages in [@page_start,@page_end) of @chunk are about to be
+ * unmapped. Flush cache. As each flushing trial can be very
+ * expensive, issue flush on the whole region at once rather than
+ * doing it for each cpu. This could be an overkill but is more
+ * scalable.
+ */
+static void pcpu_pre_unmap_flush(struct pcpu_chunk *chunk,
+ int page_start, int page_end)
+{
+ unsigned int last = num_possible_cpus() - 1;
+
+ flush_cache_vunmap(pcpu_chunk_addr(chunk, 0, page_start),
+ pcpu_chunk_addr(chunk, last, page_end));
+}
+
+static void __pcpu_unmap_pages(unsigned long addr, int nr_pages)
+{
+ unmap_kernel_range_noflush(addr, nr_pages << PAGE_SHIFT);
+}
+
+/**
+ * pcpu_unmap_pages - unmap pages out of a pcpu_chunk
* @chunk: chunk of interest
+ * @pages: pages array which can be used to pass information to free
+ * @populated: populated bitmap
* @page_start: page index of the first page to unmap
* @page_end: page index of the last page to unmap + 1
- * @flush_tlb: whether to flush tlb or not
*
* For each cpu, unmap pages [@page_start,@page_end) out of @chunk.
- * If @flush is true, vcache is flushed before unmapping and tlb
- * after.
+ * Corresponding elements in @pages were cleared by the caller and can
+ * be used to carry information to pcpu_free_pages() which will be
+ * called after all unmaps are finished. The caller should call
+ * proper pre/post flush functions.
*/
-static void pcpu_unmap(struct pcpu_chunk *chunk, int page_start, int page_end,
- bool flush_tlb)
+static void pcpu_unmap_pages(struct pcpu_chunk *chunk,
+ struct page **pages, unsigned long *populated,
+ int page_start, int page_end)
{
- unsigned int last = num_possible_cpus() - 1;
unsigned int cpu;
+ int i;

- /* unmap must not be done on immutable chunk */
- WARN_ON(chunk->immutable);
+ for_each_possible_cpu(cpu) {
+ for (i = page_start; i < page_end; i++) {
+ struct page *page;

- /*
- * Each flushing trial can be very expensive, issue flush on
- * the whole region at once rather than doing it for each cpu.
- * This could be an overkill but is more scalable.
- */
- flush_cache_vunmap(pcpu_chunk_addr(chunk, 0, page_start),
- pcpu_chunk_addr(chunk, last, page_end));
+ page = pcpu_chunk_page(chunk, cpu, i);
+ WARN_ON(!page);
+ pages[pcpu_page_idx(cpu, i)] = page;
+ }
+ __pcpu_unmap_pages(pcpu_chunk_addr(chunk, cpu, page_start),
+ page_end - page_start);
+ }

- for_each_possible_cpu(cpu)
- unmap_kernel_range_noflush(
- pcpu_chunk_addr(chunk, cpu, page_start),
- (page_end - page_start) << PAGE_SHIFT);
-
- /* ditto as flush_cache_vunmap() */
- if (flush_tlb)
- flush_tlb_kernel_range(pcpu_chunk_addr(chunk, 0, page_start),
- pcpu_chunk_addr(chunk, last, page_end));
+ for (i = page_start; i < page_end; i++)
+ __clear_bit(i, populated);
+}
+
+/**
+ * pcpu_post_unmap_tlb_flush - flush TLB after unmapping
+ * @chunk: pcpu_chunk the regions to be flushed belong to
+ * @page_start: page index of the first page to be flushed
+ * @page_end: page index of the last page to be flushed + 1
+ *
+ * Pages [@page_start,@page_end) of @chunk have been unmapped. Flush
+ * TLB for the regions. This can be skipped if the area is to be
+ * returned to vmalloc as vmalloc will handle TLB flushing lazily.
+ *
+ * As with pcpu_pre_unmap_flush(), TLB flushing also is done at once
+ * for the whole region.
+ */
+static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
+ int page_start, int page_end)
+{
+ unsigned int last = num_possible_cpus() - 1;
+
+ flush_tlb_kernel_range(pcpu_chunk_addr(chunk, 0, page_start),
+ pcpu_chunk_addr(chunk, last, page_end));
}

static int __pcpu_map_pages(unsigned long addr, struct page **pages,
@@ -591,35 +770,76 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
}

/**
- * pcpu_map - map pages into a pcpu_chunk
+ * pcpu_map_pages - map pages into a pcpu_chunk
* @chunk: chunk of interest
+ * @pages: pages array containing pages to be mapped
+ * @populated: populated bitmap
* @page_start: page index of the first page to map
* @page_end: page index of the last page to map + 1
*
- * For each cpu, map pages [@page_start,@page_end) into @chunk.
- * vcache is flushed afterwards.
+ * For each cpu, map pages [@page_start,@page_end) into @chunk. The
+ * caller is responsible for calling pcpu_post_map_flush() after all
+ * mappings are complete.
+ *
+ * This function is responsible for setting corresponding bits in
+ * @chunk->populated bitmap and whatever is necessary for reverse
+ * lookup (addr -> chunk).
*/
-static int pcpu_map(struct pcpu_chunk *chunk, int page_start, int page_end)
+static int pcpu_map_pages(struct pcpu_chunk *chunk,
+ struct page **pages, unsigned long *populated,
+ int page_start, int page_end)
{
- unsigned int last = num_possible_cpus() - 1;
- unsigned int cpu;
- int err;
-
- /* map must not be done on immutable chunk */
- WARN_ON(chunk->immutable);
+ unsigned int cpu, tcpu;
+ int i, err;

for_each_possible_cpu(cpu) {
err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
- pcpu_chunk_pagep(chunk, cpu, page_start),
+ &pages[pcpu_page_idx(cpu, page_start)],
page_end - page_start);
if (err < 0)
- return err;
+ goto err;
}

+ /* mapping successful, link chunk and mark populated */
+ for (i = page_start; i < page_end; i++) {
+ for_each_possible_cpu(cpu)
+ pcpu_set_page_chunk(pages[pcpu_page_idx(cpu, i)],
+ chunk);
+ __set_bit(i, populated);
+ }
+
+ return 0;
+
+err:
+ for_each_possible_cpu(tcpu) {
+ if (tcpu == cpu)
+ break;
+ __pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start),
+ page_end - page_start);
+ }
+ return err;
+}
+
+/**
+ * pcpu_post_map_flush - flush cache after mapping
+ * @chunk: pcpu_chunk the regions to be flushed belong to
+ * @page_start: page index of the first page to be flushed
+ * @page_end: page index of the last page to be flushed + 1
+ *
+ * Pages [@page_start,@page_end) of @chunk have been mapped. Flush
+ * cache.
+ *
+ * As with pcpu_pre_unmap_flush(), TLB flushing also is done at once
+ * for the whole region.
+ */
+static void pcpu_post_map_flush(struct pcpu_chunk *chunk,
+ int page_start, int page_end)
+{
+ unsigned int last = num_possible_cpus() - 1;
+
/* flush at once, please read comments in pcpu_unmap() */
flush_cache_vmap(pcpu_chunk_addr(chunk, 0, page_start),
pcpu_chunk_addr(chunk, last, page_end));
- return 0;
}

/**
@@ -636,39 +856,45 @@ static int pcpu_map(struct pcpu_chunk *chunk, int page_start, int page_end)
* CONTEXT:
* pcpu_alloc_mutex.
*/
-static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int off, int size,
- bool flush)
+static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int off, int size)
{
int page_start = PFN_DOWN(off);
int page_end = PFN_UP(off + size);
- int unmap_start = -1;
- int uninitialized_var(unmap_end);
- unsigned int cpu;
- int i;
+ struct page **pages;
+ unsigned long *populated;
+ int rs, re;
+
+ /* quick path, check whether it's empty already */
+ pcpu_for_each_unpop_region(chunk, rs, re, page_start, page_end) {
+ if (rs == page_start && re == page_end)
+ return;
+ break;
+ }

- for (i = page_start; i < page_end; i++) {
- for_each_possible_cpu(cpu) {
- struct page **pagep = pcpu_chunk_pagep(chunk, cpu, i);
+ /* immutable chunks can't be depopulated */
+ WARN_ON(chunk->immutable);

- if (!*pagep)
- continue;
+ /*
+ * If control reaches here, there must have been at least one
+ * successful population attempt so the temp pages array must
+ * be available now.
+ */
+ pages = pcpu_get_pages_and_bitmap(chunk, &populated, false);
+ BUG_ON(!pages);

- __free_page(*pagep);
+ /* unmap and free */
+ pcpu_pre_unmap_flush(chunk, page_start, page_end);

- /*
- * If it's partial depopulation, it might get
- * populated or depopulated again. Mark the
- * page gone.
- */
- *pagep = NULL;
+ pcpu_for_each_pop_region(chunk, rs, re, page_start, page_end)
+ pcpu_unmap_pages(chunk, pages, populated, rs, re);

- unmap_start = unmap_start < 0 ? i : unmap_start;
- unmap_end = i + 1;
- }
- }
+ /* no need to flush tlb, vmalloc will handle it lazily */
+
+ pcpu_for_each_pop_region(chunk, rs, re, page_start, page_end)
+ pcpu_free_pages(chunk, pages, populated, rs, re);

- if (unmap_start >= 0)
- pcpu_unmap(chunk, unmap_start, unmap_end, flush);
+ /* commit new bitmap */
+ bitmap_copy(chunk->populated, populated, pcpu_unit_pages);
}

/**
@@ -685,50 +911,61 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int off, int size,
*/
static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
{
- const gfp_t alloc_mask = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
int page_start = PFN_DOWN(off);
int page_end = PFN_UP(off + size);
- int map_start = -1;
- int uninitialized_var(map_end);
+ int free_end = page_start, unmap_end = page_start;
+ struct page **pages;
+ unsigned long *populated;
unsigned int cpu;
- int i;
+ int rs, re, rc;

- for (i = page_start; i < page_end; i++) {
- if (pcpu_chunk_page_occupied(chunk, i)) {
- if (map_start >= 0) {
- if (pcpu_map(chunk, map_start, map_end))
- goto err;
- map_start = -1;
- }
- continue;
- }
+ /* quick path, check whether all pages are already there */
+ pcpu_for_each_pop_region(chunk, rs, re, page_start, page_end) {
+ if (rs == page_start && re == page_end)
+ goto clear;
+ break;
+ }

- map_start = map_start < 0 ? i : map_start;
- map_end = i + 1;
+ /* need to allocate and map pages, this chunk can't be immutable */
+ WARN_ON(chunk->immutable);

- for_each_possible_cpu(cpu) {
- struct page **pagep = pcpu_chunk_pagep(chunk, cpu, i);
+ pages = pcpu_get_pages_and_bitmap(chunk, &populated, true);
+ if (!pages)
+ return -ENOMEM;

- *pagep = alloc_pages_node(cpu_to_node(cpu),
- alloc_mask, 0);
- if (!*pagep)
- goto err;
- pcpu_set_page_chunk(*pagep, chunk);
- }
+ /* alloc and map */
+ pcpu_for_each_unpop_region(chunk, rs, re, page_start, page_end) {
+ rc = pcpu_alloc_pages(chunk, pages, populated, rs, re);
+ if (rc)
+ goto err_free;
+ free_end = re;
}

- if (map_start >= 0 && pcpu_map(chunk, map_start, map_end))
- goto err;
+ pcpu_for_each_unpop_region(chunk, rs, re, page_start, page_end) {
+ rc = pcpu_map_pages(chunk, pages, populated, rs, re);
+ if (rc)
+ goto err_unmap;
+ unmap_end = re;
+ }
+ pcpu_post_map_flush(chunk, page_start, page_end);

+ /* commit new bitmap */
+ bitmap_copy(chunk->populated, populated, pcpu_unit_pages);
+clear:
for_each_possible_cpu(cpu)
memset(chunk->vm->addr + cpu * pcpu_unit_size + off, 0,
size);
-
return 0;
-err:
- /* likely under heavy memory pressure, give memory back */
- pcpu_depopulate_chunk(chunk, off, size, true);
- return -ENOMEM;
+
+err_unmap:
+ pcpu_pre_unmap_flush(chunk, page_start, unmap_end);
+ pcpu_for_each_unpop_region(chunk, rs, re, page_start, unmap_end)
+ pcpu_unmap_pages(chunk, pages, populated, rs, re);
+ pcpu_post_unmap_tlb_flush(chunk, page_start, unmap_end);
+err_free:
+ pcpu_for_each_unpop_region(chunk, rs, re, page_start, free_end)
+ pcpu_free_pages(chunk, pages, populated, rs, re);
+ return rc;
}

static void free_pcpu_chunk(struct pcpu_chunk *chunk)
@@ -752,7 +989,6 @@ static struct pcpu_chunk *alloc_pcpu_chunk(void)
chunk->map = pcpu_mem_alloc(PCPU_DFL_MAP_ALLOC * sizeof(chunk->map[0]));
chunk->map_alloc = PCPU_DFL_MAP_ALLOC;
chunk->map[chunk->map_used++] = pcpu_unit_size;
- chunk->page = chunk->page_ar;

chunk->vm = get_vm_area(pcpu_chunk_size, GFP_KERNEL);
if (!chunk->vm) {
@@ -933,7 +1169,7 @@ static void pcpu_reclaim(struct work_struct *work)
mutex_unlock(&pcpu_alloc_mutex);

list_for_each_entry_safe(chunk, next, &todo, list) {
- pcpu_depopulate_chunk(chunk, 0, pcpu_unit_size, false);
+ pcpu_depopulate_chunk(chunk, 0, pcpu_unit_size);
free_pcpu_chunk(chunk);
}
}
@@ -981,7 +1217,6 @@ EXPORT_SYMBOL_GPL(free_percpu);

/**
* pcpu_setup_first_chunk - initialize the first percpu chunk
- * @get_page_fn: callback to fetch page pointer
* @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes, 0 for none
* @dyn_size: free size for dynamic allocation in bytes, -1 for auto
@@ -992,14 +1227,6 @@ EXPORT_SYMBOL_GPL(free_percpu);
* perpcu area. This function is to be called from arch percpu area
* setup path.
*
- * @get_page_fn() should return pointer to percpu page given cpu
- * number and page number. It should at least return enough pages to
- * cover the static area. The returned pages for static area should
- * have been initialized with valid data. It can also return pages
- * after the static area. NULL return indicates end of pages for the
- * cpu. Note that @get_page_fn() must return the same number of pages
- * for all cpus.
- *
* @reserved_size, if non-zero, specifies the amount of bytes to
* reserve after the static area in the first chunk. This reserves
* the first chunk such that it's available only through reserved
@@ -1031,8 +1258,7 @@ EXPORT_SYMBOL_GPL(free_percpu);
* The determined pcpu_unit_size which can be used to initialize
* percpu access.
*/
-size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
- size_t static_size, size_t reserved_size,
+size_t __init pcpu_setup_first_chunk(size_t static_size, size_t reserved_size,
ssize_t dyn_size, size_t unit_size,
void *base_addr)
{
@@ -1041,8 +1267,7 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
size_t size_sum = static_size + reserved_size +
(dyn_size >= 0 ? dyn_size : 0);
struct pcpu_chunk *schunk, *dchunk = NULL;
- unsigned int cpu;
- int i, nr_pages;
+ int i;

/* santiy checks */
BUILD_BUG_ON(ARRAY_SIZE(smap) >= PCPU_DFL_MAP_ALLOC ||
@@ -1056,8 +1281,8 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
pcpu_unit_pages = unit_size >> PAGE_SHIFT;
pcpu_unit_size = pcpu_unit_pages << PAGE_SHIFT;
pcpu_chunk_size = num_possible_cpus() * pcpu_unit_size;
- pcpu_chunk_struct_size = sizeof(struct pcpu_chunk)
- + num_possible_cpus() * pcpu_unit_pages * sizeof(struct page *);
+ pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
+ BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long);

if (dyn_size < 0)
dyn_size = pcpu_unit_size - static_size - reserved_size;
@@ -1087,8 +1312,8 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
schunk->vm = &first_vm;
schunk->map = smap;
schunk->map_alloc = ARRAY_SIZE(smap);
- schunk->page = schunk->page_ar;
schunk->immutable = true;
+ bitmap_fill(schunk->populated, pcpu_unit_pages);

if (reserved_size) {
schunk->free_size = reserved_size;
@@ -1106,38 +1331,19 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,

/* init dynamic chunk if necessary */
if (dyn_size) {
- dchunk = alloc_bootmem(sizeof(struct pcpu_chunk));
+ dchunk = alloc_bootmem(pcpu_chunk_struct_size);
INIT_LIST_HEAD(&dchunk->list);
dchunk->vm = &first_vm;
dchunk->map = dmap;
dchunk->map_alloc = ARRAY_SIZE(dmap);
- dchunk->page = schunk->page_ar; /* share page map with schunk */
dchunk->immutable = true;
+ bitmap_fill(dchunk->populated, pcpu_unit_pages);

dchunk->contig_hint = dchunk->free_size = dyn_size;
dchunk->map[dchunk->map_used++] = -pcpu_reserved_chunk_limit;
dchunk->map[dchunk->map_used++] = dchunk->free_size;
}

- /* assign pages */
- nr_pages = -1;
- for_each_possible_cpu(cpu) {
- for (i = 0; i < pcpu_unit_pages; i++) {
- struct page *page = get_page_fn(cpu, i);
-
- if (!page)
- break;
- *pcpu_chunk_pagep(schunk, cpu, i) = page;
- }
-
- BUG_ON(i < PFN_UP(static_size));
-
- if (nr_pages < 0)
- nr_pages = i;
- else
- BUG_ON(nr_pages != i);
- }
-
/* link the first chunk in */
pcpu_first_chunk = dchunk ?: schunk;
pcpu_chunk_relocate(pcpu_first_chunk, -1);
@@ -1160,23 +1366,6 @@ static size_t pcpu_calc_fc_sizes(size_t static_size, size_t reserved_size,
return size_sum;
}

-/*
- * Embedding first chunk setup helper.
- */
-static void *pcpue_ptr __initdata;
-static size_t pcpue_size __initdata;
-static size_t pcpue_unit_size __initdata;
-
-static struct page * __init pcpue_get_page(unsigned int cpu, int pageno)
-{
- size_t off = (size_t)pageno << PAGE_SHIFT;
-
- if (off >= pcpue_size)
- return NULL;
-
- return virt_to_page(pcpue_ptr + cpu * pcpue_unit_size + off);
-}
-
/**
* pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
* @static_size: the size of static percpu area in bytes
@@ -1207,18 +1396,19 @@ static struct page * __init pcpue_get_page(unsigned int cpu, int pageno)
ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
ssize_t dyn_size)
{
- size_t chunk_size;
+ size_t size_sum, unit_size, chunk_size;
+ void *base;
unsigned int cpu;

/* determine parameters and allocate */
- pcpue_size = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
+ size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);

- pcpue_unit_size = max_t(size_t, pcpue_size, PCPU_MIN_UNIT_SIZE);
- chunk_size = pcpue_unit_size * num_possible_cpus();
+ unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
+ chunk_size = unit_size * num_possible_cpus();

- pcpue_ptr = __alloc_bootmem_nopanic(chunk_size, PAGE_SIZE,
- __pa(MAX_DMA_ADDRESS));
- if (!pcpue_ptr) {
+ base = __alloc_bootmem_nopanic(chunk_size, PAGE_SIZE,
+ __pa(MAX_DMA_ADDRESS));
+ if (!base) {
pr_warning("PERCPU: failed to allocate %zu bytes for "
"embedding\n", chunk_size);
return -ENOMEM;
@@ -1226,33 +1416,18 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,

/* return the leftover and copy */
for_each_possible_cpu(cpu) {
- void *ptr = pcpue_ptr + cpu * pcpue_unit_size;
+ void *ptr = base + cpu * unit_size;

- free_bootmem(__pa(ptr + pcpue_size),
- pcpue_unit_size - pcpue_size);
+ free_bootmem(__pa(ptr + size_sum), unit_size - size_sum);
memcpy(ptr, __per_cpu_load, static_size);
}

/* we're ready, commit */
pr_info("PERCPU: Embedded %zu pages at %p, static data %zu bytes\n",
- pcpue_size >> PAGE_SHIFT, pcpue_ptr, static_size);
+ size_sum >> PAGE_SHIFT, base, static_size);

- return pcpu_setup_first_chunk(pcpue_get_page, static_size,
- reserved_size, dyn_size,
- pcpue_unit_size, pcpue_ptr);
-}
-
-/*
- * 4k page first chunk setup helper.
- */
-static struct page **pcpu4k_pages __initdata;
-static int pcpu4k_unit_pages __initdata;
-
-static struct page * __init pcpu4k_get_page(unsigned int cpu, int pageno)
-{
- if (pageno < pcpu4k_unit_pages)
- return pcpu4k_pages[cpu * pcpu4k_unit_pages + pageno];
- return NULL;
+ return pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
+ unit_size, base);
}

/**
@@ -1279,23 +1454,25 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
pcpu_fc_populate_pte_fn_t populate_pte_fn)
{
static struct vm_struct vm;
+ int unit_pages;
size_t pages_size;
+ struct page **pages;
unsigned int cpu;
int i, j;
ssize_t ret;

- pcpu4k_unit_pages = PFN_UP(max_t(size_t, static_size + reserved_size,
- PCPU_MIN_UNIT_SIZE));
+ unit_pages = PFN_UP(max_t(size_t, static_size + reserved_size,
+ PCPU_MIN_UNIT_SIZE));

/* unaligned allocations can't be freed, round up to page size */
- pages_size = PFN_ALIGN(pcpu4k_unit_pages * num_possible_cpus() *
- sizeof(pcpu4k_pages[0]));
- pcpu4k_pages = alloc_bootmem(pages_size);
+ pages_size = PFN_ALIGN(unit_pages * num_possible_cpus() *
+ sizeof(pages[0]));
+ pages = alloc_bootmem(pages_size);

/* allocate pages */
j = 0;
for_each_possible_cpu(cpu)
- for (i = 0; i < pcpu4k_unit_pages; i++) {
+ for (i = 0; i < unit_pages; i++) {
void *ptr;

ptr = alloc_fn(cpu, PAGE_SIZE);
@@ -1304,25 +1481,24 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,
"4k page for cpu%u\n", cpu);
goto enomem;
}
- pcpu4k_pages[j++] = virt_to_page(ptr);
+ pages[j++] = virt_to_page(ptr);
}

/* allocate vm area, map the pages and copy static data */
vm.flags = VM_ALLOC;
- vm.size = num_possible_cpus() * pcpu4k_unit_pages << PAGE_SHIFT;
+ vm.size = num_possible_cpus() * unit_pages << PAGE_SHIFT;
vm_area_register_early(&vm, PAGE_SIZE);

for_each_possible_cpu(cpu) {
unsigned long unit_addr = (unsigned long)vm.addr +
- (cpu * pcpu4k_unit_pages << PAGE_SHIFT);
+ (cpu * unit_pages << PAGE_SHIFT);

- for (i = 0; i < pcpu4k_unit_pages; i++)
+ for (i = 0; i < unit_pages; i++)
populate_pte_fn(unit_addr + (i << PAGE_SHIFT));

/* pte already populated, the following shouldn't fail */
- ret = __pcpu_map_pages(unit_addr,
- &pcpu4k_pages[cpu * pcpu4k_unit_pages],
- pcpu4k_unit_pages);
+ ret = __pcpu_map_pages(unit_addr, &pages[cpu * unit_pages],
+ unit_pages);
if (ret < 0)
panic("failed to map percpu area, err=%zd\n", ret);

@@ -1340,19 +1516,18 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,

/* we're ready, commit */
pr_info("PERCPU: %d 4k pages per cpu, static data %zu bytes\n",
- pcpu4k_unit_pages, static_size);
+ unit_pages, static_size);

- ret = pcpu_setup_first_chunk(pcpu4k_get_page, static_size,
- reserved_size, -1,
- pcpu4k_unit_pages << PAGE_SHIFT, vm.addr);
+ ret = pcpu_setup_first_chunk(static_size, reserved_size, -1,
+ unit_pages << PAGE_SHIFT, vm.addr);
goto out_free_ar;

enomem:
while (--j >= 0)
- free_fn(page_address(pcpu4k_pages[j]), PAGE_SIZE);
+ free_fn(page_address(pages[j]), PAGE_SIZE);
ret = -ENOMEM;
out_free_ar:
- free_bootmem(__pa(pcpu4k_pages), pages_size);
+ free_bootmem(__pa(pages), pages_size);
return ret;
}

@@ -1370,16 +1545,6 @@ static size_t pcpul_unit_size;
static struct pcpul_ent *pcpul_map;
static struct vm_struct pcpul_vm;

-static struct page * __init pcpul_get_page(unsigned int cpu, int pageno)
-{
- size_t off = (size_t)pageno << PAGE_SHIFT;
-
- if (off >= pcpul_size)
- return NULL;
-
- return virt_to_page(pcpul_map[cpu].ptr + off);
-}
-
/**
* pcpu_lpage_first_chunk - remap the first percpu chunk using large page
* @static_size: the size of static percpu area in bytes
@@ -1475,9 +1640,8 @@ ssize_t __init pcpu_lpage_first_chunk(size_t static_size, size_t reserved_size,
pr_info("PERCPU: Remapped at %p with large pages, static data "
"%zu bytes\n", pcpul_vm.addr, static_size);

- ret = pcpu_setup_first_chunk(pcpul_get_page, static_size,
- reserved_size, dyn_size, pcpul_unit_size,
- pcpul_vm.addr);
+ ret = pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
+ pcpul_unit_size, pcpul_vm.addr);

/* sort pcpul_map array for pcpu_lpage_remapped() */
for (i = 0; i < num_possible_cpus() - 1; i++)
--
1.6.0.2

2009-06-24 13:32:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/10] percpu: simplify pcpu_setup_first_chunk()

Now that all first chunk allocator helpers allocate and map the first
chunk themselves, there's no need to have optional default alloc/map
in pcpu_setup_first_chunk(). Drop @populate_pte_fn and only leave
@dyn_size optional and make all other params mandatory.

This makes it much easier to follow what pcpu_setup_first_chunk() is
doing and what actual differences tweaking each parameter results in.

[ Impact: drop unused code path ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/sparc/kernel/smp_64.c | 2 +-
include/linux/percpu.h | 5 +-
mm/percpu.c | 104 +++++++++++++-------------------------------
3 files changed, 33 insertions(+), 78 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index fa44eaf..ccad7b2 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1528,7 +1528,7 @@ void __init setup_per_cpu_areas(void)

pcpu_unit_size = pcpu_setup_first_chunk(pcpur_get_page, static_size,
PERCPU_MODULE_RESERVE, dyn_size,
- PCPU_CHUNK_SIZE, vm.addr, NULL);
+ PCPU_CHUNK_SIZE, vm.addr);

free_bootmem(__pa(pcpur_ptrs), ptrs_size);

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 9f6bfd7..ec64357 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -66,9 +66,8 @@ typedef void (*pcpu_fc_map_fn_t)(void *ptr, size_t size, void *addr);

extern size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
size_t static_size, size_t reserved_size,
- ssize_t dyn_size, ssize_t unit_size,
- void *base_addr,
- pcpu_fc_populate_pte_fn_t populate_pte_fn);
+ ssize_t dyn_size, size_t unit_size,
+ void *base_addr);

extern ssize_t __init pcpu_embed_first_chunk(
size_t static_size, size_t reserved_size,
diff --git a/mm/percpu.c b/mm/percpu.c
index 17dfb7c..452d3f3 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -983,24 +983,22 @@ EXPORT_SYMBOL_GPL(free_percpu);
* pcpu_setup_first_chunk - initialize the first percpu chunk
* @get_page_fn: callback to fetch page pointer
* @static_size: the size of static percpu area in bytes
- * @reserved_size: the size of reserved percpu area in bytes
+ * @reserved_size: the size of reserved percpu area in bytes, 0 for none
* @dyn_size: free size for dynamic allocation in bytes, -1 for auto
- * @unit_size: unit size in bytes, must be multiple of PAGE_SIZE, -1 for auto
- * @base_addr: mapped address, NULL for auto
- * @populate_pte_fn: callback to allocate pagetable, NULL if unnecessary
+ * @unit_size: unit size in bytes, must be multiple of PAGE_SIZE
+ * @base_addr: mapped address
*
* Initialize the first percpu chunk which contains the kernel static
* perpcu area. This function is to be called from arch percpu area
- * setup path. The first two parameters are mandatory. The rest are
- * optional.
+ * setup path.
*
* @get_page_fn() should return pointer to percpu page given cpu
* number and page number. It should at least return enough pages to
* cover the static area. The returned pages for static area should
- * have been initialized with valid data. If @unit_size is specified,
- * it can also return pages after the static area. NULL return
- * indicates end of pages for the cpu. Note that @get_page_fn() must
- * return the same number of pages for all cpus.
+ * have been initialized with valid data. It can also return pages
+ * after the static area. NULL return indicates end of pages for the
+ * cpu. Note that @get_page_fn() must return the same number of pages
+ * for all cpus.
*
* @reserved_size, if non-zero, specifies the amount of bytes to
* reserve after the static area in the first chunk. This reserves
@@ -1015,17 +1013,12 @@ EXPORT_SYMBOL_GPL(free_percpu);
* non-negative value makes percpu leave alone the area beyond
* @static_size + @reserved_size + @dyn_size.
*
- * @unit_size, if non-negative, specifies unit size and must be
- * aligned to PAGE_SIZE and equal to or larger than @static_size +
- * @reserved_size + if non-negative, @dyn_size.
- *
- * Non-null @base_addr means that the caller already allocated virtual
- * region for the first chunk and mapped it. percpu must not mess
- * with the chunk. Note that @base_addr with 0 @unit_size or non-NULL
- * @populate_pte_fn doesn't make any sense.
+ * @unit_size specifies unit size and must be aligned to PAGE_SIZE and
+ * equal to or larger than @static_size + @reserved_size + if
+ * non-negative, @dyn_size.
*
- * @populate_pte_fn is used to populate the pagetable. NULL means the
- * caller already populated the pagetable.
+ * The caller should have mapped the first chunk at @base_addr and
+ * copied static data to each unit.
*
* If the first chunk ends up with both reserved and dynamic areas, it
* is served by two chunks - one to serve the core static and reserved
@@ -1040,9 +1033,8 @@ EXPORT_SYMBOL_GPL(free_percpu);
*/
size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
size_t static_size, size_t reserved_size,
- ssize_t dyn_size, ssize_t unit_size,
- void *base_addr,
- pcpu_fc_populate_pte_fn_t populate_pte_fn)
+ ssize_t dyn_size, size_t unit_size,
+ void *base_addr)
{
static struct vm_struct first_vm;
static int smap[2], dmap[2];
@@ -1050,27 +1042,18 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
(dyn_size >= 0 ? dyn_size : 0);
struct pcpu_chunk *schunk, *dchunk = NULL;
unsigned int cpu;
- int nr_pages;
- int err, i;
+ int i, nr_pages;

/* santiy checks */
BUILD_BUG_ON(ARRAY_SIZE(smap) >= PCPU_DFL_MAP_ALLOC ||
ARRAY_SIZE(dmap) >= PCPU_DFL_MAP_ALLOC);
BUG_ON(!static_size);
- if (unit_size >= 0) {
- BUG_ON(unit_size < size_sum);
- BUG_ON(unit_size & ~PAGE_MASK);
- BUG_ON(unit_size < PCPU_MIN_UNIT_SIZE);
- } else
- BUG_ON(base_addr);
- BUG_ON(base_addr && populate_pte_fn);
-
- if (unit_size >= 0)
- pcpu_unit_pages = unit_size >> PAGE_SHIFT;
- else
- pcpu_unit_pages = max_t(int, PCPU_MIN_UNIT_SIZE >> PAGE_SHIFT,
- PFN_UP(size_sum));
+ BUG_ON(!base_addr);
+ BUG_ON(unit_size < size_sum);
+ BUG_ON(unit_size & ~PAGE_MASK);
+ BUG_ON(unit_size < PCPU_MIN_UNIT_SIZE);

+ pcpu_unit_pages = unit_size >> PAGE_SHIFT;
pcpu_unit_size = pcpu_unit_pages << PAGE_SHIFT;
pcpu_chunk_size = num_possible_cpus() * pcpu_unit_size;
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk)
@@ -1079,6 +1062,10 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
if (dyn_size < 0)
dyn_size = pcpu_unit_size - static_size - reserved_size;

+ first_vm.flags = VM_ALLOC;
+ first_vm.size = pcpu_chunk_size;
+ first_vm.addr = base_addr;
+
/*
* Allocate chunk slots. The additional last slot is for
* empty chunks.
@@ -1101,6 +1088,7 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
schunk->map = smap;
schunk->map_alloc = ARRAY_SIZE(smap);
schunk->page = schunk->page_ar;
+ schunk->immutable = true;

if (reserved_size) {
schunk->free_size = reserved_size;
@@ -1124,31 +1112,13 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
dchunk->map = dmap;
dchunk->map_alloc = ARRAY_SIZE(dmap);
dchunk->page = schunk->page_ar; /* share page map with schunk */
+ dchunk->immutable = true;

dchunk->contig_hint = dchunk->free_size = dyn_size;
dchunk->map[dchunk->map_used++] = -pcpu_reserved_chunk_limit;
dchunk->map[dchunk->map_used++] = dchunk->free_size;
}

- /* allocate vm address */
- first_vm.flags = VM_ALLOC;
- first_vm.size = pcpu_chunk_size;
-
- if (!base_addr)
- vm_area_register_early(&first_vm, PAGE_SIZE);
- else {
- /*
- * Pages already mapped. No need to remap into
- * vmalloc area. In this case the first chunks can't
- * be mapped or unmapped by percpu and are marked
- * immutable.
- */
- first_vm.addr = base_addr;
- schunk->immutable = true;
- if (dchunk)
- dchunk->immutable = true;
- }
-
/* assign pages */
nr_pages = -1;
for_each_possible_cpu(cpu) {
@@ -1168,19 +1138,6 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
BUG_ON(nr_pages != i);
}

- /* map them */
- if (populate_pte_fn) {
- for_each_possible_cpu(cpu)
- for (i = 0; i < nr_pages; i++)
- populate_pte_fn(pcpu_chunk_addr(schunk,
- cpu, i));
-
- err = pcpu_map(schunk, 0, nr_pages);
- if (err)
- panic("failed to setup static percpu area, err=%d\n",
- err);
- }
-
/* link the first chunk in */
pcpu_first_chunk = dchunk ?: schunk;
pcpu_chunk_relocate(pcpu_first_chunk, -1);
@@ -1282,7 +1239,7 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,

return pcpu_setup_first_chunk(pcpue_get_page, static_size,
reserved_size, dyn_size,
- pcpue_unit_size, pcpue_ptr, NULL);
+ pcpue_unit_size, pcpue_ptr);
}

/*
@@ -1387,8 +1344,7 @@ ssize_t __init pcpu_4k_first_chunk(size_t static_size, size_t reserved_size,

ret = pcpu_setup_first_chunk(pcpu4k_get_page, static_size,
reserved_size, -1,
- pcpu4k_unit_pages << PAGE_SHIFT, vm.addr,
- NULL);
+ pcpu4k_unit_pages << PAGE_SHIFT, vm.addr);
goto out_free_ar;

enomem:
@@ -1521,7 +1477,7 @@ ssize_t __init pcpu_lpage_first_chunk(size_t static_size, size_t reserved_size,

ret = pcpu_setup_first_chunk(pcpul_get_page, static_size,
reserved_size, dyn_size, pcpul_unit_size,
- pcpul_vm.addr, NULL);
+ pcpul_vm.addr);

/* sort pcpul_map array for pcpu_lpage_remapped() */
for (i = 0; i < num_possible_cpus() - 1; i++)
--
1.6.0.2

2009-06-24 13:33:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/10] percpu: teach large page allocator about NUMA

Large page first chunk allocator is primarily used for NUMA machines;
however, its NUMA handling is extremely simplistic. Regardless of
their proximity, each cpu is put into separate large page just to
return most of the allocated space back wasting large amount of
vmalloc space and increasing cache footprint.

This patch teachs NUMA details to large page allocator. Given
processor proximity information, pcpu_lpage_build_unit_map() will find
fitting cpu -> unit mapping in which cpus in LOCAL_DISTANCE share the
same large page and not too much virtual address space is wasted.

This greatly reduces the unit and thus chunk size and wastes much less
address space for the first chunk. For example, on 4/4 NUMA machine,
the original code occupied 16MB of virtual space for the first chunk
while the new code only uses 4MB - one 2MB page for each node.

[ Impact: much better space efficiency on NUMA machines ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jan Beulich <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: David Miller <[email protected]>
---
arch/x86/kernel/setup_percpu.c | 72 ++++++--
include/linux/percpu.h | 24 +++-
mm/percpu.c | 358 +++++++++++++++++++++++++++++++---------
3 files changed, 359 insertions(+), 95 deletions(-)

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 4f2e0ac..7501bb1 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -149,36 +149,73 @@ static void __init pcpul_map(void *ptr, size_t size, void *addr)
set_pmd(pmd, pmd_v);
}

+static int pcpu_lpage_cpu_distance(unsigned int from, unsigned int to)
+{
+ if (early_cpu_to_node(from) == early_cpu_to_node(to))
+ return LOCAL_DISTANCE;
+ else
+ return REMOTE_DISTANCE;
+}
+
static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
{
size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
+ size_t dyn_size = reserve - PERCPU_FIRST_CHUNK_RESERVE;
+ size_t unit_map_size, unit_size;
+ int *unit_map;
+ int nr_units;
+ ssize_t ret;
+
+ /* on non-NUMA, embedding is better */
+ if (!chosen && !pcpu_need_numa())
+ return -EINVAL;
+
+ /* need PSE */
+ if (!cpu_has_pse) {
+ pr_warning("PERCPU: lpage allocator requires PSE\n");
+ return -EINVAL;
+ }

+ /* allocate and build unit_map */
+ unit_map_size = num_possible_cpus() * sizeof(int);
+ unit_map = alloc_bootmem_nopanic(unit_map_size);
+ if (!unit_map) {
+ pr_warning("PERCPU: failed to allocate unit_map\n");
+ return -ENOMEM;
+ }
+
+ ret = pcpu_lpage_build_unit_map(static_size,
+ PERCPU_FIRST_CHUNK_RESERVE,
+ &dyn_size, &unit_size, PMD_SIZE,
+ unit_map, pcpu_lpage_cpu_distance);
+ if (ret < 0) {
+ pr_warning("PERCPU: failed to build unit_map\n");
+ goto out_free;
+ }
+ nr_units = ret;
+
+ /* do the parameters look okay? */
if (!chosen) {
size_t vm_size = VMALLOC_END - VMALLOC_START;
- size_t tot_size = num_possible_cpus() * PMD_SIZE;
-
- /* on non-NUMA, embedding is better */
- if (!pcpu_need_numa())
- return -EINVAL;
+ size_t tot_size = nr_units * unit_size;

/* don't consume more than 20% of vmalloc area */
if (tot_size > vm_size / 5) {
pr_info("PERCPU: too large chunk size %zuMB for "
"large page remap\n", tot_size >> 20);
- return -EINVAL;
+ ret = -EINVAL;
+ goto out_free;
}
}

- /* need PSE */
- if (!cpu_has_pse) {
- pr_warning("PERCPU: lpage allocator requires PSE\n");
- return -EINVAL;
- }
-
- return pcpu_lpage_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
- reserve - PERCPU_FIRST_CHUNK_RESERVE,
- PMD_SIZE,
- pcpu_fc_alloc, pcpu_fc_free, pcpul_map);
+ ret = pcpu_lpage_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
+ dyn_size, unit_size, PMD_SIZE,
+ unit_map, nr_units,
+ pcpu_fc_alloc, pcpu_fc_free, pcpul_map);
+out_free:
+ if (ret < 0)
+ free_bootmem(__pa(unit_map), unit_map_size);
+ return ret;
}
#else
static ssize_t __init setup_pcpu_lpage(size_t static_size, bool chosen)
@@ -299,7 +336,8 @@ void __init setup_per_cpu_areas(void)
/* alrighty, percpu areas up and running */
delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu) {
- per_cpu_offset(cpu) = delta + cpu * pcpu_unit_size;
+ per_cpu_offset(cpu) =
+ delta + pcpu_unit_map[cpu] * pcpu_unit_size;
per_cpu(this_cpu_off, cpu) = per_cpu_offset(cpu);
per_cpu(cpu_number, cpu) = cpu;
setup_percpu_segment(cpu);
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1e0e887..8ce91af 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -62,6 +62,7 @@ extern const int *pcpu_unit_map;
typedef void * (*pcpu_fc_alloc_fn_t)(unsigned int cpu, size_t size);
typedef void (*pcpu_fc_free_fn_t)(void *ptr, size_t size);
typedef void (*pcpu_fc_populate_pte_fn_t)(unsigned long addr);
+typedef int (pcpu_fc_cpu_distance_fn_t)(unsigned int from, unsigned int to);
typedef void (*pcpu_fc_map_fn_t)(void *ptr, size_t size, void *addr);

extern size_t __init pcpu_setup_first_chunk(
@@ -80,18 +81,37 @@ extern ssize_t __init pcpu_4k_first_chunk(
pcpu_fc_populate_pte_fn_t populate_pte_fn);

#ifdef CONFIG_NEED_MULTIPLE_NODES
+extern int __init pcpu_lpage_build_unit_map(
+ size_t static_size, size_t reserved_size,
+ ssize_t *dyn_sizep, size_t *unit_sizep,
+ size_t lpage_size, int *unit_map,
+ pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
+
extern ssize_t __init pcpu_lpage_first_chunk(
size_t static_size, size_t reserved_size,
- ssize_t dyn_size, size_t lpage_size,
+ size_t dyn_size, size_t unit_size,
+ size_t lpage_size, const int *unit_map,
+ int nr_units,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_map_fn_t map_fn);

extern void *pcpu_lpage_remapped(void *kaddr);
#else
+static inline int pcpu_lpage_build_unit_map(
+ size_t static_size, size_t reserved_size,
+ ssize_t *dyn_sizep, size_t *unit_sizep,
+ size_t lpage_size, int *unit_map,
+ pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
+{
+ return -EINVAL;
+}
+
static inline ssize_t __init pcpu_lpage_first_chunk(
size_t static_size, size_t reserved_size,
- ssize_t dyn_size, size_t lpage_size,
+ size_t dyn_size, size_t unit_size,
+ size_t lpage_size, const int *unit_map,
+ int nr_units,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_map_fn_t map_fn)
diff --git a/mm/percpu.c b/mm/percpu.c
index f0fce38..b11ae7a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -59,6 +59,7 @@
#include <linux/bitmap.h>
#include <linux/bootmem.h>
#include <linux/list.h>
+#include <linux/log2.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/mutex.h>
@@ -1594,75 +1595,259 @@ out_free_ar:
* Large page remapping first chunk setup helper
*/
#ifdef CONFIG_NEED_MULTIPLE_NODES
+
+/**
+ * pcpu_lpage_build_unit_map - build unit_map for large page remapping
+ * @static_size: the size of static percpu area in bytes
+ * @reserved_size: the size of reserved percpu area in bytes
+ * @dyn_sizep: in/out parameter for dynamic size, -1 for auto
+ * @unit_sizep: out parameter for unit size
+ * @unit_map: unit_map to be filled
+ * @cpu_distance_fn: callback to determine distance between cpus
+ *
+ * This function builds cpu -> unit map and determine other parameters
+ * considering needed percpu size, large page size and distances
+ * between CPUs in NUMA.
+ *
+ * CPUs which are of LOCAL_DISTANCE both ways are grouped together and
+ * may share units in the same large page. The returned configuration
+ * is guaranteed to have CPUs on different nodes on different large
+ * pages and >=75% usage of allocated virtual address space.
+ *
+ * RETURNS:
+ * On success, fills in @unit_map, sets *@dyn_sizep, *@unit_sizep and
+ * returns the number of units to be allocated. -errno on failure.
+ */
+int __init pcpu_lpage_build_unit_map(size_t static_size, size_t reserved_size,
+ ssize_t *dyn_sizep, size_t *unit_sizep,
+ size_t lpage_size, int *unit_map,
+ pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
+{
+ static int group_map[NR_CPUS] __initdata;
+ static int group_cnt[NR_CPUS] __initdata;
+ int group_cnt_max = 0;
+ size_t size_sum, min_unit_size, alloc_size;
+ int upa, max_upa, uninitialized_var(best_upa); /* units_per_alloc */
+ int last_allocs;
+ unsigned int cpu, tcpu;
+ int group, unit;
+
+ /*
+ * Determine min_unit_size, alloc_size and max_upa such that
+ * alloc_size is multiple of lpage_size and is the smallest
+ * which can accomodate 4k aligned segments which are equal to
+ * or larger than min_unit_size.
+ */
+ size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, dyn_sizep);
+ min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
+
+ alloc_size = roundup(min_unit_size, lpage_size);
+ upa = alloc_size / min_unit_size;
+ while (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK))
+ upa--;
+ max_upa = upa;
+
+ /* group cpus according to their proximity */
+ for_each_possible_cpu(cpu) {
+ group = 0;
+ next_group:
+ for_each_possible_cpu(tcpu) {
+ if (cpu == tcpu)
+ break;
+ if (group_map[tcpu] == group &&
+ (cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE ||
+ cpu_distance_fn(tcpu, cpu) > LOCAL_DISTANCE)) {
+ group++;
+ goto next_group;
+ }
+ }
+ group_map[cpu] = group;
+ group_cnt[group]++;
+ group_cnt_max = max(group_cnt_max, group_cnt[group]);
+ }
+
+ /*
+ * Expand unit size until address space usage goes over 75%
+ * and then as much as possible without using more address
+ * space.
+ */
+ last_allocs = INT_MAX;
+ for (upa = max_upa; upa; upa--) {
+ int allocs = 0, wasted = 0;
+
+ if (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK))
+ continue;
+
+ for (group = 0; group_cnt[group]; group++) {
+ int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
+ allocs += this_allocs;
+ wasted += this_allocs * upa - group_cnt[group];
+ }
+
+ /*
+ * Don't accept if wastage is over 25%. The
+ * greater-than comparison ensures upa==1 always
+ * passes the following check.
+ */
+ if (wasted > num_possible_cpus() / 3)
+ continue;
+
+ /* and then don't consume more memory */
+ if (allocs > last_allocs)
+ break;
+ last_allocs = allocs;
+ best_upa = upa;
+ }
+ *unit_sizep = alloc_size / best_upa;
+
+ /* assign units to cpus accordingly */
+ unit = 0;
+ for (group = 0; group_cnt[group]; group++) {
+ for_each_possible_cpu(cpu)
+ if (group_map[cpu] == group)
+ unit_map[cpu] = unit++;
+ unit = roundup(unit, best_upa);
+ }
+
+ return unit; /* unit contains aligned number of units */
+}
+
struct pcpul_ent {
- unsigned int cpu;
void *ptr;
+ void *map_addr;
};

static size_t pcpul_size;
-static size_t pcpul_unit_size;
+static size_t pcpul_lpage_size;
+static int pcpul_nr_lpages;
static struct pcpul_ent *pcpul_map;
-static struct vm_struct pcpul_vm;
+
+static bool __init pcpul_unit_to_cpu(int unit, const int *unit_map,
+ unsigned int *cpup)
+{
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu)
+ if (unit_map[cpu] == unit) {
+ if (cpup)
+ *cpup = cpu;
+ return true;
+ }
+
+ return false;
+}
+
+static void __init pcpul_lpage_dump_cfg(const char *lvl, size_t static_size,
+ size_t reserved_size, size_t dyn_size,
+ size_t unit_size, size_t lpage_size,
+ const int *unit_map, int nr_units)
+{
+ int width = 1, v = nr_units;
+ char empty_str[] = "--------";
+ int upl, lpl; /* units per lpage, lpage per line */
+ unsigned int cpu;
+ int lpage, unit;
+
+ while (v /= 10)
+ width++;
+ empty_str[min_t(int, width, sizeof(empty_str) - 1)] = '\0';
+
+ upl = max_t(int, lpage_size / unit_size, 1);
+ lpl = rounddown_pow_of_two(max_t(int, 60 / (upl * (width + 1) + 2), 1));
+
+ printk("%spcpu-lpage: sta/res/dyn=%zu/%zu/%zu unit=%zu lpage=%zu", lvl,
+ static_size, reserved_size, dyn_size, unit_size, lpage_size);
+
+ for (lpage = 0, unit = 0; unit < nr_units; unit++) {
+ if (!(unit % upl)) {
+ if (!(lpage++ % lpl)) {
+ printk("\n");
+ printk("%spcpu-lpage: ", lvl);
+ } else
+ printk("| ");
+ }
+ if (pcpul_unit_to_cpu(unit, unit_map, &cpu))
+ printk("%0*d ", width, cpu);
+ else
+ printk("%s ", empty_str);
+ }
+ printk("\n");
+}

/**
* pcpu_lpage_first_chunk - remap the first percpu chunk using large page
* @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes
- * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @dyn_size: free size for dynamic allocation in bytes
+ * @unit_size: unit size in bytes
* @lpage_size: the size of a large page
+ * @unit_map: cpu -> unit mapping
+ * @nr_units: the number of units
* @alloc_fn: function to allocate percpu lpage, always called with lpage_size
* @free_fn: function to free percpu memory, @size <= lpage_size
* @map_fn: function to map percpu lpage, always called with lpage_size
*
- * This allocator uses large page as unit. A large page is allocated
- * for each cpu and each is remapped into vmalloc area using large
- * page mapping. As large page can be quite large, only part of it is
- * used for the first chunk. Unused part is returned to the bootmem
- * allocator.
- *
- * So, the large pages are mapped twice - once to the physical mapping
- * and to the vmalloc area for the first percpu chunk. The double
- * mapping does add one more large TLB entry pressure but still is
- * much better than only using 4k mappings while still being NUMA
- * friendly.
+ * This allocator uses large page to build and map the first chunk.
+ * Unlike other helpers, the caller should always specify @dyn_size
+ * and @unit_size. These parameters along with @unit_map and
+ * @nr_units can be determined using pcpu_lpage_build_unit_map().
+ * This two stage initialization is to allow arch code to evaluate the
+ * parameters before committing to it.
+ *
+ * Large pages are allocated as directed by @unit_map and other
+ * parameters and mapped to vmalloc space. Unused holes are returned
+ * to the page allocator. Note that these holes end up being actively
+ * mapped twice - once to the physical mapping and to the vmalloc area
+ * for the first percpu chunk. Depending on architecture, this might
+ * cause problem when changing page attributes of the returned area.
+ * These double mapped areas can be detected using
+ * pcpu_lpage_remapped().
*
* RETURNS:
* The determined pcpu_unit_size which can be used to initialize
* percpu access on success, -errno on failure.
*/
ssize_t __init pcpu_lpage_first_chunk(size_t static_size, size_t reserved_size,
- ssize_t dyn_size, size_t lpage_size,
+ size_t dyn_size, size_t unit_size,
+ size_t lpage_size, const int *unit_map,
+ int nr_units,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn,
pcpu_fc_map_fn_t map_fn)
{
- size_t size_sum;
+ static struct vm_struct vm;
+ size_t chunk_size = unit_size * nr_units;
size_t map_size;
unsigned int cpu;
- int i, j;
ssize_t ret;
+ int i, j, unit;

- /*
- * Currently supports only single page. Supporting multiple
- * pages won't be too difficult if it ever becomes necessary.
- */
- size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
+ pcpul_lpage_dump_cfg(KERN_DEBUG, static_size, reserved_size, dyn_size,
+ unit_size, lpage_size, unit_map, nr_units);

- pcpul_unit_size = lpage_size;
- pcpul_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
- if (pcpul_size > pcpul_unit_size) {
- pr_warning("PERCPU: static data is larger than large page, "
- "can't use large page\n");
- return -EINVAL;
- }
+ BUG_ON(chunk_size % lpage_size);
+
+ pcpul_size = static_size + reserved_size + dyn_size;
+ pcpul_lpage_size = lpage_size;
+ pcpul_nr_lpages = chunk_size / lpage_size;

/* allocate pointer array and alloc large pages */
- map_size = PFN_ALIGN(num_possible_cpus() * sizeof(pcpul_map[0]));
+ map_size = pcpul_nr_lpages * sizeof(pcpul_map[0]);
pcpul_map = alloc_bootmem(map_size);

- for_each_possible_cpu(cpu) {
+ /* allocate all pages */
+ for (i = 0; i < pcpul_nr_lpages; i++) {
+ size_t offset = i * lpage_size;
+ int first_unit = offset / unit_size;
+ int last_unit = (offset + lpage_size - 1) / unit_size;
void *ptr;

+ /* find out which cpu is mapped to this unit */
+ for (unit = first_unit; unit <= last_unit; unit++)
+ if (pcpul_unit_to_cpu(unit, unit_map, &cpu))
+ goto found;
+ continue;
+ found:
ptr = alloc_fn(cpu, lpage_size);
if (!ptr) {
pr_warning("PERCPU: failed to allocate large page "
@@ -1670,53 +1855,79 @@ ssize_t __init pcpu_lpage_first_chunk(size_t static_size, size_t reserved_size,
goto enomem;
}

- /*
- * Only use pcpul_size bytes and give back the rest.
- *
- * Ingo: The lpage_size up-rounding bootmem is needed
- * to make sure the partial lpage is still fully RAM -
- * it's not well-specified to have a incompatible area
- * (unmapped RAM, device memory, etc.) in that hole.
- */
- free_fn(ptr + pcpul_size, lpage_size - pcpul_size);
-
- pcpul_map[cpu].cpu = cpu;
- pcpul_map[cpu].ptr = ptr;
+ pcpul_map[i].ptr = ptr;
+ }

- memcpy(ptr, __per_cpu_load, static_size);
+ /* return unused holes */
+ for (unit = 0; unit < nr_units; unit++) {
+ size_t start = unit * unit_size;
+ size_t end = start + unit_size;
+ size_t off, next;
+
+ /* don't free used part of occupied unit */
+ if (pcpul_unit_to_cpu(unit, unit_map, NULL))
+ start += pcpul_size;
+
+ /* unit can span more than one page, punch the holes */
+ for (off = start; off < end; off = next) {
+ void *ptr = pcpul_map[off / lpage_size].ptr;
+ next = min(roundup(off + 1, lpage_size), end);
+ if (ptr)
+ free_fn(ptr + off % lpage_size, next - off);
+ }
}

- /* allocate address and map */
- pcpul_vm.flags = VM_ALLOC;
- pcpul_vm.size = num_possible_cpus() * pcpul_unit_size;
- vm_area_register_early(&pcpul_vm, pcpul_unit_size);
+ /* allocate address, map and copy */
+ vm.flags = VM_ALLOC;
+ vm.size = chunk_size;
+ vm_area_register_early(&vm, unit_size);
+
+ for (i = 0; i < pcpul_nr_lpages; i++) {
+ if (!pcpul_map[i].ptr)
+ continue;
+ pcpul_map[i].map_addr = vm.addr + i * lpage_size;
+ map_fn(pcpul_map[i].ptr, lpage_size, pcpul_map[i].map_addr);
+ }

for_each_possible_cpu(cpu)
- map_fn(pcpul_map[cpu].ptr, pcpul_unit_size,
- pcpul_vm.addr + cpu * pcpul_unit_size);
+ memcpy(vm.addr + unit_map[cpu] * unit_size, __per_cpu_load,
+ static_size);

/* we're ready, commit */
pr_info("PERCPU: Remapped at %p with large pages, static data "
- "%zu bytes\n", pcpul_vm.addr, static_size);
+ "%zu bytes\n", vm.addr, static_size);

ret = pcpu_setup_first_chunk(static_size, reserved_size, dyn_size,
- pcpul_unit_size, pcpul_vm.addr, NULL);
-
- /* sort pcpul_map array for pcpu_lpage_remapped() */
- for (i = 0; i < num_possible_cpus() - 1; i++)
- for (j = i + 1; j < num_possible_cpus(); j++)
- if (pcpul_map[i].ptr > pcpul_map[j].ptr) {
- struct pcpul_ent tmp = pcpul_map[i];
- pcpul_map[i] = pcpul_map[j];
- pcpul_map[j] = tmp;
- }
+ unit_size, vm.addr, unit_map);
+
+ /*
+ * Sort pcpul_map array for pcpu_lpage_remapped(). Unmapped
+ * lpages are pushed to the end and trimmed.
+ */
+ for (i = 0; i < pcpul_nr_lpages - 1; i++)
+ for (j = i + 1; j < pcpul_nr_lpages; j++) {
+ struct pcpul_ent tmp;
+
+ if (!pcpul_map[j].ptr)
+ continue;
+ if (pcpul_map[i].ptr &&
+ pcpul_map[i].ptr < pcpul_map[j].ptr)
+ continue;
+
+ tmp = pcpul_map[i];
+ pcpul_map[i] = pcpul_map[j];
+ pcpul_map[j] = tmp;
+ }
+
+ while (pcpul_nr_lpages && !pcpul_map[pcpul_nr_lpages - 1].ptr)
+ pcpul_nr_lpages--;

return ret;

enomem:
- for_each_possible_cpu(cpu)
- if (pcpul_map[cpu].ptr)
- free_fn(pcpul_map[cpu].ptr, pcpul_size);
+ for (i = 0; i < pcpul_nr_lpages; i++)
+ if (pcpul_map[i].ptr)
+ free_fn(pcpul_map[i].ptr, lpage_size);
free_bootmem(__pa(pcpul_map), map_size);
return -ENOMEM;
}
@@ -1739,10 +1950,10 @@ enomem:
*/
void *pcpu_lpage_remapped(void *kaddr)
{
- unsigned long unit_mask = pcpul_unit_size - 1;
- void *lpage_addr = (void *)((unsigned long)kaddr & ~unit_mask);
- unsigned long offset = (unsigned long)kaddr & unit_mask;
- int left = 0, right = num_possible_cpus() - 1;
+ unsigned long lpage_mask = pcpul_lpage_size - 1;
+ void *lpage_addr = (void *)((unsigned long)kaddr & ~lpage_mask);
+ unsigned long offset = (unsigned long)kaddr & lpage_mask;
+ int left = 0, right = pcpul_nr_lpages - 1;
int pos;

/* pcpul in use at all? */
@@ -1757,13 +1968,8 @@ void *pcpu_lpage_remapped(void *kaddr)
left = pos + 1;
else if (pcpul_map[pos].ptr > lpage_addr)
right = pos - 1;
- else {
- /* it shouldn't be in the area for the first chunk */
- WARN_ON(offset < pcpul_size);
-
- return pcpul_vm.addr +
- pcpul_map[pos].cpu * pcpul_unit_size + offset;
- }
+ else
+ return pcpul_map[pos].map_addr + offset;
}

return NULL;
--
1.6.0.2

2009-06-24 13:32:47

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/10] percpu: reorder a few functions in mm/percpu.c

(de)populate functions are about to be reimplemented to drop
pcpu_chunk->page array. Move a few functions so that the rewrite
patch doesn't have code movement making it more difficult to read.

[ Impact: code movement ]

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
mm/percpu.c | 90 +++++++++++++++++++++++++++++-----------------------------
1 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 452d3f3..770db98 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -181,12 +181,6 @@ static int pcpu_page_idx(unsigned int cpu, int page_idx)
return cpu * pcpu_unit_pages + page_idx;
}

-static struct page **pcpu_chunk_pagep(struct pcpu_chunk *chunk,
- unsigned int cpu, int page_idx)
-{
- return &chunk->page[pcpu_page_idx(cpu, page_idx)];
-}
-
static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
unsigned int cpu, int page_idx)
{
@@ -194,6 +188,12 @@ static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
(pcpu_page_idx(cpu, page_idx) << PAGE_SHIFT);
}

+static struct page **pcpu_chunk_pagep(struct pcpu_chunk *chunk,
+ unsigned int cpu, int page_idx)
+{
+ return &chunk->page[pcpu_page_idx(cpu, page_idx)];
+}
+
static bool pcpu_chunk_page_occupied(struct pcpu_chunk *chunk,
int page_idx)
{
@@ -583,6 +583,45 @@ static void pcpu_unmap(struct pcpu_chunk *chunk, int page_start, int page_end,
pcpu_chunk_addr(chunk, last, page_end));
}

+static int __pcpu_map_pages(unsigned long addr, struct page **pages,
+ int nr_pages)
+{
+ return map_kernel_range_noflush(addr, nr_pages << PAGE_SHIFT,
+ PAGE_KERNEL, pages);
+}
+
+/**
+ * pcpu_map - map pages into a pcpu_chunk
+ * @chunk: chunk of interest
+ * @page_start: page index of the first page to map
+ * @page_end: page index of the last page to map + 1
+ *
+ * For each cpu, map pages [@page_start,@page_end) into @chunk.
+ * vcache is flushed afterwards.
+ */
+static int pcpu_map(struct pcpu_chunk *chunk, int page_start, int page_end)
+{
+ unsigned int last = num_possible_cpus() - 1;
+ unsigned int cpu;
+ int err;
+
+ /* map must not be done on immutable chunk */
+ WARN_ON(chunk->immutable);
+
+ for_each_possible_cpu(cpu) {
+ err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
+ pcpu_chunk_pagep(chunk, cpu, page_start),
+ page_end - page_start);
+ if (err < 0)
+ return err;
+ }
+
+ /* flush at once, please read comments in pcpu_unmap() */
+ flush_cache_vmap(pcpu_chunk_addr(chunk, 0, page_start),
+ pcpu_chunk_addr(chunk, last, page_end));
+ return 0;
+}
+
/**
* pcpu_depopulate_chunk - depopulate and unmap an area of a pcpu_chunk
* @chunk: chunk to depopulate
@@ -632,45 +671,6 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int off, int size,
pcpu_unmap(chunk, unmap_start, unmap_end, flush);
}

-static int __pcpu_map_pages(unsigned long addr, struct page **pages,
- int nr_pages)
-{
- return map_kernel_range_noflush(addr, nr_pages << PAGE_SHIFT,
- PAGE_KERNEL, pages);
-}
-
-/**
- * pcpu_map - map pages into a pcpu_chunk
- * @chunk: chunk of interest
- * @page_start: page index of the first page to map
- * @page_end: page index of the last page to map + 1
- *
- * For each cpu, map pages [@page_start,@page_end) into @chunk.
- * vcache is flushed afterwards.
- */
-static int pcpu_map(struct pcpu_chunk *chunk, int page_start, int page_end)
-{
- unsigned int last = num_possible_cpus() - 1;
- unsigned int cpu;
- int err;
-
- /* map must not be done on immutable chunk */
- WARN_ON(chunk->immutable);
-
- for_each_possible_cpu(cpu) {
- err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
- pcpu_chunk_pagep(chunk, cpu, page_start),
- page_end - page_start);
- if (err < 0)
- return err;
- }
-
- /* flush at once, please read comments in pcpu_unmap() */
- flush_cache_vmap(pcpu_chunk_addr(chunk, 0, page_start),
- pcpu_chunk_addr(chunk, last, page_end));
- return 0;
-}
-
/**
* pcpu_populate_chunk - populate and map an area of a pcpu_chunk
* @chunk: chunk of interest
--
1.6.0.2

2009-06-24 23:56:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Wed, 24 Jun 2009 22:30:06 +0900
Tejun Heo <[email protected]> wrote:

> This patchset is available in the following git tree and will be
> published in for-next if there's no major objection. It might get
> rebased before going into for-next.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git lpage-numa

<tries to read the patches>

Boy, this stuff is complicated. Does it all work?

The Impact: lines were useful :)


I assume from the tremendous number of for_each_possible_cpu()s that
CPU hotplug awareness won't be happening.

Do we have a feeling for the amount of wastage here? If

num_possible_cpus() - num_online_cpus() == N

and N is large, what did it cost?

And what are reasonable values of N?

2009-06-25 00:02:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

> I assume from the tremendous number of for_each_possible_cpu()s that
> CPU hotplug awareness won't be happening.
>
> Do we have a feeling for the amount of wastage here? If
>
> num_possible_cpus() - num_online_cpus() == N

Haven't read the new patches, but per cpu data always was sized
for all possible CPUs.

> and N is large, what did it cost?

> And what are reasonable values of N?

N should normally not be large anymore, since num_possible_cpus()
is sized based on firmware information now.

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-25 00:14:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Andi Kleen wrote:
>
> Haven't read the new patches, but per cpu data always was sized
> for all possible CPUs.
>
>> and N is large, what did it cost?
>
>> And what are reasonable values of N?
>
> N should normally not be large anymore, since num_possible_cpus()
> is sized based on firmware information now.
>

*Ahem* virtual machines *ahem*...

I have discussed this with Tejun, and the plan is to allocate the percpu
information when a processor is first brought online (but not removed
when it is offlined again.) It's a real problem for 32-bit VMs, so it's
more important than you'd think.

-hpa

2009-06-25 02:36:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Hello,

Andrew Morton wrote:
> On Wed, 24 Jun 2009 22:30:06 +0900
> Tejun Heo <[email protected]> wrote:
>
>> This patchset is available in the following git tree and will be
>> published in for-next if there's no major objection. It might get
>> rebased before going into for-next.
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git lpage-numa
>
> <tries to read the patches>
>
> Boy, this stuff is complicated. Does it all work?

I sure hope so.

> The Impact: lines were useful :)

Eh.. well, it looks like it's going the way of dodo tho.

> I assume from the tremendous number of for_each_possible_cpu()s that
> CPU hotplug awareness won't be happening.
>
> Do we have a feeling for the amount of wastage here? If
>
> num_possible_cpus() - num_online_cpus() == N
>
> and N is large, what did it cost?
>
> And what are reasonable values of N?

The goal is to eventually implement has_ever_been_online_cpus (any
better naming?) mask and allocate only for those cpus. I think I
mentioned it in one of the patch descriptions but anyways the unit_map
and lpage improvments implemented in this patchset will be used for
that purpose. The plan is to treat possible but offline cpus as if
they belong to separate group such that they don't end up sharing the
same PMD page and later when those cpus come up the generic 4k mapping
can kick in and map them.

Thanks.

--
tejun

2009-06-25 09:19:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Wed, Jun 24, 2009 at 05:13:37PM -0700, H. Peter Anvin wrote:
> Andi Kleen wrote:
> >
> > Haven't read the new patches, but per cpu data always was sized
> > for all possible CPUs.
> >
> >> and N is large, what did it cost?
> >
> >> And what are reasonable values of N?
> >
> > N should normally not be large anymore, since num_possible_cpus()
> > is sized based on firmware information now.
> >
>
> *Ahem* virtual machines *ahem*...

And? Even there's not that big typically.

The traditional problem was just for 128 NR_CPUs kernel were nothing
was sized based on machine capacity.

Also on large systems the VMs shouldn't be sized for full capacity.

>
> I have discussed this with Tejun, and the plan is to allocate the percpu
> information when a processor is first brought online (but not removed
> when it is offlined again.) It's a real problem for 32-bit VMs, so it's
> more important than you'd think.

You have to rewrite all code that does for_each_possible_cpu (x)
in initialization then to use callbacks. It would be a gigantic change
all over the tree.

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-25 09:20:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support


* Tejun Heo <[email protected]> wrote:

> Andrew Morton wrote:
>
> > The Impact: lines were useful :)
>
> Eh.. well, it looks like it's going the way of dodo tho.

Yes. I found them useful too but some people (including Andrew ;)
were too argumentative about it so i asked you to remove them from
this patch-set. So yes, it's all going the way of the dodo.

And yes, i attribute the excellent stability of the x86 code in this
cycle partly to the effect of impact lines. They really helped us be
very conscious with dangerous changes.

Ingo

2009-06-25 19:27:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Andi Kleen wrote:
>>>
>> *Ahem* virtual machines *ahem*...
>
> And? Even there's not that big typically.
>
> The traditional problem was just for 128 NR_CPUs kernel were nothing
> was sized based on machine capacity.
>
> Also on large systems the VMs shouldn't be sized for full capacity.
>

We have already have cases where the "possible" CPUs have eaten up the
entire vmalloc area on 32 bits. In real use. It's a real problem.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-06-25 19:54:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Thu, Jun 25, 2009 at 07:18:07AM -0700, H. Peter Anvin wrote:
> Andi Kleen wrote:
> >>>
> >> *Ahem* virtual machines *ahem*...
> >
> > And? Even there's not that big typically.
> >
> > The traditional problem was just for 128 NR_CPUs kernel were nothing
> > was sized based on machine capacity.
> >
> > Also on large systems the VMs shouldn't be sized for full capacity.
> >
>
> We have already have cases where the "possible" CPUs have eaten up the
> entire vmalloc area on 32 bits. In real use. It's a real problem.

That's hard to believe or a serious bug/misconfiguration somewhere.

Each per CPU data should be <100k (let's say 200k with some slack for modules),
so to fill vmalloc you would need hundreds of CPUs, which a 32bit
kernel doesn't really support anyways because it doesn't support
enough memory for that many CPUs.

Perhaps you had firmware/hypervisor who passed a gigantic impossible
value here? If yes thy

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-25 20:17:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Andi Kleen wrote:
>
> That's hard to believe or a serious bug/misconfiguration somewhere.
>
> Each per CPU data should be <100k (let's say 200k with some slack for modules),
> so to fill vmalloc you would need hundreds of CPUs, which a 32bit
> kernel doesn't really support anyways because it doesn't support
> enough memory for that many CPUs.
>
> Perhaps you had firmware/hypervisor who passed a gigantic impossible
> value here? If yes thy
>

I don't know why, but in the configuration I saw each percpu area was
something like 1.8 MB.

-hpa

2009-06-25 20:26:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

> I don't know why, but in the configuration I saw each percpu area was
> something like 1.8 MB.

That sounds like something that needs to be investigated and fixed then.
In a sense per cpu space is precious like stack space and shouldn't
be wasted.

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-26 00:40:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Andi Kleen wrote:
>> I don't know why, but in the configuration I saw each percpu area was
>> something like 1.8 MB.
>
> That sounds like something that needs to be investigated and fixed then.
> In a sense per cpu space is precious like stack space and shouldn't
> be wasted.

The large percpu space is most likely from lockdep.

Thanks.

--
tejun

2009-06-26 02:04:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Tejun Heo wrote:
> Andi Kleen wrote:
>>> I don't know why, but in the configuration I saw each percpu area was
>>> something like 1.8 MB.
>> That sounds like something that needs to be investigated and fixed then.
>> In a sense per cpu space is precious like stack space and shouldn't
>> be wasted.
>
> The large percpu space is most likely from lockdep.
>

That makes sense.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-06-26 06:54:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Fri, Jun 26, 2009 at 09:40:10AM +0900, Tejun Heo wrote:
> Andi Kleen wrote:
> >> I don't know why, but in the configuration I saw each percpu area was
> >> something like 1.8 MB.
> >
> > That sounds like something that needs to be investigated and fixed then.
> > In a sense per cpu space is precious like stack space and shouldn't
> > be wasted.
>
> The large percpu space is most likely from lockdep.

It would be better then to fix lockdep then to add weird hacks for that
elsewhere.

e.g. it could just allocate buffers in a cpu up/down callback and
only put a pointer to these buffers into percpu.

-Andi
--
[email protected] -- Speaking for myself only.

2009-06-29 23:21:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Wed, 24 Jun 2009, Andrew Morton wrote:

> I assume from the tremendous number of for_each_possible_cpu()s that
> CPU hotplug awareness won't be happening.

Per cpu areas are allocated for all possible processors. No need to handle
offlining and onlining them.

2009-06-29 23:41:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Mon, 29 Jun 2009 19:20:53 -0400 (EDT)
Christoph Lameter <[email protected]> wrote:

> On Wed, 24 Jun 2009, Andrew Morton wrote:
>
> > I assume from the tremendous number of for_each_possible_cpu()s that
> > CPU hotplug awareness won't be happening.
>
> Per cpu areas are allocated for all possible processors. No need to handle
> offlining and onlining them.

Well yes. My point is that this is a bug-not-a-feature ;)

2009-06-30 14:26:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Mon, 29 Jun 2009, Andrew Morton wrote:

> On Mon, 29 Jun 2009 19:20:53 -0400 (EDT)
> Christoph Lameter <[email protected]> wrote:
>
> > On Wed, 24 Jun 2009, Andrew Morton wrote:
> >
> > > I assume from the tremendous number of for_each_possible_cpu()s that
> > > CPU hotplug awareness won't be happening.
> >
> > Per cpu areas are allocated for all possible processors. No need to handle
> > offlining and onlining them.
>
> Well yes. My point is that this is a bug-not-a-feature ;)

Its a feature that I would like to exploit in the future to get rid of
some of the hotplug callbacks in the allocator.

Some state also needs to be kept for offlined processors. This is done in
per cpu data.

2009-06-30 19:15:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support


* Andrew Morton <[email protected]> wrote:

> On Mon, 29 Jun 2009 19:20:53 -0400 (EDT)
> Christoph Lameter <[email protected]> wrote:
>
> > On Wed, 24 Jun 2009, Andrew Morton wrote:
> >
> > > I assume from the tremendous number of
> > > for_each_possible_cpu()s that CPU hotplug awareness won't be
> > > happening.
> >
> > Per cpu areas are allocated for all possible processors. No need
> > to handle offlining and onlining them.
>
> Well yes. My point is that this is a bug-not-a-feature ;)

Yeah, it's a bug for something like a virtual environment which
boots generic kernels that might have 64 possible CPUs (on a true
64-way system), but which will have fewer in practice.

It's pretty basic stuff: the on-demand allocation of percpu
resources.

Ingo

2009-06-30 19:40:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Tue, 30 Jun 2009, Ingo Molnar wrote:

> Yeah, it's a bug for something like a virtual environment which
> boots generic kernels that might have 64 possible CPUs (on a true
> 64-way system), but which will have fewer in practice.

A machine (and a virtual environment) can indicate via the BIOS tables or
ACPI that there are less "possible" cpus. That is actually very common.

The difference between actual and possible cpus only has to be the number
of processors that could be brought up later. In a regular system that is
pretty much zero. In a fancy system with actual hotpluggable cpus there
would be a difference but usually the number of hotpluggable cpus is
minimal.

2009-06-30 20:37:16

by Scott Lurndal

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Tue, Jun 30, 2009 at 03:39:52PM -0400, Christoph Lameter wrote:
> On Tue, 30 Jun 2009, Ingo Molnar wrote:
>
> > Yeah, it's a bug for something like a virtual environment which
> > boots generic kernels that might have 64 possible CPUs (on a true
> > 64-way system), but which will have fewer in practice.
>
> A machine (and a virtual environment) can indicate via the BIOS tables or
> ACPI that there are less "possible" cpus. That is actually very common.
>
> The difference between actual and possible cpus only has to be the number
> of processors that could be brought up later. In a regular system that is
> pretty much zero. In a fancy system with actual hotpluggable cpus there
> would be a difference but usually the number of hotpluggable cpus is
> minimal.

A hypervisor running on a 4-socket Beckton platform will have 32
(without hyperthreading) or 64 (with hyperthreading) CPUs to
allocate 1Q2010. A hypervisor supporting ACPI hot-plug would be able
then to hot-plug from one to 64 cores into a single linux guest if it
wanted to (and didn't overcommit, so gang-scheduling wasn't required).

Such a hypervisor might typically pass the maximum number of
CPUs in via the guest ACPI tables when the guest boots, to provide
the maximum flexibility in being able to grow and shrink the guest,
with only a subset marked as initially present.

I've been working recently with a 128 (shanghai) / 192 (istanbul)
core shared memory system which supports hot plugging CPUs into Linux,
and the feature would be less useful if a hot-plug event could fail
because the per-cpu area couldn't be allocated at the time of the
event (which argues for pre-allocation, if a large physically contiguous
region is required).

That said, I don't see much use for 32-bit kernels in environments with
such CPU counts.

scott

2009-06-30 21:33:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support


* Christoph Lameter <[email protected]> wrote:

> On Tue, 30 Jun 2009, Ingo Molnar wrote:
>
> > Yeah, it's a bug for something like a virtual environment which
> > boots generic kernels that might have 64 possible CPUs (on a
> > true 64-way system), but which will have fewer in practice.

i think this bit should be quoted too, because it is the crux of the
issue:

> > It's pretty basic stuff: the on-demand allocation of percpu
> > resources.

> A machine (and a virtual environment) can indicate via the BIOS
> tables or ACPI that there are less "possible" cpus. That is
> actually very common.
>
> The difference between actual and possible cpus only has to be the
> number of processors that could be brought up later. In a regular
> system that is pretty much zero. In a fancy system with actual
> hotpluggable cpus there would be a difference but usually the
> number of hotpluggable cpus is minimal.

You are arguing against the concept of the demand-allocation of
resources, and i dont think that technical argument can be won.

Sure you dont have to demand-allocate if you know the demand
beforehand and can preallocate and size accordingly.

But what if not? What if the kernel can run on up to 4096 CPUs and
runs on a big box. Why should a virtual machine have the illogical
choice between either wasting a lot of RAM preallocating stuff, or
limiting its own extendability.

In other words: your proposed change in essence reduces the utility
of CPU hotplug. It's a bad idea.

Ingo

2009-06-30 22:17:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Tue, 30 Jun 2009, Ingo Molnar wrote:

> > The difference between actual and possible cpus only has to be the
> > number of processors that could be brought up later. In a regular
> > system that is pretty much zero. In a fancy system with actual
> > hotpluggable cpus there would be a difference but usually the
> > number of hotpluggable cpus is minimal.
>
> You are arguing against the concept of the demand-allocation of
> resources, and i dont think that technical argument can be won.

I looked at allocating for online cpus only a couple of years back but at
that per cpu state was kept for offlined cpus in per cpu areas. There are
numerous assumptions in per cpu handling all over the kernel that a percpu
area is always available. We successfully restricted it to only possible
cpus. ACPI may be the worst offender there. If you can get all of that
addressed then we can move to pure on demand allocation. Which also would
complicate a per cpu memory allocator allocator.

> Sure you dont have to demand-allocate if you know the demand
> beforehand and can preallocate and size accordingly.

Well you know the demand from the BIOS information.

> But what if not? What if the kernel can run on up to 4096 CPUs and
> runs on a big box. Why should a virtual machine have the illogical
> choice between either wasting a lot of RAM preallocating stuff, or
> limiting its own extendability.

The kernel may be able to run on 4096 but the machines config information
that is available via ACPI knows how many processors the machine we are
booting on is able to add.

> In other words: your proposed change in essence reduces the utility
> of CPU hotplug. It's a bad idea.

Change? I am talking about what is the case.

2009-06-30 22:31:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support


* Christoph Lameter <[email protected]> wrote:

> On Tue, 30 Jun 2009, Ingo Molnar wrote:
>
> > But what if not? What if the kernel can run on up to 4096 CPUs
> > and runs on a big box. Why should a virtual machine have the
> > illogical choice between either wasting a lot of RAM
> > preallocating stuff, or limiting its own extendability.
>
> The kernel may be able to run on 4096 but the machines config
> information that is available via ACPI knows how many processors
> the machine we are booting on is able to add.

I think we might be talking past each other.

The usecase i'm talking about is to boot a generic,
many-CPUs-capable kernel in a guest image.

How would you allow that guest to stay on 2 virtual CPUs but still
be able to hot-plug many other CPUs if the guest context rises above
its original CPU utilization?

Ingo

2009-06-30 22:40:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

> How would you allow that guest to stay on 2 virtual CPUs but still
> be able to hot-plug many other CPUs if the guest context rises above
> its original CPU utilization?

(unless you're planning to rewrite lots of possible cpu users all over
the tree) -- the only way is to keep the percpu area small and preallocate.

As long as the per cpu data size stays reasonable (not more than a 100-200k)
that's very doable. It probably won't work with 4096 guest CPUs without
wasting too much memory, but then I don't think we have any Hypervisor
that scales to that many CPUs anyways, so it's not the biggest
concern. For the 128CPU case it works (although i might need
to enlarge vmalloc area a bit on 32bit)

Unfortunately a few debugging subsystems seem to currently eat
much more, but those just need to be fixed to only allocate
state for actually running CPUs, not just possible ones.

I suspect we need a scripts/percpubloat.pl

-Andi

--
[email protected] -- Speaking for myself only.

2009-06-30 22:56:46

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Wed, 1 Jul 2009, Ingo Molnar wrote:

> The usecase i'm talking about is to boot a generic,
> many-CPUs-capable kernel in a guest image.

The many-cpu-capable kernel is not the issue. The number of potential cpus
is config information provided by the hardware. In your case the host
system provides the number of possible processors.

> How would you allow that guest to stay on 2 virtual CPUs but still
> be able to hot-plug many other CPUs if the guest context rises above
> its original CPU utilization?

The use case is a guest that needs to add a couple of thousand processors?

That will require lots of memory and special configuration of the guest
environment simulated. Yes that is an extreme case that will require lots
of per cpu memory.

2009-06-30 23:08:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support


* Christoph Lameter <[email protected]> wrote:

> On Wed, 1 Jul 2009, Ingo Molnar wrote:
>
> > The usecase i'm talking about is to boot a generic,
> > many-CPUs-capable kernel in a guest image.
>
> The many-cpu-capable kernel is not the issue. The number of
> potential cpus is config information provided by the hardware. In
> your case the host system provides the number of possible
> processors.

Say it says "256 CPUs".

But i only run a 2 CPU guest system. But i want to run it with the
_on demand capability_ to scale up to 256 virtual CPUs if needed,
without having to reboot the guest and without having to allocate
all the percpu space for 256 CPUs beforehand. Ok?

It is a simple, basic OS feature and concept.

Ingo

2009-06-30 23:20:36

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Wed, 1 Jul 2009, Ingo Molnar wrote:

> Say it says "256 CPUs".

Ok then you found a bug in the virtualization software ;-).

2009-06-30 23:22:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Hello, Christoph.

Christoph Lameter wrote:
> I looked at allocating for online cpus only a couple of years back but at
> that per cpu state was kept for offlined cpus in per cpu areas. There are
> numerous assumptions in per cpu handling all over the kernel that a percpu
> area is always available.

The plan is to allocate and keep percpu areas for cpus which have ever
been up. There'll be no taking down of percpu areas. Conversion from
possible to has_ever_been_up should be much easier than possible ->
online. State keeping will work fine too.

> We successfully restricted it to only possible
> cpus. ACPI may be the worst offender there. If you can get all of that
> addressed then we can move to pure on demand allocation. Which also would
> complicate a per cpu memory allocator allocator.

I don't think it will be too complex. The necessary bits are already
there and they are necessary for other stuff too, so...

Thanks.

--
tejun

2009-06-30 23:30:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support


* Christoph Lameter <[email protected]> wrote:

> On Wed, 1 Jul 2009, Ingo Molnar wrote:
>
> > Say it says "256 CPUs".
>
> Ok then you found a bug in the virtualization software ;-).

No, it runs on a 256 CPUs host system, why shouldnt it be possible
to expose that _possible_ expansion space to a guest instance?

Ingo

2009-06-30 23:31:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support


* Tejun Heo <[email protected]> wrote:

> Hello, Christoph.
>
> Christoph Lameter wrote:

> > I looked at allocating for online cpus only a couple of years
> > back but at that per cpu state was kept for offlined cpus in per
> > cpu areas. There are numerous assumptions in per cpu handling
> > all over the kernel that a percpu area is always available.
>
> The plan is to allocate and keep percpu areas for cpus which have
> ever been up. There'll be no taking down of percpu areas.
> Conversion from possible to has_ever_been_up should be much easier
> than possible -> online. State keeping will work fine too.

That sounds like a very sane plan.

Ingo

2009-06-30 23:36:31

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Ingo Molnar wrote:
> * Tejun Heo <[email protected]> wrote:
>
>> Hello, Christoph.
>>
>> Christoph Lameter wrote:
>
>>> I looked at allocating for online cpus only a couple of years
>>> back but at that per cpu state was kept for offlined cpus in per
>>> cpu areas. There are numerous assumptions in per cpu handling
>>> all over the kernel that a percpu area is always available.
>> The plan is to allocate and keep percpu areas for cpus which have
>> ever been up. There'll be no taking down of percpu areas.
>> Conversion from possible to has_ever_been_up should be much easier
>> than possible -> online. State keeping will work fine too.
>
> That sounds like a very sane plan.
>
> Ingo

Yes, percpu area dynamic *de*-allocation would almost certainly be a
nightmare.

-hpa

2009-07-01 01:20:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Hello, Andi.

Andi Kleen wrote:
>> How would you allow that guest to stay on 2 virtual CPUs but still
>> be able to hot-plug many other CPUs if the guest context rises above
>> its original CPU utilization?
>
> (unless you're planning to rewrite lots of possible cpu users all over
> the tree) -- the only way is to keep the percpu area small and preallocate.
>
> As long as the per cpu data size stays reasonable (not more than a 100-200k)
> that's very doable. It probably won't work with 4096 guest CPUs without
> wasting too much memory, but then I don't think we have any Hypervisor
> that scales to that many CPUs anyways, so it's not the biggest
> concern. For the 128CPU case it works (although i might need
> to enlarge vmalloc area a bit on 32bit)

I don't see much reason why we should put artificial limit on how much
percpu memory could be used. For lockdep, that much of percpu memory
is actually necessary. Another layer of indirection surely can lessen
the pressure on the generic percpu implementation but the problem can
be solved by the generic code without too much difficulty.

Thanks.

--
tejun

2009-07-01 06:35:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

> Say it says "256 CPUs".
>
> But i only run a 2 CPU guest system. But i want to run it with the
> _on demand capability_ to scale up to 256 virtual CPUs if needed,
> without having to reboot the guest and without having to allocate
> all the percpu space for 256 CPUs beforehand. Ok?

Then you would need to fix lots of subsystems. Everyone doing
for_each_possible_cpu() today and not having a cpu notifier callback.

git/linux-2.6% gid for_each_possible_cpu | wc -l
247
git/linux-2.6% gid CPU_UP_PREPARE | wc -l
48

Essentially you would discard the concept of possible CPUs
and replace it only with online CPUs, but possible CPUs
are widely used.

It's unclear to me it's worth all that work (and how to test
it properly), assuming we fix the few per cpu pigs today that make it
very expensive to preallocate percpu in the big debug configurations.

> It is a simple, basic OS feature and concept.

.. that Linux doesn't support without some heavy lifting.

Not answering emails will not change that fact.

-Andi
--
[email protected] -- Speaking for myself only.

2009-07-01 06:42:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

> I don't think it will be too complex. The necessary bits are already
> there and they are necessary for other stuff too, so...

Are we looking at a different source base? Here's a random example
using possible per cpu data I picked in current git: icmp.c

static int __net_init icmp_sk_init(struct net *net)
{
int i, err;

net->ipv4.icmp_sk =
kzalloc(nr_cpu_ids * sizeof(struct sock *), GFP_KERNEL);
if (net->ipv4.icmp_sk == NULL)
return -ENOMEM;

for_each_possible_cpu(i) {
... allocate per cpu socket and some other setup ...
}
}

static void __net_exit icmp_sk_exit(struct net *net)
{
int i;

for_each_possible_cpu(i)
inet_ctl_sock_destroy(net->ipv4.icmp_sk[i]);
kfree(net->ipv4.icmp_sk);
net->ipv4.icmp_sk = NULL;
}

You would need to convert that to use a CPU notifier and callbacks
setting up the sockets. Then make sure there are no races in all of
this. And get it somehow tested (where is the user base who
tests cpu hotplug?)

And there is lots of similar code all over the tree

-Andi
--
[email protected] -- Speaking for myself only.

2009-07-01 10:20:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Hello, Andi.

Andi Kleen wrote:
>> I don't think it will be too complex. The necessary bits are already
>> there and they are necessary for other stuff too, so...
>
> Are we looking at a different source base? Here's a random example
> using possible per cpu data I picked in current git: icmp.c

I was talking about percpu allocator proper. Yeap, the major work
would be in auditing and converting for_each_possible_cpu() users.

> static int __net_init icmp_sk_init(struct net *net)
> {
> int i, err;
>
> net->ipv4.icmp_sk =
> kzalloc(nr_cpu_ids * sizeof(struct sock *), GFP_KERNEL);
> if (net->ipv4.icmp_sk == NULL)
> return -ENOMEM;
>
> for_each_possible_cpu(i) {
> ... allocate per cpu socket and some other setup ...
> }
> }
>
> static void __net_exit icmp_sk_exit(struct net *net)
> {
> int i;
>
> for_each_possible_cpu(i)
> inet_ctl_sock_destroy(net->ipv4.icmp_sk[i]);
> kfree(net->ipv4.icmp_sk);
> net->ipv4.icmp_sk = NULL;
> }
>
> You would need to convert that to use a CPU notifier and callbacks
> setting up the sockets. Then make sure there are no races in all of
> this. And get it somehow tested (where is the user base who
> tests cpu hotplug?)

Maybe it would be better to allocate percpu sockets as proper percpu
variables. Initialization would still need callback mechanism tho. I
was thinking about adding @init callback to percpu_alloc(), which
would be much simpler than doing full cpu hotplug callback.

> And there is lots of similar code all over the tree

For static percpu variables, it'll be mostly about converting
for_each_possible_cpu() to for_each_used_cpu() as both allocation and
initialization can be handled by percpu proper. For dynamic areas,
allocation can be handled by percpu proper but cpus coming online
would need more work to convert. It'll take some effort but there
aren't too many alloc_percpu() users yet and I don't think it will be
too difficult. I wouldn't know for sure before I actually try tho.

Thanks.

--
tejun

2009-07-01 12:23:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Wed, Jul 01, 2009 at 07:21:57PM +0900, Tejun Heo wrote:
> > using possible per cpu data I picked in current git: icmp.c
>
> I was talking about percpu allocator proper. Yeap, the major work
> would be in auditing and converting for_each_possible_cpu() users.

and testing. that's the hard part. cpu hotplug is normally not well
tested. Any code change that requires lots of new code for
it will be a problem because that code will then likely bitrot.

> > kfree(net->ipv4.icmp_sk);
> > net->ipv4.icmp_sk = NULL;
> > }
> >
> > You would need to convert that to use a CPU notifier and callbacks
> > setting up the sockets. Then make sure there are no races in all of
> > this. And get it somehow tested (where is the user base who
> > tests cpu hotplug?)
>
> Maybe it would be better to allocate percpu sockets as proper percpu
> variables.

That would be like percpu dentries.

A socket is a very complex structure. Trying to pull it out
of its normal allocators would be a bad idea.

I actually had a patch some time ago to move this one over
to callbacks, but it was already difficult, and I dropped
it.

> Initialization would still need callback mechanism tho. I
> was thinking about adding @init callback to percpu_alloc(), which
> would be much simpler than doing full cpu hotplug callback.

I'm not sure it will be actually simpler. It's not just
initializing the structure, but all the other set up to make
it known to the world.

>
> > And there is lots of similar code all over the tree
>
> For static percpu variables, it'll be mostly about converting
> for_each_possible_cpu() to for_each_used_cpu() as both allocation and
> initialization can be handled by percpu proper. For dynamic areas,

That's the easy part, but you would still need all the callbacks
for the extension case.

> allocation can be handled by percpu proper but cpus coming online
> would need more work to convert. It'll take some effort but there
> aren't too many alloc_percpu() users yet and I don't think it will be

alloc_percpu()? It affects all DEFINE_PER_CPU users. My current
tree has hundreds.

> too difficult. I wouldn't know for sure before I actually try tho.

I think it's clear that you haven't tried yet :)

I wrote quite a few per cpu callback handlers over the years and
in my experience they are all nasty code with subtle races. The problem
is that instead of having a single subset init function which
is just single threaded and doesn't need to worry about races
you now have multi threaded init, which tends to be a can of worms.

I think a far saner strategy than rewriting every user of DEFINE_PER_CPU,
ending up with lots of badly tested code is to:

- Fix the few large size percpu pigs that are problematic today to
allocate in a callback.
- Then once the per cpu data in all configurations is <200k (better
<100 in the non debug builds) again just keep pre-allocating like we always did
- Possibly adjust the vmalloc area on 32bit based on the possible
CPU map at the cost of the direct mapping, to make sure there's always enough
mapping space.

-Andi

--
[email protected] -- Speaking for myself only.

2009-07-01 12:51:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Hello, Andi.

Andi Kleen wrote:
> On Wed, Jul 01, 2009 at 07:21:57PM +0900, Tejun Heo wrote:
>>> using possible per cpu data I picked in current git: icmp.c
>> I was talking about percpu allocator proper. Yeap, the major work
>> would be in auditing and converting for_each_possible_cpu() users.
>
> and testing. that's the hard part. cpu hotplug is normally not well
> tested. Any code change that requires lots of new code for
> it will be a problem because that code will then likely bitrot.

It would be nice to have something to test cpu on/offlining
automatically. Something which keeps bringing cpus up and down as the
system goes through stress testing.

[--snip--]
>> too difficult. I wouldn't know for sure before I actually try tho.
>
> I think it's clear that you haven't tried yet :)

No I haven't yet. I had a pretty good idea about how to implement it
in percpu allocator but haven't really looked at the users. So, yeap,
it's quite possible that I'm under estimating the problem. Oh well,
let's see. Thanks for the warnings.

> I wrote quite a few per cpu callback handlers over the years and
> in my experience they are all nasty code with subtle races. The problem
> is that instead of having a single subset init function which
> is just single threaded and doesn't need to worry about races
> you now have multi threaded init, which tends to be a can of worms.

I tried a couple (didn't end up sending them out) and yeah they could
be quite painful. Bringing up part usually isn't as painful as the
other way tho.

> I think a far saner strategy than rewriting every user of DEFINE_PER_CPU,
> ending up with lots of badly tested code is to:

But I don't think it would be that drastic. Most users are quite
simple.

> - Fix the few large size percpu pigs that are problematic today to
> allocate in a callback.
> - Then once the per cpu data in all configurations is <200k (better
> <100 in the non debug builds) again just keep pre-allocating like we
> always did
> - Possibly adjust the vmalloc area on 32bit based on the possible
> CPU map at the cost of the direct mapping, to make sure there's
> always enough mapping space.

I think it's something we eventually need to do. There already are
cases where lack of scalable and performant percpu allocation leads to
design restrictions and between many core cpus and virtualization the
requirements are becoming more varied.

Thanks.

--
tejun

2009-07-01 13:11:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Wed, Jul 01, 2009 at 09:53:06PM +0900, Tejun Heo wrote:
> It would be nice to have something to test cpu on/offlining
> automatically. Something which keeps bringing cpus up and down as the
> system goes through stress testing.

That's an trivial shell script using echo into sysfs files. It doesn't
seem to be widely done (the last time I tried it I promptly ran
into some RCU bug). You need a large enough machine for it.

But most stress testing does not actually have good code coverage
in my experience. It just runs the same small set of core code all over
(you can check now, kernel gcov is finally in)

The tricky part is to actually test the code you want to test.

> > ending up with lots of badly tested code is to:
>
> But I don't think it would be that drastic. Most users are quite
> simple.

But how do you test them properly? And educate
the driver writers? Also it would likely increase code sizes drastically.

Philosophically I think code like that should be a simple
operation and turning all the per cpu init code into
callbacks is not simple. That makes everything more error prone.

And it's imho unclear if that is all worth it just to avoid
wasting some memory in the "256 possible CPUs" case (which
I doubt is particularly realistic anyways, at least I don't
know of any Hypervisor today that scales to 256 CPUs)

-Andi

--
[email protected] -- Speaking for myself only.

2009-07-01 17:33:53

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

On Wed, 1 Jul 2009, Andi Kleen wrote:

> And it's imho unclear if that is all worth it just to avoid
> wasting some memory in the "256 possible CPUs" case (which
> I doubt is particularly realistic anyways, at least I don't
> know of any Hypervisor today that scales to 256 CPUs)

I basically agree. Its not worth it given the rare cases where this
matters. It will be a lot of code with callbacks in each subsystem.

One of the motivations of working on revising the percpu handling for
me was that we could get rid of these screwy callbacks that are rarely
tested and cause all sorts of other issues with locking and serialization.

2009-07-01 22:43:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Christoph Lameter wrote:
> On Wed, 1 Jul 2009, Andi Kleen wrote:
>
>> And it's imho unclear if that is all worth it just to avoid
>> wasting some memory in the "256 possible CPUs" case (which
>> I doubt is particularly realistic anyways, at least I don't
>> know of any Hypervisor today that scales to 256 CPUs)
>
> I basically agree. Its not worth it given the rare cases where this
> matters. It will be a lot of code with callbacks in each subsystem.
>
> One of the motivations of working on revising the percpu handling for
> me was that we could get rid of these screwy callbacks that are rarely
> tested and cause all sorts of other issues with locking and serialization.

Hmmm.... yeah. I have to agree that callbacks are nasty and requiring
all users to use callbacks wouldn't be very nice. Once the current
dust settles down, I'll look around and see whether this can be solved
in some reasonable way.

Thanks.

--
tejun

2009-07-03 23:15:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] percpu: generalize first chunk allocators and improve lpage NUMA support

Tejun Heo wrote:
> Hello,
>
> This patchset is combination of the following two patchsets.
>
> [1] x86,percpu: generalize 4k and lpage allocator
> [2] percpu: teach lpage allocator about NUMA

Alpha changes from linus#master pulled into #for-next and this
patchset has been published in #for-next on top of that.

Thanks.

--
tejun

2009-07-13 10:21:28

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 04/10] percpu: make 4k first chunk allocator map memory

Tejun Heo <[email protected]> wrote:

> + pr_info("PERCPU: %d 4k pages per cpu, static data %zu bytes\n",
> + pcpu4k_unit_pages, static_size);

It occurs to me that this is may include a bad assumption. Page size need not
be 4K. I don't know that it matters, but looking at this:

> + vm.size = num_possible_cpus() * pcpu4k_unit_pages << PAGE_SHIFT;
> + vm_area_register_early(&vm, PAGE_SIZE);

It might.

David

2009-07-15 03:18:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/10] percpu: make 4k first chunk allocator map memory

Hello,

David Howells wrote:
> Tejun Heo <[email protected]> wrote:
>
>> + pr_info("PERCPU: %d 4k pages per cpu, static data %zu bytes\n",
>> + pcpu4k_unit_pages, static_size);
>
> It occurs to me that this is may include a bad assumption. Page size need not
> be 4K. I don't know that it matters, but looking at this:
>
>> + vm.size = num_possible_cpus() * pcpu4k_unit_pages << PAGE_SHIFT;
>> + vm_area_register_early(&vm, PAGE_SIZE);
>
> It might.

Yeah, I realized that after trying to convert powerpc64 and ia64 and
renamed it to page. I'll soon post the patches.

Thanks.

--
tejun