Hello, all.
This patchset converts ia64 to dynamic percpu allocator and drop the
now unused old percpu allocator. This patchset contains the following
four patches.
0001-vmalloc-rename-local-variables-vmalloc_start-and-vma.patch
0002-ia64-allocate-percpu-area-for-cpu0-like-percpu-areas.patch
0003-ia64-convert-to-dynamic-percpu-allocator.patch
0004-percpu-kill-legacy-percpu-allocator.patch
0001 is misc prep to avoid macro / local variable collision. 0002
makes ia64 allocate percpu area for cpu0 in the same way it does for
other cpus. 0003 converts ia64 to dynamic percpu allocator and 0004
drops now unused legacy allocator.
Contig memory model was verified with ski emulator. Discontig and
sparse models were verified on a 4-way SGI altix machine. I've run
percpu stress test module for quite a while on the machine.
Mike Travis, it would be great if you can test this on your machine.
I'd really like to see how it would behave on a machine with that many
NUMA nodes.
This patchset is available in the following git tree.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git convert-ia64
Hmmm... kernel.org seems slow to sync today. If the branch isn't
mirroreed, please pull from the master.
Thanks.
arch/ia64/Kconfig | 3
arch/ia64/kernel/setup.c | 12 --
arch/ia64/kernel/vmlinux.lds.S | 11 +-
arch/ia64/mm/contig.c | 87 ++++++++++++++++----
arch/ia64/mm/discontig.c | 120 +++++++++++++++++++++++++--
include/linux/percpu.h | 24 -----
kernel/module.c | 150 ----------------------------------
mm/Makefile | 4
mm/allocpercpu.c | 177 -----------------------------------------
mm/percpu.c | 2
mm/vmalloc.c | 16 +--
11 files changed, 193 insertions(+), 413 deletions(-)
--
tejun
ia64 defines global vmalloc_end and VMALLOC_END as an alias to it, so
using local variable named vmalloc_end and initializing it from
VMALLOC_END makes it a bogus initialization like the following.
const unsigned long vmalloc_end = vmalloc_end & ~(align - 1);
Rename local variables vmalloc_start and vmalloc_end to vm_start and
vm_end to avoid the collision.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: linux-ia64 <[email protected]>
---
mm/vmalloc.c | 16 ++++++++--------
1 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 204b824..416e7fe 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1901,13 +1901,13 @@ static unsigned long pvm_determine_end(struct vmap_area **pnext,
struct vmap_area **pprev,
unsigned long align)
{
- const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
+ const unsigned long vm_end = VMALLOC_END & ~(align - 1);
unsigned long addr;
if (*pnext)
- addr = min((*pnext)->va_start & ~(align - 1), vmalloc_end);
+ addr = min((*pnext)->va_start & ~(align - 1), vm_end);
else
- addr = vmalloc_end;
+ addr = vm_end;
while (*pprev && (*pprev)->va_end > addr) {
*pnext = *pprev;
@@ -1946,8 +1946,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
const size_t *sizes, int nr_vms,
size_t align, gfp_t gfp_mask)
{
- const unsigned long vmalloc_start = ALIGN(VMALLOC_START, align);
- const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
+ const unsigned long vm_start = ALIGN(VMALLOC_START, align);
+ const unsigned long vm_end = VMALLOC_END & ~(align - 1);
struct vmap_area **vas, *prev, *next;
struct vm_struct **vms;
int area, area2, last_area, term_area;
@@ -1983,7 +1983,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
}
last_end = offsets[last_area] + sizes[last_area];
- if (vmalloc_end - vmalloc_start < last_end) {
+ if (vm_end - vm_start < last_end) {
WARN_ON(true);
return NULL;
}
@@ -2008,7 +2008,7 @@ retry:
end = start + sizes[area];
if (!pvm_find_next_prev(vmap_area_pcpu_hole, &next, &prev)) {
- base = vmalloc_end - last_end;
+ base = vm_end - last_end;
goto found;
}
base = pvm_determine_end(&next, &prev, align) - end;
@@ -2021,7 +2021,7 @@ retry:
* base might have underflowed, add last_end before
* comparing.
*/
- if (base + last_end < vmalloc_start + last_end) {
+ if (base + last_end < vm_start + last_end) {
spin_unlock(&vmap_area_lock);
if (!purged) {
purge_vmap_area_lazy();
--
1.6.4.2
cpu0 used special percpu area reserved by the linker, __cpu0_per_cpu,
which is set up early in boot by head.S. However, this doesn't
guarantee that the area will be on the same node as cpu0 and the
percpu area for cpu0 ends up very far away from percpu areas for other
cpus which cause problems for congruent percpu allocator.
This patch makes percpu area initialization allocate percpu area for
cpu0 like any other cpus and copy it from __cpu0_per_cpu which now
resides in the __init area. This means that for cpu0, percpu area is
first setup at __cpu0_per_cpu early by head.S and then moved to an
area in the linear mapping during memory initialization and it's not
allowed to take a pointer to percpu variables between head.S and
memory initialization.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: linux-ia64 <[email protected]>
---
arch/ia64/kernel/vmlinux.lds.S | 11 +++++----
arch/ia64/mm/contig.c | 47 +++++++++++++++++++++++++--------------
arch/ia64/mm/discontig.c | 35 ++++++++++++++++++++---------
3 files changed, 60 insertions(+), 33 deletions(-)
diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index 0a0c77b..1295ba3 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -166,6 +166,12 @@ SECTIONS
}
#endif
+#ifdef CONFIG_SMP
+ . = ALIGN(PERCPU_PAGE_SIZE);
+ __cpu0_per_cpu = .;
+ . = . + PERCPU_PAGE_SIZE; /* cpu0 per-cpu space */
+#endif
+
. = ALIGN(PAGE_SIZE);
__init_end = .;
@@ -198,11 +204,6 @@ SECTIONS
data : { } :data
.data : AT(ADDR(.data) - LOAD_OFFSET)
{
-#ifdef CONFIG_SMP
- . = ALIGN(PERCPU_PAGE_SIZE);
- __cpu0_per_cpu = .;
- . = . + PERCPU_PAGE_SIZE; /* cpu0 per-cpu space */
-#endif
INIT_TASK_DATA(PAGE_SIZE)
CACHELINE_ALIGNED_DATA(SMP_CACHE_BYTES)
READ_MOSTLY_DATA(SMP_CACHE_BYTES)
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 2f724d2..9493bbf 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -154,36 +154,49 @@ static void *cpu_data;
void * __cpuinit
per_cpu_init (void)
{
- int cpu;
- static int first_time=1;
+ static bool first_time = true;
+ void *cpu0_data = __cpu0_per_cpu;
+ unsigned int cpu;
+
+ if (!first_time)
+ goto skip;
+ first_time = false;
/*
* get_free_pages() cannot be used before cpu_init() done. BSP
* allocates "NR_CPUS" pages for all CPUs to avoid that AP calls
* get_zeroed_page().
*/
- if (first_time) {
- void *cpu0_data = __cpu0_per_cpu;
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ void *src = cpu == 0 ? cpu0_data : __phys_per_cpu_start;
- first_time=0;
+ memcpy(cpu_data, src, __per_cpu_end - __per_cpu_start);
+ __per_cpu_offset[cpu] = (char *)cpu_data - __per_cpu_start;
+ per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];
- __per_cpu_offset[0] = (char *) cpu0_data - __per_cpu_start;
- per_cpu(local_per_cpu_offset, 0) = __per_cpu_offset[0];
+ /*
+ * percpu area for cpu0 is moved from the __init area
+ * which is setup by head.S and used till this point.
+ * Update ar.k3. This move is ensures that percpu
+ * area for cpu0 is on the correct node and its
+ * virtual address isn't insanely far from other
+ * percpu areas which is important for congruent
+ * percpu allocator.
+ */
+ if (cpu == 0)
+ ia64_set_kr(IA64_KR_PER_CPU_DATA, __pa(cpu_data) -
+ (unsigned long)__per_cpu_start);
- for (cpu = 1; cpu < NR_CPUS; cpu++) {
- memcpy(cpu_data, __phys_per_cpu_start, __per_cpu_end - __per_cpu_start);
- __per_cpu_offset[cpu] = (char *) cpu_data - __per_cpu_start;
- cpu_data += PERCPU_PAGE_SIZE;
- per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];
- }
+ cpu_data += PERCPU_PAGE_SIZE;
}
+skip:
return __per_cpu_start + __per_cpu_offset[smp_processor_id()];
}
static inline void
alloc_per_cpu_data(void)
{
- cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS-1,
+ cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS,
PERCPU_PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
}
#else
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index d85ba98..35a61ec 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -143,17 +143,30 @@ static void *per_cpu_node_setup(void *cpu_data, int node)
int cpu;
for_each_possible_early_cpu(cpu) {
- if (cpu == 0) {
- void *cpu0_data = __cpu0_per_cpu;
- __per_cpu_offset[cpu] = (char*)cpu0_data -
- __per_cpu_start;
- } else if (node == node_cpuid[cpu].nid) {
- memcpy(__va(cpu_data), __phys_per_cpu_start,
- __per_cpu_end - __per_cpu_start);
- __per_cpu_offset[cpu] = (char*)__va(cpu_data) -
- __per_cpu_start;
- cpu_data += PERCPU_PAGE_SIZE;
- }
+ void *src = cpu == 0 ? __cpu0_per_cpu : __phys_per_cpu_start;
+
+ if (node != node_cpuid[cpu].nid)
+ continue;
+
+ memcpy(__va(cpu_data), src, __per_cpu_end - __per_cpu_start);
+ __per_cpu_offset[cpu] = (char *)__va(cpu_data) -
+ __per_cpu_start;
+
+ /*
+ * percpu area for cpu0 is moved from the __init area
+ * which is setup by head.S and used till this point.
+ * Update ar.k3. This move is ensures that percpu
+ * area for cpu0 is on the correct node and its
+ * virtual address isn't insanely far from other
+ * percpu areas which is important for congruent
+ * percpu allocator.
+ */
+ if (cpu == 0)
+ ia64_set_kr(IA64_KR_PER_CPU_DATA,
+ (unsigned long)cpu_data -
+ (unsigned long)__per_cpu_start);
+
+ cpu_data += PERCPU_PAGE_SIZE;
}
#endif
return cpu_data;
--
1.6.4.2
Unlike other archs, ia64 reserves space for percpu areas during early
memory initialization. These areas occupy a contiguous region indexed
by cpu number on contiguous memory model or are grouped by node on
discontiguous memory model.
As allocation and initialization are done by the arch code, all that
setup_per_cpu_areas() needs to do is communicating the determined
layout to the percpu allocator. This patch implements
setup_per_cpu_areas() for both contig and discontig memory models and
drops HAVE_LEGACY_PER_CPU_AREA.
Please note that for contig model, the allocation itself is modified
only to allocate for possible cpus instead of NR_CPUS. As dynamic
percpu allocator can handle non-direct mapping, there's no reason to
allocate memory for cpus which aren't possible.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: linux-ia64 <[email protected]>
---
arch/ia64/Kconfig | 3 --
arch/ia64/kernel/setup.c | 12 ------
arch/ia64/mm/contig.c | 50 ++++++++++++++++++++++++---
arch/ia64/mm/discontig.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 130 insertions(+), 20 deletions(-)
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 011a1cd..e624611 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -89,9 +89,6 @@ config GENERIC_TIME_VSYSCALL
bool
default y
-config HAVE_LEGACY_PER_CPU_AREA
- def_bool y
-
config HAVE_SETUP_PER_CPU_AREA
def_bool y
diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 1de86c9..42f8a18 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -856,18 +856,6 @@ identify_cpu (struct cpuinfo_ia64 *c)
}
/*
- * In UP configuration, setup_per_cpu_areas() is defined in
- * include/linux/percpu.h
- */
-#ifdef CONFIG_SMP
-void __init
-setup_per_cpu_areas (void)
-{
- /* start_kernel() requires this... */
-}
-#endif
-
-/*
* Do the following calculations:
*
* 1. the max. cache line size.
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 9493bbf..c86ce4f 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -163,11 +163,11 @@ per_cpu_init (void)
first_time = false;
/*
- * get_free_pages() cannot be used before cpu_init() done. BSP
- * allocates "NR_CPUS" pages for all CPUs to avoid that AP calls
- * get_zeroed_page().
+ * get_free_pages() cannot be used before cpu_init() done.
+ * BSP allocates PERCPU_PAGE_SIZE bytes for all possible CPUs
+ * to avoid that AP calls get_zeroed_page().
*/
- for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ for_each_possible_cpu(cpu) {
void *src = cpu == 0 ? cpu0_data : __phys_per_cpu_start;
memcpy(cpu_data, src, __per_cpu_end - __per_cpu_start);
@@ -196,9 +196,49 @@ skip:
static inline void
alloc_per_cpu_data(void)
{
- cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS,
+ cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * num_possible_cpus(),
PERCPU_PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
}
+
+/**
+ * setup_per_cpu_areas - setup percpu areas
+ *
+ * Arch code has already allocated and initialized percpu areas. All
+ * this function has to do is to teach the determined layout to the
+ * dynamic percpu allocator, which happens to be more complex than
+ * creating whole new ones using helpers.
+ */
+void __init
+setup_per_cpu_areas(void)
+{
+ struct pcpu_alloc_info *ai;
+ struct pcpu_group_info *gi;
+ unsigned int cpu;
+ int rc;
+
+ ai = pcpu_alloc_alloc_info(1, num_possible_cpus());
+ if (!ai)
+ panic("failed to allocate pcpu_alloc_info");
+ gi = &ai->groups[0];
+
+ /* units are assigned consecutively to possible cpus */
+ for_each_possible_cpu(cpu)
+ gi->cpu_map[gi->nr_units++] = cpu;
+
+ /* set parameters */
+ ai->static_size = __per_cpu_end - __per_cpu_start;
+ ai->reserved_size = PERCPU_MODULE_RESERVE;
+ ai->dyn_size = PERCPU_DYNAMIC_RESERVE;
+ ai->unit_size = PERCPU_PAGE_SIZE;
+ ai->atom_size = PAGE_SIZE;
+ ai->alloc_size = PERCPU_PAGE_SIZE;
+
+ rc = pcpu_setup_first_chunk(ai, __per_cpu_start + __per_cpu_offset[0]);
+ if (rc)
+ panic("failed to setup percpu area (err=%d)", rc);
+
+ pcpu_free_alloc_info(ai);
+}
#else
#define alloc_per_cpu_data() do { } while (0)
#endif /* CONFIG_SMP */
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 35a61ec..69e9e91 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -172,6 +172,91 @@ static void *per_cpu_node_setup(void *cpu_data, int node)
return cpu_data;
}
+#ifdef CONFIG_SMP
+/**
+ * setup_per_cpu_areas - setup percpu areas
+ *
+ * Arch code has already allocated and initialized percpu areas. All
+ * this function has to do is to teach the determined layout to the
+ * dynamic percpu allocator, which happens to be more complex than
+ * creating whole new ones using helpers.
+ */
+void __init setup_per_cpu_areas(void)
+{
+ struct pcpu_alloc_info *ai;
+ struct pcpu_group_info *uninitialized_var(gi);
+ unsigned int *cpu_map;
+ void *base;
+ unsigned long base_offset;
+ unsigned int cpu;
+ ssize_t static_size, reserved_size, dyn_size;
+ int node, prev_node, unit, nr_units, rc;
+
+ ai = pcpu_alloc_alloc_info(MAX_NUMNODES, nr_cpu_ids);
+ if (!ai)
+ panic("failed to allocate pcpu_alloc_info");
+ cpu_map = ai->groups[0].cpu_map;
+
+ /* determine base */
+ base = (void *)ULONG_MAX;
+ for_each_possible_cpu(cpu)
+ base = min(base,
+ (void *)(__per_cpu_offset[cpu] + __per_cpu_start));
+ base_offset = (void *)__per_cpu_start - base;
+
+ /* build cpu_map, units are grouped by node */
+ unit = 0;
+ for_each_node(node)
+ for_each_possible_cpu(cpu)
+ if (node == node_cpuid[cpu].nid)
+ cpu_map[unit++] = cpu;
+ nr_units = unit;
+
+ /* set basic parameters */
+ static_size = __per_cpu_end - __per_cpu_start;
+ reserved_size = PERCPU_MODULE_RESERVE;
+ dyn_size = PERCPU_PAGE_SIZE - static_size - reserved_size;
+ if (dyn_size < 0)
+ panic("percpu area overflow static=%zd reserved=%zd\n",
+ static_size, reserved_size);
+
+ ai->static_size = static_size;
+ ai->reserved_size = reserved_size;
+ ai->dyn_size = dyn_size;
+ ai->unit_size = PERCPU_PAGE_SIZE;
+ ai->atom_size = PAGE_SIZE;
+ ai->alloc_size = PERCPU_PAGE_SIZE;
+
+ /*
+ * CPUs are put into groups according to node. Walk cpu_map
+ * and create new groups at node boundaries.
+ */
+ prev_node = -1;
+ ai->nr_groups = 0;
+ for (unit = 0; unit < nr_units; unit++) {
+ cpu = cpu_map[unit];
+ node = node_cpuid[cpu].nid;
+
+ if (node == prev_node) {
+ gi->nr_units++;
+ continue;
+ }
+ prev_node = node;
+
+ gi = &ai->groups[ai->nr_groups++];
+ gi->nr_units = 1;
+ gi->base_offset = __per_cpu_offset[cpu] + base_offset;
+ gi->cpu_map = &cpu_map[unit];
+ }
+
+ rc = pcpu_setup_first_chunk(ai, base);
+ if (rc)
+ panic("failed to setup percpu area (err=%d)", rc);
+
+ pcpu_free_alloc_info(ai);
+}
+#endif
+
/**
* fill_pernode - initialize pernode data.
* @node: the node id.
--
1.6.4.2
With ia64 converted, there's no arch left which still uses legacy
percpu allocator. Kill it.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/percpu.h | 24 -------
kernel/module.c | 150 ----------------------------------------
mm/Makefile | 4 -
mm/allocpercpu.c | 177 ------------------------------------------------
mm/percpu.c | 2 -
5 files changed, 0 insertions(+), 357 deletions(-)
delete mode 100644 mm/allocpercpu.c
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 878836c..5baf5b8 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -34,8 +34,6 @@
#ifdef CONFIG_SMP
-#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
-
/* minimum unit size, also is the maximum supported allocation size */
#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(64 << 10)
@@ -130,28 +128,6 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
#define per_cpu_ptr(ptr, cpu) SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)))
extern void *__alloc_reserved_percpu(size_t size, size_t align);
-
-#else /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
-struct percpu_data {
- void *ptrs[1];
-};
-
-/* pointer disguising messes up the kmemleak objects tracking */
-#ifndef CONFIG_DEBUG_KMEMLEAK
-#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
-#else
-#define __percpu_disguise(pdata) (struct percpu_data *)(pdata)
-#endif
-
-#define per_cpu_ptr(ptr, cpu) \
-({ \
- struct percpu_data *__p = __percpu_disguise(ptr); \
- (__typeof__(ptr))__p->ptrs[(cpu)]; \
-})
-
-#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
extern void *__alloc_percpu(size_t size, size_t align);
extern void free_percpu(void *__pdata);
diff --git a/kernel/module.c b/kernel/module.c
index b6ee424..bac3fe8 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -369,8 +369,6 @@ EXPORT_SYMBOL_GPL(find_module);
#ifdef CONFIG_SMP
-#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
-
static void *percpu_modalloc(unsigned long size, unsigned long align,
const char *name)
{
@@ -394,154 +392,6 @@ static void percpu_modfree(void *freeme)
free_percpu(freeme);
}
-#else /* ... CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
-/* Number of blocks used and allocated. */
-static unsigned int pcpu_num_used, pcpu_num_allocated;
-/* Size of each block. -ve means used. */
-static int *pcpu_size;
-
-static int split_block(unsigned int i, unsigned short size)
-{
- /* Reallocation required? */
- if (pcpu_num_used + 1 > pcpu_num_allocated) {
- int *new;
-
- new = krealloc(pcpu_size, sizeof(new[0])*pcpu_num_allocated*2,
- GFP_KERNEL);
- if (!new)
- return 0;
-
- pcpu_num_allocated *= 2;
- pcpu_size = new;
- }
-
- /* Insert a new subblock */
- memmove(&pcpu_size[i+1], &pcpu_size[i],
- sizeof(pcpu_size[0]) * (pcpu_num_used - i));
- pcpu_num_used++;
-
- pcpu_size[i+1] -= size;
- pcpu_size[i] = size;
- return 1;
-}
-
-static inline unsigned int block_size(int val)
-{
- if (val < 0)
- return -val;
- return val;
-}
-
-static void *percpu_modalloc(unsigned long size, unsigned long align,
- const char *name)
-{
- unsigned long extra;
- unsigned int i;
- void *ptr;
- int cpu;
-
- if (align > PAGE_SIZE) {
- printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
- name, align, PAGE_SIZE);
- align = PAGE_SIZE;
- }
-
- ptr = __per_cpu_start;
- for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
- /* Extra for alignment requirement. */
- extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
- BUG_ON(i == 0 && extra != 0);
-
- if (pcpu_size[i] < 0 || pcpu_size[i] < extra + size)
- continue;
-
- /* Transfer extra to previous block. */
- if (pcpu_size[i-1] < 0)
- pcpu_size[i-1] -= extra;
- else
- pcpu_size[i-1] += extra;
- pcpu_size[i] -= extra;
- ptr += extra;
-
- /* Split block if warranted */
- if (pcpu_size[i] - size > sizeof(unsigned long))
- if (!split_block(i, size))
- return NULL;
-
- /* add the per-cpu scanning areas */
- for_each_possible_cpu(cpu)
- kmemleak_alloc(ptr + per_cpu_offset(cpu), size, 0,
- GFP_KERNEL);
-
- /* Mark allocated */
- pcpu_size[i] = -pcpu_size[i];
- return ptr;
- }
-
- printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
- size);
- return NULL;
-}
-
-static void percpu_modfree(void *freeme)
-{
- unsigned int i;
- void *ptr = __per_cpu_start + block_size(pcpu_size[0]);
- int cpu;
-
- /* First entry is core kernel percpu data. */
- for (i = 1; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
- if (ptr == freeme) {
- pcpu_size[i] = -pcpu_size[i];
- goto free;
- }
- }
- BUG();
-
- free:
- /* remove the per-cpu scanning areas */
- for_each_possible_cpu(cpu)
- kmemleak_free(freeme + per_cpu_offset(cpu));
-
- /* Merge with previous? */
- if (pcpu_size[i-1] >= 0) {
- pcpu_size[i-1] += pcpu_size[i];
- pcpu_num_used--;
- memmove(&pcpu_size[i], &pcpu_size[i+1],
- (pcpu_num_used - i) * sizeof(pcpu_size[0]));
- i--;
- }
- /* Merge with next? */
- if (i+1 < pcpu_num_used && pcpu_size[i+1] >= 0) {
- pcpu_size[i] += pcpu_size[i+1];
- pcpu_num_used--;
- memmove(&pcpu_size[i+1], &pcpu_size[i+2],
- (pcpu_num_used - (i+1)) * sizeof(pcpu_size[0]));
- }
-}
-
-static int percpu_modinit(void)
-{
- pcpu_num_used = 2;
- pcpu_num_allocated = 2;
- pcpu_size = kmalloc(sizeof(pcpu_size[0]) * pcpu_num_allocated,
- GFP_KERNEL);
- /* Static in-kernel percpu data (used). */
- pcpu_size[0] = -(__per_cpu_end-__per_cpu_start);
- /* Free room. */
- pcpu_size[1] = PERCPU_ENOUGH_ROOM + pcpu_size[0];
- if (pcpu_size[1] < 0) {
- printk(KERN_ERR "No per-cpu room for modules.\n");
- pcpu_num_used = 1;
- }
-
- return 0;
-}
-__initcall(percpu_modinit);
-
-#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
static unsigned int find_pcpusec(Elf_Ehdr *hdr,
Elf_Shdr *sechdrs,
const char *secstrings)
diff --git a/mm/Makefile b/mm/Makefile
index ea4b18b..1195920 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,11 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
-ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
obj-$(CONFIG_SMP) += percpu.o
-else
-obj-$(CONFIG_SMP) += allocpercpu.o
-endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/allocpercpu.c b/mm/allocpercpu.c
deleted file mode 100644
index df34cea..0000000
--- a/mm/allocpercpu.c
+++ /dev/null
@@ -1,177 +0,0 @@
-/*
- * linux/mm/allocpercpu.c
- *
- * Separated from slab.c August 11, 2006 Christoph Lameter
- */
-#include <linux/mm.h>
-#include <linux/module.h>
-#include <linux/bootmem.h>
-#include <asm/sections.h>
-
-#ifndef cache_line_size
-#define cache_line_size() L1_CACHE_BYTES
-#endif
-
-/**
- * percpu_depopulate - depopulate per-cpu data for given cpu
- * @__pdata: per-cpu data to depopulate
- * @cpu: depopulate per-cpu data for this cpu
- *
- * Depopulating per-cpu data for a cpu going offline would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- */
-static void percpu_depopulate(void *__pdata, int cpu)
-{
- struct percpu_data *pdata = __percpu_disguise(__pdata);
-
- kfree(pdata->ptrs[cpu]);
- pdata->ptrs[cpu] = NULL;
-}
-
-/**
- * percpu_depopulate_mask - depopulate per-cpu data for some cpu's
- * @__pdata: per-cpu data to depopulate
- * @mask: depopulate per-cpu data for cpu's selected through mask bits
- */
-static void __percpu_depopulate_mask(void *__pdata, const cpumask_t *mask)
-{
- int cpu;
- for_each_cpu_mask_nr(cpu, *mask)
- percpu_depopulate(__pdata, cpu);
-}
-
-#define percpu_depopulate_mask(__pdata, mask) \
- __percpu_depopulate_mask((__pdata), &(mask))
-
-/**
- * percpu_populate - populate per-cpu data for given cpu
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @cpu: populate per-data for this cpu
- *
- * Populating per-cpu data for a cpu coming online would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- * Per-cpu object is populated with zeroed buffer.
- */
-static void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
-{
- struct percpu_data *pdata = __percpu_disguise(__pdata);
- int node = cpu_to_node(cpu);
-
- /*
- * We should make sure each CPU gets private memory.
- */
- size = roundup(size, cache_line_size());
-
- BUG_ON(pdata->ptrs[cpu]);
- if (node_online(node))
- pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
- else
- pdata->ptrs[cpu] = kzalloc(size, gfp);
- return pdata->ptrs[cpu];
-}
-
-/**
- * percpu_populate_mask - populate per-cpu data for more cpu's
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-cpu data for cpu's selected through mask bits
- *
- * Per-cpu objects are populated with zeroed buffers.
- */
-static int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
- cpumask_t *mask)
-{
- cpumask_t populated;
- int cpu;
-
- cpus_clear(populated);
- for_each_cpu_mask_nr(cpu, *mask)
- if (unlikely(!percpu_populate(__pdata, size, gfp, cpu))) {
- __percpu_depopulate_mask(__pdata, &populated);
- return -ENOMEM;
- } else
- cpu_set(cpu, populated);
- return 0;
-}
-
-#define percpu_populate_mask(__pdata, size, gfp, mask) \
- __percpu_populate_mask((__pdata), (size), (gfp), &(mask))
-
-/**
- * alloc_percpu - initial setup of per-cpu data
- * @size: size of per-cpu object
- * @align: alignment
- *
- * Allocate dynamic percpu area. Percpu objects are populated with
- * zeroed buffers.
- */
-void *__alloc_percpu(size_t size, size_t align)
-{
- /*
- * We allocate whole cache lines to avoid false sharing
- */
- size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
- void *pdata = kzalloc(sz, GFP_KERNEL);
- void *__pdata = __percpu_disguise(pdata);
-
- /*
- * Can't easily make larger alignment work with kmalloc. WARN
- * on it. Larger alignment should only be used for module
- * percpu sections on SMP for which this path isn't used.
- */
- WARN_ON_ONCE(align > SMP_CACHE_BYTES);
-
- if (unlikely(!pdata))
- return NULL;
- if (likely(!__percpu_populate_mask(__pdata, size, GFP_KERNEL,
- &cpu_possible_map)))
- return __pdata;
- kfree(pdata);
- return NULL;
-}
-EXPORT_SYMBOL_GPL(__alloc_percpu);
-
-/**
- * free_percpu - final cleanup of per-cpu data
- * @__pdata: object to clean up
- *
- * We simply clean up any per-cpu object left. No need for the client to
- * track and specify through a bis mask which per-cpu objects are to free.
- */
-void free_percpu(void *__pdata)
-{
- if (unlikely(!__pdata))
- return;
- __percpu_depopulate_mask(__pdata, cpu_possible_mask);
- kfree(__percpu_disguise(__pdata));
-}
-EXPORT_SYMBOL_GPL(free_percpu);
-
-/*
- * Generic percpu area setup.
- */
-#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
-unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
-
-EXPORT_SYMBOL(__per_cpu_offset);
-
-void __init setup_per_cpu_areas(void)
-{
- unsigned long size, i;
- char *ptr;
- unsigned long nr_possible_cpus = num_possible_cpus();
-
- /* Copy section for each CPU (we discard the original) */
- size = ALIGN(PERCPU_ENOUGH_ROOM, PAGE_SIZE);
- ptr = alloc_bootmem_pages(size * nr_possible_cpus);
-
- for_each_possible_cpu(i) {
- __per_cpu_offset[i] = ptr - __per_cpu_start;
- memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
- ptr += size;
- }
-}
-#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
diff --git a/mm/percpu.c b/mm/percpu.c
index 43d8cac..adbc5a4 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -46,8 +46,6 @@
*
* To use this allocator, arch code should do the followings.
*
- * - drop CONFIG_HAVE_LEGACY_PER_CPU_AREA
- *
* - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
* regular address to percpu pointer and back if they need to be
* different from the default
--
1.6.4.2
* Tejun Heo <[email protected]> wrote:
> Hello, all.
>
> This patchset converts ia64 to dynamic percpu allocator and drop the
> now unused old percpu allocator. This patchset contains the following
> four patches.
>
> 0001-vmalloc-rename-local-variables-vmalloc_start-and-vma.patch
> 0002-ia64-allocate-percpu-area-for-cpu0-like-percpu-areas.patch
> 0003-ia64-convert-to-dynamic-percpu-allocator.patch
> 0004-percpu-kill-legacy-percpu-allocator.patch
>
> 0001 is misc prep to avoid macro / local variable collision. 0002
> makes ia64 allocate percpu area for cpu0 in the same way it does for
> other cpus. 0003 converts ia64 to dynamic percpu allocator and 0004
> drops now unused legacy allocator.
>
> Contig memory model was verified with ski emulator. Discontig and
> sparse models were verified on a 4-way SGI altix machine. I've run
> percpu stress test module for quite a while on the machine.
>
> Mike Travis, it would be great if you can test this on your machine.
> I'd really like to see how it would behave on a machine with that many
> NUMA nodes.
>
> This patchset is available in the following git tree.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git convert-ia64
>
> Hmmm... kernel.org seems slow to sync today. If the branch isn't
> mirroreed, please pull from the master.
>
> Thanks.
>
> arch/ia64/Kconfig | 3
> arch/ia64/kernel/setup.c | 12 --
> arch/ia64/kernel/vmlinux.lds.S | 11 +-
> arch/ia64/mm/contig.c | 87 ++++++++++++++++----
> arch/ia64/mm/discontig.c | 120 +++++++++++++++++++++++++--
> include/linux/percpu.h | 24 -----
> kernel/module.c | 150 ----------------------------------
> mm/Makefile | 4
> mm/allocpercpu.c | 177 -----------------------------------------
> mm/percpu.c | 2
> mm/vmalloc.c | 16 +--
> 11 files changed, 193 insertions(+), 413 deletions(-)
Kudos, really nice stuff!
Ingo
> Contig memory model was verified with ski emulator. Discontig and
> sparse models were verified on a 4-way SGI altix machine. I've run
> percpu stress test module for quite a while on the machine.
Ski must have missed something. I just tried to boot this on a
"tiger_defconfig" kernel[1] and it panic'd early in boot. I'll need
to re-connect my serial console to get the useful part of the
panic message ... what's on the VGA console isn't very helpful :-(
-Tony
[1] uses contig.c
> Ski must have missed something. I just tried to boot this on a
> "tiger_defconfig" kernel[1] and it panic'd early in boot. I'll need
> to re-connect my serial console to get the useful part of the
> panic message ... what's on the VGA console isn't very helpful :-(
Ok. Here is the tail of the console log. The instruction at the
faulting address is a "ld8 r3=[r14]" and r14 is indeed 0x0.
... nothing apparently odd leading up to here ...
ACPI: Core revision 20090521
Boot processor id 0x0/0xc618
Unable to handle kernel NULL pointer dereference (address 0000000000000000)
migration/0[3]: Oops 8813272891392 [1]
Modules linked in:
Pid: 3, CPU 0, comm: migration/0
psr : 00001010085a2018 ifs : 800000000000050e ip : [<a00000010006a470>] Not tainted (2.6.31-tiger-smp)
ip is at __wake_up_common+0xb0/0x120
unat: 0000000000000000 pfs : 000000000000030b rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000002941
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0930ffff00090000 ssd : 0930ffff00090000
b0 : a00000010006c1a0 b6 : a000000100080f20 b7 : a00000010000bc20
f6 : 1003e0000000000000000 f7 : 1003e0000000000000002
f8 : 1003e00000000a0722fad f9 : 1003e000000000cf5bbc7
f10 : 1003e081f5d33de276e7b f11 : 1003e0000000000000000
r1 : a000000100da9de0 r2 : 00000000fffedfeb r3 : e0000001c0210230
r8 : 00000010085a6018 r9 : 0000000000000001 r10 : ffffffffffff7100
r11 : ffffffffffff7100 r12 : e0000001c021fe00 r13 : e0000001c0210000
r14 : 0000000000000000 r15 : ffffffffffffffe8 r16 : a000000100bc4708
r17 : a000000100bcbe30 r18 : 0000000000000000 r19 : e0000001c0210be4
r20 : 0000000000000001 r21 : 0000000000000001 r22 : 0000000000000000
r23 : e0000001c0210038 r24 : e0000001c0210000 r25 : e000000180007090
r26 : a000000100076000 r27 : 00000010085a6018 r28 : 00000000ffffffff
r29 : e0000001c021001c r30 : 0000000000000000 r31 : ffffffffffff7120
Call Trace:
[<a000000100015a30>] show_stack+0x50/0xa0
sp=e0000001c021f9d0 bsp=e0000001c0210f00
[<a0000001000162a0>] show_regs+0x820/0x860
sp=e0000001c021fba0 bsp=e0000001c0210ea8
[<a00000010003abc0>] die+0x1a0/0x2c0
sp=e0000001c021fba0 bsp=e0000001c0210e68
[<a0000001000645d0>] ia64_do_page_fault+0x8b0/0x9e0
sp=e0000001c021fba0 bsp=e0000001c0210e18
[<a00000010000c420>] ia64_native_leave_kernel+0x0/0x270
sp=e0000001c021fc30 bsp=e0000001c0210e18
[<a00000010006a470>] __wake_up_common+0xb0/0x120
sp=e0000001c021fe00 bsp=e0000001c0210da0
[<a00000010006c1a0>] complete+0x60/0xa0
sp=e0000001c021fe00 bsp=e0000001c0210d70
[<a0000001000815a0>] migration_thread+0x680/0x700
sp=e0000001c021fe00 bsp=e0000001c0210ca0
[<a0000001000b9630>] kthread+0x110/0x140
sp=e0000001c021fe00 bsp=e0000001c0210c68
[<a000000100013cf0>] kernel_thread_helper+0x30/0x60
sp=e0000001c021fe30 bsp=e0000001c0210c40
[<a00000010000a0c0>] start_kernel_thread+0x20/0x40
sp=e0000001c021fe30 bsp=e0000001c0210c40
I just noticed the "#for-next" in the Subject line for these
patches. Do they depend on some stuff in linux-next that is
not in Linus' tree (pulled today HEAD=7fa07729e...)? If so, then
ignore my results.
Kernel built from generic_defconfig does boot OK though, so I suspect
this is a discontig vs. contig problem.
-Tony
Hello,
Luck, Tony wrote:
> I just noticed the "#for-next" in the Subject line for these
> patches. Do they depend on some stuff in linux-next that is
> not in Linus' tree (pulled today HEAD=7fa07729e...)? If so, then
> ignore my results.
Nope, it should work on top of Linus's tree.
> Kernel built from generic_defconfig does boot OK though, so I suspect
> this is a discontig vs. contig problem.
Yeah, it's probably something broken with contig SMP configuration.
I've just found a machine with contig mem and multiple processors.
Will test on it later today.
Thanks.
--
tejun
On Tue, 22 Sep 2009, Tejun Heo wrote:
> const unsigned long vmalloc_end = vmalloc_end & ~(align - 1);
>
> Rename local variables vmalloc_start and vmalloc_end to vm_start and
> vm_end to avoid the collision.
Could you keep vmalloc_end and vmalloc_start? vm_start and vm_end may led
to misinterpretations as start and end of the memory area covered by the
VM.
On Tue, 22 Sep 2009, Tejun Heo wrote:
>
> +#ifdef CONFIG_SMP
> + . = ALIGN(PERCPU_PAGE_SIZE);
> + __cpu0_per_cpu = .;
__per_cpu_start?
> + . = . + PERCPU_PAGE_SIZE; /* cpu0 per-cpu space */
> +#endif
This is a statically sized per cpu area that is used by __get_cpu_var()
Data is access via a cpu specific memory mapping. How does this work when
the area grows beyond PERCPU_PAGE_SIZE? As far as I can see: It seems
that __get_cpu_var would then cause a memory fault?
Christoph Lameter wrote:
> On Tue, 22 Sep 2009, Tejun Heo wrote:
>
>> const unsigned long vmalloc_end = vmalloc_end & ~(align - 1);
>>
>> Rename local variables vmalloc_start and vmalloc_end to vm_start and
>> vm_end to avoid the collision.
>
> Could you keep vmalloc_end and vmalloc_start? vm_start and vm_end may led
> to misinterpretations as start and end of the memory area covered by the
> VM.
Hmmm... yeah, the right thing to do would be either let ia64 use
VMALLOC_END directly as variable name or let it alias an unlikely
symbol like ____vmalloc_end as macro which ends up expanding to a
seemingly normal symbol like vmalloc_end is just rude. I'll change
ia64 part.
Thanks.
--
tejun
Christoph Lameter wrote:
> On Tue, 22 Sep 2009, Tejun Heo wrote:
>
>> +#ifdef CONFIG_SMP
>> + . = ALIGN(PERCPU_PAGE_SIZE);
>> + __cpu0_per_cpu = .;
>
> __per_cpu_start?
>
>> + . = . + PERCPU_PAGE_SIZE; /* cpu0 per-cpu space */
>> +#endif
>
> This is a statically sized per cpu area that is used by __get_cpu_var()
> Data is access via a cpu specific memory mapping. How does this work when
> the area grows beyond PERCPU_PAGE_SIZE? As far as I can see: It seems
> that __get_cpu_var would then cause a memory fault?
On ia64, the first chunk is fixed at PERCPU_PAGE_SIZE. It's something
hardwired into the page fault logic and the linker script. Build will
fail if the static + reserved area goes over PERCPU_PAGE_SIZE and in
that case ia64 will need to update the special case page fault logic
and increase PERCPU_PAGE_SIZE. The area reserved above is interim
per-cpu area for cpu0 which is used between head.S and proper percpu
area setup and will be ditched once initialization is complete.
Thanks.
--
tejun
On Wed, 23 Sep 2009, Tejun Heo wrote:
> On ia64, the first chunk is fixed at PERCPU_PAGE_SIZE. It's something
> hardwired into the page fault logic and the linker script. Build will
> fail if the static + reserved area goes over PERCPU_PAGE_SIZE and in
> that case ia64 will need to update the special case page fault logic
> and increase PERCPU_PAGE_SIZE. The area reserved above is interim
> per-cpu area for cpu0 which is used between head.S and proper percpu
> area setup and will be ditched once initialization is complete.
You did not answer my question.
The local percpu variables are accessed via a static per cpu
virtual mapping. You cannot place per cpu variables outside of that
virtual address range of PERCPU_PAGE_SIZE.
What happens if the percpu allocator allocates more data than available in
the reserved area?
Hello,
Christoph Lameter wrote:
> You did not answer my question.
Hmmm...
> The local percpu variables are accessed via a static per cpu
> virtual mapping. You cannot place per cpu variables outside of that
> virtual address range of PERCPU_PAGE_SIZE.
>
> What happens if the percpu allocator allocates more data than available in
> the reserved area?
I still don't understand your question. Static percpu variables are
always allocated from the first chunk inside that PERCPU_PAGE_SIZE
area. Dynamic allocations can go outside of that but they don't need
any special handling. What problems are you seeing?
--
tejun
On Wed, 23 Sep 2009, Tejun Heo wrote:
> any special handling. What problems are you seeing?
per cpu variable access on IA64 does not use the percpu_offset for the
calculation of the current per cpu data area. Its using a virtual mapping.
How does the new percpu allocator support this? Does it use different
methods of access for static and dynamic percpu access?
Christoph Lameter wrote:
> On Wed, 23 Sep 2009, Tejun Heo wrote:
>
>> any special handling. What problems are you seeing?
>
> per cpu variable access on IA64 does not use the percpu_offset for the
> calculation of the current per cpu data area. Its using a virtual mapping.
>
> How does the new percpu allocator support this? Does it use different
> methods of access for static and dynamic percpu access?
That's only when __ia64_per_cpu_var() macro is used in arch code which
always references static perpcu variable in the kernel image which
falls inside PERCPU_PAGE_SIZE. For everything else, __my_cpu_offset
is defined as __ia64_per_cpu_var(local_per_cpu_offset) and regular
pointer offsetting is used.
Thanks.
--
tejun
On Thu, 24 Sep 2009, Tejun Heo wrote:
> > How does the new percpu allocator support this? Does it use different
> > methods of access for static and dynamic percpu access?
>
> That's only when __ia64_per_cpu_var() macro is used in arch code which
> always references static perpcu variable in the kernel image which
> falls inside PERCPU_PAGE_SIZE. For everything else, __my_cpu_offset
> is defined as __ia64_per_cpu_var(local_per_cpu_offset) and regular
> pointer offsetting is used.
So this means that address arithmetic needs to be performed for each
percpu access. The virtual mapping would allow the calculation of the
address at link time. Calculation means that a single atomic instruction
for percpu access wont be possible for ia64.
I can toss my ia64 percpu optimization patches. No point anymore.
Tony: We could then also drop the virtual per cpu mapping. Its only useful
for arch specific code and an alternate method of reference exists.
Hello, Christoph.
Christoph Lameter wrote:
> On Thu, 24 Sep 2009, Tejun Heo wrote:
>
>>> How does the new percpu allocator support this? Does it use different
>>> methods of access for static and dynamic percpu access?
>> That's only when __ia64_per_cpu_var() macro is used in arch code which
>> always references static perpcu variable in the kernel image which
>> falls inside PERCPU_PAGE_SIZE. For everything else, __my_cpu_offset
>> is defined as __ia64_per_cpu_var(local_per_cpu_offset) and regular
>> pointer offsetting is used.
>
> So this means that address arithmetic needs to be performed for each
> percpu access. The virtual mapping would allow the calculation of the
> address at link time. Calculation means that a single atomic instruction
> for percpu access wont be possible for ia64.
>
> I can toss my ia64 percpu optimization patches. No point anymore.
>
> Tony: We could then also drop the virtual per cpu mapping. Its only useful
> for arch specific code and an alternate method of reference exists.
percpu implementation on ia64 has always been like that. The problem
with the alternate mapping is that you can't take the pointer to it as
it would mean different thing depending on which processor you're on
and the overall generic percpu implementation expects unique addresses
from percpu access macros.
ia64 currently has been and is the only arch which uses virtual percpu
mapping. The one biggest benefit would be accesses to the
local_per_cpu_offset. Whether it's beneficial enough to justify the
complexity, I frankly don't know.
Andrew once also suggested taking advantage of those overlapping
virtual mappings for local percpu accesses. If the generic code
followed such design, ia64's virtual mappings would definitely be more
useful, but that means we would need aliased mappings for percpu areas
and addresses will be different for local and remote accesses. Also,
getting it right on machines with virtually mapped caches would be
very painful. Given that %gs/fs offesetting is quite efficient on
x86, I don't think changing the generic mechanism is worthwhile.
So, it would be great if we can find a better way to offset addresses
on ia64. If not, nothing improves or deteriorates performance-wise
with the new implementation.
Thanks.
--
tejun
On Thu, 24 Sep 2009, Tejun Heo wrote:
> percpu implementation on ia64 has always been like that. The problem
> with the alternate mapping is that you can't take the pointer to it as
> it would mean different thing depending on which processor you're on
> and the overall generic percpu implementation expects unique addresses
> from percpu access macros.
The cpu ops patchset uses per cpu addresses that are not relocated to a
certain processor. The relocation is implicit in these instructions and
must be implicit so these operations can be processor atomic.
> ia64 currently has been and is the only arch which uses virtual percpu
> mapping. The one biggest benefit would be accesses to the
> local_per_cpu_offset. Whether it's beneficial enough to justify the
> complexity, I frankly don't know.
Its not worth working on given the state of IA64. I talked to Tony at the
Plumbers conference. It may be beneficial to drop the virtual percpu
mapping entirely because that would increase the number of TLB entries
available.
> Andrew once also suggested taking advantage of those overlapping
> virtual mappings for local percpu accesses. If the generic code
> followed such design, ia64's virtual mappings would definitely be more
> useful, but that means we would need aliased mappings for percpu areas
> and addresses will be different for local and remote accesses. Also,
> getting it right on machines with virtually mapped caches would be
> very painful. Given that %gs/fs offesetting is quite efficient on
> x86, I don't think changing the generic mechanism is worthwhile.
There is no problem with using unrelocated percpu addresses as an
"address" for the cpu ops. The IA64 "virtual" addresses are a stand in for
the segment registers on IA64.
> So, it would be great if we can find a better way to offset addresses
> on ia64. If not, nothing improves or deteriorates performance-wise
> with the new implementation.
Dropping the use of the special mapping over time may be the easiest way
to go for IA64. percpu RMW ops like this_cpu_add are not possible with
IA64 since no lightweight primitives exist. We could only avoid the
calculation of the per cpu variables address. Would allow assignment and
access to be atomic but not the RMW instruction. So it would not be a full
per cpu ops implementation anyways.