2009-09-23 05:06:56

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET percpu#for-next] percpu: convert ia64 to dynamic percpu and drop the old one, take#2

Hello, all.

This is the second take of convert-ia64-to-dynamic-percpu patchset.
Changes from the last take[L] are

* 0001 now updates ia64 to not define VMALLOC_END as a macro to
vmalloc_end instead of disallowing vmalloc_end as a variable name as
suggested by Christoph.

* 0002 added to initialize cpu maps early. This is necessary to get
contig memory model working.

* 0004 updated so that dyn_size is calculated correctly for contig
model.

This patchset contains the following five patches.

0001-ia64-don-t-alias-VMALLOC_END-to-vmalloc_end.patch
0002-ia64-initialize-cpu-maps-early.patch
0003-ia64-allocate-percpu-area-for-cpu0-like-percpu-areas.patch
0004-ia64-convert-to-dynamic-percpu-allocator.patch
0005-percpu-kill-legacy-percpu-allocator.patch

0001 is misc prep to avoid macro / local variable collision. 0002
makes ia64 arch code initialize cpu possible and present maps before
memory initialization. 0003 makes ia64 allocate percpu area for cpu0
in the same way it does for other cpus. 0004 converts ia64 to dynamic
percpu allocator and 0005 drops now unused legacy allocator.

Contig memory model was tested on a 16p Tiger4 machine. Discontig and
sparse tested on 4-way SGI altix. ski seems to be happy with contig
up/smp.

This patchset is available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git convert-ia64

The new commit ID is dcc91f19c6662b24f1f4e5878d773244f1079724 and it's
on top of today's Linus 7fa07729e439a6184bd824746d06a49cca553f15.

Thanks.

arch/ia64/Kconfig | 3
arch/ia64/kernel/setup.c | 12 --
arch/ia64/kernel/vmlinux.lds.S | 11 +-
arch/ia64/mm/contig.c | 87 ++++++++++++++++----
arch/ia64/mm/discontig.c | 120 +++++++++++++++++++++++++--
include/linux/percpu.h | 24 -----
kernel/module.c | 150 ----------------------------------
mm/Makefile | 4
mm/allocpercpu.c | 177 -----------------------------------------
mm/percpu.c | 2
mm/vmalloc.c | 16 +--
11 files changed, 193 insertions(+), 413 deletions(-)

--
tejun

[L] http://thread.gmane.org/gmane.linux.ports.ia64/20812


2009-09-23 05:07:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/5] ia64: don't alias VMALLOC_END to vmalloc_end

If CONFIG_VIRTUAL_MEM_MAP is enabled, ia64 defines macro VMALLOC_END
as unsigned long variable vmalloc_end which is adjusted to prepare
room for vmemmap. This becomes probnlematic if a local variables
vmalloc_end is defined in some function (not very unlikely) and
VMALLOC_END is used in the function - the function thinks its
referencing the global VMALLOC_END value but would be referencing its
own local vmalloc_end variable.

There's no reason VMALLOC_END should be a macro. Just define it as an
unsigned long variable if CONFIG_VIRTUAL_MEM_MAP is set to avoid nasty
surprises.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: linux-ia64 <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
arch/ia64/include/asm/meminit.h | 2 +-
arch/ia64/include/asm/pgtable.h | 3 +--
arch/ia64/mm/contig.c | 4 ++--
arch/ia64/mm/discontig.c | 4 ++--
arch/ia64/mm/init.c | 4 ++--
5 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/ia64/include/asm/meminit.h b/arch/ia64/include/asm/meminit.h
index 688a812..61c7b17 100644
--- a/arch/ia64/include/asm/meminit.h
+++ b/arch/ia64/include/asm/meminit.h
@@ -61,7 +61,7 @@ extern int register_active_ranges(u64 start, u64 len, int nid);

#ifdef CONFIG_VIRTUAL_MEM_MAP
# define LARGE_GAP 0x40000000 /* Use virtual mem map if hole is > than this */
- extern unsigned long vmalloc_end;
+ extern unsigned long VMALLOC_END;
extern struct page *vmem_map;
extern int find_largest_hole(u64 start, u64 end, void *arg);
extern int create_mem_map_page_table(u64 start, u64 end, void *arg);
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 8840a69..69bf138 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -228,8 +228,7 @@ ia64_phys_addr_valid (unsigned long addr)
#define VMALLOC_START (RGN_BASE(RGN_GATE) + 0x200000000UL)
#ifdef CONFIG_VIRTUAL_MEM_MAP
# define VMALLOC_END_INIT (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 9)))
-# define VMALLOC_END vmalloc_end
- extern unsigned long vmalloc_end;
+extern unsigned long VMALLOC_END;
#else
#if defined(CONFIG_SPARSEMEM) && defined(CONFIG_SPARSEMEM_VMEMMAP)
/* SPARSEMEM_VMEMMAP uses half of vmalloc... */
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 2f724d2..1341437 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -270,8 +270,8 @@ paging_init (void)

map_size = PAGE_ALIGN(ALIGN(max_low_pfn, MAX_ORDER_NR_PAGES) *
sizeof(struct page));
- vmalloc_end -= map_size;
- vmem_map = (struct page *) vmalloc_end;
+ VMALLOC_END -= map_size;
+ vmem_map = (struct page *) VMALLOC_END;
efi_memmap_walk(create_mem_map_page_table, NULL);

/*
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index d85ba98..9f24b3c 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -666,9 +666,9 @@ void __init paging_init(void)
sparse_init();

#ifdef CONFIG_VIRTUAL_MEM_MAP
- vmalloc_end -= PAGE_ALIGN(ALIGN(max_low_pfn, MAX_ORDER_NR_PAGES) *
+ VMALLOC_END -= PAGE_ALIGN(ALIGN(max_low_pfn, MAX_ORDER_NR_PAGES) *
sizeof(struct page));
- vmem_map = (struct page *) vmalloc_end;
+ vmem_map = (struct page *) VMALLOC_END;
efi_memmap_walk(create_mem_map_page_table, NULL);
printk("Virtual mem_map starts at 0x%p\n", vmem_map);
#endif
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 1d28624..f301071 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -44,8 +44,8 @@ extern void ia64_tlb_init (void);
unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x100000000UL;

#ifdef CONFIG_VIRTUAL_MEM_MAP
-unsigned long vmalloc_end = VMALLOC_END_INIT;
-EXPORT_SYMBOL(vmalloc_end);
+unsigned long VMALLOC_END = VMALLOC_END_INIT;
+EXPORT_SYMBOL(VMALLOC_END);
struct page *vmem_map;
EXPORT_SYMBOL(vmem_map);
#endif
--
1.6.4.2

2009-09-23 05:06:56

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/5] ia64: initialize cpu maps early

All information necessary to initialize cpu possible and present maps
are available once early_acpi_boot_init() is complete. Reorganize
setup_arch() and acpi init functions such that,

* CPU information is printed after LAPIC entries are parsed in
early_acpi_boot_init().

* smp_build_cpu_map() is called by setup_arch() instead of acpi
functions.

* smp_build_cpu_map() is called once all CPU related information is
available before memory is initialized.

This is primarily to allow find_memory() to use cpu maps but is also a
general cleanup. Please note that with this change, the somewhat
ad-hoc early_cpu_possible_map defined and used for NUMA configurations
is probably unnecessary. Something to clean up another day.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: linux-ia64 <[email protected]>
---
arch/ia64/kernel/acpi.c | 33 +++++++++++++++------------------
arch/ia64/kernel/setup.c | 11 +++++------
2 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index baec6f0..40574ae 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -702,11 +702,23 @@ int __init early_acpi_boot_init(void)
printk(KERN_ERR PREFIX
"Error parsing MADT - no LAPIC entries\n");

+#ifdef CONFIG_SMP
+ if (available_cpus == 0) {
+ printk(KERN_INFO "ACPI: Found 0 CPUS; assuming 1\n");
+ printk(KERN_INFO "CPU 0 (0x%04x)", hard_smp_processor_id());
+ smp_boot_data.cpu_phys_id[available_cpus] =
+ hard_smp_processor_id();
+ available_cpus = 1; /* We've got at least one of these, no? */
+ }
+ smp_boot_data.cpu_count = available_cpus;
+#endif
+ /* Make boot-up look pretty */
+ printk(KERN_INFO "%d CPUs available, %d CPUs total\n", available_cpus,
+ total_cpus);
+
return 0;
}

-
-
int __init acpi_boot_init(void)
{

@@ -769,18 +781,8 @@ int __init acpi_boot_init(void)
if (acpi_table_parse(ACPI_SIG_FADT, acpi_parse_fadt))
printk(KERN_ERR PREFIX "Can't find FADT\n");

+#ifdef CONFIG_ACPI_NUMA
#ifdef CONFIG_SMP
- if (available_cpus == 0) {
- printk(KERN_INFO "ACPI: Found 0 CPUS; assuming 1\n");
- printk(KERN_INFO "CPU 0 (0x%04x)", hard_smp_processor_id());
- smp_boot_data.cpu_phys_id[available_cpus] =
- hard_smp_processor_id();
- available_cpus = 1; /* We've got at least one of these, no? */
- }
- smp_boot_data.cpu_count = available_cpus;
-
- smp_build_cpu_map();
-# ifdef CONFIG_ACPI_NUMA
if (srat_num_cpus == 0) {
int cpu, i = 1;
for (cpu = 0; cpu < smp_boot_data.cpu_count; cpu++)
@@ -789,14 +791,9 @@ int __init acpi_boot_init(void)
node_cpuid[i++].phys_id =
smp_boot_data.cpu_phys_id[cpu];
}
-# endif
#endif
-#ifdef CONFIG_ACPI_NUMA
build_cpu_to_node_map();
#endif
- /* Make boot-up look pretty */
- printk(KERN_INFO "%d CPUs available, %d CPUs total\n", available_cpus,
- total_cpus);
return 0;
}

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 1de86c9..5d77c1e 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -566,19 +566,18 @@ setup_arch (char **cmdline_p)
early_acpi_boot_init();
# ifdef CONFIG_ACPI_NUMA
acpi_numa_init();
-#ifdef CONFIG_ACPI_HOTPLUG_CPU
+# ifdef CONFIG_ACPI_HOTPLUG_CPU
prefill_possible_map();
-#endif
+# endif
per_cpu_scan_finalize((cpus_weight(early_cpu_possible_map) == 0 ?
32 : cpus_weight(early_cpu_possible_map)),
additional_cpus > 0 ? additional_cpus : 0);
# endif
-#else
-# ifdef CONFIG_SMP
- smp_build_cpu_map(); /* happens, e.g., with the Ski simulator */
-# endif
#endif /* CONFIG_APCI_BOOT */

+#ifdef CONFIG_SMP
+ smp_build_cpu_map();
+#endif
find_memory();

/* process SAL system table: */
--
1.6.4.2

2009-09-23 05:06:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/5] ia64: allocate percpu area for cpu0 like percpu areas for other cpus

cpu0 used special percpu area reserved by the linker, __cpu0_per_cpu,
which is set up early in boot by head.S. However, this doesn't
guarantee that the area will be on the same node as cpu0 and the
percpu area for cpu0 ends up very far away from percpu areas for other
cpus which cause problems for congruent percpu allocator.

This patch makes percpu area initialization allocate percpu area for
cpu0 like any other cpus and copy it from __cpu0_per_cpu which now
resides in the __init area. This means that for cpu0, percpu area is
first setup at __cpu0_per_cpu early by head.S and then moved to an
area in the linear mapping during memory initialization and it's not
allowed to take a pointer to percpu variables between head.S and
memory initialization.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: linux-ia64 <[email protected]>
---
arch/ia64/kernel/vmlinux.lds.S | 11 +++++----
arch/ia64/mm/contig.c | 47 +++++++++++++++++++++++++--------------
arch/ia64/mm/discontig.c | 35 ++++++++++++++++++++---------
3 files changed, 60 insertions(+), 33 deletions(-)

diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index 0a0c77b..1295ba3 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -166,6 +166,12 @@ SECTIONS
}
#endif

+#ifdef CONFIG_SMP
+ . = ALIGN(PERCPU_PAGE_SIZE);
+ __cpu0_per_cpu = .;
+ . = . + PERCPU_PAGE_SIZE; /* cpu0 per-cpu space */
+#endif
+
. = ALIGN(PAGE_SIZE);
__init_end = .;

@@ -198,11 +204,6 @@ SECTIONS
data : { } :data
.data : AT(ADDR(.data) - LOAD_OFFSET)
{
-#ifdef CONFIG_SMP
- . = ALIGN(PERCPU_PAGE_SIZE);
- __cpu0_per_cpu = .;
- . = . + PERCPU_PAGE_SIZE; /* cpu0 per-cpu space */
-#endif
INIT_TASK_DATA(PAGE_SIZE)
CACHELINE_ALIGNED_DATA(SMP_CACHE_BYTES)
READ_MOSTLY_DATA(SMP_CACHE_BYTES)
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 1341437..351da0a 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -154,36 +154,49 @@ static void *cpu_data;
void * __cpuinit
per_cpu_init (void)
{
- int cpu;
- static int first_time=1;
+ static bool first_time = true;
+ void *cpu0_data = __cpu0_per_cpu;
+ unsigned int cpu;
+
+ if (!first_time)
+ goto skip;
+ first_time = false;

/*
* get_free_pages() cannot be used before cpu_init() done. BSP
* allocates "NR_CPUS" pages for all CPUs to avoid that AP calls
* get_zeroed_page().
*/
- if (first_time) {
- void *cpu0_data = __cpu0_per_cpu;
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ void *src = cpu == 0 ? cpu0_data : __phys_per_cpu_start;

- first_time=0;
+ memcpy(cpu_data, src, __per_cpu_end - __per_cpu_start);
+ __per_cpu_offset[cpu] = (char *)cpu_data - __per_cpu_start;
+ per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];

- __per_cpu_offset[0] = (char *) cpu0_data - __per_cpu_start;
- per_cpu(local_per_cpu_offset, 0) = __per_cpu_offset[0];
+ /*
+ * percpu area for cpu0 is moved from the __init area
+ * which is setup by head.S and used till this point.
+ * Update ar.k3. This move is ensures that percpu
+ * area for cpu0 is on the correct node and its
+ * virtual address isn't insanely far from other
+ * percpu areas which is important for congruent
+ * percpu allocator.
+ */
+ if (cpu == 0)
+ ia64_set_kr(IA64_KR_PER_CPU_DATA, __pa(cpu_data) -
+ (unsigned long)__per_cpu_start);

- for (cpu = 1; cpu < NR_CPUS; cpu++) {
- memcpy(cpu_data, __phys_per_cpu_start, __per_cpu_end - __per_cpu_start);
- __per_cpu_offset[cpu] = (char *) cpu_data - __per_cpu_start;
- cpu_data += PERCPU_PAGE_SIZE;
- per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];
- }
+ cpu_data += PERCPU_PAGE_SIZE;
}
+skip:
return __per_cpu_start + __per_cpu_offset[smp_processor_id()];
}

static inline void
alloc_per_cpu_data(void)
{
- cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS-1,
+ cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS,
PERCPU_PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
}
#else
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 9f24b3c..200282b 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -143,17 +143,30 @@ static void *per_cpu_node_setup(void *cpu_data, int node)
int cpu;

for_each_possible_early_cpu(cpu) {
- if (cpu == 0) {
- void *cpu0_data = __cpu0_per_cpu;
- __per_cpu_offset[cpu] = (char*)cpu0_data -
- __per_cpu_start;
- } else if (node == node_cpuid[cpu].nid) {
- memcpy(__va(cpu_data), __phys_per_cpu_start,
- __per_cpu_end - __per_cpu_start);
- __per_cpu_offset[cpu] = (char*)__va(cpu_data) -
- __per_cpu_start;
- cpu_data += PERCPU_PAGE_SIZE;
- }
+ void *src = cpu == 0 ? __cpu0_per_cpu : __phys_per_cpu_start;
+
+ if (node != node_cpuid[cpu].nid)
+ continue;
+
+ memcpy(__va(cpu_data), src, __per_cpu_end - __per_cpu_start);
+ __per_cpu_offset[cpu] = (char *)__va(cpu_data) -
+ __per_cpu_start;
+
+ /*
+ * percpu area for cpu0 is moved from the __init area
+ * which is setup by head.S and used till this point.
+ * Update ar.k3. This move is ensures that percpu
+ * area for cpu0 is on the correct node and its
+ * virtual address isn't insanely far from other
+ * percpu areas which is important for congruent
+ * percpu allocator.
+ */
+ if (cpu == 0)
+ ia64_set_kr(IA64_KR_PER_CPU_DATA,
+ (unsigned long)cpu_data -
+ (unsigned long)__per_cpu_start);
+
+ cpu_data += PERCPU_PAGE_SIZE;
}
#endif
return cpu_data;
--
1.6.4.2

2009-09-23 05:07:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/5] ia64: convert to dynamic percpu allocator

Unlike other archs, ia64 reserves space for percpu areas during early
memory initialization. These areas occupy a contiguous region indexed
by cpu number on contiguous memory model or are grouped by node on
discontiguous memory model.

As allocation and initialization are done by the arch code, all that
setup_per_cpu_areas() needs to do is communicating the determined
layout to the percpu allocator. This patch implements
setup_per_cpu_areas() for both contig and discontig memory models and
drops HAVE_LEGACY_PER_CPU_AREA.

Please note that for contig model, the allocation itself is modified
only to allocate for possible cpus instead of NR_CPUS. As dynamic
percpu allocator can handle non-direct mapping, there's no reason to
allocate memory for cpus which aren't possible.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: linux-ia64 <[email protected]>
---
arch/ia64/Kconfig | 3 --
arch/ia64/kernel/setup.c | 12 ------
arch/ia64/mm/contig.c | 58 ++++++++++++++++++++++++++++---
arch/ia64/mm/discontig.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 138 insertions(+), 20 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 011a1cd..e624611 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -89,9 +89,6 @@ config GENERIC_TIME_VSYSCALL
bool
default y

-config HAVE_LEGACY_PER_CPU_AREA
- def_bool y
-
config HAVE_SETUP_PER_CPU_AREA
def_bool y

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 5d77c1e..bc1ef4a 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -855,18 +855,6 @@ identify_cpu (struct cpuinfo_ia64 *c)
}

/*
- * In UP configuration, setup_per_cpu_areas() is defined in
- * include/linux/percpu.h
- */
-#ifdef CONFIG_SMP
-void __init
-setup_per_cpu_areas (void)
-{
- /* start_kernel() requires this... */
-}
-#endif
-
-/*
* Do the following calculations:
*
* 1. the max. cache line size.
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 351da0a..54bf540 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -163,11 +163,11 @@ per_cpu_init (void)
first_time = false;

/*
- * get_free_pages() cannot be used before cpu_init() done. BSP
- * allocates "NR_CPUS" pages for all CPUs to avoid that AP calls
- * get_zeroed_page().
+ * get_free_pages() cannot be used before cpu_init() done.
+ * BSP allocates PERCPU_PAGE_SIZE bytes for all possible CPUs
+ * to avoid that AP calls get_zeroed_page().
*/
- for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ for_each_possible_cpu(cpu) {
void *src = cpu == 0 ? cpu0_data : __phys_per_cpu_start;

memcpy(cpu_data, src, __per_cpu_end - __per_cpu_start);
@@ -196,9 +196,57 @@ skip:
static inline void
alloc_per_cpu_data(void)
{
- cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS,
+ cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * num_possible_cpus(),
PERCPU_PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
}
+
+/**
+ * setup_per_cpu_areas - setup percpu areas
+ *
+ * Arch code has already allocated and initialized percpu areas. All
+ * this function has to do is to teach the determined layout to the
+ * dynamic percpu allocator, which happens to be more complex than
+ * creating whole new ones using helpers.
+ */
+void __init
+setup_per_cpu_areas(void)
+{
+ struct pcpu_alloc_info *ai;
+ struct pcpu_group_info *gi;
+ unsigned int cpu;
+ ssize_t static_size, reserved_size, dyn_size;
+ int rc;
+
+ ai = pcpu_alloc_alloc_info(1, num_possible_cpus());
+ if (!ai)
+ panic("failed to allocate pcpu_alloc_info");
+ gi = &ai->groups[0];
+
+ /* units are assigned consecutively to possible cpus */
+ for_each_possible_cpu(cpu)
+ gi->cpu_map[gi->nr_units++] = cpu;
+
+ /* set parameters */
+ static_size = __per_cpu_end - __per_cpu_start;
+ reserved_size = PERCPU_MODULE_RESERVE;
+ dyn_size = PERCPU_PAGE_SIZE - static_size - reserved_size;
+ if (dyn_size < 0)
+ panic("percpu area overflow static=%zd reserved=%zd\n",
+ static_size, reserved_size);
+
+ ai->static_size = static_size;
+ ai->reserved_size = reserved_size;
+ ai->dyn_size = dyn_size;
+ ai->unit_size = PERCPU_PAGE_SIZE;
+ ai->atom_size = PAGE_SIZE;
+ ai->alloc_size = PERCPU_PAGE_SIZE;
+
+ rc = pcpu_setup_first_chunk(ai, __per_cpu_start + __per_cpu_offset[0]);
+ if (rc)
+ panic("failed to setup percpu area (err=%d)", rc);
+
+ pcpu_free_alloc_info(ai);
+}
#else
#define alloc_per_cpu_data() do { } while (0)
#endif /* CONFIG_SMP */
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 200282b..40e4c1f 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -172,6 +172,91 @@ static void *per_cpu_node_setup(void *cpu_data, int node)
return cpu_data;
}

+#ifdef CONFIG_SMP
+/**
+ * setup_per_cpu_areas - setup percpu areas
+ *
+ * Arch code has already allocated and initialized percpu areas. All
+ * this function has to do is to teach the determined layout to the
+ * dynamic percpu allocator, which happens to be more complex than
+ * creating whole new ones using helpers.
+ */
+void __init setup_per_cpu_areas(void)
+{
+ struct pcpu_alloc_info *ai;
+ struct pcpu_group_info *uninitialized_var(gi);
+ unsigned int *cpu_map;
+ void *base;
+ unsigned long base_offset;
+ unsigned int cpu;
+ ssize_t static_size, reserved_size, dyn_size;
+ int node, prev_node, unit, nr_units, rc;
+
+ ai = pcpu_alloc_alloc_info(MAX_NUMNODES, nr_cpu_ids);
+ if (!ai)
+ panic("failed to allocate pcpu_alloc_info");
+ cpu_map = ai->groups[0].cpu_map;
+
+ /* determine base */
+ base = (void *)ULONG_MAX;
+ for_each_possible_cpu(cpu)
+ base = min(base,
+ (void *)(__per_cpu_offset[cpu] + __per_cpu_start));
+ base_offset = (void *)__per_cpu_start - base;
+
+ /* build cpu_map, units are grouped by node */
+ unit = 0;
+ for_each_node(node)
+ for_each_possible_cpu(cpu)
+ if (node == node_cpuid[cpu].nid)
+ cpu_map[unit++] = cpu;
+ nr_units = unit;
+
+ /* set basic parameters */
+ static_size = __per_cpu_end - __per_cpu_start;
+ reserved_size = PERCPU_MODULE_RESERVE;
+ dyn_size = PERCPU_PAGE_SIZE - static_size - reserved_size;
+ if (dyn_size < 0)
+ panic("percpu area overflow static=%zd reserved=%zd\n",
+ static_size, reserved_size);
+
+ ai->static_size = static_size;
+ ai->reserved_size = reserved_size;
+ ai->dyn_size = dyn_size;
+ ai->unit_size = PERCPU_PAGE_SIZE;
+ ai->atom_size = PAGE_SIZE;
+ ai->alloc_size = PERCPU_PAGE_SIZE;
+
+ /*
+ * CPUs are put into groups according to node. Walk cpu_map
+ * and create new groups at node boundaries.
+ */
+ prev_node = -1;
+ ai->nr_groups = 0;
+ for (unit = 0; unit < nr_units; unit++) {
+ cpu = cpu_map[unit];
+ node = node_cpuid[cpu].nid;
+
+ if (node == prev_node) {
+ gi->nr_units++;
+ continue;
+ }
+ prev_node = node;
+
+ gi = &ai->groups[ai->nr_groups++];
+ gi->nr_units = 1;
+ gi->base_offset = __per_cpu_offset[cpu] + base_offset;
+ gi->cpu_map = &cpu_map[unit];
+ }
+
+ rc = pcpu_setup_first_chunk(ai, base);
+ if (rc)
+ panic("failed to setup percpu area (err=%d)", rc);
+
+ pcpu_free_alloc_info(ai);
+}
+#endif
+
/**
* fill_pernode - initialize pernode data.
* @node: the node id.
--
1.6.4.2

2009-09-23 05:07:16

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/5] percpu: kill legacy percpu allocator

With ia64 converted, there's no arch left which still uses legacy
percpu allocator. Kill it.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/percpu.h | 24 -------
kernel/module.c | 150 ----------------------------------------
mm/Makefile | 4 -
mm/allocpercpu.c | 177 ------------------------------------------------
mm/percpu.c | 2 -
5 files changed, 0 insertions(+), 357 deletions(-)
delete mode 100644 mm/allocpercpu.c

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 878836c..5baf5b8 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -34,8 +34,6 @@

#ifdef CONFIG_SMP

-#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
-
/* minimum unit size, also is the maximum supported allocation size */
#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(64 << 10)

@@ -130,28 +128,6 @@ extern int __init pcpu_page_first_chunk(size_t reserved_size,
#define per_cpu_ptr(ptr, cpu) SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)))

extern void *__alloc_reserved_percpu(size_t size, size_t align);
-
-#else /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
-struct percpu_data {
- void *ptrs[1];
-};
-
-/* pointer disguising messes up the kmemleak objects tracking */
-#ifndef CONFIG_DEBUG_KMEMLEAK
-#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
-#else
-#define __percpu_disguise(pdata) (struct percpu_data *)(pdata)
-#endif
-
-#define per_cpu_ptr(ptr, cpu) \
-({ \
- struct percpu_data *__p = __percpu_disguise(ptr); \
- (__typeof__(ptr))__p->ptrs[(cpu)]; \
-})
-
-#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
extern void *__alloc_percpu(size_t size, size_t align);
extern void free_percpu(void *__pdata);

diff --git a/kernel/module.c b/kernel/module.c
index e6bc4b2..7fd81d8 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -370,8 +370,6 @@ EXPORT_SYMBOL_GPL(find_module);

#ifdef CONFIG_SMP

-#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
-
static void *percpu_modalloc(unsigned long size, unsigned long align,
const char *name)
{
@@ -395,154 +393,6 @@ static void percpu_modfree(void *freeme)
free_percpu(freeme);
}

-#else /* ... CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
-/* Number of blocks used and allocated. */
-static unsigned int pcpu_num_used, pcpu_num_allocated;
-/* Size of each block. -ve means used. */
-static int *pcpu_size;
-
-static int split_block(unsigned int i, unsigned short size)
-{
- /* Reallocation required? */
- if (pcpu_num_used + 1 > pcpu_num_allocated) {
- int *new;
-
- new = krealloc(pcpu_size, sizeof(new[0])*pcpu_num_allocated*2,
- GFP_KERNEL);
- if (!new)
- return 0;
-
- pcpu_num_allocated *= 2;
- pcpu_size = new;
- }
-
- /* Insert a new subblock */
- memmove(&pcpu_size[i+1], &pcpu_size[i],
- sizeof(pcpu_size[0]) * (pcpu_num_used - i));
- pcpu_num_used++;
-
- pcpu_size[i+1] -= size;
- pcpu_size[i] = size;
- return 1;
-}
-
-static inline unsigned int block_size(int val)
-{
- if (val < 0)
- return -val;
- return val;
-}
-
-static void *percpu_modalloc(unsigned long size, unsigned long align,
- const char *name)
-{
- unsigned long extra;
- unsigned int i;
- void *ptr;
- int cpu;
-
- if (align > PAGE_SIZE) {
- printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
- name, align, PAGE_SIZE);
- align = PAGE_SIZE;
- }
-
- ptr = __per_cpu_start;
- for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
- /* Extra for alignment requirement. */
- extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
- BUG_ON(i == 0 && extra != 0);
-
- if (pcpu_size[i] < 0 || pcpu_size[i] < extra + size)
- continue;
-
- /* Transfer extra to previous block. */
- if (pcpu_size[i-1] < 0)
- pcpu_size[i-1] -= extra;
- else
- pcpu_size[i-1] += extra;
- pcpu_size[i] -= extra;
- ptr += extra;
-
- /* Split block if warranted */
- if (pcpu_size[i] - size > sizeof(unsigned long))
- if (!split_block(i, size))
- return NULL;
-
- /* add the per-cpu scanning areas */
- for_each_possible_cpu(cpu)
- kmemleak_alloc(ptr + per_cpu_offset(cpu), size, 0,
- GFP_KERNEL);
-
- /* Mark allocated */
- pcpu_size[i] = -pcpu_size[i];
- return ptr;
- }
-
- printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
- size);
- return NULL;
-}
-
-static void percpu_modfree(void *freeme)
-{
- unsigned int i;
- void *ptr = __per_cpu_start + block_size(pcpu_size[0]);
- int cpu;
-
- /* First entry is core kernel percpu data. */
- for (i = 1; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
- if (ptr == freeme) {
- pcpu_size[i] = -pcpu_size[i];
- goto free;
- }
- }
- BUG();
-
- free:
- /* remove the per-cpu scanning areas */
- for_each_possible_cpu(cpu)
- kmemleak_free(freeme + per_cpu_offset(cpu));
-
- /* Merge with previous? */
- if (pcpu_size[i-1] >= 0) {
- pcpu_size[i-1] += pcpu_size[i];
- pcpu_num_used--;
- memmove(&pcpu_size[i], &pcpu_size[i+1],
- (pcpu_num_used - i) * sizeof(pcpu_size[0]));
- i--;
- }
- /* Merge with next? */
- if (i+1 < pcpu_num_used && pcpu_size[i+1] >= 0) {
- pcpu_size[i] += pcpu_size[i+1];
- pcpu_num_used--;
- memmove(&pcpu_size[i+1], &pcpu_size[i+2],
- (pcpu_num_used - (i+1)) * sizeof(pcpu_size[0]));
- }
-}
-
-static int percpu_modinit(void)
-{
- pcpu_num_used = 2;
- pcpu_num_allocated = 2;
- pcpu_size = kmalloc(sizeof(pcpu_size[0]) * pcpu_num_allocated,
- GFP_KERNEL);
- /* Static in-kernel percpu data (used). */
- pcpu_size[0] = -(__per_cpu_end-__per_cpu_start);
- /* Free room. */
- pcpu_size[1] = PERCPU_ENOUGH_ROOM + pcpu_size[0];
- if (pcpu_size[1] < 0) {
- printk(KERN_ERR "No per-cpu room for modules.\n");
- pcpu_num_used = 1;
- }
-
- return 0;
-}
-__initcall(percpu_modinit);
-
-#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */
-
static unsigned int find_pcpusec(Elf_Ehdr *hdr,
Elf_Shdr *sechdrs,
const char *secstrings)
diff --git a/mm/Makefile b/mm/Makefile
index 728a9fd..3230eb5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,11 +34,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
-ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
obj-$(CONFIG_SMP) += percpu.o
-else
-obj-$(CONFIG_SMP) += allocpercpu.o
-endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
diff --git a/mm/allocpercpu.c b/mm/allocpercpu.c
deleted file mode 100644
index df34cea..0000000
--- a/mm/allocpercpu.c
+++ /dev/null
@@ -1,177 +0,0 @@
-/*
- * linux/mm/allocpercpu.c
- *
- * Separated from slab.c August 11, 2006 Christoph Lameter
- */
-#include <linux/mm.h>
-#include <linux/module.h>
-#include <linux/bootmem.h>
-#include <asm/sections.h>
-
-#ifndef cache_line_size
-#define cache_line_size() L1_CACHE_BYTES
-#endif
-
-/**
- * percpu_depopulate - depopulate per-cpu data for given cpu
- * @__pdata: per-cpu data to depopulate
- * @cpu: depopulate per-cpu data for this cpu
- *
- * Depopulating per-cpu data for a cpu going offline would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- */
-static void percpu_depopulate(void *__pdata, int cpu)
-{
- struct percpu_data *pdata = __percpu_disguise(__pdata);
-
- kfree(pdata->ptrs[cpu]);
- pdata->ptrs[cpu] = NULL;
-}
-
-/**
- * percpu_depopulate_mask - depopulate per-cpu data for some cpu's
- * @__pdata: per-cpu data to depopulate
- * @mask: depopulate per-cpu data for cpu's selected through mask bits
- */
-static void __percpu_depopulate_mask(void *__pdata, const cpumask_t *mask)
-{
- int cpu;
- for_each_cpu_mask_nr(cpu, *mask)
- percpu_depopulate(__pdata, cpu);
-}
-
-#define percpu_depopulate_mask(__pdata, mask) \
- __percpu_depopulate_mask((__pdata), &(mask))
-
-/**
- * percpu_populate - populate per-cpu data for given cpu
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @cpu: populate per-data for this cpu
- *
- * Populating per-cpu data for a cpu coming online would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- * Per-cpu object is populated with zeroed buffer.
- */
-static void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
-{
- struct percpu_data *pdata = __percpu_disguise(__pdata);
- int node = cpu_to_node(cpu);
-
- /*
- * We should make sure each CPU gets private memory.
- */
- size = roundup(size, cache_line_size());
-
- BUG_ON(pdata->ptrs[cpu]);
- if (node_online(node))
- pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
- else
- pdata->ptrs[cpu] = kzalloc(size, gfp);
- return pdata->ptrs[cpu];
-}
-
-/**
- * percpu_populate_mask - populate per-cpu data for more cpu's
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-cpu data for cpu's selected through mask bits
- *
- * Per-cpu objects are populated with zeroed buffers.
- */
-static int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
- cpumask_t *mask)
-{
- cpumask_t populated;
- int cpu;
-
- cpus_clear(populated);
- for_each_cpu_mask_nr(cpu, *mask)
- if (unlikely(!percpu_populate(__pdata, size, gfp, cpu))) {
- __percpu_depopulate_mask(__pdata, &populated);
- return -ENOMEM;
- } else
- cpu_set(cpu, populated);
- return 0;
-}
-
-#define percpu_populate_mask(__pdata, size, gfp, mask) \
- __percpu_populate_mask((__pdata), (size), (gfp), &(mask))
-
-/**
- * alloc_percpu - initial setup of per-cpu data
- * @size: size of per-cpu object
- * @align: alignment
- *
- * Allocate dynamic percpu area. Percpu objects are populated with
- * zeroed buffers.
- */
-void *__alloc_percpu(size_t size, size_t align)
-{
- /*
- * We allocate whole cache lines to avoid false sharing
- */
- size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
- void *pdata = kzalloc(sz, GFP_KERNEL);
- void *__pdata = __percpu_disguise(pdata);
-
- /*
- * Can't easily make larger alignment work with kmalloc. WARN
- * on it. Larger alignment should only be used for module
- * percpu sections on SMP for which this path isn't used.
- */
- WARN_ON_ONCE(align > SMP_CACHE_BYTES);
-
- if (unlikely(!pdata))
- return NULL;
- if (likely(!__percpu_populate_mask(__pdata, size, GFP_KERNEL,
- &cpu_possible_map)))
- return __pdata;
- kfree(pdata);
- return NULL;
-}
-EXPORT_SYMBOL_GPL(__alloc_percpu);
-
-/**
- * free_percpu - final cleanup of per-cpu data
- * @__pdata: object to clean up
- *
- * We simply clean up any per-cpu object left. No need for the client to
- * track and specify through a bis mask which per-cpu objects are to free.
- */
-void free_percpu(void *__pdata)
-{
- if (unlikely(!__pdata))
- return;
- __percpu_depopulate_mask(__pdata, cpu_possible_mask);
- kfree(__percpu_disguise(__pdata));
-}
-EXPORT_SYMBOL_GPL(free_percpu);
-
-/*
- * Generic percpu area setup.
- */
-#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
-unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
-
-EXPORT_SYMBOL(__per_cpu_offset);
-
-void __init setup_per_cpu_areas(void)
-{
- unsigned long size, i;
- char *ptr;
- unsigned long nr_possible_cpus = num_possible_cpus();
-
- /* Copy section for each CPU (we discard the original) */
- size = ALIGN(PERCPU_ENOUGH_ROOM, PAGE_SIZE);
- ptr = alloc_bootmem_pages(size * nr_possible_cpus);
-
- for_each_possible_cpu(i) {
- __per_cpu_offset[i] = ptr - __per_cpu_start;
- memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
- ptr += size;
- }
-}
-#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
diff --git a/mm/percpu.c b/mm/percpu.c
index 43d8cac..adbc5a4 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -46,8 +46,6 @@
*
* To use this allocator, arch code should do the followings.
*
- * - drop CONFIG_HAVE_LEGACY_PER_CPU_AREA
- *
* - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
* regular address to percpu pointer and back if they need to be
* different from the default
--
1.6.4.2

2009-09-23 11:10:57

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/5] percpu: kill legacy percpu allocator

On Wed, 23 Sep 2009 02:36:22 pm Tejun Heo wrote:
> With ia64 converted, there's no arch left which still uses legacy
> percpu allocator. Kill it.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Christoph Lameter <[email protected]>

Delightedly-acked-by: Rusty Russell <[email protected]>

Thanks!
Rusty.

2009-09-29 00:25:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET percpu#for-next] percpu: convert ia64 to dynamic percpu and drop the old one, take#2

Tejun Heo wrote:
> Hello, all.
>
> This is the second take of convert-ia64-to-dynamic-percpu patchset.
> Changes from the last take[L] are
>
> * 0001 now updates ia64 to not define VMALLOC_END as a macro to
> vmalloc_end instead of disallowing vmalloc_end as a variable name as
> suggested by Christoph.
>
> * 0002 added to initialize cpu maps early. This is necessary to get
> contig memory model working.
>
> * 0004 updated so that dyn_size is calculated correctly for contig
> model.
>
> This patchset contains the following five patches.
>
> 0001-ia64-don-t-alias-VMALLOC_END-to-vmalloc_end.patch
> 0002-ia64-initialize-cpu-maps-early.patch
> 0003-ia64-allocate-percpu-area-for-cpu0-like-percpu-areas.patch
> 0004-ia64-convert-to-dynamic-percpu-allocator.patch
> 0005-percpu-kill-legacy-percpu-allocator.patch
>
> 0001 is misc prep to avoid macro / local variable collision. 0002
> makes ia64 arch code initialize cpu possible and present maps before
> memory initialization. 0003 makes ia64 allocate percpu area for cpu0
> in the same way it does for other cpus. 0004 converts ia64 to dynamic
> percpu allocator and 0005 drops now unused legacy allocator.
>
> Contig memory model was tested on a 16p Tiger4 machine. Discontig and
> sparse tested on 4-way SGI altix. ski seems to be happy with contig
> up/smp.
>
> This patchset is available in the following git tree.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git convert-ia64
>
> The new commit ID is dcc91f19c6662b24f1f4e5878d773244f1079724 and it's
> on top of today's Linus 7fa07729e439a6184bd824746d06a49cca553f15.

Tony, can you please ack ia64 part?

Thanks.

--
tejun

2009-09-30 01:24:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH REPOST 3/5] ia64: allocate percpu area for cpu0 like percpu areas for other cpus

cpu0 used special percpu area reserved by the linker, __cpu0_per_cpu,
which is set up early in boot by head.S. However, this doesn't
guarantee that the area will be on the same node as cpu0 and the
percpu area for cpu0 ends up very far away from percpu areas for other
cpus which cause problems for congruent percpu allocator.

This patch makes percpu area initialization allocate percpu area for
cpu0 like any other cpus and copy it from __cpu0_per_cpu which now
resides in the __init area. This means that for cpu0, percpu area is
first setup at __cpu0_per_cpu early by head.S and then moved to an
area in the linear mapping during memory initialization and it's not
allowed to take a pointer to percpu variables between head.S and
memory initialization.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: linux-ia64 <[email protected]>
---
Tony Luck didn't receive this one and couldn't find it in archives
either. Repost. Unchanged from the original posting.

arch/ia64/kernel/vmlinux.lds.S | 11 +++++----
arch/ia64/mm/contig.c | 47 +++++++++++++++++++++++++--------------
arch/ia64/mm/discontig.c | 35 ++++++++++++++++++++---------
3 files changed, 60 insertions(+), 33 deletions(-)

diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index 0a0c77b..1295ba3 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -166,6 +166,12 @@ SECTIONS
}
#endif

+#ifdef CONFIG_SMP
+ . = ALIGN(PERCPU_PAGE_SIZE);
+ __cpu0_per_cpu = .;
+ . = . + PERCPU_PAGE_SIZE; /* cpu0 per-cpu space */
+#endif
+
. = ALIGN(PAGE_SIZE);
__init_end = .;

@@ -198,11 +204,6 @@ SECTIONS
data : { } :data
.data : AT(ADDR(.data) - LOAD_OFFSET)
{
-#ifdef CONFIG_SMP
- . = ALIGN(PERCPU_PAGE_SIZE);
- __cpu0_per_cpu = .;
- . = . + PERCPU_PAGE_SIZE; /* cpu0 per-cpu space */
-#endif
INIT_TASK_DATA(PAGE_SIZE)
CACHELINE_ALIGNED_DATA(SMP_CACHE_BYTES)
READ_MOSTLY_DATA(SMP_CACHE_BYTES)
diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
index 1341437..351da0a 100644
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -154,36 +154,49 @@ static void *cpu_data;
void * __cpuinit
per_cpu_init (void)
{
- int cpu;
- static int first_time=1;
+ static bool first_time = true;
+ void *cpu0_data = __cpu0_per_cpu;
+ unsigned int cpu;
+
+ if (!first_time)
+ goto skip;
+ first_time = false;

/*
* get_free_pages() cannot be used before cpu_init() done. BSP
* allocates "NR_CPUS" pages for all CPUs to avoid that AP calls
* get_zeroed_page().
*/
- if (first_time) {
- void *cpu0_data = __cpu0_per_cpu;
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ void *src = cpu == 0 ? cpu0_data : __phys_per_cpu_start;

- first_time=0;
+ memcpy(cpu_data, src, __per_cpu_end - __per_cpu_start);
+ __per_cpu_offset[cpu] = (char *)cpu_data - __per_cpu_start;
+ per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];

- __per_cpu_offset[0] = (char *) cpu0_data - __per_cpu_start;
- per_cpu(local_per_cpu_offset, 0) = __per_cpu_offset[0];
+ /*
+ * percpu area for cpu0 is moved from the __init area
+ * which is setup by head.S and used till this point.
+ * Update ar.k3. This move is ensures that percpu
+ * area for cpu0 is on the correct node and its
+ * virtual address isn't insanely far from other
+ * percpu areas which is important for congruent
+ * percpu allocator.
+ */
+ if (cpu == 0)
+ ia64_set_kr(IA64_KR_PER_CPU_DATA, __pa(cpu_data) -
+ (unsigned long)__per_cpu_start);

- for (cpu = 1; cpu < NR_CPUS; cpu++) {
- memcpy(cpu_data, __phys_per_cpu_start, __per_cpu_end - __per_cpu_start);
- __per_cpu_offset[cpu] = (char *) cpu_data - __per_cpu_start;
- cpu_data += PERCPU_PAGE_SIZE;
- per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];
- }
+ cpu_data += PERCPU_PAGE_SIZE;
}
+skip:
return __per_cpu_start + __per_cpu_offset[smp_processor_id()];
}

static inline void
alloc_per_cpu_data(void)
{
- cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS-1,
+ cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS,
PERCPU_PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
}
#else
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 9f24b3c..200282b 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -143,17 +143,30 @@ static void *per_cpu_node_setup(void *cpu_data, int node)
int cpu;

for_each_possible_early_cpu(cpu) {
- if (cpu == 0) {
- void *cpu0_data = __cpu0_per_cpu;
- __per_cpu_offset[cpu] = (char*)cpu0_data -
- __per_cpu_start;
- } else if (node == node_cpuid[cpu].nid) {
- memcpy(__va(cpu_data), __phys_per_cpu_start,
- __per_cpu_end - __per_cpu_start);
- __per_cpu_offset[cpu] = (char*)__va(cpu_data) -
- __per_cpu_start;
- cpu_data += PERCPU_PAGE_SIZE;
- }
+ void *src = cpu == 0 ? __cpu0_per_cpu : __phys_per_cpu_start;
+
+ if (node != node_cpuid[cpu].nid)
+ continue;
+
+ memcpy(__va(cpu_data), src, __per_cpu_end - __per_cpu_start);
+ __per_cpu_offset[cpu] = (char *)__va(cpu_data) -
+ __per_cpu_start;
+
+ /*
+ * percpu area for cpu0 is moved from the __init area
+ * which is setup by head.S and used till this point.
+ * Update ar.k3. This move is ensures that percpu
+ * area for cpu0 is on the correct node and its
+ * virtual address isn't insanely far from other
+ * percpu areas which is important for congruent
+ * percpu allocator.
+ */
+ if (cpu == 0)
+ ia64_set_kr(IA64_KR_PER_CPU_DATA,
+ (unsigned long)cpu_data -
+ (unsigned long)__per_cpu_start);
+
+ cpu_data += PERCPU_PAGE_SIZE;
}
#endif
return cpu_data;
--
1.6.4.2

2009-09-30 20:31:58

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCHSET percpu#for-next] percpu: convert ia64 to dynamic percpu and drop the old one, take#2

> Tony, can you please ack ia64 part?

Tejun,

Ok. The new patchset builda and boots for both the contig.c and discontig.c
versions of the kernel.

Acked-by; Tony Luck <[email protected]>

2009-09-30 20:52:16

by Christoph Lameter

[permalink] [raw]
Subject: RE: [PATCHSET percpu#for-next] percpu: convert ia64 to dynamic percpu and drop the old one, take#2


Great. Ill post a new rev of the per cpu ops patchset.

Tony: Could we use a global register for the per cpu address? That would
make IA64 work similar to sparc.

2009-09-30 22:05:45

by Tony Luck

[permalink] [raw]
Subject: RE: [PATCHSET percpu#for-next] percpu: convert ia64 to dynamic percpu and drop the old one, take#2

> Tony: Could we use a global register for the per cpu address? That would
> make IA64 work similar to sparc.

Would that be a useful trade of resources for convenience? We've already
hard-wired r13 for "current". Grabbing another one would require fixing
(since user code will have clobbered it). Possibly re-working any existing
code that already uses whatever register you choose.

How would that compare with using [r13]?

We might have a krN free ... but kregs are a lot slower than real registers.

-Tony

2009-09-30 23:41:28

by Peter Chubb

[permalink] [raw]
Subject: Re: [PATCHSET percpu#for-next] percpu: convert ia64 to dynamic percpu and drop the old one, take#2

>>>>> "Tony" == Tony Luck <Luck> writes:

>> Tony: Could we use a global register for the per cpu address? That
>> would make IA64 work similar to sparc.

Tony> Would that be a useful trade of resources for convenience?
Tony> We've already hard-wired r13 for "current". Grabbing another one
Tony> would require fixing (since user code will have clobbered it).
Tony> Possibly re-working any existing code that already uses whatever
Tony> register you choose.

r3, r4 and r5 are currently unused by the kernel, and unused
by GCC and ICC. Only hand-written assembler and weird compilers use
those registers(and my virtual-machine monitor :-(). If you wanted to
experiment, that'd be a starting place.

I'm not sure of the advantage though -- TLB mapping is relatively
cheap, and we're no longer hard-wiring the translation register.

You';d have to do somne careful benchmarking on a wide variety of
workloads and machines to get a definitive answer.


---
Dr Peter Chubb http://www.nicta.com.au peter DOT chubb AT nicta.com.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia
>From Imagination to Impact Imagining the (ICT) Future

2009-09-30 23:54:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCHSET percpu#for-next] percpu: convert ia64 to dynamic percpu and drop the old one, take#2

On Thu, 1 Oct 2009, Peter Chubb wrote:

> r3, r4 and r5 are currently unused by the kernel, and unused
> by GCC and ICC. Only hand-written assembler and weird compilers use
> those registers(and my virtual-machine monitor :-(). If you wanted to
> experiment, that'd be a starting place.
>
> I'm not sure of the advantage though -- TLB mapping is relatively
> cheap, and we're no longer hard-wiring the translation register.

Dynamic and static per cpu variables could use relative access to that
register. This would reduce code size, avoid the use of a TLB entry.

> You';d have to do somne careful benchmarking on a wide variety of
> workloads and machines to get a definitive answer.

I have some patches here that make heavy use of dynamic percpu allocations
in the allocators to optimize the alloc / free paths.

2009-10-02 05:11:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET percpu#for-next] percpu: convert ia64 to dynamic percpu and drop the old one, take#2

Tejun Heo wrote:
> Hello, all.
>
> This is the second take of convert-ia64-to-dynamic-percpu patchset.
> Changes from the last take[L] are

Patchset pushed out for linux-next with Rusty and Tony's acks added.

Thanks.

--
tejun