2009-03-10 07:56:39

by Tejun Heo

[permalink] [raw]
Subject: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator


Hello, all.

This patchset converts all SMP-capable arches other than three
non-trivial ones - powerpc64, sparc64 and ia64 - to use dynamic percpu
allocator. The affected archs are

* sh
* arm
* cris
* mips
* sparc(32)
* blackfin
* avr32
* parisc
* m32r
* powerpc(32)
* alpha
* s390

The x86 embedding first chunk allocator is generalized and used as the
default first chunk allocator. This first chunk allocator makes the
dynamic percpu allocator behave mostly identically to the previous
generic percpu allocator. Percpu memory is allocated using
alloc_bootmem_pages() and module static percpu variables are allocated
right after core percpu variables.

The only differences are 1. there can be more space between percpu
areas for each cpu to accomodate minimum allocation size and first
chunk dynamic reserve and 2. dynamic percpu variables use the same
address translation mechanism as static ones.

#1 shouldn't cause any trouble and #2 shouldn't too because the
offsets for dynamic percpu variables are carried in full pointer-width
variables, so as long as the calculations don't make wrong assumptions
(should be masked by RELOC_HIDE), it should be okay, but if your arch
has addressing limitations (alpha, s390), please take a second look.

I really wanted to test it on an architecture which is actually
affected by this change but my powerstation wouldn't boot 32bit
kernel, qemu-system-sparc doesn't like SMP kernels and I don't have
access to a sparc, alpha or SMP arm, so I could only tested it on
x86_32 with a patch to use the generic default allocator. It works
fine there and compiles fine for sparc(32), powerpc(32), alpha and
arm. If you have an access to one of the affected archs, please test
and report the result. I'll send a test module as a reply to this
message. If you can boot and load/unload the module, it should be
working fine.

This patchset is on top of the current x86/core/percpu[1] and contains
the following patches.

0001-linker-script-define-__per_cpu_load-on-all-SMP-capa.patch
0002-percpu-make-x86-addr-pcpu-ptr-conversion-macros.patch
0003-percpu-more-flexibility-for-dyn_size-of-pcpu_setup.patch
0004-percpu-generalize-embedding-first-chunk-setup-helpe.patch
0005-percpu-use-dynamic-percpu-allocator-as-the-default.patch

0001-0004 preps. 0005 does the conversion.

This patchset is available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git tj-percpu

Ingo, I think it would be best to delay pulling till at least some of
the archs have been confirmed.

Diffstat follows.

arch/arm/kernel/vmlinux.lds.S | 1
arch/ia64/Kconfig | 3
arch/ia64/kernel/vmlinux.lds.S | 12 --
arch/powerpc/Kconfig | 3
arch/powerpc/kernel/vmlinux.lds.S | 9 --
arch/sparc/Kconfig | 3
arch/x86/Kconfig | 3
arch/x86/include/asm/percpu.h | 8 -
arch/x86/kernel/setup_percpu.c | 63 ++------------
include/linux/percpu.h | 18 +++-
init/main.c | 24 -----
kernel/module.c | 6 -
mm/Makefile | 2
mm/percpu.c | 170 ++++++++++++++++++++++++++++++++++----
14 files changed, 195 insertions(+), 130 deletions(-)

Thanks.

--
tejun

[1] 3a450de1365d20afde406f0d9b2931a5e4a4fd6a


2009-03-10 07:56:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/5] percpu: use dynamic percpu allocator as the default percpu allocator

Impact: use dynamic allocator for all archs w/o CONFIG_HAVE_SETUP_PER_CPU_AREA

This patch makes !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use dynamic
percpu allocator. The first chunk is allocated using embedding helper
and 8k is reserved for modules. This ensures that the new allocator
behaves almost identically to the original allocator as long as static
percpu variables are concerned, so it shouldn't introduce much
breakage.

One thing which can go wrong is SHIFT_PERCPU_PTR() overflowing on
dynamic percpu variables which might end up far from __per_cpu_start
or the text if percpu address calculation makes assumptions about how
far they can be located. alpha and s390 already have this problem for
modules because module can be located far from the percpu area and
they have special provisions for it. However, I don't think this will
be an issue for dynamic percpu variables as their offsets are carried
in full pointer variables instead of depending on relocations.

The following architectures are affected by this change.

* sh
* arm
* cris
* mips
* sparc(32)
* blackfin
* avr32
* parisc
* m32r
* powerpc(32)
* alpha
* s390

As this change makes the dynamic allocator the default one,
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted
archs. These archs implement their own setup_per_cpu_areas() and the
conversion is not trivial.

* powerpc(64)
* sparc(64)
* ia64

Boot and batch alloc/free tests on x86_32 with debug code (x86_32
doesn't use default first chunk initialization). Compile tested on
sparc(32), powerpc(32) and arm.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Russell King <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Bryan Wu <[email protected]>
Cc: Kyle McMartin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Hirokazu Takata <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
---
arch/ia64/Kconfig | 3 +++
arch/powerpc/Kconfig | 3 +++
arch/sparc/Kconfig | 3 +++
arch/x86/Kconfig | 3 ---
include/linux/percpu.h | 12 +++++++++---
init/main.c | 24 ------------------------
kernel/module.c | 6 +++---
mm/Makefile | 2 +-
mm/percpu.c | 40 +++++++++++++++++++++++++++++++++++++++-
9 files changed, 61 insertions(+), 35 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 153e727..f333149 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -85,6 +85,9 @@ config GENERIC_TIME_VSYSCALL
bool
default y

+config HAVE_LEGACY_PER_CPU_AREA
+ def_bool y
+
config HAVE_SETUP_PER_CPU_AREA
def_bool y

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 74cc312..892f1ad 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -42,6 +42,9 @@ config GENERIC_HARDIRQS
bool
default y

+config HAVE_LEGACY_PER_CPU_AREA
+ def_bool PPC64
+
config HAVE_SETUP_PER_CPU_AREA
def_bool PPC64

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index c3ea215..0125b57 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -90,6 +90,9 @@ config AUDIT_ARCH
bool
default y

+config HAVE_LEGACY_PER_CPU_AREA
+ def_bool y if SPARC64
+
config HAVE_SETUP_PER_CPU_AREA
def_bool y if SPARC64

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f5cef3f..572329a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -138,9 +138,6 @@ config ARCH_HAS_CACHE_LINE_SIZE
config HAVE_SETUP_PER_CPU_AREA
def_bool y

-config HAVE_DYNAMIC_PER_CPU_AREA
- def_bool y
-
config HAVE_CPUMASK_OF_CPU_MAP
def_bool X86_64_SMP

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index ee5615d..1daed0c 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -78,7 +78,7 @@

#ifdef CONFIG_SMP

-#ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA

/* minimum unit size, also is the maximum supported allocation size */
#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(64 << 10)
@@ -124,7 +124,11 @@ extern ssize_t __init pcpu_embed_first_chunk(

extern void *__alloc_reserved_percpu(size_t size, size_t align);

-#else /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
+extern void __init setup_per_cpu_areas(void);
+#endif
+
+#else /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */

struct percpu_data {
void *ptrs[1];
@@ -138,7 +142,7 @@ struct percpu_data {
(__typeof__(ptr))__p->ptrs[(cpu)]; \
})

-#endif /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */

extern void *__alloc_percpu(size_t size, size_t align);
extern void free_percpu(void *__pdata);
@@ -163,6 +167,8 @@ static inline void free_percpu(void *p)
kfree(p);
}

+static inline void __init setup_per_cpu_areas(void) { }
+
#endif /* CONFIG_SMP */

#define alloc_percpu(type) (type *)__alloc_percpu(sizeof(type), \
diff --git a/init/main.c b/init/main.c
index 6441083..42ae3c0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -354,7 +354,6 @@ static void __init smp_init(void)
#define smp_init() do { } while (0)
#endif

-static inline void setup_per_cpu_areas(void) { }
static inline void setup_nr_cpu_ids(void) { }
static inline void smp_prepare_cpus(unsigned int maxcpus) { }

@@ -375,29 +374,6 @@ static void __init setup_nr_cpu_ids(void)
nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
}

-#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
-unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
-
-EXPORT_SYMBOL(__per_cpu_offset);
-
-static void __init setup_per_cpu_areas(void)
-{
- unsigned long size, i;
- char *ptr;
- unsigned long nr_possible_cpus = num_possible_cpus();
-
- /* Copy section for each CPU (we discard the original) */
- size = ALIGN(PERCPU_ENOUGH_ROOM, PAGE_SIZE);
- ptr = alloc_bootmem_pages(size * nr_possible_cpus);
-
- for_each_possible_cpu(i) {
- __per_cpu_offset[i] = ptr - __per_cpu_start;
- memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
- ptr += size;
- }
-}
-#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
-
/* Called by boot processor to activate the rest. */
static void __init smp_init(void)
{
diff --git a/kernel/module.c b/kernel/module.c
index f0e04d6..f9c7b86 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -368,7 +368,7 @@ static struct module *find_module(const char *name)

#ifdef CONFIG_SMP

-#ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA

static void *percpu_modalloc(unsigned long size, unsigned long align,
const char *name)
@@ -393,7 +393,7 @@ static void percpu_modfree(void *freeme)
free_percpu(freeme);
}

-#else /* ... !CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+#else /* ... CONFIG_HAVE_LEGACY_PER_CPU_AREA */

/* Number of blocks used and allocated. */
static unsigned int pcpu_num_used, pcpu_num_allocated;
@@ -528,7 +528,7 @@ static int percpu_modinit(void)
}
__initcall(percpu_modinit);

-#endif /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */

static unsigned int find_pcpusec(Elf_Ehdr *hdr,
Elf_Shdr *sechdrs,
diff --git a/mm/Makefile b/mm/Makefile
index 818569b..f3a037b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,7 +30,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
-ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
obj-$(CONFIG_SMP) += percpu.o
else
obj-$(CONFIG_SMP) += allocpercpu.o
diff --git a/mm/percpu.c b/mm/percpu.c
index 1aa5d8f..059b57a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -43,7 +43,7 @@
*
* To use this allocator, arch code should do the followings.
*
- * - define CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+ * - drop CONFIG_HAVE_LEGACY_PER_CPU_AREA
*
* - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
* regular address to percpu pointer and back if they need to be
@@ -1324,3 +1324,41 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
reserved_size, dyn_size,
pcpue_unit_size, pcpue_ptr, NULL);
}
+
+/*
+ * Generic percpu area setup.
+ *
+ * The embedding helper is used because its behavior closely resembles
+ * the original non-dynamic generic percpu area setup. This is
+ * important because many archs have addressing restrictions and might
+ * fail if the percpu area is located far away from the previous
+ * location. As an added bonus, in non-NUMA cases, embedding is
+ * generally a good idea TLB-wise because percpu area can piggy back
+ * on the physical linear memory mapping which uses large page
+ * mappings on applicable archs.
+ */
+#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
+unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
+EXPORT_SYMBOL(__per_cpu_offset);
+
+void __init setup_per_cpu_areas(void)
+{
+ size_t static_size = __per_cpu_end - __per_cpu_start;
+ ssize_t unit_size;
+ unsigned long delta;
+ unsigned int cpu;
+
+ /*
+ * Always reserve area for module percpu variables. That's
+ * what the legacy allocator did.
+ */
+ unit_size = pcpu_embed_first_chunk(static_size, PERCPU_MODULE_RESERVE,
+ PERCPU_DYNAMIC_RESERVE, -1);
+ if (unit_size < 0)
+ panic("Failed to initialized percpu areas.");
+
+ delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
+ for_each_possible_cpu(cpu)
+ __per_cpu_offset[cpu] = delta + cpu * unit_size;
+}
+#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
--
1.6.0.2

2009-03-10 07:56:56

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/5] percpu: more flexibility for @dyn_size of pcpu_setup_first_chunk()

Impact: cleanup, more flexibility for first chunk init

Non-negative @dyn_size used to be allowed iff @unit_size wasn't auto.
This restriction stemmed from implementation detail and made things a
bit less intuitive. This patch allows @dyn_size to be specified
regardless of @unit_size and swaps the positions of @dyn_size and
@unit_size so that the parameter order makes more sense (static,
reserved and dyn sizes followed by enclosing unit_size).

While at it, add @unit_size >= PCPU_MIN_UNIT_SIZE sanity check.

Signed-off-by: Tejun Heo <[email protected]>
---
arch/x86/kernel/setup_percpu.c | 13 ++++++-------
include/linux/percpu.h | 2 +-
mm/percpu.c | 28 ++++++++++++++--------------
3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index efa615f..e41c51f 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -233,8 +233,8 @@ proceed:
"%zu bytes\n", vm.addr, static_size);

ret = pcpu_setup_first_chunk(pcpur_get_page, static_size,
- PERCPU_FIRST_CHUNK_RESERVE,
- PMD_SIZE, dyn_size, vm.addr, NULL);
+ PERCPU_FIRST_CHUNK_RESERVE, dyn_size,
+ PMD_SIZE, vm.addr, NULL);
goto out_free_ar;

enomem:
@@ -315,9 +315,8 @@ static ssize_t __init setup_pcpu_embed(size_t static_size)
pcpue_size >> PAGE_SHIFT, pcpue_ptr, static_size);

return pcpu_setup_first_chunk(pcpue_get_page, static_size,
- PERCPU_FIRST_CHUNK_RESERVE,
- pcpue_unit_size, dyn_size,
- pcpue_ptr, NULL);
+ PERCPU_FIRST_CHUNK_RESERVE, dyn_size,
+ pcpue_unit_size, pcpue_ptr, NULL);
}

/*
@@ -375,8 +374,8 @@ static ssize_t __init setup_pcpu_4k(size_t static_size)
pcpu4k_nr_static_pages, static_size);

ret = pcpu_setup_first_chunk(pcpu4k_get_page, static_size,
- PERCPU_FIRST_CHUNK_RESERVE, -1, -1, NULL,
- pcpu4k_populate_pte);
+ PERCPU_FIRST_CHUNK_RESERVE, -1,
+ -1, NULL, pcpu4k_populate_pte);
goto out_free_ar;

enomem:
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 54a968b..fb455dc 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -107,7 +107,7 @@ typedef void (*pcpu_populate_pte_fn_t)(unsigned long addr);

extern size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
size_t static_size, size_t reserved_size,
- ssize_t unit_size, ssize_t dyn_size,
+ ssize_t dyn_size, ssize_t unit_size,
void *base_addr,
pcpu_populate_pte_fn_t populate_pte_fn);

diff --git a/mm/percpu.c b/mm/percpu.c
index c6f38a2..2f94661 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1027,8 +1027,8 @@ EXPORT_SYMBOL_GPL(free_percpu);
* @get_page_fn: callback to fetch page pointer
* @static_size: the size of static percpu area in bytes
* @reserved_size: the size of reserved percpu area in bytes
- * @unit_size: unit size in bytes, must be multiple of PAGE_SIZE, -1 for auto
* @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @unit_size: unit size in bytes, must be multiple of PAGE_SIZE, -1 for auto
* @base_addr: mapped address, NULL for auto
* @populate_pte_fn: callback to allocate pagetable, NULL if unnecessary
*
@@ -1053,14 +1053,14 @@ EXPORT_SYMBOL_GPL(free_percpu);
* limited offset range for symbol relocations to guarantee module
* percpu symbols fall inside the relocatable range.
*
+ * @dyn_size, if non-negative, determines the number of bytes
+ * available for dynamic allocation in the first chunk. Specifying
+ * non-negative value makes percpu leave alone the area beyond
+ * @static_size + @reserved_size + @dyn_size.
+ *
* @unit_size, if non-negative, specifies unit size and must be
* aligned to PAGE_SIZE and equal to or larger than @static_size +
- * @reserved_size + @dyn_size.
- *
- * @dyn_size, if non-negative, limits the number of bytes available
- * for dynamic allocation in the first chunk. Specifying non-negative
- * value make percpu leave alone the area beyond @static_size +
- * @reserved_size + @dyn_size.
+ * @reserved_size + if non-negative, @dyn_size.
*
* Non-null @base_addr means that the caller already allocated virtual
* region for the first chunk and mapped it. percpu must not mess
@@ -1083,12 +1083,14 @@ EXPORT_SYMBOL_GPL(free_percpu);
*/
size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
size_t static_size, size_t reserved_size,
- ssize_t unit_size, ssize_t dyn_size,
+ ssize_t dyn_size, ssize_t unit_size,
void *base_addr,
pcpu_populate_pte_fn_t populate_pte_fn)
{
static struct vm_struct first_vm;
static int smap[2], dmap[2];
+ size_t size_sum = static_size + reserved_size +
+ (dyn_size >= 0 ? dyn_size : 0);
struct pcpu_chunk *schunk, *dchunk = NULL;
unsigned int cpu;
int nr_pages;
@@ -1099,20 +1101,18 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
ARRAY_SIZE(dmap) >= PCPU_DFL_MAP_ALLOC);
BUG_ON(!static_size);
if (unit_size >= 0) {
- BUG_ON(unit_size < static_size + reserved_size +
- (dyn_size >= 0 ? dyn_size : 0));
+ BUG_ON(unit_size < size_sum);
BUG_ON(unit_size & ~PAGE_MASK);
- } else {
- BUG_ON(dyn_size >= 0);
+ BUG_ON(unit_size < PCPU_MIN_UNIT_SIZE);
+ } else
BUG_ON(base_addr);
- }
BUG_ON(base_addr && populate_pte_fn);

if (unit_size >= 0)
pcpu_unit_pages = unit_size >> PAGE_SHIFT;
else
pcpu_unit_pages = max_t(int, PCPU_MIN_UNIT_SIZE >> PAGE_SHIFT,
- PFN_UP(static_size + reserved_size));
+ PFN_UP(size_sum));

pcpu_unit_size = pcpu_unit_pages << PAGE_SHIFT;
pcpu_chunk_size = num_possible_cpus() * pcpu_unit_size;
--
1.6.0.2

2009-03-10 07:57:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/5] percpu: make x86 addr <-> pcpu ptr conversion macros generic

Impact: generic addr <-> pcpu ptr conversion macros

There's nothing arch specific about x86 __addr_to_pcpu_ptr() and
__pcpu_ptr_to_addr(). With proper __per_cpu_load and __per_cpu_start
defined, they'll do the right thing regardless of actual layout.

Move these macros from arch/x86/include/asm/percpu.h to mm/percpu.c
and allow archs to override it as necessary.

Signed-off-by: Tejun Heo <[email protected]>
---
arch/x86/include/asm/percpu.h | 8 --------
mm/percpu.c | 16 +++++++++++++++-
2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 8f1d2fb..aee103b 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -43,14 +43,6 @@
#else /* ...!ASSEMBLY */

#include <linux/stringify.h>
-#include <asm/sections.h>
-
-#define __addr_to_pcpu_ptr(addr) \
- (void *)((unsigned long)(addr) - (unsigned long)pcpu_base_addr \
- + (unsigned long)__per_cpu_start)
-#define __pcpu_ptr_to_addr(ptr) \
- (void *)((unsigned long)(ptr) + (unsigned long)pcpu_base_addr \
- - (unsigned long)__per_cpu_start)

#ifdef CONFIG_SMP
#define __percpu_arg(x) "%%"__stringify(__percpu_seg)":%P" #x
diff --git a/mm/percpu.c b/mm/percpu.c
index bfe6a3a..c6f38a2 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -46,7 +46,8 @@
* - define CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
*
* - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
- * regular address to percpu pointer and back
+ * regular address to percpu pointer and back if they need to be
+ * different from the default
*
* - use pcpu_setup_first_chunk() during percpu area initialization to
* setup the first chunk containing the kernel static percpu area
@@ -67,11 +68,24 @@
#include <linux/workqueue.h>

#include <asm/cacheflush.h>
+#include <asm/sections.h>
#include <asm/tlbflush.h>

#define PCPU_SLOT_BASE_SHIFT 5 /* 1-31 shares the same slot */
#define PCPU_DFL_MAP_ALLOC 16 /* start a map with 16 ents */

+/* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */
+#ifndef __addr_to_pcpu_ptr
+#define __addr_to_pcpu_ptr(addr) \
+ (void *)((unsigned long)(addr) - (unsigned long)pcpu_base_addr \
+ + (unsigned long)__per_cpu_start)
+#endif
+#ifndef __pcpu_ptr_to_addr
+#define __pcpu_ptr_to_addr(ptr) \
+ (void *)((unsigned long)(ptr) + (unsigned long)pcpu_base_addr \
+ - (unsigned long)__per_cpu_start)
+#endif
+
struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
struct rb_node rb_node; /* key is chunk->vm->addr */
--
1.6.0.2

2009-03-10 07:57:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/5] linker script: define __per_cpu_load on all SMP capable archs

Impact: __per_cpu_load available on all SMP capable archs

Percpu now requires three symbols to be defined - __per_cpu_load,
__per_cpu_start and __per_cpu_end. There were three archs which
didn't have it. Update them as follows.

* powerpc: can use generic PERCPU() macro. Compile tested for
powerpc32, compile/boot tested for powerpc64.

* ia64: can use generic PERCPU_VADDR() macro. __phys_per_cpu_start is
identical to __per_cpu_load. Compile tested and symbol table looks
identical after the change except for the additional __per_cpu_load.

* arm: added explicit __per_cpu_load definition. Currently uses
unified .init output section so can't use the generic macro. Dunno
whether the unified .init ouput section is required by arch
peculiarity so I left it alone. Please break it up and use PERCPU()
if possible.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Pat Gefre <[email protected]>
Cc: Russell King <[email protected]>
---
arch/arm/kernel/vmlinux.lds.S | 1 +
arch/ia64/kernel/vmlinux.lds.S | 12 ++----------
arch/powerpc/kernel/vmlinux.lds.S | 9 +--------
3 files changed, 4 insertions(+), 18 deletions(-)

diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index 85598f7..1602373 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -64,6 +64,7 @@ SECTIONS
__initramfs_end = .;
#endif
. = ALIGN(4096);
+ __per_cpu_load = .;
__per_cpu_start = .;
*(.data.percpu.page_aligned)
*(.data.percpu)
diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S
index f45e4e5..3765efc 100644
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -213,17 +213,9 @@ SECTIONS
{ *(.data.cacheline_aligned) }

/* Per-cpu data: */
- percpu : { } :percpu
. = ALIGN(PERCPU_PAGE_SIZE);
- __phys_per_cpu_start = .;
- .data.percpu PERCPU_ADDR : AT(__phys_per_cpu_start - LOAD_OFFSET)
- {
- __per_cpu_start = .;
- *(.data.percpu.page_aligned)
- *(.data.percpu)
- *(.data.percpu.shared_aligned)
- __per_cpu_end = .;
- }
+ PERCPU_VADDR(PERCPU_ADDR, :percpu)
+ __phys_per_cpu_start = __per_cpu_load;
. = __phys_per_cpu_start + PERCPU_PAGE_SIZE; /* ensure percpu data fits
* into percpu page size
*/
diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S
index 295ccc5..67f07f4 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -181,14 +181,7 @@ SECTIONS
__initramfs_end = .;
}
#endif
- . = ALIGN(PAGE_SIZE);
- .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) {
- __per_cpu_start = .;
- *(.data.percpu.page_aligned)
- *(.data.percpu)
- *(.data.percpu.shared_aligned)
- __per_cpu_end = .;
- }
+ PERCPU(PAGE_SIZE)

. = ALIGN(8);
.machine.desc : AT(ADDR(.machine.desc) - LOAD_OFFSET) {
--
1.6.0.2

2009-03-10 07:57:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/5] percpu: generalize embedding first chunk setup helper

Impact: code reorganization

Separate out embedding first chunk setup helper from x86 embedding
first chunk allocator and put it in mm/percpu.c. This will be used by
the default percpu first chunk allocator and possibly by other archs.

Signed-off-by: Tejun Heo <[email protected]>
---
arch/x86/kernel/setup_percpu.c | 54 +++----------------------
include/linux/percpu.h | 4 ++
mm/percpu.c | 86 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 96 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index e41c51f..400331b 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -257,31 +257,13 @@ static ssize_t __init setup_pcpu_remap(size_t static_size)
* Embedding allocator
*
* The first chunk is sized to just contain the static area plus
- * module and dynamic reserves, and allocated as a contiguous area
- * using bootmem allocator and used as-is without being mapped into
- * vmalloc area. This enables the first chunk to piggy back on the
- * linear physical PMD mapping and doesn't add any additional pressure
- * to TLB. Note that if the needed size is smaller than the minimum
- * unit size, the leftover is returned to the bootmem allocator.
+ * module and dynamic reserves and embedded into linear physical
+ * mapping so that it can use PMD mapping without additional TLB
+ * pressure.
*/
-static void *pcpue_ptr __initdata;
-static size_t pcpue_size __initdata;
-static size_t pcpue_unit_size __initdata;
-
-static struct page * __init pcpue_get_page(unsigned int cpu, int pageno)
-{
- size_t off = (size_t)pageno << PAGE_SHIFT;
-
- if (off >= pcpue_size)
- return NULL;
-
- return virt_to_page(pcpue_ptr + cpu * pcpue_unit_size + off);
-}
-
static ssize_t __init setup_pcpu_embed(size_t static_size)
{
- unsigned int cpu;
- size_t dyn_size;
+ size_t reserve = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;

/*
* If large page isn't supported, there's no benefit in doing
@@ -291,32 +273,8 @@ static ssize_t __init setup_pcpu_embed(size_t static_size)
if (!cpu_has_pse || pcpu_need_numa())
return -EINVAL;

- /* allocate and copy */
- pcpue_size = PFN_ALIGN(static_size + PERCPU_MODULE_RESERVE +
- PERCPU_DYNAMIC_RESERVE);
- pcpue_unit_size = max_t(size_t, pcpue_size, PCPU_MIN_UNIT_SIZE);
- dyn_size = pcpue_size - static_size - PERCPU_FIRST_CHUNK_RESERVE;
-
- pcpue_ptr = pcpu_alloc_bootmem(0, num_possible_cpus() * pcpue_unit_size,
- PAGE_SIZE);
- if (!pcpue_ptr)
- return -ENOMEM;
-
- for_each_possible_cpu(cpu) {
- void *ptr = pcpue_ptr + cpu * pcpue_unit_size;
-
- free_bootmem(__pa(ptr + pcpue_size),
- pcpue_unit_size - pcpue_size);
- memcpy(ptr, __per_cpu_load, static_size);
- }
-
- /* we're ready, commit */
- pr_info("PERCPU: Embedded %zu pages at %p, static data %zu bytes\n",
- pcpue_size >> PAGE_SHIFT, pcpue_ptr, static_size);
-
- return pcpu_setup_first_chunk(pcpue_get_page, static_size,
- PERCPU_FIRST_CHUNK_RESERVE, dyn_size,
- pcpue_unit_size, pcpue_ptr, NULL);
+ return pcpu_embed_first_chunk(static_size, PERCPU_FIRST_CHUNK_RESERVE,
+ reserve - PERCPU_FIRST_CHUNK_RESERVE, -1);
}

/*
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index fb455dc..ee5615d 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -111,6 +111,10 @@ extern size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
void *base_addr,
pcpu_populate_pte_fn_t populate_pte_fn);

+extern ssize_t __init pcpu_embed_first_chunk(
+ size_t static_size, size_t reserved_size,
+ ssize_t dyn_size, ssize_t unit_size);
+
/*
* Use this to get to a cpu's version of the per-cpu object
* dynamically allocated. Non-atomic access to the current CPU's
diff --git a/mm/percpu.c b/mm/percpu.c
index 2f94661..1aa5d8f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1238,3 +1238,89 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
pcpu_base_addr = (void *)pcpu_chunk_addr(schunk, 0, 0);
return pcpu_unit_size;
}
+
+/*
+ * Embedding first chunk setup helper.
+ */
+static void *pcpue_ptr __initdata;
+static size_t pcpue_size __initdata;
+static size_t pcpue_unit_size __initdata;
+
+static struct page * __init pcpue_get_page(unsigned int cpu, int pageno)
+{
+ size_t off = (size_t)pageno << PAGE_SHIFT;
+
+ if (off >= pcpue_size)
+ return NULL;
+
+ return virt_to_page(pcpue_ptr + cpu * pcpue_unit_size + off);
+}
+
+/**
+ * pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
+ * @static_size: the size of static percpu area in bytes
+ * @reserved_size: the size of reserved percpu area in bytes
+ * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @unit_size: unit size in bytes, must be multiple of PAGE_SIZE, -1 for auto
+ *
+ * This is a helper to ease setting up embedded first percpu chunk and
+ * can be called where pcpu_setup_first_chunk() is expected.
+ *
+ * If this function is used to setup the first chunk, it is allocated
+ * as a contiguous area using bootmem allocator and used as-is without
+ * being mapped into vmalloc area. This enables the first chunk to
+ * piggy back on the linear physical mapping which often uses larger
+ * page size.
+ *
+ * When @dyn_size is positive, dynamic area might be larger than
+ * specified to fill page alignment. Also, when @dyn_size is auto,
+ * @dyn_size does not fill the whole first chunk but only what's
+ * necessary for page alignment after static and reserved areas.
+ *
+ * If the needed size is smaller than the minimum or specified unit
+ * size, the leftover is returned to the bootmem allocator.
+ *
+ * RETURNS:
+ * The determined pcpu_unit_size which can be used to initialize
+ * percpu access on success, -errno on failure.
+ */
+ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
+ ssize_t dyn_size, ssize_t unit_size)
+{
+ unsigned int cpu;
+
+ /* determine parameters and allocate */
+ pcpue_size = PFN_ALIGN(static_size + reserved_size +
+ (dyn_size >= 0 ? dyn_size : 0));
+ if (dyn_size != 0)
+ dyn_size = pcpue_size - static_size - reserved_size;
+
+ if (unit_size >= 0) {
+ BUG_ON(unit_size < pcpue_size);
+ pcpue_unit_size = unit_size;
+ } else
+ pcpue_unit_size = max_t(size_t, pcpue_size, PCPU_MIN_UNIT_SIZE);
+
+ pcpue_ptr = __alloc_bootmem_nopanic(
+ num_possible_cpus() * pcpue_unit_size,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ if (!pcpue_ptr)
+ return -ENOMEM;
+
+ /* return the leftover and copy */
+ for_each_possible_cpu(cpu) {
+ void *ptr = pcpue_ptr + cpu * pcpue_unit_size;
+
+ free_bootmem(__pa(ptr + pcpue_size),
+ pcpue_unit_size - pcpue_size);
+ memcpy(ptr, __per_cpu_load, static_size);
+ }
+
+ /* we're ready, commit */
+ pr_info("PERCPU: Embedded %zu pages at %p, static data %zu bytes\n",
+ pcpue_size >> PAGE_SHIFT, pcpue_ptr, static_size);
+
+ return pcpu_setup_first_chunk(pcpue_get_page, static_size,
+ reserved_size, dyn_size,
+ pcpue_unit_size, pcpue_ptr, NULL);
+}
--
1.6.0.2

2009-03-10 07:58:28

by Tejun Heo

[permalink] [raw]
Subject: test module to verify percpu allocator

Hello, again.

Attached is the test module to test percpu allocator. Just build as
an external module and load/unload to perform the test.

Thanks.

--
tejun


Attachments:
test_module.c (5.04 kB)

2009-03-10 10:59:39

by David Miller

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

From: Tejun Heo <[email protected]>
Date: Tue, 10 Mar 2009 16:53:46 +0900

> This patchset converts all SMP-capable arches other than three
> non-trivial ones - powerpc64, sparc64 and ia64 - to use dynamic percpu
> allocator. The affected archs are
...
> The only differences are 1. there can be more space between percpu
> areas for each cpu to accomodate minimum allocation size and first
> chunk dynamic reserve and 2. dynamic percpu variables use the same
> address translation mechanism as static ones.
>
> #1 shouldn't cause any trouble and #2 shouldn't too because the
> offsets for dynamic percpu variables are carried in full pointer-width
> variables, so as long as the calculations don't make wrong assumptions
> (should be masked by RELOC_HIDE), it should be okay, but if your arch
> has addressing limitations (alpha, s390), please take a second look.

The one thing sparc64 does is calculate the base of the per-cpu
area using a base (contained in a fixed global register) a shift.

>From arch/sparc/include/asm/percpu_64.h

register unsigned long __local_per_cpu_offset asm("g5");
extern unsigned long __per_cpu_base;
extern unsigned long __per_cpu_shift;
#define __per_cpu_offset(__cpu) \
(__per_cpu_base + ((unsigned long)(__cpu) << __per_cpu_shift))
#define per_cpu_offset(x) (__per_cpu_offset(x))
#define __my_cpu_offset __local_per_cpu_offset

I hope this won't cause problems for what you're trying to accomplish.

2009-03-11 05:58:03

by Tejun Heo

[permalink] [raw]
Subject: [GIT PULL] pull request for safe part

Tejun Heo wrote:
> 0001-linker-script-define-__per_cpu_load-on-all-SMP-capa.patch
> 0002-percpu-make-x86-addr-pcpu-ptr-conversion-macros.patch
> 0003-percpu-more-flexibility-for-dyn_size-of-pcpu_setup.patch
> 0004-percpu-generalize-embedding-first-chunk-setup-helpe.patch

I left the above four patches which shouldn't cause disruption on
archs in #tj-percpu and put the recent alignment warning fix on top of
it.

> 0005-percpu-use-dynamic-percpu-allocator-as-the-default.patch

and moved this one to #tj-percpu-exp. I'll move it into #tj-percpu
once it gets confirmed. So, pulling from #tj-percpu should be safe
now.

Thanks.

--
tejun

2009-03-11 06:04:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hello,

David Miller wrote:
> The one thing sparc64 does is calculate the base of the per-cpu
> area using a base (contained in a fixed global register) a shift.
>
>>From arch/sparc/include/asm/percpu_64.h
>
> register unsigned long __local_per_cpu_offset asm("g5");
> extern unsigned long __per_cpu_base;
> extern unsigned long __per_cpu_shift;
> #define __per_cpu_offset(__cpu) \
> (__per_cpu_base + ((unsigned long)(__cpu) << __per_cpu_shift))
> #define per_cpu_offset(x) (__per_cpu_offset(x))
> #define __my_cpu_offset __local_per_cpu_offset
>
> I hope this won't cause problems for what you're trying to accomplish.

It won't at all. I'll be converting the remaining three once the
default conversion is settled and I'm getting my very first sun
workstation this weekend, so no worries. :-)

Thanks.

--
tejun

2009-03-16 18:07:37

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hi Tejun,

On Tue, 10 Mar 2009 16:53:46 +0900
Tejun Heo <[email protected]> wrote:

> The only differences are 1. there can be more space between percpu
> areas for each cpu to accomodate minimum allocation size and first
> chunk dynamic reserve and 2. dynamic percpu variables use the same
> address translation mechanism as static ones.
>
> #1 shouldn't cause any trouble and #2 shouldn't too because the
> offsets for dynamic percpu variables are carried in full pointer-width
> variables, so as long as the calculations don't make wrong assumptions
> (should be masked by RELOC_HIDE), it should be okay, but if your arch
> has addressing limitations (alpha, s390), please take a second look.

We do have a problem with #2, the dynamic percpu patches currently
breaks s390. But the nice thing is that we can now get rid of the GOTENT
relocation for the percpu symbols. If the code is changed to use
RELOC_HIDE for the SHIFT_PERCPU_PTR define, everything works just fine.
Patch attached. Nice works guys.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

---
Subject: [PATCH] s390: percpu access.

From: Martin Schwidefsky <[email protected]>

With the dynamic percpu allocator there is no need anymore to play
tricks with the GOTENT relocation for the access to the percpu
symbols. A simple RELOC_HIDE gets the job done.

Signed-off-by: Martin Schwidefsky <[email protected]>
---

arch/s390/include/asm/percpu.h | 26 +-------------------------
1 file changed, 1 insertion(+), 25 deletions(-)

diff -urpN linux-next/arch/s390/include/asm/percpu.h linux-patched/arch/s390/include/asm/percpu.h
--- linux-next/arch/s390/include/asm/percpu.h 2009-03-16 18:53:47.000000000 +0100
+++ linux-patched/arch/s390/include/asm/percpu.h 2009-03-16 18:53:35.000000000 +0100
@@ -4,31 +4,7 @@
#include <linux/compiler.h>
#include <asm/lowcore.h>

-/*
- * s390 uses its own implementation for per cpu data, the offset of
- * the cpu local data area is cached in the cpu's lowcore memory.
- * For 64 bit module code s390 forces the use of a GOT slot for the
- * address of the per cpu variable. This is needed because the module
- * may be more than 4G above the per cpu area.
- */
-#if defined(__s390x__) && defined(MODULE)
-
-#define SHIFT_PERCPU_PTR(ptr,offset) (({ \
- extern int simple_identifier_##var(void); \
- unsigned long *__ptr; \
- asm ( "larl %0, %1@GOTENT" \
- : "=a" (__ptr) : "X" (ptr) ); \
- (typeof(ptr))((*__ptr) + (offset)); }))
-
-#else
-
-#define SHIFT_PERCPU_PTR(ptr, offset) (({ \
- extern int simple_identifier_##var(void); \
- unsigned long __ptr; \
- asm ( "" : "=a" (__ptr) : "0" (ptr) ); \
- (typeof(ptr)) (__ptr + (offset)); }))
-
-#endif
+#define SHIFT_PERCPU_PTR(var, offset) RELOC_HIDE(&per_cpu_var(var), (offset))

#define __my_cpu_offset S390_lowcore.percpu_offset

2009-03-20 02:37:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hello, Martin.

Sorry about the delay.

Martin Schwidefsky wrote:
> We do have a problem with #2, the dynamic percpu patches currently
> breaks s390. But the nice thing is that we can now get rid of the GOTENT
> relocation for the percpu symbols. If the code is changed to use
> RELOC_HIDE for the SHIFT_PERCPU_PTR define, everything works just fine.
> Patch attached. Nice works guys.
>
> Subject: [PATCH] s390: percpu access.
>
> From: Martin Schwidefsky <[email protected]>
>
> With the dynamic percpu allocator there is no need anymore to play
> tricks with the GOTENT relocation for the access to the percpu
> symbols. A simple RELOC_HIDE gets the job done.

Hmm... I don't quite get it. The GOTENT was to work around large
offsets for modules, right? Can you please explain what changed by
the dynamic percpu allocator?

> +#define SHIFT_PERCPU_PTR(var, offset) RELOC_HIDE(&per_cpu_var(var), (offset))

Hmm... @var already has per_cpu__ prefix when the above macro is
invoked, so doing per_cpu_var() on it again wouldn't work. If simple
RELOC_HIDE works, you should be able to simply drop the above
definition. The generic percpu.h will define it.

Thanks.

--
tejun

2009-03-24 15:26:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Tejun Heo wrote:
> Hello, Martin.
>
> Sorry about the delay.
>
> Martin Schwidefsky wrote:
>> We do have a problem with #2, the dynamic percpu patches currently
>> breaks s390. But the nice thing is that we can now get rid of the GOTENT
>> relocation for the percpu symbols. If the code is changed to use
>> RELOC_HIDE for the SHIFT_PERCPU_PTR define, everything works just fine.
>> Patch attached. Nice works guys.
>>
>> Subject: [PATCH] s390: percpu access.
>>
>> From: Martin Schwidefsky <[email protected]>
>>
>> With the dynamic percpu allocator there is no need anymore to play
>> tricks with the GOTENT relocation for the access to the percpu
>> symbols. A simple RELOC_HIDE gets the job done.
>
> Hmm... I don't quite get it. The GOTENT was to work around large
> offsets for modules, right? Can you please explain what changed by
> the dynamic percpu allocator?
>
>> +#define SHIFT_PERCPU_PTR(var, offset) RELOC_HIDE(&per_cpu_var(var), (offset))
>
> Hmm... @var already has per_cpu__ prefix when the above macro is
> invoked, so doing per_cpu_var() on it again wouldn't work. If simple
> RELOC_HIDE works, you should be able to simply drop the above
> definition. The generic percpu.h will define it.

Ping.

--
tejun

2009-03-25 11:28:01

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 25 Mar 2009 00:22:44 +0900
Tejun Heo <[email protected]> wrote:

> Tejun Heo wrote:
> > Hello, Martin.
> >
> > Sorry about the delay.

Yes, dito..

> > Martin Schwidefsky wrote:
> >> We do have a problem with #2, the dynamic percpu patches currently
> >> breaks s390. But the nice thing is that we can now get rid of the GOTENT
> >> relocation for the percpu symbols. If the code is changed to use
> >> RELOC_HIDE for the SHIFT_PERCPU_PTR define, everything works just fine.
> >> Patch attached. Nice works guys.
> >>
> >> Subject: [PATCH] s390: percpu access.
> >>
> >> From: Martin Schwidefsky <[email protected]>
> >>
> >> With the dynamic percpu allocator there is no need anymore to play
> >> tricks with the GOTENT relocation for the access to the percpu
> >> symbols. A simple RELOC_HIDE gets the job done.
> >
> > Hmm... I don't quite get it. The GOTENT was to work around large
> > offsets for modules, right? Can you please explain what changed by
> > the dynamic percpu allocator?

Unfortunately it didn't change. The problem is still there, only with
my particular configuration (and the correct patch) the system did
work because the problematic modules were not in use. But in general it
won't work.

The reason for the GOTENT indirection are static per-cpu variables that
are defined inside a module. The compiler considers these to be local.
For locally defined per_cpu__#var symbols the compiler uses an
instruction that is limited to the scope of a single object, which is
+-4 GB. The trick with GOTENT introduced an indirection which hid the
problem.

Without the GOTENT indirection the access to a static per cpu variable
will look like this:

0000000000000000 <test_fn>:
0: e3 30 03 30 00 04 lg %r3,816
6: c0 10 00 00 00 00 larl %r1,6 <test_fn+0x6>
8: R_390_PC32DBL .data.percpu+0x2
c: e3 23 10 00 00 04 lg %r2,0(%r3,%r1)

The R_390_PC32DBL relocation in the module relocation will fail if the
per-cpu area is farther than 4GB away from the vmalloc area.

With your patches and a RELOC_HIDE version that uses the GOTENT
indirection the kernel won't compile because the "X" constraint for
the GOTENT access needs a symbol and there are quite a few users that
pass a pointer. I do not see a simple solution for that problem yet.

> >> +#define SHIFT_PERCPU_PTR(var, offset) RELOC_HIDE(&per_cpu_var(var), (offset))
> >
> > Hmm... @var already has per_cpu__ prefix when the above macro is
> > invoked, so doing per_cpu_var() on it again wouldn't work. If simple
> > RELOC_HIDE works, you should be able to simply drop the above
> > definition. The generic percpu.h will define it.

That is true, in my working version of the oatch the second
per_cpu_var() wasn't there.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-25 11:55:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hello, Martin.

Martin Schwidefsky wrote:
>>>> With the dynamic percpu allocator there is no need anymore to play
>>>> tricks with the GOTENT relocation for the access to the percpu
>>>> symbols. A simple RELOC_HIDE gets the job done.
>>> Hmm... I don't quite get it. The GOTENT was to work around large
>>> offsets for modules, right? Can you please explain what changed by
>>> the dynamic percpu allocator?
>
> Unfortunately it didn't change. The problem is still there, only with
> my particular configuration (and the correct patch) the system did
> work because the problematic modules were not in use. But in general it
> won't work.
>
> The reason for the GOTENT indirection are static per-cpu variables that
> are defined inside a module. The compiler considers these to be local.
> For locally defined per_cpu__#var symbols the compiler uses an
> instruction that is limited to the scope of a single object, which is
> +-4 GB. The trick with GOTENT introduced an indirection which hid the
> problem.
>
> Without the GOTENT indirection the access to a static per cpu variable
> will look like this:
>
> 0000000000000000 <test_fn>:
> 0: e3 30 03 30 00 04 lg %r3,816
> 6: c0 10 00 00 00 00 larl %r1,6 <test_fn+0x6>
> 8: R_390_PC32DBL .data.percpu+0x2
> c: e3 23 10 00 00 04 lg %r2,0(%r3,%r1)
>
> The R_390_PC32DBL relocation in the module relocation will fail if the
> per-cpu area is farther than 4GB away from the vmalloc area.

Okay, up to this point, I understand, so nothing really changed for
symbols (core or modules) by the dynamic percpu allocator and they
still need GOTENT, right?

> With your patches and a RELOC_HIDE version that uses the GOTENT
> indirection the kernel won't compile because the "X" constraint for
> the GOTENT access needs a symbol and there are quite a few users that
> pass a pointer. I do not see a simple solution for that problem yet.

Ah... okay. Now I get it. It wasn't expecting variables there. How
about doing the following?

#define SHIFT_PERCPU_PTR(ptr, offset) (({ \
if (__builtin_constant_p(ptr)) \
do GOTENT trick; \
else \
RELOC_HIDE(); \
}))

Thanks.

--
tejun

2009-03-25 12:24:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator


* Tejun Heo <[email protected]> wrote:

> > With your patches and a RELOC_HIDE version that uses the GOTENT
> > indirection the kernel won't compile because the "X" constraint
> > for the GOTENT access needs a symbol and there are quite a few
> > users that pass a pointer. I do not see a simple solution for
> > that problem yet.
>
> Ah... okay. Now I get it. It wasn't expecting variables there. How
> about doing the following?
>
> #define SHIFT_PERCPU_PTR(ptr, offset) (({ \
> if (__builtin_constant_p(ptr)) \
> do GOTENT trick; \
> else \
> RELOC_HIDE(); \
> }))
>
> Thanks.

Btw., is there any intermediate fix/workaround i could apply
perhaps? I'd like to send the percpu bits early in the merge window
and would hate to break Martin's arch ...

Ingo

2009-03-25 12:31:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Ingo Molnar wrote:
> * Tejun Heo <[email protected]> wrote:
>
>>> With your patches and a RELOC_HIDE version that uses the GOTENT
>>> indirection the kernel won't compile because the "X" constraint
>>> for the GOTENT access needs a symbol and there are quite a few
>>> users that pass a pointer. I do not see a simple solution for
>>> that problem yet.
>> Ah... okay. Now I get it. It wasn't expecting variables there. How
>> about doing the following?
>>
>> #define SHIFT_PERCPU_PTR(ptr, offset) (({ \
>> if (__builtin_constant_p(ptr)) \
>> do GOTENT trick; \
>> else \
>> RELOC_HIDE(); \
>> }))
>>
>> Thanks.
>
> Btw., is there any intermediate fix/workaround i could apply
> perhaps? I'd like to send the percpu bits early in the merge window
> and would hate to break Martin's arch ...

Martin's original patch should do the trick although it would be
slower for static symbols. I'll merge it and post the tree.

BTW, I couldn't do the sparc32 test as the sparc I got is IIe which
can't boot 32 bit kernels. :-(

Thanks.

--
tejun

2009-03-25 12:40:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator


* Tejun Heo <[email protected]> wrote:

> > Btw., is there any intermediate fix/workaround i could apply
> > perhaps? I'd like to send the percpu bits early in the merge
> > window and would hate to break Martin's arch ...
>
> Martin's original patch should do the trick although it would be
> slower for static symbols. I'll merge it and post the tree.

thanks!

> BTW, I couldn't do the sparc32 test as the sparc I got is IIe
> which can't boot 32 bit kernels. :-(

a pity ...

Ingo

2009-03-25 13:13:49

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 25 Mar 2009 21:27:49 +0900
Tejun Heo <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Tejun Heo <[email protected]> wrote:
> >
> >>> With your patches and a RELOC_HIDE version that uses the GOTENT
> >>> indirection the kernel won't compile because the "X" constraint
> >>> for the GOTENT access needs a symbol and there are quite a few
> >>> users that pass a pointer. I do not see a simple solution for
> >>> that problem yet.
> >> Ah... okay. Now I get it. It wasn't expecting variables there. How
> >> about doing the following?
> >>
> >> #define SHIFT_PERCPU_PTR(ptr, offset) (({ \
> >> if (__builtin_constant_p(ptr)) \
> >> do GOTENT trick; \
> >> else \
> >> RELOC_HIDE(); \
> >> }))
> >>
> >> Thanks.
> >
> > Btw., is there any intermediate fix/workaround i could apply
> > perhaps? I'd like to send the percpu bits early in the merge window
> > and would hate to break Martin's arch ...
>
> Martin's original patch should do the trick although it would be
> slower for static symbols. I'll merge it and post the tree.

No, my original patch doesn't work. It will break modules that use
static per-cpu variables.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-25 13:23:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Martin Schwidefsky wrote:
>> Martin's original patch should do the trick although it would be
>> slower for static symbols. I'll merge it and post the tree.
>
> No, my original patch doesn't work. It will break modules that use
> static per-cpu variables.

Oops. Even with the default offset adding macros? Heh... I think it
would be best to wait for your fix then.

Thanks.

--
tejun

2009-03-25 13:26:21

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 25 Mar 2009 22:21:03 +0900
Tejun Heo <[email protected]> wrote:

> Martin Schwidefsky wrote:
> >> Martin's original patch should do the trick although it would be
> >> slower for static symbols. I'll merge it and post the tree.
> >
> > No, my original patch doesn't work. It will break modules that use
> > static per-cpu variables.
>
> Oops. Even with the default offset adding macros? Heh... I think it
> would be best to wait for your fix then.

We could use HAVE_LEGACY_PER_CPU_AREA for the time being.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-25 13:37:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Martin Schwidefsky wrote:
> On Wed, 25 Mar 2009 22:21:03 +0900
> Tejun Heo <[email protected]> wrote:
>
>> Martin Schwidefsky wrote:
>>>> Martin's original patch should do the trick although it would be
>>>> slower for static symbols. I'll merge it and post the tree.
>>> No, my original patch doesn't work. It will break modules that use
>>> static per-cpu variables.
>> Oops. Even with the default offset adding macros? Heh... I think it
>> would be best to wait for your fix then.
>
> We could use HAVE_LEGACY_PER_CPU_AREA for the time being.

Eh... The thing is that the patch kills the legacy default allocator.
We can move it into arch/s390 for the time being but it would be
simpler if the constant_p thing or something else could work. :-) Do
you think figuring out how to fix it will take long?

Thanks.

--
tejun

2009-03-25 14:01:43

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 25 Mar 2009 20:51:31 +0900
Tejun Heo <[email protected]> wrote:

> > With your patches and a RELOC_HIDE version that uses the GOTENT
> > indirection the kernel won't compile because the "X" constraint for
> > the GOTENT access needs a symbol and there are quite a few users that
> > pass a pointer. I do not see a simple solution for that problem yet.
>
> Ah... okay. Now I get it. It wasn't expecting variables there. How
> about doing the following?
>
> #define SHIFT_PERCPU_PTR(ptr, offset) (({ \
> if (__builtin_constant_p(ptr)) \
> do GOTENT trick; \
> else \
> RELOC_HIDE(); \
> }))

That doesn't work because __builtin_constant_p is false for variable
symbols. I would need a __builtin_symbol_p but that doesn't exist.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-25 14:17:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Martin Schwidefsky wrote:
> On Wed, 25 Mar 2009 20:51:31 +0900
> Tejun Heo <[email protected]> wrote:
>
>>> With your patches and a RELOC_HIDE version that uses the GOTENT
>>> indirection the kernel won't compile because the "X" constraint for
>>> the GOTENT access needs a symbol and there are quite a few users that
>>> pass a pointer. I do not see a simple solution for that problem yet.
>> Ah... okay. Now I get it. It wasn't expecting variables there. How
>> about doing the following?
>>
>> #define SHIFT_PERCPU_PTR(ptr, offset) (({ \
>> if (__builtin_constant_p(ptr)) \
>> do GOTENT trick; \
>> else \
>> RELOC_HIDE(); \
>> }))
>
> That doesn't work because __builtin_constant_p is false for variable
> symbols. I would need a __builtin_symbol_p but that doesn't exist.

Right, I somehow always get confused about the two. Eh... Not easy.
I think alpha might have the same problem too. I suppose we'll have
to make the legacy allocator available for alpha and s390 for the time
being. I'll get to that.

Thanks.

--
tejun

2009-03-30 10:10:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

Impact: use dynamic allocator for most archs w/o
CONFIG_HAVE_SETUP_PER_CPU_AREA

This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use
dynamic percpu allocator. The first chunk is allocated using
embedding helper and 8k is reserved for modules. This ensures that
the new allocator behaves almost identically to the original allocator
as long as static percpu variables are concerned, so it shouldn't
introduce much breakage.

s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing
range limit the addressing model imposes. Unfortunately, this breaks
if the address is specified using a variable, so for now, the two
archs aren't converted.

The following architectures are affected by this change.

* sh
* arm
* cris
* mips
* sparc(32)
* blackfin
* avr32
* parisc
* m32r
* powerpc(32)

As this change makes the dynamic allocator the default one,
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted
archs. These archs implement their own setup_per_cpu_areas() and the
conversion is not trivial.

* powerpc(64)
* sparc(64)
* ia64
* alpha
* s390

Boot and batch alloc/free tests on x86_32 with debug code (x86_32
doesn't use default first chunk initialization). Compile tested on
sparc(32), powerpc(32), arm and alpha.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Russell King <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Bryan Wu <[email protected]>
Cc: Kyle McMartin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Hirokazu Takata <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
---
Okay, this should keep s390 and alpha working till proper solution is
found. Martin, can you please verify? Ingo, please feel free to push
this upstream (or -next) once Martin acks.

Thanks.

arch/alpha/Kconfig | 3 +++
arch/ia64/Kconfig | 3 +++
arch/powerpc/Kconfig | 3 +++
arch/s390/Kconfig | 3 +++
arch/sparc/Kconfig | 3 +++
arch/x86/Kconfig | 3 ---
include/linux/percpu.h | 12 +++++++++---
init/main.c | 24 ------------------------
kernel/module.c | 6 +++---
mm/Makefile | 2 +-
mm/allocpercpu.c | 28 ++++++++++++++++++++++++++++
mm/percpu.c | 40 +++++++++++++++++++++++++++++++++++++++-
12 files changed, 95 insertions(+), 35 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 9fb8aae..05d8640 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -70,6 +70,9 @@ config AUTO_IRQ_AFFINITY
depends on SMP
default y

+config HAVE_LEGACY_PER_CPU_AREA
+ def_bool y
+
source "init/Kconfig"
source "kernel/Kconfig.freezer"

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 153e727..f333149 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -85,6 +85,9 @@ config GENERIC_TIME_VSYSCALL
bool
default y

+config HAVE_LEGACY_PER_CPU_AREA
+ def_bool y
+
config HAVE_SETUP_PER_CPU_AREA
def_bool y

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 74cc312..892f1ad 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -42,6 +42,9 @@ config GENERIC_HARDIRQS
bool
default y

+config HAVE_LEGACY_PER_CPU_AREA
+ def_bool PPC64
+
config HAVE_SETUP_PER_CPU_AREA
def_bool PPC64

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 6b0a353..c69264a 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -72,6 +72,9 @@ config PGSTE
config VIRT_CPU_ACCOUNTING
def_bool y

+config HAVE_LEGACY_PER_CPU_AREA
+ def_bool y
+
mainmenu "Linux Kernel Configuration"

config S390
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index c3ea215..0125b57 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -90,6 +90,9 @@ config AUDIT_ARCH
bool
default y

+config HAVE_LEGACY_PER_CPU_AREA
+ def_bool y if SPARC64
+
config HAVE_SETUP_PER_CPU_AREA
def_bool y if SPARC64

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f5cef3f..572329a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -138,9 +138,6 @@ config ARCH_HAS_CACHE_LINE_SIZE
config HAVE_SETUP_PER_CPU_AREA
def_bool y

-config HAVE_DYNAMIC_PER_CPU_AREA
- def_bool y
-
config HAVE_CPUMASK_OF_CPU_MAP
def_bool X86_64_SMP

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index ee5615d..817136a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -78,7 +78,7 @@

#ifdef CONFIG_SMP

-#ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA

/* minimum unit size, also is the maximum supported allocation size */
#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(64 << 10)
@@ -124,7 +124,7 @@ extern ssize_t __init pcpu_embed_first_chunk(

extern void *__alloc_reserved_percpu(size_t size, size_t align);

-#else /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+#else /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */

struct percpu_data {
void *ptrs[1];
@@ -138,11 +138,15 @@ struct percpu_data {
(__typeof__(ptr))__p->ptrs[(cpu)]; \
})

-#endif /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */

extern void *__alloc_percpu(size_t size, size_t align);
extern void free_percpu(void *__pdata);

+#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
+extern void __init setup_per_cpu_areas(void);
+#endif
+
#else /* CONFIG_SMP */

#define per_cpu_ptr(ptr, cpu) ({ (void)(cpu); (ptr); })
@@ -163,6 +167,8 @@ static inline void free_percpu(void *p)
kfree(p);
}

+static inline void __init setup_per_cpu_areas(void) { }
+
#endif /* CONFIG_SMP */

#define alloc_percpu(type) (type *)__alloc_percpu(sizeof(type), \
diff --git a/init/main.c b/init/main.c
index 6441083..42ae3c0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -354,7 +354,6 @@ static void __init smp_init(void)
#define smp_init() do { } while (0)
#endif

-static inline void setup_per_cpu_areas(void) { }
static inline void setup_nr_cpu_ids(void) { }
static inline void smp_prepare_cpus(unsigned int maxcpus) { }

@@ -375,29 +374,6 @@ static void __init setup_nr_cpu_ids(void)
nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
}

-#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
-unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
-
-EXPORT_SYMBOL(__per_cpu_offset);
-
-static void __init setup_per_cpu_areas(void)
-{
- unsigned long size, i;
- char *ptr;
- unsigned long nr_possible_cpus = num_possible_cpus();
-
- /* Copy section for each CPU (we discard the original) */
- size = ALIGN(PERCPU_ENOUGH_ROOM, PAGE_SIZE);
- ptr = alloc_bootmem_pages(size * nr_possible_cpus);
-
- for_each_possible_cpu(i) {
- __per_cpu_offset[i] = ptr - __per_cpu_start;
- memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
- ptr += size;
- }
-}
-#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
-
/* Called by boot processor to activate the rest. */
static void __init smp_init(void)
{
diff --git a/kernel/module.c b/kernel/module.c
index f0e04d6..f9c7b86 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -368,7 +368,7 @@ static struct module *find_module(const char *name)

#ifdef CONFIG_SMP

-#ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+#ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA

static void *percpu_modalloc(unsigned long size, unsigned long align,
const char *name)
@@ -393,7 +393,7 @@ static void percpu_modfree(void *freeme)
free_percpu(freeme);
}

-#else /* ... !CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+#else /* ... CONFIG_HAVE_LEGACY_PER_CPU_AREA */

/* Number of blocks used and allocated. */
static unsigned int pcpu_num_used, pcpu_num_allocated;
@@ -528,7 +528,7 @@ static int percpu_modinit(void)
}
__initcall(percpu_modinit);

-#endif /* CONFIG_HAVE_DYNAMIC_PER_CPU_AREA */
+#endif /* CONFIG_HAVE_LEGACY_PER_CPU_AREA */

static unsigned int find_pcpusec(Elf_Ehdr *hdr,
Elf_Shdr *sechdrs,
diff --git a/mm/Makefile b/mm/Makefile
index 818569b..f3a037b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -30,7 +30,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
-ifdef CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+ifndef CONFIG_HAVE_LEGACY_PER_CPU_AREA
obj-$(CONFIG_SMP) += percpu.o
else
obj-$(CONFIG_SMP) += allocpercpu.o
diff --git a/mm/allocpercpu.c b/mm/allocpercpu.c
index 1882923..fd75eff 100644
--- a/mm/allocpercpu.c
+++ b/mm/allocpercpu.c
@@ -5,6 +5,8 @@
*/
#include <linux/mm.h>
#include <linux/module.h>
+#include <linux/bootmem.h>
+#include <asm/sections.h>

#ifndef cache_line_size
#define cache_line_size() L1_CACHE_BYTES
@@ -147,3 +149,29 @@ void free_percpu(void *__pdata)
kfree(__percpu_disguise(__pdata));
}
EXPORT_SYMBOL_GPL(free_percpu);
+
+/*
+ * Generic percpu area setup.
+ */
+#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
+unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
+
+EXPORT_SYMBOL(__per_cpu_offset);
+
+void __init setup_per_cpu_areas(void)
+{
+ unsigned long size, i;
+ char *ptr;
+ unsigned long nr_possible_cpus = num_possible_cpus();
+
+ /* Copy section for each CPU (we discard the original) */
+ size = ALIGN(PERCPU_ENOUGH_ROOM, PAGE_SIZE);
+ ptr = alloc_bootmem_pages(size * nr_possible_cpus);
+
+ for_each_possible_cpu(i) {
+ __per_cpu_offset[i] = ptr - __per_cpu_start;
+ memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
+ ptr += size;
+ }
+}
+#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
diff --git a/mm/percpu.c b/mm/percpu.c
index 1aa5d8f..059b57a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -43,7 +43,7 @@
*
* To use this allocator, arch code should do the followings.
*
- * - define CONFIG_HAVE_DYNAMIC_PER_CPU_AREA
+ * - drop CONFIG_HAVE_LEGACY_PER_CPU_AREA
*
* - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
* regular address to percpu pointer and back if they need to be
@@ -1324,3 +1324,41 @@ ssize_t __init pcpu_embed_first_chunk(size_t static_size, size_t reserved_size,
reserved_size, dyn_size,
pcpue_unit_size, pcpue_ptr, NULL);
}
+
+/*
+ * Generic percpu area setup.
+ *
+ * The embedding helper is used because its behavior closely resembles
+ * the original non-dynamic generic percpu area setup. This is
+ * important because many archs have addressing restrictions and might
+ * fail if the percpu area is located far away from the previous
+ * location. As an added bonus, in non-NUMA cases, embedding is
+ * generally a good idea TLB-wise because percpu area can piggy back
+ * on the physical linear memory mapping which uses large page
+ * mappings on applicable archs.
+ */
+#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
+unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
+EXPORT_SYMBOL(__per_cpu_offset);
+
+void __init setup_per_cpu_areas(void)
+{
+ size_t static_size = __per_cpu_end - __per_cpu_start;
+ ssize_t unit_size;
+ unsigned long delta;
+ unsigned int cpu;
+
+ /*
+ * Always reserve area for module percpu variables. That's
+ * what the legacy allocator did.
+ */
+ unit_size = pcpu_embed_first_chunk(static_size, PERCPU_MODULE_RESERVE,
+ PERCPU_DYNAMIC_RESERVE, -1);
+ if (unit_size < 0)
+ panic("Failed to initialized percpu areas.");
+
+ delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
+ for_each_possible_cpu(cpu)
+ __per_cpu_offset[cpu] = delta + cpu * unit_size;
+}
+#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */

2009-03-30 10:43:00

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Mon, 30 Mar 2009 19:07:44 +0900
Tejun Heo <[email protected]> wrote:

> Okay, this should keep s390 and alpha working till proper solution is
> found. Martin, can you please verify? Ingo, please feel free to push
> this upstream (or -next) once Martin acks.

Looks good, everything compiles and the static per-cpu variables are
resolved via GOTENT:

Acked-by: Martin Schwidefsky <[email protected]>

For the proper solution, the easiest fix is imho to define a
variant of SHIFT_PERCPU_PTR. The macro is currently used for dynamic
pointers and for per-cpu symbols. We only want to use the GOTENT
indirection for per-cpu symbols. So why don't we split it into
1) SHIFT_PERCPU_SYMBOL for per-cpu symbols and
2) SHIFT_PERCPU_PTR for dynamically allocated pointers?
For s390 the first would be the current SHIFT_PERCPU_PTR macro, the
second would be a simple RELOC_HIDE. The patch would be really
short ..

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-30 11:54:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Tejun Heo <[email protected]> wrote:

> Impact: use dynamic allocator for most archs w/o
> CONFIG_HAVE_SETUP_PER_CPU_AREA
>
> This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use
> dynamic percpu allocator. The first chunk is allocated using
> embedding helper and 8k is reserved for modules. This ensures
> that the new allocator behaves almost identically to the original
> allocator as long as static percpu variables are concerned, so it
> shouldn't introduce much breakage.
>
> s390 and alpha use custom SHIFT_PERCPU_PTR() to work around
> addressing range limit the addressing model imposes.
> Unfortunately, this breaks if the address is specified using a
> variable, so for now, the two archs aren't converted.
>
> The following architectures are affected by this change.
>
> * sh
> * arm
> * cris
> * mips
> * sparc(32)
> * blackfin
> * avr32
> * parisc
> * m32r
> * powerpc(32)
>
> As this change makes the dynamic allocator the default one,
> CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
> CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be
> converted archs. These archs implement their own
> setup_per_cpu_areas() and the conversion is not trivial.
>
> * powerpc(64)
> * sparc(64)
> * ia64
> * alpha
> * s390
>
> Boot and batch alloc/free tests on x86_32 with debug code (x86_32
> doesn't use default first chunk initialization). Compile tested
> on sparc(32), powerpc(32), arm and alpha.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Paul Mundt <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Mikael Starvik <[email protected]>
> Cc: Ralf Baechle <[email protected]>
> Cc: David S. Miller <[email protected]>
> Cc: Bryan Wu <[email protected]>
> Cc: Kyle McMartin <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Grant Grundler <[email protected]>
> Cc: Hirokazu Takata <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Richard Henderson <[email protected]>
> Cc: Ivan Kokshaysky <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> ---

> Okay, this should keep s390 and alpha working till proper solution
> is found. Martin, can you please verify? Ingo, please feel free
> to push this upstream (or -next) once Martin acks.
>
> arch/alpha/Kconfig | 3 +++
> arch/ia64/Kconfig | 3 +++
> arch/powerpc/Kconfig | 3 +++
> arch/s390/Kconfig | 3 +++
> arch/sparc/Kconfig | 3 +++
> arch/x86/Kconfig | 3 ---
> include/linux/percpu.h | 12 +++++++++---
> init/main.c | 24 ------------------------
> kernel/module.c | 6 +++---
> mm/Makefile | 2 +-
> mm/allocpercpu.c | 28 ++++++++++++++++++++++++++++
> mm/percpu.c | 40 +++++++++++++++++++++++++++++++++++++++-
> 12 files changed, 95 insertions(+), 35 deletions(-)

We also need the Ack from Davem for Sparc32, the Ack from Martin, an
Ack from Ben for the PowerPC bits and an ack from Tony for the IA64
bits. We also need the ack from Andrew and Rusty for the MM and
module bits. Plus a final Reviewed-by of the all-architectures
percpu allocator concepts from Christoph and Nick would be nice as
well, just in case we missed some trick or failed to consider a
complication.

We need these acks both for technical correctness and for the form
of the proposed workflow - and the acks need to be in the commit
logs as well. (i.e. we need another rebase once the dust settles)

I know this is a _lot_ of paperwork but this is the proper way to do
it: the main gravity of impact is not on x86 anymore, and that
impact is non-trivial on them, so we'll do whatever is most
convenient to the architectures and maintainers affected.

It was fine and efficient to prototype this on x86 (which was the
architecture historically most infested with legacy percpu
complications), but the work flow spreads out and slows down from
now on.

We may also need to split this up into per architecture bits if that
is requested, so that they can merge and switch this allocator on at
a pace that suits their own testing and merging schedules best. More
complications beyond the s390 one might be discovered.

Furthermore, while i dont mind pulling it into tip/core if everyone
agrees with that and can push it to -next as well, i think these
bits are best suited for -mm from this point on, it is the tree best
set up for more complex many-architecture merges.

The bits obviously look fine from an x86 perspective as well (they
have essentially no impact there) so you can put in my ack for that
in any case.

Thanks,

Ingo

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Mon, 30 Mar 2009, Ingo Molnar wrote:

> We also need the Ack from Davem for Sparc32, the Ack from Martin, an
> Ack from Ben for the PowerPC bits and an ack from Tony for the IA64
> bits. We also need the ack from Andrew and Rusty for the MM and
> module bits. Plus a final Reviewed-by of the all-architectures
> percpu allocator concepts from Christoph and Nick would be nice as
> well, just in case we missed some trick or failed to consider a
> complication.

I just started looking at this but I see that the patches were already
merged. Will see if I can find what needs looking after.

2009-03-31 01:35:16

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Monday 30 March 2009 22:19:38 Ingo Molnar wrote:
>
> * Tejun Heo <[email protected]> wrote:
...
> We also need the Ack from Davem for Sparc32, the Ack from Martin, an
> Ack from Ben for the PowerPC bits and an ack from Tony for the IA64
> bits. We also need the ack from Andrew and Rusty for the MM and
> module bits.

Module bits are trivial, but Acked-by: Rusty Russell <[email protected]>
just in case :)

Thanks!
Rusty.

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

Needs additional feedback and review by Tejun I would think. Just compile
tested so far. The removal of the rbtree also relaxes locking restrictions
for pcpu_chunk_address_search (which is not really searching anymore).


Subject: dynamic percpu allocator: Remove rbtree

The rbtree is used to determine the chunk from the virtual address. However,
we can already determine the page struct from a virtual address and there
are several unused fields in page struct used by vmalloc. Use the index
field to store a pointer to the chunk. Then there is no need anymore for
an rbtree.

Signed-off-by: Christoph Lameter <[email protected]>

---
mm/percpu.c | 101 +++++++++++-------------------------------------------------
1 file changed, 19 insertions(+), 82 deletions(-)

Index: linux-2.6/mm/percpu.c
===================================================================
--- linux-2.6.orig/mm/percpu.c 2009-03-31 11:02:01.000000000 -0500
+++ linux-2.6/mm/percpu.c 2009-03-31 11:04:04.000000000 -0500
@@ -23,7 +23,7 @@
* Allocation is done in offset-size areas of single unit space. Ie,
* an area of 512 bytes at 6k in c1 occupies 512 bytes at 6k of c1:u0,
* c1:u1, c1:u2 and c1:u3. Percpu access can be done by configuring
- * percpu base registers UNIT_SIZE apart.
+ * percpu base registers pcpu_unit_size apart.
*
* There are usually many small percpu allocations many of them as
* small as 4 bytes. The allocator organizes chunks into lists
@@ -38,8 +38,8 @@
* region and negative allocated. Allocation inside a chunk is done
* by scanning this map sequentially and serving the first matching
* entry. This is mostly copied from the percpu_modalloc() allocator.
- * Chunks are also linked into a rb tree to ease address to chunk
- * mapping during free.
+ * Chunks can be determined from the address using the index field
+ * in the page struct. The index field contains a pointer to the chunk.
*
* To use this allocator, arch code should do the followings.
*
@@ -61,7 +61,6 @@
#include <linux/mutex.h>
#include <linux/percpu.h>
#include <linux/pfn.h>
-#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
@@ -88,7 +87,6 @@

struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
- struct rb_node rb_node; /* key is chunk->vm->addr */
int free_size; /* free bytes in the chunk */
int contig_hint; /* max contiguous size hint */
struct vm_struct *vm; /* mapped vmalloc region */
@@ -121,7 +119,7 @@ static int pcpu_reserved_chunk_limit;
* There are two locks - pcpu_alloc_mutex and pcpu_lock. The former
* protects allocation/reclaim paths, chunks and chunk->page arrays.
* The latter is a spinlock and protects the index data structures -
- * chunk slots, rbtree, chunks and area maps in chunks.
+ * chunk slots, chunks and area maps in chunks.
*
* During allocation, pcpu_alloc_mutex is kept locked all the time and
* pcpu_lock is grabbed and released as necessary. All actual memory
@@ -140,7 +138,6 @@ static DEFINE_MUTEX(pcpu_alloc_mutex); /
static DEFINE_SPINLOCK(pcpu_lock); /* protects index data structures */

static struct list_head *pcpu_slot; /* chunk list slots */
-static struct rb_root pcpu_addr_root = RB_ROOT; /* chunks by address */

/* reclaim work to release fully free chunks, scheduled from free path */
static void pcpu_reclaim(struct work_struct *work);
@@ -257,49 +254,27 @@ static void pcpu_chunk_relocate(struct p
}
}

-static struct rb_node **pcpu_chunk_rb_search(void *addr,
- struct rb_node **parentp)
+/* Set the pointer to a chunk in a page struct */
+static inline void set_chunk(struct page *page, struct pcpu_chunk *pcpu)
{
- struct rb_node **p = &pcpu_addr_root.rb_node;
- struct rb_node *parent = NULL;
- struct pcpu_chunk *chunk;
-
- while (*p) {
- parent = *p;
- chunk = rb_entry(parent, struct pcpu_chunk, rb_node);
-
- if (addr < chunk->vm->addr)
- p = &(*p)->rb_left;
- else if (addr > chunk->vm->addr)
- p = &(*p)->rb_right;
- else
- break;
- }
+ page->index = (unsigned long)pcpu;
+}

- if (parentp)
- *parentp = parent;
- return p;
+/* Obtain pointer to a chunk from a page struct */
+static inline struct pcpu_chunk*get_chunk(struct page *page)
+{
+ return (struct pcpu_chunk *)page->index;
}

/**
- * pcpu_chunk_addr_search - search for chunk containing specified address
- * @addr: address to search for
- *
- * Look for chunk which might contain @addr. More specifically, it
- * searchs for the chunk with the highest start address which isn't
- * beyond @addr.
- *
- * CONTEXT:
- * pcpu_lock.
+ * pcpu_chunk_addr_search - determine chunk containing specified address
+ * @addr: address for which the chunk needs to be determined.
*
* RETURNS:
* The address of the found chunk.
*/
static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
{
- struct rb_node *n, *parent;
- struct pcpu_chunk *chunk;
-
/* is it in the reserved chunk? */
if (pcpu_reserved_chunk) {
void *start = pcpu_reserved_chunk->vm->addr;
@@ -308,42 +283,7 @@ static struct pcpu_chunk *pcpu_chunk_add
return pcpu_reserved_chunk;
}

- /* nah... search the regular ones */
- n = *pcpu_chunk_rb_search(addr, &parent);
- if (!n) {
- /* no exactly matching chunk, the parent is the closest */
- n = parent;
- BUG_ON(!n);
- }
- chunk = rb_entry(n, struct pcpu_chunk, rb_node);
-
- if (addr < chunk->vm->addr) {
- /* the parent was the next one, look for the previous one */
- n = rb_prev(n);
- BUG_ON(!n);
- chunk = rb_entry(n, struct pcpu_chunk, rb_node);
- }
-
- return chunk;
-}
-
-/**
- * pcpu_chunk_addr_insert - insert chunk into address rb tree
- * @new: chunk to insert
- *
- * Insert @new into address rb tree.
- *
- * CONTEXT:
- * pcpu_lock.
- */
-static void pcpu_chunk_addr_insert(struct pcpu_chunk *new)
-{
- struct rb_node **p, *parent;
-
- p = pcpu_chunk_rb_search(new->vm->addr, &parent);
- BUG_ON(*p);
- rb_link_node(&new->rb_node, parent, p);
- rb_insert_color(&new->rb_node, &pcpu_addr_root);
+ return get_chunk(vmalloc_to_page(addr));
}

/**
@@ -755,6 +695,7 @@ static int pcpu_populate_chunk(struct pc
alloc_mask, 0);
if (!*pagep)
goto err;
+ set_chunk(*pagep, chunk);
}
}

@@ -879,7 +820,6 @@ restart:

spin_lock_irq(&pcpu_lock);
pcpu_chunk_relocate(chunk, -1);
- pcpu_chunk_addr_insert(chunk);
goto restart;

area_found:
@@ -968,7 +908,6 @@ static void pcpu_reclaim(struct work_str
if (chunk == list_first_entry(head, struct pcpu_chunk, list))
continue;

- rb_erase(&chunk->rb_node, &pcpu_addr_root);
list_move(&chunk->list, &todo);
}

@@ -1202,6 +1141,7 @@ size_t __init pcpu_setup_first_chunk(pcp
if (!page)
break;
*pcpu_chunk_pagep(schunk, cpu, i) = page;
+ set_chunk(page, schunk);
}

BUG_ON(i < PFN_UP(static_size));
@@ -1226,13 +1166,10 @@ size_t __init pcpu_setup_first_chunk(pcp
}

/* link the first chunk in */
- if (!dchunk) {
+ if (!dchunk)
pcpu_chunk_relocate(schunk, -1);
- pcpu_chunk_addr_insert(schunk);
- } else {
+ else
pcpu_chunk_relocate(dchunk, -1);
- pcpu_chunk_addr_insert(dchunk);
- }

/* we're done */
pcpu_base_addr = (void *)pcpu_chunk_addr(schunk, 0, 0);

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

I reviewed the new per cpu allocator. Quite extensive work. Found two
issues that I would like to have addressed. But basically:

Reviewed-by: Christoph Lameter <[email protected]>


Two issues

1. Lot of unnecessary use of __read_mostly for local static variables that
are not on the hotpath. Patch follows in this message.

2. rbtree is not necessary since we can link back through the an available
field in struct page. Patch in next message.



Subject: Remove __read_mostly from percpu allocator symbols.

__read_mostly is reserved for hot code path items. The percpu variables
are used in allocation and freeing of percpu items which are typically
not hot code paths. It also reduces the cache footprint of the percpu
allocator if the variables are placed near each other. __read_mostly
would move a portion of the variable into a different linker section.

Signed-off-by: Christoph Lameter <[email protected]>

---
mm/percpu.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/percpu.c
===================================================================
--- linux-2.6.orig/mm/percpu.c 2009-03-31 10:59:34.000000000 -0500
+++ linux-2.6/mm/percpu.c 2009-03-31 11:11:22.000000000 -0500
@@ -100,14 +100,14 @@ struct pcpu_chunk {
struct page *page_ar[]; /* #cpus * UNIT_PAGES */
};

-static int pcpu_unit_pages __read_mostly;
-static int pcpu_unit_size __read_mostly;
-static int pcpu_chunk_size __read_mostly;
-static int pcpu_nr_slots __read_mostly;
-static size_t pcpu_chunk_struct_size __read_mostly;
+static int pcpu_unit_pages;
+static int pcpu_unit_size;
+static int pcpu_chunk_size;
+static int pcpu_nr_slots;
+static size_t pcpu_chunk_struct_size;

/* the address of the first chunk which starts with the kernel static area */
-void *pcpu_base_addr __read_mostly;
+void *pcpu_base_addr;
EXPORT_SYMBOL_GPL(pcpu_base_addr);

/* optional reserved chunk, only accessible for reserved allocations */
@@ -139,7 +139,7 @@ static int pcpu_reserved_chunk_limit;
static DEFINE_MUTEX(pcpu_alloc_mutex); /* protects whole alloc and reclaim */
static DEFINE_SPINLOCK(pcpu_lock); /* protects index data structures */

-static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
+static struct list_head *pcpu_slot; /* chunk list slots */
static struct rb_root pcpu_addr_root = RB_ROOT; /* chunks by address */

/* reclaim work to release fully free chunks, scheduled from free path */

2009-03-31 16:55:19

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 25 Mar 2009 22:34:46 +0900
Tejun Heo <[email protected]> wrote:

> Martin Schwidefsky wrote:
> > On Wed, 25 Mar 2009 22:21:03 +0900
> > Tejun Heo <[email protected]> wrote:
> >
> >> Martin Schwidefsky wrote:
> >>>> Martin's original patch should do the trick although it would be
> >>>> slower for static symbols. I'll merge it and post the tree.
> >>> No, my original patch doesn't work. It will break modules that use
> >>> static per-cpu variables.
> >> Oops. Even with the default offset adding macros? Heh... I think it
> >> would be best to wait for your fix then.
> >
> > We could use HAVE_LEGACY_PER_CPU_AREA for the time being.
>
> Eh... The thing is that the patch kills the legacy default allocator.
> We can move it into arch/s390 for the time being but it would be
> simpler if the constant_p thing or something else could work. :-) Do
> you think figuring out how to fix it will take long?

I got the dynamic cpu allocator to work with the patch below. Anybody
with an objection against the SHIFT_PERCPU_VAR macro ?

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

---
Subject: [PATCH] introduce SHIFT_PERCPU_VAR

From: Martin Schwidefsky <[email protected]>

Introduce SHIFT_PERCPU_VAR to make it possible to use the dynamic
percpu allocator on s390. The background for the new macro is the
addressing limitation on s390 for local variables in an elf object.

Every variable that the compiler considers to be local to an elf object
has to be addressable with an load-address-relative-long (LARL)
instruction, which has a range of +-4GB. A static per-cpu variable is
allocated in the .data.percpu section. If such a variable is defined in
a module the module loader will relocate this section to some place
which will be outside the +-4GB range. So far we used SHIFT_PERCPU_PTR
to add an indirection over the global offset table via a GOTENT
reloction. The GOT for the module is created by the module loader
and is located right next to the module code. That corrects the problem
with the LARL addressability but GOTENT works only for symbols, not for
dynamically allocated per-cpu memory. If the dynamic percpu allocator
is used on s390 the build breaks.

In order to fix this we need to distinguish the dynamically allocated
per-cpu objects from the per-cpu variables. For the former we can
use a simple RELOC_HIDE, for the later we have to use our GOTENT
indirection.

Signed-off-by: Martin Schwidefsky <[email protected]>
---

arch/alpha/include/asm/percpu.h | 2 ++
arch/s390/Kconfig | 3 ---
arch/s390/include/asm/percpu.h | 12 ++++++------
include/asm-generic/percpu.h | 10 +++++++---
4 files changed, 15 insertions(+), 12 deletions(-)

diff -urpN linux-next/arch/alpha/include/asm/percpu.h linux-s390/arch/alpha/include/asm/percpu.h
--- linux-next/arch/alpha/include/asm/percpu.h 2009-03-31 15:55:03.000000000 +0200
+++ linux-s390/arch/alpha/include/asm/percpu.h 2009-03-31 15:55:46.000000000 +0200
@@ -47,6 +47,8 @@ extern unsigned long __per_cpu_offset[NR
: "=&r"(__ptr), "=&r"(tmp_gp)); \
(typeof(&per_cpu_var(var)))(__ptr + (offset)); })

+#define SHIFT_PERCPU_VAR(var, offset) SHIFT_PERCPU_PTR(var, offset)
+
#define PER_CPU_ATTRIBUTES __used

#endif /* MODULE */
diff -urpN linux-next/arch/s390/include/asm/percpu.h linux-s390/arch/s390/include/asm/percpu.h
--- linux-next/arch/s390/include/asm/percpu.h 2009-03-31 15:55:03.000000000 +0200
+++ linux-s390/arch/s390/include/asm/percpu.h 2009-03-31 15:55:46.000000000 +0200
@@ -13,19 +13,19 @@
*/
#if defined(__s390x__) && defined(MODULE)

-#define SHIFT_PERCPU_PTR(ptr,offset) (({ \
+#define SHIFT_PERCPU_VAR(ptr,offset) (({ \
extern int simple_identifier_##var(void); \
unsigned long *__ptr; \
- asm ( "larl %0, %1@GOTENT" \
+ asm ( "larl %0, %1@GOTENT" \
: "=a" (__ptr) : "X" (ptr) ); \
(typeof(ptr))((*__ptr) + (offset)); }))

#else

-#define SHIFT_PERCPU_PTR(ptr, offset) (({ \
- extern int simple_identifier_##var(void); \
- unsigned long __ptr; \
- asm ( "" : "=a" (__ptr) : "0" (ptr) ); \
+#define SHIFT_PERCPU_VAR(ptr, offset) (({ \
+ extern int simple_identifier_##var(void); \
+ unsigned long __ptr; \
+ asm ( "" : "=a" (__ptr) : "0" (ptr) ); \
(typeof(ptr)) (__ptr + (offset)); }))

#endif
diff -urpN linux-next/arch/s390/Kconfig linux-s390/arch/s390/Kconfig
--- linux-next/arch/s390/Kconfig 2009-03-31 15:55:10.000000000 +0200
+++ linux-s390/arch/s390/Kconfig 2009-03-31 15:55:36.000000000 +0200
@@ -72,9 +72,6 @@ config PGSTE
config VIRT_CPU_ACCOUNTING
def_bool y

-config HAVE_LEGACY_PER_CPU_AREA
- def_bool y
-
mainmenu "Linux Kernel Configuration"

config S390
diff -urpN linux-next/include/asm-generic/percpu.h linux-s390/include/asm-generic/percpu.h
--- linux-next/include/asm-generic/percpu.h 2009-03-31 15:55:03.000000000 +0200
+++ linux-s390/include/asm-generic/percpu.h 2009-03-31 15:55:46.000000000 +0200
@@ -48,17 +48,21 @@ extern unsigned long __per_cpu_offset[NR
#define SHIFT_PERCPU_PTR(__p, __offset) RELOC_HIDE((__p), (__offset))
#endif

+#ifndef SHIFT_PERCPU_VAR
+#define SHIFT_PERCPU_VAR(__p, __offset) RELOC_HIDE((__p), (__offset))
+#endif
+
/*
* A percpu variable may point to a discarded regions. The following are
* established ways to produce a usable pointer from the percpu variable
* offset.
*/
#define per_cpu(var, cpu) \
- (*SHIFT_PERCPU_PTR(&per_cpu_var(var), per_cpu_offset(cpu)))
+ (*SHIFT_PERCPU_VAR(&per_cpu_var(var), per_cpu_offset(cpu)))
#define __get_cpu_var(var) \
- (*SHIFT_PERCPU_PTR(&per_cpu_var(var), my_cpu_offset))
+ (*SHIFT_PERCPU_VAR(&per_cpu_var(var), my_cpu_offset))
#define __raw_get_cpu_var(var) \
- (*SHIFT_PERCPU_PTR(&per_cpu_var(var), __my_cpu_offset))
+ (*SHIFT_PERCPU_VAR(&per_cpu_var(var), __my_cpu_offset))


#ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA

Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Tue, 31 Mar 2009, Martin Schwidefsky wrote:

> I got the dynamic cpu allocator to work with the patch below. Anybody
> with an objection against the SHIFT_PERCPU_VAR macro ?

Please include the patch inline next time.

Why is there a difference between percpu alloc and "static" percpu
variables? The "static" percpu variables can be dynamically allocated when
a module is loaded.

Isnt this mmore an issue that can be addressed by changing the way the initial per cpu
segment is allocated? If its also dynamically allocated then there should
be no problem?

2009-03-31 19:45:37

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Tue, Mar 31, 2009 at 06:54:31PM +0200, Martin Schwidefsky wrote:
> I got the dynamic cpu allocator to work with the patch below. Anybody
> with an objection against the SHIFT_PERCPU_VAR macro ?

That's fine with me.

> --- linux-next/arch/alpha/include/asm/percpu.h 2009-03-31 15:55:03.000000000 +0200
> +++ linux-s390/arch/alpha/include/asm/percpu.h 2009-03-31 15:55:46.000000000 +0200
> @@ -47,6 +47,8 @@ extern unsigned long __per_cpu_offset[NR
> : "=&r"(__ptr), "=&r"(tmp_gp)); \
> (typeof(&per_cpu_var(var)))(__ptr + (offset)); })
>
> +#define SHIFT_PERCPU_VAR(var, offset) SHIFT_PERCPU_PTR(var, offset)
> +

No, just do s/SHIFT_PERCPU_PTR/SHIFT_PERCPU_VAR/ in
arch/alpha/include/asm/percpu.h and the rest of your patch would
fix alpha as well.

Ivan.

2009-03-31 20:19:34

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Tue, 31 Mar 2009 13:20:42 -0400 (EDT)
Christoph Lameter <[email protected]> wrote:

> On Tue, 31 Mar 2009, Martin Schwidefsky wrote:
>
> > I got the dynamic cpu allocator to work with the patch below. Anybody
> > with an objection against the SHIFT_PERCPU_VAR macro ?
>
> Please include the patch inline next time.

What do you mean by "inline"? That the patch should not be the last
thing in the mail?

> Why is there a difference between percpu alloc and "static" percpu
> variables? The "static" percpu variables can be dynamically allocated when
> a module is loaded.

We need to hide the static per cpu symbol from the compiler, otherwise
the compiler will try to load the address of the symbol with an
instruction that is limited in the addressing range of a single elf
object. We do that will the inline assembly in SHIFT_PERCPU_PTR

#define SHIFT_PERCPU_PTR(ptr,offset) (({ \
extern int simple_identifier_##var(void); \
unsigned long *__ptr; \
asm ( "larl %0, %1@GOTENT" \
: "=a" (__ptr) : "X" (ptr) ); \
(typeof(ptr))((*__ptr) + (offset)); }))

The "X" constraint only works if ptr is the name of a symbol. The
per_cpu_ptr macro will use SHIFT_PERCPU_PTR on an pointer which does
not compile. For pointers we do not need to trick the compiler, we can
just use the standard RELOC_HIDE.

> Isnt this mmore an issue that can be addressed by changing the way the initial per cpu
> segment is allocated? If its also dynamically allocated then there should
> be no problem?

I don't see how changes to the initial per cpu segment should help with
access to per cpu symbols.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-31 20:19:59

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Tue, 31 Mar 2009 23:17:03 +0400
Ivan Kokshaysky <[email protected]> wrote:

> On Tue, Mar 31, 2009 at 06:54:31PM +0200, Martin Schwidefsky wrote:
> > I got the dynamic cpu allocator to work with the patch below. Anybody
> > with an objection against the SHIFT_PERCPU_VAR macro ?
>
> That's fine with me.
>
> > --- linux-next/arch/alpha/include/asm/percpu.h 2009-03-31 15:55:03.000000000 +0200
> > +++ linux-s390/arch/alpha/include/asm/percpu.h 2009-03-31 15:55:46.000000000 +0200
> > @@ -47,6 +47,8 @@ extern unsigned long __per_cpu_offset[NR
> > : "=&r"(__ptr), "=&r"(tmp_gp)); \
> > (typeof(&per_cpu_var(var)))(__ptr + (offset)); })
> >
> > +#define SHIFT_PERCPU_VAR(var, offset) SHIFT_PERCPU_PTR(var, offset)
> > +
>
> No, just do s/SHIFT_PERCPU_PTR/SHIFT_PERCPU_VAR/ in
> arch/alpha/include/asm/percpu.h and the rest of your patch would
> fix alpha as well.

Ok, cool. May I add an Acked-by?

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-03-31 20:29:45

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Tue, Mar 31, 2009 at 10:19:37PM +0200, Martin Schwidefsky wrote:
> Ok, cool. May I add an Acked-by?

Yes, sure.

Ivan.

Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Tue, 31 Mar 2009, Martin Schwidefsky wrote:

> > Please include the patch inline next time.
>
> What do you mean by "inline"? That the patch should not be the last
> thing in the mail?

The patch needs to show up when I press reply and not vanish. I guess this
was an attachment.

> I don't see how changes to the initial per cpu segment should help with
> access to per cpu symbols.

Can you convince the linker to place the per cpu segment > 4G away from
the code? Its a virtual address right?

2009-03-31 22:57:40

by David Miller

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

From: Rusty Russell <[email protected]>
Date: Tue, 31 Mar 2009 12:04:50 +1030

> On Monday 30 March 2009 22:19:38 Ingo Molnar wrote:
> >
> > * Tejun Heo <[email protected]> wrote:
> ...
> > We also need the Ack from Davem for Sparc32, the Ack from Martin, an
> > Ack from Ben for the PowerPC bits and an ack from Tony for the IA64
> > bits. We also need the ack from Andrew and Rusty for the MM and
> > module bits.
>
> Module bits are trivial, but Acked-by: Rusty Russell <[email protected]>
> just in case :)

I'm fine with the sparc32 bits:

Acked-by: David S. Miller <[email protected]>

And I'll work on sparc64 support.

2009-04-01 00:10:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hello, Martin.

Martin Schwidefsky wrote:
> On Wed, 25 Mar 2009 22:34:46 +0900
> Tejun Heo <[email protected]> wrote:
>
>> Martin Schwidefsky wrote:
>>> On Wed, 25 Mar 2009 22:21:03 +0900
>>> Tejun Heo <[email protected]> wrote:
>>>
>>>> Martin Schwidefsky wrote:
>>>>>> Martin's original patch should do the trick although it would be
>>>>>> slower for static symbols. I'll merge it and post the tree.
>>>>> No, my original patch doesn't work. It will break modules that use
>>>>> static per-cpu variables.
>>>> Oops. Even with the default offset adding macros? Heh... I think it
>>>> would be best to wait for your fix then.
>>> We could use HAVE_LEGACY_PER_CPU_AREA for the time being.
>> Eh... The thing is that the patch kills the legacy default allocator.
>> We can move it into arch/s390 for the time being but it would be
>> simpler if the constant_p thing or something else could work. :-) Do
>> you think figuring out how to fix it will take long?
>
> I got the dynamic cpu allocator to work with the patch below. Anybody
> with an objection against the SHIFT_PERCPU_VAR macro ?

One of the main goals of the percpu allocator is removing the
distinction between statically and dynamically allocated percpu
variables, so that it can be treated like other normal variables. For
all archs other than s390, alpha and ia64, achieving this is easy, so
I wish we could come up with solution for the three archs too. Is it
possible to do similar stuff with pointer values with input
constraint which can take both constant and variable?

Thanks.

--
tejun

2009-04-01 00:10:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

Hello,

Martin Schwidefsky wrote:
> On Mon, 30 Mar 2009 19:07:44 +0900
> Tejun Heo <[email protected]> wrote:
>
>> Okay, this should keep s390 and alpha working till proper solution is
>> found. Martin, can you please verify? Ingo, please feel free to push
>> this upstream (or -next) once Martin acks.
>
> Looks good, everything compiles and the static per-cpu variables are
> resolved via GOTENT:
>
> Acked-by: Martin Schwidefsky <[email protected]>

Thanks.

> For the proper solution, the easiest fix is imho to define a
> variant of SHIFT_PERCPU_PTR. The macro is currently used for dynamic
> pointers and for per-cpu symbols. We only want to use the GOTENT
> indirection for per-cpu symbols. So why don't we split it into
> 1) SHIFT_PERCPU_SYMBOL for per-cpu symbols and
> 2) SHIFT_PERCPU_PTR for dynamically allocated pointers?
> For s390 the first would be the current SHIFT_PERCPU_PTR macro, the
> second would be a simple RELOC_HIDE. The patch would be really
> short ..

Eh... I wrote in the other reply but unifying the two is kind of one
of the main goals, so....

--
tejun

2009-04-01 00:18:37

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Tue, 2009-03-31 at 12:04 +1030, Rusty Russell wrote:
> On Monday 30 March 2009 22:19:38 Ingo Molnar wrote:
> >
> > * Tejun Heo <[email protected]> wrote:
> ...
> > We also need the Ack from Davem for Sparc32, the Ack from Martin, an
> > Ack from Ben for the PowerPC bits and an ack from Tony for the IA64
> > bits. We also need the ack from Andrew and Rusty for the MM and
> > module bits.
>
> Module bits are trivial, but Acked-by: Rusty Russell <[email protected]>
> just in case :)

PPC32 bits look ok unless I missed something... oh well.

Acked-by: Benjamin Herrenschmidt <[email protected]>

2009-04-01 00:18:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

Hello,

Christoph Lameter wrote:
> I reviewed the new per cpu allocator. Quite extensive work. Found two
> issues that I would like to have addressed. But basically:
>
> Reviewed-by: Christoph Lameter <[email protected]>

Thanks.

> Two issues
>
> 1. Lot of unnecessary use of __read_mostly for local static variables that
> are not on the hotpath. Patch follows in this message.

Hmmm... Those are basically read-only vars once initialization is
complete. IIUC, __read_mostly just puts the tagged variables into a
separate segment so that they don't have to worry about cacheline
bouncing. Is there any reason to remove __read_mostly? Or is it just
that it's mostly superflous?

> 2. rbtree is not necessary since we can link back through the an available
> field in struct page. Patch in next message.

Ah... nice. I was thinking about adding ->private pointer to the
vmalloc structure but if page already has the space, all the better.

Thanks.

--
tejun

2009-04-01 00:22:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

Hello,

Christoph Lameter wrote:
> Needs additional feedback and review by Tejun I would think. Just compile
> tested so far. The removal of the rbtree also relaxes locking restrictions
> for pcpu_chunk_address_search (which is not really searching anymore).
>
> Subject: dynamic percpu allocator: Remove rbtree
>
> The rbtree is used to determine the chunk from the virtual address. However,
> we can already determine the page struct from a virtual address and there
> are several unused fields in page struct used by vmalloc. Use the index
> field to store a pointer to the chunk. Then there is no need anymore for
> an rbtree.
>
> Signed-off-by: Christoph Lameter <[email protected]>

This is great. I'll integrate the patch and test it soonish but with
LSF on the next week, I'll probably be slow during the next two weeks.

Thanks a lot.

--
tejun

2009-04-01 00:23:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

Benjamin Herrenschmidt wrote:
> On Tue, 2009-03-31 at 12:04 +1030, Rusty Russell wrote:
>> On Monday 30 March 2009 22:19:38 Ingo Molnar wrote:
>>> * Tejun Heo <[email protected]> wrote:
>> ...
>>> We also need the Ack from Davem for Sparc32, the Ack from Martin, an
>>> Ack from Ben for the PowerPC bits and an ack from Tony for the IA64
>>> bits. We also need the ack from Andrew and Rusty for the MM and
>>> module bits.
>> Module bits are trivial, but Acked-by: Rusty Russell <[email protected]>
>> just in case :)
>
> PPC32 bits look ok unless I missed something... oh well.
>
> Acked-by: Benjamin Herrenschmidt <[email protected]>

Cool, all thread ACKs added.

Thanks.

--
tejun

2009-04-01 08:01:58

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Tue, 31 Mar 2009 17:10:40 -0400 (EDT)
Christoph Lameter <[email protected]> wrote:

> On Tue, 31 Mar 2009, Martin Schwidefsky wrote:
>
> > > Please include the patch inline next time.
> >
> > What do you mean by "inline"? That the patch should not be the last
> > thing in the mail?
>
> The patch needs to show up when I press reply and not vanish. I guess this
> was an attachment.

Hmm, interesting. The patch was part of the message body but came after
the signature. I'll make the signature the last thing in my mails in the
future.

> > I don't see how changes to the initial per cpu segment should help with
> > access to per cpu symbols.
>
> Can you convince the linker to place the per cpu segment > 4G away from
> the code? Its a virtual address right?

Yes it is a virtual address but the linker has nothing to do with it.
It is the compiler that makes the assumption that an static variable
can be accessed within < 4G. The module loader then rips the code and
the .data.percpu section apart and places them at memory locations
which are farther away than 4GB.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-04-01 08:11:33

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 01 Apr 2009 09:07:41 +0900
Tejun Heo <[email protected]> wrote:

> Martin Schwidefsky wrote:
> >
> > I got the dynamic cpu allocator to work with the patch below. Anybody
> > with an objection against the SHIFT_PERCPU_VAR macro ?
>
> One of the main goals of the percpu allocator is removing the
> distinction between statically and dynamically allocated percpu
> variables, so that it can be treated like other normal variables. For
> all archs other than s390, alpha and ia64, achieving this is easy, so
> I wish we could come up with solution for the three archs too. Is it
> possible to do similar stuff with pointer values with input
> constraint which can take both constant and variable?

Is the goal to use the same access macros for both dynamically and
statically allocated percpu variables? That would make the proposed
solution impractical.

The "X" constraint trick we used so far tells the compiler to pass the
argument verbatim to the assembler. The assembler knows how to deal
with symbol@GOTENT. If we pass a gcc variable or even a more general
term via an "X" constraint the assembler gets <something C-ish>@GOTENT.
It is not possible to parse this in the assembler. To do what you want
to achieve would mean to avoid using the "X" constraint. Which means
that we cannot use the GOTENT trick anymore. If we let the compiler
resolve the address for a static per cpu variable we end up with the
larl problem. Ouch!

So far the SHIFT_PERCPU_PTR/SHIFT_PERCPU_VAR is the only solution
I have found for s390 and the dynamic percpu allocator. I'll keep
looking for a better solution but I am not optimistic.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-04-01 08:20:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Martin Schwidefsky wrote:
> On Wed, 01 Apr 2009 09:07:41 +0900
> Tejun Heo <[email protected]> wrote:
>
>> Martin Schwidefsky wrote:
>>> I got the dynamic cpu allocator to work with the patch below. Anybody
>>> with an objection against the SHIFT_PERCPU_VAR macro ?
>> One of the main goals of the percpu allocator is removing the
>> distinction between statically and dynamically allocated percpu
>> variables, so that it can be treated like other normal variables. For
>> all archs other than s390, alpha and ia64, achieving this is easy, so
>> I wish we could come up with solution for the three archs too. Is it
>> possible to do similar stuff with pointer values with input
>> constraint which can take both constant and variable?
>
> Is the goal to use the same access macros for both dynamically and
> statically allocated percpu variables? That would make the proposed
> solution impractical.

Yeah, it's one of the goals so that we don't have to have two sets of
APIs (e.g. the fast percpu_*() accessors).

> The "X" constraint trick we used so far tells the compiler to pass the
> argument verbatim to the assembler. The assembler knows how to deal
> with symbol@GOTENT. If we pass a gcc variable or even a more general
> term via an "X" constraint the assembler gets <something C-ish>@GOTENT.
> It is not possible to parse this in the assembler. To do what you want
> to achieve would mean to avoid using the "X" constraint. Which means
> that we cannot use the GOTENT trick anymore. If we let the compiler
> resolve the address for a static per cpu variable we end up with the
> larl problem. Ouch!
>
> So far the SHIFT_PERCPU_PTR/SHIFT_PERCPU_VAR is the only solution
> I have found for s390 and the dynamic percpu allocator. I'll keep
> looking for a better solution but I am not optimistic.

What does the assembler do when it gets GOTENT? GOTENT sounds like
global offset table entry, so does it make the assembler emit an entry
in GOT and then get the address indirectly?

Thanks.

--
tejun

2009-04-01 08:33:22

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 01 Apr 2009 17:17:33 +0900
Tejun Heo <[email protected]> wrote:

> Martin Schwidefsky wrote:
> > Is the goal to use the same access macros for both dynamically and
> > statically allocated percpu variables? That would make the proposed
> > solution impractical.
>
> Yeah, it's one of the goals so that we don't have to have two sets of
> APIs (e.g. the fast percpu_*() accessors).

Uh-oh..

> > The "X" constraint trick we used so far tells the compiler to pass the
> > argument verbatim to the assembler. The assembler knows how to deal
> > with symbol@GOTENT. If we pass a gcc variable or even a more general
> > term via an "X" constraint the assembler gets <something C-ish>@GOTENT.
> > It is not possible to parse this in the assembler. To do what you want
> > to achieve would mean to avoid using the "X" constraint. Which means
> > that we cannot use the GOTENT trick anymore. If we let the compiler
> > resolve the address for a static per cpu variable we end up with the
> > larl problem. Ouch!
> >
> > So far the SHIFT_PERCPU_PTR/SHIFT_PERCPU_VAR is the only solution
> > I have found for s390 and the dynamic percpu allocator. I'll keep
> > looking for a better solution but I am not optimistic.
>
> What does the assembler do when it gets GOTENT? GOTENT sounds like
> global offset table entry, so does it make the assembler emit an entry
> in GOT and then get the address indirectly?

Yes, @GOTENT is a relocation against the GOT slot that contains the
address of the symbol. It is a special version of @GOT that uses larl
to locate the got slot directly without the need of a got base pointer.

The code sequence with @GOT:

larl %r12,_GLOBAL_OFFSET_TABLE_
lg %r1,symbol@GOT(%r12)

is equivalent to:

larl %r1,symbol@GOTENT
lg %r1,0(%r1)

The advantage of the second code sequence is that it need a single
register and the size of the GOT is not limited to 4K as in the first
example (the offset in an RX format instruction is limited to 12 bits -
but that is probably something you don't want to know ;-).

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-04-01 08:37:32

by David Miller

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

From: Martin Schwidefsky <[email protected]>
Date: Wed, 1 Apr 2009 10:32:57 +0200

> The code sequence with @GOT:
>
> larl %r12,_GLOBAL_OFFSET_TABLE_
> lg %r1,symbol@GOT(%r12)
>
> is equivalent to:
>
> larl %r1,symbol@GOTENT
> lg %r1,0(%r1)
>
> The advantage of the second code sequence is that it need a single
> register and the size of the GOT is not limited to 4K as in the first
> example (the offset in an RX format instruction is limited to 12 bits -
> but that is probably something you don't want to know ;-).

If practical I think you guys should just force all of the module
address space below 4GB virtually, as we do on sparc64. It's a good
way to avoid all of these problems.

2009-04-01 08:47:55

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 01 Apr 2009 01:37:02 -0700 (PDT)
David Miller <[email protected]> wrote:

> From: Martin Schwidefsky <[email protected]>
> Date: Wed, 1 Apr 2009 10:32:57 +0200
>
> > The code sequence with @GOT:
> >
> > larl %r12,_GLOBAL_OFFSET_TABLE_
> > lg %r1,symbol@GOT(%r12)
> >
> > is equivalent to:
> >
> > larl %r1,symbol@GOTENT
> > lg %r1,0(%r1)
> >
> > The advantage of the second code sequence is that it need a single
> > register and the size of the GOT is not limited to 4K as in the first
> > example (the offset in an RX format instruction is limited to 12 bits -
> > but that is probably something you don't want to know ;-).
>
> If practical I think you guys should just force all of the module
> address space below 4GB virtually, as we do on sparc64. It's a good
> way to avoid all of these problems.

We have thought about that solution as well but it not really a good
one. For a machine with more than 4GB of memory we would either loose
the memory that overlaps with the module area or we'd have to play
nasty remapping tricks. On s390 the kernel image is linked to address 0
(PAGE_OFFSET==0) and we have a simple 1:1 mapping for all real memory.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-04-01 08:52:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Martin Schwidefsky wrote:
> On Wed, 01 Apr 2009 01:37:02 -0700 (PDT)
> David Miller <[email protected]> wrote:
>
>> From: Martin Schwidefsky <[email protected]>
>> Date: Wed, 1 Apr 2009 10:32:57 +0200
>>
>>> The code sequence with @GOT:
>>>
>>> larl %r12,_GLOBAL_OFFSET_TABLE_
>>> lg %r1,symbol@GOT(%r12)
>>>
>>> is equivalent to:
>>>
>>> larl %r1,symbol@GOTENT
>>> lg %r1,0(%r1)
>>>
>>> The advantage of the second code sequence is that it need a single
>>> register and the size of the GOT is not limited to 4K as in the first
>>> example (the offset in an RX format instruction is limited to 12 bits -
>>> but that is probably something you don't want to know ;-).
>> If practical I think you guys should just force all of the module
>> address space below 4GB virtually, as we do on sparc64. It's a good
>> way to avoid all of these problems.
>
> We have thought about that solution as well but it not really a good
> one. For a machine with more than 4GB of memory we would either loose
> the memory that overlaps with the module area or we'd have to play
> nasty remapping tricks. On s390 the kernel image is linked to address 0
> (PAGE_OFFSET==0) and we have a simple 1:1 mapping for all real memory.
>

Also, there is no guarantee that the offset from dynamic allocator
will fall in the same 4G. There is reserve mechanism for static ones
for archs which need it but for dynamic ones, the offset can be any
value.

Thanks.

--
tejun

2009-04-01 08:54:19

by David Miller

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

From: Martin Schwidefsky <[email protected]>
Date: Wed, 1 Apr 2009 10:47:37 +0200

> We have thought about that solution as well but it not really a good
> one. For a machine with more than 4GB of memory we would either loose
> the memory that overlaps with the module area or we'd have to play
> nasty remapping tricks. On s390 the kernel image is linked to address 0
> (PAGE_OFFSET==0) and we have a simple 1:1 mapping for all real memory.

Flexibility can often cost you a few TLB entries, but that's
life :-)

2009-04-01 08:55:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hello, Martin.

Martin Schwidefsky wrote:
> Yes, @GOTENT is a relocation against the GOT slot that contains the
> address of the symbol. It is a special version of @GOT that uses larl
> to locate the got slot directly without the need of a got base pointer.
>
> The code sequence with @GOT:
>
> larl %r12,_GLOBAL_OFFSET_TABLE_
> lg %r1,symbol@GOT(%r12)
>
> is equivalent to:
>
> larl %r1,symbol@GOTENT
> lg %r1,0(%r1)
>
> The advantage of the second code sequence is that it need a single
> register and the size of the GOT is not limited to 4K as in the first
> example (the offset in an RX format instruction is limited to 12 bits -
> but that is probably something you don't want to know ;-).

Maybe we can build indirection pointer manually by twiddling with
DEFINE_PER_CPU() in such a way that it doesn't have to distinguish
symbols and variables?

--
tejun

2009-04-01 09:09:06

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 01 Apr 2009 17:50:45 +0900
Tejun Heo <[email protected]> wrote:

> Martin Schwidefsky wrote:
> > On Wed, 01 Apr 2009 01:37:02 -0700 (PDT)
> > David Miller <[email protected]> wrote:
> >
> >> If practical I think you guys should just force all of the module
> >> address space below 4GB virtually, as we do on sparc64. It's a good
> >> way to avoid all of these problems.
> >
> > We have thought about that solution as well but it not really a good
> > one. For a machine with more than 4GB of memory we would either loose
> > the memory that overlaps with the module area or we'd have to play
> > nasty remapping tricks. On s390 the kernel image is linked to address 0
> > (PAGE_OFFSET==0) and we have a simple 1:1 mapping for all real memory.
> >
>
> Also, there is no guarantee that the offset from dynamic allocator
> will fall in the same 4G. There is reserve mechanism for static ones
> for archs which need it but for dynamic ones, the offset can be any
> value.

If we map the modules with a 4GB distance to the main kernel image then
we don't have to worry about the offsets anymore. The compiler could
just use LARL to get the address of the static per-cpu variables and
SHIFT_PERCPU_PTR could be a RELOC_HIDE.

It just that the remapping we need to do is sooo icky ..

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-04-01 11:08:45

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 01 Apr 2009 17:53:42 +0900
Tejun Heo <[email protected]> wrote:

> Hello, Martin.
>
> Martin Schwidefsky wrote:
> > Yes, @GOTENT is a relocation against the GOT slot that contains the
> > address of the symbol. It is a special version of @GOT that uses larl
> > to locate the got slot directly without the need of a got base pointer.
> >
> > The code sequence with @GOT:
> >
> > larl %r12,_GLOBAL_OFFSET_TABLE_
> > lg %r1,symbol@GOT(%r12)
> >
> > is equivalent to:
> >
> > larl %r1,symbol@GOTENT
> > lg %r1,0(%r1)
> >
> > The advantage of the second code sequence is that it need a single
> > register and the size of the GOT is not limited to 4K as in the first
> > example (the offset in an RX format instruction is limited to 12 bits -
> > but that is probably something you don't want to know ;-).
>
> Maybe we can build indirection pointer manually by twiddling with
> DEFINE_PER_CPU() in such a way that it doesn't have to distinguish
> symbols and variables?

Hmm, a provocative idea: can we ban the use of static per-cpu variables
for modules in common code? There are not many common code modules
doing this, with a visibility hack I found the following relevant
locations:

block/as-iosched.c:152: warning: 'visibility' attribute ignored
crypto/sha512_generic.c:30: warning: 'visibility' attribute ignored
block/cfq-iosched.c:51: warning: 'visibility' attribute ignored
mm/kmemleak-test.c:39: warning: 'visibility' attribute ignored
kernel/rcutorture.c:117: warning: 'visibility' attribute ignored
kernel/rcutorture.c:119: warning: 'visibility' attribute ignored
drivers/acpi/processor_core.c:663: warning: 'visibility' attribute ignored
drivers/acpi/processor_thermal.c:99: warning: 'visibility' attribute ignored
drivers/cpufreq/cpufreq_stats.c:46: warning: 'visibility' attribute ignored
drivers/cpufreq/cpufreq_userspace.c:30: warning: 'visibility' attribute ignored
drivers/cpufreq/cpufreq_userspace.c:31: warning: 'visibility' attribute ignored
drivers/cpufreq/cpufreq_userspace.c:32: warning: 'visibility' attribute ignored
drivers/cpufreq/cpufreq_userspace.c:33: warning: 'visibility' attribute ignored
drivers/cpufreq/cpufreq_userspace.c:35: warning: 'visibility' attribute ignored
drivers/cpufreq/cpufreq_ondemand.c:90: warning: 'visibility' attribute ignored
drivers/cpufreq/cpufreq_conservative.c:83: warning: 'visibility' attribute igno\
red
drivers/cpufreq/freq_table.c:177: warning: 'visibility' attribute ignored
net/ipv6/syncookies.c:77: warning: 'visibility' attribute ignored

That would "fix" the problem as well..

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 1 Apr 2009, Tejun Heo wrote:

> > 1. Lot of unnecessary use of __read_mostly for local static variables that
> > are not on the hotpath. Patch follows in this message.
>
> Hmmm... Those are basically read-only vars once initialization is
> complete. IIUC, __read_mostly just puts the tagged variables into a
> separate segment so that they don't have to worry about cacheline
> bouncing. Is there any reason to remove __read_mostly? Or is it just
> that it's mostly superflous?

Yes its better for the variables to be all together if its not on hot
cache paths. And its better to keep the __read_mostly segment reserved for
frequently accessed variables.

2009-04-01 15:54:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Christoph Lameter <[email protected]> wrote:

> On Wed, 1 Apr 2009, Tejun Heo wrote:
>
> > > 1. Lot of unnecessary use of __read_mostly for local static variables that
> > > are not on the hotpath. Patch follows in this message.
> >
> > Hmmm... Those are basically read-only vars once initialization is
> > complete. IIUC, __read_mostly just puts the tagged variables into a
> > separate segment so that they don't have to worry about cacheline
> > bouncing. Is there any reason to remove __read_mostly? Or is it just
> > that it's mostly superflous?
>
> Yes its better for the variables to be all together if its not on hot
> cache paths. And its better to keep the __read_mostly segment reserved for
> frequently accessed variables.

We need __access_rarely i guess.

Ingo

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 1 Apr 2009, Ingo Molnar wrote:

> > Yes its better for the variables to be all together if its not on hot
> > cache paths. And its better to keep the __read_mostly segment reserved for
> > frequently accessed variables.
>
> We need __access_rarely i guess.

Better leave as is. __read_mostly is for hot code paths. __read_mostly
should be packed as tightly as possible to increase the chance that one
cacheline includes multiple of the critical variables for the hot code
paths. Too much __read_mostly defeats its purpose.

2009-04-01 19:06:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Christoph Lameter <[email protected]> wrote:

> __read_mostly should be packed as tightly as possible to increase
> the chance that one cacheline includes multiple of the critical
> variables for the hot code paths. Too much __read_mostly defeats
> its purpose.

That stance is a commonly held but quite wrong and harmful IMHO.

It stiffles the proper identification of read-mostly variables _AND_
it hurts the proper identification of critical write-often variables
as well. Not good.

The solution for critical write-often variables is what we always
used: to identify them explicitly and to place them consciously into
separate cachelines. (Or to per-cpu-ify or object-ify them where
possible/sensible.)

Then annotate everything that is read-mostly and accessed-frequently
with the __read_mostly attribute.

The rest (unannotated variables) is to be assumed "access-rarely" or
"we-dont-care", by default. This is actually 95% of the global
variables.

Yes, a spreading amount of annotations puts increasing pressure on
the places that are frequently access but not properly annotated -
but we should be happy about that: it creates the dynamics and
pressure for them to be properly annotated.

On the other hand, depending on the "put enough data bloat between
critical variables anyway, no need to care alignment" scheme is a
sloppy, fragile concept that does not lead to a reliable and
dependable end result.

It has two problems:

- Thinking that this solves false cacheline sharing reliably is
wrong: there's nothing that guarantees and enforces that slapping
a few variables between two critical variables puts them on
separate cachelines:

- Ongoing changes in code can bit-rot the
thought-to-be-large-enough distance between two critical
variables - and there's no mechanism in place.
Explicitly cacheline aligning them will preserve the
information long-term.

- There are architectures with larger cacheline sizes than
what you are developing on.

- .config variations can move variables closer or farther
apart from each other, hiding/triggering the false cacheline
sharing problem.

It is not a maintainable concept IMHO and we should not pretend
it is.

- It actually prevents true read-mostly variables from being
annotated properly. (In such a case a true read-mostly variable
bouncing around with a frequently-written variable cache line is
almost as bad in terms of MESI latencies and costs as false
cacheline sharing between two write-mostly variables.)

Architecturing the layout of variables in a knowingly random and
.config sensitive way is simply not good design and we should not
pretend it is.

We might not be able to solve the problem if not enough people care
about their variables, but we should at least not be proud of a
non-solution ;-)

Ingo

2009-04-01 19:59:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator



On Wed, 1 Apr 2009, Ingo Molnar wrote:
>
> The rest (unannotated variables) is to be assumed "access-rarely" or
> "we-dont-care", by default. This is actually 95% of the global
> variables.

I'm not at all convinced that this is a good option.

The thing is, things like "read_mostly" or "access_rarely" may talk about
how we access those individual variables, but you're missing a _huge_
chunk of the puzzle if you ignore the _correlations_ of those accesses
with accesses to other variables.

The thign is, if you have variables 'a' and 'b', and they are always
accessed together, then it's probably worth it to put them in the same
cacheline.

And that is true EVEN IF 'b' is basically read-only, but 'a' is always
written to. Marking 'a' as read-only or read-mostly is actually a bad
idea, since the only thing it can ever result in is a bigger cache
footprint.

The fact that the read-only cacheline has "nicer" behavior is totally
irrelevant - it's still an extra cacheline that we're accessing. And they
aren't free, even if read-only cachelines that can be shared across CPU's
are certainly a lot cheaper than ones that bounce.

Linus

2009-04-01 20:13:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, Apr 01, 2009 at 12:39:46PM -0700, Linus Torvalds wrote:
> The thing is, things like "read_mostly" or "access_rarely" may talk about
> how we access those individual variables, but you're missing a _huge_
> chunk of the puzzle if you ignore the _correlations_ of those accesses
> with accesses to other variables.
>
> The thign is, if you have variables 'a' and 'b', and they are always
> accessed together, then it's probably worth it to put them in the same
> cacheline.

If you've got two global variables that are generally accessed together,
they should probably be `annotated' as such by putting them in a struct.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-04-01 22:35:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Linus Torvalds <[email protected]> wrote:

> On Wed, 1 Apr 2009, Ingo Molnar wrote:
> >
> > The rest (unannotated variables) is to be assumed
> > "access-rarely" or "we-dont-care", by default. This is actually
> > 95% of the global variables.
>
> I'm not at all convinced that this is a good option.
>
> The thing is, things like "read_mostly" or "access_rarely" may
> talk about how we access those individual variables, but you're
> missing a _huge_ chunk of the puzzle if you ignore the
> _correlations_ of those accesses with accesses to other variables.
>
> The thign is, if you have variables 'a' and 'b', and they are
> always accessed together, then it's probably worth it to put them
> in the same cacheline.
>
> And that is true EVEN IF 'b' is basically read-only, but 'a' is
> always written to. Marking 'a' as read-only or read-mostly is
> actually a bad idea, since the only thing it can ever result in is
> a bigger cache footprint.
>
> The fact that the read-only cacheline has "nicer" behavior is
> totally irrelevant - it's still an extra cacheline that we're
> accessing. And they aren't free, even if read-only cachelines that
> can be shared across CPU's are certainly a lot cheaper than ones
> that bounce.

Yes, i was mainly considering independent/uncorrelated variables in
my (short, hence simplistic) argument.

And you are perfectly right to point out that correlated-access
variables should not be annotated in a conflicting (or wasteful)
way.

But lets see this specific example we are discussing, demonstrated
on a live kernel.

Lets see how the cacheline usage and alignments look like in
practice, and what the various effects of Christopher's patch are.
Lets check whether your and Christoph's intuition is correct that
this patch is a good change.

Christopher's patch removes __read_mostly annotations, such as
these:

static int pcpu_unit_pages __read_mostly;
static int pcpu_unit_size __read_mostly;
static int pcpu_chunk_size __read_mostly;
static int pcpu_nr_slots __read_mostly;
static size_t pcpu_chunk_struct_size __read_mostly;

I did a 64-bit x86 defconfig build to check how these symbols are
laid out, before and after the patch.

Before the patch:

ffffffff817d3e60 d shmem_backing_dev_info [ cacheline: ffffffff817d3f40 ]
ffffffff817d3f50 D sysctl_stat_interval
ffffffff817d3f54 D randomize_va_space
ffffffff817d3f58 D sysctl_max_map_count
ffffffff817d3f5c d vmap_initialized
ffffffff817d3f60 d pcpu_unit_pages *
ffffffff817d3f64 d pcpu_unit_size .
ffffffff817d3f68 d pcpu_chunk_size .
ffffffff817d3f6c d pcpu_nr_slots .
ffffffff817d3f70 d pcpu_chunk_struct_size .
ffffffff817d3f78 d pcpu_slot .
ffffffff817d3f80 D pcpu_base_addr [ cacheline: ffffffff817d3f80 ]
ffffffff817d3f90 d filp_cachep *
ffffffff817d3f90 d filp_cachep
ffffffff817d3f98 d pipe_mnt
ffffffff817d3fa0 d fasync_cache
ffffffff817d3fa8 D sysctl_vfs_cache_pressure
ffffffff817d3fb0 d dentry_cache
ffffffff817d3fb8 d d_hash_mask
ffffffff817d3fbc d d_hash_shift
ffffffff817d3fc0 d dentry_hashtable [ cacheline: ffffffff817d3fc0 ]

The .data (write-often) variables have this (before-patch) layout:

ffffffff81919480 b dmap.19506 [ cacheline: ffffffff81919480 ]
ffffffff81919488 b smap.19505 .
ffffffff81919490 b first_vm.19504 [ cacheline: ffffffff819194c0 ]
ffffffff819194d0 b pcpu_addr_root .
ffffffff819194d8 b pcpu_lock .
ffffffff819194e0 b pcpu_reserved_chunk .
ffffffff819194e8 b pcpu_reserved_chunk_limit *
ffffffff819194f0 b __key.21450
ffffffff819194f0 b old_max.21260
ffffffff81919500 B sb_lock [ cacheline: ffffffff81919500 ]

So the existing __read_mostly pcpu_* variables touch two cachelines
in the read-mostly data section, and two cachelines in .data.

After the patch we get one continuous block:

ffffffff81919480 B pcpu_base_addr [ cacheline: ffffffff81919480 ]
ffffffff81919488 b dmap.19506 .
ffffffff81919490 b smap.19505 .
ffffffff819194a0 b first_vm.19504 [ cacheline: ffffffff819194c0 ]
ffffffff819194e0 b pcpu_addr_root .
ffffffff819194e8 b pcpu_lock .
ffffffff819194ec b pcpu_unit_pages .
ffffffff819194f0 b pcpu_unit_size .
ffffffff819194f4 b pcpu_chunk_size .
ffffffff819194f8 b pcpu_nr_slots .
ffffffff81919500 b pcpu_chunk_struct_size [ cacheline: ffffffff81919500 ]
ffffffff81919508 b pcpu_reserved_chunk .
ffffffff81919510 b pcpu_reserved_chunk_limit .
ffffffff81919518 b pcpu_slot *
ffffffff81919520 b __key.21450
ffffffff81919520 b old_max.21260
ffffffff81919530 B sb_lock
ffffffff81919534 b unnamed_dev_lock
ffffffff81919540 b unnamed_dev_ida [ cacheline: ffffffff81919540 ]

All variables are in 3 cachelines.

So seemingly we have won one cacheline of cache footprint.

But the problem is, the use of the read-mostly variables that Tejun
annotated does not necessarily correlate with the use of the other
variables.

One case would be the free_percpu(NULL) fastpath.

We encourage kfree(NULL) and it triggers commonly in the kernel
today [on distro kernels we checked it can trigger once per
syscall!] - so i think we should consider free_percpu(NULL) a
possibly common pattern too. (even though today it's likely _not_
common at all.)

And free_percpu(NULL) does this:

void free_percpu(void *ptr)
{
void *addr = __pcpu_ptr_to_addr(ptr);
struct pcpu_chunk *chunk;
unsigned long flags;
int off;

if (!ptr)
return;

Look at the __pcpu_ptr_to_addr(ptr) transformation - that triggers
an access to pcpu_base_addr.

This is how it's compiled (GCC 4.3.2):

ffffffff810c26f1 <free_percpu>:
ffffffff810c26f1: 55 push %rbp
ffffffff810c26f2: 48 89 e5 mov %rsp,%rbp
ffffffff810c26f5: 41 55 push %r13
ffffffff810c26f7: 41 54 push %r12
ffffffff810c26f9: 53 push %rbx
ffffffff810c26fa: 48 83 ec 08 sub $0x8,%rsp
ffffffff810c26fe: 48 85 ff test %rdi,%rdi
ffffffff810c2701: 48 8b 05 78 6d 85 00 mov 0x856d78(%rip),%rax # ffffffff81919480 <pcpu_base_addr>
ffffffff810c2708: 0f 84 18 01 00 00 je ffffffff810c2826 <free_percpu+0x135>
ffffffff810c270e: 48 2d 00 00 00 00 sub $0x0,%rax

The instruction at ffffffff810c2701 will read the pcpu_base_addr
global variable into RAX.

Which this patch just moved out of the read-mostly section and into
the write-often .data section, and in the symbol map above it will
share a cacheline with first_vm - wich vm_area_struct will be
dirtied if another CPU does a [real] percpu_alloc() call on another
CPU.

So ... we regressed the performance of percpu_free(NULL) with a
potential cross-CPU cacheline bounce. Without the patch,
percpu_free(NULL) would never do such a bounce. So i dont think the
patch is a good idea.

This is generally true of __read_mostly removal.

It is true that read-mostly variables can often be argued to always
be access-correlated, but often they are not. _Most_ in-kernel
facilities that are performance critical and have global variables
also have fastpaths as well.

2)

The other observation i'd like to make about the example above is
that _after_ the patch, we accidentally got into the same cacheline
as sb_lock: cacheline ffffffff81919500.

This is not a fault of the patch, but it is a demonstration of the
problem i was talking about: that our variable placement is
unplanned and random.

sb_lock can be a very hot lock in certain VFS workloads (on certain
filesystems - maybe even in other situations), so a workload that
also does frequent percpu_alloc() calls on another CPU will suffer
from the worst type of (and hardest to notice) cacheline bouncing:
two _different_ write-often-variable-accessing workloads bouncing
the same cacheline.

Before the patch we got lucky and we'd bounce two separate
cachelines - which on modern systems is twice as fast as bouncing
the same cacheline.

This example i believe strengthens the observation i made in the
original mail: my impression is that the best approach is to be
totally conscious about which variable is where, and to be generous
with __read_mostly, and to properly align any global write-often
variables.

I tried to fix these problems in the code and tried to make the
alignment of the write-mostly bits. One problem is that locks can
easily end up in the BSS - spread apart from non-zero initialized
.data global variables - which seems pretty dumb as it splits the
cacheline again.

It creates such kind of cacheline waste:

ffffffff817cd000 D tasklist_lock
ffffffff817cd040 D mmlist_lock
ffffffff817cd080 d softirq_vec
ffffffff817cd140 d pidmap_lock
ffffffff817cd180 D xtime_lock
ffffffff817cd1c0 d call_function
ffffffff817cd200 d hash_lock
ffffffff817cd240 d irq_desc_legacy
ffffffff817cde40 D kmalloc_caches
ffffffff817d1e80 d pcpu_alloc_mutex
ffffffff817d1ec0 D files_lock
ffffffff817d1f00 d nr_files

Each lock is on a separate cacheline - this is fine, but correlated
data is not there which is a bit dumb because we waste most of that
cacheline.

Cannot we force the linker to put zero-initialized variables into
.data as well, moving all (non-annotated) variables into the same
section? (A downside is that the kernel image would not compress as
well, but that is not a big issue as it gets freed after init
anyway.)

I think it is another important fact that the read-mostly section is
well-packed (no unnecessary holes), and it is never intruded by
forced cacheline evictions. So in today's huge L2 caches it can
remain populated easily and can linger there for a long time. The
write-intense cachelines on the other hand will only be in a single
CPU's cache - and the more sockets, the less the chance is that it
will be in a nearby cache. So that extra cacheline footprint the
above example shows might be less of a cost in reality.

So the ideal variable layout IMHO would be roughly what i suggested
in the mail before:

- known-read-mostly variables in one continous block in the
read-mostly section [not cacheline aligned]

- cacheline-aligned write-often variable starting the list of
variables, and all other global variables coming after it. This
would mean we start a new cacheline, but we'd also start

In fact ... i'd argue that each separate .o should _probably_ have
its .data section aligned to cacheline size, on SMP.

This would be a lot more maintainable than explicit cacheline
alignment for the first write-often variable: we'd only have to
annotate __read_mostly, and we'd have to move the most important
variables first (which is a good practice anyway).

The downside would be cacheline waste for tiny .o files with just
one or two global variables. Precise statistics have to be done how
the typical case looks like. A really back-of-the-paper wild
ballpark figure guess would be the vmlinux i just built:

text data bss dec hex filename
7703478 1141192 974848 9819518 95d57e vmlinux

That kernel has 2403 object files - which gives 474 bytes of .data
per .o, and 405 bytes of .bss. If we combined the two, the averag
would be around 1K. A 64 bytes alignment for each such 1K block
would be more than acceptable and the granularity loss (32 bytes per
ech block) would be around 75K of .data - or 3.6%.

In fact, the current bss would shrink due to those cacheline-aligned
locks going back to the .data and being packed right - so i think
the real figure would be even lower.

Looks like an acceptable equation - or at least one that could be
considered.

What do you think? I think the current situation is far from ideal,
the practices are not well settled and are also marred by harmful
misconceptions, and i think we could do a lot better via a few
simple measures.

Ingo

---
fs/super.c | 15 ++++++++++-----
mm/percpu.c | 46 +++++++++++++++++++++++++++++++---------------
2 files changed, 41 insertions(+), 20 deletions(-)

Index: linux/fs/super.c
===================================================================
--- linux.orig/fs/super.c
+++ linux/fs/super.c
@@ -42,9 +42,16 @@
#include <asm/uaccess.h>
#include "internal.h"

+/*
+ * Write-often and accessed-rarely global variables, cacheline aligned:
+ */

+__cacheline_aligned_in_smp DEFINE_SPINLOCK(sb_lock);
LIST_HEAD(super_blocks);
-DEFINE_SPINLOCK(sb_lock);
+
+static struct super_operations default_op;
+
+static DEFINE_MUTEX(sync_fs_mutex);

/**
* alloc_super - create new superblock
@@ -56,7 +63,6 @@ DEFINE_SPINLOCK(sb_lock);
static struct super_block *alloc_super(struct file_system_type *type)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
- static struct super_operations default_op;

if (s) {
if (security_sb_alloc(s)) {
@@ -466,9 +472,8 @@ restart:
void sync_filesystems(int wait)
{
struct super_block *sb;
- static DEFINE_MUTEX(mutex);

- mutex_lock(&mutex); /* Could be down_interruptible */
+ mutex_lock(&sync_fs_mutex); /* Could be down_interruptible */
spin_lock(&sb_lock);
list_for_each_entry(sb, &super_blocks, s_list) {
if (!sb->s_op->sync_fs)
@@ -498,7 +503,7 @@ restart:
goto restart;
}
spin_unlock(&sb_lock);
- mutex_unlock(&mutex);
+ mutex_unlock(&sync_fs_mutex);
}

/**
Index: linux/mm/percpu.c
===================================================================
--- linux.orig/mm/percpu.c
+++ linux/mm/percpu.c
@@ -100,20 +100,27 @@ struct pcpu_chunk {
struct page *page_ar[]; /* #cpus * UNIT_PAGES */
};

-static int pcpu_unit_pages __read_mostly;
-static int pcpu_unit_size __read_mostly;
-static int pcpu_chunk_size __read_mostly;
-static int pcpu_nr_slots __read_mostly;
-static size_t pcpu_chunk_struct_size __read_mostly;
+/*
+ * Read-mostly data:
+ */
+
+/* Constants, set at init time: */
+static int pcpu_unit_pages __read_mostly;
+static int pcpu_unit_size __read_mostly;
+static int pcpu_chunk_size __read_mostly;
+static int pcpu_nr_slots __read_mostly;
+static size_t pcpu_chunk_struct_size __read_mostly;

-/* the address of the first chunk which starts with the kernel static area */
-void *pcpu_base_addr __read_mostly;
+/* The address of the first chunk which starts with the kernel static area: */
+void *pcpu_base_addr __read_mostly;
EXPORT_SYMBOL_GPL(pcpu_base_addr);

-/* optional reserved chunk, only accessible for reserved allocations */
-static struct pcpu_chunk *pcpu_reserved_chunk;
-/* offset limit of the reserved chunk */
-static int pcpu_reserved_chunk_limit;
+/* Optional reserved chunk, only accessible for reserved allocations: */
+static struct pcpu_chunk *pcpu_reserved_chunk __read_mostly;
+
+/* Offset limit of the reserved chunk: */
+static int pcpu_reserved_chunk_limit
+ __read_mostly;

/*
* Synchronization rules.
@@ -136,12 +143,23 @@ static int pcpu_reserved_chunk_limit;
* allocation path might be referencing the chunk with only
* pcpu_alloc_mutex locked.
*/
-static DEFINE_MUTEX(pcpu_alloc_mutex); /* protects whole alloc and reclaim */
+
+/*
+ * Write-often and rarely accessed data, cacheline-aligned:
+ */
+
+/* protects whole alloc and reclaim: */
+static __cacheline_aligned_in_smp DEFINE_MUTEX(pcpu_alloc_mutex);
static DEFINE_SPINLOCK(pcpu_lock); /* protects index data structures */

-static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
static struct rb_root pcpu_addr_root = RB_ROOT; /* chunks by address */

+static struct vm_struct first_vm; /* first vm-list member */
+static int smap[2], dmap[2]; /* initial static and dynamic maps */
+
+/* chunk list slots: */
+static struct list_head *pcpu_slot;
+
/* reclaim work to release fully free chunks, scheduled from free path */
static void pcpu_reclaim(struct work_struct *work);
static DECLARE_WORK(pcpu_reclaim_work, pcpu_reclaim);
@@ -1087,8 +1105,6 @@ size_t __init pcpu_setup_first_chunk(pcp
void *base_addr,
pcpu_populate_pte_fn_t populate_pte_fn)
{
- static struct vm_struct first_vm;
- static int smap[2], dmap[2];
size_t size_sum = static_size + reserved_size +
(dyn_size >= 0 ? dyn_size : 0);
struct pcpu_chunk *schunk, *dchunk = NULL;

2009-04-01 22:58:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Thu, Apr 02, 2009 at 12:32:41AM +0200, Ingo Molnar wrote:
> And free_percpu(NULL) does this:
>
> void free_percpu(void *ptr)
> {
> void *addr = __pcpu_ptr_to_addr(ptr);
> struct pcpu_chunk *chunk;
> unsigned long flags;
> int off;
>
> if (!ptr)
> return;

Why don't we rewrite this as:

- void *addr = __pcpu_ptr_to_addr(ptr);
+ void *addr;
...
if (!ptr)
return;
addr = __pcpu_ptr_to_addr(ptr);

if kfree(NULL) is really that important, we should avoid doing this
extra work, not just rely on the variable being cache-hot.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-04-02 00:20:45

by Rusty Russell

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wednesday 01 April 2009 18:47:33 Tejun Heo wrote:
> Martin Schwidefsky wrote:
> > Is the goal to use the same access macros for both dynamically and
> > statically allocated percpu variables? That would make the proposed
> > solution impractical.
>
> Yeah, it's one of the goals so that we don't have to have two sets of
> APIs (e.g. the fast percpu_*() accessors).

There's a weaker, but still useful, subset of this goal: to allow the
ptr versions to access any var (ie. you can do "DEFINE_PER_CPU(int, foo);
... some_func(&foo)) yet still have the get_cpu_var() be the optimized
actual-variable versions.

Don't know that the distinction is *useful* here tho...
Rusty.

2009-04-02 01:55:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hello,

Martin Schwidefsky wrote:
>> Also, there is no guarantee that the offset from dynamic allocator
>> will fall in the same 4G. There is reserve mechanism for static ones
>> for archs which need it but for dynamic ones, the offset can be any
>> value.
>
> If we map the modules with a 4GB distance to the main kernel image then
> we don't have to worry about the offsets anymore. The compiler could
> just use LARL to get the address of the static per-cpu variables and
> SHIFT_PERCPU_PTR could be a RELOC_HIDE.

Ah... okay. Great. No more GOTENT trick for static ones and for
dynamic ones the compiler will generate indirection.

> It just that the remapping we need to do is sooo icky ..

Fingers and toes crossed. It would improve general percpu access
performance too.

Thanks.

--
tejun

2009-04-02 01:58:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hi,

Martin Schwidefsky wrote:
>> Maybe we can build indirection pointer manually by twiddling with
>> DEFINE_PER_CPU() in such a way that it doesn't have to distinguish
>> symbols and variables?
>
> Hmm, a provocative idea: can we ban the use of static per-cpu variables
> for modules in common code? There are not many common code modules
> doing this, with a visibility hack I found the following relevant
> locations:
>
> ....
>
> That would "fix" the problem as well..

Eh... I don't know. I'd really like remove such API limitations as
most of the necessary pieces are already there anyway, so I'd much
prefer the re-mapping thing if it can be made to work.

Thanks.

--
tejun

2009-04-02 02:13:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Matthew Wilcox <[email protected]> wrote:

> On Thu, Apr 02, 2009 at 12:32:41AM +0200, Ingo Molnar wrote:
> > And free_percpu(NULL) does this:
> >
> > void free_percpu(void *ptr)
> > {
> > void *addr = __pcpu_ptr_to_addr(ptr);
> > struct pcpu_chunk *chunk;
> > unsigned long flags;
> > int off;
> >
> > if (!ptr)
> > return;
>
> Why don't we rewrite this as:
>
> - void *addr = __pcpu_ptr_to_addr(ptr);
> + void *addr;
> ...
> if (!ptr)
> return;
> addr = __pcpu_ptr_to_addr(ptr);
>
> if kfree(NULL) is really that important, we should avoid doing this
> extra work, not just rely on the variable being cache-hot.

Yes, of course we can fix that, the NULL fastpath needs no access to
anything but the call parameter.

Note that my argument was different though: that assumptions about
variable correlation are very hard to track and validate, and that
IMHO we should be using __read_mostly generously (we know _that_
attribute with a rather high likelyhood), and we should group the
remaining variables together, starting at a cacheline aligned
address.

A sub-argument was that the boundary between global variables from
different .o files should perhaps not be 'merged' together (on SMP),
because the sharing effects are not maintainable: an example of
badness is the sb_lock false cacheline sharing that Christoph's
patch triggered (randomly).

A sub-sub argument was that perhaps we should not split .data and
.bss variables into separate sections - it doubles the chance of
false cacheline sharing and spreads the cacheline footprint.

Ingo

2009-04-02 02:17:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Matthew Wilcox <[email protected]> wrote:

> On Wed, Apr 01, 2009 at 12:39:46PM -0700, Linus Torvalds wrote:
>
> > The thing is, things like "read_mostly" or "access_rarely" may
> > talk about how we access those individual variables, but you're
> > missing a _huge_ chunk of the puzzle if you ignore the
> > _correlations_ of those accesses with accesses to other
> > variables.
> >
> > The thign is, if you have variables 'a' and 'b', and they are
> > always accessed together, then it's probably worth it to put
> > them in the same cacheline.
>
> If you've got two global variables that are generally accessed
> together, they should probably be `annotated' as such by putting
> them in a struct.

It is certainly done so in a number of cases (say the RCU core and
the scheduler), but i dont think it should be forced or preferred in
any way.

IMHO it's equally good and clean code to have the global variables
separately at the top of a .c file. Sometimes it's cleaner.

Ingo

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 1 Apr 2009, Ingo Molnar wrote:

> > __read_mostly should be packed as tightly as possible to increase
> > the chance that one cacheline includes multiple of the critical
> > variables for the hot code paths. Too much __read_mostly defeats
> > its purpose.
>
> That stance is a commonly held but quite wrong and harmful IMHO.

Well that is the reason I introduced __read_mostly.

> It stiffles the proper identification of read-mostly variables _AND_
> it hurts the proper identification of critical write-often variables
> as well. Not good.

Well then lets create another way of annotating variables that does not
move them into a separate section.

> The solution for critical write-often variables is what we always
> used: to identify them explicitly and to place them consciously into
> separate cachelines. (Or to per-cpu-ify or object-ify them where
> possible/sensible.)

Right. But there are none here.

> Then annotate everything that is read-mostly and accessed-frequently
> with the __read_mostly attribute.

None of that is the case here. These are rarely used variables for
allocation and free of percpu variables.

> - Thinking that this solves false cacheline sharing reliably is
> wrong: there's nothing that guarantees and enforces that slapping
> a few variables between two critical variables puts them on
> separate cachelines:

__read_mostly reduces cacheline bouncing problems significantly by saying
that these variables are rarely updated and frequently used in critical
paths. Thus the special placement.

> - It actually prevents true read-mostly variables from being
> annotated properly. (In such a case a true read-mostly variable
> bouncing around with a frequently-written variable cache line is
> almost as bad in terms of MESI latencies and costs as false
> cacheline sharing between two write-mostly variables.)

What I often thought we need is another per cpu read mostly section that
is always local NUMA wise. This means the percpu read mostly section would
be replicated per node. The update of such a read mostly variable would
then take a loop over all these per node segments which would be more
expensive. However, reads would always be local and thus it would be an
advantage.

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Thu, 2 Apr 2009, Ingo Molnar wrote:

> So ... we regressed the performance of percpu_free(NULL) with a
> potential cross-CPU cacheline bounce. Without the patch,
> percpu_free(NULL) would never do such a bounce. So i dont think the
> patch is a good idea.


But percpu_free is not an operation typical for hot code paths. Per cpu
variables are allocated and freed in rarely used code paths. Typically
the use of the per cpu variables occurs in hot code paths, not their
allocations.

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Thu, 2 Apr 2009, Ingo Molnar wrote:

> Note that my argument was different though: that assumptions about
> variable correlation are very hard to track and validate, and that
> IMHO we should be using __read_mostly generously (we know _that_
> attribute with a rather high likelyhood), and we should group the
> remaining variables together, starting at a cacheline aligned
> address.

But then you decrease the density of accessed to the __read_mostly
sections. The cachelines are not hot in the caches anymore which is an
average performance reduction.

> A sub-sub argument was that perhaps we should not split .data and
> .bss variables into separate sections - it doubles the chance of
> false cacheline sharing and spreads the cacheline footprint.

False cacheline sharing is something normal that comes with the cpu
caching schemes. As long as there is no significant impact on performance
we are fine with it. Extensive measures to avoid false cacheline sharing
on unimportant variables increases the cache footprint of code.

2009-04-02 02:33:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

Matthew Wilcox wrote:
> On Thu, Apr 02, 2009 at 12:32:41AM +0200, Ingo Molnar wrote:
>> And free_percpu(NULL) does this:
>>
>> void free_percpu(void *ptr)
>> {
>> void *addr = __pcpu_ptr_to_addr(ptr);
>> struct pcpu_chunk *chunk;
>> unsigned long flags;
>> int off;
>>
>> if (!ptr)
>> return;
>
> Why don't we rewrite this as:
>
> - void *addr = __pcpu_ptr_to_addr(ptr);
> + void *addr;
> ...
> if (!ptr)
> return;
> addr = __pcpu_ptr_to_addr(ptr);
>
> if kfree(NULL) is really that important, we should avoid doing this
> extra work, not just rely on the variable being cache-hot.

Slightly off-topic but the expectation was that gcc would be smart
enough to optimize it as it sees fit regardless of the actual code
ordering. The access to the global variable being read, there is no
actual difference between the original code and yours. Oh well... at
any rate, I don't think it's at the level we should be optimizing at.

Thanks.

--
tejun

2009-04-02 03:27:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Christoph Lameter <[email protected]> wrote:

> > A sub-sub argument was that perhaps we should not split .data
> > and .bss variables into separate sections - it doubles the
> > chance of false cacheline sharing and spreads the cacheline
> > footprint.
>
> False cacheline sharing is something normal that comes with the
> cpu caching schemes. As long as there is no significant impact on
> performance we are fine with it. Extensive measures to avoid false
> cacheline sharing on unimportant variables increases the cache
> footprint of code.

You didnt really answer to my suggestion(s) though. You only stated
that:

"problem XYZ is something normal that comes with the cpu caching
schemes. As long as there is no significant impact on
performance weare fine with it."

Which i'd proffer is true for any given value of XYZ ;-)

Ingo

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Thu, 2 Apr 2009, Ingo Molnar wrote:

> You didnt really answer to my suggestion(s) though. You only stated
> that:
>
> "problem XYZ is something normal that comes with the cpu caching
> schemes. As long as there is no significant impact on
> performance weare fine with it."
>
> Which i'd proffer is true for any given value of XYZ ;-)

Nope. If XYZ is a significant performance issue then we are not fine with
XYZ. __read_mostly was introduced to deal with signficant false aliasing
issues.

False aliasing can be set to occur anytime you place two variables in the
same cacheline. That is by design in the current cacheline schemes. One
would not place a variable in a separate cacheline without good
reason.




2009-04-02 03:45:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Christoph Lameter <[email protected]> wrote:

> On Thu, 2 Apr 2009, Ingo Molnar wrote:
>
> > So ... we regressed the performance of percpu_free(NULL) with a
> > potential cross-CPU cacheline bounce. Without the patch,
> > percpu_free(NULL) would never do such a bounce. So i dont think
> > the patch is a good idea.
>
> But percpu_free is not an operation typical for hot code paths.
> Per cpu variables are allocated and freed in rarely used code
> paths. Typically the use of the per cpu variables occurs in hot
> code paths, not their allocations.

to quote an earlier part of my mail:

> > We encourage kfree(NULL) and it triggers commonly in the kernel
> > today [on distro kernels we checked it can trigger once per
> > syscall!] - so i think we should consider free_percpu(NULL) a
> > possibly common pattern too. (even though today it's likely
> > _not_ common at all.)

I specifically mentioned that it is not at all common now.

But there's no reason why an object shutdown fastpath with an
optional percpu buffer (say for debug statistics, not enabled by
default) couldnt look like this:

percpu_free(NULL);

We actually have such patterns of kfree(ptr) use, where the _common_
case in a fastpath is kfree(NULL).

So, had your patch been applied as-is, we might have created this
situation.

Yes, it can be fixed, and yes, it is probably not worth even
bothering (percpu alloc/free goes over all online CPUs) - but i
found it kind of interesting that this random specific example we
ran into (i totally didnt go out and try to find some other, better
example) showed so many variable placement ambiguities and false
cacheline sharing problems, and showed visible inefficiencies in the
layout (i showed a whole line of variables wastefully spaces -
unrelated to the percpu allocator).

So these examples almost seemed to support my earlier points in the
thread where i suggested that variable placement is often random and
ambigious, and that we could fix inefficiencies in the layout ;-)

Ingo

2009-04-02 04:23:25

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/2 x86#core/percpu] percpu: don't put the first chunk in reverse-map rbtree

Impact: both first chunks don't use rbtree, no functional change

There can be two first chunks - reserved and dynamic with the former
one being optional. Dynamic first chunk was linked on reverse-mapping
rbtree while the reserved one was mapped manually using the start
address and reserved offset limit.

This patch makes both first chunks to be looked up manually without
using the rbtree. This is to help getting rid of the rbtree.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
mm/percpu.c | 41 ++++++++++++++++++++++++-----------------
1 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 1aa5d8f..bf1bf1f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -110,9 +110,21 @@ static size_t pcpu_chunk_struct_size __read_mostly;
void *pcpu_base_addr __read_mostly;
EXPORT_SYMBOL_GPL(pcpu_base_addr);

-/* optional reserved chunk, only accessible for reserved allocations */
+/*
+ * The first chunk which always exists. Note that unlike other
+ * chunks, this one can be allocated and mapped in several different
+ * ways and thus often doesn't live in the vmalloc area.
+ */
+static struct pcpu_chunk *pcpu_first_chunk;
+
+/*
+ * Optional reserved chunk. This chunk reserves part of the first
+ * chunk and serves it for reserved allocations. The amount of
+ * reserved offset is in pcpu_reserved_chunk_limit. When reserved
+ * area doesn't exist, the following variables contain NULL and 0
+ * respectively.
+ */
static struct pcpu_chunk *pcpu_reserved_chunk;
-/* offset limit of the reserved chunk */
static int pcpu_reserved_chunk_limit;

/*
@@ -297,15 +309,16 @@ static struct rb_node **pcpu_chunk_rb_search(void *addr,
*/
static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
{
+ void *first_start = pcpu_first_chunk->vm->addr;
struct rb_node *n, *parent;
struct pcpu_chunk *chunk;

- /* is it in the reserved chunk? */
- if (pcpu_reserved_chunk) {
- void *start = pcpu_reserved_chunk->vm->addr;
-
- if (addr >= start && addr < start + pcpu_reserved_chunk_limit)
+ /* is it in the first chunk? */
+ if (addr >= first_start && addr < first_start + pcpu_chunk_size) {
+ /* is it in the reserved area? */
+ if (addr < first_start + pcpu_reserved_chunk_limit)
return pcpu_reserved_chunk;
+ return pcpu_first_chunk;
}

/* nah... search the regular ones */
@@ -1147,7 +1160,8 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,

if (reserved_size) {
schunk->free_size = reserved_size;
- pcpu_reserved_chunk = schunk; /* not for dynamic alloc */
+ pcpu_reserved_chunk = schunk;
+ pcpu_reserved_chunk_limit = static_size + reserved_size;
} else {
schunk->free_size = dyn_size;
dyn_size = 0; /* dynamic area covered */
@@ -1158,8 +1172,6 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
if (schunk->free_size)
schunk->map[schunk->map_used++] = schunk->free_size;

- pcpu_reserved_chunk_limit = static_size + schunk->free_size;
-
/* init dynamic chunk if necessary */
if (dyn_size) {
dchunk = alloc_bootmem(sizeof(struct pcpu_chunk));
@@ -1226,13 +1238,8 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
}

/* link the first chunk in */
- if (!dchunk) {
- pcpu_chunk_relocate(schunk, -1);
- pcpu_chunk_addr_insert(schunk);
- } else {
- pcpu_chunk_relocate(dchunk, -1);
- pcpu_chunk_addr_insert(dchunk);
- }
+ pcpu_first_chunk = dchunk ?: schunk;
+ pcpu_chunk_relocate(pcpu_first_chunk, -1);

/* we're done */
pcpu_base_addr = (void *)pcpu_chunk_addr(schunk, 0, 0);
--
1.6.0.2

2009-04-02 04:23:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/2 x86#core/percpu] percpu: remove rbtree and use page->index instead

From: Christoph Lameter <[email protected]>

Impact: use page->index for addr to chunk mapping instead of dedicated rbtree

The rbtree is used to determine the chunk from the virtual address.
However, we can already determine the page struct from a virtual
address and there are several unused fields in page struct used by
vmalloc. Use the index field to store a pointer to the chunk. Then
there is no need anymore for an rbtree.

tj: * s/(set|get)_chunk/pcpu_\1_page_chunk/

* Drop inline from the above two functions and moved them upwards
so that they are with other simple helpers.

* Initial pages might not (actually most of the time don't) live
in the vmalloc area. With the previous patch to manually
reverse-map both first chunks, this is no longer an issue.
Removed pcpu_set_chunk() call on initial pages.

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
I made some modifications and it needed the previous patch to get the
first chunk reverse-mapping working. Tested all three first chunk
allocators and everything seems to work fine. Nice cleanup. :-)

mm/percpu.c | 100 ++++++++++++-----------------------------------------------
1 files changed, 20 insertions(+), 80 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index bf1bf1f..c0b2c1a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -23,7 +23,7 @@
* Allocation is done in offset-size areas of single unit space. Ie,
* an area of 512 bytes at 6k in c1 occupies 512 bytes at 6k of c1:u0,
* c1:u1, c1:u2 and c1:u3. Percpu access can be done by configuring
- * percpu base registers UNIT_SIZE apart.
+ * percpu base registers pcpu_unit_size apart.
*
* There are usually many small percpu allocations many of them as
* small as 4 bytes. The allocator organizes chunks into lists
@@ -38,8 +38,8 @@
* region and negative allocated. Allocation inside a chunk is done
* by scanning this map sequentially and serving the first matching
* entry. This is mostly copied from the percpu_modalloc() allocator.
- * Chunks are also linked into a rb tree to ease address to chunk
- * mapping during free.
+ * Chunks can be determined from the address using the index field
+ * in the page struct. The index field contains a pointer to the chunk.
*
* To use this allocator, arch code should do the followings.
*
@@ -61,7 +61,6 @@
#include <linux/mutex.h>
#include <linux/percpu.h>
#include <linux/pfn.h>
-#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
@@ -88,7 +87,6 @@

struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
- struct rb_node rb_node; /* key is chunk->vm->addr */
int free_size; /* free bytes in the chunk */
int contig_hint; /* max contiguous size hint */
struct vm_struct *vm; /* mapped vmalloc region */
@@ -133,7 +131,7 @@ static int pcpu_reserved_chunk_limit;
* There are two locks - pcpu_alloc_mutex and pcpu_lock. The former
* protects allocation/reclaim paths, chunks and chunk->page arrays.
* The latter is a spinlock and protects the index data structures -
- * chunk slots, rbtree, chunks and area maps in chunks.
+ * chunk slots, chunks and area maps in chunks.
*
* During allocation, pcpu_alloc_mutex is kept locked all the time and
* pcpu_lock is grabbed and released as necessary. All actual memory
@@ -152,7 +150,6 @@ static DEFINE_MUTEX(pcpu_alloc_mutex); /* protects whole alloc and reclaim */
static DEFINE_SPINLOCK(pcpu_lock); /* protects index data structures */

static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
-static struct rb_root pcpu_addr_root = RB_ROOT; /* chunks by address */

/* reclaim work to release fully free chunks, scheduled from free path */
static void pcpu_reclaim(struct work_struct *work);
@@ -203,6 +200,18 @@ static bool pcpu_chunk_page_occupied(struct pcpu_chunk *chunk,
return *pcpu_chunk_pagep(chunk, 0, page_idx) != NULL;
}

+/* set the pointer to a chunk in a page struct */
+static void pcpu_set_page_chunk(struct page *page, struct pcpu_chunk *pcpu)
+{
+ page->index = (unsigned long)pcpu;
+}
+
+/* obtain pointer to a chunk from a page struct */
+static struct pcpu_chunk *pcpu_get_page_chunk(struct page *page)
+{
+ return (struct pcpu_chunk *)page->index;
+}
+
/**
* pcpu_mem_alloc - allocate memory
* @size: bytes to allocate
@@ -269,40 +278,9 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
}
}

-static struct rb_node **pcpu_chunk_rb_search(void *addr,
- struct rb_node **parentp)
-{
- struct rb_node **p = &pcpu_addr_root.rb_node;
- struct rb_node *parent = NULL;
- struct pcpu_chunk *chunk;
-
- while (*p) {
- parent = *p;
- chunk = rb_entry(parent, struct pcpu_chunk, rb_node);
-
- if (addr < chunk->vm->addr)
- p = &(*p)->rb_left;
- else if (addr > chunk->vm->addr)
- p = &(*p)->rb_right;
- else
- break;
- }
-
- if (parentp)
- *parentp = parent;
- return p;
-}
-
/**
- * pcpu_chunk_addr_search - search for chunk containing specified address
- * @addr: address to search for
- *
- * Look for chunk which might contain @addr. More specifically, it
- * searchs for the chunk with the highest start address which isn't
- * beyond @addr.
- *
- * CONTEXT:
- * pcpu_lock.
+ * pcpu_chunk_addr_search - determine chunk containing specified address
+ * @addr: address for which the chunk needs to be determined.
*
* RETURNS:
* The address of the found chunk.
@@ -310,8 +288,6 @@ static struct rb_node **pcpu_chunk_rb_search(void *addr,
static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
{
void *first_start = pcpu_first_chunk->vm->addr;
- struct rb_node *n, *parent;
- struct pcpu_chunk *chunk;

/* is it in the first chunk? */
if (addr >= first_start && addr < first_start + pcpu_chunk_size) {
@@ -321,42 +297,7 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
return pcpu_first_chunk;
}

- /* nah... search the regular ones */
- n = *pcpu_chunk_rb_search(addr, &parent);
- if (!n) {
- /* no exactly matching chunk, the parent is the closest */
- n = parent;
- BUG_ON(!n);
- }
- chunk = rb_entry(n, struct pcpu_chunk, rb_node);
-
- if (addr < chunk->vm->addr) {
- /* the parent was the next one, look for the previous one */
- n = rb_prev(n);
- BUG_ON(!n);
- chunk = rb_entry(n, struct pcpu_chunk, rb_node);
- }
-
- return chunk;
-}
-
-/**
- * pcpu_chunk_addr_insert - insert chunk into address rb tree
- * @new: chunk to insert
- *
- * Insert @new into address rb tree.
- *
- * CONTEXT:
- * pcpu_lock.
- */
-static void pcpu_chunk_addr_insert(struct pcpu_chunk *new)
-{
- struct rb_node **p, *parent;
-
- p = pcpu_chunk_rb_search(new->vm->addr, &parent);
- BUG_ON(*p);
- rb_link_node(&new->rb_node, parent, p);
- rb_insert_color(&new->rb_node, &pcpu_addr_root);
+ return pcpu_get_page_chunk(vmalloc_to_page(addr));
}

/**
@@ -768,6 +709,7 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
alloc_mask, 0);
if (!*pagep)
goto err;
+ pcpu_set_page_chunk(*pagep, chunk);
}
}

@@ -892,7 +834,6 @@ restart:

spin_lock_irq(&pcpu_lock);
pcpu_chunk_relocate(chunk, -1);
- pcpu_chunk_addr_insert(chunk);
goto restart;

area_found:
@@ -981,7 +922,6 @@ static void pcpu_reclaim(struct work_struct *work)
if (chunk == list_first_entry(head, struct pcpu_chunk, list))
continue;

- rb_erase(&chunk->rb_node, &pcpu_addr_root);
list_move(&chunk->list, &todo);
}

--
1.6.0.2

2009-04-02 07:24:27

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Thu, Apr 02, 2009 at 10:57:13AM +0900, Tejun Heo wrote:
> Martin Schwidefsky wrote:
> >> Maybe we can build indirection pointer manually by twiddling with
> >> DEFINE_PER_CPU() in such a way that it doesn't have to distinguish
> >> symbols and variables?
> >
> > Hmm, a provocative idea: can we ban the use of static per-cpu variables
> > for modules in common code? There are not many common code modules
> > doing this, with a visibility hack I found the following relevant
> > locations:
> >
> > ....
> >
> > That would "fix" the problem as well..
>
> Eh... I don't know. I'd really like remove such API limitations as
> most of the necessary pieces are already there anyway, so I'd much
> prefer the re-mapping thing if it can be made to work.

On alpha, the main kernel lives in the unmapped area (superpage)
and modules get loaded into vmalloc area. Changing this means total
rewrite of the alpha code, so the remapping thing is not affordable...

On the other hand, some tricks with DEFINE_PER_CPU() are indeed possible -
for instance, using weak references we could force the compiler to
generate proper addressing mode. So DEFINE_PER_CPU(int, foo) in module
would expand to something like this:

extern int per_cpu__foo;
asm(".weakref per_cpu__foo, per_cpu_mod__foo");
__attribute__((__section__(".data.percpu"))) int per_cpu_mod__foo

The main problem is that our DEFINE_PER_CPU() macro consists of more
than one definition, so it won't be possible to specify both storage class
and initializer with it.

If it's acceptable to change the semantics from

static DEFINE_PER_CPU(int, foo) = 1

to

DEFINE_PER_CPU(static, int, foo) = 1

then we're ok.

Or maybe just add STATIC_DEFINE_PER_CPU_xx() variants?

Ivan.

2009-04-02 11:14:37

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Thu, 2 Apr 2009 11:24:18 +0400
Ivan Kokshaysky <[email protected]> wrote:

> On the other hand, some tricks with DEFINE_PER_CPU() are indeed possible -
> for instance, using weak references we could force the compiler to
> generate proper addressing mode. So DEFINE_PER_CPU(int, foo) in module
> would expand to something like this:
>
> extern int per_cpu__foo;
> asm(".weakref per_cpu__foo, per_cpu_mod__foo");
> __attribute__((__section__(".data.percpu"))) int per_cpu_mod__foo
>
> The main problem is that our DEFINE_PER_CPU() macro consists of more
> than one definition, so it won't be possible to specify both storage class
> and initializer with it.
>
> If it's acceptable to change the semantics from
>
> static DEFINE_PER_CPU(int, foo) = 1
>
> to
>
> DEFINE_PER_CPU(static, int, foo) = 1
>
> then we're ok.
>
> Or maybe just add STATIC_DEFINE_PER_CPU_xx() variants?

That is what I'm after as well. Just drop the "static" from the
DEFINE_PER_CPU statement found inside modules and it works.

My experiments with the weak and visibility attribute failed because
the static storage class specifier together with the attribute either
causes a compile error or static just overrides the attribute.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Thu, 2 Apr 2009, Ingo Molnar wrote:

> to quote an earlier part of my mail:
>
> > > We encourage kfree(NULL) and it triggers commonly in the kernel
> > > today [on distro kernels we checked it can trigger once per
> > > syscall!] - so i think we should consider free_percpu(NULL) a
> > > possibly common pattern too. (even though today it's likely
> > > _not_ common at all.)
>
> I specifically mentioned that it is not at all common now.

What is this? Nonsense day? Consider it a common pattern although its
likely not common at all? April fools day?

> But there's no reason why an object shutdown fastpath with an
> optional percpu buffer (say for debug statistics, not enabled by
> default) couldnt look like this:
>
> percpu_free(NULL);
>
> We actually have such patterns of kfree(ptr) use, where the _common_
> case in a fastpath is kfree(NULL).

Speculation. A shutdown fastpath? The percpu allocation and free
operations are expensive and deal with teardown and setup of virtual
mappings. Those paths are *not* optimized for fastpath use. kfree is
different.

2009-04-03 00:32:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hello,

Martin Schwidefsky wrote:
> On Thu, 2 Apr 2009 11:24:18 +0400
> Ivan Kokshaysky <[email protected]> wrote:
>
>> On the other hand, some tricks with DEFINE_PER_CPU() are indeed possible -
>> for instance, using weak references we could force the compiler to
>> generate proper addressing mode. So DEFINE_PER_CPU(int, foo) in module
>> would expand to something like this:
>>
>> extern int per_cpu__foo;
>> asm(".weakref per_cpu__foo, per_cpu_mod__foo");
>> __attribute__((__section__(".data.percpu"))) int per_cpu_mod__foo
>>
>> The main problem is that our DEFINE_PER_CPU() macro consists of more
>> than one definition, so it won't be possible to specify both storage class
>> and initializer with it.
>>
>> If it's acceptable to change the semantics from
>>
>> static DEFINE_PER_CPU(int, foo) = 1
>>
>> to
>>
>> DEFINE_PER_CPU(static, int, foo) = 1
>>
>> then we're ok.
>>
>> Or maybe just add STATIC_DEFINE_PER_CPU_xx() variants?
>
> That is what I'm after as well. Just drop the "static" from the
> DEFINE_PER_CPU statement found inside modules and it works.
>
> My experiments with the weak and visibility attribute failed because
> the static storage class specifier together with the attribute either
> causes a compile error or static just overrides the attribute.

Can STATIC_DEFINE_PER_CPU() be made to work? It's not pretty but if
that's the only sensible way to reach uniform static/dynamic handling,
I suppose we can ignore the slight ugliness.

Rusty, Ingo, what do you guys think?

Thanks.

--
tejun

2009-04-07 16:09:50

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Fri, Apr 03, 2009 at 09:31:10AM +0900, Tejun Heo wrote:
> Can STATIC_DEFINE_PER_CPU() be made to work? It's not pretty but if
> that's the only sensible way to reach uniform static/dynamic handling,
> I suppose we can ignore the slight ugliness.

Well, I've got a workaround that has zero impact on common code (patch
appended).

The idea is that we can effectively discard the "static" specifier
by declaring a dummy variable right before actual per-cpu variable
definition:

#define DEFINE_PER_CPU_SECTION(type, name, section) \
__attribute__((__section__(".garbage"), __unused__)) \
char __dummy__##name; \
__attribute__((__section__(PER_CPU_BASE_SECTION section))) \
__weak __typeof__(type) per_cpu__##name

In the "static" case the __dummy__ thing gets optimized away completely,
so the only downside is that the modules with public per-cpu variables
will have unused symbols in the .garbage section.
It's certainly possible to discard that section in the build process
(with objcopy or linker script), but right now there are only three
modules which end up with non-empty .garbage section, namely oprofile,
fs/xfs and net/rds. So I wouldn't bother about it.

That macro should work for s390 as well, except that maybe Martin
doesn't need the "weak" attribute.

Ivan.

---
alpha: fix for static percpu variables

Work around 32-bit GP-relative addressing of local per-cpu variables
in modules. This is needed to make the dynamic per-cpu allocator
work on alpha.

Signed-off-by: Ivan Kokshaysky <[email protected]>
---
arch/alpha/include/asm/percpu.h | 74 ++++++---------------------------------
include/linux/percpu.h | 2 +
2 files changed, 13 insertions(+), 63 deletions(-)

diff --git a/arch/alpha/include/asm/percpu.h b/arch/alpha/include/asm/percpu.h
index 3495e8e..28f2fbb 100644
--- a/arch/alpha/include/asm/percpu.h
+++ b/arch/alpha/include/asm/percpu.h
@@ -1,78 +1,26 @@
#ifndef __ALPHA_PERCPU_H
#define __ALPHA_PERCPU_H
#include <linux/compiler.h>
-#include <linux/threads.h>

-/*
- * Determine the real variable name from the name visible in the
- * kernel sources.
- */
-#define per_cpu_var(var) per_cpu__##var
-
-#ifdef CONFIG_SMP
-
-/*
- * per_cpu_offset() is the offset that has to be added to a
- * percpu variable to get to the instance for a certain processor.
- */
-extern unsigned long __per_cpu_offset[NR_CPUS];
-
-#define per_cpu_offset(x) (__per_cpu_offset[x])
-
-#define __my_cpu_offset per_cpu_offset(raw_smp_processor_id())
-#ifdef CONFIG_DEBUG_PREEMPT
-#define my_cpu_offset per_cpu_offset(smp_processor_id())
-#else
-#define my_cpu_offset __my_cpu_offset
-#endif
-
-#ifndef MODULE
-#define SHIFT_PERCPU_PTR(var, offset) RELOC_HIDE(&per_cpu_var(var), (offset))
-#define PER_CPU_ATTRIBUTES
-#else
+#if defined(MODULE) && defined(CONFIG_SMP)
/*
* To calculate addresses of locally defined variables, GCC uses 32-bit
* displacement from the GP. Which doesn't work for per cpu variables in
* modules, as an offset to the kernel per cpu area is way above 4G.
*
* This forces allocation of a GOT entry for per cpu variable using
- * ldq instruction with a 'literal' relocation.
+ * "weak" attribute (as the compiler must assume an external reference);
+ * to make this work we have to neutralize possible "static" storage
+ * class specifier with a dummy variable.
*/
-#define SHIFT_PERCPU_PTR(var, offset) ({ \
- extern int simple_identifier_##var(void); \
- unsigned long __ptr, tmp_gp; \
- asm ( "br %1, 1f \n\
- 1: ldgp %1, 0(%1) \n\
- ldq %0, per_cpu__" #var"(%1)\t!literal" \
- : "=&r"(__ptr), "=&r"(tmp_gp)); \
- (typeof(&per_cpu_var(var)))(__ptr + (offset)); })
-
-#define PER_CPU_ATTRIBUTES __used
-
-#endif /* MODULE */
-
-/*
- * A percpu variable may point to a discarded regions. The following are
- * established ways to produce a usable pointer from the percpu variable
- * offset.
- */
-#define per_cpu(var, cpu) \
- (*SHIFT_PERCPU_PTR(var, per_cpu_offset(cpu)))
-#define __get_cpu_var(var) \
- (*SHIFT_PERCPU_PTR(var, my_cpu_offset))
-#define __raw_get_cpu_var(var) \
- (*SHIFT_PERCPU_PTR(var, __my_cpu_offset))
-
-#else /* ! SMP */
-
-#define per_cpu(var, cpu) (*((void)(cpu), &per_cpu_var(var)))
-#define __get_cpu_var(var) per_cpu_var(var)
-#define __raw_get_cpu_var(var) per_cpu_var(var)
-
-#define PER_CPU_ATTRIBUTES
+#define DEFINE_PER_CPU_SECTION(type, name, section) \
+ __attribute__((__section__(".garbage"), __unused__)) \
+ char __dummy__##name; \
+ __attribute__((__section__(PER_CPU_BASE_SECTION section))) \
+ __weak __typeof__(type) per_cpu__##name

-#endif /* SMP */
+#endif /* MODULE && SMP */

-#define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu_var(name)
+#include <asm-generic/percpu.h>

#endif /* __ALPHA_PERCPU_H */
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index ee5615d..865f749 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -33,9 +33,11 @@

#endif

+#ifndef DEFINE_PER_CPU_SECTION
#define DEFINE_PER_CPU_SECTION(type, name, section) \
__attribute__((__section__(PER_CPU_BASE_SECTION section))) \
PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name
+#endif

#define DEFINE_PER_CPU(type, name) \
DEFINE_PER_CPU_SECTION(type, name, "")

2009-04-08 16:29:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Christoph Lameter <[email protected]> wrote:

> On Thu, 2 Apr 2009, Ingo Molnar wrote:
>
> > to quote an earlier part of my mail:
> >
> > > > We encourage kfree(NULL) and it triggers commonly in the
> > > > kernel today [on distro kernels we checked it can trigger
> > > > once per syscall!] - so i think we should consider
> > > > free_percpu(NULL) a possibly common pattern too. (even
> > > > though today it's likely _not_ common at all.)
> >
> > I specifically mentioned that it is not at all common now.
>
> What is this? Nonsense day? Consider it a common pattern although
> its likely not common at all? April fools day?

Dude, this is a new facility freshly modernized and freshly made
usable. What did you expect, for a thousand usecases pop up in the
kernel overnight? _None_ of this code is "common" today per se. (the
networking folks are working on making it more and more common
though)

> > But there's no reason why an object shutdown fastpath with an
> > optional percpu buffer (say for debug statistics, not enabled by
> > default) couldnt look like this:
> >
> > percpu_free(NULL);
> >
> > We actually have such patterns of kfree(ptr) use, where the _common_
> > case in a fastpath is kfree(NULL).
>
> Speculation. A shutdown fastpath? The percpu allocation and free
> operations are expensive and deal with teardown and setup of
> virtual mappings. Those paths are *not* optimized for fastpath
> use. kfree is different.

Of course a lot of this is speculation, dynamic percpu so far has
been a rarely used facility compared to kmalloc()/kfree(). If you
dont accept my analogy that's fine - but that is opinion against
opinion - while you state you opinion as truism.

So my point remains: your patch had effects you clearly did not
anticipate, and the cacheline alignment management situation is not
nearly as clear-cut as you imagine it to be.

Unfortunately you failed to answer my detailed mail that made very
specific points though, you only got into generalities and flames
about my summary mail - so it's hard to judge what your opinion
about those specific facts is - you have not stated one.

Ingo

2009-04-08 17:06:31

by Tejun Heo

[permalink] [raw]
Subject: [tip:core/percpu] percpu: don't put the first chunk in reverse-map rbtree

Commit-ID: ae9e6bc9f74f8247cbca50a6a93c80e0d686fa19
Gitweb: http://git.kernel.org/tip/ae9e6bc9f74f8247cbca50a6a93c80e0d686fa19
Author: Tejun Heo <[email protected]>
AuthorDate: Thu, 2 Apr 2009 13:19:54 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 8 Apr 2009 18:31:30 +0200

percpu: don't put the first chunk in reverse-map rbtree

Impact: both first chunks don't use rbtree, no functional change

There can be two first chunks - reserved and dynamic with the former
one being optional. Dynamic first chunk was linked on reverse-mapping
rbtree while the reserved one was mapped manually using the start
address and reserved offset limit.

This patch makes both first chunks to be looked up manually without
using the rbtree. This is to help getting rid of the rbtree.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: [email protected]
Cc: Paul Mundt <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Christoph Lameter <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
mm/percpu.c | 41 ++++++++++++++++++++++++-----------------
1 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 1aa5d8f..bf1bf1f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -110,9 +110,21 @@ static size_t pcpu_chunk_struct_size __read_mostly;
void *pcpu_base_addr __read_mostly;
EXPORT_SYMBOL_GPL(pcpu_base_addr);

-/* optional reserved chunk, only accessible for reserved allocations */
+/*
+ * The first chunk which always exists. Note that unlike other
+ * chunks, this one can be allocated and mapped in several different
+ * ways and thus often doesn't live in the vmalloc area.
+ */
+static struct pcpu_chunk *pcpu_first_chunk;
+
+/*
+ * Optional reserved chunk. This chunk reserves part of the first
+ * chunk and serves it for reserved allocations. The amount of
+ * reserved offset is in pcpu_reserved_chunk_limit. When reserved
+ * area doesn't exist, the following variables contain NULL and 0
+ * respectively.
+ */
static struct pcpu_chunk *pcpu_reserved_chunk;
-/* offset limit of the reserved chunk */
static int pcpu_reserved_chunk_limit;

/*
@@ -297,15 +309,16 @@ static struct rb_node **pcpu_chunk_rb_search(void *addr,
*/
static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
{
+ void *first_start = pcpu_first_chunk->vm->addr;
struct rb_node *n, *parent;
struct pcpu_chunk *chunk;

- /* is it in the reserved chunk? */
- if (pcpu_reserved_chunk) {
- void *start = pcpu_reserved_chunk->vm->addr;
-
- if (addr >= start && addr < start + pcpu_reserved_chunk_limit)
+ /* is it in the first chunk? */
+ if (addr >= first_start && addr < first_start + pcpu_chunk_size) {
+ /* is it in the reserved area? */
+ if (addr < first_start + pcpu_reserved_chunk_limit)
return pcpu_reserved_chunk;
+ return pcpu_first_chunk;
}

/* nah... search the regular ones */
@@ -1147,7 +1160,8 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,

if (reserved_size) {
schunk->free_size = reserved_size;
- pcpu_reserved_chunk = schunk; /* not for dynamic alloc */
+ pcpu_reserved_chunk = schunk;
+ pcpu_reserved_chunk_limit = static_size + reserved_size;
} else {
schunk->free_size = dyn_size;
dyn_size = 0; /* dynamic area covered */
@@ -1158,8 +1172,6 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
if (schunk->free_size)
schunk->map[schunk->map_used++] = schunk->free_size;

- pcpu_reserved_chunk_limit = static_size + schunk->free_size;
-
/* init dynamic chunk if necessary */
if (dyn_size) {
dchunk = alloc_bootmem(sizeof(struct pcpu_chunk));
@@ -1226,13 +1238,8 @@ size_t __init pcpu_setup_first_chunk(pcpu_get_page_fn_t get_page_fn,
}

/* link the first chunk in */
- if (!dchunk) {
- pcpu_chunk_relocate(schunk, -1);
- pcpu_chunk_addr_insert(schunk);
- } else {
- pcpu_chunk_relocate(dchunk, -1);
- pcpu_chunk_addr_insert(dchunk);
- }
+ pcpu_first_chunk = dchunk ?: schunk;
+ pcpu_chunk_relocate(pcpu_first_chunk, -1);

/* we're done */
pcpu_base_addr = (void *)pcpu_chunk_addr(schunk, 0, 0);

Subject: [tip:core/percpu] percpu: remove rbtree and use page->index instead

Commit-ID: e1b9aa3f47242e757c776a3771bb6613e675bf9c
Gitweb: http://git.kernel.org/tip/e1b9aa3f47242e757c776a3771bb6613e675bf9c
Author: Christoph Lameter <[email protected]>
AuthorDate: Thu, 2 Apr 2009 13:21:44 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 8 Apr 2009 18:31:31 +0200

percpu: remove rbtree and use page->index instead

Impact: use page->index for addr to chunk mapping instead of dedicated rbtree

The rbtree is used to determine the chunk from the virtual address.
However, we can already determine the page struct from a virtual
address and there are several unused fields in page struct used by
vmalloc. Use the index field to store a pointer to the chunk. Then
there is no need anymore for an rbtree.

tj: * s/(set|get)_chunk/pcpu_\1_page_chunk/

* Drop inline from the above two functions and moved them upwards
so that they are with other simple helpers.

* Initial pages might not (actually most of the time don't) live
in the vmalloc area. With the previous patch to manually
reverse-map both first chunks, this is no longer an issue.
Removed pcpu_set_chunk() call on initial pages.

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: [email protected]
Cc: Paul Mundt <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Nick Piggin <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
mm/percpu.c | 100 ++++++++++++-----------------------------------------------
1 files changed, 20 insertions(+), 80 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index bf1bf1f..c0b2c1a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -23,7 +23,7 @@
* Allocation is done in offset-size areas of single unit space. Ie,
* an area of 512 bytes at 6k in c1 occupies 512 bytes at 6k of c1:u0,
* c1:u1, c1:u2 and c1:u3. Percpu access can be done by configuring
- * percpu base registers UNIT_SIZE apart.
+ * percpu base registers pcpu_unit_size apart.
*
* There are usually many small percpu allocations many of them as
* small as 4 bytes. The allocator organizes chunks into lists
@@ -38,8 +38,8 @@
* region and negative allocated. Allocation inside a chunk is done
* by scanning this map sequentially and serving the first matching
* entry. This is mostly copied from the percpu_modalloc() allocator.
- * Chunks are also linked into a rb tree to ease address to chunk
- * mapping during free.
+ * Chunks can be determined from the address using the index field
+ * in the page struct. The index field contains a pointer to the chunk.
*
* To use this allocator, arch code should do the followings.
*
@@ -61,7 +61,6 @@
#include <linux/mutex.h>
#include <linux/percpu.h>
#include <linux/pfn.h>
-#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
@@ -88,7 +87,6 @@

struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
- struct rb_node rb_node; /* key is chunk->vm->addr */
int free_size; /* free bytes in the chunk */
int contig_hint; /* max contiguous size hint */
struct vm_struct *vm; /* mapped vmalloc region */
@@ -133,7 +131,7 @@ static int pcpu_reserved_chunk_limit;
* There are two locks - pcpu_alloc_mutex and pcpu_lock. The former
* protects allocation/reclaim paths, chunks and chunk->page arrays.
* The latter is a spinlock and protects the index data structures -
- * chunk slots, rbtree, chunks and area maps in chunks.
+ * chunk slots, chunks and area maps in chunks.
*
* During allocation, pcpu_alloc_mutex is kept locked all the time and
* pcpu_lock is grabbed and released as necessary. All actual memory
@@ -152,7 +150,6 @@ static DEFINE_MUTEX(pcpu_alloc_mutex); /* protects whole alloc and reclaim */
static DEFINE_SPINLOCK(pcpu_lock); /* protects index data structures */

static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
-static struct rb_root pcpu_addr_root = RB_ROOT; /* chunks by address */

/* reclaim work to release fully free chunks, scheduled from free path */
static void pcpu_reclaim(struct work_struct *work);
@@ -203,6 +200,18 @@ static bool pcpu_chunk_page_occupied(struct pcpu_chunk *chunk,
return *pcpu_chunk_pagep(chunk, 0, page_idx) != NULL;
}

+/* set the pointer to a chunk in a page struct */
+static void pcpu_set_page_chunk(struct page *page, struct pcpu_chunk *pcpu)
+{
+ page->index = (unsigned long)pcpu;
+}
+
+/* obtain pointer to a chunk from a page struct */
+static struct pcpu_chunk *pcpu_get_page_chunk(struct page *page)
+{
+ return (struct pcpu_chunk *)page->index;
+}
+
/**
* pcpu_mem_alloc - allocate memory
* @size: bytes to allocate
@@ -269,40 +278,9 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
}
}

-static struct rb_node **pcpu_chunk_rb_search(void *addr,
- struct rb_node **parentp)
-{
- struct rb_node **p = &pcpu_addr_root.rb_node;
- struct rb_node *parent = NULL;
- struct pcpu_chunk *chunk;
-
- while (*p) {
- parent = *p;
- chunk = rb_entry(parent, struct pcpu_chunk, rb_node);
-
- if (addr < chunk->vm->addr)
- p = &(*p)->rb_left;
- else if (addr > chunk->vm->addr)
- p = &(*p)->rb_right;
- else
- break;
- }
-
- if (parentp)
- *parentp = parent;
- return p;
-}
-
/**
- * pcpu_chunk_addr_search - search for chunk containing specified address
- * @addr: address to search for
- *
- * Look for chunk which might contain @addr. More specifically, it
- * searchs for the chunk with the highest start address which isn't
- * beyond @addr.
- *
- * CONTEXT:
- * pcpu_lock.
+ * pcpu_chunk_addr_search - determine chunk containing specified address
+ * @addr: address for which the chunk needs to be determined.
*
* RETURNS:
* The address of the found chunk.
@@ -310,8 +288,6 @@ static struct rb_node **pcpu_chunk_rb_search(void *addr,
static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
{
void *first_start = pcpu_first_chunk->vm->addr;
- struct rb_node *n, *parent;
- struct pcpu_chunk *chunk;

/* is it in the first chunk? */
if (addr >= first_start && addr < first_start + pcpu_chunk_size) {
@@ -321,42 +297,7 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
return pcpu_first_chunk;
}

- /* nah... search the regular ones */
- n = *pcpu_chunk_rb_search(addr, &parent);
- if (!n) {
- /* no exactly matching chunk, the parent is the closest */
- n = parent;
- BUG_ON(!n);
- }
- chunk = rb_entry(n, struct pcpu_chunk, rb_node);
-
- if (addr < chunk->vm->addr) {
- /* the parent was the next one, look for the previous one */
- n = rb_prev(n);
- BUG_ON(!n);
- chunk = rb_entry(n, struct pcpu_chunk, rb_node);
- }
-
- return chunk;
-}
-
-/**
- * pcpu_chunk_addr_insert - insert chunk into address rb tree
- * @new: chunk to insert
- *
- * Insert @new into address rb tree.
- *
- * CONTEXT:
- * pcpu_lock.
- */
-static void pcpu_chunk_addr_insert(struct pcpu_chunk *new)
-{
- struct rb_node **p, *parent;
-
- p = pcpu_chunk_rb_search(new->vm->addr, &parent);
- BUG_ON(*p);
- rb_link_node(&new->rb_node, parent, p);
- rb_insert_color(&new->rb_node, &pcpu_addr_root);
+ return pcpu_get_page_chunk(vmalloc_to_page(addr));
}

/**
@@ -768,6 +709,7 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int off, int size)
alloc_mask, 0);
if (!*pagep)
goto err;
+ pcpu_set_page_chunk(*pagep, chunk);
}
}

@@ -892,7 +834,6 @@ restart:

spin_lock_irq(&pcpu_lock);
pcpu_chunk_relocate(chunk, -1);
- pcpu_chunk_addr_insert(chunk);
goto restart;

area_found:
@@ -981,7 +922,6 @@ static void pcpu_reclaim(struct work_struct *work)
if (chunk == list_first_entry(head, struct pcpu_chunk, list))
continue;

- rb_erase(&chunk->rb_node, &pcpu_addr_root);
list_move(&chunk->list, &todo);
}

2009-04-08 20:20:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hello,

Ivan Kokshaysky wrote:
> On Fri, Apr 03, 2009 at 09:31:10AM +0900, Tejun Heo wrote:
>> Can STATIC_DEFINE_PER_CPU() be made to work? It's not pretty but if
>> that's the only sensible way to reach uniform static/dynamic handling,
>> I suppose we can ignore the slight ugliness.
>
> Well, I've got a workaround that has zero impact on common code (patch
> appended).
>
> The idea is that we can effectively discard the "static" specifier
> by declaring a dummy variable right before actual per-cpu variable
> definition:
>
> #define DEFINE_PER_CPU_SECTION(type, name, section) \
> __attribute__((__section__(".garbage"), __unused__)) \
> char __dummy__##name; \
> __attribute__((__section__(PER_CPU_BASE_SECTION section))) \
> __weak __typeof__(type) per_cpu__##name

Nice. I think .discard would suit better for the dummy variable tho
(x86 discards the section and it seems to be more common name with
ldscript using /DISCARD/ identifier for things to throw away). Also,
it seems like if two different files happen to use static percpu
variables with the same name, they would end up sharing it, right?
That looks a tad bit dangerous.

Thanks.

--
tejun

2009-04-09 09:47:38

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, Apr 08, 2009 at 01:18:23PM -0700, Tejun Heo wrote:
> > #define DEFINE_PER_CPU_SECTION(type, name, section) \
> > __attribute__((__section__(".garbage"), __unused__)) \
> > char __dummy__##name; \
> > __attribute__((__section__(PER_CPU_BASE_SECTION section))) \
> > __weak __typeof__(type) per_cpu__##name
>
> Nice. I think .discard would suit better for the dummy variable tho
> (x86 discards the section and it seems to be more common name with
> ldscript using /DISCARD/ identifier for things to throw away). Also,

Indeed, .discard sounds better.

> it seems like if two different files happen to use static percpu
> variables with the same name, they would end up sharing it, right?
> That looks a tad bit dangerous.

True. One of possible solutions is to add another dummy variable,
like this:

#define DEFINE_PER_CPU_SECTION(type, name, section) \
__attribute__((__section__(".discard"), __unused__)) \
char __dummy__##name; \
+ __attribute__((__section__(".discard"))) \
+ char __per_cpu_multiple_def__##name; \
__attribute__((__section__(PER_CPU_BASE_SECTION section))) \
__weak __typeof__(type) per_cpu__##name

so in that situation we'll get a link failure.

Ivan.

2009-04-09 11:55:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Hello, Ivan.

Ivan Kokshaysky wrote:
>> it seems like if two different files happen to use static percpu
>> variables with the same name, they would end up sharing it, right?
>> That looks a tad bit dangerous.
>
> True. One of possible solutions is to add another dummy variable,
> like this:
>
> #define DEFINE_PER_CPU_SECTION(type, name, section) \
> __attribute__((__section__(".discard"), __unused__)) \
> char __dummy__##name; \
> + __attribute__((__section__(".discard"))) \
> + char __per_cpu_multiple_def__##name; \
> __attribute__((__section__(PER_CPU_BASE_SECTION section))) \
> __weak __typeof__(type) per_cpu__##name
>
> so in that situation we'll get a link failure.

Hmmm... yeah, I agree not allowing common static names is the lesser
poison here. It's generally a good idea to use uniquely
distinguisible identifier for static symbols anyway to help debugging.
If this limitation is acceptable, I think we should also add the dup
build failure thing to the generic definition too tho so that such
cases can be discovered before they hit alpha and s390 later.

Any objections?

Thanks.

--
tejun

2009-04-11 01:38:32

by Rusty Russell

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

On Thu, 9 Apr 2009 09:23:25 pm Tejun Heo wrote:
> It's generally a good idea to use uniquely
> distinguisible identifier for static symbols anyway to help debugging.

Sorry, I can't let this statement stand. It's completely wrong: use the
shortest clear name, always.

> If this limitation is acceptable, I think we should also add the dup
> build failure thing to the generic definition too tho so that such
> cases can be discovered before they hit alpha and s390 later.
>
> Any objections?

Yes. If we decide that static per-cpu is unsupportable, let's not hide the
damn thing. We just make it give a compile warning if we can, patch out the
current cases, and make checkpatch.pl warn on new ones.

Don't silently override "static". Don't come up with stupid justifications.
Accept with open-eyes that it's evil, just a lesser evil.

Thanks,
Rusty.

2009-04-11 01:54:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [GIT RFC] percpu: use dynamic percpu allocator as the default percpu allocator

Rusty Russell wrote:
> On Thu, 9 Apr 2009 09:23:25 pm Tejun Heo wrote:
>> It's generally a good idea to use uniquely
>> distinguisible identifier for static symbols anyway to help debugging.
>
> Sorry, I can't let this statement stand. It's completely wrong: use the
> shortest clear name, always.

Heh... well, yeah, I don't think this is something everyone can agree
about. I generally like having proper prefix for static symbols. It
makes symbols more consistent too.

>> If this limitation is acceptable, I think we should also add the dup
>> build failure thing to the generic definition too tho so that such
>> cases can be discovered before they hit alpha and s390 later.
>>
>> Any objections?
>
> Yes. If we decide that static per-cpu is unsupportable, let's not
> hide the damn thing. We just make it give a compile warning if we
> can, patch out the current cases, and make checkpatch.pl warn on new
> ones.
>
> Don't silently override "static". Don't come up with stupid
> justifications. Accept with open-eyes that it's evil, just a lesser
> evil.

It's different tho. With the right tricks, we can still make percpu
variables defined static unusable from other compile units. ie. if we
make the combination of DECLARE_PER_CPU() and static DEFINE_PER_CPU()
trigger compile error, most of what the programmer meant by 'static'
can be achieved whether the symbol itself ends up visible outside of
the compile unit or not.

Thanks.

--
tejun

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Wed, 8 Apr 2009, Ingo Molnar wrote:

> Dude, this is a new facility freshly modernized and freshly made
> usable. What did you expect, for a thousand usecases pop up in the
> kernel overnight? _None_ of this code is "common" today per se. (the
> networking folks are working on making it more and more common
> though)

?? kfree(NULL) has been allowed for years. None of this is new.

> > Speculation. A shutdown fastpath? The percpu allocation and free
> > operations are expensive and deal with teardown and setup of
> > virtual mappings. Those paths are *not* optimized for fastpath
> > use. kfree is different.
>
> Of course a lot of this is speculation, dynamic percpu so far has
> been a rarely used facility compared to kmalloc()/kfree(). If you
> dont accept my analogy that's fine - but that is opinion against
> opinion - while you state you opinion as truism.


Please look at the kernel source for the use of percpu_free and
percpu_alloc.

> So my point remains: your patch had effects you clearly did not
> anticipate, and the cacheline alignment management situation is not
> nearly as clear-cut as you imagine it to be.

There was no effect that I did not anticipate. Just imagination on your
part that percpu_free is used like kfree.

Again: The frequent insertion of __read_mostly will
destroy the purpose and function of the read mostly areas!

> Unfortunately you failed to answer my detailed mail that made very
> specific points though, you only got into generalities and flames
> about my summary mail - so it's hard to judge what your opinion
> about those specific facts is - you have not stated one.

I was pretty clear on those points. Not sure what material question I did
not answer. What you say here fits your posts. Generalizing from kfree to
percpu_free etc...

2009-04-14 14:06:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Christoph Lameter <[email protected]> wrote:

> On Wed, 8 Apr 2009, Ingo Molnar wrote:
>
> > Dude, this is a new facility freshly modernized and freshly made
> > usable. What did you expect, for a thousand usecases pop up in the
> > kernel overnight? _None_ of this code is "common" today per se. (the
> > networking folks are working on making it more and more common
> > though)
>
> ?? kfree(NULL) has been allowed for years. None of this is new.

[ This reply if yours is completely inapposite. It does not fit into
the logical stream of arguments at all. Of course kfree(NULL) has
been allowed for years. Did i claim otherwise? I dont think you
understand my arguments - and i get the impression that you dont
even _try_ to understand them. ]

The thing is, i spent well in excess of an hour analyzing your
patch, counting cachelines, looking at effects and interactions,
thinking about the various implications. I came up with a good deal
of factoids, a handful of suggestions and a few summary paragraphs:

http://marc.info/?l=linux-kernel&m=123862536011780&w=2

A proper reply to that work would be one of several responses:

1) to ignore it. (you can always do that, you dont have to react
to everything on lkml - especially if you think it's bull.)

2) disagree with the factoids - preferably in a specific way.

3) agree with the factoids and disagree with my opinion.

4) agree with it all.

You did neither of these: you never replied to my detailed analysis,
you only replied to my followup summary - disagreeing with my
opinion based not on a fair deconstruction of my factoids but on a
mere repetition of your arguments.

Furthermore, you also tried to 'win' this argument by increasing the
volume of shouting, by injecting unprovoked insults and by using a
patronizing and irritated tone.

You might be completely right in the end technically (i fully submit
that the discussion is open-ended), but this kind of generic
handwaving and your asocial behavior in this thread does not really
do your technical arguments any service. It can only really end in
me starting to ignore you.

Ingo

Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator

On Tue, 14 Apr 2009, Ingo Molnar wrote:

> The thing is, i spent well in excess of an hour analyzing your
> patch, counting cachelines, looking at effects and interactions,
> thinking about the various implications. I came up with a good deal
> of factoids, a handful of suggestions and a few summary paragraphs:
>
> http://marc.info/?l=linux-kernel&m=123862536011780&w=2

Good work.

> A proper reply to that work would be one of several responses:
>

...

>
> 3) agree with the factoids and disagree with my opinion.

Yep I thought that what I did...

> Furthermore, you also tried to 'win' this argument by increasing the
> volume of shouting, by injecting unprovoked insults and by using a
> patronizing and irritated tone.

Win an argument? I want to restore the function of __read_mostly to what
it was intended. It seems that you wanted instead to tag labels on all
sorts of variables in the kernel that show their usage pattern.

2009-04-14 17:16:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH UPDATED] percpu: use dynamic percpu allocator as the default percpu allocator


* Christoph Lameter <[email protected]> wrote:

> On Tue, 14 Apr 2009, Ingo Molnar wrote:
>
> > The thing is, i spent well in excess of an hour analyzing your
> > patch, counting cachelines, looking at effects and interactions,
> > thinking about the various implications. I came up with a good deal
> > of factoids, a handful of suggestions and a few summary paragraphs:
> >
> > http://marc.info/?l=linux-kernel&m=123862536011780&w=2
>
> Good work.
>
> > A proper reply to that work would be one of several responses:
> >
>
> ...
>
> >
> > 3) agree with the factoids and disagree with my opinion.
>
> Yep I thought that what I did...

Ok, thanks ... since i never saw a reply from you on that
mail so i couldnt assume you did so.

There's really 3 key observations in that mail - let me sum
them up in order of importance.

1)

I'm wondering what your take on the bss+data suggestion is.
To me it appears it's tempting to merge them into a single
per .o section: it clearly wins us locality of reference.

It seems so obvious to do to me on a modern SMP kernel - has
anyone tried that in the past?

Instead of:

.data1 .data2 .data3 .... .bss1 .bss2 .bss3

we'd have:

.databss1 .databss2 .databss3

This is clearly better compressed, and the layout is easier
to control in the .c file. We could also do tricks to
further compress data here: we could put variables right
after their __aligned__ locks - while currently they are in
the .bss wasting a full cache-line.

In the example i analyzed it would reduce the cache
footprint by one cacheline. This would apply to most .o's so
the combined effect on cache locality would be significant.

[ Another (sub-)advantage would be that it 'linearizes' and
hence properly colors the per .o module variable layout.
With an artificially split .data1 .bss1 the offset between
them is random, and it's harder to control the cache port
positions of closely related variables. ]

2)

Aligning (the now merged) data+bss per .o section on
cacheline boundary [up to 64 byte cacheline sizes or so]
sounds tempting as well - it eliminates accidental "tail
bites the next head" type of cross-object-file interactions.

The price is an estimated 3% blow-up in combined .data+bss
size. A suspect a patch and measurements would settle this
pretty neatly.

3)

The free_percpu() collateral-damage argument i made was
pretty speculative (and artificial as well - the allocation
of percpu resources is very global in nature so a
kfree(NULL)-alike fastpath is harder to imagine) - i tried
at all costs demonstrate my point based on that narrow
example alone.

Thanks,

Ingo