LinuxLists.cc - [patch] mm: Reimplementation of dynamic percpu memory allocator

2005-01-13 08:32:52

Subject: [patch] mm: Reimplementation of dynamic percpu memory allocator

Hi Andrew,
Could you consider this for inclusion in the mm tree? The patch has been
tested in user space and kernel space. Manfred seems to like the fact that
allocator doesn't depend on slab, so that it can be used for slab's internal
head arrays. He had questions about fragmentation which I have answered.

Patch follows

Thanks,
Kiran

The following patch re-implements the linux dynamic percpu memory allocator
so that:
1. Percpu memory dereference is faster
- One less memory reference compared to existing simple alloc_percpu
- As fast as with static percpu areas, one mem ref less actually.
2. Better memory usage
- Doesn't need a NR_CPUS pointer array for each allocation
- Interlaces objects making better utilization of memory/cachelines
- Userspace tests show 98% utilization with random sized allocations
after repeated random frees
3. Provides truly node local allocation
- The percpu memory with existing alloc_percpu does node local
allocation, but the NR_CPUS place holder is not node local. This
problem doesn't exist with the new implementation.

Design:
We have "blocks" of memory akin to slabs. Each block has
(percpu blocksize) * NR_CPUS + (block management data structures) of kernel
VA space allocated to it. Node local pages are allocated and mapped
against the corresponding cpus' VA space. Pages are allocated for block
management data structures and mapped to their corresponding VA. These
reside at (percpu blocksize) * NR_CPUS offset from the beginning of the block.
The allocator maintains a circular linked list of blocks, sorted in
descending order of block utilization. On requests for objects, allocator
tries to serve the request from the most utilized block.
The allocator allocates memory in multiples of a fixed currency size for
a request. The allocator returns address of the percpu
object corresponding to cpu0. The cpu local variable for any given cpu
can be obtained by simple arithmetic:
obj_address + cpu_id * PCPU_BLKSIZE.

Testing:
The block allocator has undergone some userspace stress testing with small
counter sized objects as well as a large number of random sized objects.

Changes:
- Cleaned up userspace compatibility bits
- Fixed bufctl corruption due to bad arithmetic in free_bufctl

Signed-off-by: Ravikiran Thirumalai <[email protected]>
---

include/linux/kernel.h | 2
include/linux/percpu.h | 18 -
mm/Makefile | 1
mm/percpu.c | 703 +++++++++++++++++++++++++++++++++++++++++++++++++
mm/slab.c | 70 ----
5 files changed, 713 insertions(+), 81 deletions(-)

diff -ruN -X dontdiff2 linux-2.6.10-mm2/include/linux/kernel.h alloc_percpu-2.6.10-mm2/include/linux/kernel.h
--- linux-2.6.10-mm2/include/linux/kernel.h 2005-01-12 17:39:30.000000000 +0530
+++ alloc_percpu-2.6.10-mm2/include/linux/kernel.h 2005-01-12 17:40:13.000000000 +0530
@@ -29,6 +29,8 @@

#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
#define ALIGN(x,a) (((x)+(a)-1)&~((a)-1))
+#define IS_ALIGNED(x,a) (!(((a) - 1) & (x)))
+#define IS_POWEROFTWO(x) (!(((x) - 1) & (x)))

#define KERN_EMERG "<0>" /* system is unusable */
#define KERN_ALERT "<1>" /* action must be taken immediately */
diff -ruN -X dontdiff2 linux-2.6.10-mm2/include/linux/percpu.h alloc_percpu-2.6.10-mm2/include/linux/percpu.h
--- linux-2.6.10-mm2/include/linux/percpu.h 2004-12-25 03:05:28.000000000 +0530
+++ alloc_percpu-2.6.10-mm2/include/linux/percpu.h 2005-01-12 17:40:13.000000000 +0530
@@ -15,23 +15,19 @@
#define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()

-#ifdef CONFIG_SMP
-
-struct percpu_data {
- void *ptrs[NR_CPUS];
- void *blkp;
-};
+/* This is the upper bound for an object using alloc_percpu */
+#define PCPU_BLKSIZE (PAGE_SIZE << 1)
+#define PCPU_CURR_SIZE (sizeof (void *))

+#ifdef CONFIG_SMP
/*
* Use this to get to a cpu's version of the per-cpu object allocated using
* alloc_percpu. Non-atomic access to the current CPU's version should
* probably be combined with get_cpu()/put_cpu().
*/
#define per_cpu_ptr(ptr, cpu) \
-({ \
- struct percpu_data *__p = (struct percpu_data *)~(unsigned long)(ptr); \
- (__typeof__(ptr))__p->ptrs[(cpu)]; \
-})
+ ((__typeof__(ptr)) \
+ (RELOC_HIDE(ptr, PCPU_BLKSIZE * cpu)))

extern void *__alloc_percpu(size_t size, size_t align);
extern void free_percpu(const void *);
@@ -56,6 +52,6 @@

/* Simple wrapper for the common case: zeros memory. */
#define alloc_percpu(type) \
- ((type *)(__alloc_percpu(sizeof(type), __alignof__(type))))
+ ((type *)(__alloc_percpu(ALIGN(sizeof (type), PCPU_CURR_SIZE), __alignof__(type))))

#endif /* __LINUX_PERCPU_H */
diff -ruN -X dontdiff2 linux-2.6.10-mm2/mm/Makefile alloc_percpu-2.6.10-mm2/mm/Makefile
--- linux-2.6.10-mm2/mm/Makefile 2005-01-12 17:39:32.000000000 +0530
+++ alloc_percpu-2.6.10-mm2/mm/Makefile 2005-01-12 17:40:13.000000000 +0530
@@ -17,4 +17,5 @@
obj-$(CONFIG_NUMA) += mempolicy.o
obj-$(CONFIG_SHMEM) += shmem.o
obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
+obj-$(CONFIG_SMP) += percpu.o

diff -ruN -X dontdiff2 linux-2.6.10-mm2/mm/percpu.c alloc_percpu-2.6.10-mm2/mm/percpu.c
--- linux-2.6.10-mm2/mm/percpu.c 1970-01-01 05:30:00.000000000 +0530
+++ alloc_percpu-2.6.10-mm2/mm/percpu.c 2005-01-12 17:40:13.000000000 +0530
@@ -0,0 +1,703 @@
+/*
+ * Dynamic percpu memory allocator.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2003
+ *
+ * Author: Ravikiran Thirumalai <[email protected]>
+ *
+ * Originally by Dipankar Sarma and Ravikiran Thirumalai,
+ * This reimplements alloc_percpu to make it
+ * 1. Independent of slab/kmalloc
+ * 2. Use node local memory
+ * 3. Use simple pointer arithmetic
+ * 4. Minimise fragmentation.
+ *
+ * Allocator is slow -- expected to be called during module/subsytem
+ * init. alloc_percpu can block.
+ *
+ */
+
+#include <linux/percpu.h>
+#include <linux/vmalloc.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <asm/semaphore.h>
+#include <asm/pgtable.h>
+#include <asm/hardirq.h>
+
+#define MAX_OBJSIZE PCPU_BLKSIZE
+#define OBJS_PER_BLOCK (PCPU_BLKSIZE/PCPU_CURR_SIZE)
+#define BITMAP_ARR_SIZE (OBJS_PER_BLOCK/(sizeof (unsigned long) * 8))
+#define MAX_NR_BITS (OBJS_PER_BLOCK)
+#define PCPUPAGES_PER_BLOCK ((PCPU_BLKSIZE >> PAGE_SHIFT) * NR_CPUS)
+
+/* Block descriptor */
+struct pcpu_block {
+ void *start_addr;
+ struct page *pages[PCPUPAGES_PER_BLOCK * 2]; /* Extra for block mgt */
+ struct list_head blklist;
+ unsigned long bitmap[BITMAP_ARR_SIZE]; /* Object Freelist */
+ int bufctl_fl[OBJS_PER_BLOCK]; /* bufctl_fl freelist */
+ int bufctl_fl_head;
+ unsigned int size_used;
+};
+
+#define BLK_SIZE_USED(listpos) (list_entry(listpos, \
+ struct pcpu_block, blklist)->size_used)
+
+/* Block list maintanance */
+
+/* Ordered list of pcpu_blocks -- Full, partial first */
+static struct list_head blkhead = LIST_HEAD_INIT(blkhead);
+static struct list_head *firstnotfull = &blkhead;
+static DECLARE_MUTEX(blklist_lock);
+
+/*
+ * Bufctl descriptor and bufctl list for all allocated objs...
+ * Having one list for all buffers in the allocater might not be very efficient
+ * but we are not expecting allocs and frees in fast path (only during module
+ * load and unload hopefully
+ */
+struct buf_ctl {
+ void *addr;
+ size_t size;
+ struct buf_ctl *next;
+};
+
+static struct buf_ctl *buf_head = NULL;
+
+#define BLOCK_MANAGEMENT_SIZE \
+({ \
+ int extra = sizeof (struct buf_ctl)*OBJS_PER_BLOCK \
+ + sizeof (struct pcpu_block); \
+ ALIGN(extra, PAGE_SIZE); \
+})
+
+#define BLOCK_MANAGEMENT_PAGES (BLOCK_MANAGEMENT_SIZE >> PAGE_SHIFT)
+
+void init_pcpu_block(struct pcpu_block *blkp)
+{
+ int i;
+ memset(blkp, 0, sizeof (struct pcpu_block)); /* Delme --for US only */
+
+ /* Setup the freelist */
+ blkp->bufctl_fl_head = 0;
+ for (i = 0; i < OBJS_PER_BLOCK-1; i++)
+ blkp->bufctl_fl[i] = i+1;
+ blkp->bufctl_fl[i] = -1; /* Sentinel to mark End of list */
+}
+
+/*
+ * Allocate PCPU_BLKSIZE * NR_CPUS + BLOCK_MANAGEMENT_SIZE worth of
+ * contiguous kva space, and PCPU_BLKSIZE amount of node local
+ * memory (pages) for all cpus possible + BLOCK_MANAGEMENT_SIZE pages
+ */
+static void *
+valloc_percpu(void)
+{
+ int i,j = 0;
+ unsigned int nr_pages;
+ struct vm_struct *area, tmp;
+ struct page **tmppage;
+ struct page *pages[BLOCK_MANAGEMENT_PAGES];
+ unsigned int cpu_pages = PCPU_BLKSIZE >> PAGE_SHIFT;
+ struct pcpu_block *blkp = NULL;
+
+ BUG_ON(!IS_ALIGNED(PCPU_BLKSIZE, PAGE_SIZE));
+ BUG_ON(!PCPU_BLKSIZE);
+ nr_pages = PCPUPAGES_PER_BLOCK + BLOCK_MANAGEMENT_PAGES;
+
+ /* Alloc Managent block pages */
+ for ( i = 0; i < BLOCK_MANAGEMENT_PAGES; i++) {
+ pages[i] = alloc_pages(GFP_KERNEL, 0);
+ if (!pages[i]) {
+ while ( --i >= 0 )
+ __free_pages(pages[i], 0);
+ return NULL;
+ }
+ /* Zero the alloced page */
+ clear_page(page_address(pages[i]));
+ }
+
+ /* Get the contiguous VA space for this block */
+ area = get_vm_area(nr_pages << PAGE_SHIFT, VM_MAP);
+ if (!area)
+ goto rollback_mgt;
+
+ /* Map pages for the block management pages */
+ tmppage = pages;
+ tmp.addr = area->addr + NR_CPUS * PCPU_BLKSIZE;
+ tmp.size = BLOCK_MANAGEMENT_SIZE + PAGE_SIZE;
+ if (map_vm_area(&tmp, PAGE_KERNEL, &tmppage))
+ goto rollback_vm_area;
+
+ /* Init the block descriptor */
+ blkp = area->addr + NR_CPUS * PCPU_BLKSIZE;
+ init_pcpu_block(blkp);
+ for ( i = 0; i < BLOCK_MANAGEMENT_PAGES; i++)
+ blkp->pages[i+PCPUPAGES_PER_BLOCK] = pages[i];
+
+ /* Alloc node local pages for all cpus possible */
+ for (i = 0; i < NR_CPUS; i++) {
+ if (cpu_possible(i)) {
+ int start_idx = i * cpu_pages;
+ for (j = start_idx; j < start_idx + cpu_pages; j++) {
+ blkp->pages[j] = alloc_pages_node(cpu_to_node(i)
+ ,GFP_KERNEL | __GFP_HIGHMEM,
+ 0);
+ if (unlikely(!blkp->pages[j]))
+ goto rollback_pages;
+ }
+ }
+ }
+
+ /* Map pages for each cpu by splitting vm_struct for each cpu */
+ for (i = 0; i < NR_CPUS; i++) {
+ if (cpu_possible(i)) {
+ tmppage = &blkp->pages[i*cpu_pages];
+ tmp.addr = area->addr + i * PCPU_BLKSIZE;
+ /* map_vm_area assumes a guard page of size PAGE_SIZE */
+ tmp.size = PCPU_BLKSIZE + PAGE_SIZE;
+ if (map_vm_area(&tmp, PAGE_KERNEL, &tmppage))
+ goto fail_map;
+ }
+ }
+
+ return area->addr;
+
+fail_map:
+ i--;
+ for (; i >= 0; i--) {
+ if (cpu_possible(i)) {
+ tmp.addr = area->addr + i * PCPU_BLKSIZE;
+ /* we've mapped a guard page extra earlier... */
+ tmp.size = PCPU_BLKSIZE + PAGE_SIZE;
+ unmap_vm_area(&tmp);
+ }
+ }
+
+ /* set i and j with proper values for the roll back at fail: */
+ i = NR_CPUS - 1;
+ j = PCPUPAGES_PER_BLOCK;
+
+rollback_pages:
+ j--;
+ for (; j >= 0; j--)
+ if (cpu_possible(j/cpu_pages))
+ __free_pages(blkp->pages[j], 0);
+
+ /* Unmap block management */
+ tmp.addr = area->addr + NR_CPUS * PCPU_BLKSIZE;
+ tmp.size = BLOCK_MANAGEMENT_SIZE + PAGE_SIZE;
+ unmap_vm_area(&tmp);
+
+rollback_vm_area:
+ /* Give back the contiguous mem area */
+ area = remove_vm_area(area->addr);
+ BUG_ON(!area);
+
+rollback_mgt:
+
+ /* Free the block management pages */
+ for (i = 0 ; i < BLOCK_MANAGEMENT_PAGES; i++)
+ __free_pages(pages[i], 0);
+
+ return NULL;
+}
+
+/* Free memory block allocated by valloc_percpu */
+static void
+vfree_percpu(void *addr)
+{
+ int i;
+ struct pcpu_block *blkp = addr + PCPUPAGES_PER_BLOCK * PAGE_SIZE;
+ struct vm_struct *area, tmp;
+ unsigned int cpu_pages = PCPU_BLKSIZE >> PAGE_SHIFT;
+ struct page *pages[BLOCK_MANAGEMENT_PAGES];
+
+ /* Backup the block management struct pages */
+ for (i=0; i < BLOCK_MANAGEMENT_PAGES; i++)
+ pages[i] = blkp->pages[i+PCPUPAGES_PER_BLOCK];
+
+ /* Unmap all cpu_pages from the block's vm space */
+ for (i = 0; i < NR_CPUS; i++) {
+ if (cpu_possible(i)) {
+ tmp.addr = addr + i * PCPU_BLKSIZE;
+ /* We've mapped a guard page extra earlier */
+ tmp.size = PCPU_BLKSIZE + PAGE_SIZE;
+ unmap_vm_area(&tmp);
+ }
+ }
+
+ /* Give back all allocated pages */
+ for (i = 0; i < PCPUPAGES_PER_BLOCK; i++) {
+ if (cpu_possible(i/cpu_pages))
+ __free_pages(blkp->pages[i], 0);
+ }
+
+ /* Unmap block management pages */
+ tmp.addr = addr + NR_CPUS * PCPU_BLKSIZE;
+ tmp.size = BLOCK_MANAGEMENT_SIZE + PAGE_SIZE;
+ unmap_vm_area(&tmp);
+
+ /* Free block management pages */
+ for (i=0; i < BLOCK_MANAGEMENT_PAGES; i++)
+ __free_pages(pages[i], 0);
+
+ /* Give back vm area for this block */
+ area = remove_vm_area(addr);
+ BUG_ON(!area);
+
+}
+
+static int
+add_percpu_block(void)
+{
+ struct pcpu_block *blkp;
+ void *start_addr;
+
+ start_addr = valloc_percpu();
+ if (!start_addr)
+ return 0;
+ blkp = start_addr + PCPUPAGES_PER_BLOCK * PAGE_SIZE;
+ blkp->start_addr = start_addr;
+ down(&blklist_lock);
+ list_add_tail(&blkp->blklist, &blkhead);
+ if (firstnotfull == &blkhead)
+ firstnotfull = &blkp->blklist;
+ up(&blklist_lock);
+
+ return 1;
+}
+
+struct obj_map_elmt {
+ int startbit;
+ int obj_size;
+};
+
+/* Fill the array with obj map info and return no of elements in the array */
+static int
+make_obj_map(struct obj_map_elmt arr[], struct pcpu_block *blkp)
+{
+ int nr_elements = 0;
+ int i, j, obj_size;
+
+ for (i = 0, j = 0; i < MAX_NR_BITS; i++) {
+ if (!test_bit(i, blkp->bitmap)) {
+ /* Free block start */
+ arr[j].startbit = i;
+ nr_elements++;
+ obj_size = 1;
+ i++;
+ while (i < MAX_NR_BITS && (!test_bit(i, blkp->bitmap))) {
+ i++;
+ obj_size++;
+ }
+ arr[j].obj_size = obj_size * PCPU_CURR_SIZE;
+ j++;
+ }
+ }
+
+ return nr_elements;
+}
+
+/* Sort obj_map array in ascending order -- simple bubble sort */
+static void
+sort_obj_map(struct obj_map_elmt map[], int nr)
+{
+ int i, j, k;
+
+ for (i = 0; i < nr - 1; i++) {
+ k = i;
+
+ for (j = k + 1; j < nr; j++)
+ if (map[j].obj_size < map[k].obj_size)
+ k = j;
+ if (k != i) {
+ struct obj_map_elmt tmp;
+ tmp = map[i];
+ map[i] = map[k];
+ map[k] = tmp;
+ }
+ }
+}
+
+/* Add bufctl to list of bufctl */
+static void
+add_bufctl(struct buf_ctl *bufp)
+{
+ if (buf_head == NULL)
+ buf_head = bufp;
+ else {
+ bufp->next = buf_head;
+ buf_head = bufp;
+ }
+}
+
+/* After you alloc from a block, It can only go up the ordered list */
+static void
+sort_blk_list_up(struct pcpu_block *blkp)
+{
+ struct list_head *pos;
+
+ for (pos = blkp->blklist.prev; pos != &blkhead; pos = pos->prev) {
+ if (BLK_SIZE_USED(pos) < blkp->size_used) {
+ /* Move blkp up */
+ list_del(&blkp->blklist);
+ list_add_tail(&blkp->blklist, pos);
+ pos = &blkp->blklist;
+ } else
+ break;
+ }
+ /* Fix firstnotfull if needed */
+ if (blkp->size_used == PCPU_BLKSIZE) {
+ firstnotfull = blkp->blklist.next;
+ return;
+ }
+ if (blkp->size_used > BLK_SIZE_USED(firstnotfull)) {
+ firstnotfull = &blkp->blklist;
+ return;
+ }
+}
+
+struct buf_ctl *alloc_bufctl(struct pcpu_block *blkp)
+{
+ void *bufctl;
+ int head = blkp->bufctl_fl_head;
+ BUG_ON(head == -1); /* If bufctls for this block has exhausted */
+ blkp->bufctl_fl_head = blkp->bufctl_fl[blkp->bufctl_fl_head];
+ bufctl = (void *)blkp + sizeof (struct pcpu_block) +
+ sizeof (struct buf_ctl) * head;
+ return bufctl;
+}
+
+/* Don't want to kmalloc this -- to avoid dependence on slab for future */
+static struct obj_map_elmt obj_map[OBJS_PER_BLOCK];
+
+/* Scan the freelist and return suitable obj if found */
+static void
+*get_obj_from_block(size_t size, size_t align, struct pcpu_block *blkp)
+{
+ int nr_elements, nr_currency, obj_startbit, obj_endbit;
+ int i, j;
+ void *objp;
+ struct buf_ctl *bufctl;
+
+ nr_elements = make_obj_map(obj_map, blkp);
+ if (!nr_elements)
+ return NULL;
+
+ /* Sort list in ascending order */
+ sort_obj_map(obj_map, nr_elements);
+
+ /* Get the smallest obj_sized chunk for this size */
+ i = 0;
+ while (i < nr_elements - 1 && size > obj_map[i].obj_size)
+ i++;
+ if (obj_map[i].obj_size < size) /* No suitable obj_size found */
+ return NULL;
+
+ /* chunk of obj_size >= size is found, check for suitability (align)
+ * and alloc
+ */
+ nr_currency = size / PCPU_CURR_SIZE;
+ obj_startbit = obj_map[i].startbit;
+
+try_again_for_align:
+
+ obj_endbit = obj_map[i].startbit + obj_map[i].obj_size / PCPU_CURR_SIZE
+ - 1;
+ objp = obj_startbit * PCPU_CURR_SIZE + blkp->start_addr;
+
+ if (IS_ALIGNED((unsigned long) objp, align)) {
+ /* Alignment is ok so alloc this chunk */
+ bufctl = alloc_bufctl(blkp);
+ if (!bufctl)
+ return NULL;
+ bufctl->addr = objp;
+ bufctl->size = size;
+ bufctl->next = NULL;
+
+ /* Mark the bitmap as allocated */
+ for (j = obj_startbit; j < nr_currency + obj_startbit; j++)
+ set_bit(j, blkp->bitmap);
+ blkp->size_used += size;
+ /* Re-arrange list to preserve full, partial and free order */
+ sort_blk_list_up(blkp);
+ /* Add to the allocated buffers list and return */
+ add_bufctl(bufctl);
+ return objp;
+ } else {
+ /* Alignment is not ok */
+ int obj_size = (obj_endbit - obj_startbit + 1) * PCPU_CURR_SIZE;
+ if (obj_size > size && obj_startbit <= obj_endbit) {
+ /* Since obj_size is bigger than requested, check if
+ alignment can be met by changing startbit */
+ obj_startbit++;
+ goto try_again_for_align;
+ } else {
+ /* Try in the next chunk */
+ if (++i < nr_elements) {
+ /* Reset start bit and try again */
+ obj_startbit = obj_map[i].startbit;
+ goto try_again_for_align;
+ }
+ }
+ }
+
+ /* Everything failed so return NULL */
+ return NULL;
+}
+
+/*
+ * __alloc_percpu - allocate one copy of the object for every present
+ * cpu in the system, zeroing them.
+ * Objects should be dereferenced using per_cpu_ptr/get_cpu_ptr
+ * macros only
+ *
+ * This allocator is slow as we assume allocs to come
+ * by during boot/module init.
+ * Should not be called from interrupt context
+ */
+void *
+__alloc_percpu(size_t size, size_t align)
+{
+ struct pcpu_block *blkp;
+ struct list_head *l;
+ void *obj;
+
+ if (!size)
+ return NULL;
+
+ if (size < PCPU_CURR_SIZE)
+ size = PCPU_CURR_SIZE;
+
+ if (align == 0)
+ align = PCPU_CURR_SIZE;
+
+ if (size > MAX_OBJSIZE) {
+ printk("alloc_percpu: ");
+ printk("size %d requested is more than I can handle\n", size);
+ return NULL;
+ }
+
+ BUG_ON(!IS_ALIGNED(size, PCPU_CURR_SIZE));
+
+try_after_refill:
+
+ /* Get the block to allocate from */
+ down(&blklist_lock);
+ l = firstnotfull;
+
+try_next_block:
+
+ /* If you have reached end of list, add another block and try */
+ if (l == &blkhead)
+ goto unlock_and_get_mem;
+ blkp = list_entry(l, struct pcpu_block, blklist);
+ obj = get_obj_from_block(size, align, blkp);
+ if (!obj) {
+ l = l->next;
+ goto try_next_block;
+ }
+ up(&blklist_lock);
+ return obj;
+
+unlock_and_get_mem:
+
+ up(&blklist_lock);
+ if (add_percpu_block())
+ goto try_after_refill;
+ return NULL;
+
+}
+
+EXPORT_SYMBOL(__alloc_percpu);
+
+/* After you free from a block, It can only go down the ordered list */
+static void
+sort_blk_list_down(struct pcpu_block *blkp)
+{
+ struct list_head *pos, *prev, *next;
+ /* Store the actual prev and next pointers for fnof fixing later */
+ prev = blkp->blklist.prev;
+ next = blkp->blklist.next;
+
+ /* Fix the ordering on the list */
+ for (pos = blkp->blklist.next; pos != &blkhead; pos = pos->next) {
+ if (BLK_SIZE_USED(pos) > blkp->size_used) {
+ /* Move blkp down */
+ list_del(&blkp->blklist);
+ list_add(&blkp->blklist, pos);
+ pos = &blkp->blklist;
+ } else
+ break;
+ }
+ /* Fix firstnotfull if needed and return */
+ if (firstnotfull == &blkhead) {
+ /* There was no block free, so now this block is fnotfull */
+ firstnotfull = &blkp->blklist;
+ return;
+ }
+
+ if (firstnotfull == &blkp->blklist) {
+ /* This was firstnotfull so fix fnof pointer accdly */
+ if (prev != &blkhead && BLK_SIZE_USED(prev) != PCPU_BLKSIZE) {
+ /* Move fnof pointer up */
+ firstnotfull = prev;
+ prev = prev->prev;
+ /* If size_used of prev is same as fnof, fix fnof to
+ point to topmost of the equal sized blocks */
+ while (prev != &blkhead &&
+ BLK_SIZE_USED(prev) != PCPU_BLKSIZE) {
+ if (BLK_SIZE_USED(prev) !=
+ BLK_SIZE_USED(firstnotfull))
+ return;
+ firstnotfull = prev;
+ prev = prev->prev;
+ }
+ } else if (next != &blkhead) {
+ /* Move fnof pointer down */
+ firstnotfull = next;
+ next = next->next;
+ if (BLK_SIZE_USED(firstnotfull) != PCPU_BLKSIZE)
+ return;
+ /* fnof is pointing to block which is full...fix it */
+ while (next != &blkhead &&
+ BLK_SIZE_USED(next) == PCPU_BLKSIZE) {
+ firstnotfull = next;
+ next = next->next;
+ }
+ }
+
+ }
+
+}
+
+void free_bufctl(struct pcpu_block *blkp, struct buf_ctl *bufp)
+{
+ int idx = ((void *) bufp - (void *) blkp - sizeof (struct pcpu_block))
+ / sizeof (struct buf_ctl);
+ blkp->bufctl_fl[idx] = blkp->bufctl_fl_head;
+ blkp->bufctl_fl_head = idx;
+}
+
+/*
+ * Free the percpu obj and whatever memory can be freed
+ */
+static void
+free_percpu_obj(struct list_head *pos, struct buf_ctl *bufp)
+{
+ struct pcpu_block *blkp;
+ blkp = list_entry(pos, struct pcpu_block, blklist);
+
+ /* Update blkp->size_used and free if size_used is 0 */
+ blkp->size_used -= bufp->size;
+ if (blkp->size_used) {
+ /* Mark the bitmap corresponding to this object free */
+ int i, obj_startbit;
+ int nr_currency = bufp->size / PCPU_CURR_SIZE;
+ obj_startbit = (bufp->addr - blkp->start_addr) / PCPU_CURR_SIZE;
+ for (i = obj_startbit; i < obj_startbit + nr_currency; i++)
+ clear_bit(i, blkp->bitmap);
+ sort_blk_list_down(blkp);
+ } else {
+ /* Usecount is zero, so prepare to give this block back to vm */
+ /* Fix firstnotfull if freeing block was firstnotfull
+ * If there are more blocks with the same usecount as fnof,
+ * point to the first block from the head */
+ if (firstnotfull == pos) {
+ firstnotfull = pos->prev;
+ while (firstnotfull != &blkhead) {
+ unsigned int fnf_size_used;
+ fnf_size_used = BLK_SIZE_USED(firstnotfull);
+
+ if (fnf_size_used == PCPU_BLKSIZE)
+ firstnotfull = &blkhead;
+ else if (firstnotfull->prev == &blkhead)
+ break;
+ else if (BLK_SIZE_USED(firstnotfull->prev)
+ == fnf_size_used)
+ firstnotfull = firstnotfull->prev;
+ else
+ break;
+ }
+ }
+ list_del(pos);
+ }
+
+ /* Free bufctl after fixing the bufctl list */
+ if (bufp == buf_head) {
+ buf_head = bufp->next;
+ } else {
+ struct buf_ctl *tmp = buf_head;
+ while (tmp && tmp->next != bufp)
+ tmp = tmp->next;
+ BUG_ON(!tmp || tmp->next != bufp);
+ tmp->next = bufp->next;
+ }
+ free_bufctl(blkp, bufp);
+ /* If usecount is zero, give this block back to vm */
+ if (!blkp->size_used)
+ vfree_percpu(blkp->start_addr);
+ return;
+}
+
+/*
+ * Free memory allocated using alloc_percpu.
+ */
+
+void
+free_percpu(const void *objp)
+{
+ struct buf_ctl *bufp;
+ struct pcpu_block *blkp;
+ struct list_head *pos;
+ if (!objp)
+ return;
+
+ /* Find block from which obj was allocated by scanning bufctl list */
+ down(&blklist_lock);
+ bufp = buf_head;
+ while (bufp) {
+ if (bufp->addr == objp)
+ break;
+ bufp = bufp->next;
+ }
+ BUG_ON(!bufp);
+
+ /* We have the bufctl for the obj here, Now get the block */
+ list_for_each(pos, &blkhead) {
+ blkp = list_entry(pos, struct pcpu_block, blklist);
+ if (objp >= blkp->start_addr &&
+ objp < blkp->start_addr + PCPU_BLKSIZE)
+ break;
+ }
+
+ BUG_ON(pos == &blkhead); /* Couldn't find obj in block list */
+
+ /*
+ * Mark the bitmap free, Update use count, fix the ordered
+ * blklist, free the obj bufctl.
+ */
+ free_percpu_obj(pos, bufp);
+
+ up(&blklist_lock);
+ return;
+}
+
+EXPORT_SYMBOL(free_percpu);
diff -ruN -X dontdiff2 linux-2.6.10-mm2/mm/slab.c alloc_percpu-2.6.10-mm2/mm/slab.c
--- linux-2.6.10-mm2/mm/slab.c 2005-01-12 17:39:32.000000000 +0530
+++ alloc_percpu-2.6.10-mm2/mm/slab.c 2005-01-12 17:42:11.000000000 +0530
@@ -2482,51 +2482,6 @@

EXPORT_SYMBOL(__kmalloc);

-#ifdef CONFIG_SMP
-/**
- * __alloc_percpu - allocate one copy of the object for every present
- * cpu in the system, zeroing them.
- * Objects should be dereferenced using the per_cpu_ptr macro only.
- *
- * @size: how many bytes of memory are required.
- * @align: the alignment, which can't be greater than SMP_CACHE_BYTES.
- */
-void *__alloc_percpu(size_t size, size_t align)
-{
- int i;
- struct percpu_data *pdata = kmalloc(sizeof (*pdata), GFP_KERNEL);
-
- if (!pdata)
- return NULL;
-
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_possible(i))
- continue;
- pdata->ptrs[i] = kmem_cache_alloc_node(
- kmem_find_general_cachep(size, GFP_KERNEL),
- cpu_to_node(i));
-
- if (!pdata->ptrs[i])
- goto unwind_oom;
- memset(pdata->ptrs[i], 0, size);
- }
-
- /* Catch derefs w/o wrappers */
- return (void *) (~(unsigned long) pdata);
-
-unwind_oom:
- while (--i >= 0) {
- if (!cpu_possible(i))
- continue;
- kfree(pdata->ptrs[i]);
- }
- kfree(pdata);
- return NULL;
-}
-
-EXPORT_SYMBOL(__alloc_percpu);
-#endif
-
/**
* kmem_cache_free - Deallocate an object
* @cachep: The cache the allocation was from.
@@ -2590,31 +2545,6 @@

EXPORT_SYMBOL(kfree);

-#ifdef CONFIG_SMP
-/**
- * free_percpu - free previously allocated percpu memory
- * @objp: pointer returned by alloc_percpu.
- *
- * Don't free memory not originally allocated by alloc_percpu()
- * The complemented objp is to check for that.
- */
-void
-free_percpu(const void *objp)
-{
- int i;
- struct percpu_data *p = (struct percpu_data *) (~(unsigned long) objp);
-
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_possible(i))
- continue;
- kfree(p->ptrs[i]);
- }
- kfree(p);
-}
-
-EXPORT_SYMBOL(free_percpu);
-#endif
-
unsigned int kmem_cache_size(kmem_cache_t *cachep)
{
return obj_reallen(cachep);

2005-01-13 08:58:28

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch] mm: Reimplementation of dynamic percpu memory allocator

Ravikiran G Thirumalai <[email protected]> wrote:
>
> ...
> The following patch re-implements the linux dynamic percpu memory allocator

Heavens, it's complex.

> 1. Percpu memory dereference is faster
> - One less memory reference compared to existing simple alloc_percpu
> - As fast as with static percpu areas, one mem ref less actually.
> 2. Better memory usage
> - Doesn't need a NR_CPUS pointer array for each allocation
> - Interlaces objects making better utilization of memory/cachelines
> - Userspace tests show 98% utilization with random sized allocations
> after repeated random frees
> 3. Provides truly node local allocation
> - The percpu memory with existing alloc_percpu does node local
> allocation, but the NR_CPUS place holder is not node local. This
> problem doesn't exist with the new implementation.

But it does consume vmalloc space and will incur additional TLB reload
costs.

> +static void *
> +valloc_percpu(void)
> +{
> + int i,j = 0;
> + unsigned int nr_pages;
> + struct vm_struct *area, tmp;
> + struct page **tmppage;
> + struct page *pages[BLOCK_MANAGEMENT_PAGES];

How much stackspace is this guy using on 512-way?

> + unsigned int cpu_pages = PCPU_BLKSIZE >> PAGE_SHIFT;
> + struct pcpu_block *blkp = NULL;
> +
> + BUG_ON(!IS_ALIGNED(PCPU_BLKSIZE, PAGE_SIZE));
> + BUG_ON(!PCPU_BLKSIZE);
> + nr_pages = PCPUPAGES_PER_BLOCK + BLOCK_MANAGEMENT_PAGES;
> +
> + /* Alloc Managent block pages */
> + for ( i = 0; i < BLOCK_MANAGEMENT_PAGES; i++) {
> + pages[i] = alloc_pages(GFP_KERNEL, 0);

Can use __GFP_ZERO here.

> + if (!pages[i]) {
> + while ( --i >= 0 )
> + __free_pages(pages[i], 0);
> + return NULL;
> + }
> + /* Zero the alloced page */
> + clear_page(page_address(pages[i]));

And so can remove this.

Cannot highmem pages be used here?

> + for ( i = 0; i < BLOCK_MANAGEMENT_PAGES; i++)

Patch has a fair amount of whitespace oddities.

> + /* Alloc node local pages for all cpus possible */
> + for (i = 0; i < NR_CPUS; i++) {
> + if (cpu_possible(i)) {

Isn't this equivalent to for_each_cpu()?

> + /* Map pages for each cpu by splitting vm_struct for each cpu */
> + for (i = 0; i < NR_CPUS; i++) {
> + if (cpu_possible(i)) {

etc.

> +/* Sort obj_map array in ascending order -- simple bubble sort */
> +static void
> +sort_obj_map(struct obj_map_elmt map[], int nr)

That'll be unpopular ;) Why not extract qsort from XFS?

Why cannot the code simply call vmalloc rather than copying its internals?

Have you considered trying a simple __alloc_pages, fall back to vmalloc if
that fails, or if the requested size is more than eight pages, or something
of that sort?

2005-01-14 02:29:24

by Rusty Russell

[permalink] [raw]

Subject: Re: [patch] mm: Reimplementation of dynamic percpu memory allocator

On Thu, 2005-01-13 at 14:04 +0530, Ravikiran G Thirumalai wrote:
> Hi Andrew,
> Could you consider this for inclusion in the mm tree? The patch has been
> tested in user space and kernel space. Manfred seems to like the fact that
> allocator doesn't depend on slab, so that it can be used for slab's internal
> head arrays. He had questions about fragmentation which I have answered.
>
> Patch follows
>
> Thanks,
> Kiran
>
>
> The following patch re-implements the linux dynamic percpu memory allocator
> so that:
> 1. Percpu memory dereference is faster
> - One less memory reference compared to existing simple alloc_percpu
> - As fast as with static percpu areas, one mem ref less actually.

Hmm, for me one point of a good dynamic per-cpu implementation is that
the same per_cpu_offset be used as for the static per-cpu variables.
This means that architectures can put it in a register. It also has
different properties than slab, because tiny allocations will be more
common (ie. one counter).

I've had this implementation sitting around for a while: it's not
enormously fast, but it is space-efficient, and would need a cache on
the front if people started doing a high rate of allocs. First patch is
merely a cleanup.

Rusty.
Name: Unification of per-cpu headers for SMP
Author: Rusty Russell
Status: Trivial
Depends: Percpu/percpu-up-unify.patch.gz

There's really only one sane way to implement accessing other CPU's
variables, there's no real reason for archs to use a method other
than __per_cpu_offset[], so move that from asm-*/percpu.h to
linux/percpu.h.

Index: linux-2.6.11-rc1-bk1-Percpu/include/asm-ia64/percpu.h
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/include/asm-ia64/percpu.h 2004-02-18 23:54:32.000000000 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/include/asm-ia64/percpu.h 2005-01-14 13:19:04.681626896 +1100
@@ -35,9 +35,6 @@
* external routine, to avoid include-hell.
*/
#ifdef CONFIG_SMP
-
-extern unsigned long __per_cpu_offset[NR_CPUS];
-
/* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */
DECLARE_PER_CPU(unsigned long, local_per_cpu_offset);

Index: linux-2.6.11-rc1-bk1-Percpu/init/main.c
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/init/main.c 2005-01-13 12:11:11.000000000 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/init/main.c 2005-01-14 13:19:04.768613672 +1100
@@ -305,11 +305,10 @@

#else

-#ifdef __GENERIC_PER_CPU
unsigned long __per_cpu_offset[NR_CPUS];
-
EXPORT_SYMBOL(__per_cpu_offset);

+#ifdef __GENERIC_PER_CPU
static void __init setup_per_cpu_areas(void)
{
unsigned long size, i;
Index: linux-2.6.11-rc1-bk1-Percpu/arch/ia64/kernel/setup.c
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/arch/ia64/kernel/setup.c 2005-01-14 11:08:00.000000000 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/arch/ia64/kernel/setup.c 2005-01-14 13:19:04.893594672 +1100
@@ -56,11 +56,6 @@
# error "struct cpuinfo_ia64 too big!"
#endif

-#ifdef CONFIG_SMP
-unsigned long __per_cpu_offset[NR_CPUS];
-EXPORT_SYMBOL(__per_cpu_offset);
-#endif
-
DEFINE_PER_CPU(struct cpuinfo_ia64, cpu_info);
DEFINE_PER_CPU(unsigned long, local_per_cpu_offset);
DEFINE_PER_CPU(unsigned long, ia64_phys_stacked_size_p8);
Index: linux-2.6.11-rc1-bk1-Percpu/include/linux/percpu.h
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/include/linux/percpu.h 2004-10-19 14:34:22.000000000 +1000
+++ linux-2.6.11-rc1-bk1-Percpu/include/linux/percpu.h 2005-01-14 13:19:04.985580688 +1100
@@ -16,6 +16,9 @@
#define put_cpu_var(var) preempt_enable()

#ifdef CONFIG_SMP
+/* var is in discarded region: offset to particular copy we want */
+#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
+extern unsigned long __per_cpu_offset[NR_CPUS];

struct percpu_data {
void *ptrs[NR_CPUS];
Index: linux-2.6.11-rc1-bk1-Percpu/include/asm-generic/percpu.h
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/include/asm-generic/percpu.h 2004-02-04 15:39:09.000000000 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/include/asm-generic/percpu.h 2005-01-14 13:19:05.099563360 +1100
@@ -5,14 +5,10 @@
#define __GENERIC_PER_CPU
#ifdef CONFIG_SMP

-extern unsigned long __per_cpu_offset[NR_CPUS];
-
/* Separate out the type, so (int[3], foo) works. */
#define DEFINE_PER_CPU(type, name) \
__attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name

-/* var is in discarded region: offset to particular copy we want */
-#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
#define __get_cpu_var(var) per_cpu(var, smp_processor_id())

/* A macro to avoid #include hell... */

Name: Dynamic per-cpu allocation using static per-cpu mechanism
Author: Rusty Russell
Status: Tested on 2.6.10-rc2-bk13

This patch replaces the dynamic per-cpu allocator, alloc_percpu,
to make it use the same mechanism as the static per-cpu variables, ie.
ptr + __per_cpu_offset[smp_processor_id()] gives the variable address.
This increases space and time efficiency of reference at the same time.

This is a generalization of the allocator in kernel/module.c: it
gets moved to its own (SMP-only) file: mm/percpu.c.

The basic idea is that when we need more memory, we allocate
another NR_CPUS*sizeof(.data.percpu section), and hand allocations
out from that (minus the __per_cpu_offset, which is set at boot
from the difference between the .data.percpu section and the
initial NR_CPUS*sizeof(.data.percpu section) allocation.

The situation is made trickier by archs which want to allocate per-cpu
memory near the CPUs which use them: hooks are provided for the
initial alloc (arch_alloc_percpu_bootmem(), which can also change the
size of the allocation, eg. to page-align) and arch_alloc_percpu().
__GENERIC_PER_CPU gets sane default implementations.

Index: linux-2.6.11-rc1-bk1-Percpu/mm/percpu.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.11-rc1-bk1-Percpu/mm/percpu.c 2005-01-14 13:19:24.844561664 +1100
@@ -0,0 +1,334 @@
+/* Routines to do per-cpu allocation.
+
+ These are more complicated than I would like, because different
+ architectures want different things.
+
+ The basic idea is simple: we allocate a (virtually) contiguous
+ block of memory at boot, and index it by cpu. The offset between
+ the original percpu section and these duplicate sections is
+ recorded in __per_cpu_offset[cpu] (and in some archs, a register,
+ etc). eg. the per-cpu section ends up at 0xc00a0000, and we
+ allocate 64k for each CPU stating at 0x100a0000, then
+ __per_cpu_offset[0] is 0x50000000, __per_cpu_offset[1] is
+ 0x50010000, etc.
+
+ This original block is also handed out to modules which use
+ DEFINE_PER_CPU: on some archs it has to be this original block, as
+ they use magic tricks to dereference these static per-cpu variables
+ in some cases (eg. cpu_local_inc on ia64).
+
+ Other blocks can be allocated later to fulfill dynamic per-cpu
+ requests: they used the same __per_cpu_offset[] values as their
+ static cousins, so the layout has to be the same (this is why we
+ insist on contiguous memory in the first place: it's easy to get
+ more contiguous memory).
+
+ It makes sense to allocate the per-cpu memory so that the memory is
+ close to the corresponding CPU, and use the virt<->phys mapping to
+ turn it into a contiguous array. However, the pagesize in the
+ kernel is often large, and we don't want to reserve that all for
+ static allocations, so we only reserve part of it.
+
+ Note about the allocator: it's designed to be space-efficient,
+ since it's used for ints and longs (counters) and the like. For
+ speed, it needs a cache on top.
+ */
+#include <linux/percpu.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/bootmem.h>
+#include <linux/init.h>
+#include <asm/semaphore.h>
+
+/* Created by linker magic */
+extern char __per_cpu_start[], __per_cpu_end[];
+
+unsigned long __per_cpu_offset[NR_CPUS];
+EXPORT_SYMBOL(__per_cpu_offset);
+
+struct percpu_block {
+ struct list_head list;
+ /* Number of blocks used and allocated. */
+ unsigned int num_used, num_allocated;
+ /* Pointer to start of block. */
+ void *start;
+ /* Array containing sizes of each block. -ve means used. */
+ int *size;
+};
+
+static DECLARE_MUTEX(percpu_lock);
+static struct percpu_block percpu_core;
+/* All blocks have to be the same size per cpu, otherwise span would differ. */
+static unsigned long reserved_size, percpu_size;
+
+#ifdef __GENERIC_PER_CPU
+/* Ideally, an arch will sew together pages local to CPUs to form a
+ * continuous allocation. */
+static inline void *arch_alloc_percpu(unsigned long size)
+{
+ void *ret;
+
+ ret = kmalloc(size * NR_CPUS, GFP_KERNEL);
+ if (!ret)
+ ret = vmalloc(size * NR_CPUS);
+ return ret;
+}
+
+static inline void *arch_alloc_percpu_bootmem(unsigned long *size)
+{
+ return alloc_bootmem(*size * NR_CPUS);
+}
+#endif /* __GENERIC_PER_CPU */
+
+static struct percpu_block *new_block(void)
+{
+ struct percpu_block *b;
+
+ b = kmalloc(sizeof(*b), GFP_KERNEL);
+ if (!b)
+ return NULL;
+
+ b->num_used = 1;
+ b->num_allocated = 4;
+ b->size = kmalloc(sizeof(b->size[0]) * b->num_allocated, GFP_KERNEL);
+ if (!b->size) {
+ kfree(b);
+ return NULL;
+ }
+
+ b->size[0] = percpu_size;
+ return b;
+}
+
+/* Done early, so areas can be used. */
+void __init setup_per_cpu_areas(void)
+{
+ unsigned long i;
+ char *ptr;
+
+ /* Copy section for each CPU (we discard the original) */
+ reserved_size = ALIGN(__per_cpu_end - __per_cpu_start,SMP_CACHE_BYTES);
+#ifdef CONFIG_MODULES
+ /* Enough to cover all DEFINE_PER_CPUs in modules, too. */
+ reserved_size = min(reserved_size, 8192UL * sizeof(unsigned long));
+#endif
+ /* Arch may choose to allocate much more for each CPU
+ * (eg. large pages). */
+ percpu_size = reserved_size;
+ percpu_core.start = ptr = arch_alloc_percpu_bootmem(&percpu_size);
+ BUG_ON(percpu_size < reserved_size);
+ for (i = 0; i < NR_CPUS; i++, ptr += percpu_size) {
+ __per_cpu_offset[i] = ptr - __per_cpu_start;
+ memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
+ }
+}
+
+static int __init percpu_alloc_init(void)
+{
+ percpu_core.num_used = 2;
+ percpu_core.num_allocated = 4;
+ percpu_core.size = kmalloc(sizeof(percpu_core.size[0])
+ * percpu_core.num_allocated,
+ GFP_KERNEL);
+ /* Static in-kernel percpu data (used, so negative). */
+ percpu_core.size[0] = -(__per_cpu_end - __per_cpu_start);
+ /* Free room. */
+ percpu_core.size[1] = percpu_size + percpu_core.size[0];
+ INIT_LIST_HEAD(&percpu_core.list);
+
+ if (percpu_size > reserved_size) {
+ struct percpu_block *b;
+
+ /* Mark out extra space as allocated. */
+ percpu_core.size[1] = reserved_size + percpu_core.size[0];
+ percpu_core.size[2] = -(percpu_size - reserved_size);
+ percpu_core.num_used++;
+
+ /* Duplicate of core block, but with core space allocated. */
+ b = new_block();
+ b->size[0] = -reserved_size;
+ b->size[1] = percpu_size - reserved_size;
+ b->num_used = 2;
+ b->start = percpu_core.start;
+ list_add(&b->list, &percpu_core.list);
+ }
+ return 0;
+}
+core_initcall(percpu_alloc_init);
+
+static int split_block(unsigned int i, unsigned short size,
+ struct percpu_block *pb)
+{
+ /* Reallocation required? */
+ if (pb->num_used + 1 > pb->num_allocated) {
+ int *new = kmalloc(sizeof(new[0]) * pb->num_allocated*2,
+ GFP_KERNEL);
+ if (!new)
+ return 0;
+
+ memcpy(new, pb->size, sizeof(new[0])*pb->num_allocated);
+ pb->num_allocated *= 2;
+ kfree(pb->size);
+ pb->size = new;
+ }
+
+ /* Insert a new subblock */
+ memmove(&pb->size[i+1], &pb->size[i],
+ sizeof(pb->size[0]) * (pb->num_used - i));
+ pb->num_used++;
+
+ pb->size[i+1] -= size;
+ pb->size[i] = size;
+ return 1;
+}
+
+static inline unsigned int block_size(int val)
+{
+ if (val < 0)
+ return -val;
+ return val;
+}
+
+static void *alloc_from_block(unsigned long size, unsigned long align,
+ struct percpu_block *pb)
+{
+ unsigned long extra;
+ unsigned int i;
+ void *ptr;
+
+ BUG_ON(align > SMP_CACHE_BYTES);
+
+ ptr = pb->start;
+ for (i = 0; i < pb->num_used; ptr += block_size(pb->size[i]), i++) {
+ /* Extra for alignment requirement. */
+ extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
+ BUG_ON(i == 0 && extra != 0);
+
+ if (pb->size[i] < 0 || pb->size[i] < extra + size)
+ continue;
+
+ /* Transfer extra to previous block. */
+ if (pb->size[i-1] < 0)
+ pb->size[i-1] -= extra;
+ else
+ pb->size[i-1] += extra;
+ pb->size[i] -= extra;
+ ptr += extra;
+
+ /* Split block if warranted */
+ if (pb->size[i] - size > sizeof(unsigned long))
+ if (!split_block(i, size, pb))
+ return NULL;
+
+ /* Mark allocated */
+ pb->size[i] = -pb->size[i];
+ return ptr;
+ }
+ return NULL;
+}
+
+static void free_from_block(const void *freeme, struct percpu_block *pb)
+{
+ unsigned int i;
+ void *ptr = pb->start;
+
+ for (i = 0; ptr != freeme; ptr += block_size(pb->size[i]), i++)
+ BUG_ON(i == pb->num_used);
+
+ pb->size[i] = -pb->size[i];
+ /* Merge with previous? */
+ if (i > 0 && pb->size[i-1] >= 0) {
+ pb->size[i-1] += pb->size[i];
+ pb->num_used--;
+ memmove(&pb->size[i], &pb->size[i+1],
+ (pb->num_used - i) * sizeof(pb->size[0]));
+ i--;
+ }
+ /* Merge with next? */
+ if (i+1 < pb->num_used && pb->size[i+1] >= 0) {
+ pb->size[i] += pb->size[i+1];
+ pb->num_used--;
+ memmove(&pb->size[i+1], &pb->size[i+2],
+ (pb->num_used - (i+1)) * sizeof(pb->size[0]));
+ }
+}
+
+#ifdef CONFIG_MODULES
+void *percpu_modalloc(unsigned long size, unsigned long align)
+{
+ void *ret;
+
+ down(&percpu_lock);
+ ret = alloc_from_block(size, align, &percpu_core);
+ printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",size);
+ up(&percpu_lock);
+ return ret;
+}
+
+void percpu_modfree(void *freeme)
+{
+ down(&percpu_lock);
+ free_from_block(freeme, &percpu_core);
+ up(&percpu_lock);
+}
+#endif
+
+void *__alloc_percpu(unsigned long size, unsigned long align)
+{
+ void *ret = NULL;
+ struct percpu_block *b;
+ unsigned int cpu;
+
+ down(&percpu_lock);
+ /* Cleverly skips over kernel reserved space. */
+ list_for_each_entry(b, &percpu_core.list, list) {
+ ret = alloc_from_block(size, align, b);
+ if (ret)
+ goto success;
+ }
+
+ b = new_block();
+ if (!b)
+ goto unlock;
+
+ b->start = arch_alloc_percpu(percpu_size);
+ if (!b->start) {
+ kfree(b->size);
+ kfree(b);
+ goto unlock;
+ }
+
+ list_add(&b->list, &percpu_core.list);
+ ret = alloc_from_block(size, align, b);
+ BUG_ON(!ret);
+success:
+ /* Gives a pointer for use with per_cpu_ptr() etc. */
+ ret -= __per_cpu_offset[0];
+ for_each_cpu(cpu)
+ memset(per_cpu_ptr(ret, cpu), 0, size);
+unlock:
+ up(&percpu_lock);
+ return ret;
+}
+EXPORT_SYMBOL(__alloc_percpu);
+
+void free_percpu(const void *freeme)
+{
+ struct percpu_block *i;
+
+ freeme += __per_cpu_offset[0];
+
+ down(&percpu_lock);
+ /* Cleverly skips over kernel reserved space. */
+ list_for_each_entry(i, &percpu_core.list, list) {
+ if (freeme >= i->start && freeme < i->start + percpu_size) {
+ free_from_block(freeme, i);
+ goto unlock;
+ }
+ }
+ BUG();
+unlock:
+ up(&percpu_lock);
+}
+EXPORT_SYMBOL(free_percpu);
Index: linux-2.6.11-rc1-bk1-Percpu/init/main.c
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/init/main.c 2005-01-14 13:19:04.768613672 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/init/main.c 2005-01-14 13:19:24.843561816 +1100
@@ -300,38 +300,9 @@
#define smp_init() do { } while (0)
#endif

-static inline void setup_per_cpu_areas(void) { }
static inline void smp_prepare_cpus(unsigned int maxcpus) { }

#else
-
-unsigned long __per_cpu_offset[NR_CPUS];
-EXPORT_SYMBOL(__per_cpu_offset);
-
-#ifdef __GENERIC_PER_CPU
-static void __init setup_per_cpu_areas(void)
-{
- unsigned long size, i;
- char *ptr;
- /* Created by linker magic */
- extern char __per_cpu_start[], __per_cpu_end[];
-
- /* Copy section for each CPU (we discard the original) */
- size = ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES);
-#ifdef CONFIG_MODULES
- if (size < PERCPU_ENOUGH_ROOM)
- size = PERCPU_ENOUGH_ROOM;
-#endif
-
- ptr = alloc_bootmem(size * NR_CPUS);
-
- for (i = 0; i < NR_CPUS; i++, ptr += size) {
- __per_cpu_offset[i] = ptr - __per_cpu_start;
- memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
- }
-}
-#endif /* !__GENERIC_PER_CPU */
-
/* Called by boot processor to activate the rest. */
static void __init smp_init(void)
{
Index: linux-2.6.11-rc1-bk1-Percpu/kernel/module.c
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/kernel/module.c 2005-01-13 12:11:11.000000000 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/kernel/module.c 2005-01-14 13:19:24.845561512 +1100
@@ -209,152 +209,13 @@
}

#ifdef CONFIG_SMP
-/* Number of blocks used and allocated. */
-static unsigned int pcpu_num_used, pcpu_num_allocated;
-/* Size of each block. -ve means used. */
-static int *pcpu_size;
-
-static int split_block(unsigned int i, unsigned short size)
-{
- /* Reallocation required? */
- if (pcpu_num_used + 1 > pcpu_num_allocated) {
- int *new = kmalloc(sizeof(new[0]) * pcpu_num_allocated*2,
- GFP_KERNEL);
- if (!new)
- return 0;
-
- memcpy(new, pcpu_size, sizeof(new[0])*pcpu_num_allocated);
- pcpu_num_allocated *= 2;
- kfree(pcpu_size);
- pcpu_size = new;
- }
-
- /* Insert a new subblock */
- memmove(&pcpu_size[i+1], &pcpu_size[i],
- sizeof(pcpu_size[0]) * (pcpu_num_used - i));
- pcpu_num_used++;
-
- pcpu_size[i+1] -= size;
- pcpu_size[i] = size;
- return 1;
-}
-
-static inline unsigned int block_size(int val)
-{
- if (val < 0)
- return -val;
- return val;
-}
-
-/* Created by linker magic */
-extern char __per_cpu_start[], __per_cpu_end[];
-
-static void *percpu_modalloc(unsigned long size, unsigned long align)
-{
- unsigned long extra;
- unsigned int i;
- void *ptr;
-
- BUG_ON(align > SMP_CACHE_BYTES);
-
- ptr = __per_cpu_start;
- for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
- /* Extra for alignment requirement. */
- extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
- BUG_ON(i == 0 && extra != 0);
-
- if (pcpu_size[i] < 0 || pcpu_size[i] < extra + size)
- continue;
-
- /* Transfer extra to previous block. */
- if (pcpu_size[i-1] < 0)
- pcpu_size[i-1] -= extra;
- else
- pcpu_size[i-1] += extra;
- pcpu_size[i] -= extra;
- ptr += extra;
-
- /* Split block if warranted */
- if (pcpu_size[i] - size > sizeof(unsigned long))
- if (!split_block(i, size))
- return NULL;
-
- /* Mark allocated */
- pcpu_size[i] = -pcpu_size[i];
- return ptr;
- }
-
- printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
- size);
- return NULL;
-}
-
-static void percpu_modfree(void *freeme)
-{
- unsigned int i;
- void *ptr = __per_cpu_start + block_size(pcpu_size[0]);
-
- /* First entry is core kernel percpu data. */
- for (i = 1; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
- if (ptr == freeme) {
- pcpu_size[i] = -pcpu_size[i];
- goto free;
- }
- }
- BUG();
-
- free:
- /* Merge with previous? */
- if (pcpu_size[i-1] >= 0) {
- pcpu_size[i-1] += pcpu_size[i];
- pcpu_num_used--;
- memmove(&pcpu_size[i], &pcpu_size[i+1],
- (pcpu_num_used - i) * sizeof(pcpu_size[0]));
- i--;
- }
- /* Merge with next? */
- if (i+1 < pcpu_num_used && pcpu_size[i+1] >= 0) {
- pcpu_size[i] += pcpu_size[i+1];
- pcpu_num_used--;
- memmove(&pcpu_size[i+1], &pcpu_size[i+2],
- (pcpu_num_used - (i+1)) * sizeof(pcpu_size[0]));
- }
-}
-
static unsigned int find_pcpusec(Elf_Ehdr *hdr,
Elf_Shdr *sechdrs,
const char *secstrings)
{
return find_sec(hdr, sechdrs, secstrings, ".data.percpu");
}
-
-static int percpu_modinit(void)
-{
- pcpu_num_used = 2;
- pcpu_num_allocated = 2;
- pcpu_size = kmalloc(sizeof(pcpu_size[0]) * pcpu_num_allocated,
- GFP_KERNEL);
- /* Static in-kernel percpu data (used). */
- pcpu_size[0] = -ALIGN(__per_cpu_end-__per_cpu_start, SMP_CACHE_BYTES);
- /* Free room. */
- pcpu_size[1] = PERCPU_ENOUGH_ROOM + pcpu_size[0];
- if (pcpu_size[1] < 0) {
- printk(KERN_ERR "No per-cpu room for modules.\n");
- pcpu_num_used = 1;
- }
-
- return 0;
-}
-__initcall(percpu_modinit);
#else /* ... !CONFIG_SMP */
-static inline void *percpu_modalloc(unsigned long size, unsigned long align)
-{
- return NULL;
-}
-static inline void percpu_modfree(void *pcpuptr)
-{
- BUG();
-}
static inline unsigned int find_pcpusec(Elf_Ehdr *hdr,
Elf_Shdr *sechdrs,
const char *secstrings)
Index: linux-2.6.11-rc1-bk1-Percpu/include/linux/percpu.h
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/include/linux/percpu.h 2005-01-14 13:19:04.985580688 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/include/linux/percpu.h 2005-01-14 13:19:24.846561360 +1100
@@ -1,7 +1,7 @@
#ifndef __LINUX_PERCPU_H
#define __LINUX_PERCPU_H
#include <linux/spinlock.h> /* For preempt_disable() */
-#include <linux/slab.h> /* For kmalloc() */
+#include <linux/slab.h> /* FIXME: remove this, fix hangers-on --RR */
#include <linux/smp.h>
#include <linux/string.h> /* For memset() */
#include <asm/percpu.h>
@@ -20,24 +20,18 @@
#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
extern unsigned long __per_cpu_offset[NR_CPUS];

-struct percpu_data {
- void *ptrs[NR_CPUS];
- void *blkp;
-};
-
/*
* Use this to get to a cpu's version of the per-cpu object allocated using
- * alloc_percpu. Non-atomic access to the current CPU's version should
- * probably be combined with get_cpu()/put_cpu().
+ * alloc_percpu. If you want to get "this cpu's version", maybe you want
+ * to use get_cpu_ptr...
*/
-#define per_cpu_ptr(ptr, cpu) \
-({ \
- struct percpu_data *__p = (struct percpu_data *)~(unsigned long)(ptr); \
- (__typeof__(ptr))__p->ptrs[(cpu)]; \
-})
+#define per_cpu_ptr(ptr, cpu) \
+ ((__typeof__(ptr))((void *)ptr + __per_cpu_offset[(cpu)]))

-extern void *__alloc_percpu(size_t size, size_t align);
+extern void *__alloc_percpu(unsigned long size, unsigned long align);
extern void free_percpu(const void *);
+extern void *percpu_modalloc(unsigned long size, unsigned long align);
+extern void percpu_modfree(void *freeme);

#else /* CONFIG_SMP */

@@ -55,6 +49,14 @@
kfree(ptr);
}

+static inline void *percpu_modalloc(unsigned long size, unsigned long align)
+{
+ return NULL;
+}
+
+static inline void percpu_modfree(void *freeme)
+{
+}
#endif /* CONFIG_SMP */

/* Simple wrapper for the common case: zeros memory. */
Index: linux-2.6.11-rc1-bk1-Percpu/mm/slab.c
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/mm/slab.c 2005-01-13 12:11:12.000000000 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/mm/slab.c 2005-01-14 13:21:04.201457120 +1100
@@ -2476,51 +2476,6 @@

EXPORT_SYMBOL(__kmalloc);

-#ifdef CONFIG_SMP
-/**
- * __alloc_percpu - allocate one copy of the object for every present
- * cpu in the system, zeroing them.
- * Objects should be dereferenced using the per_cpu_ptr macro only.
- *
- * @size: how many bytes of memory are required.
- * @align: the alignment, which can't be greater than SMP_CACHE_BYTES.
- */
-void *__alloc_percpu(size_t size, size_t align)
-{
- int i;
- struct percpu_data *pdata = kmalloc(sizeof (*pdata), GFP_KERNEL);
-
- if (!pdata)
- return NULL;
-
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_possible(i))
- continue;
- pdata->ptrs[i] = kmem_cache_alloc_node(
- kmem_find_general_cachep(size, GFP_KERNEL),
- cpu_to_node(i));
-
- if (!pdata->ptrs[i])
- goto unwind_oom;
- memset(pdata->ptrs[i], 0, size);
- }
-
- /* Catch derefs w/o wrappers */
- return (void *) (~(unsigned long) pdata);
-
-unwind_oom:
- while (--i >= 0) {
- if (!cpu_possible(i))
- continue;
- kfree(pdata->ptrs[i]);
- }
- kfree(pdata);
- return NULL;
-}
-
-EXPORT_SYMBOL(__alloc_percpu);
-#endif
-
/**
* kmem_cache_free - Deallocate an object
* @cachep: The cache the allocation was from.
@@ -2584,31 +2539,6 @@

EXPORT_SYMBOL(kfree);

-#ifdef CONFIG_SMP
-/**
- * free_percpu - free previously allocated percpu memory
- * @objp: pointer returned by alloc_percpu.
- *
- * Don't free memory not originally allocated by alloc_percpu()
- * The complemented objp is to check for that.
- */
-void
-free_percpu(const void *objp)
-{
- int i;
- struct percpu_data *p = (struct percpu_data *) (~(unsigned long) objp);
-
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_possible(i))
- continue;
- kfree(p->ptrs[i]);
- }
- kfree(p);
-}
-
-EXPORT_SYMBOL(free_percpu);
-#endif
-
unsigned int kmem_cache_size(kmem_cache_t *cachep)
{
return obj_reallen(cachep);
Index: linux-2.6.11-rc1-bk1-Percpu/mm/Makefile
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/mm/Makefile 2005-01-13 12:11:12.000000000 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/mm/Makefile 2005-01-14 13:19:24.868558016 +1100
@@ -17,4 +17,4 @@
obj-$(CONFIG_NUMA) += mempolicy.o
obj-$(CONFIG_SHMEM) += shmem.o
obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
-
+obj-$(CONFIG_SMP) += percpu.o
Index: linux-2.6.11-rc1-bk1-Percpu/include/asm-generic/percpu.h
===================================================================
--- linux-2.6.11-rc1-bk1-Percpu.orig/include/asm-generic/percpu.h 2005-01-14 13:19:05.099563360 +1100
+++ linux-2.6.11-rc1-bk1-Percpu/include/asm-generic/percpu.h 2005-01-14 13:19:24.869557864 +1100
@@ -10,6 +10,8 @@
__attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name

#define __get_cpu_var(var) per_cpu(var, smp_processor_id())
+#define __get_cpu_ptr(ptr) \
+ ((__typeof__(ptr))((void *)ptr + __per_cpu_offset[smp_processor_id()]))

/* A macro to avoid #include hell... */
#define percpu_modcopy(pcpudst, src, size) \
@@ -20,6 +22,8 @@
memcpy((pcpudst)+__per_cpu_offset[__i], \
(src), (size)); \
} while (0)
+
+void setup_per_cpu_areas(void);
#else /* ! SMP */

#define DEFINE_PER_CPU(type, name) \
@@ -27,7 +31,10 @@

#define per_cpu(var, cpu) (*((void)cpu, &per_cpu__##var))
#define __get_cpu_var(var) per_cpu__##var
-
+#define __get_cpu_ptr(ptr) (ptr)
+static inline void setup_per_cpu_areas(void)
+{
+}
#endif /* SMP */

#define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name

--
A bad analogy is like a leaky screwdriver -- Richard Braakman

2005-01-14 09:30:46

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [patch] mm: Reimplementation of dynamic percpu memory allocator

On Thu, Jan 13, 2005 at 12:57:30AM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > ...
> > The following patch re-implements the linux dynamic percpu memory allocator
>
> Heavens, it's complex.

:p

>
> > 1. Percpu memory dereference is faster
> > - One less memory reference compared to existing simple alloc_percpu
> > - As fast as with static percpu areas, one mem ref less actually.
> > 2. Better memory usage
> > - Doesn't need a NR_CPUS pointer array for each allocation
> > - Interlaces objects making better utilization of memory/cachelines
> > - Userspace tests show 98% utilization with random sized allocations
> > after repeated random frees
> > 3. Provides truly node local allocation
> > - The percpu memory with existing alloc_percpu does node local
> > allocation, but the NR_CPUS place holder is not node local. This
> > problem doesn't exist with the new implementation.
>
> But it does consume vmalloc space and will incur additional TLB reload
> costs.
>
> > +static void *
> > +valloc_percpu(void)
> > +{
> > + int i,j = 0;
> > + unsigned int nr_pages;
> > + struct vm_struct *area, tmp;
> > + struct page **tmppage;
> > + struct page *pages[BLOCK_MANAGEMENT_PAGES];
>
> How much stackspace is this guy using on 512-way?

36 bytes. BLOCK_MANAGEMENT_PAGES is not dependent on NR_CPUS. It depends
on the number of objects in a block -- depends on the currency size and
the max obj size. With current values, BLOCK_MANAGEMENT_PAGES is 9

>
> > + unsigned int cpu_pages = PCPU_BLKSIZE >> PAGE_SHIFT;
> > + struct pcpu_block *blkp = NULL;
> > +
> > + BUG_ON(!IS_ALIGNED(PCPU_BLKSIZE, PAGE_SIZE));
> > + BUG_ON(!PCPU_BLKSIZE);
> > + nr_pages = PCPUPAGES_PER_BLOCK + BLOCK_MANAGEMENT_PAGES;
> > +
> > + /* Alloc Managent block pages */
> > + for ( i = 0; i < BLOCK_MANAGEMENT_PAGES; i++) {
> > + pages[i] = alloc_pages(GFP_KERNEL, 0);
>
> Can use __GFP_ZERO here.
>
> > + if (!pages[i]) {
> > + while ( --i >= 0 )
> > + __free_pages(pages[i], 0);
> > + return NULL;
> > + }
> > + /* Zero the alloced page */
> > + clear_page(page_address(pages[i]));
>
> And so can remove this.
>
> Cannot highmem pages be used here?

Yes. Will change patch to do that.

>
> > + for ( i = 0; i < BLOCK_MANAGEMENT_PAGES; i++)
>
> Patch has a fair amount of whitespace oddities.

Will fix this.

>
> > + /* Alloc node local pages for all cpus possible */
> > + for (i = 0; i < NR_CPUS; i++) {
> > + if (cpu_possible(i)) {
>
> Isn't this equivalent to for_each_cpu()?

Yes. Will fix this too. This patch has been in the cans for quite some time
and for_each_cpu wasn't available when I coded this up...

>
> > + /* Map pages for each cpu by splitting vm_struct for each cpu */
> > + for (i = 0; i < NR_CPUS; i++) {
> > + if (cpu_possible(i)) {
>
> etc.
>
> > +/* Sort obj_map array in ascending order -- simple bubble sort */
> > +static void
> > +sort_obj_map(struct obj_map_elmt map[], int nr)
>
> That'll be unpopular ;) Why not extract qsort from XFS?

Ok, I'll make a patch to move qsort to a common place like linux/lib
and use it in the next iteration. Just out of curiosity, is bubble
sort unpopular or having two sort functions in the kernel source tree an
issue here?

>
> Why cannot the code simply call vmalloc rather than copying its internals?

Node local allocation. vmalloc cannot ensure pages for correspomding
cpus are node local. Also, design goal was to allocate pages for
cpu_possible cpus only. With plain vmalloc, we will end up allocating
pages for NR_CPUS.

>
> Have you considered trying a simple __alloc_pages, fall back to vmalloc if
> that fails, or if the requested size is more than eight pages, or something
> of that sort?

Again, node local allocation will be difficult and also this means we
allocate pages for !cpu_possible cpus too.

I am travelling (we have a long weekend here), I will mail the revised patch
early next week. Thanks for the review comments.

Thanks,
Kiran
Thanks,
Kiran

2005-01-14 09:34:51

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch] mm: Reimplementation of dynamic percpu memory allocator

Ravikiran G Thirumalai <[email protected]> wrote:
>
> Just out of curiosity, is bubble
> sort unpopular or having two sort functions in the kernel source tree an
> issue here?

yes and yes ;)

> >
> > Why cannot the code simply call vmalloc rather than copying its internals?
>
> Node local allocation. vmalloc cannot ensure pages for correspomding
> cpus are node local. Also, design goal was to allocate pages for
> cpu_possible cpus only. With plain vmalloc, we will end up allocating
> pages for NR_CPUS.

So... is it not possible to enhance vmalloc() for node-awareness, then
just use it?

2005-01-14 09:38:35

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [patch] mm: Reimplementation of dynamic percpu memory allocator

On Fri, Jan 14, 2005 at 01:24:47PM +1100, Rusty Russell wrote:
> On Thu, 2005-01-13 at 14:04 +0530, Ravikiran G Thirumalai wrote:
> > ...
> >
> > The following patch re-implements the linux dynamic percpu memory allocator
> > so that:
> > 1. Percpu memory dereference is faster
> > - One less memory reference compared to existing simple alloc_percpu
> > - As fast as with static percpu areas, one mem ref less actually.
>
> Hmm, for me one point of a good dynamic per-cpu implementation is that
> the same per_cpu_offset be used as for the static per-cpu variables.
> This means that architectures can put it in a register.

The allocator I have posted can be easily fixed for that. In fact, I had a
version which used per_cpu_offset. My earlier experiments proved that
pointer arithmetic is marginally faster than the per_cpu_offset
based version. Also, why waste a register if we can achieve same dereference
speeds without using a register? But as I said, we can modify the allocator I
posted to use per_cpu_offset.

> It also has
> different properties than slab, because tiny allocations will be more
> common (ie. one counter).

As regards tiny allocations, many existing users of alloc_percpu are
structures which are aggregations of many counters -- like the mib statistics,
disk_stats etc. So IMHO, dynamic percpu allocator should work well for
both small and large objects. I have done some user space measurements
with my version and the allocator behaves well with both small and random sized
objects.

The advantages of my implementation as I see it is:
1. Independent of slab/kmalloc
2. Doesn't need arches to do their own node local allocation -- generic
version does that.
3. Space efficient -- about 98% utilization achieved in userspace tests
4. Allocates pages for cpu_possible cpus only

disadvantage:
1. Uses vmalloc area and hence additional tlb foot print -- but is there a
better way to keep node local allocations, simple pointer arithmetic
and page allocations for cpu_possible cpus only

Thanks,
Kiran

2005-01-14 10:42:24

by Rusty Russell

[permalink] [raw]

Subject: Re: [patch] mm: Reimplementation of dynamic percpu memory allocator

On Fri, 2005-01-14 at 15:28 +0530, Ravikiran G Thirumalai wrote:
> On Fri, Jan 14, 2005 at 01:24:47PM +1100, Rusty Russell wrote:
> > On Thu, 2005-01-13 at 14:04 +0530, Ravikiran G Thirumalai wrote:
> > > ...
> > >
> > > The following patch re-implements the linux dynamic percpu memory allocator
> > > so that:
> > > 1. Percpu memory dereference is faster
> > > - One less memory reference compared to existing simple alloc_percpu
> > > - As fast as with static percpu areas, one mem ref less actually.
> >
> > Hmm, for me one point of a good dynamic per-cpu implementation is that
> > the same per_cpu_offset be used as for the static per-cpu variables.
> > This means that architectures can put it in a register.
>
> The allocator I have posted can be easily fixed for that. In fact, I had a
> version which used per_cpu_offset. My earlier experiments proved that
> pointer arithmetic is marginally faster than the per_cpu_offset
> based version. Also, why waste a register if we can achieve same dereference
> speeds without using a register? But as I said, we can modify the allocator I
> posted to use per_cpu_offset.

Because we ALSO use per_cpu_offset for static per-cpu. In fact, it
starts to make sense to make smp_processor_id a normal per-cpu variable,
since increasingly we just want "this cpu's version of the
datastructure".

ie. increasingly, this cpu's per_cpu_offset is fundamental. The fact
that currently smp_processor_id() is fundamental and per_cpu_offset
derived warps current timings.

> As regards tiny allocations, many existing users of alloc_percpu are
> structures which are aggregations of many counters -- like the mib statistics,
> disk_stats etc. So IMHO, dynamic percpu allocator should work well for
> both small and large objects. I have done some user space measurements
> with my version and the allocator behaves well with both small and random sized
> objects.

See this experimental kref replacement to see what I mean.

BTW, I'm really glad someone's working on this. Thanks!
Rusty.

Author: Rusty Russell
Status: Experimental
Depends: Percpu/kmalloc_percpu-full.patch.gz

This is an implementation of cache-friendly reference counts, using
kmalloc_percpu.

The refcounters work in two modes: the normal mode just incs and
decs a per-cpu counter. When someone is waiting for the reference
count to hit 0, a common counter is used.

Index: linux-2.6.10-rc2-bk13-Percpu/kernel/bigref.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10-rc2-bk13-Percpu/kernel/bigref.c 2004-12-02 14:50:54.070536528 +1100
@@ -0,0 +1,98 @@
+#include <linux/bigref.h>
+#include <linux/compiler.h>
+#include <linux/percpu.h>
+#include <linux/rcupdate.h>
+#include <linux/module.h>
+#include <asm/system.h>
+
+static inline int is_slow_mode(const struct bigref *bigref)
+{
+ return atomic_read(&bigref->slow_ref) != 0;
+}
+
+/* Initialize reference to 1. */
+void bigref_init(struct bigref *bigref)
+{
+ bigref->local_ref = alloc_percpu(local_t);
+
+ /* If we can't allocate, we just stay in slow mode. */
+ if (!bigref->local_ref)
+ atomic_set(&bigref->slow_ref, 1);
+ else {
+ /* Bump any counter to 1. */
+ local_set(__get_cpu_ptr(bigref->local_ref), 1);
+ atomic_set(&bigref->slow_ref, 0);
+ }
+}
+
+/* Disown reference: next bigref_put can hit zero */
+void bigref_disown(struct bigref *bigref)
+{
+ int i, bias = 0x7FFFFFFF;
+ if (unlikely(is_slow_mode(bigref))) {
+ /* Must have been alloc fail, not double disown. */
+ BUG_ON(bigref->local_ref);
+ return;
+ }
+
+ /* Insert high number so this doesn't go to zero now. */
+ atomic_set(&bigref->slow_ref, bias);
+
+ /* Make sure everyone sees it and is using slow mode. */
+ synchronize_kernel();
+
+ /* Take away bias, and add sum of local counters. */
+ for_each_cpu(i)
+ bias -= local_read(per_cpu_ptr(bigref->local_ref, i));
+ atomic_sub(bias, &bigref->slow_ref);
+
+ /* This caller should be holding one reference. */
+ BUG_ON(atomic_read(&bigref->slow_ref) == 0);
+}
+
+/* Grab reference */
+void bigref_get(struct bigref *bigref)
+{
+ if (unlikely(is_slow_mode(bigref)))
+ atomic_inc(&bigref->slow_ref);
+ else
+ local_inc(__get_cpu_ptr(bigref->local_ref));
+}
+
+/* Drop reference */
+void bigref_put(struct bigref *bigref, void (*release)(struct bigref *bigref))
+{
+ if (unlikely(is_slow_mode(bigref))) {
+ if (atomic_dec_and_test(&bigref->slow_ref)) {
+ free_percpu(bigref->local_ref);
+ release(bigref);
+ }
+ } else
+ local_dec(__get_cpu_ptr(bigref->local_ref));
+}
+
+void bigref_destroy(struct bigref *bigref)
+{
+ if (bigref->local_ref)
+ free_percpu(bigref->local_ref);
+}
+
+/* Get current value of reference, useful for debugging info. */
+unsigned int bigref_val(struct bigref *bigref)
+{
+ unsigned int sum = 0, i;
+
+ if (unlikely(is_slow_mode(bigref)))
+ sum = atomic_read(&bigref->slow_ref);
+ else
+ for_each_cpu(i)
+ sum += local_read(per_cpu_ptr(bigref->local_ref, i));
+ return sum;
+}
+
+EXPORT_SYMBOL(bigref_init);
+EXPORT_SYMBOL(bigref_disown);
+EXPORT_SYMBOL(bigref_get);
+EXPORT_SYMBOL(bigref_put);
+EXPORT_SYMBOL(bigref_destroy);
+EXPORT_SYMBOL(bigref_val);
Index: linux-2.6.10-rc2-bk13-Percpu/include/linux/bigref.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10-rc2-bk13-Percpu/include/linux/bigref.h 2004-12-02 14:50:37.216098792 +1100
@@ -0,0 +1,76 @@
+#ifndef _LINUX_BIGREF_H
+#define _LINUX_BIGREF_H
+/* Per-cpu reference counters. Useful when speed is important, and
+ counter will only hit zero after some explicit event (such as being
+ discarded from a list).
+
+ (C) 2003 Rusty Russell, IBM Corporation.
+*/
+#include <linux/config.h>
+#include <asm/atomic.h>
+#include <asm/local.h>
+
+#ifdef CONFIG_SMP
+struct bigref
+{
+ /* If this is zero, we use per-cpu counters. */
+ atomic_t slow_ref;
+ local_t *local_ref;
+};
+
+/* Initialize reference to 1. */
+void bigref_init(struct bigref *bigref);
+/* Disown reference: next bigref_put can hit zero */
+void bigref_disown(struct bigref *bigref);
+/* Grab reference */
+void bigref_get(struct bigref *bigref);
+/* Drop reference */
+void bigref_put(struct bigref *bigref, void (*release)(struct bigref *bigref));
+/* Destroy bigref prematurely (which might not have hit zero) */
+void bigref_destroy(struct bigref *bigref);
+
+/* Get current value of reference, useful for debugging info. */
+unsigned int bigref_val(struct bigref *bigref);
+#else /* ... !SMP */
+struct bigref
+{
+ atomic_t ref;
+};
+
+/* Initialize reference to 1. */
+static inline void bigref_init(struct bigref *bigref)
+{
+ atomic_set(&bigref->ref, 1);
+}
+
+/* Disown reference: next bigref_put can hit zero */
+static inline void bigref_disown(struct bigref *bigref)
+{
+}
+
+/* Grab reference */
+static inline void bigref_get(struct bigref *bigref)
+{
+ atomic_inc(&bigref->ref);
+}
+
+/* Drop reference */
+static inline void bigref_put(struct bigref *bigref,
+ void (*release)(struct bigref *bigref))
+{
+ if (atomic_dec_and_test(&bigref->ref))
+ release(bigref);
+}
+
+/* Get current value of reference, useful for debugging info. */
+static inline unsigned int bigref_val(struct bigref *bigref)
+{
+ return atomic_read(&bigref->ref);
+}
+
+static inline void bigref_destroy(struct bigref *bigref)
+{
+}
+#endif /* !SMP */
+
+#endif /* _LINUX_BIGREF_H */
Index: linux-2.6.10-rc2-bk13-Percpu/kernel/Makefile
===================================================================
--- linux-2.6.10-rc2-bk13-Percpu.orig/kernel/Makefile 2004-11-16 15:30:10.000000000 +1100
+++ linux-2.6.10-rc2-bk13-Percpu/kernel/Makefile 2004-12-02 14:50:37.216098792 +1100
@@ -11,7 +11,7 @@

obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
-obj-$(CONFIG_SMP) += cpu.o spinlock.o
+obj-$(CONFIG_SMP) += cpu.o spinlock.o bigref.o
obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += module.o
obj-$(CONFIG_KALLSYMS) += kallsyms.o

--
A bad analogy is like a leaky screwdriver -- Richard Braakman

2005-01-17 18:31:22

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [patch] mm: Reimplementation of dynamic percpu memory allocator

On Fri, Jan 14, 2005 at 01:34:25AM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
>
> > >
> > > Why cannot the code simply call vmalloc rather than copying its internals?
> >
> > Node local allocation. vmalloc cannot ensure pages for correspomding
> > cpus are node local. Also, design goal was to allocate pages for
> > cpu_possible cpus only. With plain vmalloc, we will end up allocating
> > pages for NR_CPUS.
>
> So... is it not possible to enhance vmalloc() for node-awareness, then
> just use it?
>

Memory for block management (free lists, bufctl lists) is also resident
in one block. A typical block in this allocator looks like this:

VMALLOC_ADDR PAGE_ADDR BLOCK
============ ========= ========

0xa0000 0x10100 ----------------- ^ ^
. | | | |
. | cpu 0 | PCPU_BLKSIZE |
| | | |
0xa0100 0x30100 ----------------- v |
| | |
| cpu 1 | |
| | |
0xa0200 - ----------------- NR_CPUS
| | |
| !cpu_possible | |
| | |
0xa0300 - ----------------- |
| | |
| !cpu_possible | |
| | |
0xa0400 0x10300 ----------------- ^ v
| | |
| Block mgmt |BLOCK_MANAGEMENT_SIZE
| | |
0xa05ff ----------------- v

This block is setup by valloc_percpu in the allocator code. There is lot
of allocator specific stuff like the PCPU_BLKSIZE, BLOCK_MANAGEMENT_SIZE
used here. I thought it was not appropriate to put them in vmalloc.c.
A common vmalloc_percpu which can take arguments for PCPU_BLKSIZE and
BLOCK_MANAGEMENT_SIZE is not useful anywhere else.

Changed patchset with other modifications suggested will follow.

Thanks,
Kiran

2005-01-17 22:38:29

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch] mm: Reimplementation of dynamic percpu memory allocator

Ravikiran G Thirumalai <[email protected]> wrote:
>
> > So... is it not possible to enhance vmalloc() for node-awareness, then
> > just use it?
> >
>
> Memory for block management (free lists, bufctl lists) is also resident
> in one block. A typical block in this allocator looks like this:
>

I still don't get it. It is possible to calculate the total size of the
block beforehand, yes? So why not simply vmalloc_numa() a block of that
size then populate it?

2005-01-18 05:56:17

by Ravikiran G Thirumalai

[permalink] [raw]

Subject: Re: [patch] mm: Reimplementation of dynamic percpu memory allocator

On Mon, Jan 17, 2005 at 02:11:17PM -0800, Andrew Morton wrote:
> Ravikiran G Thirumalai <[email protected]> wrote:
> >
> > > So... is it not possible to enhance vmalloc() for node-awareness, then
> > > just use it?
> > >
> >
> > Memory for block management (free lists, bufctl lists) is also resident
> > in one block. A typical block in this allocator looks like this:
> >
>
> I still don't get it. It is possible to calculate the total size of the
> block beforehand, yes? So why not simply vmalloc_numa() a block of that
> size then populate it?

I should be excited to add a new api ;), but, then we would have something
like:

void *vmalloc_numa(unsigned long size, unsigned long extra)

which would
1. Allocate (size * NR_CPUS + extra) worth of va space (vm_struct)
2. Allocate node local pages amounting to 'size' for each possible cpu
3. Allocate pages for 'extra'
4. Mapping 'size' amount of pages allocated for cpus to corresponding
va space beginning from vm_struct.addr to
(vm_struct.addr + NR_CPUS * size - 1)
5. Map vm_struct.addr + NR_CPUS * size to pages allocated for extra in (3).

It is the need for this 'extra' -- the block management memory, which made me
think against a common api outside the allocator. If you feel vmalloc_numa
is the right approach, I will make a patch to put it in vmalloc.c in the
next iteration.

Thanks,
Kiran