I have been maintaining a set of patches based on the reimplemented dynamic
per-cpu allocator. I would like to have them included in -mm for testing
and maybe merged when the mainline merge window opens :)...
The reimplementation of the dynamic per-cpu allocator has been published
earlier and discussed, But the re-implementation with the applications and
supporting numbers weren't published earlier. Here is an attempt towards
the same.
This patchset contains
a) Reimplementation of alloc_percpu.
b) Rusty's implementation of distributed reference counters (bigrefs)
c) Code to change struct netdevice.refcnt to bigrefs -- this patch improves
tbench performance by 6% on a 8 way 3.00 GHZ x86 Intel xeon (x445).
d) Code to change struct dst_entry.__refcnt per-cpu. This patch was
originally submitted sometime back by Christoph. The reworked patch uses
alloc_percpu, and struct dst_entry does not bloat up now. This patch along with
the netdevice ref-counter patch above gives a whopping 55 % improvement on
tbench on a 8way x86 xeon (x445). The same patchset resulted in 30% better
tbench throughput on a x460 (8 way 3.3 Ghz x86_64 xeon).
All tbench numbers were on loopback with 8 clients.
The patches consist of
1. vmalloc_fixup
(Fix up __get_vm_area to take gfp flags as extra arg -- preparatory for 3.)
2. alloc_percpu
(Main allocator)
3. alloc_percpu_atomic
(Add GFP_FLAGS args to alloc_percpu -- for dst_entry.refcount)
4.change_alloc_percpu_users
(Change alloc_percpu users to use modified interface (with gfp_flags))
5. add_getcpuptr
(This is needed for bigrefs)
6. bigrefs
(Fixed up bigref from rusty)
7. netdev_refcnt_bigref.patch
(Bigref based netdev refcount)
8. dst_abstraction
9. dst_alloc_percpu
(dst_entry.refcount patches)
10. allow_early_mapvmarea
11. hotplug_alloc_percpu_blocks
(If alloc_percpu needs to be used real early then these patches will be
needed. This patch allows us to use bigrefs/alloc_percpu for refcounters
like struct vfsmount.mnt_count)
Patch to add gfp_flags as args to __get_vm_area. alloc_percpu needs to use
GFP flags as the dst_entry.refcount needs to be allocated with GFP_ATOMIC.
Since alloc_percpu needs get_vm_area underneath, this patch changes
__get_vmarea to accept gfp_flags as arg, so that alloc_percpu can use
__get_vm_area. get_vm_area remains unchanged.
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>
Index: alloc_percpu-2.6.13-rc6/arch/arm/kernel/module.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/arch/arm/kernel/module.c 2005-06-17 12:48:29.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/arch/arm/kernel/module.c 2005-08-14 23:03:49.000000000 -0700
@@ -40,7 +40,8 @@
if (!size)
return NULL;
- area = __get_vm_area(size, VM_ALLOC, MODULE_START, MODULE_END);
+ area = __get_vm_area(size, VM_ALLOC, MODULE_START, MODULE_END,
+ GFP_KERNEL);
if (!area)
return NULL;
Index: alloc_percpu-2.6.13-rc6/arch/sh/kernel/cpu/sh4/sq.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/arch/sh/kernel/cpu/sh4/sq.c 2005-06-17 12:48:29.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/arch/sh/kernel/cpu/sh4/sq.c 2005-08-14 23:04:48.000000000 -0700
@@ -189,7 +189,8 @@
* writeout before we hit the TLB flush, we do it anyways. This way
* we at least save ourselves the initial page fault overhead.
*/
- vma = __get_vm_area(map->size, VM_ALLOC, map->sq_addr, SQ_ADDRMAX);
+ vma = __get_vm_area(map->size, VM_ALLOC, map->sq_addr, SQ_ADDRMAX,
+ GFP_KERNEL);
if (!vma)
return ERR_PTR(-ENOMEM);
Index: alloc_percpu-2.6.13-rc6/arch/sparc64/kernel/module.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/arch/sparc64/kernel/module.c 2005-06-17 12:48:29.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/arch/sparc64/kernel/module.c 2005-08-14 23:05:49.000000000 -0700
@@ -25,7 +25,8 @@
if (!size || size > MODULES_LEN)
return NULL;
- area = __get_vm_area(size, VM_ALLOC, MODULES_VADDR, MODULES_END);
+ area = __get_vm_area(size, VM_ALLOC, MODULES_VADDR, MODULES_END,
+ GFP_KERNEL);
if (!area)
return NULL;
Index: alloc_percpu-2.6.13-rc6/arch/x86_64/kernel/module.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/arch/x86_64/kernel/module.c 2005-06-17 12:48:29.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/arch/x86_64/kernel/module.c 2005-08-14 23:06:28.000000000 -0700
@@ -48,7 +48,8 @@
if (size > MODULES_LEN)
return NULL;
- area = __get_vm_area(size, VM_ALLOC, MODULES_VADDR, MODULES_END);
+ area = __get_vm_area(size, VM_ALLOC, MODULES_VADDR, MODULES_END,
+ GFP_KERNEL);
if (!area)
return NULL;
Index: alloc_percpu-2.6.13-rc6/include/linux/vmalloc.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/linux/vmalloc.h 2005-06-17 12:48:29.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/include/linux/vmalloc.h 2005-08-14 23:12:29.000000000 -0700
@@ -39,7 +39,8 @@
*/
extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags);
extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
- unsigned long start, unsigned long end);
+ unsigned long start, unsigned long end,
+ unsigned int gfpflags);
extern struct vm_struct *remove_vm_area(void *addr);
extern struct vm_struct *__remove_vm_area(void *addr);
extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
Index: alloc_percpu-2.6.13-rc6/mm/vmalloc.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/mm/vmalloc.c 2005-06-17 12:48:29.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/mm/vmalloc.c 2005-08-14 22:29:31.000000000 -0700
@@ -161,7 +161,8 @@
#define IOREMAP_MAX_ORDER (7 + PAGE_SHIFT) /* 128 pages */
struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ unsigned int gfp_flags)
{
struct vm_struct **p, *tmp, *area;
unsigned long align = 1;
@@ -180,7 +181,7 @@
addr = ALIGN(start, align);
size = PAGE_ALIGN(size);
- area = kmalloc(sizeof(*area), GFP_KERNEL);
+ area = kmalloc(sizeof(*area), gfp_flags);
if (unlikely(!area))
return NULL;
@@ -245,7 +246,8 @@
*/
struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
{
- return __get_vm_area(size, flags, VMALLOC_START, VMALLOC_END);
+ return __get_vm_area(size, flags, VMALLOC_START, VMALLOC_END,
+ GFP_KERNEL);
}
/* Caller must hold vmlist_lock */
The following patch re-implements the linux dynamic percpu memory allocator
so that:
1. Percpu memory dereference is faster
- One less memory reference compared to existing simple alloc_percpu
- As fast as with static percpu areas, one mem ref less actually.
2. Better memory usage
- Doesn't need a NR_CPUS pointer array for each allocation
- Interlaces objects making better utilization of memory/cachelines
- Userspace tests show 98% utilization with random sized allocations
after repeated random frees. Utilization with small sized
allocations (counters) expected to be better than random sized
allocations.
3. Provides truly node local allocation
- The percpu memory with existing alloc_percpu does node local
allocation, but the NR_CPUS place holder is not node local. This
problem doesn't exist with the new implementation.
4. CPU Hotplug friendly
5. Independent of slab. Works early
Design:
We have "blocks" of memory akin to slabs. Each block has
(percpu blocksize) * NR_CPUS + (block management data structures) of kernel
VA space allocated to it. Node local pages are allocated and mapped
against the corresponding cpus' VA space. Pages are allocated for block
management data structures and mapped to their corresponding VA. These
reside at (percpu blocksize) * NR_CPUS offset from the beginning of the block.
The allocator maintains a circular linked list of blocks, sorted in
descending order of block utilization. On requests for objects, allocator
tries to serve the request from the most utilized block.
The allocator allocates memory in multiples of a fixed currency size for
a request. The allocator returns address of the percpu
object corresponding to cpu0. The cpu local variable for any given cpu
can be obtained by simple arithmetic:
obj_address + cpu_id * PCPU_BLKSIZE.
Testing:
The block allocator has undergone some userspace stress testing with small
counter sized objects as well as a large number of random sized objects.
Signed-off-by: Ravikiran Thirumalai <[email protected]>
---
Index: alloc_percpu-2.6.13-rc6/include/linux/kernel.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/linux/kernel.h 2005-08-15 00:54:41.154135250 -0400
+++ alloc_percpu-2.6.13-rc6/include/linux/kernel.h 2005-08-15 00:55:06.611726250 -0400
@@ -29,6 +29,8 @@
#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
#define ALIGN(x,a) (((x)+(a)-1)&~((a)-1))
+#define IS_ALIGNED(x,a) (!(((a) - 1) & (x)))
+#define IS_POWEROFTWO(x) (!(((x) - 1) & (x)))
#define KERN_EMERG "<0>" /* system is unusable */
#define KERN_ALERT "<1>" /* action must be taken immediately */
Index: alloc_percpu-2.6.13-rc6/include/linux/percpu.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/linux/percpu.h 2005-08-15 00:54:41.154135250 -0400
+++ alloc_percpu-2.6.13-rc6/include/linux/percpu.h 2005-08-15 00:55:06.615726500 -0400
@@ -15,23 +15,19 @@
#define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()
-#ifdef CONFIG_SMP
-
-struct percpu_data {
- void *ptrs[NR_CPUS];
- void *blkp;
-};
+/* This is the upper bound for an object using alloc_percpu */
+#define PCPU_BLKSIZE (PAGE_SIZE << 1)
+#define PCPU_CURR_SIZE (sizeof (int))
+#ifdef CONFIG_SMP
/*
* Use this to get to a cpu's version of the per-cpu object allocated using
* alloc_percpu. Non-atomic access to the current CPU's version should
* probably be combined with get_cpu()/put_cpu().
*/
#define per_cpu_ptr(ptr, cpu) \
-({ \
- struct percpu_data *__p = (struct percpu_data *)~(unsigned long)(ptr); \
- (__typeof__(ptr))__p->ptrs[(cpu)]; \
-})
+ ((__typeof__(ptr)) \
+ (RELOC_HIDE(ptr, PCPU_BLKSIZE * cpu)))
extern void *__alloc_percpu(size_t size, size_t align);
extern void free_percpu(const void *);
@@ -56,6 +52,6 @@
/* Simple wrapper for the common case: zeros memory. */
#define alloc_percpu(type) \
- ((type *)(__alloc_percpu(sizeof(type), __alignof__(type))))
+ ((type *)(__alloc_percpu(ALIGN(sizeof (type), PCPU_CURR_SIZE), __alignof__(type))))
#endif /* __LINUX_PERCPU_H */
Index: alloc_percpu-2.6.13-rc6/mm/Makefile
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/mm/Makefile 2005-08-15 00:54:41.154135250 -0400
+++ alloc_percpu-2.6.13-rc6/mm/Makefile 2005-08-15 00:55:06.615726500 -0400
@@ -18,5 +18,6 @@
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SHMEM) += shmem.o
obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o
+obj-$(CONFIG_SMP) += percpu.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: alloc_percpu-2.6.13-rc6/mm/percpu.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/mm/percpu.c 2005-08-12 07:14:03.324696250 -0400
+++ alloc_percpu-2.6.13-rc6/mm/percpu.c 2005-08-15 00:59:16.099318250 -0400
@@ -0,0 +1,693 @@
+/*
+ * Dynamic percpu memory allocator.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2003
+ *
+ * Author: Ravikiran Thirumalai <[email protected]>
+ *
+ * Originally by Dipankar Sarma and Ravikiran Thirumalai,
+ * This reimplements alloc_percpu to make it
+ * 1. Independent of slab/kmalloc
+ * 2. Use node local memory
+ * 3. Use simple pointer arithmetic
+ * 4. Minimise fragmentation.
+ *
+ * Allocator is slow -- expected to be called during module/subsytem
+ * init. alloc_percpu can block.
+ *
+ */
+
+#include <linux/percpu.h>
+#include <linux/vmalloc.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+
+#include <linux/sort.h>
+#ifdef CONFIG_HIGHMEM
+#include <asm/highmem.h>
+#endif
+#include <asm/pgtable.h>
+#include <asm/hardirq.h>
+
+#define MAX_OBJSIZE PCPU_BLKSIZE
+#define OBJS_PER_BLOCK (PCPU_BLKSIZE/PCPU_CURR_SIZE)
+#define BITMAP_ARR_SIZE (OBJS_PER_BLOCK/(sizeof (unsigned long) * 8))
+#define MAX_NR_BITS (OBJS_PER_BLOCK)
+#define PCPUPAGES_PER_BLOCK ((PCPU_BLKSIZE >> PAGE_SHIFT) * NR_CPUS)
+
+/* Block descriptor */
+struct pcpu_block {
+ void *start_addr;
+ struct page *pages[PCPUPAGES_PER_BLOCK * 2]; /* Extra for block mgt */
+ struct list_head blklist;
+ unsigned long bitmap[BITMAP_ARR_SIZE]; /* Object Freelist */
+ int bufctl_fl[OBJS_PER_BLOCK]; /* bufctl_fl freelist */
+ int bufctl_fl_head;
+ unsigned int size_used;
+};
+
+#define BLK_SIZE_USED(listpos) (list_entry(listpos, \
+ struct pcpu_block, blklist)->size_used)
+
+/* Block list maintanance */
+
+/* Ordered list of pcpu_blocks -- Full, partial first */
+static struct list_head blkhead = LIST_HEAD_INIT(blkhead);
+static struct list_head *firstnotfull = &blkhead;
+static spinlock_t blklist_lock = SPIN_LOCK_UNLOCKED;
+
+/*
+ * Bufctl descriptor and bufctl list for all allocated objs...
+ * Having one list for all buffers in the allocater might not be very efficient
+ * but we are not expecting allocs and frees in fast path (only during module
+ * load and unload hopefully
+ */
+struct buf_ctl {
+ void *addr;
+ size_t size;
+ struct buf_ctl *next;
+};
+
+static struct buf_ctl *buf_head = NULL;
+
+#define BLOCK_MANAGEMENT_SIZE \
+({ \
+ int extra = sizeof (struct buf_ctl)*OBJS_PER_BLOCK \
+ + sizeof (struct pcpu_block); \
+ ALIGN(extra, PAGE_SIZE); \
+})
+
+#define BLOCK_MANAGEMENT_PAGES (BLOCK_MANAGEMENT_SIZE >> PAGE_SHIFT)
+
+void init_pcpu_block(struct pcpu_block *blkp)
+{
+ int i;
+
+ /* Setup the freelist */
+ blkp->bufctl_fl_head = 0;
+ for (i = 0; i < OBJS_PER_BLOCK - 1; i++)
+ blkp->bufctl_fl[i] = i + 1;
+ blkp->bufctl_fl[i] = -1; /* Sentinel to mark End of list */
+}
+
+/*
+ * Allocate PCPU_BLKSIZE * NR_CPUS + BLOCK_MANAGEMENT_SIZE worth of
+ * contiguous kva space, and PCPU_BLKSIZE amount of node local
+ * memory (pages) for all cpus possible + BLOCK_MANAGEMENT_SIZE pages
+ */
+static void *valloc_percpu(void)
+{
+ int i, j = 0;
+ unsigned int nr_pages;
+ struct vm_struct *area, tmp;
+ struct page **tmppage;
+ struct page *pages[BLOCK_MANAGEMENT_PAGES];
+ unsigned int cpu_pages = PCPU_BLKSIZE >> PAGE_SHIFT;
+ struct pcpu_block *blkp = NULL;
+
+ BUG_ON(!IS_ALIGNED(PCPU_BLKSIZE, PAGE_SIZE));
+ BUG_ON(!PCPU_BLKSIZE);
+ nr_pages = PCPUPAGES_PER_BLOCK + BLOCK_MANAGEMENT_PAGES;
+
+ /* Alloc Managent block pages */
+ for (i = 0; i < BLOCK_MANAGEMENT_PAGES; i++) {
+ pages[i] = alloc_pages(GFP_ATOMIC|__GFP_ZERO, 0);
+ if (!pages[i]) {
+ while (--i >= 0)
+ __free_pages(pages[i], 0);
+ return NULL;
+ }
+ }
+
+ /* Get the contiguous VA space for this block */
+ area = __get_vm_area(nr_pages << PAGE_SHIFT, VM_MAP, VMALLOC_START,
+ VMALLOC_END, GFP_KERNEL);
+ if (!area)
+ goto rollback_mgt;
+
+ /* Map pages for the block management pages */
+ tmppage = pages;
+ tmp.addr = area->addr + NR_CPUS * PCPU_BLKSIZE;
+ tmp.size = BLOCK_MANAGEMENT_SIZE + PAGE_SIZE;
+ if (map_vm_area(&tmp, PAGE_KERNEL, &tmppage))
+ goto rollback_vm_area;
+
+ /* Init the block descriptor */
+ blkp = area->addr + NR_CPUS * PCPU_BLKSIZE;
+ init_pcpu_block(blkp);
+ for (i = 0; i < BLOCK_MANAGEMENT_PAGES; i++)
+ blkp->pages[i + PCPUPAGES_PER_BLOCK] = pages[i];
+
+ /* Alloc node local pages for all cpus possible */
+ for_each_cpu(i) {
+ int start_idx = i * cpu_pages;
+ for (j = start_idx; j < start_idx + cpu_pages; j++) {
+ blkp->pages[j] = alloc_pages_node(cpu_to_node(i)
+ ,
+ GFP_ATOMIC |
+ __GFP_HIGHMEM,
+ 0);
+ if (unlikely(!blkp->pages[j]))
+ goto rollback_pages;
+ }
+ }
+
+ /* Map pages for each cpu by splitting vm_struct for each cpu */
+ for_each_cpu(i) {
+ tmppage = &blkp->pages[i * cpu_pages];
+ tmp.addr = area->addr + i * PCPU_BLKSIZE;
+ /* map_vm_area assumes a guard page of size PAGE_SIZE */
+ tmp.size = PCPU_BLKSIZE + PAGE_SIZE;
+ if (map_vm_area(&tmp, PAGE_KERNEL, &tmppage))
+ goto fail_map;
+ }
+
+ return area->addr;
+
+fail_map:
+ i--;
+ for (; i >= 0; i--) {
+ if (cpu_possible(i)) {
+ tmp.addr = area->addr + i * PCPU_BLKSIZE;
+ /* we've mapped a guard page extra earlier... */
+ tmp.size = PCPU_BLKSIZE + PAGE_SIZE;
+ unmap_vm_area(&tmp);
+ }
+ }
+
+ /* set i and j with proper values for the roll back at fail: */
+ i = NR_CPUS - 1;
+ j = PCPUPAGES_PER_BLOCK;
+
+rollback_pages:
+ j--;
+ for (; j >= 0; j--)
+ if (cpu_possible(j / cpu_pages))
+ __free_pages(blkp->pages[j], 0);
+
+ /* Unmap block management */
+ tmp.addr = area->addr + NR_CPUS * PCPU_BLKSIZE;
+ tmp.size = BLOCK_MANAGEMENT_SIZE + PAGE_SIZE;
+ unmap_vm_area(&tmp);
+
+rollback_vm_area:
+ /* Give back the contiguous mem area */
+ area = remove_vm_area(area->addr);
+ BUG_ON(!area);
+
+rollback_mgt:
+
+ /* Free the block management pages */
+ for (i = 0; i < BLOCK_MANAGEMENT_PAGES; i++)
+ __free_pages(pages[i], 0);
+
+ return NULL;
+}
+
+/* Free memory block allocated by valloc_percpu */
+static void vfree_percpu(void *addr)
+{
+ int i;
+ struct pcpu_block *blkp = addr + PCPUPAGES_PER_BLOCK * PAGE_SIZE;
+ struct vm_struct *area, tmp;
+ unsigned int cpu_pages = PCPU_BLKSIZE >> PAGE_SHIFT;
+ struct page *pages[BLOCK_MANAGEMENT_PAGES];
+
+ /* Backup the block management struct pages */
+ for (i = 0; i < BLOCK_MANAGEMENT_PAGES; i++)
+ pages[i] = blkp->pages[i + PCPUPAGES_PER_BLOCK];
+
+ /* Unmap all cpu_pages from the block's vm space */
+ for_each_cpu(i) {
+ tmp.addr = addr + i * PCPU_BLKSIZE;
+ /* We've mapped a guard page extra earlier */
+ tmp.size = PCPU_BLKSIZE + PAGE_SIZE;
+ unmap_vm_area(&tmp);
+ }
+
+ /* Give back all allocated pages */
+ for (i = 0; i < PCPUPAGES_PER_BLOCK; i++) {
+ if (cpu_possible(i / cpu_pages))
+ __free_pages(blkp->pages[i], 0);
+ }
+
+ /* Unmap block management pages */
+ tmp.addr = addr + NR_CPUS * PCPU_BLKSIZE;
+ tmp.size = BLOCK_MANAGEMENT_SIZE + PAGE_SIZE;
+ unmap_vm_area(&tmp);
+
+ /* Free block management pages */
+ for (i = 0; i < BLOCK_MANAGEMENT_PAGES; i++)
+ __free_pages(pages[i], 0);
+
+ /* Give back vm area for this block */
+ area = remove_vm_area(addr);
+ BUG_ON(!area);
+
+}
+
+static int add_percpu_block(void)
+{
+ struct pcpu_block *blkp;
+ void *start_addr;
+ unsigned long flags;
+
+ start_addr = valloc_percpu();
+ if (!start_addr)
+ return 0;
+ blkp = start_addr + PCPUPAGES_PER_BLOCK * PAGE_SIZE;
+ blkp->start_addr = start_addr;
+ spin_lock_irqsave(&blklist_lock, flags);
+ list_add_tail(&blkp->blklist, &blkhead);
+ if (firstnotfull == &blkhead)
+ firstnotfull = &blkp->blklist;
+ spin_unlock_irqrestore(&blklist_lock, flags);
+
+ return 1;
+}
+
+struct obj_map_elmt {
+ int startbit;
+ int obj_size;
+};
+
+/* Fill the array with obj map info and return no of elements in the array */
+static int
+make_obj_map(struct obj_map_elmt arr[], struct pcpu_block *blkp)
+{
+ int nr_elements = 0;
+ int i, j, obj_size;
+
+ for (i = 0, j = 0; i < MAX_NR_BITS; i++) {
+ if (!test_bit(i, blkp->bitmap)) {
+ /* Free block start */
+ arr[j].startbit = i;
+ nr_elements++;
+ obj_size = 1;
+ i++;
+ while (i < MAX_NR_BITS && (!test_bit(i, blkp->bitmap))) {
+ i++;
+ obj_size++;
+ }
+ arr[j].obj_size = obj_size * PCPU_CURR_SIZE;
+ j++;
+ }
+ }
+
+ return nr_elements;
+}
+
+/* Compare routine for sort -- for asceding order */
+static int obj_map_cmp(const void *a, const void *b)
+{
+ struct obj_map_elmt *sa, *sb;
+
+ sa = (struct obj_map_elmt *) a;
+ sb = (struct obj_map_elmt *) b;
+ return sa->obj_size - sb->obj_size;
+}
+
+/* Add bufctl to list of bufctl */
+static void add_bufctl(struct buf_ctl *bufp)
+{
+ if (buf_head == NULL)
+ buf_head = bufp;
+ else {
+ bufp->next = buf_head;
+ buf_head = bufp;
+ }
+}
+
+/* After you alloc from a block, It can only go up the ordered list */
+static void sort_blk_list_up(struct pcpu_block *blkp)
+{
+ struct list_head *pos;
+
+ for (pos = blkp->blklist.prev; pos != &blkhead; pos = pos->prev) {
+ if (BLK_SIZE_USED(pos) < blkp->size_used) {
+ /* Move blkp up */
+ list_del(&blkp->blklist);
+ list_add_tail(&blkp->blklist, pos);
+ pos = &blkp->blklist;
+ } else
+ break;
+ }
+ /* Fix firstnotfull if needed */
+ if (blkp->size_used == PCPU_BLKSIZE) {
+ firstnotfull = blkp->blklist.next;
+ return;
+ }
+ if (blkp->size_used > BLK_SIZE_USED(firstnotfull)) {
+ firstnotfull = &blkp->blklist;
+ return;
+ }
+}
+
+struct buf_ctl *alloc_bufctl(struct pcpu_block *blkp)
+{
+ void *bufctl;
+ int head = blkp->bufctl_fl_head;
+ BUG_ON(head == -1); /* If bufctls for this block has exhausted */
+ blkp->bufctl_fl_head = blkp->bufctl_fl[blkp->bufctl_fl_head];
+ BUG_ON(head >= OBJS_PER_BLOCK);
+ bufctl = (void *) blkp + sizeof (struct pcpu_block) +
+ sizeof (struct buf_ctl) * head;
+ return bufctl;
+}
+
+/* Don't want to kmalloc this -- to avoid dependence on slab for future */
+static struct obj_map_elmt obj_map[OBJS_PER_BLOCK];
+
+/* Scan the freelist and return suitable obj if found */
+static void *
+get_obj_from_block(size_t size, size_t align, struct pcpu_block *blkp)
+{
+ int nr_elements, nr_currency, obj_startbit, obj_endbit;
+ int i, j;
+ void *objp;
+ struct buf_ctl *bufctl;
+
+ nr_elements = make_obj_map(obj_map, blkp);
+ if (!nr_elements)
+ return NULL;
+
+ /* Sort list in ascending order */
+ sort(obj_map, nr_elements, sizeof (obj_map[0]), obj_map_cmp, NULL);
+
+ /* Get the smallest obj_sized chunk for this size */
+ i = 0;
+ while (i < nr_elements - 1 && size > obj_map[i].obj_size)
+ i++;
+ if (obj_map[i].obj_size < size) /* No suitable obj_size found */
+ return NULL;
+
+ /* chunk of obj_size >= size is found, check for suitability (align)
+ * and alloc
+ */
+ nr_currency = size / PCPU_CURR_SIZE;
+ obj_startbit = obj_map[i].startbit;
+
+try_again_for_align:
+
+ obj_endbit = obj_map[i].startbit + obj_map[i].obj_size / PCPU_CURR_SIZE
+ - 1;
+ objp = obj_startbit * PCPU_CURR_SIZE + blkp->start_addr;
+
+ if (IS_ALIGNED((unsigned long) objp, align)) {
+ /* Alignment is ok so alloc this chunk */
+ bufctl = alloc_bufctl(blkp);
+ if (!bufctl)
+ return NULL;
+ bufctl->addr = objp;
+ bufctl->size = size;
+ bufctl->next = NULL;
+
+ /* Mark the bitmap as allocated */
+ for (j = obj_startbit; j < nr_currency + obj_startbit; j++)
+ set_bit(j, blkp->bitmap);
+ blkp->size_used += size;
+ /* Re-arrange list to preserve full, partial and free order */
+ sort_blk_list_up(blkp);
+ /* Add to the allocated buffers list and return */
+ add_bufctl(bufctl);
+ return objp;
+ } else {
+ /* Alignment is not ok */
+ int obj_size = (obj_endbit - obj_startbit + 1) * PCPU_CURR_SIZE;
+ if (obj_size > size && obj_startbit <= obj_endbit) {
+ /* Since obj_size is bigger than requested, check if
+ alignment can be met by changing startbit */
+ obj_startbit++;
+ goto try_again_for_align;
+ } else {
+ /* Try in the next chunk */
+ if (++i < nr_elements) {
+ /* Reset start bit and try again */
+ obj_startbit = obj_map[i].startbit;
+ goto try_again_for_align;
+ }
+ }
+ }
+
+ /* Everything failed so return NULL */
+ return NULL;
+}
+
+static void zero_obj(void *obj, size_t size)
+{
+ int cpu;
+ for_each_cpu(cpu)
+ memset(per_cpu_ptr(obj, cpu), 0, size);
+}
+
+/*
+ * __alloc_percpu - allocate one copy of the object for every present
+ * cpu in the system, zeroing them.
+ * Objects should be dereferenced using per_cpu_ptr/get_cpu_ptr
+ * macros only
+ *
+ * This allocator is slow as we assume allocs to come
+ * by during boot/module init.
+ * Should not be called from interrupt context
+ */
+void *__alloc_percpu(size_t size, size_t align)
+{
+ struct pcpu_block *blkp;
+ struct list_head *l;
+ void *obj;
+ unsigned long flags;
+
+ if (!size)
+ return NULL;
+
+ if (size < PCPU_CURR_SIZE)
+ size = PCPU_CURR_SIZE;
+
+ if (align == 0)
+ align = PCPU_CURR_SIZE;
+
+ if (size > MAX_OBJSIZE) {
+ printk("alloc_percpu: ");
+ printk("size %d requested is more than I can handle\n", size);
+ return NULL;
+ }
+
+ BUG_ON(!IS_ALIGNED(size, PCPU_CURR_SIZE));
+
+try_after_refill:
+
+ /* Get the block to allocate from */
+ spin_lock_irqsave(&blklist_lock, flags);
+ l = firstnotfull;
+
+try_next_block:
+
+ /* If you have reached end of list, add another block and try */
+ if (l == &blkhead)
+ goto unlock_and_get_mem;
+ blkp = list_entry(l, struct pcpu_block, blklist);
+ obj = get_obj_from_block(size, align, blkp);
+ if (!obj) {
+ l = l->next;
+ goto try_next_block;
+ }
+ spin_unlock_irqrestore(&blklist_lock, flags);
+ zero_obj(obj, size);
+ return obj;
+
+unlock_and_get_mem:
+
+ spin_unlock_irqrestore(&blklist_lock, flags);
+ if (add_percpu_block())
+ goto try_after_refill;
+ return NULL;
+
+}
+
+EXPORT_SYMBOL(__alloc_percpu);
+
+/* After you free from a block, It can only go down the ordered list */
+static void sort_blk_list_down(struct pcpu_block *blkp)
+{
+ struct list_head *pos, *prev, *next;
+ /* Store the actual prev and next pointers for fnof fixing later */
+ prev = blkp->blklist.prev;
+ next = blkp->blklist.next;
+
+ /* Fix the ordering on the list */
+ for (pos = blkp->blklist.next; pos != &blkhead; pos = pos->next) {
+ if (BLK_SIZE_USED(pos) > blkp->size_used) {
+ /* Move blkp down */
+ list_del(&blkp->blklist);
+ list_add(&blkp->blklist, pos);
+ pos = &blkp->blklist;
+ } else
+ break;
+ }
+ /* Fix firstnotfull if needed and return */
+ if (firstnotfull == &blkhead) {
+ /* There was no block free, so now this block is fnotfull */
+ firstnotfull = &blkp->blklist;
+ return;
+ }
+
+ if (firstnotfull == &blkp->blklist) {
+ /* This was firstnotfull so fix fnof pointer accdly */
+ if (prev != &blkhead && BLK_SIZE_USED(prev) != PCPU_BLKSIZE) {
+ /* Move fnof pointer up */
+ firstnotfull = prev;
+ prev = prev->prev;
+ /* If size_used of prev is same as fnof, fix fnof to
+ point to topmost of the equal sized blocks */
+ while (prev != &blkhead &&
+ BLK_SIZE_USED(prev) != PCPU_BLKSIZE) {
+ if (BLK_SIZE_USED(prev) !=
+ BLK_SIZE_USED(firstnotfull))
+ return;
+ firstnotfull = prev;
+ prev = prev->prev;
+ }
+ } else if (next != &blkhead) {
+ /* Move fnof pointer down */
+ firstnotfull = next;
+ next = next->next;
+ if (BLK_SIZE_USED(firstnotfull) != PCPU_BLKSIZE)
+ return;
+ /* fnof is pointing to block which is full...fix it */
+ while (next != &blkhead &&
+ BLK_SIZE_USED(next) == PCPU_BLKSIZE) {
+ firstnotfull = next;
+ next = next->next;
+ }
+ }
+
+ }
+
+}
+
+void free_bufctl(struct pcpu_block *blkp, struct buf_ctl *bufp)
+{
+ int idx = ((void *) bufp - (void *) blkp - sizeof (struct pcpu_block))
+ / sizeof (struct buf_ctl);
+ blkp->bufctl_fl[idx] = blkp->bufctl_fl_head;
+ blkp->bufctl_fl_head = idx;
+}
+
+/*
+ * Free the percpu obj and whatever memory can be freed
+ */
+static void free_percpu_obj(struct list_head *pos, struct buf_ctl *bufp)
+{
+ struct pcpu_block *blkp;
+ blkp = list_entry(pos, struct pcpu_block, blklist);
+
+ /* Update blkp->size_used and free if size_used is 0 */
+ blkp->size_used -= bufp->size;
+ if (blkp->size_used) {
+ /* Mark the bitmap corresponding to this object free */
+ int i, obj_startbit;
+ int nr_currency = bufp->size / PCPU_CURR_SIZE;
+ obj_startbit = (bufp->addr - blkp->start_addr) / PCPU_CURR_SIZE;
+ for (i = obj_startbit; i < obj_startbit + nr_currency; i++)
+ clear_bit(i, blkp->bitmap);
+ sort_blk_list_down(blkp);
+ } else {
+ /* Usecount is zero, so prepare to give this block back to vm */
+ /* Fix firstnotfull if freeing block was firstnotfull
+ * If there are more blocks with the same usecount as fnof,
+ * point to the first block from the head */
+ if (firstnotfull == pos) {
+ firstnotfull = pos->prev;
+ while (firstnotfull != &blkhead) {
+ unsigned int fnf_size_used;
+ fnf_size_used = BLK_SIZE_USED(firstnotfull);
+
+ if (fnf_size_used == PCPU_BLKSIZE)
+ firstnotfull = &blkhead;
+ else if (firstnotfull->prev == &blkhead)
+ break;
+ else if (BLK_SIZE_USED(firstnotfull->prev)
+ == fnf_size_used)
+ firstnotfull = firstnotfull->prev;
+ else
+ break;
+ }
+ }
+ list_del(pos);
+ }
+
+ /* Free bufctl after fixing the bufctl list */
+ if (bufp == buf_head) {
+ buf_head = bufp->next;
+ } else {
+ struct buf_ctl *tmp = buf_head;
+ while (tmp && tmp->next != bufp)
+ tmp = tmp->next;
+ BUG_ON(!tmp || tmp->next != bufp);
+ tmp->next = bufp->next;
+ }
+ free_bufctl(blkp, bufp);
+ /* If usecount is zero, give this block back to vm */
+ if (!blkp->size_used)
+ vfree_percpu(blkp->start_addr);
+ return;
+}
+
+/*
+ * Free memory allocated using alloc_percpu.
+ */
+
+void free_percpu(const void *objp)
+{
+ struct buf_ctl *bufp;
+ struct pcpu_block *blkp;
+ struct list_head *pos;
+ unsigned long flags;
+ if (!objp)
+ return;
+
+ /* Find block from which obj was allocated by scanning bufctl list */
+ spin_lock_irqsave(&blklist_lock, flags);
+ bufp = buf_head;
+ while (bufp) {
+ if (bufp->addr == objp)
+ break;
+ bufp = bufp->next;
+ }
+ BUG_ON(!bufp);
+
+ /* We have the bufctl for the obj here, Now get the block */
+ list_for_each(pos, &blkhead) {
+ blkp = list_entry(pos, struct pcpu_block, blklist);
+ if (objp >= blkp->start_addr &&
+ objp < blkp->start_addr + PCPU_BLKSIZE)
+ break;
+ }
+
+ BUG_ON(pos == &blkhead); /* Couldn't find obj in block list */
+
+ /*
+ * Mark the bitmap free, Update use count, fix the ordered
+ * blklist, free the obj bufctl.
+ */
+ free_percpu_obj(pos, bufp);
+
+ spin_unlock_irqrestore(&blklist_lock, flags);
+ return;
+}
+
+EXPORT_SYMBOL(free_percpu);
Index: alloc_percpu-2.6.13-rc6/mm/slab.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/mm/slab.c 2005-08-15 00:54:41.158135500 -0400
+++ alloc_percpu-2.6.13-rc6/mm/slab.c 2005-08-15 00:55:06.615726500 -0400
@@ -2493,49 +2493,6 @@
}
EXPORT_SYMBOL(__kmalloc);
-#ifdef CONFIG_SMP
-/**
- * __alloc_percpu - allocate one copy of the object for every present
- * cpu in the system, zeroing them.
- * Objects should be dereferenced using the per_cpu_ptr macro only.
- *
- * @size: how many bytes of memory are required.
- * @align: the alignment, which can't be greater than SMP_CACHE_BYTES.
- */
-void *__alloc_percpu(size_t size, size_t align)
-{
- int i;
- struct percpu_data *pdata = kmalloc(sizeof (*pdata), GFP_KERNEL);
-
- if (!pdata)
- return NULL;
-
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_possible(i))
- continue;
- pdata->ptrs[i] = kmalloc_node(size, GFP_KERNEL,
- cpu_to_node(i));
-
- if (!pdata->ptrs[i])
- goto unwind_oom;
- memset(pdata->ptrs[i], 0, size);
- }
-
- /* Catch derefs w/o wrappers */
- return (void *) (~(unsigned long) pdata);
-
-unwind_oom:
- while (--i >= 0) {
- if (!cpu_possible(i))
- continue;
- kfree(pdata->ptrs[i]);
- }
- kfree(pdata);
- return NULL;
-}
-EXPORT_SYMBOL(__alloc_percpu);
-#endif
-
/**
* kmem_cache_free - Deallocate an object
* @cachep: The cache the allocation was from.
@@ -2596,30 +2553,6 @@
}
EXPORT_SYMBOL(kfree);
-#ifdef CONFIG_SMP
-/**
- * free_percpu - free previously allocated percpu memory
- * @objp: pointer returned by alloc_percpu.
- *
- * Don't free memory not originally allocated by alloc_percpu()
- * The complemented objp is to check for that.
- */
-void
-free_percpu(const void *objp)
-{
- int i;
- struct percpu_data *p = (struct percpu_data *) (~(unsigned long) objp);
-
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_possible(i))
- continue;
- kfree(p->ptrs[i]);
- }
- kfree(p);
-}
-EXPORT_SYMBOL(free_percpu);
-#endif
-
unsigned int kmem_cache_size(kmem_cache_t *cachep)
{
return obj_reallen(cachep);
Following patch changes the alloc_percpu interface to accept gfp_flags as an
argument.
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>
Index: alloc_percpu-2.6.13-rc6/include/linux/percpu.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/linux/percpu.h 2005-08-15 17:28:42.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/include/linux/percpu.h 2005-08-15 18:44:17.000000000 -0700
@@ -29,16 +29,17 @@
((__typeof__(ptr)) \
(RELOC_HIDE(ptr, PCPU_BLKSIZE * cpu)))
-extern void *__alloc_percpu(size_t size, size_t align);
+extern void *__alloc_percpu(size_t size, size_t align, unsigned int gfpflags);
extern void free_percpu(const void *);
#else /* CONFIG_SMP */
#define per_cpu_ptr(ptr, cpu) (ptr)
-static inline void *__alloc_percpu(size_t size, size_t align)
+static inline void *
+__alloc_percpu(size_t size, size_t align, unsigned int gfpflags)
{
- void *ret = kmalloc(size, GFP_KERNEL);
+ void *ret = kmalloc(size, gfpflags);
if (ret)
memset(ret, 0, size);
return ret;
@@ -51,7 +52,11 @@
#endif /* CONFIG_SMP */
/* Simple wrapper for the common case: zeros memory. */
-#define alloc_percpu(type) \
- ((type *)(__alloc_percpu(ALIGN(sizeof (type), PCPU_CURR_SIZE), __alignof__(type))))
+#define alloc_percpu(type, gfpflags) \
+({ \
+ BUG_ON(~(GFP_ATOMIC|GFP_KERNEL) & gfpflags); \
+ ((type *)(__alloc_percpu(ALIGN(sizeof (type), PCPU_CURR_SIZE), \
+ __alignof__(type), gfpflags))); \
+})
#endif /* __LINUX_PERCPU_H */
Index: alloc_percpu-2.6.13-rc6/mm/percpu.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/mm/percpu.c 2005-08-15 17:28:42.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/mm/percpu.c 2005-08-15 19:01:11.000000000 -0700
@@ -109,7 +109,7 @@
* contiguous kva space, and PCPU_BLKSIZE amount of node local
* memory (pages) for all cpus possible + BLOCK_MANAGEMENT_SIZE pages
*/
-static void *valloc_percpu(void)
+static void *valloc_percpu(unsigned int gfpflags)
{
int i, j = 0;
unsigned int nr_pages;
@@ -118,14 +118,21 @@
struct page *pages[BLOCK_MANAGEMENT_PAGES];
unsigned int cpu_pages = PCPU_BLKSIZE >> PAGE_SHIFT;
struct pcpu_block *blkp = NULL;
+ unsigned int flags;
BUG_ON(!IS_ALIGNED(PCPU_BLKSIZE, PAGE_SIZE));
BUG_ON(!PCPU_BLKSIZE);
nr_pages = PCPUPAGES_PER_BLOCK + BLOCK_MANAGEMENT_PAGES;
+ /* gfpflags can be either GFP_KERNEL or GFP_ATOMIC only */
+ if (gfpflags & GFP_KERNEL)
+ flags = GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO;
+ else
+ flags = GFP_ATOMIC|__GFP_ZERO;
+
/* Alloc Managent block pages */
for (i = 0; i < BLOCK_MANAGEMENT_PAGES; i++) {
- pages[i] = alloc_pages(GFP_ATOMIC|__GFP_ZERO, 0);
+ pages[i] = alloc_pages(flags, 0);
if (!pages[i]) {
while (--i >= 0)
__free_pages(pages[i], 0);
@@ -135,7 +142,7 @@
/* Get the contiguous VA space for this block */
area = __get_vm_area(nr_pages << PAGE_SHIFT, VM_MAP, VMALLOC_START,
- VMALLOC_END, GFP_KERNEL);
+ VMALLOC_END, gfpflags);
if (!area)
goto rollback_mgt;
@@ -156,11 +163,8 @@
for_each_cpu(i) {
int start_idx = i * cpu_pages;
for (j = start_idx; j < start_idx + cpu_pages; j++) {
- blkp->pages[j] = alloc_pages_node(cpu_to_node(i)
- ,
- GFP_ATOMIC |
- __GFP_HIGHMEM,
- 0);
+ blkp->pages[j] = alloc_pages_node(cpu_to_node(i) ,
+ flags, 0);
if (unlikely(!blkp->pages[j]))
goto rollback_pages;
}
@@ -260,13 +264,13 @@
}
-static int add_percpu_block(void)
+static int add_percpu_block(unsigned int gfpflags)
{
struct pcpu_block *blkp;
void *start_addr;
unsigned long flags;
- start_addr = valloc_percpu();
+ start_addr = valloc_percpu(gfpflags);
if (!start_addr)
return 0;
blkp = start_addr + PCPUPAGES_PER_BLOCK * PAGE_SIZE;
@@ -464,7 +468,7 @@
* by during boot/module init.
* Should not be called from interrupt context
*/
-void *__alloc_percpu(size_t size, size_t align)
+void *__alloc_percpu(size_t size, size_t align, unsigned int gfpflags)
{
struct pcpu_block *blkp;
struct list_head *l;
@@ -512,7 +516,7 @@
unlock_and_get_mem:
spin_unlock_irqrestore(&blklist_lock, flags);
- if (add_percpu_block())
+ if (add_percpu_block(gfpflags))
goto try_after_refill;
return NULL;
Change all current users of alloc_percpu to the new alloc_percpu interface
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>
Index: alloc_percpu-2.6.13-rc6/include/linux/genhd.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/linux/genhd.h 2005-08-15 15:26:40.979596750 -0400
+++ alloc_percpu-2.6.13-rc6/include/linux/genhd.h 2005-08-15 15:27:07.153232500 -0400
@@ -199,7 +199,7 @@
#ifdef CONFIG_SMP
static inline int init_disk_stats(struct gendisk *disk)
{
- disk->dkstats = alloc_percpu(struct disk_stats);
+ disk->dkstats = alloc_percpu(struct disk_stats, GFP_KERNEL);
if (!disk->dkstats)
return 0;
return 1;
Index: alloc_percpu-2.6.13-rc6/include/linux/percpu_counter.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/linux/percpu_counter.h 2005-08-15 15:26:40.983597000 -0400
+++ alloc_percpu-2.6.13-rc6/include/linux/percpu_counter.h 2005-08-15 15:27:07.153232500 -0400
@@ -30,7 +30,7 @@
{
spin_lock_init(&fbc->lock);
fbc->count = 0;
- fbc->counters = alloc_percpu(long);
+ fbc->counters = alloc_percpu(long, GFP_KERNEL);
}
static inline void percpu_counter_destroy(struct percpu_counter *fbc)
Index: alloc_percpu-2.6.13-rc6/net/core/neighbour.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/core/neighbour.c 2005-08-15 15:26:40.983597000 -0400
+++ alloc_percpu-2.6.13-rc6/net/core/neighbour.c 2005-08-15 15:27:07.153232500 -0400
@@ -1350,7 +1350,7 @@
if (!tbl->kmem_cachep)
panic("cannot create neighbour cache");
- tbl->stats = alloc_percpu(struct neigh_statistics);
+ tbl->stats = alloc_percpu(struct neigh_statistics, GFP_KERNEL);
if (!tbl->stats)
panic("cannot create neighbour cache statistics");
Index: alloc_percpu-2.6.13-rc6/net/ipv4/af_inet.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/af_inet.c 2005-08-15 15:26:40.983597000 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/af_inet.c 2005-08-15 15:27:07.153232500 -0400
@@ -985,16 +985,16 @@
static int __init init_ipv4_mibs(void)
{
- net_statistics[0] = alloc_percpu(struct linux_mib);
- net_statistics[1] = alloc_percpu(struct linux_mib);
- ip_statistics[0] = alloc_percpu(struct ipstats_mib);
- ip_statistics[1] = alloc_percpu(struct ipstats_mib);
- icmp_statistics[0] = alloc_percpu(struct icmp_mib);
- icmp_statistics[1] = alloc_percpu(struct icmp_mib);
- tcp_statistics[0] = alloc_percpu(struct tcp_mib);
- tcp_statistics[1] = alloc_percpu(struct tcp_mib);
- udp_statistics[0] = alloc_percpu(struct udp_mib);
- udp_statistics[1] = alloc_percpu(struct udp_mib);
+ net_statistics[0] = alloc_percpu(struct linux_mib, GFP_KERNEL);
+ net_statistics[1] = alloc_percpu(struct linux_mib, GFP_KERNEL);
+ ip_statistics[0] = alloc_percpu(struct ipstats_mib, GFP_KERNEL);
+ ip_statistics[1] = alloc_percpu(struct ipstats_mib, GFP_KERNEL);
+ icmp_statistics[0] = alloc_percpu(struct icmp_mib, GFP_KERNEL);
+ icmp_statistics[1] = alloc_percpu(struct icmp_mib, GFP_KERNEL);
+ tcp_statistics[0] = alloc_percpu(struct tcp_mib, GFP_KERNEL);
+ tcp_statistics[1] = alloc_percpu(struct tcp_mib, GFP_KERNEL);
+ udp_statistics[0] = alloc_percpu(struct udp_mib, GFP_KERNEL);
+ udp_statistics[1] = alloc_percpu(struct udp_mib, GFP_KERNEL);
if (!
(net_statistics[0] && net_statistics[1] && ip_statistics[0]
&& ip_statistics[1] && tcp_statistics[0] && tcp_statistics[1]
Index: alloc_percpu-2.6.13-rc6/net/ipv4/ipcomp.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/ipcomp.c 2005-08-15 15:26:40.983597000 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/ipcomp.c 2005-08-15 15:27:07.153232500 -0400
@@ -306,7 +306,7 @@
if (ipcomp_scratch_users++)
return ipcomp_scratches;
- scratches = alloc_percpu(void *);
+ scratches = alloc_percpu(void *, GFP_KERNEL);
if (!scratches)
return NULL;
@@ -380,7 +380,7 @@
INIT_LIST_HEAD(&pos->list);
list_add(&pos->list, &ipcomp_tfms_list);
- pos->tfms = tfms = alloc_percpu(struct crypto_tfm *);
+ pos->tfms = tfms = alloc_percpu(struct crypto_tfm *, GFP_KERNEL);
if (!tfms)
goto error;
Index: alloc_percpu-2.6.13-rc6/net/ipv4/route.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/route.c 2005-08-15 15:26:40.987597250 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/route.c 2005-08-15 15:27:07.157232750 -0400
@@ -3154,7 +3154,7 @@
ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
ip_rt_max_size = (rt_hash_mask + 1) * 16;
- rt_cache_stat = alloc_percpu(struct rt_cache_stat);
+ rt_cache_stat = alloc_percpu(struct rt_cache_stat, GFP_KERNEL);
if (!rt_cache_stat)
return -ENOMEM;
Index: alloc_percpu-2.6.13-rc6/net/ipv6/ipcomp6.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv6/ipcomp6.c 2005-08-15 15:26:40.987597250 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv6/ipcomp6.c 2005-08-15 15:27:07.157232750 -0400
@@ -302,7 +302,7 @@
if (ipcomp6_scratch_users++)
return ipcomp6_scratches;
- scratches = alloc_percpu(void *);
+ scratches = alloc_percpu(void *, GFP_KERNEL);
if (!scratches)
return NULL;
@@ -376,7 +376,7 @@
INIT_LIST_HEAD(&pos->list);
list_add(&pos->list, &ipcomp6_tfms_list);
- pos->tfms = tfms = alloc_percpu(struct crypto_tfm *);
+ pos->tfms = tfms = alloc_percpu(struct crypto_tfm *, GFP_KERNEL);
if (!tfms)
goto error;
Index: alloc_percpu-2.6.13-rc6/net/sctp/protocol.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/sctp/protocol.c 2005-08-15 15:26:40.987597250 -0400
+++ alloc_percpu-2.6.13-rc6/net/sctp/protocol.c 2005-08-15 15:27:07.157232750 -0400
@@ -939,10 +939,10 @@
static int __init init_sctp_mibs(void)
{
- sctp_statistics[0] = alloc_percpu(struct sctp_mib);
+ sctp_statistics[0] = alloc_percpu(struct sctp_mib, GFP_KERNEL);
if (!sctp_statistics[0])
return -ENOMEM;
- sctp_statistics[1] = alloc_percpu(struct sctp_mib);
+ sctp_statistics[1] = alloc_percpu(struct sctp_mib, GFP_KERNEL);
if (!sctp_statistics[1]) {
free_percpu(sctp_statistics[0]);
return -ENOMEM;
Reintroduce __get_cpu_ptr for bigrefs
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Index: alloc_percpu-2.6.13-rc6/include/linux/percpu.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/linux/percpu.h 2005-08-15 15:31:35.770020000 -0400
+++ alloc_percpu-2.6.13-rc6/include/linux/percpu.h 2005-08-15 17:12:57.745572000 -0400
@@ -59,4 +59,5 @@
__alignof__(type), gfpflags))); \
})
+#define __get_cpu_ptr(ptr) per_cpu_ptr(ptr, smp_processor_id())
#endif /* __LINUX_PERCPU_H */
Distributed refcounting infrastructure patch originally by Rusty Russel.
http://lkml.org/lkml/2005/1/14/47
Changes from the original:
- Rediffed and changed to use new alloc_percpu interface
- Added bigref_set for some applications which initialize refcounters to 0
- Donot call release at bigref_put if NULL
- Use synchronize_sched instead of synchronize_kernel
Signed-off-by: Rusty Russel <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>
Index: alloc_percpu-2.6.13-rc7/include/linux/bigref.h
===================================================================
--- alloc_percpu-2.6.13-rc7.orig/include/linux/bigref.h 2005-08-28 09:48:29.174802240 -0700
+++ alloc_percpu-2.6.13-rc7/include/linux/bigref.h 2005-08-29 08:28:41.000000000 -0700
@@ -0,0 +1,84 @@
+#ifndef _LINUX_BIGREF_H
+#define _LINUX_BIGREF_H
+/* Per-cpu reference counters. Useful when speed is important, and
+ counter will only hit zero after some explicit event (such as being
+ discarded from a list).
+
+ (C) 2003 Rusty Russell, IBM Corporation.
+*/
+#include <linux/config.h>
+#include <asm/atomic.h>
+#include <asm/local.h>
+
+#ifdef CONFIG_SMP
+struct bigref
+{
+ /* If this is zero, we use per-cpu counters. */
+ atomic_t slow_ref;
+ local_t *local_ref;
+};
+
+/* Initialize reference to 1. */
+void bigref_init(struct bigref *bigref);
+/* Initialize reference to val */
+void bigref_set(struct bigref *bigref, int val);
+/* Disown reference: next bigref_put can hit zero */
+void bigref_disown(struct bigref *bigref);
+/* Grab reference */
+void bigref_get(struct bigref *bigref);
+/* Drop reference */
+void bigref_put(struct bigref *bigref, void (*release)(struct bigref *bigref));
+/* Destroy bigref prematurely (which might not have hit zero) */
+void bigref_destroy(struct bigref *bigref);
+
+/* Get current value of reference, useful for debugging info. */
+unsigned int bigref_val(struct bigref *bigref);
+#else /* ... !SMP */
+struct bigref
+{
+ atomic_t ref;
+};
+
+/* Initialize reference to 1. */
+static inline void bigref_init(struct bigref *bigref)
+{
+ atomic_set(&bigref->ref, 1);
+}
+
+/* Initialize reference to val */
+static inline void bigref_set(struct bigref *bigref, int val)
+{
+ atomic_set(&bigref->ref, val);
+}
+
+/* Disown reference: next bigref_put can hit zero */
+static inline void bigref_disown(struct bigref *bigref)
+{
+}
+
+/* Grab reference */
+static inline void bigref_get(struct bigref *bigref)
+{
+ atomic_inc(&bigref->ref);
+}
+
+/* Drop reference */
+static inline void bigref_put(struct bigref *bigref,
+ void (*release)(struct bigref *bigref))
+{
+ if (atomic_dec_and_test(&bigref->ref))
+ release(bigref);
+}
+
+/* Get current value of reference, useful for debugging info. */
+static inline unsigned int bigref_val(struct bigref *bigref)
+{
+ return atomic_read(&bigref->ref);
+}
+
+static inline void bigref_destroy(struct bigref *bigref)
+{
+}
+#endif /* !SMP */
+
+#endif /* _LINUX_BIGREF_H */
Index: alloc_percpu-2.6.13-rc7/kernel/Makefile
===================================================================
--- alloc_percpu-2.6.13-rc7.orig/kernel/Makefile 2005-08-29 08:28:30.000000000 -0700
+++ alloc_percpu-2.6.13-rc7/kernel/Makefile 2005-08-29 08:28:41.000000000 -0700
@@ -11,7 +11,7 @@
obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
-obj-$(CONFIG_SMP) += cpu.o spinlock.o
+obj-$(CONFIG_SMP) += cpu.o spinlock.o bigref.o
obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += module.o
obj-$(CONFIG_KALLSYMS) += kallsyms.o
Index: alloc_percpu-2.6.13-rc7/kernel/bigref.c
===================================================================
--- alloc_percpu-2.6.13-rc7.orig/kernel/bigref.c 2005-08-28 09:48:29.174802240 -0700
+++ alloc_percpu-2.6.13-rc7/kernel/bigref.c 2005-08-29 08:29:41.000000000 -0700
@@ -0,0 +1,112 @@
+#include <linux/bigref.h>
+#include <linux/compiler.h>
+#include <linux/percpu.h>
+#include <linux/rcupdate.h>
+#include <linux/module.h>
+#include <asm/system.h>
+
+static inline int is_slow_mode(const struct bigref *bigref)
+{
+ return atomic_read(&bigref->slow_ref) != 0;
+}
+
+/* Initialize reference to 1. */
+void bigref_init(struct bigref *bigref)
+{
+ bigref->local_ref = alloc_percpu(local_t, GFP_KERNEL);
+
+ /* If we can't allocate, we just stay in slow mode. */
+ if (!bigref->local_ref)
+ atomic_set(&bigref->slow_ref, 1);
+ else {
+ /* Bump any counter to 1. */
+ local_set(__get_cpu_ptr(bigref->local_ref), 1);
+ atomic_set(&bigref->slow_ref, 0);
+ }
+}
+/* Initialize reference to val */
+void bigref_set(struct bigref *bigref, int val)
+{
+
+ if (!bigref->local_ref)
+ atomic_set(&bigref->slow_ref, val);
+ else {
+ /* Bump any counter to val. */
+ local_set(__get_cpu_ptr(bigref->local_ref), val);
+ atomic_set(&bigref->slow_ref, 0);
+ }
+}
+
+
+/* Disown reference: next bigref_put can hit zero */
+void bigref_disown(struct bigref *bigref)
+{
+ int i, bias = 0x7FFFFFFF;
+ if (unlikely(is_slow_mode(bigref))) {
+ /* Must have been alloc fail, not double disown. */
+ BUG_ON(bigref->local_ref);
+ return;
+ }
+
+ /* Insert high number so this doesn't go to zero now. */
+ atomic_set(&bigref->slow_ref, bias);
+
+ /* Make sure everyone sees it and is using slow mode. */
+ synchronize_sched();
+
+ /* Take away bias, and add sum of local counters. */
+ for_each_cpu(i)
+ bias -= local_read(per_cpu_ptr(bigref->local_ref, i));
+ atomic_sub(bias, &bigref->slow_ref);
+
+ /* This caller should be holding one reference. */
+ BUG_ON(atomic_read(&bigref->slow_ref) == 0);
+}
+
+/* Grab reference */
+void bigref_get(struct bigref *bigref)
+{
+ if (unlikely(is_slow_mode(bigref)))
+ atomic_inc(&bigref->slow_ref);
+ else
+ local_inc(__get_cpu_ptr(bigref->local_ref));
+}
+
+/* Drop reference */
+void bigref_put(struct bigref *bigref, void (*release)(struct bigref *bigref))
+{
+ if (unlikely(is_slow_mode(bigref))) {
+ if (atomic_dec_and_test(&bigref->slow_ref)) {
+ free_percpu(bigref->local_ref);
+ if (release)
+ release(bigref);
+ }
+ } else
+ local_dec(__get_cpu_ptr(bigref->local_ref));
+}
+
+void bigref_destroy(struct bigref *bigref)
+{
+ if (bigref->local_ref)
+ free_percpu(bigref->local_ref);
+}
+
+/* Get current value of reference, useful for debugging info. */
+unsigned int bigref_val(struct bigref *bigref)
+{
+ unsigned int sum = 0, i;
+
+ if (unlikely(is_slow_mode(bigref)))
+ sum = atomic_read(&bigref->slow_ref);
+ else
+ for_each_cpu(i)
+ sum += local_read(per_cpu_ptr(bigref->local_ref, i));
+ return sum;
+}
+
+EXPORT_SYMBOL(bigref_init);
+EXPORT_SYMBOL(bigref_disown);
+EXPORT_SYMBOL(bigref_get);
+EXPORT_SYMBOL(bigref_put);
+EXPORT_SYMBOL(bigref_destroy);
+EXPORT_SYMBOL(bigref_val);
The net_device has a refcnt used to keep track of it's uses.
This is used at the time of unregistering the network device
(module unloading ..) (see netdev_wait_allrefs) .
For loopback_dev , this refcnt increment/decrement is causing
unnecessary traffic on the interlink for NUMA system
affecting it's performance. This patch improves tbench numbers by 6% on a
8way x86 Xeon (x445).
This patch is dependent on the bigref patch
Signed-off-by : Niraj Kumar <[email protected]>
Signed-off-by : Shai Fultheim <[email protected]>
Signed-off-by : Ravikiran Thirumalai <[email protected]>
Index: alloc_percpu-2.6.13/drivers/net/loopback.c
===================================================================
--- alloc_percpu-2.6.13.orig/drivers/net/loopback.c 2005-08-28 16:41:01.000000000 -0700
+++ alloc_percpu-2.6.13/drivers/net/loopback.c 2005-09-12 12:04:25.000000000 -0700
@@ -226,6 +226,12 @@
loopback_dev.priv = stats;
loopback_dev.get_stats = &get_stats;
}
+
+ /*
+ * This is the only struct net_device not allocated by alloc_netdev
+ * So explicitly init the bigref hanging off loopback_dev
+ */
+ bigref_init(&loopback_dev.netdev_refcnt);
return register_netdev(&loopback_dev);
};
Index: alloc_percpu-2.6.13/include/linux/netdevice.h
===================================================================
--- alloc_percpu-2.6.13.orig/include/linux/netdevice.h 2005-08-28 16:41:01.000000000 -0700
+++ alloc_percpu-2.6.13/include/linux/netdevice.h 2005-09-12 11:54:21.000000000 -0700
@@ -37,6 +37,7 @@
#include <linux/config.h>
#include <linux/device.h>
#include <linux/percpu.h>
+#include <linux/bigref.h>
struct divert_blk;
struct vlan_group;
@@ -377,7 +378,7 @@
/* device queue lock */
spinlock_t queue_lock;
/* Number of references to this device */
- atomic_t refcnt;
+ struct bigref netdev_refcnt;
/* delayed register/unregister */
struct list_head todo_list;
/* device name hash chain */
@@ -677,11 +678,11 @@
static inline void dev_put(struct net_device *dev)
{
- atomic_dec(&dev->refcnt);
+ bigref_put(&dev->netdev_refcnt, NULL);
}
-#define __dev_put(dev) atomic_dec(&(dev)->refcnt)
-#define dev_hold(dev) atomic_inc(&(dev)->refcnt)
+#define __dev_put(dev) bigref_put(&(dev)->netdev_refcnt, NULL);
+#define dev_hold(dev) bigref_get(&(dev)->netdev_refcnt);
/* Carrier loss detection, dial on demand. The functions netif_carrier_on
* and _off may be called from IRQ context, but it is caller
Index: alloc_percpu-2.6.13/net/core/dev.c
===================================================================
--- alloc_percpu-2.6.13.orig/net/core/dev.c 2005-08-28 16:41:01.000000000 -0700
+++ alloc_percpu-2.6.13/net/core/dev.c 2005-09-12 11:54:21.000000000 -0700
@@ -2658,6 +2658,7 @@
goto out;
dev->iflink = -1;
+ bigref_set(&dev->netdev_refcnt, 0);
/* Init, if this function is available */
if (dev->init) {
@@ -2808,7 +2809,7 @@
unsigned long rebroadcast_time, warning_time;
rebroadcast_time = warning_time = jiffies;
- while (atomic_read(&dev->refcnt) != 0) {
+ while ( bigref_val(&dev->netdev_refcnt) != 0) {
if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
rtnl_shlock();
@@ -2838,7 +2839,7 @@
printk(KERN_EMERG "unregister_netdevice: "
"waiting for %s to become free. Usage "
"count = %d\n",
- dev->name, atomic_read(&dev->refcnt));
+ dev->name, bigref_val(&dev->netdev_refcnt));
warning_time = jiffies;
}
}
@@ -2909,7 +2910,7 @@
netdev_wait_allrefs(dev);
/* paranoia */
- BUG_ON(atomic_read(&dev->refcnt));
+ BUG_ON(bigref_val(&dev->netdev_refcnt));
BUG_TRAP(!dev->ip_ptr);
BUG_TRAP(!dev->ip6_ptr);
BUG_TRAP(!dev->dn_ptr);
@@ -2969,6 +2970,7 @@
setup(dev);
strcpy(dev->name, name);
+ bigref_init(&dev->netdev_refcnt);
return dev;
}
EXPORT_SYMBOL(alloc_netdev);
@@ -2986,6 +2988,7 @@
#ifdef CONFIG_SYSFS
/* Compatiablity with error handling in drivers */
if (dev->reg_state == NETREG_UNINITIALIZED) {
+ bigref_destroy(&dev->netdev_refcnt);
kfree((char *)dev - dev->padded);
return;
}
@@ -2996,6 +2999,7 @@
/* will free via class release */
class_device_put(&dev->class_dev);
#else
+ bigref_destroy(&dev->netdev_refcnt);
kfree((char *)dev - dev->padded);
#endif
}
@@ -3210,7 +3214,7 @@
set_bit(__LINK_STATE_START, &queue->backlog_dev.state);
queue->backlog_dev.weight = weight_p;
queue->backlog_dev.poll = process_backlog;
- atomic_set(&queue->backlog_dev.refcnt, 1);
+ bigref_init(&queue->backlog_dev.netdev_refcnt);
}
dev_boot_phase = 0;
Index: alloc_percpu-2.6.13/net/core/net-sysfs.c
===================================================================
--- alloc_percpu-2.6.13.orig/net/core/net-sysfs.c 2005-08-28 16:41:01.000000000 -0700
+++ alloc_percpu-2.6.13/net/core/net-sysfs.c 2005-09-12 11:54:21.000000000 -0700
@@ -16,6 +16,7 @@
#include <net/sock.h>
#include <linux/rtnetlink.h>
#include <linux/wireless.h>
+#include <linux/bigref.h>
#define to_class_dev(obj) container_of(obj,struct class_device,kobj)
#define to_net_dev(class) container_of(class, struct net_device, class_dev)
@@ -400,6 +401,8 @@
= container_of(cd, struct net_device, class_dev);
BUG_ON(dev->reg_state != NETREG_RELEASED);
+
+ bigref_destroy(&dev->netdev_refcnt);
kfree((char *)dev - dev->padded);
}
This patch introduces macros to handle the use, lastuse and refcnt fields in
the dst_entry structure using macros. Having macros manipulate these fields
allows cleaner source code and provides an easy way for modifications to the
way these performance critical fields are handled.
The introduction of macros removes some code that is repeated in
various places. Also
- decnet_dn_route: introduces dn_dst_useful to check the usefulness of a dst
entry. dst_update_rtu used to reduce code duplication.
- net/ipv4/route.c: add ip_rt_copy. dst_update_rtu used to reduce code duplication.
The patch is a prerequisite for the dst numa patch.
Signed-off-by: Pravin B. Shelar <[email protected]>
Signed-off-by: Shobhit Dayal <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Index: alloc_percpu-2.6.13-rc6/include/net/dst.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/net/dst.h 2005-08-15 17:54:34.721623250 -0400
+++ alloc_percpu-2.6.13-rc6/include/net/dst.h 2005-08-15 17:58:14.499358500 -0400
@@ -103,6 +103,30 @@
#ifdef __KERNEL__
+#define dst_use(__dst) (__dst)->__use
+#define dst_use_inc(__dst) (__dst)->__use++
+
+#define dst_lastuse(__dst) (__dst)->lastuse
+#define dst_lastuse_set(__dst) (__dst)->lastuse = jiffies
+
+#define dst_update_tu(__dst) do { dst_lastuse_set(__dst);dst_use_inc(__dst); } while (0)
+#define dst_update_rtu(__dst) do { dst_lastuse_set(__dst);dst_hold(__dst);dst_use_inc(__dst); } while (0)
+
+#define dst_refcnt(__dst) atomic_read(&(__dst)->__refcnt)
+#define dst_refcnt_one(__dst) atomic_set(&(__dst)->__refcnt, 1)
+#define dst_refcnt_dec(__dst) atomic_dec(&(__dst)->__refcnt)
+#define dst_hold(__dst) atomic_inc(&(__dst)->__refcnt)
+
+static inline
+void dst_release(struct dst_entry * dst)
+{
+ if (dst) {
+ WARN_ON(dst_refcnt(dst) < 1);
+ smp_mb__before_atomic_dec();
+ dst_refcnt_dec(dst);
+ }
+}
+
static inline u32
dst_metric(const struct dst_entry *dst, int metric)
{
@@ -134,29 +158,14 @@
return dst_metric(dst, RTAX_LOCK) & (1<<metric);
}
-static inline void dst_hold(struct dst_entry * dst)
-{
- atomic_inc(&dst->__refcnt);
-}
-
static inline
struct dst_entry * dst_clone(struct dst_entry * dst)
{
if (dst)
- atomic_inc(&dst->__refcnt);
+ dst_hold(dst);
return dst;
}
-static inline
-void dst_release(struct dst_entry * dst)
-{
- if (dst) {
- WARN_ON(atomic_read(&dst->__refcnt) < 1);
- smp_mb__before_atomic_dec();
- atomic_dec(&dst->__refcnt);
- }
-}
-
/* Children define the path of the packet through the
* Linux networking. Thus, destinations are stackable.
*/
@@ -177,7 +186,7 @@
{
if (dst->obsolete > 1)
return;
- if (!atomic_read(&dst->__refcnt)) {
+ if (!dst_refcnt(dst)) {
dst = dst_destroy(dst);
if (!dst)
return;
Index: alloc_percpu-2.6.13-rc6/net/core/dst.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/core/dst.c 2005-08-15 17:54:34.761625750 -0400
+++ alloc_percpu-2.6.13-rc6/net/core/dst.c 2005-08-15 17:58:14.499358500 -0400
@@ -57,7 +57,7 @@
dstp = &dst_garbage_list;
work_performed = 0;
while ((dst = *dstp) != NULL) {
- if (atomic_read(&dst->__refcnt)) {
+ if (dst_refcnt(dst)) {
dstp = &dst->next;
delayed++;
continue;
@@ -176,9 +176,8 @@
struct neighbour *neigh;
struct hh_cache *hh;
- smp_rmb();
-
again:
+ smp_rmb();
neigh = dst->neighbour;
hh = dst->hh;
child = dst->child;
@@ -206,16 +205,16 @@
dst = child;
if (dst) {
int nohash = dst->flags & DST_NOHASH;
-
- if (atomic_dec_and_test(&dst->__refcnt)) {
- /* We were real parent of this dst, so kill child. */
- if (nohash)
+ dst_refcnt_dec(dst);
+ if (nohash) {
+ if (!dst_refcnt(dst)) {
+ /* We were real parent of this dst, so kill child. */
goto again;
- } else {
- /* Child is still referenced, return it for freeing. */
- if (nohash)
+ } else {
+ /* Child is still referenced, return it for freeing. */
return dst;
- /* Child is still in his hash table */
+ /* Child is still in his hash table */
+ }
}
}
return NULL;
Index: alloc_percpu-2.6.13-rc6/net/decnet/dn_route.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/decnet/dn_route.c 2005-08-15 17:54:34.789627500 -0400
+++ alloc_percpu-2.6.13-rc6/net/decnet/dn_route.c 2005-08-15 17:58:14.503358750 -0400
@@ -155,6 +155,11 @@
call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
}
+static inline int dn_dst_useful(struct dn_route *rth, unsigned long now, unsigned long expire)
+{
+ return (atomic_read(&rth->u.dst.__refcnt) || (now - rth->u.dst.lastuse) < expire) ;
+}
+
static void dn_dst_check_expire(unsigned long dummy)
{
int i;
@@ -167,8 +172,7 @@
spin_lock(&dn_rt_hash_table[i].lock);
while((rt=*rtp) != NULL) {
- if (atomic_read(&rt->u.dst.__refcnt) ||
- (now - rt->u.dst.lastuse) < expire) {
+ if (dn_dst_useful(rt, now, expire)) {
rtp = &rt->u.rt_next;
continue;
}
@@ -198,8 +202,7 @@
rtp = &dn_rt_hash_table[i].chain;
while((rt=*rtp) != NULL) {
- if (atomic_read(&rt->u.dst.__refcnt) ||
- (now - rt->u.dst.lastuse) < expire) {
+ if (dn_dst_useful(rt, now, expire)) {
rtp = &rt->u.rt_next;
continue;
}
@@ -277,10 +280,8 @@
static int dn_insert_route(struct dn_route *rt, unsigned hash, struct dn_route **rp)
{
struct dn_route *rth, **rthp;
- unsigned long now = jiffies;
-
- rthp = &dn_rt_hash_table[hash].chain;
+ rthp = &dn_rt_hash_table[hash].chain;
spin_lock_bh(&dn_rt_hash_table[hash].lock);
while((rth = *rthp) != NULL) {
if (compare_keys(&rth->fl, &rt->fl)) {
@@ -290,9 +291,7 @@
dn_rt_hash_table[hash].chain);
rcu_assign_pointer(dn_rt_hash_table[hash].chain, rth);
- rth->u.dst.__use++;
- dst_hold(&rth->u.dst);
- rth->u.dst.lastuse = now;
+ dst_update_rtu(&rth->u.dst);
spin_unlock_bh(&dn_rt_hash_table[hash].lock);
dnrt_drop(rt);
@@ -304,10 +303,8 @@
rcu_assign_pointer(rt->u.rt_next, dn_rt_hash_table[hash].chain);
rcu_assign_pointer(dn_rt_hash_table[hash].chain, rt);
-
- dst_hold(&rt->u.dst);
- rt->u.dst.__use++;
- rt->u.dst.lastuse = now;
+
+ dst_update_rtu(&rt->u.dst);
spin_unlock_bh(&dn_rt_hash_table[hash].lock);
*rp = rt;
return 0;
@@ -1091,7 +1088,7 @@
if (rt == NULL)
goto e_nobufs;
- atomic_set(&rt->u.dst.__refcnt, 1);
+ dst_refcnt_one(&rt->u.dst);
rt->u.dst.flags = DST_HOST;
rt->fl.fld_src = oldflp->fld_src;
@@ -1115,7 +1112,7 @@
rt->u.dst.neighbour = neigh;
neigh = NULL;
- rt->u.dst.lastuse = jiffies;
+ dst_lastuse_set(&rt->u.dst);
rt->u.dst.output = dn_output;
rt->u.dst.input = dn_rt_bug;
rt->rt_flags = flags;
@@ -1173,9 +1170,7 @@
#endif
(rt->fl.iif == 0) &&
(rt->fl.oif == flp->oif)) {
- rt->u.dst.lastuse = jiffies;
- dst_hold(&rt->u.dst);
- rt->u.dst.__use++;
+ dst_update_rtu(&rt->u.dst);
rcu_read_unlock_bh();
*pprt = &rt->u.dst;
return 0;
@@ -1381,7 +1376,7 @@
rt->u.dst.flags = DST_HOST;
rt->u.dst.neighbour = neigh;
rt->u.dst.dev = out_dev;
- rt->u.dst.lastuse = jiffies;
+ dst_lastuse_set(&rt->u.dst);
rt->u.dst.output = dn_rt_bug;
switch(res.type) {
case RTN_UNICAST:
@@ -1452,9 +1447,7 @@
(rt->fl.fld_fwmark == skb->nfmark) &&
#endif
(rt->fl.iif == cb->iif)) {
- rt->u.dst.lastuse = jiffies;
- dst_hold(&rt->u.dst);
- rt->u.dst.__use++;
+ dst_update_rtu(&rt->u.dst);
rcu_read_unlock();
skb->dst = (struct dst_entry *)rt;
return 0;
@@ -1504,9 +1497,9 @@
RTA_PUT(skb, RTA_GATEWAY, 2, &rt->rt_gateway);
if (rtnetlink_put_metrics(skb, rt->u.dst.metrics) < 0)
goto rtattr_failure;
- ci.rta_lastuse = jiffies_to_clock_t(jiffies - rt->u.dst.lastuse);
- ci.rta_used = rt->u.dst.__use;
- ci.rta_clntref = atomic_read(&rt->u.dst.__refcnt);
+ ci.rta_lastuse = jiffies_to_clock_t(jiffies - dst_lastuse(&rt->u.dst));
+ ci.rta_used = dst_use(&rt->u.dst);
+ ci.rta_clntref = dst_refcnt(&rt->u.dst);
if (rt->u.dst.expires)
ci.rta_expires = jiffies_to_clock_t(rt->u.dst.expires - jiffies);
else
@@ -1729,8 +1722,8 @@
rt->u.dst.dev ? rt->u.dst.dev->name : "*",
dn_addr2asc(dn_ntohs(rt->rt_daddr), buf1),
dn_addr2asc(dn_ntohs(rt->rt_saddr), buf2),
- atomic_read(&rt->u.dst.__refcnt),
- rt->u.dst.__use,
+ dst_refcnt(&rt->u.dst),
+ dst_use(&rt->u.dst),
(int) dst_metric(&rt->u.dst, RTAX_RTT));
return 0;
}
Index: alloc_percpu-2.6.13-rc6/net/ipv4/ipvs/ip_vs_xmit.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/ipvs/ip_vs_xmit.c 2005-08-15 17:54:34.837630500 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/ipvs/ip_vs_xmit.c 2005-08-15 17:55:23.980701750 -0400
@@ -88,7 +88,7 @@
__ip_vs_dst_set(dest, rtos, dst_clone(&rt->u.dst));
IP_VS_DBG(10, "new dst %u.%u.%u.%u, refcnt=%d, rtos=%X\n",
NIPQUAD(dest->addr),
- atomic_read(&rt->u.dst.__refcnt), rtos);
+ dst_refcnt(&rt->u.dst), rtos);
}
spin_unlock(&dest->dst_lock);
} else {
Index: alloc_percpu-2.6.13-rc6/net/ipv4/multipath_drr.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/multipath_drr.c 2005-08-15 17:54:34.905634750 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/multipath_drr.c 2005-08-15 17:55:23.980701750 -0400
@@ -149,8 +149,7 @@
multipath_comparekeys(&nh->fl, flp)) {
int nh_ifidx = nh->u.dst.dev->ifindex;
- nh->u.dst.lastuse = jiffies;
- nh->u.dst.__use++;
+ dst_update_tu(&nh->u.dst);
if (result != NULL)
continue;
Index: alloc_percpu-2.6.13-rc6/net/ipv4/multipath_random.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/multipath_random.c 2005-08-15 17:54:34.909635000 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/multipath_random.c 2005-08-15 17:55:23.980701750 -0400
@@ -94,7 +94,8 @@
for (rt = first; rt; rt = rt->u.rt_next) {
if ((rt->u.dst.flags & DST_BALANCED) != 0 &&
multipath_comparekeys(&rt->fl, flp)) {
- rt->u.dst.lastuse = jiffies;
+
+ dst_lastuse_set(&rt->u.dst);
if (i == candidate_no)
decision = rt;
@@ -107,7 +108,7 @@
}
}
- decision->u.dst.__use++;
+ dst_use_inc(&decision->u.dst);
*rp = decision;
}
Index: alloc_percpu-2.6.13-rc6/net/ipv4/multipath_rr.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/multipath_rr.c 2005-08-15 17:54:34.973639000 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/multipath_rr.c 2005-08-15 17:55:24.056706500 -0400
@@ -62,10 +62,11 @@
nh = rcu_dereference(nh->u.rt_next)) {
if ((nh->u.dst.flags & DST_BALANCED) != 0 &&
multipath_comparekeys(&nh->fl, flp)) {
- nh->u.dst.lastuse = jiffies;
+ int __use = dst_use(&nh->u.dst);
+ dst_lastuse_set(&nh->u.dst);
- if (min_use == -1 || nh->u.dst.__use < min_use) {
- min_use = nh->u.dst.__use;
+ if (min_use == -1 || __use < min_use) {
+ min_use = __use;
min_use_cand = nh;
}
}
@@ -74,7 +75,7 @@
if (!result)
result = first;
- result->u.dst.__use++;
+ dst_use_inc(&result->u.dst);
*rp = result;
}
Index: alloc_percpu-2.6.13-rc6/net/ipv4/multipath_wrandom.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/multipath_wrandom.c 2005-08-15 17:54:34.973639000 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/multipath_wrandom.c 2005-08-15 17:55:24.056706500 -0400
@@ -202,7 +202,7 @@
decision = first;
last_mpc = NULL;
for (mpc = first_mpc; mpc; mpc = mpc->next) {
- mpc->rt->u.dst.lastuse = jiffies;
+ dst_lastuse_set(&mpc->rt->u.dst);
if (last_power <= selector && selector < mpc->power)
decision = mpc->rt;
@@ -217,8 +217,7 @@
/* concurrent __multipath_flush may lead to !last_mpc */
kfree(last_mpc);
}
-
- decision->u.dst.__use++;
+ dst_use_inc(&decision->u.dst);
*rp = decision;
}
Index: alloc_percpu-2.6.13-rc6/net/ipv4/route.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/route.c 2005-08-15 17:54:34.973639000 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/route.c 2005-08-15 17:58:14.503358750 -0400
@@ -334,8 +334,8 @@
"%08lX\t%d\t%u\t%u\t%02X\t%d\t%1d\t%08X",
r->u.dst.dev ? r->u.dst.dev->name : "*",
(unsigned long)r->rt_dst, (unsigned long)r->rt_gateway,
- r->rt_flags, atomic_read(&r->u.dst.__refcnt),
- r->u.dst.__use, 0, (unsigned long)r->rt_src,
+ r->rt_flags, dst_refcnt(&r->u.dst),
+ dst_use(&r->u.dst), 0, (unsigned long)r->rt_src,
(dst_metric(&r->u.dst, RTAX_ADVMSS) ?
(int)dst_metric(&r->u.dst, RTAX_ADVMSS) + 40 : 0),
dst_metric(&r->u.dst, RTAX_WINDOW),
@@ -512,7 +512,7 @@
unsigned long age;
int ret = 0;
- if (atomic_read(&rth->u.dst.__refcnt))
+ if (dst_refcnt(&rth->u.dst))
goto out;
ret = 1;
@@ -536,7 +536,7 @@
*/
static inline u32 rt_score(struct rtable *rt)
{
- u32 score = jiffies - rt->u.dst.lastuse;
+ u32 score = jiffies - dst_lastuse(&rt->u.dst);
score = ~score & ~(3<<30);
@@ -943,9 +943,7 @@
*/
rcu_assign_pointer(rt_hash_table[hash].chain, rth);
- rth->u.dst.__use++;
- dst_hold(&rth->u.dst);
- rth->u.dst.lastuse = now;
+ dst_update_rtu(&rth->u.dst);
spin_unlock_bh(rt_hash_lock_addr(hash));
rt_drop(rt);
@@ -953,7 +951,7 @@
return 0;
}
- if (!atomic_read(&rth->u.dst.__refcnt)) {
+ if (!dst_refcnt(&rth->u.dst)) {
u32 score = rt_score(rth);
if (score <= min_score) {
@@ -1108,6 +1106,12 @@
spin_unlock_bh(rt_hash_lock_addr(hash));
}
+void ip_rt_copy(struct rtable *to, struct rtable *from)
+{
+ *to = *from;
+ to->u.dst.__use = 1;
+}
+
void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
u32 saddr, u8 tos, struct net_device *dev)
{
@@ -1175,17 +1179,17 @@
}
/* Copy all the information. */
- *rt = *rth;
- INIT_RCU_HEAD(&rt->u.dst.rcu_head);
- rt->u.dst.__use = 1;
- atomic_set(&rt->u.dst.__refcnt, 1);
+ ip_rt_copy(rt, rth);
+
+ INIT_RCU_HEAD(&rt->u.dst.rcu_head);
+ dst_lastuse_set(&rt->u.dst);
+ dst_refcnt_one(&rt->u.dst);
rt->u.dst.child = NULL;
if (rt->u.dst.dev)
dev_hold(rt->u.dst.dev);
if (rt->idev)
in_dev_hold(rt->idev);
rt->u.dst.obsolete = 0;
- rt->u.dst.lastuse = jiffies;
rt->u.dst.path = &rt->u.dst;
rt->u.dst.neighbour = NULL;
rt->u.dst.hh = NULL;
@@ -1619,7 +1623,7 @@
rth->u.dst.output= ip_rt_bug;
- atomic_set(&rth->u.dst.__refcnt, 1);
+ dst_refcnt_one(&rth->u.dst);
rth->u.dst.flags= DST_HOST;
if (in_dev->cnf.no_policy)
rth->u.dst.flags |= DST_NOPOLICY;
@@ -1818,7 +1822,7 @@
err = __mkroute_input(skb, res, in_dev, daddr, saddr, tos, &rth);
if (err)
return err;
- atomic_set(&rth->u.dst.__refcnt, 1);
+ dst_refcnt_one(&rth->u.dst);
/* put it into the cache */
hash = rt_hash_code(daddr, saddr ^ (fl->iif << 5), tos);
@@ -1876,7 +1880,7 @@
* outside
*/
if (hop == lasthop)
- atomic_set(&(skb->dst->__refcnt), 1);
+ dst_refcnt_one(skb->dst);
}
return err;
#else /* CONFIG_IP_ROUTE_MULTIPATH_CACHED */
@@ -2012,7 +2016,7 @@
rth->u.dst.output= ip_rt_bug;
- atomic_set(&rth->u.dst.__refcnt, 1);
+ dst_refcnt_one(&rth->u.dst);
rth->u.dst.flags= DST_HOST;
if (in_dev->cnf.no_policy)
rth->u.dst.flags |= DST_NOPOLICY;
@@ -2102,9 +2106,7 @@
rth->fl.fl4_fwmark == skb->nfmark &&
#endif
rth->fl.fl4_tos == tos) {
- rth->u.dst.lastuse = jiffies;
- dst_hold(&rth->u.dst);
- rth->u.dst.__use++;
+ dst_update_rtu(&rth->u.dst);
RT_CACHE_STAT_INC(in_hit);
rcu_read_unlock();
skb->dst = (struct dst_entry*)rth;
@@ -2288,7 +2290,7 @@
if (err == 0) {
u32 tos = RT_FL_TOS(oldflp);
- atomic_set(&rth->u.dst.__refcnt, 1);
+ dst_refcnt_one(&rth->u.dst);
hash = rt_hash_code(oldflp->fl4_dst,
oldflp->fl4_src ^ (oldflp->oif << 5), tos);
@@ -2348,7 +2350,7 @@
if (err != 0)
return err;
}
- atomic_set(&(*rp)->u.dst.__refcnt, 1);
+ dst_refcnt_one(&(*rp)->u.dst);
return err;
} else {
return ip_mkroute_output_def(rp, res, fl, oldflp, dev_out,
@@ -2584,10 +2586,7 @@
rcu_read_unlock_bh();
return 0;
}
-
- rth->u.dst.lastuse = jiffies;
- dst_hold(&rth->u.dst);
- rth->u.dst.__use++;
+ dst_update_rtu(&rth->u.dst);
RT_CACHE_STAT_INC(out_hit);
rcu_read_unlock_bh();
*rp = rth;
@@ -2673,9 +2672,9 @@
RTA_PUT(skb, RTA_GATEWAY, 4, &rt->rt_gateway);
if (rtnetlink_put_metrics(skb, rt->u.dst.metrics) < 0)
goto rtattr_failure;
- ci.rta_lastuse = jiffies_to_clock_t(jiffies - rt->u.dst.lastuse);
- ci.rta_used = rt->u.dst.__use;
- ci.rta_clntref = atomic_read(&rt->u.dst.__refcnt);
+ ci.rta_lastuse = jiffies_to_clock_t(jiffies - dst_lastuse(&rt->u.dst));
+ ci.rta_used = dst_use(&rt->u.dst);
+ ci.rta_clntref = dst_refcnt(&rt->u.dst);
if (rt->u.dst.expires)
ci.rta_expires = jiffies_to_clock_t(rt->u.dst.expires - jiffies);
else
Index: alloc_percpu-2.6.13-rc6/net/ipv4/xfrm4_policy.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv4/xfrm4_policy.c 2005-08-15 17:54:34.973639000 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv4/xfrm4_policy.c 2005-08-15 17:55:24.060706750 -0400
@@ -135,7 +135,7 @@
dev_hold(rt->u.dst.dev);
dst_prev->obsolete = -1;
dst_prev->flags |= DST_HOST;
- dst_prev->lastuse = jiffies;
+ dst_lastuse_set(dst_prev);
dst_prev->header_len = header_len;
dst_prev->trailer_len = trailer_len;
memcpy(&dst_prev->metrics, &x->route->metrics, sizeof(dst_prev->metrics));
Index: alloc_percpu-2.6.13-rc6/net/ipv6/ip6_fib.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv6/ip6_fib.c 2005-08-15 17:54:34.973639000 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv6/ip6_fib.c 2005-08-15 17:58:14.503358750 -0400
@@ -1160,8 +1160,8 @@
}
gc_args.more++;
} else if (rt->rt6i_flags & RTF_CACHE) {
- if (atomic_read(&rt->u.dst.__refcnt) == 0 &&
- time_after_eq(now, rt->u.dst.lastuse + gc_args.timeout)) {
+ if (dst_refcnt(&rt->u.dst) == 0 &&
+ time_after_eq(now, dst_lastuse(&rt->u.dst) + gc_args.timeout)) {
RT6_TRACE("aging clone %p\n", rt);
return -1;
} else if ((rt->rt6i_flags & RTF_GATEWAY) &&
Index: alloc_percpu-2.6.13-rc6/net/ipv6/route.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv6/route.c 2005-08-15 17:54:34.973639000 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv6/route.c 2005-08-15 17:58:14.507359000 -0400
@@ -368,10 +368,9 @@
fn = fib6_lookup(&ip6_routing_table, daddr, saddr);
rt = rt6_device_match(fn->leaf, oif, strict);
dst_hold(&rt->u.dst);
- rt->u.dst.__use++;
- read_unlock_bh(&rt6_lock);
- rt->u.dst.lastuse = jiffies;
+ read_unlock_bh(&rt6_lock);
+ dst_update_tu(&rt->u.dst);
if (rt->u.dst.error == 0)
return rt;
dst_release(&rt->u.dst);
@@ -512,8 +511,7 @@
out:
read_unlock_bh(&rt6_lock);
out2:
- rt->u.dst.lastuse = jiffies;
- rt->u.dst.__use++;
+ dst_update_tu(&rt->u.dst);
skb->dst = (struct dst_entry *) rt;
}
@@ -572,8 +570,7 @@
out:
read_unlock_bh(&rt6_lock);
out2:
- rt->u.dst.lastuse = jiffies;
- rt->u.dst.__use++;
+ dst_update_tu(&rt->u.dst);
return &rt->u.dst;
}
@@ -685,7 +682,7 @@
rt->rt6i_dev = dev;
rt->rt6i_idev = idev;
rt->rt6i_nexthop = neigh;
- atomic_set(&rt->u.dst.__refcnt, 1);
+ dst_refcnt_one(&rt->u.dst);
rt->u.dst.metrics[RTAX_HOPLIMIT-1] = 255;
rt->u.dst.metrics[RTAX_MTU-1] = ipv6_get_mtu(rt->rt6i_dev);
rt->u.dst.metrics[RTAX_ADVMSS-1] = ipv6_advmss(dst_mtu(&rt->u.dst));
@@ -719,7 +716,7 @@
pprev = &ndisc_dst_gc_list;
freed = 0;
while ((dst = *pprev) != NULL) {
- if (!atomic_read(&dst->__refcnt)) {
+ if (!dst_refcnt(dst)) {
*pprev = dst->next;
dst_free(dst);
freed++;
@@ -1261,7 +1258,7 @@
rt->rt6i_idev = ort->rt6i_idev;
if (rt->rt6i_idev)
in6_dev_hold(rt->rt6i_idev);
- rt->u.dst.lastuse = jiffies;
+ dst_lastuse_set(&rt->u.dst);
rt->rt6i_expires = 0;
ipv6_addr_copy(&rt->rt6i_gateway, &ort->rt6i_gateway);
@@ -1424,7 +1421,7 @@
ipv6_addr_copy(&rt->rt6i_dst.addr, addr);
rt->rt6i_dst.plen = 128;
- atomic_set(&rt->u.dst.__refcnt, 1);
+ dst_refcnt_one(&rt->u.dst);
return rt;
}
@@ -1637,13 +1634,13 @@
if (rt->u.dst.dev)
RTA_PUT(skb, RTA_OIF, sizeof(int), &rt->rt6i_dev->ifindex);
RTA_PUT(skb, RTA_PRIORITY, 4, &rt->rt6i_metric);
- ci.rta_lastuse = jiffies_to_clock_t(jiffies - rt->u.dst.lastuse);
+ ci.rta_lastuse = jiffies_to_clock_t(jiffies - dst_lastuse(&rt->u.dst));
if (rt->rt6i_expires)
ci.rta_expires = jiffies_to_clock_t(rt->rt6i_expires - jiffies);
else
ci.rta_expires = 0;
- ci.rta_used = rt->u.dst.__use;
- ci.rta_clntref = atomic_read(&rt->u.dst.__refcnt);
+ ci.rta_used = dst_use(&rt->u.dst);
+ ci.rta_clntref = dst_refcnt(&rt->u.dst);
ci.rta_error = rt->u.dst.error;
ci.rta_id = 0;
ci.rta_ts = 0;
@@ -1927,8 +1924,8 @@
}
arg->len += sprintf(arg->buffer + arg->len,
" %08x %08x %08x %08x %8s\n",
- rt->rt6i_metric, atomic_read(&rt->u.dst.__refcnt),
- rt->u.dst.__use, rt->rt6i_flags,
+ rt->rt6i_metric, dst_refcnt(&rt->u.dst),
+ dst_use(&rt->u.dst), rt->rt6i_flags,
rt->rt6i_dev ? rt->rt6i_dev->name : "");
return 0;
}
Index: alloc_percpu-2.6.13-rc6/net/ipv6/xfrm6_policy.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/ipv6/xfrm6_policy.c 2005-08-15 17:54:34.977639250 -0400
+++ alloc_percpu-2.6.13-rc6/net/ipv6/xfrm6_policy.c 2005-08-15 17:55:24.160713000 -0400
@@ -156,7 +156,7 @@
dev_hold(rt->u.dst.dev);
dst_prev->obsolete = -1;
dst_prev->flags |= DST_HOST;
- dst_prev->lastuse = jiffies;
+ dst_lastuse_set(dst_prev);
dst_prev->header_len = header_len;
dst_prev->trailer_len = trailer_len;
memcpy(&dst_prev->metrics, &x->route->metrics, sizeof(dst_prev->metrics));
Index: alloc_percpu-2.6.13-rc6/net/xfrm/xfrm_policy.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/net/xfrm/xfrm_policy.c 2005-08-15 17:54:34.977639250 -0400
+++ alloc_percpu-2.6.13-rc6/net/xfrm/xfrm_policy.c 2005-08-15 17:55:24.184714500 -0400
@@ -1090,7 +1090,7 @@
static int unused_bundle(struct dst_entry *dst)
{
- return !atomic_read(&dst->__refcnt);
+ return !dst_refcnt(dst);
}
static void __xfrm_garbage_collect(void)
Patch to use alloc_percpu for dst_entry.refcount. This patch reduces the
cacheline bouncing of the atomic_t dst_entry.__refcount. This Patch gets us
55% better tbench throughput, on a 8way x445 box.
Signed-off by: Pravin B. Shelar <[email protected]>
Signed-off by: Shobhit Dayal <[email protected]>
Signed-off by: Christoph Lameter <[email protected]>
Signed-off by: Ravikiran Thirumalai <[email protected]>
Index: alloc_percpu-2.6.13/include/net/dst.h
===================================================================
--- alloc_percpu-2.6.13.orig/include/net/dst.h 2005-09-12 12:23:37.000000000 -0700
+++ alloc_percpu-2.6.13/include/net/dst.h 2005-09-12 16:44:05.000000000 -0700
@@ -35,11 +35,33 @@
struct sk_buff;
+#ifdef CONFIG_NUMA
+
+/* A per cpu instance of this exist for every dst_entry.
+ * These are the most written fields of dst_entry.
+ */
+struct per_cpu_cnt
+{
+ int refcnt;
+ int use;
+ unsigned long lastuse;
+};
+
+#endif
+
struct dst_entry
{
struct dst_entry *next;
+#ifdef CONFIG_NUMA
+ /* first cpu that should be checked for time-out */
+ int s_cpu;
+ /* per cpu client references */
+ struct per_cpu_cnt *pcc;
+#else
atomic_t __refcnt; /* client references */
int __use;
+ unsigned long lastuse;
+#endif
struct dst_entry *child;
struct net_device *dev;
short error;
@@ -50,7 +72,6 @@
#define DST_NOPOLICY 4
#define DST_NOHASH 8
#define DST_BALANCED 0x10
- unsigned long lastuse;
unsigned long expires;
unsigned short header_len; /* more space at head required */
@@ -103,25 +124,94 @@
#ifdef __KERNEL__
+#ifdef CONFIG_NUMA
+
+static inline int dst_use(struct dst_entry *dst)
+{
+ int total = 0, cpu;
+
+ for_each_online_cpu(cpu)
+ total += per_cpu_ptr(dst->pcc, cpu)->use;
+ return total;
+}
+
+#define dst_use_inc(__dst) do { \
+ per_cpu_ptr((__dst)->pcc, get_cpu())->use++ ; \
+ put_cpu(); \
+ } while(0);
+
+static inline unsigned long dst_lastuse(struct dst_entry *dst)
+{
+ unsigned long max = 0;
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ if (max < per_cpu_ptr(dst->pcc, cpu)->lastuse)
+ max = per_cpu_ptr(dst->pcc, cpu)->lastuse;
+ return max;
+}
+
+#define dst_lastuse_set(__dst) do { \
+ per_cpu_ptr((__dst)->pcc, get_cpu())->lastuse = jiffies ; \
+ put_cpu(); \
+ } while(0);
+
+static inline int dst_refcnt(struct dst_entry *dst)
+{
+ int cpu, sum = 0;
+
+ for_each_online_cpu(cpu)
+ sum += per_cpu_ptr(dst->pcc, cpu)->refcnt;
+
+ return sum;
+}
+
+#define dst_refcnt_one(__dst) do { \
+ per_cpu_ptr((__dst)->pcc, get_cpu())->refcnt = 1; \
+ put_cpu(); \
+ } while(0);
+
+#define dst_refcnt_dec(__dst) do { \
+ per_cpu_ptr((__dst)->pcc, get_cpu())->refcnt--; \
+ put_cpu(); \
+ } while(0);
+#define dst_hold(__dst) do { \
+ per_cpu_ptr((__dst)->pcc, get_cpu())->refcnt++ ; \
+ put_cpu(); \
+ } while(0);
+
+#else
+
#define dst_use(__dst) (__dst)->__use
#define dst_use_inc(__dst) (__dst)->__use++
#define dst_lastuse(__dst) (__dst)->lastuse
#define dst_lastuse_set(__dst) (__dst)->lastuse = jiffies
-#define dst_update_tu(__dst) do { dst_lastuse_set(__dst);dst_use_inc(__dst); } while (0)
-#define dst_update_rtu(__dst) do { dst_lastuse_set(__dst);dst_hold(__dst);dst_use_inc(__dst); } while (0)
-
#define dst_refcnt(__dst) atomic_read(&(__dst)->__refcnt)
#define dst_refcnt_one(__dst) atomic_set(&(__dst)->__refcnt, 1)
#define dst_refcnt_dec(__dst) atomic_dec(&(__dst)->__refcnt)
#define dst_hold(__dst) atomic_inc(&(__dst)->__refcnt)
+#endif
+#define dst_update_tu(__dst) do { \
+ dst_lastuse_set(__dst); \
+ dst_use_inc(__dst); \
+ } while (0);
+
+#define dst_update_rtu(__dst) do { \
+ dst_lastuse_set(__dst); \
+ dst_hold(__dst); \
+ dst_use_inc(__dst); \
+ } while (0)
+
static inline
void dst_release(struct dst_entry * dst)
{
if (dst) {
+#if (!defined (CONFIG_NUMA) || (RT_CACHE_DEBUG >= 2 ))
WARN_ON(dst_refcnt(dst) < 1);
+#endif
smp_mb__before_atomic_dec();
dst_refcnt_dec(dst);
}
@@ -271,6 +361,48 @@
extern void dst_init(void);
+/* This function allocates and initializes rtu array of given dst-entry.
+ */
+static inline int dst_init_rtu_array(struct dst_entry *dst)
+{
+#ifdef CONFIG_NUMA
+ int cpu;
+ dst->pcc = alloc_percpu(struct per_cpu_cnt, GFP_ATOMIC);
+ if(!dst->pcc)
+ return -ENOMEM;
+
+ for_each_cpu(cpu) {
+ per_cpu_ptr(dst->pcc, cpu)->use = 0;
+ per_cpu_ptr(dst->pcc, cpu)->refcnt = 0;
+ per_cpu_ptr(dst->pcc, cpu)->lastuse = jiffies;
+ }
+ dst->s_cpu = smp_processor_id();
+#else
+ atomic_set(&dst->__refcnt, 0);
+ dst->lastuse = jiffies;
+#endif
+ return 0;
+}
+
+static inline void dst_free_rtu_array(struct dst_entry *dst)
+{
+#ifdef CONFIG_NUMA
+ free_percpu(dst->pcc);
+#endif
+}
+
+#if defined (CONFIG_HOTPLUG_CPU) && defined (CONFIG_NUMA)
+inline static void dst_ref_xfr_cpu_down(struct dst_entry *__dst, int cpu)
+{
+ int refcnt = per_cpu_ptr((__dst)->pcc, cpu)->refcnt;
+ if (refcnt) {
+ per_cpu_ptr((__dst)->pcc, get_cpu())->refcnt += refcnt;
+ put_cpu();
+ per_cpu_ptr((__dst)->pcc, cpu)->refcnt = 0;
+ }
+}
+#endif
+
struct flowi;
#ifndef CONFIG_XFRM
static inline int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl,
Index: alloc_percpu-2.6.13/net/bridge/br_netfilter.c
===================================================================
--- alloc_percpu-2.6.13.orig/net/bridge/br_netfilter.c 2005-09-12 12:23:37.000000000 -0700
+++ alloc_percpu-2.6.13/net/bridge/br_netfilter.c 2005-09-12 12:24:01.000000000 -0700
@@ -85,7 +85,6 @@
static struct rtable __fake_rtable = {
.u = {
.dst = {
- .__refcnt = ATOMIC_INIT(1),
.dev = &__fake_net_device,
.path = &__fake_rtable.u.dst,
.metrics = {[RTAX_MTU - 1] = 1500},
@@ -1010,6 +1009,10 @@
{
int i;
+ if (dst_init_rtu_array(&__fake_rtable.u.dst) < 0)
+ panic("br_netfilter : cannot allocate memory for dst-entry rtu array");
+ dst_refcnt_one(&__fake_rtable.u.dst);
+
for (i = 0; i < ARRAY_SIZE(br_nf_ops); i++) {
int ret;
@@ -1046,4 +1049,5 @@
#ifdef CONFIG_SYSCTL
unregister_sysctl_table(brnf_sysctl_header);
#endif
+ dst_free_rtu_array(&__fake_rtable.u.dst);
}
Index: alloc_percpu-2.6.13/net/core/dst.c
===================================================================
--- alloc_percpu-2.6.13.orig/net/core/dst.c 2005-09-12 12:23:37.000000000 -0700
+++ alloc_percpu-2.6.13/net/core/dst.c 2005-09-12 12:24:01.000000000 -0700
@@ -131,9 +131,9 @@
if (!dst)
return NULL;
memset(dst, 0, ops->entry_size);
- atomic_set(&dst->__refcnt, 0);
+ if (dst_init_rtu_array(dst) < 0)
+ return NULL;
dst->ops = ops;
- dst->lastuse = jiffies;
dst->path = dst;
dst->input = dst_discard_in;
dst->output = dst_discard_out;
@@ -200,6 +200,7 @@
#if RT_CACHE_DEBUG >= 2
atomic_dec(&dst_total);
#endif
+ dst_free_rtu_array(dst);
kmem_cache_free(dst->ops->kmem_cachep, dst);
dst = child;
Index: alloc_percpu-2.6.13/net/decnet/dn_route.c
===================================================================
--- alloc_percpu-2.6.13.orig/net/decnet/dn_route.c 2005-09-12 12:23:37.000000000 -0700
+++ alloc_percpu-2.6.13/net/decnet/dn_route.c 2005-09-12 12:24:01.000000000 -0700
@@ -77,6 +77,7 @@
#include <linux/netfilter_decnet.h>
#include <linux/rcupdate.h>
#include <linux/times.h>
+#include <linux/cpu.h>
#include <asm/errno.h>
#include <net/neighbour.h>
#include <net/dst.h>
@@ -157,7 +158,29 @@
static inline int dn_dst_useful(struct dn_route *rth, unsigned long now, unsigned long expire)
{
+#ifdef CONFIG_NUMA
+ {
+ int max, sum = 0, age, cpu;
+ struct dst_entry *dst = &rth->u.dst;
+
+ cpu = dst->s_cpu;
+ max = cpu + NR_CPUS;
+ for(sum = 0; cpu < max; cpu++) {
+ int cpu_ = cpu % NR_CPUS;
+ if (cpu_online(cpu_)) {
+ sum += per_cpu_ptr(dst->pcc, cpu_)->refcnt;
+ age = now - per_cpu_ptr(dst->pcc, cpu_)->lastuse;
+ if (age <= expire) {
+ dst->s_cpu = cpu_ ;
+ return 1;
+ }
+ }
+ }
+ return (sum != 0);
+ }
+#else
return (atomic_read(&rth->u.dst.__refcnt) || (now - rth->u.dst.lastuse) < expire) ;
+#endif
}
static void dn_dst_check_expire(unsigned long dummy)
@@ -1766,6 +1789,43 @@
#endif /* CONFIG_PROC_FS */
+#if defined(CONFIG_NUMA) && defined(CONFIG_HOTPLUG_CPU)
+static int __devinit dn_rtcache_cpu_callback(struct notifier_block *nfb,
+ unsigned long action,
+ void *hcpu)
+{
+ int cpu = (int) hcpu;
+
+ switch(action) {
+ int i;
+ struct dn_route *rt, *next;
+
+ case CPU_DEAD:
+
+ for(i = 0; i < dn_rt_hash_mask; i++) {
+ spin_lock_bh(&dn_rt_hash_table[i].lock);
+
+ if ((rt = dn_rt_hash_table[i].chain) == NULL)
+ goto nothing_to_do;
+
+ for(; rt; rt=next) {
+ dst_ref_xfr_cpu_down(&rt->u.dst, cpu);
+ next = rt->u.rt_next;
+ }
+nothing_to_do:
+ spin_unlock_bh(&dn_rt_hash_table[i].lock);
+ }
+
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block dn_rtcache_cpu_notifier =
+ { &dn_rtcache_cpu_callback, NULL, 0 };
+
+#endif
+
void __init dn_route_init(void)
{
int i, goal, order;
@@ -1822,10 +1882,16 @@
dn_dst_ops.gc_thresh = (dn_rt_hash_mask + 1);
proc_net_fops_create("decnet_cache", S_IRUGO, &dn_rt_cache_seq_fops);
+#if defined(CONFIG_NUMA) && defined(CONFIG_HOTPLUG_CPU)
+ register_cpu_notifier(&dn_rtcache_cpu_notifier);
+#endif
}
void __exit dn_route_cleanup(void)
{
+#if defined(CONFIG_NUMA) && defined(CONFIG_HOTPLUG_CPU)
+ unregister_cpu_notifier(&dn_rtcache_cpu_notifier);
+#endif
del_timer(&dn_route_timer);
dn_run_flush(0);
Index: alloc_percpu-2.6.13/net/ipv4/route.c
===================================================================
--- alloc_percpu-2.6.13.orig/net/ipv4/route.c 2005-09-12 12:23:37.000000000 -0700
+++ alloc_percpu-2.6.13/net/ipv4/route.c 2005-09-12 12:24:01.000000000 -0700
@@ -92,6 +92,7 @@
#include <linux/jhash.h>
#include <linux/rcupdate.h>
#include <linux/times.h>
+#include <linux/cpu.h>
#include <net/protocol.h>
#include <net/ip.h>
#include <net/route.h>
@@ -507,6 +508,54 @@
rth->u.dst.expires;
}
+#ifdef CONFIG_NUMA
+
+/*
+ * For NUMA systems, we do not want to sum up all local cpu refcnts every
+ * time. So we consider lastuse element of the dst_entry and start loop
+ * with the cpu where this entry was allocated. If dst_entry is not timed
+ * out then update s_cpu of this dst_entry so that next time we can start from
+ * that cpu.
+ */
+static inline int rt_check_age(struct rtable *rth,
+ unsigned long tmo1, unsigned long tmo2)
+{
+ int max, sum = 0, age, idx;
+ struct dst_entry *dst = &rth->u.dst;
+ unsigned long now = jiffies;
+
+ idx = dst->s_cpu;
+ max = idx + NR_CPUS;
+ for(sum = 0; idx < max; idx++) {
+ int cpu_ = idx % NR_CPUS;
+ if (cpu_online(cpu_)) {
+ sum += per_cpu_ptr(dst->pcc, cpu_)->refcnt;
+ age = now - per_cpu_ptr(dst->pcc, cpu_)->lastuse;
+ if ((age <= tmo1 && !rt_fast_clean(rth)) ||
+ (age <= tmo2 && rt_valuable(rth))) {
+ dst->s_cpu = cpu_ ;
+ return 0;
+ }
+ }
+ }
+ return (sum == 0);
+}
+
+/*
+ * In this function order of examining three factors (ref_cnt, expires,
+ * lastuse) is changed, considering the cost of analyzing refcnt and lastuse
+ * which are localized for each cpu on NUMA.
+ */
+static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
+{
+ if (rth->u.dst.expires && time_after_eq(jiffies, rth->u.dst.expires))
+ return (dst_refcnt(&rth->u.dst) == 0) ;
+
+ return rt_check_age(rth, tmo1, tmo2);
+}
+
+#else
+
static int rt_may_expire(struct rtable *rth, unsigned long tmo1, unsigned long tmo2)
{
unsigned long age;
@@ -529,6 +578,8 @@
out: return ret;
}
+#endif
+
/* Bits of score are:
* 31: very valuable
* 30: not quite useless
@@ -1108,8 +1159,19 @@
void ip_rt_copy(struct rtable *to, struct rtable *from)
{
+#ifdef CONFIG_NUMA
+ struct per_cpu_cnt *tmp_pnc;
+ tmp_pnc = to->u.dst.pcc;
+
+ *to = *from;
+ to->u.dst.pcc = tmp_pnc;
+ per_cpu_ptr(to->u.dst.pcc,get_cpu())->use = 1;
+ to->u.dst.s_cpu = smp_processor_id();
+ put_cpu();
+#else
*to = *from;
to->u.dst.__use = 1;
+#endif
}
void ip_rt_redirect(u32 old_gw, u32 daddr, u32 new_gw,
@@ -3108,6 +3170,33 @@
}
__setup("rhash_entries=", set_rhash_entries);
+#if defined(CONFIG_NUMA) && defined(CONFIG_HOTPLUG_CPU)
+static int __devinit rtcache_cpu_callback(struct notifier_block *nfb,
+ unsigned long action,
+ void *hcpu)
+{
+ int cpu = (int) hcpu;
+
+ switch(action) {
+ int i ;
+ struct rtable *rth;
+ case CPU_DEAD:
+ for(i = rt_hash_mask; i >= 0; i--) {
+ spin_lock_irq(rt_hash_lock_addr(i));
+ rth = rt_hash_table[i].chain;
+ while(rth) {
+ dst_ref_xfr_cpu_down(&rth->u.dst, cpu);
+ rth = rth->u.rt_next;
+ }
+ spin_unlock_irq(rt_hash_lock_addr(i));
+ }
+ break;
+ }
+ return NOTIFY_OK;
+}
+static struct notifier_block rtcache_cpu_notifier = { &rtcache_cpu_callback, NULL, 0 };
+#endif
+
int __init ip_rt_init(void)
{
int rc = 0;
@@ -3197,6 +3286,9 @@
xfrm_init();
xfrm4_init();
#endif
+#if defined(CONFIG_NUMA) && defined(CONFIG_HOTPLUG_CPU)
+ register_cpu_notifier(&rtcache_cpu_notifier);
+#endif
return rc;
}
Index: alloc_percpu-2.6.13/net/ipv6/ip6_fib.c
===================================================================
--- alloc_percpu-2.6.13.orig/net/ipv6/ip6_fib.c 2005-09-12 12:23:37.000000000 -0700
+++ alloc_percpu-2.6.13/net/ipv6/ip6_fib.c 2005-09-12 12:24:01.000000000 -0700
@@ -1209,6 +1209,35 @@
spin_unlock_bh(&fib6_gc_lock);
}
+#if defined(CONFIG_NUMA) && defined(CONFIG_HOTPLUG_CPU)
+#include <linux/cpu.h>
+inline static int rt6_ref_xfr_cpu_down(struct rt6_info *rt, void *arg)
+{
+ dst_ref_xfr_cpu_down(&rt->u.dst, (int)arg);
+ return 0;
+}
+
+static int __devinit ipv6_rtcache_cpu_callback(struct notifier_block *nfb,
+ unsigned long action,
+ void *hcpu)
+{
+ int cpu = (int) hcpu;
+
+ switch(action) {
+ case CPU_DEAD:
+ write_lock_bh(&rt6_lock);
+ fib6_clean_tree(&ip6_routing_table, rt6_ref_xfr_cpu_down,
+ 0, (void *)cpu);
+ write_unlock_bh(&rt6_lock);
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block ipv6_rtcache_cpu_notifier =
+ { &ipv6_rtcache_cpu_callback, NULL, 0 };
+#endif
+
void __init fib6_init(void)
{
fib6_node_kmem = kmem_cache_create("fib6_nodes",
@@ -1217,10 +1246,16 @@
NULL, NULL);
if (!fib6_node_kmem)
panic("cannot create fib6_nodes cache");
+#if defined(CONFIG_NUMA) && defined(CONFIG_HOTPLUG_CPU)
+ register_cpu_notifier(&ipv6_rtcache_cpu_notifier);
+#endif
}
void fib6_gc_cleanup(void)
{
+#if defined(CONFIG_NUMA) && defined(CONFIG_HOTPLUG_CPU)
+ unregister_cpu_notifier(&ipv6_rtcache_cpu_notifier);
+#endif
del_timer(&ip6_fib_timer);
kmem_cache_destroy(fib6_node_kmem);
}
Index: alloc_percpu-2.6.13/net/ipv6/route.c
===================================================================
--- alloc_percpu-2.6.13.orig/net/ipv6/route.c 2005-09-12 12:23:37.000000000 -0700
+++ alloc_percpu-2.6.13/net/ipv6/route.c 2005-09-12 12:24:01.000000000 -0700
@@ -110,8 +110,6 @@
struct rt6_info ip6_null_entry = {
.u = {
.dst = {
- .__refcnt = ATOMIC_INIT(1),
- .__use = 1,
.dev = &loopback_dev,
.obsolete = -1,
.error = -ENETUNREACH,
@@ -2104,6 +2102,10 @@
NULL, NULL);
if (!ip6_dst_ops.kmem_cachep)
panic("cannot create ip6_dst_cache");
+ if (dst_init_rtu_array(&ip6_null_entry.u.dst) < 0)
+ panic("ip6_route : can't allocate memory for dst-entry array");
+ dst_use_inc(&ipv6_null_entry.u.dist);
+ dst_refcnt_one(&ip6_null_entry.u.dst);
fib6_init();
#ifdef CONFIG_PROC_FS
@@ -2130,4 +2132,5 @@
rt6_ifdown(NULL);
fib6_gc_cleanup();
kmem_cache_destroy(ip6_dst_ops.kmem_cachep);
+ dst_free_rtu_array(&ip6_null_entry.u.dst);
}
This patch provides for early calls to map_vm_area. Currently, map_vm_area
cannot be called early during the boot process since map_vm_area depends on
kmalloc for the vm_struct objects. This patch is a bad hack to let
map_vm_area work early, but just for a few calls so that the dynamic per-cpu
subsystem can allocate a block and satisfy some early requests. This is
primarily to enable slab code use alloc_percpu for slab head arrays. This
patch might not be elegant, but solves the chicken and egg problem in using
alloc_percpu for slab.
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Index: alloc_percpu-2.6.13-rc6/include/linux/slab.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/linux/slab.h 2005-08-14 21:47:56.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/include/linux/slab.h 2005-08-15 17:29:41.000000000 -0700
@@ -76,6 +76,8 @@
extern struct cache_sizes malloc_sizes[];
extern void *__kmalloc(size_t, unsigned int __nocast);
+#define SLAB_READY ({malloc_sizes[0].cs_cachep != NULL;})
+
static inline void *kmalloc(size_t size, unsigned int __nocast flags)
{
if (__builtin_constant_p(size)) {
Index: alloc_percpu-2.6.13-rc6/include/linux/vmalloc.h
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/include/linux/vmalloc.h 2005-08-15 17:28:42.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/include/linux/vmalloc.h 2005-08-15 17:29:41.000000000 -0700
@@ -8,6 +8,7 @@
#define VM_IOREMAP 0x00000001 /* ioremap() and friends */
#define VM_ALLOC 0x00000002 /* vmalloc() */
#define VM_MAP 0x00000004 /* vmap()ed pages */
+#define VM_EARLY 0x00000008 /* indicates static vm_struct */
/* bits [20..32] reserved for arch specific ioremap internals */
struct vm_struct {
Index: alloc_percpu-2.6.13-rc6/mm/vmalloc.c
===================================================================
--- alloc_percpu-2.6.13-rc6.orig/mm/vmalloc.c 2005-08-15 17:28:42.000000000 -0700
+++ alloc_percpu-2.6.13-rc6/mm/vmalloc.c 2005-08-15 17:35:29.000000000 -0700
@@ -160,6 +160,15 @@
#define IOREMAP_MAX_ORDER (7 + PAGE_SHIFT) /* 128 pages */
+/*
+ * Statically define a few vm_structs so that early per-cpu allocator code
+ * can get vm_areas even before slab is up. NR_EARLY_VMAREAS should remain
+ * in single digits
+ */
+#define NR_EARLY_VMAREAS (1)
+static struct vm_struct early_vmareas[NR_EARLY_VMAREAS];
+static int early_vmareas_idx = 0;
+
struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
unsigned long start, unsigned long end,
unsigned int gfp_flags)
@@ -168,6 +177,9 @@
unsigned long align = 1;
unsigned long addr;
+ if (unlikely(!size))
+ return NULL;
+
if (flags & VM_IOREMAP) {
int bit = fls(size);
@@ -184,10 +196,16 @@
area = kmalloc(sizeof(*area), gfp_flags);
if (unlikely(!area))
return NULL;
-
- if (unlikely(!size)) {
- kfree (area);
- return NULL;
+ if (likely(SLAB_READY)) {
+ area = kmalloc(sizeof(*area), GFP_KERNEL);
+ if (unlikely(!area))
+ return NULL;
+ } else {
+ if (early_vmareas_idx < NR_EARLY_VMAREAS) {
+ area = &early_vmareas[early_vmareas_idx++];
+ flags |= VM_EARLY;
+ } else
+ return NULL;
}
/*
@@ -228,7 +246,8 @@
out:
write_unlock(&vmlist_lock);
- kfree(area);
+ if (likely(!(flags & VM_EARLY)))
+ kfree(area);
if (printk_ratelimit())
printk(KERN_WARNING "allocation failed: out of vmalloc space - use vmalloc=<size> to increase size.\n");
return NULL;
@@ -326,7 +345,8 @@
kfree(area->pages);
}
- kfree(area);
+ if (likely(!(area->flags & VM_EARLY)))
+ kfree(area);
return;
}
@@ -415,7 +435,8 @@
area->pages = pages;
if (!area->pages) {
remove_vm_area(area->addr);
- kfree(area);
+ if (likely(!(area->flags & VM_EARLY)))
+ kfree(area);
return NULL;
}
memset(area->pages, 0, array_size);
Patch to hotplug chunks of memory for alloc_percpu blocks when a cpu comes
up. This is needed when alloc_percpu is used real early and the
cpu_possible mask is not fully setup. Then, the backing chunks of memory
are allocated when cpus come up.
Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>
Index: alloc_percpu-2.6.13/include/linux/percpu.h
===================================================================
--- alloc_percpu-2.6.13.orig/include/linux/percpu.h 2005-09-12 12:23:34.000000000 -0700
+++ alloc_percpu-2.6.13/include/linux/percpu.h 2005-09-12 18:39:42.000000000 -0700
@@ -31,6 +31,7 @@
extern void *__alloc_percpu(size_t size, size_t align, unsigned int gfpflags);
extern void free_percpu(const void *);
+extern void __init alloc_percpu_init(void);
#else /* CONFIG_SMP */
@@ -49,6 +50,8 @@
kfree(ptr);
}
+#define alloc_percpu_init() do {} while (0)
+
#endif /* CONFIG_SMP */
/* Simple wrapper for the common case: zeros memory. */
Index: alloc_percpu-2.6.13/init/main.c
===================================================================
--- alloc_percpu-2.6.13.orig/init/main.c 2005-09-12 12:23:34.000000000 -0700
+++ alloc_percpu-2.6.13/init/main.c 2005-09-12 18:25:05.000000000 -0700
@@ -495,6 +495,9 @@
#endif
vfs_caches_init_early();
mem_init();
+
+ alloc_percpu_init();
+
kmem_cache_init();
setup_per_cpu_pageset();
numa_policy_init();
Index: alloc_percpu-2.6.13/mm/percpu.c
===================================================================
--- alloc_percpu-2.6.13.orig/mm/percpu.c 2005-09-12 12:23:34.000000000 -0700
+++ alloc_percpu-2.6.13/mm/percpu.c 2005-09-12 16:57:49.000000000 -0700
@@ -35,8 +35,9 @@
#include <linux/vmalloc.h>
#include <linux/module.h>
#include <linux/mm.h>
-
#include <linux/sort.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
#ifdef CONFIG_HIGHMEM
#include <asm/highmem.h>
#endif
@@ -200,8 +201,10 @@
rollback_pages:
j--;
for (; j >= 0; j--)
- if (cpu_possible(j / cpu_pages))
+ if (cpu_possible(j / cpu_pages)) {
__free_pages(blkp->pages[j], 0);
+ blkp->pages[j] = NULL;
+ }
/* Unmap block management */
tmp.addr = area->addr + NR_CPUS * PCPU_BLKSIZE;
@@ -222,6 +225,90 @@
return NULL;
}
+static int __devinit __allocate_chunk(int cpu, struct pcpu_block *blkp)
+{
+ unsigned int cpu_pages = PCPU_BLKSIZE >> PAGE_SHIFT;
+ int start_idx, j;
+ struct vm_struct tmp;
+ struct page **tmppage;
+
+ /* Alloc node local pages for the onlined cpu */
+ start_idx = cpu * cpu_pages;
+
+ if (blkp->pages[start_idx])
+ return 1; /* Already allocated */
+
+ for (j = start_idx; j < start_idx + cpu_pages; j++) {
+ BUG_ON(blkp->pages[j]);
+ blkp->pages[j] = alloc_pages_node(cpu_to_node(cpu),
+ GFP_ATOMIC |
+ __GFP_HIGHMEM,
+ 0);
+ if (unlikely(!blkp->pages[j]))
+ goto rollback_pages;
+ }
+
+ /* Map pages for each cpu by splitting vm_struct for each cpu */
+ tmppage = &blkp->pages[cpu * cpu_pages];
+ tmp.addr = blkp->start_addr + cpu * PCPU_BLKSIZE;
+ /* map_vm_area assumes a guard page of size PAGE_SIZE */
+ tmp.size = PCPU_BLKSIZE + PAGE_SIZE;
+ if (map_vm_area(&tmp, PAGE_KERNEL, &tmppage))
+ goto rollback_pages;
+
+ return 1; /* Success */
+
+rollback_pages:
+ j--;
+ for (; j >= 0; j--) {
+ __free_pages(blkp->pages[j], 0);
+ blkp->pages[j] = NULL;
+ }
+ return 0;
+}
+
+/* Allocate chunks for this cpu in all blocks */
+static int __devinit allocate_chunk(int cpu)
+{
+ struct pcpu_block *blkp = NULL;
+ unsigned long flags;
+ spin_lock_irqsave(&blklist_lock, flags);
+ list_for_each_entry(blkp, &blkhead, blklist) {
+ if (!__allocate_chunk(cpu, blkp)) {
+ spin_unlock_irqrestore(&blklist_lock, flags);
+ return 0;
+ }
+ }
+
+ spin_unlock_irqrestore(&blklist_lock, flags);
+ return 1;
+}
+
+
+static int __devinit alloc_percpu_notify(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ long cpu = (long)hcpu;
+ switch (action) {
+ case CPU_UP_PREPARE:
+ if (!allocate_chunk(cpu))
+ return NOTIFY_BAD;
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __devinitdata alloc_percpu_nb = {
+ .notifier_call = alloc_percpu_notify,
+};
+
+void __init alloc_percpu_init(void)
+{
+ register_cpu_notifier(&alloc_percpu_nb);
+}
+
/* Free memory block allocated by valloc_percpu */
static void vfree_percpu(void *addr)
{
On Tue, 13 Sep 2005 09:10:12 -0700
Ravikiran G Thirumalai <[email protected]> wrote:
> The net_device has a refcnt used to keep track of it's uses.
> This is used at the time of unregistering the network device
> (module unloading ..) (see netdev_wait_allrefs) .
> For loopback_dev , this refcnt increment/decrement is causing
> unnecessary traffic on the interlink for NUMA system
> affecting it's performance. This patch improves tbench numbers by 6% on a
> 8way x86 Xeon (x445).
>
Since when is bringing a network device up/down performance critical?
Stephen Hemminger wrote:
> On Tue, 13 Sep 2005 09:10:12 -0700
> Ravikiran G Thirumalai <[email protected]> wrote:
>
>
>>The net_device has a refcnt used to keep track of it's uses.
>>This is used at the time of unregistering the network device
>>(module unloading ..) (see netdev_wait_allrefs) .
>>For loopback_dev , this refcnt increment/decrement is causing
>>unnecessary traffic on the interlink for NUMA system
>>affecting it's performance. This patch improves tbench numbers by 6% on a
>>8way x86 Xeon (x445).
>>
>
>
> Since when is bringing a network device up/down performance critical?
We grab and drop a reference for each poll of a device, roughly.
See dev_hold in _netif_rx_schedule(struct net_device *dev)
in include/netdevice.h, for instance.
Thanks,
Ben
--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com
On Tue, 13 Sep 2005 09:35:14 -0700
Ben Greear <[email protected]> wrote:
> Stephen Hemminger wrote:
> > On Tue, 13 Sep 2005 09:10:12 -0700
> > Ravikiran G Thirumalai <[email protected]> wrote:
> >
> >
> >>The net_device has a refcnt used to keep track of it's uses.
> >>This is used at the time of unregistering the network device
> >>(module unloading ..) (see netdev_wait_allrefs) .
> >>For loopback_dev , this refcnt increment/decrement is causing
> >>unnecessary traffic on the interlink for NUMA system
> >>affecting it's performance. This patch improves tbench numbers by 6% on a
> >>8way x86 Xeon (x445).
> >>
> >
> >
> > Since when is bringing a network device up/down performance critical?
>
> We grab and drop a reference for each poll of a device, roughly.
>
> See dev_hold in _netif_rx_schedule(struct net_device *dev)
> in include/netdevice.h, for instance.
Yeah, that would be an issue, especially since the rest of that
path is nicely per-cpu
Ravikiran G Thirumalai a ?crit :
> The net_device has a refcnt used to keep track of it's uses.
> This is used at the time of unregistering the network device
> (module unloading ..) (see netdev_wait_allrefs) .
> For loopback_dev , this refcnt increment/decrement is causing
> unnecessary traffic on the interlink for NUMA system
> affecting it's performance. This patch improves tbench numbers by 6% on a
> 8way x86 Xeon (x445).
===================================================================
> --- alloc_percpu-2.6.13.orig/include/linux/netdevice.h 2005-08-28 16:41:01.000000000 -0700
> +++ alloc_percpu-2.6.13/include/linux/netdevice.h 2005-09-12 11:54:21.000000000 -0700
> @@ -37,6 +37,7 @@
> #include <linux/config.h>
> #include <linux/device.h>
> #include <linux/percpu.h>
> +#include <linux/bigref.h>
>
> struct divert_blk;
> struct vlan_group;
> @@ -377,7 +378,7 @@
> /* device queue lock */
> spinlock_t queue_lock;
> /* Number of references to this device */
> - atomic_t refcnt;
> + struct bigref netdev_refcnt;
> /* delayed register/unregister */
> struct list_head todo_list;
> /* device name hash chain */
> @@ -677,11 +678,11 @@
Hum...
Did you tried to place refcnt/netdev_refcnt in a separate cache line than
queue_lock ? I got good results too...
> /* device queue lock */
> spinlock_t queue_lock;
> /* Number of references to this device */
> - atomic_t refcnt;
> + struct bigref netdev_refcnt ____cacheline_aligned_in_smp ;
> /* delayed register/unregister */
> struct list_head todo_list;
> /* device name hash chain */
Every time a cpu take the queue_lock spinlock, it exclusively gets one cache
line. If another cpu try to access netdev_refcnt, it has to grab this cache
line (even if properely per_cpu designed, there is still one shared field). In
fact the whole struct net_device should be re-ordered for SMP/NUMA performance.
Eric
On Tue, Sep 13, 2005 at 08:27:52PM +0200, Eric Dumazet wrote:
> Ravikiran G Thirumalai a ?crit :
>
> Hum...
>
> Did you tried to place refcnt/netdev_refcnt in a separate cache line than
> queue_lock ? I got good results too...
>
> > /* device queue lock */
> > spinlock_t queue_lock;
> > /* Number of references to this device */
> > - atomic_t refcnt;
> > + struct bigref netdev_refcnt ____cacheline_aligned_in_smp ;
> > /* delayed register/unregister */
> > struct list_head todo_list;
> > /* device name hash chain */
>
> Every time a cpu take the queue_lock spinlock, it exclusively gets one
> cache line. If another cpu try to access netdev_refcnt, it has to grab this
> cache line (even if properely per_cpu designed, there is still one shared
> field). In fact the whole struct net_device should be re-ordered for
> SMP/NUMA performance.
I agree. Maybe placing the queue_lock in a different cacheline is the
right approach?
Thanks,
Kiran
Ravikiran G Thirumalai <[email protected]> wrote:
>
> Patch to add gfp_flags as args to __get_vm_area. alloc_percpu needs to use
> GFP flags as the dst_entry.refcount needs to be allocated with GFP_ATOMIC.
> Since alloc_percpu needs get_vm_area underneath, this patch changes
> __get_vmarea to accept gfp_flags as arg, so that alloc_percpu can use
> __get_vm_area. get_vm_area remains unchanged.
Is dst_alloc() ever called from IRQ or softirq contexts?
If so, __get_vm_area()'s write_lock(&vmlist_lock) is now deadlockable.
If not, then you've just added a restriction to dst_alloc()'s usage which
should be checked over by the net guys and which needs commenting in the code.
There is no way in the world this enormous amount of NUMA
complexity is being added to the destination cache layer.
Sorry.
From: Stephen Hemminger <[email protected]>
Date: Tue, 13 Sep 2005 09:26:59 -0700
> Since when is bringing a network device up/down performance critical?
The issue is the dev_get()'s that occur all over the place
to during packet transmit/receive, that's what they are
trying to address.
I'm still against all of these invasive NUMA changes to the
networking though, they are simply too ugly and special cased
to consider seriously.
On Tue, Sep 13, 2005 at 01:24:42PM -0700, David S. Miller wrote:
>
> There is no way in the world this enormous amount of NUMA
> complexity is being added to the destination cache layer.
Agreed the dst changes are ugly; that can be worked on. But the
cacheline bouncing problem on the atomic_t dst_entry refcounter has been
around for quite a while -- even on SMPs, not just NUMA. We need a solution
for that. I thought you were against the dst_entry bloat caused by the
previous version of the dst patch. alloc_percpu takes that away. You had
concerns about workloads with low route locality. Unfortunately we don't have
access to infrastructure setup for such tests :(
As for the ugliness, would something on the lines of net_device refcounter
patch in the series above be acceptable?
Thanks,
Kiran
From: Ravikiran G Thirumalai <[email protected]>
Date: Tue, 13 Sep 2005 15:07:37 -0700
> Agreed the dst changes are ugly; that can be worked on. But the
> cacheline bouncing problem on the atomic_t dst_entry refcounter has
> been around for quite a while -- even on SMPs, not just NUMA. We
> need a solution for that. I thought you were against the dst_entry
> bloat caused by the previous version of the dst patch. alloc_percpu
> takes that away. You had concerns about workloads with low route
> locality. Unfortunately we don't have access to infrastructure setup
> for such tests :(
You don't have two computers connected on a network?
All you need is that, load a bunch of routes into one system that
point to an IP address which you just force an ARP entry for (so it
just gets lost in the ether) and then generate a rDOS workload through
it from another machine using pktgen.
I'm fine with funny per-cpu memory allocation strategies, perhaps
(would have to see a patch doing _only_ that to be sure).
But using bigrefs, no way. We have enough trouble making the data
structures small without adding bloat like that. A busy server can
have hundreds of thousands of dst cache entries active on it, and they
chew up enough memory as is.
On Tue, Sep 13, 2005 at 01:26:07PM -0700, David S. Miller wrote:
> From: Stephen Hemminger <[email protected]>
> Date: Tue, 13 Sep 2005 09:26:59 -0700
>
> > Since when is bringing a network device up/down performance critical?
>
> The issue is the dev_get()'s that occur all over the place
> to during packet transmit/receive, that's what they are
> trying to address.
>
> I'm still against all of these invasive NUMA changes to the
> networking though, they are simply too ugly and special cased
> to consider seriously.
All of them or the dst ones? Hopefully the netdevice refcounter patch
is not ugly or complicated as the dst ones? And why are they special cased?
Are networking workloads with high route locality not interesting?
Thanks,
Kiran
On Tue, Sep 13, 2005 at 03:12:16PM -0700, David S. Miller wrote:
> From: Ravikiran G Thirumalai <[email protected]>
> Date: Tue, 13 Sep 2005 15:07:37 -0700
> ...
> But using bigrefs, no way. We have enough trouble making the data
> structures small without adding bloat like that. A busy server can
> have hundreds of thousands of dst cache entries active on it, and they
> chew up enough memory as is.
>
But even 1 Million dst cache entries would be 16+4 MB additional for a 4 cpu
box....is that too much? The alloc_percpu reimplementation interleaves
objects on cache lines, unlike the existing implementation which pads per-cpu
objects to cache lines...
If you are referring to embedded routing devices,
would they use CONFIG_NUMA or CONFIG_SMP?? (bigrefs nicely fold back to
regular atomic_t s on UPs)
Thanks,
Kiran
From: Ravikiran G Thirumalai <[email protected]>
Date: Tue, 13 Sep 2005 16:17:17 -0700
> But even 1 Million dst cache entries would be 16+4 MB additional for
> a 4 cpu box....is that too much?
Absolutely.
Per-cpu counters are great for things like single instance
statistics et al. But once you start doing them per-object
that's out of control bloat as far as I'm concerned.
On Tue, 2005-09-13 at 16:27 -0700, David S. Miller wrote:
> From: Ravikiran G Thirumalai <[email protected]>
> Date: Tue, 13 Sep 2005 16:17:17 -0700
>
> > But even 1 Million dst cache entries would be 16+4 MB additional for
> > a 4 cpu box....is that too much?
>
> Absolutely.
>
> Per-cpu counters are great for things like single instance
> statistics et al. But once you start doing them per-object
> that's out of control bloat as far as I'm concerned.
This is why my original per-cpu allocator patch was damn slow, and
GFP_KERNEL only. I wasn't convinced that high-churn objects are a good
fit for spreading across cpus.
I thought that net devices and modules (which uses a primitive
hard-coded "bigref" currently) were a fair uses for bigrefs, though I'd
like to see some stats.
Cheers,
Rusty.
--
A bad analogy is like a leaky screwdriver -- Richard Braakman