2009-01-21 14:30:41

by Nick Piggin

[permalink] [raw]
Subject: [patch] SLQB slab allocator

Hi,

Since last posted, I've cleaned up a few bits and pieces, (hopefully)
fixed a known bug where it wouldn't boot on memoryless nodes (I don't
have a system to test with), and improved performance and reduced
locking somewhat for node-specific and interleaved allocations.

There are a few TODOs remaining (see "TODO"). Most are hopefully
obscure or relatively unimportant cases. The biggest thing really
is to test and tune on a wider range of workloads, so I'll ask to
merge it in the slab tree and from there linux-next to see what
comes up. I'll work on tuning things and the TODO items before a
possible mainline merge. Actually it would be kind of instructive
if people run into issues on the TODO list because it would help
guide improvements...

BTW, if anybody wants explicit copyright attribution on the files,
that's fine just send patches. I just dislike big header buildups,
which is why I make a broader acknowledgement. In fact, the other allocators
don't even explictly acknowledge SLAB, so I didn't think it would
be a problem. I don't really know the legal issues, but we've set
plenty of precendent...

---
Introducing the SLQB slab allocator.

SLQB takes code and ideas from all other slab allocators in the tree.

The primary method for keeping lists of free objects within the allocator
is a singly-linked list, storing a pointer within the object memory itself
(or a small additional space in the case of RCU destroyed slabs). This is
like SLOB and SLUB, and opposed to SLAB, which uses arrays of objects, and
metadata. This reduces memory consumption and makes smaller sized objects
more realistic as there is less overhead.

Using lists rather than arrays can reduce the cacheline footprint. When moving
objects around, SLQB can move a list of objects from one CPU to another by
simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
can be touched during alloc/free. Newly freed objects tend to be cache hot,
and newly allocated ones tend to soon be touched anyway, so often there is
little cost to using metadata in the objects.

SLQB has a per-CPU LIFO freelist of objects like SLAB (but using lists rather
than arrays). Freed objects are returned to this freelist if they belong to
the node which our CPU belongs to. So objects allocated on one CPU can be
added to the freelist of another CPU on the same node. When LIFO freelists need
to be refilled or trimmed, SLQB takes or returns objects from a list of slabs.

SLQB has per-CPU lists of slabs (which use struct page as their metadata
including list head for this list). Each slab contains a singly-linked list of
objects that are free in that slab (free, and not on a LIFO freelist). Slabs
are freed as soon as all their objects are freed, and only allocated when there
are no slabs remaining. They are taken off this slab list when if there are no
free objects left. So the slab lists always only contain "partial" slabs; those
slabs which are not completely full and not completely empty. SLQB slabs can be
manipulated with no locking unlike other allocators which tend to use per-node
locks. As the number of threads per socket increases, this should help improve
the scalability of slab operations.

Freeing objects to remote slab lists first batches up the objects on the freeing
CPU, then moves them over at once to a list on the allocating CPU. The allocating
CPU will then notice those objects and pull them onto the end of its freelist.
This remote freeing scheme is designed to minimise the number of cross CPU
cachelines touched, short of going to a "crossbar" arrangement like SLAB has.
SLAB has "crossbars" of arrays of objects. That is, NR_CPUS*MAX_NUMNODES type
arrays, which can become very bloated in huge systems (this could be hundreds
of GBs for kmem caches for 4096 CPU, 1024 nodes systems).

SLQB also has similar freelist, slablist structures per-node, which are
protected by a lock, and usable by any CPU in order to do node specific
allocations. These allocations tend not to be too frequent (short lived
allocations should be node local, long lived allocations should not be
too frequent).

There is a good overview and illustration of the design here:

http://lwn.net/Articles/311502/

By using LIFO freelists like SLAB, SLQB tries to be very page-size agnostic.
It tries very hard to use order-0 pages. This is good for both page allocator
fragmentation, and slab fragmentation.

SLQB initialistaion code attempts to be as simple and un-clever as possible.
There are no multiple phases where different things come up. There is no
weird self bootstrapping stuff. It just statically allocates the structures
required to create the slabs that allocate other slab structures.

SLQB uses much of the debugging infrastructure, and fine-grained sysfs
statistics from SLUB. There is also a Documentation/vm/slqbinfo.c, derived
from slabinfo.c, which can query the sysfs data.

Signed-off-by: Nick Piggin <[email protected]>
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
#ifndef __LINUX_RCUPDATE_H
#define __LINUX_RCUPDATE_H

+#include <linux/rcu_types.h>
#include <linux/cache.h>
#include <linux/spinlock.h>
#include <linux/threads.h>
@@ -42,16 +43,6 @@
#include <linux/lockdep.h>
#include <linux/completion.h>

-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
- struct rcu_head *next;
- void (*func)(struct rcu_head *head);
-};
-
#if defined(CONFIG_CLASSIC_RCU)
#include <linux/rcuclassic.h>
#elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,283 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <[email protected]>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+enum stat_item {
+ ALLOC, /* Allocation count */
+ ALLOC_SLAB_FILL, /* Fill freelist from page list */
+ ALLOC_SLAB_NEW, /* New slab acquired from page allocator */
+ FREE, /* Free count */
+ FREE_REMOTE, /* NUMA: freeing to remote list */
+ FLUSH_FREE_LIST, /* Freelist flushed */
+ FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+ FLUSH_FREE_LIST_REMOTE, /* Objects flushed from freelist to remote */
+ FLUSH_SLAB_PARTIAL, /* Freeing moves slab to partial list */
+ FLUSH_SLAB_FREE, /* Slab freed to the page allocator */
+ FLUSH_RFREE_LIST, /* Rfree list flushed */
+ FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+ CLAIM_REMOTE_LIST, /* Remote freed list claimed */
+ CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+ NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+ unsigned long nr;
+ void **head, **tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+ spinlock_t lock;
+ struct kmlist list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+ struct kmlist freelist; /* Fastpath LIFO freelist of objects */
+#ifdef CONFIG_SMP
+ int remote_free_check; /* remote_free has reached a watermark */
+#endif
+ struct kmem_cache *cache; /* kmem_cache corresponding to this list */
+
+ unsigned long nr_partial; /* Number of partial slabs (pages) */
+ struct list_head partial; /* Slabs which have some free objects */
+
+ unsigned long nr_slabs; /* Total number of slabs allocated */
+
+ //struct list_head full;
+
+#ifdef CONFIG_SMP
+ /*
+ * In the case of per-cpu lists, remote_free is for objects freed by
+ * non-owner CPU back to its home list. For per-node lists, remote_free
+ * is always used to free objects.
+ */
+ struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+ unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+ struct kmem_cache_list list; /* List for node-local slabs. */
+
+ unsigned int colour_next;
+
+#ifdef CONFIG_SMP
+ /*
+ * rlist is a list of objects that don't fit on list.freelist (ie.
+ * wrong node). The objects all correspond to a given kmem_cache_list,
+ * remote_cache_list. To free objects to another list, we must first
+ * flush the existing objects, then switch remote_cache_list.
+ *
+ * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+ * get to O(NR_CPUS^2) memory consumption situation.
+ */
+ struct kmlist rlist;
+ struct kmem_cache_list *remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure.
+ */
+struct kmem_cache_node {
+ struct kmem_cache_list list;
+ spinlock_t list_lock; /* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+ unsigned long flags;
+ int hiwater; /* LIFO list high watermark */
+ int freebatch; /* LIFO freelist batch flush size */
+ int objsize; /* The size of an object without meta data */
+ int offset; /* Free pointer offset. */
+ int objects; /* Number of objects in slab */
+
+ int size; /* The size of an object including meta data */
+ int order; /* Allocation order */
+ gfp_t allocflags; /* gfp flags to use on allocation */
+ unsigned int colour_range; /* range of colour counter */
+ unsigned int colour_off; /* offset per colour */
+ void (*ctor)(void *);
+
+ const char *name; /* Name (only for display!) */
+ struct list_head list; /* List of slab caches */
+
+ int align; /* Alignment */
+ int inuse; /* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+ struct kobject kobj; /* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+ struct kmem_cache_node *node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+ struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+ struct kmem_cache_cpu cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+ if (unlikely(!size))
+ return 0;
+ if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+ return 0;
+
+ if (unlikely(size <= KMALLOC_MIN_SIZE))
+ return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+ if (size > 64 && size <= 96)
+ return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+ if (size > 128 && size <= 192)
+ return 2;
+#endif
+ if (size <= 8) return 3;
+ if (size <= 16) return 4;
+ if (size <= 32) return 5;
+ if (size <= 64) return 6;
+ if (size <= 128) return 7;
+ if (size <= 256) return 8;
+ if (size <= 512) return 9;
+ if (size <= 1024) return 10;
+ if (size <= 2 * 1024) return 11;
+ if (size <= 4 * 1024) return 12;
+ if (size <= 8 * 1024) return 13;
+ if (size <= 16 * 1024) return 14;
+ if (size <= 32 * 1024) return 15;
+ if (size <= 64 * 1024) return 16;
+ if (size <= 128 * 1024) return 17;
+ if (size <= 256 * 1024) return 18;
+ if (size <= 512 * 1024) return 19;
+ if (size <= 1024 * 1024) return 20;
+ if (size <= 2 * 1024 * 1024) return 21;
+ return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+ int index = kmalloc_index(size);
+
+ if (unlikely(index == 0))
+ return NULL;
+
+ if (likely(!(flags & SLQB_DMA)))
+ return &kmalloc_caches[index];
+ else
+ return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+ if (__builtin_constant_p(size)) {
+ struct kmem_cache *s;
+
+ s = kmalloc_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc(s, flags);
+ }
+ return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+ if (__builtin_constant_p(size)) {
+ struct kmem_cache *s;
+
+ s = kmalloc_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc_node(s, flags, node);
+ }
+ return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -806,7 +806,7 @@ config SLUB_DEBUG

choice
prompt "Choose SLAB allocator"
- default SLUB
+ default SLQB
help
This option allows to select a slab allocator.

@@ -827,6 +827,11 @@ config SLUB
and has enhanced diagnostics. SLUB is the default choice for
a slab allocator.

+config SLQB
+ bool "SLQB (Qeued allocator)"
+ help
+ SLQB is a proposed new slab allocator.
+
config SLOB
depends on EMBEDDED
bool "SLOB (Simple Allocator)"
@@ -868,7 +873,7 @@ config HAVE_GENERIC_DMA_COHERENT
config SLABINFO
bool
depends on PROC_FS
- depends on SLAB || SLUB_DEBUG
+ depends on SLAB || SLUB_DEBUG || SLQB
default y

config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
out which slabs are relevant to a particular load.
Try running: slabinfo -DA

+config SLQB_DEBUG
+ default y
+ bool "Enable SLQB debugging support"
+ depends on SLQB
+
+config SLQB_DEBUG_ON
+ default n
+ bool "SLQB debugging on by default"
+ depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+ bool "Create SYSFS entries for slab caches"
+ default n
+ depends on SLQB
+
+config SLQB_STATS
+ bool "Enable SLQB performance statistics"
+ default n
+ depends on SLQB_SYSFS
+
config DEBUG_PREEMPT
bool "Debug preemptible kernel"
depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3436 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/bit_spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+/*
+ * TODO
+ * - fix up releasing of offlined data structures. Not a big deal because
+ * they don't get cumulatively leaked with successive online/offline cycles
+ * - improve fallback paths, allow OOM conditions to flush back per-CPU pages
+ * to common lists to be reused by other CPUs.
+ * - investiage performance with memoryless nodes. Perhaps CPUs can be given
+ * a default closest home node via which it can use fastpath functions.
+ * Perhaps it is not a big problem.
+ */
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+ union {
+ struct {
+ unsigned long flags; /* mandatory */
+ atomic_t _count; /* mandatory */
+ unsigned int inuse; /* Nr of objects */
+ struct kmem_cache_list *list; /* Pointer to list */
+ void **freelist; /* freelist req. slab lock */
+ union {
+ struct list_head lru; /* misc. list */
+ struct rcu_head rcu_head; /* for rcu freeing */
+ };
+ };
+ struct page page;
+ };
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static int numa_platform __read_mostly;
+#else
+#define numa_platform 0
+#endif
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+ return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+ return s->freebatch;
+}
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ * kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ * objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ * the object list, it tries from the partial list, then the free list.
+ * After freeing an object to the object list, if it is over a watermark,
+ * some objects are freed back to pages. If an allocation misses these lists,
+ * a new slab page is allocated from the page allocator. If the free list
+ * reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ * node are queued to. When this reaches a watermark, the objects are
+ * flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ * to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ * used to protect access to this queue.
+ *
+ * When the remotely freed queue reaches a watermark, a flag is set to tell
+ * the owner CPU to check it. The owner CPU will then check the queue on the
+ * next allocation that misses the object list. It will move all objects from
+ * this list onto the object list and then allocate one.
+ *
+ * This system of remote queueing is intended to reduce lock and remote
+ * cacheline acquisitions, and give a cooling off period for remotely freed
+ * objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ * allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+ list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list, enum stat_item si,
+ unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+ list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+ return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+ return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+ return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+ return page_to_nid(virt_to_page_fast(addr));
+#else
+ return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+ struct page *p;
+
+ p = virt_to_head_page(addr);
+ return (struct slqb_page *)p;
+}
+
+static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
+ unsigned int order)
+{
+ struct page *p;
+
+ if (nid == -1)
+ p = alloc_pages(flags, order);
+ else
+ p = alloc_pages_node(nid, flags, order);
+
+ return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+ struct page *p = &page->page;
+
+ reset_page_mapcount(p);
+ p->mapping = NULL;
+ VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+ p->flags &= ~PG_SLQB_BIT;
+
+ __free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+ return (s->flags &
+ (SLAB_DEBUG_FREE |
+ SLAB_RED_ZONE |
+ SLAB_POISON |
+ SLAB_STORE_USER |
+ SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+ return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+ return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+ return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+ SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON 0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size() L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/* A list of all slab caches on the system */
+static DECLARE_RWSEM(slqb_lock);
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+ void *addr; /* Called from address */
+ int cpu; /* Was running on cpu */
+ int pid; /* Pid context */
+ unsigned long when; /* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+ return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+ kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+ return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+ VM_BUG_ON(!s->cpu_slab[cpu]);
+ return s->cpu_slab[cpu];
+#else
+ return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+ struct slqb_page *page, const void *object)
+{
+ void *base;
+
+ base = slqb_page_address(page);
+ if (object < base || object >= base + s->objects * s->size ||
+ (object - base) % s->size) {
+ return 0;
+ }
+
+ return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+ return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+ *(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+ for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+ __p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+ for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+ __p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+ int i, offset;
+ int newline = 1;
+ char ascii[17];
+
+ ascii[16] = 0;
+
+ for (i = 0; i < length; i++) {
+ if (newline) {
+ printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+ newline = 0;
+ }
+ printk(KERN_CONT " %02x", addr[i]);
+ offset = i % 16;
+ ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+ if (offset == 15) {
+ printk(KERN_CONT " %s\n", ascii);
+ newline = 1;
+ }
+ }
+ if (!newline) {
+ i %= 16;
+ while (i < 16) {
+ printk(KERN_CONT " ");
+ ascii[i] = ' ';
+ i++;
+ }
+ printk(KERN_CONT " %s\n", ascii);
+ }
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+ enum track_item alloc)
+{
+ struct track *p;
+
+ if (s->offset)
+ p = object + s->offset + sizeof(void *);
+ else
+ p = object + s->inuse;
+
+ return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+ enum track_item alloc, void *addr)
+{
+ struct track *p;
+
+ if (s->offset)
+ p = object + s->offset + sizeof(void *);
+ else
+ p = object + s->inuse;
+
+ p += alloc;
+ if (addr) {
+ p->addr = addr;
+ p->cpu = raw_smp_processor_id();
+ p->pid = current ? current->pid : -1;
+ p->when = jiffies;
+ } else
+ memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+ if (!(s->flags & SLAB_STORE_USER))
+ return;
+
+ set_track(s, object, TRACK_FREE, NULL);
+ set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+ if (!t->addr)
+ return;
+
+ printk(KERN_ERR "INFO: %s in ", s);
+ __print_symbol("%s", (unsigned long)t->addr);
+ printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+ if (!(s->flags & SLAB_STORE_USER))
+ return;
+
+ print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+ print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+ printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+ page, page->inuse, page->freelist, page->flags);
+
+}
+
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ printk(KERN_ERR "========================================"
+ "=====================================\n");
+ printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+ printk(KERN_ERR "----------------------------------------"
+ "-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+ unsigned int off; /* Offset of last byte */
+ u8 *addr = slqb_page_address(page);
+
+ print_tracking(s, p);
+
+ print_page_info(page);
+
+ printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+ p, p - addr, get_freepointer(s, p));
+
+ if (p > addr + 16)
+ print_section("Bytes b4", p - 16, 16);
+
+ print_section("Object", p, min(s->objsize, 128));
+
+ if (s->flags & SLAB_RED_ZONE)
+ print_section("Redzone", p + s->objsize,
+ s->inuse - s->objsize);
+
+ if (s->offset)
+ off = s->offset + sizeof(void *);
+ else
+ off = s->inuse;
+
+ if (s->flags & SLAB_STORE_USER)
+ off += 2 * sizeof(struct track);
+
+ if (off != s->size)
+ /* Beginning of the filler is the free pointer */
+ print_section("Padding", p + off, s->size - off);
+
+ dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+ u8 *object, char *reason)
+{
+ slab_bug(s, reason);
+ print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ slab_bug(s, fmt);
+ print_page_info(page);
+ dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+ u8 *p = object;
+
+ if (s->flags & __OBJECT_POISON) {
+ memset(p, POISON_FREE, s->objsize - 1);
+ p[s->objsize - 1] = POISON_END;
+ }
+
+ if (s->flags & SLAB_RED_ZONE)
+ memset(p + s->objsize,
+ active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+ s->inuse - s->objsize);
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+ while (bytes) {
+ if (*start != (u8)value)
+ return start;
+ start++;
+ bytes--;
+ }
+ return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+ void *from, void *to)
+{
+ slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+ memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+ u8 *object, char *what,
+ u8 *start, unsigned int value, unsigned int bytes)
+{
+ u8 *fault;
+ u8 *end;
+
+ fault = check_bytes(start, value, bytes);
+ if (!fault)
+ return 1;
+
+ end = start + bytes;
+ while (end > fault && end[-1] == value)
+ end--;
+
+ slab_bug(s, "%s overwritten", what);
+ printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+ fault, end - 1, fault[0], value);
+ print_trailer(s, page, object);
+
+ restore_bytes(s, what, value, fault, end);
+ return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * Bytes of the object to be managed.
+ * If the freepointer may overlay the object then the free
+ * pointer is the first word of the object.
+ *
+ * Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * Padding to reach word boundary. This is also used for Redzoning.
+ * Padding is extended by another word if Redzoning is enabled and
+ * objsize == inuse.
+ *
+ * We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * Meta data starts here.
+ *
+ * A. Free pointer (if we cannot overwrite object on free)
+ * B. Tracking data for SLAB_STORE_USER
+ * C. Padding to reach required alignment boundary or at mininum
+ * one word if debuggin is on to be able to detect writes
+ * before the word boundary.
+ *
+ * Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+ unsigned long off = s->inuse; /* The end of info */
+
+ if (s->offset)
+ /* Freepointer is placed after the object. */
+ off += sizeof(void *);
+
+ if (s->flags & SLAB_STORE_USER)
+ /* We also have user information there */
+ off += 2 * sizeof(struct track);
+
+ if (s->size == off)
+ return 1;
+
+ return check_bytes_and_report(s, page, p, "Object padding",
+ p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+ u8 *start;
+ u8 *fault;
+ u8 *end;
+ int length;
+ int remainder;
+
+ if (!(s->flags & SLAB_POISON))
+ return 1;
+
+ start = slqb_page_address(page);
+ end = start + (PAGE_SIZE << s->order);
+ length = s->objects * s->size;
+ remainder = end - (start + length);
+ if (!remainder)
+ return 1;
+
+ fault = check_bytes(start + length, POISON_INUSE, remainder);
+ if (!fault)
+ return 1;
+ while (end > fault && end[-1] == POISON_INUSE)
+ end--;
+
+ slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+ print_section("Padding", start, length);
+
+ restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+ return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+ void *object, int active)
+{
+ u8 *p = object;
+ u8 *endobject = object + s->objsize;
+
+ if (s->flags & SLAB_RED_ZONE) {
+ unsigned int red =
+ active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+ if (!check_bytes_and_report(s, page, object, "Redzone",
+ endobject, red, s->inuse - s->objsize))
+ return 0;
+ } else {
+ if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+ check_bytes_and_report(s, page, p, "Alignment padding",
+ endobject, POISON_INUSE, s->inuse - s->objsize);
+ }
+ }
+
+ if (s->flags & SLAB_POISON) {
+ if (!active && (s->flags & __OBJECT_POISON) &&
+ (!check_bytes_and_report(s, page, p, "Poison", p,
+ POISON_FREE, s->objsize - 1) ||
+ !check_bytes_and_report(s, page, p, "Poison",
+ p + s->objsize - 1, POISON_END, 1)))
+ return 0;
+ /*
+ * check_pad_bytes cleans up on its own.
+ */
+ check_pad_bytes(s, page, p);
+ }
+
+ return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+ if (!(page->flags & PG_SLQB_BIT)) {
+ slab_err(s, page, "Not a valid slab page");
+ return 0;
+ }
+ if (page->inuse == 0) {
+ slab_err(s, page, "inuse before free / after alloc", s->name);
+ return 0;
+ }
+ if (page->inuse > s->objects) {
+ slab_err(s, page, "inuse %u > max %u",
+ s->name, page->inuse, s->objects);
+ return 0;
+ }
+ /* Slab_pad_check fixes things up after itself */
+ slab_pad_check(s, page);
+ return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
+{
+ if (s->flags & SLAB_TRACE) {
+ printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+ s->name,
+ alloc ? "alloc" : "free",
+ object, page->inuse,
+ page->freelist);
+
+ if (!alloc)
+ print_section("Object", (void *)object, s->objsize);
+
+ dump_stack();
+ }
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+ void *object)
+{
+ if (!slab_debug(s))
+ return;
+
+ if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+ return;
+
+ init_object(s, object, 0);
+ init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+ struct slqb_page *page;
+ page = virt_to_head_slqb_page(object);
+
+ if (!check_slab(s, page))
+ goto bad;
+
+ if (!check_valid_pointer(s, page, object)) {
+ object_err(s, page, object, "Freelist Pointer check fails");
+ goto bad;
+ }
+
+ if (object && !check_object(s, page, object, 0))
+ goto bad;
+
+ /* Success perform special debug activities for allocs */
+ if (s->flags & SLAB_STORE_USER)
+ set_track(s, object, TRACK_ALLOC, addr);
+ trace(s, page, object, 1);
+ init_object(s, object, 1);
+ return 1;
+
+bad:
+ return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+ struct slqb_page *page;
+ page = virt_to_head_slqb_page(object);
+
+ if (!check_slab(s, page))
+ goto fail;
+
+ if (!check_valid_pointer(s, page, object)) {
+ slab_err(s, page, "Invalid object pointer 0x%p", object);
+ goto fail;
+ }
+
+ if (!check_object(s, page, object, 1))
+ return 0;
+
+ /* Special debug activities for freeing objects */
+ if (s->flags & SLAB_STORE_USER)
+ set_track(s, object, TRACK_FREE, addr);
+ trace(s, page, object, 0);
+ init_object(s, object, 0);
+ return 1;
+
+fail:
+ slab_fix(s, "Object at 0x%p not freed", object);
+ return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+ slqb_debug = DEBUG_DEFAULT_FLAGS;
+ if (*str++ != '=' || !*str)
+ /*
+ * No options specified. Switch on full debugging.
+ */
+ goto out;
+
+ if (*str == ',')
+ /*
+ * No options but restriction on slabs. This means full
+ * debugging for slabs matching a pattern.
+ */
+ goto check_slabs;
+
+ slqb_debug = 0;
+ if (*str == '-')
+ /*
+ * Switch off all debugging measures.
+ */
+ goto out;
+
+ /*
+ * Determine which debug features should be switched on
+ */
+ for (; *str && *str != ','; str++) {
+ switch (tolower(*str)) {
+ case 'f':
+ slqb_debug |= SLAB_DEBUG_FREE;
+ break;
+ case 'z':
+ slqb_debug |= SLAB_RED_ZONE;
+ break;
+ case 'p':
+ slqb_debug |= SLAB_POISON;
+ break;
+ case 'u':
+ slqb_debug |= SLAB_STORE_USER;
+ break;
+ case 't':
+ slqb_debug |= SLAB_TRACE;
+ break;
+ default:
+ printk(KERN_ERR "slqb_debug option '%c' "
+ "unknown. skipped\n", *str);
+ }
+ }
+
+check_slabs:
+ if (*str == ',')
+ slqb_debug_slabs = str + 1;
+out:
+ return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+ unsigned long flags, const char *name,
+ void (*ctor)(void *))
+{
+ /*
+ * Enable debugging if selected on the kernel commandline.
+ */
+ if (slqb_debug && (!slqb_debug_slabs ||
+ strncmp(slqb_debug_slabs, name,
+ strlen(slqb_debug_slabs)) == 0))
+ flags |= slqb_debug;
+
+ return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+ struct slqb_page *page, void *object) {}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+ void *object, void *addr) { return 0; }
+
+static inline int free_debug_processing(struct kmem_cache *s,
+ void *object, void *addr) { return 0; }
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+ { return 1; }
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+ void *object, int active) { return 1; }
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page) {}
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+ unsigned long flags, const char *name, void (*ctor)(void *))
+{
+ return flags;
+}
+#define slqb_debug 0
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+{
+ struct slqb_page *page;
+ int pages = 1 << s->order;
+
+ flags |= s->allocflags;
+
+ page = alloc_slqb_pages_node(node, flags, s->order);
+ if (!page)
+ return NULL;
+
+ mod_zone_page_state(slqb_page_zone(page),
+ (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+ NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+ pages);
+
+ return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s, struct slqb_page *page,
+ void *object)
+{
+ setup_object_debug(s, page, object);
+ if (unlikely(s->ctor))
+ s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
+{
+ struct slqb_page *page;
+ void *start;
+ void *last;
+ void *p;
+
+ BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+ page = allocate_slab(s,
+ flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+ if (!page)
+ goto out;
+
+ page->flags |= PG_SLQB_BIT;
+
+ start = page_address(&page->page);
+
+ if (unlikely(slab_poison(s)))
+ memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+ start += colour;
+
+ last = start;
+ for_each_object(p, s, start) {
+ setup_object(s, page, p);
+ set_freepointer(s, last, p);
+ last = p;
+ }
+ set_freepointer(s, last, NULL);
+
+ page->freelist = start;
+ page->inuse = 0;
+out:
+ return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+ int pages = 1 << s->order;
+
+ if (unlikely(slab_debug(s))) {
+ void *p;
+
+ slab_pad_check(s, page);
+ for_each_free_object(p, s, page->freelist)
+ check_object(s, page, p, 0);
+ }
+
+ mod_zone_page_state(slqb_page_zone(page),
+ (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+ NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+ -pages);
+
+ __free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+ struct slqb_page *page;
+
+ page = container_of((struct list_head *)h, struct slqb_page, lru);
+ __free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+ VM_BUG_ON(page->inuse);
+ if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+ call_rcu(&page->rcu_head, rcu_free_slab);
+ else
+ __free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)
+{
+ VM_BUG_ON(page->list != l);
+
+ set_freepointer(s, object, page->freelist);
+ page->freelist = object;
+ page->inuse--;
+
+ if (!page->inuse) {
+ if (likely(s->objects > 1)) {
+ l->nr_partial--;
+ list_del(&page->lru);
+ }
+ l->nr_slabs--;
+ free_slab(s, page);
+ slqb_stat_inc(l, FLUSH_SLAB_FREE);
+ return 1;
+ } else if (page->inuse + 1 == s->objects) {
+ l->nr_partial++;
+ list_add(&page->lru, &l->partial);
+ slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+ return 0;
+ }
+ return 0;
+}
+
+#ifdef CONFIG_SMP
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ struct kmem_cache_cpu *c;
+ void **head;
+ int nr;
+
+ nr = l->freelist.nr;
+ if (unlikely(!nr))
+ return;
+
+ nr = min(slab_freebatch(s), nr);
+
+ slqb_stat_inc(l, FLUSH_FREE_LIST);
+ slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+ c = get_cpu_slab(s, smp_processor_id());
+
+ l->freelist.nr -= nr;
+ head = l->freelist.head;
+
+ do {
+ struct slqb_page *page;
+ void **object;
+
+ object = head;
+ VM_BUG_ON(!object);
+ head = get_freepointer(s, object);
+ page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+ if (page->list != l) {
+ slab_free_to_remote(s, page, object, c);
+ slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+ } else
+#endif
+ free_object_to_page(s, l, page, object);
+
+ nr--;
+ } while (nr);
+
+ l->freelist.head = head;
+ if (!l->freelist.nr)
+ l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ while (l->freelist.nr)
+ flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ void **head, **tail;
+ int nr;
+
+ VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
+
+ if (!l->remote_free.list.nr)
+ return;
+
+ l->remote_free_check = 0;
+ head = l->remote_free.list.head;
+ /* Get the head hot for the likely subsequent allocation or flush */
+ prefetchw(head);
+
+ spin_lock(&l->remote_free.lock);
+ l->remote_free.list.head = NULL;
+ tail = l->remote_free.list.tail;
+ l->remote_free.list.tail = NULL;
+ nr = l->remote_free.list.nr;
+ l->remote_free.list.nr = 0;
+ spin_unlock(&l->remote_free.lock);
+
+ if (!l->freelist.nr)
+ l->freelist.head = head;
+ else
+ set_freepointer(s, l->freelist.tail, head);
+ l->freelist.tail = tail;
+
+ l->freelist.nr += nr;
+
+ slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+ slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ void *object;
+
+ object = l->freelist.head;
+ if (likely(object)) {
+ void *next = get_freepointer(s, object);
+ VM_BUG_ON(!l->freelist.nr);
+ l->freelist.nr--;
+ l->freelist.head = next;
+// if (next)
+// prefetchw(next);
+ return object;
+ }
+ VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+ if (unlikely(l->remote_free_check)) {
+ claim_remote_free_list(s, l);
+
+ if (l->freelist.nr > slab_hiwater(s))
+ flush_free_list(s, l);
+
+ /* repetition here helps gcc :( */
+ object = l->freelist.head;
+ if (likely(object)) {
+ void *next = get_freepointer(s, object);
+ VM_BUG_ON(!l->freelist.nr);
+ l->freelist.nr--;
+ l->freelist.head = next;
+// if (next)
+// prefetchw(next);
+ return object;
+ }
+ VM_BUG_ON(l->freelist.nr);
+ }
+#endif
+
+ return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ struct slqb_page *page;
+ void *object;
+
+ if (unlikely(!l->nr_partial))
+ return NULL;
+
+ page = list_first_entry(&l->partial, struct slqb_page, lru);
+ VM_BUG_ON(page->inuse == s->objects);
+ if (page->inuse + 1 == s->objects) {
+ l->nr_partial--;
+ list_del(&page->lru);
+/*XXX list_move(&page->lru, &l->full); */
+ }
+
+ VM_BUG_ON(!page->freelist);
+
+ page->inuse++;
+
+// VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
+
+ object = page->freelist;
+ page->freelist = get_freepointer(s, object);
+ if (page->freelist)
+ prefetchw(page->freelist);
+ VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+ slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+ return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+ struct slqb_page *page;
+ struct kmem_cache_list *l;
+ struct kmem_cache_cpu *c;
+ unsigned int colour;
+ void *object;
+
+ c = get_cpu_slab(s, smp_processor_id());
+ colour = c->colour_next;
+ c->colour_next += s->colour_off;
+ if (c->colour_next >= s->colour_range)
+ c->colour_next = 0;
+
+ /* XXX: load any partial? */
+
+ /* Caller handles __GFP_ZERO */
+ gfpflags &= ~__GFP_ZERO;
+
+ if (gfpflags & __GFP_WAIT)
+ local_irq_enable();
+ page = new_slab_page(s, gfpflags, node, colour);
+ if (gfpflags & __GFP_WAIT)
+ local_irq_disable();
+ if (unlikely(!page))
+ return page;
+
+ if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+ struct kmem_cache_cpu *c;
+ int cpu = smp_processor_id();
+
+ c = get_cpu_slab(s, cpu);
+ l = &c->list;
+ page->list = l;
+
+ l->nr_slabs++;
+ l->nr_partial++;
+ list_add(&page->lru, &l->partial);
+ slqb_stat_inc(l, ALLOC);
+ slqb_stat_inc(l, ALLOC_SLAB_NEW);
+ object = __cache_list_get_page(s, l);
+#ifdef CONFIG_NUMA
+ } else {
+ struct kmem_cache_node *n;
+
+ n = s->node[slqb_page_to_nid(page)];
+ l = &n->list;
+ page->list = l;
+
+ spin_lock(&n->list_lock);
+ l->nr_slabs++;
+ l->nr_partial++;
+ list_add(&page->lru, &l->partial);
+ slqb_stat_inc(l, ALLOC);
+ slqb_stat_inc(l, ALLOC_SLAB_NEW);
+ object = __cache_list_get_page(s, l);
+ spin_unlock(&n->list_lock);
+#endif
+ }
+ VM_BUG_ON(!object);
+ return object;
+}
+
+#ifdef CONFIG_NUMA
+static noinline int alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+ if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+ return node;
+ if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+ return cpuset_mem_spread_node();
+ else if (current->mempolicy)
+ return slab_node(current->mempolicy);
+ return node;
+}
+
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__remote_slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
+{
+ struct kmem_cache_node *n;
+ struct kmem_cache_list *l;
+ void *object;
+
+ n = s->node[node];
+ if (unlikely(!n)) /* node has no memory */
+ return NULL;
+ l = &n->list;
+
+// if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
+// return NULL;
+
+ spin_lock(&n->list_lock);
+
+ object = __cache_list_get_object(s, l);
+ if (unlikely(!object)) {
+ object = __cache_list_get_page(s, l);
+ if (unlikely(!object)) {
+ spin_unlock(&n->list_lock);
+ return __slab_alloc_page(s, gfpflags, node);
+ }
+ }
+ if (likely(object))
+ slqb_stat_inc(l, ALLOC);
+ spin_unlock(&n->list_lock);
+ return object;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
+{
+ void *object;
+ struct kmem_cache_cpu *c;
+ struct kmem_cache_list *l;
+
+#ifdef CONFIG_NUMA
+ if (unlikely(node != -1) && unlikely(node != numa_node_id()))
+ return __remote_slab_alloc(s, gfpflags, node);
+#endif
+
+ c = get_cpu_slab(s, smp_processor_id());
+ VM_BUG_ON(!c);
+ l = &c->list;
+ object = __cache_list_get_object(s, l);
+ if (unlikely(!object)) {
+ object = __cache_list_get_page(s, l);
+ if (unlikely(!object))
+ return __slab_alloc_page(s, gfpflags, node);
+ }
+ if (likely(object))
+ slqb_stat_inc(l, ALLOC);
+ return object;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node, void *addr)
+{
+ void *object;
+ unsigned long flags;
+
+again:
+ local_irq_save(flags);
+ object = __slab_alloc(s, gfpflags, node);
+ local_irq_restore(flags);
+
+ if (unlikely(slab_debug(s)) && likely(object)) {
+ if (unlikely(!alloc_debug_processing(s, object, addr)))
+ goto again;
+ }
+
+ if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+ memset(object, 0, s->objsize);
+
+ return object;
+}
+
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)
+{
+ int node = -1;
+#ifdef CONFIG_NUMA
+ if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+ node = alternate_nid(s, gfpflags, node);
+#endif
+ return slab_alloc(s, gfpflags, node, caller);
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+ return __kmem_cache_alloc(s, gfpflags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+ return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
+{
+ struct kmlist *src;
+ struct kmem_cache_list *dst;
+ unsigned int nr;
+ int set;
+
+ src = &c->rlist;
+ nr = src->nr;
+ if (unlikely(!nr))
+ return;
+
+#ifdef CONFIG_SLQB_STATS
+ {
+ struct kmem_cache_list *l = &c->list;
+ slqb_stat_inc(l, FLUSH_RFREE_LIST);
+ slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+ }
+#endif
+
+ dst = c->remote_cache_list;
+
+ spin_lock(&dst->remote_free.lock);
+ if (!dst->remote_free.list.head)
+ dst->remote_free.list.head = src->head;
+ else
+ set_freepointer(s, dst->remote_free.list.tail, src->head);
+ dst->remote_free.list.tail = src->tail;
+
+ src->head = NULL;
+ src->tail = NULL;
+ src->nr = 0;
+
+ if (dst->remote_free.list.nr < slab_freebatch(s))
+ set = 1;
+ else
+ set = 0;
+
+ dst->remote_free.list.nr += nr;
+
+ if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+ dst->remote_free_check = 1;
+
+ spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c)
+{
+ struct kmlist *r;
+
+ /*
+ * Our remote free list corresponds to a different list. Must
+ * flush it and switch.
+ */
+ if (page->list != c->remote_cache_list) {
+ flush_remote_free_cache(s, c);
+ c->remote_cache_list = page->list;
+ }
+
+ r = &c->rlist;
+ if (!r->head)
+ r->head = object;
+ else
+ set_freepointer(s, r->tail, object);
+ set_freepointer(s, object, NULL);
+ r->tail = object;
+ r->nr++;
+
+ if (unlikely(r->nr > slab_freebatch(s)))
+ flush_remote_free_cache(s, c);
+}
+#endif
+
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+ struct slqb_page *page, void *object)
+{
+ struct kmem_cache_cpu *c;
+ struct kmem_cache_list *l;
+ int thiscpu = smp_processor_id();
+
+ c = get_cpu_slab(s, thiscpu);
+ l = &c->list;
+
+ slqb_stat_inc(l, FREE);
+
+ if (!NUMA_BUILD || !numa_platform ||
+ likely(slqb_page_to_nid(page) == numa_node_id())) {
+ /*
+ * Freeing fastpath. Collects all local-node objects, not
+ * just those allocated from our per-CPU list. This allows
+ * fast transfer of objects from one CPU to another within
+ * a given node.
+ */
+ set_freepointer(s, object, l->freelist.head);
+ l->freelist.head = object;
+ if (!l->freelist.nr)
+ l->freelist.tail = object;
+ l->freelist.nr++;
+
+ if (unlikely(l->freelist.nr > slab_hiwater(s)))
+ flush_free_list(s, l);
+
+#ifdef CONFIG_NUMA
+ } else {
+ /*
+ * Freeing an object that was allocated on a remote node.
+ */
+ slab_free_to_remote(s, page, object, c);
+ slqb_stat_inc(l, FREE_REMOTE);
+#endif
+ }
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+ struct slqb_page *page, void *object)
+{
+ unsigned long flags;
+
+ prefetchw(object);
+
+ debug_check_no_locks_freed(object, s->objsize);
+ if (likely(object) && unlikely(slab_debug(s))) {
+ if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+ return;
+ }
+
+ local_irq_save(flags);
+ __slab_free(s, page, object);
+ local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+ struct slqb_page *page = NULL;
+ if (numa_platform)
+ page = virt_to_head_slqb_page(object);
+ slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static inline int slab_order(int size, int max_order, int frac)
+{
+ int order;
+
+ if (fls(size - 1) <= PAGE_SHIFT)
+ order = 0;
+ else
+ order = fls(size - 1) - PAGE_SHIFT;
+ while (order <= max_order) {
+ unsigned long slab_size = PAGE_SIZE << order;
+ unsigned long objects;
+ unsigned long waste;
+
+ objects = slab_size / size;
+ if (!objects)
+ continue;
+
+ waste = slab_size - (objects * size);
+
+ if (waste * frac <= slab_size)
+ break;
+
+ order++;
+ }
+
+ return order;
+}
+
+static inline int calculate_order(int size)
+{
+ int order;
+
+ /*
+ * Attempt to find best configuration for a slab. This
+ * works by first attempting to generate a layout with
+ * the best configuration and backing off gradually.
+ */
+ order = slab_order(size, 1, 4);
+ if (order <= 1)
+ return order;
+
+ /*
+ * This size cannot fit in order-1. Allow bigger orders, but
+ * forget about trying to save space.
+ */
+ order = slab_order(size, MAX_ORDER, 0);
+ if (order <= MAX_ORDER)
+ return order;
+
+ return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+ unsigned long align, unsigned long size)
+{
+ /*
+ * If the user wants hardware cache aligned objects then follow that
+ * suggestion if the object is sufficiently large.
+ *
+ * The hardware cache alignment cannot override the specified
+ * alignment though. If that is greater then use it.
+ */
+ if (flags & SLAB_HWCACHE_ALIGN) {
+ unsigned long ralign = cache_line_size();
+ while (size <= ralign / 2)
+ ralign /= 2;
+ align = max(align, ralign);
+ }
+
+ if (align < ARCH_SLAB_MINALIGN)
+ align = ARCH_SLAB_MINALIGN;
+
+ return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ l->cache = s;
+ l->freelist.nr = 0;
+ l->freelist.head = NULL;
+ l->freelist.tail = NULL;
+ l->nr_partial = 0;
+ l->nr_slabs = 0;
+ INIT_LIST_HEAD(&l->partial);
+// INIT_LIST_HEAD(&l->full);
+
+#ifdef CONFIG_SMP
+ l->remote_free_check = 0;
+ spin_lock_init(&l->remote_free.lock);
+ l->remote_free.list.nr = 0;
+ l->remote_free.list.head = NULL;
+ l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+ memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
+{
+ init_kmem_cache_list(s, &c->list);
+
+ c->colour_next = 0;
+#ifdef CONFIG_SMP
+ c->rlist.nr = 0;
+ c->rlist.head = NULL;
+ c->rlist.tail = NULL;
+ c->remote_cache_list = NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
+{
+ spin_lock_init(&n->list_lock);
+ init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs */
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
+{
+ struct kmem_cache_cpu *c;
+
+ c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+ if (!c)
+ return NULL;
+
+ init_kmem_cache_cpu(s, c);
+ return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+
+ c = s->cpu_slab[cpu];
+ if (c) {
+ kmem_cache_free(&kmem_cpu_cache, c);
+ s->cpu_slab[cpu] = NULL;
+ }
+ }
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+
+ c = s->cpu_slab[cpu];
+ if (c)
+ continue;
+
+ c = alloc_kmem_cache_cpu(s, cpu);
+ if (!c) {
+ free_kmem_cache_cpus(s);
+ return 0;
+ }
+ s->cpu_slab[cpu] = c;
+ }
+ return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+ init_kmem_cache_cpu(s, &s->cpu_slab);
+ return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n;
+
+ n = s->node[node];
+ if (n) {
+ kmem_cache_free(&kmem_node_cache, n);
+ s->node[node] = NULL;
+ }
+ }
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n;
+
+ n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+ if (!n) {
+ free_kmem_cache_nodes(s);
+ return 0;
+ }
+ init_kmem_cache_node(s, n);
+ s->node[node] = n;
+ }
+ return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+ return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+ unsigned long flags = s->flags;
+ unsigned long size = s->objsize;
+ unsigned long align = s->align;
+
+ /*
+ * Determine if we can poison the object itself. If the user of
+ * the slab may touch the object after free or before allocation
+ * then we should never poison the object itself.
+ */
+ if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+ s->flags |= __OBJECT_POISON;
+ else
+ s->flags &= ~__OBJECT_POISON;
+
+ /*
+ * Round up object size to the next word boundary. We can only
+ * place the free pointer at word boundaries and this determines
+ * the possible location of the free pointer.
+ */
+ size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+ /*
+ * If we are Redzoning then check if there is some space between the
+ * end of the object and the free pointer. If not then add an
+ * additional word to have some bytes to store Redzone information.
+ */
+ if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+ size += sizeof(void *);
+#endif
+
+ /*
+ * With that we have determined the number of bytes in actual use
+ * by the object. This is the potential offset to the free pointer.
+ */
+ s->inuse = size;
+
+ if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+ /*
+ * Relocate free pointer after the object if it is not
+ * permitted to overwrite the first word of the object on
+ * kmem_cache_free.
+ *
+ * This is the case if we do RCU, have a constructor or
+ * destructor or are poisoning the objects.
+ */
+ s->offset = size;
+ size += sizeof(void *);
+ }
+
+#ifdef CONFIG_SLQB_DEBUG
+ if (flags & SLAB_STORE_USER)
+ /*
+ * Need to store information about allocs and frees after
+ * the object.
+ */
+ size += 2 * sizeof(struct track);
+
+ if (flags & SLAB_RED_ZONE)
+ /*
+ * Add some empty padding so that we can catch
+ * overwrites from earlier objects rather than let
+ * tracking information or the free pointer be
+ * corrupted if an user writes before the start
+ * of the object.
+ */
+ size += sizeof(void *);
+#endif
+
+ /*
+ * Determine the alignment based on various parameters that the
+ * user specified and the dynamic determination of cache line size
+ * on bootup.
+ */
+ align = calculate_alignment(flags, align, s->objsize);
+
+ /*
+ * SLQB stores one object immediately after another beginning from
+ * offset 0. In order to align the objects we have to simply size
+ * each object to conform to the alignment.
+ */
+ size = ALIGN(size, align);
+ s->size = size;
+ s->order = calculate_order(size);
+
+ if (s->order < 0)
+ return 0;
+
+ s->allocflags = 0;
+ if (s->order)
+ s->allocflags |= __GFP_COMP;
+
+ if (s->flags & SLAB_CACHE_DMA)
+ s->allocflags |= SLQB_DMA;
+
+ if (s->flags & SLAB_RECLAIM_ACCOUNT)
+ s->allocflags |= __GFP_RECLAIMABLE;
+
+ /*
+ * Determine the number of objects per slab
+ */
+ s->objects = (PAGE_SIZE << s->order) / size;
+
+ s->freebatch = max(4UL*PAGE_SIZE / size, min(256UL, 64*PAGE_SIZE / size));
+ if (!s->freebatch)
+ s->freebatch = 1;
+ s->hiwater = s->freebatch << 2;
+
+ return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+ const char *name, size_t size,
+ size_t align, unsigned long flags,
+ void (*ctor)(void *), int alloc)
+{
+ unsigned int left_over;
+
+ memset(s, 0, kmem_size);
+ s->name = name;
+ s->ctor = ctor;
+ s->objsize = size;
+ s->align = align;
+ s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+ if (!calculate_sizes(s))
+ goto error;
+
+ if (!slab_debug(s)) {
+ left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+ s->colour_off = max(cache_line_size(), s->align);
+ s->colour_range = left_over;
+ } else {
+ s->colour_off = 0;
+ s->colour_range = 0;
+ }
+
+ if (likely(alloc)) {
+ if (!alloc_kmem_cache_nodes(s))
+ goto error;
+
+ if (!alloc_kmem_cache_cpus(s))
+ goto error_nodes;
+ }
+
+ down_write(&slqb_lock);
+ sysfs_slab_add(s);
+ list_add(&s->list, &slab_caches);
+ up_write(&slqb_lock);
+
+ return 1;
+
+error_nodes:
+ free_kmem_cache_nodes(s);
+error:
+ if (flags & SLAB_PANIC)
+ panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+ return 0;
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *object)
+{
+ struct slqb_page *page = virt_to_head_slqb_page(object);
+
+ if (!(page->flags & PG_SLQB_BIT))
+ return 0;
+
+ /*
+ * We could also check if the object is on the slabs freelist.
+ * But this would be too expensive and it seems that the main
+ * purpose of kmem_ptr_valid is to check if the object belongs
+ * to a certain slab.
+ */
+ return 1;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+ return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+ return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+ int node;
+#endif
+ int cpu;
+
+ down_write(&slqb_lock);
+ list_del(&s->list);
+ up_write(&slqb_lock);
+
+#ifdef CONFIG_SMP
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+ flush_free_list_all(s, l);
+ flush_remote_free_cache(s, c);
+ }
+#endif
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+ claim_remote_free_list(s, l);
+#endif
+ flush_free_list_all(s, l);
+
+ WARN_ON(l->freelist.nr);
+ WARN_ON(l->nr_slabs);
+ WARN_ON(l->nr_partial);
+ }
+
+ free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ claim_remote_free_list(s, l);
+ flush_free_list_all(s, l);
+
+ WARN_ON(l->freelist.nr);
+ WARN_ON(l->nr_slabs);
+ WARN_ON(l->nr_partial);
+ }
+
+ free_kmem_cache_nodes(s);
+#endif
+
+ sysfs_slab_remove(s);
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+/********************************************************************
+ * Kmalloc subsystem
+ *******************************************************************/
+
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+#ifdef CONFIG_ZONE_DMA
+struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+ const char *name, int size, gfp_t gfp_flags)
+{
+ unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+ if (gfp_flags & SLQB_DMA)
+ flags |= SLAB_CACHE_DMA;
+
+ kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+ return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+ 3, /* 8 */
+ 4, /* 16 */
+ 5, /* 24 */
+ 5, /* 32 */
+ 6, /* 40 */
+ 6, /* 48 */
+ 6, /* 56 */
+ 6, /* 64 */
+#if L1_CACHE_BYTES < 64
+ 1, /* 72 */
+ 1, /* 80 */
+ 1, /* 88 */
+ 1, /* 96 */
+#else
+ 7,
+ 7,
+ 7,
+ 7,
+#endif
+ 7, /* 104 */
+ 7, /* 112 */
+ 7, /* 120 */
+ 7, /* 128 */
+#if L1_CACHE_BYTES < 128
+ 2, /* 136 */
+ 2, /* 144 */
+ 2, /* 152 */
+ 2, /* 160 */
+ 2, /* 168 */
+ 2, /* 176 */
+ 2, /* 184 */
+ 2 /* 192 */
+#else
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+ int index;
+
+#if L1_CACHE_BYTES >= 128
+ if (size <= 128) {
+#else
+ if (size <= 192) {
+#endif
+ if (unlikely(!size))
+ return ZERO_SIZE_PTR;
+
+ index = size_index[(size - 1) / 8];
+ } else
+ index = fls(size - 1);
+
+ if (unlikely((flags & SLQB_DMA)))
+ return &kmalloc_caches_dma[index];
+ else
+ return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+ struct kmem_cache *s;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return __kmem_cache_alloc(s, flags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+ struct kmem_cache *s;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+ struct slqb_page *page;
+ struct kmem_cache *s;
+
+ BUG_ON(!object);
+ if (unlikely(object == ZERO_SIZE_PTR))
+ return 0;
+
+ page = virt_to_head_slqb_page(object);
+ BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+ s = page->list->cache;
+
+ /*
+ * Debugging requires use of the padding between object
+ * and whatever may come after it.
+ */
+ if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+ return s->objsize;
+
+ /*
+ * If we have the need to store the freelist pointer
+ * back there or track user information then we can
+ * only use the space before that information.
+ */
+ if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+ return s->inuse;
+
+ /*
+ * Else we can use all the padding etc for the allocation
+ */
+ return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+ struct kmem_cache *s;
+ struct slqb_page *page;
+
+ if (unlikely(ZERO_OR_NULL_PTR(object)))
+ return;
+
+ page = virt_to_head_slqb_page(object);
+ s = page->list->cache;
+
+ slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+ int cpu = smp_processor_id();
+ struct kmem_cache *s = arg;
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+ claim_remote_free_list(s, l);
+#endif
+ flush_free_list(s, l);
+#ifdef CONFIG_SMP
+ flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+ int node;
+#endif
+
+ on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list(s, l);
+ spin_unlock_irq(&n->list_lock);
+ }
+#endif
+
+ return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+ int cpu = smp_processor_id();
+ struct kmem_cache *s;
+ long phase = (long)arg;
+
+ list_for_each_entry(s, &slab_caches, list) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+ if (phase == 0) {
+ flush_free_list_all(s, l);
+ flush_remote_free_cache(s, c);
+ }
+
+ if (phase == 1) {
+ claim_remote_free_list(s, l);
+ flush_free_list_all(s, l);
+ }
+ }
+}
+
+static void kmem_cache_reap(void)
+{
+ struct kmem_cache *s;
+ int node;
+
+ down_read(&slqb_lock);
+ on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+ on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+ list_for_each_entry(s, &slab_caches, list) {
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list_all(s, l);
+ spin_unlock_irq(&n->list_lock);
+ }
+ }
+ up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+ struct delayed_work *work =
+ container_of(w, struct delayed_work, work);
+ struct kmem_cache *s;
+ int node;
+
+ if (!down_read_trylock(&slqb_lock))
+ goto out;
+
+ node = numa_node_id();
+ list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list(s, l);
+ spin_unlock_irq(&n->list_lock);
+#endif
+
+ local_irq_disable();
+ kmem_cache_trim_percpu(s);
+ local_irq_enable();
+ }
+
+ up_read(&slqb_lock);
+out:
+ schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+ struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+ /*
+ * When this gets called from do_initcalls via cpucache_init(),
+ * init_workqueues() has already run, so keventd will be setup
+ * at that time.
+ */
+ if (keventd_up() && cache_trim_work->work.func == NULL) {
+ INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+ schedule_delayed_work_on(cpu, cache_trim_work,
+ __round_jiffies_relative(HZ, cpu));
+ }
+}
+
+static int __init cpucache_init(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ start_cpu_timer(cpu);
+ return 0;
+}
+__initcall(cpucache_init);
+
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+ kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+ struct kmem_cache *s;
+ struct memory_notify *marg = arg;
+ int nid = marg->status_change_nid;
+
+ /*
+ * If the node still has available memory. we need kmem_cache_node
+ * for it yet.
+ */
+ if (nid < 0)
+ return;
+
+#if 0 // XXX: see cpu offline comment
+ down_read(&slqb_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ struct kmem_cache_node *n;
+ n = s->node[nid];
+ if (n) {
+ s->node[nid] = NULL;
+ kmem_cache_free(&kmem_node_cache, n);
+ }
+ }
+ up_read(&slqb_lock);
+#endif
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+ struct kmem_cache *s;
+ struct kmem_cache_node *n;
+ struct memory_notify *marg = arg;
+ int nid = marg->status_change_nid;
+ int ret = 0;
+
+ /*
+ * If the node's memory is already available, then kmem_cache_node is
+ * already created. Nothing to do.
+ */
+ if (nid < 0)
+ return 0;
+
+ /*
+ * We are bringing a node online. No memory is availabe yet. We must
+ * allocate a kmem_cache_node structure in order to bring the node
+ * online.
+ */
+ down_read(&slqb_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ /*
+ * XXX: kmem_cache_alloc_node will fallback to other nodes
+ * since memory is not yet available from the node that
+ * is brought up.
+ */
+ if (s->node[nid]) /* could be lefover from last online */
+ continue;
+ n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+ if (!n) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ init_kmem_cache_node(s, n);
+ s->node[nid] = n;
+ }
+out:
+ up_read(&slqb_lock);
+ return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ int ret = 0;
+
+ switch (action) {
+ case MEM_GOING_ONLINE:
+ ret = slab_mem_going_online_callback(arg);
+ break;
+ case MEM_GOING_OFFLINE:
+ slab_mem_going_offline_callback(arg);
+ break;
+ case MEM_OFFLINE:
+ case MEM_CANCEL_ONLINE:
+ slab_mem_offline_callback(arg);
+ break;
+ case MEM_ONLINE:
+ case MEM_CANCEL_OFFLINE:
+ break;
+ }
+
+ ret = notifier_from_errno(ret);
+ return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ * Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+ int i;
+ unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+#ifdef CONFIG_NUMA
+ if (num_possible_nodes() == 1)
+ numa_platform = 0;
+ else
+ numa_platform = 1;
+#endif
+
+#ifdef CONFIG_SMP
+ kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+ nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+ kmem_size = sizeof(struct kmem_cache);
+#endif
+
+ kmem_cache_open(&kmem_cache_cache, "kmem_cache", kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+ kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu", sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+ kmem_cache_open(&kmem_node_cache, "kmem_cache_node", sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+ for_each_possible_cpu(i) {
+ init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
+ kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+
+ init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
+ kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+
+#ifdef CONFIG_NUMA
+ init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
+ kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+#endif
+ }
+#else
+ init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+ for_each_node_state(i, N_NORMAL_MEMORY) {
+ init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
+ kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
+
+ init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
+ kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+
+ init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
+ kmem_node_cache.node[i] = &kmem_node_nodes[i];
+ }
+#endif
+
+ /* Caches that are not of the two-to-the-power-of size */
+ if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+ open_kmalloc_cache(&kmalloc_caches[1],
+ "kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+ open_kmalloc_cache(&kmalloc_caches_dma[1],
+ "kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+ }
+ if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+ open_kmalloc_cache(&kmalloc_caches[2],
+ "kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+ open_kmalloc_cache(&kmalloc_caches_dma[2],
+ "kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+ }
+
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+ open_kmalloc_cache(&kmalloc_caches[i],
+ "kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+ open_kmalloc_cache(&kmalloc_caches_dma[i],
+ "kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+ }
+
+
+ /*
+ * Patch up the size_index table if we have strange large alignment
+ * requirements for the kmalloc array. This is only the case for
+ * mips it seems. The standard arches will not generate any code here.
+ *
+ * Largest permitted alignment is 256 bytes due to the way we
+ * handle the index determination for the smaller caches.
+ *
+ * Make sure that nothing crazy happens if someone starts tinkering
+ * around with ARCH_KMALLOC_MINALIGN
+ */
+ BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+ (KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+ for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+ size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+ /* Provide the correct kmalloc names now that the caches are up */
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+ kmalloc_caches[i].name =
+ kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+ kmalloc_caches_dma[i].name =
+ kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+ }
+
+#ifdef CONFIG_SMP
+ register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+ hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+ /*
+ * smp_init() has not yet been called, so no worries about memory
+ * ordering here (eg. slab_is_available vs numa_platform)
+ */
+ __slab_is_available = 1;
+}
+
+/*
+ * Some basic slab creation sanity checks
+ */
+static int kmem_cache_create_ok(const char *name, size_t size,
+ size_t align, unsigned long flags)
+{
+ struct kmem_cache *tmp;
+
+ /*
+ * Sanity checks... these are all serious usage bugs.
+ */
+ if (!name || in_interrupt() || (size < sizeof(void *))) {
+ printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
+ name);
+ dump_stack();
+ return 0;
+ }
+
+ down_read(&slqb_lock);
+ list_for_each_entry(tmp, &slab_caches, list) {
+ char x;
+ int res;
+
+ /*
+ * This happens when the module gets unloaded and doesn't
+ * destroy its slab cache and no-one else reuses the vmalloc
+ * area of the module. Print a warning.
+ */
+ res = probe_kernel_address(tmp->name, x);
+ if (res) {
+ printk(KERN_ERR
+ "SLAB: cache with size %d has lost its name\n",
+ tmp->size);
+ continue;
+ }
+
+ if (!strcmp(tmp->name, name)) {
+ printk(KERN_ERR
+ "kmem_cache_create(): duplicate cache %s\n", name);
+ dump_stack();
+ up_read(&slqb_lock);
+ return 0;
+ }
+ }
+ up_read(&slqb_lock);
+
+ WARN_ON(strchr(name, ' ')); /* It confuses parsers */
+ if (flags & SLAB_DESTROY_BY_RCU)
+ WARN_ON(flags & SLAB_POISON);
+
+ return 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+ size_t align, unsigned long flags, void (*ctor)(void *))
+{
+ struct kmem_cache *s;
+
+ if (!kmem_cache_create_ok(name, size, align, flags))
+ goto err;
+
+ s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+ if (!s)
+ goto err;
+
+ if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+ return s;
+
+ kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+ if (flags & SLAB_PANIC)
+ panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+ return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ long cpu = (long)hcpu;
+ struct kmem_cache *s;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ down_read(&slqb_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ if (s->cpu_slab[cpu]) /* could be lefover last online */
+ continue;
+ s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+ if (!s->cpu_slab[cpu]) {
+ up_read(&slqb_lock);
+ return NOTIFY_BAD;
+ }
+ }
+ up_read(&slqb_lock);
+ break;
+
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ start_cpu_timer(cpu);
+ break;
+
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+ per_cpu(cache_trim_work, cpu).work.func = NULL;
+ break;
+
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+#if 0
+ down_read(&slqb_lock);
+ /* XXX: this doesn't work because objects can still be on this
+ * CPU's list. periodic timer needs to check if a CPU is offline
+ * and then try to cleanup from there. Same for node offline.
+ */
+ list_for_each_entry(s, &slab_caches, list) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ if (c) {
+ kmem_cache_free(&kmem_cpu_cache, c);
+ s->cpu_slab[cpu] = NULL;
+ }
+ }
+
+ up_read(&slqb_lock);
+#endif
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+ .notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+ struct kmem_cache *s;
+ int node = -1;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+#ifdef CONFIG_NUMA
+ if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+ node = alternate_nid(s, flags, node);
+#endif
+ return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+ unsigned long caller)
+{
+ struct kmem_cache *s;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+ struct kmem_cache *s;
+ spinlock_t lock;
+ unsigned long nr_slabs;
+ unsigned long nr_partial;
+ unsigned long nr_inuse;
+ unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+ unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+ unsigned long nr_slabs;
+ unsigned long nr_partial;
+ unsigned long nr_inuse;
+ struct stats_gather *gather = arg;
+ int cpu = smp_processor_id();
+ struct kmem_cache *s = gather->s;
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+ struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+ int i;
+#endif
+
+ nr_slabs = l->nr_slabs;
+ nr_partial = l->nr_partial;
+ nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+ list_for_each_entry(page, &l->partial, lru) {
+ nr_inuse += page->inuse;
+ }
+
+ spin_lock(&gather->lock);
+ gather->nr_slabs += nr_slabs;
+ gather->nr_partial += nr_partial;
+ gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+ gather->stats[i] += l->stats[i];
+ }
+#endif
+ spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+ int node;
+#endif
+
+ memset(stats, 0, sizeof(struct stats_gather));
+ stats->s = s;
+ spin_lock_init(&stats->lock);
+
+ on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+ for_each_online_node(node) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+ struct slqb_page *page;
+ unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+ int i;
+#endif
+
+ spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+ stats->stats[i] += l->stats[i];
+ }
+#endif
+ stats->nr_slabs += l->nr_slabs;
+ stats->nr_partial += l->nr_partial;
+ stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+ list_for_each_entry(page, &l->partial, lru) {
+ stats->nr_inuse += page->inuse;
+ }
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ }
+#endif
+
+ stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+ size_t count, loff_t *ppos)
+{
+ return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+ seq_puts(m, "slabinfo - version: 2.1\n");
+ seq_puts(m, "# name <active_objs> <num_objs> <objsize> "
+ "<objperslab> <pagesperslab>");
+ seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+ seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+ seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+ loff_t n = *pos;
+
+ down_read(&slqb_lock);
+ if (!n)
+ print_slabinfo_header(m);
+
+ return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+ return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+ up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+ struct stats_gather stats;
+ struct kmem_cache *s;
+
+ s = list_entry(p, struct kmem_cache, list);
+
+ gather_stats(s, &stats);
+
+ seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+ stats.nr_objects, s->size, s->objects, (1 << s->order));
+ seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s), slab_freebatch(s), 0);
+ seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs, stats.nr_slabs,
+ 0UL);
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+ .start = s_start,
+ .next = s_next,
+ .stop = s_stop,
+ .show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+ .open = slabinfo_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+ proc_create("slabinfo",S_IWUSR|S_IRUGO,NULL,&proc_slabinfo_operations);
+ return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct kmem_cache *s, char *buf);
+ ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+ static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+ static struct slab_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+ if (s->ctor) {
+ int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+ return n + sprintf(buf + n, "\n");
+ }
+ return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+ struct stats_gather stats;
+ gather_stats(s, &stats);
+ return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+ struct stats_gather stats;
+ gather_stats(s, &stats);
+ return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+ struct stats_gather stats;
+ gather_stats(s, &stats);
+ return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+ long hiwater;
+ int err;
+
+ err = strict_strtol(buf, 10, &hiwater);
+ if (err)
+ return err;
+
+ if (hiwater < 0)
+ return -EINVAL;
+
+ s->hiwater = hiwater;
+
+ return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+ long freebatch;
+ int err;
+
+ err = strict_strtol(buf, 10, &freebatch);
+ if (err)
+ return err;
+
+ if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+ return -EINVAL;
+
+ s->freebatch = freebatch;
+
+ return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+ struct stats_gather stats;
+ int len;
+#ifdef CONFIG_SMP
+ int cpu;
+#endif
+
+ gather_stats(s, &stats);
+
+ len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+ if (len < PAGE_SIZE - 20)
+ len += sprintf(buf + len, " C%d=%lu", cpu, l->stats[si]);
+ }
+#endif
+ return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) \
+static ssize_t text##_show(struct kmem_cache *s, char *buf) \
+{ \
+ return show_stat(s, buf, si); \
+} \
+SLAB_ATTR_RO(text); \
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+ &slab_size_attr.attr,
+ &object_size_attr.attr,
+ &objs_per_slab_attr.attr,
+ &order_attr.attr,
+ &objects_attr.attr,
+ &total_objects_attr.attr,
+ &slabs_attr.attr,
+ &ctor_attr.attr,
+ &align_attr.attr,
+ &hwcache_align_attr.attr,
+ &reclaim_account_attr.attr,
+ &destroy_by_rcu_attr.attr,
+ &red_zone_attr.attr,
+ &poison_attr.attr,
+ &store_user_attr.attr,
+ &hiwater_attr.attr,
+ &freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+ &cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+ &alloc_attr.attr,
+ &alloc_slab_fill_attr.attr,
+ &alloc_slab_new_attr.attr,
+ &free_attr.attr,
+ &free_remote_attr.attr,
+ &flush_free_list_attr.attr,
+ &flush_free_list_objects_attr.attr,
+ &flush_free_list_remote_attr.attr,
+ &flush_slab_partial_attr.attr,
+ &flush_slab_free_attr.attr,
+ &flush_rfree_list_attr.attr,
+ &flush_rfree_list_objects_attr.attr,
+ &claim_remote_list_attr.attr,
+ &claim_remote_list_objects_attr.attr,
+#endif
+ NULL
+};
+
+static struct attribute_group slab_attr_group = {
+ .attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+ struct attribute *attr,
+ char *buf)
+{
+ struct slab_attribute *attribute;
+ struct kmem_cache *s;
+ int err;
+
+ attribute = to_slab_attr(attr);
+ s = to_slab(kobj);
+
+ if (!attribute->show)
+ return -EIO;
+
+ err = attribute->show(s, buf);
+
+ return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+ struct attribute *attr,
+ const char *buf, size_t len)
+{
+ struct slab_attribute *attribute;
+ struct kmem_cache *s;
+ int err;
+
+ attribute = to_slab_attr(attr);
+ s = to_slab(kobj);
+
+ if (!attribute->store)
+ return -EIO;
+
+ err = attribute->store(s, buf, len);
+
+ return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+ struct kmem_cache *s = to_slab(kobj);
+
+ kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+ .show = slab_attr_show,
+ .store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+ .sysfs_ops = &slab_sysfs_ops,
+ .release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+ struct kobj_type *ktype = get_ktype(kobj);
+
+ if (ktype == &slab_ktype)
+ return 1;
+ return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+ .filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+ int err;
+
+ if (!sysfs_available)
+ return 0;
+
+ s->kobj.kset = slab_kset;
+ err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+ if (err) {
+ kobject_put(&s->kobj);
+ return err;
+ }
+
+ err = sysfs_create_group(&s->kobj, &slab_attr_group);
+ if (err)
+ return err;
+ kobject_uevent(&s->kobj, KOBJ_ADD);
+
+ return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+ kobject_uevent(&s->kobj, KOBJ_REMOVE);
+ kobject_del(&s->kobj);
+ kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+ struct kmem_cache *s;
+ int err;
+
+ slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+ if (!slab_kset) {
+ printk(KERN_ERR "Cannot register slab subsystem.\n");
+ return -ENOSYS;
+ }
+
+ down_write(&slqb_lock);
+ sysfs_available = 1;
+ list_for_each_entry(s, &slab_caches, list) {
+ err = sysfs_slab_add(s);
+ if (err)
+ printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+ " to sysfs\n", s->name);
+ }
+ up_write(&slqb_lock);
+
+ return 0;
+}
+
+__initcall(slab_sysfs_init);
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -150,6 +150,8 @@ size_t ksize(const void *);
*/
#ifdef CONFIG_SLUB
#include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
#elif defined(CONFIG_SLOB)
#include <linux/slob_def.h>
#else
@@ -252,7 +254,7 @@ static inline void *kmem_cache_alloc_nod
* allocator where we care about the real place the memory allocation
* request comes from.
*/
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
#define kmalloc_track_caller(size, flags) \
__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +272,7 @@ extern void *__kmalloc_track_caller(size
* standard allocator where we care about the real place the memory
* allocation request comes from.
*/
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
#define kmalloc_node_track_caller(size, flags, node) \
__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+ struct rcu_head *next;
+ void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
* virt_addr_valid(kaddr) returns true.
*/
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
extern bool __virt_addr_valid(unsigned long kaddr);
#define virt_addr_valid(kaddr) __virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -305,7 +305,11 @@ static inline void get_page(struct page

static inline struct page *virt_to_head_page(const void *x)
{
+#ifdef virt_to_page_fast
+ struct page *page = virt_to_page_fast(x);
+#else
struct page *page = virt_to_page(x);
+#endif
return compound_head(page);
}

Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <[email protected]> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+ char *name;
+ int align, cache_dma, destroy_by_rcu;
+ int hwcache_align, object_size, objs_per_slab;
+ int slab_size, store_user;
+ int order, poison, reclaim_account, red_zone;
+ int batch;
+ unsigned long objects, slabs, total_objects;
+ unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+ unsigned long free, free_remote;
+ unsigned long claim_remote_list, claim_remote_list_objects;
+ unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+ unsigned long flush_rfree_list, flush_rfree_list_objects;
+ unsigned long flush_slab_free, flush_slab_partial;
+ int numa[MAX_NODES];
+ int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+ va_list ap;
+
+ va_start(ap, x);
+ vfprintf(stderr, x, ap);
+ va_end(ap);
+ exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+ printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+ "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+ "-A|--activity Most active slabs first\n"
+ "-d<options>|--debug=<options> Set/Clear Debug options\n"
+ "-D|--display-active Switch line format to activity\n"
+ "-e|--empty Show empty slabs\n"
+ "-h|--help Show usage information\n"
+ "-i|--inverted Inverted list\n"
+ "-l|--slabs Show slabs\n"
+ "-n|--numa Show NUMA information\n"
+ "-o|--ops Show kmem_cache_ops\n"
+ "-s|--shrink Shrink slabs\n"
+ "-r|--report Detailed report on single slabs\n"
+ "-S|--Size Sort by size\n"
+ "-t|--tracking Show alloc/free information\n"
+ "-T|--Totals Show summary information\n"
+ "-v|--validate Validate slabs\n"
+ "-z|--zero Include empty slabs\n"
+ "\nValid debug options (FZPUT may be combined)\n"
+ "a / A Switch on all debug options (=FZUP)\n"
+ "- Switch off all debug options\n"
+ "f / F Sanity Checks (SLAB_DEBUG_FREE)\n"
+ "z / Z Redzoning\n"
+ "p / P Poisoning\n"
+ "u / U Tracking\n"
+ "t / T Tracing\n"
+ );
+}
+
+unsigned long read_obj(const char *name)
+{
+ FILE *f = fopen(name, "r");
+
+ if (!f)
+ buffer[0] = 0;
+ else {
+ if (!fgets(buffer, sizeof(buffer), f))
+ buffer[0] = 0;
+ fclose(f);
+ if (buffer[strlen(buffer)] == '\n')
+ buffer[strlen(buffer)] = 0;
+ }
+ return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+ if (!read_obj(name))
+ return 0;
+
+ return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+ unsigned long result = 0;
+ char *p;
+
+ *x = NULL;
+
+ if (!read_obj(name)) {
+ x = NULL;
+ return 0;
+ }
+ result = strtoul(buffer, &p, 10);
+ while (*p == ' ')
+ p++;
+ if (*p)
+ *x = strdup(p);
+ return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+ char x[100];
+ FILE *f;
+
+ snprintf(x, 100, "%s/%s", s->name, name);
+ f = fopen(x, "w");
+ if (!f)
+ fatal("Cannot write to %s\n", x);
+
+ fprintf(f, "%d\n", n);
+ fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+ char x[100];
+ FILE *f;
+ size_t l;
+
+ snprintf(x, 100, "%s/%s", s->name, name);
+ f = fopen(x, "r");
+ if (!f) {
+ buffer[0] = 0;
+ l = 0;
+ } else {
+ l = fread(buffer, 1, sizeof(buffer), f);
+ buffer[l] = 0;
+ fclose(f);
+ }
+ return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+ unsigned long divisor = 1;
+ char trailer = 0;
+ int n;
+
+ if (value > 1000000000UL) {
+ divisor = 100000000UL;
+ trailer = 'G';
+ } else if (value > 1000000UL) {
+ divisor = 100000UL;
+ trailer = 'M';
+ } else if (value > 1000UL) {
+ divisor = 100;
+ trailer = 'K';
+ }
+
+ value /= divisor;
+ n = sprintf(buffer, "%ld",value);
+ if (trailer) {
+ buffer[n] = trailer;
+ n++;
+ buffer[n] = 0;
+ }
+ if (divisor != 1) {
+ memmove(buffer + n - 2, buffer + n - 3, 4);
+ buffer[n-2] = '.';
+ n++;
+ }
+ return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+ int node;
+ int nr;
+
+ memset(numa, 0, MAX_NODES * sizeof(int));
+
+ if (!t)
+ return;
+
+ while (*t == 'N') {
+ t++;
+ node = strtoul(t, &t, 10);
+ if (*t == '=') {
+ t++;
+ nr = strtoul(t, &t, 10);
+ numa[node] = nr;
+ if (node > highest_node)
+ highest_node = node;
+ }
+ while (*t == ' ')
+ t++;
+ }
+}
+
+void slab_validate(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+ if (show_activity)
+ printf("Name Objects Alloc Free %%Fill %%New "
+ "FlushR %%FlushR FlushR_Objs O\n");
+ else
+ printf("Name Objects Objsize Space "
+ " O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+ return s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+ return s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+ int node;
+
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (!highest_node) {
+ printf("\n%s: No NUMA information available.\n", s->name);
+ return;
+ }
+
+ if (skip_zero && !s->slabs)
+ return;
+
+ if (!line) {
+ printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+ for(node = 0; node <= highest_node; node++)
+ printf(" %4d", node);
+ printf("\n----------------------");
+ for(node = 0; node <= highest_node; node++)
+ printf("-----");
+ printf("\n");
+ }
+ printf("%-21s ", mode ? "All slabs" : s->name);
+ for(node = 0; node <= highest_node; node++) {
+ char b[20];
+
+ store_size(b, s->numa[node]);
+ printf(" %4s", b);
+ }
+ printf("\n");
+ if (mode) {
+ printf("%-21s ", "Partial slabs");
+ for(node = 0; node <= highest_node; node++) {
+ char b[20];
+
+ store_size(b, s->numa_partial[node]);
+ printf(" %4s", b);
+ }
+ printf("\n");
+ }
+ line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+ printf("\n%s: Kernel object allocation\n", s->name);
+ printf("-----------------------------------------------------------------------\n");
+ if (read_slab_obj(s, "alloc_calls"))
+ printf(buffer);
+ else
+ printf("No Data\n");
+
+ printf("\n%s: Kernel object freeing\n", s->name);
+ printf("------------------------------------------------------------------------\n");
+ if (read_slab_obj(s, "free_calls"))
+ printf(buffer);
+ else
+ printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (read_slab_obj(s, "ops")) {
+ printf("\n%s: kmem_cache operations\n", s->name);
+ printf("--------------------------------------------\n");
+ printf(buffer);
+ } else
+ printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+ if (x)
+ return "On ";
+ return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+ unsigned long total_alloc;
+ unsigned long total_free;
+ unsigned long total;
+
+ total_alloc = s->alloc;
+ total_free = s->free;
+
+ if (!total_alloc)
+ return;
+
+ printf("\n");
+ printf("Slab Perf Counter\n");
+ printf("------------------------------------------------------------------------\n");
+ printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+ total_alloc,
+ s->alloc_slab_fill, s->alloc_slab_new);
+ printf("Free: %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+ total_free,
+ s->flush_slab_partial,
+ s->flush_slab_free,
+ s->free_remote);
+ printf("Claim: %8lu, objects %8lu\n",
+ s->claim_remote_list,
+ s->claim_remote_list_objects);
+ printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+ s->flush_free_list,
+ s->flush_free_list_objects,
+ s->flush_free_list_remote);
+ printf("FlushR:%8lu, objects %8lu\n",
+ s->flush_rfree_list,
+ s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ printf("\nSlabcache: %-20s Order : %2d Objects: %lu\n",
+ s->name, s->order, s->objects);
+ if (s->hwcache_align)
+ printf("** Hardware cacheline aligned\n");
+ if (s->cache_dma)
+ printf("** Memory is allocated in a special DMA zone\n");
+ if (s->destroy_by_rcu)
+ printf("** Slabs are destroyed via RCU\n");
+ if (s->reclaim_account)
+ printf("** Reclaim accounting active\n");
+
+ printf("\nSizes (bytes) Slabs Debug Memory\n");
+ printf("------------------------------------------------------------------------\n");
+ printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n",
+ s->object_size, s->slabs, "N/A",
+ s->slabs * (page_size << s->order));
+ printf("SlabObj: %7d Full : %7s Redzoning : %s Used : %7ld\n",
+ s->slab_size, "N/A",
+ onoff(s->red_zone), s->objects * s->object_size);
+ printf("SlabSiz: %7d Partial: %7s Poisoning : %s Loss : %7ld\n",
+ page_size << s->order, "N/A", onoff(s->poison),
+ s->slabs * (page_size << s->order) - s->objects * s->object_size);
+ printf("Loss : %7d CpuSlab: %7s Tracking : %s Lalig: %7ld\n",
+ s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+ (s->slab_size - s->object_size) * s->objects);
+ printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n",
+ s->align, s->objs_per_slab, "N/A",
+ ((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+ s->slabs);
+
+ ops(s);
+ show_tracking(s);
+ slab_numa(s, 1);
+ slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+ char size_str[20];
+ char flags[20];
+ char *p = flags;
+
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (actual_slabs == 1) {
+ report(s);
+ return;
+ }
+
+ if (skip_zero && !show_empty && !s->slabs)
+ return;
+
+ if (show_empty && s->slabs)
+ return;
+
+ store_size(size_str, slab_size(s));
+
+ if (!line++)
+ first_line();
+
+ if (s->cache_dma)
+ *p++ = 'd';
+ if (s->hwcache_align)
+ *p++ = 'A';
+ if (s->poison)
+ *p++ = 'P';
+ if (s->reclaim_account)
+ *p++ = 'a';
+ if (s->red_zone)
+ *p++ = 'Z';
+ if (s->store_user)
+ *p++ = 'U';
+
+ *p = 0;
+ if (show_activity) {
+ unsigned long total_alloc;
+ unsigned long total_free;
+
+ total_alloc = s->alloc;
+ total_free = s->free;
+
+ printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+ s->name, s->objects,
+ total_alloc, total_free,
+ total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+ total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+ s->flush_rfree_list,
+ s->flush_rfree_list * 100 / (total_alloc + total_free),
+ s->flush_rfree_list_objects,
+ s->order);
+ }
+ else
+ printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+ s->name, s->objects, s->object_size, size_str,
+ s->objs_per_slab, s->order,
+ s->slabs ? (s->objects * s->object_size * 100) /
+ (s->slabs * (page_size << s->order)) : 100,
+ s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+ if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+ return 1;
+
+ if (strcasecmp(opt, "a") == 0) {
+ sanity = 1;
+ poison = 1;
+ redzone = 1;
+ tracking = 1;
+ return 1;
+ }
+
+ for ( ; *opt; opt++)
+ switch (*opt) {
+ case 'F' : case 'f':
+ if (sanity)
+ return 0;
+ sanity = 1;
+ break;
+ case 'P' : case 'p':
+ if (poison)
+ return 0;
+ poison = 1;
+ break;
+
+ case 'Z' : case 'z':
+ if (redzone)
+ return 0;
+ redzone = 1;
+ break;
+
+ case 'U' : case 'u':
+ if (tracking)
+ return 0;
+ tracking = 1;
+ break;
+
+ case 'T' : case 't':
+ if (tracing)
+ return 0;
+ tracing = 1;
+ break;
+ default:
+ return 0;
+ }
+ return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+ if (s->objects > 0)
+ return 0;
+
+ /*
+ * We may still have slabs even if there are no objects. Shrinking will
+ * remove them.
+ */
+ if (s->slabs != 0)
+ set_obj(s, "shrink", 1);
+
+ return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (redzone && !s->red_zone) {
+ if (slab_empty(s))
+ set_obj(s, "red_zone", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+ }
+ if (!redzone && s->red_zone) {
+ if (slab_empty(s))
+ set_obj(s, "red_zone", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+ }
+ if (poison && !s->poison) {
+ if (slab_empty(s))
+ set_obj(s, "poison", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+ }
+ if (!poison && s->poison) {
+ if (slab_empty(s))
+ set_obj(s, "poison", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+ }
+ if (tracking && !s->store_user) {
+ if (slab_empty(s))
+ set_obj(s, "store_user", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+ }
+ if (!tracking && s->store_user) {
+ if (slab_empty(s))
+ set_obj(s, "store_user", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+ }
+}
+
+void totals(void)
+{
+ struct slabinfo *s;
+
+ int used_slabs = 0;
+ char b1[20], b2[20], b3[20], b4[20];
+ unsigned long long max = 1ULL << 63;
+
+ /* Object size */
+ unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+ /* Number of partial slabs in a slabcache */
+ unsigned long long min_partial = max, max_partial = 0,
+ avg_partial, total_partial = 0;
+
+ /* Number of slabs in a slab cache */
+ unsigned long long min_slabs = max, max_slabs = 0,
+ avg_slabs, total_slabs = 0;
+
+ /* Size of the whole slab */
+ unsigned long long min_size = max, max_size = 0,
+ avg_size, total_size = 0;
+
+ /* Bytes used for object storage in a slab */
+ unsigned long long min_used = max, max_used = 0,
+ avg_used, total_used = 0;
+
+ /* Waste: Bytes used for alignment and padding */
+ unsigned long long min_waste = max, max_waste = 0,
+ avg_waste, total_waste = 0;
+ /* Number of objects in a slab */
+ unsigned long long min_objects = max, max_objects = 0,
+ avg_objects, total_objects = 0;
+ /* Waste per object */
+ unsigned long long min_objwaste = max,
+ max_objwaste = 0, avg_objwaste,
+ total_objwaste = 0;
+
+ /* Memory per object */
+ unsigned long long min_memobj = max,
+ max_memobj = 0, avg_memobj,
+ total_objsize = 0;
+
+ for (s = slabinfo; s < slabinfo + slabs; s++) {
+ unsigned long long size;
+ unsigned long used;
+ unsigned long long wasted;
+ unsigned long long objwaste;
+
+ if (!s->slabs || !s->objects)
+ continue;
+
+ used_slabs++;
+
+ size = slab_size(s);
+ used = s->objects * s->object_size;
+ wasted = size - used;
+ objwaste = s->slab_size - s->object_size;
+
+ if (s->object_size < min_objsize)
+ min_objsize = s->object_size;
+ if (s->slabs < min_slabs)
+ min_slabs = s->slabs;
+ if (size < min_size)
+ min_size = size;
+ if (wasted < min_waste)
+ min_waste = wasted;
+ if (objwaste < min_objwaste)
+ min_objwaste = objwaste;
+ if (s->objects < min_objects)
+ min_objects = s->objects;
+ if (used < min_used)
+ min_used = used;
+ if (s->slab_size < min_memobj)
+ min_memobj = s->slab_size;
+
+ if (s->object_size > max_objsize)
+ max_objsize = s->object_size;
+ if (s->slabs > max_slabs)
+ max_slabs = s->slabs;
+ if (size > max_size)
+ max_size = size;
+ if (wasted > max_waste)
+ max_waste = wasted;
+ if (objwaste > max_objwaste)
+ max_objwaste = objwaste;
+ if (s->objects > max_objects)
+ max_objects = s->objects;
+ if (used > max_used)
+ max_used = used;
+ if (s->slab_size > max_memobj)
+ max_memobj = s->slab_size;
+
+ total_slabs += s->slabs;
+ total_size += size;
+ total_waste += wasted;
+
+ total_objects += s->objects;
+ total_used += used;
+
+ total_objwaste += s->objects * objwaste;
+ total_objsize += s->objects * s->slab_size;
+ }
+
+ if (!total_objects) {
+ printf("No objects\n");
+ return;
+ }
+ if (!used_slabs) {
+ printf("No slabs\n");
+ return;
+ }
+
+ /* Per slab averages */
+ avg_slabs = total_slabs / used_slabs;
+ avg_size = total_size / used_slabs;
+ avg_waste = total_waste / used_slabs;
+
+ avg_objects = total_objects / used_slabs;
+ avg_used = total_used / used_slabs;
+
+ /* Per object object sizes */
+ avg_objsize = total_used / total_objects;
+ avg_objwaste = total_objwaste / total_objects;
+ avg_memobj = total_objsize / total_objects;
+
+ printf("Slabcache Totals\n");
+ printf("----------------\n");
+ printf("Slabcaches : %3d Active: %3d\n",
+ slabs, used_slabs);
+
+ store_size(b1, total_size);store_size(b2, total_waste);
+ store_size(b3, total_waste * 100 / total_used);
+ printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3);
+
+ store_size(b1, total_objects);
+ printf("# Objects : %6s\n", b1);
+
+ printf("\n");
+ printf("Per Cache Average Min Max Total\n");
+ printf("---------------------------------------------------------\n");
+
+ store_size(b1, avg_objects);store_size(b2, min_objects);
+ store_size(b3, max_objects);store_size(b4, total_objects);
+ printf("#Objects %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_slabs);store_size(b2, min_slabs);
+ store_size(b3, max_slabs);store_size(b4, total_slabs);
+ printf("#Slabs %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_size);store_size(b2, min_size);
+ store_size(b3, max_size);store_size(b4, total_size);
+ printf("Memory %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_used);store_size(b2, min_used);
+ store_size(b3, max_used);store_size(b4, total_used);
+ printf("Used %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_waste);store_size(b2, min_waste);
+ store_size(b3, max_waste);store_size(b4, total_waste);
+ printf("Loss %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ printf("\n");
+ printf("Per Object Average Min Max\n");
+ printf("---------------------------------------------\n");
+
+ store_size(b1, avg_memobj);store_size(b2, min_memobj);
+ store_size(b3, max_memobj);
+ printf("Memory %10s %10s %10s\n",
+ b1, b2, b3);
+ store_size(b1, avg_objsize);store_size(b2, min_objsize);
+ store_size(b3, max_objsize);
+ printf("User %10s %10s %10s\n",
+ b1, b2, b3);
+
+ store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+ store_size(b3, max_objwaste);
+ printf("Loss %10s %10s %10s\n",
+ b1, b2, b3);
+}
+
+void sort_slabs(void)
+{
+ struct slabinfo *s1,*s2;
+
+ for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+ for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+ int result;
+
+ if (sort_size)
+ result = slab_size(s1) < slab_size(s2);
+ else if (sort_active)
+ result = slab_activity(s1) < slab_activity(s2);
+ else
+ result = strcasecmp(s1->name, s2->name);
+
+ if (show_inverted)
+ result = -result;
+
+ if (result > 0) {
+ struct slabinfo t;
+
+ memcpy(&t, s1, sizeof(struct slabinfo));
+ memcpy(s1, s2, sizeof(struct slabinfo));
+ memcpy(s2, &t, sizeof(struct slabinfo));
+ }
+ }
+ }
+}
+
+int slab_mismatch(char *slab)
+{
+ return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+ DIR *dir;
+ struct dirent *de;
+ struct slabinfo *slab = slabinfo;
+ char *p;
+ char *t;
+ int count;
+
+ if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+ fatal("SYSFS support for SLUB not active\n");
+
+ dir = opendir(".");
+ while ((de = readdir(dir))) {
+ if (de->d_name[0] == '.' ||
+ (de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+ continue;
+ switch (de->d_type) {
+ case DT_DIR:
+ if (chdir(de->d_name))
+ fatal("Unable to access slab %s\n", slab->name);
+ slab->name = strdup(de->d_name);
+ slab->align = get_obj("align");
+ slab->cache_dma = get_obj("cache_dma");
+ slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+ slab->hwcache_align = get_obj("hwcache_align");
+ slab->object_size = get_obj("object_size");
+ slab->objects = get_obj("objects");
+ slab->total_objects = get_obj("total_objects");
+ slab->objs_per_slab = get_obj("objs_per_slab");
+ slab->order = get_obj("order");
+ slab->poison = get_obj("poison");
+ slab->reclaim_account = get_obj("reclaim_account");
+ slab->red_zone = get_obj("red_zone");
+ slab->slab_size = get_obj("slab_size");
+ slab->slabs = get_obj_and_str("slabs", &t);
+ decode_numa_list(slab->numa, t);
+ free(t);
+ slab->store_user = get_obj("store_user");
+ slab->batch = get_obj("batch");
+ slab->alloc = get_obj("alloc");
+ slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+ slab->alloc_slab_new = get_obj("alloc_slab_new");
+ slab->free = get_obj("free");
+ slab->free_remote = get_obj("free_remote");
+ slab->claim_remote_list = get_obj("claim_remote_list");
+ slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+ slab->flush_free_list = get_obj("flush_free_list");
+ slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+ slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+ slab->flush_rfree_list = get_obj("flush_rfree_list");
+ slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+ slab->flush_slab_free = get_obj("flush_slab_free");
+ slab->flush_slab_partial = get_obj("flush_slab_partial");
+
+ chdir("..");
+ slab++;
+ break;
+ default :
+ fatal("Unknown file type %lx\n", de->d_type);
+ }
+ }
+ closedir(dir);
+ slabs = slab - slabinfo;
+ actual_slabs = slabs;
+ if (slabs > MAX_SLABS)
+ fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+ struct slabinfo *slab;
+
+ for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+ if (show_numa)
+ slab_numa(slab, 0);
+ else if (show_track)
+ show_tracking(slab);
+ else if (validate)
+ slab_validate(slab);
+ else if (shrink)
+ slab_shrink(slab);
+ else if (set_debug)
+ slab_debug(slab);
+ else if (show_ops)
+ ops(slab);
+ else if (show_slab)
+ slabcache(slab);
+ else if (show_report)
+ report(slab);
+ }
+}
+
+struct option opts[] = {
+ { "activity", 0, NULL, 'A' },
+ { "debug", 2, NULL, 'd' },
+ { "display-activity", 0, NULL, 'D' },
+ { "empty", 0, NULL, 'e' },
+ { "help", 0, NULL, 'h' },
+ { "inverted", 0, NULL, 'i'},
+ { "numa", 0, NULL, 'n' },
+ { "ops", 0, NULL, 'o' },
+ { "report", 0, NULL, 'r' },
+ { "shrink", 0, NULL, 's' },
+ { "slabs", 0, NULL, 'l' },
+ { "track", 0, NULL, 't'},
+ { "validate", 0, NULL, 'v' },
+ { "zero", 0, NULL, 'z' },
+ { "1ref", 0, NULL, '1'},
+ { NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+ int c;
+ int err;
+ char *pattern_source;
+
+ page_size = getpagesize();
+
+ while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+ opts, NULL)) != -1)
+ switch (c) {
+ case 'A':
+ sort_active = 1;
+ break;
+ case 'd':
+ set_debug = 1;
+ if (!debug_opt_scan(optarg))
+ fatal("Invalid debug option '%s'\n", optarg);
+ break;
+ case 'D':
+ show_activity = 1;
+ break;
+ case 'e':
+ show_empty = 1;
+ break;
+ case 'h':
+ usage();
+ return 0;
+ case 'i':
+ show_inverted = 1;
+ break;
+ case 'n':
+ show_numa = 1;
+ break;
+ case 'o':
+ show_ops = 1;
+ break;
+ case 'r':
+ show_report = 1;
+ break;
+ case 's':
+ shrink = 1;
+ break;
+ case 'l':
+ show_slab = 1;
+ break;
+ case 't':
+ show_track = 1;
+ break;
+ case 'v':
+ validate = 1;
+ break;
+ case 'z':
+ skip_zero = 0;
+ break;
+ case 'T':
+ show_totals = 1;
+ break;
+ case 'S':
+ sort_size = 1;
+ break;
+
+ default:
+ fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+ }
+
+ if (!show_slab && !show_track && !show_report
+ && !validate && !shrink && !set_debug && !show_ops)
+ show_slab = 1;
+
+ if (argc > optind)
+ pattern_source = argv[optind];
+ else
+ pattern_source = ".*";
+
+ err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+ if (err)
+ fatal("%s: Invalid pattern '%s' code %d\n",
+ argv[0], pattern_source, err);
+ read_slab_dir();
+ if (show_totals)
+ totals();
+ else {
+ sort_slabs();
+ output_slabs();
+ }
+ return 0;
+}


2009-01-21 14:59:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator


* Nick Piggin <[email protected]> wrote:

> +/*
> + * Management object for a slab cache.
> + */
> +struct kmem_cache {
> + unsigned long flags;
> + int hiwater; /* LIFO list high watermark */
> + int freebatch; /* LIFO freelist batch flush size */
> + int objsize; /* The size of an object without meta data */
> + int offset; /* Free pointer offset. */
> + int objects; /* Number of objects in slab */
> +
> + int size; /* The size of an object including meta data */
> + int order; /* Allocation order */
> + gfp_t allocflags; /* gfp flags to use on allocation */
> + unsigned int colour_range; /* range of colour counter */
> + unsigned int colour_off; /* offset per colour */
> + void (*ctor)(void *);
> +

Mind if i nitpick a bit about minor style issues? Since this is going to
be the next Linux SLAB allocator we might as well do it perfectly :-)

When intoducing new structures it makes sense to properly vertical align
them, like:

> + unsigned long flags;
> + int hiwater; /* LIFO list high watermark */
> + int freebatch; /* LIFO freelist batch flush size */
> + int objsize; /* Object size without meta data */
> + int offset; /* Free pointer offset */
> + int objects; /* Number of objects in slab */
> + const char *name; /* Name (only for display!) */
> + struct list_head list; /* List of slab caches */
> +
> + int align; /* Alignment */
> + int inuse; /* Offset to metadata */

because proper vertical alignment/lineup can really help readability.
Like you do it yourself here:

> + if (size <= 8) return 3;
> + if (size <= 16) return 4;
> + if (size <= 32) return 5;
> + if (size <= 64) return 6;
> + if (size <= 128) return 7;
> + if (size <= 256) return 8;
> + if (size <= 512) return 9;
> + if (size <= 1024) return 10;
> + if (size <= 2 * 1024) return 11;
> + if (size <= 4 * 1024) return 12;
> + if (size <= 8 * 1024) return 13;
> + if (size <= 16 * 1024) return 14;
> + if (size <= 32 * 1024) return 15;
> + if (size <= 64 * 1024) return 16;
> + if (size <= 128 * 1024) return 17;
> + if (size <= 256 * 1024) return 18;
> + if (size <= 512 * 1024) return 19;
> + if (size <= 1024 * 1024) return 20;
> + if (size <= 2 * 1024 * 1024) return 21;

> +static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
> +{
> + va_list args;
> + char buf[100];

magic constant.

> + if (s->flags & SLAB_RED_ZONE)
> + memset(p + s->objsize,
> + active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
> + s->inuse - s->objsize);

We tend to add curly braces in such multi-line statement situations i
guess.

> +static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
> +{
> + if (s->flags & SLAB_TRACE) {
> + printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
> + s->name,
> + alloc ? "alloc" : "free",
> + object, page->inuse,
> + page->freelist);

Could use ftrace_printk() here i guess. That way it goes into a fast
ringbuffer and not printk and it also gets embedded into whatever tracer
plugin there is active. (for example kmemtrace)


> +static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
> + void *object)

there's a trick that can be done here to avoid the col-80 artifact:

static void
setup_object_debug(struct kmem_cache *s, struct slqb_page *page, void *object)

ditto all these prototypes:

> +static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
> +static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
> +static unsigned long kmem_cache_flags(unsigned long objsize,
> + unsigned long flags, const char *name,
> + void (*ctor)(void *))
> +static inline void setup_object_debug(struct kmem_cache *s,
> + struct slqb_page *page, void *object) {}
> +static inline int alloc_debug_processing(struct kmem_cache *s,
> + void *object, void *addr) { return 0; }
> +static inline int free_debug_processing(struct kmem_cache *s,
> + void *object, void *addr) { return 0; }
> +static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
> + void *object, int active) { return 1; }
> +static inline unsigned long kmem_cache_flags(unsigned long objsize,
> + unsigned long flags, const char *name, void (*ctor)(void *))

> +#define slqb_debug 0

should be 'static const int slqb_debug;' i guess?

more function prototype inconsistencies:

> +static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> +static void setup_object(struct kmem_cache *s, struct slqb_page *page,
> + void *object)
> +static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
> +static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)

> +#ifdef CONFIG_SMP
> +static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
> +#endif

does noline have to be declared?

i almost missed the lock taking here:

> + spin_lock(&l->remote_free.lock);
> + l->remote_free.list.head = NULL;
> + tail = l->remote_free.list.tail;
> + l->remote_free.list.tail = NULL;
> + nr = l->remote_free.list.nr;
> + l->remote_free.list.nr = 0;
> + spin_unlock(&l->remote_free.lock);

Putting an extra newline after the spin_lock() and one extra newline
before the spin_unlock() really helps raise attention to critical
sections.

various leftover bits:

> +// if (next)
> +// prefetchw(next);

> +// if (next)
> +// prefetchw(next);

> + list_del(&page->lru);
> +/*XXX list_move(&page->lru, &l->full); */

> +// VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));

overlong prototype:

> +static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)

putting the newline elsewhere would improve this too:

> +static noinline void *__remote_slab_alloc(struct kmem_cache *s,
> + gfp_t gfpflags, int node)

leftover:

> +// if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
> +// return NULL;

newline in wrong place:

> +static __always_inline void *__slab_alloc(struct kmem_cache *s,
> + gfp_t gfpflags, int node)

> +static __always_inline void *slab_alloc(struct kmem_cache *s,
> + gfp_t gfpflags, int node, void *addr)

> +static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)

> +#ifdef CONFIG_SLQB_STATS
> + {
> + struct kmem_cache_list *l = &c->list;
> + slqb_stat_inc(l, FLUSH_RFREE_LIST);
> + slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);

Please put a newline after local variable declarations.

newline in another place could improve this:

> +static __always_inline void __slab_free(struct kmem_cache *s,
> + struct slqb_page *page, void *object)

> +#ifdef CONFIG_NUMA
> + } else {
> + /*
> + * Freeing an object that was allocated on a remote node.
> + */
> + slab_free_to_remote(s, page, object, c);
> + slqb_stat_inc(l, FREE_REMOTE);
> +#endif
> + }

while it's correct code, the CONFIG_NUMA ifdef begs to be placed one line
further down.

newline in another place could improve this:

> +static __always_inline void slab_free(struct kmem_cache *s,
> + struct slqb_page *page, void *object)

> +void kmem_cache_free(struct kmem_cache *s, void *object)
> +{
> + struct slqb_page *page = NULL;
> + if (numa_platform)
> + page = virt_to_head_slqb_page(object);

newline after local variable definition please.

> +static inline int slab_order(int size, int max_order, int frac)
> +{
> + int order;
> +
> + if (fls(size - 1) <= PAGE_SHIFT)
> + order = 0;
> + else
> + order = fls(size - 1) - PAGE_SHIFT;
> + while (order <= max_order) {

Please put a newline before loops, so that they stand out better.

> +static inline int calculate_order(int size)
> +{
> + int order;
> +
> + /*
> + * Attempt to find best configuration for a slab. This
> + * works by first attempting to generate a layout with
> + * the best configuration and backing off gradually.
> + */
> + order = slab_order(size, 1, 4);
> + if (order <= 1)
> + return order;
> +
> + /*
> + * This size cannot fit in order-1. Allow bigger orders, but
> + * forget about trying to save space.
> + */
> + order = slab_order(size, MAX_ORDER, 0);
> + if (order <= MAX_ORDER)
> + return order;
> +
> + return -ENOSYS;
> +}

function with very nice typographics. All should be like this.

> + if (flags & SLAB_HWCACHE_ALIGN) {
> + unsigned long ralign = cache_line_size();
> + while (size <= ralign / 2)
> + ralign /= 2;

newline after variables please.

> +static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> + l->cache = s;
> + l->freelist.nr = 0;
> + l->freelist.head = NULL;
> + l->freelist.tail = NULL;
> + l->nr_partial = 0;
> + l->nr_slabs = 0;
> + INIT_LIST_HEAD(&l->partial);
> +// INIT_LIST_HEAD(&l->full);

leftover. Also, initializations tend to read nicer if they are aligned
like this:

> + l->cache = s;
> + l->freelist.nr = 0;
> + l->freelist.head = NULL;
> + l->freelist.tail = NULL;
> + l->nr_partial = 0;
> + l->nr_slabs = 0;
> +
> +#ifdef CONFIG_SMP
> + l->remote_free_check = 0;
> + spin_lock_init(&l->remote_free.lock);
> + l->remote_free.list.nr = 0;
> + l->remote_free.list.head = NULL;
> + l->remote_free.list.tail = NULL;
> +#endif

As this way it really stands out that the only relevant non-zero
initializations are l->cache and the spinlock init.

> +static void init_kmem_cache_cpu(struct kmem_cache *s,
> + struct kmem_cache_cpu *c)

prototype newline.

dead code:

> +#if 0 // XXX: see cpu offline comment
> + down_read(&slqb_lock);
> + list_for_each_entry(s, &slab_caches, list) {
> + struct kmem_cache_node *n;
> + n = s->node[nid];
> + if (n) {
> + s->node[nid] = NULL;
> + kmem_cache_free(&kmem_node_cache, n);
> + }
> + }
> + up_read(&slqb_lock);
> +#endif

... and many more similar instances are in the patch in other places.

Ingo

2009-01-21 15:17:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > +/*
> > + * Management object for a slab cache.
> > + */
> > +struct kmem_cache {
> > + unsigned long flags;
> > + int hiwater; /* LIFO list high watermark */
> > + int freebatch; /* LIFO freelist batch flush size */
> > + int objsize; /* The size of an object without meta data */
> > + int offset; /* Free pointer offset. */
> > + int objects; /* Number of objects in slab */
> > +
> > + int size; /* The size of an object including meta data */
> > + int order; /* Allocation order */
> > + gfp_t allocflags; /* gfp flags to use on allocation */
> > + unsigned int colour_range; /* range of colour counter */
> > + unsigned int colour_off; /* offset per colour */
> > + void (*ctor)(void *);
> > +
>
> Mind if i nitpick a bit about minor style issues? Since this is going to
> be the next Linux SLAB allocator we might as well do it perfectly :-)

Well, let's not get ahead of ourselves :) But it's very appreciated.

I think most if not all of your suggestions are good ones, although
I probably won't convert to ftrace just for the moment.

I'll come up with an incremental patch....

Thanks,
Nick

2009-01-21 16:56:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
>
> Mind if i nitpick a bit about minor style issues? Since this is going to
> be the next Linux SLAB allocator we might as well do it perfectly :-)

Well here is an incremental patch which should get most of the issues you
pointed out, most of the sane ones that checkpatch pointed out, and a
few of my own ;)

---
include/linux/slqb_def.h | 90 +++++-----
mm/slqb.c | 386 +++++++++++++++++++++++++----------------------
2 files changed, 261 insertions(+), 215 deletions(-)

Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- linux-2.6.orig/include/linux/slqb_def.h
+++ linux-2.6/include/linux/slqb_def.h
@@ -37,8 +37,9 @@ enum stat_item {
* Singly-linked list with head, tail, and nr
*/
struct kmlist {
- unsigned long nr;
- void **head, **tail;
+ unsigned long nr;
+ void **head;
+ void **tail;
};

/*
@@ -46,8 +47,8 @@ struct kmlist {
* objects can be returned to the kmem_cache_list from remote CPUs.
*/
struct kmem_cache_remote_free {
- spinlock_t lock;
- struct kmlist list;
+ spinlock_t lock;
+ struct kmlist list;
} ____cacheline_aligned;

/*
@@ -56,18 +57,23 @@ struct kmem_cache_remote_free {
* kmem_cache_lists allow off-node allocations (but require locking).
*/
struct kmem_cache_list {
- struct kmlist freelist; /* Fastpath LIFO freelist of objects */
+ /* Fastpath LIFO freelist of objects */
+ struct kmlist freelist;
#ifdef CONFIG_SMP
- int remote_free_check; /* remote_free has reached a watermark */
+ /* remote_free has reached a watermark */
+ int remote_free_check;
#endif
- struct kmem_cache *cache; /* kmem_cache corresponding to this list */
+ /* kmem_cache corresponding to this list */
+ struct kmem_cache *cache;

- unsigned long nr_partial; /* Number of partial slabs (pages) */
- struct list_head partial; /* Slabs which have some free objects */
+ /* Number of partial slabs (pages) */
+ unsigned long nr_partial;

- unsigned long nr_slabs; /* Total number of slabs allocated */
+ /* Slabs which have some free objects */
+ struct list_head partial;

- //struct list_head full;
+ /* Total number of slabs allocated */
+ unsigned long nr_slabs;

#ifdef CONFIG_SMP
/*
@@ -79,7 +85,7 @@ struct kmem_cache_list {
#endif

#ifdef CONFIG_SLQB_STATS
- unsigned long stats[NR_SLQB_STAT_ITEMS];
+ unsigned long stats[NR_SLQB_STAT_ITEMS];
#endif
} ____cacheline_aligned;

@@ -87,9 +93,8 @@ struct kmem_cache_list {
* Primary per-cpu, per-kmem_cache structure.
*/
struct kmem_cache_cpu {
- struct kmem_cache_list list; /* List for node-local slabs. */
-
- unsigned int colour_next;
+ struct kmem_cache_list list; /* List for node-local slabs */
+ unsigned int colour_next; /* Next colour offset to use */

#ifdef CONFIG_SMP
/*
@@ -101,53 +106,53 @@ struct kmem_cache_cpu {
* An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
* get to O(NR_CPUS^2) memory consumption situation.
*/
- struct kmlist rlist;
- struct kmem_cache_list *remote_cache_list;
+ struct kmlist rlist;
+ struct kmem_cache_list *remote_cache_list;
#endif
} ____cacheline_aligned;

/*
- * Per-node, per-kmem_cache structure.
+ * Per-node, per-kmem_cache structure. Used for node-specific allocations.
*/
struct kmem_cache_node {
- struct kmem_cache_list list;
- spinlock_t list_lock; /* protects access to list */
+ struct kmem_cache_list list;
+ spinlock_t list_lock; /* protects access to list */
} ____cacheline_aligned;

/*
* Management object for a slab cache.
*/
struct kmem_cache {
- unsigned long flags;
- int hiwater; /* LIFO list high watermark */
- int freebatch; /* LIFO freelist batch flush size */
- int objsize; /* The size of an object without meta data */
- int offset; /* Free pointer offset. */
- int objects; /* Number of objects in slab */
-
- int size; /* The size of an object including meta data */
- int order; /* Allocation order */
- gfp_t allocflags; /* gfp flags to use on allocation */
- unsigned int colour_range; /* range of colour counter */
- unsigned int colour_off; /* offset per colour */
- void (*ctor)(void *);
+ unsigned long flags;
+ int hiwater; /* LIFO list high watermark */
+ int freebatch; /* LIFO freelist batch flush size */
+ int objsize; /* Size of object without meta data */
+ int offset; /* Free pointer offset. */
+ int objects; /* Number of objects in slab */
+
+ int size; /* Size of object including meta data */
+ int order; /* Allocation order */
+ gfp_t allocflags; /* gfp flags to use on allocation */
+ unsigned int colour_range; /* range of colour counter */
+ unsigned int colour_off; /* offset per colour */
+ void (*ctor)(void *);

- const char *name; /* Name (only for display!) */
- struct list_head list; /* List of slab caches */
+ const char *name; /* Name (only for display!) */
+ struct list_head list; /* List of slab caches */

- int align; /* Alignment */
- int inuse; /* Offset to metadata */
+ int align; /* Alignment */
+ int inuse; /* Offset to metadata */

#ifdef CONFIG_SLQB_SYSFS
- struct kobject kobj; /* For sysfs */
+ struct kobject kobj; /* For sysfs */
#endif
#ifdef CONFIG_NUMA
- struct kmem_cache_node *node[MAX_NUMNODES];
+ struct kmem_cache_node *node[MAX_NUMNODES];
#endif
#ifdef CONFIG_SMP
- struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+ struct kmem_cache_cpu *cpu_slab[NR_CPUS];
#else
- struct kmem_cache_cpu cpu_slab;
+ struct kmem_cache_cpu cpu_slab;
#endif
};

@@ -245,7 +250,8 @@ void *__kmalloc(size_t size, gfp_t flags
#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
#endif

-#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? \
+ sizeof(void *) : ARCH_KMALLOC_MINALIGN)

static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
Index: linux-2.6/mm/slqb.c
===================================================================
--- linux-2.6.orig/mm/slqb.c
+++ linux-2.6/mm/slqb.c
@@ -40,13 +40,13 @@
struct slqb_page {
union {
struct {
- unsigned long flags; /* mandatory */
- atomic_t _count; /* mandatory */
- unsigned int inuse; /* Nr of objects */
- struct kmem_cache_list *list; /* Pointer to list */
- void **freelist; /* freelist req. slab lock */
+ unsigned long flags; /* mandatory */
+ atomic_t _count; /* mandatory */
+ unsigned int inuse; /* Nr of objects */
+ struct kmem_cache_list *list; /* Pointer to list */
+ void **freelist; /* LIFO freelist */
union {
- struct list_head lru; /* misc. list */
+ struct list_head lru; /* misc. list */
struct rcu_head rcu_head; /* for rcu freeing */
};
};
@@ -62,7 +62,7 @@ static int kmem_size __read_mostly;
#ifdef CONFIG_NUMA
static int numa_platform __read_mostly;
#else
-#define numa_platform 0
+static const int numa_platform = 0;
#endif

static inline int slab_hiwater(struct kmem_cache *s)
@@ -120,15 +120,16 @@ static inline int slab_freebatch(struct
* - There is no remote free queue. Nodes don't free objects, CPUs do.
*/

-static inline void slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
+static inline void slqb_stat_inc(struct kmem_cache_list *list,
+ enum stat_item si)
{
#ifdef CONFIG_SLQB_STATS
list->stats[si]++;
#endif
}

-static inline void slqb_stat_add(struct kmem_cache_list *list, enum stat_item si,
- unsigned long nr)
+static inline void slqb_stat_add(struct kmem_cache_list *list,
+ enum stat_item si, unsigned long nr)
{
#ifdef CONFIG_SLQB_STATS
list->stats[si] += nr;
@@ -433,10 +434,11 @@ static void print_page_info(struct slqb_

}

+#define MAX_ERR_STR 100
static void slab_bug(struct kmem_cache *s, char *fmt, ...)
{
va_list args;
- char buf[100];
+ char buf[MAX_ERR_STR];

va_start(args, fmt);
vsnprintf(buf, sizeof(buf), fmt, args);
@@ -477,8 +479,7 @@ static void print_trailer(struct kmem_ca
print_section("Object", p, min(s->objsize, 128));

if (s->flags & SLAB_RED_ZONE)
- print_section("Redzone", p + s->objsize,
- s->inuse - s->objsize);
+ print_section("Redzone", p + s->objsize, s->inuse - s->objsize);

if (s->offset)
off = s->offset + sizeof(void *);
@@ -488,9 +489,10 @@ static void print_trailer(struct kmem_ca
if (s->flags & SLAB_STORE_USER)
off += 2 * sizeof(struct track);

- if (off != s->size)
+ if (off != s->size) {
/* Beginning of the filler is the free pointer */
print_section("Padding", p + off, s->size - off);
+ }

dump_stack();
}
@@ -502,14 +504,9 @@ static void object_err(struct kmem_cache
print_trailer(s, page, object);
}

-static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
+static void slab_err(struct kmem_cache *s, struct slqb_page *page,
+ char *fmt, ...)
{
- va_list args;
- char buf[100];
-
- va_start(args, fmt);
- vsnprintf(buf, sizeof(buf), fmt, args);
- va_end(args);
slab_bug(s, fmt);
print_page_info(page);
dump_stack();
@@ -524,10 +521,11 @@ static void init_object(struct kmem_cach
p[s->objsize - 1] = POISON_END;
}

- if (s->flags & SLAB_RED_ZONE)
+ if (s->flags & SLAB_RED_ZONE) {
memset(p + s->objsize,
active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
s->inuse - s->objsize);
+ }
}

static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
@@ -542,7 +540,7 @@ static u8 *check_bytes(u8 *start, unsign
}

static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
- void *from, void *to)
+ void *from, void *to)
{
slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
memset(from, data, to - from);
@@ -610,13 +608,15 @@ static int check_pad_bytes(struct kmem_c
{
unsigned long off = s->inuse; /* The end of info */

- if (s->offset)
+ if (s->offset) {
/* Freepointer is placed after the object. */
off += sizeof(void *);
+ }

- if (s->flags & SLAB_STORE_USER)
+ if (s->flags & SLAB_STORE_USER) {
/* We also have user information there */
off += 2 * sizeof(struct track);
+ }

if (s->size == off)
return 1;
@@ -646,6 +646,7 @@ static int slab_pad_check(struct kmem_ca
fault = check_bytes(start + length, POISON_INUSE, remainder);
if (!fault)
return 1;
+
while (end > fault && end[-1] == POISON_INUSE)
end--;

@@ -677,12 +678,16 @@ static int check_object(struct kmem_cach
}

if (s->flags & SLAB_POISON) {
- if (!active && (s->flags & __OBJECT_POISON) &&
- (!check_bytes_and_report(s, page, p, "Poison", p,
- POISON_FREE, s->objsize - 1) ||
- !check_bytes_and_report(s, page, p, "Poison",
- p + s->objsize - 1, POISON_END, 1)))
- return 0;
+ if (!active && (s->flags & __OBJECT_POISON)) {
+ if (!check_bytes_and_report(s, page, p, "Poison", p,
+ POISON_FREE, s->objsize - 1))
+ return 0;
+
+ if (!check_bytes_and_report(s, page, p, "Poison",
+ p + s->objsize - 1, POISON_END, 1))
+ return 0;
+ }
+
/*
* check_pad_bytes cleans up on its own.
*/
@@ -712,7 +717,8 @@ static int check_slab(struct kmem_cache
return 1;
}

-static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
+static void trace(struct kmem_cache *s, struct slqb_page *page,
+ void *object, int alloc)
{
if (s->flags & SLAB_TRACE) {
printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
@@ -729,7 +735,7 @@ static void trace(struct kmem_cache *s,
}

static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
- void *object)
+ void *object)
{
if (!slab_debug(s))
return;
@@ -741,7 +747,8 @@ static void setup_object_debug(struct km
init_tracking(s, object);
}

-static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
+static int alloc_debug_processing(struct kmem_cache *s,
+ void *object, void *addr)
{
struct slqb_page *page;
page = virt_to_head_slqb_page(object);
@@ -768,7 +775,8 @@ bad:
return 0;
}

-static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
+static int free_debug_processing(struct kmem_cache *s,
+ void *object, void *addr)
{
struct slqb_page *page;
page = virt_to_head_slqb_page(object);
@@ -799,25 +807,28 @@ fail:
static int __init setup_slqb_debug(char *str)
{
slqb_debug = DEBUG_DEFAULT_FLAGS;
- if (*str++ != '=' || !*str)
+ if (*str++ != '=' || !*str) {
/*
* No options specified. Switch on full debugging.
*/
goto out;
+ }

- if (*str == ',')
+ if (*str == ',') {
/*
* No options but restriction on slabs. This means full
* debugging for slabs matching a pattern.
*/
goto check_slabs;
+ }

slqb_debug = 0;
- if (*str == '-')
+ if (*str == '-') {
/*
* Switch off all debugging measures.
*/
goto out;
+ }

/*
* Determine which debug features should be switched on
@@ -855,8 +866,8 @@ out:
__setup("slqb_debug", setup_slqb_debug);

static unsigned long kmem_cache_flags(unsigned long objsize,
- unsigned long flags, const char *name,
- void (*ctor)(void *))
+ unsigned long flags, const char *name,
+ void (*ctor)(void *))
{
/*
* Enable debugging if selected on the kernel commandline.
@@ -870,31 +881,51 @@ static unsigned long kmem_cache_flags(un
}
#else
static inline void setup_object_debug(struct kmem_cache *s,
- struct slqb_page *page, void *object) {}
+ struct slqb_page *page, void *object)
+{
+}

static inline int alloc_debug_processing(struct kmem_cache *s,
- void *object, void *addr) { return 0; }
+ void *object, void *addr)
+{
+ return 0;
+}

static inline int free_debug_processing(struct kmem_cache *s,
- void *object, void *addr) { return 0; }
+ void *object, void *addr)
+{
+ return 0;
+}

static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
- { return 1; }
+{
+ return 1;
+}
+
static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
- void *object, int active) { return 1; }
-static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page) {}
+ void *object, int active)
+{
+ return 1;
+}
+
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page)
+{
+}
+
static inline unsigned long kmem_cache_flags(unsigned long objsize,
unsigned long flags, const char *name, void (*ctor)(void *))
{
return flags;
}
-#define slqb_debug 0
+
+static const int slqb_debug = 0;
#endif

/*
* allocate a new slab (return its corresponding struct slqb_page)
*/
-static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct slqb_page *allocate_slab(struct kmem_cache *s,
+ gfp_t flags, int node)
{
struct slqb_page *page;
int pages = 1 << s->order;
@@ -916,8 +947,8 @@ static struct slqb_page *allocate_slab(s
/*
* Called once for each object on a new slab page
*/
-static void setup_object(struct kmem_cache *s, struct slqb_page *page,
- void *object)
+static void setup_object(struct kmem_cache *s,
+ struct slqb_page *page, void *object)
{
setup_object_debug(s, page, object);
if (unlikely(s->ctor))
@@ -927,7 +958,8 @@ static void setup_object(struct kmem_cac
/*
* Allocate a new slab, set up its object list.
*/
-static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
+static struct slqb_page *new_slab_page(struct kmem_cache *s,
+ gfp_t flags, int node, unsigned int colour)
{
struct slqb_page *page;
void *start;
@@ -1010,7 +1042,9 @@ static void free_slab(struct kmem_cache
* Caller must be the owner CPU in the case of per-CPU list, or hold the node's
* list_lock in the case of per-node list.
*/
-static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)
+static int free_object_to_page(struct kmem_cache *s,
+ struct kmem_cache_list *l, struct slqb_page *page,
+ void *object)
{
VM_BUG_ON(page->list != l);

@@ -1027,6 +1061,7 @@ static int free_object_to_page(struct km
free_slab(s, page);
slqb_stat_inc(l, FLUSH_SLAB_FREE);
return 1;
+
} else if (page->inuse + 1 == s->objects) {
l->nr_partial++;
list_add(&page->lru, &l->partial);
@@ -1037,7 +1072,8 @@ static int free_object_to_page(struct km
}

#ifdef CONFIG_SMP
-static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
+static void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page,
+ void *object, struct kmem_cache_cpu *c);
#endif

/*
@@ -1110,7 +1146,8 @@ static void flush_free_list_all(struct k
* Caller must be the owner CPU in the case of per-CPU list, or hold the node's
* list_lock in the case of per-node list.
*/
-static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+static void claim_remote_free_list(struct kmem_cache *s,
+ struct kmem_cache_list *l)
{
void **head, **tail;
int nr;
@@ -1126,11 +1163,13 @@ static void claim_remote_free_list(struc
prefetchw(head);

spin_lock(&l->remote_free.lock);
+
l->remote_free.list.head = NULL;
tail = l->remote_free.list.tail;
l->remote_free.list.tail = NULL;
nr = l->remote_free.list.nr;
l->remote_free.list.nr = 0;
+
spin_unlock(&l->remote_free.lock);

if (!l->freelist.nr)
@@ -1153,18 +1192,19 @@ static void claim_remote_free_list(struc
* Caller must be the owner CPU in the case of per-CPU list, or hold the node's
* list_lock in the case of per-node list.
*/
-static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
+ struct kmem_cache_list *l)
{
void *object;

object = l->freelist.head;
if (likely(object)) {
void *next = get_freepointer(s, object);
+
VM_BUG_ON(!l->freelist.nr);
l->freelist.nr--;
l->freelist.head = next;
-// if (next)
-// prefetchw(next);
+
return object;
}
VM_BUG_ON(l->freelist.nr);
@@ -1180,11 +1220,11 @@ static __always_inline void *__cache_lis
object = l->freelist.head;
if (likely(object)) {
void *next = get_freepointer(s, object);
+
VM_BUG_ON(!l->freelist.nr);
l->freelist.nr--;
l->freelist.head = next;
-// if (next)
-// prefetchw(next);
+
return object;
}
VM_BUG_ON(l->freelist.nr);
@@ -1203,7 +1243,8 @@ static __always_inline void *__cache_lis
* Caller must be the owner CPU in the case of per-CPU list, or hold the node's
* list_lock in the case of per-node list.
*/
-static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
+static noinline void *__cache_list_get_page(struct kmem_cache *s,
+ struct kmem_cache_list *l)
{
struct slqb_page *page;
void *object;
@@ -1216,15 +1257,12 @@ static noinline void *__cache_list_get_p
if (page->inuse + 1 == s->objects) {
l->nr_partial--;
list_del(&page->lru);
-/*XXX list_move(&page->lru, &l->full); */
}

VM_BUG_ON(!page->freelist);

page->inuse++;

-// VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
-
object = page->freelist;
page->freelist = get_freepointer(s, object);
if (page->freelist)
@@ -1244,7 +1282,8 @@ static noinline void *__cache_list_get_p
*
* Must be called with interrupts disabled.
*/
-static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
+static noinline void *__slab_alloc_page(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
{
struct slqb_page *page;
struct kmem_cache_list *l;
@@ -1285,8 +1324,8 @@ static noinline void *__slab_alloc_page(
slqb_stat_inc(l, ALLOC);
slqb_stat_inc(l, ALLOC_SLAB_NEW);
object = __cache_list_get_page(s, l);
-#ifdef CONFIG_NUMA
} else {
+#ifdef CONFIG_NUMA
struct kmem_cache_node *n;

n = s->node[slqb_page_to_nid(page)];
@@ -1308,7 +1347,8 @@ static noinline void *__slab_alloc_page(
}

#ifdef CONFIG_NUMA
-static noinline int alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
+static noinline int alternate_nid(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
{
if (in_interrupt() || (gfpflags & __GFP_THISNODE))
return node;
@@ -1326,7 +1366,7 @@ static noinline int alternate_nid(struct
* Must be called with interrupts disabled.
*/
static noinline void *__remote_slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node)
+ gfp_t gfpflags, int node)
{
struct kmem_cache_node *n;
struct kmem_cache_list *l;
@@ -1337,9 +1377,6 @@ static noinline void *__remote_slab_allo
return NULL;
l = &n->list;

-// if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
-// return NULL;
-
spin_lock(&n->list_lock);

object = __cache_list_get_object(s, l);
@@ -1363,7 +1400,7 @@ static noinline void *__remote_slab_allo
* Must be called with interrupts disabled.
*/
static __always_inline void *__slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node)
+ gfp_t gfpflags, int node)
{
void *object;
struct kmem_cache_cpu *c;
@@ -1393,7 +1430,7 @@ static __always_inline void *__slab_allo
* (debug checking and memset()ing).
*/
static __always_inline void *slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node, void *addr)
+ gfp_t gfpflags, int node, void *addr)
{
void *object;
unsigned long flags;
@@ -1414,7 +1451,8 @@ again:
return object;
}

-static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, void *caller)
{
int node = -1;
#ifdef CONFIG_NUMA
@@ -1449,7 +1487,8 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
*
* Must be called with interrupts disabled.
*/
-static void flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
+static void flush_remote_free_cache(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
{
struct kmlist *src;
struct kmem_cache_list *dst;
@@ -1464,6 +1503,7 @@ static void flush_remote_free_cache(stru
#ifdef CONFIG_SLQB_STATS
{
struct kmem_cache_list *l = &c->list;
+
slqb_stat_inc(l, FLUSH_RFREE_LIST);
slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
}
@@ -1472,6 +1512,7 @@ static void flush_remote_free_cache(stru
dst = c->remote_cache_list;

spin_lock(&dst->remote_free.lock);
+
if (!dst->remote_free.list.head)
dst->remote_free.list.head = src->head;
else
@@ -1500,7 +1541,9 @@ static void flush_remote_free_cache(stru
*
* Must be called with interrupts disabled.
*/
-static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c)
+static noinline void slab_free_to_remote(struct kmem_cache *s,
+ struct slqb_page *page, void *object,
+ struct kmem_cache_cpu *c)
{
struct kmlist *r;

@@ -1526,14 +1569,14 @@ static noinline void slab_free_to_remote
flush_remote_free_cache(s, c);
}
#endif
-
+
/*
* Main freeing path. Return an object, or NULL on allocation failure.
*
* Must be called with interrupts disabled.
*/
static __always_inline void __slab_free(struct kmem_cache *s,
- struct slqb_page *page, void *object)
+ struct slqb_page *page, void *object)
{
struct kmem_cache_cpu *c;
struct kmem_cache_list *l;
@@ -1561,8 +1604,8 @@ static __always_inline void __slab_free(
if (unlikely(l->freelist.nr > slab_hiwater(s)))
flush_free_list(s, l);

-#ifdef CONFIG_NUMA
} else {
+#ifdef CONFIG_NUMA
/*
* Freeing an object that was allocated on a remote node.
*/
@@ -1577,7 +1620,7 @@ static __always_inline void __slab_free(
* (debug checking).
*/
static __always_inline void slab_free(struct kmem_cache *s,
- struct slqb_page *page, void *object)
+ struct slqb_page *page, void *object)
{
unsigned long flags;

@@ -1597,6 +1640,7 @@ static __always_inline void slab_free(st
void kmem_cache_free(struct kmem_cache *s, void *object)
{
struct slqb_page *page = NULL;
+
if (numa_platform)
page = virt_to_head_slqb_page(object);
slab_free(s, page, object);
@@ -1610,7 +1654,7 @@ EXPORT_SYMBOL(kmem_cache_free);
* in the page allocator, and they have fastpaths in the page allocator. But
* also minimise external fragmentation with large objects.
*/
-static inline int slab_order(int size, int max_order, int frac)
+static int slab_order(int size, int max_order, int frac)
{
int order;

@@ -1618,6 +1662,7 @@ static inline int slab_order(int size, i
order = 0;
else
order = fls(size - 1) - PAGE_SHIFT;
+
while (order <= max_order) {
unsigned long slab_size = PAGE_SIZE << order;
unsigned long objects;
@@ -1638,7 +1683,7 @@ static inline int slab_order(int size, i
return order;
}

-static inline int calculate_order(int size)
+static int calculate_order(int size)
{
int order;

@@ -1666,7 +1711,7 @@ static inline int calculate_order(int si
* Figure out what the alignment of the objects will be.
*/
static unsigned long calculate_alignment(unsigned long flags,
- unsigned long align, unsigned long size)
+ unsigned long align, unsigned long size)
{
/*
* If the user wants hardware cache aligned objects then follow that
@@ -1677,6 +1722,7 @@ static unsigned long calculate_alignment
*/
if (flags & SLAB_HWCACHE_ALIGN) {
unsigned long ralign = cache_line_size();
+
while (size <= ralign / 2)
ralign /= 2;
align = max(align, ralign);
@@ -1688,21 +1734,21 @@ static unsigned long calculate_alignment
return ALIGN(align, sizeof(void *));
}

-static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
+static void init_kmem_cache_list(struct kmem_cache *s,
+ struct kmem_cache_list *l)
{
- l->cache = s;
- l->freelist.nr = 0;
- l->freelist.head = NULL;
- l->freelist.tail = NULL;
- l->nr_partial = 0;
- l->nr_slabs = 0;
+ l->cache = s;
+ l->freelist.nr = 0;
+ l->freelist.head = NULL;
+ l->freelist.tail = NULL;
+ l->nr_partial = 0;
+ l->nr_slabs = 0;
INIT_LIST_HEAD(&l->partial);
-// INIT_LIST_HEAD(&l->full);

#ifdef CONFIG_SMP
- l->remote_free_check = 0;
+ l->remote_free_check = 0;
spin_lock_init(&l->remote_free.lock);
- l->remote_free.list.nr = 0;
+ l->remote_free.list.nr = 0;
l->remote_free.list.head = NULL;
l->remote_free.list.tail = NULL;
#endif
@@ -1713,21 +1759,22 @@ static void init_kmem_cache_list(struct
}

static void init_kmem_cache_cpu(struct kmem_cache *s,
- struct kmem_cache_cpu *c)
+ struct kmem_cache_cpu *c)
{
init_kmem_cache_list(s, &c->list);

- c->colour_next = 0;
+ c->colour_next = 0;
#ifdef CONFIG_SMP
- c->rlist.nr = 0;
- c->rlist.head = NULL;
- c->rlist.tail = NULL;
- c->remote_cache_list = NULL;
+ c->rlist.nr = 0;
+ c->rlist.head = NULL;
+ c->rlist.tail = NULL;
+ c->remote_cache_list = NULL;
#endif
}

#ifdef CONFIG_NUMA
-static void init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
+static void init_kmem_cache_node(struct kmem_cache *s,
+ struct kmem_cache_node *n)
{
spin_lock_init(&n->list_lock);
init_kmem_cache_list(s, &n->list);
@@ -1757,7 +1804,8 @@ static struct kmem_cache_node kmem_node_
#endif

#ifdef CONFIG_SMP
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
+ int cpu)
{
struct kmem_cache_cpu *c;

@@ -1918,14 +1966,15 @@ static int calculate_sizes(struct kmem_c
}

#ifdef CONFIG_SLQB_DEBUG
- if (flags & SLAB_STORE_USER)
+ if (flags & SLAB_STORE_USER) {
/*
* Need to store information about allocs and frees after
* the object.
*/
size += 2 * sizeof(struct track);
+ }

- if (flags & SLAB_RED_ZONE)
+ if (flags & SLAB_RED_ZONE) {
/*
* Add some empty padding so that we can catch
* overwrites from earlier objects rather than let
@@ -1934,6 +1983,7 @@ static int calculate_sizes(struct kmem_c
* of the object.
*/
size += sizeof(void *);
+ }
#endif

/*
@@ -1970,7 +2020,8 @@ static int calculate_sizes(struct kmem_c
*/
s->objects = (PAGE_SIZE << s->order) / size;

- s->freebatch = max(4UL*PAGE_SIZE / size, min(256UL, 64*PAGE_SIZE / size));
+ s->freebatch = max(4UL*PAGE_SIZE / size,
+ min(256UL, 64*PAGE_SIZE / size));
if (!s->freebatch)
s->freebatch = 1;
s->hiwater = s->freebatch << 2;
@@ -1980,9 +2031,8 @@ static int calculate_sizes(struct kmem_c
}

static int kmem_cache_open(struct kmem_cache *s,
- const char *name, size_t size,
- size_t align, unsigned long flags,
- void (*ctor)(void *), int alloc)
+ const char *name, size_t size, size_t align,
+ unsigned long flags, void (*ctor)(void *), int alloc)
{
unsigned int left_over;

@@ -2024,7 +2074,7 @@ error_nodes:
free_kmem_cache_nodes(s);
error:
if (flags & SLAB_PANIC)
- panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+ panic("kmem_cache_create(): failed to create slab `%s'\n", name);
return 0;
}

@@ -2141,7 +2191,7 @@ EXPORT_SYMBOL(kmalloc_caches_dma);
#endif

static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
- const char *name, int size, gfp_t gfp_flags)
+ const char *name, int size, gfp_t gfp_flags)
{
unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;

@@ -2446,10 +2496,10 @@ static int __init cpucache_init(void)

for_each_online_cpu(cpu)
start_cpu_timer(cpu);
+
return 0;
}
-__initcall(cpucache_init);
-
+device_initcall(cpucache_init);

#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
static void slab_mem_going_offline_callback(void *arg)
@@ -2459,29 +2509,7 @@ static void slab_mem_going_offline_callb

static void slab_mem_offline_callback(void *arg)
{
- struct kmem_cache *s;
- struct memory_notify *marg = arg;
- int nid = marg->status_change_nid;
-
- /*
- * If the node still has available memory. we need kmem_cache_node
- * for it yet.
- */
- if (nid < 0)
- return;
-
-#if 0 // XXX: see cpu offline comment
- down_read(&slqb_lock);
- list_for_each_entry(s, &slab_caches, list) {
- struct kmem_cache_node *n;
- n = s->node[nid];
- if (n) {
- s->node[nid] = NULL;
- kmem_cache_free(&kmem_node_cache, n);
- }
- }
- up_read(&slqb_lock);
-#endif
+ /* XXX: should release structures, see CPU offline comment */
}

static int slab_mem_going_online_callback(void *arg)
@@ -2562,6 +2590,10 @@ void __init kmem_cache_init(void)
int i;
unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;

+ /*
+ * All the ifdefs are rather ugly here, but it's just the setup code,
+ * so it doesn't have to be too readable :)
+ */
#ifdef CONFIG_NUMA
if (num_possible_nodes() == 1)
numa_platform = 0;
@@ -2576,12 +2608,15 @@ void __init kmem_cache_init(void)
kmem_size = sizeof(struct kmem_cache);
#endif

- kmem_cache_open(&kmem_cache_cache, "kmem_cache", kmem_size, 0, flags, NULL, 0);
+ kmem_cache_open(&kmem_cache_cache, "kmem_cache",
+ kmem_size, 0, flags, NULL, 0);
#ifdef CONFIG_SMP
- kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu", sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+ kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu",
+ sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
#endif
#ifdef CONFIG_NUMA
- kmem_cache_open(&kmem_node_cache, "kmem_cache_node", sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+ kmem_cache_open(&kmem_node_cache, "kmem_cache_node",
+ sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
#endif

#ifdef CONFIG_SMP
@@ -2634,14 +2669,13 @@ void __init kmem_cache_init(void)

for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
open_kmalloc_cache(&kmalloc_caches[i],
- "kmalloc", 1 << i, GFP_KERNEL);
+ "kmalloc", 1 << i, GFP_KERNEL);
#ifdef CONFIG_ZONE_DMA
open_kmalloc_cache(&kmalloc_caches_dma[i],
"kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
#endif
}

-
/*
* Patch up the size_index table if we have strange large alignment
* requirements for the kmalloc array. This is only the case for
@@ -2697,10 +2731,12 @@ static int kmem_cache_create_ok(const ch
printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
name);
dump_stack();
+
return 0;
}

down_read(&slqb_lock);
+
list_for_each_entry(tmp, &slab_caches, list) {
char x;
int res;
@@ -2723,9 +2759,11 @@ static int kmem_cache_create_ok(const ch
"kmem_cache_create(): duplicate cache %s\n", name);
dump_stack();
up_read(&slqb_lock);
+
return 0;
}
}
+
up_read(&slqb_lock);

WARN_ON(strchr(name, ' ')); /* It confuses parsers */
@@ -2754,7 +2792,8 @@ struct kmem_cache *kmem_cache_create(con

err:
if (flags & SLAB_PANIC)
- panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+ panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+
return NULL;
}
EXPORT_SYMBOL(kmem_cache_create);
@@ -2765,7 +2804,7 @@ EXPORT_SYMBOL(kmem_cache_create);
* necessary.
*/
static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
- unsigned long action, void *hcpu)
+ unsigned long action, void *hcpu)
{
long cpu = (long)hcpu;
struct kmem_cache *s;
@@ -2803,23 +2842,12 @@ static int __cpuinit slab_cpuup_callback
case CPU_UP_CANCELED_FROZEN:
case CPU_DEAD:
case CPU_DEAD_FROZEN:
-#if 0
- down_read(&slqb_lock);
- /* XXX: this doesn't work because objects can still be on this
- * CPU's list. periodic timer needs to check if a CPU is offline
- * and then try to cleanup from there. Same for node offline.
+ /*
+ * XXX: Freeing here doesn't work because objects can still be
+ * on this CPU's list. periodic timer needs to check if a CPU
+ * is offline and then try to cleanup from there. Same for node
+ * offline.
*/
- list_for_each_entry(s, &slab_caches, list) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
- if (c) {
- kmem_cache_free(&kmem_cpu_cache, c);
- s->cpu_slab[cpu] = NULL;
- }
- }
-
- up_read(&slqb_lock);
-#endif
- break;
default:
break;
}
@@ -2904,9 +2932,8 @@ static void __gather_stats(void *arg)
gather->nr_partial += nr_partial;
gather->nr_inuse += nr_inuse;
#ifdef CONFIG_SLQB_STATS
- for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
gather->stats[i] += l->stats[i];
- }
#endif
spin_unlock(&gather->lock);
}
@@ -2935,9 +2962,8 @@ static void gather_stats(struct kmem_cac

spin_lock_irqsave(&n->list_lock, flags);
#ifdef CONFIG_SLQB_STATS
- for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
stats->stats[i] += l->stats[i];
- }
#endif
stats->nr_slabs += l->nr_slabs;
stats->nr_partial += l->nr_partial;
@@ -3007,10 +3033,11 @@ static int s_show(struct seq_file *m, vo
gather_stats(s, &stats);

seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
- stats.nr_objects, s->size, s->objects, (1 << s->order));
- seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s), slab_freebatch(s), 0);
- seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs, stats.nr_slabs,
- 0UL);
+ stats.nr_objects, s->size, s->objects, (1 << s->order));
+ seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s),
+ slab_freebatch(s), 0);
+ seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
+ stats.nr_slabs, 0UL);
seq_putc(m, '\n');
return 0;
}
@@ -3036,7 +3063,8 @@ static const struct file_operations proc

static int __init slab_proc_init(void)
{
- proc_create("slabinfo",S_IWUSR|S_IRUGO,NULL,&proc_slabinfo_operations);
+ proc_create("slabinfo", S_IWUSR|S_IRUGO, NULL,
+ &proc_slabinfo_operations);
return 0;
}
module_init(slab_proc_init);
@@ -3106,7 +3134,9 @@ SLAB_ATTR_RO(ctor);
static ssize_t slabs_show(struct kmem_cache *s, char *buf)
{
struct stats_gather stats;
+
gather_stats(s, &stats);
+
return sprintf(buf, "%lu\n", stats.nr_slabs);
}
SLAB_ATTR_RO(slabs);
@@ -3114,7 +3144,9 @@ SLAB_ATTR_RO(slabs);
static ssize_t objects_show(struct kmem_cache *s, char *buf)
{
struct stats_gather stats;
+
gather_stats(s, &stats);
+
return sprintf(buf, "%lu\n", stats.nr_inuse);
}
SLAB_ATTR_RO(objects);
@@ -3122,7 +3154,9 @@ SLAB_ATTR_RO(objects);
static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
{
struct stats_gather stats;
+
gather_stats(s, &stats);
+
return sprintf(buf, "%lu\n", stats.nr_objects);
}
SLAB_ATTR_RO(total_objects);
@@ -3171,7 +3205,8 @@ static ssize_t store_user_show(struct km
}
SLAB_ATTR_RO(store_user);

-static ssize_t hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
+static ssize_t hiwater_store(struct kmem_cache *s,
+ const char *buf, size_t length)
{
long hiwater;
int err;
@@ -3194,7 +3229,8 @@ static ssize_t hiwater_show(struct kmem_
}
SLAB_ATTR(hiwater);

-static ssize_t freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
+static ssize_t freebatch_store(struct kmem_cache *s,
+ const char *buf, size_t length)
{
long freebatch;
int err;
@@ -3216,6 +3252,7 @@ static ssize_t freebatch_show(struct kme
return sprintf(buf, "%d\n", slab_freebatch(s));
}
SLAB_ATTR(freebatch);
+
#ifdef CONFIG_SLQB_STATS
static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
{
@@ -3233,8 +3270,9 @@ static int show_stat(struct kmem_cache *
for_each_online_cpu(cpu) {
struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
struct kmem_cache_list *l = &c->list;
+
if (len < PAGE_SIZE - 20)
- len += sprintf(buf + len, " C%d=%lu", cpu, l->stats[si]);
+ len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
}
#endif
return len + sprintf(buf + len, "\n");
@@ -3308,8 +3346,7 @@ static struct attribute_group slab_attr_
};

static ssize_t slab_attr_show(struct kobject *kobj,
- struct attribute *attr,
- char *buf)
+ struct attribute *attr, char *buf)
{
struct slab_attribute *attribute;
struct kmem_cache *s;
@@ -3327,8 +3364,7 @@ static ssize_t slab_attr_show(struct kob
}

static ssize_t slab_attr_store(struct kobject *kobj,
- struct attribute *attr,
- const char *buf, size_t len)
+ struct attribute *attr, const char *buf, size_t len)
{
struct slab_attribute *attribute;
struct kmem_cache *s;
@@ -3396,6 +3432,7 @@ static int sysfs_slab_add(struct kmem_ca
err = sysfs_create_group(&s->kobj, &slab_attr_group);
if (err)
return err;
+
kobject_uevent(&s->kobj, KOBJ_ADD);

return 0;
@@ -3420,17 +3457,20 @@ static int __init slab_sysfs_init(void)
}

down_write(&slqb_lock);
+
sysfs_available = 1;
+
list_for_each_entry(s, &slab_caches, list) {
err = sysfs_slab_add(s);
if (err)
printk(KERN_ERR "SLQB: Unable to add boot slab %s"
" to sysfs\n", s->name);
}
+
up_write(&slqb_lock);

return 0;
}
+device_initcall(slab_sysfs_init);

-__initcall(slab_sysfs_init);
#endif

2009-01-21 17:40:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator


* Nick Piggin <[email protected]> wrote:

> On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> >
> > Mind if i nitpick a bit about minor style issues? Since this is going to
> > be the next Linux SLAB allocator we might as well do it perfectly :-)
>
> Well here is an incremental patch which should get most of the issues
> you pointed out, most of the sane ones that checkpatch pointed out, and
> a few of my own ;)

here's an incremental one ontop of your incremental patch, enhancing some
more issues. I now find the code very readable! :-)

( in case you are wondering about the placement of bit_spinlock.h - that
file needs fixing, just move it to the top of the file and see the build
break. But that's a separate patch.)

Ingo

------------------->
Subject: slbq: cleanup
From: Ingo Molnar <[email protected]>
Date: Wed Jan 21 18:10:20 CET 2009

mm/slqb.o:

text data bss dec hex filename
17655 54159 200456 272270 4278e slqb.o.before
17653 54159 200456 272268 4278c slqb.o.after

Signed-off-by: Ingo Molnar <[email protected]>
---
mm/slqb.c | 588 ++++++++++++++++++++++++++++++++------------------------------
1 file changed, 308 insertions(+), 280 deletions(-)

Index: linux/mm/slqb.c
===================================================================
--- linux.orig/mm/slqb.c
+++ linux/mm/slqb.c
@@ -7,19 +7,20 @@
* Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
*/

-#include <linux/mm.h>
-#include <linux/module.h>
-#include <linux/bit_spinlock.h>
#include <linux/interrupt.h>
-#include <linux/bitops.h>
-#include <linux/slab.h>
-#include <linux/seq_file.h>
-#include <linux/cpu.h>
-#include <linux/cpuset.h>
#include <linux/mempolicy.h>
-#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include <linux/seq_file.h>
+#include <linux/bitops.h>
+#include <linux/cpuset.h>
#include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/ctype.h>
+#include <linux/slab.h>
+#include <linux/cpu.h>
+#include <linux/mm.h>
+
+#include <linux/bit_spinlock.h>

/*
* TODO
@@ -40,14 +41,14 @@
struct slqb_page {
union {
struct {
- unsigned long flags; /* mandatory */
- atomic_t _count; /* mandatory */
- unsigned int inuse; /* Nr of objects */
+ unsigned long flags; /* mandatory */
+ atomic_t _count; /* mandatory */
+ unsigned int inuse; /* Nr of objects */
struct kmem_cache_list *list; /* Pointer to list */
- void **freelist; /* LIFO freelist */
+ void **freelist; /* LIFO freelist */
union {
- struct list_head lru; /* misc. list */
- struct rcu_head rcu_head; /* for rcu freeing */
+ struct list_head lru; /* misc. list */
+ struct rcu_head rcu_head; /* for rcu freeing */
};
};
struct page page;
@@ -120,16 +121,16 @@ static inline int slab_freebatch(struct
* - There is no remote free queue. Nodes don't free objects, CPUs do.
*/

-static inline void slqb_stat_inc(struct kmem_cache_list *list,
- enum stat_item si)
+static inline void
+slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
{
#ifdef CONFIG_SLQB_STATS
list->stats[si]++;
#endif
}

-static inline void slqb_stat_add(struct kmem_cache_list *list,
- enum stat_item si, unsigned long nr)
+static inline void
+slqb_stat_add(struct kmem_cache_list *list, enum stat_item si, unsigned long nr)
{
#ifdef CONFIG_SLQB_STATS
list->stats[si] += nr;
@@ -196,12 +197,12 @@ static inline void __free_slqb_pages(str
#ifdef CONFIG_SLQB_DEBUG
static inline int slab_debug(struct kmem_cache *s)
{
- return (s->flags &
+ return s->flags &
(SLAB_DEBUG_FREE |
SLAB_RED_ZONE |
SLAB_POISON |
SLAB_STORE_USER |
- SLAB_TRACE));
+ SLAB_TRACE);
}
static inline int slab_poison(struct kmem_cache *s)
{
@@ -574,34 +575,34 @@ static int check_bytes_and_report(struct
* Object layout:
*
* object address
- * Bytes of the object to be managed.
- * If the freepointer may overlay the object then the free
- * pointer is the first word of the object.
+ * Bytes of the object to be managed.
+ * If the freepointer may overlay the object then the free
+ * pointer is the first word of the object.
*
- * Poisoning uses 0x6b (POISON_FREE) and the last byte is
- * 0xa5 (POISON_END)
+ * Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 0xa5 (POISON_END)
*
* object + s->objsize
- * Padding to reach word boundary. This is also used for Redzoning.
- * Padding is extended by another word if Redzoning is enabled and
- * objsize == inuse.
+ * Padding to reach word boundary. This is also used for Redzoning.
+ * Padding is extended by another word if Redzoning is enabled and
+ * objsize == inuse.
*
- * We fill with 0xbb (RED_INACTIVE) for inactive objects and with
- * 0xcc (RED_ACTIVE) for objects in use.
+ * We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 0xcc (RED_ACTIVE) for objects in use.
*
* object + s->inuse
- * Meta data starts here.
+ * Meta data starts here.
*
- * A. Free pointer (if we cannot overwrite object on free)
- * B. Tracking data for SLAB_STORE_USER
- * C. Padding to reach required alignment boundary or at mininum
- * one word if debuggin is on to be able to detect writes
- * before the word boundary.
+ * A. Free pointer (if we cannot overwrite object on free)
+ * B. Tracking data for SLAB_STORE_USER
+ * C. Padding to reach required alignment boundary or at mininum
+ * one word if debuggin is on to be able to detect writes
+ * before the word boundary.
*
* Padding is done using 0x5a (POISON_INUSE)
*
* object + s->size
- * Nothing is used beyond s->size.
+ * Nothing is used beyond s->size.
*/

static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
@@ -717,25 +718,26 @@ static int check_slab(struct kmem_cache
return 1;
}

-static void trace(struct kmem_cache *s, struct slqb_page *page,
- void *object, int alloc)
+static void
+trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
{
- if (s->flags & SLAB_TRACE) {
- printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
- s->name,
- alloc ? "alloc" : "free",
- object, page->inuse,
- page->freelist);
+ if (likely(!(s->flags & SLAB_TRACE)))
+ return;

- if (!alloc)
- print_section("Object", (void *)object, s->objsize);
+ printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+ s->name,
+ alloc ? "alloc" : "free",
+ object, page->inuse,
+ page->freelist);

- dump_stack();
- }
+ if (!alloc)
+ print_section("Object", (void *)object, s->objsize);
+
+ dump_stack();
}

-static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
- void *object)
+static void
+setup_object_debug(struct kmem_cache *s, struct slqb_page *page, void *object)
{
if (!slab_debug(s))
return;
@@ -747,11 +749,10 @@ static void setup_object_debug(struct km
init_tracking(s, object);
}

-static int alloc_debug_processing(struct kmem_cache *s,
- void *object, void *addr)
+static int
+alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
{
- struct slqb_page *page;
- page = virt_to_head_slqb_page(object);
+ struct slqb_page *page = virt_to_head_slqb_page(object);

if (!check_slab(s, page))
goto bad;
@@ -767,6 +768,7 @@ static int alloc_debug_processing(struct
/* Success perform special debug activities for allocs */
if (s->flags & SLAB_STORE_USER)
set_track(s, object, TRACK_ALLOC, addr);
+
trace(s, page, object, 1);
init_object(s, object, 1);
return 1;
@@ -775,11 +777,9 @@ bad:
return 0;
}

-static int free_debug_processing(struct kmem_cache *s,
- void *object, void *addr)
+static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
{
- struct slqb_page *page;
- page = virt_to_head_slqb_page(object);
+ struct slqb_page *page = virt_to_head_slqb_page(object);

if (!check_slab(s, page))
goto fail;
@@ -870,29 +870,34 @@ static unsigned long kmem_cache_flags(un
void (*ctor)(void *))
{
/*
- * Enable debugging if selected on the kernel commandline.
+ * Enable debugging if selected on the kernel commandline:
*/
- if (slqb_debug && (!slqb_debug_slabs ||
- strncmp(slqb_debug_slabs, name,
- strlen(slqb_debug_slabs)) == 0))
- flags |= slqb_debug;
+
+ if (!slqb_debug)
+ return flags;
+
+ if (slqb_debug_slabs)
+ return flags | slqb_debug;
+
+ if (!strncmp(slqb_debug_slabs, name, strlen(slqb_debug_slabs)))
+ return flags | slqb_debug;

return flags;
}
#else
-static inline void setup_object_debug(struct kmem_cache *s,
- struct slqb_page *page, void *object)
+static inline void
+setup_object_debug(struct kmem_cache *s, struct slqb_page *page, void *object)
{
}

-static inline int alloc_debug_processing(struct kmem_cache *s,
- void *object, void *addr)
+static inline int
+alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
{
return 0;
}

-static inline int free_debug_processing(struct kmem_cache *s,
- void *object, void *addr)
+static inline int
+free_debug_processing(struct kmem_cache *s, void *object, void *addr)
{
return 0;
}
@@ -903,7 +908,7 @@ static inline int slab_pad_check(struct
}

static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
- void *object, int active)
+ void *object, int active)
{
return 1;
}
@@ -924,11 +929,11 @@ static const int slqb_debug = 0;
/*
* allocate a new slab (return its corresponding struct slqb_page)
*/
-static struct slqb_page *allocate_slab(struct kmem_cache *s,
- gfp_t flags, int node)
+static struct slqb_page *
+allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
- struct slqb_page *page;
int pages = 1 << s->order;
+ struct slqb_page *page;

flags |= s->allocflags;

@@ -947,8 +952,8 @@ static struct slqb_page *allocate_slab(s
/*
* Called once for each object on a new slab page
*/
-static void setup_object(struct kmem_cache *s,
- struct slqb_page *page, void *object)
+static void
+setup_object(struct kmem_cache *s, struct slqb_page *page, void *object)
{
setup_object_debug(s, page, object);
if (unlikely(s->ctor))
@@ -958,8 +963,8 @@ static void setup_object(struct kmem_cac
/*
* Allocate a new slab, set up its object list.
*/
-static struct slqb_page *new_slab_page(struct kmem_cache *s,
- gfp_t flags, int node, unsigned int colour)
+static struct slqb_page *
+new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
{
struct slqb_page *page;
void *start;
@@ -1030,6 +1035,7 @@ static void rcu_free_slab(struct rcu_hea
static void free_slab(struct kmem_cache *s, struct slqb_page *page)
{
VM_BUG_ON(page->inuse);
+
if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
call_rcu(&page->rcu_head, rcu_free_slab);
else
@@ -1060,12 +1066,14 @@ static int free_object_to_page(struct km
l->nr_slabs--;
free_slab(s, page);
slqb_stat_inc(l, FLUSH_SLAB_FREE);
+
return 1;

} else if (page->inuse + 1 == s->objects) {
l->nr_partial++;
list_add(&page->lru, &l->partial);
slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+
return 0;
}
return 0;
@@ -1146,8 +1154,8 @@ static void flush_free_list_all(struct k
* Caller must be the owner CPU in the case of per-CPU list, or hold the node's
* list_lock in the case of per-node list.
*/
-static void claim_remote_free_list(struct kmem_cache *s,
- struct kmem_cache_list *l)
+static void
+claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
{
void **head, **tail;
int nr;
@@ -1192,8 +1200,8 @@ static void claim_remote_free_list(struc
* Caller must be the owner CPU in the case of per-CPU list, or hold the node's
* list_lock in the case of per-node list.
*/
-static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
- struct kmem_cache_list *l)
+static __always_inline void *
+__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
{
void *object;

@@ -1243,8 +1251,8 @@ static __always_inline void *__cache_lis
* Caller must be the owner CPU in the case of per-CPU list, or hold the node's
* list_lock in the case of per-node list.
*/
-static noinline void *__cache_list_get_page(struct kmem_cache *s,
- struct kmem_cache_list *l)
+static noinline void *
+__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
{
struct slqb_page *page;
void *object;
@@ -1282,12 +1290,12 @@ static noinline void *__cache_list_get_p
*
* Must be called with interrupts disabled.
*/
-static noinline void *__slab_alloc_page(struct kmem_cache *s,
- gfp_t gfpflags, int node)
+static noinline void *
+__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
{
- struct slqb_page *page;
struct kmem_cache_list *l;
struct kmem_cache_cpu *c;
+ struct slqb_page *page;
unsigned int colour;
void *object;

@@ -1347,15 +1355,19 @@ static noinline void *__slab_alloc_page(
}

#ifdef CONFIG_NUMA
-static noinline int alternate_nid(struct kmem_cache *s,
- gfp_t gfpflags, int node)
+static noinline int
+alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
{
if (in_interrupt() || (gfpflags & __GFP_THISNODE))
return node;
- if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+
+ if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD)) {
return cpuset_mem_spread_node();
- else if (current->mempolicy)
- return slab_node(current->mempolicy);
+ } else {
+ if (current->mempolicy)
+ return slab_node(current->mempolicy);
+ }
+
return node;
}

@@ -1365,8 +1377,8 @@ static noinline int alternate_nid(struct
*
* Must be called with interrupts disabled.
*/
-static noinline void *__remote_slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node)
+static noinline void *
+__remote_slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node)
{
struct kmem_cache_node *n;
struct kmem_cache_list *l;
@@ -1375,6 +1387,7 @@ static noinline void *__remote_slab_allo
n = s->node[node];
if (unlikely(!n)) /* node has no memory */
return NULL;
+
l = &n->list;

spin_lock(&n->list_lock);
@@ -1389,7 +1402,9 @@ static noinline void *__remote_slab_allo
}
if (likely(object))
slqb_stat_inc(l, ALLOC);
+
spin_unlock(&n->list_lock);
+
return object;
}
#endif
@@ -1399,12 +1414,12 @@ static noinline void *__remote_slab_allo
*
* Must be called with interrupts disabled.
*/
-static __always_inline void *__slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node)
+static __always_inline void *
+__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node)
{
- void *object;
- struct kmem_cache_cpu *c;
struct kmem_cache_list *l;
+ struct kmem_cache_cpu *c;
+ void *object;

#ifdef CONFIG_NUMA
if (unlikely(node != -1) && unlikely(node != numa_node_id()))
@@ -1422,6 +1437,7 @@ static __always_inline void *__slab_allo
}
if (likely(object))
slqb_stat_inc(l, ALLOC);
+
return object;
}

@@ -1429,11 +1445,11 @@ static __always_inline void *__slab_allo
* Perform some interrupts-on processing around the main allocation path
* (debug checking and memset()ing).
*/
-static __always_inline void *slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node, void *addr)
+static __always_inline void *
+slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, void *addr)
{
- void *object;
unsigned long flags;
+ void *object;

again:
local_irq_save(flags);
@@ -1451,10 +1467,11 @@ again:
return object;
}

-static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
- gfp_t gfpflags, void *caller)
+static __always_inline void *
+__kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags, void *caller)
{
int node = -1;
+
#ifdef CONFIG_NUMA
if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
node = alternate_nid(s, gfpflags, node);
@@ -1487,8 +1504,8 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
*
* Must be called with interrupts disabled.
*/
-static void flush_remote_free_cache(struct kmem_cache *s,
- struct kmem_cache_cpu *c)
+static void
+flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
struct kmlist *src;
struct kmem_cache_list *dst;
@@ -1575,12 +1592,12 @@ static noinline void slab_free_to_remote
*
* Must be called with interrupts disabled.
*/
-static __always_inline void __slab_free(struct kmem_cache *s,
- struct slqb_page *page, void *object)
+static __always_inline void
+__slab_free(struct kmem_cache *s, struct slqb_page *page, void *object)
{
- struct kmem_cache_cpu *c;
- struct kmem_cache_list *l;
int thiscpu = smp_processor_id();
+ struct kmem_cache_list *l;
+ struct kmem_cache_cpu *c;

c = get_cpu_slab(s, thiscpu);
l = &c->list;
@@ -1619,8 +1636,8 @@ static __always_inline void __slab_free(
* Perform some interrupts-on processing around the main freeing path
* (debug checking).
*/
-static __always_inline void slab_free(struct kmem_cache *s,
- struct slqb_page *page, void *object)
+static __always_inline void
+slab_free(struct kmem_cache *s, struct slqb_page *page, void *object)
{
unsigned long flags;

@@ -1683,7 +1700,7 @@ static int slab_order(int size, int max_
return order;
}

-static int calculate_order(int size)
+static int calc_order(int size)
{
int order;

@@ -1710,8 +1727,8 @@ static int calculate_order(int size)
/*
* Figure out what the alignment of the objects will be.
*/
-static unsigned long calculate_alignment(unsigned long flags,
- unsigned long align, unsigned long size)
+static unsigned long
+calc_alignment(unsigned long flags, unsigned long align, unsigned long size)
{
/*
* If the user wants hardware cache aligned objects then follow that
@@ -1737,18 +1754,18 @@ static unsigned long calculate_alignment
static void init_kmem_cache_list(struct kmem_cache *s,
struct kmem_cache_list *l)
{
- l->cache = s;
- l->freelist.nr = 0;
- l->freelist.head = NULL;
- l->freelist.tail = NULL;
- l->nr_partial = 0;
- l->nr_slabs = 0;
+ l->cache = s;
+ l->freelist.nr = 0;
+ l->freelist.head = NULL;
+ l->freelist.tail = NULL;
+ l->nr_partial = 0;
+ l->nr_slabs = 0;
INIT_LIST_HEAD(&l->partial);

#ifdef CONFIG_SMP
- l->remote_free_check = 0;
+ l->remote_free_check = 0;
spin_lock_init(&l->remote_free.lock);
- l->remote_free.list.nr = 0;
+ l->remote_free.list.nr = 0;
l->remote_free.list.head = NULL;
l->remote_free.list.tail = NULL;
#endif
@@ -1758,8 +1775,7 @@ static void init_kmem_cache_list(struct
#endif
}

-static void init_kmem_cache_cpu(struct kmem_cache *s,
- struct kmem_cache_cpu *c)
+static void init_kmem_cache_cpu(struct kmem_cache *s, struct kmem_cache_cpu *c)
{
init_kmem_cache_list(s, &c->list);

@@ -1773,8 +1789,8 @@ static void init_kmem_cache_cpu(struct k
}

#ifdef CONFIG_NUMA
-static void init_kmem_cache_node(struct kmem_cache *s,
- struct kmem_cache_node *n)
+static void
+init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
{
spin_lock_init(&n->list_lock);
init_kmem_cache_list(s, &n->list);
@@ -1804,8 +1820,8 @@ static struct kmem_cache_node kmem_node_
#endif

#ifdef CONFIG_SMP
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
- int cpu)
+static struct kmem_cache_cpu *
+alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
{
struct kmem_cache_cpu *c;

@@ -1910,7 +1926,7 @@ static int alloc_kmem_cache_nodes(struct
#endif

/*
- * calculate_sizes() determines the order and the distribution of data within
+ * calc_sizes() determines the order and the distribution of data within
* a slab object.
*/
static int calculate_sizes(struct kmem_cache *s)
@@ -1991,7 +2007,7 @@ static int calculate_sizes(struct kmem_c
* user specified and the dynamic determination of cache line size
* on bootup.
*/
- align = calculate_alignment(flags, align, s->objsize);
+ align = calc_alignment(flags, align, s->objsize);

/*
* SLQB stores one object immediately after another beginning from
@@ -2000,7 +2016,7 @@ static int calculate_sizes(struct kmem_c
*/
size = ALIGN(size, align);
s->size = size;
- s->order = calculate_order(size);
+ s->order = calc_order(size);

if (s->order < 0)
return 0;
@@ -2210,38 +2226,38 @@ static struct kmem_cache *open_kmalloc_c
* fls.
*/
static s8 size_index[24] __cacheline_aligned = {
- 3, /* 8 */
- 4, /* 16 */
- 5, /* 24 */
- 5, /* 32 */
- 6, /* 40 */
- 6, /* 48 */
- 6, /* 56 */
- 6, /* 64 */
+ 3, /* 8 */
+ 4, /* 16 */
+ 5, /* 24 */
+ 5, /* 32 */
+ 6, /* 40 */
+ 6, /* 48 */
+ 6, /* 56 */
+ 6, /* 64 */
#if L1_CACHE_BYTES < 64
- 1, /* 72 */
- 1, /* 80 */
- 1, /* 88 */
- 1, /* 96 */
+ 1, /* 72 */
+ 1, /* 80 */
+ 1, /* 88 */
+ 1, /* 96 */
#else
- 7,
- 7,
- 7,
- 7,
-#endif
- 7, /* 104 */
- 7, /* 112 */
- 7, /* 120 */
- 7, /* 128 */
+ 7,
+ 7,
+ 7,
+ 7,
+#endif
+ 7, /* 104 */
+ 7, /* 112 */
+ 7, /* 120 */
+ 7, /* 128 */
#if L1_CACHE_BYTES < 128
- 2, /* 136 */
- 2, /* 144 */
- 2, /* 152 */
- 2, /* 160 */
- 2, /* 168 */
- 2, /* 176 */
- 2, /* 184 */
- 2 /* 192 */
+ 2, /* 136 */
+ 2, /* 144 */
+ 2, /* 152 */
+ 2, /* 160 */
+ 2, /* 168 */
+ 2, /* 176 */
+ 2, /* 184 */
+ 2 /* 192 */
#else
-1,
-1,
@@ -2278,9 +2294,8 @@ static struct kmem_cache *get_slab(size_

void *__kmalloc(size_t size, gfp_t flags)
{
- struct kmem_cache *s;
+ struct kmem_cache *s = get_slab(size, flags);

- s = get_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

@@ -2291,9 +2306,8 @@ EXPORT_SYMBOL(__kmalloc);
#ifdef CONFIG_NUMA
void *__kmalloc_node(size_t size, gfp_t flags, int node)
{
- struct kmem_cache *s;
+ struct kmem_cache *s = get_slab(size, flags);

- s = get_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

@@ -2340,8 +2354,8 @@ EXPORT_SYMBOL(ksize);

void kfree(const void *object)
{
- struct kmem_cache *s;
struct slqb_page *page;
+ struct kmem_cache *s;

if (unlikely(ZERO_OR_NULL_PTR(object)))
return;
@@ -2371,21 +2385,21 @@ static void kmem_cache_trim_percpu(void

int kmem_cache_shrink(struct kmem_cache *s)
{
-#ifdef CONFIG_NUMA
- int node;
-#endif
-
on_each_cpu(kmem_cache_trim_percpu, s, 1);

#ifdef CONFIG_NUMA
- for_each_node_state(node, N_NORMAL_MEMORY) {
- struct kmem_cache_node *n = s->node[node];
- struct kmem_cache_list *l = &n->list;
+ {
+ int node;

- spin_lock_irq(&n->list_lock);
- claim_remote_free_list(s, l);
- flush_free_list(s, l);
- spin_unlock_irq(&n->list_lock);
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list(s, l);
+ spin_unlock_irq(&n->list_lock);
+ }
}
#endif

@@ -2397,8 +2411,8 @@ EXPORT_SYMBOL(kmem_cache_shrink);
static void kmem_cache_reap_percpu(void *arg)
{
int cpu = smp_processor_id();
- struct kmem_cache *s;
long phase = (long)arg;
+ struct kmem_cache *s;

list_for_each_entry(s, &slab_caches, list) {
struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
@@ -2442,8 +2456,7 @@ static void kmem_cache_reap(void)

static void cache_trim_worker(struct work_struct *w)
{
- struct delayed_work *work =
- container_of(w, struct delayed_work, work);
+ struct delayed_work *work;
struct kmem_cache *s;
int node;

@@ -2469,6 +2482,7 @@ static void cache_trim_worker(struct wor

up_read(&slqb_lock);
out:
+ work = container_of(w, struct delayed_work, work);
schedule_delayed_work(work, round_jiffies_relative(3*HZ));
}

@@ -2587,8 +2601,8 @@ static int slab_memory_callback(struct n

void __init kmem_cache_init(void)
{
- int i;
unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+ int i;

/*
* All the ifdefs are rather ugly here, but it's just the setup code,
@@ -2719,8 +2733,9 @@ void __init kmem_cache_init(void)
/*
* Some basic slab creation sanity checks
*/
-static int kmem_cache_create_ok(const char *name, size_t size,
- size_t align, unsigned long flags)
+static int
+kmem_cache_create_ok(const char *name, size_t size,
+ size_t align, unsigned long flags)
{
struct kmem_cache *tmp;

@@ -2773,8 +2788,9 @@ static int kmem_cache_create_ok(const ch
return 1;
}

-struct kmem_cache *kmem_cache_create(const char *name, size_t size,
- size_t align, unsigned long flags, void (*ctor)(void *))
+struct kmem_cache *
+kmem_cache_create(const char *name, size_t size,
+ size_t align, unsigned long flags, void (*ctor)(void *))
{
struct kmem_cache *s;

@@ -2804,7 +2820,7 @@ EXPORT_SYMBOL(kmem_cache_create);
* necessary.
*/
static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
- unsigned long action, void *hcpu)
+ unsigned long action, void *hcpu)
{
long cpu = (long)hcpu;
struct kmem_cache *s;
@@ -2855,7 +2871,7 @@ static int __cpuinit slab_cpuup_callback
}

static struct notifier_block __cpuinitdata slab_notifier = {
- .notifier_call = slab_cpuup_callback
+ .notifier_call = slab_cpuup_callback
};

#endif
@@ -2878,11 +2894,10 @@ void *__kmalloc_track_caller(size_t size
}

void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
- unsigned long caller)
+ unsigned long caller)
{
- struct kmem_cache *s;
+ struct kmem_cache *s = get_slab(size, flags);

- s = get_slab(size, flags);
if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

@@ -2892,12 +2907,17 @@ void *__kmalloc_node_track_caller(size_t

#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
struct stats_gather {
- struct kmem_cache *s;
- spinlock_t lock;
- unsigned long nr_slabs;
- unsigned long nr_partial;
- unsigned long nr_inuse;
- unsigned long nr_objects;
+ /*
+ * Serialize on_each_cpu() instances updating the summary
+ * stats structure:
+ */
+ spinlock_t lock;
+
+ struct kmem_cache *s;
+ unsigned long nr_slabs;
+ unsigned long nr_partial;
+ unsigned long nr_inuse;
+ unsigned long nr_objects;

#ifdef CONFIG_SLQB_STATS
unsigned long stats[NR_SLQB_STAT_ITEMS];
@@ -2915,25 +2935,25 @@ static void __gather_stats(void *arg)
struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
struct kmem_cache_list *l = &c->list;
struct slqb_page *page;
-#ifdef CONFIG_SLQB_STATS
- int i;
-#endif

nr_slabs = l->nr_slabs;
nr_partial = l->nr_partial;
nr_inuse = (nr_slabs - nr_partial) * s->objects;

- list_for_each_entry(page, &l->partial, lru) {
+ list_for_each_entry(page, &l->partial, lru)
nr_inuse += page->inuse;
- }

spin_lock(&gather->lock);
gather->nr_slabs += nr_slabs;
gather->nr_partial += nr_partial;
gather->nr_inuse += nr_inuse;
#ifdef CONFIG_SLQB_STATS
- for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
- gather->stats[i] += l->stats[i];
+ {
+ int i;
+
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+ gather->stats[i] += l->stats[i];
+ }
#endif
spin_unlock(&gather->lock);
}
@@ -2956,14 +2976,15 @@ static void gather_stats(struct kmem_cac
struct kmem_cache_list *l = &n->list;
struct slqb_page *page;
unsigned long flags;
-#ifdef CONFIG_SLQB_STATS
- int i;
-#endif

spin_lock_irqsave(&n->list_lock, flags);
#ifdef CONFIG_SLQB_STATS
- for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
- stats->stats[i] += l->stats[i];
+ {
+ int i;
+
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+ stats->stats[i] += l->stats[i];
+ }
#endif
stats->nr_slabs += l->nr_slabs;
stats->nr_partial += l->nr_partial;
@@ -3039,14 +3060,15 @@ static int s_show(struct seq_file *m, vo
seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
stats.nr_slabs, 0UL);
seq_putc(m, '\n');
+
return 0;
}

static const struct seq_operations slabinfo_op = {
- .start = s_start,
- .next = s_next,
- .stop = s_stop,
- .show = s_show,
+ .start = s_start,
+ .next = s_next,
+ .stop = s_stop,
+ .show = s_show,
};

static int slabinfo_open(struct inode *inode, struct file *file)
@@ -3205,8 +3227,8 @@ static ssize_t store_user_show(struct km
}
SLAB_ATTR_RO(store_user);

-static ssize_t hiwater_store(struct kmem_cache *s,
- const char *buf, size_t length)
+static ssize_t
+hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
{
long hiwater;
int err;
@@ -3229,8 +3251,8 @@ static ssize_t hiwater_show(struct kmem_
}
SLAB_ATTR(hiwater);

-static ssize_t freebatch_store(struct kmem_cache *s,
- const char *buf, size_t length)
+static ssize_t
+freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
{
long freebatch;
int err;
@@ -3258,91 +3280,95 @@ static int show_stat(struct kmem_cache *
{
struct stats_gather stats;
int len;
-#ifdef CONFIG_SMP
- int cpu;
-#endif

gather_stats(s, &stats);

len = sprintf(buf, "%lu", stats.stats[si]);

#ifdef CONFIG_SMP
- for_each_online_cpu(cpu) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
- struct kmem_cache_list *l = &c->list;
+ {
+ int cpu;

- if (len < PAGE_SIZE - 20)
- len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+ if (len < PAGE_SIZE - 20) {
+ len += sprintf(buf+len,
+ " C%d=%lu", cpu, l->stats[si]);
+ }
+ }
}
#endif
return len + sprintf(buf + len, "\n");
}

-#define STAT_ATTR(si, text) \
+#define STAT_ATTR(si, text) \
static ssize_t text##_show(struct kmem_cache *s, char *buf) \
{ \
return show_stat(s, buf, si); \
} \
SLAB_ATTR_RO(text); \

-STAT_ATTR(ALLOC, alloc);
-STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
-STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
-STAT_ATTR(FREE, free);
-STAT_ATTR(FREE_REMOTE, free_remote);
-STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
-STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
-STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
-STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
-STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
-STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
-STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
-STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
-STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
#endif

static struct attribute *slab_attrs[] = {
- &slab_size_attr.attr,
- &object_size_attr.attr,
- &objs_per_slab_attr.attr,
- &order_attr.attr,
- &objects_attr.attr,
- &total_objects_attr.attr,
- &slabs_attr.attr,
- &ctor_attr.attr,
- &align_attr.attr,
- &hwcache_align_attr.attr,
- &reclaim_account_attr.attr,
- &destroy_by_rcu_attr.attr,
- &red_zone_attr.attr,
- &poison_attr.attr,
- &store_user_attr.attr,
- &hiwater_attr.attr,
- &freebatch_attr.attr,
+
+ & slab_size_attr.attr,
+ & object_size_attr.attr,
+ & objs_per_slab_attr.attr,
+ & order_attr.attr,
+ & objects_attr.attr,
+ & total_objects_attr.attr,
+ & slabs_attr.attr,
+ & ctor_attr.attr,
+ & align_attr.attr,
+ & hwcache_align_attr.attr,
+ & reclaim_account_attr.attr,
+ & destroy_by_rcu_attr.attr,
+ & red_zone_attr.attr,
+ & poison_attr.attr,
+ & store_user_attr.attr,
+ & hiwater_attr.attr,
+ & freebatch_attr.attr,
#ifdef CONFIG_ZONE_DMA
- &cache_dma_attr.attr,
+ & cache_dma_attr.attr,
#endif
#ifdef CONFIG_SLQB_STATS
- &alloc_attr.attr,
- &alloc_slab_fill_attr.attr,
- &alloc_slab_new_attr.attr,
- &free_attr.attr,
- &free_remote_attr.attr,
- &flush_free_list_attr.attr,
- &flush_free_list_objects_attr.attr,
- &flush_free_list_remote_attr.attr,
- &flush_slab_partial_attr.attr,
- &flush_slab_free_attr.attr,
- &flush_rfree_list_attr.attr,
- &flush_rfree_list_objects_attr.attr,
- &claim_remote_list_attr.attr,
- &claim_remote_list_objects_attr.attr,
+ & alloc_attr.attr,
+ & alloc_slab_fill_attr.attr,
+ & alloc_slab_new_attr.attr,
+ & free_attr.attr,
+ & free_remote_attr.attr,
+ & flush_free_list_attr.attr,
+ & flush_free_list_objects_attr.attr,
+ & flush_free_list_remote_attr.attr,
+ & flush_slab_partial_attr.attr,
+ & flush_slab_free_attr.attr,
+ & flush_rfree_list_attr.attr,
+ & flush_rfree_list_objects_attr.attr,
+ & claim_remote_list_attr.attr,
+ & claim_remote_list_objects_attr.attr,
#endif
NULL
};

static struct attribute_group slab_attr_group = {
- .attrs = slab_attrs,
+ .attrs = slab_attrs,
};

static ssize_t slab_attr_show(struct kobject *kobj,
@@ -3389,13 +3415,13 @@ static void kmem_cache_release(struct ko
}

static struct sysfs_ops slab_sysfs_ops = {
- .show = slab_attr_show,
- .store = slab_attr_store,
+ .show = slab_attr_show,
+ .store = slab_attr_store,
};

static struct kobj_type slab_ktype = {
- .sysfs_ops = &slab_sysfs_ops,
- .release = kmem_cache_release
+ .sysfs_ops = &slab_sysfs_ops,
+ .release = kmem_cache_release
};

static int uevent_filter(struct kset *kset, struct kobject *kobj)
@@ -3413,7 +3439,7 @@ static struct kset_uevent_ops slab_ueven

static struct kset *slab_kset;

-static int sysfs_available __read_mostly = 0;
+static int sysfs_available __read_mostly;

static int sysfs_slab_add(struct kmem_cache *s)
{
@@ -3462,9 +3488,11 @@ static int __init slab_sysfs_init(void)

list_for_each_entry(s, &slab_caches, list) {
err = sysfs_slab_add(s);
- if (err)
- printk(KERN_ERR "SLQB: Unable to add boot slab %s"
- " to sysfs\n", s->name);
+ if (!err)
+ continue;
+
+ printk(KERN_ERR
+ "SLQB: Unable to add boot slab %s to sysfs\n", s->name);
}

up_write(&slqb_lock);

2009-01-21 17:58:50

by Joe Perches

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

One thing you might consider is that
Q is visually close enough to O to be
misread.

Perhaps a different letter would be good.

2009-01-21 18:11:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, 21 Jan 2009, Nick Piggin wrote:
>
> Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> have a system to test with), and improved performance and reduced
> locking somewhat for node-specific and interleaved allocations.

I haven't reviewed your postings, but I did give the previous version
of your patch a try on all my machines. Some observations and one patch.

I was initially _very_ impressed by how well it did on my venerable
tmpfs loop swapping loads, where I'd expected next to no effect; but
that turned out to be because on three machines I'd been using SLUB,
without remembering how default slub_max_order got raised from 1 to 3
in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).

That's been making SLUB behave pretty badly (e.g. elapsed time 30%
more than SLAB) with swapping loads on most of my machines. Though
oddly one seems immune, and another takes four times as long: guess
it depends on how close to thrashing, but probably more to investigate
there. I think my original SLUB versus SLAB comparisons were done on
the immune one: as I remember, SLUB and SLAB were equivalent on those
loads when SLUB came in, but even with boot option slub_max_order=1,
SLUB is still slower than SLAB on such tests (e.g. 2% slower).
FWIW - swapping loads are not what anybody should tune for.

So in fact SLQB comes in very much like SLAB, as I think you'd expect:
slightly ahead of it on most of the machines, but probably in the noise.
(SLOB behaves decently: not a winner, but no catastrophic behaviour.)

What I love most about SLUB is the way you can reasonably build with
CONFIG_SLUB_DEBUG=y, very little impact, then switch on the specific
debugging you want with a boot option when you want it. That was a
great stride forward, which you've followed in SLQB: so I'd have to
prefer SLQB to SLAB (on debuggability) and to SLUB (on high orders).

I do hate the name SLQB. Despite having no experience of databases,
I find it almost impossible to type, coming out as SQLB most times.
Wish you'd invented a plausible vowel instead of the Q; but probably
too late for that.

init/Kconfig describes it as "Qeued allocator": should say "Queued".

Documentation/vm/slqbinfo.c gives several compilation warnings:
I'd rather leave it to you to fix them, maybe the unused variables
are about to be used, or maybe there's much worse wrong with it
than a few compilation warnings, I didn't investigate.

The only bug I found (but you'll probably want to change the patch
- which I've rediffed to today's slqb.c, but not retested).

On fake NUMA I hit kernel BUG at mm/slqb.c:1107! claim_remote_free_list()
is doing several things without remote_free.lock: that VM_BUG_ON is unsafe
for one, and even if others are somehow safe today, it will be more robust
to take the lock sooner.

I moved the prefetchw(head) down to where we know it's going to be the head,
and replaced the offending VM_BUG_ON by a later WARN_ON which you'd probably
better remove altogether: once we got the lock, it's hardly interesting.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/slqb.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

--- slqb/mm/slqb.c.orig 2009-01-21 15:23:54.000000000 +0000
+++ slqb/mm/slqb.c 2009-01-21 15:32:44.000000000 +0000
@@ -1115,17 +1115,12 @@ static void claim_remote_free_list(struc
void **head, **tail;
int nr;

- VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
-
if (!l->remote_free.list.nr)
return;

+ spin_lock(&l->remote_free.lock);
l->remote_free_check = 0;
head = l->remote_free.list.head;
- /* Get the head hot for the likely subsequent allocation or flush */
- prefetchw(head);
-
- spin_lock(&l->remote_free.lock);
l->remote_free.list.head = NULL;
tail = l->remote_free.list.tail;
l->remote_free.list.tail = NULL;
@@ -1133,9 +1128,15 @@ static void claim_remote_free_list(struc
l->remote_free.list.nr = 0;
spin_unlock(&l->remote_free.lock);

- if (!l->freelist.nr)
+ WARN_ON(!head + !tail != !nr + !nr);
+ if (!nr)
+ return;
+
+ if (!l->freelist.nr) {
+ /* Get head hot for likely subsequent allocation or flush */
+ prefetchw(head);
l->freelist.head = head;
- else
+ } else
set_freepointer(s, l->freelist.tail, head);
l->freelist.tail = tail;

2009-01-22 08:45:50

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, 2009-01-21 at 15:30 +0100, Nick Piggin wrote:
> Hi,
>
> Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> have a system to test with),
Panic again on my Montvale Itanium NUMA machine if I start kernel with parameter
mem=2G.

The call chain is mnt_init => sysfs_init. kmem_cache_create fails, so later on
when mnt_init uses kmem_cache sysfs_dir_cache, kernel panic
at __slab_alloc => get_cpu_slab because parameter s is equal to NULL.

Function __remote_slab_alloc return NULL when s->node[node]==NULL. That causes
sysfs_init => kmem_cache_create fails.


------------------log----------------

Dentry cache hash table entries: 262144 (order: 7, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 6, 1048576 bytes)
Mount-cache hash table entries: 1024
mnt_init: sysfs_init error: -12
Unable to handle kernel NULL pointer dereference (address 0000000000002058)
swapper[0]: Oops 8813272891392 [1]
Modules linked in:

Pid: 0, CPU 0, comm: swapper
psr : 00001010084a2018 ifs : 8000000000000690 ip : [<a000000100180350>] Not tainted (2.6.29-rc2slqb0121)
ip is at kmem_cache_alloc+0x150/0x4e0
unat: 0000000000000000 pfs : 0000000000000690 rsc : 0000000000000003
rnat: 0009804c8a70433f bsps: a000000100f484b0 pr : 656960155aa65959
ldrs: 0000000000000000 ccv : 000000000000001a fpsr: 0009804c8a70433f
csd : 893fffff000f0000 ssd : 893fffff00090000
b0 : a000000100180270 b6 : a000000100507360 b7 : a000000100507360
f6 : 000000000000000000000 f7 : 1003e0000000000000800
f8 : 1003e0000000000000008 f9 : 1003e0000000000000001
f10 : 1003e0000000000000031 f11 : 1003e7d6343eb1a1f58d1
r1 : a0000001011bc810 r2 : 0000000000000008 r3 : ffffffffffffffff
r8 : 0000000000000000 r9 : a000000100ded800 r10 : 0000000000000000
r11 : a000000100ded800 r12 : a000000100db3d80 r13 : a000000100dac000
r14 : 0000000000000000 r15 : fffffffffffffffe r16 : a000000100fbcd30
r17 : a000000100dacc44 r18 : 0000000000002058 r19 : 0000000000000000
r20 : 0000000000000000 r21 : a000000100dacc44 r22 : 0000000000000002
r23 : 0000000000000066 r24 : 0000000000000073 r25 : 0000000000000000
r26 : e000000102014030 r27 : a0007fffffc9f120 r28 : 0000000000000000
r29 : 0000000000000000 r30 : 0000000000000008 r31 : 0000000000000001

Call Trace:
[<a000000100016240>] show_stack+0x40/0xa0
sp=a000000100db3950 bsp=a000000100dad140
[<a000000100016b50>] show_regs+0x850/0x8a0
sp=a000000100db3b20 bsp=a000000100dad0e8
[<a00000010003a5f0>] die+0x230/0x360
sp=a000000100db3b20 bsp=a000000100dad0a0
[<a00000010005e0e0>] ia64_do_page_fault+0x8e0/0xa40
sp=a000000100db3b20 bsp=a000000100dad050
[<a00000010000c700>] ia64_native_leave_kernel+0x0/0x280
sp=a000000100db3bb0 bsp=a000000100dad050
[<a000000100180350>] kmem_cache_alloc+0x150/0x4e0
sp=a000000100db3d80 bsp=a000000100dacfc8
[<a000000100238610>] sysfs_new_dirent+0x90/0x240
sp=a000000100db3d80 bsp=a000000100dacf80
[<a000000100239140>] create_dir+0x40/0x100
sp=a000000100db3d90 bsp=a000000100dacf48
[<a0000001002392b0>] sysfs_create_dir+0xb0/0x100
sp=a000000100db3db0 bsp=a000000100dacf28
[<a0000001004eca60>] kobject_add_internal+0x1e0/0x420
sp=a000000100db3dc0 bsp=a000000100dacee8
[<a0000001004eceb0>] kobject_add_varg+0x90/0xc0
sp=a000000100db3dc0 bsp=a000000100daceb0
[<a0000001004ed620>] kobject_add+0x100/0x140
sp=a000000100db3dc0 bsp=a000000100dace50
[<a0000001004ed6b0>] kobject_create_and_add+0x50/0xc0
sp=a000000100db3e00 bsp=a000000100dace20
[<a000000100c28ff0>] mnt_init+0x1b0/0x480
sp=a000000100db3e00 bsp=a000000100dacde0
[<a000000100c28610>] vfs_caches_init+0x230/0x280
sp=a000000100db3e20 bsp=a000000100dacdb8
[<a000000100c01410>] start_kernel+0x830/0x8c0
sp=a000000100db3e20 bsp=a000000100dacd40
[<a0000001009d7b60>] __kprobes_text_end+0x760/0x780
sp=a000000100db3e30 bsp=a000000100dacca0
Kernel panic - not syncing: Attempted to kill the idle task!

2009-01-22 10:01:39

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Hi Hugh,

On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <[email protected]> wrote:
> I was initially _very_ impressed by how well it did on my venerable
> tmpfs loop swapping loads, where I'd expected next to no effect; but
> that turned out to be because on three machines I'd been using SLUB,
> without remembering how default slub_max_order got raised from 1 to 3
> in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).
>
> That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> more than SLAB) with swapping loads on most of my machines. Though
> oddly one seems immune, and another takes four times as long: guess
> it depends on how close to thrashing, but probably more to investigate
> there. I think my original SLUB versus SLAB comparisons were done on
> the immune one: as I remember, SLUB and SLAB were equivalent on those
> loads when SLUB came in, but even with boot option slub_max_order=1,
> SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> FWIW - swapping loads are not what anybody should tune for.

What kind of machine are you seeing this on? It sounds like it could
be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
("slub: Calculate min_objects based on number of processors").

Pekka

2009-01-22 12:48:01

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, 22 Jan 2009, Pekka Enberg wrote:
> On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <[email protected]> wrote:
> > I was initially _very_ impressed by how well it did on my venerable
> > tmpfs loop swapping loads, where I'd expected next to no effect; but
> > that turned out to be because on three machines I'd been using SLUB,
> > without remembering how default slub_max_order got raised from 1 to 3
> > in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).
> >
> > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > more than SLAB) with swapping loads on most of my machines. Though
> > oddly one seems immune, and another takes four times as long: guess
> > it depends on how close to thrashing, but probably more to investigate
> > there. I think my original SLUB versus SLAB comparisons were done on
> > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > loads when SLUB came in, but even with boot option slub_max_order=1,
> > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > FWIW - swapping loads are not what anybody should tune for.
>
> What kind of machine are you seeing this on? It sounds like it could
> be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> ("slub: Calculate min_objects based on number of processors").

Thanks, yes, that could well account for the residual difference: the
machines in question have 2 or 4 cpus, so the old slub_min_objects=4
has effectively become slub_min_objects=12 or slub_min_objects=16.

I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
lines (though I'll need to curtail tests on a couple of machines),
and will report back later.

It's great that SLUB provides these knobs; not so great that it needs them.

Hugh

2009-01-23 03:31:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 21, 2009 at 06:40:10PM +0100, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > On Wed, Jan 21, 2009 at 03:59:18PM +0100, Ingo Molnar wrote:
> > >
> > > Mind if i nitpick a bit about minor style issues? Since this is going to
> > > be the next Linux SLAB allocator we might as well do it perfectly :-)
> >
> > Well here is an incremental patch which should get most of the issues
> > you pointed out, most of the sane ones that checkpatch pointed out, and
> > a few of my own ;)
>
> here's an incremental one ontop of your incremental patch, enhancing some
> more issues. I now find the code very readable! :-)

Thanks! I'll go through it and apply it. I'll raise any issues if I
am particularly against them ;)

> ( in case you are wondering about the placement of bit_spinlock.h - that
> file needs fixing, just move it to the top of the file and see the build
> break. But that's a separate patch.)

Ah, SLQB doesn't use bit spinlocks anyway, so I'll just get rid of that.
I'll see if there are any other obviously unneeded headers too.

Thanks,
Nick

2009-01-23 03:35:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 21, 2009 at 09:59:30AM -0800, Joe Perches wrote:
> One thing you might consider is that
> Q is visually close enough to O to be
> misread.
>
> Perhaps a different letter would be good.

That's a fair point. Hugh dislikes it too, I see ;) What to do... I
had been toying with the idea that if slqb (or slub) becomes "the"
allocator, then we could rename it all back to slAb after replacing
the existing slab?

Or I could make it a 128 bit allocator and call it SLZB, which would
definitely make it "the final" allocator ;)

2009-01-23 03:55:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 21, 2009 at 06:10:12PM +0000, Hugh Dickins wrote:
> On Wed, 21 Jan 2009, Nick Piggin wrote:
> >
> > Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> > fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> > have a system to test with), and improved performance and reduced
> > locking somewhat for node-specific and interleaved allocations.
>
> I haven't reviewed your postings, but I did give the previous version
> of your patch a try on all my machines. Some observations and one patch.

Great, thanks!


> I was initially _very_ impressed by how well it did on my venerable
> tmpfs loop swapping loads, where I'd expected next to no effect; but
> that turned out to be because on three machines I'd been using SLUB,
> without remembering how default slub_max_order got raised from 1 to 3
> in 2.6.26 (hmm, and Documentation/vm/slub.txt not updated).
>
> That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> more than SLAB) with swapping loads on most of my machines. Though
> oddly one seems immune, and another takes four times as long: guess
> it depends on how close to thrashing, but probably more to investigate
> there. I think my original SLUB versus SLAB comparisons were done on
> the immune one: as I remember, SLUB and SLAB were equivalent on those
> loads when SLUB came in, but even with boot option slub_max_order=1,
> SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> FWIW - swapping loads are not what anybody should tune for.

Yeah, that's to be expected with higher order allocations I think. Does
your immune machine simply have fewer CPUs and thus doesn't use such
high order allocations?


> So in fact SLQB comes in very much like SLAB, as I think you'd expect:
> slightly ahead of it on most of the machines, but probably in the noise.
> (SLOB behaves decently: not a winner, but no catastrophic behaviour.)
>
> What I love most about SLUB is the way you can reasonably build with
> CONFIG_SLUB_DEBUG=y, very little impact, then switch on the specific
> debugging you want with a boot option when you want it. That was a
> great stride forward, which you've followed in SLQB: so I'd have to
> prefer SLQB to SLAB (on debuggability) and to SLUB (on high orders).

It is nice. All credit to Christoph for that (and the fine grained
sysfs code).


> I do hate the name SLQB. Despite having no experience of databases,
> I find it almost impossible to type, coming out as SQLB most times.
> Wish you'd invented a plausible vowel instead of the Q; but probably
> too late for that.

Yeah, apologies for the name :P


> init/Kconfig describes it as "Qeued allocator": should say "Queued".

Thanks.


> Documentation/vm/slqbinfo.c gives several compilation warnings:
> I'd rather leave it to you to fix them, maybe the unused variables
> are about to be used, or maybe there's much worse wrong with it
> than a few compilation warnings, I didn't investigate.

OK.


> The only bug I found (but you'll probably want to change the patch
> - which I've rediffed to today's slqb.c, but not retested).
>
> On fake NUMA I hit kernel BUG at mm/slqb.c:1107! claim_remote_free_list()
> is doing several things without remote_free.lock: that VM_BUG_ON is unsafe
> for one, and even if others are somehow safe today, it will be more robust
> to take the lock sooner.

Good catch, thanks. The BUG should be OK where it is if we only
claim the remote free list when remote_free_check is is set, but
some of the periodic reaping and teardown code calls it unconditionally.
But it's not critical so it should definitely go inside the lock.


> I moved the prefetchw(head) down to where we know it's going to be the head,
> and replaced the offending VM_BUG_ON by a later WARN_ON which you'd probably
> better remove altogether: once we got the lock, it's hardly interesting.

Right, I'll probably do that. Thanks!

> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/slqb.c | 17 +++++++++--------
> 1 file changed, 9 insertions(+), 8 deletions(-)
>
> --- slqb/mm/slqb.c.orig 2009-01-21 15:23:54.000000000 +0000
> +++ slqb/mm/slqb.c 2009-01-21 15:32:44.000000000 +0000
> @@ -1115,17 +1115,12 @@ static void claim_remote_free_list(struc
> void **head, **tail;
> int nr;
>
> - VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
> -
> if (!l->remote_free.list.nr)
> return;
>
> + spin_lock(&l->remote_free.lock);
> l->remote_free_check = 0;
> head = l->remote_free.list.head;
> - /* Get the head hot for the likely subsequent allocation or flush */
> - prefetchw(head);
> -
> - spin_lock(&l->remote_free.lock);
> l->remote_free.list.head = NULL;
> tail = l->remote_free.list.tail;
> l->remote_free.list.tail = NULL;
> @@ -1133,9 +1128,15 @@ static void claim_remote_free_list(struc
> l->remote_free.list.nr = 0;
> spin_unlock(&l->remote_free.lock);
>
> - if (!l->freelist.nr)
> + WARN_ON(!head + !tail != !nr + !nr);
> + if (!nr)
> + return;
> +
> + if (!l->freelist.nr) {
> + /* Get head hot for likely subsequent allocation or flush */
> + prefetchw(head);
> l->freelist.head = head;
> - else
> + } else
> set_freepointer(s, l->freelist.tail, head);
> l->freelist.tail = tail;
>

2009-01-23 03:57:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, Jan 22, 2009 at 04:45:33PM +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 15:30 +0100, Nick Piggin wrote:
> > Hi,
> >
> > Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> > fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> > have a system to test with),
> Panic again on my Montvale Itanium NUMA machine if I start kernel with parameter
> mem=2G.
>
> The call chain is mnt_init => sysfs_init. ???kmem_cache_create fails, so later on
> when ???mnt_init uses kmem_cache sysfs_dir_cache, kernel panic
> at __slab_alloc => get_cpu_slab because parameter s is equal to NULL.
>
> Function __remote_slab_alloc return NULL when s->node[node]==NULL. That causes
> ???sysfs_init => kmem_cache_create fails.

Hmm, I'll probably have to add a bit more fallback logic. I'll have to
work out what semantics the callers require here. Thanks for the report.

>
>
> ------------------log----------------
>
> Dentry cache hash table entries: 262144 (order: 7, 2097152 bytes)
> Inode-cache hash table entries: 131072 (order: 6, 1048576 bytes)
> Mount-cache hash table entries: 1024
> mnt_init: sysfs_init error: -12
> Unable to handle kernel NULL pointer dereference (address 0000000000002058)
> swapper[0]: Oops 8813272891392 [1]
> Modules linked in:
>
> Pid: 0, CPU 0, comm: swapper
> psr : 00001010084a2018 ifs : 8000000000000690 ip : [<a000000100180350>] Not tainted (2.6.29-rc2slqb0121)
> ip is at kmem_cache_alloc+0x150/0x4e0
> unat: 0000000000000000 pfs : 0000000000000690 rsc : 0000000000000003
> rnat: 0009804c8a70433f bsps: a000000100f484b0 pr : 656960155aa65959
> ldrs: 0000000000000000 ccv : 000000000000001a fpsr: 0009804c8a70433f
> csd : 893fffff000f0000 ssd : 893fffff00090000
> b0 : a000000100180270 b6 : a000000100507360 b7 : a000000100507360
> f6 : 000000000000000000000 f7 : 1003e0000000000000800
> f8 : 1003e0000000000000008 f9 : 1003e0000000000000001
> f10 : 1003e0000000000000031 f11 : 1003e7d6343eb1a1f58d1
> r1 : a0000001011bc810 r2 : 0000000000000008 r3 : ffffffffffffffff
> r8 : 0000000000000000 r9 : a000000100ded800 r10 : 0000000000000000
> r11 : a000000100ded800 r12 : a000000100db3d80 r13 : a000000100dac000
> r14 : 0000000000000000 r15 : fffffffffffffffe r16 : a000000100fbcd30
> r17 : a000000100dacc44 r18 : 0000000000002058 r19 : 0000000000000000
> r20 : 0000000000000000 r21 : a000000100dacc44 r22 : 0000000000000002
> r23 : 0000000000000066 r24 : 0000000000000073 r25 : 0000000000000000
> r26 : e000000102014030 r27 : a0007fffffc9f120 r28 : 0000000000000000
> r29 : 0000000000000000 r30 : 0000000000000008 r31 : 0000000000000001
>
> Call Trace:
> [<a000000100016240>] show_stack+0x40/0xa0
> sp=a000000100db3950 bsp=a000000100dad140
> [<a000000100016b50>] show_regs+0x850/0x8a0
> sp=a000000100db3b20 bsp=a000000100dad0e8
> [<a00000010003a5f0>] die+0x230/0x360
> sp=a000000100db3b20 bsp=a000000100dad0a0
> [<a00000010005e0e0>] ia64_do_page_fault+0x8e0/0xa40
> sp=a000000100db3b20 bsp=a000000100dad050
> [<a00000010000c700>] ia64_native_leave_kernel+0x0/0x280
> sp=a000000100db3bb0 bsp=a000000100dad050
> [<a000000100180350>] kmem_cache_alloc+0x150/0x4e0
> sp=a000000100db3d80 bsp=a000000100dacfc8
> [<a000000100238610>] sysfs_new_dirent+0x90/0x240
> sp=a000000100db3d80 bsp=a000000100dacf80
> [<a000000100239140>] create_dir+0x40/0x100
> sp=a000000100db3d90 bsp=a000000100dacf48
> [<a0000001002392b0>] sysfs_create_dir+0xb0/0x100
> sp=a000000100db3db0 bsp=a000000100dacf28
> [<a0000001004eca60>] kobject_add_internal+0x1e0/0x420
> sp=a000000100db3dc0 bsp=a000000100dacee8
> [<a0000001004eceb0>] kobject_add_varg+0x90/0xc0
> sp=a000000100db3dc0 bsp=a000000100daceb0
> [<a0000001004ed620>] kobject_add+0x100/0x140
> sp=a000000100db3dc0 bsp=a000000100dace50
> [<a0000001004ed6b0>] kobject_create_and_add+0x50/0xc0
> sp=a000000100db3e00 bsp=a000000100dace20
> [<a000000100c28ff0>] mnt_init+0x1b0/0x480
> sp=a000000100db3e00 bsp=a000000100dacde0
> [<a000000100c28610>] vfs_caches_init+0x230/0x280
> sp=a000000100db3e20 bsp=a000000100dacdb8
> [<a000000100c01410>] start_kernel+0x830/0x8c0
> sp=a000000100db3e20 bsp=a000000100dacd40
> [<a0000001009d7b60>] __kprobes_text_end+0x760/0x780
> sp=a000000100db3e30 bsp=a000000100dacca0
> Kernel panic - not syncing: Attempted to kill the idle task!
>

2009-01-23 03:59:59

by Joe Perches

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, 2009-01-23 at 04:35 +0100, Nick Piggin wrote:
> That's a fair point. Hugh dislikes it too, I see ;) What to do... I
> had been toying with the idea that if slqb (or slub) becomes "the"
> allocator, then we could rename it all back to slAb after replacing
> the existing slab?

maybe SLIB (slab-improved) or SLAB_NG or NSLAB or SLABX
Who says it has to be 4 letters?

> Or I could make it a 128 bit allocator and call it SLZB, which would
> definitely make it "the final" allocator ;)

That leads to the phone book game.

SLZZB - and a crystal bridge now spans the fissure.

Hmm, wrong game.

cheers, j

2009-01-23 06:14:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 21, 2009 at 06:40:10PM +0100, Ingo Molnar wrote:
> -static inline void slqb_stat_inc(struct kmem_cache_list *list,
> - enum stat_item si)
> +static inline void
> +slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
> {

Hmm, I'm not entirely fond of this style. The former scales to longer lines
with just a single style change (putting args into new lines), wheras the
latter first moves its prefixes to a newline, then moves args as the
line grows even longer.

I guess it is a matter of taste, not wrong either way... but I think most
of the mm code I'm used to looking at uses the former. Do you feel strongly?


> +static void
> +trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
> {
> - if (s->flags & SLAB_TRACE) {
> - printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
> - s->name,
> - alloc ? "alloc" : "free",
> - object, page->inuse,
> - page->freelist);
> + if (likely(!(s->flags & SLAB_TRACE)))
> + return;

I think most of your flow control changes are improvements (others even
more than this, but this is the first one so I comment here). Thanks.


> @@ -1389,7 +1402,9 @@ static noinline void *__remote_slab_allo
> }
> if (likely(object))
> slqb_stat_inc(l, ALLOC);
> +
> spin_unlock(&n->list_lock);
> +
> return object;
> }
> #endif

Whitespace, I never really know if I'm "doing it right" or not :) And
often it is easy to tell a badly wrong one, but harder to tell what is
better between two reasonable ones. But I guess I'm the same way with
paragraphs in my writing...


> @@ -1399,12 +1414,12 @@ static noinline void *__remote_slab_allo
> *
> * Must be called with interrupts disabled.
> */
> -static __always_inline void *__slab_alloc(struct kmem_cache *s,
> - gfp_t gfpflags, int node)
> +static __always_inline void *
> +__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node)
> {
> - void *object;
> - struct kmem_cache_cpu *c;
> struct kmem_cache_list *l;
> + struct kmem_cache_cpu *c;
> + void *object;

Same with order of local variables. You like longest lines to
shortest I know. I think I vaguely try to arrange them from the
most important or high level "actor" to the least, and then in
order of when they get discovered/used.

For example, in the above function, "object" is the raison d'etre.
kmem_cache_cpu is found first, and from that, kmem_cache_list is
found. Which slightly explains the order.


> +static __always_inline void *
> +slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, void *addr)
> {
> - void *object;
> unsigned long flags;
> + void *object;

And here, eg. flags comes last because mostly inconsequential to
the bigger picture.

Your method is easier though, I'll grant you that :)


> static void init_kmem_cache_list(struct kmem_cache *s,
> struct kmem_cache_list *l)
> {
> - l->cache = s;
> - l->freelist.nr = 0;
> - l->freelist.head = NULL;
> - l->freelist.tail = NULL;
> - l->nr_partial = 0;
> - l->nr_slabs = 0;
> + l->cache = s;
> + l->freelist.nr = 0;
> + l->freelist.head = NULL;
> + l->freelist.tail = NULL;
> + l->nr_partial = 0;
> + l->nr_slabs = 0;
> INIT_LIST_HEAD(&l->partial);

Hmm, we seem to have gathered an extra space...

>
> #ifdef CONFIG_SMP
> - l->remote_free_check = 0;
> + l->remote_free_check = 0;
> spin_lock_init(&l->remote_free.lock);
> - l->remote_free.list.nr = 0;
> + l->remote_free.list.nr = 0;
> l->remote_free.list.head = NULL;
> l->remote_free.list.tail = NULL;
> #endif

... ah, to line up with this guy. TBH, I prefer not to religiously
line things up like this. If there is the odd long-line, just give
it the normal single space. I find it just keeps it easier to
maintain. Although you might counter that of course it is easier to
keep something clean if one relaxes their definition of "clean".


> static s8 size_index[24] __cacheline_aligned = {
> - 3, /* 8 */
> - 4, /* 16 */
> - 5, /* 24 */
> - 5, /* 32 */
> - 6, /* 40 */
> - 6, /* 48 */
> - 6, /* 56 */
> - 6, /* 64 */
> + 3, /* 8 */
> + 4, /* 16 */
> + 5, /* 24 */
> + 5, /* 32 */
> + 6, /* 40 */
> + 6, /* 48 */
> + 6, /* 56 */
> + 6, /* 64 */

However justifying numbers, like this, I'm happy to do (may as well
align the numbers in the comments too while we're here).


> @@ -2278,9 +2294,8 @@ static struct kmem_cache *get_slab(size_
>
> void *__kmalloc(size_t size, gfp_t flags)
> {
> - struct kmem_cache *s;
> + struct kmem_cache *s = get_slab(size, flags);
>
> - s = get_slab(size, flags);
> if (unlikely(ZERO_OR_NULL_PTR(s)))
> return s;

I've got yet the same problem with these... I mostly try to avoid
doing this, although there are some cases where it works well
(eg. constants, or a simple assignment of an argument to a local).

At some point, you start putting real code in there, at which point
the space after the local vars doesn't seem to serve much purpose.
get_slab I feel logically belongs close to the subsequent check,
because that's basically sanitizing its return value / extracting
the error case from it and leaving the rest of the function to work
on the common case.


> -static int sysfs_available __read_mostly = 0;
> +static int sysfs_available __read_mostly;

These, I actually like initializing to zero explicitly. I'm pretty
sure gcc no longer makes it any more expensive than leaving out.
Yes of course everybody who knows C has to know this, but.... I
just don't feel much harm in leaving it.

Lots of good stuff, lots I'm on the fence with, some I dislike ;)
I'll concentrate on picking up the obvious ones, and get the bugs
fixed. Will see where the discussion goes with the rest.

Thanks,
Nick

2009-01-23 09:01:06

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, Jan 22, 2009 at 04:45:33PM +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 15:30 +0100, Nick Piggin wrote:
> > Hi,
> >
> > Since last posted, I've cleaned up a few bits and pieces, (hopefully)
> > fixed a known bug where it wouldn't boot on memoryless nodes (I don't
> > have a system to test with),
> Panic again on my Montvale Itanium NUMA machine if I start kernel with parameter
> mem=2G.
>
> The call chain is mnt_init => sysfs_init. ???kmem_cache_create fails, so later on
> when ???mnt_init uses kmem_cache sysfs_dir_cache, kernel panic
> at __slab_alloc => get_cpu_slab because parameter s is equal to NULL.
>
> Function __remote_slab_alloc return NULL when s->node[node]==NULL. That causes
> ???sysfs_init => kmem_cache_create fails.

Booting with mem= is a good trick to create memoryless nodes easily.
Unfortunately it didn't trigger any bugs on my system, so I couldn't
actually verify that the fallback code solve your problem. Would
you be able to test with this updated patch (which also includes
Hugh's fix and some code style changes).

The other thing is that this bug has uncovered a little buglet in the
sysfs setup code: if it is unable to continue in a degraded mode after
the allocation failure, it should be using SLAB_PANIC.

Thanks,
Nick
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
#ifndef __LINUX_RCUPDATE_H
#define __LINUX_RCUPDATE_H

+#include <linux/rcu_types.h>
#include <linux/cache.h>
#include <linux/spinlock.h>
#include <linux/threads.h>
@@ -42,16 +43,6 @@
#include <linux/lockdep.h>
#include <linux/completion.h>

-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
- struct rcu_head *next;
- void (*func)(struct rcu_head *head);
-};
-
#if defined(CONFIG_CLASSIC_RCU)
#include <linux/rcuclassic.h>
#elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,289 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <[email protected]>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+enum stat_item {
+ ALLOC, /* Allocation count */
+ ALLOC_SLAB_FILL, /* Fill freelist from page list */
+ ALLOC_SLAB_NEW, /* New slab acquired from page allocator */
+ FREE, /* Free count */
+ FREE_REMOTE, /* NUMA: freeing to remote list */
+ FLUSH_FREE_LIST, /* Freelist flushed */
+ FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+ FLUSH_FREE_LIST_REMOTE, /* Objects flushed from freelist to remote */
+ FLUSH_SLAB_PARTIAL, /* Freeing moves slab to partial list */
+ FLUSH_SLAB_FREE, /* Slab freed to the page allocator */
+ FLUSH_RFREE_LIST, /* Rfree list flushed */
+ FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+ CLAIM_REMOTE_LIST, /* Remote freed list claimed */
+ CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+ NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+ unsigned long nr;
+ void **head;
+ void **tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+ spinlock_t lock;
+ struct kmlist list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+ /* Fastpath LIFO freelist of objects */
+ struct kmlist freelist;
+#ifdef CONFIG_SMP
+ /* remote_free has reached a watermark */
+ int remote_free_check;
+#endif
+ /* kmem_cache corresponding to this list */
+ struct kmem_cache *cache;
+
+ /* Number of partial slabs (pages) */
+ unsigned long nr_partial;
+
+ /* Slabs which have some free objects */
+ struct list_head partial;
+
+ /* Total number of slabs allocated */
+ unsigned long nr_slabs;
+
+#ifdef CONFIG_SMP
+ /*
+ * In the case of per-cpu lists, remote_free is for objects freed by
+ * non-owner CPU back to its home list. For per-node lists, remote_free
+ * is always used to free objects.
+ */
+ struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+ unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+ struct kmem_cache_list list; /* List for node-local slabs */
+ unsigned int colour_next; /* Next colour offset to use */
+
+#ifdef CONFIG_SMP
+ /*
+ * rlist is a list of objects that don't fit on list.freelist (ie.
+ * wrong node). The objects all correspond to a given kmem_cache_list,
+ * remote_cache_list. To free objects to another list, we must first
+ * flush the existing objects, then switch remote_cache_list.
+ *
+ * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+ * get to O(NR_CPUS^2) memory consumption situation.
+ */
+ struct kmlist rlist;
+ struct kmem_cache_list *remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure. Used for node-specific allocations.
+ */
+struct kmem_cache_node {
+ struct kmem_cache_list list;
+ spinlock_t list_lock; /* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+ unsigned long flags;
+ int hiwater; /* LIFO list high watermark */
+ int freebatch; /* LIFO freelist batch flush size */
+ int objsize; /* Size of object without meta data */
+ int offset; /* Free pointer offset. */
+ int objects; /* Number of objects in slab */
+
+ int size; /* Size of object including meta data */
+ int order; /* Allocation order */
+ gfp_t allocflags; /* gfp flags to use on allocation */
+ unsigned int colour_range; /* range of colour counter */
+ unsigned int colour_off; /* offset per colour */
+ void (*ctor)(void *);
+
+ const char *name; /* Name (only for display!) */
+ struct list_head list; /* List of slab caches */
+
+ int align; /* Alignment */
+ int inuse; /* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+ struct kobject kobj; /* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+ struct kmem_cache_node *node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+ struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+ struct kmem_cache_cpu cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+ if (unlikely(!size))
+ return 0;
+ if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+ return 0;
+
+ if (unlikely(size <= KMALLOC_MIN_SIZE))
+ return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+ if (size > 64 && size <= 96)
+ return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+ if (size > 128 && size <= 192)
+ return 2;
+#endif
+ if (size <= 8) return 3;
+ if (size <= 16) return 4;
+ if (size <= 32) return 5;
+ if (size <= 64) return 6;
+ if (size <= 128) return 7;
+ if (size <= 256) return 8;
+ if (size <= 512) return 9;
+ if (size <= 1024) return 10;
+ if (size <= 2 * 1024) return 11;
+ if (size <= 4 * 1024) return 12;
+ if (size <= 8 * 1024) return 13;
+ if (size <= 16 * 1024) return 14;
+ if (size <= 32 * 1024) return 15;
+ if (size <= 64 * 1024) return 16;
+ if (size <= 128 * 1024) return 17;
+ if (size <= 256 * 1024) return 18;
+ if (size <= 512 * 1024) return 19;
+ if (size <= 1024 * 1024) return 20;
+ if (size <= 2 * 1024 * 1024) return 21;
+ return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+ int index = kmalloc_index(size);
+
+ if (unlikely(index == 0))
+ return NULL;
+
+ if (likely(!(flags & SLQB_DMA)))
+ return &kmalloc_caches[index];
+ else
+ return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? \
+ sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+ if (__builtin_constant_p(size)) {
+ struct kmem_cache *s;
+
+ s = kmalloc_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc(s, flags);
+ }
+ return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+ if (__builtin_constant_p(size)) {
+ struct kmem_cache *s;
+
+ s = kmalloc_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc_node(s, flags, node);
+ }
+ return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -806,7 +806,7 @@ config SLUB_DEBUG

choice
prompt "Choose SLAB allocator"
- default SLUB
+ default SLQB
help
This option allows to select a slab allocator.

@@ -827,6 +827,11 @@ config SLUB
and has enhanced diagnostics. SLUB is the default choice for
a slab allocator.

+config SLQB
+ bool "SLQB (Qeued allocator)"
+ help
+ SLQB is a proposed new slab allocator.
+
config SLOB
depends on EMBEDDED
bool "SLOB (Simple Allocator)"
@@ -868,7 +873,7 @@ config HAVE_GENERIC_DMA_COHERENT
config SLABINFO
bool
depends on PROC_FS
- depends on SLAB || SLUB_DEBUG
+ depends on SLAB || SLUB_DEBUG || SLQB
default y

config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
out which slabs are relevant to a particular load.
Try running: slabinfo -DA

+config SLQB_DEBUG
+ default y
+ bool "Enable SLQB debugging support"
+ depends on SLQB
+
+config SLQB_DEBUG_ON
+ default n
+ bool "SLQB debugging on by default"
+ depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+ bool "Create SYSFS entries for slab caches"
+ default n
+ depends on SLQB
+
+config SLQB_STATS
+ bool "Enable SLQB performance statistics"
+ default n
+ depends on SLQB_SYSFS
+
config DEBUG_PREEMPT
bool "Debug preemptible kernel"
depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3509 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+/*
+ * TODO
+ * - fix up releasing of offlined data structures. Not a big deal because
+ * they don't get cumulatively leaked with successive online/offline cycles
+ * - improve fallback paths, allow OOM conditions to flush back per-CPU pages
+ * to common lists to be reused by other CPUs.
+ * - investiage performance with memoryless nodes. Perhaps CPUs can be given
+ * a default closest home node via which it can use fastpath functions.
+ * Perhaps it is not a big problem.
+ */
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+ union {
+ struct {
+ unsigned long flags; /* mandatory */
+ atomic_t _count; /* mandatory */
+ unsigned int inuse; /* Nr of objects */
+ struct kmem_cache_list *list; /* Pointer to list */
+ void **freelist; /* LIFO freelist */
+ union {
+ struct list_head lru; /* misc. list */
+ struct rcu_head rcu_head; /* for rcu freeing */
+ };
+ };
+ struct page page;
+ };
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static int numa_platform __read_mostly;
+#else
+static const int numa_platform = 0;
+#endif
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+ return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+ return s->freebatch;
+}
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ * kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ * objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ * the object list, it tries from the partial list, then the free list.
+ * After freeing an object to the object list, if it is over a watermark,
+ * some objects are freed back to pages. If an allocation misses these lists,
+ * a new slab page is allocated from the page allocator. If the free list
+ * reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ * node are queued to. When this reaches a watermark, the objects are
+ * flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ * to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ * used to protect access to this queue.
+ *
+ * When the remotely freed queue reaches a watermark, a flag is set to tell
+ * the owner CPU to check it. The owner CPU will then check the queue on the
+ * next allocation that misses the object list. It will move all objects from
+ * this list onto the object list and then allocate one.
+ *
+ * This system of remote queueing is intended to reduce lock and remote
+ * cacheline acquisitions, and give a cooling off period for remotely freed
+ * objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ * allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list,
+ enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+ list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list,
+ enum stat_item si, unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+ list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+ return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+ return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+ return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+ return page_to_nid(virt_to_page_fast(addr));
+#else
+ return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+ struct page *p;
+
+ p = virt_to_head_page(addr);
+ return (struct slqb_page *)p;
+}
+
+static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
+ unsigned int order)
+{
+ struct page *p;
+
+ if (nid == -1)
+ p = alloc_pages(flags, order);
+ else
+ p = alloc_pages_node(nid, flags, order);
+
+ return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+ struct page *p = &page->page;
+
+ reset_page_mapcount(p);
+ p->mapping = NULL;
+ VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+ p->flags &= ~PG_SLQB_BIT;
+
+ __free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+ return (s->flags &
+ (SLAB_DEBUG_FREE |
+ SLAB_RED_ZONE |
+ SLAB_POISON |
+ SLAB_STORE_USER |
+ SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+ return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+ return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+ return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+ SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON 0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size() L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/* A list of all slab caches on the system */
+static DECLARE_RWSEM(slqb_lock);
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+ void *addr; /* Called from address */
+ int cpu; /* Was running on cpu */
+ int pid; /* Pid context */
+ unsigned long when; /* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+ return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+ kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+ return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+ VM_BUG_ON(!s->cpu_slab[cpu]);
+ return s->cpu_slab[cpu];
+#else
+ return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+ struct slqb_page *page, const void *object)
+{
+ void *base;
+
+ base = slqb_page_address(page);
+ if (object < base || object >= base + s->objects * s->size ||
+ (object - base) % s->size) {
+ return 0;
+ }
+
+ return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+ return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+ *(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+ for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+ __p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+ for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+ __p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+ int i, offset;
+ int newline = 1;
+ char ascii[17];
+
+ ascii[16] = 0;
+
+ for (i = 0; i < length; i++) {
+ if (newline) {
+ printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+ newline = 0;
+ }
+ printk(KERN_CONT " %02x", addr[i]);
+ offset = i % 16;
+ ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+ if (offset == 15) {
+ printk(KERN_CONT " %s\n", ascii);
+ newline = 1;
+ }
+ }
+ if (!newline) {
+ i %= 16;
+ while (i < 16) {
+ printk(KERN_CONT " ");
+ ascii[i] = ' ';
+ i++;
+ }
+ printk(KERN_CONT " %s\n", ascii);
+ }
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+ enum track_item alloc)
+{
+ struct track *p;
+
+ if (s->offset)
+ p = object + s->offset + sizeof(void *);
+ else
+ p = object + s->inuse;
+
+ return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+ enum track_item alloc, void *addr)
+{
+ struct track *p;
+
+ if (s->offset)
+ p = object + s->offset + sizeof(void *);
+ else
+ p = object + s->inuse;
+
+ p += alloc;
+ if (addr) {
+ p->addr = addr;
+ p->cpu = raw_smp_processor_id();
+ p->pid = current ? current->pid : -1;
+ p->when = jiffies;
+ } else
+ memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+ if (!(s->flags & SLAB_STORE_USER))
+ return;
+
+ set_track(s, object, TRACK_FREE, NULL);
+ set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+ if (!t->addr)
+ return;
+
+ printk(KERN_ERR "INFO: %s in ", s);
+ __print_symbol("%s", (unsigned long)t->addr);
+ printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+ if (!(s->flags & SLAB_STORE_USER))
+ return;
+
+ print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+ print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+ printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+ page, page->inuse, page->freelist, page->flags);
+
+}
+
+#define MAX_ERR_STR 100
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+ va_list args;
+ char buf[MAX_ERR_STR];
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ printk(KERN_ERR "========================================"
+ "=====================================\n");
+ printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+ printk(KERN_ERR "----------------------------------------"
+ "-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+ unsigned int off; /* Offset of last byte */
+ u8 *addr = slqb_page_address(page);
+
+ print_tracking(s, p);
+
+ print_page_info(page);
+
+ printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+ p, p - addr, get_freepointer(s, p));
+
+ if (p > addr + 16)
+ print_section("Bytes b4", p - 16, 16);
+
+ print_section("Object", p, min(s->objsize, 128));
+
+ if (s->flags & SLAB_RED_ZONE)
+ print_section("Redzone", p + s->objsize, s->inuse - s->objsize);
+
+ if (s->offset)
+ off = s->offset + sizeof(void *);
+ else
+ off = s->inuse;
+
+ if (s->flags & SLAB_STORE_USER)
+ off += 2 * sizeof(struct track);
+
+ if (off != s->size) {
+ /* Beginning of the filler is the free pointer */
+ print_section("Padding", p + off, s->size - off);
+ }
+
+ dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+ u8 *object, char *reason)
+{
+ slab_bug(s, reason);
+ print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page,
+ char *fmt, ...)
+{
+ slab_bug(s, fmt);
+ print_page_info(page);
+ dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+ u8 *p = object;
+
+ if (s->flags & __OBJECT_POISON) {
+ memset(p, POISON_FREE, s->objsize - 1);
+ p[s->objsize - 1] = POISON_END;
+ }
+
+ if (s->flags & SLAB_RED_ZONE) {
+ memset(p + s->objsize,
+ active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+ s->inuse - s->objsize);
+ }
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+ while (bytes) {
+ if (*start != (u8)value)
+ return start;
+ start++;
+ bytes--;
+ }
+ return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+ void *from, void *to)
+{
+ slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+ memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+ u8 *object, char *what,
+ u8 *start, unsigned int value, unsigned int bytes)
+{
+ u8 *fault;
+ u8 *end;
+
+ fault = check_bytes(start, value, bytes);
+ if (!fault)
+ return 1;
+
+ end = start + bytes;
+ while (end > fault && end[-1] == value)
+ end--;
+
+ slab_bug(s, "%s overwritten", what);
+ printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+ fault, end - 1, fault[0], value);
+ print_trailer(s, page, object);
+
+ restore_bytes(s, what, value, fault, end);
+ return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * Bytes of the object to be managed.
+ * If the freepointer may overlay the object then the free
+ * pointer is the first word of the object.
+ *
+ * Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * Padding to reach word boundary. This is also used for Redzoning.
+ * Padding is extended by another word if Redzoning is enabled and
+ * objsize == inuse.
+ *
+ * We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * Meta data starts here.
+ *
+ * A. Free pointer (if we cannot overwrite object on free)
+ * B. Tracking data for SLAB_STORE_USER
+ * C. Padding to reach required alignment boundary or at mininum
+ * one word if debuggin is on to be able to detect writes
+ * before the word boundary.
+ *
+ * Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+ unsigned long off = s->inuse; /* The end of info */
+
+ if (s->offset) {
+ /* Freepointer is placed after the object. */
+ off += sizeof(void *);
+ }
+
+ if (s->flags & SLAB_STORE_USER) {
+ /* We also have user information there */
+ off += 2 * sizeof(struct track);
+ }
+
+ if (s->size == off)
+ return 1;
+
+ return check_bytes_and_report(s, page, p, "Object padding",
+ p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+ u8 *start;
+ u8 *fault;
+ u8 *end;
+ int length;
+ int remainder;
+
+ if (!(s->flags & SLAB_POISON))
+ return 1;
+
+ start = slqb_page_address(page);
+ end = start + (PAGE_SIZE << s->order);
+ length = s->objects * s->size;
+ remainder = end - (start + length);
+ if (!remainder)
+ return 1;
+
+ fault = check_bytes(start + length, POISON_INUSE, remainder);
+ if (!fault)
+ return 1;
+
+ while (end > fault && end[-1] == POISON_INUSE)
+ end--;
+
+ slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+ print_section("Padding", start, length);
+
+ restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+ return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+ void *object, int active)
+{
+ u8 *p = object;
+ u8 *endobject = object + s->objsize;
+
+ if (s->flags & SLAB_RED_ZONE) {
+ unsigned int red =
+ active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+ if (!check_bytes_and_report(s, page, object, "Redzone",
+ endobject, red, s->inuse - s->objsize))
+ return 0;
+ } else {
+ if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+ check_bytes_and_report(s, page, p, "Alignment padding",
+ endobject, POISON_INUSE, s->inuse - s->objsize);
+ }
+ }
+
+ if (s->flags & SLAB_POISON) {
+ if (!active && (s->flags & __OBJECT_POISON)) {
+ if (!check_bytes_and_report(s, page, p, "Poison", p,
+ POISON_FREE, s->objsize - 1))
+ return 0;
+
+ if (!check_bytes_and_report(s, page, p, "Poison",
+ p + s->objsize - 1, POISON_END, 1))
+ return 0;
+ }
+
+ /*
+ * check_pad_bytes cleans up on its own.
+ */
+ check_pad_bytes(s, page, p);
+ }
+
+ return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+ if (!(page->flags & PG_SLQB_BIT)) {
+ slab_err(s, page, "Not a valid slab page");
+ return 0;
+ }
+ if (page->inuse == 0) {
+ slab_err(s, page, "inuse before free / after alloc", s->name);
+ return 0;
+ }
+ if (page->inuse > s->objects) {
+ slab_err(s, page, "inuse %u > max %u",
+ s->name, page->inuse, s->objects);
+ return 0;
+ }
+ /* Slab_pad_check fixes things up after itself */
+ slab_pad_check(s, page);
+ return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page,
+ void *object, int alloc)
+{
+ if (s->flags & SLAB_TRACE) {
+ printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+ s->name,
+ alloc ? "alloc" : "free",
+ object, page->inuse,
+ page->freelist);
+
+ if (!alloc)
+ print_section("Object", (void *)object, s->objsize);
+
+ dump_stack();
+ }
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+ void *object)
+{
+ if (!slab_debug(s))
+ return;
+
+ if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+ return;
+
+ init_object(s, object, 0);
+ init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s,
+ void *object, void *addr)
+{
+ struct slqb_page *page;
+ page = virt_to_head_slqb_page(object);
+
+ if (!check_slab(s, page))
+ goto bad;
+
+ if (!check_valid_pointer(s, page, object)) {
+ object_err(s, page, object, "Freelist Pointer check fails");
+ goto bad;
+ }
+
+ if (object && !check_object(s, page, object, 0))
+ goto bad;
+
+ /* Success perform special debug activities for allocs */
+ if (s->flags & SLAB_STORE_USER)
+ set_track(s, object, TRACK_ALLOC, addr);
+ trace(s, page, object, 1);
+ init_object(s, object, 1);
+ return 1;
+
+bad:
+ return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s,
+ void *object, void *addr)
+{
+ struct slqb_page *page;
+ page = virt_to_head_slqb_page(object);
+
+ if (!check_slab(s, page))
+ goto fail;
+
+ if (!check_valid_pointer(s, page, object)) {
+ slab_err(s, page, "Invalid object pointer 0x%p", object);
+ goto fail;
+ }
+
+ if (!check_object(s, page, object, 1))
+ return 0;
+
+ /* Special debug activities for freeing objects */
+ if (s->flags & SLAB_STORE_USER)
+ set_track(s, object, TRACK_FREE, addr);
+ trace(s, page, object, 0);
+ init_object(s, object, 0);
+ return 1;
+
+fail:
+ slab_fix(s, "Object at 0x%p not freed", object);
+ return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+ slqb_debug = DEBUG_DEFAULT_FLAGS;
+ if (*str++ != '=' || !*str) {
+ /*
+ * No options specified. Switch on full debugging.
+ */
+ goto out;
+ }
+
+ if (*str == ',') {
+ /*
+ * No options but restriction on slabs. This means full
+ * debugging for slabs matching a pattern.
+ */
+ goto check_slabs;
+ }
+
+ slqb_debug = 0;
+ if (*str == '-') {
+ /*
+ * Switch off all debugging measures.
+ */
+ goto out;
+ }
+
+ /*
+ * Determine which debug features should be switched on
+ */
+ for (; *str && *str != ','; str++) {
+ switch (tolower(*str)) {
+ case 'f':
+ slqb_debug |= SLAB_DEBUG_FREE;
+ break;
+ case 'z':
+ slqb_debug |= SLAB_RED_ZONE;
+ break;
+ case 'p':
+ slqb_debug |= SLAB_POISON;
+ break;
+ case 'u':
+ slqb_debug |= SLAB_STORE_USER;
+ break;
+ case 't':
+ slqb_debug |= SLAB_TRACE;
+ break;
+ default:
+ printk(KERN_ERR "slqb_debug option '%c' "
+ "unknown. skipped\n", *str);
+ }
+ }
+
+check_slabs:
+ if (*str == ',')
+ slqb_debug_slabs = str + 1;
+out:
+ return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+ unsigned long flags, const char *name,
+ void (*ctor)(void *))
+{
+ /*
+ * Enable debugging if selected on the kernel commandline.
+ */
+ if (slqb_debug && (!slqb_debug_slabs ||
+ strncmp(slqb_debug_slabs, name,
+ strlen(slqb_debug_slabs)) == 0))
+ flags |= slqb_debug;
+
+ return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+ struct slqb_page *page, void *object)
+{
+}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+ void *object, void *addr)
+{
+ return 0;
+}
+
+static inline int free_debug_processing(struct kmem_cache *s,
+ void *object, void *addr)
+{
+ return 0;
+}
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+ return 1;
+}
+
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+ void *object, int active)
+{
+ return 1;
+}
+
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page)
+{
+}
+
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+ unsigned long flags, const char *name, void (*ctor)(void *))
+{
+ return flags;
+}
+
+static const int slqb_debug = 0;
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s,
+ gfp_t flags, int node)
+{
+ struct slqb_page *page;
+ int pages = 1 << s->order;
+
+ flags |= s->allocflags;
+
+ page = alloc_slqb_pages_node(node, flags, s->order);
+ if (!page)
+ return NULL;
+
+ mod_zone_page_state(slqb_page_zone(page),
+ (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+ NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+ pages);
+
+ return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s,
+ struct slqb_page *page, void *object)
+{
+ setup_object_debug(s, page, object);
+ if (unlikely(s->ctor))
+ s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s,
+ gfp_t flags, int node, unsigned int colour)
+{
+ struct slqb_page *page;
+ void *start;
+ void *last;
+ void *p;
+
+ BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+ page = allocate_slab(s,
+ flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+ if (!page)
+ goto out;
+
+ page->flags |= PG_SLQB_BIT;
+
+ start = page_address(&page->page);
+
+ if (unlikely(slab_poison(s)))
+ memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+ start += colour;
+
+ last = start;
+ for_each_object(p, s, start) {
+ setup_object(s, page, p);
+ set_freepointer(s, last, p);
+ last = p;
+ }
+ set_freepointer(s, last, NULL);
+
+ page->freelist = start;
+ page->inuse = 0;
+out:
+ return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+ int pages = 1 << s->order;
+
+ if (unlikely(slab_debug(s))) {
+ void *p;
+
+ slab_pad_check(s, page);
+ for_each_free_object(p, s, page->freelist)
+ check_object(s, page, p, 0);
+ }
+
+ mod_zone_page_state(slqb_page_zone(page),
+ (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+ NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+ -pages);
+
+ __free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+ struct slqb_page *page;
+
+ page = container_of((struct list_head *)h, struct slqb_page, lru);
+ __free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+ VM_BUG_ON(page->inuse);
+ if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+ call_rcu(&page->rcu_head, rcu_free_slab);
+ else
+ __free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s,
+ struct kmem_cache_list *l, struct slqb_page *page,
+ void *object)
+{
+ VM_BUG_ON(page->list != l);
+
+ set_freepointer(s, object, page->freelist);
+ page->freelist = object;
+ page->inuse--;
+
+ if (!page->inuse) {
+ if (likely(s->objects > 1)) {
+ l->nr_partial--;
+ list_del(&page->lru);
+ }
+ l->nr_slabs--;
+ free_slab(s, page);
+ slqb_stat_inc(l, FLUSH_SLAB_FREE);
+ return 1;
+
+ } else if (page->inuse + 1 == s->objects) {
+ l->nr_partial++;
+ list_add(&page->lru, &l->partial);
+ slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+ return 0;
+ }
+ return 0;
+}
+
+#ifdef CONFIG_SMP
+static void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page,
+ void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ struct kmem_cache_cpu *c;
+ void **head;
+ int nr;
+
+ nr = l->freelist.nr;
+ if (unlikely(!nr))
+ return;
+
+ nr = min(slab_freebatch(s), nr);
+
+ slqb_stat_inc(l, FLUSH_FREE_LIST);
+ slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+ c = get_cpu_slab(s, smp_processor_id());
+
+ l->freelist.nr -= nr;
+ head = l->freelist.head;
+
+ do {
+ struct slqb_page *page;
+ void **object;
+
+ object = head;
+ VM_BUG_ON(!object);
+ head = get_freepointer(s, object);
+ page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+ if (page->list != l) {
+ slab_free_to_remote(s, page, object, c);
+ slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+ } else
+#endif
+ free_object_to_page(s, l, page, object);
+
+ nr--;
+ } while (nr);
+
+ l->freelist.head = head;
+ if (!l->freelist.nr)
+ l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ while (l->freelist.nr)
+ flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s,
+ struct kmem_cache_list *l)
+{
+ void **head, **tail;
+ int nr;
+
+ VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
+
+ if (!l->remote_free.list.nr)
+ return;
+
+ spin_lock(&l->remote_free.lock);
+
+ l->remote_free_check = 0;
+ head = l->remote_free.list.head;
+ l->remote_free.list.head = NULL;
+ tail = l->remote_free.list.tail;
+ l->remote_free.list.tail = NULL;
+ nr = l->remote_free.list.nr;
+ l->remote_free.list.nr = 0;
+
+ spin_unlock(&l->remote_free.lock);
+
+ VM_BUG_ON(!nr);
+
+ if (!l->freelist.nr) {
+ /* Get head hot for likely subsequent allocation or flush */
+ prefetchw(head);
+ l->freelist.head = head;
+ } else
+ set_freepointer(s, l->freelist.tail, head);
+ l->freelist.tail = tail;
+
+ l->freelist.nr += nr;
+
+ slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+ slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s,
+ struct kmem_cache_list *l)
+{
+ void *object;
+
+ object = l->freelist.head;
+ if (likely(object)) {
+ void *next = get_freepointer(s, object);
+
+ VM_BUG_ON(!l->freelist.nr);
+ l->freelist.nr--;
+ l->freelist.head = next;
+
+ return object;
+ }
+ VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+ if (unlikely(l->remote_free_check)) {
+ claim_remote_free_list(s, l);
+
+ if (l->freelist.nr > slab_hiwater(s))
+ flush_free_list(s, l);
+
+ /* repetition here helps gcc :( */
+ object = l->freelist.head;
+ if (likely(object)) {
+ void *next = get_freepointer(s, object);
+
+ VM_BUG_ON(!l->freelist.nr);
+ l->freelist.nr--;
+ l->freelist.head = next;
+
+ return object;
+ }
+ VM_BUG_ON(l->freelist.nr);
+ }
+#endif
+
+ return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s,
+ struct kmem_cache_list *l)
+{
+ struct slqb_page *page;
+ void *object;
+
+ if (unlikely(!l->nr_partial))
+ return NULL;
+
+ page = list_first_entry(&l->partial, struct slqb_page, lru);
+ VM_BUG_ON(page->inuse == s->objects);
+ if (page->inuse + 1 == s->objects) {
+ l->nr_partial--;
+ list_del(&page->lru);
+ }
+
+ VM_BUG_ON(!page->freelist);
+
+ page->inuse++;
+
+ object = page->freelist;
+ page->freelist = get_freepointer(s, object);
+ if (page->freelist)
+ prefetchw(page->freelist);
+ VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+ slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+ return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__slab_alloc_page(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
+{
+ struct slqb_page *page;
+ struct kmem_cache_list *l;
+ struct kmem_cache_cpu *c;
+ unsigned int colour;
+ void *object;
+
+ c = get_cpu_slab(s, smp_processor_id());
+ colour = c->colour_next;
+ c->colour_next += s->colour_off;
+ if (c->colour_next >= s->colour_range)
+ c->colour_next = 0;
+
+ /* XXX: load any partial? */
+
+ /* Caller handles __GFP_ZERO */
+ gfpflags &= ~__GFP_ZERO;
+
+ if (gfpflags & __GFP_WAIT)
+ local_irq_enable();
+ page = new_slab_page(s, gfpflags, node, colour);
+ if (gfpflags & __GFP_WAIT)
+ local_irq_disable();
+ if (unlikely(!page))
+ return page;
+
+ if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+ struct kmem_cache_cpu *c;
+ int cpu = smp_processor_id();
+
+ c = get_cpu_slab(s, cpu);
+ l = &c->list;
+ page->list = l;
+
+ l->nr_slabs++;
+ l->nr_partial++;
+ list_add(&page->lru, &l->partial);
+ slqb_stat_inc(l, ALLOC);
+ slqb_stat_inc(l, ALLOC_SLAB_NEW);
+ object = __cache_list_get_page(s, l);
+ } else {
+#ifdef CONFIG_NUMA
+ struct kmem_cache_node *n;
+
+ n = s->node[slqb_page_to_nid(page)];
+ l = &n->list;
+ page->list = l;
+
+ spin_lock(&n->list_lock);
+ l->nr_slabs++;
+ l->nr_partial++;
+ list_add(&page->lru, &l->partial);
+ slqb_stat_inc(l, ALLOC);
+ slqb_stat_inc(l, ALLOC_SLAB_NEW);
+ object = __cache_list_get_page(s, l);
+ spin_unlock(&n->list_lock);
+#endif
+ }
+ VM_BUG_ON(!object);
+ return object;
+}
+
+#ifdef CONFIG_NUMA
+static noinline int alternate_nid(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
+{
+ if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+ return node;
+ if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+ return cpuset_mem_spread_node();
+ else if (current->mempolicy)
+ return slab_node(current->mempolicy);
+ return node;
+}
+
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static void *__remote_slab_alloc_node(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
+{
+ struct kmem_cache_node *n;
+ struct kmem_cache_list *l;
+ void *object;
+
+ n = s->node[node];
+ if (unlikely(!n)) /* node has no memory */
+ return NULL;
+ l = &n->list;
+
+ spin_lock(&n->list_lock);
+
+ object = __cache_list_get_object(s, l);
+ if (unlikely(!object)) {
+ object = __cache_list_get_page(s, l);
+ if (unlikely(!object)) {
+ spin_unlock(&n->list_lock);
+ return __slab_alloc_page(s, gfpflags, node);
+ }
+ }
+ if (likely(object))
+ slqb_stat_inc(l, ALLOC);
+ spin_unlock(&n->list_lock);
+ return object;
+}
+
+static noinline void *__remote_slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
+{
+ void *object;
+ struct zonelist *zonelist;
+ struct zoneref *z;
+ struct zone *zone;
+ enum zone_type high_zoneidx = gfp_zone(gfpflags);
+
+ object = __remote_slab_alloc_node(s, gfpflags, node);
+ if (likely(object || (gfpflags & __GFP_THISNODE)))
+ return object;
+
+ zonelist = node_zonelist(slab_node(current->mempolicy), gfpflags);
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+ if (!cpuset_zone_allowed_hardwall(zone, gfpflags))
+ continue;
+
+ node = zone_to_nid(zone);
+ object = __remote_slab_alloc_node(s, gfpflags, node);
+ if (likely(object))
+ return object;
+ }
+ return NULL;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
+{
+ void *object;
+ struct kmem_cache_cpu *c;
+ struct kmem_cache_list *l;
+
+#ifdef CONFIG_NUMA
+ if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+try_remote:
+ return __remote_slab_alloc(s, gfpflags, node);
+ }
+#endif
+
+ c = get_cpu_slab(s, smp_processor_id());
+ VM_BUG_ON(!c);
+ l = &c->list;
+ object = __cache_list_get_object(s, l);
+ if (unlikely(!object)) {
+ object = __cache_list_get_page(s, l);
+ if (unlikely(!object)) {
+ object = __slab_alloc_page(s, gfpflags, node);
+#ifdef CONFIG_NUMA
+ if (unlikely(!object))
+ goto try_remote;
+#endif
+ return object;
+ }
+ }
+ if (likely(object))
+ slqb_stat_inc(l, ALLOC);
+ return object;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node, void *addr)
+{
+ void *object;
+ unsigned long flags;
+
+again:
+ local_irq_save(flags);
+ object = __slab_alloc(s, gfpflags, node);
+ local_irq_restore(flags);
+
+ if (unlikely(slab_debug(s)) && likely(object)) {
+ if (unlikely(!alloc_debug_processing(s, object, addr)))
+ goto again;
+ }
+
+ if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+ memset(object, 0, s->objsize);
+
+ return object;
+}
+
+static __always_inline void *__kmem_cache_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, void *caller)
+{
+ int node = -1;
+#ifdef CONFIG_NUMA
+ if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+ node = alternate_nid(s, gfpflags, node);
+#endif
+ return slab_alloc(s, gfpflags, node, caller);
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+ return __kmem_cache_alloc(s, gfpflags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+ return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
+{
+ struct kmlist *src;
+ struct kmem_cache_list *dst;
+ unsigned int nr;
+ int set;
+
+ src = &c->rlist;
+ nr = src->nr;
+ if (unlikely(!nr))
+ return;
+
+#ifdef CONFIG_SLQB_STATS
+ {
+ struct kmem_cache_list *l = &c->list;
+
+ slqb_stat_inc(l, FLUSH_RFREE_LIST);
+ slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+ }
+#endif
+
+ dst = c->remote_cache_list;
+
+ spin_lock(&dst->remote_free.lock);
+
+ if (!dst->remote_free.list.head)
+ dst->remote_free.list.head = src->head;
+ else
+ set_freepointer(s, dst->remote_free.list.tail, src->head);
+ dst->remote_free.list.tail = src->tail;
+
+ src->head = NULL;
+ src->tail = NULL;
+ src->nr = 0;
+
+ if (dst->remote_free.list.nr < slab_freebatch(s))
+ set = 1;
+ else
+ set = 0;
+
+ dst->remote_free.list.nr += nr;
+
+ if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+ dst->remote_free_check = 1;
+
+ spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s,
+ struct slqb_page *page, void *object,
+ struct kmem_cache_cpu *c)
+{
+ struct kmlist *r;
+
+ /*
+ * Our remote free list corresponds to a different list. Must
+ * flush it and switch.
+ */
+ if (page->list != c->remote_cache_list) {
+ flush_remote_free_cache(s, c);
+ c->remote_cache_list = page->list;
+ }
+
+ r = &c->rlist;
+ if (!r->head)
+ r->head = object;
+ else
+ set_freepointer(s, r->tail, object);
+ set_freepointer(s, object, NULL);
+ r->tail = object;
+ r->nr++;
+
+ if (unlikely(r->nr > slab_freebatch(s)))
+ flush_remote_free_cache(s, c);
+}
+#endif
+
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+ struct slqb_page *page, void *object)
+{
+ struct kmem_cache_cpu *c;
+ struct kmem_cache_list *l;
+ int thiscpu = smp_processor_id();
+
+ c = get_cpu_slab(s, thiscpu);
+ l = &c->list;
+
+ slqb_stat_inc(l, FREE);
+
+ if (!NUMA_BUILD || !numa_platform ||
+ likely(slqb_page_to_nid(page) == numa_node_id())) {
+ /*
+ * Freeing fastpath. Collects all local-node objects, not
+ * just those allocated from our per-CPU list. This allows
+ * fast transfer of objects from one CPU to another within
+ * a given node.
+ */
+ set_freepointer(s, object, l->freelist.head);
+ l->freelist.head = object;
+ if (!l->freelist.nr)
+ l->freelist.tail = object;
+ l->freelist.nr++;
+
+ if (unlikely(l->freelist.nr > slab_hiwater(s)))
+ flush_free_list(s, l);
+
+ } else {
+#ifdef CONFIG_NUMA
+ /*
+ * Freeing an object that was allocated on a remote node.
+ */
+ slab_free_to_remote(s, page, object, c);
+ slqb_stat_inc(l, FREE_REMOTE);
+#endif
+ }
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+ struct slqb_page *page, void *object)
+{
+ unsigned long flags;
+
+ prefetchw(object);
+
+ debug_check_no_locks_freed(object, s->objsize);
+ if (likely(object) && unlikely(slab_debug(s))) {
+ if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+ return;
+ }
+
+ local_irq_save(flags);
+ __slab_free(s, page, object);
+ local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+ struct slqb_page *page = NULL;
+
+ if (numa_platform)
+ page = virt_to_head_slqb_page(object);
+ slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static int slab_order(int size, int max_order, int frac)
+{
+ int order;
+
+ if (fls(size - 1) <= PAGE_SHIFT)
+ order = 0;
+ else
+ order = fls(size - 1) - PAGE_SHIFT;
+
+ while (order <= max_order) {
+ unsigned long slab_size = PAGE_SIZE << order;
+ unsigned long objects;
+ unsigned long waste;
+
+ objects = slab_size / size;
+ if (!objects)
+ continue;
+
+ waste = slab_size - (objects * size);
+
+ if (waste * frac <= slab_size)
+ break;
+
+ order++;
+ }
+
+ return order;
+}
+
+static int calculate_order(int size)
+{
+ int order;
+
+ /*
+ * Attempt to find best configuration for a slab. This
+ * works by first attempting to generate a layout with
+ * the best configuration and backing off gradually.
+ */
+ order = slab_order(size, 1, 4);
+ if (order <= 1)
+ return order;
+
+ /*
+ * This size cannot fit in order-1. Allow bigger orders, but
+ * forget about trying to save space.
+ */
+ order = slab_order(size, MAX_ORDER, 0);
+ if (order <= MAX_ORDER)
+ return order;
+
+ return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+ unsigned long align, unsigned long size)
+{
+ /*
+ * If the user wants hardware cache aligned objects then follow that
+ * suggestion if the object is sufficiently large.
+ *
+ * The hardware cache alignment cannot override the specified
+ * alignment though. If that is greater then use it.
+ */
+ if (flags & SLAB_HWCACHE_ALIGN) {
+ unsigned long ralign = cache_line_size();
+
+ while (size <= ralign / 2)
+ ralign /= 2;
+ align = max(align, ralign);
+ }
+
+ if (align < ARCH_SLAB_MINALIGN)
+ align = ARCH_SLAB_MINALIGN;
+
+ return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s,
+ struct kmem_cache_list *l)
+{
+ l->cache = s;
+ l->freelist.nr = 0;
+ l->freelist.head = NULL;
+ l->freelist.tail = NULL;
+ l->nr_partial = 0;
+ l->nr_slabs = 0;
+ INIT_LIST_HEAD(&l->partial);
+
+#ifdef CONFIG_SMP
+ l->remote_free_check = 0;
+ spin_lock_init(&l->remote_free.lock);
+ l->remote_free.list.nr = 0;
+ l->remote_free.list.head = NULL;
+ l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+ memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
+{
+ init_kmem_cache_list(s, &c->list);
+
+ c->colour_next = 0;
+#ifdef CONFIG_SMP
+ c->rlist.nr = 0;
+ c->rlist.head = NULL;
+ c->rlist.tail = NULL;
+ c->remote_cache_list = NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s,
+ struct kmem_cache_node *n)
+{
+ spin_lock_init(&n->list_lock);
+ init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs */
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
+ int cpu)
+{
+ struct kmem_cache_cpu *c;
+
+ c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+ if (!c)
+ return NULL;
+
+ init_kmem_cache_cpu(s, c);
+ return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+
+ c = s->cpu_slab[cpu];
+ if (c) {
+ kmem_cache_free(&kmem_cpu_cache, c);
+ s->cpu_slab[cpu] = NULL;
+ }
+ }
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+
+ c = s->cpu_slab[cpu];
+ if (c)
+ continue;
+
+ c = alloc_kmem_cache_cpu(s, cpu);
+ if (!c) {
+ free_kmem_cache_cpus(s);
+ return 0;
+ }
+ s->cpu_slab[cpu] = c;
+ }
+ return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+ init_kmem_cache_cpu(s, &s->cpu_slab);
+ return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n;
+
+ n = s->node[node];
+ if (n) {
+ kmem_cache_free(&kmem_node_cache, n);
+ s->node[node] = NULL;
+ }
+ }
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n;
+
+ n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+ if (!n) {
+ free_kmem_cache_nodes(s);
+ return 0;
+ }
+ init_kmem_cache_node(s, n);
+ s->node[node] = n;
+ }
+ return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+ return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+ unsigned long flags = s->flags;
+ unsigned long size = s->objsize;
+ unsigned long align = s->align;
+
+ /*
+ * Determine if we can poison the object itself. If the user of
+ * the slab may touch the object after free or before allocation
+ * then we should never poison the object itself.
+ */
+ if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+ s->flags |= __OBJECT_POISON;
+ else
+ s->flags &= ~__OBJECT_POISON;
+
+ /*
+ * Round up object size to the next word boundary. We can only
+ * place the free pointer at word boundaries and this determines
+ * the possible location of the free pointer.
+ */
+ size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+ /*
+ * If we are Redzoning then check if there is some space between the
+ * end of the object and the free pointer. If not then add an
+ * additional word to have some bytes to store Redzone information.
+ */
+ if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+ size += sizeof(void *);
+#endif
+
+ /*
+ * With that we have determined the number of bytes in actual use
+ * by the object. This is the potential offset to the free pointer.
+ */
+ s->inuse = size;
+
+ if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+ /*
+ * Relocate free pointer after the object if it is not
+ * permitted to overwrite the first word of the object on
+ * kmem_cache_free.
+ *
+ * This is the case if we do RCU, have a constructor or
+ * destructor or are poisoning the objects.
+ */
+ s->offset = size;
+ size += sizeof(void *);
+ }
+
+#ifdef CONFIG_SLQB_DEBUG
+ if (flags & SLAB_STORE_USER) {
+ /*
+ * Need to store information about allocs and frees after
+ * the object.
+ */
+ size += 2 * sizeof(struct track);
+ }
+
+ if (flags & SLAB_RED_ZONE) {
+ /*
+ * Add some empty padding so that we can catch
+ * overwrites from earlier objects rather than let
+ * tracking information or the free pointer be
+ * corrupted if an user writes before the start
+ * of the object.
+ */
+ size += sizeof(void *);
+ }
+#endif
+
+ /*
+ * Determine the alignment based on various parameters that the
+ * user specified and the dynamic determination of cache line size
+ * on bootup.
+ */
+ align = calculate_alignment(flags, align, s->objsize);
+
+ /*
+ * SLQB stores one object immediately after another beginning from
+ * offset 0. In order to align the objects we have to simply size
+ * each object to conform to the alignment.
+ */
+ size = ALIGN(size, align);
+ s->size = size;
+ s->order = calculate_order(size);
+
+ if (s->order < 0)
+ return 0;
+
+ s->allocflags = 0;
+ if (s->order)
+ s->allocflags |= __GFP_COMP;
+
+ if (s->flags & SLAB_CACHE_DMA)
+ s->allocflags |= SLQB_DMA;
+
+ if (s->flags & SLAB_RECLAIM_ACCOUNT)
+ s->allocflags |= __GFP_RECLAIMABLE;
+
+ /*
+ * Determine the number of objects per slab
+ */
+ s->objects = (PAGE_SIZE << s->order) / size;
+
+ s->freebatch = max(4UL*PAGE_SIZE / size,
+ min(256UL, 64*PAGE_SIZE / size));
+ if (!s->freebatch)
+ s->freebatch = 1;
+ s->hiwater = s->freebatch << 2;
+
+ return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+ const char *name, size_t size, size_t align,
+ unsigned long flags, void (*ctor)(void *), int alloc)
+{
+ unsigned int left_over;
+
+ memset(s, 0, kmem_size);
+ s->name = name;
+ s->ctor = ctor;
+ s->objsize = size;
+ s->align = align;
+ s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+ if (!calculate_sizes(s))
+ goto error;
+
+ if (!slab_debug(s)) {
+ left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+ s->colour_off = max(cache_line_size(), s->align);
+ s->colour_range = left_over;
+ } else {
+ s->colour_off = 0;
+ s->colour_range = 0;
+ }
+
+ if (likely(alloc)) {
+ if (!alloc_kmem_cache_nodes(s))
+ goto error;
+
+ if (!alloc_kmem_cache_cpus(s))
+ goto error_nodes;
+ }
+
+ down_write(&slqb_lock);
+ sysfs_slab_add(s);
+ list_add(&s->list, &slab_caches);
+ up_write(&slqb_lock);
+
+ return 1;
+
+error_nodes:
+ free_kmem_cache_nodes(s);
+error:
+ if (flags & SLAB_PANIC)
+ panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+ return 0;
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *object)
+{
+ struct slqb_page *page = virt_to_head_slqb_page(object);
+
+ if (!(page->flags & PG_SLQB_BIT))
+ return 0;
+
+ /*
+ * We could also check if the object is on the slabs freelist.
+ * But this would be too expensive and it seems that the main
+ * purpose of kmem_ptr_valid is to check if the object belongs
+ * to a certain slab.
+ */
+ return 1;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+ return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+ return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+ int node;
+#endif
+ int cpu;
+
+ down_write(&slqb_lock);
+ list_del(&s->list);
+ up_write(&slqb_lock);
+
+#ifdef CONFIG_SMP
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+ flush_free_list_all(s, l);
+ flush_remote_free_cache(s, c);
+ }
+#endif
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+ claim_remote_free_list(s, l);
+#endif
+ flush_free_list_all(s, l);
+
+ WARN_ON(l->freelist.nr);
+ WARN_ON(l->nr_slabs);
+ WARN_ON(l->nr_partial);
+ }
+
+ free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ claim_remote_free_list(s, l);
+ flush_free_list_all(s, l);
+
+ WARN_ON(l->freelist.nr);
+ WARN_ON(l->nr_slabs);
+ WARN_ON(l->nr_partial);
+ }
+
+ free_kmem_cache_nodes(s);
+#endif
+
+ sysfs_slab_remove(s);
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+/********************************************************************
+ * Kmalloc subsystem
+ *******************************************************************/
+
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+#ifdef CONFIG_ZONE_DMA
+struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+ const char *name, int size, gfp_t gfp_flags)
+{
+ unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+ if (gfp_flags & SLQB_DMA)
+ flags |= SLAB_CACHE_DMA;
+
+ kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+ return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+ 3, /* 8 */
+ 4, /* 16 */
+ 5, /* 24 */
+ 5, /* 32 */
+ 6, /* 40 */
+ 6, /* 48 */
+ 6, /* 56 */
+ 6, /* 64 */
+#if L1_CACHE_BYTES < 64
+ 1, /* 72 */
+ 1, /* 80 */
+ 1, /* 88 */
+ 1, /* 96 */
+#else
+ 7,
+ 7,
+ 7,
+ 7,
+#endif
+ 7, /* 104 */
+ 7, /* 112 */
+ 7, /* 120 */
+ 7, /* 128 */
+#if L1_CACHE_BYTES < 128
+ 2, /* 136 */
+ 2, /* 144 */
+ 2, /* 152 */
+ 2, /* 160 */
+ 2, /* 168 */
+ 2, /* 176 */
+ 2, /* 184 */
+ 2 /* 192 */
+#else
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+ int index;
+
+#if L1_CACHE_BYTES >= 128
+ if (size <= 128) {
+#else
+ if (size <= 192) {
+#endif
+ if (unlikely(!size))
+ return ZERO_SIZE_PTR;
+
+ index = size_index[(size - 1) / 8];
+ } else
+ index = fls(size - 1);
+
+ if (unlikely((flags & SLQB_DMA)))
+ return &kmalloc_caches_dma[index];
+ else
+ return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+ struct kmem_cache *s;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return __kmem_cache_alloc(s, flags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+ struct kmem_cache *s;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+ struct slqb_page *page;
+ struct kmem_cache *s;
+
+ BUG_ON(!object);
+ if (unlikely(object == ZERO_SIZE_PTR))
+ return 0;
+
+ page = virt_to_head_slqb_page(object);
+ BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+ s = page->list->cache;
+
+ /*
+ * Debugging requires use of the padding between object
+ * and whatever may come after it.
+ */
+ if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+ return s->objsize;
+
+ /*
+ * If we have the need to store the freelist pointer
+ * back there or track user information then we can
+ * only use the space before that information.
+ */
+ if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+ return s->inuse;
+
+ /*
+ * Else we can use all the padding etc for the allocation
+ */
+ return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+ struct kmem_cache *s;
+ struct slqb_page *page;
+
+ if (unlikely(ZERO_OR_NULL_PTR(object)))
+ return;
+
+ page = virt_to_head_slqb_page(object);
+ s = page->list->cache;
+
+ slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+ int cpu = smp_processor_id();
+ struct kmem_cache *s = arg;
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+ claim_remote_free_list(s, l);
+#endif
+ flush_free_list(s, l);
+#ifdef CONFIG_SMP
+ flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+ int node;
+#endif
+
+ on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list(s, l);
+ spin_unlock_irq(&n->list_lock);
+ }
+#endif
+
+ return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+ int cpu = smp_processor_id();
+ struct kmem_cache *s;
+ long phase = (long)arg;
+
+ list_for_each_entry(s, &slab_caches, list) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+ if (phase == 0) {
+ flush_free_list_all(s, l);
+ flush_remote_free_cache(s, c);
+ }
+
+ if (phase == 1) {
+ claim_remote_free_list(s, l);
+ flush_free_list_all(s, l);
+ }
+ }
+}
+
+static void kmem_cache_reap(void)
+{
+ struct kmem_cache *s;
+ int node;
+
+ down_read(&slqb_lock);
+ on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+ on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+ list_for_each_entry(s, &slab_caches, list) {
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list_all(s, l);
+ spin_unlock_irq(&n->list_lock);
+ }
+ }
+ up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+ struct delayed_work *work =
+ container_of(w, struct delayed_work, work);
+ struct kmem_cache *s;
+ int node;
+
+ if (!down_read_trylock(&slqb_lock))
+ goto out;
+
+ node = numa_node_id();
+ list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list(s, l);
+ spin_unlock_irq(&n->list_lock);
+#endif
+
+ local_irq_disable();
+ kmem_cache_trim_percpu(s);
+ local_irq_enable();
+ }
+
+ up_read(&slqb_lock);
+out:
+ schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+ struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+ /*
+ * When this gets called from do_initcalls via cpucache_init(),
+ * init_workqueues() has already run, so keventd will be setup
+ * at that time.
+ */
+ if (keventd_up() && cache_trim_work->work.func == NULL) {
+ INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+ schedule_delayed_work_on(cpu, cache_trim_work,
+ __round_jiffies_relative(HZ, cpu));
+ }
+}
+
+static int __init cpucache_init(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ start_cpu_timer(cpu);
+
+ return 0;
+}
+device_initcall(cpucache_init);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+ kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+ /* XXX: should release structures, see CPU offline comment */
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+ struct kmem_cache *s;
+ struct kmem_cache_node *n;
+ struct memory_notify *marg = arg;
+ int nid = marg->status_change_nid;
+ int ret = 0;
+
+ /*
+ * If the node's memory is already available, then kmem_cache_node is
+ * already created. Nothing to do.
+ */
+ if (nid < 0)
+ return 0;
+
+ /*
+ * We are bringing a node online. No memory is availabe yet. We must
+ * allocate a kmem_cache_node structure in order to bring the node
+ * online.
+ */
+ down_read(&slqb_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ /*
+ * XXX: kmem_cache_alloc_node will fallback to other nodes
+ * since memory is not yet available from the node that
+ * is brought up.
+ */
+ if (s->node[nid]) /* could be lefover from last online */
+ continue;
+ n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+ if (!n) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ init_kmem_cache_node(s, n);
+ s->node[nid] = n;
+ }
+out:
+ up_read(&slqb_lock);
+ return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ int ret = 0;
+
+ switch (action) {
+ case MEM_GOING_ONLINE:
+ ret = slab_mem_going_online_callback(arg);
+ break;
+ case MEM_GOING_OFFLINE:
+ slab_mem_going_offline_callback(arg);
+ break;
+ case MEM_OFFLINE:
+ case MEM_CANCEL_ONLINE:
+ slab_mem_offline_callback(arg);
+ break;
+ case MEM_ONLINE:
+ case MEM_CANCEL_OFFLINE:
+ break;
+ }
+
+ ret = notifier_from_errno(ret);
+ return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ * Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+ int i;
+ unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+ /*
+ * All the ifdefs are rather ugly here, but it's just the setup code,
+ * so it doesn't have to be too readable :)
+ */
+#ifdef CONFIG_NUMA
+ if (num_possible_nodes() == 1)
+ numa_platform = 0;
+ else
+ numa_platform = 1;
+#endif
+
+#ifdef CONFIG_SMP
+ kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+ nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+ kmem_size = sizeof(struct kmem_cache);
+#endif
+
+ kmem_cache_open(&kmem_cache_cache, "kmem_cache",
+ kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+ kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu",
+ sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+ kmem_cache_open(&kmem_node_cache, "kmem_cache_node",
+ sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+ for_each_possible_cpu(i) {
+ init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
+ kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+
+ init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
+ kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+
+#ifdef CONFIG_NUMA
+ init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
+ kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+#endif
+ }
+#else
+ init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+ for_each_node_state(i, N_NORMAL_MEMORY) {
+ init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
+ kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
+
+ init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
+ kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+
+ init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
+ kmem_node_cache.node[i] = &kmem_node_nodes[i];
+ }
+#endif
+
+ /* Caches that are not of the two-to-the-power-of size */
+ if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+ open_kmalloc_cache(&kmalloc_caches[1],
+ "kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+ open_kmalloc_cache(&kmalloc_caches_dma[1],
+ "kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+ }
+ if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+ open_kmalloc_cache(&kmalloc_caches[2],
+ "kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+ open_kmalloc_cache(&kmalloc_caches_dma[2],
+ "kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+ }
+
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+ open_kmalloc_cache(&kmalloc_caches[i],
+ "kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+ open_kmalloc_cache(&kmalloc_caches_dma[i],
+ "kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+ }
+
+ /*
+ * Patch up the size_index table if we have strange large alignment
+ * requirements for the kmalloc array. This is only the case for
+ * mips it seems. The standard arches will not generate any code here.
+ *
+ * Largest permitted alignment is 256 bytes due to the way we
+ * handle the index determination for the smaller caches.
+ *
+ * Make sure that nothing crazy happens if someone starts tinkering
+ * around with ARCH_KMALLOC_MINALIGN
+ */
+ BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+ (KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+ for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+ size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+ /* Provide the correct kmalloc names now that the caches are up */
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+ kmalloc_caches[i].name =
+ kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+ kmalloc_caches_dma[i].name =
+ kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+ }
+
+#ifdef CONFIG_SMP
+ register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+ hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+ /*
+ * smp_init() has not yet been called, so no worries about memory
+ * ordering here (eg. slab_is_available vs numa_platform)
+ */
+ __slab_is_available = 1;
+}
+
+/*
+ * Some basic slab creation sanity checks
+ */
+static int kmem_cache_create_ok(const char *name, size_t size,
+ size_t align, unsigned long flags)
+{
+ struct kmem_cache *tmp;
+
+ /*
+ * Sanity checks... these are all serious usage bugs.
+ */
+ if (!name || in_interrupt() || (size < sizeof(void *))) {
+ printk(KERN_ERR "kmem_cache_create(): early error in slab %s\n",
+ name);
+ dump_stack();
+
+ return 0;
+ }
+
+ down_read(&slqb_lock);
+
+ list_for_each_entry(tmp, &slab_caches, list) {
+ char x;
+ int res;
+
+ /*
+ * This happens when the module gets unloaded and doesn't
+ * destroy its slab cache and no-one else reuses the vmalloc
+ * area of the module. Print a warning.
+ */
+ res = probe_kernel_address(tmp->name, x);
+ if (res) {
+ printk(KERN_ERR
+ "SLAB: cache with size %d has lost its name\n",
+ tmp->size);
+ continue;
+ }
+
+ if (!strcmp(tmp->name, name)) {
+ printk(KERN_ERR
+ "kmem_cache_create(): duplicate cache %s\n", name);
+ dump_stack();
+ up_read(&slqb_lock);
+
+ return 0;
+ }
+ }
+
+ up_read(&slqb_lock);
+
+ WARN_ON(strchr(name, ' ')); /* It confuses parsers */
+ if (flags & SLAB_DESTROY_BY_RCU)
+ WARN_ON(flags & SLAB_POISON);
+
+ return 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+ size_t align, unsigned long flags, void (*ctor)(void *))
+{
+ struct kmem_cache *s;
+
+ if (!kmem_cache_create_ok(name, size, align, flags))
+ goto err;
+
+ s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+ if (!s)
+ goto err;
+
+ if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+ return s;
+
+ kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+ if (flags & SLAB_PANIC)
+ panic("kmem_cache_create(): failed to create slab `%s'\n", name);
+
+ return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ long cpu = (long)hcpu;
+ struct kmem_cache *s;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ down_read(&slqb_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ if (s->cpu_slab[cpu]) /* could be lefover last online */
+ continue;
+ s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+ if (!s->cpu_slab[cpu]) {
+ up_read(&slqb_lock);
+ return NOTIFY_BAD;
+ }
+ }
+ up_read(&slqb_lock);
+ break;
+
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ start_cpu_timer(cpu);
+ break;
+
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+ per_cpu(cache_trim_work, cpu).work.func = NULL;
+ break;
+
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ /*
+ * XXX: Freeing here doesn't work because objects can still be
+ * on this CPU's list. periodic timer needs to check if a CPU
+ * is offline and then try to cleanup from there. Same for node
+ * offline.
+ */
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+ .notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+ struct kmem_cache *s;
+ int node = -1;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+#ifdef CONFIG_NUMA
+ if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+ node = alternate_nid(s, flags, node);
+#endif
+ return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+ unsigned long caller)
+{
+ struct kmem_cache *s;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+ struct kmem_cache *s;
+ spinlock_t lock;
+ unsigned long nr_slabs;
+ unsigned long nr_partial;
+ unsigned long nr_inuse;
+ unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+ unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+ unsigned long nr_slabs;
+ unsigned long nr_partial;
+ unsigned long nr_inuse;
+ struct stats_gather *gather = arg;
+ int cpu = smp_processor_id();
+ struct kmem_cache *s = gather->s;
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+ struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+ int i;
+#endif
+
+ nr_slabs = l->nr_slabs;
+ nr_partial = l->nr_partial;
+ nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+ list_for_each_entry(page, &l->partial, lru) {
+ nr_inuse += page->inuse;
+ }
+
+ spin_lock(&gather->lock);
+ gather->nr_slabs += nr_slabs;
+ gather->nr_partial += nr_partial;
+ gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+ gather->stats[i] += l->stats[i];
+#endif
+ spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+ int node;
+#endif
+
+ memset(stats, 0, sizeof(struct stats_gather));
+ stats->s = s;
+ spin_lock_init(&stats->lock);
+
+ on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+ for_each_online_node(node) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+ struct slqb_page *page;
+ unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+ int i;
+#endif
+
+ spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++)
+ stats->stats[i] += l->stats[i];
+#endif
+ stats->nr_slabs += l->nr_slabs;
+ stats->nr_partial += l->nr_partial;
+ stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+ list_for_each_entry(page, &l->partial, lru) {
+ stats->nr_inuse += page->inuse;
+ }
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ }
+#endif
+
+ stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+ size_t count, loff_t *ppos)
+{
+ return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+ seq_puts(m, "slabinfo - version: 2.1\n");
+ seq_puts(m, "# name <active_objs> <num_objs> <objsize> "
+ "<objperslab> <pagesperslab>");
+ seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+ seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+ seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+ loff_t n = *pos;
+
+ down_read(&slqb_lock);
+ if (!n)
+ print_slabinfo_header(m);
+
+ return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+ return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+ up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+ struct stats_gather stats;
+ struct kmem_cache *s;
+
+ s = list_entry(p, struct kmem_cache, list);
+
+ gather_stats(s, &stats);
+
+ seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+ stats.nr_objects, s->size, s->objects, (1 << s->order));
+ seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s),
+ slab_freebatch(s), 0);
+ seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs,
+ stats.nr_slabs, 0UL);
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+ .start = s_start,
+ .next = s_next,
+ .stop = s_stop,
+ .show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+ .open = slabinfo_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+ proc_create("slabinfo", S_IWUSR|S_IRUGO, NULL,
+ &proc_slabinfo_operations);
+ return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct kmem_cache *s, char *buf);
+ ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+ static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+ static struct slab_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+ if (s->ctor) {
+ int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+ return n + sprintf(buf + n, "\n");
+ }
+ return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+ struct stats_gather stats;
+
+ gather_stats(s, &stats);
+
+ return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+ struct stats_gather stats;
+
+ gather_stats(s, &stats);
+
+ return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+ struct stats_gather stats;
+
+ gather_stats(s, &stats);
+
+ return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ long hiwater;
+ int err;
+
+ err = strict_strtol(buf, 10, &hiwater);
+ if (err)
+ return err;
+
+ if (hiwater < 0)
+ return -EINVAL;
+
+ s->hiwater = hiwater;
+
+ return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s,
+ const char *buf, size_t length)
+{
+ long freebatch;
+ int err;
+
+ err = strict_strtol(buf, 10, &freebatch);
+ if (err)
+ return err;
+
+ if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+ return -EINVAL;
+
+ s->freebatch = freebatch;
+
+ return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+ struct stats_gather stats;
+ int len;
+#ifdef CONFIG_SMP
+ int cpu;
+#endif
+
+ gather_stats(s, &stats);
+
+ len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+ if (len < PAGE_SIZE - 20)
+ len += sprintf(buf+len, " C%d=%lu", cpu, l->stats[si]);
+ }
+#endif
+ return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) \
+static ssize_t text##_show(struct kmem_cache *s, char *buf) \
+{ \
+ return show_stat(s, buf, si); \
+} \
+SLAB_ATTR_RO(text); \
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+ &slab_size_attr.attr,
+ &object_size_attr.attr,
+ &objs_per_slab_attr.attr,
+ &order_attr.attr,
+ &objects_attr.attr,
+ &total_objects_attr.attr,
+ &slabs_attr.attr,
+ &ctor_attr.attr,
+ &align_attr.attr,
+ &hwcache_align_attr.attr,
+ &reclaim_account_attr.attr,
+ &destroy_by_rcu_attr.attr,
+ &red_zone_attr.attr,
+ &poison_attr.attr,
+ &store_user_attr.attr,
+ &hiwater_attr.attr,
+ &freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+ &cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+ &alloc_attr.attr,
+ &alloc_slab_fill_attr.attr,
+ &alloc_slab_new_attr.attr,
+ &free_attr.attr,
+ &free_remote_attr.attr,
+ &flush_free_list_attr.attr,
+ &flush_free_list_objects_attr.attr,
+ &flush_free_list_remote_attr.attr,
+ &flush_slab_partial_attr.attr,
+ &flush_slab_free_attr.attr,
+ &flush_rfree_list_attr.attr,
+ &flush_rfree_list_objects_attr.attr,
+ &claim_remote_list_attr.attr,
+ &claim_remote_list_objects_attr.attr,
+#endif
+ NULL
+};
+
+static struct attribute_group slab_attr_group = {
+ .attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+ struct attribute *attr, char *buf)
+{
+ struct slab_attribute *attribute;
+ struct kmem_cache *s;
+ int err;
+
+ attribute = to_slab_attr(attr);
+ s = to_slab(kobj);
+
+ if (!attribute->show)
+ return -EIO;
+
+ err = attribute->show(s, buf);
+
+ return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+ struct attribute *attr, const char *buf, size_t len)
+{
+ struct slab_attribute *attribute;
+ struct kmem_cache *s;
+ int err;
+
+ attribute = to_slab_attr(attr);
+ s = to_slab(kobj);
+
+ if (!attribute->store)
+ return -EIO;
+
+ err = attribute->store(s, buf, len);
+
+ return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+ struct kmem_cache *s = to_slab(kobj);
+
+ kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+ .show = slab_attr_show,
+ .store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+ .sysfs_ops = &slab_sysfs_ops,
+ .release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+ struct kobj_type *ktype = get_ktype(kobj);
+
+ if (ktype == &slab_ktype)
+ return 1;
+ return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+ .filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+ int err;
+
+ if (!sysfs_available)
+ return 0;
+
+ s->kobj.kset = slab_kset;
+ err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+ if (err) {
+ kobject_put(&s->kobj);
+ return err;
+ }
+
+ err = sysfs_create_group(&s->kobj, &slab_attr_group);
+ if (err)
+ return err;
+
+ kobject_uevent(&s->kobj, KOBJ_ADD);
+
+ return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+ kobject_uevent(&s->kobj, KOBJ_REMOVE);
+ kobject_del(&s->kobj);
+ kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+ struct kmem_cache *s;
+ int err;
+
+ slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+ if (!slab_kset) {
+ printk(KERN_ERR "Cannot register slab subsystem.\n");
+ return -ENOSYS;
+ }
+
+ down_write(&slqb_lock);
+
+ sysfs_available = 1;
+
+ list_for_each_entry(s, &slab_caches, list) {
+ err = sysfs_slab_add(s);
+ if (err)
+ printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+ " to sysfs\n", s->name);
+ }
+
+ up_write(&slqb_lock);
+
+ return 0;
+}
+device_initcall(slab_sysfs_init);
+
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -150,6 +150,8 @@ size_t ksize(const void *);
*/
#ifdef CONFIG_SLUB
#include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
#elif defined(CONFIG_SLOB)
#include <linux/slob_def.h>
#else
@@ -252,7 +254,7 @@ static inline void *kmem_cache_alloc_nod
* allocator where we care about the real place the memory allocation
* request comes from.
*/
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
#define kmalloc_track_caller(size, flags) \
__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +272,7 @@ extern void *__kmalloc_track_caller(size
* standard allocator where we care about the real place the memory
* allocation request comes from.
*/
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
#define kmalloc_node_track_caller(size, flags, node) \
__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+ struct rcu_head *next;
+ void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
* virt_addr_valid(kaddr) returns true.
*/
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
extern bool __virt_addr_valid(unsigned long kaddr);
#define virt_addr_valid(kaddr) __virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -305,7 +305,11 @@ static inline void get_page(struct page

static inline struct page *virt_to_head_page(const void *x)
{
+#ifdef virt_to_page_fast
+ struct page *page = virt_to_page_fast(x);
+#else
struct page *page = virt_to_page(x);
+#endif
return compound_head(page);
}

Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <[email protected]> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+ char *name;
+ int align, cache_dma, destroy_by_rcu;
+ int hwcache_align, object_size, objs_per_slab;
+ int slab_size, store_user;
+ int order, poison, reclaim_account, red_zone;
+ int batch;
+ unsigned long objects, slabs, total_objects;
+ unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+ unsigned long free, free_remote;
+ unsigned long claim_remote_list, claim_remote_list_objects;
+ unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+ unsigned long flush_rfree_list, flush_rfree_list_objects;
+ unsigned long flush_slab_free, flush_slab_partial;
+ int numa[MAX_NODES];
+ int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+ va_list ap;
+
+ va_start(ap, x);
+ vfprintf(stderr, x, ap);
+ va_end(ap);
+ exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+ printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+ "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+ "-A|--activity Most active slabs first\n"
+ "-d<options>|--debug=<options> Set/Clear Debug options\n"
+ "-D|--display-active Switch line format to activity\n"
+ "-e|--empty Show empty slabs\n"
+ "-h|--help Show usage information\n"
+ "-i|--inverted Inverted list\n"
+ "-l|--slabs Show slabs\n"
+ "-n|--numa Show NUMA information\n"
+ "-o|--ops Show kmem_cache_ops\n"
+ "-s|--shrink Shrink slabs\n"
+ "-r|--report Detailed report on single slabs\n"
+ "-S|--Size Sort by size\n"
+ "-t|--tracking Show alloc/free information\n"
+ "-T|--Totals Show summary information\n"
+ "-v|--validate Validate slabs\n"
+ "-z|--zero Include empty slabs\n"
+ "\nValid debug options (FZPUT may be combined)\n"
+ "a / A Switch on all debug options (=FZUP)\n"
+ "- Switch off all debug options\n"
+ "f / F Sanity Checks (SLAB_DEBUG_FREE)\n"
+ "z / Z Redzoning\n"
+ "p / P Poisoning\n"
+ "u / U Tracking\n"
+ "t / T Tracing\n"
+ );
+}
+
+unsigned long read_obj(const char *name)
+{
+ FILE *f = fopen(name, "r");
+
+ if (!f)
+ buffer[0] = 0;
+ else {
+ if (!fgets(buffer, sizeof(buffer), f))
+ buffer[0] = 0;
+ fclose(f);
+ if (buffer[strlen(buffer)] == '\n')
+ buffer[strlen(buffer)] = 0;
+ }
+ return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+ if (!read_obj(name))
+ return 0;
+
+ return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+ unsigned long result = 0;
+ char *p;
+
+ *x = NULL;
+
+ if (!read_obj(name)) {
+ x = NULL;
+ return 0;
+ }
+ result = strtoul(buffer, &p, 10);
+ while (*p == ' ')
+ p++;
+ if (*p)
+ *x = strdup(p);
+ return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+ char x[100];
+ FILE *f;
+
+ snprintf(x, 100, "%s/%s", s->name, name);
+ f = fopen(x, "w");
+ if (!f)
+ fatal("Cannot write to %s\n", x);
+
+ fprintf(f, "%d\n", n);
+ fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+ char x[100];
+ FILE *f;
+ size_t l;
+
+ snprintf(x, 100, "%s/%s", s->name, name);
+ f = fopen(x, "r");
+ if (!f) {
+ buffer[0] = 0;
+ l = 0;
+ } else {
+ l = fread(buffer, 1, sizeof(buffer), f);
+ buffer[l] = 0;
+ fclose(f);
+ }
+ return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+ unsigned long divisor = 1;
+ char trailer = 0;
+ int n;
+
+ if (value > 1000000000UL) {
+ divisor = 100000000UL;
+ trailer = 'G';
+ } else if (value > 1000000UL) {
+ divisor = 100000UL;
+ trailer = 'M';
+ } else if (value > 1000UL) {
+ divisor = 100;
+ trailer = 'K';
+ }
+
+ value /= divisor;
+ n = sprintf(buffer, "%ld",value);
+ if (trailer) {
+ buffer[n] = trailer;
+ n++;
+ buffer[n] = 0;
+ }
+ if (divisor != 1) {
+ memmove(buffer + n - 2, buffer + n - 3, 4);
+ buffer[n-2] = '.';
+ n++;
+ }
+ return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+ int node;
+ int nr;
+
+ memset(numa, 0, MAX_NODES * sizeof(int));
+
+ if (!t)
+ return;
+
+ while (*t == 'N') {
+ t++;
+ node = strtoul(t, &t, 10);
+ if (*t == '=') {
+ t++;
+ nr = strtoul(t, &t, 10);
+ numa[node] = nr;
+ if (node > highest_node)
+ highest_node = node;
+ }
+ while (*t == ' ')
+ t++;
+ }
+}
+
+void slab_validate(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+ if (show_activity)
+ printf("Name Objects Alloc Free %%Fill %%New "
+ "FlushR %%FlushR FlushR_Objs O\n");
+ else
+ printf("Name Objects Objsize Space "
+ " O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+ return s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+ return s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+ int node;
+
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (!highest_node) {
+ printf("\n%s: No NUMA information available.\n", s->name);
+ return;
+ }
+
+ if (skip_zero && !s->slabs)
+ return;
+
+ if (!line) {
+ printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+ for(node = 0; node <= highest_node; node++)
+ printf(" %4d", node);
+ printf("\n----------------------");
+ for(node = 0; node <= highest_node; node++)
+ printf("-----");
+ printf("\n");
+ }
+ printf("%-21s ", mode ? "All slabs" : s->name);
+ for(node = 0; node <= highest_node; node++) {
+ char b[20];
+
+ store_size(b, s->numa[node]);
+ printf(" %4s", b);
+ }
+ printf("\n");
+ if (mode) {
+ printf("%-21s ", "Partial slabs");
+ for(node = 0; node <= highest_node; node++) {
+ char b[20];
+
+ store_size(b, s->numa_partial[node]);
+ printf(" %4s", b);
+ }
+ printf("\n");
+ }
+ line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+ printf("\n%s: Kernel object allocation\n", s->name);
+ printf("-----------------------------------------------------------------------\n");
+ if (read_slab_obj(s, "alloc_calls"))
+ printf(buffer);
+ else
+ printf("No Data\n");
+
+ printf("\n%s: Kernel object freeing\n", s->name);
+ printf("------------------------------------------------------------------------\n");
+ if (read_slab_obj(s, "free_calls"))
+ printf(buffer);
+ else
+ printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (read_slab_obj(s, "ops")) {
+ printf("\n%s: kmem_cache operations\n", s->name);
+ printf("--------------------------------------------\n");
+ printf(buffer);
+ } else
+ printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+ if (x)
+ return "On ";
+ return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+ unsigned long total_alloc;
+ unsigned long total_free;
+ unsigned long total;
+
+ total_alloc = s->alloc;
+ total_free = s->free;
+
+ if (!total_alloc)
+ return;
+
+ printf("\n");
+ printf("Slab Perf Counter\n");
+ printf("------------------------------------------------------------------------\n");
+ printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+ total_alloc,
+ s->alloc_slab_fill, s->alloc_slab_new);
+ printf("Free: %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+ total_free,
+ s->flush_slab_partial,
+ s->flush_slab_free,
+ s->free_remote);
+ printf("Claim: %8lu, objects %8lu\n",
+ s->claim_remote_list,
+ s->claim_remote_list_objects);
+ printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+ s->flush_free_list,
+ s->flush_free_list_objects,
+ s->flush_free_list_remote);
+ printf("FlushR:%8lu, objects %8lu\n",
+ s->flush_rfree_list,
+ s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ printf("\nSlabcache: %-20s Order : %2d Objects: %lu\n",
+ s->name, s->order, s->objects);
+ if (s->hwcache_align)
+ printf("** Hardware cacheline aligned\n");
+ if (s->cache_dma)
+ printf("** Memory is allocated in a special DMA zone\n");
+ if (s->destroy_by_rcu)
+ printf("** Slabs are destroyed via RCU\n");
+ if (s->reclaim_account)
+ printf("** Reclaim accounting active\n");
+
+ printf("\nSizes (bytes) Slabs Debug Memory\n");
+ printf("------------------------------------------------------------------------\n");
+ printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n",
+ s->object_size, s->slabs, "N/A",
+ s->slabs * (page_size << s->order));
+ printf("SlabObj: %7d Full : %7s Redzoning : %s Used : %7ld\n",
+ s->slab_size, "N/A",
+ onoff(s->red_zone), s->objects * s->object_size);
+ printf("SlabSiz: %7d Partial: %7s Poisoning : %s Loss : %7ld\n",
+ page_size << s->order, "N/A", onoff(s->poison),
+ s->slabs * (page_size << s->order) - s->objects * s->object_size);
+ printf("Loss : %7d CpuSlab: %7s Tracking : %s Lalig: %7ld\n",
+ s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+ (s->slab_size - s->object_size) * s->objects);
+ printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n",
+ s->align, s->objs_per_slab, "N/A",
+ ((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+ s->slabs);
+
+ ops(s);
+ show_tracking(s);
+ slab_numa(s, 1);
+ slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+ char size_str[20];
+ char flags[20];
+ char *p = flags;
+
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (actual_slabs == 1) {
+ report(s);
+ return;
+ }
+
+ if (skip_zero && !show_empty && !s->slabs)
+ return;
+
+ if (show_empty && s->slabs)
+ return;
+
+ store_size(size_str, slab_size(s));
+
+ if (!line++)
+ first_line();
+
+ if (s->cache_dma)
+ *p++ = 'd';
+ if (s->hwcache_align)
+ *p++ = 'A';
+ if (s->poison)
+ *p++ = 'P';
+ if (s->reclaim_account)
+ *p++ = 'a';
+ if (s->red_zone)
+ *p++ = 'Z';
+ if (s->store_user)
+ *p++ = 'U';
+
+ *p = 0;
+ if (show_activity) {
+ unsigned long total_alloc;
+ unsigned long total_free;
+
+ total_alloc = s->alloc;
+ total_free = s->free;
+
+ printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+ s->name, s->objects,
+ total_alloc, total_free,
+ total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+ total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+ s->flush_rfree_list,
+ s->flush_rfree_list * 100 / (total_alloc + total_free),
+ s->flush_rfree_list_objects,
+ s->order);
+ }
+ else
+ printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+ s->name, s->objects, s->object_size, size_str,
+ s->objs_per_slab, s->order,
+ s->slabs ? (s->objects * s->object_size * 100) /
+ (s->slabs * (page_size << s->order)) : 100,
+ s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+ if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+ return 1;
+
+ if (strcasecmp(opt, "a") == 0) {
+ sanity = 1;
+ poison = 1;
+ redzone = 1;
+ tracking = 1;
+ return 1;
+ }
+
+ for ( ; *opt; opt++)
+ switch (*opt) {
+ case 'F' : case 'f':
+ if (sanity)
+ return 0;
+ sanity = 1;
+ break;
+ case 'P' : case 'p':
+ if (poison)
+ return 0;
+ poison = 1;
+ break;
+
+ case 'Z' : case 'z':
+ if (redzone)
+ return 0;
+ redzone = 1;
+ break;
+
+ case 'U' : case 'u':
+ if (tracking)
+ return 0;
+ tracking = 1;
+ break;
+
+ case 'T' : case 't':
+ if (tracing)
+ return 0;
+ tracing = 1;
+ break;
+ default:
+ return 0;
+ }
+ return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+ if (s->objects > 0)
+ return 0;
+
+ /*
+ * We may still have slabs even if there are no objects. Shrinking will
+ * remove them.
+ */
+ if (s->slabs != 0)
+ set_obj(s, "shrink", 1);
+
+ return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (redzone && !s->red_zone) {
+ if (slab_empty(s))
+ set_obj(s, "red_zone", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+ }
+ if (!redzone && s->red_zone) {
+ if (slab_empty(s))
+ set_obj(s, "red_zone", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+ }
+ if (poison && !s->poison) {
+ if (slab_empty(s))
+ set_obj(s, "poison", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+ }
+ if (!poison && s->poison) {
+ if (slab_empty(s))
+ set_obj(s, "poison", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+ }
+ if (tracking && !s->store_user) {
+ if (slab_empty(s))
+ set_obj(s, "store_user", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+ }
+ if (!tracking && s->store_user) {
+ if (slab_empty(s))
+ set_obj(s, "store_user", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+ }
+}
+
+void totals(void)
+{
+ struct slabinfo *s;
+
+ int used_slabs = 0;
+ char b1[20], b2[20], b3[20], b4[20];
+ unsigned long long max = 1ULL << 63;
+
+ /* Object size */
+ unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+ /* Number of partial slabs in a slabcache */
+ unsigned long long min_partial = max, max_partial = 0,
+ avg_partial, total_partial = 0;
+
+ /* Number of slabs in a slab cache */
+ unsigned long long min_slabs = max, max_slabs = 0,
+ avg_slabs, total_slabs = 0;
+
+ /* Size of the whole slab */
+ unsigned long long min_size = max, max_size = 0,
+ avg_size, total_size = 0;
+
+ /* Bytes used for object storage in a slab */
+ unsigned long long min_used = max, max_used = 0,
+ avg_used, total_used = 0;
+
+ /* Waste: Bytes used for alignment and padding */
+ unsigned long long min_waste = max, max_waste = 0,
+ avg_waste, total_waste = 0;
+ /* Number of objects in a slab */
+ unsigned long long min_objects = max, max_objects = 0,
+ avg_objects, total_objects = 0;
+ /* Waste per object */
+ unsigned long long min_objwaste = max,
+ max_objwaste = 0, avg_objwaste,
+ total_objwaste = 0;
+
+ /* Memory per object */
+ unsigned long long min_memobj = max,
+ max_memobj = 0, avg_memobj,
+ total_objsize = 0;
+
+ for (s = slabinfo; s < slabinfo + slabs; s++) {
+ unsigned long long size;
+ unsigned long used;
+ unsigned long long wasted;
+ unsigned long long objwaste;
+
+ if (!s->slabs || !s->objects)
+ continue;
+
+ used_slabs++;
+
+ size = slab_size(s);
+ used = s->objects * s->object_size;
+ wasted = size - used;
+ objwaste = s->slab_size - s->object_size;
+
+ if (s->object_size < min_objsize)
+ min_objsize = s->object_size;
+ if (s->slabs < min_slabs)
+ min_slabs = s->slabs;
+ if (size < min_size)
+ min_size = size;
+ if (wasted < min_waste)
+ min_waste = wasted;
+ if (objwaste < min_objwaste)
+ min_objwaste = objwaste;
+ if (s->objects < min_objects)
+ min_objects = s->objects;
+ if (used < min_used)
+ min_used = used;
+ if (s->slab_size < min_memobj)
+ min_memobj = s->slab_size;
+
+ if (s->object_size > max_objsize)
+ max_objsize = s->object_size;
+ if (s->slabs > max_slabs)
+ max_slabs = s->slabs;
+ if (size > max_size)
+ max_size = size;
+ if (wasted > max_waste)
+ max_waste = wasted;
+ if (objwaste > max_objwaste)
+ max_objwaste = objwaste;
+ if (s->objects > max_objects)
+ max_objects = s->objects;
+ if (used > max_used)
+ max_used = used;
+ if (s->slab_size > max_memobj)
+ max_memobj = s->slab_size;
+
+ total_slabs += s->slabs;
+ total_size += size;
+ total_waste += wasted;
+
+ total_objects += s->objects;
+ total_used += used;
+
+ total_objwaste += s->objects * objwaste;
+ total_objsize += s->objects * s->slab_size;
+ }
+
+ if (!total_objects) {
+ printf("No objects\n");
+ return;
+ }
+ if (!used_slabs) {
+ printf("No slabs\n");
+ return;
+ }
+
+ /* Per slab averages */
+ avg_slabs = total_slabs / used_slabs;
+ avg_size = total_size / used_slabs;
+ avg_waste = total_waste / used_slabs;
+
+ avg_objects = total_objects / used_slabs;
+ avg_used = total_used / used_slabs;
+
+ /* Per object object sizes */
+ avg_objsize = total_used / total_objects;
+ avg_objwaste = total_objwaste / total_objects;
+ avg_memobj = total_objsize / total_objects;
+
+ printf("Slabcache Totals\n");
+ printf("----------------\n");
+ printf("Slabcaches : %3d Active: %3d\n",
+ slabs, used_slabs);
+
+ store_size(b1, total_size);store_size(b2, total_waste);
+ store_size(b3, total_waste * 100 / total_used);
+ printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3);
+
+ store_size(b1, total_objects);
+ printf("# Objects : %6s\n", b1);
+
+ printf("\n");
+ printf("Per Cache Average Min Max Total\n");
+ printf("---------------------------------------------------------\n");
+
+ store_size(b1, avg_objects);store_size(b2, min_objects);
+ store_size(b3, max_objects);store_size(b4, total_objects);
+ printf("#Objects %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_slabs);store_size(b2, min_slabs);
+ store_size(b3, max_slabs);store_size(b4, total_slabs);
+ printf("#Slabs %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_size);store_size(b2, min_size);
+ store_size(b3, max_size);store_size(b4, total_size);
+ printf("Memory %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_used);store_size(b2, min_used);
+ store_size(b3, max_used);store_size(b4, total_used);
+ printf("Used %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_waste);store_size(b2, min_waste);
+ store_size(b3, max_waste);store_size(b4, total_waste);
+ printf("Loss %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ printf("\n");
+ printf("Per Object Average Min Max\n");
+ printf("---------------------------------------------\n");
+
+ store_size(b1, avg_memobj);store_size(b2, min_memobj);
+ store_size(b3, max_memobj);
+ printf("Memory %10s %10s %10s\n",
+ b1, b2, b3);
+ store_size(b1, avg_objsize);store_size(b2, min_objsize);
+ store_size(b3, max_objsize);
+ printf("User %10s %10s %10s\n",
+ b1, b2, b3);
+
+ store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+ store_size(b3, max_objwaste);
+ printf("Loss %10s %10s %10s\n",
+ b1, b2, b3);
+}
+
+void sort_slabs(void)
+{
+ struct slabinfo *s1,*s2;
+
+ for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+ for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+ int result;
+
+ if (sort_size)
+ result = slab_size(s1) < slab_size(s2);
+ else if (sort_active)
+ result = slab_activity(s1) < slab_activity(s2);
+ else
+ result = strcasecmp(s1->name, s2->name);
+
+ if (show_inverted)
+ result = -result;
+
+ if (result > 0) {
+ struct slabinfo t;
+
+ memcpy(&t, s1, sizeof(struct slabinfo));
+ memcpy(s1, s2, sizeof(struct slabinfo));
+ memcpy(s2, &t, sizeof(struct slabinfo));
+ }
+ }
+ }
+}
+
+int slab_mismatch(char *slab)
+{
+ return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+ DIR *dir;
+ struct dirent *de;
+ struct slabinfo *slab = slabinfo;
+ char *p;
+ char *t;
+ int count;
+
+ if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+ fatal("SYSFS support for SLUB not active\n");
+
+ dir = opendir(".");
+ while ((de = readdir(dir))) {
+ if (de->d_name[0] == '.' ||
+ (de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+ continue;
+ switch (de->d_type) {
+ case DT_DIR:
+ if (chdir(de->d_name))
+ fatal("Unable to access slab %s\n", slab->name);
+ slab->name = strdup(de->d_name);
+ slab->align = get_obj("align");
+ slab->cache_dma = get_obj("cache_dma");
+ slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+ slab->hwcache_align = get_obj("hwcache_align");
+ slab->object_size = get_obj("object_size");
+ slab->objects = get_obj("objects");
+ slab->total_objects = get_obj("total_objects");
+ slab->objs_per_slab = get_obj("objs_per_slab");
+ slab->order = get_obj("order");
+ slab->poison = get_obj("poison");
+ slab->reclaim_account = get_obj("reclaim_account");
+ slab->red_zone = get_obj("red_zone");
+ slab->slab_size = get_obj("slab_size");
+ slab->slabs = get_obj_and_str("slabs", &t);
+ decode_numa_list(slab->numa, t);
+ free(t);
+ slab->store_user = get_obj("store_user");
+ slab->batch = get_obj("batch");
+ slab->alloc = get_obj("alloc");
+ slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+ slab->alloc_slab_new = get_obj("alloc_slab_new");
+ slab->free = get_obj("free");
+ slab->free_remote = get_obj("free_remote");
+ slab->claim_remote_list = get_obj("claim_remote_list");
+ slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+ slab->flush_free_list = get_obj("flush_free_list");
+ slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+ slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+ slab->flush_rfree_list = get_obj("flush_rfree_list");
+ slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+ slab->flush_slab_free = get_obj("flush_slab_free");
+ slab->flush_slab_partial = get_obj("flush_slab_partial");
+
+ chdir("..");
+ slab++;
+ break;
+ default :
+ fatal("Unknown file type %lx\n", de->d_type);
+ }
+ }
+ closedir(dir);
+ slabs = slab - slabinfo;
+ actual_slabs = slabs;
+ if (slabs > MAX_SLABS)
+ fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+ struct slabinfo *slab;
+
+ for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+ if (show_numa)
+ slab_numa(slab, 0);
+ else if (show_track)
+ show_tracking(slab);
+ else if (validate)
+ slab_validate(slab);
+ else if (shrink)
+ slab_shrink(slab);
+ else if (set_debug)
+ slab_debug(slab);
+ else if (show_ops)
+ ops(slab);
+ else if (show_slab)
+ slabcache(slab);
+ else if (show_report)
+ report(slab);
+ }
+}
+
+struct option opts[] = {
+ { "activity", 0, NULL, 'A' },
+ { "debug", 2, NULL, 'd' },
+ { "display-activity", 0, NULL, 'D' },
+ { "empty", 0, NULL, 'e' },
+ { "help", 0, NULL, 'h' },
+ { "inverted", 0, NULL, 'i'},
+ { "numa", 0, NULL, 'n' },
+ { "ops", 0, NULL, 'o' },
+ { "report", 0, NULL, 'r' },
+ { "shrink", 0, NULL, 's' },
+ { "slabs", 0, NULL, 'l' },
+ { "track", 0, NULL, 't'},
+ { "validate", 0, NULL, 'v' },
+ { "zero", 0, NULL, 'z' },
+ { "1ref", 0, NULL, '1'},
+ { NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+ int c;
+ int err;
+ char *pattern_source;
+
+ page_size = getpagesize();
+
+ while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+ opts, NULL)) != -1)
+ switch (c) {
+ case 'A':
+ sort_active = 1;
+ break;
+ case 'd':
+ set_debug = 1;
+ if (!debug_opt_scan(optarg))
+ fatal("Invalid debug option '%s'\n", optarg);
+ break;
+ case 'D':
+ show_activity = 1;
+ break;
+ case 'e':
+ show_empty = 1;
+ break;
+ case 'h':
+ usage();
+ return 0;
+ case 'i':
+ show_inverted = 1;
+ break;
+ case 'n':
+ show_numa = 1;
+ break;
+ case 'o':
+ show_ops = 1;
+ break;
+ case 'r':
+ show_report = 1;
+ break;
+ case 's':
+ shrink = 1;
+ break;
+ case 'l':
+ show_slab = 1;
+ break;
+ case 't':
+ show_track = 1;
+ break;
+ case 'v':
+ validate = 1;
+ break;
+ case 'z':
+ skip_zero = 0;
+ break;
+ case 'T':
+ show_totals = 1;
+ break;
+ case 'S':
+ sort_size = 1;
+ break;
+
+ default:
+ fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+ }
+
+ if (!show_slab && !show_track && !show_report
+ && !validate && !shrink && !set_debug && !show_ops)
+ show_slab = 1;
+
+ if (argc > optind)
+ pattern_source = argv[optind];
+ else
+ pattern_source = ".*";
+
+ err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+ if (err)
+ fatal("%s: Invalid pattern '%s' code %d\n",
+ argv[0], pattern_source, err);
+ read_slab_dir();
+ if (show_totals)
+ totals();
+ else {
+ sort_slabs();
+ output_slabs();
+ }
+ return 0;
+}

2009-01-23 09:55:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Nick Piggin <[email protected]> writes:

Not a full review, just some things i noticed.

The code is very readable thanks (that's imho the main reason slab.c
should go btw, it's really messy and hard to get through)

> Using lists rather than arrays can reduce the cacheline footprint. When moving
> objects around, SLQB can move a list of objects from one CPU to another by
> simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
> SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
> can be touched during alloc/free. Newly freed objects tend to be cache hot,
> and newly allocated ones tend to soon be touched anyway, so often there is
> little cost to using metadata in the objects.

You're probably aware of that, but the obvious counter argument
is that for manipulating a single object a double linked
list will always require touching three cache lines
(prev, current, next), while an array access only a single one.
A possible alternative would be a list of shorter arrays.

> + int objsize; /* The size of an object without meta data */
> + int offset; /* Free pointer offset. */
> + int objects; /* Number of objects in slab */
> +
> + int size; /* The size of an object including meta data */
> + int order; /* Allocation order */
> + gfp_t allocflags; /* gfp flags to use on allocation */
> + unsigned int colour_range; /* range of colour counter */
> + unsigned int colour_off; /* offset per colour */
> + void (*ctor)(void *);
> +
> + const char *name; /* Name (only for display!) */
> + struct list_head list; /* List of slab caches */
> +
> + int align; /* Alignment */
> + int inuse; /* Offset to metadata */

I suspect some of these fields could be short or char (E.g. alignment),
possibly lowering cache line impact.

> +
> +#ifdef CONFIG_SLQB_SYSFS
> + struct kobject kobj; /* For sysfs */
> +#endif
> +#ifdef CONFIG_NUMA
> + struct kmem_cache_node *node[MAX_NUMNODES];
> +#endif
> +#ifdef CONFIG_SMP
> + struct kmem_cache_cpu *cpu_slab[NR_CPUS];

Those both really need to be dynamically allocated, otherwise
it wastes a lot of memory in the common case
(e.g. NR_CPUS==128 kernel on dual core system). And of course
on the proposed NR_CPUS==4096 kernels it becomes prohibitive.

You could use alloc_percpu? There's no alloc_pernode
unfortunately, perhaps there should be one.

> +#if L1_CACHE_BYTES < 64
> + if (size > 64 && size <= 96)
> + return 1;
> +#endif
> +#if L1_CACHE_BYTES < 128
> + if (size > 128 && size <= 192)
> + return 2;
> +#endif
> + if (size <= 8) return 3;
> + if (size <= 16) return 4;
> + if (size <= 32) return 5;
> + if (size <= 64) return 6;
> + if (size <= 128) return 7;
> + if (size <= 256) return 8;
> + if (size <= 512) return 9;
> + if (size <= 1024) return 10;
> + if (size <= 2 * 1024) return 11;
> + if (size <= 4 * 1024) return 12;
> + if (size <= 8 * 1024) return 13;
> + if (size <= 16 * 1024) return 14;
> + if (size <= 32 * 1024) return 15;
> + if (size <= 64 * 1024) return 16;
> + if (size <= 128 * 1024) return 17;
> + if (size <= 256 * 1024) return 18;
> + if (size <= 512 * 1024) return 19;
> + if (size <= 1024 * 1024) return 20;
> + if (size <= 2 * 1024 * 1024) return 21;

Have you looked into other binsizes? iirc the original slab paper
mentioned that power of two is usually not the best.

> + return -1;

> +}
> +
> +#ifdef CONFIG_ZONE_DMA
> +#define SLQB_DMA __GFP_DMA
> +#else
> +/* Disable "DMA slabs" */
> +#define SLQB_DMA (__force gfp_t)0
> +#endif
> +
> +/*
> + * Find the kmalloc slab cache for a given combination of allocation flags and
> + * size.

You should mention that this would be a very bad idea to call for !__builtin_constant_p(size)

> + */
> +static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
> +{
> + int index = kmalloc_index(size);
> +
> + if (unlikely(index == 0))
> + return NULL;
> +
> + if (likely(!(flags & SLQB_DMA)))
> + return &kmalloc_caches[index];
> + else
> + return &kmalloc_caches_dma[index];

BTW i had an old patchkit to kill all GFP_DMA slab users. Perhaps should
warm that up again. That would lower the inline footprint.

> +#ifdef CONFIG_NUMA
> +void *__kmalloc_node(size_t size, gfp_t flags, int node);
> +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> +
> +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)

kmalloc_node should be infrequent, i suspect it can be safely out of lined.

> + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> + * a default closest home node via which it can use fastpath functions.

FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do
that too and be happy.

> + * Perhaps it is not a big problem.
> + */
> +
> +/*
> + * slqb_page overloads struct page, and is used to manage some slob allocation
> + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> + * we'll just define our own struct slqb_page type variant here.

Hopefully this works for the crash dumpers. Do they have a way to distingush
slub/slqb/slab kernels with different struct page usage?

> +#define PG_SLQB_BIT (1 << PG_slab)
> +
> +static int kmem_size __read_mostly;
> +#ifdef CONFIG_NUMA
> +static int numa_platform __read_mostly;
> +#else
> +#define numa_platform 0
> +#endif

It would be cheaper if you put that as a flag into the kmem_caches flags, this
way you avoid an additional cache line touched.

> +static inline int slqb_page_to_nid(struct slqb_page *page)
> +{
> + return page_to_nid(&page->page);
> +}

etc. you got a lot of wrappers...

> +static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
> + unsigned int order)
> +{
> + struct page *p;
> +
> + if (nid == -1)
> + p = alloc_pages(flags, order);
> + else
> + p = alloc_pages_node(nid, flags, order);

alloc_pages_nodes does that check anyways.


> +/* Not all arches define cache_line_size */
> +#ifndef cache_line_size
> +#define cache_line_size() L1_CACHE_BYTES
> +#endif
> +

They should. better fix them?


> +
> + /*
> + * Determine which debug features should be switched on
> + */

It would be nicer if you could use long options. At least for me
that would increase the probability that I could remember them
without having to look them up.

> +/*
> + * Allocate a new slab, set up its object list.
> + */
> +static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
> +{
> + struct slqb_page *page;
> + void *start;
> + void *last;
> + void *p;
> +
> + BUG_ON(flags & GFP_SLAB_BUG_MASK);
> +
> + page = allocate_slab(s,
> + flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> + if (!page)
> + goto out;
> +
> + page->flags |= PG_SLQB_BIT;
> +
> + start = page_address(&page->page);
> +
> + if (unlikely(slab_poison(s)))
> + memset(start, POISON_INUSE, PAGE_SIZE << s->order);
> +
> + start += colour;

One thing i was wondering. Did you try to disable the colouring and see
if it makes much difference on modern systems? They tend to have either
larger caches or higher associativity caches.

Or perhaps it could be made optional based on CPU type?


> +static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
> +{
> + struct slqb_page *page;
> + struct kmem_cache_list *l;
> + struct kmem_cache_cpu *c;
> + unsigned int colour;
> + void *object;
> +
> + c = get_cpu_slab(s, smp_processor_id());
> + colour = c->colour_next;
> + c->colour_next += s->colour_off;
> + if (c->colour_next >= s->colour_range)
> + c->colour_next = 0;
> +
> + /* XXX: load any partial? */
> +
> + /* Caller handles __GFP_ZERO */
> + gfpflags &= ~__GFP_ZERO;
> +
> + if (gfpflags & __GFP_WAIT)
> + local_irq_enable();

At least on P4 you could get some win by avoiding the local_irq_save() up in the fast
path when __GFP_WAIT is set (because storing the eflags is very expensive there)

> +
> +again:
> + local_irq_save(flags);
> + object = __slab_alloc(s, gfpflags, node);
> + local_irq_restore(flags);
> +
> + if (unlikely(slab_debug(s)) && likely(object)) {

AFAIK gcc cannot process multiple likelys in a single condition.

> +/* Initial slabs */
> +#ifdef CONFIG_SMP
> +static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
> +#endif
> +#ifdef CONFIG_NUMA
> +static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
> +#endif
> +
> +#ifdef CONFIG_SMP
> +static struct kmem_cache kmem_cpu_cache;
> +static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
> +#ifdef CONFIG_NUMA
> +static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
> +#endif
> +#endif
> +
> +#ifdef CONFIG_NUMA
> +static struct kmem_cache kmem_node_cache;
> +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
> +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
> +#endif

That all needs fixing too of course.

> +
> +#ifdef CONFIG_SMP
> +static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
> +{
> + struct kmem_cache_cpu *c;
> +
> + c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
> + if (!c)
> + return NULL;
> +
> + init_kmem_cache_cpu(s, c);
> + return c;
> +}
> +
> +static void free_kmem_cache_cpus(struct kmem_cache *s)
> +{
> + int cpu;
> +
> + for_each_online_cpu(cpu) {

Is this protected against racing cpu hotplugs? Doesn't look like it. Multiple occurrences.

> +static void cache_trim_worker(struct work_struct *w)
> +{
> + struct delayed_work *work =
> + container_of(w, struct delayed_work, work);
> + struct kmem_cache *s;
> + int node;
> +
> + if (!down_read_trylock(&slqb_lock))
> + goto out;

No counter for this?

> +
> + /*
> + * We are bringing a node online. No memory is availabe yet. We must
> + * allocate a kmem_cache_node structure in order to bring the node
> + * online.
> + */
> + down_read(&slqb_lock);
> + list_for_each_entry(s, &slab_caches, list) {
> + /*
> + * XXX: kmem_cache_alloc_node will fallback to other nodes
> + * since memory is not yet available from the node that
> + * is brought up.
> + */
> + if (s->node[nid]) /* could be lefover from last online */
> + continue;
> + n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
> + if (!n) {
> + ret = -ENOMEM;

Surely that should panic? I don't think a slab less node will
be very useful later.

> +#ifdef CONFIG_SLQB_SYSFS
> +/*
> + * sysfs API
> + */
> +#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
> +#define to_slab(n) container_of(n, struct kmem_cache, kobj);
> +
> +struct slab_attribute {
> + struct attribute attr;
> + ssize_t (*show)(struct kmem_cache *s, char *buf);
> + ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
> +};
> +
> +#define SLAB_ATTR_RO(_name) \
> + static struct slab_attribute _name##_attr = __ATTR_RO(_name)
> +
> +#define SLAB_ATTR(_name) \
> + static struct slab_attribute _name##_attr = \
> + __ATTR(_name, 0644, _name##_show, _name##_store)
> +
> +static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
> +{
> + return sprintf(buf, "%d\n", s->size);
> +}
> +SLAB_ATTR_RO(slab_size);
> +
> +static ssize_t align_show(struct kmem_cache *s, char *buf)
> +{
> + return sprintf(buf, "%d\n", s->align);
> +}
> +SLAB_ATTR_RO(align);
> +

When you map back to the attribute you can use a index into a table
for the field, saving that many functions?

> +#define STAT_ATTR(si, text) \
> +static ssize_t text##_show(struct kmem_cache *s, char *buf) \
> +{ \
> + return show_stat(s, buf, si); \
> +} \
> +SLAB_ATTR_RO(text); \
> +
> +STAT_ATTR(ALLOC, alloc);
> +STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
> +STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
> +STAT_ATTR(FREE, free);
> +STAT_ATTR(FREE_REMOTE, free_remote);
> +STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
> +STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
> +STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
> +STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
> +STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
> +STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
> +STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
> +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
> +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);

This really should be table driven, shouldn't it? That would give much
smaller code.

-Andi
--
[email protected] -- Speaking for myself only.

2009-01-23 10:13:44

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Hi Andi,

On Fri, 2009-01-23 at 10:55 +0100, Andi Kleen wrote:
> > +#if L1_CACHE_BYTES < 64
> > + if (size > 64 && size <= 96)
> > + return 1;
> > +#endif
> > +#if L1_CACHE_BYTES < 128
> > + if (size > 128 && size <= 192)
> > + return 2;
> > +#endif
> > + if (size <= 8) return 3;
> > + if (size <= 16) return 4;
> > + if (size <= 32) return 5;
> > + if (size <= 64) return 6;
> > + if (size <= 128) return 7;
> > + if (size <= 256) return 8;
> > + if (size <= 512) return 9;
> > + if (size <= 1024) return 10;
> > + if (size <= 2 * 1024) return 11;
> > + if (size <= 4 * 1024) return 12;
> > + if (size <= 8 * 1024) return 13;
> > + if (size <= 16 * 1024) return 14;
> > + if (size <= 32 * 1024) return 15;
> > + if (size <= 64 * 1024) return 16;
> > + if (size <= 128 * 1024) return 17;
> > + if (size <= 256 * 1024) return 18;
> > + if (size <= 512 * 1024) return 19;
> > + if (size <= 1024 * 1024) return 20;
> > + if (size <= 2 * 1024 * 1024) return 21;
>
> Have you looked into other binsizes? iirc the original slab paper
> mentioned that power of two is usually not the best.

Judging by the limited boot-time testing I've done with kmemtrace, the
bulk of kmalloc() allocations are under 64 bytes or so and actually a
pretty ok fit with the current sizes. The badly fitting objects are
usually very big and of different sizes (so they won't share a cache
easily) so I'm not expecting big gains from non-power of two sizes.

Pekka

2009-01-23 11:26:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 10:55:26AM +0100, Andi Kleen wrote:
> Nick Piggin <[email protected]> writes:
>
> Not a full review, just some things i noticed.
>
> The code is very readable thanks (that's imho the main reason slab.c
> should go btw, it's really messy and hard to get through)

Thanks, appreciated. It is very helpful.


> > Using lists rather than arrays can reduce the cacheline footprint. When moving
> > objects around, SLQB can move a list of objects from one CPU to another by
> > simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some
> > SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that
> > can be touched during alloc/free. Newly freed objects tend to be cache hot,
> > and newly allocated ones tend to soon be touched anyway, so often there is
> > little cost to using metadata in the objects.
>
> You're probably aware of that, but the obvious counter argument
> is that for manipulating a single object a double linked
> list will always require touching three cache lines
> (prev, current, next), while an array access only a single one.
> A possible alternative would be a list of shorter arrays.

That's true, but SLQB doesn't use double linked lists, but single.
An allocation needs to load a "head" pointer to the first object, then
load a "next" pointer from that object and assign it to "head". The
2nd load touches memory which should be subsequently touched bythe
caller anyway. A free just has to assign a pointer in the to-be-freed
object to point to the old head, and then update the head to the new
object. So this 1st touch should usually be cache hot memory.

But yes there are situations where SLAB scheme could result in
fewer cache misses. I haven't yet noticed it is a problem.


> > + const char *name; /* Name (only for display!) */
> > + struct list_head list; /* List of slab caches */
> > +
> > + int align; /* Alignment */
> > + int inuse; /* Offset to metadata */
>
> I suspect some of these fields could be short or char (E.g. alignment),
> possibly lowering cache line impact.

Good point. I'll have to do a pass through all structures and
make sure sizes and alignments etc are optimal. I have somewhat
ordered it eg. so that LIFO freelist allocations only have to
touch the first few fields in structures, then partial page
list allocations touch the next few, then page allocator etc.

But that might have gone out of date a little bit.


> > +#ifdef CONFIG_SLQB_SYSFS
> > + struct kobject kobj; /* For sysfs */
> > +#endif
> > +#ifdef CONFIG_NUMA
> > + struct kmem_cache_node *node[MAX_NUMNODES];
> > +#endif
> > +#ifdef CONFIG_SMP
> > + struct kmem_cache_cpu *cpu_slab[NR_CPUS];
>
> Those both really need to be dynamically allocated, otherwise
> it wastes a lot of memory in the common case
> (e.g. NR_CPUS==128 kernel on dual core system). And of course
> on the proposed NR_CPUS==4096 kernels it becomes prohibitive.
>
> You could use alloc_percpu? There's no alloc_pernode
> unfortunately, perhaps there should be one.

cpu_slab is dynamically allocated, by just changing the size of
the kmem_cache cache at boot time. Probably the best way would
be to have dynamic cpu and node allocs for them, I agree.

Any plans for an alloc_pernode?


> > + if (size <= 2 * 1024 * 1024) return 21;
>
> Have you looked into other binsizes? iirc the original slab paper
> mentioned that power of two is usually not the best.

No I haven't. Although I have been spending most effort at this
point just to improve SLQB versus the other allocators without
changing things like this. But it would be fine to investigate
when SLQB is more mature or for somebody else to look at it.

> > +/*
> > + * Find the kmalloc slab cache for a given combination of allocation flags and
> > + * size.
>
> You should mention that this would be a very bad idea to call for !__builtin_constant_p(size)

OK. It's not meant to be used outside slqb_def.h, however.


> > +static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
> > +{
> > + int index = kmalloc_index(size);
> > +
> > + if (unlikely(index == 0))
> > + return NULL;
> > +
> > + if (likely(!(flags & SLQB_DMA)))
> > + return &kmalloc_caches[index];
> > + else
> > + return &kmalloc_caches_dma[index];
>
> BTW i had an old patchkit to kill all GFP_DMA slab users. Perhaps should
> warm that up again. That would lower the inline footprint.

That would be excellent. It would also reduce constant data overheads
for SLAB and SLQB, and some nasty code from SLUB.


> > +#ifdef CONFIG_NUMA
> > +void *__kmalloc_node(size_t size, gfp_t flags, int node);
> > +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> > +
> > +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
>
> kmalloc_node should be infrequent, i suspect it can be safely out of lined.

Hmm... I wonder how much it increases code size...


> > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > + * a default closest home node via which it can use fastpath functions.
>
> FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do
> that too and be happy.

What if the node is possible but not currently online?


> > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> > + * we'll just define our own struct slqb_page type variant here.
>
> Hopefully this works for the crash dumpers. Do they have a way to distingush
> slub/slqb/slab kernels with different struct page usage?

Beyond looking at configs or hacks like looking at symbols, I don't
think so... It probably should go into vermagic I guess.


> > +#define PG_SLQB_BIT (1 << PG_slab)
> > +
> > +static int kmem_size __read_mostly;
> > +#ifdef CONFIG_NUMA
> > +static int numa_platform __read_mostly;
> > +#else
> > +#define numa_platform 0
> > +#endif
>
> It would be cheaper if you put that as a flag into the kmem_caches flags, this
> way you avoid an additional cache line touched.

Ok, that works.


> > +static inline int slqb_page_to_nid(struct slqb_page *page)
> > +{
> > + return page_to_nid(&page->page);
> > +}
>
> etc. you got a lot of wrappers...

I think they're not too bad though.


> > +static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
> > + unsigned int order)
> > +{
> > + struct page *p;
> > +
> > + if (nid == -1)
> > + p = alloc_pages(flags, order);
> > + else
> > + p = alloc_pages_node(nid, flags, order);
>
> alloc_pages_nodes does that check anyways.

OK, I rip out that wrapper completely.


> > +/* Not all arches define cache_line_size */
> > +#ifndef cache_line_size
> > +#define cache_line_size() L1_CACHE_BYTES
> > +#endif
> > +
>
> They should. better fix them?

git grep -l -e cache_line_size arch/ | egrep '\.h$'

Only ia64, mips, powerpc, sparc, x86...

> > + /*
> > + * Determine which debug features should be switched on
> > + */
>
> It would be nicer if you could use long options. At least for me
> that would increase the probability that I could remember them
> without having to look them up.

I haven't looked closely at the debug code which is mostly straight
out of SLUB and minimal changes to get it working. Of course it is
very important, but useless if the core allocator isn't good. I
also don't want to diverge from SLUB in these areas if possible until
we reduce the number of allocators in the tree...

Long options is probably not a bad idea, though.


> > + if (unlikely(slab_poison(s)))
> > + memset(start, POISON_INUSE, PAGE_SIZE << s->order);
> > +
> > + start += colour;
>
> One thing i was wondering. Did you try to disable the colouring and see
> if it makes much difference on modern systems? They tend to have either
> larger caches or higher associativity caches.

I have tried, but I don't think I found a test where it made a
statistically significant difference. It is not very costly to
implement, though.


> Or perhaps it could be made optional based on CPU type?

It could easily be changed, yes.



> > +
> > +again:
> > + local_irq_save(flags);
> > + object = __slab_alloc(s, gfpflags, node);
> > + local_irq_restore(flags);
>
> At least on P4 you could get some win by avoiding the local_irq_save() up in the fast
> path when __GFP_WAIT is set (because storing the eflags is very expensive there)

That's a good point, although also something trivially applicable to
all allocators and as such I prefer not to add such differences to
the SLQB patch if we are going into an evaluation phase.


> > +/* Initial slabs */
> > +#ifdef CONFIG_SMP
> > +static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
> > +#endif
> > +#ifdef CONFIG_NUMA
> > +static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
> > +#endif
> > +
> > +#ifdef CONFIG_SMP
> > +static struct kmem_cache kmem_cpu_cache;
> > +static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
> > +#ifdef CONFIG_NUMA
> > +static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
> > +#endif
> > +#endif
> > +
> > +#ifdef CONFIG_NUMA
> > +static struct kmem_cache kmem_node_cache;
> > +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
> > +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
> > +#endif
>
> That all needs fixing too of course.

Hmm. I was hoping it could stay simple as it is just a static constant
(for a given NR_CPUS) overhead. I wonder if bootmem is still up here?
How fine grained is it these days?

Could bite the bullet and do a multi-stage bootstap like SLUB, but I
want to try avoiding that (but init code is also of course much less
important than core code and total overheads).


> > +static void free_kmem_cache_cpus(struct kmem_cache *s)
> > +{
> > + int cpu;
> > +
> > + for_each_online_cpu(cpu) {
>
> Is this protected against racing cpu hotplugs? Doesn't look like it. Multiple occurrences.

I think you're right.


> > +static void cache_trim_worker(struct work_struct *w)
> > +{
> > + struct delayed_work *work =
> > + container_of(w, struct delayed_work, work);
> > + struct kmem_cache *s;
> > + int node;
> > +
> > + if (!down_read_trylock(&slqb_lock))
> > + goto out;
>
> No counter for this?

It's quite unimportant. It will only race with creating or destroying
actual kmem caches, and cache trimming is infrequent too.


> > + down_read(&slqb_lock);
> > + list_for_each_entry(s, &slab_caches, list) {
> > + /*
> > + * XXX: kmem_cache_alloc_node will fallback to other nodes
> > + * since memory is not yet available from the node that
> > + * is brought up.
> > + */
> > + if (s->node[nid]) /* could be lefover from last online */
> > + continue;
> > + n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
> > + if (!n) {
> > + ret = -ENOMEM;
>
> Surely that should panic? I don't think a slab less node will
> be very useful later.

Returning error here I think will just fail the online operation?
Better than a panic :)


> > +static ssize_t align_show(struct kmem_cache *s, char *buf)
> > +{
> > + return sprintf(buf, "%d\n", s->align);
> > +}
> > +SLAB_ATTR_RO(align);
> > +
>
> When you map back to the attribute you can use a index into a table
> for the field, saving that many functions?
>
> > +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
> > +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
>
> This really should be table driven, shouldn't it? That would give much
> smaller code.

Tables probably would help. I will keep it close to SLUB for now,
though.

Thanks,
Nick

2009-01-23 11:42:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 12:25:55PM +0100, Nick Piggin wrote:
> > > +#ifdef CONFIG_SLQB_SYSFS
> > > + struct kobject kobj; /* For sysfs */
> > > +#endif
> > > +#ifdef CONFIG_NUMA
> > > + struct kmem_cache_node *node[MAX_NUMNODES];
> > > +#endif
> > > +#ifdef CONFIG_SMP
> > > + struct kmem_cache_cpu *cpu_slab[NR_CPUS];
> >
> > Those both really need to be dynamically allocated, otherwise
> > it wastes a lot of memory in the common case
> > (e.g. NR_CPUS==128 kernel on dual core system). And of course
> > on the proposed NR_CPUS==4096 kernels it becomes prohibitive.
> >
> > You could use alloc_percpu? There's no alloc_pernode
> > unfortunately, perhaps there should be one.
>
> cpu_slab is dynamically allocated, by just changing the size of
> the kmem_cache cache at boot time.

You'll always have at least the MAX_NUMNODES waste because
you cannot tell the compiler that the cpu_slab field has
moved.

> Probably the best way would
> be to have dynamic cpu and node allocs for them, I agree.

It's really needed.

> Any plans for an alloc_pernode?

It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)

> > > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > > + * a default closest home node via which it can use fastpath functions.
> >
> > FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do
> > that too and be happy.
>
> What if the node is possible but not currently online?

Nobody should allocate on it then.

> > > +/* Not all arches define cache_line_size */
> > > +#ifndef cache_line_size
> > > +#define cache_line_size() L1_CACHE_BYTES
> > > +#endif
> > > +
> >
> > They should. better fix them?
>
> git grep -l -e cache_line_size arch/ | egrep '\.h$'
>
> Only ia64, mips, powerpc, sparc, x86...

It's straight forward to that define everywhere.

>
> > > + if (unlikely(slab_poison(s)))
> > > + memset(start, POISON_INUSE, PAGE_SIZE << s->order);
> > > +
> > > + start += colour;
> >
> > One thing i was wondering. Did you try to disable the colouring and see
> > if it makes much difference on modern systems? They tend to have either
> > larger caches or higher associativity caches.
>
> I have tried, but I don't think I found a test where it made a
> statistically significant difference. It is not very costly to
> implement, though.

how about the memory usage?

also this is all so complicated already that every simplification helps.

> > > +#endif
> > > +
> > > +#ifdef CONFIG_NUMA
> > > +static struct kmem_cache kmem_node_cache;
> > > +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
> > > +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
> > > +#endif
> >
> > That all needs fixing too of course.
>
> Hmm. I was hoping it could stay simple as it is just a static constant
> (for a given NR_CPUS) overhead.

The issue is that distro kernels typically run with NR_CPUS >>> num_possible_cpus()
And we'll see likely higher NR_CPUS (and MAX_NUMNODES) in the future,
but also still want to run the same kernels on really small systems (e.g.
Atom based) without wasting their memory.

So for anything NR_CPUS you should use per_cpu data -- that is correctly
sized automatically.

For MAX_NUMNODES we don't have anything equivalent currently, so
you would also need alloc_pernode() I guess.

Ok you can just use per cpu for them too and only use the first
entry in each node. That's cheating, but not too bad.


> I wonder if bootmem is still up here?

bootmem is finished when slab comes up.
>
> Could bite the bullet and do a multi-stage bootstap like SLUB, but I
> want to try avoiding that (but init code is also of course much less
> important than core code and total overheads).

For DEFINE_PER_CPU you don't need special allocation.

Probably want a DEFINE_PER_NODE() for this or see above.

>
> > > +static ssize_t align_show(struct kmem_cache *s, char *buf)
> > > +{
> > > + return sprintf(buf, "%d\n", s->align);
> > > +}
> > > +SLAB_ATTR_RO(align);
> > > +
> >
> > When you map back to the attribute you can use a index into a table
> > for the field, saving that many functions?
> >
> > > +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
> > > +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
> >
> > This really should be table driven, shouldn't it? That would give much
> > smaller code.
>
> Tables probably would help. I will keep it close to SLUB for now,
> though.

Hmm, then fix slub?

-Andi

--
[email protected] -- Speaking for myself only.

2009-01-23 12:55:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 10:55:26AM +0100, Andi Kleen wrote:
> Nick Piggin <[email protected]> writes:
> > +#ifdef CONFIG_NUMA
> > +void *__kmalloc_node(size_t size, gfp_t flags, int node);
> > +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> > +
> > +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
>
> kmalloc_node should be infrequent, i suspect it can be safely out of lined.

Hmm, it only takes up another couple of hundred bytes for a full
numa kernel. Completely out of lining it can take a slightly slower
path and makes the code slightly different from the kmalloc case.
So I'll leave this change for now.

2009-01-23 12:57:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator


* Nick Piggin <[email protected]> wrote:

> On Wed, Jan 21, 2009 at 06:40:10PM +0100, Ingo Molnar wrote:
> > -static inline void slqb_stat_inc(struct kmem_cache_list *list,
> > - enum stat_item si)
> > +static inline void
> > +slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
> > {
>
> Hmm, I'm not entirely fond of this style. [...]

well, it's a borderline situation and a nuance, and i think we agree on
the two (much more common) boundary conditions:

1) line fits into 80 cols - in that case we keep it all on a single line
(this is the ideal case)

2) line does not fit on two lines either - in that case we do the style
that you used above.

On the boundary there's a special case though, and i tend to prefer:

+static inline void
+slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)

over:

-static inline void slqb_stat_inc(struct kmem_cache_list *list,
- enum stat_item si)

for two reasons:

1) the line break is not just arbitrarily in the middle of the
enumeration of arguments - it is right after function return type.

2) the arguments fit on a single line - and often one wants to know that
signature. (return values are usually a separate thought)

3) the return type stands out much better.

But again ... this is a nuance.

> [...] The former scales to longer lines with just a single style change
> (putting args into new lines), wheras the latter first moves its
> prefixes to a newline, then moves args as the line grows even longer.

the moment this 'boundary style' "overflows", it falls back to the 'lots
of lines' case, where we generally put the function return type and the
function name on the first line.

> I guess it is a matter of taste, not wrong either way... but I think
> most of the mm code I'm used to looking at uses the former. Do you feel
> strongly?

there are a handful of cases where the return type (and the function
attributes) are _really_ long - in this case it really helps to have them
decoupled from the arguments.

Ingo

2009-01-23 13:18:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 12:57:31PM +0100, Andi Kleen wrote:
> On Fri, Jan 23, 2009 at 12:25:55PM +0100, Nick Piggin wrote:
> > > > +#ifdef CONFIG_SLQB_SYSFS
> > > > + struct kobject kobj; /* For sysfs */
> > > > +#endif
> > > > +#ifdef CONFIG_NUMA
> > > > + struct kmem_cache_node *node[MAX_NUMNODES];
> > > > +#endif
> > > > +#ifdef CONFIG_SMP
> > > > + struct kmem_cache_cpu *cpu_slab[NR_CPUS];
> > >
> > > Those both really need to be dynamically allocated, otherwise
> > > it wastes a lot of memory in the common case
> > > (e.g. NR_CPUS==128 kernel on dual core system). And of course
> > > on the proposed NR_CPUS==4096 kernels it becomes prohibitive.
> > >
> > > You could use alloc_percpu? There's no alloc_pernode
> > > unfortunately, perhaps there should be one.
> >
> > cpu_slab is dynamically allocated, by just changing the size of
> > the kmem_cache cache at boot time.
>
> You'll always have at least the MAX_NUMNODES waste because
> you cannot tell the compiler that the cpu_slab field has
> moved.

Right. It could go into a completely different per-cpu structure
if needed to work around that (using node is a relatively rare
operation). But an alloc_pernode would be nicer.


> > Probably the best way would
> > be to have dynamic cpu and node allocs for them, I agree.
>
> It's really needed.
>
> > Any plans for an alloc_pernode?
>
> It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)

Just if you knew about plans. I won't get too much time to work on
it next week, so I hope to have something in slab tree in the
meantime. I think it is OK to leave now, with a mind to improving
it before a possible mainline merge (there will possibly be more
serious issues discovered anyway).


> > > > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > > > + * a default closest home node via which it can use fastpath functions.
> > >
> > > FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do
> > > that too and be happy.
> >
> > What if the node is possible but not currently online?
>
> Nobody should allocate on it then.

But then it goes online and what happens? Your numa_node_id() changes?
How does that work? Or you mean x86-64 does not do that same trick for
possible but offline nodes?


> > git grep -l -e cache_line_size arch/ | egrep '\.h$'
> >
> > Only ia64, mips, powerpc, sparc, x86...
>
> It's straight forward to that define everywhere.

OK, but this code is just copied straight from SLAB... I don't want
to add such dependency at this point I'm trying to get something
reasonable to merge. But it would be a fine cleanup.


> > > One thing i was wondering. Did you try to disable the colouring and see
> > > if it makes much difference on modern systems? They tend to have either
> > > larger caches or higher associativity caches.
> >
> > I have tried, but I don't think I found a test where it made a
> > statistically significant difference. It is not very costly to
> > implement, though.
>
> how about the memory usage?
>
> also this is all so complicated already that every simplification helps.

Oh, it only uses slack space in the slabs as such, so it should be
almost zero cost. I tried testing extra colour at the cost of space, but
no obvious difference there either. But I think I'll leave in the code
because it might be a win for some embedded or unusual CPUs.


> > Could bite the bullet and do a multi-stage bootstap like SLUB, but I
> > want to try avoiding that (but init code is also of course much less
> > important than core code and total overheads).
>
> For DEFINE_PER_CPU you don't need special allocation.
>
> Probably want a DEFINE_PER_NODE() for this or see above.

Ah yes DEFINE_PER_CPU of course. Not quite correct for per-node data,
but it should be good enough for wider testing in linux-next.


> > Tables probably would help. I will keep it close to SLUB for now,
> > though.
>
> Hmm, then fix slub?

That's my plan, but I go about it a different way ;) I don't want to
spend too much time on other allocators or cleanup etc code too much
right now (except cleanups in SLQB, which of course is required).

Here is an incremental patch for your review points. Thanks very much,
it's a big improvement (getting rid of those static arrays vastly
decreases memory consumption with bigger NR_CPUs, so that's a good
start; will need to investigate alloc_percpu / pernode etc, but that
may have to wait until next week.

---
include/linux/slab.h | 4 +
include/linux/slqb_def.h | 10 +++
mm/slqb.c | 125 ++++++++++++++++++++++++++---------------------
3 files changed, 82 insertions(+), 57 deletions(-)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -65,6 +65,10 @@
/* The following flags affect the page allocator grouping pages by mobility */
#define SLAB_RECLAIM_ACCOUNT 0x00020000UL /* Objects are reclaimable */
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
+
+/* Following flags should only be used by allocator specific flags */
+#define SLAB_ALLOC_PRIVATE 0x000000ffUL
+
/*
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
*
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- linux-2.6.orig/include/linux/slqb_def.h
+++ linux-2.6/include/linux/slqb_def.h
@@ -15,6 +15,8 @@
#include <linux/kernel.h>
#include <linux/kobject.h>

+#define SLAB_NUMA 0x00000001UL /* shortcut */
+
enum stat_item {
ALLOC, /* Allocation count */
ALLOC_SLAB_FILL, /* Fill freelist from page list */
@@ -224,12 +226,16 @@ static __always_inline int kmalloc_index

/*
* Find the kmalloc slab cache for a given combination of allocation flags and
- * size.
+ * size. Should really only be used for constant 'size' arguments, due to
+ * bloat.
*/
static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
{
- int index = kmalloc_index(size);
+ int index;
+
+ BUILD_BUG_ON(!__builtin_constant_p(size));

+ index = kmalloc_index(size);
if (unlikely(index == 0))
return NULL;

Index: linux-2.6/mm/slqb.c
===================================================================
--- linux-2.6.orig/mm/slqb.c
+++ linux-2.6/mm/slqb.c
@@ -58,9 +58,15 @@ static inline void struct_slqb_page_wron

static int kmem_size __read_mostly;
#ifdef CONFIG_NUMA
-static int numa_platform __read_mostly;
+static inline int slab_numa(struct kmem_cache *s)
+{
+ return s->flags & SLAB_NUMA;
+}
#else
-static const int numa_platform = 0;
+static inline int slab_numa(struct kmem_cache *s)
+{
+ return 0;
+}
#endif

static inline int slab_hiwater(struct kmem_cache *s)
@@ -166,19 +172,6 @@ static inline struct slqb_page *virt_to_
return (struct slqb_page *)p;
}

-static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
- unsigned int order)
-{
- struct page *p;
-
- if (nid == -1)
- p = alloc_pages(flags, order);
- else
- p = alloc_pages_node(nid, flags, order);
-
- return (struct slqb_page *)p;
-}
-
static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
{
struct page *p = &page->page;
@@ -231,8 +224,16 @@ static inline int slab_poison(struct kme
static struct notifier_block slab_notifier;
#endif

-/* A list of all slab caches on the system */
+/*
+ * slqb_lock protects slab_caches list and serialises hotplug operations.
+ * hotplug operations take lock for write, other operations can hold off
+ * hotplug by taking it for read (or write).
+ */
static DECLARE_RWSEM(slqb_lock);
+
+/*
+ * A list of all slab caches on the system
+ */
static LIST_HEAD(slab_caches);

/*
@@ -875,6 +876,9 @@ static unsigned long kmem_cache_flags(un
strlen(slqb_debug_slabs)) == 0))
flags |= slqb_debug;

+ if (num_possible_nodes() > 1)
+ flags |= SLAB_NUMA;
+
return flags;
}
#else
@@ -913,6 +917,8 @@ static inline void add_full(struct kmem_
static inline unsigned long kmem_cache_flags(unsigned long objsize,
unsigned long flags, const char *name, void (*ctor)(void *))
{
+ if (num_possible_nodes() > 1)
+ flags |= SLAB_NUMA;
return flags;
}

@@ -930,7 +936,7 @@ static struct slqb_page *allocate_slab(s

flags |= s->allocflags;

- page = alloc_slqb_pages_node(node, flags, s->order);
+ page = (struct slqb_page *)alloc_pages_node(node, flags, s->order);
if (!page)
return NULL;

@@ -1296,8 +1302,6 @@ static noinline void *__slab_alloc_page(
if (c->colour_next >= s->colour_range)
c->colour_next = 0;

- /* XXX: load any partial? */
-
/* Caller handles __GFP_ZERO */
gfpflags &= ~__GFP_ZERO;

@@ -1622,7 +1626,7 @@ static __always_inline void __slab_free(

slqb_stat_inc(l, FREE);

- if (!NUMA_BUILD || !numa_platform ||
+ if (!NUMA_BUILD || !slab_numa(s) ||
likely(slqb_page_to_nid(page) == numa_node_id())) {
/*
* Freeing fastpath. Collects all local-node objects, not
@@ -1676,7 +1680,7 @@ void kmem_cache_free(struct kmem_cache *
{
struct slqb_page *page = NULL;

- if (numa_platform)
+ if (slab_numa(s))
page = virt_to_head_slqb_page(object);
slab_free(s, page, object);
}
@@ -1816,26 +1820,28 @@ static void init_kmem_cache_node(struct
}
#endif

-/* Initial slabs */
+/* Initial slabs. XXX: allocate dynamically (with bootmem maybe) */
#ifdef CONFIG_SMP
-static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cache_cpus);
#endif
#ifdef CONFIG_NUMA
-static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+/* XXX: really need a DEFINE_PER_NODE for per-node data, but this is better than
+ * a static array */
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cache_nodes);
#endif

#ifdef CONFIG_SMP
static struct kmem_cache kmem_cpu_cache;
-static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cpu_cpus);
#ifdef CONFIG_NUMA
-static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cpu_nodes); /* XXX per-nid */
#endif
#endif

#ifdef CONFIG_NUMA
static struct kmem_cache kmem_node_cache;
-static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
-static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_node_cpus);
+static DEFINE_PER_CPU(struct kmem_cache_node, kmem_node_nodes); /*XXX per-nid */
#endif

#ifdef CONFIG_SMP
@@ -2090,15 +2096,15 @@ static int kmem_cache_open(struct kmem_c
s->colour_range = 0;
}

+ down_write(&slqb_lock);
if (likely(alloc)) {
if (!alloc_kmem_cache_nodes(s))
- goto error;
+ goto error_lock;

if (!alloc_kmem_cache_cpus(s))
goto error_nodes;
}

- down_write(&slqb_lock);
sysfs_slab_add(s);
list_add(&s->list, &slab_caches);
up_write(&slqb_lock);
@@ -2107,6 +2113,8 @@ static int kmem_cache_open(struct kmem_c

error_nodes:
free_kmem_cache_nodes(s);
+error_lock:
+ up_write(&slqb_lock);
error:
if (flags & SLAB_PANIC)
panic("kmem_cache_create(): failed to create slab `%s'\n", name);
@@ -2180,7 +2188,6 @@ void kmem_cache_destroy(struct kmem_cach

down_write(&slqb_lock);
list_del(&s->list);
- up_write(&slqb_lock);

#ifdef CONFIG_SMP
for_each_online_cpu(cpu) {
@@ -2230,6 +2237,7 @@ void kmem_cache_destroy(struct kmem_cach
#endif

sysfs_slab_remove(s);
+ up_write(&slqb_lock);
}
EXPORT_SYMBOL(kmem_cache_destroy);

@@ -2603,7 +2611,7 @@ static int slab_mem_going_online_callbac
* allocate a kmem_cache_node structure in order to bring the node
* online.
*/
- down_read(&slqb_lock);
+ down_write(&slqb_lock);
list_for_each_entry(s, &slab_caches, list) {
/*
* XXX: kmem_cache_alloc_node will fallback to other nodes
@@ -2621,7 +2629,7 @@ static int slab_mem_going_online_callbac
s->node[nid] = n;
}
out:
- up_read(&slqb_lock);
+ up_write(&slqb_lock);
return ret;
}

@@ -2665,13 +2673,6 @@ void __init kmem_cache_init(void)
* All the ifdefs are rather ugly here, but it's just the setup code,
* so it doesn't have to be too readable :)
*/
-#ifdef CONFIG_NUMA
- if (num_possible_nodes() == 1)
- numa_platform = 0;
- else
- numa_platform = 1;
-#endif
-
#ifdef CONFIG_SMP
kmem_size = offsetof(struct kmem_cache, cpu_slab) +
nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
@@ -2692,15 +2693,20 @@ void __init kmem_cache_init(void)

#ifdef CONFIG_SMP
for_each_possible_cpu(i) {
- init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
- kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+ struct kmem_cache_cpu *c;

- init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
- kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+ c = &per_cpu(kmem_cache_cpus, i);
+ init_kmem_cache_cpu(&kmem_cache_cache, c);
+ kmem_cache_cache.cpu_slab[i] = c;
+
+ c = &per_cpu(kmem_cpu_cpus, i);
+ init_kmem_cache_cpu(&kmem_cpu_cache, c);
+ kmem_cpu_cache.cpu_slab[i] = c;

#ifdef CONFIG_NUMA
- init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
- kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+ c = &per_cpu(kmem_node_cpus, i);
+ init_kmem_cache_cpu(&kmem_node_cache, c);
+ kmem_node_cache.cpu_slab[i] = c;
#endif
}
#else
@@ -2709,14 +2715,19 @@ void __init kmem_cache_init(void)

#ifdef CONFIG_NUMA
for_each_node_state(i, N_NORMAL_MEMORY) {
- init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
- kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
-
- init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
- kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+ struct kmem_cache_node *n;

- init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
- kmem_node_cache.node[i] = &kmem_node_nodes[i];
+ n = &per_cpu(kmem_cache_nodes, i);
+ init_kmem_cache_node(&kmem_cache_cache, n);
+ kmem_cache_cache.node[i] = n;
+
+ n = &per_cpu(kmem_cpu_nodes, i);
+ init_kmem_cache_node(&kmem_cpu_cache, n);
+ kmem_cpu_cache.node[i] = n;
+
+ n = &per_cpu(kmem_node_nodes, i);
+ init_kmem_cache_node(&kmem_node_cache, n);
+ kmem_node_cache.node[i] = n;
}
#endif

@@ -2883,7 +2894,7 @@ static int __cpuinit slab_cpuup_callback
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
- down_read(&slqb_lock);
+ down_write(&slqb_lock);
list_for_each_entry(s, &slab_caches, list) {
if (s->cpu_slab[cpu]) /* could be lefover last online */
continue;
@@ -2893,7 +2904,7 @@ static int __cpuinit slab_cpuup_callback
return NOTIFY_BAD;
}
}
- up_read(&slqb_lock);
+ up_write(&slqb_lock);
break;

case CPU_ONLINE:
@@ -3019,6 +3030,8 @@ static void gather_stats(struct kmem_cac
stats->s = s;
spin_lock_init(&stats->lock);

+ down_read(&slqb_lock); /* hold off hotplug */
+
on_each_cpu(__gather_stats, stats, 1);

#ifdef CONFIG_NUMA
@@ -3047,6 +3060,8 @@ static void gather_stats(struct kmem_cac
}
#endif

+ up_read(&slqb_lock);
+
stats->nr_objects = stats->nr_slabs * s->objects;
}
#endif

2009-01-23 13:35:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, 23 Jan 2009, Nick Piggin wrote:
>
> ... Would you be able to test with this updated patch
> (which also includes Hugh's fix ...

In fact not: claim_remote_free_list() still has the offending unlocked
+ VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);

Hugh

2009-01-23 13:44:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 01:34:49PM +0000, Hugh Dickins wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
> >
> > ... Would you be able to test with this updated patch
> > (which also includes Hugh's fix ...
>
> In fact not: claim_remote_free_list() still has the offending unlocked
> + VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);

Doh, thanks. Turned out to still miss a few cases where it wasn't
checking for memoryless nodes (Andi explains why I didn't see it
with x86-64: because it handles the case differently and assigns
the default node to the nearest one with memory. I think).

Working on a new version, so I've definitely got your bug covered
now :)

2009-01-23 13:48:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

[dropping lameters' outdated address]

On Fri, Jan 23, 2009 at 02:18:00PM +0100, Nick Piggin wrote:
>
> > > Probably the best way would
> > > be to have dynamic cpu and node allocs for them, I agree.
> >
> > It's really needed.
> >
> > > Any plans for an alloc_pernode?
> >
> > It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)
>
> Just if you knew about plans. I won't get too much time to work on

Not aware of anyone working on it.

> it next week, so I hope to have something in slab tree in the
> meantime. I think it is OK to leave now, with a mind to improving

Sorry, the NR_CPUS/MAX_NUMNODE arrays are a merge blocker imho
because they explode with CONFIG_MAXSMP.

> it before a possible mainline merge (there will possibly be more
> serious issues discovered anyway).

I see you fixed the static arrays.

Doing the same for the kmem_cache arrays with making them a pointer
and then using num_possible_{cpus,nodes}() would seem straight forward,
wouldn't it?

Although I think I would prefer alloc_percpu, possibly with
per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)

> > > > > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given
> > > > > + * a default closest home node via which it can use fastpath functions.
> > > >
> > > > FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do
> > > > that too and be happy.
> > >
> > > What if the node is possible but not currently online?
> >
> > Nobody should allocate on it then.
>
> But then it goes online and what happens?

You already have a node online notifier that should handle that then, don't you?

x86-64 btw currently doesn't support node hotplug (but I expect it will
be added at some point), but it should be ok even on architectures
that do.

> Your numa_node_id() changes?

What do you mean?

> How does that work? Or you mean x86-64 does not do that same trick for
> possible but offline nodes?

All I'm saying is that when x86-64 finds a memory less node it assigns
its CPUs to other nodes. Hmm ok perhaps there's a backdoor when someone
sets it with kmalloc_node() but that should normally not happen I think.

>
> > > git grep -l -e cache_line_size arch/ | egrep '\.h$'
> > >
> > > Only ia64, mips, powerpc, sparc, x86...
> >
> > It's straight forward to that define everywhere.
>
> OK, but this code is just copied straight from SLAB... I don't want
> to add such dependency at this point I'm trying to get something

I'm sure such a straight forward change could be still put into .29

> reasonable to merge. But it would be a fine cleanup.

Hmm to be honest it's a little weird to post so much code and then
say you can't change large parts of it.

Could you perhaps mark all the code you don't want to change?

I'm not sure I follow the rationale for not changing code that has been
copied from elsewhere. If you copied it why can't you change it?

> >
> > Hmm, then fix slub?
>
> That's my plan, but I go about it a different way ;) I don't want to
> spend too much time on other allocators or cleanup etc code too much
> right now (except cleanups in SLQB, which of course is required).

But still if you copy code from slub you can improve it, can't you?
The sysfs code definitely could be done much nicer (ok for small values
of "nice"; sysfs is always ugly of course @). But at least it can be
done in a way that doesn't bloat the text so much.

Thanks for the patch.

One thing I'm not sure about is using a private lock to hold off hotplug.
I don't have a concrete scenario, but it makes me uneasy considering
deadlocks when someone sleeps etc. Safer is get/put_online_cpus()

-Andi
--
[email protected] -- Speaking for myself only.

2009-01-23 13:58:17

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, 23 Jan 2009, Nick Piggin wrote:
> On Wed, Jan 21, 2009 at 06:10:12PM +0000, Hugh Dickins wrote:
> >
> > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > more than SLAB) with swapping loads on most of my machines. Though
> > oddly one seems immune, and another takes four times as long: guess
> > it depends on how close to thrashing, but probably more to investigate
> > there. I think my original SLUB versus SLAB comparisons were done on
> > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > loads when SLUB came in, but even with boot option slub_max_order=1,
> > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > FWIW - swapping loads are not what anybody should tune for.
>
> Yeah, that's to be expected with higher order allocations I think. Does
> your immune machine simply have fewer CPUs and thus doesn't use such
> high order allocations?

No, it's just one of the quads. Whereas the worst affected (laptop)
is a duo. I should probably be worrying more about that one: it may
be that I'm thrashing it and its results are meaningless, though still
curious that slab and slqb and slob all do so markedly better on it.

It's behaving much better with slub_max_order=1 slub_min_objects=4,
but to get competitive I've had to switch off most of the debugging
options I usually have on that one - and I've not yet tried slab,
slqb and slob with those off too. Hmm, it looks like its getting
progressively slower.

I'll continue to investigate at leisure,
but can't give it too much attention.

Hugh

2009-01-23 14:23:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, 22 Jan 2009, Hugh Dickins wrote:
> On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <[email protected]> wrote:
> > >
> > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > more than SLAB) with swapping loads on most of my machines. Though
> > > oddly one seems immune, and another takes four times as long: guess
> > > it depends on how close to thrashing, but probably more to investigate
> > > there. I think my original SLUB versus SLAB comparisons were done on
> > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > FWIW - swapping loads are not what anybody should tune for.
> >
> > What kind of machine are you seeing this on? It sounds like it could
> > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > ("slub: Calculate min_objects based on number of processors").
>
> Thanks, yes, that could well account for the residual difference: the
> machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> has effectively become slub_min_objects=12 or slub_min_objects=16.
>
> I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> lines (though I'll need to curtail tests on a couple of machines),
> and will report back later.

Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
swapping load. I've not tried slub_max_order=0, but I'm running
with 8kB stacks, so order 1 seems a reasonable choice.

I can't say where I pulled that "e.g. 2% slower" from: on different
machines slub was 5% or 10% or 20% slower than slab and slqb even with
slub_max_order=1 (but not significantly slower on the "immune" machine).
How much slub_min_objects=4 helps again varies widely, between halving
or eliminating the difference.

But I think it's more important that I focus on the worst case machine,
try to understand what's going on there.

Hugh

2009-01-23 14:28:12

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 03:04:06PM +0100, Andi Kleen wrote:
> [dropping lameters' outdated address]
>
> On Fri, Jan 23, 2009 at 02:18:00PM +0100, Nick Piggin wrote:
> >
> > > > Probably the best way would
> > > > be to have dynamic cpu and node allocs for them, I agree.
> > >
> > > It's really needed.
> > >
> > > > Any plans for an alloc_pernode?
> > >
> > > It shouldn't be very hard to implement. Or do you ask if I'm volunteering? @)
> >
> > Just if you knew about plans. I won't get too much time to work on
>
> Not aware of anyone working on it.
>
> > it next week, so I hope to have something in slab tree in the
> > meantime. I think it is OK to leave now, with a mind to improving
>
> Sorry, the NR_CPUS/MAX_NUMNODE arrays are a merge blocker imho
> because they explode with CONFIG_MAXSMP.

This is a linux-next merge, I'm talking about. The point is to get
some parallelism between testing and making slqb perfect (not because
I don't agree with the problem you point out).


> > it before a possible mainline merge (there will possibly be more
> > serious issues discovered anyway).
>
> I see you fixed the static arrays.
>
> Doing the same for the kmem_cache arrays with making them a pointer
> and then using num_possible_{cpus,nodes}() would seem straight forward,
> wouldn't it?

Hmm, yes that might be the way to go. I'll do that with the node
array, the cpu array can stay where it is (this reduces cacheline
footprint for small NR_CPUS configs).


> Although I think I would prefer alloc_percpu, possibly with
> per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)

I don't think we have the NUMA information available early enough
to do that. But it would be the best idea indeed because it would
take advantage of improvements in percpu allocator.


> > But then it goes online and what happens?
>
> You already have a node online notifier that should handle that then, don't you?
>
> x86-64 btw currently doesn't support node hotplug (but I expect it will
> be added at some point), but it should be ok even on architectures
> that do.
>
> > Your numa_node_id() changes?
>
> What do you mean?
>
> > How does that work? Or you mean x86-64 does not do that same trick for
> > possible but offline nodes?
>
> All I'm saying is that when x86-64 finds a memory less node it assigns
> its CPUs to other nodes. Hmm ok perhaps there's a backdoor when someone
> sets it with kmalloc_node() but that should normally not happen I think.

OK, but if it is _possible_ for the node to gain memory, then you
can't do that of course. If the node is always memoryless then yes
I think it is probably a good idea to just assign it to the closest node
with memory.


> > OK, but this code is just copied straight from SLAB... I don't want
> > to add such dependency at this point I'm trying to get something
>
> I'm sure such a straight forward change could be still put into .29
>
> > reasonable to merge. But it would be a fine cleanup.
>
> Hmm to be honest it's a little weird to post so much code and then
> say you can't change large parts of it.

The cache_line_size() change wouldn't change slqb code significantly.
I have no problem with it, but I simply won't have time to do it and
test all architectures and get them merged and hold off merging
SLQB until they all get merged.


> Could you perhaps mark all the code you don't want to change?

Primarily the debug code from SLUB.


> I'm not sure I follow the rationale for not changing code that has been
> copied from elsewhere. If you copied it why can't you change it?

I have, very extensively. Just diff mm/slqb.c mm/slub.c ;)

The point of not cleaning up peripheral (non-core) code that works, and
exists upstream, is because it will actually be less hassle for me to
maintain. By all means make improvements to the slub version which I can
then pull into slqb.


> > That's my plan, but I go about it a different way ;) I don't want to
> > spend too much time on other allocators or cleanup etc code too much
> > right now (except cleanups in SLQB, which of course is required).
>
> But still if you copy code from slub you can improve it, can't you?
> The sysfs code definitely could be done much nicer (ok for small values
> of "nice"; sysfs is always ugly of course @). But at least it can be
> done in a way that doesn't bloat the text so much.

I'm definitely not adverse to cleanups at all, but I just want to try
avoid duplicating work or diverging if it is not necessary which makes
it harder to track fixes etc. Just in this point in development...


> Thanks for the patch.
>
> One thing I'm not sure about is using a private lock to hold off hotplug.
> I don't have a concrete scenario, but it makes me uneasy considering
> deadlocks when someone sleeps etc. Safer is get/put_online_cpus()

I think it is OK, considering those locks must usually be taken anyway
in the path, I've just tended to widen the coverage. But I'll think if
anything can be improved with get/put API.

2009-01-23 14:30:29

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Hi Hugh,

On Wed, Jan 21, 2009 at 8:10 PM, Hugh Dickins <[email protected]> wrote:
> > > > That's been making SLUB behave pretty badly (e.g. elapsed time 30%
> > > > more than SLAB) with swapping loads on most of my machines. Though
> > > > oddly one seems immune, and another takes four times as long: guess
> > > > it depends on how close to thrashing, but probably more to investigate
> > > > there. I think my original SLUB versus SLAB comparisons were done on
> > > > the immune one: as I remember, SLUB and SLAB were equivalent on those
> > > > loads when SLUB came in, but even with boot option slub_max_order=1,
> > > > SLUB is still slower than SLAB on such tests (e.g. 2% slower).
> > > > FWIW - swapping loads are not what anybody should tune for.

On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > What kind of machine are you seeing this on? It sounds like it could
> > > be a side-effect from commit 9b2cd506e5f2117f94c28a0040bf5da058105316
> > > ("slub: Calculate min_objects based on number of processors").

On Thu, 22 Jan 2009, Hugh Dickins wrote:
> > Thanks, yes, that could well account for the residual difference: the
> > machines in question have 2 or 4 cpus, so the old slub_min_objects=4
> > has effectively become slub_min_objects=12 or slub_min_objects=16.
> >
> > I'm now trying with slub_max_order=1 slub_min_objects=4 on the boot
> > lines (though I'll need to curtail tests on a couple of machines),
> > and will report back later.

On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> Yes, slub_max_order=1 with slub_min_objects=4 certainly helps this
> swapping load. I've not tried slub_max_order=0, but I'm running
> with 8kB stacks, so order 1 seems a reasonable choice.

Yanmin/Christoph, maybe we should revisit the min objects logic
calculate_order()?

On Fri, 2009-01-23 at 14:23 +0000, Hugh Dickins wrote:
> I can't say where I pulled that "e.g. 2% slower" from: on different
> machines slub was 5% or 10% or 20% slower than slab and slqb even with
> slub_max_order=1 (but not significantly slower on the "immune" machine).
> How much slub_min_objects=4 helps again varies widely, between halving
> or eliminating the difference.
>
> But I think it's more important that I focus on the worst case machine,
> try to understand what's going on there.

Yeah. Oprofile and CONFIG_SLUB_STATS are usually quite helpful. You
might want to test the included patch which targets one known SLAB vs.
SLUB regression discovered quite recently.

Pekka

Subject: [PATCH] SLUB: revert direct page allocator pass through
From: Pekka Enberg <[email protected]>

This patch reverts page allocator pass-through logic from the SLUB allocator.

Commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of
page size or higher kmalloc requests") added page allocator pass-through to the
SLUB allocator for large sized allocations. This, however, results in a
performance regression compared to SLAB in the netperf UDP-U-4k test.

The regression comes from the kfree(skb->head) call in skb_release_data() that
is subject to page allocator pass-through as the size passed to __alloc_skb()
is larger than 4 KB in this test. With this patch, the performance regression
is almost closed:

<insert numbers here>

Reported-by: "Zhang, Yanmin" <[email protected]>
Tested-by: "Zhang, Yanmin" <[email protected]>
Signed-off-by: Pekka Enberg <[email protected]>
---

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..3bd3662 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -124,7 +124,7 @@ struct kmem_cache {
* We keep the general caches in an array of slab caches that are used for
* 2^x bytes of allocations.
*/
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];

/*
* Sorry that the following has to be that ugly but some versions of GCC
@@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size)
if (!size)
return 0;

+ if (size > KMALLOC_MAX_SIZE)
+ return -1;
+
if (size <= KMALLOC_MIN_SIZE)
return KMALLOC_SHIFT_LOW;

@@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size)
if (size <= 1024) return 10;
if (size <= 2 * 1024) return 11;
if (size <= 4 * 1024) return 12;
-/*
- * The following is only needed to support architectures with a larger page
- * size than 4k.
- */
if (size <= 8 * 1024) return 13;
if (size <= 16 * 1024) return 14;
if (size <= 32 * 1024) return 15;
@@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size)
if (size <= 512 * 1024) return 19;
if (size <= 1024 * 1024) return 20;
if (size <= 2 * 1024 * 1024) return 21;
+ if (size <= 4 * 1024 * 1024) return 22;
+ if (size <= 8 * 1024 * 1024) return 23;
+ if (size <= 16 * 1024 * 1024) return 24;
+ if (size <= 32 * 1024 * 1024) return 25;
return -1;

/*
@@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
if (index == 0)
return NULL;

+ /*
+ * This function only gets expanded if __builtin_constant_p(size), so
+ * testing it here shouldn't be needed. But some versions of gcc need
+ * help.
+ */
+ if (__builtin_constant_p(size) && index < 0) {
+ /*
+ * Generate a link failure. Would be great if we could
+ * do something to stop the compile here.
+ */
+ extern void __kmalloc_size_too_large(void);
+ __kmalloc_size_too_large();
+ }
return &kmalloc_caches[index];
}

@@ -204,17 +220,9 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
void *__kmalloc(size_t size, gfp_t flags);

-static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
-{
- return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size));
-}
-
static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size)) {
- if (size > PAGE_SIZE)
- return kmalloc_large(size, flags);
-
if (!(flags & SLUB_DMA)) {
struct kmem_cache *s = kmalloc_slab(size);

diff --git a/mm/slub.c b/mm/slub.c
index 6392ae5..8fad23f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
* Kmalloc subsystem
*******************************************************************/

-struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned;
EXPORT_SYMBOL(kmalloc_caches);

static int __init setup_slub_min_order(char *str)
@@ -2537,7 +2537,7 @@ panic:
}

#ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
+static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1];

static void sysfs_add_func(struct work_struct *w)
{
@@ -2643,8 +2643,12 @@ static struct kmem_cache *get_slab(size_t size, gfp_t flags)
return ZERO_SIZE_PTR;

index = size_index[(size - 1) / 8];
- } else
+ } else {
+ if (size > KMALLOC_MAX_SIZE)
+ return NULL;
+
index = fls(size - 1);
+ }

#ifdef CONFIG_ZONE_DMA
if (unlikely((flags & SLUB_DMA)))
@@ -2658,9 +2662,6 @@ void *__kmalloc(size_t size, gfp_t flags)
{
struct kmem_cache *s;

- if (unlikely(size > PAGE_SIZE))
- return kmalloc_large(size, flags);
-
s = get_slab(size, flags);

if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2670,25 +2671,11 @@ void *__kmalloc(size_t size, gfp_t flags)
}
EXPORT_SYMBOL(__kmalloc);

-static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
-{
- struct page *page = alloc_pages_node(node, flags | __GFP_COMP,
- get_order(size));
-
- if (page)
- return page_address(page);
- else
- return NULL;
-}
-
#ifdef CONFIG_NUMA
void *__kmalloc_node(size_t size, gfp_t flags, int node)
{
struct kmem_cache *s;

- if (unlikely(size > PAGE_SIZE))
- return kmalloc_large_node(size, flags, node);
-
s = get_slab(size, flags);

if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2746,11 +2733,8 @@ void kfree(const void *x)
return;

page = virt_to_head_page(x);
- if (unlikely(!PageSlab(page))) {
- BUG_ON(!PageCompound(page));
- put_page(page);
+ if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */
return;
- }
slab_free(page->slab, page, object, _RET_IP_);
}
EXPORT_SYMBOL(kfree);
@@ -2985,7 +2969,7 @@ void __init kmem_cache_init(void)
caches++;
}

- for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
create_kmalloc_cache(&kmalloc_caches[i],
"kmalloc", 1 << i, GFP_KERNEL);
caches++;
@@ -3022,7 +3006,7 @@ void __init kmem_cache_init(void)
slab_state = UP;

/* Provide the correct kmalloc names now that the caches are up */
- for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
kmalloc_caches[i]. name =
kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);

@@ -3222,9 +3206,6 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
{
struct kmem_cache *s;

- if (unlikely(size > PAGE_SIZE))
- return kmalloc_large(size, gfpflags);
-
s = get_slab(size, gfpflags);

if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -3238,9 +3219,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
{
struct kmem_cache *s;

- if (unlikely(size > PAGE_SIZE))
- return kmalloc_large_node(size, gfpflags, node);
-
s = get_slab(size, gfpflags);

if (unlikely(ZERO_OR_NULL_PTR(s)))

2009-01-23 14:51:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 03:27:53PM +0100, Nick Piggin wrote:
>
> > Although I think I would prefer alloc_percpu, possibly with
> > per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)
>
> I don't think we have the NUMA information available early enough
> to do that.

How early? At mem_init time it should be there because bootmem needed
it already. It meaning the architectural level NUMA information.

> OK, but if it is _possible_ for the node to gain memory, then you
> can't do that of course.

In theory it could gain memory through memory hotplug.

> > I'm sure such a straight forward change could be still put into .29
> >
> > > reasonable to merge. But it would be a fine cleanup.
> >
> > Hmm to be honest it's a little weird to post so much code and then
> > say you can't change large parts of it.
>
> The cache_line_size() change wouldn't change slqb code significantly.
> I have no problem with it, but I simply won't have time to do it and
> test all architectures and get them merged and hold off merging
> SLQB until they all get merged.

I was mainly refering to the sysfs code here.


> > Could you perhaps mark all the code you don't want to change?
>
> Primarily the debug code from SLUB.

Ok so you could fix the sysfs code? @)

Anyways, if you have such shared pieces perhaps it would be better
if you just pull them all out into a separate file.

-Andi
--
[email protected] -- Speaking for myself only.

2009-01-23 15:15:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 04:06:32PM +0100, Andi Kleen wrote:
> On Fri, Jan 23, 2009 at 03:27:53PM +0100, Nick Piggin wrote:
> >
> > > Although I think I would prefer alloc_percpu, possibly with
> > > per_cpu_ptr(first_cpu(node_to_cpumask(node)), ...)
> >
> > I don't think we have the NUMA information available early enough
> > to do that.
>
> How early? At mem_init time it should be there because bootmem needed
> it already. It meaning the architectural level NUMA information.

node_to_cpumask(0) returned 0 at kmem_cache_init time.


> > OK, but if it is _possible_ for the node to gain memory, then you
> > can't do that of course.
>
> In theory it could gain memory through memory hotplug.

Yes.


> > The cache_line_size() change wouldn't change slqb code significantly.
> > I have no problem with it, but I simply won't have time to do it and
> > test all architectures and get them merged and hold off merging
> > SLQB until they all get merged.
>
> I was mainly refering to the sysfs code here.

OK.


> > > Could you perhaps mark all the code you don't want to change?
> >
> > Primarily the debug code from SLUB.
>
> Ok so you could fix the sysfs code? @)
>
> Anyways, if you have such shared pieces perhaps it would be better
> if you just pull them all out into a separate file.

I'll see. I do plan to try making improvements to this peripheral
code but it just has to wait a little bit for other improvements
first.

2009-01-23 15:37:59

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > That is, a list of pages that could be returned to the page allocator
> > but are pooled in SLUB to avoid the page allocator overhead. Note that
> > this will not help allocators that trigger page allocator pass-through.

On Fri, 2009-01-23 at 10:32 -0500, Christoph Lameter wrote:
> We use the partial list for that.

Even if the slab is totally empty?

2009-01-23 15:56:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, 23 Jan 2009, Pekka Enberg wrote:

> On Thu, 22 Jan 2009, Pekka Enberg wrote:
> > > That is, a list of pages that could be returned to the page allocator
> > > but are pooled in SLUB to avoid the page allocator overhead. Note that
> > > this will not help allocators that trigger page allocator pass-through.
>
> On Fri, 2009-01-23 at 10:32 -0500, Christoph Lameter wrote:
> > We use the partial list for that.
>
> Even if the slab is totally empty?

The MIN_PARTIAL thingy can keep pages around even if the slab becomes
totally empty in order to avoid page allocator trips.

2009-01-23 16:10:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 10:52:43AM -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
>
> > > Typically we traverse lists of objects that are in the same slab cache.
> >
> > Very often that is not the case. And the price you pay for that is that
> > you have to drain and switch freelists whenever you encounter an object
> > that is not on the same page.
>
> SLUB can directly free an object to any slab page. "Queuing" on free via
> the per cpu slab is only possible if the object came from that per cpu
> slab. This is typically only the case for objects that were recently
> allocated.

Ah yes ok that's right. But then you don't get LIFO allocation
behaviour for those cases.


> > This gives your freelists a chaotic and unpredictable behaviour IMO in
> > a running system where pages succumb to fragmentation so your freelist
> > maximum sizes are limited. It also means you can lose track of cache
> > hot objects when you switch to different "fast" pages. I don't consider
> > this to be "queueing done right".
>
> Yes you can loose track of caching hot objects. That is one of the
> concerns with the SLUB approach. On the other hand: Caching architectures
> get more and more complex these days (especially in a NUMA system). The

Because it is more important to get good cache behaviour.


> SLAB approach is essentially trying to guess which objects are cache hot
> and queue them. Sometimes the queueing is advantageous (may be a reason
> that SLAB is better than SLUB in some cases). In other cases SLAB keeps
> objects on queues but the object have become sale (context switch, slab
> unused for awhile). Then its no advantage anymore.

But in those cases would be expected to be encountered if that slab
is not used as frequently, ergo less performance critical. And
ones that are used frequently should be more likely to have recently
freed cache hot objects.


> > > If all objects are from the same page then you need not check
> > > the NUMA locality of any object on that queue.
> >
> > In SLAB and SLQB, all objects on the freelist are on the same node. So
> > tell me how does same-page objects simplify numa handling?
>
> F.e. On free you need to determine the node to find the right queue in
> SLAB. SLUB does not need to do that. It simply determines the page address
> and does not care about the node when freeing the object. It is irrelevant
> on which node the object sits.

OK, but how much does that help?


> Also on alloc: The per cpu slab can be from a foreign node. NUMA locality
> does only matter if the caller wants memory from a particular node. So
> cpus that have no local memory can still use the per cpu slabs to have
> fast allocations etc etc.

Yeah. In my experience I haven't needed to optimise this type of behaviour
yet, but other allocators could definitely do similar things to switch their
queues to different nodes.


> > > > And you found you have to increase the size of your pages because you
> > > > need bigger queues. (must we argue semantics? it is a list of free
> > > > objects)
> > >
> > > Right. That may be the case and its a similar tuning to what SLAB does.
> >
> > SLAB and SLQB doesn't need bigger pages to do that.
>
> But they require more metadata handling because they need to manage lists
> of order-0 pages. metadata handling is reduced by orders of magnitude in
> SLUB.

SLQB's page lists typically get accessed eg. 1% of the time (sometimes far
less, other workloads more). So it is several orders of magnitude removed
from the fastpath which is handled by the freelist.

So I think it is wrong to say it requires more metadata handling. SLUB
will have to switch pages more often or free objects to pages other than
the "fast" page (what do you call it?), so quite often I think you'll
find SLUB has just as much if not more metadata handling.

2009-01-23 17:10:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Saturday 24 January 2009 03:10:17 Nick Piggin wrote:
> On Fri, Jan 23, 2009 at 10:52:43AM -0500, Christoph Lameter wrote:
> > On Fri, 23 Jan 2009, Nick Piggin wrote:
> > > > Typically we traverse lists of objects that are in the same slab
> > > > cache.
> > >
> > > Very often that is not the case. And the price you pay for that is that
> > > you have to drain and switch freelists whenever you encounter an object
> > > that is not on the same page.
> >
> > SLUB can directly free an object to any slab page. "Queuing" on free via
> > the per cpu slab is only possible if the object came from that per cpu
> > slab. This is typically only the case for objects that were recently
> > allocated.
>
> Ah yes ok that's right. But then you don't get LIFO allocation
> behaviour for those cases.

And actually really this all just stems from conceptually in fact you
_do_ switch to a different queue (from the one being allocated from)
to free the object if it is on a different page. Because you have a
set of queues (a queue per-page). So freeing to a different queue is
where you lose LIFO property.

2009-01-26 17:37:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, 23 Jan 2009, Nick Piggin wrote:

> > SLUB can directly free an object to any slab page. "Queuing" on free via
> > the per cpu slab is only possible if the object came from that per cpu
> > slab. This is typically only the case for objects that were recently
> > allocated.
>
> Ah yes ok that's right. But then you don't get LIFO allocation
> behaviour for those cases.

But you get more TLB local allocations.

> > > hot objects when you switch to different "fast" pages. I don't consider
> > > this to be "queueing done right".
> >
> > Yes you can loose track of caching hot objects. That is one of the
> > concerns with the SLUB approach. On the other hand: Caching architectures
> > get more and more complex these days (especially in a NUMA system). The
>
> Because it is more important to get good cache behaviour.

Its going to be quite difficult to realize algorithm that guestimate what
information the processor keeps in its caches. The situation is quite
complex in NUMA systems.

> So I think it is wrong to say it requires more metadata handling. SLUB
> will have to switch pages more often or free objects to pages other than
> the "fast" page (what do you call it?), so quite often I think you'll
> find SLUB has just as much if not more metadata handling.

Its the per cpu slab. SLUB does not switch pages often but frees objects
not from the per cpu slab directly with minimal overhead compared to a per
cpu slab free. The overhead is much less than the SLAB slowpath which has
to be taken for alien caches etc.

2009-01-26 17:57:44

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Sat, 24 Jan 2009, Nick Piggin wrote:

> > > SLUB can directly free an object to any slab page. "Queuing" on free via
> > > the per cpu slab is only possible if the object came from that per cpu
> > > slab. This is typically only the case for objects that were recently
> > > allocated.
> >
> > Ah yes ok that's right. But then you don't get LIFO allocation
> > behaviour for those cases.
>
> And actually really this all just stems from conceptually in fact you
> _do_ switch to a different queue (from the one being allocated from)
> to free the object if it is on a different page. Because you have a
> set of queues (a queue per-page). So freeing to a different queue is
> where you lose LIFO property.

Yes you basically go for locality instead of LIFO if the free does not hit
the per cpu slab. If the object is not in the per cpu slab then it is
likely that it had a long lifetime and thus LIFOness does not matter
too much. It is likely that many objects from that slab are going to be
freed at the same time. So the first free warms up the "queue" of the page
you are freeing to.

This is an increasingly important feature since memory chips prefer
allocations next to each other. Same page accesses are faster
in recent memory subsystems than random accesses across memory. LIFO used
to be better but we are increasingly getting into locality of access being
very important for access. Especially with the NUMA characteristics of the
existing AMD and upcoming Nehalem processors this will become much more
important.