2009-01-14 09:05:46

by Nick Piggin

[permalink] [raw]
Subject: [patch] SLQB slab allocator

Hi,

This is the latest SLQB patch. Since last time, we have imported the sysfs
framework from SLUB, and added specific event counters things for SLQB. I
had initially been somewhat against this because it makes SLQB depend on
another complex subsystem (which itself depends back on the slab allocator).
But I guess it is not fundamentally different than /proc, and there needs to
be some reporting somewhere. The individual per-slab counters really do make
performance analysis much easier. There is a Documentation/vm/slqbinfo.c
file, which is a parser adapted from slabinfo.c for SLUB.

Fixed some bugs, including a nasty one that was causing remote objects to
sneak onto local freelist, which would mean NUMA allocation was basically
broken.

The NUMA side of things is now much more complete. NUMA policies are obeyed.
There is still a known bug where it won't run on a system with CPU-only
nodes.

CONFIG options are improved.

Credit to some of the engineers at Intel for helping run tests, contributing
ideas and patches to improve performance and fix bugs.

I think it is getting to the point where it is stable and featureful. It
really needs to be further proven in the performance area. We'd welcome
any performance results or suggestions for tests to run.

After this round of review/feedback, I plan to set about getting SLQB merged.
---
Index: linux-2.6/include/linux/rcupdate.h
===================================================================
--- linux-2.6.orig/include/linux/rcupdate.h
+++ linux-2.6/include/linux/rcupdate.h
@@ -33,6 +33,7 @@
#ifndef __LINUX_RCUPDATE_H
#define __LINUX_RCUPDATE_H

+#include <linux/rcu_types.h>
#include <linux/cache.h>
#include <linux/spinlock.h>
#include <linux/threads.h>
@@ -42,16 +43,6 @@
#include <linux/lockdep.h>
#include <linux/completion.h>

-/**
- * struct rcu_head - callback structure for use with RCU
- * @next: next update requests in a list
- * @func: actual update function to call after the grace period.
- */
-struct rcu_head {
- struct rcu_head *next;
- void (*func)(struct rcu_head *head);
-};
-
#if defined(CONFIG_CLASSIC_RCU)
#include <linux/rcuclassic.h>
#elif defined(CONFIG_TREE_RCU)
Index: linux-2.6/include/linux/slqb_def.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/slqb_def.h
@@ -0,0 +1,283 @@
+#ifndef _LINUX_SLQB_DEF_H
+#define _LINUX_SLQB_DEF_H
+
+/*
+ * SLQB : A slab allocator with object queues.
+ *
+ * (C) 2008 Nick Piggin <[email protected]>
+ */
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/kobject.h>
+#include <linux/rcu_types.h>
+#include <linux/mm_types.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+
+enum stat_item {
+ ALLOC, /* Allocation count */
+ ALLOC_SLAB_FILL, /* Fill freelist from page list */
+ ALLOC_SLAB_NEW, /* New slab acquired from page allocator */
+ FREE, /* Free count */
+ FREE_REMOTE, /* NUMA: freeing to remote list */
+ FLUSH_FREE_LIST, /* Freelist flushed */
+ FLUSH_FREE_LIST_OBJECTS, /* Objects flushed from freelist */
+ FLUSH_FREE_LIST_REMOTE, /* Objects flushed from freelist to remote */
+ FLUSH_SLAB_PARTIAL, /* Freeing moves slab to partial list */
+ FLUSH_SLAB_FREE, /* Slab freed to the page allocator */
+ FLUSH_RFREE_LIST, /* Rfree list flushed */
+ FLUSH_RFREE_LIST_OBJECTS, /* Rfree objects flushed */
+ CLAIM_REMOTE_LIST, /* Remote freed list claimed */
+ CLAIM_REMOTE_LIST_OBJECTS, /* Remote freed objects claimed */
+ NR_SLQB_STAT_ITEMS
+};
+
+/*
+ * Singly-linked list with head, tail, and nr
+ */
+struct kmlist {
+ unsigned long nr;
+ void **head, **tail;
+};
+
+/*
+ * Every kmem_cache_list has a kmem_cache_remote_free structure, by which
+ * objects can be returned to the kmem_cache_list from remote CPUs.
+ */
+struct kmem_cache_remote_free {
+ spinlock_t lock;
+ struct kmlist list;
+} ____cacheline_aligned;
+
+/*
+ * A kmem_cache_list manages all the slabs and objects allocated from a given
+ * source. Per-cpu kmem_cache_lists allow node-local allocations. Per-node
+ * kmem_cache_lists allow off-node allocations (but require locking).
+ */
+struct kmem_cache_list {
+ struct kmlist freelist; /* Fastpath LIFO freelist of objects */
+#ifdef CONFIG_SMP
+ int remote_free_check; /* remote_free has reached a watermark */
+#endif
+ struct kmem_cache *cache; /* kmem_cache corresponding to this list */
+
+ unsigned long nr_partial; /* Number of partial slabs (pages) */
+ struct list_head partial; /* Slabs which have some free objects */
+
+ unsigned long nr_slabs; /* Total number of slabs allocated */
+
+ //struct list_head full;
+
+#ifdef CONFIG_SMP
+ /*
+ * In the case of per-cpu lists, remote_free is for objects freed by
+ * non-owner CPU back to its home list. For per-node lists, remote_free
+ * is always used to free objects.
+ */
+ struct kmem_cache_remote_free remote_free;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+ unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Primary per-cpu, per-kmem_cache structure.
+ */
+struct kmem_cache_cpu {
+ struct kmem_cache_list list; /* List for node-local slabs. */
+
+ unsigned int colour_next;
+
+#ifdef CONFIG_SMP
+ /*
+ * rlist is a list of objects that don't fit on list.freelist (ie.
+ * wrong node). The objects all correspond to a given kmem_cache_list,
+ * remote_cache_list. To free objects to another list, we must first
+ * flush the existing objects, then switch remote_cache_list.
+ *
+ * An NR_CPUS or MAX_NUMNODES array would be nice here, but then we
+ * get to O(NR_CPUS^2) memory consumption situation.
+ */
+ struct kmlist rlist;
+ struct kmem_cache_list *remote_cache_list;
+#endif
+} ____cacheline_aligned;
+
+/*
+ * Per-node, per-kmem_cache structure.
+ */
+struct kmem_cache_node {
+ struct kmem_cache_list list;
+ spinlock_t list_lock; /* protects access to list */
+} ____cacheline_aligned;
+
+/*
+ * Management object for a slab cache.
+ */
+struct kmem_cache {
+ unsigned long flags;
+ int hiwater; /* LIFO list high watermark */
+ int freebatch; /* LIFO freelist batch flush size */
+ int objsize; /* The size of an object without meta data */
+ int offset; /* Free pointer offset. */
+ int objects; /* Number of objects in slab */
+
+ int size; /* The size of an object including meta data */
+ int order; /* Allocation order */
+ gfp_t allocflags; /* gfp flags to use on allocation */
+ unsigned int colour_range; /* range of colour counter */
+ unsigned int colour_off; /* offset per colour */
+ void (*ctor)(void *);
+
+ const char *name; /* Name (only for display!) */
+ struct list_head list; /* List of slab caches */
+
+ int align; /* Alignment */
+ int inuse; /* Offset to metadata */
+
+#ifdef CONFIG_SLQB_SYSFS
+ struct kobject kobj; /* For sysfs */
+#endif
+#ifdef CONFIG_NUMA
+ struct kmem_cache_node *node[MAX_NUMNODES];
+#endif
+#ifdef CONFIG_SMP
+ struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+ struct kmem_cache_cpu cpu_slab;
+#endif
+};
+
+/*
+ * Kmalloc subsystem.
+ */
+#if defined(ARCH_KMALLOC_MINALIGN) && ARCH_KMALLOC_MINALIGN > 8
+#define KMALLOC_MIN_SIZE ARCH_KMALLOC_MINALIGN
+#else
+#define KMALLOC_MIN_SIZE 8
+#endif
+
+#define KMALLOC_SHIFT_LOW ilog2(KMALLOC_MIN_SIZE)
+#define KMALLOC_SHIFT_SLQB_HIGH (PAGE_SHIFT + 9)
+
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1];
+extern struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1];
+
+/*
+ * Constant size allocations use this path to find index into kmalloc caches
+ * arrays. get_slab() function is used for non-constant sizes.
+ */
+static __always_inline int kmalloc_index(size_t size)
+{
+ if (unlikely(!size))
+ return 0;
+ if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
+ return 0;
+
+ if (unlikely(size <= KMALLOC_MIN_SIZE))
+ return KMALLOC_SHIFT_LOW;
+
+#if L1_CACHE_BYTES < 64
+ if (size > 64 && size <= 96)
+ return 1;
+#endif
+#if L1_CACHE_BYTES < 128
+ if (size > 128 && size <= 192)
+ return 2;
+#endif
+ if (size <= 8) return 3;
+ if (size <= 16) return 4;
+ if (size <= 32) return 5;
+ if (size <= 64) return 6;
+ if (size <= 128) return 7;
+ if (size <= 256) return 8;
+ if (size <= 512) return 9;
+ if (size <= 1024) return 10;
+ if (size <= 2 * 1024) return 11;
+ if (size <= 4 * 1024) return 12;
+ if (size <= 8 * 1024) return 13;
+ if (size <= 16 * 1024) return 14;
+ if (size <= 32 * 1024) return 15;
+ if (size <= 64 * 1024) return 16;
+ if (size <= 128 * 1024) return 17;
+ if (size <= 256 * 1024) return 18;
+ if (size <= 512 * 1024) return 19;
+ if (size <= 1024 * 1024) return 20;
+ if (size <= 2 * 1024 * 1024) return 21;
+ return -1;
+}
+
+#ifdef CONFIG_ZONE_DMA
+#define SLQB_DMA __GFP_DMA
+#else
+/* Disable "DMA slabs" */
+#define SLQB_DMA (__force gfp_t)0
+#endif
+
+/*
+ * Find the kmalloc slab cache for a given combination of allocation flags and
+ * size.
+ */
+static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+ int index = kmalloc_index(size);
+
+ if (unlikely(index == 0))
+ return NULL;
+
+ if (likely(!(flags & SLQB_DMA)))
+ return &kmalloc_caches[index];
+ else
+ return &kmalloc_caches_dma[index];
+}
+
+void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+void *__kmalloc(size_t size, gfp_t flags);
+
+#ifndef ARCH_KMALLOC_MINALIGN
+#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
+#endif
+
+#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
+
+static __always_inline void *kmalloc(size_t size, gfp_t flags)
+{
+ if (__builtin_constant_p(size)) {
+ struct kmem_cache *s;
+
+ s = kmalloc_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc(s, flags);
+ }
+ return __kmalloc(size, flags);
+}
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node);
+void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
+
+static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+ if (__builtin_constant_p(size)) {
+ struct kmem_cache *s;
+
+ s = kmalloc_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc_node(s, flags, node);
+ }
+ return __kmalloc_node(size, flags, node);
+}
+#endif
+
+#endif /* _LINUX_SLQB_DEF_H */
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -828,6 +828,9 @@ config SLUB
and has enhanced diagnostics. SLUB is the default choice for
a slab allocator.

+config SLQB
+ bool "SLQB (Qeued allocator)"
+
config SLOB
depends on EMBEDDED
bool "SLOB (Simple Allocator)"
@@ -869,7 +872,7 @@ config HAVE_GENERIC_DMA_COHERENT
config SLABINFO
bool
depends on PROC_FS
- depends on SLAB || SLUB_DEBUG
+ depends on SLAB || SLUB_DEBUG || SLQB
default y

config RT_MUTEXES
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -298,6 +298,26 @@ config SLUB_STATS
out which slabs are relevant to a particular load.
Try running: slabinfo -DA

+config SLQB_DEBUG
+ default y
+ bool "Enable SLQB debugging support"
+ depends on SLQB
+
+config SLQB_DEBUG_ON
+ default n
+ bool "SLQB debugging on by default"
+ depends on SLQB_DEBUG
+
+config SLQB_SYSFS
+ bool "Create SYSFS entries for slab caches"
+ default n
+ depends on SLQB
+
+config SLQB_STATS
+ bool "Enable SLQB performance statistics"
+ default n
+ depends on SLQB_SYSFS
+
config DEBUG_PREEMPT
bool "Debug preemptible kernel"
depends on DEBUG_KERNEL && PREEMPT && (TRACE_IRQFLAGS_SUPPORT || PPC64)
Index: linux-2.6/mm/slqb.c
===================================================================
--- /dev/null
+++ linux-2.6/mm/slqb.c
@@ -0,0 +1,3368 @@
+/*
+ * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
+ * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
+ * and freeing, but with a secondary goal of good remote freeing (freeing on
+ * another CPU from that which allocated).
+ *
+ * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
+ */
+
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/bit_spinlock.h>
+#include <linux/interrupt.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include <linux/memory.h>
+
+static inline int slab_hiwater(struct kmem_cache *s)
+{
+ return s->hiwater;
+}
+
+static inline int slab_freebatch(struct kmem_cache *s)
+{
+ return s->freebatch;
+}
+
+/*
+ * slqb_page overloads struct page, and is used to manage some slob allocation
+ * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
+ * we'll just define our own struct slqb_page type variant here.
+ */
+struct slqb_page {
+ union {
+ struct {
+ unsigned long flags; /* mandatory */
+ atomic_t _count; /* mandatory */
+ unsigned int inuse; /* Nr of objects */
+ struct kmem_cache_list *list; /* Pointer to list */
+ void **freelist; /* freelist req. slab lock */
+ union {
+ struct list_head lru; /* misc. list */
+ struct rcu_head rcu_head; /* for rcu freeing */
+ };
+ };
+ struct page page;
+ };
+};
+static inline void struct_slqb_page_wrong_size(void)
+{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
+
+#define PG_SLQB_BIT (1 << PG_slab)
+
+static int kmem_size __read_mostly;
+#ifdef CONFIG_NUMA
+static int numa_platform __read_mostly;
+#else
+#define numa_platform 0
+#endif
+
+/*
+ * Lock order:
+ * kmem_cache_node->list_lock
+ * kmem_cache_remote_free->lock
+ *
+ * Data structures:
+ * SLQB is primarily per-cpu. For each kmem_cache, each CPU has:
+ *
+ * - A LIFO list of node-local objects. Allocation and freeing of node local
+ * objects goes first to this list.
+ *
+ * - 2 Lists of slab pages, free and partial pages. If an allocation misses
+ * the object list, it tries from the partial list, then the free list.
+ * After freeing an object to the object list, if it is over a watermark,
+ * some objects are freed back to pages. If an allocation misses these lists,
+ * a new slab page is allocated from the page allocator. If the free list
+ * reaches a watermark, some of its pages are returned to the page allocator.
+ *
+ * - A remote free queue, where objects freed that did not come from the local
+ * node are queued to. When this reaches a watermark, the objects are
+ * flushed.
+ *
+ * - A remotely freed queue, where objects allocated from this CPU are flushed
+ * to from other CPUs' remote free queues. kmem_cache_remote_free->lock is
+ * used to protect access to this queue.
+ *
+ * When the remotely freed queue reaches a watermark, a flag is set to tell
+ * the owner CPU to check it. The owner CPU will then check the queue on the
+ * next allocation that misses the object list. It will move all objects from
+ * this list onto the object list and then allocate one.
+ *
+ * This system of remote queueing is intended to reduce lock and remote
+ * cacheline acquisitions, and give a cooling off period for remotely freed
+ * objects before they are re-allocated.
+ *
+ * node specific allocations from somewhere other than the local node are
+ * handled by a per-node list which is the same as the above per-CPU data
+ * structures except for the following differences:
+ *
+ * - kmem_cache_node->list_lock is used to protect access for multiple CPUs to
+ * allocate from a given node.
+ *
+ * - There is no remote free queue. Nodes don't free objects, CPUs do.
+ */
+
+static inline void slqb_stat_inc(struct kmem_cache_list *list, enum stat_item si)
+{
+#ifdef CONFIG_SLQB_STATS
+ list->stats[si]++;
+#endif
+}
+
+static inline void slqb_stat_add(struct kmem_cache_list *list, enum stat_item si,
+ unsigned long nr)
+{
+#ifdef CONFIG_SLQB_STATS
+ list->stats[si] += nr;
+#endif
+}
+
+static inline int slqb_page_to_nid(struct slqb_page *page)
+{
+ return page_to_nid(&page->page);
+}
+
+static inline void *slqb_page_address(struct slqb_page *page)
+{
+ return page_address(&page->page);
+}
+
+static inline struct zone *slqb_page_zone(struct slqb_page *page)
+{
+ return page_zone(&page->page);
+}
+
+static inline int virt_to_nid(const void *addr)
+{
+#ifdef virt_to_page_fast
+ return page_to_nid(virt_to_page_fast(addr));
+#else
+ return page_to_nid(virt_to_page(addr));
+#endif
+}
+
+static inline struct slqb_page *virt_to_head_slqb_page(const void *addr)
+{
+ struct page *p;
+
+ p = virt_to_head_page(addr);
+ return (struct slqb_page *)p;
+}
+
+static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags,
+ unsigned int order)
+{
+ struct page *p;
+
+ if (nid == -1)
+ p = alloc_pages(flags, order);
+ else
+ p = alloc_pages_node(nid, flags, order);
+
+ return (struct slqb_page *)p;
+}
+
+static inline void __free_slqb_pages(struct slqb_page *page, unsigned int order)
+{
+ struct page *p = &page->page;
+
+ reset_page_mapcount(p);
+ p->mapping = NULL;
+ VM_BUG_ON(!(p->flags & PG_SLQB_BIT));
+ p->flags &= ~PG_SLQB_BIT;
+
+ __free_pages(p, order);
+}
+
+#ifdef CONFIG_SLQB_DEBUG
+static inline int slab_debug(struct kmem_cache *s)
+{
+ return (s->flags &
+ (SLAB_DEBUG_FREE |
+ SLAB_RED_ZONE |
+ SLAB_POISON |
+ SLAB_STORE_USER |
+ SLAB_TRACE));
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+ return s->flags & SLAB_POISON;
+}
+#else
+static inline int slab_debug(struct kmem_cache *s)
+{
+ return 0;
+}
+static inline int slab_poison(struct kmem_cache *s)
+{
+ return 0;
+}
+#endif
+
+#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
+ SLAB_POISON | SLAB_STORE_USER)
+
+/* Internal SLQB flags */
+#define __OBJECT_POISON 0x80000000 /* Poison object */
+
+/* Not all arches define cache_line_size */
+#ifndef cache_line_size
+#define cache_line_size() L1_CACHE_BYTES
+#endif
+
+#ifdef CONFIG_SMP
+static struct notifier_block slab_notifier;
+#endif
+
+/* A list of all slab caches on the system */
+static DECLARE_RWSEM(slqb_lock);
+static LIST_HEAD(slab_caches);
+
+/*
+ * Tracking user of a slab.
+ */
+struct track {
+ void *addr; /* Called from address */
+ int cpu; /* Was running on cpu */
+ int pid; /* Pid context */
+ unsigned long when; /* When did the operation occur */
+};
+
+enum track_item { TRACK_ALLOC, TRACK_FREE };
+
+static struct kmem_cache kmem_cache_cache;
+
+#ifdef CONFIG_SLQB_SYSFS
+static int sysfs_slab_add(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
+#else
+static inline int sysfs_slab_add(struct kmem_cache *s)
+{
+ return 0;
+}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+ kmem_cache_free(&kmem_cache_cache, s);
+}
+#endif
+
+/********************************************************************
+ * Core slab cache functions
+ *******************************************************************/
+
+static int __slab_is_available __read_mostly;
+int slab_is_available(void)
+{
+ return __slab_is_available;
+}
+
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+#ifdef CONFIG_SMP
+ VM_BUG_ON(!s->cpu_slab[cpu]);
+ return s->cpu_slab[cpu];
+#else
+ return &s->cpu_slab;
+#endif
+}
+
+static inline int check_valid_pointer(struct kmem_cache *s,
+ struct slqb_page *page, const void *object)
+{
+ void *base;
+
+ base = slqb_page_address(page);
+ if (object < base || object >= base + s->objects * s->size ||
+ (object - base) % s->size) {
+ return 0;
+ }
+
+ return 1;
+}
+
+static inline void *get_freepointer(struct kmem_cache *s, void *object)
+{
+ return *(void **)(object + s->offset);
+}
+
+static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
+{
+ *(void **)(object + s->offset) = fp;
+}
+
+/* Loop over all objects in a slab */
+#define for_each_object(__p, __s, __addr) \
+ for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\
+ __p += (__s)->size)
+
+/* Scan freelist */
+#define for_each_free_object(__p, __s, __free) \
+ for (__p = (__free); (__p) != NULL; __p = get_freepointer((__s),\
+ __p))
+
+#ifdef CONFIG_SLQB_DEBUG
+/*
+ * Debug settings:
+ */
+#ifdef CONFIG_SLQB_DEBUG_ON
+static int slqb_debug __read_mostly = DEBUG_DEFAULT_FLAGS;
+#else
+static int slqb_debug __read_mostly;
+#endif
+
+static char *slqb_debug_slabs;
+
+/*
+ * Object debugging
+ */
+static void print_section(char *text, u8 *addr, unsigned int length)
+{
+ int i, offset;
+ int newline = 1;
+ char ascii[17];
+
+ ascii[16] = 0;
+
+ for (i = 0; i < length; i++) {
+ if (newline) {
+ printk(KERN_ERR "%8s 0x%p: ", text, addr + i);
+ newline = 0;
+ }
+ printk(KERN_CONT " %02x", addr[i]);
+ offset = i % 16;
+ ascii[offset] = isgraph(addr[i]) ? addr[i] : '.';
+ if (offset == 15) {
+ printk(KERN_CONT " %s\n", ascii);
+ newline = 1;
+ }
+ }
+ if (!newline) {
+ i %= 16;
+ while (i < 16) {
+ printk(KERN_CONT " ");
+ ascii[i] = ' ';
+ i++;
+ }
+ printk(KERN_CONT " %s\n", ascii);
+ }
+}
+
+static struct track *get_track(struct kmem_cache *s, void *object,
+ enum track_item alloc)
+{
+ struct track *p;
+
+ if (s->offset)
+ p = object + s->offset + sizeof(void *);
+ else
+ p = object + s->inuse;
+
+ return p + alloc;
+}
+
+static void set_track(struct kmem_cache *s, void *object,
+ enum track_item alloc, void *addr)
+{
+ struct track *p;
+
+ if (s->offset)
+ p = object + s->offset + sizeof(void *);
+ else
+ p = object + s->inuse;
+
+ p += alloc;
+ if (addr) {
+ p->addr = addr;
+ p->cpu = raw_smp_processor_id();
+ p->pid = current ? current->pid : -1;
+ p->when = jiffies;
+ } else
+ memset(p, 0, sizeof(struct track));
+}
+
+static void init_tracking(struct kmem_cache *s, void *object)
+{
+ if (!(s->flags & SLAB_STORE_USER))
+ return;
+
+ set_track(s, object, TRACK_FREE, NULL);
+ set_track(s, object, TRACK_ALLOC, NULL);
+}
+
+static void print_track(const char *s, struct track *t)
+{
+ if (!t->addr)
+ return;
+
+ printk(KERN_ERR "INFO: %s in ", s);
+ __print_symbol("%s", (unsigned long)t->addr);
+ printk(" age=%lu cpu=%u pid=%d\n", jiffies - t->when, t->cpu, t->pid);
+}
+
+static void print_tracking(struct kmem_cache *s, void *object)
+{
+ if (!(s->flags & SLAB_STORE_USER))
+ return;
+
+ print_track("Allocated", get_track(s, object, TRACK_ALLOC));
+ print_track("Freed", get_track(s, object, TRACK_FREE));
+}
+
+static void print_page_info(struct slqb_page *page)
+{
+ printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n",
+ page, page->inuse, page->freelist, page->flags);
+
+}
+
+static void slab_bug(struct kmem_cache *s, char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ printk(KERN_ERR "========================================"
+ "=====================================\n");
+ printk(KERN_ERR "BUG %s: %s\n", s->name, buf);
+ printk(KERN_ERR "----------------------------------------"
+ "-------------------------------------\n\n");
+}
+
+static void slab_fix(struct kmem_cache *s, char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ printk(KERN_ERR "FIX %s: %s\n", s->name, buf);
+}
+
+static void print_trailer(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+ unsigned int off; /* Offset of last byte */
+ u8 *addr = slqb_page_address(page);
+
+ print_tracking(s, p);
+
+ print_page_info(page);
+
+ printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
+ p, p - addr, get_freepointer(s, p));
+
+ if (p > addr + 16)
+ print_section("Bytes b4", p - 16, 16);
+
+ print_section("Object", p, min(s->objsize, 128));
+
+ if (s->flags & SLAB_RED_ZONE)
+ print_section("Redzone", p + s->objsize,
+ s->inuse - s->objsize);
+
+ if (s->offset)
+ off = s->offset + sizeof(void *);
+ else
+ off = s->inuse;
+
+ if (s->flags & SLAB_STORE_USER)
+ off += 2 * sizeof(struct track);
+
+ if (off != s->size)
+ /* Beginning of the filler is the free pointer */
+ print_section("Padding", p + off, s->size - off);
+
+ dump_stack();
+}
+
+static void object_err(struct kmem_cache *s, struct slqb_page *page,
+ u8 *object, char *reason)
+{
+ slab_bug(s, reason);
+ print_trailer(s, page, object);
+}
+
+static void slab_err(struct kmem_cache *s, struct slqb_page *page, char *fmt, ...)
+{
+ va_list args;
+ char buf[100];
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ slab_bug(s, fmt);
+ print_page_info(page);
+ dump_stack();
+}
+
+static void init_object(struct kmem_cache *s, void *object, int active)
+{
+ u8 *p = object;
+
+ if (s->flags & __OBJECT_POISON) {
+ memset(p, POISON_FREE, s->objsize - 1);
+ p[s->objsize - 1] = POISON_END;
+ }
+
+ if (s->flags & SLAB_RED_ZONE)
+ memset(p + s->objsize,
+ active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE,
+ s->inuse - s->objsize);
+}
+
+static u8 *check_bytes(u8 *start, unsigned int value, unsigned int bytes)
+{
+ while (bytes) {
+ if (*start != (u8)value)
+ return start;
+ start++;
+ bytes--;
+ }
+ return NULL;
+}
+
+static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
+ void *from, void *to)
+{
+ slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
+ memset(from, data, to - from);
+}
+
+static int check_bytes_and_report(struct kmem_cache *s, struct slqb_page *page,
+ u8 *object, char *what,
+ u8 *start, unsigned int value, unsigned int bytes)
+{
+ u8 *fault;
+ u8 *end;
+
+ fault = check_bytes(start, value, bytes);
+ if (!fault)
+ return 1;
+
+ end = start + bytes;
+ while (end > fault && end[-1] == value)
+ end--;
+
+ slab_bug(s, "%s overwritten", what);
+ printk(KERN_ERR "INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n",
+ fault, end - 1, fault[0], value);
+ print_trailer(s, page, object);
+
+ restore_bytes(s, what, value, fault, end);
+ return 0;
+}
+
+/*
+ * Object layout:
+ *
+ * object address
+ * Bytes of the object to be managed.
+ * If the freepointer may overlay the object then the free
+ * pointer is the first word of the object.
+ *
+ * Poisoning uses 0x6b (POISON_FREE) and the last byte is
+ * 0xa5 (POISON_END)
+ *
+ * object + s->objsize
+ * Padding to reach word boundary. This is also used for Redzoning.
+ * Padding is extended by another word if Redzoning is enabled and
+ * objsize == inuse.
+ *
+ * We fill with 0xbb (RED_INACTIVE) for inactive objects and with
+ * 0xcc (RED_ACTIVE) for objects in use.
+ *
+ * object + s->inuse
+ * Meta data starts here.
+ *
+ * A. Free pointer (if we cannot overwrite object on free)
+ * B. Tracking data for SLAB_STORE_USER
+ * C. Padding to reach required alignment boundary or at mininum
+ * one word if debuggin is on to be able to detect writes
+ * before the word boundary.
+ *
+ * Padding is done using 0x5a (POISON_INUSE)
+ *
+ * object + s->size
+ * Nothing is used beyond s->size.
+ */
+
+static int check_pad_bytes(struct kmem_cache *s, struct slqb_page *page, u8 *p)
+{
+ unsigned long off = s->inuse; /* The end of info */
+
+ if (s->offset)
+ /* Freepointer is placed after the object. */
+ off += sizeof(void *);
+
+ if (s->flags & SLAB_STORE_USER)
+ /* We also have user information there */
+ off += 2 * sizeof(struct track);
+
+ if (s->size == off)
+ return 1;
+
+ return check_bytes_and_report(s, page, p, "Object padding",
+ p + off, POISON_INUSE, s->size - off);
+}
+
+static int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+{
+ u8 *start;
+ u8 *fault;
+ u8 *end;
+ int length;
+ int remainder;
+
+ if (!(s->flags & SLAB_POISON))
+ return 1;
+
+ start = slqb_page_address(page);
+ end = start + (PAGE_SIZE << s->order);
+ length = s->objects * s->size;
+ remainder = end - (start + length);
+ if (!remainder)
+ return 1;
+
+ fault = check_bytes(start + length, POISON_INUSE, remainder);
+ if (!fault)
+ return 1;
+ while (end > fault && end[-1] == POISON_INUSE)
+ end--;
+
+ slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1);
+ print_section("Padding", start, length);
+
+ restore_bytes(s, "slab padding", POISON_INUSE, start, end);
+ return 0;
+}
+
+static int check_object(struct kmem_cache *s, struct slqb_page *page,
+ void *object, int active)
+{
+ u8 *p = object;
+ u8 *endobject = object + s->objsize;
+
+ if (s->flags & SLAB_RED_ZONE) {
+ unsigned int red =
+ active ? SLUB_RED_ACTIVE : SLUB_RED_INACTIVE;
+
+ if (!check_bytes_and_report(s, page, object, "Redzone",
+ endobject, red, s->inuse - s->objsize))
+ return 0;
+ } else {
+ if ((s->flags & SLAB_POISON) && s->objsize < s->inuse) {
+ check_bytes_and_report(s, page, p, "Alignment padding",
+ endobject, POISON_INUSE, s->inuse - s->objsize);
+ }
+ }
+
+ if (s->flags & SLAB_POISON) {
+ if (!active && (s->flags & __OBJECT_POISON) &&
+ (!check_bytes_and_report(s, page, p, "Poison", p,
+ POISON_FREE, s->objsize - 1) ||
+ !check_bytes_and_report(s, page, p, "Poison",
+ p + s->objsize - 1, POISON_END, 1)))
+ return 0;
+ /*
+ * check_pad_bytes cleans up on its own.
+ */
+ check_pad_bytes(s, page, p);
+ }
+
+ return 1;
+}
+
+static int check_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+ if (!(page->flags & PG_SLQB_BIT)) {
+ slab_err(s, page, "Not a valid slab page");
+ return 0;
+ }
+ if (page->inuse == 0) {
+ slab_err(s, page, "inuse before free / after alloc", s->name);
+ return 0;
+ }
+ if (page->inuse > s->objects) {
+ slab_err(s, page, "inuse %u > max %u",
+ s->name, page->inuse, s->objects);
+ return 0;
+ }
+ /* Slab_pad_check fixes things up after itself */
+ slab_pad_check(s, page);
+ return 1;
+}
+
+static void trace(struct kmem_cache *s, struct slqb_page *page, void *object, int alloc)
+{
+ if (s->flags & SLAB_TRACE) {
+ printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+ s->name,
+ alloc ? "alloc" : "free",
+ object, page->inuse,
+ page->freelist);
+
+ if (!alloc)
+ print_section("Object", (void *)object, s->objsize);
+
+ dump_stack();
+ }
+}
+
+static void setup_object_debug(struct kmem_cache *s, struct slqb_page *page,
+ void *object)
+{
+ if (!slab_debug(s))
+ return;
+
+ if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)))
+ return;
+
+ init_object(s, object, 0);
+ init_tracking(s, object);
+}
+
+static int alloc_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+ struct slqb_page *page;
+ page = virt_to_head_slqb_page(object);
+
+ if (!check_slab(s, page))
+ goto bad;
+
+ if (!check_valid_pointer(s, page, object)) {
+ object_err(s, page, object, "Freelist Pointer check fails");
+ goto bad;
+ }
+
+ if (object && !check_object(s, page, object, 0))
+ goto bad;
+
+ /* Success perform special debug activities for allocs */
+ if (s->flags & SLAB_STORE_USER)
+ set_track(s, object, TRACK_ALLOC, addr);
+ trace(s, page, object, 1);
+ init_object(s, object, 1);
+ return 1;
+
+bad:
+ return 0;
+}
+
+static int free_debug_processing(struct kmem_cache *s, void *object, void *addr)
+{
+ struct slqb_page *page;
+ page = virt_to_head_slqb_page(object);
+
+ if (!check_slab(s, page))
+ goto fail;
+
+ if (!check_valid_pointer(s, page, object)) {
+ slab_err(s, page, "Invalid object pointer 0x%p", object);
+ goto fail;
+ }
+
+ if (!check_object(s, page, object, 1))
+ return 0;
+
+ /* Special debug activities for freeing objects */
+ if (s->flags & SLAB_STORE_USER)
+ set_track(s, object, TRACK_FREE, addr);
+ trace(s, page, object, 0);
+ init_object(s, object, 0);
+ return 1;
+
+fail:
+ slab_fix(s, "Object at 0x%p not freed", object);
+ return 0;
+}
+
+static int __init setup_slqb_debug(char *str)
+{
+ slqb_debug = DEBUG_DEFAULT_FLAGS;
+ if (*str++ != '=' || !*str)
+ /*
+ * No options specified. Switch on full debugging.
+ */
+ goto out;
+
+ if (*str == ',')
+ /*
+ * No options but restriction on slabs. This means full
+ * debugging for slabs matching a pattern.
+ */
+ goto check_slabs;
+
+ slqb_debug = 0;
+ if (*str == '-')
+ /*
+ * Switch off all debugging measures.
+ */
+ goto out;
+
+ /*
+ * Determine which debug features should be switched on
+ */
+ for (; *str && *str != ','; str++) {
+ switch (tolower(*str)) {
+ case 'f':
+ slqb_debug |= SLAB_DEBUG_FREE;
+ break;
+ case 'z':
+ slqb_debug |= SLAB_RED_ZONE;
+ break;
+ case 'p':
+ slqb_debug |= SLAB_POISON;
+ break;
+ case 'u':
+ slqb_debug |= SLAB_STORE_USER;
+ break;
+ case 't':
+ slqb_debug |= SLAB_TRACE;
+ break;
+ default:
+ printk(KERN_ERR "slqb_debug option '%c' "
+ "unknown. skipped\n", *str);
+ }
+ }
+
+check_slabs:
+ if (*str == ',')
+ slqb_debug_slabs = str + 1;
+out:
+ return 1;
+}
+
+__setup("slqb_debug", setup_slqb_debug);
+
+static unsigned long kmem_cache_flags(unsigned long objsize,
+ unsigned long flags, const char *name,
+ void (*ctor)(void *))
+{
+ /*
+ * Enable debugging if selected on the kernel commandline.
+ */
+ if (slqb_debug && (!slqb_debug_slabs ||
+ strncmp(slqb_debug_slabs, name,
+ strlen(slqb_debug_slabs)) == 0))
+ flags |= slqb_debug;
+
+ return flags;
+}
+#else
+static inline void setup_object_debug(struct kmem_cache *s,
+ struct slqb_page *page, void *object) {}
+
+static inline int alloc_debug_processing(struct kmem_cache *s,
+ void *object, void *addr) { return 0; }
+
+static inline int free_debug_processing(struct kmem_cache *s,
+ void *object, void *addr) { return 0; }
+
+static inline int slab_pad_check(struct kmem_cache *s, struct slqb_page *page)
+ { return 1; }
+static inline int check_object(struct kmem_cache *s, struct slqb_page *page,
+ void *object, int active) { return 1; }
+static inline void add_full(struct kmem_cache_node *n, struct slqb_page *page) {}
+static inline unsigned long kmem_cache_flags(unsigned long objsize,
+ unsigned long flags, const char *name, void (*ctor)(void *))
+{
+ return flags;
+}
+#define slqb_debug 0
+#endif
+
+/*
+ * allocate a new slab (return its corresponding struct slqb_page)
+ */
+static struct slqb_page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+{
+ struct slqb_page *page;
+ int pages = 1 << s->order;
+
+ flags |= s->allocflags;
+
+ page = alloc_slqb_pages_node(node, flags, s->order);
+ if (!page)
+ return NULL;
+
+ mod_zone_page_state(slqb_page_zone(page),
+ (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+ NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+ pages);
+
+ return page;
+}
+
+/*
+ * Called once for each object on a new slab page
+ */
+static void setup_object(struct kmem_cache *s, struct slqb_page *page,
+ void *object)
+{
+ setup_object_debug(s, page, object);
+ if (unlikely(s->ctor))
+ s->ctor(object);
+}
+
+/*
+ * Allocate a new slab, set up its object list.
+ */
+static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour)
+{
+ struct slqb_page *page;
+ void *start;
+ void *last;
+ void *p;
+
+ BUG_ON(flags & GFP_SLAB_BUG_MASK);
+
+ page = allocate_slab(s,
+ flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
+ if (!page)
+ goto out;
+
+ page->flags |= PG_SLQB_BIT;
+
+ start = page_address(&page->page);
+
+ if (unlikely(slab_poison(s)))
+ memset(start, POISON_INUSE, PAGE_SIZE << s->order);
+
+ start += colour;
+
+ last = start;
+ for_each_object(p, s, start) {
+ setup_object(s, page, p);
+ set_freepointer(s, last, p);
+ last = p;
+ }
+ set_freepointer(s, last, NULL);
+
+ page->freelist = start;
+ page->inuse = 0;
+out:
+ return page;
+}
+
+/*
+ * Free a slab page back to the page allocator
+ */
+static void __free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+ int pages = 1 << s->order;
+
+ if (unlikely(slab_debug(s))) {
+ void *p;
+
+ slab_pad_check(s, page);
+ for_each_free_object(p, s, page->freelist)
+ check_object(s, page, p, 0);
+ }
+
+ mod_zone_page_state(slqb_page_zone(page),
+ (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+ NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
+ -pages);
+
+ __free_slqb_pages(page, s->order);
+}
+
+static void rcu_free_slab(struct rcu_head *h)
+{
+ struct slqb_page *page;
+
+ page = container_of((struct list_head *)h, struct slqb_page, lru);
+ __free_slab(page->list->cache, page);
+}
+
+static void free_slab(struct kmem_cache *s, struct slqb_page *page)
+{
+ VM_BUG_ON(page->inuse);
+ if (unlikely(s->flags & SLAB_DESTROY_BY_RCU))
+ call_rcu(&page->rcu_head, rcu_free_slab);
+ else
+ __free_slab(s, page);
+}
+
+/*
+ * Return an object to its slab.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static int free_object_to_page(struct kmem_cache *s, struct kmem_cache_list *l, struct slqb_page *page, void *object)
+{
+ VM_BUG_ON(page->list != l);
+
+ set_freepointer(s, object, page->freelist);
+ page->freelist = object;
+ page->inuse--;
+
+ if (!page->inuse) {
+ if (likely(s->objects > 1)) {
+ l->nr_partial--;
+ list_del(&page->lru);
+ }
+ l->nr_slabs--;
+ free_slab(s, page);
+ slqb_stat_inc(l, FLUSH_SLAB_FREE);
+ return 1;
+ } else if (page->inuse + 1 == s->objects) {
+ l->nr_partial++;
+ list_add(&page->lru, &l->partial);
+ slqb_stat_inc(l, FLUSH_SLAB_PARTIAL);
+ return 0;
+ }
+ return 0;
+}
+
+#ifdef CONFIG_SMP
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c);
+#endif
+
+/*
+ * Flush the LIFO list of objects on a list. They are sent back to their pages
+ * in case the pages also belong to the list, or to our CPU's remote-free list
+ * in the case they do not.
+ *
+ * Doesn't flush the entire list. flush_free_list_all does.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void flush_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ struct kmem_cache_cpu *c;
+ void **head;
+ int nr;
+
+ nr = l->freelist.nr;
+ if (unlikely(!nr))
+ return;
+
+ nr = min(slab_freebatch(s), nr);
+
+ slqb_stat_inc(l, FLUSH_FREE_LIST);
+ slqb_stat_add(l, FLUSH_FREE_LIST_OBJECTS, nr);
+
+ c = get_cpu_slab(s, smp_processor_id());
+
+ l->freelist.nr -= nr;
+ head = l->freelist.head;
+
+ do {
+ struct slqb_page *page;
+ void **object;
+
+ object = head;
+ VM_BUG_ON(!object);
+ head = get_freepointer(s, object);
+ page = virt_to_head_slqb_page(object);
+
+#ifdef CONFIG_SMP
+ if (page->list != l) {
+ slab_free_to_remote(s, page, object, c);
+ slqb_stat_inc(l, FLUSH_FREE_LIST_REMOTE);
+ } else
+#endif
+ free_object_to_page(s, l, page, object);
+
+ nr--;
+ } while (nr);
+
+ l->freelist.head = head;
+ if (!l->freelist.nr)
+ l->freelist.tail = NULL;
+}
+
+static void flush_free_list_all(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ while (l->freelist.nr)
+ flush_free_list(s, l);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * If enough objects have been remotely freed back to this list,
+ * remote_free_check will be set. In which case, we'll eventually come here
+ * to take those objects off our remote_free list and onto our LIFO freelist.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ void **head, **tail;
+ int nr;
+
+ VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
+
+ if (!l->remote_free.list.nr)
+ return;
+
+ l->remote_free_check = 0;
+ head = l->remote_free.list.head;
+ prefetchw(head);
+
+ spin_lock(&l->remote_free.lock);
+ l->remote_free.list.head = NULL;
+ tail = l->remote_free.list.tail;
+ l->remote_free.list.tail = NULL;
+ nr = l->remote_free.list.nr;
+ l->remote_free.list.nr = 0;
+ spin_unlock(&l->remote_free.lock);
+
+ if (!l->freelist.nr)
+ l->freelist.head = head;
+ else
+ set_freepointer(s, l->freelist.tail, head);
+ l->freelist.tail = tail;
+
+ l->freelist.nr += nr;
+
+ slqb_stat_inc(l, CLAIM_REMOTE_LIST);
+ slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
+}
+#endif
+
+/*
+ * Allocation fastpath. Get an object from the list's LIFO freelist, or
+ * return NULL if it is empty.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ void *object;
+
+ object = l->freelist.head;
+ if (likely(object)) {
+ void *next = get_freepointer(s, object);
+ VM_BUG_ON(!l->freelist.nr);
+ l->freelist.nr--;
+ l->freelist.head = next;
+ if (next)
+ prefetchw(next);
+ return object;
+ }
+ VM_BUG_ON(l->freelist.nr);
+
+#ifdef CONFIG_SMP
+ if (unlikely(l->remote_free_check)) {
+ claim_remote_free_list(s, l);
+
+ if (l->freelist.nr > slab_hiwater(s))
+ flush_free_list(s, l);
+
+ /* repetition here helps gcc :( */
+ object = l->freelist.head;
+ if (likely(object)) {
+ void *next = get_freepointer(s, object);
+ VM_BUG_ON(!l->freelist.nr);
+ l->freelist.nr--;
+ l->freelist.head = next;
+ if (next)
+ prefetchw(next);
+ return object;
+ }
+ VM_BUG_ON(l->freelist.nr);
+ }
+#endif
+
+ return NULL;
+}
+
+/*
+ * Slow(er) path. Get a page from this list's existing pages. Will be a
+ * new empty page in the case that __slab_alloc_page has just been called
+ * (empty pages otherwise never get queued up on the lists), or a partial page
+ * already on the list.
+ *
+ * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
+ * list_lock in the case of per-node list.
+ */
+static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ struct slqb_page *page;
+ void *object;
+
+ if (unlikely(!l->nr_partial))
+ return NULL;
+
+ page = list_first_entry(&l->partial, struct slqb_page, lru);
+ VM_BUG_ON(page->inuse == s->objects);
+ if (page->inuse + 1 == s->objects) {
+ l->nr_partial--;
+ list_del(&page->lru);
+/*XXX list_move(&page->lru, &l->full); */
+ }
+
+ VM_BUG_ON(!page->freelist);
+
+ page->inuse++;
+
+// VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
+
+ object = page->freelist;
+ page->freelist = get_freepointer(s, object);
+ if (page->freelist)
+ prefetchw(page->freelist);
+ VM_BUG_ON((page->inuse == s->objects) != (page->freelist == NULL));
+ slqb_stat_inc(l, ALLOC_SLAB_FILL);
+
+ return object;
+}
+
+/*
+ * Allocation slowpath. Allocate a new slab page from the page allocator, and
+ * put it on the list's partial list. Must be followed by an allocation so
+ * that we don't have dangling empty pages on the partial list.
+ *
+ * Returns 0 on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline int __slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+ struct slqb_page *page;
+ struct kmem_cache_list *l;
+ struct kmem_cache_cpu *c;
+ unsigned int colour;
+
+ c = get_cpu_slab(s, smp_processor_id());
+ colour = c->colour_next;
+ c->colour_next += s->colour_off;
+ if (c->colour_next >= s->colour_range)
+ c->colour_next = 0;
+
+ /* XXX: load any partial? */
+
+ /* Caller handles __GFP_ZERO */
+ gfpflags &= ~__GFP_ZERO;
+
+ if (gfpflags & __GFP_WAIT)
+ local_irq_enable();
+ page = new_slab_page(s, gfpflags, node, colour);
+ if (gfpflags & __GFP_WAIT)
+ local_irq_disable();
+ if (unlikely(!page))
+ return 0;
+
+ if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+ struct kmem_cache_cpu *c;
+ int cpu = smp_processor_id();
+
+ c = get_cpu_slab(s, cpu);
+ l = &c->list;
+ page->list = l;
+
+ if (unlikely(l->nr_partial)) {
+ __free_slqb_pages(page, s->order);
+ return 1;
+ }
+ l->nr_slabs++;
+ l->nr_partial++;
+ list_add(&page->lru, &l->partial);
+ slqb_stat_inc(l, ALLOC_SLAB_NEW);
+#ifdef CONFIG_NUMA
+ } else {
+ struct kmem_cache_node *n;
+
+ n = s->node[slqb_page_to_nid(page)];
+ l = &n->list;
+ page->list = l;
+
+ spin_lock(&n->list_lock);
+ if (unlikely(l->nr_partial)) {
+ spin_unlock(&n->list_lock);
+ __free_slqb_pages(page, s->order);
+ return 1;
+ }
+ l->nr_slabs++;
+ l->nr_partial++;
+ list_add(&page->lru, &l->partial);
+ slqb_stat_inc(l, ALLOC_SLAB_NEW);
+ spin_unlock(&n->list_lock);
+ /* XXX: could have a race here where a full page is left on
+ * the list if we subsequently migrate to or from the node.
+ * Should make the above node selection and stick to it.
+ */
+#endif
+ }
+ return 1;
+}
+
+#ifdef CONFIG_NUMA
+/*
+ * Allocate an object from a remote node. Return NULL if none could be found
+ * (in which case, caller should allocate a new slab)
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void *__remote_slab_alloc(struct kmem_cache *s, int node)
+{
+ struct kmem_cache_node *n;
+ struct kmem_cache_list *l;
+ void *object;
+
+ n = s->node[node];
+ VM_BUG_ON(!n);
+ l = &n->list;
+
+// if (unlikely(!(l->freelist.nr | l->nr_partial | l->remote_free_check)))
+// return NULL;
+
+ spin_lock(&n->list_lock);
+
+ object = __cache_list_get_object(s, l);
+ if (unlikely(!object))
+ object = __cache_list_get_page(s, l);
+ slqb_stat_inc(l, ALLOC);
+ spin_unlock(&n->list_lock);
+ return object;
+}
+
+static noinline int alternate_nid(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+ if (in_interrupt() || (gfpflags & __GFP_THISNODE))
+ return node;
+ if (cpuset_do_slab_mem_spread() && (s->flags & SLAB_MEM_SPREAD))
+ return cpuset_mem_spread_node();
+ else if (current->mempolicy)
+ return slab_node(current->mempolicy);
+ return node;
+}
+#endif
+
+/*
+ * Main allocation path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void *__slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node)
+{
+ void *object;
+ struct kmem_cache_cpu *c;
+ struct kmem_cache_list *l;
+
+again:
+#ifdef CONFIG_NUMA
+ if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+ object = __remote_slab_alloc(s, node);
+ if (unlikely(!object))
+ goto alloc_new;
+ return object;
+ }
+#endif
+
+ c = get_cpu_slab(s, smp_processor_id());
+ VM_BUG_ON(!c);
+ l = &c->list;
+ object = __cache_list_get_object(s, l);
+ if (unlikely(!object)) {
+ object = __cache_list_get_page(s, l);
+ if (unlikely(!object))
+ goto alloc_new;
+ }
+ slqb_stat_inc(l, ALLOC);
+ return object;
+
+alloc_new:
+ if (unlikely(!__slab_alloc_page(s, gfpflags, node)))
+ return NULL;
+ goto again;
+}
+
+/*
+ * Perform some interrupts-on processing around the main allocation path
+ * (debug checking and memset()ing).
+ */
+static __always_inline void *slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node, void *addr)
+{
+ void *object;
+ unsigned long flags;
+
+again:
+ local_irq_save(flags);
+ object = __slab_alloc(s, gfpflags, node);
+ local_irq_restore(flags);
+
+ if (unlikely(slab_debug(s)) && likely(object)) {
+ if (unlikely(!alloc_debug_processing(s, object, addr)))
+ goto again;
+ }
+
+ if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
+ memset(object, 0, s->objsize);
+
+ return object;
+}
+
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
+{
+ int node = -1;
+#ifdef CONFIG_NUMA
+ if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+ node = alternate_nid(s, gfpflags, node);
+#endif
+ return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+#ifdef CONFIG_NUMA
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
+{
+ return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+#endif
+
+#ifdef CONFIG_SMP
+/*
+ * Flush this CPU's remote free list of objects back to the list from where
+ * they originate. They end up on that list's remotely freed list, and
+ * eventually we set it's remote_free_check if there are enough objects on it.
+ *
+ * This seems convoluted, but it keeps is from stomping on the target CPU's
+ * fastpath cachelines.
+ *
+ * Must be called with interrupts disabled.
+ */
+static void flush_remote_free_cache(struct kmem_cache *s, struct kmem_cache_cpu *c)
+{
+ struct kmlist *src;
+ struct kmem_cache_list *dst;
+ unsigned int nr;
+ int set;
+
+ src = &c->rlist;
+ nr = src->nr;
+ if (unlikely(!nr))
+ return;
+
+#ifdef CONFIG_SLQB_STATS
+ {
+ struct kmem_cache_list *l = &c->list;
+ slqb_stat_inc(l, FLUSH_RFREE_LIST);
+ slqb_stat_add(l, FLUSH_RFREE_LIST_OBJECTS, nr);
+ }
+#endif
+
+ dst = c->remote_cache_list;
+
+ spin_lock(&dst->remote_free.lock);
+ if (!dst->remote_free.list.head)
+ dst->remote_free.list.head = src->head;
+ else
+ set_freepointer(s, dst->remote_free.list.tail, src->head);
+ dst->remote_free.list.tail = src->tail;
+
+ src->head = NULL;
+ src->tail = NULL;
+ src->nr = 0;
+
+ if (dst->remote_free.list.nr < slab_freebatch(s))
+ set = 1;
+ else
+ set = 0;
+
+ dst->remote_free.list.nr += nr;
+
+ if (unlikely(dst->remote_free.list.nr >= slab_freebatch(s) && set))
+ dst->remote_free_check = 1;
+
+ spin_unlock(&dst->remote_free.lock);
+}
+
+/*
+ * Free an object to this CPU's remote free list.
+ *
+ * Must be called with interrupts disabled.
+ */
+static noinline void slab_free_to_remote(struct kmem_cache *s, struct slqb_page *page, void *object, struct kmem_cache_cpu *c)
+{
+ struct kmlist *r;
+
+ /*
+ * Our remote free list corresponds to a different list. Must
+ * flush it and switch.
+ */
+ if (page->list != c->remote_cache_list) {
+ flush_remote_free_cache(s, c);
+ c->remote_cache_list = page->list;
+ }
+
+ r = &c->rlist;
+ if (!r->head)
+ r->head = object;
+ else
+ set_freepointer(s, r->tail, object);
+ set_freepointer(s, object, NULL);
+ r->tail = object;
+ r->nr++;
+
+ if (unlikely(r->nr > slab_freebatch(s)))
+ flush_remote_free_cache(s, c);
+}
+#endif
+
+/*
+ * Main freeing path. Return an object, or NULL on allocation failure.
+ *
+ * Must be called with interrupts disabled.
+ */
+static __always_inline void __slab_free(struct kmem_cache *s,
+ struct slqb_page *page, void *object)
+{
+ struct kmem_cache_cpu *c;
+ struct kmem_cache_list *l;
+ int thiscpu = smp_processor_id();
+
+ c = get_cpu_slab(s, thiscpu);
+ l = &c->list;
+
+ slqb_stat_inc(l, FREE);
+
+ if (!NUMA_BUILD || !numa_platform ||
+ likely(slqb_page_to_nid(page) == numa_node_id())) {
+ /*
+ * Freeing fastpath. Collects all local-node objects, not
+ * just those allocated from our per-CPU list. This allows
+ * fast transfer of objects from one CPU to another within
+ * a given node.
+ */
+ set_freepointer(s, object, l->freelist.head);
+ l->freelist.head = object;
+ if (!l->freelist.nr)
+ l->freelist.tail = object;
+ l->freelist.nr++;
+
+ if (unlikely(l->freelist.nr > slab_hiwater(s)))
+ flush_free_list(s, l);
+
+#ifdef CONFIG_NUMA
+ } else {
+ /*
+ * Freeing an object that was allocated on a remote node.
+ */
+ slab_free_to_remote(s, page, object, c);
+ slqb_stat_inc(l, FREE_REMOTE);
+#endif
+ }
+}
+
+/*
+ * Perform some interrupts-on processing around the main freeing path
+ * (debug checking).
+ */
+static __always_inline void slab_free(struct kmem_cache *s,
+ struct slqb_page *page, void *object)
+{
+ unsigned long flags;
+
+ prefetchw(object);
+
+ debug_check_no_locks_freed(object, s->objsize);
+ if (likely(object) && unlikely(slab_debug(s))) {
+ if (unlikely(!free_debug_processing(s, object, __builtin_return_address(0))))
+ return;
+ }
+
+ local_irq_save(flags);
+ __slab_free(s, page, object);
+ local_irq_restore(flags);
+}
+
+void kmem_cache_free(struct kmem_cache *s, void *object)
+{
+ struct slqb_page *page = NULL;
+ if (numa_platform)
+ page = virt_to_head_slqb_page(object);
+ slab_free(s, page, object);
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+/*
+ * Calculate the order of allocation given an slab object size.
+ *
+ * Order 0 allocations are preferred since order 0 does not cause fragmentation
+ * in the page allocator, and they have fastpaths in the page allocator. But
+ * also minimise external fragmentation with large objects.
+ */
+static inline int slab_order(int size, int max_order, int frac)
+{
+ int order;
+
+ if (fls(size - 1) <= PAGE_SHIFT)
+ order = 0;
+ else
+ order = fls(size - 1) - PAGE_SHIFT;
+ while (order <= max_order) {
+ unsigned long slab_size = PAGE_SIZE << order;
+ unsigned long objects;
+ unsigned long waste;
+
+ objects = slab_size / size;
+ if (!objects)
+ continue;
+
+ waste = slab_size - (objects * size);
+
+ if (waste * frac <= slab_size)
+ break;
+
+ order++;
+ }
+
+ return order;
+}
+
+static inline int calculate_order(int size)
+{
+ int order;
+
+ /*
+ * Attempt to find best configuration for a slab. This
+ * works by first attempting to generate a layout with
+ * the best configuration and backing off gradually.
+ */
+ order = slab_order(size, 1, 4);
+ if (order <= 1)
+ return order;
+
+ /*
+ * This size cannot fit in order-1. Allow bigger orders, but
+ * forget about trying to save space.
+ */
+ order = slab_order(size, MAX_ORDER, 0);
+ if (order <= MAX_ORDER)
+ return order;
+
+ return -ENOSYS;
+}
+
+/*
+ * Figure out what the alignment of the objects will be.
+ */
+static unsigned long calculate_alignment(unsigned long flags,
+ unsigned long align, unsigned long size)
+{
+ /*
+ * If the user wants hardware cache aligned objects then follow that
+ * suggestion if the object is sufficiently large.
+ *
+ * The hardware cache alignment cannot override the specified
+ * alignment though. If that is greater then use it.
+ */
+ if (flags & SLAB_HWCACHE_ALIGN) {
+ unsigned long ralign = cache_line_size();
+ while (size <= ralign / 2)
+ ralign /= 2;
+ align = max(align, ralign);
+ }
+
+ if (align < ARCH_SLAB_MINALIGN)
+ align = ARCH_SLAB_MINALIGN;
+
+ return ALIGN(align, sizeof(void *));
+}
+
+static void init_kmem_cache_list(struct kmem_cache *s, struct kmem_cache_list *l)
+{
+ l->cache = s;
+ l->freelist.nr = 0;
+ l->freelist.head = NULL;
+ l->freelist.tail = NULL;
+ l->nr_partial = 0;
+ l->nr_slabs = 0;
+ INIT_LIST_HEAD(&l->partial);
+// INIT_LIST_HEAD(&l->full);
+
+#ifdef CONFIG_SMP
+ l->remote_free_check = 0;
+ spin_lock_init(&l->remote_free.lock);
+ l->remote_free.list.nr = 0;
+ l->remote_free.list.head = NULL;
+ l->remote_free.list.tail = NULL;
+#endif
+
+#ifdef CONFIG_SLQB_STATS
+ memset(l->stats, 0, sizeof(l->stats));
+#endif
+}
+
+static void init_kmem_cache_cpu(struct kmem_cache *s,
+ struct kmem_cache_cpu *c)
+{
+ init_kmem_cache_list(s, &c->list);
+
+ c->colour_next = 0;
+#ifdef CONFIG_SMP
+ c->rlist.nr = 0;
+ c->rlist.head = NULL;
+ c->rlist.tail = NULL;
+ c->remote_cache_list = NULL;
+#endif
+}
+
+#ifdef CONFIG_NUMA
+static void init_kmem_cache_node(struct kmem_cache *s, struct kmem_cache_node *n)
+{
+ spin_lock_init(&n->list_lock);
+ init_kmem_cache_list(s, &n->list);
+}
+#endif
+
+/* Initial slabs */
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache kmem_cpu_cache;
+static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS];
+#ifdef CONFIG_NUMA
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES];
+#endif
+#endif
+
+#ifdef CONFIG_NUMA
+static struct kmem_cache kmem_node_cache;
+static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS];
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES];
+#endif
+
+#ifdef CONFIG_SMP
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu)
+{
+ struct kmem_cache_cpu *c;
+
+ c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu));
+ if (!c)
+ return NULL;
+
+ init_kmem_cache_cpu(s, c);
+ return c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+
+ c = s->cpu_slab[cpu];
+ if (c) {
+ kmem_cache_free(&kmem_cpu_cache, c);
+ s->cpu_slab[cpu] = NULL;
+ }
+ }
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+
+ c = s->cpu_slab[cpu];
+ if (c)
+ continue;
+
+ c = alloc_kmem_cache_cpu(s, cpu);
+ if (!c) {
+ free_kmem_cache_cpus(s);
+ return 0;
+ }
+ s->cpu_slab[cpu] = c;
+ }
+ return 1;
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
+{
+ init_kmem_cache_cpu(s, &s->cpu_slab);
+ return 1;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n;
+
+ n = s->node[node];
+ if (n) {
+ kmem_cache_free(&kmem_node_cache, n);
+ s->node[node] = NULL;
+ }
+ }
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n;
+
+ n = kmem_cache_alloc_node(&kmem_node_cache, GFP_KERNEL, node);
+ if (!n) {
+ free_kmem_cache_nodes(s);
+ return 0;
+ }
+ init_kmem_cache_node(s, n);
+ s->node[node] = n;
+ }
+ return 1;
+}
+#else
+static void free_kmem_cache_nodes(struct kmem_cache *s)
+{
+}
+
+static int alloc_kmem_cache_nodes(struct kmem_cache *s)
+{
+ return 1;
+}
+#endif
+
+/*
+ * calculate_sizes() determines the order and the distribution of data within
+ * a slab object.
+ */
+static int calculate_sizes(struct kmem_cache *s)
+{
+ unsigned long flags = s->flags;
+ unsigned long size = s->objsize;
+ unsigned long align = s->align;
+
+ /*
+ * Determine if we can poison the object itself. If the user of
+ * the slab may touch the object after free or before allocation
+ * then we should never poison the object itself.
+ */
+ if (slab_poison(s) && !(flags & SLAB_DESTROY_BY_RCU) && !s->ctor)
+ s->flags |= __OBJECT_POISON;
+ else
+ s->flags &= ~__OBJECT_POISON;
+
+ /*
+ * Round up object size to the next word boundary. We can only
+ * place the free pointer at word boundaries and this determines
+ * the possible location of the free pointer.
+ */
+ size = ALIGN(size, sizeof(void *));
+
+#ifdef CONFIG_SLQB_DEBUG
+ /*
+ * If we are Redzoning then check if there is some space between the
+ * end of the object and the free pointer. If not then add an
+ * additional word to have some bytes to store Redzone information.
+ */
+ if ((flags & SLAB_RED_ZONE) && size == s->objsize)
+ size += sizeof(void *);
+#endif
+
+ /*
+ * With that we have determined the number of bytes in actual use
+ * by the object. This is the potential offset to the free pointer.
+ */
+ s->inuse = size;
+
+ if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor)) {
+ /*
+ * Relocate free pointer after the object if it is not
+ * permitted to overwrite the first word of the object on
+ * kmem_cache_free.
+ *
+ * This is the case if we do RCU, have a constructor or
+ * destructor or are poisoning the objects.
+ */
+ s->offset = size;
+ size += sizeof(void *);
+ }
+
+#ifdef CONFIG_SLQB_DEBUG
+ if (flags & SLAB_STORE_USER)
+ /*
+ * Need to store information about allocs and frees after
+ * the object.
+ */
+ size += 2 * sizeof(struct track);
+
+ if (flags & SLAB_RED_ZONE)
+ /*
+ * Add some empty padding so that we can catch
+ * overwrites from earlier objects rather than let
+ * tracking information or the free pointer be
+ * corrupted if an user writes before the start
+ * of the object.
+ */
+ size += sizeof(void *);
+#endif
+
+ /*
+ * Determine the alignment based on various parameters that the
+ * user specified and the dynamic determination of cache line size
+ * on bootup.
+ */
+ align = calculate_alignment(flags, align, s->objsize);
+
+ /*
+ * SLQB stores one object immediately after another beginning from
+ * offset 0. In order to align the objects we have to simply size
+ * each object to conform to the alignment.
+ */
+ size = ALIGN(size, align);
+ s->size = size;
+ s->order = calculate_order(size);
+
+ if (s->order < 0)
+ return 0;
+
+ s->allocflags = 0;
+ if (s->order)
+ s->allocflags |= __GFP_COMP;
+
+ if (s->flags & SLAB_CACHE_DMA)
+ s->allocflags |= SLQB_DMA;
+
+ if (s->flags & SLAB_RECLAIM_ACCOUNT)
+ s->allocflags |= __GFP_RECLAIMABLE;
+
+ /*
+ * Determine the number of objects per slab
+ */
+ s->objects = (PAGE_SIZE << s->order) / size;
+
+ s->freebatch = max(4UL*PAGE_SIZE / size, min(256UL, 64*PAGE_SIZE / size));
+ if (!s->freebatch)
+ s->freebatch = 1;
+ s->hiwater = s->freebatch << 2;
+
+ return !!s->objects;
+
+}
+
+static int kmem_cache_open(struct kmem_cache *s,
+ const char *name, size_t size,
+ size_t align, unsigned long flags,
+ void (*ctor)(void *), int alloc)
+{
+ unsigned int left_over;
+
+ memset(s, 0, kmem_size);
+ s->name = name;
+ s->ctor = ctor;
+ s->objsize = size;
+ s->align = align;
+ s->flags = kmem_cache_flags(size, flags, name, ctor);
+
+ if (!calculate_sizes(s))
+ goto error;
+
+ if (!slab_debug(s)) {
+ left_over = (PAGE_SIZE << s->order) - (s->objects * s->size);
+ s->colour_off = max(cache_line_size(), s->align);
+ s->colour_range = left_over;
+ } else {
+ s->colour_off = 0;
+ s->colour_range = 0;
+ }
+
+ if (likely(alloc)) {
+ if (!alloc_kmem_cache_nodes(s))
+ goto error;
+
+ if (!alloc_kmem_cache_cpus(s))
+ goto error_nodes;
+ }
+
+ /* XXX: perform some basic checks like SLAB does, eg. duplicate names */
+ down_write(&slqb_lock);
+ sysfs_slab_add(s);
+ list_add(&s->list, &slab_caches);
+ up_write(&slqb_lock);
+
+ return 1;
+
+error_nodes:
+ free_kmem_cache_nodes(s);
+error:
+ if (flags & SLAB_PANIC)
+ panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+ return 0;
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+int kmem_ptr_validate(struct kmem_cache *s, const void *object)
+{
+ struct slqb_page *page = virt_to_head_slqb_page(object);
+
+ if (!(page->flags & PG_SLQB_BIT))
+ return 0;
+
+ /*
+ * We could also check if the object is on the slabs freelist.
+ * But this would be too expensive and it seems that the main
+ * purpose of kmem_ptr_valid is to check if the object belongs
+ * to a certain slab.
+ */
+ return 1;
+}
+EXPORT_SYMBOL(kmem_ptr_validate);
+
+/*
+ * Determine the size of a slab object
+ */
+unsigned int kmem_cache_size(struct kmem_cache *s)
+{
+ return s->objsize;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(struct kmem_cache *s)
+{
+ return s->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+/*
+ * Release all resources used by a slab cache. No more concurrency on the
+ * slab, so we can touch remote kmem_cache_cpu structures.
+ */
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+ int node;
+#endif
+ int cpu;
+
+ down_write(&slqb_lock);
+ list_del(&s->list);
+ up_write(&slqb_lock);
+
+#ifdef CONFIG_SMP
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+ flush_free_list_all(s, l);
+ flush_remote_free_cache(s, c);
+ }
+#endif
+
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+ claim_remote_free_list(s, l);
+#endif
+ flush_free_list_all(s, l);
+
+ WARN_ON(l->freelist.nr);
+ WARN_ON(l->nr_slabs);
+ WARN_ON(l->nr_partial);
+ }
+
+ free_kmem_cache_cpus(s);
+
+#ifdef CONFIG_NUMA
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ claim_remote_free_list(s, l);
+ flush_free_list_all(s, l);
+
+ WARN_ON(l->freelist.nr);
+ WARN_ON(l->nr_slabs);
+ WARN_ON(l->nr_partial);
+ }
+
+ free_kmem_cache_nodes(s);
+#endif
+
+ sysfs_slab_remove(s);
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+/********************************************************************
+ * Kmalloc subsystem
+ *******************************************************************/
+
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+#ifdef CONFIG_ZONE_DMA
+struct kmem_cache kmalloc_caches_dma[KMALLOC_SHIFT_SLQB_HIGH + 1] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches_dma);
+#endif
+
+#ifndef ARCH_KMALLOC_FLAGS
+#define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN
+#endif
+
+static struct kmem_cache *open_kmalloc_cache(struct kmem_cache *s,
+ const char *name, int size, gfp_t gfp_flags)
+{
+ unsigned int flags = ARCH_KMALLOC_FLAGS | SLAB_PANIC;
+
+ if (gfp_flags & SLQB_DMA)
+ flags |= SLAB_CACHE_DMA;
+
+ kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN, flags, NULL, 1);
+
+ return s;
+}
+
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] __cacheline_aligned = {
+ 3, /* 8 */
+ 4, /* 16 */
+ 5, /* 24 */
+ 5, /* 32 */
+ 6, /* 40 */
+ 6, /* 48 */
+ 6, /* 56 */
+ 6, /* 64 */
+#if L1_CACHE_BYTES < 64
+ 1, /* 72 */
+ 1, /* 80 */
+ 1, /* 88 */
+ 1, /* 96 */
+#else
+ 7,
+ 7,
+ 7,
+ 7,
+#endif
+ 7, /* 104 */
+ 7, /* 112 */
+ 7, /* 120 */
+ 7, /* 128 */
+#if L1_CACHE_BYTES < 128
+ 2, /* 136 */
+ 2, /* 144 */
+ 2, /* 152 */
+ 2, /* 160 */
+ 2, /* 168 */
+ 2, /* 176 */
+ 2, /* 184 */
+ 2 /* 192 */
+#else
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1,
+ -1
+#endif
+};
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags)
+{
+ int index;
+
+#if L1_CACHE_BYTES >= 128
+ if (size <= 128) {
+#else
+ if (size <= 192) {
+#endif
+ if (unlikely(!size))
+ return ZERO_SIZE_PTR;
+
+ index = size_index[(size - 1) / 8];
+ } else
+ index = fls(size - 1);
+
+ if (unlikely((flags & SLQB_DMA)))
+ return &kmalloc_caches_dma[index];
+ else
+ return &kmalloc_caches[index];
+}
+
+void *__kmalloc(size_t size, gfp_t flags)
+{
+ struct kmem_cache *s;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc(s, flags);
+}
+EXPORT_SYMBOL(__kmalloc);
+
+#ifdef CONFIG_NUMA
+void *__kmalloc_node(size_t size, gfp_t flags, int node)
+{
+ struct kmem_cache *s;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return kmem_cache_alloc_node(s, flags, node);
+}
+EXPORT_SYMBOL(__kmalloc_node);
+#endif
+
+size_t ksize(const void *object)
+{
+ struct slqb_page *page;
+ struct kmem_cache *s;
+
+ BUG_ON(!object);
+ if (unlikely(object == ZERO_SIZE_PTR))
+ return 0;
+
+ page = virt_to_head_slqb_page(object);
+ BUG_ON(!(page->flags & PG_SLQB_BIT));
+
+ s = page->list->cache;
+
+ /*
+ * Debugging requires use of the padding between object
+ * and whatever may come after it.
+ */
+ if (s->flags & (SLAB_RED_ZONE | SLAB_POISON))
+ return s->objsize;
+
+ /*
+ * If we have the need to store the freelist pointer
+ * back there or track user information then we can
+ * only use the space before that information.
+ */
+ if (s->flags & (SLAB_DESTROY_BY_RCU | SLAB_STORE_USER))
+ return s->inuse;
+
+ /*
+ * Else we can use all the padding etc for the allocation
+ */
+ return s->size;
+}
+EXPORT_SYMBOL(ksize);
+
+void kfree(const void *object)
+{
+ struct kmem_cache *s;
+ struct slqb_page *page;
+
+ if (unlikely(ZERO_OR_NULL_PTR(object)))
+ return;
+
+ page = virt_to_head_slqb_page(object);
+ s = page->list->cache;
+
+ slab_free(s, page, (void *)object);
+}
+EXPORT_SYMBOL(kfree);
+
+static void kmem_cache_trim_percpu(void *arg)
+{
+ int cpu = smp_processor_id();
+ struct kmem_cache *s = arg;
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+#ifdef CONFIG_SMP
+ claim_remote_free_list(s, l);
+#endif
+ flush_free_list(s, l);
+#ifdef CONFIG_SMP
+ flush_remote_free_cache(s, c);
+#endif
+}
+
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+ int node;
+#endif
+
+ on_each_cpu(kmem_cache_trim_percpu, s, 1);
+
+#ifdef CONFIG_NUMA
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list(s, l);
+ spin_unlock_irq(&n->list_lock);
+ }
+#endif
+
+ return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void kmem_cache_reap_percpu(void *arg)
+{
+ int cpu = smp_processor_id();
+ struct kmem_cache *s;
+ long phase = (long)arg;
+
+ list_for_each_entry(s, &slab_caches, list) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+
+ if (phase == 0) {
+ flush_free_list_all(s, l);
+ flush_remote_free_cache(s, c);
+ }
+
+ if (phase == 1) {
+ claim_remote_free_list(s, l);
+ flush_free_list_all(s, l);
+ }
+ }
+}
+
+static void kmem_cache_reap(void)
+{
+ struct kmem_cache *s;
+ int node;
+
+ down_read(&slqb_lock);
+ on_each_cpu(kmem_cache_reap_percpu, (void *)0, 1);
+ on_each_cpu(kmem_cache_reap_percpu, (void *)1, 1);
+
+ list_for_each_entry(s, &slab_caches, list) {
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list_all(s, l);
+ spin_unlock_irq(&n->list_lock);
+ }
+ }
+ up_read(&slqb_lock);
+}
+#endif
+
+static void cache_trim_worker(struct work_struct *w)
+{
+ struct delayed_work *work =
+ container_of(w, struct delayed_work, work);
+ struct kmem_cache *s;
+ int node;
+
+ if (!down_read_trylock(&slqb_lock))
+ goto out;
+
+ node = numa_node_id();
+ list_for_each_entry(s, &slab_caches, list) {
+#ifdef CONFIG_NUMA
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+
+ spin_lock_irq(&n->list_lock);
+ claim_remote_free_list(s, l);
+ flush_free_list(s, l);
+ spin_unlock_irq(&n->list_lock);
+#endif
+
+ local_irq_disable();
+ kmem_cache_trim_percpu(s);
+ local_irq_enable();
+ }
+
+ up_read(&slqb_lock);
+out:
+ schedule_delayed_work(work, round_jiffies_relative(3*HZ));
+}
+
+static DEFINE_PER_CPU(struct delayed_work, cache_trim_work);
+
+static void __cpuinit start_cpu_timer(int cpu)
+{
+ struct delayed_work *cache_trim_work = &per_cpu(cache_trim_work, cpu);
+
+ /*
+ * When this gets called from do_initcalls via cpucache_init(),
+ * init_workqueues() has already run, so keventd will be setup
+ * at that time.
+ */
+ if (keventd_up() && cache_trim_work->work.func == NULL) {
+ INIT_DELAYED_WORK(cache_trim_work, cache_trim_worker);
+ schedule_delayed_work_on(cpu, cache_trim_work,
+ __round_jiffies_relative(HZ, cpu));
+ }
+}
+
+static int __init cpucache_init(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ start_cpu_timer(cpu);
+ return 0;
+}
+__initcall(cpucache_init);
+
+
+#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
+static void slab_mem_going_offline_callback(void *arg)
+{
+ kmem_cache_reap();
+}
+
+static void slab_mem_offline_callback(void *arg)
+{
+ struct kmem_cache *s;
+ struct memory_notify *marg = arg;
+ int nid = marg->status_change_nid;
+
+ /*
+ * If the node still has available memory. we need kmem_cache_node
+ * for it yet.
+ */
+ if (nid < 0)
+ return;
+
+#if 0 // XXX: see cpu offline comment
+ down_read(&slqb_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ struct kmem_cache_node *n;
+ n = s->node[nid];
+ if (n) {
+ s->node[nid] = NULL;
+ kmem_cache_free(&kmem_node_cache, n);
+ }
+ }
+ up_read(&slqb_lock);
+#endif
+}
+
+static int slab_mem_going_online_callback(void *arg)
+{
+ struct kmem_cache *s;
+ struct kmem_cache_node *n;
+ struct memory_notify *marg = arg;
+ int nid = marg->status_change_nid;
+ int ret = 0;
+
+ /*
+ * If the node's memory is already available, then kmem_cache_node is
+ * already created. Nothing to do.
+ */
+ if (nid < 0)
+ return 0;
+
+ /*
+ * We are bringing a node online. No memory is availabe yet. We must
+ * allocate a kmem_cache_node structure in order to bring the node
+ * online.
+ */
+ down_read(&slqb_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ /*
+ * XXX: kmem_cache_alloc_node will fallback to other nodes
+ * since memory is not yet available from the node that
+ * is brought up.
+ */
+ n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL);
+ if (!n) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ init_kmem_cache_node(s, n);
+ s->node[nid] = n;
+ }
+out:
+ up_read(&slqb_lock);
+ return ret;
+}
+
+static int slab_memory_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ int ret = 0;
+
+ switch (action) {
+ case MEM_GOING_ONLINE:
+ ret = slab_mem_going_online_callback(arg);
+ break;
+ case MEM_GOING_OFFLINE:
+ slab_mem_going_offline_callback(arg);
+ break;
+ case MEM_OFFLINE:
+ case MEM_CANCEL_ONLINE:
+ slab_mem_offline_callback(arg);
+ break;
+ case MEM_ONLINE:
+ case MEM_CANCEL_OFFLINE:
+ break;
+ }
+
+ ret = notifier_from_errno(ret);
+ return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+/********************************************************************
+ * Basic setup of slabs
+ *******************************************************************/
+
+void __init kmem_cache_init(void)
+{
+ int i;
+ unsigned int flags = SLAB_HWCACHE_ALIGN|SLAB_PANIC;
+
+#ifdef CONFIG_NUMA
+ if (num_possible_nodes() == 1)
+ numa_platform = 0;
+ else
+ numa_platform = 1;
+#endif
+
+#ifdef CONFIG_SMP
+ kmem_size = offsetof(struct kmem_cache, cpu_slab) +
+ nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#else
+ kmem_size = sizeof(struct kmem_cache);
+#endif
+
+ kmem_cache_open(&kmem_cache_cache, "kmem_cache", kmem_size, 0, flags, NULL, 0);
+#ifdef CONFIG_SMP
+ kmem_cache_open(&kmem_cpu_cache, "kmem_cache_cpu", sizeof(struct kmem_cache_cpu), 0, flags, NULL, 0);
+#endif
+#ifdef CONFIG_NUMA
+ kmem_cache_open(&kmem_node_cache, "kmem_cache_node", sizeof(struct kmem_cache_node), 0, flags, NULL, 0);
+#endif
+
+#ifdef CONFIG_SMP
+ for_each_possible_cpu(i) {
+ init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cpus[i]);
+ kmem_cache_cache.cpu_slab[i] = &kmem_cache_cpus[i];
+
+ init_kmem_cache_cpu(&kmem_cpu_cache, &kmem_cpu_cpus[i]);
+ kmem_cpu_cache.cpu_slab[i] = &kmem_cpu_cpus[i];
+
+#ifdef CONFIG_NUMA
+ init_kmem_cache_cpu(&kmem_node_cache, &kmem_node_cpus[i]);
+ kmem_node_cache.cpu_slab[i] = &kmem_node_cpus[i];
+#endif
+ }
+#else
+ init_kmem_cache_cpu(&kmem_cache_cache, &kmem_cache_cache.cpu_slab);
+#endif
+
+#ifdef CONFIG_NUMA
+ for_each_node_state(i, N_NORMAL_MEMORY) {
+ init_kmem_cache_node(&kmem_cache_cache, &kmem_cache_nodes[i]);
+ kmem_cache_cache.node[i] = &kmem_cache_nodes[i];
+
+ init_kmem_cache_node(&kmem_cpu_cache, &kmem_cpu_nodes[i]);
+ kmem_cpu_cache.node[i] = &kmem_cpu_nodes[i];
+
+ init_kmem_cache_node(&kmem_node_cache, &kmem_node_nodes[i]);
+ kmem_node_cache.node[i] = &kmem_node_nodes[i];
+ }
+#endif
+
+ /* Caches that are not of the two-to-the-power-of size */
+ if (L1_CACHE_BYTES < 64 && KMALLOC_MIN_SIZE <= 64) {
+ open_kmalloc_cache(&kmalloc_caches[1],
+ "kmalloc-96", 96, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+ open_kmalloc_cache(&kmalloc_caches_dma[1],
+ "kmalloc_dma-96", 96, GFP_KERNEL|SLQB_DMA);
+#endif
+ }
+ if (L1_CACHE_BYTES < 128 && KMALLOC_MIN_SIZE <= 128) {
+ open_kmalloc_cache(&kmalloc_caches[2],
+ "kmalloc-192", 192, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+ open_kmalloc_cache(&kmalloc_caches_dma[2],
+ "kmalloc_dma-192", 192, GFP_KERNEL|SLQB_DMA);
+#endif
+ }
+
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+ open_kmalloc_cache(&kmalloc_caches[i],
+ "kmalloc", 1 << i, GFP_KERNEL);
+#ifdef CONFIG_ZONE_DMA
+ open_kmalloc_cache(&kmalloc_caches_dma[i],
+ "kmalloc_dma", 1 << i, GFP_KERNEL|SLQB_DMA);
+#endif
+ }
+
+
+ /*
+ * Patch up the size_index table if we have strange large alignment
+ * requirements for the kmalloc array. This is only the case for
+ * mips it seems. The standard arches will not generate any code here.
+ *
+ * Largest permitted alignment is 256 bytes due to the way we
+ * handle the index determination for the smaller caches.
+ *
+ * Make sure that nothing crazy happens if someone starts tinkering
+ * around with ARCH_KMALLOC_MINALIGN
+ */
+ BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+ (KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+ for (i = 8; i < KMALLOC_MIN_SIZE; i += 8)
+ size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
+ /* Provide the correct kmalloc names now that the caches are up */
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_SLQB_HIGH; i++) {
+ kmalloc_caches[i].name =
+ kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+#ifdef CONFIG_ZONE_DMA
+ kmalloc_caches_dma[i].name =
+ kasprintf(GFP_KERNEL, "kmalloc_dma-%d", 1 << i);
+#endif
+ }
+
+#ifdef CONFIG_SMP
+ register_cpu_notifier(&slab_notifier);
+#endif
+#ifdef CONFIG_NUMA
+ hotplug_memory_notifier(slab_memory_callback, 1);
+#endif
+ /*
+ * smp_init() has not yet been called, so no worries about memory
+ * ordering here (eg. slab_is_available vs numa_platform)
+ */
+ __slab_is_available = 1;
+}
+
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+ size_t align, unsigned long flags, void (*ctor)(void *))
+{
+ struct kmem_cache *s;
+
+ s = kmem_cache_alloc(&kmem_cache_cache, GFP_KERNEL);
+ if (!s)
+ goto err;
+
+ if (kmem_cache_open(s, name, size, align, flags, ctor, 1))
+ return s;
+
+ kmem_cache_free(&kmem_cache_cache, s);
+
+err:
+ if (flags & SLAB_PANIC)
+ panic("kmem_cache_create(): failed to create slab `%s'\n",name);
+ return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+#ifdef CONFIG_SMP
+/*
+ * Use the cpu notifier to insure that the cpu slabs are flushed when
+ * necessary.
+ */
+static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ long cpu = (long)hcpu;
+ struct kmem_cache *s;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ down_read(&slqb_lock);
+ list_for_each_entry(s, &slab_caches, list) {
+ s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu);
+ if (!s->cpu_slab[cpu]) {
+ up_read(&slqb_lock);
+ return NOTIFY_BAD;
+ }
+ }
+ up_read(&slqb_lock);
+ break;
+
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ start_cpu_timer(cpu);
+ break;
+
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ cancel_rearming_delayed_work(&per_cpu(cache_trim_work, cpu));
+ per_cpu(cache_trim_work, cpu).work.func = NULL;
+ break;
+
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+#if 0
+ down_read(&slqb_lock);
+ /* XXX: this doesn't work because objects can still be on this
+ * CPU's list. periodic timer needs to check if a CPU is offline
+ * and then try to cleanup from there. Same for node offline.
+ */
+ list_for_each_entry(s, &slab_caches, list) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ if (c) {
+ kmem_cache_free(&kmem_cpu_cache, c);
+ s->cpu_slab[cpu] = NULL;
+ }
+ }
+
+ up_read(&slqb_lock);
+#endif
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata slab_notifier = {
+ .notifier_call = slab_cpuup_callback
+};
+
+#endif
+
+#ifdef CONFIG_SLQB_DEBUG
+void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+{
+ struct kmem_cache *s;
+ int node = -1;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+#ifdef CONFIG_NUMA
+ if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
+ node = alternate_nid(s, flags, node);
+#endif
+ return slab_alloc(s, flags, node, (void *)caller);
+}
+
+void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+ unsigned long caller)
+{
+ struct kmem_cache *s;
+
+ s = get_slab(size, flags);
+ if (unlikely(ZERO_OR_NULL_PTR(s)))
+ return s;
+
+ return slab_alloc(s, flags, node, (void *)caller);
+}
+#endif
+
+#if defined(CONFIG_SLQB_SYSFS) || defined(CONFIG_SLABINFO)
+struct stats_gather {
+ struct kmem_cache *s;
+ spinlock_t lock;
+ unsigned long nr_slabs;
+ unsigned long nr_partial;
+ unsigned long nr_inuse;
+ unsigned long nr_objects;
+
+#ifdef CONFIG_SLQB_STATS
+ unsigned long stats[NR_SLQB_STAT_ITEMS];
+#endif
+};
+
+static void __gather_stats(void *arg)
+{
+ unsigned long nr_slabs;
+ unsigned long nr_partial;
+ unsigned long nr_inuse;
+ struct stats_gather *gather = arg;
+ int cpu = smp_processor_id();
+ struct kmem_cache *s = gather->s;
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+ struct slqb_page *page;
+#ifdef CONFIG_SLQB_STATS
+ int i;
+#endif
+
+ nr_slabs = l->nr_slabs;
+ nr_partial = l->nr_partial;
+ nr_inuse = (nr_slabs - nr_partial) * s->objects;
+
+ list_for_each_entry(page, &l->partial, lru) {
+ nr_inuse += page->inuse;
+ }
+
+ spin_lock(&gather->lock);
+ gather->nr_slabs += nr_slabs;
+ gather->nr_partial += nr_partial;
+ gather->nr_inuse += nr_inuse;
+#ifdef CONFIG_SLQB_STATS
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+ gather->stats[i] += l->stats[i];
+ }
+#endif
+ spin_unlock(&gather->lock);
+}
+
+static void gather_stats(struct kmem_cache *s, struct stats_gather *stats)
+{
+#ifdef CONFIG_NUMA
+ int node;
+#endif
+
+ memset(stats, 0, sizeof(struct stats_gather));
+ stats->s = s;
+ spin_lock_init(&stats->lock);
+
+ on_each_cpu(__gather_stats, stats, 1);
+
+#ifdef CONFIG_NUMA
+ for_each_online_node(node) {
+ struct kmem_cache_node *n = s->node[node];
+ struct kmem_cache_list *l = &n->list;
+ struct slqb_page *page;
+ unsigned long flags;
+#ifdef CONFIG_SLQB_STATS
+ int i;
+#endif
+
+ spin_lock_irqsave(&n->list_lock, flags);
+#ifdef CONFIG_SLQB_STATS
+ for (i = 0; i < NR_SLQB_STAT_ITEMS; i++) {
+ stats->stats[i] += l->stats[i];
+ }
+#endif
+ stats->nr_slabs += l->nr_slabs;
+ stats->nr_partial += l->nr_partial;
+ stats->nr_inuse += (l->nr_slabs - l->nr_partial) * s->objects;
+
+ list_for_each_entry(page, &l->partial, lru) {
+ stats->nr_inuse += page->inuse;
+ }
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ }
+#endif
+
+ stats->nr_objects = stats->nr_slabs * s->objects;
+}
+#endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_SLABINFO
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+ssize_t slabinfo_write(struct file *file, const char __user * buffer,
+ size_t count, loff_t *ppos)
+{
+ return -EINVAL;
+}
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+ seq_puts(m, "slabinfo - version: 2.1\n");
+ seq_puts(m, "# name <active_objs> <num_objs> <objsize> "
+ "<objperslab> <pagesperslab>");
+ seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+ seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+ seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+ loff_t n = *pos;
+
+ down_read(&slqb_lock);
+ if (!n)
+ print_slabinfo_header(m);
+
+ return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+ return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+ up_read(&slqb_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+ struct stats_gather stats;
+ struct kmem_cache *s;
+
+ s = list_entry(p, struct kmem_cache, list);
+
+ gather_stats(s, &stats);
+
+ seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, stats.nr_inuse,
+ stats.nr_objects, s->size, s->objects, (1 << s->order));
+ seq_printf(m, " : tunables %4u %4u %4u", slab_hiwater(s), slab_freebatch(s), 0);
+ seq_printf(m, " : slabdata %6lu %6lu %6lu", stats.nr_slabs, stats.nr_slabs,
+ 0UL);
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations slabinfo_op = {
+ .start = s_start,
+ .next = s_next,
+ .stop = s_stop,
+ .show = s_show,
+};
+
+static int slabinfo_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &slabinfo_op);
+}
+
+static const struct file_operations proc_slabinfo_operations = {
+ .open = slabinfo_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static int __init slab_proc_init(void)
+{
+ proc_create("slabinfo",S_IWUSR|S_IRUGO,NULL,&proc_slabinfo_operations);
+ return 0;
+}
+module_init(slab_proc_init);
+#endif /* CONFIG_SLABINFO */
+
+#ifdef CONFIG_SLQB_SYSFS
+/*
+ * sysfs API
+ */
+#define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
+#define to_slab(n) container_of(n, struct kmem_cache, kobj);
+
+struct slab_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct kmem_cache *s, char *buf);
+ ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count);
+};
+
+#define SLAB_ATTR_RO(_name) \
+ static struct slab_attribute _name##_attr = __ATTR_RO(_name)
+
+#define SLAB_ATTR(_name) \
+ static struct slab_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+static ssize_t slab_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->size);
+}
+SLAB_ATTR_RO(slab_size);
+
+static ssize_t align_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->align);
+}
+SLAB_ATTR_RO(align);
+
+static ssize_t object_size_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->objsize);
+}
+SLAB_ATTR_RO(object_size);
+
+static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->objects);
+}
+SLAB_ATTR_RO(objs_per_slab);
+
+static ssize_t order_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", s->order);
+}
+SLAB_ATTR_RO(order);
+
+static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+{
+ if (s->ctor) {
+ int n = sprint_symbol(buf, (unsigned long)s->ctor);
+
+ return n + sprintf(buf + n, "\n");
+ }
+ return 0;
+}
+SLAB_ATTR_RO(ctor);
+
+static ssize_t slabs_show(struct kmem_cache *s, char *buf)
+{
+ struct stats_gather stats;
+ gather_stats(s, &stats);
+ return sprintf(buf, "%lu\n", stats.nr_slabs);
+}
+SLAB_ATTR_RO(slabs);
+
+static ssize_t objects_show(struct kmem_cache *s, char *buf)
+{
+ struct stats_gather stats;
+ gather_stats(s, &stats);
+ return sprintf(buf, "%lu\n", stats.nr_inuse);
+}
+SLAB_ATTR_RO(objects);
+
+static ssize_t total_objects_show(struct kmem_cache *s, char *buf)
+{
+ struct stats_gather stats;
+ gather_stats(s, &stats);
+ return sprintf(buf, "%lu\n", stats.nr_objects);
+}
+SLAB_ATTR_RO(total_objects);
+
+static ssize_t reclaim_account_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT));
+}
+SLAB_ATTR_RO(reclaim_account);
+
+static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_HWCACHE_ALIGN));
+}
+SLAB_ATTR_RO(hwcache_align);
+
+#ifdef CONFIG_ZONE_DMA
+static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA));
+}
+SLAB_ATTR_RO(cache_dma);
+#endif
+
+static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_DESTROY_BY_RCU));
+}
+SLAB_ATTR_RO(destroy_by_rcu);
+
+static ssize_t red_zone_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE));
+}
+SLAB_ATTR_RO(red_zone);
+
+static ssize_t poison_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON));
+}
+SLAB_ATTR_RO(poison);
+
+static ssize_t store_user_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER));
+}
+SLAB_ATTR_RO(store_user);
+
+static ssize_t hiwater_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+ long hiwater;
+ int err;
+
+ err = strict_strtol(buf, 10, &hiwater);
+ if (err)
+ return err;
+
+ if (hiwater < 0)
+ return -EINVAL;
+
+ s->hiwater = hiwater;
+
+ return length;
+}
+
+static ssize_t hiwater_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", slab_hiwater(s));
+}
+SLAB_ATTR(hiwater);
+
+static ssize_t freebatch_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+ long freebatch;
+ int err;
+
+ err = strict_strtol(buf, 10, &freebatch);
+ if (err)
+ return err;
+
+ if (freebatch <= 0 || freebatch - 1 > s->hiwater)
+ return -EINVAL;
+
+ s->freebatch = freebatch;
+
+ return length;
+}
+
+static ssize_t freebatch_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", slab_freebatch(s));
+}
+SLAB_ATTR(freebatch);
+#ifdef CONFIG_SLQB_STATS
+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+ struct stats_gather stats;
+ int len;
+#ifdef CONFIG_SMP
+ int cpu;
+#endif
+
+ gather_stats(s, &stats);
+
+ len = sprintf(buf, "%lu", stats.stats[si]);
+
+#ifdef CONFIG_SMP
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_list *l = &c->list;
+ if (len < PAGE_SIZE - 20)
+ len += sprintf(buf + len, " C%d=%lu", cpu, l->stats[si]);
+ }
+#endif
+ return len + sprintf(buf + len, "\n");
+}
+
+#define STAT_ATTR(si, text) \
+static ssize_t text##_show(struct kmem_cache *s, char *buf) \
+{ \
+ return show_stat(s, buf, si); \
+} \
+SLAB_ATTR_RO(text); \
+
+STAT_ATTR(ALLOC, alloc);
+STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill);
+STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new);
+STAT_ATTR(FREE, free);
+STAT_ATTR(FREE_REMOTE, free_remote);
+STAT_ATTR(FLUSH_FREE_LIST, flush_free_list);
+STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects);
+STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote);
+STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial);
+STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free);
+STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list);
+STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects);
+STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list);
+STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects);
+#endif
+
+static struct attribute *slab_attrs[] = {
+ &slab_size_attr.attr,
+ &object_size_attr.attr,
+ &objs_per_slab_attr.attr,
+ &order_attr.attr,
+ &objects_attr.attr,
+ &total_objects_attr.attr,
+ &slabs_attr.attr,
+ &ctor_attr.attr,
+ &align_attr.attr,
+ &hwcache_align_attr.attr,
+ &reclaim_account_attr.attr,
+ &destroy_by_rcu_attr.attr,
+ &red_zone_attr.attr,
+ &poison_attr.attr,
+ &store_user_attr.attr,
+ &hiwater_attr.attr,
+ &freebatch_attr.attr,
+#ifdef CONFIG_ZONE_DMA
+ &cache_dma_attr.attr,
+#endif
+#ifdef CONFIG_SLQB_STATS
+ &alloc_attr.attr,
+ &alloc_slab_fill_attr.attr,
+ &alloc_slab_new_attr.attr,
+ &free_attr.attr,
+ &free_remote_attr.attr,
+ &flush_free_list_attr.attr,
+ &flush_free_list_objects_attr.attr,
+ &flush_free_list_remote_attr.attr,
+ &flush_slab_partial_attr.attr,
+ &flush_slab_free_attr.attr,
+ &flush_rfree_list_attr.attr,
+ &flush_rfree_list_objects_attr.attr,
+ &claim_remote_list_attr.attr,
+ &claim_remote_list_objects_attr.attr,
+#endif
+ NULL
+};
+
+static struct attribute_group slab_attr_group = {
+ .attrs = slab_attrs,
+};
+
+static ssize_t slab_attr_show(struct kobject *kobj,
+ struct attribute *attr,
+ char *buf)
+{
+ struct slab_attribute *attribute;
+ struct kmem_cache *s;
+ int err;
+
+ attribute = to_slab_attr(attr);
+ s = to_slab(kobj);
+
+ if (!attribute->show)
+ return -EIO;
+
+ err = attribute->show(s, buf);
+
+ return err;
+}
+
+static ssize_t slab_attr_store(struct kobject *kobj,
+ struct attribute *attr,
+ const char *buf, size_t len)
+{
+ struct slab_attribute *attribute;
+ struct kmem_cache *s;
+ int err;
+
+ attribute = to_slab_attr(attr);
+ s = to_slab(kobj);
+
+ if (!attribute->store)
+ return -EIO;
+
+ err = attribute->store(s, buf, len);
+
+ return err;
+}
+
+static void kmem_cache_release(struct kobject *kobj)
+{
+ struct kmem_cache *s = to_slab(kobj);
+
+ kmem_cache_free(&kmem_cache_cache, s);
+}
+
+static struct sysfs_ops slab_sysfs_ops = {
+ .show = slab_attr_show,
+ .store = slab_attr_store,
+};
+
+static struct kobj_type slab_ktype = {
+ .sysfs_ops = &slab_sysfs_ops,
+ .release = kmem_cache_release
+};
+
+static int uevent_filter(struct kset *kset, struct kobject *kobj)
+{
+ struct kobj_type *ktype = get_ktype(kobj);
+
+ if (ktype == &slab_ktype)
+ return 1;
+ return 0;
+}
+
+static struct kset_uevent_ops slab_uevent_ops = {
+ .filter = uevent_filter,
+};
+
+static struct kset *slab_kset;
+
+static int sysfs_available __read_mostly = 0;
+
+static int sysfs_slab_add(struct kmem_cache *s)
+{
+ int err;
+
+ if (!sysfs_available)
+ return 0;
+
+ s->kobj.kset = slab_kset;
+ err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, s->name);
+ if (err) {
+ kobject_put(&s->kobj);
+ return err;
+ }
+
+ err = sysfs_create_group(&s->kobj, &slab_attr_group);
+ if (err)
+ return err;
+ kobject_uevent(&s->kobj, KOBJ_ADD);
+
+ return 0;
+}
+
+static void sysfs_slab_remove(struct kmem_cache *s)
+{
+ kobject_uevent(&s->kobj, KOBJ_REMOVE);
+ kobject_del(&s->kobj);
+ kobject_put(&s->kobj);
+}
+
+static int __init slab_sysfs_init(void)
+{
+ struct kmem_cache *s;
+ int err;
+
+ slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
+ if (!slab_kset) {
+ printk(KERN_ERR "Cannot register slab subsystem.\n");
+ return -ENOSYS;
+ }
+
+ down_write(&slqb_lock);
+ sysfs_available = 1;
+ list_for_each_entry(s, &slab_caches, list) {
+ err = sysfs_slab_add(s);
+ if (err)
+ printk(KERN_ERR "SLQB: Unable to add boot slab %s"
+ " to sysfs\n", s->name);
+ }
+ up_write(&slqb_lock);
+
+ return 0;
+}
+
+__initcall(slab_sysfs_init);
+#endif
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h
+++ linux-2.6/include/linux/slab.h
@@ -150,6 +150,8 @@ size_t ksize(const void *);
*/
#ifdef CONFIG_SLUB
#include <linux/slub_def.h>
+#elif defined(CONFIG_SLQB)
+#include <linux/slqb_def.h>
#elif defined(CONFIG_SLOB)
#include <linux/slob_def.h>
#else
@@ -252,7 +254,7 @@ static inline void *kmem_cache_alloc_nod
* allocator where we care about the real place the memory allocation
* request comes from.
*/
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
#define kmalloc_track_caller(size, flags) \
__kmalloc_track_caller(size, flags, _RET_IP_)
@@ -270,7 +272,7 @@ extern void *__kmalloc_track_caller(size
* standard allocator where we care about the real place the memory
* allocation request comes from.
*/
-#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB)
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || defined(CONFIG_SLQB_DEBUG)
extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
#define kmalloc_node_track_caller(size, flags, node) \
__kmalloc_node_track_caller(size, flags, node, \
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile
+++ linux-2.6/mm/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_SLUB) += slub.o
+obj-$(CONFIG_SLQB) += slqb.o
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
Index: linux-2.6/include/linux/rcu_types.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/rcu_types.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_RCU_TYPES_H
+#define __LINUX_RCU_TYPES_H
+
+#ifdef __KERNEL__
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+ struct rcu_head *next;
+ void (*func)(struct rcu_head *head);
+};
+
+#endif
+
+#endif
Index: linux-2.6/arch/x86/include/asm/page.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/page.h
+++ linux-2.6/arch/x86/include/asm/page.h
@@ -194,6 +194,7 @@ static inline pteval_t native_pte_flags(
* virt_addr_valid(kaddr) returns true.
*/
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
+#define virt_to_page_fast(kaddr) pfn_to_page(((unsigned long)(kaddr) - PAGE_OFFSET) >> PAGE_SHIFT)
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
extern bool __virt_addr_valid(unsigned long kaddr);
#define virt_addr_valid(kaddr) __virt_addr_valid((unsigned long) (kaddr))
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -306,7 +306,11 @@ static inline void get_page(struct page

static inline struct page *virt_to_head_page(const void *x)
{
+#ifdef virt_to_page_fast
+ struct page *page = virt_to_page_fast(x);
+#else
struct page *page = virt_to_page(x);
+#endif
return compound_head(page);
}

Index: linux-2.6/Documentation/vm/slqbinfo.c
===================================================================
--- /dev/null
+++ linux-2.6/Documentation/vm/slqbinfo.c
@@ -0,0 +1,1054 @@
+/*
+ * Slabinfo: Tool to get reports about slabs
+ *
+ * (C) 2007 sgi, Christoph Lameter
+ *
+ * Reworked by Lin Ming <[email protected]> for SLQB
+ *
+ * Compile by:
+ *
+ * gcc -o slabinfo slabinfo.c
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <strings.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <getopt.h>
+#include <regex.h>
+#include <errno.h>
+
+#define MAX_SLABS 500
+#define MAX_ALIASES 500
+#define MAX_NODES 1024
+
+struct slabinfo {
+ char *name;
+ int align, cache_dma, destroy_by_rcu;
+ int hwcache_align, object_size, objs_per_slab;
+ int slab_size, store_user;
+ int order, poison, reclaim_account, red_zone;
+ int batch;
+ unsigned long objects, slabs, total_objects;
+ unsigned long alloc, alloc_slab_fill, alloc_slab_new;
+ unsigned long free, free_remote;
+ unsigned long claim_remote_list, claim_remote_list_objects;
+ unsigned long flush_free_list, flush_free_list_objects, flush_free_list_remote;
+ unsigned long flush_rfree_list, flush_rfree_list_objects;
+ unsigned long flush_slab_free, flush_slab_partial;
+ int numa[MAX_NODES];
+ int numa_partial[MAX_NODES];
+} slabinfo[MAX_SLABS];
+
+int slabs = 0;
+int actual_slabs = 0;
+int highest_node = 0;
+
+char buffer[4096];
+
+int show_empty = 0;
+int show_report = 0;
+int show_slab = 0;
+int skip_zero = 1;
+int show_numa = 0;
+int show_track = 0;
+int validate = 0;
+int shrink = 0;
+int show_inverted = 0;
+int show_totals = 0;
+int sort_size = 0;
+int sort_active = 0;
+int set_debug = 0;
+int show_ops = 0;
+int show_activity = 0;
+
+/* Debug options */
+int sanity = 0;
+int redzone = 0;
+int poison = 0;
+int tracking = 0;
+int tracing = 0;
+
+int page_size;
+
+regex_t pattern;
+
+void fatal(const char *x, ...)
+{
+ va_list ap;
+
+ va_start(ap, x);
+ vfprintf(stderr, x, ap);
+ va_end(ap);
+ exit(EXIT_FAILURE);
+}
+
+void usage(void)
+{
+ printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n"
+ "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+ "-A|--activity Most active slabs first\n"
+ "-d<options>|--debug=<options> Set/Clear Debug options\n"
+ "-D|--display-active Switch line format to activity\n"
+ "-e|--empty Show empty slabs\n"
+ "-h|--help Show usage information\n"
+ "-i|--inverted Inverted list\n"
+ "-l|--slabs Show slabs\n"
+ "-n|--numa Show NUMA information\n"
+ "-o|--ops Show kmem_cache_ops\n"
+ "-s|--shrink Shrink slabs\n"
+ "-r|--report Detailed report on single slabs\n"
+ "-S|--Size Sort by size\n"
+ "-t|--tracking Show alloc/free information\n"
+ "-T|--Totals Show summary information\n"
+ "-v|--validate Validate slabs\n"
+ "-z|--zero Include empty slabs\n"
+ "\nValid debug options (FZPUT may be combined)\n"
+ "a / A Switch on all debug options (=FZUP)\n"
+ "- Switch off all debug options\n"
+ "f / F Sanity Checks (SLAB_DEBUG_FREE)\n"
+ "z / Z Redzoning\n"
+ "p / P Poisoning\n"
+ "u / U Tracking\n"
+ "t / T Tracing\n"
+ );
+}
+
+unsigned long read_obj(const char *name)
+{
+ FILE *f = fopen(name, "r");
+
+ if (!f)
+ buffer[0] = 0;
+ else {
+ if (!fgets(buffer, sizeof(buffer), f))
+ buffer[0] = 0;
+ fclose(f);
+ if (buffer[strlen(buffer)] == '\n')
+ buffer[strlen(buffer)] = 0;
+ }
+ return strlen(buffer);
+}
+
+
+/*
+ * Get the contents of an attribute
+ */
+unsigned long get_obj(const char *name)
+{
+ if (!read_obj(name))
+ return 0;
+
+ return atol(buffer);
+}
+
+unsigned long get_obj_and_str(const char *name, char **x)
+{
+ unsigned long result = 0;
+ char *p;
+
+ *x = NULL;
+
+ if (!read_obj(name)) {
+ x = NULL;
+ return 0;
+ }
+ result = strtoul(buffer, &p, 10);
+ while (*p == ' ')
+ p++;
+ if (*p)
+ *x = strdup(p);
+ return result;
+}
+
+void set_obj(struct slabinfo *s, const char *name, int n)
+{
+ char x[100];
+ FILE *f;
+
+ snprintf(x, 100, "%s/%s", s->name, name);
+ f = fopen(x, "w");
+ if (!f)
+ fatal("Cannot write to %s\n", x);
+
+ fprintf(f, "%d\n", n);
+ fclose(f);
+}
+
+unsigned long read_slab_obj(struct slabinfo *s, const char *name)
+{
+ char x[100];
+ FILE *f;
+ size_t l;
+
+ snprintf(x, 100, "%s/%s", s->name, name);
+ f = fopen(x, "r");
+ if (!f) {
+ buffer[0] = 0;
+ l = 0;
+ } else {
+ l = fread(buffer, 1, sizeof(buffer), f);
+ buffer[l] = 0;
+ fclose(f);
+ }
+ return l;
+}
+
+
+/*
+ * Put a size string together
+ */
+int store_size(char *buffer, unsigned long value)
+{
+ unsigned long divisor = 1;
+ char trailer = 0;
+ int n;
+
+ if (value > 1000000000UL) {
+ divisor = 100000000UL;
+ trailer = 'G';
+ } else if (value > 1000000UL) {
+ divisor = 100000UL;
+ trailer = 'M';
+ } else if (value > 1000UL) {
+ divisor = 100;
+ trailer = 'K';
+ }
+
+ value /= divisor;
+ n = sprintf(buffer, "%ld",value);
+ if (trailer) {
+ buffer[n] = trailer;
+ n++;
+ buffer[n] = 0;
+ }
+ if (divisor != 1) {
+ memmove(buffer + n - 2, buffer + n - 3, 4);
+ buffer[n-2] = '.';
+ n++;
+ }
+ return n;
+}
+
+void decode_numa_list(int *numa, char *t)
+{
+ int node;
+ int nr;
+
+ memset(numa, 0, MAX_NODES * sizeof(int));
+
+ if (!t)
+ return;
+
+ while (*t == 'N') {
+ t++;
+ node = strtoul(t, &t, 10);
+ if (*t == '=') {
+ t++;
+ nr = strtoul(t, &t, 10);
+ numa[node] = nr;
+ if (node > highest_node)
+ highest_node = node;
+ }
+ while (*t == ' ')
+ t++;
+ }
+}
+
+void slab_validate(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ set_obj(s, "validate", 1);
+}
+
+void slab_shrink(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ set_obj(s, "shrink", 1);
+}
+
+int line = 0;
+
+void first_line(void)
+{
+ if (show_activity)
+ printf("Name Objects Alloc Free %%Fill %%New "
+ "FlushR %%FlushR FlushR_Objs O\n");
+ else
+ printf("Name Objects Objsize Space "
+ " O/S O %%Ef Batch Flg\n");
+}
+
+unsigned long slab_size(struct slabinfo *s)
+{
+ return s->slabs * (page_size << s->order);
+}
+
+unsigned long slab_activity(struct slabinfo *s)
+{
+ return s->alloc + s->free;
+}
+
+void slab_numa(struct slabinfo *s, int mode)
+{
+ int node;
+
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (!highest_node) {
+ printf("\n%s: No NUMA information available.\n", s->name);
+ return;
+ }
+
+ if (skip_zero && !s->slabs)
+ return;
+
+ if (!line) {
+ printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+ for(node = 0; node <= highest_node; node++)
+ printf(" %4d", node);
+ printf("\n----------------------");
+ for(node = 0; node <= highest_node; node++)
+ printf("-----");
+ printf("\n");
+ }
+ printf("%-21s ", mode ? "All slabs" : s->name);
+ for(node = 0; node <= highest_node; node++) {
+ char b[20];
+
+ store_size(b, s->numa[node]);
+ printf(" %4s", b);
+ }
+ printf("\n");
+ if (mode) {
+ printf("%-21s ", "Partial slabs");
+ for(node = 0; node <= highest_node; node++) {
+ char b[20];
+
+ store_size(b, s->numa_partial[node]);
+ printf(" %4s", b);
+ }
+ printf("\n");
+ }
+ line++;
+}
+
+void show_tracking(struct slabinfo *s)
+{
+ printf("\n%s: Kernel object allocation\n", s->name);
+ printf("-----------------------------------------------------------------------\n");
+ if (read_slab_obj(s, "alloc_calls"))
+ printf(buffer);
+ else
+ printf("No Data\n");
+
+ printf("\n%s: Kernel object freeing\n", s->name);
+ printf("------------------------------------------------------------------------\n");
+ if (read_slab_obj(s, "free_calls"))
+ printf(buffer);
+ else
+ printf("No Data\n");
+
+}
+
+void ops(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (read_slab_obj(s, "ops")) {
+ printf("\n%s: kmem_cache operations\n", s->name);
+ printf("--------------------------------------------\n");
+ printf(buffer);
+ } else
+ printf("\n%s has no kmem_cache operations\n", s->name);
+}
+
+const char *onoff(int x)
+{
+ if (x)
+ return "On ";
+ return "Off";
+}
+
+void slab_stats(struct slabinfo *s)
+{
+ unsigned long total_alloc;
+ unsigned long total_free;
+ unsigned long total;
+
+ total_alloc = s->alloc;
+ total_free = s->free;
+
+ if (!total_alloc)
+ return;
+
+ printf("\n");
+ printf("Slab Perf Counter\n");
+ printf("------------------------------------------------------------------------\n");
+ printf("Alloc: %8lu, partial %8lu, page allocator %8lu\n",
+ total_alloc,
+ s->alloc_slab_fill, s->alloc_slab_new);
+ printf("Free: %8lu, partial %8lu, page allocator %8lu, remote %5lu\n",
+ total_free,
+ s->flush_slab_partial,
+ s->flush_slab_free,
+ s->free_remote);
+ printf("Claim: %8lu, objects %8lu\n",
+ s->claim_remote_list,
+ s->claim_remote_list_objects);
+ printf("Flush: %8lu, objects %8lu, remote: %8lu\n",
+ s->flush_free_list,
+ s->flush_free_list_objects,
+ s->flush_free_list_remote);
+ printf("FlushR:%8lu, objects %8lu\n",
+ s->flush_rfree_list,
+ s->flush_rfree_list_objects);
+}
+
+void report(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ printf("\nSlabcache: %-20s Order : %2d Objects: %lu\n",
+ s->name, s->order, s->objects);
+ if (s->hwcache_align)
+ printf("** Hardware cacheline aligned\n");
+ if (s->cache_dma)
+ printf("** Memory is allocated in a special DMA zone\n");
+ if (s->destroy_by_rcu)
+ printf("** Slabs are destroyed via RCU\n");
+ if (s->reclaim_account)
+ printf("** Reclaim accounting active\n");
+
+ printf("\nSizes (bytes) Slabs Debug Memory\n");
+ printf("------------------------------------------------------------------------\n");
+ printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n",
+ s->object_size, s->slabs, "N/A",
+ s->slabs * (page_size << s->order));
+ printf("SlabObj: %7d Full : %7s Redzoning : %s Used : %7ld\n",
+ s->slab_size, "N/A",
+ onoff(s->red_zone), s->objects * s->object_size);
+ printf("SlabSiz: %7d Partial: %7s Poisoning : %s Loss : %7ld\n",
+ page_size << s->order, "N/A", onoff(s->poison),
+ s->slabs * (page_size << s->order) - s->objects * s->object_size);
+ printf("Loss : %7d CpuSlab: %7s Tracking : %s Lalig: %7ld\n",
+ s->slab_size - s->object_size, "N/A", onoff(s->store_user),
+ (s->slab_size - s->object_size) * s->objects);
+ printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n",
+ s->align, s->objs_per_slab, "N/A",
+ ((page_size << s->order) - s->objs_per_slab * s->slab_size) *
+ s->slabs);
+
+ ops(s);
+ show_tracking(s);
+ slab_numa(s, 1);
+ slab_stats(s);
+}
+
+void slabcache(struct slabinfo *s)
+{
+ char size_str[20];
+ char flags[20];
+ char *p = flags;
+
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (actual_slabs == 1) {
+ report(s);
+ return;
+ }
+
+ if (skip_zero && !show_empty && !s->slabs)
+ return;
+
+ if (show_empty && s->slabs)
+ return;
+
+ store_size(size_str, slab_size(s));
+
+ if (!line++)
+ first_line();
+
+ if (s->cache_dma)
+ *p++ = 'd';
+ if (s->hwcache_align)
+ *p++ = 'A';
+ if (s->poison)
+ *p++ = 'P';
+ if (s->reclaim_account)
+ *p++ = 'a';
+ if (s->red_zone)
+ *p++ = 'Z';
+ if (s->store_user)
+ *p++ = 'U';
+
+ *p = 0;
+ if (show_activity) {
+ unsigned long total_alloc;
+ unsigned long total_free;
+
+ total_alloc = s->alloc;
+ total_free = s->free;
+
+ printf("%-21s %8ld %10ld %10ld %5ld %5ld %7ld %5d %7ld %8d\n",
+ s->name, s->objects,
+ total_alloc, total_free,
+ total_alloc ? (s->alloc_slab_fill * 100 / total_alloc) : 0,
+ total_alloc ? (s->alloc_slab_new * 100 / total_alloc) : 0,
+ s->flush_rfree_list,
+ s->flush_rfree_list * 100 / (total_alloc + total_free),
+ s->flush_rfree_list_objects,
+ s->order);
+ }
+ else
+ printf("%-21s %8ld %7d %8s %4d %1d %3ld %4ld %s\n",
+ s->name, s->objects, s->object_size, size_str,
+ s->objs_per_slab, s->order,
+ s->slabs ? (s->objects * s->object_size * 100) /
+ (s->slabs * (page_size << s->order)) : 100,
+ s->batch, flags);
+}
+
+/*
+ * Analyze debug options. Return false if something is amiss.
+ */
+int debug_opt_scan(char *opt)
+{
+ if (!opt || !opt[0] || strcmp(opt, "-") == 0)
+ return 1;
+
+ if (strcasecmp(opt, "a") == 0) {
+ sanity = 1;
+ poison = 1;
+ redzone = 1;
+ tracking = 1;
+ return 1;
+ }
+
+ for ( ; *opt; opt++)
+ switch (*opt) {
+ case 'F' : case 'f':
+ if (sanity)
+ return 0;
+ sanity = 1;
+ break;
+ case 'P' : case 'p':
+ if (poison)
+ return 0;
+ poison = 1;
+ break;
+
+ case 'Z' : case 'z':
+ if (redzone)
+ return 0;
+ redzone = 1;
+ break;
+
+ case 'U' : case 'u':
+ if (tracking)
+ return 0;
+ tracking = 1;
+ break;
+
+ case 'T' : case 't':
+ if (tracing)
+ return 0;
+ tracing = 1;
+ break;
+ default:
+ return 0;
+ }
+ return 1;
+}
+
+int slab_empty(struct slabinfo *s)
+{
+ if (s->objects > 0)
+ return 0;
+
+ /*
+ * We may still have slabs even if there are no objects. Shrinking will
+ * remove them.
+ */
+ if (s->slabs != 0)
+ set_obj(s, "shrink", 1);
+
+ return 1;
+}
+
+void slab_debug(struct slabinfo *s)
+{
+ if (strcmp(s->name, "*") == 0)
+ return;
+
+ if (redzone && !s->red_zone) {
+ if (slab_empty(s))
+ set_obj(s, "red_zone", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
+ }
+ if (!redzone && s->red_zone) {
+ if (slab_empty(s))
+ set_obj(s, "red_zone", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
+ }
+ if (poison && !s->poison) {
+ if (slab_empty(s))
+ set_obj(s, "poison", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
+ }
+ if (!poison && s->poison) {
+ if (slab_empty(s))
+ set_obj(s, "poison", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
+ }
+ if (tracking && !s->store_user) {
+ if (slab_empty(s))
+ set_obj(s, "store_user", 1);
+ else
+ fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
+ }
+ if (!tracking && s->store_user) {
+ if (slab_empty(s))
+ set_obj(s, "store_user", 0);
+ else
+ fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
+ }
+}
+
+void totals(void)
+{
+ struct slabinfo *s;
+
+ int used_slabs = 0;
+ char b1[20], b2[20], b3[20], b4[20];
+ unsigned long long max = 1ULL << 63;
+
+ /* Object size */
+ unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
+
+ /* Number of partial slabs in a slabcache */
+ unsigned long long min_partial = max, max_partial = 0,
+ avg_partial, total_partial = 0;
+
+ /* Number of slabs in a slab cache */
+ unsigned long long min_slabs = max, max_slabs = 0,
+ avg_slabs, total_slabs = 0;
+
+ /* Size of the whole slab */
+ unsigned long long min_size = max, max_size = 0,
+ avg_size, total_size = 0;
+
+ /* Bytes used for object storage in a slab */
+ unsigned long long min_used = max, max_used = 0,
+ avg_used, total_used = 0;
+
+ /* Waste: Bytes used for alignment and padding */
+ unsigned long long min_waste = max, max_waste = 0,
+ avg_waste, total_waste = 0;
+ /* Number of objects in a slab */
+ unsigned long long min_objects = max, max_objects = 0,
+ avg_objects, total_objects = 0;
+ /* Waste per object */
+ unsigned long long min_objwaste = max,
+ max_objwaste = 0, avg_objwaste,
+ total_objwaste = 0;
+
+ /* Memory per object */
+ unsigned long long min_memobj = max,
+ max_memobj = 0, avg_memobj,
+ total_objsize = 0;
+
+ for (s = slabinfo; s < slabinfo + slabs; s++) {
+ unsigned long long size;
+ unsigned long used;
+ unsigned long long wasted;
+ unsigned long long objwaste;
+
+ if (!s->slabs || !s->objects)
+ continue;
+
+ used_slabs++;
+
+ size = slab_size(s);
+ used = s->objects * s->object_size;
+ wasted = size - used;
+ objwaste = s->slab_size - s->object_size;
+
+ if (s->object_size < min_objsize)
+ min_objsize = s->object_size;
+ if (s->slabs < min_slabs)
+ min_slabs = s->slabs;
+ if (size < min_size)
+ min_size = size;
+ if (wasted < min_waste)
+ min_waste = wasted;
+ if (objwaste < min_objwaste)
+ min_objwaste = objwaste;
+ if (s->objects < min_objects)
+ min_objects = s->objects;
+ if (used < min_used)
+ min_used = used;
+ if (s->slab_size < min_memobj)
+ min_memobj = s->slab_size;
+
+ if (s->object_size > max_objsize)
+ max_objsize = s->object_size;
+ if (s->slabs > max_slabs)
+ max_slabs = s->slabs;
+ if (size > max_size)
+ max_size = size;
+ if (wasted > max_waste)
+ max_waste = wasted;
+ if (objwaste > max_objwaste)
+ max_objwaste = objwaste;
+ if (s->objects > max_objects)
+ max_objects = s->objects;
+ if (used > max_used)
+ max_used = used;
+ if (s->slab_size > max_memobj)
+ max_memobj = s->slab_size;
+
+ total_slabs += s->slabs;
+ total_size += size;
+ total_waste += wasted;
+
+ total_objects += s->objects;
+ total_used += used;
+
+ total_objwaste += s->objects * objwaste;
+ total_objsize += s->objects * s->slab_size;
+ }
+
+ if (!total_objects) {
+ printf("No objects\n");
+ return;
+ }
+ if (!used_slabs) {
+ printf("No slabs\n");
+ return;
+ }
+
+ /* Per slab averages */
+ avg_slabs = total_slabs / used_slabs;
+ avg_size = total_size / used_slabs;
+ avg_waste = total_waste / used_slabs;
+
+ avg_objects = total_objects / used_slabs;
+ avg_used = total_used / used_slabs;
+
+ /* Per object object sizes */
+ avg_objsize = total_used / total_objects;
+ avg_objwaste = total_objwaste / total_objects;
+ avg_memobj = total_objsize / total_objects;
+
+ printf("Slabcache Totals\n");
+ printf("----------------\n");
+ printf("Slabcaches : %3d Active: %3d\n",
+ slabs, used_slabs);
+
+ store_size(b1, total_size);store_size(b2, total_waste);
+ store_size(b3, total_waste * 100 / total_used);
+ printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3);
+
+ store_size(b1, total_objects);
+ printf("# Objects : %6s\n", b1);
+
+ printf("\n");
+ printf("Per Cache Average Min Max Total\n");
+ printf("---------------------------------------------------------\n");
+
+ store_size(b1, avg_objects);store_size(b2, min_objects);
+ store_size(b3, max_objects);store_size(b4, total_objects);
+ printf("#Objects %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_slabs);store_size(b2, min_slabs);
+ store_size(b3, max_slabs);store_size(b4, total_slabs);
+ printf("#Slabs %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_size);store_size(b2, min_size);
+ store_size(b3, max_size);store_size(b4, total_size);
+ printf("Memory %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_used);store_size(b2, min_used);
+ store_size(b3, max_used);store_size(b4, total_used);
+ printf("Used %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ store_size(b1, avg_waste);store_size(b2, min_waste);
+ store_size(b3, max_waste);store_size(b4, total_waste);
+ printf("Loss %10s %10s %10s %10s\n",
+ b1, b2, b3, b4);
+
+ printf("\n");
+ printf("Per Object Average Min Max\n");
+ printf("---------------------------------------------\n");
+
+ store_size(b1, avg_memobj);store_size(b2, min_memobj);
+ store_size(b3, max_memobj);
+ printf("Memory %10s %10s %10s\n",
+ b1, b2, b3);
+ store_size(b1, avg_objsize);store_size(b2, min_objsize);
+ store_size(b3, max_objsize);
+ printf("User %10s %10s %10s\n",
+ b1, b2, b3);
+
+ store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
+ store_size(b3, max_objwaste);
+ printf("Loss %10s %10s %10s\n",
+ b1, b2, b3);
+}
+
+void sort_slabs(void)
+{
+ struct slabinfo *s1,*s2;
+
+ for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
+ for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
+ int result;
+
+ if (sort_size)
+ result = slab_size(s1) < slab_size(s2);
+ else if (sort_active)
+ result = slab_activity(s1) < slab_activity(s2);
+ else
+ result = strcasecmp(s1->name, s2->name);
+
+ if (show_inverted)
+ result = -result;
+
+ if (result > 0) {
+ struct slabinfo t;
+
+ memcpy(&t, s1, sizeof(struct slabinfo));
+ memcpy(s1, s2, sizeof(struct slabinfo));
+ memcpy(s2, &t, sizeof(struct slabinfo));
+ }
+ }
+ }
+}
+
+int slab_mismatch(char *slab)
+{
+ return regexec(&pattern, slab, 0, NULL, 0);
+}
+
+void read_slab_dir(void)
+{
+ DIR *dir;
+ struct dirent *de;
+ struct slabinfo *slab = slabinfo;
+ char *p;
+ char *t;
+ int count;
+
+ if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
+ fatal("SYSFS support for SLUB not active\n");
+
+ dir = opendir(".");
+ while ((de = readdir(dir))) {
+ if (de->d_name[0] == '.' ||
+ (de->d_name[0] != ':' && slab_mismatch(de->d_name)))
+ continue;
+ switch (de->d_type) {
+ case DT_DIR:
+ if (chdir(de->d_name))
+ fatal("Unable to access slab %s\n", slab->name);
+ slab->name = strdup(de->d_name);
+ slab->align = get_obj("align");
+ slab->cache_dma = get_obj("cache_dma");
+ slab->destroy_by_rcu = get_obj("destroy_by_rcu");
+ slab->hwcache_align = get_obj("hwcache_align");
+ slab->object_size = get_obj("object_size");
+ slab->objects = get_obj("objects");
+ slab->total_objects = get_obj("total_objects");
+ slab->objs_per_slab = get_obj("objs_per_slab");
+ slab->order = get_obj("order");
+ slab->poison = get_obj("poison");
+ slab->reclaim_account = get_obj("reclaim_account");
+ slab->red_zone = get_obj("red_zone");
+ slab->slab_size = get_obj("slab_size");
+ slab->slabs = get_obj_and_str("slabs", &t);
+ decode_numa_list(slab->numa, t);
+ free(t);
+ slab->store_user = get_obj("store_user");
+ slab->batch = get_obj("batch");
+ slab->alloc = get_obj("alloc");
+ slab->alloc_slab_fill = get_obj("alloc_slab_fill");
+ slab->alloc_slab_new = get_obj("alloc_slab_new");
+ slab->free = get_obj("free");
+ slab->free_remote = get_obj("free_remote");
+ slab->claim_remote_list = get_obj("claim_remote_list");
+ slab->claim_remote_list_objects = get_obj("claim_remote_list_objects");
+ slab->flush_free_list = get_obj("flush_free_list");
+ slab->flush_free_list_objects = get_obj("flush_free_list_objects");
+ slab->flush_free_list_remote = get_obj("flush_free_list_remote");
+ slab->flush_rfree_list = get_obj("flush_rfree_list");
+ slab->flush_rfree_list_objects = get_obj("flush_rfree_list_objects");
+ slab->flush_slab_free = get_obj("flush_slab_free");
+ slab->flush_slab_partial = get_obj("flush_slab_partial");
+
+ chdir("..");
+ slab++;
+ break;
+ default :
+ fatal("Unknown file type %lx\n", de->d_type);
+ }
+ }
+ closedir(dir);
+ slabs = slab - slabinfo;
+ actual_slabs = slabs;
+ if (slabs > MAX_SLABS)
+ fatal("Too many slabs\n");
+}
+
+void output_slabs(void)
+{
+ struct slabinfo *slab;
+
+ for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
+
+ if (show_numa)
+ slab_numa(slab, 0);
+ else if (show_track)
+ show_tracking(slab);
+ else if (validate)
+ slab_validate(slab);
+ else if (shrink)
+ slab_shrink(slab);
+ else if (set_debug)
+ slab_debug(slab);
+ else if (show_ops)
+ ops(slab);
+ else if (show_slab)
+ slabcache(slab);
+ else if (show_report)
+ report(slab);
+ }
+}
+
+struct option opts[] = {
+ { "activity", 0, NULL, 'A' },
+ { "debug", 2, NULL, 'd' },
+ { "display-activity", 0, NULL, 'D' },
+ { "empty", 0, NULL, 'e' },
+ { "help", 0, NULL, 'h' },
+ { "inverted", 0, NULL, 'i'},
+ { "numa", 0, NULL, 'n' },
+ { "ops", 0, NULL, 'o' },
+ { "report", 0, NULL, 'r' },
+ { "shrink", 0, NULL, 's' },
+ { "slabs", 0, NULL, 'l' },
+ { "track", 0, NULL, 't'},
+ { "validate", 0, NULL, 'v' },
+ { "zero", 0, NULL, 'z' },
+ { "1ref", 0, NULL, '1'},
+ { NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+ int c;
+ int err;
+ char *pattern_source;
+
+ page_size = getpagesize();
+
+ while ((c = getopt_long(argc, argv, "Ad::Dehil1noprstvzTS",
+ opts, NULL)) != -1)
+ switch (c) {
+ case 'A':
+ sort_active = 1;
+ break;
+ case 'd':
+ set_debug = 1;
+ if (!debug_opt_scan(optarg))
+ fatal("Invalid debug option '%s'\n", optarg);
+ break;
+ case 'D':
+ show_activity = 1;
+ break;
+ case 'e':
+ show_empty = 1;
+ break;
+ case 'h':
+ usage();
+ return 0;
+ case 'i':
+ show_inverted = 1;
+ break;
+ case 'n':
+ show_numa = 1;
+ break;
+ case 'o':
+ show_ops = 1;
+ break;
+ case 'r':
+ show_report = 1;
+ break;
+ case 's':
+ shrink = 1;
+ break;
+ case 'l':
+ show_slab = 1;
+ break;
+ case 't':
+ show_track = 1;
+ break;
+ case 'v':
+ validate = 1;
+ break;
+ case 'z':
+ skip_zero = 0;
+ break;
+ case 'T':
+ show_totals = 1;
+ break;
+ case 'S':
+ sort_size = 1;
+ break;
+
+ default:
+ fatal("%s: Invalid option '%c'\n", argv[0], optopt);
+
+ }
+
+ if (!show_slab && !show_track && !show_report
+ && !validate && !shrink && !set_debug && !show_ops)
+ show_slab = 1;
+
+ if (argc > optind)
+ pattern_source = argv[optind];
+ else
+ pattern_source = ".*";
+
+ err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
+ if (err)
+ fatal("%s: Invalid pattern '%s' code %d\n",
+ argv[0], pattern_source, err);
+ read_slab_dir();
+ if (show_totals)
+ totals();
+ else {
+ sort_slabs();
+ output_slabs();
+ }
+ return 0;
+}


2009-01-14 10:53:31

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Hi Nick,

On Wed, Jan 14, 2009 at 11:04 AM, Nick Piggin <[email protected]> wrote:
> This is the latest SLQB patch. Since last time, we have imported the sysfs
> framework from SLUB, and added specific event counters things for SLQB. I
> had initially been somewhat against this because it makes SLQB depend on
> another complex subsystem (which itself depends back on the slab allocator).
> But I guess it is not fundamentally different than /proc, and there needs to
> be some reporting somewhere. The individual per-slab counters really do make
> performance analysis much easier. There is a Documentation/vm/slqbinfo.c
> file, which is a parser adapted from slabinfo.c for SLUB.
>
> Fixed some bugs, including a nasty one that was causing remote objects to
> sneak onto local freelist, which would mean NUMA allocation was basically
> broken.
>
> The NUMA side of things is now much more complete. NUMA policies are obeyed.
> There is still a known bug where it won't run on a system with CPU-only
> nodes.
>
> CONFIG options are improved.
>
> Credit to some of the engineers at Intel for helping run tests, contributing
> ideas and patches to improve performance and fix bugs.
>
> I think it is getting to the point where it is stable and featureful. It
> really needs to be further proven in the performance area. We'd welcome
> any performance results or suggestions for tests to run.
>
> After this round of review/feedback, I plan to set about getting SLQB merged.

The code looks sane but I am still bit unhappy it's not a patchset on top of
SLUB. We've discussed this in the past and you mentioned that the design is
"completely different." Looking at it, I don't see any fundamental reason we
can't do a struct kmem_cache_list layer on top of SLUB which would make
merging of all this much less painful. I mean, at least in the past Linus hasn't
been too keen on adding yet another slab allocator to the kernel and I must
say judging from the SLAB -> SLUB experience, I'm not looking forward to it
either.

Also, to merge this, we need to see numbers. I assume SLQB fixes the
long-standing SLUB vs. SLAB regression reported by Intel and doesn't
introduce new performance regressions? Also, it would be nice for me to
be able to reproduce the numbers, especially for those tests where SLUB
performs worse.

One thing that puzzles me a bit is that in addition to the struct
kmem_cache_list caching, I also see things like cache coloring, avoiding
page allocator pass-through, and lots of prefetch hints in the code
which makes evaluating the performance differences quite difficult. If
these optimizations *are* a win, then why don't we add them to SLUB?

A completely different topic is memory efficiency of SLQB. The current
situation is that SLOB out-performs SLAB by huge margin whereas SLUB is
usually quite close. With the introduction of kmemtrace, I'm hopeful
that we will be able to fix up many of the badly fitting allocations in
the kernel to narrow the gap between SLUB and SLOB even more and I worry
SLQB will take us back to the SLAB numbers.

> +/*
> + * Primary per-cpu, per-kmem_cache structure.
> + */
> +struct kmem_cache_cpu {
> + struct kmem_cache_list list; /* List for node-local slabs. */
> +
> + unsigned int colour_next;

Do you see a performance improvement with cache coloring? IIRC,
Christoph has stated in the past that SLUB doesn't do it because newer
associative cache designs take care of the issue.

> +/*
> + * Constant size allocations use this path to find index into kmalloc caches
> + * arrays. get_slab() function is used for non-constant sizes.
> + */
> +static __always_inline int kmalloc_index(size_t size)
> +{
> + if (unlikely(!size))
> + return 0;
> + if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
> + return 0;

SLUB doesn't have the above check. Does it fix an actual bug? Should we
add that to SLUB as well?

> +
> + if (unlikely(size <= KMALLOC_MIN_SIZE))
> + return KMALLOC_SHIFT_LOW;
> +
> +#if L1_CACHE_BYTES < 64
> + if (size > 64 && size <= 96)
> + return 1;
> +#endif
> +#if L1_CACHE_BYTES < 128
> + if (size > 128 && size <= 192)
> + return 2;
> +#endif
> + if (size <= 8) return 3;
> + if (size <= 16) return 4;
> + if (size <= 32) return 5;
> + if (size <= 64) return 6;
> + if (size <= 128) return 7;
> + if (size <= 256) return 8;
> + if (size <= 512) return 9;
> + if (size <= 1024) return 10;
> + if (size <= 2 * 1024) return 11;
> + if (size <= 4 * 1024) return 12;
> + if (size <= 8 * 1024) return 13;
> + if (size <= 16 * 1024) return 14;
> + if (size <= 32 * 1024) return 15;
> + if (size <= 64 * 1024) return 16;
> + if (size <= 128 * 1024) return 17;
> + if (size <= 256 * 1024) return 18;
> + if (size <= 512 * 1024) return 19;
> + if (size <= 1024 * 1024) return 20;
> + if (size <= 2 * 1024 * 1024) return 21;
> + return -1;

I suppose we could just make this one return zero and drop the above
check?

> +#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
> +
> +static __always_inline void *kmalloc(size_t size, gfp_t flags)
> +{

So no page allocator pass-through, why is that? Looking at commit
aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of
page size or higher kmalloc requests"), I'd assume SQLB would get many
of the same benefits as well? It seems like a bad idea to hang on onto
large chuncks of pages in caches, no?

> + if (__builtin_constant_p(size)) {
> + struct kmem_cache *s;
> +
> + s = kmalloc_slab(size, flags);
> + if (unlikely(ZERO_OR_NULL_PTR(s)))
> + return s;
> +
> + return kmem_cache_alloc(s, flags);
> + }
> + return __kmalloc(size, flags);
> +}

> Index: linux-2.6/mm/slqb.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/mm/slqb.c
> @@ -0,0 +1,3368 @@
> +/*
> + * SLQB: A slab allocator that focuses on per-CPU scaling, and good performance
> + * with order-0 allocations. Fastpaths emphasis is placed on local allocaiton
> + * and freeing, but with a secondary goal of good remote freeing (freeing on
> + * another CPU from that which allocated).
> + *
> + * Using ideas and code from mm/slab.c, mm/slob.c, and mm/slub.c.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/bit_spinlock.h>
> +#include <linux/interrupt.h>
> +#include <linux/bitops.h>
> +#include <linux/slab.h>
> +#include <linux/seq_file.h>
> +#include <linux/cpu.h>
> +#include <linux/cpuset.h>
> +#include <linux/mempolicy.h>
> +#include <linux/ctype.h>
> +#include <linux/kallsyms.h>
> +#include <linux/memory.h>
> +
> +static inline int slab_hiwater(struct kmem_cache *s)
> +{
> + return s->hiwater;
> +}
> +
> +static inline int slab_freebatch(struct kmem_cache *s)
> +{
> + return s->freebatch;
> +}
> +
> +/*
> + * slqb_page overloads struct page, and is used to manage some slob allocation
> + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> + * we'll just define our own struct slqb_page type variant here.
> + */

You say horrible mess, I say convenient. I think it's good that core vm
hackers who have no interest in the slab allocator can clearly see we're
overloading some of the struct page fields. But as SLOB does it like
this as well, I suppose we can keep it as-is.

> +struct slqb_page {
> + union {
> + struct {
> + unsigned long flags; /* mandatory */
> + atomic_t _count; /* mandatory */
> + unsigned int inuse; /* Nr of objects */
> + struct kmem_cache_list *list; /* Pointer to list */
> + void **freelist; /* freelist req. slab lock */
> + union {
> + struct list_head lru; /* misc. list */
> + struct rcu_head rcu_head; /* for rcu freeing */
> + };
> + };
> + struct page page;
> + };
> +};
> +static inline void struct_slqb_page_wrong_size(void)
> +{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
> +
> +#define PG_SLQB_BIT (1 << PG_slab)
> +
> +static int kmem_size __read_mostly;
> +#ifdef CONFIG_NUMA
> +static int numa_platform __read_mostly;
> +#else
> +#define numa_platform 0
> +#endif

Hmm, why do we want to do this? If someone is running a CONFIG_NUMA
kernel on an UMA machine, let them suffer?

And if we *do* need to do this, can we move numa_platform() logic out of
the memory allocator?

> +#ifdef CONFIG_SMP
> +/*
> + * If enough objects have been remotely freed back to this list,
> + * remote_free_check will be set. In which case, we'll eventually come here
> + * to take those objects off our remote_free list and onto our LIFO freelist.
> + *
> + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> + * list_lock in the case of per-node list.
> + */
> +static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> + void **head, **tail;
> + int nr;
> +
> + VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
> +
> + if (!l->remote_free.list.nr)
> + return;
> +
> + l->remote_free_check = 0;
> + head = l->remote_free.list.head;
> + prefetchw(head);

So this prefetchw() is for flush_free_list(), right? A comment would be
nice.

> +
> + spin_lock(&l->remote_free.lock);
> + l->remote_free.list.head = NULL;
> + tail = l->remote_free.list.tail;
> + l->remote_free.list.tail = NULL;
> + nr = l->remote_free.list.nr;
> + l->remote_free.list.nr = 0;
> + spin_unlock(&l->remote_free.lock);
> +
> + if (!l->freelist.nr)
> + l->freelist.head = head;
> + else
> + set_freepointer(s, l->freelist.tail, head);
> + l->freelist.tail = tail;
> +
> + l->freelist.nr += nr;
> +
> + slqb_stat_inc(l, CLAIM_REMOTE_LIST);
> + slqb_stat_add(l, CLAIM_REMOTE_LIST_OBJECTS, nr);
> +}
> +#endif
> +
> +/*
> + * Allocation fastpath. Get an object from the list's LIFO freelist, or
> + * return NULL if it is empty.
> + *
> + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> + * list_lock in the case of per-node list.
> + */
> +static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> + void *object;
> +
> + object = l->freelist.head;
> + if (likely(object)) {
> + void *next = get_freepointer(s, object);
> + VM_BUG_ON(!l->freelist.nr);
> + l->freelist.nr--;
> + l->freelist.head = next;
> + if (next)
> + prefetchw(next);

Why do we need this prefetchw() here?

> + return object;
> + }
> + VM_BUG_ON(l->freelist.nr);
> +
> +#ifdef CONFIG_SMP
> + if (unlikely(l->remote_free_check)) {
> + claim_remote_free_list(s, l);
> +
> + if (l->freelist.nr > slab_hiwater(s))
> + flush_free_list(s, l);
> +
> + /* repetition here helps gcc :( */
> + object = l->freelist.head;
> + if (likely(object)) {
> + void *next = get_freepointer(s, object);
> + VM_BUG_ON(!l->freelist.nr);
> + l->freelist.nr--;
> + l->freelist.head = next;
> + if (next)
> + prefetchw(next);

Or here?

> + return object;
> + }
> + VM_BUG_ON(l->freelist.nr);
> + }
> +#endif
> +
> + return NULL;
> +}
> +
> +/*
> + * Slow(er) path. Get a page from this list's existing pages. Will be a
> + * new empty page in the case that __slab_alloc_page has just been called
> + * (empty pages otherwise never get queued up on the lists), or a partial page
> + * already on the list.
> + *
> + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> + * list_lock in the case of per-node list.
> + */
> +static noinline void *__cache_list_get_page(struct kmem_cache *s, struct kmem_cache_list *l)
> +{
> + struct slqb_page *page;
> + void *object;
> +
> + if (unlikely(!l->nr_partial))
> + return NULL;
> +
> + page = list_first_entry(&l->partial, struct slqb_page, lru);
> + VM_BUG_ON(page->inuse == s->objects);
> + if (page->inuse + 1 == s->objects) {
> + l->nr_partial--;
> + list_del(&page->lru);
> +/*XXX list_move(&page->lru, &l->full); */
> + }
> +
> + VM_BUG_ON(!page->freelist);
> +
> + page->inuse++;
> +
> +// VM_BUG_ON(node != -1 && node != slqb_page_to_nid(page));
> +
> + object = page->freelist;
> + page->freelist = get_freepointer(s, object);
> + if (page->freelist)
> + prefetchw(page->freelist);

I don't understand this prefetchw(). Who exactly is going to be updating
contents of page->freelist?

> +/*
> + * Perform some interrupts-on processing around the main allocation path
> + * (debug checking and memset()ing).
> + */
> +static __always_inline void *slab_alloc(struct kmem_cache *s,
> + gfp_t gfpflags, int node, void *addr)
> +{
> + void *object;
> + unsigned long flags;
> +
> +again:
> + local_irq_save(flags);
> + object = __slab_alloc(s, gfpflags, node);
> + local_irq_restore(flags);
> +

As a cleanup, you could just do:

if (unlikely(object == NULL))
return NULL;

here to avoid the double comparison. Maybe it even generates better asm.

> + if (unlikely(slab_debug(s)) && likely(object)) {
> + if (unlikely(!alloc_debug_processing(s, object, addr)))
> + goto again;
> + }
> +
> + if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
> + memset(object, 0, s->objsize);
> +
> + return object;
> +}
> +
> +void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> +{
> + int node = -1;
> +#ifdef CONFIG_NUMA
> + if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
> + node = alternate_nid(s, gfpflags, node);
> +#endif
> + return slab_alloc(s, gfpflags, node, __builtin_return_address(0));

The return address is wrong when kmem_cache_alloc() is called through
__kmalloc().

As a side note, you can use the shorter _RET_IP_ instead of
builtin_return_address(0) everywhere.

Pekka

2009-01-14 11:47:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 14, 2009 at 12:53:18PM +0200, Pekka Enberg wrote:
> Hi Nick,
>
> On Wed, Jan 14, 2009 at 11:04 AM, Nick Piggin <[email protected]> wrote:
> > This is the latest SLQB patch. Since last time, we have imported the sysfs
> > framework from SLUB, and added specific event counters things for SLQB. I
> > had initially been somewhat against this because it makes SLQB depend on
> > another complex subsystem (which itself depends back on the slab allocator).
> > But I guess it is not fundamentally different than /proc, and there needs to
> > be some reporting somewhere. The individual per-slab counters really do make
> > performance analysis much easier. There is a Documentation/vm/slqbinfo.c
> > file, which is a parser adapted from slabinfo.c for SLUB.
> >
> > Fixed some bugs, including a nasty one that was causing remote objects to
> > sneak onto local freelist, which would mean NUMA allocation was basically
> > broken.
> >
> > The NUMA side of things is now much more complete. NUMA policies are obeyed.
> > There is still a known bug where it won't run on a system with CPU-only
> > nodes.
> >
> > CONFIG options are improved.
> >
> > Credit to some of the engineers at Intel for helping run tests, contributing
> > ideas and patches to improve performance and fix bugs.
> >
> > I think it is getting to the point where it is stable and featureful. It
> > really needs to be further proven in the performance area. We'd welcome
> > any performance results or suggestions for tests to run.
> >
> > After this round of review/feedback, I plan to set about getting SLQB merged.
>
> The code looks sane but I am still bit unhappy it's not a patchset on top of
> SLUB. We've discussed this in the past and you mentioned that the design is
> "completely different." Looking at it, I don't see any fundamental reason we
> can't do a struct kmem_cache_list layer on top of SLUB which would make
> merging of all this much less painful. I mean, at least in the past Linus hasn't
> been too keen on adding yet another slab allocator to the kernel and I must
> say judging from the SLAB -> SLUB experience, I'm not looking forward to it
> either.

Well SLUB has all this stuff in it to attempt to make it "unqueued", or
semi unqueued. None of that is required with SLQB; after the object
queues go away, the rest of SLQB is little more than a per-CPU SLOB with
individual slabs. But also has important differences. It is per-cpu, obeys
NUMA policies strongly, frees unused pages immediately (after they drop
off the object lists done via periodic reaping). Another one of the major
things I specifically avoid for example is higher order allocations.

The core allocator algorithms are so completely different that it is
obviously as different from SLUB as SLUB is from SLAB (apart from peripheral
support code and code structure). So it may as well be a patch against
SLAB.

I will also prefer to maintain it myself because as I've said I don't
really agree with choices made in SLUB (and ergo SLUB developers don't
agree with SLQB).

Note that I'm not trying to be nasty here. Of course I raised objections
to things I don't like, and I don't think I'm right by default. Just IMO
SLUB has some problems. As do SLAB and SLQB of course. Nothing is
perfect.

Also, I don't want to propose replacing any of the other allocators yet,
until more performance data is gathered. People need to compare each one.
SLQB definitely is not a clear winner in all tests. At the moment I want
to see healthy competition and hopefully one day decide on just one of
the main 3.


> Also, to merge this, we need to see numbers. I assume SLQB fixes the
> long-standing SLUB vs. SLAB regression reported by Intel and doesn't
> introduce new performance regressions? Also, it would be nice for me to
> be able to reproduce the numbers, especially for those tests where SLUB
> performs worse.

It is comparable to SLAB on Intel's OLTP test. I don't know exactly
where SLUB lies, but I think it is several % below that.

No big obvious new regressions yet, but of course we won't know that
without a lot more testing. SLQB isn't outright winner in all cases.
For example, on machine A, tbench may be faster with SLAB, but on
machine B it turns out to be faster on SLQB. Another test might show
SLUB is better.


> One thing that puzzles me a bit is that in addition to the struct
> kmem_cache_list caching, I also see things like cache coloring, avoiding
> page allocator pass-through, and lots of prefetch hints in the code
> which makes evaluating the performance differences quite difficult. If
> these optimizations *are* a win, then why don't we add them to SLUB?

I don't know. I don't have enough time of day to work on SLQB enough,
let alone attempt to do all this for SLUB as well. Especially when I
think there are fundamental problems with the basic design of it.

None of those optimisations you mention really showed a noticable win
anywhere (except avoiding page allocator pass-through perhaps, simply
because that is not an optimisation, rather it would be a de-optimisation
to *add* page allocator pass-through for SLQB, so maybe it would aslow
down some loads).

Cache colouring was just brought over from SLAB. prefetching was done
by looking at cache misses generally, and attempting to reduce them.
But you end up barely making a significant difference or just pushing
the cost elsewhere really. Down to the level of prefetching it is
going to hugely depend on the exact behaviour of the workload and
the allocator.


> A completely different topic is memory efficiency of SLQB. The current
> situation is that SLOB out-performs SLAB by huge margin whereas SLUB is
> usually quite close. With the introduction of kmemtrace, I'm hopeful
> that we will be able to fix up many of the badly fitting allocations in
> the kernel to narrow the gap between SLUB and SLOB even more and I worry
> SLQB will take us back to the SLAB numbers.

Fundamentally it is more like SLOB and SLUB in that it uses object
pointers and can allocate down to very small sizes. It doesn't have
O(NR_CPUS^2) type behaviours or preallocated array caches like SLAB.
I didn't look closely at memory efficiency, but I have no reason to
think it would be a problem.


> > +/*
> > + * Primary per-cpu, per-kmem_cache structure.
> > + */
> > +struct kmem_cache_cpu {
> > + struct kmem_cache_list list; /* List for node-local slabs. */
> > +
> > + unsigned int colour_next;
>
> Do you see a performance improvement with cache coloring? IIRC,
> Christoph has stated in the past that SLUB doesn't do it because newer
> associative cache designs take care of the issue.

No I haven't seen an improvement.

> > +/*
> > + * Constant size allocations use this path to find index into kmalloc caches
> > + * arrays. get_slab() function is used for non-constant sizes.
> > + */
> > +static __always_inline int kmalloc_index(size_t size)
> > +{
> > + if (unlikely(!size))
> > + return 0;
> > + if (unlikely(size > 1UL << KMALLOC_SHIFT_SLQB_HIGH))
> > + return 0;
>
> SLUB doesn't have the above check. Does it fix an actual bug? Should we
> add that to SLUB as well?

I think it is OK because of page allocator passthrough.


> > + if (size <= 64) return 6;
> > + if (size <= 128) return 7;
> > + if (size <= 256) return 8;
> > + if (size <= 512) return 9;
> > + if (size <= 1024) return 10;
> > + if (size <= 2 * 1024) return 11;
> > + if (size <= 4 * 1024) return 12;
> > + if (size <= 8 * 1024) return 13;
> > + if (size <= 16 * 1024) return 14;
> > + if (size <= 32 * 1024) return 15;
> > + if (size <= 64 * 1024) return 16;
> > + if (size <= 128 * 1024) return 17;
> > + if (size <= 256 * 1024) return 18;
> > + if (size <= 512 * 1024) return 19;
> > + if (size <= 1024 * 1024) return 20;
> > + if (size <= 2 * 1024 * 1024) return 21;
> > + return -1;
>
> I suppose we could just make this one return zero and drop the above
> check?

I guess so... although this is for the constant folded path anyway,
so efficiency is not an issue.


> > +#define KMALLOC_HEADER (ARCH_KMALLOC_MINALIGN < sizeof(void *) ? sizeof(void *) : ARCH_KMALLOC_MINALIGN)
> > +
> > +static __always_inline void *kmalloc(size_t size, gfp_t flags)
> > +{
>
> So no page allocator pass-through, why is that? Looking at commit
> aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB: direct pass through of
> page size or higher kmalloc requests"), I'd assume SQLB would get many
> of the same benefits as well? It seems like a bad idea to hang on onto
> large chuncks of pages in caches, no?

I don't think so. From that commit:

Advantages:
- Reduces memory overhead for kmalloc array

Fair point. But I'm attempting to compete primarily with SLAB than SLOB.

- Large kmalloc operations are faster since they do not
need to pass through the slab allocator to get to the
page allocator.

SLQB is faster than the page allocator.

- Performance increase of 10%-20% on alloc and 50% on free for
PAGE_SIZEd allocations.
SLUB must call page allocator for each alloc anyways since
the higher order pages which that allowed avoiding the page alloc calls
are not available in a reliable way anymore. So we are basically removing
useless slab allocator overhead.

SLQB is more like SLAB in this regard so it doesn't have this problme.

- Large kmallocs yields page aligned object which is what
SLAB did. Bad things like using page sized kmalloc allocations to
stand in for page allocate allocs can be transparently handled and are not
distinguishable from page allocator uses.

I don't understand this one. Definitely SLQB should give page aligned
objects for large kmallocs too.

- Checking for too large objects can be removed since
it is done by the page allocator.

But the check is made for size > PAGE_SIZE anyway, so I don't see the
win.

Drawbacks:
- No accounting for large kmalloc slab allocations anymore
- No debugging of large kmalloc slab allocations.

And doesn't suffer these drawbacks either of course.

> > +/*
> > + * slqb_page overloads struct page, and is used to manage some slob allocation
> > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
> > + * we'll just define our own struct slqb_page type variant here.
> > + */
>
> You say horrible mess, I say convenient. I think it's good that core vm
> hackers who have no interest in the slab allocator can clearly see we're
> overloading some of the struct page fields.

Yeah, but you can't really. There are so many places that overload them
for different things and don't tell you about it right in that file. But
it mostly works because we have nice layering and compartmentalisation.

Anyway IIRC my initial patches to do some of these conversions actually
either put the definitions into mm_types.h or at least added references
to them in mm_types.h. It is the better way to go really because you get
better type checking and it is readable. You may say the horrible mess is
readable. Barely. Imagine how it would be if we put everything in there.


> But as SLOB does it like
> this as well, I suppose we can keep it as-is.

I added that ;)


> > +struct slqb_page {
> > + union {
> > + struct {
> > + unsigned long flags; /* mandatory */
> > + atomic_t _count; /* mandatory */
> > + unsigned int inuse; /* Nr of objects */
> > + struct kmem_cache_list *list; /* Pointer to list */
> > + void **freelist; /* freelist req. slab lock */
> > + union {
> > + struct list_head lru; /* misc. list */
> > + struct rcu_head rcu_head; /* for rcu freeing */
> > + };
> > + };
> > + struct page page;
> > + };
> > +};
> > +static inline void struct_slqb_page_wrong_size(void)
> > +{ BUILD_BUG_ON(sizeof(struct slqb_page) != sizeof(struct page)); }
> > +
> > +#define PG_SLQB_BIT (1 << PG_slab)
> > +
> > +static int kmem_size __read_mostly;
> > +#ifdef CONFIG_NUMA
> > +static int numa_platform __read_mostly;
> > +#else
> > +#define numa_platform 0
> > +#endif
>
> Hmm, why do we want to do this? If someone is running a CONFIG_NUMA
> kernel on an UMA machine, let them suffer?

Distros, mainly. SLAB does the same thing of course. There is a tiny
downside for the NUMA case (not measurable, but obviously another branch).
Not worth another config option, although I guess there could be a
config option to basically say "this config is exactly my machine; not
the maximum capabilities of a machine intended to run on this kernel".
That could be useful to everyone, including here.


> And if we *do* need to do this, can we move numa_platform() logic out of
> the memory allocator?

Possible. If it is moved out of SLAB it would make my life (slightly)
easier


> > +#ifdef CONFIG_SMP
> > +/*
> > + * If enough objects have been remotely freed back to this list,
> > + * remote_free_check will be set. In which case, we'll eventually come here
> > + * to take those objects off our remote_free list and onto our LIFO freelist.
> > + *
> > + * Caller must be the owner CPU in the case of per-CPU list, or hold the node's
> > + * list_lock in the case of per-node list.
> > + */
> > +static void claim_remote_free_list(struct kmem_cache *s, struct kmem_cache_list *l)
> > +{
> > + void **head, **tail;
> > + int nr;
> > +
> > + VM_BUG_ON(!l->remote_free.list.head != !l->remote_free.list.tail);
> > +
> > + if (!l->remote_free.list.nr)
> > + return;
> > +
> > + l->remote_free_check = 0;
> > + head = l->remote_free.list.head;
> > + prefetchw(head);
>
> So this prefetchw() is for flush_free_list(), right? A comment would be
> nice.

Either the flush or the next allocation, whichever comes first.

Added a comment.


> > +static __always_inline void *__cache_list_get_object(struct kmem_cache *s, struct kmem_cache_list *l)
> > +{
> > + void *object;
> > +
> > + object = l->freelist.head;
> > + if (likely(object)) {
> > + void *next = get_freepointer(s, object);
> > + VM_BUG_ON(!l->freelist.nr);
> > + l->freelist.nr--;
> > + l->freelist.head = next;
> > + if (next)
> > + prefetchw(next);
>
> Why do we need this prefetchw() here?

For the next allocation call. But TBH I have not seen a significant
difference in any test. I alternate from commenting it out and not.
I guess when in doubt there should be less code and complexity...

> > + if (next)
> > + prefetchw(next);
>
> Or here?

Ditto.


> > +
> > + object = page->freelist;
> > + page->freelist = get_freepointer(s, object);
> > + if (page->freelist)
> > + prefetchw(page->freelist);
>
> I don't understand this prefetchw(). Who exactly is going to be updating
> contents of page->freelist?

Again, it is for the next allocation. This was shown to reduce cache
misses here in IIRC tbench, but I'm not sure if that translated to a
significant performance improvement.

An alternate approach I have is a patch called "batchfeed", which
basically loads the entire page freelist in this path. But it cost
complexity and the last free word in struct page (which could be
gained back at the cost of yet more complexity). So I'm still on
the fence with this. I will have to take reports of regressions and
see if things like this help.


> > +/*
> > + * Perform some interrupts-on processing around the main allocation path
> > + * (debug checking and memset()ing).
> > + */
> > +static __always_inline void *slab_alloc(struct kmem_cache *s,
> > + gfp_t gfpflags, int node, void *addr)
> > +{
> > + void *object;
> > + unsigned long flags;
> > +
> > +again:
> > + local_irq_save(flags);
> > + object = __slab_alloc(s, gfpflags, node);
> > + local_irq_restore(flags);
> > +
>
> As a cleanup, you could just do:
>
> if (unlikely(object == NULL))
> return NULL;
>
> here to avoid the double comparison. Maybe it even generates better asm.

Sometimes the stupid compiler loads a new literal to return with code
like this. I'll see.

>
> > + if (unlikely(slab_debug(s)) && likely(object)) {
> > + if (unlikely(!alloc_debug_processing(s, object, addr)))
> > + goto again;
> > + }
> > +
> > + if (unlikely(gfpflags & __GFP_ZERO) && likely(object))
> > + memset(object, 0, s->objsize);
> > +
> > + return object;
> > +}
> > +
> > +void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > +{
> > + int node = -1;
> > +#ifdef CONFIG_NUMA
> > + if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
> > + node = alternate_nid(s, gfpflags, node);
> > +#endif
> > + return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
>
> The return address is wrong when kmem_cache_alloc() is called through
> __kmalloc().

Ah, good catch.


> As a side note, you can use the shorter _RET_IP_ instead of
> builtin_return_address(0) everywhere.

OK.

Thanks for the comments and discussion so far.

Nick

2009-01-14 13:44:57

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Hi Nick,

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
> The core allocator algorithms are so completely different that it is
> obviously as different from SLUB as SLUB is from SLAB (apart from peripheral
> support code and code structure). So it may as well be a patch against
> SLAB.
>
> I will also prefer to maintain it myself because as I've said I don't
> really agree with choices made in SLUB (and ergo SLUB developers don't
> agree with SLQB).

Just for the record, I am only interesting in getting rid of SLAB (and
SLOB if we can serve the embedded folks as well in the future, sorry
Matt). Now, if that means we have to replace SLUB with SLQB, I am fine
with that. Judging from the SLAB -> SLUB experience, though, I am not
so sure adding a completely separate allocator is the way to get
there.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
> Note that I'm not trying to be nasty here. Of course I raised objections
> to things I don't like, and I don't think I'm right by default. Just IMO
> SLUB has some problems. As do SLAB and SLQB of course. Nothing is
> perfect.
>
> Also, I don't want to propose replacing any of the other allocators yet,
> until more performance data is gathered. People need to compare each one.
> SLQB definitely is not a clear winner in all tests. At the moment I want
> to see healthy competition and hopefully one day decide on just one of
> the main 3.

OK, then it is really up to Andrew and Linus to decide whether they
want to merge it or not. I'm not violently against it, it's just that
there's some maintenance overhead for API changes and for external
code like kmemcheck, kmemtrace, and failslab, that need hooks in the
slab allocator.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
>> One thing that puzzles me a bit is that in addition to the struct
>> kmem_cache_list caching, I also see things like cache coloring, avoiding
>> page allocator pass-through, and lots of prefetch hints in the code
>> which makes evaluating the performance differences quite difficult. If
>> these optimizations *are* a win, then why don't we add them to SLUB?
>
> I don't know. I don't have enough time of day to work on SLQB enough,
> let alone attempt to do all this for SLUB as well. Especially when I
> think there are fundamental problems with the basic design of it.
>
> None of those optimisations you mention really showed a noticable win
> anywhere (except avoiding page allocator pass-through perhaps, simply
> because that is not an optimisation, rather it would be a de-optimisation
> to *add* page allocator pass-through for SLQB, so maybe it would aslow
> down some loads).
>
> Cache colouring was just brought over from SLAB. prefetching was done
> by looking at cache misses generally, and attempting to reduce them.
> But you end up barely making a significant difference or just pushing
> the cost elsewhere really. Down to the level of prefetching it is
> going to hugely depend on the exact behaviour of the workload and
> the allocator.

As far as I understood, the prefetch optimizations can produce
unexpected results on some systems (yes, bit of hand-waving here), so
I would consider ripping them out. Even if cache coloring isn't a huge
win on most systems, it's probably not going to hurt either.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
>> A completely different topic is memory efficiency of SLQB. The current
>> situation is that SLOB out-performs SLAB by huge margin whereas SLUB is
>> usually quite close. With the introduction of kmemtrace, I'm hopeful
>> that we will be able to fix up many of the badly fitting allocations in
>> the kernel to narrow the gap between SLUB and SLOB even more and I worry
>> SLQB will take us back to the SLAB numbers.
>
> Fundamentally it is more like SLOB and SLUB in that it uses object
> pointers and can allocate down to very small sizes. It doesn't have
> O(NR_CPUS^2) type behaviours or preallocated array caches like SLAB.
> I didn't look closely at memory efficiency, but I have no reason to
> think it would be a problem.

Right, that's nice to hear.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
>> > +/*
>> > + * slqb_page overloads struct page, and is used to manage some slob allocation
>> > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h,
>> > + * we'll just define our own struct slqb_page type variant here.
>> > + */
>>
>> You say horrible mess, I say convenient. I think it's good that core vm
>> hackers who have no interest in the slab allocator can clearly see we're
>> overloading some of the struct page fields.
>
> Yeah, but you can't really. There are so many places that overload them
> for different things and don't tell you about it right in that file. But
> it mostly works because we have nice layering and compartmentalisation.
>
> Anyway IIRC my initial patches to do some of these conversions actually
> either put the definitions into mm_types.h or at least added references
> to them in mm_types.h. It is the better way to go really because you get
> better type checking and it is readable. You may say the horrible mess is
> readable. Barely. Imagine how it would be if we put everything in there.

Well, if we only had one slab allocator... But yeah, point taken.

On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
>> > + object = page->freelist;
>> > + page->freelist = get_freepointer(s, object);
>> > + if (page->freelist)
>> > + prefetchw(page->freelist);
>>
>> I don't understand this prefetchw(). Who exactly is going to be updating
>> contents of page->freelist?
>
> Again, it is for the next allocation. This was shown to reduce cache
> misses here in IIRC tbench, but I'm not sure if that translated to a
> significant performance improvement.

I'm not sure why you would want to optimize for the next allocation. I
mean, I'd expect us to optimize for the kmalloc() + do some work +
kfree() case where prefetching is likely to hurt more than help. Not
that I have any numbers on this.

Pekka

2009-01-14 14:22:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 14, 2009 at 03:44:44PM +0200, Pekka Enberg wrote:
> Hi Nick,
>
> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
> > The core allocator algorithms are so completely different that it is
> > obviously as different from SLUB as SLUB is from SLAB (apart from peripheral
> > support code and code structure). So it may as well be a patch against
> > SLAB.
> >
> > I will also prefer to maintain it myself because as I've said I don't
> > really agree with choices made in SLUB (and ergo SLUB developers don't
> > agree with SLQB).
>
> Just for the record, I am only interesting in getting rid of SLAB (and
> SLOB if we can serve the embedded folks as well in the future, sorry
> Matt). Now, if that means we have to replace SLUB with SLQB, I am fine
> with that. Judging from the SLAB -> SLUB experience, though, I am not
> so sure adding a completely separate allocator is the way to get
> there.

The problem is there was apparently no plan for resolving the SLAB vs SLUB
strategy. And then features and things were added to one or the other one.
But on the other hand, the SLUB experience was a success in a way because
there were a lot of performance regressions found and fixed after it was
merged, for example.

I'd love to be able to justify replacing SLAB and SLUB today, but actually
it is simply never going to be trivial to discover performance regressions.
So I don't think outright replacement is great either (consider if SLUB
had replaced SLAB completely).


> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
> > Note that I'm not trying to be nasty here. Of course I raised objections
> > to things I don't like, and I don't think I'm right by default. Just IMO
> > SLUB has some problems. As do SLAB and SLQB of course. Nothing is
> > perfect.
> >
> > Also, I don't want to propose replacing any of the other allocators yet,
> > until more performance data is gathered. People need to compare each one.
> > SLQB definitely is not a clear winner in all tests. At the moment I want
> > to see healthy competition and hopefully one day decide on just one of
> > the main 3.
>
> OK, then it is really up to Andrew and Linus to decide whether they
> want to merge it or not. I'm not violently against it, it's just that
> there's some maintenance overhead for API changes and for external
> code like kmemcheck, kmemtrace, and failslab, that need hooks in the
> slab allocator.

Sure. On the slab side, I would be happy to do that work.

We split the user base, which is a big problem if it drags out for years
like SLAB/SLUB. But if it is a very deliberate use of testing resources
in order to make progress on the issue, then that can be a positive.



> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
> > Cache colouring was just brought over from SLAB. prefetching was done
> > by looking at cache misses generally, and attempting to reduce them.
> > But you end up barely making a significant difference or just pushing
> > the cost elsewhere really. Down to the level of prefetching it is
> > going to hugely depend on the exact behaviour of the workload and
> > the allocator.
>
> As far as I understood, the prefetch optimizations can produce
> unexpected results on some systems (yes, bit of hand-waving here), so
> I would consider ripping them out. Even if cache coloring isn't a huge
> win on most systems, it's probably not going to hurt either.

I hit problems on some microbenchmark where I was prefetching a NULL pointer in
some cases, which must have been causing the CPU to trap internally and alloc
overhead suddenly got much higher ;)

Possibly sometimes if you prefetch too early or when various queues are
already full, then it ends up causing slowdowns too.


> On Wed, Jan 14, 2009 at 1:47 PM, Nick Piggin <[email protected]> wrote:
> >> > + object = page->freelist;
> >> > + page->freelist = get_freepointer(s, object);
> >> > + if (page->freelist)
> >> > + prefetchw(page->freelist);
> >>
> >> I don't understand this prefetchw(). Who exactly is going to be updating
> >> contents of page->freelist?
> >
> > Again, it is for the next allocation. This was shown to reduce cache
> > misses here in IIRC tbench, but I'm not sure if that translated to a
> > significant performance improvement.
>
> I'm not sure why you would want to optimize for the next allocation. I
> mean, I'd expect us to optimize for the kmalloc() + do some work +
> kfree() case where prefetching is likely to hurt more than help. Not
> that I have any numbers on this.

That's true, OTOH if the time between allocations is large, then an
extra prefetch is just a small cost. If the time between them is
short, it might be a bigger win.

2009-01-14 14:45:44

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Hi Nick,

On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <[email protected]> wrote:
> The problem is there was apparently no plan for resolving the SLAB vs SLUB
> strategy. And then features and things were added to one or the other one.
> But on the other hand, the SLUB experience was a success in a way because
> there were a lot of performance regressions found and fixed after it was
> merged, for example.

That's not completely true. I can't speak for Christoph, but the
biggest problem I have is that I have _no way_ of reproducing or
analyzing the regression. I've tried out various benchmarks I have
access to but I haven't been able to find anything.

The hypothesis is that SLUB regresses because of kmalloc()/kfree()
ping-pong between CPUs and as far as I understood, Christoph thinks we
can improve SLUB with the per-cpu alloc patches and the freelist
management rework.

Don't get me wrong, though. I am happy you are able to work with the
Intel engineers to fix the long standing issue (I want it fixed too!)
but I would be happier if the end-result was few simple patches
against mm/slub.c :-).

On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <[email protected]> wrote:
> I'd love to be able to justify replacing SLAB and SLUB today, but actually
> it is simply never going to be trivial to discover performance regressions.
> So I don't think outright replacement is great either (consider if SLUB
> had replaced SLAB completely).

If you ask me, I wish we *had* removed SLAB so relevant people could
have made a huge stink out of it and the regression would have been
taken care quickly ;-).

Pekka

2009-01-14 15:09:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 14, 2009 at 04:45:15PM +0200, Pekka Enberg wrote:
> Hi Nick,
>
> On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <[email protected]> wrote:
> > The problem is there was apparently no plan for resolving the SLAB vs SLUB
> > strategy. And then features and things were added to one or the other one.
> > But on the other hand, the SLUB experience was a success in a way because
> > there were a lot of performance regressions found and fixed after it was
> > merged, for example.
>
> That's not completely true. I can't speak for Christoph, but the
> biggest problem I have is that I have _no way_ of reproducing or
> analyzing the regression. I've tried out various benchmarks I have
> access to but I haven't been able to find anything.
>
> The hypothesis is that SLUB regresses because of kmalloc()/kfree()
> ping-pong between CPUs and as far as I understood, Christoph thinks we
> can improve SLUB with the per-cpu alloc patches and the freelist
> management rework.
>
> Don't get me wrong, though. I am happy you are able to work with the
> Intel engineers to fix the long standing issue (I want it fixed too!)
> but I would be happier if the end-result was few simple patches
> against mm/slub.c :-).

Right, but that regression isn't my only problem with SLUB. I think
higher order allocations could be much more damaging for more a wider
class of users. It is less common to see higher order allocation failure
reports in places other than lkml, where people tend to have systems
stay up longer and/or do a wider range of things with them.

The idea of removing queues doesn't seem so good to me. Queues are good.
You amortize or avoid all sorts of things with queues. We have them
everywhere in the kernel ;)


> On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <[email protected]> wrote:
> > I'd love to be able to justify replacing SLAB and SLUB today, but actually
> > it is simply never going to be trivial to discover performance regressions.
> > So I don't think outright replacement is great either (consider if SLUB
> > had replaced SLAB completely).
>
> If you ask me, I wish we *had* removed SLAB so relevant people could
> have made a huge stink out of it and the regression would have been
> taken care quickly ;-).

Well, presumably the stink was made because we've been stuck with SLAB
for 2 years for some reason. But it is not only that one but there were
other regressions too. Point simply is that it would have been much
harder for users to detect if there even is a regression, what with all
the other changes happening.

2009-01-14 15:22:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Hit send a bit too early here.

On Wed, Jan 14, 2009 at 04:09:00PM +0100, Nick Piggin wrote:
> On Wed, Jan 14, 2009 at 04:45:15PM +0200, Pekka Enberg wrote:
> >
> > Don't get me wrong, though. I am happy you are able to work with the
> > Intel engineers to fix the long standing issue (I want it fixed too!)
> > but I would be happier if the end-result was few simple patches
> > against mm/slub.c :-).
>
> Right, but that regression isn't my only problem with SLUB. I think
> higher order allocations could be much more damaging for more a wider
> class of users. It is less common to see higher order allocation failure

Sorry, *more* common.


> reports in places other than lkml, where people tend to have systems
> stay up longer and/or do a wider range of things with them.
>
> The idea of removing queues doesn't seem so good to me. Queues are good.
> You amortize or avoid all sorts of things with queues. We have them
> everywhere in the kernel ;)
>
>
> > On Wed, Jan 14, 2009 at 4:22 PM, Nick Piggin <[email protected]> wrote:
> > > I'd love to be able to justify replacing SLAB and SLUB today, but actually
> > > it is simply never going to be trivial to discover performance regressions.
> > > So I don't think outright replacement is great either (consider if SLUB
> > > had replaced SLAB completely).
> >
> > If you ask me, I wish we *had* removed SLAB so relevant people could
> > have made a huge stink out of it and the regression would have been
> > taken care quickly ;-).
>
> Well, presumably the stink was made because we've been stuck with SLAB
> for 2 years for some reason. But it is not only that one but there were
> other regressions too. Point simply is that it would have been much
> harder for users to detect if there even is a regression, what with all
> the other changes happening.

It would have been harder to detect SLUB vs SLAB regressions if they
had been forced to bisect (which could lead eg. to CFS, or GSO), rather
than select between two allocators.

And... IIRC, the Intel guys did make a stink but it wasn't considered
so important or worthwhile to fix for some reason? Anyway, the fact is
that it hadn't been fixed in SLUB. Hmm, I guess it is a significant
failure of SLUB that it hasn't managed to replace SLAB by this point.

2009-01-14 15:31:01

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Hi Nick,

On Wed, Jan 14, 2009 at 5:22 PM, Nick Piggin <[email protected]> wrote:
> And... IIRC, the Intel guys did make a stink but it wasn't considered
> so important or worthwhile to fix for some reason? Anyway, the fact is
> that it hadn't been fixed in SLUB. Hmm, I guess it is a significant
> failure of SLUB that it hasn't managed to replace SLAB by this point.

Again, not speaking for Christoph, but *I* do consider the regression
to be important and I do want it to be fixed. I have asked for a test
case to reproduce the regression and/or oprofile reports but have yet
to receive them. I did fix one regression I saw with the fio benchmark
but unfortunately it wasn't the same regression the Intel guys are
hitting. I suppose we're in limbo now because the people who are
affected by the regression can simply turn on CONFIG_SLAB.

In any case, I do agree that the inability to replace SLAB with SLUB
is a failure on the latter. I'm just not totally convinced that it's
because the SLUB code is unfixable ;).

Pekka

2009-01-14 15:59:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 14, 2009 at 05:30:48PM +0200, Pekka Enberg wrote:
> Hi Nick,
>
> On Wed, Jan 14, 2009 at 5:22 PM, Nick Piggin <[email protected]> wrote:
> > And... IIRC, the Intel guys did make a stink but it wasn't considered
> > so important or worthwhile to fix for some reason? Anyway, the fact is
> > that it hadn't been fixed in SLUB. Hmm, I guess it is a significant
> > failure of SLUB that it hasn't managed to replace SLAB by this point.
>
> Again, not speaking for Christoph, but *I* do consider the regression
> to be important and I do want it to be fixed. I have asked for a test
> case to reproduce the regression and/or oprofile reports but have yet
> to receive them. I did fix one regression I saw with the fio benchmark
> but unfortunately it wasn't the same regression the Intel guys are
> hitting. I suppose we're in limbo now because the people who are
> affected by the regression can simply turn on CONFIG_SLAB.

Mmm. SLES11 will ship with CONFIG_SLAB, FWIW. No I actually didn't
make any input into the decision. And I have mixed feelings about
that because there are places where SLAB is better.

But I must say that SLAB seems to be a really good allocator, and
outside of some types of microbenchmarks where it would sometimes
be much slower than SLAB, SLAB was often my main performance competitor
and often very hard to match with SLQB, let alone beat. That's not
to say SLUB wasn't also often the faster of the two, but I was
surprised at how good SLAB is.


> In any case, I do agree that the inability to replace SLAB with SLUB
> is a failure on the latter. I'm just not totally convinced that it's
> because the SLUB code is unfixable ;).

Well if you would like to consider SLQB as a fix for SLUB, that's
fine by me ;) Actually I guess it is a valid way to look at the problem:
SLQB solves the OLTP regression, so the only question is "what is the
downside of it?".


2009-01-14 18:02:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, 14 Jan 2009, Nick Piggin wrote:

> Right, but that regression isn't my only problem with SLUB. I think
> higher order allocations could be much more damaging for more a wider
> class of users. It is less common to see higher order allocation failure
> reports in places other than lkml, where people tend to have systems
> stay up longer and/or do a wider range of things with them.

The higher orders can fail and will then result in the allocator doing
order 0 allocs. It is not a failure condition. Higher orders are an
advantage because they localize variables of the same type and therefore
reduce TLB pressure.

> The idea of removing queues doesn't seem so good to me. Queues are good.
> You amortize or avoid all sorts of things with queues. We have them
> everywhere in the kernel ;)

Queues require maintenance which introduces variability because queue
cleaning has to be done periodically and the queues grow in number if NUMA
scenarios have to be handled effectively. This is a big problem for low
latency applications (like in HPC). Spending far too much time optimizing
queue cleaning in SLAB lead to the SLUB idea.

2009-01-14 18:41:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, 14 Jan 2009, Nick Piggin wrote:

> Well if you would like to consider SLQB as a fix for SLUB, that's
> fine by me ;) Actually I guess it is a valid way to look at the problem:
> SLQB solves the OLTP regression, so the only question is "what is the
> downside of it?".

The downside is that it brings the SLAB stuff back that SLUB was
designed to avoid. Queue expiration. The use of timers to expire at
uncontrollable intervals for user space. Object dispersal
in the kernel address space. Memory policy handling in the slab
allocator. Even seems to include periodic moving of objects between
queues. The NUMA stuff is still a bit foggy to me since it seems to assume
a mapping between cpus and nodes. There are cpuless nodes as well as
memoryless cpus.

SLQB maybe a good cleanup for SLAB. Its good that it is based on the
cleaned up code in SLUB but the fundamental design is SLAB (or rather the
Solaris allocator from which we got the design for all the queuing stuff
in the first place). It preserves many of the drawbacks of that code.

If SLQB would replace SLAB then there would be a lot of shared code
(debugging for example). Having a generic slab allocator framework may
then be possible within which a variety of algorithms may be implemented.

2009-01-15 06:03:45

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 14, 2009 at 12:01:32PM -0600, Christoph Lameter wrote:
> On Wed, 14 Jan 2009, Nick Piggin wrote:
>
> > Right, but that regression isn't my only problem with SLUB. I think
> > higher order allocations could be much more damaging for more a wider
> > class of users. It is less common to see higher order allocation failure
> > reports in places other than lkml, where people tend to have systems
> > stay up longer and/or do a wider range of things with them.
>
> The higher orders can fail and will then result in the allocator doing
> order 0 allocs. It is not a failure condition.

But they increase pressure on the resource and reduce availability to
other higher order allocations. They accelerate the breakdown of the
anti-frag heuristics, and they make slab internal fragmentation worse.
They also simply cost more to allocate and free and reclaim.


> Higher orders are an
> advantage because they localize variables of the same type and therefore
> reduce TLB pressure.

They are also a disadvantage. The disadvantages are very real. The
advantage is a bit theoretical (how much really is it going to help
going from 4K to 32K, if you still have hundreds or thousands of
slabs anyway?). Also, there is no reason why the other allocators
cannot use higher orer allocations, but their big advantage is that
they don't need to.


> > The idea of removing queues doesn't seem so good to me. Queues are good.
> > You amortize or avoid all sorts of things with queues. We have them
> > everywhere in the kernel ;)
>
> Queues require maintenance which introduces variability because queue
> cleaning has to be done periodically and the queues grow in number if NUMA
> scenarios have to be handled effectively. This is a big problem for low
> latency applications (like in HPC). Spending far too much time optimizing
> queue cleaning in SLAB lead to the SLUB idea.

I'd like to see any real numbers showing this is a problem. Queue
trimming in SLQB can easily be scaled or tweaked to change latency
characteristics. The fact is that it isn't a very critical or highly
tuned operation. It happens _very_ infrequently in the large scheme
of things, and could easily be changed if there is a problem.

What you have in SLUB IMO is not obviously better because it effectively
has sizeable queues in higher order partial and free pages and the
active page, which simply never get trimmed, AFAIKS. This can be harmful
for slab internal fragmentation as well in some situations.

2009-01-15 06:19:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 14, 2009 at 12:40:12PM -0600, Christoph Lameter wrote:
> On Wed, 14 Jan 2009, Nick Piggin wrote:
>
> > Well if you would like to consider SLQB as a fix for SLUB, that's
> > fine by me ;) Actually I guess it is a valid way to look at the problem:
> > SLQB solves the OLTP regression, so the only question is "what is the
> > downside of it?".
>
> The downside is that it brings the SLAB stuff back that SLUB was
> designed to avoid. Queue expiration.

What's this mean? Something distinct from periodic timer?

> The use of timers to expire at
> uncontrollable intervals for user space.

I am not convinced this is a problem. I would like to see evidence
that it is a problem, but I have only seen assertions.

Definitely it is not uncontrollable. And not unchangeable. It is
about the least sensitive part of the allocator because in a serious
workload, the queues will continually be bounded by watermarks rather
than timer reaping.


> Object dispersal
> in the kernel address space.

You mean due to lower order allocations?
1. I have not seen any results showing this gives a practical performance
increase, let alone one that offsets the downsides of using higher
order allocations.
2. Increased internal fragmentation may also have the opposite effect and
result in worse packing.
3. There is no reason why SLQB can't use higher order allocations if this
is a significant win.


> Memory policy handling in the slab
> allocator.

I see no reason why this should be a problem. The SLUB merge just asserted
it would be a problem. But actually SLAB seems to handle it just fine, and
SLUB also doesn't always obey memory policies, so I consider that to be a
worse problem, at least until it is justified by performance numbers that
show otherwise.


> Even seems to include periodic moving of objects between
> queues.

The queues expire slowly. Same as SLAB's arrays. You are describing the
implementation, and not the problems it has.


> The NUMA stuff is still a bit foggy to me since it seems to assume
> a mapping between cpus and nodes. There are cpuless nodes as well as
> memoryless cpus.

That needs a little bit of work, but my primary focus is to come up
with a design that has competitive performance in the most important
cases.

There needs to be some fallback cases added to slowpaths to handle
these things, but I don't see why it would take much work.


> SLQB maybe a good cleanup for SLAB. Its good that it is based on the
> cleaned up code in SLUB but the fundamental design is SLAB (or rather the
> Solaris allocator from which we got the design for all the queuing stuff
> in the first place). It preserves many of the drawbacks of that code.

It is _like_ slab. It avoids the major drawbacks of large footprint of
array caches, and O(N^2) memory consumption behaviour, and corner cases
where scalability is poor. The queueing behaviour of SLAB IMO is not
a drawback and it is a big reaon why SLAB is so good.


> If SLQB would replace SLAB then there would be a lot of shared code
> (debugging for example). Having a generic slab allocator framework may
> then be possible within which a variety of algorithms may be implemented.

The goal is to replace SLAB and SLUB. Anything less would be a failure
on behalf of SLQB. Shared code is not a bad thing, but the major problem
is the actual core behaviour of the allocator because it affects almost
everywhere in the kernel and splitting userbase is not a good thing.

2009-01-15 20:57:36

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, 15 Jan 2009, Nick Piggin wrote:

> > The higher orders can fail and will then result in the allocator doing
> > order 0 allocs. It is not a failure condition.
>
> But they increase pressure on the resource and reduce availability to
> other higher order allocations. They accelerate the breakdown of the
> anti-frag heuristics, and they make slab internal fragmentation worse.
> They also simply cost more to allocate and free and reclaim.

The costs are less since there is no need to have metadata for each of
the 4k sized pages. Instead there is one contiguous chunk that can be
tracked as a whole.

> > Higher orders are an
> > advantage because they localize variables of the same type and therefore
> > reduce TLB pressure.
>
> They are also a disadvantage. The disadvantages are very real. The
> advantage is a bit theoretical (how much really is it going to help
> going from 4K to 32K, if you still have hundreds or thousands of
> slabs anyway?). Also, there is no reason why the other allocators
> cannot use higher orer allocations, but their big advantage is that
> they don't need to.

The benefit of going from 4k to 32k is that 8 times as many objects may be
handled by the same 2M TLB covering a 32k page. If the 4k pages are
dispersed then you may need 8 2M tlbs (which covers already a quarter of
the available 2M TLBs on nehalem f.e.) for which the larger alloc just
needs a single one.

> > > The idea of removing queues doesn't seem so good to me. Queues are good.
> > > You amortize or avoid all sorts of things with queues. We have them
> > > everywhere in the kernel ;)
> >
> > Queues require maintenance which introduces variability because queue
> > cleaning has to be done periodically and the queues grow in number if NUMA
> > scenarios have to be handled effectively. This is a big problem for low
> > latency applications (like in HPC). Spending far too much time optimizing
> > queue cleaning in SLAB lead to the SLUB idea.
>
> I'd like to see any real numbers showing this is a problem. Queue
> trimming in SLQB can easily be scaled or tweaked to change latency
> characteristics. The fact is that it isn't a very critical or highly
> tuned operation. It happens _very_ infrequently in the large scheme
> of things, and could easily be changed if there is a problem.

Queue trimming can be configured in the same way in SLAB. But this means
that you are forever tuning these things as loads vary. Thats one of the
frustrations that led to the SLUB design. Also if the objects in queues
are not bound to particular page (as in slub) then traversal of the queues
can be very TLB fault intensive.

> What you have in SLUB IMO is not obviously better because it effectively
> has sizeable queues in higher order partial and free pages and the
> active page, which simply never get trimmed, AFAIKS. This can be harmful
> for slab internal fragmentation as well in some situations.

It has lists of free objects that are bound to a particular page. That
simplifies numa handling since all the objects in a "queue" (or page) have
the same NUMA characteristics. There is no moving between queues
(there is one exception but in general that is true) because the page list can
become the percpu list by just using the pointer to the head object.

Slab internal fragmentation is already a problem in SLAB. The solution
would be a targeted reclaim mechanism. Something like what I proposed in
with the slab defrag patches.

There is no need for trimming since there is no queue in the SLAB sense. A
page is assigned to a processor and then that processor takes objects off
the freelist and may free objects back to the freelist of that page that
was assigned to a processor. Memory wastage may only occur because
each processor needs to have a separate page from which to allocate. SLAB
like designs needs to put a large number of objects in queues which may
keep a number of pages in the allocated pages pool although all objects
are unused. That does not occur with slub.


2009-01-15 21:06:39

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, 15 Jan 2009, Nick Piggin wrote:

> Definitely it is not uncontrollable. And not unchangeable. It is
> about the least sensitive part of the allocator because in a serious
> workload, the queues will continually be bounded by watermarks rather
> than timer reaping.

The application that is interrupted has no control over when SLQB runs its
expiration. The longer the queues the longer the holdoff. Look at the
changelogs for various queue expiration things in the kernel. I fixed up a
couple of those over the years for latency reasons.

> > Object dispersal
> > in the kernel address space.
>
> You mean due to lower order allocations?
> 1. I have not seen any results showing this gives a practical performance
> increase, let alone one that offsets the downsides of using higher
> order allocations.

Well yes with enterprise app you are likely not going to see it. Run HPC
and other low latency tests (Infiniband based and such).

> 2. Increased internal fragmentation may also have the opposite effect and
> result in worse packing.

Memory allocations in latency critical appls are generally done in
contexts where high latencies are tolerable (f.e. at startup).

> 3. There is no reason why SLQB can't use higher order allocations if this
> is a significant win.

It still will have to move objects between queues? Or does it adapt the
slub method of "queue" per page?

> > Memory policy handling in the slab
> > allocator.
>
> I see no reason why this should be a problem. The SLUB merge just asserted
> it would be a problem. But actually SLAB seems to handle it just fine, and
> SLUB also doesn't always obey memory policies, so I consider that to be a
> worse problem, at least until it is justified by performance numbers that
> show otherwise.

Well I wrote the code in SLAB that does this. And AFAICT this was a very
bad hack that I had to put in after all the original developers of the
NUMA slab stuff vanished and things began to segfault.

SLUB obeys memory policies. It just uses the page allocator for this by
doing an allocation *without* specifying the node that memory has to come
from. SLAB manages memory strictly per node. So it always has to ask for
memory from a particular node. Hence the need to implement memory policies
in the allocator.

> > Even seems to include periodic moving of objects between
> > queues.
>
> The queues expire slowly. Same as SLAB's arrays. You are describing the
> implementation, and not the problems it has.

Periodic movement again introduces processing spikes and pollution of the
cpu caches.

> There needs to be some fallback cases added to slowpaths to handle
> these things, but I don't see why it would take much work.

The need for that fallback comes from the SLAB methodology used....

> > SLQB maybe a good cleanup for SLAB. Its good that it is based on the
> > cleaned up code in SLUB but the fundamental design is SLAB (or rather the
> > Solaris allocator from which we got the design for all the queuing stuff
> > in the first place). It preserves many of the drawbacks of that code.
>
> It is _like_ slab. It avoids the major drawbacks of large footprint of
> array caches, and O(N^2) memory consumption behaviour, and corner cases
> where scalability is poor. The queueing behaviour of SLAB IMO is not
> a drawback and it is a big reaon why SLAB is so good.

Queuing and the explosions of the number of queues with the alien caches
resulted in the potential of portions of memory vanishing into these
queues. Queueing means unused objects are in those queues stemming from
pages that would otherwise (if the the free object would be "moved" back
to the page) be available for other kernel uses.

> > If SLQB would replace SLAB then there would be a lot of shared code
> > (debugging for example). Having a generic slab allocator framework may
> > then be possible within which a variety of algorithms may be implemented.
>
> The goal is to replace SLAB and SLUB. Anything less would be a failure
> on behalf of SLQB. Shared code is not a bad thing, but the major problem
> is the actual core behaviour of the allocator because it affects almost
> everywhere in the kernel and splitting userbase is not a good thing.

I still dont see the problem that SLQB is addressing (aside from code
cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
is okay.

2009-01-16 03:19:55

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, Jan 15, 2009 at 02:05:55PM -0600, Christoph Lameter wrote:
> On Thu, 15 Jan 2009, Nick Piggin wrote:
>
> > > The higher orders can fail and will then result in the allocator doing
> > > order 0 allocs. It is not a failure condition.
> >
> > But they increase pressure on the resource and reduce availability to
> > other higher order allocations. They accelerate the breakdown of the
> > anti-frag heuristics, and they make slab internal fragmentation worse.
> > They also simply cost more to allocate and free and reclaim.
>
> The costs are less since there is no need to have metadata for each of
> the 4k sized pages. Instead there is one contiguous chunk that can be
> tracked as a whole.

In terms of memory footprint of metadata, SLQB and SLUB both use the
struct page for this, so no advantage. In terms of performance, well
it seems that SLUB is not faster with higher order allocations, but
slower with order-0 allocations. So this seems to just be an argument
for using higher order allocations in SLUB, rather than an argument
against using order-0 in SLQB or SLAB>


> > > Higher orders are an
> > > advantage because they localize variables of the same type and therefore
> > > reduce TLB pressure.
> >
> > They are also a disadvantage. The disadvantages are very real. The
> > advantage is a bit theoretical (how much really is it going to help
> > going from 4K to 32K, if you still have hundreds or thousands of
> > slabs anyway?). Also, there is no reason why the other allocators
> > cannot use higher orer allocations, but their big advantage is that
> > they don't need to.
>
> The benefit of going from 4k to 32k is that 8 times as many objects may be
> handled by the same 2M TLB covering a 32k page. If the 4k pages are
> dispersed then you may need 8 2M tlbs (which covers already a quarter of
> the available 2M TLBs on nehalem f.e.) for which the larger alloc just
> needs a single one.

Yes I know that. But it's pretty theoretical IMO (and I could equally
describe a theoretical situation where increased fragmentation in higher
order slabs will result in worse TLB coverage).

Has there been a demonstrated advantage of this? That outweighs the
costs of using higher order allocs? If so, then I could just add some
parameters to SLQB to allow higher order allocs as well.


> > I'd like to see any real numbers showing this is a problem. Queue
> > trimming in SLQB can easily be scaled or tweaked to change latency
> > characteristics. The fact is that it isn't a very critical or highly
> > tuned operation. It happens _very_ infrequently in the large scheme
> > of things, and could easily be changed if there is a problem.
>
> Queue trimming can be configured in the same way in SLAB. But this means
> that you are forever tuning these things as loads vary. Thats one of the

As I said, this is not my experience. Queue trimming is a third order
problem and barely makes any difference in a running workload. It just
serves to clean up unused memory when slabs stop being used. There is
basically no performance implication to it (watermarks are far more
important to queue management of a busy slab).

Tens of millions of objects can be allocated and freed between queue
trimming intervals.


> frustrations that led to the SLUB design. Also if the objects in queues
> are not bound to particular page (as in slub) then traversal of the queues
> can be very TLB fault intensive.

Yeah but it happens so infrequently it should be in the noise. I have
never seen any real problems caused by this. Have you?


> > What you have in SLUB IMO is not obviously better because it effectively
> > has sizeable queues in higher order partial and free pages and the
> > active page, which simply never get trimmed, AFAIKS. This can be harmful
> > for slab internal fragmentation as well in some situations.
>
> It has lists of free objects that are bound to a particular page. That
> simplifies numa handling since all the objects in a "queue" (or page) have
> the same NUMA characteristics.

The same can be said of SLQB and SLAB as well.

> There is no moving between queues
> (there is one exception but in general that is true) because the page list can
> become the percpu list by just using the pointer to the head object.

Exactly the same way SLQB can move a queue of remotely freed objects back
to the allocating processor. It just manipulates a head pointer. It doesn't
walk every object in the list.


> Slab internal fragmentation is already a problem in SLAB. The solution
> would be a targeted reclaim mechanism. Something like what I proposed in
> with the slab defrag patches.

It is still a much bigger problem with 16x larger pages. If you have targetted
reclaim and 32K slabs, then you will still have a bigger problem than
targetted reclaim and 4K slabs (and the targetted reclaim will be less
efficient because it will have to free more objects).


> There is no need for trimming since there is no queue in the SLAB sense. A
> page is assigned to a processor and then that processor takes objects off
> the freelist and may free objects back to the freelist of that page that
> was assigned to a processor. Memory wastage may only occur because
> each processor needs to have a separate page from which to allocate. SLAB
> like designs needs to put a large number of objects in queues which may
> keep a number of pages in the allocated pages pool although all objects
> are unused. That does not occur with slub.

That's wrong. SLUB keeps completely free pages on its partial lists, and
also IIRC can keep free pages pinned in the per-cpu page. I have actually
seen SLQB use less memory than SLUB in some situations for this reason.

2009-01-16 03:44:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, Jan 15, 2009 at 02:47:02PM -0600, Christoph Lameter wrote:
> On Thu, 15 Jan 2009, Nick Piggin wrote:
>
> > Definitely it is not uncontrollable. And not unchangeable. It is
> > about the least sensitive part of the allocator because in a serious
> > workload, the queues will continually be bounded by watermarks rather
> > than timer reaping.
>
> The application that is interrupted has no control over when SLQB runs its
> expiration. The longer the queues the longer the holdoff. Look at the
> changelogs for various queue expiration things in the kernel. I fixed up a
> couple of those over the years for latency reasons.

Interrupts and timers etc. as well as preemption by kernel threads happen
everywhere in the kernel. I have not seen any reason why slab queue reaping
in particular is a problem.

Any slab allocator is going to have a whole lot of theoretical problems and
you simply won't be able to fix them all because some require an oracle or
others fundamentally conflict with another theoretical problem.

I concentrate on the main practical problems and the end result. If I see
evidence of some problem caused, then I will do my best to fix it.


> > > Object dispersal
> > > in the kernel address space.
> >
> > You mean due to lower order allocations?
> > 1. I have not seen any results showing this gives a practical performance
> > increase, let alone one that offsets the downsides of using higher
> > order allocations.
>
> Well yes with enterprise app you are likely not going to see it. Run HPC
> and other low latency tests (Infiniband based and such).

So do you have any results or not?


> > 2. Increased internal fragmentation may also have the opposite effect and
> > result in worse packing.
>
> Memory allocations in latency critical appls are generally done in
> contexts where high latencies are tolerable (f.e. at startup).
>
> > 3. There is no reason why SLQB can't use higher order allocations if this
> > is a significant win.
>
> It still will have to move objects between queues? Or does it adapt the
> slub method of "queue" per page?

It has several queues that objects can move between. You keep asserting
that this is a problem.


> > > Memory policy handling in the slab
> > > allocator.
> >
> > I see no reason why this should be a problem. The SLUB merge just asserted
> > it would be a problem. But actually SLAB seems to handle it just fine, and
> > SLUB also doesn't always obey memory policies, so I consider that to be a
> > worse problem, at least until it is justified by performance numbers that
> > show otherwise.
>
> Well I wrote the code in SLAB that does this. And AFAICT this was a very
> bad hack that I had to put in after all the original developers of the
> NUMA slab stuff vanished and things began to segfault.
>
> SLUB obeys memory policies. It just uses the page allocator for this by
> doing an allocation *without* specifying the node that memory has to come
> from. SLAB manages memory strictly per node. So it always has to ask for
> memory from a particular node. Hence the need to implement memory policies
> in the allocator.

You only go to the allocator when the percpu queue goes empty though, so
if memory policy changes (eg context switch or something), then subsequent
allocations will be of the wrong policy.

That is what I call a hack, which is made in order to solve a percieved
performance problem. The SLAB/SLQB method of checking policy is simple,
obviously correct, and until there is a *demonstrated* performance problem
with that, then I'm not going to change it.


> > > Even seems to include periodic moving of objects between
> > > queues.
> >
> > The queues expire slowly. Same as SLAB's arrays. You are describing the
> > implementation, and not the problems it has.
>
> Periodic movement again introduces processing spikes and pollution of the
> cpu caches.

I don't think this is a problem. Anyway, rt systems that care about such
tiny latencies can easily prioritise this. And ones that don't care so
much have many other sources of interrupts and background processing by
the kernel or hardware interrupts.

If this actually *is* a problem, I will allow an option to turn of periodic
trimming of queues, and allow objects to remain in queues (like the page
allocator does with its queues). And just provide hooks to reap them at
low memory time.


> > There needs to be some fallback cases added to slowpaths to handle
> > these things, but I don't see why it would take much work.
>
> The need for that fallback comes from the SLAB methodology used....

The fallback will probably be adapted from SLUB.


> > > SLQB maybe a good cleanup for SLAB. Its good that it is based on the
> > > cleaned up code in SLUB but the fundamental design is SLAB (or rather the
> > > Solaris allocator from which we got the design for all the queuing stuff
> > > in the first place). It preserves many of the drawbacks of that code.
> >
> > It is _like_ slab. It avoids the major drawbacks of large footprint of
> > array caches, and O(N^2) memory consumption behaviour, and corner cases
> > where scalability is poor. The queueing behaviour of SLAB IMO is not
> > a drawback and it is a big reaon why SLAB is so good.
>
> Queuing and the explosions of the number of queues with the alien caches
> resulted in the potential of portions of memory vanishing into these
> queues. Queueing means unused objects are in those queues stemming from
> pages that would otherwise (if the the free object would be "moved" back
> to the page) be available for other kernel uses.

It's strange. You percieve these theoretical problems with things that I
actually consider is a distinct *advantage* of SLAB/SLQB. order-0 allocations,
queueing, strictly obeying NUMA policies...


> > > If SLQB would replace SLAB then there would be a lot of shared code
> > > (debugging for example). Having a generic slab allocator framework may
> > > then be possible within which a variety of algorithms may be implemented.
> >
> > The goal is to replace SLAB and SLUB. Anything less would be a failure
> > on behalf of SLQB. Shared code is not a bad thing, but the major problem
> > is the actual core behaviour of the allocator because it affects almost
> > everywhere in the kernel and splitting userbase is not a good thing.
>
> I still dont see the problem that SLQB is addressing (aside from code
> cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
> is okay.

It addresses O(NR_CPUS^2) memory consumption of kmem caches, and large
constant consumption of array caches of SLAB. It addresses scalability
eg in situations with lots of cores per node. It allows resizeable
queues. It addresses the code complexity and bootstap hoops of SLAB.

It addresses performance and higher order allocation problems of SLUB.

2009-01-16 21:08:51

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, 16 Jan 2009, Nick Piggin wrote:

> > handled by the same 2M TLB covering a 32k page. If the 4k pages are
> > dispersed then you may need 8 2M tlbs (which covers already a quarter of
> > the available 2M TLBs on nehalem f.e.) for which the larger alloc just
> > needs a single one.
>
> Yes I know that. But it's pretty theoretical IMO (and I could equally
> describe a theoretical situation where increased fragmentation in higher
> order slabs will result in worse TLB coverage).

Theoretical only for low sizes of memory. If you have terabytes of memory
then this becomes significant in a pretty fast way.

> > It has lists of free objects that are bound to a particular page. That
> > simplifies numa handling since all the objects in a "queue" (or page) have
> > the same NUMA characteristics.
>
> The same can be said of SLQB and SLAB as well.

Sorry not at all. SLAB and SLQB queue objects from different pages in the
same queue.

> > was assigned to a processor. Memory wastage may only occur because
> > each processor needs to have a separate page from which to allocate. SLAB
> > like designs needs to put a large number of objects in queues which may
> > keep a number of pages in the allocated pages pool although all objects
> > are unused. That does not occur with slub.
>
> That's wrong. SLUB keeps completely free pages on its partial lists, and
> also IIRC can keep free pages pinned in the per-cpu page. I have actually
> seen SLQB use less memory than SLUB in some situations for this reason.

As I sad it pins a single page in the per cpu page and uses that in a way
that you call a queue and I call a freelist.

SLUB keeps a few pages on the partial list right now because it tries to
avoid trips to the page allocator (which is quite slow). These could be
eliminated if the page allocator would work effectively. However that
number is a per node limit.

SLAB and SLUB can have large quantities of objects in their queues that
each can keep a single page out of circulation if its the last
object in that page. This is per queue thing and you have at least two
queues per cpu. SLAB has queues per cpu, per pair of cpu, per node and per
alien node for each node. That can pin quite a number of pages on large
systems. Note that SLAB has one per cpu whereas you have already 2 per
cpu? In SMP configurations this may mean that SLQB has more queues than
SLAB.

2009-01-16 21:26:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, 16 Jan 2009, Nick Piggin wrote:

> > The application that is interrupted has no control over when SLQB runs its
> > expiration. The longer the queues the longer the holdoff. Look at the
> > changelogs for various queue expiration things in the kernel. I fixed up a
> > couple of those over the years for latency reasons.
>
> Interrupts and timers etc. as well as preemption by kernel threads happen
> everywhere in the kernel. I have not seen any reason why slab queue reaping
> in particular is a problem.

The slab queues are a particular problem since they are combined with
timers. So the latency insensitive phase of an HPC app completes and then
the latency critical part starts to run. SLAB will happily every 2 seconds
expire a certain amount of objects from all its various queues.

> Any slab allocator is going to have a whole lot of theoretical problems and
> you simply won't be able to fix them all because some require an oracle or
> others fundamentally conflict with another theoretical problem.

I agree there is no point in working on theoretical problems. We are
talking about practical problems.

> I concentrate on the main practical problems and the end result. If I see
> evidence of some problem caused, then I will do my best to fix it.

You concentrate on the problems that are given to you I guess...

> > Well yes with enterprise app you are likely not going to see it. Run HPC
> > and other low latency tests (Infiniband based and such).
>
> So do you have any results or not?

Of course. I need to repost them? I am no longer employed by the company I
did the work for. So the test data is no longer accessible to me. You have
to rely on the material that was posted in the past.

> > It still will have to move objects between queues? Or does it adapt the
> > slub method of "queue" per page?
>
> It has several queues that objects can move between. You keep asserting
> that this is a problem.

> > SLUB obeys memory policies. It just uses the page allocator for this by
> > doing an allocation *without* specifying the node that memory has to come
> > from. SLAB manages memory strictly per node. So it always has to ask for
> > memory from a particular node. Hence the need to implement memory policies
> > in the allocator.
>
> You only go to the allocator when the percpu queue goes empty though, so
> if memory policy changes (eg context switch or something), then subsequent
> allocations will be of the wrong policy.

The per cpu queue size in SLUB is limited by the queues only containing
objects from the same page. If you have large queues like SLAB/SLQB(?)
then this could be an issue.

> That is what I call a hack, which is made in order to solve a percieved
> performance problem. The SLAB/SLQB method of checking policy is simple,
> obviously correct, and until there is a *demonstrated* performance problem
> with that, then I'm not going to change it.

Well so far it seems that your tests never even exercise that part of the
allocators.

> I don't think this is a problem. Anyway, rt systems that care about such
> tiny latencies can easily prioritise this. And ones that don't care so
> much have many other sources of interrupts and background processing by
> the kernel or hardware interrupts.

How do they prioritize this?

> If this actually *is* a problem, I will allow an option to turn of periodic
> trimming of queues, and allow objects to remain in queues (like the page
> allocator does with its queues). And just provide hooks to reap them at
> low memory time.

That means large amounts of memory are going to be caught in these queues.
If its per cpu and one cpu does allocation and the other frees then the
first cpu will consume more and more memory from the page allocator
whereas the second will build up huge per cpu lists.

> It's strange. You percieve these theoretical problems with things that I
> actually consider is a distinct *advantage* of SLAB/SLQB. order-0 allocations,
> queueing, strictly obeying NUMA policies...

These are issues that we encountered in practice with large systems.
Pointer chasing performance on many apps is bounded by TLB faults etc.
Strictly obeying NUMA policies causes performance problems in SLAB. Try
MPOL_INTERLEAVE vs a cpu local allocations.

> > I still dont see the problem that SLQB is addressing (aside from code
> > cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
> > is okay.
>
> It addresses O(NR_CPUS^2) memory consumption of kmem caches, and large
> constant consumption of array caches of SLAB. It addresses scalability
> eg in situations with lots of cores per node. It allows resizeable
> queues. It addresses the code complexity and bootstap hoops of SLAB.
>
> It addresses performance and higher order allocation problems of SLUB.

It seems that on SMP systems SLQB will actually increase the number of
queues since it needs 2 queues per cpu instead of the 1 of SLAB. SLAB also
has resizable queues. Code simplification and bootstrap: Great work on
that. Again good cleanup of SLAB.

2009-01-19 05:47:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 16, 2009 at 03:07:52PM -0600, Christoph Lameter wrote:
> On Fri, 16 Jan 2009, Nick Piggin wrote:
>
> > > handled by the same 2M TLB covering a 32k page. If the 4k pages are
> > > dispersed then you may need 8 2M tlbs (which covers already a quarter of
> > > the available 2M TLBs on nehalem f.e.) for which the larger alloc just
> > > needs a single one.
> >
> > Yes I know that. But it's pretty theoretical IMO (and I could equally
> > describe a theoretical situation where increased fragmentation in higher
> > order slabs will result in worse TLB coverage).
>
> Theoretical only for low sizes of memory. If you have terabytes of memory
> then this becomes significant in a pretty fast way.

I don't really buy that as a general statement with no other qualifiers.
If the huge system has a correspondingly increased number of slab
objects, then the potential win is much smaller as system sizes increases.
Say if you have a 1GB RAM system, with 128 2MB TLBs, and suppose you have
a slab that takes 25% of the RAM. Then if you optimally pack the slab for
TLB access of those objects in a random pattern, then you can get 100%
TLB hit for a given random access. And 25% in the case of random allocations.

In a 1TB RAM system, you have ~0.1% chance of TLB hit in the optimally packed
case and ~0.025% chance of hit in the random case. So in that case it is a
much smaller (negligable) possible gain from packing.

And note that we're talking about best possible packing scenario (ie. 2MB
pages vs 4K pages). The standard SLUB tune would not get anywhere near that.

The thing IMO you forget with all these doomsday scenarios about SGI's peta
scale systems is that no matter what you do, you can't avoid the fact that
computing is about locality. Even if you totally take the TLB out of the
equation, you still have the small detail of other caches. Code that jumps
all over that 1024 TB of memory with no locality is going to suck regardless
of what the kernel ever does, due to physical limitations of hardware.

Anyway, this is trivial for SLQB to tune this on those systems that really
care, giving SLUB no advantage. So we needn't really get sidetracked with
this in the context of SLQB.


> > > It has lists of free objects that are bound to a particular page. That
> > > simplifies numa handling since all the objects in a "queue" (or page) have
> > > the same NUMA characteristics.
> >
> > The same can be said of SLQB and SLAB as well.
>
> Sorry not at all. SLAB and SLQB queue objects from different pages in the
> same queue.

The last sentence is what I was replying to. Ie. "simplification of
numa handling" does not follow from the SLUB implementation of per-page
freelists.


> > > was assigned to a processor. Memory wastage may only occur because
> > > each processor needs to have a separate page from which to allocate. SLAB
> > > like designs needs to put a large number of objects in queues which may
> > > keep a number of pages in the allocated pages pool although all objects
> > > are unused. That does not occur with slub.
> >
> > That's wrong. SLUB keeps completely free pages on its partial lists, and
> > also IIRC can keep free pages pinned in the per-cpu page. I have actually
> > seen SLQB use less memory than SLUB in some situations for this reason.
>
> As I sad it pins a single page in the per cpu page and uses that in a way
> that you call a queue and I call a freelist.

And you found you have to increase the size of your pages because you
need bigger queues. (must we argue semantics? it is a list of free
objects)


> SLUB keeps a few pages on the partial list right now because it tries to
> avoid trips to the page allocator (which is quite slow). These could be
> eliminated if the page allocator would work effectively. However that
> number is a per node limit.

This is the practical vs theoretical I'm talking about.


> SLAB and SLUB can have large quantities of objects in their queues that
> each can keep a single page out of circulation if its the last
> object in that page. This is per queue thing and you have at least two

And if that were a problem, SLQB can easily be runtime tuned to keep no
objects in its object lists. But as I said, queueing is good, so why
would anybody want to get rid of it?


> queues per cpu. SLAB has queues per cpu, per pair of cpu, per node and per
> alien node for each node. That can pin quite a number of pages on large
> systems. Note that SLAB has one per cpu whereas you have already 2 per
> cpu? In SMP configurations this may mean that SLQB has more queues than
> SLAB.

Again, this doesn't really go anywhere while we disagree on the
fundamental goodliness of queueing. This is just describing the
implementation.

2009-01-19 06:19:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 16, 2009 at 03:25:05PM -0600, Christoph Lameter wrote:
> On Fri, 16 Jan 2009, Nick Piggin wrote:
>
> > > The application that is interrupted has no control over when SLQB runs its
> > > expiration. The longer the queues the longer the holdoff. Look at the
> > > changelogs for various queue expiration things in the kernel. I fixed up a
> > > couple of those over the years for latency reasons.
> >
> > Interrupts and timers etc. as well as preemption by kernel threads happen
> > everywhere in the kernel. I have not seen any reason why slab queue reaping
> > in particular is a problem.
>
> The slab queues are a particular problem since they are combined with
> timers. So the latency insensitive phase of an HPC app completes and then
> the latency critical part starts to run. SLAB will happily every 2 seconds
> expire a certain amount of objects from all its various queues.

Didn't you just earlier say that HPC apps will do all their memory
allocations up front before the critical part of the run? Anyway,
I have not seen any particular problem of cache reaping every 2 seconds
that would be worse than hardware interrupts or other timers.

But as I said, if I do see evidence of that, I will tweak the queue
reaping. But I prefer to keep it simple and closer to SLAB behaviour
until that time.


> > Any slab allocator is going to have a whole lot of theoretical problems and
> > you simply won't be able to fix them all because some require an oracle or
> > others fundamentally conflict with another theoretical problem.
>
> I agree there is no point in working on theoretical problems. We are
> talking about practical problems.

You are not saying what the practical problem is, though. Ie. what
exactly is the workload where cache reaping is causing a problem and
what are the results of eliminating that reaping?


> > I concentrate on the main practical problems and the end result. If I see
> > evidence of some problem caused, then I will do my best to fix it.
>
> You concentrate on the problems that are given to you I guess...

Of course. That would include any problems SGI or you would give me, I would
try to fix them. But I not only concentrate on those, but also ones that
I seek out myself. Eg. the OLTP problem.


> > > Well yes with enterprise app you are likely not going to see it. Run HPC
> > > and other low latency tests (Infiniband based and such).
> >
> > So do you have any results or not?
>
> Of course. I need to repost them? I am no longer employed by the company I
> did the work for. So the test data is no longer accessible to me. You have
> to rely on the material that was posted in the past.

Just links are fine. I could find nothing concrete in the mm/slqb.c
changelogs.


> > > It still will have to move objects between queues? Or does it adapt the
> > > slub method of "queue" per page?
> >
> > It has several queues that objects can move between. You keep asserting
> > that this is a problem.
>
> > > SLUB obeys memory policies. It just uses the page allocator for this by
> > > doing an allocation *without* specifying the node that memory has to come
> > > from. SLAB manages memory strictly per node. So it always has to ask for
> > > memory from a particular node. Hence the need to implement memory policies
> > > in the allocator.
> >
> > You only go to the allocator when the percpu queue goes empty though, so
> > if memory policy changes (eg context switch or something), then subsequent
> > allocations will be of the wrong policy.
>
> The per cpu queue size in SLUB is limited by the queues only containing
> objects from the same page. If you have large queues like SLAB/SLQB(?)
> then this could be an issue.

And it could be a problem in SLUB too. Chances are that several allocations
will be wrong after every policy switch. I could describe situations in which
SLUB will allocate with the _wrong_ policy literally 100% of the time.


> > That is what I call a hack, which is made in order to solve a percieved
> > performance problem. The SLAB/SLQB method of checking policy is simple,
> > obviously correct, and until there is a *demonstrated* performance problem
> > with that, then I'm not going to change it.
>
> Well so far it seems that your tests never even exercise that part of the
> allocators.

It uses code and behaviour from SLAB, which I know that has been
extensively tested in enterprise distros and on just about every
serious production NUMA installation and every hardware vendor
including SGI.

That will satisfy me until somebody reports a problem. Again, I could
only find handwaving in the SLUB changelog, which I definitely have
looked at because I want to find and solve as many issues in this
subsystem as I can.


> > I don't think this is a problem. Anyway, rt systems that care about such
> > tiny latencies can easily prioritise this. And ones that don't care so
> > much have many other sources of interrupts and background processing by
> > the kernel or hardware interrupts.
>
> How do they prioritize this?

In -rt, by prioritising interrupts. As I said, in the mainline kernel
I am skeptical that this is a problem. If it was a problem, I have
never seen a report, or an obvious simple improvmeent of moving the
reaping into a workqueue which can be prioritised, like was done with
the multi-cpu scheduler when there were problems iwth it.


> > If this actually *is* a problem, I will allow an option to turn of periodic
> > trimming of queues, and allow objects to remain in queues (like the page
> > allocator does with its queues). And just provide hooks to reap them at
> > low memory time.
>
> That means large amounts of memory are going to be caught in these queues.
> If its per cpu and one cpu does allocation and the other frees then the
> first cpu will consume more and more memory from the page allocator
> whereas the second will build up huge per cpu lists.

Wrong. I said I would allow an option to turn off *periodic trimming*.
Or just modify the existing tunables or look at making the trimming
more fine grained etc etc. I won't know until I see a workload where it
hurts, and I will try to solve it then.


> > It's strange. You percieve these theoretical problems with things that I
> > actually consider is a distinct *advantage* of SLAB/SLQB. order-0 allocations,
> > queueing, strictly obeying NUMA policies...
>
> These are issues that we encountered in practice with large systems.
> Pointer chasing performance on many apps is bounded by TLB faults etc.

I would be surprised if SLUB somehow fixed such apps. Especially on
large systems, if you look at the maths.

But I don't dismiss the possibility, and SLQB as I keep repeating,
can do higher order allocations. The reason I bring it up is
because SLUB will get significantly slower for many workloads if
higher order allocations *are not* done. Which is the advantage of
SLAB and SLQB here. This cannot be turned into an advantage for
SLUB because of this TLB issue.


> Strictly obeying NUMA policies causes performance problems in SLAB. Try
> MPOL_INTERLEAVE vs a cpu local allocations.

I will try testing that. Note that I have of course been testing
NUMA things on our small Altix, but I just haven't found anything
interesting enough to post...


> > > I still dont see the problem that SLQB is addressing (aside from code
> > > cleanup of SLAB). Seems that you feel that the queueing behavior of SLAB
> > > is okay.
> >
> > It addresses O(NR_CPUS^2) memory consumption of kmem caches, and large
> > constant consumption of array caches of SLAB. It addresses scalability
> > eg in situations with lots of cores per node. It allows resizeable
> > queues. It addresses the code complexity and bootstap hoops of SLAB.
> >
> > It addresses performance and higher order allocation problems of SLUB.
>
> It seems that on SMP systems SLQB will actually increase the number of
> queues since it needs 2 queues per cpu instead of the 1 of SLAB.

I don't know what you mean when you say queues, but SLQB has more
than 2 queues per CPU. Great. I like them ;)


> SLAB also
> has resizable queues.

Not significantly because that would require large memory allocations for
large queues. And there is no code there to do runtime resizing.


> Code simplification and bootstrap: Great work on
> that. Again good cleanup of SLAB.

2009-01-22 00:31:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Mon, 19 Jan 2009, Nick Piggin wrote:

> > > You only go to the allocator when the percpu queue goes empty though, so
> > > if memory policy changes (eg context switch or something), then subsequent
> > > allocations will be of the wrong policy.
> >
> > The per cpu queue size in SLUB is limited by the queues only containing
> > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > then this could be an issue.
>
> And it could be a problem in SLUB too. Chances are that several allocations
> will be wrong after every policy switch. I could describe situations in which
> SLUB will allocate with the _wrong_ policy literally 100% of the time.

No it cannot because in SLUB objects must come from the same page.
Multiple objects in a queue will only ever require a single page and not
multiple like in SLAB.

> > That means large amounts of memory are going to be caught in these queues.
> > If its per cpu and one cpu does allocation and the other frees then the
> > first cpu will consume more and more memory from the page allocator
> > whereas the second will build up huge per cpu lists.
>
> Wrong. I said I would allow an option to turn off *periodic trimming*.
> Or just modify the existing tunables or look at making the trimming
> more fine grained etc etc. I won't know until I see a workload where it
> hurts, and I will try to solve it then.

You are not responding to the issue. If you have queues that contain
objects from multiple pages then every object pointer in these queues can
pin a page although this actually is a free object.

> > It seems that on SMP systems SLQB will actually increase the number of
> > queues since it needs 2 queues per cpu instead of the 1 of SLAB.
>
> I don't know what you mean when you say queues, but SLQB has more
> than 2 queues per CPU. Great. I like them ;)

This gets better and better.

> > SLAB also
> > has resizable queues.
>
> Not significantly because that would require large memory allocations for
> large queues. And there is no code there to do runtime resizing.

Groan. Please have a look at do_tune_cpucache() in slab.c

2009-01-22 00:33:23

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Mon, 19 Jan 2009, Nick Piggin wrote:

> The thing IMO you forget with all these doomsday scenarios about SGI's peta
> scale systems is that no matter what you do, you can't avoid the fact that
> computing is about locality. Even if you totally take the TLB out of the
> equation, you still have the small detail of other caches. Code that jumps
> all over that 1024 TB of memory with no locality is going to suck regardless
> of what the kernel ever does, due to physical limitations of hardware.

Typically we traverse lists of objects that are in the same slab cache.

> > Sorry not at all. SLAB and SLQB queue objects from different pages in the
> > same queue.
>
> The last sentence is what I was replying to. Ie. "simplification of
> numa handling" does not follow from the SLUB implementation of per-page
> freelists.

If all objects are from the same page then you need not check
the NUMA locality of any object on that queue.

> > As I sad it pins a single page in the per cpu page and uses that in a way
> > that you call a queue and I call a freelist.
>
> And you found you have to increase the size of your pages because you
> need bigger queues. (must we argue semantics? it is a list of free
> objects)

Right. That may be the case and its a similar tuning to what SLAB does.

> > SLAB and SLUB can have large quantities of objects in their queues that
> > each can keep a single page out of circulation if its the last
> > object in that page. This is per queue thing and you have at least two
>
> And if that were a problem, SLQB can easily be runtime tuned to keep no
> objects in its object lists. But as I said, queueing is good, so why
> would anybody want to get rid of it?

Queing is sometimes good....

> Again, this doesn't really go anywhere while we disagree on the
> fundamental goodliness of queueing. This is just describing the
> implementation.

I am not sure that you understand the fine points of queuing in slub. I am
not a fundamentalist: Queues are good if used the right way and as you say
SLUB has "queues" designed in a particular fashion that solves issus that
we had with SLAB queues.

2009-01-22 09:27:22

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

Hi Christoph,

On Mon, 19 Jan 2009, Nick Piggin wrote:
> > > > You only go to the allocator when the percpu queue goes empty though, so
> > > > if memory policy changes (eg context switch or something), then subsequent
> > > > allocations will be of the wrong policy.
> > >
> > > The per cpu queue size in SLUB is limited by the queues only containing
> > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > then this could be an issue.
> >
> > And it could be a problem in SLUB too. Chances are that several allocations
> > will be wrong after every policy switch. I could describe situations in which
> > SLUB will allocate with the _wrong_ policy literally 100% of the time.

On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> No it cannot because in SLUB objects must come from the same page.
> Multiple objects in a queue will only ever require a single page and not
> multiple like in SLAB.

There's one potential problem with "per-page queues", though. The bigger
the object, the smaller the "queue" (i.e. less objects per page). Also,
partial lists are less likely to help for big objects because they get
emptied so quickly and returned to the page allocator. Perhaps we should
do a small "full list" for caches with large objects?

Pekka

2009-01-22 09:30:58

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, 2009-01-22 at 11:27 +0200, Pekka Enberg wrote:
> Hi Christoph,
>
> On Mon, 19 Jan 2009, Nick Piggin wrote:
> > > > > You only go to the allocator when the percpu queue goes empty though, so
> > > > > if memory policy changes (eg context switch or something), then subsequent
> > > > > allocations will be of the wrong policy.
> > > >
> > > > The per cpu queue size in SLUB is limited by the queues only containing
> > > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > > then this could be an issue.
> > >
> > > And it could be a problem in SLUB too. Chances are that several allocations
> > > will be wrong after every policy switch. I could describe situations in which
> > > SLUB will allocate with the _wrong_ policy literally 100% of the time.
>
> On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> > No it cannot because in SLUB objects must come from the same page.
> > Multiple objects in a queue will only ever require a single page and not
> > multiple like in SLAB.
>
> There's one potential problem with "per-page queues", though. The bigger
> the object, the smaller the "queue" (i.e. less objects per page). Also,
> partial lists are less likely to help for big objects because they get
> emptied so quickly and returned to the page allocator. Perhaps we should
> do a small "full list" for caches with large objects?
That helps definitely. We could use a batch to control the list size.

2009-01-22 09:33:23

by Pekka Enberg

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, 2009-01-22 at 17:30 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-01-22 at 11:27 +0200, Pekka Enberg wrote:
> > Hi Christoph,
> >
> > On Mon, 19 Jan 2009, Nick Piggin wrote:
> > > > > > You only go to the allocator when the percpu queue goes empty though, so
> > > > > > if memory policy changes (eg context switch or something), then subsequent
> > > > > > allocations will be of the wrong policy.
> > > > >
> > > > > The per cpu queue size in SLUB is limited by the queues only containing
> > > > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > > > then this could be an issue.
> > > >
> > > > And it could be a problem in SLUB too. Chances are that several allocations
> > > > will be wrong after every policy switch. I could describe situations in which
> > > > SLUB will allocate with the _wrong_ policy literally 100% of the time.
> >
> > On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> > > No it cannot because in SLUB objects must come from the same page.
> > > Multiple objects in a queue will only ever require a single page and not
> > > multiple like in SLAB.
> >
> > There's one potential problem with "per-page queues", though. The bigger
> > the object, the smaller the "queue" (i.e. less objects per page). Also,
> > partial lists are less likely to help for big objects because they get
> > emptied so quickly and returned to the page allocator. Perhaps we should
> > do a small "full list" for caches with large objects?
> That helps definitely. We could use a batch to control the list size.

s/full list/empty list/g

That is, a list of pages that could be returned to the page allocator
but are pooled in SLUB to avoid the page allocator overhead. Note that
this will not help allocators that trigger page allocator pass-through.

Pekka

2009-01-23 04:09:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 21, 2009 at 07:13:44PM -0500, Christoph Lameter wrote:
> On Mon, 19 Jan 2009, Nick Piggin wrote:
>
> > > The per cpu queue size in SLUB is limited by the queues only containing
> > > objects from the same page. If you have large queues like SLAB/SLQB(?)
> > > then this could be an issue.
> >
> > And it could be a problem in SLUB too. Chances are that several allocations
> > will be wrong after every policy switch. I could describe situations in which
> > SLUB will allocate with the _wrong_ policy literally 100% of the time.
>
> No it cannot because in SLUB objects must come from the same page.
> Multiple objects in a queue will only ever require a single page and not
> multiple like in SLAB.

I don't know how that solves the problem. Task with memory policy A
allocates an object, which allocates the "fast" page with policy A
and allocates an object. Then context switch to task with memory
policy B which allocates another object, which is taken from the page
allocated with policy A. Right?

(OK this doesn't give the wrong policy 100% of the time; I thought
there could have been a context switch race during page allocation
that would result in 100% incorrect, but anyway it could still be
significantly incorrect couldn't it?)


> > > That means large amounts of memory are going to be caught in these queues.
> > > If its per cpu and one cpu does allocation and the other frees then the
> > > first cpu will consume more and more memory from the page allocator
> > > whereas the second will build up huge per cpu lists.
> >
> > Wrong. I said I would allow an option to turn off *periodic trimming*.
> > Or just modify the existing tunables or look at making the trimming
> > more fine grained etc etc. I won't know until I see a workload where it
> > hurts, and I will try to solve it then.
>
> You are not responding to the issue. If you have queues that contain
> objects from multiple pages then every object pointer in these queues can
> pin a page although this actually is a free object.

I am trying to respond to what you raise. "The" issue I thought you
raised above was that SLQB would grow freelists unbounded

"the first cpu will consume more and more memory from the page allocator
whereas the second will build up huge per cpu lists"

And this is wrong. There is another possible issue where every single
object on the freelist might come from a different (and otherwise free)
page, and thus eg 100 8 byte objects might consume 400K.

That's not an invalid concern, but I think it will be quite rare, and
the periodic queue trimming should naturally help this because it will
cycle out those objects and if new allocations are needed, they will
come from new pages which can be packed more densely.


> > > It seems that on SMP systems SLQB will actually increase the number of
> > > queues since it needs 2 queues per cpu instead of the 1 of SLAB.
> >
> > I don't know what you mean when you say queues, but SLQB has more
> > than 2 queues per CPU. Great. I like them ;)
>
> This gets better and better.

So no response to my asking where the TLB improvement in SLUB helps,
or where queueing hurts? You complain about not being able to reproduce
Intel's OLTP problem, and yet you won't even _say_ what the problems
are for SLQB. Wheras Intel at least puts a lot of effort into running
tests and helping to analyse things.


> > > SLAB also
> > > has resizable queues.
> >
> > Not significantly because that would require large memory allocations for
> > large queues. And there is no code there to do runtime resizing.
>
> Groan. Please have a look at do_tune_cpucache() in slab.c

Cool, I didn't realise it had hooks to do runtime resizing. The more
important issue of course is the one of extra cache footprint and
metadata in SLAB's scheme.

2009-01-23 04:18:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Wed, Jan 21, 2009 at 07:19:08PM -0500, Christoph Lameter wrote:
> On Mon, 19 Jan 2009, Nick Piggin wrote:
>
> > The thing IMO you forget with all these doomsday scenarios about SGI's peta
> > scale systems is that no matter what you do, you can't avoid the fact that
> > computing is about locality. Even if you totally take the TLB out of the
> > equation, you still have the small detail of other caches. Code that jumps
> > all over that 1024 TB of memory with no locality is going to suck regardless
> > of what the kernel ever does, due to physical limitations of hardware.
>
> Typically we traverse lists of objects that are in the same slab cache.

Very often that is not the case. And the price you pay for that is that
you have to drain and switch freelists whenever you encounter an object
that is not on the same page.

This gives your freelists a chaotic and unpredictable behaviour IMO in
a running system where pages succumb to fragmentation so your freelist
maximum sizes are limited. It also means you can lose track of cache
hot objects when you switch to different "fast" pages. I don't consider
this to be "queueing done right".


> > > Sorry not at all. SLAB and SLQB queue objects from different pages in the
> > > same queue.
> >
> > The last sentence is what I was replying to. Ie. "simplification of
> > numa handling" does not follow from the SLUB implementation of per-page
> > freelists.
>
> If all objects are from the same page then you need not check
> the NUMA locality of any object on that queue.

In SLAB and SLQB, all objects on the freelist are on the same node. So
tell me how does same-page objects simplify numa handling?


> > > As I sad it pins a single page in the per cpu page and uses that in a way
> > > that you call a queue and I call a freelist.
> >
> > And you found you have to increase the size of your pages because you
> > need bigger queues. (must we argue semantics? it is a list of free
> > objects)
>
> Right. That may be the case and its a similar tuning to what SLAB does.

SLAB and SLQB doesn't need bigger pages to do that.


> > > SLAB and SLUB can have large quantities of objects in their queues that
> > > each can keep a single page out of circulation if its the last
> > > object in that page. This is per queue thing and you have at least two
> >
> > And if that were a problem, SLQB can easily be runtime tuned to keep no
> > objects in its object lists. But as I said, queueing is good, so why
> > would anybody want to get rid of it?
>
> Queing is sometimes good....
>
> > Again, this doesn't really go anywhere while we disagree on the
> > fundamental goodliness of queueing. This is just describing the
> > implementation.
>
> I am not sure that you understand the fine points of queuing in slub. I am
> not a fundamentalist: Queues are good if used the right way and as you say
> SLUB has "queues" designed in a particular fashion that solves issus that
> we had with SLAB queues.

OK, and I juts don't think they solved all the problems and they added
other worse ones. And if you would tell me what the problems are and
how to reproduce them (or point to someone who might be able to help
with reproducing them), then I'm confident that I can solve those problems
in SLQB, which has fewer downsides than SLUB. At least I will try my best.

So can you please give a better idea of the problems? "latency sensitive
HPC applications" is about as much help to me solving that as telling
you that "OLTP applications slow down" helps solve one of the problems in
SLUB.

2009-01-23 15:44:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, 23 Jan 2009, Nick Piggin wrote:

> > No it cannot because in SLUB objects must come from the same page.
> > Multiple objects in a queue will only ever require a single page and not
> > multiple like in SLAB.
>
> I don't know how that solves the problem. Task with memory policy A
> allocates an object, which allocates the "fast" page with policy A
> and allocates an object. Then context switch to task with memory
> policy B which allocates another object, which is taken from the page
> allocated with policy A. Right?

Correct. But this is only an issue if you think about policies applying to
individual object allocations (like realized in SLAB). If policies only
apply to pages (which is sufficient for balancing IMHO) then this is okay.

> > (OK this doesn't give the wrong policy 100% of the time; I thought
> there could have been a context switch race during page allocation
> that would result in 100% incorrect, but anyway it could still be
> significantly incorrect couldn't it?)

Memory policies are applied in a fuzzy way anyways. A context switch can
result in page allocation action that changes the expected interleave
pattern. Page populations in an address space depend on the task policy.
So the exact policy applied to a page depends on the task. This isnt an
exact thing.

> "the first cpu will consume more and more memory from the page allocator
> whereas the second will build up huge per cpu lists"
>
> And this is wrong. There is another possible issue where every single
> object on the freelist might come from a different (and otherwise free)
> page, and thus eg 100 8 byte objects might consume 400K.
>
> That's not an invalid concern, but I think it will be quite rare, and
> the periodic queue trimming should naturally help this because it will
> cycle out those objects and if new allocations are needed, they will
> come from new pages which can be packed more densely.

Well but you said that you would defer the trimming (due to latency
concerns). The longer you defer the larger the lists will get.

2009-01-23 15:44:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, 22 Jan 2009, Pekka Enberg wrote:

> That is, a list of pages that could be returned to the page allocator
> but are pooled in SLUB to avoid the page allocator overhead. Note that
> this will not help allocators that trigger page allocator pass-through.

We use the partial list for that.

2009-01-23 15:44:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Thu, 22 Jan 2009, Pekka Enberg wrote:

> On Wed, 2009-01-21 at 19:13 -0500, Christoph Lameter wrote:
> > No it cannot because in SLUB objects must come from the same page.
> > Multiple objects in a queue will only ever require a single page and not
> > multiple like in SLAB.
>
> There's one potential problem with "per-page queues", though. The bigger
> the object, the smaller the "queue" (i.e. less objects per page). Also,
> partial lists are less likely to help for big objects because they get
> emptied so quickly and returned to the page allocator. Perhaps we should
> do a small "full list" for caches with large objects?

Right thats why there is need for higher order allocs because that
increases the "queue" sizes. If the pages are larger then also the partial
lists will cover more ground. Much of the tuning in SLUB is the page size
setting (remember you can set the order for each slab in slub!). In
SLAB/SLQB the corresponding tuning is through the queue sizes.

2009-01-23 15:53:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, Jan 23, 2009 at 10:41:15AM -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Nick Piggin wrote:
>
> > > No it cannot because in SLUB objects must come from the same page.
> > > Multiple objects in a queue will only ever require a single page and not
> > > multiple like in SLAB.
> >
> > I don't know how that solves the problem. Task with memory policy A
> > allocates an object, which allocates the "fast" page with policy A
> > and allocates an object. Then context switch to task with memory
> > policy B which allocates another object, which is taken from the page
> > allocated with policy A. Right?
>
> Correct. But this is only an issue if you think about policies applying to
> individual object allocations (like realized in SLAB). If policies only
> apply to pages (which is sufficient for balancing IMHO) then this is okay.

According to memory policies, a task's memory policy is supposed to
apply to its slab allocations too.


> > > (OK this doesn't give the wrong policy 100% of the time; I thought
> > there could have been a context switch race during page allocation
> > that would result in 100% incorrect, but anyway it could still be
> > significantly incorrect couldn't it?)
>
> Memory policies are applied in a fuzzy way anyways. A context switch can
> result in page allocation action that changes the expected interleave
> pattern. Page populations in an address space depend on the task policy.
> So the exact policy applied to a page depends on the task. This isnt an
> exact thing.

There are other memory policies than just interleave though.


> > "the first cpu will consume more and more memory from the page allocator
> > whereas the second will build up huge per cpu lists"
> >
> > And this is wrong. There is another possible issue where every single
> > object on the freelist might come from a different (and otherwise free)
> > page, and thus eg 100 8 byte objects might consume 400K.
> >
> > That's not an invalid concern, but I think it will be quite rare, and
> > the periodic queue trimming should naturally help this because it will
> > cycle out those objects and if new allocations are needed, they will
> > come from new pages which can be packed more densely.
>
> Well but you said that you would defer the trimming (due to latency
> concerns). The longer you defer the larger the lists will get.

But that is wrong. The lists obviously have high water marks that
get trimmed down. Periodic trimming as I keep saying basically is
alrady so infrequent that it is irrelevant (millions of objects
per cpu can be allocated anyway between existing trimming interval)

2009-01-23 16:27:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

On Fri, 23 Jan 2009, Nick Piggin wrote:

> > > The thing IMO you forget with all these doomsday scenarios about SGI's peta
> > > scale systems is that no matter what you do, you can't avoid the fact that
> > > computing is about locality. Even if you totally take the TLB out of the
> > > equation, you still have the small detail of other caches. Code that jumps
> > > all over that 1024 TB of memory with no locality is going to suck regardless
> > > of what the kernel ever does, due to physical limitations of hardware.
> >
> > Typically we traverse lists of objects that are in the same slab cache.
>
> Very often that is not the case. And the price you pay for that is that
> you have to drain and switch freelists whenever you encounter an object
> that is not on the same page.

SLUB can directly free an object to any slab page. "Queuing" on free via
the per cpu slab is only possible if the object came from that per cpu
slab. This is typically only the case for objects that were recently
allocated.

There is no switching of queues because they do not exist in that form in
SLUB. We always determine the page address and put the object into the
freelist of that page. Also results in nice parallelism since the lock is
not even cpu specific.

> This gives your freelists a chaotic and unpredictable behaviour IMO in
> a running system where pages succumb to fragmentation so your freelist
> maximum sizes are limited. It also means you can lose track of cache
> hot objects when you switch to different "fast" pages. I don't consider
> this to be "queueing done right".

Yes you can loose track of caching hot objects. That is one of the
concerns with the SLUB approach. On the other hand: Caching architectures
get more and more complex these days (especially in a NUMA system). The
SLAB approach is essentially trying to guess which objects are cache hot
and queue them. Sometimes the queueing is advantageous (may be a reason
that SLAB is better than SLUB in some cases). In other cases SLAB keeps
objects on queues but the object have become sale (context switch, slab
unused for awhile). Then its no advantage anymore.

> > If all objects are from the same page then you need not check
> > the NUMA locality of any object on that queue.
>
> In SLAB and SLQB, all objects on the freelist are on the same node. So
> tell me how does same-page objects simplify numa handling?

F.e. On free you need to determine the node to find the right queue in
SLAB. SLUB does not need to do that. It simply determines the page address
and does not care about the node when freeing the object. It is irrelevant
on which node the object sits.

Also on alloc: The per cpu slab can be from a foreign node. NUMA locality
does only matter if the caller wants memory from a particular node. So
cpus that have no local memory can still use the per cpu slabs to have
fast allocations etc etc.

> > > And you found you have to increase the size of your pages because you
> > > need bigger queues. (must we argue semantics? it is a list of free
> > > objects)
> >
> > Right. That may be the case and its a similar tuning to what SLAB does.
>
> SLAB and SLQB doesn't need bigger pages to do that.

But they require more metadata handling because they need to manage lists
of order-0 pages. metadata handling is reduced by orders of magnitude in
SLUB.

2009-01-26 17:31:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch] SLQB slab allocator

n Fri, 23 Jan 2009, Nick Piggin wrote:

> According to memory policies, a task's memory policy is supposed to
> apply to its slab allocations too.

It does apply to slab allocations. The question is whether it has to apply
to every object allocation or to every page allocation of the slab
allocators.

> > Memory policies are applied in a fuzzy way anyways. A context switch can
> > result in page allocation action that changes the expected interleave
> > pattern. Page populations in an address space depend on the task policy.
> > So the exact policy applied to a page depends on the task. This isnt an
> > exact thing.
>
> There are other memory policies than just interleave though.

Which have similar issues since memory policy application is depending on
a task policy and on memory migration that has been applied to an address
range.

> > > "the first cpu will consume more and more memory from the page allocator
> > > whereas the second will build up huge per cpu lists"
> > >
> > > And this is wrong. There is another possible issue where every single
> > > object on the freelist might come from a different (and otherwise free)
> > > page, and thus eg 100 8 byte objects might consume 400K.
> > >
> > > That's not an invalid concern, but I think it will be quite rare, and
> > > the periodic queue trimming should naturally help this because it will
> > > cycle out those objects and if new allocations are needed, they will
> > > come from new pages which can be packed more densely.
> >
> > Well but you said that you would defer the trimming (due to latency
> > concerns). The longer you defer the larger the lists will get.
>
> But that is wrong. The lists obviously have high water marks that
> get trimmed down. Periodic trimming as I keep saying basically is
> alrady so infrequent that it is irrelevant (millions of objects
> per cpu can be allocated anyway between existing trimming interval)

Trimming through water marks and allocating memory from the page allocator
is going to be very frequent if you continually allocate on one processor
and free on another.