This is a proposal for a modularized slab allocator. I have tried
for a long time to find some way to untwist the code in the slab
and I think I have found it although I have some concerns about
its acceptability given the techniques and the scope of this
work. But we have done some of this before for device drivers. However,
we have never done this in the VM as far as I can tell.
The modularization is accomplished by trying to use a few
concepts from object oriented programming. Allocators are
described by methods and functions can produce new allocators
based on existing ones by modifying their methods.
So what the patches provide here is:
1. A framework for page allocators and slab allocators
2. Various methods to derive new allocators from old ones
(add rcu support, destructors, constructors, dma etc)
3. A layer that emulates the exist slab interface (the slabulator).
4. A layer that provides kmalloc functionality.
5. Three different slab allocators.
A. The Slabifier. This is conceptually the Simple Slab (See my RFC
from last week) but with the additional allocator modifications
possible it grows like on steroids and then can supply most of
the functionality of the existing slab allocator and can go even
beyond it.
It is called the Slabifier because it can slabify any
page allocator. VMALLOC is a valid page allocator so
it can do slabs on vmalloc ranges. You can define your
own page allocator (mempools??) and then slabify that
one.
B. The page slab allocator. This is a simple Pagesize based
allocator that uses the page allocator directly to manage its
objects. Doing so avoids all the slab overhead for large
allocations. The page slab can also slabify any other
page allocator.
C. The NUMA Slab. This allocator is based on the slabifier
and simply creates one Slabifier per node and manages
those. This allows a clean separation of NUMA.
The slabifier stays simple and the NUMA slab can deal
with the allocation complexities. So system
without NUMA are not affected by the logic that is
put in.
All slab allocators are written to the same interface. New allocators can
be added at will. One could f.e add the slob and the existing slab. One can
dynamically decide which slab to use (if one abandons the emulation
layer) by combining page allocator, slab allocators and various methods
of deriving allocators.
The patchset is definitely still in its infancy. Large sections of the
code are not tested yet. This boots fine if the slabifier or the
numa slab are being used on an SGI Altix. Nothing beyond that has
been done yet.
Some of the other issues in the slab layer are also addressed here:
1. shrink_slab takes a function to move object. Using that
function slabs can be defragmented to ease slab reclaim.
2. Bootstrap occurs through dynamic slab creation.
3. New slabs that are created can be merged into the kmalloc array
if it is detected that they match. This decreases the number of caches
and benefits cache use.
4. The slabifier can flag double frees when the act occurs
and will attempt to continue.
5. There is no 2 second slab reaper tick. Each slab has a 10
second flusher attached. flusher is inactive is slab is
inactive.
Notably missing features:
- Slab Debugging
- Proper NUMA policy support.
- No support for pagese
- slab_reaper also does some other things that we would have to do
in a different way.
Performance tests with AIM7 on an 8p Itanium machine (4 NUMA nodes)
(Memory spreading active which means that we do not take advantage of NUMA locality
in favor of load balancing)
2.6.18-rc4 straight (w/existing NUMA Slab).
Tasks jobs/min jti jobs/min/task real cpu
1 2434.08 100 2434.0771 2.46 0.02 Tue Aug 15 15:53:12 2006
100 178571.43 95 1785.7143 3.36 7.21 Tue Aug 15 15:53:25 2006
200 280964.65 90 1404.8232 4.27 14.52 Tue Aug 15 15:53:43 2006
300 345356.87 86 1151.1896 5.21 21.87 Tue Aug 15 15:54:05 2006
2.6.18-rc4 with slabifier alone (NUMA system without NUMA slab)!
Tasks jobs/min jti jobs/min/task real cpu
1 2433.09 100 2433.0900 2.47 0.02 Tue Aug 15 17:10:44 2006
100 176108.01 92 1761.0801 3.41 7.47 Tue Aug 15 17:10:58 2006
200 275608.64 88 1378.0432 4.35 15.14 Tue Aug 15 17:11:16 2006
300 338919.22 85 1129.7307 5.31 22.77 Tue Aug 15 17:11:38 2006
We are almost getting there even with this immature version and without having the
object caches.
2.6.18-rc4 with NUMA slab and slabifier.
Tasks jobs/min jti jobs/min/task real cpu
1 2431.12 100 2431.1183 2.47 0.03 Tue Aug 15 19:14:38 2006
100 176418.70 93 1764.1870 3.40 7.53 Tue Aug 15 19:14:52 2006
200 275862.07 90 1379.3103 4.35 15.14 Tue Aug 15 19:15:10 2006
300 338345.86 86 1127.8195 5.32 22.85 Tue Aug 15 19:15:32 2006
Hmmm... This gets more inefficient but then it cannot use its NUMA advantage. Definitely
more overhead. But also so far a pretty naive implementation.
Using the allocator framework we could now add a generic caching layer.
Wonder what that would do.
I would appreciate feedback on the way this is done. Beware: There are
likely numerous issues with this patchset.
This patchset should just leave the existing slab allocator unharmed.
One needs to switch on the modular slab allocator to activate any of this.
Slabifier: A slab allocator with minimal meta information
Lately I have started tinkering around with the slab in particular after
Matt Mackal mentioned that the slab should be more modular at the KS.
One particular design issue with the current slab is that it is build on the
basic notion of shifting object references from list to list. Without NUMA this
is wild enough with the per cpu caches and the shared cache but with NUMA we now
have per node shared arrays, per node list and per node per node alien caches.
Somehow this all works but one wonders does it have to be that way? On very
large systems the number of these entities grows to unbelievable numbers.
So I thought it may be best to try to develop another basic slab layer
that does not have all the object queues and that does not have to carry
so much state information. I also have had concerns about the way locking
is handled for awhile. We could increase parallelism by finer grained locking.
This in turn may avoid the need for object queues.
After toying around for awhile I came to the realization that the page struct
contains all the information necessary to manage a slab block. One can put
all the management information there and that is also advantageous
for performance since we constantly have to use the page struct anyways for
reverse object lookups and during slab creation. So this also reduces the
cache footprint of the slab. The alignment is naturally the best since the
first object starts right at the page boundary. This reduces the complexity
of alignment calculations.
We use two locks:
1. The per slab list_lock to protect the lists. This lock does not protect
the slab and is only taken during list manipulation. List manupulation
is reduced to necessary moves if the state of a page changes. An allocation
of an object or the freeing of an object in a slab does not require
that the list_lock is taken if the slab does not run empty or overflows.
2. The page lock in struct page is used to protect the slab during
allocation and freeing. This lock is conveniently placed in a cacheline
that is already available. Other key information is also placed there.
struct page overloading:
- _mapcout => Used to count the objects in use in a slab
- mapping => Reference to the slab structure
- index => Pointer to the first free element in a slab
- lru => Used for list management.
Flag overloading:
PageReferenced => Used to control per cpu slab freeing.
PageActive => slab is under active allocation.
PageLocked => slab locking
The freelists of objects per page are managed as a chained list.
The struct page contains a pointer to the first element. The first 4 bytes of
the free element contains a pointer to the next free element etc until the
chain ends with NULL.
There is no freelist for slabs. slabs are immediately returned to the page
allocator. The page allocator has its own per cpu page queues that should
provide enough caching (only works for order 0 pages though).
Per cpu caches exist in the sense that each processor has a per processor
"cpuslab". Objects in this slab will only be allocated from this processor.
The page state is likely going to stay in the cache. Allocation will be
very fast since we only need the page struct reference for all our needs
which is likely not contended at all. Fetching the next free pointer from
the location of the object nicely prefetches the object.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.18-rc4/mm/slabifier.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/mm/slabifier.c 2006-08-15 16:29:48.958782043 -0700
@@ -0,0 +1,1014 @@
+/*
+ * Generic Slabifier for any allocator with minimal management overhead.
+ *
+ * (C) 2006 Silicon Graphics Inc., Christoph Lameter <[email protected]>
+ */
+
+#include <linux/mm.h>
+#include <linux/allocator.h>
+#include <linux/bit_spinlock.h>
+#include <linux/module.h>
+
+#define SLABIFIER_DEBUG
+
+#ifdef SLABIFIER_DEBUG
+#define DBUG_ON(_x) BUG_ON(_x)
+#else
+#define DBUG_ON(_x)
+#endif
+
+#define TPRINTK printk
+//#define TPRINTK(x, ...)
+
+struct slab {
+ struct slab_cache sc;
+ spinlock_t list_lock; /* Lock for partial slabs.
+ * Also protects active slabs and bit
+ */
+ struct list_head partial; /* List of partially allocated slabs */
+ struct list_head full; /* Fully allocated slabs */
+ unsigned long nr_partial; /* Partial slabs */
+ unsigned long slabs; /* Total slabs used */
+ int size; /* Slab size */
+ int offset; /* Free pointer offset. */
+ int objects; /* Number of objects in slab */
+ atomic_t refcount; /* Refcount for destroy */
+ /* Flusher related data */
+ int flusher_active;
+ struct work_struct flush;
+ struct page *active[NR_CPUS]; /* Per CPU slabs list protected by
+ * page lock
+ */
+};
+
+static struct slab cache_cache;
+
+/*
+ * A good compiler should figure out that the result of this function
+ * is constant.
+ */
+static int get_slab_order(void)
+{
+ return max(0, (fls(sizeof(struct slab)-1) - PAGE_SHIFT));
+}
+
+/*
+ * The page struct is used to keep necessary information about a slab.
+ * For a compound page the first page keeps the slab state.
+ *
+ * Overloaded fields in struct page:
+ *
+ * lru -> used to a slab on the lists
+ * mapping -> pointer to struct slab
+ * index -> pointer to next free object
+ * _mapcount -> count number of elements in use
+ *
+ * Lock order:
+ * 1. slab_lock(page)
+ * 2. slab->list_lock
+ *
+ * The slabifier assigns one slab for allocation to each processor
+ * This slab is on the active list and allocations
+ * occur only on the active slabs. If a cpu slab is active then
+ * a workqueue thread checks every 10 seconds if the cpu slab is
+ * still in use. It is dropped back to the inactive lists if not.
+ *
+ * Leftover slabs with free elements are kept on the partial list
+ * and full slabs on the full list.
+ *
+ * Slabs are freed when they become empty. Teardown and setup is
+ * minimal so we rely on the page cache per cpu caches for
+ * performance on frees/allocs.
+ */
+
+#define lru_to_last_page(_head) (list_entry((_head)->next, struct page, lru))
+#define lru_to_first_page(_head) (list_entry((_head)->next, struct page, lru))
+
+/* Some definitions to overload certain fields in struct page */
+
+static void *get_object_pointer(struct page *page)
+{
+ return (void *)page->index;
+}
+
+static void set_object_pointer(struct page *page, void *object)
+{
+ page->index = (unsigned long)object;
+}
+
+static struct slab *get_slab(struct page *page)
+{
+ return (struct slab *)page->mapping;
+}
+
+static void set_slab(struct page *page, struct slab *s)
+{
+ page->mapping = (void *)s;
+}
+
+static int *object_counter(struct page *page)
+{
+ return (int *)&page->_mapcount;
+}
+
+static void inc_object_counter(struct page *page)
+{
+ (*object_counter(page))++;
+}
+
+static void dec_object_counter(struct page *page)
+{
+ (*object_counter(page))--;
+}
+
+static void set_object_counter(struct page *page, int counter)
+{
+ (*object_counter(page))= counter;
+}
+
+static int get_object_counter(struct page *page)
+{
+ return (*object_counter(page));
+}
+
+/*
+ * For a given active page get the corresponding cpu */
+static int get_active_cpu(struct page *page)
+{
+ return (unsigned long)(page->lru.prev);
+}
+
+static void set_active_cpu(struct page *page, unsigned long cpu)
+{
+ page->lru.prev = (void *)cpu;
+}
+
+/*
+ * Locking for each individual slab using the pagelock
+ */
+static void slab_lock(struct page *page)
+{
+ bit_spin_lock(PG_locked, &page->flags);
+}
+
+static void slab_unlock(struct page *page)
+{
+ bit_spin_unlock(PG_locked, &page->flags);
+}
+
+static void check_slab(struct page *page)
+{
+#ifdef SLABIFIER_DEBUG
+ if (!PageSlab(page)) {
+ printk(KERN_CRIT "Not a valid slab page @%p flags=%lx"
+ " mapping=%p count=%d \n",
+ page, page->flags, page->mapping, page_count(page));
+ BUG();
+ }
+#endif
+}
+
+static void check_active_slab(struct page *page)
+{
+#ifdef SLABIFIER_DEBUG
+ if (!PageActive(page)) {
+ printk(KERN_CRIT "Not an active slab page @%p flags=%lx"
+ " mapping=%p count=%d \n",
+ page, page->flags, page->mapping, page_count(page));
+ BUG();
+ }
+#endif
+}
+
+/*
+ * Discard an unused slab page
+ * Must hold list_lock.
+ * Cannot hold the slab lock since the page is going away.
+ */
+static void discard_slab(struct slab *s, struct page *page)
+{
+ TPRINTK(KERN_CRIT "slab %s free %p page_alloc=%p free=%p\n", s->sc.name, page,
+ s->sc.page_alloc, s->sc.page_alloc->free);
+
+ DBUG_ON(PageActive(page));
+ DBUG_ON(PageLocked(page));
+ list_del(&page->lru);
+ s->slabs--;
+
+ /* Restore page state */
+ page->mapping = NULL; /* was used for slab pointer */
+ page->index = 0; /* was used for the object pointer */
+ reset_page_mapcount(page); /* Was used for inuse counter */
+ __ClearPageSlab(page);
+
+ s->sc.page_alloc->free(s->sc.page_alloc, page, s->sc.order);
+ sub_zone_page_state(page_zone(page), NR_SLAB, 1 << s->sc.order);
+}
+
+/*
+ * Move a page back to the lists.
+ *
+ * Must be called with the slab lock held.
+ * On exit the slab lock will have been dropped.
+ */
+static void deactivate_slab(struct slab *s, struct page *page, int contended)
+{
+ int inuse;
+#ifdef SLABIFIER_DEBUG
+ void *objp;
+ int cpu = get_active_cpu(page);
+#endif
+
+ spin_lock(&s->list_lock);
+// We are also using this from slab_shrink!
+// DBUG_ON(!s->active[get_active_cpu(page)]);
+// check_active_slab(page);
+
+ s->active[get_active_cpu(page)] = NULL;
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ inuse = get_object_counter(page);
+#ifdef SLABIFIER_DEBUG
+ /*
+ * Must get this before dropping slab lock otherwise others
+ * may already be freeing objects in the page again.
+ */
+ objp = get_object_pointer(page);
+#endif
+ slab_unlock(page);
+ if (inuse) {
+ if (inuse < s->objects) {
+ DBUG_ON(!objp);
+ TPRINTK(KERN_CRIT "slab %s: %p partial %d/%d %d cpu=%d\n",s->sc.name, page, inuse, s->objects, contended, cpu);
+ s->nr_partial++;
+ if (contended)
+ list_add_tail(&page->lru, &s->partial);
+ else
+ list_add(&page->lru, &s->partial);
+ } else {
+ DBUG_ON(objp);
+ TPRINTK(KERN_CRIT "slab %s: %p full %d cpu=%d\n",s->sc.name, page, contended, cpu);
+ list_add_tail(&page->lru, &s->full);
+ }
+ } else {
+ /* For discard_slab we must have the slab on some list */
+ list_add_tail(&page->lru, &s->full);
+ discard_slab(s, page);
+ }
+ spin_unlock(&s->list_lock);
+}
+
+static int check_valid_pointer(struct slab *s, struct page *page, void *object, void *origin)
+{
+#ifdef SLABIFIER_DEBUG
+ void *base = page_address(page);
+
+ check_slab(page);
+
+ if (object < base || object >= base + s->objects * s->size) {
+ printk(KERN_CRIT "slab %s size %d: pointer %p->%p\nnot in "
+ " range (%p-%p) in page %p\n", s->sc.name, s->size,
+ origin, object, base, base + s->objects * s->size,
+ page);
+ return 0;
+ }
+
+ if ((object - base) % s->size) {
+ printk(KERN_CRIT "slab %s size %d: pointer %p->%p\n"
+ "does not properly point"
+ "to an object in page %p\n",
+ s->sc.name, s->size, origin, object, page);
+ return 0;
+ }
+#endif
+ return 1;
+}
+
+/*
+ * Determine if a certain object on a page is on the freelist and
+ * therefore free. Must hold either the list_lock for inactive slabs
+ * or the slab lock for active slabs to guarantee that the chains
+ * are consistent.
+ */
+static int on_freelist(struct slab *s, struct page *page, void *search)
+{
+ int nr = 0;
+ void **object = get_object_pointer(page);
+ void *origin = &page->lru;
+
+ check_slab(page);
+
+ while (object) {
+ if (object == search)
+ return 1;
+ if (!check_valid_pointer(s, page, object, origin))
+ goto try_recover;
+ origin = object;
+ object = object[s->offset];
+ nr++;
+ }
+
+ if (get_object_counter(page) != s->objects - nr) {
+ printk(KERN_CRIT "slab %s: page %p wrong object count."
+ " counter is %d but counted were %d\n",
+ s->sc.name, page, get_object_counter(page), nr);
+try_recover:
+ printk(KERN_CRIT "****** Trying to continue by marking "
+ "all objects used (memory leak!)\n");
+ set_object_counter(page, s->objects);
+ set_object_pointer(page, NULL);
+ }
+ return 0;
+}
+
+void check_free_chain(struct slab *s, struct page *page)
+{
+#ifdef SLABIFIER_DEBUG
+ on_freelist(s, page, NULL);
+#endif
+}
+
+/*
+ * Allocate a new slab and prepare an empty freelist
+ * and the basic struct page settings.
+ */
+static struct page *new_slab(struct slab *s, gfp_t flags)
+{
+ void *p, *start, *end;
+ void **last;
+ struct page *page;
+
+ TPRINTK(KERN_CRIT "add slab %s flags=%x\n", s->sc.name, flags);
+
+ page = s->sc.page_alloc->allocate(s->sc.page_alloc, s->sc.order,
+ flags, s->sc.node);
+ if (!page)
+ return NULL;
+
+ set_slab(page, s);
+ start = page_address(page);
+ set_object_pointer(page, start);
+
+ end = start + s->objects * s->size;
+ last = start;
+ for (p = start + s->size; p < end; p += s->size) {
+ last[s->offset] = p;
+ last = p;
+ }
+ last[s->offset] = NULL;
+ set_object_counter(page, 0);
+ __SetPageSlab(page);
+ check_free_chain(s, page);
+ add_zone_page_state(page_zone(page), NR_SLAB, 1 << s->sc.order);
+ return page;
+}
+
+/*
+ * Acquire the slab lock from the active array. If there is active
+ * slab for this processor then return NULL;
+ */
+static struct page *get_and_lock_active(struct slab *s, int cpu)
+{
+ struct page *page;
+
+redo:
+ page = s->active[cpu];
+ if (unlikely(!page))
+ return NULL;
+ slab_lock(page);
+ if (unlikely(s->active[cpu] != page)) {
+ slab_unlock(page);
+ goto redo;
+ }
+ check_active_slab(page);
+ return page;
+}
+
+/*
+ * Return a per cpu slab by taking it off the active list.
+ */
+static void flush_active(struct slab *s, int cpu)
+{
+ struct page *page;
+ unsigned long flags;
+
+ TPRINTK(KERN_CRIT "flush_active %s cpu=%d\n", s->sc.name, cpu);
+ local_irq_save(flags);
+ page = get_and_lock_active(s, cpu);
+ if (likely(page))
+ deactivate_slab(s, page, 0);
+ local_irq_restore(flags);
+}
+
+/*
+ * Flush per cpu slabs if they are not in use.
+ */
+void flusher(void *d)
+{
+ struct slab *s = d;
+ int cpu = smp_processor_id();
+ struct page *page;
+ int nr_active = 0;
+
+ for_each_online_cpu(cpu) {
+
+ page = s->active[cpu];
+ if (!page)
+ continue;
+
+ if (PageReferenced(page)) {
+ ClearPageReferenced(page);
+ nr_active++;
+ } else
+ flush_active(s, cpu);
+ }
+ if (nr_active)
+ schedule_delayed_work(&s->flush, 10 * HZ);
+ else
+ s->flusher_active = 0;
+}
+
+/*
+ * Drain all per cpu slabs
+ */
+static void drain_all(struct slab *s)
+{
+ int cpu;
+
+ if (s->flusher_active) {
+ cancel_delayed_work(&s->flush);
+ for_each_possible_cpu(cpu)
+ flush_active(s, cpu);
+ s->flusher_active = 0;
+ }
+}
+
+/*
+ * slab_create produces objects aligned at size and the first object
+ * is placed at offset 0 in the slab (We have no metainformation on the
+ * slab, all slabs are in essence off slab).
+ *
+ * In order to get the desired alignment one just needs to align the
+ * size. F.e.
+ *
+ * slab_create(&my_cache, ALIGN(sizeof(struct mystruct)), CACHE_L1_SIZE),
+ * 2, page_allocator);
+ *
+ * Notice that the allocation order determines the sizes of the per cpu
+ * caches. Each processor has always one slab available for allocations.
+ * Increasing the allocation order reduces the number of times that slabs
+ * must be moved on and off lists and therefore influences overhead.
+ *
+ * The offset is used to relocate the free list link in each object. It is
+ * therefore possible to move the free list link behind the object. This
+ * is necessary for RCU to work properly and also useful for debugging.
+ */
+static int __slab_create(struct slab *s,
+ const struct slab_allocator *slab_alloc,
+ const struct page_allocator *page_alloc, int node, const char *name,
+ int size, int align, int order, int objsize, int inuse, int offset)
+{
+ int cpu;
+
+ slab_allocator_fill(&s->sc, slab_alloc, page_alloc, node, name, size,
+ align, order, objsize, inuse, offset);
+
+ s->size = ALIGN(size, sizeof(void *));
+
+ if (offset > s->size - sizeof(void *) || (offset % sizeof(void*)))
+ return -EINVAL;
+
+ s->offset = offset / sizeof(void *);
+ s->objects = (PAGE_SIZE << order) / s->size;
+ s->slabs = 0;
+ s->nr_partial = 0;
+ s->flusher_active = 0;
+
+ if (!s->objects)
+ return -EINVAL;
+
+
+ INIT_LIST_HEAD(&s->partial);
+ INIT_LIST_HEAD(&s->full);
+
+ atomic_set(&s->refcount, 1);
+ spin_lock_init(&s->list_lock);
+ INIT_WORK(&s->flush, &flusher, s);
+
+ for_each_possible_cpu(cpu)
+ s->active[cpu] = NULL;
+ return 0;
+}
+
+/*
+ * Deal with ugly bootstrap issues.
+ */
+static struct slab_cache *slab_create(const struct slab_allocator *slab_alloc,
+ const struct page_allocator *page_alloc, int node, const char *name,
+ int size, int align, int order, int objsize, int inuse, int offset)
+{
+ struct slab *s;
+
+ if (!cache_cache.sc.size) {
+ if (__slab_create(&cache_cache,
+ slab_alloc, page_alloc, node, "cache_cache",
+ ALIGN(sizeof(struct slab), L1_CACHE_BYTES),
+ L1_CACHE_BYTES,
+ get_slab_order(), sizeof(struct slab),
+ sizeof(struct slab), 0))
+ panic("Cannot create slab of slabifier slabs!");
+ }
+
+ s = cache_cache.sc.slab_alloc->alloc_node(&cache_cache.sc,
+ GFP_KERNEL, node);
+
+ if (!s)
+ return NULL;
+ __slab_create(s, slab_alloc, page_alloc, node, name, size, align,
+ order, objsize, inuse, offset);
+ return &s->sc;
+}
+
+/*
+ * Reload a new active cpu slab
+ *
+ * On entry the active slab for this cpu must be NULL.
+ * If we have reloaded successfully then we exit with holding the slab lock
+ * and return the pointer to the new page.
+ *
+ * Return NULL if we cannot reload.
+ */
+static struct page *reload(struct slab *s, unsigned long cpu, gfp_t flags)
+{
+ struct page *page;
+
+ spin_lock(&s->list_lock);
+
+redo:
+ if (s->active[cpu]) {
+ /* Someone else updated it first. */
+ spin_unlock(&s->list_lock);
+ page = get_and_lock_active(s, cpu);
+ if (!page)
+ goto redo;
+ return page;
+ }
+
+ if (unlikely(list_empty(&s->partial))) {
+
+ /* No slabs with free objets */
+ spin_unlock(&s->list_lock);
+ if ((flags & __GFP_WAIT)) {
+ local_irq_enable();
+ page = new_slab(s, flags);
+ local_irq_disable();
+ } else
+ page = new_slab(s, flags);
+
+ if (!page)
+ return NULL;
+
+ spin_lock(&s->list_lock);
+ list_add(&page->lru,&s->partial);
+ s->nr_partial++;
+ /*
+ * Need to recheck conditions since we dropped
+ * the list_lock
+ */
+ goto redo;
+ }
+ page = lru_to_first_page(&s->partial);
+ list_del(&page->lru);
+ s->nr_partial--;
+ slab_lock(page);
+ SetPageActive(page);
+ ClearPageReferenced(page);
+ set_active_cpu(page, cpu);
+ s->active[cpu] = page;
+ spin_unlock(&s->list_lock);
+ if (keventd_up() && !s->flusher_active)
+ schedule_delayed_work(&s->flush, 10 * HZ);
+ return page;
+}
+/*
+ * If the gfp mask has __GFP_WAIT set then slab_alloc() may enable interrupts
+ * if it needs to acquire more pages for new slabs.
+ */
+static void *slab_alloc(struct slab_cache *sc, gfp_t gfpflags)
+{
+ struct slab *s = (void *)sc;
+ int cpu = smp_processor_id();
+ struct page *page;
+ void **object = NULL;
+ unsigned long flags;
+
+ local_irq_save(flags);
+
+ do {
+ page = get_and_lock_active(s, cpu);
+
+ if (unlikely(!page)) {
+ page = reload(s, cpu, gfpflags);
+
+ if (!page)
+ goto out;
+ }
+
+ check_free_chain(s, page);
+ object = get_object_pointer(page);
+ if (likely(!object))
+ /* Sorry, fully allocated slab! */
+ deactivate_slab(s, page, 0);
+ } while (!object);
+
+ set_object_pointer(page, object[s->offset]);
+ inc_object_counter(page);
+ SetPageReferenced(page);
+ check_free_chain(s, page);
+ slab_unlock(page);
+out:
+ local_irq_restore(flags);
+ return object;
+}
+
+/* Figure out on which slab object the object resides */
+static struct page *get_object_page(const void *x)
+{
+ struct page * page;
+
+ page = virt_to_page(x);
+ if (unlikely(PageCompound(page)))
+ page = (struct page *)page_private(page);
+
+ if (!PageSlab(page))
+ return NULL;
+
+ return page;
+}
+
+/*
+ * If the struct slab pointer is NULL then we will determine the
+ * proper cache. Otherwise the slab ownership will be verified
+ * before object removal.
+ */
+static void slab_free(struct slab_cache *sc, const void *x)
+{
+ struct slab *s = (void *)sc;
+ struct page * page;
+ int leftover;
+ void *prior;
+ void **object = (void *)x;
+ unsigned long flags;
+
+ if (!object)
+ return;
+
+ page = get_object_page(object);
+ if (unlikely(!page)) {
+ printk(KERN_CRIT "slab_free %s size %d: attempt to free object"
+ "(%p) outside of slab.\n", s->sc.name, s->size, object);
+ goto dumpret;
+ }
+
+ if (!s) {
+ s = get_slab(page);
+
+ if (unlikely(!s)) {
+#ifdef CONFIG_SLAB
+ extern void __kfree(void *);
+ /*
+ * Upps we use multiple slab allocator. This
+ * must be another one (regular slab?)
+ */
+ __kfree(object);
+ return;
+#else
+ printk(KERN_CRIT
+ "slab_free : no slab(NULL) for object %p.\n",
+ object);
+#endif
+ goto dumpret;
+ }
+ }
+
+ if (unlikely(s != get_slab(page))) {
+ printk(KERN_CRIT "slab_free %s: object at %p"
+ " belongs to slab %p\n",
+ s->sc.name, object, get_slab(page));
+ dump_stack();
+ s = get_slab(page);
+ }
+
+ if (unlikely(!check_valid_pointer(s, page, object, NULL))) {
+dumpret:
+ dump_stack();
+ printk(KERN_CRIT "***** Trying to continue by not"
+ "freeing object.\n");
+ return;
+ }
+
+ local_irq_save(flags);
+ slab_lock(page);
+
+ /* Check for double free */
+#ifdef SLABIFIER_DEBUG
+ if (on_freelist(s, page, object)) {
+ printk(KERN_CRIT "slab_free %s: object %p already free.\n",
+ s->sc.name, object);
+ dump_stack();
+ goto out_unlock;
+ }
+#endif
+
+ prior = get_object_pointer(page);
+ object[s->offset] = prior;
+
+ set_object_pointer(page, object);
+ dec_object_counter(page);
+ leftover = get_object_counter(page);
+
+ if (unlikely(PageActive(page)))
+ goto out_unlock;
+
+ if (unlikely(leftover == 0)) {
+ /*
+ * We deallocated all objects in a slab and the slab
+ * is not under allocation. So we can free it.
+ */
+ spin_lock(&s->list_lock);
+ check_free_chain(s, page);
+ slab_unlock(page);
+ discard_slab(s, page);
+ spin_unlock(&s->list_lock);
+ goto out;
+ }
+ if (unlikely(!prior)) {
+ /*
+ * Page was fully used before. It will only have one free
+ * object now. So move to the front of the partial list.
+ * This will increase the chances of the first object
+ * to be reused soon. Its likely cache hot.
+ */
+ spin_lock(&s->list_lock);
+ list_move(&page->lru, &s->partial);
+ s->nr_partial++;
+ spin_unlock(&s->list_lock);
+ }
+out_unlock:
+ slab_unlock(page);
+out:
+ local_irq_restore(flags);
+}
+
+/*
+ * Check if a given pointer is valid
+ */
+static int slab_pointer_valid(struct slab_cache *sc, const void *object)
+{
+ struct slab *s = (void *)sc;
+ struct page * page;
+ void *addr;
+
+ TPRINTK(KERN_CRIT "slab_pointer_valid(%s,%p\n",s->sc.name, object);
+
+ page = get_object_page(object);
+
+ if (!page || s != get_slab(page))
+ return 0;
+
+ addr = page_address(page);
+ if (object < addr || object >= addr + s->objects * s->size)
+ return 0;
+
+ if ((object - addr) & s->size)
+ return 0;
+
+ return 1;
+}
+
+/*
+ * Determine the size of a slab object
+ */
+static unsigned long slab_object_size(struct slab_cache *sc,
+ const void *object)
+{
+ struct page *page;
+ struct slab *s;
+
+ TPRINTK(KERN_CRIT "slab_object_size(%p)\n",object);
+
+ page = get_object_page(object);
+ if (page) {
+ s = get_slab(page);
+ BUG_ON(sc && s != (void *)sc);
+ if (s)
+ return s->size;
+ }
+ BUG();
+}
+
+/*
+ * Move slab objects in a given slab by calling the move_objects
+ * function.
+ */
+static int move_slab_objects(struct slab *s, struct page *page,
+ int (*move_objects)(struct slab_cache *, void *))
+{
+ int unfreeable = 0;
+ void *addr = page_address(page);
+
+ while (get_object_counter(page) - unfreeable > 0) {
+ void *p;
+
+ for (p = addr; p < addr + s->objects; p+= s->size) {
+ if (!on_freelist(s, page, p)) {
+ if (move_objects((struct slab_cache *)s, p))
+ slab_free(&s->sc, p);
+ else
+ unfreeable++;
+ }
+ }
+ }
+ return unfreeable;
+}
+
+/*
+ * Shrinking drops all the active per cpu slabs and also reaps all empty
+ * slabs off the partial list (preloading may have added those). Returns the
+ * number of slabs freed.
+ *
+ * If a move_object function is specified then the partial list is going
+ * to be compacted by calling the function on all slabs until only a single
+ * slab is on the partial list. The move_object function
+ * will be called for each objects in the respective slab. Move_object needs to
+ * perform a new allocation for the object and move the contents
+ * of the object to the new location. If move_object returns 1
+ * for success then the object is going to be removed. If 0 then the
+ * slab object cannot be freed at all.
+ *
+ * Try to remove as many slabs as possible. In particular try to undo the
+ * effect of slab_preload which may have added empty pages to the
+ * partial list.
+ *
+ * Returns the number of slabs freed.
+ */
+static int slab_shrink(struct slab_cache *sc,
+ int (*move_object)(struct slab_cache *, void *))
+{
+ struct slab *s = (void *)sc;
+ unsigned long flags;
+ int slabs_freed = 0;
+ int i;
+
+ drain_all(s);
+ /*
+ * Take the last element of the partial list and free entries until
+ * the free list contains only a single page.
+ *
+ * FIXME: If there are multiple objects pinned in multiple
+ * slabs then this loop may never terminate!
+ */
+ for(i = 0; s->nr_partial > 1 && i < s->nr_partial - 1 ; i++) {
+ struct page * page;
+
+ /* Take one page off the list */
+ spin_lock_irqsave(&s->list_lock, flags);
+
+ if (s->nr_partial == 0) {
+ spin_unlock_irqrestore(&s->list_lock, flags);
+ break;
+ }
+
+ page = lru_to_last_page(&s->partial);
+ s->nr_partial--;
+ list_del(&page->lru);
+
+ spin_unlock_irqrestore(&s->list_lock, flags);
+
+ /* Free the objects in it */
+ if (s->nr_partial) {
+ if (get_object_counter(page) < s->objects)
+ if (move_slab_objects(s,
+ page, move_object) == 0)
+ slabs_freed++;
+
+ }
+ /*
+ * This will put the slab on the front of the partial
+ * list, the used list or free it.
+ */
+ slab_lock(page);
+ deactivate_slab(s, page, 0);
+ }
+
+ return slabs_freed;
+
+}
+
+static void free_list(struct slab *s, struct list_head *list)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&s->list_lock, flags);
+ while (!list_empty(list))
+ discard_slab(s, lru_to_last_page(list));
+
+ spin_unlock_irqrestore(&s->list_lock, flags);
+}
+
+
+/* Duplicate a slab handle */
+static struct slab_cache *slab_dup(struct slab_cache *sc)
+{
+ struct slab *s = (void *)sc;
+
+ atomic_inc(&s->refcount);
+ return &s->sc;
+}
+
+/*
+ * Release all leftover slabs. If there are any leftover pointers dangling
+ * to these objects then we will get into a lot of trouble later.
+ */
+static int slab_destroy(struct slab_cache *sc)
+{
+ struct slab * s = (void *)sc;
+
+ if (!atomic_dec_and_test(&s->refcount))
+ return 0;
+
+ TPRINTK("Slab destroy %s\n",sc->name);
+ drain_all(s);
+
+ free_list(s, &s->full);
+ free_list(s, &s->partial);
+
+ if (s->slabs)
+ return 1;
+
+ /* Just to make sure that no one uses this again */
+ s->size = 0;
+ cache_cache.sc.slab_alloc->free(&cache_cache.sc, sc);
+ return 0;
+
+}
+
+static unsigned long count_objects(struct slab *s, struct list_head *list)
+{
+ int count = 0;
+ struct list_head *h;
+ unsigned long flags;
+
+ spin_lock_irqsave(&s->list_lock, flags);
+ list_for_each(h, list) {
+ struct page *page = lru_to_first_page(h);
+
+ count += get_object_counter(page);
+ }
+ spin_unlock_irqrestore(&s->list_lock, flags);
+ return count;
+}
+
+static unsigned long slab_objects(struct slab_cache *sc,
+ unsigned long *p_active, unsigned long *p_partial)
+{
+ struct slab *s = (void *)sc;
+ int partial;
+ int active = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ if (s->active[cpu])
+ active++;
+
+ partial = count_objects(s, &s->partial);
+
+ if (p_partial)
+ *p_partial = partial;
+
+ if (p_active)
+ *p_active = active;
+
+ return active + partial + count_objects(s, &s->full);
+}
+
+static void *slab_alloc_node(struct slab_cache *sc, gfp_t flags, int node)
+{
+ return slab_alloc(sc, flags);
+}
+
+const struct slab_allocator slabifier_allocator = {
+ .name = "Slabifier",
+ .create = slab_create,
+ .alloc = slab_alloc,
+ .alloc_node = slab_alloc_node,
+ .free = slab_free,
+ .valid_pointer = slab_pointer_valid,
+ .object_size = slab_object_size,
+ .objects = slab_objects,
+ .shrink = slab_shrink,
+ .dup = slab_dup,
+ .destroy = slab_destroy,
+ .destructor = null_slab_allocator_destructor,
+};
+EXPORT_SYMBOL(slabifier_allocator);
Index: linux-2.6.18-rc4/mm/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/mm/Makefile 2006-08-15 16:20:52.526098853 -0700
+++ linux-2.6.18-rc4/mm/Makefile 2006-08-15 16:20:53.007514428 -0700
@@ -24,5 +24,5 @@ obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
-obj-$(CONFIG_MODULAR_SLAB) += allocator.o kmalloc.o slabulator.o
+obj-$(CONFIG_MODULAR_SLAB) += allocator.o kmalloc.o slabulator.o slabifier.o
Add allocator abstraction
The allocator abstraction layer provides sources of pages for the slabifier.
Various basic sources of pages are provided (regular, vmalloc, numa
node memory) and some modifiers (rcuify, dma etc). The slabifier will use
those sources of pages as the source of memory on top of which it sets up
its slab structures.
Signed-off-by: Christoph Lameter <clameter>.
Index: linux-2.6.18-rc4/mm/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/mm/Makefile 2006-08-13 23:26:17.597328068 -0700
+++ linux-2.6.18-rc4/mm/Makefile 2006-08-15 15:09:06.168282639 -0700
@@ -24,4 +24,5 @@ obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_MODULAR_SLAB) += allocator.o
Index: linux-2.6.18-rc4/include/linux/allocator.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/include/linux/allocator.h 2006-08-15 16:18:38.379110868 -0700
@@ -0,0 +1,233 @@
+#ifndef _LINUX_ALLOCATOR_H
+#define _LINUX_ALLOCATOR_H
+
+/*
+ * Generic API to memory allocators.
+ * (C) 2006 Silicon Graphics, Inc,
+ * Christoph Lameter <[email protected]>
+ */
+#include <linux/gfp.h>
+
+/*
+ * Page allocators
+ *
+ * Page allocators are sources of memory in pages. They basically only
+ * support allocation and freeing of pages. The interesting thing
+ * is how these pages are obtained. Plus with these methods
+ * we can add things in between the page allocator a use
+ * of the page allocator to add things without too much
+ * effort . This allows us to encapsulate new features.
+ *
+ * New allocators could be added f.e. for specialized memory pools.
+ */
+
+struct page_allocator {
+ struct page *(*allocate)(const struct page_allocator *, int order,
+ gfp_t mask, int node);
+ void (*free)(const struct page_allocator *, struct page *, int order);
+ void (*destructor) (struct page_allocator *);
+ const char *name;
+};
+
+/* Some known page allocators */
+extern const struct page_allocator page_allocator;
+extern const struct page_allocator vmalloc_allocator;
+
+/*
+ * Generators for new allocators based on known allocators
+ *
+ * These behave like modifiers to already generated or
+ * existing allocators. May be combined at will.
+ */
+
+/*
+ * A way to free all pages via RCU. The RCU head is placed in the
+ * struct page so this fully transparent and does not require any
+ * allocation and freeing of memory.
+ */
+struct page_allocator *rcuify_page_allocator
+ (const struct page_allocator *base);
+
+/*
+ * Make an allocation via a specific allocator always return
+ * DMA memory.
+ */
+struct page_allocator *dmaify_page_allocator
+ (const struct page_allocator *base);
+
+/*
+ * This provides a constructor and a destructor call for each object
+ * on a page. The constructors and destructors calling conventions
+ * are compatible with the existing slab implementation. However,
+ * this implementation assumes that the objects always start at offset 0.
+ *
+ * The main use of these is to provide a generic forrm of constructors
+ * and destructors. These run after a page was allocated and before
+ * a page is freed.
+ */
+struct page_allocator *ctor_and_dtor_for_page_allocator
+ (const struct page_allocator *, unsigned long size, void *private,
+ void (*ctor)(void *, void *, unsigned long),
+ void (*dtor)(void *, void *, unsigned long));
+
+#ifdef CONFIG_NUMA
+/*
+ * Allocator that allows the customization of the NUMA behavior of an
+ * allocator. If a node is specified then the allocator will always try
+ * to allocate on that node. Flags set are ORed for every allocation.
+ * F.e. one can set GFP_THISNODE to force an allocation on a particular node
+ * or on a local node.
+ */
+struct page_allocator *numactl_allocator(const struct page_allocator *,
+ int node, gfp_t flags);
+#endif
+
+/* Tools to make your own */
+struct derived_page_allocator {
+ struct page_allocator a;
+ const struct page_allocator *base;
+};
+
+void derived_destructor(struct page_allocator *a);
+
+struct derived_page_allocator *derive_page_allocator(
+ const struct page_allocator *base);
+
+/*
+ * Slab allocators
+ */
+
+
+/*
+ * A slab cache structure is generated by a slab allocator when the
+ * create method is invoked.
+ */
+struct slab_cache {
+ const struct slab_allocator *slab_alloc;
+ const struct page_allocator *page_alloc;
+ int node; /* Node passed to page allocator */
+ const char *name; /* Name (only for display!) */
+ int size; /* The size of a chunk on a slab */
+ int align; /* Alignment requirements */
+ int objsize; /* The size of an object that is in a chunk */
+ int inuse; /* Used portion of the chunk */
+ int offset; /* Offset to the freelist pointer */
+ unsigned int order; /* Size of the slab page */
+};
+
+struct slab_allocator {
+ /*
+ * Create an actually usable slab cache from a slab allocator
+ */
+ struct slab_cache *(*create)(const struct slab_allocator *,
+ const struct page_allocator *a, int node, const char *name,
+ int size, int align, int order, int objsize, int inuse,
+ int offset);
+
+ /* Allocation functions */
+ void *(*alloc)(struct slab_cache *, gfp_t);
+ void *(*alloc_node)(struct slab_cache *, gfp_t, int);
+ void (*free)(struct slab_cache *, const void *);
+
+ /* Object checks */
+ int (*valid_pointer)(struct slab_cache *, const void *object);
+ unsigned long (*object_size)(struct slab_cache *, const void *);
+
+ /*
+ * Determine slab statistics in units of slabs. Returns the
+ * number of total pages used by the slab cache.
+ * active are the pages under allocation or empty
+ * partial are the number of partial slabs.
+ */
+ unsigned long (*objects)(struct slab_cache *, unsigned long *active,
+ unsigned long *partial);
+
+ /*
+ * shrink defragments a slab cache by moving objects from sparsely
+ * populated slabs to others. slab shrink will terminate when there
+ * is only one fragmented slab left.
+ *
+ * The move_object function must be supplied otherwise shrink can only
+ * free pages that are competely empty.
+ *
+ * move_object gets a slab_cache pointer and an object pointer. The
+ * function must reallocate another object and move the contents
+ * from this object into the new object. Then the function should
+ * return 1 for success. If it return 0 then the object is pinned.
+ * the slab that the object resides on will not be freed.
+ */
+ int (*shrink)(struct slab_cache *,
+ int (*move_object)(struct slab_cache *, void *));
+
+ /*
+ * The NUMA slab provides an array of slab caches. This is a means
+ * to get to the slab cache on a particular node. But beware!
+ * The resulting slab_cache belongs to another allocator.
+ */
+ struct slab_cache *(*node)(struct slab_cache *, int node);
+
+ /*
+ * Establish a new reference so that destroy does not
+ * unecessarilyl destroy the slab_cache
+ */
+ struct slab_cache * (*dup)(struct slab_cache *);
+ int (*destroy)(struct slab_cache *);
+ void (*destructor)(struct slab_allocator *);
+ const char *name;
+};
+
+/* Standard slab allocators */
+extern const struct slab_allocator slab_allocator; /* The original one */
+extern const struct slab_allocator slob_allocator;
+extern const struct slab_allocator slabifier_allocator;
+
+/* Use page allocator for slab (allocation size in pages) */
+extern struct slab_allocator page_slab_allocator;
+
+/* Access kmalloc's fixed slabs without creating new ones. */
+extern struct slab_allocator kmalloc_slab_allocator;
+
+#ifdef CONFIG_NUMA
+extern const struct slab_allocator numa_slab_allocator;
+#endif
+
+/* Generate new slab allocators based on old ones */
+struct slab_allocator *rcuify_slab(struct slab_allocator *base);
+struct slab_allocator *dmaify_slab(struct slab_allocator *base);
+struct slab_allocator *redzone_slab(struct slab_allocator *base);
+struct slab_allocator *track_slab(struct slab_allocator *base);
+struct slab_allocator *trace_slab(struct slab_allocator *base);
+
+/* Tools to make your own slab allocators */
+static inline void slab_allocator_fill(struct slab_cache *sc,
+ const struct slab_allocator *slab_alloc,
+ const struct page_allocator *page_alloc, int node,
+ const char *name, int size, int align,
+ int order, int objsize, int inuse, int offset)
+{
+ sc->slab_alloc = slab_alloc;
+ sc->page_alloc = page_alloc;
+ sc->node = node;
+ sc->name = name;
+ sc->size = size;
+ sc->align = align;
+ sc->order = order;
+ sc->objsize = objsize;
+ sc->inuse = inuse;
+ sc->offset = offset;
+}
+
+/* Indestructible static allocators use this. */
+void null_slab_allocator_destructor(struct slab_allocator *);
+
+struct derived_slab_allocator {
+ struct slab_allocator a;
+ const struct slab_allocator *base;
+};
+
+void derived_slab_destructor(struct slab_allocator *a);
+
+struct derived_slab_allocator *derive_slab_allocator(
+ const struct slab_allocator *base);
+
+#endif /* _LINUX_ALLOCATOR_H */
Index: linux-2.6.18-rc4/mm/allocator.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/mm/allocator.c 2006-08-15 16:17:27.996739026 -0700
@@ -0,0 +1,411 @@
+/*
+ * Generic Allocator Functions for the slabifier
+ *
+ * This abstracts pagesize based order N allocators.
+ *
+ * (C) 2006 Silicon Graphics, Inc. Christoph Lameter <[email protected]>
+ */
+
+#include <linux/allocator.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+
+/* For static allocators */
+
+static void null_destructor(struct page_allocator *a)
+{
+}
+
+/*
+ * Static allocators
+ */
+
+/*
+ * The page allocator that can allocate all of memory
+ */
+static struct page *gen_alloc(const struct page_allocator *a, int order,
+ gfp_t flags, int node)
+{
+ if (order)
+ flags |= __GFP_COMP;
+#ifdef CONFIG_NUMA
+ if (node >=0)
+ return alloc_pages_node(flags, order, node);
+#endif
+ return alloc_pages(flags, order);
+}
+
+static void gen_free(const struct page_allocator *a, struct page *page,
+ int order)
+{
+ __free_pages(page, order);
+}
+
+const struct page_allocator page_allocator = {
+ .allocate = gen_alloc,
+ .free = gen_free,
+ .destructor = null_destructor,
+ .name = "page_allocator"
+};
+
+/*
+ * Allocate vmalloc memory
+ */
+static struct page *vmalloc_alloc(const struct page_allocator *a, int order,
+ gfp_t flags, int node)
+{
+ void *addr;
+
+#ifdef CONFIG_NUMA
+ if (node >= 0)
+ addr = vmalloc_node(PAGE_SIZE << order, node);
+ else
+#endif
+ addr = vmalloc(PAGE_SIZE << order);
+
+ if (!addr)
+ return NULL;
+
+ return virt_to_page(addr);
+}
+
+static void vmalloc_free(const struct page_allocator *a,
+ struct page *page, int order)
+{
+ vfree(page_address(page));
+}
+
+const struct page_allocator vmalloc_allocator = {
+ .allocate = vmalloc_alloc,
+ .free = vmalloc_free,
+ .destructor = null_destructor,
+ .name = "vmalloc_allocator"
+};
+
+/*
+ * Functions to deal with dynamically generating allocators.
+ */
+void derived_destructor(struct page_allocator *a)
+{
+ struct derived_page_allocator *d = (void *)a;
+
+ d->base->destructor((struct page_allocator *)d->base);
+ kfree(a);
+}
+
+/*
+ * Create a new allocator based on another one. All functionality
+ * is duplicated except for the destructor. The caller needs to do
+ * modifications to some of the methods in the copy.
+ */
+struct derived_page_allocator *derive_page_allocator
+ (const struct page_allocator *base)
+{
+ struct derived_page_allocator *d =
+ kmalloc(sizeof(struct derived_page_allocator), GFP_KERNEL);
+
+ d->a.allocate = base->allocate;
+ d->a.free = base->free;
+ d->a.name = "derived";
+ d->a.destructor = derived_destructor;
+ d->base = base;
+ return d;
+};
+
+/*
+ * RCU allocator generator
+ *
+ * We overload struct page once more for the RCU data
+ * lru = RCU head
+ * index = order
+ * mapping = base allocator
+ */
+static void page_free_rcu(struct rcu_head *h)
+{
+ struct page *page;
+ struct page_allocator * base;
+ int order = page->index;
+
+ page = container_of((struct list_head *)h, struct page, lru);
+ base = (void *)page->mapping;
+ order = page->index;
+ page->index = 0;
+ page->mapping = NULL;
+ base->free(base, page, order);
+}
+
+/*
+ * Use page struct as intermediate rcu storage.
+ */
+static void rcu_free(const struct page_allocator *a, struct page *page,
+ int order)
+{
+ struct rcu_head *head = (void *)&page->lru;
+ struct derived_page_allocator *d = (void *)a;
+
+ page->index = order;
+ page->mapping = (void *)d->base;
+ call_rcu(head, page_free_rcu);
+}
+
+struct page_allocator *rcuify_page_allocator
+ (const struct page_allocator *base)
+{
+ struct derived_page_allocator *d = derive_page_allocator(base);
+
+ d->a.free = rcu_free;
+ d->a.name = "rcu";
+ return &d->a;
+};
+
+/*
+ * Restrict memory allocations to DMA
+ */
+static struct page *dma_alloc(const struct page_allocator *a, int order,
+ gfp_t flags, int node)
+{
+ struct derived_page_allocator *d = (void *)a;
+
+ return d->base->allocate(d->base, order, flags | __GFP_DMA, node);
+}
+
+struct page_allocator *dmaify_page_allocator
+ (const struct page_allocator *base)
+{
+ struct derived_page_allocator *d = derive_page_allocator(base);
+
+ d->a.allocate = dma_alloc;
+ d->a.name = "dma";
+ return &d->a;
+}
+
+/*
+ * Allocator with constructur and destructor in fixed sized length
+ * in the page.
+ */
+struct deconstructor {
+ struct page_allocator a;
+ const struct page_allocator *base;
+ unsigned int size;
+ void *private;
+ void (*ctor)(void *, void *, unsigned long);
+ void (*dtor)(void *, void *, unsigned long);
+};
+
+static struct page *ctor_alloc(const struct page_allocator *a,
+ int order, gfp_t flags, int node)
+{
+ struct deconstructor *d = (void *)a;
+ struct page * page = d->base->allocate(d->base, order, flags, node);
+
+ if (d->ctor) {
+ void *start = page_address(page);
+ void *end = start + (PAGE_SIZE << order);
+ void *p;
+
+ for (p = start; p <= end - d->size; p += d->size)
+ /* Make the flags of the constructor compatible with SLAB use */
+ d->ctor(p, d->private, 1 + !(flags & __GFP_WAIT));
+ }
+ return page;
+}
+
+static void dtor_free(const struct page_allocator *a,
+ struct page *page, int order)
+{
+ struct deconstructor *d = (void *)a;
+
+ if (d->dtor) {
+ void *start = page_address(page);
+ void *end = start + (PAGE_SIZE << order);
+ void *p;
+
+ for (p = start; p <= end - d->size; p += d->size)
+ d->dtor(p, d->private, 0);
+ }
+ d->base->free(d->base, page, order);
+}
+
+struct page_allocator *ctor_and_dtor_for_page_allocator
+ (const struct page_allocator *base,
+ size_t size, void *private,
+ void (*ctor)(void *, void *, unsigned long),
+ void (*dtor)(void *, void *, unsigned long))
+{
+ struct deconstructor *d =
+ kmalloc(sizeof(struct deconstructor), GFP_KERNEL);
+
+ d->a.allocate = ctor ? ctor_alloc : base->allocate;
+ d->a.free = dtor ? dtor_free : base->free;
+ d->a.destructor = derived_destructor;
+ d->a.name = "ctor_dtor";
+ d->base = base;
+ d->ctor = ctor;
+ d->dtor = dtor;
+ d->size = size;
+ d->private = private;
+ return &d->a;
+}
+
+
+/*
+ * Slab allocators
+ */
+
+/* Tools to make your own */
+void null_slab_allocator_destructor(struct slab_allocator *a)
+{
+}
+
+void derived_slab_destructor(struct slab_allocator *a) {
+ struct derived_slab_allocator *d = (void *)a;
+
+ d->base->destructor((struct slab_allocator *)d->base);
+ kfree(d);
+}
+
+struct derived_slab_allocator *derive_slab_allocator(
+ const struct slab_allocator *base) {
+ struct derived_slab_allocator *d =
+ kmalloc(sizeof(struct derived_slab_allocator), GFP_KERNEL);
+
+ memcpy(&d->a, base, sizeof(struct slab_allocator));
+ d->base = base;
+ d->a.name = "derived";
+ d->a.destructor = derived_slab_destructor;
+ return d;
+}
+
+/* Generate a new slab allocators based on old ones */
+
+/*
+ * First a generic method to rcuify any slab. We add the rcuhead
+ * to the end of the object and use that on free.
+ */
+
+struct rcuified_slab {
+ struct slab_allocator *a;
+ const struct slab_allocator *base;
+ unsigned int rcu_offset;
+};
+
+/*
+ * Information that is added to the end of the slab
+ */
+struct slabr {
+ struct rcu_head r;
+ struct slab_cache *s;
+};
+
+struct slab_cache *rcuify_slab_create(const struct slab_allocator *sa,
+ const struct page_allocator *pa, int node,
+ const char *name, int size, int align, int order,
+ int objsize, int inuse, int offset)
+{
+ struct rcuified_slab *d = (void *)sa;
+
+ inuse = d->rcu_offset = ALIGN(inuse, sizeof(void *));
+ inuse += sizeof(struct slabr) + sizeof(void *);
+ while (inuse > size)
+ size += align;
+
+ return d->base->create(d->base, pa, node, name, size, align,
+ order, objsize, inuse, offset);
+}
+
+void rcu_slab_free(struct rcu_head *rcu)
+{
+ struct slabr *r = (void *) rcu;
+ struct slab_cache *s = r->s;
+ struct rcuified_slab *d = (void *)s->slab_alloc;
+ void *object = (void *)object - d->rcu_offset;
+
+ d->base->free(s, object);
+}
+
+void rcuify_slab_free(struct slab_cache *s, const void *object)
+{
+ struct rcuified_slab *r = (struct rcuified_slab *)(s->slab_alloc);
+
+ call_rcu((struct rcu_head *)(object + r->rcu_offset), rcu_slab_free);
+}
+
+struct slab_allocator *rcuify_slab_allocator
+ (const struct slab_allocator *base)
+{
+ struct derived_slab_allocator *d = derive_slab_allocator(base);
+
+ d->a.create = rcuify_slab_create;
+ d->a.free = rcuify_slab_free;
+ d->a.name = "rcu";
+ return &d->a;
+}
+
+
+/*
+ * dmaification of slab allocation. This is done by dmaifying the
+ * underlying page allocator.
+ */
+
+struct slab_cache *dmaify_slab_create(const struct slab_allocator *s,
+ const struct page_allocator *a, int node, const char *name,
+ int size, int align, int order, int objsize, int inuse,
+ int offset)
+{
+ struct derived_slab_allocator *d = (void *)s;
+
+ return d->base->create(s, dmaify_page_allocator(a), node, name,
+ size, align, order, objsize, inuse, offset);
+}
+
+struct slab_allocator *dmaify_slab_allocator
+ (const struct slab_allocator *base)
+{
+ struct derived_slab_allocator *d = derive_slab_allocator(base);
+
+ d->a.create = dmaify_slab_create;
+ d->a.name = "dma";
+ return &d->a;
+}
+
+#if 0
+/*
+ * Generic Redzoning support works by adding a word at the end
+ * of an object and filling up to the beginning of the next
+ * object (depending on the gap between objects).
+ *
+ * The fillings are different on free and allocate so that
+ * we can check on each if the object or the guard area was
+ * corrupted.
+ */
+struct slab_allocator *redzone_slab_allocator(struct slab_allocator *base)
+{
+ struct derived_slab_allocat *d = derive_slab_allcator(base);
+
+ d->create = redzone_create;
+ d->alloc = redzone_alloc;
+ d->free = redzone_free;
+ return &d->aL;
+}
+
+/*
+ * Object tracking adds two words to the end of the slab that track
+ * the latest allocator and the last freeer of the object.
+ */
+struct slab_allocator *track_slab_allocator(struct slab_allocator *base)
+{
+ return NULL;
+}
+
+/*
+ * Write to the syslog whenever something happens on a slab.
+ *
+ * CAUTION: Do not try this on busy system slabs.
+ */
+struct slab_allocator *trace_slab_allocator(struct slab_allocator *base)
+{
+ return NULL;
+}
+#endif
+
+
Separate the allocpercpu functions from the slab allocator
The allocpercpu functions __alloc_percpu and __free_percpu() are
heavily using the slab allocator. However, they are conceptually
different allocators that can be used independently from the
slab. This also simplifies SLOB.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.18-rc4/mm/allocpercpu.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/mm/allocpercpu.c 2006-08-13 23:26:17.588539549 -0700
@@ -0,0 +1,76 @@
+/*
+ * linux/mm/allocpercpu.c
+ *
+ * Separated out from slab.c August 11, 2006 Christoph Lameter <[email protected]>
+ */
+#include <linux/mm.h>
+#include <linux/module.h>
+
+/**
+ * __alloc_percpu - allocate one copy of the object for every present
+ * cpu in the system, zeroing them.
+ * Objects should be dereferenced using the per_cpu_ptr macro only.
+ *
+ * @size: how many bytes of memory are required.
+ */
+void *__alloc_percpu(size_t size)
+{
+ int i;
+ struct percpu_data *pdata = kmalloc(sizeof(*pdata), GFP_KERNEL);
+
+ if (!pdata)
+ return NULL;
+
+ /*
+ * Cannot use for_each_online_cpu since a cpu may come online
+ * and we have no way of figuring out how to fix the array
+ * that we have allocated then....
+ */
+ for_each_possible_cpu(i) {
+ int node = cpu_to_node(i);
+
+ if (node_online(node))
+ pdata->ptrs[i] = kmalloc_node(size, GFP_KERNEL, node);
+ else
+ pdata->ptrs[i] = kmalloc(size, GFP_KERNEL);
+
+ if (!pdata->ptrs[i])
+ goto unwind_oom;
+ memset(pdata->ptrs[i], 0, size);
+ }
+
+ /* Catch derefs w/o wrappers */
+ return (void *)(~(unsigned long)pdata);
+
+unwind_oom:
+ while (--i >= 0) {
+ if (!cpu_possible(i))
+ continue;
+ kfree(pdata->ptrs[i]);
+ }
+ kfree(pdata);
+ return NULL;
+}
+EXPORT_SYMBOL(__alloc_percpu);
+
+/**
+ * free_percpu - free previously allocated percpu memory
+ * @objp: pointer returned by alloc_percpu.
+ *
+ * Don't free memory not originally allocated by alloc_percpu()
+ * The complemented objp is to check for that.
+ */
+void free_percpu(const void *objp)
+{
+ int i;
+ struct percpu_data *p = (struct percpu_data *)(~(unsigned long)objp);
+
+ /*
+ * We allocate for all cpus so we cannot use for online cpu here.
+ */
+ for_each_possible_cpu(i)
+ kfree(p->ptrs[i]);
+ kfree(p);
+}
+EXPORT_SYMBOL(free_percpu);
+
Index: linux-2.6.18-rc4/mm/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/mm/Makefile 2006-08-06 11:20:11.000000000 -0700
+++ linux-2.6.18-rc4/mm/Makefile 2006-08-13 23:26:17.597328068 -0700
@@ -13,6 +13,7 @@ obj-y := bootmem.o filemap.o mempool.o
prio_tree.o util.o mmzone.o vmstat.o $(mmu-y)
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
Index: linux-2.6.18-rc4/mm/slab.c
===================================================================
--- linux-2.6.18-rc4.orig/mm/slab.c 2006-08-13 23:26:17.521160905 -0700
+++ linux-2.6.18-rc4/mm/slab.c 2006-08-13 23:26:17.600257574 -0700
@@ -3390,55 +3390,6 @@ void *__kmalloc_track_caller(size_t size
EXPORT_SYMBOL(__kmalloc_track_caller);
#endif
-#ifdef CONFIG_SMP
-/**
- * __alloc_percpu - allocate one copy of the object for every present
- * cpu in the system, zeroing them.
- * Objects should be dereferenced using the per_cpu_ptr macro only.
- *
- * @size: how many bytes of memory are required.
- */
-void *__alloc_percpu(size_t size)
-{
- int i;
- struct percpu_data *pdata = kmalloc(sizeof(*pdata), GFP_KERNEL);
-
- if (!pdata)
- return NULL;
-
- /*
- * Cannot use for_each_online_cpu since a cpu may come online
- * and we have no way of figuring out how to fix the array
- * that we have allocated then....
- */
- for_each_possible_cpu(i) {
- int node = cpu_to_node(i);
-
- if (node_online(node))
- pdata->ptrs[i] = kmalloc_node(size, GFP_KERNEL, node);
- else
- pdata->ptrs[i] = kmalloc(size, GFP_KERNEL);
-
- if (!pdata->ptrs[i])
- goto unwind_oom;
- memset(pdata->ptrs[i], 0, size);
- }
-
- /* Catch derefs w/o wrappers */
- return (void *)(~(unsigned long)pdata);
-
-unwind_oom:
- while (--i >= 0) {
- if (!cpu_possible(i))
- continue;
- kfree(pdata->ptrs[i]);
- }
- kfree(pdata);
- return NULL;
-}
-EXPORT_SYMBOL(__alloc_percpu);
-#endif
-
/**
* kmem_cache_free - Deallocate an object
* @cachep: The cache the allocation was from.
@@ -3484,29 +3435,6 @@ void kfree(const void *objp)
}
EXPORT_SYMBOL(kfree);
-#ifdef CONFIG_SMP
-/**
- * free_percpu - free previously allocated percpu memory
- * @objp: pointer returned by alloc_percpu.
- *
- * Don't free memory not originally allocated by alloc_percpu()
- * The complemented objp is to check for that.
- */
-void free_percpu(const void *objp)
-{
- int i;
- struct percpu_data *p = (struct percpu_data *)(~(unsigned long)objp);
-
- /*
- * We allocate for all cpus so we cannot use for online cpu here.
- */
- for_each_possible_cpu(i)
- kfree(p->ptrs[i]);
- kfree(p);
-}
-EXPORT_SYMBOL(free_percpu);
-#endif
-
unsigned int kmem_cache_size(struct kmem_cache *cachep)
{
return obj_size(cachep);
Index: linux-2.6.18-rc4/mm/slob.c
===================================================================
--- linux-2.6.18-rc4.orig/mm/slob.c 2006-08-06 11:20:11.000000000 -0700
+++ linux-2.6.18-rc4/mm/slob.c 2006-08-13 23:26:17.611975599 -0700
@@ -343,48 +343,3 @@ void kmem_cache_init(void)
atomic_t slab_reclaim_pages = ATOMIC_INIT(0);
EXPORT_SYMBOL(slab_reclaim_pages);
-#ifdef CONFIG_SMP
-
-void *__alloc_percpu(size_t size)
-{
- int i;
- struct percpu_data *pdata = kmalloc(sizeof (*pdata), GFP_KERNEL);
-
- if (!pdata)
- return NULL;
-
- for_each_possible_cpu(i) {
- pdata->ptrs[i] = kmalloc(size, GFP_KERNEL);
- if (!pdata->ptrs[i])
- goto unwind_oom;
- memset(pdata->ptrs[i], 0, size);
- }
-
- /* Catch derefs w/o wrappers */
- return (void *) (~(unsigned long) pdata);
-
-unwind_oom:
- while (--i >= 0) {
- if (!cpu_possible(i))
- continue;
- kfree(pdata->ptrs[i]);
- }
- kfree(pdata);
- return NULL;
-}
-EXPORT_SYMBOL(__alloc_percpu);
-
-void
-free_percpu(const void *objp)
-{
- int i;
- struct percpu_data *p = (struct percpu_data *) (~(unsigned long) objp);
-
- for_each_possible_cpu(i)
- kfree(p->ptrs[i]);
-
- kfree(p);
-}
-EXPORT_SYMBOL(free_percpu);
-
-#endif
Kmalloc layer for the modular slab
New ideas:
1. Generate slabs on the fly (May have issues with GFP_KERNEL).
2. use fls to calculate array position.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.18-rc4/include/linux/kmalloc.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/include/linux/kmalloc.h 2006-08-15 18:37:47.757837824 -0700
@@ -0,0 +1,72 @@
+#ifndef _LINUX_KMALLOC_H
+#define _LINUX_KMALLOC_H
+/*
+ * In kernel dynamic memory allocator.
+ *
+ * (C) 2006 Silicon Graphics, Inc,
+ * Christoph Lameter <[email protected]>
+ */
+
+#include <linux/allocator.h>
+#include <linux/config.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_NUMA_SLAB
+#define kmalloc_allocator numa_slab_allocator
+#else
+#define kmalloc_allocator slabifier_allocator
+#endif
+
+/*
+ * We keep the general caches in an array of slab caches that are used for
+ * 2^x bytes of allocations. For each size we generate a DMA and a
+ * non DMA cache (Do not get confused: DMA simply means memory for
+ * legacy I/O. The regular caches can be used for devices that can
+ * do DMA to all of memory).
+ */
+extern struct slab_cache *kmalloc_caches[2][BITS_PER_LONG];
+
+struct slab_cache *kmalloc_generate_slab(int index, gfp_t flags);
+
+/*
+ * Find (and possibly create) the slab cache for a given combination of
+ * allocation flags and size.
+ *
+ * Believe it or not but this should result in a simple load
+ * from a global memory address for a constant flag and size,
+ * a comparison and two function calls with the same parameters.
+ */
+static inline struct slab_cache *kmalloc_slab(size_t size, gfp_t flags)
+{
+ struct slab_cache *s;
+ int index = fls(size - 1);
+
+ s = kmalloc_caches[!!(flags & __GFP_DMA)][index];
+ if (!s)
+ s = kmalloc_generate_slab(index, flags);
+ return s;
+}
+
+static inline void *kmalloc(size_t size, gfp_t flags)
+{
+ return kmalloc_allocator.alloc(kmalloc_slab(size, flags), flags);
+}
+
+static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+{
+ return kmalloc_allocator.alloc_node(kmalloc_slab(size, flags),
+ flags, node);
+}
+
+static inline void kfree(const void *x)
+{
+ kmalloc_allocator.free(NULL, x);
+}
+
+/* Allocate and zero the specified number of bytes */
+extern void *kzalloc(size_t, gfp_t);
+
+/* Figure out what size the chunk is */
+extern size_t ksize(const void *);
+
+#endif /* _LINUX_KMALLOC_H */
Index: linux-2.6.18-rc4/mm/kmalloc.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/mm/kmalloc.c 2006-08-15 18:35:04.744461494 -0700
@@ -0,0 +1,114 @@
+/*
+ * Implement basic generic caches for memory allocation.
+ *
+ * This is an on demand implementation. General slabs
+ * are only created if there is an actual need for allocations
+ * of that size.
+ *
+ * (C) 2006 Silicon Graphics. Inc. Christoph Lameter <[email protected]>
+ */
+
+#include <linux/mm.h>
+#include <linux/allocator.h>
+#include <linux/module.h>
+
+// #define KMALLOC_DEBUG
+
+struct slab_cache *kmalloc_caches[2][BITS_PER_LONG] __cacheline_aligned;
+EXPORT_SYMBOL(kmalloc_caches);
+
+/* Smallest size of a cache in 2^x */
+#define LOW 5 /* 32 byte */
+
+#if BITS_PER_LONG == 64
+/* 64 bit platform. Maximum size is 2 Gigabyte */
+#define HIGH 31
+#else
+/* 32 bit. Take the customary max of 128k */
+#define HIGH 17
+#endif
+
+/*
+ * Given a slab size find the correct order to use.
+ * We only support powers of two so there is really
+ * no need for anything special. Objects will always
+ * fit exactly into the slabs.
+ */
+static int order(size_t size)
+{
+ if (size >= PAGE_SIZE)
+ /* One object per slab */
+ return fls(size -1) - PAGE_SHIFT + 1;
+
+ /* Multiple objects but let slab_create() figure out how much */
+ return 0;
+}
+
+static DEFINE_SPINLOCK(kmalloc_lock);
+
+/*
+ * Create a general slab
+ */
+struct slab_cache *kmalloc_generate_slab(int index, gfp_t flags)
+{
+ int dma = !!(flags & __GFP_DMA);
+ struct slab_cache *s;
+ long realsize;
+ const struct page_allocator *a = &page_allocator;
+
+#ifdef KMALLOC_DEBUG
+ printk(KERN_CRIT "kmalloc_generate_slab(%d, %x)\n", index, flags);
+#endif
+ if (index < LOW) {
+ s = kmalloc_caches[dma][LOW];
+ if (s) {
+ kmalloc_caches[dma][index] = s;
+ return s;
+ }
+ index = LOW;
+ }
+
+ if (index > HIGH) {
+ printk(KERN_CRIT "Kmalloc size(%ld) too large.\n",
+ 1L << index);
+ BUG();
+ }
+
+ realsize = 1 << index;
+
+ if (dma)
+ a = dmaify_page_allocator(a);
+
+ s = kmalloc_allocator.create(&kmalloc_allocator, a, -1,
+ dma ? "kmalloc-DMA" : "kmalloc", realsize,
+ L1_CACHE_BYTES, order(realsize), realsize, realsize, 0);
+
+ if (!s)
+ panic("Creation of kmalloc slab dma=%d index=%d failed.\n",
+ dma, index);
+ spin_lock_irq(&kmalloc_lock);
+ if (kmalloc_caches[dma][index])
+ kmalloc_allocator.destroy(s);
+ else
+ kmalloc_caches[dma][index] = s;
+ spin_unlock_irq(&kmalloc_lock);
+ return s;
+}
+EXPORT_SYMBOL(kmalloc_generate_slab);
+
+void *kzalloc(size_t size, gfp_t flags)
+{
+ void *x = kmalloc(size, flags);
+
+ if (x)
+ memset(x, 0, size);
+ return x;
+}
+EXPORT_SYMBOL(kzalloc);
+
+size_t ksize(const void *object)
+{
+ return kmalloc_allocator.object_size(NULL, object);
+};
+EXPORT_SYMBOL(ksize);
+
Index: linux-2.6.18-rc4/mm/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/mm/Makefile 2006-08-15 18:35:03.591212358 -0700
+++ linux-2.6.18-rc4/mm/Makefile 2006-08-15 18:35:04.744461494 -0700
@@ -24,5 +24,6 @@ obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
-obj-$(CONFIG_MODULAR_SLAB) += allocator.o
+obj-$(CONFIG_MODULAR_SLAB) += allocator.o kmalloc.o
+
The slab emulation layer.
This provides a layer that implements the existing slab API. Put
a hook into slab.h to redirect include for slab.h.
kmem_cache_create dynamically derives page allocators with the
proper features requested.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.18-rc4/mm/slabulator.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/mm/slabulator.c 2006-08-15 18:36:39.655595826 -0700
@@ -0,0 +1,208 @@
+/*
+ * Slabulator = Emulate the old Slab API and provide various debugging helps.
+ *
+ * (C) 2006 Silicon Graphics, Inc. Christoph Lameter <[email protected]>
+ *
+ * Important missing things: (breaking stuff!)
+ * 1. Support 2 second cache reaper calls (no pcp draining and VM stat
+ * update currently!).
+ */
+#include <linux/kmalloc.h>
+#include <linux/module.h>
+#include <linux/allocator.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+
+#define SLAB_MAX_ORDER 4
+
+// #define SLABMULATOR_DEBUG
+// #define SLABMULATOR_MERGE
+
+#ifndef ARCH_SLAB_MINALIGN
+#define ARCH_SLAB_MINALIGN sizeof(void *)
+#endif
+
+static int calculate_order(int size)
+{
+ int order;
+ int rem;
+
+ for(order = max(0, fls(size - 1) - PAGE_SHIFT);
+ order < MAX_ORDER; order++) {
+ unsigned long slab_size = PAGE_SIZE << order;
+
+ if (slab_size < size)
+ continue;
+
+ rem = slab_size % size;
+
+ if (rem * 8 <= PAGE_SIZE << order)
+ break;
+
+ }
+ if (order >= MAX_ORDER)
+ return -E2BIG;
+ return order;
+}
+
+/*
+ * Track reclaimable pages using a special allocator
+ */
+static struct derived_page_allocator *reclaim_accounting_allocator;
+
+atomic_t slab_reclaim_pages = ATOMIC_INIT(0);
+EXPORT_SYMBOL(slab_reclaim_pages);
+
+static struct page *rac_alloc(const struct page_allocator *a, int order,
+ gfp_t flags, int node)
+{
+ struct derived_page_allocator *d = (void *)a;
+
+ atomic_add(1 << order, &slab_reclaim_pages);
+ return d->base->allocate(d->base, order, flags, node);
+}
+
+static void rac_free(const struct page_allocator *a, struct page *page,
+ int order)
+{
+ struct derived_page_allocator *d = (void *)a;
+
+ atomic_sub(1 << order, &slab_reclaim_pages);
+ d->base->free(d->base, page, order);
+}
+
+/*
+ * kmem_cache_init() is not particularly useful here. We can actually operate
+ * slabs any time after the page allocator is up (on demand creation of
+ * all necessary structure) and maybe shift the
+ * initialization of the reclaim accounting allocator somewhere else..
+ * But for traditions sake we provide the same mechanism as the slab here.
+ */
+int slabulator_up = 0;
+
+int slab_is_available(void) {
+ return slabulator_up;
+}
+
+void kmem_cache_init(void)
+{
+ reclaim_accounting_allocator = derive_page_allocator(&page_allocator);
+ reclaim_accounting_allocator->a.allocate = rac_alloc;
+ reclaim_accounting_allocator->a.free = rac_free;
+ slabulator_up = 1;
+}
+
+struct slab_cache *kmem_cache_create(const char *name, size_t size,
+ size_t align, unsigned long flags,
+ void (*ctor)(void *, struct slab_cache *, unsigned long),
+ void (*dtor)(void *, struct slab_cache *, unsigned long))
+{
+ const struct page_allocator *a = &page_allocator;
+ unsigned int offset = 0;
+ unsigned int realsize;
+ int order;
+ int inuse;
+ struct slab_cache *s;
+
+ align = max(ARCH_SLAB_MINALIGN, ALIGN(align, sizeof(void *)));
+
+ if (flags & (SLAB_MUST_HWCACHE_ALIGN|SLAB_HWCACHE_ALIGN))
+ align = L1_CACHE_BYTES;
+
+ inuse = size;
+ realsize = ALIGN(size, align);
+
+ /* Pick the right allocator for our purposes */
+ if (flags & SLAB_RECLAIM_ACCOUNT)
+ a = &reclaim_accounting_allocator->a;
+
+ if (flags & SLAB_CACHE_DMA)
+ a = dmaify_page_allocator(a);
+
+ if (flags & SLAB_DESTROY_BY_RCU)
+ a = rcuify_page_allocator(a);
+
+ if ((flags & SLAB_DESTROY_BY_RCU) || ctor || dtor) {
+ /*
+ * For RCU processing and constructors / destructors:
+ * The object must remain intact even if it is free.
+ * The free pointer would hurt us there.
+ * Relocate the free object pointer out of
+ * the space used by the object.
+ */
+ offset = realsize - sizeof(void *);
+ if (offset < size) {
+ /*
+ * Would overlap the object. We need to waste some
+ * more space to make the object RCU safe
+ */
+ offset = realsize;
+ realsize += align;
+ }
+ inuse = realsize;
+ }
+
+ order = calculate_order(realsize);
+
+ if (order < 0)
+ goto error;
+
+#ifdef SLABMULATOR_MERGE
+ /*
+ * This works but is this really something we want?
+ */
+ if (((realsize & (realsize - 1))==0) && !ctor && !dtor &&
+ !(flags & (SLAB_DESTROY_BY_RCU|SLAB_RECLAIM_ACCOUNT))) {
+ a->destructor((struct page_allocator *)a);
+
+ printk(KERN_CRIT "Merging slab %s size %d to kmalloc\n",
+ name, realsize);
+ return slabulator_allocator.dup(kmalloc_slab(realsize, flags));
+ }
+#endif
+
+ s = slabulator_allocator.create(&slabulator_allocator, a, -1, name,
+ realsize, align, order, size, inuse, offset);
+#ifdef SLABMULATOR_DEBUG
+ printk(KERN_CRIT "Creating slab %s size=%ld realsize=%d "
+ "order=%d offset=%d flags=%lx\n",
+ name, size, realsize, order, offset, flags);
+#endif
+ if (!s)
+ goto error;
+
+ /*
+ * Now deal with constuctors and destructors. We need to know the
+ * slab_cache address in order to be able to pass the slab_cache
+ * address down the chain.
+ */
+ if (ctor || dtor)
+ s->page_alloc =
+ ctor_and_dtor_for_page_allocator(s->page_alloc,
+ realsize, s,
+ (void *)ctor, (void *)dtor);
+
+ return s;
+
+error:
+ a->destructor((struct page_allocator *)a);
+ if (flags & SLAB_PANIC)
+ panic("Cannot create slab %s size=%ld realsize=%d "
+ "order=%d offset=%d flags=%lx\n",
+ name, size, realsize, order, offset, flags);
+
+
+ return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+void *kmem_cache_zalloc(struct slab_cache *s, gfp_t flags)
+{
+ void *x;
+
+ x = kmem_cache_alloc(s, flags);
+ if (x)
+ memset(x, 0, s->objsize);
+ return x;
+}
+
Index: linux-2.6.18-rc4/include/linux/slabulator.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/include/linux/slabulator.h 2006-08-15 18:37:11.758106635 -0700
@@ -0,0 +1,128 @@
+/*
+ * Slabulator: Emulate the existing Slab API.
+ *
+ * (C) 2006 Silicon Graphics, Inc.
+ * Christoph Lameter <[email protected]>
+ *
+ * The use of this API necessarily restrict the user to a single type
+ * of Slab allocator and the functions available.
+ */
+
+#include <linux/allocator.h>
+#include <linux/kmalloc.h>
+
+#define kmem_cache_t struct slab_cache
+#define kmem_cache slab_cache
+
+#define ____kmalloc kmalloc
+#define __kmalloc kmalloc
+
+#ifdef CONFIG_NUMA_SLAB
+#define slabulator_allocator numa_slab_allocator
+#else
+#define slabulator_allocator slabifier_allocator
+#endif
+
+/* We really should be getting rid of these */
+#define SLAB_KERNEL GFP_KERNEL
+#define SLAB_ATOMIC GFP_ATOMIC
+#define SLAB_NOFS GFP_NOFS
+
+/* No debug features for now */
+#define SLAB_HWCACHE_ALIGN 0x00002000UL /* align objs on a h/w cache lines */
+#define SLAB_CACHE_DMA 0x00004000UL /* use GFP_DMA memory */
+#define SLAB_MUST_HWCACHE_ALIGN 0x00008000UL /* force alignment */
+#define SLAB_RECLAIM_ACCOUNT 0x00020000UL /* track pages allocated to indicate
+ what is reclaimable later*/
+#define SLAB_PANIC 0x00040000UL /* panic if kmem_cache_create() fails */
+#define SLAB_DESTROY_BY_RCU 0x00080000UL /* defer freeing pages to RCU */
+#define SLAB_MEM_SPREAD 0x00100000UL /* Spread some memory over cpuset */
+
+/* flags passed to a constructor func */
+#define SLAB_CTOR_CONSTRUCTOR 0x001UL /* if not set, then deconstructor */
+#define SLAB_CTOR_ATOMIC 0x002UL /* tell constructor it can't sleep */
+#define SLAB_CTOR_VERIFY 0x004UL /* tell constructor it's a verify call */
+
+
+/*
+ * slab_allocators are always available on demand after the page allocator
+ * has been brought up but for completenesses sake:
+ */
+extern int slab_is_available(void);
+extern void kmem_cache_init(void);
+
+/* System wide caches (Should these be really here?) */
+extern struct slab_cache *vm_area_cachep;
+extern struct slab_cache *names_cachep;
+extern struct slab_cache *files_cachep;
+extern struct slab_cache *filp_cachep;
+extern struct slab_cache *fs_cachep;
+extern struct slab_cache *sighand_cachep;
+extern struct slab_cache *bio_cachep;
+
+/* And this really would be be placed with the reclaim code ? */
+extern atomic_t slab_reclaim_pages;
+
+
+extern struct slab_cache *kmem_cache_create(const char *name, size_t size, size_t align,
+ unsigned long flags,
+ void (*ctor)(void *, struct slab_cache *, unsigned long),
+ void (*dtor)(void *, struct slab_cache *, unsigned long));
+
+static inline unsigned int kmem_cache_size(struct slab_cache *s)
+{
+ return s->objsize;
+}
+
+static inline const char *kmem_cache_name(struct slab_cache *s)
+{
+ return s->name;
+}
+
+static inline void *kmem_cache_alloc(struct slab_cache *s, gfp_t flags)
+{
+ return slabulator_allocator.alloc(s, flags);
+}
+
+static inline void *kmem_cache_alloc_node(struct slab_cache *s,
+ gfp_t flags, int node)
+{
+ return slabulator_allocator.alloc_node(s, flags, node);
+}
+
+extern void *kmem_cache_zalloc(struct slab_cache *s, gfp_t flags);
+
+static inline void kmem_cache_free(struct slab_cache *s, const void *x)
+{
+ slabulator_allocator.free(s, x);
+}
+
+static inline int kmem_ptr_validate(struct slab_cache *s, void *x)
+{
+ return slabulator_allocator.valid_pointer(s, x);
+}
+
+static inline int kmem_cache_destroy(struct slab_cache *s)
+{
+ return slabulator_allocator.destroy(s);
+}
+
+static inline int kmem_cache_shrink(struct slab_cache *s)
+{
+ return slabulator_allocator.shrink(s, NULL);
+}
+
+/**
+ * kcalloc - allocate memory for an array. The memory is set to zero.
+ * @n: number of elements.
+ * @size: element size.
+ * @flags: the type of memory to allocate.
+ */
+static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
+{
+ if (n != 0 && size > ULONG_MAX / n)
+ return NULL;
+ return kzalloc(n * size, flags);
+}
+
+
Index: linux-2.6.18-rc4/mm/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/mm/Makefile 2006-08-15 18:35:04.744461494 -0700
+++ linux-2.6.18-rc4/mm/Makefile 2006-08-15 18:36:39.656572329 -0700
@@ -24,6 +24,5 @@ obj-$(CONFIG_SLAB) += slab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
-obj-$(CONFIG_MODULAR_SLAB) += allocator.o kmalloc.o
-
+obj-$(CONFIG_MODULAR_SLAB) += allocator.o kmalloc.o slabulator.o
Index: linux-2.6.18-rc4/init/Kconfig
===================================================================
--- linux-2.6.18-rc4.orig/init/Kconfig 2006-08-06 11:20:11.000000000 -0700
+++ linux-2.6.18-rc4/init/Kconfig 2006-08-15 18:36:39.657548831 -0700
@@ -403,6 +403,16 @@ config SLAB
SLOB is more space efficient but does not scale well and is
more susceptible to fragmentation.
+config MODULAR_SLAB
+ default n
+ bool "Use the modular slab allocator frameworkr"
+ depends on EXPERIMENTAL
+ help
+ The modular slab allocator framework allows the flexible use
+ of different slab allocators and page allocators for memory
+ allocation. This will completely replace the existing
+ slab allocator. Beware this is experimental code.
+
config VM_EVENT_COUNTERS
default y
bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
@@ -424,7 +434,7 @@ config BASE_SMALL
default 1 if !BASE_FULL
config SLOB
- default !SLAB
+ default !SLAB && !MODULAR_SLAB
bool
menu "Loadable module support"
Index: linux-2.6.18-rc4/include/linux/slab.h
===================================================================
--- linux-2.6.18-rc4.orig/include/linux/slab.h 2006-08-06 11:20:11.000000000 -0700
+++ linux-2.6.18-rc4/include/linux/slab.h 2006-08-15 18:36:39.657548831 -0700
@@ -9,6 +9,10 @@
#if defined(__KERNEL__)
+#ifdef CONFIG_MODULAR_SLAB
+#include <linux/slabulator.h>
+#else
+
typedef struct kmem_cache kmem_cache_t;
#include <linux/gfp.h>
@@ -265,6 +269,8 @@ extern kmem_cache_t *bio_cachep;
extern atomic_t slab_reclaim_pages;
+#endif /* CONFIG_SLABULATOR */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_SLAB_H */
The page slab is a specialized slab allocator that can only handle
page order size object. It directly uses the page allocator to
track the objects and can therefore avoid the overhead of the
slabifier.
[Have not had time to test this one yet. Compiles... and
its the simplest slab allocator of them all.]
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.18-rc4/mm/page_slab.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/mm/page_slab.c 2006-08-15 18:57:03.553625883 -0700
@@ -0,0 +1,145 @@
+/*
+ * Page Slab implementation
+ *
+ * This is a special slab allocator that uses the page cache to manage
+ * the objects. It is restricted to page size order allocations.
+ *
+ * (C) 2006 Silicon Graphics Inc.
+ * Christoph Lameter <[email protected]
+ */
+
+#include <linux/mm.h>
+#include <linux/allocator.h>
+#include <linux/module.h>
+
+/*
+ * Allocator to use for the metadata
+ */
+#define baseslab slabifier_allocator
+
+static struct slab_cache *page_slab_cache;
+
+struct page_slab {
+ struct slab_cache sc;
+ spinlock_t lock;
+ atomic_t refcount;
+ atomic_t pages;
+};
+
+static struct slab_cache *page_slab_create
+ (const struct slab_allocator *slab_alloc,
+ const struct page_allocator *page_alloc, int node,
+ const char *name, int size, int align, int order,
+ int objsize, int inuse, int offset)
+{
+ struct page_slab *n;
+
+ if (!page_slab_cache) {
+ page_slab_cache = baseslab.create(&baseslab, page_alloc, node,
+ "page_slab_cache",
+ ALIGN(sizeof(struct page_slab), L1_CACHE_BYTES),
+ L1_CACHE_BYTES, 0, sizeof(struct page_slab),
+ sizeof(struct page_slab), 0);
+ if (!page_slab_cache)
+ panic("Cannot create page_slab cache\n");
+ }
+
+ n = baseslab.alloc(page_slab_cache, GFP_KERNEL);
+ if (!n)
+ return NULL;
+
+ memset(n, 0, sizeof(struct page_slab));
+ slab_allocator_fill(&n->sc, slab_alloc, page_alloc, node, name, size,
+ align, order, objsize, inuse, offset);
+ spin_lock_init(&n->lock);
+ atomic_set(&n->refcount, 1);
+ return &n->sc;
+}
+
+static void *page_slab_alloc_node(struct slab_cache *sc, gfp_t flags, int node)
+{
+ struct page_slab *n = (void *)sc;
+ struct page * page;
+
+ page = sc->page_alloc->allocate(sc->page_alloc, sc->order,
+ flags, node);
+ atomic_inc(&n->pages);
+ if (!page)
+ return NULL;
+ return page_address(page);
+}
+
+static void *page_slab_alloc(struct slab_cache *sc, gfp_t flags)
+{
+ return page_slab_alloc_node(sc, flags, -1);
+}
+
+static void page_slab_free(struct slab_cache *sc, const void *object)
+{
+ struct page_slab *p = (void *)sc;
+
+ atomic_dec(&p->pages);
+ return __free_pages(virt_to_page(object), sc ? sc->order : 0);
+}
+
+static int page_slab_destroy(struct slab_cache *sc)
+{
+ struct page_slab *p = (void *)sc;
+
+ if (!atomic_dec_and_test(&p->refcount))
+ return 0;
+
+ baseslab.free(page_slab_cache, sc);
+ return 0;
+}
+
+static int page_slab_pointer_valid(struct slab_cache *sc, const void *object)
+{
+ struct page *page = virt_to_page(object);
+
+ return page_address(page) == object;
+}
+
+static unsigned long page_slab_object_size(struct slab_cache *sc,
+ const void *object)
+{
+ return PAGE_SIZE << sc->order;
+}
+
+static struct slab_cache *page_slab_dup(struct slab_cache *sc)
+{
+ struct page_slab *p = (void *)sc;
+
+ atomic_inc(&p->refcount);
+ return sc;
+}
+
+static int page_slab_shrink(struct slab_cache *sc,
+ int (*move_object)(struct slab_cache *, void *))
+{
+ return 0;
+}
+
+static unsigned long page_slab_objects(struct slab_cache *sc,
+ unsigned long *active, unsigned long *partial)
+{
+ struct page_slab *p = (void *)sc;
+ return atomic_read(&p->pages);
+}
+
+struct slab_allocator page_slab_allocator = {
+ .name = "PageSlab",
+ .create = page_slab_create,
+ .alloc = page_slab_alloc,
+ .alloc_node = page_slab_alloc_node,
+ .free = page_slab_free,
+ .valid_pointer = page_slab_pointer_valid,
+ .object_size = page_slab_object_size,
+ .objects = page_slab_objects,
+ .shrink = page_slab_shrink,
+ .dup = page_slab_dup,
+ .destroy = page_slab_destroy,
+ .destructor = null_slab_allocator_destructor
+};
+EXPORT_SYMBOL(page_slab_allocator);
+
Index: linux-2.6.18-rc4/mm/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/mm/Makefile 2006-08-15 18:57:03.516518810 -0700
+++ linux-2.6.18-rc4/mm/Makefile 2006-08-15 18:57:03.554602385 -0700
@@ -26,4 +26,5 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_MODULAR_SLAB) += allocator.o kmalloc.o slabulator.o slabifier.o
obj-$(CONFIG_NUMA_SLAB) += numa_slab.o
+obj-$(CONFIG_PAGE_SLAB) += page_slab.o
Index: linux-2.6.18-rc4/init/Kconfig
===================================================================
--- linux-2.6.18-rc4.orig/init/Kconfig 2006-08-15 18:57:03.517495312 -0700
+++ linux-2.6.18-rc4/init/Kconfig 2006-08-15 18:57:03.554602385 -0700
@@ -418,6 +418,15 @@ config NUMA_SLAB
bool "NUMA Slab allocator (for lots of memory)"
depends on MODULAR_SLAB && NUMA
+config PAGE_SLAB
+ default n
+ depends on MODULAR_SLAB
+ bool "Page slab allocator"
+ help
+ The page slab allocator is a form of slab allocator but it manges
+ page in order of pagesize through the page allocator. This allows
+ for efficient management of slab caches with huge objects.
+
config VM_EVENT_COUNTERS
default y
bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
Simple NUMA slab allocator
The NUMA slab allocator simply generates a slab per node
on demand and uses the slabifier to manage per node
slab caches.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux-2.6.18-rc4/mm/numa_slab.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc4/mm/numa_slab.c 2006-08-15 18:44:01.142994108 -0700
@@ -0,0 +1,311 @@
+/*
+ * NUMA slab implementation
+ *
+ * (C) 2006 Silicon Graphics Inc.
+ * Christoph Lameter <[email protected]
+ */
+
+#include <linux/mm.h>
+#include <linux/allocator.h>
+#include <linux/kmalloc.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+
+#define NUMA_SLAB_DEBUG
+
+//#define TPRINTK printk
+#define TPRINTK(x, ...)
+
+/*
+ * Prelude : Numacontrol for allocators
+ */
+struct numactl {
+ struct page_allocator a;
+ const struct page_allocator *base;
+ int node;
+ gfp_t flags;
+};
+
+static struct page *numactl_alloc(const struct page_allocator *a,
+ int order, gfp_t flags, int node)
+{
+ struct numactl *d = (void *)a;
+
+ if (d->node >= 0)
+ node = d->node;
+
+ return d->base->allocate(d->base, order, flags | d->flags, node);
+}
+
+
+struct page_allocator *numactl_allocator(const struct page_allocator *base,
+ int node, gfp_t flags)
+{
+ struct numactl *d =
+ kmalloc(sizeof(struct numactl), GFP_KERNEL);
+
+ d->a.allocate = numactl_alloc;
+ d->a.destructor = derived_destructor;
+ d->a.name = "numa";
+ d->base = base;
+ d->node = node;
+ d->flags = flags;
+ return &d->a;
+}
+
+/*
+ * Allocator on which we base this functionality.
+ */
+#define base slabifier_allocator
+
+static struct slab_cache *numa_cache;
+
+struct numa_slab {
+ struct slab_cache sc;
+ spinlock_t lock;
+ atomic_t refcount;
+ struct slab_cache *node[MAX_NUMNODES];
+};
+
+static int __numa_slab_destroy(struct numa_slab *n)
+{
+ int node;
+
+ TPRINTK(KERN_CRIT "__numa_slab_destroy(%s)\n", n->sc.name);
+
+ for_each_node(node) {
+ base.free(NULL,n->node[node]);
+ n->node[node] = NULL;
+ }
+ return 0;
+}
+
+static struct slab_cache *bring_up_node(struct numa_slab *n, int node)
+{
+ struct slab_cache *s = n->node[node];
+ struct slab_cache *sc = &n->sc;
+
+ TPRINTK(KERN_CRIT "bring_up_node(%s, %d)\n", n->sc.name, node);
+ if (s)
+ return s;
+
+ spin_lock(&n->lock);
+ s = n->node[node];
+ if (s) {
+ spin_unlock(&n->lock);
+ return s;
+ }
+ s = n->node[node] = base.create(&base, sc->page_alloc, node,
+ sc->name, sc->size, sc->align, sc->order, sc->objsize, sc->inuse,
+ sc->offset);
+
+ spin_unlock(&n->lock);
+ return s;
+}
+
+static struct slab_cache *numa_slab_create
+ (const struct slab_allocator *slab_alloc,
+ const struct page_allocator *page_alloc, int node,
+ const char *name, int size, int align, int order,
+ int objsize, int inuse, int offset)
+{
+ struct numa_slab *n;
+
+ TPRINTK(KERN_CRIT "numa_slab_create(%s, %s, %d, %s, %d, %d, %d ,%d ,%d ,%d)\n",
+ slab_alloc->name, page_alloc->name, node, name, size,
+ align, order, objsize, inuse, offset);
+
+ if (!numa_cache) {
+ numa_cache = base.create(&base, page_alloc, node, "numa_cache",
+ ALIGN(sizeof(struct numa_slab), L1_CACHE_BYTES),
+ L1_CACHE_BYTES, 0, sizeof(struct numa_slab),
+ sizeof(struct numa_slab), 0);
+ if (!numa_cache)
+ panic("Cannot create numa cache\n");
+ }
+
+ n = base.alloc(numa_cache, in_atomic() ? GFP_ATOMIC : GFP_KERNEL);
+ if (!n)
+ return NULL;
+
+ memset(n, 0, sizeof(struct numa_slab));
+ slab_allocator_fill(&n->sc, slab_alloc, page_alloc, node, name, size, align,
+ order, objsize, inuse, offset);
+ spin_lock_init(&n->lock);
+ atomic_set(&n->refcount, 1);
+
+ /* Do not bring up a node here. slabulator may set a constructor after the fact */
+ return &n->sc;
+}
+
+static void *numa_slab_alloc(struct slab_cache *sc, gfp_t flags);
+
+static void *numa_slab_alloc_node(struct slab_cache *sc, gfp_t flags, int node)
+{
+ struct numa_slab *n = (void *)sc;
+ struct slab_cache *s;
+
+ TPRINTK(KERN_CRIT "numa_slab_alloc_node(%s, %x, %d)\n", sc->name, flags, node);
+
+ if (node < 0)
+ node = numa_node_id();
+
+ s = n->node[node];
+
+ if (unlikely(!s)) {
+ s = bring_up_node(n, node);
+ if (!s)
+ return NULL;
+ }
+ return base.alloc(s, flags);
+}
+
+static void *numa_slab_alloc(struct slab_cache *sc, gfp_t flags)
+{
+ int node = numa_node_id();
+
+ TPRINTK(KERN_CRIT "numa_slab_alloc(%s, %x)\n", sc->name, flags);
+
+ if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))
+ && !in_interrupt()) {
+ if (cpuset_do_slab_mem_spread())
+ node = cpuset_mem_spread_node();
+ else if (current->mempolicy)
+ node = slab_node(current->mempolicy);
+ }
+ return numa_slab_alloc_node(sc, flags, node);
+}
+
+static int numa_slab_destroy(struct slab_cache *sc)
+{
+ struct numa_slab *n = (void *)sc;
+
+ TPRINTK(KERN_CRIT "numa_slab_destroy(%s)\n", sc->name);
+
+ if (!atomic_dec_and_test(&n->refcount))
+ return 0;
+
+ __numa_slab_destroy(n);
+ return 0;
+}
+
+static int numa_slab_pointer_valid(struct slab_cache *sc, const void *object)
+{
+ struct numa_slab *n = (void *)sc;
+ int node;
+
+ TPRINTK(KERN_CRIT "numa_slab_pointer_valid(%s, %p)\n", sc->name, object);
+
+ /* We can deduct from the allocator which node this is. */
+ node = ((struct numactl *)(sc->page_alloc))->node;
+ return base.valid_pointer(n->node[node], object);
+}
+
+static unsigned long numa_slab_object_size(struct slab_cache *sc,
+ const void *object)
+{
+ struct numa_slab *n = (void *)sc;
+ int node;
+
+ TPRINTK(KERN_CRIT "numa_slab_object_size(%s, %p)\n", sc->name, object);
+
+ /* We can deduct from the allocator which node this is. */
+ node = ((struct numactl *)(sc->page_alloc))->node;
+ return base.object_size(n->node[node], object);
+}
+
+static void numa_slab_free(struct slab_cache *sc, const void *object)
+{
+ TPRINTK(KERN_CRIT "numa_slab_free(%s, %p)\n", sc ? sc->name : "<none>", object);
+ base.free(NULL, object);
+}
+
+static struct slab_cache *numa_slab_dup(struct slab_cache *sc)
+{
+ struct numa_slab *n = (void *)sc;
+
+ TPRINTK(KERN_CRIT "numa_slab_dup(%s)\n", sc->name);
+
+ atomic_inc(&n->refcount);
+ return sc;
+}
+
+static struct slab_cache *numa_slab_node(struct slab_cache *sc, int node)
+{
+ struct numa_slab *n = (void *)sc;
+ struct slab_cache *s = n->node[node];
+
+ return s;
+}
+
+static int numa_slab_shrink(struct slab_cache *sc,
+ int (*move_object)(struct slab_cache *, void *))
+{
+ struct numa_slab *n = (void *)sc;
+ int node;
+ int count = 0;
+
+ TPRINTK(KERN_CRIT "numa_slab_shrink(%s, %p)\n", sc->name, move_object);
+
+ /*
+ * FIXME: What you really want to do here is to
+ * run the shrinking on each node separately
+ */
+ spin_lock(&n->lock);
+ for_each_node(node) {
+ struct slab_cache *s = n->node[node];
+
+ if (s)
+ count += base.shrink(n->node[node], move_object);
+ }
+ spin_unlock(&n->lock);
+ return count;
+}
+
+static unsigned long numa_slab_objects(struct slab_cache *sc,
+ unsigned long *active, unsigned long *partial)
+{
+ struct numa_slab *n = (void *)sc;
+ int node;
+ unsigned long count = 0;
+ unsigned long count_active = 0;
+ unsigned long count_partial = 0;
+
+ printk(KERN_CRIT "numa_slab_objects(%s)\n", sc->name);
+
+ for_each_node(node) {
+ unsigned long nactive, npartial;
+ struct slab_cache *s = n->node[node];
+
+ if (!s)
+ continue;
+
+ count += base.objects(n->node[node], &nactive, &npartial);
+ count_active += nactive;
+ count_partial += npartial;
+ }
+ if (active)
+ *active = count_active;
+ if (partial)
+ *partial = count_partial;
+ return count;
+}
+
+const struct slab_allocator numa_slab_allocator = {
+ .name = "NumaSlab",
+ .create = numa_slab_create,
+ .alloc = numa_slab_alloc,
+ .alloc_node = numa_slab_alloc_node,
+ .free = numa_slab_free,
+ .valid_pointer = numa_slab_pointer_valid,
+ .object_size = numa_slab_object_size,
+ .objects = numa_slab_objects,
+ .shrink = numa_slab_shrink,
+ .dup = numa_slab_dup,
+ .node = numa_slab_node,
+ .destroy = numa_slab_destroy,
+ .destructor = null_slab_allocator_destructor
+};
+EXPORT_SYMBOL(numa_slab_allocator);
Index: linux-2.6.18-rc4/mm/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/mm/Makefile 2006-08-15 18:37:53.847305724 -0700
+++ linux-2.6.18-rc4/mm/Makefile 2006-08-15 18:43:25.610031185 -0700
@@ -25,4 +25,5 @@ obj-$(CONFIG_MEMORY_HOTPLUG) += memory_h
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_MODULAR_SLAB) += allocator.o kmalloc.o slabulator.o slabifier.o
+obj-$(CONFIG_NUMA_SLAB) += numa_slab.o
Index: linux-2.6.18-rc4/init/Kconfig
===================================================================
--- linux-2.6.18-rc4.orig/init/Kconfig 2006-08-15 18:37:52.247795073 -0700
+++ linux-2.6.18-rc4/init/Kconfig 2006-08-15 18:43:25.611007687 -0700
@@ -405,7 +405,7 @@ config SLAB
config MODULAR_SLAB
default n
- bool "Use the modular slab allocator frameworkr"
+ bool "Modular SLAB allocator framework"
depends on EXPERIMENTAL
help
The modular slab allocator framework allows the flexible use
@@ -413,6 +413,11 @@ config MODULAR_SLAB
allocation. This will completely replace the existing
slab allocator. Beware this is experimental code.
+config NUMA_SLAB
+ default y
+ bool "NUMA Slab allocator (for lots of memory)"
+ depends on MODULAR_SLAB && NUMA
+
config VM_EVENT_COUNTERS
default y
bool "Enable VM event counters for /proc/vmstat" if EMBEDDED
> 2. use fls to calculate array position.
I'm not sure that's a good idea. I always had my doubts that power of twos
are a good size distribution. I remember the original slab paper from Bonwick
also discouraged them. With fls you would hard code it.
I think it would be better to keep the more flexible arbitary array so that
people can try to develop new better distributions.
-Andi
> 1. A framework for page allocators and slab allocators
Great.
Hopefully this will combat the recent surge of new custom
allocators.
Maybe it would be a good idea to first check if Evgeny's tree allocator
is really that much better as he claims and if any ideas from
that one could be incorporated into the new design?
> 2. Various methods to derive new allocators from old ones
> (add rcu support, destructors, constructors, dma etc)
I'm surprised that you add that many indirect calls when writing
code for IA64. I remember during SLES kernel testing we found that
gcc's generation of indirect calls on IA64/McKinley seemed to be rather poor --
in particular we got a quite measurable slow down just from the indirect calls
that the LSM layer added to many paths. I hope that's ok.
I'm not arguing against doing this way btw, just a remark.
> B. The page slab allocator. This is a simple Pagesize based
> allocator that uses the page allocator directly to manage its
> objects. Doing so avoids all the slab overhead for large
> allocations. The page slab can also slabify any other
> page allocator.
What other ones do we have?
> C. The NUMA Slab. This allocator is based on the slabifier
> and simply creates one Slabifier per node and manages
> those. This allows a clean separation of NUMA.
> The slabifier stays simple and the NUMA slab can deal
> with the allocation complexities. So system
> without NUMA are not affected by the logic that is
> put in.
I hope the NUMA slab will still perfom well even on non NUMA though.
That will be a common situation on x86-64 (kernels compiled with NUMA,
but running on flat Intel systems)
> 1. shrink_slab takes a function to move object. Using that
> function slabs can be defragmented to ease slab reclaim.
Does that help with the inefficient dcache/icache pruning?
> - No support for pagese
What does that mean?
> Performance tests with AIM7 on an 8p Itanium machine (4 NUMA nodes)
> (Memory spreading active which means that we do not take advantage of NUMA locality
> in favor of load balancing)
Hmm, i'm not sure how allocator intensive AIM7 is. I guess networking
would be a good test because it is very sensitive to allocator performance.
Perhaps also check with the routing people on netdev -- they seem to be able
to stress the allocator very much.
-Andi
On Tue, Aug 15, 2006 at 07:22:38PM -0700, Christoph Lameter wrote:
> Some of the other issues in the slab layer are also addressed here:
>
> 1. shrink_slab takes a function to move object. Using that
> function slabs can be defragmented to ease slab reclaim.
>
> 2. Bootstrap occurs through dynamic slab creation.
>
> 3. New slabs that are created can be merged into the kmalloc array
> if it is detected that they match. This decreases the number of caches
> and benefits cache use.
While this will be good for reducing fragmentation, one important
thing is needed here for tracking down leaks and slab corruptions -
the ability to split the caches back out into individual slabs.
Maybe a boot parameter would be useful here - that way it is easy to
determine which type of slab object is causing the problems without
needing the end user to run special kernels.
Also, some slab users probably want their own pool of objects that
nobody else can use - mempools are a good example of this - so there
needs to a way of indicating slabs should not be merged into the
kmalloc array.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
> > 3. New slabs that are created can be merged into the kmalloc array
> > if it is detected that they match. This decreases the number of caches
> > and benefits cache use.
>
> While this will be good for reducing fragmentation,
Will it? The theory behind a zone allocator like slab is that objects of the
same type have similar livetimes. Fragmentation mostly happens when objects
have very different live times. If you mix objects of different types
into the same slab then you might get more fragmentation.
kmalloc already has that problem but it probably shouldn't be added
to other slabs too.
-Andi
On Wed, Aug 16, 2006 at 09:52:54AM +0200, Andi Kleen wrote:
> > 1. shrink_slab takes a function to move object. Using that
> > function slabs can be defragmented to ease slab reclaim.
>
> Does that help with the inefficient dcache/icache pruning?
There was a fair amount of debate on this at the VM summit.
The approach we thought was most promising started with splitting the
dcache into directory and leaf entries.
--
Mathematics is the supreme nostalgia of our time.
On Wed, Aug 16, 2006 at 10:32:59AM +0200, Andi Kleen wrote:
>
> > > 3. New slabs that are created can be merged into the kmalloc array
> > > if it is detected that they match. This decreases the number of caches
> > > and benefits cache use.
> >
> > While this will be good for reducing fragmentation,
>
> Will it? The theory behind a zone allocator like slab is that objects of the
> same type have similar livetimes.
True, but not all users of the slab caches are similar lifetime objects. e.g.
bufferheads, dentries, inodes, filps, radix tree nodes, etc all have effectively
random lifetimes. i'd say that most linux slab objects don't have that
property....
> Fragmentation mostly happens when objects
> have very different live times.
*nod*
Just look at how badly the inode and dentry slabs can fragment....
> If you mix objects of different types
> into the same slab then you might get more fragmentation.
Yes, but you don't tend to get the same worst case behaviour
that you get with single use slabs. With multiple use slabs, long
lifetime objects tend to accumulate on the same pages as different
objects come and go from the pages. IOWs, you waste less pages
in a fragmented multi-object cache that you do in N fragmented
single use caches.
> kmalloc already has that problem but it probably shouldn't be added
> to other slabs too.
I've never seen the kmalloc slabs show anywhere near the levels of
fragmentation I've seen from the inode and dentry slabs.....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wed, Aug 16, 2006 at 03:41:19AM -0500, Matt Mackall wrote:
> On Wed, Aug 16, 2006 at 09:52:54AM +0200, Andi Kleen wrote:
> > > 1. shrink_slab takes a function to move object. Using that
> > > function slabs can be defragmented to ease slab reclaim.
> >
> > Does that help with the inefficient dcache/icache pruning?
>
> There was a fair amount of debate on this at the VM summit.
>
> The approach we thought was most promising started with splitting the
> dcache into directory and leaf entries.
That's been tried before with no noticable effect on fragmentation.
Patch:
http://marc.theaimsgroup.com/?l=linux-mm&m=112858024817277&w=2
And informative thread:
http://marc.theaimsgroup.com/?l=linux-mm&m=112660138015732&w=2
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wed, 16 Aug 2006, Andi Kleen wrote:
> What other ones do we have?
vmalloc and the page allocators you derive from others using the
framework.
> > 1. shrink_slab takes a function to move object. Using that
> > function slabs can be defragmented to ease slab reclaim.
>
> Does that help with the inefficient dcache/icache pruning?
It will fix that if the dcache/icache would provide a move object
function.
>
> > - No support for pagese
>
> What does that mean?
That was just clutter. Sorry.
>
> > Performance tests with AIM7 on an 8p Itanium machine (4 NUMA nodes)
> > (Memory spreading active which means that we do not take advantage of NUMA locality
> > in favor of load balancing)
>
> Hmm, i'm not sure how allocator intensive AIM7 is. I guess networking
> would be a good test because it is very sensitive to allocator performance.
> Perhaps also check with the routing people on netdev -- they seem to be able
> to stress the allocator very much.
Yeah, well I think the implementations could be much more sophisticated
and one should do better tests but I am not sure how much time I can spend
on them.
On Wed, 16 Aug 2006, David Chinner wrote:
> Also, some slab users probably want their own pool of objects that
> nobody else can use - mempools are a good example of this - so there
> needs to a way of indicating slabs should not be merged into the
> kmalloc array.
slabs are only merged if they are compatible with the kmalloc array. If a
slab uses a memory pool then that would prohibit the merging.
On Wed, 16 Aug 2006, Matt Mackall wrote:
> The approach we thought was most promising started with splitting the
> dcache into directory and leaf entries.
Dipankar later told me that they have tried that approch and it was
of no benefit.
Christoph Lameter wrote:
>5. Three different slab allocators.
>[snip]
> It is called the Slabifier because it can slabify any
> page allocator. VMALLOC is a valid page allocator so
> it can do slabs on vmalloc ranges. You can define your
> own page allocator (mempools??) and then slabify that
> one.
>
>
Which .config settings are necessary? I tried to use it (uniprocessor,
no debug options enabled), but the compilation failed. 2.6.18-rc4
kernel. All 7 patches applied.
And: Are you sure that the slabifier works on vmalloc ranges? The code
uses virt_to_page(). Does that function work for vmalloc on all archs?
The lack of virt_to_page() on vmalloc/mempool memory. always prevented
the slab allocator from handling such memory.
--
Manfred
On Wed, 16 Aug 2006, Manfred Spraul wrote:
> Which .config settings are necessary? I tried to use it (uniprocessor, no
> debug options enabled), but the compilation failed. 2.6.18-rc4 kernel. All 7
> patches applied.
I only build it on IA64. Never tested it on another arch. What error
messages are you getting?
> And: Are you sure that the slabifier works on vmalloc ranges? The code uses
> virt_to_page(). Does that function work for vmalloc on all archs?
Hmm.... Not tried it just got minimal things going to have some numnbers.
You are right. A real virtual address to page translation for
vmalloc would involve going through the page tables. Seems that
virt_to_page that is used in get_object_page() does not do that.
In order to get vmalloc working we would need first to check the
address. If its in the vmalloc range then walk the page table and get
the struct page address that way. There is a function
vmalloc_to_page()
in mm/memory.c that would do that for us.
So we need to modify get_object_page() to check for a VMALLOC range
and then use vmalloc_to_page instead of virt_to_page.
On Wed, 16 Aug 2006, Manfred Spraul wrote:
> The lack of virt_to_page() on vmalloc/mempool memory. always prevented the
> slab allocator from handling such memory.
Guess you need this patch for the slabifier to make it work:
Index: linux-2.6.18-rc4/mm/slabifier.c
===================================================================
--- linux-2.6.18-rc4.orig/mm/slabifier.c 2006-08-16 14:25:18.449419152 -0700
+++ linux-2.6.18-rc4/mm/slabifier.c 2006-08-16 15:13:18.687428561 -0700
@@ -638,7 +638,12 @@ static struct page *get_object_page(cons
{
struct page * page;
- page = virt_to_page(x);
+ if ((unsigned long)x >= VMALLOC_START &&
+ (unsigned long)x < VMALLOC_END)
+ page = vmalloc_to_page(x);
+ else
+ page = virt_to_page(x);
+
if (unlikely(PageCompound(page)))
page = (struct page *)page_private(page);
Index: linux-2.6.18-rc4/mm/memory.c
===================================================================
--- linux-2.6.18-rc4.orig/mm/memory.c 2006-08-06 11:20:11.000000000 -0700
+++ linux-2.6.18-rc4/mm/memory.c 2006-08-16 15:14:01.595912665 -0700
@@ -2432,7 +2432,7 @@ int make_pages_present(unsigned long add
/*
* Map a vmalloc()-space virtual address to the physical page.
*/
-struct page * vmalloc_to_page(void * vmalloc_addr)
+struct page * vmalloc_to_page(const void * vmalloc_addr)
{
unsigned long addr = (unsigned long) vmalloc_addr;
struct page *page = NULL;
Index: linux-2.6.18-rc4/include/linux/mm.h
===================================================================
--- linux-2.6.18-rc4.orig/include/linux/mm.h 2006-08-06 11:20:11.000000000 -0700
+++ linux-2.6.18-rc4/include/linux/mm.h 2006-08-16 15:14:25.875663886 -0700
@@ -1013,7 +1013,7 @@ static inline unsigned long vma_pages(st
}
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
-struct page *vmalloc_to_page(void *addr);
+struct page *vmalloc_to_page(const void *addr);
unsigned long vmalloc_to_pfn(void *addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
On Wed, 16 Aug 2006, Manfred Spraul wrote:
> Which .config settings are necessary? I tried to use it (uniprocessor, no
> debug options enabled), but the compilation failed. 2.6.18-rc4 kernel. All 7
> patches applied.
Ahh. Tried it on i386. This works if you disable full slab support.
Some more minor fixes that my newer compiler on the i386 box discovered.
Index: linux-2.6.18-rc4/include/linux/allocator.h
===================================================================
--- linux-2.6.18-rc4.orig/include/linux/allocator.h 2006-08-16 15:42:24.000000000 -0700
+++ linux-2.6.18-rc4/include/linux/allocator.h 2006-08-16 15:42:27.000000000 -0700
@@ -66,7 +66,7 @@
* a page is freed.
*/
struct page_allocator *ctor_and_dtor_for_page_allocator
- (const struct page_allocator *, unsigned long size, void *private,
+ (const struct page_allocator *, unsigned int size, void *private,
void (*ctor)(void *, void *, unsigned long),
void (*dtor)(void *, void *, unsigned long));
Index: linux-2.6.18-rc4/mm/allocator.c
===================================================================
--- linux-2.6.18-rc4.orig/mm/allocator.c 2006-08-16 15:42:24.000000000 -0700
+++ linux-2.6.18-rc4/mm/allocator.c 2006-08-16 15:42:27.000000000 -0700
@@ -123,8 +123,8 @@
static void page_free_rcu(struct rcu_head *h)
{
struct page *page;
- struct page_allocator * base;
- int order = page->index;
+ struct page_allocator *base;
+ int order;
page = container_of((struct list_head *)h, struct page, lru);
base = (void *)page->mapping;
@@ -228,7 +228,7 @@
struct page_allocator *ctor_and_dtor_for_page_allocator
(const struct page_allocator *base,
- size_t size, void *private,
+ unsigned int size, void *private,
void (*ctor)(void *, void *, unsigned long),
void (*dtor)(void *, void *, unsigned long))
{
@@ -318,7 +318,7 @@
struct slabr *r = (void *) rcu;
struct slab_cache *s = r->s;
struct rcuified_slab *d = (void *)s->slab_alloc;
- void *object = (void *)object - d->rcu_offset;
+ void *object = (void *)rcu - d->rcu_offset;
d->base->free(s, object);
}
On Wed, 16 Aug 2006, Andi Kleen wrote:
> > 2. use fls to calculate array position.
>
> I'm not sure that's a good idea. I always had my doubts that power of twos
> are a good size distribution. I remember the original slab paper from Bonwick
> also discouraged them. With fls you would hard code it.
The original paper from Bonwick is pretty dated now. There are some
interesting ideas in there and we have mostly followed them. But it is an
academic paper after all. So not all ideas may be relevant in practice.
Support for non power of two sizes could lead to general slabs that do not
exactly fit into one page. Right now its nicely fitting in. What exactly
was his argument?
> I think it would be better to keep the more flexible arbitary array so that
> people can try to develop new better distributions.
Well they have not so far. The irregular arrays in the list are 96 and 192
bytes long. 96 is only used if cacheline size is smaller than 64. Which is
probably relevent for a very small number of machines. 192 is only used if
cacheline size is less than 128. Most current machine have 128 byte
cacheline right? So we already have the power of two there.
Also why do we limit the sizes of slabs? Would it not be best to have
larger than page size slabs be handled by the page_slab_allocator
(similar to SLOB)? The page allocator is build to optimize page sized
allocation. Reclaim is trivial then.
// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 6
// SUBLEVEL = 17
// EXTRAVERSION =-rc1-mm3
--- 2.6/mm/slab.c 2006-04-18 13:18:57.000000000 +0200
+++ build-2.6/mm/slab.c 2006-05-13 21:25:04.000000000 +0200
@@ -223,11 +223,9 @@
*/
struct slab {
struct list_head list;
- unsigned long colouroff;
void *s_mem; /* including colour offset */
unsigned int inuse; /* num of objs active in slab */
kmem_bufctl_t free;
- unsigned short nodeid;
};
/*
@@ -305,6 +303,14 @@
int free_touched; /* updated without locking */
};
+struct kmem_pagehdr {
+ struct slab *slabp;
+ kmem_cache_t *cachep;
+#ifdef CONFIG_NUMA
+ int nodeid;
+#endif
+};
+
/*
* Need this for bootstrapping a per node allocator.
*/
@@ -398,7 +404,7 @@
size_t colour; /* cache colouring range */
unsigned int colour_off; /* colour offset */
struct kmem_cache *slabp_cache;
- unsigned int slab_size;
+ unsigned int data_offset;
unsigned int dflags; /* dynamic flags */
/* constructor func */
@@ -571,51 +577,38 @@
#endif
/*
- * Do not go above this order unless 0 objects fit into the slab.
- */
-#define BREAK_GFP_ORDER_HI 1
-#define BREAK_GFP_ORDER_LO 0
-static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;
-
-/*
* Functions for storing/retrieving the cachep and or slab from the page
* allocator. These are used to find the slab an obj belongs to. With kfree(),
* these are used to find the cache which an obj belongs to.
*/
-static inline void page_set_cache(struct page *page, struct kmem_cache *cache)
-{
- page->lru.next = (struct list_head *)cache;
-}
-static inline struct kmem_cache *page_get_cache(struct page *page)
+static inline struct kmem_pagehdr *objp_to_pagehdr(const void *objp)
{
- if (unlikely(PageCompound(page)))
- page = (struct page *)page_private(page);
- return (struct kmem_cache *)page->lru.next;
-}
+ unsigned long addr;
+ addr = (unsigned long)objp;
+ addr -= sizeof(struct kmem_pagehdr);
+ addr &= ~(PAGE_SIZE-1);
-static inline void page_set_slab(struct page *page, struct slab *slab)
-{
- page->lru.prev = (struct list_head *)slab;
+ return (struct kmem_pagehdr *)addr;
}
-static inline struct slab *page_get_slab(struct page *page)
+static inline struct kmem_cache *virt_to_cache(const void *objp)
{
- if (unlikely(PageCompound(page)))
- page = (struct page *)page_private(page);
- return (struct slab *)page->lru.prev;
+ return objp_to_pagehdr(objp)->cachep;
}
-static inline struct kmem_cache *virt_to_cache(const void *obj)
+static inline struct slab *virt_to_slab(const void *objp)
{
- struct page *page = virt_to_page(obj);
- return page_get_cache(page);
+ return objp_to_pagehdr(objp)->slabp;
}
-static inline struct slab *virt_to_slab(const void *obj)
+static inline int slabp_to_nodeid(struct slab *slabp)
{
- struct page *page = virt_to_page(obj);
- return page_get_slab(page);
+#ifdef CONFIG_NUMA
+ return objp_to_pagehdr(slabp->s_mem)->nodeid;
+#else
+ return 0;
+#endif
}
static inline void *index_to_obj(struct kmem_cache *cache, struct slab *slab,
@@ -738,72 +731,6 @@
}
EXPORT_SYMBOL(kmem_find_general_cachep);
-static size_t slab_mgmt_size(size_t nr_objs, size_t align)
-{
- return ALIGN(sizeof(struct slab)+nr_objs*sizeof(kmem_bufctl_t), align);
-}
-
-/*
- * Calculate the number of objects and left-over bytes for a given buffer size.
- */
-static void cache_estimate(unsigned long gfporder, size_t buffer_size,
- size_t align, int flags, size_t *left_over,
- unsigned int *num)
-{
- int nr_objs;
- size_t mgmt_size;
- size_t slab_size = PAGE_SIZE << gfporder;
-
- /*
- * The slab management structure can be either off the slab or
- * on it. For the latter case, the memory allocated for a
- * slab is used for:
- *
- * - The struct slab
- * - One kmem_bufctl_t for each object
- * - Padding to respect alignment of @align
- * - @buffer_size bytes for each object
- *
- * If the slab management structure is off the slab, then the
- * alignment will already be calculated into the size. Because
- * the slabs are all pages aligned, the objects will be at the
- * correct alignment when allocated.
- */
- if (flags & CFLGS_OFF_SLAB) {
- mgmt_size = 0;
- nr_objs = slab_size / buffer_size;
-
- if (nr_objs > SLAB_LIMIT)
- nr_objs = SLAB_LIMIT;
- } else {
- /*
- * Ignore padding for the initial guess. The padding
- * is at most @align-1 bytes, and @buffer_size is at
- * least @align. In the worst case, this result will
- * be one greater than the number of objects that fit
- * into the memory allocation when taking the padding
- * into account.
- */
- nr_objs = (slab_size - sizeof(struct slab)) /
- (buffer_size + sizeof(kmem_bufctl_t));
-
- /*
- * This calculated number will be either the right
- * amount, or one greater than what we want.
- */
- if (slab_mgmt_size(nr_objs, align) + nr_objs*buffer_size
- > slab_size)
- nr_objs--;
-
- if (nr_objs > SLAB_LIMIT)
- nr_objs = SLAB_LIMIT;
-
- mgmt_size = slab_mgmt_size(nr_objs, align);
- }
- *num = nr_objs;
- *left_over = slab_size - nr_objs*buffer_size - mgmt_size;
-}
-
#define slab_error(cachep, msg) __slab_error(__FUNCTION__, cachep, msg)
static void __slab_error(const char *function, struct kmem_cache *cachep,
@@ -1018,7 +945,7 @@
static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
{
struct slab *slabp = virt_to_slab(objp);
- int nodeid = slabp->nodeid;
+ int nodeid = slabp_to_nodeid(slabp);
struct kmem_list3 *l3;
struct array_cache *alien = NULL;
@@ -1026,7 +953,7 @@
* Make sure we are not freeing a object from another node to the array
* cache on this cpu.
*/
- if (likely(slabp->nodeid == numa_node_id()))
+ if (likely(nodeid == numa_node_id()))
return 0;
l3 = cachep->nodelists[numa_node_id()];
@@ -1272,17 +1199,105 @@
local_irq_enable();
}
+static int get_dataoff(kmem_cache_t *cachep, size_t align)
+{
+ int off;
+
+ off = sizeof(struct kmem_pagehdr);
+ if (!(cachep->flags & CFLGS_OFF_SLAB)) {
+ off = ALIGN(off, sizeof(void*));
+ off += (cachep->num*sizeof(kmem_bufctl_t) +
+ sizeof(struct slab));
+ }
+ off=ALIGN(off, align);
+
+ return off;
+}
+
+static int get_colourspace(struct kmem_cache *cachep)
+{
+ int used, avail, ret;
+
+ avail = PAGE_SIZE * (1<<cachep->gfporder);
+ used = cachep->data_offset + cachep->buffer_size*cachep->num;
+
+ ret = (avail-used)/(cachep->colour_off);
+
+ return ret;
+}
+
+
+/**
+ * get_pagecnt - calculate the number of pages required for a slab setup
+ * @size: size of objects to be created in this cache.
+ * @align: required alignment for the objects.
+ * @on_slab: slab struct on slab?
+ * @count: Number of objects
+ */
+static int get_pagecnt(size_t size, size_t align, int on_slab, int count)
+{
+ int hdr;
+ int pages;
+
+ hdr = sizeof(struct kmem_pagehdr);
+
+ if (on_slab) {
+ hdr = ALIGN(hdr, sizeof(void*));
+
+ hdr += sizeof(struct slab)+count*sizeof(kmem_bufctl_t);
+ }
+ hdr = ALIGN(hdr,align);
+
+ pages = (count*size+hdr+PAGE_SIZE-1)/PAGE_SIZE;
+
+ return pages;
+}
+
+/**
+ * calculate_sizes - calculate size (page order) of slabs
+ * @numobj: Number of objects in one slab (out)
+ * @gfporder: gfp order of the slab (out)
+ * @size: size of objects to be created in this cache.
+ * @align: required alignment for the objects.
+ * @on_slab: slab struct on slab?
+ *
+ */
+static void calculate_sizes(int *numobj, int *gfporder,
+ size_t size, size_t align, int on_slab)
+{
+ int pages, num;
+
+ pages = get_pagecnt(size, align, on_slab, 1);
+
+ if (pages == 1) {
+ num = 1;
+ *gfporder = 0;
+
+ while (get_pagecnt(size, align, on_slab, num+1) == 1) {
+ num++;
+ }
+ *numobj = num;
+ } else {
+ int i;
+ *numobj = 1;
+ *gfporder = 0;
+ i = 1;
+ while (pages > i) {
+ i <<= 1;
+ (*gfporder)++;
+ }
+ }
+}
+
/*
* Initialisation. Called after the page allocator have been initialised and
* before smp_init().
*/
void __init kmem_cache_init(void)
{
- size_t left_over;
struct cache_sizes *sizes;
struct cache_names *names;
int i;
- int order;
for (i = 0; i < NUM_INIT_LISTS; i++) {
kmem_list3_init(&initkmem_list3[i]);
@@ -1290,13 +1305,6 @@
cache_cache.nodelists[i] = NULL;
}
- /*
- * Fragmentation resistance on low memory - only use bigger
- * page orders on machines with more than 32MB of memory.
- */
- if (num_physpages > (32 << 20) >> PAGE_SHIFT)
- slab_break_gfp_order = BREAK_GFP_ORDER_HI;
-
/* Bootstrap is tricky, because several objects are allocated
* from caches that do not exist yet:
* 1) initialize the cache_cache cache: it contains the struct
@@ -1327,17 +1335,11 @@
cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,
cache_line_size());
- for (order = 0; order < MAX_ORDER; order++) {
- cache_estimate(order, cache_cache.buffer_size,
- cache_line_size(), 0, &left_over, &cache_cache.num);
- if (cache_cache.num)
- break;
- }
+ calculate_sizes(&cache_cache.num, &cache_cache.gfporder,
+ cache_cache.buffer_size, cache_line_size(), 1);
BUG_ON(!cache_cache.num);
- cache_cache.gfporder = order;
- cache_cache.colour = left_over / cache_cache.colour_off;
- cache_cache.slab_size = ALIGN(cache_cache.num * sizeof(kmem_bufctl_t) +
- sizeof(struct slab), cache_line_size());
+ cache_cache.data_offset = get_dataoff(&cache_cache, cache_line_size());
+ cache_cache.colour = get_colourspace(&cache_cache);
/* 2+3) create the kmalloc caches */
sizes = malloc_sizes;
@@ -1753,7 +1755,7 @@
*/
static void slab_destroy(struct kmem_cache *cachep, struct slab *slabp)
{
- void *addr = slabp->s_mem - slabp->colouroff;
+ void *addr = objp_to_pagehdr(slabp->s_mem);
slab_destroy_objs(cachep, slabp);
if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU)) {
@@ -1786,66 +1788,6 @@
}
}
-/**
- * calculate_slab_order - calculate size (page order) of slabs
- * @cachep: pointer to the cache that is being created
- * @size: size of objects to be created in this cache.
- * @align: required alignment for the objects.
- * @flags: slab allocation flags
- *
- * Also calculates the number of objects per slab.
- *
- * This could be made much more intelligent. For now, try to avoid using
- * high order pages for slabs. When the gfp() functions are more friendly
- * towards high-order requests, this should be changed.
- */
-static size_t calculate_slab_order(struct kmem_cache *cachep,
- size_t size, size_t align, unsigned long flags)
-{
- size_t left_over = 0;
- int gfporder;
-
- for (gfporder = 0; gfporder <= MAX_GFP_ORDER; gfporder++) {
- unsigned int num;
- size_t remainder;
-
- cache_estimate(gfporder, size, align, flags, &remainder, &num);
- if (!num)
- continue;
-
- /* More than offslab_limit objects will cause problems */
- if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit)
- break;
-
- /* Found something acceptable - save it away */
- cachep->num = num;
- cachep->gfporder = gfporder;
- left_over = remainder;
-
- /*
- * A VFS-reclaimable slab tends to have most allocations
- * as GFP_NOFS and we really don't want to have to be allocating
- * higher-order pages when we are unable to shrink dcache.
- */
- if (flags & SLAB_RECLAIM_ACCOUNT)
- break;
-
- /*
- * Large number of objects is good, but very large slabs are
- * currently bad for the gfp()s.
- */
- if (gfporder >= slab_break_gfp_order)
- break;
-
- /*
- * Acceptable internal fragmentation?
- */
- if (left_over * 8 <= (PAGE_SIZE << gfporder))
- break;
- }
- return left_over;
-}
-
static void setup_cpu_cache(struct kmem_cache *cachep)
{
if (g_cpucache_up == FULL) {
@@ -1935,7 +1877,7 @@
void (*ctor)(void*, struct kmem_cache *, unsigned long),
void (*dtor)(void*, struct kmem_cache *, unsigned long))
{
- size_t left_over, slab_size, ralign;
+ size_t ralign;
struct kmem_cache *cachep = NULL, *pc;
/*
@@ -1948,6 +1890,10 @@
BUG();
}
+ if (align > PAGE_SIZE) {
+ printk(KERN_ERR "%s: alignment too large (%d)\n", name, align);
+ goto out;
+ }
/*
* Prevent CPUs from coming and going.
* lock_cpu_hotplug() nests outside cache_chain_mutex
@@ -2090,17 +2036,24 @@
#endif
#endif
- /* Determine if the slab management is 'on' or 'off' slab. */
- if (size >= (PAGE_SIZE >> 3))
- /*
- * Size is large, assume best to place the slab management obj
- * off-slab (should allow better packing of objs).
- */
- flags |= CFLGS_OFF_SLAB;
-
size = ALIGN(size, align);
- left_over = calculate_slab_order(cachep, size, align, flags);
+ {
+ int num_on, gfp_on, num_off, gfp_off;
+
+ calculate_sizes(&num_on, &gfp_on, size, align, 1);
+ calculate_sizes(&num_off, &gfp_off, size, align, 0);
+
+ if(gfp_on > gfp_off ||
+ (num_on > num_off && size > PAGE_SIZE/8)) {
+ flags |= CFLGS_OFF_SLAB;
+ cachep->num = num_off;
+ cachep->gfporder = gfp_off;
+ } else {
+ cachep->num = num_on;
+ cachep->gfporder = gfp_on;
+ }
+ }
if (!cachep->num) {
printk("kmem_cache_create: couldn't create cache %s.\n", name);
@@ -2108,53 +2061,43 @@
cachep = NULL;
goto oops;
}
- slab_size = ALIGN(cachep->num * sizeof(kmem_bufctl_t)
- + sizeof(struct slab), align);
-
- /*
- * If the slab has been placed off-slab, and we have enough space then
- * move it on-slab. This is at the expense of any extra colouring.
- */
- if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {
- flags &= ~CFLGS_OFF_SLAB;
- left_over -= slab_size;
- }
if (flags & CFLGS_OFF_SLAB) {
- /* really off slab. No need for manual alignment */
- slab_size =
- cachep->num * sizeof(kmem_bufctl_t) + sizeof(struct slab);
+ size_t slab_size;
+
+ slab_size = sizeof(struct slab);
+ slab_size += cachep->num * sizeof(kmem_bufctl_t);
+ cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);
}
cachep->colour_off = cache_line_size();
/* Offset must be a multiple of the alignment. */
if (cachep->colour_off < align)
cachep->colour_off = align;
- cachep->colour = left_over / cachep->colour_off;
- cachep->slab_size = slab_size;
cachep->flags = flags;
cachep->gfpflags = 0;
if (flags & SLAB_CACHE_DMA)
cachep->gfpflags |= GFP_DMA;
cachep->buffer_size = size;
- if (flags & CFLGS_OFF_SLAB)
- cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);
+ cachep->data_offset = get_dataoff(cachep, align);
+ cachep->colour = get_colourspace(cachep);
+
cachep->ctor = ctor;
cachep->dtor = dtor;
cachep->name = name;
-
setup_cpu_cache(cachep);
/* cache setup completed, link it into the list */
list_add(&cachep->next, &cache_chain);
oops:
+ mutex_unlock(&cache_chain_mutex);
+ unlock_cpu_hotplug();
+out:
if (!cachep && (flags & SLAB_PANIC))
panic("kmem_cache_create(): failed to create slab `%s'\n",
name);
- mutex_unlock(&cache_chain_mutex);
- unlock_cpu_hotplug();
return cachep;
}
EXPORT_SYMBOL(kmem_cache_create);
@@ -2356,7 +2299,7 @@
EXPORT_SYMBOL(kmem_cache_destroy);
/* Get the memory for a slab management obj. */
-static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *objp,
+static struct slab *alloc_slabmgmt(struct kmem_cache *cachep, void *addr,
int colour_off, gfp_t local_flags,
int nodeid)
{
@@ -2369,13 +2312,12 @@
if (!slabp)
return NULL;
} else {
- slabp = objp + colour_off;
- colour_off += cachep->slab_size;
+ slabp = addr
+ + ALIGN(sizeof(struct kmem_pagehdr), sizeof(void*))
+ + colour_off;
}
slabp->inuse = 0;
- slabp->colouroff = colour_off;
- slabp->s_mem = objp + colour_off;
- slabp->nodeid = nodeid;
+ slabp->s_mem = addr + cachep->data_offset + colour_off;
return slabp;
}
@@ -2451,7 +2393,7 @@
next = slab_bufctl(slabp)[slabp->free];
#if DEBUG
slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
- WARN_ON(slabp->nodeid != nodeid);
+ WARN_ON(slabp_to_nodeid(slabp) != nodeid);
#endif
slabp->free = next;
@@ -2483,23 +2425,18 @@
* for the slab allocator to be able to lookup the cache and slab of a
* virtual address for kfree, ksize, kmem_ptr_validate, and slab debugging.
*/
-static void slab_map_pages(struct kmem_cache *cache, struct slab *slab,
- void *addr)
+static void slab_map_pages(void *addr, struct kmem_cache *cachep,
+ struct slab *slabp, int nodeid)
{
- int nr_pages;
- struct page *page;
+ struct kmem_pagehdr *phdr;
- page = virt_to_page(addr);
+ phdr = (struct kmem_pagehdr*)addr;
- nr_pages = 1;
- if (likely(!PageCompound(page)))
- nr_pages <<= cache->gfporder;
-
- do {
- page_set_cache(page, cache);
- page_set_slab(page, slab);
- page++;
- } while (--nr_pages);
+ phdr->cachep = cachep;
+ phdr->slabp = slabp;
+#ifdef CONFIG_NUMA
+ phdr->nodeid = nodeid;
+#endif
}
/*
@@ -2509,7 +2446,7 @@
static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
struct slab *slabp;
- void *objp;
+ void *addr;
size_t offset;
gfp_t local_flags;
unsigned long ctor_flags;
@@ -2561,17 +2498,16 @@
* Get mem for the objs. Attempt to allocate a physical page from
* 'nodeid'.
*/
- objp = kmem_getpages(cachep, flags, nodeid);
- if (!objp)
+ addr = kmem_getpages(cachep, flags, nodeid);
+ if (!addr)
goto failed;
/* Get slab management. */
- slabp = alloc_slabmgmt(cachep, objp, offset, local_flags, nodeid);
+ slabp = alloc_slabmgmt(cachep, addr, offset, local_flags, nodeid);
if (!slabp)
goto opps1;
- slabp->nodeid = nodeid;
- slab_map_pages(cachep, slabp, objp);
+ slab_map_pages(addr, cachep, slabp, nodeid);
cache_init_objs(cachep, slabp, ctor_flags);
@@ -2587,7 +2523,7 @@
spin_unlock(&l3->list_lock);
return 1;
opps1:
- kmem_freepages(cachep, objp);
+ kmem_freepages(cachep, addr);
failed:
if (local_flags & __GFP_WAIT)
local_irq_disable();
@@ -2622,24 +2558,22 @@
static void *cache_free_debugcheck(struct kmem_cache *cachep, void *objp,
void *caller)
{
- struct page *page;
unsigned int objnr;
struct slab *slabp;
objp -= obj_offset(cachep);
kfree_debugcheck(objp);
- page = virt_to_page(objp);
- if (page_get_cache(page) != cachep) {
+ if (virt_to_cache(objp) != cachep) {
printk(KERN_ERR "mismatch in kmem_cache_free: expected "
"cache %p, got %p\n",
- page_get_cache(page), cachep);
+ virt_to_cache(objp), cachep);
printk(KERN_ERR "%p is %s.\n", cachep, cachep->name);
- printk(KERN_ERR "%p is %s.\n", page_get_cache(page),
- page_get_cache(page)->name);
+ printk(KERN_ERR "%p is %s.\n", virt_to_cache(objp),
+ virt_to_cache(objp)->name);
WARN_ON(1);
}
- slabp = page_get_slab(page);
+ slabp = virt_to_slab(objp);
if (cachep->flags & SLAB_RED_ZONE) {
if (*dbg_redzone1(cachep, objp) != RED_ACTIVE ||
@@ -2857,7 +2791,7 @@
int objnr;
struct slab *slabp;
- slabp = page_get_slab(virt_to_page(objp));
+ slabp = virt_to_slab(objp);
objnr = (objp - slabp->s_mem) / cachep->buffer_size;
slab_bufctl(slabp)[objnr] = (unsigned long)caller;
@@ -2867,7 +2801,7 @@
struct slab *slabp;
unsigned objnr;
- slabp = page_get_slab(virt_to_page(objp));
+ slabp = virt_to_slab(objp);
objnr = (unsigned)(objp - slabp->s_mem) / cachep->buffer_size;
slab_bufctl(slabp)[objnr] = BUFCTL_ACTIVE;
}
@@ -3199,7 +3133,7 @@
page = virt_to_page(ptr);
if (unlikely(!PageSlab(page)))
goto out;
- if (unlikely(page_get_cache(page) != cachep))
+ if (unlikely(virt_to_cache(ptr) != cachep))
goto out;
return 1;
out:
--- 2.6/include/asm-i386/thread_info.h 2006-03-20 06:53:29.000000000 +0100
+++ build-2.6/include/asm-i386/thread_info.h 2006-05-13 20:51:19.000000000 +0200
@@ -55,8 +55,10 @@
#define PREEMPT_ACTIVE 0x10000000
#ifdef CONFIG_4KSTACKS
#define THREAD_SIZE (4096)
+#define THREAD_GFPORDER 0
#else
#define THREAD_SIZE (8192)
+#define THREAD_GFPORDER 1
#endif
#define STACK_WARN (THREAD_SIZE/8)
@@ -101,16 +103,17 @@
({ \
struct thread_info *ret; \
\
- ret = kmalloc(THREAD_SIZE, GFP_KERNEL); \
+ ret = __get_free_pages(GFP_KERNEL, THREAD_GFPORDER) \
if (ret) \
memset(ret, 0, THREAD_SIZE); \
ret; \
})
#else
-#define alloc_thread_info(tsk) kmalloc(THREAD_SIZE, GFP_KERNEL)
+#define alloc_thread_info(tsk) \
+ __get_free_pages(GFP_KERNEL, THREAD_GFPORDER)
#endif
-#define free_thread_info(info) kfree(info)
+#define free_thread_info(info) free_pages(info, THREAD_GFPORDER)
#else /* !__ASSEMBLY__ */
On Thursday 17 August 2006 07:19, Manfred Spraul wrote:
> Christoph Lameter wrote:
> >On Wed, 16 Aug 2006, Andi Kleen wrote:
> >>>2. use fls to calculate array position.
> >>
> >>I'm not sure that's a good idea. I always had my doubts that power of
> >> twos are a good size distribution. I remember the original slab paper
> >> from Bonwick also discouraged them. With fls you would hard code it.
> >
> >The original paper from Bonwick is pretty dated now. There are some
> >interesting ideas in there and we have mostly followed them. But it is an
> >academic paper after all.
>
> It's not just an academic paper, it's implemented in Solaris. Check the
> opensolaris sources.
If you watch some allocation size traces then it's not clear power of two
is really a good fit.
>
> > So not all ideas may be relevant in practice.
> >Support for non power of two sizes could lead to general slabs that do not
> >exactly fit into one page.
>
> That's a good thing.
> I'm not sure that the current approach with
> virt_to_page()/vmalloc_to_page() is the right thing(tm): Both functions
> are slow.
> If you have non-power-of-two caches, you could store the control data at
> (addr&(~PAGE_SIZE)) - the lookup would be much faster. I wrote a patch a
> few weeks ago, it's attached.
For networking it might be useful. For it the allocator is quite performance
critical and the common case for large objects is 1.5k+sizeof(skb_shared_info)
MTU sized packets. This would need a new
At some point I had a hack to hint to the slab allocator about MTU
sizes of network interfaces, but I dropped it because it didn't seem
useful without anything else to store in a page (2K objects vs 1.5k
objects still only fit two per page)
But if you have metadata you could use it. Drawback even more
complexity because you have another special case?
-Andi
On Thu, 17 Aug 2006, Manfred Spraul wrote:
> I'm not sure that the current approach with virt_to_page()/vmalloc_to_page()
> is the right thing(tm): Both functions are slow.
Right. Would be great to avoid it but if we have to use it then I think we
should just store as much metainformation in the page as possible.
> If you have non-power-of-two caches, you could store the control data at
> (addr&(~PAGE_SIZE)) - the lookup would be much faster. I wrote a patch a few
> weeks ago, it's attached.
That would only work for slabs that use order 0 pages.
> Right now we have a few slab users that perform kmalloc(PAGE_SIZE). But that's
> a mostly for historic reasons: kmalloc was significantly (IIRC up to factor
> 10) faster than get_free_pages(). Now get_free_pages() also contains per-cpu
> structures, so we could convert __get_name or the pipe code back to
> get_free_pages().
Note that the caches of the page allocator only work for zero order pages.
And the reason for the existence of those caches is questionable.
We have repeatedly found in pratical tests that switching off these
caches has no influence on performance whatsoever.
On Thu, 17 Aug 2006, Manfred Spraul wrote:
> I'm not sure that the current approach with virt_to_page()/vmalloc_to_page()
> is the right thing(tm): Both functions are slow.
I am not so sure.
C code
typedef union ia64_va {
struct {
unsigned long off : 61; /* intra-region offset */
unsigned long reg : 3; /* region number */
} f;
unsigned long l;
void *p;
} ia64_va;
#define __pa(x) ({ia64_va _v; _v.l = (long) (x); _v.f.reg = 0; _v.l;})
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
# define pfn_to_page(pfn) (vmem_map + (pfn))
Here is a disassembly of kfree() (just relevant instructions, compiler
creates a mess and wastes scores of instructions repeating the same
thing about 3 times due to the failure to properly optimize (gcc 3.3) the
debugging crud in the existing slab):
0xa000000100144850 <kfree+48>: [MII] addl r11=-1995424,r1
r11 = &vmem_map
0xa000000100144870 <kfree+80>: [MII] mov r10=r11
r10 = &vmem_map
0xa000000100144881 <kfree+97>: ld8 r18=[r10]
r18 = [vmem_map]
0xa000000100144890 <kfree+112>: [MMI] shladd r16=r3,3,r18;;
0xa000000100144891 <kfree+113>: ld4.acq r2=[r16]
struct page * = r16 = vmem_map + pfn
r2 = flags;
0xa0000001001448a0 <kfree+128>: [MII] nop.m 0x0
0xa0000001001448a1 <kfree+129>: tbit.z p8,p9=r2,14;;
if(PageCompound(page))
.....
(And some more crud: A useless check for PageCompound (in the hot path!)
although page compound is not used for higher order slabs (only for non
MMU arches))
So we only have a single lookup of vmem_map from memory in order to
calculate the address of struct page. The cacheline for vmem_map is
heavily used and certainly in memory. virt_to_page seems to be a very
efficient means to get to struct page. The problem scope may simply be
to minimize the cachelines touched during free and alloc.
vmalloc_to_addr is certainly slower due to the page table walking. But the
user already is aware of the fact that vmalloc memory is not as fast as
direct mapped.
On Thu, 17 Aug 2006 23:17:54 -0700 (PDT)
Christoph Lameter <[email protected]> wrote:
>
> So we only have a single lookup of vmem_map from memory in order to
> calculate the address of struct page. The cacheline for vmem_map is
> heavily used and certainly in memory. virt_to_page seems to be a very
> efficient means to get to struct page. The problem scope may simply be
> to minimize the cachelines touched during free and alloc.
Just a note: with SPARSEMEM, we need more calculation and access to
mem_section[] table and page structs(mem_map).
> vmalloc_to_addr is certainly slower due to the page table walking. But the
> user already is aware of the fact that vmalloc memory is not as fast as
> direct mapped.
Hmm.
I wonder....
==
vmalloc() area is backed-up by VHPT(16Kb page size). And direct-mapped-area
is backed-up by software-tlb-miss-handler (16MB/64MB page size)
If TLB misses frequently, prefetch will work well in virtually-mapped area
(region5) rather direct-mapped-area (region 7) because of hardware assist
of 'VHPT walker'. (just I think. I have no data)
Considering some code walking through a list of objects scattered over all memory,
Is virtually-mapped area really slow ?
-Kame
On Fri, 18 Aug 2006, KAMEZAWA Hiroyuki wrote:
> Just a note: with SPARSEMEM, we need more calculation and access to
> mem_section[] table and page structs(mem_map).
Uhh, a regression against DISCONTIG. Could you address that issue?
> > vmalloc_to_addr is certainly slower due to the page table walking. But the
> > user already is aware of the fact that vmalloc memory is not as fast as
> > direct mapped.
> Considering some code walking through a list of objects scattered over all memory,
> Is virtually-mapped area really slow ?
On most arches it is somewhat slower. IA64 is special in the sense that it
has virtually mapped kernel memory.
On Fri, 18 Aug 2006 09:58:13 -0700 (PDT)
Christoph Lameter <[email protected]> wrote:
> On Fri, 18 Aug 2006, KAMEZAWA Hiroyuki wrote:
>
> > Just a note: with SPARSEMEM, we need more calculation and access to
> > mem_section[] table and page structs(mem_map).
>
> Uhh, a regression against DISCONTIG. Could you address that issue?
>
At first, ia64's DISCONTIG is special because of VIRTUAL_MEMMAP.
and ia64's SPARSEMEM is special,too. it's SPARSEMEM_EXTREME.
Considering generic arch, see include/asm-generic/memory_model.h,
which doesn't use virtual mem_map.
with FLATMEM, pfn_to_page() is pfn + mem_map. just an address calclation.
with *usual* DISCONTIG
--
pgdat = NODE_DATA(pfn_to_nid(pfn));
page = pgdat->node_mem_map + pfn - pgdat->node_start_pfn
--
if accessing to pgdat is fast, cost will not be big problem.
pfn_to_nid() is usually implemeted by calclation or table look up.
and usual SPARSEMEM, (not EXTREME)
--
page = mem_section[(pfn >> SECTION_SHIFT)].mem_map + pfn
--
need one table look up. maybe not very big.
with SPARSEMEM_EXTREME
--
page = mem_section[(pfn >> SECTION_SHIFT)][(pfn & MASK)].mem_map + pfn
--
need one (big)table look up.
-Kame
On Sat, 19 Aug 2006, KAMEZAWA Hiroyuki wrote:
> At first, ia64's DISCONTIG is special because of VIRTUAL_MEMMAP.
> and ia64's SPARSEMEM is special,too. it's SPARSEMEM_EXTREME.
Right. Would it be possible to get VIRTUAL_MEMMAP support into SPARSEMEM?
> with FLATMEM, pfn_to_page() is pfn + mem_map. just an address calclation.
So the virt_to_page is as fast as IA64 DISCONTIG on UP and SMP.
> with *usual* DISCONTIG
> --
> pgdat = NODE_DATA(pfn_to_nid(pfn));
> page = pgdat->node_mem_map + pfn - pgdat->node_start_pfn
> --
> if accessing to pgdat is fast, cost will not be big problem.
> pfn_to_nid() is usually implemeted by calclation or table look up.
Hmmm.... pfn_to_nid usually involves a table lookup. So two table lookups
to get there.
> and usual SPARSEMEM, (not EXTREME)
> --
> page = mem_section[(pfn >> SECTION_SHIFT)].mem_map + pfn
> --
> need one table look up. maybe not very big.
Bigger than a cacheline that can be kept in the cache? On a large Altix we
may have 1k nodes meaning up to 4k zones!
> with SPARSEMEM_EXTREME
> --
> page = mem_section[(pfn >> SECTION_SHIFT)][(pfn & MASK)].mem_map + pfn
> --
> need one (big)table look up.
Owww... Cache issues.
Could we do the lookup using a sparse virtually mapped table like on
IA64. Then align section shift to whatever page table is in place (on
platforms that require page tables and IA64 could continue to use its
special handler)?
Then page could be reached via
page = vmem_map + pfn
again ?
On Fri, 18 Aug 2006 11:44:22 -0700 (PDT)
Christoph Lameter <[email protected]> wrote:
> > and usual SPARSEMEM, (not EXTREME)
> > --
> > page = mem_section[(pfn >> SECTION_SHIFT)].mem_map + pfn
> > --
> > need one table look up. maybe not very big.
>
> Bigger than a cacheline that can be kept in the cache?
powerpc's section size is 16Mbytes. so table itself is bigger than cacheline.
And powerpc doesn't have DISCONTIG configuration.
> > with SPARSEMEM_EXTREME
> > --
> > page = mem_section[(pfn >> SECTION_SHIFT)][(pfn & MASK)].mem_map + pfn
> > --
> > need one (big)table look up.
>
> Owww... Cache issues.
>
> Could we do the lookup using a sparse virtually mapped table like on
> IA64. Then align section shift to whatever page table is in place (on
> platforms that require page tables and IA64 could continue to use its
> special handler)?
>
> Then page could be reached via
>
> page = vmem_map + pfn
>
> again ?
>
In early days of implementing SPARSEMEM, I tried vmem_map + SPARSEMEM.
But it was very complicated.....so it was dropped.
Before retrying, I think performance/profile test with each memory model should
be done.
Considering ia64, an advantage of SPARSEMEM is (just?) memory hotplug.
If people need extreme performance, they can select DISCONTIGMEM.
But powerpc(NUMA) people has only SPARSEMEM. So test on powerpc will be
necessary, anyway.(of course, they doesn't have vmem_map)
-Kame
On Sat, 19 Aug 2006, KAMEZAWA Hiroyuki wrote:
> But powerpc(NUMA) people has only SPARSEMEM. So test on powerpc will be
> necessary, anyway.(of course, they doesn't have vmem_map)
Well they also have quite sophisticated page tables. They may be able to
do the same thing as we have on IA64.
Christoph Lameter wrote:
>>If you have non-power-of-two caches, you could store the control data at
>>(addr&(~PAGE_SIZE)) - the lookup would be much faster. I wrote a patch a few
>>weeks ago, it's attached.
>>
>>
>
>That would only work for slabs that use order 0 pages.
>
>
>
Most slabs are order 0. Actually: there are only 6 slabs that are not
order 0 (excluding the kmalloc caches) on my system.
What about:
if (unlikely(addr & (~(PAGE_SIZE-1))))
slabp=virt_to_page(addr)->pagefield;
else
slabp=addr & (~(PAGE_SIZE-1));
Modify the kmalloc caches slightly and use non-power-of-2 cache sizes.
Move the kmalloc(PAGE_SIZE) users to gfp.
From my system:
good order 1 slab caches: (i.e.: forcing them to order 0 wastes some memory)
biovec-128
blkdev_queue
mqueue_inode_cache
RAWv6
UDPv6
bogus order 1 caches: (i.e.: they could be order 0 without wasting memory)
biovec-(256)
--
Manfred
On Sat, 19 Aug 2006, Manfred Spraul wrote:
> What about:
>
> if (unlikely(addr & (~(PAGE_SIZE-1))))
> slabp=virt_to_page(addr)->pagefield;
> else
> slabp=addr & (~(PAGE_SIZE-1));
Well this would not be working with the simple slab design that puts the
first element at the page border to simplify alignment.
And as we have just seen virt to page is mostly an address
calculation in many configurations. I doubt that there would be a great
advantage. Todays processors biggest cause for latencies are
cacheline misses.. Some arithmetic with addresses does not seem to
be that important. Misaligning data in order to not put objects on such
boundaries could be an issue.
> Modify the kmalloc caches slightly and use non-power-of-2 cache sizes. Move
> the kmalloc(PAGE_SIZE) users to gfp.
Power of 2 cache sizes make the object align neatly to cacheline
boundaries and make them fit tightly into a page.
Christoph Lameter wrote:
>On Sat, 19 Aug 2006, Manfred Spraul wrote:
>
>
>>What about:
>>
>>if (unlikely(addr & (~(PAGE_SIZE-1))))
>> slabp=virt_to_page(addr)->pagefield;
>>else
>> slabp=addr & (~(PAGE_SIZE-1));
>>
>>
>
>Well this would not be working with the simple slab design that puts the
>first element at the page border to simplify alignment.
>
>And as we have just seen virt to page is mostly an address
>calculation in many configurations. I doubt that there would be a great
>advantage. Todays processors biggest cause for latencies are
>cacheline misses..
>
It involves table walking on discontigmem archs. "slabp=addr &
(~(PAGE_SIZE-1));" means no pointer chasing, and the access touches the
same page, i.e. definitively no TLB miss.
> Some arithmetic with addresses does not seem to
>be that important. Misaligning data in order to not put objects on such
>boundaries could be an issue.
>
> > Modify the kmalloc caches slightly and use non-power-of-2 cache sizes. Move
>
>
>>the kmalloc(PAGE_SIZE) users to gfp.
>>
>>
>
>Power of 2 cache sizes make the object align neatly to cacheline
>boundaries and make them fit tightly into a page.
>
>
IMHO not really an issue. 2kb-cache_line_size() also aligns perfectly.
--
Manfred
On Sat, 19 Aug 2006, Manfred Spraul wrote:
> > And as we have just seen virt to page is mostly an address calculation in
> > many configurations. I doubt that there would be a great advantage. Todays
> > processors biggest cause for latencies are cacheline misses..
> >
> It involves table walking on discontigmem archs. "slabp=addr &
> (~(PAGE_SIZE-1));" means no pointer chasing, and the access touches the same
> page, i.e. definitively no TLB miss.
There is no table walking for discontigmem on ia64. Ia64 only creates page
table if it needs to satify the Linux kernels demands for such a thing.
And this is a kernel mapping. No page table involved.
The current sparsemem approach also does not need table walking. It needs
to do lookups in a table.
UP and SMP currently work cleanly.
> > Power of 2 cache sizes make the object align neatly to cacheline boundaries
> > and make them fit tightly into a page.
> >
> IMHO not really an issue. 2kb-cache_line_size() also aligns perfectly.
That would work and also be in line with the existing overhead of the
slabs.