Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761957AbXEUOKu (ORCPT ); Mon, 21 May 2007 10:10:50 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756559AbXEUOKn (ORCPT ); Mon, 21 May 2007 10:10:43 -0400 Received: from calculon.skynet.ie ([193.1.99.88]:46814 "EHLO calculon.skynet.ie" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756471AbXEUOKm (ORCPT ); Mon, 21 May 2007 10:10:42 -0400 Date: Mon, 21 May 2007 15:10:39 +0100 To: clameter@sgi.com Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, dgc@sgi.com, Hugh Dickins Subject: Re: [patch 02/10] SLUB: slab defragmentation and kmem_cache_vacate Message-ID: <20070521141039.GA18474@skynet.ie> References: <20070518181040.465335396@sgi.com> <20070518181119.062736299@sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20070518181119.062736299@sgi.com> User-Agent: Mutt/1.5.13 (2006-08-11) From: mel@skynet.ie (Mel Gorman) Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 17756 Lines: 554 On (18/05/07 11:10), clameter@sgi.com didst pronounce: > Slab defragmentation occurs when the slabs are shrunk (after inode, dentry > shrinkers have been run from the reclaim code) or when a manual shrinking > is requested via slabinfo. During the shrink operation SLUB will generate a > list of partially populated slabs sorted by the number of objects in use. > > We extract pages off that list that are only filled less than a quarter and > attempt to motivate the users of those slabs to either remove the objects > or move the objects. > I know I brought up this "less than a quarter" thing before and I haven't thought of a better alternative. However, it occurs to be that shrink_slab() is called when there is awareness of a reclaim priority. It may be worth passing that down so that the fraction of candidates pages is calculated based on priority. That said..... where is kmem_cache_shrink() ever called? The freeing of slab pages seems to be indirect these days. Way back, kmem_cache_shrink() used to be called directly but I'm not sure where it happens now. > Targeted reclaim allows to target a single slab for reclaim. This is done by > calling > > kmem_cache_vacate(page); > > It will return 1 on success, 0 if the operation failed. > > > In order for a slabcache to support defragmentation a couple of functions > must be defined via kmem_cache_ops. These are > > void *get(struct kmem_cache *s, int nr, void **objects) > > Must obtain a reference to the listed objects. SLUB guarantees that > the objects are still allocated. However, other threads may be blocked > in slab_free attempting to free objects in the slab. These may succeed > as soon as get() returns to the slab allocator. The function must > be able to detect the situation and void the attempts to handle such > objects (by for example voiding the corresponding entry in the objects > array). > > No slab operations may be performed in get_reference(). Interrupts > are disabled. What can be done is very limited. The slab lock > for the page with the object is taken. Any attempt to perform a slab > operation may lead to a deadlock. > > get() returns a private pointer that is passed to kick. Should we > be unable to obtain all references then that pointer may indicate > to the kick() function that it should not attempt any object removal > or move but simply remove the reference counts. > Much clearer than before. > void kick(struct kmem_cache *, int nr, void **objects, void *get_result) > > After SLUB has established references to the objects in a > slab it will drop all locks and then use kick() to move objects out > of the slab. The existence of the object is guaranteed by virtue of > the earlier obtained references via get(). The callback may perform > any slab operation since no locks are held at the time of call. > > The callback should remove the object from the slab in some way. This > may be accomplished by reclaiming the object and then running > kmem_cache_free() or reallocating it and then running > kmem_cache_free(). Reallocation is advantageous because the partial > slabs were just sorted to have the partial slabs with the most objects > first. Allocation is likely to result in filling up a slab so that > it can be removed from the partial list. > > Kick() does not return a result. SLUB will check the number of > remaining objects in the slab. If all objects were removed then > we know that the operation was successful. > Again, much clearer. > If a kmem_cache_vacate on a page fails then the slab has usually a pretty > low usage ratio. Go through the slab and resequence the freelist so that > object addresses increase as we allocate objects. This will trigger the > cacheline prefetcher when we start allocating from the slab again and > thereby increase allocations speed. > Nice idea. > Signed-off-by: Christoph Lameter > > --- > include/linux/slab.h | 31 +++++ > mm/slab.c | 9 + > mm/slob.c | 9 + > mm/slub.c | 264 +++++++++++++++++++++++++++++++++++++++++++++++++-- > 4 files changed, 303 insertions(+), 10 deletions(-) > > Index: slub/include/linux/slab.h > =================================================================== > --- slub.orig/include/linux/slab.h 2007-05-18 00:13:39.000000000 -0700 > +++ slub/include/linux/slab.h 2007-05-18 00:13:40.000000000 -0700 > @@ -39,6 +39,36 @@ void __init kmem_cache_init(void); > int slab_is_available(void); > > struct kmem_cache_ops { > + /* > + * Called with slab lock held and interrupts disabled. > + * No slab operation may be performed. > + * > + * Parameters passed are the number of objects to process > + * and a an array of pointers to objects for which we > + * need references. > + * s/a an/an/ > + * Returns a pointer that is passed to the kick function. > + * If all objects cannot be moved then the pointer may > + * indicate that this wont work and then kick can simply > + * remove the references that were already obtained. > + * > + * The array passed to get() is also passed to kick(). The > + * function may remove objects by setting array elements to NULL. > + */ > + void *(*get)(struct kmem_cache *, int nr, void **); > + > + /* > + * Called with no locks held and interrupts enabled. > + * Any operation may be performed in kick(). > + * > + * Parameters passed are the number of objects in the array, > + * the array of pointers to the objects and the pointer > + * returned by get(). > + * > + * Success is checked by examining the number of remaining > + * objects in the slab. > + */ > + void (*kick)(struct kmem_cache *, int nr, void **, void *private); > }; > > struct kmem_cache *kmem_cache_create(const char *, size_t, size_t, > @@ -53,6 +83,7 @@ void kmem_cache_free(struct kmem_cache * > unsigned int kmem_cache_size(struct kmem_cache *); > const char *kmem_cache_name(struct kmem_cache *); > int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr); > +int kmem_cache_vacate(struct page *); > > /* > * Please use this macro to create slab caches. Simply specify the > Index: slub/mm/slub.c > =================================================================== > --- slub.orig/mm/slub.c 2007-05-18 00:13:39.000000000 -0700 > +++ slub/mm/slub.c 2007-05-18 09:55:47.000000000 -0700 > @@ -1043,12 +1043,11 @@ static struct page *new_slab(struct kmem > n = get_node(s, page_to_nid(page)); > if (n) > atomic_long_inc(&n->nr_slabs); > + > + page->inuse = 0; > + page->lockless_freelist = NULL; > page->offset = s->offset / sizeof(void *); > page->slab = s; > - page->flags |= 1 << PG_slab; > - if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON | > - SLAB_STORE_USER | SLAB_TRACE)) > - SetSlabDebug(page); > > start = page_address(page); > end = start + s->objects * s->size; > @@ -1066,11 +1065,20 @@ static struct page *new_slab(struct kmem > set_freepointer(s, last, NULL); > > page->freelist = start; > - page->lockless_freelist = NULL; > - page->inuse = 0; > -out: > - if (flags & __GFP_WAIT) > - local_irq_disable(); > + > + /* > + * page->inuse must be 0 when PageSlab(page) becomes > + * true so that defrag knows that this slab is not in use. > + */ > + smp_wmb(); > + __SetPageSlab(page); > + if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON | > + SLAB_STORE_USER | SLAB_TRACE)) > + SetSlabDebug(page); > + > + out: > + if (flags & __GFP_WAIT) > + local_irq_disable(); > return page; > } > > @@ -2323,6 +2331,191 @@ void kfree(const void *x) > EXPORT_SYMBOL(kfree); > > /* > + * Order the freelist so that addresses increase as object are allocated. > + * This is useful to trigger the cpu cacheline prefetching logic. > + */ makes sense. However, it occurs to me that maybe this should be a separate patch so it can be measured to be sure. It makes sense though. > +void resequence_freelist(struct kmem_cache *s, struct page *page) > +{ > + void *p; > + void *last; > + void *addr = page_address(page); > + DECLARE_BITMAP(map, s->objects); > + > + bitmap_zero(map, s->objects); > + > + /* Figure out which objects are on the freelist */ > + for_each_free_object(p, s, page->freelist) > + set_bit(slab_index(p, s, addr), map); > + > + last = NULL; > + for_each_object(p, s, addr) > + if (test_bit(slab_index(p, s, addr), map)) { > + if (last) > + set_freepointer(s, last, p); > + else > + page->freelist = p; > + last = p; > + } > + > + if (last) > + set_freepointer(s, last, NULL); > + else > + page->freelist = NULL; > +} > + > +/* > + * Vacate all objects in the given slab. > + * > + * Slab must be locked and frozen. Interrupts are disabled (flags must > + * be passed). > + * It may not hurt to have a VM_BUG_ON() if interrupts are still enabled when this is called > + * Will drop and regain and drop the slab lock. At the end the slab will > + * either be freed or returned to the partial lists. > + * > + * Returns the number of remaining objects > + */ > +static int __kmem_cache_vacate(struct kmem_cache *s, > + struct page *page, unsigned long flags, void **vector) > +{ > + void *p; > + void *addr = page_address(page); > + DECLARE_BITMAP(map, s->objects); > + int leftover; > + int objects; > + void *private; > + > + if (!page->inuse) > + goto out; > + > + /* Determine used objects */ > + bitmap_fill(map, s->objects); > + for_each_free_object(p, s, page->freelist) > + __clear_bit(slab_index(p, s, addr), map); > + > + objects = 0; > + memset(vector, 0, s->objects * sizeof(void **)); > + for_each_object(p, s, addr) { > + if (test_bit(slab_index(p, s, addr), map)) > + vector[objects++] = p; > + } > + > + private = s->ops->get(s, objects, vector); > + > + /* > + * Got references. Now we can drop the slab lock. The slab > + * is frozen so it cannot vanish from under us nor will > + * allocations be performed on the slab. However, unlocking the > + * slab will allow concurrent slab_frees to proceed. > + */ > + slab_unlock(page); > + local_irq_restore(flags); I recognise that you want to restore interrupts as early as possible but it should be noted somewhere that kmem_cache_vacate() disables interrupts and __kmem_cache_vacate() enabled them again. I had to go searching to see where interrupts are enabled again. Maybe even a slab_lock_irq() and slab_unlock_irq() would clarify things a little. > + > + /* > + * Perform the KICK callbacks to remove the objects. > + */ > + s->ops->kick(s, objects, vector, private); > + > + local_irq_save(flags); > + slab_lock(page); > +out: > + /* > + * Check the result and unfreeze the slab > + */ > + leftover = page->inuse; > + if (leftover > 0) > + /* > + * Cannot free. Lets at least optimize the freelist. We have > + * likely touched all the cachelines with the free pointers > + * already so it is cheap to do here. > + */ > + resequence_freelist(s, page); > + unfreeze_slab(s, page); > + local_irq_restore(flags); > + return leftover; > +} > + > +/* > + * Get a page off a list and freeze it. Must be holding slab lock. > + */ > +static void freeze_from_list(struct kmem_cache *s, struct page *page) > +{ > + if (page->inuse < s->objects) > + remove_partial(s, page); > + else if (s->flags & SLAB_STORE_USER) > + remove_full(s, page); > + SetSlabFrozen(page); > +} > + > +/* > + * Attempt to free objects in a page. Return 1 if succesful. > + */ > +int kmem_cache_vacate(struct page *page) > +{ > + unsigned long flags; > + struct kmem_cache *s; > + int vacated = 0; > + void **vector = NULL; > + > + /* > + * Get a reference to the page. Return if its freed or being freed. > + * This is necessary to make sure that the page does not vanish > + * from under us before we are able to check the result. > + */ > + if (!get_page_unless_zero(page)) > + return 0; > + > + if (!PageSlab(page)) > + goto out; > + > + s = page->slab; > + if (!s) > + goto out; > + > + vector = kmalloc(s->objects * sizeof(void *), GFP_KERNEL); > + if (!vector) > + return 0; Is it worth logging this event, returning -ENOMEM or something so that callers are aware of why kmem_cache_vacate() failed in this instance? Also.. we have called get_page_unless_zero() but if we are out of memory here, where have we called put_page()? Maybe we should be "goto out" here with a if (vector) kfree(vector); > + > + local_irq_save(flags); > + /* > + * The implicit memory barrier in slab_lock guarantees that page->inuse > + * is loaded after PageSlab(page) has been established to be true. This is > + * only revelant for a newly created slab. > + */ > + slab_lock(page); > + > + /* > + * We may now have locked a page that may be in various stages of > + * being freed. If the PageSlab bit is off then we have already > + * reached the page allocator. If page->inuse is zero then we are > + * in SLUB but freeing or allocating the page. > + * page->inuse is never modified without the slab lock held. > + * > + * Also abort if the page happens to be already frozen. If its > + * frozen then a concurrent vacate may be in progress. > + */ > + if (!PageSlab(page) || SlabFrozen(page) || !page->inuse) > + goto out_locked; > + > + /* > + * We are holding a lock on a slab page and all operations on the > + * slab are blocking. > + */ > + if (!s->ops->get || !s->ops->kick) > + goto out_locked; > + freeze_from_list(s, page); > + vacated = __kmem_cache_vacate(s, page, flags, vector) == 0; That is a little funky looking. This may be nicer; vacated = __kmem_cache_vacate(s, page, flags, vector); out: .... return vacated == 0; > +out: > + put_page(page); > + kfree(vector); > + return vacated; > +out_locked: > + slab_unlock(page); > + local_irq_restore(flags); > + goto out; > + > +} > + > +/* > * kmem_cache_shrink removes empty slabs from the partial lists and sorts > * the remaining slabs by the number of items in use. The slabs with the > * most items in use come first. New allocations will then fill those up > @@ -2337,11 +2530,12 @@ int kmem_cache_shrink(struct kmem_cache > int node; > int i; > struct kmem_cache_node *n; > - struct page *page; > + struct page *page, *page2; > struct page *t; > struct list_head *slabs_by_inuse = > kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL); > unsigned long flags; > + LIST_HEAD(zaplist); > > if (!slabs_by_inuse) > return -ENOMEM; > @@ -2392,8 +2586,44 @@ int kmem_cache_shrink(struct kmem_cache > for (i = s->objects - 1; i >= 0; i--) > list_splice(slabs_by_inuse + i, n->partial.prev); > > + /* > + * If we have no functions available to defragment the slabs > + * then we are done. > + */ > + if (!s->ops->get || !s->ops->kick) > + goto out; > + > + /* Take objects with just a few objects off the tail */ > + while (n->nr_partial > MAX_PARTIAL) { > + page = container_of(n->partial.prev, struct page, lru); > + > + /* > + * We are holding the list_lock so we can only > + * trylock the slab > + */ > + if (page->inuse > s->objects / 4) > + break; > + > + if (!slab_trylock(page)) > + break; > + > + list_move_tail(&page->lru, &zaplist); > + n->nr_partial--; > + SetSlabFrozen(page); > + slab_unlock(page); > + } > out: > spin_unlock_irqrestore(&n->list_lock, flags); > + > + /* Now we can free objects in the slabs on the zaplist */ > + list_for_each_entry_safe(page, page2, &zaplist, lru) { > + unsigned long flags; > + > + local_irq_save(flags); > + slab_lock(page); > + __kmem_cache_vacate(s, page, flags, > + (void **)slabs_by_inuse); > + } > } > > kfree(slabs_by_inuse); > @@ -3229,6 +3459,20 @@ static ssize_t ops_show(struct kmem_cach > x += sprint_symbol(buf + x, (unsigned long)s->ctor); > x += sprintf(buf + x, "\n"); > } > + > + if (s->ops->get) { > + x += sprintf(buf + x, "get : "); > + x += sprint_symbol(buf + x, > + (unsigned long)s->ops->get); > + x += sprintf(buf + x, "\n"); > + } > + > + if (s->ops->kick) { > + x += sprintf(buf + x, "kick : "); > + x += sprint_symbol(buf + x, > + (unsigned long)s->ops->kick); > + x += sprintf(buf + x, "\n"); > + } > return x; > } > SLAB_ATTR_RO(ops); > Index: slub/mm/slab.c > =================================================================== > --- slub.orig/mm/slab.c 2007-05-18 00:13:39.000000000 -0700 > +++ slub/mm/slab.c 2007-05-18 00:13:40.000000000 -0700 > @@ -2516,6 +2516,15 @@ int kmem_cache_shrink(struct kmem_cache > } > EXPORT_SYMBOL(kmem_cache_shrink); > > +/* > + * SLAB does not support slab defragmentation > + */ > +int kmem_cache_vacate(struct page *page) > +{ > + return 0; > +} > +EXPORT_SYMBOL(kmem_cache_vacate); > + > /** > * kmem_cache_destroy - delete a cache > * @cachep: the cache to destroy > Index: slub/mm/slob.c > =================================================================== > --- slub.orig/mm/slob.c 2007-05-18 00:13:39.000000000 -0700 > +++ slub/mm/slob.c 2007-05-18 00:13:40.000000000 -0700 > @@ -394,6 +394,15 @@ int kmem_cache_shrink(struct kmem_cache > } > EXPORT_SYMBOL(kmem_cache_shrink); > > +/* > + * SLOB does not support slab defragmentation > + */ > +int kmem_cache_vacate(struct page *page) > +{ > + return 0; > +} > +EXPORT_SYMBOL(kmem_cache_vacate); > + > int kmem_ptr_validate(struct kmem_cache *a, const void *b) > { > return 0; > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/