2002-09-26 04:04:22

by Andrew Morton

[permalink] [raw]
Subject: [patch 3/4] slab reclaim balancing



A patch from Ed Tomlinson which improves the way in which the kernel
reclaims slab objects.

The theory is: a cached object's usefulness is measured in terms of the
number of disk seeks which it saves. Furthermore, we assume that one
dentry or inode saves as many seeks as one pagecache page.

So we reap slab objects at the same rate as we reclaim pages. For each
1% of reclaimed pagecache we reclaim 1% of slab. (Actually, we _scan_
1% of slab for each 1% of scanned pages).

Furthermore we assume that one swapout costs twice as many seeks as one
pagecache page, and twice as many seeks as one slab object. So we
double the pressure on slab when anonymous pages are being considered
for eviction.

The code works nicely, and smoothly. Possibly it does not shrink slab
hard enough, but that is now very easy to tune up and down. It is just:

ratio *= 3;

in shrink_caches().

Slab caches no longer hold onto completely empty pages. Instead, pages
are freed as soon as they have zero objects. This is possibly a
performance hit for slabs which have constructors, but it's doubtful.
Most allocations after a batch of frees are satisfied from inside
internally-fragmented pages and by the time slab gets back onto using
the wholly-empty pages they'll be cache-cold. slab would be better off
going and requesting a new, cache-warm page and reconstructing the
objects therein. (Once we have the per-cpu hot-page allocator in
place. It's happening).

As a consequence of the above, kmem_cache_srhink() is now unused. No
great loss there - the serialising effect of kmem_cache_shrink and its
semaphore in front of page reclaim was measurably bad.


Still todo:

- batch up the shrinking so we don't call into prune_dcache and
friends at high frequency asking for a tiny number of objects.

- Maybe expose the shrink ratio via a tunable.

- clean up slab.c

- highmem page reclaim in prune_icache: highmem pages can pin
inodes.


fs/dcache.c | 30 ++++++----------------------
fs/dquot.c | 19 ++++--------------
fs/inode.c | 29 ++++++++-------------------
include/linux/dcache.h | 2 -
include/linux/mm.h | 1
mm/page_alloc.c | 11 ++++++++++
mm/slab.c | 8 +++++--
mm/vmscan.c | 51 ++++++++++++++++++++++++++++++++++---------------
8 files changed, 76 insertions(+), 75 deletions(-)

--- 2.5.38/fs/dcache.c~slabasap Wed Sep 25 20:15:25 2002
+++ 2.5.38-akpm/fs/dcache.c Wed Sep 25 20:15:25 2002
@@ -329,12 +329,11 @@ static inline void prune_one_dentry(stru
void prune_dcache(int count)
{
spin_lock(&dcache_lock);
- for (;;) {
+ for (; count ; count--) {
struct dentry *dentry;
struct list_head *tmp;

tmp = dentry_unused.prev;
-
if (tmp == &dentry_unused)
break;
list_del_init(tmp);
@@ -349,12 +348,8 @@ void prune_dcache(int count)
dentry_stat.nr_unused--;

/* Unused dentry with a count? */
- if (atomic_read(&dentry->d_count))
- BUG();
-
+ BUG_ON(atomic_read(&dentry->d_count));
prune_one_dentry(dentry);
- if (!--count)
- break;
}
spin_unlock(&dcache_lock);
}
@@ -573,19 +568,11 @@ void shrink_dcache_anon(struct list_head

/*
* This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our dcache, but not
- * too much.
- *
- * Priority:
- * 1 - very urgent: shrink everything
- * ...
- * 6 - base-level: try to shrink a bit.
+ * more memory.
*/
-int shrink_dcache_memory(int priority, unsigned int gfp_mask)
+int shrink_dcache_memory(int ratio, unsigned int gfp_mask)
{
- int count = 0;
-
+ int entries = dentry_stat.nr_dentry / ratio + 1;
/*
* Nasty deadlock avoidance.
*
@@ -600,11 +587,8 @@ int shrink_dcache_memory(int priority, u
if (!(gfp_mask & __GFP_FS))
return 0;

- count = dentry_stat.nr_unused / priority;
-
- prune_dcache(count);
- kmem_cache_shrink(dentry_cache);
- return 0;
+ prune_dcache(entries);
+ return entries;
}

#define NAME_ALLOC_LEN(len) ((len+16) & ~15)
--- 2.5.38/fs/dquot.c~slabasap Wed Sep 25 20:15:25 2002
+++ 2.5.38-akpm/fs/dquot.c Wed Sep 25 20:15:25 2002
@@ -480,26 +480,17 @@ static void prune_dqcache(int count)

/*
* This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our dqcache, but not
- * too much.
- *
- * Priority:
- * 1 - very urgent: shrink everything
- * ...
- * 6 - base-level: try to shrink a bit.
+ * more memory
*/

-int shrink_dqcache_memory(int priority, unsigned int gfp_mask)
+int shrink_dqcache_memory(int ratio, unsigned int gfp_mask)
{
- int count = 0;
+ int entries = dqstats.allocated_dquots / ratio + 1;

lock_kernel();
- count = dqstats.free_dquots / priority;
- prune_dqcache(count);
+ prune_dqcache(entries);
unlock_kernel();
- kmem_cache_shrink(dquot_cachep);
- return 0;
+ return entries;
}

/*
--- 2.5.38/fs/inode.c~slabasap Wed Sep 25 20:15:25 2002
+++ 2.5.38-akpm/fs/inode.c Wed Sep 25 20:15:25 2002
@@ -386,10 +386,11 @@ void prune_icache(int goal)

count = 0;
entry = inode_unused.prev;
- while (entry != &inode_unused)
- {
+ for(; goal; goal--) {
struct list_head *tmp = entry;

+ if (entry == &inode_unused)
+ break;
entry = entry->prev;
inode = INODE(tmp);
if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK))
@@ -403,8 +404,6 @@ void prune_icache(int goal)
list_add(tmp, freeable);
inode->i_state |= I_FREEING;
count++;
- if (!--goal)
- break;
}
inodes_stat.nr_unused -= count;
spin_unlock(&inode_lock);
@@ -414,19 +413,11 @@ void prune_icache(int goal)

/*
* This is called from kswapd when we think we need some
- * more memory, but aren't really sure how much. So we
- * carefully try to free a _bit_ of our icache, but not
- * too much.
- *
- * Priority:
- * 1 - very urgent: shrink everything
- * ...
- * 6 - base-level: try to shrink a bit.
+ * more memory.
*/
-int shrink_icache_memory(int priority, int gfp_mask)
+int shrink_icache_memory(int ratio, unsigned int gfp_mask)
{
- int count = 0;
-
+ int entries = inodes_stat.nr_inodes / ratio + 1;
/*
* Nasty deadlock avoidance..
*
@@ -437,12 +428,10 @@ int shrink_icache_memory(int priority, i
if (!(gfp_mask & __GFP_FS))
return 0;

- count = inodes_stat.nr_unused / priority;
-
- prune_icache(count);
- kmem_cache_shrink(inode_cachep);
- return 0;
+ prune_icache(entries);
+ return entries;
}
+EXPORT_SYMBOL(shrink_icache_memory);

/*
* Called with the inode lock held.
--- 2.5.38/include/linux/dcache.h~slabasap Wed Sep 25 20:15:25 2002
+++ 2.5.38-akpm/include/linux/dcache.h Wed Sep 25 20:15:25 2002
@@ -186,7 +186,7 @@ extern int shrink_dcache_memory(int, uns
extern void prune_dcache(int);

/* icache memory management (defined in linux/fs/inode.c) */
-extern int shrink_icache_memory(int, int);
+extern int shrink_icache_memory(int, unsigned int);
extern void prune_icache(int);

/* quota cache memory management (defined in linux/fs/dquot.c) */
--- 2.5.38/include/linux/mm.h~slabasap Wed Sep 25 20:15:25 2002
+++ 2.5.38-akpm/include/linux/mm.h Wed Sep 25 20:15:25 2002
@@ -524,6 +524,7 @@ extern struct vm_area_struct *find_exten

extern struct page * vmalloc_to_page(void *addr);
extern unsigned long get_page_cache_size(void);
+extern unsigned int nr_used_zone_pages(void);

#endif /* __KERNEL__ */

--- 2.5.38/mm/page_alloc.c~slabasap Wed Sep 25 20:15:25 2002
+++ 2.5.38-akpm/mm/page_alloc.c Wed Sep 25 20:15:25 2002
@@ -479,6 +479,17 @@ unsigned int nr_free_pages(void)
return sum;
}

+unsigned int nr_used_zone_pages(void)
+{
+ unsigned int pages = 0;
+ struct zone *zone;
+
+ for_each_zone(zone)
+ pages += zone->nr_active + zone->nr_inactive;
+
+ return pages;
+}
+
static unsigned int nr_free_zone_pages(int offset)
{
pg_data_t *pgdat;
--- 2.5.38/mm/slab.c~slabasap Wed Sep 25 20:15:25 2002
+++ 2.5.38-akpm/mm/slab.c Wed Sep 25 20:15:25 2002
@@ -1496,7 +1496,11 @@ static inline void kmem_cache_free_one(k
if (unlikely(!--slabp->inuse)) {
/* Was partial or full, now empty. */
list_del(&slabp->list);
- list_add(&slabp->list, &cachep->slabs_free);
+/* list_add(&slabp->list, &cachep->slabs_free); */
+ if (unlikely(list_empty(&cachep->slabs_partial)))
+ list_add(&slabp->list, &cachep->slabs_partial);
+ else
+ kmem_slab_destroy(cachep, slabp);
} else if (unlikely(inuse == cachep->num)) {
/* Was full. */
list_del(&slabp->list);
@@ -1970,7 +1974,7 @@ static int s_show(struct seq_file *m, vo
}
list_for_each(q,&cachep->slabs_partial) {
slabp = list_entry(q, slab_t, list);
- if (slabp->inuse == cachep->num || !slabp->inuse)
+ if (slabp->inuse == cachep->num)
BUG();
active_objs += slabp->inuse;
active_slabs++;
--- 2.5.38/mm/vmscan.c~slabasap Wed Sep 25 20:15:25 2002
+++ 2.5.38-akpm/mm/vmscan.c Wed Sep 25 20:15:25 2002
@@ -70,6 +70,10 @@
#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
#endif

+#ifndef CONFIG_QUOTA
+#define shrink_dqcache_memory(ratio, gfp_mask) do { } while (0)
+#endif
+
/* Must be called with page's pte_chain_lock held. */
static inline int page_mapping_inuse(struct page * page)
{
@@ -97,7 +101,7 @@ static inline int is_page_cache_freeable

static /* inline */ int
shrink_list(struct list_head *page_list, int nr_pages,
- unsigned int gfp_mask, int *max_scan)
+ unsigned int gfp_mask, int *max_scan, int *nr_mapped)
{
struct address_space *mapping;
LIST_HEAD(ret_pages);
@@ -116,6 +120,10 @@ shrink_list(struct list_head *page_list,
if (TestSetPageLocked(page))
goto keep;

+ /* Double the slab pressure for mapped and swapcache pages */
+ if (page_mapped(page) || PageSwapCache(page))
+ (*nr_mapped)++;
+
BUG_ON(PageActive(page));
may_enter_fs = (gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (gfp_mask & __GFP_IO));
@@ -320,7 +328,7 @@ keep:
*/
static /* inline */ int
shrink_cache(int nr_pages, struct zone *zone,
- unsigned int gfp_mask, int max_scan)
+ unsigned int gfp_mask, int max_scan, int *nr_mapped)
{
LIST_HEAD(page_list);
struct pagevec pvec;
@@ -371,7 +379,8 @@ shrink_cache(int nr_pages, struct zone *

max_scan -= nr_scan;
KERNEL_STAT_ADD(pgscan, nr_scan);
- nr_pages = shrink_list(&page_list,nr_pages,gfp_mask,&max_scan);
+ nr_pages = shrink_list(&page_list, nr_pages,
+ gfp_mask, &max_scan, nr_mapped);

if (nr_pages <= 0 && list_empty(&page_list))
goto done;
@@ -522,14 +531,10 @@ refill_inactive_zone(struct zone *zone,

static /* inline */ int
shrink_zone(struct zone *zone, int max_scan,
- unsigned int gfp_mask, int nr_pages)
+ unsigned int gfp_mask, int nr_pages, int *nr_mapped)
{
unsigned long ratio;

- /* This is bogus for ZONE_HIGHMEM? */
- if (kmem_cache_reap(gfp_mask) >= nr_pages)
- return 0;
-
/*
* Try to keep the active list 2/3 of the size of the cache. And
* make sure that refill_inactive is given a decent number of pages.
@@ -547,7 +552,8 @@ shrink_zone(struct zone *zone, int max_s
atomic_sub(SWAP_CLUSTER_MAX, &zone->refill_counter);
refill_inactive_zone(zone, SWAP_CLUSTER_MAX);
}
- nr_pages = shrink_cache(nr_pages, zone, gfp_mask, max_scan);
+ nr_pages = shrink_cache(nr_pages, zone, gfp_mask,
+ max_scan, nr_mapped);
return nr_pages;
}

@@ -557,6 +563,9 @@ shrink_caches(struct zone *classzone, in
{
struct zone *first_classzone;
struct zone *zone;
+ int ratio;
+ int nr_mapped = 0;
+ int pages = nr_used_zone_pages();

first_classzone = classzone->zone_pgdat->node_zones;
for (zone = classzone; zone >= first_classzone; zone--) {
@@ -581,16 +590,28 @@ shrink_caches(struct zone *classzone, in
max_scan = zone->nr_inactive >> priority;
if (max_scan < to_reclaim * 2)
max_scan = to_reclaim * 2;
- unreclaimed = shrink_zone(zone, max_scan, gfp_mask, to_reclaim);
+ unreclaimed = shrink_zone(zone, max_scan,
+ gfp_mask, to_reclaim, &nr_mapped);
nr_pages -= to_reclaim - unreclaimed;
*total_scanned += max_scan;
}

- shrink_dcache_memory(priority, gfp_mask);
- shrink_icache_memory(1, gfp_mask);
-#ifdef CONFIG_QUOTA
- shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
-#endif
+ /*
+ * Here we assume it costs one seek to replace a lru page and that
+ * it also takes a seek to recreate a cache object. With this in
+ * mind we age equal percentages of the lru and ageable caches.
+ * This should balance the seeks generated by these structures.
+ *
+ * NOTE: for now I do this for all zones. If we find this is too
+ * aggressive on large boxes we may want to exclude ZONE_HIGHMEM
+ *
+ * If we're encountering mapped pages on the LRU then increase the
+ * pressure on slab to avoid swapping.
+ */
+ ratio = (pages / (*total_scanned + nr_mapped + 1)) + 1;
+ shrink_dcache_memory(ratio, gfp_mask);
+ shrink_icache_memory(ratio, gfp_mask);
+ shrink_dqcache_memory(ratio, gfp_mask);
return nr_pages;
}


.


2002-09-26 11:35:41

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

On September 26, 2002 12:08 am, Andrew Morton wrote:
> A patch from Ed Tomlinson which improves the way in which the kernel
> reclaims slab objects.

Thanks Andrew.

Does this looks any better than the last version? I have moved the shrinking
calls to vmscan from slab. The idea being that this really vm related, not
necessarily linked directly to any one cache, so why should it be in slab? This
eliminated an export too...

As it stands now I have not implimented a delete function. It would be pretty
simple to do so though and may well be needed later.

Thoughts?
Ed

-------- against 38-mm2
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.549 -> 1.554
# include/linux/mm.h 1.81 -> 1.83
# fs/dcache.c 1.30 -> 1.32
# mm/vmscan.c 1.107 -> 1.111
# include/linux/slab.h 1.13 -> 1.16
# fs/dquot.c 1.45 -> 1.47
# mm/slab.c 1.28 -> 1.31
# fs/inode.c 1.69 -> 1.71
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/09/18 [email protected] 1.550
# slab_callbacks_A2
# --------------------------------------------
# 02/09/18 [email protected] 1.551
# convert ageable slab shrinking to use callbacks.
# --------------------------------------------
# 02/09/19 [email protected] 1.552
# move the shrink functions to vmscan which is where the really do belong
# --------------------------------------------
# 02/09/20 [email protected] 1.553
# small improvements
# --------------------------------------------
# 02/09/20 [email protected] 1.554
# cleanup
# --------------------------------------------
#
diff -Nru a/fs/dcache.c b/fs/dcache.c
--- a/fs/dcache.c Fri Sep 20 10:35:51 2002
+++ b/fs/dcache.c Fri Sep 20 10:35:51 2002
@@ -570,9 +570,10 @@
* This is called from kswapd when we think we need some
* more memory.
*/
-int shrink_dcache_memory(int ratio, unsigned int gfp_mask)
+int shrink_dcache_memory(int nr, unsigned int gfp_mask)
{
- int entries = dentry_stat.nr_dentry / ratio + 1;
+ if (!nr)
+ return dentry_stat.nr_dentry;
/*
* Nasty deadlock avoidance.
*
@@ -585,10 +586,10 @@
* block allocations, but for now:
*/
if (!(gfp_mask & __GFP_FS))
- return 0;
+ return nr;

- prune_dcache(entries);
- return entries;
+ prune_dcache(nr);
+ return 0;
}

#define NAME_ALLOC_LEN(len) ((len+16) & ~15)
@@ -1328,6 +1329,8 @@
NULL, NULL);
if (!dentry_cache)
panic("Cannot create dentry cache");
+
+ set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);

#if PAGE_SHIFT < 13
mempages >>= (13 - PAGE_SHIFT);
@@ -1401,6 +1404,8 @@
SLAB_HWCACHE_ALIGN, NULL, NULL);
if (!dquot_cachep)
panic("Cannot create dquot SLAB cache");
+
+ set_shrinker(DEFAULT_SEEKS, shrink_dquot_memory);
#endif

dcache_init(mempages);
diff -Nru a/fs/dquot.c b/fs/dquot.c
--- a/fs/dquot.c Fri Sep 20 10:35:51 2002
+++ b/fs/dquot.c Fri Sep 20 10:35:51 2002
@@ -55,6 +55,7 @@
#include <linux/errno.h>
#include <linux/kernel.h>
#include <linux/fs.h>
+#include <linux/mm.h>
#include <linux/time.h>
#include <linux/types.h>
#include <linux/string.h>
@@ -483,14 +484,15 @@
* more memory
*/

-int shrink_dqcache_memory(int ratio, unsigned int gfp_mask)
+int shrink_dqcache_memory(int nr, unsigned int gfp_mask)
{
- int entries = dqstats.allocated_dquots / ratio + 1;
+ if (!nr)
+ return dqstats.allocated_dquots;

lock_kernel();
- prune_dqcache(entries);
+ prune_dqcache(nr);
unlock_kernel();
- return entries;
+ return 0;
}

/*
diff -Nru a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c Fri Sep 20 10:35:51 2002
+++ b/fs/inode.c Fri Sep 20 10:35:51 2002
@@ -419,9 +419,10 @@
* This is called from kswapd when we think we need some
* more memory.
*/
-int shrink_icache_memory(int ratio, unsigned int gfp_mask)
+int shrink_icache_memory(int nr, unsigned int gfp_mask)
{
- int entries = inodes_stat.nr_inodes / ratio + 1;
+ if (!nr)
+ return inodes_stat.nr_inodes;
/*
* Nasty deadlock avoidance..
*
@@ -430,10 +431,10 @@
* in clear_inode() and friends..
*/
if (!(gfp_mask & __GFP_FS))
- return 0;
+ return nr;

- prune_icache(entries);
- return entries;
+ prune_icache(nr);
+ return 0;
}
EXPORT_SYMBOL(shrink_icache_memory);

@@ -1098,4 +1099,6 @@
NULL);
if (!inode_cachep)
panic("cannot create inode slab cache");
+
+ set_shrinker(DEFAULT_SEEKS, shrink_icache_memory);
}
diff -Nru a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h Fri Sep 20 10:35:51 2002
+++ b/include/linux/mm.h Fri Sep 20 10:35:51 2002
@@ -395,6 +395,30 @@


/*
+ * Prototype to add a shrinker callback for ageable caches.
+ *
+ * These functions are passed a count and a gfpmask. They should
+ * return one of three results.
+ *
+ * when nr = 0 return number of entries in the cache(s)
+ * when nr > 0 and we can age return 0
+ * when nr > 0 and we cannot age return nr
+ *
+ * if the cache(s) 'disappears' passing nr = 0 must return 0
+ */
+typedef int (*shrinker_t)(int, unsigned int);
+
+/*
+ * Add an aging callback. The int is the number of 'seeks' it takes
+ * to recreate one of the objects that these functions age.
+ */
+
+#define DEFAULT_SEEKS 2
+
+extern void set_shrinker(int, shrinker_t);
+
+
+/*
* If the mapping doesn't provide a set_page_dirty a_op, then
* just fall through and assume that it wants buffer_heads.
* FIXME: make the method unconditional.
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c Fri Sep 20 10:35:51 2002
+++ b/mm/vmscan.c Fri Sep 20 10:35:51 2002
@@ -73,10 +73,36 @@
#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
#endif

-#ifndef CONFIG_QUOTA
-#define shrink_dqcache_memory(ratio, gfp_mask) do { } while (0)
-#endif
+/*
+ * The list of shrinker callbacks used by to apply pressure to
+ * ageable caches.
+ */
+struct shrinker_s {
+ shrinker_t shrinker;
+ struct list_head next;
+ int seeks; /* seeks to recreate an obj */
+ int nr; /* objs pending delete */
+};
+
+static LIST_HEAD(shrinker_list);
+static spinlock_t shrinker_lock = SPIN_LOCK_UNLOCKED;

+/*
+ * Add a shrinker to be called from the vm
+ */
+void set_shrinker(int seeks, shrinker_t theshrinker)
+{
+ struct shrinker_s *shrinkerp;
+ shrinkerp = kmalloc(sizeof(struct shrinker_s),GFP_KERNEL);
+ BUG_ON(!shrinkerp);
+ shrinkerp->shrinker = theshrinker;
+ shrinkerp->seeks = seeks;
+ shrinkerp->nr = 0;
+ spin_lock(&shrinker_lock);
+ list_add(&shrinkerp->next, &shrinker_list);
+ spin_unlock(&shrinker_lock);
+}
+
/* Must be called with page's pte_chain_lock held. */
static inline int page_mapping_inuse(struct page * page)
{
@@ -572,32 +598,6 @@
}

/*
- * FIXME: don't do this for ZONE_HIGHMEM
- */
-/*
- * Here we assume it costs one seek to replace a lru page and that it also
- * takes a seek to recreate a cache object. With this in mind we age equal
- * percentages of the lru and ageable caches. This should balance the seeks
- * generated by these structures.
- *
- * NOTE: for now I do this for all zones. If we find this is too aggressive
- * on large boxes we may want to exclude ZONE_HIGHMEM.
- *
- * If we're encountering mapped pages on the LRU then increase the pressure on
- * slab to avoid swapping.
- */
-static void shrink_slab(int total_scanned, int gfp_mask)
-{
- int shrink_ratio;
- int pages = nr_used_zone_pages();
-
- shrink_ratio = (pages / (total_scanned + 1)) + 1;
- shrink_dcache_memory(shrink_ratio, gfp_mask);
- shrink_icache_memory(shrink_ratio, gfp_mask);
- shrink_dqcache_memory(shrink_ratio, gfp_mask);
-}
-
-/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
* request.
@@ -638,6 +638,45 @@
break;
}
return ret;
+}
+
+
+#define SHRINK_BATCH 32
+/*
+ * Call the shrink functions to age shrinkable caches
+ *
+ * Here we assume it costs one seek to replace a lru page and that it also
+ * takes a seek to recreate a cache object. With this in mind we age equal
+ * percentages of the lru and ageable caches. This should balance the seeks
+ * generated by these structures.
+ *
+ * If the vm encounted mapped pages on the LRU it increase the pressure on
+ * slab to avoid swapping.
+ *
+ * FIXME: do not do for zone highmem
+ */
+int
+shrink_slab(int scanned, unsigned int gfp_mask)
+{
+ struct list_head *p, *n;
+ int pages = nr_used_zone_pages();
+
+ spin_lock(&shrinker_lock);
+ list_for_each_safe(p, n, &shrinker_list) {
+ struct shrinker_s *shrinkerp = list_entry(p, struct shrinker_s, next);
+ int entries = (*shrinkerp->shrinker)(0, gfp_mask);
+ if (!entries)
+ continue;
+ shrinkerp->nr += ((unsigned long)scanned*shrinkerp->seeks*entries) / pages + 1;
+ if (shrinkerp->nr > SHRINK_BATCH) {
+ spin_unlock(&shrinker_lock);
+ shrinkerp->nr = (*shrinkerp->shrinker)(shrinkerp->nr, gfp_mask);
+ spin_lock(&shrinker_lock);
+ }
+ }
+ spin_unlock(&shrinker_lock);
+
+ return 0;
}

/*

--------

2002-09-26 14:08:29

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

> Slab caches no longer hold onto completely empty pages. Instead, pages
> are freed as soon as they have zero objects. This is possibly a
> performance hit for slabs which have constructors, but it's doubtful.

It could be a performance hit for slab with just one object - e.g the
page sized names cache, used in every syscall that has a path name as a
parameter.

Ed, have you benchmarked that there is no noticable slowdown?
e.g. test the time needed for stat("."). on UP, otherwise the SMP arrays
would perform the caching.

--
Manfred

2002-09-26 14:15:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

At some point in the past, Ed Tomlinson wrote:
>> Slab caches no longer hold onto completely empty pages. Instead, pages
>> are freed as soon as they have zero objects. This is possibly a
>> performance hit for slabs which have constructors, but it's doubtful.

On Thu, Sep 26, 2002 at 04:13:28PM +0200, Manfred Spraul wrote:
> It could be a performance hit for slab with just one object - e.g the
> page sized names cache, used in every syscall that has a path name as a
> parameter.
> Ed, have you benchmarked that there is no noticable slowdown?
> e.g. test the time needed for stat("."). on UP, otherwise the SMP arrays
> would perform the caching.

This might need testing on large-memory 64-bit boxen for that, since
ZONE_NORMAL pressure outweighs many other considerations on my boxen.


Cheers,
Bill

2002-09-26 15:06:02

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Actually this is not quite as aggressive as mentioned below. There is an
optimization that leaves one slabp object in the cachep->partial list...
This should prevent problems with caches with small numbers of objects.

Ed Tomlinson


Andrew Morton wrote:
>
>
> A patch from Ed Tomlinson which improves the way in which the kernel
> reclaims slab objects.
>
> The theory is: a cached object's usefulness is measured in terms of the
> number of disk seeks which it saves. Furthermore, we assume that one
> dentry or inode saves as many seeks as one pagecache page.
>
> So we reap slab objects at the same rate as we reclaim pages. For each
> 1% of reclaimed pagecache we reclaim 1% of slab. (Actually, we _scan_
> 1% of slab for each 1% of scanned pages).
>
> Furthermore we assume that one swapout costs twice as many seeks as one
> pagecache page, and twice as many seeks as one slab object. So we
> double the pressure on slab when anonymous pages are being considered
> for eviction.
>
> The code works nicely, and smoothly. Possibly it does not shrink slab
> hard enough, but that is now very easy to tune up and down. It is just:
>
> ratio *= 3;
>
> in shrink_caches().
>
> Slab caches no longer hold onto completely empty pages. Instead, pages
> are freed as soon as they have zero objects. This is possibly a
> performance hit for slabs which have constructors, but it's doubtful.
> Most allocations after a batch of frees are satisfied from inside
> internally-fragmented pages and by the time slab gets back onto using
> the wholly-empty pages they'll be cache-cold. slab would be better off
> going and requesting a new, cache-warm page and reconstructing the
> objects therein. (Once we have the per-cpu hot-page allocator in
> place. It's happening).
>
> As a consequence of the above, kmem_cache_srhink() is now unused. No
> great loss there - the serialising effect of kmem_cache_shrink and its
> semaphore in front of page reclaim was measurably bad.
>
>
> Still todo:
>
> - batch up the shrinking so we don't call into prune_dcache and
> friends at high frequency asking for a tiny number of objects.
>
> - Maybe expose the shrink ratio via a tunable.
>
> - clean up slab.c
>
> - highmem page reclaim in prune_icache: highmem pages can pin
> inodes.
>
>
> fs/dcache.c | 30 ++++++----------------------
> fs/dquot.c | 19 ++++--------------
> fs/inode.c | 29 ++++++++-------------------
> include/linux/dcache.h | 2 -
> include/linux/mm.h | 1
> mm/page_alloc.c | 11 ++++++++++
> mm/slab.c | 8 +++++--
> mm/vmscan.c | 51
> ++++++++++++++++++++++++++++++++++--------------- 8 files changed, 76
> insertions(+), 75 deletions(-)
>
> --- 2.5.38/fs/dcache.c~slabasap Wed Sep 25 20:15:25 2002
> +++ 2.5.38-akpm/fs/dcache.c Wed Sep 25 20:15:25 2002
> @@ -329,12 +329,11 @@ static inline void prune_one_dentry(stru
> void prune_dcache(int count)
> {
> spin_lock(&dcache_lock);
> - for (;;) {
> + for (; count ; count--) {
> struct dentry *dentry;
> struct list_head *tmp;
>
> tmp = dentry_unused.prev;
> -
> if (tmp == &dentry_unused)
> break;
> list_del_init(tmp);
> @@ -349,12 +348,8 @@ void prune_dcache(int count)
> dentry_stat.nr_unused--;
>
> /* Unused dentry with a count? */
> - if (atomic_read(&dentry->d_count))
> - BUG();
> -
> + BUG_ON(atomic_read(&dentry->d_count));
> prune_one_dentry(dentry);
> - if (!--count)
> - break;
> }
> spin_unlock(&dcache_lock);
> }
> @@ -573,19 +568,11 @@ void shrink_dcache_anon(struct list_head
>
> /*
> * This is called from kswapd when we think we need some
> - * more memory, but aren't really sure how much. So we
> - * carefully try to free a _bit_ of our dcache, but not
> - * too much.
> - *
> - * Priority:
> - * 1 - very urgent: shrink everything
> - * ...
> - * 6 - base-level: try to shrink a bit.
> + * more memory.
> */
> -int shrink_dcache_memory(int priority, unsigned int gfp_mask)
> +int shrink_dcache_memory(int ratio, unsigned int gfp_mask)
> {
> - int count = 0;
> -
> + int entries = dentry_stat.nr_dentry / ratio + 1;
> /*
> * Nasty deadlock avoidance.
> *
> @@ -600,11 +587,8 @@ int shrink_dcache_memory(int priority, u
> if (!(gfp_mask & __GFP_FS))
> return 0;
>
> - count = dentry_stat.nr_unused / priority;
> -
> - prune_dcache(count);
> - kmem_cache_shrink(dentry_cache);
> - return 0;
> + prune_dcache(entries);
> + return entries;
> }
>
> #define NAME_ALLOC_LEN(len) ((len+16) & ~15)
> --- 2.5.38/fs/dquot.c~slabasap Wed Sep 25 20:15:25 2002
> +++ 2.5.38-akpm/fs/dquot.c Wed Sep 25 20:15:25 2002
> @@ -480,26 +480,17 @@ static void prune_dqcache(int count)
>
> /*
> * This is called from kswapd when we think we need some
> - * more memory, but aren't really sure how much. So we
> - * carefully try to free a _bit_ of our dqcache, but not
> - * too much.
> - *
> - * Priority:
> - * 1 - very urgent: shrink everything
> - * ...
> - * 6 - base-level: try to shrink a bit.
> + * more memory
> */
>
> -int shrink_dqcache_memory(int priority, unsigned int gfp_mask)
> +int shrink_dqcache_memory(int ratio, unsigned int gfp_mask)
> {
> - int count = 0;
> + int entries = dqstats.allocated_dquots / ratio + 1;
>
> lock_kernel();
> - count = dqstats.free_dquots / priority;
> - prune_dqcache(count);
> + prune_dqcache(entries);
> unlock_kernel();
> - kmem_cache_shrink(dquot_cachep);
> - return 0;
> + return entries;
> }
>
> /*
> --- 2.5.38/fs/inode.c~slabasap Wed Sep 25 20:15:25 2002
> +++ 2.5.38-akpm/fs/inode.c Wed Sep 25 20:15:25 2002
> @@ -386,10 +386,11 @@ void prune_icache(int goal)
>
> count = 0;
> entry = inode_unused.prev;
> - while (entry != &inode_unused)
> - {
> + for(; goal; goal--) {
> struct list_head *tmp = entry;
>
> + if (entry == &inode_unused)
> + break;
> entry = entry->prev;
> inode = INODE(tmp);
> if (inode->i_state & (I_FREEING|I_CLEAR|I_LOCK))
> @@ -403,8 +404,6 @@ void prune_icache(int goal)
> list_add(tmp, freeable);
> inode->i_state |= I_FREEING;
> count++;
> - if (!--goal)
> - break;
> }
> inodes_stat.nr_unused -= count;
> spin_unlock(&inode_lock);
> @@ -414,19 +413,11 @@ void prune_icache(int goal)
>
> /*
> * This is called from kswapd when we think we need some
> - * more memory, but aren't really sure how much. So we
> - * carefully try to free a _bit_ of our icache, but not
> - * too much.
> - *
> - * Priority:
> - * 1 - very urgent: shrink everything
> - * ...
> - * 6 - base-level: try to shrink a bit.
> + * more memory.
> */
> -int shrink_icache_memory(int priority, int gfp_mask)
> +int shrink_icache_memory(int ratio, unsigned int gfp_mask)
> {
> - int count = 0;
> -
> + int entries = inodes_stat.nr_inodes / ratio + 1;
> /*
> * Nasty deadlock avoidance..
> *
> @@ -437,12 +428,10 @@ int shrink_icache_memory(int priority, i
> if (!(gfp_mask & __GFP_FS))
> return 0;
>
> - count = inodes_stat.nr_unused / priority;
> -
> - prune_icache(count);
> - kmem_cache_shrink(inode_cachep);
> - return 0;
> + prune_icache(entries);
> + return entries;
> }
> +EXPORT_SYMBOL(shrink_icache_memory);
>
> /*
> * Called with the inode lock held.
> --- 2.5.38/include/linux/dcache.h~slabasap Wed Sep 25 20:15:25 2002
> +++ 2.5.38-akpm/include/linux/dcache.h Wed Sep 25 20:15:25 2002
> @@ -186,7 +186,7 @@ extern int shrink_dcache_memory(int, uns
> extern void prune_dcache(int);
>
> /* icache memory management (defined in linux/fs/inode.c) */
> -extern int shrink_icache_memory(int, int);
> +extern int shrink_icache_memory(int, unsigned int);
> extern void prune_icache(int);
>
> /* quota cache memory management (defined in linux/fs/dquot.c) */
> --- 2.5.38/include/linux/mm.h~slabasap Wed Sep 25 20:15:25 2002
> +++ 2.5.38-akpm/include/linux/mm.h Wed Sep 25 20:15:25 2002
> @@ -524,6 +524,7 @@ extern struct vm_area_struct *find_exten
>
> extern struct page * vmalloc_to_page(void *addr);
> extern unsigned long get_page_cache_size(void);
> +extern unsigned int nr_used_zone_pages(void);
>
> #endif /* __KERNEL__ */
>
> --- 2.5.38/mm/page_alloc.c~slabasap Wed Sep 25 20:15:25 2002
> +++ 2.5.38-akpm/mm/page_alloc.c Wed Sep 25 20:15:25 2002
> @@ -479,6 +479,17 @@ unsigned int nr_free_pages(void)
> return sum;
> }
>
> +unsigned int nr_used_zone_pages(void)
> +{
> + unsigned int pages = 0;
> + struct zone *zone;
> +
> + for_each_zone(zone)
> + pages += zone->nr_active + zone->nr_inactive;
> +
> + return pages;
> +}
> +
> static unsigned int nr_free_zone_pages(int offset)
> {
> pg_data_t *pgdat;
> --- 2.5.38/mm/slab.c~slabasap Wed Sep 25 20:15:25 2002
> +++ 2.5.38-akpm/mm/slab.c Wed Sep 25 20:15:25 2002
> @@ -1496,7 +1496,11 @@ static inline void kmem_cache_free_one(k
> if (unlikely(!--slabp->inuse)) {
> /* Was partial or full, now empty. */
> list_del(&slabp->list);
> - list_add(&slabp->list, &cachep->slabs_free);
> +/* list_add(&slabp->list, &cachep->slabs_free); */
> + if (unlikely(list_empty(&cachep->slabs_partial)))
> + list_add(&slabp->list, &cachep->slabs_partial);
> + else
> + kmem_slab_destroy(cachep, slabp);
> } else if (unlikely(inuse == cachep->num)) {
> /* Was full. */
> list_del(&slabp->list);
> @@ -1970,7 +1974,7 @@ static int s_show(struct seq_file *m, vo
> }
> list_for_each(q,&cachep->slabs_partial) {
> slabp = list_entry(q, slab_t, list);
> - if (slabp->inuse == cachep->num || !slabp->inuse)
> + if (slabp->inuse == cachep->num)
> BUG();
> active_objs += slabp->inuse;
> active_slabs++;
> --- 2.5.38/mm/vmscan.c~slabasap Wed Sep 25 20:15:25 2002
> +++ 2.5.38-akpm/mm/vmscan.c Wed Sep 25 20:15:25 2002
> @@ -70,6 +70,10 @@
> #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
> #endif
>
> +#ifndef CONFIG_QUOTA
> +#define shrink_dqcache_memory(ratio, gfp_mask) do { } while (0)
> +#endif
> +
> /* Must be called with page's pte_chain_lock held. */
> static inline int page_mapping_inuse(struct page * page)
> {
> @@ -97,7 +101,7 @@ static inline int is_page_cache_freeable
>
> static /* inline */ int
> shrink_list(struct list_head *page_list, int nr_pages,
> - unsigned int gfp_mask, int *max_scan)
> + unsigned int gfp_mask, int *max_scan, int *nr_mapped)
> {
> struct address_space *mapping;
> LIST_HEAD(ret_pages);
> @@ -116,6 +120,10 @@ shrink_list(struct list_head *page_list,
> if (TestSetPageLocked(page))
> goto keep;
>
> + /* Double the slab pressure for mapped and swapcache pages */
> + if (page_mapped(page) || PageSwapCache(page))
> + (*nr_mapped)++;
> +
> BUG_ON(PageActive(page));
> may_enter_fs = (gfp_mask & __GFP_FS) ||
> (PageSwapCache(page) && (gfp_mask & __GFP_IO));
> @@ -320,7 +328,7 @@ keep:
> */
> static /* inline */ int
> shrink_cache(int nr_pages, struct zone *zone,
> - unsigned int gfp_mask, int max_scan)
> + unsigned int gfp_mask, int max_scan, int *nr_mapped)
> {
> LIST_HEAD(page_list);
> struct pagevec pvec;
> @@ -371,7 +379,8 @@ shrink_cache(int nr_pages, struct zone *
>
> max_scan -= nr_scan;
> KERNEL_STAT_ADD(pgscan, nr_scan);
> - nr_pages = shrink_list(&page_list,nr_pages,gfp_mask,&max_scan);
> + nr_pages = shrink_list(&page_list, nr_pages,
> + gfp_mask, &max_scan, nr_mapped);
>
> if (nr_pages <= 0 && list_empty(&page_list))
> goto done;
> @@ -522,14 +531,10 @@ refill_inactive_zone(struct zone *zone,
>
> static /* inline */ int
> shrink_zone(struct zone *zone, int max_scan,
> - unsigned int gfp_mask, int nr_pages)
> + unsigned int gfp_mask, int nr_pages, int *nr_mapped)
> {
> unsigned long ratio;
>
> - /* This is bogus for ZONE_HIGHMEM? */
> - if (kmem_cache_reap(gfp_mask) >= nr_pages)
> - return 0;
> -
> /*
> * Try to keep the active list 2/3 of the size of the cache. And
> * make sure that refill_inactive is given a decent number of pages.
> @@ -547,7 +552,8 @@ shrink_zone(struct zone *zone, int max_s
> atomic_sub(SWAP_CLUSTER_MAX, &zone->refill_counter);
> refill_inactive_zone(zone, SWAP_CLUSTER_MAX);
> }
> - nr_pages = shrink_cache(nr_pages, zone, gfp_mask, max_scan);
> + nr_pages = shrink_cache(nr_pages, zone, gfp_mask,
> + max_scan, nr_mapped);
> return nr_pages;
> }
>
> @@ -557,6 +563,9 @@ shrink_caches(struct zone *classzone, in
> {
> struct zone *first_classzone;
> struct zone *zone;
> + int ratio;
> + int nr_mapped = 0;
> + int pages = nr_used_zone_pages();
>
> first_classzone = classzone->zone_pgdat->node_zones;
> for (zone = classzone; zone >= first_classzone; zone--) {
> @@ -581,16 +590,28 @@ shrink_caches(struct zone *classzone, in
> max_scan = zone->nr_inactive >> priority;
> if (max_scan < to_reclaim * 2)
> max_scan = to_reclaim * 2;
> - unreclaimed = shrink_zone(zone, max_scan, gfp_mask, to_reclaim);
> + unreclaimed = shrink_zone(zone, max_scan,
> + gfp_mask, to_reclaim, &nr_mapped);
> nr_pages -= to_reclaim - unreclaimed;
> *total_scanned += max_scan;
> }
>
> - shrink_dcache_memory(priority, gfp_mask);
> - shrink_icache_memory(1, gfp_mask);
> -#ifdef CONFIG_QUOTA
> - shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
> -#endif
> + /*
> + * Here we assume it costs one seek to replace a lru page and that
> + * it also takes a seek to recreate a cache object. With this in
> + * mind we age equal percentages of the lru and ageable caches.
> + * This should balance the seeks generated by these structures.
> + *
> + * NOTE: for now I do this for all zones. If we find this is too
> + * aggressive on large boxes we may want to exclude ZONE_HIGHMEM
> + *
> + * If we're encountering mapped pages on the LRU then increase the
> + * pressure on slab to avoid swapping.
> + */
> + ratio = (pages / (*total_scanned + nr_mapped + 1)) + 1;
> + shrink_dcache_memory(ratio, gfp_mask);
> + shrink_icache_memory(ratio, gfp_mask);
> + shrink_dqcache_memory(ratio, gfp_mask);
> return nr_pages;
> }
>
>
> .
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-09-26 15:19:30

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

William Lee Irwin III wrote:
>
> This might need testing on large-memory 64-bit boxen for that, since
> ZONE_NORMAL pressure outweighs many other considerations on my boxen.
>

Most of them will be SMP, correct? For SMP, 95% of the memory operations
occur in the per-cpu arrays. I'm not sure if a slab is really the right
backend behind the per-cpu arrays - perhaps a algorithms that's more
aggressive toward finding freeable pages would be a better choice, even
if it needs more cpu time for sorting/searching through the partially
allocated pages.

--
Manfred

2002-09-26 17:31:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Manfred Spraul wrote:
>
> > Slab caches no longer hold onto completely empty pages. Instead, pages
> > are freed as soon as they have zero objects. This is possibly a
> > performance hit for slabs which have constructors, but it's doubtful.
>
> It could be a performance hit for slab with just one object - e.g the
> page sized names cache, used in every syscall that has a path name as a
> parameter.
>
> Ed, have you benchmarked that there is no noticable slowdown?
> e.g. test the time needed for stat("."). on UP, otherwise the SMP arrays
> would perform the caching.
>

(What Ed said - we do hang onto one page. And I _have_ measured
cost in kmem_cache_shrink...)

For those things, the caching should be performed in the page
allocator. This way, when names_cache requests a cache-hot page,
it may get a page which was very recently a (say) pagetable page,
rather than restricting itself only to pages which used to be
a names_cache page.

CPU caches are per-cpu global. So the hot pages list should be
per-cpu global also.

Martin Bligh seems to have the patches up and running. It isn't
very finetuned yet, but initial indications are promising:

Before:
Elapsed: 20.18s User: 192.914s System: 48.292s CPU: 1195.6%

After:
Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%

That's for a kernel compile.

And from the profiles, it appears that the benefit is coming
from cache locality, not from the buddylist lock amortisation
which we've also designed into those patches.

I need to stop being slack, and get that code into the pipeline.

2002-09-26 18:42:39

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Andrew Morton wrote:
>
> (What Ed said - we do hang onto one page. And I _have_ measured
> cost in kmem_cache_shrink...)
>
I totally agree about kmem_cache_shrink - it's total abuse that
fs/dcache.c calls it regularly. It was intended to be called before
module unload, or during ifdown, etc.
On NUMA, it's probably worse, because it does an IPI to all cpus.
dcache.c should not call kmem_cache_shrink, and kmem_cache_reap should
be improved.

> Before:
> Elapsed: 20.18s User: 192.914s System: 48.292s CPU: 1195.6%
>
> After:
> Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%
>
> That's for a kernel compile.
>
UP or SMP?
And was that the complete patch, or just the modification to slab.c?

I've made a microbenchmark of kmem_cache_alloc/free of 4 kb objects, on
UP, AMD Duron:
1 object 4 objects
cur 145 cycles 662 cycles
patched 133 cycles 2733 cycles

Summary:
* for one object, the patch is a slight performance improvement. The
reason is that the fallback from partial to free list in
kmem_cache_alloc_one is avoided.
* the overhead of kmem_cache_grow/shrink is around 500 cycles, nearly a
slowdown of factor 4. The cache had no constructor/destructor.
* everything cache hot state. [100 runs in a loop, loop overhead
substracted. 98 or 99 runs completed in the given time, except for
patched-4obj, where 24 runs completed in 2735 cycles, 72 in 2733 cycles]


For SMP and slabs that are per-cpu cached, the change could be right,
because the arrays should absorb bursts. But I do not think that the
change is the right approach for UP.

--
Manfred

2002-09-26 19:44:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Manfred Spraul wrote:
>
> Andrew Morton wrote:
> >
> > (What Ed said - we do hang onto one page. And I _have_ measured
> > cost in kmem_cache_shrink...)
> >
> I totally agree about kmem_cache_shrink - it's total abuse that
> fs/dcache.c calls it regularly. It was intended to be called before
> module unload, or during ifdown, etc.
> On NUMA, it's probably worse, because it does an IPI to all cpus.
> dcache.c should not call kmem_cache_shrink, and kmem_cache_reap should
> be improved.

Sorry, I meant kmem_cache_reap. It was the call into there from
vmscan.c which was the problem. On my 4-way (which is fairly
immune to lock contention for some reason), I was seeing *25%*
contention on the spinlock inside the cache_chain_sem semaphore.
With, of course, tons of context switching going on.

And it was giving a sort of turnstile effect, where each CPU needed
to pass through kmem_cache_reap before entering page reclaim proper.
Which maybe wasn't so bad in the 2.4 setup. But in 2.5 the VM is
basically per-zone, so different CPUs can run reclaim against different
zones with no contention of any form. This is especially important on
NUMA.

But the above could have been solved in different ways, and there are
still global effects with the requirement to prune slab. I expect that
the batching and perhaps a per-slab exclusion mechanism will fix up
any remaining problems.

> > Before:
> > Elapsed: 20.18s User: 192.914s System: 48.292s CPU: 1195.6%
> >
> > After:
> > Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%
> >
> > That's for a kernel compile.
> >
> UP or SMP?
> And was that the complete patch, or just the modification to slab.c?

That is the effect of per-cpu hotlist within the page allocator
on a 16- or 32-way NUMAQ.

It shows some of the benefit which we can get by carefully recycling
known-to-be-hot pages back into places where the page is known to
soon be touched by this CPU. Most of the gains were in do_anonymous_page,
copy_foo_user(), zap_pte_range(), etc. Where you'd expect.

I expect to end up with two forms of page allocation and freeing:
alloc_hot_page/alloc_cold_page and free_hot_page/free_cold_page

> I've made a microbenchmark of kmem_cache_alloc/free of 4 kb objects, on
> UP, AMD Duron:
> 1 object 4 objects
> cur 145 cycles 662 cycles
> patched 133 cycles 2733 cycles
>
> Summary:
> * for one object, the patch is a slight performance improvement. The
> reason is that the fallback from partial to free list in
> kmem_cache_alloc_one is avoided.
> * the overhead of kmem_cache_grow/shrink is around 500 cycles, nearly a
> slowdown of factor 4. The cache had no constructor/destructor.
> * everything cache hot state. [100 runs in a loop, loop overhead
> substracted. 98 or 99 runs completed in the given time, except for
> patched-4obj, where 24 runs completed in 2735 cycles, 72 in 2733 cycles]

Was the microbenchmark actually touching the memory which it was
allocating from slab? If so then yes, we'd expect to see cache
misses against those cold pages coming out of the buddy.

> For SMP and slabs that are per-cpu cached, the change could be right,
> because the arrays should absorb bursts. But I do not think that the
> change is the right approach for UP.

I'd suggest that we wait until we have slab freeing its pages into
the hotlists, and allocating from them. That should pull things back.

2002-09-26 20:44:14

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Andrew Morton wrote:
>
> Was the microbenchmark actually touching the memory which it was
> allocating from slab? If so then yes, we'd expect to see cache
> misses against those cold pages coming out of the buddy.
>

No, it was just measuring the cost of the kmem_cache_grow/shrink.

Btw, 140 cycles for kmem_cache_alloc+free is inflated - someone enabled
kmem_cache_alloc_head() even in the no-debugging version.
As expected, done by Andrea, who neither bothered to cc me, nor actually
understood the code.

>
>>For SMP and slabs that are per-cpu cached, the change could be right,
>>because the arrays should absorb bursts. But I do not think that the
>>change is the right approach for UP.
>
>
> I'd suggest that we wait until we have slab freeing its pages into
> the hotlists, and allocating from them. That should pull things back.
>
You are asking a interesting question:

The slab is by design far from LIFO - it tries to find pages with no
allocated objects, that are possible to return to the page allocator. It
doesn't try to optimize for cache hit rates.

Is that actually the right approach? For large objects, it would be
possible to cripple the freeable slabs list, and to perform the cache
hit optimization (i.e. per-cpu LIFO) in page_alloc.c, but that doesn't
work with small objects.

On SMP, the per-cpu arrays are the LIFO and should give good cache hit
rates. On UP, I haven't enabled them, because they could increase the
internal fragmentation of the slabs.

Perhaps we should enable the arrays on UP, too, and thus improve the
cache hit rates? If there is no increase in fragmentation, we could
ignore it. Otherwise we could replace the 3-list Bonwick slab with
another backend, something that's stronger at reducing the internal
fragmentation.

--
Manfred


2002-09-26 21:34:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Manfred Spraul wrote:
>
> Andrew Morton wrote:
> >
> > Was the microbenchmark actually touching the memory which it was
> > allocating from slab? If so then yes, we'd expect to see cache
> > misses against those cold pages coming out of the buddy.
> >
>
> No, it was just measuring the cost of the kmem_cache_grow/shrink.
>
> Btw, 140 cycles for kmem_cache_alloc+free is inflated - someone enabled
> kmem_cache_alloc_head() even in the no-debugging version.
> As expected, done by Andrea, who neither bothered to cc me, nor actually
> understood the code.

hm, OK. Sorry, I did not realise that you were this closely
interested/involved with slab, so things have been sort of
going on behind your back :(

> >
> >>For SMP and slabs that are per-cpu cached, the change could be right,
> >>because the arrays should absorb bursts. But I do not think that the
> >>change is the right approach for UP.
> >
> >
> > I'd suggest that we wait until we have slab freeing its pages into
> > the hotlists, and allocating from them. That should pull things back.
> >
> You are asking a interesting question:
>
> The slab is by design far from LIFO - it tries to find pages with no
> allocated objects, that are possible to return to the page allocator. It
> doesn't try to optimize for cache hit rates.
>
> Is that actually the right approach? For large objects, it would be
> possible to cripple the freeable slabs list, and to perform the cache
> hit optimization (i.e. per-cpu LIFO) in page_alloc.c, but that doesn't
> work with small objects.

Well with a, what? 100:1 speed ratio, we'll generally get best results
from optimising for locality/recency of reference.

> On SMP, the per-cpu arrays are the LIFO and should give good cache hit
> rates. On UP, I haven't enabled them, because they could increase the
> internal fragmentation of the slabs.
>
> Perhaps we should enable the arrays on UP, too, and thus improve the
> cache hit rates? If there is no increase in fragmentation, we could
> ignore it. Otherwise we could replace the 3-list Bonwick slab with
> another backend, something that's stronger at reducing the internal
> fragmentation.

Definitely worthy of investigation. Memory sizes are increasing,
and the cached-versus-noncached latencies are increasing. Both
these say "optimise for cache hits".

Plus we'd lose a ton of ifdefs if we enabled it on UP as well...

Bill wrote a couple of handy slab-monitoring tools, btw.
http://www.zip.com.au/~akpm/linux/patches/ - I use bloatmeter.

2002-09-27 00:37:37

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

On September 26, 2002 05:39 pm, Andrew Morton wrote:
> Manfred Spraul wrote:
> > Andrew Morton wrote:
> > Btw, 140 cycles for kmem_cache_alloc+free is inflated - someone enabled
> > kmem_cache_alloc_head() even in the no-debugging version.
> > As expected, done by Andrea, who neither bothered to cc me, nor actually
> > understood the code.
>
> hm, OK. Sorry, I did not realise that you were this closely
> interested/involved with slab, so things have been sort of
> going on behind your back :(

Nor did I realize this... The reasoning behind quickly giving pages back to the
system was fairly simple. Previous vm experiements showed that lazy freeing
of pages from the lru was not a good performer. When the first versions of
slablru went into mm we noticed that large numbers of pages (thousands) were
free but not reclaimed. This was working as designed - the conclusion was that
slablru was over designed and that lazy removal was probably not such a good
idea. The slabasap patch that started this thread was the fix for this.

There is no dispute that in some cases it will be slower from a slab perspective. As
Andrew and you have discussed there are things that can be done to speed things
up. Is not the question really, "Are the vm and slab faster together when slab pages
are freed asap?"

As it stands now we could remove quite a bit of code from slab. There is no longer
a need for most of the kmem_cache_shrink family, nor is kmem_cache_reap needed
any more. This simpilifes slab. If we also enable the per cpu arrays for UP the code
is even cleaner (and hopefully faster).

Manfred, slab is currently using typedefs. Andrews has stated that he and Linus are
trying to remove these from the kernel. When I code the cleanup patches for slabasap
(provided it proves itself) shall I clean them out too?

Ed Tomlinson

2002-09-27 15:54:40

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Andrew Morton wrote:
>>
>>Is that actually the right approach? For large objects, it would be
>>possible to cripple the freeable slabs list, and to perform the cache
>>hit optimization (i.e. per-cpu LIFO) in page_alloc.c, but that doesn't
>>work with small objects.
>
>
> Well with a, what? 100:1 speed ratio, we'll generally get best results
> from optimising for locality/recency of reference.
>
You misunderstood me:

AFAICS slab.c has 2 weaks spots:
* cache hit rates are ignored on UP, and for objects > PAGE_SIZE on both
SMP and UP.
* freeable pages are not returned efficiently to page_alloc.c, neither
on SMP nor on UP. On SMP, this is a big problems, because the
cache_chain_semaphore is overloaded.

I just wanted to say that a hotlist in page_alloc.c is not able to
replace a hotlist in slab.c, because many objects are smaller than page
size. Both lists are needed.

--
Manfred


2002-09-27 17:19:46

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Ed Tomlinson wrote:
>
> There is no dispute that in some cases it will be slower from a slab perspective. As
> Andrew and you have discussed there are things that can be done to speed things
> up. Is not the question really, "Are the vm and slab faster together when slab pages
> are freed asap?"
>

Some caches are quite bursty - what about the 2 kB generic cache that is
used for the MTU sized socket buffers? With interrupt mitigation
enabled, I'd expect that a GigE nic could allocate a few dozend 2kb
objects in every interrupt, and I don't think it's the right approach to
effectively disable the cache in slab.c for such loads.

I do not have many data points, but in a netbench run on 4-way Xeon,
kmem_cache_free is called 5 million times/minute, and additional 4
million calls to kfree - I agree that _reap right now is bad, but IMHO
it's questionable if the fix should be inside the hot-path of the allocator.

What about this approach:

* enable batching even on UP, with a LIFO array in front of the lists.

* After flushing a batch back into the lists, the number of free objects
in the lists is calculated. If freeable pages exist and the number
exceeds a target, then the freeable pages above the target are returned
to the page buddy.
* The target of freeable pages is increased by kmem_cache_grow - if we
had to get another page from gfp, then our own cache was too small.

Since the test for the number of freeable objects only happens after
batching, i.e. in the worst case once for every 30 kmem_cache_free
calls, it doesn't matter if it's a bit expensive.

Open problems:

* What about cache with large objects (>PAGE_SIZE, e.g. the bio
MAX_PAGES object, or the 16 kb socket buffers used over loopback)? Right
now, they are not cached in the per-cpu arrays, to reduce the memory
pressure. If the list processing becomes slower, we would slow down
these slab users. But OTHO if you memcpy 16 kB, then a few cycles in
kmalloc probably won't matter much.

* Where to flush the per-cpu caches? On a 16-way system, they can
contain up to 4000 objects, for each cache. Right now that happens in
kmem_cache_reap(). One flush per second would be enough, just to avoid
that on lightly loaded slabs, objects remain forever in the per-cpu
arrays and prevent pages from becoming freeable.

* where is the freeable pages limit decreased?


--
Manfred

2002-09-27 18:21:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Manfred Spraul wrote:
>
> Ed Tomlinson wrote:
> >
> > There is no dispute that in some cases it will be slower from a slab perspective. As
> > Andrew and you have discussed there are things that can be done to speed things
> > up. Is not the question really, "Are the vm and slab faster together when slab pages
> > are freed asap?"
> >
>
> Some caches are quite bursty - what about the 2 kB generic cache that is
> used for the MTU sized socket buffers? With interrupt mitigation
> enabled, I'd expect that a GigE nic could allocate a few dozend 2kb
> objects in every interrupt, and I don't think it's the right approach to
> effectively disable the cache in slab.c for such loads.

Well that's all rather broken at present. Your gigE NIC has just
trashed 60k of cache-warm memory by putting it under busmastering
receive.

It would be better to use separate slabs for Rx and Tx. Rx ones
are backed by the cold page allocator and Tx ones by the hot page
allocator. This is an improvement, but Rx is still doing suboptimal
things because it "warms up" memory and we're not taking advantage
of that. That's starting to get tricky.

There are many things we can do. We need to get the core in place
and start using, tuning and using it in a few places before deciding
whether to go nuts using the same technique everywhere. I hope we do ;)

> I do not have many data points, but in a netbench run on 4-way Xeon,
> kmem_cache_free is called 5 million times/minute, and additional 4
> million calls to kfree - I agree that _reap right now is bad, but IMHO
> it's questionable if the fix should be inside the hot-path of the allocator.
>
> What about this approach:
>
> * enable batching even on UP, with a LIFO array in front of the lists.

That has to be right.

> * After flushing a batch back into the lists, the number of free objects
> in the lists is calculated. If freeable pages exist and the number
> exceeds a target, then the freeable pages above the target are returned
> to the page buddy.

Probably OK for now. But slab should _not_ hold onto an unused,
cache-warm page. Because do_anonymous_page() may want one.

> * The target of freeable pages is increased by kmem_cache_grow - if we
> had to get another page from gfp, then our own cache was too small.
>
> Since the test for the number of freeable objects only happens after
> batching, i.e. in the worst case once for every 30 kmem_cache_free
> calls, it doesn't matter if it's a bit expensive.
>
> Open problems:
>
> * What about cache with large objects (>PAGE_SIZE, e.g. the bio
> MAX_PAGES object, or the 16 kb socket buffers used over loopback)? Right
> now, they are not cached in the per-cpu arrays, to reduce the memory
> pressure. If the list processing becomes slower, we would slow down
> these slab users. But OTHO if you memcpy 16 kB, then a few cycles in
> kmalloc probably won't matter much.
>
> * Where to flush the per-cpu caches? On a 16-way system, they can
> contain up to 4000 objects, for each cache. Right now that happens in
> kmem_cache_reap(). One flush per second would be enough, just to avoid
> that on lightly loaded slabs, objects remain forever in the per-cpu
> arrays and prevent pages from becoming freeable.

kswapd could do that, or set up a timer and use pdflush_operation,

2002-09-27 19:33:19

by Manfred Spraul

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

--- 2.4/mm/slab.c Fri Aug 30 18:39:22 2002
+++ build-2.4/mm/slab.c Fri Sep 27 21:01:31 2002
@@ -1727,6 +1735,9 @@
}
#endif

+unsigned long g_calls = 0;
+unsigned long g_pages = 0;
+unsigned long g_success = 0;
/**
* kmem_cache_reap - Reclaim memory from caches.
* @gfp_mask: the type of memory required.
@@ -1749,6 +1760,7 @@
if (down_trylock(&cache_chain_sem))
return 0;

+g_calls++;
scan = REAP_SCANLEN;
best_len = 0;
best_pages = 0;
@@ -1827,6 +1839,8 @@
perfect:
/* free only 50% of the free slabs */
best_len = (best_len + 1)/2;
+g_success++;
+g_pages+=best_len;
for (scan = 0; scan < best_len; scan++) {
struct list_head *p;

@@ -1907,6 +1921,7 @@
* Output format version, so at least we can change it
* without _too_ many complaints.
*/
+ seq_printf(m, "%lu %lu %lu.\n",g_calls, g_pages, g_success);
seq_puts(m, "slabinfo - version: 1.1"
#if STATS
" (statistics)"


Attachments:
patch-slab-stat (926.00 B)

2002-09-27 19:47:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/4] slab reclaim balancing

Manfred Spraul wrote:
>
> Andrew Morton wrote:
> >
> >>* After flushing a batch back into the lists, the number of free objects
> >>in the lists is calculated. If freeable pages exist and the number
> >>exceeds a target, then the freeable pages above the target are returned
> >>to the page buddy.
> >
> >
> > Probably OK for now. But slab should _not_ hold onto an unused,
> > cache-warm page. Because do_anonymous_page() may want one.
> >
> If the per-cpu caches are enabled on UP, too, then this is a moot point:
> by the time a batch is freed from the per-cpu array, it will be cache cold.

Well yes, it's all smoke, mirrors and wishful thinking. All we can
do is to apply local knowledge of typical behaviour in deciding whether
a page is likely to be usefully reused.

> And btw, why do you think a page is cache-warm when the last object on a
> page is freed? If the last 32-byte kmalloc is released on a page, 40xx
> bytes are probably cache-cold.

L2 caches are hundreds of K, and a single P4 cacheline is 1/32nd of
a page. Things are tending in that direction.

> Back to your first problem: You've mentioned excess hits on the
> cache_chain_semaphore. Which app did you use for stress testing?

I think it was dd-to-six-disks.

> Could you run a stress test with the applied patch?

Shall try to.

> I've tried dbench 50, with 128 MB RAM, on uniprocessor, with 2.4:
>
> There were 9100 calls to kmem_cache_reap, and in 90% of the calls, no
> freeable memory was found. Alltogether, only 1300 pages were freed from
> the slabs.
>
> Are there just too many calls to kmem_cache_reap()? Perhaps we should
> try to optimize the "nothing freeable exists" logic?

It certainly sounds like it. Some sort of counter which is accessed
outside locks would be appropriate. Test that before deciding to
take the lock.