2003-02-26 16:39:24

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] New dcache / inode hash tuning patch


Unlike the previous version this one should actually compile.

The purpose of the patch is to address the excessive cache misses
observed on some boxes when accessing the dcache hash table.
For that it uses single pointer hash bucket heads and reduces
the hash table sizes to values more likely to fit into cache.

In addition it saves tons of memory.

Hash table sizes still experimental. If it should perform badly
at first please tweak D_NUMBUCKETS and I_NUMBUCKETS. I suspect
D_* could be increased, while I_* could be decreased actually.

Thanks a lot to Paul McKenney for review.

Patch for 2.5.63.

Changes compared to the last version.
- Should really compile now (last version had missing and wrong hunks
from last minute changes)
- Replace wmb with smp_wmb to make uniprocessor faster.
- Replace max_dentries RCU race loop breaker with Brent's algorithm.
This is untested because I don't know how to reproduce the race it fixes.
But in theory it should work better than the old code.
- Fix bug in find_inode_* to not leak memory (thanks Paul for noticing
that)
- Some other minor changes requested by Paul.
- Other minor cleanups

Todo:
- Check if a better hash function for the dcache can be designed
- Verify the current hash table sizes under various workloads
- Add CONFIG_SMALL for small systems.

Please test, especially on big machines.

-Andi


diff -burpN -X ../KDIFX linux/fs/dcache.c linux-2.5.63-work/fs/dcache.c
--- linux/fs/dcache.c 2003-02-21 12:13:53.000000000 +0100
+++ linux-2.5.63-work/fs/dcache.c 2003-02-26 13:52:33.000000000 +0100
@@ -41,23 +41,24 @@ static kmem_cache_t *dentry_cache;
*
* This hash-function tries to avoid losing too many bits of hash
* information, yet avoid using a prime hash-size or similar.
+ *
+ * AK: using a prime hash with a prime near some power-of-two would be
+ * likely better. Any hash guru interested? Same for the inode hash.
+ *
+ * We probably should have CONFIG_SMALL and CONFIG_LARGE for this.
+ * Don't scale it by memory size, otherwise big systems are eaten
+ * by the cache misses.
+ *
+ * Sizes need to be power-of-two for now.
*/
-#define D_HASHBITS d_hash_shift
-#define D_HASHMASK d_hash_mask
+#define D_NUMBUCKETS (16*1024) /* 64K RAM on a 32bit machine */
+#define D_HASHBITS 13 /* = log2(D_NUMBUCKETS) */
+#define D_HASHMASK (D_NUMBUCKETS-1)

-static unsigned int d_hash_mask;
-static unsigned int d_hash_shift;
-static struct list_head *dentry_hashtable;
+static struct hlist_head *dentry_hashtable;
static LIST_HEAD(dentry_unused);
-static int max_dentries;
static void * hashtable_end;

-static inline int is_bucket(void * addr)
-{
- return ((addr < (void *)dentry_hashtable)
- || (addr > hashtable_end) ? 0 : 1);
-}
-
/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {
.age_limit = 45,
@@ -292,6 +293,7 @@ struct dentry * d_find_alias(struct inod
while (next != head) {
tmp = next;
next = tmp->next;
+ prefetch(next);
alias = list_entry(tmp, struct dentry, d_alias);
if (!d_unhashed(alias)) {
if (alias->d_flags & DCACHE_DISCONNECTED)
@@ -378,6 +380,7 @@ static void prune_dcache(int count)
if (tmp == &dentry_unused)
break;
list_del_init(tmp);
+ prefetch(dentry_unused.prev);
dentry_stat.nr_unused--;
dentry = list_entry(tmp, struct dentry, d_lru);

@@ -603,15 +606,15 @@ void shrink_dcache_parent(struct dentry
* done under dcache_lock.
*
*/
-void shrink_dcache_anon(struct list_head *head)
+void shrink_dcache_anon(struct hlist_head *head)
{
- struct list_head *lp;
+ struct hlist_node *lp;
int found;
do {
found = 0;
spin_lock(&dcache_lock);
- list_for_each(lp, head) {
- struct dentry *this = list_entry(lp, struct dentry, d_hash);
+ hlist_for_each(lp, head) {
+ struct dentry *this = hlist_entry(lp, struct dentry, d_hash);
list_del(&this->d_lru);

/* don't add non zero d_count dentries
@@ -727,7 +730,7 @@ struct dentry * d_alloc(struct dentry *
dentry->d_mounted = 0;
dentry->d_cookie = NULL;
dentry->d_bucket = NULL;
- INIT_LIST_HEAD(&dentry->d_hash);
+ INIT_HLIST_NODE(&dentry->d_hash);
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
INIT_LIST_HEAD(&dentry->d_alias);
@@ -797,7 +800,7 @@ struct dentry * d_alloc_root(struct inod
return res;
}

-static inline struct list_head * d_hash(struct dentry * parent, unsigned long hash)
+static inline struct hlist_head * d_hash(struct dentry * parent, unsigned long hash)
{
hash += (unsigned long) parent / L1_CACHE_BYTES;
hash = hash ^ (hash >> D_HASHBITS);
@@ -860,7 +863,7 @@ struct dentry * d_alloc_anon(struct inod
res->d_flags |= DCACHE_DISCONNECTED;
res->d_vfs_flags &= ~DCACHE_UNHASHED;
list_add(&res->d_alias, &inode->i_dentry);
- list_add(&res->d_hash, &inode->i_sb->s_anon);
+ hlist_add_head(&res->d_hash, &inode->i_sb->s_anon);
spin_unlock(&res->d_lock);
}
inode = NULL; /* don't drop reference */
@@ -947,21 +950,23 @@ struct dentry * d_lookup(struct dentry *
unsigned int len = name->len;
unsigned int hash = name->hash;
const unsigned char *str = name->name;
- struct list_head *head = d_hash(parent,hash);
+ struct hlist_head *head = d_hash(parent,hash);
struct dentry *found = NULL;
- struct list_head *tmp;
- int lookup_count = 0;
+ struct hlist_node *node;
+ int loop_count = 0;
+ struct dentry *loop_marker = NULL;

rcu_read_lock();

- /* lookup is terminated when flow reaches any bucket head */
- for(tmp = head->next; !is_bucket(tmp); tmp = tmp->next) {
+ hlist_for_each (node, head) {
struct dentry *dentry;
unsigned long move_count;
struct qstr * qstr;

+ prefetch(node->next);
+
smp_read_barrier_depends();
- dentry = list_entry(tmp, struct dentry, d_hash);
+ dentry = hlist_entry(node, struct dentry, d_hash);

/* if lookup ends up in a different bucket
* due to concurrent rename, fail it
@@ -969,11 +974,28 @@ struct dentry * d_lookup(struct dentry *
if (unlikely(dentry->d_bucket != head))
break;

- /* to avoid race if dentry keep coming back to original
- * bucket due to double moves
+ /* to avoid a race if the dentry keeps coming back to the
+ * original bucket due to double moves. Check for a cycle
+ * in each power of two. It could be still a false
+ * positive due to memory reuse, so also recheck
+ * again using an exclusive lock. This should happen
+ * only in exceptional cases. Recursion is bounded to
+ * one because the race cannot occur with dcache lock
+ * hold.
+ *
+ * UNTESTED CURRENTLY AS I CANNOT REPRODUCE THIS RACE!
*/
- if (unlikely(++lookup_count > max_dentries))
- break;
+ if (unlikely(loop_count & (loop_count - 1))) {
+ if (unlikely(dentry == loop_marker)) {
+ rcu_read_unlock();
+ spin_lock(&dcache_lock);
+ dentry = d_lookup(parent, name);
+ spin_unlock(&dcache_lock);
+ return dentry;
+ }
+ loop_marker = dentry;
+ }
+ loop_count++;

/*
* We must take a snapshot of d_move_count followed by
@@ -1031,7 +1053,8 @@ int d_validate(struct dentry *dentry, st
unsigned long dent_addr = (unsigned long) dentry;
unsigned long min_addr = PAGE_OFFSET;
unsigned long align_mask = 0x0F;
- struct list_head *base, *lhp;
+ struct hlist_head *base;
+ struct hlist_node *lhp;

if (dent_addr < min_addr)
goto out;
@@ -1047,12 +1070,13 @@ int d_validate(struct dentry *dentry, st
goto out;

spin_lock(&dcache_lock);
- lhp = base = d_hash(dparent, dentry->d_name.hash);
- while ((lhp = lhp->next) != base) {
+ base = d_hash(dparent, dentry->d_name.hash);
+ hlist_for_each(lhp,base) {
+ prefetch(lhp->next);
/* read_barrier_depends() not required for d_hash list
* as it is parsed under dcache_lock
*/
- if (dentry == list_entry(lhp, struct dentry, d_hash)) {
+ if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
dget(dentry);
spin_unlock(&dcache_lock);
return 1;
@@ -1113,12 +1137,11 @@ void d_delete(struct dentry * dentry)

void d_rehash(struct dentry * entry)
{
- struct list_head *list = d_hash(entry->d_parent, entry->d_name.hash);
+ struct hlist_head *list = d_hash(entry->d_parent, entry->d_name.hash);
spin_lock(&dcache_lock);
- if (!list_empty(&entry->d_hash) && !d_unhashed(entry)) BUG();
entry->d_vfs_flags &= ~DCACHE_UNHASHED;
entry->d_bucket = list;
- list_add_rcu(&entry->d_hash, list);
+ hlist_add_head_rcu(&entry->d_hash, list);
spin_unlock(&dcache_lock);
}

@@ -1171,10 +1194,6 @@ static inline void switch_names(struct d
* We could be nicer about the deleted file, and let it show
* up under the name it got deleted rather than the name that
* deleted it.
- *
- * Careful with the hash switch. The hash switch depends on
- * the fact that any list-entry can be a head of the list.
- * Think about it.
*/

/**
@@ -1197,8 +1216,8 @@ void d_move(struct dentry * dentry, stru
/* Move the dentry to the target hash queue, if on different bucket */
if (dentry->d_bucket != target->d_bucket) {
dentry->d_bucket = target->d_bucket;
- list_del_rcu(&dentry->d_hash);
- list_add_rcu(&dentry->d_hash, &target->d_hash);
+ hlist_del_rcu(&dentry->d_hash);
+ hlist_add_head_rcu(&dentry->d_hash, target->d_bucket);
}

/* Unhash the target: dput() will then get rid of it */
@@ -1281,6 +1300,7 @@ static char * __d_path( struct dentry *d
continue;
}
parent = dentry->d_parent;
+ prefetch(parent);
namelen = dentry->d_name.len;
buflen -= namelen + 1;
if (buflen < 0)
@@ -1500,9 +1520,6 @@ out:

static void __init dcache_init(unsigned long mempages)
{
- struct list_head *d;
- unsigned long order;
- unsigned int nr_hash;
int i;

/*
@@ -1521,49 +1538,17 @@ static void __init dcache_init(unsigned
if (!dentry_cache)
panic("Cannot create dentry cache");

- /* approximate maximum number of dentries in one hash bucket */
- max_dentries = (mempages * (PAGE_SIZE / sizeof(struct dentry)));
-
set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);

-#if PAGE_SHIFT < 13
- mempages >>= (13 - PAGE_SHIFT);
-#endif
- mempages *= sizeof(struct list_head);
- for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
- ;
-
- do {
- unsigned long tmp;
-
- nr_hash = (1UL << order) * PAGE_SIZE /
- sizeof(struct list_head);
- d_hash_mask = (nr_hash - 1);
-
- tmp = nr_hash;
- d_hash_shift = 0;
- while ((tmp >>= 1UL) != 0UL)
- d_hash_shift++;
-
- dentry_hashtable = (struct list_head *)
- __get_free_pages(GFP_ATOMIC, order);
- } while (dentry_hashtable == NULL && --order >= 0);
-
- printk(KERN_INFO "Dentry cache hash table entries: %d (order: %ld, %ld bytes)\n",
- nr_hash, order, (PAGE_SIZE << order));
-
+ dentry_hashtable = (struct hlist_head *)
+ __get_free_pages(GFP_ATOMIC,
+ get_order(D_NUMBUCKETS * sizeof(struct hlist_head)));
if (!dentry_hashtable)
panic("Failed to allocate dcache hash table\n");

- hashtable_end = dentry_hashtable + nr_hash;
-
- d = dentry_hashtable;
- i = nr_hash;
- do {
- INIT_LIST_HEAD(d);
- d++;
- i--;
- } while (i);
+ for (i = 0; i < D_NUMBUCKETS; i++)
+ INIT_HLIST_HEAD(&dentry_hashtable[i]);
+ hashtable_end = dentry_hashtable + i;
}

/* SLAB cache for __getname() consumers */
diff -burpN -X ../KDIFX linux/fs/fs-writeback.c linux-2.5.63-work/fs/fs-writeback.c
--- linux/fs/fs-writeback.c 2003-02-10 19:39:17.000000000 +0100
+++ linux-2.5.63-work/fs/fs-writeback.c 2003-02-26 13:53:13.000000000 +0100
@@ -90,7 +90,7 @@ void __mark_inode_dirty(struct inode *in
* Only add valid (hashed) inodes to the superblock's
* dirty list. Add blockdev inodes as well.
*/
- if (list_empty(&inode->i_hash) && !S_ISBLK(inode->i_mode))
+ if (hnode_empty(&inode->i_hash) && !S_ISBLK(inode->i_mode))
goto out;

/*
diff -burpN -X ../KDIFX linux/fs/hugetlbfs/inode.c linux-2.5.63-work/fs/hugetlbfs/inode.c
--- linux/fs/hugetlbfs/inode.c 2003-02-15 10:37:04.000000000 +0100
+++ linux-2.5.63-work/fs/hugetlbfs/inode.c 2003-02-26 13:54:11.000000000 +0100
@@ -189,7 +189,7 @@ void truncate_hugepages(struct address_s

static void hugetlbfs_delete_inode(struct inode *inode)
{
- list_del_init(&inode->i_hash);
+ hlist_del_init(&inode->i_hash);
list_del_init(&inode->i_list);
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
@@ -208,7 +208,7 @@ static void hugetlbfs_forget_inode(struc
{
struct super_block *super_block = inode->i_sb;

- if (list_empty(&inode->i_hash))
+ if (hnode_empty(&inode->i_hash))
goto out_truncate;

if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
@@ -223,7 +223,7 @@ static void hugetlbfs_forget_inode(struc

/* write_inode_now() ? */
inodes_stat.nr_unused--;
- list_del_init(&inode->i_hash);
+ hlist_del_init(&inode->i_hash);
out_truncate:
list_del_init(&inode->i_list);
inode->i_state |= I_FREEING;
diff -burpN -X ../KDIFX linux/fs/inode.c linux-2.5.63-work/fs/inode.c
--- linux/fs/inode.c 2003-02-10 19:39:00.000000000 +0100
+++ linux-2.5.63-work/fs/inode.c 2003-02-26 13:16:49.000000000 +0100
@@ -49,11 +49,9 @@
* Inode lookup is no longer as critical as it used to be:
* most of the lookups are going to be through the dcache.
*/
-#define I_HASHBITS i_hash_shift
-#define I_HASHMASK i_hash_mask
-
-static unsigned int i_hash_mask;
-static unsigned int i_hash_shift;
+#define I_NUMBUCKETS (8*1024)
+#define I_HASHBITS 12 /* = log2(I_NUMBUCKETS) */
+#define I_HASHMASK (I_NUMBUCKETS-1)

/*
* Each inode can be on two separate lists. One is
@@ -69,8 +67,8 @@ static unsigned int i_hash_shift;

LIST_HEAD(inode_in_use);
LIST_HEAD(inode_unused);
-static struct list_head *inode_hashtable;
-static LIST_HEAD(anon_hash_chain); /* for inodes with NULL i_sb */
+static struct hlist_head *inode_hashtable;
+static HLIST_HEAD(anon_hash_chain); /* for inodes with NULL i_sb */

/*
* A simple spinlock to protect the list manipulations.
@@ -162,7 +160,7 @@ void destroy_inode(struct inode *inode)
void inode_init_once(struct inode *inode)
{
memset(inode, 0, sizeof(*inode));
- INIT_LIST_HEAD(&inode->i_hash);
+ INIT_HLIST_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_data.clean_pages);
INIT_LIST_HEAD(&inode->i_data.dirty_pages);
INIT_LIST_HEAD(&inode->i_data.locked_pages);
@@ -284,7 +282,7 @@ static int invalidate_list(struct list_h
continue;
invalidate_inode_buffers(inode);
if (!atomic_read(&inode->i_count)) {
- list_del_init(&inode->i_hash);
+ hlist_del_init(&inode->i_hash);
list_del(&inode->i_list);
list_add(&inode->i_list, dispose);
inode->i_state |= I_FREEING;
@@ -422,7 +420,7 @@ static void prune_icache(int nr_to_scan)
if (!can_unuse(inode))
continue;
}
- list_del_init(&inode->i_hash);
+ hlist_del_init(&inode->i_hash);
list_move(&inode->i_list, &freeable);
inode->i_state |= I_FREEING;
nr_pruned++;
@@ -460,50 +458,42 @@ static int shrink_icache_memory(int nr,
* by hand after calling find_inode now! This simplifies iunique and won't
* add any additional branch in the common code.
*/
-static struct inode * find_inode(struct super_block * sb, struct list_head *head, int (*test)(struct inode *, void *), void *data)
+static struct inode * find_inode(struct super_block * sb, struct hlist_head *head, int (*test)(struct inode *, void *), void *data)
{
- struct list_head *tmp;
+ struct hlist_node *node;
struct inode * inode;

- tmp = head;
- for (;;) {
- tmp = tmp->next;
- inode = NULL;
- if (tmp == head)
- break;
- inode = list_entry(tmp, struct inode, i_hash);
+ hlist_for_each (node, head) {
+ prefetch(node->next);
+ inode = hlist_entry(node, struct inode, i_hash);
if (inode->i_sb != sb)
continue;
if (!test(inode, data))
continue;
break;
}
- return inode;
+ return node ? inode : NULL;
}

/*
* find_inode_fast is the fast path version of find_inode, see the comment at
* iget_locked for details.
*/
-static struct inode * find_inode_fast(struct super_block * sb, struct list_head *head, unsigned long ino)
+static struct inode * find_inode_fast(struct super_block * sb, struct hlist_head *head, unsigned long ino)
{
- struct list_head *tmp;
+ struct hlist_node *node;
struct inode * inode;

- tmp = head;
- for (;;) {
- tmp = tmp->next;
- inode = NULL;
- if (tmp == head)
- break;
- inode = list_entry(tmp, struct inode, i_hash);
+ hlist_for_each (node, head) {
+ prefetch(node->next);
+ inode = list_entry(node, struct inode, i_hash);
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
break;
}
- return inode;
+ return node ? inode : NULL;
}

/**
@@ -553,7 +543,7 @@ EXPORT_SYMBOL(unlock_new_inode);
* We no longer cache the sb_flags in i_flags - see fs.h
* -- [email protected]
*/
-static struct inode * get_new_inode(struct super_block *sb, struct list_head *head, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *data)
+static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *head, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *data)
{
struct inode * inode;

@@ -570,7 +560,7 @@ static struct inode * get_new_inode(stru

inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
- list_add(&inode->i_hash, head);
+ hlist_add_head(&inode->i_hash, head);
inode->i_state = I_LOCK|I_NEW;
spin_unlock(&inode_lock);

@@ -603,7 +593,7 @@ set_failed:
* get_new_inode_fast is the fast path version of get_new_inode, see the
* comment at iget_locked for details.
*/
-static struct inode * get_new_inode_fast(struct super_block *sb, struct list_head *head, unsigned long ino)
+static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_head *head, unsigned long ino)
{
struct inode * inode;

@@ -618,7 +608,7 @@ static struct inode * get_new_inode_fast
inode->i_ino = ino;
inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
- list_add(&inode->i_hash, head);
+ hlist_add_head(&inode->i_hash, head);
inode->i_state = I_LOCK|I_NEW;
spin_unlock(&inode_lock);

@@ -670,7 +660,7 @@ ino_t iunique(struct super_block *sb, in
{
static ino_t counter = 0;
struct inode *inode;
- struct list_head * head;
+ struct hlist_head * head;
ino_t res;
spin_lock(&inode_lock);
retry:
@@ -724,7 +714,7 @@ struct inode *igrab(struct inode *inode)
* Note, @test is called with the inode_lock held, so can't sleep.
*/
static inline struct inode *ifind(struct super_block *sb,
- struct list_head *head, int (*test)(struct inode *, void *),
+ struct hlist_head *head, int (*test)(struct inode *, void *),
void *data)
{
struct inode *inode;
@@ -756,7 +746,7 @@ static inline struct inode *ifind(struct
* Otherwise NULL is returned.
*/
static inline struct inode *ifind_fast(struct super_block *sb,
- struct list_head *head, unsigned long ino)
+ struct hlist_head *head, unsigned long ino)
{
struct inode *inode;

@@ -794,7 +784,7 @@ static inline struct inode *ifind_fast(s
struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
{
- struct list_head *head = inode_hashtable + hash(sb, hashval);
+ struct hlist_head *head = inode_hashtable + hash(sb, hashval);

return ifind(sb, head, test, data);
}
@@ -816,7 +806,7 @@ EXPORT_SYMBOL(ilookup5);
*/
struct inode *ilookup(struct super_block *sb, unsigned long ino)
{
- struct list_head *head = inode_hashtable + hash(sb, ino);
+ struct hlist_head *head = inode_hashtable + hash(sb, ino);

return ifind_fast(sb, head, ino);
}
@@ -848,7 +838,7 @@ struct inode *iget5_locked(struct super_
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *), void *data)
{
- struct list_head *head = inode_hashtable + hash(sb, hashval);
+ struct hlist_head *head = inode_hashtable + hash(sb, hashval);
struct inode *inode;

inode = ifind(sb, head, test, data);
@@ -881,7 +871,7 @@ EXPORT_SYMBOL(iget5_locked);
*/
struct inode *iget_locked(struct super_block *sb, unsigned long ino)
{
- struct list_head *head = inode_hashtable + hash(sb, ino);
+ struct hlist_head *head = inode_hashtable + hash(sb, ino);
struct inode *inode;

inode = ifind_fast(sb, head, ino);
@@ -907,11 +897,11 @@ EXPORT_SYMBOL(iget_locked);

void __insert_inode_hash(struct inode *inode, unsigned long hashval)
{
- struct list_head *head = &anon_hash_chain;
+ struct hlist_head *head = &anon_hash_chain;
if (inode->i_sb)
head = inode_hashtable + hash(inode->i_sb, hashval);
spin_lock(&inode_lock);
- list_add(&inode->i_hash, head);
+ hlist_add_head(&inode->i_hash, head);
spin_unlock(&inode_lock);
}

@@ -925,7 +915,7 @@ void __insert_inode_hash(struct inode *i
void remove_inode_hash(struct inode *inode)
{
spin_lock(&inode_lock);
- list_del_init(&inode->i_hash);
+ hlist_del_init(&inode->i_hash);
spin_unlock(&inode_lock);
}

@@ -933,7 +923,7 @@ void generic_delete_inode(struct inode *
{
struct super_operations *op = inode->i_sb->s_op;

- list_del_init(&inode->i_hash);
+ hlist_del_init(&inode->i_hash);
list_del_init(&inode->i_list);
inode->i_state|=I_FREEING;
inodes_stat.nr_inodes--;
@@ -962,7 +952,7 @@ static void generic_forget_inode(struct
{
struct super_block *sb = inode->i_sb;

- if (!list_empty(&inode->i_hash)) {
+ if (!hnode_empty(&inode->i_hash)) {
if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
list_del(&inode->i_list);
list_add(&inode->i_list, &inode_unused);
@@ -974,7 +964,7 @@ static void generic_forget_inode(struct
write_inode_now(inode, 1);
spin_lock(&inode_lock);
inodes_stat.nr_unused--;
- list_del_init(&inode->i_hash);
+ hlist_del_init(&inode->i_hash);
}
list_del_init(&inode->i_list);
inode->i_state|=I_FREEING;
@@ -1220,48 +1210,18 @@ void wake_up_inode(struct inode *inode)
*/
void __init inode_init(unsigned long mempages)
{
- struct list_head *head;
- unsigned long order;
- unsigned int nr_hash;
int i;
-
for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
init_waitqueue_head(&i_wait_queue_heads[i].wqh);

- mempages >>= (14 - PAGE_SHIFT);
- mempages *= sizeof(struct list_head);
- for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
- ;
-
- do {
- unsigned long tmp;
-
- nr_hash = (1UL << order) * PAGE_SIZE /
- sizeof(struct list_head);
- i_hash_mask = (nr_hash - 1);
-
- tmp = nr_hash;
- i_hash_shift = 0;
- while ((tmp >>= 1UL) != 0UL)
- i_hash_shift++;
-
- inode_hashtable = (struct list_head *)
- __get_free_pages(GFP_ATOMIC, order);
- } while (inode_hashtable == NULL && --order >= 0);
-
- printk("Inode-cache hash table entries: %d (order: %ld, %ld bytes)\n",
- nr_hash, order, (PAGE_SIZE << order));
-
+ inode_hashtable = (struct hlist_head *)
+ __get_free_pages(GFP_ATOMIC,
+ get_order(I_NUMBUCKETS*sizeof(struct hlist_head)));
if (!inode_hashtable)
panic("Failed to allocate inode hash table\n");

- head = inode_hashtable;
- i = nr_hash;
- do {
- INIT_LIST_HEAD(head);
- head++;
- i--;
- } while (i);
+ for (i = 0; i < I_NUMBUCKETS; i++)
+ INIT_HLIST_HEAD(&inode_hashtable[i]);

/* inode slab cache */
inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode),
diff -burpN -X ../KDIFX linux/fs/super.c linux-2.5.63-work/fs/super.c
--- linux/fs/super.c 2003-02-10 19:38:30.000000000 +0100
+++ linux-2.5.63-work/fs/super.c 2003-02-26 13:51:33.000000000 +0100
@@ -63,7 +63,7 @@ static struct super_block *alloc_super(v
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
- INIT_LIST_HEAD(&s->s_anon);
+ INIT_HLIST_HEAD(&s->s_anon);
init_rwsem(&s->s_umount);
sema_init(&s->s_lock, 1);
down_write(&s->s_umount);
diff -burpN -X ../KDIFX linux/include/linux/dcache.h linux-2.5.63-work/include/linux/dcache.h
--- linux/include/linux/dcache.h 2003-02-21 12:13:54.000000000 +0100
+++ linux-2.5.63-work/include/linux/dcache.h 2003-02-26 13:16:27.000000000 +0100
@@ -80,8 +80,8 @@ struct dentry {
unsigned long d_move_count; /* to indicated moved dentry while lockless lookup */
struct inode * d_inode; /* Where the name belongs to - NULL is negative */
struct dentry * d_parent; /* parent directory */
- struct list_head * d_bucket; /* lookup hash bucket */
- struct list_head d_hash; /* lookup hash list */
+ struct hlist_head * d_bucket; /* lookup hash bucket */
+ struct hlist_node d_hash; /* lookup hash list */
struct list_head d_lru; /* LRU list */
struct list_head d_child; /* child of parent list */
struct list_head d_subdirs; /* our children */
@@ -171,7 +171,7 @@ extern rwlock_t dparent_lock;
static __inline__ void __d_drop(struct dentry * dentry)
{
dentry->d_vfs_flags |= DCACHE_UNHASHED;
- list_del_rcu(&dentry->d_hash);
+ hlist_del_rcu(&dentry->d_hash);
}

static __inline__ void d_drop(struct dentry * dentry)
@@ -198,7 +198,7 @@ extern struct dentry * d_alloc_anon(stru
extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
extern void shrink_dcache_sb(struct super_block *);
extern void shrink_dcache_parent(struct dentry *);
-extern void shrink_dcache_anon(struct list_head *);
+extern void shrink_dcache_anon(struct hlist_head *);
extern int d_invalidate(struct dentry *);

/* only used at mount-time */
diff -burpN -X ../KDIFX linux/include/linux/fs.h linux-2.5.63-work/include/linux/fs.h
--- linux/include/linux/fs.h 2003-02-26 12:55:30.000000000 +0100
+++ linux-2.5.63-work/include/linux/fs.h 2003-02-26 13:16:27.000000000 +0100
@@ -353,7 +353,7 @@ struct block_device {
};

struct inode {
- struct list_head i_hash;
+ struct hlist_node i_hash;
struct list_head i_list;
struct list_head i_dentry;
unsigned long i_ino;
@@ -601,7 +601,7 @@ struct super_block {

struct list_head s_dirty; /* dirty inodes */
struct list_head s_io; /* parked for writeback */
- struct list_head s_anon; /* anonymous dentries for (nfs) exporting */
+ struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_files;

struct block_device *s_bdev;
diff -burpN -X ../KDIFX linux/include/linux/list.h linux-2.5.63-work/include/linux/list.h
--- linux/include/linux/list.h 2003-02-10 19:37:56.000000000 +0100
+++ linux-2.5.63-work/include/linux/list.h 2003-02-26 13:19:48.000000000 +0100
@@ -319,6 +319,95 @@ static inline void list_splice_init(stru
for (pos = (head)->next, n = pos->next; pos != (head); \
pos = n, ({ read_barrier_depends(); 0;}), n = pos->next)

+/*
+ * Double linked lists with a single pointer list head.
+ * Mostly useful for hash tables where the two pointer list head is
+ * too wasteful.
+ * You lose the ability to access the tail in O(1).
+ */
+
+struct hlist_head {
+ struct hlist_node *first;
+};
+
+struct hlist_node {
+ struct hlist_node *next, **pprev;
+};
+
+#define HLIST_HEAD_INIT { first: NULL }
+#define HLIST_HEAD(name) struct hlist_head name = { first: NULL }
+#define INIT_HLIST_HEAD(ptr) ((ptr)->first = NULL)
+#define INIT_HLIST_NODE(ptr) ((ptr)->next = NULL, (ptr)->pprev = NULL)
+
+/* This is really misnamed */
+static __inline__ int hnode_empty(struct hlist_node *h)
+{
+ return h->pprev==0;
+}
+
+static __inline__ int hlist_empty(struct hlist_head *h)
+{
+ return !h->first;
+}
+
+static __inline__ void hlist_del(struct hlist_node *n)
+{
+ /* The if could be avoided by a common dummy pprev target. */
+ if (!n->pprev)
+ return;
+ *(n->pprev) = n->next;
+ if (n->next)
+ n->next->pprev = n->pprev;
+}
+
+#define hlist_del_rcu hlist_del /* list_del_rcu is identical too? */
+
+static __inline__ void hlist_del_init(struct hlist_node *n)
+{
+ /* The if could be avoided by a common dummy pprev target. */
+ if (!n->pprev)
+ return;
+ *(n->pprev) = n->next;
+ if (n->next)
+ n->next->pprev = n->pprev;
+ INIT_HLIST_NODE(n);
+}
+
+static __inline__ void hlist_add_head(struct hlist_node *n, struct hlist_head *h)
+{
+ n->next = h->first;
+ if (h->first)
+ h->first->pprev = &n->next;
+ h->first = n;
+ n->pprev = &h->first;
+}
+
+static __inline__ void hlist_add_head_rcu(struct hlist_node *n, struct hlist_head *h)
+{
+ n->next = h->first;
+ n->pprev = &h->first;
+ smp_wmb();
+ if (h->first)
+ h->first->pprev = &n->next;
+ h->first = n;
+}
+
+/* next must be != NULL */
+static __inline__ void hlist_add_before(struct hlist_node *n, struct hlist_node *next)
+{
+ n->pprev = next->pprev;
+ n->next = next;
+ next->pprev = &n->next;
+ *(n->pprev) = n;
+}
+
+#define hlist_entry(ptr, type, member) container_of(ptr,type,member)
+
+/* Cannot easily do prefetch unfortunately */
+#define hlist_for_each(pos, head) \
+ for (pos = (head)->first; pos; \
+ pos = pos->next)
+
#else
#warning "don't include kernel headers in userspace"
#endif /* __KERNEL__ */


2003-02-26 18:14:59

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

It actually seems a fraction slower (see systimes for Kernbench-16,
for instance).

Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed User System CPU
2.5.62-mjb3 43.92 557.65 94.12 1483.50
2.5.62-mjb3-andi 44.28 557.90 95.79 1475.67

Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed User System CPU
2.5.62-mjb3 45.21 560.46 114.58 1492.67
2.5.62-mjb3-andi 45.39 561.29 117.73 1495.67

DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.

Results are shown as percentages of the first set displayed

SDET 1 (see disclaimer)
Throughput Std. Dev
2.5.62-mjb3 100.0% 3.1%
2.5.62-mjb3-andi 101.8% 6.6%

SDET 2 (see disclaimer)
Throughput Std. Dev
2.5.62-mjb3 100.0% 4.0%
2.5.62-mjb3-andi 96.4% 4.8%

SDET 4 (see disclaimer)
Throughput Std. Dev
2.5.62-mjb3 100.0% 2.0%
2.5.62-mjb3-andi 100.4% 2.4%

SDET 8 (see disclaimer)
Throughput Std. Dev
2.5.62-mjb3 100.0% 4.6%
2.5.62-mjb3-andi 100.3% 2.3%

SDET 16 (see disclaimer)
Throughput Std. Dev
2.5.62-mjb3 100.0% 2.7%
2.5.62-mjb3-andi 96.6% 4.3%

SDET 32 (see disclaimer)
Throughput Std. Dev
2.5.62-mjb3 100.0% 0.9%
2.5.62-mjb3-andi 97.0% 2.1%

SDET 64 (see disclaimer)
Throughput Std. Dev
2.5.62-mjb3 100.0% 1.1%
2.5.62-mjb3-andi 100.2% 0.6%


Diffprofile (+ worse with your patch, - better)

500 3.1% total
138 6.9% .text.lock.file_table
103 2.2% default_idle
73 9.3% d_lookup
50 11.0% __copy_to_user_ll
45 27.4% file_move
40 28.2% path_lookup
...
-7 -3.9% do_schedule
-9 -2.0% get_empty_filp
-10 -8.6% dput
-10 -9.8% link_path_walk

That text.lock.file_table has been bugging me for a bit, and I need to
drill down into it some more.

M.

2003-02-26 18:22:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

On Wed, Feb 26, 2003 at 10:23:09AM -0800, Martin J. Bligh wrote:
> It actually seems a fraction slower (see systimes for Kernbench-16,
> for instance).

Can you play a bit with the hash table sizes? Perhaps double the
dcache hash and half the inode hash ?

I suspect it really just needs a better hash function. I'll cook
something up based on FNV hash.

-Andi

2003-02-26 21:06:06

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

>> It actually seems a fraction slower (see systimes for Kernbench-16,
>> for instance).
>
> Can you play a bit with the hash table sizes? Perhaps double the
> dcache hash and half the inode hash ?

I quadrupled the dcache hash size ... helped a bit. Is still a little
slower, which I don't understand ... your plan should make it go faster,
AFAICS. Those off-node accesses to ZONE_NORMAL are really expensive for me.

Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed User System CPU
2.5.62-mjb3 44.24 558.12 94.17 1473.83
2.5.62-mjb3-andi 44.28 557.90 95.79 1475.67
2.5.62-mjb3-andi-4x 44.20 557.96 94.77 1475.83

Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed User System CPU
2.5.62-mjb3 45.19 560.95 114.77 1495.00
2.5.62-mjb3-andi 45.39 561.29 117.73 1495.67
2.5.62-mjb3-andi-4x 45.34 560.54 116.85 1493.50

> I suspect it really just needs a better hash function. I'll cook
> something up based on FNV hash.

Sounds good ;-)

M.

2003-02-27 11:10:55

by Gianni Tedesco

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

On Wed, 2003-02-26 at 18:33, Andi Kleen wrote:
> On Wed, Feb 26, 2003 at 10:23:09AM -0800, Martin J. Bligh wrote:
> > It actually seems a fraction slower (see systimes for Kernbench-16,
> > for instance).
>
> Can you play a bit with the hash table sizes? Perhaps double the
> dcache hash and half the inode hash ?
>
> I suspect it really just needs a better hash function. I'll cook
> something up based on FNV hash.

Didn't wli do some work in this area? I seem to recall him recommending
FNV1a for dcache...

--
// Gianni Tedesco (gianni at scaramanga dot co dot uk)
lynx --source http://www.scaramanga.co.uk/gianni-at-ecsc.asc | gpg --import
8646BE7D: 6D9F 2287 870E A2C9 8F60 3A3C 91B5 7669 8646 BE7D


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-02-27 11:20:18

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] New dcache / inode hash tuning patch

On Wed, 2003-02-26 at 18:33, Andi Kleen wrote:
>> Can you play a bit with the hash table sizes? Perhaps double the
>> dcache hash and half the inode hash ?
>> I suspect it really just needs a better hash function. I'll cook
>> something up based on FNV hash.

On Thu, Feb 27, 2003 at 11:21:38AM +0000, Gianni Tedesco wrote:
> Didn't wli do some work in this area? I seem to recall him recommending
> FNV1a for dcache...

The work I did here was inconclusive but _suggested_ (non-rigorously)
that FNV was merely more expensive with little benefit for the most
straightforward hash function replacement strategy.

More intelligent mixing strategies for parent + child names etc. may
reveal some use for it yet, but I have zero results in that area.

There is much room for experimentation AFAICT.


-- wli

2003-02-27 23:04:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch

In article <[email protected]>,
Andi Kleen <[email protected]> wrote:
>
>Unlike the previous version this one should actually compile.
>
>The purpose of the patch is to address the excessive cache misses
>observed on some boxes when accessing the dcache hash table.

I don't think that the hash-list approach is really worth it.

There are two things the hash-list helps with:

- memory usage

But quite frankly, if the hash list heads are actually noticeable
memory users, the hash is likely to be _way_ too big. The list heads
are _much_ smaller than the entries they point to, the hash list just
shouldn't be big enough that it really matters.

- hash list cache footprint

Again: the hash head array itself is at least dense in the cache, and
each entry is _much_ smaller than the actual data structures it
points to. So even if you improve the hash heads to be better from a
cache standpoint, you're only getting a very small percentage of the
real cache costs.

So let's say that the cache costs of the dcache is 4% (according to
the oprofile run), then saving a few procent of that is not actually
going to be noticeable at a user level.

And the downsides of the hash list is that addition/removal is costlier
due to the conditionals, and a non-conditional version (a common
optimization using a special "tail marker entry" that is shared across
the different chains) has absolutely _horrible_ false sharing
characteristics on SMP systems.

In other words: it may be that our current dentry hashes are too big,
and that is certainly worth fixing if so. But the "hlist" approach very
fundamentally cannot really help the _real_ problem very much, and it
will (slightly) hurt the case where the hashes are actually cached.

So I really think that the only longterm fix is to make the lookup data
structures be "local" to the base of the lookup, in order to get away
from the inherently non-local nature of the current hash lookups.

That "local" thing doesn't have to be on a direct per-directory level
either, btw. A fairly simple scheme might be to _force_ some locality
by having some of the hash bits always come from the parent directory,
rather than trying to get a "perfect distribution".

For example, one approach might be to just change the "d_hash()"
function, to something more like

#define HASHBITS_DENTRY 5 // just an example value
#define HASHBITS_PARENT 5 // just an example value
#define HASHBITS_TOTAL (HASHBITS_DENTRY + HASHBITS_PARENT)

static inline struct list_head * d_hash(struct dentry * parent, unsigned long hash)
{
unsigned long parenthash = (unsigned long) parent / L1_CACHE_BYTES;
parenthash ^= parent->d_hash;

parenthash = parenthash ^ (parenthash >> HASHBITS_TOTAL);
parenthash &= (1 << HASHBITS_TOTAL)-1;
hash = hash ^ (hash >> HASHBITS_DENRY);
hash &= (1 << HASHBITS_DENTRY)-1;
return parenthash ^ hash;
}

ie the high bits are _solely_ based on the parent, while the low bits
are based on some hash of the parent and the dentry itself.

(Yeah, don't tell me the hash for the parent sucks. I'm a retard, and
the point is really more to try to make the hash "worse" by actually
giving it more locality, so that lookups within one directory won't blow
the whole cache).

Will this help? I dunno. It's kind of a "cheap hash partitioning"
thing. It will obviously result in huge hash chain imbalances if you
for example have only one directory with millions of entries, and
because of the above that directory will only populate 5% of the whole
hash list.

And traditionally a hash imbalance has always been considered a bad
thing. I'm claiming that _maybe_ the hash imbalance can actually be a
good thing too, as long as we try to make sure it's not _too_ extreme.

I'd frankly prefer truly local lookups (ie each dentry has it's own
rbtree associated with it), since that would also potentially help
locking a lot (ability to use private locks rather than a global dcache
lock), but I don't see any good low-cost algorithms for something like
the dcache that is _extremely_ performance-sensitive.

Linus

2003-02-28 08:24:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch

[email protected] (Linus Torvalds) writes:

[please read the whole post before answering. thanks]

> But quite frankly, if the hash list heads are actually noticeable
> memory users, the hash is likely to be _way_ too big. The list heads
> are _much_ smaller than the entries they point to, the hash list just
> shouldn't be big enough that it really matters.

On big machines it is currently 1+MB. IMHO this is way too big.

>
> - hash list cache footprint
>
> Again: the hash head array itself is at least dense in the cache, and
> each entry is _much_ smaller than the actual data structures it
> points to. So even if you improve the hash heads to be better from a
> cache standpoint, you're only getting a very small percentage of the
> real cache costs.

It would be possible to cache line optimize the layout of struct dentry
in addition. May be an interesting add-on project for someone...
But for lookup walking even one cache line - the one containing d_hash -
should be needed. Unless d_hash is unlucky enough to cross a cache
line for its two members ... but I doubt that.

>
> So let's say that the cache costs of the dcache is 4% (according to
> the oprofile run), then saving a few procent of that is not actually
> going to be noticeable at a user level.
>
> And the downsides of the hash list is that addition/removal is costlier
> due to the conditionals, and a non-conditional version (a common

But the list walking is faster. Testing for NULL generates much better
code on i386 than having to dedicate a register for storing the head
to test against. List walking happens more often than insertion/deletion.

I believe the conditionals are completely left in the noise compared
to the cache misses the two pointer head version causes. You can execute
a lot of conditionals in the time needed to serve one cache miss!

Please take a look at the x86 assembly generated by list_for_each
vs hlist_for_each. hlist_for_each looks much nicer, especially when you
can use the register normally wasted on the head for something else
in the loop body.

In case of dcache rcu it also made things simpler/faster because it didn't
require the complicated is_bucket race breaker check.

> In other words: it may be that our current dentry hashes are too big,
> and that is certainly worth fixing if so. But the "hlist" approach very
> fundamentally cannot really help the _real_ problem very much, and it
> will (slightly) hurt the case where the hashes are actually cached.

I admit it is a kind of micro optimization, but I believe it is an useful
one. Frankly wasting two pointers for a hash bucket in a potentially
big hash table is just s*d.

>
> So I really think that the only longterm fix is to make the lookup data
> structures be "local" to the base of the lookup, in order to get away
> from the inherently non-local nature of the current hash lookups.

Yes, that may be a good idea. I don't have time to work on this though.

Still even with local lookups single pointer buckets will likely help ;)

Also isn't it a bit late in the 2.5 cycle to think about such radical
changes like local lookup? It sounds more like a nice 2.7 project. I
believe my patch with a bit more tweaking (my current 64K hash table
seems to be too small) is suitable even for an soon to be stable
kernel.

Also my patch had some other changes that I believe should be included
anyways because they're independent and improvement. It replaces the
max_dentries race break hack with a better algorithm to detect cycles on walk.

Also it does more prefetches while list walking which I believe to be
useful.

-Andi

2003-02-28 10:17:29

by Paul Menage

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] New dcache / inode hash tuning patch

>
>It would be possible to cache line optimize the layout of struct dentry
>in addition. May be an interesting add-on project for someone...

I played with this a few months ago, and sent a preliminary patch to
linux-kernel, but in my (fairly brief) testing on a 4-way system I
wasn't able to produce a measurable benefit from it. I think part of
this may have been due to contention on dcache_lock, so maybe dcache_rcu
will help there.

The main changes were to bring together all the data needed for checking
a non-matching dcache hash entry (modulo hash collisions) into one
cacheline, and to separate out the mostly-read-only fields from the
change-frequently fields.

The original patch is at

http://marc.theaimsgroup.com/?l=linux-kernel&m=102650654002932&w=2

>But for lookup walking even one cache line - the one containing d_hash -
>should be needed. Unless d_hash is unlucky enough to cross a cache
>line for its two members ... but I doubt that.

No, but on a 32-byte cache line system, d_parent, d_hash and d_name are
all on different cache lines, and they're used when checking each entry.
On 64-byte systems, d_parent and d_hash will be on the same line, but
d_name is still on a separate line and d_name.hash gets checked before
d_parent. So bringing these three fields on to the same cacheline
would theoretically be a win.

Paul


2003-02-28 10:30:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] New dcache / inode hash tuning patch

On Fri, Feb 28, 2003 at 02:27:27AM -0800, Paul Menage wrote:
> >But for lookup walking even one cache line - the one containing d_hash -
> >should be needed. Unless d_hash is unlucky enough to cross a cache
> >line for its two members ... but I doubt that.
>
> No, but on a 32-byte cache line system, d_parent, d_hash and d_name are
> all on different cache lines, and they're used when checking each entry.

... and dcache RCU checks d_bucket and d_move_count too in the hash
walking loop.


> On 64-byte systems, d_parent and d_hash will be on the same line, but
> d_name is still on a separate line and d_name.hash gets checked before
> d_parent. So bringing these three fields on to the same cacheline
> would theoretically be a win.

Ok you're right. Optimizing the layout a bit would be probably a good
idea. I won't include it in the hash patchkit for now to not do too
many things with the same patch.

-Andi

2003-02-28 18:39:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch


On 28 Feb 2003, Andi Kleen wrote:
>
> Also isn't it a bit late in the 2.5 cycle to think about such radical
> changes like local lookup?

Look again at my suggestion: forcing locality by just changing the hash
function to _intentionally_ bunch directory hashes together.

Yes, it's "pseudo-locality" (and my example hash was truly broken: we
must _not_ use the directory dentry hash value as part of the hash, as the
hash needs to be stable over the whole lifetime of the directory dentry),
but the point is that by just changing the hashing algorithm you can
potentially get a good portion of the locality we want.

Right now the dcache hash is often something like 17 bits - and we could
easily make it so that roughly "half" the bits would be based purely on
the directory. That would still give each directory ~8 bits worth of
"local hashing", which is fairly reasonable.

> It sounds more like a nice 2.7 project.

It sounds more like changing two lines of code to me.

> I believe my patch with a bit more tweaking (my current 64K hash table
> seems to be too small) is suitable even for an soon to be stable
> kernel.

Quite frankly, right now the only report I've seen about your patch is
that it made things slightly _slower_.

For a patch that is supposed to speed stuff up, that's a damn bad track
record. Sorry.

I'd suggest you drop the hash size changes, and try with _just_ the hlist
stuff, and once that is verified to perform well, _then_ worry about
hashing changes. Because quite frankly, I suspect my "local hash" thing
performs better than "make the hashes smaller". And the hash algorithm and
size is _totally_ independent from whether it uses the regular lists or
the hlists..

Linus

2003-02-28 18:48:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch

On Fri, Feb 28, 2003 at 10:47:38AM -0800, Linus Torvalds wrote:
> Right now the dcache hash is often something like 17 bits - and we could
> easily make it so that roughly "half" the bits would be based purely on
> the directory. That would still give each directory ~8 bits worth of
> "local hashing", which is fairly reasonable.

Ok I will see if that helps.
>
> > I believe my patch with a bit more tweaking (my current 64K hash table
> > seems to be too small) is suitable even for an soon to be stable
> > kernel.
>
> Quite frankly, right now the only report I've seen about your patch is
> that it made things slightly _slower_.

Actually that's not quite true. The report had a completely different
profile (lots of other functions had different percentages), so it likely
wasn't a comparable workload. I also don't think the NUMAQs are a good test
platform for this because they have 2MB of fast cache per CPU, while
the typical linux multiprocessor machine has much less. Yes you can
fit an 1MB hash table into a 2 MB cache....

I'll generate some new numbers here locally over the weekend on P4,
but I only have a dual to test on and see how it performs.

-Andi

2003-03-01 00:39:24

by Jan Harkes

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch

On Fri, Feb 28, 2003 at 07:59:10PM +0100, Andi Kleen wrote:
> > Quite frankly, right now the only report I've seen about your patch is
> > that it made things slightly _slower_.
>
> Actually that's not quite true. The report had a completely different
> profile (lots of other functions had different percentages), so it likely
> wasn't a comparable workload. I also don't think the NUMAQs are a good test

Did you read any of the comments on your patch?

Look at slabinfo to see how much space those objects are actually using.

Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
Inode cache hash table entries: 65536 (order: 7, 524288 bytes)
inode_cache 33977 82901 512 11810 11843 1 : 124 62
dentry_cache 14490 14490 128 483 483 1 : 252 126

So the actual amount of memory used by f.i. the inodes is about 47MB
(11810 * 4KB).

The dentry cache hashtable currently has a fill-rate of about 10%, but
I've been running a lot of compilations and updatedb hasn't been around
for a long time. The VM has already aggressivly pruned it, which is to be
expected. The inode cache hash table is at about 50%. So if the hash
function is good we can assume that most hash chains are either empty or
only contain a single object and any hit will give us either a pointer
to the likely candidate or we know that the object isn't cached.

Now from what I read of your patch, it limits the size of the inode hash
table to 64KB, but with single pointers it can still address about 16000
hash chains. Assuming the same perfect hash distribution, all chains
will contain about 2 objects. So although the hash table fits much
better, each time we are looking for an uncached object we need to walk
2 objects that are spread across that 47MB chunk of allocated inodes,
instead of basically having a 'yes or no' answer pretty much from
looking at the hash table.

> platform for this because they have 2MB of fast cache per CPU, while
> the typical linux multiprocessor machine has much less. Yes you can
> fit an 1MB hash table into a 2 MB cache....

Yes but you will never fit 47MB worth of inodes into that same 2MB
cache. And I realize that you can align things so that you don't have
to pull in more than a few cache-lines per object, but traversing ~0.5
pointers versus 2 pointers for every L1/L2 cache miss will be felt
because each traversal will hit another L2 cache-line.

It is probably far more useful to restrain the sizes of the inode and
dentry caches. Why would my system need 2-3 times the number of inodes
compared to dentries, while you pretty much always need to go through
the dentry cache to find the inodes. So a lot of those inodes are likely
unreferenced and we already have to hit the disk for the dentry lookup
anyways.

Jan

2003-03-01 01:07:03

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch

> The dentry cache hashtable currently has a fill-rate of about 10%, but
> I've been running a lot of compilations and updatedb hasn't been around
> for a long time. The VM has already aggressivly pruned it, which is to be
> expected. The inode cache hash table is at about 50%. So if the hash
> function is good we can assume that most hash chains are either empty or

The hash function isn't good and the hash chains are far from evenly
distributed. There are typically 30-40% empty buckets, while some
others are rather long.

In addition I think only a small fraction of the dentries are actually hot.
(no numbers on that, sorry)

Your 47MB number doesn't make much sense for lookup tuning because these
are not walked all the time. Only a few cache lines of the dentry are accessed
for a lookup.

If you think you have a better concept than me to tune it then feel
free to post a patch.


> Now from what I read of your patch, it limits the size of the inode hash
> table to 64KB, but with single pointers it can still address about 16000
> hash chains. Assuming the same perfect hash distribution, all chains
> will contain about 2 objects. So although the hash table fits much
> better, each time we are looking for an uncached object we need to walk
> 2 objects that are spread across that 47MB chunk of allocated inodes,
> instead of basically having a 'yes or no' answer pretty much from
> looking at the hash table.

Apply the latest hash patch and run some statistics over the bucket
distributions from /proc/dcache. You'll quickly see that your ideal model
does not match reality at all.

> Yes but you will never fit 47MB worth of inodes into that same 2MB
> cache. And I realize that you can align things so that you don't have

The overall cached inodes are not too interesting - it is a long known
problem of linux that it caches too many inodes. But it doesn't matter
here because the dcache does all the hard lookup work and it hash
direct pointers to the inodes (no inode hash lookup needed). Most of them
are just wasted memory, but luckily not wasted cache.

What's interesting is to make the dentry lookup as fast as possible.

> to pull in more than a few cache-lines per object, but traversing ~0.5
> pointers versus 2 pointers for every L1/L2 cache miss will be felt
> because each traversal will hit another L2 cache-line.

Sorry but the average 2 bucket length number is completely unrealistic.
It doesn't make any sense to use it in an argument.


>
> It is probably far more useful to restrain the sizes of the inode and
> dentry caches. Why would my system need 2-3 times the number of inodes

Limiting inode caches is probably a good idea. It's a long standing known
bug. At least definitely do prune inodes not referenced by the dcache
I believe some 2.4 trees fixed that in fact, but these fixes have
likely not moved to 2.5 yet. But really, it makes no difference
for these benchmarks, which have enough memory.

Limiting dcache is probably a bad idea. Recreating the dcache is far
slower than any hash lookup. And a lot of Linux's good interactive
performance comes from the aggressive dcache caching.

-Andi

2003-03-01 04:20:51

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch

Andi Kleen wrote:

> The hash function isn't good and the hash chains are far from evenly
> distributed. There are typically 30-40% empty buckets, while some
> others are rather long.
>
> In addition I think only a small fraction of the dentries are actually
> hot. (no numbers on that, sorry)

I wonder what would happen if you reordered the chains moving a 'found'
dentry to the front of the chain? If this could be done without
excessive locking it might help keep hot entries quickly accessable.
This operation should be cheaper given you are using hlists.

Ed Tomlinson

2003-03-01 08:58:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch

> I wonder what would happen if you reordered the chains moving a 'found'
> dentry to the front of the chain? If this could be done without
> excessive locking it might help keep hot entries quickly accessable.
> This operation should be cheaper given you are using hlists.

You would need to fetch a spinlock for LRU, while the current lookup
runs completely lockless.

-Andi

2003-03-01 12:24:02

by Ed Tomlinson

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch

On March 1, 2003 04:08 am, Andi Kleen wrote:
> > I wonder what would happen if you reordered the chains moving a 'found'
> > dentry to the front of the chain? If this could be done without
> > excessive locking it might help keep hot entries quickly accessable.
> > This operation should be cheaper given you are using hlists.
>
> You would need to fetch a spinlock for LRU, while the current lookup
> runs completely lockless.

You would have to lock the chain IF the dentry was not at the head. Would
be interesting to see if this locking would hurt much - since hot
dentries would not require a lock...

Ed

2003-03-01 18:24:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] New dcache / inode hash tuning patch

In article <[email protected]>,
Ed Tomlinson <[email protected]> wrote:
>
>I wonder what would happen if you reordered the chains moving a 'found'
>dentry to the front of the chain? If this could be done without
>excessive locking it might help keep hot entries quickly accessable.

The original dentry code actually did that. It sucks.

The reason it sucks is not so much just the locking, but the fact that
you dirty the cache lines, which means that not only are you blowing
your cache on that CPU, you also caused the other CPU's to blow _their_
caches (the lines that are in the cache can no longer be shared) AND you
caused excessive bus traffic for the writeouts.

In other words: it makes sense if there is one or two really hot
entries. But it does not make sense in general. But you might have
some heuristic that does it "every 1000 lookups" or something like that,
to avoid the problems but still statistically getting the really hot
entries closer to the top.

This cache behaviour is, btw, something that rcu made worse - with the
pre-rcu stuff, we avoided taking the dcache locks and incrementing the
dcache counters for intermediate cached lookups, and we only did it for
the leaf entry (or misses).

I hope that we can re-do that optimization _with_ rcu in 2.7.x.

Linus