by Theodore Ts'o

[permalink] [raw]

On 01/24/2014 02:38 PM, Andi Kleen wrote:
> T Makphaibulchoke <[email protected]> writes:
>
>> The patch consists of three parts.
>>
>> The first part changes the implementation of both the block and hash chains of
>> an mb_cache from list_head to hlist_bl_head and also introduces new members,
>> including a spinlock to mb_cache_entry, as required by the second part.
>
> spinlock per entry is usually overkill for larger hash tables.
>
> Can you use a second smaller lock table that just has locks and is
> indexed by a subset of the hash key. Most likely a very small
> table is good enough.
>
> Also I would be good to have some data on the additional memory consumption.
>
> -Andi
>

Thanks Andi for the comments. Will look into that.

Thanks,
Mak.

2014-01-27 18:35:44

by Thavatchai Makphaibulchoke

[permalink] [raw]

Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

2014-01-28 12:26:27

by George Spelvin

[permalink] [raw]

Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

> The third part of the patch further increases the scalablity of an ext4
> filesystem by having each ext4 fielsystem allocate and use its own private
> mbcache structure, instead of sharing a single mcache structures across all
> ext4 filesystems, and increases the size of its mbcache hash tables.

Are you sure this helps? The idea behind having one large mbcache is
that one large hash table will always be at least as well balanced as
multiple separate tables, if the total size is the same.

If you have two size 2^n hash tables, the chance of collision is equal to
one size 2^(n+1) table if they're equally busy, and if they're unequally
busy. the latter is better. The busier file system will take less time
per search, and since it's searcehed more often than the less-busy one,
net win.

How does it compare with just increasing the hash table size but leaving
them combined?

2014-01-28 21:10:23

2014-02-12 00:56:20

by Thavatchai Makphaibulchoke

[permalink] [raw]

Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

On 01/28/2014 02:09 PM, Andreas Dilger wrote:
> On Jan 28, 2014, at 5:26 AM, George Spelvin <[email protected]> wrote:
>>> The third part of the patch further increases the scalablity of an ext4
>>> filesystem by having each ext4 fielsystem allocate and use its own private
>>> mbcache structure, instead of sharing a single mcache structures across all
>>> ext4 filesystems, and increases the size of its mbcache hash tables.
>>
>> Are you sure this helps? The idea behind having one large mbcache is
>> that one large hash table will always be at least as well balanced as
>> multiple separate tables, if the total size is the same.
>>
>> If you have two size 2^n hash tables, the chance of collision is equal to
>> one size 2^(n+1) table if they're equally busy, and if they're unequally
>> busy. the latter is better. The busier file system will take less time
>> per search, and since it's searched more often than the less-busy one,
>> net win.
>>
>> How does it compare with just increasing the hash table size but leaving
>> them combined?
>
> Except that having one mbcache per block device would avoid the need
> to store the e_bdev pointer in thousands/millions of entries. Since
> the blocks are never shared between different block devices, there
> is no caching benefit even if the same block is on two block devices.
>
> Cheers, Andreas
>

On all 3 systems, with 80, 60 and 20 cores, that I ran aim7 on, spreading test files across 4 ext4 filesystems, there seems to be no different in performance either with a single large hash table or a smaller one per filesystem.

Having said that, I still believe that having a separate hash table for each filesystem should scale better, as the size of a larger single hash table would be very arbitrary. As Andres mentioned above, with an mbcache per filesystem we would be able to remove the e_bdev member from the mb_cache_entry. It would also work well and also result in less mb_cache_entry lock contention, if we are to use the blockgroup locks, which are also on a per filesystem base, to implement the mb_cache_entry lock as suggested by Andreas.

Please let me know if you have any further comment or concerns.

Thanks,
Mak.

2014-02-12 02:07:51

by Thavatchai Makphaibulchoke

[permalink] [raw]

Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

On 01/24/2014 11:09 PM, Andreas Dilger wrote:
> I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC. See blockgroup_lock.h.
>
> While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.
>
> Cheers, Andreas
>

Andreas, looks like your assumption is correct. On all 3 systems, 80, 60 and 20 cores, I got almost identical aim7 results using either a smaller dedicated lock array or the block group lock. I'm inclined to go with using the block group lock as it does not incur any extra space.

One problem is that, with the current implementation mbcache has no knowledge of the super block, including its block group lock, of the filesystem. In my implementation I have to change the first argument of mb_cache_create() from char * to struct super_block * to be able to access the super block's block group lock. This works with my proposed change to allocate an mb_cache for each mounted ext4 filesystem. This would also require the same change, allocating an mb_cache for each mounted filesystem, to both ext2 and ext3, which would increase the scope of the patch. The other alternative, allocating a new smaller spinlock array, would not require any change to either ext2 and ext3.

I'm working on resubmitting my patches using the block group locks and extending the changes to also include both ext2 and ext3. With this approach, not only that no addition space for dedicated new spinlock array is required, the e_bdev member of struct mb_cache_entry could also be removed, reducing the space required for each mb_cache_entry.

Please let me know if you have any concern or suggestion.

Thanks,
Mak.

2014-02-13 02:01:35

The patch increases the parallelism of mb_cache_entry utilization by
replacing list_head with hlist_bl_node for the implementation of both the
block and index hash tables. Each hlist_bl_node contains a built-in lock
used to protect mb_cache's local block and index hash chains. The global
data mb_cache_lru_list and mb_cache_list continue to be protected by the
global mb_cache_spinlock.

New block group spinlock, mb_cache_bg_lock is also added to serialize accesses
to mb_cache_entry's local data.

A new member e_refcnt is added to the mb_cache_entry structure to help
preventing an mb_cache_entry from being deallocated by a free while it is
being referenced by either mb_cache_entry_get() or mb_cache_entry_find().

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
Changed in v5:
- Added mb_cache_bg_lock, which is used to serialize accesses to an
mb_cache_entry's local data.
- Used e_refcnt to help preventing an mb_cache_entry from being
deallocated by a free inside mb_cache_entry_get() and
mb_cache_entry_find() and at the same time reducing the hash chain's
lock held time.

fs/mbcache.c | 417 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 301 insertions(+), 116 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 55db0da..786ecab 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,41 @@
* back on the lru list.
*/

+/*
+ * Lock descriptions and usage:
+ *
+ * Each hash chain of both the block and index hash tables now contains
+ * a built-in lock used to serialize accesses to the hash chain.
+ *
+ * Accesses to global data structures mb_cache_list and mb_cache_lru_list
+ * are serialized via the global spinlock mb_cache_spinlock.
+ *
+ * Each mb_cache_entry contains a spinlock, e_entry_lock, to serialize
+ * accesses to its local data, such as e_used and e_queued.
+ *
+ * Lock ordering:
+ *
+ * Each block hash chain's lock has the highest lock order, followed by an
+ * index hash chain's lock, mb_cache_bg_lock (used to implement mb_cache_entry's
+ * lock), and mb_cach_spinlock, with the lowest order. While holding
+ * either a block or index hash chain lock, a thread can acquire an
+ * mc_cache_bg_lock, which in turn can also acquire mb_cache_spinlock.
+ *
+ * Synchronization:
+ *
+ * Since both mb_cache_entry_get and mb_cache_entry_find scan the block and
+ * index hash chian, it needs to lock the corresponding hash chain. For each
+ * mb_cache_entry within the chain, it needs to lock the mb_cache_entry to
+ * prevent either any simultaneous release or free on the entry and also
+ * to serialize accesses to either the e_used or e_queued member of the entry.
+ *
+ * To avoid having a dangling reference to an already freed
+ * mb_cache_entry, an mb_cache_entry is only freed when it is not on a
+ * block hash chain and also no longer being referenced, both e_used,
+ * and e_queued are 0's. When an mb_cache_entry is explicitly freed it is
+ * first removed from a block hash chain.
+ */
+
#include <linux/kernel.h>
#include <linux/module.h>

@@ -37,6 +72,7 @@
#include <linux/list_bl.h>
#include <linux/mbcache.h>
#include <linux/init.h>
+#include <linux/blockgroup_lock.h>

#ifdef MB_CACHE_DEBUG
# define mb_debug(f...) do { \
@@ -57,8 +93,13 @@

#define MB_CACHE_WRITER ((unsigned short)~0U >> 1)

+#define MB_CACHE_ENTRY_LOCK_BITS __builtin_log2(NR_BG_LOCKS)
+#define MB_CACHE_ENTRY_LOCK_INDEX(ce) \
+ (hash_long((unsigned long)ce, MB_CACHE_ENTRY_LOCK_BITS))
+
static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
-
+static struct blockgroup_lock *mb_cache_bg_lock;
+
MODULE_AUTHOR("Andreas Gruenbacher <[email protected]>");
MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
MODULE_LICENSE("GPL");
@@ -86,6 +127,20 @@ static LIST_HEAD(mb_cache_list);
static LIST_HEAD(mb_cache_lru_list);
static DEFINE_SPINLOCK(mb_cache_spinlock);

+static inline void
+__spin_lock_mb_cache_entry(struct mb_cache_entry *ce)
+{
+ spin_lock(bgl_lock_ptr(mb_cache_bg_lock,
+ MB_CACHE_ENTRY_LOCK_INDEX(ce)));
+}
+
+static inline void
+__spin_unlock_mb_cache_entry(struct mb_cache_entry *ce)
+{
+ spin_unlock(bgl_lock_ptr(mb_cache_bg_lock,
+ MB_CACHE_ENTRY_LOCK_INDEX(ce)));
+}
+
static inline int
__mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
{
@@ -113,11 +168,21 @@ __mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
hlist_bl_del_init(&ce->e_index.o_list);
}

+/*
+ * __mb_cache_entry_unhash_unlock()
+ *
+ * This function is called to unhash both the block and index hash
+ * chain.
+ * It assumes both the block and index hash chain is locked upon entry.
+ * It also unlock both hash chains both exit
+ */
static inline void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+__mb_cache_entry_unhash_unlock(struct mb_cache_entry *ce)
{
__mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
__mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
}

static void
@@ -125,36 +190,47 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
{
struct mb_cache *cache = ce->e_cache;

- mb_assert(!(ce->e_used || ce->e_queued));
+ mb_assert(!(ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt)));
kmem_cache_free(cache->c_entry_cache, ce);
atomic_dec(&cache->c_entry_count);
}

-
static void
-__mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
- __releases(mb_cache_spinlock)
+__mb_cache_entry_release(struct mb_cache_entry *ce)
{
+ /* First lock the entry to serialize access to its local data. */
+ __spin_lock_mb_cache_entry(ce);
/* Wake up all processes queuing for this cache entry. */
if (ce->e_queued)
wake_up_all(&mb_cache_queue);
if (ce->e_used >= MB_CACHE_WRITER)
ce->e_used -= MB_CACHE_WRITER;
+ /*
+ * Make sure that all cache entries on lru_list have
+ * both e_used and e_qued of 0s.
+ */
ce->e_used--;
- if (!(ce->e_used || ce->e_queued)) {
- if (!__mb_cache_entry_is_block_hashed(ce))
+ if (!(ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt))) {
+ if (!__mb_cache_entry_is_block_hashed(ce)) {
+ __spin_unlock_mb_cache_entry(ce);
goto forget;
- mb_assert(list_empty(&ce->e_lru_list));
- list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+ }
+ /*
+ * Need access to lru list, first drop entry lock,
+ * then reacquire the lock in the proper order.
+ */
+ spin_lock(&mb_cache_spinlock);
+ if (list_empty(&ce->e_lru_list))
+ list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+ spin_unlock(&mb_cache_spinlock);
}
- spin_unlock(&mb_cache_spinlock);
+ __spin_unlock_mb_cache_entry(ce);
return;
forget:
- spin_unlock(&mb_cache_spinlock);
+ mb_assert(list_empty(&ce->e_lru_list));
__mb_cache_entry_forget(ce, GFP_KERNEL);
}

-
/*
* mb_cache_shrink_scan() memory pressure callback
*
@@ -177,17 +253,34 @@ mb_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)

mb_debug("trying to free %d entries", nr_to_scan);
spin_lock(&mb_cache_spinlock);
- while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
+ while ((nr_to_scan-- > 0) && !list_empty(&mb_cache_lru_list)) {
struct mb_cache_entry *ce =
list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
- freed++;
+ struct mb_cache_entry, e_lru_list);
+ list_del_init(&ce->e_lru_list);
+ if (ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt))
+ continue;
+ spin_unlock(&mb_cache_spinlock);
+ /* Prevent any find or get operation on the entry */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ /* Ignore if it is touched by a find/get */
+ if (ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt) ||
+ !list_empty(&ce->e_lru_list)) {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ spin_lock(&mb_cache_spinlock);
+ continue;
+ }
+ __mb_cache_entry_unhash_unlock(ce);
+ list_add_tail(&ce->e_lru_list, &free_list);
+ spin_lock(&mb_cache_spinlock);
}
spin_unlock(&mb_cache_spinlock);
+
list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
__mb_cache_entry_forget(entry, gfp_mask);
+ freed++;
}
return freed;
}
@@ -232,6 +325,14 @@ mb_cache_create(const char *name, int bucket_bits)
int n, bucket_count = 1 << bucket_bits;
struct mb_cache *cache = NULL;

+ if (!mb_cache_bg_lock) {
+ mb_cache_bg_lock = kmalloc(sizeof(struct blockgroup_lock),
+ GFP_KERNEL);
+ if (!mb_cache_bg_lock)
+ return NULL;
+ bgl_lock_init(mb_cache_bg_lock);
+ }
+
cache = kmalloc(sizeof(struct mb_cache), GFP_KERNEL);
if (!cache)
return NULL;
@@ -290,21 +391,47 @@ void
mb_cache_shrink(struct block_device *bdev)
{
LIST_HEAD(free_list);
- struct list_head *l, *ltmp;
+ struct list_head *l;
+ struct mb_cache_entry *ce, *tmp;

+ l = &mb_cache_lru_list;
spin_lock(&mb_cache_spinlock);
- list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_lru_list);
+ while (!list_is_last(l, &mb_cache_lru_list)) {
+ l = l->next;
+ ce = list_entry(l, struct mb_cache_entry, e_lru_list);
if (ce->e_bdev == bdev) {
- list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
+ list_del_init(&ce->e_lru_list);
+ if (ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt))
+ continue;
+ spin_unlock(&mb_cache_spinlock);
+ /*
+ * Prevent any find or get operation on the entry.
+ */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ /* Ignore if it is touched by a find/get */
+ if (ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt) ||
+ !list_empty(&ce->e_lru_list)) {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ l = &mb_cache_lru_list;
+ spin_lock(&mb_cache_spinlock);
+ continue;
+ }
+ __mb_cache_entry_unhash_unlock(ce);
+ mb_assert(!(ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt)));
+ list_add_tail(&ce->e_lru_list, &free_list);
+ l = &mb_cache_lru_list;
+ spin_lock(&mb_cache_spinlock);
}
}
spin_unlock(&mb_cache_spinlock);
- list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+
+ list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
}
}

@@ -320,23 +447,27 @@ void
mb_cache_destroy(struct mb_cache *cache)
{
LIST_HEAD(free_list);
- struct list_head *l, *ltmp;
+ struct mb_cache_entry *ce, *tmp;

spin_lock(&mb_cache_spinlock);
- list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_lru_list);
- if (ce->e_cache == cache) {
+ list_for_each_entry_safe(ce, tmp, &mb_cache_lru_list, e_lru_list) {
+ if (ce->e_cache == cache)
list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
- }
}
list_del(&cache->c_cache_list);
spin_unlock(&mb_cache_spinlock);

- list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+ list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+ list_del_init(&ce->e_lru_list);
+ /*
+ * Prevent any find or get operation on the entry.
+ */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ mb_assert(!(ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt)));
+ __mb_cache_entry_unhash_unlock(ce);
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
}

if (atomic_read(&cache->c_entry_count) > 0) {
@@ -345,8 +476,6 @@ mb_cache_destroy(struct mb_cache *cache)
atomic_read(&cache->c_entry_count));
}

- kmem_cache_destroy(cache->c_entry_cache);
-
kfree(cache->c_index_hash);
kfree(cache->c_block_hash);
kfree(cache);
@@ -363,29 +492,59 @@ mb_cache_destroy(struct mb_cache *cache)
struct mb_cache_entry *
mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
{
- struct mb_cache_entry *ce = NULL;
+ struct mb_cache_entry *ce;

if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+ struct list_head *l;
+
+ l = &mb_cache_lru_list;
spin_lock(&mb_cache_spinlock);
- if (!list_empty(&mb_cache_lru_list)) {
- ce = list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_del_init(&ce->e_lru_list);
- __mb_cache_entry_unhash(ce);
+ while (!list_is_last(l, &mb_cache_lru_list)) {
+ l = l->next;
+ ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+ if (ce->e_cache == cache) {
+ list_del_init(&ce->e_lru_list);
+ if (ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt))
+ continue;
+ spin_unlock(&mb_cache_spinlock);
+ /*
+ * Prevent any find or get operation on the
+ * entry.
+ */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ /* Ignore if it is touched by a find/get */
+ if (ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt) ||
+ !list_empty(&ce->e_lru_list)) {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ l = &mb_cache_lru_list;
+ spin_lock(&mb_cache_spinlock);
+ continue;
+ }
+ mb_assert(list_empty(&ce->e_lru_list));
+ mb_assert(!(ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt)));
+ __mb_cache_entry_unhash_unlock(ce);
+ goto found;
+ }
}
spin_unlock(&mb_cache_spinlock);
}
- if (!ce) {
- ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
- if (!ce)
- return NULL;
- atomic_inc(&cache->c_entry_count);
- INIT_LIST_HEAD(&ce->e_lru_list);
- INIT_HLIST_BL_NODE(&ce->e_block_list);
- INIT_HLIST_BL_NODE(&ce->e_index.o_list);
- ce->e_cache = cache;
- ce->e_queued = 0;
- }
+
+ ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
+ if (!ce)
+ return NULL;
+ atomic_inc(&cache->c_entry_count);
+ INIT_LIST_HEAD(&ce->e_lru_list);
+ INIT_HLIST_BL_NODE(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_index.o_list);
+ ce->e_cache = cache;
+ ce->e_queued = 0;
+ atomic_set(&ce->e_refcnt, 0);
+found:
ce->e_block_hash_p = &cache->c_block_hash[0];
ce->e_index_hash_p = &cache->c_index_hash[0];
ce->e_used = 1 + MB_CACHE_WRITER;
@@ -414,7 +573,6 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
struct mb_cache *cache = ce->e_cache;
unsigned int bucket;
struct hlist_bl_node *l;
- int error = -EBUSY;
struct hlist_bl_head *block_hash_p;
struct hlist_bl_head *index_hash_p;
struct mb_cache_entry *lce;
@@ -423,26 +581,29 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
block_hash_p = &cache->c_block_hash[bucket];
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(block_hash_p);
hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
- if (lce->e_bdev == bdev && lce->e_block == block)
- goto out;
+ if (lce->e_bdev == bdev && lce->e_block == block) {
+ hlist_bl_unlock(block_hash_p);
+ return -EBUSY;
+ }
}
mb_assert(!__mb_cache_entry_is_block_hashed(ce));
- __mb_cache_entry_unhash(ce);
+ __mb_cache_entry_unhash_block(ce);
+ __mb_cache_entry_unhash_index(ce);
ce->e_bdev = bdev;
ce->e_block = block;
ce->e_block_hash_p = block_hash_p;
ce->e_index.o_key = key;
+ hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+ hlist_bl_unlock(block_hash_p);
bucket = hash_long(key, cache->c_bucket_bits);
index_hash_p = &cache->c_index_hash[bucket];
+ hlist_bl_lock(index_hash_p);
ce->e_index_hash_p = index_hash_p;
hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
- hlist_bl_add_head(&ce->e_block_list, block_hash_p);
- error = 0;
-out:
- spin_unlock(&mb_cache_spinlock);
- return error;
+ hlist_bl_unlock(index_hash_p);
+ return 0;
}

@@ -456,24 +617,26 @@ out:
void
mb_cache_entry_release(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
- __mb_cache_entry_release_unlock(ce);
+ __mb_cache_entry_release(ce);
}

/*
* mb_cache_entry_free()
*
- * This is equivalent to the sequence mb_cache_entry_takeout() --
- * mb_cache_entry_release().
*/
void
mb_cache_entry_free(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
+ mb_assert(ce);
mb_assert(list_empty(&ce->e_lru_list));
- __mb_cache_entry_unhash(ce);
- __mb_cache_entry_release_unlock(ce);
+ hlist_bl_lock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_lock(ce->e_block_hash_p);
+ __mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ __mb_cache_entry_release(ce);
}

@@ -497,39 +660,48 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
block_hash_p = &cache->c_block_hash[bucket];
- spin_lock(&mb_cache_spinlock);
+ /* First serialize access to the block corresponding hash chain. */
+ hlist_bl_lock(block_hash_p);
hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
mb_assert(ce->e_block_hash_p == block_hash_p);
if (ce->e_bdev == bdev && ce->e_block == block) {
- DEFINE_WAIT(wait);
+ /*
+ * Prevent a free from removing the entry.
+ */
+ atomic_inc(&ce->e_refcnt);
+ hlist_bl_unlock(block_hash_p);
+ __spin_lock_mb_cache_entry(ce);
+ atomic_dec(&ce->e_refcnt);
+ if (ce->e_used > 0) {
+ DEFINE_WAIT(wait);
+ while (ce->e_used > 0) {
+ ce->e_queued++;
+ prepare_to_wait(&mb_cache_queue, &wait,
+ TASK_UNINTERRUPTIBLE);
+ __spin_unlock_mb_cache_entry(ce);
+ schedule();
+ __spin_lock_mb_cache_entry(ce);
+ ce->e_queued--;
+ }
+ finish_wait(&mb_cache_queue, &wait);
+ }
+ ce->e_used += 1 + MB_CACHE_WRITER;
+ __spin_unlock_mb_cache_entry(ce);

- if (!list_empty(&ce->e_lru_list))
+ if (!list_empty(&ce->e_lru_list)) {
+ spin_lock(&mb_cache_spinlock);
list_del_init(&ce->e_lru_list);
-
- while (ce->e_used > 0) {
- ce->e_queued++;
- prepare_to_wait(&mb_cache_queue, &wait,
- TASK_UNINTERRUPTIBLE);
spin_unlock(&mb_cache_spinlock);
- schedule();
- spin_lock(&mb_cache_spinlock);
- ce->e_queued--;
}
- finish_wait(&mb_cache_queue, &wait);
- ce->e_used += 1 + MB_CACHE_WRITER;
-
if (!__mb_cache_entry_is_block_hashed(ce)) {
- __mb_cache_entry_release_unlock(ce);
+ __mb_cache_entry_release(ce);
return NULL;
}
- goto cleanup;
+ return ce;
}
}
- ce = NULL;
-
-cleanup:
- spin_unlock(&mb_cache_spinlock);
- return ce;
+ hlist_bl_unlock(block_hash_p);
+ return NULL;
}

#if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
@@ -538,40 +710,53 @@ static struct mb_cache_entry *
__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
struct block_device *bdev, unsigned int key)
{
+
+ /* The index hash chain is alredy acquire by caller. */
while (l != NULL) {
struct mb_cache_entry *ce =
hlist_bl_entry(l, struct mb_cache_entry,
e_index.o_list);
mb_assert(ce->e_index_hash_p == head);
if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
- DEFINE_WAIT(wait);
-
- if (!list_empty(&ce->e_lru_list))
- list_del_init(&ce->e_lru_list);
-
+ /*
+ * Prevent a free from removing the entry.
+ */
+ atomic_inc(&ce->e_refcnt);
+ hlist_bl_unlock(head);
+ __spin_lock_mb_cache_entry(ce);
+ atomic_dec(&ce->e_refcnt);
+ ce->e_used++;
/* Incrementing before holding the lock gives readers
priority over writers. */
- ce->e_used++;
- while (ce->e_used >= MB_CACHE_WRITER) {
- ce->e_queued++;
- prepare_to_wait(&mb_cache_queue, &wait,
- TASK_UNINTERRUPTIBLE);
- spin_unlock(&mb_cache_spinlock);
- schedule();
+ if (ce->e_used >= MB_CACHE_WRITER) {
+ DEFINE_WAIT(wait);
+
+ while (ce->e_used >= MB_CACHE_WRITER) {
+ ce->e_queued++;
+ prepare_to_wait(&mb_cache_queue, &wait,
+ TASK_UNINTERRUPTIBLE);
+ __spin_unlock_mb_cache_entry(ce);
+ schedule();
+ __spin_lock_mb_cache_entry(ce);
+ ce->e_queued--;
+ }
+ finish_wait(&mb_cache_queue, &wait);
+ }
+ __spin_unlock_mb_cache_entry(ce);
+ if (!list_empty(&ce->e_lru_list)) {
spin_lock(&mb_cache_spinlock);
- ce->e_queued--;
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
}
- finish_wait(&mb_cache_queue, &wait);
-
if (!__mb_cache_entry_is_block_hashed(ce)) {
- __mb_cache_entry_release_unlock(ce);
- spin_lock(&mb_cache_spinlock);
+ __mb_cache_entry_release(ce);
return ERR_PTR(-EAGAIN);
}
return ce;
}
l = l->next;
}
+ hlist_bl_unlock(head);
return NULL;
}

@@ -598,12 +783,12 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
struct hlist_bl_head *index_hash_p;

index_hash_p = &cache->c_index_hash[bucket];
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(index_hash_p);
if (!hlist_bl_empty(index_hash_p)) {
l = hlist_bl_first(index_hash_p);
ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
- }
- spin_unlock(&mb_cache_spinlock);
+ } else
+ hlist_bl_unlock(index_hash_p);
return ce;
}

@@ -638,11 +823,11 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,

index_hash_p = &cache->c_index_hash[bucket];
mb_assert(prev->e_index_hash_p == index_hash_p);
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(index_hash_p);
mb_assert(!hlist_bl_empty(index_hash_p));
l = prev->e_index.o_list.next;
ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
- __mb_cache_entry_release_unlock(prev);
+ __mb_cache_entry_release(prev);
return ce;
}

--
1.7.11.3