2013-07-18 00:55:38

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH 0/2] ext4: increase mbcache scalability

This patch intends to improve the scalability of an ext4 filesystem by
introducing higher degree of parallelism to the usages of its mb_cache and
mb_cache_entries.

Here are some of the benchmark results with the changes.

On a 90 core machine:

Here are the performance improvements in some of the aim7 workloads,

---------------------------
| | % increase |
---------------------------
| custom | 13.30 |
---------------------------
| disk | 5.00 |
---------------------------
| new_fserver | 6.73 |
---------------------------
| shared | 7.12 |
---------------------------

For Swingbench dss workload,

-------------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
-------------------------------------------------------------------------------
| % imprvoment | 9.14 | 9.26 | 19.9 | 5.31 | 7.79 | 0.23 |-4.39 |-4.48 |-4.36 |
-------------------------------------------------------------------------------

For SPECjbb2013, composite run,

--------------------------------------------
| | max-jOPS | critical-jOPS |
--------------------------------------------
| % improvement | 6.73 | 16.37 |
--------------------------------------------



On an 80 core machine:

Here are the improvments in some of the aim7 workloads,

---------------------------
| | % increase |
---------------------------
| fserver | 6.47 |
---------------------------
| new_fserver | 4.62 |
---------------------------

For Swingbench dss workload,

-------------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
-------------------------------------------------------------------------------
| % imprvoment | 4.29 | 2.28 | 3.25 | 1.45 |-0.84 | 3.11 |-0.61 | 3.80 | 2.58 |
-------------------------------------------------------------------------------

For SPECjbb2013, composite run,

--------------------------------------------
| | max-jOPS | critical-jOPS |
--------------------------------------------
| % improvement | 0.49 | 3.21 |
--------------------------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced.


fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 8 ++
fs/ext4/xattr.c | 41 +++++---
fs/ext4/xattr.h | 3 +
fs/mbcache.c | 432 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------
include/linux/mbcache.h | 5 +
6 files changed, 385 insertions(+), 105 deletions(-)


2013-08-22 15:55:05

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v2 0/2] ext4: increase mbcache scalability

The patch consists of two parts. The first part introduces higher degree of
parallelism to the usages of the mb_cache and mb_cache_entries and impacts
all ext filesystems.

The second part of the patch further increases the scalablity of an ext4
filesystem by having each ext4 fielsystem allocate and use its own private
mbcache structure, instead of sharing a single mcache structures across all
ext4 filesystems

Here are some of the benchmark results with the changes.

On a 90 core machine:

Here are the performance improvements in some of the aim7 workloads,

---------------------------
| | % increase |
---------------------------
| alltests | 11.85 |
---------------------------
| custom | 14.42 |
---------------------------
| fserver | 21.36 |
---------------------------
| new_dbase | 5.59 |
---------------------------
| new_fserver | 21.45 |
---------------------------
| shared | 12.84 |
---------------------------
For Swingbench dss workload, with 16 GB database,

-------------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
-------------------------------------------------------------------------------
| % imprvoment | 8.46 | 8.00 | 7.35 | -.313| 1.09 | 0.69 | 0.30 | 2.18 | 5.23 |
-------------------------------------------------------------------------------
| % imprvoment |45.66 |47.62 |34.54 |25.15 |15.29 | 3.38 | -8.7 |-4.98 |-7.86 |
| without using| | | | | | | | | |
| shared memory| | | | | | | | | |
-------------------------------------------------------------------------------
For SPECjbb2013, composite run,

--------------------------------------------
| | max-jOPS | critical-jOPS |
--------------------------------------------
| % improvement | 5.99 | N/A |
--------------------------------------------


On an 80 core machine:

The aim7's results for most of the workloads turn out to the same.

Here are the results of Swingbench dss workload,

-------------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
-------------------------------------------------------------------------------
| % imprvoment |-1.79 | 0.37 | 1.36 | 0.08 | 1.66 | 2.09 | 1.16 | 1.48 | 1.92 |
-------------------------------------------------------------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced.

Changed in v2:
- New performance data
- New diff summary

T Makphaibulchoke (2):
mbcache: decoupling the locking of local from global data
ext4: each filesystem creates and uses its own mc_cache

fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 24 ++--
fs/ext4/xattr.c | 51 +++++----
fs/ext4/xattr.h | 6 +-
fs/mbcache.c | 293 +++++++++++++++++++++++++++++++++++-------------
include/linux/mbcache.h | 10 +-
6 files changed, 269 insertions(+), 116 deletions(-)

--
1.7.11.3

2013-08-22 15:55:14

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v2 1/2] mbcache: decoupling the locking of local from global data

The patch increases the parallelism of mb_cache_entry utilization by
introducing new spinlocks to the mb_cache structure to protect the mb_cache
local block and index hash chains, while the global mb_cache_lru_list and
mb_cache_list continue to be protected by the global mb_cache_spinlock.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
Changed in v2:
- As per Linus Torvalds' suggestion, instead of allocating spinlock
arrays to protect the hash chains, use hlist_bl_head, which already
contains builtin lock.

fs/mbcache.c | 293 +++++++++++++++++++++++++++++++++++-------------
include/linux/mbcache.h | 10 +-
2 files changed, 221 insertions(+), 82 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 8c32ef3..cd4446e 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,16 @@
* back on the lru list.
*/

+/* Locking protocol:
+ *
+ * The nth hash chain of both the c_block_hash and c_index_hash are
+ * protected by the mth entry of the c_bdev_locks and c_key_locks respectively,
+ * where m is equal to n & c_lock_mask.
+ *
+ * While holding a c_bdev_locks, a thread can acquire either a c_key_locks
+ * or mb_cache_spinlock.
+ */
+
#include <linux/kernel.h>
#include <linux/module.h>

@@ -35,9 +45,9 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/init.h>
+#include <linux/list_bl.h>
#include <linux/mbcache.h>

-
#ifdef MB_CACHE_DEBUG
# define mb_debug(f...) do { \
printk(KERN_DEBUG f); \
@@ -57,6 +67,40 @@

#define MB_CACHE_WRITER ((unsigned short)~0U >> 1)

+#define MB_LOCK_HASH_CHAIN(hash_head) hlist_bl_lock(hash_head)
+#define MB_UNLOCK_HASH_CHAIN(hash_head) hlist_bl_unlock(hash_head)
+#ifdef MB_CACHE_DEBUG
+#define MB_LOCK_BLOCK_HASH(ce) do { \
+ struct hlist_bl_head *block_hash_p = ce->e_block_hash_p;\
+ MB_LOCK_HASH_CHAIN(block_hash_p); \
+ mb_assert(ce->e_block_hash_p == block_hash_p); \
+ } while (0)
+#else
+#define MB_LOCK_BLOCK_HASH(ce) MB_LOCK_HASH_CHAIN(ce->e_block_hash_p)
+#endif
+#define MB_UNLOCK_BLOCK_HASH(ce) MB_UNLOCK_HASH_CHAIN(ce->e_block_hash_p)
+#define MB_LOCK_INDEX_HASH(ce) MB_LOCK_HASH_CHAIN(ce->e_index_hash_p)
+#define MB_UNLOCK_INDEX_HASH(ce) MB_UNLOCK_HASH_CHAIN(ce->e_index_hash_p)
+
+#define MAX_LOCK_RETRY 1024
+
+static inline int __mb_cache_lock_entry_block_hash(struct mb_cache_entry *ce)
+{
+ int nretry = 0;
+ struct hlist_bl_head *block_hash_p = ce->e_block_hash_p;
+
+ do {
+ MB_LOCK_HASH_CHAIN(block_hash_p);
+ if (ce->e_block_hash_p == block_hash_p)
+ return 0;
+ MB_UNLOCK_HASH_CHAIN(block_hash_p);
+ block_hash_p = ce->e_block_hash_p;
+ } while (++nretry < MAX_LOCK_RETRY);
+ return -1;
+}
+
+#define MB_LOCK_ENTRY_BLOCK_HASH(ce) __mb_cache_lock_entry_block_hash(ce)
+
static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);

MODULE_AUTHOR("Andreas Gruenbacher <[email protected]>");
@@ -101,7 +145,7 @@ static struct shrinker mb_cache_shrinker = {
static inline int
__mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
{
- return !list_empty(&ce->e_block_list);
+ return !hlist_bl_unhashed(&ce->e_block_list);
}


@@ -109,11 +153,26 @@ static void
__mb_cache_entry_unhash(struct mb_cache_entry *ce)
{
if (__mb_cache_entry_is_hashed(ce)) {
- list_del_init(&ce->e_block_list);
- list_del(&ce->e_index.o_list);
+ hlist_bl_del_init(&ce->e_block_list);
+ MB_LOCK_INDEX_HASH(ce);
+ hlist_bl_del(&ce->e_index.o_list);
+ MB_UNLOCK_INDEX_HASH(ce);
}
}

+#ifdef MAK_SRC
+static void
+__mb_cache_entry_unhash_nolock(struct mb_cache_entry *ce)
+{
+ mb_assert(!__mb_cache_entry_is_hashed(ce));
+ mb_assert(!ce->e_block_hash_p);
+ mb_assert(!ce->e_index_hash_p);
+
+ hlist_bl_del_init(&ce->e_block_list);
+ hlist_bl_del(&ce->e_index.o_list);
+}
+#endif /* MAK_SRC */
+

static void
__mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
@@ -127,9 +186,10 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)


static void
-__mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
- __releases(mb_cache_spinlock)
+__mb_cache_entry_release_unlock(struct mb_cache_entry *ce,
+ struct hlist_bl_head *hash_p)
{
+ mb_assert(hash_p);
/* Wake up all processes queuing for this cache entry. */
if (ce->e_queued)
wake_up_all(&mb_cache_queue);
@@ -139,13 +199,17 @@ __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
if (!(ce->e_used || ce->e_queued)) {
if (!__mb_cache_entry_is_hashed(ce))
goto forget;
- mb_assert(list_empty(&ce->e_lru_list));
- list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
- }
- spin_unlock(&mb_cache_spinlock);
+ MB_UNLOCK_HASH_CHAIN(hash_p);
+ spin_lock(&mb_cache_spinlock);
+ if (list_empty(&ce->e_lru_list))
+ list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+ } else
+ MB_UNLOCK_HASH_CHAIN(hash_p);
return;
forget:
- spin_unlock(&mb_cache_spinlock);
+ MB_UNLOCK_HASH_CHAIN(hash_p);
+ mb_assert(list_empty(&ce->e_lru_list));
__mb_cache_entry_forget(ce, GFP_KERNEL);
}

@@ -164,31 +228,45 @@ forget:
static int
mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
{
- LIST_HEAD(free_list);
struct mb_cache *cache;
- struct mb_cache_entry *entry, *tmp;
int count = 0;
int nr_to_scan = sc->nr_to_scan;
gfp_t gfp_mask = sc->gfp_mask;
+ int max_loop = nr_to_scan << 1;

mb_debug("trying to free %d entries", nr_to_scan);
- spin_lock(&mb_cache_spinlock);
- while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
- struct mb_cache_entry *ce =
- list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_move_tail(&ce->e_lru_list, &free_list);
+ while ((nr_to_scan-- > 0) && (max_loop-- > 0)) {
+ struct mb_cache_entry *ce;
+
+ spin_lock(&mb_cache_spinlock);
+ if (list_empty(&mb_cache_lru_list)) {
+ spin_unlock(&mb_cache_spinlock);
+ break;
+ }
+ ce = list_entry(mb_cache_lru_list.next,
+ struct mb_cache_entry, e_lru_list);
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+
+ if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
+ spin_lock(&mb_cache_spinlock);
+ list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+ continue;
+ }
+ mb_assert(!(ce->e_used || ce->e_queued));
__mb_cache_entry_unhash(ce);
+ MB_UNLOCK_BLOCK_HASH(ce);
+ __mb_cache_entry_forget(ce, gfp_mask);
+ nr_to_scan--;
}
+ spin_lock(&mb_cache_spinlock);
list_for_each_entry(cache, &mb_cache_list, c_cache_list) {
mb_debug("cache %s (%d)", cache->c_name,
atomic_read(&cache->c_entry_count));
count += atomic_read(&cache->c_entry_count);
}
spin_unlock(&mb_cache_spinlock);
- list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
- __mb_cache_entry_forget(entry, gfp_mask);
- }
return (count / 100) * sysctl_vfs_cache_pressure;
}

@@ -216,18 +294,18 @@ mb_cache_create(const char *name, int bucket_bits)
cache->c_name = name;
atomic_set(&cache->c_entry_count, 0);
cache->c_bucket_bits = bucket_bits;
- cache->c_block_hash = kmalloc(bucket_count * sizeof(struct list_head),
- GFP_KERNEL);
+ cache->c_block_hash = kmalloc(bucket_count *
+ sizeof(struct hlist_bl_head), GFP_KERNEL);
if (!cache->c_block_hash)
goto fail;
for (n=0; n<bucket_count; n++)
- INIT_LIST_HEAD(&cache->c_block_hash[n]);
- cache->c_index_hash = kmalloc(bucket_count * sizeof(struct list_head),
- GFP_KERNEL);
+ INIT_HLIST_BL_HEAD(&cache->c_block_hash[n]);
+ cache->c_index_hash = kmalloc(bucket_count *
+ sizeof(struct hlist_bl_head), GFP_KERNEL);
if (!cache->c_index_hash)
goto fail;
for (n=0; n<bucket_count; n++)
- INIT_LIST_HEAD(&cache->c_index_hash[n]);
+ INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
cache->c_entry_cache = kmem_cache_create(name,
sizeof(struct mb_cache_entry), 0,
SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
@@ -276,13 +354,22 @@ mb_cache_shrink(struct block_device *bdev)
list_entry(l, struct mb_cache_entry, e_lru_list);
if (ce->e_bdev == bdev) {
list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
}
}
spin_unlock(&mb_cache_spinlock);
list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+ struct mb_cache_entry *ce =
+ list_entry(l, struct mb_cache_entry, e_lru_list);
+ if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
+ spin_lock(&mb_cache_spinlock);
+ list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+ continue;
+ }
+ mb_assert(!ce->e_used && !ce->e_queued);
+ __mb_cache_entry_unhash(ce);
+ MB_UNLOCK_BLOCK_HASH(ce);
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
}
}

@@ -306,15 +393,18 @@ mb_cache_destroy(struct mb_cache *cache)
list_entry(l, struct mb_cache_entry, e_lru_list);
if (ce->e_cache == cache) {
list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
}
}
list_del(&cache->c_cache_list);
spin_unlock(&mb_cache_spinlock);

list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+ struct mb_cache_entry *ce =
+ list_entry(l, struct mb_cache_entry, e_lru_list);
+ MB_LOCK_BLOCK_HASH(ce);
+ __mb_cache_entry_unhash(ce);
+ MB_UNLOCK_BLOCK_HASH(ce);
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
}

if (atomic_read(&cache->c_entry_count) > 0) {
@@ -344,12 +434,26 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
struct mb_cache_entry *ce = NULL;

if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+ struct list_head *l;
+ struct list_head *ltmp;
+
spin_lock(&mb_cache_spinlock);
- if (!list_empty(&mb_cache_lru_list)) {
- ce = list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_del_init(&ce->e_lru_list);
- __mb_cache_entry_unhash(ce);
+ list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
+ ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+ if (ce->e_cache == cache) {
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+ if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
+ spin_lock(&mb_cache_spinlock);
+ list_add_tail(&ce->e_lru_list,
+ &mb_cache_lru_list);
+ continue;
+ }
+ mb_assert(!ce->e_used && !ce->e_queued);
+ __mb_cache_entry_unhash(ce);
+ MB_UNLOCK_BLOCK_HASH(ce);
+ break;
+ }
}
spin_unlock(&mb_cache_spinlock);
}
@@ -359,9 +463,12 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
return NULL;
atomic_inc(&cache->c_entry_count);
INIT_LIST_HEAD(&ce->e_lru_list);
- INIT_LIST_HEAD(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_index.o_list);
ce->e_cache = cache;
ce->e_queued = 0;
+ ce->e_block_hash_p = &cache->c_block_hash[0];
+ ce->e_index_hash_p = &cache->c_index_hash[0];
}
ce->e_used = 1 + MB_CACHE_WRITER;
return ce;
@@ -388,28 +495,36 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
{
struct mb_cache *cache = ce->e_cache;
unsigned int bucket;
- struct list_head *l;
+ struct hlist_bl_node *l;
int error = -EBUSY;
+ struct hlist_bl_head *block_hash_p;
+ struct hlist_bl_head *index_hash_p;
+ struct mb_cache_entry *lce;

bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
- spin_lock(&mb_cache_spinlock);
- list_for_each_prev(l, &cache->c_block_hash[bucket]) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_block_list);
- if (ce->e_bdev == bdev && ce->e_block == block)
+ block_hash_p = &cache->c_block_hash[bucket];
+ MB_LOCK_HASH_CHAIN(block_hash_p);
+ hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
+ if (lce->e_bdev == bdev && lce->e_block == block)
goto out;
}
+ mb_assert(!__mb_cache_entry_is_hashed(ce));
__mb_cache_entry_unhash(ce);
ce->e_bdev = bdev;
ce->e_block = block;
- list_add(&ce->e_block_list, &cache->c_block_hash[bucket]);
+ hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+ ce->e_block_hash_p = block_hash_p;
ce->e_index.o_key = key;
bucket = hash_long(key, cache->c_bucket_bits);
- list_add(&ce->e_index.o_list, &cache->c_index_hash[bucket]);
+ index_hash_p = &cache->c_index_hash[bucket];
+ MB_LOCK_HASH_CHAIN(index_hash_p);
+ hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
+ ce->e_index_hash_p = index_hash_p;
+ MB_UNLOCK_HASH_CHAIN(index_hash_p);
error = 0;
out:
- spin_unlock(&mb_cache_spinlock);
+ MB_UNLOCK_HASH_CHAIN(block_hash_p);
return error;
}

@@ -424,8 +539,8 @@ out:
void
mb_cache_entry_release(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
- __mb_cache_entry_release_unlock(ce);
+ MB_LOCK_BLOCK_HASH(ce);
+ __mb_cache_entry_release_unlock(ce, ce->e_block_hash_p);
}


@@ -438,10 +553,12 @@ mb_cache_entry_release(struct mb_cache_entry *ce)
void
mb_cache_entry_free(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
+ mb_assert(ce);
+ mb_assert(ce->e_block_hash_p);
+ MB_LOCK_BLOCK_HASH(ce);
mb_assert(list_empty(&ce->e_lru_list));
__mb_cache_entry_unhash(ce);
- __mb_cache_entry_release_unlock(ce);
+ __mb_cache_entry_release_unlock(ce, ce->e_block_hash_p);
}


@@ -458,34 +575,39 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
sector_t block)
{
unsigned int bucket;
- struct list_head *l;
+ struct hlist_bl_node *l;
struct mb_cache_entry *ce;
+ struct hlist_bl_head *block_hash_p;

bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
- spin_lock(&mb_cache_spinlock);
- list_for_each(l, &cache->c_block_hash[bucket]) {
- ce = list_entry(l, struct mb_cache_entry, e_block_list);
+ block_hash_p = &cache->c_block_hash[bucket];
+ MB_LOCK_HASH_CHAIN(block_hash_p);
+ hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
+ mb_assert(ce->e_block_hash_p == block_hash_p);
if (ce->e_bdev == bdev && ce->e_block == block) {
DEFINE_WAIT(wait);

+ spin_lock(&mb_cache_spinlock);
if (!list_empty(&ce->e_lru_list))
list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);

while (ce->e_used > 0) {
ce->e_queued++;
prepare_to_wait(&mb_cache_queue, &wait,
TASK_UNINTERRUPTIBLE);
- spin_unlock(&mb_cache_spinlock);
+ MB_UNLOCK_BLOCK_HASH(ce);
schedule();
- spin_lock(&mb_cache_spinlock);
+ MB_LOCK_BLOCK_HASH(ce);
ce->e_queued--;
}
finish_wait(&mb_cache_queue, &wait);
ce->e_used += 1 + MB_CACHE_WRITER;

if (!__mb_cache_entry_is_hashed(ce)) {
- __mb_cache_entry_release_unlock(ce);
+ __mb_cache_entry_release_unlock(ce,
+ block_hash_p);
return NULL;
}
goto cleanup;
@@ -494,25 +616,30 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
ce = NULL;

cleanup:
- spin_unlock(&mb_cache_spinlock);
+ MB_UNLOCK_HASH_CHAIN(block_hash_p);
return ce;
}

#if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)

static struct mb_cache_entry *
-__mb_cache_entry_find(struct list_head *l, struct list_head *head,
+__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
struct block_device *bdev, unsigned int key)
{
- while (l != head) {
+ while (l != NULL) {
struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_index.o_list);
+ hlist_bl_entry(l, struct mb_cache_entry,
+ e_index.o_list);
if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
DEFINE_WAIT(wait);

+ MB_UNLOCK_HASH_CHAIN(head);
+ spin_lock(&mb_cache_spinlock);
if (!list_empty(&ce->e_lru_list))
list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);

+ MB_LOCK_HASH_CHAIN(head);
/* Incrementing before holding the lock gives readers
priority over writers. */
ce->e_used++;
@@ -520,18 +647,20 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
ce->e_queued++;
prepare_to_wait(&mb_cache_queue, &wait,
TASK_UNINTERRUPTIBLE);
- spin_unlock(&mb_cache_spinlock);
+ MB_UNLOCK_HASH_CHAIN(head);
schedule();
- spin_lock(&mb_cache_spinlock);
+ MB_LOCK_HASH_CHAIN(head);
+ mb_assert(ce->e_index_hash_p == head);
ce->e_queued--;
}
finish_wait(&mb_cache_queue, &wait);

if (!__mb_cache_entry_is_hashed(ce)) {
- __mb_cache_entry_release_unlock(ce);
- spin_lock(&mb_cache_spinlock);
+ __mb_cache_entry_release_unlock(ce, head);
+ MB_LOCK_HASH_CHAIN(head);
return ERR_PTR(-EAGAIN);
}
+ mb_assert(ce->e_index_hash_p == head);
return ce;
}
l = l->next;
@@ -557,13 +686,17 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
unsigned int key)
{
unsigned int bucket = hash_long(key, cache->c_bucket_bits);
- struct list_head *l;
- struct mb_cache_entry *ce;
+ struct hlist_bl_node *l;
+ struct mb_cache_entry *ce = NULL;
+ struct hlist_bl_head *index_hash_p;

- spin_lock(&mb_cache_spinlock);
- l = cache->c_index_hash[bucket].next;
- ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
- spin_unlock(&mb_cache_spinlock);
+ index_hash_p = &cache->c_index_hash[bucket];
+ MB_LOCK_HASH_CHAIN(index_hash_p);
+ if (!hlist_bl_empty(index_hash_p)) {
+ l = hlist_bl_first(index_hash_p);
+ ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+ }
+ MB_UNLOCK_HASH_CHAIN(index_hash_p);
return ce;
}

@@ -592,13 +725,17 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
{
struct mb_cache *cache = prev->e_cache;
unsigned int bucket = hash_long(key, cache->c_bucket_bits);
- struct list_head *l;
+ struct hlist_bl_node *l;
struct mb_cache_entry *ce;
+ struct hlist_bl_head *index_hash_p;

- spin_lock(&mb_cache_spinlock);
+ index_hash_p = &cache->c_index_hash[bucket];
+ mb_assert(prev->e_index_hash_p == index_hash_p);
+ MB_LOCK_HASH_CHAIN(index_hash_p);
+ mb_assert(!hlist_bl_empty(index_hash_p));
l = prev->e_index.o_list.next;
- ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
- __mb_cache_entry_release_unlock(prev);
+ ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+ MB_UNLOCK_HASH_CHAIN(index_hash_p);
return ce;
}

diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 5525d37..89826c0 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -11,11 +11,13 @@ struct mb_cache_entry {
unsigned short e_queued;
struct block_device *e_bdev;
sector_t e_block;
- struct list_head e_block_list;
+ struct hlist_bl_node e_block_list;
struct {
- struct list_head o_list;
+ struct hlist_bl_node o_list;
unsigned int o_key;
} e_index;
+ struct hlist_bl_head *e_block_hash_p;
+ struct hlist_bl_head *e_index_hash_p;
};

struct mb_cache {
@@ -25,8 +27,8 @@ struct mb_cache {
int c_max_entries;
int c_bucket_bits;
struct kmem_cache *c_entry_cache;
- struct list_head *c_block_hash;
- struct list_head *c_index_hash;
+ struct hlist_bl_head *c_block_hash;
+ struct hlist_bl_head *c_index_hash;
};

/* Functions on caches */
--
1.7.11.3

2013-08-22 15:55:22

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v2 2/2] ext4: each filesystem creates and uses its own mc_cache

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the cache
creation and destory calls from ex4_init_xattr() and ext4_exitxattr() in
fs/ext4/xattr.c.
fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 24 ++++++++++++++++--------
fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
fs/ext4/xattr.h | 6 +++---
4 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..70f2709 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1306,6 +1306,7 @@ struct ext4_sb_info {
struct list_head s_es_lru;
unsigned long s_es_last_sorted;
struct percpu_counter s_extent_cache_cnt;
+ struct mb_cache *s_mb_cache;
spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
};

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 36b141e..3e7d338 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
static struct ext4_lazy_init *ext4_li_info;
static struct mutex ext4_li_mtx;
static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;

static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
unsigned long journal_devnum);
@@ -828,6 +829,10 @@ static void ext4_put_super(struct super_block *sb)
invalidate_bdev(sbi->journal_bdev);
ext4_blkdev_remove(sbi);
}
+ if (sbi->s_mb_cache) {
+ ext4_xattr_destroy_cache(sbi->s_mb_cache);
+ sbi->s_mb_cache = NULL;
+ }
if (sbi->s_mmp_tsk)
kthread_stop(sbi->s_mmp_tsk);
sb->s_fs_info = NULL;
@@ -3916,6 +3921,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);

sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+ if (ext4_mballoc_ready) {
+ sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+ if (!sbi->s_mb_cache) {
+ ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+ goto failed_mount_wq;
+ }
+ }

/*
* The journal may have updated the bg summary counts, so we
@@ -5428,11 +5440,9 @@ static int __init ext4_init_fs(void)

err = ext4_init_mballoc();
if (err)
- goto out3;
-
- err = ext4_init_xattr();
- if (err)
goto out2;
+ else
+ ext4_mballoc_ready = 1;
err = init_inodecache();
if (err)
goto out1;
@@ -5448,10 +5458,9 @@ out:
unregister_as_ext3();
destroy_inodecache();
out1:
- ext4_exit_xattr();
-out2:
+ ext4_mballoc_ready = 0;
ext4_exit_mballoc();
-out3:
+out2:
ext4_exit_feat_adverts();
out4:
if (ext4_proc_root)
@@ -5474,7 +5483,6 @@ static void __exit ext4_exit_fs(void)
unregister_as_ext3();
unregister_filesystem(&ext4_fs_type);
destroy_inodecache();
- ext4_exit_xattr();
ext4_exit_mballoc();
ext4_exit_feat_adverts();
remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c081e34..3e0ff31 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
# define ea_bdebug(bh, fmt, ...) no_printk(fmt, ##__VA_ARGS__)
#endif

-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
static struct buffer_head *ext4_xattr_cache_find(struct inode *,
struct ext4_xattr_header *,
struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
static int ext4_xattr_list(struct dentry *dentry, char *buffer,
size_t buffer_size);

-static struct mb_cache *ext4_xattr_cache;
-
static const struct xattr_handler *ext4_xattr_handler_map[] = {
[EXT4_XATTR_INDEX_USER] = &ext4_xattr_user_handler,
#ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
NULL
};

+#define EXT4_GET_MB_CACHE(inode) (((struct ext4_sb_info *) \
+ inode->i_sb->s_fs_info)->s_mb_cache)
+
static __le32 ext4_xattr_block_csum(struct inode *inode,
sector_t block_nr,
struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
struct ext4_xattr_entry *entry;
size_t size;
int error;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
error = -EIO;
goto cleanup;
}
- ext4_xattr_cache_insert(bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, bh);
entry = BFIRST(bh);
error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
struct inode *inode = dentry->d_inode;
struct buffer_head *bh = NULL;
int error;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

ea_idebug(inode, "buffer=%p, buffer_size=%ld",
buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
error = -EIO;
goto cleanup;
}
- ext4_xattr_cache_insert(bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, bh);
error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);

cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
{
struct mb_cache_entry *ce = NULL;
int error = 0;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

- ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+ ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
error = ext4_journal_get_write_access(handle, bh);
if (error)
goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_search *s = &bs->s;
struct mb_cache_entry *ce = NULL;
int error = 0;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

#define header(x) ((struct ext4_xattr_header *)(x))

if (i->value && i->value_len > sb->s_blocksize)
return -ENOSPC;
if (s->base) {
- ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+ ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
bs->bh->b_blocknr);
error = ext4_journal_get_write_access(handle, bs->bh);
if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
if (!IS_LAST_ENTRY(s->first))
ext4_xattr_rehash(header(s->base),
s->here);
- ext4_xattr_cache_insert(bs->bh);
+ ext4_xattr_cache_insert(ext4_mb_cache,
+ bs->bh);
}
unlock_buffer(bs->bh);
if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
memcpy(new_bh->b_data, s->base, new_bh->b_size);
set_buffer_uptodate(new_bh);
unlock_buffer(new_bh);
- ext4_xattr_cache_insert(new_bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
error = ext4_handle_dirty_xattr_block(handle,
inode, new_bh);
if (error)
@@ -1492,13 +1498,13 @@ ext4_xattr_put_super(struct super_block *sb)
* Returns 0, or a negative error number on failure.
*/
static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
{
__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
struct mb_cache_entry *ce;
int error;

- ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+ ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
if (!ce) {
ea_bdebug(bh, "out of memory");
return;
@@ -1570,12 +1576,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
{
__u32 hash = le32_to_cpu(header->h_hash);
struct mb_cache_entry *ce;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

if (!header->h_hash)
return NULL; /* never share */
ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
again:
- ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+ ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
hash);
while (ce) {
struct buffer_head *bh;
@@ -1673,19 +1680,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,

#undef BLOCK_HASH_SHIFT

-int __init
-ext4_init_xattr(void)
+#define HASH_BUCKET_BITS 6
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
{
- ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
- if (!ext4_xattr_cache)
- return -ENOMEM;
- return 0;
+ return mb_cache_create(name, HASH_BUCKET_BITS);
}

-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
{
- if (ext4_xattr_cache)
- mb_cache_destroy(ext4_xattr_cache);
- ext4_xattr_cache = NULL;
+ if (cache)
+ mb_cache_destroy(cache);
}
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
struct ext4_inode *raw_inode, handle_t *handle);

-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);
-
extern const struct xattr_handler *ext4_xattr_handlers[];

extern int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
@@ -126,6 +123,9 @@ extern int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_info *i,
struct ext4_xattr_ibody_find *is);

+extern struct mb_cache *ext4_xattr_create_cache(char *name);
+extern void ext4_xattr_destroy_cache(struct mb_cache *);
+
#ifdef CONFIG_EXT4_FS_SECURITY
extern int ext4_init_security(handle_t *handle, struct inode *inode,
struct inode *dir, const struct qstr *qstr);
--
1.7.11.3

2013-08-22 16:53:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mbcache: decoupling the locking of local from global data

On Thu, Aug 22, 2013 at 8:54 AM, T Makphaibulchoke <[email protected]> wrote:
>
> +#define MB_LOCK_HASH_CHAIN(hash_head) hlist_bl_lock(hash_head)
> +#define MB_UNLOCK_HASH_CHAIN(hash_head) hlist_bl_unlock(hash_head)
> +#ifdef MB_CACHE_DEBUG
> +#define MB_LOCK_BLOCK_HASH(ce) do {

Please don't do these ugly and pointless preprocessor macro expanders
that hide what the actual operation is.

The DEBUG case seems to be just for your own testing anyway, so even
that shouldn't exist in the merged version.

> +#define MAX_LOCK_RETRY 1024
> +
> +static inline int __mb_cache_lock_entry_block_hash(struct mb_cache_entry *ce)
> +{
> + int nretry = 0;
> + struct hlist_bl_head *block_hash_p = ce->e_block_hash_p;
> +
> + do {
> + MB_LOCK_HASH_CHAIN(block_hash_p);
> + if (ce->e_block_hash_p == block_hash_p)
> + return 0;
> + MB_UNLOCK_HASH_CHAIN(block_hash_p);
> + block_hash_p = ce->e_block_hash_p;
> + } while (++nretry < MAX_LOCK_RETRY);
> + return -1;
> +}

And this is really ugly. Again it's also then hidden behind the ugly macro.

First off, the thousand-time retry seems completely excessive. Does it
actually need any retry AT ALL? If the hash entry changes, either you
should retry forever, or if you feel that can result in livelocks
(fair enough) and you need a fallback case to a bigger lock, then why
not just do the fallback immediately?

More importantly, regardless of that retry issue, this seems to be
abstracted at the wrong level, resulting in every single user of this
repeating the same complex and hard-to-understand incantation:

> + if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
> + spin_lock(&mb_cache_spinlock);
> + list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
> + spin_unlock(&mb_cache_spinlock);
> + continue;
> + }
..
> + if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
> + spin_lock(&mb_cache_spinlock);
> + list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
> + spin_unlock(&mb_cache_spinlock);
> + continue;
> + }
..
> + if (MB_LOCK_ENTRY_BLOCK_HASH(ce)) {
> + spin_lock(&mb_cache_spinlock);
> + list_add_tail(&ce->e_lru_list,
> + &mb_cache_lru_list);
> + continue;
> + }

where the only difference is that the last one doesn't unlock
afterwards because it runs in a loop with that LRU list lock held.
Ugh.

The locking logic also isn't explained anywhere, making the
hard-to-read code even harder to read.

Linus

Subject: Re: [PATCH v2 1/2] mbcache: decoupling the locking of local from global data

Thanks for the comments.

On 08/22/2013 04:53 PM, Linus Torvalds wrote:
>
> Please don't do these ugly and pointless preprocessor macro expanders
> that hide what the actual operation is.
>
> The DEBUG case seems to be just for your own testing anyway, so even
> that shouldn't exist in the merged version.
>

Sorry, will clean them all up.

>
> And this is really ugly. Again it's also then hidden behind the ugly macro.
>
> First off, the thousand-time retry seems completely excessive. Does it
> actually need any retry AT ALL? If the hash entry changes, either you
> should retry forever, or if you feel that can result in livelocks
> (fair enough) and you need a fallback case to a bigger lock, then why
> not just do the fallback immediately?
>
> More importantly, regardless of that retry issue, this seems to be
> abstracted at the wrong level, resulting in every single user of this
> repeating the same complex and hard-to-understand incantation:
>

Looks like this is a misjudgement on my part. There is really no need to guard against mb_cache_entry from moving to a different hash chain, as the shrinking and allocation function already protecting against each other thorugh mb_cache_spinlock. The retry is not needed.

>
> where the only difference is that the last one doesn't unlock
> afterwards because it runs in a loop with that LRU list lock held.
> Ugh.

Followed the above logic, all these pieces of code are also not necessary and could be just a simple unhash, as the original.

>
> The locking logic also isn't explained anywhere, making the
> hard-to-read code even harder to read.


Will add comment, explaining the locking logic.

>
> Linus
>

Thanks,
Mak.

2013-09-04 16:39:35

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v3 0/2] ext4: increase mbcache scalability

This patch intends to improve the scalability of an ext filesystem,
particularly ext4.

The patch consists of two parts. The first part introduces higher degree of
parallelism to the usages of the mb_cache and mb_cache_entries and impacts
all ext filesystems.

The second part of the patch further increases the scalablity of an ext4
filesystem by having each ext4 fielsystem allocate and use its own private
mbcache structure, instead of sharing a single mcache structures across all
ext4 filesystems

Here are some of the benchmark results with the changes.

On a 90 core machine:

Here are the performance improvements in some of the aim7 workloads,

---------------------------
| | % increase |
---------------------------
| alltests | 11.85 |
---------------------------
| custom | 14.42 |
---------------------------
| fserver | 21.36 |
---------------------------
| new_dbase | 5.59 |
---------------------------
| new_fserver | 21.45 |
---------------------------
| shared | 12.84 |
---------------------------
For Swingbench dss workload, with 16 GB database,

-------------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
-------------------------------------------------------------------------------
| % imprvoment | 8.46 | 8.00 | 7.35 | -.313| 1.09 | 0.69 | 0.30 | 2.18 | 5.23 |
-------------------------------------------------------------------------------
| % imprvoment |45.66 |47.62 |34.54 |25.15 |15.29 | 3.38 | -8.7 |-4.98 |-7.86 |
| without using| | | | | | | | | |
| shared memory| | | | | | | | | |
-------------------------------------------------------------------------------
For SPECjbb2013, composite run,

--------------------------------------------
| | max-jOPS | critical-jOPS |
--------------------------------------------
| % improvement | 5.99 | N/A |
--------------------------------------------


On an 80 core machine:

The aim7's results for most of the workloads turn out to the same.

Here are the results of Swingbench dss workload,

-------------------------------------------------------------------------------
| Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
-------------------------------------------------------------------------------
| % imprvoment |-1.79 | 0.37 | 1.36 | 0.08 | 1.66 | 2.09 | 1.16 | 1.48 | 1.92 |
-------------------------------------------------------------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced.

Changed in v3:
- New diff summary

Changed in v2:
- New performance data
- New diff summary

T Makphaibulchoke (2):
mbcache: decoupling the locking of local from global data
ext4: each filesystem creates and uses its own mb_cache

fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 24 ++--
fs/ext4/xattr.c | 51 ++++----
fs/ext4/xattr.h | 6 +-
fs/mbcache.c | 306 +++++++++++++++++++++++++++++++++++-------------
include/linux/mbcache.h | 10 +-
6 files changed, 277 insertions(+), 121 deletions(-)

--
1.7.11.3

2013-09-04 16:39:55

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v3 2/2] ext4: each filesystem creates and uses its own mb_cache

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the
cache creation and destory calls from ex4_init_xattr() and ext4_exitxattr()
in fs/ext4/xattr.c.
fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
Changed in v3:
- No change

Changed in v2:
- No change

fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 24 ++++++++++++++++--------
fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
fs/ext4/xattr.h | 6 +++---
4 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..70f2709 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1306,6 +1306,7 @@ struct ext4_sb_info {
struct list_head s_es_lru;
unsigned long s_es_last_sorted;
struct percpu_counter s_extent_cache_cnt;
+ struct mb_cache *s_mb_cache;
spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;
};

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 36b141e..3e7d338 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
static struct ext4_lazy_init *ext4_li_info;
static struct mutex ext4_li_mtx;
static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;

static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
unsigned long journal_devnum);
@@ -828,6 +829,10 @@ static void ext4_put_super(struct super_block *sb)
invalidate_bdev(sbi->journal_bdev);
ext4_blkdev_remove(sbi);
}
+ if (sbi->s_mb_cache) {
+ ext4_xattr_destroy_cache(sbi->s_mb_cache);
+ sbi->s_mb_cache = NULL;
+ }
if (sbi->s_mmp_tsk)
kthread_stop(sbi->s_mmp_tsk);
sb->s_fs_info = NULL;
@@ -3916,6 +3921,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);

sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+ if (ext4_mballoc_ready) {
+ sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+ if (!sbi->s_mb_cache) {
+ ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+ goto failed_mount_wq;
+ }
+ }

/*
* The journal may have updated the bg summary counts, so we
@@ -5428,11 +5440,9 @@ static int __init ext4_init_fs(void)

err = ext4_init_mballoc();
if (err)
- goto out3;
-
- err = ext4_init_xattr();
- if (err)
goto out2;
+ else
+ ext4_mballoc_ready = 1;
err = init_inodecache();
if (err)
goto out1;
@@ -5448,10 +5458,9 @@ out:
unregister_as_ext3();
destroy_inodecache();
out1:
- ext4_exit_xattr();
-out2:
+ ext4_mballoc_ready = 0;
ext4_exit_mballoc();
-out3:
+out2:
ext4_exit_feat_adverts();
out4:
if (ext4_proc_root)
@@ -5474,7 +5483,6 @@ static void __exit ext4_exit_fs(void)
unregister_as_ext3();
unregister_filesystem(&ext4_fs_type);
destroy_inodecache();
- ext4_exit_xattr();
ext4_exit_mballoc();
ext4_exit_feat_adverts();
remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c081e34..3e0ff31 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
# define ea_bdebug(bh, fmt, ...) no_printk(fmt, ##__VA_ARGS__)
#endif

-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
static struct buffer_head *ext4_xattr_cache_find(struct inode *,
struct ext4_xattr_header *,
struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
static int ext4_xattr_list(struct dentry *dentry, char *buffer,
size_t buffer_size);

-static struct mb_cache *ext4_xattr_cache;
-
static const struct xattr_handler *ext4_xattr_handler_map[] = {
[EXT4_XATTR_INDEX_USER] = &ext4_xattr_user_handler,
#ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
NULL
};

+#define EXT4_GET_MB_CACHE(inode) (((struct ext4_sb_info *) \
+ inode->i_sb->s_fs_info)->s_mb_cache)
+
static __le32 ext4_xattr_block_csum(struct inode *inode,
sector_t block_nr,
struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
struct ext4_xattr_entry *entry;
size_t size;
int error;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
error = -EIO;
goto cleanup;
}
- ext4_xattr_cache_insert(bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, bh);
entry = BFIRST(bh);
error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
struct inode *inode = dentry->d_inode;
struct buffer_head *bh = NULL;
int error;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

ea_idebug(inode, "buffer=%p, buffer_size=%ld",
buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
error = -EIO;
goto cleanup;
}
- ext4_xattr_cache_insert(bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, bh);
error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);

cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
{
struct mb_cache_entry *ce = NULL;
int error = 0;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

- ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+ ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
error = ext4_journal_get_write_access(handle, bh);
if (error)
goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_search *s = &bs->s;
struct mb_cache_entry *ce = NULL;
int error = 0;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

#define header(x) ((struct ext4_xattr_header *)(x))

if (i->value && i->value_len > sb->s_blocksize)
return -ENOSPC;
if (s->base) {
- ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+ ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
bs->bh->b_blocknr);
error = ext4_journal_get_write_access(handle, bs->bh);
if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
if (!IS_LAST_ENTRY(s->first))
ext4_xattr_rehash(header(s->base),
s->here);
- ext4_xattr_cache_insert(bs->bh);
+ ext4_xattr_cache_insert(ext4_mb_cache,
+ bs->bh);
}
unlock_buffer(bs->bh);
if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
memcpy(new_bh->b_data, s->base, new_bh->b_size);
set_buffer_uptodate(new_bh);
unlock_buffer(new_bh);
- ext4_xattr_cache_insert(new_bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
error = ext4_handle_dirty_xattr_block(handle,
inode, new_bh);
if (error)
@@ -1492,13 +1498,13 @@ ext4_xattr_put_super(struct super_block *sb)
* Returns 0, or a negative error number on failure.
*/
static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
{
__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
struct mb_cache_entry *ce;
int error;

- ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+ ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
if (!ce) {
ea_bdebug(bh, "out of memory");
return;
@@ -1570,12 +1576,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
{
__u32 hash = le32_to_cpu(header->h_hash);
struct mb_cache_entry *ce;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

if (!header->h_hash)
return NULL; /* never share */
ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
again:
- ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+ ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
hash);
while (ce) {
struct buffer_head *bh;
@@ -1673,19 +1680,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,

#undef BLOCK_HASH_SHIFT

-int __init
-ext4_init_xattr(void)
+#define HASH_BUCKET_BITS 6
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
{
- ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
- if (!ext4_xattr_cache)
- return -ENOMEM;
- return 0;
+ return mb_cache_create(name, HASH_BUCKET_BITS);
}

-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
{
- if (ext4_xattr_cache)
- mb_cache_destroy(ext4_xattr_cache);
- ext4_xattr_cache = NULL;
+ if (cache)
+ mb_cache_destroy(cache);
}
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
struct ext4_inode *raw_inode, handle_t *handle);

-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);
-
extern const struct xattr_handler *ext4_xattr_handlers[];

extern int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
@@ -126,6 +123,9 @@ extern int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_info *i,
struct ext4_xattr_ibody_find *is);

+extern struct mb_cache *ext4_xattr_create_cache(char *name);
+extern void ext4_xattr_destroy_cache(struct mb_cache *);
+
#ifdef CONFIG_EXT4_FS_SECURITY
extern int ext4_init_security(handle_t *handle, struct inode *inode,
struct inode *dir, const struct qstr *qstr);
--
1.7.11.3

2013-09-04 16:39:54

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v3 1/2] mbcache: decoupling the locking of local from global data

The patch increases the parallelism of mb_cache_entry utilization by
replacing list_head with hlist_bl_node for the implementation of both the
block and index hash tables. Each hlist_bl_node contains a built-in lock
used to protect mb_cache's local block and index hash chains. The global
data mb_cache_lru_list and mb_cache_list continue to be protected by the
global mb_cache_spinlock.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
Changed in v3:
- Removed all hash lock macros.
- Fixed a possible race condition updating the e_used and
e_queued members of an mb_cache_entry between mb_cache_entry_get
function, traversing an block hash chain, and mb_cache_entry_find
function, traversing an index hash chain.

Changed in v2:
- As per Linus Torvalds' suggestion, instead of allocating spinlock
arrays to protect the hash chains, use hlist_bl_head, which already
contains builtin lock.

fs/mbcache.c | 306 +++++++++++++++++++++++++++++++++++-------------
include/linux/mbcache.h | 10 +-
2 files changed, 229 insertions(+), 87 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 8c32ef3..dd45fe9 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,38 @@
* back on the lru list.
*/

+/*
+ * Lock descriptions and usage:
+ *
+ * Each hash chain of both the block and index hash tables now contains
+ * a built-in lock used to serialize accesses to the hash chain.
+ *
+ * Accesses to global data structures mb_cache_list and mb_cache_lru_list
+ * are serialized via the global spinlock mb_cache_spinlock.
+ *
+ * Lock ordering:
+ *
+ * Each block hash chain's lock has the highest order, followed by each
+ * index hash chain's lock, with mb_cache_spinlock the lowest.
+ * While holding a block hash chain lock a thread can acquire either
+ * an index hash chain lock or mb_cache_spinlock.
+ *
+ * Synchronization:
+ *
+ * Since both the e_used and e_queued members of each mb_cache_entry can
+ * be updated while traversing either a block hash chain or an index hash
+ * chain and the index hash chain lock has lower oder, each index hash
+ * chain's lock, in addition to being used to serialize accesses to the
+ * index hash chain, is also used to serialize accesses to both e_used
+ * and e_queued.
+ *
+ * To avoid having a dangling reference to an already freed
+ * mb_cache_entry, an mb_cache_entry is only freed when it is not on a
+ * block hash chain and also no longer being referenced. Both e_used
+ * and e_queued are 0's. When an mb_cache_entry is explicitly
+ * freed it is first removed from a block hash chain.
+ */
+
#include <linux/kernel.h>
#include <linux/module.h>

@@ -35,9 +67,9 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/init.h>
+#include <linux/list_bl.h>
#include <linux/mbcache.h>

-
#ifdef MB_CACHE_DEBUG
# define mb_debug(f...) do { \
printk(KERN_DEBUG f); \
@@ -99,23 +131,34 @@ static struct shrinker mb_cache_shrinker = {
};

static inline int
-__mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
+__mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
{
- return !list_empty(&ce->e_block_list);
+ return !hlist_bl_unhashed(&ce->e_block_list);
}


static void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+__mb_cache_entry_unhash_block(struct mb_cache_entry *ce)
{
- if (__mb_cache_entry_is_hashed(ce)) {
- list_del_init(&ce->e_block_list);
- list_del(&ce->e_index.o_list);
- }
+ if (__mb_cache_entry_is_block_hashed(ce))
+ hlist_bl_del_init(&ce->e_block_list);
+}
+
+static inline int
+__mb_cache_entry_is_index_hashed(struct mb_cache_entry *ce)
+{
+ return !hlist_bl_unhashed(&ce->e_index.o_list);
}


static void
+__mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
+{
+ if (__mb_cache_entry_is_index_hashed(ce))
+ hlist_bl_del(&ce->e_index.o_list);
+}
+
+static void
__mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
{
struct mb_cache *cache = ce->e_cache;
@@ -128,8 +171,8 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)

static void
__mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
- __releases(mb_cache_spinlock)
{
+ hlist_bl_lock(ce->e_index_hash_p);
/* Wake up all processes queuing for this cache entry. */
if (ce->e_queued)
wake_up_all(&mb_cache_queue);
@@ -137,15 +180,20 @@ __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
ce->e_used -= MB_CACHE_WRITER;
ce->e_used--;
if (!(ce->e_used || ce->e_queued)) {
- if (!__mb_cache_entry_is_hashed(ce))
+ hlist_bl_unlock(ce->e_index_hash_p);
+ if (!__mb_cache_entry_is_block_hashed(ce))
goto forget;
+ spin_lock(&mb_cache_spinlock);
mb_assert(list_empty(&ce->e_lru_list));
list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
- }
- spin_unlock(&mb_cache_spinlock);
+ spin_unlock(&mb_cache_spinlock);
+ } else
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
return;
forget:
- spin_unlock(&mb_cache_spinlock);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ mb_assert(list_empty(&ce->e_lru_list));
__mb_cache_entry_forget(ce, GFP_KERNEL);
}

@@ -164,31 +212,46 @@ forget:
static int
mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
{
- LIST_HEAD(free_list);
struct mb_cache *cache;
- struct mb_cache_entry *entry, *tmp;
int count = 0;
int nr_to_scan = sc->nr_to_scan;
gfp_t gfp_mask = sc->gfp_mask;

mb_debug("trying to free %d entries", nr_to_scan);
- spin_lock(&mb_cache_spinlock);
- while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
- struct mb_cache_entry *ce =
- list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
+ while (nr_to_scan > 0) {
+ struct mb_cache_entry *ce;
+
+ spin_lock(&mb_cache_spinlock);
+ if (list_empty(&mb_cache_lru_list)) {
+ spin_unlock(&mb_cache_spinlock);
+ break;
+ }
+ ce = list_entry(mb_cache_lru_list.next,
+ struct mb_cache_entry, e_lru_list);
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ if (!(ce->e_used || ce->e_queued)) {
+ __mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ __mb_cache_entry_forget(ce, gfp_mask);
+ --nr_to_scan;
+ } else {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ }
}
+ spin_lock(&mb_cache_spinlock);
list_for_each_entry(cache, &mb_cache_list, c_cache_list) {
mb_debug("cache %s (%d)", cache->c_name,
atomic_read(&cache->c_entry_count));
count += atomic_read(&cache->c_entry_count);
}
spin_unlock(&mb_cache_spinlock);
- list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
- __mb_cache_entry_forget(entry, gfp_mask);
- }
return (count / 100) * sysctl_vfs_cache_pressure;
}

@@ -216,18 +279,18 @@ mb_cache_create(const char *name, int bucket_bits)
cache->c_name = name;
atomic_set(&cache->c_entry_count, 0);
cache->c_bucket_bits = bucket_bits;
- cache->c_block_hash = kmalloc(bucket_count * sizeof(struct list_head),
- GFP_KERNEL);
+ cache->c_block_hash = kmalloc(bucket_count *
+ sizeof(struct hlist_bl_head), GFP_KERNEL);
if (!cache->c_block_hash)
goto fail;
for (n=0; n<bucket_count; n++)
- INIT_LIST_HEAD(&cache->c_block_hash[n]);
- cache->c_index_hash = kmalloc(bucket_count * sizeof(struct list_head),
- GFP_KERNEL);
+ INIT_HLIST_BL_HEAD(&cache->c_block_hash[n]);
+ cache->c_index_hash = kmalloc(bucket_count *
+ sizeof(struct hlist_bl_head), GFP_KERNEL);
if (!cache->c_index_hash)
goto fail;
for (n=0; n<bucket_count; n++)
- INIT_LIST_HEAD(&cache->c_index_hash[n]);
+ INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
cache->c_entry_cache = kmem_cache_create(name,
sizeof(struct mb_cache_entry), 0,
SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
@@ -276,13 +339,24 @@ mb_cache_shrink(struct block_device *bdev)
list_entry(l, struct mb_cache_entry, e_lru_list);
if (ce->e_bdev == bdev) {
list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
}
}
spin_unlock(&mb_cache_spinlock);
list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+ struct mb_cache_entry *ce =
+ list_entry(l, struct mb_cache_entry, e_lru_list);
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ if (!(ce->e_used || ce->e_queued)) {
+ __mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
+ } else {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ }
}
}

@@ -306,15 +380,21 @@ mb_cache_destroy(struct mb_cache *cache)
list_entry(l, struct mb_cache_entry, e_lru_list);
if (ce->e_cache == cache) {
list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
}
}
list_del(&cache->c_cache_list);
spin_unlock(&mb_cache_spinlock);

list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+ struct mb_cache_entry *ce =
+ list_entry(l, struct mb_cache_entry, e_lru_list);
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
}

if (atomic_read(&cache->c_entry_count) > 0) {
@@ -344,12 +424,27 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
struct mb_cache_entry *ce = NULL;

if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+ struct list_head *l;
+ struct list_head *ltmp;
+
spin_lock(&mb_cache_spinlock);
- if (!list_empty(&mb_cache_lru_list)) {
- ce = list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_del_init(&ce->e_lru_list);
- __mb_cache_entry_unhash(ce);
+ list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
+ ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+ if (ce->e_cache == cache) {
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ if (!(ce->e_used || ce->e_queued)) {
+ __mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ break;
+ }
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ }
}
spin_unlock(&mb_cache_spinlock);
}
@@ -359,9 +454,12 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
return NULL;
atomic_inc(&cache->c_entry_count);
INIT_LIST_HEAD(&ce->e_lru_list);
- INIT_LIST_HEAD(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_index.o_list);
ce->e_cache = cache;
ce->e_queued = 0;
+ ce->e_block_hash_p = &cache->c_block_hash[0];
+ ce->e_index_hash_p = &cache->c_index_hash[0];
}
ce->e_used = 1 + MB_CACHE_WRITER;
return ce;
@@ -388,28 +486,37 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
{
struct mb_cache *cache = ce->e_cache;
unsigned int bucket;
- struct list_head *l;
+ struct hlist_bl_node *l;
int error = -EBUSY;
+ struct hlist_bl_head *block_hash_p;
+ struct hlist_bl_head *index_hash_p;
+ struct mb_cache_entry *lce;

bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
- spin_lock(&mb_cache_spinlock);
- list_for_each_prev(l, &cache->c_block_hash[bucket]) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_block_list);
- if (ce->e_bdev == bdev && ce->e_block == block)
+ block_hash_p = &cache->c_block_hash[bucket];
+ hlist_bl_lock(block_hash_p);
+ hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
+ if (lce->e_bdev == bdev && lce->e_block == block)
goto out;
}
- __mb_cache_entry_unhash(ce);
+ mb_assert(!__mb_cache_entry_is_hashed(ce));
+ __mb_cache_entry_unhash_index(ce);
+ __mb_cache_entry_unhash_block(ce);
ce->e_bdev = bdev;
ce->e_block = block;
- list_add(&ce->e_block_list, &cache->c_block_hash[bucket]);
+ hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+ ce->e_block_hash_p = block_hash_p;
ce->e_index.o_key = key;
bucket = hash_long(key, cache->c_bucket_bits);
- list_add(&ce->e_index.o_list, &cache->c_index_hash[bucket]);
+ index_hash_p = &cache->c_index_hash[bucket];
+ hlist_bl_lock(index_hash_p);
+ hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
+ ce->e_index_hash_p = index_hash_p;
+ hlist_bl_unlock(index_hash_p);
error = 0;
out:
- spin_unlock(&mb_cache_spinlock);
+ hlist_bl_unlock(block_hash_p);
return error;
}

@@ -424,7 +531,7 @@ out:
void
mb_cache_entry_release(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(ce->e_block_hash_p);
__mb_cache_entry_release_unlock(ce);
}

@@ -438,9 +545,13 @@ mb_cache_entry_release(struct mb_cache_entry *ce)
void
mb_cache_entry_free(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
+ mb_assert(ce);
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
mb_assert(list_empty(&ce->e_lru_list));
- __mb_cache_entry_unhash(ce);
+ __mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_block(ce);
__mb_cache_entry_release_unlock(ce);
}

@@ -458,33 +569,46 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
sector_t block)
{
unsigned int bucket;
- struct list_head *l;
+ struct hlist_bl_node *l;
struct mb_cache_entry *ce;
+ struct hlist_bl_head *block_hash_p;
+ int held_block_lock = 1;

bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
- spin_lock(&mb_cache_spinlock);
- list_for_each(l, &cache->c_block_hash[bucket]) {
- ce = list_entry(l, struct mb_cache_entry, e_block_list);
+ block_hash_p = &cache->c_block_hash[bucket];
+ hlist_bl_lock(block_hash_p);
+ hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
+ mb_assert(ce->e_block_hash_p == block_hash_p);
if (ce->e_bdev == bdev && ce->e_block == block) {
DEFINE_WAIT(wait);

+ spin_lock(&mb_cache_spinlock);
if (!list_empty(&ce->e_lru_list))
list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);

+ hlist_bl_lock(ce->e_index_hash_p);
while (ce->e_used > 0) {
ce->e_queued++;
+ hlist_bl_unlock(ce->e_index_hash_p);
prepare_to_wait(&mb_cache_queue, &wait,
TASK_UNINTERRUPTIBLE);
- spin_unlock(&mb_cache_spinlock);
+ if (held_block_lock) {
+ hlist_bl_unlock(block_hash_p);
+ held_block_lock = 0;
+ }
schedule();
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(ce->e_index_hash_p);
ce->e_queued--;
}
- finish_wait(&mb_cache_queue, &wait);
ce->e_used += 1 + MB_CACHE_WRITER;
+ hlist_bl_unlock(ce->e_index_hash_p);
+ finish_wait(&mb_cache_queue, &wait);

- if (!__mb_cache_entry_is_hashed(ce)) {
+ if (!held_block_lock)
+ hlist_bl_lock(block_hash_p);
+ if (!__mb_cache_entry_is_block_hashed(ce)) {
__mb_cache_entry_release_unlock(ce);
return NULL;
}
@@ -494,24 +618,27 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
ce = NULL;

cleanup:
- spin_unlock(&mb_cache_spinlock);
+ hlist_bl_unlock(block_hash_p);
return ce;
}

#if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)

static struct mb_cache_entry *
-__mb_cache_entry_find(struct list_head *l, struct list_head *head,
+__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
struct block_device *bdev, unsigned int key)
{
- while (l != head) {
+ while (l != NULL) {
struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_index.o_list);
+ hlist_bl_entry(l, struct mb_cache_entry,
+ e_index.o_list);
if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
DEFINE_WAIT(wait);

+ spin_lock(&mb_cache_spinlock);
if (!list_empty(&ce->e_lru_list))
list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);

/* Incrementing before holding the lock gives readers
priority over writers. */
@@ -520,18 +647,23 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
ce->e_queued++;
prepare_to_wait(&mb_cache_queue, &wait,
TASK_UNINTERRUPTIBLE);
- spin_unlock(&mb_cache_spinlock);
+ hlist_bl_unlock(head);
schedule();
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(head);
+ mb_assert(ce->e_index_hash_p == head);
ce->e_queued--;
}
+ hlist_bl_unlock(head);
finish_wait(&mb_cache_queue, &wait);

- if (!__mb_cache_entry_is_hashed(ce)) {
+ hlist_bl_lock(ce->e_block_hash_p);
+ if (!__mb_cache_entry_is_block_hashed(ce)) {
__mb_cache_entry_release_unlock(ce);
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(head);
return ERR_PTR(-EAGAIN);
- }
+ } else
+ hlist_bl_unlock(ce->e_block_hash_p);
+ mb_assert(ce->e_index_hash_p == head);
return ce;
}
l = l->next;
@@ -557,13 +689,17 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
unsigned int key)
{
unsigned int bucket = hash_long(key, cache->c_bucket_bits);
- struct list_head *l;
- struct mb_cache_entry *ce;
+ struct hlist_bl_node *l;
+ struct mb_cache_entry *ce = NULL;
+ struct hlist_bl_head *index_hash_p;

- spin_lock(&mb_cache_spinlock);
- l = cache->c_index_hash[bucket].next;
- ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
- spin_unlock(&mb_cache_spinlock);
+ index_hash_p = &cache->c_index_hash[bucket];
+ hlist_bl_lock(index_hash_p);
+ if (!hlist_bl_empty(index_hash_p)) {
+ l = hlist_bl_first(index_hash_p);
+ ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+ }
+ hlist_bl_unlock(index_hash_p);
return ce;
}

@@ -592,13 +728,17 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
{
struct mb_cache *cache = prev->e_cache;
unsigned int bucket = hash_long(key, cache->c_bucket_bits);
- struct list_head *l;
+ struct hlist_bl_node *l;
struct mb_cache_entry *ce;
+ struct hlist_bl_head *index_hash_p;

- spin_lock(&mb_cache_spinlock);
+ index_hash_p = &cache->c_index_hash[bucket];
+ mb_assert(prev->e_index_hash_p == index_hash_p);
+ hlist_bl_lock(index_hash_p);
+ mb_assert(!hlist_bl_empty(index_hash_p));
l = prev->e_index.o_list.next;
- ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
- __mb_cache_entry_release_unlock(prev);
+ ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+ hlist_bl_unlock(index_hash_p);
return ce;
}

diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 5525d37..89826c0 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -11,11 +11,13 @@ struct mb_cache_entry {
unsigned short e_queued;
struct block_device *e_bdev;
sector_t e_block;
- struct list_head e_block_list;
+ struct hlist_bl_node e_block_list;
struct {
- struct list_head o_list;
+ struct hlist_bl_node o_list;
unsigned int o_key;
} e_index;
+ struct hlist_bl_head *e_block_hash_p;
+ struct hlist_bl_head *e_index_hash_p;
};

struct mb_cache {
@@ -25,8 +27,8 @@ struct mb_cache {
int c_max_entries;
int c_bucket_bits;
struct kmem_cache *c_entry_cache;
- struct list_head *c_block_hash;
- struct list_head *c_index_hash;
+ struct hlist_bl_head *c_block_hash;
+ struct hlist_bl_head *c_index_hash;
};

/* Functions on caches */
--
1.7.11.3

2013-09-04 20:00:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 2013-09-04, at 10:39 AM, T Makphaibulchoke wrote:
> This patch intends to improve the scalability of an ext filesystem,
> particularly ext4.

In the past, I've raised the question of whether mbcache is even
useful on real-world systems. Essentially, this is providing a
"deduplication" service for ext2/3/4 xattr blocks that are identical.
The question is how often this is actually the case in modern use?
The original design was for allowing external ACL blocks to be
shared between inodes, at a time when ACLs where pretty much the
only xattrs stored on inodes.

The question now is whether there are common uses where all of the
xattrs stored on multiple inodes are identical? If that is not the
case, mbcache is just adding overhead and should just be disabled
entirely instead of just adding less overhead.

There aren't good statistics on the hit rate for mbcache, but it
might be possible to generate some with systemtap or similar to
see how often ext4_xattr_cache_find() returns NULL vs. non-NULL.

Cheers, Andreas

> The patch consists of two parts. The first part introduces higher
> degree of parallelism to the usages of the mb_cache and
> mb_cache_entries and impacts all ext filesystems.
>
> The second part of the patch further increases the scalablity of
> an ext4 filesystem by having each ext4 fielsystem allocate and use
> its own private mbcache structure, instead of sharing a single
> mcache structures across all ext4 filesystems
>
> Here are some of the benchmark results with the changes.
>
> On a 90 core machine:
>
> Here are the performance improvements in some of the aim7 workloads,
>
> ---------------------------
> | | % increase |
> ---------------------------
> | alltests | 11.85 |
> ---------------------------
> | custom | 14.42 |
> ---------------------------
> | fserver | 21.36 |
> ---------------------------
> | new_dbase | 5.59 |
> ---------------------------
> | new_fserver | 21.45 |
> ---------------------------
> | shared | 12.84 |
> ---------------------------
> For Swingbench dss workload, with 16 GB database,
>
> -------------------------------------------------------------------------------
> | Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
> -------------------------------------------------------------------------------
> | % imprvoment | 8.46 | 8.00 | 7.35 | -.313| 1.09 | 0.69 | 0.30 | 2.18 | 5.23 |
> -------------------------------------------------------------------------------
> | % imprvoment |45.66 |47.62 |34.54 |25.15 |15.29 | 3.38 | -8.7 |-4.98 |-7.86 |
> | without using| | | | | | | | | |
> | shared memory| | | | | | | | | |
> -------------------------------------------------------------------------------
> For SPECjbb2013, composite run,
>
> --------------------------------------------
> | | max-jOPS | critical-jOPS |
> --------------------------------------------
> | % improvement | 5.99 | N/A |
> --------------------------------------------
>
>
> On an 80 core machine:
>
> The aim7's results for most of the workloads turn out to the same.
>
> Here are the results of Swingbench dss workload,
>
> -------------------------------------------------------------------------------
> | Users | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 |
> -------------------------------------------------------------------------------
> | % imprvoment |-1.79 | 0.37 | 1.36 | 0.08 | 1.66 | 2.09 | 1.16 | 1.48 | 1.92 |
> -------------------------------------------------------------------------------
>
> The changes have been tested with ext4 xfstests to verify that no regression
> has been introduced.
>
> Changed in v3:
> - New diff summary
>
> Changed in v2:
> - New performance data
> - New diff summary
>
> T Makphaibulchoke (2):
> mbcache: decoupling the locking of local from global data
> ext4: each filesystem creates and uses its own mb_cache
>
> fs/ext4/ext4.h | 1 +
> fs/ext4/super.c | 24 ++--
> fs/ext4/xattr.c | 51 ++++----
> fs/ext4/xattr.h | 6 +-
> fs/mbcache.c | 306 +++++++++++++++++++++++++++++++++++-------------
> include/linux/mbcache.h | 10 +-
> 6 files changed, 277 insertions(+), 121 deletions(-)
>
> --
> 1.7.11.3
>


Cheers, Andreas




Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 09/04/2013 08:00 PM, Andreas Dilger wrote:

>
> In the past, I've raised the question of whether mbcache is even
> useful on real-world systems. Essentially, this is providing a
> "deduplication" service for ext2/3/4 xattr blocks that are identical.
> The question is how often this is actually the case in modern use?
> The original design was for allowing external ACL blocks to be
> shared between inodes, at a time when ACLs where pretty much the
> only xattrs stored on inodes.
>
> The question now is whether there are common uses where all of the
> xattrs stored on multiple inodes are identical? If that is not the
> case, mbcache is just adding overhead and should just be disabled
> entirely instead of just adding less overhead.
>
> There aren't good statistics on the hit rate for mbcache, but it
> might be possible to generate some with systemtap or similar to
> see how often ext4_xattr_cache_find() returns NULL vs. non-NULL.
>
> Cheers, Andreas
>

Thanks Andreas for the comments. Since I'm not familiar with systemtap, I'm thinking probably the quickest and simplest way is to re-run aim7 and swing bench with mbcache disabled for comparison. Please let me know if you have any other benchmark suggestion or if you think systemtap on ext4_xattr_cache_find() would give a more accurate measurement.

Thanks,
Mak.

2013-09-05 02:35:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On Wed, Sep 04, 2013 at 10:39:14AM -0600, T Makphaibulchoke wrote:
>
> Here are the performance improvements in some of the aim7 workloads,

How did you gather these results? The mbcache is only used if you are
using extended attributes, and only if the extended attributes don't
fit in the inode's extra space.

I checked aim7, and it doesn't do any extended attribute operations.
So why are you seeing differences? Are you doing something like
deliberately using 128 byte inodes (which is not the default inode
size), and then enabling SELinux, or some such?

- Ted

Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 09/05/2013 02:35 AM, Theodore Ts'o wrote:
> How did you gather these results? The mbcache is only used if you are
> using extended attributes, and only if the extended attributes don't
> fit in the inode's extra space.
>
> I checked aim7, and it doesn't do any extended attribute operations.
> So why are you seeing differences? Are you doing something like
> deliberately using 128 byte inodes (which is not the default inode
> size), and then enabling SELinux, or some such?
>
> - Ted
>

No, I did not do anything special, including changing an inode's size. I just used the profile data, which indicated mb_cache module as one of the bottleneck. Please see below for perf data from one of th new_fserver run, which also shows some mb_cache activities.


|--3.51%-- __mb_cache_entry_find
| mb_cache_entry_find_first
| ext4_xattr_cache_find
| ext4_xattr_block_set
| ext4_xattr_set_handle
| ext4_initxattrs
| security_inode_init_security
| ext4_init_security
| __ext4_new_inode
| ext4_create
| vfs_create
| lookup_open
| do_last
| path_openat
| do_filp_open
| do_sys_open
| SyS_open
| sys_creat
| system_call
| __creat_nocancel
| |
| |--16.67%-- 0x11fe2c0
| |

Thanks,
Mak.

Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 09/04/2013 08:00 PM, Andreas Dilger wrote:
>
> In the past, I've raised the question of whether mbcache is even
> useful on real-world systems. Essentially, this is providing a
> "deduplication" service for ext2/3/4 xattr blocks that are identical.
> The question is how often this is actually the case in modern use?
> The original design was for allowing external ACL blocks to be
> shared between inodes, at a time when ACLs where pretty much the
> only xattrs stored on inodes.
>
> The question now is whether there are common uses where all of the
> xattrs stored on multiple inodes are identical? If that is not the
> case, mbcache is just adding overhead and should just be disabled
> entirely instead of just adding less overhead.
>
> There aren't good statistics on the hit rate for mbcache, but it
> might be possible to generate some with systemtap or similar to
> see how often ext4_xattr_cache_find() returns NULL vs. non-NULL.
>
> Cheers, Andreas
>

Looks like it's a bit harder to disable mbcache than I thought. I ended up adding code to collect the statics.

With selinux enabled, for new_fserver workload of aim7, there are a total of 0x7e05420100000000 ext4_xattr_cache_find() calls that result in a hit and 0xc100000000000000 calls that are not. The number does not seem to favor the complete disabling of mbcache in this case.

Thanks,
Mak.

2013-09-06 05:10:43

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote:
> On 09/05/2013 02:35 AM, Theodore Ts'o wrote:
>> How did you gather these results? The mbcache is only used if you
>> are using extended attributes, and only if the extended attributes don't fit in the inode's extra space.
>>
>> I checked aim7, and it doesn't do any extended attribute operations.
>> So why are you seeing differences? Are you doing something like
>> deliberately using 128 byte inodes (which is not the default inode
>> size), and then enabling SELinux, or some such?
>
> No, I did not do anything special, including changing an inode's size. I just used the profile data, which indicated mb_cache module as one of the bottleneck. Please see below for perf data from one of th new_fserver run, which also shows some mb_cache activities.
>
>
> |--3.51%-- __mb_cache_entry_find
> | mb_cache_entry_find_first
> | ext4_xattr_cache_find
> | ext4_xattr_block_set
> | ext4_xattr_set_handle
> | ext4_initxattrs
> | security_inode_init_security
> | ext4_init_security

Looks like this is some large security xattr, or enough smaller
xattrs to exceed the ~120 bytes of in-inode xattr storage. How
big is the SELinux xattr (assuming that is what it is)?

> Looks like it's a bit harder to disable mbcache than I thought.
> I ended up adding code to collect the statics.
>
> With selinux enabled, for new_fserver workload of aim7, there
> are a total of 0x7e05420100000000 ext4_xattr_cache_find() calls
> that result in a hit and 0xc100000000000000 calls that are not.
> The number does not seem to favor the complete disabling of
> mbcache in this case.

This is about a 65% hit rate, which seems reasonable.

You could try a few different things here:
- disable selinux completely (boot with "selinux=0" on the kernel
command line) and see how much faster it is
- format your ext4 filesystem with larger inodes (-I 512) and see
if this is an improvement or not. That depends on the size of
the selinux xattrs and if they will fit into the extra 256 bytes
of xattr space these larger inodes will give you. The performance
might also be worse, since there will be more data to read/write
for each inode, but it would avoid seeking to the xattr blocks.

Cheers, Andreas




Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 09/06/2013 05:10 AM, Andreas Dilger wrote:
> On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote:
>> No, I did not do anything special, including changing an inode's size. I just used the profile data, which indicated mb_cache module as one of the bottleneck. Please see below for perf data from one of th new_fserver run, which also shows some mb_cache activities.
>>
>>
>> |--3.51%-- __mb_cache_entry_find
>> | mb_cache_entry_find_first
>> | ext4_xattr_cache_find
>> | ext4_xattr_block_set
>> | ext4_xattr_set_handle
>> | ext4_initxattrs
>> | security_inode_init_security
>> | ext4_init_security
>
> Looks like this is some large security xattr, or enough smaller
> xattrs to exceed the ~120 bytes of in-inode xattr storage. How
> big is the SELinux xattr (assuming that is what it is)?
>

Sorry I'm familiar with SELinux enough to say how big its xattr is. Anyway, I'm positive that SELinux is what is generating these xattrs. With SELinux disabled, there seems to be no call ext4_xattr_cache_find().

>> Looks like it's a bit harder to disable mbcache than I thought.
>> I ended up adding code to collect the statics.
>>
>> With selinux enabled, for new_fserver workload of aim7, there
>> are a total of 0x7e05420100000000 ext4_xattr_cache_find() calls
>> that result in a hit and 0xc100000000000000 calls that are not.
>> The number does not seem to favor the complete disabling of
>> mbcache in this case.
>
> This is about a 65% hit rate, which seems reasonable.
>
> You could try a few different things here:
> - disable selinux completely (boot with "selinux=0" on the kernel
> command line) and see how much faster it is
> - format your ext4 filesystem with larger inodes (-I 512) and see
> if this is an improvement or not. That depends on the size of
> the selinux xattrs and if they will fit into the extra 256 bytes
> of xattr space these larger inodes will give you. The performance
> might also be worse, since there will be more data to read/write
> for each inode, but it would avoid seeking to the xattr blocks.
>

Thanks for the above suggestions. Could you please clarify if we are attempting to look for a workaround here? Since we agree the way mb_cache uses one global spinlock is incorrect and SELinux exposes the problem (which is not uncommon with Enterprise installations), I believe we should look at fixing it (patch 1/2). As you also mentioned, this will also impact both ext2 and ext3 filesystems.

Anyway, please let me know if you still think any of the above experiments is relevant.

Thanks,
Mak.


> Cheers, Andreas
>
>
>
>
>

2013-09-10 20:47:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 2013-09-06, at 6:23 AM, Thavatchai Makphaibulchoke wrote:
> On 09/06/2013 05:10 AM, Andreas Dilger wrote:
>> On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote:
>>> No, I did not do anything special, including changing an inode's size. I just used the profile data, which indicated mb_cache module as one of the bottleneck. Please see below for perf data from one of th new_fserver run, which also shows some mb_cache activities.
>>>
>>>
>>> |--3.51%-- __mb_cache_entry_find
>>> | mb_cache_entry_find_first
>>> | ext4_xattr_cache_find
>>> | ext4_xattr_block_set
>>> | ext4_xattr_set_handle
>>> | ext4_initxattrs
>>> | security_inode_init_security
>>> | ext4_init_security
>>
>> Looks like this is some large security xattr, or enough smaller
>> xattrs to exceed the ~120 bytes of in-inode xattr storage. How
>> big is the SELinux xattr (assuming that is what it is)?
>>
>> You could try a few different things here:
>> - disable selinux completely (boot with "selinux=0" on the kernel
>> command line) and see how much faster it is

> Sorry I'm

not?

> familiar with SELinux enough to say how big its xattr is. Anyway, I'm positive that SELinux is what is generating these xattrs. With SELinux disabled, there seems to be no call ext4_xattr_cache_find().

What is the relative performance of your benchmark with SELinux disabled?
While the oprofile graphs will be of passing interest to see that the
mbcache overhead is gone, they will not show the reduction in disk IO
from not writing/updating the external xattr blocks at all.

>> - format your ext4 filesystem with larger inodes (-I 512) and see
>> if this is an improvement or not. That depends on the size of
>> the selinux xattrs and if they will fit into the extra 256 bytes
>> of xattr space these larger inodes will give you. The performance
>> might also be worse, since there will be more data to read/write
>> for each inode, but it would avoid seeking to the xattr blocks.
>
> Thanks for the above suggestions. Could you please clarify if we are
> attempting to look for a workaround here? Since we agree the way
> mb_cache uses one global spinlock is incorrect and SELinux exposes
> the problem (which is not uncommon with Enterprise installations),
> I believe we should look at fixing it (patch 1/2). As you also
> mentioned, this will also impact both ext2 and ext3 filesystems.

I agree that SELinux is enabled on enterprise distributions by default,
but I'm also interested to know how much overhead this imposes. I would
expect that writing large external xattrs for each file would have quite
a significant performance overhead that should not be ignored. Reducing
the mbcache overhead is good, but eliminating it entirely is better.

Depending on how much overhead SELinux has, it might be important to
spend more time to optimize it (not just the mbcache part), or users
may consider disabling SELinux entirely on systems where they care
about peak performance.

> Anyway, please let me know if you still think any of the above
> experiments is relevant.

You have already done one of the tests that I'm interested in (the above
test which showed that disabling SELinux removed the mbcache overhead).
What I'm interested in is the actual performance (or relative performance
if you are not allowed to publish the actual numbers) of your AIM7
benchmark between SELinux enabled and SELinux disabled.

Next would be a new test that has SELinux enabled, but formatting the
filesystem with 512-byte inodes instead of the ext4 default of 256-byte
inodes. If this makes a significant improvement, it would potentially
mean users and the upstream distros should use different formatting
options along with SELinux. This is less clearly a win, since I don't
know enough details of how SELinux uses xattrs (I always disable it,
so I don't have any systems to check).

Cheers, Andreas




2013-09-10 21:04:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
> I agree that SELinux is enabled on enterprise distributions by default,
> but I'm also interested to know how much overhead this imposes. I would
> expect that writing large external xattrs for each file would have quite
> a significant performance overhead that should not be ignored. Reducing
> the mbcache overhead is good, but eliminating it entirely is better.

I was under the impression that using a 256 byte inode (which gives a
bit over 100 bytes worth of xattr space) was plenty for SELinux. If
it turns out that SELinux's use of xattrs have gotten especially
piggy, then we may need to revisit the recommended inode size for
those systems who insist on using SELinux... even if we eliminate the
overhead associated with mbcache, the fact that files are requiring a
separate xattr is going to seriously degrade performance.

- Ted

Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 09/10/2013 09:02 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
>> I agree that SELinux is enabled on enterprise distributions by default,
>> but I'm also interested to know how much overhead this imposes. I would
>> expect that writing large external xattrs for each file would have quite
>> a significant performance overhead that should not be ignored. Reducing
>> the mbcache overhead is good, but eliminating it entirely is better.
>
> I was under the impression that using a 256 byte inode (which gives a
> bit over 100 bytes worth of xattr space) was plenty for SELinux. If
> it turns out that SELinux's use of xattrs have gotten especially
> piggy, then we may need to revisit the recommended inode size for
> those systems who insist on using SELinux... even if we eliminate the
> overhead associated with mbcache, the fact that files are requiring a
> separate xattr is going to seriously degrade performance.
>
> - Ted
>

Thank you Andreas and Ted for the explanations and comments. Yes, I see both of your points now. Though we may reduce the mbcache overhead, due to the overhead of additional xattr I/O it would be better to provide some data to help users or distros to determine whether they will be better off completely disabling SELinux or increasing the inode size. I will go ahead and run the suggested experiments and get back with the results.

Thanks,
Mak.

2013-09-11 03:16:04

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 9/10/13 4:02 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
>> I agree that SELinux is enabled on enterprise distributions by default,
>> but I'm also interested to know how much overhead this imposes. I would
>> expect that writing large external xattrs for each file would have quite
>> a significant performance overhead that should not be ignored. Reducing
>> the mbcache overhead is good, but eliminating it entirely is better.
>
> I was under the impression that using a 256 byte inode (which gives a
> bit over 100 bytes worth of xattr space) was plenty for SELinux. If
> it turns out that SELinux's use of xattrs have gotten especially
> piggy, then we may need to revisit the recommended inode size for
> those systems who insist on using SELinux... even if we eliminate the
> overhead associated with mbcache, the fact that files are requiring a
> separate xattr is going to seriously degrade performance.

On my RHEL6 system,

# find / -xdev -exec getfattr --only-values -m security.* {} 2>/dev/null \; | wc -c
11082179

bytes of names for:

# df -i /
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_bp05-lv_root
3276800 280785 2996015 9% /

280785 inodes used,

so:
11082179/280785 = ~39.5 bytes per value on average, plus:

# echo -n "security.selinux" | wc -c
16

16 bytes for the name is only about 55-56 bytes per selinux attr on average.

So nope, not "especially piggy" on average.

Another way to do it is this; list all possible file contexts, and make
a histogram of sizes:

# for CONTEXT in `semanage fcontext -l | awk '{print $NF}' `; do echo -n $CONTEXT | wc -c; done | sort -n | uniq -c
1 7
33 8
356 26
14 27
14 28
37 29
75 30
237 31
295 32
425 33
324 34
445 35
548 36
229 37
193 38
181 39
259 40
81 41
108 42
96 43
55 44
55 45
16 46
41 47
23 48
28 49
36 50
10 51
10 52
5 54
2 57

so a 57 byte value is max, but there aren't many of the larger values.

Above doesn't tell us the prevalence of various contexts on the actual system,
but they are all under 100 bytes in any case.

-Eric

> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-09-11 11:31:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
>
> Above doesn't tell us the prevalence of various contexts on the actual system,
> but they are all under 100 bytes in any case.

OK, so in other words, on your system i_file_acl and i_file_acl_high
(which is where we store the block # for the external xattr block),
should always be zero for all inodes, yes?

Thavatchai, can you check to see whether or not this is true on your
system? You can use debugfs on the file system, and then use the
"stat" command to sample various inodes in your system. Or I can make
a version of e2fsck which counts the number of inodes with external
xattr blocks --- it sounds like this is something we should do anyway.

One difference might be that Eric ran this test on RHEL 6, and
Thavatchai is using an upstream kernel, so maybe this is bloat has
been added recently?

The reason why I'm pushing here is that mbcache shouldn't be showing
up in the profiles at all if there is no external xattr block. And so
if newer versions of SELinux (like Adnreas, I've been burned by
SELinux too many times in the past, so I don't use SELinux on any of
my systems) is somehow causing mbcache to get triggered, we should
figure this out and understand what's really going on.

Sigh, I suppose I should figure out how to create a minimal KVM setup
which uses SELinux just so I can see what the heck is going on....

- Ted

2013-09-11 16:52:28

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 9/11/13 6:30 AM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
>>
>> Above doesn't tell us the prevalence of various contexts on the actual system,
>> but they are all under 100 bytes in any case.
>
> OK, so in other words, on your system i_file_acl and i_file_acl_high
> (which is where we store the block # for the external xattr block),
> should always be zero for all inodes, yes?

Oh, hum - ok, so that would have been the better thing to check (or at
least an additional thing).

# find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq -c

Finds quite a lot that claim to have external blocks, but it seems broken:

# filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel
Filesystem type is: ef53
File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, blocksize 4096)
ext logical physical expected length flags
0 0 32212996252 100 not_aligned,inline
/var/lib/yum/repos/x86_64/6Server/epel: 1 extent found

So _filefrag_ says it has a block (at a 120T physical address not on my fs!)

And yet it's a small attr:

# getfattr -m - -d /var/lib/yum/repos/x86_64/6Server/epel
getfattr: Removing leading '/' from absolute path names
# file: var/lib/yum/repos/x86_64/6Server/epel
security.selinux="unconfined_u:object_r:rpm_var_lib_t:s0"

And debugfs thinks it's stored within the inode:

debugfs: stat var/lib/yum/repos/x86_64/6Server/epel
Inode: 1968466 Type: directory Mode: 0755 Flags: 0x80000
Generation: 2728788146 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 4096
File ACL: 0 Directory ACL: 0
Links: 2 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
atime: 0x522fc8fa:62eb2d90 -- Tue Sep 10 20:35:54 2013
mtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
crtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
Size of extra inode fields: 28
Extended attributes stored in inode body:
selinux = "unconfined_u:object_r:rpm_var_lib_t:s0\000" (39)
EXTENTS:
(0): 7873422

sooo seems like filefrag -x is buggy and can't be trusted. :(

> Thavatchai, can you check to see whether or not this is true on your
> system? You can use debugfs on the file system, and then use the
> "stat" command to sample various inodes in your system. Or I can make
> a version of e2fsck which counts the number of inodes with external
> xattr blocks --- it sounds like this is something we should do anyway.
>
> One difference might be that Eric ran this test on RHEL 6, and
> Thavatchai is using an upstream kernel, so maybe this is bloat has
> been added recently?

It's a userspace policy so the kernel shouldn't matter... "bloat" would
only come from new, longer contexts (outside the kernel).

> The reason why I'm pushing here is that mbcache shouldn't be showing
> up in the profiles at all if there is no external xattr block. And so
> if newer versions of SELinux (like Adnreas, I've been burned by
> SELinux too many times in the past, so I don't use SELinux on any of
> my systems) is somehow causing mbcache to get triggered, we should
> figure this out and understand what's really going on.

selinux, from an fs allocation behavior perspective, is simply setxattr.

And as I showed earlier, name+value for all of the attrs set by at least RHEL6
selinux policy are well under 100 bytes.

(Add in a bunch of other non-selinux xattrs, and you'll go out of a 256b inode,
sure, but selinux on its own should not).

> Sigh, I suppose I should figure out how to create a minimal KVM setup
> which uses SELinux just so I can see what the heck is going on....

http://fedoraproject.org/en/get-fedora ;)

-Eric

> - Ted
>

2013-09-11 19:38:06

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 9/11/13 11:49 AM, Eric Sandeen wrote:
> On 9/11/13 6:30 AM, Theodore Ts'o wrote:
>> On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
>>>
>>> Above doesn't tell us the prevalence of various contexts on the actual system,
>>> but they are all under 100 bytes in any case.
>>
>> OK, so in other words, on your system i_file_acl and i_file_acl_high
>> (which is where we store the block # for the external xattr block),
>> should always be zero for all inodes, yes?
>
> Oh, hum - ok, so that would have been the better thing to check (or at
> least an additional thing).
>
> # find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq -c
>
> Finds quite a lot that claim to have external blocks, but it seems broken:
>
> # filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel
> Filesystem type is: ef53
> File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, blocksize 4096)
> ext logical physical expected length flags
> 0 0 32212996252 100 not_aligned,inline
> /var/lib/yum/repos/x86_64/6Server/epel: 1 extent found
>
> So _filefrag_ says it has a block (at a 120T physical address not on my fs!)

Oh, this is the special-but-not-documented "print inline extents in bytes
not blocks" :(

I'll send a patch to ignore inline extents on fiemap calls to make this
easier, but in the meantime, neither my RHEL6 root nor my F17 root have
any out-of-inode selinux xattrs on 256-byte-inode filesystems.

So selinux alone should not be exercising mbcache much, if at all, w/ 256 byte
inodes.

-Eric

2013-09-11 20:33:58

by David Lang

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On Wed, 11 Sep 2013, Eric Sandeen wrote:

>> The reason why I'm pushing here is that mbcache shouldn't be showing
>> up in the profiles at all if there is no external xattr block. And so
>> if newer versions of SELinux (like Adnreas, I've been burned by
>> SELinux too many times in the past, so I don't use SELinux on any of
>> my systems) is somehow causing mbcache to get triggered, we should
>> figure this out and understand what's really going on.
>
> selinux, from an fs allocation behavior perspective, is simply setxattr.

what you are missing is that Ted is saying that unless you are using xattrs, the
mbcache should not show up at all.

The fact that you are using SElinux, and SELinux sets the xattrs is what makes
this show up on your system, but other people who don't use SELinux (and so
don't have any xattrs set) don't see the same bottleneck.

David Lang

2013-09-11 20:51:39

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 9/11/13 3:32 PM, David Lang wrote:
> On Wed, 11 Sep 2013, Eric Sandeen wrote:
>
>>> The reason why I'm pushing here is that mbcache shouldn't be showing
>>> up in the profiles at all if there is no external xattr block. And so
>>> if newer versions of SELinux (like Adnreas, I've been burned by
>>> SELinux too many times in the past, so I don't use SELinux on any of
>>> my systems) is somehow causing mbcache to get triggered, we should
>>> figure this out and understand what's really going on.
>>
>> selinux, from an fs allocation behavior perspective, is simply setxattr.
>
> what you are missing is that Ted is saying that unless you are using xattrs, the mbcache should not show up at all.
>
> The fact that you are using SElinux, and SELinux sets the xattrs is
> what makes this show up on your system, but other people who don't
> use SELinux (and so don't have any xattrs set) don't see the same
> bottleneck.

Sure, I understand that quite well. But Ted was also saying that perhaps selinux had "gotten piggy" and was causing this. I've showed that it hasn't.


This matters because unless the selinux xattrs go out of the inode into their own block, mbcache should STILL not come into it at all.

And for attrs < 100 bytes, they stay in the inode. And on inspection, my SELinux boxes have no external attr blocks allocated.

mbcache only handles extended attributes that live in separately-allocated blocks, and selinux does not cause that on its own.

Soo... selinux by itself should not be triggering any mbcache codepaths.

Ted suggested that "selinux had gotten piggy" so I checked, and showed that it hadn't. That's all.

So at this point I think it's up to Mak to figure out why on his system, aim7 is triggering mbcache codepaths.

-Eric

> David Lang

2013-09-11 21:27:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On Wed, Sep 11, 2013 at 03:48:57PM -0500, Eric Sandeen wrote:
>
> So at this point I think it's up to Mak to figure out why on his system, aim7 is triggering mbcache codepaths.
>

Yes, the next thing is to see if on his systems, whether or not he's
seeing external xattr blocks.

- Ted

Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 09/11/2013 09:25 PM, Theodore Ts'o wrote:
> On Wed, Sep 11, 2013 at 03:48:57PM -0500, Eric Sandeen wrote:
>>
>> So at this point I think it's up to Mak to figure out why on his system, aim7 is triggering mbcache codepaths.
>>
>
> Yes, the next thing is to see if on his systems, whether or not he's
> seeing external xattr blocks.
>
> - Ted
>

I seem to be seeing the same thing as Eric is seeing.

On one of my systems,

# find / -mount -exec getfattr --only-values -m security.* {} 2>/dev/null \; | wc -c
2725655
# df -i /
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_dhg1-lv_root
[tmac@lxbuild linux]$ man find
1974272 84737 1889535 5% /
# find /home -mount -exec getfattr --only-values -m security.* {} 2>/dev/null \; | wc -c
274173
# df -i /home
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_dhg1-lv_home
192384 7862 184522 5% /home

For both filesystems, the security xattr are about 32.17 and 34.87 bytes respectively.

I also see a similar problem with filefrag.

# filefrag -xv /bin/sh
Filesystem type is: ef53
File size of /bin/sh is 938736 (230 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 23622459548 100 not_aligned,inline
/bin/sh: 1 extent found

# getfattr -m - -d /bin/sh
getfattr: Removing leading '/' from absolute path names
# file: bin/sh
security.selinux="system_u:object_r:shell_exec_t:s0"

debugfs: stat /bin/sh
Inode: 1441795 Type: symlink Mode: 0777 Flags: 0x0
Generation: 3470616846 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 4
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x50c2779d:ad792a58 -- Fri Dec 7 16:11:25 2012
atime: 0x52311211:006d1658 -- Wed Sep 11 19:00:01 2013
mtime: 0x50c2779d:ad792a58 -- Fri Dec 7 16:11:25 2012
crtime: 0x50c2779d:ad792a58 -- Fri Dec 7 16:11:25 2012
Size of extra inode fields: 28
Extended attributes stored in inode body:
selinux = "system_u:object_r:bin_t:s0\000" (27)
Fast_link_dest: bash

At this point, I'm not sure why we get into the mbcache path when SELinux is enabled. As mentioned in one my earlier replies to Andreas, I did see actual calls into ext4_xattr_cache.

There seems to be one difference between 3.11 kernel and 2.6 kernel in set_inode_init_security(). There is an additional attempt to initialize evm xattr. But I do not seem to be seeing any evm xattr in any file.

I will continue to try to find out how we get into the mbcache path. Please let me know if anyone has any suggestion.

Thanks,
Mak.


2013-09-12 03:45:16

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] ext4: increase mbcache scalability

On 9/11/13 3:36 PM, Thavatchai Makphaibulchoke wrote:

> I seem to be seeing the same thing as Eric is seeing.

...
> For both filesystems, the security xattr are about 32.17 and 34.87 bytes respectively.
...

Can you triple-check the inode size on your fs, for good measure?

dumpe2fs -h /dev/whatever | grep "Inode size"

> I also see a similar problem with filefrag.

turns out it's not a problem, it's an undocumented & surprising "feature." :(

/* For inline data all offsets should be in bytes, not blocks */
if (fm_extent->fe_flags & FIEMAP_EXTENT_DATA_INLINE)
blk_shift = 0;

because ... ? (the commit which added it didn't mention anything about it).

But I guess it does mean for at least those files, the xattr data is actually inline.

> At this point, I'm not sure why we get into the mbcache path when
> SELinux is enabled. As mentioned in one my earlier replies to
> Andreas, I did see actual calls into ext4_xattr_cache.
>
> There seems to be one difference between 3.11 kernel and 2.6 kernel
> in set_inode_init_security(). There is an additional attempt to
> initialize evm xattr. But I do not seem to be seeing any evm xattr in
> any file.
>
> I will continue to try to find out how we get into the mbcache path.
> Please let me know if anyone has any suggestion.

Sorry we got so far off the thread of the original patches.

But it seems like a mystery worth solving.

Perhaps in ext4_xattr_set_handle() you can instrument the case(s) where it
gets into ext4_xattr_block_set().

Or most simply, just printk inode number in ext4_xattr_block_set() so
you can look at them later via debugfs.

And in here,

} else {
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
if (!error && !bs.s.not_found) {
i.value = NULL;
error = ext4_xattr_block_set(handle, inode, &i, &bs);
} else if (error == -ENOSPC) {
if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
error = ext4_xattr_block_find(inode, &i, &bs);
if (error)
goto cleanup;
}
error = ext4_xattr_block_set(handle, inode, &i, &bs);

maybe print out in the ext4_xattr_ibody_set error case what the size of
of the xattr is, and probably inode number again for good measure, to get
an idea of what's causing it to fail to land in the inode?

-Eric

>
> Thanks,
> Mak.

2013-10-30 16:51:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mbcache: decoupling the locking of local from global data

On Wed, Sep 04, 2013 at 10:39:15AM -0600, T Makphaibulchoke wrote:
> The patch increases the parallelism of mb_cache_entry utilization by
> replacing list_head with hlist_bl_node for the implementation of both the
> block and index hash tables. Each hlist_bl_node contains a built-in lock
> used to protect mb_cache's local block and index hash chains. The global
> data mb_cache_lru_list and mb_cache_list continue to be protected by the
> global mb_cache_spinlock.
>
> Signed-off-by: T. Makphaibulchoke <[email protected]>

I tried running xfstests with this patch, and it blew up on
generic/020 test:

generic/020 [10:21:50][ 105.170352] ------------[ cut here ]------------
[ 105.171683] kernel BUG at /usr/projects/linux/ext4/include/linux/bit_spinlock.h:76!
[ 105.173346] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 105.173346] Modules linked in:
[ 105.173346] CPU: 1 PID: 8519 Comm: attr Not tainted 3.12.0-rc5-00008-gffbe1d7-dirty #1492
[ 105.173346] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 105.173346] task: f5abe560 ti: f2274000 task.ti: f2274000
[ 105.173346] EIP: 0060:[<c026b464>] EFLAGS: 00010246 CPU: 1
[ 105.173346] EIP is at hlist_bl_unlock+0x7/0x1c
[ 105.173346] EAX: f488d360 EBX: f488d360 ECX: 00000000 EDX: f2998800
[ 105.173346] ESI: f29987f0 EDI: 6954c848 EBP: f2275cc8 ESP: f2275cb8
[ 105.173346] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 105.173346] CR0: 80050033 CR2: b76bcf54 CR3: 34844000 CR4: 000006f0
[ 105.173346] Stack:
[ 105.173346] c026bc78 f2275d48 6954c848 f29987f0 f2275d24 c02cd7a9 f2275ce4 c02e2881
[ 105.173346] f255d8c8 00000000 f1109020 f4a67f00 f2275d54 f2275d08 c02cd020 6954c848
[ 105.173346] f4a67f00 f1109000 f2b0eba8 f2ee3800 f2275d28 f4f811e8 f2275d38 00000000
[ 105.173346] Call Trace:
[ 105.173346] [<c026bc78>] ? mb_cache_entry_find_first+0x4b/0x55
[ 105.173346] [<c02cd7a9>] ext4_xattr_block_set+0x248/0x6e7
[ 105.173346] [<c02e2881>] ? jbd2_journal_put_journal_head+0xe2/0xed
[ 105.173346] [<c02cd020>] ? ext4_xattr_find_entry+0x52/0xac
[ 105.173346] [<c02ce307>] ext4_xattr_set_handle+0x1c7/0x30f
[ 105.173346] [<c02ce4f4>] ext4_xattr_set+0xa5/0xe1
[ 105.173346] [<c02ceb36>] ext4_xattr_user_set+0x46/0x5f
[ 105.173346] [<c024a4da>] generic_setxattr+0x4c/0x5e
[ 105.173346] [<c024a48e>] ? generic_listxattr+0x95/0x95
[ 105.173346] [<c024ab0f>] __vfs_setxattr_noperm+0x56/0xb6
[ 105.173346] [<c024abd2>] vfs_setxattr+0x63/0x7e
[ 105.173346] [<c024ace8>] setxattr+0xfb/0x139
[ 105.173346] [<c01b200a>] ? __lock_acquire+0x540/0xca6
[ 105.173346] [<c01877a3>] ? lg_local_unlock+0x1b/0x34
[ 105.173346] [<c01af8dd>] ? trace_hardirqs_off_caller+0x2e/0x98
[ 105.173346] [<c0227e69>] ? kmem_cache_free+0xd4/0x149
[ 105.173346] [<c01b2c2b>] ? lock_acquire+0xdd/0x107
[ 105.173346] [<c023225e>] ? __sb_start_write+0xee/0x11d
[ 105.173346] [<c0247383>] ? mnt_want_write+0x1e/0x3e
[ 105.173346] [<c01b3019>] ? trace_hardirqs_on_caller+0x12a/0x17e
[ 105.173346] [<c0247353>] ? __mnt_want_write+0x4e/0x60
[ 105.173346] [<c024af3b>] SyS_lsetxattr+0x6a/0x9f
[ 105.173346] [<c078d0e8>] syscall_call+0x7/0xb
[ 105.173346] Code: 00 00 00 00 5b 5d c3 55 89 e5 53 3e 8d 74 26 00 8b 58 08 89 c2 8b 43 18 e8 3f c9 fb ff f0 ff 4b 0c 5b 5d c3 8b 10 80 e2 01 75 02 <0f> 0b 55 89 e5 0f ba 30 00 89 e0 25 00 e0 ff ff ff 48 14 5d c3
[ 105.173346] EIP: [<c026b464>] hlist_bl_unlock+0x7/0x1c SS:ESP 0068:f2275cb8
[ 105.273781] ---[ end trace 1ee45ddfc1df0935 ]---

When I tried to find a potential problem, I immediately ran into this.
I'm not entirely sure it's the problem, but it's raised a number of
red flags for me in terms of (a) how much testing you've employed with
this patch set, and (b) how maintaining and easy-to-audit the code
will be with this extra locking. The comments are good start, but
some additional comments about exactly what assumptions a function
assumes about locks that are held on function entry, or especially if
the locking is different on function entry and function exit, might
make it easier for people to audit this patch.

Or maybe this commit needs to be split up with first a conversion from
using list_head to hlist_hl_node, and the changing the locking? The
bottom line is that we need to somehow make this patch easier to
validate/review.

> @@ -520,18 +647,23 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
> ce->e_queued++;
> prepare_to_wait(&mb_cache_queue, &wait,
> TASK_UNINTERRUPTIBLE);
> - spin_unlock(&mb_cache_spinlock);
> + hlist_bl_unlock(head);
> schedule();
> - spin_lock(&mb_cache_spinlock);
> + hlist_bl_lock(head);
> + mb_assert(ce->e_index_hash_p == head);
> ce->e_queued--;
> }
> + hlist_bl_unlock(head);
> finish_wait(&mb_cache_queue, &wait);
>
> - if (!__mb_cache_entry_is_hashed(ce)) {
> + hlist_bl_lock(ce->e_block_hash_p);
> + if (!__mb_cache_entry_is_block_hashed(ce)) {
> __mb_cache_entry_release_unlock(ce);
> - spin_lock(&mb_cache_spinlock);
> + hlist_bl_lock(head);

<--- are we missing a "hlist_bl_unlock(ce->e_block_hash_p);" here?

<---- Why is it that we call "hlist_bl_lock(head);" but not in the else clause?

> return ERR_PTR(-EAGAIN);
> - }
> + } else
> + hlist_bl_unlock(ce->e_block_hash_p);
> + mb_assert(ce->e_index_hash_p == head);
> return ce;
> }
> l = l->next;


Cheers,

- Ted

2013-10-30 16:51:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] ext4: each filesystem creates and uses its own mb_cache

On Wed, Sep 04, 2013 at 10:39:16AM -0600, T Makphaibulchoke wrote:
> This patch adds new interfaces to create and destory cache,
> ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the
> cache creation and destory calls from ex4_init_xattr() and ext4_exitxattr()
> in fs/ext4/xattr.c.
> fs/ext4/super.c has been changed so that when a filesystem is mounted a
> cache is allocated and attched to its ext4_sb_info structure.

One problem with this patch is that it creates separate slabs for each
ext4 file system where previously we had a single slab cache for
mbcache entries. I've tried adding the following patch on top of
yours so that we can retain the previous behavior of creating a single
kmem_cache for all ext4 file systems.

Please comment/review...

- Ted

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3b525b6..d529807 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -31,6 +31,7 @@
#include <linux/vfs.h>
#include <linux/random.h>
#include <linux/mount.h>
+#include <linux/mbcache.h>
#include <linux/namei.h>
#include <linux/quotaops.h>
#include <linux/seq_file.h>
@@ -5492,9 +5493,16 @@ static int __init ext4_init_fs(void)
init_waitqueue_head(&ext4__ioend_wq[i]);
}

+ ext4_xattr_kmem_cache = kmem_cache_create("ext4_xattr",
+ sizeof(struct mb_cache_entry), 0,
+ SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+ if (ext4_xattr_kmem_cache == NULL)
+ return -ENOMEM;
+
+
err = ext4_init_es();
if (err)
- return err;
+ goto out8;

err = ext4_init_pageio();
if (err)
@@ -5548,12 +5556,15 @@ out6:
ext4_exit_pageio();
out7:
ext4_exit_es();
+out8:
+ kmem_cache_destroy(ext4_xattr_kmem_cache);

return err;
}

static void __exit ext4_exit_fs(void)
{
+ kmem_cache_destroy(ext4_xattr_kmem_cache);
ext4_destroy_lazyinit_thread();
unregister_as_ext2();
unregister_as_ext3();
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 07ca399..3d2bd5a 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,6 +81,8 @@
# define ea_bdebug(bh, fmt, ...) no_printk(fmt, ##__VA_ARGS__)
#endif

+struct kmem_cache *ext4_xattr_kmem_cache;
+
static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
static struct buffer_head *ext4_xattr_cache_find(struct inode *,
struct ext4_xattr_header *,
@@ -1687,7 +1689,8 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,
struct mb_cache *
ext4_xattr_create_cache(char *name)
{
- return mb_cache_create(name, HASH_BUCKET_BITS);
+ return mb_cache_create_instance(name, HASH_BUCKET_BITS,
+ ext4_xattr_kmem_cache);
}

void ext4_xattr_destroy_cache(struct mb_cache *cache)
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index b930320..15616eb 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -94,6 +94,8 @@ struct ext4_xattr_ibody_find {
struct ext4_iloc iloc;
};

+extern struct kmem_cache *ext4_xattr_kmem_cache;
+
extern const struct xattr_handler ext4_xattr_user_handler;
extern const struct xattr_handler ext4_xattr_trusted_handler;
extern const struct xattr_handler ext4_xattr_acl_access_handler;
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 44e7153..8e373c6 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -96,6 +96,7 @@ MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
MODULE_LICENSE("GPL");

EXPORT_SYMBOL(mb_cache_create);
+EXPORT_SYMBOL(mb_cache_create_instance);
EXPORT_SYMBOL(mb_cache_shrink);
EXPORT_SYMBOL(mb_cache_destroy);
EXPORT_SYMBOL(mb_cache_entry_alloc);
@@ -260,18 +261,20 @@ static struct shrinker mb_cache_shrinker = {
};

/*
- * mb_cache_create() create a new cache
+ * mb_cache_create_instance() create a new cache for a device
*
* All entries in one cache are equal size. Cache entries may be from
- * multiple devices. If this is the first mbcache created, registers
- * the cache with kernel memory management. Returns NULL if no more
+ * multiple devices. The caller must create a kmem_cache appropriate
+ * for allocating struct mb_cache_entry. Returns NULL if no more
* memory was available.
*
* @name: name of the cache (informal)
* @bucket_bits: log2(number of hash buckets)
+ * @mb_kmem_cache: slab cache for mb_cache_entry structures
*/
struct mb_cache *
-mb_cache_create(const char *name, int bucket_bits)
+mb_cache_create_instance(const char *name, int bucket_bits,
+ struct kmem_cache *mb_kmem_cache)
{
int n, bucket_count = 1 << bucket_bits;
struct mb_cache *cache = NULL;
@@ -294,11 +297,8 @@ mb_cache_create(const char *name, int bucket_bits)
goto fail;
for (n=0; n<bucket_count; n++)
INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
- cache->c_entry_cache = kmem_cache_create(name,
- sizeof(struct mb_cache_entry), 0,
- SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
- if (!cache->c_entry_cache)
- goto fail2;
+ BUG_ON(mb_kmem_cache == NULL);
+ cache->c_entry_cache = mb_kmem_cache;

/*
* Set an upper limit on the number of cache entries so that the hash
@@ -311,15 +311,38 @@ mb_cache_create(const char *name, int bucket_bits)
spin_unlock(&mb_cache_spinlock);
return cache;

-fail2:
- kfree(cache->c_index_hash);
-
fail:
kfree(cache->c_block_hash);
kfree(cache);
return NULL;
}

+/*
+ * mb_cache_create() create a new cache
+ *
+ * All entries in one cache are equal size. Cache entries may be from
+ * multiple devices. Returns NULL if no more memory was available.
+ *
+ * @name: name of the cache (informal)
+ * @bucket_bits: log2(number of hash buckets)
+ */
+struct mb_cache *
+mb_cache_create(const char *name, int bucket_bits)
+{
+ struct kmem_cache *mb_kmem_cache;
+ struct mb_cache *cache;
+
+ mb_kmem_cache = kmem_cache_create(name,
+ sizeof(struct mb_cache_entry), 0,
+ SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+ if (mb_kmem_cache == NULL)
+ return NULL;
+ cache = mb_cache_create_instance(name, bucket_bits, mb_kmem_cache);
+ if (cache == NULL)
+ kmem_cache_destroy(mb_kmem_cache);
+ cache->c_entry_cache_autofree = 1;
+ return cache;
+}

/*
* mb_cache_shrink()
@@ -406,7 +429,8 @@ mb_cache_destroy(struct mb_cache *cache)
atomic_read(&cache->c_entry_count));
}

- kmem_cache_destroy(cache->c_entry_cache);
+ if (cache->c_entry_cache_autofree)
+ kmem_cache_destroy(cache->c_entry_cache);

kfree(cache->c_index_hash);
kfree(cache->c_block_hash);
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 89826c0..b326ce7 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -29,11 +29,14 @@ struct mb_cache {
struct kmem_cache *c_entry_cache;
struct hlist_bl_head *c_block_hash;
struct hlist_bl_head *c_index_hash;
+ unsigned int c_entry_cache_autofree:1;
};

/* Functions on caches */

struct mb_cache *mb_cache_create(const char *, int);
+struct mb_cache *mb_cache_create_instance(const char *,
+ int, struct kmem_cache *);
void mb_cache_shrink(struct block_device *);
void mb_cache_destroy(struct mb_cache *);

2013-10-30 16:51:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] mbcache: decoupling the locking of local from global data

On Wed, Sep 04, 2013 at 10:39:15AM -0600, T Makphaibulchoke wrote:
> The patch increases the parallelism of mb_cache_entry utilization by
> replacing list_head with hlist_bl_node for the implementation of both the
> block and index hash tables. Each hlist_bl_node contains a built-in lock
> used to protect mb_cache's local block and index hash chains. The global
> data mb_cache_lru_list and mb_cache_list continue to be protected by the
> global mb_cache_spinlock.

In the process of applying this patch to the ext4 tree, I had to
rework one of the patches to account for a change upstream to the
shrinker interface (which modified mb_cache_shrink_fn() to be
mb_cache_shrink_scan()).

Can you verify that the changes I made look sane?

Thanks,

- Ted

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 1f90cd0..44e7153 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -200,25 +200,38 @@ forget:
static unsigned long
mb_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
{
- LIST_HEAD(free_list);
- struct mb_cache_entry *entry, *tmp;
int nr_to_scan = sc->nr_to_scan;
gfp_t gfp_mask = sc->gfp_mask;
unsigned long freed = 0;

mb_debug("trying to free %d entries", nr_to_scan);
- spin_lock(&mb_cache_spinlock);
- while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
- struct mb_cache_entry *ce =
- list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
- freed++;
- }
- spin_unlock(&mb_cache_spinlock);
- list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
- __mb_cache_entry_forget(entry, gfp_mask);
+ while (nr_to_scan > 0) {
+ struct mb_cache_entry *ce;
+
+ spin_lock(&mb_cache_spinlock);
+ if (list_empty(&mb_cache_lru_list)) {
+ spin_unlock(&mb_cache_spinlock);
+ break;
+ }
+ ce = list_entry(mb_cache_lru_list.next,
+ struct mb_cache_entry, e_lru_list);
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ if (!(ce->e_used || ce->e_queued)) {
+ __mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ __mb_cache_entry_forget(ce, gfp_mask);
+ --nr_to_scan;
+ freed++;
+ } else {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ }
}
return freed;
}

Subject: Re: [PATCH v3 1/2] mbcache: decoupling the locking of local from global data

On 10/30/2013 08:42 AM, Theodore Ts'o wrote:
> I tried running xfstests with this patch, and it blew up on
> generic/020 test:
>
> generic/020 [10:21:50][ 105.170352] ------------[ cut here ]------------
> [ 105.171683] kernel BUG at /usr/projects/linux/ext4/include/linux/bit_spinlock.h:76!
> [ 105.173346] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 105.173346] Modules linked in:
> [ 105.173346] CPU: 1 PID: 8519 Comm: attr Not tainted 3.12.0-rc5-00008-gffbe1d7-dirty #1492
> [ 105.173346] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 105.173346] task: f5abe560 ti: f2274000 task.ti: f2274000
> [ 105.173346] EIP: 0060:[<c026b464>] EFLAGS: 00010246 CPU: 1
> [ 105.173346] EIP is at hlist_bl_unlock+0x7/0x1c
> [ 105.173346] EAX: f488d360 EBX: f488d360 ECX: 00000000 EDX: f2998800
> [ 105.173346] ESI: f29987f0 EDI: 6954c848 EBP: f2275cc8 ESP: f2275cb8
> [ 105.173346] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [ 105.173346] CR0: 80050033 CR2: b76bcf54 CR3: 34844000 CR4: 000006f0
> [ 105.173346] Stack:
> [ 105.173346] c026bc78 f2275d48 6954c848 f29987f0 f2275d24 c02cd7a9 f2275ce4 c02e2881
> [ 105.173346] f255d8c8 00000000 f1109020 f4a67f00 f2275d54 f2275d08 c02cd020 6954c848
> [ 105.173346] f4a67f00 f1109000 f2b0eba8 f2ee3800 f2275d28 f4f811e8 f2275d38 00000000
> [ 105.173346] Call Trace:
> [ 105.173346] [<c026bc78>] ? mb_cache_entry_find_first+0x4b/0x55
> [ 105.173346] [<c02cd7a9>] ext4_xattr_block_set+0x248/0x6e7
> [ 105.173346] [<c02e2881>] ? jbd2_journal_put_journal_head+0xe2/0xed
> [ 105.173346] [<c02cd020>] ? ext4_xattr_find_entry+0x52/0xac
> [ 105.173346] [<c02ce307>] ext4_xattr_set_handle+0x1c7/0x30f
> [ 105.173346] [<c02ce4f4>] ext4_xattr_set+0xa5/0xe1
> [ 105.173346] [<c02ceb36>] ext4_xattr_user_set+0x46/0x5f
> [ 105.173346] [<c024a4da>] generic_setxattr+0x4c/0x5e
> [ 105.173346] [<c024a48e>] ? generic_listxattr+0x95/0x95
> [ 105.173346] [<c024ab0f>] __vfs_setxattr_noperm+0x56/0xb6
> [ 105.173346] [<c024abd2>] vfs_setxattr+0x63/0x7e
> [ 105.173346] [<c024ace8>] setxattr+0xfb/0x139
> [ 105.173346] [<c01b200a>] ? __lock_acquire+0x540/0xca6
> [ 105.173346] [<c01877a3>] ? lg_local_unlock+0x1b/0x34
> [ 105.173346] [<c01af8dd>] ? trace_hardirqs_off_caller+0x2e/0x98
> [ 105.173346] [<c0227e69>] ? kmem_cache_free+0xd4/0x149
> [ 105.173346] [<c01b2c2b>] ? lock_acquire+0xdd/0x107
> [ 105.173346] [<c023225e>] ? __sb_start_write+0xee/0x11d
> [ 105.173346] [<c0247383>] ? mnt_want_write+0x1e/0x3e
> [ 105.173346] [<c01b3019>] ? trace_hardirqs_on_caller+0x12a/0x17e
> [ 105.173346] [<c0247353>] ? __mnt_want_write+0x4e/0x60
> [ 105.173346] [<c024af3b>] SyS_lsetxattr+0x6a/0x9f
> [ 105.173346] [<c078d0e8>] syscall_call+0x7/0xb
> [ 105.173346] Code: 00 00 00 00 5b 5d c3 55 89 e5 53 3e 8d 74 26 00 8b 58 08 89 c2 8b 43 18 e8 3f c9 fb ff f0 ff 4b 0c 5b 5d c3 8b 10 80 e2 01 75 02 <0f> 0b 55 89 e5 0f ba 30 00 89 e0 25 00 e0 ff ff ff 48 14 5d c3
> [ 105.173346] EIP: [<c026b464>] hlist_bl_unlock+0x7/0x1c SS:ESP 0068:f2275cb8
> [ 105.273781] ---[ end trace 1ee45ddfc1df0935 ]---
>
> When I tried to find a potential problem, I immediately ran into this.
> I'm not entirely sure it's the problem, but it's raised a number of
> red flags for me in terms of (a) how much testing you've employed with
> this patch set, and (b) how maintaining and easy-to-audit the code
> will be with this extra locking. The comments are good start, but
> some additional comments about exactly what assumptions a function
> assumes about locks that are held on function entry, or especially if
> the locking is different on function entry and function exit, might
> make it easier for people to audit this patch.
>
> Or maybe this commit needs to be split up with first a conversion from
> using list_head to hlist_hl_node, and the changing the locking? The
> bottom line is that we need to somehow make this patch easier to
> validate/review.
>

Thanks for the comemnts. Yes, I did run through xfstests. My guess is that you probably ran into a race condition that I did not.

I will try to port the patch to a more recent kernel, including the mb_cache_shrink_scan() sent earlier (BTW, it looks good) and debug the problem.

Yes, those are good suggestions. Once I find the problem, I will resubmit with more comments and also split it into two patches, as suggested.

>> @@ -520,18 +647,23 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
>> ce->e_queued++;
>> prepare_to_wait(&mb_cache_queue, &wait,
>> TASK_UNINTERRUPTIBLE);
>> - spin_unlock(&mb_cache_spinlock);
>> + hlist_bl_unlock(head);
>> schedule();
>> - spin_lock(&mb_cache_spinlock);
>> + hlist_bl_lock(head);
>> + mb_assert(ce->e_index_hash_p == head);
>> ce->e_queued--;
>> }
>> + hlist_bl_unlock(head);
>> finish_wait(&mb_cache_queue, &wait);
>>
>> - if (!__mb_cache_entry_is_hashed(ce)) {
>> + hlist_bl_lock(ce->e_block_hash_p);
>> + if (!__mb_cache_entry_is_block_hashed(ce)) {
>> __mb_cache_entry_release_unlock(ce);
>> - spin_lock(&mb_cache_spinlock);
>> + hlist_bl_lock(head);
>
> <--- are we missing a "hlist_bl_unlock(ce->e_block_hash_p);" here?
>
> <---- Why is it that we call "hlist_bl_lock(head);" but not in the else clause?

The function __mb_cache_entry_release_unlock() releases both the e_block_hash and e_index_hash locks. So we have to reacquire the e_index_hash (head) lock in the then part, and release the e_block_hash lock in the else part.
>
>> return ERR_PTR(-EAGAIN);
>> - }
>> + } else
>> + hlist_bl_unlock(ce->e_block_hash_p);
>> + mb_assert(ce->e_index_hash_p == head);
>> return ce;
>> }
>> l = l->next;
>
>
> Cheers,
>
> - Ted
>

Thanks,
Mak.

2014-01-24 18:32:04

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v4 0/3] ext4: increase mbcache scalability

The patch consists of three parts.

The first part changes the implementation of both the block and hash chains of
an mb_cache from list_head to hlist_bl_head and also introduces new members,
including a spinlock to mb_cache_entry, as required by the second part.

The second part introduces higher degree of parallelism to the usages of the
mb_cache and mb_cache_entries and impacts all ext filesystems.

The third part of the patch further increases the scalablity of an ext4
filesystem by having each ext4 fielsystem allocate and use its own private
mbcache structure, instead of sharing a single mcache structures across all
ext4 filesystems, and increases the size of its mbcache hash tables.

Here are some of the benchmark results with the changes.

Using ram disk, there seems to be no peformance differences with aim7 for all
workloads on all platforms tested.

With regular disk filesystems with inode size of 128 bytes, forcing the uses
of external xattr, there seems to be some good peformance increases with
some of the aim7's workloads on all platforms tested.

Here are some of the performance improvement on aim7 with 2000 users.

On a 60 core machine:

---------------------------
| | % increase |
---------------------------
| alltests | 30.97 |
---------------------------
| custom | 41.18 |
---------------------------
| dbase | 32.06 |
---------------------------
| disk | 125.02 |
---------------------------
| fserver | 10.23 |
---------------------------
| new_dbase | 14.58 |
---------------------------
| new_fserve | 9.62 |
---------------------------
| shared | 52.56 |
---------------------------

On an 40 core machine:

---------------------------
| | % increase |
---------------------------
| custom | 63.45 |
---------------------------
| disk | 133.25 |
---------------------------
| fserver | 78.29 |
---------------------------
| new_fserver | 80.66 |
---------------------------
| shared | 56.34 |
---------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced.

Changed in v4:
- New performance data
- New diff summary
- New patch architecture

Changed in v3:
- New idff summary

Changed in v2:
- New performance data
- New diff summary
T Makphaibulchoke (3):
fs/mbcache.c change block and index hash chain to hlist_bl_node
mbcache: decoupling the locking of local from global data
ext4: each filesystem creates and uses its own mc_cache

fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 24 ++-
fs/ext4/xattr.c | 51 ++---
fs/ext4/xattr.h | 6 +-
fs/mbcache.c | 491 ++++++++++++++++++++++++++++++++++--------------
include/linux/mbcache.h | 11 +-
6 files changed, 402 insertions(+), 182 deletions(-)

--
1.7.11.3

2014-01-24 18:32:19

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v4 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node

This patch changes each mb_cache's both block and index hash chains from
regular linked list to hlist_bl_node, which contains a built-in lock. This is
the first step in decoupling of locks serializing accesses to mb_cache global
data and each mb_cache_entry local data.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
fs/mbcache.c | 117 ++++++++++++++++++++++++++++++++----------------
include/linux/mbcache.h | 11 +++--
2 files changed, 85 insertions(+), 43 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index e519e45..55db0da 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -34,9 +34,9 @@
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/sched.h>
-#include <linux/init.h>
+#include <linux/list_bl.h>
#include <linux/mbcache.h>
-
+#include <linux/init.h>

#ifdef MB_CACHE_DEBUG
# define mb_debug(f...) do { \
@@ -87,21 +87,38 @@ static LIST_HEAD(mb_cache_lru_list);
static DEFINE_SPINLOCK(mb_cache_spinlock);

static inline int
-__mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
+__mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
{
- return !list_empty(&ce->e_block_list);
+ return !hlist_bl_unhashed(&ce->e_block_list);
}


-static void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+static inline void
+__mb_cache_entry_unhash_block(struct mb_cache_entry *ce)
{
- if (__mb_cache_entry_is_hashed(ce)) {
- list_del_init(&ce->e_block_list);
- list_del(&ce->e_index.o_list);
- }
+ if (__mb_cache_entry_is_block_hashed(ce))
+ hlist_bl_del_init(&ce->e_block_list);
+}
+
+static inline int
+__mb_cache_entry_is_index_hashed(struct mb_cache_entry *ce)
+{
+ return !hlist_bl_unhashed(&ce->e_index.o_list);
}

+static inline void
+__mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
+{
+ if (__mb_cache_entry_is_index_hashed(ce))
+ hlist_bl_del_init(&ce->e_index.o_list);
+}
+
+static inline void
+__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+{
+ __mb_cache_entry_unhash_index(ce);
+ __mb_cache_entry_unhash_block(ce);
+}

static void
__mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
@@ -125,7 +142,7 @@ __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
ce->e_used -= MB_CACHE_WRITER;
ce->e_used--;
if (!(ce->e_used || ce->e_queued)) {
- if (!__mb_cache_entry_is_hashed(ce))
+ if (!__mb_cache_entry_is_block_hashed(ce))
goto forget;
mb_assert(list_empty(&ce->e_lru_list));
list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
@@ -221,18 +238,18 @@ mb_cache_create(const char *name, int bucket_bits)
cache->c_name = name;
atomic_set(&cache->c_entry_count, 0);
cache->c_bucket_bits = bucket_bits;
- cache->c_block_hash = kmalloc(bucket_count * sizeof(struct list_head),
- GFP_KERNEL);
+ cache->c_block_hash = kmalloc(bucket_count *
+ sizeof(struct hlist_bl_head), GFP_KERNEL);
if (!cache->c_block_hash)
goto fail;
for (n=0; n<bucket_count; n++)
- INIT_LIST_HEAD(&cache->c_block_hash[n]);
- cache->c_index_hash = kmalloc(bucket_count * sizeof(struct list_head),
- GFP_KERNEL);
+ INIT_HLIST_BL_HEAD(&cache->c_block_hash[n]);
+ cache->c_index_hash = kmalloc(bucket_count *
+ sizeof(struct hlist_bl_head), GFP_KERNEL);
if (!cache->c_index_hash)
goto fail;
for (n=0; n<bucket_count; n++)
- INIT_LIST_HEAD(&cache->c_index_hash[n]);
+ INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
cache->c_entry_cache = kmem_cache_create(name,
sizeof(struct mb_cache_entry), 0,
SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
@@ -364,10 +381,13 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
return NULL;
atomic_inc(&cache->c_entry_count);
INIT_LIST_HEAD(&ce->e_lru_list);
- INIT_LIST_HEAD(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_index.o_list);
ce->e_cache = cache;
ce->e_queued = 0;
}
+ ce->e_block_hash_p = &cache->c_block_hash[0];
+ ce->e_index_hash_p = &cache->c_index_hash[0];
ce->e_used = 1 + MB_CACHE_WRITER;
return ce;
}
@@ -393,25 +413,32 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
{
struct mb_cache *cache = ce->e_cache;
unsigned int bucket;
- struct list_head *l;
+ struct hlist_bl_node *l;
int error = -EBUSY;
+ struct hlist_bl_head *block_hash_p;
+ struct hlist_bl_head *index_hash_p;
+ struct mb_cache_entry *lce;

+ mb_assert(ce);
bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
+ block_hash_p = &cache->c_block_hash[bucket];
spin_lock(&mb_cache_spinlock);
- list_for_each_prev(l, &cache->c_block_hash[bucket]) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_block_list);
- if (ce->e_bdev == bdev && ce->e_block == block)
+ hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
+ if (lce->e_bdev == bdev && lce->e_block == block)
goto out;
}
+ mb_assert(!__mb_cache_entry_is_block_hashed(ce));
__mb_cache_entry_unhash(ce);
ce->e_bdev = bdev;
ce->e_block = block;
- list_add(&ce->e_block_list, &cache->c_block_hash[bucket]);
+ ce->e_block_hash_p = block_hash_p;
ce->e_index.o_key = key;
bucket = hash_long(key, cache->c_bucket_bits);
- list_add(&ce->e_index.o_list, &cache->c_index_hash[bucket]);
+ index_hash_p = &cache->c_index_hash[bucket];
+ ce->e_index_hash_p = index_hash_p;
+ hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
+ hlist_bl_add_head(&ce->e_block_list, block_hash_p);
error = 0;
out:
spin_unlock(&mb_cache_spinlock);
@@ -463,14 +490,16 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
sector_t block)
{
unsigned int bucket;
- struct list_head *l;
+ struct hlist_bl_node *l;
struct mb_cache_entry *ce;
+ struct hlist_bl_head *block_hash_p;

bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
+ block_hash_p = &cache->c_block_hash[bucket];
spin_lock(&mb_cache_spinlock);
- list_for_each(l, &cache->c_block_hash[bucket]) {
- ce = list_entry(l, struct mb_cache_entry, e_block_list);
+ hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
+ mb_assert(ce->e_block_hash_p == block_hash_p);
if (ce->e_bdev == bdev && ce->e_block == block) {
DEFINE_WAIT(wait);

@@ -489,7 +518,7 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
finish_wait(&mb_cache_queue, &wait);
ce->e_used += 1 + MB_CACHE_WRITER;

- if (!__mb_cache_entry_is_hashed(ce)) {
+ if (!__mb_cache_entry_is_block_hashed(ce)) {
__mb_cache_entry_release_unlock(ce);
return NULL;
}
@@ -506,12 +535,14 @@ cleanup:
#if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)

static struct mb_cache_entry *
-__mb_cache_entry_find(struct list_head *l, struct list_head *head,
+__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
struct block_device *bdev, unsigned int key)
{
- while (l != head) {
+ while (l != NULL) {
struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_index.o_list);
+ hlist_bl_entry(l, struct mb_cache_entry,
+ e_index.o_list);
+ mb_assert(ce->e_index_hash_p == head);
if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
DEFINE_WAIT(wait);

@@ -532,7 +563,7 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
}
finish_wait(&mb_cache_queue, &wait);

- if (!__mb_cache_entry_is_hashed(ce)) {
+ if (!__mb_cache_entry_is_block_hashed(ce)) {
__mb_cache_entry_release_unlock(ce);
spin_lock(&mb_cache_spinlock);
return ERR_PTR(-EAGAIN);
@@ -562,12 +593,16 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
unsigned int key)
{
unsigned int bucket = hash_long(key, cache->c_bucket_bits);
- struct list_head *l;
- struct mb_cache_entry *ce;
+ struct hlist_bl_node *l;
+ struct mb_cache_entry *ce = NULL;
+ struct hlist_bl_head *index_hash_p;

+ index_hash_p = &cache->c_index_hash[bucket];
spin_lock(&mb_cache_spinlock);
- l = cache->c_index_hash[bucket].next;
- ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+ if (!hlist_bl_empty(index_hash_p)) {
+ l = hlist_bl_first(index_hash_p);
+ ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+ }
spin_unlock(&mb_cache_spinlock);
return ce;
}
@@ -597,12 +632,16 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
{
struct mb_cache *cache = prev->e_cache;
unsigned int bucket = hash_long(key, cache->c_bucket_bits);
- struct list_head *l;
+ struct hlist_bl_node *l;
struct mb_cache_entry *ce;
+ struct hlist_bl_head *index_hash_p;

+ index_hash_p = &cache->c_index_hash[bucket];
+ mb_assert(prev->e_index_hash_p == index_hash_p);
spin_lock(&mb_cache_spinlock);
+ mb_assert(!hlist_bl_empty(index_hash_p));
l = prev->e_index.o_list.next;
- ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+ ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
__mb_cache_entry_release_unlock(prev);
return ce;
}
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 5525d37..e76cc0e 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -11,11 +11,14 @@ struct mb_cache_entry {
unsigned short e_queued;
struct block_device *e_bdev;
sector_t e_block;
- struct list_head e_block_list;
+ struct hlist_bl_node e_block_list;
struct {
- struct list_head o_list;
+ struct hlist_bl_node o_list;
unsigned int o_key;
} e_index;
+ struct hlist_bl_head *e_block_hash_p;
+ struct hlist_bl_head *e_index_hash_p;
+ spinlock_t e_entry_lock;
};

struct mb_cache {
@@ -25,8 +28,8 @@ struct mb_cache {
int c_max_entries;
int c_bucket_bits;
struct kmem_cache *c_entry_cache;
- struct list_head *c_block_hash;
- struct list_head *c_index_hash;
+ struct hlist_bl_head *c_block_hash;
+ struct hlist_bl_head *c_index_hash;
};

/* Functions on caches */
--
1.7.11.3

2014-01-24 18:32:25

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v4 2/3] mbcache: decoupling the locking of local from global data

The patch increases the parallelism of mb_cache_entry utilization by
replacing list_head with hlist_bl_node for the implementation of both the
block and index hash tables. Each hlist_bl_node contains a built-in lock
used to protect mb_cache's local block and index hash chains. The global
data mb_cache_lru_list and mb_cache_list continue to be protected by the
global mb_cache_spinlock.

A new spinlock is also added to the mb_cache_entry to protect its local data,
e_used and e_queued.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
fs/mbcache.c | 363 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 260 insertions(+), 103 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 55db0da..0c4cec2 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,40 @@
* back on the lru list.
*/

+/*
+ * Lock descriptions and usage:
+ *
+ * Each hash chain of both the block and index hash tables now contains
+ * a built-in lock used to serialize accesses to the hash chain.
+ *
+ * Accesses to global data structures mb_cache_list and mb_cache_lru_list
+ * are serialized via the global spinlock mb_cache_spinlock.
+ *
+ * Each mb_cache_entry contains a spinlock, e_entry_lock, to serialize
+ * accesses to its local data, such as e_used and e_queued.
+ *
+ * Lock ordering:
+ *
+ * mb_cache_spinlock has the lock highest order, followed by each block and
+ * index hash chain's lock, with e_entry_lock the lowest. While holding
+ * mb_cache_spinlock, a thread can acquire either a block and/or index hash
+ * chain lock, and in turn can also acquire each entry spinlock, e_entry_lock.
+ *
+ * Synchronization:
+ *
+ * Since both mb_cache_entry_get and mb_cache_entry_find scan the block and
+ * index hash chian, it needs to lock the corresponding hash chain. For each
+ * mb_cache_entry within the chain, it needs to lock the mb_cache_entry to
+ * prevent either any simultaneous release or free on the entry and also
+ * to serialize accesses to either the e_used or e_queued member of the entry.
+ *
+ * To avoid having a dangling reference to an already freed
+ * mb_cache_entry, an mb_cache_entry is only freed when it is not on a
+ * block hash chain and also no longer being referenced, both e_used,
+ * and e_queued are 0's. When an mb_cache_entry is explicitly freed it is
+ * first removed from a block hash chain.
+ */
+
#include <linux/kernel.h>
#include <linux/module.h>

@@ -113,11 +147,21 @@ __mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
hlist_bl_del_init(&ce->e_index.o_list);
}

+/*
+ * __mb_cache_entry_unhash_unlock()
+ *
+ * This function is called to unhash both the block and index hash
+ * chain.
+ * It assumes both the block and index hash chain is locked upon entry.
+ * It also unlock both hash chains both exit
+ */
static inline void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+__mb_cache_entry_unhash_unlock(struct mb_cache_entry *ce)
{
__mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
__mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
}

static void
@@ -130,31 +174,42 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
atomic_dec(&cache->c_entry_count);
}

-
static void
-__mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
- __releases(mb_cache_spinlock)
+__mb_cache_entry_release(struct mb_cache_entry *ce)
{
+ /* First lock the entry to serialize access to its local data. */
+ spin_lock(&ce->e_entry_lock);
/* Wake up all processes queuing for this cache entry. */
if (ce->e_queued)
wake_up_all(&mb_cache_queue);
if (ce->e_used >= MB_CACHE_WRITER)
ce->e_used -= MB_CACHE_WRITER;
+ /*
+ * Make sure that all cache entries on lru_list have
+ * both e_used and e_qued of 0s.
+ */
ce->e_used--;
if (!(ce->e_used || ce->e_queued)) {
- if (!__mb_cache_entry_is_block_hashed(ce))
+ if (!__mb_cache_entry_is_block_hashed(ce)) {
+ spin_unlock(&ce->e_entry_lock);
goto forget;
- mb_assert(list_empty(&ce->e_lru_list));
- list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+ }
+ /*
+ * Need access to lru list, first drop entry lock,
+ * then reacquire the lock in the proper order.
+ */
+ spin_lock(&mb_cache_spinlock);
+ if (list_empty(&ce->e_lru_list))
+ list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+ spin_unlock(&mb_cache_spinlock);
}
- spin_unlock(&mb_cache_spinlock);
+ spin_unlock(&ce->e_entry_lock);
return;
forget:
- spin_unlock(&mb_cache_spinlock);
+ mb_assert(list_empty(&ce->e_lru_list));
__mb_cache_entry_forget(ce, GFP_KERNEL);
}

-
/*
* mb_cache_shrink_scan() memory pressure callback
*
@@ -177,17 +232,32 @@ mb_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)

mb_debug("trying to free %d entries", nr_to_scan);
spin_lock(&mb_cache_spinlock);
- while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
+ while ((nr_to_scan-- > 0) && !list_empty(&mb_cache_lru_list)) {
struct mb_cache_entry *ce =
list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
- freed++;
+ struct mb_cache_entry, e_lru_list);
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+ /* Prevent any find or get operation on the entry */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ /* Ignore if it is touched by a find/get */
+ if (ce->e_used || ce->e_queued ||
+ !list_empty(&ce->e_lru_list)) {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ spin_lock(&mb_cache_spinlock);
+ continue;
+ }
+ __mb_cache_entry_unhash_unlock(ce);
+ list_add_tail(&ce->e_lru_list, &free_list);
+ spin_lock(&mb_cache_spinlock);
}
spin_unlock(&mb_cache_spinlock);
+
list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
__mb_cache_entry_forget(entry, gfp_mask);
+ freed++;
}
return freed;
}
@@ -290,21 +360,41 @@ void
mb_cache_shrink(struct block_device *bdev)
{
LIST_HEAD(free_list);
- struct list_head *l, *ltmp;
+ struct list_head *l;
+ struct mb_cache_entry *ce, *tmp;

+ l = &mb_cache_lru_list;
spin_lock(&mb_cache_spinlock);
- list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_lru_list);
+ while (!list_is_last(l, &mb_cache_lru_list)) {
+ l = l->next;
+ ce = list_entry(l, struct mb_cache_entry, e_lru_list);
if (ce->e_bdev == bdev) {
- list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+ /*
+ * Prevent any find or get operation on the entry.
+ */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ /* Ignore if it is touched by a find/get */
+ if (ce->e_used || ce->e_queued ||
+ !list_empty(&ce->e_lru_list)) {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ l = &mb_cache_lru_list;
+ spin_lock(&mb_cache_spinlock);
+ continue;
+ }
+ __mb_cache_entry_unhash_unlock(ce);
+ list_add_tail(&ce->e_lru_list, &free_list);
+ l = &mb_cache_lru_list;
+ spin_lock(&mb_cache_spinlock);
}
}
spin_unlock(&mb_cache_spinlock);
- list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+
+ list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
}
}

@@ -320,23 +410,26 @@ void
mb_cache_destroy(struct mb_cache *cache)
{
LIST_HEAD(free_list);
- struct list_head *l, *ltmp;
+ struct mb_cache_entry *ce, *tmp;

spin_lock(&mb_cache_spinlock);
- list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_lru_list);
- if (ce->e_cache == cache) {
+ list_for_each_entry_safe(ce, tmp, &mb_cache_lru_list, e_lru_list) {
+ if (ce->e_cache == cache)
list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
- }
}
list_del(&cache->c_cache_list);
spin_unlock(&mb_cache_spinlock);

- list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+ list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+ list_del_init(&ce->e_lru_list);
+ /*
+ * Prevent any find or get operation on the entry.
+ */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ mb_assert(!(ce->e_used || ce->e_queued));
+ __mb_cache_entry_unhash_unlock(ce);
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
}

if (atomic_read(&cache->c_entry_count) > 0) {
@@ -363,31 +456,56 @@ mb_cache_destroy(struct mb_cache *cache)
struct mb_cache_entry *
mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
{
- struct mb_cache_entry *ce = NULL;
+ struct mb_cache_entry *ce;

if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+ struct list_head *l;
+
+ l = &mb_cache_lru_list;
spin_lock(&mb_cache_spinlock);
- if (!list_empty(&mb_cache_lru_list)) {
- ce = list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_del_init(&ce->e_lru_list);
- __mb_cache_entry_unhash(ce);
+ while (!list_is_last(l, &mb_cache_lru_list)) {
+ l = l->next;
+ ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+ if (ce->e_cache == cache) {
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
+ /*
+ * Prevent any find or get operation on the
+ * entry.
+ */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ /* Ignore if it is touched by a find/get */
+ if (ce->e_used || ce->e_queued ||
+ !list_empty(&ce->e_lru_list)) {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ l = &mb_cache_lru_list;
+ spin_lock(&mb_cache_spinlock);
+ continue;
+ }
+ mb_assert(list_empty(&ce->e_lru_list));
+ mb_assert(!(ce->e_used || ce->e_queued));
+ __mb_cache_entry_unhash_unlock(ce);
+ goto found;
+ }
}
spin_unlock(&mb_cache_spinlock);
}
- if (!ce) {
- ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
- if (!ce)
- return NULL;
- atomic_inc(&cache->c_entry_count);
- INIT_LIST_HEAD(&ce->e_lru_list);
- INIT_HLIST_BL_NODE(&ce->e_block_list);
- INIT_HLIST_BL_NODE(&ce->e_index.o_list);
- ce->e_cache = cache;
- ce->e_queued = 0;
- }
+
+ ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
+ if (!ce)
+ return NULL;
+ atomic_inc(&cache->c_entry_count);
+ INIT_LIST_HEAD(&ce->e_lru_list);
+ INIT_HLIST_BL_NODE(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_index.o_list);
+ ce->e_cache = cache;
+ ce->e_queued = 0;
+found:
ce->e_block_hash_p = &cache->c_block_hash[0];
ce->e_index_hash_p = &cache->c_index_hash[0];
+ spin_lock_init(&ce->e_entry_lock);
ce->e_used = 1 + MB_CACHE_WRITER;
return ce;
}
@@ -423,25 +541,28 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
block_hash_p = &cache->c_block_hash[bucket];
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(block_hash_p);
hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
if (lce->e_bdev == bdev && lce->e_block == block)
goto out;
}
mb_assert(!__mb_cache_entry_is_block_hashed(ce));
- __mb_cache_entry_unhash(ce);
+ __mb_cache_entry_unhash_block(ce);
+ __mb_cache_entry_unhash_index(ce);
ce->e_bdev = bdev;
ce->e_block = block;
ce->e_block_hash_p = block_hash_p;
ce->e_index.o_key = key;
bucket = hash_long(key, cache->c_bucket_bits);
index_hash_p = &cache->c_index_hash[bucket];
+ hlist_bl_lock(index_hash_p);
ce->e_index_hash_p = index_hash_p;
hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+ hlist_bl_unlock(index_hash_p);
error = 0;
out:
- spin_unlock(&mb_cache_spinlock);
+ hlist_bl_unlock(block_hash_p);
return error;
}

@@ -456,24 +577,26 @@ out:
void
mb_cache_entry_release(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
- __mb_cache_entry_release_unlock(ce);
+ __mb_cache_entry_release(ce);
}


/*
* mb_cache_entry_free()
*
- * This is equivalent to the sequence mb_cache_entry_takeout() --
- * mb_cache_entry_release().
*/
void
mb_cache_entry_free(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
+ mb_assert(ce);
mb_assert(list_empty(&ce->e_lru_list));
- __mb_cache_entry_unhash(ce);
- __mb_cache_entry_release_unlock(ce);
+ hlist_bl_lock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_lock(ce->e_block_hash_p);
+ __mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ __mb_cache_entry_release(ce);
}


@@ -493,43 +616,58 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
struct hlist_bl_node *l;
struct mb_cache_entry *ce;
struct hlist_bl_head *block_hash_p;
+ int held_lock = 1;

bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
block_hash_p = &cache->c_block_hash[bucket];
- spin_lock(&mb_cache_spinlock);
+ /* First serialize access to the block corresponding hash chain. */
+ hlist_bl_lock(block_hash_p);
hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
mb_assert(ce->e_block_hash_p == block_hash_p);
if (ce->e_bdev == bdev && ce->e_block == block) {
- DEFINE_WAIT(wait);
-
- if (!list_empty(&ce->e_lru_list))
+ /*
+ * Prevent a release on the entry.
+ */
+ spin_lock(&ce->e_entry_lock);
+ if (!list_empty(&ce->e_lru_list)) {
+ spin_lock(&mb_cache_spinlock);
list_del_init(&ce->e_lru_list);
-
- while (ce->e_used > 0) {
- ce->e_queued++;
- prepare_to_wait(&mb_cache_queue, &wait,
- TASK_UNINTERRUPTIBLE);
spin_unlock(&mb_cache_spinlock);
- schedule();
- spin_lock(&mb_cache_spinlock);
- ce->e_queued--;
}
- finish_wait(&mb_cache_queue, &wait);
+ if (ce->e_used > 0) {
+ DEFINE_WAIT(wait);
+ while (ce->e_used > 0) {
+ ce->e_queued++;
+ prepare_to_wait(&mb_cache_queue, &wait,
+ TASK_UNINTERRUPTIBLE);
+ spin_unlock(&ce->e_entry_lock);
+ if (held_lock) {
+ hlist_bl_unlock(block_hash_p);
+ held_lock = 0;
+ }
+ schedule();
+ spin_lock(&ce->e_entry_lock);
+ ce->e_queued--;
+ }
+ finish_wait(&mb_cache_queue, &wait);
+ }
ce->e_used += 1 + MB_CACHE_WRITER;
+ spin_unlock(&ce->e_entry_lock);

if (!__mb_cache_entry_is_block_hashed(ce)) {
- __mb_cache_entry_release_unlock(ce);
+ if (held_lock)
+ hlist_bl_unlock(block_hash_p);
+ __mb_cache_entry_release(ce);
return NULL;
}
- goto cleanup;
+ if (held_lock)
+ hlist_bl_unlock(block_hash_p);
+ return ce;
}
}
- ce = NULL;
-
-cleanup:
- spin_unlock(&mb_cache_spinlock);
- return ce;
+ hlist_bl_unlock(block_hash_p);
+ return NULL;
}

#if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
@@ -538,40 +676,59 @@ static struct mb_cache_entry *
__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
struct block_device *bdev, unsigned int key)
{
+ int held_lock = 1;
+
+ /* The index hash chain is alredy acquire by caller. */
while (l != NULL) {
struct mb_cache_entry *ce =
hlist_bl_entry(l, struct mb_cache_entry,
e_index.o_list);
mb_assert(ce->e_index_hash_p == head);
if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
- DEFINE_WAIT(wait);
-
- if (!list_empty(&ce->e_lru_list))
+ /*
+ * Prevent a release on the entry.
+ */
+ spin_lock(&ce->e_entry_lock);
+ if (!list_empty(&ce->e_lru_list)) {
+ spin_lock(&mb_cache_spinlock);
list_del_init(&ce->e_lru_list);
-
+ spin_unlock(&mb_cache_spinlock);
+ }
+ ce->e_used++;
/* Incrementing before holding the lock gives readers
priority over writers. */
- ce->e_used++;
- while (ce->e_used >= MB_CACHE_WRITER) {
- ce->e_queued++;
- prepare_to_wait(&mb_cache_queue, &wait,
- TASK_UNINTERRUPTIBLE);
- spin_unlock(&mb_cache_spinlock);
- schedule();
- spin_lock(&mb_cache_spinlock);
- ce->e_queued--;
+ if (ce->e_used >= MB_CACHE_WRITER) {
+ DEFINE_WAIT(wait);
+
+ while (ce->e_used >= MB_CACHE_WRITER) {
+ ce->e_queued++;
+ prepare_to_wait(&mb_cache_queue, &wait,
+ TASK_UNINTERRUPTIBLE);
+ spin_unlock(&ce->e_entry_lock);
+ if (held_lock) {
+ hlist_bl_unlock(head);
+ held_lock = 0;
+ }
+ schedule();
+ spin_lock(&ce->e_entry_lock);
+ ce->e_queued--;
+ }
+ finish_wait(&mb_cache_queue, &wait);
}
- finish_wait(&mb_cache_queue, &wait);
-
+ spin_unlock(&ce->e_entry_lock);
if (!__mb_cache_entry_is_block_hashed(ce)) {
- __mb_cache_entry_release_unlock(ce);
- spin_lock(&mb_cache_spinlock);
+ if (held_lock)
+ hlist_bl_unlock(head);
+ __mb_cache_entry_release(ce);
return ERR_PTR(-EAGAIN);
}
+ if (held_lock)
+ hlist_bl_unlock(head);
return ce;
}
l = l->next;
}
+ hlist_bl_unlock(head);
return NULL;
}

@@ -598,12 +755,12 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
struct hlist_bl_head *index_hash_p;

index_hash_p = &cache->c_index_hash[bucket];
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(index_hash_p);
if (!hlist_bl_empty(index_hash_p)) {
l = hlist_bl_first(index_hash_p);
ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
- }
- spin_unlock(&mb_cache_spinlock);
+ } else
+ hlist_bl_unlock(index_hash_p);
return ce;
}

@@ -638,11 +795,11 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,

index_hash_p = &cache->c_index_hash[bucket];
mb_assert(prev->e_index_hash_p == index_hash_p);
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(index_hash_p);
mb_assert(!hlist_bl_empty(index_hash_p));
l = prev->e_index.o_list.next;
ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
- __mb_cache_entry_release_unlock(prev);
+ __mb_cache_entry_release(prev);
return ce;
}

--
1.7.11.3

2014-01-24 18:32:49

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH v4 3/3] ext4: each filesystem creates and uses its own mc_cache

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the cache
creation and destory calls from ex4_init_xattr() and ext4_exitxattr() in
fs/ext4/xattr.c. It also increases the size of the mbcache's hash tables for
ext4.

fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

fs/mbcache.c has been changed so that only one slab allocator is allocated
and used by all mbcache structures.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 24 ++++++++++++++++--------
fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
fs/ext4/xattr.h | 6 +++---
fs/mbcache.c | 19 +++++++++++++------
5 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e618503..fa20aab 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1314,6 +1314,7 @@ struct ext4_sb_info {
struct list_head s_es_lru;
unsigned long s_es_last_sorted;
struct percpu_counter s_extent_cache_cnt;
+ struct mb_cache *s_mb_cache;
spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;

/* Ratelimit ext4 messages. */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c977f4e..176048e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
static struct ext4_lazy_init *ext4_li_info;
static struct mutex ext4_li_mtx;
static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;

static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
unsigned long journal_devnum);
@@ -845,6 +846,10 @@ static void ext4_put_super(struct super_block *sb)
invalidate_bdev(sbi->journal_bdev);
ext4_blkdev_remove(sbi);
}
+ if (sbi->s_mb_cache) {
+ ext4_xattr_destroy_cache(sbi->s_mb_cache);
+ sbi->s_mb_cache = NULL;
+ }
if (sbi->s_mmp_tsk)
kthread_stop(sbi->s_mmp_tsk);
sb->s_fs_info = NULL;
@@ -3981,6 +3986,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);

sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+ if (ext4_mballoc_ready) {
+ sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+ if (!sbi->s_mb_cache) {
+ ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+ goto failed_mount_wq;
+ }
+ }

/*
* The journal may have updated the bg summary counts, so we
@@ -5501,11 +5513,9 @@ static int __init ext4_init_fs(void)

err = ext4_init_mballoc();
if (err)
- goto out3;
-
- err = ext4_init_xattr();
- if (err)
goto out2;
+ else
+ ext4_mballoc_ready = 1;
err = init_inodecache();
if (err)
goto out1;
@@ -5521,10 +5531,9 @@ out:
unregister_as_ext3();
destroy_inodecache();
out1:
- ext4_exit_xattr();
-out2:
+ ext4_mballoc_ready = 0;
ext4_exit_mballoc();
-out3:
+out2:
ext4_exit_feat_adverts();
out4:
if (ext4_proc_root)
@@ -5547,7 +5556,6 @@ static void __exit ext4_exit_fs(void)
unregister_as_ext3();
unregister_filesystem(&ext4_fs_type);
destroy_inodecache();
- ext4_exit_xattr();
ext4_exit_mballoc();
ext4_exit_feat_adverts();
remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 1423c48..6e2788f 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
# define ea_bdebug(bh, fmt, ...) no_printk(fmt, ##__VA_ARGS__)
#endif

-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
static struct buffer_head *ext4_xattr_cache_find(struct inode *,
struct ext4_xattr_header *,
struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
static int ext4_xattr_list(struct dentry *dentry, char *buffer,
size_t buffer_size);

-static struct mb_cache *ext4_xattr_cache;
-
static const struct xattr_handler *ext4_xattr_handler_map[] = {
[EXT4_XATTR_INDEX_USER] = &ext4_xattr_user_handler,
#ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
NULL
};

+#define EXT4_GET_MB_CACHE(inode) (((struct ext4_sb_info *) \
+ inode->i_sb->s_fs_info)->s_mb_cache)
+
static __le32 ext4_xattr_block_csum(struct inode *inode,
sector_t block_nr,
struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
struct ext4_xattr_entry *entry;
size_t size;
int error;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
error = -EIO;
goto cleanup;
}
- ext4_xattr_cache_insert(bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, bh);
entry = BFIRST(bh);
error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
struct inode *inode = dentry->d_inode;
struct buffer_head *bh = NULL;
int error;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

ea_idebug(inode, "buffer=%p, buffer_size=%ld",
buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
error = -EIO;
goto cleanup;
}
- ext4_xattr_cache_insert(bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, bh);
error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);

cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
{
struct mb_cache_entry *ce = NULL;
int error = 0;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

- ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+ ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
error = ext4_journal_get_write_access(handle, bh);
if (error)
goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_search *s = &bs->s;
struct mb_cache_entry *ce = NULL;
int error = 0;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

#define header(x) ((struct ext4_xattr_header *)(x))

if (i->value && i->value_len > sb->s_blocksize)
return -ENOSPC;
if (s->base) {
- ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+ ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
bs->bh->b_blocknr);
error = ext4_journal_get_write_access(handle, bs->bh);
if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
if (!IS_LAST_ENTRY(s->first))
ext4_xattr_rehash(header(s->base),
s->here);
- ext4_xattr_cache_insert(bs->bh);
+ ext4_xattr_cache_insert(ext4_mb_cache,
+ bs->bh);
}
unlock_buffer(bs->bh);
if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
memcpy(new_bh->b_data, s->base, new_bh->b_size);
set_buffer_uptodate(new_bh);
unlock_buffer(new_bh);
- ext4_xattr_cache_insert(new_bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
error = ext4_handle_dirty_xattr_block(handle,
inode, new_bh);
if (error)
@@ -1495,13 +1501,13 @@ ext4_xattr_put_super(struct super_block *sb)
* Returns 0, or a negative error number on failure.
*/
static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
{
__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
struct mb_cache_entry *ce;
int error;

- ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+ ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
if (!ce) {
ea_bdebug(bh, "out of memory");
return;
@@ -1573,12 +1579,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
{
__u32 hash = le32_to_cpu(header->h_hash);
struct mb_cache_entry *ce;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

if (!header->h_hash)
return NULL; /* never share */
ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
again:
- ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+ ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
hash);
while (ce) {
struct buffer_head *bh;
@@ -1676,19 +1683,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,

#undef BLOCK_HASH_SHIFT

-int __init
-ext4_init_xattr(void)
+#define HASH_BUCKET_BITS 10
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
{
- ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
- if (!ext4_xattr_cache)
- return -ENOMEM;
- return 0;
+ return mb_cache_create(name, HASH_BUCKET_BITS);
}

-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
{
- if (ext4_xattr_cache)
- mb_cache_destroy(ext4_xattr_cache);
- ext4_xattr_cache = NULL;
+ if (cache)
+ mb_cache_destroy(cache);
}
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
struct ext4_inode *raw_inode, handle_t *handle);

-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);
-
extern const struct xattr_handler *ext4_xattr_handlers[];

extern int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
@@ -126,6 +123,9 @@ extern int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_info *i,
struct ext4_xattr_ibody_find *is);

+extern struct mb_cache *ext4_xattr_create_cache(char *name);
+extern void ext4_xattr_destroy_cache(struct mb_cache *);
+
#ifdef CONFIG_EXT4_FS_SECURITY
extern int ext4_init_security(handle_t *handle, struct inode *inode,
struct inode *dir, const struct qstr *qstr);
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 0c4cec2..d620b7f 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -92,6 +92,7 @@
#define MB_CACHE_WRITER ((unsigned short)~0U >> 1)

static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
+static struct kmem_cache *mb_cache_kmem_cache;

MODULE_AUTHOR("Andreas Gruenbacher <[email protected]>");
MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
@@ -320,11 +321,14 @@ mb_cache_create(const char *name, int bucket_bits)
goto fail;
for (n=0; n<bucket_count; n++)
INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
- cache->c_entry_cache = kmem_cache_create(name,
- sizeof(struct mb_cache_entry), 0,
- SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
- if (!cache->c_entry_cache)
- goto fail2;
+ if (!mb_cache_kmem_cache) {
+ mb_cache_kmem_cache = kmem_cache_create(name,
+ sizeof(struct mb_cache_entry), 0,
+ SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+ if (!mb_cache_kmem_cache)
+ goto fail2;
+ }
+ cache->c_entry_cache = mb_cache_kmem_cache;

/*
* Set an upper limit on the number of cache entries so that the hash
@@ -438,7 +442,10 @@ mb_cache_destroy(struct mb_cache *cache)
atomic_read(&cache->c_entry_count));
}

- kmem_cache_destroy(cache->c_entry_cache);
+ if (list_empty(&mb_cache_list)) {
+ kmem_cache_destroy(mb_cache_kmem_cache);
+ mb_cache_kmem_cache = NULL;
+ }

kfree(cache->c_index_hash);
kfree(cache->c_block_hash);
--
1.7.11.3

2014-01-24 21:39:02

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

T Makphaibulchoke <[email protected]> writes:

> The patch consists of three parts.
>
> The first part changes the implementation of both the block and hash chains of
> an mb_cache from list_head to hlist_bl_head and also introduces new members,
> including a spinlock to mb_cache_entry, as required by the second part.

spinlock per entry is usually overkill for larger hash tables.

Can you use a second smaller lock table that just has locks and is
indexed by a subset of the hash key. Most likely a very small
table is good enough.

Also I would be good to have some data on the additional memory consumption.

-Andi

--
[email protected] -- Speaking for myself only

2014-01-25 06:09:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC. See blockgroup_lock.h.

While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.

Cheers, Andreas

> On Jan 24, 2014, at 14:38, Andi Kleen <[email protected]> wrote:
>
> T Makphaibulchoke <[email protected]> writes:
>
>> The patch consists of three parts.
>>
>> The first part changes the implementation of both the block and hash chains of
>> an mb_cache from list_head to hlist_bl_head and also introduces new members,
>> including a spinlock to mb_cache_entry, as required by the second part.
>
> spinlock per entry is usually overkill for larger hash tables.
>
> Can you use a second smaller lock table that just has locks and is
> indexed by a subset of the hash key. Most likely a very small
> table is good enough.
>
> Also I would be good to have some data on the additional memory consumption.
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only

Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

On 01/24/2014 02:38 PM, Andi Kleen wrote:
> T Makphaibulchoke <[email protected]> writes:
>
>> The patch consists of three parts.
>>
>> The first part changes the implementation of both the block and hash chains of
>> an mb_cache from list_head to hlist_bl_head and also introduces new members,
>> including a spinlock to mb_cache_entry, as required by the second part.
>
> spinlock per entry is usually overkill for larger hash tables.
>
> Can you use a second smaller lock table that just has locks and is
> indexed by a subset of the hash key. Most likely a very small
> table is good enough.
>
> Also I would be good to have some data on the additional memory consumption.
>
> -Andi
>

Thanks Andi for the comments. Will look into that.

Thanks,
Mak.

Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

On 01/24/2014 11:09 PM, Andreas Dilger wrote:
> I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC. See blockgroup_lock.h.
>
> While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.
>
> Cheers, Andreas
>


Thanks Andreas for the suggestion. Will try that versus adding just a new private spinlock array in mbcache and compare the performance.

Thanks,
Mak.

2014-01-28 12:26:27

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

> The third part of the patch further increases the scalablity of an ext4
> filesystem by having each ext4 fielsystem allocate and use its own private
> mbcache structure, instead of sharing a single mcache structures across all
> ext4 filesystems, and increases the size of its mbcache hash tables.

Are you sure this helps? The idea behind having one large mbcache is
that one large hash table will always be at least as well balanced as
multiple separate tables, if the total size is the same.

If you have two size 2^n hash tables, the chance of collision is equal to
one size 2^(n+1) table if they're equally busy, and if they're unequally
busy. the latter is better. The busier file system will take less time
per search, and since it's searcehed more often than the less-busy one,
net win.

How does it compare with just increasing the hash table size but leaving
them combined?

2014-01-28 21:10:23

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

On Jan 28, 2014, at 5:26 AM, George Spelvin <[email protected]> wrote:
>> The third part of the patch further increases the scalablity of an ext4
>> filesystem by having each ext4 fielsystem allocate and use its own private
>> mbcache structure, instead of sharing a single mcache structures across all
>> ext4 filesystems, and increases the size of its mbcache hash tables.
>
> Are you sure this helps? The idea behind having one large mbcache is
> that one large hash table will always be at least as well balanced as
> multiple separate tables, if the total size is the same.
>
> If you have two size 2^n hash tables, the chance of collision is equal to
> one size 2^(n+1) table if they're equally busy, and if they're unequally
> busy. the latter is better. The busier file system will take less time
> per search, and since it's searched more often than the less-busy one,
> net win.
>
> How does it compare with just increasing the hash table size but leaving
> them combined?

Except that having one mbcache per block device would avoid the need
to store the e_bdev pointer in thousands/millions of entries. Since
the blocks are never shared between different block devices, there
is no caching benefit even if the same block is on two block devices.

Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail
Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

On 01/28/2014 02:09 PM, Andreas Dilger wrote:
> On Jan 28, 2014, at 5:26 AM, George Spelvin <[email protected]> wrote:
>>> The third part of the patch further increases the scalablity of an ext4
>>> filesystem by having each ext4 fielsystem allocate and use its own private
>>> mbcache structure, instead of sharing a single mcache structures across all
>>> ext4 filesystems, and increases the size of its mbcache hash tables.
>>
>> Are you sure this helps? The idea behind having one large mbcache is
>> that one large hash table will always be at least as well balanced as
>> multiple separate tables, if the total size is the same.
>>
>> If you have two size 2^n hash tables, the chance of collision is equal to
>> one size 2^(n+1) table if they're equally busy, and if they're unequally
>> busy. the latter is better. The busier file system will take less time
>> per search, and since it's searched more often than the less-busy one,
>> net win.
>>
>> How does it compare with just increasing the hash table size but leaving
>> them combined?
>
> Except that having one mbcache per block device would avoid the need
> to store the e_bdev pointer in thousands/millions of entries. Since
> the blocks are never shared between different block devices, there
> is no caching benefit even if the same block is on two block devices.
>
> Cheers, Andreas
>
>
>
>
>

Thanks George and Andreas for the comments. Andreas you mentions a good point, e_bdev pointer is not needed when having one mb_cache for each block device. I'll integrate that into my patch, removing the e_bdev pointer, and run some comparison between one large hash table vs multiple hash tables, as suggested by George.

Thanks,
Mak.

Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

On 01/24/2014 11:09 PM, Andreas Dilger wrote:
> I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC. See blockgroup_lock.h.
>
> While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.
>
> Cheers, Andreas
>


Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

On 01/28/2014 02:09 PM, Andreas Dilger wrote:
> On Jan 28, 2014, at 5:26 AM, George Spelvin <[email protected]> wrote:
>>> The third part of the patch further increases the scalablity of an ext4
>>> filesystem by having each ext4 fielsystem allocate and use its own private
>>> mbcache structure, instead of sharing a single mcache structures across all
>>> ext4 filesystems, and increases the size of its mbcache hash tables.
>>
>> Are you sure this helps? The idea behind having one large mbcache is
>> that one large hash table will always be at least as well balanced as
>> multiple separate tables, if the total size is the same.
>>
>> If you have two size 2^n hash tables, the chance of collision is equal to
>> one size 2^(n+1) table if they're equally busy, and if they're unequally
>> busy. the latter is better. The busier file system will take less time
>> per search, and since it's searched more often than the less-busy one,
>> net win.
>>
>> How does it compare with just increasing the hash table size but leaving
>> them combined?
>
> Except that having one mbcache per block device would avoid the need
> to store the e_bdev pointer in thousands/millions of entries. Since
> the blocks are never shared between different block devices, there
> is no caching benefit even if the same block is on two block devices.
>
> Cheers, Andreas
>

On all 3 systems, with 80, 60 and 20 cores, that I ran aim7 on, spreading test files across 4 ext4 filesystems, there seems to be no different in performance either with a single large hash table or a smaller one per filesystem.

Having said that, I still believe that having a separate hash table for each filesystem should scale better, as the size of a larger single hash table would be very arbitrary. As Andres mentioned above, with an mbcache per filesystem we would be able to remove the e_bdev member from the mb_cache_entry. It would also work well and also result in less mb_cache_entry lock contention, if we are to use the blockgroup locks, which are also on a per filesystem base, to implement the mb_cache_entry lock as suggested by Andreas.

Please let me know if you have any further comment or concerns.

Thanks,
Mak.


Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability

On 01/24/2014 11:09 PM, Andreas Dilger wrote:
> I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC. See blockgroup_lock.h.
>
> While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.
>
> Cheers, Andreas
>

Andreas, looks like your assumption is correct. On all 3 systems, 80, 60 and 20 cores, I got almost identical aim7 results using either a smaller dedicated lock array or the block group lock. I'm inclined to go with using the block group lock as it does not incur any extra space.

One problem is that, with the current implementation mbcache has no knowledge of the super block, including its block group lock, of the filesystem. In my implementation I have to change the first argument of mb_cache_create() from char * to struct super_block * to be able to access the super block's block group lock. This works with my proposed change to allocate an mb_cache for each mounted ext4 filesystem. This would also require the same change, allocating an mb_cache for each mounted filesystem, to both ext2 and ext3, which would increase the scope of the patch. The other alternative, allocating a new smaller spinlock array, would not require any change to either ext2 and ext3.

I'm working on resubmitting my patches using the block group locks and extending the changes to also include both ext2 and ext3. With this approach, not only that no addition space for dedicated new spinlock array is required, the e_bdev member of struct mb_cache_entry could also be removed, reducing the space required for each mb_cache_entry.

Please let me know if you have any concern or suggestion.

Thanks,
Mak.

2014-02-13 02:01:35

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] ext4: increase mbcache scalability


On Feb 11, 2014, at 12:58 PM, Thavatchai Makphaibulchoke <[email protected]> wrote:
> On 01/24/2014 11:09 PM, Andreas Dilger wrote:
>> I think the ext4 block groups are locked with the blockgroup_lock that has about the same number of locks as the number of cores, with a max of 128, IIRC. See blockgroup_lock.h.
>>
>> While there is some chance of contention, it is also unlikely that all of the cores are locking this area at the same time.
>>
>> Cheers, Andreas
>>
>
> Andreas, looks like your assumption is correct. On all 3 systems, 80, 60 and 20 cores, I got almost identical aim7 results using either a smaller dedicated lock array or the block group lock. I'm inclined to go with using the block group lock as it does not incur any extra space.
>
> One problem is that, with the current implementation mbcache has no knowledge of the super block, including its block group lock, of the filesystem. In my implementation I have to change the first argument of mb_cache_create() from char * to struct super_block * to be able to access the super block's block group lock.

Note that you don't have to use the ext4_sb_info->s_blockgroup_lock.
You can allocate and use a separate struct blockgroup_lock for mbcache
instead of allocating a spinlock array (and essentially reimplementing
the bgl_lock_*() code). While it isn't a huge amount of duplication,
that code is already tuned for different SMP core configurations and
there is no reason NOT to use struct blockgroup_lock.

> This works with my proposed change to allocate an mb_cache for each mounted ext4 filesystem. This would also require the same change, allocating an mb_cache for each mounted filesystem, to both ext2 and ext3, which would increase the scope of the patch. The other alternative, allocating a new smaller spinlock array, would not require any change to either ext2 and ext3.
>
> I'm working on resubmitting my patches using the block group locks and extending the changes to also include both ext2 and ext3. With this approach, not only that no addition space for dedicated new spinlock array is required, the e_bdev member of struct mb_cache_entry could also be removed, reducing the space required for each mb_cache_entry.
>
> Please let me know if you have any concern or suggestion.

I'm not against re-using the s_blockgroup_lock in the superblock, since
the chance of contention on this lock between threads is very small, as
there are normally 4x the number of spinlocks as cores.

You might consider starting with a dedicated struct blockgroup_lock in
the mbcache code, then move to use the in-superblock struct in a later
patch. That would allow you to push and land the mbcache and ext4 patches
independently of the ext2 and ext3 patches (if they are big). If the
ext2 and ext3 patches are relatively small then this extra complexity
in the patches may not be needed.

Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-02-20 18:38:10

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH V5 0/3] ext4: increase mbcache scalability

The patch consists of three parts.

The first part changes the implementation of both the block and hash chains of
an mb_cache from list_head to hlist_bl_head and also introduces new members,
including a spinlock to mb_cache_entry, as required by the second part.

The second part introduces higher degree of parallelism to the usages of the
mb_cache and mb_cache_entries and impacts all ext filesystems.

The third part of the patch further increases the scalablity of an ext4
filesystem by having each ext4 fielsystem allocate and use its own private
mbcache structure, instead of sharing a single mcache structures across all
ext4 filesystems, and increases the size of its mbcache hash tables.

Here are some of the benchmark results with the changes.

Using ram disk, there seems to be no peformance differences with aim7 for all
workloads on all platforms tested.

With regular disk filesystems with inode size of 128 bytes, forcing the uses
of external xattr, there seems to be some good peformance increases with
some of the aim7's workloads on all platforms tested.

Here are some of the performance improvement on aim7 with 2000 users.

On a 20 core macine, there is no performance differences.

On a 60 core machine:

---------------------------
| | % increase |
---------------------------
| alltests | 74.69 |
---------------------------
| custom | 77.1 |
---------------------------
| disk | 125.02 |
---------------------------
| fserver | 113.22 |
---------------------------
| new_dbase | 21.17 |
---------------------------
| new_fserve | 70.31 |
---------------------------
| shared | 52.56 |
---------------------------

On a 80 core machine:

---------------------------
| | % increase |
---------------------------
| custom | 74.29 |
---------------------------
| disk | 61.01 |
---------------------------
| fserver | 11.59 |
---------------------------
| new_fserver | 32.76 |
---------------------------

The changes have been tested with ext4 xfstests to verify that no regression
has been introduced.

Changed in v5:
- New performance data
- New diff summary

Changed in v4:
- New performance data
- New diff summary
- New patch architecture

Changed in v3:
- New idff summary

Changed in v2:
- New performance data
- New diff summary

T Makphaibulchoke (3):
fs/mbcache.c change block and index hash chain to hlist_bl_node
mbcache: decoupling the locking of local from global data
ext4: each filesystem creates and uses its own mb_cache

fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 24 ++-
fs/ext4/xattr.c | 51 ++---
fs/ext4/xattr.h | 6 +-
fs/mbcache.c | 540 ++++++++++++++++++++++++++++++++++--------------
include/linux/mbcache.h | 12 +-
6 files changed, 441 insertions(+), 193 deletions(-)

--
1.7.11.3

2014-02-20 18:38:23

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH V5 3/3] ext4: each filesystem creates and uses its own mb_cache

This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the cache
creation and destory calls from ex4_init_xattr() and ext4_exitxattr() in
fs/ext4/xattr.c.

fs/ext4/super.c has been changed so that when a filesystem is mounted a
cache is allocated and attched to its ext4_sb_info structure.

fs/mbcache.c has been changed so that only one slab allocator is allocated
and used by all mbcache structures.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 24 ++++++++++++++++--------
fs/ext4/xattr.c | 51 ++++++++++++++++++++++++++++-----------------------
fs/ext4/xattr.h | 6 +++---
fs/mbcache.c | 18 +++++++++++++-----
5 files changed, 61 insertions(+), 39 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e618503..fa20aab 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1314,6 +1314,7 @@ struct ext4_sb_info {
struct list_head s_es_lru;
unsigned long s_es_last_sorted;
struct percpu_counter s_extent_cache_cnt;
+ struct mb_cache *s_mb_cache;
spinlock_t s_es_lru_lock ____cacheline_aligned_in_smp;

/* Ratelimit ext4 messages. */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c977f4e..176048e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -59,6 +59,7 @@ static struct kset *ext4_kset;
static struct ext4_lazy_init *ext4_li_info;
static struct mutex ext4_li_mtx;
static struct ext4_features *ext4_feat;
+static int ext4_mballoc_ready;

static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
unsigned long journal_devnum);
@@ -845,6 +846,10 @@ static void ext4_put_super(struct super_block *sb)
invalidate_bdev(sbi->journal_bdev);
ext4_blkdev_remove(sbi);
}
+ if (sbi->s_mb_cache) {
+ ext4_xattr_destroy_cache(sbi->s_mb_cache);
+ sbi->s_mb_cache = NULL;
+ }
if (sbi->s_mmp_tsk)
kthread_stop(sbi->s_mmp_tsk);
sb->s_fs_info = NULL;
@@ -3981,6 +3986,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);

sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+ if (ext4_mballoc_ready) {
+ sbi->s_mb_cache = ext4_xattr_create_cache(sb->s_id);
+ if (!sbi->s_mb_cache) {
+ ext4_msg(sb, KERN_ERR, "Failed to create an mb_cache");
+ goto failed_mount_wq;
+ }
+ }

/*
* The journal may have updated the bg summary counts, so we
@@ -5501,11 +5513,9 @@ static int __init ext4_init_fs(void)

err = ext4_init_mballoc();
if (err)
- goto out3;
-
- err = ext4_init_xattr();
- if (err)
goto out2;
+ else
+ ext4_mballoc_ready = 1;
err = init_inodecache();
if (err)
goto out1;
@@ -5521,10 +5531,9 @@ out:
unregister_as_ext3();
destroy_inodecache();
out1:
- ext4_exit_xattr();
-out2:
+ ext4_mballoc_ready = 0;
ext4_exit_mballoc();
-out3:
+out2:
ext4_exit_feat_adverts();
out4:
if (ext4_proc_root)
@@ -5547,7 +5556,6 @@ static void __exit ext4_exit_fs(void)
unregister_as_ext3();
unregister_filesystem(&ext4_fs_type);
destroy_inodecache();
- ext4_exit_xattr();
ext4_exit_mballoc();
ext4_exit_feat_adverts();
remove_proc_entry("fs/ext4", NULL);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 1423c48..6e2788f 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -81,7 +81,7 @@
# define ea_bdebug(bh, fmt, ...) no_printk(fmt, ##__VA_ARGS__)
#endif

-static void ext4_xattr_cache_insert(struct buffer_head *);
+static void ext4_xattr_cache_insert(struct mb_cache *, struct buffer_head *);
static struct buffer_head *ext4_xattr_cache_find(struct inode *,
struct ext4_xattr_header *,
struct mb_cache_entry **);
@@ -90,8 +90,6 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *,
static int ext4_xattr_list(struct dentry *dentry, char *buffer,
size_t buffer_size);

-static struct mb_cache *ext4_xattr_cache;
-
static const struct xattr_handler *ext4_xattr_handler_map[] = {
[EXT4_XATTR_INDEX_USER] = &ext4_xattr_user_handler,
#ifdef CONFIG_EXT4_FS_POSIX_ACL
@@ -117,6 +115,9 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
NULL
};

+#define EXT4_GET_MB_CACHE(inode) (((struct ext4_sb_info *) \
+ inode->i_sb->s_fs_info)->s_mb_cache)
+
static __le32 ext4_xattr_block_csum(struct inode *inode,
sector_t block_nr,
struct ext4_xattr_header *hdr)
@@ -265,6 +266,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
struct ext4_xattr_entry *entry;
size_t size;
int error;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
name_index, name, buffer, (long)buffer_size);
@@ -286,7 +288,7 @@ bad_block:
error = -EIO;
goto cleanup;
}
- ext4_xattr_cache_insert(bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, bh);
entry = BFIRST(bh);
error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
if (error == -EIO)
@@ -409,6 +411,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
struct inode *inode = dentry->d_inode;
struct buffer_head *bh = NULL;
int error;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

ea_idebug(inode, "buffer=%p, buffer_size=%ld",
buffer, (long)buffer_size);
@@ -430,7 +433,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
error = -EIO;
goto cleanup;
}
- ext4_xattr_cache_insert(bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, bh);
error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);

cleanup:
@@ -526,8 +529,9 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
{
struct mb_cache_entry *ce = NULL;
int error = 0;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

- ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
+ ce = mb_cache_entry_get(ext4_mb_cache, bh->b_bdev, bh->b_blocknr);
error = ext4_journal_get_write_access(handle, bh);
if (error)
goto out;
@@ -745,13 +749,14 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_search *s = &bs->s;
struct mb_cache_entry *ce = NULL;
int error = 0;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

#define header(x) ((struct ext4_xattr_header *)(x))

if (i->value && i->value_len > sb->s_blocksize)
return -ENOSPC;
if (s->base) {
- ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
+ ce = mb_cache_entry_get(ext4_mb_cache, bs->bh->b_bdev,
bs->bh->b_blocknr);
error = ext4_journal_get_write_access(handle, bs->bh);
if (error)
@@ -769,7 +774,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
if (!IS_LAST_ENTRY(s->first))
ext4_xattr_rehash(header(s->base),
s->here);
- ext4_xattr_cache_insert(bs->bh);
+ ext4_xattr_cache_insert(ext4_mb_cache,
+ bs->bh);
}
unlock_buffer(bs->bh);
if (error == -EIO)
@@ -905,7 +911,7 @@ getblk_failed:
memcpy(new_bh->b_data, s->base, new_bh->b_size);
set_buffer_uptodate(new_bh);
unlock_buffer(new_bh);
- ext4_xattr_cache_insert(new_bh);
+ ext4_xattr_cache_insert(ext4_mb_cache, new_bh);
error = ext4_handle_dirty_xattr_block(handle,
inode, new_bh);
if (error)
@@ -1495,13 +1501,13 @@ ext4_xattr_put_super(struct super_block *sb)
* Returns 0, or a negative error number on failure.
*/
static void
-ext4_xattr_cache_insert(struct buffer_head *bh)
+ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
{
__u32 hash = le32_to_cpu(BHDR(bh)->h_hash);
struct mb_cache_entry *ce;
int error;

- ce = mb_cache_entry_alloc(ext4_xattr_cache, GFP_NOFS);
+ ce = mb_cache_entry_alloc(ext4_mb_cache, GFP_NOFS);
if (!ce) {
ea_bdebug(bh, "out of memory");
return;
@@ -1573,12 +1579,13 @@ ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
{
__u32 hash = le32_to_cpu(header->h_hash);
struct mb_cache_entry *ce;
+ struct mb_cache *ext4_mb_cache = EXT4_GET_MB_CACHE(inode);

if (!header->h_hash)
return NULL; /* never share */
ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
again:
- ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
+ ce = mb_cache_entry_find_first(ext4_mb_cache, inode->i_sb->s_bdev,
hash);
while (ce) {
struct buffer_head *bh;
@@ -1676,19 +1683,17 @@ static void ext4_xattr_rehash(struct ext4_xattr_header *header,

#undef BLOCK_HASH_SHIFT

-int __init
-ext4_init_xattr(void)
+#define HASH_BUCKET_BITS 10
+
+struct mb_cache *
+ext4_xattr_create_cache(char *name)
{
- ext4_xattr_cache = mb_cache_create("ext4_xattr", 6);
- if (!ext4_xattr_cache)
- return -ENOMEM;
- return 0;
+ return mb_cache_create(name, HASH_BUCKET_BITS);
}

-void
-ext4_exit_xattr(void)
+void ext4_xattr_destroy_cache(struct mb_cache *cache)
{
- if (ext4_xattr_cache)
- mb_cache_destroy(ext4_xattr_cache);
- ext4_xattr_cache = NULL;
+ if (cache)
+ mb_cache_destroy(cache);
}
+
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index c767dbd..b930320 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -112,9 +112,6 @@ extern void ext4_xattr_put_super(struct super_block *);
extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
struct ext4_inode *raw_inode, handle_t *handle);

-extern int __init ext4_init_xattr(void);
-extern void ext4_exit_xattr(void);
-
extern const struct xattr_handler *ext4_xattr_handlers[];

extern int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
@@ -126,6 +123,9 @@ extern int ext4_xattr_ibody_inline_set(handle_t *handle, struct inode *inode,
struct ext4_xattr_info *i,
struct ext4_xattr_ibody_find *is);

+extern struct mb_cache *ext4_xattr_create_cache(char *name);
+extern void ext4_xattr_destroy_cache(struct mb_cache *);
+
#ifdef CONFIG_EXT4_FS_SECURITY
extern int ext4_init_security(handle_t *handle, struct inode *inode,
struct inode *dir, const struct qstr *qstr);
diff --git a/fs/mbcache.c b/fs/mbcache.c
index 786ecab..bf166e3 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -99,6 +99,7 @@

static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
static struct blockgroup_lock *mb_cache_bg_lock;
+static struct kmem_cache *mb_cache_kmem_cache;

MODULE_AUTHOR("Andreas Gruenbacher <[email protected]>");
MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
@@ -351,11 +352,14 @@ mb_cache_create(const char *name, int bucket_bits)
goto fail;
for (n=0; n<bucket_count; n++)
INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
- cache->c_entry_cache = kmem_cache_create(name,
- sizeof(struct mb_cache_entry), 0,
- SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
- if (!cache->c_entry_cache)
- goto fail2;
+ if (!mb_cache_kmem_cache) {
+ mb_cache_kmem_cache = kmem_cache_create(name,
+ sizeof(struct mb_cache_entry), 0,
+ SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+ if (!mb_cache_kmem_cache)
+ goto fail2;
+ }
+ cache->c_entry_cache = mb_cache_kmem_cache;

/*
* Set an upper limit on the number of cache entries so that the hash
@@ -476,6 +480,10 @@ mb_cache_destroy(struct mb_cache *cache)
atomic_read(&cache->c_entry_count));
}

+ if (list_empty(&mb_cache_list)) {
+ kmem_cache_destroy(mb_cache_kmem_cache);
+ mb_cache_kmem_cache = NULL;
+ }
kfree(cache->c_index_hash);
kfree(cache->c_block_hash);
kfree(cache);
--
1.7.11.3

2014-02-20 18:38:35

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH V5 1/3] fs/mbcache.c change block and index hash chain to hlist_bl_node

This patch changes each mb_cache's both block and index hash chains from
regular linked list to hlist_bl_node, which contains a built-in lock. This is
the first step in decoupling of locks serializing accesses to mb_cache global
data and each mb_cache_entry local data.

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
Changed in v5:
- Added e_refcnt member to mb_cache_entry structure to be used in
the decoupling in patch 2/3.
- Removed e_entry_lock from mb_cache_entry structure.

fs/mbcache.c | 117 ++++++++++++++++++++++++++++++++----------------
include/linux/mbcache.h | 12 ++---
2 files changed, 85 insertions(+), 44 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index e519e45..55db0da 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -34,9 +34,9 @@
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/sched.h>
-#include <linux/init.h>
+#include <linux/list_bl.h>
#include <linux/mbcache.h>
-
+#include <linux/init.h>

#ifdef MB_CACHE_DEBUG
# define mb_debug(f...) do { \
@@ -87,21 +87,38 @@ static LIST_HEAD(mb_cache_lru_list);
static DEFINE_SPINLOCK(mb_cache_spinlock);

static inline int
-__mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
+__mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
{
- return !list_empty(&ce->e_block_list);
+ return !hlist_bl_unhashed(&ce->e_block_list);
}


-static void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+static inline void
+__mb_cache_entry_unhash_block(struct mb_cache_entry *ce)
{
- if (__mb_cache_entry_is_hashed(ce)) {
- list_del_init(&ce->e_block_list);
- list_del(&ce->e_index.o_list);
- }
+ if (__mb_cache_entry_is_block_hashed(ce))
+ hlist_bl_del_init(&ce->e_block_list);
+}
+
+static inline int
+__mb_cache_entry_is_index_hashed(struct mb_cache_entry *ce)
+{
+ return !hlist_bl_unhashed(&ce->e_index.o_list);
}

+static inline void
+__mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
+{
+ if (__mb_cache_entry_is_index_hashed(ce))
+ hlist_bl_del_init(&ce->e_index.o_list);
+}
+
+static inline void
+__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+{
+ __mb_cache_entry_unhash_index(ce);
+ __mb_cache_entry_unhash_block(ce);
+}

static void
__mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
@@ -125,7 +142,7 @@ __mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
ce->e_used -= MB_CACHE_WRITER;
ce->e_used--;
if (!(ce->e_used || ce->e_queued)) {
- if (!__mb_cache_entry_is_hashed(ce))
+ if (!__mb_cache_entry_is_block_hashed(ce))
goto forget;
mb_assert(list_empty(&ce->e_lru_list));
list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
@@ -221,18 +238,18 @@ mb_cache_create(const char *name, int bucket_bits)
cache->c_name = name;
atomic_set(&cache->c_entry_count, 0);
cache->c_bucket_bits = bucket_bits;
- cache->c_block_hash = kmalloc(bucket_count * sizeof(struct list_head),
- GFP_KERNEL);
+ cache->c_block_hash = kmalloc(bucket_count *
+ sizeof(struct hlist_bl_head), GFP_KERNEL);
if (!cache->c_block_hash)
goto fail;
for (n=0; n<bucket_count; n++)
- INIT_LIST_HEAD(&cache->c_block_hash[n]);
- cache->c_index_hash = kmalloc(bucket_count * sizeof(struct list_head),
- GFP_KERNEL);
+ INIT_HLIST_BL_HEAD(&cache->c_block_hash[n]);
+ cache->c_index_hash = kmalloc(bucket_count *
+ sizeof(struct hlist_bl_head), GFP_KERNEL);
if (!cache->c_index_hash)
goto fail;
for (n=0; n<bucket_count; n++)
- INIT_LIST_HEAD(&cache->c_index_hash[n]);
+ INIT_HLIST_BL_HEAD(&cache->c_index_hash[n]);
cache->c_entry_cache = kmem_cache_create(name,
sizeof(struct mb_cache_entry), 0,
SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
@@ -364,10 +381,13 @@ mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
return NULL;
atomic_inc(&cache->c_entry_count);
INIT_LIST_HEAD(&ce->e_lru_list);
- INIT_LIST_HEAD(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_index.o_list);
ce->e_cache = cache;
ce->e_queued = 0;
}
+ ce->e_block_hash_p = &cache->c_block_hash[0];
+ ce->e_index_hash_p = &cache->c_index_hash[0];
ce->e_used = 1 + MB_CACHE_WRITER;
return ce;
}
@@ -393,25 +413,32 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
{
struct mb_cache *cache = ce->e_cache;
unsigned int bucket;
- struct list_head *l;
+ struct hlist_bl_node *l;
int error = -EBUSY;
+ struct hlist_bl_head *block_hash_p;
+ struct hlist_bl_head *index_hash_p;
+ struct mb_cache_entry *lce;

+ mb_assert(ce);
bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
+ block_hash_p = &cache->c_block_hash[bucket];
spin_lock(&mb_cache_spinlock);
- list_for_each_prev(l, &cache->c_block_hash[bucket]) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_block_list);
- if (ce->e_bdev == bdev && ce->e_block == block)
+ hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
+ if (lce->e_bdev == bdev && lce->e_block == block)
goto out;
}
+ mb_assert(!__mb_cache_entry_is_block_hashed(ce));
__mb_cache_entry_unhash(ce);
ce->e_bdev = bdev;
ce->e_block = block;
- list_add(&ce->e_block_list, &cache->c_block_hash[bucket]);
+ ce->e_block_hash_p = block_hash_p;
ce->e_index.o_key = key;
bucket = hash_long(key, cache->c_bucket_bits);
- list_add(&ce->e_index.o_list, &cache->c_index_hash[bucket]);
+ index_hash_p = &cache->c_index_hash[bucket];
+ ce->e_index_hash_p = index_hash_p;
+ hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
+ hlist_bl_add_head(&ce->e_block_list, block_hash_p);
error = 0;
out:
spin_unlock(&mb_cache_spinlock);
@@ -463,14 +490,16 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
sector_t block)
{
unsigned int bucket;
- struct list_head *l;
+ struct hlist_bl_node *l;
struct mb_cache_entry *ce;
+ struct hlist_bl_head *block_hash_p;

bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
+ block_hash_p = &cache->c_block_hash[bucket];
spin_lock(&mb_cache_spinlock);
- list_for_each(l, &cache->c_block_hash[bucket]) {
- ce = list_entry(l, struct mb_cache_entry, e_block_list);
+ hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
+ mb_assert(ce->e_block_hash_p == block_hash_p);
if (ce->e_bdev == bdev && ce->e_block == block) {
DEFINE_WAIT(wait);

@@ -489,7 +518,7 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
finish_wait(&mb_cache_queue, &wait);
ce->e_used += 1 + MB_CACHE_WRITER;

- if (!__mb_cache_entry_is_hashed(ce)) {
+ if (!__mb_cache_entry_is_block_hashed(ce)) {
__mb_cache_entry_release_unlock(ce);
return NULL;
}
@@ -506,12 +535,14 @@ cleanup:
#if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)

static struct mb_cache_entry *
-__mb_cache_entry_find(struct list_head *l, struct list_head *head,
+__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
struct block_device *bdev, unsigned int key)
{
- while (l != head) {
+ while (l != NULL) {
struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_index.o_list);
+ hlist_bl_entry(l, struct mb_cache_entry,
+ e_index.o_list);
+ mb_assert(ce->e_index_hash_p == head);
if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
DEFINE_WAIT(wait);

@@ -532,7 +563,7 @@ __mb_cache_entry_find(struct list_head *l, struct list_head *head,
}
finish_wait(&mb_cache_queue, &wait);

- if (!__mb_cache_entry_is_hashed(ce)) {
+ if (!__mb_cache_entry_is_block_hashed(ce)) {
__mb_cache_entry_release_unlock(ce);
spin_lock(&mb_cache_spinlock);
return ERR_PTR(-EAGAIN);
@@ -562,12 +593,16 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
unsigned int key)
{
unsigned int bucket = hash_long(key, cache->c_bucket_bits);
- struct list_head *l;
- struct mb_cache_entry *ce;
+ struct hlist_bl_node *l;
+ struct mb_cache_entry *ce = NULL;
+ struct hlist_bl_head *index_hash_p;

+ index_hash_p = &cache->c_index_hash[bucket];
spin_lock(&mb_cache_spinlock);
- l = cache->c_index_hash[bucket].next;
- ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+ if (!hlist_bl_empty(index_hash_p)) {
+ l = hlist_bl_first(index_hash_p);
+ ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
+ }
spin_unlock(&mb_cache_spinlock);
return ce;
}
@@ -597,12 +632,16 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,
{
struct mb_cache *cache = prev->e_cache;
unsigned int bucket = hash_long(key, cache->c_bucket_bits);
- struct list_head *l;
+ struct hlist_bl_node *l;
struct mb_cache_entry *ce;
+ struct hlist_bl_head *index_hash_p;

+ index_hash_p = &cache->c_index_hash[bucket];
+ mb_assert(prev->e_index_hash_p == index_hash_p);
spin_lock(&mb_cache_spinlock);
+ mb_assert(!hlist_bl_empty(index_hash_p));
l = prev->e_index.o_list.next;
- ce = __mb_cache_entry_find(l, &cache->c_index_hash[bucket], bdev, key);
+ ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
__mb_cache_entry_release_unlock(prev);
return ce;
}
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 5525d37..6a392e7 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -3,19 +3,21 @@

(C) 2001 by Andreas Gruenbacher, <[email protected]>
*/
-
struct mb_cache_entry {
struct list_head e_lru_list;
struct mb_cache *e_cache;
unsigned short e_used;
unsigned short e_queued;
+ atomic_t e_refcnt;
struct block_device *e_bdev;
sector_t e_block;
- struct list_head e_block_list;
+ struct hlist_bl_node e_block_list;
struct {
- struct list_head o_list;
+ struct hlist_bl_node o_list;
unsigned int o_key;
} e_index;
+ struct hlist_bl_head *e_block_hash_p;
+ struct hlist_bl_head *e_index_hash_p;
};

struct mb_cache {
@@ -25,8 +27,8 @@ struct mb_cache {
int c_max_entries;
int c_bucket_bits;
struct kmem_cache *c_entry_cache;
- struct list_head *c_block_hash;
- struct list_head *c_index_hash;
+ struct hlist_bl_head *c_block_hash;
+ struct hlist_bl_head *c_index_hash;
};

/* Functions on caches */
--
1.7.11.3

2014-02-20 18:38:47

by T Makphaibulchoke

[permalink] [raw]
Subject: [PATCH V5 2/3] mbcache: decoupling the locking of local from global data

The patch increases the parallelism of mb_cache_entry utilization by
replacing list_head with hlist_bl_node for the implementation of both the
block and index hash tables. Each hlist_bl_node contains a built-in lock
used to protect mb_cache's local block and index hash chains. The global
data mb_cache_lru_list and mb_cache_list continue to be protected by the
global mb_cache_spinlock.

New block group spinlock, mb_cache_bg_lock is also added to serialize accesses
to mb_cache_entry's local data.

A new member e_refcnt is added to the mb_cache_entry structure to help
preventing an mb_cache_entry from being deallocated by a free while it is
being referenced by either mb_cache_entry_get() or mb_cache_entry_find().

Signed-off-by: T. Makphaibulchoke <[email protected]>
---
Changed in v5:
- Added mb_cache_bg_lock, which is used to serialize accesses to an
mb_cache_entry's local data.
- Used e_refcnt to help preventing an mb_cache_entry from being
deallocated by a free inside mb_cache_entry_get() and
mb_cache_entry_find() and at the same time reducing the hash chain's
lock held time.

fs/mbcache.c | 417 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 301 insertions(+), 116 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 55db0da..786ecab 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -26,6 +26,41 @@
* back on the lru list.
*/

+/*
+ * Lock descriptions and usage:
+ *
+ * Each hash chain of both the block and index hash tables now contains
+ * a built-in lock used to serialize accesses to the hash chain.
+ *
+ * Accesses to global data structures mb_cache_list and mb_cache_lru_list
+ * are serialized via the global spinlock mb_cache_spinlock.
+ *
+ * Each mb_cache_entry contains a spinlock, e_entry_lock, to serialize
+ * accesses to its local data, such as e_used and e_queued.
+ *
+ * Lock ordering:
+ *
+ * Each block hash chain's lock has the highest lock order, followed by an
+ * index hash chain's lock, mb_cache_bg_lock (used to implement mb_cache_entry's
+ * lock), and mb_cach_spinlock, with the lowest order. While holding
+ * either a block or index hash chain lock, a thread can acquire an
+ * mc_cache_bg_lock, which in turn can also acquire mb_cache_spinlock.
+ *
+ * Synchronization:
+ *
+ * Since both mb_cache_entry_get and mb_cache_entry_find scan the block and
+ * index hash chian, it needs to lock the corresponding hash chain. For each
+ * mb_cache_entry within the chain, it needs to lock the mb_cache_entry to
+ * prevent either any simultaneous release or free on the entry and also
+ * to serialize accesses to either the e_used or e_queued member of the entry.
+ *
+ * To avoid having a dangling reference to an already freed
+ * mb_cache_entry, an mb_cache_entry is only freed when it is not on a
+ * block hash chain and also no longer being referenced, both e_used,
+ * and e_queued are 0's. When an mb_cache_entry is explicitly freed it is
+ * first removed from a block hash chain.
+ */
+
#include <linux/kernel.h>
#include <linux/module.h>

@@ -37,6 +72,7 @@
#include <linux/list_bl.h>
#include <linux/mbcache.h>
#include <linux/init.h>
+#include <linux/blockgroup_lock.h>

#ifdef MB_CACHE_DEBUG
# define mb_debug(f...) do { \
@@ -57,8 +93,13 @@

#define MB_CACHE_WRITER ((unsigned short)~0U >> 1)

+#define MB_CACHE_ENTRY_LOCK_BITS __builtin_log2(NR_BG_LOCKS)
+#define MB_CACHE_ENTRY_LOCK_INDEX(ce) \
+ (hash_long((unsigned long)ce, MB_CACHE_ENTRY_LOCK_BITS))
+
static DECLARE_WAIT_QUEUE_HEAD(mb_cache_queue);
-
+static struct blockgroup_lock *mb_cache_bg_lock;
+
MODULE_AUTHOR("Andreas Gruenbacher <[email protected]>");
MODULE_DESCRIPTION("Meta block cache (for extended attributes)");
MODULE_LICENSE("GPL");
@@ -86,6 +127,20 @@ static LIST_HEAD(mb_cache_list);
static LIST_HEAD(mb_cache_lru_list);
static DEFINE_SPINLOCK(mb_cache_spinlock);

+static inline void
+__spin_lock_mb_cache_entry(struct mb_cache_entry *ce)
+{
+ spin_lock(bgl_lock_ptr(mb_cache_bg_lock,
+ MB_CACHE_ENTRY_LOCK_INDEX(ce)));
+}
+
+static inline void
+__spin_unlock_mb_cache_entry(struct mb_cache_entry *ce)
+{
+ spin_unlock(bgl_lock_ptr(mb_cache_bg_lock,
+ MB_CACHE_ENTRY_LOCK_INDEX(ce)));
+}
+
static inline int
__mb_cache_entry_is_block_hashed(struct mb_cache_entry *ce)
{
@@ -113,11 +168,21 @@ __mb_cache_entry_unhash_index(struct mb_cache_entry *ce)
hlist_bl_del_init(&ce->e_index.o_list);
}

+/*
+ * __mb_cache_entry_unhash_unlock()
+ *
+ * This function is called to unhash both the block and index hash
+ * chain.
+ * It assumes both the block and index hash chain is locked upon entry.
+ * It also unlock both hash chains both exit
+ */
static inline void
-__mb_cache_entry_unhash(struct mb_cache_entry *ce)
+__mb_cache_entry_unhash_unlock(struct mb_cache_entry *ce)
{
__mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
__mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
}

static void
@@ -125,36 +190,47 @@ __mb_cache_entry_forget(struct mb_cache_entry *ce, gfp_t gfp_mask)
{
struct mb_cache *cache = ce->e_cache;

- mb_assert(!(ce->e_used || ce->e_queued));
+ mb_assert(!(ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt)));
kmem_cache_free(cache->c_entry_cache, ce);
atomic_dec(&cache->c_entry_count);
}

-
static void
-__mb_cache_entry_release_unlock(struct mb_cache_entry *ce)
- __releases(mb_cache_spinlock)
+__mb_cache_entry_release(struct mb_cache_entry *ce)
{
+ /* First lock the entry to serialize access to its local data. */
+ __spin_lock_mb_cache_entry(ce);
/* Wake up all processes queuing for this cache entry. */
if (ce->e_queued)
wake_up_all(&mb_cache_queue);
if (ce->e_used >= MB_CACHE_WRITER)
ce->e_used -= MB_CACHE_WRITER;
+ /*
+ * Make sure that all cache entries on lru_list have
+ * both e_used and e_qued of 0s.
+ */
ce->e_used--;
- if (!(ce->e_used || ce->e_queued)) {
- if (!__mb_cache_entry_is_block_hashed(ce))
+ if (!(ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt))) {
+ if (!__mb_cache_entry_is_block_hashed(ce)) {
+ __spin_unlock_mb_cache_entry(ce);
goto forget;
- mb_assert(list_empty(&ce->e_lru_list));
- list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+ }
+ /*
+ * Need access to lru list, first drop entry lock,
+ * then reacquire the lock in the proper order.
+ */
+ spin_lock(&mb_cache_spinlock);
+ if (list_empty(&ce->e_lru_list))
+ list_add_tail(&ce->e_lru_list, &mb_cache_lru_list);
+ spin_unlock(&mb_cache_spinlock);
}
- spin_unlock(&mb_cache_spinlock);
+ __spin_unlock_mb_cache_entry(ce);
return;
forget:
- spin_unlock(&mb_cache_spinlock);
+ mb_assert(list_empty(&ce->e_lru_list));
__mb_cache_entry_forget(ce, GFP_KERNEL);
}

-
/*
* mb_cache_shrink_scan() memory pressure callback
*
@@ -177,17 +253,34 @@ mb_cache_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)

mb_debug("trying to free %d entries", nr_to_scan);
spin_lock(&mb_cache_spinlock);
- while (nr_to_scan-- && !list_empty(&mb_cache_lru_list)) {
+ while ((nr_to_scan-- > 0) && !list_empty(&mb_cache_lru_list)) {
struct mb_cache_entry *ce =
list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
- freed++;
+ struct mb_cache_entry, e_lru_list);
+ list_del_init(&ce->e_lru_list);
+ if (ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt))
+ continue;
+ spin_unlock(&mb_cache_spinlock);
+ /* Prevent any find or get operation on the entry */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ /* Ignore if it is touched by a find/get */
+ if (ce->e_used || ce->e_queued || atomic_read(&ce->e_refcnt) ||
+ !list_empty(&ce->e_lru_list)) {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ spin_lock(&mb_cache_spinlock);
+ continue;
+ }
+ __mb_cache_entry_unhash_unlock(ce);
+ list_add_tail(&ce->e_lru_list, &free_list);
+ spin_lock(&mb_cache_spinlock);
}
spin_unlock(&mb_cache_spinlock);
+
list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
__mb_cache_entry_forget(entry, gfp_mask);
+ freed++;
}
return freed;
}
@@ -232,6 +325,14 @@ mb_cache_create(const char *name, int bucket_bits)
int n, bucket_count = 1 << bucket_bits;
struct mb_cache *cache = NULL;

+ if (!mb_cache_bg_lock) {
+ mb_cache_bg_lock = kmalloc(sizeof(struct blockgroup_lock),
+ GFP_KERNEL);
+ if (!mb_cache_bg_lock)
+ return NULL;
+ bgl_lock_init(mb_cache_bg_lock);
+ }
+
cache = kmalloc(sizeof(struct mb_cache), GFP_KERNEL);
if (!cache)
return NULL;
@@ -290,21 +391,47 @@ void
mb_cache_shrink(struct block_device *bdev)
{
LIST_HEAD(free_list);
- struct list_head *l, *ltmp;
+ struct list_head *l;
+ struct mb_cache_entry *ce, *tmp;

+ l = &mb_cache_lru_list;
spin_lock(&mb_cache_spinlock);
- list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_lru_list);
+ while (!list_is_last(l, &mb_cache_lru_list)) {
+ l = l->next;
+ ce = list_entry(l, struct mb_cache_entry, e_lru_list);
if (ce->e_bdev == bdev) {
- list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
+ list_del_init(&ce->e_lru_list);
+ if (ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt))
+ continue;
+ spin_unlock(&mb_cache_spinlock);
+ /*
+ * Prevent any find or get operation on the entry.
+ */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ /* Ignore if it is touched by a find/get */
+ if (ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt) ||
+ !list_empty(&ce->e_lru_list)) {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ l = &mb_cache_lru_list;
+ spin_lock(&mb_cache_spinlock);
+ continue;
+ }
+ __mb_cache_entry_unhash_unlock(ce);
+ mb_assert(!(ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt)));
+ list_add_tail(&ce->e_lru_list, &free_list);
+ l = &mb_cache_lru_list;
+ spin_lock(&mb_cache_spinlock);
}
}
spin_unlock(&mb_cache_spinlock);
- list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+
+ list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
}
}

@@ -320,23 +447,27 @@ void
mb_cache_destroy(struct mb_cache *cache)
{
LIST_HEAD(free_list);
- struct list_head *l, *ltmp;
+ struct mb_cache_entry *ce, *tmp;

spin_lock(&mb_cache_spinlock);
- list_for_each_safe(l, ltmp, &mb_cache_lru_list) {
- struct mb_cache_entry *ce =
- list_entry(l, struct mb_cache_entry, e_lru_list);
- if (ce->e_cache == cache) {
+ list_for_each_entry_safe(ce, tmp, &mb_cache_lru_list, e_lru_list) {
+ if (ce->e_cache == cache)
list_move_tail(&ce->e_lru_list, &free_list);
- __mb_cache_entry_unhash(ce);
- }
}
list_del(&cache->c_cache_list);
spin_unlock(&mb_cache_spinlock);

- list_for_each_safe(l, ltmp, &free_list) {
- __mb_cache_entry_forget(list_entry(l, struct mb_cache_entry,
- e_lru_list), GFP_KERNEL);
+ list_for_each_entry_safe(ce, tmp, &free_list, e_lru_list) {
+ list_del_init(&ce->e_lru_list);
+ /*
+ * Prevent any find or get operation on the entry.
+ */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ mb_assert(!(ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt)));
+ __mb_cache_entry_unhash_unlock(ce);
+ __mb_cache_entry_forget(ce, GFP_KERNEL);
}

if (atomic_read(&cache->c_entry_count) > 0) {
@@ -345,8 +476,6 @@ mb_cache_destroy(struct mb_cache *cache)
atomic_read(&cache->c_entry_count));
}

- kmem_cache_destroy(cache->c_entry_cache);
-
kfree(cache->c_index_hash);
kfree(cache->c_block_hash);
kfree(cache);
@@ -363,29 +492,59 @@ mb_cache_destroy(struct mb_cache *cache)
struct mb_cache_entry *
mb_cache_entry_alloc(struct mb_cache *cache, gfp_t gfp_flags)
{
- struct mb_cache_entry *ce = NULL;
+ struct mb_cache_entry *ce;

if (atomic_read(&cache->c_entry_count) >= cache->c_max_entries) {
+ struct list_head *l;
+
+ l = &mb_cache_lru_list;
spin_lock(&mb_cache_spinlock);
- if (!list_empty(&mb_cache_lru_list)) {
- ce = list_entry(mb_cache_lru_list.next,
- struct mb_cache_entry, e_lru_list);
- list_del_init(&ce->e_lru_list);
- __mb_cache_entry_unhash(ce);
+ while (!list_is_last(l, &mb_cache_lru_list)) {
+ l = l->next;
+ ce = list_entry(l, struct mb_cache_entry, e_lru_list);
+ if (ce->e_cache == cache) {
+ list_del_init(&ce->e_lru_list);
+ if (ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt))
+ continue;
+ spin_unlock(&mb_cache_spinlock);
+ /*
+ * Prevent any find or get operation on the
+ * entry.
+ */
+ hlist_bl_lock(ce->e_block_hash_p);
+ hlist_bl_lock(ce->e_index_hash_p);
+ /* Ignore if it is touched by a find/get */
+ if (ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt) ||
+ !list_empty(&ce->e_lru_list)) {
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ l = &mb_cache_lru_list;
+ spin_lock(&mb_cache_spinlock);
+ continue;
+ }
+ mb_assert(list_empty(&ce->e_lru_list));
+ mb_assert(!(ce->e_used || ce->e_queued ||
+ atomic_read(&ce->e_refcnt)));
+ __mb_cache_entry_unhash_unlock(ce);
+ goto found;
+ }
}
spin_unlock(&mb_cache_spinlock);
}
- if (!ce) {
- ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
- if (!ce)
- return NULL;
- atomic_inc(&cache->c_entry_count);
- INIT_LIST_HEAD(&ce->e_lru_list);
- INIT_HLIST_BL_NODE(&ce->e_block_list);
- INIT_HLIST_BL_NODE(&ce->e_index.o_list);
- ce->e_cache = cache;
- ce->e_queued = 0;
- }
+
+ ce = kmem_cache_alloc(cache->c_entry_cache, gfp_flags);
+ if (!ce)
+ return NULL;
+ atomic_inc(&cache->c_entry_count);
+ INIT_LIST_HEAD(&ce->e_lru_list);
+ INIT_HLIST_BL_NODE(&ce->e_block_list);
+ INIT_HLIST_BL_NODE(&ce->e_index.o_list);
+ ce->e_cache = cache;
+ ce->e_queued = 0;
+ atomic_set(&ce->e_refcnt, 0);
+found:
ce->e_block_hash_p = &cache->c_block_hash[0];
ce->e_index_hash_p = &cache->c_index_hash[0];
ce->e_used = 1 + MB_CACHE_WRITER;
@@ -414,7 +573,6 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
struct mb_cache *cache = ce->e_cache;
unsigned int bucket;
struct hlist_bl_node *l;
- int error = -EBUSY;
struct hlist_bl_head *block_hash_p;
struct hlist_bl_head *index_hash_p;
struct mb_cache_entry *lce;
@@ -423,26 +581,29 @@ mb_cache_entry_insert(struct mb_cache_entry *ce, struct block_device *bdev,
bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
block_hash_p = &cache->c_block_hash[bucket];
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(block_hash_p);
hlist_bl_for_each_entry(lce, l, block_hash_p, e_block_list) {
- if (lce->e_bdev == bdev && lce->e_block == block)
- goto out;
+ if (lce->e_bdev == bdev && lce->e_block == block) {
+ hlist_bl_unlock(block_hash_p);
+ return -EBUSY;
+ }
}
mb_assert(!__mb_cache_entry_is_block_hashed(ce));
- __mb_cache_entry_unhash(ce);
+ __mb_cache_entry_unhash_block(ce);
+ __mb_cache_entry_unhash_index(ce);
ce->e_bdev = bdev;
ce->e_block = block;
ce->e_block_hash_p = block_hash_p;
ce->e_index.o_key = key;
+ hlist_bl_add_head(&ce->e_block_list, block_hash_p);
+ hlist_bl_unlock(block_hash_p);
bucket = hash_long(key, cache->c_bucket_bits);
index_hash_p = &cache->c_index_hash[bucket];
+ hlist_bl_lock(index_hash_p);
ce->e_index_hash_p = index_hash_p;
hlist_bl_add_head(&ce->e_index.o_list, index_hash_p);
- hlist_bl_add_head(&ce->e_block_list, block_hash_p);
- error = 0;
-out:
- spin_unlock(&mb_cache_spinlock);
- return error;
+ hlist_bl_unlock(index_hash_p);
+ return 0;
}


@@ -456,24 +617,26 @@ out:
void
mb_cache_entry_release(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
- __mb_cache_entry_release_unlock(ce);
+ __mb_cache_entry_release(ce);
}


/*
* mb_cache_entry_free()
*
- * This is equivalent to the sequence mb_cache_entry_takeout() --
- * mb_cache_entry_release().
*/
void
mb_cache_entry_free(struct mb_cache_entry *ce)
{
- spin_lock(&mb_cache_spinlock);
+ mb_assert(ce);
mb_assert(list_empty(&ce->e_lru_list));
- __mb_cache_entry_unhash(ce);
- __mb_cache_entry_release_unlock(ce);
+ hlist_bl_lock(ce->e_index_hash_p);
+ __mb_cache_entry_unhash_index(ce);
+ hlist_bl_unlock(ce->e_index_hash_p);
+ hlist_bl_lock(ce->e_block_hash_p);
+ __mb_cache_entry_unhash_block(ce);
+ hlist_bl_unlock(ce->e_block_hash_p);
+ __mb_cache_entry_release(ce);
}


@@ -497,39 +660,48 @@ mb_cache_entry_get(struct mb_cache *cache, struct block_device *bdev,
bucket = hash_long((unsigned long)bdev + (block & 0xffffffff),
cache->c_bucket_bits);
block_hash_p = &cache->c_block_hash[bucket];
- spin_lock(&mb_cache_spinlock);
+ /* First serialize access to the block corresponding hash chain. */
+ hlist_bl_lock(block_hash_p);
hlist_bl_for_each_entry(ce, l, block_hash_p, e_block_list) {
mb_assert(ce->e_block_hash_p == block_hash_p);
if (ce->e_bdev == bdev && ce->e_block == block) {
- DEFINE_WAIT(wait);
+ /*
+ * Prevent a free from removing the entry.
+ */
+ atomic_inc(&ce->e_refcnt);
+ hlist_bl_unlock(block_hash_p);
+ __spin_lock_mb_cache_entry(ce);
+ atomic_dec(&ce->e_refcnt);
+ if (ce->e_used > 0) {
+ DEFINE_WAIT(wait);
+ while (ce->e_used > 0) {
+ ce->e_queued++;
+ prepare_to_wait(&mb_cache_queue, &wait,
+ TASK_UNINTERRUPTIBLE);
+ __spin_unlock_mb_cache_entry(ce);
+ schedule();
+ __spin_lock_mb_cache_entry(ce);
+ ce->e_queued--;
+ }
+ finish_wait(&mb_cache_queue, &wait);
+ }
+ ce->e_used += 1 + MB_CACHE_WRITER;
+ __spin_unlock_mb_cache_entry(ce);

- if (!list_empty(&ce->e_lru_list))
+ if (!list_empty(&ce->e_lru_list)) {
+ spin_lock(&mb_cache_spinlock);
list_del_init(&ce->e_lru_list);
-
- while (ce->e_used > 0) {
- ce->e_queued++;
- prepare_to_wait(&mb_cache_queue, &wait,
- TASK_UNINTERRUPTIBLE);
spin_unlock(&mb_cache_spinlock);
- schedule();
- spin_lock(&mb_cache_spinlock);
- ce->e_queued--;
}
- finish_wait(&mb_cache_queue, &wait);
- ce->e_used += 1 + MB_CACHE_WRITER;
-
if (!__mb_cache_entry_is_block_hashed(ce)) {
- __mb_cache_entry_release_unlock(ce);
+ __mb_cache_entry_release(ce);
return NULL;
}
- goto cleanup;
+ return ce;
}
}
- ce = NULL;
-
-cleanup:
- spin_unlock(&mb_cache_spinlock);
- return ce;
+ hlist_bl_unlock(block_hash_p);
+ return NULL;
}

#if !defined(MB_CACHE_INDEXES_COUNT) || (MB_CACHE_INDEXES_COUNT > 0)
@@ -538,40 +710,53 @@ static struct mb_cache_entry *
__mb_cache_entry_find(struct hlist_bl_node *l, struct hlist_bl_head *head,
struct block_device *bdev, unsigned int key)
{
+
+ /* The index hash chain is alredy acquire by caller. */
while (l != NULL) {
struct mb_cache_entry *ce =
hlist_bl_entry(l, struct mb_cache_entry,
e_index.o_list);
mb_assert(ce->e_index_hash_p == head);
if (ce->e_bdev == bdev && ce->e_index.o_key == key) {
- DEFINE_WAIT(wait);
-
- if (!list_empty(&ce->e_lru_list))
- list_del_init(&ce->e_lru_list);
-
+ /*
+ * Prevent a free from removing the entry.
+ */
+ atomic_inc(&ce->e_refcnt);
+ hlist_bl_unlock(head);
+ __spin_lock_mb_cache_entry(ce);
+ atomic_dec(&ce->e_refcnt);
+ ce->e_used++;
/* Incrementing before holding the lock gives readers
priority over writers. */
- ce->e_used++;
- while (ce->e_used >= MB_CACHE_WRITER) {
- ce->e_queued++;
- prepare_to_wait(&mb_cache_queue, &wait,
- TASK_UNINTERRUPTIBLE);
- spin_unlock(&mb_cache_spinlock);
- schedule();
+ if (ce->e_used >= MB_CACHE_WRITER) {
+ DEFINE_WAIT(wait);
+
+ while (ce->e_used >= MB_CACHE_WRITER) {
+ ce->e_queued++;
+ prepare_to_wait(&mb_cache_queue, &wait,
+ TASK_UNINTERRUPTIBLE);
+ __spin_unlock_mb_cache_entry(ce);
+ schedule();
+ __spin_lock_mb_cache_entry(ce);
+ ce->e_queued--;
+ }
+ finish_wait(&mb_cache_queue, &wait);
+ }
+ __spin_unlock_mb_cache_entry(ce);
+ if (!list_empty(&ce->e_lru_list)) {
spin_lock(&mb_cache_spinlock);
- ce->e_queued--;
+ list_del_init(&ce->e_lru_list);
+ spin_unlock(&mb_cache_spinlock);
}
- finish_wait(&mb_cache_queue, &wait);
-
if (!__mb_cache_entry_is_block_hashed(ce)) {
- __mb_cache_entry_release_unlock(ce);
- spin_lock(&mb_cache_spinlock);
+ __mb_cache_entry_release(ce);
return ERR_PTR(-EAGAIN);
}
return ce;
}
l = l->next;
}
+ hlist_bl_unlock(head);
return NULL;
}

@@ -598,12 +783,12 @@ mb_cache_entry_find_first(struct mb_cache *cache, struct block_device *bdev,
struct hlist_bl_head *index_hash_p;

index_hash_p = &cache->c_index_hash[bucket];
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(index_hash_p);
if (!hlist_bl_empty(index_hash_p)) {
l = hlist_bl_first(index_hash_p);
ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
- }
- spin_unlock(&mb_cache_spinlock);
+ } else
+ hlist_bl_unlock(index_hash_p);
return ce;
}

@@ -638,11 +823,11 @@ mb_cache_entry_find_next(struct mb_cache_entry *prev,

index_hash_p = &cache->c_index_hash[bucket];
mb_assert(prev->e_index_hash_p == index_hash_p);
- spin_lock(&mb_cache_spinlock);
+ hlist_bl_lock(index_hash_p);
mb_assert(!hlist_bl_empty(index_hash_p));
l = prev->e_index.o_list.next;
ce = __mb_cache_entry_find(l, index_hash_p, bdev, key);
- __mb_cache_entry_release_unlock(prev);
+ __mb_cache_entry_release(prev);
return ce;
}

--
1.7.11.3