2008-07-18 16:11:32

by Eric Sandeen

[permalink] [raw]
Subject: delalloc is crippling fs_mark performance

running fs_mark like this:

fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0

(256 subdirs, 100000 files/iteration, 4 threads, 20k files, no sync)

on a 1T fs, with and without delalloc (mount option), is pretty interesting:

http://people.redhat.com/esandeen/ext4/fs_mark.png

somehow delalloc is crushing performance here. I'm planning to wait
'til the fs is full and see what the effect is on fsck, and look at the
directory layout for differences compared to w/o delalloc.

But something seems to have gone awry here ...

This is on 2.6.26 with the patch queue applied up to stable.

-Eric


2008-07-18 23:00:14

by Eric Sandeen

[permalink] [raw]
Subject: Re: delalloc is crippling fs_mark performance

Eric Sandeen wrote:
> running fs_mark like this:
>
> fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0
>
> (256 subdirs, 100000 files/iteration, 4 threads, 20k files, no sync)
>
> on a 1T fs, with and without delalloc (mount option), is pretty interesting:
>
> http://people.redhat.com/esandeen/ext4/fs_mark.png
>
> somehow delalloc is crushing performance here. I'm planning to wait
> 'til the fs is full and see what the effect is on fsck, and look at the
> directory layout for differences compared to w/o delalloc.
>
> But something seems to have gone awry here ...
>
> This is on 2.6.26 with the patch queue applied up to stable.
>
> -Eric

I oprofiled both with and without delalloc for the first 15% of the fs fill:

==> delalloc.op <==
CPU: AMD64 processors, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 100000
samples % image name app name
symbol name
56094537 73.6320 ext4dev.ko ext4dev
ext4_mb_use_preallocated
642479 0.8433 vmlinux vmlinux
__copy_user_nocache
523803 0.6876 vmlinux vmlinux memcmp
482874 0.6338 jbd2.ko jbd2
do_get_write_access
480687 0.6310 vmlinux vmlinux
kmem_cache_free
403604 0.5298 ext4dev.ko ext4dev
str2hashbuf
400471 0.5257 vmlinux vmlinux
__find_get_block

==> nodelalloc.op <==
CPU: AMD64 processors, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 100000
samples % image name app name
symbol name
56167198 56.8949 ext4dev.ko ext4dev
ext4_mb_use_preallocated
1524662 1.5444 jbd2.ko jbd2
do_get_write_access
1234776 1.2508 vmlinux vmlinux
__copy_user_nocache
1115267 1.1297 jbd2.ko jbd2
jbd2_journal_add_journal_head
1053102 1.0667 vmlinux vmlinux
__find_get_block
963646 0.9761 vmlinux vmlinux
kmem_cache_free
958804 0.9712 vmlinux vmlinux memcmp

not sure if this points to anything or not - but
ext4_mb_use_preallocated is working awfully hard in both cases :)

-Eric

2008-07-19 15:44:36

by Eric Sandeen

[permalink] [raw]
Subject: Re: delalloc is crippling fs_mark performance

Eric Sandeen wrote:
> Eric Sandeen wrote:
>> running fs_mark like this:
>>
>> fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0
>>
>> (256 subdirs, 100000 files/iteration, 4 threads, 20k files, no sync)
>>
>> on a 1T fs, with and without delalloc (mount option), is pretty interesting:
>>
>> http://people.redhat.com/esandeen/ext4/fs_mark.png
>>
>> somehow delalloc is crushing performance here. I'm planning to wait
>> 'til the fs is full and see what the effect is on fsck, and look at the
>> directory layout for differences compared to w/o delalloc.
>>
>> But something seems to have gone awry here ...
>>
>> This is on 2.6.26 with the patch queue applied up to stable.
>>
>> -Eric
>
> I oprofiled both with and without delalloc for the first 15% of the fs fill:
>
> ==> delalloc.op <==
> CPU: AMD64 processors, speed 2000 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
> unit mask of 0x00 (No unit mask) count 100000
> samples % image name app name
> symbol name
> 56094537 73.6320 ext4dev.ko ext4dev
> ext4_mb_use_preallocated
> 642479 0.8433 vmlinux vmlinux
> __copy_user_nocache
> 523803 0.6876 vmlinux vmlinux memcmp
> 482874 0.6338 jbd2.ko jbd2
> do_get_write_access
> 480687 0.6310 vmlinux vmlinux
> kmem_cache_free
> 403604 0.5298 ext4dev.ko ext4dev
> str2hashbuf
> 400471 0.5257 vmlinux vmlinux
> __find_get_block
>
> ==> nodelalloc.op <==
> CPU: AMD64 processors, speed 2000 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
> unit mask of 0x00 (No unit mask) count 100000
> samples % image name app name
> symbol name
> 56167198 56.8949 ext4dev.ko ext4dev
> ext4_mb_use_preallocated

This was wrong, I forgot to clear stats before re-running.

With delalloc, the lg_prealloc list seems to just grow & grow in
ext4_mb_use_preallocated, searching up to 90,000 entries before finding
something, I think this is what's hurting - I need to look into how this
should work.

-Eric

2008-07-19 17:20:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: delalloc is crippling fs_mark performance

On Sat, Jul 19, 2008 at 10:44:34AM -0500, Eric Sandeen wrote:
>
> With delalloc, the lg_prealloc list seems to just grow & grow in
> ext4_mb_use_preallocated, searching up to 90,000 entries before finding
> something, I think this is what's hurting - I need to look into how this
> should work.

Hmm, this may explain Holger's benchmark regressions.

- Ted

2008-07-21 09:38:07

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: delalloc is crippling fs_mark performance

On Sat, Jul 19, 2008 at 10:44:34AM -0500, Eric Sandeen wrote:
> Eric Sandeen wrote:
>
> With delalloc, the lg_prealloc list seems to just grow & grow in
> ext4_mb_use_preallocated, searching up to 90,000 entries before finding
> something, I think this is what's hurting - I need to look into how this
> should work.
>

How about this

>From 2a841f47e612fa49c7a469054e441a3dc3e65f3e Mon Sep 17 00:00:00 2001
From: Aneesh Kumar K.V <[email protected]>
Date: Mon, 21 Jul 2008 15:06:45 +0530
Subject: [PATCH] ext4: Don't allow lg prealloc list to be grow large.

The locality group prealloc list is freed only when there is a block allocation
failure. This can result in large number of per cpu locality group prealloc space
and also make the ext4_mb_use_preallocated expensive. Add a tunable max_lg_prealloc
which default to 1000. If we have more than 1000 Per-CPU prealloc space and if we
fail to find a suitable prealloc space during allocation we will now free all
the prealloc space in the locality group.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/ext4_sb.h | 1 +
fs/ext4/mballoc.c | 151 +++++++++++++++++++++++++++++++++++++++-------------
fs/ext4/mballoc.h | 6 ++
3 files changed, 120 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
index 6300226..f8bf8b0 100644
--- a/fs/ext4/ext4_sb.h
+++ b/fs/ext4/ext4_sb.h
@@ -115,6 +115,7 @@ struct ext4_sb_info {
/* where last allocation was done - for stream allocation */
unsigned long s_mb_last_group;
unsigned long s_mb_last_start;
+ unsigned long s_mb_max_lg_prealloc;

/* history to debug policy */
struct ext4_mb_history *s_mb_history;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 9db0f4d..4139da0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2540,6 +2540,7 @@ int ext4_mb_init(struct super_block *sb, int needs_recovery)
sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
sbi->s_mb_history_filter = EXT4_MB_HISTORY_DEFAULT;
sbi->s_mb_group_prealloc = MB_DEFAULT_GROUP_PREALLOC;
+ sbi->s_mb_max_lg_prealloc = MB_DEFAULT_LG_PREALLOC;

i = sizeof(struct ext4_locality_group) * NR_CPUS;
sbi->s_locality_groups = kmalloc(i, GFP_KERNEL);
@@ -2720,6 +2721,7 @@ ext4_mb_free_committed_blocks(struct super_block *sb)
#define EXT4_MB_ORDER2_REQ "order2_req"
#define EXT4_MB_STREAM_REQ "stream_req"
#define EXT4_MB_GROUP_PREALLOC "group_prealloc"
+#define EXT4_MB_MAX_LG_PREALLOC "max_lg_prealloc"



@@ -2769,6 +2771,7 @@ MB_PROC_FOPS(min_to_scan);
MB_PROC_FOPS(order2_reqs);
MB_PROC_FOPS(stream_request);
MB_PROC_FOPS(group_prealloc);
+MB_PROC_FOPS(max_lg_prealloc);

#define MB_PROC_HANDLER(name, var) \
do { \
@@ -2800,11 +2803,13 @@ static int ext4_mb_init_per_dev_proc(struct super_block *sb)
MB_PROC_HANDLER(EXT4_MB_ORDER2_REQ, order2_reqs);
MB_PROC_HANDLER(EXT4_MB_STREAM_REQ, stream_request);
MB_PROC_HANDLER(EXT4_MB_GROUP_PREALLOC, group_prealloc);
+ MB_PROC_HANDLER(EXT4_MB_MAX_LG_PREALLOC, max_lg_prealloc);

return 0;

err_out:
printk(KERN_ERR "EXT4-fs: Unable to create %s\n", devname);
+ remove_proc_entry(EXT4_MB_MAX_LG_PREALLOC, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_GROUP_PREALLOC, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_STREAM_REQ, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_ORDER2_REQ, sbi->s_mb_proc);
@@ -2826,6 +2831,7 @@ static int ext4_mb_destroy_per_dev_proc(struct super_block *sb)
return -EINVAL;

bdevname(sb->s_bdev, devname);
+ remove_proc_entry(EXT4_MB_MAX_LG_PREALLOC, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_GROUP_PREALLOC, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_STREAM_REQ, sbi->s_mb_proc);
remove_proc_entry(EXT4_MB_ORDER2_REQ, sbi->s_mb_proc);
@@ -3280,6 +3286,107 @@ static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
mb_debug("use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa);
}

+static noinline_for_stack int
+ext4_mb_release_group_pa(struct ext4_buddy *e4b,
+ struct ext4_prealloc_space *pa,
+ struct ext4_allocation_context *ac)
+{
+ struct super_block *sb = e4b->bd_sb;
+ ext4_group_t group;
+ ext4_grpblk_t bit;
+
+ if (ac)
+ ac->ac_op = EXT4_MB_HISTORY_DISCARD;
+
+ BUG_ON(pa->pa_deleted == 0);
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
+ BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
+ mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len);
+ atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded);
+
+ if (ac) {
+ ac->ac_sb = sb;
+ ac->ac_inode = NULL;
+ ac->ac_b_ex.fe_group = group;
+ ac->ac_b_ex.fe_start = bit;
+ ac->ac_b_ex.fe_len = pa->pa_len;
+ ac->ac_b_ex.fe_logical = 0;
+ ext4_mb_store_history(ac);
+ }
+
+ return 0;
+}
+
+static void ext4_mb_pa_callback(struct rcu_head *head)
+{
+ struct ext4_prealloc_space *pa;
+ pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
+ kmem_cache_free(ext4_pspace_cachep, pa);
+}
+
+/*
+ * release the locality group prealloc space.
+ * called with lg_mutex held
+ */
+static noinline_for_stack void
+ext4_mb_discard_lg_preallocations(struct super_block *sb,
+ struct ext4_locality_group *lg)
+{
+ ext4_group_t group = 0;
+ struct list_head list;
+ struct ext4_buddy e4b;
+ struct ext4_allocation_context *ac;
+ struct ext4_prealloc_space *pa, *tmp;
+
+ INIT_LIST_HEAD(&list);
+ ac = kmem_cache_alloc(ext4_ac_cachep, GFP_NOFS);
+
+ list_for_each_entry_rcu(pa, &lg->lg_prealloc_list, pa_inode_list) {
+ spin_lock(&pa->pa_lock);
+ if (atomic_read(&pa->pa_count)) {
+ /* This should not happen */
+ spin_unlock(&pa->pa_lock);
+ printk(KERN_ERR "uh-oh! used pa while discarding\n");
+ WARN_ON(1);
+ continue;
+ }
+ if (pa->pa_deleted) {
+ spin_unlock(&pa->pa_lock);
+ continue;
+ }
+ /* only lg prealloc space */
+ BUG_ON(!pa->pa_linear);
+
+ /* seems this one can be freed ... */
+ pa->pa_deleted = 1;
+ spin_unlock(&pa->pa_lock);
+
+ list_del_rcu(&pa->pa_inode_list);
+ list_add(&pa->u.pa_tmp_list, &list);
+ }
+
+ list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
+
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, NULL);
+ if (ext4_mb_load_buddy(sb, group, &e4b)) {
+ ext4_error(sb, __func__, "Error in loading buddy "
+ "information for %lu\n", group);
+ continue;
+ }
+ ext4_lock_group(sb, group);
+ list_del(&pa->pa_group_list);
+ ext4_mb_release_group_pa(&e4b, pa, ac);
+ ext4_unlock_group(sb, group);
+
+ ext4_mb_release_desc(&e4b);
+ list_del(&pa->u.pa_tmp_list);
+ call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ }
+ if (ac)
+ kmem_cache_free(ext4_ac_cachep, ac);
+ return;
+}
+
/*
* search goal blocks in preallocated space
*/
@@ -3287,8 +3394,10 @@ static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
{
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_locality_group *lg;
struct ext4_prealloc_space *pa;
+ unsigned long lg_prealloc_count = 0;

/* only data can be preallocated */
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
@@ -3339,9 +3448,13 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
return 1;
}
spin_unlock(&pa->pa_lock);
+ lg_prealloc_count++;
}
rcu_read_unlock();

+ if (lg_prealloc_count > sbi->s_mb_max_lg_prealloc)
+ ext4_mb_discard_lg_preallocations(ac->ac_sb, lg);
+
return 0;
}

@@ -3388,13 +3501,6 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
mb_debug("prellocated %u for group %lu\n", preallocated, group);
}

-static void ext4_mb_pa_callback(struct rcu_head *head)
-{
- struct ext4_prealloc_space *pa;
- pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
- kmem_cache_free(ext4_pspace_cachep, pa);
-}
-
/*
* drops a reference to preallocated space descriptor
* if this was the last reference and the space is consumed
@@ -3676,37 +3782,6 @@ ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
return err;
}

-static noinline_for_stack int
-ext4_mb_release_group_pa(struct ext4_buddy *e4b,
- struct ext4_prealloc_space *pa,
- struct ext4_allocation_context *ac)
-{
- struct super_block *sb = e4b->bd_sb;
- ext4_group_t group;
- ext4_grpblk_t bit;
-
- if (ac)
- ac->ac_op = EXT4_MB_HISTORY_DISCARD;
-
- BUG_ON(pa->pa_deleted == 0);
- ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
- BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
- mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len);
- atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded);
-
- if (ac) {
- ac->ac_sb = sb;
- ac->ac_inode = NULL;
- ac->ac_b_ex.fe_group = group;
- ac->ac_b_ex.fe_start = bit;
- ac->ac_b_ex.fe_len = pa->pa_len;
- ac->ac_b_ex.fe_logical = 0;
- ext4_mb_store_history(ac);
- }
-
- return 0;
-}

2008-07-21 16:22:41

by Eric Sandeen

[permalink] [raw]
Subject: Re: delalloc is crippling fs_mark performance

Eric Sandeen wrote:
> running fs_mark like this:
>
> fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0
>
> (256 subdirs, 100000 files/iteration, 4 threads, 20k files, no sync)
>
> on a 1T fs, with and without delalloc (mount option), is pretty interesting:
>
> http://people.redhat.com/esandeen/ext4/fs_mark.png

I've updated this graph with another run where the group_prealloc
tuneable was set to a perfect multiple of the allocation size, or 500
blocks. This way the leftover 2-block preallocations don't wind up
causing the list to grow with unuseable tiny leftover preallocations.
After tuning this way, it does clearly seem to be the problem here.

-Eric


2008-07-21 22:39:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: delalloc is crippling fs_mark performance

On Jul 21, 2008 11:22 -0500, Eric Sandeen wrote:
> Eric Sandeen wrote:
> > running fs_mark like this:
> >
> > fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0
> >
> > (256 subdirs, 100000 files/iteration, 4 threads, 20k files, no sync)
> >
> > on a 1T fs, with and without delalloc (mount option), is pretty interesting:
> >
> > http://people.redhat.com/esandeen/ext4/fs_mark.png
>
> I've updated this graph with another run where the group_prealloc
> tuneable was set to a perfect multiple of the allocation size, or 500
> blocks. This way the leftover 2-block preallocations don't wind up
> causing the list to grow with unuseable tiny leftover preallocations.
> After tuning this way, it does clearly seem to be the problem here.

Looking at that graph it would seem that allowing 1000 PAs to accumulate
with Aneesh's patch adds a constant slowdown. Compared with the "perfect"
case where the PA list is always empty it is noticably slower.

I'd guess that the right thing to do is have a few buckets for PAs of
different sizes, and keep them very short (e.g. <= 8) to avoid a lot of
list walking overhead on each access.

I think keeping a single PA of "each size" would likely run out if
different-sized allocations are being done, requiring a re-search.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-07-22 11:13:15

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: delalloc is crippling fs_mark performance

On Mon, Jul 21, 2008 at 04:39:35PM -0600, Andreas Dilger wrote:
> On Jul 21, 2008 11:22 -0500, Eric Sandeen wrote:
> > Eric Sandeen wrote:
> > > running fs_mark like this:
> > >
> > > fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0
> > >
> > > (256 subdirs, 100000 files/iteration, 4 threads, 20k files, no sync)
> > >
> > > on a 1T fs, with and without delalloc (mount option), is pretty interesting:
> > >
> > > http://people.redhat.com/esandeen/ext4/fs_mark.png
> >
> > I've updated this graph with another run where the group_prealloc
> > tuneable was set to a perfect multiple of the allocation size, or 500
> > blocks. This way the leftover 2-block preallocations don't wind up
> > causing the list to grow with unuseable tiny leftover preallocations.
> > After tuning this way, it does clearly seem to be the problem here.
>
> Looking at that graph it would seem that allowing 1000 PAs to accumulate
> with Aneesh's patch adds a constant slowdown. Compared with the "perfect"
> case where the PA list is always empty it is noticably slower.
>
> I'd guess that the right thing to do is have a few buckets for PAs of
> different sizes, and keep them very short (e.g. <= 8) to avoid a lot of
> list walking overhead on each access.
>
> I think keeping a single PA of "each size" would likely run out if
> different-sized allocations are being done, requiring a re-search.
>

How about

>From 049cfcf425e57fba0c3d1555e3f9f72f8104b4ed Mon Sep 17 00:00:00 2001
From: Aneesh Kumar K.V <[email protected]>
Date: Tue, 22 Jul 2008 16:39:07 +0530
Subject: [PATCH] ext4: Don't allow lg prealloc list to be grow large.

Currently locality group prealloc list is freed only when there is a block allocation
failure. This can result in large number of per cpu locality group prealloc space
and also make the ext4_mb_use_preallocated expensive. Convert the locality group
prealloc list to a hash list. The hash index is the order of number of blocks
in the prealloc space with a max order of 9. When adding prealloc space to the
list we make sure total entries for each order does not exceed 8. If it is more
than 8 we discard few entries and make sure the we have only <= 5 entries.


Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/mballoc.c | 266 +++++++++++++++++++++++++++++++++++++++++------------
fs/ext4/mballoc.h | 10 ++-
2 files changed, 215 insertions(+), 61 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5b854b7..e058509 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2481,7 +2481,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
int ext4_mb_init(struct super_block *sb, int needs_recovery)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
- unsigned i;
+ unsigned i, j;
unsigned offset;
unsigned max;
int ret;
@@ -2553,7 +2553,8 @@ int ext4_mb_init(struct super_block *sb, int needs_recovery)
struct ext4_locality_group *lg;
lg = &sbi->s_locality_groups[i];
mutex_init(&lg->lg_mutex);
- INIT_LIST_HEAD(&lg->lg_prealloc_list);
+ for (j = 0; j < PREALLOC_TB_SIZE; j++)
+ INIT_LIST_HEAD(&lg->lg_prealloc_list[j]);
spin_lock_init(&lg->lg_prealloc_lock);
}

@@ -3258,12 +3259,68 @@ static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
}

/*
+ * release the locality group prealloc space.
+ * called with lg_mutex held
+ * called with lg->lg_prealloc_lock held
+ */
+static noinline_for_stack void
+ext4_mb_discard_lg_preallocations_prep(struct list_head *discard_list,
+ struct list_head *lg_prealloc_list,
+ int total_entries)
+{
+ struct ext4_prealloc_space *pa;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(pa, lg_prealloc_list, pa_inode_list) {
+ spin_lock(&pa->pa_lock);
+ if (atomic_read(&pa->pa_count)) {
+ spin_unlock(&pa->pa_lock);
+ continue;
+ }
+ if (pa->pa_deleted) {
+ spin_unlock(&pa->pa_lock);
+ continue;
+ }
+ /* only lg prealloc space */
+ BUG_ON(!pa->pa_linear);
+
+ /* seems this one can be freed ... */
+ pa->pa_deleted = 1;
+ spin_unlock(&pa->pa_lock);
+
+
+ list_del_rcu(&pa->pa_inode_list);
+ list_add(&pa->u.pa_tmp_list, discard_list);
+
+ total_entries--;
+ if (total_entries <= 5) {
+ /*
+ * we want to keep only 5 entries
+ * allowing it to grow to 8. This
+ * mak sure we don't call discard
+ * soon for this list.
+ */
+ break;
+ }
+ }
+ rcu_read_unlock();
+
+ return;
+}
+
+/*
* use blocks preallocated to locality group
+ * called with lg->lg_prealloc_lock held
*/
static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
- struct ext4_prealloc_space *pa)
+ struct ext4_prealloc_space *pa,
+ struct list_head *discard_list)
{
+ int order, added = 0, lg_prealloc_count = 1;
unsigned int len = ac->ac_o_ex.fe_len;
+ struct ext4_prealloc_space *tmp_pa;
+ struct ext4_locality_group *lg = ac->ac_lg;
+
ext4_get_group_no_and_offset(ac->ac_sb, pa->pa_pstart,
&ac->ac_b_ex.fe_group,
&ac->ac_b_ex.fe_start);
@@ -3278,6 +3335,112 @@ static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
* Other CPUs are prevented from allocating from this pa by lg_mutex
*/
mb_debug("use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa);
+
+ /* remove the pa from the current list and add it to the new list */
+ pa->pa_free -= len;
+ order = fls(pa->pa_free) - 1;
+
+ /* remove from the old list */
+ list_del_rcu(&pa->pa_inode_list);
+
+ list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order],
+ pa_inode_list) {
+ if (!added && pa->pa_free < tmp_pa->pa_free) {
+ /* Add to the tail of the previous entry */
+ list_add_tail_rcu(&pa->pa_inode_list,
+ tmp_pa->pa_inode_list.prev);
+ added = 1;
+ /* we want to count the total
+ * number of entries in the list
+ */
+ }
+ lg_prealloc_count++;
+ }
+ if (!added)
+ list_add_tail_rcu(&pa->pa_inode_list, &tmp_pa->pa_inode_list);
+
+ /* Now trim the list to be not more than 8 elements */
+ if (lg_prealloc_count > 8) {
+ /*
+ * We can remove the prealloc space from grp->bb_prealloc_list
+ * here because we are holding lg_prealloc_lock and can't take
+ * group lock.
+ */
+ ext4_mb_discard_lg_preallocations_prep(discard_list,
+ &lg->lg_prealloc_list[order],
+ lg_prealloc_count);
+ }
+}
+
+static noinline_for_stack int
+ext4_mb_release_group_pa(struct ext4_buddy *e4b,
+ struct ext4_prealloc_space *pa,
+ struct ext4_allocation_context *ac)
+{
+ struct super_block *sb = e4b->bd_sb;
+ ext4_group_t group;
+ ext4_grpblk_t bit;
+
+ if (ac)
+ ac->ac_op = EXT4_MB_HISTORY_DISCARD;
+
+ BUG_ON(pa->pa_deleted == 0);
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
+ BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
+ mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len);
+ atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded);
+
+ if (ac) {
+ ac->ac_sb = sb;
+ ac->ac_inode = NULL;
+ ac->ac_b_ex.fe_group = group;
+ ac->ac_b_ex.fe_start = bit;
+ ac->ac_b_ex.fe_len = pa->pa_len;
+ ac->ac_b_ex.fe_logical = 0;
+ ext4_mb_store_history(ac);
+ }
+
+ return 0;
+}
+
+static void ext4_mb_pa_callback(struct rcu_head *head)
+{
+ struct ext4_prealloc_space *pa;
+ pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
+ kmem_cache_free(ext4_pspace_cachep, pa);
+}
+
+static noinline_for_stack void
+ext4_mb_discard_lg_preallocations_commit(struct super_block *sb,
+ struct list_head *discard_list)
+{
+ ext4_group_t group = 0;
+ struct ext4_buddy e4b;
+ struct ext4_allocation_context *ac;
+ struct ext4_prealloc_space *pa, *tmp;
+
+ ac = kmem_cache_alloc(ext4_ac_cachep, GFP_NOFS);
+
+ list_for_each_entry_safe(pa, tmp, discard_list, u.pa_tmp_list) {
+
+ ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, NULL);
+ if (ext4_mb_load_buddy(sb, group, &e4b)) {
+ ext4_error(sb, __func__, "Error in loading buddy "
+ "information for %lu\n", group);
+ continue;
+ }
+ ext4_lock_group(sb, group);
+ list_del(&pa->pa_group_list);
+ ext4_mb_release_group_pa(&e4b, pa, ac);
+ ext4_unlock_group(sb, group);
+
+ ext4_mb_release_desc(&e4b);
+ list_del(&pa->u.pa_tmp_list);
+ call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ }
+ if (ac)
+ kmem_cache_free(ext4_ac_cachep, ac);
+ return;
}

/*
@@ -3286,14 +3449,17 @@ static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
static noinline_for_stack int
ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
{
+ int order, lg_prealloc_count = 0, i;
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
struct ext4_locality_group *lg;
struct ext4_prealloc_space *pa;
+ struct list_head discard_list;

/* only data can be preallocated */
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
return 0;

+ INIT_LIST_HEAD(&discard_list);
/* first, try per-file preallocation */
rcu_read_lock();
list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
@@ -3326,22 +3492,39 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
lg = ac->ac_lg;
if (lg == NULL)
return 0;
-
- rcu_read_lock();
- list_for_each_entry_rcu(pa, &lg->lg_prealloc_list, pa_inode_list) {
- spin_lock(&pa->pa_lock);
- if (pa->pa_deleted == 0 && pa->pa_free >= ac->ac_o_ex.fe_len) {
- atomic_inc(&pa->pa_count);
- ext4_mb_use_group_pa(ac, pa);
+ order = fls(ac->ac_o_ex.fe_len) - 1;
+ if (order > PREALLOC_TB_SIZE - 1)
+ /* The max size of hash table is PREALLOC_TB_SIZE */
+ order = PREALLOC_TB_SIZE - 1;
+ /*
+ * We take the lock on the locality object to prevent a
+ * discard via ext4_mb_discard_group_preallocations
+ */
+ spin_lock(&lg->lg_prealloc_lock);
+ for (i = order; i < PREALLOC_TB_SIZE; i++) {
+ lg_prealloc_count = 0;
+ rcu_read_lock();
+ list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i],
+ pa_inode_list) {
+ spin_lock(&pa->pa_lock);
+ if (pa->pa_deleted == 0 &&
+ pa->pa_free >= ac->ac_o_ex.fe_len) {
+ atomic_inc(&pa->pa_count);
+ ext4_mb_use_group_pa(ac, pa, &discard_list);
+ spin_unlock(&pa->pa_lock);
+ ac->ac_criteria = 20;
+ rcu_read_unlock();
+ spin_unlock(&lg->lg_prealloc_lock);
+ return 1;
+ }
spin_unlock(&pa->pa_lock);
- ac->ac_criteria = 20;
- rcu_read_unlock();
- return 1;
+ lg_prealloc_count++;
}
- spin_unlock(&pa->pa_lock);
+ rcu_read_unlock();
}
- rcu_read_unlock();
-
+ spin_unlock(&lg->lg_prealloc_lock);
+ ext4_mb_discard_lg_preallocations_commit(ac->ac_sb,
+ &discard_list);
return 0;
}

@@ -3388,13 +3571,6 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
mb_debug("prellocated %u for group %lu\n", preallocated, group);
}

-static void ext4_mb_pa_callback(struct rcu_head *head)
-{
- struct ext4_prealloc_space *pa;
- pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
- kmem_cache_free(ext4_pspace_cachep, pa);
-}
-
/*
* drops a reference to preallocated space descriptor
* if this was the last reference and the space is consumed
@@ -3543,6 +3719,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
struct ext4_locality_group *lg;
struct ext4_prealloc_space *pa;
struct ext4_group_info *grp;
+ struct list_head discard_list;

/* preallocate only when found space is larger then requested */
BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len);
@@ -3554,6 +3731,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
if (pa == NULL)
return -ENOMEM;

+ INIT_LIST_HEAD(&discard_list);
/* preallocation can change ac_b_ex, thus we store actually
* allocated blocks for history */
ac->ac_f_ex = ac->ac_b_ex;
@@ -3564,13 +3742,13 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
pa->pa_free = pa->pa_len;
atomic_set(&pa->pa_count, 1);
spin_lock_init(&pa->pa_lock);
+ INIT_LIST_HEAD(&pa->pa_inode_list);
pa->pa_deleted = 0;
pa->pa_linear = 1;

mb_debug("new group pa %p: %llu/%u for %u\n", pa,
pa->pa_pstart, pa->pa_len, pa->pa_lstart);

- ext4_mb_use_group_pa(ac, pa);
atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);

grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
@@ -3584,10 +3762,12 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
ext4_unlock_group(sb, ac->ac_b_ex.fe_group);

- spin_lock(pa->pa_obj_lock);
- list_add_tail_rcu(&pa->pa_inode_list, &lg->lg_prealloc_list);
- spin_unlock(pa->pa_obj_lock);
+ /* ext4_mb_use_group_pa will also add the pa to the lg list */
+ spin_lock(&lg->lg_prealloc_lock);
+ ext4_mb_use_group_pa(ac, pa, &discard_list);
+ spin_unlock(&lg->lg_prealloc_lock);

+ ext4_mb_discard_lg_preallocations_commit(sb, &discard_list);
return 0;
}

@@ -3676,37 +3856,6 @@ ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
return err;
}

-static noinline_for_stack int
-ext4_mb_release_group_pa(struct ext4_buddy *e4b,
- struct ext4_prealloc_space *pa,
- struct ext4_allocation_context *ac)
-{
- struct super_block *sb = e4b->bd_sb;
- ext4_group_t group;
- ext4_grpblk_t bit;
-
- if (ac)
- ac->ac_op = EXT4_MB_HISTORY_DISCARD;
-
- BUG_ON(pa->pa_deleted == 0);
- ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
- BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
- mb_free_blocks(pa->pa_inode, e4b, bit, pa->pa_len);
- atomic_add(pa->pa_len, &EXT4_SB(sb)->s_mb_discarded);
-
- if (ac) {
- ac->ac_sb = sb;
- ac->ac_inode = NULL;
- ac->ac_b_ex.fe_group = group;
- ac->ac_b_ex.fe_start = bit;
- ac->ac_b_ex.fe_len = pa->pa_len;
- ac->ac_b_ex.fe_logical = 0;
- ext4_mb_store_history(ac);
- }
-
- return 0;
-}
-
/*
* releases all preallocations in given group
*
@@ -4136,7 +4285,6 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
spin_lock(&ac->ac_pa->pa_lock);
ac->ac_pa->pa_pstart += ac->ac_b_ex.fe_len;
ac->ac_pa->pa_lstart += ac->ac_b_ex.fe_len;
- ac->ac_pa->pa_free -= ac->ac_b_ex.fe_len;
ac->ac_pa->pa_len -= ac->ac_b_ex.fe_len;
spin_unlock(&ac->ac_pa->pa_lock);
}
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 1141ad5..6b46c86 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -164,11 +164,17 @@ struct ext4_free_extent {
* Locality group:
* we try to group all related changes together
* so that writeback can flush/allocate them together as well
+ * Size of lg_prealloc_list hash is determined by MB_DEFAULT_GROUP_PREALLOC
+ * (512). We store prealloc space into the hash based on the pa_free blocks
+ * order value.ie, fls(pa_free)-1;
*/
+#define PREALLOC_TB_SIZE 10
struct ext4_locality_group {
/* for allocator */
- struct mutex lg_mutex; /* to serialize allocates */
- struct list_head lg_prealloc_list;/* list of preallocations */
+ /* to serialize allocates */
+ struct mutex lg_mutex;
+ /* list of preallocations */
+ struct list_head lg_prealloc_list[PREALLOC_TB_SIZE];
spinlock_t lg_prealloc_lock;
};

--
1.5.6.3.439.g1e10.dirty