2009-08-07 13:07:57

by Theodore Ts'o

[permalink] [raw]
Subject: Questions about mballoc's stream allocation


I've got two questions about mballoc's stream allocation.

First of all, in ext4_mb_regular_allocator(), I'm 99% sure this is a
bug:

/* if stream allocation is enabled, use global goal */
size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
isize = i_size_read(ac->ac_inode) >> bsbits;
if (size < isize)
size = isize;

if (size < sbi->s_mb_stream_request &&
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ac->ac_flags & EXT4_MB_HINT_DATA)) {
/* TBD: may be hot point */
spin_lock(&sbi->s_md_lock);
ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
spin_unlock(&sbi->s_md_lock);
}

Shouldn't that be ">=", not "<". We want to use the values saved in
sbi->s_mb_last_{group,start} only if we are doing a stream allocation,
which would be true only if the file is *larger* than
s_mb_stream_request, no?


The second question I have is with regards to ext4_mb_use_best_found(),
we set sbi->s_mb_last_{group,start} on any data allocation; shouldn't we
only be setting those values only if we were doing a stream allocation
in the first place?

Otherwise, any kind of allocation will end up moving the global goal
block for stream allocations; even if it is a small allocation in the
middle of some block group caused by the flag EXT4_MB_HINT_NO_PREALLOC
being set.

Am I missing anything?

- Ted


2009-08-08 07:51:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: Questions about mballoc's stream allocation

On Aug 07, 2009 09:07 -0400, Theodore Ts'o wrote:
> I've got two questions about mballoc's stream allocation.

Sorry, I don't know the answers. Hopefully Alex can chime in.

> First of all, in ext4_mb_regular_allocator(), I'm 99% sure this is a
> bug:
>
> /* if stream allocation is enabled, use global goal */
> size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
> isize = i_size_read(ac->ac_inode) >> bsbits;
> if (size < isize)
> size = isize;
>
> if (size < sbi->s_mb_stream_request &&
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> (ac->ac_flags & EXT4_MB_HINT_DATA)) {
> /* TBD: may be hot point */
> spin_lock(&sbi->s_md_lock);
> ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
> spin_unlock(&sbi->s_md_lock);
> }
>
> Shouldn't that be ">=", not "<". We want to use the values saved in
> sbi->s_mb_last_{group,start} only if we are doing a stream allocation,
> which would be true only if the file is *larger* than
> s_mb_stream_request, no?
>
>
> The second question I have is with regards to ext4_mb_use_best_found(),
> we set sbi->s_mb_last_{group,start} on any data allocation; shouldn't we
> only be setting those values only if we were doing a stream allocation
> in the first place?
>
> Otherwise, any kind of allocation will end up moving the global goal
> block for stream allocations; even if it is a small allocation in the
> middle of some block group caused by the flag EXT4_MB_HINT_NO_PREALLOC
> being set.
>
> Am I missing anything?
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-08-08 22:52:47

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH,RFC 1/2] ext4: Fix bugs in mballoc's stream allocation mode

The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all
screwed up. These fields were getting unconditionally allt he time,
set even when stream allocation had not taken place, and if they were
being used when the file was smaller than s_mb_stream_request, which
is when the allocation should _not_ be doing stream allocation.

Fix this by determining whether or not we stream allocation should
take place once, in ext4_mb_group_or_file(), and setting a flag which
gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found().
This simplifies the code and assures that we are consistently using
(or not using) the stream allocation logic.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/mballoc.c | 23 ++++++++++-------------
2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9714db3..1ca6995 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -88,6 +88,8 @@ typedef unsigned int ext4_group_t;
#define EXT4_MB_HINT_TRY_GOAL 512
/* blocks already pre-reserved by delayed allocation */
#define EXT4_MB_DELALLOC_RESERVED 1024
+/* We are doing stream allocation */
+#define EXT4_MB_STREAM_ALLOC 2048


struct ext4_allocation_request {
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 68cde59..f019a50 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1360,7 +1360,7 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
ac->alloc_semp = e4b->alloc_semp;
e4b->alloc_semp = NULL;
/* store last allocated for subsequent stream allocation */
- if ((ac->ac_flags & EXT4_MB_HINT_DATA)) {
+ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
spin_lock(&sbi->s_md_lock);
sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
sbi->s_mb_last_start = ac->ac_f_ex.fe_start;
@@ -1938,7 +1938,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
struct ext4_sb_info *sbi;
struct super_block *sb;
struct ext4_buddy e4b;
- loff_t size, isize;

sb = ac->ac_sb;
sbi = EXT4_SB(sb);
@@ -1974,20 +1973,16 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
}

bsbits = ac->ac_sb->s_blocksize_bits;
- /* if stream allocation is enabled, use global goal */
- size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
- isize = i_size_read(ac->ac_inode) >> bsbits;
- if (size < isize)
- size = isize;

- if (size < sbi->s_mb_stream_request &&
- (ac->ac_flags & EXT4_MB_HINT_DATA)) {
+ /* if stream allocation is enabled, use global goal */
+ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
/* TBD: may be hot point */
spin_lock(&sbi->s_md_lock);
ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
spin_unlock(&sbi->s_md_lock);
}
+
/* Let's just scan groups to find more-less suitable blocks */
cr = ac->ac_2order ? 0 : 1;
/*
@@ -4155,16 +4150,18 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
return;

+ if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
+ return;
+
size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
isize = i_size_read(ac->ac_inode) >> bsbits;
size = max(size, isize);

/* don't use group allocation for large files */
- if (size >= sbi->s_mb_stream_request)
- return;

2009-08-08 22:52:47

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH,RFC 2/2] ext4: Avoid group preallocation for closed files

Currently the group preallocation code tries to find a large (512)
free block from which to do per-cpu group allocation for small files.
The problem with this scheme is that it leaves the filesystem horribly
fragmented. In the worst case, if the filesystem is unmounted and
remounted (after a system shutdown, for example) we forget the fact
that wee were using a particular (now-partially filled) 512 block
extent. So the next time we try to allocate space for a small file,
we will find *another* completely free 512 block chunk to allocate
small files. Given that there are 32,768 blocks in a block group,
after 64 iterations of "mount, write one 4k file in a directory,
unmount", the block group will have 64 files, each separated by 511
blocks, and the block group will no longer have any free 512
completely free chunks of blocks for group preallocation space.

So if we try to allocate blocks for a file that has been closed, such
that we know the final size of the file, and the filesystem is not
busy, avoid using group preallocation.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 24 +++++++++++++++++++++++-
fs/ext4/mballoc.c | 10 +++++++++-
2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1ca6995..c2d98f8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -90,6 +90,8 @@ typedef unsigned int ext4_group_t;
#define EXT4_MB_DELALLOC_RESERVED 1024
/* We are doing stream allocation */
#define EXT4_MB_STREAM_ALLOC 2048
+/* We suppressed group preallocation */
+#define EXT4_MB_SUPPRESS_GROUP_PREALLOC 4096


struct ext4_allocation_request {
@@ -952,6 +954,7 @@ struct ext4_sb_info {
atomic_t s_mb_lost_chunks;
atomic_t s_mb_preallocated;
atomic_t s_mb_discarded;
+ atomic_t s_lock_busy;

/* locality groups */
struct ext4_locality_group *s_locality_groups;
@@ -1593,15 +1596,34 @@ struct ext4_group_info {
#define EXT4_MB_GRP_NEED_INIT(grp) \
(test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))

+#define EXT4_MAX_CONTENTION 8
+#define EXT4_CONTENTION_THRESHOLD 2
+
static inline spinlock_t *ext4_group_lock_ptr(struct super_block *sb,
ext4_group_t group)
{
return bgl_lock_ptr(EXT4_SB(sb)->s_blockgroup_lock, group);
}

+/*
+ * Returns true if the filesystem is busy enough that attempts to
+ * access the block group locks has run into contention.
+ */
+static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi)
+{
+ return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD);
+}
+
static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group)
{
- spin_lock(ext4_group_lock_ptr(sb, group));
+ spinlock_t *lock = ext4_group_lock_ptr(sb, group);
+ if (spin_trylock(lock))
+ atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1,
+ EXT4_MAX_CONTENTION);
+ else {
+ atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0);
+ spin_lock(lock);
+ }
}

static inline void ext4_unlock_group(struct super_block *sb,
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f019a50..482bf40 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4154,9 +4154,17 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
return;

size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
- isize = i_size_read(ac->ac_inode) >> bsbits;
+ isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
+ >> bsbits;
size = max(size, isize);

+ if ((size == isize) &&
+ ext4_fs_is_busy(sbi) &&
+ (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
+ ac->ac_flags |= EXT4_MB_SUPPRESS_GROUP_PREALLOC;
+ return;
+ }
+
/* don't use group allocation for large files */
if (size >= sbi->s_mb_stream_request) {
ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
--
1.6.3.2.1.gb9f7d.dirty


2009-08-11 15:39:12

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Questions about mballoc's stream allocation

On Fri, Aug 07, 2009 at 09:07:53AM -0400, Theodore Ts'o wrote:
>
> I've got two questions about mballoc's stream allocation.
>
> First of all, in ext4_mb_regular_allocator(), I'm 99% sure this is a
> bug:
>
> /* if stream allocation is enabled, use global goal */
> size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
> isize = i_size_read(ac->ac_inode) >> bsbits;
> if (size < isize)
> size = isize;
>
> if (size < sbi->s_mb_stream_request &&
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> (ac->ac_flags & EXT4_MB_HINT_DATA)) {
> /* TBD: may be hot point */
> spin_lock(&sbi->s_md_lock);
> ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
> spin_unlock(&sbi->s_md_lock);
> }
>
> Shouldn't that be ">=", not "<". We want to use the values saved in
> sbi->s_mb_last_{group,start} only if we are doing a stream allocation,
> which would be true only if the file is *larger* than
> s_mb_stream_request, no?
>
>
> The second question I have is with regards to ext4_mb_use_best_found(),
> we set sbi->s_mb_last_{group,start} on any data allocation; shouldn't we
> only be setting those values only if we were doing a stream allocation
> in the first place?
>
> Otherwise, any kind of allocation will end up moving the global goal
> block for stream allocations; even if it is a small allocation in the
> middle of some block group caused by the flag EXT4_MB_HINT_NO_PREALLOC
> being set.
>
> Am I missing anything?
>

I guess we should be setting the sbi->s_mb_last_{group,start} only when doing
small file allocation. We want to make sure small file allocation always
use the goal block near to the previous small file allocation request. So
(size < sbi->s_mb_stream_request) is correct. But we should not be doing

sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
always.

For large file allocation we wanted the new blocks to closer to that files previous
allocated block which ext4_ext_find_goal return as the goal value. So for
large files the goal value passed should represent the correct value.

NOTE: I am still behind ext4 list mails due to other commitments. Will start looking
at the mails, but mostly would be slow in replies

-aneesh

2009-08-11 17:37:13

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Questions about mballoc's stream allocation

On Tue, Aug 11, 2009 at 09:09:05PM +0530, Aneesh Kumar K.V wrote:
> On Fri, Aug 07, 2009 at 09:07:53AM -0400, Theodore Ts'o wrote:
> >
> > I've got two questions about mballoc's stream allocation.
> >
> > First of all, in ext4_mb_regular_allocator(), I'm 99% sure this is a
> > bug:
> >
> > /* if stream allocation is enabled, use global goal */
> > size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
> > isize = i_size_read(ac->ac_inode) >> bsbits;
> > if (size < isize)
> > size = isize;
> >
> > if (size < sbi->s_mb_stream_request &&
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > (ac->ac_flags & EXT4_MB_HINT_DATA)) {
> > /* TBD: may be hot point */
> > spin_lock(&sbi->s_md_lock);
> > ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> > ac->ac_g_ex.fe_start = sbi->s_mb_last_start;
> > spin_unlock(&sbi->s_md_lock);
> > }
> >
> > Shouldn't that be ">=", not "<". We want to use the values saved in
> > sbi->s_mb_last_{group,start} only if we are doing a stream allocation,
> > which would be true only if the file is *larger* than
> > s_mb_stream_request, no?
> >
> >
> > The second question I have is with regards to ext4_mb_use_best_found(),
> > we set sbi->s_mb_last_{group,start} on any data allocation; shouldn't we
> > only be setting those values only if we were doing a stream allocation
> > in the first place?
> >
> > Otherwise, any kind of allocation will end up moving the global goal
> > block for stream allocations; even if it is a small allocation in the
> > middle of some block group caused by the flag EXT4_MB_HINT_NO_PREALLOC
> > being set.
> >
> > Am I missing anything?
> >
>
> I guess we should be setting the sbi->s_mb_last_{group,start} only when doing
> small file allocation. We want to make sure small file allocation always
> use the goal block near to the previous small file allocation request. So
> (size < sbi->s_mb_stream_request) is correct. But we should not be doing
>
> sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> always.
>
> For large file allocation we wanted the new blocks to closer to that files previous
> allocated block which ext4_ext_find_goal return as the goal value. So for
> large files the goal value passed should represent the correct value.
>

Or may be the current code is fine. I guess what the aim was to make sure we spread
small file allocation near to the context. We use the ac_g_ex only when we don't find
blocks in prealloc space. The goal could be that, if we create lot of small files it would be that
the new small files created may be related to the process that did last block allocation.

-aneesh