2010-03-23 00:21:24

by djwong

[permalink] [raw]
Subject: bug in inode allocator?

Hi,

I'm trying to understand how ext4_allocate_inode selects a blockgroup when
creating a top level directory, and I've noticed a couple of odd behaviors with
the algorithm (2.6.34-rc2 if anyone cares):

First, the allocator will pick a random blockgroup from which to begin a linear
scan of all the blockgroups to find the least heavily loaded one. However, if
there are ties for the least heavily loaded bg, the allocator picks the first
one it scanned, not necessarily the one with the lowest bg number. This seems
to be a strategy to scatter top level directories all over the disk in an
attempt to try to keep top level directories from ending up in the same bg and
fragmenting each other. However, if the tie is between empty blockgroups and
the media is a rotating disk, this can result in top level directories being
created far away from the high-bandwidth beginning of the disk. If one creates
only a handful of directories which all end up hashing to higher number
blockgroups, then the filesystem won't use the high performance areas of the
disk until there's enough data to wrap around to the blockgroups at the
beginning.

An "easy" fix seems to be: If there is a tie in comparing blockgroups, then the
one with the lowest bg number wins, though that heavily biases blockgroup
creation towards the beginning of the disk, so further study on my part is
needed. In performing _that_ analysis, I came across a second problem:

The get_orlov_stat() function returns three metrics for a given block group;
these metrics (used_dirs, free_inodes, and free_blocks) are used to figure out
if one blockgroup is less heavily loaded than another. If I create a bunch of
1-byte files, the free_inodes and free_blocks counts decrease by 1 every time,
as you'd expect. However, when I create directories, only the free_blocks
count decreases--used_dirs and free_inodes remain the same! This seemed very
suspicious to me, so I umounted and mounted the filesystem and reran my test.
free_blocks and used_dirs suddenly decreased by the number of directories that
I had created before the umount, but after the first mkdir, the free_inodes and
used_dirs counts did not change, just like before.

I then ran a loop wherein I create a directory and then a small file. For
each dir/file creation, the free_inodes count decreased by 1, the used_dirs
count remained unchanged, and the free_blocks count decreased by 2. Weird,
since I was pretty sure that even directories require an inode and a block.

I interpret this behavior to mean that free_inodes/used_dirs only get updated
at mount time and at file creation time, and furthermore are not being updated
when directories get created. The fact that the counts _do_ suddenly decrease
across a umount/mount cycle confirms that directories do use inodes, which is
what I expect. I wondered if this was a behavior of delalloc or something, but
-o nodelalloc did not change the behavior. Nor did adding copious calls to
sync.

Unfortunately, this second behavior means that the "find the least full
blockgroup" code can use stale data in its comparisons. Am I correct that
something is wrong here, or have I misinterpreted the code? Is it /supposed/
to be the case that used_dirs reflects the number of directories in the
blockgroup at *mount time* and not at the current time?

--D


2010-03-23 03:34:40

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH] fix up flex groups used_dirs manipulation

Darrick J. Wong wrote:

...
> Unfortunately, this second behavior means that the "find the least full
> blockgroup" code can use stale data in its comparisons. Am I correct that
> something is wrong here, or have I misinterpreted the code? Is it /supposed/
> to be the case that used_dirs reflects the number of directories in the
> blockgroup at *mount time* and not at the current time?
>
This does seem weird; the flex_group dir counters are indeed only updated
at mount time:

ext4_fill_super
ext4_fill_flex_info
atomic_add(ext4_used_dirs_count(sb, gdp),
&sbi->s_flex_groups[flex_group].used_dirs);

and yet it's read repeatedly in get_orlov_stats:

2 ialloc.c get_orlov_stats 430 stats->used_dirs = atomic_read(&flex_group[g].used_dirs);

I think this patch:

commit 7d39db14a42cbd719c7515b9da8f85a2eb6a0633
[PATCH] ext4: Use struct flex_groups to calculate get_orlov_stats()

"missed" a bit, maybe a cut and paste error:

@@ -267,6 +267,13 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
if (is_directory) {
count = ext4_used_dirs_count(sb, gdp) - 1;
ext4_used_dirs_set(sb, gdp, count);
+ if (sbi->s_log_groups_per_flex) {
+ ext4_group_t f;
+
+ f = ext4_flex_group(sbi, block_group);
+ atomic_dec(&sbi->s_flex_groups[f].free_inodes);
+ }

why would we be decremeting free inodes in free_inode? And then later
in the function we atomic_inc it again. Very odd, and likely a thinko.

I think the following patch fixes it up, although it seems like we should
probably introduce (another) wrapper to set these counts in the gdp as
well as the flex groups if they are present, so we don't always have
to remember to manually hit both.

There also seems to be some inconsistency about when we update the flex
grp vs the group descriptor, but I may be reading things wrong; ext4_new_inode
decrements the flex group free inode count, but ext4_claim_inode decrements
the gdp free inode count? I may be missing something there.

Anyway - does this make things behave more as expected?

-------- patch follows ----------

When used_dirs was introduced for the flex_groups struct, it looks
like the accounting was not put into place properly, in some places
manipulating free_inodes rather than used_dirs.


Signed-off-by: Eric Sandeen <[email protected]>
---

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f3624ea..3a5c7ec 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -268,7 +268,7 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
ext4_group_t f;

f = ext4_flex_group(sbi, block_group);
- atomic_dec(&sbi->s_flex_groups[f].free_inodes);
+ atomic_dec(&sbi->s_flex_groups[f].used_dirs);
}

}
@@ -779,7 +779,7 @@ static int ext4_claim_inode(struct super_block *sb,
if (sbi->s_log_groups_per_flex) {
ext4_group_t f = ext4_flex_group(sbi, group);

- atomic_inc(&sbi->s_flex_groups[f].free_inodes);
+ atomic_inc(&sbi->s_flex_groups[f].used_dirs);
}
}

gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);


2010-03-23 05:13:20

by djwong

[permalink] [raw]
Subject: Re: [PATCH] fix up flex groups used_dirs manipulation

On Mon, Mar 22, 2010 at 10:34:33PM -0500, Eric Sandeen wrote:
> Darrick J. Wong wrote:
>
> ...
> > Unfortunately, this second behavior means that the "find the least full
> > blockgroup" code can use stale data in its comparisons. Am I correct that
> > something is wrong here, or have I misinterpreted the code? Is it /supposed/
> > to be the case that used_dirs reflects the number of directories in the
> > blockgroup at *mount time* and not at the current time?
> >
> This does seem weird; the flex_group dir counters are indeed only updated
> at mount time:
>
> ext4_fill_super
> ext4_fill_flex_info
> atomic_add(ext4_used_dirs_count(sb, gdp),
> &sbi->s_flex_groups[flex_group].used_dirs);
>
> and yet it's read repeatedly in get_orlov_stats:
>
> 2 ialloc.c get_orlov_stats 430 stats->used_dirs = atomic_read(&flex_group[g].used_dirs);
>
> I think this patch:
>
> commit 7d39db14a42cbd719c7515b9da8f85a2eb6a0633
> [PATCH] ext4: Use struct flex_groups to calculate get_orlov_stats()
>
> "missed" a bit, maybe a cut and paste error:
>
> @@ -267,6 +267,13 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> if (is_directory) {
> count = ext4_used_dirs_count(sb, gdp) - 1;
> ext4_used_dirs_set(sb, gdp, count);
> + if (sbi->s_log_groups_per_flex) {
> + ext4_group_t f;
> +
> + f = ext4_flex_group(sbi, block_group);
> + atomic_dec(&sbi->s_flex_groups[f].free_inodes);
> + }
>
> why would we be decremeting free inodes in free_inode? And then later
> in the function we atomic_inc it again. Very odd, and likely a thinko.
>
> I think the following patch fixes it up, although it seems like we should
> probably introduce (another) wrapper to set these counts in the gdp as
> well as the flex groups if they are present, so we don't always have
> to remember to manually hit both.
>
> There also seems to be some inconsistency about when we update the flex
> grp vs the group descriptor, but I may be reading things wrong; ext4_new_inode
> decrements the flex group free inode count, but ext4_claim_inode decrements
> the gdp free inode count? I may be missing something there.
>
> Anyway - does this make things behave more as expected?

Offhand, it looks like this works, so:
Acked-By: Darrick J. Wong <[email protected]>

--D
>
> -------- patch follows ----------
>
> When used_dirs was introduced for the flex_groups struct, it looks
> like the accounting was not put into place properly, in some places
> manipulating free_inodes rather than used_dirs.
>
>
> Signed-off-by: Eric Sandeen <[email protected]>
> ---
>
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index f3624ea..3a5c7ec 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -268,7 +268,7 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> ext4_group_t f;
>
> f = ext4_flex_group(sbi, block_group);
> - atomic_dec(&sbi->s_flex_groups[f].free_inodes);
> + atomic_dec(&sbi->s_flex_groups[f].used_dirs);
> }
>
> }
> @@ -779,7 +779,7 @@ static int ext4_claim_inode(struct super_block *sb,
> if (sbi->s_log_groups_per_flex) {
> ext4_group_t f = ext4_flex_group(sbi, group);
>
> - atomic_inc(&sbi->s_flex_groups[f].free_inodes);
> + atomic_inc(&sbi->s_flex_groups[f].used_dirs);
> }
> }
>
> gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-03-24 01:31:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] fix up flex groups used_dirs manipulation

I've added this to the ext4 patch queue.

Yeah, I screwed up when I implemented the new flex_bg based allocator
last year. Thanks for catching this!

- Ted

2010-03-24 13:57:50

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] fix up flex groups used_dirs manipulation

[email protected] wrote:
> I've added this to the ext4 patch queue.

thanks. probably a -stable candidate too, whenever that's ready
to go ...

-Eric

> Yeah, I screwed up when I implemented the new flex_bg based allocator
> last year. Thanks for catching this!
>
> - Ted