2007-06-29 22:11:07

by Jose R. Santos

[permalink] [raw]
Subject: [RFC] BIG_BG vs extended META_BG in ext4

Hi folks,

I've been looking at getting around some of the limitations imposed by
the block groups and was wondering what are peoples thoughts about
implementing this using either bigger block groups or storing the
bitmaps and inode tables outside of the block groups.

I think the BIG_BG feature is better suited to the design philosophy of
ext2/3. Since all the important meta-data is easily accessible thanks
to the static filesystem layout, I would expect for easier fsck
recovery. This should also provide with some performance improvements
for both extents (allowing each extent to be larger than 128M) as well
as fsck since bitmaps would be place closer together.

An extended version of metadata block group could provide better
performance improvements during fsck time since we could pack all of
the filesystem bitmaps together. Having the inode tables separated
from the block groups could mean that we could implement dynamic inodes
in the future as well. This feature seems like it would be more
invasive for e2fspros at first glance (at least for fsck). Also, with
no metadata in the block groups, there is essentially no need to have a
concept of block groups anymore which would mean that this is a
completely different filesystem layout compared to ext2/3.

Since I have not much experience with ext4 development, I was wondering
if anybody had any opinion as to which of these two methods would
better serve the need of the intended users and see which one would be
worth to prototype first.

Comments?


-JRS


2007-06-30 05:51:27

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Jun 29, 2007 17:09 -0500, Jose R. Santos wrote:
> I think the BIG_BG feature is better suited to the design philosophy of
> ext2/3. Since all the important meta-data is easily accessible thanks
> to the static filesystem layout, I would expect for easier fsck
> recovery. This should also provide with some performance improvements
> for both extents (allowing each extent to be larger than 128M) as well
> as fsck since bitmaps would be place closer together.
>
> An extended version of metadata block group could provide better
> performance improvements during fsck time since we could pack all of
> the filesystem bitmaps together. Having the inode tables separated
> from the block groups could mean that we could implement dynamic inodes
> in the future as well. This feature seems like it would be more
> invasive for e2fspros at first glance (at least for fsck). Also, with
> no metadata in the block groups, there is essentially no need to have a
> concept of block groups anymore which would mean that this is a
> completely different filesystem layout compared to ext2/3.
>
> Since I have not much experience with ext4 development, I was wondering
> if anybody had any opinion as to which of these two methods would
> better serve the need of the intended users and see which one would be
> worth to prototype first.

I don't think there is actually any fundamental difference between these
proposals. The reality is that we cannot change the semantics of the
META_BG flag at this point, since both e2fsprogs and ext3/ext4 in the
kernel understand META_BG to mean only "group descriptor backups are
in groups {0, 1, last} of the metagroup" and nothing else.

If we want to allow the bitmaps and inode table outside the group they
represent then this needs to be a separate feature flag, and we may as
well include the additional improvement of the BIG_BG feature at the
same time. I don't think this really any reason to claim there is "no
need to have a concept of block groups".

Also note that e2fsprogs already reserves the bg_free_*_bg fields for
BIG_BG in the expanded group descriptors, though there is no official
definition for BIG_BG:

struct ext4_group_desc
{
[ ext3_group_desc ]
__u32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
__u32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
__u32 bg_inode_table_hi; /* Inodes table block MSB */
__u16 bg_free_blocks_count_hi;/* Free blocks count MSB */
__u16 bg_free_inodes_count_hi;/* Free inodes count MSB */
__u16 bg_used_dirs_count_hi; /* Directories count MSB */
__u16 bg_pad;
__u32 bg_reserved2[3];
};



Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-06-30 14:24:57

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Sat, 2007-06-30 at 01:51 -0400, Andreas Dilger wrote:
> On Jun 29, 2007 17:09 -0500, Jose R. Santos wrote:
> > I think the BIG_BG feature is better suited to the design philosophy of
> > ext2/3. Since all the important meta-data is easily accessible thanks
> > to the static filesystem layout, I would expect for easier fsck
> > recovery. This should also provide with some performance improvements
> > for both extents (allowing each extent to be larger than 128M) as well
> > as fsck since bitmaps would be place closer together.
> >
> > An extended version of metadata block group could provide better
> > performance improvements during fsck time since we could pack all of
> > the filesystem bitmaps together. Having the inode tables separated
> > from the block groups could mean that we could implement dynamic inodes
> > in the future as well. This feature seems like it would be more
> > invasive for e2fspros at first glance (at least for fsck). Also, with
> > no metadata in the block groups, there is essentially no need to have a
> > concept of block groups anymore which would mean that this is a
> > completely different filesystem layout compared to ext2/3.
> >
> > Since I have not much experience with ext4 development, I was wondering
> > if anybody had any opinion as to which of these two methods would
> > better serve the need of the intended users and see which one would be
> > worth to prototype first.
>
> I don't think there is actually any fundamental difference between these
> proposals.

I agree. The more I think about the extended META BG, the more I think
it's pretty much the BIG_BG. Only difference is, with extended META BG,
it removed the restriction that all fs block descriptors has to store in
the first block group. Thus online resize volume size doesn't has to be
dependent on the block group size.


> The reality is that we cannot change the semantics of the
> META_BG flag at this point, since both e2fsprogs and ext3/ext4 in the
> kernel understand META_BG to mean only "group descriptor backups are
> in groups {0, 1, last} of the metagroup" and nothing else.
>
> If we want to allow the bitmaps and inode table outside the group they
> represent then this needs to be a separate feature flag, and we may as
> well include the additional improvement of the BIG_BG feature at the
> same time. I don't think this really any reason to claim there is "no
> need to have a concept of block groups".
>
> Also note that e2fsprogs already reserves the bg_free_*_bg fields for
> BIG_BG in the expanded group descriptors, though there is no official
> definition for BIG_BG:
>
> struct ext4_group_desc
> {
> [ ext3_group_desc ]
> __u32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
> __u32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
> __u32 bg_inode_table_hi; /* Inodes table block MSB */
> __u16 bg_free_blocks_count_hi;/* Free blocks count MSB */
> __u16 bg_free_inodes_count_hi;/* Free inodes count MSB */
> __u16 bg_used_dirs_count_hi; /* Directories count MSB */
> __u16 bg_pad;
> __u32 bg_reserved2[3];
> };
>
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-07-01 04:40:25

by Jose R. Santos

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Sat, 30 Jun 2007 01:51:25 -0400
Andreas Dilger <[email protected]> wrote:
> I don't think there is actually any fundamental difference between these
> proposals. The reality is that we cannot change the semantics of the
> META_BG flag at this point, since both e2fsprogs and ext3/ext4 in the
> kernel understand META_BG to mean only "group descriptor backups are
> in groups {0, 1, last} of the metagroup" and nothing else.

Agree. I call it extended META_BG for lack of a better name, but a new
feature flag will be required.

> If we want to allow the bitmaps and inode table outside the group they
> represent then this needs to be a separate feature flag, and we may as
> well include the additional improvement of the BIG_BG feature at the
> same time. I don't think this really any reason to claim there is "no
> need to have a concept of block groups".

Well when I think about block groups, it seems to me that its basically
a range of blocks with some blocks dedicated for holding important meta
data. If you remove the meta data, then all that is left is a range of
blocks with some backup data scatter around specific locations on the
disk. Of course, my definition of what a block group is could just be
wrong. :)

We could blur the difference between these two features though.

> Also note that e2fsprogs already reserves the bg_free_*_bg fields for
> BIG_BG in the expanded group descriptors, though there is no official
> definition for BIG_BG:
>
> struct ext4_group_desc
> {
> [ ext3_group_desc ]
> __u32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
> __u32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
> __u32 bg_inode_table_hi; /* Inodes table block MSB */
> __u16 bg_free_blocks_count_hi;/* Free blocks count MSB */
> __u16 bg_free_inodes_count_hi;/* Free inodes count MSB */
> __u16 bg_used_dirs_count_hi; /* Directories count MSB */
> __u16 bg_pad;
> __u32 bg_reserved2[3];
> };
>
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>

Thanks for the pointer.

-JRS

2007-07-01 04:41:23

by Jose R. Santos

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Sat, 30 Jun 2007 11:06:16 -0400
Laurent Vivier <[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Le 29 juin 07 à 18:09, Jose R. Santos a écrit :
> Hi Jose,

Hi Laurent,

Seems like your emails are not making it to the mailing list. I got
them fine though.

> Thank you for the question ;-)
>
> BIG_BG allows to limit the number of groups (at least in the group
> counter).
> IMHO, I think it could be important in some cases.

Yes, I think bigger block groups will benefit extents a great deal
since not only can we have larger extents, but I believe that as the
filesystem ages the chances of getting large number contiguous block can
be reduce with small block groups.

> For instance, if we keep the same inode table allocation politic, we
> divide the total number of inode in the FS by the total number of
> groups.
> For the moment, number of inode < 2^32 and if we have number of block
> group > 2^32 the number of inode per group is 0.... is META_BG able
> to manage this case ?

Good point. It is a scenario that needs to be looked, although I
sincerely hope that we get 64-bit inodes implemented by the time
storage devices get that big. ;)

> With META_BG, a 2^48 blocks FS will have 2^48 / 2^12 = 2^36 groups.
> Perhaps it could be interesting to have less groups ?

Agree...

> With less groups, we load less group descriptors in memory, we have
> less I/O to read bitmap and inode array (because we manage less group
> descriptors again, because we load bigger bitmap and array in one time)

Presumably, we would still need to access the same amount data but
latencies should be reduce since we could do larger IO's and less seeks
to read the bitmaps. I also wonder if there are benefits in terms of
locality to having the bitmaps closer to its blocks vs having them far
away like in xMETA_BG.

> Regards,
> Laurent

-JRS

2007-07-01 12:30:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Sat, Jun 30, 2007 at 11:39:08PM -0500, Jose R. Santos wrote:
> On Sat, 30 Jun 2007 01:51:25 -0400
> Andreas Dilger <[email protected]> wrote:
> > I don't think there is actually any fundamental difference between these
> > proposals. The reality is that we cannot change the semantics of the
> > META_BG flag at this point, since both e2fsprogs and ext3/ext4 in the
> > kernel understand META_BG to mean only "group descriptor backups are
> > in groups {0, 1, last} of the metagroup" and nothing else.
>
> Agree. I call it extended META_BG for lack of a better name, but a new
> feature flag will be required.

It was the intention that META_BG include allowing the bitmap and
inode tables to range anywhere outside of the block group, but that
never got coded. It would be confusing though if we relaxed it
withotu adding a feature bit, and I agree that we might as well use
overload the BIG_BG group to indicate this feature.

The fact that BIG_BG requires contiguous blocks for the bitmaps when
they exceed blocksize*8 blocks still concerns me a minor amount, and
given the hopeful inclusion of kernel patches that allow blocksize >
pagesize. Furthermore, I still wonder whether we will want to make
blockgroups that much bigger (since reducing the allocation groups is
not necessarily a smart thing; we will need to do some benchmarks with
filesystem aging to see how this affects antifragmentation efforts),
but the complexity engenered by adding BIG_BG isn't that bad (again,
my only concern is with the contiguous bitmap blocks requirements).

- Ted

2007-07-01 14:49:50

by Jose R. Santos

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Sun, 1 Jul 2007 08:30:54 -0400
Theodore Tso <[email protected]> wrote:
> It was the intention that META_BG include allowing the bitmap and
> inode tables to range anywhere outside of the block group, but that
> never got coded. It would be confusing though if we relaxed it
> withotu adding a feature bit, and I agree that we might as well use
> overload the BIG_BG group to indicate this feature.
>
> The fact that BIG_BG requires contiguous blocks for the bitmaps when
> they exceed blocksize*8 blocks still concerns me a minor amount, and

Is your concern due to being unable to find contiguous block in the
case that a bad disk area is in one of the bitmap blocks? One thing we
can do is try to search for another set of contiguous blocks and if we
fail to find one, we can flag the block group and move to an indirect
block approach to allocating the bitmaps. At this point, we do lose
some of the performance benefits of BIG_BG, but we would still be able
to use the block group.

> given the hopeful inclusion of kernel patches that allow blocksize >

One thing that concerns me about blocksize larger than page size is
that it requires to much planing from the person creating the
filesystem. While larger blocksizes does address some of the issues of
the block group size limits, it does so in a less flexible manner.
With something like BIG_BG, we can provide larger extents for large
files while still keeping disk space consumption under control with
small files.

> pagesize. Furthermore, I still wonder whether we will want to make
> blockgroups that much bigger (since reducing the allocation groups is
> not necessarily a smart thing; we will need to do some benchmarks with
> filesystem aging to see how this affects antifragmentation efforts),
> but the complexity engenered by adding BIG_BG isn't that bad (again,
> my only concern is with the contiguous bitmap blocks requirements).

Point take... I'll do some aged filesystem test to see how if are some
real benefits to be gain in reducing fragmentation.

>
> - Ted

-JRS

2007-07-01 16:31:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Jun 30, 2007 23:40 -0500, Jose R. Santos wrote:
> Yes, I think bigger block groups will benefit extents a great deal
> since not only can we have larger extents, but I believe that as the
> filesystem ages the chances of getting large number contiguous block can
> be reduce with small block groups.

This turns out not to be true, and in fact we need to change the unwritten
extents patch a tiny bit. The reason is that we have limited the maximum
extent size to 2^16-1 = 32767 blocks. The current maximum for the number
of blocks in a group is 65528, so that we can always fit the "free blocks"
count into a __u16 if the bitmaps and inode table are moved out of the
group. Moving the bitmaps and itable will hit the max extent length.

There are still other benefits to moving the metadata together.

Now, the one minor problem with the unwritten extent patches is that by
using the high bit of the ee_len this limits the extent length to 2^15-1
blocks, but it would be MUCH better if this limit was 2^16 blocks and
it fit evenly into an empty group, consecutive extents were aligned, etc.
It also doesn't make sense to have an uninitialized 0-length extent, so
I think the unwritten extent (fallocate) patch needs to special case
the ee_len = 65536 to be a "regular" extent instead of "unwritten".

> > With less groups, we load less group descriptors in memory, we have
> > less I/O to read bitmap and inode array (because we manage less group
> > descriptors again, because we load bigger bitmap and array in one time)
>
> Presumably, we would still need to access the same amount data but
> latencies should be reduce since we could do larger IO's and less seeks
> to read the bitmaps. I also wonder if there are benefits in terms of
> locality to having the bitmaps closer to its blocks vs having them far
> away like in xMETA_BG.

Having the bitmaps together will fix this independent of "BIG_BG".

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-07-02 14:42:01

by Jose R. Santos

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Sun, 1 Jul 2007 12:31:53 -0400
Andreas Dilger <[email protected]> wrote:

> On Jun 30, 2007 23:40 -0500, Jose R. Santos wrote:
> > Yes, I think bigger block groups will benefit extents a great deal
> > since not only can we have larger extents, but I believe that as the
> > filesystem ages the chances of getting large number contiguous
> > block can be reduce with small block groups.
>
> This turns out not to be true, and in fact we need to change the
> unwritten extents patch a tiny bit. The reason is that we have
> limited the maximum extent size to 2^16-1 = 32767 blocks. The
> current maximum for the number of blocks in a group is 65528, so that
> we can always fit the "free blocks" count into a __u16 if the bitmaps
> and inode table are moved out of the group. Moving the bitmaps and
> itable will hit the max extent length.

I miss this while looking at the extent code. I thought that the
extents limit was caused by being unable to allocate enough contiguous
blocks due to the small block groups.

Are there no plans to support very large extents? It seems like this
would be a good reason to support either BIG_BG or xMETA_BG. Aside
from some possible alignment issues with the structure, what else would
keep would keep ee_len from being larger?

> There are still other benefits to moving the metadata together.
>
> Now, the one minor problem with the unwritten extent patches is that
> by using the high bit of the ee_len this limits the extent length to
> 2^15-1 blocks, but it would be MUCH better if this limit was 2^16
> blocks and it fit evenly into an empty group, consecutive extents
> were aligned, etc. It also doesn't make sense to have an
> uninitialized 0-length extent, so I think the unwritten extent
> (fallocate) patch needs to special case the ee_len = 65536 to be a
> "regular" extent instead of "unwritten"
>
> > > With less groups, we load less group descriptors in memory, we
> > > have less I/O to read bitmap and inode array (because we manage
> > > less group descriptors again, because we load bigger bitmap and
> > > array in one time)
> >
> > Presumably, we would still need to access the same amount data but
> > latencies should be reduce since we could do larger IO's and less
> > seeks to read the bitmaps. I also wonder if there are benefits in
> > terms of locality to having the bitmaps closer to its blocks vs
> > having them far away like in xMETA_BG.
>
> Having the bitmaps together will fix this independent of "BIG_BG".

I was referring to the locality of block bit maps and the actual free
blocks. If we move the block bitmaps out of block group, wouldn't we
be promoting larger seeks on operations that heavily write to both the
bitmaps and blocks?

This would not be a problem for inode bitmap and itables since those
would be move together in xMETA_BG.

Thanks

-JRS

2007-07-02 15:49:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Sun, Jul 01, 2007 at 09:48:33AM -0500, Jose R. Santos wrote:
> Is your concern due to being unable to find contiguous block in the
> case that a bad disk area is in one of the bitmap blocks? One thing we
> can do is try to search for another set of contiguous blocks and if we
> fail to find one, we can flag the block group and move to an indirect
> block approach to allocating the bitmaps. At this point, we do lose
> some of the performance benefits of BIG_BG, but we would still be able
> to use the block group.

Yes, my concern is what we might need to do if for some reason e2fsck
needs to reallocate the bitmap blocks. I don't think an indirect
block scheme is the right approach, though; we're adding a lot of
complexity for a case that probably wouldn't be used but very, very
rarely.

My proposal (as we discsused) in the call, is to implement BIG_BG as
meaning the following:

1) Implementations must understand and use the s_desc_size
superblock field to determine whether block group descriptors
are the old 32 bytes or the newer 64 bytes format.

2) Implementations must support the newer ext4_group_desc
format in particular to support bg_free_blocks_count_hi and
bg_free_inodes_count_hi

3) Implementations will relax constraints on where the
superblock, bitmaps, and inode tables for a particular block
group will be stored.

So with that, we can experiment with what size block groups really
make sense, versus using the extended metablockgroup idea, or possibly
doing both.

- Ted

2007-07-02 17:12:57

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Mon, 2007-07-02 at 11:49 -0400, Theodore Tso wrote:
> On Sun, Jul 01, 2007 at 09:48:33AM -0500, Jose R. Santos wrote:
> > Is your concern due to being unable to find contiguous block in the
> > case that a bad disk area is in one of the bitmap blocks? One thing we
> > can do is try to search for another set of contiguous blocks and if we
> > fail to find one, we can flag the block group and move to an indirect
> > block approach to allocating the bitmaps. At this point, we do lose
> > some of the performance benefits of BIG_BG, but we would still be able
> > to use the block group.
>
> Yes, my concern is what we might need to do if for some reason e2fsck
> needs to reallocate the bitmap blocks. I don't think an indirect
> block scheme is the right approach, though; we're adding a lot of
> complexity for a case that probably wouldn't be used but very, very
> rarely.
>
> My proposal (as we discsused) in the call, is to implement BIG_BG as
> meaning the following:
>
> 1) Implementations must understand and use the s_desc_size
> superblock field to determine whether block group descriptors
> are the old 32 bytes or the newer 64 bytes format.
>
> 2) Implementations must support the newer ext4_group_desc
> format in particular to support bg_free_blocks_count_hi and
> bg_free_inodes_count_hi
>
> 3) Implementations will relax constraints on where the
> superblock, bitmaps, and inode tables for a particular block
> group will be stored.
>

I agree.

> So with that, we can experiment with what size block groups really
> make sense, versus using the extended metablockgroup idea, or possibly
> doing both.
>

How about incorporating some of the chunkfs ideas into this BIG_BG or
extended metablockgroups? The original block group size (128MB) is
probably too small that would results in many continous inodes. By
enlarging the size of groups via BIG_BG or extended metablockgroups, we
could add dirty/clean bit to allow partical/parallel fsck, and something
like that. Any thoughts on thhis?


Mingming

2007-07-03 17:56:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Jul 02, 2007 09:39 -0500, Jose R. Santos wrote:
> On Sun, 1 Jul 2007 12:31:53 -0400
> Andreas Dilger <[email protected]> wrote:
> > This turns out not to be true, and in fact we need to change the
> > unwritten extents patch a tiny bit. The reason is that we have
> > limited the maximum extent size to 2^16-1 = 32767 blocks. The
> > current maximum for the number of blocks in a group is 65528, so that
> > we can always fit the "free blocks" count into a __u16 if the bitmaps
> > and inode table are moved out of the group. Moving the bitmaps and
> > itable will hit the max extent length.
>
> I miss this while looking at the extent code. I thought that the
> extents limit was caused by being unable to allocate enough contiguous
> blocks due to the small block groups.
>
> Are there no plans to support very large extents?

At some point, there may be a desire to move to full 64-bit extents in
order to allow gigantic files (over 2^32 blocks = 16TB@4kb, 256TB@16kb),
but I don't think this will happen for a while yet).

> Aside from some possible alignment issues with the structure, what else
> would keep would keep ee_len from being larger?

There just isn't any free space in the extents structure for ee_len.

> > There are still other benefits to moving the metadata together.
> >
> > Now, the one minor problem with the unwritten extent patches is that
> > by using the high bit of the ee_len this limits the extent length to
> > 2^15-1 blocks, but it would be MUCH better if this limit was 2^16
> > blocks and it fit evenly into an empty group, consecutive extents
> > were aligned, etc. It also doesn't make sense to have an
> > uninitialized 0-length extent, so I think the unwritten extent
> > (fallocate) patch needs to special case the ee_len = 65536 to be a
> > "regular" extent instead of "unwritten"

The extent-to-group alignment problem is definitely an issue once
we get past 2^15 blocks per group and/or move all the metadata out of
the group. Otherwise, we will be stuck allocating 2^15-1 or 2^15-256
blocks per extent, and mballoc will not like this very much.

The change I'm asking for is fairly simple at this stage, but would be
much more complex later:

-#define EXT_MAX_LEN ((2^15) - 1)
+#define EXT_MAX_LEN (2^15)

static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
{
+ BUG_ON(le16_to_cpu(ext->ee_len) & ~0x8000 == 0);
ext->ee_len |= cpu_to_le16(0x8000);
}

static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext)
{
- return (int)(le16_to_cpu(ext->ee_len) & 0x8000);
+ return (le16_to_cpu(ext->ee_len) > 0x8000);
}

static inline int ext4_ext_get_actual_len(struct ext4_extent *ext)
{
- return (int)(le16_to_cpu(ext->ee_len) & 0x7FFF);
+ return (le16_to_cpu(ext->ee_len) <= 0x8000 ? le16_to_cpu(ext->ee_len) :
+ (le16_to_cpu(ext->ee_len) - 0x8000));
}

Hmm, but now I'm not sure how to mark an uninitialized extent of length
2^15 blocks... I suppose it would be possible to limit uninitialized
extents to 2^14 blocks (since uninitialized extents will be a much rarer
case than initialized extents), or teach mballoc/delalloc to allocate
even-sized extents like (2^15-s_raid_stripe) blocks or something.

> I was referring to the locality of block bit maps and the actual free
> blocks. If we move the block bitmaps out of block group, wouldn't we
> be promoting larger seeks on operations that heavily write to both the
> bitmaps and blocks?

I don't think this is necessarily true. The block bitmaps are usually
read from disk only rarely and cached after that. When they are written
they are written first to the journal and only later to disk, so there is
little coherency between the data writes and the bitmap writes. I would
expect that putting the metadata together would _improve_ performance
because the journal checkpoint could avoid many seeks when flushing the
bitmap/itable to disk.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-07-05 06:56:59

by Valerie Henson

[permalink] [raw]
Subject: Re: [RFC] BIG_BG vs extended META_BG in ext4

On Mon, Jul 02, 2007 at 10:12:57AM -0400, Mingming Cao wrote:
>
> How about incorporating some of the chunkfs ideas into this BIG_BG or
> extended metablockgroups? The original block group size (128MB) is
> probably too small that would results in many continous inodes. By
> enlarging the size of groups via BIG_BG or extended metablockgroups, we
> could add dirty/clean bit to allow partical/parallel fsck, and something
> like that. Any thoughts on thhis?

We looked into this in the 2006 OLS paper
(http://infohost.nmt.edu/~val/review/ext2fsck.pdf) and concluded that
it's pretty hard to do anything useful on a per-block group basis.
The only metadata that could be checked and repaired on a per-bg basis
are the block group summaries of free/used inodes and free/used
blocks. So if we had a situation in which the metadata in the block
group was consistent in all ways (link counts, directory entries,
etc.) except that we hadn't updated the bg summary info, then it would
be useful - but I don't think that happens very often. Anything more
useful than that will require on-disk format changes; at which point
why restrict ourselves to a bit in the block group descriptor?

-VAL