This is a broken up set of patches replacing the js/flex-bg branch
that I've been including in the pu series.
The main way that they have been simplified is that the BLOCK_UNINIT
flag only gets set on block groups which only contains their own
metadata and nothing else. That is, if the block group contains
another block group's inode table, et. al, then the BLOCK_UNINIT flag
is disabled. This is *much* simpler since it means that e2fsck pass 5
isn't forced to try to synthesize the block bitmap for the initial
block groups containing foreign block group's metadata.
I've done basic tests and it seems to do the right thing, but Jose,
please look it over and see what you think.
- Ted
Change the way we allocate bitmaps and inode tables if the FLEX_BG
feature is used at mke2fs time. It places calculates a new offset for
bitmaps and inode table base on the number of groups that the user
wishes to pack together using the new "-G" option. Creating a
filesystem with 64 block groups in a flex group can be done by:
mke2fs -j -I 256 -O flex_bg -G 32 /dev/sdX
Signed-off-by: Jose R. Santos <[email protected]>
Signed-off-by: Valerie Clement <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
---
lib/ext2fs/alloc_tables.c | 118 +++++++++++++++++++++++++++++++++++++++++++-
lib/ext2fs/initialize.c | 7 +++
misc/mke2fs.8.in | 16 ++++++
misc/mke2fs.c | 35 ++++++++++++-
misc/mke2fs.conf.5.in | 7 +++
5 files changed, 177 insertions(+), 6 deletions(-)
diff --git a/lib/ext2fs/alloc_tables.c b/lib/ext2fs/alloc_tables.c
index 9b4f0e5..d87585b 100644
--- a/lib/ext2fs/alloc_tables.c
+++ b/lib/ext2fs/alloc_tables.c
@@ -27,18 +27,80 @@
#include "ext2_fs.h"
#include "ext2fs.h"
+/*
+ * This routine searches for free blocks that can allocate a full
+ * group of bitmaps or inode tables for a flexbg group. Returns the
+ * block number with a correct offset were the bitmaps and inode
+ * tables can be allocated continously and in order.
+ */
+static blk_t flexbg_offset(ext2_filsys fs, dgrp_t group, blk_t start_blk,
+ ext2fs_block_bitmap bmap, int offset, int size)
+{
+ int flexbg, flexbg_size, elem_size;
+ blk_t last_blk, first_free = 0;
+ dgrp_t last_grp;
+
+ flexbg_size = 1 << fs->super->s_log_groups_per_flex;
+ flexbg = group / flexbg_size;
+
+ if (size > fs->super->s_blocks_per_group / 8)
+ size = fs->super->s_blocks_per_group / 8;
+
+ /*
+ * Dont do a long search if the previous block
+ * search is still valid.
+ */
+ if (start_blk && group % flexbg_size) {
+ if (size > flexbg_size)
+ elem_size = fs->inode_blocks_per_group;
+ else
+ elem_size = 1;
+ if (ext2fs_test_block_bitmap_range(bmap, start_blk + elem_size,
+ size))
+ return start_blk + elem_size;
+ }
+
+ start_blk = ext2fs_group_first_block(fs, flexbg_size * flexbg);
+ last_grp = group | (flexbg_size - 1);
+ if (last_grp > fs->group_desc_count)
+ last_grp = fs->group_desc_count;
+ last_blk = ext2fs_group_last_block(fs, last_grp);
+
+ /* Find the first available block */
+ if (ext2fs_get_free_blocks(fs, start_blk, last_blk, 1, bmap,
+ &first_free))
+ return first_free;
+
+ if (ext2fs_get_free_blocks(fs, first_free + offset, last_blk, size,
+ bmap, &first_free))
+ return first_free;
+
+ return first_free;
+}
+
errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
ext2fs_block_bitmap bmap)
{
errcode_t retval;
blk_t group_blk, start_blk, last_blk, new_blk, blk;
- int j;
+ dgrp_t last_grp;
+ int j, rem_grps, flexbg_size = 0;
group_blk = ext2fs_group_first_block(fs, group);
last_blk = ext2fs_group_last_block(fs, group);
if (!bmap)
bmap = fs->block_map;
+
+ if (EXT2_HAS_INCOMPAT_FEATURE(fs->super,
+ EXT4_FEATURE_INCOMPAT_FLEX_BG) &&
+ fs->super->s_log_groups_per_flex) {
+ flexbg_size = 1 << fs->super->s_log_groups_per_flex;
+ last_grp = group | (flexbg_size - 1);
+ rem_grps = last_grp - group;
+ if (last_grp > fs->group_desc_count)
+ last_grp = fs->group_desc_count;
+ }
/*
* Allocate the block and inode bitmaps, if necessary
@@ -56,6 +118,15 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
} else
start_blk = group_blk;
+ if (flexbg_size) {
+ int prev_block = 0;
+ if (group && fs->group_desc[group-1].bg_block_bitmap)
+ prev_block = fs->group_desc[group-1].bg_block_bitmap;
+ start_blk = flexbg_offset(fs, group, prev_block, bmap,
+ 0, rem_grps);
+ last_blk = ext2fs_group_last_block(fs, last_grp);
+ }
+
if (!fs->group_desc[group].bg_block_bitmap) {
retval = ext2fs_get_free_blocks(fs, start_blk, last_blk,
1, bmap, &new_blk);
@@ -66,6 +137,22 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
return retval;
ext2fs_mark_block_bitmap(bmap, new_blk);
fs->group_desc[group].bg_block_bitmap = new_blk;
+ if (flexbg_size) {
+ dgrp_t gr = ext2fs_group_of_blk(fs, new_blk);
+ fs->group_desc[gr].bg_free_blocks_count--;
+ fs->super->s_free_blocks_count--;
+ fs->group_desc[gr].bg_flags &= ~EXT2_BG_BLOCK_UNINIT;
+ ext2fs_group_desc_csum_set(fs, gr);
+ }
+ }
+
+ if (flexbg_size) {
+ int prev_block = 0;
+ if (group && fs->group_desc[group-1].bg_inode_bitmap)
+ prev_block = fs->group_desc[group-1].bg_inode_bitmap;
+ start_blk = flexbg_offset(fs, group, prev_block, bmap,
+ flexbg_size, rem_grps);
+ last_blk = ext2fs_group_last_block(fs, last_grp);
}
if (!fs->group_desc[group].bg_inode_bitmap) {
@@ -78,11 +165,29 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
return retval;
ext2fs_mark_block_bitmap(bmap, new_blk);
fs->group_desc[group].bg_inode_bitmap = new_blk;
+ if (flexbg_size) {
+ dgrp_t gr = ext2fs_group_of_blk(fs, new_blk);
+ fs->group_desc[gr].bg_free_blocks_count--;
+ fs->super->s_free_blocks_count--;
+ fs->group_desc[gr].bg_flags &= ~EXT2_BG_BLOCK_UNINIT;
+ ext2fs_group_desc_csum_set(fs, gr);
+ }
}
/*
* Allocate the inode table
*/
+ if (flexbg_size) {
+ int prev_block = 0;
+ if (group && fs->group_desc[group-1].bg_inode_table)
+ prev_block = fs->group_desc[group-1].bg_inode_table;
+ group_blk = flexbg_offset(fs, group, prev_block, bmap,
+ flexbg_size * 2,
+ fs->inode_blocks_per_group *
+ rem_grps);
+ last_blk = ext2fs_group_last_block(fs, last_grp);
+ }
+
if (!fs->group_desc[group].bg_inode_table) {
retval = ext2fs_get_free_blocks(fs, group_blk, last_blk,
fs->inode_blocks_per_group,
@@ -91,12 +196,19 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
return retval;
for (j=0, blk = new_blk;
j < fs->inode_blocks_per_group;
- j++, blk++)
+ j++, blk++) {
ext2fs_mark_block_bitmap(bmap, blk);
+ if (flexbg_size) {
+ dgrp_t gr = ext2fs_group_of_blk(fs, blk);
+ fs->group_desc[gr].bg_free_blocks_count--;
+ fs->super->s_free_blocks_count--;
+ fs->group_desc[gr].bg_flags &= ~EXT2_BG_BLOCK_UNINIT;
+ ext2fs_group_desc_csum_set(fs, gr);
+ }
+ }
fs->group_desc[group].bg_inode_table = new_blk;
}
ext2fs_group_desc_csum_set(fs, group);
-
return 0;
}
diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
index 09e1008..396dd59 100644
--- a/lib/ext2fs/initialize.c
+++ b/lib/ext2fs/initialize.c
@@ -159,6 +159,7 @@ errcode_t ext2fs_initialize(const char *name, int flags,
set_field(s_first_meta_bg, 0);
set_field(s_raid_stride, 0); /* default stride size: 0 */
set_field(s_raid_stripe_width, 0); /* default stripe width: 0 */
+ set_field(s_log_groups_per_flex, 0);
set_field(s_flags, 0);
if (super->s_feature_incompat & ~EXT2_LIB_FEATURE_INCOMPAT_SUPP) {
retval = EXT2_ET_UNSUPP_FEATURE;
@@ -374,6 +375,10 @@ ipg_retry:
* Note that although the block bitmap, inode bitmap, and
* inode table have not been allocated (and in fact won't be
* by this routine), they are accounted for nevertheless.
+ *
+ * If FLEX_BG meta-data grouping is used, only account for the
+ * superblock and group descriptors (the inode tables and
+ * bitmaps will be accounted for when allocated).
*/
super->s_free_blocks_count = 0;
csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
@@ -390,6 +395,8 @@ ipg_retry:
fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
}
numblocks = ext2fs_reserve_super_and_bgd(fs, i, fs->block_map);
+ if (fs->super->s_log_groups_per_flex)
+ numblocks += 2 + fs->inode_blocks_per_group;
super->s_free_blocks_count += numblocks;
fs->group_desc[i].bg_free_blocks_count = numblocks;
diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
index 1e9a203..aa068b3 100644
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -26,6 +26,10 @@ mke2fs \- create an ext2/ext3 filesystem
.I blocks-per-group
]
[
+.B \-G
+.I number-of-groups
+]
+[
.B \-i
.I bytes-per-inode
]
@@ -245,6 +249,13 @@ option rather than manipulating the number of blocks per group.)
This option is generally used by developers who
are developing test cases.
.TP
+.BI \-G " number-of-groups"
+Specify the number of block goups that will be packed together to
+create one large virtual block group on an ext4 filesystem. This
+improves meta-data locality and performance on meta-data heavy
+workloads. The number of goups must be a power of 2 and may only be
+specified if the flex_bg filesystem feature is enabled.
+.TP
.BI \-i " bytes-per-inode"
Specify the bytes/inode ratio.
.B mke2fs
@@ -445,6 +456,11 @@ Use hashed b-trees to speed up lookups in large directories.
.B filetype
Store file type information in directory entries.
.TP
+.B flex_bg
+Allow bitmaps and inode tables for a block group to be placed anywhere
+on the storage media (use with -G option to group meta-data in order
+to create a large virtual block group).
+.TP
.B has_journal
Create an ext3 journal (as if using the
.B \-j
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 61f45aa..e37e510 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -98,8 +98,9 @@ static void usage(void)
fprintf(stderr, _("Usage: %s [-c|-l filename] [-b block-size] "
"[-f fragment-size]\n\t[-i bytes-per-inode] [-I inode-size] "
"[-J journal-options]\n"
- "\t[-N number-of-inodes] [-m reserved-blocks-percentage] "
- "[-o creator-os]\n\t[-g blocks-per-group] [-L volume-label] "
+ "\t[-G meta group size] [-N number-of-inodes]\n"
+ "\t[-m reserved-blocks-percentage] [-o creator-os]\n"
+ "\t[-g blocks-per-group] [-L volume-label] "
"[-M last-mounted-directory]\n\t[-O feature[,...]] "
"[-r fs-revision] [-E extended-option[,...]]\n"
"\t[-T fs-type] [-jnqvFSV] device [blocks-count]\n"),
@@ -1096,6 +1097,7 @@ static void PRS(int argc, char *argv[])
int blocksize = 0;
int inode_ratio = 0;
int inode_size = 0;
+ unsigned long flex_bg_size = 0;
double reserved_ratio = 5.0;
int sector_size = 0;
int show_version_only = 0;
@@ -1180,7 +1182,7 @@ static void PRS(int argc, char *argv[])
}
while ((c = getopt (argc, argv,
- "b:cf:g:i:jl:m:no:qr:s:t:vE:FI:J:L:M:N:O:R:ST:V")) != EOF) {
+ "b:cf:g:G:i:jl:m:no:qr:s:t:vE:FI:J:L:M:N:O:R:ST:V")) != EOF) {
switch (c) {
case 'b':
blocksize = strtol(optarg, &tmp, 0);
@@ -1230,6 +1232,20 @@ static void PRS(int argc, char *argv[])
exit(1);
}
break;
+ case 'G':
+ flex_bg_size = strtoul(optarg, &tmp, 0);
+ if (*tmp) {
+ com_err(program_name, 0,
+ _("Illegal number for flex_bg size"));
+ exit(1);
+ }
+ if (flex_bg_size < 2 ||
+ (flex_bg_size & (flex_bg_size-1)) != 0) {
+ com_err(program_name, 0,
+ _("flex_bg size must be a power of 2"));
+ exit(1);
+ }
+ break;
case 'i':
inode_ratio = strtoul(optarg, &tmp, 0);
if (inode_ratio < EXT2_MIN_BLOCK_SIZE ||
@@ -1638,6 +1654,19 @@ static void PRS(int argc, char *argv[])
if (inode_size == 0)
inode_size = get_int_from_profile(fs_types, "inode_size", 0);
+ if (!flex_bg_size && (fs_param.s_feature_incompat &
+ EXT4_FEATURE_INCOMPAT_FLEX_BG))
+ get_int_from_profile(fs_types, "flex_bg_size", 8);
+ if (flex_bg_size) {
+ if (!(fs_param.s_feature_incompat &
+ EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
+ com_err(program_name, 0,
+ _("Flex_bg feature not enabled, so "
+ "flex_bg size may not be specified"));
+ exit(1);
+ }
+ fs_param.s_log_groups_per_flex = int_log2(flex_bg_size);
+ }
if (inode_size && fs_param.s_rev_level >= EXT2_DYNAMIC_REV) {
if (inode_size < EXT2_GOOD_OLD_INODE_SIZE ||
diff --git a/misc/mke2fs.conf.5.in b/misc/mke2fs.conf.5.in
index 6734bf3..5dd92d8 100644
--- a/misc/mke2fs.conf.5.in
+++ b/misc/mke2fs.conf.5.in
@@ -301,6 +301,13 @@ specify one on the command line.
.I inode_size
This relation specifies the default inode size if the user does not
specify one on the command line.
+.TP
+.I flex_bg_size
+This relation specifies the number of block goups that will be packed
+together to create one large virtual block group on an ext4 filesystem.
+This improves meta-data locality and performance on meta-data heavy
+workloads. The number of goups must be a power of 2 and may only be
+specified if the flex_bg filesystem feature is enabled.
.SH FILES
.TP
.I /etc/mke2fs.conf
--
1.5.4.1.144.gdfee-dirty
Add superblock definition, and dumpe2fs and debugfs support.
Signed-off-by: Jose R. Santos <[email protected]>
Signed-off-by: Valerie Clement <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
---
debugfs/set_fields.c | 1 +
lib/e2p/ls.c | 3 +++
lib/ext2fs/ext2_fs.h | 5 ++++-
3 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/debugfs/set_fields.c b/debugfs/set_fields.c
index ee51c45..25343f0 100644
--- a/debugfs/set_fields.c
+++ b/debugfs/set_fields.c
@@ -132,6 +132,7 @@ static struct field_set_info super_fields[] = {
{ "mmp_interval", &set_sb.s_mmp_interval, 2, parse_uint },
{ "mmp_block", &set_sb.s_mmp_block, 8, parse_uint },
{ "raid_stripe_width", &set_sb.s_raid_stripe_width, 4, parse_uint },
+ { "log_groups_per_flex", &set_sb.s_log_groups_per_flex, 1, parse_uint },
{ 0, 0, 0, 0 }
};
diff --git a/lib/e2p/ls.c b/lib/e2p/ls.c
index b119606..c211dce 100644
--- a/lib/e2p/ls.c
+++ b/lib/e2p/ls.c
@@ -242,6 +242,9 @@ void list_super2(struct ext2_super_block * sb, FILE *f)
if (sb->s_first_meta_bg)
fprintf(f, "First meta block group: %u\n",
sb->s_first_meta_bg);
+ if (sb->s_log_groups_per_flex)
+ fprintf(f, "Flex block group size: %u\n",
+ 1 << sb->s_log_groups_per_flex);
if (sb->s_mkfs_time) {
tm = sb->s_mkfs_time;
fprintf(f, "Filesystem created: %s", ctime(&tm));
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index ad42cf8..f23c8fd 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -564,7 +564,10 @@ struct ext2_super_block {
__u16 s_mmp_interval; /* # seconds to wait in MMP checking */
__u64 s_mmp_block; /* Block for multi-mount protection */
__u32 s_raid_stripe_width; /* blocks on all data disks (N*stride)*/
- __u32 s_reserved[163]; /* Padding to the end of the block */
+ __u8 s_log_groups_per_flex; /* FLEX_BG group size */
+ __u8 s_reserved_char_pad;
+ __u16 s_reserved_pad; /* Padding to next 32bits */
+ __u32 s_reserved[162]; /* Padding to the end of the block */
};
/*
--
1.5.4.1.144.gdfee-dirty
On Tue, 22 Apr 2008 08:46:18 -0400
"Theodore Ts'o" <[email protected]> wrote:
> Add superblock definition, and dumpe2fs and debugfs support.
Looks good.
> Signed-off-by: Jose R. Santos <[email protected]>
> Signed-off-by: Valerie Clement <[email protected]>
> Signed-off-by: Theodore Ts'o <[email protected]>
> ---
> debugfs/set_fields.c | 1 +
> lib/e2p/ls.c | 3 +++
> lib/ext2fs/ext2_fs.h | 5 ++++-
> 3 files changed, 8 insertions(+), 1 deletions(-)
>
> diff --git a/debugfs/set_fields.c b/debugfs/set_fields.c
> index ee51c45..25343f0 100644
> --- a/debugfs/set_fields.c
> +++ b/debugfs/set_fields.c
> @@ -132,6 +132,7 @@ static struct field_set_info super_fields[] = {
> { "mmp_interval", &set_sb.s_mmp_interval, 2, parse_uint },
> { "mmp_block", &set_sb.s_mmp_block, 8, parse_uint },
> { "raid_stripe_width", &set_sb.s_raid_stripe_width, 4, parse_uint },
> + { "log_groups_per_flex", &set_sb.s_log_groups_per_flex, 1, parse_uint },
> { 0, 0, 0, 0 }
> };
>
> diff --git a/lib/e2p/ls.c b/lib/e2p/ls.c
> index b119606..c211dce 100644
> --- a/lib/e2p/ls.c
> +++ b/lib/e2p/ls.c
> @@ -242,6 +242,9 @@ void list_super2(struct ext2_super_block * sb, FILE *f)
> if (sb->s_first_meta_bg)
> fprintf(f, "First meta block group: %u\n",
> sb->s_first_meta_bg);
> + if (sb->s_log_groups_per_flex)
> + fprintf(f, "Flex block group size: %u\n",
> + 1 << sb->s_log_groups_per_flex);
> if (sb->s_mkfs_time) {
> tm = sb->s_mkfs_time;
> fprintf(f, "Filesystem created: %s", ctime(&tm));
> diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
> index ad42cf8..f23c8fd 100644
> --- a/lib/ext2fs/ext2_fs.h
> +++ b/lib/ext2fs/ext2_fs.h
> @@ -564,7 +564,10 @@ struct ext2_super_block {
> __u16 s_mmp_interval; /* # seconds to wait in MMP checking */
> __u64 s_mmp_block; /* Block for multi-mount protection */
> __u32 s_raid_stripe_width; /* blocks on all data disks (N*stride)*/
> - __u32 s_reserved[163]; /* Padding to the end of the block */
> + __u8 s_log_groups_per_flex; /* FLEX_BG group size */
> + __u8 s_reserved_char_pad;
> + __u16 s_reserved_pad; /* Padding to next 32bits */
> + __u32 s_reserved[162]; /* Padding to the end of the block */
> };
>
> /*
-JRS
On Tue, 22 Apr 2008 08:46:19 -0400
"Theodore Ts'o" <[email protected]> wrote:
> Change the way we allocate bitmaps and inode tables if the FLEX_BG
> feature is used at mke2fs time. It places calculates a new offset for
> bitmaps and inode table base on the number of groups that the user
> wishes to pack together using the new "-G" option. Creating a
> filesystem with 64 block groups in a flex group can be done by:
>
> mke2fs -j -I 256 -O flex_bg -G 32 /dev/sdX
>
> Signed-off-by: Jose R. Santos <[email protected]>
> Signed-off-by: Valerie Clement <[email protected]>
> Signed-off-by: Theodore Ts'o <[email protected]>
> ---
> lib/ext2fs/alloc_tables.c | 118 +++++++++++++++++++++++++++++++++++++++++++-
> lib/ext2fs/initialize.c | 7 +++
> misc/mke2fs.8.in | 16 ++++++
> misc/mke2fs.c | 35 ++++++++++++-
> misc/mke2fs.conf.5.in | 7 +++
> 5 files changed, 177 insertions(+), 6 deletions(-)
>
> diff --git a/lib/ext2fs/alloc_tables.c b/lib/ext2fs/alloc_tables.c
> index 9b4f0e5..d87585b 100644
> --- a/lib/ext2fs/alloc_tables.c
> +++ b/lib/ext2fs/alloc_tables.c
> @@ -27,18 +27,80 @@
> #include "ext2_fs.h"
> #include "ext2fs.h"
>
> +/*
> + * This routine searches for free blocks that can allocate a full
> + * group of bitmaps or inode tables for a flexbg group. Returns the
> + * block number with a correct offset were the bitmaps and inode
> + * tables can be allocated continously and in order.
> + */
> +static blk_t flexbg_offset(ext2_filsys fs, dgrp_t group, blk_t start_blk,
> + ext2fs_block_bitmap bmap, int offset, int size)
> +{
> + int flexbg, flexbg_size, elem_size;
> + blk_t last_blk, first_free = 0;
> + dgrp_t last_grp;
> +
> + flexbg_size = 1 << fs->super->s_log_groups_per_flex;
> + flexbg = group / flexbg_size;
> +
> + if (size > fs->super->s_blocks_per_group / 8)
> + size = fs->super->s_blocks_per_group / 8;
> +
> + /*
> + * Dont do a long search if the previous block
> + * search is still valid.
> + */
> + if (start_blk && group % flexbg_size) {
> + if (size > flexbg_size)
> + elem_size = fs->inode_blocks_per_group;
> + else
> + elem_size = 1;
> + if (ext2fs_test_block_bitmap_range(bmap, start_blk + elem_size,
> + size))
> + return start_blk + elem_size;
> + }
> +
> + start_blk = ext2fs_group_first_block(fs, flexbg_size * flexbg);
> + last_grp = group | (flexbg_size - 1);
> + if (last_grp > fs->group_desc_count)
> + last_grp = fs->group_desc_count;
> + last_blk = ext2fs_group_last_block(fs, last_grp);
> +
> + /* Find the first available block */
> + if (ext2fs_get_free_blocks(fs, start_blk, last_blk, 1, bmap,
> + &first_free))
> + return first_free;
> +
> + if (ext2fs_get_free_blocks(fs, first_free + offset, last_blk, size,
> + bmap, &first_free))
> + return first_free;
> +
> + return first_free;
> +}
> +
> errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
> ext2fs_block_bitmap bmap)
> {
> errcode_t retval;
> blk_t group_blk, start_blk, last_blk, new_blk, blk;
> - int j;
> + dgrp_t last_grp;
> + int j, rem_grps, flexbg_size = 0;
>
> group_blk = ext2fs_group_first_block(fs, group);
> last_blk = ext2fs_group_last_block(fs, group);
>
> if (!bmap)
> bmap = fs->block_map;
> +
> + if (EXT2_HAS_INCOMPAT_FEATURE(fs->super,
> + EXT4_FEATURE_INCOMPAT_FLEX_BG) &&
> + fs->super->s_log_groups_per_flex) {
> + flexbg_size = 1 << fs->super->s_log_groups_per_flex;
> + last_grp = group | (flexbg_size - 1);
> + rem_grps = last_grp - group;
> + if (last_grp > fs->group_desc_count)
> + last_grp = fs->group_desc_count;
> + }
>
> /*
> * Allocate the block and inode bitmaps, if necessary
> @@ -56,6 +118,15 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
> } else
> start_blk = group_blk;
>
> + if (flexbg_size) {
> + int prev_block = 0;
> + if (group && fs->group_desc[group-1].bg_block_bitmap)
> + prev_block = fs->group_desc[group-1].bg_block_bitmap;
> + start_blk = flexbg_offset(fs, group, prev_block, bmap,
> + 0, rem_grps);
> + last_blk = ext2fs_group_last_block(fs, last_grp);
> + }
> +
> if (!fs->group_desc[group].bg_block_bitmap) {
> retval = ext2fs_get_free_blocks(fs, start_blk, last_blk,
> 1, bmap, &new_blk);
> @@ -66,6 +137,22 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
> return retval;
> ext2fs_mark_block_bitmap(bmap, new_blk);
> fs->group_desc[group].bg_block_bitmap = new_blk;
> + if (flexbg_size) {
> + dgrp_t gr = ext2fs_group_of_blk(fs, new_blk);
> + fs->group_desc[gr].bg_free_blocks_count--;
> + fs->super->s_free_blocks_count--;
> + fs->group_desc[gr].bg_flags &= ~EXT2_BG_BLOCK_UNINIT;
> + ext2fs_group_desc_csum_set(fs, gr);
> + }
> + }
> +
> + if (flexbg_size) {
> + int prev_block = 0;
> + if (group && fs->group_desc[group-1].bg_inode_bitmap)
> + prev_block = fs->group_desc[group-1].bg_inode_bitmap;
> + start_blk = flexbg_offset(fs, group, prev_block, bmap,
> + flexbg_size, rem_grps);
> + last_blk = ext2fs_group_last_block(fs, last_grp);
> }
>
> if (!fs->group_desc[group].bg_inode_bitmap) {
> @@ -78,11 +165,29 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
> return retval;
> ext2fs_mark_block_bitmap(bmap, new_blk);
> fs->group_desc[group].bg_inode_bitmap = new_blk;
> + if (flexbg_size) {
> + dgrp_t gr = ext2fs_group_of_blk(fs, new_blk);
> + fs->group_desc[gr].bg_free_blocks_count--;
> + fs->super->s_free_blocks_count--;
> + fs->group_desc[gr].bg_flags &= ~EXT2_BG_BLOCK_UNINIT;
> + ext2fs_group_desc_csum_set(fs, gr);
> + }
> }
>
> /*
> * Allocate the inode table
> */
> + if (flexbg_size) {
> + int prev_block = 0;
> + if (group && fs->group_desc[group-1].bg_inode_table)
> + prev_block = fs->group_desc[group-1].bg_inode_table;
> + group_blk = flexbg_offset(fs, group, prev_block, bmap,
> + flexbg_size * 2,
> + fs->inode_blocks_per_group *
> + rem_grps);
> + last_blk = ext2fs_group_last_block(fs, last_grp);
> + }
> +
> if (!fs->group_desc[group].bg_inode_table) {
> retval = ext2fs_get_free_blocks(fs, group_blk, last_blk,
> fs->inode_blocks_per_group,
> @@ -91,12 +196,19 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
> return retval;
> for (j=0, blk = new_blk;
> j < fs->inode_blocks_per_group;
> - j++, blk++)
> + j++, blk++) {
> ext2fs_mark_block_bitmap(bmap, blk);
> + if (flexbg_size) {
> + dgrp_t gr = ext2fs_group_of_blk(fs, blk);
> + fs->group_desc[gr].bg_free_blocks_count--;
> + fs->super->s_free_blocks_count--;
> + fs->group_desc[gr].bg_flags &= ~EXT2_BG_BLOCK_UNINIT;
> + ext2fs_group_desc_csum_set(fs, gr);
> + }
> + }
> fs->group_desc[group].bg_inode_table = new_blk;
> }
> ext2fs_group_desc_csum_set(fs, group);
> -
> return 0;
> }
>
> diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
> index 09e1008..396dd59 100644
> --- a/lib/ext2fs/initialize.c
> +++ b/lib/ext2fs/initialize.c
> @@ -159,6 +159,7 @@ errcode_t ext2fs_initialize(const char *name, int flags,
> set_field(s_first_meta_bg, 0);
> set_field(s_raid_stride, 0); /* default stride size: 0 */
> set_field(s_raid_stripe_width, 0); /* default stripe width: 0 */
> + set_field(s_log_groups_per_flex, 0);
> set_field(s_flags, 0);
> if (super->s_feature_incompat & ~EXT2_LIB_FEATURE_INCOMPAT_SUPP) {
> retval = EXT2_ET_UNSUPP_FEATURE;
> @@ -374,6 +375,10 @@ ipg_retry:
> * Note that although the block bitmap, inode bitmap, and
> * inode table have not been allocated (and in fact won't be
> * by this routine), they are accounted for nevertheless.
> + *
> + * If FLEX_BG meta-data grouping is used, only account for the
> + * superblock and group descriptors (the inode tables and
> + * bitmaps will be accounted for when allocated).
> */
> super->s_free_blocks_count = 0;
> csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
> @@ -390,6 +395,8 @@ ipg_retry:
> fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
> }
> numblocks = ext2fs_reserve_super_and_bgd(fs, i, fs->block_map);
> + if (fs->super->s_log_groups_per_flex)
> + numblocks += 2 + fs->inode_blocks_per_group;
>
> super->s_free_blocks_count += numblocks;
> fs->group_desc[i].bg_free_blocks_count = numblocks;
> diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
> index 1e9a203..aa068b3 100644
> --- a/misc/mke2fs.8.in
> +++ b/misc/mke2fs.8.in
> @@ -26,6 +26,10 @@ mke2fs \- create an ext2/ext3 filesystem
> .I blocks-per-group
> ]
> [
> +.B \-G
> +.I number-of-groups
> +]
> +[
> .B \-i
> .I bytes-per-inode
> ]
> @@ -245,6 +249,13 @@ option rather than manipulating the number of blocks per group.)
> This option is generally used by developers who
> are developing test cases.
> .TP
> +.BI \-G " number-of-groups"
> +Specify the number of block goups that will be packed together to
> +create one large virtual block group on an ext4 filesystem. This
> +improves meta-data locality and performance on meta-data heavy
> +workloads. The number of goups must be a power of 2 and may only be
> +specified if the flex_bg filesystem feature is enabled.
> +.TP
> .BI \-i " bytes-per-inode"
> Specify the bytes/inode ratio.
> .B mke2fs
> @@ -445,6 +456,11 @@ Use hashed b-trees to speed up lookups in large directories.
> .B filetype
> Store file type information in directory entries.
> .TP
> +.B flex_bg
> +Allow bitmaps and inode tables for a block group to be placed anywhere
> +on the storage media (use with -G option to group meta-data in order
> +to create a large virtual block group).
> +.TP
> .B has_journal
> Create an ext3 journal (as if using the
> .B \-j
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index 61f45aa..e37e510 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -98,8 +98,9 @@ static void usage(void)
> fprintf(stderr, _("Usage: %s [-c|-l filename] [-b block-size] "
> "[-f fragment-size]\n\t[-i bytes-per-inode] [-I inode-size] "
> "[-J journal-options]\n"
> - "\t[-N number-of-inodes] [-m reserved-blocks-percentage] "
> - "[-o creator-os]\n\t[-g blocks-per-group] [-L volume-label] "
> + "\t[-G meta group size] [-N number-of-inodes]\n"
> + "\t[-m reserved-blocks-percentage] [-o creator-os]\n"
> + "\t[-g blocks-per-group] [-L volume-label] "
> "[-M last-mounted-directory]\n\t[-O feature[,...]] "
> "[-r fs-revision] [-E extended-option[,...]]\n"
> "\t[-T fs-type] [-jnqvFSV] device [blocks-count]\n"),
> @@ -1096,6 +1097,7 @@ static void PRS(int argc, char *argv[])
> int blocksize = 0;
> int inode_ratio = 0;
> int inode_size = 0;
> + unsigned long flex_bg_size = 0;
> double reserved_ratio = 5.0;
> int sector_size = 0;
> int show_version_only = 0;
> @@ -1180,7 +1182,7 @@ static void PRS(int argc, char *argv[])
> }
>
> while ((c = getopt (argc, argv,
> - "b:cf:g:i:jl:m:no:qr:s:t:vE:FI:J:L:M:N:O:R:ST:V")) != EOF) {
> + "b:cf:g:G:i:jl:m:no:qr:s:t:vE:FI:J:L:M:N:O:R:ST:V")) != EOF) {
> switch (c) {
> case 'b':
> blocksize = strtol(optarg, &tmp, 0);
> @@ -1230,6 +1232,20 @@ static void PRS(int argc, char *argv[])
> exit(1);
> }
> break;
> + case 'G':
> + flex_bg_size = strtoul(optarg, &tmp, 0);
> + if (*tmp) {
> + com_err(program_name, 0,
> + _("Illegal number for flex_bg size"));
> + exit(1);
> + }
> + if (flex_bg_size < 2 ||
> + (flex_bg_size & (flex_bg_size-1)) != 0) {
> + com_err(program_name, 0,
> + _("flex_bg size must be a power of 2"));
> + exit(1);
> + }
> + break;
> case 'i':
> inode_ratio = strtoul(optarg, &tmp, 0);
> if (inode_ratio < EXT2_MIN_BLOCK_SIZE ||
> @@ -1638,6 +1654,19 @@ static void PRS(int argc, char *argv[])
>
> if (inode_size == 0)
> inode_size = get_int_from_profile(fs_types, "inode_size", 0);
> + if (!flex_bg_size && (fs_param.s_feature_incompat &
> + EXT4_FEATURE_INCOMPAT_FLEX_BG))
> + get_int_from_profile(fs_types, "flex_bg_size", 8);
A default of 256 block groups to pack seems a bit high base on some of
the performance testing that I've done. At some point having the inodes
too far away from the data blocks begins to affect performance
(especially on read operations). The optimum number of groups depends
a lot on platter density of the hard drive so I expect that we can
increase the default grouping size as time goes by. Using 128 groups
as already showing performance degradation on read operations on some
of my smaller disks (147GB). For now, I would change this to 6 (64
groups) as this is a good balance for both big an small disks.
> + if (flex_bg_size) {
> + if (!(fs_param.s_feature_incompat &
> + EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
> + com_err(program_name, 0,
> + _("Flex_bg feature not enabled, so "
> + "flex_bg size may not be specified"));
> + exit(1);
> + }
> + fs_param.s_log_groups_per_flex = int_log2(flex_bg_size);
> + }
>
> if (inode_size && fs_param.s_rev_level >= EXT2_DYNAMIC_REV) {
> if (inode_size < EXT2_GOOD_OLD_INODE_SIZE ||
> diff --git a/misc/mke2fs.conf.5.in b/misc/mke2fs.conf.5.in
> index 6734bf3..5dd92d8 100644
> --- a/misc/mke2fs.conf.5.in
> +++ b/misc/mke2fs.conf.5.in
> @@ -301,6 +301,13 @@ specify one on the command line.
> .I inode_size
> This relation specifies the default inode size if the user does not
> specify one on the command line.
> +.TP
> +.I flex_bg_size
> +This relation specifies the number of block goups that will be packed
> +together to create one large virtual block group on an ext4 filesystem.
> +This improves meta-data locality and performance on meta-data heavy
> +workloads. The number of goups must be a power of 2 and may only be
> +specified if the flex_bg filesystem feature is enabled.
> .SH FILES
> .TP
> .I /etc/mke2fs.conf
-JRS
On Tue, Apr 22, 2008 at 09:18:47AM -0500, Jose R. Santos wrote:
> > @@ -1638,6 +1654,19 @@ static void PRS(int argc, char *argv[])
> >
> > if (inode_size == 0)
> > inode_size = get_int_from_profile(fs_types, "inode_size", 0);
> > + if (!flex_bg_size && (fs_param.s_feature_incompat &
> > + EXT4_FEATURE_INCOMPAT_FLEX_BG))
> > + get_int_from_profile(fs_types, "flex_bg_size", 8);
>
> A default of 256 block groups to pack seems a bit high base on some of
> the performance testing that I've done. At some point having the inodes
> too far away from the data blocks begins to affect performance
> (especially on read operations). The optimum number of groups depends
> a lot on platter density of the hard drive so I expect that we can
> increase the default grouping size as time goes by. Using 128 groups
> as already showing performance degradation on read operations on some
> of my smaller disks (147GB). For now, I would change this to 6 (64
> groups) as this is a good balance for both big an small disks.
Actually this is 8 (as in 2**3), which was intentionally very small,
because I was being conservative. I could change it to be 64 if you
think it is a better balance. As you can see, it gets set later on
down here.
> > + fs_param.s_log_groups_per_flex = int_log2(flex_bg_size);
And, in fact the biggest bug which both you and I missed was that this:
> > + get_int_from_profile(fs_types, "flex_bg_size", 8);
Should have been this:
flex_bg_size = get_int_from_profile(fs_types, "flex_bg_size", 8);
<Dons paper bag>
- Ted
On Tue, 22 Apr 2008 10:51:25 -0400
Theodore Tso <[email protected]> wrote:
> On Tue, Apr 22, 2008 at 09:18:47AM -0500, Jose R. Santos wrote:
> > > @@ -1638,6 +1654,19 @@ static void PRS(int argc, char *argv[])
> > >
> > > if (inode_size == 0)
> > > inode_size = get_int_from_profile(fs_types, "inode_size", 0);
> > > + if (!flex_bg_size && (fs_param.s_feature_incompat &
> > > + EXT4_FEATURE_INCOMPAT_FLEX_BG))
> > > + get_int_from_profile(fs_types, "flex_bg_size", 8);
> >
> > A default of 256 block groups to pack seems a bit high base on some of
> > the performance testing that I've done. At some point having the inodes
> > too far away from the data blocks begins to affect performance
> > (especially on read operations). The optimum number of groups depends
> > a lot on platter density of the hard drive so I expect that we can
> > increase the default grouping size as time goes by. Using 128 groups
> > as already showing performance degradation on read operations on some
> > of my smaller disks (147GB). For now, I would change this to 6 (64
> > groups) as this is a good balance for both big an small disks.
>
> Actually this is 8 (as in 2**3), which was intentionally very small,
> because I was being conservative. I could change it to be 64 if you
> think it is a better balance. As you can see, it gets set later on
> down here.
I see that now, guess I should not read code with out having
breakfast. I think 8 is a very safe and conservative number, maybe to
conservative. The 64 group packing was the number I found to be a
overall improvement with the limited number of drives that I had to
test with. Haven't done any testing on old drives or laptop drive with
slow spindle speed but I would think 16 or 32 would be safe here unless
the drive is really old and small.
>
> > > + fs_param.s_log_groups_per_flex = int_log2(flex_bg_size);
>
> And, in fact the biggest bug which both you and I missed was that this:
>
> > > + get_int_from_profile(fs_types, "flex_bg_size", 8);
>
> Should have been this:
>
> flex_bg_size = get_int_from_profile(fs_types, "flex_bg_size", 8);
>
> <Dons paper bag>
>
> - Ted
-JRS
On Tue, Apr 22, 2008 at 10:32:12AM -0500, Jose R. Santos wrote:
> I see that now, guess I should not read code with out having
> breakfast. I think 8 is a very safe and conservative number, maybe to
> conservative. The 64 group packing was the number I found to be a
> overall improvement with the limited number of drives that I had to
> test with. Haven't done any testing on old drives or laptop drive with
> slow spindle speed but I would think 16 or 32 would be safe here unless
> the drive is really old and small.
Let's stay with 16 then for now. Spindle speed doesn't actually
matter here; what matters is seek speed, and the density of the disk
drive. The other thing which worries me though is that the size of
each flex_bg block group cluster is dependent on the size of the block
group, which in turn is related to the square of the filesystem
blocksize. i.e., assuming a fs blockgroup size of 16, then:
Blocksize Blocks/blockgroup Blockgroup Size Flex_BG cluster size
1k 8192 8 Meg 128 Meg
2k 16384 32 Meg 512 Meg
4k 32768 128 Meg 2 Gig
8k 65536 512 Meg 8 Gig
16k 131072 2 Gig 32 Gig
32k 262144 8 Gig 128 Gig
64k 524288 32 Gig 512 Gig
So using a fixed default of 16, the flexible blockgroup size can range
anything from 128 megs to half a terabyte!
How much a difference in your numbers are you seeing, anyway? Is it
big enough that we really need to worry about it?
- Ted
On Tue, 22 Apr 2008 14:57:28 -0400
Theodore Tso <[email protected]> wrote:
> On Tue, Apr 22, 2008 at 10:32:12AM -0500, Jose R. Santos wrote:
> > I see that now, guess I should not read code with out having
> > breakfast. I think 8 is a very safe and conservative number, maybe to
> > conservative. The 64 group packing was the number I found to be a
> > overall improvement with the limited number of drives that I had to
> > test with. Haven't done any testing on old drives or laptop drive with
> > slow spindle speed but I would think 16 or 32 would be safe here unless
> > the drive is really old and small.
>
> Let's stay with 16 then for now. Spindle speed doesn't actually
> matter here; what matters is seek speed, and the density of the disk
Well higher spindle speed affect cylinder seek times which affect
overall seek time, which is why I think it should be tested as well.
> drive. The other thing which worries me though is that the size of
> each flex_bg block group cluster is dependent on the size of the block
> group, which in turn is related to the square of the filesystem
> blocksize. i.e., assuming a fs blockgroup size of 16, then:
>
> Blocksize Blocks/blockgroup Blockgroup Size Flex_BG cluster size
>
> 1k 8192 8 Meg 128 Meg
> 2k 16384 32 Meg 512 Meg
> 4k 32768 128 Meg 2 Gig
> 8k 65536 512 Meg 8 Gig
> 16k 131072 2 Gig 32 Gig
> 32k 262144 8 Gig 128 Gig
> 64k 524288 32 Gig 512 Gig
>
> So using a fixed default of 16, the flexible blockgroup size can range
> anything from 128 megs to half a terabyte!
>
> How much a difference in your numbers are you seeing, anyway? Is it
> big enough that we really need to worry about it?
>
> - Ted
I do not have any data on multiple block size and I have not done
testing with the 64K equivalent of 4096 groups for a 4k filesystem. The
testing scenarios in a 4k filesystem should also be different than
those for a 64k filesystem, so the testing I did in 4k does not
necessarily apply to a bigger block size.
The default of 16 is a safe number for 4k block size. I would think
that the larger the block size, the smaller the flex_bg packing size
should be since larger block size address some of the issues that
flex_bg tries to address.
-JRS
On Tue, Apr 22, 2008 at 05:27:51PM -0500, Jose R. Santos wrote:
> > Let's stay with 16 then for now. Spindle speed doesn't actually
> > matter here; what matters is seek speed, and the density of the disk
>
> Well higher spindle speed affect cylinder seek times which affect
> overall seek time, which is why I think it should be tested as well.
Well, I looked at some laptop drives with spindle speeds of 4200rpm,
5400rpm, and 7200rpm, and they have an average read/write seek time of
of 10.5/12.5ms.
Comparing Western Digital's current enterprise disk drives (the RE-2)
which are 7200rpm, and their Enterprise "Green Power" drives (the
RE2-GP) which try to hide the fact that their disks are 5400RPM, but
which web sites have outed by using doing a frequency analysis of its
acoustic output --- both have the same read/write seek time of 8.9ms.
Interestingly, some of the older disks have faster seek times (i.e.,
4ms), at the same disk platters, and I doubt it's because hard drive
head positioning motors have gotten slower; rather, it's probably that
as the platter density has increased, the time to position the hard
drive heads is what's taking longer.
Something that would be interesting to do is to do some experiments
measuring small seeks (i.e., within a 1 gigabyte or so), and long
seeks (i.e., across 10-20% and 50% of the disk drive). The difference
between those two times is what's probably driving your flex_bg
performance numbers, and it might be easier simply to measure that
directly.
- Ted
On Tue, 22 Apr 2008 21:21:49 -0400
Theodore Tso <[email protected]> wrote:
> On Tue, Apr 22, 2008 at 05:27:51PM -0500, Jose R. Santos wrote:
> > > Let's stay with 16 then for now. Spindle speed doesn't actually
> > > matter here; what matters is seek speed, and the density of the disk
> >
> > Well higher spindle speed affect cylinder seek times which affect
> > overall seek time, which is why I think it should be tested as well.
>
> Well, I looked at some laptop drives with spindle speeds of 4200rpm,
> 5400rpm, and 7200rpm, and they have an average read/write seek time of
> of 10.5/12.5ms.
>
> Comparing Western Digital's current enterprise disk drives (the RE-2)
> which are 7200rpm, and their Enterprise "Green Power" drives (the
> RE2-GP) which try to hide the fact that their disks are 5400RPM, but
> which web sites have outed by using doing a frequency analysis of its
> acoustic output --- both have the same read/write seek time of 8.9ms.
Well, these Green Power drives from Western Digital dont have constant
spindle speed and I believe that they run at 7200 rpm under load and
5400 when mostly idle. Makes sense why the seek times would be the
same. On the other hand, the VelociRaptor drives with 10k rpm have a
latency of 5.5ms.
Looking at the specs of Seagate Savvio and Cheetah family of drives, a
33% increase in spindle speed from 10k to 15K rpms give out around 25%
improvement in average seek latency. Also note that benchmark
publishes that are sensitive to IO latencies tend to use smaller 15k
rpm disk than their larger but slower counter parts. RPM speeds
usually beats density when it comes to seek time improvements.
>
> Interestingly, some of the older disks have faster seek times (i.e.,
> 4ms), at the same disk platters, and I doubt it's because hard drive
> head positioning motors have gotten slower; rather, it's probably that
> as the platter density has increased, the time to position the hard
> drive heads is what's taking longer.
Or it could be that hard drive manufactures in the digital media age
care more about capacity at a cheaper price than tuning a drive for
best seek performance. For those user that demand speed, there are
options available like the VelociRaptor family of drives but those come
at a cost of both capacity and price.
I got to say 4ms is really good. Was this on IDE? Most drive I've
seen in this category have been stuck in the 7-8ms barrier. Dont
recall seeing them lower than this, but I have not been paying much
attention in the last couple of years.
> Something that would be interesting to do is to do some experiments
> measuring small seeks (i.e., within a 1 gigabyte or so), and long
> seeks (i.e., across 10-20% and 50% of the disk drive). The difference
> between those two times is what's probably driving your flex_bg
> performance numbers, and it might be easier simply to measure that
> directly.
I may have some data related to that since I did run blktrace on
some of my runs. Need check if I still have the data so that I can run
seekwatcher on them. I had to erase most of them since the traces where
huge. :(
>
> - Ted
-JRS
On Wed, Apr 23, 2008 at 12:48:43AM -0500, Jose R. Santos wrote:
>
> Well, these Green Power drives from Western Digital dont have constant
> spindle speed and I believe that they run at 7200 rpm under load and
> 5400 when mostly idle. Makes sense why the seek times would be the
> same. On the other hand, the VelociRaptor drives with 10k rpm have a
> latency of 5.5ms.
Actually, no, check out some of the web pages, especially:
http://www.silentpcreview.com/article786-page1.html
"Western Digital has caught a lot of flak for withholding the
rotation speed of the Green Power, especially when the product
was first launched and the marketing material listed the
rotation speed as 5,400-7,200 RPM. This led some to speculate
that the rotation speed changed dynamically during use — which
would have been an impressive engineering feat had it been
true. The reality is revealed by a sentence that Western
Digital added to the description of IntelliPower: "For each
GreenPower™ drive model, WD may use a different, invariable
RPM." In other words, Western Digital reserves the right to
release both 5,400 RPM and 7,200 RPM drives under the Green
Power name — without telling you which are which."
In fact, all of the Western Digital Green Power disks released to date
are all using 5400rpm, based on people who have put a microphone to
the disk drive and then done a frequency analysis. The "Intellipower"
nonsense is just marketing fluff so that people don't think the drive
is going to be vastly slower just because the platter turns more
slowly. I'm pretty sure that's because there are other tradeoffs made
in laptop drives for powersavings, more than just the spindle speed,
but for whatever reason people associate 5400rpm drives with SLOW. :-)
> Looking at the specs of Seagate Savvio and Cheetah family of drives, a
> 33% increase in spindle speed from 10k to 15K rpms give out around 25%
> improvement in average seek latency. Also note that benchmark
> publishes that are sensitive to IO latencies tend to use smaller 15k
> rpm disk than their larger but slower counter parts. RPM speeds
> usually beats density when it comes to seek time improvements.
Yeah, but that's not a fair comparison, because you're comparing
different generations of disk drives, as well as the fact that Savvio
are enterprise disks which costs much more than the Cheetah drives.
A much better comparison would be the Seagate Cheetah 15k.5 and the
Seagate Cheetah NS. To quote from the Seagate Cheetah NS description,
"The Seagate Cheetah NS shares the Cheetah 15K.5 design, optimized for
storage capacity and power consumption but maintaining better
performance than standard 10K enterprise products." So the Cheetah NS
is based off of the same technology and design as the 15k.5 design,
but the spindle speed has been slowed to 10k to save power. And what
do you see there?
Model RPM Seek times (read/write, in ms)
Seagate Cheetah NS 10k 3.9/4.2
Seegate Cheetah 15k.5 15k 3.5/4.0
That's only a 10% improvement going from 10k to 15k, when speed has
gone up by a factor of 50%. (And that's for average read seek times;
for writing, it's only a 5% improvement.) It also shows that it is
certainly possible to create a 10k rpm hard drive with a 4ms seek
time.
- Ted
On Wed, 23 Apr 2008 08:23:49 -0400
Theodore Tso <[email protected]> wrote:
> On Wed, Apr 23, 2008 at 12:48:43AM -0500, Jose R. Santos wrote:
> >
> > Well, these Green Power drives from Western Digital dont have constant
> > spindle speed and I believe that they run at 7200 rpm under load and
> > 5400 when mostly idle. Makes sense why the seek times would be the
> > same. On the other hand, the VelociRaptor drives with 10k rpm have a
> > latency of 5.5ms.
>
> Actually, no, check out some of the web pages, especially:
>
> http://www.silentpcreview.com/article786-page1.html
>
> "Western Digital has caught a lot of flak for withholding the
> rotation speed of the Green Power, especially when the product
> was first launched and the marketing material listed the
> rotation speed as 5,400-7,200 RPM. This led some to speculate
> that the rotation speed changed dynamically during use — which
> would have been an impressive engineering feat had it been
> true. The reality is revealed by a sentence that Western
> Digital added to the description of IntelliPower: "For each
> GreenPower™ drive model, WD may use a different, invariable
> RPM." In other words, Western Digital reserves the right to
> release both 5,400 RPM and 7,200 RPM drives under the Green
> Power name — without telling you which are which."
>
> In fact, all of the Western Digital Green Power disks released to date
> are all using 5400rpm, based on people who have put a microphone to
> the disk drive and then done a frequency analysis. The "Intellipower"
> nonsense is just marketing fluff so that people don't think the drive
> is going to be vastly slower just because the platter turns more
> slowly. I'm pretty sure that's because there are other tradeoffs made
> in laptop drives for powersavings, more than just the spindle speed,
> but for whatever reason people associate 5400rpm drives with SLOW. :-)
The biggest power savings tradeoff in laptop drives IS rpm speed. :)
All the test I've seen about the WD GP 1TB drives seem to point out that
the performance is very disappointing. So the "5400rpm drives are SLOW"
seems to be a correct assessment. The same test that compares it to a
WD Raptor drive at 10K rpm show the that a drive with 5 time less
capacity but with higher rpms can run circles around a large capacity
hard drives with slower rpm.
This goes to my original point of testing FLEX_BG on laptop hard drive
with slower RPM speeds since the response time on random access
workloads is higher than desktop counterparts. Higher platter density
seems to help them very little here.
>
> > Looking at the specs of Seagate Savvio and Cheetah family of drives, a
> > 33% increase in spindle speed from 10k to 15K rpms give out around 25%
> > improvement in average seek latency. Also note that benchmark
> > publishes that are sensitive to IO latencies tend to use smaller 15k
> > rpm disk than their larger but slower counter parts. RPM speeds
> > usually beats density when it comes to seek time improvements.
>
> Yeah, but that's not a fair comparison, because you're comparing
> different generations of disk drives, as well as the fact that Savvio
> are enterprise disks which costs much more than the Cheetah drives.
>
> A much better comparison would be the Seagate Cheetah 15k.5 and the
> Seagate Cheetah NS. To quote from the Seagate Cheetah NS description,
> "The Seagate Cheetah NS shares the Cheetah 15K.5 design, optimized for
> storage capacity and power consumption but maintaining better
> performance than standard 10K enterprise products." So the Cheetah NS
> is based off of the same technology and design as the 15k.5 design,
> but the spindle speed has been slowed to 10k to save power. And what
> do you see there?
>
> Model RPM Seek times (read/write, in ms)
>
> Seagate Cheetah NS 10k 3.9/4.2
> Seegate Cheetah 15k.5 15k 3.5/4.0
>
> That's only a 10% improvement going from 10k to 15k, when speed has
> gone up by a factor of 50%. (And that's for average read seek times;
> for writing, it's only a 5% improvement.) It also shows that it is
> certainly possible to create a 10k rpm hard drive with a 4ms seek
> time.
But the NS also has larger per platter capacity which attributes to the
4ms. The 15k drive also has about a 30% better average latency than
the NS which still support my previous statement that higher RPM
_usually_ beats density when it comes to seek time improvements.
> - Ted
-JRS
On Apr 22, 2008 08:46 -0400, Theodore Ts'o wrote:
> Change the way we allocate bitmaps and inode tables if the FLEX_BG
> feature is used at mke2fs time. It places calculates a new offset for
> bitmaps and inode table base on the number of groups that the user
> wishes to pack together using the new "-G" option. Creating a
> filesystem with 64 block groups in a flex group can be done by:
>
> mke2fs -j -I 256 -O flex_bg -G 32 /dev/sdX
Presumably you mean "-G 64" based on your description of 64 groups/flex_bg?
> @@ -66,6 +137,22 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
> + if (flexbg_size) {
> + dgrp_t gr = ext2fs_group_of_blk(fs, new_blk);
> + fs->group_desc[gr].bg_free_blocks_count--;
> + fs->super->s_free_blocks_count--;
> + fs->group_desc[gr].bg_flags &= ~EXT2_BG_BLOCK_UNINIT;
> + ext2fs_group_desc_csum_set(fs, gr);
> + }
It makes total sense to me that the BG_BLOCK_UNINIT flag would not be set
on a group that does not have the default bitmap layouts, so I agree with
this change. I might suggest that we add a new flag BG_BLOCK_EMPTY or
similar (which is really part of the FLEXBG feature so it doesn't affect
the existing uninit_groups code) that indicates that the block bitmap
contains NO allocated blocks, so that the kernel can know immediately
when reconstructing the bitmap that there are no bitmaps or itable in
that group (i.e. the bitmap is all zero).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Apr 22, 2008 14:57 -0400, Theodore Ts'o wrote:
> On Tue, Apr 22, 2008 at 10:32:12AM -0500, Jose R. Santos wrote:
> > I see that now, guess I should not read code with out having
> > breakfast. I think 8 is a very safe and conservative number, maybe to
> > conservative. The 64 group packing was the number I found to be a
> > overall improvement with the limited number of drives that I had to
> > test with. Haven't done any testing on old drives or laptop drive with
> > slow spindle speed but I would think 16 or 32 would be safe here unless
> > the drive is really old and small.
>
> Let's stay with 16 then for now. Spindle speed doesn't actually
> matter here; what matters is seek speed, and the density of the disk
> drive. The other thing which worries me though is that the size of
> each flex_bg block group cluster is dependent on the size of the block
> group, which in turn is related to the square of the filesystem
> blocksize. i.e., assuming a fs blockgroup size of 16, then:
>
> Blocksize Blocks/blockgroup Blockgroup Size Flex_BG cluster size
>
> 1k 8192 8 Meg 128 Meg
> 2k 16384 32 Meg 512 Meg
> 4k 32768 128 Meg 2 Gig
> 8k 65536 512 Meg 8 Gig
> 16k 131072 2 Gig 32 Gig
> 32k 262144 8 Gig 128 Gig
> 64k 524288 32 Gig 512 Gig
>
> So using a fixed default of 16, the flexible blockgroup size can range
> anything from 128 megs to half a terabyte!
>
> How much a difference in your numbers are you seeing, anyway? Is it
> big enough that we really need to worry about it?
It probably makes sense to change the mke2fs/tune2fs parameter to be in
MB or GB instead of a count of groups, and/or change the internal default
to be a function of the groups size instead of just a constant.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Apr 22, 2008 08:46 -0400, Theodore Ts'o wrote:
> Change the way we allocate bitmaps and inode tables if the FLEX_BG
> feature is used at mke2fs time. It places calculates a new offset for
> bitmaps and inode table base on the number of groups that the user
> wishes to pack together using the new "-G" option. Creating a
> filesystem with 64 block groups in a flex group can be done by:
>
> mke2fs -j -I 256 -O flex_bg -G 32 /dev/sdX
>
> @@ -1638,6 +1654,19 @@ static void PRS(int argc, char *argv[])
>
> if (inode_size == 0)
> inode_size = get_int_from_profile(fs_types, "inode_size", 0);
> + if (!flex_bg_size && (fs_param.s_feature_incompat &
> + EXT4_FEATURE_INCOMPAT_FLEX_BG))
> + flex_bg_size = get_int_from_profile(fs_types, "flex_bg_size",8);
> + if (flex_bg_size) {
> + if (!(fs_param.s_feature_incompat &
> + EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
> + com_err(program_name, 0,
> + _("Flex_bg feature not enabled, so "
> + "flex_bg size may not be specified"));
> + exit(1);
> + }
> + fs_param.s_log_groups_per_flex = int_log2(flex_bg_size);
> + }
Should specifying "-G" enable FLEX_BG, like specifying "-j" or "-J size"
will enable HAS_JOURNAL instead of requiring that "-O has_journal" needs
to be explicitly given?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Wed, 23 Apr 2008 14:39:55 -0600
Andreas Dilger <[email protected]> wrote:
> On Apr 22, 2008 08:46 -0400, Theodore Ts'o wrote:
> > Change the way we allocate bitmaps and inode tables if the FLEX_BG
> > feature is used at mke2fs time. It places calculates a new offset for
> > bitmaps and inode table base on the number of groups that the user
> > wishes to pack together using the new "-G" option. Creating a
> > filesystem with 64 block groups in a flex group can be done by:
> >
> > mke2fs -j -I 256 -O flex_bg -G 32 /dev/sdX
>
> Presumably you mean "-G 64" based on your description of 64 groups/flex_bg?
Thanks for catching.
> > @@ -66,6 +137,22 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
> > + if (flexbg_size) {
> > + dgrp_t gr = ext2fs_group_of_blk(fs, new_blk);
> > + fs->group_desc[gr].bg_free_blocks_count--;
> > + fs->super->s_free_blocks_count--;
> > + fs->group_desc[gr].bg_flags &= ~EXT2_BG_BLOCK_UNINIT;
> > + ext2fs_group_desc_csum_set(fs, gr);
> > + }
>
> It makes total sense to me that the BG_BLOCK_UNINIT flag would not be set
> on a group that does not have the default bitmap layouts, so I agree with
> this change. I might suggest that we add a new flag BG_BLOCK_EMPTY or
> similar (which is really part of the FLEXBG feature so it doesn't affect
> the existing uninit_groups code) that indicates that the block bitmap
> contains NO allocated blocks, so that the kernel can know immediately
> when reconstructing the bitmap that there are no bitmaps or itable in
> that group (i.e. the bitmap is all zero).
I originally had a similar idea but was vetoed because there was no
kernel user on the flag. The flag that I used was set if the block
group had meta-data as opposed to just being empty since there are still
block groups out there that can have no meta-data but still have bgd or
backup super blocks. Would BG_BLOCK_EMPTY mean no bitmaps/inode tables
or does it imply completely empty block group?
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
-JRS
On Wed, 23 Apr 2008 14:57:35 -0600
Andreas Dilger <[email protected]> wrote:
> On Apr 22, 2008 14:57 -0400, Theodore Ts'o wrote:
> > On Tue, Apr 22, 2008 at 10:32:12AM -0500, Jose R. Santos wrote:
> > > I see that now, guess I should not read code with out having
> > > breakfast. I think 8 is a very safe and conservative number, maybe to
> > > conservative. The 64 group packing was the number I found to be a
> > > overall improvement with the limited number of drives that I had to
> > > test with. Haven't done any testing on old drives or laptop drive with
> > > slow spindle speed but I would think 16 or 32 would be safe here unless
> > > the drive is really old and small.
> >
> > Let's stay with 16 then for now. Spindle speed doesn't actually
> > matter here; what matters is seek speed, and the density of the disk
> > drive. The other thing which worries me though is that the size of
> > each flex_bg block group cluster is dependent on the size of the block
> > group, which in turn is related to the square of the filesystem
> > blocksize. i.e., assuming a fs blockgroup size of 16, then:
> >
> > Blocksize Blocks/blockgroup Blockgroup Size Flex_BG cluster size
> >
> > 1k 8192 8 Meg 128 Meg
> > 2k 16384 32 Meg 512 Meg
> > 4k 32768 128 Meg 2 Gig
> > 8k 65536 512 Meg 8 Gig
> > 16k 131072 2 Gig 32 Gig
> > 32k 262144 8 Gig 128 Gig
> > 64k 524288 32 Gig 512 Gig
> >
> > So using a fixed default of 16, the flexible blockgroup size can range
> > anything from 128 megs to half a terabyte!
> >
> > How much a difference in your numbers are you seeing, anyway? Is it
> > big enough that we really need to worry about it?
>
> It probably makes sense to change the mke2fs/tune2fs parameter to be in
> MB or GB instead of a count of groups, and/or change the internal default
> to be a function of the groups size instead of just a constant.
Did you mean making it a function of the block size? I agree that this
would make more sense than just the constant.
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
-JRS
On Apr 23, 2008 16:05 -0500, Jose R. Santos wrote:
> On Wed, 23 Apr 2008 14:39:55 -0600
> Andreas Dilger <[email protected]> wrote:
> > It makes total sense to me that the BG_BLOCK_UNINIT flag would not be set
> > on a group that does not have the default bitmap layouts, so I agree with
> > this change. I might suggest that we add a new flag BG_BLOCK_EMPTY or
> > similar (which is really part of the FLEXBG feature so it doesn't affect
> > the existing uninit_groups code) that indicates that the block bitmap
> > contains NO allocated blocks, so that the kernel can know immediately
> > when reconstructing the bitmap that there are no bitmaps or itable in
> > that group (i.e. the bitmap is all zero).
>
> I originally had a similar idea but was vetoed because there was no
> kernel user on the flag. The flag that I used was set if the block
> group had meta-data as opposed to just being empty since there are still
> block groups out there that can have no meta-data but still have bgd or
> backup super blocks. Would BG_BLOCK_EMPTY mean no bitmaps/inode tables
> or does it imply completely empty block group?
It could mean either... What is important is if that is useful it should
be done before FLEXBG goes into the field.
The kernel can already determine somewhat efficiently whether a group
has sb or gdt backups, though it can't hurt to flag this also. What
seems to be quite difficult is to know in the presence of FLEXBG whether
a group has an itable or bitmap in it.
I'd HOPE (and I believe this is what Ted's recent patch did) is that any
group which is being used to store flexbg data will have an initialized
block bitmap in it, because it is "non-standard".
What is more tricky is if a group has BLOCK_UNINIT and/or INODE_UNINIT
set what should happen when that group's block bitmap is initialized.
Should it assume there is a block + inode bitmap and an itable, or is
it enough to check its own group descriptor to determine if the bitmap
and itable are not in the group itself.
Maybe I'm being paranoid, and we don't need the flag(s), but better to
think the issues through now and decide we don't need them, than to
decide later that we do.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Fri, Apr 25, 2008 at 02:10:26PM -0600, Andreas Dilger wrote:
> I'd HOPE (and I believe this is what Ted's recent patch did) is that any
> group which is being used to store flexbg data will have an initialized
> block bitmap in it, because it is "non-standard".
Correct. If there are *any* blocks allocated other than the block's
own metadata, BLOCK_UNINIT will never be set. And that's precisely to
avoid the tricky case described in your next paragraph:
> What is more tricky is if a group has BLOCK_UNINIT and/or INODE_UNINIT
> set what should happen when that group's block bitmap is initialized.
> Should it assume there is a block + inode bitmap and an itable, or is
> it enough to check its own group descriptor to determine if the bitmap
> and itable are not in the group itself.
In the kernel, it should be enough only to check bg_inode_bitmp,
bg_block_bitmap, and bg_inode_table to construct the block bitmap.
The point was to keep things simple.
The cost of doing this is that you will end up needing to initialize
the block bitmaps for every an extra 1 out of every flex_bg_size block
groups, but that's not a major cost. It also means that BLOCK_UNINIT
and BLOCK_BG_EMPTY as defined by Andreas are the same thing. This was
a Keep It Simple, Stupid design point; I don't think the complexity is
worth it.
If someone wants to convince me that the benefits of forcing the
kernel and e2fsck pass5 to paw through all of the block group
descriptors to construct the block bitmap outweighs the costs (and
more importantly, volunteers to write the code :-), I'm willing to be
convinced otherwise....
- Ted