2011-03-19 21:28:39

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 00/12] bigalloc patchset

This is an initial patchset of the bigalloc patches to ext4. This patch
adds support for clustered allocation, so that each bit in the ext4
block allocation bitmap addresses a power of two number of blocks. For
example, if the file system is mainly going to be storing large files in
the 4-32 megabyte range, it might make sense to set a cluster size of 1
megabyte. This means that each bit in the block allocaiton bitmap would
now address 256 4k blocks, and it means that the size of the block
bitmaps for a 2T file system shrinks from 64 megabytes to 256k. It also
means that a block group addresses 32 gigabytes instead of 128
megabytes, also shrinking the amount of file system overhead for
metadata.

The cost is increased disk space efficiency. Directories will consume
1T, as will extent tree blocks. (I am on the fence as to whether I
should add complexity so that in the rare case that an inode needs more
than 344 extents --- a highly fragmented file indeed --- and need a
second extent tree block, we can avoid allocating any cluster and
instead use another block from the cluster used by the inode. The
concern is the amount of complexity this adds to the e2fsck, not just to
the kernel.)

To test these patches, I have used an *extremely* kludgy set of patches
to e2fsprogs, which are attached below. These patches need *extensive*
revision before I would consider them clean enough suitable for
committing into e2fsprogs, but they were sufficient for me to do the
kernel-side changes --- mke2fs, dumpe2fs, and debugfs work. E2fsck most
definitely does _not_ work at this stage.

Please comment! I do not intend for these patches to be merged during
the 2.6.39 merge window. I am targetting 2.6.40, 3 months from now,
since these patches are quite extensive.

- Ted

Theodore Ts'o (12):
ext4: read-only support for bigalloc file systems
ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)
ext4: Convert instances of EXT4_BLOCKS_PER_GROUP to
EXT4_CLUSTERS_PER_GROUP
ext4: Remove block bitmap initialization in ext4_new_inode()
ext4: factor out block group accounting into functions
ext4: split out ext4_free_blocks_after_init()
ext4: bigalloc changes to block bitmap initialization functions
ext4: Convert block group-relative offsets to use clusters
ext4: teach ext4_ext_map_blocks() about the bigalloc feature
ext4: teach ext4_statfs() to deal with clusters if bigalloc is
enabled
ext4: tune mballoc's default group prealloc size for bigalloc file
systems
ext4: enable mounting bigalloc as read/write

fs/ext4/balloc.c | 268 +++++++++++++++++++++++++++++++++--------------------
fs/ext4/ext4.h | 47 ++++++++--
fs/ext4/extents.c | 132 +++++++++++++++++++++++---
fs/ext4/ialloc.c | 37 --------
fs/ext4/inode.c | 7 ++
fs/ext4/ioctl.c | 33 ++++++-
fs/ext4/mballoc.c | 49 ++++++----
fs/ext4/mballoc.h | 3 +-
fs/ext4/super.c | 100 ++++++++++++++++----
9 files changed, 472 insertions(+), 204 deletions(-)

--
1.7.3.1

------------------- e2fsprogs patches follow below

diff --git a/lib/ext2fs/bmap64.h b/lib/ext2fs/bmap64.h
index b0aa84c..cfbdfd6 100644
--- a/lib/ext2fs/bmap64.h
+++ b/lib/ext2fs/bmap64.h
@@ -31,6 +31,10 @@ struct ext2fs_struct_generic_bitmap {
((bmap)->magic == EXT2_ET_MAGIC_BLOCK_BITMAP64) || \
((bmap)->magic == EXT2_ET_MAGIC_INODE_BITMAP64))

+/* Bitmap flags */
+
+#define EXT2_BMFLAG_CLUSTER 0x0001
+
struct ext2_bitmap_ops {
int type;
/* Generic bmap operators */
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index a89e33b..0970506 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -228,9 +228,13 @@ struct ext2_dx_countlimit {

#define EXT2_BLOCKS_PER_GROUP(s) (EXT2_SB(s)->s_blocks_per_group)
#define EXT2_INODES_PER_GROUP(s) (EXT2_SB(s)->s_inodes_per_group)
+#define EXT2_CLUSTERS_PER_GROUP(s) (EXT2_SB(s)->s_clusters_per_group)
#define EXT2_INODES_PER_BLOCK(s) (EXT2_BLOCK_SIZE(s)/EXT2_INODE_SIZE(s))
/* limits imposed by 16-bit value gd_free_{blocks,inode}_count */
-#define EXT2_MAX_BLOCKS_PER_GROUP(s) ((1 << 16) - 8)
+#define EXT2_MAX_BLOCKS_PER_GROUP(s) (((1 << 16) - 8) * \
+ (EXT2_CLUSTER_SIZE(s) / \
+ EXT2_BLOCK_SIZE(s)))
+#define EXT2_MAX_CLUSTERS_PER_GROUP(s) ((1 << 16) - 8)
#define EXT2_MAX_INODES_PER_GROUP(s) ((1 << 16) - EXT2_INODES_PER_BLOCK(s))
#ifdef __KERNEL__
#define EXT2_DESC_PER_BLOCK(s) (EXT2_SB(s)->s_desc_per_block)
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index d3eb31d..a065e87 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -207,7 +207,7 @@ struct struct_ext2_filsys {
char * device_name;
struct ext2_super_block * super;
unsigned int blocksize;
- int clustersize;
+ int cluster_ratio;
dgrp_t group_desc_count;
unsigned long desc_blocks;
struct opaque_ext2_group_desc * group_desc;
@@ -232,7 +232,8 @@ struct struct_ext2_filsys {
/*
* Reserved for future expansion
*/
- __u32 reserved[7];
+ __u32 clustersize;
+ __u32 reserved[6];

/*
* Reserved for the use of the calling application.
@@ -553,7 +554,8 @@ typedef struct ext2_icount *ext2_icount_t;
EXT2_FEATURE_RO_COMPAT_LARGE_FILE|\
EXT4_FEATURE_RO_COMPAT_DIR_NLINK|\
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE|\
- EXT4_FEATURE_RO_COMPAT_GDT_CSUM)
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM|\
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)

/*
* These features are only allowed if EXT2_FLAG_SOFTSUPP_FEATURES is passed
diff --git a/lib/ext2fs/gen_bitmap64.c b/lib/ext2fs/gen_bitmap64.c
index df095ac..60321df 100644
--- a/lib/ext2fs/gen_bitmap64.c
+++ b/lib/ext2fs/gen_bitmap64.c
@@ -559,3 +559,85 @@ int ext2fs_warn_bitmap32(ext2fs_generic_bitmap bitmap, const char *func)
"called %s with 64-bit bitmap", func);
#endif
}
+
+errcode_t ext2fs_allocate_cluster_bitmap(ext2_filsys fs,
+ const char *descr,
+ ext2fs_block_bitmap *ret)
+{
+ __u64 start, end, real_end;
+ errcode_t retval;
+
+ EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
+
+ if (!(fs->flags & EXT2_FLAG_64BITS))
+ return EXT2_ET_CANT_USE_LEGACY_BITMAPS;
+
+ fs->write_bitmaps = ext2fs_write_bitmaps;
+
+ start = (fs->super->s_first_data_block >>
+ EXT2_CLUSTER_SIZE_BITS(fs->super));
+ end = (ext2fs_blocks_count(fs->super) - 1) / fs->cluster_ratio;
+ real_end = ((__u64) EXT2_CLUSTERS_PER_GROUP(fs->super)
+ * (__u64) fs->group_desc_count)-1 + start;
+
+ retval = ext2fs_alloc_generic_bmap(fs,
+ EXT2_ET_MAGIC_BLOCK_BITMAP64,
+ EXT2FS_BMAP64_BITARRAY,
+ start, end, real_end, descr, ret);
+ if (retval)
+ return retval;
+
+ (*ret)->flags = EXT2_BMFLAG_CLUSTER;
+
+ printf("Returning 0...\n");
+ return 0;
+}
+
+int ext2fs_is_cluster_bitmap(ext2fs_block_bitmap bm)
+{
+ if (EXT2FS_IS_32_BITMAP(bm))
+ return 0;
+
+ return (bm->flags & EXT2_BMFLAG_CLUSTER);
+}
+
+errcode_t ext2fs_convert_to_cluster_bitmap(ext2_filsys fs,
+ ext2fs_block_bitmap bmap,
+ ext2fs_block_bitmap *ret)
+{
+ ext2fs_block_bitmap cmap;
+ errcode_t retval;
+ blk64_t i, j, b_end, c_end;
+ int n;
+
+ retval = ext2fs_allocate_cluster_bitmap(fs, "converted cluster bitmap",
+ ret);
+ if (retval)
+ return retval;
+
+ cmap = *ret;
+ i = bmap->start;
+ b_end = bmap->end;
+ bmap->end = bmap->real_end;
+ j = cmap->start;
+ c_end = cmap->end;
+ cmap->end = cmap->real_end;
+ n = 0;
+ while (i < bmap->real_end) {
+ if (ext2fs_test_block_bitmap2(bmap, i)) {
+ ext2fs_mark_block_bitmap2(cmap, j);
+ i += fs->cluster_ratio - n;
+ j++;
+ n = 0;
+ continue;
+ }
+ i++; n++;
+ if (n >= fs->cluster_ratio) {
+ j++;
+ n = 0;
+ }
+ }
+ bmap->end = b_end;
+ cmap->end = c_end;
+ return 0;
+}
diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
index e1f229b..00a8b38 100644
--- a/lib/ext2fs/initialize.c
+++ b/lib/ext2fs/initialize.c
@@ -94,6 +94,7 @@ errcode_t ext2fs_initialize(const char *name, int flags,
blk_t numblocks;
int rsv_gdt;
int csum_flag;
+ int bigalloc_flag;
int io_flags;
char *buf = 0;
char c;
@@ -134,12 +135,25 @@ errcode_t ext2fs_initialize(const char *name, int flags,

#define set_field(field, default) (super->field = param->field ? \
param->field : (default))
+#define assign_field(field) (super->field = param->field)

super->s_magic = EXT2_SUPER_MAGIC;
super->s_state = EXT2_VALID_FS;

- set_field(s_log_block_size, 0); /* default blocksize: 1024 bytes */
- set_field(s_log_cluster_size, 0);
+ bigalloc_flag = EXT2_HAS_RO_COMPAT_FEATURE(param,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC);
+
+ assign_field(s_log_block_size);
+
+ if (bigalloc_flag) {
+ set_field(s_log_cluster_size, super->s_log_block_size+4);
+ if (super->s_log_block_size > super->s_log_cluster_size) {
+ retval = EXT2_ET_INVALID_ARGUMENT;
+ goto cleanup;
+ }
+ } else
+ super->s_log_cluster_size = super->s_log_block_size;
+
set_field(s_first_data_block, super->s_log_block_size ? 0 : 1);
set_field(s_max_mnt_count, 0);
set_field(s_errors, EXT2_ERRORS_DEFAULT);
@@ -183,14 +197,36 @@ errcode_t ext2fs_initialize(const char *name, int flags,

fs->blocksize = EXT2_BLOCK_SIZE(super);
fs->clustersize = EXT2_CLUSTER_SIZE(super);
+ fs->cluster_ratio = fs->clustersize / fs->blocksize;
+
+ if (bigalloc_flag) {
+ if (param->s_blocks_per_group &&
+ param->s_clusters_per_group &&
+ ((param->s_clusters_per_group * fs->cluster_ratio) !=
+ param->s_blocks_per_group)) {
+ retval = EXT2_ET_INVALID_ARGUMENT;
+ goto cleanup;
+ }
+ if (param->s_clusters_per_group)
+ assign_field(s_clusters_per_group);
+ else if (param->s_blocks_per_group)
+ super->s_clusters_per_group =
+ param->s_blocks_per_group / fs->cluster_ratio;
+ else
+ super->s_clusters_per_group = fs->blocksize * 8;
+ if (super->s_clusters_per_group > EXT2_MAX_CLUSTERS_PER_GROUP(super))
+ super->s_blocks_per_group = EXT2_MAX_CLUSTERS_PER_GROUP(super);
+ super->s_blocks_per_group = super->s_clusters_per_group;
+ super->s_blocks_per_group *= fs->cluster_ratio;
+ } else {
+ set_field(s_blocks_per_group, fs->blocksize * 8);
+ if (super->s_blocks_per_group > EXT2_MAX_BLOCKS_PER_GROUP(super))
+ super->s_blocks_per_group = EXT2_MAX_BLOCKS_PER_GROUP(super);
+ super->s_clusters_per_group = super->s_blocks_per_group;
+ }

- /* default: (fs->blocksize*8) blocks/group, up to 2^16 (GDT limit) */
- set_field(s_blocks_per_group, fs->blocksize * 8);
- if (super->s_blocks_per_group > EXT2_MAX_BLOCKS_PER_GROUP(super))
- super->s_blocks_per_group = EXT2_MAX_BLOCKS_PER_GROUP(super);
- super->s_clusters_per_group = super->s_blocks_per_group;
-
- ext2fs_blocks_count_set(super, ext2fs_blocks_count(param));
+ ext2fs_blocks_count_set(super, ext2fs_blocks_count(param) &
+ ~((blk64_t) fs->cluster_ratio - 1));
ext2fs_r_blocks_count_set(super, ext2fs_r_blocks_count(param));
if (ext2fs_r_blocks_count(super) >= ext2fs_blocks_count(param)) {
retval = EXT2_ET_INVALID_ARGUMENT;
@@ -246,7 +282,7 @@ retry:
*/
ipg = ext2fs_div_ceil(super->s_inodes_count, fs->group_desc_count);
if (ipg > fs->blocksize * 8) {
- if (super->s_blocks_per_group >= 256) {
+ if (!bigalloc_flag && super->s_blocks_per_group >= 256) {
/* Try again with slightly different parameters */
super->s_blocks_per_group -= 8;
ext2fs_blocks_count_set(super,
diff --git a/lib/ext2fs/openfs.c b/lib/ext2fs/openfs.c
index 90abed1..8b37852 100644
--- a/lib/ext2fs/openfs.c
+++ b/lib/ext2fs/openfs.c
@@ -251,6 +251,7 @@ errcode_t ext2fs_open2(const char *name, const char *io_options,
goto cleanup;
}
fs->clustersize = EXT2_CLUSTER_SIZE(fs->super);
+ fs->cluster_ratio = fs->clustersize / fs->blocksize;
fs->inode_blocks_per_group = ((EXT2_INODES_PER_GROUP(fs->super) *
EXT2_INODE_SIZE(fs->super) +
EXT2_BLOCK_SIZE(fs->super) - 1) /
diff --git a/lib/ext2fs/rw_bitmaps.c b/lib/ext2fs/rw_bitmaps.c
index 3031b7d..aeea997 100644
--- a/lib/ext2fs/rw_bitmaps.c
+++ b/lib/ext2fs/rw_bitmaps.c
@@ -51,7 +51,7 @@ static errcode_t write_bitmaps(ext2_filsys fs, int do_inode, int do_block)

inode_nbytes = block_nbytes = 0;
if (do_block) {
- block_nbytes = EXT2_BLOCKS_PER_GROUP(fs->super) / 8;
+ block_nbytes = EXT2_CLUSTERS_PER_GROUP(fs->super) / 8;
retval = ext2fs_get_memalign(fs->blocksize, fs->blocksize,
&block_buf);
if (retval)
@@ -85,7 +85,7 @@ static errcode_t write_bitmaps(ext2_filsys fs, int do_inode, int do_block)
/* Force bitmap padding for the last group */
nbits = ((ext2fs_blocks_count(fs->super)
- (__u64) fs->super->s_first_data_block)
- % (__u64) EXT2_BLOCKS_PER_GROUP(fs->super));
+ % (__u64) EXT2_CLUSTERS_PER_GROUP(fs->super));
if (nbits)
for (j = nbits; j < fs->blocksize * 8; j++)
ext2fs_set_bit(j, block_buf);
@@ -141,7 +141,7 @@ static errcode_t read_bitmaps(ext2_filsys fs, int do_inode, int do_block)
char *block_bitmap = 0, *inode_bitmap = 0;
char *buf;
errcode_t retval;
- int block_nbytes = EXT2_BLOCKS_PER_GROUP(fs->super) / 8;
+ int block_nbytes = EXT2_CLUSTERS_PER_GROUP(fs->super) / 8;
int inode_nbytes = EXT2_INODES_PER_GROUP(fs->super) / 8;
int csum_flag = 0;
int do_image = fs->flags & EXT2_FLAG_IMAGE_FILE;
@@ -219,7 +219,7 @@ static errcode_t read_bitmaps(ext2_filsys fs, int do_inode, int do_block)
}
blk = (fs->image_header->offset_blockmap /
fs->blocksize);
- blk_cnt = (blk64_t)EXT2_BLOCKS_PER_GROUP(fs->super) *
+ blk_cnt = (blk64_t)EXT2_CLUSTERS_PER_GROUP(fs->super) *
fs->group_desc_count;
while (block_nbytes > 0) {
retval = io_channel_read_blk64(fs->image_io, blk++,
diff --git a/misc/dumpe2fs.c b/misc/dumpe2fs.c
index c01ffe5..d3f617a 100644
--- a/misc/dumpe2fs.c
+++ b/misc/dumpe2fs.c
@@ -71,25 +71,26 @@ static void print_range(unsigned long long a, unsigned long long b)
printf("%llu-%llu", a, b);
}

-static void print_free (unsigned long group, char * bitmap,
- unsigned long nbytes, unsigned long offset)
+static void print_free(unsigned long group, char * bitmap,
+ unsigned long nbytes, unsigned long offset, int ratio)
{
int p = 0;
unsigned long i;
unsigned long j;

+ offset /= ratio;
offset += group * nbytes;
for (i = 0; i < nbytes; i++)
if (!in_use (bitmap, i))
{
if (p)
printf (", ");
- print_number(i + offset);
+ print_number((i + offset) * ratio);
for (j = i; j < nbytes && !in_use (bitmap, j); j++)
;
if (--j != i) {
fputc('-', stdout);
- print_number(j + offset);
+ print_number((j + offset) * ratio);
i = j;
}
p = 1;
@@ -153,7 +154,7 @@ static void list_desc (ext2_filsys fs)
blk64_t blk_itr = fs->super->s_first_data_block;
ext2_ino_t ino_itr = 1;

- block_nbytes = EXT2_BLOCKS_PER_GROUP(fs->super) / 8;
+ block_nbytes = EXT2_CLUSTERS_PER_GROUP(fs->super) / 8;
inode_nbytes = EXT2_INODES_PER_GROUP(fs->super) / 8;

if (fs->block_map)
@@ -238,18 +239,19 @@ static void list_desc (ext2_filsys fs)
fputs(_(" Free blocks: "), stdout);
ext2fs_get_block_bitmap_range2(fs->block_map,
blk_itr, block_nbytes << 3, block_bitmap);
- print_free (i, block_bitmap,
- fs->super->s_blocks_per_group,
- fs->super->s_first_data_block);
+ print_free(i, block_bitmap,
+ fs->super->s_clusters_per_group,
+ fs->super->s_first_data_block,
+ fs->cluster_ratio);
fputc('\n', stdout);
- blk_itr += fs->super->s_blocks_per_group;
+ blk_itr += fs->super->s_clusters_per_group;
}
if (inode_bitmap) {
fputs(_(" Free inodes: "), stdout);
ext2fs_get_inode_bitmap_range2(fs->inode_map,
ino_itr, inode_nbytes << 3, inode_bitmap);
- print_free (i, inode_bitmap,
- fs->super->s_inodes_per_group, 1);
+ print_free(i, inode_bitmap,
+ fs->super->s_inodes_per_group, 1, 1);
fputc('\n', stdout);
ino_itr += fs->super->s_inodes_per_group;
}
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 9798b88..079638f 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -815,7 +815,8 @@ static __u32 ok_features[3] = {
EXT4_FEATURE_RO_COMPAT_DIR_NLINK|
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE|
EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER|
- EXT4_FEATURE_RO_COMPAT_GDT_CSUM
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM|
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC
};


@@ -1252,7 +1253,7 @@ profile_error:
}

while ((c = getopt (argc, argv,
- "b:cf:g:G:i:jl:m:no:qr:s:t:vE:FI:J:KL:M:N:O:R:ST:U:V")) != EOF) {
+ "b:cg:i:jl:m:no:qr:s:t:vC:E:FG:I:J:KL:M:N:O:R:ST:U:V")) != EOF) {
switch (c) {
case 'b':
blocksize = strtol(optarg, &tmp, 0);
@@ -1275,17 +1276,17 @@ profile_error:
case 'c': /* Check for bad blocks */
cflag++;
break;
- case 'f':
+ case 'C':
size = strtoul(optarg, &tmp, 0);
- if (size < EXT2_MIN_BLOCK_SIZE ||
- size > EXT2_MAX_BLOCK_SIZE || *tmp) {
+ if (size < EXT2_MIN_CLUSTER_SIZE ||
+ size > EXT2_MAX_CLUSTER_SIZE || *tmp) {
com_err(program_name, 0,
_("invalid fragment size - %s"),
optarg);
exit(1);
}
- fprintf(stderr, _("Warning: fragments not supported. "
- "Ignoring -f option\n"));
+ fs_param.s_log_cluster_size =
+ int_log2(size >> EXT2_MIN_CLUSTER_LOG_SIZE);
break;
case 'g':
fs_param.s_blocks_per_group = strtoul(optarg, &tmp, 0);
@@ -1515,8 +1516,6 @@ profile_error:
check_plausibility(device_name);
check_mount(device_name, force, _("filesystem"));

- fs_param.s_log_cluster_size = fs_param.s_log_block_size;
-
/* Determine the size of the device (if possible) */
if (noaction && fs_blocks_count) {
dev_size = fs_blocks_count;
@@ -1752,16 +1751,24 @@ profile_error:
}
}

+ fs_param.s_log_block_size =
+ int_log2(blocksize >> EXT2_MIN_BLOCK_LOG_SIZE);
+ if (fs_param.s_feature_ro_compat & EXT4_FEATURE_RO_COMPAT_BIGALLOC) {
+ if (fs_param.s_log_cluster_size == 0)
+ fs_param.s_log_cluster_size =
+ fs_param.s_log_block_size + 4;
+ } else
+ fs_param.s_log_cluster_size = fs_param.s_log_block_size;
+
if (inode_ratio == 0) {
inode_ratio = get_int_from_profile(fs_types, "inode_ratio",
8192);
if (inode_ratio < blocksize)
inode_ratio = blocksize;
+ if (inode_ratio < EXT2_CLUSTER_SIZE(&fs_param))
+ inode_ratio = EXT2_CLUSTER_SIZE(&fs_param);
}

- fs_param.s_log_cluster_size = fs_param.s_log_block_size =
- int_log2(blocksize >> EXT2_MIN_BLOCK_LOG_SIZE);
-
#ifdef HAVE_BLKID_PROBE_GET_TOPOLOGY
retval = get_device_geometry(device_name, &fs_param, psector_size);
if (retval < 0) {
@@ -2049,6 +2056,33 @@ static int mke2fs_discard_device(ext2_filsys fs)
return retval;
}

+static fix_cluster_bg_counts(ext2_filsys fs)
+{
+ blk64_t cluster, num_clusters, tot_free;
+ int grp_free, num_free, group, num;
+
+ num_clusters = ext2fs_blocks_count(fs->super) / fs->cluster_ratio;
+ tot_free = num_free = num = group = grp_free = 0;
+ for (cluster = fs->super->s_first_data_block / fs->cluster_ratio;
+ cluster < num_clusters; cluster++) {
+ if (!ext2fs_test_block_bitmap2(fs->block_map, cluster)) {
+ grp_free++;
+ tot_free++;
+ }
+ num++;
+ if ((num == fs->super->s_clusters_per_group) ||
+ (cluster == num_clusters-1)) {
+ printf("Group %d has free #: %d\n", group, grp_free);
+ ext2fs_bg_free_blocks_count_set(fs, group, grp_free);
+ ext2fs_group_desc_csum_set(fs, group);
+ num = 0;
+ grp_free = 0;
+ group++;
+ }
+ }
+ ext2fs_free_blocks_count_set(fs->super, tot_free);
+}
+
int main (int argc, char *argv[])
{
errcode_t retval = 0;
@@ -2367,6 +2401,17 @@ int main (int argc, char *argv[])
}
no_journal:

+ if (EXT2_HAS_RO_COMPAT_FEATURE(&fs_param,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ ext2fs_block_bitmap cluster_map;
+
+ retval = ext2fs_convert_to_cluster_bitmap(fs, fs->block_map,
+ &cluster_map);
+ ext2fs_free_block_bitmap(fs->block_map);
+ fs->block_map = cluster_map;
+ fix_cluster_bg_counts(fs);
+ }
+
if (!quiet)
printf(_("Writing superblocks and "
"filesystem accounting information: "));


2011-03-19 21:28:44

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 01/12] ext4: read-only support for bigalloc file systems

This adds supports for bigalloc file systems. It teaches the mount
code just enough about bigalloc superblock fields that it will mount
the file system without freaking out that the number of blocks per
group is too big.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 18 ++++++++++++++--
fs/ext4/super.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3aa0b72..94a7a7b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -231,12 +231,16 @@ struct ext4_io_submit {
#define EXT4_MAX_BLOCK_LOG_SIZE 16
#ifdef __KERNEL__
# define EXT4_BLOCK_SIZE(s) ((s)->s_blocksize)
+# define EXT4_CLUSTER_SIZE(s) (EXT4_SB(s)->s_clustersize)
#else
+# define EXT2_CLUSTER_SIZE(s) (EXT2_MIN_BLOCK_SIZE << \
+ (s)->s_log_cluster_size)
# define EXT4_BLOCK_SIZE(s) (EXT4_MIN_BLOCK_SIZE << (s)->s_log_block_size)
#endif
#define EXT4_ADDR_PER_BLOCK(s) (EXT4_BLOCK_SIZE(s) / sizeof(__u32))
#ifdef __KERNEL__
# define EXT4_BLOCK_SIZE_BITS(s) ((s)->s_blocksize_bits)
+# define EXT4_CLUSTER_SIZE_BITS(s) (EXT4_SB(s)->s_clustersize_bits)
#else
# define EXT4_BLOCK_SIZE_BITS(s) ((s)->s_log_block_size + 10)
#endif
@@ -302,6 +306,7 @@ struct flex_groups {
#define EXT4_DESC_SIZE(s) (EXT4_SB(s)->s_desc_size)
#ifdef __KERNEL__
# define EXT4_BLOCKS_PER_GROUP(s) (EXT4_SB(s)->s_blocks_per_group)
+# define EXT4_CLUSTERS_PER_GROUP(s) (EXT4_SB(s)->s_clusters_per_group)
# define EXT4_DESC_PER_BLOCK(s) (EXT4_SB(s)->s_desc_per_block)
# define EXT4_INODES_PER_GROUP(s) (EXT4_SB(s)->s_inodes_per_group)
# define EXT4_DESC_PER_BLOCK_BITS(s) (EXT4_SB(s)->s_desc_per_block_bits)
@@ -957,9 +962,9 @@ struct ext4_super_block {
/*10*/ __le32 s_free_inodes_count; /* Free inodes count */
__le32 s_first_data_block; /* First Data Block */
__le32 s_log_block_size; /* Block size */
- __le32 s_obso_log_frag_size; /* Obsoleted fragment size */
+ __le32 s_log_cluster_size; /* Allocation cluster size */
/*20*/ __le32 s_blocks_per_group; /* # Blocks per group */
- __le32 s_obso_frags_per_group; /* Obsoleted fragments per group */
+ __le32 s_clusters_per_group; /* # Clusters per group */
__le32 s_inodes_per_group; /* # Inodes per group */
__le32 s_mtime; /* Mount time */
/*30*/ __le32 s_wtime; /* Write time */
@@ -1055,7 +1060,10 @@ struct ext4_super_block {
__u8 s_last_error_func[32]; /* function where the error happened */
#define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
__u8 s_mount_opts[64];
- __le32 s_reserved[112]; /* Padding to the end of the block */
+ __le32 s_usr_quota_inum; /* inode for tracking user quota */
+ __le32 s_grp_quota_inum; /* inode for tracking group quota */
+ __le32 s_overhead_blocks; /* overhead blocks/clusters in fs */
+ __le32 s_reserved[109]; /* Padding to the end of the block */
};

#define EXT4_S_ERR_LEN (EXT4_S_ERR_END - EXT4_S_ERR_START)
@@ -1075,6 +1083,7 @@ struct ext4_sb_info {
unsigned long s_desc_size; /* Size of a group descriptor in bytes */
unsigned long s_inodes_per_block;/* Number of inodes per block */
unsigned long s_blocks_per_group;/* Number of blocks in a group */
+ unsigned long s_clusters_per_group; /* Number of clusters in a group */
unsigned long s_inodes_per_group;/* Number of inodes in a group */
unsigned long s_itb_per_group; /* Number of inode table blocks per group */
unsigned long s_gdb_count; /* Number of group descriptor blocks */
@@ -1083,6 +1092,8 @@ struct ext4_sb_info {
ext4_group_t s_blockfile_groups;/* Groups acceptable for non-extent files */
unsigned long s_overhead_last; /* Last calculated overhead */
unsigned long s_blocks_last; /* Last seen block count */
+ unsigned int s_cluster_ratio; /* Number of blocks per group */
+ unsigned int s_cluster_bits; /* log2 of s_cluster_ratio */
loff_t s_bitmap_maxbytes; /* max bytes for bitmap files */
struct buffer_head * s_sbh; /* Buffer containing the super block */
struct ext4_super_block *s_es; /* Pointer to the super block in the buffer */
@@ -1338,6 +1349,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010
#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
+#define EXT4_FEATURE_RO_COMPAT_BIGALLOC 0x0200

#define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b357c27..7273728 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1875,7 +1875,7 @@ static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es,
res = MS_RDONLY;
}
if (read_only)
- return res;
+ goto done;
if (!(sbi->s_mount_state & EXT4_VALID_FS))
ext4_msg(sb, KERN_WARNING, "warning: mounting unchecked fs, "
"running e2fsck is recommended");
@@ -1906,6 +1906,7 @@ static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es,
EXT4_SET_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);

ext4_commit_super(sb, 1);
+done:
if (test_opt(sb, DEBUG))
printk(KERN_INFO "[EXT4 FS bs=%lu, gc=%u, "
"bpg=%lu, ipg=%lu, mo=%04x, mo2=%04x]\n",
@@ -3022,10 +3023,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
char *cp;
const char *descr;
int ret = -ENOMEM;
- int blocksize;
+ int blocksize, clustersize;
unsigned int db_count;
unsigned int i;
- int needs_recovery, has_huge_files;
+ int needs_recovery, has_huge_files, has_bigalloc;
__u64 blocks_count;
int err;
unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
@@ -3276,12 +3277,53 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sb->s_dirt = 1;
}

- if (sbi->s_blocks_per_group > blocksize * 8) {
- ext4_msg(sb, KERN_ERR,
- "#blocks per group too big: %lu",
- sbi->s_blocks_per_group);
- goto failed_mount;
+ /* Handle clustersize */
+ clustersize = BLOCK_SIZE << le32_to_cpu(es->s_log_cluster_size);
+ has_bigalloc = EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC);
+ if (has_bigalloc) {
+ if (clustersize < blocksize) {
+ ext4_msg(sb, KERN_ERR,
+ "cluster size (%d) smaller than "
+ "block size (%d)", clustersize, blocksize);
+ goto failed_mount;
+ }
+ sbi->s_cluster_bits = le32_to_cpu(es->s_log_cluster_size) -
+ le32_to_cpu(es->s_log_block_size);
+ sbi->s_clusters_per_group =
+ le32_to_cpu(es->s_clusters_per_group);
+ if (sbi->s_clusters_per_group > blocksize * 8) {
+ ext4_msg(sb, KERN_ERR,
+ "#clusters per group too big: %lu",
+ sbi->s_clusters_per_group);
+ goto failed_mount;
+ }
+ if (sbi->s_blocks_per_group !=
+ (sbi->s_clusters_per_group * (clustersize / blocksize))) {
+ ext4_msg(sb, KERN_ERR, "blocks per group (%lu) and "
+ "clusters per group (%lu) inconsistent",
+ sbi->s_blocks_per_group,
+ sbi->s_clusters_per_group);
+ goto failed_mount;
+ }
+ } else {
+ if (clustersize != blocksize) {
+ ext4_warning(sb, "fragment/cluster size (%d) != "
+ "block size (%d)", clustersize,
+ blocksize);
+ clustersize = blocksize;
+ }
+ if (sbi->s_blocks_per_group > blocksize * 8) {
+ ext4_msg(sb, KERN_ERR,
+ "#blocks per group too big: %lu",
+ sbi->s_blocks_per_group);
+ goto failed_mount;
+ }
+ sbi->s_clusters_per_group = sbi->s_blocks_per_group;
+ sbi->s_cluster_bits = 0;
}
+ sbi->s_cluster_ratio = clustersize / blocksize;
+
if (sbi->s_inodes_per_group > blocksize * 8) {
ext4_msg(sb, KERN_ERR,
"#inodes per group too big: %lu",
--
1.7.3.1


2011-03-19 21:37:23

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 07/12] ext4: bigalloc changes to block bitmap initialization functions

Add bigalloc support to ext4_init_block_bitmap() and
ext4_free_blocks_after_init().

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 131 +++++++++++++++++++++++++++++++++++++-----------------
fs/ext4/ext4.h | 13 +++++
2 files changed, 103 insertions(+), 41 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 5b60c8b..930615d 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -21,9 +21,6 @@
#include "ext4_jbd2.h"
#include "mballoc.h"

-static unsigned int num_base_meta_blocks(struct super_block *sb,
- ext4_group_t block_group);
-
/*
* balloc.c contains the blocks allocation and deallocation routines
*/
@@ -56,37 +53,87 @@ static int ext4_block_in_group(struct super_block *sb, ext4_fsblk_t block,
return 0;
}

-static int ext4_group_used_meta_blocks(struct super_block *sb,
- ext4_group_t block_group,
- struct ext4_group_desc *gdp)
+/* Return the number of clusters used for file system metadata; this
+ * represents the overhead needed by the file system.
+ */
+unsigned ext4_num_overhead_clusters(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp)
{
- ext4_fsblk_t tmp;
+ unsigned num_clusters;
+ int block_cluster = -1, inode_cluster = -1, itbl_cluster = -1, i, c;
+ ext4_fsblk_t start = ext4_group_first_block_no(sb, block_group);
+ ext4_fsblk_t itbl_blk;
struct ext4_sb_info *sbi = EXT4_SB(sb);
- /* block bitmap, inode bitmap, and inode table blocks */
- int used_blocks = sbi->s_itb_per_group + 2;

- if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
- if (!ext4_block_in_group(sb, ext4_block_bitmap(sb, gdp),
- block_group))
- used_blocks--;
-
- if (!ext4_block_in_group(sb, ext4_inode_bitmap(sb, gdp),
- block_group))
- used_blocks--;
-
- tmp = ext4_inode_table(sb, gdp);
- for (; tmp < ext4_inode_table(sb, gdp) +
- sbi->s_itb_per_group; tmp++) {
- if (!ext4_block_in_group(sb, tmp, block_group))
- used_blocks -= 1;
+ /* This is the number of clusters used by the superblock,
+ * block group descriptors, and reserved block group
+ * descriptor blocks */
+ num_clusters = ext4_num_base_meta_clusters(sb, block_group);
+
+ /*
+ * For the allocation bitmaps and inode table, we first need
+ * to check to see if the block is in the block group. If it
+ * is, then check to see if the cluster is already accounted
+ * for in the clusters used for the base metadata cluster, or
+ * if we can increment the base metadata cluster to include
+ * that block. Otherwise, we will have to track the cluster
+ * used for the allocation bitmap or inode table explicitly.
+ * Normally all of these blocks are contiguous, so the special
+ * case handling shouldn't be necessary except for *very*
+ * unusual file system layouts.
+ */
+ if (ext4_block_in_group(sb, ext4_block_bitmap(sb, gdp), block_group)) {
+ block_cluster = EXT4_B2C(sbi, (start -
+ ext4_block_bitmap(sb, gdp)));
+ if (block_cluster < num_clusters)
+ block_cluster = -1;
+ else if (block_cluster == num_clusters) {
+ num_clusters++;
+ block_cluster = -1;
+ }
+ }
+
+ if (ext4_block_in_group(sb, ext4_inode_bitmap(sb, gdp), block_group)) {
+ inode_cluster = EXT4_B2C(sbi,
+ start - ext4_inode_bitmap(sb, gdp));
+ if (inode_cluster < num_clusters)
+ inode_cluster = -1;
+ else if (inode_cluster == num_clusters) {
+ num_clusters++;
+ inode_cluster = -1;
+ }
+ }
+
+ itbl_blk = ext4_inode_table(sb, gdp);
+ for (i = 0; i < sbi->s_itb_per_group; i++) {
+ if (ext4_block_in_group(sb, itbl_blk + i, block_group)) {
+ c = EXT4_B2C(sbi, start - itbl_blk + i);
+ if ((c < num_clusters) || (c == inode_cluster) ||
+ (c == block_cluster) || (c == itbl_cluster))
+ continue;
+ if (c == num_clusters) {
+ num_clusters++;
+ continue;
+ }
+ num_clusters++;
+ itbl_cluster = c;
}
}
- return used_blocks;
+
+ if (block_cluster != -1)
+ num_clusters++;
+ if (inode_cluster != -1)
+ num_clusters++;
+
+ return num_clusters;
}

-static unsigned int num_blocks_in_group(struct super_block *sb,
- ext4_group_t block_group)
+static unsigned int num_clusters_in_group(struct super_block *sb,
+ ext4_group_t block_group)
{
+ unsigned int blocks;
+
if (block_group == ext4_get_groups_count(sb) - 1) {
/*
* Even though mke2fs always initializes the first and
@@ -94,10 +141,11 @@ static unsigned int num_blocks_in_group(struct super_block *sb,
* we need to make sure we calculate the right free
* blocks.
*/
- return ext4_blocks_count(EXT4_SB(sb)->s_es) -
+ blocks = ext4_blocks_count(EXT4_SB(sb)->s_es) -
ext4_group_first_block_no(sb, block_group);
} else
- return EXT4_BLOCKS_PER_GROUP(sb);
+ blocks = EXT4_BLOCKS_PER_GROUP(sb);
+ return EXT4_NUM_B2C(EXT4_SB(sb), blocks);
}

/* Initializes an uninitialized block bitmap */
@@ -105,7 +153,7 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_group_t block_group,
struct ext4_group_desc *gdp)
{
- unsigned int bit, bit_max = num_base_meta_blocks(sb, block_group);
+ unsigned int bit, bit_max;
struct ext4_sb_info *sbi = EXT4_SB(sb);
ext4_fsblk_t start, tmp;
int flex_bg = 0;
@@ -124,6 +172,7 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
}
memset(bh->b_data, 0, sb->s_blocksize);

+ bit_max = ext4_num_base_meta_clusters(sb, block_group);
for (bit = 0; bit < bit_max; bit++)
ext4_set_bit(bit, bh->b_data);

@@ -135,24 +184,25 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
/* Set bits for block and inode bitmaps, and inode table */
tmp = ext4_block_bitmap(sb, gdp);
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
+ ext4_set_bit(EXT4_B2C(sbi, tmp - start), bh->b_data);

tmp = ext4_inode_bitmap(sb, gdp);
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
+ ext4_set_bit(EXT4_B2C(sbi, tmp - start), bh->b_data);

tmp = ext4_inode_table(sb, gdp);
for (; tmp < ext4_inode_table(sb, gdp) +
sbi->s_itb_per_group; tmp++) {
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
+ ext4_set_bit(EXT4_B2C(sbi, tmp - start), bh->b_data);
}
+
/*
* Also if the number of blocks within the group is less than
* the blocksize * 8 ( which is the size of bitmap ), set rest
* of the block bitmap to 1
*/
- ext4_mark_bitmap_end(num_blocks_in_group(sb, block_group),
+ ext4_mark_bitmap_end(num_clusters_in_group(sb, block_group),
sb->s_blocksize * 8, bh->b_data);
}

@@ -163,9 +213,8 @@ unsigned ext4_free_blocks_after_init(struct super_block *sb,
ext4_group_t block_group,
struct ext4_group_desc *gdp)
{
- return num_blocks_in_group(sb, block_group) -
- num_base_meta_blocks(sb, block_group) -
- ext4_group_used_meta_blocks(sb, block_group, gdp);
+ return num_clusters_in_group(sb, block_group) -
+ ext4_num_overhead_clusters(sb, block_group, gdp);
}

/*
@@ -732,14 +781,14 @@ unsigned long ext4_bg_num_gdb(struct super_block *sb, ext4_group_t group)
}

/*
- * This function returns the number of file system metadata blocks at
+ * This function returns the number of file system metadata clusters at
* the beginning of a block group, including the reserved gdt blocks.
*/
-static unsigned int num_base_meta_blocks(struct super_block *sb,
- ext4_group_t block_group)
+unsigned ext4_num_base_meta_clusters(struct super_block *sb,
+ ext4_group_t block_group)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
- int num;
+ unsigned num;

/* Check for superblock and gdt backups in this group */
num = ext4_bg_has_super(sb, block_group);
@@ -754,5 +803,5 @@ static unsigned int num_base_meta_blocks(struct super_block *sb,
} else { /* For META_BG_BLOCK_GROUPS */
num += ext4_bg_num_gdb(sb, block_group);
}
- return num;
+ return EXT4_NUM_B2C(sbi, num);
}
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0eb7407..d373931 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -258,6 +258,14 @@ struct ext4_io_submit {
#endif
#define EXT4_BLOCK_ALIGN(size, blkbits) ALIGN((size), (1 << (blkbits)))

+/* Translate a block number to a cluster number */
+#define EXT4_B2C(sbi, blk) ((blk) >> (sbi)->s_cluster_bits)
+/* Translate a cluster number to a block number */
+#define EXT4_C2B(sbi, cluster) ((cluster) << (sbi)->s_cluster_bits)
+/* Translate # of blks to # of clusters */
+#define EXT4_NUM_B2C(sbi, blks) (((blks) + (sbi)->s_cluster_ratio - 1) >> \
+ (sbi)->s_cluster_bits)
+
/*
* Structure of a blocks group descriptor
*/
@@ -1669,6 +1677,11 @@ extern void ext4_init_block_bitmap(struct super_block *sb,
extern unsigned ext4_free_blocks_after_init(struct super_block *sb,
ext4_group_t block_group,
struct ext4_group_desc *gdp);
+extern unsigned ext4_num_base_meta_clusters(struct super_block *sb,
+ ext4_group_t block_group);
+extern unsigned ext4_num_overhead_clusters(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp);

/* dir.c */
extern int __ext4_check_dir_entry(const char *, unsigned int, struct inode *,
--
1.7.3.1


2011-03-19 21:37:23

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 12/12] ext4: enable mounting bigalloc as read/write

Now that we have implemented all of the changes needed for bigalloc,
we can finally enable it!

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d373931..05e9ba5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1384,7 +1384,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
EXT4_FEATURE_RO_COMPAT_BTREE_DIR |\
- EXT4_FEATURE_RO_COMPAT_HUGE_FILE)
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)

/*
* Default values for user and/or group using reserved blocks
--
1.7.3.1


2011-03-19 21:37:23

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 08/12] ext4: Convert block group-relative offsets to use clusters

Certain parts of the ext4 code base, primarily in mballoc.c, use a
block group number and offset from the beginning of the block group.
This offset is invariably used to index into the allocation bitmap, so
change the offset to be denominated in units of clusters.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 6 ++++--
fs/ext4/mballoc.h | 3 ++-
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 930615d..1cd7afb 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -26,7 +26,8 @@
*/

/*
- * Calculate the block group number and offset, given a block number
+ * Calculate the block group number and offset into the block/cluster
+ * allocation bitmap, given a block number
*/
void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
ext4_group_t *blockgrpp, ext4_grpblk_t *offsetp)
@@ -35,7 +36,8 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
ext4_grpblk_t offset;

blocknr = blocknr - le32_to_cpu(es->s_first_data_block);
- offset = do_div(blocknr, EXT4_BLOCKS_PER_GROUP(sb));
+ offset = do_div(blocknr, EXT4_BLOCKS_PER_GROUP(sb)) >>
+ EXT4_SB(sb)->s_cluster_bits;
if (offsetp)
*offsetp = offset;
if (blockgrpp)
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 22bd4d7..1a75182 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -223,6 +223,7 @@ struct ext4_buddy {
static inline ext4_fsblk_t ext4_grp_offs_to_block(struct super_block *sb,
struct ext4_free_extent *fex)
{
- return ext4_group_first_block_no(sb, fex->fe_group) + fex->fe_start;
+ return ext4_group_first_block_no(sb, fex->fe_group) +
+ (fex->fe_start << EXT4_SB(sb)->s_cluster_bits);
}
#endif
--
1.7.3.1


2011-03-19 21:37:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 02/12] ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)

At least initially if the bigalloc feature is enabled, we will not
support non-extent mapped inodes, online reisizing, online defrag, or
the FITRIM ioctl. This simplifies the initial implementation.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/inode.c | 7 +++++++
fs/ext4/ioctl.c | 33 +++++++++++++++++++++++++++++----
fs/ext4/super.c | 7 +++++++
3 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 67e7a3c..3bf751c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1008,6 +1008,13 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
/*
* Okay, we need to do block allocation.
*/
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ EXT4_ERROR_INODE(inode, "Can't allocate blocks for "
+ "non-extent mapped inodes with bigalloc");
+ return -ENOSPC;
+ }
+
goal = ext4_find_goal(inode, map->m_lblk, partial);

/* the number of blocks need to allocate for [d,t]indirect blocks */
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index c052c9f..1231e25 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -21,6 +21,7 @@
long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
struct inode *inode = filp->f_dentry->d_inode;
+ struct super_block *sb = inode->i_sb;
struct ext4_inode_info *ei = EXT4_I(inode);
unsigned int flags;

@@ -183,7 +184,6 @@ setversion_out:
* Returns 1 if it slept, else zero.
*/
{
- struct super_block *sb = inode->i_sb;
DECLARE_WAITQUEUE(wait, current);
int ret = 0;

@@ -199,7 +199,6 @@ setversion_out:
#endif
case EXT4_IOC_GROUP_EXTEND: {
ext4_fsblk_t n_blocks_count;
- struct super_block *sb = inode->i_sb;
int err, err2=0;

if (!capable(CAP_SYS_RESOURCE))
@@ -212,6 +211,13 @@ setversion_out:
if (err)
return err;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ ext4_msg(sb, KERN_ERR,
+ "Online resizing not supported with bigalloc");
+ return -EOPNOTSUPP;
+ }
+
err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count);
if (EXT4_SB(sb)->s_journal) {
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
@@ -252,6 +258,13 @@ setversion_out:
if (err)
goto mext_out;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ ext4_msg(sb, KERN_ERR,
+ "Online defrag not supported with bigalloc");
+ return -EOPNOTSUPP;
+ }
+
err = ext4_move_extents(filp, donor_filp, me.orig_start,
me.donor_start, me.len, &me.moved_len);
mnt_drop_write(filp->f_path.mnt);
@@ -268,7 +281,6 @@ mext_out:

case EXT4_IOC_GROUP_ADD: {
struct ext4_new_group_data input;
- struct super_block *sb = inode->i_sb;
int err, err2=0;

if (!capable(CAP_SYS_RESOURCE))
@@ -282,6 +294,13 @@ mext_out:
if (err)
return err;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ ext4_msg(sb, KERN_ERR,
+ "Online resizing not supported with bigalloc");
+ return -EOPNOTSUPP;
+ }
+
err = ext4_group_add(sb, &input);
if (EXT4_SB(sb)->s_journal) {
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
@@ -333,7 +352,6 @@ mext_out:

case FITRIM:
{
- struct super_block *sb = inode->i_sb;
struct request_queue *q = bdev_get_queue(sb->s_bdev);
struct fstrim_range range;
int ret = 0;
@@ -344,6 +362,13 @@ mext_out:
if (!blk_queue_discard(q))
return -EOPNOTSUPP;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ ext4_msg(sb, KERN_ERR,
+ "FITRIM not supported with bigalloc");
+ return -EOPNOTSUPP;
+ }
+
if (copy_from_user(&range, (struct fstrim_range *)arg,
sizeof(range)))
return -EFAULT;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7273728..f9b25cd 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2593,6 +2593,13 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
return 0;
}
}
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_BIGALLOC) &&
+ !EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
+ ext4_msg(sb, KERN_ERR,
+ "Can't support bigalloc feature without "
+ "extents feature\n");
+ return 0;
+ }
return 1;
}

--
1.7.3.1


2011-03-19 21:37:23

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 11/12] ext4: tune mballoc's default group prealloc size for bigalloc file systems

The default group preallocation size had been previously set to 512
blocks/clusters, regardless of the block/cluster size. This is
probably too big for large cluster sizes. So adjust the default so
that it is 2 megabytes or 32 clusters, whichever is larger.
---
fs/ext4/mballoc.c | 15 ++++++++++++++-
1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 02f099f..ad4a6eb 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2529,7 +2529,20 @@ int ext4_mb_init(struct super_block *sb, int needs_recovery)
sbi->s_mb_stats = MB_DEFAULT_STATS;
sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
- sbi->s_mb_group_prealloc = MB_DEFAULT_GROUP_PREALLOC;
+ /*
+ * The default group preallocation is 512, which for 4k block
+ * sizes translates to 2 megabytes. However for bigalloc file
+ * systems, this is probably too big (i.e, if the cluster size
+ * is 1 megabyte, then group preallocation size becomes half a
+ * gigabyte!). As a default, we will keep a two megabyte
+ * group pralloc size for cluster sizes up to 64k, and after
+ * that, we will force a minimum group preallocation size of
+ * 32 clusters. This translates to 8 megs when the cluster
+ * size is 256k, and 32 megs when the cluster size is 1 meg,
+ * which seems reasonable as a default.
+ */
+ sbi->s_mb_group_prealloc = max(MB_DEFAULT_GROUP_PREALLOC >>
+ sbi->s_cluster_bits, 32);

sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
if (sbi->s_locality_groups == NULL) {
--
1.7.3.1


2011-03-19 21:37:23

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 05/12] ext4: factor out block group accounting into functions

This makes it easier to understand how ext4_init_block_bitmap() works,
and it will assist when we split out ext4_free_blocks_after_init() in
the next commit.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 80 ++++++++++++++++++++++++++++++++---------------------
1 files changed, 48 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index adf96b8..c608fda 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -21,6 +21,9 @@
#include "ext4_jbd2.h"
#include "mballoc.h"

+static unsigned int num_base_meta_blocks(struct super_block *sb,
+ ext4_group_t block_group);
+
/*
* balloc.c contains the blocks allocation and deallocation routines
*/
@@ -81,14 +84,30 @@ static int ext4_group_used_meta_blocks(struct super_block *sb,
return used_blocks;
}

+static unsigned int num_blocks_in_group(struct super_block *sb,
+ ext4_group_t block_group)
+{
+ if (block_group == ext4_get_groups_count(sb) - 1) {
+ /*
+ * Even though mke2fs always initializes the first and
+ * last group, just in case some other tool was used,
+ * we need to make sure we calculate the right free
+ * blocks.
+ */
+ return ext4_blocks_count(EXT4_SB(sb)->s_es) -
+ ext4_group_first_block_no(sb, block_group);
+ } else
+ return EXT4_BLOCKS_PER_GROUP(sb);
+}
+
/* Initializes an uninitialized block bitmap if given, and returns the
* number of blocks free in the group. */
unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_group_t block_group, struct ext4_group_desc *gdp)
{
- int bit, bit_max;
+ unsigned int bit, bit_max = num_base_meta_blocks(sb, block_group);
ext4_group_t ngroups = ext4_get_groups_count(sb);
- unsigned free_blocks, group_blocks;
+ unsigned group_blocks = num_blocks_in_group(sb, block_group);
struct ext4_sb_info *sbi = EXT4_SB(sb);

if (bh) {
@@ -108,35 +127,6 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
memset(bh->b_data, 0, sb->s_blocksize);
}

- /* Check for superblock and gdt backups in this group */
- bit_max = ext4_bg_has_super(sb, block_group);
-
- if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_META_BG) ||
- block_group < le32_to_cpu(sbi->s_es->s_first_meta_bg) *
- sbi->s_desc_per_block) {
- if (bit_max) {
- bit_max += ext4_bg_num_gdb(sb, block_group);
- bit_max +=
- le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks);
- }
- } else { /* For META_BG_BLOCK_GROUPS */
- bit_max += ext4_bg_num_gdb(sb, block_group);
- }
-
- if (block_group == ngroups - 1) {
- /*
- * Even though mke2fs always initialize first and last group
- * if some other tool enabled the EXT4_BG_BLOCK_UNINIT we need
- * to make sure we calculate the right free blocks
- */
- group_blocks = ext4_blocks_count(sbi->s_es) -
- ext4_group_first_block_no(sb, ngroups - 1);
- } else {
- group_blocks = EXT4_BLOCKS_PER_GROUP(sb);
- }
-
- free_blocks = group_blocks - bit_max;
-
if (bh) {
ext4_fsblk_t start, tmp;
int flex_bg = 0;
@@ -174,7 +164,8 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_mark_bitmap_end(group_blocks, sb->s_blocksize * 8,
bh->b_data);
}
- return free_blocks - ext4_group_used_meta_blocks(sb, block_group, gdp);
+ return group_blocks - bit_max -
+ ext4_group_used_meta_blocks(sb, block_group, gdp);
}


@@ -741,3 +732,28 @@ unsigned long ext4_bg_num_gdb(struct super_block *sb, ext4_group_t group)

}

+/*
+ * This function returns the number of file system metadata blocks at
+ * the beginning of a block group, including the reserved gdt blocks.
+ */
+static unsigned int num_base_meta_blocks(struct super_block *sb,
+ ext4_group_t block_group)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int num;
+
+ /* Check for superblock and gdt backups in this group */
+ num = ext4_bg_has_super(sb, block_group);
+
+ if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_META_BG) ||
+ block_group < le32_to_cpu(sbi->s_es->s_first_meta_bg) *
+ sbi->s_desc_per_block) {
+ if (num) {
+ num += ext4_bg_num_gdb(sb, block_group);
+ num += le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks);
+ }
+ } else { /* For META_BG_BLOCK_GROUPS */
+ num += ext4_bg_num_gdb(sb, block_group);
+ }
+ return num;
+}
--
1.7.3.1


2011-03-19 21:37:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 03/12] ext4: Convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP

Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are
used to indicate the number of bits in a block bitmap (which is really
a cluster allocation bitmap in bigalloc file systems). There are
still some places in the ext4 codebase where usage of
EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that
aren't used given the initial restricted assumptions for bigalloc.
These will need to be fixed before we can relax those restrictions.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/mballoc.c | 34 +++++++++++++++++-----------------
1 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 2f6f0dd..02f099f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -651,7 +651,7 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
ext4_grpblk_t chunk;
unsigned short border;

- BUG_ON(len > EXT4_BLOCKS_PER_GROUP(sb));
+ BUG_ON(len > EXT4_CLUSTERS_PER_GROUP(sb));

border = 2 << sb->s_blocksize_bits;

@@ -703,7 +703,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
void *buddy, void *bitmap, ext4_group_t group)
{
struct ext4_group_info *grp = ext4_get_group_info(sb, group);
- ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
+ ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb);
ext4_grpblk_t i = 0;
ext4_grpblk_t first;
ext4_grpblk_t len;
@@ -1680,8 +1680,8 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
struct ext4_free_extent *gex = &ac->ac_g_ex;

BUG_ON(ex->fe_len <= 0);
- BUG_ON(ex->fe_len > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
- BUG_ON(ex->fe_start >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
+ BUG_ON(ex->fe_len > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
+ BUG_ON(ex->fe_start >= EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);

ac->ac_found++;
@@ -1879,8 +1879,8 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,

while (free && ac->ac_status == AC_STATUS_CONTINUE) {
i = mb_find_next_zero_bit(bitmap,
- EXT4_BLOCKS_PER_GROUP(sb), i);
- if (i >= EXT4_BLOCKS_PER_GROUP(sb)) {
+ EXT4_CLUSTERS_PER_GROUP(sb), i);
+ if (i >= EXT4_CLUSTERS_PER_GROUP(sb)) {
/*
* IF we have corrupt bitmap, we won't find any
* free blocks even though group info says we
@@ -1943,7 +1943,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
do_div(a, sbi->s_stripe);
i = (a * sbi->s_stripe) - first_group_block;

- while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
+ while (i < EXT4_CLUSTERS_PER_GROUP(sb)) {
if (!mb_test_bit(i, bitmap)) {
max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
if (max >= sbi->s_stripe) {
@@ -3071,7 +3071,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
}
BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
start > ac->ac_o_ex.fe_logical);
- BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
+ BUG_ON(size <= 0 || size > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));

/* now prepare goal request */

@@ -3724,7 +3724,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
}

if (needed == 0)
- needed = EXT4_BLOCKS_PER_GROUP(sb) + 1;
+ needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;

INIT_LIST_HEAD(&list);
repeat:
@@ -4039,8 +4039,8 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
len = ar->len;

/* just a dirty hack to filter too big requests */
- if (len >= EXT4_BLOCKS_PER_GROUP(sb) - 10)
- len = EXT4_BLOCKS_PER_GROUP(sb) - 10;
+ if (len >= EXT4_CLUSTERS_PER_GROUP(sb) - 10)
+ len = EXT4_CLUSTERS_PER_GROUP(sb) - 10;

/* start searching from the goal */
goal = ar->goal;
@@ -4579,8 +4579,8 @@ do_more:
* Check to see if we are freeing blocks across a group
* boundary.
*/
- if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) {
- overflow = bit + count - EXT4_BLOCKS_PER_GROUP(sb);
+ if (bit + count > EXT4_CLUSTERS_PER_GROUP(sb)) {
+ overflow = bit + count - EXT4_CLUSTERS_PER_GROUP(sb);
count -= overflow;
}
bitmap_bh = ext4_read_block_bitmap(sb, block_group);
@@ -4844,7 +4844,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
minlen = range->minlen >> sb->s_blocksize_bits;
trimmed = 0;

- if (unlikely(minlen > EXT4_BLOCKS_PER_GROUP(sb)))
+ if (unlikely(minlen > EXT4_CLUSTERS_PER_GROUP(sb)))
return -EINVAL;
if (start < first_data_blk) {
len -= first_data_blk - start;
@@ -4857,7 +4857,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) (start + len),
&last_group, &last_block);
last_group = (last_group > ngroups - 1) ? ngroups - 1 : last_group;
- last_block = EXT4_BLOCKS_PER_GROUP(sb);
+ last_block = EXT4_CLUSTERS_PER_GROUP(sb);

if (first_group > last_group)
return -EINVAL;
@@ -4870,8 +4870,8 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
break;
}

- if (len >= EXT4_BLOCKS_PER_GROUP(sb))
- len -= (EXT4_BLOCKS_PER_GROUP(sb) - first_block);
+ if (len >= EXT4_CLUSTERS_PER_GROUP(sb))
+ len -= (EXT4_CLUSTERS_PER_GROUP(sb) - first_block);
else
last_block = first_block + len;

--
1.7.3.1


2011-03-19 21:37:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 04/12] ext4: Remove block bitmap initialization in ext4_new_inode()

We are initializing the block bitmap in ext4_new_inode(), and as far
as I can tell, there's no reason to do it. So remove it to simplify
things.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ialloc.c | 37 -------------------------------------
1 files changed, 0 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index a679a48..88ad0e3 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -815,7 +815,6 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode,
int ret2, err = 0;
struct inode *ret;
ext4_group_t i;
- int free = 0;
static int once = 1;
ext4_group_t flex_group;

@@ -936,42 +935,6 @@ repeat_in_this_group:
goto out;

got:
- /* We may have to initialize the block bitmap if it isn't already */
- if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM) &&
- gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
- struct buffer_head *block_bitmap_bh;
-
- block_bitmap_bh = ext4_read_block_bitmap(sb, group);
- BUFFER_TRACE(block_bitmap_bh, "get block bitmap access");
- err = ext4_journal_get_write_access(handle, block_bitmap_bh);
- if (err) {
- brelse(block_bitmap_bh);
- goto fail;
- }
-
- free = 0;
- ext4_lock_group(sb, group);
- /* recheck and clear flag under lock if we still need to */
- if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
- free = ext4_free_blocks_after_init(sb, group, gdp);
- gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
- ext4_free_blks_set(sb, gdp, free);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, group,
- gdp);
- }
- ext4_unlock_group(sb, group);
-
- /* Don't need to dirty bitmap block if we didn't change it */
- if (free) {
- BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap");
- err = ext4_handle_dirty_metadata(handle,
- NULL, block_bitmap_bh);
- }

2011-03-19 21:37:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 09/12] ext4: teach ext4_ext_map_blocks() about the bigalloc feature

If we need to allocate a new block in ext4_ext_map_blocks(), the
function needs to see if the cluster has already been allocated.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents.c | 132 +++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 119 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9ea1bc6..3221b79 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1273,7 +1273,8 @@ static int ext4_ext_search_left(struct inode *inode,
*/
static int ext4_ext_search_right(struct inode *inode,
struct ext4_ext_path *path,
- ext4_lblk_t *logical, ext4_fsblk_t *phys)
+ ext4_lblk_t *logical, ext4_fsblk_t *phys,
+ struct ext4_extent **ret_ex)
{
struct buffer_head *bh = NULL;
struct ext4_extent_header *eh;
@@ -1315,9 +1316,7 @@ static int ext4_ext_search_right(struct inode *inode,
return -EIO;
}
}
- *logical = le32_to_cpu(ex->ee_block);
- *phys = ext4_ext_pblock(ex);
- return 0;
+ goto found_extent;
}

if (unlikely(*logical < (le32_to_cpu(ex->ee_block) + ee_len))) {
@@ -1330,9 +1329,7 @@ static int ext4_ext_search_right(struct inode *inode,
if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) {
/* next allocated block in this leaf */
ex++;
- *logical = le32_to_cpu(ex->ee_block);
- *phys = ext4_ext_pblock(ex);
- return 0;
+ goto found_extent;
}

/* go up and search for index to the right */
@@ -1375,9 +1372,12 @@ got_index:
return -EIO;
}
ex = EXT_FIRST_EXTENT(eh);
+found_extent:
*logical = le32_to_cpu(ex->ee_block);
*phys = ext4_ext_pblock(ex);
- put_bh(bh);
+ *ret_ex = ex;
+ if (bh)
+ put_bh(bh);
return 0;
}

@@ -1606,7 +1606,8 @@ static int ext4_ext_try_to_merge(struct inode *inode,
* such that there will be no overlap, and then returns 1.
* If there is no overlap found, it returns 0.
*/
-static unsigned int ext4_ext_check_overlap(struct inode *inode,
+static unsigned int ext4_ext_check_overlap(struct ext4_sb_info *sbi,
+ struct inode *inode,
struct ext4_extent *newext,
struct ext4_ext_path *path)
{
@@ -1620,6 +1621,7 @@ static unsigned int ext4_ext_check_overlap(struct inode *inode,
if (!path[depth].p_ext)
goto out;
b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+ b2 &= ~(sbi->s_cluster_ratio - 1);

/*
* get the next allocated block if the extent in the path
@@ -3274,6 +3276,82 @@ out2:
}

/*
+ * get_implied_cluster_alloc - check to see if the requested
+ * allocation (in the map structure) overlaps with a cluster already
+ * allocated in an extent.
+ * @map The requested lblk->pblk mapping
+ * @c_len The number of blocks in a cluster
+ * @ex The extent structure which might contain an implied
+ * cluster allocation
+ *
+ * This function is called by ext4_ext_map_blocks() after we failed to
+ * find blocks that were already in the inode's extent tree. Hence,
+ * we know that the beginning of the requested region cannot overlap
+ * the extent from the inode's extent tree. There are two cases we
+ * want to catch. The first is this case:
+ *
+ * |--- cluster # N--|
+ * |--- extent ---| |---- requested region ---|
+ * |==========|
+ *
+ * The second case that we need to test for is this one:
+ *
+ * |--------- cluster # N ----------------|
+ * |--- requested region --| |------- extent ----|
+ * |=======================|
+ *
+ * In each case, we need to return the extent labelled as "|====|"
+ * from cluster #N.
+ */
+static int get_implied_cluster_alloc(struct ext4_sb_info *sbi,
+ struct ext4_map_blocks *map,
+ struct ext4_extent *ex)
+{
+ ext4_lblk_t c_offset = map->m_lblk & (sbi->s_cluster_ratio-1);
+ ext4_lblk_t ex_cluster_start, ex_cluster_end;
+ ext4_lblk_t rr_cluster_start, rr_cluster_end;
+ ext4_lblk_t ee_block = le32_to_cpu(ex->ee_block);
+ ext4_fsblk_t ee_start = ext4_ext_pblock(ex);
+ unsigned short ee_len = ext4_ext_get_actual_len(ex);
+
+ if (!c_offset)
+ return 0;
+
+ /* The extent passed in that we are trying to match */
+ ex_cluster_start = EXT4_B2C(sbi, ee_block);
+ ex_cluster_end = EXT4_B2C(sbi, ee_block + ee_len - 1);
+
+ /* The requested region passed into ext4_map_blocks() */
+ rr_cluster_start = EXT4_B2C(sbi, map->m_lblk);
+ rr_cluster_end = EXT4_B2C(sbi, map->m_lblk + map->m_len - 1);
+
+ if ((rr_cluster_start == ex_cluster_end) ||
+ (rr_cluster_start == ex_cluster_start)) {
+ if (rr_cluster_start == ex_cluster_end)
+ ee_start += ee_len - 1;
+ map->m_pblk = (ee_start & ~(sbi->s_cluster_ratio - 1)) +
+ c_offset;
+ map->m_len = min(map->m_len,
+ (unsigned) sbi->s_cluster_ratio - c_offset);
+ /*
+ * Check for and handle this case:
+ *
+ * |--------- cluster # N-------------|
+ * |------- extent ----|
+ * |--- requested region ---|
+ * |===========|
+ */
+
+ if ((map->m_lblk < ee_block) &&
+ (map->m_lblk + map->m_len > ee_block))
+ map->m_len = ee_block - map->m_lblk;
+ return 1;
+ }
+ return 0;
+}
+
+
+/*
* Block allocation/map/preallocation routine for extents based files
*
*
@@ -3296,12 +3374,14 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
{
struct ext4_ext_path *path = NULL;
struct ext4_extent_header *eh;
- struct ext4_extent newex, *ex;
+ struct ext4_extent newex, *ex, *ex2;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
ext4_fsblk_t newblock;
int err = 0, depth, ret;
unsigned int allocated = 0;
struct ext4_allocation_request ar;
ext4_io_end_t *io = EXT4_I(inode)->cur_aio_dio;
+ ext4_lblk_t cluster_offset;

ext_debug("blocks %u/%u requested for inode %lu\n",
map->m_lblk, map->m_len, inode->i_ino);
@@ -3398,9 +3478,21 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
goto out2;
}
+
/*
* Okay, we need to do block allocation.
*/
+ cluster_offset = map->m_lblk & (sbi->s_cluster_ratio-1);
+ /*
+ * If we are doing bigalloc, check to see if the extent returned
+ * by ext4_find_extent() implies a cluster we can use.
+ */
+ if (cluster_offset && ex &&
+ get_implied_cluster_alloc(sbi, map, ex)) {
+ ar.len = allocated = map->m_len;
+ newblock = map->m_pblk;
+ goto got_allocated_blocks;
+ }

/* find neighbour allocated blocks */
ar.lleft = map->m_lblk;
@@ -3408,10 +3500,20 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
if (err)
goto out2;
ar.lright = map->m_lblk;
- err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright);
+ ex2 = 0;
+ err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2);
if (err)
goto out2;

+ /* Check if the extent after searching to the right implies a
+ * cluster we can use. */
+ if (cluster_offset && ex2 &&
+ get_implied_cluster_alloc(sbi, map, ex2)) {
+ ar.len = allocated = map->m_len;
+ newblock = map->m_pblk;
+ goto got_allocated_blocks;
+ }
+
/*
* See if request is beyond maximum number of blocks we can have in
* a single extent. For an initialized extent this limit is
@@ -3428,7 +3530,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* Check if we can really insert (m_lblk)::(m_lblk + m_len) extent */
newex.ee_block = cpu_to_le32(map->m_lblk);
newex.ee_len = cpu_to_le16(map->m_len);
- err = ext4_ext_check_overlap(inode, &newex, path);
+ err = ext4_ext_check_overlap(sbi, inode, &newex, path);
if (err)
allocated = ext4_ext_get_actual_len(&newex);
else
@@ -3438,7 +3540,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ar.inode = inode;
ar.goal = ext4_ext_find_goal(inode, path, map->m_lblk);
ar.logical = map->m_lblk;
- ar.len = allocated;
+ ar.len = EXT4_NUM_B2C(sbi, allocated);
if (S_ISREG(inode->i_mode))
ar.flags = EXT4_MB_HINT_DATA;
else
@@ -3449,7 +3551,11 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
goto out2;
ext_debug("allocate new block: goal %llu, found %llu/%u\n",
ar.goal, newblock, allocated);
+ ar.len <<= sbi->s_cluster_bits;
+ if (ar.len > allocated)
+ ar.len = allocated;

+got_allocated_blocks:
/* try to insert new extent into found leaf and return */
ext4_ext_store_pblock(&newex, newblock);
newex.ee_len = cpu_to_le16(ar.len);
--
1.7.3.1


2011-03-19 21:37:24

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 06/12] ext4: split out ext4_free_blocks_after_init()

The function ext4_free_blocks_after_init() used to be a #define of
ext4_init_block_bitmap(). This actually made it difficult to
understand how the function worked, and made it hard make changes to
support clusters. So as an initial cleanup, I've separated out the
functionality of initializing block bitmap from calculating the number
of free blocks in the new block group.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 105 ++++++++++++++++++++++++++---------------------------
fs/ext4/ext4.h | 13 ++++---
2 files changed, 59 insertions(+), 59 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index c608fda..5b60c8b 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -100,74 +100,73 @@ static unsigned int num_blocks_in_group(struct super_block *sb,
return EXT4_BLOCKS_PER_GROUP(sb);
}

-/* Initializes an uninitialized block bitmap if given, and returns the
- * number of blocks free in the group. */
-unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
- ext4_group_t block_group, struct ext4_group_desc *gdp)
+/* Initializes an uninitialized block bitmap */
+void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp)
{
unsigned int bit, bit_max = num_base_meta_blocks(sb, block_group);
- ext4_group_t ngroups = ext4_get_groups_count(sb);
- unsigned group_blocks = num_blocks_in_group(sb, block_group);
struct ext4_sb_info *sbi = EXT4_SB(sb);
-
- if (bh) {
- J_ASSERT_BH(bh, buffer_locked(bh));
-
- /* If checksum is bad mark all blocks used to prevent allocation
- * essentially implementing a per-group read-only flag. */
- if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
- ext4_error(sb, "Checksum bad for group %u",
- block_group);
- ext4_free_blks_set(sb, gdp, 0);
- ext4_free_inodes_set(sb, gdp, 0);
- ext4_itable_unused_set(sb, gdp, 0);
- memset(bh->b_data, 0xff, sb->s_blocksize);
- return 0;
- }
- memset(bh->b_data, 0, sb->s_blocksize);
+ ext4_fsblk_t start, tmp;
+ int flex_bg = 0;
+
+ J_ASSERT_BH(bh, buffer_locked(bh));
+
+ /* If checksum is bad mark all blocks used to prevent allocation
+ * essentially implementing a per-group read-only flag. */
+ if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
+ ext4_error(sb, "Checksum bad for group %u", block_group);
+ ext4_free_blks_set(sb, gdp, 0);
+ ext4_free_inodes_set(sb, gdp, 0);
+ ext4_itable_unused_set(sb, gdp, 0);
+ memset(bh->b_data, 0xff, sb->s_blocksize);
+ return;
}
+ memset(bh->b_data, 0, sb->s_blocksize);

- if (bh) {
- ext4_fsblk_t start, tmp;
- int flex_bg = 0;
+ for (bit = 0; bit < bit_max; bit++)
+ ext4_set_bit(bit, bh->b_data);

- for (bit = 0; bit < bit_max; bit++)
- ext4_set_bit(bit, bh->b_data);
+ start = ext4_group_first_block_no(sb, block_group);

- start = ext4_group_first_block_no(sb, block_group);
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
+ flex_bg = 1;

- if (EXT4_HAS_INCOMPAT_FEATURE(sb,
- EXT4_FEATURE_INCOMPAT_FLEX_BG))
- flex_bg = 1;
+ /* Set bits for block and inode bitmaps, and inode table */
+ tmp = ext4_block_bitmap(sb, gdp);
+ if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
+ ext4_set_bit(tmp - start, bh->b_data);

- /* Set bits for block and inode bitmaps, and inode table */
- tmp = ext4_block_bitmap(sb, gdp);
- if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
+ tmp = ext4_inode_bitmap(sb, gdp);
+ if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
+ ext4_set_bit(tmp - start, bh->b_data);

- tmp = ext4_inode_bitmap(sb, gdp);
+ tmp = ext4_inode_table(sb, gdp);
+ for (; tmp < ext4_inode_table(sb, gdp) +
+ sbi->s_itb_per_group; tmp++) {
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
ext4_set_bit(tmp - start, bh->b_data);
-
- tmp = ext4_inode_table(sb, gdp);
- for (; tmp < ext4_inode_table(sb, gdp) +
- sbi->s_itb_per_group; tmp++) {
- if (!flex_bg ||
- ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
- }
- /*
- * Also if the number of blocks within the group is
- * less than the blocksize * 8 ( which is the size
- * of bitmap ), set rest of the block bitmap to 1
- */
- ext4_mark_bitmap_end(group_blocks, sb->s_blocksize * 8,
- bh->b_data);
}
- return group_blocks - bit_max -
- ext4_group_used_meta_blocks(sb, block_group, gdp);
+ /*
+ * Also if the number of blocks within the group is less than
+ * the blocksize * 8 ( which is the size of bitmap ), set rest
+ * of the block bitmap to 1
+ */
+ ext4_mark_bitmap_end(num_blocks_in_group(sb, block_group),
+ sb->s_blocksize * 8, bh->b_data);
}

+/* Return the number of free blocks in a block group. It is used when
+ * the block bitmap is uninitialized, so we can't just count the bits
+ * in the bitmap. */
+unsigned ext4_free_blocks_after_init(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp)
+{
+ return num_blocks_in_group(sb, block_group) -
+ num_base_meta_blocks(sb, block_group) -
+ ext4_group_used_meta_blocks(sb, block_group, gdp);
+}

/*
* The free blocks are managed by bitmaps. A file system contains several
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 94a7a7b..0eb7407 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1662,12 +1662,13 @@ extern struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
extern int ext4_should_retry_alloc(struct super_block *sb, int *retries);
struct buffer_head *ext4_read_block_bitmap(struct super_block *sb,
ext4_group_t block_group);
-extern unsigned ext4_init_block_bitmap(struct super_block *sb,
- struct buffer_head *bh,
- ext4_group_t group,
- struct ext4_group_desc *desc);
-#define ext4_free_blocks_after_init(sb, group, desc) \
- ext4_init_block_bitmap(sb, NULL, group, desc)
+extern void ext4_init_block_bitmap(struct super_block *sb,
+ struct buffer_head *bh,
+ ext4_group_t group,
+ struct ext4_group_desc *desc);
+extern unsigned ext4_free_blocks_after_init(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp);

/* dir.c */
extern int __ext4_check_dir_entry(const char *, unsigned int, struct inode *,
--
1.7.3.1


2011-03-19 21:43:04

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, RFC 10/12] ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/super.c | 35 +++++++++++++++++++++++------------
1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f9b25cd..24964da 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4438,15 +4438,33 @@ restore_opts:
return err;
}

+/*
+ * Note: calculating the overhead so we can be compatible with
+ * historical BSD practice is quite difficult in the face of
+ * clusters/bigalloc. This is because multiple metadata blocks from
+ * different block group can end up in the same allocation cluster.
+ * Calculating the exact overhead in the face of clustered allocation
+ * requires either O(all block bitmaps) in memory or O(number of block
+ * groups**2) in time. We will still calculate the superblock for
+ * older file systems --- and if we come across with a bigalloc file
+ * system with zero in s_overhead_blocks the estimate will be close to
+ * correct especially for very large cluster sizes --- but for newer
+ * file systems, it's better to calculate this figure once at mkfs
+ * time, and store it in the superblock. If the superblock value is
+ * present (even for non-bigalloc file systems), we will use it.
+ */
static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct super_block *sb = dentry->d_sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_super_block *es = sbi->s_es;
+ struct ext4_group_desc *gdp;
u64 fsid;

if (test_opt(sb, MINIX_DF)) {
sbi->s_overhead_last = 0;
+ } else if (es->s_overhead_blocks) {
+ sbi->s_overhead_last = le32_to_cpu(es->s_overhead_blocks);
} else if (sbi->s_blocks_last != ext4_blocks_count(es)) {
ext4_group_t i, ngroups = ext4_get_groups_count(sb);
ext4_fsblk_t overhead = 0;
@@ -4461,24 +4479,16 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
* All of the blocks before first_data_block are
* overhead
*/
- overhead = le32_to_cpu(es->s_first_data_block);
+ overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));

/*
- * Add the overhead attributed to the superblock and
- * block group descriptors. If the sparse superblocks
- * feature is turned on, then not all groups have this.
+ * Add the overhead found in each block group
*/
for (i = 0; i < ngroups; i++) {
- overhead += ext4_bg_has_super(sb, i) +
- ext4_bg_num_gdb(sb, i);
+ gdp = ext4_get_group_desc(sb, i, NULL);
+ overhead += ext4_num_overhead_clusters(sb, i, gdp);
cond_resched();
}

2011-03-20 10:26:58

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH, RFC 03/12] ext4: Convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP

On Sat, Mar 19, 2011 at 11:28 PM, Theodore Ts'o <[email protected]> wrote:
> Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are
> used to indicate the number of bits in a block bitmap (which is really
> a cluster allocation bitmap in bigalloc file systems). ?There are
> still some places in the ext4 codebase where usage of
> EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that
> aren't used given the initial restricted assumptions for bigalloc.
> These will need to be fixed before we can relax those restrictions.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> ?fs/ext4/mballoc.c | ? 34 +++++++++++++++++-----------------
> ?1 files changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 2f6f0dd..02f099f 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -651,7 +651,7 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> ? ? ? ?ext4_grpblk_t chunk;
> ? ? ? ?unsigned short border;
>
> - ? ? ? BUG_ON(len > EXT4_BLOCKS_PER_GROUP(sb));
> + ? ? ? BUG_ON(len > EXT4_CLUSTERS_PER_GROUP(sb));
>
> ? ? ? ?border = 2 << sb->s_blocksize_bits;
>
> @@ -703,7 +703,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?void *buddy, void *bitmap, ext4_group_t group)
> ?{
> ? ? ? ?struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> - ? ? ? ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
> + ? ? ? ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb);
> ? ? ? ?ext4_grpblk_t i = 0;
> ? ? ? ?ext4_grpblk_t first;
> ? ? ? ?ext4_grpblk_t len;
> @@ -1680,8 +1680,8 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
> ? ? ? ?struct ext4_free_extent *gex = &ac->ac_g_ex;
>
> ? ? ? ?BUG_ON(ex->fe_len <= 0);
> - ? ? ? BUG_ON(ex->fe_len > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
> - ? ? ? BUG_ON(ex->fe_start >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
> + ? ? ? BUG_ON(ex->fe_len > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
> + ? ? ? BUG_ON(ex->fe_start >= EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
> ? ? ? ?BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);
>
> ? ? ? ?ac->ac_found++;
> @@ -1879,8 +1879,8 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
>
> ? ? ? ?while (free && ac->ac_status == AC_STATUS_CONTINUE) {
> ? ? ? ? ? ? ? ?i = mb_find_next_zero_bit(bitmap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_BLOCKS_PER_GROUP(sb), i);
> - ? ? ? ? ? ? ? if (i >= EXT4_BLOCKS_PER_GROUP(sb)) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_CLUSTERS_PER_GROUP(sb), i);
> + ? ? ? ? ? ? ? if (i >= EXT4_CLUSTERS_PER_GROUP(sb)) {
> ? ? ? ? ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? ? ? ? ? * IF we have corrupt bitmap, we won't find any
> ? ? ? ? ? ? ? ? ? ? ? ? * free blocks even though group info says we
> @@ -1943,7 +1943,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
> ? ? ? ?do_div(a, sbi->s_stripe);
> ? ? ? ?i = (a * sbi->s_stripe) - first_group_block;
>
> - ? ? ? while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
> + ? ? ? while (i < EXT4_CLUSTERS_PER_GROUP(sb)) {
> ? ? ? ? ? ? ? ?if (!mb_test_bit(i, bitmap)) {
> ? ? ? ? ? ? ? ? ? ? ? ?max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
> ? ? ? ? ? ? ? ? ? ? ? ?if (max >= sbi->s_stripe) {
> @@ -3071,7 +3071,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> ? ? ? ?}
> ? ? ? ?BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
> ? ? ? ? ? ? ? ? ? ? ? ?start > ac->ac_o_ex.fe_logical);
> - ? ? ? BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
> + ? ? ? BUG_ON(size <= 0 || size > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
>
> ? ? ? ?/* now prepare goal request */
>
> @@ -3724,7 +3724,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> ? ? ? ?}
>
> ? ? ? ?if (needed == 0)
> - ? ? ? ? ? ? ? needed = EXT4_BLOCKS_PER_GROUP(sb) + 1;
> + ? ? ? ? ? ? ? needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;
>
> ? ? ? ?INIT_LIST_HEAD(&list);
> ?repeat:
> @@ -4039,8 +4039,8 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
> ? ? ? ?len = ar->len;
>
> ? ? ? ?/* just a dirty hack to filter too big requests ?*/
> - ? ? ? if (len >= EXT4_BLOCKS_PER_GROUP(sb) - 10)
> - ? ? ? ? ? ? ? len = EXT4_BLOCKS_PER_GROUP(sb) - 10;
> + ? ? ? if (len >= EXT4_CLUSTERS_PER_GROUP(sb) - 10)
> + ? ? ? ? ? ? ? len = EXT4_CLUSTERS_PER_GROUP(sb) - 10;
>
> ? ? ? ?/* start searching from the goal */
> ? ? ? ?goal = ar->goal;
> @@ -4579,8 +4579,8 @@ do_more:
> ? ? ? ? * Check to see if we are freeing blocks across a group
> ? ? ? ? * boundary.
> ? ? ? ? */
> - ? ? ? if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) {
> - ? ? ? ? ? ? ? overflow = bit + count - EXT4_BLOCKS_PER_GROUP(sb);
> + ? ? ? if (bit + count > EXT4_CLUSTERS_PER_GROUP(sb)) {
> + ? ? ? ? ? ? ? overflow = bit + count - EXT4_CLUSTERS_PER_GROUP(sb);



On Sat, Mar 19, 2011 at 11:28 PM, Theodore Ts'o <[email protected]> wrote:
> Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are
> used to indicate the number of bits in a block bitmap (which is really
> a cluster allocation bitmap in bigalloc file systems). ?There are
> still some places in the ext4 codebase where usage of
> EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that
> aren't used given the initial restricted assumptions for bigalloc.
> These will need to be fixed before we can relax those restrictions.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> ?fs/ext4/mballoc.c | ? 34 +++++++++++++++++-----------------
> ?1 files changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 2f6f0dd..02f099f 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -651,7 +651,7 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> ? ? ? ?ext4_grpblk_t chunk;
> ? ? ? ?unsigned short border;
>
> - ? ? ? BUG_ON(len > EXT4_BLOCKS_PER_GROUP(sb));
> + ? ? ? BUG_ON(len > EXT4_CLUSTERS_PER_GROUP(sb));
>
> ? ? ? ?border = 2 << sb->s_blocksize_bits;
>
> @@ -703,7 +703,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?void *buddy, void *bitmap, ext4_group_t group)
> ?{
> ? ? ? ?struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> - ? ? ? ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
> + ? ? ? ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb);
> ? ? ? ?ext4_grpblk_t i = 0;
> ? ? ? ?ext4_grpblk_t first;
> ? ? ? ?ext4_grpblk_t len;
> @@ -1680,8 +1680,8 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
> ? ? ? ?struct ext4_free_extent *gex = &ac->ac_g_ex;
>
> ? ? ? ?BUG_ON(ex->fe_len <= 0);
> - ? ? ? BUG_ON(ex->fe_len > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
> - ? ? ? BUG_ON(ex->fe_start >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
> + ? ? ? BUG_ON(ex->fe_len > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
> + ? ? ? BUG_ON(ex->fe_start >= EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
> ? ? ? ?BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);
>
> ? ? ? ?ac->ac_found++;
> @@ -1879,8 +1879,8 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
>
> ? ? ? ?while (free && ac->ac_status == AC_STATUS_CONTINUE) {
> ? ? ? ? ? ? ? ?i = mb_find_next_zero_bit(bitmap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_BLOCKS_PER_GROUP(sb), i);
> - ? ? ? ? ? ? ? if (i >= EXT4_BLOCKS_PER_GROUP(sb)) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_CLUSTERS_PER_GROUP(sb), i);
> + ? ? ? ? ? ? ? if (i >= EXT4_CLUSTERS_PER_GROUP(sb)) {
> ? ? ? ? ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? ? ? ? ? * IF we have corrupt bitmap, we won't find any
> ? ? ? ? ? ? ? ? ? ? ? ? * free blocks even though group info says we
> @@ -1943,7 +1943,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
> ? ? ? ?do_div(a, sbi->s_stripe);
> ? ? ? ?i = (a * sbi->s_stripe) - first_group_block;
>
> - ? ? ? while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
> + ? ? ? while (i < EXT4_CLUSTERS_PER_GROUP(sb)) {
> ? ? ? ? ? ? ? ?if (!mb_test_bit(i, bitmap)) {
> ? ? ? ? ? ? ? ? ? ? ? ?max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
> ? ? ? ? ? ? ? ? ? ? ? ?if (max >= sbi->s_stripe) {
> @@ -3071,7 +3071,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> ? ? ? ?}
> ? ? ? ?BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
> ? ? ? ? ? ? ? ? ? ? ? ?start > ac->ac_o_ex.fe_logical);
> - ? ? ? BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
> + ? ? ? BUG_ON(size <= 0 || size > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
>
> ? ? ? ?/* now prepare goal request */
>
> @@ -3724,7 +3724,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> ? ? ? ?}
>
> ? ? ? ?if (needed == 0)
> - ? ? ? ? ? ? ? needed = EXT4_BLOCKS_PER_GROUP(sb) + 1;
> + ? ? ? ? ? ? ? needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;
>
> ? ? ? ?INIT_LIST_HEAD(&list);
> ?repeat:
> @@ -4039,8 +4039,8 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
> ? ? ? ?len = ar->len;
>
> ? ? ? ?/* just a dirty hack to filter too big requests ?*/
> - ? ? ? if (len >= EXT4_BLOCKS_PER_GROUP(sb) - 10)
> - ? ? ? ? ? ? ? len = EXT4_BLOCKS_PER_GROUP(sb) - 10;
> + ? ? ? if (len >= EXT4_CLUSTERS_PER_GROUP(sb) - 10)
> + ? ? ? ? ? ? ? len = EXT4_CLUSTERS_PER_GROUP(sb) - 10;
>
> ? ? ? ?/* start searching from the goal */
> ? ? ? ?goal = ar->goal;
> @@ -4579,8 +4579,8 @@ do_more:
> ? ? ? ? * Check to see if we are freeing blocks across a group
> ? ? ? ? * boundary.
> ? ? ? ? */
> - ? ? ? if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) {
> - ? ? ? ? ? ? ? overflow = bit + count - EXT4_BLOCKS_PER_GROUP(sb);
> + ? ? ? if (bit + count > EXT4_CLUSTERS_PER_GROUP(sb)) {
> + ? ? ? ? ? ? ? overflow = bit + count - EXT4_CLUSTERS_PER_GROUP(sb);

If I am not mistaken, while 'bit' was already converted to cluster units,
'count' is still in block units.

I think ext4_free_blocks() need to do 2 things:
1. convert 'count' to clusters (after issuing journal_forget())
2. round 'bit' up (and round 'count' down) if start block is not
on cluster boundary, so truncate/punch hole, will not free a
cluster when it's 'base' block is still allocated.


> ? ? ? ? ? ? ? ?count -= overflow;
> ? ? ? ?}
> ? ? ? ?bitmap_bh = ext4_read_block_bitmap(sb, block_group);
> @@ -4844,7 +4844,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
> ? ? ? ?minlen = range->minlen >> sb->s_blocksize_bits;
> ? ? ? ?trimmed = 0;
>
> - ? ? ? if (unlikely(minlen > EXT4_BLOCKS_PER_GROUP(sb)))
> + ? ? ? if (unlikely(minlen > EXT4_CLUSTERS_PER_GROUP(sb)))
> ? ? ? ? ? ? ? ?return -EINVAL;
> ? ? ? ?if (start < first_data_blk) {
> ? ? ? ? ? ? ? ?len -= first_data_blk - start;
> @@ -4857,7 +4857,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
> ? ? ? ?ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) (start + len),
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &last_group, &last_block);
> ? ? ? ?last_group = (last_group > ngroups - 1) ? ngroups - 1 : last_group;
> - ? ? ? last_block = EXT4_BLOCKS_PER_GROUP(sb);
> + ? ? ? last_block = EXT4_CLUSTERS_PER_GROUP(sb);
>
> ? ? ? ?if (first_group > last_group)
> ? ? ? ? ? ? ? ?return -EINVAL;
> @@ -4870,8 +4870,8 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> ? ? ? ? ? ? ? ?}
>
> - ? ? ? ? ? ? ? if (len >= EXT4_BLOCKS_PER_GROUP(sb))
> - ? ? ? ? ? ? ? ? ? ? ? len -= (EXT4_BLOCKS_PER_GROUP(sb) - first_block);
> + ? ? ? ? ? ? ? if (len >= EXT4_CLUSTERS_PER_GROUP(sb))
> + ? ? ? ? ? ? ? ? ? ? ? len -= (EXT4_CLUSTERS_PER_GROUP(sb) - first_block);
> ? ? ? ? ? ? ? ?else
> ? ? ? ? ? ? ? ? ? ? ? ?last_block = first_block + len;
>
> --
> 1.7.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-03-20 10:33:45

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH, RFC 00/12] bigalloc patchset

On Sat, Mar 19, 2011 at 11:28 PM, Theodore Ts'o <[email protected]> wrote:
> This is an initial patchset of the bigalloc patches to ext4. ?This patch
> adds support for clustered allocation, so that each bit in the ext4
> block allocation bitmap addresses a power of two number of blocks. ?For
> example, if the file system is mainly going to be storing large files in
> the 4-32 megabyte range, it might make sense to set a cluster size of 1
> megabyte. ?This means that each bit in the block allocaiton bitmap would
> now address 256 4k blocks, and it means that the size of the block
> bitmaps for a 2T file system shrinks from 64 megabytes to 256k. ?It also
> means that a block group addresses 32 gigabytes instead of 128
> megabytes, also shrinking the amount of file system overhead for
> metadata.
>
> The cost is increased disk space efficiency. ?Directories will consume
> 1T, as will extent tree blocks. ?(I am on the fence as to whether I
> should add complexity so that in the rare case that an inode needs more
> than 344 extents --- a highly fragmented file indeed --- and need a
> second extent tree block, we can avoid allocating any cluster and
> instead use another block from the cluster used by the inode. ?The
> concern is the amount of complexity this adds to the e2fsck, not just to
> the kernel.)

Unless you define extent tree block size = cluster size.
Shouldn't be too hard to teach that to kernel and fsck, right?

>
> To test these patches, I have used an *extremely* kludgy set of patches
> to e2fsprogs, which are attached below. ?These patches need *extensive*
> revision before I would consider them clean enough suitable for
> committing into e2fsprogs, but they were sufficient for me to do the
> kernel-side changes --- mke2fs, dumpe2fs, and debugfs work. ?E2fsck most
> definitely does _not_ work at this stage.
>
> Please comment! ?I do not intend for these patches to be merged during
> the 2.6.39 merge window. ?I am targetting 2.6.40, 3 months from now,
> since these patches are quite extensive.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>
> Theodore Ts'o (12):
> ?ext4: read-only support for bigalloc file systems
> ?ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)
> ?ext4: Convert instances of EXT4_BLOCKS_PER_GROUP to
> ? ?EXT4_CLUSTERS_PER_GROUP
> ?ext4: Remove block bitmap initialization in ext4_new_inode()
> ?ext4: factor out block group accounting into functions
> ?ext4: split out ext4_free_blocks_after_init()
> ?ext4: bigalloc changes to block bitmap initialization functions
> ?ext4: Convert block group-relative offsets to use clusters
> ?ext4: teach ext4_ext_map_blocks() about the bigalloc feature


I think you are missing an important patch here:

ext4: teach ext4_free_blocks() about the bigalloc feature

Free only clusters whose 'base' block is contained within the
requested blocks range.


> ?ext4: teach ext4_statfs() to deal with clusters if bigalloc is
> ? ?enabled
> ?ext4: tune mballoc's default group prealloc size for bigalloc file
> ? ?systems
> ?ext4: enable mounting bigalloc as read/write
>
> ?fs/ext4/balloc.c ?| ?268 +++++++++++++++++++++++++++++++++--------------------
> ?fs/ext4/ext4.h ? ?| ? 47 ++++++++--
> ?fs/ext4/extents.c | ?132 +++++++++++++++++++++++---
> ?fs/ext4/ialloc.c ?| ? 37 --------
> ?fs/ext4/inode.c ? | ? ?7 ++
> ?fs/ext4/ioctl.c ? | ? 33 ++++++-
> ?fs/ext4/mballoc.c | ? 49 ++++++----
> ?fs/ext4/mballoc.h | ? ?3 +-
> ?fs/ext4/super.c ? | ?100 ++++++++++++++++----
> ?9 files changed, 472 insertions(+), 204 deletions(-)
>
> --
> 1.7.3.1
>
> ------------------- e2fsprogs patches follow below
>
> diff --git a/lib/ext2fs/bmap64.h b/lib/ext2fs/bmap64.h
> index b0aa84c..cfbdfd6 100644
> --- a/lib/ext2fs/bmap64.h
> +++ b/lib/ext2fs/bmap64.h
> @@ -31,6 +31,10 @@ struct ext2fs_struct_generic_bitmap {
> ? ? ? ? ((bmap)->magic == EXT2_ET_MAGIC_BLOCK_BITMAP64) || \
> ? ? ? ? ((bmap)->magic == EXT2_ET_MAGIC_INODE_BITMAP64))
>
> +/* Bitmap flags */
> +
> +#define EXT2_BMFLAG_CLUSTER 0x0001
> +
> ?struct ext2_bitmap_ops {
> ? ? ? ?int ? ? type;
> ? ? ? ?/* Generic bmap operators */
> diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
> index a89e33b..0970506 100644
> --- a/lib/ext2fs/ext2_fs.h
> +++ b/lib/ext2fs/ext2_fs.h
> @@ -228,9 +228,13 @@ struct ext2_dx_countlimit {
>
> ?#define EXT2_BLOCKS_PER_GROUP(s) ? ? ? (EXT2_SB(s)->s_blocks_per_group)
> ?#define EXT2_INODES_PER_GROUP(s) ? ? ? (EXT2_SB(s)->s_inodes_per_group)
> +#define EXT2_CLUSTERS_PER_GROUP(s) ? ? (EXT2_SB(s)->s_clusters_per_group)
> ?#define EXT2_INODES_PER_BLOCK(s) ? ? ? (EXT2_BLOCK_SIZE(s)/EXT2_INODE_SIZE(s))
> ?/* limits imposed by 16-bit value gd_free_{blocks,inode}_count */
> -#define EXT2_MAX_BLOCKS_PER_GROUP(s) ? ((1 << 16) - 8)
> +#define EXT2_MAX_BLOCKS_PER_GROUP(s) ? (((1 << 16) - 8) * ? ? ?\
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(EXT2_CLUSTER_SIZE(s) / \
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT2_BLOCK_SIZE(s)))
> +#define EXT2_MAX_CLUSTERS_PER_GROUP(s) ((1 << 16) - 8)
> ?#define EXT2_MAX_INODES_PER_GROUP(s) ? ((1 << 16) - EXT2_INODES_PER_BLOCK(s))
> ?#ifdef __KERNEL__
> ?#define EXT2_DESC_PER_BLOCK(s) ? ? ? ? (EXT2_SB(s)->s_desc_per_block)
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index d3eb31d..a065e87 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -207,7 +207,7 @@ struct struct_ext2_filsys {
> ? ? ? ?char * ? ? ? ? ? ? ? ? ? ? ? ? ?device_name;
> ? ? ? ?struct ext2_super_block * ? ? ? super;
> ? ? ? ?unsigned int ? ? ? ? ? ? ? ? ? ?blocksize;
> - ? ? ? int ? ? ? ? ? ? ? ? ? ? ? ? ? ? clustersize;
> + ? ? ? int ? ? ? ? ? ? ? ? ? ? ? ? ? ? cluster_ratio;
> ? ? ? ?dgrp_t ? ? ? ? ? ? ? ? ? ? ? ? ?group_desc_count;
> ? ? ? ?unsigned long ? ? ? ? ? ? ? ? ? desc_blocks;
> ? ? ? ?struct opaque_ext2_group_desc * group_desc;
> @@ -232,7 +232,8 @@ struct struct_ext2_filsys {
> ? ? ? ?/*
> ? ? ? ? * Reserved for future expansion
> ? ? ? ? */
> - ? ? ? __u32 ? ? ? ? ? ? ? ? ? ? ? ? ? reserved[7];
> + ? ? ? __u32 ? ? ? ? ? ? ? ? ? ? ? ? ? clustersize;
> + ? ? ? __u32 ? ? ? ? ? ? ? ? ? ? ? ? ? reserved[6];
>
> ? ? ? ?/*
> ? ? ? ? * Reserved for the use of the calling application.
> @@ -553,7 +554,8 @@ typedef struct ext2_icount *ext2_icount_t;
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT2_FEATURE_RO_COMPAT_LARGE_FILE|\
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_FEATURE_RO_COMPAT_DIR_NLINK|\
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE|\
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_FEATURE_RO_COMPAT_GDT_CSUM)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_FEATURE_RO_COMPAT_GDT_CSUM|\
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_FEATURE_RO_COMPAT_BIGALLOC)
>
> ?/*
> ?* These features are only allowed if EXT2_FLAG_SOFTSUPP_FEATURES is passed
> diff --git a/lib/ext2fs/gen_bitmap64.c b/lib/ext2fs/gen_bitmap64.c
> index df095ac..60321df 100644
> --- a/lib/ext2fs/gen_bitmap64.c
> +++ b/lib/ext2fs/gen_bitmap64.c
> @@ -559,3 +559,85 @@ int ext2fs_warn_bitmap32(ext2fs_generic_bitmap bitmap, const char *func)
> ? ? ? ? ? ? ? ? ? ? ? ?"called %s with 64-bit bitmap", func);
> ?#endif
> ?}
> +
> +errcode_t ext2fs_allocate_cluster_bitmap(ext2_filsys fs,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const char *descr,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext2fs_block_bitmap *ret)
> +{
> + ? ? ? __u64 ? ? ? ? ? start, end, real_end;
> + ? ? ? errcode_t ? ? ? retval;
> +
> + ? ? ? EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
> +
> + ? ? ? if (!(fs->flags & EXT2_FLAG_64BITS))
> + ? ? ? ? ? ? ? return EXT2_ET_CANT_USE_LEGACY_BITMAPS;
> +
> + ? ? ? fs->write_bitmaps = ext2fs_write_bitmaps;
> +
> + ? ? ? start = (fs->super->s_first_data_block >>
> + ? ? ? ? ? ? ? ?EXT2_CLUSTER_SIZE_BITS(fs->super));
> + ? ? ? end = (ext2fs_blocks_count(fs->super) - 1) / fs->cluster_ratio;
> + ? ? ? real_end = ((__u64) EXT2_CLUSTERS_PER_GROUP(fs->super)
> + ? ? ? ? ? ? ? ? ? * (__u64) fs->group_desc_count)-1 + start;
> +
> + ? ? ? retval = ext2fs_alloc_generic_bmap(fs,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT2_ET_MAGIC_BLOCK_BITMAP64,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT2FS_BMAP64_BITARRAY,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?start, end, real_end, descr, ret);
> + ? ? ? if (retval)
> + ? ? ? ? ? ? ? return retval;
> +
> + ? ? ? (*ret)->flags = EXT2_BMFLAG_CLUSTER;
> +
> + ? ? ? printf("Returning 0...\n");
> + ? ? ? return 0;
> +}
> +
> +int ext2fs_is_cluster_bitmap(ext2fs_block_bitmap bm)
> +{
> + ? ? ? if (EXT2FS_IS_32_BITMAP(bm))
> + ? ? ? ? ? ? ? return 0;
> +
> + ? ? ? return (bm->flags & EXT2_BMFLAG_CLUSTER);
> +}
> +
> +errcode_t ext2fs_convert_to_cluster_bitmap(ext2_filsys fs,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ext2fs_block_bitmap bmap,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ext2fs_block_bitmap *ret)
> +{
> + ? ? ? ext2fs_block_bitmap ? ? cmap;
> + ? ? ? errcode_t ? ? ? ? ? ? ? retval;
> + ? ? ? blk64_t ? ? ? ? ? ? ? ? i, j, b_end, c_end;
> + ? ? ? int ? ? ? ? ? ? ? ? ? ? n;
> +
> + ? ? ? retval = ext2fs_allocate_cluster_bitmap(fs, "converted cluster bitmap",
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ret);
> + ? ? ? if (retval)
> + ? ? ? ? ? ? ? return retval;
> +
> + ? ? ? cmap = *ret;
> + ? ? ? i = bmap->start;
> + ? ? ? b_end = bmap->end;
> + ? ? ? bmap->end = bmap->real_end;
> + ? ? ? j = cmap->start;
> + ? ? ? c_end = cmap->end;
> + ? ? ? cmap->end = cmap->real_end;
> + ? ? ? n = 0;
> + ? ? ? while (i < bmap->real_end) {
> + ? ? ? ? ? ? ? if (ext2fs_test_block_bitmap2(bmap, i)) {
> + ? ? ? ? ? ? ? ? ? ? ? ext2fs_mark_block_bitmap2(cmap, j);
> + ? ? ? ? ? ? ? ? ? ? ? i += fs->cluster_ratio - n;
> + ? ? ? ? ? ? ? ? ? ? ? j++;
> + ? ? ? ? ? ? ? ? ? ? ? n = 0;
> + ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? i++; n++;
> + ? ? ? ? ? ? ? if (n >= fs->cluster_ratio) {
> + ? ? ? ? ? ? ? ? ? ? ? j++;
> + ? ? ? ? ? ? ? ? ? ? ? n = 0;
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> + ? ? ? bmap->end = b_end;
> + ? ? ? cmap->end = c_end;
> + ? ? ? return 0;
> +}
> diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
> index e1f229b..00a8b38 100644
> --- a/lib/ext2fs/initialize.c
> +++ b/lib/ext2fs/initialize.c
> @@ -94,6 +94,7 @@ errcode_t ext2fs_initialize(const char *name, int flags,
> ? ? ? ?blk_t ? ? ? ? ? numblocks;
> ? ? ? ?int ? ? ? ? ? ? rsv_gdt;
> ? ? ? ?int ? ? ? ? ? ? csum_flag;
> + ? ? ? int ? ? ? ? ? ? bigalloc_flag;
> ? ? ? ?int ? ? ? ? ? ? io_flags;
> ? ? ? ?char ? ? ? ? ? ?*buf = 0;
> ? ? ? ?char ? ? ? ? ? ?c;
> @@ -134,12 +135,25 @@ errcode_t ext2fs_initialize(const char *name, int flags,
>
> ?#define set_field(field, default) (super->field = param->field ? \
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? param->field : (default))
> +#define assign_field(field) ? ?(super->field = param->field)
>
> ? ? ? ?super->s_magic = EXT2_SUPER_MAGIC;
> ? ? ? ?super->s_state = EXT2_VALID_FS;
>
> - ? ? ? set_field(s_log_block_size, 0); /* default blocksize: 1024 bytes */
> - ? ? ? set_field(s_log_cluster_size, 0);
> + ? ? ? bigalloc_flag = EXT2_HAS_RO_COMPAT_FEATURE(param,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_FEATURE_RO_COMPAT_BIGALLOC);
> +
> + ? ? ? assign_field(s_log_block_size);
> +
> + ? ? ? if (bigalloc_flag) {
> + ? ? ? ? ? ? ? set_field(s_log_cluster_size, super->s_log_block_size+4);
> + ? ? ? ? ? ? ? if (super->s_log_block_size > super->s_log_cluster_size) {
> + ? ? ? ? ? ? ? ? ? ? ? retval = EXT2_ET_INVALID_ARGUMENT;
> + ? ? ? ? ? ? ? ? ? ? ? goto cleanup;
> + ? ? ? ? ? ? ? }
> + ? ? ? } else
> + ? ? ? ? ? ? ? super->s_log_cluster_size = super->s_log_block_size;
> +
> ? ? ? ?set_field(s_first_data_block, super->s_log_block_size ? 0 : 1);
> ? ? ? ?set_field(s_max_mnt_count, 0);
> ? ? ? ?set_field(s_errors, EXT2_ERRORS_DEFAULT);
> @@ -183,14 +197,36 @@ errcode_t ext2fs_initialize(const char *name, int flags,
>
> ? ? ? ?fs->blocksize = EXT2_BLOCK_SIZE(super);
> ? ? ? ?fs->clustersize = EXT2_CLUSTER_SIZE(super);
> + ? ? ? fs->cluster_ratio = fs->clustersize / fs->blocksize;
> +
> + ? ? ? if (bigalloc_flag) {
> + ? ? ? ? ? ? ? if (param->s_blocks_per_group &&
> + ? ? ? ? ? ? ? ? ? param->s_clusters_per_group &&
> + ? ? ? ? ? ? ? ? ? ((param->s_clusters_per_group * fs->cluster_ratio) !=
> + ? ? ? ? ? ? ? ? ? ?param->s_blocks_per_group)) {
> + ? ? ? ? ? ? ? ? ? ? ? retval = EXT2_ET_INVALID_ARGUMENT;
> + ? ? ? ? ? ? ? ? ? ? ? goto cleanup;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? if (param->s_clusters_per_group)
> + ? ? ? ? ? ? ? ? ? ? ? assign_field(s_clusters_per_group);
> + ? ? ? ? ? ? ? else if (param->s_blocks_per_group)
> + ? ? ? ? ? ? ? ? ? ? ? super->s_clusters_per_group =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? param->s_blocks_per_group / fs->cluster_ratio;
> + ? ? ? ? ? ? ? else
> + ? ? ? ? ? ? ? ? ? ? ? super->s_clusters_per_group = fs->blocksize * 8;
> + ? ? ? ? ? ? ? if (super->s_clusters_per_group > EXT2_MAX_CLUSTERS_PER_GROUP(super))
> + ? ? ? ? ? ? ? ? ? ? ? super->s_blocks_per_group = EXT2_MAX_CLUSTERS_PER_GROUP(super);
> + ? ? ? ? ? ? ? super->s_blocks_per_group = super->s_clusters_per_group;
> + ? ? ? ? ? ? ? super->s_blocks_per_group *= fs->cluster_ratio;
> + ? ? ? } else {
> + ? ? ? ? ? ? ? set_field(s_blocks_per_group, fs->blocksize * 8);
> + ? ? ? ? ? ? ? if (super->s_blocks_per_group > EXT2_MAX_BLOCKS_PER_GROUP(super))
> + ? ? ? ? ? ? ? ? ? ? ? super->s_blocks_per_group = EXT2_MAX_BLOCKS_PER_GROUP(super);
> + ? ? ? ? ? ? ? super->s_clusters_per_group = super->s_blocks_per_group;
> + ? ? ? }
>
> - ? ? ? /* default: (fs->blocksize*8) blocks/group, up to 2^16 (GDT limit) */
> - ? ? ? set_field(s_blocks_per_group, fs->blocksize * 8);
> - ? ? ? if (super->s_blocks_per_group > EXT2_MAX_BLOCKS_PER_GROUP(super))
> - ? ? ? ? ? ? ? super->s_blocks_per_group = EXT2_MAX_BLOCKS_PER_GROUP(super);
> - ? ? ? super->s_clusters_per_group = super->s_blocks_per_group;
> -
> - ? ? ? ext2fs_blocks_count_set(super, ext2fs_blocks_count(param));
> + ? ? ? ext2fs_blocks_count_set(super, ext2fs_blocks_count(param) &
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ~((blk64_t) fs->cluster_ratio - 1));
> ? ? ? ?ext2fs_r_blocks_count_set(super, ext2fs_r_blocks_count(param));
> ? ? ? ?if (ext2fs_r_blocks_count(super) >= ext2fs_blocks_count(param)) {
> ? ? ? ? ? ? ? ?retval = EXT2_ET_INVALID_ARGUMENT;
> @@ -246,7 +282,7 @@ retry:
> ? ? ? ? */
> ? ? ? ?ipg = ext2fs_div_ceil(super->s_inodes_count, fs->group_desc_count);
> ? ? ? ?if (ipg > fs->blocksize * 8) {
> - ? ? ? ? ? ? ? if (super->s_blocks_per_group >= 256) {
> + ? ? ? ? ? ? ? if (!bigalloc_flag && super->s_blocks_per_group >= 256) {
> ? ? ? ? ? ? ? ? ? ? ? ?/* Try again with slightly different parameters */
> ? ? ? ? ? ? ? ? ? ? ? ?super->s_blocks_per_group -= 8;
> ? ? ? ? ? ? ? ? ? ? ? ?ext2fs_blocks_count_set(super,
> diff --git a/lib/ext2fs/openfs.c b/lib/ext2fs/openfs.c
> index 90abed1..8b37852 100644
> --- a/lib/ext2fs/openfs.c
> +++ b/lib/ext2fs/openfs.c
> @@ -251,6 +251,7 @@ errcode_t ext2fs_open2(const char *name, const char *io_options,
> ? ? ? ? ? ? ? ?goto cleanup;
> ? ? ? ?}
> ? ? ? ?fs->clustersize = EXT2_CLUSTER_SIZE(fs->super);
> + ? ? ? fs->cluster_ratio = fs->clustersize / fs->blocksize;
> ? ? ? ?fs->inode_blocks_per_group = ((EXT2_INODES_PER_GROUP(fs->super) *
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT2_INODE_SIZE(fs->super) +
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT2_BLOCK_SIZE(fs->super) - 1) /
> diff --git a/lib/ext2fs/rw_bitmaps.c b/lib/ext2fs/rw_bitmaps.c
> index 3031b7d..aeea997 100644
> --- a/lib/ext2fs/rw_bitmaps.c
> +++ b/lib/ext2fs/rw_bitmaps.c
> @@ -51,7 +51,7 @@ static errcode_t write_bitmaps(ext2_filsys fs, int do_inode, int do_block)
>
> ? ? ? ?inode_nbytes = block_nbytes = 0;
> ? ? ? ?if (do_block) {
> - ? ? ? ? ? ? ? block_nbytes = EXT2_BLOCKS_PER_GROUP(fs->super) / 8;
> + ? ? ? ? ? ? ? block_nbytes = EXT2_CLUSTERS_PER_GROUP(fs->super) / 8;
> ? ? ? ? ? ? ? ?retval = ext2fs_get_memalign(fs->blocksize, fs->blocksize,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &block_buf);
> ? ? ? ? ? ? ? ?if (retval)
> @@ -85,7 +85,7 @@ static errcode_t write_bitmaps(ext2_filsys fs, int do_inode, int do_block)
> ? ? ? ? ? ? ? ? ? ? ? ?/* Force bitmap padding for the last group */
> ? ? ? ? ? ? ? ? ? ? ? ?nbits = ((ext2fs_blocks_count(fs->super)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- (__u64) fs->super->s_first_data_block)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?% (__u64) EXT2_BLOCKS_PER_GROUP(fs->super));
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?% (__u64) EXT2_CLUSTERS_PER_GROUP(fs->super));
> ? ? ? ? ? ? ? ? ? ? ? ?if (nbits)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?for (j = nbits; j < fs->blocksize * 8; j++)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext2fs_set_bit(j, block_buf);
> @@ -141,7 +141,7 @@ static errcode_t read_bitmaps(ext2_filsys fs, int do_inode, int do_block)
> ? ? ? ?char *block_bitmap = 0, *inode_bitmap = 0;
> ? ? ? ?char *buf;
> ? ? ? ?errcode_t retval;
> - ? ? ? int block_nbytes = EXT2_BLOCKS_PER_GROUP(fs->super) / 8;
> + ? ? ? int block_nbytes = EXT2_CLUSTERS_PER_GROUP(fs->super) / 8;
> ? ? ? ?int inode_nbytes = EXT2_INODES_PER_GROUP(fs->super) / 8;
> ? ? ? ?int csum_flag = 0;
> ? ? ? ?int do_image = fs->flags & EXT2_FLAG_IMAGE_FILE;
> @@ -219,7 +219,7 @@ static errcode_t read_bitmaps(ext2_filsys fs, int do_inode, int do_block)
> ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?blk = (fs->image_header->offset_blockmap /
> ? ? ? ? ? ? ? ? ? ? ? fs->blocksize);
> - ? ? ? ? ? ? ? blk_cnt = (blk64_t)EXT2_BLOCKS_PER_GROUP(fs->super) *
> + ? ? ? ? ? ? ? blk_cnt = (blk64_t)EXT2_CLUSTERS_PER_GROUP(fs->super) *
> ? ? ? ? ? ? ? ? ? ? ? ?fs->group_desc_count;
> ? ? ? ? ? ? ? ?while (block_nbytes > 0) {
> ? ? ? ? ? ? ? ? ? ? ? ?retval = io_channel_read_blk64(fs->image_io, blk++,
> diff --git a/misc/dumpe2fs.c b/misc/dumpe2fs.c
> index c01ffe5..d3f617a 100644
> --- a/misc/dumpe2fs.c
> +++ b/misc/dumpe2fs.c
> @@ -71,25 +71,26 @@ static void print_range(unsigned long long a, unsigned long long b)
> ? ? ? ? ? ? ? ?printf("%llu-%llu", a, b);
> ?}
>
> -static void print_free (unsigned long group, char * bitmap,
> - ? ? ? ? ? ? ? ? ? ? ? unsigned long nbytes, unsigned long offset)
> +static void print_free(unsigned long group, char * bitmap,
> + ? ? ? ? ? ? ? ? ? ? ?unsigned long nbytes, unsigned long offset, int ratio)
> ?{
> ? ? ? ?int p = 0;
> ? ? ? ?unsigned long i;
> ? ? ? ?unsigned long j;
>
> + ? ? ? offset /= ratio;
> ? ? ? ?offset += group * nbytes;
> ? ? ? ?for (i = 0; i < nbytes; i++)
> ? ? ? ? ? ? ? ?if (!in_use (bitmap, i))
> ? ? ? ? ? ? ? ?{
> ? ? ? ? ? ? ? ? ? ? ? ?if (p)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?printf (", ");
> - ? ? ? ? ? ? ? ? ? ? ? print_number(i + offset);
> + ? ? ? ? ? ? ? ? ? ? ? print_number((i + offset) * ratio);
> ? ? ? ? ? ? ? ? ? ? ? ?for (j = i; j < nbytes && !in_use (bitmap, j); j++)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?;
> ? ? ? ? ? ? ? ? ? ? ? ?if (--j != i) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fputc('-', stdout);
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? print_number(j + offset);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? print_number((j + offset) * ratio);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?i = j;
> ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ? ? ? ? ?p = 1;
> @@ -153,7 +154,7 @@ static void list_desc (ext2_filsys fs)
> ? ? ? ?blk64_t ? ? ? ? blk_itr = fs->super->s_first_data_block;
> ? ? ? ?ext2_ino_t ? ? ?ino_itr = 1;
>
> - ? ? ? block_nbytes = EXT2_BLOCKS_PER_GROUP(fs->super) / 8;
> + ? ? ? block_nbytes = EXT2_CLUSTERS_PER_GROUP(fs->super) / 8;
> ? ? ? ?inode_nbytes = EXT2_INODES_PER_GROUP(fs->super) / 8;
>
> ? ? ? ?if (fs->block_map)
> @@ -238,18 +239,19 @@ static void list_desc (ext2_filsys fs)
> ? ? ? ? ? ? ? ? ? ? ? ?fputs(_(" ?Free blocks: "), stdout);
> ? ? ? ? ? ? ? ? ? ? ? ?ext2fs_get_block_bitmap_range2(fs->block_map,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? blk_itr, block_nbytes << 3, block_bitmap);
> - ? ? ? ? ? ? ? ? ? ? ? print_free (i, block_bitmap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? fs->super->s_blocks_per_group,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? fs->super->s_first_data_block);
> + ? ? ? ? ? ? ? ? ? ? ? print_free(i, block_bitmap,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fs->super->s_clusters_per_group,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fs->super->s_first_data_block,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fs->cluster_ratio);
> ? ? ? ? ? ? ? ? ? ? ? ?fputc('\n', stdout);
> - ? ? ? ? ? ? ? ? ? ? ? blk_itr += fs->super->s_blocks_per_group;
> + ? ? ? ? ? ? ? ? ? ? ? blk_itr += fs->super->s_clusters_per_group;
> ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?if (inode_bitmap) {
> ? ? ? ? ? ? ? ? ? ? ? ?fputs(_(" ?Free inodes: "), stdout);
> ? ? ? ? ? ? ? ? ? ? ? ?ext2fs_get_inode_bitmap_range2(fs->inode_map,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ino_itr, inode_nbytes << 3, inode_bitmap);
> - ? ? ? ? ? ? ? ? ? ? ? print_free (i, inode_bitmap,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? fs->super->s_inodes_per_group, 1);
> + ? ? ? ? ? ? ? ? ? ? ? print_free(i, inode_bitmap,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fs->super->s_inodes_per_group, 1, 1);
> ? ? ? ? ? ? ? ? ? ? ? ?fputc('\n', stdout);
> ? ? ? ? ? ? ? ? ? ? ? ?ino_itr += fs->super->s_inodes_per_group;
> ? ? ? ? ? ? ? ?}
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index 9798b88..079638f 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -815,7 +815,8 @@ static __u32 ok_features[3] = {
> ? ? ? ? ? ? ? ?EXT4_FEATURE_RO_COMPAT_DIR_NLINK|
> ? ? ? ? ? ? ? ?EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE|
> ? ? ? ? ? ? ? ?EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER|
> - ? ? ? ? ? ? ? EXT4_FEATURE_RO_COMPAT_GDT_CSUM
> + ? ? ? ? ? ? ? EXT4_FEATURE_RO_COMPAT_GDT_CSUM|
> + ? ? ? ? ? ? ? EXT4_FEATURE_RO_COMPAT_BIGALLOC
> ?};
>
>
> @@ -1252,7 +1253,7 @@ profile_error:
> ? ? ? ?}
>
> ? ? ? ?while ((c = getopt (argc, argv,
> - ? ? ? ? ? ? ? ? ? "b:cf:g:G:i:jl:m:no:qr:s:t:vE:FI:J:KL:M:N:O:R:ST:U:V")) != EOF) {
> + ? ? ? ? ? ? ? ? ? "b:cg:i:jl:m:no:qr:s:t:vC:E:FG:I:J:KL:M:N:O:R:ST:U:V")) != EOF) {
> ? ? ? ? ? ? ? ?switch (c) {
> ? ? ? ? ? ? ? ?case 'b':
> ? ? ? ? ? ? ? ? ? ? ? ?blocksize = strtol(optarg, &tmp, 0);
> @@ -1275,17 +1276,17 @@ profile_error:
> ? ? ? ? ? ? ? ?case 'c': ? ? ? /* Check for bad blocks */
> ? ? ? ? ? ? ? ? ? ? ? ?cflag++;
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> - ? ? ? ? ? ? ? case 'f':
> + ? ? ? ? ? ? ? case 'C':
> ? ? ? ? ? ? ? ? ? ? ? ?size = strtoul(optarg, &tmp, 0);
> - ? ? ? ? ? ? ? ? ? ? ? if (size < EXT2_MIN_BLOCK_SIZE ||
> - ? ? ? ? ? ? ? ? ? ? ? ? ? size > EXT2_MAX_BLOCK_SIZE || *tmp) {
> + ? ? ? ? ? ? ? ? ? ? ? if (size < EXT2_MIN_CLUSTER_SIZE ||
> + ? ? ? ? ? ? ? ? ? ? ? ? ? size > EXT2_MAX_CLUSTER_SIZE || *tmp) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?com_err(program_name, 0,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?_("invalid fragment size - %s"),
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?optarg);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?exit(1);
> ? ? ? ? ? ? ? ? ? ? ? ?}
> - ? ? ? ? ? ? ? ? ? ? ? fprintf(stderr, _("Warning: fragments not supported. ?"
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?"Ignoring -f option\n"));
> + ? ? ? ? ? ? ? ? ? ? ? fs_param.s_log_cluster_size =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int_log2(size >> EXT2_MIN_CLUSTER_LOG_SIZE);
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> ? ? ? ? ? ? ? ?case 'g':
> ? ? ? ? ? ? ? ? ? ? ? ?fs_param.s_blocks_per_group = strtoul(optarg, &tmp, 0);
> @@ -1515,8 +1516,6 @@ profile_error:
> ? ? ? ? ? ? ? ?check_plausibility(device_name);
> ? ? ? ?check_mount(device_name, force, _("filesystem"));
>
> - ? ? ? fs_param.s_log_cluster_size = fs_param.s_log_block_size;
> -
> ? ? ? ?/* Determine the size of the device (if possible) */
> ? ? ? ?if (noaction && fs_blocks_count) {
> ? ? ? ? ? ? ? ?dev_size = fs_blocks_count;
> @@ -1752,16 +1751,24 @@ profile_error:
> ? ? ? ? ? ? ? ?}
> ? ? ? ?}
>
> + ? ? ? fs_param.s_log_block_size =
> + ? ? ? ? ? ? ? int_log2(blocksize >> EXT2_MIN_BLOCK_LOG_SIZE);
> + ? ? ? if (fs_param.s_feature_ro_compat & EXT4_FEATURE_RO_COMPAT_BIGALLOC) {
> + ? ? ? ? ? ? ? if (fs_param.s_log_cluster_size == 0)
> + ? ? ? ? ? ? ? ? ? ? ? fs_param.s_log_cluster_size =
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? fs_param.s_log_block_size + 4;
> + ? ? ? } else
> + ? ? ? ? ? ? ? fs_param.s_log_cluster_size = fs_param.s_log_block_size;
> +
> ? ? ? ?if (inode_ratio == 0) {
> ? ? ? ? ? ? ? ?inode_ratio = get_int_from_profile(fs_types, "inode_ratio",
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 8192);
> ? ? ? ? ? ? ? ?if (inode_ratio < blocksize)
> ? ? ? ? ? ? ? ? ? ? ? ?inode_ratio = blocksize;
> + ? ? ? ? ? ? ? if (inode_ratio < EXT2_CLUSTER_SIZE(&fs_param))
> + ? ? ? ? ? ? ? ? ? ? ? inode_ratio = EXT2_CLUSTER_SIZE(&fs_param);
> ? ? ? ?}
>
> - ? ? ? fs_param.s_log_cluster_size = fs_param.s_log_block_size =
> - ? ? ? ? ? ? ? int_log2(blocksize >> EXT2_MIN_BLOCK_LOG_SIZE);
> -
> ?#ifdef HAVE_BLKID_PROBE_GET_TOPOLOGY
> ? ? ? ?retval = get_device_geometry(device_name, &fs_param, psector_size);
> ? ? ? ?if (retval < 0) {
> @@ -2049,6 +2056,33 @@ static int mke2fs_discard_device(ext2_filsys fs)
> ? ? ? ?return retval;
> ?}
>
> +static fix_cluster_bg_counts(ext2_filsys fs)
> +{
> + ? ? ? blk64_t cluster, num_clusters, tot_free;
> + ? ? ? int ? ? grp_free, num_free, group, num;
> +
> + ? ? ? num_clusters = ext2fs_blocks_count(fs->super) / fs->cluster_ratio;
> + ? ? ? tot_free = num_free = num = group = grp_free = 0;
> + ? ? ? for (cluster = fs->super->s_first_data_block / fs->cluster_ratio;
> + ? ? ? ? ? ?cluster < num_clusters; cluster++) {
> + ? ? ? ? ? ? ? if (!ext2fs_test_block_bitmap2(fs->block_map, cluster)) {
> + ? ? ? ? ? ? ? ? ? ? ? grp_free++;
> + ? ? ? ? ? ? ? ? ? ? ? tot_free++;
> + ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? num++;
> + ? ? ? ? ? ? ? if ((num == fs->super->s_clusters_per_group) ||
> + ? ? ? ? ? ? ? ? ? (cluster == num_clusters-1)) {
> + ? ? ? ? ? ? ? ? ? ? ? printf("Group %d has free #: %d\n", group, grp_free);
> + ? ? ? ? ? ? ? ? ? ? ? ext2fs_bg_free_blocks_count_set(fs, group, grp_free);
> + ? ? ? ? ? ? ? ? ? ? ? ext2fs_group_desc_csum_set(fs, group);
> + ? ? ? ? ? ? ? ? ? ? ? num = 0;
> + ? ? ? ? ? ? ? ? ? ? ? grp_free = 0;
> + ? ? ? ? ? ? ? ? ? ? ? group++;
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> + ? ? ? ext2fs_free_blocks_count_set(fs->super, tot_free);
> +}
> +
> ?int main (int argc, char *argv[])
> ?{
> ? ? ? ?errcode_t ? ? ? retval = 0;
> @@ -2367,6 +2401,17 @@ int main (int argc, char *argv[])
> ? ? ? ?}
> ?no_journal:
>
> + ? ? ? if (EXT2_HAS_RO_COMPAT_FEATURE(&fs_param,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
> + ? ? ? ? ? ? ? ext2fs_block_bitmap cluster_map;
> +
> + ? ? ? ? ? ? ? retval = ext2fs_convert_to_cluster_bitmap(fs, fs->block_map,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?&cluster_map);
> + ? ? ? ? ? ? ? ext2fs_free_block_bitmap(fs->block_map);
> + ? ? ? ? ? ? ? fs->block_map = cluster_map;
> + ? ? ? ? ? ? ? fix_cluster_bg_counts(fs);
> + ? ? ? }
> +
> ? ? ? ?if (!quiet)
> ? ? ? ? ? ? ? ?printf(_("Writing superblocks and "
> ? ? ? ? ? ? ? ? ? ? ? "filesystem accounting information: "));
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-03-21 08:55:15

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH, RFC 00/12] bigalloc patchset

On 2011-03-19, at 10:28 PM, Theodore Ts'o wrote:
> This is an initial patchset of the bigalloc patches to ext4. This patch
> adds support for clustered allocation, so that each bit in the ext4
> block allocation bitmap addresses a power of two number of blocks. For
> example, if the file system is mainly going to be storing large files in
> the 4-32 megabyte range, it might make sense to set a cluster size of 1
> megabyte. This means that each bit in the block allocaiton bitmap would
> now address 256 4k blocks, and it means that the size of the block
> bitmaps for a 2T file system shrinks from 64 megabytes to 256k. It also
> means that a block group addresses 32 gigabytes instead of 128
> megabytes, also shrinking the amount of file system overhead for
> metadata.
>
> The cost is increased disk space efficiency. Directories will consume
> 1T, as will extent tree blocks.

Presumably you mean "1M" here and not "1T"?

> (I am on the fence as to whether I
> should add complexity so that in the rare case that an inode needs more
> than 344 extents --- a highly fragmented file indeed --- and need a
> second extent tree block, we can avoid allocating any cluster and
> instead use another block from the cluster used by the inode. The
> concern is the amount of complexity this adds to the e2fsck, not just to
> the kernel.)

It would be a shame to waste another MB of space just to allocate 4kB for the next indirect block... I guess it isn't clear to me why the index blocks need to be treated differently from file data blocks or directory blocks in this regard, since they both can use multiple blocks from the same cluster. Being able to use the full cluster would allow 256 * 344 = 88064 extents, or 11TB to be addressed by the cluster of index blocks, which should be plenty.

Unfortunately, the overhead of allocating a whole cluster for every index block and every directory is fairly high. For Lustre it matters very little, since there are only a handful of directories (under 40) on the data filesystems where this would be used and the real directory tree is located on a different metadata filesystem which probably wouldn't use this feature, but for most "normal" users this overhead may become prohibitive. That is why I've been trying to think of a way to allow sub-cluster allocations for these uses.

> To test these patches, I have used an *extremely* kludgy set of patches
> to e2fsprogs, which are attached below. These patches need *extensive*
> revision before I would consider them clean enough suitable for
> committing into e2fsprogs, but they were sufficient for me to do the
> kernel-side changes --- mke2fs, dumpe2fs, and debugfs work. E2fsck most
> definitely does _not_ work at this stage.
>
> Please comment! I do not intend for these patches to be merged during
> the 2.6.39 merge window. I am targetting 2.6.40, 3 months from now,
> since these patches are quite extensive.

Is that before or after e2fsck support for this will be done? I'm rather reluctant to commit anything to the kernel that doesn't have e2fsck support in a released e2fsprogs.

> - Ted
>
> Theodore Ts'o (12):
> ext4: read-only support for bigalloc file systems
> ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)
> ext4: Convert instances of EXT4_BLOCKS_PER_GROUP to
> EXT4_CLUSTERS_PER_GROUP
> ext4: Remove block bitmap initialization in ext4_new_inode()
> ext4: factor out block group accounting into functions
> ext4: split out ext4_free_blocks_after_init()
> ext4: bigalloc changes to block bitmap initialization functions
> ext4: Convert block group-relative offsets to use clusters
> ext4: teach ext4_ext_map_blocks() about the bigalloc feature
> ext4: teach ext4_statfs() to deal with clusters if bigalloc is
> enabled
> ext4: tune mballoc's default group prealloc size for bigalloc file
> systems
> ext4: enable mounting bigalloc as read/write
>
> fs/ext4/balloc.c | 268 +++++++++++++++++++++++++++++++++--------------------
> fs/ext4/ext4.h | 47 ++++++++--
> fs/ext4/extents.c | 132 +++++++++++++++++++++++---
> fs/ext4/ialloc.c | 37 --------
> fs/ext4/inode.c | 7 ++
> fs/ext4/ioctl.c | 33 ++++++-
> fs/ext4/mballoc.c | 49 ++++++----
> fs/ext4/mballoc.h | 3 +-
> fs/ext4/super.c | 100 ++++++++++++++++----
> 9 files changed, 472 insertions(+), 204 deletions(-)
>
> --
> 1.7.3.1
>
> ------------------- e2fsprogs patches follow below
>
> diff --git a/lib/ext2fs/bmap64.h b/lib/ext2fs/bmap64.h
> index b0aa84c..cfbdfd6 100644
> --- a/lib/ext2fs/bmap64.h
> +++ b/lib/ext2fs/bmap64.h
> @@ -31,6 +31,10 @@ struct ext2fs_struct_generic_bitmap {
> ((bmap)->magic == EXT2_ET_MAGIC_BLOCK_BITMAP64) || \
> ((bmap)->magic == EXT2_ET_MAGIC_INODE_BITMAP64))
>
> +/* Bitmap flags */
> +
> +#define EXT2_BMFLAG_CLUSTER 0x0001
> +
> struct ext2_bitmap_ops {
> int type;
> /* Generic bmap operators */
> diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
> index a89e33b..0970506 100644
> --- a/lib/ext2fs/ext2_fs.h
> +++ b/lib/ext2fs/ext2_fs.h
> @@ -228,9 +228,13 @@ struct ext2_dx_countlimit {
>
> #define EXT2_BLOCKS_PER_GROUP(s) (EXT2_SB(s)->s_blocks_per_group)
> #define EXT2_INODES_PER_GROUP(s) (EXT2_SB(s)->s_inodes_per_group)
> +#define EXT2_CLUSTERS_PER_GROUP(s) (EXT2_SB(s)->s_clusters_per_group)
> #define EXT2_INODES_PER_BLOCK(s) (EXT2_BLOCK_SIZE(s)/EXT2_INODE_SIZE(s))
> /* limits imposed by 16-bit value gd_free_{blocks,inode}_count */
> -#define EXT2_MAX_BLOCKS_PER_GROUP(s) ((1 << 16) - 8)
> +#define EXT2_MAX_BLOCKS_PER_GROUP(s) (((1 << 16) - 8) * \
> + (EXT2_CLUSTER_SIZE(s) / \
> + EXT2_BLOCK_SIZE(s)))
> +#define EXT2_MAX_CLUSTERS_PER_GROUP(s) ((1 << 16) - 8)
> #define EXT2_MAX_INODES_PER_GROUP(s) ((1 << 16) - EXT2_INODES_PER_BLOCK(s))
> #ifdef __KERNEL__
> #define EXT2_DESC_PER_BLOCK(s) (EXT2_SB(s)->s_desc_per_block)
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index d3eb31d..a065e87 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -207,7 +207,7 @@ struct struct_ext2_filsys {
> char * device_name;
> struct ext2_super_block * super;
> unsigned int blocksize;
> - int clustersize;
> + int cluster_ratio;
> dgrp_t group_desc_count;
> unsigned long desc_blocks;
> struct opaque_ext2_group_desc * group_desc;
> @@ -232,7 +232,8 @@ struct struct_ext2_filsys {
> /*
> * Reserved for future expansion
> */
> - __u32 reserved[7];
> + __u32 clustersize;
> + __u32 reserved[6];
>
> /*
> * Reserved for the use of the calling application.
> @@ -553,7 +554,8 @@ typedef struct ext2_icount *ext2_icount_t;
> EXT2_FEATURE_RO_COMPAT_LARGE_FILE|\
> EXT4_FEATURE_RO_COMPAT_DIR_NLINK|\
> EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE|\
> - EXT4_FEATURE_RO_COMPAT_GDT_CSUM)
> + EXT4_FEATURE_RO_COMPAT_GDT_CSUM|\
> + EXT4_FEATURE_RO_COMPAT_BIGALLOC)
>
> /*
> * These features are only allowed if EXT2_FLAG_SOFTSUPP_FEATURES is passed
> diff --git a/lib/ext2fs/gen_bitmap64.c b/lib/ext2fs/gen_bitmap64.c
> index df095ac..60321df 100644
> --- a/lib/ext2fs/gen_bitmap64.c
> +++ b/lib/ext2fs/gen_bitmap64.c
> @@ -559,3 +559,85 @@ int ext2fs_warn_bitmap32(ext2fs_generic_bitmap bitmap, const char *func)
> "called %s with 64-bit bitmap", func);
> #endif
> }
> +
> +errcode_t ext2fs_allocate_cluster_bitmap(ext2_filsys fs,
> + const char *descr,
> + ext2fs_block_bitmap *ret)
> +{
> + __u64 start, end, real_end;
> + errcode_t retval;
> +
> + EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
> +
> + if (!(fs->flags & EXT2_FLAG_64BITS))
> + return EXT2_ET_CANT_USE_LEGACY_BITMAPS;
> +
> + fs->write_bitmaps = ext2fs_write_bitmaps;
> +
> + start = (fs->super->s_first_data_block >>
> + EXT2_CLUSTER_SIZE_BITS(fs->super));
> + end = (ext2fs_blocks_count(fs->super) - 1) / fs->cluster_ratio;
> + real_end = ((__u64) EXT2_CLUSTERS_PER_GROUP(fs->super)
> + * (__u64) fs->group_desc_count)-1 + start;
> +
> + retval = ext2fs_alloc_generic_bmap(fs,
> + EXT2_ET_MAGIC_BLOCK_BITMAP64,
> + EXT2FS_BMAP64_BITARRAY,
> + start, end, real_end, descr, ret);
> + if (retval)
> + return retval;
> +
> + (*ret)->flags = EXT2_BMFLAG_CLUSTER;
> +
> + printf("Returning 0...\n");
> + return 0;
> +}
> +
> +int ext2fs_is_cluster_bitmap(ext2fs_block_bitmap bm)
> +{
> + if (EXT2FS_IS_32_BITMAP(bm))
> + return 0;
> +
> + return (bm->flags & EXT2_BMFLAG_CLUSTER);
> +}
> +
> +errcode_t ext2fs_convert_to_cluster_bitmap(ext2_filsys fs,
> + ext2fs_block_bitmap bmap,
> + ext2fs_block_bitmap *ret)
> +{
> + ext2fs_block_bitmap cmap;
> + errcode_t retval;
> + blk64_t i, j, b_end, c_end;
> + int n;
> +
> + retval = ext2fs_allocate_cluster_bitmap(fs, "converted cluster bitmap",
> + ret);
> + if (retval)
> + return retval;
> +
> + cmap = *ret;
> + i = bmap->start;
> + b_end = bmap->end;
> + bmap->end = bmap->real_end;
> + j = cmap->start;
> + c_end = cmap->end;
> + cmap->end = cmap->real_end;
> + n = 0;
> + while (i < bmap->real_end) {
> + if (ext2fs_test_block_bitmap2(bmap, i)) {
> + ext2fs_mark_block_bitmap2(cmap, j);
> + i += fs->cluster_ratio - n;
> + j++;
> + n = 0;
> + continue;
> + }
> + i++; n++;
> + if (n >= fs->cluster_ratio) {
> + j++;
> + n = 0;
> + }
> + }
> + bmap->end = b_end;
> + cmap->end = c_end;
> + return 0;
> +}
> diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
> index e1f229b..00a8b38 100644
> --- a/lib/ext2fs/initialize.c
> +++ b/lib/ext2fs/initialize.c
> @@ -94,6 +94,7 @@ errcode_t ext2fs_initialize(const char *name, int flags,
> blk_t numblocks;
> int rsv_gdt;
> int csum_flag;
> + int bigalloc_flag;
> int io_flags;
> char *buf = 0;
> char c;
> @@ -134,12 +135,25 @@ errcode_t ext2fs_initialize(const char *name, int flags,
>
> #define set_field(field, default) (super->field = param->field ? \
> param->field : (default))
> +#define assign_field(field) (super->field = param->field)
>
> super->s_magic = EXT2_SUPER_MAGIC;
> super->s_state = EXT2_VALID_FS;
>
> - set_field(s_log_block_size, 0); /* default blocksize: 1024 bytes */
> - set_field(s_log_cluster_size, 0);
> + bigalloc_flag = EXT2_HAS_RO_COMPAT_FEATURE(param,
> + EXT4_FEATURE_RO_COMPAT_BIGALLOC);
> +
> + assign_field(s_log_block_size);
> +
> + if (bigalloc_flag) {
> + set_field(s_log_cluster_size, super->s_log_block_size+4);
> + if (super->s_log_block_size > super->s_log_cluster_size) {
> + retval = EXT2_ET_INVALID_ARGUMENT;
> + goto cleanup;
> + }
> + } else
> + super->s_log_cluster_size = super->s_log_block_size;
> +
> set_field(s_first_data_block, super->s_log_block_size ? 0 : 1);
> set_field(s_max_mnt_count, 0);
> set_field(s_errors, EXT2_ERRORS_DEFAULT);
> @@ -183,14 +197,36 @@ errcode_t ext2fs_initialize(const char *name, int flags,
>
> fs->blocksize = EXT2_BLOCK_SIZE(super);
> fs->clustersize = EXT2_CLUSTER_SIZE(super);
> + fs->cluster_ratio = fs->clustersize / fs->blocksize;
> +
> + if (bigalloc_flag) {
> + if (param->s_blocks_per_group &&
> + param->s_clusters_per_group &&
> + ((param->s_clusters_per_group * fs->cluster_ratio) !=
> + param->s_blocks_per_group)) {
> + retval = EXT2_ET_INVALID_ARGUMENT;
> + goto cleanup;
> + }
> + if (param->s_clusters_per_group)
> + assign_field(s_clusters_per_group);
> + else if (param->s_blocks_per_group)
> + super->s_clusters_per_group =
> + param->s_blocks_per_group / fs->cluster_ratio;
> + else
> + super->s_clusters_per_group = fs->blocksize * 8;
> + if (super->s_clusters_per_group > EXT2_MAX_CLUSTERS_PER_GROUP(super))
> + super->s_blocks_per_group = EXT2_MAX_CLUSTERS_PER_GROUP(super);
> + super->s_blocks_per_group = super->s_clusters_per_group;
> + super->s_blocks_per_group *= fs->cluster_ratio;
> + } else {
> + set_field(s_blocks_per_group, fs->blocksize * 8);
> + if (super->s_blocks_per_group > EXT2_MAX_BLOCKS_PER_GROUP(super))
> + super->s_blocks_per_group = EXT2_MAX_BLOCKS_PER_GROUP(super);
> + super->s_clusters_per_group = super->s_blocks_per_group;
> + }
>
> - /* default: (fs->blocksize*8) blocks/group, up to 2^16 (GDT limit) */
> - set_field(s_blocks_per_group, fs->blocksize * 8);
> - if (super->s_blocks_per_group > EXT2_MAX_BLOCKS_PER_GROUP(super))
> - super->s_blocks_per_group = EXT2_MAX_BLOCKS_PER_GROUP(super);
> - super->s_clusters_per_group = super->s_blocks_per_group;
> -
> - ext2fs_blocks_count_set(super, ext2fs_blocks_count(param));
> + ext2fs_blocks_count_set(super, ext2fs_blocks_count(param) &
> + ~((blk64_t) fs->cluster_ratio - 1));
> ext2fs_r_blocks_count_set(super, ext2fs_r_blocks_count(param));
> if (ext2fs_r_blocks_count(super) >= ext2fs_blocks_count(param)) {
> retval = EXT2_ET_INVALID_ARGUMENT;
> @@ -246,7 +282,7 @@ retry:
> */
> ipg = ext2fs_div_ceil(super->s_inodes_count, fs->group_desc_count);
> if (ipg > fs->blocksize * 8) {
> - if (super->s_blocks_per_group >= 256) {
> + if (!bigalloc_flag && super->s_blocks_per_group >= 256) {
> /* Try again with slightly different parameters */
> super->s_blocks_per_group -= 8;
> ext2fs_blocks_count_set(super,
> diff --git a/lib/ext2fs/openfs.c b/lib/ext2fs/openfs.c
> index 90abed1..8b37852 100644
> --- a/lib/ext2fs/openfs.c
> +++ b/lib/ext2fs/openfs.c
> @@ -251,6 +251,7 @@ errcode_t ext2fs_open2(const char *name, const char *io_options,
> goto cleanup;
> }
> fs->clustersize = EXT2_CLUSTER_SIZE(fs->super);
> + fs->cluster_ratio = fs->clustersize / fs->blocksize;
> fs->inode_blocks_per_group = ((EXT2_INODES_PER_GROUP(fs->super) *
> EXT2_INODE_SIZE(fs->super) +
> EXT2_BLOCK_SIZE(fs->super) - 1) /
> diff --git a/lib/ext2fs/rw_bitmaps.c b/lib/ext2fs/rw_bitmaps.c
> index 3031b7d..aeea997 100644
> --- a/lib/ext2fs/rw_bitmaps.c
> +++ b/lib/ext2fs/rw_bitmaps.c
> @@ -51,7 +51,7 @@ static errcode_t write_bitmaps(ext2_filsys fs, int do_inode, int do_block)
>
> inode_nbytes = block_nbytes = 0;
> if (do_block) {
> - block_nbytes = EXT2_BLOCKS_PER_GROUP(fs->super) / 8;
> + block_nbytes = EXT2_CLUSTERS_PER_GROUP(fs->super) / 8;
> retval = ext2fs_get_memalign(fs->blocksize, fs->blocksize,
> &block_buf);
> if (retval)
> @@ -85,7 +85,7 @@ static errcode_t write_bitmaps(ext2_filsys fs, int do_inode, int do_block)
> /* Force bitmap padding for the last group */
> nbits = ((ext2fs_blocks_count(fs->super)
> - (__u64) fs->super->s_first_data_block)
> - % (__u64) EXT2_BLOCKS_PER_GROUP(fs->super));
> + % (__u64) EXT2_CLUSTERS_PER_GROUP(fs->super));
> if (nbits)
> for (j = nbits; j < fs->blocksize * 8; j++)
> ext2fs_set_bit(j, block_buf);
> @@ -141,7 +141,7 @@ static errcode_t read_bitmaps(ext2_filsys fs, int do_inode, int do_block)
> char *block_bitmap = 0, *inode_bitmap = 0;
> char *buf;
> errcode_t retval;
> - int block_nbytes = EXT2_BLOCKS_PER_GROUP(fs->super) / 8;
> + int block_nbytes = EXT2_CLUSTERS_PER_GROUP(fs->super) / 8;
> int inode_nbytes = EXT2_INODES_PER_GROUP(fs->super) / 8;
> int csum_flag = 0;
> int do_image = fs->flags & EXT2_FLAG_IMAGE_FILE;
> @@ -219,7 +219,7 @@ static errcode_t read_bitmaps(ext2_filsys fs, int do_inode, int do_block)
> }
> blk = (fs->image_header->offset_blockmap /
> fs->blocksize);
> - blk_cnt = (blk64_t)EXT2_BLOCKS_PER_GROUP(fs->super) *
> + blk_cnt = (blk64_t)EXT2_CLUSTERS_PER_GROUP(fs->super) *
> fs->group_desc_count;
> while (block_nbytes > 0) {
> retval = io_channel_read_blk64(fs->image_io, blk++,
> diff --git a/misc/dumpe2fs.c b/misc/dumpe2fs.c
> index c01ffe5..d3f617a 100644
> --- a/misc/dumpe2fs.c
> +++ b/misc/dumpe2fs.c
> @@ -71,25 +71,26 @@ static void print_range(unsigned long long a, unsigned long long b)
> printf("%llu-%llu", a, b);
> }
>
> -static void print_free (unsigned long group, char * bitmap,
> - unsigned long nbytes, unsigned long offset)
> +static void print_free(unsigned long group, char * bitmap,
> + unsigned long nbytes, unsigned long offset, int ratio)
> {
> int p = 0;
> unsigned long i;
> unsigned long j;
>
> + offset /= ratio;
> offset += group * nbytes;
> for (i = 0; i < nbytes; i++)
> if (!in_use (bitmap, i))
> {
> if (p)
> printf (", ");
> - print_number(i + offset);
> + print_number((i + offset) * ratio);
> for (j = i; j < nbytes && !in_use (bitmap, j); j++)
> ;
> if (--j != i) {
> fputc('-', stdout);
> - print_number(j + offset);
> + print_number((j + offset) * ratio);
> i = j;
> }
> p = 1;
> @@ -153,7 +154,7 @@ static void list_desc (ext2_filsys fs)
> blk64_t blk_itr = fs->super->s_first_data_block;
> ext2_ino_t ino_itr = 1;
>
> - block_nbytes = EXT2_BLOCKS_PER_GROUP(fs->super) / 8;
> + block_nbytes = EXT2_CLUSTERS_PER_GROUP(fs->super) / 8;
> inode_nbytes = EXT2_INODES_PER_GROUP(fs->super) / 8;
>
> if (fs->block_map)
> @@ -238,18 +239,19 @@ static void list_desc (ext2_filsys fs)
> fputs(_(" Free blocks: "), stdout);
> ext2fs_get_block_bitmap_range2(fs->block_map,
> blk_itr, block_nbytes << 3, block_bitmap);
> - print_free (i, block_bitmap,
> - fs->super->s_blocks_per_group,
> - fs->super->s_first_data_block);
> + print_free(i, block_bitmap,
> + fs->super->s_clusters_per_group,
> + fs->super->s_first_data_block,
> + fs->cluster_ratio);
> fputc('\n', stdout);
> - blk_itr += fs->super->s_blocks_per_group;
> + blk_itr += fs->super->s_clusters_per_group;
> }
> if (inode_bitmap) {
> fputs(_(" Free inodes: "), stdout);
> ext2fs_get_inode_bitmap_range2(fs->inode_map,
> ino_itr, inode_nbytes << 3, inode_bitmap);
> - print_free (i, inode_bitmap,
> - fs->super->s_inodes_per_group, 1);
> + print_free(i, inode_bitmap,
> + fs->super->s_inodes_per_group, 1, 1);
> fputc('\n', stdout);
> ino_itr += fs->super->s_inodes_per_group;
> }
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index 9798b88..079638f 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -815,7 +815,8 @@ static __u32 ok_features[3] = {
> EXT4_FEATURE_RO_COMPAT_DIR_NLINK|
> EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE|
> EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER|
> - EXT4_FEATURE_RO_COMPAT_GDT_CSUM
> + EXT4_FEATURE_RO_COMPAT_GDT_CSUM|
> + EXT4_FEATURE_RO_COMPAT_BIGALLOC
> };
>
>
> @@ -1252,7 +1253,7 @@ profile_error:
> }
>
> while ((c = getopt (argc, argv,
> - "b:cf:g:G:i:jl:m:no:qr:s:t:vE:FI:J:KL:M:N:O:R:ST:U:V")) != EOF) {
> + "b:cg:i:jl:m:no:qr:s:t:vC:E:FG:I:J:KL:M:N:O:R:ST:U:V")) != EOF) {
> switch (c) {
> case 'b':
> blocksize = strtol(optarg, &tmp, 0);
> @@ -1275,17 +1276,17 @@ profile_error:
> case 'c': /* Check for bad blocks */
> cflag++;
> break;
> - case 'f':
> + case 'C':
> size = strtoul(optarg, &tmp, 0);
> - if (size < EXT2_MIN_BLOCK_SIZE ||
> - size > EXT2_MAX_BLOCK_SIZE || *tmp) {
> + if (size < EXT2_MIN_CLUSTER_SIZE ||
> + size > EXT2_MAX_CLUSTER_SIZE || *tmp) {
> com_err(program_name, 0,
> _("invalid fragment size - %s"),
> optarg);
> exit(1);
> }
> - fprintf(stderr, _("Warning: fragments not supported. "
> - "Ignoring -f option\n"));
> + fs_param.s_log_cluster_size =
> + int_log2(size >> EXT2_MIN_CLUSTER_LOG_SIZE);
> break;
> case 'g':
> fs_param.s_blocks_per_group = strtoul(optarg, &tmp, 0);
> @@ -1515,8 +1516,6 @@ profile_error:
> check_plausibility(device_name);
> check_mount(device_name, force, _("filesystem"));
>
> - fs_param.s_log_cluster_size = fs_param.s_log_block_size;
> -
> /* Determine the size of the device (if possible) */
> if (noaction && fs_blocks_count) {
> dev_size = fs_blocks_count;
> @@ -1752,16 +1751,24 @@ profile_error:
> }
> }
>
> + fs_param.s_log_block_size =
> + int_log2(blocksize >> EXT2_MIN_BLOCK_LOG_SIZE);
> + if (fs_param.s_feature_ro_compat & EXT4_FEATURE_RO_COMPAT_BIGALLOC) {
> + if (fs_param.s_log_cluster_size == 0)
> + fs_param.s_log_cluster_size =
> + fs_param.s_log_block_size + 4;
> + } else
> + fs_param.s_log_cluster_size = fs_param.s_log_block_size;
> +
> if (inode_ratio == 0) {
> inode_ratio = get_int_from_profile(fs_types, "inode_ratio",
> 8192);
> if (inode_ratio < blocksize)
> inode_ratio = blocksize;
> + if (inode_ratio < EXT2_CLUSTER_SIZE(&fs_param))
> + inode_ratio = EXT2_CLUSTER_SIZE(&fs_param);
> }
>
> - fs_param.s_log_cluster_size = fs_param.s_log_block_size =
> - int_log2(blocksize >> EXT2_MIN_BLOCK_LOG_SIZE);
> -
> #ifdef HAVE_BLKID_PROBE_GET_TOPOLOGY
> retval = get_device_geometry(device_name, &fs_param, psector_size);
> if (retval < 0) {
> @@ -2049,6 +2056,33 @@ static int mke2fs_discard_device(ext2_filsys fs)
> return retval;
> }
>
> +static fix_cluster_bg_counts(ext2_filsys fs)
> +{
> + blk64_t cluster, num_clusters, tot_free;
> + int grp_free, num_free, group, num;
> +
> + num_clusters = ext2fs_blocks_count(fs->super) / fs->cluster_ratio;
> + tot_free = num_free = num = group = grp_free = 0;
> + for (cluster = fs->super->s_first_data_block / fs->cluster_ratio;
> + cluster < num_clusters; cluster++) {
> + if (!ext2fs_test_block_bitmap2(fs->block_map, cluster)) {
> + grp_free++;
> + tot_free++;
> + }
> + num++;
> + if ((num == fs->super->s_clusters_per_group) ||
> + (cluster == num_clusters-1)) {
> + printf("Group %d has free #: %d\n", group, grp_free);
> + ext2fs_bg_free_blocks_count_set(fs, group, grp_free);
> + ext2fs_group_desc_csum_set(fs, group);
> + num = 0;
> + grp_free = 0;
> + group++;
> + }
> + }
> + ext2fs_free_blocks_count_set(fs->super, tot_free);
> +}
> +
> int main (int argc, char *argv[])
> {
> errcode_t retval = 0;
> @@ -2367,6 +2401,17 @@ int main (int argc, char *argv[])
> }
> no_journal:
>
> + if (EXT2_HAS_RO_COMPAT_FEATURE(&fs_param,
> + EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
> + ext2fs_block_bitmap cluster_map;
> +
> + retval = ext2fs_convert_to_cluster_bitmap(fs, fs->block_map,
> + &cluster_map);
> + ext2fs_free_block_bitmap(fs->block_map);
> + fs->block_map = cluster_map;
> + fix_cluster_bg_counts(fs);
> + }
> +
> if (!quiet)
> printf(_("Writing superblocks and "
> "filesystem accounting information: "));
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






2011-03-21 11:31:21

by Rogier Wolff

[permalink] [raw]
Subject: Re: [PATCH, RFC 00/12] bigalloc patchset

On Mon, Mar 21, 2011 at 09:55:07AM +0100, Andreas Dilger wrote:

> Unfortunately, the overhead of allocating a whole cluster for every
> index block and every directory is fairly high. For Lustre it
> matters very little, since there are only a handful of directories
> (under 40) on the data filesystems where this would be used and the
> real directory tree is located on a different metadata filesystem
> which probably wouldn't use this feature, but for most "normal"
> users this overhead may become prohibitive. That is why I've been
> trying to think of a way to allow sub-cluster allocations for these
> uses.

FYI, one filesystem I know has a neat trick that works well (IMHO).

Whenever you have a file or a "tail-of-a-file" that is smaller than a
cluster, you allocate it from a special file that contains all
tails-of-files that are a specific size.

That filesystem happened to have 64k clusters and 512byte
tail-file-granularity. So it had 128 files that contained the tails of
all possible differt sizes.

In our case I would suggest using a sequence something like:
4k, 8k, 12k, 16k, 24k, 32k, 48k 64k, 96k, 128k and so on.

That way with about 15 tail-files we are reasonably efficient (never
having more than 4k or 33% allocated over the filesize). Also, adding
8k to a 128k file doesn't require moving the whole file from one spot
in the filesystem to another, which would be required if we were to
enumerate all 4k-multiples up to 1M.

The higher bits of the block numbers will indicate that it refers to a
tail-block blocknumber, and in which tail-file to look for the
tail-block.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2011-03-21 13:12:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH, RFC 03/12] ext4: Convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP

On Sun, Mar 20, 2011 at 12:26:57PM +0200, Amir Goldstein wrote:
> If I am not mistaken, while 'bit' was already converted to cluster units,
> 'count' is still in block units.
>
> I think ext4_free_blocks() need to do 2 things:
> 1. convert 'count' to clusters (after issuing journal_forget())
> 2. round 'bit' up (and round 'count' down) if start block is not
> on cluster boundary, so truncate/punch hole, will not free a
> cluster when it's 'base' block is still allocated.

Good catch! It's actually more complicated than that, though.
Whether or not we can free the first and the last cluster in an
unaligned extent is going to depend on whether or not there are any
blocks still in use --- which is something that has to be communicated
by extents.c, since mballoc has no way of knowing this. (Consider
what happens with sparse blocks.) What I think I'm going to have to
do is to teach truncate to check to see whether this is an unaligned
truncate, and if so, whether there are any blocks still left used. If
so, it will need to pass a flag to ext4_free_blocks() to tell it to
skip the first incomplete cluster.

Similarly, when the punch code lands, we'll need to worry about this
at the end of the region which gets "punched" out, and we'll need a
similar flag telling ext4_free_blocks() whether or not to release the
last incomplete cluster.

Thanks for catching this; I have a bit more coding work to do to deal
with this case.

- Ted

2011-03-21 13:24:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH, RFC 00/12] bigalloc patchset

On Mon, Mar 21, 2011 at 09:55:07AM +0100, Andreas Dilger wrote:
> >
> > The cost is increased disk space efficiency. Directories will consume
> > 1T, as will extent tree blocks.
>
> Presumably you mean "1M" here and not "1T"?

Yes; or more accurately, one allocation cluster (no matter what size
it might be).


> It would be a shame to waste another MB of space just to allocate
> 4kB for the next indirect block... I guess it isn't clear to me why
> the index blocks need to be treated differently from file data
> blocks or directory blocks in this regard, since they both can use
> multiple blocks from the same cluster. Being able to use the full
> cluster would allow 256 * 344 = 88064 extents, or 11TB to be
> addressed by the cluster of index blocks, which should be plenty.

There's a reason why I'm explicitly not supporting indirect blocks
with bigalloc, at least initially. :-)

The reason why this gets difficult with metadata blocks (directory
blocks excepted) is the problem of determining whether or not a block
in a cluster is in use or not at allocation time, and whether all of
the blocks in a cluster are no longer in use when deciding whether or
not to free a cluster. For data blocks we rely on the extent tree to
determine this, since clusters are aligned with respect to logical
block numbers --- that is, a physical cluster which is 1M starts on a
1M logical block boundary, and covers the logical blocks in that 1M
region. So if you have a file which has a 4k sparse block at offset
4, and another 4k sparse block located at offset 1M+42, that file will
consume _two_ clusters, not one.

But for file system metadata blocks, such as extent tree blocks, if we
want to allocate multiple blocks from the same cluster, we would need
some way of determining which blocks from that cluster have been
allocated so far. I could add a bitmap to the first block in the
cluster, but that adds a lot of complexity.

One thing which I've thought about doing is to initialize a bitmap in
the first block of a cluster (and then use the second block), but to
only use one block per cluster for extent tree blocks --- at least for
now. That would allow a future read-only extension to use multiple
blocks/cluster, and if I also implement checking the bitmap at free
time, it could be a fully backwards compatible extension.

> Unfortunately, the overhead of allocating a whole cluster for every
> index block and every directory is fairly high. For Lustre it
> matters very little, since there are only a handful of directories
> (under 40) on the data filesystems where this would be used and the
> real directory tree is located on a different metadata filesystem
> which probably wouldn't use this feature, but for most "normal"
> users this overhead may become prohibitive. That is why I've been
> trying to think of a way to allow sub-cluster allocations for these
> uses.

I don't think it's that bad, if the cluster size is well chosen. If
you know that most of your files are 4-8M, and you are using a 1M
cluster allocation size, most of the time you will be able to fit all
of the extents you need into the inode. It's only for highly
fragmented file systems that you'll need more than 3 extents to store
8 clusters, no? And for very large files, say 256M, an extra 1M
extent would be unfortunate, if it is needed, but as a percentage of
the file space used, it's not a complete deal breaker.

> > Please comment! I do not intend for these patches to be merged during
> > the 2.6.39 merge window. I am targetting 2.6.40, 3 months from now,
> > since these patches are quite extensive.
>
> Is that before or after e2fsck support for this will be done? I'm
> rather reluctant to commit anything to the kernel that doesn't have
> e2fsck support in a released e2fsprogs.

I think getting the e2fsck changes done in 3 months really ought not
to be a problem...

- Ted

2011-03-21 19:35:24

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH, RFC 01/12] ext4: read-only support for bigalloc file systems

On Sat, 19 Mar 2011, Theodore Ts'o wrote:

> This adds supports for bigalloc file systems. It teaches the mount
> code just enough about bigalloc superblock fields that it will mount
> the file system without freaking out that the number of blocks per
> group is too big.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/ext4.h | 18 ++++++++++++++--
> fs/ext4/super.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 65 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3aa0b72..94a7a7b 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -231,12 +231,16 @@ struct ext4_io_submit {
> #define EXT4_MAX_BLOCK_LOG_SIZE 16
> #ifdef __KERNEL__
> # define EXT4_BLOCK_SIZE(s) ((s)->s_blocksize)
> +# define EXT4_CLUSTER_SIZE(s) (EXT4_SB(s)->s_clustersize)
> #else
> +# define EXT2_CLUSTER_SIZE(s) (EXT2_MIN_BLOCK_SIZE << \
> + (s)->s_log_cluster_size)
> # define EXT4_BLOCK_SIZE(s) (EXT4_MIN_BLOCK_SIZE << (s)->s_log_block_size)
> #endif
> #define EXT4_ADDR_PER_BLOCK(s) (EXT4_BLOCK_SIZE(s) / sizeof(__u32))
> #ifdef __KERNEL__
> # define EXT4_BLOCK_SIZE_BITS(s) ((s)->s_blocksize_bits)
> +# define EXT4_CLUSTER_SIZE_BITS(s) (EXT4_SB(s)->s_clustersize_bits)
> #else
> # define EXT4_BLOCK_SIZE_BITS(s) ((s)->s_log_block_size + 10)
> #endif
> @@ -302,6 +306,7 @@ struct flex_groups {
> #define EXT4_DESC_SIZE(s) (EXT4_SB(s)->s_desc_size)
> #ifdef __KERNEL__
> # define EXT4_BLOCKS_PER_GROUP(s) (EXT4_SB(s)->s_blocks_per_group)
> +# define EXT4_CLUSTERS_PER_GROUP(s) (EXT4_SB(s)->s_clusters_per_group)
> # define EXT4_DESC_PER_BLOCK(s) (EXT4_SB(s)->s_desc_per_block)
> # define EXT4_INODES_PER_GROUP(s) (EXT4_SB(s)->s_inodes_per_group)
> # define EXT4_DESC_PER_BLOCK_BITS(s) (EXT4_SB(s)->s_desc_per_block_bits)
> @@ -957,9 +962,9 @@ struct ext4_super_block {
> /*10*/ __le32 s_free_inodes_count; /* Free inodes count */
> __le32 s_first_data_block; /* First Data Block */
> __le32 s_log_block_size; /* Block size */
> - __le32 s_obso_log_frag_size; /* Obsoleted fragment size */
> + __le32 s_log_cluster_size; /* Allocation cluster size */
> /*20*/ __le32 s_blocks_per_group; /* # Blocks per group */
> - __le32 s_obso_frags_per_group; /* Obsoleted fragments per group */
> + __le32 s_clusters_per_group; /* # Clusters per group */
> __le32 s_inodes_per_group; /* # Inodes per group */
> __le32 s_mtime; /* Mount time */
> /*30*/ __le32 s_wtime; /* Write time */
> @@ -1055,7 +1060,10 @@ struct ext4_super_block {
> __u8 s_last_error_func[32]; /* function where the error happened */
> #define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
> __u8 s_mount_opts[64];
> - __le32 s_reserved[112]; /* Padding to the end of the block */
> + __le32 s_usr_quota_inum; /* inode for tracking user quota */
> + __le32 s_grp_quota_inum; /* inode for tracking group quota */
> + __le32 s_overhead_blocks; /* overhead blocks/clusters in fs */
> + __le32 s_reserved[109]; /* Padding to the end of the block */
> };
>
> #define EXT4_S_ERR_LEN (EXT4_S_ERR_END - EXT4_S_ERR_START)
> @@ -1075,6 +1083,7 @@ struct ext4_sb_info {
> unsigned long s_desc_size; /* Size of a group descriptor in bytes */
> unsigned long s_inodes_per_block;/* Number of inodes per block */
> unsigned long s_blocks_per_group;/* Number of blocks in a group */
> + unsigned long s_clusters_per_group; /* Number of clusters in a group */
> unsigned long s_inodes_per_group;/* Number of inodes in a group */
> unsigned long s_itb_per_group; /* Number of inode table blocks per group */
> unsigned long s_gdb_count; /* Number of group descriptor blocks */
> @@ -1083,6 +1092,8 @@ struct ext4_sb_info {
> ext4_group_t s_blockfile_groups;/* Groups acceptable for non-extent files */
> unsigned long s_overhead_last; /* Last calculated overhead */
> unsigned long s_blocks_last; /* Last seen block count */
> + unsigned int s_cluster_ratio; /* Number of blocks per group */
> + unsigned int s_cluster_bits; /* log2 of s_cluster_ratio */
> loff_t s_bitmap_maxbytes; /* max bytes for bitmap files */
> struct buffer_head * s_sbh; /* Buffer containing the super block */
> struct ext4_super_block *s_es; /* Pointer to the super block in the buffer */
> @@ -1338,6 +1349,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> #define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010
> #define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
> #define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
> +#define EXT4_FEATURE_RO_COMPAT_BIGALLOC 0x0200
>
> #define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
> #define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index b357c27..7273728 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1875,7 +1875,7 @@ static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es,
> res = MS_RDONLY;
> }
> if (read_only)
> - return res;
> + goto done;
> if (!(sbi->s_mount_state & EXT4_VALID_FS))
> ext4_msg(sb, KERN_WARNING, "warning: mounting unchecked fs, "
> "running e2fsck is recommended");
> @@ -1906,6 +1906,7 @@ static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es,
> EXT4_SET_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>
> ext4_commit_super(sb, 1);
> +done:
> if (test_opt(sb, DEBUG))
> printk(KERN_INFO "[EXT4 FS bs=%lu, gc=%u, "
> "bpg=%lu, ipg=%lu, mo=%04x, mo2=%04x]\n",

Maybe this is a bit nitpicky, but should not this be rather done in
separate commit as it has nothing to do with bigalloc ?

> @@ -3022,10 +3023,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> char *cp;
> const char *descr;
> int ret = -ENOMEM;
> - int blocksize;
> + int blocksize, clustersize;
> unsigned int db_count;
> unsigned int i;
> - int needs_recovery, has_huge_files;
> + int needs_recovery, has_huge_files, has_bigalloc;
> __u64 blocks_count;
> int err;
> unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> @@ -3276,12 +3277,53 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> sb->s_dirt = 1;
> }
>
> - if (sbi->s_blocks_per_group > blocksize * 8) {
> - ext4_msg(sb, KERN_ERR,
> - "#blocks per group too big: %lu",
> - sbi->s_blocks_per_group);
> - goto failed_mount;
> + /* Handle clustersize */
> + clustersize = BLOCK_SIZE << le32_to_cpu(es->s_log_cluster_size);
> + has_bigalloc = EXT4_HAS_RO_COMPAT_FEATURE(sb,
> + EXT4_FEATURE_RO_COMPAT_BIGALLOC);
> + if (has_bigalloc) {
> + if (clustersize < blocksize) {
> + ext4_msg(sb, KERN_ERR,
> + "cluster size (%d) smaller than "
> + "block size (%d)", clustersize, blocksize);
> + goto failed_mount;
> + }
> + sbi->s_cluster_bits = le32_to_cpu(es->s_log_cluster_size) -
> + le32_to_cpu(es->s_log_block_size);
> + sbi->s_clusters_per_group =
> + le32_to_cpu(es->s_clusters_per_group);
> + if (sbi->s_clusters_per_group > blocksize * 8) {
> + ext4_msg(sb, KERN_ERR,
> + "#clusters per group too big: %lu",
> + sbi->s_clusters_per_group);
> + goto failed_mount;
> + }
> + if (sbi->s_blocks_per_group !=
> + (sbi->s_clusters_per_group * (clustersize / blocksize))) {
> + ext4_msg(sb, KERN_ERR, "blocks per group (%lu) and "
> + "clusters per group (%lu) inconsistent",
> + sbi->s_blocks_per_group,
> + sbi->s_clusters_per_group);
> + goto failed_mount;
> + }
> + } else {
> + if (clustersize != blocksize) {
> + ext4_warning(sb, "fragment/cluster size (%d) != "
> + "block size (%d)", clustersize,
> + blocksize);
> + clustersize = blocksize;

I wonder if we should continue at this point, because something
definitely went wrong as it has not biballoc feature but yet
s_log_cluster_size does not match s_log_block_size which means
definitely corruption or an error somewhere.

> + }
> + if (sbi->s_blocks_per_group > blocksize * 8) {
> + ext4_msg(sb, KERN_ERR,
> + "#blocks per group too big: %lu",
> + sbi->s_blocks_per_group);
> + goto failed_mount;
> + }
> + sbi->s_clusters_per_group = sbi->s_blocks_per_group;
> + sbi->s_cluster_bits = 0;
> }
> + sbi->s_cluster_ratio = clustersize / blocksize;
> +
> if (sbi->s_inodes_per_group > blocksize * 8) {
> ext4_msg(sb, KERN_ERR,
> "#inodes per group too big: %lu",
>

Thanks!
-Lukas

2011-03-21 20:17:34

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH, RFC 10/12] ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled

On Sat, 19 Mar 2011, Theodore Ts'o wrote:

> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/super.c | 35 +++++++++++++++++++++++------------
> 1 files changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index f9b25cd..24964da 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4438,15 +4438,33 @@ restore_opts:
> return err;
> }
>
> +/*
> + * Note: calculating the overhead so we can be compatible with
> + * historical BSD practice is quite difficult in the face of
> + * clusters/bigalloc. This is because multiple metadata blocks from
> + * different block group can end up in the same allocation cluster.
> + * Calculating the exact overhead in the face of clustered allocation
> + * requires either O(all block bitmaps) in memory or O(number of block
> + * groups**2) in time. We will still calculate the superblock for
> + * older file systems --- and if we come across with a bigalloc file
> + * system with zero in s_overhead_blocks the estimate will be close to
> + * correct especially for very large cluster sizes --- but for newer
> + * file systems, it's better to calculate this figure once at mkfs
> + * time, and store it in the superblock. If the superblock value is
> + * present (even for non-bigalloc file systems), we will use it.
> + */
> static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
> {
> struct super_block *sb = dentry->d_sb;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> struct ext4_super_block *es = sbi->s_es;
> + struct ext4_group_desc *gdp;
> u64 fsid;
>
> if (test_opt(sb, MINIX_DF)) {
> sbi->s_overhead_last = 0;
> + } else if (es->s_overhead_blocks) {
> + sbi->s_overhead_last = le32_to_cpu(es->s_overhead_blocks);
> } else if (sbi->s_blocks_last != ext4_blocks_count(es)) {
> ext4_group_t i, ngroups = ext4_get_groups_count(sb);
> ext4_fsblk_t overhead = 0;
> @@ -4461,24 +4479,16 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
> * All of the blocks before first_data_block are
> * overhead
> */
> - overhead = le32_to_cpu(es->s_first_data_block);
> + overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
>
> /*
> - * Add the overhead attributed to the superblock and
> - * block group descriptors. If the sparse superblocks
> - * feature is turned on, then not all groups have this.
> + * Add the overhead found in each block group
> */
> for (i = 0; i < ngroups; i++) {
> - overhead += ext4_bg_has_super(sb, i) +
> - ext4_bg_num_gdb(sb, i);
> + gdp = ext4_get_group_desc(sb, i, NULL);
> + overhead += ext4_num_overhead_clusters(sb, i, gdp);
> cond_resched();
> }
> -
> - /*
> - * Every block group has an inode bitmap, a block
> - * bitmap, and an inode table.
> - */
> - overhead += ngroups * (2 + sbi->s_itb_per_group);
> sbi->s_overhead_last = overhead;

overhead is in clusters units, but

> smp_wmb();
> sbi->s_blocks_last = ext4_blocks_count(es);
> @@ -4489,6 +4499,7 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
> buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;

here it seems to be treated as blocks if I am not missing something.

> buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter) -
> percpu_counter_sum_positive(&sbi->s_dirtyblocks_counter);
> + buf->f_bfree = buf->f_bfree << sbi->s_cluster_bits;
> buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
> if (buf->f_bfree < ext4_r_blocks_count(es))
> buf->f_bavail = 0;
>

2011-03-22 07:04:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH, RFC 00/12] bigalloc patchset

On 2011-03-21, at 2:24 PM, Ted Ts'o <[email protected]> wrote:
> On Mon, Mar 21, 2011 at 09:55:07AM +0100, Andreas Dilger wrote:
>> It would be a shame to waste another MB of space just to allocate
>> 4kB for the next index block... I guess it isn't clear to me why
>> the index blocks need to be treated differently from file data
>> blocks or directory blocks in this regard, since they both can use
>> multiple blocks from the same cluster. Being able to use the full
>> cluster would allow 256 * 344 = 88064 extents, or 11TB to be
>> addressed by the cluster of index blocks, which should be plenty.
>
> There's a reason why I'm explicitly not supporting indirect blocks
> with bigalloc, at least initially. :-)

To clarify, that means only extent-mapped files with bigalloc? I was actually referring to the extent index cluster allocations. I'd assume that at least a single index block needs to be handled, otherwise the maximum file size would be 4 * 128MB = 512MB.

> The reason why this gets difficult with metadata blocks (directory
> blocks excepted) is the problem of determining whether or not a block
> in a cluster is in use or not at allocation time, and whether all of
> the blocks in a cluster are no longer in use when deciding whether or
> not to free a cluster. For data blocks we rely on the extent tree to
> determine this, since clusters are aligned with respect to logical
> block numbers --- that is, a physical cluster which is 1M starts on a
> 1M logical block boundary, and covers the logical blocks in that 1M
> region. So if you have a file which has a 4k sparse block at offset
> 4, and another 4k sparse block located at offset 1M+42, that file will
> consume _two_ clusters, not one.

Will the actual file allocation be pointing to the cluster or the blocks within the cluster? Pointing at the individual blocks is probably best (allows FIEMAP to return the actual used blocks, punch/truncate of the real blocks will free the cluster), so long as later allocation of adjacent blocks will first consume the unused blocks in that cluster instead of allocating new clusters.

> But for file system metadata blocks, such as extent tree blocks, if we
> want to allocate multiple blocks from the same cluster, we would need
> some way of determining which blocks from that cluster have been
> allocated so far. I could add a bitmap to the first block in the
> cluster, but that adds a lot of complexity.

Given that the number of index blocks for a single inode will be tiny, doing a list walk to see which blocks in the cluster are used would be pretty reasonable. Contrast that with the need to allocate a new cluster on disk, and later to seek to read the new index block from a different cluster, I think it is better to just do the full search.

> One thing which I've thought about doing is to initialize a bitmap in
> the first block of a cluster (and then use the second block), but to
> only use one block per cluster for extent tree blocks --- at least for
> now. That would allow a future read-only extension to use multiple
> blocks/cluster, and if I also implement checking the bitmap at free
> time, it could be a fully backwards compatible extension.

This is probably overkill. For e.g. a cluster size of 1MB means only 256 index blocks per cluster, all of which would fit into the first-level index block, if the second-level index blocks are needed.

>
>> Unfortunately, the overhead of allocating a whole cluster for every
>> index block and every directory is fairly high. For Lustre it
>> matters very little, since there are only a handful of directories
>> (under 40) on the data filesystems where this would be used and the
>> real directory tree is located on a different metadata filesystem
>> which probably wouldn't use this feature, but for most "normal"
>> users this overhead may become prohibitive. That is why I've been
>> trying to think of a way to allow sub-cluster allocations for these
>> uses.
>
> I don't think it's that bad, if the cluster size is well chosen. If
> you know that most of your files are 4-8M, and you are using a 1M
> cluster allocation size, most of the time you will be able to fit all
> of the extents you need into the inode.

Maybe I'm missing something, but isn't the direct-addressable extent size limit independent of the cluster size? The extents are referencing blocks, so the extent size is still capped at 128MB.

> It's only for highly
> fragmented file systems that you'll need more than 3 extents to store
> 8 clusters, no? And for very large files, say 256M, an extra 1M
> extent would be unfortunate, if it is needed, but as a percentage of
> the file space used, it's not a complete deal breaker.

Well, it's true that with bigalloc it will be more likely to have contiguous extents (guaranteed to have at least extents of the cluster size :-) but it isn't clear whether this will avoid the need for multiple extents at a larger scale.

>>> Please comment! I do not intend for these patches to be merged during
>>> the 2.6.39 merge window. I am targetting 2.6.40, 3 months from now,
>>> since these patches are quite extensive.
>>
>> Is that before or after e2fsck support for this will be done? I'm
>> rather reluctant to commit anything to the kernel that doesn't have
>> e2fsck support in a released e2fsprogs.
>
> I think getting the e2fsck changes done in 3 months really ought not
> to be a problem...

Will that release include the 64bit feature, or will it be based off 1.41?

Cheers, Andreas

2011-03-22 17:02:24

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH, RFC 01/12] ext4: read-only support for bigalloc file systems

On Mon, Mar 21, 2011 at 08:35:20PM +0100, Lukas Czerner wrote:
>
> Maybe this is a bit nitpicky, but should not this be rather done in
> separate commit as it has nothing to do with bigalloc ?

Perhaps. The reason why I had it was because I wanted to see the
blocks per group information when I was testing a read-only bigalloc
mount. I'll grant this is tenuous; I suppose I could separate out
these two patch hunks into a separate patch, but I didn't think it was
really worth it.

>
> I wonder if we should continue at this point, because something
> definitely went wrong as it has not biballoc feature but yet
> s_log_cluster_size does not match s_log_block_size which means
> definitely corruption or an error somewhere.

I considered this, but I was just paranoid because I didn't want to
change anything in the !bigalloc case. There was one or two users who
reported that somehow the second 512 byte sector was containing
garbage, and nothing had cared in the past, but when we first broke
into the second 512 bytes of the superblock we did have some
complaints, so that's why I decided to err on the side of
conservatism.

- Ted

2011-03-22 22:09:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH, RFC 10/12] ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled

On Mon, Mar 21, 2011 at 09:17:30PM +0100, Lukas Czerner wrote:
>
> overhead is in clusters units, but
>
> > smp_wmb();
> > sbi->s_blocks_last = ext4_blocks_count(es);
> > @@ -4489,6 +4499,7 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
> > buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
>
> here it seems to be treated as blocks if I am not missing something.

Oops, good point! Thanks for catching that.

- Ted

2011-03-23 10:28:18

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH, RFC 01/12] ext4: read-only support for bigalloc file systems

On Tue, 22 Mar 2011, Ted Ts'o wrote:

> On Mon, Mar 21, 2011 at 08:35:20PM +0100, Lukas Czerner wrote:
> >
> > Maybe this is a bit nitpicky, but should not this be rather done in
> > separate commit as it has nothing to do with bigalloc ?
>
> Perhaps. The reason why I had it was because I wanted to see the
> blocks per group information when I was testing a read-only bigalloc
> mount. I'll grant this is tenuous; I suppose I could separate out
> these two patch hunks into a separate patch, but I didn't think it was
> really worth it.
>
> >
> > I wonder if we should continue at this point, because something
> > definitely went wrong as it has not biballoc feature but yet
> > s_log_cluster_size does not match s_log_block_size which means
> > definitely corruption or an error somewhere.
>
> I considered this, but I was just paranoid because I didn't want to
> change anything in the !bigalloc case. There was one or two users who
> reported that somehow the second 512 byte sector was containing
> garbage, and nothing had cared in the past, but when we first broke
> into the second 512 bytes of the superblock we did have some
> complaints, so that's why I decided to err on the side of
> conservatism.

Right, I can see your point. So maybe we should add this check to e2fsck
to set the s_log_cluster_size properly if bigalloc is not set.

>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

-Lukas

2011-04-05 16:54:34

by Coly Li

[permalink] [raw]
Subject: Re: [PATCH, RFC 00/12] bigalloc patchset

On 2011年03月20日 05:28, Theodore Ts'o Wrote:
> This is an initial patchset of the bigalloc patches to ext4. This patch
> adds support for clustered allocation, so that each bit in the ext4
> block allocation bitmap addresses a power of two number of blocks. For
> example, if the file system is mainly going to be storing large files in
> the 4-32 megabyte range, it might make sense to set a cluster size of 1
> megabyte. This means that each bit in the block allocaiton bitmap would
> now address 256 4k blocks, and it means that the size of the block
> bitmaps for a 2T file system shrinks from 64 megabytes to 256k. It also
> means that a block group addresses 32 gigabytes instead of 128
> megabytes, also shrinking the amount of file system overhead for
> metadata.
>
> The cost is increased disk space efficiency. Directories will consume
> 1T, as will extent tree blocks. (I am on the fence as to whether I
> should add complexity so that in the rare case that an inode needs more
> than 344 extents --- a highly fragmented file indeed --- and need a
> second extent tree block, we can avoid allocating any cluster and
> instead use another block from the cluster used by the inode. The
> concern is the amount of complexity this adds to the e2fsck, not just to
> the kernel.)
>
> To test these patches, I have used an *extremely* kludgy set of patches
> to e2fsprogs, which are attached below. These patches need *extensive*
> revision before I would consider them clean enough suitable for
> committing into e2fsprogs, but they were sufficient for me to do the
> kernel-side changes --- mke2fs, dumpe2fs, and debugfs work. E2fsck most
> definitely does _not_ work at this stage.
>
> Please comment! I do not intend for these patches to be merged during
> the 2.6.39 merge window. I am targetting 2.6.40, 3 months from now,
> since these patches are quite extensive.

Hi Ted,

Is it possible to open a git repo/branch for both kernel and user space tools ? So others can follow, help or provide
performance number in time :-)

Thanks.

Coly

--
Coly Li