2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 00/23] New spin of the bigalloc patches

This version of the bigalloc patch set has a bunch of fixes so that
delayed allocation and i_blocks/quota works correctly, which was the
major set of failures that we had been seeing. There is still are a few
warnings triggering in ext4_da_reserve_space():

EXT4-fs (sdb2): ext4_da_update_reserve_space: ino 13, used 1 with only 0 reserved data

caused by fsstress, but this should be much more stable. Thanks to
Aditya Kali for his yeoman work to fix this problem.

The next branch in e2fsprogs branch also has a number of bigalloc fixes.
E2fsck should now be able to properly check bigalloc file systems, and
repair many of the more common bigalloc file system corruption issues.

- Ted

Aditya Kali (2):
ext4: Fix bigalloc quota accounting and i_blocks value
ext4: add some tracepoints in ext4/extents.c

Theodore Ts'o (21):
ext4: read-only support for bigalloc file systems
ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)
ext4: convert instances of EXT4_BLOCKS_PER_GROUP to
EXT4_CLUSTERS_PER_GROUP
ext4: factor out block group accounting into functions
ext4: split out ext4_free_blocks_after_init()
ext4: bigalloc changes to block bitmap initialization functions
ext4: convert block group-relative offsets to use clusters
ext4: teach mballoc preallocation code about bigalloc clusters
ext4: teach ext4_free_blocks() about bigalloc and clusters
ext4: teach ext4_ext_map_blocks() about the bigalloc feature
ext4: teach ext4_ext_truncate() about the bigalloc feature
ext4: convert s_{dirty,free}blocks_counter to
s_{dirty,free}clusters_counter
ext4: convert the free_blocks field in s_flex_groups to be
free_clusters
ext4: teach ext4_statfs() to deal with clusters if bigalloc is
enabled
ext4: tune mballoc's default group prealloc size for bigalloc file
systems
ext4: enable mounting bigalloc as read/write
ext4: Rename ext4_free_blks_{count,set}() to refer to clusters
ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters()
ext4: rename ext4_free_blocks_after_init() to
ext4_free_clusters_after_init()
ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters()
ext4: rename ext4_has_free_blocks() to ext4_has_free_clusters()

fs/ext4/balloc.c | 345 +++++++++++++++----------
fs/ext4/ext4.h | 94 +++++--
fs/ext4/ext4_extents.h | 2 +
fs/ext4/extents.c | 607 ++++++++++++++++++++++++++++++++++++++++---
fs/ext4/ialloc.c | 73 +++---
fs/ext4/indirect.c | 7 +
fs/ext4/inode.c | 87 +++++--
fs/ext4/ioctl.c | 33 ++-
fs/ext4/mballoc.c | 290 +++++++++++++--------
fs/ext4/mballoc.h | 9 +-
fs/ext4/resize.c | 10 +-
fs/ext4/super.c | 151 ++++++++---
include/trace/events/ext4.h | 415 ++++++++++++++++++++++++++++-
13 files changed, 1677 insertions(+), 446 deletions(-)

--
1.7.4.1.22.gec8e1.dirty



2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 17/23] ext4: enable mounting bigalloc as read/write

Now that we have implemented all of the changes needed for bigalloc,
we can finally enable it!

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 78ac8ed..d79e773 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1424,7 +1424,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
EXT4_FEATURE_RO_COMPAT_BTREE_DIR |\
- EXT4_FEATURE_RO_COMPAT_HUGE_FILE)
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)

/*
* Default values for user and/or group using reserved blocks
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 19/23] ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters()

This function really counts the free clusters reported in the block
group descriptors, so rename it to reduce confusion.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 11 ++++++-----
fs/ext4/ext4.h | 2 +-
fs/ext4/inode.c | 6 +++---
fs/ext4/super.c | 7 ++++---
4 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 84dc3d6..4f30380 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -524,12 +524,12 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
}

/**
- * ext4_count_free_blocks() -- count filesystem free blocks
+ * ext4_count_free_clusters() -- count filesystem free clusters
* @sb: superblock
*
- * Adds up the number of free blocks from each block group.
+ * Adds up the number of free clusters from each block group.
*/
-ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
+ext4_fsblk_t ext4_count_free_clusters(struct super_block *sb)
{
ext4_fsblk_t desc_count;
struct ext4_group_desc *gdp;
@@ -562,8 +562,9 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
bitmap_count += x;
}
brelse(bitmap_bh);
- printk(KERN_DEBUG "ext4_count_free_blocks: stored = %llu"
- ", computed = %llu, %llu\n", ext4_free_blocks_count(es),
+ printk(KERN_DEBUG "ext4_count_free_clusters: stored = %llu"
+ ", computed = %llu, %llu\n",
+ EXT4_B2C(sbi, ext4_free_blocks_count(es)),
desc_count, bitmap_count);
return bitmap_count;
#else
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 290b7e4..04bf654 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1760,7 +1760,7 @@ extern ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
int *errp);
extern int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
s64 nblocks, unsigned int flags);
-extern ext4_fsblk_t ext4_count_free_blocks(struct super_block *);
+extern ext4_fsblk_t ext4_count_free_clusters(struct super_block *);
extern void ext4_check_blocks_bitmap(struct super_block *);
extern struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
ext4_group_t block_group,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 513f4e6..1cbad80 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1327,7 +1327,8 @@ static void ext4_print_free_blocks(struct inode *inode)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
printk(KERN_CRIT "Total free blocks count %lld\n",
- ext4_count_free_blocks(inode->i_sb));
+ EXT4_C2B(EXT4_SB(inode->i_sb),
+ ext4_count_free_clusters(inode->i_sb)));
printk(KERN_CRIT "Free/Dirty block details\n");
printk(KERN_CRIT "free_blocks=%lld\n",
(long long) EXT4_C2B(EXT4_SB(inode->i_sb),
@@ -1413,8 +1414,7 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd)
if (err == -EAGAIN)
goto submit_io;

- if (err == -ENOSPC &&
- ext4_count_free_blocks(sb)) {
+ if (err == -ENOSPC && ext4_count_free_clusters(sb)) {
mpd->retval = err;
goto submit_io;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3f02afa..1d7dc89 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2113,7 +2113,8 @@ static int ext4_check_descriptors(struct super_block *sb,
if (NULL != first_not_zeroed)
*first_not_zeroed = grp;

- ext4_free_blocks_count_set(sbi->s_es, ext4_count_free_blocks(sb));
+ ext4_free_blocks_count_set(sbi->s_es,
+ EXT4_C2B(sbi, ext4_count_free_clusters(sb)));
sbi->s_es->s_free_inodes_count =cpu_to_le32(ext4_count_free_inodes(sb));
return 1;
}
@@ -3503,7 +3504,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_err_report.data = (unsigned long) sb;

err = percpu_counter_init(&sbi->s_freeclusters_counter,
- ext4_count_free_blocks(sb));
+ ext4_count_free_clusters(sb));
if (!err) {
err = percpu_counter_init(&sbi->s_freeinodes_counter,
ext4_count_free_inodes(sb));
@@ -3629,7 +3630,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
* need to update the global counters.
*/
percpu_counter_set(&sbi->s_freeclusters_counter,
- ext4_count_free_blocks(sb));
+ ext4_count_free_clusters(sb));
percpu_counter_set(&sbi->s_freeinodes_counter,
ext4_count_free_inodes(sb));
percpu_counter_set(&sbi->s_dirs_counter,
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 07/23] ext4: convert block group-relative offsets to use clusters

Certain parts of the ext4 code base, primarily in mballoc.c, use a
block group number and offset from the beginning of the block group.
This offset is invariably used to index into the allocation bitmap, so
change the offset to be denominated in units of clusters.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 6 ++++--
fs/ext4/mballoc.h | 3 ++-
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1c6d777..89abf1f 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -28,7 +28,8 @@
*/

/*
- * Calculate the block group number and offset, given a block number
+ * Calculate the block group number and offset into the block/cluster
+ * allocation bitmap, given a block number
*/
void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
ext4_group_t *blockgrpp, ext4_grpblk_t *offsetp)
@@ -37,7 +38,8 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
ext4_grpblk_t offset;

blocknr = blocknr - le32_to_cpu(es->s_first_data_block);
- offset = do_div(blocknr, EXT4_BLOCKS_PER_GROUP(sb));
+ offset = do_div(blocknr, EXT4_BLOCKS_PER_GROUP(sb)) >>
+ EXT4_SB(sb)->s_cluster_bits;
if (offsetp)
*offsetp = offset;
if (blockgrpp)
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 20b5e7b..4423c6f 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -217,6 +217,7 @@ struct ext4_buddy {
static inline ext4_fsblk_t ext4_grp_offs_to_block(struct super_block *sb,
struct ext4_free_extent *fex)
{
- return ext4_group_first_block_no(sb, fex->fe_group) + fex->fe_start;
+ return ext4_group_first_block_no(sb, fex->fe_group) +
+ (fex->fe_start << EXT4_SB(sb)->s_cluster_bits);
}
#endif
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 05/23] ext4: split out ext4_free_blocks_after_init()

The function ext4_free_blocks_after_init() used to be a #define of
ext4_init_block_bitmap(). This actually made it difficult to
understand how the function worked, and made it hard make changes to
support clusters. So as an initial cleanup, I've separated out the
functionality of initializing block bitmap from calculating the number
of free blocks in the new block group.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 105 ++++++++++++++++++++++++++---------------------------
fs/ext4/ext4.h | 13 ++++---
fs/ext4/ialloc.c | 20 ++++-------
3 files changed, 66 insertions(+), 72 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 8573e2b..735d9fc 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -102,74 +102,73 @@ static unsigned int num_blocks_in_group(struct super_block *sb,
return EXT4_BLOCKS_PER_GROUP(sb);
}

-/* Initializes an uninitialized block bitmap if given, and returns the
- * number of blocks free in the group. */
-unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
- ext4_group_t block_group, struct ext4_group_desc *gdp)
+/* Initializes an uninitialized block bitmap */
+void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp)
{
unsigned int bit, bit_max = num_base_meta_blocks(sb, block_group);
- ext4_group_t ngroups = ext4_get_groups_count(sb);
- unsigned group_blocks = num_blocks_in_group(sb, block_group);
struct ext4_sb_info *sbi = EXT4_SB(sb);
-
- if (bh) {
- J_ASSERT_BH(bh, buffer_locked(bh));
-
- /* If checksum is bad mark all blocks used to prevent allocation
- * essentially implementing a per-group read-only flag. */
- if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
- ext4_error(sb, "Checksum bad for group %u",
- block_group);
- ext4_free_blks_set(sb, gdp, 0);
- ext4_free_inodes_set(sb, gdp, 0);
- ext4_itable_unused_set(sb, gdp, 0);
- memset(bh->b_data, 0xff, sb->s_blocksize);
- return 0;
- }
- memset(bh->b_data, 0, sb->s_blocksize);
+ ext4_fsblk_t start, tmp;
+ int flex_bg = 0;
+
+ J_ASSERT_BH(bh, buffer_locked(bh));
+
+ /* If checksum is bad mark all blocks used to prevent allocation
+ * essentially implementing a per-group read-only flag. */
+ if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
+ ext4_error(sb, "Checksum bad for group %u", block_group);
+ ext4_free_blks_set(sb, gdp, 0);
+ ext4_free_inodes_set(sb, gdp, 0);
+ ext4_itable_unused_set(sb, gdp, 0);
+ memset(bh->b_data, 0xff, sb->s_blocksize);
+ return;
}
+ memset(bh->b_data, 0, sb->s_blocksize);

- if (bh) {
- ext4_fsblk_t start, tmp;
- int flex_bg = 0;
+ for (bit = 0; bit < bit_max; bit++)
+ ext4_set_bit(bit, bh->b_data);

- for (bit = 0; bit < bit_max; bit++)
- ext4_set_bit(bit, bh->b_data);
+ start = ext4_group_first_block_no(sb, block_group);

- start = ext4_group_first_block_no(sb, block_group);
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG))
+ flex_bg = 1;

- if (EXT4_HAS_INCOMPAT_FEATURE(sb,
- EXT4_FEATURE_INCOMPAT_FLEX_BG))
- flex_bg = 1;
+ /* Set bits for block and inode bitmaps, and inode table */
+ tmp = ext4_block_bitmap(sb, gdp);
+ if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
+ ext4_set_bit(tmp - start, bh->b_data);

- /* Set bits for block and inode bitmaps, and inode table */
- tmp = ext4_block_bitmap(sb, gdp);
- if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
+ tmp = ext4_inode_bitmap(sb, gdp);
+ if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
+ ext4_set_bit(tmp - start, bh->b_data);

- tmp = ext4_inode_bitmap(sb, gdp);
+ tmp = ext4_inode_table(sb, gdp);
+ for (; tmp < ext4_inode_table(sb, gdp) +
+ sbi->s_itb_per_group; tmp++) {
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
ext4_set_bit(tmp - start, bh->b_data);
-
- tmp = ext4_inode_table(sb, gdp);
- for (; tmp < ext4_inode_table(sb, gdp) +
- sbi->s_itb_per_group; tmp++) {
- if (!flex_bg ||
- ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
- }
- /*
- * Also if the number of blocks within the group is
- * less than the blocksize * 8 ( which is the size
- * of bitmap ), set rest of the block bitmap to 1
- */
- ext4_mark_bitmap_end(group_blocks, sb->s_blocksize * 8,
- bh->b_data);
}
- return group_blocks - bit_max -
- ext4_group_used_meta_blocks(sb, block_group, gdp);
+ /*
+ * Also if the number of blocks within the group is less than
+ * the blocksize * 8 ( which is the size of bitmap ), set rest
+ * of the block bitmap to 1
+ */
+ ext4_mark_bitmap_end(num_blocks_in_group(sb, block_group),
+ sb->s_blocksize * 8, bh->b_data);
}

+/* Return the number of free blocks in a block group. It is used when
+ * the block bitmap is uninitialized, so we can't just count the bits
+ * in the bitmap. */
+unsigned ext4_free_blocks_after_init(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp)
+{
+ return num_blocks_in_group(sb, block_group) -
+ num_base_meta_blocks(sb, block_group) -
+ ext4_group_used_meta_blocks(sb, block_group, gdp);
+}

/*
* The free blocks are managed by bitmaps. A file system contains several
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 03a0857..db090e7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1748,12 +1748,13 @@ extern struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
extern int ext4_should_retry_alloc(struct super_block *sb, int *retries);
struct buffer_head *ext4_read_block_bitmap(struct super_block *sb,
ext4_group_t block_group);
-extern unsigned ext4_init_block_bitmap(struct super_block *sb,
- struct buffer_head *bh,
- ext4_group_t group,
- struct ext4_group_desc *desc);
-#define ext4_free_blocks_after_init(sb, group, desc) \
- ext4_init_block_bitmap(sb, NULL, group, desc)
+extern void ext4_init_block_bitmap(struct super_block *sb,
+ struct buffer_head *bh,
+ ext4_group_t group,
+ struct ext4_group_desc *desc);
+extern unsigned ext4_free_blocks_after_init(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp);
ext4_fsblk_t ext4_inode_to_goal_block(struct inode *);

/* dir.c */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 21bb2f6..c176cf4 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -816,7 +816,6 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode,
int ret2, err = 0;
struct inode *ret;
ext4_group_t i;
- int free = 0;
static int once = 1;
ext4_group_t flex_group;

@@ -950,26 +949,21 @@ got:
goto fail;
}

- free = 0;
- ext4_lock_group(sb, group);
+ BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap");
+ err = ext4_handle_dirty_metadata(handle, NULL, block_bitmap_bh);
+ brelse(block_bitmap_bh);
+
/* recheck and clear flag under lock if we still need to */
+ ext4_lock_group(sb, group);
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
- free = ext4_free_blocks_after_init(sb, group, gdp);
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
- ext4_free_blks_set(sb, gdp, free);
+ ext4_free_blks_set(sb, gdp,
+ ext4_free_blocks_after_init(sb, group, gdp));
gdp->bg_checksum = ext4_group_desc_csum(sbi, group,
gdp);
}
ext4_unlock_group(sb, group);

- /* Don't need to dirty bitmap block if we didn't change it */
- if (free) {
- BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap");
- err = ext4_handle_dirty_metadata(handle,
- NULL, block_bitmap_bh);
- }

2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 06/23] ext4: bigalloc changes to block bitmap initialization functions

Add bigalloc support to ext4_init_block_bitmap() and
ext4_free_blocks_after_init().

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 131 +++++++++++++++++++++++++++++++++++++-----------------
fs/ext4/ext4.h | 13 +++++
2 files changed, 103 insertions(+), 41 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 735d9fc..1c6d777 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -23,9 +23,6 @@

#include <trace/events/ext4.h>

-static unsigned int num_base_meta_blocks(struct super_block *sb,
- ext4_group_t block_group);
-
/*
* balloc.c contains the blocks allocation and deallocation routines
*/
@@ -58,37 +55,87 @@ static int ext4_block_in_group(struct super_block *sb, ext4_fsblk_t block,
return 0;
}

-static int ext4_group_used_meta_blocks(struct super_block *sb,
- ext4_group_t block_group,
- struct ext4_group_desc *gdp)
+/* Return the number of clusters used for file system metadata; this
+ * represents the overhead needed by the file system.
+ */
+unsigned ext4_num_overhead_clusters(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp)
{
- ext4_fsblk_t tmp;
+ unsigned num_clusters;
+ int block_cluster = -1, inode_cluster = -1, itbl_cluster = -1, i, c;
+ ext4_fsblk_t start = ext4_group_first_block_no(sb, block_group);
+ ext4_fsblk_t itbl_blk;
struct ext4_sb_info *sbi = EXT4_SB(sb);
- /* block bitmap, inode bitmap, and inode table blocks */
- int used_blocks = sbi->s_itb_per_group + 2;

- if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
- if (!ext4_block_in_group(sb, ext4_block_bitmap(sb, gdp),
- block_group))
- used_blocks--;
-
- if (!ext4_block_in_group(sb, ext4_inode_bitmap(sb, gdp),
- block_group))
- used_blocks--;
-
- tmp = ext4_inode_table(sb, gdp);
- for (; tmp < ext4_inode_table(sb, gdp) +
- sbi->s_itb_per_group; tmp++) {
- if (!ext4_block_in_group(sb, tmp, block_group))
- used_blocks -= 1;
+ /* This is the number of clusters used by the superblock,
+ * block group descriptors, and reserved block group
+ * descriptor blocks */
+ num_clusters = ext4_num_base_meta_clusters(sb, block_group);
+
+ /*
+ * For the allocation bitmaps and inode table, we first need
+ * to check to see if the block is in the block group. If it
+ * is, then check to see if the cluster is already accounted
+ * for in the clusters used for the base metadata cluster, or
+ * if we can increment the base metadata cluster to include
+ * that block. Otherwise, we will have to track the cluster
+ * used for the allocation bitmap or inode table explicitly.
+ * Normally all of these blocks are contiguous, so the special
+ * case handling shouldn't be necessary except for *very*
+ * unusual file system layouts.
+ */
+ if (ext4_block_in_group(sb, ext4_block_bitmap(sb, gdp), block_group)) {
+ block_cluster = EXT4_B2C(sbi, (start -
+ ext4_block_bitmap(sb, gdp)));
+ if (block_cluster < num_clusters)
+ block_cluster = -1;
+ else if (block_cluster == num_clusters) {
+ num_clusters++;
+ block_cluster = -1;
+ }
+ }
+
+ if (ext4_block_in_group(sb, ext4_inode_bitmap(sb, gdp), block_group)) {
+ inode_cluster = EXT4_B2C(sbi,
+ start - ext4_inode_bitmap(sb, gdp));
+ if (inode_cluster < num_clusters)
+ inode_cluster = -1;
+ else if (inode_cluster == num_clusters) {
+ num_clusters++;
+ inode_cluster = -1;
+ }
+ }
+
+ itbl_blk = ext4_inode_table(sb, gdp);
+ for (i = 0; i < sbi->s_itb_per_group; i++) {
+ if (ext4_block_in_group(sb, itbl_blk + i, block_group)) {
+ c = EXT4_B2C(sbi, start - itbl_blk + i);
+ if ((c < num_clusters) || (c == inode_cluster) ||
+ (c == block_cluster) || (c == itbl_cluster))
+ continue;
+ if (c == num_clusters) {
+ num_clusters++;
+ continue;
+ }
+ num_clusters++;
+ itbl_cluster = c;
}
}
- return used_blocks;
+
+ if (block_cluster != -1)
+ num_clusters++;
+ if (inode_cluster != -1)
+ num_clusters++;
+
+ return num_clusters;
}

-static unsigned int num_blocks_in_group(struct super_block *sb,
- ext4_group_t block_group)
+static unsigned int num_clusters_in_group(struct super_block *sb,
+ ext4_group_t block_group)
{
+ unsigned int blocks;
+
if (block_group == ext4_get_groups_count(sb) - 1) {
/*
* Even though mke2fs always initializes the first and
@@ -96,10 +143,11 @@ static unsigned int num_blocks_in_group(struct super_block *sb,
* we need to make sure we calculate the right free
* blocks.
*/
- return ext4_blocks_count(EXT4_SB(sb)->s_es) -
+ blocks = ext4_blocks_count(EXT4_SB(sb)->s_es) -
ext4_group_first_block_no(sb, block_group);
} else
- return EXT4_BLOCKS_PER_GROUP(sb);
+ blocks = EXT4_BLOCKS_PER_GROUP(sb);
+ return EXT4_NUM_B2C(EXT4_SB(sb), blocks);
}

/* Initializes an uninitialized block bitmap */
@@ -107,7 +155,7 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_group_t block_group,
struct ext4_group_desc *gdp)
{
- unsigned int bit, bit_max = num_base_meta_blocks(sb, block_group);
+ unsigned int bit, bit_max;
struct ext4_sb_info *sbi = EXT4_SB(sb);
ext4_fsblk_t start, tmp;
int flex_bg = 0;
@@ -126,6 +174,7 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
}
memset(bh->b_data, 0, sb->s_blocksize);

+ bit_max = ext4_num_base_meta_clusters(sb, block_group);
for (bit = 0; bit < bit_max; bit++)
ext4_set_bit(bit, bh->b_data);

@@ -137,24 +186,25 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
/* Set bits for block and inode bitmaps, and inode table */
tmp = ext4_block_bitmap(sb, gdp);
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
+ ext4_set_bit(EXT4_B2C(sbi, tmp - start), bh->b_data);

tmp = ext4_inode_bitmap(sb, gdp);
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
+ ext4_set_bit(EXT4_B2C(sbi, tmp - start), bh->b_data);

tmp = ext4_inode_table(sb, gdp);
for (; tmp < ext4_inode_table(sb, gdp) +
sbi->s_itb_per_group; tmp++) {
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
- ext4_set_bit(tmp - start, bh->b_data);
+ ext4_set_bit(EXT4_B2C(sbi, tmp - start), bh->b_data);
}
+
/*
* Also if the number of blocks within the group is less than
* the blocksize * 8 ( which is the size of bitmap ), set rest
* of the block bitmap to 1
*/
- ext4_mark_bitmap_end(num_blocks_in_group(sb, block_group),
+ ext4_mark_bitmap_end(num_clusters_in_group(sb, block_group),
sb->s_blocksize * 8, bh->b_data);
}

@@ -165,9 +215,8 @@ unsigned ext4_free_blocks_after_init(struct super_block *sb,
ext4_group_t block_group,
struct ext4_group_desc *gdp)
{
- return num_blocks_in_group(sb, block_group) -
- num_base_meta_blocks(sb, block_group) -
- ext4_group_used_meta_blocks(sb, block_group, gdp);
+ return num_clusters_in_group(sb, block_group) -
+ ext4_num_overhead_clusters(sb, block_group, gdp);
}

/*
@@ -611,14 +660,14 @@ unsigned long ext4_bg_num_gdb(struct super_block *sb, ext4_group_t group)
}

/*
- * This function returns the number of file system metadata blocks at
+ * This function returns the number of file system metadata clusters at
* the beginning of a block group, including the reserved gdt blocks.
*/
-static unsigned int num_base_meta_blocks(struct super_block *sb,
- ext4_group_t block_group)
+unsigned ext4_num_base_meta_clusters(struct super_block *sb,
+ ext4_group_t block_group)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
- int num;
+ unsigned num;

/* Check for superblock and gdt backups in this group */
num = ext4_bg_has_super(sb, block_group);
@@ -633,7 +682,7 @@ static unsigned int num_base_meta_blocks(struct super_block *sb,
} else { /* For META_BG_BLOCK_GROUPS */
num += ext4_bg_num_gdb(sb, block_group);
}
- return num;
+ return EXT4_NUM_B2C(sbi, num);
}
/**
* ext4_inode_to_goal_block - return a hint for block allocation
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index db090e7..d404288 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -260,6 +260,14 @@ struct ext4_io_submit {
#endif
#define EXT4_BLOCK_ALIGN(size, blkbits) ALIGN((size), (1 << (blkbits)))

+/* Translate a block number to a cluster number */
+#define EXT4_B2C(sbi, blk) ((blk) >> (sbi)->s_cluster_bits)
+/* Translate a cluster number to a block number */
+#define EXT4_C2B(sbi, cluster) ((cluster) << (sbi)->s_cluster_bits)
+/* Translate # of blks to # of clusters */
+#define EXT4_NUM_B2C(sbi, blks) (((blks) + (sbi)->s_cluster_ratio - 1) >> \
+ (sbi)->s_cluster_bits)
+
/*
* Structure of a blocks group descriptor
*/
@@ -1755,6 +1763,11 @@ extern void ext4_init_block_bitmap(struct super_block *sb,
extern unsigned ext4_free_blocks_after_init(struct super_block *sb,
ext4_group_t block_group,
struct ext4_group_desc *gdp);
+extern unsigned ext4_num_base_meta_clusters(struct super_block *sb,
+ ext4_group_t block_group);
+extern unsigned ext4_num_overhead_clusters(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp);
ext4_fsblk_t ext4_inode_to_goal_block(struct inode *);

/* dir.c */
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 03/23] ext4: convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP

Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are
used to indicate the number of bits in a block bitmap (which is really
a cluster allocation bitmap in bigalloc file systems). There are
still some places in the ext4 codebase where usage of
EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that
aren't used given the initial restricted assumptions for bigalloc.
These will need to be fixed before we can relax those restrictions.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/mballoc.c | 48 ++++++++++++++++++++++++------------------------
1 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 389386b..01dbee6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -651,7 +651,7 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
ext4_grpblk_t chunk;
unsigned short border;

- BUG_ON(len > EXT4_BLOCKS_PER_GROUP(sb));
+ BUG_ON(len > EXT4_CLUSTERS_PER_GROUP(sb));

border = 2 << sb->s_blocksize_bits;

@@ -703,7 +703,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,
void *buddy, void *bitmap, ext4_group_t group)
{
struct ext4_group_info *grp = ext4_get_group_info(sb, group);
- ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
+ ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb);
ext4_grpblk_t i = 0;
ext4_grpblk_t first;
ext4_grpblk_t len;
@@ -1622,8 +1622,8 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac,
struct ext4_free_extent *gex = &ac->ac_g_ex;

BUG_ON(ex->fe_len <= 0);
- BUG_ON(ex->fe_len > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
- BUG_ON(ex->fe_start >= EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
+ BUG_ON(ex->fe_len > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
+ BUG_ON(ex->fe_start >= EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));
BUG_ON(ac->ac_status != AC_STATUS_CONTINUE);

ac->ac_found++;
@@ -1821,8 +1821,8 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,

while (free && ac->ac_status == AC_STATUS_CONTINUE) {
i = mb_find_next_zero_bit(bitmap,
- EXT4_BLOCKS_PER_GROUP(sb), i);
- if (i >= EXT4_BLOCKS_PER_GROUP(sb)) {
+ EXT4_CLUSTERS_PER_GROUP(sb), i);
+ if (i >= EXT4_CLUSTERS_PER_GROUP(sb)) {
/*
* IF we have corrupt bitmap, we won't find any
* free blocks even though group info says we
@@ -1885,7 +1885,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
do_div(a, sbi->s_stripe);
i = (a * sbi->s_stripe) - first_group_block;

- while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
+ while (i < EXT4_CLUSTERS_PER_GROUP(sb)) {
if (!mb_test_bit(i, bitmap)) {
max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
if (max >= sbi->s_stripe) {
@@ -3007,7 +3007,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
}
BUG_ON(start + size <= ac->ac_o_ex.fe_logical &&
start > ac->ac_o_ex.fe_logical);
- BUG_ON(size <= 0 || size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb));
+ BUG_ON(size <= 0 || size > EXT4_CLUSTERS_PER_GROUP(ac->ac_sb));

/* now prepare goal request */

@@ -3660,7 +3660,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
}

if (needed == 0)
- needed = EXT4_BLOCKS_PER_GROUP(sb) + 1;
+ needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;

INIT_LIST_HEAD(&list);
repeat:
@@ -3975,8 +3975,8 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
len = ar->len;

/* just a dirty hack to filter too big requests */
- if (len >= EXT4_BLOCKS_PER_GROUP(sb) - 10)
- len = EXT4_BLOCKS_PER_GROUP(sb) - 10;
+ if (len >= EXT4_CLUSTERS_PER_GROUP(sb) - 10)
+ len = EXT4_CLUSTERS_PER_GROUP(sb) - 10;

/* start searching from the goal */
goal = ar->goal;
@@ -4520,8 +4520,8 @@ do_more:
* Check to see if we are freeing blocks across a group
* boundary.
*/
- if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) {
- overflow = bit + count - EXT4_BLOCKS_PER_GROUP(sb);
+ if (bit + count > EXT4_CLUSTERS_PER_GROUP(sb)) {
+ overflow = bit + count - EXT4_CLUSTERS_PER_GROUP(sb);
count -= overflow;
}
bitmap_bh = ext4_read_block_bitmap(sb, block_group);
@@ -4890,7 +4890,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
struct ext4_group_info *grp;
ext4_group_t first_group, last_group;
ext4_group_t group, ngroups = ext4_get_groups_count(sb);
- ext4_grpblk_t cnt = 0, first_block, last_block;
+ ext4_grpblk_t cnt = 0, first_cluster, last_cluster;
uint64_t start, len, minlen, trimmed = 0;
ext4_fsblk_t first_data_blk =
le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
@@ -4900,7 +4900,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
len = range->len >> sb->s_blocksize_bits;
minlen = range->minlen >> sb->s_blocksize_bits;

- if (unlikely(minlen > EXT4_BLOCKS_PER_GROUP(sb)))
+ if (unlikely(minlen > EXT4_CLUSTERS_PER_GROUP(sb)))
return -EINVAL;
if (start < first_data_blk) {
len -= first_data_blk - start;
@@ -4909,11 +4909,11 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)

/* Determine first and last group to examine based on start and len */
ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) start,
- &first_group, &first_block);
+ &first_group, &first_cluster);
ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) (start + len),
- &last_group, &last_block);
+ &last_group, &last_cluster);
last_group = (last_group > ngroups - 1) ? ngroups - 1 : last_group;
- last_block = EXT4_BLOCKS_PER_GROUP(sb);
+ last_cluster = EXT4_CLUSTERS_PER_GROUP(sb);

if (first_group > last_group)
return -EINVAL;
@@ -4933,20 +4933,20 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
* change it for the last group in which case start +
* len < EXT4_BLOCKS_PER_GROUP(sb).
*/
- if (first_block + len < EXT4_BLOCKS_PER_GROUP(sb))
- last_block = first_block + len;
- len -= last_block - first_block;
+ if (first_cluster + len < EXT4_CLUSTERS_PER_GROUP(sb))
+ last_cluster = first_cluster + len;
+ len -= last_cluster - first_cluster;

if (grp->bb_free >= minlen) {
- cnt = ext4_trim_all_free(sb, group, first_block,
- last_block, minlen);
+ cnt = ext4_trim_all_free(sb, group, first_cluster,
+ last_cluster, minlen);
if (cnt < 0) {
ret = cnt;
break;
}
}
trimmed += cnt;
- first_block = 0;
+ first_cluster = 0;
}
range->len = trimmed * sb->s_blocksize;

--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 12/23] ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter

Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyblocks_counter and s_freeclusters_counter.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 10 +++++-----
fs/ext4/ext4.h | 5 +++--
fs/ext4/ialloc.c | 3 ++-
fs/ext4/inode.c | 18 ++++++++++--------
fs/ext4/mballoc.c | 12 +++++++-----
fs/ext4/resize.c | 4 ++--
fs/ext4/super.c | 29 +++++++++++++++--------------
7 files changed, 44 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 89abf1f..9080a85 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -414,16 +414,16 @@ static int ext4_has_free_blocks(struct ext4_sb_info *sbi,
s64 nblocks, unsigned int flags)
{
s64 free_blocks, dirty_blocks, root_blocks;
- struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
- struct percpu_counter *dbc = &sbi->s_dirtyblocks_counter;
+ struct percpu_counter *fcc = &sbi->s_freeclusters_counter;
+ struct percpu_counter *dbc = &sbi->s_dirtyclusters_counter;

- free_blocks = percpu_counter_read_positive(fbc);
+ free_blocks = percpu_counter_read_positive(fcc);
dirty_blocks = percpu_counter_read_positive(dbc);
root_blocks = ext4_r_blocks_count(sbi->s_es);

if (free_blocks - (nblocks + root_blocks + dirty_blocks) <
EXT4_FREEBLOCKS_WATERMARK) {
- free_blocks = percpu_counter_sum_positive(fbc);
+ free_blocks = EXT4_C2B(sbi, percpu_counter_sum_positive(fcc));
dirty_blocks = percpu_counter_sum_positive(dbc);
}
/* Check whether we have space after
@@ -449,7 +449,7 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
s64 nblocks, unsigned int flags)
{
if (ext4_has_free_blocks(sbi, nblocks, flags)) {
- percpu_counter_add(&sbi->s_dirtyblocks_counter, nblocks);
+ percpu_counter_add(&sbi->s_dirtyclusters_counter, nblocks);
return 0;
} else
return -ENOSPC;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e77df83..cabc9fd 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -849,6 +849,7 @@ struct ext4_inode_info {
ext4_group_t i_last_alloc_group;

/* allocation reservation info for delalloc */
+ /* In case of bigalloc, these refer to clusters rather than blocks */
unsigned int i_reserved_data_blocks;
unsigned int i_reserved_meta_blocks;
unsigned int i_allocated_meta_blocks;
@@ -1133,10 +1134,10 @@ struct ext4_sb_info {
u32 s_hash_seed[4];
int s_def_hash_version;
int s_hash_unsigned; /* 3 if hash should be signed, 0 if not */
- struct percpu_counter s_freeblocks_counter;
+ struct percpu_counter s_freeclusters_counter;
struct percpu_counter s_freeinodes_counter;
struct percpu_counter s_dirs_counter;
- struct percpu_counter s_dirtyblocks_counter;
+ struct percpu_counter s_dirtyclusters_counter;
struct blockgroup_lock *s_blockgroup_lock;
struct proc_dir_entry *s_proc;
struct kobject s_kobj;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index c176cf4..31a4dc3 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -490,7 +490,8 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,

freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
avefreei = freei / ngroups;
- freeb = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
+ freeb = EXT4_C2B(sbi,
+ percpu_counter_read_positive(&sbi->s_freeclusters_counter));
avefreeb = freeb;
do_div(avefreeb, ngroups);
ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7d1e9e0..2c0150c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -251,7 +251,7 @@ void ext4_da_update_reserve_space(struct inode *inode,
/* Update per-inode reservations */
ei->i_reserved_data_blocks -= used;
ei->i_reserved_meta_blocks -= ei->i_allocated_meta_blocks;
- percpu_counter_sub(&sbi->s_dirtyblocks_counter,
+ percpu_counter_sub(&sbi->s_dirtyclusters_counter,
used + ei->i_allocated_meta_blocks);
ei->i_allocated_meta_blocks = 0;

@@ -261,7 +261,7 @@ void ext4_da_update_reserve_space(struct inode *inode,
* only when we have written all of the delayed
* allocation blocks.
*/
- percpu_counter_sub(&sbi->s_dirtyblocks_counter,
+ percpu_counter_sub(&sbi->s_dirtyclusters_counter,
ei->i_reserved_meta_blocks);
ei->i_reserved_meta_blocks = 0;
ei->i_da_metadata_calc_len = 0;
@@ -1086,14 +1086,14 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
* only when we have written all of the delayed
* allocation blocks.
*/
- percpu_counter_sub(&sbi->s_dirtyblocks_counter,
+ percpu_counter_sub(&sbi->s_dirtyclusters_counter,
ei->i_reserved_meta_blocks);
ei->i_reserved_meta_blocks = 0;
ei->i_da_metadata_calc_len = 0;
}

/* update fs dirty data blocks counter */
- percpu_counter_sub(&sbi->s_dirtyblocks_counter, to_free);
+ percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free);

spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);

@@ -1311,9 +1311,10 @@ static void ext4_print_free_blocks(struct inode *inode)
ext4_count_free_blocks(inode->i_sb));
printk(KERN_CRIT "Free/Dirty block details\n");
printk(KERN_CRIT "free_blocks=%lld\n",
- (long long) percpu_counter_sum(&sbi->s_freeblocks_counter));
+ (long long) EXT4_C2B(EXT4_SB(inode->i_sb),
+ percpu_counter_sum(&sbi->s_freeclusters_counter)));
printk(KERN_CRIT "dirty_blocks=%lld\n",
- (long long) percpu_counter_sum(&sbi->s_dirtyblocks_counter));
+ (long long) percpu_counter_sum(&sbi->s_dirtyclusters_counter));
printk(KERN_CRIT "Block reservation details\n");
printk(KERN_CRIT "i_reserved_data_blocks=%u\n",
EXT4_I(inode)->i_reserved_data_blocks);
@@ -2185,8 +2186,9 @@ static int ext4_nonda_switch(struct super_block *sb)
* Delalloc need an accurate free block accounting. So switch
* to non delalloc when we are near to error range.
*/
- free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter);
- dirty_blocks = percpu_counter_read_positive(&sbi->s_dirtyblocks_counter);
+ free_blocks = EXT4_C2B(sbi,
+ percpu_counter_read_positive(&sbi->s_freeclusters_counter));
+ dirty_blocks = percpu_counter_read_positive(&sbi->s_dirtyclusters_counter);
if (2 * free_blocks < 3 * dirty_blocks ||
free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
/*
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 635cc9f..a0f92c0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2804,13 +2804,14 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
gdp->bg_checksum = ext4_group_desc_csum(sbi, ac->ac_b_ex.fe_group, gdp);

ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
- percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
+ percpu_counter_sub(&sbi->s_freeclusters_counter, ac->ac_b_ex.fe_len);
/*
* Now reduce the dirty block count also. Should not go negative
*/
if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
/* release all the reserved blocks if non delalloc */
- percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_clstrs);
+ percpu_counter_sub(&sbi->s_dirtyclusters_counter,
+ reserv_clstrs);

if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi,
@@ -4352,7 +4353,7 @@ out:
if (!ext4_test_inode_state(ar->inode,
EXT4_STATE_DELALLOC_RESERVED))
/* release all the reserved blocks if non delalloc */
- percpu_counter_sub(&sbi->s_dirtyblocks_counter,
+ percpu_counter_sub(&sbi->s_dirtyclusters_counter,
reserv_clstrs);
}

@@ -4659,7 +4660,7 @@ do_more:
ext4_free_blks_set(sb, gdp, ret);
gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
ext4_unlock_group(sb, block_group);
- percpu_counter_add(&sbi->s_freeblocks_counter, count);
+ percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters);

if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
@@ -4788,7 +4789,8 @@ void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
ext4_free_blks_set(sb, desc, blk_free_count);
desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
ext4_unlock_group(sb, block_group);
- percpu_counter_add(&sbi->s_freeblocks_counter, blocks_freed);
+ percpu_counter_add(&sbi->s_freeclusters_counter,
+ EXT4_B2C(sbi, blocks_freed));

if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 80bbc9c..405c1cb 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -919,8 +919,8 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
input->reserved_blocks);

/* Update the free space counts */
- percpu_counter_add(&sbi->s_freeblocks_counter,
- input->free_blocks_count);
+ percpu_counter_add(&sbi->s_freeclusters_counter,
+ EXT4_B2C(sbi, input->free_blocks_count));
percpu_counter_add(&sbi->s_freeinodes_counter,
EXT4_INODES_PER_GROUP(sb));

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 231aac2..307e453 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -794,10 +794,10 @@ static void ext4_put_super(struct super_block *sb)
vfree(sbi->s_flex_groups);
else
kfree(sbi->s_flex_groups);
- percpu_counter_destroy(&sbi->s_freeblocks_counter);
+ percpu_counter_destroy(&sbi->s_freeclusters_counter);
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
- percpu_counter_destroy(&sbi->s_dirtyblocks_counter);
+ percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
brelse(sbi->s_sbh);
#ifdef CONFIG_QUOTA
for (i = 0; i < MAXQUOTAS; i++)
@@ -2425,7 +2425,7 @@ static ssize_t delayed_allocation_blocks_show(struct ext4_attr *a,
char *buf)
{
return snprintf(buf, PAGE_SIZE, "%llu\n",
- (s64) percpu_counter_sum(&sbi->s_dirtyblocks_counter));
+ (s64) percpu_counter_sum(&sbi->s_dirtyclusters_counter));
}

static ssize_t session_write_kbytes_show(struct ext4_attr *a,
@@ -3501,7 +3501,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_err_report.function = print_daily_error_info;
sbi->s_err_report.data = (unsigned long) sb;

- err = percpu_counter_init(&sbi->s_freeblocks_counter,
+ err = percpu_counter_init(&sbi->s_freeclusters_counter,
ext4_count_free_blocks(sb));
if (!err) {
err = percpu_counter_init(&sbi->s_freeinodes_counter,
@@ -3512,7 +3512,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
ext4_count_dirs(sb));
}
if (!err) {
- err = percpu_counter_init(&sbi->s_dirtyblocks_counter, 0);
+ err = percpu_counter_init(&sbi->s_dirtyclusters_counter, 0);
}
if (err) {
ext4_msg(sb, KERN_ERR, "insufficient memory");
@@ -3627,13 +3627,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
* The journal may have updated the bg summary counts, so we
* need to update the global counters.
*/
- percpu_counter_set(&sbi->s_freeblocks_counter,
+ percpu_counter_set(&sbi->s_freeclusters_counter,
ext4_count_free_blocks(sb));
percpu_counter_set(&sbi->s_freeinodes_counter,
ext4_count_free_inodes(sb));
percpu_counter_set(&sbi->s_dirs_counter,
ext4_count_dirs(sb));
- percpu_counter_set(&sbi->s_dirtyblocks_counter, 0);
+ percpu_counter_set(&sbi->s_dirtyclusters_counter, 0);

no_journal:
/*
@@ -3796,10 +3796,10 @@ failed_mount3:
else
kfree(sbi->s_flex_groups);
}
- percpu_counter_destroy(&sbi->s_freeblocks_counter);
+ percpu_counter_destroy(&sbi->s_freeclusters_counter);
percpu_counter_destroy(&sbi->s_freeinodes_counter);
percpu_counter_destroy(&sbi->s_dirs_counter);
- percpu_counter_destroy(&sbi->s_dirtyblocks_counter);
+ percpu_counter_destroy(&sbi->s_dirtyclusters_counter);
if (sbi->s_mmp_tsk)
kthread_stop(sbi->s_mmp_tsk);
failed_mount2:
@@ -4122,8 +4122,9 @@ static int ext4_commit_super(struct super_block *sb, int sync)
else
es->s_kbytes_written =
cpu_to_le64(EXT4_SB(sb)->s_kbytes_written);
- ext4_free_blocks_count_set(es, percpu_counter_sum_positive(
- &EXT4_SB(sb)->s_freeblocks_counter));
+ ext4_free_blocks_count_set(es,
+ EXT4_C2B(EXT4_SB(sb), percpu_counter_sum_positive(
+ &EXT4_SB(sb)->s_freeclusters_counter)));
es->s_free_inodes_count =
cpu_to_le32(percpu_counter_sum_positive(
&EXT4_SB(sb)->s_freeinodes_counter));
@@ -4578,10 +4579,10 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
buf->f_type = EXT4_SUPER_MAGIC;
buf->f_bsize = sb->s_blocksize;
buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
- bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter) -
- percpu_counter_sum_positive(&sbi->s_dirtyblocks_counter);
+ bfree = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) -
+ percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter);
/* prevent underflow in case that few free space is available */
- buf->f_bfree = max_t(s64, bfree, 0);
+ buf->f_bfree = EXT4_C2B(sbi, max_t(s64, bfree, 0));
buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
if (buf->f_bfree < ext4_r_blocks_count(es))
buf->f_bavail = 0;
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 14/23] ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/super.c | 37 ++++++++++++++++++++++++-------------
1 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a803d65..7a4d54e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4529,16 +4529,34 @@ restore_opts:
return err;
}

+/*
+ * Note: calculating the overhead so we can be compatible with
+ * historical BSD practice is quite difficult in the face of
+ * clusters/bigalloc. This is because multiple metadata blocks from
+ * different block group can end up in the same allocation cluster.
+ * Calculating the exact overhead in the face of clustered allocation
+ * requires either O(all block bitmaps) in memory or O(number of block
+ * groups**2) in time. We will still calculate the superblock for
+ * older file systems --- and if we come across with a bigalloc file
+ * system with zero in s_overhead_clusters the estimate will be close to
+ * correct especially for very large cluster sizes --- but for newer
+ * file systems, it's better to calculate this figure once at mkfs
+ * time, and store it in the superblock. If the superblock value is
+ * present (even for non-bigalloc file systems), we will use it.
+ */
static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct super_block *sb = dentry->d_sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_super_block *es = sbi->s_es;
+ struct ext4_group_desc *gdp;
u64 fsid;
s64 bfree;

if (test_opt(sb, MINIX_DF)) {
sbi->s_overhead_last = 0;
+ } else if (es->s_overhead_clusters) {
+ sbi->s_overhead_last = le32_to_cpu(es->s_overhead_clusters);
} else if (sbi->s_blocks_last != ext4_blocks_count(es)) {
ext4_group_t i, ngroups = ext4_get_groups_count(sb);
ext4_fsblk_t overhead = 0;
@@ -4553,24 +4571,16 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
* All of the blocks before first_data_block are
* overhead
*/
- overhead = le32_to_cpu(es->s_first_data_block);
+ overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));

/*
- * Add the overhead attributed to the superblock and
- * block group descriptors. If the sparse superblocks
- * feature is turned on, then not all groups have this.
+ * Add the overhead found in each block group
*/
for (i = 0; i < ngroups; i++) {
- overhead += ext4_bg_has_super(sb, i) +
- ext4_bg_num_gdb(sb, i);
+ gdp = ext4_get_group_desc(sb, i, NULL);
+ overhead += ext4_num_overhead_clusters(sb, i, gdp);
cond_resched();
}

2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 13/23] ext4: convert the free_blocks field in s_flex_groups to be free_clusters

Convert the free_blocks to be free_clusters to make the final revised
bigalloc changes easier to read/understand.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 2 +-
fs/ext4/ialloc.c | 40 +++++++++++++++++++++-------------------
fs/ext4/mballoc.c | 9 +++++----
fs/ext4/resize.c | 4 ++--
fs/ext4/super.c | 2 +-
5 files changed, 30 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index cabc9fd..ca5ab11 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -299,7 +299,7 @@ struct ext4_group_desc

struct flex_groups {
atomic_t free_inodes;
- atomic_t free_blocks;
+ atomic_t free_clusters;
atomic_t used_dirs;
};

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 31a4dc3..048ee7f 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -346,7 +346,7 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
int flex_size = ext4_flex_bg_size(sbi);
ext4_group_t best_flex = parent_fbg_group;
int blocks_per_flex = sbi->s_blocks_per_group * flex_size;
- int flexbg_free_blocks;
+ int flexbg_free_clusters;
int flex_freeb_ratio;
ext4_group_t n_fbg_groups;
ext4_group_t i;
@@ -355,8 +355,9 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
sbi->s_log_groups_per_flex;

find_close_to_parent:
- flexbg_free_blocks = atomic_read(&flex_group[best_flex].free_blocks);
- flex_freeb_ratio = flexbg_free_blocks * 100 / blocks_per_flex;
+ flexbg_free_clusters = atomic_read(&flex_group[best_flex].free_clusters);
+ flex_freeb_ratio = EXT4_C2B(sbi, flexbg_free_clusters) * 100 /
+ blocks_per_flex;
if (atomic_read(&flex_group[best_flex].free_inodes) &&
flex_freeb_ratio > free_block_ratio)
goto found_flexbg;
@@ -370,8 +371,9 @@ find_close_to_parent:
if (i == parent_fbg_group || i == parent_fbg_group - 1)
continue;

- flexbg_free_blocks = atomic_read(&flex_group[i].free_blocks);
- flex_freeb_ratio = flexbg_free_blocks * 100 / blocks_per_flex;
+ flexbg_free_clusters = atomic_read(&flex_group[i].free_clusters);
+ flex_freeb_ratio = EXT4_C2B(sbi, flexbg_free_clusters) * 100 /
+ blocks_per_flex;

if (flex_freeb_ratio > free_block_ratio &&
(atomic_read(&flex_group[i].free_inodes))) {
@@ -380,14 +382,14 @@ find_close_to_parent:
}

if ((atomic_read(&flex_group[best_flex].free_inodes) == 0) ||
- ((atomic_read(&flex_group[i].free_blocks) >
- atomic_read(&flex_group[best_flex].free_blocks)) &&
+ ((atomic_read(&flex_group[i].free_clusters) >
+ atomic_read(&flex_group[best_flex].free_clusters)) &&
atomic_read(&flex_group[i].free_inodes)))
best_flex = i;
}

if (!atomic_read(&flex_group[best_flex].free_inodes) ||
- !atomic_read(&flex_group[best_flex].free_blocks))
+ !atomic_read(&flex_group[best_flex].free_clusters))
return -1;

found_flexbg:
@@ -407,7 +409,7 @@ out:

struct orlov_stats {
__u32 free_inodes;
- __u32 free_blocks;
+ __u32 free_clusters;
__u32 used_dirs;
};

@@ -424,7 +426,7 @@ static void get_orlov_stats(struct super_block *sb, ext4_group_t g,

if (flex_size > 1) {
stats->free_inodes = atomic_read(&flex_group[g].free_inodes);
- stats->free_blocks = atomic_read(&flex_group[g].free_blocks);
+ stats->free_clusters = atomic_read(&flex_group[g].free_clusters);
stats->used_dirs = atomic_read(&flex_group[g].used_dirs);
return;
}
@@ -432,11 +434,11 @@ static void get_orlov_stats(struct super_block *sb, ext4_group_t g,
desc = ext4_get_group_desc(sb, g, NULL);
if (desc) {
stats->free_inodes = ext4_free_inodes_count(sb, desc);
- stats->free_blocks = ext4_free_blks_count(sb, desc);
+ stats->free_clusters = ext4_free_blks_count(sb, desc);
stats->used_dirs = ext4_used_dirs_count(sb, desc);
} else {
stats->free_inodes = 0;
- stats->free_blocks = 0;
+ stats->free_clusters = 0;
stats->used_dirs = 0;
}
}
@@ -471,10 +473,10 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
ext4_group_t real_ngroups = ext4_get_groups_count(sb);
int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
unsigned int freei, avefreei;
- ext4_fsblk_t freeb, avefreeb;
+ ext4_fsblk_t freeb, avefreec;
unsigned int ndirs;
int max_dirs, min_inodes;
- ext4_grpblk_t min_blocks;
+ ext4_grpblk_t min_clusters;
ext4_group_t i, grp, g, ngroups;
struct ext4_group_desc *desc;
struct orlov_stats stats;
@@ -492,8 +494,8 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
avefreei = freei / ngroups;
freeb = EXT4_C2B(sbi,
percpu_counter_read_positive(&sbi->s_freeclusters_counter));
- avefreeb = freeb;
- do_div(avefreeb, ngroups);
+ avefreec = freeb;
+ do_div(avefreec, ngroups);
ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);

if (S_ISDIR(mode) &&
@@ -519,7 +521,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
continue;
if (stats.free_inodes < avefreei)
continue;
- if (stats.free_blocks < avefreeb)
+ if (stats.free_clusters < avefreec)
continue;
grp = g;
ret = 0;
@@ -557,7 +559,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
min_inodes = avefreei - inodes_per_group*flex_size / 4;
if (min_inodes < 1)
min_inodes = 1;
- min_blocks = avefreeb - EXT4_BLOCKS_PER_GROUP(sb)*flex_size / 4;
+ min_clusters = avefreec - EXT4_CLUSTERS_PER_GROUP(sb)*flex_size / 4;

/*
* Start looking in the flex group where we last allocated an
@@ -576,7 +578,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
continue;
if (stats.free_inodes < min_inodes)
continue;
- if (stats.free_blocks < min_blocks)
+ if (stats.free_clusters < min_clusters)
continue;
goto found_flex_bg;
}
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a0f92c0..48172a1 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2817,7 +2817,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
ext4_group_t flex_group = ext4_flex_group(sbi,
ac->ac_b_ex.fe_group);
atomic_sub(ac->ac_b_ex.fe_len,
- &sbi->s_flex_groups[flex_group].free_blocks);
+ &sbi->s_flex_groups[flex_group].free_clusters);
}

err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
@@ -4664,7 +4664,8 @@ do_more:

if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
- atomic_add(count, &sbi->s_flex_groups[flex_group].free_blocks);
+ atomic_add(count_clusters,
+ &sbi->s_flex_groups[flex_group].free_clusters);
}

ext4_mb_unload_buddy(&e4b);
@@ -4794,8 +4795,8 @@ void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,

if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
- atomic_add(blocks_freed,
- &sbi->s_flex_groups[flex_group].free_blocks);
+ atomic_add(EXT4_B2C(sbi, blocks_freed),
+ &sbi->s_flex_groups[flex_group].free_clusters);
}

ext4_mb_unload_buddy(&e4b);
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 405c1cb..cef49d3 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -928,8 +928,8 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
sbi->s_log_groups_per_flex) {
ext4_group_t flex_group;
flex_group = ext4_flex_group(sbi, input->group);
- atomic_add(input->free_blocks_count,
- &sbi->s_flex_groups[flex_group].free_blocks);
+ atomic_add(EXT4_B2C(sbi, input->free_blocks_count),
+ &sbi->s_flex_groups[flex_group].free_clusters);
atomic_add(EXT4_INODES_PER_GROUP(sb),
&sbi->s_flex_groups[flex_group].free_inodes);
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 307e453..a803d65 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1995,7 +1995,7 @@ static int ext4_fill_flex_info(struct super_block *sb)
atomic_add(ext4_free_inodes_count(sb, gdp),
&sbi->s_flex_groups[flex_group].free_inodes);
atomic_add(ext4_free_blks_count(sb, gdp),
- &sbi->s_flex_groups[flex_group].free_blocks);
+ &sbi->s_flex_groups[flex_group].free_clusters);
atomic_add(ext4_used_dirs_count(sb, gdp),
&sbi->s_flex_groups[flex_group].used_dirs);
}
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 04/23] ext4: factor out block group accounting into functions

This makes it easier to understand how ext4_init_block_bitmap() works,
and it will assist when we split out ext4_free_blocks_after_init() in
the next commit.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 80 ++++++++++++++++++++++++++++++++---------------------
1 files changed, 48 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index f8224ad..8573e2b 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -23,6 +23,9 @@

#include <trace/events/ext4.h>

+static unsigned int num_base_meta_blocks(struct super_block *sb,
+ ext4_group_t block_group);
+
/*
* balloc.c contains the blocks allocation and deallocation routines
*/
@@ -83,14 +86,30 @@ static int ext4_group_used_meta_blocks(struct super_block *sb,
return used_blocks;
}

+static unsigned int num_blocks_in_group(struct super_block *sb,
+ ext4_group_t block_group)
+{
+ if (block_group == ext4_get_groups_count(sb) - 1) {
+ /*
+ * Even though mke2fs always initializes the first and
+ * last group, just in case some other tool was used,
+ * we need to make sure we calculate the right free
+ * blocks.
+ */
+ return ext4_blocks_count(EXT4_SB(sb)->s_es) -
+ ext4_group_first_block_no(sb, block_group);
+ } else
+ return EXT4_BLOCKS_PER_GROUP(sb);
+}
+
/* Initializes an uninitialized block bitmap if given, and returns the
* number of blocks free in the group. */
unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_group_t block_group, struct ext4_group_desc *gdp)
{
- int bit, bit_max;
+ unsigned int bit, bit_max = num_base_meta_blocks(sb, block_group);
ext4_group_t ngroups = ext4_get_groups_count(sb);
- unsigned free_blocks, group_blocks;
+ unsigned group_blocks = num_blocks_in_group(sb, block_group);
struct ext4_sb_info *sbi = EXT4_SB(sb);

if (bh) {
@@ -110,35 +129,6 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
memset(bh->b_data, 0, sb->s_blocksize);
}

- /* Check for superblock and gdt backups in this group */
- bit_max = ext4_bg_has_super(sb, block_group);
-
- if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_META_BG) ||
- block_group < le32_to_cpu(sbi->s_es->s_first_meta_bg) *
- sbi->s_desc_per_block) {
- if (bit_max) {
- bit_max += ext4_bg_num_gdb(sb, block_group);
- bit_max +=
- le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks);
- }
- } else { /* For META_BG_BLOCK_GROUPS */
- bit_max += ext4_bg_num_gdb(sb, block_group);
- }
-
- if (block_group == ngroups - 1) {
- /*
- * Even though mke2fs always initialize first and last group
- * if some other tool enabled the EXT4_BG_BLOCK_UNINIT we need
- * to make sure we calculate the right free blocks
- */
- group_blocks = ext4_blocks_count(sbi->s_es) -
- ext4_group_first_block_no(sb, ngroups - 1);
- } else {
- group_blocks = EXT4_BLOCKS_PER_GROUP(sb);
- }
-
- free_blocks = group_blocks - bit_max;
-
if (bh) {
ext4_fsblk_t start, tmp;
int flex_bg = 0;
@@ -176,7 +166,8 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_mark_bitmap_end(group_blocks, sb->s_blocksize * 8,
bh->b_data);
}
- return free_blocks - ext4_group_used_meta_blocks(sb, block_group, gdp);
+ return group_blocks - bit_max -
+ ext4_group_used_meta_blocks(sb, block_group, gdp);
}


@@ -620,6 +611,31 @@ unsigned long ext4_bg_num_gdb(struct super_block *sb, ext4_group_t group)

}

+/*
+ * This function returns the number of file system metadata blocks at
+ * the beginning of a block group, including the reserved gdt blocks.
+ */
+static unsigned int num_base_meta_blocks(struct super_block *sb,
+ ext4_group_t block_group)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int num;
+
+ /* Check for superblock and gdt backups in this group */
+ num = ext4_bg_has_super(sb, block_group);
+
+ if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_META_BG) ||
+ block_group < le32_to_cpu(sbi->s_es->s_first_meta_bg) *
+ sbi->s_desc_per_block) {
+ if (num) {
+ num += ext4_bg_num_gdb(sb, block_group);
+ num += le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks);
+ }
+ } else { /* For META_BG_BLOCK_GROUPS */
+ num += ext4_bg_num_gdb(sb, block_group);
+ }
+ return num;
+}
/**
* ext4_inode_to_goal_block - return a hint for block allocation
* @inode: inode for block allocation
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 09/23] ext4: teach ext4_free_blocks() about bigalloc and clusters

The ext4_free_blocks() function now has two new flags that indicate
whether a partial cluster at the beginning or the end of the block
extents should be freed or not. That will be up the caller (i.e.,
truncate), who can figure out whether partial clusters at the
beginning or the end of a block range can be freed.

We also have to update the ext4_mb_free_metadata() and
release_blocks_on_commit() machinery to be cluster-based, since it is
used by ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/mballoc.c | 86 ++++++++++++++++++++++++++++++++++++++---------------
fs/ext4/mballoc.h | 2 +-
3 files changed, 65 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d404288..e77df83 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -538,6 +538,8 @@ struct ext4_new_group_data {
#define EXT4_FREE_BLOCKS_METADATA 0x0001
#define EXT4_FREE_BLOCKS_FORGET 0x0002
#define EXT4_FREE_BLOCKS_VALIDATED 0x0004
+#define EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER 0x0008
+#define EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER 0x0010

/*
* ioctl commands
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 646366c..635cc9f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2582,11 +2582,13 @@ int ext4_mb_release(struct super_block *sb)
}

static inline int ext4_issue_discard(struct super_block *sb,
- ext4_group_t block_group, ext4_grpblk_t block, int count)
+ ext4_group_t block_group, ext4_grpblk_t cluster, int count)
{
ext4_fsblk_t discard_block;

- discard_block = block + ext4_group_first_block_no(sb, block_group);
+ discard_block = (EXT4_C2B(EXT4_SB(sb), cluster) +
+ ext4_group_first_block_no(sb, block_group));
+ count = EXT4_C2B(EXT4_SB(sb), count);
trace_ext4_discard_blocks(sb,
(unsigned long long) discard_block, count);
return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0);
@@ -2613,7 +2615,7 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)

if (test_opt(sb, DISCARD))
ext4_issue_discard(sb, entry->group,
- entry->start_blk, entry->count);
+ entry->start_cluster, entry->count);

err = ext4_mb_load_buddy(sb, entry->group, &e4b);
/* we expect to find existing buddy because it's pinned */
@@ -2626,7 +2628,7 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
ext4_lock_group(sb, entry->group);
/* Take it out of per group rb tree */
rb_erase(&entry->node, &(db->bb_free_root));
- mb_free_blocks(NULL, &e4b, entry->start_blk, entry->count);
+ mb_free_blocks(NULL, &e4b, entry->start_cluster, entry->count);

if (!db->bb_free_root.rb_node) {
/* No more items in the per group rb tree
@@ -3271,7 +3273,7 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,

while (n) {
entry = rb_entry(n, struct ext4_free_data, node);
- mb_set_bits(bitmap, entry->start_blk, entry->count);
+ mb_set_bits(bitmap, entry->start_cluster, entry->count);
n = rb_next(n);
}
return;
@@ -4369,7 +4371,7 @@ static int can_merge(struct ext4_free_data *entry1,
{
if ((entry1->t_tid == entry2->t_tid) &&
(entry1->group == entry2->group) &&
- ((entry1->start_blk + entry1->count) == entry2->start_blk))
+ ((entry1->start_cluster + entry1->count) == entry2->start_cluster))
return 1;
return 0;
}
@@ -4379,7 +4381,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
struct ext4_free_data *new_entry)
{
ext4_group_t group = e4b->bd_group;
- ext4_grpblk_t block;
+ ext4_grpblk_t cluster;
struct ext4_free_data *entry;
struct ext4_group_info *db = e4b->bd_info;
struct super_block *sb = e4b->bd_sb;
@@ -4392,7 +4394,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
BUG_ON(e4b->bd_buddy_page == NULL);

new_node = &new_entry->node;
- block = new_entry->start_blk;
+ cluster = new_entry->start_cluster;

if (!*n) {
/* first free block exent. We need to
@@ -4406,13 +4408,14 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
while (*n) {
parent = *n;
entry = rb_entry(parent, struct ext4_free_data, node);
- if (block < entry->start_blk)
+ if (cluster < entry->start_cluster)
n = &(*n)->rb_left;
- else if (block >= (entry->start_blk + entry->count))
+ else if (cluster >= (entry->start_cluster + entry->count))
n = &(*n)->rb_right;
else {
ext4_grp_locked_error(sb, group, 0,
- ext4_group_first_block_no(sb, group) + block,
+ ext4_group_first_block_no(sb, group) +
+ EXT4_C2B(sbi, cluster),
"Block already on to-be-freed list");
return 0;
}
@@ -4426,7 +4429,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
if (node) {
entry = rb_entry(node, struct ext4_free_data, node);
if (can_merge(entry, new_entry)) {
- new_entry->start_blk = entry->start_blk;
+ new_entry->start_cluster = entry->start_cluster;
new_entry->count += entry->count;
rb_erase(node, &(db->bb_free_root));
spin_lock(&sbi->s_md_lock);
@@ -4477,6 +4480,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
ext4_group_t block_group;
struct ext4_sb_info *sbi;
struct ext4_buddy e4b;
+ unsigned int count_clusters;
int err = 0;
int ret;

@@ -4525,6 +4529,38 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
if (!ext4_should_writeback_data(inode))
flags |= EXT4_FREE_BLOCKS_METADATA;

+ /*
+ * If the extent to be freed does not begin on a cluster
+ * boundary, we need to deal with partial clusters at the
+ * beginning and end of the extent. Normally we will free
+ * blocks at the beginning or the end unless we are explicitly
+ * requested to avoid doing so.
+ */
+ overflow = block & (sbi->s_cluster_ratio - 1);
+ if (overflow) {
+ if (flags & EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER) {
+ overflow = sbi->s_cluster_ratio - overflow;
+ block += overflow;
+ if (count > overflow)
+ count -= overflow;
+ else
+ return;
+ } else {
+ block -= overflow;
+ count += overflow;
+ }
+ }
+ overflow = count & (sbi->s_cluster_ratio - 1);
+ if (overflow) {
+ if (flags & EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER) {
+ if (count > overflow)
+ count -= overflow;
+ else
+ return;
+ } else
+ count += sbi->s_cluster_ratio - overflow;
+ }
+
do_more:
overflow = 0;
ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
@@ -4533,10 +4569,12 @@ do_more:
* Check to see if we are freeing blocks across a group
* boundary.
*/
- if (bit + count > EXT4_CLUSTERS_PER_GROUP(sb)) {
- overflow = bit + count - EXT4_CLUSTERS_PER_GROUP(sb);
+ if (EXT4_C2B(sbi, bit) + count > EXT4_BLOCKS_PER_GROUP(sb)) {
+ overflow = EXT4_C2B(sbi, bit) + count -
+ EXT4_BLOCKS_PER_GROUP(sb);
count -= overflow;
}
+ count_clusters = EXT4_B2C(sbi, count);
bitmap_bh = ext4_read_block_bitmap(sb, block_group);
if (!bitmap_bh) {
err = -EIO;
@@ -4551,9 +4589,9 @@ do_more:
if (in_range(ext4_block_bitmap(sb, gdp), block, count) ||
in_range(ext4_inode_bitmap(sb, gdp), block, count) ||
in_range(block, ext4_inode_table(sb, gdp),
- EXT4_SB(sb)->s_itb_per_group) ||
+ EXT4_SB(sb)->s_itb_per_group) ||
in_range(block + count - 1, ext4_inode_table(sb, gdp),
- EXT4_SB(sb)->s_itb_per_group)) {
+ EXT4_SB(sb)->s_itb_per_group)) {

ext4_error(sb, "Freeing blocks in system zone - "
"Block = %llu, count = %lu", block, count);
@@ -4578,11 +4616,11 @@ do_more:
#ifdef AGGRESSIVE_CHECK
{
int i;
- for (i = 0; i < count; i++)
+ for (i = 0; i < count_clusters; i++)
BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
}
#endif
- trace_ext4_mballoc_free(sb, inode, block_group, bit, count);
+ trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);

err = ext4_mb_load_buddy(sb, block_group, &e4b);
if (err)
@@ -4599,13 +4637,13 @@ do_more:
err = -ENOMEM;
goto error_return;
}
- new_entry->start_blk = bit;
+ new_entry->start_cluster = bit;
new_entry->group = block_group;
- new_entry->count = count;
+ new_entry->count = count_clusters;
new_entry->t_tid = handle->h_transaction->t_tid;

ext4_lock_group(sb, block_group);
- mb_clear_bits(bitmap_bh->b_data, bit, count);
+ mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
ext4_mb_free_metadata(handle, &e4b, new_entry);
} else {
/* need to update group_info->bb_free and bitmap
@@ -4613,11 +4651,11 @@ do_more:
* them with group lock_held
*/
ext4_lock_group(sb, block_group);
- mb_clear_bits(bitmap_bh->b_data, bit, count);
- mb_free_blocks(inode, &e4b, bit, count);
+ mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
+ mb_free_blocks(inode, &e4b, bit, count_clusters);
}

- ret = ext4_free_blks_count(sb, gdp) + count;
+ ret = ext4_free_blks_count(sb, gdp) + count_clusters;
ext4_free_blks_set(sb, gdp, ret);
gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
ext4_unlock_group(sb, block_group);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index cd93532..5d27fc0 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -106,7 +106,7 @@ struct ext4_free_data {
ext4_group_t group;

/* free block extent */
- ext4_grpblk_t start_blk;
+ ext4_grpblk_t start_cluster;
ext4_grpblk_t count;

/* transaction which freed this extent */
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 15/23] ext4: tune mballoc's default group prealloc size for bigalloc file systems

The default group preallocation size had been previously set to 512
blocks/clusters, regardless of the block/cluster size. This is
probably too big for large cluster sizes. So adjust the default so
that it is 2 megabytes or 32 clusters, whichever is larger.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/mballoc.c | 18 ++++++++++++++++--
1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 48172a1..2d23d0a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -126,7 +126,8 @@
* list. In case of inode preallocation we follow a list of heuristics
* based on file size. This can be found in ext4_mb_normalize_request. If
* we are doing a group prealloc we try to normalize the request to
- * sbi->s_mb_group_prealloc. Default value of s_mb_group_prealloc is
+ * sbi->s_mb_group_prealloc. The default value of s_mb_group_prealloc is
+ * dependent on the cluster size; for non-bigalloc file systems, it is
* 512 blocks. This can be tuned via
* /sys/fs/ext4/<partition/mb_group_prealloc. The value is represented in
* terms of number of blocks. If we have mounted the file system with -O
@@ -2471,7 +2472,20 @@ int ext4_mb_init(struct super_block *sb, int needs_recovery)
sbi->s_mb_stats = MB_DEFAULT_STATS;
sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
- sbi->s_mb_group_prealloc = MB_DEFAULT_GROUP_PREALLOC;
+ /*
+ * The default group preallocation is 512, which for 4k block
+ * sizes translates to 2 megabytes. However for bigalloc file
+ * systems, this is probably too big (i.e, if the cluster size
+ * is 1 megabyte, then group preallocation size becomes half a
+ * gigabyte!). As a default, we will keep a two megabyte
+ * group pralloc size for cluster sizes up to 64k, and after
+ * that, we will force a minimum group preallocation size of
+ * 32 clusters. This translates to 8 megs when the cluster
+ * size is 256k, and 32 megs when the cluster size is 1 meg,
+ * which seems reasonable as a default.
+ */
+ sbi->s_mb_group_prealloc = max(MB_DEFAULT_GROUP_PREALLOC >>
+ sbi->s_cluster_bits, 32);

sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
if (sbi->s_locality_groups == NULL) {
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 22/23] ext4: rename ext4_has_free_blocks() to ext4_has_free_clusters()

Rename the function so it is more clear what is going on. Also rename
the various variables so it's clearer what's happening.

Also fix a missing blocks to cluster conversion when reading the
number of reserved blocks for root.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 43 ++++++++++++++++++++++---------------------
fs/ext4/ext4.h | 8 ++++----
fs/ext4/inode.c | 2 +-
3 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index af0c535..f6dba45 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -403,42 +403,43 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
}

/**
- * ext4_has_free_blocks()
+ * ext4_has_free_clusters()
* @sbi: in-core super block structure.
- * @nblocks: number of needed blocks
+ * @nclusters: number of needed blocks
+ * @flags: flags from ext4_mb_new_blocks()
*
- * Check if filesystem has nblocks free & available for allocation.
+ * Check if filesystem has nclusters free & available for allocation.
* On success return 1, return 0 on failure.
*/
-static int ext4_has_free_blocks(struct ext4_sb_info *sbi,
- s64 nblocks, unsigned int flags)
+static int ext4_has_free_clusters(struct ext4_sb_info *sbi,
+ s64 nclusters, unsigned int flags)
{
- s64 free_blocks, dirty_blocks, root_blocks;
+ s64 free_clusters, dirty_clusters, root_clusters;
struct percpu_counter *fcc = &sbi->s_freeclusters_counter;
- struct percpu_counter *dbc = &sbi->s_dirtyclusters_counter;
+ struct percpu_counter *dcc = &sbi->s_dirtyclusters_counter;

- free_blocks = percpu_counter_read_positive(fcc);
- dirty_blocks = percpu_counter_read_positive(dbc);
- root_blocks = ext4_r_blocks_count(sbi->s_es);
+ free_clusters = percpu_counter_read_positive(fcc);
+ dirty_clusters = percpu_counter_read_positive(dcc);
+ root_clusters = EXT4_B2C(sbi, ext4_r_blocks_count(sbi->s_es));

- if (free_blocks - (nblocks + root_blocks + dirty_blocks) <
- EXT4_FREEBLOCKS_WATERMARK) {
- free_blocks = EXT4_C2B(sbi, percpu_counter_sum_positive(fcc));
- dirty_blocks = percpu_counter_sum_positive(dbc);
+ if (free_clusters - (nclusters + root_clusters + dirty_clusters) <
+ EXT4_FREECLUSTERS_WATERMARK) {
+ free_clusters = EXT4_C2B(sbi, percpu_counter_sum_positive(fcc));
+ dirty_clusters = percpu_counter_sum_positive(dcc);
}
- /* Check whether we have space after
- * accounting for current dirty blocks & root reserved blocks.
+ /* Check whether we have space after accounting for current
+ * dirty clusters & root reserved clusters.
*/
- if (free_blocks >= ((root_blocks + nblocks) + dirty_blocks))
+ if (free_clusters >= ((root_clusters + nclusters) + dirty_clusters))
return 1;

- /* Hm, nope. Are (enough) root reserved blocks available? */
+ /* Hm, nope. Are (enough) root reserved clusters available? */
if (sbi->s_resuid == current_fsuid() ||
((sbi->s_resgid != 0) && in_group_p(sbi->s_resgid)) ||
capable(CAP_SYS_RESOURCE) ||
(flags & EXT4_MB_USE_ROOT_BLOCKS)) {

- if (free_blocks >= (nblocks + dirty_blocks))
+ if (free_clusters >= (nclusters + dirty_clusters))
return 1;
}

@@ -448,7 +449,7 @@ static int ext4_has_free_blocks(struct ext4_sb_info *sbi,
int ext4_claim_free_clusters(struct ext4_sb_info *sbi,
s64 nclusters, unsigned int flags)
{
- if (ext4_has_free_blocks(sbi, nclusters, flags)) {
+ if (ext4_has_free_clusters(sbi, nclusters, flags)) {
percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters);
return 0;
} else
@@ -469,7 +470,7 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi,
*/
int ext4_should_retry_alloc(struct super_block *sb, int *retries)
{
- if (!ext4_has_free_blocks(EXT4_SB(sb), 1, 0) ||
+ if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) ||
(*retries)++ > 3 ||
!EXT4_SB(sb)->s_journal)
return 0;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8d2e34f..b11ee7e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2079,13 +2079,13 @@ do { \
} while (0)

#ifdef CONFIG_SMP
-/* Each CPU can accumulate percpu_counter_batch blocks in their local
- * counters. So we need to make sure we have free blocks more
+/* Each CPU can accumulate percpu_counter_batch clusters in their local
+ * counters. So we need to make sure we have free clusters more
* than percpu_counter_batch * nr_cpu_ids. Also add a window of 4 times.
*/
-#define EXT4_FREEBLOCKS_WATERMARK (4 * (percpu_counter_batch * nr_cpu_ids))
+#define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch * nr_cpu_ids))
#else
-#define EXT4_FREEBLOCKS_WATERMARK 0
+#define EXT4_FREECLUSTERS_WATERMARK 0
#endif

static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4a68531..7ae8dec 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2214,7 +2214,7 @@ static int ext4_nonda_switch(struct super_block *sb)
percpu_counter_read_positive(&sbi->s_freeclusters_counter));
dirty_blocks = percpu_counter_read_positive(&sbi->s_dirtyclusters_counter);
if (2 * free_blocks < 3 * dirty_blocks ||
- free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
+ free_blocks < (dirty_blocks + EXT4_FREECLUSTERS_WATERMARK)) {
/*
* free block count is less than 150% of dirty blocks
* or free blocks is less than watermark
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 20/23] ext4: rename ext4_free_blocks_after_init() to ext4_free_clusters_after_init()

This function really returns the number of clusters after initializing
an uninitalized block bitmap has been initialized.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 6 +++---
fs/ext4/ext4.h | 6 +++---
fs/ext4/ialloc.c | 2 +-
fs/ext4/mballoc.c | 4 ++--
4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 4f30380..3d07c35 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -213,9 +213,9 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
/* Return the number of free blocks in a block group. It is used when
* the block bitmap is uninitialized, so we can't just count the bits
* in the bitmap. */
-unsigned ext4_free_blocks_after_init(struct super_block *sb,
- ext4_group_t block_group,
- struct ext4_group_desc *gdp)
+unsigned ext4_free_clusters_after_init(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp)
{
return num_clusters_in_group(sb, block_group) -
ext4_num_overhead_clusters(sb, block_group, gdp);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 04bf654..b18c6d5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1772,9 +1772,9 @@ extern void ext4_init_block_bitmap(struct super_block *sb,
struct buffer_head *bh,
ext4_group_t group,
struct ext4_group_desc *desc);
-extern unsigned ext4_free_blocks_after_init(struct super_block *sb,
- ext4_group_t block_group,
- struct ext4_group_desc *gdp);
+extern unsigned ext4_free_clusters_after_init(struct super_block *sb,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp);
extern unsigned ext4_num_base_meta_clusters(struct super_block *sb,
ext4_group_t block_group);
extern unsigned ext4_num_overhead_clusters(struct super_block *sb,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index aa74ca6..b30115b 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -961,7 +961,7 @@ got:
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
ext4_free_group_clusters_set(sb, gdp,
- ext4_free_blocks_after_init(sb, group, gdp));
+ ext4_free_clusters_after_init(sb, group, gdp));
gdp->bg_checksum = ext4_group_desc_csum(sbi, group,
gdp);
}
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 852f303..33cdc81 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2251,7 +2251,7 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
*/
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
meta_group_info[i]->bb_free =
- ext4_free_blocks_after_init(sb, group, desc);
+ ext4_free_clusters_after_init(sb, group, desc);
} else {
meta_group_info[i]->bb_free =
ext4_free_group_clusters(sb, desc);
@@ -2810,7 +2810,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
ext4_free_group_clusters_set(sb, gdp,
- ext4_free_blocks_after_init(sb,
+ ext4_free_clusters_after_init(sb,
ac->ac_b_ex.fe_group, gdp));
}
len = ext4_free_group_clusters(sb, gdp) - ac->ac_b_ex.fe_len;
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 21/23] ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters()

This function really claims a number of free clusters, not blocks, so
rename it so it's clearer what's going on.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 8 ++++----
fs/ext4/ext4.h | 4 ++--
fs/ext4/inode.c | 2 +-
fs/ext4/mballoc.c | 2 +-
4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 3d07c35..af0c535 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -445,11 +445,11 @@ static int ext4_has_free_blocks(struct ext4_sb_info *sbi,
return 0;
}

-int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
- s64 nblocks, unsigned int flags)
+int ext4_claim_free_clusters(struct ext4_sb_info *sbi,
+ s64 nclusters, unsigned int flags)
{
- if (ext4_has_free_blocks(sbi, nblocks, flags)) {
- percpu_counter_add(&sbi->s_dirtyclusters_counter, nblocks);
+ if (ext4_has_free_blocks(sbi, nclusters, flags)) {
+ percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters);
return 0;
} else
return -ENOSPC;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b18c6d5..8d2e34f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1758,8 +1758,8 @@ extern ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
unsigned int flags,
unsigned long *count,
int *errp);
-extern int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
- s64 nblocks, unsigned int flags);
+extern int ext4_claim_free_clusters(struct ext4_sb_info *sbi,
+ s64 nclusters, unsigned int flags);
extern ext4_fsblk_t ext4_count_free_clusters(struct super_block *);
extern void ext4_check_blocks_bitmap(struct super_block *);
extern struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1cbad80..4a68531 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1038,7 +1038,7 @@ repeat:
* We do still charge estimated metadata to the sb though;
* we cannot afford to run out of free blocks.
*/
- if (ext4_claim_free_blocks(sbi, md_needed + 1, 0)) {
+ if (ext4_claim_free_clusters(sbi, md_needed + 1, 0)) {
dquot_release_reservation_block(inode, EXT4_C2B(sbi, 1));
if (ext4_should_retry_alloc(inode->i_sb, &retries)) {
yield();
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 33cdc81..ace05a4 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4265,7 +4265,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
* and verify allocation doesn't exceed the quota limits.
*/
while (ar->len &&
- ext4_claim_free_blocks(sbi, ar->len, ar->flags)) {
+ ext4_claim_free_clusters(sbi, ar->len, ar->flags)) {

/* let others to free the space */
yield();
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 08/23] ext4: teach mballoc preallocation code about bigalloc clusters

In most of mballoc.c, we do everything in units of clusters, since the
block allocation bitmaps and buddy bitmaps are all denominated in
clusters. The one place where we do deal with absolute block numbers
is in the code that handles the preallocation regions, since in the
case of inode-based preallocation regions, the start of the
preallocation region can't be relative to the beginning of the group.

So this adds a bit of complexity, where pa_pstart and pa_lstart are
block numbers, while pa_free, pa_len, and fe_len are denominated in
units of clusters.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/mballoc.c | 95 ++++++++++++++++++++++++++++++-----------------------
fs/ext4/mballoc.h | 4 +-
2 files changed, 56 insertions(+), 43 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 01dbee6..646366c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -70,8 +70,8 @@
*
* pa_lstart -> the logical start block for this prealloc space
* pa_pstart -> the physical start block for this prealloc space
- * pa_len -> length for this prealloc space
- * pa_free -> free space available in this prealloc space
+ * pa_len -> length for this prealloc space (in clusters)
+ * pa_free -> free space available in this prealloc space (in clusters)
*
* The inode preallocation space is used looking at the _logical_ start
* block. If only the logical file block falls within the range of prealloc
@@ -458,7 +458,7 @@ static void mb_free_blocks_double(struct inode *inode, struct ext4_buddy *e4b,
ext4_fsblk_t blocknr;

blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
- blocknr += first + i;
+ blocknr += EXT4_C2B(EXT4_SB(sb), first + i);
ext4_grp_locked_error(sb, e4b->bd_group,
inode ? inode->i_ino : 0,
blocknr,
@@ -732,7 +732,7 @@ void ext4_mb_generate_buddy(struct super_block *sb,

if (free != grp->bb_free) {
ext4_grp_locked_error(sb, group, 0, 0,
- "%u blocks in bitmap, %u in gd",
+ "%u clusters in bitmap, %u in gd",
free, grp->bb_free);
/*
* If we intent to continue, we consider group descritor
@@ -1337,7 +1337,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
ext4_fsblk_t blocknr;

blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
- blocknr += block;
+ blocknr += EXT4_C2B(EXT4_SB(sb), block);
ext4_grp_locked_error(sb, e4b->bd_group,
inode ? inode->i_ino : 0,
blocknr,
@@ -1829,7 +1829,7 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
* we have free blocks
*/
ext4_grp_locked_error(sb, e4b->bd_group, 0, 0,
- "%d free blocks as per "
+ "%d free clusters as per "
"group info. But bitmap says 0",
free);
break;
@@ -1839,7 +1839,7 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac,
BUG_ON(ex.fe_len <= 0);
if (free < ex.fe_len) {
ext4_grp_locked_error(sb, e4b->bd_group, 0, 0,
- "%d free blocks as per "
+ "%d free clusters as per "
"group info. But got %d blocks",
free, ex.fe_len);
/*
@@ -2723,7 +2723,7 @@ void ext4_exit_mballoc(void)
*/
static noinline_for_stack int
ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
- handle_t *handle, unsigned int reserv_blks)
+ handle_t *handle, unsigned int reserv_clstrs)
{
struct buffer_head *bitmap_bh = NULL;
struct ext4_group_desc *gdp;
@@ -2762,7 +2762,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,

block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);

- len = ac->ac_b_ex.fe_len;
+ len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
if (!ext4_data_block_valid(sbi, block, len)) {
ext4_error(sb, "Allocating blocks %llu-%llu which overlap "
"fs metadata\n", block, block+len);
@@ -2808,7 +2808,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
*/
if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
/* release all the reserved blocks if non delalloc */
- percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);
+ percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_clstrs);

if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi,
@@ -2858,6 +2858,7 @@ static noinline_for_stack void
ext4_mb_normalize_request(struct ext4_allocation_context *ac,
struct ext4_allocation_request *ar)
{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
int bsbits, max;
ext4_lblk_t end;
loff_t size, orig_size, start_off;
@@ -2888,7 +2889,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,

/* first, let's learn actual file size
* given current request is allocated */
- size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
+ size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
size = size << bsbits;
if (size < i_size_read(ac->ac_inode))
size = i_size_read(ac->ac_inode);
@@ -2960,7 +2961,8 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
continue;
}

- pa_end = pa->pa_lstart + pa->pa_len;
+ pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
+ pa->pa_len);

/* PA must not overlap original request */
BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
@@ -2990,9 +2992,11 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
rcu_read_lock();
list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
ext4_lblk_t pa_end;
+
spin_lock(&pa->pa_lock);
if (pa->pa_deleted == 0) {
- pa_end = pa->pa_lstart + pa->pa_len;
+ pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
+ pa->pa_len);
BUG_ON(!(start >= pa_end || end <= pa->pa_lstart));
}
spin_unlock(&pa->pa_lock);
@@ -3014,7 +3018,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
/* XXX: is it better to align blocks WRT to logical
* placement or satisfy big request as is */
ac->ac_g_ex.fe_logical = start;
- ac->ac_g_ex.fe_len = size;
+ ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size);

/* define goal start in order to merge */
if (ar->pright && (ar->lright == (start + size))) {
@@ -3083,14 +3087,16 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
struct ext4_prealloc_space *pa)
{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
ext4_fsblk_t start;
ext4_fsblk_t end;
int len;

/* found preallocated blocks, use them */
start = pa->pa_pstart + (ac->ac_o_ex.fe_logical - pa->pa_lstart);
- end = min(pa->pa_pstart + pa->pa_len, start + ac->ac_o_ex.fe_len);
- len = end - start;
+ end = min(pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len),
+ start + EXT4_C2B(sbi, ac->ac_o_ex.fe_len));
+ len = EXT4_NUM_B2C(sbi, end - start);
ext4_get_group_no_and_offset(ac->ac_sb, start, &ac->ac_b_ex.fe_group,
&ac->ac_b_ex.fe_start);
ac->ac_b_ex.fe_len = len;
@@ -3098,7 +3104,7 @@ static void ext4_mb_use_inode_pa(struct ext4_allocation_context *ac,
ac->ac_pa = pa;

BUG_ON(start < pa->pa_pstart);
- BUG_ON(start + len > pa->pa_pstart + pa->pa_len);
+ BUG_ON(end > pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len));
BUG_ON(pa->pa_free < len);
pa->pa_free -= len;

@@ -3164,6 +3170,7 @@ ext4_mb_check_group_pa(ext4_fsblk_t goal_block,
static noinline_for_stack int
ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
int order, i;
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
struct ext4_locality_group *lg;
@@ -3181,12 +3188,14 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
/* all fields in this condition don't change,
* so we can skip locking for them */
if (ac->ac_o_ex.fe_logical < pa->pa_lstart ||
- ac->ac_o_ex.fe_logical >= pa->pa_lstart + pa->pa_len)
+ ac->ac_o_ex.fe_logical >= (pa->pa_lstart +
+ EXT4_C2B(sbi, pa->pa_len)))
continue;

/* non-extent files can't have physical blocks past 2^32 */
if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
- pa->pa_pstart + pa->pa_len > EXT4_MAX_BLOCK_FILE_PHYS)
+ (pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) >
+ EXT4_MAX_BLOCK_FILE_PHYS))
continue;

/* found preallocated blocks, use them */
@@ -3383,6 +3392,7 @@ static noinline_for_stack int
ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
{
struct super_block *sb = ac->ac_sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_prealloc_space *pa;
struct ext4_group_info *grp;
struct ext4_inode_info *ei;
@@ -3414,16 +3424,18 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
winl = ac->ac_o_ex.fe_logical - ac->ac_g_ex.fe_logical;

/* also, we should cover whole original request */
- wins = ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len;
+ wins = EXT4_C2B(sbi, ac->ac_b_ex.fe_len - ac->ac_o_ex.fe_len);

/* the smallest one defines real window */
win = min(winl, wins);

- offs = ac->ac_o_ex.fe_logical % ac->ac_b_ex.fe_len;
+ offs = ac->ac_o_ex.fe_logical %
+ EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
if (offs && offs < win)
win = offs;

- ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical - win;
+ ac->ac_b_ex.fe_logical = ac->ac_o_ex.fe_logical -
+ EXT4_B2C(sbi, win);
BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical);
BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len);
}
@@ -3448,7 +3460,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
trace_ext4_mb_new_inode_pa(ac, pa);

ext4_mb_use_inode_pa(ac, pa);
- atomic_add(pa->pa_free, &EXT4_SB(sb)->s_mb_preallocated);
+ atomic_add(pa->pa_free, &sbi->s_mb_preallocated);

ei = EXT4_I(ac->ac_inode);
grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
@@ -3563,7 +3575,7 @@ ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,

BUG_ON(pa->pa_deleted == 0);
ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
- grp_blk_start = pa->pa_pstart - bit;
+ grp_blk_start = pa->pa_pstart - EXT4_C2B(sbi, bit);
BUG_ON(group != e4b->bd_group && pa->pa_len != 0);
end = bit + pa->pa_len;

@@ -3578,7 +3590,8 @@ ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
free += next - bit;

trace_ext4_mballoc_discard(sb, NULL, group, bit, next - bit);
- trace_ext4_mb_release_inode_pa(pa, grp_blk_start + bit,
+ trace_ext4_mb_release_inode_pa(pa, (grp_blk_start +
+ EXT4_C2B(sbi, bit)),
next - bit);
mb_free_blocks(pa->pa_inode, e4b, bit, next - bit);
bit = next + 1;
@@ -3926,7 +3939,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
return;

- size = ac->ac_o_ex.fe_logical + ac->ac_o_ex.fe_len;
+ size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
>> bsbits;

@@ -3987,18 +4000,15 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,

/* set up allocation goals */
memset(ac, 0, sizeof(struct ext4_allocation_context));
- ac->ac_b_ex.fe_logical = ar->logical;
+ ac->ac_b_ex.fe_logical = ar->logical & ~(sbi->s_cluster_ratio - 1);
ac->ac_status = AC_STATUS_CONTINUE;
ac->ac_sb = sb;
ac->ac_inode = ar->inode;
- ac->ac_o_ex.fe_logical = ar->logical;
+ ac->ac_o_ex.fe_logical = ac->ac_b_ex.fe_logical;
ac->ac_o_ex.fe_group = group;
ac->ac_o_ex.fe_start = block;
ac->ac_o_ex.fe_len = len;
- ac->ac_g_ex.fe_logical = ar->logical;
- ac->ac_g_ex.fe_group = group;
- ac->ac_g_ex.fe_start = block;
- ac->ac_g_ex.fe_len = len;
+ ac->ac_g_ex = ac->ac_o_ex;
ac->ac_flags = ar->flags;

/* we have to define context: we'll we work with a file or
@@ -4150,13 +4160,14 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
*/
static int ext4_mb_release_context(struct ext4_allocation_context *ac)
{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_prealloc_space *pa = ac->ac_pa;
if (pa) {
if (pa->pa_type == MB_GROUP_PA) {
/* see comment in ext4_mb_use_group_pa() */
spin_lock(&pa->pa_lock);
- pa->pa_pstart += ac->ac_b_ex.fe_len;
- pa->pa_lstart += ac->ac_b_ex.fe_len;
+ pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
+ pa->pa_lstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
pa->pa_free -= ac->ac_b_ex.fe_len;
pa->pa_len -= ac->ac_b_ex.fe_len;
spin_unlock(&pa->pa_lock);
@@ -4217,7 +4228,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
struct super_block *sb;
ext4_fsblk_t block = 0;
unsigned int inquota = 0;
- unsigned int reserv_blks = 0;
+ unsigned int reserv_clstrs = 0;

sb = ar->inode->i_sb;
sbi = EXT4_SB(sb);
@@ -4247,12 +4258,14 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
*errp = -ENOSPC;
return 0;
}
- reserv_blks = ar->len;
+ reserv_clstrs = ar->len;
if (ar->flags & EXT4_MB_USE_ROOT_BLOCKS) {
- dquot_alloc_block_nofail(ar->inode, ar->len);
+ dquot_alloc_block_nofail(ar->inode,
+ EXT4_C2B(sbi, ar->len));
} else {
while (ar->len &&
- dquot_alloc_block(ar->inode, ar->len)) {
+ dquot_alloc_block(ar->inode,
+ EXT4_C2B(sbi, ar->len))) {

ar->flags |= EXT4_MB_HINT_NOPREALLOC;
ar->len--;
@@ -4296,7 +4309,7 @@ repeat:
ext4_mb_new_preallocation(ac);
}
if (likely(ac->ac_status == AC_STATUS_FOUND)) {
- *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_blks);
+ *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs);
if (*errp == -EAGAIN) {
/*
* drop the reference that we took
@@ -4332,13 +4345,13 @@ out:
if (ac)
kmem_cache_free(ext4_ac_cachep, ac);
if (inquota && ar->len < inquota)
- dquot_free_block(ar->inode, inquota - ar->len);
+ dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len));
if (!ar->len) {
if (!ext4_test_inode_state(ar->inode,
EXT4_STATE_DELALLOC_RESERVED))
/* release all the reserved blocks if non delalloc */
percpu_counter_sub(&sbi->s_dirtyblocks_counter,
- reserv_blks);
+ reserv_clstrs);
}

trace_ext4_allocate_blocks(ar, (unsigned long long)block);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 4423c6f..cd93532 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -139,9 +139,9 @@ enum {

struct ext4_free_extent {
ext4_lblk_t fe_logical;
- ext4_grpblk_t fe_start;
+ ext4_grpblk_t fe_start; /* In cluster units */
ext4_group_t fe_group;
- ext4_grpblk_t fe_len;
+ ext4_grpblk_t fe_len; /* In cluster units */
};

/*
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 16/23] ext4: Fix bigalloc quota accounting and i_blocks value

From: Aditya Kali <[email protected]>

With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.

Signed-off-by: Aditya Kali <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 5 +-
fs/ext4/ext4.h | 16 +++-
fs/ext4/ext4_extents.h | 2 +
fs/ext4/extents.c | 306 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/inode.c | 54 ++++++---
fs/ext4/mballoc.c | 4 +-
fs/ext4/super.c | 3 +-
7 files changed, 365 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 9080a85..bf42b32 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -485,7 +485,7 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
* @handle: handle to this transaction
* @inode: file inode
* @goal: given target block(filesystem wide)
- * @count: pointer to total number of blocks needed
+ * @count: pointer to total number of clusters needed
* @errp: error code
*
* Return 1st allocated block number on success, *count stores total account
@@ -517,7 +517,8 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
EXT4_I(inode)->i_allocated_meta_blocks += ar.len;
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
- dquot_alloc_block_nofail(inode, ar.len);
+ dquot_alloc_block_nofail(inode,
+ EXT4_C2B(EXT4_SB(inode->i_sb), ar.len));
}
return ret;
}
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ca5ab11..78ac8ed 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -144,9 +144,17 @@ struct ext4_allocation_request {
#define EXT4_MAP_UNWRITTEN (1 << BH_Unwritten)
#define EXT4_MAP_BOUNDARY (1 << BH_Boundary)
#define EXT4_MAP_UNINIT (1 << BH_Uninit)
+/* Sometimes (in the bigalloc case, from ext4_da_get_block_prep) the caller of
+ * ext4_map_blocks wants to know whether or not the underlying cluster has
+ * already been accounted for. EXT4_MAP_FROM_CLUSTER conveys to the caller that
+ * the requested mapping was from previously mapped (or delayed allocated)
+ * cluster. We use BH_AllocFromCluster only for this flag. BH_AllocFromCluster
+ * should never appear on buffer_head's state flags.
+ */
+#define EXT4_MAP_FROM_CLUSTER (1 << BH_AllocFromCluster)
#define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
- EXT4_MAP_UNINIT)
+ EXT4_MAP_UNINIT | EXT4_MAP_FROM_CLUSTER)

struct ext4_map_blocks {
ext4_fsblk_t m_pblk;
@@ -1863,6 +1871,7 @@ extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
+extern int ext4_da_reserve_space(struct inode *inode, ext4_lblk_t lblock);

/* indirect.c */
extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
@@ -2252,6 +2261,11 @@ extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
enum ext4_state_bits {
BH_Uninit /* blocks are allocated but uninitialized on disk */
= BH_JBDPrivateStart,
+ BH_AllocFromCluster, /* allocated blocks were part of already
+ * allocated cluster. Note that this flag will
+ * never, ever appear in a buffer_head's state
+ * flag. See EXT4_MAP_FROM_CLUSTER to see where
+ * this is used. */
};

BUFFER_FNS(Uninit, uninit)
diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index 095c36f..a52db3a 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -290,5 +290,7 @@ extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t,
struct ext4_ext_path *);
extern void ext4_ext_drop_refs(struct ext4_ext_path *);
extern int ext4_ext_check_inode(struct inode *inode);
+extern int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk,
+ int search_hint_reverse);
#endif /* _EXT4_EXTENTS */

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8fc735f..11bd617 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2676,6 +2676,21 @@ again:
}
}

+ /* If we still have something in the partial cluster and we have removed
+ * even the first extent, then we should free the blocks in the partial
+ * cluster as well. */
+ if (partial_cluster && path->p_hdr->eh_entries == 0) {
+ int flags = EXT4_FREE_BLOCKS_FORGET;
+
+ if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
+ flags |= EXT4_FREE_BLOCKS_METADATA;
+
+ ext4_free_blocks(handle, inode, NULL,
+ EXT4_C2B(EXT4_SB(sb), partial_cluster),
+ EXT4_SB(sb)->s_cluster_ratio, flags);
+ partial_cluster = 0;
+ }
+
/* TODO: flexible tree reduction should be here */
if (path->p_hdr->eh_entries == 0) {
/*
@@ -3223,6 +3238,195 @@ static int check_eofblocks_fl(handle_t *handle, struct inode *inode,
return ext4_mark_inode_dirty(handle, inode);
}

+/**
+ * ext4_find_delalloc_range: find delayed allocated block in the given range.
+ *
+ * Goes through the buffer heads in the range [lblk_start, lblk_end] and returns
+ * whether there are any buffers marked for delayed allocation. It returns '1'
+ * on the first delalloc'ed buffer head found. If no buffer head in the given
+ * range is marked for delalloc, it returns 0.
+ * lblk_start should always be <= lblk_end.
+ * search_hint_reverse is to indicate that searching in reverse from lblk_end to
+ * lblk_start might be more efficient (i.e., we will likely hit the delalloc'ed
+ * block sooner). This is useful when blocks are truncated sequentially from
+ * lblk_start towards lblk_end.
+ */
+static int ext4_find_delalloc_range(struct inode *inode,
+ ext4_lblk_t lblk_start,
+ ext4_lblk_t lblk_end,
+ int search_hint_reverse)
+{
+ struct address_space *mapping = inode->i_mapping;
+ struct buffer_head *head, *bh = NULL;
+ struct page *page;
+ ext4_lblk_t i, pg_lblk;
+ pgoff_t index;
+
+ /* reverse search wont work if fs block size is less than page size */
+ if (inode->i_blkbits < PAGE_CACHE_SHIFT)
+ search_hint_reverse = 0;
+
+ if (search_hint_reverse)
+ i = lblk_end;
+ else
+ i = lblk_start;
+
+ index = i >> (PAGE_CACHE_SHIFT - inode->i_blkbits);
+
+ while ((i >= lblk_start) && (i <= lblk_end)) {
+ page = find_get_page(mapping, index);
+ if (!page || !PageDirty(page))
+ goto nextpage;
+
+ if (PageWriteback(page)) {
+ /*
+ * This might be a race with allocation and writeout. In
+ * this case we just assume that the rest of the range
+ * will eventually be written and there wont be any
+ * delalloc blocks left.
+ * TODO: the above assumption is troublesome, but might
+ * work better in practice. other option could be note
+ * somewhere that the cluster is getting written out and
+ * detect that here.
+ */
+ page_cache_release(page);
+ return 0;
+ }
+
+ if (!page_has_buffers(page))
+ goto nextpage;
+
+ head = page_buffers(page);
+ if (!head)
+ goto nextpage;
+
+ bh = head;
+ pg_lblk = index << (PAGE_CACHE_SHIFT -
+ inode->i_blkbits);
+ do {
+ if (unlikely(pg_lblk < lblk_start)) {
+ /*
+ * This is possible when fs block size is less
+ * than page size and our cluster starts/ends in
+ * middle of the page. So we need to skip the
+ * initial few blocks till we reach the 'lblk'
+ */
+ pg_lblk++;
+ continue;
+ }
+
+ if (buffer_delay(bh)) {
+ page_cache_release(page);
+ return 1;
+ }
+ if (search_hint_reverse)
+ i--;
+ else
+ i++;
+ } while ((i >= lblk_start) && (i <= lblk_end) &&
+ ((bh = bh->b_this_page) != head));
+nextpage:
+ if (page)
+ page_cache_release(page);
+ /*
+ * Move to next page. 'i' will be the first lblk in the next
+ * page.
+ */
+ if (search_hint_reverse)
+ index--;
+ else
+ index++;
+ i = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ }
+
+ return 0;
+}
+
+int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk,
+ int search_hint_reverse)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ ext4_lblk_t lblk_start, lblk_end;
+ lblk_start = lblk & (~(sbi->s_cluster_ratio - 1));
+ lblk_end = lblk_start + sbi->s_cluster_ratio - 1;
+
+ return ext4_find_delalloc_range(inode, lblk_start, lblk_end,
+ search_hint_reverse);
+}
+
+/**
+ * Determines how many complete clusters (out of those specified by the 'map')
+ * are under delalloc and were reserved quota for.
+ * This function is called when we are writing out the blocks that were
+ * originally written with their allocation delayed, but then the space was
+ * allocated using fallocate() before the delayed allocation could be resolved.
+ * The cases to look for are:
+ * ('=' indicated delayed allocated blocks
+ * '-' indicates non-delayed allocated blocks)
+ * (a) partial clusters towards beginning and/or end outside of allocated range
+ * are not delalloc'ed.
+ * Ex:
+ * |----c---=|====c====|====c====|===-c----|
+ * |++++++ allocated ++++++|
+ * ==> 4 complete clusters in above example
+ *
+ * (b) partial cluster (outside of allocated range) towards either end is
+ * marked for delayed allocation. In this case, we will exclude that
+ * cluster.
+ * Ex:
+ * |----====c========|========c========|
+ * |++++++ allocated ++++++|
+ * ==> 1 complete clusters in above example
+ *
+ * Ex:
+ * |================c================|
+ * |++++++ allocated ++++++|
+ * ==> 0 complete clusters in above example
+ *
+ * The ext4_da_update_reserve_space will be called only if we
+ * determine here that there were some "entire" clusters that span
+ * this 'allocated' range.
+ * In the non-bigalloc case, this function will just end up returning num_blks
+ * without ever calling ext4_find_delalloc_range.
+ */
+static unsigned int
+get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start,
+ unsigned int num_blks)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ ext4_lblk_t alloc_cluster_start, alloc_cluster_end;
+ ext4_lblk_t lblk_from, lblk_to, c_offset;
+ unsigned int allocated_clusters = 0;
+
+ alloc_cluster_start = EXT4_B2C(sbi, lblk_start);
+ alloc_cluster_end = EXT4_B2C(sbi, lblk_start + num_blks - 1);
+
+ /* max possible clusters for this allocation */
+ allocated_clusters = alloc_cluster_end - alloc_cluster_start + 1;
+
+ /* Check towards left side */
+ c_offset = lblk_start & (sbi->s_cluster_ratio - 1);
+ if (c_offset) {
+ lblk_from = lblk_start & (~(sbi->s_cluster_ratio - 1));
+ lblk_to = lblk_from + c_offset - 1;
+
+ if (ext4_find_delalloc_range(inode, lblk_from, lblk_to, 0))
+ allocated_clusters--;
+ }
+
+ /* Now check towards right. */
+ c_offset = (lblk_start + num_blks) & (sbi->s_cluster_ratio - 1);
+ if (c_offset) {
+ lblk_from = lblk_start + num_blks;
+ lblk_to = lblk_from + (sbi->s_cluster_ratio - c_offset) - 1;
+
+ if (ext4_find_delalloc_range(inode, lblk_from, lblk_to, 0))
+ allocated_clusters--;
+ }
+
+ return allocated_clusters;
+}
+
static int
ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map,
@@ -3328,8 +3532,15 @@ out:
* But fallocate would have already updated quota and block
* count for this offset. So cancel these reservation
*/
- if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
- ext4_da_update_reserve_space(inode, allocated, 0);
+ if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
+ unsigned int reserved_clusters;
+ reserved_clusters = get_reserved_cluster_alloc(inode,
+ map->m_lblk, map->m_len);
+ if (reserved_clusters)
+ ext4_da_update_reserve_space(inode,
+ reserved_clusters,
+ 0);
+ }

map_out:
map->m_flags |= EXT4_MAP_MAPPED;
@@ -3474,6 +3685,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t newblock = 0;
int free_on_err = 0, err = 0, depth, ret;
unsigned int allocated = 0, offset = 0;
+ unsigned int allocated_clusters = 0, reserved_clusters = 0;
unsigned int punched_out = 0;
unsigned int result = 0;
struct ext4_allocation_request ar;
@@ -3489,6 +3701,10 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
if (ext4_ext_in_cache(inode, map->m_lblk, &newex) &&
((flags & EXT4_GET_BLOCKS_PUNCH_OUT_EXT) == 0)) {
if (!newex.ee_start_lo && !newex.ee_start_hi) {
+ if ((sbi->s_cluster_ratio > 1) &&
+ ext4_find_delalloc_cluster(inode, map->m_lblk, 0))
+ map->m_flags |= EXT4_MAP_FROM_CLUSTER;
+
if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
/*
* block isn't allocated yet and
@@ -3499,6 +3715,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* we should allocate requested block */
} else {
/* block is already allocated */
+ if (sbi->s_cluster_ratio > 1)
+ map->m_flags |= EXT4_MAP_FROM_CLUSTER;
newblock = map->m_lblk
- le32_to_cpu(newex.ee_block)
+ ext4_ext_pblock(&newex);
@@ -3633,6 +3851,10 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
}
}

+ if ((sbi->s_cluster_ratio > 1) &&
+ ext4_find_delalloc_cluster(inode, map->m_lblk, 0))
+ map->m_flags |= EXT4_MAP_FROM_CLUSTER;
+
/*
* requested block isn't allocated yet;
* we couldn't try to create block if create flag is zero
@@ -3649,6 +3871,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/*
* Okay, we need to do block allocation.
*/
+ map->m_flags &= ~EXT4_MAP_FROM_CLUSTER;
newex.ee_block = cpu_to_le32(map->m_lblk);
cluster_offset = map->m_lblk & (sbi->s_cluster_ratio-1);

@@ -3660,6 +3883,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
get_implied_cluster_alloc(sbi, map, ex, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
+ map->m_flags |= EXT4_MAP_FROM_CLUSTER;
goto got_allocated_blocks;
}

@@ -3680,6 +3904,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
get_implied_cluster_alloc(sbi, map, ex2, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
+ map->m_flags |= EXT4_MAP_FROM_CLUSTER;
goto got_allocated_blocks;
}

@@ -3733,6 +3958,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext_debug("allocate new block: goal %llu, found %llu/%u\n",
ar.goal, newblock, allocated);
free_on_err = 1;
+ allocated_clusters = ar.len;
ar.len = EXT4_C2B(sbi, ar.len) - offset;
if (ar.len > allocated)
ar.len = allocated;
@@ -3789,8 +4015,80 @@ got_allocated_blocks:
* Update reserved blocks/metadata blocks after successful
* block allocation which had been deferred till now.
*/
- if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
- ext4_da_update_reserve_space(inode, allocated, 1);
+ if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
+ /*
+ * Check how many clusters we had reserved this allocted range.
+ */
+ reserved_clusters = get_reserved_cluster_alloc(inode,
+ map->m_lblk, allocated);
+ if (map->m_flags & EXT4_MAP_FROM_CLUSTER) {
+ if (reserved_clusters) {
+ /*
+ * We have clusters reserved for this range.
+ * But since we are not doing actual allocation
+ * and are simply using blocks from previously
+ * allocated cluster, we should release the
+ * reservation and not claim quota.
+ */
+ ext4_da_update_reserve_space(inode,
+ reserved_clusters, 0);
+ }
+ } else {
+ BUG_ON(allocated_clusters < reserved_clusters);
+ /* We will claim quota for all newly allocated blocks.*/
+ ext4_da_update_reserve_space(inode, allocated_clusters,
+ 1);
+ if (reserved_clusters < allocated_clusters) {
+ int reservation = allocated_clusters -
+ reserved_clusters;
+ /*
+ * It seems we claimed few clusters outside of
+ * the range of this allocation. We should give
+ * it back to the reservation pool. This can
+ * happen in the following case:
+ *
+ * * Suppose s_cluster_ratio is 4 (i.e., each
+ * cluster has 4 blocks. Thus, the clusters
+ * are [0-3],[4-7],[8-11]...
+ * * First comes delayed allocation write for
+ * logical blocks 10 & 11. Since there were no
+ * previous delayed allocated blocks in the
+ * range [8-11], we would reserve 1 cluster
+ * for this write.
+ * * Next comes write for logical blocks 3 to 8.
+ * In this case, we will reserve 2 clusters
+ * (for [0-3] and [4-7]; and not for [8-11] as
+ * that range has a delayed allocated blocks.
+ * Thus total reserved clusters now becomes 3.
+ * * Now, during the delayed allocation writeout
+ * time, we will first write blocks [3-8] and
+ * allocate 3 clusters for writing these
+ * blocks. Also, we would claim all these
+ * three clusters above.
+ * * Now when we come here to writeout the
+ * blocks [10-11], we would expect to claim
+ * the reservation of 1 cluster we had made
+ * (and we would claim it since there are no
+ * more delayed allocated blocks in the range
+ * [8-11]. But our reserved cluster count had
+ * already gone to 0.
+ *
+ * Thus, at the step 4 above when we determine
+ * that there are still some unwritten delayed
+ * allocated blocks outside of our current
+ * block range, we should increment the
+ * reserved clusters count so that when the
+ * remaining blocks finally gets written, we
+ * could claim them.
+ */
+ while (reservation) {
+ ext4_da_reserve_space(inode,
+ map->m_lblk);
+ reservation--;
+ }
+ }
+ }
+ }

/*
* Cache the extent and update transaction to commit on fdatasync only
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2c0150c..513f4e6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -270,14 +270,14 @@ void ext4_da_update_reserve_space(struct inode *inode,

/* Update quota subsystem for data blocks */
if (quota_claim)
- dquot_claim_block(inode, used);
+ dquot_claim_block(inode, EXT4_C2B(sbi, used));
else {
/*
* We did fallocate with an offset that is already delayed
* allocated. So on delayed allocated writeback we should
* not re-claim the quota for fallocated blocks.
*/
- dquot_release_reservation_block(inode, used);
+ dquot_release_reservation_block(inode, EXT4_C2B(sbi, used));
}

/*
@@ -1004,14 +1004,14 @@ static int ext4_journalled_write_end(struct file *file,
}

/*
- * Reserve a single block located at lblock
+ * Reserve a single cluster located at lblock
*/
-static int ext4_da_reserve_space(struct inode *inode, ext4_lblk_t lblock)
+int ext4_da_reserve_space(struct inode *inode, ext4_lblk_t lblock)
{
int retries = 0;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct ext4_inode_info *ei = EXT4_I(inode);
- unsigned long md_needed;
+ unsigned int md_needed;
int ret;

/*
@@ -1021,7 +1021,8 @@ static int ext4_da_reserve_space(struct inode *inode, ext4_lblk_t lblock)
*/
repeat:
spin_lock(&ei->i_block_reservation_lock);
- md_needed = ext4_calc_metadata_amount(inode, lblock);
+ md_needed = EXT4_NUM_B2C(sbi,
+ ext4_calc_metadata_amount(inode, lblock));
trace_ext4_da_reserve_space(inode, md_needed);
spin_unlock(&ei->i_block_reservation_lock);

@@ -1030,7 +1031,7 @@ repeat:
* us from metadata over-estimation, though we may go over by
* a small amount in the end. Here we just reserve for data.
*/
- ret = dquot_reserve_block(inode, 1);
+ ret = dquot_reserve_block(inode, EXT4_C2B(sbi, 1));
if (ret)
return ret;
/*
@@ -1038,7 +1039,7 @@ repeat:
* we cannot afford to run out of free blocks.
*/
if (ext4_claim_free_blocks(sbi, md_needed + 1, 0)) {
- dquot_release_reservation_block(inode, 1);
+ dquot_release_reservation_block(inode, EXT4_C2B(sbi, 1));
if (ext4_should_retry_alloc(inode->i_sb, &retries)) {
yield();
goto repeat;
@@ -1085,6 +1086,8 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
* We can release all of the reserved metadata blocks
* only when we have written all of the delayed
* allocation blocks.
+ * Note that in case of bigalloc, i_reserved_meta_blocks,
+ * i_reserved_data_blocks, etc. refer to number of clusters.
*/
percpu_counter_sub(&sbi->s_dirtyclusters_counter,
ei->i_reserved_meta_blocks);
@@ -1097,7 +1100,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free)

spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);

- dquot_release_reservation_block(inode, to_free);
+ dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free));
}

static void ext4_da_page_release_reservation(struct page *page,
@@ -1106,6 +1109,9 @@ static void ext4_da_page_release_reservation(struct page *page,
int to_release = 0;
struct buffer_head *head, *bh;
unsigned int curr_off = 0;
+ struct inode *inode = page->mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ int num_clusters;

head = page_buffers(page);
bh = head;
@@ -1118,7 +1124,20 @@ static void ext4_da_page_release_reservation(struct page *page,
}
curr_off = next_off;
} while ((bh = bh->b_this_page) != head);
- ext4_da_release_space(page->mapping->host, to_release);
+
+ /* If we have released all the blocks belonging to a cluster, then we
+ * need to release the reserved space for that cluster. */
+ num_clusters = EXT4_NUM_B2C(sbi, to_release);
+ while (num_clusters > 0) {
+ ext4_fsblk_t lblk;
+ lblk = (page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits)) +
+ ((num_clusters - 1) << sbi->s_cluster_bits);
+ if (sbi->s_cluster_ratio == 1 ||
+ !ext4_find_delalloc_cluster(inode, lblk, 1))
+ ext4_da_release_space(inode, 1);
+
+ num_clusters--;
+ }
}

/*
@@ -1314,7 +1333,8 @@ static void ext4_print_free_blocks(struct inode *inode)
(long long) EXT4_C2B(EXT4_SB(inode->i_sb),
percpu_counter_sum(&sbi->s_freeclusters_counter)));
printk(KERN_CRIT "dirty_blocks=%lld\n",
- (long long) percpu_counter_sum(&sbi->s_dirtyclusters_counter));
+ (long long) EXT4_C2B(EXT4_SB(inode->i_sb),
+ percpu_counter_sum(&sbi->s_dirtyclusters_counter)));
printk(KERN_CRIT "Block reservation details\n");
printk(KERN_CRIT "i_reserved_data_blocks=%u\n",
EXT4_I(inode)->i_reserved_data_blocks);
@@ -1588,10 +1608,14 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
/*
* XXX: __block_write_begin() unmaps passed block, is it OK?
*/
- ret = ext4_da_reserve_space(inode, iblock);
- if (ret)
- /* not enough space to reserve */
- return ret;
+ /* If the block was allocated from previously allocated cluster,
+ * then we dont need to reserve it again. */
+ if (!(map.m_flags & EXT4_MAP_FROM_CLUSTER)) {
+ ret = ext4_da_reserve_space(inode, iblock);
+ if (ret)
+ /* not enough space to reserve */
+ return ret;
+ }

map_bh(bh, inode->i_sb, invalid_block);
set_buffer_new(bh);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 2d23d0a..73fbb99 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4686,6 +4686,8 @@ do_more:

freed += count;

+ dquot_free_block(inode, EXT4_C2B(sbi, count_clusters));
+
/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
@@ -4704,8 +4706,6 @@ do_more:
}
ext4_mark_super_dirty(sb);
error_return:
- if (freed)
- dquot_free_block(inode, freed);
brelse(bitmap_bh);
ext4_std_error(sb, err);
return;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7a4d54e..e10062d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2425,7 +2425,8 @@ static ssize_t delayed_allocation_blocks_show(struct ext4_attr *a,
char *buf)
{
return snprintf(buf, PAGE_SIZE, "%llu\n",
- (s64) percpu_counter_sum(&sbi->s_dirtyclusters_counter));
+ (s64) EXT4_C2B(sbi,
+ percpu_counter_sum(&sbi->s_dirtyclusters_counter)));
}

static ssize_t session_write_kbytes_show(struct ext4_attr *a,
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 01/23] ext4: read-only support for bigalloc file systems

This adds supports for bigalloc file systems. It teaches the mount
code just enough about bigalloc superblock fields that it will mount
the file system without freaking out that the number of blocks per
group is too big.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4.h | 17 +++++++++++++--
fs/ext4/super.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 49d2cea..03a0857 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -238,8 +238,11 @@ struct ext4_io_submit {
# define EXT4_BLOCK_SIZE(s) (EXT4_MIN_BLOCK_SIZE << (s)->s_log_block_size)
#endif
#define EXT4_ADDR_PER_BLOCK(s) (EXT4_BLOCK_SIZE(s) / sizeof(__u32))
+#define EXT4_CLUSTER_SIZE(s) (EXT4_BLOCK_SIZE(s) << \
+ EXT4_SB(s)->s_cluster_bits)
#ifdef __KERNEL__
# define EXT4_BLOCK_SIZE_BITS(s) ((s)->s_blocksize_bits)
+# define EXT4_CLUSTER_BITS(s) (EXT4_SB(s)->s_cluster_bits)
#else
# define EXT4_BLOCK_SIZE_BITS(s) ((s)->s_log_block_size + 10)
#endif
@@ -305,6 +308,7 @@ struct flex_groups {
#define EXT4_DESC_SIZE(s) (EXT4_SB(s)->s_desc_size)
#ifdef __KERNEL__
# define EXT4_BLOCKS_PER_GROUP(s) (EXT4_SB(s)->s_blocks_per_group)
+# define EXT4_CLUSTERS_PER_GROUP(s) (EXT4_SB(s)->s_clusters_per_group)
# define EXT4_DESC_PER_BLOCK(s) (EXT4_SB(s)->s_desc_per_block)
# define EXT4_INODES_PER_GROUP(s) (EXT4_SB(s)->s_inodes_per_group)
# define EXT4_DESC_PER_BLOCK_BITS(s) (EXT4_SB(s)->s_desc_per_block_bits)
@@ -964,9 +968,9 @@ struct ext4_super_block {
/*10*/ __le32 s_free_inodes_count; /* Free inodes count */
__le32 s_first_data_block; /* First Data Block */
__le32 s_log_block_size; /* Block size */
- __le32 s_obso_log_frag_size; /* Obsoleted fragment size */
+ __le32 s_log_cluster_size; /* Allocation cluster size */
/*20*/ __le32 s_blocks_per_group; /* # Blocks per group */
- __le32 s_obso_frags_per_group; /* Obsoleted fragments per group */
+ __le32 s_clusters_per_group; /* # Clusters per group */
__le32 s_inodes_per_group; /* # Inodes per group */
__le32 s_mtime; /* Mount time */
/*30*/ __le32 s_wtime; /* Write time */
@@ -1062,7 +1066,10 @@ struct ext4_super_block {
__u8 s_last_error_func[32]; /* function where the error happened */
#define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
__u8 s_mount_opts[64];
- __le32 s_reserved[112]; /* Padding to the end of the block */
+ __le32 s_usr_quota_inum; /* inode for tracking user quota */
+ __le32 s_grp_quota_inum; /* inode for tracking group quota */
+ __le32 s_overhead_clusters; /* overhead blocks/clusters in fs */
+ __le32 s_reserved[109]; /* Padding to the end of the block */
};

#define EXT4_S_ERR_LEN (EXT4_S_ERR_END - EXT4_S_ERR_START)
@@ -1082,6 +1089,7 @@ struct ext4_sb_info {
unsigned long s_desc_size; /* Size of a group descriptor in bytes */
unsigned long s_inodes_per_block;/* Number of inodes per block */
unsigned long s_blocks_per_group;/* Number of blocks in a group */
+ unsigned long s_clusters_per_group; /* Number of clusters in a group */
unsigned long s_inodes_per_group;/* Number of inodes in a group */
unsigned long s_itb_per_group; /* Number of inode table blocks per group */
unsigned long s_gdb_count; /* Number of group descriptor blocks */
@@ -1090,6 +1098,8 @@ struct ext4_sb_info {
ext4_group_t s_blockfile_groups;/* Groups acceptable for non-extent files */
unsigned long s_overhead_last; /* Last calculated overhead */
unsigned long s_blocks_last; /* Last seen block count */
+ unsigned int s_cluster_ratio; /* Number of blocks per cluster */
+ unsigned int s_cluster_bits; /* log2 of s_cluster_ratio */
loff_t s_bitmap_maxbytes; /* max bytes for bitmap files */
struct buffer_head * s_sbh; /* Buffer containing the super block */
struct ext4_super_block *s_es; /* Pointer to the super block in the buffer */
@@ -1352,6 +1362,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
#define EXT4_FEATURE_RO_COMPAT_QUOTA 0x0100
+#define EXT4_FEATURE_RO_COMPAT_BIGALLOC 0x0200

#define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9ea71aa..4618f7c1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1909,7 +1909,7 @@ static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es,
res = MS_RDONLY;
}
if (read_only)
- return res;
+ goto done;
if (!(sbi->s_mount_state & EXT4_VALID_FS))
ext4_msg(sb, KERN_WARNING, "warning: mounting unchecked fs, "
"running e2fsck is recommended");
@@ -1940,6 +1940,7 @@ static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es,
EXT4_SET_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);

ext4_commit_super(sb, 1);
+done:
if (test_opt(sb, DEBUG))
printk(KERN_INFO "[EXT4 FS bs=%lu, gc=%u, "
"bpg=%lu, ipg=%lu, mo=%04x, mo2=%04x]\n",
@@ -3057,10 +3058,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
char *cp;
const char *descr;
int ret = -ENOMEM;
- int blocksize;
+ int blocksize, clustersize;
unsigned int db_count;
unsigned int i;
- int needs_recovery, has_huge_files;
+ int needs_recovery, has_huge_files, has_bigalloc;
__u64 blocks_count;
int err;
unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
@@ -3339,12 +3340,53 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sb->s_dirt = 1;
}

- if (sbi->s_blocks_per_group > blocksize * 8) {
- ext4_msg(sb, KERN_ERR,
- "#blocks per group too big: %lu",
- sbi->s_blocks_per_group);
- goto failed_mount;
+ /* Handle clustersize */
+ clustersize = BLOCK_SIZE << le32_to_cpu(es->s_log_cluster_size);
+ has_bigalloc = EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC);
+ if (has_bigalloc) {
+ if (clustersize < blocksize) {
+ ext4_msg(sb, KERN_ERR,
+ "cluster size (%d) smaller than "
+ "block size (%d)", clustersize, blocksize);
+ goto failed_mount;
+ }
+ sbi->s_cluster_bits = le32_to_cpu(es->s_log_cluster_size) -
+ le32_to_cpu(es->s_log_block_size);
+ sbi->s_clusters_per_group =
+ le32_to_cpu(es->s_clusters_per_group);
+ if (sbi->s_clusters_per_group > blocksize * 8) {
+ ext4_msg(sb, KERN_ERR,
+ "#clusters per group too big: %lu",
+ sbi->s_clusters_per_group);
+ goto failed_mount;
+ }
+ if (sbi->s_blocks_per_group !=
+ (sbi->s_clusters_per_group * (clustersize / blocksize))) {
+ ext4_msg(sb, KERN_ERR, "blocks per group (%lu) and "
+ "clusters per group (%lu) inconsistent",
+ sbi->s_blocks_per_group,
+ sbi->s_clusters_per_group);
+ goto failed_mount;
+ }
+ } else {
+ if (clustersize != blocksize) {
+ ext4_warning(sb, "fragment/cluster size (%d) != "
+ "block size (%d)", clustersize,
+ blocksize);
+ clustersize = blocksize;
+ }
+ if (sbi->s_blocks_per_group > blocksize * 8) {
+ ext4_msg(sb, KERN_ERR,
+ "#blocks per group too big: %lu",
+ sbi->s_blocks_per_group);
+ goto failed_mount;
+ }
+ sbi->s_clusters_per_group = sbi->s_blocks_per_group;
+ sbi->s_cluster_bits = 0;
}
+ sbi->s_cluster_ratio = clustersize / blocksize;
+
if (sbi->s_inodes_per_group > blocksize * 8) {
ext4_msg(sb, KERN_ERR,
"#inodes per group too big: %lu",
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 11/23] ext4: teach ext4_ext_truncate() about the bigalloc feature

When we are truncating (as opposed unlinking) a file, we need to worry
about partial truncates of a file, especially in the light of sparse
files. The changes here make sure that arbitrary truncates of sparse
files works correctly. Yeah, it's messy.

Note that these functions will need to be revisted when the punch
ioctl is integrated --- in fact this commit will probably have merge
conflicts with the punch changes which Allison Henders and the IBM LTC
have been working on. I will need to fix this up when either patch
hits mainline.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents.c | 82 +++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 71 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9458586..8fc735f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2192,14 +2192,39 @@ int ext4_ext_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
}

static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
- struct ext4_extent *ex,
- ext4_lblk_t from, ext4_lblk_t to)
+ struct ext4_extent *ex,
+ ext4_fsblk_t *partial_cluster,
+ ext4_lblk_t from, ext4_lblk_t to)
{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
unsigned short ee_len = ext4_ext_get_actual_len(ex);
+ ext4_fsblk_t pblk;
int flags = EXT4_FREE_BLOCKS_FORGET;

if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
flags |= EXT4_FREE_BLOCKS_METADATA;
+ /*
+ * For bigalloc file systems, we never free a partial cluster
+ * at the beginning of the extent. Instead, we make a note
+ * that we tried freeing the cluster, and check to see if we
+ * need to free it on a subsequent call to ext4_remove_blocks,
+ * or at the end of the ext4_truncate() operation.
+ */
+ flags |= EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER;
+
+ /*
+ * If we have a partial cluster, and it's different from the
+ * cluster of the last block, we need to explicitly free the
+ * partial cluster here.
+ */
+ pblk = ext4_ext_pblock(ex) + ee_len - 1;
+ if (*partial_cluster && (EXT4_B2C(sbi, pblk) != *partial_cluster)) {
+ ext4_free_blocks(handle, inode, NULL,
+ EXT4_C2B(sbi, *partial_cluster),
+ sbi->s_cluster_ratio, flags);
+ *partial_cluster = 0;
+ }
+
#ifdef EXTENTS_STATS
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -2219,12 +2244,24 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
&& to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
/* tail removal */
ext4_lblk_t num;
- ext4_fsblk_t start;

num = le32_to_cpu(ex->ee_block) + ee_len - from;
- start = ext4_ext_pblock(ex) + ee_len - num;
- ext_debug("free last %u blocks starting %llu\n", num, start);
- ext4_free_blocks(handle, inode, NULL, start, num, flags);
+ pblk = ext4_ext_pblock(ex) + ee_len - num;
+ ext_debug("free last %u blocks starting %llu\n", num, pblk);
+ ext4_free_blocks(handle, inode, NULL, pblk, num, flags);
+ /*
+ * If the block range to be freed didn't start at the
+ * beginning of a cluster, and we removed the entire
+ * extent, save the partial cluster here, since we
+ * might need to delete if we determine that the
+ * truncate operation has removed all of the blocks in
+ * the cluster.
+ */
+ if (pblk & (sbi->s_cluster_ratio - 1) &&
+ (ee_len == num))
+ *partial_cluster = EXT4_B2C(sbi, pblk);
+ else
+ *partial_cluster = 0;
} else if (from == le32_to_cpu(ex->ee_block)
&& to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
/* head removal */
@@ -2259,9 +2296,10 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
*/
static int
ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
- struct ext4_ext_path *path, ext4_lblk_t start,
- ext4_lblk_t end)
+ struct ext4_ext_path *path, ext4_fsblk_t *partial_cluster,
+ ext4_lblk_t start, ext4_lblk_t end)
{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
int err = 0, correct_index = 0;
int depth = ext_depth(inode), credits;
struct ext4_extent_header *eh;
@@ -2413,7 +2451,8 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
if (err)
goto out;

- err = ext4_remove_blocks(handle, inode, ex, a, b);
+ err = ext4_remove_blocks(handle, inode, ex, partial_cluster,
+ a, b);
if (err)
goto out;

@@ -2461,7 +2500,8 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
sizeof(struct ext4_extent));
}
le16_add_cpu(&eh->eh_entries, -1);
- }
+ } else
+ *partial_cluster = 0;

ext_debug("new extent: %u:%u:%llu\n", block, num,
ext4_ext_pblock(ex));
@@ -2473,6 +2513,25 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
if (correct_index && eh->eh_entries)
err = ext4_ext_correct_indexes(handle, inode, path);

+ /*
+ * If there is still a entry in the leaf node, check to see if
+ * it references the partial cluster. This is the only place
+ * where it could; if it doesn't, we can free the cluster.
+ */
+ if (*partial_cluster && ex >= EXT_FIRST_EXTENT(eh) &&
+ (EXT4_B2C(sbi, ext4_ext_pblock(ex) + ex_ee_len - 1) !=
+ *partial_cluster)) {
+ int flags = EXT4_FREE_BLOCKS_FORGET;
+
+ if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode))
+ flags |= EXT4_FREE_BLOCKS_METADATA;
+
+ ext4_free_blocks(handle, inode, NULL,
+ EXT4_C2B(sbi, *partial_cluster),
+ sbi->s_cluster_ratio, flags);
+ *partial_cluster = 0;
+ }
+
/* if this leaf is free, then we should
* remove it from index block above */
if (err == 0 && eh->eh_entries == 0 && path[depth].p_bh != NULL)
@@ -2509,6 +2568,7 @@ static int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
struct super_block *sb = inode->i_sb;
int depth = ext_depth(inode);
struct ext4_ext_path *path;
+ ext4_fsblk_t partial_cluster = 0;
handle_t *handle;
int i, err;

@@ -2544,7 +2604,7 @@ again:
if (i == depth) {
/* this is leaf block */
err = ext4_ext_rm_leaf(handle, inode, path,
- start, end);
+ &partial_cluster, start, end);
/* root level has p_bh == NULL, brelse() eats this */
brelse(path[i].p_bh);
path[i].p_bh = NULL;
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 10/23] ext4: teach ext4_ext_map_blocks() about the bigalloc feature

If we need to allocate a new block in ext4_ext_map_blocks(), the
function needs to see if the cluster has already been allocated.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents.c | 181 +++++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 162 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 31ae5fb..9458586 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1264,7 +1264,8 @@ static int ext4_ext_search_left(struct inode *inode,
*/
static int ext4_ext_search_right(struct inode *inode,
struct ext4_ext_path *path,
- ext4_lblk_t *logical, ext4_fsblk_t *phys)
+ ext4_lblk_t *logical, ext4_fsblk_t *phys,
+ struct ext4_extent **ret_ex)
{
struct buffer_head *bh = NULL;
struct ext4_extent_header *eh;
@@ -1306,9 +1307,7 @@ static int ext4_ext_search_right(struct inode *inode,
return -EIO;
}
}
- *logical = le32_to_cpu(ex->ee_block);
- *phys = ext4_ext_pblock(ex);
- return 0;
+ goto found_extent;
}

if (unlikely(*logical < (le32_to_cpu(ex->ee_block) + ee_len))) {
@@ -1321,9 +1320,7 @@ static int ext4_ext_search_right(struct inode *inode,
if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) {
/* next allocated block in this leaf */
ex++;
- *logical = le32_to_cpu(ex->ee_block);
- *phys = ext4_ext_pblock(ex);
- return 0;
+ goto found_extent;
}

/* go up and search for index to the right */
@@ -1366,9 +1363,12 @@ got_index:
return -EIO;
}
ex = EXT_FIRST_EXTENT(eh);
+found_extent:
*logical = le32_to_cpu(ex->ee_block);
*phys = ext4_ext_pblock(ex);
- put_bh(bh);
+ *ret_ex = ex;
+ if (bh)
+ put_bh(bh);
return 0;
}

@@ -1622,7 +1622,8 @@ static int ext4_ext_try_to_merge(struct inode *inode,
* such that there will be no overlap, and then returns 1.
* If there is no overlap found, it returns 0.
*/
-static unsigned int ext4_ext_check_overlap(struct inode *inode,
+static unsigned int ext4_ext_check_overlap(struct ext4_sb_info *sbi,
+ struct inode *inode,
struct ext4_extent *newext,
struct ext4_ext_path *path)
{
@@ -1636,6 +1637,7 @@ static unsigned int ext4_ext_check_overlap(struct inode *inode,
if (!path[depth].p_ext)
goto out;
b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+ b2 &= ~(sbi->s_cluster_ratio - 1);

/*
* get the next allocated block if the extent in the path
@@ -1645,6 +1647,7 @@ static unsigned int ext4_ext_check_overlap(struct inode *inode,
b2 = ext4_ext_next_allocated_block(path);
if (b2 == EXT_MAX_BLOCKS)
goto out;
+ b2 &= ~(sbi->s_cluster_ratio - 1);
}

/* check for wrap through zero on extent logical start block*/
@@ -3285,6 +3288,106 @@ out2:
}

/*
+ * get_implied_cluster_alloc - check to see if the requested
+ * allocation (in the map structure) overlaps with a cluster already
+ * allocated in an extent.
+ * @sbi The ext4-specific superblock structure
+ * @map The requested lblk->pblk mapping
+ * @ex The extent structure which might contain an implied
+ * cluster allocation
+ *
+ * This function is called by ext4_ext_map_blocks() after we failed to
+ * find blocks that were already in the inode's extent tree. Hence,
+ * we know that the beginning of the requested region cannot overlap
+ * the extent from the inode's extent tree. There are three cases we
+ * want to catch. The first is this case:
+ *
+ * |--- cluster # N--|
+ * |--- extent ---| |---- requested region ---|
+ * |==========|
+ *
+ * The second case that we need to test for is this one:
+ *
+ * |--------- cluster # N ----------------|
+ * |--- requested region --| |------- extent ----|
+ * |=======================|
+ *
+ * The third case is when the requested region lies between two extents
+ * within the same cluster:
+ * |------------- cluster # N-------------|
+ * |----- ex -----| |---- ex_right ----|
+ * |------ requested region ------|
+ * |================|
+ *
+ * In each of the above cases, we need to set the map->m_pblk and
+ * map->m_len so it corresponds to the return the extent labelled as
+ * "|====|" from cluster #N, since it is already in use for data in
+ * cluster EXT4_B2C(sbi, map->m_lblk). We will then return 1 to
+ * signal to ext4_ext_map_blocks() that map->m_pblk should be treated
+ * as a new "allocated" block region. Otherwise, we will return 0 and
+ * ext4_ext_map_blocks() will then allocate one or more new clusters
+ * by calling ext4_mb_new_blocks().
+ */
+static int get_implied_cluster_alloc(struct ext4_sb_info *sbi,
+ struct ext4_map_blocks *map,
+ struct ext4_extent *ex,
+ struct ext4_ext_path *path)
+{
+ ext4_lblk_t c_offset = map->m_lblk & (sbi->s_cluster_ratio-1);
+ ext4_lblk_t ex_cluster_start, ex_cluster_end;
+ ext4_lblk_t rr_cluster_start, rr_cluster_end;
+ ext4_lblk_t ee_block = le32_to_cpu(ex->ee_block);
+ ext4_fsblk_t ee_start = ext4_ext_pblock(ex);
+ unsigned short ee_len = ext4_ext_get_actual_len(ex);
+
+ /* The extent passed in that we are trying to match */
+ ex_cluster_start = EXT4_B2C(sbi, ee_block);
+ ex_cluster_end = EXT4_B2C(sbi, ee_block + ee_len - 1);
+
+ /* The requested region passed into ext4_map_blocks() */
+ rr_cluster_start = EXT4_B2C(sbi, map->m_lblk);
+ rr_cluster_end = EXT4_B2C(sbi, map->m_lblk + map->m_len - 1);
+
+ if ((rr_cluster_start == ex_cluster_end) ||
+ (rr_cluster_start == ex_cluster_start)) {
+ if (rr_cluster_start == ex_cluster_end)
+ ee_start += ee_len - 1;
+ map->m_pblk = (ee_start & ~(sbi->s_cluster_ratio - 1)) +
+ c_offset;
+ map->m_len = min(map->m_len,
+ (unsigned) sbi->s_cluster_ratio - c_offset);
+ /*
+ * Check for and handle this case:
+ *
+ * |--------- cluster # N-------------|
+ * |------- extent ----|
+ * |--- requested region ---|
+ * |===========|
+ */
+
+ if (map->m_lblk < ee_block)
+ map->m_len = min(map->m_len, ee_block - map->m_lblk);
+
+ /*
+ * Check for the case where there is already another allocated
+ * block to the right of 'ex' but before the end of the cluster.
+ *
+ * |------------- cluster # N-------------|
+ * |----- ex -----| |---- ex_right ----|
+ * |------ requested region ------|
+ * |================|
+ */
+ if (map->m_lblk > ee_block) {
+ ext4_lblk_t next = ext4_ext_next_allocated_block(path);
+ map->m_len = min(map->m_len, next - map->m_lblk);
+ }
+ return 1;
+ }
+ return 0;
+}
+
+
+/*
* Block allocation/map/preallocation routine for extents based files
*
*
@@ -3306,14 +3409,16 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags)
{
struct ext4_ext_path *path = NULL;
- struct ext4_extent newex, *ex;
+ struct ext4_extent newex, *ex, *ex2;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
ext4_fsblk_t newblock = 0;
- int err = 0, depth, ret;
- unsigned int allocated = 0;
+ int free_on_err = 0, err = 0, depth, ret;
+ unsigned int allocated = 0, offset = 0;
unsigned int punched_out = 0;
unsigned int result = 0;
struct ext4_allocation_request ar;
ext4_io_end_t *io = EXT4_I(inode)->cur_aio_dio;
+ ext4_lblk_t cluster_offset;
struct ext4_map_blocks punch_map;

ext_debug("blocks %u/%u requested for inode %lu\n",
@@ -3480,9 +3585,23 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
goto out2;
}
+
/*
* Okay, we need to do block allocation.
*/
+ newex.ee_block = cpu_to_le32(map->m_lblk);
+ cluster_offset = map->m_lblk & (sbi->s_cluster_ratio-1);
+
+ /*
+ * If we are doing bigalloc, check to see if the extent returned
+ * by ext4_ext_find_extent() implies a cluster we can use.
+ */
+ if (cluster_offset && ex &&
+ get_implied_cluster_alloc(sbi, map, ex, path)) {
+ ar.len = allocated = map->m_len;
+ newblock = map->m_pblk;
+ goto got_allocated_blocks;
+ }

/* find neighbour allocated blocks */
ar.lleft = map->m_lblk;
@@ -3490,10 +3609,20 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
if (err)
goto out2;
ar.lright = map->m_lblk;
- err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright);
+ ex2 = NULL;
+ err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright, &ex2);
if (err)
goto out2;

+ /* Check if the extent after searching to the right implies a
+ * cluster we can use. */
+ if ((sbi->s_cluster_ratio > 1) && ex2 &&
+ get_implied_cluster_alloc(sbi, map, ex2, path)) {
+ ar.len = allocated = map->m_len;
+ newblock = map->m_pblk;
+ goto got_allocated_blocks;
+ }
+
/*
* See if request is beyond maximum number of blocks we can have in
* a single extent. For an initialized extent this limit is
@@ -3508,9 +3637,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
map->m_len = EXT_UNINIT_MAX_LEN;

/* Check if we can really insert (m_lblk)::(m_lblk + m_len) extent */
- newex.ee_block = cpu_to_le32(map->m_lblk);
newex.ee_len = cpu_to_le16(map->m_len);
- err = ext4_ext_check_overlap(inode, &newex, path);
+ err = ext4_ext_check_overlap(sbi, inode, &newex, path);
if (err)
allocated = ext4_ext_get_actual_len(&newex);
else
@@ -3520,7 +3648,18 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ar.inode = inode;
ar.goal = ext4_ext_find_goal(inode, path, map->m_lblk);
ar.logical = map->m_lblk;
- ar.len = allocated;
+ /*
+ * We calculate the offset from the beginning of the cluster
+ * for the logical block number, since when we allocate a
+ * physical cluster, the physical block should start at the
+ * same offset from the beginning of the cluster. This is
+ * needed so that future calls to get_implied_cluster_alloc()
+ * work correctly.
+ */
+ offset = map->m_lblk & (sbi->s_cluster_ratio - 1);
+ ar.len = EXT4_NUM_B2C(sbi, offset+allocated);
+ ar.goal -= offset;
+ ar.logical -= offset;
if (S_ISREG(inode->i_mode))
ar.flags = EXT4_MB_HINT_DATA;
else
@@ -3533,9 +3672,14 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
goto out2;
ext_debug("allocate new block: goal %llu, found %llu/%u\n",
ar.goal, newblock, allocated);
+ free_on_err = 1;
+ ar.len = EXT4_C2B(sbi, ar.len) - offset;
+ if (ar.len > allocated)
+ ar.len = allocated;

+got_allocated_blocks:
/* try to insert new extent into found leaf and return */
- ext4_ext_store_pblock(&newex, newblock);
+ ext4_ext_store_pblock(&newex, newblock + offset);
newex.ee_len = cpu_to_le16(ar.len);
/* Mark uninitialized */
if (flags & EXT4_GET_BLOCKS_UNINIT_EXT){
@@ -3564,7 +3708,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
goto out2;

err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
- if (err) {
+ if (err && free_on_err) {
/* free data blocks we just allocated */
/* not a good idea to call discard here directly,
* but otherwise we'd need to call it every free() */
@@ -4077,7 +4221,6 @@ found_delayed_extent:
return EXT_BREAK;
return EXT_CONTINUE;
}

2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 23/23] ext4: add some tracepoints in ext4/extents.c

From: Aditya Kali <[email protected]>

This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
ext4/inode.c.

Tested: Built and ran the kernel and verified that these tracepoints work.
Also ran xfstests.

Signed-off-by: Aditya Kali <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents.c | 46 +++++-
fs/ext4/inode.c | 2 +-
include/trace/events/ext4.h | 415 +++++++++++++++++++++++++++++++++++++++++--
3 files changed, 443 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 11bd617..0d3ba59 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1964,6 +1964,7 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
struct ext4_ext_cache *cex;
BUG_ON(len == 0);
spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
+ trace_ext4_ext_put_in_cache(inode, block, len, start);
cex = &EXT4_I(inode)->i_cached_extent;
cex->ec_block = block;
cex->ec_len = len;
@@ -2065,6 +2066,7 @@ errout:
sbi->extent_cache_misses++;
else
sbi->extent_cache_hits++;
+ trace_ext4_ext_in_cache(inode, block, ret);
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
return ret;
}
@@ -2127,6 +2129,8 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
if (err)
return err;
ext_debug("index is empty, remove it, free block %llu\n", leaf);
+ trace_ext4_ext_rm_idx(inode, leaf);
+
ext4_free_blocks(handle, inode, NULL, leaf, 1,
EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET);
return err;
@@ -2212,6 +2216,9 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
*/
flags |= EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER;

+ trace_ext4_remove_blocks(inode, cpu_to_le32(ex->ee_block),
+ ext4_ext_pblock(ex), ee_len, from,
+ to, *partial_cluster);
/*
* If we have a partial cluster, and it's different from the
* cluster of the last block, we need to explicitly free the
@@ -2326,6 +2333,9 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
ex_ee_block = le32_to_cpu(ex->ee_block);
ex_ee_len = ext4_ext_get_actual_len(ex);

+ trace_ext4_ext_rm_leaf(inode, start, ex_ee_block, ext4_ext_pblock(ex),
+ ex_ee_len, *partial_cluster);
+
while (ex >= EXT_FIRST_EXTENT(eh) &&
ex_ee_block + ex_ee_len > start) {

@@ -2582,6 +2592,8 @@ static int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
again:
ext4_ext_invalidate_cache(inode);

+ trace_ext4_ext_remove_space(inode, start, depth);
+
/*
* We start scanning from right side, freeing all the blocks
* after i_size and walking into the tree depth-wise.
@@ -2676,6 +2688,9 @@ again:
}
}

+ trace_ext4_ext_remove_space_done(inode, start, depth, partial_cluster,
+ path->p_hdr->eh_entries);
+
/* If we still have something in the partial cluster and we have removed
* even the first extent, then we should free the blocks in the partial
* cluster as well. */
@@ -3290,6 +3305,10 @@ static int ext4_find_delalloc_range(struct inode *inode,
* detect that here.
*/
page_cache_release(page);
+ trace_ext4_find_delalloc_range(inode,
+ lblk_start, lblk_end,
+ search_hint_reverse,
+ 0, i);
return 0;
}

@@ -3317,6 +3336,10 @@ static int ext4_find_delalloc_range(struct inode *inode,

if (buffer_delay(bh)) {
page_cache_release(page);
+ trace_ext4_find_delalloc_range(inode,
+ lblk_start, lblk_end,
+ search_hint_reverse,
+ 1, i);
return 1;
}
if (search_hint_reverse)
@@ -3339,6 +3362,8 @@ nextpage:
i = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
}

+ trace_ext4_find_delalloc_range(inode, lblk_start, lblk_end,
+ search_hint_reverse, 0, 0);
return 0;
}

@@ -3404,6 +3429,8 @@ get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start,
/* max possible clusters for this allocation */
allocated_clusters = alloc_cluster_end - alloc_cluster_start + 1;

+ trace_ext4_get_reserved_cluster_alloc(inode, lblk_start, num_blks);
+
/* Check towards left side */
c_offset = lblk_start & (sbi->s_cluster_ratio - 1);
if (c_offset) {
@@ -3443,6 +3470,9 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
flags, allocated);
ext4_ext_show_leaf(inode, path);

+ trace_ext4_ext_handle_uninitialized_extents(inode, map, allocated,
+ newblock);
+
/* get_block() before submit the IO, split the extent */
if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
ret = ext4_split_unwritten_extents(handle, inode, map,
@@ -3562,7 +3592,7 @@ out2:
* get_implied_cluster_alloc - check to see if the requested
* allocation (in the map structure) overlaps with a cluster already
* allocated in an extent.
- * @sbi The ext4-specific superblock structure
+ * @sb The filesystem superblock structure
* @map The requested lblk->pblk mapping
* @ex The extent structure which might contain an implied
* cluster allocation
@@ -3599,11 +3629,12 @@ out2:
* ext4_ext_map_blocks() will then allocate one or more new clusters
* by calling ext4_mb_new_blocks().
*/
-static int get_implied_cluster_alloc(struct ext4_sb_info *sbi,
+static int get_implied_cluster_alloc(struct super_block *sb,
struct ext4_map_blocks *map,
struct ext4_extent *ex,
struct ext4_ext_path *path)
{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
ext4_lblk_t c_offset = map->m_lblk & (sbi->s_cluster_ratio-1);
ext4_lblk_t ex_cluster_start, ex_cluster_end;
ext4_lblk_t rr_cluster_start, rr_cluster_end;
@@ -3652,8 +3683,12 @@ static int get_implied_cluster_alloc(struct ext4_sb_info *sbi,
ext4_lblk_t next = ext4_ext_next_allocated_block(path);
map->m_len = min(map->m_len, next - map->m_lblk);
}
+
+ trace_ext4_get_implied_cluster_alloc_exit(sb, map, 1);
return 1;
}
+
+ trace_ext4_get_implied_cluster_alloc_exit(sb, map, 0);
return 0;
}

@@ -3762,6 +3797,9 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
* we split out initialized portions during a write.
*/
ee_len = ext4_ext_get_actual_len(ex);
+
+ trace_ext4_ext_show_extent(inode, ee_block, ee_start, ee_len);
+
/* if found extent covers block, simply return it */
if (in_range(map->m_lblk, ee_block, ee_len)) {
newblock = map->m_lblk - ee_block + ee_start;
@@ -3880,7 +3918,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
* by ext4_ext_find_extent() implies a cluster we can use.
*/
if (cluster_offset && ex &&
- get_implied_cluster_alloc(sbi, map, ex, path)) {
+ get_implied_cluster_alloc(inode->i_sb, map, ex, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
map->m_flags |= EXT4_MAP_FROM_CLUSTER;
@@ -3901,7 +3939,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* Check if the extent after searching to the right implies a
* cluster we can use. */
if ((sbi->s_cluster_ratio > 1) && ex2 &&
- get_implied_cluster_alloc(sbi, map, ex2, path)) {
+ get_implied_cluster_alloc(inode->i_sb, map, ex2, path)) {
ar.len = allocated = map->m_len;
newblock = map->m_pblk;
map->m_flags |= EXT4_MAP_FROM_CLUSTER;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7ae8dec..2495cfb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -238,7 +238,7 @@ void ext4_da_update_reserve_space(struct inode *inode,
struct ext4_inode_info *ei = EXT4_I(inode);

spin_lock(&ei->i_block_reservation_lock);
- trace_ext4_da_update_reserve_space(inode, used);
+ trace_ext4_da_update_reserve_space(inode, used, quota_claim);
if (unlikely(used > ei->i_reserved_data_blocks)) {
ext4_msg(inode->i_sb, KERN_NOTICE, "%s: ino %lu, used %d "
"with only %d reserved data blocks\n",
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 5ce2b2f..3368399 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -12,6 +12,7 @@ struct ext4_allocation_request;
struct ext4_prealloc_space;
struct ext4_inode_info;
struct mpage_da_data;
+struct ext4_map_blocks;

#define EXT4_I(inode) (container_of(inode, struct ext4_inode_info, vfs_inode))

@@ -1034,19 +1035,20 @@ TRACE_EVENT(ext4_forget,
);

TRACE_EVENT(ext4_da_update_reserve_space,
- TP_PROTO(struct inode *inode, int used_blocks),
+ TP_PROTO(struct inode *inode, int used_blocks, int quota_claim),

- TP_ARGS(inode, used_blocks),
+ TP_ARGS(inode, used_blocks, quota_claim),

TP_STRUCT__entry(
- __field( dev_t, dev )
- __field( ino_t, ino )
- __field( umode_t, mode )
- __field( __u64, i_blocks )
- __field( int, used_blocks )
- __field( int, reserved_data_blocks )
- __field( int, reserved_meta_blocks )
- __field( int, allocated_meta_blocks )
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( umode_t, mode )
+ __field( __u64, i_blocks )
+ __field( int, used_blocks )
+ __field( int, reserved_data_blocks )
+ __field( int, reserved_meta_blocks )
+ __field( int, allocated_meta_blocks )
+ __field( int, quota_claim )
),

TP_fast_assign(
@@ -1055,19 +1057,24 @@ TRACE_EVENT(ext4_da_update_reserve_space,
__entry->mode = inode->i_mode;
__entry->i_blocks = inode->i_blocks;
__entry->used_blocks = used_blocks;
- __entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
- __entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
- __entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks;
+ __entry->reserved_data_blocks =
+ EXT4_I(inode)->i_reserved_data_blocks;
+ __entry->reserved_meta_blocks =
+ EXT4_I(inode)->i_reserved_meta_blocks;
+ __entry->allocated_meta_blocks =
+ EXT4_I(inode)->i_allocated_meta_blocks;
+ __entry->quota_claim = quota_claim;
),

TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d "
"reserved_data_blocks %d reserved_meta_blocks %d "
- "allocated_meta_blocks %d",
+ "allocated_meta_blocks %d quota_claim %d",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
__entry->mode, __entry->i_blocks,
__entry->used_blocks, __entry->reserved_data_blocks,
- __entry->reserved_meta_blocks, __entry->allocated_meta_blocks)
+ __entry->reserved_meta_blocks, __entry->allocated_meta_blocks,
+ __entry->quota_claim)
);

TRACE_EVENT(ext4_da_reserve_space,
@@ -1520,6 +1527,384 @@ TRACE_EVENT(ext4_load_inode,
(unsigned long) __entry->ino)
);

+TRACE_EVENT(ext4_ext_handle_uninitialized_extents,
+ TP_PROTO(struct inode *inode, struct ext4_map_blocks *map,
+ unsigned int allocated, ext4_fsblk_t newblock),
+
+ TP_ARGS(inode, map, allocated, newblock),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, lblk )
+ __field( ext4_fsblk_t, pblk )
+ __field( unsigned int, len )
+ __field( int, flags )
+ __field( unsigned int, allocated )
+ __field( ext4_fsblk_t, newblk )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->lblk = map->m_lblk;
+ __entry->pblk = map->m_pblk;
+ __entry->len = map->m_len;
+ __entry->flags = map->m_flags;
+ __entry->allocated = allocated;
+ __entry->newblk = newblock;
+ ),
+
+ TP_printk("dev %d,%d ino %lu m_lblk %u m_pblk %llu m_len %u flags %d"
+ "allocated %d newblock %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->lblk, (unsigned long long) __entry->pblk,
+ __entry->len, __entry->flags,
+ (unsigned int) __entry->allocated,
+ (unsigned long long) __entry->newblk)
+);
+
+TRACE_EVENT(ext4_get_implied_cluster_alloc_exit,
+ TP_PROTO(struct super_block *sb, struct ext4_map_blocks *map, int ret),
+
+ TP_ARGS(sb, map, ret),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, lblk )
+ __field( ext4_fsblk_t, pblk )
+ __field( unsigned int, len )
+ __field( unsigned int, flags )
+ __field( int, ret )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->lblk = map->m_lblk;
+ __entry->pblk = map->m_pblk;
+ __entry->len = map->m_len;
+ __entry->flags = map->m_flags;
+ __entry->ret = ret;
+ ),
+
+ TP_printk("dev %d,%d m_lblk %u m_pblk %llu m_len %u m_flags %u ret %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->lblk, (unsigned long long) __entry->pblk,
+ __entry->len, __entry->flags, __entry->ret)
+);
+
+TRACE_EVENT(ext4_ext_put_in_cache,
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, unsigned int len,
+ ext4_fsblk_t start),
+
+ TP_ARGS(inode, lblk, len, start),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, lblk )
+ __field( unsigned int, len )
+ __field( ext4_fsblk_t, start )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->lblk = lblk;
+ __entry->len = len;
+ __entry->start = start;
+ ),
+
+ TP_printk("dev %d,%d ino %lu lblk %u len %u start %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->lblk,
+ __entry->len,
+ (unsigned long long) __entry->start)
+);
+
+TRACE_EVENT(ext4_ext_in_cache,
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, int ret),
+
+ TP_ARGS(inode, lblk, ret),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, lblk )
+ __field( int, ret )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->lblk = lblk;
+ __entry->ret = ret;
+ ),
+
+ TP_printk("dev %d,%d ino %lu lblk %u ret %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->lblk,
+ __entry->ret)
+
+);
+
+TRACE_EVENT(ext4_find_delalloc_range,
+ TP_PROTO(struct inode *inode, ext4_lblk_t from, ext4_lblk_t to,
+ int reverse, int found, ext4_lblk_t found_blk),
+
+ TP_ARGS(inode, from, to, reverse, found, found_blk),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, from )
+ __field( ext4_lblk_t, to )
+ __field( int, reverse )
+ __field( int, found )
+ __field( ext4_lblk_t, found_blk )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->from = from;
+ __entry->to = to;
+ __entry->reverse = reverse;
+ __entry->found = found;
+ __entry->found_blk = found_blk;
+ ),
+
+ TP_printk("dev %d,%d ino %lu from %u to %u reverse %d found %d "
+ "(blk = %u)",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->from, (unsigned) __entry->to,
+ __entry->reverse, __entry->found,
+ (unsigned) __entry->found_blk)
+);
+
+TRACE_EVENT(ext4_get_reserved_cluster_alloc,
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, unsigned int len),
+
+ TP_ARGS(inode, lblk, len),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, lblk )
+ __field( unsigned int, len )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->lblk = lblk;
+ __entry->len = len;
+ ),
+
+ TP_printk("dev %d,%d ino %lu lblk %u len %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->lblk,
+ __entry->len)
+);
+
+TRACE_EVENT(ext4_ext_show_extent,
+ TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_fsblk_t pblk,
+ unsigned short len),
+
+ TP_ARGS(inode, lblk, pblk, len),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, lblk )
+ __field( ext4_fsblk_t, pblk )
+ __field( unsigned short, len )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->lblk = lblk;
+ __entry->pblk = pblk;
+ __entry->len = len;
+ ),
+
+ TP_printk("dev %d,%d ino %lu lblk %u pblk %llu len %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->lblk,
+ (unsigned long long) __entry->pblk,
+ (unsigned short) __entry->len)
+);
+
+TRACE_EVENT(ext4_remove_blocks,
+ TP_PROTO(struct inode *inode, ext4_lblk_t ex_lblk,
+ ext4_fsblk_t ex_pblk, unsigned short ex_len,
+ ext4_lblk_t from, ext4_fsblk_t to,
+ ext4_fsblk_t partial_cluster),
+
+ TP_ARGS(inode, ex_lblk, ex_pblk, ex_len, from, to, partial_cluster),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, ee_lblk )
+ __field( ext4_fsblk_t, ee_pblk )
+ __field( unsigned short, ee_len )
+ __field( ext4_lblk_t, from )
+ __field( ext4_lblk_t, to )
+ __field( ext4_fsblk_t, partial )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ee_lblk = ex_lblk;
+ __entry->ee_pblk = ex_pblk;
+ __entry->ee_len = ex_len;
+ __entry->from = from;
+ __entry->to = to;
+ __entry->partial = partial_cluster;
+ ),
+
+ TP_printk("dev %d,%d ino %lu extent [%u(%llu), %u]"
+ "from %u to %u partial_cluster %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->ee_lblk,
+ (unsigned long long) __entry->ee_pblk,
+ (unsigned short) __entry->ee_len,
+ (unsigned) __entry->from,
+ (unsigned) __entry->to,
+ (unsigned) __entry->partial)
+);
+
+TRACE_EVENT(ext4_ext_rm_leaf,
+ TP_PROTO(struct inode *inode, ext4_lblk_t start,
+ ext4_lblk_t ex_lblk, ext4_fsblk_t ex_pblk,
+ unsigned short ex_len, ext4_fsblk_t partial_cluster),
+
+ TP_ARGS(inode, start, ex_lblk, ex_pblk, ex_len, partial_cluster),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, start )
+ __field( ext4_lblk_t, ee_lblk )
+ __field( ext4_fsblk_t, ee_pblk )
+ __field( short, ee_len )
+ __field( ext4_fsblk_t, partial )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->start = start;
+ __entry->ee_lblk = ex_lblk;
+ __entry->ee_pblk = ex_pblk;
+ __entry->ee_len = ex_len;
+ __entry->partial = partial_cluster;
+ ),
+
+ TP_printk("dev %d,%d ino %lu start_lblk %u last_extent [%u(%llu), %u]"
+ "partial_cluster %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->start,
+ (unsigned) __entry->ee_lblk,
+ (unsigned long long) __entry->ee_pblk,
+ (unsigned short) __entry->ee_len,
+ (unsigned) __entry->partial)
+);
+
+TRACE_EVENT(ext4_ext_rm_idx,
+ TP_PROTO(struct inode *inode, ext4_fsblk_t pblk),
+
+ TP_ARGS(inode, pblk),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_fsblk_t, pblk )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->pblk = pblk;
+ ),
+
+ TP_printk("dev %d,%d ino %lu index_pblk %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned long long) __entry->pblk)
+);
+
+TRACE_EVENT(ext4_ext_remove_space,
+ TP_PROTO(struct inode *inode, ext4_lblk_t start, int depth),
+
+ TP_ARGS(inode, start, depth),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, start )
+ __field( int, depth )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->start = start;
+ __entry->depth = depth;
+ ),
+
+ TP_printk("dev %d,%d ino %lu since %u depth %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->start,
+ __entry->depth)
+);
+
+TRACE_EVENT(ext4_ext_remove_space_done,
+ TP_PROTO(struct inode *inode, ext4_lblk_t start, int depth,
+ ext4_lblk_t partial, unsigned short eh_entries),
+
+ TP_ARGS(inode, start, depth, partial, eh_entries),
+
+ TP_STRUCT__entry(
+ __field( ino_t, ino )
+ __field( dev_t, dev )
+ __field( ext4_lblk_t, start )
+ __field( int, depth )
+ __field( ext4_lblk_t, partial )
+ __field( unsigned short, eh_entries )
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->start = start;
+ __entry->depth = depth;
+ __entry->partial = partial;
+ __entry->eh_entries = eh_entries;
+ ),
+
+ TP_printk("dev %d,%d ino %lu since %u depth %d partial %u "
+ "remaining_entries %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ (unsigned) __entry->start,
+ __entry->depth,
+ (unsigned) __entry->partial,
+ (unsigned short) __entry->eh_entries)
+);
+
#endif /* _TRACE_EXT4_H */

/* This part must be outside protection */
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 18/23] ext4: Rename ext4_free_blks_{count,set}() to refer to clusters

The field bg_free_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions. So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.

Note: at this point in bigalloc development we doesn't support
online-resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 8 ++++----
fs/ext4/ext4.h | 9 +++++----
fs/ext4/ialloc.c | 14 +++++++-------
fs/ext4/mballoc.c | 22 +++++++++++-----------
fs/ext4/resize.c | 2 +-
fs/ext4/super.c | 10 +++++-----
6 files changed, 33 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index bf42b32..84dc3d6 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -168,7 +168,7 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
* essentially implementing a per-group read-only flag. */
if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
ext4_error(sb, "Checksum bad for group %u", block_group);
- ext4_free_blks_set(sb, gdp, 0);
+ ext4_free_group_clusters_set(sb, gdp, 0);
ext4_free_inodes_set(sb, gdp, 0);
ext4_itable_unused_set(sb, gdp, 0);
memset(bh->b_data, 0xff, sb->s_blocksize);
@@ -550,7 +550,7 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
continue;
- desc_count += ext4_free_blks_count(sb, gdp);
+ desc_count += ext4_free_group_clusters(sb, gdp);
brelse(bitmap_bh);
bitmap_bh = ext4_read_block_bitmap(sb, i);
if (bitmap_bh == NULL)
@@ -558,7 +558,7 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)

x = ext4_count_free(bitmap_bh, sb->s_blocksize);
printk(KERN_DEBUG "group %u: stored = %d, counted = %u\n",
- i, ext4_free_blks_count(sb, gdp), x);
+ i, ext4_free_group_clusters(sb, gdp), x);
bitmap_count += x;
}
brelse(bitmap_bh);
@@ -572,7 +572,7 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
continue;
- desc_count += ext4_free_blks_count(sb, gdp);
+ desc_count += ext4_free_group_clusters(sb, gdp);
}

return desc_count;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d79e773..290b7e4 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1954,8 +1954,8 @@ extern ext4_fsblk_t ext4_inode_bitmap(struct super_block *sb,
struct ext4_group_desc *bg);
extern ext4_fsblk_t ext4_inode_table(struct super_block *sb,
struct ext4_group_desc *bg);
-extern __u32 ext4_free_blks_count(struct super_block *sb,
- struct ext4_group_desc *bg);
+extern __u32 ext4_free_group_clusters(struct super_block *sb,
+ struct ext4_group_desc *bg);
extern __u32 ext4_free_inodes_count(struct super_block *sb,
struct ext4_group_desc *bg);
extern __u32 ext4_used_dirs_count(struct super_block *sb,
@@ -1968,8 +1968,9 @@ extern void ext4_inode_bitmap_set(struct super_block *sb,
struct ext4_group_desc *bg, ext4_fsblk_t blk);
extern void ext4_inode_table_set(struct super_block *sb,
struct ext4_group_desc *bg, ext4_fsblk_t blk);
-extern void ext4_free_blks_set(struct super_block *sb,
- struct ext4_group_desc *bg, __u32 count);
+extern void ext4_free_group_clusters_set(struct super_block *sb,
+ struct ext4_group_desc *bg,
+ __u32 count);
extern void ext4_free_inodes_set(struct super_block *sb,
struct ext4_group_desc *bg, __u32 count);
extern void ext4_used_dirs_set(struct super_block *sb,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 048ee7f..aa74ca6 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -78,7 +78,7 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb,
* allocation, essentially implementing a per-group read-only flag. */
if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
ext4_error(sb, "Checksum bad for group %u", block_group);
- ext4_free_blks_set(sb, gdp, 0);
+ ext4_free_group_clusters_set(sb, gdp, 0);
ext4_free_inodes_set(sb, gdp, 0);
ext4_itable_unused_set(sb, gdp, 0);
memset(bh->b_data, 0xff, sb->s_blocksize);
@@ -322,8 +322,8 @@ static int find_group_dir(struct super_block *sb, struct inode *parent,
if (ext4_free_inodes_count(sb, desc) < avefreei)
continue;
if (!best_desc ||
- (ext4_free_blks_count(sb, desc) >
- ext4_free_blks_count(sb, best_desc))) {
+ (ext4_free_group_clusters(sb, desc) >
+ ext4_free_group_clusters(sb, best_desc))) {
*best_group = group;
best_desc = desc;
ret = 0;
@@ -434,7 +434,7 @@ static void get_orlov_stats(struct super_block *sb, ext4_group_t g,
desc = ext4_get_group_desc(sb, g, NULL);
if (desc) {
stats->free_inodes = ext4_free_inodes_count(sb, desc);
- stats->free_clusters = ext4_free_blks_count(sb, desc);
+ stats->free_clusters = ext4_free_group_clusters(sb, desc);
stats->used_dirs = ext4_used_dirs_count(sb, desc);
} else {
stats->free_inodes = 0;
@@ -662,7 +662,7 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
*group = parent_group;
desc = ext4_get_group_desc(sb, *group, NULL);
if (desc && ext4_free_inodes_count(sb, desc) &&
- ext4_free_blks_count(sb, desc))
+ ext4_free_group_clusters(sb, desc))
return 0;

/*
@@ -686,7 +686,7 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
*group -= ngroups;
desc = ext4_get_group_desc(sb, *group, NULL);
if (desc && ext4_free_inodes_count(sb, desc) &&
- ext4_free_blks_count(sb, desc))
+ ext4_free_group_clusters(sb, desc))
return 0;
}

@@ -960,7 +960,7 @@ got:
ext4_lock_group(sb, group);
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
- ext4_free_blks_set(sb, gdp,
+ ext4_free_group_clusters_set(sb, gdp,
ext4_free_blocks_after_init(sb, group, gdp));
gdp->bg_checksum = ext4_group_desc_csum(sbi, group,
gdp);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 73fbb99..852f303 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2254,7 +2254,7 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
ext4_free_blocks_after_init(sb, group, desc);
} else {
meta_group_info[i]->bb_free =
- ext4_free_blks_count(sb, desc);
+ ext4_free_group_clusters(sb, desc);
}

INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
@@ -2770,7 +2770,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
goto out_err;

ext4_debug("using block group %u(%d)\n", ac->ac_b_ex.fe_group,
- ext4_free_blks_count(sb, gdp));
+ ext4_free_group_clusters(sb, gdp));

err = ext4_journal_get_write_access(handle, gdp_bh);
if (err)
@@ -2809,12 +2809,12 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
mb_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,ac->ac_b_ex.fe_len);
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
- ext4_free_blks_set(sb, gdp,
- ext4_free_blocks_after_init(sb,
- ac->ac_b_ex.fe_group, gdp));
+ ext4_free_group_clusters_set(sb, gdp,
+ ext4_free_blocks_after_init(sb,
+ ac->ac_b_ex.fe_group, gdp));
}
- len = ext4_free_blks_count(sb, gdp) - ac->ac_b_ex.fe_len;
- ext4_free_blks_set(sb, gdp, len);
+ len = ext4_free_group_clusters(sb, gdp) - ac->ac_b_ex.fe_len;
+ ext4_free_group_clusters_set(sb, gdp, len);
gdp->bg_checksum = ext4_group_desc_csum(sbi, ac->ac_b_ex.fe_group, gdp);

ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
@@ -4670,8 +4670,8 @@ do_more:
mb_free_blocks(inode, &e4b, bit, count_clusters);
}

- ret = ext4_free_blks_count(sb, gdp) + count_clusters;
- ext4_free_blks_set(sb, gdp, ret);
+ ret = ext4_free_group_clusters(sb, gdp) + count_clusters;
+ ext4_free_group_clusters_set(sb, gdp, ret);
gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters);
@@ -4800,8 +4800,8 @@ void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
ext4_lock_group(sb, block_group);
mb_clear_bits(bitmap_bh->b_data, bit, count);
mb_free_blocks(NULL, &e4b, bit, count);
- blk_free_count = blocks_freed + ext4_free_blks_count(sb, desc);
- ext4_free_blks_set(sb, desc, blk_free_count);
+ blk_free_count = blocks_freed + ext4_free_group_clusters(sb, desc);
+ ext4_free_group_clusters_set(sb, desc, blk_free_count);
desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeclusters_counter,
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index cef49d3..14b66ba 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -853,7 +853,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
ext4_block_bitmap_set(sb, gdp, input->block_bitmap); /* LV FIXME */
ext4_inode_bitmap_set(sb, gdp, input->inode_bitmap); /* LV FIXME */
ext4_inode_table_set(sb, gdp, input->inode_table); /* LV FIXME */
- ext4_free_blks_set(sb, gdp, input->free_blocks_count);
+ ext4_free_group_clusters_set(sb, gdp, input->free_blocks_count);
ext4_free_inodes_set(sb, gdp, EXT4_INODES_PER_GROUP(sb));
gdp->bg_flags = cpu_to_le16(EXT4_BG_INODE_ZEROED);
gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e10062d..3f02afa 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -134,8 +134,8 @@ ext4_fsblk_t ext4_inode_table(struct super_block *sb,
(ext4_fsblk_t)le32_to_cpu(bg->bg_inode_table_hi) << 32 : 0);
}

-__u32 ext4_free_blks_count(struct super_block *sb,
- struct ext4_group_desc *bg)
+__u32 ext4_free_group_clusters(struct super_block *sb,
+ struct ext4_group_desc *bg)
{
return le16_to_cpu(bg->bg_free_blocks_count_lo) |
(EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
@@ -190,8 +190,8 @@ void ext4_inode_table_set(struct super_block *sb,
bg->bg_inode_table_hi = cpu_to_le32(blk >> 32);
}

-void ext4_free_blks_set(struct super_block *sb,
- struct ext4_group_desc *bg, __u32 count)
+void ext4_free_group_clusters_set(struct super_block *sb,
+ struct ext4_group_desc *bg, __u32 count)
{
bg->bg_free_blocks_count_lo = cpu_to_le16((__u16)count);
if (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT)
@@ -1994,7 +1994,7 @@ static int ext4_fill_flex_info(struct super_block *sb)
flex_group = ext4_flex_group(sbi, i);
atomic_add(ext4_free_inodes_count(sb, gdp),
&sbi->s_flex_groups[flex_group].free_inodes);
- atomic_add(ext4_free_blks_count(sb, gdp),
+ atomic_add(ext4_free_group_clusters(sb, gdp),
&sbi->s_flex_groups[flex_group].free_clusters);
atomic_add(ext4_used_dirs_count(sb, gdp),
&sbi->s_flex_groups[flex_group].used_dirs);
--
1.7.4.1.22.gec8e1.dirty


2011-07-06 16:36:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 02/23] ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)

At least initially if the bigalloc feature is enabled, we will not
support non-extent mapped inodes, online resizing, online defrag, or
the FITRIM ioctl. This simplifies the initial implementation.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/indirect.c | 7 +++++++
fs/ext4/inode.c | 5 +++++
fs/ext4/ioctl.c | 33 +++++++++++++++++++++++++++++----
fs/ext4/super.c | 7 +++++++
4 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 6c27111..9572827 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -699,6 +699,13 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
/*
* Okay, we need to do block allocation.
*/
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ EXT4_ERROR_INODE(inode, "Can't allocate blocks for "
+ "non-extent mapped inodes with bigalloc");
+ return -ENOSPC;
+ }
+
goal = ext4_find_goal(inode, map->m_lblk, partial);

/* the number of blocks need to allocate for [d,t]indirect blocks */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index de50b16..7d1e9e0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3060,6 +3060,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
return -ENOTSUPP;
}

+ if (EXT4_SB(inode->i_sb)->s_cluster_ratio > 1) {
+ /* TODO: Add support for bigalloc file systems */
+ return -ENOTSUPP;
+ }
+
return ext4_ext_punch_hole(file, offset, length);
}

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 808c554..dcd0916 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -21,6 +21,7 @@
long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
struct inode *inode = filp->f_dentry->d_inode;
+ struct super_block *sb = inode->i_sb;
struct ext4_inode_info *ei = EXT4_I(inode);
unsigned int flags;

@@ -183,7 +184,6 @@ setversion_out:
* Returns 1 if it slept, else zero.
*/
{
- struct super_block *sb = inode->i_sb;
DECLARE_WAITQUEUE(wait, current);
int ret = 0;

@@ -199,7 +199,6 @@ setversion_out:
#endif
case EXT4_IOC_GROUP_EXTEND: {
ext4_fsblk_t n_blocks_count;
- struct super_block *sb = inode->i_sb;
int err, err2=0;

if (!capable(CAP_SYS_RESOURCE))
@@ -212,6 +211,13 @@ setversion_out:
if (err)
return err;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ ext4_msg(sb, KERN_ERR,
+ "Online resizing not supported with bigalloc");
+ return -EOPNOTSUPP;
+ }
+
err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count);
if (EXT4_SB(sb)->s_journal) {
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
@@ -252,6 +258,13 @@ setversion_out:
if (err)
goto mext_out;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ ext4_msg(sb, KERN_ERR,
+ "Online defrag not supported with bigalloc");
+ return -EOPNOTSUPP;
+ }
+
err = ext4_move_extents(filp, donor_filp, me.orig_start,
me.donor_start, me.len, &me.moved_len);
mnt_drop_write(filp->f_path.mnt);
@@ -268,7 +281,6 @@ mext_out:

case EXT4_IOC_GROUP_ADD: {
struct ext4_new_group_data input;
- struct super_block *sb = inode->i_sb;
int err, err2=0;

if (!capable(CAP_SYS_RESOURCE))
@@ -282,6 +294,13 @@ mext_out:
if (err)
return err;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ ext4_msg(sb, KERN_ERR,
+ "Online resizing not supported with bigalloc");
+ return -EOPNOTSUPP;
+ }
+
err = ext4_group_add(sb, &input);
if (EXT4_SB(sb)->s_journal) {
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
@@ -333,7 +352,6 @@ mext_out:

case FITRIM:
{
- struct super_block *sb = inode->i_sb;
struct request_queue *q = bdev_get_queue(sb->s_bdev);
struct fstrim_range range;
int ret = 0;
@@ -344,6 +362,13 @@ mext_out:
if (!blk_queue_discard(q))
return -EOPNOTSUPP;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_BIGALLOC)) {
+ ext4_msg(sb, KERN_ERR,
+ "FITRIM not supported with bigalloc");
+ return -EOPNOTSUPP;
+ }
+
if (copy_from_user(&range, (struct fstrim_range *)arg,
sizeof(range)))
return -EFAULT;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4618f7c1..231aac2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2653,6 +2653,13 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
return 0;
}
}
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_BIGALLOC) &&
+ !EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
+ ext4_msg(sb, KERN_ERR,
+ "Can't support bigalloc feature without "
+ "extents feature\n");
+ return 0;
+ }
return 1;
}

--
1.7.4.1.22.gec8e1.dirty


2011-07-06 18:12:37

by Eric Gouriou

[permalink] [raw]
Subject: Re: [PATCH 23/23] ext4: add some tracepoints in ext4/extents.c

On Wed, Jul 6, 2011 at 09:36, Theodore Ts'o <[email protected]> wrote:
> From: Aditya Kali <[email protected]>
>
> This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
> ext4/inode.c.
>
> Tested: Built and ran the kernel and verified that these tracepoints work.
> Also ran xfstests.
>
> Signed-off-by: Aditya Kali <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
>  fs/ext4/extents.c           |   46 +++++-
>  fs/ext4/inode.c             |    2 +-
>  include/trace/events/ext4.h |  415 +++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 443 insertions(+), 20 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 11bd617..0d3ba59 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1964,6 +1964,7 @@ ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block,
>        struct ext4_ext_cache *cex;
>        BUG_ON(len == 0);
>        spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
> +       trace_ext4_ext_put_in_cache(inode, block, len, start);
>        cex = &EXT4_I(inode)->i_cached_extent;
>        cex->ec_block = block;
>        cex->ec_len = len;
> @@ -2065,6 +2066,7 @@ errout:
>                sbi->extent_cache_misses++;
>        else
>                sbi->extent_cache_hits++;
> +       trace_ext4_ext_in_cache(inode, block, ret);
>        spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
>        return ret;
>  }
> @@ -2127,6 +2129,8 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
>        if (err)
>                return err;
>        ext_debug("index is empty, remove it, free block %llu\n", leaf);
> +       trace_ext4_ext_rm_idx(inode, leaf);
> +
>        ext4_free_blocks(handle, inode, NULL, leaf, 1,
>                         EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET);
>        return err;
> @@ -2212,6 +2216,9 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
>         */
>        flags |= EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER;
>
> +       trace_ext4_remove_blocks(inode, cpu_to_le32(ex->ee_block),
> +                                ext4_ext_pblock(ex), ee_len, from,
> +                                to, *partial_cluster);

In all the trace points that capture the logical block/physical
block/length of an extent, it would be nicer to pass the extent itself
as a parameter and perform the extraction in the TP_fast_assign()
block. This requires adding an #include <ext4_extents.h> above the
include of <trace/trace_events/ext4.h> in fs/ext4/super.c in order to
have ext4_ext_pblock() and ext4_ext_get_actual_len() available.

By itself this should not be ground to respin the whole patch series,
a follow-on cleanup patch would do (I volunteer to create one if
desired). Fortunately this wouldn't cause the trace record format to
change, so it wouldn't cause any backward/forward compatibility issue.

Regards - Eric

>        /*
>         * If we have a partial cluster, and it's different from the
>         * cluster of the last block, we need to explicitly free the
> @@ -2326,6 +2333,9 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
>        ex_ee_block = le32_to_cpu(ex->ee_block);
>        ex_ee_len = ext4_ext_get_actual_len(ex);
>
> +       trace_ext4_ext_rm_leaf(inode, start, ex_ee_block, ext4_ext_pblock(ex),
> +                              ex_ee_len, *partial_cluster);
> +
>        while (ex >= EXT_FIRST_EXTENT(eh) &&
>                        ex_ee_block + ex_ee_len > start) {
>
> @@ -2582,6 +2592,8 @@ static int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
>  again:
>        ext4_ext_invalidate_cache(inode);
>
> +       trace_ext4_ext_remove_space(inode, start, depth);
> +
>        /*
>         * We start scanning from right side, freeing all the blocks
>         * after i_size and walking into the tree depth-wise.
> @@ -2676,6 +2688,9 @@ again:
>                }
>        }
>
> +       trace_ext4_ext_remove_space_done(inode, start, depth, partial_cluster,
> +                       path->p_hdr->eh_entries);
> +
>        /* If we still have something in the partial cluster and we have removed
>         * even the first extent, then we should free the blocks in the partial
>         * cluster as well. */
> @@ -3290,6 +3305,10 @@ static int ext4_find_delalloc_range(struct inode *inode,
>                         * detect that here.
>                         */
>                        page_cache_release(page);
> +                       trace_ext4_find_delalloc_range(inode,
> +                                       lblk_start, lblk_end,
> +                                       search_hint_reverse,
> +                                       0, i);
>                        return 0;
>                }
>
> @@ -3317,6 +3336,10 @@ static int ext4_find_delalloc_range(struct inode *inode,
>
>                        if (buffer_delay(bh)) {
>                                page_cache_release(page);
> +                               trace_ext4_find_delalloc_range(inode,
> +                                               lblk_start, lblk_end,
> +                                               search_hint_reverse,
> +                                               1, i);
>                                return 1;
>                        }
>                        if (search_hint_reverse)
> @@ -3339,6 +3362,8 @@ nextpage:
>                i = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
>        }
>
> +       trace_ext4_find_delalloc_range(inode, lblk_start, lblk_end,
> +                                       search_hint_reverse, 0, 0);
>        return 0;
>  }
>
> @@ -3404,6 +3429,8 @@ get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start,
>        /* max possible clusters for this allocation */
>        allocated_clusters = alloc_cluster_end - alloc_cluster_start + 1;
>
> +       trace_ext4_get_reserved_cluster_alloc(inode, lblk_start, num_blks);
> +
>        /* Check towards left side */
>        c_offset = lblk_start & (sbi->s_cluster_ratio - 1);
>        if (c_offset) {
> @@ -3443,6 +3470,9 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode,
>                  flags, allocated);
>        ext4_ext_show_leaf(inode, path);
>
> +       trace_ext4_ext_handle_uninitialized_extents(inode, map, allocated,
> +                                                   newblock);
> +
>        /* get_block() before submit the IO, split the extent */
>        if ((flags & EXT4_GET_BLOCKS_PRE_IO)) {
>                ret = ext4_split_unwritten_extents(handle, inode, map,
> @@ -3562,7 +3592,7 @@ out2:
>  * get_implied_cluster_alloc - check to see if the requested
>  * allocation (in the map structure) overlaps with a cluster already
>  * allocated in an extent.
> - *     @sbi    The ext4-specific superblock structure
> + *     @sb     The filesystem superblock structure
>  *     @map    The requested lblk->pblk mapping
>  *     @ex     The extent structure which might contain an implied
>  *                     cluster allocation
> @@ -3599,11 +3629,12 @@ out2:
>  * ext4_ext_map_blocks() will then allocate one or more new clusters
>  * by calling ext4_mb_new_blocks().
>  */
> -static int get_implied_cluster_alloc(struct ext4_sb_info *sbi,
> +static int get_implied_cluster_alloc(struct super_block *sb,
>                                     struct ext4_map_blocks *map,
>                                     struct ext4_extent *ex,
>                                     struct ext4_ext_path *path)
>  {
> +       struct ext4_sb_info *sbi = EXT4_SB(sb);
>        ext4_lblk_t c_offset = map->m_lblk & (sbi->s_cluster_ratio-1);
>        ext4_lblk_t ex_cluster_start, ex_cluster_end;
>        ext4_lblk_t rr_cluster_start, rr_cluster_end;
> @@ -3652,8 +3683,12 @@ static int get_implied_cluster_alloc(struct ext4_sb_info *sbi,
>                        ext4_lblk_t next = ext4_ext_next_allocated_block(path);
>                        map->m_len = min(map->m_len, next - map->m_lblk);
>                }
> +
> +               trace_ext4_get_implied_cluster_alloc_exit(sb, map, 1);
>                return 1;
>        }
> +
> +       trace_ext4_get_implied_cluster_alloc_exit(sb, map, 0);
>        return 0;
>  }
>
> @@ -3762,6 +3797,9 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
>                 * we split out initialized portions during a write.
>                 */
>                ee_len = ext4_ext_get_actual_len(ex);
> +
> +               trace_ext4_ext_show_extent(inode, ee_block, ee_start, ee_len);
> +
>                /* if found extent covers block, simply return it */
>                if (in_range(map->m_lblk, ee_block, ee_len)) {
>                        newblock = map->m_lblk - ee_block + ee_start;
> @@ -3880,7 +3918,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
>         * by ext4_ext_find_extent() implies a cluster we can use.
>         */
>        if (cluster_offset && ex &&
> -           get_implied_cluster_alloc(sbi, map, ex, path)) {
> +           get_implied_cluster_alloc(inode->i_sb, map, ex, path)) {
>                ar.len = allocated = map->m_len;
>                newblock = map->m_pblk;
>                map->m_flags |= EXT4_MAP_FROM_CLUSTER;
> @@ -3901,7 +3939,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
>        /* Check if the extent after searching to the right implies a
>         * cluster we can use. */
>        if ((sbi->s_cluster_ratio > 1) && ex2 &&
> -           get_implied_cluster_alloc(sbi, map, ex2, path)) {
> +           get_implied_cluster_alloc(inode->i_sb, map, ex2, path)) {
>                ar.len = allocated = map->m_len;
>                newblock = map->m_pblk;
>                map->m_flags |= EXT4_MAP_FROM_CLUSTER;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 7ae8dec..2495cfb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -238,7 +238,7 @@ void ext4_da_update_reserve_space(struct inode *inode,
>        struct ext4_inode_info *ei = EXT4_I(inode);
>
>        spin_lock(&ei->i_block_reservation_lock);
> -       trace_ext4_da_update_reserve_space(inode, used);
> +       trace_ext4_da_update_reserve_space(inode, used, quota_claim);
>        if (unlikely(used > ei->i_reserved_data_blocks)) {
>                ext4_msg(inode->i_sb, KERN_NOTICE, "%s: ino %lu, used %d "
>                         "with only %d reserved data blocks\n",
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 5ce2b2f..3368399 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -12,6 +12,7 @@ struct ext4_allocation_request;
>  struct ext4_prealloc_space;
>  struct ext4_inode_info;
>  struct mpage_da_data;
> +struct ext4_map_blocks;
>
>  #define EXT4_I(inode) (container_of(inode, struct ext4_inode_info, vfs_inode))
>
> @@ -1034,19 +1035,20 @@ TRACE_EVENT(ext4_forget,
>  );
>
>  TRACE_EVENT(ext4_da_update_reserve_space,
> -       TP_PROTO(struct inode *inode, int used_blocks),
> +       TP_PROTO(struct inode *inode, int used_blocks, int quota_claim),
>
> -       TP_ARGS(inode, used_blocks),
> +       TP_ARGS(inode, used_blocks, quota_claim),
>
>        TP_STRUCT__entry(
> -               __field(        dev_t,  dev                     )
> -               __field(        ino_t,  ino                     )
> -               __field(        umode_t, mode                   )
> -               __field(        __u64,  i_blocks                )
> -               __field(        int,    used_blocks             )
> -               __field(        int,    reserved_data_blocks    )
> -               __field(        int,    reserved_meta_blocks    )
> -               __field(        int,    allocated_meta_blocks   )
> +               __field(        dev_t,          dev                     )
> +               __field(        ino_t,          ino                     )
> +               __field(        umode_t,        mode                    )
> +               __field(        __u64,          i_blocks                )
> +               __field(        int,            used_blocks             )
> +               __field(        int,            reserved_data_blocks    )
> +               __field(        int,            reserved_meta_blocks    )
> +               __field(        int,            allocated_meta_blocks   )
> +               __field(        int,            quota_claim             )
>        ),
>
>        TP_fast_assign(
> @@ -1055,19 +1057,24 @@ TRACE_EVENT(ext4_da_update_reserve_space,
>                __entry->mode   = inode->i_mode;
>                __entry->i_blocks = inode->i_blocks;
>                __entry->used_blocks = used_blocks;
> -               __entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
> -               __entry->reserved_meta_blocks = EXT4_I(inode)->i_reserved_meta_blocks;
> -               __entry->allocated_meta_blocks = EXT4_I(inode)->i_allocated_meta_blocks;
> +               __entry->reserved_data_blocks =
> +                               EXT4_I(inode)->i_reserved_data_blocks;
> +               __entry->reserved_meta_blocks =
> +                               EXT4_I(inode)->i_reserved_meta_blocks;
> +               __entry->allocated_meta_blocks =
> +                               EXT4_I(inode)->i_allocated_meta_blocks;
> +               __entry->quota_claim = quota_claim;
>        ),
>
>        TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d "
>                  "reserved_data_blocks %d reserved_meta_blocks %d "
> -                 "allocated_meta_blocks %d",
> +                 "allocated_meta_blocks %d quota_claim %d",
>                  MAJOR(__entry->dev), MINOR(__entry->dev),
>                  (unsigned long) __entry->ino,
>                  __entry->mode, __entry->i_blocks,
>                  __entry->used_blocks, __entry->reserved_data_blocks,
> -                 __entry->reserved_meta_blocks, __entry->allocated_meta_blocks)
> +                 __entry->reserved_meta_blocks, __entry->allocated_meta_blocks,
> +                 __entry->quota_claim)
>  );
>
>  TRACE_EVENT(ext4_da_reserve_space,
> @@ -1520,6 +1527,384 @@ TRACE_EVENT(ext4_load_inode,
>                  (unsigned long) __entry->ino)
>  );
>
> +TRACE_EVENT(ext4_ext_handle_uninitialized_extents,
> +       TP_PROTO(struct inode *inode, struct ext4_map_blocks *map,
> +                unsigned int allocated, ext4_fsblk_t newblock),
> +
> +       TP_ARGS(inode, map, allocated, newblock),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino             )
> +               __field(        dev_t,          dev             )
> +               __field(        ext4_lblk_t,    lblk            )
> +               __field(        ext4_fsblk_t,   pblk            )
> +               __field(        unsigned int,   len             )
> +               __field(        int,            flags           )
> +               __field(        unsigned int,   allocated       )
> +               __field(        ext4_fsblk_t,   newblk          )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino            = inode->i_ino;
> +               __entry->dev            = inode->i_sb->s_dev;
> +               __entry->lblk           = map->m_lblk;
> +               __entry->pblk           = map->m_pblk;
> +               __entry->len            = map->m_len;
> +               __entry->flags          = map->m_flags;
> +               __entry->allocated      = allocated;
> +               __entry->newblk         = newblock;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu m_lblk %u m_pblk %llu m_len %u flags %d"
> +                 "allocated %d newblock %llu",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->lblk, (unsigned long long) __entry->pblk,
> +                 __entry->len, __entry->flags,
> +                 (unsigned int) __entry->allocated,
> +                 (unsigned long long) __entry->newblk)
> +);
> +
> +TRACE_EVENT(ext4_get_implied_cluster_alloc_exit,
> +       TP_PROTO(struct super_block *sb, struct ext4_map_blocks *map, int ret),
> +
> +       TP_ARGS(sb, map, ret),
> +
> +       TP_STRUCT__entry(
> +               __field(        dev_t,          dev     )
> +               __field(        ext4_lblk_t,    lblk    )
> +               __field(        ext4_fsblk_t,   pblk    )
> +               __field(        unsigned int,   len     )
> +               __field(        unsigned int,   flags   )
> +               __field(        int,            ret     )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->dev    = sb->s_dev;
> +               __entry->lblk   = map->m_lblk;
> +               __entry->pblk   = map->m_pblk;
> +               __entry->len    = map->m_len;
> +               __entry->flags  = map->m_flags;
> +               __entry->ret    = ret;
> +       ),
> +
> +       TP_printk("dev %d,%d m_lblk %u m_pblk %llu m_len %u m_flags %u ret %d",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 __entry->lblk, (unsigned long long) __entry->pblk,
> +                 __entry->len, __entry->flags, __entry->ret)
> +);
> +
> +TRACE_EVENT(ext4_ext_put_in_cache,
> +       TP_PROTO(struct inode *inode, ext4_lblk_t lblk, unsigned int len,
> +                ext4_fsblk_t start),
> +
> +       TP_ARGS(inode, lblk, len, start),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino     )
> +               __field(        dev_t,          dev     )
> +               __field(        ext4_lblk_t,    lblk    )
> +               __field(        unsigned int,   len     )
> +               __field(        ext4_fsblk_t,   start   )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino    = inode->i_ino;
> +               __entry->dev    = inode->i_sb->s_dev;
> +               __entry->lblk   = lblk;
> +               __entry->len    = len;
> +               __entry->start  = start;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu lblk %u len %u start %llu",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->lblk,
> +                 __entry->len,
> +                 (unsigned long long) __entry->start)
> +);
> +
> +TRACE_EVENT(ext4_ext_in_cache,
> +       TP_PROTO(struct inode *inode, ext4_lblk_t lblk, int ret),
> +
> +       TP_ARGS(inode, lblk, ret),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino     )
> +               __field(        dev_t,          dev     )
> +               __field(        ext4_lblk_t,    lblk    )
> +               __field(        int,            ret     )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino    = inode->i_ino;
> +               __entry->dev    = inode->i_sb->s_dev;
> +               __entry->lblk   = lblk;
> +               __entry->ret    = ret;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu lblk %u ret %d",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->lblk,
> +                 __entry->ret)
> +
> +);
> +
> +TRACE_EVENT(ext4_find_delalloc_range,
> +       TP_PROTO(struct inode *inode, ext4_lblk_t from, ext4_lblk_t to,
> +               int reverse, int found, ext4_lblk_t found_blk),
> +
> +       TP_ARGS(inode, from, to, reverse, found, found_blk),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino             )
> +               __field(        dev_t,          dev             )
> +               __field(        ext4_lblk_t,    from            )
> +               __field(        ext4_lblk_t,    to              )
> +               __field(        int,            reverse         )
> +               __field(        int,            found           )
> +               __field(        ext4_lblk_t,    found_blk       )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino            = inode->i_ino;
> +               __entry->dev            = inode->i_sb->s_dev;
> +               __entry->from           = from;
> +               __entry->to             = to;
> +               __entry->reverse        = reverse;
> +               __entry->found          = found;
> +               __entry->found_blk      = found_blk;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu from %u to %u reverse %d found %d "
> +                 "(blk = %u)",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->from, (unsigned) __entry->to,
> +                 __entry->reverse, __entry->found,
> +                 (unsigned) __entry->found_blk)
> +);
> +
> +TRACE_EVENT(ext4_get_reserved_cluster_alloc,
> +       TP_PROTO(struct inode *inode, ext4_lblk_t lblk, unsigned int len),
> +
> +       TP_ARGS(inode, lblk, len),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino     )
> +               __field(        dev_t,          dev     )
> +               __field(        ext4_lblk_t,    lblk    )
> +               __field(        unsigned int,   len     )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino    = inode->i_ino;
> +               __entry->dev    = inode->i_sb->s_dev;
> +               __entry->lblk   = lblk;
> +               __entry->len    = len;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu lblk %u len %u",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->lblk,
> +                 __entry->len)
> +);
> +
> +TRACE_EVENT(ext4_ext_show_extent,
> +       TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_fsblk_t pblk,
> +                unsigned short len),
> +
> +       TP_ARGS(inode, lblk, pblk, len),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino     )
> +               __field(        dev_t,          dev     )
> +               __field(        ext4_lblk_t,    lblk    )
> +               __field(        ext4_fsblk_t,   pblk    )
> +               __field(        unsigned short, len     )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino    = inode->i_ino;
> +               __entry->dev    = inode->i_sb->s_dev;
> +               __entry->lblk   = lblk;
> +               __entry->pblk   = pblk;
> +               __entry->len    = len;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu lblk %u pblk %llu len %u",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->lblk,
> +                 (unsigned long long) __entry->pblk,
> +                 (unsigned short) __entry->len)
> +);
> +
> +TRACE_EVENT(ext4_remove_blocks,
> +       TP_PROTO(struct inode *inode, ext4_lblk_t ex_lblk,
> +               ext4_fsblk_t ex_pblk, unsigned short ex_len,
> +               ext4_lblk_t from, ext4_fsblk_t to,
> +               ext4_fsblk_t partial_cluster),
> +
> +       TP_ARGS(inode, ex_lblk, ex_pblk, ex_len, from, to, partial_cluster),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino     )
> +               __field(        dev_t,          dev     )
> +               __field(        ext4_lblk_t,    ee_lblk )
> +               __field(        ext4_fsblk_t,   ee_pblk )
> +               __field(        unsigned short, ee_len  )
> +               __field(        ext4_lblk_t,    from    )
> +               __field(        ext4_lblk_t,    to      )
> +               __field(        ext4_fsblk_t,   partial )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino            = inode->i_ino;
> +               __entry->dev            = inode->i_sb->s_dev;
> +               __entry->ee_lblk        = ex_lblk;
> +               __entry->ee_pblk        = ex_pblk;
> +               __entry->ee_len         = ex_len;
> +               __entry->from           = from;
> +               __entry->to             = to;
> +               __entry->partial        = partial_cluster;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu extent [%u(%llu), %u]"
> +                 "from %u to %u partial_cluster %u",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->ee_lblk,
> +                 (unsigned long long) __entry->ee_pblk,
> +                 (unsigned short) __entry->ee_len,
> +                 (unsigned) __entry->from,
> +                 (unsigned) __entry->to,
> +                 (unsigned) __entry->partial)
> +);
> +
> +TRACE_EVENT(ext4_ext_rm_leaf,
> +       TP_PROTO(struct inode *inode, ext4_lblk_t start,
> +                ext4_lblk_t ex_lblk, ext4_fsblk_t ex_pblk,
> +                unsigned short ex_len, ext4_fsblk_t partial_cluster),
> +
> +       TP_ARGS(inode, start, ex_lblk, ex_pblk, ex_len, partial_cluster),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino     )
> +               __field(        dev_t,          dev     )
> +               __field(        ext4_lblk_t,    start   )
> +               __field(        ext4_lblk_t,    ee_lblk )
> +               __field(        ext4_fsblk_t,   ee_pblk )
> +               __field(        short,          ee_len  )
> +               __field(        ext4_fsblk_t,   partial )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino            = inode->i_ino;
> +               __entry->dev            = inode->i_sb->s_dev;
> +               __entry->start          = start;
> +               __entry->ee_lblk        = ex_lblk;
> +               __entry->ee_pblk        = ex_pblk;
> +               __entry->ee_len         = ex_len;
> +               __entry->partial        = partial_cluster;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu start_lblk %u last_extent [%u(%llu), %u]"
> +                 "partial_cluster %u",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->start,
> +                 (unsigned) __entry->ee_lblk,
> +                 (unsigned long long) __entry->ee_pblk,
> +                 (unsigned short) __entry->ee_len,
> +                 (unsigned) __entry->partial)
> +);
> +
> +TRACE_EVENT(ext4_ext_rm_idx,
> +       TP_PROTO(struct inode *inode, ext4_fsblk_t pblk),
> +
> +       TP_ARGS(inode, pblk),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino     )
> +               __field(        dev_t,          dev     )
> +               __field(        ext4_fsblk_t,   pblk    )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino    = inode->i_ino;
> +               __entry->dev    = inode->i_sb->s_dev;
> +               __entry->pblk   = pblk;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu index_pblk %llu",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned long long) __entry->pblk)
> +);
> +
> +TRACE_EVENT(ext4_ext_remove_space,
> +       TP_PROTO(struct inode *inode, ext4_lblk_t start, int depth),
> +
> +       TP_ARGS(inode, start, depth),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino     )
> +               __field(        dev_t,          dev     )
> +               __field(        ext4_lblk_t,    start   )
> +               __field(        int,            depth   )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino    = inode->i_ino;
> +               __entry->dev    = inode->i_sb->s_dev;
> +               __entry->start  = start;
> +               __entry->depth  = depth;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu since %u depth %d",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->start,
> +                 __entry->depth)
> +);
> +
> +TRACE_EVENT(ext4_ext_remove_space_done,
> +       TP_PROTO(struct inode *inode, ext4_lblk_t start, int depth,
> +               ext4_lblk_t partial, unsigned short eh_entries),
> +
> +       TP_ARGS(inode, start, depth, partial, eh_entries),
> +
> +       TP_STRUCT__entry(
> +               __field(        ino_t,          ino             )
> +               __field(        dev_t,          dev             )
> +               __field(        ext4_lblk_t,    start           )
> +               __field(        int,            depth           )
> +               __field(        ext4_lblk_t,    partial         )
> +               __field(        unsigned short, eh_entries      )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->ino            = inode->i_ino;
> +               __entry->dev            = inode->i_sb->s_dev;
> +               __entry->start          = start;
> +               __entry->depth          = depth;
> +               __entry->partial        = partial;
> +               __entry->eh_entries     = eh_entries;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu since %u depth %d partial %u "
> +                 "remaining_entries %u",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->ino,
> +                 (unsigned) __entry->start,
> +                 __entry->depth,
> +                 (unsigned) __entry->partial,
> +                 (unsigned short) __entry->eh_entries)
> +);
> +
>  #endif /* _TRACE_EXT4_H */
>
>  /* This part must be outside protection */
> --
> 1.7.4.1.22.gec8e1.dirty
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

2011-07-06 22:58:14

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 14/23] ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled

On 2011-07-06, at 10:35 AM, Theodore Ts'o wrote:
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/super.c | 37 ++++++++++++++++++++++++-------------
> 1 files changed, 24 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a803d65..7a4d54e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4529,16 +4529,34 @@ restore_opts:
> return err;
> }
>
> +/*
> + * Note: calculating the overhead so we can be compatible with
> + * historical BSD practice is quite difficult in the face of
> + * clusters/bigalloc. This is because multiple metadata blocks from
> + * different block group can end up in the same allocation cluster.
> + * Calculating the exact overhead in the face of clustered allocation
> + * requires either O(all block bitmaps) in memory or O(number of block
> + * groups**2) in time. We will still calculate the superblock for
> + * older file systems --- and if we come across with a bigalloc file
> + * system with zero in s_overhead_clusters the estimate will be close to
> + * correct especially for very large cluster sizes --- but for newer
> + * file systems, it's better to calculate this figure once at mkfs
> + * time, and store it in the superblock. If the superblock value is
> + * present (even for non-bigalloc file systems), we will use it.
> + */
> static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
> {
> struct super_block *sb = dentry->d_sb;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> struct ext4_super_block *es = sbi->s_es;
> + struct ext4_group_desc *gdp;
> u64 fsid;
> s64 bfree;
>
> if (test_opt(sb, MINIX_DF)) {

Should we change this check to unlikely()?

> sbi->s_overhead_last = 0;
> + } else if (es->s_overhead_clusters) {
> + sbi->s_overhead_last = le32_to_cpu(es->s_overhead_clusters);
> } else if (sbi->s_blocks_last != ext4_blocks_count(es)) {
> ext4_group_t i, ngroups = ext4_get_groups_count(sb);
> ext4_fsblk_t overhead = 0;
> @@ -4553,24 +4571,16 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
> * All of the blocks before first_data_block are
> * overhead
> */
> - overhead = le32_to_cpu(es->s_first_data_block);
> + overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
>
> /*
> - * Add the overhead attributed to the superblock and
> - * block group descriptors. If the sparse superblocks
> - * feature is turned on, then not all groups have this.
> + * Add the overhead found in each block group
> */
> for (i = 0; i < ngroups; i++) {
> - overhead += ext4_bg_has_super(sb, i) +
> - ext4_bg_num_gdb(sb, i);
> + gdp = ext4_get_group_desc(sb, i, NULL);
> + overhead += ext4_num_overhead_clusters(sb, i, gdp);
> cond_resched();
> }
> -
> - /*
> - * Every block group has an inode bitmap, a block
> - * bitmap, and an inode table.
> - */
> - overhead += ngroups * (2 + sbi->s_itb_per_group);
> sbi->s_overhead_last = overhead;
> smp_wmb();
> sbi->s_blocks_last = ext4_blocks_count(es);
> @@ -4578,7 +4588,8 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
>
> buf->f_type = EXT4_SUPER_MAGIC;
> buf->f_bsize = sb->s_blocksize;
> - buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
> + buf->f_blocks = (ext4_blocks_count(es) -
> + EXT4_C2B(sbi, sbi->s_overhead_last));
> bfree = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) -
> percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter);
> /* prevent underflow in case that few free space is available */
> --
> 1.7.4.1.22.gec8e1.dirty
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






2011-07-06 23:00:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 12/23] ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter

On 2011-07-06, at 10:35 AM, Theodore Ts'o wrote:
> Convert the percpu counters s_dirtyblocks_counter and
> s_freeblocks_counter in struct ext4_super_info to be
> s_dirtyblocks_counter and s_freeclusters_counter.

Commit comment is incorrect. Should be "s_dirtyclusters_counter"
at the second usage.

Cheers, Andreas






2011-07-06 23:06:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 18/23] ext4: Rename ext4_free_blks_{count,set}() to refer to clusters

On 2011-07-06, at 10:36 AM, Theodore Ts'o wrote:
> The field bg_free_free_blocks_count_{lo,high} in the block group

s/free_free/free_/

> descriptor has been repurposed to hold the number of free clusters for
> bigalloc functions. So rename the functions so it makes it easier to
> read and audit the block allocation and block freeing code.
>
> Note: at this point in bigalloc development we doesn't support
> online-resize, so this also makes it really obvious all of the places
> we need to fix up to add support for online resize.

Cheers, Andreas






2011-07-08 22:40:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 14/23] ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled

On Wed, Jul 06, 2011 at 04:58:13PM -0600, Andreas Dilger wrote:
> >
> > if (test_opt(sb, MINIX_DF)) {
>
> Should we change this check to unlikely()?

It's unrelated to this commit, so it should probably separated out to
a different commit. It's not clear to me that statfs() is a common
enough code path that it would really matter, though.

- Ted

2011-07-08 22:41:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 12/23] ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter

On Wed, Jul 06, 2011 at 04:59:59PM -0600, Andreas Dilger wrote:
> On 2011-07-06, at 10:35 AM, Theodore Ts'o wrote:
> > Convert the percpu counters s_dirtyblocks_counter and
> > s_freeblocks_counter in struct ext4_super_info to be
> > s_dirtyblocks_counter and s_freeclusters_counter.
>
> Commit comment is incorrect. Should be "s_dirtyclusters_counter"
> at the second usage.

Thanks for pointing that out, fixed.

- Ted

2011-07-08 22:42:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 18/23] ext4: Rename ext4_free_blks_{count,set}() to refer to clusters

On Wed, Jul 06, 2011 at 05:06:37PM -0600, Andreas Dilger wrote:
> On 2011-07-06, at 10:36 AM, Theodore Ts'o wrote:
> > The field bg_free_free_blocks_count_{lo,high} in the block group
>
> s/free_free/free_/

Thanks, fixed.

- Ted

2011-07-08 23:02:03

by Theodore Ts'o

[permalink] [raw]
Subject: bigalloc performance stats (was Re: [PATCH 00/23] New spin of the bigalloc patches)

I have some initial benchmark figures that may help provide some
insight into why I am especially interested in getting bigalloc into
ext4.

The following statistics were collected on a Google file server. As
Michael Rubin mentioned in his talks at the LinuxCon this year, and at
the Kernel Summit two years ago, one of the things that we do our
servers is to really pack in a large number of jobs onto a single
machine, for cost and power efficiency.

As a result, we generally don't have machines which are *only* a file
server; that would leave wasted memory and CPU on the table. I
believe the same thing will be found in people who are implementing
cloud computing using virtualization; the whole point is to do things
efficiently, which means a large number of guest OS's will be packed
onto a single physical machine, so memory and disk bandwidth will
often be at a premium. This is the environment in which these figures
were captured.

I compared a stock ext4 file system, against ext4 file system with
bigalloc with 64k, 256k, and 1M clusters. First, let's looked at the
average time needed to execute the fallocate system call and the inode
truncation portion of the ftruncate and unlink system calls (this data
was gathered using tracepoints, so the overhead of syscall entry and
exit are not included in these numbers):

ext4 64k 256k 1M
time meta max time meta max time meta max time meta max
fallocate 14,262 1.1494 11 | 895 0.0417 2 | 318 0.0084 2 | 122 0.00077 1
truncate 12,944 0.8256 27 | 6911 0.4877 3 | 4541 0.2822 3 | 4558 0.2744 3

The time column is in microseconds (i.e., in this server, using stock
ext4, fallocate was taking 14.2 milliseconds on average); the "meta"
column indicates the average number of metadata reads were necessary
to complete the operation, and the "max" column indicates the maximum
number of metadata reads needed to complete the operation.

Note the improvement in the average time to execute the fallocate()
system call went down by over two orders of magnitude comparing ext4
against bigalloc with a 1M cluster size, using the same workload (from
14.2 ms to 122 usec). And even the 64k and 256k cluster sizes did
quite well (factors of 16 and 45, respectively) compared to stock
ext4.

Also of interest was the percentage of direct I/O reads and writes
that took over 100ms:

ext4 64k 256k 1M
DIO reads > 100ms: 0.498% 0.228% 0.257% 0.269%
DIO writes > 100ms: 0.202% 0.134% 0.109% 0.0582%

Since we don't need to read or write the block allocation bitmaps when
we do our DIO (since we fallocate the files in advance), this
improvement must be largely due to improved fragmentation of the files
(we let the workload run for a couple of days on a set of disks so we
could get something closer "steady state" as opposed "freshly
formatted" results). The reason why the DIO reads improve so much
more is because of the need to read in the extent tree blocks, which
would tend to be in memory already most of the time since the inode
would have been freshly fallocated while the DIO write was going on.

These are only initial results, but they were gathered on a production
workload --- but I hope this demonstrates why I consider bigalloc to
be especially interesting in environments where server resources
(especially memory) are constrained due to desire to use those
resources as efficiently as possible.

- Ted

2011-07-08 23:20:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 23/23] ext4: add some tracepoints in ext4/extents.c

On Wed, Jul 06, 2011 at 11:12:30AM -0700, Eric Gouriou wrote:
>
> In all the trace points that capture the logical block/physical
> block/length of an extent, it would be nicer to pass the extent itself
> as a parameter and perform the extraction in the TP_fast_assign()
> block. This requires adding an #include <ext4_extents.h> above the
> include of <trace/trace_events/ext4.h> in fs/ext4/super.c in order to
> have ext4_ext_pblock() and ext4_ext_get_actual_len() available.
>
> By itself this should not be ground to respin the whole patch series,
> a follow-on cleanup patch would do (I volunteer to create one if
> desired). Fortunately this wouldn't cause the trace record format to
> change, so it wouldn't cause any backward/forward compatibility issue.

Good suggestion, I've made this change.

- Ted

2011-09-28 12:55:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [02/23] ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)

On Tue, Sep 20, 2011 at 10:20:01AM +0800, Robin Dong wrote:
> From: Theodore Ts'o <[email protected]>
>
> At least initially if the bigalloc feature is enabled, we will not
> support non-extent mapped inodes, online resizing, online defrag, or
> the FITRIM ioctl. This simplifies the initial implementation.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> Signed-off-by: Robin Dong <[email protected]>
> Acked-by:

Hi Robin,

What was your purpose in resending these patches? Did you modify them
in any way? I can diff them all, but it's a pain...

- Ted