v3->v4:
1. Collect Reviewed-by from Ojaswin
2. Do improve as Ojaswin kindly suggested: Fix typo in commit,
WARN if try to clear bit of uninitialized group and improve
refactoring of AGGRESSIVE_CHECK code.
3. Fix conflic on patch 16
4. Improve git log in patch 16,17
v2->v3:
1. Make patches on new branch head and fix conflic on "ext4: add
EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated"
2. Fix build warnings on "ext4: add some kunit stub for mballoc kunit
test" and "ext4: add first unit test for ext4_mb_new_blocks_simple in
mballoc"
There are three parts in this patchset:
Part1: Patch 1-7 is v2 of sent series
v1->v2:
1. collect reviewed-by from Ojaswin. Only "ext4: add
EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated" needs futher
review. See [1] for previous comments.
2. drop "ext4: fix wrong unit use in ext4_mb_new_inode_pa" which is
already done in [2].
Part2: Patch 8-17 are more fixes and cleanups to mballoc
Some patches in this part will be conflict with patches in part1, so
append new patches in this series instead of creating a new one.
Patch 8-11 are some random fixes and cleanups, see respective log
message for detail.
Patch 12-17 factor out codes to mark bit in group is used or free
which will update on disk block bitmap and group descriptor. Several
reasons to do this:
1. pair behavior of alloc/free bits. For example,
ext4_mb_new_blocks_simple will update free_clusters in struct flex_groups
in ext4_mb_mark_bb while ext4_free_blocks_simple forgets this.
2. remove repeat code to read from disk, update and write back to disk.
3. reduce future unit test mocks to avoid real IO to update structure
on disk.
Part3: Patch 18-19 add one unit test for mballoc
Patch 18 add mocks to functions which will issue IO to disk.
Patch 19 add unit test for ext4_mb_new_blocks_simple in mballoc.
Details can be found in respective log message.
Before add more unit tests, there are something should be discussed:
1. How to test static function in mballoc.c
Currently, include mballoc-test.c in mballoc.c to test static function
in mballoc.c from mballoc-test.c which is one way suggested in [3].
Not sure if there is any more elegant way to test static function without
touch mballoc.c.
2. How to add mocks to function in mballoc.c which may issue IO to disk
Currently, KUNIT_STATIC_STUB_REDIRECT is added to functions as suggested
in kunit document [4].
3. How to simulate a block bitmap.
Currently, a fake buffer_head with bitmap data is returned, then no
futher change is needed.
If we simulate a block bitmap with an array of data structure like:
struct test_bitmap {
unsigned int start;
unsigned int len;
}
which is suggested by Theodore in [5], then we need to add mocks to
function which expected bitmap from bitmap_bh->b_data, like
mb_find_next_bit, mb_find_next_zero_bit and maybe more.
Would like to hear any suggestion! Thanks!
[1]
https://lore.kernel.org/linux-ext4/[email protected]/
[2]
https://lore.kernel.org/linux-ext4/9b35f3955a1d7b66bbd713eca1e63026e01f78c1.1679731817.git.ojaswin@linux.ibm.com
[3]
https://docs.kernel.org/dev-tools/kunit/usage.html#testing-static-functions
[4]
https://docs.kernel.org/dev-tools/kunit/api/functionredirection.html#c.KUNIT_STATIC_STUB_REDIRECT
[5]
https://lore.kernel.org/linux-ext4/[email protected]/
By the way, the "xfstest somke" passes. Please let me know if any more
test is needed.
Unit test result is as followings:
# ./tools/testing/kunit/kunit.py run --kunitconfig=fs/ext4/.kunitconfig --raw_output
[18:44:39] Configuring KUnit Kernel ...
[18:44:39] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make ARCH=um O=.kunit --jobs=88
[18:44:47] Starting KUnit Kernel (1/1)...
KTAP version 1
1..2
KTAP version 1
# Subtest: ext4_mballoc_test
1..1
ok 1 test_new_blocks_simple
ok 1 ext4_mballoc_test
KTAP version 1
# Subtest: ext4_inode_test
1..1
KTAP version 1
# Subtest: inode_test_xtimestamp_decoding
ok 1 1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits
ok 2 1969-12-31 Upper bound of 32bit < 0 timestamp, no extra bits
ok 3 1970-01-01 Lower bound of 32bit >=0 timestamp, no extra bits
ok 4 2038-01-19 Upper bound of 32bit >=0 timestamp, no extra bits
ok 5 2038-01-19 Lower bound of 32bit <0 timestamp, lo extra sec bit on
ok 6 2106-02-07 Upper bound of 32bit <0 timestamp, lo extra sec bit on
ok 7 2106-02-07 Lower bound of 32bit >=0 timestamp, lo extra sec bit on
ok 8 2174-02-25 Upper bound of 32bit >=0 timestamp, lo extra sec bit on
ok 9 2174-02-25 Lower bound of 32bit <0 timestamp, hi extra sec bit on
ok 10 2242-03-16 Upper bound of 32bit <0 timestamp, hi extra sec bit on
ok 11 2242-03-16 Lower bound of 32bit >=0 timestamp, hi extra sec bit on
ok 12 2310-04-04 Upper bound of 32bit >=0 timestamp, hi extra sec bit on
ok 13 2310-04-04 Upper bound of 32bit>=0 timestamp, hi extra sec bit 1. 1 ns
ok 14 2378-04-22 Lower bound of 32bit>= timestamp. Extra sec bits 1. Max ns
ok 15 2378-04-22 Lower bound of 32bit >=0 timestamp. All extra sec bits on
ok 16 2446-05-10 Upper bound of 32bit >=0 timestamp. All extra sec bits on
# inode_test_xtimestamp_decoding: pass:16 fail:0 skip:0 total:16
ok 1 inode_test_xtimestamp_decoding
# Totals: pass:16 fail:0 skip:0 total:16
ok 2 ext4_inode_test
[18:44:48] Elapsed time: 8.602s total, 0.001s configuring, 8.483s building, 0.072s running
Kemeng Shi (19):
ext4: fix wrong unit use in ext4_mb_normalize_request
ext4: fix unit mismatch in ext4_mb_new_blocks_simple
ext4: fix wrong unit use in ext4_mb_find_by_goal
ext4: treat stripe in block unit
ext4: add EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated
ext4: remove ext4_block_group and ext4_block_group_offset declaration
ext4: try all groups in ext4_mb_new_blocks_simple
ext4: get block from bh before pass it to ext4_free_blocks_simple in
ext4_free_blocks
ext4: remove unsed parameter and unnecessary forward declaration of
ext4_mb_new_blocks_simple
ext4: fix wrong unit use in ext4_mb_clear_bb
ext4: fix wrong unit use in ext4_mb_new_blocks
ext4: factor out codes to update block bitmap and group descriptor on
disk from ext4_mb_mark_bb
ext4: call ext4_mb_mark_group_bb in ext4_free_blocks_simple
ext4: extent ext4_mb_mark_group_bb to support allocation under journal
ext4: call ext4_mb_mark_group_bb in ext4_mb_mark_diskspace_used
ext4: call ext4_mb_mark_group_bb in ext4_mb_clear_bb
ext4: call ext4_mb_mark_group_bb in ext4_group_add_blocks
ext4: add some kunit stub for mballoc kunit test
ext4: add first unit test for ext4_mb_new_blocks_simple in mballoc
fs/ext4/balloc.c | 16 +
fs/ext4/ext4.h | 4 -
fs/ext4/mballoc-test.c | 323 +++++++++++++++++++
fs/ext4/mballoc.c | 714 ++++++++++++++++++-----------------------
fs/ext4/super.c | 13 +
5 files changed, 672 insertions(+), 398 deletions(-)
create mode 100644 fs/ext4/mballoc-test.c
--
2.30.0
ext4_free_blocks will retrieve block from bh if block parameter is zero.
Retrieve block before ext4_free_blocks_simple to avoid potentially
passing wrong block to ext4_free_blocks_simple.
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index bcc9622daa4a..dadda9e8cf29 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6368,12 +6368,6 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
sbi = EXT4_SB(sb);
- if (sbi->s_mount_state & EXT4_FC_REPLAY) {
- ext4_free_blocks_simple(inode, block, count);
- return;
- }
-
- might_sleep();
if (bh) {
if (block)
BUG_ON(block != bh->b_blocknr);
@@ -6381,6 +6375,13 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
block = bh->b_blocknr;
}
+ if (sbi->s_mount_state & EXT4_FC_REPLAY) {
+ ext4_free_blocks_simple(inode, block, count);
+ return;
+ }
+
+ might_sleep();
+
if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) &&
!ext4_inode_block_valid(inode, block, count)) {
ext4_error(sb, "Freeing blocks not in datazone - "
--
2.30.0
call ext4_mb_mark_group_bb in ext4_free_blocks_simple to:
1. remove repeat code
2. pair update of free_clusters in ext4_mb_new_blocks_simple.
3. add missing ext4_lock_group/ext4_unlock_group protection.
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 39 +++++++--------------------------------
1 file changed, 7 insertions(+), 32 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 685dcc17bf7c..dae4533411f7 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3798,6 +3798,8 @@ ext4_mb_mark_group_bb(struct ext4_mark_context *mc, ext4_group_t group,
ext4_lock_group(sb, group);
if (ext4_has_group_desc_csum(sb) &&
(gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
+ WARN_ON(mc->state == 0);
+
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
ext4_free_group_clusters_set(sb, gdp,
ext4_free_clusters_after_init(sb, group, gdp));
@@ -6120,43 +6122,16 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
static void ext4_free_blocks_simple(struct inode *inode, ext4_fsblk_t block,
unsigned long count)
{
- struct buffer_head *bitmap_bh;
+ struct ext4_mark_context mc = {
+ .sb = inode->i_sb,
+ .state = 0,
+ };
struct super_block *sb = inode->i_sb;
- struct ext4_group_desc *gdp;
- struct buffer_head *gdp_bh;
ext4_group_t group;
ext4_grpblk_t blkoff;
- int already_freed = 0, err, i;
ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
- bitmap_bh = ext4_read_block_bitmap(sb, group);
- if (IS_ERR(bitmap_bh)) {
- pr_warn("Failed to read block bitmap\n");
- return;
- }
- gdp = ext4_get_group_desc(sb, group, &gdp_bh);
- if (!gdp)
- goto err_out;
-
- for (i = 0; i < count; i++) {
- if (!mb_test_bit(blkoff + i, bitmap_bh->b_data))
- already_freed++;
- }
- mb_clear_bits(bitmap_bh->b_data, blkoff, count);
- err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
- if (err)
- goto err_out;
- ext4_free_group_clusters_set(
- sb, gdp, ext4_free_group_clusters(sb, gdp) +
- count - already_freed);
- ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh);
- ext4_group_desc_csum_set(sb, group, gdp);
- ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
- sync_dirty_buffer(bitmap_bh);
- sync_dirty_buffer(gdp_bh);
-
-err_out:
- brelse(bitmap_bh);
+ ext4_mb_mark_group_bb(&mc, group, blkoff, count);
}
/**
--
2.30.0
call ext4_mb_mark_group_bb in ext4_mb_mark_diskspace_used to:
1. remove repeat code to normally update bitmap and group descriptor
on disk.
2. call ext4_mb_mark_group_bb instead of only setting bits in block bitmap
to fix the bitmap. Function ext4_mb_mark_group_bb will also update
checksum of bitmap and other counter along with the bit change to keep
the cosistent with bit change or block bitmap will be marked corrupted as
checksum of bitmap is in inconsistent state.
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 89 ++++++++++++-----------------------------------
1 file changed, 23 insertions(+), 66 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e51cb0add4eb..46b37f5c9223 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3887,13 +3887,17 @@ static noinline_for_stack int
ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
handle_t *handle, unsigned int reserv_clstrs)
{
- struct buffer_head *bitmap_bh = NULL;
+ struct ext4_mark_context mc = {
+ .handle = handle,
+ .sb = ac->ac_sb,
+ .state = 1,
+ };
struct ext4_group_desc *gdp;
- struct buffer_head *gdp_bh;
struct ext4_sb_info *sbi;
struct super_block *sb;
ext4_fsblk_t block;
int err, len;
+ int flags = 0;
BUG_ON(ac->ac_status != AC_STATUS_FOUND);
BUG_ON(ac->ac_b_ex.fe_len <= 0);
@@ -3901,32 +3905,13 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
sb = ac->ac_sb;
sbi = EXT4_SB(sb);
- bitmap_bh = ext4_read_block_bitmap(sb, ac->ac_b_ex.fe_group);
- if (IS_ERR(bitmap_bh)) {
- return PTR_ERR(bitmap_bh);
- }
-
- BUFFER_TRACE(bitmap_bh, "getting write access");
- err = ext4_journal_get_write_access(handle, sb, bitmap_bh,
- EXT4_JTR_NONE);
- if (err)
- goto out_err;
-
- err = -EIO;
- gdp = ext4_get_group_desc(sb, ac->ac_b_ex.fe_group, &gdp_bh);
+ gdp = ext4_get_group_desc(sb, ac->ac_b_ex.fe_group, NULL);
if (!gdp)
- goto out_err;
-
+ return -EIO;
ext4_debug("using block group %u(%d)\n", ac->ac_b_ex.fe_group,
ext4_free_group_clusters(sb, gdp));
- BUFFER_TRACE(gdp_bh, "get_write_access");
- err = ext4_journal_get_write_access(handle, sb, gdp_bh, EXT4_JTR_NONE);
- if (err)
- goto out_err;
-
block = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
-
len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
if (!ext4_inode_block_valid(ac->ac_inode, block, len)) {
ext4_error(sb, "Allocating blocks %llu-%llu which overlap "
@@ -3935,41 +3920,28 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
* Fix the bitmap and return EFSCORRUPTED
* We leak some of the blocks here.
*/
- ext4_lock_group(sb, ac->ac_b_ex.fe_group);
- mb_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,
- ac->ac_b_ex.fe_len);
- ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
- err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
+ err = ext4_mb_mark_group_bb(&mc, ac->ac_b_ex.fe_group,
+ ac->ac_b_ex.fe_start,
+ ac->ac_b_ex.fe_len,
+ 0);
if (!err)
err = -EFSCORRUPTED;
- goto out_err;
+ return err;
}
- ext4_lock_group(sb, ac->ac_b_ex.fe_group);
#ifdef AGGRESSIVE_CHECK
- {
- int i;
- for (i = 0; i < ac->ac_b_ex.fe_len; i++) {
- BUG_ON(mb_test_bit(ac->ac_b_ex.fe_start + i,
- bitmap_bh->b_data));
- }
- }
+ flags |= EXT4_MB_BITMAP_MARKED_CHECK;
#endif
- mb_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,
- ac->ac_b_ex.fe_len);
- if (ext4_has_group_desc_csum(sb) &&
- (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
- gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
- ext4_free_group_clusters_set(sb, gdp,
- ext4_free_clusters_after_init(sb,
- ac->ac_b_ex.fe_group, gdp));
- }
- len = ext4_free_group_clusters(sb, gdp) - ac->ac_b_ex.fe_len;
- ext4_free_group_clusters_set(sb, gdp, len);
- ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh);
- ext4_group_desc_csum_set(sb, ac->ac_b_ex.fe_group, gdp);
+ err = ext4_mb_mark_group_bb(&mc, ac->ac_b_ex.fe_group,
+ ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len,
+ flags);
+
+ if (err && mc.changed == 0)
+ return err;
- ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
+#ifdef AGGRESSIVE_CHECK
+ BUG_ON(mc.changed != ac->ac_b_ex.fe_len);
+#endif
percpu_counter_sub(&sbi->s_freeclusters_counter, ac->ac_b_ex.fe_len);
/*
* Now reduce the dirty block count also. Should not go negative
@@ -3979,21 +3951,6 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
percpu_counter_sub(&sbi->s_dirtyclusters_counter,
reserv_clstrs);
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t flex_group = ext4_flex_group(sbi,
- ac->ac_b_ex.fe_group);
- atomic64_sub(ac->ac_b_ex.fe_len,
- &sbi_array_rcu_deref(sbi, s_flex_groups,
- flex_group)->free_clusters);
- }
-
- err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
- if (err)
- goto out_err;
- err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);
-
-out_err:
- brelse(bitmap_bh);
return err;
}
--
2.30.0
Two cleanups for ext4_mb_new_blocks_simple:
Remove unsed parameter handle of ext4_mb_new_blocks_simple.
Move ext4_mb_new_blocks_simple definition before ext4_mb_new_blocks to
remove unnecessary forward declaration of ext4_mb_new_blocks_simple.
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 137 +++++++++++++++++++++++-----------------------
1 file changed, 67 insertions(+), 70 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index dadda9e8cf29..1345c4e84cc3 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5786,8 +5786,72 @@ static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
return ret;
}
-static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle,
- struct ext4_allocation_request *ar, int *errp);
+/*
+ * Simple allocator for Ext4 fast commit replay path. It searches for blocks
+ * linearly starting at the goal block and also excludes the blocks which
+ * are going to be in use after fast commit replay.
+ */
+static ext4_fsblk_t
+ext4_mb_new_blocks_simple(struct ext4_allocation_request *ar, int *errp)
+{
+ struct buffer_head *bitmap_bh;
+ struct super_block *sb = ar->inode->i_sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ ext4_group_t group, nr;
+ ext4_grpblk_t blkoff;
+ ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb);
+ ext4_grpblk_t i = 0;
+ ext4_fsblk_t goal, block;
+ struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+
+ goal = ar->goal;
+ if (goal < le32_to_cpu(es->s_first_data_block) ||
+ goal >= ext4_blocks_count(es))
+ goal = le32_to_cpu(es->s_first_data_block);
+
+ ar->len = 0;
+ ext4_get_group_no_and_offset(sb, goal, &group, &blkoff);
+ for (nr = ext4_get_groups_count(sb); nr > 0; nr--) {
+ bitmap_bh = ext4_read_block_bitmap(sb, group);
+ if (IS_ERR(bitmap_bh)) {
+ *errp = PTR_ERR(bitmap_bh);
+ pr_warn("Failed to read block bitmap\n");
+ return 0;
+ }
+
+ while (1) {
+ i = mb_find_next_zero_bit(bitmap_bh->b_data, max,
+ blkoff);
+ if (i >= max)
+ break;
+ if (ext4_fc_replay_check_excluded(sb,
+ ext4_group_first_block_no(sb, group) +
+ EXT4_C2B(sbi, i))) {
+ blkoff = i + 1;
+ } else
+ break;
+ }
+ brelse(bitmap_bh);
+ if (i < max)
+ break;
+
+ if (++group >= ext4_get_groups_count(sb))
+ group = 0;
+
+ blkoff = 0;
+ }
+
+ if (i >= max) {
+ *errp = -ENOSPC;
+ return 0;
+ }
+
+ block = ext4_group_first_block_no(sb, group) + EXT4_C2B(sbi, i);
+ ext4_mb_mark_bb(sb, block, 1, 1);
+ ar->len = 1;
+
+ return block;
+}
/*
* Main entry point into mballoc to allocate blocks
@@ -5812,7 +5876,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
trace_ext4_request_blocks(ar);
if (sbi->s_mount_state & EXT4_FC_REPLAY)
- return ext4_mb_new_blocks_simple(handle, ar, errp);
+ return ext4_mb_new_blocks_simple(ar, errp);
/* Allow to use superuser reservation for quota file */
if (ext4_is_quota_file(ar->inode))
@@ -6036,73 +6100,6 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
spin_unlock(&sbi->s_md_lock);
}
-/*
- * Simple allocator for Ext4 fast commit replay path. It searches for blocks
- * linearly starting at the goal block and also excludes the blocks which
- * are going to be in use after fast commit replay.
- */
-static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle,
- struct ext4_allocation_request *ar, int *errp)
-{
- struct buffer_head *bitmap_bh;
- struct super_block *sb = ar->inode->i_sb;
- struct ext4_sb_info *sbi = EXT4_SB(sb);
- ext4_group_t group, nr;
- ext4_grpblk_t blkoff;
- ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb);
- ext4_grpblk_t i = 0;
- ext4_fsblk_t goal, block;
- struct ext4_super_block *es = EXT4_SB(sb)->s_es;
-
- goal = ar->goal;
- if (goal < le32_to_cpu(es->s_first_data_block) ||
- goal >= ext4_blocks_count(es))
- goal = le32_to_cpu(es->s_first_data_block);
-
- ar->len = 0;
- ext4_get_group_no_and_offset(sb, goal, &group, &blkoff);
- for (nr = ext4_get_groups_count(sb); nr > 0; nr--) {
- bitmap_bh = ext4_read_block_bitmap(sb, group);
- if (IS_ERR(bitmap_bh)) {
- *errp = PTR_ERR(bitmap_bh);
- pr_warn("Failed to read block bitmap\n");
- return 0;
- }
-
- while (1) {
- i = mb_find_next_zero_bit(bitmap_bh->b_data, max,
- blkoff);
- if (i >= max)
- break;
- if (ext4_fc_replay_check_excluded(sb,
- ext4_group_first_block_no(sb, group) +
- EXT4_C2B(sbi, i))) {
- blkoff = i + 1;
- } else
- break;
- }
- brelse(bitmap_bh);
- if (i < max)
- break;
-
- if (++group >= ext4_get_groups_count(sb))
- group = 0;
-
- blkoff = 0;
- }
-
- if (i >= max) {
- *errp = -ENOSPC;
- return 0;
- }
-
- block = ext4_group_first_block_no(sb, group) + EXT4_C2B(sbi, i);
- ext4_mb_mark_bb(sb, block, 1, 1);
- ar->len = 1;
-
- return block;
-}
-
static void ext4_free_blocks_simple(struct inode *inode, ext4_fsblk_t block,
unsigned long count)
{
--
2.30.0
Multiblocks allocation will read and write block bitmap and group
descriptor which reside on disk. Add kunit stub to function
ext4_get_group_desc, ext4_read_block_bitmap_nowait, ext4_wait_block_bitmap
and ext4_mb_mark_group_bb to avoid real IO to disk.
Signed-off-by: Kemeng Shi <[email protected]>
---
fs/ext4/balloc.c | 16 ++++++++++++++++
fs/ext4/mballoc.c | 6 ++++++
2 files changed, 22 insertions(+)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index c1edde817be8..49411b21a190 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -22,6 +22,7 @@
#include "mballoc.h"
#include <trace/events/ext4.h>
+#include <kunit/static_stub.h>
static unsigned ext4_num_base_meta_clusters(struct super_block *sb,
ext4_group_t block_group);
@@ -274,6 +275,11 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block *sb,
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct buffer_head *bh_p;
+#ifdef CONFIG_EXT4_KUNIT_TESTS
+ KUNIT_STATIC_STUB_REDIRECT(ext4_get_group_desc,
+ sb, block_group, bh);
+#endif
+
if (block_group >= ngroups) {
ext4_error(sb, "block_group >= groups_count - block_group = %u,"
" groups_count = %u", block_group, ngroups);
@@ -470,6 +476,11 @@ ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group,
ext4_fsblk_t bitmap_blk;
int err;
+#ifdef CONFIG_EXT4_KUNIT_TESTS
+ KUNIT_STATIC_STUB_REDIRECT(ext4_read_block_bitmap_nowait,
+ sb, block_group, ignore_locked);
+#endif
+
desc = ext4_get_group_desc(sb, block_group, NULL);
if (!desc)
return ERR_PTR(-EFSCORRUPTED);
@@ -565,6 +576,11 @@ int ext4_wait_block_bitmap(struct super_block *sb, ext4_group_t block_group,
{
struct ext4_group_desc *desc;
+#ifdef CONFIG_EXT4_KUNIT_TESTS
+ KUNIT_STATIC_STUB_REDIRECT(ext4_wait_block_bitmap,
+ sb, block_group, bh);
+#endif
+
if (!buffer_new(bh))
return 0;
desc = ext4_get_group_desc(sb, block_group, NULL);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 18713b671e46..c9b7fc0de49e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -17,6 +17,7 @@
#include <linux/nospec.h>
#include <linux/backing-dev.h>
#include <trace/events/ext4.h>
+#include <kunit/static_stub.h>
/*
* MUSTDO:
@@ -3791,6 +3792,11 @@ ext4_mb_mark_group_bb(struct ext4_mark_context *mc, ext4_group_t group,
int err;
unsigned int i, already, changed = len;
+#ifdef CONFIG_EXT4_KUNIT_TESTS
+ KUNIT_STATIC_STUB_REDIRECT(ext4_mb_mark_group_bb,
+ mc, group, blkoff, len, flags);
+#endif
+
mc->changed = 0;
bitmap_bh = ext4_read_block_bitmap(sb, group);
if (IS_ERR(bitmap_bh))
--
2.30.0
call ext4_mb_mark_group_bb in ext4_group_add_blocks to remove repeat code
to update block bitmap and group descriptor on disk.
Note: ext4_group_add_blocks will update buddy and bitmap in two critical
sections instead of update in the same critical.
Originally:
ext4_mb_load_buddy_gfp
ext4_lock_group
mb_clear_bits(bitmap_bh, ...)
mb_free_blocks/ext4_mb_free_metadata
ext4_unlock_group
ext4_mb_unload_buddy
Now:
ext4_mb_load_buddy_gfp
ext4_lock_group
mb_clear_bits(bitmap_bh, ...)
ext4_unlock_group
/* no ext4_mb_init_cache for the same group will be called as
ext4_mb_load_buddy_gfp will ensure buddy page is update-to-date */
ext4_lock_group
mb_free_blocks/ext4_mb_free_metadata
ext4_unlock_group
ext4_mb_unload_buddy
As buddy page for group is always update-to-date between
ext4_mb_load_buddy_gfp and ext4_mb_unload_buddy. Then no
ext4_mb_init_cache will be called for the same group concurrentlly when
we update bitmap and buddy page betwwen buddy load and unload.
Signed-off-by: Kemeng Shi <[email protected]>
---
fs/ext4/mballoc.c | 92 +++++++++--------------------------------------
1 file changed, 17 insertions(+), 75 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e4f1b34448e3..18713b671e46 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6397,23 +6397,23 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
ext4_fsblk_t block, unsigned long count)
{
- struct buffer_head *bitmap_bh = NULL;
- struct buffer_head *gd_bh;
+ struct ext4_mark_context mc = {
+ .handle = handle,
+ .sb = sb,
+ .state = 0,
+ };
ext4_group_t block_group;
ext4_grpblk_t bit;
- unsigned int i;
- struct ext4_group_desc *desc;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_buddy e4b;
- int err = 0, ret, free_clusters_count;
- ext4_grpblk_t clusters_freed;
+ int err = 0;
ext4_fsblk_t first_cluster = EXT4_B2C(sbi, block);
ext4_fsblk_t last_cluster = EXT4_B2C(sbi, block + count - 1);
unsigned long cluster_count = last_cluster - first_cluster + 1;
ext4_debug("Adding block(s) %llu-%llu\n", block, block + count - 1);
- if (count == 0)
+ if (cluster_count == 0)
return 0;
ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
@@ -6428,19 +6428,6 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
goto error_return;
}
- bitmap_bh = ext4_read_block_bitmap(sb, block_group);
- if (IS_ERR(bitmap_bh)) {
- err = PTR_ERR(bitmap_bh);
- bitmap_bh = NULL;
- goto error_return;
- }
-
- desc = ext4_get_group_desc(sb, block_group, &gd_bh);
- if (!desc) {
- err = -EIO;
- goto error_return;
- }
-
if (!ext4_sb_block_valid(sb, NULL, block, count)) {
ext4_error(sb, "Adding blocks in system zones - "
"Block = %llu, count = %lu",
@@ -6449,75 +6436,30 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
goto error_return;
}
- BUFFER_TRACE(bitmap_bh, "getting write access");
- err = ext4_journal_get_write_access(handle, sb, bitmap_bh,
- EXT4_JTR_NONE);
- if (err)
- goto error_return;
-
- /*
- * We are about to modify some metadata. Call the journal APIs
- * to unshare ->b_data if a currently-committing transaction is
- * using it
- */
- BUFFER_TRACE(gd_bh, "get_write_access");
- err = ext4_journal_get_write_access(handle, sb, gd_bh, EXT4_JTR_NONE);
+ err = ext4_mb_load_buddy(sb, block_group, &e4b);
if (err)
goto error_return;
- for (i = 0, clusters_freed = 0; i < cluster_count; i++) {
- BUFFER_TRACE(bitmap_bh, "clear bit");
- if (!mb_test_bit(bit + i, bitmap_bh->b_data)) {
- ext4_error(sb, "bit already cleared for block %llu",
- (ext4_fsblk_t)(block + i));
- BUFFER_TRACE(bitmap_bh, "bit already cleared");
- } else {
- clusters_freed++;
- }
- }
+ err = ext4_mb_mark_group_bb(&mc, block_group, bit, cluster_count,
+ EXT4_MB_BITMAP_MARKED_CHECK);
- err = ext4_mb_load_buddy(sb, block_group, &e4b);
- if (err)
+ if (err && mc.changed == 0) {
+ ext4_mb_unload_buddy(&e4b);
goto error_return;
+ }
- /*
- * need to update group_info->bb_free and bitmap
- * with group lock held. generate_buddy look at
- * them with group lock_held
- */
+ if (mc.changed != cluster_count)
+ ext4_error(sb, "bit already cleared in group %u",
+ block_group);
ext4_lock_group(sb, block_group);
- mb_clear_bits(bitmap_bh->b_data, bit, cluster_count);
mb_free_blocks(NULL, &e4b, bit, cluster_count);
- free_clusters_count = clusters_freed +
- ext4_free_group_clusters(sb, desc);
- ext4_free_group_clusters_set(sb, desc, free_clusters_count);
- ext4_block_bitmap_csum_set(sb, desc, bitmap_bh);
- ext4_group_desc_csum_set(sb, block_group, desc);
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeclusters_counter,
- clusters_freed);
-
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
- atomic64_add(clusters_freed,
- &sbi_array_rcu_deref(sbi, s_flex_groups,
- flex_group)->free_clusters);
- }
+ mc.changed);
ext4_mb_unload_buddy(&e4b);
- /* We dirtied the bitmap block */
- BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
- err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
-
- /* And the group descriptor block */
- BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
- ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh);
- if (!err)
- err = ret;
-
error_return:
- brelse(bitmap_bh);
ext4_std_error(sb, err);
return err;
}
--
2.30.0
There are several reasons to add a general function to update block
bitmap and group descriptor on disk:
1. pair behavior of alloc/free bits. For example,
ext4_mb_new_blocks_simple will update free_clusters in struct flex_groups
in ext4_mb_mark_bb while ext4_free_blocks_simple forgets this.
2. remove repeat code to read from disk, update and write back to disk.
3. reduce future unit test mocks to catch real IO to update structure
on disk.
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 157 +++++++++++++++++++++++++---------------------
1 file changed, 87 insertions(+), 70 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4322eb7559fc..685dcc17bf7c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3769,6 +3769,86 @@ void ext4_exit_mballoc(void)
ext4_groupinfo_destroy_slabs();
}
+struct ext4_mark_context {
+ struct super_block *sb;
+ int state;
+};
+
+static int
+ext4_mb_mark_group_bb(struct ext4_mark_context *mc, ext4_group_t group,
+ ext4_grpblk_t blkoff, ext4_grpblk_t len)
+{
+ struct super_block *sb = mc->sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct buffer_head *bitmap_bh = NULL;
+ struct ext4_group_desc *gdp;
+ struct buffer_head *gdp_bh;
+ int err;
+ unsigned int i, already, changed;
+
+ bitmap_bh = ext4_read_block_bitmap(sb, group);
+ if (IS_ERR(bitmap_bh))
+ return PTR_ERR(bitmap_bh);
+
+ err = -EIO;
+ gdp = ext4_get_group_desc(sb, group, &gdp_bh);
+ if (!gdp)
+ goto out_err;
+
+ ext4_lock_group(sb, group);
+ if (ext4_has_group_desc_csum(sb) &&
+ (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
+ gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+ ext4_free_group_clusters_set(sb, gdp,
+ ext4_free_clusters_after_init(sb, group, gdp));
+ }
+
+ already = 0;
+ for (i = 0; i < len; i++)
+ if (mb_test_bit(blkoff + i, bitmap_bh->b_data) ==
+ mc->state)
+ already++;
+ changed = len - already;
+
+ if (mc->state) {
+ mb_set_bits(bitmap_bh->b_data, blkoff, len);
+ ext4_free_group_clusters_set(sb, gdp,
+ ext4_free_group_clusters(sb, gdp) - changed);
+ } else {
+ mb_clear_bits(bitmap_bh->b_data, blkoff, len);
+ ext4_free_group_clusters_set(sb, gdp,
+ ext4_free_group_clusters(sb, gdp) + changed);
+ }
+
+ ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh);
+ ext4_group_desc_csum_set(sb, group, gdp);
+ ext4_unlock_group(sb, group);
+
+ if (sbi->s_log_groups_per_flex) {
+ ext4_group_t flex_group = ext4_flex_group(sbi, group);
+ struct flex_groups *fg = sbi_array_rcu_deref(sbi,
+ s_flex_groups, flex_group);
+
+ if (mc->state)
+ atomic64_sub(changed, &fg->free_clusters);
+ else
+ atomic64_add(changed, &fg->free_clusters);
+ }
+
+ err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+ if (err)
+ goto out_err;
+ err = ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
+ if (err)
+ goto out_err;
+
+ sync_dirty_buffer(bitmap_bh);
+ sync_dirty_buffer(gdp_bh);
+
+out_err:
+ brelse(bitmap_bh);
+ return err;
+}
/*
* Check quota and mark chosen space (ac->ac_b_ex) non-free in bitmaps
@@ -3895,15 +3975,15 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
int len, int state)
{
- struct buffer_head *bitmap_bh = NULL;
- struct ext4_group_desc *gdp;
- struct buffer_head *gdp_bh;
+ struct ext4_mark_context mc = {
+ .sb = sb,
+ .state = state,
+ };
struct ext4_sb_info *sbi = EXT4_SB(sb);
ext4_group_t group;
ext4_grpblk_t blkoff;
- int i, err;
- int already;
- unsigned int clen, clen_changed, thisgrp_len;
+ int err;
+ unsigned int clen, thisgrp_len;
while (len > 0) {
ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
@@ -3924,80 +4004,17 @@ void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
ext4_error(sb, "Marking blocks in system zone - "
"Block = %llu, len = %u",
block, thisgrp_len);
- bitmap_bh = NULL;
break;
}
- bitmap_bh = ext4_read_block_bitmap(sb, group);
- if (IS_ERR(bitmap_bh)) {
- err = PTR_ERR(bitmap_bh);
- bitmap_bh = NULL;
- break;
- }
-
- err = -EIO;
- gdp = ext4_get_group_desc(sb, group, &gdp_bh);
- if (!gdp)
- break;
-
- ext4_lock_group(sb, group);
- already = 0;
- for (i = 0; i < clen; i++)
- if (!mb_test_bit(blkoff + i, bitmap_bh->b_data) ==
- !state)
- already++;
-
- clen_changed = clen - already;
- if (state)
- mb_set_bits(bitmap_bh->b_data, blkoff, clen);
- else
- mb_clear_bits(bitmap_bh->b_data, blkoff, clen);
- if (ext4_has_group_desc_csum(sb) &&
- (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
- gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
- ext4_free_group_clusters_set(sb, gdp,
- ext4_free_clusters_after_init(sb, group, gdp));
- }
- if (state)
- clen = ext4_free_group_clusters(sb, gdp) - clen_changed;
- else
- clen = ext4_free_group_clusters(sb, gdp) + clen_changed;
-
- ext4_free_group_clusters_set(sb, gdp, clen);
- ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh);
- ext4_group_desc_csum_set(sb, group, gdp);
-
- ext4_unlock_group(sb, group);
-
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t flex_group = ext4_flex_group(sbi, group);
- struct flex_groups *fg = sbi_array_rcu_deref(sbi,
- s_flex_groups, flex_group);
-
- if (state)
- atomic64_sub(clen_changed, &fg->free_clusters);
- else
- atomic64_add(clen_changed, &fg->free_clusters);
-
- }
-
- err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
- if (err)
- break;
- sync_dirty_buffer(bitmap_bh);
- err = ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
- sync_dirty_buffer(gdp_bh);
+ err = ext4_mb_mark_group_bb(&mc, group, blkoff, clen);
if (err)
break;
block += thisgrp_len;
len -= thisgrp_len;
- brelse(bitmap_bh);
BUG_ON(len < 0);
}
-
- if (err)
- brelse(bitmap_bh);
}
/*
--
2.30.0
Stripe is misused in block unit and in cluster unit in different code
paths. User awared of stripe maybe not awared of bigalloc feature, so
treat stripe only in block unit to fix this.
Besides, it's hard to get stripe aligned blocks (start and length are both
aligned with stripe) if stripe is not aligned with cluster, just disable
stripe and alert user in this case to simpfy the code and avoid
unnecessary work to get stripe aligned blocks which likely to be failed.
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 18 +++++++++++-------
fs/ext4/super.c | 13 +++++++++++++
2 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f6dc4f276ca4..7ef95bdde91d 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2207,7 +2207,8 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
ac->ac_g_ex.fe_len, &ex);
ex.fe_logical = 0xDEADFA11; /* debug value */
- if (max >= ac->ac_g_ex.fe_len && ac->ac_g_ex.fe_len == sbi->s_stripe) {
+ if (max >= ac->ac_g_ex.fe_len &&
+ ac->ac_g_ex.fe_len == EXT4_B2C(sbi, sbi->s_stripe)) {
ext4_fsblk_t start;
start = ext4_grp_offs_to_block(ac->ac_sb, &ex);
@@ -2372,7 +2373,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
struct ext4_free_extent ex;
ext4_fsblk_t first_group_block;
ext4_fsblk_t a;
- ext4_grpblk_t i;
+ ext4_grpblk_t i, stripe;
int max;
BUG_ON(sbi->s_stripe == 0);
@@ -2384,10 +2385,12 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
do_div(a, sbi->s_stripe);
i = (a * sbi->s_stripe) - first_group_block;
+ stripe = EXT4_B2C(sbi, sbi->s_stripe);
+ i = EXT4_B2C(sbi, i);
while (i < EXT4_CLUSTERS_PER_GROUP(sb)) {
if (!mb_test_bit(i, bitmap)) {
- max = mb_find_extent(e4b, i, sbi->s_stripe, &ex);
- if (max >= sbi->s_stripe) {
+ max = mb_find_extent(e4b, i, stripe, &ex);
+ if (max >= stripe) {
ac->ac_found++;
ex.fe_logical = 0xDEADF00D; /* debug value */
ac->ac_b_ex = ex;
@@ -2395,7 +2398,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
break;
}
}
- i += sbi->s_stripe;
+ i += stripe;
}
}
@@ -2758,7 +2761,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
if (cr == 0)
ext4_mb_simple_scan_group(ac, &e4b);
else if (cr == 1 && sbi->s_stripe &&
- !(ac->ac_g_ex.fe_len % sbi->s_stripe))
+ !(ac->ac_g_ex.fe_len %
+ EXT4_B2C(sbi, sbi->s_stripe)))
ext4_mb_scan_aligned(ac, &e4b);
else
ext4_mb_complex_scan_group(ac, &e4b);
@@ -3477,7 +3481,7 @@ int ext4_mb_init(struct super_block *sb)
*/
if (sbi->s_stripe > 1) {
sbi->s_mb_group_prealloc = roundup(
- sbi->s_mb_group_prealloc, sbi->s_stripe);
+ sbi->s_mb_group_prealloc, EXT4_B2C(sbi, sbi->s_stripe));
}
sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 56a5d1c469fc..e4441836239e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5297,6 +5297,19 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
goto failed_mount3;
sbi->s_stripe = ext4_get_stripe_size(sbi);
+ /*
+ * It's hard to get stripe aligned blocks if stripe is not aligned with
+ * cluster, just disable stripe and alert user to simpfy code and avoid
+ * stripe aligned allocation which will rarely successes.
+ */
+ if (sbi->s_stripe > 0 && sbi->s_cluster_ratio > 1 &&
+ sbi->s_stripe % sbi->s_cluster_ratio != 0) {
+ ext4_msg(sb, KERN_WARNING,
+ "stripe (%lu) is not aligned with cluster size (%u), "
+ "stripe is disabled",
+ sbi->s_stripe, sbi->s_cluster_ratio);
+ sbi->s_stripe = 0;
+ }
sbi->s_extent_max_zeroout_kb = 32;
/*
--
2.30.0
Previously, ext4_mb_mark_group_bb is only called under fast commit
replay path, so there is no valid handle when we update block bitmap
and group descriptor. This patch try to extent ext4_mb_mark_group_bb
to be used by code under journal. There are several improves:
1. add "handle_t *handle" to struct ext4_mark_context to accept handle
to journal block bitmap and group descriptor update inside
ext4_mb_mark_group_bb (the added journal caode is based on journal
code in ext4_mb_mark_diskspace_used where ext4_mb_mark_group_bb
is going to be used.)
2. add EXT4_MB_BITMAP_MARKED_CHECK flag to control check if bits in block
bitmap are already marked as allocation code under journal asserts that
all bits to be changed are not marked before.
3. add "ext4_grpblk_t changed" to struct ext4_mark_context to notify number
of bits in block bitmap has changed.
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 59 ++++++++++++++++++++++++++++++++++++-----------
1 file changed, 45 insertions(+), 14 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index dae4533411f7..e51cb0add4eb 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3769,32 +3769,54 @@ void ext4_exit_mballoc(void)
ext4_groupinfo_destroy_slabs();
}
+#define EXT4_MB_BITMAP_MARKED_CHECK 0x0001
+#define EXT4_MB_SYNC_UPDATE 0x0002
struct ext4_mark_context {
+ handle_t *handle;
struct super_block *sb;
int state;
+ ext4_grpblk_t changed;
};
static int
ext4_mb_mark_group_bb(struct ext4_mark_context *mc, ext4_group_t group,
- ext4_grpblk_t blkoff, ext4_grpblk_t len)
+ ext4_grpblk_t blkoff, ext4_grpblk_t len, int flags)
{
+ handle_t *handle = mc->handle;
struct super_block *sb = mc->sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct buffer_head *bitmap_bh = NULL;
struct ext4_group_desc *gdp;
struct buffer_head *gdp_bh;
int err;
- unsigned int i, already, changed;
+ unsigned int i, already, changed = len;
+ mc->changed = 0;
bitmap_bh = ext4_read_block_bitmap(sb, group);
if (IS_ERR(bitmap_bh))
return PTR_ERR(bitmap_bh);
+ if (handle) {
+ BUFFER_TRACE(bitmap_bh, "getting write access");
+ err = ext4_journal_get_write_access(handle, sb, bitmap_bh,
+ EXT4_JTR_NONE);
+ if (err)
+ goto out_err;
+ }
+
err = -EIO;
gdp = ext4_get_group_desc(sb, group, &gdp_bh);
if (!gdp)
goto out_err;
+ if (handle) {
+ BUFFER_TRACE(gdp_bh, "get_write_access");
+ err = ext4_journal_get_write_access(handle, sb, gdp_bh,
+ EXT4_JTR_NONE);
+ if (err)
+ goto out_err;
+ }
+
ext4_lock_group(sb, group);
if (ext4_has_group_desc_csum(sb) &&
(gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
@@ -3805,12 +3827,14 @@ ext4_mb_mark_group_bb(struct ext4_mark_context *mc, ext4_group_t group,
ext4_free_clusters_after_init(sb, group, gdp));
}
- already = 0;
- for (i = 0; i < len; i++)
- if (mb_test_bit(blkoff + i, bitmap_bh->b_data) ==
- mc->state)
- already++;
- changed = len - already;
+ if (flags & EXT4_MB_BITMAP_MARKED_CHECK) {
+ already = 0;
+ for (i = 0; i < len; i++)
+ if (mb_test_bit(blkoff + i, bitmap_bh->b_data) ==
+ mc->state)
+ already++;
+ changed = len - already;
+ }
if (mc->state) {
mb_set_bits(bitmap_bh->b_data, blkoff, len);
@@ -3825,6 +3849,7 @@ ext4_mb_mark_group_bb(struct ext4_mark_context *mc, ext4_group_t group,
ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh);
ext4_group_desc_csum_set(sb, group, gdp);
ext4_unlock_group(sb, group);
+ mc->changed = changed;
if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi, group);
@@ -3837,15 +3862,17 @@ ext4_mb_mark_group_bb(struct ext4_mark_context *mc, ext4_group_t group,
atomic64_add(changed, &fg->free_clusters);
}
- err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+ err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
if (err)
goto out_err;
- err = ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
+ err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);
if (err)
goto out_err;
- sync_dirty_buffer(bitmap_bh);
- sync_dirty_buffer(gdp_bh);
+ if (flags & EXT4_MB_SYNC_UPDATE) {
+ sync_dirty_buffer(bitmap_bh);
+ sync_dirty_buffer(gdp_bh);
+ }
out_err:
brelse(bitmap_bh);
@@ -4009,7 +4036,9 @@ void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
break;
}
- err = ext4_mb_mark_group_bb(&mc, group, blkoff, clen);
+ err = ext4_mb_mark_group_bb(&mc, group, blkoff, clen,
+ EXT4_MB_BITMAP_MARKED_CHECK |
+ EXT4_MB_SYNC_UPDATE);
if (err)
break;
@@ -6131,7 +6160,9 @@ static void ext4_free_blocks_simple(struct inode *inode, ext4_fsblk_t block,
ext4_grpblk_t blkoff;
ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
- ext4_mb_mark_group_bb(&mc, group, blkoff, count);
+ ext4_mb_mark_group_bb(&mc, group, blkoff, count,
+ EXT4_MB_BITMAP_MARKED_CHECK |
+ EXT4_MB_SYNC_UPDATE);
}
/**
--
2.30.0
Here are prepared work:
1. Include mballoc-test.c to mballoc.c to be able test static function
in mballoc.c.
2. Implement static stub to avoid read IO to disk.
3. Construct fake super_block. Only partial members are set, more members
will be set when more functions are tested.
Then unit test for ext4_mb_new_blocks_simple is added.
Signed-off-by: Kemeng Shi <[email protected]>
---
fs/ext4/mballoc-test.c | 323 +++++++++++++++++++++++++++++++++++++++++
fs/ext4/mballoc.c | 4 +
2 files changed, 327 insertions(+)
create mode 100644 fs/ext4/mballoc-test.c
diff --git a/fs/ext4/mballoc-test.c b/fs/ext4/mballoc-test.c
new file mode 100644
index 000000000000..4bb131fc501b
--- /dev/null
+++ b/fs/ext4/mballoc-test.c
@@ -0,0 +1,323 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit test of ext4 multiblocks allocation.
+ */
+
+#include <kunit/test.h>
+#include <kunit/static_stub.h>
+
+#include "ext4.h"
+
+struct mb_grp_ctx {
+ struct buffer_head bitmap_bh;
+ struct ext4_group_desc desc;
+ /* one group descriptor for each group descriptor for simplicity */
+ struct buffer_head gd_bh;
+};
+
+struct mb_ctx {
+ struct mb_grp_ctx *grp_ctx;
+};
+
+struct fake_super_block {
+ struct super_block sb;
+ struct mb_ctx mb_ctx;
+};
+
+#define MB_CTX(_sb) (&(container_of((_sb), struct fake_super_block, sb)->mb_ctx))
+#define MB_GRP_CTX(_sb, _group) (&MB_CTX(_sb)->grp_ctx[_group])
+
+static struct super_block *alloc_fake_super_block(void)
+{
+ struct ext4_super_block *es = kzalloc(sizeof(*es), GFP_KERNEL);
+ struct ext4_sb_info *sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
+ struct fake_super_block *fsb = kzalloc(sizeof(*fsb), GFP_KERNEL);
+
+ if (fsb == NULL || sbi == NULL || es == NULL)
+ goto out;
+
+ sbi->s_es = es;
+ fsb->sb.s_fs_info = sbi;
+ return &fsb->sb;
+
+out:
+ kfree(fsb);
+ kfree(sbi);
+ kfree(es);
+ return NULL;
+}
+
+static void free_fake_super_block(struct super_block *sb)
+{
+ struct fake_super_block *fsb = container_of(sb, struct fake_super_block, sb);
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ kfree(sbi->s_es);
+ kfree(sbi);
+ kfree(fsb);
+}
+
+struct ext4_block_layout {
+ unsigned char blocksize_bits;
+ unsigned int cluster_bits;
+ unsigned long blocks_per_group;
+ ext4_group_t group_count;
+ unsigned long desc_size;
+};
+
+static void init_sb_layout(struct super_block *sb,
+ struct ext4_block_layout *layout)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_super_block *es = sbi->s_es;
+
+ sb->s_blocksize = 1UL << layout->blocksize_bits;
+ sb->s_blocksize_bits = layout->blocksize_bits;
+
+ sbi->s_groups_count = layout->group_count;
+ sbi->s_blocks_per_group = layout->blocks_per_group;
+ sbi->s_cluster_bits = layout->cluster_bits;
+ sbi->s_cluster_ratio = 1U << layout->cluster_bits;
+ sbi->s_clusters_per_group = layout->blocks_per_group >>
+ layout->cluster_bits;
+ sbi->s_desc_size = layout->desc_size;
+
+ es->s_first_data_block = cpu_to_le32(0);
+ es->s_blocks_count_lo = cpu_to_le32(layout->blocks_per_group *
+ layout->group_count);
+}
+
+static int mb_grp_ctx_init(struct super_block *sb,
+ struct mb_grp_ctx *grp_ctx)
+{
+ grp_ctx->bitmap_bh.b_data = kzalloc(EXT4_BLOCK_SIZE(sb), GFP_KERNEL);
+ if (grp_ctx->bitmap_bh.b_data == NULL)
+ return -ENOMEM;
+
+ get_bh(&grp_ctx->bitmap_bh);
+ get_bh(&grp_ctx->gd_bh);
+ return 0;
+}
+
+static void mb_grp_ctx_release(struct mb_grp_ctx *grp_ctx)
+{
+ kfree(grp_ctx->bitmap_bh.b_data);
+ grp_ctx->bitmap_bh.b_data = NULL;
+}
+
+static void mb_ctx_mark_used(struct super_block *sb, ext4_group_t group,
+ unsigned int start, unsigned int len)
+{
+ struct mb_grp_ctx *grp_ctx = MB_GRP_CTX(sb, group);
+
+ mb_set_bits(grp_ctx->bitmap_bh.b_data, start, len);
+}
+
+/* called after init_sb_layout */
+static int mb_ctx_init(struct super_block *sb)
+{
+ struct mb_ctx *ctx = MB_CTX(sb);
+ ext4_group_t i, ngroups = ext4_get_groups_count(sb);
+
+ ctx->grp_ctx = kcalloc(ngroups, sizeof(struct mb_grp_ctx),
+ GFP_KERNEL);
+ if (ctx->grp_ctx == NULL)
+ return -ENOMEM;
+
+ for (i = 0; i < ngroups; i++)
+ if (mb_grp_ctx_init(sb, &ctx->grp_ctx[i]))
+ goto out;
+
+ /*
+ * first data block(first cluster in first group) is used by
+ * metadata, mark it used to avoid to alloc data block at first
+ * block which will fail ext4_sb_block_valid check.
+ */
+ mb_set_bits(ctx->grp_ctx[0].bitmap_bh.b_data, 0, 1);
+
+ return 0;
+out:
+ while (i-- > 0)
+ mb_grp_ctx_release(&ctx->grp_ctx[i]);
+ kfree(ctx->grp_ctx);
+ return -ENOMEM;
+}
+
+static void mb_ctx_release(struct super_block *sb)
+{
+ struct mb_ctx *ctx = MB_CTX(sb);
+ ext4_group_t i, ngroups = ext4_get_groups_count(sb);
+
+ for (i = 0; i < ngroups; i++)
+ mb_grp_ctx_release(&ctx->grp_ctx[i]);
+ kfree(ctx->grp_ctx);
+}
+
+static struct buffer_head *
+ext4_read_block_bitmap_nowait_stub(struct super_block *sb, ext4_group_t block_group,
+ bool ignore_locked)
+{
+ struct mb_grp_ctx *grp_ctx = MB_GRP_CTX(sb, block_group);
+
+ get_bh(&grp_ctx->bitmap_bh);
+ return &grp_ctx->bitmap_bh;
+}
+
+static int ext4_wait_block_bitmap_stub(struct super_block *sb,
+ ext4_group_t block_group,
+ struct buffer_head *bh)
+{
+ return 0;
+}
+
+static struct ext4_group_desc *
+ext4_get_group_desc_stub(struct super_block *sb, ext4_group_t block_group,
+ struct buffer_head **bh)
+{
+ struct mb_grp_ctx *grp_ctx = MB_GRP_CTX(sb, block_group);
+
+ if (bh != NULL)
+ *bh = &grp_ctx->gd_bh;
+
+ return &grp_ctx->desc;
+}
+
+static int ext4_mb_mark_group_bb_stub(struct ext4_mark_context *mc,
+ ext4_group_t group, ext4_grpblk_t blkoff,
+ ext4_grpblk_t len, int flags)
+{
+ struct mb_grp_ctx *grp_ctx = MB_GRP_CTX(mc->sb, group);
+ struct buffer_head *bitmap_bh = &grp_ctx->bitmap_bh;
+
+ if (mc->state)
+ mb_set_bits(bitmap_bh->b_data, blkoff, len);
+ else
+ mb_clear_bits(bitmap_bh->b_data, blkoff, len);
+
+ return 0;
+}
+
+#define TEST_BLOCKSIZE_BITS 10
+#define TEST_CLUSTER_BITS 3
+#define TEST_BLOCKS_PER_GROUP 8192
+#define TEST_GROUP_COUNT 4
+#define TEST_DESC_SIZE 64
+#define TEST_GOAL_GROUP 1
+static int mballoc_test_init(struct kunit *test)
+{
+ struct ext4_block_layout layout = {
+ .blocksize_bits = TEST_BLOCKSIZE_BITS,
+ .cluster_bits = TEST_CLUSTER_BITS,
+ .blocks_per_group = TEST_BLOCKS_PER_GROUP,
+ .group_count = TEST_GROUP_COUNT,
+ .desc_size = TEST_DESC_SIZE,
+ };
+ struct super_block *sb;
+ int ret;
+
+ sb = alloc_fake_super_block();
+ if (sb == NULL)
+ return -ENOMEM;
+
+ init_sb_layout(sb, &layout);
+
+ ret = mb_ctx_init(sb);
+ if (ret != 0) {
+ free_fake_super_block(sb);
+ return ret;
+ }
+
+ test->priv = sb;
+ kunit_activate_static_stub(test,
+ ext4_read_block_bitmap_nowait,
+ ext4_read_block_bitmap_nowait_stub);
+ kunit_activate_static_stub(test,
+ ext4_wait_block_bitmap,
+ ext4_wait_block_bitmap_stub);
+ kunit_activate_static_stub(test,
+ ext4_get_group_desc,
+ ext4_get_group_desc_stub);
+ kunit_activate_static_stub(test,
+ ext4_mb_mark_group_bb,
+ ext4_mb_mark_group_bb_stub);
+ return 0;
+}
+
+static void mballoc_test_exit(struct kunit *test)
+{
+ struct super_block *sb = (struct super_block *)test->priv;
+
+ mb_ctx_release(sb);
+ free_fake_super_block(sb);
+}
+
+static void test_new_blocks_simple(struct kunit *test)
+{
+ struct super_block *sb = (struct super_block *)test->priv;
+ struct inode inode = { .i_sb = sb, };
+ struct ext4_allocation_request ar;
+ ext4_group_t i, goal_group = TEST_GOAL_GROUP;
+ int err = 0;
+ ext4_fsblk_t found;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ ar.inode = &inode;
+
+ /* get block at goal */
+ ar.goal = ext4_group_first_block_no(sb, goal_group);
+ found = ext4_mb_new_blocks_simple(&ar, &err);
+ KUNIT_ASSERT_EQ_MSG(test, ar.goal, found,
+ "failed to alloc block at goal, expected %llu found %llu",
+ ar.goal, found);
+
+ /* get block after goal in goal group */
+ ar.goal = ext4_group_first_block_no(sb, goal_group);
+ found = ext4_mb_new_blocks_simple(&ar, &err);
+ KUNIT_ASSERT_EQ_MSG(test, ar.goal + EXT4_C2B(sbi, 1), found,
+ "failed to alloc block after goal in goal group, expected %llu found %llu",
+ ar.goal + 1, found);
+
+ /* get block after goal group */
+ mb_ctx_mark_used(sb, goal_group, 0, EXT4_CLUSTERS_PER_GROUP(sb));
+ ar.goal = ext4_group_first_block_no(sb, goal_group);
+ found = ext4_mb_new_blocks_simple(&ar, &err);
+ KUNIT_ASSERT_EQ_MSG(test,
+ ext4_group_first_block_no(sb, goal_group + 1), found,
+ "failed to alloc block after goal group, expected %llu found %llu",
+ ext4_group_first_block_no(sb, goal_group + 1), found);
+
+ /* get block before goal group */
+ for (i = goal_group; i < ext4_get_groups_count(sb); i++)
+ mb_ctx_mark_used(sb, i, 0, EXT4_CLUSTERS_PER_GROUP(sb));
+ ar.goal = ext4_group_first_block_no(sb, goal_group);
+ found = ext4_mb_new_blocks_simple(&ar, &err);
+ KUNIT_ASSERT_EQ_MSG(test,
+ ext4_group_first_block_no(sb, 0) + EXT4_C2B(sbi, 1), found,
+ "failed to alloc block before goal group, expected %llu found %llu",
+ ext4_group_first_block_no(sb, 0 + EXT4_C2B(sbi, 1)), found);
+
+ /* no block available, fail to allocate block */
+ for (i = 0; i < ext4_get_groups_count(sb); i++)
+ mb_ctx_mark_used(sb, i, 0, EXT4_CLUSTERS_PER_GROUP(sb));
+ ar.goal = ext4_group_first_block_no(sb, goal_group);
+ found = ext4_mb_new_blocks_simple(&ar, &err);
+ KUNIT_ASSERT_NE_MSG(test, err, 0,
+ "unexpectedly get block when no block is available");
+}
+
+
+static struct kunit_case ext4_mballoc_test_cases[] = {
+ KUNIT_CASE(test_new_blocks_simple),
+ {}
+};
+
+static struct kunit_suite ext4_mballoc_test_suite = {
+ .name = "ext4_mballoc_test",
+ .init = mballoc_test_init,
+ .exit = mballoc_test_exit,
+ .test_cases = ext4_mballoc_test_cases,
+};
+
+kunit_test_suites(&ext4_mballoc_test_suite);
+
+MODULE_LICENSE("GPL v2");
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c9b7fc0de49e..fb766d12cdf9 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6764,3 +6764,7 @@ ext4_mballoc_query_range(
return error;
}
+
+#ifdef CONFIG_EXT4_KUNIT_TESTS
+#include "mballoc-test.c"
+#endif
--
2.30.0
call ext4_mb_mark_group_bb in ext4_mb_clear_bb to remove repeat code
to update block bitmap and group descriptor on disk.
Note: ext4_mb_clear_bb will update buddy and bitmap in two critical sections
instead of update in the same critical section.
Original lock behavior introduced in 7a2fcbf7f857 ("ext4: don't use
blocks freed but not yet committed in buddy cache init") to avoid
race betwwen ext4_mb_free_blocks and ext4_mb_init_cache:
ext4_mb_load_buddy_gfp
ext4_lock_group
mb_clear_bits(bitmap_bh, ...)
mb_free_blocks/ext4_mb_free_metadata
ext4_unlock_group
ext4_mb_unload_buddy
New lock behavior in this patch:
ext4_mb_load_buddy_gfp
ext4_lock_group
mb_clear_bits(bitmap_bh, ...)
ext4_unlock_group
/* no ext4_mb_init_cache for the same group will be called as
ext4_mb_load_buddy_gfp will ensure buddy page is update-to-date */
ext4_lock_group
mb_free_blocks/ext4_mb_free_metadata
ext4_unlock_group
ext4_mb_unload_buddy
As buddy page for group is always update-to-date between
ext4_mb_load_buddy_gfp and ext4_mb_unload_buddy. Then no
ext4_mb_init_cache will be called for the same group concurrentlly when
we update bitmap and buddy page betwwen buddy load and unload.
Signed-off-by: Kemeng Shi <[email protected]>
---
fs/ext4/mballoc.c | 90 ++++++++++++-----------------------------------
1 file changed, 23 insertions(+), 67 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 46b37f5c9223..e4f1b34448e3 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6135,19 +6135,21 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
ext4_fsblk_t block, unsigned long count,
int flags)
{
- struct buffer_head *bitmap_bh = NULL;
+ struct ext4_mark_context mc = {
+ .handle = handle,
+ .sb = inode->i_sb,
+ .state = 0,
+ };
struct super_block *sb = inode->i_sb;
- struct ext4_group_desc *gdp;
struct ext4_group_info *grp;
unsigned int overflow;
ext4_grpblk_t bit;
- struct buffer_head *gd_bh;
ext4_group_t block_group;
struct ext4_sb_info *sbi;
struct ext4_buddy e4b;
unsigned int count_clusters;
int err = 0;
- int ret;
+ int mark_flags = 0;
sbi = EXT4_SB(sb);
@@ -6179,18 +6181,6 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
/* The range changed so it's no longer validated */
flags &= ~EXT4_FREE_BLOCKS_VALIDATED;
}
- count_clusters = EXT4_NUM_B2C(sbi, count);
- bitmap_bh = ext4_read_block_bitmap(sb, block_group);
- if (IS_ERR(bitmap_bh)) {
- err = PTR_ERR(bitmap_bh);
- bitmap_bh = NULL;
- goto error_return;
- }
- gdp = ext4_get_group_desc(sb, block_group, &gd_bh);
- if (!gdp) {
- err = -EIO;
- goto error_return;
- }
if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) &&
!ext4_inode_block_valid(inode, block, count)) {
@@ -6200,28 +6190,7 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
goto error_return;
}
- BUFFER_TRACE(bitmap_bh, "getting write access");
- err = ext4_journal_get_write_access(handle, sb, bitmap_bh,
- EXT4_JTR_NONE);
- if (err)
- goto error_return;
-
- /*
- * We are about to modify some metadata. Call the journal APIs
- * to unshare ->b_data if a currently-committing transaction is
- * using it
- */
- BUFFER_TRACE(gd_bh, "get_write_access");
- err = ext4_journal_get_write_access(handle, sb, gd_bh, EXT4_JTR_NONE);
- if (err)
- goto error_return;
-#ifdef AGGRESSIVE_CHECK
- {
- int i;
- for (i = 0; i < count_clusters; i++)
- BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
- }
-#endif
+ count_clusters = EXT4_NUM_B2C(sbi, count);
trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
@@ -6230,6 +6199,22 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
if (err)
goto error_return;
+#ifdef AGGRESSIVE_CHECK
+ mark_flags |= EXT4_MB_BITMAP_MARKED_CHECK;
+#endif
+ err = ext4_mb_mark_group_bb(&mc, block_group, bit, count_clusters,
+ mark_flags);
+
+
+ if (err && mc.changed == 0) {
+ ext4_mb_unload_buddy(&e4b);
+ goto error_return;
+ }
+
+#ifdef AGGRESSIVE_CHECK
+ BUG_ON(mc.changed != count_clusters);
+#endif
+
/*
* We need to make sure we don't reuse the freed block until after the
* transaction is committed. We make an exception if the inode is to be
@@ -6252,13 +6237,8 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
new_entry->efd_tid = handle->h_transaction->t_tid;
ext4_lock_group(sb, block_group);
- mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
ext4_mb_free_metadata(handle, &e4b, new_entry);
} else {
- /* need to update group_info->bb_free and bitmap
- * with group lock held. generate_buddy look at
- * them with group lock_held
- */
if (test_opt(sb, DISCARD)) {
err = ext4_issue_discard(sb, block_group, bit,
count_clusters, NULL);
@@ -6271,23 +6251,11 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
EXT4_MB_GRP_CLEAR_TRIMMED(e4b.bd_info);
ext4_lock_group(sb, block_group);
- mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
mb_free_blocks(inode, &e4b, bit, count_clusters);
}
- ret = ext4_free_group_clusters(sb, gdp) + count_clusters;
- ext4_free_group_clusters_set(sb, gdp, ret);
- ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh);
- ext4_group_desc_csum_set(sb, block_group, gdp);
ext4_unlock_group(sb, block_group);
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
- atomic64_add(count_clusters,
- &sbi_array_rcu_deref(sbi, s_flex_groups,
- flex_group)->free_clusters);
- }
-
/*
* on a bigalloc file system, defer the s_freeclusters_counter
* update to the caller (ext4_remove_space and friends) so they
@@ -6302,26 +6270,14 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
ext4_mb_unload_buddy(&e4b);
- /* We dirtied the bitmap block */
- BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
- err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
-
- /* And the group descriptor block */
- BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
- ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh);
- if (!err)
- err = ret;
-
if (overflow && !err) {
block += count;
count = overflow;
- put_bh(bitmap_bh);
/* The range changed so it's no longer validated */
flags &= ~EXT4_FREE_BLOCKS_VALIDATED;
goto do_more;
}
error_return:
- brelse(bitmap_bh);
ext4_std_error(sb, err);
return;
}
--
2.30.0
ext4_mb_new_blocks_simple ignores the group before goal, so it will fail
if free blocks reside in group before goal. Try all groups to avoid
unexpected failure.
Search finishes either if any free block is found or if no available
blocks are found. Simpliy check "i >= max" to distinguish the above
cases.
Signed-off-by: Kemeng Shi <[email protected]>
Suggested-by: Theodore Ts'o <[email protected]>
Reviewed-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 839db9c46aea..bcc9622daa4a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6047,7 +6047,7 @@ static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle,
struct buffer_head *bitmap_bh;
struct super_block *sb = ar->inode->i_sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
- ext4_group_t group;
+ ext4_group_t group, nr;
ext4_grpblk_t blkoff;
ext4_grpblk_t max = EXT4_CLUSTERS_PER_GROUP(sb);
ext4_grpblk_t i = 0;
@@ -6061,7 +6061,7 @@ static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle,
ar->len = 0;
ext4_get_group_no_and_offset(sb, goal, &group, &blkoff);
- for (; group < ext4_get_groups_count(sb); group++) {
+ for (nr = ext4_get_groups_count(sb); nr > 0; nr--) {
bitmap_bh = ext4_read_block_bitmap(sb, group);
if (IS_ERR(bitmap_bh)) {
*errp = PTR_ERR(bitmap_bh);
@@ -6085,10 +6085,13 @@ static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle,
if (i < max)
break;
+ if (++group >= ext4_get_groups_count(sb))
+ group = 0;
+
blkoff = 0;
}
- if (group >= ext4_get_groups_count(sb) || i >= max) {
+ if (i >= max) {
*errp = -ENOSPC;
return 0;
}
--
2.30.0
Function ext4_issue_discard need count in cluster. Pass count_clusters
instead of count to fix the mismatch.
Signed-off-by: Kemeng Shi <[email protected]>
Reviewed-by: Ojaswin Mujoo <[email protected]>
---
fs/ext4/mballoc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 1345c4e84cc3..474aebfdc1dd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -6280,8 +6280,8 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
* them with group lock_held
*/
if (test_opt(sb, DISCARD)) {
- err = ext4_issue_discard(sb, block_group, bit, count,
- NULL);
+ err = ext4_issue_discard(sb, block_group, bit,
+ count_clusters, NULL);
if (err && err != -EOPNOTSUPP)
ext4_msg(sb, KERN_WARNING, "discard request in"
" group:%u block:%d count:%lu failed"
--
2.30.0
on 6/6/2023 5:39 PM, Ojaswin Mujoo wrote:
> On Sat, Jun 03, 2023 at 11:03:24PM +0800, Kemeng Shi wrote:
>> call ext4_mb_mark_group_bb in ext4_mb_clear_bb to remove repeat code
>> to update block bitmap and group descriptor on disk.
>>
>> Note: ext4_mb_clear_bb will update buddy and bitmap in two critical sections
>> instead of update in the same critical section.
>>
>> Original lock behavior introduced in 7a2fcbf7f857 ("ext4: don't use
>> blocks freed but not yet committed in buddy cache init") to avoid
>> race betwwen ext4_mb_free_blocks and ext4_mb_init_cache:
>> ext4_mb_load_buddy_gfp
>> ext4_lock_group
>> mb_clear_bits(bitmap_bh, ...)
>> mb_free_blocks/ext4_mb_free_metadata
>> ext4_unlock_group
>> ext4_mb_unload_buddy
>>
>> New lock behavior in this patch:
>> ext4_mb_load_buddy_gfp
>> ext4_lock_group
>> mb_clear_bits(bitmap_bh, ...)
>> ext4_unlock_group
>>
>> /* no ext4_mb_init_cache for the same group will be called as
>> ext4_mb_load_buddy_gfp will ensure buddy page is update-to-date */
>>
>> ext4_lock_group
>> mb_free_blocks/ext4_mb_free_metadata
>> ext4_unlock_group
>> ext4_mb_unload_buddy
>>
>> As buddy page for group is always update-to-date between
>> ext4_mb_load_buddy_gfp and ext4_mb_unload_buddy. Then no
>> ext4_mb_init_cache will be called for the same group concurrentlly when
>> we update bitmap and buddy page betwwen buddy load and unload.
>>
>> Signed-off-by: Kemeng Shi <[email protected]>
>
> Hi Kemeng,
>
> Sorry for the late reply I was trying to understand the codepath
> properly. So I have a question here:
>
> With the changes you've made in the patch, the flow would look something
> like:
>
> ext4_mb_clear_bb():
> ext4_mb_mark_group_bb():
> ext4_group_lock()
> - Mark bitmap free
> - Modify gdp
> ext4_group_unlock()
> ext4_handle_dirty_metadata()
> - I understand this will add the bitmap and gdp buffers to journal's
> dirty metadata list
> ...
> ext4_group_lock()
> ext4_mb_free_metadata()
> - Add ext4_free_data entries to sbi->s_freed_data_list. (On commit
> ext4_journal_commit_callback() will then free the buddy for these)
> ext4_group_unlock()
>
> My question is what happens if journal commits between
> ext4_handle_dirty_metadata() and ext4_mb_free_metadata() call (Possible?). Then we might
> never end up freeing the metadata in the buddy bitmap because the commit callback wont
> be able to find the ext4_free_data entries in sbi->s_freed_data_list.
>
> Regards,
> ojaswin
>
Hi Ojaswin, thanks for the reply. To my knowledge, commit should be normally done after handle
is stopped as following:
ext4_journal_start_sb
start_this_handle
read_lock(&journal->j_state_lock);
atomic_inc(&transaction->t_updates);
read_unlock(&journal->j_state_lock);
ext4_journal_stop
jbd2_journal_stop
stop_this_handle
if (atomic_dec_and_test(&transaction->t_updates))
wake_up(&journal->j_wait_updates);
jbd2_journal_commit_transaction
jbd2_journal_wait_updates
while (1)
if (!atomic_read(&transaction->t_updates))
/* break loop */
...
if (journal->j_commit_callback)
journal->j_commit_callback(journal, commit_transaction);
So no commit of transaction should not happen between ext4_handle_dirty_metadata and
ext4_mb_free_metadata.
>> ---
>> fs/ext4/mballoc.c | 90 ++++++++++++-----------------------------------
>> 1 file changed, 23 insertions(+), 67 deletions(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 46b37f5c9223..e4f1b34448e3 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -6135,19 +6135,21 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
>> ext4_fsblk_t block, unsigned long count,
>> int flags)
>> {
>> - struct buffer_head *bitmap_bh = NULL;
>> + struct ext4_mark_context mc = {
>> + .handle = handle,
>> + .sb = inode->i_sb,
>> + .state = 0,
>> + };
>> struct super_block *sb = inode->i_sb;
>> - struct ext4_group_desc *gdp;
>> struct ext4_group_info *grp;
>> unsigned int overflow;
>> ext4_grpblk_t bit;
>> - struct buffer_head *gd_bh;
>> ext4_group_t block_group;
>> struct ext4_sb_info *sbi;
>> struct ext4_buddy e4b;
>> unsigned int count_clusters;
>> int err = 0;
>> - int ret;
>> + int mark_flags = 0;
>>
>> sbi = EXT4_SB(sb);
>>
>> @@ -6179,18 +6181,6 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
>> /* The range changed so it's no longer validated */
>> flags &= ~EXT4_FREE_BLOCKS_VALIDATED;
>> }
>> - count_clusters = EXT4_NUM_B2C(sbi, count);
>> - bitmap_bh = ext4_read_block_bitmap(sb, block_group);
>> - if (IS_ERR(bitmap_bh)) {
>> - err = PTR_ERR(bitmap_bh);
>> - bitmap_bh = NULL;
>> - goto error_return;
>> - }
>> - gdp = ext4_get_group_desc(sb, block_group, &gd_bh);
>> - if (!gdp) {
>> - err = -EIO;
>> - goto error_return;
>> - }
>>
>> if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) &&
>> !ext4_inode_block_valid(inode, block, count)) {
>> @@ -6200,28 +6190,7 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
>> goto error_return;
>> }
>>
>> - BUFFER_TRACE(bitmap_bh, "getting write access");
>> - err = ext4_journal_get_write_access(handle, sb, bitmap_bh,
>> - EXT4_JTR_NONE);
>> - if (err)
>> - goto error_return;
>> -
>> - /*
>> - * We are about to modify some metadata. Call the journal APIs
>> - * to unshare ->b_data if a currently-committing transaction is
>> - * using it
>> - */
>> - BUFFER_TRACE(gd_bh, "get_write_access");
>> - err = ext4_journal_get_write_access(handle, sb, gd_bh, EXT4_JTR_NONE);
>> - if (err)
>> - goto error_return;
>> -#ifdef AGGRESSIVE_CHECK
>> - {
>> - int i;
>> - for (i = 0; i < count_clusters; i++)
>> - BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
>> - }
>> -#endif
>> + count_clusters = EXT4_NUM_B2C(sbi, count);
>> trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
>>
>> /* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
>> @@ -6230,6 +6199,22 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
>> if (err)
>> goto error_return;
>>
>> +#ifdef AGGRESSIVE_CHECK
>> + mark_flags |= EXT4_MB_BITMAP_MARKED_CHECK;
>> +#endif
>> + err = ext4_mb_mark_group_bb(&mc, block_group, bit, count_clusters,
>> + mark_flags);
>> +
>> +
>> + if (err && mc.changed == 0) {
>> + ext4_mb_unload_buddy(&e4b);
>> + goto error_return;
>> + }
>> +
>> +#ifdef AGGRESSIVE_CHECK
>> + BUG_ON(mc.changed != count_clusters);
>> +#endif
>> +
>> /*
>> * We need to make sure we don't reuse the freed block until after the
>> * transaction is committed. We make an exception if the inode is to be
>> @@ -6252,13 +6237,8 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
>> new_entry->efd_tid = handle->h_transaction->t_tid;
>>
>> ext4_lock_group(sb, block_group);
>> - mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
>> ext4_mb_free_metadata(handle, &e4b, new_entry);
>> } else {
>> - /* need to update group_info->bb_free and bitmap
>> - * with group lock held. generate_buddy look at
>> - * them with group lock_held
>> - */
>> if (test_opt(sb, DISCARD)) {
>> err = ext4_issue_discard(sb, block_group, bit,
>> count_clusters, NULL);
>> @@ -6271,23 +6251,11 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
>> EXT4_MB_GRP_CLEAR_TRIMMED(e4b.bd_info);
>>
>> ext4_lock_group(sb, block_group);
>> - mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
>> mb_free_blocks(inode, &e4b, bit, count_clusters);
>> }
>>
>> - ret = ext4_free_group_clusters(sb, gdp) + count_clusters;
>> - ext4_free_group_clusters_set(sb, gdp, ret);
>> - ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh);
>> - ext4_group_desc_csum_set(sb, block_group, gdp);
>> ext4_unlock_group(sb, block_group);
>>
>> - if (sbi->s_log_groups_per_flex) {
>> - ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
>> - atomic64_add(count_clusters,
>> - &sbi_array_rcu_deref(sbi, s_flex_groups,
>> - flex_group)->free_clusters);
>> - }
>> -
>> /*
>> * on a bigalloc file system, defer the s_freeclusters_counter
>> * update to the caller (ext4_remove_space and friends) so they
>> @@ -6302,26 +6270,14 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
>>
>> ext4_mb_unload_buddy(&e4b);
>>
>> - /* We dirtied the bitmap block */
>> - BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
>> - err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
>> -
>> - /* And the group descriptor block */
>> - BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
>> - ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh);
>> - if (!err)
>> - err = ret;
>> -
>> if (overflow && !err) {
>> block += count;
>> count = overflow;
>> - put_bh(bitmap_bh);
>> /* The range changed so it's no longer validated */
>> flags &= ~EXT4_FREE_BLOCKS_VALIDATED;
>> goto do_more;
>> }
>> error_return:
>> - brelse(bitmap_bh);
>> ext4_std_error(sb, err);
>> return;
>> }
>> --
>> 2.30.0
>>
>
--
Best wishes
Kemeng Shi
On Tue, Jun 06, 2023 at 10:16:38PM +0800, Kemeng Shi wrote:
>
>
> on 6/6/2023 5:39 PM, Ojaswin Mujoo wrote:
> > On Sat, Jun 03, 2023 at 11:03:24PM +0800, Kemeng Shi wrote:
> >> call ext4_mb_mark_group_bb in ext4_mb_clear_bb to remove repeat code
> >> to update block bitmap and group descriptor on disk.
> >>
> >> Note: ext4_mb_clear_bb will update buddy and bitmap in two critical sections
> >> instead of update in the same critical section.
> >>
> >> Original lock behavior introduced in 7a2fcbf7f857 ("ext4: don't use
> >> blocks freed but not yet committed in buddy cache init") to avoid
> >> race betwwen ext4_mb_free_blocks and ext4_mb_init_cache:
> >> ext4_mb_load_buddy_gfp
> >> ext4_lock_group
> >> mb_clear_bits(bitmap_bh, ...)
> >> mb_free_blocks/ext4_mb_free_metadata
> >> ext4_unlock_group
> >> ext4_mb_unload_buddy
> >>
> >> New lock behavior in this patch:
> >> ext4_mb_load_buddy_gfp
> >> ext4_lock_group
> >> mb_clear_bits(bitmap_bh, ...)
> >> ext4_unlock_group
> >>
> >> /* no ext4_mb_init_cache for the same group will be called as
> >> ext4_mb_load_buddy_gfp will ensure buddy page is update-to-date */
> >>
> >> ext4_lock_group
> >> mb_free_blocks/ext4_mb_free_metadata
> >> ext4_unlock_group
> >> ext4_mb_unload_buddy
> >>
> >> As buddy page for group is always update-to-date between
> >> ext4_mb_load_buddy_gfp and ext4_mb_unload_buddy. Then no
> >> ext4_mb_init_cache will be called for the same group concurrentlly when
> >> we update bitmap and buddy page betwwen buddy load and unload.
> >>
> >> Signed-off-by: Kemeng Shi <[email protected]>
> >
> > Hi Kemeng,
> >
> > Sorry for the late reply I was trying to understand the codepath
> > properly. So I have a question here:
> >
> > With the changes you've made in the patch, the flow would look something
> > like:
> >
> > ext4_mb_clear_bb():
> > ext4_mb_mark_group_bb():
> > ext4_group_lock()
> > - Mark bitmap free
> > - Modify gdp
> > ext4_group_unlock()
> > ext4_handle_dirty_metadata()
> > - I understand this will add the bitmap and gdp buffers to journal's
> > dirty metadata list
> > ...
> > ext4_group_lock()
> > ext4_mb_free_metadata()
> > - Add ext4_free_data entries to sbi->s_freed_data_list. (On commit
> > ext4_journal_commit_callback() will then free the buddy for these)
> > ext4_group_unlock()
> >
> > My question is what happens if journal commits between
> > ext4_handle_dirty_metadata() and ext4_mb_free_metadata() call (Possible?). Then we might
> > never end up freeing the metadata in the buddy bitmap because the commit callback wont
> > be able to find the ext4_free_data entries in sbi->s_freed_data_list.
> >
> > Regards,
> > ojaswin
> >
> Hi Ojaswin, thanks for the reply. To my knowledge, commit should be normally done after handle
> is stopped as following:
> ext4_journal_start_sb
> start_this_handle
> read_lock(&journal->j_state_lock);
> atomic_inc(&transaction->t_updates);
> read_unlock(&journal->j_state_lock);
>
> ext4_journal_stop
> jbd2_journal_stop
> stop_this_handle
> if (atomic_dec_and_test(&transaction->t_updates))
> wake_up(&journal->j_wait_updates);
>
> jbd2_journal_commit_transaction
> jbd2_journal_wait_updates
> while (1)
> if (!atomic_read(&transaction->t_updates))
> /* break loop */
> ...
>
> if (journal->j_commit_callback)
> journal->j_commit_callback(journal, commit_transaction);
>
> So no commit of transaction should not happen between ext4_handle_dirty_metadata and
> ext4_mb_free_metadata.
Hi Kemeng,
Okay makes sense. Thanks for the explanation :)
Regards,
ojaswin
>
> >> ---
> >> fs/ext4/mballoc.c | 90 ++++++++++++-----------------------------------
> >> 1 file changed, 23 insertions(+), 67 deletions(-)
> >>
> >> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> >> index 46b37f5c9223..e4f1b34448e3 100644
> >> --- a/fs/ext4/mballoc.c
> >> +++ b/fs/ext4/mballoc.c
> >> @@ -6135,19 +6135,21 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
> >> ext4_fsblk_t block, unsigned long count,
> >> int flags)
> >> {
> >> - struct buffer_head *bitmap_bh = NULL;
> >> + struct ext4_mark_context mc = {
> >> + .handle = handle,
> >> + .sb = inode->i_sb,
> >> + .state = 0,
> >> + };
> >> struct super_block *sb = inode->i_sb;
> >> - struct ext4_group_desc *gdp;
> >> struct ext4_group_info *grp;
> >> unsigned int overflow;
> >> ext4_grpblk_t bit;
> >> - struct buffer_head *gd_bh;
> >> ext4_group_t block_group;
> >> struct ext4_sb_info *sbi;
> >> struct ext4_buddy e4b;
> >> unsigned int count_clusters;
> >> int err = 0;
> >> - int ret;
> >> + int mark_flags = 0;
> >>
> >> sbi = EXT4_SB(sb);
> >>
> >> @@ -6179,18 +6181,6 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
> >> /* The range changed so it's no longer validated */
> >> flags &= ~EXT4_FREE_BLOCKS_VALIDATED;
> >> }
> >> - count_clusters = EXT4_NUM_B2C(sbi, count);
> >> - bitmap_bh = ext4_read_block_bitmap(sb, block_group);
> >> - if (IS_ERR(bitmap_bh)) {
> >> - err = PTR_ERR(bitmap_bh);
> >> - bitmap_bh = NULL;
> >> - goto error_return;
> >> - }
> >> - gdp = ext4_get_group_desc(sb, block_group, &gd_bh);
> >> - if (!gdp) {
> >> - err = -EIO;
> >> - goto error_return;
> >> - }
> >>
> >> if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) &&
> >> !ext4_inode_block_valid(inode, block, count)) {
> >> @@ -6200,28 +6190,7 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
> >> goto error_return;
> >> }
> >>
> >> - BUFFER_TRACE(bitmap_bh, "getting write access");
> >> - err = ext4_journal_get_write_access(handle, sb, bitmap_bh,
> >> - EXT4_JTR_NONE);
> >> - if (err)
> >> - goto error_return;
> >> -
> >> - /*
> >> - * We are about to modify some metadata. Call the journal APIs
> >> - * to unshare ->b_data if a currently-committing transaction is
> >> - * using it
> >> - */
> >> - BUFFER_TRACE(gd_bh, "get_write_access");
> >> - err = ext4_journal_get_write_access(handle, sb, gd_bh, EXT4_JTR_NONE);
> >> - if (err)
> >> - goto error_return;
> >> -#ifdef AGGRESSIVE_CHECK
> >> - {
> >> - int i;
> >> - for (i = 0; i < count_clusters; i++)
> >> - BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
> >> - }
> >> -#endif
> >> + count_clusters = EXT4_NUM_B2C(sbi, count);
> >> trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters);
> >>
> >> /* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
> >> @@ -6230,6 +6199,22 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
> >> if (err)
> >> goto error_return;
> >>
> >> +#ifdef AGGRESSIVE_CHECK
> >> + mark_flags |= EXT4_MB_BITMAP_MARKED_CHECK;
> >> +#endif
> >> + err = ext4_mb_mark_group_bb(&mc, block_group, bit, count_clusters,
> >> + mark_flags);
> >> +
> >> +
> >> + if (err && mc.changed == 0) {
> >> + ext4_mb_unload_buddy(&e4b);
> >> + goto error_return;
> >> + }
> >> +
> >> +#ifdef AGGRESSIVE_CHECK
> >> + BUG_ON(mc.changed != count_clusters);
> >> +#endif
> >> +
> >> /*
> >> * We need to make sure we don't reuse the freed block until after the
> >> * transaction is committed. We make an exception if the inode is to be
> >> @@ -6252,13 +6237,8 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
> >> new_entry->efd_tid = handle->h_transaction->t_tid;
> >>
> >> ext4_lock_group(sb, block_group);
> >> - mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
> >> ext4_mb_free_metadata(handle, &e4b, new_entry);
> >> } else {
> >> - /* need to update group_info->bb_free and bitmap
> >> - * with group lock held. generate_buddy look at
> >> - * them with group lock_held
> >> - */
> >> if (test_opt(sb, DISCARD)) {
> >> err = ext4_issue_discard(sb, block_group, bit,
> >> count_clusters, NULL);
> >> @@ -6271,23 +6251,11 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
> >> EXT4_MB_GRP_CLEAR_TRIMMED(e4b.bd_info);
> >>
> >> ext4_lock_group(sb, block_group);
> >> - mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
> >> mb_free_blocks(inode, &e4b, bit, count_clusters);
> >> }
> >>
> >> - ret = ext4_free_group_clusters(sb, gdp) + count_clusters;
> >> - ext4_free_group_clusters_set(sb, gdp, ret);
> >> - ext4_block_bitmap_csum_set(sb, gdp, bitmap_bh);
> >> - ext4_group_desc_csum_set(sb, block_group, gdp);
> >> ext4_unlock_group(sb, block_group);
> >>
> >> - if (sbi->s_log_groups_per_flex) {
> >> - ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
> >> - atomic64_add(count_clusters,
> >> - &sbi_array_rcu_deref(sbi, s_flex_groups,
> >> - flex_group)->free_clusters);
> >> - }
> >> -
> >> /*
> >> * on a bigalloc file system, defer the s_freeclusters_counter
> >> * update to the caller (ext4_remove_space and friends) so they
> >> @@ -6302,26 +6270,14 @@ static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
> >>
> >> ext4_mb_unload_buddy(&e4b);
> >>
> >> - /* We dirtied the bitmap block */
> >> - BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
> >> - err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
> >> -
> >> - /* And the group descriptor block */
> >> - BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
> >> - ret = ext4_handle_dirty_metadata(handle, NULL, gd_bh);
> >> - if (!err)
> >> - err = ret;
> >> -
> >> if (overflow && !err) {
> >> block += count;
> >> count = overflow;
> >> - put_bh(bitmap_bh);
> >> /* The range changed so it's no longer validated */
> >> flags &= ~EXT4_FREE_BLOCKS_VALIDATED;
> >> goto do_more;
> >> }
> >> error_return:
> >> - brelse(bitmap_bh);
> >> ext4_std_error(sb, err);
> >> return;
> >> }
> >> --
> >> 2.30.0
> >>
> >
>
> --
> Best wishes
> Kemeng Shi
>
On Sat, 03 Jun 2023 23:03:08 +0800, Kemeng Shi wrote:
> v3->v4:
> 1. Collect Reviewed-by from Ojaswin
> 2. Do improve as Ojaswin kindly suggested: Fix typo in commit,
> WARN if try to clear bit of uninitialized group and improve
> refactoring of AGGRESSIVE_CHECK code.
> 3. Fix conflic on patch 16
> 4. Improve git log in patch 16,17
>
> [...]
Applied, thanks!
[01/19] ext4: fix wrong unit use in ext4_mb_normalize_request
commit: cc891ea78ad04df2fedd419c23237cdc5a5dda62
[02/19] ext4: fix unit mismatch in ext4_mb_new_blocks_simple
commit: abf6b44598cc3aa8684ff3fa5605a5a0104d69e7
[03/19] ext4: fix wrong unit use in ext4_mb_find_by_goal
commit: ae04f8c997b0043e429e484019560417ffdb416f
[04/19] ext4: treat stripe in block unit
commit: 7e1fb2de7d72fbb354c8ed883be34daf95cf3121
[05/19] ext4: add EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated
commit: 4423c2f31548dd464ab56a716eb5868d7f599417
[06/19] ext4: remove ext4_block_group and ext4_block_group_offset declaration
commit: 61896327e1c69492c5433aa60516a9256f0781b0
[07/19] ext4: try all groups in ext4_mb_new_blocks_simple
commit: 7de4b9d4409a267abe088cbd5b06fba849626d62
[08/19] ext4: get block from bh before pass it to ext4_free_blocks_simple in ext4_free_blocks
commit: 9c90b9af90f233c66ce7e1a30e6477f9f8b482dc
[09/19] ext4: remove unsed parameter and unnecessary forward declaration of ext4_mb_new_blocks_simple
commit: e8a7cea1d0a43b78c934425b8640c8764b39772a
[10/19] ext4: fix wrong unit use in ext4_mb_clear_bb
commit: 99101b01b1b63aaee4d573f4c857738171b455bf
[11/19] ext4: fix wrong unit use in ext4_mb_new_blocks
commit: bec857ed03e324d2b111e920f467d201c900c8ed
[12/19] ext4: factor out codes to update block bitmap and group descriptor on disk from ext4_mb_mark_bb
commit: 80b9027c69d439d5fce276245f10c51153fa5301
[13/19] ext4: call ext4_mb_mark_group_bb in ext4_free_blocks_simple
commit: 60ba685c5998aa8f8f0b4dd1bd8653062ee720dd
[14/19] ext4: extent ext4_mb_mark_group_bb to support allocation under journal
commit: b989e0b98798c235a3f8b3b424590c4721c861a5
[15/19] ext4: call ext4_mb_mark_group_bb in ext4_mb_mark_diskspace_used
commit: 2df9dcb865f335eea4edb4ce03f0e0bbb63a7d86
[16/19] ext4: call ext4_mb_mark_group_bb in ext4_mb_clear_bb
commit: d83b42442e9d1f7a4a418c34514c7a7f3282f39a
[17/19] ext4: call ext4_mb_mark_group_bb in ext4_group_add_blocks
commit: 797f324727ba1c07208431deec8d3ccf5ef4c3c6
[18/19] ext4: add some kunit stub for mballoc kunit test
commit: b8eb90a9a9c2ceddd93bf5cbc16bbec69a2af0f5
[19/19] ext4: add first unit test for ext4_mb_new_blocks_simple in mballoc
commit: d21e7bd66903c8f1f7264e63f310a463086d7bd2
Best regards,
--
Theodore Ts'o <[email protected]>
On Sat, Jun 03, 2023 at 11:03:21PM +0800, Kemeng Shi wrote:
> call ext4_mb_mark_group_bb in ext4_free_blocks_simple to:
> 1. remove repeat code
> 2. pair update of free_clusters in ext4_mb_new_blocks_simple.
> 3. add missing ext4_lock_group/ext4_unlock_group protection.
>
> Signed-off-by: Kemeng Shi <[email protected]>
> Reviewed-by: Ojaswin Mujoo <[email protected]>
Note: after bisecting, I've found that this commit is causing a OOPS
when running "kvm-xfstests -c ext4/adv generic/468". It appears to be
an issue with the fast commit feature not playing nice with this
patch. The stack trace looks like this:
[ 7.409663] ------------[ cut here ]------------
[ 7.409969] WARNING: CPU: 0 PID: 3069 at fs/ext4/mballoc.c:3801 ext4_mb_mark_group_bb+0x48e/0x4a0
[ 7.410480] CPU: 0 PID: 3069 Comm: mount Not tainted 6.4.0-rc5-xfstests-lockdep-00021-g60ba685c5998 #146
[ 7.411067] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 7.411639] RIP: 0010:ext4_mb_mark_group_bb+0x48e/0x4a0
[ 7.411968] Code: 48 c7 c7 35 b0 88 82 c6 05 16 f4 9b 01 01 e8 f9 16 c9 ff e9 7f fe ff ff 8b 45 08 c7 44 24 10 00 00 00 00 31 c9 e9 ef fc ff ff <0f> 0b e9 76 fc ff ff e8 96 64 b6 00 66 0f 1f 44 00 00 90 90 90 90
[ 7.413128] RSP: 0018:ffffc90003b0f9f8 EFLAGS: 00010246
[ 7.413458] RAX: 0000000000000003 RBX: 0000000000006002 RCX: 0000000000000001
[ 7.413902] RDX: ffff88800965b000 RSI: 0000000000000000 RDI: ffff88800d690100
[ 7.414346] RBP: ffffc90003b0fa68 R08: 000000000aebbd6e R09: 0000000000000246
[ 7.414791] R10: 00000000d148c994 R11: 00000000941da2bb R12: ffff88800d7fd000
[ 7.415234] R13: 0000000000000000 R14: ffff88800f3e4080 R15: ffff88800b5ca160
[ 7.415724] FS: 00007f3d04516840(0000) GS:ffff88807da00000(0000) knlGS:0000000000000000
[ 7.416227] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.416588] CR2: 00007ffcb3979ac8 CR3: 000000000f290003 CR4: 0000000000770ef0
[ 7.417032] PKRU: 55555554
[ 7.417205] Call Trace:
[ 7.417363] <TASK>
[ 7.417502] ? ext4_mb_mark_group_bb+0x48e/0x4a0
[ 7.417807] ? __warn+0x80/0x170
[ 7.418051] ? ext4_mb_mark_group_bb+0x48e/0x4a0
[ 7.418337] ? report_bug+0x173/0x1d0
[ 7.418567] ? handle_bug+0x3c/0x70
[ 7.418797] ? exc_invalid_op+0x17/0x70
[ 7.419037] ? asm_exc_invalid_op+0x1a/0x20
[ 7.419226] ? ext4_mb_mark_group_bb+0x48e/0x4a0
[ 7.419437] ? ext4_mb_mark_group_bb+0xae/0x4a0
[ 7.419708] ext4_mb_mark_bb+0xc0/0x120
[ 7.419946] ext4_ext_clear_bb+0x210/0x280
[ 7.420198] ext4_fc_replay_inode+0xa1/0x380
[ 7.420466] ext4_fc_replay+0x435/0x880
[ 7.420703] ? __getblk_gfp+0x37/0x110
[ 7.420938] ? jread+0x7a/0x180
[ 7.421138] do_one_pass+0x7df/0x1040
[ 7.421365] jbd2_journal_recover+0x150/0x250
[ 7.421637] jbd2_journal_load+0xbe/0x190
[ 7.421886] ext4_load_journal+0x214/0x610
[ 7.422152] ext4_load_and_init_journal+0x29/0x380
[ 7.422490] __ext4_fill_super+0x15ca/0x15e0
[ 7.422756] ? __pfx_ext4_fill_super+0x10/0x10
[ 7.423032] ext4_fill_super+0xcf/0x280
[ 7.423270] get_tree_bdev+0x188/0x290
[ 7.423505] vfs_get_tree+0x29/0xe0
[ 7.423723] ? capable+0x37/0x70
[ 7.423927] do_new_mount+0x174/0x300
[ 7.424157] __x64_sys_mount+0x11a/0x150
[ 7.424401] do_syscall_64+0x3b/0x90
[ 7.424624] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 7.424935] RIP: 0033:0x7f3d0475562a
[ 7.425160] Code: 48 8b 0d 69 18 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 36 18 0d 00 f7 d8 64 89 01 48
[ 7.426298] RSP: 002b:00007ffcb397aaf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 7.426761] RAX: ffffffffffffffda RBX: 00007f3d04889264 RCX: 00007f3d0475562a
[ 7.427197] RDX: 0000558ea381db90 RSI: 0000558ea381dbb0 RDI: 0000558ea381dbd0
[ 7.427631] RBP: 0000558ea381d960 R08: 0000558ea381dbf0 R09: 00007f3d04827be0
[ 7.428063] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 7.428499] R13: 0000558ea381dbd0 R14: 0000558ea381db90 R15: 0000558ea381d960
[ 7.428941] </TASK>
[ 7.429083] irq event stamp: 10951
[ 7.429296] hardirqs last enabled at (10959): [<ffffffff811643c2>] __up_console_sem+0x52/0x60
[ 7.429824] hardirqs last disabled at (10966): [<ffffffff811643a7>] __up_console_sem+0x37/0x60
[ 7.430325] softirqs last enabled at (10574): [<ffffffff8204a529>] __do_softirq+0x2d9/0x39e
[ 7.430839] softirqs last disabled at (10407): [<ffffffff810dcc57>] __irq_exit_rcu+0x87/0xb0
[ 7.431354] ---[ end trace 0000000000000000 ]---
on 6/11/2023 1:05 PM, Theodore Ts'o wrote:
> On Sat, Jun 03, 2023 at 11:03:21PM +0800, Kemeng Shi wrote:
>> call ext4_mb_mark_group_bb in ext4_free_blocks_simple to:
>> 1. remove repeat code
>> 2. pair update of free_clusters in ext4_mb_new_blocks_simple.
>> 3. add missing ext4_lock_group/ext4_unlock_group protection.
>>
>> Signed-off-by: Kemeng Shi <[email protected]>
>> Reviewed-by: Ojaswin Mujoo <[email protected]>
>
> Note: after bisecting, I've found that this commit is causing a OOPS
> when running "kvm-xfstests -c ext4/adv generic/468". It appears to be
> an issue with the fast commit feature not playing nice with this
> patch. The stack trace looks like this:
>
> [ 7.409663] ------------[ cut here ]------------
> [ 7.409969] WARNING: CPU: 0 PID: 3069 at fs/ext4/mballoc.c:3801 ext4_mb_mark_group_bb+0x48e/0x4a0
> [ 7.410480] CPU: 0 PID: 3069 Comm: mount Not tainted 6.4.0-rc5-xfstests-lockdep-00021-g60ba685c5998 #146
> [ 7.411067] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 7.411639] RIP: 0010:ext4_mb_mark_group_bb+0x48e/0x4a0
> [ 7.411968] Code: 48 c7 c7 35 b0 88 82 c6 05 16 f4 9b 01 01 e8 f9 16 c9 ff e9 7f fe ff ff 8b 45 08 c7 44 24 10 00 00 00 00 31 c9 e9 ef fc ff ff <0f> 0b e9 76 fc ff ff e8 96 64 b6 00 66 0f 1f 44 00 00 90 90 90 90
> [ 7.413128] RSP: 0018:ffffc90003b0f9f8 EFLAGS: 00010246
> [ 7.413458] RAX: 0000000000000003 RBX: 0000000000006002 RCX: 0000000000000001
> [ 7.413902] RDX: ffff88800965b000 RSI: 0000000000000000 RDI: ffff88800d690100
> [ 7.414346] RBP: ffffc90003b0fa68 R08: 000000000aebbd6e R09: 0000000000000246
> [ 7.414791] R10: 00000000d148c994 R11: 00000000941da2bb R12: ffff88800d7fd000
> [ 7.415234] R13: 0000000000000000 R14: ffff88800f3e4080 R15: ffff88800b5ca160
> [ 7.415724] FS: 00007f3d04516840(0000) GS:ffff88807da00000(0000) knlGS:0000000000000000
> [ 7.416227] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7.416588] CR2: 00007ffcb3979ac8 CR3: 000000000f290003 CR4: 0000000000770ef0
> [ 7.417032] PKRU: 55555554
> [ 7.417205] Call Trace:
> [ 7.417363] <TASK>
> [ 7.417502] ? ext4_mb_mark_group_bb+0x48e/0x4a0
> [ 7.417807] ? __warn+0x80/0x170
> [ 7.418051] ? ext4_mb_mark_group_bb+0x48e/0x4a0
> [ 7.418337] ? report_bug+0x173/0x1d0
> [ 7.418567] ? handle_bug+0x3c/0x70
> [ 7.418797] ? exc_invalid_op+0x17/0x70
> [ 7.419037] ? asm_exc_invalid_op+0x1a/0x20
> [ 7.419226] ? ext4_mb_mark_group_bb+0x48e/0x4a0
> [ 7.419437] ? ext4_mb_mark_group_bb+0xae/0x4a0
> [ 7.419708] ext4_mb_mark_bb+0xc0/0x120
> [ 7.419946] ext4_ext_clear_bb+0x210/0x280
> [ 7.420198] ext4_fc_replay_inode+0xa1/0x380
> [ 7.420466] ext4_fc_replay+0x435/0x880
> [ 7.420703] ? __getblk_gfp+0x37/0x110
> [ 7.420938] ? jread+0x7a/0x180
> [ 7.421138] do_one_pass+0x7df/0x1040
> [ 7.421365] jbd2_journal_recover+0x150/0x250
> [ 7.421637] jbd2_journal_load+0xbe/0x190
> [ 7.421886] ext4_load_journal+0x214/0x610
> [ 7.422152] ext4_load_and_init_journal+0x29/0x380
> [ 7.422490] __ext4_fill_super+0x15ca/0x15e0
> [ 7.422756] ? __pfx_ext4_fill_super+0x10/0x10
> [ 7.423032] ext4_fill_super+0xcf/0x280
> [ 7.423270] get_tree_bdev+0x188/0x290
> [ 7.423505] vfs_get_tree+0x29/0xe0
> [ 7.423723] ? capable+0x37/0x70
> [ 7.423927] do_new_mount+0x174/0x300
> [ 7.424157] __x64_sys_mount+0x11a/0x150
> [ 7.424401] do_syscall_64+0x3b/0x90
> [ 7.424624] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [ 7.424935] RIP: 0033:0x7f3d0475562a
> [ 7.425160] Code: 48 8b 0d 69 18 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 36 18 0d 00 f7 d8 64 89 01 48
> [ 7.426298] RSP: 002b:00007ffcb397aaf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> [ 7.426761] RAX: ffffffffffffffda RBX: 00007f3d04889264 RCX: 00007f3d0475562a
> [ 7.427197] RDX: 0000558ea381db90 RSI: 0000558ea381dbb0 RDI: 0000558ea381dbd0
> [ 7.427631] RBP: 0000558ea381d960 R08: 0000558ea381dbf0 R09: 00007f3d04827be0
> [ 7.428063] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [ 7.428499] R13: 0000558ea381dbd0 R14: 0000558ea381db90 R15: 0000558ea381d960
> [ 7.428941] </TASK>
> [ 7.429083] irq event stamp: 10951
> [ 7.429296] hardirqs last enabled at (10959): [<ffffffff811643c2>] __up_console_sem+0x52/0x60
> [ 7.429824] hardirqs last disabled at (10966): [<ffffffff811643a7>] __up_console_sem+0x37/0x60
> [ 7.430325] softirqs last enabled at (10574): [<ffffffff8204a529>] __do_softirq+0x2d9/0x39e
> [ 7.430839] softirqs last disabled at (10407): [<ffffffff810dcc57>] __irq_exit_rcu+0x87/0xb0
> [ 7.431354] ---[ end trace 0000000000000000 ]---
>
>
Hi ted, sorry for this issue. This patch added a WARN_ON for case that we free block
to uninitialized block group which should be invalid.
We can simply remove the WARN_ON to allow free on uninitialized block group as old
way for emergency fix and I will find out why we free blocks to uninitialized block
group in fast commit code path and is it a valid behavior.
--
Best wishes
Kemeng Shi
On Mon, Jun 12, 2023 at 10:24:55AM +0800, Kemeng Shi wrote:
> Hi ted, sorry for this issue. This patch added a WARN_ON for case that we free block
> to uninitialized block group which should be invalid.
> We can simply remove the WARN_ON to allow free on uninitialized block group as old
> way for emergency fix and I will find out why we free blocks to uninitialized block
> group in fast commit code path and is it a valid behavior.
What I've done for now in the dev branch was to drop patches 12
through 19 of this patch series. That seemed to be a good break
point, and I wanted to make sure we had something working so we can
start doing a lot more intesive testing on the patches so far.
Also, that way, when you resend the last 8 patches in the patch
series, we can make sure they get a proper review as opposed to making
changes on the fly.
The current contents of the dev branch are:
% git log --reverse --oneline origin..dev
40fa8be3852f ext4: kill unused function ext4_journalled_write_inline_data
a030569c34be ext4: Change remaining tracepoints to use folio
d1ffc6fb5ded ext4: Make mpage_journal_page_buffers use folio
5ac99c22fa84 ext4: Make ext4_write_inline_data_end() use folio
d578dfc510cf ext4: Call fsverity_verify_folio()
30f0bd64ed09 ext4: fix wrong unit use in ext4_mb_normalize_request
b9dc976cc348 ext4: fix unit mismatch in ext4_mb_new_blocks_simple
9afc5e21107a ext4: fix wrong unit use in ext4_mb_find_by_goal
860f86ccff6e ext4: treat stripe in block unit
710c384f1536 ext4: add EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated
f242d8a98a6f ext4: remove ext4_block_group and ext4_block_group_offset declaration
5b859728b98b ext4: try all groups in ext4_mb_new_blocks_simple
ea7bbd168135 ext4: get block from bh before pass it to ext4_free_blocks_simple in ext4_free_blocks
757d9100a5d1 ext4: remove unsed parameter and unnecessary forward declaration of ext4_mb_new_blocks_simple
5d62e6da25f5 ext4: fix wrong unit use in ext4_mb_clear_bb
993d22f0a250 ext4: fix wrong unit use in ext4_mb_new_blocks
bf4f2aa4844a ext4: mballoc: Remove useless setting of ac_criteria
743f4dd07bf9 ext4: Remove unused extern variables declaration
bc40109767b3 ext4: Convert mballoc cr (criteria) to enum
52e3814a1342 ext4: Add per CR extent scanned counter
a15c09da1255 ext4: Add counter to track successful allocation of goal length
26cbe38f0275 ext4: Avoid scanning smaller extents in BG during CR1
9c8f8195852c ext4: Don't skip prefetching BLOCK_UNINIT groups
cd303d98b9b5 ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
ea639ce794e5 ext4: Abstract out logic to search average fragment list
b080c84db854 ext4: Add allocation criteria 1.5 (CR1_5)
3a08f7ac3bfa ext4: Give symbolic names to mballoc criterias
d14b5d0b1373 ext4: only update i_reserved_data_blocks on successful block allocation
b352d1f09a20 ext4: add a new helper to check if es must be kept
579c020ea7b7 ext4: factor out __es_alloc_extent() and __es_free_extent()
f4ddcde91d00 ext4: use pre-allocated es in __es_insert_extent()
e77481862663 ext4: use pre-allocated es in __es_remove_extent()
28774513875c ext4: using nofail preallocation in ext4_es_remove_extent()
e109a1db5b09 ext4: using nofail preallocation in ext4_es_insert_delayed_block()
14d876070f03 ext4: using nofail preallocation in ext4_es_insert_extent()
2af6f615b18b ext4: make ext4_es_remove_extent() return void
0ee9cccd1971 ext4: make ext4_es_insert_delayed_block() return void
7a7c285c485d ext4: make ext4_es_insert_extent() return void
9d1c6dea1aa3 ext4: make ext4_zeroout_es() return void
2e3f4cdef544 ext4: clean up mballoc criteria comments
acef67482edf ext4: allow concurrent unaligned dio overwrites
63bc068f0d1a ext4: Fix reusing stale buffer heads from last failed mounting
3a57c2f88be3 ext4: ext4_put_super: Remove redundant checking for 'sbi->s_journal_bdev'
6b960d2155f9 jbd2: remove unused feature macros
4b049709e652 jbd2: switch to check format version in superblock directly
d9eafe0afafa jbd2: factor out journal initialization from journal_get_superblock()
6eecd1f4c7ef jbd2: remove j_format_version
431ca11fafd3 jbd2: continue to record log between each mount
2ea31402649c ext4: add journal cycled recording support
a228f0e153f6 ext4: update doc about journal superblock description
f9c45d83f4da ext4: turning quotas off if mount failed after enable quotas
5404e4738054 ext4: refactoring to use the unified helper ext4_quotas_off()
d3ab1bca26b4 jbd2: recheck chechpointing non-dirty buffer
7b0cfe40a991 jbd2: remove t_checkpoint_io_list
e86f802ab8d4 jbd2: remove journal_clean_one_cp_list()
e8ece5c78f36 jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
cdffaad9649e jbd2: fix a race when checking checkpoint buffer busy
11761ed6026e jbd2: remove __journal_try_to_free_buffer()
Cheers,
- Ted
on 6/12/2023 11:49 AM, Theodore Ts'o wrote:
> On Mon, Jun 12, 2023 at 10:24:55AM +0800, Kemeng Shi wrote:
>
>> Hi ted, sorry for this issue. This patch added a WARN_ON for case that we free block
>> to uninitialized block group which should be invalid.
>> We can simply remove the WARN_ON to allow free on uninitialized block group as old
>> way for emergency fix and I will find out why we free blocks to uninitialized block
>> group in fast commit code path and is it a valid behavior.
>
> What I've done for now in the dev branch was to drop patches 12
> through 19 of this patch series. That seemed to be a good break
> point, and I wanted to make sure we had something working so we can
> start doing a lot more intesive testing on the patches so far.
>
> Also, that way, when you resend the last 8 patches in the patch
> series, we can make sure they get a proper review as opposed to making
> changes on the fly.
Sure, I will resend last 8 patches after I solve the issue. I can also take my time
to look at problem in this way :)
> The current contents of the dev branch are:
>
> % git log --reverse --oneline origin..dev
> 40fa8be3852f ext4: kill unused function ext4_journalled_write_inline_data
> a030569c34be ext4: Change remaining tracepoints to use folio
> d1ffc6fb5ded ext4: Make mpage_journal_page_buffers use folio
> 5ac99c22fa84 ext4: Make ext4_write_inline_data_end() use folio
> d578dfc510cf ext4: Call fsverity_verify_folio()
> 30f0bd64ed09 ext4: fix wrong unit use in ext4_mb_normalize_request
> b9dc976cc348 ext4: fix unit mismatch in ext4_mb_new_blocks_simple
> 9afc5e21107a ext4: fix wrong unit use in ext4_mb_find_by_goal
> 860f86ccff6e ext4: treat stripe in block unit
> 710c384f1536 ext4: add EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated
> f242d8a98a6f ext4: remove ext4_block_group and ext4_block_group_offset declaration
> 5b859728b98b ext4: try all groups in ext4_mb_new_blocks_simple
> ea7bbd168135 ext4: get block from bh before pass it to ext4_free_blocks_simple in ext4_free_blocks
> 757d9100a5d1 ext4: remove unsed parameter and unnecessary forward declaration of ext4_mb_new_blocks_simple
> 5d62e6da25f5 ext4: fix wrong unit use in ext4_mb_clear_bb
> 993d22f0a250 ext4: fix wrong unit use in ext4_mb_new_blocks
> bf4f2aa4844a ext4: mballoc: Remove useless setting of ac_criteria
> 743f4dd07bf9 ext4: Remove unused extern variables declaration
> bc40109767b3 ext4: Convert mballoc cr (criteria) to enum
> 52e3814a1342 ext4: Add per CR extent scanned counter
> a15c09da1255 ext4: Add counter to track successful allocation of goal length
> 26cbe38f0275 ext4: Avoid scanning smaller extents in BG during CR1
> 9c8f8195852c ext4: Don't skip prefetching BLOCK_UNINIT groups
> cd303d98b9b5 ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
> ea639ce794e5 ext4: Abstract out logic to search average fragment list
> b080c84db854 ext4: Add allocation criteria 1.5 (CR1_5)
> 3a08f7ac3bfa ext4: Give symbolic names to mballoc criterias
> d14b5d0b1373 ext4: only update i_reserved_data_blocks on successful block allocation
> b352d1f09a20 ext4: add a new helper to check if es must be kept
> 579c020ea7b7 ext4: factor out __es_alloc_extent() and __es_free_extent()
> f4ddcde91d00 ext4: use pre-allocated es in __es_insert_extent()
> e77481862663 ext4: use pre-allocated es in __es_remove_extent()
> 28774513875c ext4: using nofail preallocation in ext4_es_remove_extent()
> e109a1db5b09 ext4: using nofail preallocation in ext4_es_insert_delayed_block()
> 14d876070f03 ext4: using nofail preallocation in ext4_es_insert_extent()
> 2af6f615b18b ext4: make ext4_es_remove_extent() return void
> 0ee9cccd1971 ext4: make ext4_es_insert_delayed_block() return void
> 7a7c285c485d ext4: make ext4_es_insert_extent() return void
> 9d1c6dea1aa3 ext4: make ext4_zeroout_es() return void
> 2e3f4cdef544 ext4: clean up mballoc criteria comments
> acef67482edf ext4: allow concurrent unaligned dio overwrites
> 63bc068f0d1a ext4: Fix reusing stale buffer heads from last failed mounting
> 3a57c2f88be3 ext4: ext4_put_super: Remove redundant checking for 'sbi->s_journal_bdev'
> 6b960d2155f9 jbd2: remove unused feature macros
> 4b049709e652 jbd2: switch to check format version in superblock directly
> d9eafe0afafa jbd2: factor out journal initialization from journal_get_superblock()
> 6eecd1f4c7ef jbd2: remove j_format_version
> 431ca11fafd3 jbd2: continue to record log between each mount
> 2ea31402649c ext4: add journal cycled recording support
> a228f0e153f6 ext4: update doc about journal superblock description
> f9c45d83f4da ext4: turning quotas off if mount failed after enable quotas
> 5404e4738054 ext4: refactoring to use the unified helper ext4_quotas_off()
> d3ab1bca26b4 jbd2: recheck chechpointing non-dirty buffer
> 7b0cfe40a991 jbd2: remove t_checkpoint_io_list
> e86f802ab8d4 jbd2: remove journal_clean_one_cp_list()
> e8ece5c78f36 jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
> cdffaad9649e jbd2: fix a race when checking checkpoint buffer busy
> 11761ed6026e jbd2: remove __journal_try_to_free_buffer()
>
> Cheers,
>
> - Ted
>
--
Best wishes
Kemeng Shi
on 6/13/2023 9:22 AM, Kemeng Shi wrote:
>
>
> on 6/12/2023 11:49 AM, Theodore Ts'o wrote:
>> On Mon, Jun 12, 2023 at 10:24:55AM +0800, Kemeng Shi wrote:
>>
>>> Hi ted, sorry for this issue. This patch added a WARN_ON for case that we free block
>>> to uninitialized block group which should be invalid.
>>> We can simply remove the WARN_ON to allow free on uninitialized block group as old
>>> way for emergency fix and I will find out why we free blocks to uninitialized block
>>> group in fast commit code path and is it a valid behavior.
>>
>> What I've done for now in the dev branch was to drop patches 12
>> through 19 of this patch series. That seemed to be a good break
>> point, and I wanted to make sure we had something working so we can
>> start doing a lot more intesive testing on the patches so far.
>>
>> Also, that way, when you resend the last 8 patches in the patch
>> series, we can make sure they get a proper review as opposed to making
>> changes on the fly.
> Sure, I will resend last 8 patches after I solve the issue. I can also take my time
> to look at problem in this way :)
Updates for how WARN_ON of free blocks to uninitialized block group is triggerred under
fast commit path in test generic/468.
# /sbin/mkfs.ext4 -F -q -O inline_data,fast_commit /dev/vdc
# /bin/mount -t ext4 -o acl,user_xattr -o block_validity /dev/vdc /vdc
# /root/xfstests/bin/xfs_io -i -f -c 'truncate 4202496' -c 'pwrite 0
4202496' -c fsync -c 'falloc 4202496 104857600' /vdc/testfile
The "falloc 4202496 104857600" will trigger block allocation in a
new uninitialized block group for file range "4202496 104857600" as
following:
ext4_map_blocks
/*
* Alloc blocks from uninitialized block group. Change to set
* group intialized will be full journaled.
*/
ext4_mb_new_blocks
[...]
/*
* New extents will be tracked in fast commit.
*/
ext4_fc_track_range
/*
* Add new extents of allocated range to inode which still has sapce
* in ext_inode_hdr
*/
ext4_ext_insert_extent
[...]
/*
* depth is 0 as inode has space in ext_inode_hdr, this will track
* inode in fast commit.
*/
ext4_ext_dirty(handle, inode, path + path->p_depth);
ext4_mark_inode_dirty
ext4_mark_iloc_dirty
ext4_fc_track_inode
# /root/xfstests/bin/xfs_io -i -c fsync /vdc/testfile
The fast commit is performed in fsync as following:
vfs_fsync
ext4_fsync_journal
ext4_fc_commit
ext4_fc_perform_commit
add EXT4_FC_TAG_ADD_RANGE of new extent range
add EXT4_FC_TAG_INODE of changed inode
# /root/xfstests/src/godown /vdc
Journaled change to set group intialized is discard as following:
ext4_shutdown
jbd2_journal_abort
# /bin/umount /dev/vdc
# /bin/mount -t ext4 -o acl,user_xattr -o block_validity /dev/vdc /vdc
Replay fast commit when mounting and added WARN_ON is triggered as
following:
ext4_fc_replay
/*
* replay EXT4_FC_TAG_ADD_RANGE, add extents contains blocks from
* uninitialized group back to inode
*/
ext4_fc_replay_add_range
/*
* replay EXT4_FC_TAG_INODE, this will mark trigger WARN_ON
*/
ext4_fc_replay_inode
/*
* mark all blocks in old inode free, then blocks from uninitialized
* block is freed and WARN_ON occurs
*/
ext4_ext_clear_bb
/* update inode with data journaled in fast commit */
[...]
/*
* mark all blocks in new inode in use, and gdp will be mark
* initialized normally
*/
ext4_fc_record_modified_inode
[...]
ext4_fc_set_bitmaps_and_counters
In this situation, free blocks to uninitialized block group do no harm.
And there may be more harmless situations, so I would like to simply
drop WARN_ON in next version.
--
Best wishes
Kemeng Shi